Any organisation that deals with customer, prospect, supplier, distributor, product and service information, uses all kinds of data in their day-to-day business processes. Identification of a customer or a product within an automated system, using a specific id-number, the name or any other identifying feature, is a key issue in these processes. Furthermore, it is a task that needs considerable attention, since the collection and management of data is essentially error-prone. People make mistakes, names are understood incorrectly, numbers are typed in the wrong order; there are just too many reasons for defective data and poor information quality.
The collective term ‘business data’ is often used without a precise notion of what business data actually contain. It is not just the customer identification numbers and product codes. Naturally, the sort and the importance of data used in a business process will differ from organization to organization. However, a closer look at the seemingly endless variation will show that names and addresses of persons and organizations are as detailed and complicated as they are identifying. The following classification will show the details of names, addresses and complementary data.
Defining the data groups as precisely and as detailed as possible, is the first step towards useful interpretation. People, applying their natural language processing capabilities, structure the information as they interpret it. They will use their frame of reference, which includes their knowledge dictionary, their linguistic repository, statistical information and mathematical information.
Knowledge-based interpretation, incorporated in an automated system to solve data quality issues, must work in exactly the same way. Consider the following examples:
Peter Arnold Frank
If you had to interpret this name, you would probably (considering you are of European or American origin) designate Peter as a given name, Arnold as a middle name (or second given name) and Frank as a surname. Of course all three names are very common given names and all three of them also exist as surname. But the signification [given name-given name-surname] is definitely the most probable signification in this particular context.
Mohammad Ouazzani Benhaddou
This name seems to have a similar structure as the name above. However, we probably (and oftentimes unconsciously) will interpret this name differently. This happens, because our frame of reference tells us that this name is most likely of Arab origin and that names from that particular region in the world have different naming conventions. Although the name Mohammad Ouazzani Benhaddou does not carry identification mark such as “attention, this is name from Arab origin!” we will consider precisely this origin in the interpretation of the name.
Chr. London Int. Transp. Co.
This example may seem puzzling in the beginning, since most of the words are abbreviations (which are very common in organization names) and the word that is not an abbreviation, London, is actually ambiguous. In this case, London is probably a surname. Chr.is most likely the abbreviation of a given name such as Christopher. The abbreviations Int. Transp. Co. most likely signify International Transport Company.
The examples above show, that a knowledge repository can be very useful in interpretation (remember the usage of your own frame of reference and knowledge dictionary mentioned above). Of course, the automated interpretation based on natural language will need additional help to perform as well as we humans do. But the creation of the knowledge universe is the starting point for answering that short question: Who is who and what is what in my database?
Holger Wandt was Principal Advisor at Human Inference. He joined Human Inference in 1991. As a linguist, he was one of the pioneers of the interpretation and matching technology in the data quality product suite. In his position as Principal Advisor he was responsible for conveying vision to customers and partners and for promoting ideas and vision to industry boards, thought communities, universities and analyst firms.
Names that are understood incorrectly, misspelled numbers; there are just too many reasons for defective data and poor information qualtiy
Are you struggling to manage the vast amount of data your business collects? With stricter regulations, managing your information effectively is more critical than ever. Luckily, a software solution can help: Master Data Management (MDM).
Don't hesitate to contact us to schedule a brief introductory meeting on how MDM can help your business succeed. Without any other commitments, but with helpful guidance.
Poor data quality costs organizations hundreds of thousands of euros per year. Unreliable data leads to incorrect decisions and inefficient processes.
Fortunately, our data quality checklist allows you to assess in just 5 minutes whether your data meets the 6 data quality dimensions. Please leave your details to download the document instantly.
A robust compliance process is crucial for safeguarding your organization against potential risks. Our team of experts is here to provide you with personalized guidance and the necessary tools to create a future-ready compliance policy, including CDD.
Fill out your details below, and let’s schedule a brief introductory meeting. There’s no obligation—just valuable insights tailored to your needs.