February 8, 2010
A sentiment repeated again and again related to Data (Information) Quality improvement goes like this:
“It’s a myth that Data Quality improvement is all about technology”.
In fact you see the same related to a lot of other disciplines as:
- “It’s a myth that Master Data Management is all about technology”.
- “It’s a myth that Business Intelligence is all about technology”.
- “It’s a myth that Customer Relationship Management is all about technology”.
I have a problem with that: I have never heard anyone say that DQ/MDM/BI/CRM… is all about technology and I have never seen anyone writing so.
When I make the above remark the reaction is almost always this:
“Of course not, but I have seen a lot of projects carried out as if they were all about technology – and of course they failed”.
Unquestionable true.
But the next question is then about root cause. Why did those projects seem to be all about technology? I think it was:
- Poor project management or
- Bad balance between business and IT involvement or
- Immature technology alienating business users.
In my eyes there is no myth about that Data Quality (and a lot of other things) is all about technology. It’s a myth it’s a myth.
24 Comments |
Data Governance, Data Quality Tools | Tagged: Business intelligence, CRM, MDM, Technology, User involvement |
Permalink
Posted by Henrik Liliendahl Sørensen
February 4, 2010
If I enjoy a restaurant meal it is basically unimportant to me what raw ingredients from where were used and which tools the chef used during preparing the meal. My concerns are whether the taste meet my expectations, the plate looks delicious in my eyes, the waiter seems nice and so on.
This is comparable to when we talk about information quality. The raw data quality and the tools available for exposing the data as tasty information in a given context is basically not important to the information consumer.
But in the daily work you and I may be the information chef. In that position we have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.
Let’s look at some analogies.
Best before
Fresh raw ingredients is similar to actualized raw data. Raw data also has a best before date depending on the nature of the data. Raw data older than that date may be spiced up but will eventually make bad tasting information.
One-stop-shopping
Buying all your raw ingredients and tools for preparing food – or taking the shortcut with ready made cookie cutting stuff – from a huge supermarket is fast and easy (and then never mind the basket usually also is filled with a lot of other products not on the shopping list).
A good chef always selects the raw ingredients from the best specialized suppliers and uses what he consider the most professional tools in the preparing process.
Making information from raw data has the same options.
Compliance
Governments around the world has for long time implemented regulations and inspection regarding food mainly focused at receiving, handling and storing raw ingredients.
The same is now going on regarding data. Regulations and inspections will naturally be directed at data as it is originated, stored and handled.
Diversity
Have you ever tried to prepare your favorite national meal in a foreign country?
Many times this is not straightforward. Some raw ingredients is simply not available and even some tools may not be among the kitchen equipment.
When making information from raw data under varying international conditions you often face the same kind of challenges.
4 Comments |
Data Governance, Data Quality Tools, External Reference Data | Tagged: Compliance, Fit for purpose, The world |
Permalink
Posted by Henrik Liliendahl Sørensen
January 27, 2010
With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.
In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.
Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).
All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:
- 2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
- 3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
- 4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
- 5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
- n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows
Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.
If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.
One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.

9 Comments |
Data Architecture, External Reference Data, Master Data | Tagged: B2B, Duplicates, One version of the truth, Real world objects |
Permalink
Posted by Henrik Liliendahl Sørensen
January 23, 2010
Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.
The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.
The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.
In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:
- Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
- Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.
As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.
Inside a given country we have different forms of citizen ID’s (national identification number) with very varying terms of use between the countries. But even in Scandinavia where I live and we have widespread use of unique citizen ID’s most tables that could have the name Homo_Sapiens cannot use a Citizen ID as (unique) primary key for several reasons as well as that data is not present in a lot of situations.
Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.
Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:
- Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
- Identity Resolution is about the same but – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
- Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.
Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.
17 Comments |
Data Architecture, Data Matching, Master Data | Tagged: B2C, Data model, Duplicates, One version of the truth, Real world objects |
Permalink
Posted by Henrik Liliendahl Sørensen
January 17, 2010
The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.
These language variants of city (and other) names is a challenge in data matching.
If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.
Computerized match processing may handle the challenge in different ways, including:
- The data model may reflect the real world by having places described by multiple names in given languages.
- Some data matching solutions use synonym listing for this challenge.
- Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.
As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but be useless tomorrow with international data.
8 Comments |
Data Architecture, Data Matching, Data Quality Tools | Tagged: Copenhagen, Data model, One version of the truth, Technology |
Permalink
Posted by Henrik Liliendahl Sørensen
January 9, 2010
If you ask me the question ”How many people live in your town?” I could give you a correct answer being 5,000 % besides what you are looking for.
I live in Greve Municipality in Denmark. Population close to 48,000. Greve is a suburb south of Copenhagen. According to Wikipedia Copenhagen urban area has a population of 1.2 million and Copenhagen metro area has a population of 1.9 million people.
The Copenhagen metro area stretches from 40 km (20 miles) south of the city to 40 km (20 miles) north at Elsinore and Kronborg Castle (immortalized in Shakespeare’s Hamlet – always remember to include Shakespeare in a blog).
Further more: From Copenhagen you can look across the water to the east seeing Sweden and the city Malmoe. The Copenhagen-Malmoe bi-national urban agglomeration has a total population of 2.5 million people.
The real data quality issue in my initial question is not the precision, validity and timeliness in the number given in the answer but the shared understanding of the label attached to the number.
I noticed that Wikipedia has developed a good metadata habit when stating town populations giving 3 distinct labels: City, Urban and Metro.

4 Comments |
Data Governance, Metadata | Tagged: Copenhagen, Fit for purpose, One version of the truth, The world |
Permalink
Posted by Henrik Liliendahl Sørensen
January 1, 2010
Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.
Just before Christmas 2009 I made such a mistake in my professional work.
It’s not that I don’t have a lot of excuses. Sure I have.
The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.
I was very busy before Christmas – but accepted this minor trivial assignment.
When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.
But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.
The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones.
Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.
Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.
2 Comments |
Data Matching, Data Quality Tools, Master Data | Tagged: Cleansing, Duplicates, Fit for purpose, odqc, The echo problem |
Permalink
Posted by Henrik Liliendahl Sørensen
December 8, 2009
There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.
Challenges includes:
- Completeness – Missing values
- Precision – Inclusion of country codes, area codes, extensions
- Reliability – Real world alignment, pseudo numbers: 1234.., 555…
- Timeliness – Outdated and converted numbers
- Conformity – Formatting of numbers
- Uniqueness – Handling shared numbers and multiple numbers per party entity
You may work with improving phone number quality with these approaches:
Profiling:
Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:
- Count of filled values
- Minimum and maximum lengths
- Represented formats – best inspected per country if international data
- Minimum and maximum values – highlighting invalid numbers
Validation:
National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.
Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.
In many countries a fixed net number includes an area code telling about place.
Match and enrichment:
Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.
Reality check:
Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.
6 Comments |
Data Matching, Data Quality Tools, External Reference Data, Master Data | Tagged: Cleansing, Prevention |
Permalink
Posted by Henrik Liliendahl Sørensen