The Myth about a Myth

February 8, 2010

A sentiment repeated again and again related to Data (Information) Quality improvement goes like this:

“It’s a myth that Data Quality improvement is all about technology”.

In fact you see the same related to a lot of other disciplines as:

  • “It’s a myth that Master Data Management is all about technology”.
  • “It’s a myth that Business Intelligence is all about technology”.
  • “It’s a myth that Customer Relationship Management is all about technology”.

I have a problem with that: I have never heard anyone say that DQ/MDM/BI/CRM… is all about technology and I have never seen anyone writing so.

When I make the above remark the reaction is almost always this:

“Of course not, but I have seen a lot of projects carried out as if they were all about technology – and of course they failed”.

Unquestionable true.

But the next question is then about root cause. Why did those projects seem to be all about technology? I think it was:

  • Poor project management or
  • Bad balance between business and IT involvement or
  • Immature technology alienating business users.

In my eyes there is no myth about that Data Quality (and a lot of other things) is all about technology. It’s a myth it’s a myth.


Bon Appetit

February 4, 2010

If I enjoy a restaurant meal it is basically unimportant to me what raw ingredients from where were used and which tools the chef used during preparing the meal. My concerns are whether the taste meet my expectations, the plate looks delicious in my eyes, the waiter seems nice and so on.

This is comparable to when we talk about information quality. The raw data quality and the tools available for exposing the data as tasty information in a given context is basically not important to the information consumer.

But in the daily work you and I may be the information chef. In that position we have to be very much concerned about the raw data quality and the tools available for what may be similar to rinsing, slicing, mixing and boiling food.

Let’s look at some analogies.

Best before

Fresh raw ingredients is similar to actualized raw data. Raw data also has a best before date depending on the nature of the data. Raw data older than that date may be spiced up but will eventually make bad tasting information.

One-stop-shopping

Buying all your raw ingredients and tools for preparing food – or taking the shortcut with ready made cookie cutting stuff – from a huge supermarket is fast and easy (and then never mind the basket usually also is filled with a lot of other products not on the shopping list).

A good chef always selects the raw ingredients from the best specialized suppliers and uses what he consider the most professional tools in the preparing process.

Making information from raw data has the same options.

Compliance

Governments around the world has for long time implemented regulations and inspection regarding food mainly focused at receiving, handling and storing raw ingredients.

The same is now going on regarding data. Regulations and inspections will naturally be directed at data as it is originated, stored and handled. 

Diversity

Have you ever tried to prepare your favorite national meal in a foreign country?

Many times this is not straightforward. Some raw ingredients is simply not available and even some tools may not be among the kitchen equipment.   

When making information from raw data under varying international conditions you often face the same kind of challenges.


Information and Data Quality Blog Carnival, February 2010

February 2, 2010


El Festival del IDQ Bloggers is another name for the monthly recurring post of selected (actually rather submitted) blog posts on information and data quality started last year by the IAIDQ.

This is the February 2010 edition covering posts published in December 2009 and January 2010.

I will go straight to the point:

Daragh O Brien shared the story about a leading Irish Hospital that has come under scrutiny for retaining data without any clear need. This highlights an important relationship between Data Protection/Privacy and Information Quality. Daragh’s post explores some of this relationship through the “Information Quality Lense”. Here’s the story: Personal Data – an Asset we hold on Trust.

Former Publicity Director of the IAIDQ, Daragh has over a decade of coal-face experience in Information Quality Management at the tactical and strategic levels from the Business perspective. He is the Taoiseach (Irish for chieftain) of Castlebridge Associates. Since 2006 he has been writing and presenting about legal issues in Information Quality amongst other topics. 

Jim Harris is an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality. Obsessive-Compulsive Data Quality is an independent blog offering a vendor-neutral perspective on data quality.

If you are a data quality professional, know the entire works by Shakespeare by heart and are able to wake up at night and promptly explain the theories of Einstein you probably know Jim’s blogging. On the other hand: If you don’t know Shakespeare, don’t understand Einstein, then: Jim to the rescue. Read The Dumb and Dumber Guide to Data Quality.

In another post Jim discusses the out-of-box-experience (OOBE) provided by data quality (DQ) software under the title: OOBE-DQ, Where Are You? Jim also posted part 8 of Adventures in Data Profiling – a great series of knowledge sharing on this important discipline within data quality improvement.

Phil Wright is a consultant based in London, UK who specialises in Business Intelligence and Data Quality Management.  With 10 years experience within the Telecommunications and Financial Services Industries, Phil has implemented data quality management programs, led data cleansing exercises and enabled organisations to realise their data management strategy.

The Data Factotum blog is a new blog in the Data Quality blogosphere, but Phil has kick started with 9 great posts during the first month. A balanced approach to scoring data quality is the start of a series on the topic of using the balanced scoreboard concept in measuring data quality.

Jan Erik Ingvaldsen is a colleague and good friend of mine. In a recent market competition scam cheap flight tickets from Norwegian Air Shuttle was booked by employees from competitor Cimber Sterling using all kinds of funny names. As usual Jan Erik not only has a nose for a good story but he is also able to propose the solutions as seen here in Detecting Scam and Fraud.

In his position as Nordic Sales Manager at Omikron Data Quality Jan Erik actually is a frequent flyer at Norwegian Air Shuttle. Now he is waiting whether he will be included on their vendor list or on the no-fly list.

William Sharp is a writer on technology focused blogs with an emphasis on data quality and identity resolution.

Informatica Data Quality Workbench Matching Algorithms is part of a series of postings were William details the various algorithms available in Informatica Data Quality (IDQ) Workbench. In this post William start by giving a quick overview of the algorithms available and some typical uses for each. The subsequent postings gets more detailed and outline the math behind the algorithm and will finally be finished up with some baseline comparisons using a single set of data.

Personally I really like this kind of ready made industrial espionage.

IQTrainwrecks hosted the previous blog carnival edition. From this source we also has a couple of postings. 

The first was submitted by Grant Robinson, the IAIDQ’s Director of Operations. He shares an amusing but thought provoking story about the accuracy of GPS systems and on-line maps based on his experiences working in Environmental Sciences. Take a dive in the ocean…

Also it is hard to avoid including the hapless Slovak border police and their accidental transportation of high explosives to Dublin due to a breakdown in communication and a reliance on inaccurate contact information. Read all about it.

And finally, we have the post about the return of the Y2k Bug as systems failed to properly handle the move into a new decade, highlighting the need for tactical solutions to information quality problems to be kept under review in a continuous improvement culture in case the problem reoccurs in a different way. Why 2K?

If you missed them, here’s a full list of previous carnival posts:

April 2009 on Obsessive-Compulsive Data Quality by Jim Harris

May 2009 on The DOBlog by Daragh O Brien

June 2009 on Data Governance and Data Quality Insider by Steve Sarsfield

July 2009 on AndrewBrooks.co.uk by Andrew Brooks

August 2009 on The DQ Chronicle by William E Sharp

September 2009 on Data Quality Edge by Daniel Gent

October 2009 on Tooling around in the IBM Infosphere by Vincent McBurney

November 2009 on IQTrainwrecks.com by IAIDQ


Select Company_ID from External_Source where possible

January 27, 2010

With the risk of having the comment area on this blog filled up with SQL statements I will follow the track and tone from the last post called Create Table Homo_Sapiens.

In the last post some challenges around modelling people in databases was discussed with focus on uniqueness. Now we will have a look at the same challenges with companies – the other big part of party master data.

Companies often act in the same role as individual people in business processes – not at least in the role as a customer. Companies also behave as persons in a lot of ways like being born (establish), change name, relocate, marry (mergers and acquisitions), divorce (split) and decease (dissolve).

All over the world a lot of people spend the days entering and updating the data held on business partners in numerous databases. The world wide sum of B2B connections between a customer and a vendor each entering and maintaining the data about the other resembles (though less aggressive) the grains on a chessboard story:

  • 2 companies both exchanging goodies makes 1+1 customers and 1+1 vendors = 4 rows
  • 3 companies all exchanging goodies makes 2+2+2 customers and 2+2+2 vendors = 12 rows
  • 4 companies all exchanging goodies makes 3+3+3+3 customers and 3+3+3+3 vendors = 24 rows
  • 5 companies all exchanging goodies makes 4+4+4+4+4 customers and 4+4+4+4+4 vendors = 40 rows
  • n companies all exchanging goodies makes n*(n-1) customers and n*(n-1) vendors = 2*n*(n-1) rows

Last time I checked the D&B WorldBase held more the 150 millions companies. Some are dissolved and fortunately? everyone doesn’t do business with everyone – but as said, the sum of B2B connections is huge and the work in entering and maintaining the master data seems overwhelming.

If we look at one single company and how it may be represented differently in databases around only taking basic data as name and address into account, there will be lots of variations. Even in the same table the same real world company often occupies several rows spelled differently.

One of the most effective methods for avoiding duplicates of company master data is plugging into a business directory. By using an external sourced company ID as a key in your master data you are able to hold a golden record of that real world entity. As a bonus you are offered updates and access to a lot of additional data you would never be able to collect yourself.       


Create Table Homo_Sapiens

January 23, 2010

Create Table is a basic statement in the SQL language which is the most widespread computer language used when structuring data in databases.

The most common entity in databases around must be rows representing real world human beings (Homo Sapiens) and the different groups we form. Tables for that could have the name Homo_Sapiens but is usually called Customer, Member, Citizen, Patient, Contact and so on.

The most common data quality issues around is related to accuracy, validity, timeliness, completeness and not at least uniqueness with the data we hold about people.

In databases tables are supposed to have a unique primary key. There are two basic types of primary keys:

  • Surrogate keys are typically numbers with no relation (and binding) to the real world. They are made invisible to the users of the applications operating on the database.
  • Natural keys are derived from existing codes or other data identifying an entity in the real world or made for that purpose. They are visible to users and part of electronic, written and verbal communication.

As surrogate keys obviously don’t help with real world uniqueness and there are no common global natural key for all human beings on the earth we have a challenge in creating a good primary key for a Homo Sapiens table.

Inside a given country we have different forms of citizen ID’s (national identification number) with very varying terms of use between the countries. But even in Scandinavia where I live and we have widespread use of unique citizen ID’s most tables that could have the name Homo_Sapiens cannot use a Citizen ID as (unique) primary key for several reasons as well as that data is not present in a lot of situations.    

Most often we name the tables holding data about human beings by the role people will act in within the purpose of use for the data we collect. For example Customer Table. A customer may be an individual but also a household or a business entity. A human being may be a private consumer but also an employee at a business making a purchase or a business owner making both private purchases and business purchases.

Every business activity always comes down to interacting with individual persons. But as our data is collected for the different roles that individual may have acted in, we have a need for viewing these data related to single human beings. The methods for facilitating this have different flavours as:

  • Deduplication is the classic term used for describing processes where records are linked, merged or purged in order to make a golden copy having only one (parent) database row for each individual person (and other legal entities). This is usually done by matching data elements in internal tables with names and addresses within a given organisation.
  • Identity Resolution is about the same but  – if a distinction is considered to exist – uses a wider range of data, rules and functionality to relate collected data rows to real world entities. In my eyes exploiting external reference data will add considerable efficiency in the years to come within deduplication / identity resolution.
  • Master Data Hierarchy Management again have the same goal of establishing a golden copy of collected data by emphasising on reflecting the complex structure of relationships in the real world as well as the related history.

Next time I am involved in a data modelling exercise I will propose a Homo_Sapiens table. Wonder about the odds for buy in from other business and technical delegates.


Diversity in City Names

January 17, 2010

The metro area I live in is called Copenhagen – in English. The local Danish name is København. When I go across the bridge to Sweden the road signs points at the Swedish variant of the name being Köpenhamn. When the new bridge from Germany to east Denmark is finished the road signs on the German side will point at Kopenhagen. A flight from Paris has the destination Copenhague. From Rome it is Copenaghen. The Latin name is Hafnia.

These language variants of city (and other) names is a challenge in data matching.

If a human is doing the matching the match may be done because that person knows about the language variations. This is a strength in human processing. But it is also a weakness in human processing if another person don’t know about the variations and thereby the matching will be inconsistent by not repeating the same results.

Computerized match processing may handle the challenge in different ways, including:

  • The data model may reflect the real world by having places described by multiple names in given languages.
  • Some data matching solutions use synonym listing for this challenge.
  • Probabilistic learning is another way. The computer finds a similarity between two sets of data describing an entity but with a varying place name. A human may confirm the connection and the varying place names then will be included in the next automated match.

As globalization moves forward data matching solutions has to deal with diversity in data. A solution may have made wonders yesterday with domestic data but be useless tomorrow with international data.



A New Year Resolution

January 1, 2010

Also for this year I have made this New Year resolution: I will try to avoid stupid mistakes that actually are easily avoidable.

Just before Christmas 2009 I made such a mistake in my professional work.

It’s not that I don’t have a lot of excuses. Sure I have.

The job was a very small assignment doing what my colleagues and I have done a lot of times before: An excel sheet with names, addresses, phone numbers and e-mails was to be cleansed for duplicates. The client had got a discount price. As usual it had to be finished very quickly.

I was very busy before Christmas – but accepted this minor trivial assignment.

When the excel sheet arrived it looked pretty straight forward. Some names of healthcare organizations and healthcare professionals working there. I processed the sheet in the Omikron Data Quality Center, scanned the result and found no false positives, made the export with suppressing merge/purge candidates and delivered back (what I thought was) a clean sheet.

But the client got back. She had found at least 3 duplicates in the not so clean sheet. Embarrassing. Because I didn’t ask her (as I use to do) a few obvious questions about what will constitute a duplicate. I have even recently blogged about the challenge that I call “the echo problem” I missed.

The problem is that many healthcare professionals have several job positions. Maybe they have a private clinic besides positions at one or several different hospitals. And for this particular purpose a given healthcare professional should only appear ones. 

Now, this wasn’t a MDM project where you have to build complex hierarchy structures but one of those many downstream cleansing jobs. Yes, they exist and I predict they will continue to do in the decade beginning today. And sure, I could easily make a new process ending in a clean sheet fit for that particular purpose based on the data available.

Next time, this year, I will get the downstream data quality job done right the first time so I have more time for implementing upstream data quality prevention in state of the art MDM solutions.


2010 predictions

December 21, 2009

Today this blog has been live for ½ year, Christmas is just around the corner in countries with Christian cultural roots and a new year – even decade – is closing in according to the Gregorian calendar.

It’s time for my 2010 predictions.

Football

Over at the Informatica blog Chris Boorman and Joe McKendrick are discussing who’s going to win next years largest sport event: The football (soccer) World Cup. I don’t think England, USA, Germany (or my team Denmark) will make it. Brazil takes a co-favorite victory – and home team South Africa will go to the semi-finals.

Climate

Brazil and South Africa also had main roles in the recent Climate Summit in my hometown Copenhagen. Despite heavy executive buy-in a very weak deal with no operational Key Performance Indicators was reached here. Money was on the table – but assigned to reactive approaches.

Our hope for avoiding climate catastrophes is now related to national responsibility and technological improvements.

Data Quality

Reactive approach, lack of enterprise wide responsibility and reliance on technological improvements are also well known circumstances in the realm of data quality.

I think we have to deal with this also next year. We have to be better at working under these conditions. That means being able to perform reactive projects faster and better while also implementing prevention upstream. Aligning people, processes and technology is a key as ever in doing that. 

Some areas where we will see improvements will in my eyes be:

  • Exploiting rich external reference data
  • International capabilities
  • Service orientation
  • Small business support
  • Human like technology

The page Data Quality 2.0 has more content on these topics.

Merry Christmas and a Happy New Year.


Phony Phones and Real Numbers

December 8, 2009

There are plenty of data quality issues related to phone numbers in party master data. Despite that a phone number should be far less fuzzy than names and addresses I have spend lots of time having fun with these calling digits.

Challenges includes:

  • Completeness – Missing values
  • Precision – Inclusion of country codes, area codes, extensions
  • Reliability – Real world alignment, pseudo numbers: 1234.., 555…
  • Timeliness – Outdated and converted numbers
  • Conformity – Formatting of numbers
  • Uniqueness – Handling shared numbers and multiple numbers per party entity

You may work with improving phone number quality with these approaches:

Profiling:

Here you establish some basic ideas about the quality of a current population of phone numbers. You may look at:

  • Count of filled values
  • Minimum and maximum lengths
  • Represented formats – best inspected per country if international data
  • Minimum and maximum values – highlighting invalid numbers

Validation:

National number plans can be used as a basis for next level check of reliability – both in batch cleansing of a current population and for an upstream prevention with new entries. Here numbers not conforming to valid lengths and ranges can be marked.

Also you may make some classification telling about if it is a fixed net number or cell number – but boundaries are not totally clear in many cases.

In many countries a fixed net number includes an area code telling about place.

Match and enrichment:

Names and addresses related to missing and invalid phone numbers may be matched with phone books and other directories having phone numbers and thereby enriching your data and improving completeness.

Reality check:

Then you of course may call the number and confirm whether you are reaching the right person (or organization). I have though never been involved in such an activity or been called by someone only asking if I am who I am.