Data Quality Tools Revealed

To be honest: Data Quality tools today only solves a very few of the data quality problems you have. On the other hand, the few problems they do solve may be solved very well and can not be solved by any other line of products or in any practically way by humans in any quantity or quality.

Data Quality tools mainly support you with automation of:

• Data Profiling and
• Data Matching

Data Profiling

Data profiling is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources in order to measure data quality and find critical areas that may harm your business. For more description on the subject I recommend reading the introduction provided by Jim Harris in his post “Getting Your Data Freq On”, which is followed up by a series of posts on the “Adventures in Data Profiling part 1 – 8”

Saying that you can’t use other product lines for data profiling is actually only partly true. You may come a long way by using features in popular database managers as demonstrated in Rich Murnanes blog post “A very inexpensive way to profile a string field in Oracle”. But for full automation and a full set of out-of-the-box functionality a data profiling tool will be necessary.

The data profiling tool market landscape is – opposite to that of data matching – also characterized by the existence of open source tools. Talend is the leading one of those, another one is DataCleaner created by my fellow countryman Kasper Sørensen.

I take the emerge of open source solutions in the realm of data profiling as a sign of, that this is the technically easiest part of data quality tool invention.

Data Matching

Data matching is the ability to compare records that are not exactly the same but are so similar that we may conclude, that they represent the same real world object.

Also here some popular database managers today have some functionality like the fuzzy grouping and lookup in MS SQL. But in order to really automate data matching processes you need a dedicated tool equipped with advanced algorithms and comprehensive functionality for candidate selection, similarity assignment and survivorship settlement.

Data matching tools are essential for processing large numbers of data rows within a short timeframe for example when purging duplicates before marketing campaigns or merging duplicates in migration projects.

Matching technology is becoming more popular implemented as what is often described as a firewall, where possible new entries are compared to existing rows in databases as an upstream prevention against duplication.

Besides handling duplicates matching techniques are used for correcting postal addresses against official postal references and matching data sets against reference databases like B2B and B2C party data directories as well as matching with product data systems all in order to be able to enrich with and maintain more accurate and timely data.

Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.

Bookmark and Share

11 thoughts on “Data Quality Tools Revealed

  1. Jim Harris 15th February 2010 / 15:34

    Henrik,

    Nice summary of the data profiling and data matching tool space.

    I definitely agree that data profiling is the technically easiest part of data quality, which explains the tool diversity, legitimate non-tool options, and early adoption of open source alternatives.

    Best Regards,

    Jim

    P.S. Thanks for mentioning my data profiling posts!

    🙂

  2. Henrik Liliendahl Sørensen 16th February 2010 / 09:28

    Thanks Jim.

    Maybe the difficulties in data matching is why I seem to have spend the most of my career in that discipline and still don’t feel that I know all about it. If anyone else share the same interest, the LinkedIn Data Matching group is a place to share experiences.

  3. Mark Besaans 17th February 2010 / 09:24

    The approach to data quality can drive your tool development, selection and use.

    I’ll explain: If the business drives the quality initiative, they may be focussed on the semantic (meaning) of the data and have several syntactic (what’s visible) representations of the same semantic. On the other hand it may be an IT initiative where IT operational excellence is the focus. If data quality is driven by the audit community, what is important that the IT processes should be robust.

    Some tools are good at representing syntax & semantic in the user interface and are chosen because the interface paradigm matches that of the tool user. Not necessarily because the tool provides flexibility of application.

    For example: I developed data-qa/meta-qa as a free tool for enterprise IT people who are driven by the top level business management to ensure the quality data. There was no budget for tools and spreadsheets were used to compare samples of data. Once a method of working was established, it was easy to see what was required. A generic enterprise tool that produces a spreadsheet as output, but uses the power and flexibility of SQL as a driver. The IT staff paradigm was SQL. The business paradigm is spreadsheets. The tool matches the paradigms of both and is flexible enough to compare any data source with a target, integrated with the batch scheduler and delivering spreadsheets by Email.

    So we can see that the approach to data quality has a great influence on tool selection, development and use.

  4. Henrik Liliendahl Sørensen 18th February 2010 / 12:31

    Thanks Mark for sharing.

    I had a look on your pages about DataQa, which is, as I understand it, another great example on a Data Profiling service here with the emphasis of comparing disparate sources.

    • Mark Besaans 3rd April 2012 / 11:59

      Correct. A newer version by another contributor is diffkit.

  5. Ashley 22nd February 2010 / 11:49

    Very nice article.That’s a great info. Thanks for sharing, really like your view. Awesome explanations about data profiling and data matching.I got more ideas about that.
    Keep posting the good work.

  6. Henrik Liliendahl Sørensen 22nd February 2010 / 13:03

    Ashley, thanks for the comment. In a bit funny way your comment is written in a way that may be characteristic for spam comments since WordPress placed the comment in the spam folder (probably using a kind of matching). But I surely recognize your referring to data profiling and matching which is also core features in the Dataladder product in your name link. I am looking forward to learn about your ideas.

  7. Duane Morrison Smith 25th June 2010 / 01:31

    Like any subject there can be both a simplistic and complex view. Data Profiling by comparison to Data Matching might seem like a simpler subject, however both can be over simplified if one is not prepared to go deep into the subjects. Likewise from a technology perspective, an advanced algortithm may be a complex way of looking for a match, but becomes absolutely useless if you are simply considering the wrong fields, or how the content is structured, and further what is actually in the fields, hence the value of data profiling. A first glance of both Talend and DataCleaner might put them both in the category of Data Profiling tools, when one is simply a data profiling and validation tool and the other is a full scale ETL platform. While I think the profiling dashboards in DataCleaner are great it would seem that Talend is a more robust platform in the Data Integration & Data Management space. What you may need from a tool is very dependant on what your requirement is. For example an IT department may be looking to perform a data migration quickly and may use an ETL tool to map there source system to their target system and may or may not consider data profiling important to their project and are happy to write their SQL and DTS packages. Whereas a Data Quality Manager may deem this to be a mandatory requirement and may want a purpose built application to monitor the state of Data Quality over time considering content (Data Profiling) and uniqueness (Data Deduplication & Matching).

  8. Henrik Liliendahl Sørensen 25th June 2010 / 06:00

    Thanks Duane for adding these wise words on the subject.

  9. data matching 28th October 2010 / 07:57

    I couldn’t agree more with the last line:

    “Automation of matching is in no way straightforward and solutions for that are constantly met with the balancing of producing a sufficient number of true positives without creating just that number of too many false positives.”

    Advanced algorithms make automation shine, poor ones can make your data matching project a nightmare!

  10. data cleansing 26th October 2012 / 20:10

    Great article this!.Just shows what benefits can be gained from using good data quality and matching tools.

Leave a comment