Candidate Selection in Deduplication

When a recruiter and/or a hiring manager finds someone for a job position it is basically done by getting in a number of candidates and then choose the best fit among them. This of course don’t make up for, that there may be someone better fit among all those people that were not among the candidates.

We have the same problem in data matching when we are deduplicating, consolidating or matching for other purposes.

Lets look at the following example. We have 2 names and addresses:

Banca di Toscana Società per azioni
Machiavelli 12
IT 51234 Firenze

Vanca di Toscana SpA
12, Via Niccolò Machiavelli
Florence
Italy

A human or a mature computerized matching engine will be able to decide, that this is the same real world entity with more or less confidence depending on taking some knowledge in consideration as:

  • The ISO country code for Italy is IT
  • Florence is the English name for the city called Firenze in Italian
  • In Italian (like Spanish, Germanic and Slavic cultures) the house number is written after the street name (opposite to in English and French cultures)
  • In Italian you sometimes don’t write “Via” (Italian for way) and the first name in a street named after a person
  • “Società per azioni” with the acronym SpA or S.p.A is an Italian legal form

But another point is if the 2 records even is going to be compared. Due to the above mentioned reasons related to diversity and the typo in the first letter of the name in the last record no ordinary sorting mechanism on the original data will get the 2 records in the same range.

If the one record is in a table with 1,000,000 rows and the other record is in another table with 1,000,000 rows the option of comparing every row with every row makes a Cartesian product of 1,000,000,000,000 similarity assignments which is not practical. Also a real-time check with 1,000,000 rows for every new entry don’t make a practical option.

I have worked with the following techniques for overcoming this challenge:

Parsing and standardization

The address part of the example data may be parsed and standardized (including using geographical reference data) so it is put on the same format like:

IT, 51234, Via Niccolo Machiavelli, 12

Then you are able to compare rows in a certain geographical depth like all on same entrance, street or postal code.

This technique is though heavily dependent on accurate and precise original addresses and works best applied for each different culture.

Fuzzy search

Here you make use of the same fuzzy techniques used in similarity assignment when searching.

Probabilistic learning

If earlier some variations of the same name or address is accepted as being the same, these variations may be recorded and used in future searching.

Hybrid

As always in data quality automation, using all different techniques in a given implementation makes your margins better.

13 thoughts on “Candidate Selection in Deduplication

  1. William Sharp 21st February 2010 / 14:12

    Henrik,
    There is just no substitute for standardization/data governance/master data management, is there?
    I see the complex nature of European data highlight this fact, however, I know the same to be true of data in the US.
    Excellent post to illustrate the complexities of matching on a large scale!
    Kind regards,
    William

  2. Anand 21st February 2010 / 14:29

    well explained a major issue.

  3. Jim Harris 21st February 2010 / 17:05

    Excellent post Henrik!

    Most vendors exalt their Algorithm as The King of their data matching solution, while ignoring Candidate Selection – The Prince.

    I love listening to vendors argue about their comparison algorithms. These arguments usually include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, and bipartite graphs.

    However, as you rightfully point out, the comparison algorithm isn’t going to matter if you don’t properly select the records to be compared, and Cartesian comparisons (comparing every record to every other record) are definitely not practical.

    Most vendors still rely on primitive sort-merge techniques for candidate selection, which leaves many false negatives among the records not matched, especially because they were never even compared.

    This is why I laugh at most of the “my comparison algorithm is better than your comparison algorithm” arguments among vendors, since most of their solutions are only capable of producing high quality match results when the input data is prepared using parsing and standardization, which ironically creates data structures where even a sort-merge could find most of the obvious matches.

    I am neither trying to dismiss the importance of the algorithm nor accuse data matching vendors of Machiavellian sales tactics.

    I am simply expressing my appreciation for a well-written blog post.

    Best Regards,

    Jim

  4. Henrik Liliendahl Sørensen 21st February 2010 / 17:56

    Thanks William, Anand and Jim.

    William, dealing with multi-cultural party data surely highlights some challenges, but as you point out, often these challenges also exist in single-cultural data but are hidden behind all the good matching.

    Jim, Candidate Selection being The Prince of matching not addressed by vendors using Machiavellian sales tactics
    – it doesn’t get much better.

  5. Patrick Austermann 23rd February 2010 / 14:57

    Great post and comments … and an important topic.

    Candidate selection and matching algorithms are not independent. What good is a matching algorithm that follows one paradigm if the candidate selection follows another ? In fact, why should the user even have to worry about candidate selection in the first place ? You have a query, and your solution should return the closest matches. Period.

    If you assume that your matching algorithm correctly ranks and/or scores matches (ideally both), then the candidate selection needs to tease out enough close matches so that all of the desired results are guaranteed (or reasonable close to certain) to be in the candidate set. And candidate selection has to be precise enough to make downstream matching be computationally feasible.

    For example, if you decide to consider string edit distance as the metric for measuring similarity, as you said, a sorted list is just too primitive (or a B-tree or any of the common prefix-based approaches).

    Oh how would I love to see a non-biased benchmark across various domains where all vendors can have a bake-off and the customers can then grade solutions along their individual priorities !

  6. Steve Sarsfield 23rd February 2010 / 15:02

    No doubt – standardization is the key to matching. If we can parse our data into discrete standardized components, the match algorithms don’t have to work so hard, nor do we have to works so hard to tune them.

  7. Olivier Mathurin 23rd February 2010 / 16:02

    Great post Henrik! Agree with Patrick’s comment on strong relation between candidate selection and matching algorithms.

    However, to be able to overcome the Country/Language/Alphabet challenge at candidate selection level, you would need to maintain different mappings between each combination (Japan/日本/Nihon/Nippon/日本国/Japon/…), a task that would require a huge amount of mappings (and related maintenance!).

    Like canonical data models are in use to avoid point-to-point datamodel mappings, an abstract layer would represent each local variant. This abstract layer would also reduce the complexity of the candidate selection by reducing the number of syntaxic request you would require against your database.

    Such abstract layer could be solved in two ways:
    1) by forcing a chosen syntax for each legal area (forcing “Japan” to be changed into “日本国” for any “address located in the so-called-Japan country”)
    – Incoming data are translated by the standardization process,
    – Display data at the end to the user remains a challenge (not everybody would know that 日本国 is known as Japan is displayed on a screen, so dont store it! 🙂 ),

    2) by choosing a abstract semantic representation:
    – the standardization software would need to translate from the input format into its semantic representation (日本国 –> {JAPAN}).
    – Because semantic technologies uses relationships to describe objects and their world, enrichment would be made by semantic augmentation of the object’s attributes ({JAPAN} –> {EAST-ASIA}) and improving matching accuracy because based on much more metadata and on relationships between the objects and other instances.
    – A semantic representation would also be part of the solution to display the address in a readable format to any human in front on his screen, thanks to dictionaries mapping concept into words.

    For me, candidate selection and efficient Matching techniques would need such an abstract layer to overcome the i18n challenge, and semantic web technologies provide tools for it.

    There is a place for a third-party service out there … (is AddressDoctor able to do more than validation and cleansing?)

    My 50 cents:
    “Banca di Toscana …”
    “Vanca di Toscana …”
    Any candidate selection algorithm should be able to identify as well wrong manual entries, based on keyboard configuration and keys distance? 🙂

  8. Patrick Austermann 23rd February 2010 / 19:58

    Oliver, you are bringing up a very important point: beyond the, shall we call it “orthographic” matching, is a rich semantic layer that we need to tap into fore better accuracy.

    As this might be of general interest in this discussion, I’d like to share what drove the design of our products at Netrics . I don’t mean to highjack the thread though.

    Address standardization is one aspect that has been discussed here. The use of semantic equivalence classes is another – and they occur across many domains. Let’s say a term is a set of tokens and represents a concept. Sets of terms make up an equivalence class. Let’s call a set of semantic equivalences a Thesaurus.

    “Bob” = “Richard” = “Rich”
    “III” = “I I I” = “Third”
    “ST” = “Street”
    “ST” = Saint”
    “Hypertension” = “high blood pressure”
    “Hypertention” = “high blood pressure”
    “scarlet” = “Red” = “Brick”
    “scarlet” = “scarlata” = “saqirlat” (fine and expensive woollen cloth in Medieval England (per Wikipedia, see http://en.wikipedia.org/wiki/Scarlet_(cloth))
    “Japan” = “日本” = “Nihon” = “Nippon” = “日本国” = “Japon” (example from Olivier Mathurin)

    These examples and other common use cases give rise to the following general requirements for semantic equivalences / thesauri:

    – Use different thesauri on different fields (in the same query)
    – Same term may have different equivalences in different domains (“ST” above)
    – Equivalence classes are non-transitive – you can use “scarlet” and “brick” interchangeably, but not “red” and “saqirlat” – so you can’t normalize data and queries to “scarlet”
    – You may want the thesauri to be case insensitive and ignore punctuation and certain other characters
    – Language and culture independence – you want to be able to use any Unicode term
    – Possibly penalize a substitution vs. the original form, so when looking for Bob Smith, you want the records with exactly Bob Smith first, then those with Robert Smith etc. – and you want to tailor this to the problem and maybe not penalize at all.
    – When you have several multi-token terms, they may partially overlap and thus potentially create an exponential number of possible combinations. You need to intelligently select the best set of applicable substitutions without too much overhead
    – Error-tolerance – you don’t want to maintain all expected misspellings of these terms. There are just too many ways to butcher words.

    And of course this all need to be taken into consideration when selecting candidates and in the final matching/scoring.

    Furthermore, if you ever want to change a thesaurus at runtime (say you just discovered, while reviewing matches, a nickname for Peggy you forgot earlier), you don’t want to re-index the whole table or manage rule bases in all the apps that use the matching functionality on the query side. You should just update the thesaurus once, and the next query 10 milliseconds later should use it. And for some queries the user may chose not to use it.

    Now another related topic is how to deal with terms that carry more or less significance. Say we are comparing “ABC Corporation” and “BCD Corporation” in a list of 18 million company names. By pretty much any string distance metric, these two company names are nearly identical. Most human users would say “Huh ?” (or, “I disagree with that assessment”). It turns out that the very common word “Corporation” has very little Information (in the Information Theory sense of the word).
    Similarly, when matching music band names, certain words are stop words – “Beatles” vs. “The Beatles”. But wait – we can’t just delete these tokens – what about the band “The The” ? What about comparing “ABC Corporation” and “ABC LLC” ?

    You can associate a weight to terms to control how much they contribute to the score calculation (e.g. weight of “Corporation” = 30% of a different term of the same length) so that “ABC Corporation” and “BCD Corporation” turn out to be scored as quite dissimilar entities.

    And this is only the tip of the iceberg – what other semantic features are relevant in matching ? Ontologies certainly are one that has been alluded to … (I used a case-based reasoning based tool from a German company Empolis , called “Orenge” many years ago that excelled at this).

  9. Henrik Liliendahl Sørensen 24th February 2010 / 08:28

    Thanks Patrick, Steve and Olivier for sharing your knowledge.

    The best part of blogging is when the comments adds a lot to your original thoughts.

    Candidate selection (and similarity scoring) is surely an art of balancing the best from standardization and semantics while taking performance into consideration. The trick is also to be able to automate the balancing.

  10. Henrik Liliendahl Sørensen 25th February 2010 / 07:17

    On LinkedIn in The Data Quality Association Group Elaine Medina says:

    Olivier Mathurin & Patrick Austermann- The engineers where I work- work hard to achieve excellence in matching-but I am uncertain whether they deem semantic technologies as profitable, scalable or implementable. However-I am completely convinced that semantic processing (if we have not already done so) to be beneficial. I have aquired a strong interest in semantic processing for over a year since I have been tasked to with some Data Stewarding responsibilities. I have learned of the limitations of syntactic processing and the challenges that engineers contend with in scoring and standardization. While I am not an engineer (I am an analyst)-from what I have experienced – those challenges have led me to believe that implementing semantic technologies is not only beneficial but something that should be at least considered.

  11. Patrick Austermann 25th February 2010 / 14:45

    Henrik,

    Humans are vastly superior to machines in many tasks partly because they can bring a very rich set of semantics to bear when solving problems.

    To take you example – it is very likely that

    Machiavelli 12

    and

    12, Via Niccolò Machiavelli

    represent the same concept – that particular street address. If we don’t use semantics when making that determination then I don’t know what.

    BTW the above interpretation also depends on context – this one happens to occur in what we believe to be an address.
    Machiavelli 12 can could also mean many other things. It could be a name of a music band or gang of criminals with twelve members, an Italian supercar etc. We derive meaning from context and knowledge.

    Semantics is the study of meaning in language (and let’s assume we are dealing with language when looking at data records). Therefore parsing, standardization and anything else that makes assumptions about the data and its meaning is semantics.

    In matching, you can only get so far by doing pattern matching in all its different flavors, from the very humble string edit distance to advanced pattern matching. Eventually you will hit the semantic wall and you can only increase accuracy if you supplement your capabilities.

    (here, I define accuracy as the capability to approximate a human expert’s judgement).

    So what is there beyond the orthographic ?

    Semantic technologies come at many levels.

    I mentioned parsing, standardization, thesauri, weighted terms.

    In structured data matching, other important factors are weights, thresholds and how to to deal with absent records. What does it mean for a comparison if some data is absent ? If the SSN is missing ? If the gender flag indicates the person is female – how do we deal with last name differences ? Rule based system are inherently flawed because they require “all” rules to be explicitly enumerated – but the space of rules can not be enumerated. And humans, our gold standard, don;t follow a finite set of “rules”. I’ll rant more about rules another day.

    For some, semantics means reasoning based of tremendous ontologies – see http://wordnet.princeton.edu/

    And for _some_ problems that’s an area worth exploring (but it is of somewhat limited use if you are only matching names and addresses … although … that’s also a topic for another post).

    What I am trying to say all this is that we live in a world full of semantics and we need to apply them where appropriate to solve problems, including data quality and matching. Ignoring the rich possibilities of semantics would be a bit like hopping on one foot because one foot is sufficient to get around.

  12. John O'Gorman 25th February 2010 / 19:27

    Hello again, Henrick;

    We’re going to see this ‘semantic’ thread over and over again, I think. There are enough recognizable and repeating patterns in language – and by extension information / data quality – (almost all of them mentioned in your orginal post and by the respondents, BTW) to warrant the submission of a methodology that begins to crack this nut.

    I have submitted an article for publication that begins to outline such a methodology. If you or your esteemed colleagues would like an advance copy, please e-mail me at jogorman@tiberon-ia.com and I will happily provide them with one. It’s a start.

  13. Henrik Liliendahl Sørensen 26th February 2010 / 06:23

    Patrick, thanks again, I like your remark:

    “Ignoring the rich possibilities of semantics would be a bit like hopping on one foot because one foot is sufficient to get around.”

    John, thanks for joining, I know the semantic topic is close to your heart. Please forward your article.

    What I am aiming on is precisely to understand semantic, outline a methodology and apply that to technology for a concrete purpose which could be selecting candidates in large scale name and address matching.

Leave a comment