Having the right element to the left

Name, address and place are core attributes in almost any database. You may atomize these attributes into smaller slices, but in doing that: Mind the sequence.

When working with data matching and party master data management some of the frequent exposed issues are:

Person name

Often a person name is split into first name and last name, but even when assigning these labels you are on slippery ground. Examples:

  • In some cultures like in east Asia the family name is written first and the given name is written last.
  • Some notations indicate that the given name isn’t the first element:
    • “DUPONT Michel” is a custom French way of telling, that the family name is the first element
    • “Smith, John” is an universal way of telling, that the family name is the first element

Besides that we have issues with middle names and other three part naming and having salutation, education and job titles mixed up in name fields.

Street address

Most of the world is divided into two “street address” cultures:

  • In the Americas you write the house number in front of street name if you are north of Rio Grande being US and CA, but you write the house number after the street name if you are south of Rio Grande being MeXico, BRazil, ARgentina and almost any other country.
  • In Europe you write the house number in front of street name if you are on the British Isles or in France, but you write the house number after the street name if you are in almost any other country.
  • The rest of the world is also divided in writing street addresses.

Besides that we have other ways of writing addresses like the block style in Japan.

Place

Most countries have a postal code system – even Ireland will have that soon.

Despite the fact that a city name in most cases can be obtained by looking up the postal code we often do store the city name anyway – for those cases that we can’t.

And if the postal code and the city name is in one string: Oh yes, in some cultures you write the city name in front of the postal code and in other cultures you do it the opposite way. And oh no: It doesn’t necessary follow the sequence of the house number and street name.

In a blog post written a while ago we also had a look into postal address hierarchy, granularity, precision and history.

Bookmark and Share

9 thoughts on “Having the right element to the left

  1. Graham Rhind 27th February 2010 / 11:05

    Good post, Henrik – anything we can do to hammer home the truth about diverse name and address patterns is welcome.

    In a world where many people are blind to diversity in so much, including personal name and address patterns, I think it’s better to state the true extent of the diversity, however overwhelming that might appear, than to tiptoe around it.

    Based just on the type of element and their positions in an address, there are at least 131 address formats covering the whole world, and around 40 personal name formats (I’m discovering more on an almost daily basis) – given name/family name order understanding won’t help with many of those formats. Your comment about The Americas (though I note the “most”) neglects Belize, Guyana, French Guiana and the very many and diverse formats of the Caribbean countries and territories. There are also those (numerous) countries that do not have street addresses at all – they use mailing addresses such as post office boxes, and those can be numeric, alphabetic or a mixture of both. And finally there are those countries which use non-geographic addressing, usually for large mail users. And though most countries do now have postal codes, a large minority (around 60) do not, and should not be ignored.

    We are much more diverse than we think! Let’s celebrate our differences! 🙂

  2. Rich Murnane 27th February 2010 / 14:43

    Like most things in life, parsing out name and address “pieces” from longer strings isn’t as easy as we’d all expect. Thanks Henrik for continuing to illustrate the challenges we face when asked to take on these tasks.

  3. Henrik Liliendahl Sørensen 27th February 2010 / 15:03

    Thanks Graham and Rich.

    If you have to deal with diversity in names and addresses – and in what detail – may depend on things like:

    • Do you only have domestic data or do you work with international data
    • Even if you only have domestic data – do you live in a country with several cultures
    • Even if you only deal with domestic data, maybe the data entry is done abroad

    No doubt about that globalization changes the master data game going from data governance policies to concrete data quality matching tool implementation. On the latter side we had a very intense debate in a recent post called Candidate Selection in Deduplication.

  4. Garnie Bolling 27th February 2010 / 17:22

    Henrik, another excellent post.

    when the world went flat, it added the level complexity that keeps the DQ experts busy, trying to “discover” the column and order to the left.

    My family has a mix culture… imagine “Lee” … some think Oriental, while others remember the Lee family in the United States (like Robert E Lee). So imagine the fun there.

    Tools have come a long way, what do you like to use ? … I am interested in your thoughts. Folks like Informatica has lots of good tools, and so does IBM, with Global Name Recognition and ID Resolution… but the WHERE clause is still a tough one since a country can have multiple regions with different WHERE definitions.

    Can you share your process in tools or selection of tools…I am curious. Again, excellent post.

  5. William Sharp 27th February 2010 / 17:30

    Since I predominately deal with US addresses, my intimate knowledge of this subject deals more with the name aspects of this post.
    In addition to the ordering of names, as you’ve illustrated here, there is the recent trend (in the US at least) for a married woman to “hyphenate” her last name. This adds complexity to what, as Rich’s points out, most people think is a straight-forward task. As is often the case, the “devil is in the details”.
    Thanks for another illuminating post, Henrik. I gain insight into the complexities of data standardization/parsing/cleansing with each post you present.

  6. Henrik Liliendahl Sørensen 27th February 2010 / 17:39

    Garnie and William, thanks for joining and kind words.

    Lately I have been using Omikron WorldMatch a lot – also since I have worked for Omikron during the last 4 years.

  7. Thorsten 27th February 2010 / 19:58

    Henrik,

    good post. One thing that hit me somewhat unexpectedly was the different formats and lengths of postal codes. So I wass “burned once”, and now I’m a bit more cautious when dealing with names and addresses.

    I’d be interested in what google does in Google maps .. seems they can work with almost any formats and translate them to coordinates.

    Again, good post showing some problems we’re facing on a regular basis.

    Thorsten

  8. Henrik Liliendahl Sørensen 27th February 2010 / 21:51

    Thorsten, thanks, you are quite right when pointing at what is possible on services like google maps and even what is possible on a everyday device like a car GPS – and what is apparently not possible in large enterprise customer databases.

  9. Graham Rhind 27th February 2010 / 22:40

    @Thorsten

    A lot of people look at Google Maps as some kind of miraculous address parsing tool that has somehow overcome the problems of diverse address structures throughout the world.

    I see it as a rather simple application. Remember, when people are looking for a point on a map they are only considering geographic addresses (so that’s the complexity of mailing addresses and large-user addresses gone immediately). They also never need to add any sub-building or internal information (building names, flat numbers, staircases numbers, relative positions and so on). In fact, all people need to enter to get a map is a thoroughfare name, a building number, a place name and, as an extra, a postal code. These are actually very easily parsed and compared to Google’s extensive databases. Enterprise databases have a lot more address information to deal with, with its associated problems.

    Also, most enterprises don’t have the financial resources that Google can throw at such issues …

Leave a comment