Name matching

Here is a quiz: where does Karimbux Zainoelbaks come from? While you’re thinking about that, I want to write today about multi-cultural name recognition software and, specifically, about Language Analysis Systems (LAS), which is the leading (only?) vendor in this field.

Of course, the first question is whether there is actually a market for name recognition and matching, as distinct from name and address matching. I have to say that I wouldn’t have thought so prior to discussing this with LAS – but my eyes have been opened.

How are you getting on with the quiz? Here’s a clue: Karimbux would seem to be made up from Karim and a suffix. What does Karim suggest? Somewhere Islamic?

LAS’ contention is that in an increasingly global world you need to understand a name’s context before you can start to do any matching. For example, in the United States it is typical for Americans of European descent to have a first name, a middle name and a surname.

However, this is not the case in most of the rest of the world. For example, take the name “Salih, Hajj Abdul Rahman”; if you didn’t know better you would do name matching against all of these names. However, Hajj is a title, and Abdul simply means “servant of” one of the 99 names of God. So, Hajj and Abdul (and their variants – they are of course spelled in multiple ways across multiple countries) are no use for name matching; the analysis should be done against “Salih” and “Rahman”.

To take another example, there are various (many) Chinese languages. However, they use the same script. As a result, very differently sounding names may be spelled the same way, which means that simply romanising the names will miss actual matches and create false positives. Similarly, using techniques such as Soundex or Metaphones will also create many false positives.

Incidentally, that’s not the only problem with Soundex, which regularly misses matches and identifies false ones. Arguably, this is not surprising since the technique is over 80 years old – it was originally developed to help analyse the 1890 census – but it has barely moved forward since. Indeed, there have been no patents awarded in this area since then because they have all been deemed derivative, but LAS is now moving things forward with half-a-dozen patents pending.

Back to the quiz: have you figured out the Dutch connection in the surname? Is it the “Z” at the beginning, or maybe the “baks”, that, to British ears, suggest Boers, which in turn leads back to Dutch origins? In any case, we now have Dutch Moslems; so where would they be from?

To get back to LAS, the company has been around for some 20 years so it is by no means a new boy on the block. Moreover, it is widely used throughout the US federal government as well as in the commercial organisations like AcXiom. It is also a partner of the Entity Analytic Solutions group (previously SRD) within IBM that I have written about recently, while its software is also being embedded into both the Group 1 and FirstLogic data quality offerings, though whether this will change now that both of these will be a part of Pitney Bowes remains to be seen.

So, got the answer? Indonesia, right? The Dutch East Indies, yes? No, actually. If you were using the LAS database (which has a billion names in it) you would know that Karimbux is a Pakistani name and that the one place where there is a significant mix of Pakistanis and Dutch is actually half the world away in the Dutch West Indies and, specifically, in Surinam.

To conclude, this was actually a real life enquiry and it illustrates the need for contextual understanding in name matching. If this is a significant issue for you then I strongly suggest a close look at what LAS is doing.