The problem with data quality solutions part 3

So far in this series of articles I have discussed the failures of traditional data quality tools when it comes to matching in general and product and complex data matching in particular. However, these aren’t the only areas they fall down in: they are not very good at dealing with names either (which makes one wonder what they are good at?).

Suppose you are Chinese and you go to live in America. Do you keep your Chinese name? Do you anglicise it? If so, how? Do you reverse your names so that your forename goes first? Now consider a data quality solution trying to match in these circumstances. Or think about criminals with 30 different aliases: how do you match these names?

Fortunately, the data quality fraternity (or some of them) has owned up to this omission in its capabilities. Thus IBM bought LAS (now Global Name Recognition) and Informatica more recently acquired Identity Systems, though the other vendors in the market remain in the cold in this regard.

However, if you have read the previous articles in this series you will know that lack of ability when it comes to names is the least of my concerns when it comes to data quality and that my real worry is that all the leading products have been built using out-of-date technology that has now been superseded.

In the second article I mentioned Silver Creek, which uses a semantic approach. In particular, both of these products feature self-learning capabilities (as does Zoomix, recently acquired by Microsoft) that improve the efficiency of the match process over time while reducing the amount of human involvement that is required.

It is not that these products are new—Silver Creek has been around for a number of years, Netrics has 150 odd customers—but I have now got to the point where I think we need the existing market to be radically disrupted. Current products are being incrementally improved but incremental improvements are not enough: we need dramatic improvements. Otherwise, most companies will continue to (ineffectively) use manual efforts for data cleansing because they can’t see the cost benefits (and I am not sure I can blame them) of moving to inadequate pattern-based matching products.

If data quality is the huge issue we all say it is, and it is, then we owe it to users to actually provide them with technologies that help them to resolve those problems rather than just a sop, which is what they are, in most cases, getting. Leading vendors need to recognise that the likes of Netrics and Silver Creek offer way better technology than they do and they need to buy or build comparable capability as soon as possible if they are not to continue to disappoint the market generally.