There are actually a number of problems with traditional data quality solutions (which I will return to) but in this article I am going to focus on what is arguably the basis of all data quality, which is matching name and address details.
The problem with traditional approaches to matching is that they assume you know what you are looking for: that you have some sort of pattern against which you can match. For example, if you are doing address matching, you expect that the data is presented in a certain format: house number, street, town, county, post code and so forth, in that order (if you are in the UK—it might be a different order elsewhere). But suppose that it isn't: then the pattern matching breaks down.
Another (and probably bigger) issue is the relative weight you give to different elements of the match process. Assuming you have managed to sort your addresses into a consistent format (which can be a substantial feat) then is having the same street more important than having the same house number? Is street address more important than name? Which elements of the name are more important?
The way that most products work is that they have some element of statistical analysis to which you add in business rules that define these weightings, which makes the whole thing what is known as "probabilistic matching". That is, you end up with a certain probability that you have a match. However, the fly in the ointment is those business rules. Some solutions come with pre-built rules. However, you still need to apply your own weights. And how do you know what those weights should be? Short answer: guess. Then you can tinker around with your guesses and see if your match percentage improves (while ensuring that false positives don't go up).
Further, you will need different rule sets for different entities. For example, the relative weights (and the information against which you match) may be different for suppliers, customers, patients and so on.
All of which means a lot of manual work, not just to begin with but on an on-going basis. And this isn't just for building and maintaining the rule sets but also for manually identifying the matches that the software isn't good enough to automate.
The problem isn't the tools; the problem is the technology that underpins the tools. Pattern matching—at least as it is implemented in most products—simply isn't good enough.
One company offering an alternative approach is Netrics. This company's solution is based on two engines. The first of these is the Netrics Matching Engine. This doesn't use conventional pattern matching but instead uses mathematical modelling based on bipartite graphs, an example of which is shown below.
The point here is that it is not bound by any expected pattern: it can match against any subsets of the data that it can see.
The second engine is the Netrics Decision Engine. This models the way that people make decisions on the data in order to automate the process of linking records without having to build and manage complex rule sets. In particular, this engine is self-learning. That is, its decision making capability will improve over time: tell it that x matches and y and it will remember that, self-adjusting its internal weights so that you don't have to. In other words, the hit rate goes up, the manual effort involved goes down.
It seems to me that these two aspects of Netrics' solution are the key to next generation data quality solutions: improved matching with less human involvement or, to put it another way, better results, less cost.