I recently wrote about the deficiencies of traditional data quality tools when it comes to data matching. How the conventional pattern-based approach with user defined rules for weights simply can't hack it. In this article I want to take that further and consider the difficulties that arise when you get beyond names and addresses and other relatively simple data and start to consider complex data such as products.
Here the problems are the same but magnified. Not only may you have strings of descriptive data but embedded within that may be technical terms (and symbols), weights and measures and abbreviations. Often, the same product code may be used for different parts in different countries. And then there are foreign languages to be considered. Not surprisingly, one company I recently spoke to was only getting a 35% match rate for its product data when using conventional approaches to data quality. In other words, almost two thirds of matches had to be identified manually.
With figures like that it's perhaps no surprise that most data quality vendors still argue that their biggest competitor is hand coding.
Anyway, that doesn't mean that there isn't a solution. That same company today is getting a match rate of better than 85% using Silver Creek's software.
As in my last article, Silver Creek doesn't use a pattern-based approach to matching but instead focuses on semantics: understanding the meaning of the data rather than its ability to fit a pattern. Doing this means that it is easy to see that "mtr" = "motor" = "moteur" and so on. What's more, it too is self-learning, so that you don't have to manually define and maintain rules for the weights that determine how probable a match is.
It isn't, of course, that the traditional vendors in this field are not doing something about product data. The typical approach is to adding parsing into their existing offerings so that they can support unstructured (and multi-lingual) text. What this effectively does is to use the parsing to recognise patterns within the text in a similar way to the sorts of products that extract "context" from documents. Certainly, as far as the data quality vendors are concerned, this has improved their capabilities and may even make them reasonably suitable for simple product environments. However, these solutions are still based on pattern recognition and will always require a lot of manual effort (and, therefore, expense). In my opinion, pattern recognition has limited utility when it comes to all but the very simplest matching. Moreover, traditional approaches do not incorporate self-learning.
The bottom line is that pattern recognition is not good enough and is insufficient. Moreover, the whole probabilistic, statistical, rules-based approach that typifies conventional data quality products is time consuming, expensive and inadequate. We need a new approach. One such is exemplified by Silver Creek's technology.