Data quality, as a discipline, is typically used within three types of environments. The first is within migration projects, whether those are purely for data migration or whether they are embedded within larger application migration projects. In either case, the requirement is to ensure that the data loaded into the new system is of sufficiently high quality to be fit for purpose. In other words, this environment has traditionally been regarded as project-based and data quality itself has therefore been assumed to be project-based: getting the new system up and running was the only thing that was important and the fact that data typically deteriorates (in the absence of counter measures) at around 2% per month was neither here nor there.
The second sort of environment in which data quality procedures have typically been used has been alongside traditional ETL tools in conjunction with data warehouse loading. Historically this was a batch process that was repeated on a regular basis. As a result data quality is usually applied in the initial analysis and then the same processes are applied on an ongoing basis, on the assumption that the same errors will occur in new batches of data that were present in the original data. Unfortunately, that is not necessarily the case.
Both of these first two uses of data quality work on a post-event basis. However, just as it is better to prevent fraud before it happens rather than trying to detect it afterwards (which means that you don't get your money back) it is better to prevent errors in the data in the first place; which leads me on to the third type of data quality, which is applied at source on an on-going basis, rather than through the sort of project-based approach used in traditional methods. As an example, if you are going to implement master data management (MDM) in order to ensure synchronisation across multiple CRM systems then it is important that all data is consistently of a high quality on an ongoing basis: otherwise you lose part of the value of the whole thing. This can only be done through use of data quality on a real-time basis.
In addition, it is worth noting that real-time data quality is not simply a question of applying this at the point of entry but also at other stages within the data lifecycle, such as whenever data is being synchronised across systems or it is being shared across and between systems, databases or applications: for example, in a federated query.
What do you need for this real-time data quality that you don't need for project-based data quality? Well, the first thing is obviously the ability to deploy data quality as and when needed. In today's world this means that you need support for web services within a service oriented architecture (SOA) so that data quality can be embedded within whatever processes may require use of the technology. Secondly, the data quality software needs to be flexible: you cannot assume that the data quality issues that you meet when you set up the MDM system will be the same issues that arise during live operations. As an example, if you are using MDM for product information management and you start working with a new supplier, then new types of matching may be required because, for example, this supplier is German and provides product details in that language which, perhaps, you have not dealt with before. In effect, you start providing data as a service.
The third thing you need is for the software to be able to automate data quality processes as much as possible. This is because real-time, on-going data quality is about data cleansing and matching at the point of data entry: in other words, this happens within the user domain rather than within the IT department. Therefore, there must be as little for the user to do as is practicable and, furthermore, what the user does have to do must be as simple as possible.
One company that is specifically targeting the market for ongoing as opposed to project-based data quality is Zoomix. I have written about the company previously and I described it as bringing automation to the data quality market. It does this by applying a data-centric (rather than a rules-centric) approach to data quality. Then Zoomix employs a range of linguistic, semantic and statistical algorithms that are combined (automatically) in an optimal manner for different types of data quality decision. In other words, Zoomix can recognise the context of the data that it has to parse (including such things as which language the information is in) as a part of the data quality process rather then treating all data as inherently the same (which is the case for most traditional products), which is what allows it to optimise the use of its various algorithms.
Further, Zoomix employs what it calls a self-learning dictionary. What this means is that if the software identifies a potential match of, say 65%, and the user then agrees that this is indeed a match then Zoomix will update its knowledge repository accordingly. Subsequently, the software will apply this fact automatically the next time that similar conditions occur, without recourse to the user. So, for example, the system will learn that EDT is the same thing as Eau de Toilette and that oz equals ounces and blau is blue. Moreover, when similar, as opposed to identical, situations arise Zoomix is clever enough (it adjusts its internal rules and knowledge) to recognise this fact and to address these situations with relevant sets of rules. As time progresses the technology allows the software to become more and more accurate and automated over time, and requires less and less human input.
Going back to the algorithms used by Zoomix, it is worth commenting further on the product's semantic capabilities, since these are continuously improved through the self-learning capabilities just described. In particular, while other tools may have semantic capabilities that deal with the structure of data, Zoomix extends these so that they relate to the attributes of entities (such as volume—ounces/oz, colour and so forth) under consideration. Ultimately, with its ability to combine these semantic algorithms with statistical and linguistic algorithms, plus its self learning capabilities that address the dynamic nature of the data, Zoomix should provide improved results in less time.
Another notable point worth mentioning is that Zoomix is suitable for use with all types of data. Of course, all vendors claim this but it is often not as true as one might like. Many other products in the market were developed specifically for name matching and cleansing in the first instance and they are not very good at matching and cleansing product types, particularly where the data is complex. Zoomix, on the other hand, was developed with all sorts of data in mind but especially complex, highly variable data.
While Zoomix can certainly be used for all sorts of data quality implementations, including project-based environments, the company (which is Israeli based with offices in the UK and the United States) is focusing on the use of its technology in on-going situations where its software is embedded into relevant business processes and, in particular, it is targeting the MDM market. With MDM increasingly being recognised as a cross-silo environment that spans all types of data (customers, suppliers, locations, products, financial data, contracts and so on) and which requires on-going maintenance of high quality data on a real-time basis, Zoomix is well placed.
Zoomix have made available a freely downloadable PDF version of this article.