Having set out my stall with regard to conventional data cleansing and matching products (a report card that would read "tries hard, could do better") I now want to turn my attention to data profiling. These sorts of tools do two things. First, they analyse the data for errors and, second they analyse the data to discover the relationships that exist between different data elements.
In so far as the analysis of data for errors (null fields, default fields, data not matching the defined datatype and so on) is concerned, most products do a pretty good job and, from a functional point of view (that is, not including considerations such as ease of use) are much of a muchness. On the other hand, where I think they may be a significant difference between products is in their ability to discover relationships. However, I will not comment on this now as I am conducting research into this issue and plan to publish a detailed report in the New Year.
Where I will comment is on the importance of understanding relationships not just between data elements but also between data and applications and even between data and the business.
Understanding data relationships is arguably the most important factor whenever you are moving and transforming data, especially in data migration and data archiving environments but also for moving data into a warehouse and similar applications. The reason for this is easy to see if you consider the effects of a broken relationship: orphaned customers with no orders aren't going to make your application software work well. It would be nice if you could see these relationships (which effectively comprise business entities) represented graphically.
Unfortunately, discovering relationships is not just about profiling your database. There may be relationships that exist across data sources (and types of data source) that you need to understand; and then there is the application factor. While it may not be theoretically correct from a purist data management perspective the fact is that many data relationships are defined within applications so, in one way or another, you really need to discover these.
Moreover, data-application relationships don't just have implications when you are moving data. For example, it is not hard to imagine a scenario where valid data for one application is invalid for another. A typical case might be where one application required a five digit numeric field and another application requires the same five numbers plus an additional two alphabetic characters. So, here's a question for data quality vendors: can your software tell the difference? It would also be nice to know how often different applications address the same data because then you would know how important the issue was.
Data relationships don't end with data and applications. It would be useful to understand, for example, who owns the data and who is a stakeholder. If you know what applications access data you may be able to infer the latter but does customer service history, for example, belong with CRM data or is it owned by the service department? You probably can't determine this automatically but you would certainly like to record it for data governance purposes.
And finally, relationships are not just important for IT but also for the business. As we have noted, a set of relationships effectively defines a business entity and these are the domain of data stewards and business analysts as much as they are of IT, so these should be capable of being presented in a business-friendly manner.
Now, I know of tools that do every one of things just outlined. Unfortunately, they are provided by disparate vendors with a variety of products, a number of which would not even be categorised as data profiling tools. During my current research I am hoping to hear that suppliers are expanding their capabilities into the areas just discussed but I expect I am destined to be disappointed, in which case a multi-tool approach may be required: I'll let you know.