The problem with data quality solutions part 5

Finally (I think) there is one more problem with standard data quality solutions and, specifically, with most data profiling products. I have previously referred to the fact that there is not as much automation as there might be in some data cleansing products, for example when it comes to improving match capabilities over time, unless there are self-learning capabilities built into the product. Similar considerations apply when it comes to data profiling.

There are three particular instances that are worth discussing. The first is the matching of primary keys to foreign keys that define the relationship discovery I discussed in my last article. What most tools do is to identify all potential primary keys and allow you to direct the tool to test potential target tables by looking for matching foreign keys. As you can imagine, this is a tedious and highly manual task, especially if you have a lot of tables to go through. However, it is entirely possible to have an engine that automatically compares all possible matches and proposes the best primary-foreign key pairs for you. Of course, this doesn’t eliminate the need for manual oversight but it does make it substantially easier and faster.

Secondly, one of the most useful features of data profiling tools is to discover potentially redundant columns, either within a database or across databases. When you are looking across multiple data sources, automating redundancy analysis is incredibly useful because you may need to look across hundreds or thousands of columns to find the overlapping data. Unfortunately, most tools only allow you to select one column at a time, and then direct the tool to look in specific tables or columns to check for overlaps. What do you do when you have hundreds of tables and thousands of columns (and this is a frequent issue) across perhaps dozens of data sources? This process needs to be completely automated and again, this kind of automation is perfectly feasible given today’s desktop computing power.

Finally, there is the question of discovering matching keys that allow you to align the rows of multiple data sources to each other, in order to discover any complex transformations that may relate these sources, whether stored in different formats or in completely different data structures. Here you can’t do anything unless you can first align the rows across systems in order to identify any patterns that may exist and then use those patterns to discover cross-source transformation logic, column by column. Once again, you want to automate as much of the process as possible.

Now, I know of only one company that does all of these things and, indeed, it would claim that it is the only company that does any of them, at least with the level of automation I’ve mentioned. That company is Exeros.

However, these superior capabilities are not the only reason to be discussing Exeros at this time. The other is that it has recently announced a partnership with CA and CA partners to resell Exeros’ tools (X-Profiler, Discovery, Validator and the soon to be released Unified Schema Builder, which allows prototyping against an empty schema) alongside its CA ERwin Data Modeler product.

This has a couple of interesting ramifications, leaving aside the fact that Exeros is likely to acquire a lot of new users. The first of these is that Exeros software will now be priced much more aggressively than was previously the case, to fit in with CA’s pricing model. However, more interesting than that from my point of view is why you would want to combine data profiling with data modelling rather than data cleansing? The answer is threefold.

Firstly, as I discussed in my last article, relationships are important beyond the realm of pure data quality. Secondly, it should be obvious that you can use discovered relationships to enrich entity-relationship (E-R) diagrams and, thirdly, because E-R diagrams provide a way to visualise relationships (and the data or business entities defined by those relationships) in a way that is sadly lacking in most data profiling tools (another problem).

To conclude, there are a bunch of things that you can do with Exeros that you can’t do with other tools, or not so well. Thus it is the capabilities of Exeros that highlight the deficiencies of its competitors.