Is data profiling enough?

Data profiling tools do two things: they discover facts about data values and they discover relationships between different data items. As I have commented before they are typically not very good at the latter, at least when those relationships exist across data sources rather than within a single data source.

One thing that data profiling tools don’t do is to discover the rules that apply to data values. Or, at least, not all of those rules. They can tell you that a value meets or fails to meet a specific database constraint or other explicit rule within the database but they lack the facility to discover value rules that are not explicit. At best you can use data profiling to infer an implicit rule, provided that you know what you are looking for or, at least, that you are looking for something.

The reason why there is an issue here is because many rules that apply to data values are defined within application code and not in the database at all. A simple example would be a credit limit. Now, you can profile and cleanse all you like but if you don’t understand what the credit limit is supposed to be for this customer then you can never guarantee the accuracy of your data.

So, how do you fill this gap? REVER, a Belgian company, has come up with what it calls program profiling, as a complementary capability to data profiling. What this software does is to inspect the application (and DDL) source code to look for data structures and data rules that exist within those applications and then do the usual sort of thing that you do with data profiling, which is to compare the actual data (which doesn’t have to be in a relational database and could equally well be in something hierarchical, for example) with the discovered rules, and then report all the exceptions so that you can go and fix them.

It sounds simple. It sounds obvious. Why has no-one done it before? I guess because it’s been hard enough to sell the idea of data quality to the business and, even more so, data profiling. However, in many ways program profiling should be an easier sell than data profiling. The problem with data profiling is that too many people think they can do it manually—which they can’t—well, they can but it’s incredibly boring so they don’t or, at least, not enough. However, I don’t think anyone would believe that they could do program profiling manually. So, once you’ve bought into the idea of program profiling there should be no further obstacles to going ahead with it.

Of course it may sound simple and obvious but there’s some clever technology under the hood. But you probably don’t care about that. The fact that it can understand complex things like “redefines” in COBOL is important from a technological perspective but all that really matters is that the results are comprehensive and easy to work with. Which they are.

I like it. It seems like a natural corollary that fits into a hole that data profiling doesn’t fill. It should be applicable for pretty much all data quality environments but especially migrations and particularly legacy migrations (because of the support for non-relational environments) and consolidations.