Open Source Data Quality

While there are a number of open source ETL (extract, transform and load) vendors I had not previously encountered an open source data quality solution until I recently spoke with Infosolve Technologies. However, Infosolve is not your typical open source vendor.

Infosolve in fact has two products: OpenDQ and OpenCDI (data quality and customer data integration respectively), where the latter leverages the former. So, how does Infosolve differ from other open source vendors?

The biggest difference between Infosolve and the remainder of the open source community is that Infosolve does not believe that you can make any money by simply having a download site and then trying to sell support or services on the back of that download. No, Infosolve believes that you need to do the complete reverse of this: go out and sell your professional services, in this case for data quality, through a direct sales force. Then you implement your solution for the customer on a “free” open source platform. In other words, as I have remarked before, Infosolve is using open source as simply a different licensing model. Typical service engagements range between three weeks and nine months, though the company informs me that it is shortly hoping to sign a two year engagement.

In addition to its own direct sales force, Infosolve is also exploiting the channel: partnering with systems integrators and sub-licensing OpenDQ to other open source (and, for that matter, non-open source) vendors and ISVs.

Remaining on the open source discussion, Infosolve is a partner of Sun’s and runs on Sun grid technology and, in particular, is available via Sun’s utility computing offering, meaning that you can have OpenDQ hosted for you using a utility-based approach that can cost as little as an hour. Infosolve refers to this open source, utility-based model as a “zero-based data solution”.

This means that, apart from the initial professional service engagement (to determine and set up appropriate data quality business rules, for example) and any on-going service fees, you will have more or less zero costs for the whole project—actually more but at an hour not much more. You can of course run the OpenDQ software on your own hardware should you prefer to do that.

On the technical side, OpenDQ is tightly integrated with Pentaho’s data integration (formerly KETTLE) product but perhaps most interesting is the fact that the company will shortly be introducing support for unstructured data. This is important when it comes to non-name and address data such as product data, where information about products often comes into the organisation in unstructured format. The company will be using natural language processing to support unstructured data, which is probably the best approach to take.

The introduction of unstructured support is interesting, not just because it is clear that product data quality is becoming more of an issue but that it suggests (and I want to make it clear here that this is my own inference) that Infosolve may introduce an OpenPIM (product information management) product to go alongside its OpenCDI offering. Which, of course, raises the whole question of open source MDM (master data management): while that is a discussion for another day we see no reason why Infosolve shouldn’t be as successful with MDM as it is with data quality.