Data preparation: new or old hat?

Written By:
Content Copyright © 2016 Bloor. All Rights Reserved.
Also posted on: The IM Blog

Most of the world thinks that data preparation is a new market. Well, it is a new market. But Experian is arguing that actually people have been doing this for a while. Or, more specifically, that business people have been doing data preparation in a self-service manner – which is the core of the issue – for quite some time, without necessarily knowing that that was what they were doing. Of course, one could argue that business analysts using Excel were doing data preparation but a) that is hardly an automated process and b) it isn’t repeatable unless you use a tool like Ormetis that can capture Excel workflows. No, actually, Experian’s argument is that users of Experian Pandora have actually been doing data preparation for years. And I should further qualify this by saying that this claim is specific to Pandora and would not apply to its competitors.

This is an argument that is worth examining but perhaps I had better refresh reader’s memories as to Pandora itself. This was originally developed by X88, which was acquired (though Experian doesn’t like the use of that word) in late 2014. Pandora is a data profiling, cleansing, prototyping and transformation tool that is built on top of a proprietary, associative database.

To return to the central question, certainly Pandora has a reputation as a tool that can be very easily installed and used, without extensive IT involvement. And the actual users of the product are often business people rather than IT personnel. So far, so good. By its very nature the product is good at profiling data, cleansing it, transforming it and so on. You can blend data from different sources and the product has workflow capabilities similar to those in Alteryx, though they are not visual in the same way that Alteryx’s workflows are.

In fact, visualisation is one weakness of the product though this is likely to be overcome shortly: the company has JDBC connectivity capabilities in order to connect to a variety of data sources and allows ODBC access for business intelligence and visualisation tools. This means that things like data profiles, monitoring results and even data itself can be easily visualised.

There are a couple of other areas where Pandora does not have all the capabilities one might like from a data preparation perspective. One is that the product’s support for Hadoop is limited currently to connectivity, though Experian is working on developing this further. Secondly, there is no recommendation engine that will do things like suggest an appropriate join column across tables though, having said that, one of the strengths of Pandora is its ability to recognise all relationships across tables (whether within a single data source or across data sources). So the basic functionality is there, it just needs to be automated a bit more. In fact, one of the potential strengths of Pandora in this area is its associative database: this should make it easy to explore relations and we trust that whatever visualisation is added will have graph-style visual capabilities so that relationships can be explored in this way. That, incidentally, would be something that most other data preparation vendors cannot do.

So, yes Pandora can be regarded as a self-service data preparation tool. It has the advantage that it is a mature product, with the sort of features (security, for example) that come with maturity. On the other hand, is it complete? No. But then there aren’t any complete data preparation tools: this is a market that has yet to mature. On the other hand, does it have potential? Yes. The company has some work to do, and it is behind some of its competitors with regard to some features of the product. On the other hand, Pandora is not just a data preparation tool, it’s also a platform for data profiling, cleansing, prototyping, migration and so on.

The market is clearly breaking into three sectors: business intelligence suppliers with data preparation, data quality with data preparation, and pure play vendors. A recent report from another analyst firm suggested that users prefer the first of these options. However, this may be coloured by three facts. Firstly, that there are more analytic vendors in the market than data quality suppliers; secondly, that the analytic companies have been involved in this market for longer; and, thirdly, that users – with the possible exception of Pandora users – are simply not used to the idea of self-service data quality, as opposed to self-service analytics. In any case, Experian is targeting the second category although Pandora has a Java engine inside it that supports R, so that it is not a million miles away from analytics.

Which of these sectors will prevail remains to be seen but Experian is clearly backing the data quality horse.