Taming the data preparation beast

Written By:
Content Copyright © 2014 Bloor. All Rights Reserved.
Also posted on: The IM Blog

It is generally reckoned that the identification and preparation of data by data scientists takes up around 80% of the time required for analysis. This figure hasn’t changed much since data mining first came to the fore twenty years ago. Indeed, while that figure improved as vendors of data mining tools built relevant capabilities into their product lines, the figure has deteriorated again now that we are dealing with big data in all its variety and volume, as opposed to nicely organised relational data.

As a result there is an emerging market for data preparation platforms that make it easier and faster to conduct preparation as a part of discovery processes. For example, Actian is targeting this market with what used to be Pervasive DataRush and I have recently written about both Trifacta and Informatica’s intention to move in this direction. However, perhaps the most interesting innovation in this area is coming from Tamr (formerly Data-Tamer), whose eponymous product has just been announced.

Before I get into what into Tamr does I have to mention that the company is the latest venture by Mike Stonebraker and Andy Palmer, who have previously worked together to found both (HP) Vertica and VoltDB, while Mike (a professor at MIT) was the original developer (along with Eugene Wang) of Ingres and subsequently PostgreSQL (in both cases, while he was at Berkeley) as well as StreamBase (now part of TIBCO). In other words, this new company has a significant provenance, has already raised significant capital, and, if history is anything to go by, it will undoubtedly be successful even before we consider the merits of the product offering itself.

Tamr is the result of four years of development (two at MIT and two commercially, working with beta customers) and has been designed as an enterprise-wide data preparation platform. The company calls it a data curation platform. If you want to get into detail of what Tamr is about I suggest you read the research paper by Stonebraker et al, which you can download from  http://www.tamr.com/about-us/ (scroll down) but I’ll try to summarise the main points.

The first key element of Tamr is machine learning algorithms combined with expert sourcing that can automate matching processes both within and across data sources while using targeted human insight to improve precision. This will also be useful for things such as database consolidations, master data management and post M&A activity as well as big data discovery. As their epithet suggests these algorithms will get better over time, with predictive capability when you add a new source, thereby reducing manual (workflow-based) review time.  

Secondly, the product is based on a triple store (currently residing on top of a relational database but in due course we can expect an HDFS version to be made available) that will store a record of how all the data connects (effectively creating a schema for related data—customers, for example), so you should be able to explore such connections using any relevant BI tool that supports graphs. BI will also be useful for exploring which entities are most significant and which attributes are the most valuable, both of which will help chief data officers prioritise relevant analytic projects.

Thirdly, there is the question of how Tamr works with unstructured data. To put it simply, it relies on partner systems to tag the data so that it effectively becomes semi-structured, which makes it much easier to work with.

One final point is that Tamr is not necessarily competitive with other products in this space. For example, you could have Tamr and Trifacta working together with Tamr predicting (suggesting) the transforms that a user of Data Wrangler might deploy.

To conclude, what’s different about Tamr from other companies moving into this space are the triple store that underpins it and, even more importantly, the machine learning algorithms combined with expert sourcing that will help to automate the data curation process at scale. Given its founders’ history, asking if the product will be successful is a bit like a 1970’s pundit wondering if the next Beatles single will be a hit: it looks like a sure thing.