Try wrangling your data

Written By:
Published:
Content Copyright © 2014 Bloor. All Rights Reserved.
Also posted on: The IM Blog

The biggest problem with the work that data scientists/miners do is not in analysing the data or discovering the patterns that they are looking for. No, the biggest headache they have is in preparing the raw data in the first place. They have to select the fields that they want to analyse—ensuring that it is relevant to the business context being investigated—and they will often need to combine fields, cleanse the data, transform the data and so on.

If you are not involved in that process it may sound easy. It isn’t. Or, actually, it is but it’s very time consuming. The typical estimate is that 80% of the total time taken to analyse the data is spent on preparing it.

Now, this issue has been addressed in conventional, structured environments. Data mining products from companies like SAS and IBM (SPSS) have tools and facilities to enable this sort of preparation and substantially reduce the time involved. However, that is not the case when you are working with unstructured or machine-generated data in a Hadoop environment. Which is where Trifacta and its product comes in.

Trifacta provides a visual environment for exploring the data. However, more than that, it provides what the company calls Predictive Interaction™. This is a bit like predictive text but predicting potential transforms rather than words. Moreover, it gets better over time. In addition, there is a Transform Editor that allows you to do things like splitting and joining columns, inserting default values and so on; there is a language called Wrangle that can be used to build regular expressions; and there is a prebuilt library of functions (around 120 of them) that you can use for things like converting case.

The actual way that it works is that you run a job locally or on Hadoop uploading a sample of the data that you are working with. After manipulating the data to your satisfaction you run the transforms across the complete data set to see if there are any mismatches or anomalies in the source data that weren’t in the sample.  This is particularly useful for really large data sets. When mismatches or anomalies are detected, you can iteratively explore them and further manipulate the transformations through the Transform Editor.

I’ve seen a demonstration of Trifacta and I have to say that it looks impressive or perhaps I should say that it seems easy to use, which is the point. In particular, the Predictive Interaction™ genuinely appears to work, which is more than one can say for some predictive text.

I think there’s a real case for implementing Trifacta (or something else like it, if there is anything) as a layer between Hadoop and whatever BI tools you are using, as well as by data scientists. The thing about BI today is that there’s lots of self-service and lots of visualisation but if the business users don’t understand the data—if it’s not in context—then a lot of that functionality is going to be wasted: Trifacta provides a way for data scientists to put the data into context for business users as well as themselves.