skip to Main Content

This page was archived on 16th April, 2019 and is no longer actively maintained.

Data Preparation (self-service)

Last Updated:
Analyst Coverage:

This page has been archived and merged, please visit the Data Discovery and Catalogues or Trust in Data pages for new content.

For many years there have been data preparation tools provided to assist with preparing relational data prior to data mining. These have mostly been embedded within data mining products or have been tightly integrated with such offerings. There is now emerging a class of (often) stand-alone products that offer the same sort of capabilities but across all types of data (structured, semi-structured and unstructured) and across all data sources (both internal to the company and external) and which are primarily targeted at business analysts. Such tools are typically referred to as self-service data preparation platforms. A common application would be for exploration of a “data lake” or for use in big data environments more generally

As their name implies, the key ingredient of data preparation platforms is their ability to provide self-service capabilities that allow knowledgeable users (but who are not IT experts) to combine, transform and cleanse relevant data prior to analysis: to “prepare” it. Most tools in this category are targeted at business analysts but there are products aimed more at data scientists. How much expertise will be required by users will depend on the particular offering.

This page has been archived and merged, please visit the Data Discovery and Catalogues or Trust in Data pages for new content.

There are various functions you may need to perform in order to get data ready for analysis. You may need to cleanse data, you may need to remove or complete empty fields, you may need to de-duplicate data. You will commonly need to join data from diverse sources, which will mean identifying a join key (some common information, such as an email address) and possibly transforming the data so that it is in a consistent format. You may need to pivot data or de-pivot it, or aggregate data. Self-service data preparation provides all these facilities.

However, the products in this category do not simply provide these facilities with a friendly user interface, they will also help with relevant tasks. For example, if you want to combine two sets of data the software will be able to suggest what fields you should join the data on, and it may recognise that you have missing demographic data for some of your customers and suggest where you could get that data from. Typically such products have both semantics and machine learning capabilities built-in. The former helps the software to make recommendations to you while the latter means that these recommendations improve over time. Machine learning can also make use of its knowledge as to how your colleagues are using the platform.

An additional capability is that data preparation platforms audit what you are doing. Not only does this have compliance implications (IT can see that you are not doing anything you shouldn’t be doing) it also means that if you decide to put a particular data preparation process into formal production (that is, build a formal IT process) then all the logic of the process has already been captured, making IT development efforts much simpler. Finally, the fact that IT can see what data is being analysed and how, means that IT can get a better handle on the sorts of services it might need to provide to the business.

This page has been archived and merged, please visit the Data Discovery and Catalogues or Trust in Data pages for new content.

Self-service data preparation is primarily aimed at business analysts or data scientists in the business units that are going to use it, because it allows them to prepare and perform ad hoc analyses without being reliant on IT. However, there are also major benefits for IT, not just in being able to monitor what users are doing but also in reducing IT workload. Without platforms of this type, IT will have to use integration tools (whether data integration or data virtualisation, data quality, data governance and other such products, all of which takes time and effort.

It is perhaps worth commenting that self-service data preparation platforms are, in effect, data governance platforms: it is just that the governance aspects are hidden from the end users.

This page has been archived and merged, please visit the Data Discovery and Catalogues or Trust in Data pages for new content.

This whole market is emerging. At present different vendors are targeting different sectors of the market: in particular business analysts versus data scientists, and data preparation versus data unification (where the emphasis is on merging very many datasets). There are certainly constituencies for all of these but over time we expect the distinctions between products to blur as different vendors extend their current capabilities into other areas.

This page has been archived and merged, please visit the Data Discovery and Catalogues or Trust in Data pages for new content.

There are a few major vendors, notably IBM, Informatica and Progress, that are active in this space while most of the others are smaller and/or start-ups. In this last category we would include Paxata, Trifacta and Tamr. There are also two vendors: Alteryx and Clearstory that go a step further than data preparation and actually some level of analytics as well. However, even these vendors partner with Tableau, Qlik and other business intelligence vendors for interactive functionality and extended visualisation.


  • FreeSight (logo)
  • Trifacta (logo)
  • Unifi (logo)
  • Waterline Data (logo)

These organisations are also known to offer solutions:

  • Advizor
  • Alation
  • Alteryx
  • Clearstory
  • Datawatch
  • IBM
  • Informatica
  • Oracle
  • Paxata
  • QlikTech
  • Rocket Software
  • SAP
  • SAS
  • Tamr
  • Trillium Software


Cover for the Data Catalogues Hot Report

Data Catalogues

Data catalogues are hot. Why? Why should you care? What can they do for you?
Cover for Managing Data Lakes (Spotlight)

Managing data lakes: building a business case

This is a companion paper to one we published in 2017. We outline a methodology for building a business case in support of implementing suitable data lake management software.
Cover for What's Hot in Data?

What’s Hot in Data

In this paper, we have identified the potential significance of a wide range of data-based technologies that impact on the move to a data-driven environment.
Cover for Managing Data Lakes

Managing Data Lakes

This paper discusses why data lakes need to be managed and the sorts of capabilities that are required to manage them.
Back To Top