Data Preparation (self-service)
Analyst Coverage: Philip Howard
For many years there have been data preparation tools provided to assist with preparing relational data prior to data mining. These have mostly been embedded within data mining products or have been tightly integrated with such offerings. There is now emerging a class of (often) stand-alone products that offer the same sort of capabilities but across all types of data (structured, semi-structured and unstructured) and across all data sources (both internal to the company and external) and which are primarily targeted at business analysts. Such tools are typically referred to as self-service data preparation platforms. A common application would be for exploration of a “data lake” or for use in big data environments more generally
As their name implies, the key ingredient of data preparation platforms is their ability to provide self-service capabilities that allow knowledgeable users (but who are not IT experts) to combine, transform and cleanse relevant data prior to analysis: to “prepare” it. Most tools in this category are targeted at business analysts but there are products aimed more at data scientists. How much expertise will be required by users will depend on the particular offering.
There are various functions you may need to perform in order to get data ready for analysis. You may need to cleanse data, you may need to remove or complete empty fields, you may need to de-duplicate data. You will commonly need to join data from diverse sources, which will mean identifying a join key (some common information, such as an email address) and possibly transforming the data so that it is in a consistent format. You may need to pivot data or de-pivot it, or aggregate data. Self-service data preparation provides all these facilities.
However, the products in this category do not simply provide these facilities with a friendly user interface, they will also help with relevant tasks. For example, if you want to combine two sets of data the software will be able to suggest what fields you should join the data on, and it may recognise that you have missing demographic data for some of your customers and suggest where you could get that data from. Typically such products have both semantics and machine learning capabilities built-in. The former helps the software to make recommendations to you while the latter means that these recommendations improve over time. Machine learning can also make use of its knowledge as to how your colleagues are using the platform.
An additional capability is that data preparation platforms audit what you are doing. Not only does this have compliance implications (IT can see that you are not doing anything you shouldn’t be doing) it also means that if you decide to put a particular data preparation process into formal production (that is, build a formal IT process) then all the logic of the process has already been captured, making IT development efforts much simpler. Finally, the fact that IT can see what data is being analysed and how, means that IT can get a better handle on the sorts of services it might need to provide to the business.
Self-service data preparation is primarily aimed at business analysts or data scientists in the business units that are going to use it, because it allows them to prepare and perform ad hoc analyses without being reliant on IT. However, there are also major benefits for IT, not just in being able to monitor what users are doing but also in reducing IT workload. Without platforms of this type, IT will have to use integration tools (whether data integration or data virtualisation, data quality, data governance and other such products, all of which takes time and effort.
This whole market is emerging. At present different vendors are targeting different sectors of the market: in particular business analysts versus data scientists, and data preparation versus data unification (where the emphasis is on merging very many datasets). There are certainly constituencies for all of these but over time we expect the distinctions between products to blur as different vendors extend their current capabilities into other areas.
There are a few major vendors, notably IBM, Informatica and Progress, that are active in this space while most of the others are smaller and/or start-ups. In this last category we would include Paxata, Trifacta and Tamr. There are also two vendors: Alteryx and Clearstory that go a step further than data preparation and actually some level of analytics as well. However, even these vendors partner with Tableau, Qlik and other business intelligence vendors for interactive functionality and extended visualisation.