skip to Main Content

Trust in Data

Last Updated:
Analyst Coverage:

Trusting your data is essential if you are going to make business decisions based on that information and there are various tools that enable that trust, specifically data profiling, data quality and data preparation.

Data profiling tools may be used to statistically analyse the content of data sources, to discover where errors exist and to monitor (typically via a dashboard) the current status of errors within a particular data source. They may also be used to discover any relationships that exist within and across data sources (see Data Discovery and Cataloguing). Data quality includes capabilities such as data matching (discovering duplicated records) and data enrichment (adding, say, geocoding or business data from the Internet), as well as data cleansing. Data quality is required for data governance and master data management (MDM). Some data quality products have specific capabilities to support, for example, data stewards and/or facilities such as issue tracking.

As far as data preparation is concerned, this takes the principles of data profiling and data quality and applies them to data that is, typically but not always, held within a data lake. As their name implies, the key ingredient of data preparation platforms is their ability to provide self-service capabilities that allow knowledgeable users (but who are not IT experts) to profile, combine, transform and cleanse relevant data prior to analysis: to “prepare” it. Tools in this category are targeted at business analysts and/or data scientists and work across all types of data (structured, semi-structured and unstructured) and across all data sources (both internal to the company and external).

One further element of trust in data is with respect to training data to support algorithmic processing, and ensuring that the data is unbiased. This is discussed in Machine Learning & AI.

Data profiling collects statistics, classically on a column by column basis: details such as minimum and maximum values, number of times a value appears, number of nulls, invalid datatypes and so on. In other words, it both detects errors and creates profiles – often expressed as histograms – of the data being examined. Relevant tools also typically have the ability to monitor these statistics on any ongoing basis.

Data quality products provide tools to perform various automated or semi-automated tasks that ensure that data is as accurate, up-to-date and complete as you need it to be. This may, of course, be different for different types of data: you want your corporate financial figures to be absolutely accurate, but a margin of error is probably acceptable when it comes to mailing lists. Data quality products provide a range of functions. A relevant tool might simply alert you that there is an invalid postal code and then leave you to fix that; or the software, perhaps integrated with a relevant ERP or CRM product, might prevent the entry of an invalid post code altogether, prompting the user to re-enter that data. Some functions, such as adding a geocode to a location, can be completely automated while others will always require manual intervention. For example, when identifying potentially duplicate records the software can do this for you, and calculate the probability of a match, but it will require a business user or data steward to actually approve the match.

Data preparation helps to get data – typically in a data lake – ready for analysis. This requires that the data is profiled and cleansed. You will also commonly need to join data from diverse sources, which will mean identifying a join key (via some common information, such as an email address) and transform the data so that it is in a consistent format. You may also need to pivot data or de-pivot it, or aggregate data. Self-service data preparation provides all these facilities.

However, the products in this category do not simply provide these facilities with a friendly user interface, they will also help with relevant tasks. For example, if you want to combine two sets of data the software will be able to suggest what fields you should join the data on, and it may recognise that you have missing demographic data for some of your customers and suggest where you could get that data from. Typically such products have both semantics and machine learning capabilities built-in. The former helps the software to make recommendations to you while the latter means that these recommendations improve over time. Machine learning can also make use of its knowledge as to how your colleagues are using the platform.

An additional capability is that data preparation platforms audit what you are doing. Not only does this have compliance implications (IT can see that you are not doing anything you shouldn’t be doing) it also means that if you decide to put a particular data preparation process into formal production (that is, build a formal IT process) then all the logic of the process has already been captured, making IT development efforts much simpler. Finally, the fact that IT can see what data is being analysed and how, means that IT can get a better handle on the sorts of services it might need to provide to the business.

Data profiling and data quality are both essential to good data governance and to any project involving the migration or movement of data, including MDM implementations. Quite simply, they are fundamental to the business being able to trust the data upon which it makes business decisions.

In the case of data profiling this can not only establish the scale of data quality problems, but it will also help data stewards to monitor the situation on an ongoing basis. Going further, data quality is about ensuring that data is fit for purpose; that it is accurate, timely and complete enough relative to the use to which it is put. As a technology, data quality can either be applied after the fact or in a preventative manner. Some data quality products have specific capabilities to support, for example, data stewards and/or facilities such as issue tracking.

Self-service data preparation, on the other hand, is primarily aimed at business analysts or data scientists in the business units that are going to use it, because it allows them to prepare and perform ad hoc analyses without being reliant on IT. However, there are also major benefits for IT, not just in being able to monitor what users are doing but also in reducing IT workload. It is perhaps worth commenting that self-service data preparation platforms are, in effect, data governance platforms: it is just that the governance aspects are hidden from the end users.

Both data profiling and data quality are mature technologies and the most significant trend here is to implement machine learning within these products, especially for matching where machine learning can help to reduce the number of false positives/negatives. The data discovery component of data profiling has also become significant thanks to compliance requirements such as GDPR. Data preparation tools are also leveraging more machine learning as well as embedding more collaborative features. Products are increasingly being combined either into analytics tools or with data catalogues, or both.

In general, the market is split between those companies that just focus on providing data profiling and/or data quality and those that also offer either ETL (extract, transform and load) or MDM (master data management) or both. Some of these “platforms” have been built from the ground up, such as that from SAS, while some others consist more of disparate bits that have been loosely bolted together. As far as data preparation is concerned, this market is being targeted by traditional data quality vendors, pure-play data preparation providers and business intelligence companies that have either built or bought (Qlik acquired Podium Data) relevant products. It is difficult to see how many of the pure play suppliers can survive in the longer term, and we expect more consolidation.


  • AB INITIO logo
  • Alex Solutions (logo)
  • ATACCAMA logo
  • BIG ID logo
  • CLOUDERA logo
  • DATA LADDER logo
  • DATACTICS logo
  • DATAGUISE logo
  • DQ GLOBAL logo
  • Experian logo
  • FreeSight (logo)
  • GDE logo
  • Global IDs logo
  • GROUND LABS logo
  • HITACHI logo
  • IBM (logo)
  • INFOGIX logo
  • Informatica (logo)
  • IRI logo
  • MARK LOGIC logo
  • mentis logo
  • Oracle (logo)
  • Pitney Bowes (logo)
  • Qlik logo
  • SAP (logo)
  • SAS (logo)
  • SEEKER logo
  • Silwood Technology (logo)
  • SOLIX logo
  • Syncsort logo
  • SYNITI logo
  • TALEND logo
  • Trifacta (logo)
  • Trillium Software (logo)
  • Unifi (logo)
  • Waterline Data (logo)

These organisations are also known to offer solutions:

  • Actian
  • Advizor
  • Alation
  • Alteryx
  • BackOffice Associates
  • BDQ
  • Broadcom
  • Clavis
  • Clearstory
  • CloverDX
  • Datacleaner
  • DataLynx
  • Datamartist
  • Datawatch
  • Datiris
  • FICO (InfoGlide)
  • Innovative Software
  • iWay
  • Melissa Data
  • Microsoft
  • Paxata
  • Pentaho
  • QlikTech
  • Rever
  • Rocket Software
  • Sypherlink
  • Tamr
  • Uniserv
Cover for Managing Data Lakes (Spotlight)

Managing data lakes: building a business case

This is a companion paper to one we published in 2017. We outline a methodology for building a business case in support of implementing suitable data lake management software.
DATAGUISE InBrief (SENSITIVE DATA) cover thumbnail

Dataguise DgSecure

Dataguise started by offering sensitive data discovery and masking but now includes, additionally, encryption/decryption and extensive reporting.
MENTIS InBrief (SENSITIVE DATA) cover thumbnail

MENTIS iDiscover

MENTIS iDiscover offers a range of modules that cover all necessary functions for discovering, protecting and monitoring sensitive data.
BIG ID InBrief (SENSITIVE DATA) cover thumbnail


BigID specialises in (sensitive) data discovery and classification and unlike other vendors has been built from the ground up to concentrate in this area.
AB INITIO InBrief (SENSITIVE DATA) cover thumbnail

Ab Initio Semantic Discovery

Ab Initio Semantic Discovery is a data discovery solution offered as part of Ab Initio’s broader data management platform.
Cover for the Data Catalogues Hot Report

Data Catalogues

Data catalogues are hot. Why? Why should you care? What can they do for you?
00002448 - EXPERIAN SCV InDetail cover thumbnail

Experian SCV

Experian SCV consists of several components, the main one of which is Experian Aperture Data Studio along with various other elements.
EXPERIAN InBrief (thumbnail)

Experian Aperture Data Studio

Experian Aperture Data Studio is a data quality and enrichment platform that includes data profiling and data quality analysis.
Back To Top