skip to Main Content

Trust in Data

Last Updated:
Analyst Coverage:

Trusting your data is essential if you are going to make business decisions based on that information and there are various tools that enable that trust, specifically data profiling, data quality and data preparation.

Data profiling tools may be used to statistically analyse the content of data sources, to discover where errors exist and to monitor (typically via a dashboard) the current status of errors within a particular data source. They may also be used to discover any relationships that exist within and across data sources (see Data Discovery and Cataloguing). Data quality includes capabilities such as data matching (discovering duplicated records) and data enrichment (adding, say, geocoding or business data from the Internet), as well as data cleansing. Data quality is required for data governance and master data management (MDM). Some data quality products have specific capabilities to support, for example, data stewards and/or facilities such as issue tracking.

As far as data preparation is concerned, this takes the principles of data profiling and data quality and applies them to data that is, typically but not always, held within a data lake. As their name implies, the key ingredient of data preparation platforms is their ability to provide self-service capabilities that allow knowledgeable users (but who are not IT experts) to profile, combine, transform and cleanse relevant data prior to analysis: to “prepare” it. Tools in this category are targeted at business analysts and/or data scientists and work across all types of data (structured, semi-structured and unstructured) and across all data sources (both internal to the company and external).

One further element of trust in data is with respect to training data to support algorithmic processing, and ensuring that the data is unbiased. This is discussed in Machine Learning & AI.

Data profiling collects statistics, classically on a column by column basis: details such as minimum and maximum values, number of times a value appears, number of nulls, invalid datatypes and so on. In other words, it both detects errors and creates profiles – often expressed as histograms – of the data being examined. Relevant tools also typically have the ability to monitor these statistics on any ongoing basis.

Data quality products provide tools to perform various automated or semi-automated tasks that ensure that data is as accurate, up-to-date and complete as you need it to be. This may, of course, be different for different types of data: you want your corporate financial figures to be absolutely accurate, but a margin of error is probably acceptable when it comes to mailing lists. Data quality products provide a range of functions. A relevant tool might simply alert you that there is an invalid postal code and then leave you to fix that; or the software, perhaps integrated with a relevant ERP or CRM product, might prevent the entry of an invalid post code altogether, prompting the user to re-enter that data. Some functions, such as adding a geocode to a location, can be completely automated while others will always require manual intervention. For example, when identifying potentially duplicate records the software can do this for you, and calculate the probability of a match, but it will require a business user or data steward to actually approve the match.

Data preparation helps to get data – typically in a data lake – ready for analysis. This requires that the data is profiled and cleansed. You will also commonly need to join data from diverse sources, which will mean identifying a join key (via some common information, such as an email address) and transform the data so that it is in a consistent format. You may also need to pivot data or de-pivot it, or aggregate data. Self-service data preparation provides all these facilities.

However, the products in this category do not simply provide these facilities with a friendly user interface, they will also help with relevant tasks. For example, if you want to combine two sets of data the software will be able to suggest what fields you should join the data on, and it may recognise that you have missing demographic data for some of your customers and suggest where you could get that data from. Typically such products have both semantics and machine learning capabilities built-in. The former helps the software to make recommendations to you while the latter means that these recommendations improve over time. Machine learning can also make use of its knowledge as to how your colleagues are using the platform.

An additional capability is that data preparation platforms audit what you are doing. Not only does this have compliance implications (IT can see that you are not doing anything you shouldn’t be doing) it also means that if you decide to put a particular data preparation process into formal production (that is, build a formal IT process) then all the logic of the process has already been captured, making IT development efforts much simpler. Finally, the fact that IT can see what data is being analysed and how, means that IT can get a better handle on the sorts of services it might need to provide to the business.

Data profiling and data quality are both essential to good data governance and to any project involving the migration or movement of data, including MDM implementations. Quite simply, they are fundamental to the business being able to trust the data upon which it makes business decisions.

In the case of data profiling this can not only establish the scale of data quality problems, but it will also help data stewards to monitor the situation on an ongoing basis. Going further, data quality is about ensuring that data is fit for purpose; that it is accurate, timely and complete enough relative to the use to which it is put. As a technology, data quality can either be applied after the fact or in a preventative manner. Some data quality products have specific capabilities to support, for example, data stewards and/or facilities such as issue tracking.

Self-service data preparation, on the other hand, is primarily aimed at business analysts or data scientists in the business units that are going to use it, because it allows them to prepare and perform ad hoc analyses without being reliant on IT. However, there are also major benefits for IT, not just in being able to monitor what users are doing but also in reducing IT workload. It is perhaps worth commenting that self-service data preparation platforms are, in effect, data governance platforms: it is just that the governance aspects are hidden from the end users.

Both data profiling and data quality are mature technologies and the most significant trend here is to implement machine learning within these products, especially for matching where machine learning can help to reduce the number of false positives/negatives. The data discovery component of data profiling has also become significant thanks to compliance requirements such as GDPR. Data preparation tools are also leveraging more machine learning as well as embedding more collaborative features. Products are increasingly being combined either into analytics tools or with data catalogues, or both.

In general, the market is split between those companies that just focus on providing data profiling and/or data quality and those that also offer either ETL (extract, transform and load) or MDM (master data management) or both. Some of these “platforms” have been built from the ground up, such as that from SAS, while some others consist more of disparate bits that have been loosely bolted together. As far as data preparation is concerned, this market is being targeted by traditional data quality vendors, pure-play data preparation providers and business intelligence companies that have either built or bought (Qlik acquired Podium Data) relevant products. It is difficult to see how many of the pure play suppliers can survive in the longer term, and we expect more consolidation.


  • AB INITIO logo
  • Alex Solutions (logo)
  • BIG ID logo
  • DATAGUISE logo
  • Experian logo
  • FreeSight (logo)
  • GDE logo
  • Global IDs (logo)
  • GROUND LABS logo
  • IBM (logo)
  • INFOGIX logo
  • Informatica (logo)
  • MENTIS Software (logo)
  • Pitney Bowes (logo)
  • SAP (logo)
  • SAS (logo)
  • SEEKER logo
  • Silwood Technology (logo)
  • SOLIX logo
  • Syncsort logo
  • SYNITI logo
  • Trifacta (logo)
  • Trillium Software (logo)
  • Unifi (logo)
  • Waterline Data (logo)

These organisations are also known to offer solutions:

  • Actian
  • Advizor
  • Alation
  • Alteryx
  • Ataccama
  • BackOffice Associates
  • BDQ
  • Broadcom
  • Clavis
  • Clearstory
  • CloverETL
  • Datacleaner
  • Datactics
  • DataLynx
  • Datamartist
  • Datawatch
  • Datiris
  • FICO (InfoGlide)
  • Hitachi Vantara
  • Innovative Software
  • iWay
  • Melissa Data
  • Microsoft
  • Oracle
  • Paxata
  • Pentaho
  • QlikTech
  • Rever
  • Rocket Software
  • Sypherlink
  • Talend
  • Tamr
  • Uniserv
Cover for What's Hot in Data?

What’s Hot in Data

In this paper, we have identified the potential significance of a wide range of data-based technologies that impact on the move to a data-driven environment.
Cover for GDPR and the MENTIS Data and Application Security Platform

GDPR and the MENTIS Data and Application Security Platform

This paper considers the requirements of GDPR and then discusses how MENTIS meets those requirements.
Cover for Managing Data Lakes

Managing Data Lakes

This paper discusses why data lakes need to be managed and the sorts of capabilities that are required to manage them.
Cover for Discovering data occurrence

Discovering data occurrence

In this paper, we examine how to find occurrences of sensitive data and we consider the different techniques that are currently available.
Cover for The Sensitive Data Lifecycle: IBM vs Informatica vs MENTIS

The Sensitive Data Lifecycle: IBM vs Informatica vs MENTIS

This paper compares the capabilities of IBM, Informatica and MENTIS for the discovery and governance of sensitive data.
GROUND LABS InBrief (SENSITIVE DATA) cover thumbnail

Ground Labs Enterprise Recon

Enterprise Recon offers enterprise-wide discovery of personal, PII and PCI data in accordance with a wide range of compliance regulations.
Cover for Alex Solutions (InBrief)

Alex Solutions

Alex Solutions is an enterprise level solution for metadata management, as well as data governance, data stewardship, data cataloguing, and data quality.
MENTIS InBrief (SENSITIVE DATA) cover thumbnail

MENTIS iDiscover

MENTIS iDiscover offers a range of modules that cover all necessary functions for discovering, protecting and monitoring sensitive data.
Back To Top