Trust in Data
Analyst Coverage: Philip Howard
Trusting your data is essential if you are going to make business decisions based on that information and there are various tools that enable that trust, specifically data profiling, data quality and data preparation.
Data profiling tools may be used to statistically analyse the content of data sources, to discover where errors exist and to monitor (typically via a dashboard) the current status of errors within a particular data source. They may also be used to discover any relationships that exist within and across data sources (see Data Discovery and Cataloguing). Data quality includes capabilities such as data matching (discovering duplicated records) and data enrichment (adding, say, geocoding or business data from the Internet), as well as data cleansing. Data quality is required for data governance and master data management (MDM). Some data quality products have specific capabilities to support, for example, data stewards and/or facilities such as issue tracking.
As far as data preparation is concerned, this takes the principles of data profiling and data quality and applies them to data that is, typically but not always, held within a data lake. As their name implies, the key ingredient of data preparation platforms is their ability to provide self-service capabilities that allow knowledgeable users (but who are not IT experts) to profile, combine, transform and cleanse relevant data prior to analysis: to “prepare” it. Tools in this category are targeted at business analysts and/or data scientists and work across all types of data (structured, semi-structured and unstructured) and across all data sources (both internal to the company and external).
One further element of trust in data is with respect to training data to support algorithmic processing, and ensuring that the data is unbiased. This is discussed in Machine Learning & AI.
Data profiling collects statistics, classically on a column by column basis: details such as minimum and maximum values, number of times a value appears, number of nulls, invalid datatypes and so on. In other words, it both detects errors and creates profiles – often expressed as histograms – of the data being examined. Relevant tools also typically have the ability to monitor these statistics on any ongoing basis.
Data quality products provide tools to perform various automated or semi-automated tasks that ensure that data is as accurate, up-to-date and complete as you need it to be. This may, of course, be different for different types of data: you want your corporate financial figures to be absolutely accurate, but a margin of error is probably acceptable when it comes to mailing lists. Data quality products provide a range of functions. A relevant tool might simply alert you that there is an invalid postal code and then leave you to fix that; or the software, perhaps integrated with a relevant ERP or CRM product, might prevent the entry of an invalid post code altogether, prompting the user to re-enter that data. Some functions, such as adding a geocode to a location, can be completely automated while others will always require manual intervention. For example, when identifying potentially duplicate records the software can do this for you, and calculate the probability of a match, but it will require a business user or data steward to actually approve the match.
Data preparation helps to get data – typically in a data lake – ready for analysis. This requires that the data is profiled and cleansed. You will also commonly need to join data from diverse sources, which will mean identifying a join key (via some common information, such as an email address) and transform the data so that it is in a consistent format. You may also need to pivot data or de-pivot it, or aggregate data. Self-service data preparation provides all these facilities.
However, the products in this category do not simply provide these facilities with a friendly user interface, they will also help with relevant tasks. For example, if you want to combine two sets of data the software will be able to suggest what fields you should join the data on, and it may recognise that you have missing demographic data for some of your customers and suggest where you could get that data from. Typically such products have both semantics and machine learning capabilities built-in. The former helps the software to make recommendations to you while the latter means that these recommendations improve over time. Machine learning can also make use of its knowledge as to how your colleagues are using the platform.
An additional capability is that data preparation platforms audit what you are doing. Not only does this have compliance implications (IT can see that you are not doing anything you shouldn’t be doing) it also means that if you decide to put a particular data preparation process into formal production (that is, build a formal IT process) then all the logic of the process has already been captured, making IT development efforts much simpler. Finally, the fact that IT can see what data is being analysed and how, means that IT can get a better handle on the sorts of services it might need to provide to the business.
Data profiling and data quality are both essential to good data governance and to any project involving the migration or movement of data, including MDM implementations. Quite simply, they are fundamental to the business being able to trust the data upon which it makes business decisions.
In the case of data profiling this can not only establish the scale of data quality problems, but it will also help data stewards to monitor the situation on an ongoing basis. Going further, data quality is about ensuring that data is fit for purpose; that it is accurate, timely and complete enough relative to the use to which it is put. As a technology, data quality can either be applied after the fact or in a preventative manner. Some data quality products have specific capabilities to support, for example, data stewards and/or facilities such as issue tracking.
Self-service data preparation, on the other hand, is primarily aimed at business analysts or data scientists in the business units that are going to use it, because it allows them to prepare and perform ad hoc analyses without being reliant on IT. However, there are also major benefits for IT, not just in being able to monitor what users are doing but also in reducing IT workload. It is perhaps worth commenting that self-service data preparation platforms are, in effect, data governance platforms: it is just that the governance aspects are hidden from the end users.
Both data profiling and data quality are mature technologies and the most significant trend here is to implement machine learning within these products, especially for matching where machine learning can help to reduce the number of false positives/negatives. The data discovery component of data profiling has also become significant thanks to compliance requirements such as GDPR. Data preparation tools are also leveraging more machine learning as well as embedding more collaborative features. Products are increasingly being combined either into analytics tools or with data catalogues, or both.
In general, the market is split between those companies that just focus on providing data profiling and/or data quality and those that also offer either ETL (extract, transform and load) or MDM (master data management) or both. Some of these “platforms” have been built from the ground up, such as that from SAS, while some others consist more of disparate bits that have been loosely bolted together. As far as data preparation is concerned, this market is being targeted by traditional data quality vendors, pure-play data preparation providers and business intelligence companies that have either built or bought (Qlik acquired Podium Data) relevant products. It is difficult to see how many of the pure play suppliers can survive in the longer term, and we expect more consolidation.