Data Profiling and Discovery
Analyst Coverage: Philip Howard
Data profiling collects statistics about the validity of data and data discovery discovers relationships between different data elements, either within a single database or across databases. Data profiling also provides the ability to monitor relevant statistics on an ongoing basis.
Historically, data profiling tools were capable of discovering relationships within a single database but not across databases, while data discovery tools focused on that element of the equation only. While there are still pure-play profiling and pure-play discovery tools that don’t do the other function, for all intents and purposes the two markets have now merged.
There are also specific tools (for example, from Rever and Silwood) that focus on particular aspects of discovery.
Data profiling is typically used as a pre-cursor to either data cleansing, because it identifies where errors exist, or data masking because it can discover where personally identifiable and similar information is stored. It is also used by data stewards and business analysts to monitor data quality on an ongoing basis.
Data discovery is used with data migration, in conjunction with data archival or with test data management and other technologies where it is important to understand the (referentially intact) business entities that you are managing or manipulating. This emphasis on business entities is also important in supporting collaboration between the business and IT because it is at this level that business analysts understand the data. Data discovery is also important in implementing MDM (master data management) because it enables the discovery of such things as matching keys and will provide precedence analysis.
One major new area of applicability for data profiling (or, more accurately, discovery) has emerged recently and this is aimed at what might be called “understanding data landscapes”. This applies to very large enterprises that have hundreds or thousands of databases and the organisation simply wants to understand the relationships that exist across those databases.
Data profiling and discovery is essential to good data governance and to any project involving the migration or movement of data, including MDM implementations. In the former case data profiling can not only establish the scale of data quality problems but it will also help data stewards to monitor the situation on an ongoing basis.
For data movement/migration projects, the proper use of data profiling and discovery can help to determine the scale of the issues that you will face during the project in question. In practice, therefore, you should aim to profile your data and discover your relationships, at least on a first-cut basis, before finalising your project budget and timescales. Project Managers and project sponsors should, therefore, care about this technology.
As noted there has been a trend towards merging profiling and discovery capabilities, albeit that some vendors have stayed within one area or the other. Another trend is that a number of suppliers (for example, Ataccama and X88), as well as the traditional open source vendors, are now offering free downloads of their profiling software in order to increase interest.
Support for NoSQL and other big data sources is another emerging trend. In general this tends to be limited to Hadoop at present, although Talend offers more extensive support. We can expect to see more of this in the future.
There have been no great shifts in the vendor landscape in the recent past although the trend towards embedding more discovery capabilities continues. One major new company has entered the market and this is Experian Data Quality, previously Experian QAS. This company is white labelling X88’s Pandora product and is moving towards being a general-purpose data quality offering rather than being limited to name and address cleansing.
That said, we remain disappointed that IBM continues to offer two products in this space, one associated with its Optim family of products (that is, for data archival and the like) and another associated with InfoSphere and DataStage for data integration. This will tend to mean that you get the latter recommended for use for data migration and MDM projects when you should really use Data Discovery, which is one of the best discovery products on the market. We should say, however, that while this is the most egregious example, IBM is not alone in its confusion.
Finally, we mentioned the emergence of landscape discovery as a new implementation possibility for data profiling and discovery: at present, as far as we are aware, only Global IDs is actively targeting this market.