Anatomy of a data management platform

Anatomy of a data management platform banner

Following on from our previous blog on the recent trends within the data management space, this blog will introduce some of the most significant capabilities that a data management platform can (and, frequently, should) possess, then proceed to demonstrate why those things are important, particularly in light of the aforementioned trends. That said, data management is a broad space, so only the most notable capabilities will be included.

On a general note, it is worth watching out for high levels of automation, ease of use (and by extension, self-service and collaboration), performance and connectivity within any prospective data management platforms. While these are not capabilities per se, they are certainly important qualities for a platform to have.

Data discovery and cataloguing

Data discovery and data cataloguing enable you to find what data you have and determine how it is related. Discovery is often provided as part of data profiling (which itself exists to generate a bird’s-eye view of your data in terms of its structure, content and so on), and elucidates the whereabouts of and the relationships that exist between different datasets within and across multiple (heterogenous) data sources. More broadly, it is a fundamental tool for understanding your data landscape. Sensitive data discovery is a notable subcategory that is particularly concerned with locating and classifying personal or otherwise sensitive data within your organisation so that it can be appropriately protected for the sake of data privacy, security, and regulatory compliance.

Data discovery is also used to build data catalogues. These provide a repository of information about your data assets (i.e., its metadata): what data is held, where it is located, what format it is in, and within which domains it is relevant. As much of this information as possible should be collected automatically, and it may be classified further by geography, time, access control, and so on. Catalogues are indexed and searchable, and support self-service and collaboration. More comprehensive catalogues will ingest metadata from various derived sources, such as analytical reports and dashboards, in addition to the physical sources of your data. Catalogues are commonly used in conjunction with data preparation tools, and are important for supporting data governance and collaborative, self-service-based data access.

The major issue to be concerned about with data discovery, and by extension data cataloguing, is the methods by which your data is discovered, classified, and catalogued. These can run the gamut from very simple, basic discovery methods (such as column name matching) to highly sophisticated, multivariate matching that leverages machine learning and other advanced techniques. The former is more common, but the latter is by far the more accurate. Likewise, as far as catalogues are concerned, the more automation (which frequently involves machine learning – for making automated recommendations, say), the better. Indeed, some level of automation is practically essential, because populating or updating a catalogue manually is so laborious and time-consuming that you might as well not have one at all.

Data movement and virtualisation

Data movement does what its name suggests – it moves data – and there are multiple ways of doing so and of avoiding to do so. The most common ways of moving data are data integration, replication, change data capture and stream processing.

Data integration is for moving data from one place to another while (possibly) transforming and enriching it in the process. The technologies involved include ETL (extract, transform and load), ELT (extract, load and transform) and variations thereof. Data replication, on the other hand, along with change data capture and associated techniques, is essentially about copying data with less need for transformation. Data integration is a broad capability, but has particularly notable uses as part of data migration and analytics, in the latter case for moving data from an operational database to a data warehouse.

Stream processing, on the other hand, is typically about moving data (and often large quantities of it, or at least quantities that could be considered large when taken in aggregate over a period of time) in real-time, often without requiring any transformation of the data, though relevant products typically have some processing capabilities. For example, for moving log data or ingesting sensor data. These platforms may also be used to support streaming analytics (which is to say, performing analytics on streaming data).

The foremost technology for avoiding data movement is data virtualisation. Sometimes called data federation, this makes all data, regardless of where it is located and regardless of what format it is in, look as if it is one place and in a consistent format. You can then easily merge that data into applications or queries without physically moving said data. This may be important where you do not own the data or when you cannot move it for security reasons, or simply because it would be too expensive to physically move the data. The big disadvantage of data virtualisation is that it cannot address use cases where moving the data is the entire point, such as in migrations to the cloud, but for the use cases where it does apply it will frequently be very appealing. Performance can also be a challenge, although this will vary depending on the use case and the technology available.

Data assurance, quality and governance

Data assurance is a combination of different aspects of data management that collectively allow you to trust that your data is secure, compliant, and of a high quality (among other things). And trusting your data, and data assurance by extension, is essential if you are going to make business decisions based on that information. Accordingly, data assurance encompasses such areas as data quality and data governance.

Data quality products provide tools to perform various automated or semi-automated tasks that ensure that your data is as accurate, up-to-date and complete as you need it to be. This may be different for different types of data: you will need your corporate financial figures to be absolutely accurate, but a margin of error is probably acceptable when it comes to, say, tracking page impressions, where there are no GDPR-like compliance risks to factor in. Data quality tools also provide capabilities such as data matching and deduplication, data enrichment, and data cleansing.

Relevant products vary in terms of functionality, from simply alerting you that there is (for instance) an invalid postal code, to preventing the entry of an invalid post code altogether, prompting the user to re-enter that data. They also vary in terms of support for proactive data quality measures (like performing quality checks at regular intervals) vs. reactive data quality measures (such as performing quality checks automatically on data as it is ingested). Some functions, such as adding a geocode to a location, can be completely automated while others will always require manual intervention. For example, when identifying potentially duplicate records, the software can do this for you and calculate the probability of a match, but it may require a business user or data steward to actually approve said match.

Moreover, data quality is instrumental to data governance. In fact, a complete data governance solution will consist of, at the very least, policy management, data stewardship, and data quality capabilities. Policy management encompasses both the formal creation of policy, the decision-making process that has led to your policy being what it is, and the management and, if needed, alteration of existing policy. Data stewardship is primarily the ability to monitor your data and ensure that it complies with the policies you have laid out, ideally in an automated fashion. Often, this will involve moving from policy in the abstract to implementing concrete and enforceable business rules. Lastly, data quality is as described above, in this case composed of the implementation and enforcement of the rules created via data stewardship.

The idea is to take a top-down approach to monitoring and curating your data, in which you decide on a robust, consistent and universal set of data policies that then filter down to all of your data assets systemically and (at least to an extent) automatically. This is particularly important for enabling data privacy and regulatory compliance at the enterprise level.

Data privacy, protection and regulatory compliance

This article has already covered several elements of data privacy, owing to the fact that it both relies on data discovery in specific and is an outcropping of data assurance in general. Despite this, it is sufficiently important that it is worth describing in its own section. Data privacy is about protecting the privacy of the sensitive data within your organisation, and by extension the people (customers, employees, and so on) that data is (most of the time) associated with. This means you will need to discover which data is sensitive, where it is, and protect and govern it appropriately. This can be achieved through a combination of (sensitive) data discovery and data governance, both of which are discussed above, as well as protective measures for sensitive data, such as encryption or data masking.

Data masking in particular involves replacing sensitive data with desensitised, random or redacted data. The sophistication of this process varies, and can be done either statically (meaning the data is replaced at the database level, usually irreversibly: useful for test data and not much else) or dynamically (the data itself remains unchanged, but only a masked version of that data is exposed to users without the appropriate privileges to access the real data).

Managing the lifecycle of your sensitive data is another important area of data privacy and regulatory compliance, especially with, say, GDPR. This includes elements of consent management (managing the permissions that let you store the data in the first place), data retention (keeping track of how long you’re permitted to keep the data for) and data retirement (when and how you’re going to dispose of the data, whether through archival or deletion, once those permissions have elapsed).

There are multiple reasons to care about data privacy, but the most obvious and significant is regulatory compliance and avoiding both the fines and the reputational damage that come with breaking it. Given the proliferation of international regulations (GDPR, CCPA, LGPD, and more) that have come about in recent years, this is a hot issue for a lot of organisations, and with good reason.

–

In summary, the four most important areas of data management are, at present, data discovery, data movement, data assurance, and data privacy. The vast majority of effective, complete data management solutions will include these capabilities, and they form the core of many data management platforms. Although there certainly are other significant areas in data management – data delivery and data access come to mind – none of them are quite so essential.