Waterline Data was founded in 2013, backed by Menlo Ventures. Since then, it has accrued a number of additional investors, including Partech Ventures, Jackson Square Ventures and Infosys. The company is headquartered in Mountain View, California, and its international office is in London.
Lumada Data Catalog
Last Updated: 22nd June 2020
Mutable Award: Gold 2020
The Lumada Data Catalog is a data catalogue targeted at both the enterprise data lake and traditional data environments. It provides a complete solution for data discovery, cataloguing and compliance on these platforms, and is particularly notable for its discovery process, which is based around using machine-learning driven “data fingerprinting” to tag data consistently and intelligently. This process is further enhanced by the collaborative and crowd-sourcing capabilities the product provides. Moreover, as we discuss in this paper, it can also be applied to specifically discovering sensitive data.
The Lumada Data Catalog supports a variety of data sources, including most major relational databases, several cloud-based products including Amazon S3, Microsoft Azure, and Google Cloud Platform, on-premises Hadoop-based big data platforms, and a variety of structured and semi-structured file formats including Avro, Parquet, delimited files, JSON, XML and others.
Hitachi Vantara correctly recognises that the majority of enterprises will have far too much data for searching for sensitive data manually to be a viable prospect. On the other hand, automated discovery can sometimes be too simplistic to determine whether data is truly sensitive, especially when that data is only sensitive indirectly. The company’s solution to this problem is data fingerprinting, its bespoke, machine learning driven AI for data discovery and tagging.
Data fingerprinting works by creating a “fingerprint” for each of your data fields. Each fingerprint is a collection of metadata – in some cases, more than 100 pieces of metadata – which includes information about both the content of your data as well as the context it exists in. Data fingerprinting can then intelligently and automatically classify and tag each of your fields based on its fingerprint. Once your data has been classified, it is exposed to your users as part of the data catalogue and is open to curation and crowdsourcing, the former of which drives data fingerprinting machine learning, thus improving its accuracy over time. Figure 1 illustrates how fields in a data set are tagged by data fingerprinting, with different confidence levels based on the likelihood that a particular tag applies.
For the purposes of identifying sensitive data, a number of possible classifications are of interest, and in fact over 300 pre-defined and pre-trained tags are provided to identify and classify data that is sensitive under GDPR alone. What’s more, Lumada Data Catalog allows you to leverage tag-based rules to automatically tag data sets based on the fields they contain. For example, you could automatically tag your data sets as sensitive under GDPR if they contain fields tagged as first and last name, and where a field tagged as country contains data points corresponding to a European country. As you can tell from this example, these rules can mix data and metadata checks. What’s more, they are applied irrespective of data source, data format, and field names.
However, what if first and last names are in one table and country information is in another? With Lumada, you can visually explore any data tables that are related to the sensitive data you have already identified, as shown in Figure 2. This can be used to discover yet more sensitive data, and moreover, the software will automatically discover potential join conditions between these data sets and virtually join them using those conditions. This allows you to view them as a single data set. In turn, this can reveal additional sensitivity within your data that is only evident when it is seen in this way.
In addition, Lumada allows you to add custom properties to your data sets, which you can subsequently search on or filter by. For sensitive data, the utility here is that you can use these custom properties to store compliance metadata, such as the business purpose of your sensitive data. The product also provides visual, traversable tracking of data lineage, see Figure 3, which can either be imported from an existing source of lineage or inferred from your fingerprints. This capability may be useful for, say, locating data movement in and out of the EU. Finally, all of the searches and metadata available in the Lumada Data Catalog – including custom properties such as compliance metadata – are exposed via REST APIs, enabling integration with other compliance products.
Automated data discovery is essential for classifying your data consistently and comprehensively at an enterprise level, and this is just as true for sensitive data as it is for any other kind. Therefore, as a data catalogue, it should come as no surprise that Hitachi Vantara positions these features prominently. In this regard, Lumada’s adoption of machine learning in the form of data fingerprinting and the fingerprint system is notable.
The fingerprint system also has a number of features that are of particular benefit to sensitive data discovery. Fingerprints themselves contain a wealth of metadata, much of which concerns the context in which your data exists, and the rules engine that the product uses is able to operate in response to a combination of both metadata and data. Since whether or not any particular piece of data is sensitive can be highly complex and contextual, these are useful features for identifying the sensitive data in your environment.
The Bottom Line
The Lumada Data Catalog provides formidable sensitive data discovery capabilities as part of a data catalogue. If you are in the market for the latter, Lumada is a strong choice for the former.
Mutable Award: Gold 2020
Waterline Data Catalog
Last Updated: 26th September 2017
Waterline was founded on the realisation that the rise of big data and data lakes, and the consequent increase in the volume and variety of data available, would quickly lead to an impenetrable, unusable morass of dark data - now colloquially known as a data swamp - without proper management. As a result, its product, Waterline Data Catalog, is targeted at managing the enterprise data lake, covering Hadoop and traditional data environments, offering a complete solution for data discovery, cataloguing and compliance both on the data lake or on the more traditional relational database.
Waterline's self-described mission is to help connect the right people to the right data. Waterline recognises that there is too much data volume and variety for humans to manually catalogue. As a result, it has created an approach called "data fingerprinting" that uses machine learning algorithms to automate the consistent tagging of data attributes with commonly used business terms. At the same time, Waterline recognises that the people in your organisation, particularly but not exclusively your data stewards, still know a great deal about your data. Consequently, Waterline seeks to extract that knowledge and formalise it so that it can be accessed freely within your organisation. This has led to augmenting the automated discovery technology, which helps populate the catalogue very quickly, with collaboration and crowd-sourcing capabilities. For instance, users are encouraged to leave ratings and reviews on data sources, and data stewards, in particular, can annotate data sources in order to guide and help other users.
While Waterline uses a direct sales model it also has a significant partner network. These partners fall into four categories: value added resellers and systems integrators, which resell the product both across the USA and internationally, namely Infosys, Wipro, Leidos and T-Systems; pure consulting partners, which include Deloitte, CapGemini and others; platform partners, such as Cloudera, Hortonworks, MapR, Amazon and so forth; and technology partners like Trifacta, Privitar, Syncsort and Paxata, amongst others.
Waterline has customers spanning many industries, including automotive, aerospace, banking, government, healthcare, insurance, and life sciences. Some of their leading customers include McDonald's, Hewlett Packard Enterprise, NASA, Airbus, Intel, Starbucks, Creditsafe, Kaiser Permanente, Nordea and Santander. Use cases for the product range from self-service analytics to data governance and compliance, to data consolidation and rationalisation.
Waterline Data Catalog features data profiling, data discovery and global search based on Apache Solr, all of which leverage the compute power across your data lake systems, including both data lakes (Apache Hadoop, with support for Apache Spark) while also connecting directly to relational databases (via a plugin architecture) for cataloguing purposes. Lineage can be imported from other sources (for instance, Cloudera Navigator or Apache Atlas) or derived directly from your data (and corrected manually if needed). Note that the native use of Hadoop and Spark make it possible for Waterline Data Catalog to scale to meet the needs of even the largest data lakes.
The software also includes collaborative capabilities, such as crowdsourced ratings, reviews and annotations, and has sophisticated data matching capabilities that leverage machine learning to automatically suggest business terms - known as 'tags' - for fields within your data sources. Notably, this is done by examining the data itself, rather than simply the field name. Users with appropriate authority can accept or reject these suggestions. If they are accepted, they are added to the field as a custom attribute. What's more, due to the machine learning component, these suggestions will become more accurate over time as the system 'learns' how your organisation interacts with your data. The tags themselves are stored within your system, each associated with a particular domain. Built-in tags exist in a default domain and additional prefab domains, that contain common tags used in a particular space, such as 'GDPR' or 'Retail', are available. Existing business glossaries or taxonomies can be imported and if additional terms and domains are needed they can be created manually. Waterline also features automated tag-based data access control and integration with Apache Ranger and Cloudera Sentry is provided.
In addition to the support, training and other services provided directly by Waterline the company also partners with several professional service and consulting firms to help their customers implement Waterline products and integrate them into their existing data environment. These firms currently include: Infosys, T-Systems, Deloitte, Capgemini and Leidos.