Analyst Coverage: Philip Howard and Daniel Howard
Hitachi Vantara is a wholly owned subsidiary of Hitachi Ltd, the Japanese multinational founded in 1910. Hitachi Vantara was formed in September 2017 by merging Hitachi Data Systems, Hitachi Insight Group and Pentaho (a previous acquisition). In January 2020, Hitachi Vantara integrated with Hitachi Consulting and began operating two distinct business units: digital infrastructure and digital solutions. The company has more than 10,000 employees and is headquartered in Santa Clara, California, but has operations in over 45 countries worldwide.
Hitachi Vantara Lumada
Last Updated: 3rd March 2021
The Lumada DataOps Suite from Hitachi Vantara has five components, as illustrated in Figure 1. Also available from the company is Pentaho Enterprise Edition (currently in version 9.1) and some of the capability offered by Lumada is based on leveraging Pentaho’s (open source) technology. This is especially true of the Data Integration and Enterprise Analytics components shown in this diagram. It is less true of the other elements of the Suite, particularly the Data Catalog (based on the acquisition of Waterline Data). While this has not yet been fully integrated with Pentaho, substantial elements of this integration (for example, registering objects into the catalogue from Pentaho and writing metadata, as well as reading capabilities) are already in place.
The other important point to note about Figure 1 is the support for Lidar and other devices that are not usually supported by data integration vendors. This is indicative of Hitachi Vantara’s industrial focus and the company specifically targets manufacturing, logistics and Internet of Things applications as well as more traditional industries such as financial services, telecommunications, and healthcare.
“The functionality that Pentaho gives over proprietary vendors provides huge cost savings.”
“CERN’s systems need to manage high volumes of confidential information on its employees and their families, so security, data governance, and data integrity are all paramount. After a review of five different proprietary and open source platforms, Pentaho emerged as best adapted to our needs.”
“Hitachi Vantara has shown that it’s a true partner to customers. Pentaho is a great tool that’s evolved to meet the challenges of real people.”
The Lumada DataOps Suite aims to provide the functionality illustrated in Figure 2. That is, it is intended to support everything from the ingestion of data through to its analysis and presentation in dashboards. That said, while it includes some capabilities for data profiling within its data cataloging product (see below) it does not offer data quality or governance capabilities, for which it relies upon partners such as Collibra, as it does with data preparation and data masking. In addition, Lumada Data Optimizer for Hadoop offers orthogonal capability by providing intelligent data tiering for HDFS environments. The company is a partner of Cloudera’s.
As far as data integration is concerned, Figure 3 provides a snapshot of the capabilities offered though what it doesn’t tell you is that there is a distinction between Pentaho and Lumada Data Integration, where the latter consists of Pentaho Data Integration combined with Dataflow Studio, which is a web and microservices-based application that enables scalable execution of dataflow jobs for business users rather than only being targeted at IT folk. It also doesn’t tell you that Lumada Edge Intelligence (IoT Edge) provides a lightweight environment that supports the hosting of AI and machine learning in edge devices, along with real-time KPI-based dashboarding, alerting, and support for streaming data. More generally, streaming is supported in all its varieties: micro-batches or real-time, continuous or event-driven, as you require. Also not mentioned is the fact that the company offers more than 200 connectors; support for Jupyter Notebooks along with R, Python and TensorFlow; and an extensible framework that includes SDKs, APIs, and plug-ins. And that change data capture (CDC) is supported.
Finally, we should discuss the Lumada (previously Waterline) Data Catalog. This is targeted at both enterprise data lakes and traditional data environments. It provides a complete solution for data discovery, cataloguing and compliance on these platforms, and is particularly notable for its discovery process, which is based around using machine-learning driven “data fingerprinting” to tag data consistently and intelligently using a bespoke, machine learning driven solution for data discovery and tagging.
The discovery and tagging solution works by creating a “fingerprint” for each of your data fields. Each fingerprint is a collection of metadata – in some cases, more than 100 pieces of metadata – which includes information about both the content of your data and the context it exists in. Lumada Data Catalog can then intelligently and automatically classify and tag each of your fields based on its fingerprint. Once your data has been classified, it is exposed to your users as part of the data catalogue and is open to curation and crowdsourcing, the former of which feeds the machine learning algorithm, thus improving its accuracy over time. Different confidence levels can be attached to tags based on the likelihood that a particular tag applies.
The biggest differentiator for Hitachi Vantara’s Lumada DataOps Suite is its end-to-end capabilities from federated edge functions through to multi-cloud centralised operations. Other data integration vendors do not typically support sensors and other devices to the extent that Hitachi Vantara does. Conversely, the suppliers that focus on IoT and similar environments do not generally offer strong support for data integration in the traditional sense of that term. We should add that the acquisition of Waterline Data in March 2020 adds significantly to the capabilities that Hitachi Vantara can offer and we would not be surprised – indeed we would recommend – to see further relevant acquisitions in the future.
The Bottom Line
With Lumada being modular, and the fact that Hitachi Vantara can leverage partners when it needs to (depending on the use case), we regard the DataOps Suite as a platform play. We particularly like it within the context of distributed deployments where it can leverage its major differentiators.
Lumada Data Catalog
Last Updated: 22nd June 2020
Mutable Award: Gold 2020
The Lumada Data Catalog is a data catalogue targeted at both the enterprise data lake and traditional data environments. It provides a complete solution for data discovery, cataloguing and compliance on these platforms, and is particularly notable for its discovery process, which is based around using machine-learning driven “data fingerprinting” to tag data consistently and intelligently. This process is further enhanced by the collaborative and crowd-sourcing capabilities the product provides. Moreover, as we discuss in this paper, it can also be applied to specifically discovering sensitive data.
The Lumada Data Catalog supports a variety of data sources, including most major relational databases, several cloud-based products including Amazon S3, Microsoft Azure, and Google Cloud Platform, on-premises Hadoop-based big data platforms, and a variety of structured and semi-structured file formats including Avro, Parquet, delimited files, JSON, XML and others.
Hitachi Vantara correctly recognises that the majority of enterprises will have far too much data for searching for sensitive data manually to be a viable prospect. On the other hand, automated discovery can sometimes be too simplistic to determine whether data is truly sensitive, especially when that data is only sensitive indirectly. The company’s solution to this problem is data fingerprinting, its bespoke, machine learning driven AI for data discovery and tagging.
Data fingerprinting works by creating a “fingerprint” for each of your data fields. Each fingerprint is a collection of metadata – in some cases, more than 100 pieces of metadata – which includes information about both the content of your data as well as the context it exists in. Data fingerprinting can then intelligently and automatically classify and tag each of your fields based on its fingerprint. Once your data has been classified, it is exposed to your users as part of the data catalogue and is open to curation and crowdsourcing, the former of which drives data fingerprinting machine learning, thus improving its accuracy over time. Figure 1 illustrates how fields in a data set are tagged by data fingerprinting, with different confidence levels based on the likelihood that a particular tag applies.
For the purposes of identifying sensitive data, a number of possible classifications are of interest, and in fact over 300 pre-defined and pre-trained tags are provided to identify and classify data that is sensitive under GDPR alone. What’s more, Lumada Data Catalog allows you to leverage tag-based rules to automatically tag data sets based on the fields they contain. For example, you could automatically tag your data sets as sensitive under GDPR if they contain fields tagged as first and last name, and where a field tagged as country contains data points corresponding to a European country. As you can tell from this example, these rules can mix data and metadata checks. What’s more, they are applied irrespective of data source, data format, and field names.
However, what if first and last names are in one table and country information is in another? With Lumada, you can visually explore any data tables that are related to the sensitive data you have already identified, as shown in Figure 2. This can be used to discover yet more sensitive data, and moreover, the software will automatically discover potential join conditions between these data sets and virtually join them using those conditions. This allows you to view them as a single data set. In turn, this can reveal additional sensitivity within your data that is only evident when it is seen in this way.
In addition, Lumada allows you to add custom properties to your data sets, which you can subsequently search on or filter by. For sensitive data, the utility here is that you can use these custom properties to store compliance metadata, such as the business purpose of your sensitive data. The product also provides visual, traversable tracking of data lineage, see Figure 3, which can either be imported from an existing source of lineage or inferred from your fingerprints. This capability may be useful for, say, locating data movement in and out of the EU. Finally, all of the searches and metadata available in the Lumada Data Catalog – including custom properties such as compliance metadata – are exposed via REST APIs, enabling integration with other compliance products.
Automated data discovery is essential for classifying your data consistently and comprehensively at an enterprise level, and this is just as true for sensitive data as it is for any other kind. Therefore, as a data catalogue, it should come as no surprise that Hitachi Vantara positions these features prominently. In this regard, Lumada’s adoption of machine learning in the form of data fingerprinting and the fingerprint system is notable.
The fingerprint system also has a number of features that are of particular benefit to sensitive data discovery. Fingerprints themselves contain a wealth of metadata, much of which concerns the context in which your data exists, and the rules engine that the product uses is able to operate in response to a combination of both metadata and data. Since whether or not any particular piece of data is sensitive can be highly complex and contextual, these are useful features for identifying the sensitive data in your environment.
The Bottom Line
The Lumada Data Catalog provides formidable sensitive data discovery capabilities as part of a data catalogue. If you are in the market for the latter, Lumada is a strong choice for the former.
Mutable Award: Gold 2020