Analyst Coverage: Philip Howard
Data movement does what its name suggests and there are multiple ways of doing so and of avoiding do so. The most common ways of moving data are via data integration, replication, change data capture and stream processing but we also consider data warehouse (and data lake) automation tools to be primarily concerned with data movement.
Data integration is set of capabilities that allow data that is in one place to be moved to another place. The technologies involved include ETL (extract, transform and load) and ELT (load the data before transforming it) and variations thereof. Data replication, on the other hand, along with change data capture and associated techniques, is essentially about copying data without any need for transformation.
There are multiple use cases for data integration but two of the most notable are to support data migration and to move data from an operational database to a data warehouse. In the latter case, data warehouse and data lake automation tools extend this capability by understanding the relevant schemas and helping to automate (generating relevant transformation and load scripts) the creation of the said warehouses and lakes. Moreover, they typically use replication and/or change data capture in order to ensure that the target system is kept up to date. For more detail click here.
Stream processing platforms, on the other hand, are typically about moving large quantities of data, in real-time, often without requiring any transformation of the data, though relevant products typically have some processing capabilities. For example, for moving log data or ingesting sensor data. These platforms may also be used to support streaming analytics.
Technologies for the avoidance of data movement include both data virtualisation and so-called HTAP (hybrid transactional and analytic processing) environments where a single database supports both transactional/analytic processing as well as analytics. With HTAP, as everything is performed in one place, there is, obviously, no need for data movement.
Data virtualisation, sometimes called data federation, makes all data, regardless of where it is located and regardless of what format it is in, look as if it is one place and in a consistent format. Note that this is not necessarily ‘your’ data: it may also include data held by partners or data on web sites, and it may be data that is on premise or it may be data in the Cloud. Also bear in mind that when we say “regardless of format” we literally mean that, so we would include relational data, data in Hadoop, XML and JSON documents, flat file data, spreadsheet data and so on. Given that you have all of this data looking as if it is in a consistent format you can then easily merge that data into applications or queries without physically moving said data. This may be important where you do not own the data or when you cannot move it for security reasons, or simply because it would be too expensive to physically move the data. Thus data virtualisation supports the concept of data federation (query across multiple heterogeneous platforms) as well as mash-ups.
The first difference between the different types of data movement products is with respect to transformation. Replication is a process by which data is copied, often for availability and disaster recovery purposes, without requiring any transformation. It is commonly used in distributed database environments and may be provided through either synchronous or asynchronous means. Change data capture is similar, and may be used to support replication, but is essentially about supporting real-time updates to data, where that data is stored in multiple places and you need to propagate changes from an originating source.
Stream processing platforms typically have some processing capabilities, so it is possible to do perform simple transformations using these products. However, this is not their main raison d’être, which is to allow the movement of large volumes of data – typically such things as sensor data, stock ticks, web clicks or log data – where there are low latency performance requirements. Unlike any of the technologies discussed here they are not usually used for moving data from one database to another but, rather, from sources that are more disparate and less structured.
Finally, data integration tools and their extension into data warehouse automation, are used when it is necessary to combine and use disparate data that is in different formats and you need that data in a consistent format for processing purposes. The classical use case for this was in moving, and transforming, data from production systems to data warehouses. However, there are many other use cases, including B2B exchange and in support of data preparation tools within data lakes.
Data movement, in all of its forms, is an enabling technology rather than a solution in its own right: it is used to populate data warehouses and to exchange information with business partners and between applications, it is used to provide high availability, it may be used to support data preparation and, in the case of streaming platforms it may be the basis for implementing machine learning and analytics in-stream, as well as to support applications – such as predictive maintenance – within the context of the Internet of Things.
In addition, these techniques may be built into other broader tools, in particular, the data warehouse automation tools already discussed.
The emergence of streaming platforms such as Kafka and Flink has radically changed the data movement landscape over the last few years. Moreover, with the adoption of 5G we can only expect even greater adoption of these technologies. In the more traditional data integration space the major trend is to move towards more cloud-centric provisioning rather than requiring an in-house deployment, as well as requirements for integration across hybrid cloud environments. At the same time, there is a continuing shift away from ETL towards ELT (especially where data lakes are involved) while more and more database vendors are moving into the HTAP space, thereby subverting the need for data movement. On the growth side, as graph databases become more popular there is increasing demand for tools that enable transformations from relational into graph structures.
The market for data integration remains dominated by traditional vendors. Some of these are in more than one space: for example, Attunity is a major provider of data warehouse automation as well as change data capture, while IBM, Informatica and others provide both data integration and data federation. IBM, Oracle and so forth are also in the HTAP space. In the graph space (see above) Neo4j has introduced its own ETL tool as well as integrating with Pentaho’s Data Integration product (previously known as Kettle). As far as streaming platforms are concerned, data Artisans (commercial supporter of Flink) has changed its name to Viverica and been acquired by Alibaba.
With respect to data virtualisation, there are an increasing number of database vendors – including NoSQL providers – building such capabilities into their products, typically using user defined functions alongside external table capabilities. IBM has recently introduced extended capabilities with its Queryplex product.