The data replication market

There are loads of vendors that provide data replication for back-up and recovery purposes but there are not many that (also) target operational and business intelligence environments. For example, Symantec, EMC and Novell are all in the former camp but they don’t target BI. In fact, there are only a little more than half a dozen suppliers supporting the need for real-time query and operational purposes across the increasingly large amounts of data that need to be processed. However, it is in this area that we are really seeing growth, with increasing demand for real-time analytics as well as for supporting things like zero-downtime migrations and master data management.

Not surprisingly, when a market starts to get hot, we see a lot of action. It probably started when IBM acquired DataMirror, and then Oracle bought GoldenGate and, most recently, Informatica has gained WisdomForce. IBM has also announced (at last) its roadmap for integrating the DataMirror products and Attunity has just launched its Replicate product (it did previously offer some replication capability but it was very limited – this new product is much more ambitious). In addition, HVR and Hit Software (part of Back Office Associates) are two further independent vendors and there are also a couple of open source projects (DBReplicator and Daffodil Replicator), though it is unclear how useful these will be in a BI environment. Microsoft is also a player though its capabilities are limited beyond a purely Microsoft environment. However, the big beast in the replication market is Sybase, with over 3,000 installations. And we can expect that to grow significantly now that Sybase is an SAP company.

The key feature for replication in a business intelligence/data warehousing market is heterogeneity. Not only are you likely to have varied operational data sources from which you want to extract data in real-time (via log-based change data capture) but the growth in the number of vendors in the data warehousing market means that you also need to support a wide range of suppliers at the back end. Gone are the days when you could simply support Oracle, DB2, SQL Server and Teradata: you now need support for IBM Netezza, EMC Greenplum, HP Vertica and myriad others. Not to mention the requirements for replicating to/from the cloud.

A major trend appears to be away from a traditional scripting approach for developing your replication mappings and towards a more graphical stance. For example, Informatica uses this method, as does Attunity. Given that that’s how you build mappings in other environments (data modelling, data transformations and so on) there seems to be no good reason why you should be forced to write scripts when that’s unnecessary.

Of course, performance is the be-all and end-all of replication, so you can’t just rely on ODBC and JDBC connectors but need purpose-built CDC (change data capture) at the source and you need to be able to leverage native APIs and native fast loaders at the target in order to get maximum throughput. This is why heterogeneity is difficult. And, of course, you will need to be able to support parallelism for the same reason.

You should also bear in mind that data integration (ETL/ELT), data replication and data federation are all complementary. It is common, for example, to use ETL/ELT to bulk load data into your warehouse while simultaneously using data replication for loading real-time data. It is also realistic to dispense with ETL processes completely, provided that there is no requirement for complex or extensive transformations. In other words, you could use data replication on its own.

I shall be writing more about data replication in the weeks to come – in the short term specifically about the new offerings from Informatica and Attunity – but it is clear that this is a hot topic and one that is attracting more attention from users and vendors alike.