Real-time data integration has become more and more of a significant issue over the last few years, supporting predictive analytics, operational business intelligence, real-time dashboards, risk management, zero-downtime migrations, service oriented architectures and so on.
However, despite its increasing popularity, one of the most common assumptions regarding real-time data movement is not often scrutinised or discussed. This is that the means of supporting it (typically via change data capture) is generally regarded as being exclusively an adjunct to traditional batch methods of data integration: a side dish rather the main course. In other words, batch capabilities are typically thought of as coming first via standard ETL (extract, transform and load) tools and then real-time data integration is implemented as an accessory, whether for trickle feeding data into a warehouse of for other purposes). What is not generally considered, except perhaps for specific projects, is the idea of implementing real-time capabilities on their own and without any use of batch processing.
What would be the value of this?
Well, consider the characteristics of batch loading. Put simply, you have to load and transform all of the data that you need to move within the confines of your batch window. What this means is that you must have enough hardware and processing power so that these peak loads can be handled with appropriate performance. Worse, of course, is the fact that batch windows are narrowing so that, relatively speaking, these peaks are becoming increasingly onerous and demanding, meaning that the required processing power to service these needs is similarly increasing.
This is where batch loading stumbles over its own proverbial feet and real-time processing begins to overtake it: where batch loading takes a specific amount of time of concentrated processing, real-time data integration divides the same effective amount of processing requirement over a much longer period of time. As a simple (and simplistic) example, if twelve hours of batch loading are required each week, replacing this with a real-time approach would mean that those twelve hours would be spread out over the whole week. In other words, you require 14 times more processing power to handle peak processing loads in this particular batch environment than you do if exclusively moving the data in real-time.
Of course there is more to it than this but, simplistically, it should be clear that the argument is valid: real-time data integration requires less in the way of computing power than using a traditional approach, either with real-time data movement combined with batch, or batch on its own. And that translates into less hardware, less space on the data centre floor, lower power requirements and less need for cooling. So real-time is more green than batch and there is a good case for using it as a standard approach to integration even if we didn't have an increasing need for real-time information.
This article was co-authored by Daniel Howard.