Applying data warehousing principles to event processing

It is now pretty much agreed that in data warehousing environments you need a massively parallel processing (MPP) architecture in order to handle very large data volumes. The important factor is that you have local processing close to each disk in order to improve performance rather than moving all of the data to a central point and then trying to process it.

Now consider (complex) event processing environments where you have very large quantities of data. Shouldn’t the same thing apply? For most of the vendors in the market the answer is no. However, these suppliers are typically focused almost exclusively on applications within capital markets. Here you have the advantage that there is, in effect, a single source of data. Yes, there are separate feeds from Bloomberg, Reuters and others, and they need to be filtered but essentially they can be treated as homogeneous. This has the effect that, suitably powered, the conventional event processing vendors can handle the volumes of data well enough. In other words, if we think about this in database terms, then there is a CPU issue but not an I/O issue.

Now consider non-financial applications such as those based on automatic number plate recognition (ANPR), surveillance, GPS, RFID or other sensor data. In these environments there are often widely distributed sources of the data, which may be generating very large volumes of data in each location. So here we have, potentially, both a processing and an I/O issue.

Well, if MPP is the solution to the combined I/O and processing limitations of data warehousing shouldn’t an analogous approach also apply to event processing in distributed environments?

This is the approach that Event Zero, an Australian company, has taken with its Event Processing Network (EPN). It offers a distributed solution that works in a fashion that is analogous to an MPP-based warehouse. That is, it has not implemented a traditional hub and spoke architecture where the spokes collect the data and pass them on to the hub for processing but, instead, it has invoked the principle of local processing, with as much data as possible being processed locally before being passed to the centre in case you need to do things like aggregation. Of course, this also makes the whole environment much more flexible because it is easy and inexpensive to add new local processors as required.

There is another point. It is often of crucial importance that you understand the sequence and timing of events in an event stream. Now, that’s easy enough if you are dealing with stock ticks but it is significantly harder if you are dealing with distributed data sources. It is not impossible when using a monolithic approach but it is going to be much easier with a solution like EPN where such facilities have been designed into the product from the outset.

As far as I know, EPN’s architecture is unique. More to the point, it appears to have substantial advantages in distributed event processing environments. I also like the way in which the company is focusing on solutions (which include ANPR, environmental monitoring, security, transport/traffic systems and utilities amongst others) rather than just on its platform. Despite the fact that EPN was only launched in April of 2008 (development has been ongoing since 2004) I expect it to have significant success.