The problem with SIEM 2

Written By:
Published:
Content Copyright © 2010 Bloor. All Rights Reserved.
Also posted on: Accessibility

In the last article in this series I discussed what SIEM products do and asked why the suppliers don’t go and talk to data integration vendors. In this article I want to discuss data storage.

The first thing to appreciate is that event data (and a log is just a series of events) does not come neatly packaged. There are two things wrong with it: first, it comes in different formats from different sources and, secondly, it is more or less unreadable. So, you have to parse and ‘normalise’ the data: put it into a consistent format that is more easily read. Most vendors do this by storing the data twice, once in raw and once in normalised form, though a few companies produce normalised versions of data on the fly. In the case of those who store normalised data, many of the suppliers, though by no means all, only store aggregated, normalised data so if you need to drill-down to individual events then that either won’t be normalised or it will need to be normalised when you want to read it.

The actual way that data is typically stored is that raw data is inserted into an indexed flat file system and then (aggregated) normalised data is held in a database such as Oracle, MySQL, DB2 or whatever. Often, these two systems are loaded in parallel. Some companies use flat file systems throughout. Now, the reason for using a flat file for long-term data storage is because they give good levels of compression (and therefore disk space and other savings) at the back-end; while a lot of companies use databases at the front-end because of high transaction processing rates (an event in this sense being just like a transaction) and because these are pretty good at rapid query processing against aggregated data.

What neither of these (flat files or transactional databases) will be very good at is in supporting the sort of complex analytics that you might want to use to investigate potential fraud, say. To be fair, not every user wants to do this, though it’s my opinion that more and more will want to do so as time progresses. The logical approach would therefore be to use an analytic warehouse for storing the data. If the warehouse uses column-based (or equivalent) compression (which DB2 and Netezza both do, as well as the column-based databases) then you should get at least as good compression as you do when using flat files and if use a database that doesn’t employ indexes (Netezza and the columnar vendors) then you should get a better net effect overall. And you shouldn’t need to calculate aggregates because these systems have proven that they can calculate aggregates on the fly and you won’t need a two-tier storage strategy (which is complicated, not to mention daft) because analytic warehouses are currently capable of ingesting data much faster than any of the SIEM products. In our survey the highest load rates we found were at around 4Tb per day: analytic warehouses can often load that much per hour!

So, it would make much more sense to use an analytic warehouse at the back-end: you still get high compression rates but you get better loading performance and better analytic performance. But only SenSage and LogMatrix use such an approach, the former based on proprietary column-based technology and the latter on the Vertica Analytic Warehouse. I am pleased to say, however, that some of the other vendors have cottoned on to this and we are aware of vendors in discussions with Greenplum and InfoBright, to name just two, so there may well be an emerging trend in this direction.

And finally, while I am on the subject, there are a number of products that do not even support SQL. Very few support the ability to import data mining models from the likes of SAS or IBM SPSS and only LogRhythm and Tier-3 support PMML (predictive modelling mark-up language), which is the standard for transporting these models. The other guys need to get more with it and, if they insist on using flat files then they should at least introduce support for MapReduce-the extra performance this gives is one of the reasons why Splunk, which is complementary to (or in part competitive with) many of the SIEM vendors, has been so successful.