Netezza: more than you might imagine

I have recently returned from Netezza’s second annual conference. This was well attended, with nearly all of the company’s customers (around 75) being represented, as well as a significant number of both prospects and partners. It was very (to use a technical term) buzzy and there was a degree of enthusiasm that I have rarely encountered. However, what was most interesting for me was the number of things I had not previously appreciated about Netezza’s technical capabilities. And, of course, its roadmap for the future (though I can’t say too much about that).

To begin with there is the question of indexes. Data warehouse appliances in general, and Netezza in particular, tends to be type cast by detractors as only being good for large table scans, because they do not support indexes and therefore cannot run complex joins. However, in the case of Netezza, at any rate, this is misleading. This is because it uses what might be described as an anti-index, which is called a zonemap. What a zonemap does is to allow you to load say, sales by time, and then the zonemap breaks the relevant data down into blocks, storing the details of the first and last record in each block (thus there is a much lower overhead compared to an index). What this means is that when you run a query you only read the blocks that contain the data you are interested in, ignoring all the other blocks. This ability to limit the data you read means that joins are much more effective than would otherwise be the case. In its roadmap, Netezza described future approaches that will further reduce the amount of data you need to read.

Another interesting thing to come out of the conference was that a number of Netezza customers have stopped using aggregates as a result of implementing Netezza. For example, Carphone Warehouse told me that it was both faster and more accurate to calculate directly from the raw data. As aggregates are a major issue for database administrators, being able to get rid of them (or, at least, minimise their use) is a significant benefit. Not that Netezza eschews aggregates altogether. More than one user employs a data warehouse appliance (not only from Netezza) as an aggregating engine as a front-end to a third party enterprise data warehouse. I will discuss this further in a subsequent article.

And while talking about enterprise data warehouses (EDW) there are several arguments put against using a data warehouse appliance as an EDW. The first is that you can’t use an appliance for complex joins but, as discussed above, this is less and less true, at least as far as Netezza is concerned. Secondly, there is the issue that the large EDW vendors provide pre-built data models—well, one of the things that Netezza has not made much of is the fact that it has partners that provide exactly these sort of capabilities (typically built on either a star or snowflake schema). And, thirdly, there is the question of managing mixed workloads. In this last case, Netezza offers guaranteed resource allocation (floors but not ceilings yet), short query bias, materialised views and prioritisation.

Another area in which Netezza has been hiding its light under a bushel is in the matter of FPGAs (field programmable gate arrays). FPGAs are used to process data as it is streamed off disk. Note that this is important to understand. Most data warehouse appliances (and, indeed, conventional products) use a caching architecture whereby data is read from disk and then held in cache for processing. Netezza, on the other hand, uses an approach that queries the data as it comes off disk before passing the results on to memory. In other words it uses a streaming architecture in which the data is streamed through the queries (whose programs have been loaded into the FPGA) rather than being stored (even if in memory) and then queried.

There are several points to make about this. The first is that you can get much better performance when using this sort of approach than when using a conventional one. For example, it is stream-based processing that is used for algorithmic trading, where processing requirements are of the order of 150,000 transactions per second. The second is that FPGAs are the natural way of handling streaming environments. For example, they are widely used for voice and video streaming. They are not yet used for event stream processing but we know of one vendor that plans to do exactly that. In turn, what this means is that FPGAs are very much a commodity item. Those of us working in more conventional environments may not think of FPGAs like that but they are as much of a commodity as, say, an Intel processor.

And talking about processors, the other thing that Netezza uses that may seem odd to some people is that it employs a PowerPC chip rather than using said Intel (or AMD). Again, this is similarly a commodity device that is widely used in small footprint devices, primarily because of its low power consumption. To be specific a Netezza Snippet Processing Unit (where a snippet is the compiled SQL query that data is streamed through) requires just 30 watts. A complete Netezza rack with 112 of these and 16.5Tb of disks (with 5.5Tb of user data) requires little more than 4Kw and produces 12,000 BTU heat output. Given the power and cooling issues afflicting most data centres today, this is a substantial advantage, as are the reduced floor space requirements.

Returning to FPGAs for a moment, the performance and price of these is following along a similar price/performance curve as those of processors. It is expected that performance and price will both improve by five times by 2010, as will the amount of logic that you can put on an FPGA. This last is particularly important because it will enable Netezza to introduce even more functionality into the FPGA in the future.

Even with the current FPGAs, Netezza plans to introduce features that will increase raw scan-rate performance, tactical query performance and advanced analytic performance. The advanced analytic capabilities will be made available to partners rather than end users and will allow predictive analytics vendors (like SPSS or SAS) to embed scoring capabilities (say) directly into the FPGA, which should provide significant performance advantages.

Another potential use of the functionality embedded in the FPGA would be to implement column-level encryption, which would be useful for companies in the data aggregation and resale market, for example, because you could use different encryption techniques for each customer’s data. Encryption generally is not available and is not currently on the roadmap and while I would like to see this it is arguably unnecessary—given the structure of a Netezza appliance you would need some seriously good hacking skills to read a Netezza disk, even if you could get at one— – so column-level encryption on its own may be good enough.

To conclude, I was surprised by this conference, not just by the enthusiasm of the attendees but also about some of the functionality that Netezza can offer, which I don’t think it has done a good job of explaining to the market. It has, for obvious reasons, concentrated on performance, price and reduced cost of ownership but, to take TCO, it has tended to focus on the removal of indexes and tuning but hasn’t discussed its advantages when it comes to aggregates. Similarly, it hasn’t really explained why using FPGAs is a good idea, it hasn’t made it clear that zonemaps are a form of anti-index, and it hasn’t talked much about its advantages in the data centre. Given all of this, and adding in the rich set of new features in the company’s roadmap (a number of which I have not mentioned), there is no reason to expect Netezza to do anything but go from strength to strength