The optimal warehouse

Suppose you wanted the best possible combination of features to provide ultimate performance in data warehousing? Actually, you probably do. What would this look like?

Well, to begin with you would probably want a hybrid storage architecture. Columns are great for large table scans but they are not a lot of use for real-time updates, look-ups or queries involving just a few rows. So, it would probably be a good idea to have a row store as well as a column store.

Secondly, you would want to maximise the use of available chip technology. Standard chips (not multi-cores) have lots of parallel capabilities but databases cannot typically make use of these. For example, suppose you load all the relevant data into memory so that there is no I/O effect, then a typical hand written C++ query will run orders of magnitude faster than a typical database query. You would really like to get close to C++ performance if you can.

Next, of course you would want to support parallelism across any multi-core CPUs within each processor you are using, as well as across processors. And you would want to support pipeline parallelism (where you can start off the next part of a query before the previous part has finished).

And, finally (there are probably more things to consider but this will do for now) you would not only support direct attached storage (DAS) for the fastest possible I/O but also a storage area network (SAN) for its high availability characteristics. However, the latter does not perform as well as DAS so, again, you would want some sort of hybrid mechanism to ensure best possible performance at the same time as providing all that resilience.

The good news is that there are companies doing all of these things though no one company is doing all of them.

On the storage side, SAP BW already has a hybrid architecture while Ingres in conjunction with VectorWise recently announced that it would also be supporting a similar column/row architecture. Of course, there are companies that having been addressing the large table scan issue in other ways, such as Netezza, and Oracle with Exadata, but both of these rely on hardware/software approaches rather than pure software.

As for the exploitation of chips is concerned this was precisely the big news from Ingres and VectorWise. By using vector processing techniques they are able to get much better performance out of their CPUs. Note, however, that this is not a vector database in the sense of storing data in arrays, merely in processing terms. I suppose you could argue that this is what Netezza has been doing at the FPGA level; and anyone who supports compiled queries will get better performance than otherwise, but probably not approaching the performance that VectorWise is claiming.

More general parallelism is of course widespread though the companies that have built cross-core parallelism have not as yet (as far as I know) extended this to cross-processor parallelism. However, it seems an obvious next step.

Finally, in so far as hybrid DAS and SAN support with optimal performance is concerned, this is just what ParAccel has introduced in its latest release.

So, the omens are looking good. This is just as well. With Ingres/VectorWise entering the market this means there will be 24 vendors in the market by my reckoning, not counting specialised plays, and that’s far too many. Who’s going to disappear? The ones that are not flexible enough to cope with the sort of developments just discussed.