The Ingres VectorWise project

In case you missed it, Ingres and VectorWise recently announced that they would be bringing a joint offering to the data warehousing market next year. The various press and other commentary on this announcement has been focused on the use of vector processing (hence the name VectorWise) in this solution, in order to improve computational efficiency. However, there is much more to it than that.

VectorWise was founded last year as a spin-off from the Centrum Wiskunde & Informatica (CWI), based in Amsterdam, which was the originator of the MonetDB column-based database project. The founders of VectorWise (Marcin Zukowski and Peter Boncz) both worked on that project but VectorWise is based specifically on work done by Marcin for his PhD thesis (see http://sites.computer.org/debull/A05june/mzukowski.ps) “MonetDB/X100 – A DBMS in the CPU Cache” by Zukowski et al.

In so far as the commercialisation of this research is concerned, there are in fact four major elements: computational efficiency in the CPU, use of columns, compression and co-operative scans.

I will not spend a long time on computational efficiency because it has been widely covered in the media. Put simply, by using vector processing that takes advantage of modern day CPU characteristics, you get much closer to the compiled performance you would expect when using C, say, as opposed to using SQL: orders of magnitude faster for computationally intensive tasks. However, what I haven’t seen widely reported is the fact that VectorWise intends to make this transparent to users: so you should be able to continue to use SQL just as you always did, just get much faster results.

The second major part of this announcement is that the combined Ingres/VectorWise product that comes to market will use a hybrid storage architecture that uses both row-based and column-based storage. Quite when you would use one rather than the other has not been announced but it is not hard to guess. For all their benefits columns are not very good for real-time updating, look-ups (for instance, when supporting MDM) or queries where only a few rows need to be read, for example. It is not hard to think about ways in which you might split, or replicate, data storage to optimise different sorts of operation.

The last two features of what VectorWise is doing seem to have been largely ignored. The first is compression: well, everybody does compression nowadays so it might appear to be no big deal but VectorWise is not focusing on getting optimum compression ratios (after all, how cheap is storage these days?) but on de-compression speed. It will be interesting to see what the figures come out like when the product is actually released: the trade-off on I/O on the one hand (less good compression) and in-processor performance (faster de-compression) on the other: the latter will need to exceed the former but presumably that’s the whole point.

Finally, VectorWise is working on co-operative scans. They aren’t the first to do so, for example SQL Server and DB2 both have capabilities in this area, but VectorWise is aiming at something more sophisticated. The idea is that when you have two queries both scanning the same data then you read the data only once but pass the results back to both queries, thus providing significant I/O reductions.

Of course, we will have to wait for all of this to come to fruition but it looks very promising. There is still plenty of warehousing market left to address and I am impressed with what VectorWise is doing. Moreover, its partnership with Ingres will enable it to reach a much bigger audience more quickly than would otherwise be the case.