Data warehousing update: Vertica

Vertica, as most readers probably know by
now, offers a column-based approach to data warehousing. To be honest, I have
written about the advantages of using columns to support analytic environments
so many times since the late 90s (when I first encountered it) that I have got
bored with explaining it. If you don’t understand that using columns means
better performance, less disk space (partly due to the fact that it is easier
to compress by column and partly because you don’t need indexes) and less
administration then you haven’t been listening.

In any case, there are so many column-based
databases now (Sybase IQ, Vertica, ParAccel, Calpont soon, Alterian, Kx
Systems, SAND, Sensage) that this can now be regarded as a standard market in
its own right.

So, leaving aside the column versus row
argument, what is different about Vertica? The short answer is that it has
been designed to support a grid architecture (connected by one or two Gigabit
Ethernet interconnects) that exploits many low cost nodes, each with local disk
storage, rather than necessarily being implemented on top of a conventional
architecture with massively parallel processing at the back-end. You can also implement
Vertica on top of a SAN configured as direct attached storage if you wish.

However, it is the way that Vertica uses
this grid that is important. Vertica distributes data across the nodes in the
grid using what it calls “projections,” which are effectively the same thing as
materialised views. As I mentioned previously, column databases like Vertica
compress data very aggressively (often achieving 90% compression ratio), and Vertica
will use as much of the conserved space as you give it to store multiple sets
of overlapping columns (projections), with different sort orders, on different
nodes in the grid. To put it simply, what this means is that you can have a group
of columns sorted in one order on one node and in another order on another
node.

Vertica uses the active redundancy built
into the projections to parallelise querying (for better concurrency) and to
support failover and recovery. So, the idea is that if you have a query that
involves a join across several columns then your query is directed to the node
that has the particular sort order needed for that combination of columns, or
as close to it as possible. Clearly, if columns are pre-sorted in a way that
best suits a particular query then you will get much better performance.

Of course there are other notable features
of Vertica: it processes the data while still compressed (though it is not
alone in this), it makes use of in-memory capabilities for writing data to the
database, it comes with its own DBDesigner that will generate an optimised physical
schema based on the logical data model, a training set of data and queries, the
number of nodes in the grid, and the number of concurrent node failures the
database must be able to tolerate to provide a highly available environment.

A further particularly nice feature that
will be introduced in the future is intelligent query monitoring. Put simply,
what this will do is to monitor the actual queries that hit the database so
that you can build up an understanding of the pattern of those queries, which
can then be linked back to the projections in use, along with DBDesigner, so
that you can optimise sort orders, for example, to best meet today’s query
patterns. Note that the potential for this capability was designed into the
environment in advance: it would be extremely difficult to retrofit such a
facility.

On the commercial side, the product is
available either as pure software (including a free “try before you buy” offer)
or as a pre-installed package on HP hardware with Red Hat Linux. Something over
half of the company’s beta sites have so far converted into paying customers
and I understand that some of these are now fully deployed as production
systems. At present the company’s largest configurations are measured in tens
of terabytes (raw data: the actual size will be significantly less than this
because of compression) and Vertica expects to exceed 100Tb within 12 months.
On the user scalability side the company reports successful tests with hundreds
of concurrent users.

All in all then: a very promising start.
But given the advantages of column-based approaches I am not surprised.