Database clustering

For what seems a very long time Oracle has dominated the market for clustered databases with its Real Application Clusters (RAC), but things seem to be changing and, indeed, the market for clustering seems to be polarising.

There are three main reasons for wanting to implement clustering in association with databases. One of these is for performance, because you can spread process loads across multiple nodes; the second is scalability, since you can expect to get reasonable performance across a bigger system; and the third is for high availability or, more particularly, continuous availability (the difference being that the former protects you against unplanned outages whereas the latter means that you can also handle planned outages).

For some vendors the emphasis seems to be mainly on availability. For example, xkoto (which provides clustering for DB2) was initially set up to focus on DB2 performance but when I met with them last week their position had changed: now their reaction was that “performance is nice to have but the real issue is continuous availability”.

In a similar vein, Sybase announced its ASE Cluster Edition around the turn of this year. Again, the emphasis is as much on continuous availability as it is on performance, though here Sybase is also focusing on its advantages over Oracle RAC with the company claiming a much lower administrative overhead with capabilities such as automated load balancing and failover, for example.

However, there are companies where the emphasis is still more on performance. In a slightly different market I attended IBM’s EMEA Information on Demand conference last week and one of the presentations was on implementing Information Server across a grid (or cluster). Actually, this is a quite a cool capability for a data integration product because you don’t have the same availability issues – you don’t have to failover user connections, for example – if you’ve got a parallel ETL (extract, transform and load) job running and one of the streams breaks it’s not a big deal, you just re-assign that to another node.

DATAllegro, of course, also supports the use of grid computing for both performance and scalability reasons while Vertica also supports clustering.

However, perhaps most interesting in the data warehousing space is that EXASOL has taken a completely different approach, building an extension to the Linux kernel, called EXACluster OS that underpins its data warehouse, potentially supporting hundreds if not thousands of nodes (with built-in load balancing – the goal is to optimise performance across low cost hardware) while still offering continuous availability.

Actually, this last approach strikes me as being entirely sensible: why not leave clustering to the operating system rather than implement it in the database? The big advantage is that you can boot the whole cluster in one operation rather than having to boot each node separately. And, of course, you can build other associated functionality into the operating system (which EXASOL is doing). So if anybody is thinking about how they can leverage clustering in their database they could do worse than speak to EXASOL about licensing their operating system and having that do it instead.