SQL and NoSQL

It is apparent that there is significant demand for SQL in the NoSQL world. Whether it is HQL (the Hive version of SQL) or CQL (the Cassandra version) there are clearly people who like a declarative, as opposed to a procedural, approach to accessing big data.

However, there is a big problem here: the reason why SQL works as well as it does in traditional, usually relational, environments is that the databases it is accessing have optimisers. Those optimisers fundamentally do two things: first, they rewrite crappy SQL code into something that is as efficient as it can be. That’s a big advantage: anybody can write poor code, whether you’re using MapReduce or SQL, but with an optimiser that gets handled for you. Of course, there’s a downside: it encourages sloppy programming but that’s another story.

The second thing that a database optimiser does for you is to determine how best to perform joins and other SQL functions. Of course, you may not have the possibility of a join within a NoSQL database but that in itself is a restriction. Anyway, the optimiser is also optimising the performance of all the other SQL capabilities that are available and some of these, at least, will be available in NoSQL databases.

So the bottom line is that you aren’t going to get much in the way of performance out of simply bolting SQL onto a NoSQL source unless you also design in an optimiser.

Interestingly, this is essentially what EMC has done with Hawq SQL in its Pivotal HD product, which is essentially a re-working of the EMC Greenplum parallel engine to support Hadoop. By coming from a conventional database background that already has an optimiser, extending or re-working it to support Hadoop environments is a great deal easier than simply creating a Hadoop optimiser from scratch, and this is precisely what Hawq offers.

Going a step further, it is worth remembering that Teradata’s polymorphic storage for Aster Data provides you with the ability to use either a native Aster storage engine or an HDFS storage engine (or both) under Aster Data. And, of course, to do this effectively you need to extend the optimiser so that it supports the whole environment. See also my article “DB2: a relational epithet is no longer enough” in which I described the fact that DB2 already has three storage engines, two of which are NoSQL engines (XML storage and a graph store – though you would typically use XQuery and SPARQL to access these), and that it is likely to add more such stores in the future. Again, one of the reasons why adding new storage engines is no trivial matter is precisely because of the need to extend the optimiser to understand the different storage. However, once you do so you are likely to get much better performance than you would do from a NoSQL database with SQL but no optimiser.