Hive, DataRush and Hadoop

Hive provides SQL access to big data stored in Hadoop. However, it is extremely limited. For example only equijoins are supported, neither indexes nor temporal types (dates, timestamps etc) are supported, sub-queries in WHERE clauses are not allowed, and ORDER BY is run on a single reducer, which means that it is very slow. And this is just a few examples. Moreover, native Hive isn’t even multi-threaded (though ZettaSet’s implementation is). So, it is limited and performs badly.

You could say the same about Hadoop itself. If there is one thing that it is not famous for it is its performance (or maybe lack of manageability, ease of use or high availability-unless you use a distribution from someone like MapR or complementary software from a company such as ZettaSet).

So performance is an issue both for Hive and Hadoop.

Pervasive Software is aiming to address both of these issues for anything data-intensive, through the use of DataRush. I have written about this before, as has my colleague David Norfolk, but a brief refresh is in order. Basically, DataRush is a high performance engine that you could use for just about anything, but in this case is used to speed up the performance of either Hive or Hadoop or both. DataRush is seriously (and I mean seriously) fast and the reason is because it has been designed to exploit all the parallelism implicit in multi-core servers.

As an aside, this is a major problem with almost all software vendors: they may write great software that parallelises across multiple servers but almost no-one writes software that can parallelise across multiple cores as well and scale to take advantage of however many cores you have. What’s worse, performance typically degrades as you add cores if the software has not been designed for multicore. To take a simple example, my son frequently complains that one of his favourite games runs slower on his quad core laptop than it did when the game first came out running on a single core. There is an overhead involved in having multiple cores and unless software is written specifically to take account of those cores then it will slow down.

Anyway, back to DataRush. DataRush is an engine that takes care of all that multi-core parallelism for you and lets you scale to however many cores you want (the company has 48 core servers running in its labs). Basically, you write your application and DataRush runs it for you so that all that parallelism is hidden. Of course, the reason why most software developers don’t write for multi-cores is that it’s difficult. DataRush hides this complexity from you. In the case of Hive, getting back to the point, Pervasive reckons that in its first release, what it calls TurboRush for Hive will run queries three times faster than using Hive on its own. And you don’t have to understand DataRush programming: that’s all hidden away. Note that this doesn’t fix the deficiencies in Hive: just makes it run faster.

On the Hadoop front you can take either of two approaches. Either you can use it to make MapReduce run faster by calling out to DataRush or you can implement DataRush in a distributed fashion in lieu of MapReduce, in conjunction with HDFS. Using the former approach runs some 4 times faster according to the Malstone B test (a standard benchmark for performance-looking for security intrusion information within web logs) while the latter approach is an order of magnitude faster (10 minutes against 0.5Tb versus 33 minutes versus 135 minutes with the same Hadoop cluster).

These figures are impressive. However, it’s not just the performance. Of course you can build a 4,000 node Hadoop cluster but if you can scale up the number of cores in each server, by using DataRush, then you will need fewer servers: wouldn’t you rather have a 500 node cluster with 32 core machines than those 4,000 quad core machines? It will cost you less money and take up less floor space, require less cooling and less power. Combine that with a ten or twelve times performance boost and you could be talking about a cluster of servers that is only measured in tens not hundreds or thousands and that makes even more sense.

I have been impressed with DataRush for some time but, to be honest, it seemed like a solution looking for a problem. Well, the performance of Hive and Hadoop (and HBase-you can use DataRush as a high speed loader) is a problem that needs addressing and DataRush is doing exactly that.