Another choice for Hadoop

Written By:
Content Copyright © 2012 Bloor. All Rights Reserved.
Also posted on: The IM Blog

I have written about RainStor previously. Hitherto it has had a single product, what is now known as RainStor Data Retention but it has now announced a second: RainStor Data Analytics for Hadoop.

First, a re-iteration of what the basic product does. It provides a highly compressed file system. A couple of notable features are worth mentioning. The first is that if you are using RainStor for relational data (typically, for application retirement or archival) then RainStor ingests the schema as well as the data. It then supports schema evolution, so that you can make queries at a point in time (that is, you can look at the data exactly as it would have appeared at a particular point in time). Secondly, it includes a query engine that supports (translates) incoming SQL so that you can run conventional business intelligence environments against RainStor. Note that it is fully ANSI compliant.

Apart from archival (for which RainStor partners with both Teradata and Informatica – RainStor is embedded in the latter’s archival offering) the major market for RainStor has been retention of call detail records (CDRs), IP detail records (IPDRs), log data and so forth. The product’s compression rates are impressive in these areas: for example, getting 35 times compression for CDRs, 20 times for IPDRs and 15 times for log data. For certain types of data, compression rates can be 40 times or more. One of its customers is ingesting 17 billion IPDRs per day, which is the highest I have heard from any vendor.

However, we are here to discuss RainStor and Hadoop. What the company has done is to allow the storage of RainStor partitions within HDFS (Hadoop distributed file system) and it has added native MapReduce support to its existing SQL capabilities. The company has also partnered with Cloudera, HortonWorks and MapR so, in the last case at least, you don’t have to worry about single points of failure in your Hadoop cluster.

There are four interesting things about this implementation. The first is the compression that you can get. Typical Hadoop implementations use pretty inefficient compression algorithms (something equivalent to zip files) that get you around a 3 times compression ratio. That means that using RainStor you only need, worst case, around 20% of the disk capacity that you would require otherwise and it could be very much less than this. That’s a significant saving.

The second interesting feature of RainStor Data Analytics is that RainStor, as you might expect from its use for the retention of CDRs and so forth, understands timestamps, which Hadoop does not. In this sense the use of RainStor makes it a competitor to Cassandra and other column-family databases.

Thirdly, if you are using RainStor in HDFS then you can choose what data you want to flatten. So, because RainStor understands SQL, you can store relevant data in tabular format if you want to and you can flatten other data where that is not relevant. And, of course, the SQL supported by RainStor means that you can do functions like joins, which would not otherwise be possible. It is also worth noting that RainStor files are treated as first-class objects within the context of Hadoop and MapReduce, so there is no need to change existing scripts (in Pig, say) only to change a single parameter that points the query to RainStor partitions.

Finally, it is worth commenting on query performance. RainStor uses what are known as Bloom filters (named for the inventor Burton H. Bloom) which are space-efficient probabilistic data structures that are used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Put simply, these filters tell the system where the data is not (rather like a Netezza ZoneMap) so that the software looks for the data it needs only within relevant data blocks. The advantage of using these Bloom filters (which, technically, are based on bit vectors) is that they greatly increase performance and, at the same time, they require much less management and overhead than indexes. Note that filters, like the data, are compressed. The upshot of all of this is that you get much better performance that you would when using HDFS on its own. Of course, you will get improved performance from simply having better compression but the use of Bloom filters takes this a significant step further.

This is an impressive list of features – (much) less disk, more capability and better performance – what’s not to like? Perhaps the most obvious thing is that you might want to be able to combine MapReduce and SQL within a single query but I understand that this is on the company’s roadmap. It might also make sense for RainStor to implement support for IBM’s GPFS as well as HDFS. The only other point worth making is that RainStor offers an “append only” data store. That is, you cannot update records. This is necessary for compliance reasons in the company’s current markets. For many Hadoop implementations this will not be an issue (event data is not typically updated) although there may be additional housekeeping: removal of previous data and writing of new data, rather than just updating. It is also possible to conceive of environments where you might genuinely want to update data, in which case the data will need to be stored outside of RainStor. Leaving these considerations aside, RainStor looks like a very competitive option for supporting Hadoop.