Taking the big out of data

One of the problems with Hadoop is that the security isn’t great. This may not matter, providing you have some governance, when you are processing social media where you need to pick up stray social security numbers or credit cards and mask the data but it certainly matters a lot if this is intrinsic to the sort of data that you are trying to analyse or, indeed, if you want to use Hadoop for archival of data that needs to be secure. It’s not that you don’t get any security with Hadoop – Kerberos authentication is fairly standard, for example, but it is not as detailed as it needs to be.

RainStor, in its latest 5.5 release, has set out to resolve this issue. It now offers not just Kerberos but also LDAP and Active Directory support. But perhaps more importantly, all the data is compressed (and compression is RainStor’s strong suit: 20 to 40 times is not untypical) and encrypted in memory before being loaded into HDFS. In addition, the company offers data masking for both SQL and MapReduce functions. While this is not intended to be a full-blown data masking tool it will be useful for masking data in, say, log files. Masking can be done in a consistent fashion so that the same piece of data is always masked in the same way. Finally, in so far as security is concerned, RainStor has extended the tamper-proofing technology that it uses in its other products to the Hadoop environment with MD5 fingerprinting and it has added a record-level delete capability; where this is to be used there are facilities to prevent deletion of records that are subject to legal hold.

The second major feature of this release is with respect to free-text search, which RainStor is introducing. This combines some of the parsing aspects of Lucene with the Bloom filters that are at the heart of RainStor. In layman’s terms the way that this works is somewhat similar to (IBM) Netezza’s Zonemaps or the data skipping introduced by IBM with its BLU acceleration for DB2. That is, the system knows where relevant data is not so it skips those partitions when searching for information. According to RainStor this results in performance that is one to two orders of magnitude better than in its previous versions.

I’ve been a fan of RainStor for some time and it’s definitely worth a look if you want to store data very efficiently but not lose the ability to query it using SQL (or a standard BI tool that works with that SQL – it is SQL ’92 standard) and, in the case of Hadoop, MapReduce. The other thing I like is the company’s new tag line: “taking the big out of data”: it’s about time somebody did.