Big Data roundup

Various companies have recently been active making announcements in the big data space (aren’t they always). Some of them, such as mapr and RainStor I have written about recently. Most notable of the latest batch from my point of view is the announcement of a partnership between Neo4J and Talend, with the latter announcing a connector for Neo4j that will work with both the Talend Platform for Big Data and Talend Open Studio for Big Data. Along with Global IDs’ partnership with YarcData this illustrates that people are starting to take graph databases seriously. Indeed, on that topic I asked SAS about graph databases at its recent analyst conference and, while the company is not in a position to announce anything yet, it confirmed that it sees graphs as an important space for analytics.

Back on the more traditional (if it’s not too early to use that term) big data area of Hadoop, Informatica (which has yet to make any announcements about graph databases – even though the CMO of YarcData was a long-time Informatica employee) the company has announced that Zettaset is embedding Informatica PowerCenter Big Data Edition (which has also just been announced) as part of Zettaset Orchestrator, which is the company’s Hadoop cluster management solution.

Incidentally, I have a nit to pick here. I really wish that companies would name their products appropriately. If we contrast Talend and Informatica with respect to big data the former supports Hadoop, MongoDB and Neo4J (and, for all I know, maybe some other NoSQL platforms as well). The Informatica PowerCenter Big Data Edition, on the other hand, supports Hadoop. Period. It would be better named the PowerCenter Hadoop Edition. Bearing in mind that www.nosql-database.org listed 150 NoSQL databases the last time I looked (and this list was not complete) and that pretty much all of these would fall into the category of big data then support for less than 1% of the available platforms hardly merits the broad claim Informatica makes in its product name. While Talend does not exactly cover the whole space either, it at least offers the virtue of diversity. PS: I’m not picking on Informatica particularly here, pretty much all the other vendors could be tarred with the same brush. For example, MapR’s M7 Big Data Platform is a Hadoop-only platform not a big data platform.

Finally, Splunk has announced that Hunk is in beta. As you might guess this is Splunk for Hadoop. The company has implemented what if calls virtual indexing, which allows use of the entire Splunk stack (including the Splunk Search Processing Language – SPL) to operate data stored in a Hadoop cluster. This virtual indexing is patent pending and is interesting not just for its own sake but also bearing in mind RainStor’s recent announcement for Hadoop indexing. If multiple vendors are creating their own indexing techniques for Hadoop-and they are-is there a case for some standardisation?