Graph update

This is the first of several articles that I intend to write about developments in the Graph and RDF database spaces. This piece will be general but others will focus on specific vendors and their products.

Perhaps the most interesting development is the JanusGraph project – see www.janusgraph.org – this is an open source, distributed graph database from the Linux Foundation, which is sponsored by IBM, GRAKN.AI (also a very interesting hypergraph development, about which I will write in a later article), HortonWorks and Google, amongst others. This has emerged from the rubble of the Titan graph database project. The latter was originally developed by Aurelius before that company was acquired by DataStax. DataStax took Titan to release status 1.0 and then unleashed it into the open source community while reusing the intellectual property it had acquired to build its own graph database on top of DataStax Cassandra. Since then, Titan has withered on the vine. There have been virtually no contributions to its code base. As John Cleese might say, “it is a deceased parrot”.

However, that’s not the whole story. Because a number of IT vendors, not least IBM, had embedded Titan into their own products. Not surprisingly, these companies want a viable platform going forward and hence the interest in JanusGraph. In this context, it is worth commenting that one of the big advantages of Titan, and now JanusGraph, is that they both run on multiple platforms, notably Cassandra, HBase and BerkeleyDB, in the case of JanusGraph. Note that JanusGraph supports the Apache TinkerPop project along with Gremlin as graph language.

To move on to something completely different: a question I get asked a lot is about performance and scalability for graph and RDF databases. This isn’t a straightforward question to answer, because it depends on what you are doing. If we take RDF databases, for example, there are vendors like Ontotext (about which, more in my next article, but which has recently introduced significant performance improvements) that are focused on operational applications, and then there are companies like Cray that are essentially offering a cognitive data platform. In the latter case, you can have hundreds of terabytes of memory so discussions about performance are moot (in the US sense of that word, meaning a waste of time). Actually, I hear the question more with respect to graph databases but, again, it depends. For example, you could host JanusGraph (or Titan for that matter) on ScyllaDB (see www.bloorresearch.com/blog/im-blog/scylladb) instead of Cassandra and you’d expect to get way better performance. Then there’s BlazeGraph running across NVIDIA GPUs. As for Neo4j, I will be discussing this issue specifically in a forthcoming article.

Finally, another issue that raises its head often is with respect to divergence between graph and RDF databases, the suggestion being that the former is increasingly focusing on data scientists while the latter is more oriented towards data standards and models that are open data-driven, which can perhaps be described as “semantic graphs”. I think there is some truth in this but I think there is a third category, which relates to the cognitive computing that Cray is addressing, as are both GRAKN.AI and Franz with AllegroGraph (which is a quad store and which supports both the graph and RDF models). It’s also simplistic to categorise all graph databases as targeting data scientists: that certainly isn’t the case for Neo4j, which remains primarily operational.

A couple of years ago, Forrester predicted that a quarter of all enterprises would be using graph (or RDF) databases by the end of this year. I don’t think that’s going to happen, at least in terms of direct licenses (as opposed to graph databases being embedded in other vendor products), but the market is certainly still growing, and continues to be interesting.