Graph databases and NoSQL

This is the second of five articles on graph databases and in this article I am going to talk more about graph databases in general and how they differ from other approaches. In particular, to date, most of the emphasis within the “big data” story has been around Hadoop, Cassandra and maybe MongoDB. Relatively little has been talked about graph databases. However, it is arguable that graph databases will have a bigger impact on the database landscape than Hadoop or its competitors.

Strictly speaking, a graph database is a NoSQL database but this is a case where strictly speaking is not very useful. There are two things that tend to typify NoSQL databases in people’s minds: the first being that Hadoop and its allies are optimised to run on low cost clusters of commodity hardware and the second is that it uses MapReduce to parallelise processing across this cluster. This works because these NoSQL databases are effectively doing either statistical analysis or search and there is only a limited shipment of data across the network. This isn’t the case with graph databases, especially where you are looking for patterns of relationships for analytic purposes.

The point to understand about graph databases, especially when it comes to analytics, is that the more nodes you have in your graph then the richer the environment becomes and the more information you can get out of it. How much more is a matter for debate: Metcalfe’s Law (which is actually no more than a hypothesis) suggests that growth in value of a network is approximately proportional to the square of the number of nodes (actually n x (n-1)). However, this has been disputed, not least because some connections (relationships) between nodes are more valuable than others. Other researchers have suggested that n(logn) would be a more appropriate figure. The answer is probably somewhere in between but there seems no doubt that the more information you can collect then the more value you can extract. So, at least for analytics, graph-based data is a big data problem.

Now, you have to bear in mind that processing graph data consists of traversing relationships. If you implemented this on a cluster then those relationships would frequently span different servers within the cluster, and that would slow down processing and the network would become a bottleneck. For this reason a scale-out approach to supporting graph databases doesn’t work and current vendors scale up rather than out. And, because you are scaling up, of course you don’t need MapReduce because parallelism can be built in.

As an aside, this has an important corollary. I mentioned in my previous article that DB2 now supports logical graph storage. It can do this because it primarily uses a scale-up model (barring pureScale). However, Oracle (RAC), Vertica and others that use a clustered approach to warehousing are unlikely to be able to compete because the performance will suck.

Having said this I should say that I know that Neo4j is working on ways to minimise the need to traverse between servers on a cluster when traversing relationships but it remains to be seen when this will be available and how effective it will be. My guess would be that the bigger your data and the richer your relationships the less effective it will be.

Speaking of Neo4J, one other essential difference between Hadoop and its ilk and graph databases is that the former is primarily seen as an environment for processing queries whereas at least some graph databases can be used for transaction processing. Neo4j, for example, supports ACID-compliant transactions and XA-compliant two-phase commit. So Neo4j might be better equated with a NewSQL database, except that it can also handle significant query processing.

Having said all the foregoing, there are two things that graph and NoSQL databases have in common. The first is that neither requires a schema. Secondly, there are significant open source developments in this area led, as always, by Apache. Other projects include Affinity, Nuvala, Stig, Pegasus and others. The most notable of these developments is SPARQL, which is the graph equivalent of SQL (as if you hadn’t guessed). While SPARQL is supported by both Neo4j and YarcData (and IBM) in neither case is it the preferred method for developing queries. Indeed, I have gained the distinct impression that neither Neo Technology nor YarcData is much impressed by it at present though SPARQL is maturing and the functionality will evolve as more people use and support it.

So the bottom line is that it is not very useful to think of graph databases as a type of NoSQL database. Certainly they have things in common such as not being relational but then Adabas is not relational and you wouldn’t call it a NoSQL database. Graph databases deserve to be treated as a technology in their own right and not be lumped in with something that is fundamentally different.