Graph databases and the warehouse

This is the last in a series of five articles about graph databases. I have previously described what they are, how they relate to NoSQL databases such as Hadoop, and what they are used for. I have specifically highlighted Neo4j and uRiKA, where the former supports both transactional and query processing and the latter is focused exclusively on pattern recognition and analytics against very large datasets. I have suggested that graph databases have a significant role to play in analytic environments where it is relationships that are the most important thing that you want to investigate. If I am right in that supposition then there are important implications for the future of data warehousing. That is what I want to discuss in this article.

It is worth re-capitulating what different types of database are good for. Briefly:

Traditional (relational) data warehouses are good for high performance, OLAP, complex and ad hoc analytics running in real-time or batch against structured or semi-structured data.
Hadoop is inexpensive, schema-free, can handle any type of data and is batch-based. Performance is nothing to write home about and nor is management. Does not support ad hoc queries and is best for statistical analysis, aggregation and search rather than complex analytics.
Cassandra is essentially similar to Hadoop except that it can handle real-time queries and natively supports time series, which is not typically the case for relational environments (Informix is the exception).
Graph databases are schema-free and can handle any type of data and support real-time complex analytics against relationship-based information. Like relational databases they scale up rather than out so are relatively expensive compared to Hadoop or Cassandra.

So graph databases enable a relationship-based focus that other approaches do not handle easily. Their other major advantages over relational databases are that they are schema-free and will work with unstructured as well structured data. Compared to Hadoop and Cassandra they enable complex analytics against the data and, with respect to the former, graph databases can operate in real-time.

We have already got used to the idea that Hadoop will, in all probability, sit alongside your conventional warehousing environment. Where time is important (for example, in smart metering applications or for log analysis) then you might have Cassandra and/or Informix running in conjunction with your warehouse. So it is hardly a stretch to think that you might have a graph database such as uRiKA sitting alongside your warehouse instead of, or as well as, any of the above.

Of course, we should consider the possibility that traditional vendors will implement graph store capabilities in their databases. IBM has already done this with DB2: providing the ability to view relational data as if it were graph-based. In internal tests IBM has compared the performance of this approach with Jena TDB, which is an Apache open source graph database, and thinks it is up to three times faster, depending on what you are doing. However, when it comes to large scale graph-based analytics comparing Jena TDB with uRiKA, is pretty much like comparing an Access database to Teradata so I think IBM is a long way from being competitive for this sort of application (though it may well be suitable in transactional and hybrid environments).

However, there is another point to make. While DB2 can reasonably implement graph-store capabilities we cannot absolutely rule that out for the likes of Oracle and Vertica, which use clustered technology to support their warehouses. As I pointed out in an earlier article, in a graph database you navigate along relationships and you cannot shard or partition relationships in the way that you can with data per se, so the network will become a bottleneck: which is why graph databases are all scale-up based, at least at present.

So even if, at some point in the future, DB2 might be able to cater to large scale, high performance relationship-based analytics there will still be a significant part of the data warehouse landscape that will not be able to cater to this possibility because of their fundamental architectures.

The bottom line is that we are going to see not just a proliferation in adoption of Hadoop but also graph databases, to be used in conjunction with each other and with traditional data warehouses. This in turn reinforces the whole concept of the logical data warehouse (see my article “The EDW is dead“), which, in turn, means consideration of what you need to support a distributed, heterogeneous warehousing environment. That’s maybe a subject for another article but you would certainly need to be able to model the whole melange, you will need various ways of moving data within the environment, and you will probably want an abstraction layer (data virtualisation/federation) that allows you to view the whole thing as a single entity. This, right now, is a drawback, because none of the integration, warehousing or virtualisation vendors have so far looked at much beyond Hadoop. I don’t know of anyone within explicit connectors to any graph databases right now, which means that if graph databases are going to be as important as I think they are, then the integration vendors need to get their fingers out.