DB2: a relational epithet is no longer enough

IBM’s Information on Demand conference was, as usual, informative though it’s sometimes difficult to keep track of all the new releases unless you attend the press conference (which I don’t) and even then you don’t get what was released the previous week.

Anyway, there was lots of good stuff. However, the most interesting thing—to me at any rate—was about old stuff and specifically, about the triple store feature in DB2 10. When I was briefed about this back in the spring I was told that the data was still stored relationally but was, in effect, tagged. And, of course, DB2 supports SPARQL for querying this data. I subsequently wrote about this and described this feature in this way without any amendments coming from IBM (I always check technical features before publishing details about them). I also wrote that I didn’t expect great performance, precisely because the data was still being stored relationally.

It turns out that this was all hogwash. In fact, the data in the triple store is not stored relationally but as an encoded vector, which is a different thing altogether. As the person who told me this was Curt Cotner, who is CTO for IBM Database Servers, I am inclined to believe him. I think the problem is that there are not enough people within IBM who actually appreciate and understand how important triple stores (otherwise known as graph or RDF databases) are going to be and haven’t felt the need to understand how they truly work. My personal view is that they will be the next big thing after all the fuss about Hadoop has died down but I’ve been at several IBM events where DB2 10 has been discussed but this feature has not even been mentioned.

It is interesting to explore why IBM has implemented this triple store and that’s because IBM Rational asked them to. Rational was developing its Rational Jazz product, which is a collaborative repository for development objects, and could not get it to perform using traditional database technology so they approached Curt’s group for help, and that’s why it was built into DB2. Tivoli is also making use of this storage mechanism. It should be noted that DB2 is not a full graph database at this point as it lacks the inference engine that would generally be included in such environment, but it is likely that DB2 will be integrated with one of the open source engines of this type in due course.

There’s an interesting sidebar to this triple store implementation as it means that DB2 now effectively has three different storage engines: relational, XML and triple store, each with its own access mechanism – SQL, XQuery and SPARQL. Now, if you’ve got three storage engines why not four or five? In fact, IBM is already working on a JSON store, accessible via SQL and with a callable interface like MongoDB. So, why not a Hadoop storage engine (HDFS or GPFS) as well? You’d get the low cost clustered hardware advantage, you’d get the advantage of “schema later” but you’d have a single management environment across multiple storage engines. Okay, don’t expect this tomorrow but I think this is the direction in which IBM is going to move: the addition of the triple store to DB2 is not only important in its own right I think it’s a pointer to the future.

There is one outstanding question: if DB2 already has three distinct storage mechanisms and is likely to have more in the future then is it valid to continue to call it a relational database management system? Isn’t it a general-purpose database management system now or even just a data management system?