Graphs and GPUs

Scalable Graph Technologies (SYSTAP) was founded in 2006. Unfortunately, it made the mistake of calling its graph database BigData. Of course, when the product was introduced nobody had yet heard of big data and that wasn’t a problem. But they soon did and then it was, not because the product isn’t highly scalable and not because it doesn’t focus on complex analytics across very large datasets – it does – but because it isn’t what people were expecting (Hadoop?). Anyway, the company has now changed the product’s name to BlazeGraph and has introduced MapGraph. The latter, in particular, is aimed at the very largest and most intractable graph problems as witness the fact that SYSTAP has gained funding from DARPA (defense advanced research projects agency) and the US government.

BlazeGraph supports both RDF (resource description framework) and property graph approaches. It supports SPARQL and quads as well as triples. You can deploy it embedded, with high availability, or as a scale-out solution. It has a shared nothing architecture with in-memory processing. Zookeeper provides high availability characteristics. It is focused on analytics and not transaction processing. Okay: that’s fine. Interesting, but not especially exciting.

What is exciting is MapGraph technology. The problem with graph analytics, even if you are using in-memory techniques (as BlazeGraph does) is that for large scale analyses you get cache thrash issues (swapping data in and out of cache) because graphs don’t have any concept of locality. In practical terms what this means is that you have memory bandwidth issues rather than compute issues. One approach to this issue is to have loads of memory. But it’s expensive and CPU main memory bandwidth is relatively slow compared to cache. With MapGraph SYSTAP has taken the approach of partnering with Nvidia to use its GPUs (graph processing units) to resolve this issue. And it’s much less expensive: SYSTAP estimates that a single $1,000 GPU can store and process a million graph edges with greatly reduced bandwidth problems.

In practice, MapGraph is still in beta (the company is looking for users to join its beta program) and there are some issues with it with respect to usability (for example, you can’t use SPARQL) that have yet to be ironed out. Instead, the company has developed a vertex-centric API that allows you to write your own algorithms in a manner similar to Pregel.

From a licensing point of view BlazeGraph is open source under a GPLv2 license. There is a version of MapGraph available under the Apache2 license. Future versions and Multi-GPU installations are proprietary.

I think MapGraph is seriously interesting, and I am pleased to hear that SPARQL support is on the company’s road map for release in the near future. Even without that this is a genuine competitor to Cray (Urika) and IBM Watson (something I will address more broadly in a separate article) and likely to be significantly less expensive.