BSP and Graphs

In 2010, Google published a paper on Pregel (the river that flows through what used to be Konigsberg and is now Kaliningrad and where Leonard Euler constructed the first graph, back in the 18th century) which is a graph analytic environment based around BSP (Bulk Synchronous Parallel—a parallel computing model—see previous article: Teradata Aster 6.0), and designed for ‘big graph’ problems. Basically, the way that it works is that you give Pregel a directed graph, it runs your computation at each vertex (vertices are assigned to nodes in the cluster), you run this until each vertex votes to halt (this voting constitutes the barrier referred to in my previous article) and then Pregel gives you a directed graph back. Other things that you need to know are that vertices are stored in a relational database and rows in BigTable, and that there is master node that partitions the initial graph and is in control of the whole process. There is a C++ API but, in terms of processing, you have to think from a vertex perspective so a change in mind set is required.

More recently, Apache has started the Giraph project, now in version 1.0.0, which is also based on BSP but on HDFS as a store instead of BigTable. Moreover, it is a non-directed environment, as is Teradata Aster SQL-GR. This is important. Directed graph environments, such as that provided by Neo4j, are good for supporting things like master data management, resource management for the semantic web and also for query environments where you know what you are looking for. For exploration – the sort of things that data scientists are supposed to do – you will need a non-directed approach because you do not know about the relationships in the data in advance.

Teradata is the most recent vendor to take up the BSP cause with Teradata Aster SQL-GR. However, unlike other BSP implementations (Pregel, Giraph and GoldenOrb – an open source version of Pregel) processing does not have to take place in-memory. Teradata is claiming that this is a scalability advantage. I don’t see that. Processing in-memory (which is also what YarcData Urika does, which is also non-directed) is optimal for performance. Allowing the possibility of disk-based processing gives you more flexibility but it doesn’t mean you’ve got greater scalability: it means better cost/scalability given that you are happy with performance degradation. So I don’t see being non-memory-bound as a particularly big deal: certainly, SQL-GR is going to be more flexible (you can choose to scale up memory or not) but I think it’s a minor point rather than a major one.

I must return to directed graphs. The point about directed graphs is that you have a good idea of how to partition the data and the best approach for assigning (in the case of Pregel and the like) vertices to processing nodes. This allows you to minimise network traffic, which can kill performance if you get too much of it. When you need to explore the data because you don’t know what your graphs look like, then you can’t do that minimisation. In this sort of circumstance I am not sure how much the BSP approach helps (I would be interested in relevant comments here—I don’t claim that this is my area of expertise) and I would have thought that a scale-up architecture—as in Urika—would be more appropriate.

It is becoming clear that there are a hierarchy of capabilities here: there are companies that have added graph capabilities to their existing data model (Oracle, MarkLogic, Objectivity and so on), there are companies with genuine graph stores (IBM, Neo4J, YarcData and lots of open source vendors) and there are companies/products that have built additional infrastructure on top of their native storage capabilities to support graph processing (Teradata, Giraph and so forth). Then we also have different parallel processing models (BSP and otherwise) as well as the directed versus non-directed debate. This creates lots of room for FUD (fear, uncertainty and doubt) and it will be some time before the requirements of this market become firmly established and understood. For my next major project I intend to research and write an appropriate report on graph databases and graph analytics.