The language of graphs

SPARQL (SPARQL Protocol and RDF Query Language) is the most widely supported (which is not the same as most widely used, which it may well not be) language by graph database vendors, regardless of how we define database in this context—that is, irrespective of whether data is actually stored in graph format as triples or as a property graph or using some other storage mechanism.

As you might infer from its linguistic similarity to SQL, SPARQL is a declarative language. That is, you don’t have to know where the data is in order to create and run queries. However, just as with SQL and relational databases, the performance of said queries is therefore dependent upon the database and, in particular, the database optimiser. Unfortunately, while relational databases have sophisticated optimisers, graph databases typically do not (Neo4j is an exception). The same, it has to be said, applies to NoSQL databases in general—you may be able to run SQL (or HiveQL) against Hadoop, for example, but without an optimiser performance is still going to suffer.

The second issue with SPARQL is that, as the “R” implies, it was designed for RDF (resource description framework), which is the basis of the semantic web. It wasn’t designed for business intelligence and analytics. Moreover, while RDF stores may have their place in supporting Web 3.0, for most commercial applications of graph technology there is a clear shift towards property graphs.

The difference between a property graph and a triple store is that in a property graph the edges and nodes of the graph may have values associated with them. As a result, they are much more practical for general-purpose business uses: they are much more compact and nodes do not grow like Topsy every time you add a new attribute (or value).

So, property graphs are becoming the popular option. But that means that SPARQL, developed to support RDF or triple stores, is not particularly well suited to support property graphs: so what language do you use?

Generally speaking the answer is to use a procedural language such as Gremlin (which is a scripting language based on Groovy). However, this has all the drawbacks of being procedural and there are also portability issues associated with Groovy and Gremlin. As far as I know the only company that has a declarative language is Neo Technology, which has developed Cypher alongside its database optimiser.

The problem, from my point of view, is that Cypher is proprietary. Neo4j is considering—and assures me that it re-evaluates on a regular basis—making it open but that’s not going to happen anytime in the immediate future. While it may be good for Neo to be the only vendor to be in this position my opinion (and I know that Neo disagrees with me on this: it’s view is that it doesn’t want the language bogged down in standards discussions at this stage) is that it would serve the market well if Cypher was to be made open and more widely available sooner rather than later. Neo4j would still have the advantage of a database optimiser but I think that the general availability of a declarative language would help to drive the market.

This Post Has One Comment

Anonymous says:
4th January 2016 at 12:15 pm
I agree with the assessment of the options. I too wish Cipher were not proprietary. Personally, I find Cypher easier to use., but isn’t available or standard like Sparqrl. I hoped to get from this article ideas if and how property graphs are used in business to make decisions. I haven’t seen any yet in all the research I”ve done. There are many hypothetical ones in market literature and demonstrations. Maybe I’m missing something.

Comments are closed.