Neo4j 2.0

Written By:
Content Copyright © 2014 Bloor. All Rights Reserved.
Also posted on: The IM Blog

While I don’t have any statistics to prove it, Neo Technology is almost certainly the market leader in the graph database space. However, that rather depends what you mean by the graph database space. Neo characterises the market as consisting of graph databases and graph compute engines, where the latter consist of products like Teradata Aster SQL-GR (which is not actually a graph database but which supports graph analytics) and YarcData which is focused on large-scale, complex data discovery problems (the sort of things that data scientists do). Neo is not in this space; it is ACID compliant and is focused as much on writes and updates as it is on reads. In other words, Neo4j is a more general-purpose offering: it is entirely suitable for many analytic functions but not for the equivalent of data mining. At least, not at this stage.

Most recently, Neo Technology has released version 2.0 of Neo4j, its graph database. Perhaps the most important thing to know about this release, apart from the new browser-based user interface, is that the company has introduced the concept of labelling along with enhancements to its declarative language, Cypher, and its database optimiser. Note that the fact that Neo4j has a database optimiser is really important: there’s not much point in having a declarative language without one—as far as I know it’s the only product in its market that has both (or either).

In order to understand labelling you first have to understand the concept of a property graph. Unlike in an RDF database or triple store, with a property graph you can attach values or weights to both the edges and nodes in the graph. This helps to prevent node proliferation and, as a result, it means that graph traversal will perform much better (because there is less traversal to accomplish when you have fewer nodes).

The idea behind labelling is that you can have a node (vertex) for, say, Tom Hanks, and then you can label that node as “actor” or “person” or “Oscar winner” or “director” as appropriate, so that you can distinguish between different groupings of nodes. Without this labelling the most common alternate approach is to create a node specifying the grouping (for example, Actor), and then link the “Tom Hanks” node to it with an “is a” relationship. In other words, labelling is another way to increase the compactness of graphs and, as a result, should further improve performance.

Labelling not only simplifies graphs, but it also makes it possible to add constraints to nodes: a new concept for graph databases, which typically have been characterised as schemaless. This new feature, combined with new Cypher language features allow a unique “PersonID” to be populated (for example) whenever a node of type Person is created. This feature is (as far as I know) unique to Neo4j 2.0. Other graph databases and query languages typically require that indexes be manually populated “out of band”, often by resorting to Java code.

I said at the beginning of this piece that Neo4j is the almost certainly the market leader in this space. The sort of innovative thinking behind the introduction of a feature such as labelling goes a long way to explaining why it is in that position.