Analyst Coverage: Philip Howard
A graph database is one that stores data in terms of entities and the relationships between entities. There are three types of graph database: property graph databases, triple stores and conventional databases that provide some graphical capabilities. Triple stores are often referred to as RDF (resource description framework) databases. The main difference between a property graph (where properties may be assigned to either entities or their relationships, or both) product and a triple store is that the former supports index free adjacency (which means you can traverse a graph without needing an index) and the latter doesn’t. That said, a number of RDF products also support properties.
Both graph and RDF databases may be native products, or they may be built on top of other database types. Most commonly, other database types are forms of NoSQL database though there are some relational implementations.
RDF databases target semantic processing, often with the ability to combine information across structured and unstructured data. Both property graph and RDF databases may be ACID compliant and both are frequently targeted at transactional environments. All graph products target analytics but different products are targeted at operational analytics (those suitable for transactional environments) or are pure-play analytic databases. In this last category there is also a distinction between vendors targeting known-known problems as opposed to those that also cover known-unknowns and those tackling unknown-unknowns: the most intractable of all.
Given that both graph and RDF databases target both transactional environments and have query processing capabilities, these are an obvious candidate for supporting hybrid processing whereby the database is used for both transactional/operational processing and real-time analytics. Compared to some other approaches to this, graphs have the major advantage that the data only needs to be stored once. Both concurrent analytics (where the analytics is separate from operational processes, for example in supporting real-time dashboards) and in-process analytics (where the analytics are embedded in real-time operational processing) may be supported. In the latter case, there are a variety of graph algorithms supported by vendors that may be implemented for machine learning purposes.
Graph databases handle a class of issues that are too structured for NoSQL and too diverse for relational technologies. Common use cases – especially for analytics – is anything to do with networks, whether these be pipelines, communications or criminals. In the latter case, relational databases are inherently limited to one-to-one, many-to-one and one-to-many relationships. They do not cater well for problems (such as bill of materials – a classic case) that are many-to-many and they also perform poorly when queries involve (perhaps multiple) self-joins. For these types of requirements graph databases not only perform way better than relational databases but they allow some types of query that are simply not possible otherwise. Semantic query support tends to be particularly strong in triple stores.
Another major point is that research suggests that graph visualisations are very easy and intuitive for users. It is also worth noting that many (not all) graph products are schema-free. This means that if you want to change the structure of the environment you simply add a new entity or relationship as required and do not have to explicitly implement a schema change. This is a major advantage over relational databases.
This market is continuing to emerge and there are a number of open-source projects and vendors, not all of which will survive. Conversely, there are companies that have been in this space for more than a decade, so the technology is not entirely new. One noticeable trend is for triple store vendors to add support for property graphs. This may explain why, according to www.db-engines.com RDF databases have recently overtaken property graph in terms of interest. Another possible explanation is that RDF databases are especially suited to the creation of Knowledge Graphs, which are becoming increasingly popular.
Another trend is towards multi-model implementations. This is where the database supports graph technology as just one of possibly several views into the data. A major consideration with such offerings is the extent to which these different representations can work together. Some vendors require, for example, a different API to be used for each model type supported, whereas others have integrated their environment so that the different models are effectively transparent to one another.
One major issue that has yet to be finalised is with respect to language support. SPARQL (SPARQL protocol and RDF query language) is a W3C standard and is a declarative language but by no means all vendors support it. In general, RDF vendors support SPARQL, but property graph vendors do not, though there are exceptions to this. In the property graph space the Gremlin graph traversal language is part of the Apache Tinkerpop project and is supported by some vendors, while other suppliers have adopted their own “SQL-like” languages. Also with significant traction is OpenCypher, which is a declarative language (Gremlin is only partially so). ANSI has a working party to define SQL extensions to support graph processing while there is also an initiative to create a standardised GQL (graph query language). It is also worth noting that GraphQL, which is an open-source project, is gaining traction as a graph API to replace REST.
Finally, while it is too early to call this a trend one vendor has introduced a graph capability based on adjacency matrices rather than adjacency lists. If this proves successful, and early results suggest that that will be the case, then we are likely to see this being more widely adopted as it promises much better performance.
We would argue that the market leaders in this space continue to be Neo4J and OntoText (GraphDB), which are graph and RDF database providers respectively. However, www.db-engines.com suggests that MarkLogic is the leader in the RDF space. This is a question of definition: GraphDB is a pure-play RDF database with multi-model capabilities while MarkLogic is a multi-model database with an underlying XML engine that offers RDF capabilities. In any case, they are both leading vendors in this space, along with Amazon Neptune.
Perhaps the most exciting development in this space has been the introduction by TigerGraph of the “$1m Dollar Challenge”, offering prizes for the most innovative uses of graph technology. This is not the first such prize (Yarc Data offered something similar when it first entered the market) but it is to be welcomed nonetheless. In other announcements, Franz, amongst other developments, has enhanced its GraphQL capabilities, while TigerGraph has released an ML Workbench. MarkLogic has acquired SmartLogic to extend its metadata management capabilities, and Neo4j is now offering graph data science as a service.