This is the first of five articles on graph databases. In this one I will discuss the technology in a general sense and consider why they are potentially "the next big thing". In the second article I will consider the similarities and differences between graph databases and NoSQL databases such as Hadoop and Cassandra, in the third and fourth articles I will discuss Neo4j (from Neo Technologies) and uRiKA (from YarcData) respectively, where the former is probably the leading general-purpose (that is, supporting both transactional and query-based applications) graph database and the latter is focused on graph analytics. Finally, in the fifth article I will consider how take-up of graph databases may impact the data warehousing market.
So, what is a graph database? Or, to get us started, what is a graph? Because it is not a graphical representation of an equation. A graph, sometimes also known as a network (but network database would be confusing) consists of nodes and edges. The very first graph of this sort was created by Leonhard Euler in the 18th century to resolve the Königsberg (now Kaliningrad) bridge problem. The question to be answered was whether it was possible to traverse all the bridges in Königsberg (which at that time was based on both banks of the Pregel and on two islands in the middle of the river) just once and ending where you started (the answer was no). Euler generalised the problem by replacing the land masses with nodes and the bridge with edges.
Graph theory has widespread applicability in, for example, operational research, where it is used to design networks of various sorts, such as road networks, pipeline networks and so forth, as well as large-scale IT networks (including the Internet itself). Major concerns in such applications are to a) discover the most cost-effective implementation of the network and b) to discover the fastest route across the network between any two particular nodes. Another major application is in circuit board design.
In so far as graph databases are concerned, a node represents an entity (a person or thing) and an edge a relationship. Note that relationships (and therefore edges) may be one way or two-way. For example, if you follow me on Twitter but I don't follow you then I can influence you but you may have no way to influence me. In addition, graph databases support the use of attributes alongside both relationships and entities. So relationships can be qualified in some way. For example, suppose that you want to understand who influences whom in potential buying situations. Most of us are not influenced by a single person but by multiple people and some people are given more credence than others, so you might want to apply a weighting to a relationship. Or, if you have a relationship that involves ownership, say, then there are lots of things that you might own: a cell phone, a car, a laptop and so on, and these can be used as qualifiers to the ownership relationship. Pursuing this ownership example further, you might also apply attributes to the entity that you own, for example applying a model and year to your Chevrolet.
Basically, graphs offer a holistic view of the relationships that an entity participates in. Applications that are interested in such relationships will typically be either about managing those relationships or discovering relationships that were not previously known. Such use cases are common: for example, a significant part of master data management consists of hierarchy management. Conversely, security services want to discover and understand the relationships that exist between criminals and/or terrorists. Other possible uses would be in network management; SIEM (security information and event management) where using graphs might make more sense than current approaches; bioinformatics; medicine (The Mayo Clinic is a YarcData customer) and capital markets; as well as various social media environments.
It is also worth noting that the semantic web (Web 3.0) is predicated upon the use of the Resource Description Framework (RDF) and RDF statements are, in effect, graphs. As a result you may see references to RDF databases but this is just another way of saying graph database. So there is also the potential for graph databases to support applications that leverage the semantic web as this comes more and more into play.
In essence, the way that a graph database works (I will talk about this further in subsequent articles) is that it stores entities and relationships, as discussed, but its processing is along the edges of the graph. This turns conventional approaches to data storage on its head. In a relational database, for example, the heart of the system is its entities (tables) and you only use relationships (primary/foreign keys) to get to another entity: what you are doing is processing data. In a graph database you are processing relationships.
Finally, why do I suggest that graph databases may be the next big thing? There are two reasons. The first is that they provide a more focused approach for addressing issues with regard to relationships than either Hadoop or traditional approaches. And, as one vendor put it to me, "understanding relationships is the best way of looking at almost any question". That makes sense to me. Secondly, some big hitters are starting to enter this field: YarcData is actually a spin-off from Cray, and the result of years of research into this area in partnership with various government security organisations; I am also aware that at least one of the major data integration companies is looking into building connectors into this space; and IBM has just released graph store capability in its latest release of DB2. Note that this isn't a graph database per se but you can tag selected relational data to make it look as if the data is graphical. In other words it supports a graph-based logical view of the data. IBM also supports SPARQL, which I'll discuss in a subsequent article, an open source language for querying graph databases.