towards understanding modern graph processing, storage ...sions in the design of graph databases:...

37
1 Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph eries Towards Understanding Modern Graph Processing, Storage, and Analytics MACIEJ BESTA, EMANUEL PETER, Department of Computer Science, ETH Zurich ROBERT GERSTENBERGER, Department of Computer Science, ETH Zurich MARC FISCHER, PRODYNA (Schweiz) AG MICHAŁ PODSTAWSKI, Future Processing CLAUDE BARTHELS, GUSTAVO ALONSO, Department of Computer Science, ETH Zurich TORSTEN HOEFLER, Department of Computer Science, ETH Zurich Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 45 graph database systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we describe research and engineering challenges to outline the future of graph databases. CCS Concepts: General and reference Surveys and overviews; Information systems Data management systems; Graph-based database models; Data structures; DBMS engine architectures; Database query processing; Parallel and distributed DBMSs; Database design and models; Distributed database transactions; Theory of computation Data modeling; Data structures and algorithms for data management ; Distributed algorithms; • Computer systems organization → Distributed architectures; Additional Key Words and Phrases: Graphs, Graph Databases, NoSQL Stores, Graph Database Management Systems, Graph Models, Data Layout, Graph Queries, Graph Transactions, Graph Representations, RDF, Labeled Property Graph, Triple Stores, Key-Value Stores, RDBMS, Wide-Column Stores, Document Stores ACM Reference format: Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, and Torsten Hoefler. 2020. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries. 37 pages. arXiv:1910.09017v4 [cs.DB] 3 Apr 2020

Upload: others

Post on 02-Nov-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1

Demystifying Graph Databases: Analysis and Taxonomyof Data Organization, System Designs, and GraphQueriesTowards Understanding Modern Graph Processing, Storage, and Analytics

MACIEJ BESTA, EMANUEL PETER, Department of Computer Science, ETH ZurichROBERT GERSTENBERGER, Department of Computer Science, ETH ZurichMARC FISCHER, PRODYNA (Schweiz) AGMICHAŁ PODSTAWSKI, Future ProcessingCLAUDE BARTHELS, GUSTAVO ALONSO, Department of Computer Science, ETH ZurichTORSTEN HOEFLER, Department of Computer Science, ETH Zurich

Graph processing has become an important part of multiple areas of computer science, such as machinelearning, computational sciences, medical applications, social network analysis, and many others. Numerousgraphs such as web or social networks may contain up to trillions of edges. Often, these graphs are alsodynamic (their structure changes over time) and have domain-specific rich data associated with vertices andedges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving,and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing,these systems face unique design challenges. To facilitate the understanding of this emerging domain, wepresent the first survey and taxonomy of graph database systems. We focus on identifying and analyzingfundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, orobject-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organizationtechniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspectsof data distribution and query execution (e.g., support for sharding and ACID). 45 graph database systemsare presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queriesand relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms).Finally, we describe research and engineering challenges to outline the future of graph databases.

CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Datamanagement systems; Graph-based database models; Data structures; DBMS engine architectures;Database query processing; Parallel and distributed DBMSs; Database design and models; Distributeddatabase transactions; • Theory of computation→ Data modeling; Data structures and algorithms for datamanagement; Distributed algorithms; • Computer systems organization → Distributed architectures;

Additional Key Words and Phrases: Graphs, Graph Databases, NoSQL Stores, Graph Database ManagementSystems, Graph Models, Data Layout, Graph Queries, Graph Transactions, Graph Representations, RDF,Labeled Property Graph, Triple Stores, Key-Value Stores, RDBMS, Wide-Column Stores, Document Stores

ACM Reference format:Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels,Gustavo Alonso, and Torsten Hoefler. 2020. Demystifying Graph Databases: Analysis and Taxonomy of DataOrganization, System Designs, and Graph Queries. 37 pages.

arX

iv:1

910.

0901

7v4

[cs

.DB

] 3

Apr

202

0

Page 2: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:2 M. Besta et al.

1 INTRODUCTIONGraph processing is behind numerous problems in computing, for example in medicine, machinelearning, computational sciences, and others [108, 127]. Several properties of graph computationssuch as irregular communication patterns or little locality [127, 185], combined with the sheer sizeof graph datasets (up to trillions of edges [22, 52, 123, 168])Graph algorithms are inherently difficult to design because of challenges such as large sizes

of processed graphs, little locality, or irregular communication [22, 52, 123, 127, 168, 185]. Thedifficulties are increased by the fact that many such graphs are also dynamic (their structure changesover time) and have rich properties associated with vertices and edges.

Graph databases1 such as Neo4j [164] emerged to enable storing, processing, and analyzing large,evolving, and rich graph datasets. In graph databases, contrarily to traditional relational databases,the underlying data layout usually does not follow the fixed schema based on tables that implementrelations. Instead, the layout in graph databases often uses the fact that a graph consists of verticesand edges between vertices. Although a graph can also be modeled with tables corresponding tovertices and edges, various graph algorithms or queries, for example traversals of a graph, benefitfrom storing a graph as an adjacency list array, where the neighbors of each vertex can be accessedwith a simple memory lookup through a pointer attached to each vertex [164].

Graph databases face unique challenges due to overall properties of irregular graph computationscombined with the demand for low latency and high throughput of graph queries that can be bothlocal (i.e., accessing or modifying a small part of the graph, for example a single edge) and global(i.e., accessing or modifying a large part of the graph, for example all the edges). Many of thesechallenges belong to the following areas: “general design” (i.e., what is the most advantageousgeneral structure of a graph database engine), “data models and organization” (i.e., how to modeland store the underlying graph dataset), “data distribution” (i.e., whether and how to distributethe data across multiple servers), and “transactions and queries” (i.e., how to query the underlyinggraph dataset to extract useful information). This distinction is illustrated in Figure 1. In this work,we present the first survey and taxonomy on these system aspects of graph databases. In general,we provide the following contributions:

• We provide the first taxonomy of graph databases, identifying and analyzing key dimen-sions in the design of graph databases: (1) general database engine, (2) data model, (3) dataorganization, (4) data distribution, (5) query execution, and (6) type of transactions.

• We use our taxonomy to survey, categorize, and compare 45 graph database systems.• We discuss in detail the design of selected graph databases.• We outline related domains, such as queries and workloads in graph databases.• We discuss future challenges in the design of graph databases.

Graph Databases vs. NoSQL Stores and Other Database Systems NoSQL stores addressvarious deficiencies of relational database systems. An example such deficiency is little support forflexible data models [60]. Graph databases such as Neo4j can be seen as one particular type of NoSQLstores; these systems are sometimes referred to as “native” graph databases [164]. Other types ofNoSQL systems include wide-column stores, document stores, and general key-value stores [60]. Inthis survey, we focus on any database system that enables storing and processing graphs, including

1Lists of graph databases can be found at http://nosql-database.org, https://database.guide,https://www.g2crowd.com/categories/graph-databases, https://www.predictiveanalyticstoday.com/top-graph-databases,and https://db-engines.com/en/ranking/graph+dbms.

Page 3: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:3

This symbol indicatesthat a given categoryis surveyed in another

publication

Integrity constraintsin graph databases...

Query languagesin graph databases...

Data modelsin graph databases...

History ofgraph databases...

Compressinggraph databases...

Relateddomainscoveredin othersurveys

Object-oriented

databases(§ 4.7)

Data hubs(§ 4.9)

Native graphsystems (§ 4.8)

Graphdata

models(§ 2.1)

RDBMSfor graphs

(§ 4.6)Document

stores (§ 4.4)

Key-valuestores(§ 4.3)

Wide-columnstores(§ 4.5)

RDF stores(§ 4.1)

Tuplestores (§ 4.2)Design details

of selected graphdatabases (§ 4)

Queries andworkloads (§ 5)

Graphdatabases

Data models andrepresentations (§ 2)

Graphrepresentations

(§ 2.2)

Non-graphdata models

(§ 2.3)

Datadistribution

(§ 3.4)

Queryexecution

(§ 3.5)

Transactiontypes (§ 3.6)

Datamodel(§ 3.2)

Dataorganization

(§ 3.3)

Generalsystem

category(§ 3.1)

Taxonomy andkey dimensionsof systems (§ 3)

Languagesupport(§ 3.7)

Fig. 1. The illustration of the considered areas of graph databases.

native graph databases and other types of NoSQL stores, relational databases, object-orienteddatabases, and others. Figure 2 shows the types of considered systems.

Hierarchicaland networksystems

Relationalsystems

Object-orientedsystems

NoSQLsystems

NewSQLsystems

Key-valuestores

Documentstores

Wide-columnstores

Nativegraphstores

Tuplestores

RDFsystems

No longeractively

developed

The focus of this survey: anysystems used as graph databases.

We consider native graph storesand parts of other domains relatedto storing and processing graphs

RDF systems canbe implementedas NoSQL or as

traditional RDBMsystems

Types of database systems

Fig. 2. The illustration of the considered types of databases.

Graph Databases vs. Graph Streaming Frameworks In graph streaming [24], the inputgraph is passed as a stream of updates, allowing to add and remove edges in a simple way. Graphdatabases are related to graph streaming in that they face graph updates of various types. Still,they usually deal with complex graph models (such as the Labeled Property Graph [4] or ResourceDescription Framework [56]) where both vertices and edges may be of different types and may beassociated with arbitrary properties. Contrarily, graph streaming frameworks such as STINGER [70]focus on simple graph models where edges or vertices may have weights and, in some cases, simple

Page 4: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:4 M. Besta et al.

additional properties such as time stamps. Moreover, challenges in the design of graph databasesinclude transactional support, a topic little related to graph streaming frameworks.

What Is the Scope of Related Surveys? There exist several surveys dedicated to the theoryof graph databases. In 2008, Angles et al. [6] described the history of graph databases, and, inparticular, the used data models, data structures, query languages, and integrity constraints. In2017, Angles et al. [4] analyzed in more detail query languages for graph databases, taking bothan edge-labeled and a property graph model into account and studying queries such as graphpattern matching and navigational expressions. In 2018, Bonifati et al. [40] provided an in-depthinvestigation into querying graphs, focusing on numerous aspects of query specification andexecution. Moreover, there are surveys that focus on NoSQL stores [60, 80, 91] and RDF [150].There is no survey dedicated to the systems aspects of graph databases, except for several briefpapers that cover small parts of the domain [110, 112, 117, 120, 135, 136, 152, 156, 170, 189, 190].

2 GRAPHS AND GRAPH MODELS IN THE LANDSCAPE OF GRAPH DATABASESWe start with graphmodels (§ 2.1) and representations (§ 2.2) used in graph databases.We summarizethe key symbols and abbreviations in Table 1.

G A graph G = (V ,E) where V is a set of vertices and E is a set of edges.M The adjacency matrix of a graph G = (V ,E).n,m The count of vertices and edges in a graph G; |V | = n, |E | =m.d, d The average degree and the maximum degree in a given graph, respectively.P(S) = 2S The power set of S : a set that contains all possible subsets of S .AM The Adjacency Matrix representation.M ∈ {0, 1}n,n ,Mu,v = 1 ⇔ (u,v) ∈ E.AL, Au The Adjacency List representation and the adjacency list of a vertex u; v ∈ Au ⇔ (u,v) ∈ E.LPG, RDF Labeled Property Graph (§ 2.1.3) and Resource Description Framework (§ 2.1.5).KV, RDBMS Key-Value store (§ 4.3) and Relational Database Management Systems (§ 4.6).OODBMS Object-Oriented Database Management Systems (§ 4.7).OLTP, OLAP Online Transaction Processing (§ 3.6) and Online Analytics Processing (§ 3.6).ACID Transaction guarantees (Atomicity, Consistency, Isolation, Durability).

Table 1. The most relevant symbols and abbreviations used in this work.

2.1 Graph ModelsWe start with graph models used by the surveyed systems.

2.1.1 Simple Graph Model. A graph G can be modeled as a tuple (V ,E) where V is a set ofvertices and E ⊆ V ×V is a set of edges.G = (V ,E) can also be denoted asG(V ,E). We have |V | = nand |E | = m. For a directed G, an edge e = (u,v) ∈ E is a tuple of two vertices, where u is theout-vertex (also called “source”) and v is the in-vertex (also called “destination”). If G is undirected,an edge e = {u,v} ∈ E is a set of two vertices. Finally, a weighted graph G is modeled with a triple(V ,E,w);w : E → R maps edges to weights.A simple graph model is often used in graph processing frameworks such as Pregel [128] or

STINGER [70]. However, it is not commonly used with graph databases. Instead, it is a basis formore complex models, such as the Labeled Property Graph.

2.1.2 Hypergraph Model. A hypergraph H generalizes a simple graph: any of its edges can joinany number of vertices. Formally, a hypergraph is also modeled as a tuple (V ,E) with V being a setof vertices. E is defined as E ⊆ (P(V ) \ ∅) and it contains hyperedges: non-empty subsets of V .

Hypergraphs are rarely used in graph databases and graph processing systems. In this survey, wedescribe a system called HyperGraphDB (§ 4.3.2) that focuses on storing and querying hypergraphs.

Page 5: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:5

2.1.3 Labeled Property Graph Model. The classical graph model, a tupleG = (V ,E), is adequatefor many problems such as computing vertex centralities [41]. However, it is not rich enough tomodel various real-world problems. This is why graph databases often use the Labeled PropertyGraph Model (LPG), sometimes simply called the property graph [4]. In LPG, one augments thesimple graph model (V ,E) with labels that define different subsets (or classes) of vertices andedges. Furthermore, every vertex and edge can have any number of properties [195] (often alsocalled attributes). A property is a pair (key,value), where key identifies a property and value is thecorresponding value of this property [49]. Formally, an LPG is defined as a tuple

(V ,E,L, lV , lE ,K ,W ,pV ,pE )L is the set of labels. lV : V 7→ P(L) and lE : E 7→ P(L) are labeling functions. Note that P(L) is thepower set of L, denoting all the possible subsets of L. Thus, each vertex and edge is mapped to asubset of labels. Next, a vertex as well as an edge can be associated with any number of properties.We model a property as a key-value pair p = (key,value), where key ∈ K and value ∈W . K andW are sets of all possible keys and values. Finally, pV (u) denotes the set of property key-value pairsof the vertex u, pE (e) denotes the set of property key-value pairs of the edge e . An example LPG isin Figure 3. All systems considered in this work use some variant of the LPG, with the exception ofRDF systems or when explicitly discussed.

2.1.4 Variants of Labeled Property Graph Model. Several databases support variants of LPG. First,Neo4j [164] (a graph database described in detail in § 4.8.1) supports an arbitrary number of labelsfor vertices. However it only allows for one label, (called edge-type), per edge. Next, ArangoDB [11](a graph database described in detail in § 4.4.2) only allows for one label per vertex (vertex-type) andone label per edge (edge-type). This facilitates the separation of vertices and edges into differentdocument collections. Finally, edge-labeled graphs [4] do not allow for any properties and uselabels in a restricted way. Specifically, only edges have labels and each edge has exactly one label.Formally, G = (V ,E,L), where V is the set of vertices and E ⊆ V × L ×V is the set of edges. Notethat this definition enables two vertices to be connected by multiple edges with different labels.

:Personname = Alice

age = 21:knows

since = 09.08.2007

:Personname = Bob

age = 24

:Message:Post

title = Holidaystext = We had...

:hasCreator

:Message:Comment

text = Wow! ...

:hasCreator

:replyOf

Fig. 3. The illustration of an example Labeled Property Graph (LPG). Vertices and edges can have labels (bold,prefixed with colon) and properties (key = value). We present a subgraph of a social network, where a person can knowother persons, post messages, and comment on others’ messages.

2.1.5 Resource Description Framework (RDF). The Resource Description Framework (RDF) [56] isa collection of specifications for representing information. It was introduced by the World WideWeb Consortium (W3C) in 1999 and the latest version (1.1) of the RDF specification was publishedin 2014. Its goal is to enable a simple format that allows for easy data exchange between differentformats of data. It is especially useful as a description of irregularly connected data. The core partof the RDF model is a collection of triples. Each triple consists of a subject, a predicate, and anobject. Thus, RDF databases are also often called triple stores (or triplestores). Subjects can either be

Page 6: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:6 M. Besta et al.

identifiers (called Uniform Resource Identifiers (URIs)) or blank nodes (which are dummy identifiersfor internal use). Objects can be URIs, blank nodes, or literals (which are simple values). Withtriples, one can connect identifiers with identifiers or identifiers with literals. The connections arenamed with another URI (the predicate). RDF triples can be formally described as

(s,p,o) ∈ (URI ∪ blank) × (URI ) × (URI ∪ blank ∪ literal)

s represents a subject, p models a predicate, and o represents an object. URI is a set of UniformResource Identifiers; blank is a set of blank node identifiers, that substitute internally URIs to allowfor more complex data structures; literal is a set of literal values [3, 97, 150].

2.1.6 Transformations between LPG and RDF. To represent a Labeled Property Graph in the RDFmodel, LPG vertices are mapped to URIs (❶) and then RDF triples are used to link those vertices withtheir LPG properties by representing a property key and a property value with, respectively, an RDFpredicate and an RDF object (❷). For example, for a vertex with an ID vertex-id and a correspondingproperty with a key property-key and a value property-value, one creates an RDF triple (vertex-id,property-key, property-value). Similarly, one can represent edges from the LPG graph model in theRDF model by giving each edge the URI status (❸), and by linking edge properties with specificedges analogously to vertices: (edge-id, property-key, property-value) (❹). Then, one has to use twotriples to connect each edge to any of its adjacent vertices (❺). Finally, LPG labels can also betransformed into RDF triples in a way similar to that of properties [107], by creating RDF triplesfor vertices (❻) and edges (❼) such that the predicate becomes a “label” URI and contains the stringname of this label. Figure 4 shows an example of transforming an LPG graph into RDF triples.

V-ID

type

from to

21 24Alice Bob09.08.2007

age namename

agesince

knows

label

Personlabellabel

LPG graph RDF graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

1

2

34

57

6

5

vertex vertex

edge

E-IDtype

type

V-ID

Fig. 4. Comparison of an LPG and an RDF graph: a transformation from LPG to RDF. “V-ID”, “E-ID”, “age”, “name”,“type”, “from”, “to”, “since” and “label” are RDF URIs. Numbers in black circles refer to transformation steps in § 2.1.6.

If all vertices and edges only have one label, one can omit the triples for labels and store the label(e.g., “Person”) together with the vertex or the edge name (“V-ID” and “E-ID”) in the identifier. Weillustrate a corresponding example in Figure 5.

Transforming RDF data into the LPG model is more complex, since RDF predicates, which wouldnormally be translated into edges, are URIs. Thus, while deriving an LPG graph from an RDF graph,one must map edges to vertices and link such vertices, otherwise the resulting LPG graph maybe disconnected. There are several schemes for such an RDF to LPG transformation, for examplederiving an LPG graph which is bipartite, at the cost of an increased graph size [97]. Details andexamples are provided in a report by Hayes [97].

Page 7: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:7

RDF graph

Person/V-ID knows/E-ID Person/V-IDfrom to

21 24Alice Bob09.08.2007

agename name

agesince

LPG graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

vertextype

vertexedge

type type

Fig. 5. Comparison of an LPG and an RDF graph: a transformation from LPG to RDF, given vertices and edges haveonly one label. “Person/V-ID”, “knows/E-ID”, “age”, “name”, “type”, “from”, “to” and “since” are RDF URIs.

2.2 Fundamental Graph RepresentationsWe also summarize fundamental graph representations. A graphG can be represented in variousways. Three common representations are the adjacency matrix format (AM), the adjacency listformat (AL), and the edge list format (EL). We illustrate these representations in Figure 6.In the AM format, a matrixM ∈ {0, 1}n,n determines the connectivity of vertices:Mu,v = 1 ⇔

(u,v) ∈ E. In the AL format, each vertex u has an associated adjacency list Au . This adjacency listmaintains the IDs of all vertices adjacent to u. Each adjacency list is often stored as a contiguousarray of vertex IDs. We have v ∈ Au ⇔ (u,v) ∈ E.AM uses O

(n2)space and can check connectivity of two vertices in O (1) time. AL requires

O (n +m) space and it can check connectivity in O (|Au |) ⊆ O(d)time.

EL is similar to AL in the asymptotic time and space complexity as well as the general design.The main difference is that each edge is stored explicitly, with both its source and destination vertex.In AL and EL, a potential cause for inefficiency is scanning all edges to find neighbors of a givenvertex. To alleviate this, index structures are employed [35].

Many graph databases (e.g., OrientDB (§ 4.4.1), MS Graph Engine (§ 4.3.1), Dgraph [62], Janus-Graph (§ 4.5.1), Neo4j (§ 4.8.1)) use variants of AL since it makes traversing neighborhoods efficientand straightforward [164]. No systems that we analyzed use an uncompressed AM as it is inefficientwith O(n2) space, especially for sparse graphs. Systems using AM focus on compression of theadjacency matrix [30], trying to mitigate storage and query overheads (e.g., GBase § 4.8.3).

2.3 Non-Graph Data Models Used in Graph DatabasesThere exist data models that do not target specifically graphs but are used in various systemsto model and store graphs. These models include collections of key-value pairs, documents, andtuples (used in different types of NoSQL stores), relations and tables (used in traditional relationaldatabases), and objects (used in object-oriented databases). Different details of these models and thedatabase systems based on them are described in other surveys, for example in a recent publicationon NoSQL stores by Davoudian et al. [60]. Thus, we omit extensive discussions and instead offerbrief summaries, focusing on how they are used to model or represent graphs.

2.3.1 Collection of Key-Value Pairs. Key-value stores are the simplest NoSQL stores [60]. Here,the data is stored as a collection of key-value pairs, with the focus on high-performance and highly-scalable lookups based on keys. The exact form of both keys and values depends on a specificsystem or an application. Keys can be simple (e.g., an URI or a hash) or structured. Values are oftenencoded as byte arrays (i.e., the structure of values is usually schema-less). However, a key-valuestore can also impose some additional data layout, structuring the schema-less values [60].

Page 8: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:8 M. Besta et al.

n: number of verticesm: number of edgesd: maximum graph degree

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 1 0 0 0 00 0 0 0 0 0 0 0 0 0 0 1 0 0 0 00 0 0 0 0 0 0 0 0 1 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 1 1 0 0 0 10 0 0 0 0 0 0 0 1 1 0 1 0 1 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 1 1 0 0 0 1 0 0 0 0 10 0 0 0 1 1 1 0 0 0 1 1 0 0 0 00 1 1 0 0 1 0 0 1 1 0 0 0 1 0 10 0 1 1 0 1 1 0 0 1 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 10 0 0 0 0 0 1 0 0 0 1 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 1 0 0 10 0 0 0 0 1 1 1 1 0 1 1 1 1 1 0

1

n

...

1 ... n

123456789

10111213141516

1

n

...

1

∅11

121112109 101112169 10121416

166 7 11165 6 7 11122 3 6 9 10 16143 4 6 7 1016

16157 1116

16136 7 8 9 11 1312 1514

d...

The number ofall elements inthe adjacencydata structure:2m (undirectedgraph) and m

(directed graph)

...

Numberof tuples:

23

456

7

16

1111

1210910111216910121416

15

3 12

6666

7777

...

2

3

45

6

7

16

11

11

1210

910

11

12

16

910

12

14

16

15

3 12

6

6

6

6

7

7

7

71

...

2m (undirected),m (directed)

123456789

10111213141516

1

n

...

Neighborhoodscan be sortedor unsorted

...

1

2morm

...

An n x n matrix

Unweighted graph: a cell is one bit

Pointers from verticesto their neighborhoods

Neighbor-hoods canbe arbitrarystructures,e.g., arrays

or lists

Weighted graph: a cell is one integer

Pointers fromvertices to theirneighborhoods

2morm

Onetuple

corres-pondsto oneedge

Offs

et a

rray

is o

ptio

nal

Adjacency Matrix

Adjacency List Edge List (sorted & unsorted)

INPUT GRAPH:

Fig. 6. Illustration of fundamental graph representations: Adjacency Matrix, Adjacency List, and Edge List.

Due to the general nature of key-value stores, there can be many ways of representing a graphas a collection of KV values. We describe several concrete example systems in § 4.3. For example,one can use vertex labels as keys and encode the neighborhoods of vertices as values.

2.3.2 Collection of Documents. A document is a fundamental storage unit in a class of NoSQLdatabases called document stores [60]. These documents are stored in collections. Multiple col-lections of documents constitute a database. A document is encoded using a selected standardsemi-structured format, e.g., JSON [42] or XML [43]. Document stores extend key-value stores inthat a document can be seen as a value that has a certain flexible schema. This schema consists ofattributes, where each attribute has a name along with one or more values. Such a structure based

Page 9: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:9

on documents with attributes allows for various value types, key-value pair storage, and recursivedata storage (attribute values can be lists or key-value dictionaries).

In all surveyed systems, each vertex is stored in a vertex document. The capability of documentsto store key-value pairs is used to store vertex labels and properties within the correspondingvertex document. The details of edge storage, however, is system-dependent: edges can be storedin the document corresponding to the source vertex of each edge, or in the documents of thedestination vertices. As documents do not impose any restriction on what key-value pairs can bestored, vertices and edges may have different sets of properties.

2.3.3 Collection of Tuples. Tuples are a basis of NoSQL stores called tuple stores. A tuple storegeneralizes an RDF store: RDF stores are restricted to triples (or – in some cases – 4-tuples, alsoreferred to as quads) whereas tuple stores can contain tuples of an arbitrary size. Thus, the numberof elements in a tuple is not fixed and can vary, even within a single database. Each tuple has an IDwhich may also be a direct memory pointer.

A collection of tuples can model a graph in different ways. For example, one tuple of size n canstore pointers to other tuples that contain neighborhoods of vertices. The exact mapping betweensuch tuples and graph data is specific to different databases; we describe examples in § 4.2.2.3.4 Collection of Tables. Tables are the basis of Relational Database Management Systems

(RDBMS) [15, 55, 98]. Tables consist of rows and columns. Each row represents a single data element,for example a car. A single column usually defines a certain data attribute, for example the color ofa car. Some columns can define unique IDs of data elements, called primary keys. Primary keys canbe used to implement relations between data elements. A one-to-one or a one-to-many relationcan be implemented with a single additional column that contains the copy of a primary key of therelated data element (such primary key copy is called the foreign key). A many-to-many relationcan be implemented with a dedicated table containing foreign keys of related data elements.

To model a graph as a collection of tables, one can implement vertices and edges as rows in twoseparate tables. Each vertex has a unique primary key that constitutes its ID. Edges can relate totheir source or destination vertices by referring to their primary keys (as foreign keys). LPG labelsand properties, as well as RDF predicates, can be modeled with additional columns [196, 198].

2.3.5 Collection of Objects. Finally, one can use collections of objects in Object-Oriented Data-base Management Systems (OODBMS) [14] to model graphs. Here, data elements and their relationsare implemented as as objects linked with some form of pointers. The details of modeling graphs asobjects heavily depend on specific designs. We provide these details for selected systems in § 4.7.

3 TAXONOMY OF GRAPH DATABASE SYSTEMSWe now describe how we categorize graph database systems considered in this survey. The mainidentified aspects are: (1) general types of database systems (§ 3.1), (2) supported data models (§ 3.2),(3) used forms of data organization (§ 3.3), (4) enabled methods for distributing data (§ 3.4), (5)provided schemes for query execution (§ 3.5), and (6) supported transaction types (§ 3.6). Figure 7illustrates the general types of considered databases together with certain aspects of data modelsand organization. Figure 8 summarizes all elements of the proposed taxonomy.

3.1 Types of Graph Database SystemsWe survey existing graph database systems [2, 9, 11, 12, 16, 37, 45, 47, 58, 62, 77, 79, 86, 105, 113,122, 130, 132, 139, 143, 146, 147, 157, 161–164, 175, 186, 193, 194, 196, 197, 197] and we identifytypes of graph databases that differ in their general design. Some systems use a certain backendtechnology, adapting this backend to storing graph data, and adding a frontend to query thegraph data. Examples of such systems are ❶ RDF systems (also called triple stores), ❷ tuple

Page 10: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:10 M. Besta et al.

stores, ❸ document stores, ❹ key-value stores, ❺ wide-column stores, ❻ Relational DatabaseManagement Systems (RDBMS), or❼Object-OrientedDatabaseManagement Systems (OODBMS).Some of these categories fall into the domain of NoSQL stores (❶ – ❺; RDF systems can also beimplemented as, e.g., RDBMS). Some graph databases are designed specifically for maintaining andquerying graphs; we call such systems ❽ native graph databases (or native graph stores); they arealso often categorized as NoSQL stores [60]. Finally, we consider a group of designs called the ❾data hub; they enable using many different storage backends, facilitating storing data in differentformats and models. Figure 7 illustrates these systems, they are discussed in more detail in § 4.

3.2 Data ModelsWe also investigate what data models are supported by different graph databases. Here, we use themodels described in § 2. In addition, we call a system Multi Model if it allows for more than onedata model, for example when it directly supports both LPG and RDF.

3.3 Data OrganizationNext, while surveying databases, we consider different aspects of data organization.

3.3.1 Dividing Data into Records. Many systems (called record based) divide data into small unitscalled records. A certain number of records is kept together in one contiguous block in memory ordisk to enhance data access locality. Some systems allow variable sized records (e.g., ArangoDB),others only enable fixed sized records (e.g., Neo4j). Next, key-value stores usually maintain a singlevalue in a single record, while in document stores a document is a record.

Systems that use records for graph storage often use one or more records per vertex (theserecords are sometimes referred to as vertex records). Neo4j uses multiple fixed-size records forvertices, document database use one document per vertex (e.g., ArangoDB). Edges are sometimesstored in the same record together with the associated (source or destination) vertices (e.g., Titanor JanusGraph). Otherwise, edges are stored in separate edge records (e.g., ArangoDB).

3.3.2 Storing Data in Index Structures. Some databases (called index based) store data in indexstructures that are used to maintain locations of different parts of the data for faster data access.In such cases, the index does not point to a certain data record but the index itself contains thedesired data. Example systems with such functionality are Sparksee/DEX and Cray Graph Engine.To maintain indices, the former uses bitmaps and B+ trees while the latter uses hash tables.

3.3.3 Enabling Lightweight Edges. Some systems (e.g., OrientDB) allow edges without labels orproperties to be stored as lightweight edges. Such edges are stored in the records of the correspondingsource and/or destination vertices. These lightweight edges are represented by the ID of theirdestination vertex, or by a pointer to this vertex. This can save storage space and accelerate resolvingdifferent graph queries such as verifying connectivity of two vertices [46].

3.3.4 Linking Records with Direct Pointers. In record based systems, vertices and edges are storedin records. To enable efficient resolution of connectivity queries (i.e., verifying whether two verticesare connected), these records have to point to other records. One option is to store direct pointers(i.e., memory addresses) to the respective connected records. For example, an edge record can storedirect pointers to vertex records with adjacent vertices. Another option is to assign each recorda unique ID and use these IDs instead of direct pointers to refer to other records. On one hand,this requires an additional indexing structure to find the physical location of a record based on itsID. On the other hand, if the physical location changes, it is usually easier to update the indexingstructure instead of changing all associated direct pointers.

Page 11: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:11

Vertices and edges are encoded invalues and indexed by keys (IDs)

Subject URIs are linked toobject URIs via predicates

Vertices and edges are stored intuples, linked via pointers or IDs

of other tuples

Vertices and edges are encodedin documents (e.g., JSON) and

linked via pointers or document IDs

Combines multiple models and/or storage schemes

Custom database systems,optimized for graph storage

and traversal queries.

Vertices and edges are storedin Java, C#, ... language objects

Vertices and edges are stored inrows of two row-oriented tables

A vertex is stored in a row andit is indexed by a unique ID; itsproperties, labels, and adjacent

edges are stored in row cells

Examples: Cayley, InfoGrid, MarkLogic,OpenLink Virtuoso, Stardog

Examples:AllegroGraph,Cray Graph

Engine

Examples: Dgraph,HyperGraphDB,

MS Graph Engine

Examples:

WhiteDB, Graphd

Examples: Titan,JanusGraph, DSE Graph,

Examples: OrientDB,ArangoDB, Azure

Cosmos DB, FaunaDB

Examples:Sparksee/DEX,

TigerGraph,GraphBase,Memgraph,Neo4j, PGX

Examples:GraphObjectivity

ThingSpan,Velocity

ID

ID

ID

vertex/edge

vertex/edge

vertex/edge

Keys Values

URI

p

p

p

(Subj, Pred, Obj)

(Subj, P

red, O

bj)

URI

URIURI

(Subj, Pred, Obj)

vertex/edge

vertex/edge

vertex/edge

Pointersor IDs

ID

Row (vertex)

prop edge edgeedgeprop

ID

ID

Key

prop edgeedge

prop edgeprop prop

ID

DocumentID

ID

ID

Table withvertices

Table withedges

Wide-Column Store

Row RDBMS

Key-Value Store

Document Store

Native Graph Store

Tuple Store

RDF Store (Triple Store)

Object-Oriented DBMS

Data Hub

Differentrecords

Differentrecords

Model used:

tables (imple-

menting

relations)

Records forming a row are storedcontiguously in memory or on disk

Example: OracleSpatial and Graph

One column can implementa property, a label, or an

ID (primary or foreign key)

Vertices and edges are stored inrows of two column-oriented tables

Table withvertices

Table withedges

Column RDBMS

Differentrecords

Differentrecords

Model used:

tables (imple-

menting

relations)

Records forming a column are storedcontiguously in memory or on disk

Example:SAP HANA

One column can implementa property, a label, or an

ID (primary or foreign key)

An opaque value contains properties,labels, adjacent vertices and edges

Model used:

pairs of

keys and

values

Model

used:

key-value

pairs and

tables

One cell containsa key-value pair

One valueoften formsone record

A documentoften formsone record

vertex /edge

vertex/edge

vertex/edge

Document(JSON / XML)

Attributes implementproperties and labels

attr attr attr

attr

attr attr attr attr

Model

used:

documents

Model

used:

triples

Triples canform records

Model

used:

tuples

Tuplescontain

propertiesand labels

Division intorecords depends

on a system

Model

used:

several

different

ones

+ ++...

Details of dataorganization are

system-dependant

Model

used:

objects

Model used:

Labeled

Property

Graph

Details of data organization aresystem-dependant. Adjacency

information is explicitly maintainedto accelerate graph traversals.

NoSQL

stores

OO

DBM

S

RD

BM

S

Fig. 7. Overview of different categories of graph database systems, with examples.

Page 12: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:12 M. Besta et al.

• Yes (fully)• Yes (partially)• No

• OLAP• OLTP• OLAP & OLTP

Is ACID supported?

ACID

What processing typeis supported?

Processing Type

• Yes• No

Is data shardingsupported?

Data Sharding

Can the system run ina distributed mode?

Distributed Mode

• Yes• No

• Yes• No

Is data replicationsupported?

Data Replication

Data Distribution

Query Execution

Can multiple queriesbe run concurrently?

Concurrent Execution

• Yes• No

• Yes• No

Can a single querybe parallelized?

Parallelization

• RDF store• Tuple store• Document store• Key-value store• Wide-column store• RDBMS• OODBMS• Native graph store• Data hub

• RDF triples• LPG• Tuples• Documents• Key-value pairs• Tables• Objects• Hypergraph• Multiple models

What is a generaltype of a database?

Types of Systems

What models of graphdata are supported?

Data Models

Taxonomy of Graph Databases

Transaction Support

• Fixed sized• Variable sized

• With direct pointers• With IDs or references

What types of recordsare supported?

Types of Records

How are recordslinked together?

Linking Records

• Yes• No

Is there support forstoring data withinindex structures?

Index StructuresWhat representationsof graphs are used?

Representations

• Adjacency matrix• Adjacency list• Edge list

• Yes• No

• Within vertex records• Within edge records

Is there support forlightweight edges?

Lightweight Edges

How are edges stored?

Edge Records

Data Organization

• SPARQL• Gremlin• Cypher• SQL• GraphQL• Other

What graph databasequery languageis supported?

Language Support

Fig. 8. Overview of the identified taxonomy of graph databases.

Page 13: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:13

A given system can also use direct pointers to avoid maintaining an additional dedicated indexingstructure to traverse the graph. Note that an index may still be used to find a vertex; using directpointers in this context means that only the structure of the adjacency data has no additionalindex. Using direct pointers can accelerate graph traversals [164], as additional index traversalsare avoided. However, when the adjacency data needs to be updated, usually a large number ofpointers need to be updated as well, generating additional overhead [12].

3.4 Data DistributionWe call a system distributed or multi-server if it can run on multiple servers (also called computenodes) connected with a network. In such systems, data may be replicated [81] (maintaining copiesof the dataset at each server), or it may allow for sharding [74] (data fragmentation, i.e., storing onlya part of the given dataset on one server). Replication often allows for more fault tolerance [73],sharding reduces the amount of used memory per node and can improve performance [73].

3.5 Query ExecutionQuery execution on multiple servers can be enabled in various ways. We define concurrent executionas the execution of separate queries at the same time. Concurrent execution of queries can lead tohigher throughput. We also define parallel execution as the parallelized execution of a single query,possibly on more than one server or compute node. Parallel execution can lead to lower latenciesfor queries that can be parallelized.

3.6 Types of TransactionsMany graph database systems support transactions.

3.6.1 Support for ACID. ACID [100] (Atomicity, Consistency, Isolation, Durability) is a well-known set of properties that database transactions uphold in many database systems. Differentgraph databases explicitly ensure some or all of the ACID properties.

3.6.2 Support for OLTP vs. OLAP. Some systems are oriented towards the Online TransactionProcessing (OLTP) [155], where focus is on executing many smaller, interactive, transactionalqueries. Other systems are designed for the Online Analytics Processing (OLAP) [155]: they executeanalytics queries that span the whole graphs, usually taking more time than OLTP operations.Analytics queries are often parallelized to minimize their latency. In this survey, we focus on theOLTP systems and not on OLAP graph processing engines. However, some of the systems wesurvey also provide certain OLAP capabilities, in addition to their OLTP functionalities.

3.7 Query Language SupportAlthough we do not focus on graph database languages, we report which query languages aresupported by each considered graph database system. We consider the leading languages such asSPARQL [153], Gremlin [165], Cypher [78, 88, 101], and SQL [59]. We also mention other system-specific languages such as GraphQL [96] and support for APIs from languages such as C++ or Java2.Note that mapping graph queries to SQL was also addressed in past work [182].

4 DATABASE SYSTEMSWe survey and describe selected graph database systems with respect to the proposed taxonomy. Ineach system category, we describe several example systems, focusing on the associated graph model,data and storage organization, data distribution, and query execution. We select the described

2We bring to the reader’s attention a manifesto on creating GQL, a standardized graph query language (https://gql.today).

Page 14: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:14 M. Besta et al.

Graph Database System Model Records Storing Edges Data Distribution &Query Execution Additional remarksMM LPG RDF FS VS DP AL SE SV LW MN RP SH CE PE TR OLTP OLAP

❶ RDF STORES (TRIPLE STORES) (§ 4.1). The main data model used: RDF triples (§ 2.1.5).

BlazeGraph [37] ∗ ∗ ? ? ? ? ? ? ∗BlazeGraph uses RDF*, an extensionof RDF (details in § 4.1).

Cray Graph Engine [162] ∗ ∗ ∗RDF triples are stored in hashtables.AllegroGraph [79] ∗ ? ∗Triples are stored as integers

(RDF strings are mapped to integers).Amazon Neptune [2] ∗ ? ? ∗LPG is enabled via Gremlin [166].AnzoGraph [47] ∗ ? ? ∗Data model and schema support SPARQL.Apache Marmotta [9] ∗ ? ? ? ∗The structure of data records is based on

that of different RDBMS systems(H2 [142], PostgreSQL [141], MySQL [68]).

BrightstarDB [143] ? ? ? ? ? ? ? —Ontotext GraphDB [146] ? ? ? ? —Profium Sense [157] ∗ ? ? ? ? ? Profium Sense is “AI powered”.

∗The format used is called JSON-LD:JSON for vertices and RDF for edge.

TripleBit [197] ∗ ‡ ? ? The data organization uses compression.∗Strings are mapped to variable size integers.‡Described as future work.

❷ TUPLE STORES (§ 4.2). The main data model used: tuples (§ 2.3.3).

WhiteDB [194] ∗ ∗ ‡ ‡ ‡ ‡ † ? ∗Implicit support for triples of integers.‡Implementable by the user. †Transactionsuse a global shared/exclusive lock.

Graphd [86] ∗ ? ? ? ? ? ‡ ? ? Backend of Google Freebase.∗Implicit support for triples. ‡Subset of ACID.

❸ DOCUMENT STORES (§ 4.4). The main data model used: documents (§ 2.3.2).

ArangoDB [11] ∗ ∗Uses a hybrid index for retrieving edges.OrientDB [45] ∗ ‡ ∗AL contains RIDs (i.e., physical locations)

of edge and vertex records. ‡Sharding is definedby the user. OrientDB supports JSON andit offers certain object-oriented capabilities.

Azure Cosmos DB [139] ? —Bitsy [122] The system is disk based and uses JSON files.

The storage only allows for appending data.FaunaDB [77] ∗ ‡ ∗Document, RDBMS, graph, and “time series”.

‡Adjacency lists are separately precomputed.

❹ KEY-VALUE STORES (§ 4.3). The main data model used: key-value pairs (§ 2.3.1).

HyperGraphDB [105] ∗ ‡ † ∗A Hypergraph model. ‡The system usesan incidence index to retrieve edgesof a vertex. †Support for ACI only.

MS Graph Engine [175] ∗ ‡ ∗Schema is defined by TrinitySpecification Language (TSL). ‡AL containsIDs of edges and/or vertices.

Dgraph [62] Dgraph is based on Badger [61].RedisGraph [158, 161, 179] ∗ ‡ RedisGraph is based on Redis [160].

‡The OLAP part uses GraphBLAS [115].

❺ WIDE-COLUMN STORES (§ 4.5). The main data model used: key-value pairs and tables (§ 2.3.1, § 2.3.4).

JanusGraph [16] JanusGraph is the continuation of Titan.Titan [16] Titan enables various backends

(e.g., Cassandra [121]).DSE Graph (DataStax) [58] ∗ ? ‡ DSE Graph is based on Cassandra [121].

∗Enabled by Gremlin query language.‡Support for AID, Consistency is configurable.

HGraphDB [163] ? ? ∗ HGraphDB uses TinkerPop3 with HBase [82].∗ACID is supported only within a row.

Table 2. Comparison of graph databases. Bolded systems are described in more detail in the corresponding sections.MM: A system is multi model. LPG, RDF: A system supports, respectively, the Labeled Property Graph and RDFwithout prior data transformation. FS, VS: Data records are fixed size and variable size, respectively. DP: A system canuse direct pointers to link records. This enables storing and traversing adjacency data without maintaining indices. AL:Edges are stored in the adjacency list format. SE: Edges can be stored in a separate edge record. SV: Edges can bestored in a vertex record. LW: Edges can be lightweight (containing just a vertex ID or a pointer, both stored in a vertexrecord). MN: A system can operate in a Multi Server (distributed) mode. RP: Given a distributed mode, a system enablesReplication of datasets. SH: Given a distributed mode, a system enables Sharding of datasets. CE: Given a distributedmode, a system enables Concurrent Execution of multiple queries. PE: Given a distributed mode, a system enablesParallel Execution of single queries on multiple nodes/CPUs. TR: Support for ACID Transactions. OLTP: Support forOnline Transaction Processing.OLAP: Support forOnline Analytical Processing.: A system offers a given feature.: A system offers a given feature in a limited way.: A system does not offer a given feature.?: Unknown.

Page 15: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:15

Graph Database System Model Records Storing Edges Data Distribution &Query Execution Additional remarksMM LPG RDF FS VS DP AL SE SV LW MN RP SH CE PE TR OLTP OLAP

❻ RELATIONAL DBMS (RDBMS) (§ 4.6). The main data model used: tables (implementing relations) (§ 2.3.4).

Oracle Spatial & Graph [148] ∗ ?∗ ?∗ ∗ ∗LPG and RDF use row-oriented storage.The system can also run on top of PGX [102](effectively as a native graph database).

AgensGraph [36] ? ? ? AgensGraph is based on PostgreSQL.FlockDB [188] ? ? ? FlockDB is based on MySQL. The system

focuses on “shallow” graph queries,such as finding mutual friends.

MS SQL Server 2017 [140] ? ? ? The system uses an SQL graph extension.OQGRAPH [129] ?∗ ?∗ ∗ ? OQGRAPH uses MariaDB [19].

∗OQGRAPH uses row-oriented storage.SAP HANA [171] ∗ ∗ ∗ ∗SAP HANA is column-oriented,

edges and vertices are stored in rows.SAP HANA can be used with a dedicatedgraph engine [169] and it offers certaincapabilities of a JSON document store [171]

❼ OBJECT-ORIENTED DATABASES (OODBMS) (§ 4.7). The main data model used: objects (§ 2.3.5).

VelocityGraph [193] The system is based on VelocityDB [192]Objectivity ThingSpan [145] ? ? ? ? ? ? The system is based on ObjectivityDB [90].

❽ NATIVE GRAPH DATABASES (§ 4.8). The main data model used: LPG (§ 2.1.3, § 2.1.4).

Neo4j [164] Neo4j is provided as a cloud serviceby a system called Graph Story [87].

GBase [113] ∗ ‡ ? ? ? ? ? ∗GBase supports simple graphs only(§ 2.1.1). ‡GBase stores the AM sparsely.

Sparksee/DEX [132] ∗ ∗ ‡ ∗The system uses maps only.‡Bitmaps are used for connectivity.

GraphBase [76] ∗ ? ? ? ? ? ? ? ? ∗No support for edge properties,only two types of edges available.

Memgraph [138] ? ? ? ? ? ? ∗ ‡ ∗This feature is under development.‡Available only for some algorithms.

TigerGraph [187] ? ? ? ? ? ? ? —Weaver [67] ? ? ? ? ? ? ? —

❾ DATA HUBS (§ 4.9). The main data model used: several different ones.

MarkLogic [130] ∗ ? ∗ ? Supported storage/models: relationaltables, RDF, various documents.∗Vertices are stored as documents,edges are stored as RDF triples.

OpenLink Virtuoso [147] ∗ ? ? ‡ Supported storage/models: relationaltables and RDF triples. ‡This featurecan be used with relational data only.

Cayley [50] ? ? ? ? ? ? ? ∗ ? ? Supported storage/models: relationaltables, RDF, document, key-value.∗This feature depends on the backend.

InfoGrid [104] ? ? ? ? ∗ ? ? Supported storage/models: relationaltables, Hadoop’s filesystem, grid storage.∗A weaker consistency model is usedinstead of ACID.

Stardog [181] ∗ ∗ ? ∗ ? Supported storage/models: relationaltables, documents. ∗RDF is simulatedon relational tables.

Table 3. Comparison of graph databases. Bolded systems are described in more detail in the corresponding sections.MM: A system is multi model. LPG, RDF: A system supports, respectively, the Labeled Property Graph and RDFwithout prior data transformation. FS, VS: Data records are fixed size and variable size, respectively. DP: A system canuse direct pointers to link records. This enables storing and traversing adjacency data without maintaining indices. AL:Edges are stored in the adjacency list format. SE: Edges can be stored in a separate edge record. SV: Edges can bestored in a vertex record. LW: Edges can be lightweight (containing just a vertex ID or a pointer, both stored in a vertexrecord). MN: A system can operate in a Multi Server (distributed) mode. RP: Given a distributed mode, a system enablesReplication of datasets. SH: Given a distributed mode, a system enables Sharding of datasets. CE: Given a distributedmode, a system enables Concurrent Execution of multiple queries. PE: Given a distributed mode, a system enablesParallel Execution of single queries on multiple nodes/CPUs. TR: Support for ACID Transactions. OLTP: Support forOnline Transaction Processing.OLAP: Support forOnline Analytical Processing.: A system offers a given feature.: A system offers a given feature in a limited way.: A system does not offer a given feature.?: Unknown.

Page 16: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:16 M. Besta et al.

Graph Database System Graph database query language Other languages and additional remarksSPARQL Gremlin Cypher SQL GraphQL Progr. API –

❶ RDF STORES (TRIPLE STORES) (§ 4.1).

AllegroGraph – – – – – –Amazon Neptune – – – – –AnzoGraph – – – – –

Apache Marmotta – – – – –Apache Marmotta also supports its nativeLDP and LDPath languages.

BlazeGraph – – – – –BrightstarDB – – – – – –Cray Graph Engine – – – – – –Ontotext GraphDB – – – – – –Profium Sense – – – – – –TripleBit – – – – – –

❷ TUPLE STORES (§ 4.2).

Graphd – – – – – – Graphd uses MQL [86].WhiteDB – – – – – (C, Python) –

❸ DOCUMENT STORES (§ 4.4).

OrientDB ∗ – ∗∗ ∗An SQL extension for graph queries. ∗∗OrientDB offersbindings to Java, C, JS, PHP, .NET, Python, and others.

ArangoDB – – – – – ArangoDB uses AQL (ArangoDBQuery Language).Azure Cosmos DB – – – – –

Bitsy – – – – –Bitsy also supports other Tinkerpop-compatible languagessuch as SQL2Gremlin and Pixy.

FaunaDB – – – – – –

❹ KEY-VALUE STORES (§ 4.3).

MS Graph Engine – – – – – – MS Graph Engine uses LINQ [175].HyperGraphDB – – – – – (Java) –Dgraph – – – – ∗ – ∗A variant of GraphQL.RedisGraph – – – – – –

❺ WIDE-COLUMN STORES (§ 4.5).

Titan – – – – – –JanusGraph – – – – – –DSE Graph (DataStax) – – – – – DSE Graph also supports CQL [58].HGraphDB – – – – – –

❻ RELATIONAL DBMS (RDBMS) (§ 4.6).

Oracle Spatial and Graph – – ∗ – – ∗PGQL [191], an SQL-like graph query language.MS SQL Server 2017 – – – ∗ – – ∗Transact-SQL.

SAP HANA – – – ∗ – ∗∗ ∗SAP HANA offers bindings to Rust, ODBC, and others.∗∗GraphScript, a domain-specific graph query language.

FlockDB – – – – – FlockDB uses the Gizzard framework and MySQL.AgensGraph – – ∗ ∗∗ – – ∗A variant called openCypher [89, 131]. ∗∗ANSI-SQL.OQGRAPH – – – – – –

❼ OBJECT-ORIENTED DATABASES (OODBMS) (§ 4.7).

Objectivity ThingSpan – – – – – – Objectivity ThingSpan uses a native DO query language [145].bjectivity ThingSpan VelocityGraph – – – – – (.NET) –

❽ NATIVE GRAPH DATABASES (§ 4.8).

Neo4j – ∗ – ∗∗ ∗∗∗∗Gremlin is supported as a part of TinkerPop integration.∗∗GraphQL supported with the GRANDstack layer.∗∗∗Neo4j can be embedded in Java applications.

Gbase – – – – – –GraphBase – – – – – – GraphBase uses its native query language.Memgraph – – ∗ – – – ∗openCypher.

Sparksee/DEX – – – (.NET)∗ –∗Sparksee/DEX also supports C++, Python,Objective-C, and Java APIs.

TigerGraph – – – – – – TigerGraph uses GSQL [187].Weaver – – – – – (Python) –

❾ DATA HUBS (§ 4.9).

Cayley – ∗ – – –∗Cayley supports Gizmo, a Gremlin dialect [50].Cayley also uses MQL [50].

MarkLogic – – – – – – MarkLogic uses XQuery [38].

OpenLink Virtuoso – – – –OpenLink Virtuoso also supports XQuery [38],XPath v1.0 [54], and XSLT v1.0 [114].

Stardog ∗ – – – ∗Stardog supports the PathQuery extension [181].

Table 4. Support for different graph database query languages in different graph database systems. “Progr. API”determines whether a given system supports formulating queries using some native programming language such as C++.“”: A system supports a given language. “”: A system supports a given language in a limited way. “–”: A system doesnot support a given language.

Page 17: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:17

systems based on the DB-Engines Ranking [178] and the availability of technical details about thedesign of the given system. Tables 2 and 3 illustrate the details of different graph database systems,including the ones described in this section. The tables indicate which features are supported bywhich systems. We use symbols “”, “”, and “” to indicate that a given system offers a givenfeature, offers a given feature in a limited way, and does not offer a given feature, respectively.“?” indicates we were unable to infer this information based on the available documentation3. Wereport the support for different graph query languages in Table 4.

4.1 RDF Stores (Triple Stores)RDF stores, also called triple stores, implement the Resource Description Framework (RDF) model(§ 2.1.5). These systems organize data into triples. We now describe in more detail a selectedrecent RDF store, Cray Graph Engine (§ 4.1.1). We also provide more details on two other systems,AllegroGraph and BlazeGraph, focusing on variants of the RDF model used in these systems (§ 4.1.2).

4.1.1 Cray Graph Engine. In a recent paper [162], Cray presented its Cray Graph Engine (CGE)which is able to load, maintain, and query databases with a trillion RDF triples. The architectureof CGE is based on Coarray C++ [184], Cray’s language and infrastructure that makes memoryresources of multiple compute nodes available as a single global address space. This facilitatesparallel programming in a distributed environment [26, 83, 173].

CGE does not store triples but quads (4-tuples), where the fourth element is a graph ID. Thus, onecan store multiple graphs in one CGE database. CGE optimizes the way in which it stores stringsfrom its triples/quads. Storing multiple long strings per triple/quad would be inefficient, consideringthe fact that many triples/quads may share strings. Therefore, CGE maintains a dictionary thatmaps strings to unique 48-bit integer identifiers (HURIs). For this, two distributed hashtables areused (one for mapping strings to HURIs and one for mapping HURIs to strings). In the loadingprocess, the strings are sorted and then assigned to HURIs. This allows integer comparisons (equal,greater, smaller, etc.) to be used instead of more expensive string comparisons.

Quads in CGE are grouped by their predicate and the identifier of the graph that they are a partof. Thus, only a pair with a subject and an object needs to be stored for one such group of quads.These subject/object pairs are stored in hashtables (one hashtable per group). Since each subjectand object is represented as a 48-bit HURI, the subject/object pairs can be packed into 12 bytes andstored in a 32-bit unsigned integer array, ultimately reducing the amount of needed storage.

The computation in CGE is distributed over the participating processes. To minimize the amountof all-to-all communication, query results are aggregated locally and – whenever possible – eachprocess only communicates with a few peers to avoid network congestion.

CGE uses SPARQL [174] for querying. The CGE authors focused on developing high-performancescan, join, and merge operations needed in many queries. The execution of all these operations isdistributed over the processes. Scans scale well [133] due to their high locality, as each processparticipating in this operation reads mostly local data. Merge needs to access most or all the datafrom all other processes. Therefore, it requires all-to-all communication [99], limiting scalability.CGE can only execute one query at a time. Currently, no parallel transactions are possible

(including parallel reads). If one uses a front-end server that allows multiple queries to be submittedin parallel to the CGE execution engine, the execution of these queries is serialized. Thus, CGEallows for the parallel execution of single queries, but not parallel execution of multiple queries.4.1.2 Other RDF Graph Databases. There exist many other RDF graph databases. We briefly

describe two systems that extend the original RDF model: AllegroGraph and BlazeGraph.3We encourage participation in this survey. In case the reader is in possession of additional information relevant for thetables, the authors would welcome the input.

Page 18: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:18 M. Besta et al.

First, some RDF stores allow for attaching attributes to a triple explicitly. AllegroGraph [79]allows an arbitrary set of attributes to be defined per triple when the triple is created. However,these attributes are immutable. Figure 9 presents an example RDF graph with such attributes. Thisfigure uses the same LPG graph as in previous examples provided in Figure 4 and Figure 5, whichcontain example transformations from the LPG into the original RDF model.

RDF graph, with triple attributes

Person/V-ID Person/V-ID

21 24Alice Bob

agename name

ageknows{since:09.08.2007}

triple attribute

LPG graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

vertex

type

vertex

type

Fig. 9. Comparison of an LPG graph and an RDF graph: a transformation from LPG to RDF with triple attributes.We represent the triple attributes as a set of key value pairs. “Person/V-ID”, “age”, “name”, “type” and “knows” are RDFURIs. The transformation uses the assumption that there is one label per vertex and edge.

Second, BlazeGraph [37] implements RDF* [94, 95], an augmentation of RDF that allows forattaching triples to triple predicates (see Figure 10). Vertices can use triples for storing labels andproperties, analogously as with the plain RDF. However, with RDF*, one can represent LPG edgesmore naturally than in the plain RDF. Specifically, edges can be stored as triples, and edge propertiescan be linked to the edge triple via other triples.

RDF* graph

Person/V-ID Person/V-ID

21 24Alice Bob

agename name

age

knowsA triple attached

to a triple

09.08.2007

since

LPG graph

:Personname = Bob

age = 24

:Personname = Alice

age = 21

:knowssince = 09.08.2007

vertextype

vertextype

Fig. 10. comparison of an lpg graph and an rdf* graph: a transformation from lpg to rdf*, that enables attachingtriples to triple predicates. “person/v-id”, “age”, “name”, “type”, “since” and “knows” are rdf uris. the transformation usesthe assumption that there is one label per vertex and edge.

4.2 Tuple StoresA tuple store is a generalization of an RDF store. RDF stores are restricted to triples (or quads, as inCGE) whereas tuple stores can maintain tuples of arbitrary sizes, as detailed in § 2.3.3.

4.2.1 WhiteDB. WhiteDB [194] is a tuple store written in C that uses shared-memory parallelism.WhiteDB’s API enables allocating new records (tuples) with an arbitrary tuple length (number oftuple elements). Small values and pointers to other tuples are stored directly in a given field. Largestrings are kept in a separate store. Each large value is only stored once, and a reference counter

Page 19: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:19

key (atom ID) value (ID-list or binary data)

node ID

link ID

value ID

type ID value ID

type ID value ID node ID node-ID...

value ID ... value ID or binary data

Fig. 11. An example utilization of key-value stores for maintaining hypergraphs in HyperGraphDB.

keeps track of how many tuples refer to it at any time. WhiteDB only provides database widelocking with a reader-writer lock [159]. As an alternative to locking the whole database, one canalso update fields of tuples atomically (set, compare and set, add). WhiteDB itself does not enforceconsistency, it is up to the user to use locks and atomics correctly. Finally, WhiteDB only enablesaccessing single tuple records, there is no higher level query engine or graph API that would allowto, for example, execute a query that fetches all neighbors of a given vertex. However, one can stilluse tuples as vertex and edge storage, linking them to one another via memory pointers.

4.3 Key-Value StoresOne can also explicitly use key-value (KV) stores for maintaining a graph. We provide details ofusing a collection of key-value pairs to model a graph in § 2.3.1. Here, we describe select KV storesused as graph databases: MS Graph Engine (also called Trinity) and HyperGraphDB.

4.3.1 Microsoft’s Graph Engine (Trinity). Microsoft’s Graph Engine [175] is based on a dis-tributed KV store called Trinity. Trinity implements a globally addressable distributed RAM storage.Processes use one-sided communication to access one another’s data directly [83]. Trinity can bedeployed on InfiniBand [103] to leverage Remote Direct Memory Access (RDMA) [83].

In Trinity, keys are called cell IDs and values are called cells. Cells are associated with a schemathat is defined using the Trinity Specification Language (TSL) [175]. TSL enables defining thestructure of cells similarly to C-structs. For example, a cell can hold data items of different datatypes, including IDs of other cells. MS Graph Engine introduces a graph storage layer on top of theTrinity KV storage layer. Vertices are stored in cells, where a dedicated field contains a vertex ID ora hash of this ID. Edges adjacent to a given vertex v are stored as a list of IDs of v’s neighboringvertices, directly in v’s cell. However, if an edge holds rich data, such an edge (together with theassociated data) can also be stored in a separate dedicated cell.

To keep the size of the exchanged messages small, Trinity maintains special accessors that allowfor accessing single attributes within a cell without needing to load the complete cell. This lowersthe I/O cost for many operations that do not need the whole cells.

4.3.2 HyperGraphDB. HyperGraphDB [105] stores hypergraphs (we define hypergraphs in § 2.1.2).The basic building blocks of HyperGraphDB are atoms, the values of the KV store. Every atom hasa cryptographically strong ID. This reduces a chance of collisions (i.e., creating identical IDs fordifferent graph elements by different peers in a distributed environment). Both hypergraph verticesand hyperedges are atoms. Thus, they have their own unique IDs. An atom of a hyperedge storesa list of IDs corresponding to the vertices connected by this hyperedge. Vertices and hyperedgesalso have a type ID and they can store additional data (such as properties) in a recursive structure(referenced by a value ID). This recursive structure contains value IDs identifying other atoms (withother recursive structures) or binary data. Figure 11 shows an example of how a KV store is used torepresent a hypergraph in HyperGraphDB.

Page 20: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:20 M. Besta et al.

HyperGraphDB maintains an incidence index that maps a vertex ID to the IDs of all hyperedgescontaining this vertex. This allows for an efficient traversal from a vertex to a hyperedge. Next, atype index enables retrieving all vertices and hyperedges with a certain type ID. Finally, a valueindex allows one to quickly find atoms that carry specific payload (identified with a value ID).4.4 Document StoresIn document stores, a fundamental storage unit is a document, described in § 2.3.2. We select twodocument stores for a more detailed discussion, OrientDB and ArangoDB.

4.4.1 OrientDB. In OrientDB [45], every document d has a Record ID (RID), consisting of the IDof the collection of documents where d is stored, and the position (also referred to as the offset) withinthis collection. Pointers (called links) between documents are represented using these unique RIDs.OrientDB [45] introduces regular edges and lightweight edges. Regular edges are stored in an

edge document and can have their own associated key/value pairs (e.g., to encode edge propertiesor labels). Lightweight edges, on the other hand, are stored directly in the document of the adjacent(source or destination) vertex. Such edges do not have any associated key/value pairs. They consti-tute simple pointers to other vertices, and they are implemented as document RIDs. Thus, a vertexdocument not only stores the labels and properties of the vertex, but also a list of lightweight edges(as a list of RIDs of the documents associated with neighboring vertices), and a list of pointers tothe attached regular edges (as a list of RIDs of the documents associated with these regular edges).Each regular edge has pointers (RIDs) to the documents storing the source and the destinationvertex. Each vertex stores a list of links (RIDs) to its incoming and the outgoing edges.

Figure 12 contains an example of using documents for representing vertices, regular edges, andlightweight edges in OrientDB. Figure 13 shows example vertex and edge documents.OrientDB provides distributed transactions with ACID semantics. OrientDB offers a form of

sharding, in which not all collections of documents have to be copied on each server. However,OrientDB does not enable sharding of the collections themselves (i.e., distributing one collectionacross many servers). If an individual collection grows large, it is the responsibility of the user topartition the collection to avoid any additional overheads.

vertex 1name: Alice

age: 21

edge 1since: 09.08.2007

vertex 2name: Bob

age: 24

in out

A lightweight edge

A regular edge of type "knows"

Fig. 12. Two vertex documents connected with a lightweight edge and a regular edge (knows) in OrientDB.

attribute (key/value)vertex document incoming edge RID's outgoing edge RID's lightweight edges: vertex RID's

attribute (key/value)regular edge document incoming vertex RID outgoing vertex RID

Fig. 13. Example OrientDB vertex and edge documents (complex JSON documents are also supported).

4.4.2 ArangoDB. ArangoDB [11, 12] keeps its documents in a binary format called VelocyPack,which is a compacted implementation of JSON documents. Documents can be stored in differentcollections and have a _key attribute which is a unique ID within a given collection. UnlikeOrientDB, these IDs are no direct memory pointers. Documents in these collections are indexedusing a hashtable, where the _key attribute serves as the hashtable key.

Page 21: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:21

For maintaining graphs, ArangoDB uses vertex collections and edge collections. The former areregular document collections with vertex documents. Vertex documents store no information aboutedges attached to the associated vertices. This has the advantage that a vertex document doesnot have to be modified when one adds or removes edges. Second, edge collections store edgedocuments. Edge documents have two particular properties: _from and _to, which are the IDs ofthe documents associated with two vertices connected by a given edge.

To find a particular vertex or an edge document, one can query the respective collection’s hashtable using the document _key. To find the edges of a vertex, one would have to scan all edgedocuments since vertex documents store no information about edges. To avoid this, there exists ahybrid index, which allows to efficiently retrieve a list of edge documents with a particular _fromor _to attribute. This enables fast reconstruction of the adjacency list of a given vertex.A traversal over the neighbors of a given vertex works as follows. First, given the _key of a

vertex v , ArangoDB finds all v’s adjacent edges using the hybrid index. Next, the system retrievesthe corresponding edge documents and fetches all the associated _to properties. Finally, the _toproperties serve as the new _key properties when searching for the neighboring vertices. An opti-mization in ArangoDB’s design prevents reading vertex documents and enables directly accessingone edge document based on the vertex ID within another edge document. This may improve cacheefficiency and thus reduce query execution time [12].ArangoDB’s indexing may accelerate edge removal. If the edges to be removed were stored

in a vertex document, then the document would have to be modified, generating overheads. Itcould become expensive when removing a vertex with a large degree, as the documents of all theneighboring vertices would have to be altered. ArangoDB’s design prevents from such modificationsof vertex documents. Figure 14 contains example ArangoDB’s vertex and edge indexing structures.

edge index (hybrid)

_from_to hash

linked list of edges

vertex index

_key hash

vertex document

Fig. 14. Left: an edge index (hybrid: a hashtable and linked lists). Right: a vertex index (a simple hashtable).

One can use different collections of documents to store different edge types (e.g., “friend_of” or“likes”). When retrieving edges conditioned on some edge type (e.g., “friend_of”), one does not haveto traverse the whole adjacency list (all “friend_of” and “likes” edges). Instead, one can target thecollection with the edges of the specific edge type (“friend_of”).

ArangoDB can be executed in a distributed mode with replication and data sharding. Sharding isdone by partitioning document collections using the _key attribute. Thus, given the _key attribute,one can straightforwardly locate a host with the queried document.

4.5 Wide-Column StoresWide-column stores combine different features of key-value stores and relational tables. On onehand, a wide-column store maps keys to rows (a KV store that maps keys to values). Every row canhave an arbitrary number of cells and every cell constitutes a key-value pair. Thus, a row containsa mapping of cell keys to cell values, effectively making a wide-column store a two-dimensional KVstore (a row key and a cell key both identify a specific value). On the other hand, a wide-columnstore is a table, where cell keys constitute column names. However, unlike in a relational database,

Page 22: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:22 M. Besta et al.

the names and the format of columns may differ between rows within the same table. We illustratean example subset of rows and rows cells in a wide-column store in Figure 15.

key cell key | value cell cell

key cell key | value cell cell

key cell key | value cell cell

sorted by cell key

sortedby key

Fig. 15. An illustration of wide-column stores: mapping keys to rows and column-keys to cells within the rows.

4.5.1 Titan and JanusGraph. Titan [16] is a graph database that was built by Aurelius, nowowned by DataStax. The Linux Foundation develops JanusGraph [186], a continuation of Titan.Both are graph databases built on top of wide-column stores. They can use different wide-columnstores as backends, for example Apache Cassandra [7]. They additionally use document-orientedsearch engines (e.g., Elasticsearch [72] or Apache Solr [10]) to accelerate lookups by labels.

In both systems, when storing a graph, each row represents a vertex. Each vertex property andadjacent edge is stored in a separate cell. One edge is thus encoded in a single cell, including all theproperties of this edge. Since cells in each row are sorted by the cell key, this sorting order can beused to find cells efficiently. For graphs, cell keys for properties and edges are chosen such thatafter sorting the cells, the cells storing properties come first, followed by the cells containing edges,see Figure 16. Since rows are ordered by the key, both systems straightforwardly partition tablesinto so called tablets, which can then be distributed over multiple data servers.

vertex ID property

vertex ID

vertex ID

sorted by cell key

sorted byvertex ID

property edge

property edge edge edge

property property property edge edge

property property

property

Fig. 16. An illustration of Titan and JanusGraph: using wide-column stores for storing graphs. The illustration isinspired by and adapted from [176].

4.6 Relational Database Management SystemsRelational Database Management Systems (RDBMS) store data in two dimensional tables with rowsand columns, described in more detail in the corresponding data model section in § 2.3.4.There are two types of RDBMS: column RDBMS (not to be confused with wide-column stores)

and row RDBMS (also referred to as column-oriented or columnar and row-oriented). They differ inphysical data persistence. Row RDBMS store table rows in consecutive memory blocks. ColumnRDBMS, on the other hand, store table columns contiguously. Row RDBMS are more efficient whenonly a few rows need to be retrieved, but with all their columns. Conversely, column RDBMSare more efficient when many rows need to be retrieved, but only with a few columns. Graphdatabase solutions that use RDBMS as their backends use both row RDBMS (e.g., Oracle Spatialand Graph [148], OQGRAPH built on MariaDB [129]) and column RDBMS (e.g., SAP HANA [171]).Graph analytics based on RDBMS has became a separate and growing area of research. Zhao

et al. [198] study the general use of RDBMS for graphs. They define four new relational algebra

Page 23: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:23

operations for modeling graph operations. They show how to define these four operations withsix smaller building blocks: basic relational algebra operations, such as group-by and aggregation.Xirogiannopoulos et al. [196] describe GraphGen, an end-to-end graph analysis framework thatis built on top of an RDBMS. GraphGen supports graph queries through so called Graph-Viewsthat define graphs as transformations over underlying relational datasets. This provides a graphmodeling abstraction, and the underlying representation can be optimized independently.

4.6.1 Oracle Spatial and Graph. Oracle Spatial and Graph [148] is a database managementplatform built on top of Oracle Database. It provides a rich set of tools for administration andanalysis of graph data. Oracle Spatial and Graph comes with a range of built-in parallel graphalgorithms (e.g., for community detection, path finding, traversals, link prediction, PageRank,etc.). Both LPG and RDF models are supported. Rows of RDBMS tables constitute vertices andrelationships between these rows form edges. Associated properties and attributes are stored askey-value pairs in separate structures.Querying graphs in Oracle Spatial and Graph is possible using PGQL [191], a declarative, SQL-

like, graph pattern matching query language. PGQL allows for querying both data stored on disk inOracle Database as well as in in-memory parts of graph datasets. In-memory analysis is boosted bythe possibility to switch the underlying relational storage with the native graph storage providedby the PGX processing engine [71, 102, 167]. In such a configuration, Oracle Spatial and Grapheffectively becomes a native graph database. PGX comes with two variants, PGX.D and PGX.SM, that– respectively – offer distributed and shared-memory processing capabilities [102]. Support for SQLand SPARQL (for RDF graphs) is also included. Moreover, the offered Java API implements ApacheTinkerpop interfaces, including the Gremlin API. Additionally, to further accelerate searchingusing properties, Oracle Spatial and Graph utilizes Apache Lucene [134] and optionally ApacheSolrCloud [119] capabilities.

4.7 Object-Oriented DatabasesObject-oriented database management systems (OODBMS) [14] enable modeling, storing, andmanaging data in the form of language objects used in object-oriented programming languages. Wesummarize such objects in § 2.3.5.

4.7.1 VelocityGraph. VelocityGraph [193] is a graph database relying on the VelocityDB [192]distributed object database. VelocityGraph edges, vertices, as well as edge or vertex properties arestored in C# objects that contain references to other objects. To handle this structure, VelocityGraphintroduces abstractions such as VertexType, EdgeType, and PropertyType. Each object has a uniqueobject identifier (Oid), pointing to its location in physical storage. Each vertex and edge has onetype (label). Properties are stored in dictionaries. Vertices keep the attached edges in collections.

4.8 Native Graph DatabasesGraph database systems described in the previous sections are all based on some database backendthat was not originally built just for managing graphs. In what follows, we describe native graphdatabases: systems that were specifically build to maintain and process graphs.

4.8.1 Neo4j: Direct Pointers. Neo4j [164] is the most popular graph database system, accordingto different database rankings (see the links on page 2). Neo4j implements the LPG model using astorage design based on fixed-size records. A vertex v is represented with a vertex record, whichstores (1) v’s labels, (2) a pointer to a linked list of v’s properties, (3) a pointer to the first edgeattached to v , and (4) some flags. An edge e is represented with an edge record, which stores (1) e’sedge type (a label), (2) a pointer to a linked list of e’s properties, (3) a pointer to two vertex recordsthat represent vertices attached to e , (4) pointers to the ALs of both adjacent vertices, and (5) some

Page 24: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:24 M. Besta et al.

flags. Each property record can store up to four properties, depending on the size of the propertyvalue. Large values (e.g., long strings) are stored in a separate dynamic store. Storing propertiesoutside vertex and edge records allows those records to be small. Moreover, if no properties areaccessed in a query, they are not loaded at all. The AL of a vertex is implemented as a doubly linkedlist. An edge is stored once, but is part of two such linked lists (one list for each attached vertex).Thus, an edge has two pointers to the previous edges and two pointers to the next edges. Figure 17outlines the Neo4j design; Figure 18 shows the details of vertex and edge records.

Previous edges inthe neighborhoods ofthe attached vertices

vertex 1

name: Alice

age: 21

knows

Next edges in thethe neighborhoods

the attached vertices

vertex 2

name: Bob

age: 24

Vertex properties

Fig. 17. Summary of the Neo4j structure: two vertices linked by a relationship “knows”. Both vertices maintain linkedlists of properties. The relationships are parts of two doubly linked lists, one linked list per attached vertex.

1 5 9 14

inUsenextEdgeID nextPropID labels

flags

1 5 9 13 17 21 25 29 33

inUsefirstVertex secondVertex relType

firstPrevEdgeID secondPrevEdgeID

firstNextEdgeID secondNextEdgeID

nextPropIDflags

Links to attached vertices A pointer to a doublylinked adjacency list of

the first attached vertex

A pointer to a doublylinked adjacency list of the

second attached vertex

A link to thefirst edge-

recordA vertex record:

An edge record:

A linked list of property-records,each holding four property blocks

Fig. 18. An overview of the Neo4j vertex and edge records.

A core concept in Neo4j is using direct pointers [164]: a vertex stores pointers to the physicallocations of its neighbors. Thus, for neighborhood queries or traversals, one needs no index andcan instead follow direct pointers (except for the root vertices in traversals). Consequently, thequery complexity does not dependent on the graph size. Instead, it only depends on how large thevisited subgraph is4.

Neo4j supports replication and provides certain level of support for sharding. Specifically, the usercan partition the graph and store each partition in a separate database, limiting data redundancy.

4That said, if the graph does not fit into the main memory, the execution speed heavily depends on caching and cachepre-warming, i.e., the running time may significantly increase

Page 25: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:25

4.8.2 Sparksee/DEX: B+ Trees and Bitmaps. Sparksee is a graph database system that wasformerly known as DEX [132]. Sparksee implements the LPG model in the following way. Verticesand edges (both are called objects) are identified by unique IDs. For each property name, thereis an associated B+ tree that maps vertex and edge IDs to the respective property values. Thereverse mapping from a property value to vertex and edge IDs is maintained by a bitmap, wherea bit set to one indicates that the corresponding ID has some property value. Labels and verticesand edges are mapped to each other in a similar way. Moreover, for each vertex, two bitmaps arestored. One bitmap indicates the in-coming edges, and one indicates the out-going edges attachedto every vertex. Furthermore, two B+ trees maintain the information about what vertices an edgeis connected to (one tree for each edge direction). Figure 19 illustrates example mappings.

Edge orvertex ID

A value ora label

B+ tree ptr0001001000001

Bitmap ptr

Edge orvertex ID

An attribute / A label

B+ tree ptr00100110000010011000

Bitmap ptr

Edge IDVertex/Edge connectivity (in / out directions)

Edge ID Vertex ID

Fig. 19. Sparksee maps for attributes, labels, and vertex/edge connectivity. All mappings are bidirectional.

Sparksee is one of the few systems that are not record based. Instead, Sparksee uses mapsimplemented as B+ trees [66] and bitmaps. The use of bitmaps allows for some operations to beperformed as bit-level operations. For example, if one wants to find all vertices with certain valuesof properties such as “age” and “first name”, one can simply find two bitmaps associated with the“age” and the “first name” properties, and then derive a third bitmap that is a result of applying abitwise AND operation to the two input bitmaps.Uncompressed bitmaps could grow unmanageably in size. As most graphs are sparse, bitmaps

indexed by vertices or edges mostly contain zeros. To alleviate large sizes of such sparse bitmaps,they are cut into 32-bit clusters. If a cluster contains a non-zero bit, it is stored explicitly. The bitmapis then represented by a collection of (cluster-id, bit-data) pairs. These pairs are stored in a sortedtree structure to allow for efficient lookup, insertion, and deletion.Heavy use of indexing has its costs. Instead of constant time pointer chasing, index lookups

cost logarithmic time in the size of the stored graph. However, bitmaps in Sparksee can at least bestored in linear space (in the number of bits set to one). Thus, finding all edges attached to a vertexcan be done in linear time with respect to the number of edges to be retrieved. This is possible dueto the bitmap compression used in Sparksee: each stored 32-bit cluster has at least one non-zeroentry (stores at least one edge ID).

4.8.3 GBase: Sparse Adjacency Matrix Format. GBase [113] is a system that can only representthe structure of a directed graph; it stores neither properties nor labels. The goal of GBase is tomaintain a compression of the adjacency matrix of a graph such that one can efficiently retrievalall incoming and outgoing edges of a selected vertex without the prohibitive O(n2) matrix storageoverheads. Simultaneously, using the adjacency matrix enables verifying in O(1) time whethertwo arbitrary vertices are connected. To compress the adjacency matrix, GBase cuts it into K2

quadratic blocks (there are K blocks along each row and column). Thus, queries that fetch in- andout-neighbors of each vertex require only to fetch K blocks. The parameter K can be optimizedfor specific databases. When K becomes smaller, one has to retrieve more small files (assuming

Page 26: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:26 M. Besta et al.

one block is stored in one file). If K grows larger, there are fewer files but they become larger,generating overheads. Further optimizations can be made when blocks contain either only zeroesor only ones; this enables higher compression rates.

4.9 Data HubsData hubs are systems that enable using multiple data models and corresponding storage designs.They often combine relational databases with RDF, document, and key-value stores. This can bebeneficial for applications that require a variety of data models, because it provides a variety ofstorage options in a single unified database management system. One can keep using RDBMSfeatures, upon which many companies heavily rely, while also storing graph data.

4.9.1 OpenLink Virtuoso. OpenLink Virtuoso [147] provides RDBMS, RDF, and document capa-bilities by connecting to a variety of storage systems. Graphs are stored in the RDF format only,thus the whole discussion from § 2.1.5 also applies to Virtuoso RDF.

4.9.2 MarkLogic. MarkLogic [130] models graphs with documents for vertices, therefore allow-ing an arbitrary number of properties for vertices. However, it uses RDF triples for edges.

4.10 Discussion and TakeawaysIn this section, we summarize selected aspects of data organization in graph database systems. Fora detailed description and analysis of all the considered aspects, see Section 3 and Tables 2 and 3.

4.10.1 Graph Models. There is no one standard graph model, but two models have proven to bepopular: RDF and LPG. RDF is a well-defined standard. However, it only supports simple triples(subject, predicate, object) representing edges from subject identifiers via predicates to objects. LPGis more popular (cf. Tables 2–3) and it allows vertices and edges to have labels and properties, thusenabling more natural data modeling. However, it is not standardized, and there are many variants(cf. § 2.1.4). For example, some systems limit the number of labels to just one. MarkLogic (§ 4.9.2)allows properties for vertices but none for edges, and thus can be viewed as a combination of LPG(vertices) and RDF (edges). Data stored in the LPG model can be converted to RDF, as describedin § 2.1.6. To benefit from different LPG features while keeping RDF advantages such as simplicity,some researchers proposed and implemented modifications to RDF. Examples are triple attributesor attaching triples to other triples (described in § 4.1.2).

There are very few systems that use neither RDF nor LPG. HyperGraphDB uses the hypergraphmodel and GBase uses a simple directed graph model without any labels or properties.4.10.2 Storage Designs. Most graph database systems are build upon existing storage designs,

including key-value stores, wide-column stores, RDBMS, and others. The advantage of usingexisting storage designs is that these systems are usually mature and well-tested. The disadvantageis that they may not be perfectly optimized for graph data and graph queries. This is what nativegraph databases attempt to address.

One can distinguish record based systems from purely indexing based systems. Most systems arerecord based: they store data in consecutive memory blocks called records. These memory blocksmay be unstructured, as is the case with key-value stores. They can also be structured: documentdatabases often use the JSON format, wide-column stores have a key-value mapping inside eachrow, row-oriented RDBMS divide each row into columns, OODBMS impose some class definition,and tuple stores as well as some RDF stores use tuples. Record based systems store vertices inrecords, most often one vertex per one record. Some systems store edges in separate records, othersstore them together with the attached vertices. If one wants to find a property of a particular vertex,one has to find a record containing the vertex. The searched property is either stored directly inthat record, or its location is accessible via a pointer.

Page 27: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:27

Some systems do not use records at all, we call them purely indexing based. Sparksee and someRDF systems as well as column-oriented RDBMS do not store information about vertices and edgesin dedicated records. Instead, they maintain data structures (usually in the form of indices) for eachproperty or label. The information about a given vertex is thus distributed over different structures.If one wants to find a property of a particular vertex, one has to query the associated data structure(index) for that property and find the value for the given vertex. Examples of such used indexstructures are B+ trees (in Sparksee) or hashtables (in some RDF systems).

4.10.3 Data Layout. Data layout determines how vertices and edges are represented and encodedin records. Some systems enable highly flexible structure inside their records. For example, documentdatabases that use JSON or wide-columns stores such as Titan and JanusGraph allow for differentkey-value mappings for each vertex and edge. Other record based systems are more fixed in theirstructure. For example, in OODBMS, one has to define a class for each configuration of vertex andedge properties. In RDBMS, one has to define tables for each vertex or edge type.Another aspect of a graph data layout is the design of the adjacency between records. One can

either assign each record an ID and then link records to one another via IDs, or one can use directmemory pointers. Using IDs requires an indexing structure to find the physical storage address ofa record associated with a particular ID. Direct memory pointers do not require an index for atraversal from one record to its adjacent records. Note that an index might still be used, for exampleto retrieve a vertex with a particular property value (in this context, direct pointers only facilitateresolving adjacency queries between vertices).

4.10.4 Data Organization vs. Query Performance. Record based systems usually deliver moreperformance for queries that need to retrieve all or most information about a vertex or an edge.They are more efficient because the required data is stored in consecutive memory blocks. In purelyindexing based systems, one queries a data structure per property, which results in a more randomaccess pattern. On the other hand, if one only wants to retrieve single properties about vertices oredges, purely indexing based systems may only have to retrieve a single value. Contrarily, manyrecord based systems cannot retrieve only parts of records, fetching more data than necessary.Furthermore, a decision on whether to use IDs versus direct memory pointers to link records

depends on the read/write ratio of the workload for the given system. In the former case, onehas to use an indexing structure to find the address of the record. This slows down read queriescompared to following direct pointers. However, write queries can be more efficient with the use ofIDs instead of pointers. For example, when a record has to be moved to a new address, all pointersto this record need to be updated to reflect this new address. IDs could remain the same, only theindexing structure needs to modify the address of the given record.

5 GRAPH DATABASE QUERIES ANDWORKLOADSWe provide a taxonomy of graph database queries and workloads. First, we categorize them usingthe scope of the accessed graph and thus, implicitly, the amount of accessed data (§ 5.1). We thenoutline the classification from the LDBC Benchmark [5] (§ 5.2). Next, the distinction into OLTP andOLAP is outlined (§ 5.3). Finally, a categorization of graph queries based on the matched patterns(§ 5.4) and loading input datasets into the database (§ 5.5) are discussed. Figure 20 summaries allelements of the proposed taxonomy.

We omit detailed discussions and examples as they are provided in different associated papers [4,5, 8, 13, 18, 48, 53, 65, 75, 106, 109, 124, 125, 136, 183]. Instead, our goal is to deliver a broad overviewand taxonomy, and point the reader to the detailed material available elsewhere.

Page 28: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:28 M. Besta et al.

Interactive

Business Intelligence

Graph Analytics

Short Read-Only

Complex Read-Only

Transactional UpdateOLTP

OLAP

Simple Pattern Matching

Complex Pattern Matching

Navigational Pattern Matching

Path

Input Loading

Local Queries

Neighborhood Queries

Traversals

Global Analytics Queries

Input Accessing

Taxonomy of GraphDatabase Queries

Scope of Access (§ 5.1)

Online vs. Offline (§ 5.3)

LDBC Workloads (§ 5.2)

Matched Patterns (§ 5.4)General Scenario (§ 5.5)

Fig. 20. Taxonomy of different graph database queries and workloads.

5.1 Scopes of GraphQueriesWe describe queries in the increasing order of their scope. We focus on the LPG model, see § 2.1.3.Figure 21 depicts the scope of graph queries.

P

PP

L L

PP

L

V

P

P

L

EV

Single vertices and edges

E

V

VE

V

VE

V

E

V

V

V

VV

V

V

V

V

E

E

E E

E

E

E

E

E

E

Subgraphs Whole graph

Local queries Neighborhood queries Traversals Global analytics

Scope / Complexity

Vertex Edge

Property

Label

Fig. 21. Illustration of different query scopes when accessing a Labeled Property Graph.

5.1.1 Local Queries. Local queries involve single vertices or edges. For example, given a vertexor an edge ID, one may want to retrieve the labels and properties of this vertex or edge. Otherexamples include fetching the value of a given property (given the property key), deriving the setof all labels, or verifying whether a given vertex or an edge has a given label (given the label name).These queries are used in social network workloads [13, 18] (e.g., to fetch the profile information ofa user) and in benchmarks [109] (e.g., to measure the vertex look-up time).

5.1.2 NeighborhoodQueries. Neighborhood queries retrieve all edges attached to a given vertex,or the vertices adjacent to a given edge. This query can be further restricted by, for example,retrieving only the edgeswith a specific label. Similarly to local queries, social networks often requirea retrieval of the friends of a given person, which results in querying the local neighborhood [13, 18].

Page 29: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:29

5.1.3 Traversals. In a traversal query, one explores a part of the graph beyond a single neigh-borhood. These queries usually start at a single vertex (or a small set of vertices) and traversesome graph part. We call the initial vertex or the set of vertices the anchor or root of the traversal.Queries can restrict what edges or vertices can be retrieved or traversed. As this is a common graphdatabase task, this query is also used in different performance benchmarks [53, 65, 109].

5.1.4 Global Graph Analytics. Finally, we identify graph analytics queries, often referred to asOLAP, which by definition consider the whole graph (not necessarily every property but all verticesand edges). Different benchmarks [21, 48, 65, 136] take these large-scale queries into account sincethey are used in different fields such as threat detection [69] or computational chemistry [17]. Asindicated in Tables 2 and 3, many graph databases support such queries. Graph processing systemssuch as Pregel [128] or Giraph [8] focus specifically on resolving OLAP [93]. Example queriesinclude resolving global pattern matching [51, 177], shortest paths [64], max-flow or min-cut [57],minimum spanning trees [118], diameter, eccentricity, connected components, PageRank [151],and many others. Some traversals can also be global (e.g., finding all shortest paths of unrestrictedlength), thus falling into the category of global analytics queries.

5.2 Classes of Graph WorkloadsWe also outline an existing taxonomy of graph database workloads that is provided as a part of theLBDC benchmarks [5]. LDBC is an effort by academia and industry to establish a set of standardbenchmarks for measuring the performance of graph databases. The effort currently specifiesinteractive workloads, business intelligence workloads, and graph analytics workloads.

5.2.1 Interactive Workloads. A part of LDBC called the Social Network Benchmark (SNB) [75]identifies and analyzes interactive workloads that can collectively be described as either read-onlyqueries or simple transactional updates. They are divided into three further categories. First, shortread-only queries start with a single graph element (e.g., a vertex) and look up its neighbors orconduct small traversals. Second, complex read-only queries traverse larger parts of the graph; theyare used in the LDBC benchmark to not just assess the efficiency of the data retrieval processbut also the quality of query optimizers. Finally, transactional update queries insert either a singleelement (e.g., a vertex), possibly together with its adjacent edges, or a single edge. This workloadtests common graph database operations such as the look up of a friend profile in a social network.

5.2.2 Business Intelligence Workloads. Next, LDBC identifies business intelligence (BI) work-loads [183], which fetch large data volumes, spanning large parts of a graph. Contrarily to theinteractive workloads, the BI workloads heavily use summarization and aggregation operationssuch as sorting, counting, or deriving minimum, maximum, and average values. They are read-only. The LDBC specification provides an extensive list of BI workloads that were selected so thatdifferent performance aspects of a database are properly stressed when benchmarking.

5.2.3 Graph Analytics Workloads. Finally, the LDBC effort comes with a graph analytics bench-mark [106], where six graph algorithms are proposed as a standard benchmark for a graph analyticspart of a graph database. These algorithms are “Breadth-First Search, PageRank [151], weakly con-nected components [85], community detection using label propagation [39], deriving the local clusteringcoefficient [172], and computing single-source shortest paths [106]” [106].

5.2.4 Scope of LDBC Workloads. The LDBC interactive workloads correspond to local, neighbor-hood, and traversals. The LDBC business intelligence workloads range from traversals to global graphanalytics queries. The LDBC graph analytics benchmark corresponds to global graph analytics.

Page 30: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:30 M. Besta et al.

5.3 Interactive TransactionalQuerying (OLTP) and Offline Graph Processing (OLAP)One can also distinguish between Online Transactional Processing (OLTP) and Offline AnalyticalProcessing (OLAP). Typically, OLTP workloads consist of many queries local in scope, such asneighborhood queries, certain restricted traversals, or lookups, inserts, deletes, and updates ofsingle vertices and edges. OLTP corresponds to LDBC’s interactive workloads. They are usuallyexecuted with some transactional guarantees. The goal is to achieve high throughput and answerthe queries at interactive speed (low latency). OLAP workloads have been a subject of numerousresearch efforts in the last decade [20, 31, 32, 111, 137, 180]. They are usually not processed atinteractive speeds, as the queries are inherently complex and global in scope. They correspond toLDBC’s graph analytics and to certain BI workloads.

5.4 Graph Patterns and Navigational ExpressionsAngles et al. [4] inspected in detail the theory of graph queries. In one identified family of graphqueries, called simple graph pattern matching, one prescribes a graph pattern (e.g., a specificationof a class of subgraphs) that is then matched to the graph maintained by the database, searchingfor the occurrences of this pattern. This query can be extended with aggregation and a projectionfunction to so called complex graph pattern matching. Furthermore, path queries allow to search forpaths of arbitrary distances in the graph. One can also combine complex graph pattern matchingand path queries, resulting in navigational graph pattern matching, in which a graph pattern can beapplied recursively on the parts of the path.

5.5 Input LoadingFinally, certain benchmarks also analyze bulk input loading [53, 65, 109]. Specifically, given aninput dataset, they measure the time to load this dataset into a database. This scenario is commonwhen data is migrated between systems.

6 CHALLENGESThese are numerous research challenges related to the design of graph database systems.

First, establishing a single graph model for these systems is far from being complete. While LPGis used most often, (1) its definition is very broad and it is rarely fully supported, and (2) RDF is alsooften used in the context of storing and managing graphs. Moreover, it is unclear what are preciserelationships between a selected graph model and the corresponding consequences for storage andperformance tradeoffs when executing different types of workloads.

Second, a clear identification of the most advantageous design choices for different use cases isyet to be determined. As illustrated in this survey, existing systems support a plethora of forms ofdata organization, and it is not clear which ones are best for many scenarios. A strongly relatedchallenge is the best design for a system that supports both OLAP and OLTP workloads, ensuringhigh throughput and low latency of both of them.Third, while there exists past research into the impact of the underlying network on the per-

formance of a distributed graph analytics framework [149], little was done into investigating thisperformance relationship in the context of graph database workloads. To the best of our knowledge,there are no efforts into developing a topology-aware or routing-aware data distribution scheme forgraph databases, especially in the context of recently proposed data center and high-performancecomputing network topologies [27, 116] and routing architectures [33, 84, 126].

Moreover, contrarily to the general static graph processing and graph streaming, little researchexists into accelerating graph databases using different types of hardware architectures, accelerators,and hardware-related designs, for example FPGAs [23, 34], designs related to network interfacecards [29, 63], hardware transactions [28], and others [1, 25]. In addition, a related research direction

Page 31: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:31

focuses on re-using different concepts from general distributed graph processing in the domain ofgraph databases, and vice versa [92].

Finally, many research challenges in the design of graph databases are related specifically to thedesign of NoSQL stores. These challenges are discussed in more detail in past recent work [60] andinclude efficient data partitioning [44, 144, 154], user-friendly query formulation, high-performancetransaction processing, and ensuring security in the form of authentication and encryption.

7 CONCLUSIONGraph databases constitute an important area of academic research and different industry efforts.They are used to maintain, query, and analyze numerous datasets in different domains in industryand academia. Many graph databases of different types have been developed. They use many datamodels and representations, they are constructed using miscellaneous design choices, and theyenable a large number of queries and workloads. In this work, we provide the first survey andtaxonomy of this rich graph database landscape. Our work can be used not only by researcherswilling to learn more about this fascinating subject, but also by architects, developers, and projectmanagers who want to select the most advantageous graph database system or design.

ACKNOWLEDGEMENTS We thank Gabor Szarnyas for extensive feedback, and Hannes Voigt,Matteo Lissandrini, Daniel Ritter, Lukas Braun, Janez Ales, Nikolay Yakovets, and Khuzaima Daudjeefor insightful remarks.

REFERENCES[1] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A scalable processing-in-memory accelerator for parallel graph

processing. ACM SIGARCH Computer Architecture News, 43(3):105–117, 2016.[2] Amazon. Amazon Neptune. Available at https://aws.amazon.com/neptune/.[3] R. Angles, M. Arenas, P. Barceló, P. Boncz, G. Fletcher, C. Gutierrez, T. Lindaaker, M. Paradies, S. Plantikow, J. Sequeda,

et al. G-CORE: A core for future graph query languages. In International Conference on Management of Data, pages1421–1432, 2018.

[4] R. Angles, M. Arenas, P. Barceló, A. Hogan, J. Reutter, and D. Vrgoč. Foundations of ModernQuery Languages forGraph Databases. in ACM Comput. Surv., 50(5):68:1–68:40, 2017.

[5] R. Angles, P. Boncz, J. Larriba-Pey, I. Fundulaki, T. Neumann, O. Erling, P. Neubauer, N. Martinez-Bazan, V. Kotsev,and I. Toma. The linked data benchmark council: a graph and RDF industry benchmarking effort. ACM SIGMODRecord, 43(1):27–31, 2014.

[6] R. Angles and C. Gutierrez. Survey of Graph Database Models. in ACM Comput. Surv., pages 1:1–1:39, 2008.[7] Apache. Apache Cassandra. Available at https://cassandra.apache.org/.[8] Apache. Apache Giraph. Available at https://giraph.apache.org/.[9] Apache. Apache Mormotta. Available at http://marmotta.apache.org/.[10] Apache. Apache Solr. Available at https://lucene.apache.org/solr/.[11] ArangoDB Inc. ArangoDB. Available at https://docs.arangodb.com/3.3/Manual/DataModeling/Concepts.html.[12] ArangoDB Inc. ArangoDB: Index Free Adjacency or Hybrid Indexes for Graph Databases. Available at https:

//www.arangodb.com/2016/04/index-free-adjacency-hybrid-indexes-graph-databases/.[13] T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. Linkbench: a database benchmark based on the

facebook social graph. In ACM SIGMOD, pages 1185–1196. ACM, 2013.[14] M. Atkinson, F. Bancilhon, D. DeWitt, K. Dittrich, D. Maier, and S. Zdonik. The Object-Oriented Database System

Manifesto. in DOOD, pages 223–40, 1989.[15] P. Atzeni and V. De Antonellis. Relational database theory. Benjamin/Cummings Redwood City, CA, 1993.[16] Aurelius. Titan Data Model. Available at http://s3.thinkaurelius.com/docs/titan/1.0.0/data-model.html.[17] A. T. Balaban. Applications of graph theory in chemistry. J. of Chem. Inf. Comp. Sciences, 25(3):334–343, 1985.[18] S. Barahmand and S. Ghandeharizadeh. BG: A Benchmark to Evaluate Interactive Social Networking Actions. In

CIDR, 2013.[19] D. Bartholomew. Mariadb vs. MySQL. Dostopano, 7(10):2014, 2012.[20] O. Batarfi, R. El Shawi, A. G. Fayoumi, R. Nouri, A. Barnawi, S. Sakr, et al. Large scale graph processing systems:

survey and an experimental evaluation. Cluster Computing, 18(3):1189–1213, 2015.[21] S. Beamer, K. Asanović, and D. Patterson. The gap benchmark suite. arXiv preprint arXiv:1508.03619, 2015.

Page 32: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:32 M. Besta et al.

[22] M. Besta et al. Slim graph: Practical lossy graph compression for approximate graph processing, storage, andanalytics. 2019.

[23] M. Besta, M. Fischer, T. Ben-Nun, J. De Fine Licht, and T. Hoefler. Substream-centric maximum matchings on fpga.In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 152–161.ACM, 2019.

[24] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler. Practice of streaming and dynamic graphs: Concepts,models, systems, and parallelism. arXiv preprint arXiv:1912.12740, 2019.

[25] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun, O. Mutlu, and T. Hoefler. Slim noc: A low-diameteron-chip network topology for high energy efficiency and scalability. In ACM SIGPLAN Notices, volume 53, pages43–55. ACM, 2018.

[26] M. Besta and T. Hoefler. Fault tolerance for remote memory access programming models. In Proceedings of the 23rdinternational symposium on High-performance parallel and distributed computing, pages 37–48. ACM, 2014.

[27] M. Besta and T. Hoefler. Slim fly: A cost effective low-diameter network topology. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage and Analysis, pages 348–359. IEEE Press, 2014.

[28] M. Besta and T. Hoefler. Accelerating irregular computations with hardware transactional memory and activemessages. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing,pages 161–172. ACM, 2015.

[29] M. Besta and T. Hoefler. Active access: A mechanism for high-performance distributed data-centric computations.In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 155–164. ACM, 2015.

[30] M. Besta and T. Hoefler. Survey and taxonomy of lossless graph compression and space-efficient graph representations.arXiv preprint arXiv:1806.01799, 2018.

[31] M. Besta, F. Marending, E. Solomonik, and T. Hoefler. Slimsell: A vectorizable graph representation for breadth-firstsearch. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 32–41. IEEE, 2017.

[32] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. To push or to pull: On reducing communicationand synchronization in graph computations. In International Symposium on High-Performance Parallel and DistributedComputing, pages 93–104. ACM, 2017.

[33] M. Besta, M. Schneider, K. Cynk, M. Konieczny, E. Henriksson, S. Di Girolamo, A. Singla, and T. Hoefler. Fatpaths:Routing in supercomputers, data centers, and clouds with low-diameter networks when shortest paths fall short.arXiv preprint arXiv:1906.10885, 2019.

[34] M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler. Graph processing on fpgas: Taxonomy, survey,challenges. arXiv preprint arXiv:1903.06697, 2019.

[35] M. Besta, D. Stanojevic, T. Zivic, J. Singh, M. Hoerold, and T. Hoefler. Log (graph): a near-optimal high-performancegraph representation. In PACT, pages 7–1, 2018.

[36] Bitnine Global Inc. AgensGraph. Available at https://bitnine.net/agensgraph-2/.[37] Blazegraph. BlazeGraph DB. Available at https://www.blazegraph.com/.[38] S. Boag, D. Chamberlin, M. F. Fernández, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. Xquery 1.0: An xml

query language. 2002.[39] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering

for compressing social networks. In International conference on World wide web, pages 587–596. ACM, 2011.[40] A. Bonifati, G. Fletcher, H. Voigt, and N. Yakovets. Querying graphs. Synthesis Lectures on Data Management,

10(3):1–184, 2018.[41] U. Brandes. A faster algorithm for betweenness centrality*. Journal of Mathematical Sociology, 25(2):163–177, 2001.[42] T. Bray. The JavaScript object notation (JSON) data interchange format. 2014.[43] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible markup language (XML) 1.0, 2000.[44] A. Buluç, H. Meyerhenke, I. Safro, P. Sanders, and C. Schulz. Recent advances in graph partitioning. In Algorithm

Engineering, pages 117–158. Springer, 2016.[45] Callidus Software Inc. OrientDB. Available at https://orientdb.com.[46] Callidus Software Inc. OrientDB: Lightweight Edges. Available at https://orientdb.com/docs/3.0.x/java/

Lightweight-Edges.html.[47] Cambridge Semantics. AnzoGraph. Available at https://www.cambridgesemantics.com/product/anzograph/.[48] M. Capotă, T. Hegeman, A. Iosup, A. Prat-Pérez, O. Erling, and P. Boncz. Graphalytics: A big data benchmark for

graph-processing platforms. In Proceedings of the GRADES’15, page 7. ACM, 2015.[49] A. Castelltort and A. Laurent. Representing history in graph-oriented nosql databases: A versioning system. In

Eighth International Conference on Digital Information Management (ICDIM 2013), pages 228–234. IEEE, 2013.[50] Cayley. CayleyGraph. Available at https://cayley.io/ and https://github.com/cayleygraph/cayley.[51] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast Graph Pattern Matching. ICDE, pages 913–922, 2008.

Page 33: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:33

[52] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing atfacebook-scale. Proceedings of the VLDB Endowment, 8(12):1804–1815, 2015.

[53] M. Ciglan, A. Averbuch, and L. Hluchy. Benchmarking traversal operations over graph databases. In 2012 IEEE 28thInternational Conference on Data Engineering Workshops, pages 186–189. IEEE, 2012.

[54] J. Clark, S. DeRose, et al. Xml path language (xpath) version 1.0, 1999.[55] E. F. Codd. Relational database: a practical foundation for productivity. In Readings in Artificial Intelligence and

Databases, pages 60–68. Elsevier, 1989.[56] R. Cyganiak, D. Wood, and M. Lanthaler. RDF 1.1 Concepts and Abstract Syntax. https://www.w3.org/TR/

rdf11-concepts/.[57] G. B. Dantzig and D. R. Fulkerson. On the max flow min cut theorem of networks. RAND CORP, 1955.[58] DataStax, Inc. DSE Graph (DataStax). Available at https://www.datastax.com/.[59] C. J. Date and H. Darwen. A Guide to the SQL Standard, volume 3. Addison-Wesley New York, 1987.[60] A. Davoudian, L. Chen, and M. Liu. A survey on NoSQL stores. ACM Computing Surveys (CSUR), 51(2):40, 2018.[61] Dgraph Labs Inc. BadgerDB. https://dbdb.io/db/badgerdb.[62] Dgraph Labs, Inc. DGraph. Available at https://dgraph.io/, https://docs.dgraph.io/design-concepts.[63] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider, J. Beránek, M. Besta, L. Benini, D. Roweth, and

T. Hoefler. Network-accelerated non-contiguous memory transfers. arXiv preprint arXiv:1908.08590, 2019.[64] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, pages 269–271, 1959.[65] D. Dominguez-Sal, P. Urbón-Bayes, A. Giménez-Vanó, S. Gómez-Villamor, N. Martínez-Bazan, and J.-L. Larriba-Pey.

Survey of graph database performance on the HPC scalable graph analysis benchmark. In International Conferenceon Web-Age Information Management, pages 37–48. Springer, 2010.

[66] C. Douglas. Ubiquitous B-Tree. in ACM Comput. Surv., pages 121–137, 1979.[67] A. Dubey, G. D. Hill, R. Escriva, and E. G. Sirer. Weaver: A High-performance, Transactional Graph Database Based

on Refinable Timestamps. in VLDB, pages 852–863, 2016. Available at http://weaver.systems/.[68] P. DuBois and M. Foreword By-Widenius. MySQL. New Riders Publishing, 1999.[69] W. Eberle, J. Graves, and L. Holder. Insider threat detection using a graph-based approach. Journal of Applied Security

Research, 6(1):32–81, 2010.[70] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. Stinger: High performance data structure for streaming graphs. In

2012 IEEE Conference on High Performance Extreme Computing, pages 1–5. IEEE, 2012.[71] H. El Maazouz, G. Wachsmuth, M. Sevenich, D. Chiadmi, S. Hong, and H. Chafi. A dsl-based framework for

performance assessment. In International Conference Europe Middle East & North Africa Information Systems andTechnologies to Support Learning, pages 260–270. Springer, 2019.

[72] Elastic. Elasticsearch. Available at https://www.elastic.co/products/elasticsearch.[73] R. Elmasri and S. B. Navathe. Advantages of Distributed Databases. In Fundamentals of Database Systems, 6th edition,

chapter 25.1.5, page 882. Pearson, 2011.[74] R. Elmasri and S. B. Navathe. Data Fragmentation. In Fundamentals of Database Systems, chapter 25.4.1, pages

894–897. Pearson, 2011.[75] O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat, M.-D. Pham, and P. Boncz. The LDBC Social

Network Benchmark: Interactive Workload. in SIGMOD, pages 619–630, 2015.[76] FactNexus. GraphBase. Available at https://graphbase.ai/.[77] Fauna. FaunaDB. Available at https://fauna.com/.[78] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plantikow, M. Rydberg, P. Selmer, and

A. Taylor. Cypher: An evolving query language for property graphs. In ACM SIGMOD, pages 1433–1445. ACM, 2018.[79] Franz Inc. AllegroGraph. Available at https://franz.com/agraph/allegrograph/.[80] S. K. Gajendran. A survey on NoSQL databases. University of Illinois, 2012.[81] H. Garcia-Molina, J. D. Ullman, and J. Widom. Data Replication. In Database Systems: The Complete Book, 1st Edition,

chapter 19.4.3, page 1021. Pearson, 2002.[82] L. George. HBase: the definitive guide: random access to your planet-size data. O’Reilly Media, Inc., 2011.[83] R. Gerstenberger, M. Besta, and T. Hoefler. Enabling highly-scalable remote memory access programming with

mpi-3 one sided. Scientific Programming, 22(2):75–91, 2014.[84] S. Ghorbani, Z. Yang, P. Godfrey, Y. Ganjali, and A. Firoozshahian. Drill: Micro load balancing for low-latency data

center networks. In Conference of the ACM Special Interest Group on Data Communication, pages 225–238. ACM,2017.

[85] L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta, and T. Hoefler. Communication-avoiding parallel minimum cutsand connected components. In ACM SIGPLAN Notices, volume 53, pages 219–232. ACM, 2018.

[86] Google. Graphd. Available at https://github.com/google/graphd.

Page 34: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:34 M. Besta et al.

[87] Graph Story Inc. Graph Story. Available at https://www.graphstory.com/.[88] A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plantikow, M. Schuster, P. Selmer, and H. Voigt.

Updating graph databases with cypher. Proceedings of the VLDB Endowment, 12(12):2242–2254, 2019.[89] A. Green, M. Junghanns, M. Kießling, T. Lindaaker, S. Plantikow, and P. Selmer. opencypher: New directions in

property graph querying. In EDBT, pages 520–523, 2018.[90] L. Guzenda. Objectivity/DB-a high performance object database architecture. InWorkshop on High performance

object databases, 2000.[91] J. Han, E. Haihong, G. Le, and J. Du. Survey on NoSQL database. In 2011 6th international conference on pervasive

computing and applications, pages 363–366. IEEE, 2011.[92] M. Han and K. Daudjee. Providing serializability for pregel-like graph processing systems. In EDBT, pages 77–88,

2016.[93] M. Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, and T. Jin. An experimental comparison of pregel-like graph

processing systems. Proceedings of the VLDB Endowment, 7(12):1047–1058, 2014.[94] O. Hartig. Reconciliation of RDF* and Property Graphs. Available at https://arxiv.org/pdf/1409.3288.pdf .[95] O. Hartig. Rdf* and sparql*: An alternative approach to annotate statements in rdf. In International Semantic Web

Conference (Posters, Demos & Industry Tracks), 2017.[96] O. Hartig and J. Pérez. Semantics and complexity of graphql. In Proceedings of the 2018 World Wide Web Conference,

pages 1155–1164. International World Wide Web Conferences Steering Committee, 2018.[97] J. Hayes. A Graph Model for RDF. diploma thesis, Technische Universität Darmstadt, Universidad de Chile, 2004.[98] J. M. Hellerstein and M. Stonebraker. Readings in database systems. MIT Press, 2005.[99] T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji,W. Gropp, and K. Underwood. Remotememory access programming

in MPI-3. ACM Transactions on Parallel Computing, 2(2):9, 2015.[100] J. A. Hoffer, V. Ramesh, and H. Topi. Modern database management. Upper Saddle River, NJ: Prentice Hall, 2011.[101] F. Holzschuher and R. Peinl. Performance of graph query languages: comparison of cypher, gremlin and native

access in neo4j. In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pages 195–204. ACM, 2013.[102] S. Hong, S. Depner, T. Manhardt, J. Van Der Lugt, M. Verstraaten, and H. Chafi. Pgx.d: a fast distributed graph

processing engine. In ACM/IEEE Supercomputing, page 58. ACM, 2015.[103] InfiniBand Trade Association. Infiniband: Architecture specification 1.3. 2015.[104] InfoGrid. The InfoGrid Graph Database.[105] B. Iordanov. HyperGraphDB: A Generalized Graph Database. In WAIM, pages 25–36, 2010.[106] A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens, A. Prat-Pérez, T. Manhardto, H. Chafio, M. Capotă, N. Sundaram,

M. Anderson, I. G. Tănase, Y. Xia, L. Nai, and P. Boncz. LDBC Graphalytics: A Benchmark for Large-scale GraphAnalysis on Parallel and Distributed Platforms. in VLDB, pages 1317–1328, 2016.

[107] Jesús Barrasa. RDF Triple Stores vs. Labeled Property Graphs: What’s the Difference? Available at https://neo4j.com/blog/rdf-triple-store-vs-labeled-property-graph-difference/.

[108] B. Jiang. A short note on data-intensive geospatial computing. In Information Fusion and Geographic InformationSystems, pages 13–17. Springer, 2011.

[109] S. Jouili and V. Vansteenberghe. An empirical comparison of graph databases. In 2013 International Conference onSocial Computing, pages 708–715. IEEE, 2013.

[110] M. Junghanns, A. Petermann, M. Neumann, and E. Rahm. Management and analysis of big graph data: currentsystems and open challenges. In Handbook of Big Data Technologies, pages 457–505. Springer, 2017.

[111] V. Kalavri, V. Vlassov, and S. Haridi. High-level programming abstractions for distributed graph processing. IEEETransactions on Knowledge and Data Engineering, 30(2):305–324, 2017.

[112] R. K. Kaliyar. Graph databases: A survey. In ICCCA, pages 785–790, 2015.[113] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos. Gbase: An efficient analysis platform for large graphs. In PVLDB,

pages 637–650, 2012.[114] M. Kay. XSLT: programmer’s reference. Wrox Press Ltd., 2001.[115] J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Mey-

erhenke, et al. Mathematical foundations of the graphblas. In High Performance Extreme Computing Conference(HPEC), pages 1–9. IEEE, 2016.

[116] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. In 2008 InternationalSymposium on Computer Architecture, pages 77–88. IEEE, 2008.

[117] V. Kolomicenko. Analysis and Experimental Comparison of Graph Databases. Charles University in Prague, 2013.Master Thesis.

[118] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Amer. Math.Soc., pages 48–50, 1956.

Page 35: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:35

[119] R. Kuć. Apache Solr 4 Cookbook. Packt Publishing Ltd, 2013.[120] V. Kumar and A. Babu. Domain suitable graph database selection: A preliminary report. In 3rd International

Conference on Advances in Engineering Sciences & Applied Mathematics, London, UK, pages 26–29, 2015.[121] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems

Review, 44(2):35–40, 2010.[122] LambdaZen LLC. Bitsy. Available at https://github.com/lambdazen/bitsy and https://bitbucket.org/lambdazen/bitsy/

wiki/Home.[123] H. Lin, X. Zhu, B. Yu, X. Tang, W. Xue, W. Chen, L. Zhang, T. Hoefler, X. Ma, X. Liu, et al. Shentu: processing

multi-trillion edge graphs on millions of cores in seconds. In ACM/IEEE Supercomputing, page 56. IEEE Press, 2018.[124] M. Lissandrini, M. Brugnara, and Y. Velegrakis. An evaluation methodology and experimental comparison of graph

databases. Technical report, Technical report, University of Trento, 04 2017. Available at https . . . , 2017.[125] M. Lissandrini, M. Brugnara, and Y. Velegrakis. Beyond macrobenchmarks: microbenchmark-based graph database

evaluation. Proceedings of the VLDB Endowment, 12(4):390–403, 2018.[126] Y. Lu, G. Chen, B. Li, K. Tan, Y. Xiong, P. Cheng, J. Zhang, E. Chen, and T. Moscibroda. Multi-path transport for

{RDMA} in datacenters. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI}18), pages 357–371, 2018.

[127] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W. Berry. Challenges in Parallel Graph Processing. Par. Proc. Let.,17(1):5–20, 2007.

[128] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system forlarge-scale graph processing. In ACM SIGMOD, SIGMOD ’10, pages 135–146, New York, NY, USA, 2010. ACM.

[129] MariaDB. OQGRAPH. Available at https://mariadb.com/kb/en/library/oqgraph-storage-engine/.[130] MarkLogic Corporation. MarkLogic. Available at https://www.marklogic.com.[131] J. Marton, G. Szárnyas, and D. Varró. Formalising opencypher graph queries in relational algebra. In European

Conference on Advances in Databases and Information Systems, pages 182–196. Springer, 2017.[132] N. Martínez-Bazan, V. Muntés-Mulero, S. Gómez-Villamor, M. Águila Lorente, D. Dominguez-Sal, and J.-L. Larriba-Pey.

Efficient Graph Management Based On Bitmap Indices. In IDEAS, pages 110–119, 2012.[133] K. J. Maschhoff, R. Vesse, S. R. Sukumar, M. F. Ringenburg, and J. Maltby. Quantifying performance of cge: A unified

scalable pattern mining and search system. In Cray User Group Conference (CUG’17), Seattle, WA, 2017.[134] M. McCandless, E. Hatcher, and O. Gospodnetic. Lucene in action: covers Apache Lucene 3.0. Manning Publications

Co., 2010.[135] R. McColl, D. Ediger, J. Poovey, D. Campbell, and D. Bader. A brief study of open source graph databases. arXiv

preprint arXiv:1309.2675, 2013.[136] R. C. McColl, D. Ediger, J. Poovey, D. Campbell, and D. A. Bader. A performance evaluation of open source graph

databases. In Proceedings of the first workshop on Parallel programming for analytics applications, pages 11–18. ACM,2014.

[137] R. R. McCune, T. Weninger, and G. Madey. Thinking like a vertex: a survey of vertex-centric frameworks for large-scaledistributed graph processing. ACM Computing Surveys (CSUR), 48(2):25, 2015.

[138] Memgraph Ltd. Memgraph. Available at https://memgraph.com/.[139] Microsoft. Azure Cosmos DB. Available at https://azure.microsoft.com/en-us/services/cosmos-db/.[140] Microsoft. Microsoft SQL Server 2017. Available at https://www.microsoft.com/en-us/sql-server/sql-server-2017.[141] B. Momjian. PostgreSQL: introduction and concepts, volume 192. Addison-Wesley New York, 2001.[142] T. Mueller. H2 database engine, 2006.[143] Networked Planet Limited. BrightstarDB. Available at http://brightstardb.com/.[144] D. Nicoara, S. Kamali, K. Daudjee, and L. Chen. Hermes: Dynamic partitioning for distributed social network graph

databases. In EDBT, pages 25–36, 2015.[145] Objectivity Inc. ThingSpan. Available at https://www.objectivity.com/products/thingspan/.[146] Ontotext. GraphDB. Available at https://www.ontotext.com/products/graphdb/.[147] OpenLink. Virtuoso. Available at https://virtuoso.openlinksw.com/.[148] Oracle. Oracle Spatial and Graph. Available at https://www.oracle.com/database/technologies/spatialandgraph.html.[149] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making sense of performance in data analytics

frameworks. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), pages 293–307,2015.

[150] M. T. Özsu. A Survey of RDF Data Management Systems. CoRR, abs/1601.00707, 2016.[151] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical

report, Stanford InfoLab, 1999.

Page 36: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

1:36 M. Besta et al.

[152] N. Patil, P. Kiran, N. Kavya, and K. Naresh Patel. A Survey on Graph Database Management Techniques for HugeUnstructured Data. International Journal of Electrical and Computer Engineering, 81:1140–1149, 2018.

[153] J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of sparql. ACM Transactions on Database Systems(TODS), 34(3):16, 2009.

[154] F. Petroni, L. Querzoni, K. Daudjee, S. Kamali, and G. Iacoboni. Hdrf: Stream-based partitioning for power-lawgraphs. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management,pages 243–252, 2015.

[155] H. Plattner. A common database approach for OLTP and OLAP using an in-memory column database. In Proceedingsof the 2009 ACM SIGMOD International Conference on Management of data, pages 1–2. ACM, 2009.

[156] J. Pokorny. Graph databases: their power and limitations. In IFIP International Conference on Computer InformationSystems and Industrial Management, pages 58–69. Springer, 2015.

[157] Profium. Profium Sense. Available at https://www.profium.com/en/.[158] A. Rana. Detailed Introduction: Redis Modules, from Graphs to Machine Learning (Part 1), 2 2019.[159] M. Raynal. Using Semaphores to Solve the Readers-Writers Problem. In Concurrent Programming: Algorithms,

Principles, and Foundations, chapter 3.2.4, pages 74–78. Springer, 2013.[160] Redis Labs. Redis. https://redis.io/.[161] Redis Labs. RedisGraph. Available at https://oss.redislabs.com/redisgraph/.[162] C. D. Rickett, U.-U. Haus, J. Maltby, and K. J. Maschhoff. Loading and Querying a Trillion RDF triples with Cray

Graph Engine on the Cray XC. Cray Users Group, 2018.[163] Robert Yokota. HGraphDB. Available at https://github.com/rayokota/hgraphdb.[164] I. Robinson, J. Webber, and E. Eifrem. Graph database internals. In Graph Databases, Second Edition, chapter 7, pages

149–170. O’Relly, 2015.[165] M. A. Rodriguez. The gremlin graph traversal machine and language (invited talk). In Proceedings of the 15th

Symposium on Database Programming Languages, pages 1–10. ACM, 2015.[166] M. A. Rodriguez. The Gremlin graph traversal machine and language (invited talk). In Proceedings of the 15th

Symposium on Database Programming Languages, pages 1–10. ACM, 2015.[167] N. P. Roth, V. Trigonakis, S. Hong, H. Chafi, A. Potter, B. Motik, and I. Horrocks. Pgx. d/async: A scalable distributed

graph pattern matching engine. In International Workshop on Graph Data-management Experiences & Systems, pages1–6, 2017.

[168] A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from secondarystorage. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 410–424. ACM, 2015.

[169] M. Rudolf, M. Paradies, C. Bornhövd, and W. Lehner. The graph story of the sap hana database. Datenbanksystemefür Business, Technologie und Web (BTW) 2037, 2013.

[170] S. Sahu, A. Mhedhbi, S. Salihoglu, J. Lin, and M. T. Özsu. The Ubiquity of Large Graphs and Surprising Challenges ofGraph Processing. in VLDB, 11(4):420–431, 2017.

[171] SAP. SAP HANA. Available at https://www.sap.com/products/hana.html and https://help.sap.com/.[172] S. E. Schaeffer. Graph clustering. Computer science review, 1(1):27–64, 2007.[173] P. Schmid, M. Besta, and T. Hoefler. High-performance distributed rma locks. In Proceedings of the 25th ACM

International Symposium on High-Performance Parallel and Distributed Computing, pages 19–30. ACM, 2016.[174] A. Seaborne and S. Harris. SPARQL 1.1 query language. Available at http://www.w3.org/TR/2013/

REC-sparql11-query-20130321/.[175] B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In SIGMOD, pages 505–516,

2013.[176] S. Sharma. Cassandra Design Patterns. Packt Publishing Ltd, 2014.[177] K. Singh and V. Singh. Graph pattern matching: A brief survey of challenges and research directions. in INDIACom,

pages 199–204, 2016.[178] solid IT gmbh. DB-Engines: Ranking of Graph DBMS. Available at https://db-engines.com/en/ranking/graph+dbms.[179] solid IT gmbh. System Properties Comparison: Neo4j vs. Redis. Available at https://db-engines.com/en/system/

Neo4j%3BRedis.[180] E. Solomonik, M. Besta, F. Vella, and T. Hoefler. Scaling betweenness centrality using communication-efficient sparse

matrix multiplication. In International Conference for High Performance Computing, Networking, Storage and Analysis,page 47. ACM, 2017.

[181] Stardog Union. Stardog. Available at https://www.stardog.com/.[182] B. A. Steer, A. Alnaimi, M. A. Lotz, F. Cuadrado, L. M. Vaquero, and J. Varvenne. Cytosm: Declarative property graph

queries without data migration. In International Workshop on Graph Data-management Experiences & Systems, pages1–6, 2017.

Page 37: Towards Understanding Modern Graph Processing, Storage ...sions in the design of graph databases: (1) general database engine, (2) data model, (3) data organization, (4) data distribution,

Survey and Taxonomy of Graph Databases 1:37

[183] G. Szárnyas, A. Prat-Pérez, A. Averbuch, J. Marton, M. Paradies, M. Kaufmann, O. Erling, P. Boncz, V. Haprian, andJ. B. Antal. An Early Look at the LDBC Social Network Benchmark’s Business Intelligence Workload. GRADES-NDA,pages 9:1–9:11, 2018.

[184] A. T. Tan and H. Kaiser. Extending C++ with Co-array Semantics. in ARRAY, pages 63–68, 2016.[185] A. Tate, A. Kamil, A. Dubey, A. Größlinger, B. Chamberlain, B. Goglin, C. Edwards, C. J. Newburn, D. Padua, D. Unat,

et al. Programming abstractions for data locality. PADAL Workshop 2014, 2014.[186] The Linux Foundation. JanusGraph. Available at http://janusgraph.org/.[187] TigerGraph. TigerGraph. Available at https://www.tigergraph.com/.[188] Twitter. FlockDB. Available at https://github.com/twitter-archive/flockdb.[189] N. Tyagi and N. Singh. Graph database-an overview of its applications and its types.[190] A. Vaikuntam and V. K. Perumal. Evaluation of contemporary graph databases. In Proceedings of the 7th ACM India

Computing Conference, page 6. ACM, 2014.[191] O. van Rest, S. Hong, J. Kim, X. Meng, and H. Chafi. Pgql: a property graph query language. In Proceedings of the

Fourth International Workshop on Graph Data Management Experiences and Systems, pages 1–6, 2016.[192] VelocityDB Inc. VelocityDB. Available at https://velocitydb.com/.[193] VelocityDB Inc. VelocityGraph. Available at https://velocitydb.com/VelocityGraph.aspx.[194] WhiteDB Team. WhiteDB. Available at http://whitedb.org/ or https://github.com/priitj/whitedb.[195] Y. Xia, I. G. Tanase, L. Nai, W. Tan, Y. Liu, J. Crawford, and C.-Y. Lin. Explore efficient data organization for large scale

graph analytics and storage. In Big Data (Big Data), 2014 IEEE International Conference on, pages 942–951. IEEE, 2014.[196] K. Xirogiannopoulos, V. Srinivas, and A. Deshpande. GraphGen: Adaptive Graph Processing Using Relational

Databases. in GRADES, pages 9:1–9:7, 2017.[197] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. TripleBit: a fast and compact system for large scale RDF data.

Proceedings of the VLDB Endowment, 6(7):517–528, 2013.[198] K. Zhao and J. X. Yu. All-in-One: Graph Processing in RDBMSs Revisited. in SIGMOD, pages 1165–1180, 2017.