rights / license: research collection in copyright - non … · 2020. 4. 27. · abstract in...

69
Research Collection Master Thesis A High-level Graph Query Language Interface for Differential Dataflow Author(s): Striebel, Lukas Publication Date: 2017 Permanent Link: https://doi.org/10.3929/ethz-a-010889948 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Upload: others

Post on 26-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

Research Collection

Master Thesis

A High-level Graph Query Language Interface for DifferentialDataflow

Author(s): Striebel, Lukas

Publication Date: 2017

Permanent Link: https://doi.org/10.3929/ethz-a-010889948

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Page 2: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

Master’s Thesis Nr. 164

Systems Group, Department of Computer Science, ETH Zurich

A High-level Graph Query Language Interface for Differential Dataflow

by

Lukas Striebel

Supervised by

Dr. Desislava DimitrovaDr. Ioannis LiagourisProf. Timothy Roscoe

October 2016 - April 2017

Page 3: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity
Page 4: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

Abstract

In today’s world, computer networks have grown immensely in terms of size and com-plexity. In an attempt to manage the dynamics and scale of these networks, software-defined networking was introduced, an approach that aims to supervise the behaviorof the network through open interfaces. In order to deal with the challenges that arisein controlling software-defined networks, complicated analysis of the characteristics andstructures of these networks has to be done. In particular, the network controller has tobe capable of matching user-defined patterns, finding the shortest paths across the net-work and extracting subgraphs that conform to specific requirements. In this thesis, wepresent an Interface of the Property Graph Query Language for Differential Dataflow,capable of fulfilling these needs. The Property Graph Query Language enables us toexpress the three aforementioned tasks in a formal way, whereas Differential Dataflowprovides the proper tools to conduct the analysis.

Master Thesis 1

Page 5: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

Acknowledgements

First and foremost I would like to thank my supervisors John Liagouris, Desislava Dim-itrova and Moritz Hoffmann for their continuous support and help. They provided manymeaningful suggestions and proposals when I was facing seemingly irresolvable problems.Their expertise and experience was invaluable.I would also like to thank Professor Roscoe for taking the time to supervise my thesisand for offering me the opportunity to work on this highly challenging project.

Master Thesis 2

Page 6: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

Contents

Contents

1 Introduction 71.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 92.1 The Property Graph Query Language . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Path Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Select Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.4 Where Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.5 Solution Modifier Clause . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Dataflow Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Differential Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 The Rust Programming Language . . . . . . . . . . . . . . . . . . . . . . 122.5 Nom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.1 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Parser 163.1 Path Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Select Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Where Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Path Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.2 Value Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Solution Modifier Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Query Engine 274.1 Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Construction of the Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Constraint sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.2 Construction of the execution plan . . . . . . . . . . . . . . . . . . 334.2.3 Selection on the vertex set . . . . . . . . . . . . . . . . . . . . . . . 354.2.4 Execution of the plan . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.5 Applying the projection and aggregation . . . . . . . . . . . . . . . 374.2.6 Evaluating an expression . . . . . . . . . . . . . . . . . . . . . . . 37

5 Experimental Evaluation 395.1 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Master Thesis 3

Page 7: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

Contents

5.5 Query 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.6 Query2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.7 Query3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.8 Query4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.9 Query5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.10 Query6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.11 Query7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Related Work 596.1 SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Green-Marl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3 Gremlin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.4 SQLGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.5 GraphiQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Summary 627.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.2.1 Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2.2 Additional Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2.3 Query Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2.4 Changes in the Graph . . . . . . . . . . . . . . . . . . . . . . . . . 63

Master Thesis 4

Page 8: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

List of Figures

List of Figures2.1 Timely Dataflow Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1 Sample Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Query Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 315.1 Fat Tree Topology Example . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2 Jellyfish Topology Example . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Query Results Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.4 Dataflow Query1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.5 Boxplots for Query 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.6 Dataflow Query2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.7 Boxplots for Query 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.8 Dataflow Query3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.9 Boxplots for Query 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.10 Dataflow Query4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.11 Boxplots for Query 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.12 Dataflow Query5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.13 Boxplots for Query 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.14 Dataflow Query6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.15 Boxplots for Query 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.16 Dataflow Query7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.17 Boxplots for Query 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Master Thesis 5

Page 9: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

List of Listings

List of Listings3.1 PGQL Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Path Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Select Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Vertex Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.5 Edge Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6 Folding Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.7 Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.8 Folding Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.9 Solution Modifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.10 Group By . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.11 Order By . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.12 Limit and Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1 Graph Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Graph Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Graph Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.6 Constraint Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.7 Plan Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.8 Vertex Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.9 Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.10 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.11 Expression Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Master Thesis 6

Page 10: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

1 IntroductionIn this chapter, we describe the motivation behind the thesis and which problems weaim to solve. Furthermore we outline the structure of the thesis.

1.1 Motivation

In our modern digital age, the Internet is the largest source of data and knowledge.The websites of large companies are accessed thousands of times a day. Furthermore,the access requests do not only grow in size, but also become more complex and facedemand for higher sophistication by company management. In order to cope with thishuge amount of client requests, large sever farms were built. In these farms, hundreds ofservers and switches are connected and requests have to travel across multiple machinesuntil they are answered. Thus, the need for a system that directs and balances the hugeamount of requests arises. Additionally, this system should be able to retrieve infor-mation from the network and execute algorithms such as Dijkstra’s shortest path. Theresults of these operations are needed to make the right decision when facing difficultchoices in controlling the distribution and processing of the user requests.

The first step to deal with this challenge is to model this complicated network of com-puters as a property graph. A property graph is a graph which allows its vertices andedges to possess an unlimited number of attributes. Traditionally, very few attributesare allowed in a graph, e.g. a weight attribute on the edges. However, our propertygraph allows attribute of type string, float and boolean.In a second step, an algorithm determines how the request should best be directed inorder to not overload any part of the network. This thesis focuses on the second step andproposes a program that can evaluate queries on a property graph, in order to determineweak links and find the optimal distribution of the work load.

This thesis combines the new query language PGQL, which is designed to unitize graphqueries, with a powerful analytics framework called Differential Dataflow. This frame-work provides us with the necessary tools to load a property graph into memory, con-struct a query plan, execute the said plan on the graph and obtain the desired answerto the user’s request. However, Differential Dataflow does not provide a declarative lan-guage for property graph management. In this thesis, we filled that gap.

We aimed to develop a program that is capable of fulfilling three main needs. How-ever, in its current state our query engine unfortunately provides not all three of them.First of all, the query engine should be able match to a given pattern with a given graph.For example, we would like to identify all routers that run on outdated firmware. Next,we mark those routers for update, and ensure that all traffic is diverted. To completethis task, we need the second and third desired capability of our program: path queries

Master Thesis 7

Page 11: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

1.2 Structure of the Thesis

and subgraph construction. Unfortunately, our current program does not support thesetwo functionalities. They are left as an opportunity for future work.

1.2 Structure of the Thesis

The Thesis has the following structure:

• First, we give an overview of the technologies used in the Thesis in Section 2.

• We continue by giving a detailed explanation of the query parser in Chapter 3.

• In Section 4, we explain how the program sets up the dataflows according to thequery.

• Next, we describe the experiments that were run in order to determine the perfor-mance of the program in Chapter 5.

• An overview of other graph query languages and query plan generators can befound in section 6.

• Finally, we draw our conclusion, provide an outlook and show potential for futurework.

Master Thesis 8

Page 12: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

2 BackgroundIn this section, all the relevant technologies used in the Thesis are explained.

2.1 The Property Graph Query Language

The Property Graph Query Language (PGQL) was developed by Oracle [17]. PGQLenables developers to write intuitive path queries over a property graph. The syntax ofPGQL greatly resembles the one of SQL. A PGQL query consists of 4 clauses, two ofthem are optional.

• Path Clause (optional)

• Select Clause (required)

• Where Clause (required)

• Solution Modifier Clause (optional)

In the following, we will give a short overview of the four clauses. We will go into moredetail in section 3, in which we explain precisely how each clause has to be parsed.

2.1.1 Path Clause

In the Path Clause, custom path patterns are defined, which are to be used again in theWhere Clause.

Example:

PATH connects_to := (:Generator) --> (:Connector WITH status = ’OPERATIVE’)

In this example, we define connects to as an edge between a generator and an operatingconnector.

2.1.2 Select Clause

The PGQL Select Clause bears great similarity to the SQL one. Here, the set of prop-erties one wishes to retrieve is defined. Possible Selections are either:

• everything, indicated by the use of *

• certain attributes of edges and/or vertices

• Aggregation of certain attributes of edges and/or vertices.

Master Thesis 9

Page 13: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

2.1 The Property Graph Query Language

Example:

SELECT v.name, AVG(v.age)

In this example, we retrieve the names and the average age of all vertices that fit ourcriterias.

2.1.3 Aggregation

As shown in the example before, aggregation functions can be used in the Select Clause.Currently, five different types of aggregations are supported:

• COUNT, returns the number of tuples in the solution.

• MAX, returns the maximum value of an attribute in any tuple. The specifiedattribute has to be numeric.

• MIN, returns the minimum value of an attribute in any tuple. The specifiedattribute has to be numeric.

• SUM, returns the sum of an attribute over all the tuples. The specified attributehas to be numeric.

• AVG, returns the average of an attribute over all the tuple. The specified attributehas to be numeric.

2.1.4 Where Clause

The most complex part of the query is the Where Clause. In this section, all the re-quirements that the edges and vertices of the result set have to fulfill are specified.

Example:

WHERE v.name = ’Alice’, v.age > 30, w.name != ’Bob’, v -> w

In this example we want to find a vertex called Alice, which is older than thirty and hasan edge to another vertex, the name of which is not Bob. We achieve this by defining alist of constraints, for example: v.name = ’Alice’.There can be an arbitrary number of these constraints, and they can of course be morecomplex than in the example above.

2.1.5 Solution Modifier Clause

Finally, the result set can be modified in the Solution Modifier Clause. This Clausecan be broken down into four subsections, which are all extremly similiar to their SQLcounterpart:

Master Thesis 10

Page 14: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

2.2 Dataflow Programming

Group by In the group by the developer specifies the attributes that shall be usedfor the aggregation in the select clause.

Order by A set of attributes that determine the order of the results, either ascendingor descending.

Limit Denotes the maximum cardinality of the result set.

Offset Denotes how many tuples of the result set shall be skipped.

The official PGQL Specifications can be accessed online at [6]

2.2 Dataflow Programming

Traditionally, a program is modeled by the programmer as an explicit list of instructionsto be executed by the computer. However, there are certain programming paradigmsthat do not conform to this idea. Dataflow oriented programming is one of them. Inthe dataflow programming paradigm, a program is modeled as a directed graph of dataflowing between operations. Rather than explicitly defining the order of execution asmost other programming paradigms do, dataflow programming specifies a network ofnodes, between which the data is exchanged.

In program execution, this means until a node’s data buffer is non-empty, it remainsidle. Only after the node receives an input, it begins to execute the operations thatare defined in the program. Once the computation is completed, the resulting data isforwarded to the next node. Each node can be viewed as a black box, since the pro-grammers do not have to worry about what happens inside. Their responsibilities lieonly in the task of connecting the nodes in the correct manner and describing how thedata should be manipulated.

2.3 Differential Dataflow

Differential Dataflow is a data-parallel programming framework designed to efficientlyprocess large volumes of data and to quickly respond to arbitrary changes in input col-lections [12]. The main contributor to Differential Dataflow is Frank McSherry.

Differential Dataflow is an extension of Timely Dataflow, which was first introducedby Derek Murray and is implemented in the Naiad system [2]. Differential Dataflowprovides a number of operators such as join, group, map and filter, which are used toexpress the dataflow. These operators are data-parallel and functional, which meanstheir output is computed by partitioning the input data according to a key and thenapplying the desired function to each partition.

Master Thesis 11

Page 15: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

2.4 The Rust Programming Language

Figure 2.1: An example of a Timely Dataflow, Source:[13]

Differential Dataflow has its own unique way of dealing with dynamic data. Rather thanoverwriting the entire data every time a change occurs, Differential Dataflow only keepstrack of the changes made to the data. This allows for huge timesavings when execut-ing the same operation multiple times on different data, since the results of previouscomputations can be re-used. For this reason, all records in Differential Dataflow areaccompanied by a timestamp.Differential Dataflow is open-source and available for download at [1].

2.4 The Rust Programming Language

Rust is a fairly new programming language. Its origin lies within the Mozilla Company,where former employee Graydon Hoare started development in 2006. In the year 2009,Mozilla officially began sponsoring the project. It took another 6 years until the firststable version of the compiler was released. Nowadays most of the development is doneby volunteers from the Rust community. The Rust compiler is self-hosted, meaning it iswritten in Rust as well. Rust won the Award for Most Loved Programming Languagein 2016, granted by the Stack Overflow Community.Rust has been praised as it combines guaranteed memory safety with minimal runtime.Other prominent features are ”zero-cost abstractions, move semantics, threads withoutdata races, trait-based generics, pattern matching, type inference and efficient C bind-ings” [4]. All these properties make Rust an ideal language for building a large scaleanalytic system.

As mentioned before, one of Rust’s most prominent features is memory safety. Thatmeans it should never be possible to dereference a null pointer or a pointer to an objectthat has already been deallocated. However, it is still possible to produce out-of-boundsarray accesses, since these checks have to be done at runtime.In this thesis, we will often encounter recursive objects, such as trees for example. SinceRust does not have a null type, the children have to be wrapped in an object of type

Master Thesis 12

Page 16: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

2.5 Nom

Box, in order to avoid infinite recursion.

1 struct BinaryTree {

2 leftChild: Box<BinaryTree>,

3 rightChild Box<BinaryTree>,

4 }

Rust is completely open-source, and a big part of the code was written by membersof the community. It is currently licensed under the Apache1 and MIT license2

2.5 Nom

Our Query Parser is built using Nom [9]. Nom was developed by Geoffroy Couprie.It is a byte oriented, zero copy, streaming parser library written in Rust. The libraryprovides macros, functions and enumerations which facilitate the parsing process. As oftoday, there exist dozens of parser written with Nom, for example for gif, csv and mp4files. The source code of these parser can be found on github. Nom is open-source andavailable for download at [3].

2.5.1 Macros

The most commonly used macros were:

• tag!: Parses a specific String. The parser aborts if the String is not found. If it isfound, the parser will consume it and return nothing.

Example:tag!("Find this String")

This parser will terminate succesfully if the input is ”Find this String”.

• do parse!: Takes a list of parsers as inputs, applies them sequentially and finallyreturns a tuple. In older version of Nom, this macro was called chain!. The parsersare separated by ”>>”, respectively commas in the case of chain!. The very lastentry in the input list defines exactly what the function should return. When usingthe chain! command, the sign || is used to announce that the next entry definesthe returned value.

1https://www.apache.org/licenses/LICENSE-2.02https://opensource.org/licenses/MIT

Master Thesis 13

Page 17: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

2.5 Nom

Example:do_parse!(

hello: tag!("Hello") >>

space >>

world: tag!("World!") >>

(1)

)

In this example we first parse the String ”Hello”, then arbitrary many spaces,and then the String ”World”. We name the first parser ”hello” and the secondone ”world”. This allows us to use the return value of these parsers, but in thisexample both parsers return nothing. Finally, we return the integer 1.

• alt complete!: Takes a set of parsers as input and applies them until one suc-ceeds. Returns the result of the first successful parser. The different parsers areseparated by the symbol |.

Example:alt_complete!(

tag!("Alice") |

tag!("Bob")

)

Here we are trying to find either the String ”Alice” or ”Bob”.

• opt!: Takes a parsers as input and makes it optional. Returns None if the parserwas unsuccessful or Some otherwise.

Example:do_parse!(

tag!("Hello") >>

space >>

opt!(tag!("Alice")) >>

()

)

This parser accepts either ”Hello” or ”Hello Alice” as input and returns noth-ing.

• named!: Facilitates the creation of new custom parsers. The first argument isthe new name of the parser, whereas the second one defines what the parser issupposed to recognize. If any value should be returned, its type can be specifiedafter the name. This is not mandatory though, since Rust can derive the typeitself.

Master Thesis 14

Page 18: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

2.5 Nom

Example:named!(line_end,

alt_complete!(

tag!("\r\n") |

tag!("\n")

)

);

Here we define a parser that recognizes the end of a line. We use the alt complete!macro because Windows and Linux disagree on the characters which denote thebeginning of a new line.

• many0!, many1!: Applies the parser 0 or more times, respectively 1 or moretimes.

Example:

many0!(

tag!("Hello")

)

This parser accepts ””, ”Hello”, ”HelloHello” and so forth.

2.5.2 Functions

The following useful functions were either provided by Nom or implemented by us:

• space: Recognizes white space and skips it.

• char only: Recognizes alphabetical characters (no special characters) and returnsthem as a String Slice.

• digit: Recognizes digits and returns them as a String.

• float: Recognizes float numbers. This is not the same as the digit functions, sincewe have to recognize the symbols ”.” and ”-”. Also, the return type is f32 insteadof String.

• boolean: Recognizes booleans (true or false).

Master Thesis 15

Page 19: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3 ParserIn this chapter, we describe how our parser is built. As mentioned in section 2.1, theParser has to recognize the following 4 clauses: Path definitions (listing 3.1, line 3), se-lect (line 5), where (line 7), solution modifier (line 9). Finally, it returns a Query object(line 10), which stores the results of the parsers in its attributes.

The main parse function therefore looks as follows:

1 named!(pgql_query<Query>,

2 do_parse!(

3 paths: opt!(path_clause) >>

4 space >>

5 select: select_clause >>

6 space >>

7 vvhere: where_clause >>

8 space >>

9 solmod: opt!(solution_modifier_clause) >>

10 (Query { select: select, vvhere: vvhere, paths: paths, solmod: solmod})

11 )

12 );

Listing 3.1: PGQL Query

3.1 Path Clause

Strictly speaking, the Path Clause is but a list of definitions. Each path definition startswith a name (listing 3.2, line 16), followed by ‘:=’ (line 18) and completed by a pathdescription (line 20). This path description is of type Connection, which will be analyzedin more detail in future sections. Finally, the definition is ended by the beginning of anew line (line 6). The final result of the path clause is a vector that contains a pair ofString and Connection.The paths defined in this clause can then be reused in the Where Clause. However, ourprogram is not yet capable of evaluating path queries. Nevertheless, the parser in chargeof recognizing these path queries is completed and working as intended.

Master Thesis 16

Page 20: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.2 Select Clause

1 named!(path_clause<Vec<(String, Connection)> >,

2 many0!(

3 chain!(

4 path: path ~

5 opt!(space) ~

6 tag!("\n"),

7 || path

8 )

9 )

10 );

11

12 named!(path<(String, Connection)>,

13 chain!(

14 tag_no_case_s!("path") ~

15 space ~

16 name: char_only ~

17 opt!(space) ~

18 tag!(":=") ~

19 opt!(space) ~

20 pattern: path_pattern,

21 || (String::from(name), pattern)

22 )

23 );

Listing 3.2: Path Clause

3.2 Select Clause

The Select Clause of PGQL is very similar to the SQL one. The clause is started offby the keyword ‘Select’ (listing 3.3, line 3), a list of attributes is provided (line 5). Ifthe user wishes to retrieve everything, he is to use the asterisk sign (line 18). Aggregatefunctions like ‘Sum’, ‘Avg’ etc. may also be used (line 12). Attributes may be renamedwith the keyword ‘as’, but this function is currently not supported. The final result is avector of SelectElem, which is an enumeration. A SelectElem is either Star, an Attributeor an Aggregation.

Master Thesis 17

Page 21: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.3 Where Clause

1 named!(select_clause<Vec<SelectElem> >,

2 chain!(

3 tag_no_case_s!("select") ~

4 space ~

5 select_elems: many1!(select_elem),

6 || select_elems

7 )

8 );

9

10 named!(select_elem<SelectElem>,

11 alt_complete!(

12 aggregate => {|aggregation| SelectElem::Aggregation(aggregation)} |

13 variable_name => {

14 |v| {let (name, field) = v;

15 let a = Attribute{name: name, field: field};

16 SelectElem::Attribute(a)}

17 } |

18 tag_s!("*") => {|_| SelectElem::Star}

19 )

20 );

Listing 3.3: Select Clause

3.3 Where Clause

The Where Clause consists of a list of constraints. Each constraint defines either a pathor value requirement that has to be fulfilled by the respective vertex or edge.

3.3.1 Path Constraints

Path Constraints are patterns, which are matched against the graph. If a set of verticesdoes not fit the pattern, it is discarded.A typical Path Constraint requires a vertex to have certain edges to other vertices. Forexample, the constraint: (v) -> (u) requires that the vertex v has a direct edge to thevertex u.

Vertices are absolutely essential for the Path Constraints. A vertex (listing 3.4) is delim-ited by parenthesis (line 3 & 29) and its properties are specified inside the brackets (orat a later point in time as a separate constraint). There are three possible properties,all of them are completely optional:

• The name of the vertex. If no name is provided, the vertex is anonymous (line 5).

Master Thesis 18

Page 22: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.3 Where Clause

• A set of label constraints, recognized by the : sign (line 7). Label constraints differfrom value constraints since the labels are defined in a very particular way, unlikethe attributes. See section 4.1 for more details.

• A set of inlined value constraints, recognized by the keyword WITH (line 9). Thisis basically just syntactic sugar, for example, the two formulations (v WITH age> 10) and (v), v.age > 10 are completely equal.

After parsing the vertex, we have to distinguish our next steps based on how manyoptional parameters were provided. This is done by matching the return type of opt!,which is Option. An Option is either nothing, called None in Rust, or something, calledSome(value). In case absolutely no parameters were provided, we return an anonymousvertex with no constraints. Otherwise we set the attributes of the vertex based on whatwas provided by the parsers.

Master Thesis 19

Page 23: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.3 Where Clause

1 named!(query_vertex<Vertex>,

2 delimited!(

3 char!(’(’),

4 do_parse!(

5 vertex_name: opt!(char_only) >>

6 opt!(space) >>

7 label: opt!(label_constraint) >>

8 opt!(space) >>

9 inlined: opt!(inlined_constraints) >>

10 ({

11 let mut constraints: Vec<Expr> = match inlined {

12 Some(value) => value,

13 None => Vec::new()

14 };

15 match label {

16 Some(value) => {constraints.push(value);},

17 None => {}

18 }

19 match vertex_name {

20 Some(name) => {Vertex{name: String::from(name),

21 anonymous: false, constraints: constraints }},

22 None => Vertex{name: String::from(""),

23 anonymous: true, constraints: constraints }

24 }

25 })

26 ),

27 char!(’)’)

28 )

29 );

Listing 3.4: Vertex Parser

The connection between two vertices can be a very simple or a very sophisticated edgewith many attributes. An edge (listing 3.5) is defined by an arrow sign -> and its proper-ties are specified inside the brackets. If there are no properties to be defined, the bracketsare omitted (line 3 & 5). There are three parameters, all of them are completely optional:

• The name of the edge. If no name is provided, the edge is anonymous (line 10).

• One label constraint, recognized by the : sign (line 12).

• A set of inlined value constraints (line 14).

Master Thesis 20

Page 24: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.3 Where Clause

1 named!(query_edge<Edge>,

2 alt_complete!(

3 tag!("-->") => { |_| Edge {name: String::from(""),

4 inverted: false, constraints: Vec::new() } } |

5 delimited!(

6 tag!("-["),

7 do_parse!(

8 edge_name: opt!(char_only) >>

9 opt!(space) >>

10 label: opt!(label_constraint) >>

11 opt!(space) >>

12 inlined: opt!(inlined_constraints) >>

13 ({

14 let mut constraints: Vec<Expr> = match inlined {

15 Some(value) => value,

16 None => Vec::new()

17 };

18 match label {

19 Some(value) => {constraints.push(value);},

20 None => {}

21 }

22 match edge_name {

23 Some(name) => {Edge{name: String::from(name),

24 inverted: false, constraints: constraints }},

25 None => Edge{name: String::from(""),

26 inverted: false, constraints: constraints }

27 }

28 })

29 ),

30 tag!("]->")

31 );

Listing 3.5: Edge Parser

The PGQ Language allows to nest an unlimited amount of (vertex,edge,vertex)-tuplesin one single constraint. I. e. after parsing an initial vertex, arbitrary many verticesproceeded by an edge may follow (listing 3.6). To process this chain of tuples, a foldfunction (line 13) has to be used. While iterating through this vector, we clone theprevious vertex, then create a new connection and push it into another vector (line 21).This vector is then returned at the end of the function (line 24).

Master Thesis 21

Page 25: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.3 Where Clause

1 named!(path_pattern<Vec<Connection> >,

2 chain!(

3 initial: query_vertex ~

4 remainder: many1!(

5 chain!(

6 space ~

7 edge: query_edge ~

8 space ~

9 target: query_vertex,

10 || (edge, target)

11 )

12 ), ||

13 fold_connection(initial, remainder)

14 )

15 );

16

17 fn fold_connection(initial: Vertex, remainder: Vec<(Edge, Vertex)>) -> Vec<Connection> {

18 let mut result = Vec::new();

19 let mut previous = initial;

20 for (edge, vertex) in remainder{

21 result.push(Connection{source: previous, target: vertex.clone(), edge: edge});

22 previous = vertex;

23 }

24 return result;

25 }

Listing 3.6: Folding Connections

3.3.2 Value Constraints

Value Constraints are constraints on attributes of the vertex, e.g. name = ‘Alice’ or age< 40. Every Value Constraint has to include one or multiple vertex attributes, and oneor more Literals. Literals are raw values, and come in 3 types:

• Strings e.g. ‘Alice’ (listing 3.7, line 2)

• Floats e.g. 40 (line 3)

• Booleans, which are either true or false (line 4)

Master Thesis 22

Page 26: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.3 Where Clause

1 pub enum Literal {

2 Str(String),

3 Float(f32),

4 Boolean(bool),

5 }

6

7 named!(literal<Literal>,

8 alt_complete!(

9 float => { |f| Literal::Float(f) } |

10 boolean => { |b| Literal::Boolean(b) } |

11 string => { |s| Literal::Str(String::from(s)) }

12

13 )

14 );

Listing 3.7: Literals

Possible operateros are:

• arithmetic: addition, subtraction, multiplication, division, modulo

• comparative: equals, not equals, greater, greater equals, smaller, smaller equals

• boolean: and, or, not

• string: like. The like is denoted as ∼= and tests if a String is contained withinanother string. For example ‘lice’ ∼= ‘Alice’ returns true, however, ‘Alice’ ∼=‘lice’ does not.

Expressions play a vital part in the parsing of a query. Since expressions can be nestedarbitrary many times, it is necessary to fold them after parsing. During this foldingprocess, the vector of expressions is converted into a tree of expressions and only theroot is returned. For simplicity, only the addition and subtraction expressions are shownin listing 3.8.

Master Thesis 23

Page 27: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.4 Solution Modifier Clause

1 named!(expression< Expr >, chain!(

2 initial: operand ~

3 remainder: many0!(

4 alt!(

5 chain!(opt!(space) ~ tag!("+") ~ opt!(space) ~ op: operand,

6 || (Oper::Add, op)) |

7 chain!(opt!(space) ~ tag!("-") ~ opt!(space) ~ op: operand,

8 || (Oper::Sub, op))

9 )

10 ), ||

11 fold_exprs(initial, remainder)

12 )

13 );

14

15 fn fold_exprs(initial: Expr, remainder: Vec<(Oper, Expr)>) -> Expr {

16 remainder.into_iter().fold(initial, |acc, pair| {

17 let (oper, expr) = pair;

18 match oper {

19 Oper::Add => Expr::Add(Box::new(acc), Box::new(expr)),

20 Oper::Sub => Expr::Sub(Box::new(acc), Box::new(expr)),

21 }

22 })

23 }

Listing 3.8: Folding Expressions

3.4 Solution Modifier Clause

The Solution Modifier Clause consists of three parts:

• Group By Clause (listing 3.9, line 3)

• Order By Clause (line 5)

• Offset and Limit Clause (line 7)

All clauses are optional and can be omitted.

Master Thesis 24

Page 28: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.4 Solution Modifier Clause

1 named!(solution_modifier<(GroupBy, OrderBy, (i32, i32))>,

2 chain!(

3 group: group_by ~

4 space ~

5 order: order_by ~

6 space ~

7 limit: limit_offset,

8 || (group, order, limit)

9 )

10 );

Listing 3.9: Solution Modifier

The Group By clause parses a number of attributes (listing 3.10, line 7), which arereturned as a vector.

1 named!(group_by<GroupBy>,

2 chain!(

3 tag_no_case_s!("group by") ~

4 space ~

5 variable_names: many1!(

6 chain!(

7 variable_name: attribute ~

8 opt!(space) ~

9 opt!(char!(’,’)),

10 || variable_name

11 )

12 ),

13 || GroupBy{ group: variable_names}

14 )

15 );

Listing 3.10: Group By

The Order By clause parses a number of variable fields, which are returned as a vector.Every attribute can be either ascending or descending. (listing 3.11, line 6 & 7)

Master Thesis 25

Page 29: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

3.4 Solution Modifier Clause

1 named!(order_term<OrderTerm>,

2 chain!(

3 variable_name: attribute ~

4 space ~

5 asc: alt!(

6 tag_no_case_s!("asc") => { |_| true } |

7 tag_no_case_s!("desc") => { |_| false } |

8 tag!("") => { |_| true }

9 ),

10 || {OrderTerm {variable_name: variable_name, asc: asc}}

11 )

12 );

Listing 3.11: Order By

The Limit and Offset clause parses two integers (listing 3.12, line 5 & 9), which arereturned as a pair (line 10).

1 named!(limit_offset<(i32, i32)>,

2 chain!(

3 tag_no_case_s!("limit") ~

4 space ~

5 limit: unsigned_int ~

6 space ~

7 tag_no_case_s!("offset") ~

8 space ~

9 offset: unsigned_int,

10 || (limit, offset)

11 )

12 );

Listing 3.12: Limit and Offset

Master Thesis 26

Page 30: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4 Query EngineIn this chapter we will discuss how we evaluate the queries we get from the parser onour graph.

4.1 Graph Model

In Differential Dataflow, a graph is represented as a collection of vertices and a collectionof edges. This separation allows us to easily apply selections on either collection. Vertexand edge are both structs that possess attributes, IDs and labels. The attributes areoriginally stored in HashMap, respectively in a vector once they are loaded into Differen-tial Dataflow. This conversion is currently required by Differential Dataflow. At futurepoint in time, this might no longer be necessary.The loading of the graph plays a major part of the entire program execution time. Toload a graph from a text file into Differential Dataflow, an entirely new parser had tobe written. This parser was also built using Nom. In order for the loader to properlyrecognize a graph, the graph has to be supplied in the Vertex-Edge-List format[5]. Thisformat gives us a text file (.txt), which contains two lists. First, all the vertices and theirattributes are described. The vertices are ordered according to their ID in an ascendingway. The IDs have to start at 1 and must be continuous.

Figure 4.1 shows a small sample graph. The corresponding text file is depicted be-low.

Small sample graph

1 * { ’Server’ } name:’server’ ip:’192.168.0.0’ ram:4

2 * { ’VM’ } name:’vm’ ip:’192.168.0.1’ ram:8

3 * { ’Firewall’ } name:’firewall’ ip:’192.168.0.2’ ram:16.5

4 * { ’Host’ } name:’host’ ip:’192.168.0.3’ ram:32.5

5 * { ’Switch’ } name:’center’ ip:’192.168.0.4’ ram:64.5

1 2 ’connects’ bandwidth:1.5 utilization:0.14

2 4 ’connects’ bandwidth:2.1 utilization:0.22

5 1 ’connects’ bandwidth:1.1 utilization:0.38

5 2 ’connects’ bandwidth:1.3 utilization:0.58

5 3 ’connects’ bandwidth:2.3 utilization:0.79

5 4 ’connects’ bandwidth:3.0 utilization:0.03

Master Thesis 27

Page 31: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.1 Graph Model

Figure 4.1: Sample Graph

A complete graph consists of at least one node (listing 4.1, line 3) and at least one edge(line 4). Therefore, the graph parser looks as follows:

1 named!(graph_parser<Graph>,

2 chain!(

3 nodes: many1!(node) ~

4 edges: many1!(edge),

5 || Graph {nodes: nodes, edges: edges}

6 )

7 );

Listing 4.1: Graph Parser

Each vertex begins with an integer (listing 4.2, line 3), representing the ID of the ver-tex, followed by an asterisk (line 5). Subsequently the labels of the vertex are presented(line 7), delimited by curly brackets. At least one label has to be provided whereas anupper limit does not exist. Finally, all the attributes of the vertex are presented in aJSON-like fashion (line 9).

Master Thesis 28

Page 32: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.1 Graph Model

1 named!(node<Node>,

2 chain!(

3 id: unsigned_int ~

4 space ~

5 char!(’*’) ~

6 space ~

7 labels: labels ~

8 space ~

9 attr: attributes ~

10 line_end,

11 || {

12 let mut map = HashMap::with_capacity(attr.len());

13 for elem in attr{

14 let (name, value) = elem;

15 map.insert(String::from(name), value);

16 }

17 Node {id: id, label: labels, attribute_values: map }

18 }

19 )

20

21 );

Listing 4.2: Graph Node

The label constraints are just a vector of strings (listing 4.3, line 7), delimited by curlybrackets (line 3 & 11).

Master Thesis 29

Page 33: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.1 Graph Model

1 named!(labels<Vec <String> >,

2 chain!(

3 char!(’{’) ~

4 opt!(space) ~

5 labels: many1!(

6 chain!(

7 s: string ~

8 opt!(space),

9 || String::from(s)

10 ) ~

11 char!(’}’),

12 || labels

13 )

14 );

Listing 4.3: Labels

Each attribute begins with a name (listing 4.3, line 4), followed by a : (line 6) andterminated by the actual value (line 8).

1 named!(attributes< Vec< (&str, Literal) > >,

2 many0!(

3 chain!(

4 name: char_only ~

5 opt!(space) ~

6 char!(’:’) ~

7 opt!(space) ~

8 value: literal ~

9 opt!(space),

10 || (name, value)

11 )

12 )

13 );

Listing 4.4: Attributes

The edges of the graph are parsed in a very similar way. There are two main dif-ferences: instead of an asterisk, a second integer is given denoting the ID of the targetvertex (listing 4.4, line 5), while the first integer determines the ID of the source vertex.Additionally, edges must have exactly one single label. The rest remains unchanged incomparison to the vertex parser.

Master Thesis 30

Page 34: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

1 named!(edge<GraphEdge>,

2 chain!(

3 source: unsigned_int ~

4 space ~

5 target: unsigned_int ~

6 space ~

7 label: string ~

8 space ~

9 attr: attributes ~

10 line_end,

11 || {

12 let mut map = HashMap::with_capacity(attr.len());

13 for elem in attr{

14 let (name, value) = elem;

15 map.insert(String::from(name), value);

16 }

17 GraphEdge {source: source, target: target,

18 label:String::from(label), attribute_values: map }

19 }

20 )

21 );

Listing 4.5: Graph Edge

4.2 Construction of the Dataflow

The biggest challenge in the thesis was translating the PGQL query into a Dataflow.Figure 4.2 shows the different phases our program runs through. In the following sectionswill further elaborate on each step. Currently there is no optimization happening, thisis left as future work.

Figure 4.2: An overview of the query evaluation process

As described in chapter 3, a query contains a vector of projections and a vector ofconstraints. We chose to apply the projections at the very latest point in time. Thus,our evaluation begins with the second vector.

4.2.1 Constraint sorting

First, we iterate through the vector and sort the constraints by type:

Master Thesis 31

Page 35: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

• If the constraint is a path pattern, it is put into a separate vector (listing 4.6, line5).

• If the constraint is an expression, we determine the name of the involved vertexor edge by recursively exploring the expression (line 7). Once the variable nameis known, the expression is then pushed into a HashMap of vectors (line 10 & 14).For each variable a separate vector with expressions exists. This grouping helpstremendously when filtering the vertex set later on.

1 let mut connections = Vec::new();

2 let mut selections : HashMap<String, Vec<Expr> > = HashMap::new();

3 for constraint in &vvhere{

4 match constraint {

5 &Constraint::PathPattern(ref pattern) => connections.push(pattern),

6 &Constraint::Expr(ref expr) => {

7 let name = explore_expr((*expr).clone());

8 let mut new = false;

9 match selections.get_mut(&name) {

10 Some(vec) => vec.push((*expr).clone()),

11 None => new = true,

12 }

13 if new {

14 selections.insert(name, vec![(*expr).clone()]);

15 }

16 },

17 }

18 }

Listing 4.6: Constraint Sorting

Master Thesis 32

Page 36: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

4.2.2 Construction of the execution plan

Once we have iterated over the entire vector, we begin the construction of an executionplan. We iterate through all the path patterns (listing 4.7 , line 1), which we have storedthem previously in a separated vector. We then make a decision how to construct theplan. There are three possibilities:

• Both the source and the target of the edge have not appeared before (line 2). Inthis case, we pull the source and the target set and join them with the edge set.

• The source of the edge has already appeared before (line 15). In this case, we pullthe target set, map the result we carried onto the correct ID and then join themwith the edge set.

• Both the source and the target of the edge have appeared before (line 28). In thiscase, we are dealing with a loop in the query. Thus, we do not need to join anysets, but we can just enforce an additional constraint on the IDs.

Master Thesis 33

Page 37: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

1 for connection in connections {

2 if !used_fields.contains(&connection.source.name) {

3 execution_plan.push(

4 PhyPlan{

5 name: vec![connection.source.name), connection.target.name],

6 left: connection.source.name),

7 right: connection.target.name,

8 join_id: 0,

9 join: true,

10 filter_id: -1, // some invalid number

11 constraints: connection.edge.constraints,

12 }

13 );

14 }

15 else if !used_fields.contains(&connection.target.name) {

16 execution_plan.push(

17 PhyPlan{

18 name: vec![connection.source.name, connection.target.name],

19 left: connection.source.name,

20 right: connection.target.name,

21 join_id: *names.get(&connection.source.name).unwrap(),

22 join: true,

23 filter_id: -1, // some invalid number

24 constraints: connection.edge.constraints,

25 }

26 );

27 }

28 else {

29 execution_plan.push(

30 PhyPlan{

31 name: vec![connection.source.name, connection.target.name)],

32 left: connection.source.name,

33 right: connection.target.name,

34 join_id: *names.get(&connection.source.name).unwrap(),

35 join: false,

36 filter_id: *names.get(&connection.target.name).unwrap(),

37 constraints: connection.edge.constraints,

38 }

39 );

40 }

41

42 }

Listing 4.7: Plan Construction

Master Thesis 34

Page 38: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

4.2.3 Selection on the vertex set

The next step is to apply all the selections we have sorted beforehand on the vertex set.For each non-anonymous vertex, we filter the entire set (listing 4.8, line 3) and save theresult in a HashMap (line 6). The key in this HashMap is the name of the vertex.

1 let mut result = HashMap::new();

2 for (name, filter) in selections {

3 let result = vertices.filter(move |x| {

4 check_node(&x, &filter)

5 });

6 results.insert(name, result);

7 }

Listing 4.8: Vertex Selection

4.2.4 Execution of the plan

After constructing a plan, we iterate through all the steps we assemble in the chapter4.4 (listing 4.9, line 2). In the beginning, our result, which is a collection of vectors ofvertices, is empty (line 1). We begin by pulling the source’s vertex collection from theHashMap we defined in section 4.5 (line 4 & 8). The HashMap’s key is the name of thevertex. Once we have the correct vertex collections, we join it with the edge set (line13), and map the result (line 15), in order to keep track of the vertex ID of the target.We then repeat this step for the target vertex. We end up with a vector of vertices,which we store as the result.

If there are more elements in the execution plan, we repeat the previous steps untilwe are done. The only difference is that we re-use our previous result, instead of pullinga new collection from our vertex HashMap.

Master Thesis 35

Page 39: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

1 let mut result = None;

2 for step in execution_plan {

3 if step.join && step.join_id == 0 {

4 let sources = match plans.get(&step.left){

5 None => vertices,

6 Some(list) => list,

7 };

8 let targets = match plans.get(&step.right){

9 Some(list) => list,

10 None => vertices,

11 };

12 result = Some(sources.map(|x| (x.id, x))

13 .join(&edges.filter(move |x| check_edge(&x, &step.constraints))

14 .map(|x| (x.source,x.target)))

15 .map(|(_,v1,v2)| (v2,v1))

16 .join(&targets.map(|x| (x.id, x)))

17 .map(|(_,v1,v2)| vec![v1,v2]));

18 }

19 else if !step.join {

20 let int = step.join_id;

21 let int2 = step.filter_id;

22 result = Some(result.unwrap().map(move |x| (x[int].id, x))

23 .join(&edges.filter(move |x| check_edge(&x, &step.constraints))

24 .map(|x| (x.source,x.target)))

25 .filter(move |x| {let &( _, ref vec, ref id) = x; vec[int2].id == *id})

26 .map(|(_,v1,_)| v1));

27 }

28 else {

29 let targets = match plans.get(&step.right){

30 Some(list) => list,

31 None => vertices,

32 };

33 result = Some(result.unwrap().map(move |vec| (vec[step.join_id].id, vec))

34 .join(&edges.filter(move |x| check_edge(&x, &step.constraints))

35 .map(|x| (x.source,x.target)))

36 .map(|(_,v1,v2)| (v2,v1))

37 .join(&targets.map(|x| (x.id, x)))

38 .map(|(_, mut v1,v2)| {v1.push(v2);v1}));

39 }

40 }

Listing 4.9: Plan Execution

Master Thesis 36

Page 40: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

4.2.5 Applying the projection and aggregation

The final step is to apply the projections and the aggregation that were parsed andpassed along. In order to get the desired result, we iterate over this vector (listing 4.10,line 1) and take the appropriate action depending on what attribute is asked for by theuser. In case of an aggregation (line 11), we first group the records (line 14), and thencount, sum etc. the values (line 15 – 19).

1 for projection in &select{

2 match projection {

3 &SelectElem::Star => {

4 list.inspect(|&(ref x,_)| println!("{:?}", x));

5 },

6 &SelectElem::Attribute(ref attr) => {

7 let field = string_to_static_str(attr.field.clone());

8 let id = *(names.get(&attr.name).unwrap());

9 list.inspect(move |&(ref x,_)| println!("{:?}", x[id].get(&field));

10 },

11 &SelectElem::Aggregation(ref aggr) => {

12 match aggr {

13 &Aggregation::Count(ref attr) => {

14 list.map(move |x| (1, x)).group(move |key, vals, output| {

15 let mut count = 0;

16 for _ in vals {

17 count = count + 1;

18 }

19 output.push(((),count));

20 }).inspect(|&(_, ref x)| println!("{:?}", x));

21 }

22 }

23 }

24 }

25 }

Listing 4.10: Projection

4.2.6 Evaluating an expression

It is of great importance to evaluate the expressions formulated in the query in theDataflow. Since expressions can be nested an unlimited number of times, the only wayto evaluate them is by using a recursive function. This function first evaluates the righthand side, then the left hand side and finally applies the desired operator. Listing 4.11shows a couple example operators. Most operator work exactly like the Equal (line 3),

Master Thesis 37

Page 41: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

4.2 Construction of the Dataflow

NotEqual (line 5) and Add Operator (line 9). Exceptions are the Not Operator (line7), which has only one side and expressions which can be evaluated immediately. Thiscategory includes the Literal (line 11), the Attribute of a vertex (line 12), the built-infunctions (line 18) and the label operator (line 8). The latter simply checks whether therequested label is present in the vertex’ label vector.

1 fn evaluate_expr (constraint: Expr, node: &Node) -> Literal {

2 match constraint{

3 Expr::Equal(left, right) => evaluate_expr(*left, node) ==

4 evaluate_expr(*right, node),

5 Expr::NotEqual(left, right) => evaluate_expr(*left, node) !=

6 evaluate_expr(*right, node),

7 Expr::Not(value) => evaluate_expr(*value, node).not(),

8 Expr::Label(label) => node.label.contains(&label),

9 Expr::Add(left, right) => evaluate_expr(*left, node).

10 add(evaluate_expr(*right, node)),

11 Expr::Literal(value) => value,

12 Expr::Attribute(attribute) => {

13 match node.attribute_values.get(&attribute.field) {

14 Some(literal) => (*literal).clone(),

15 None => panic!("Field {:?} does not exist!", &attribute.field)

16 }

17 },

18 Expr::BuiltIn(_, function) => {

19 match function {

20 BuiltIn::Label => Literal::Str(node.label[0].clone()),

21 BuiltIn::Id => Literal::Float(node.id as f32),

22 _ => panic!("Function {:?} not supported for nodes", function)

23 }

24 }

25 }

26 }

Listing 4.11: Expression Evaluation

Master Thesis 38

Page 42: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5 Experimental EvaluationIn this chapter we present the experiments that were run in order to determine how fastour query engine is.

5.1 Execution

Since our program is a module for Differential Dataflow, the Differential Dataflow repos-itory has to be present in order to execute the program.

Execution Example The program can be run using the following command:

$ cargo run --release --example pgql graph.txt query.txt true -- -w 32

The –release argument allows for compiler optimizations and should speed up the exe-cution of the program. The ”–example pgql” argument specifies which program to run.The remaining 4 arguments are parsed by our program. First, we denote the name ofthe graph we want to parse. This argument has to be a text file with the graph inedge-list format (see section 4.1). The second argument gives a text file containing allthe queries that the user wishes to evaluate. Next a boolean specifies whether the edgesin the graph are directed or not. In case they are not, all the edges in the graph textfile will be cloned, inverted and added to the edge set. This means an edge from A to Bpossesses the exact same properties as an edge from B to A. If this is not desired, thenthis argument should be set to false. Finally, we provide a number of arguments neededby the timely framework3. First, we define how many worker threads should be used toevaluate the query with the -w argument. This number is expected to be less or equalto the number of cores of the machine. Alternatively, one can add a -n argument tospecify the number of processes and a -p argument to denote the ID of this process. Ifthe program is to be distributed over multiple machines, the -h argument should specifya text file in which all the hostnames and ports are defined.

Benchmark Example We used a special python script with the goal to automatethe benchmarking of our program.

$ python run_benchmark.py graph.txt query.txt

This script takes as input the name of the topology text file and the name of the querytext file. It will then launch 50 runs for all possible worker configurations. We start withone single worker and then increase the number of workers in powers of two, until wereach the maximum amount, which is the number of physical cores present in the ma-

3http://www.frankmcsherry.org/timely-dataflow/timely/execute/fn.execute\_from\_args.

html

Master Thesis 39

Page 43: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.2 Hardware

chine. The result of each run is written into a separate text file. After the experimentshave concluded, we bring the results together and conduct our analysis.

5.2 Hardware

All the experiments were conducted on an ETH server. This machine possesses 4 x 12AMD Opteron 6174 processors, each one running at 2.2 GHz. The operating system wasDebian 7.0 and the installed memory was 128 GB of RAM.

5.3 Topologies

We select two distinctly different topologies: Fat Tree [8] and Jellyfish [15]. Fat Treeis a modification of the leaf-spine structure, commonly found in data centers, and it isthus a practically deployable solution.

Figure 5.1: Example Fat Tree Topology: 16 hosts and 20 4-port switches. Source: [8]

The other topology called Jellyfish is inspired by work on random graphs showcasingthat random structures can deliver shorter average paths but this is more of an ideal-casereference rather than a realistic topology.

Master Thesis 40

Page 44: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.3 Topologies

Figure 5.2: Example Jellyfish Topology: 16 hosts and 20 4-port switches. Source: [15]

In the experiment, four differently sized topologies were used:

Large Fat Tree The large Fat Tree topology is the largest one of the four topologies.It contains 55’000 vertices, of which 2’880 are switches. Furthermore the graph possessesa total of 50’000 bidirectional edges. Since one bidirectional edge is represented by twodirected edges, there are a total of 100’000 edges in the collection.

Small Fat Tree The small Fat Tree topology has the same vertices to edge ratioas the large Fat Tree, but it is overall much smaller. It contains 17’665 vertices, 1’280of which are switches. Furthermore the graph possesses a total of 16’ 400 bidirectionaledges.

Large Jellyfish The large jellyfish topology matches the large Fat Tree in termsof vertices, but contains far less edges. It contains 56’160 vertices, of which 864 areswitches. Furthermore the graph possesses a total of 6’910 bidirectional edges.)

Small Jellyfish The small jellyfish topology is the smallest one of the four topologies.It contains 16’770 vertices, of which 390 are switches. Furthermore the graph possessesa total of 2’145 bidirectional edges.

In all topologies, weight is an attribute of the edges and uniformly distributed from1 to 9. Consequently, the selection ’weight < 2’ will therefore remove around 89% of alledges. A lesser amount of edges immensely shortens the duration of any subsequent joinprocess.Since joining the entire vertex and edge set is a very costly operation, it is possible thatqueries with multiple, but restricted joins are faster than those which contain fewer joins,

Master Thesis 41

Page 45: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.3 Topologies

but do not include any restrictions on the edge set.

Furthermore, it is important to consider the fact that the attributes of vertices andedges in a Differential Dataflow collection are stored in a vector of (String, Literal)tuples. We chose this suboptimal implementation since Differential Dataflow does notappreciate HashMaps at all. However, in order to do a selection on either set, this vectorhas to be transformed back into a HashMap for each edge respectively vertex. This pro-cess can be quite time consuming, therefore a query is expected to be slower the moreselections it contains.

It should also be noted that one single edge in the query triggers two joins in theevaluation. Reason being that we have to join the edge collection twice with the vertexcollection, in order to determine all adjacent vertices of a given vertex.

In total, we ran seven different queries, with each subsequent query being more complexthan the preceding one. For each query the resulting dataflow is visualized on the fol-lowing pages. The numbers on the edges indicate the size of (intermediate) result set onthe large Fat Tree.

Master Thesis 42

Page 46: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.4 Overview

5.4 Overview

The following table shows the seven query we decided to run.

Syntax Description

Query 1 SELECT u.name WHERE u.label() = ’switch’, u.position = ’access’Retrieve all the names of switchesthat are in the access position.

Query 2 SELECT u.name WHERE (u) -[e with weight < 4]-> (v)Retrieve all the names of verticesthat have an edge with weight smaller than 4.

Query 3 SELECT u.name WHERE (u:switch) -> (v with position = ’distribution’)Retrieve all the names of switchesthat are connected to a distribution vertex.

Query 4SELECT u.name WHERE (u with position = ’distribution’)-[e with weight > 8]-> (v with position = ’access’)

Retrieve all the names of distribution verticesthat have a connection with weight greater 8to an access vertex.

Query 5SELECT u.name WHERE (u WITH position = ’access’)-> (v WITH position = ’distribution’)-> (w WITH position = ’core’)

Retrieve all the names of access verticesthat are connected to a distribution vertex,which in turn is connected to a core vertex.

Query 6SELECT u.name WHERE (u WITH position = ’access’)-[e with weight < 2]-> (v WITH position = ’distribution’)-[f with weight > 8]-> (w WITH position = ’core’)

Retrieve all the names of access verticesthat have a connection with weight smaller 2to a distribution vertex, which in turn has aconnection with weight greater 8 to a core vertex.

Query 7

SELECT u.name WHERE (u WITH position = ’access’)-[e with weight < 2]-> (v WITH position = ’distribution’)-[f with weight < 2]-> (w WITH position = ’core’)-[g with weight > 8]-> (x WITH position = ’distribution’)

Retrieve all the names of access vertices that havea connection with,weight smaller 2 to adistribution vertex, which in turn has a,connectionwith weight greater 8 to a core vertex, which thenhas a connection with weight greater 8 toanother distribution vertex.

Table 1: My caption

Figure 7.2 shows the result of the seven queries run on the different topologies with 32workers..

Figure 5.3: An overview of the seven queries, run with 32 workers

Master Thesis 43

Page 47: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.4 Overview

Throughout all seven queries, the following pattern is noticeable:The influence of the number of workers behaves in an unexpected way, especially forqueries 1,4 and 6. In a perfect world, the query latency would be inversely proportionalto the number of workers. However, we only observe this behavior for query 5, and onlyfor the Fat Tree topologies.More often than not, we observed that an increase of worker threads leads to a higher la-tency, especially when the latency is already quite low, around the 50 to 200 millisecondrange. We reason that this happens because at that level, the communication overheaddominates the time savings achieved by parallelism. This claim is additionally supportedby the fact that query 5 has one of the longest latencies, ranging up to 10 seconds. Whenevaluating queries with lower latency, the contention between the workers increases andwe frequently observe that the best performance is achieved when running the querywith just a single worker.

Furthermore graphs with smaller sizes show larger variance than their counterparts whenrun with more than four workers. This behavior is due to the fact that the time spent incoordinating the different threads plays a bigger role in small graphs. The time savingsachieved by parallelization are outweighed by the communication overhead.

The following pages contain the dataflows executing the query and the evaluation timesdistribution for each of the seven queries. While describing the dataflows, we use thefollowing notation:

Projections are denoted by the symbol π. They always occur at the very end of aquery.

Selections are denoted by the symbol σ.

Joins are denoted by the symbol ./.

Master Thesis 44

Page 48: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.5 Query 1

5.5 Query 1

SELECT u.name WHERE u.label() = ’switch’, u.position = ’access’

Peak Memory: 377’840 kbResult size: 1’152

Our first query consists of only 2 simple Selections and 0 Joins. One pass throughthe vertices collection is enough to produce the result. This query is important becauseit shows us how the fast the lower bound for queries can be. Since the query is quitefast (50 milliseconds on the smaller topologies), parallelization does not bear fruit here.As a matter of fact, increasing the number of workers only slows down the evaluation.

Dataflow

Figure 5.4: Dataflow Query1

Master Thesis 45

Page 49: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.5 Query 1

(a) Large Fat Tree (b) Small Fat Tree

(c) Large Jellyfish (d) Small Jellyfish

Figure 5.5: Boxplots for Query 1

Master Thesis 46

Page 50: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.6 Query2

5.6 Query2

SELECT n.name WHERE (n) -[e with weight < 4]-> (m)

Peak Memory: 3’597’452 kbResult size: 36’542

The second query contains 2 Joins and 1 Selection. The joins are not very expensivesince we only use about 30% of the edges. The query is therefore only marginally slowerthan the first one. Again we observe that query times increase when we run the querymore than 8 workers. This most likely happens because of the communication overhead.

Dataflow

Figure 5.6: Dataflow Query2

Master Thesis 47

Page 51: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.6 Query2

(a) Large Fat Tree (b) Small Fat Tree

(c) Large Jellyfish (d) Small Jellyfish

Figure 5.7: Boxplots for Query 2

Master Thesis 48

Page 52: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.7 Query3

5.7 Query3

SELECT n.name WHERE (n:switch) -> (m with position = ’distribution’)

Peak Memory: 1’433’376 kbResult size: 55’296

The third query contains 2 Joins and 2 Selections. We expected this query to be alittle bit slower since we are using all the edges in the joins, but it is actually a bit fasterthan Query 2 on the large Fat Tree. Once again, there is not much work to share here,therefore adding more worker is only beneficial for the large Fat Tree topology.

Dataflow

Figure 5.8: Dataflow Query3

Master Thesis 49

Page 53: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.7 Query3

(a) Large Fat Tree (b) Small Fat Tree

(c) Large Jellyfish (d) Small Jellyfish

Figure 5.9: Boxplots for Query 3

Master Thesis 50

Page 54: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.8 Query4

5.8 Query4

SELECT n.name WHERE (n with position = ’distribution’)

-[e with weight > 8]-> (m with position = ’access’)

Peak Memory: 1’114’896 kbResult size: 3’097

Our fourth Query has 2 Joins and 3 Selections. Even though this query has moreconstraints than Query 2 and 3, it is faster since we only use 10% of the edges in thejoins. When studying the effect of parallelization, the same pattern as in the previousqueries is observed.

Dataflow

Figure 5.10: Dataflow Query4

Master Thesis 51

Page 55: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.8 Query4

(a) Large Fat Tree (b) Small Fat Tree

(c) Large Jellyfish (d) Small Jellyfish

Figure 5.11: Boxplots for Query 4

Master Thesis 52

Page 56: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.9 Query5

5.9 Query5

SELECT v.name WHERE (u WITH position = ’access’)

-> (v WITH position = ’distribution’)

-> (w WITH position = ’core’)

Peak Memory: 3’797’976 kbResult size: 663’552

The last three queries all contain multiple query edges and are much slower the firstfour. The fifth query contains 4 Joins and 3 Selections. Since there is no selection onthe edge, this particular query is quite slow even though it has ”only” 4 joins. It shouldbe noted that this query has a gigantic intermediate result set of 1.5 billion tuples onthe large Fat Tree. This is most likely the reason why the query is the slowest of all forthe large Fat Tree topology. In the other three topologies, query 7 is actually a littleslower. For the first time, we observe a smooth decrease in query time as we add moreworker threads, at least for the Fat Tree topologies.

Dataflow

Figure 5.12: Dataflow Query5

Master Thesis 53

Page 57: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.9 Query5

(a) Large Fat Tree (b) Small Fat Tree

(c) Large Jellyfish (d) Small Jellyfish

Figure 5.13: Boxplots for Query 5

Master Thesis 54

Page 58: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.10 Query6

5.10 Query6

SELECT v.name WHERE (u WITH position = ’access’)

-[e with weight < 2]-> (v WITH position = ’distribution’)

-[f with weight > 8]-> (w WITH position = ’core’)

Peak Memory: 1’573’940 kbResult size: 8’334

The sixth query consists of 4 Joins and 5 Selections. It is almost a copy of query 5,but we add new restrictions to the edges. Like query 4, the evaluation time goes downsince the edge set used for the joins is a lot smaller now. Since we have a lot less tuplesin the 3rd and 4th join, this query is faster than the previous one. Query times staynearly the same regardless of how many worker threads we use. Exceptions are the smallFat Tree and Jellyfish when run with 32 workers.

Dataflow

Figure 5.14: Dataflow Query6

Master Thesis 55

Page 59: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.10 Query6

(a) Large Fat Tree (b) Small Fat Tree

(c) Large Jellyfish (d) Small Jellyfish

Figure 5.15: Boxplots for Query 6

Master Thesis 56

Page 60: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.11 Query7

5.11 Query7

SELECT v.name WHERE (u WITH position = ’access’)

-[e with weight < 2]-> (v WITH position = ’distribution’)

-[f with weight < 2]-> (w WITH position = ’core’)

-[g with weight < 2]-> (x WITH position = ’distribution’)

Peak Memory: 2’210’824 kbResult size: 49’956The last query contains 6 Joins and 7 Selections. This is probably the most interestingresult. We add two more joins to the query, but we restrict the number of edges used inall the joins dramatically. This consequently leads to a surprisingly low evaluation time,almost on the level as query 5 for the three smaller topologies. For the largest one it isactually faster than query 5. This demonstrates how big the impact of the joins is forthe evaluation time. Once again, parallelization shows no benefit for the small topologiesand surprisingly also for the large Jellyfish topology. However, the large Fat Tree querytimes decrease with the number of workers, except when run with 32 workers.Dataflow

Figure 5.16: Dataflow Query7

Master Thesis 57

Page 61: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

5.11 Query7

(a) Large Fat Tree (b) Small Fat Tree

(c) Large Jellyfish (d) Small Jellyfish

Figure 5.17: Boxplots for Query 7

Master Thesis 58

Page 62: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

6 Related WorkIn this chapter we give an overview of other existing graph query languages and queryevaluator engines. There are many languages and engines developed for the processingof graphs and graph algorithms. PGQL is the leading one that combines graph patternmatching, graph traversals and graph construction. Although the language’s syntaxsupports all three actions the official engine PGX developed by Oracle still misses criticalfunctionality to match the syntax. This is why we decided to base our evaluation engineon PGQL. Below we present other related works on graph processing.

6.1 SPARQL

The Resource Descriptive Framework4 (RDF) is a directed, labeled graph data formatwhich is widely used to represent information in the Internet. The SPARQL ProtocolAnd RDF Query Language (SPARQL, recursive acronym) was developed as a querylanguage for RDF. A great advantage of SPARQL is that queries can be expressed acrossdiverse data sources, meaning the data does not have to be stored natively as RDF, italso can be viewed as RDF via middleware. SPARQL queries return either results setsor RDF graphs. SPARQL offers support for querying required and optional graph pat-terns. It is also possible to express label and value constraints in SPARQL queries.

The major difference between SPARQL and PGQL is the format of the graph. SPARQLonly supports graphs in RDF format, whereas PGQL requires the graph to be in eitherthe edge-list or adjacency-list format.[7]

6.2 Green-Marl

Green-Marl is a domain-specific language (DSL) developed by the Stanford Universityin collaboration with Oracle. Green-Marl provides many high level language constructsthat allow developers to intuitively describe their graph queries. Simultaneously thedata-level parallelism inherent in the algorithms is exposed rather than concealed withinthe compiler. Green-Marl has its own compiler which translates the aforementioned al-gorithmic description written in Green-Marl into an efficient C++ program by exploitingthe exposed datalevel parallelism. In addition, a set of optimizations is applied by theGreen-Marl compiler. These optimizations that exploit the semantic knowledge encodedin the Green-Marl DSL. With Green-Marl almost any graph analysis algorithms can bewritten very intuitively. Experimental results show that the program generated by thecompiler based on the user’s descriptions performs on the same level than highly-tunedhandcoded implementations.[10]

4https://www.w3.org/RDF/

Master Thesis 59

Page 63: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

6.3 Gremlin

6.3 Gremlin

Developed by the Apache Software Foundation, Gremlin is a query language as wellas a graph traversal machine. The graph traversal machine Gremlin consists of threeparts that continously interact with each other: first the graph, second the traversaland finally the set of traversers. The traversers move along the graph according to theinstructions which were specified by the user in the traversal. Gremlin returns as resultthe locations of all halted traversers. Gremlin imposes no restrictions on the graph for-mat. Any supporting graph computing system suffices to run a Gremlin machine.

The language Gremlin is a functional language which enables the user to specify the rulesof the traversal. Both imperative and declarative querying are supported by Gremlin.[14]

6.4 SQLGraph

SQLGraph is a Graph Store that combines existing relational optimizers with a novelschema, attempting to achieve a better performance for graph storage and querying thanother popular graph stores. This new schema holds the information about the vertices’neighbors in a typical relational storage, while the attributes of the vertices and edgesare kept in a separate JSON storage. This particular design holds many computationalbenefits compared to a purely relational or purely JSON solution. Queries are originallywritten in Gremlin and then translated without side effects into SQL, in order to makeuse of the powerful relational query optimizers that have been built in the last decades.[16]

6.5 GraphiQL

GRAPHiQL is an intuitive query language for graph analytics. Its goal is to allow devel-opers to express their queries in terms of nodes and edges rather than tables and joinswhich are being used in relational databases. In order to achieve this goal GRAPHiQLprovides the developers with key graph constructs such as looping, recursion, and neigh-borhood operations. Graph programs are compiled into efficient SQL queries at runtime.These SQL queries can be run on any relational database. However, GraphiQL has itsown unique way of describing graph data, the syntax of which is reminding of JSON.[11]

Master Thesis 60

Page 64: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

6.6 Overview

6.6 Overview

The following table depicts the most interesting features for each query language. It isnotable that bar SPARQL and GraphiQL no language supports graph construction, noteven PGQL. Nevertheless we deemed PGQL the best choice for our query language.

Query Language

PGQL Gremlin GraphiQL SPARQL Green-Marl SQLGraph

Language Java Java Javascript Mainly Java C++ Java

Operating SystemLinux,Mac

Windows,Linux,Mac

Windows,Linux,Mac

Windows,Linux,Mac

LinuxWindows,Linux,Mac

Open Source Parser only yes yessome imple-mentations

yes yes

Graph FormatPropertyGraph

RDF andProperty Graph

JSON-likeProperty Graph

RDFPropertyGraph

PropertyGraph

Pattern Matching yes yes yes yes yes yes

Path queries yes yes yes yes yes yes

Subgraph construction

no, although imp-lementation isplanned forfuture releases

no yes yes no no

Table 2: Overview of Frameworks

Master Thesis 61

Page 65: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

7 Summary

7.1 Conclusion

In this thesis we presented a high-level interface, which is capabble of evaluating PGQLqueries in a dataflow environment. We showed how our program parses queries andgraphs, and then evaluates these queries on the graph.

Furthermore we provided a thorough evaluation of the performance of our programto showcase its competitiveness.

7.2 Future Work

Due to technical difficulties, we were unable to implement all the functionality we hadin mind. In this section we show which challenges still lie ahead.

7.2.1 Shortest Path

Unfortunately, there is currently no support for evaluating path expressions. This isa rather pressing issue, since paths are of great interest in network routing. In a firststep the evaluator should be extended with the functionality to find the shortest path.Subsequently, special properties of the paths can be taken into consideration. Theseinclude forbidding the visitation of certain vertices or requiring that a certain type ofvertex, e.g. a firewall, has to be included in the path at least one or more times.

7.2.2 Additional Joins

Currently, Joins are only happening when putting an edge between two vertices. Unfor-tunately, certain problems stem from this. For example, the query:

SELECT * WHERE v.age = u.age

cannot be evaluated by the current version of the engine. This query requires us tojoin the vertex set with itself on the age attribute, which does not pose a problem.The underlying difficulty is to recognize that there is an actual need to join the sets.Therefore in order to fix this issue, we would have to extend the handling of the valueconstraints, so we can recognize these special requirements.

7.2.3 Query Optimizer

In the current version of the program, all queries are evaluated in the same fashion.We start with the selections of the vertex set, join them with the edge set, and finallyapply the projections. Pushing the selections down is usually a good practice, however,

Master Thesis 62

Page 66: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

7.2 Future Work

under certain conditions this approach is suboptimal. Therefore the implementationof a query optimizer is absolutely necessary once we get to larger graphs and morecomplicated queries.

7.2.4 Changes in the Graph

Throughout all the experiments, our graphs never changed at all. Since this programwas built on Differential Dataflow, which excels at dealing with dynamic graphs, it wouldmake a lot of sense to have periodic input, that adds or removes parts of the graph. Ina real life application, these changes represent for example the failure of switch or theaddition of previously shut down machines to the network.

Master Thesis 63

Page 67: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

References

References[1] Github Frank McSherry Differential Dataflow. https://github.com/

frankmcsherry/differential-dataflow. Accessed: 2017-03-30.

[2] Github Frank McSherry Timely Dataflow. https://github.com/frankmcsherry/timely-dataflow. Accessed: 2017-03-30.

[3] Github Geoffroy Couprie Nom. https://github.com/geal/nom. Accessed: 2017-03-30.

[4] Official Rust Website. https://www.rust-lang.org/en-US/. Accessed: 2017-03-30.

[5] Oracle Edge-List Format. https://docs.oracle.com/cd/E56133_01/2.3.1/

reference/loader/file-system/plain-text-formats.html. Accessed: 2017-03-30.

[6] Oracle Official PGQL Specifications. http://pgql-lang.org/spec/1.0/. Ac-cessed: 2017-03-30.

[7] Sparql Query Language. https://www.w3.org/TR/2008/

REC-rdf-sparql-query-20080115/. Accessed: 2017-03-30.

[8] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, com-modity data center network architecture. In Proceedings of the ACM SIGCOMM2008 Conference on Data Communication, SIGCOMM ’08, pages 63–74, New York,NY, USA, 2008. ACM.

[9] G. Couprie. Nom, a byte oriented, streaming, zero copy, parser combinators libraryin rust. In 2015 IEEE Security and Privacy Workshops, pages 142–148, May 2015.

[10] Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. Green-marl: A dslfor easy and efficient graph analysis. In Proceedings of the Seventeenth InternationalConference on Architectural Support for Programming Languages and OperatingSystems, ASPLOS XVII, pages 349–362, New York, NY, USA, 2012. ACM.

[11] A. Jindal and S. Madden. Graphiql: A graph intuitive query language for relationaldatabases. In 2014 IEEE International Conference on Big Data (Big Data), pages441–450, Oct 2014.

[12] Frank McSherry, Derek Murray, Rebecca Isaacs, and Michael Isard. Differentialdataflow. In Proceedings of CIDR 2013, January 2013.

[13] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham,and Mart́ın Abadi. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 439–455, New York, NY, USA, 2013. ACM.

Master Thesis 64

Page 68: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity

References

[14] Marko A. Rodriguez. The gremlin graph traversal machine and language. CoRR,abs/1508.03843, 2015.

[15] Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. Jellyfish: Net-working data centers randomly. In Presented as part of the 9th USENIX Symposiumon Networked Systems Design and Implementation (NSDI 12), pages 225–238, SanJose, CA, 2012. USENIX.

[16] Wen Sun, Achille Fokoue, Kavitha Srinivas, Anastasios Kementsietsidis, Gang Hu,and Guotong Xie. Sqlgraph: An efficient relational-based property graph store. InProceedings of the 2015 ACM SIGMOD International Conference on Managementof Data, SIGMOD ’15, pages 1887–1901, New York, NY, USA, 2015. ACM.

[17] Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi.Pgql: A property graph query language. In Proceedings of the Fourth InternationalWorkshop on Graph Data Management Experiences and Systems, GRADES ’16,pages 7:1–7:6, New York, NY, USA, 2016. ACM.

Master Thesis 65

Page 69: Rights / License: Research Collection In Copyright - Non … · 2020. 4. 27. · Abstract In today’s world, computer networks have grown immensely in terms of size and com-plexity