incquery-d: distributed incremental graph queries

Budapest University of Technology and EconomicsDepartment of Measurement and Information Systems

DISTRIBUTED INCREMENTAL GRAPH QUERIES

Gábor Szárnyas, Dániel Varró

2 February, 2015

22nd Minisymposium of the Department of Measurement and Information Systems

MOTIVATION

Performance issues

Agile Model-Driven Development

Modeling

Codegeneration

Testing

Early validationsTransformations

Scalabilitychallenges

Model Sizes

Models = graphs with 100M–1B elements

o Car industry

o Avionics

o Software analysis

o Cyber-physical systems

Source: Markus Scheidgen, Automated and TransparentModel Fragmentation for Persisting Large Models, 2012

application model size

software models 108

sensor data 109

geo-spatial models 1012

Validation may take hours

MDE

Scalability

Incrementality

Incremental queries

Incremental transformation

Storing partialresults

Trackingchanges

Motivating Example

Pattern for an AUTOSAR validation constraint

Communicationchannel

Logical signal Mapping Physical signal

Invalid submodel

Validation

Valid submodel

Antijoin

Join

Join

Fill indexer nodesStore interim resultsRead result setEdit modelPropagating changesRead result set

Rete Algorithm

Communication channel

Logical signal Mapping Physical signal

Result set

CURRENT STATE OF RESEARCH

EMF-INCQUERY

Rete-based incremental graph query engine

Open source Eclipse project

Typical use cases

o Validation

o Incremental model transformation

oModel synchronization

Single Workstation Limitations

Majority of tools mostly work for <1M model elements due to resource exhaustion

Best tools: <10M model elements

JVM limitations: cannot handle 15+ GB heap memory efficiently

Proposed solution

o Horizontal scaling: distributed system

Problem Statement

Scalability

Scalable storage

Scalable query engine

Distributed NoSQLdatabases

Distributed INCQUERY:

INCQUERY-D

Complex queries

Big models

Goals of INCQUERY-D

Objectives

o Distributed incremental pattern matching

o Adapting EMF-INCQUERY’s tooling to distributed DBs

o Executed over a cloud infrastructure (COTS hardware)

Achieve scalability by avoiding memory bottleneck

o Sharding separately

• Data

• Indexers

• Query network

o In memory

• Index + query

RESEARCH QUESTIONS AND RESULTS

Architecture and Data Representation

Is it possible to build a query engine which works on various backends using different data representation formats?

Is it possible to serve multiple users concurrently?

INCQUERY-D Architecture

Server 1

Databaseshard 1

Server 2

Databaseshard 2

Server 3

Databaseshard 3

Transaction

In-memory EMF modelDatabaseshard 0

Server 0

Rete net

Indexer layer

EMF-INCQUERY INCQUERY-D

Distributed query evaluation network

Distributed indexer Model access adapterIndexing

Indexer Indexer Indexer Indexer

Join

Join

Antijoin

In-memory storage

Distributed indexing, notification

Production network• Stores intermediate query results• Propagates changes

Distributed persistent storage

Distributed production network• Each intermediate node can be allocated

to a different host• Remote internode communication

Scalable Incremental Query Evaluation

Is it possible to utilise an incremental query evaluation algorithm in a distributed system for high performance query evaluation?

How can we benchmark a distributed system in areproducible manner?

Benchmark Results for Revalidation

Quick response time for models with 88M elements

Different characteristics

Dimensions of Scalability

Infrastructure

o Number of machines

o Available memory / CPU

o Network performance

o Number of concurrent users

Model

o Model size

o Model characteristics

Queries

o Number of queries

o Query complexity

Optimisation and Dynamic Reconfiguration

How can we scale and optimise such a system?

How can the system adapt to the changes

o in the system?

o in the cloud environment?

How can we estimate the resources required by a certain setup?

Dynamic Resource Allocation

Server 1 Server 2 Server 3Server 0

Indexer Indexer Indexer Indexer

Join

Join

Antijoin

10% 70% 60%

Δ

80%90%

Join

25%75%

Δ

Δ

Memory usage

Conclusion

MDE provides Big Data questions for research

Horizontal scaling is a way for querying large models

Theoretical challenges

o Distributed pattern matching algorithm

o Data representation

o Dynamic resource allocation

Practical challenges

o Integrating technologies: database, messaging framework, monitoring, user interface, etc.

o High performance query evaluation

incquery-d: distributed incremental graph queries

Engineering