incquery-d: distributed incremental graph queries
TRANSCRIPT
Budapest University of Technology and EconomicsDepartment of Measurement and Information Systems
DISTRIBUTED INCREMENTAL GRAPH QUERIES
Gábor Szárnyas, Dániel Varró
2 February, 2015
22nd Minisymposium of the Department of Measurement and Information Systems
MOTIVATION
Performance issues
Agile Model-Driven Development
Modeling
Codegeneration
Testing
Early validationsTransformations
Scalabilitychallenges
Model Sizes
Models = graphs with 100M–1B elements
o Car industry
o Avionics
o Software analysis
o Cyber-physical systems
Source: Markus Scheidgen, Automated and TransparentModel Fragmentation for Persisting Large Models, 2012
application model size
software models 108
sensor data 109
geo-spatial models 1012
Validation may take hours
MDE
Scalability
Incrementality
Incremental queries
Incremental transformation
Storing partialresults
Trackingchanges
Motivating Example
Pattern for an AUTOSAR validation constraint
Communicationchannel
Logical signal Mapping Physical signal
Invalid submodel
Validation
Valid submodel
Antijoin
Join
Join
Fill indexer nodesStore interim resultsRead result setEdit modelPropagating changesRead result set
Rete Algorithm
Communication channel
Logical signal Mapping Physical signal
Result set
CURRENT STATE OF RESEARCH
EMF-INCQUERY
Rete-based incremental graph query engine
Open source Eclipse project
Typical use cases
o Validation
o Incremental model transformation
oModel synchronization
Single Workstation Limitations
Majority of tools mostly work for <1M model elements due to resource exhaustion
Best tools: <10M model elements
JVM limitations: cannot handle 15+ GB heap memory efficiently
Proposed solution
o Horizontal scaling: distributed system
Problem Statement
Scalability
Scalable storage
Scalable query engine
Distributed NoSQLdatabases
Distributed INCQUERY:
INCQUERY-D
Complex queries
Big models
Goals of INCQUERY-D
Objectives
o Distributed incremental pattern matching
o Adapting EMF-INCQUERY’s tooling to distributed DBs
o Executed over a cloud infrastructure (COTS hardware)
Achieve scalability by avoiding memory bottleneck
o Sharding separately
• Data
• Indexers
• Query network
o In memory
• Index + query
RESEARCH QUESTIONS AND RESULTS
Architecture and Data Representation
Is it possible to build a query engine which works on various backends using different data representation formats?
Is it possible to serve multiple users concurrently?
INCQUERY-D Architecture
Server 1
Databaseshard 1
Server 2
Databaseshard 2
Server 3
Databaseshard 3
Transaction
In-memory EMF modelDatabaseshard 0
Server 0
Rete net
Indexer layer
EMF-INCQUERY INCQUERY-D
Distributed query evaluation network
Distributed indexer Model access adapterIndexing
Indexer Indexer Indexer Indexer
Join
Join
Antijoin
In-memory storage
Distributed indexing, notification
Production network• Stores intermediate query results• Propagates changes
Distributed persistent storage
Distributed production network• Each intermediate node can be allocated
to a different host• Remote internode communication
Scalable Incremental Query Evaluation
Is it possible to utilise an incremental query evaluation algorithm in a distributed system for high performance query evaluation?
How can we benchmark a distributed system in areproducible manner?
Benchmark Results for Revalidation
Quick response time for models with 88M elements
Different characteristics
Dimensions of Scalability
Infrastructure
o Number of machines
o Available memory / CPU
o Network performance
o Number of concurrent users
Model
o Model size
o Model characteristics
Queries
o Number of queries
o Query complexity
Optimisation and Dynamic Reconfiguration
How can we scale and optimise such a system?
How can the system adapt to the changes
o in the system?
o in the cloud environment?
How can we estimate the resources required by a certain setup?
Dynamic Resource Allocation
Server 1 Server 2 Server 3Server 0
Indexer Indexer Indexer Indexer
Join
Join
Antijoin
10% 70% 60%
Δ
80%90%
Join
25%75%
Δ
Δ
Memory usage
Conclusion
MDE provides Big Data questions for research
Horizontal scaling is a way for querying large models
Theoretical challenges
o Distributed pattern matching algorithm
o Data representation
o Dynamic resource allocation
Practical challenges
o Integrating technologies: database, messaging framework, monitoring, user interface, etc.
o High performance query evaluation
Ω