reference representation in large metamodel-based datasets
DESCRIPTION
This presentation was held at the BigMDE Workshop (at STAF) in Budapest, 2013TRANSCRIPT
Markus Scheidgen
Model representations for large meta-model based data-sets
■ Introduction: Technological spaces and model representations■ Comparison of representation ■ Implementation■ Application
1
Introduction:Technological Spaces
2
Software Models
Code
revers
e engin
eering
code genera
tion
XML
persistence / exchange
databases
persistence/versioning
processing (via ORMs: e.g. JPA)
Objects(e.g. POJOs)
debu
ggin
g/pr
ofilin
g
refle
ctio
n
runt
ime
mod
elin
g
processin
g (e.g
. dom/jaxb)
exchange
(e.g. i
n web-servic
es) xslt/xsl/xquery/xpath
model-transformation/-constraints/-queries
static analysis/compilation/refactoring
SQL
running programs
other data
othe
r dat
a other dataother d
ata other data
Introduction: State of the Art
3
Meta-ModelsModels
SchemasXML
GammarsCode
ClassesObjects
ER-SchemasRelational Data
*
visualization and editing by human users
processing in computer programs
exchange
large data-sets/persistence and querying
Introduction: New Class of DBMS
4
Meta-ModelsModels
SchemasXML
GammarsCode
ClassesObjects
ER-SchemasRelational Data
*
-Big Data
+
-Graphs
ER-SchemasBig Relational Data
?
Representation: Strategies
5
Object-by-object Fragments
Part-of-source Morsa, ( Java) XMI, EMF-Frag
Relations CDO ?
Refe
renc
es
Objects
Representation: Object-by-object vs. Fragmentation(considering traversal, theoretical results)
6
100 101 102 103 104 105 106
100
101
102
103
104
105
Number of loaded objects [l]
no fragmentation [f=m]
optimal fragmentation
total fragmentation [f=1]
Exec
utio
n tim
e [t]
(in
ms)
1e+001e+011e+021e+031e+041e+051e+06
Fragment size [f]
Representation: Object-by-object vs. Fragmentation(considering traversal, theoretical results vs. implementation)
7
100 101 102 103 104 105 106
100
101
102
103
104
105
Number of loaded objects [l]
no fragmentation [f=m]
optimal fragmentation
total fragmentation [f=1]
Exec
utio
n tim
e [t]
(in
ms)
1e+001e+011e+021e+031e+041e+051e+06
Fragment size [f]
100 101 102 103 104 105 106100
101
102
103
104
105
Number of loaded objects [l]
Exec
utio
n tim
e [t]
(in
ms)
1e+011e+021e+031e+041e+05
Fragment size [f]
optimal fragmentation
Representation: Object-by-object vs. Fragmentation(considering traversal, implementation with actual model)
■Model traversal of Grabats models with four different sizes and different characteristics
8
set0 set1 set2 set3 set40
1
2
3
4
5
6
7
8
XMI
CDO
Morsa
EMFFrag coarse
EMFFrag fine
no
t m
ea
su
red
– e
xtr
ap
ola
ted
no
t m
ea
su
red
– e
xtr
ap
ola
tedOb
jects
pe
r se
co
nd
(=
10
4)
set0 set1 set2 set3 set410
3
104
105
106
107
Nu
mb
er
of
fra
gm
en
ts
CDO/Morsa
EMFFrag coarse
EMFFrag fine
Representation: Object-by-object vs. Fragmentation(considering query, implementation with actual model)
■Query of Grabats models with four different sizes and different characteristics
9
set0 set1 set2 set3 set410
3
104
105
106
107
Nu
mb
er
of
fra
gm
en
ts
CDO/Morsa
EMFFrag coarse
EMFFrag fine
set0 set1 set2 set3 set40
50
100
150
200
250
300
350
Exe
cu
tio
n t
ime
(in
s)
XMI
CDO w/o SQL
CDO
Morsa w/o index
Morsa
EMFFrag coarse
EMFFrag fine
not m
easure
d –
extr
apola
ted
not m
easure
d –
extr
apola
ted
not m
easure
d –
extr
apola
ted
not m
easure
d –
extr
apola
ted
Representation: Part-of-source vs. Relations(real implementation, artificial model)
10
100 102 104 106
101
102
103
104
number of outgoing references
exec
utio
n tim
e in
ms
100 102 104 106
101
102
103
104
number of outgoing references
exec
utio
n tim
e in
ms
Part of source implementation Relation implementation with individual access
access of one outgoing referencetraversal of all outgoing references
access of one outgoing referencetraversal of all outgoing references
Representation: Part-of-source vs. Relations(real implementation, artificial model)
11
100 102 104 106
101
102
103
104
number of outgoing references
exec
utio
n tim
e in
ms
Part of source implementation
access of one outgoing referencetraversal of all outgoing references
100 102 104 106
101
102
103
104
number of outgoing references
exec
utio
n tim
e in
ms
Relation implementation with scanning
access of one outgoing referencetraversal of all outgoing references
1
2
3
4
Implementation: EMF-Fragments
12
map/reduce(hadoop)
“Share Nothing” Nodes(cluster, adhoc-network)
DFS (HDFS)
key-value-store(hbase)
structured datadata-sets
applications meta-model
structured datamodel transformations
Implementation: Datastore mapping
13
regular containment
metamodel
0
1
part of source fragmentation
relation based fragmentation
Implementation: Meta-mode-based declaration of representations
14
Project
Package
CompilationUnit
FieldMethod
Class
«fragments»
«fragments»
«fragments»
*
* *
*
*
*
Call«relation»
Implementation: Architecture
15
FragmentedModel extends Resource
ResourceSet
FObject extends EObject©UHÁHFWLYH�IHDWXUH�GHOHJDWLRQª
FStore extends EStore©VLQJOHWRQ��VWDWHOHVVª
ResourceSet
Fragment extends Resource
FInternalObject extends DynamicEObject
URIHandler
DataStore©GHULYHGª
©GHOHJDWHVª
©GHOHJDWHVª
*
*1
*
*
1
11
1GDWDEDVH
visi
ble
API
EMF-Fragments ClassesRegular EMF Classes
1EList
EObjectEList FValueSetList
*
1
*
Applications: Mining and Analyzing Software Repositories
■ Software repositories contain more information than the current software code:■ “developers who changed class/method/statement X also changed class/
method/statement Y”■ this information leads to knowledge about dependencies that cannot be
determined through static or even dynamic analysis■ this can be used to• predict/find bugs• understand/improve the code-base
■ dependency information should be stored as relational data
■ When a piece of software evolves, its metrics change. Such dynamic metrics describe software better than static code metrics. Could lead to a better assessment of methodologies or understanding of software engineering in general.
16
Applications: Mining and Analyzing Software Repositories
■ JGit: Java implementation of the Git version control system■ MoDisco: Reverse engineering framework for eclipse java
projects based on EMF■ EMF-Compare: Determines matches and differences between
models■ EMF-Fragments: My own framework for large models■ over 300 Git repositories with eclipse plug-ins that
constitute the whole eclipse foundation source base as “example” data-set
17
Applications: Model of a Software Repository
18
A B C
A
A B
A D
PB1.R1
B1.R2
B1.R3
B1.R4
B2.R1
B2.R2
A
A B
Repository
Revision Diff
CompilationUnit
Model
Package Class
...
* * * *
*
1
prevnext
JGit MoDisco
model
metamodel
usageInPackageAccess
*
package1
«relation,fragmentation»
«fragmentation» «relation,fragmentation»
«relation»
«fragmentation»
* * extends1
Summary■ Choosing the right representation makes a difference ■Meta-model-based declaration of representations works
(might not be good enough)■ There are applications that can benefit from different
representations
19
Object-by-object Fragments
Part-of-source Morsa, ( Java) XMI, EMF-Frag
Relations CDO ?
Refe
renc
es
Objects
Backup
20
Possible Approaches: Different Target Platforms
21
SchemasXML
*
-Big Data
-Graphs
BASE
CAP-Theorem1
1Eric A. Brewer: Towards robust distributed systems; 19th ACM Symposium on Principles of Distributed Computing, 20002K. Barmpis and D.S. Kolovos. Comparative Analysis of Data Persistence Technologies for Large-Scale Models. XM 2012
ORM
XMI
XMI+Resources
ER-SchemasRelational Data
ACID,structured data
ER-SchemasBig Relational Data
BASE,structured data
BASE,structured data
Big
*
ORM?
2
Possible Approaches: Different Types of Mapping
22
*
1Javier Espinazo-Pagán, Jesús Sánchez Cuadrado, Jesús García Molina: Morsa, A Scalable Approach for Persisting and Accessing Large Models; MoDELS 2011
per o
bject m
appin
g fragmentation
ER-SchemasRelational Data
fast query,slow traversal,slow entry,(fine transactions)
fast query,slow traversal,slow entry,(fine transactions)1
Big
*
per object m
apping
slow query,fast traversal,fast entry,(coarse trans.)
Big
*ER-SchemasBig Relational Data/
Fragmentation: Types of references
■ organizing large artifacts in different resources is already implemented in EMF■ resources are loaded if necessary, objects in unloaded
resources are represented by proxy objects■ objects in different resources (as all related objects) are
related through references, therefore models are fragmented along references■ EMF-Fragments automatically fragments large models based
on annotations in the meta-model■ resources are identified via URIs and can be serialized (e.g.
XMI), therefore resources can be stored in a key-value store
23
Fragmentation: Types of references
24
*normal
references
*«fragments»fragmenting
references
large value sets *
Applications
■ HWL sensor and network operation data (or experiment data in general)■ realtime persistence required ➜ fast data entry■ hierarchical structured data (different sensors and other data sources) ➜ meta-modeling■ queries for experiments, sensors, specific time periods ➜ only coarse simple queries■ traversal of larger sub-trees, mostly applications based on data aggregation■ actual demand for big-data depends on size of sensor network ➜ scalability
■ CityGML models (or geo-spatial data in general)■ standardized as XML-schemas ➜ XML based data■ special proprietary indexes (e.g. spacial indexes like R-trees) and corresponding queries■ rather query intense applications■ actual demand for big-data depends on LOL of the models ➜ scalability
■ Software Engineering■ Code/Model Version Control■ Mining Software Repositories (MSR)■ revisions of AST-trees and differences between AST-trees ➜ existing meta-model based frameworks (e.g. designed
for reverse engineering purposes)■ large number of revisions causes many large value sets■ queries for revisions, compilation-units ➜ rather coarse queries■ aggregations and statistics ➜ can be expressed in an OCL-like language■ immediate demand for processing in (at least smaller) clusters■ has to be mixed with relational data for some applications
25
Applications: Scientific Data
26
WSN
<xm
l? ..
. >
<xm
l? ..
. >
click *
*
xml-to-model
text-to-model*
Applications: CityGML
■ XML-based standard ➜ meta-models can be generated (1-to-1 mapping)■ different standards define XML-schemas that extend each
other: GML⇽CityGML⇽extensions■ transparent use of spacial indexes ■ map onto existing platforms (e.g. SpatialHadoop)■ use existing implementations and persist into the key-value
store
■ extensions to CityGML can be facilitated to reference CityGML-models as spatial context for sensor data
27
backup
28
Research Overview
29
WIRELESS SENSOR NETWORKS
DATA ANALYSIS FRAMEW
ORK
GEO INFORMATION SYSTEMS
sensor data
heterogenous networks
mesh-networks
cellular-networks
spatial dataregular databases
spatial databases
distributeddata stores
distributedanalysis
data homo-genisation
domain speci!c analysis languages
HWL: Commodity Hardware
30
31
‣120+ Nodes
‣indoor and outdoor
‣dense and sparse
‣short and long links
‣stationary and mobil nodes
‣120+ Nodes
‣indoor and outdoor
‣dense and sparse
‣short and long links
‣stationary and mobil nodes
1
2
3
4
6
7
8
9
stein
? m
10m
5 10
Richtung Groß-Berliner Damm
Richtung Institut
Markus Scheidgen: H
WL – A
High-Perform
ance Wireless Sensor R
esearch Netw
ork
35
Experiments: The Test Site
§ simplest case: two lane, newly paved road
§ spatially equally distributed nodes on both sides of the rode
§ 2x5 nodes§ homogeneous test-bed:
same nodes, equally calibrated, same stone ground
§ one camera to record control data
0 20 40 60 80 100 120 140 160 180 2000
50
100
150
200
250
300
350
400
450Single−sided Amplitude Spectrum
Frequency (Hz)
|Y(fr
)|
Channel ZChannel YChannel X
0 500 1000 1500 2000 2500 3000−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Time sample (1/400 sec)
Acce
lera
tor v
alue
Time signal of all 3 channels
Channel ZChannel YChannel X
Markus Scheidgen: H
WL – A
High-Perform
ance Wireless Sensor R
esearch Netw
ork
Experiments: Example Data
36
Amplitudes Frequencies
Markus Scheidgen: H
WL – A
High-Perform
ance Wireless Sensor R
esearch Netw
ork
Experiment: Algorithm
§ Similar to earthquake detection: comparison of short and long moving averages (S=0.2s, L=4s)
38
s
x
= xth acceleration value (1)
mavg(s
x
,W ) =
Px
i=x�W
s
i
W
(2)
s
x
= |sx
� avg(s
x
, L)| (3)
w
S
x
= mavg(s
x
, S) (4)
w
L
x
= mavg(s
x
, L) (5)
�w = w
S
x
� w
L
x
(6)
Data Management
39
Research Overview
40
WIRELESS SENSOR NETWORKS
DATA ANALYSIS FRAMEW
ORK
GEO INFORMATION SYSTEMS
sensor data
heterogenous networks
mesh-networks
cellular-networks
spatial dataregular databases
spatial databases
distributeddata stores
distributedanalysis
data homo-genisation
domain speci!c analysis languages
41
internetcellular
cellular
wifi
zigbee
zigbee
Technological Infrastructure
Logical Infrastructure
actions
visualization
sensors
information
43
internetcellular
cellular
wifi
zigbee
zigbee
information/knowledge
distributed programming models
data bases
data representation
algorithmsprocesses
programming languages
CPUs
machine code radios
network protocols
hard drives
gene
ric
dom
ain
spec
ific
software engineering
algorithmsprocesses
programming languages
information/knowledge
distributed programming models
data bases
data representation
DSL
Complex Data Types
44
➡ complex data structures➡ lots of links between data objects➡ evolving structures➡ requires a type safe programming
environment that proliferates re-use
Large Amounts of Data
45
➡ a certain amount of data needs to be stored per second (HWL: 120 nodes)
~140x103 data objects per second~7MB/s serialized
➡ a certain amount of data needs to be stored all together (24h)
~12x109 data objects~600GB serialized
➡ Data analysis must complete in reasonable time. For live applications in real time.
From Click to ClickWatch
46
Click API software
Element
Element
Element
CompoundHandler
Han
dler
Net
wor
k In
terf
ace
Complex Data Types: Meta-Modeling
47
This [ ] happens all the time in software modeling
state charts class diagrams MSCsOCL
context Fooself.properties-> foreach(a|a.x != a.y)
eclipse modeling framework (EMF)
➡ Distributed storage and links between different types of data is only a simple extension of existing technology: multi resource persistence is already implemented
“Share Nothing” Nodes(cluster, adhoc-network)
DFS (HDFS)
key-value-store1
(hbase)
Large Amounts of Data: Problem Statement
48
1. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A distributed storage system for structured data (awarded best paper!). In Brian N. Bershad and Jeffrey C. Mogul, editors, OSDI, pages 205–218. USENIX Association, 2006.
2. Jeffrey Dean and Sanjay Ghemawat. Map/reduce: Simplified data processing on large clusters. In OSDI, pages 137–150. USENIX Association, 2004.
map/reduce2
(hadoop)
hierarchical data(XML, OGC standards)
data series(sensor data)
signal analysis, statistics, sensor-fusion
dom
ain
spec
ific
gene
ric
1
2
3
4
Large Amounts of Data: Approach
49
map/reduce(hadoop)
“Share Nothing” Nodes(cluster, adhoc-network)
DFS (HDFS)
key-value-store(hbase)
hierarchical data(XML, OGC standards)
data series(sensor data)
signal analysis, statistics, sensor-fusion meta-model
structured datamodel transformations