Download - Aules d’empresa 2011 DEX
Au
les d
’Em
pre
sa 2
01
1
Aules d’empresa 2011DEX
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Contents
Graph database Motivation
DEX Experiments
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Graph database
What is a graph database?
Data and schema are represented by graphs.• Nodes, edges, and properties.
Data manipulation is expressed as graph operations.
Integrity constraints enforce graph consistency.
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Motivation
Trends in current data sets: A higher degree of connectivity among entities. A higher degree of complexity of data models. Decentralization of data generation.
• Users provide contents.
Requirements: Queries with different flavors:
• Structural queries (not based on the schema).• Link analysis.
Manage unstructured data. Flexible schemas.
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Scenarios
Social networks MySpace, Facebook, Flickr …
Information networks Bibliographic databases: DBLP, Scopus … On-line encyclopedias: Wikipedia …
Technological networks Electric power grids, airline routes,
telephone networks … Biological networks
Genomics, chemical structures …
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Why not RDBMS?
Classical relational model
Inefficient for unstructured data or flexible schemas
Prefixed schema, based on relations (tables)
Inefficient for structural queries
Intensive use of join operations
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
, a graph database
DEX is a programming library which allows to manage a graph database.
Focuses on: Very large datasets. High performance
query processing.
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Basic concepts
Persistent and temporary graph management programming library.
Data model: Typed and attributed directed multigraph.
Node and edge instances belong to a type (label). Node and edge instances have attribute values. Edge can be directed or undirected. Multiple edges between two nodes.
Type of edges: Materialized: directed and undirected. Virtual: constrained by the values of two attributes
(foreign keys)• Just for navigation
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
A graph model
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Software architecture
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Software architecture
Java library: jdex.jar public API Native library
• Linux: libjdex.so• Windows: jdex.dll
System requirements: Java Runtime Environment, v1.5 or higher. Operative system:
• Windows – 32 bits• Linux – 32 and 64 bits
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Application architecture
Presentation
Network
Application Logic
Data
Desktop application
DEX
Data Sources
Graphs
Java SwingApplication
BrowserHTML + Javascript
DEX
GraphsData Sources
Query
Servlet
INTERNET
Web application
APIDEX
Loadand Query
APIDEX
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Experiments
Five categories: Bulk load performance. Core operations performance and memory usage Scalability. Comparison with other approaches.
• Relational (MySQL) and OIM. Query performance analysis
Different datasets: Wikipedia. IMDb, the Internet Movie Database. XMark, a standard and scalable benchmark for XML. LUBM, a benchmark to evaluate the performance of
RDF repositories. R-MAT, a synthetic scale-free network.
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Load performance
IMDb Wikipedia XMark LUBM
DbGraph (GB) 0.40 9.69 12.19 6.13
Ratio DbGraph/raw data 2.64 3.86 4.38 1.14
Objects (millions) 14.65 486.45 343.77 215.06
Time (hours) 0.08 26.55 2.61 4.22
Speed (objs / sec) 50349 4885 36521 10074
Memory (%)BitmapsMaps
39.58%60.42%
39.12%60.88%
33.32%66.68%
34.11%65.89%
Single CPU with 4096 KB of cache, 2 GB of RAM and 80 GB of disk.
Operating system: Linux Debian etch 4.0
DEX buffer pool: 1.5 GB max.
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Operations performance and memory usage
Query Time (s) Results
Bitmaps
64K pages Operations
Maps
64K pages Operations
Q1 – count 0.0029 16986429 1 1 4 1
Q2 – scan 3.2000 16986429 66 16986430 4 3
Q3 – select 0.8000 4583294 20 4583295 8 3
Q4 – projection 33.2000 4583294 20 4583295 2156 18333175
Q5 – combine 0.0050 1 22 2 38 6
Q6 – explode 0.0057 462 2 463 18 929
Q7 – values 0.0110 253 12 253 7 781
Benchmark: Wikipedia with more than 200 million nodes and edges
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Scalability
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20
0
5
10
sca
le r
atio
sf=0.1sf=1sf=5sf=10sf=25
46.35
XMark over 5 different scale
factors ranging from 0.1
(110MB) to 25 (2.78GB)
SF=01 SF=1 SF=5 SF=10 SF=25
Graph size (MB) 63.9 546.3 2596.9 5093.9 12480.4
I/O (MB) 0.0 0.0 40.5 185.7 890.0
Objects (millions) 1.38 13.71 68.75 137.47 343.77
Load (secs.) 16.84 172.15 928.6 1934.14 5121.17
Optimize (secs.) 1.74 22.54 243.54 807.44 4292.38
Total (secs.) 18.58 194.69 194.69 2741.58 9413.55
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
R-MAT scalability
Scale Nodes Edges Load (sec)
Edges/sec
GB Q1 %visited Traversals
Trav/sec
25 29M 268M 4372 61398.78 11 1465.82 85 529M 361K
26 58M 536M 9499 56518.68 21 3134 85 1058M 337K
27 116M 1073M 20336 52800.05 41.0 6888.90 85.01 2118M 307K
28 230M 2147M 54146 39660.98 83 14323.90 84.62 4236M 295K
29 457M 4294M 225202 19071.62 162
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Comparison with Other Approaches
Comparison with a relational
database (MySQL) and
with an Oriented Incidence
Matrix
Query MySQL OIM DEX
Q1 – count 20.380 17.347 0.001
Q2 – scan 32.760 174.635 3.137
Q3 – select 7.340 5.430 0.837
Q4 – projection 17.340 43.699 33.192
Q5 – combine 0.740 2.612 0.005
Q6 – explode 0.070 202.070 0.006
Q7 – values 12.1280 20.774 0.011
Q8 – hub > 3 hours > 3 hours 624.681
MySQL OIM DEX
Data (GB) 27.36 54 9.69
Ratio overhead 10.9 21.51 3.96
Load time (secs) 52891 17453 95579
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Comparison with Neo4j
Neo4j DEX4.0
Size (GB) 82 16.98
Load time (h) 8.22 2.25
Q1 (s) 32230.00 118.93
Q2 (s) 24832.00 205.97
Q3 (s) 2045.00 10.68
Q4 (s) 34882.00 146.77
Q5 (s) 32539.00 141.06
Q6 (s) > 1week 7518.06
Query 1: max-outdegree + SPTQuery 2: paper recommender (2-hops) Query 3: pattern matchingQuery 4: for each language: number of papers and imagesQuery 5: for each paper: materialize number of imagesQuery 6: delete papers with no images
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Another comparison with a RDBMS
Datasets: D1: Synthetic data, generated from R-MAT
• Scale factor = 16 (524K edges) D2: Synthetic data, generated from R-MAT
• Scale factor = 18 (2M edges)
D1 and D2 both just nodes and edges, no attributes. R-MAT generates scale-free networks.
Queries: Q1: 3-hops from a given node.
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Another comparison with RDBMS
Test: Execute Q1 for 5 specific nodes. These query nodes have a significant number of
out-going edges.• Scale factor 16: about some tens• Scale factor 18: about some hundreds
Results: Scale factor 16: reached about 160K nodes Scale factor 18: reached about 600K nodes
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Another comparison with RDBMS
Schema:
CREATE TABLE `edges` ( `src` int(11) NOT NULL, `dst` int(11) NOT NULL, INDEX `srcI` (`src`) USING BTREE, INDEX `dstI` (`dst`) USING BTREE
) ENGINE=InnoDB;
Query:
SELECT DISTINCT c.dst FROM edges as a, edges as b, edges as cWHERE (a.dst=b.src AND b.dst=c.src AND a.src=node);
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Results
Platform test MacBook 2.4GHz Intel Core 2 Duo (Mac OS X 10.6) Up to 1GB memory for MySQL buffer pool.
Results
Test T1 MySQL DEX
Dataset D1 1m 57s 9s
Dataset D2 13m 36s 34s
No
m e la p
resenatació
o altra in
fo (o
pcio
nal)
Au
les d
’Em
pre
sa 2
01
1
Any question?DAMA Group Web Site: www.dama.upc.edu
Sparsity Web Site: www.sparsity-technologies.com