pradeep kumar, h. howie huang the george washington …...5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6 t...
TRANSCRIPT
GraphOne: A Data Store for Real-time
Analytics on Evolving Graphs
Pradeep Kumar, H. Howie Huang
The George Washington University
Graph Applications: Big Data and Scientific Applications
2
▪ Basic Building Block for Modern Trade
▪ Social Network, Navigation, E-Commerce, etc.
▪ Metadata Store for Big Data
Li et al. Science’04
▪ Scientific Applications▪ Protein-Protein Interaction, Metabolic Network, etc.
uploads
Graph Applications: Big Data and Scientific Applications
3
▪ Basic Building Block for Modern Trade
▪ Social Network, Navigation, E-Commerce, etc.
▪ Metadata Store for Big Data
Li et al. Science’04
▪ And Many More▪ Identification of Vulnerable code, Cyber Attack
▪ Knowledge Graph
▪ Key for Data Governance, e.g. GDPR
▪ Scientific Applications▪ Protein-Protein Interaction, Metabolic Network, etc.
Definitions
▪ Topological Graph: Static (or slowly changing) Relationships▪ Vertices: Switches and Computers
▪ Edges: Network Bandwidth between Vertices
▪ Batch Analytics on Topological Graph▪ Network Infrastructure Management
▪ PageRank, Breadth First Traversal, etc.
4
(b) A Stream Graph after Time t: Useful for finding all the infected machines
t0 = t - δ
t2= t + δ
t1= t2 + δ
t3= t2 - δ
t5= t2+ δ
t3= t - δ
t
(a) A Computer Network Showing Articulation Points
Articulation Points
* Grades’15 Paper from Pacific Northwest National Laboratory; LANL Net-flow dataset, 2015 and 2017; computers and Security’15, …
▪ Streaming Graphs: Faster Arrival Rate▪ Vertices: Users, Processes, Computers + Ports, etc.
▪ Edges: Network Data Flow, Authentication Logs, Process Events
▪ Stream Analytics on Streaming Graph▪ Identification of Security Risks*
Observation: Prior Works have Specialized/Private Data Stores
5
Research Question: How to perform diverse set of real-time analytics in presence of high arrival rate of graph data?
Solutions Data Ingestion Data Access Type
Graph Databases (For Short Queries)
Fine-grained(e.g., Neo4j)
Whole Data
Dynamic Graph Engines(For Batch Analytics)
Fine-grained(e.g., Stinger)
Whole Data
Coarse-grained(e.g., Graphchi)
Whole Data
Stream Graph Engines(For Stream Analytics)
Coarse-grained(e.g. Naiad)
Streaming Access Of Whole Data
Coarse-grained(e.g. GraphTau)
Streaming Access from a Time Window
GraphOne Both All
Findings:
▪ Specialized/Private Data Stores
▪ Integration of Many Systems
▪ Excessive Data Duplication
▪ Weakest Link Problem
Abstracting Data Store away from Graph Analytics Engine
Data Sources
+
Graph Data StoreStream
Analytics
Actionable Knowledge
Anomaly
Community
Ranking
Batch Analytics
6
Graph Analytics Framework
Graph Data Store
Idea: Abstracting Data Store away from Graph Analytics Engine
Data Sources
+Graph Data Store
GraphOne (FAST’19)
Stream Analytics
Actionable Knowledge
Anomaly
Community
Ranking
Batch Analytics
7
Results:▪ Over Neo4j and SQLite: 2 to 3 orders of higher ingestion rate
▪ Over Stinger: 5.36x more ingestion rate, over 3x speedup for analytics
▪ Over Static Graph Engine: No Pre-processing Cost
Graph Analytics Framework
GraphOne: Background and Architecture
8
(a) Example graph
0,1,w0 3,4,w1 1,4,w2 2,4,w3 1,2,w4 4,5,w5
1,w00
1
2
3
4
5
0,w0 4,w22,w4
1,w4 4,w3
4,w1
3,w1 2,w31,w2 5,w5
4,w5
Vertex Array Edge Arrays
2
3
1
54
0w0 w4
w2 w3
w1 w5
Opportunity: Hybrid Store
(c) Adjacency List
▪ Indexed Data Structure
▪ Coarse-grained Versioning
(b) Edge List
▪ Log: Captures Temporal Ordering
▪ Fine-grained Versioning is Free
Stream Analytics
Batch Analytics
Graph DataUpdates
Data Management
Hybrid Store
Archiving
Adjacency Store
New log Old log
Circular Edge Log
Logging
Data Durability NVMe
Fin
e-g
rain
ed
Coarse-grained
Data Visibility
Two New Abstractions:
Data Visibility: Analytics
choose their Ingestion type
GraphOne: Background and Architecture
9
(a) Example graph
0,1,w0 3,4,w1 1,4,w2 2,4,w3 1,2,w4 4,5,w5
1,w00
1
2
3
4
5
0,w0 4,w22,w4
1,w4 4,w3
4,w1
3,w1 2,w31,w2 5,w5
4,w5
Vertex Array Edge Arrays
2
3
1
54
0w0 w4
w2 w3
w1 w5
Opportunity: Hybrid Store
(c) Adjacency List
▪ Indexed Data Structure
▪ Coarse-grained Versioning
(b) Edge List
▪ Log: Captures Temporal Ordering
▪ Fine-grained Versioning is Free
Two New Abstractions:
Data Visibility: Analytics
choose their Ingestion type
Graph View:
▪ Static View: Batch Analytics
▪ Stream View: Stream Analytics
Data Management
Hybrid Store
GraphView(Data Visibility)
Static View
Archiving
Adjacency Store
Stream View
New log Old log
Circular Edge Log
Stream Analytics
Batch Analytics
Logging
Data Durability NVMe
Graph DataUpdates
Fin
e-g
rain
ed
Coarse-grained
Hybrid Store Details
10
Edge Deletion at t7
Edge Arrays
Vertex Array
Multi-version Degree Array
3 2
6
5
4
1 4
2
0
1
2
3
4
5
6
0
1
2
3
4
5
6
2,S0
1,S0
1,S0
2,S0
1,S0
1,S0
5,6 1,2 3,4 2,4
t0 t1 t2 t3
S0, 4Global
Snapshot List
Circular Edge Log2
3
1
54
0
6
t4t1
t2
t3
t0
t5t6
t8
t9
S1, 8
0,1 0,3 1,3 -2,4
t4 t5 t6 t7
Adjacency List
Hybrid Store Details
11
Edge Deletion at t7
Vertex Array
Multi-version Degree Array
Adjacency List
3 2
6
5
4
1 4
2
10
1
2
3
4
5
6
3
0
0 1
2,S10
1
2
3
4
5
6
3,S1
3,S1
2,S0
1,S0
1,S0
2,S0
1,S0
1,S0
-1
-13,S1
3,S1
3
S1, 8
S0, 4Global
Snapshot List
5,6 1,2 3,4 2,4 0,1 0,3 1,3 -2,4 2,5 3,6
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
Non-archived
edgesCircular Edge Log2
3
1
54
0
6
t4t1
t2
t3
t0
t5t6
t8
t9
Stream View
Static View
Whole Data Access
Data Access from a Time-Window
GraphView
handle create-static-view(flags)status delete-static-view (handle)count get-nebr-degree(handle, vertex-id)count get-nebrs(handle, vertex-id, ptr)
status update-view(handle)bool has-vertex-changed(handle, vertex-id)
Edge Arrays
Hybrid Store: Optimizing the Archiving Phase
12
edges with source vertex range [v0, v1)
Adjacency list Data
edges with sourcevertex range [v1, v2)
v0
v1
v2
Non-atomicArchiving
Non-atomicArchiving
Tail Head
Edge Log
Local Buffers
Local Buffers
EdgeSharding
0 20 40 60 80 100
Archiving with EdgeSharding
Archiving with AtomicInstruction
Run Time (sec)
Sharding Filling Degree Filling Edge Array
Archiving Phase
Machine: 2 Intel Xeon CPU socket▪ 14 cores each, ▪ Hyperthreading enabledDataset: Kron-28 Graph▪ 256 Million Vertex, ▪ 4 Billion Edges 0
10
20
30
40
1 2 4 8 16 32 56
Arc
hiv
ing
Rat
eM
illio
ns
Thread Count
Hybrid Store: Optimizing Memory Usage
▪ Edge Log : Circular
▪ Degree Array: Older Versions Reused
13
OptimizationsChain Count Memory
Needed(GB)
Average Maximum
Baseline System 29.18 65536 148.73
Static GraphOne 0.45 1 33.81
+Cacheline Size allocation 2.96 65536 47.42
+Hub vertex Handling 2.47 3998 45.79
Chainededge arrays
Vertex array
Multi-version Degree array
3 2
6
5
4
1 4
2
10
1
2
3
4
5
6
3
0
0 1
2,S10
1
2
3
4
5
6
3,S1
3,S1
2,S0
1,S0
1,S0
2,S0
1,S0
1,S0
-1
-13,S1
3,S1
3
02468
10
Archiving Rate BFS PageRank 1-Hop
Spe
ed
up
Baseline +Cacheline Sized +Hub Vertex
▪ Edge Arrays: Many Memory Links▪ Cacheline Size Memory Allocation▪ Hub Vertex Handling
020406080100120140
0
0.25
0.5
0.75
1
1.25
Arc
hiv
ing
Rat
e
(Ed
ge/s
)
Mill
ion
s
Alg
ori
thm
Ru
n t
ime
(N
orm
aliz
ed
)
Non-archived Edge Count (Log axis with base 2)
PageRank BFS 1 Hop Archiving Rate
210 212 214 216 218 220 222 224 226 228 230 232
Results: Hybrid Store Composition and Ingestion Rate
▪ Max Non-archived Edge count: ▪ 217 edges
▪ Archiving Threshold: ▪ 216 edges
▪ Archive Rate: ▪ 43.68 Million edges/s
14
▪ Faster Ingestion?
▪ Logging rate:
▪ ~80 Million edges/sec
▪ Higher archiving threshold
Graph Name Vertex Count (in Millions)
Edge Count(in Millions)
Rates (Million Edges/s)
Logging Archiving Ingestion
Twitter (D) 52.58 1963.26 82.62 47.98 66.39
Friendster (D) 68.35 2586.15 82.85 49.32 60.40
Subdomain (D) 101.72 2043.20 82.86 43.43 68.25
Kron-28 (U) 256.00 4096.00 79.23 43.68 52.39
LANL (D) 0.16 1521.19 35.98 28.91 26.99
D = Directed, U = Undirected
Results: Dynamic Graph Systems
15
048
1216
Updates BFS PageRank
Spe
ed
up
Stinger GraphOne
1.0E+0
1.0E+2
1.0E+4
1.0E+6
1.0E+8
Neo4j GraphOne
Up
dat
e R
ate
(e
dge
s/se
c)
1.0E+0
1.0E+2
1.0E+4
1.0E+6
1.0E+8
SQLite GraphOne
Up
dat
e R
ate
(e
dge
s/se
c)
Against Stinger▪ 5.36x Higher Ingestion Rate▪ 12.76x and 3.18x Speedup for BFS and PageRank
Against SQLite and Neo4j: ▪ 2 to 3 Orders of Magnitude Higher Ingestion Rate
Dataset: RMAT Graph▪ 4 Million Vertices, 64 Million Edges▪ 8 bytes weight▪ 40 Million updates, 2.5 Millions deletions
Experimental Setup▪ Two Intel Xeon CPU E5-2683 sockets with 14 Core each▪ 500GB Memory▪ Samsung 950 Pro NVMe SSD, 512GB
Results: Cost of Data Management Over Static Graph Engine
▪ Static GraphOne: For Static Graphs▪ Builds graph in entirety, Like Galois
▪ GraphOne: For Dynamic/Streaming Graphs ▪ Builds graph one edge at a time
16
0
0.25
0.5
0.75
1
1.25
Twitter Friendster Subdomain Kron-28-16 Kron-21-16
Spe
ed
up
(Co
mp
are
d t
o
Stat
ic G
rap
hO
ne
) BFS PageRank 1-Hop
▪ 17% less performance for real-world graphs▪ Avoids pre-processing cost completely
▪ 32.73 second pre-processing cost, 34.12x of BFS
GraphOne: Conclusion
▪ Batch and Stream Analytics on Evolving Graphs
▪ Hybrid Store: Two Abstractions
▪ Data Visibility and GraphView
▪ Optimizations: ▪ Edge Sharding and Memory Allocation
▪ Results:▪ Over Neo4j and SQLite: 2 to 3 orders of higher
ingestion rate
▪ Over Stinger: 5.36x more ingestion rate, over 3x speedup for analytics
▪ Over Static Graph Engine: No Pre-processing Cost
▪ A Real-System you can use
▪ Can convert text/logs/csv/json to graph
17
Data Management
Hybrid Store
GraphView(Data Visibility)
Static View
Archiving
Adjacency Store
Stream View
New log Old log
Circular Edge Log
Stream Analytics
Batch Analytics
Logging
Data Durability NVMe
Graph DataUpdates
Fin
e-g
rain
ed
Coarse-grained
Thank [email protected]