rstream:marrying relational algebra with streaming for ... · -10-node cluster, 5tb ssd -each node:...

Post on 16-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, Guoqing Harry Xu1 1 12 3

UCLA Nanjing University Facebook1 2 3

2

Big Graph

2

Graph Datasets

Big Graph

2

Graph Datasets

GraphChi

Graph Systems

GridGraphBig Graph

Graph Analytical Problems

3

Graph Analytical Problems

3

Graph Computation

Graph Analytical Problems

3

Graph Computation

PageRank

Connected Component

Graph Analytical Problems

3

Graph Computation

PageRank

Connected Component

Iterative value computation

Graph Analytical Problems

3

Graph Computation

PageRank

Connected Component

Iterative value computation

GraphChi

Think Like a Vertex

Graph Analytical Problems

3

Graph Computation

Graph Mining

PageRank

Connected Component

Iterative value computation

GraphChi

Think Like a Vertex

Graph Analytical Problems

3

Graph Computation

Graph Mining

PageRank

Connected Component

Frequent Subgraph Mining

Clique Finding

Iterative value computation

GraphChi

Think Like a Vertex

Graph Analytical Problems

3

Graph Computation

Graph Mining

PageRank

Connected Component

Frequent Subgraph Mining

Clique Finding

Iterative value computation

Discover structural patterns

GraphChi

Think Like a Vertex

Graph Analytical Problems

3

Graph Computation

Graph Mining

PageRank

Connected Component

Frequent Subgraph Mining

Clique Finding

Iterative value computation

Discover structural patterns

GraphChi

?

Think Like a Vertex

Existing Mining Systems

• Enumerate all possible subgraphs

• For each subgraph, check if it matches the pattern

• Pattern is application-specific (Clique finding, motif counting, frequent subgraph mining)

4

Existing Datalog Systems

5

• Relational predicates

- TC(a, b, c) R(a, b), a < b, R(b, c), b < c, R(c, a)

- count TC(a, b, c)

• Relation algebra enables composition of small structures into big structures

Challenges in Graph Mining

6

1 2 3 4 5 6

4k22k

335k7.8M

117M

1.7B

Exponentially

size of subgraphs

# of

subg

raph

s

• # of subgraphs grows exponentially with the size of subgraphs

Arabesque [CHC Teixeira et al. , SOSP’15]

Problems with Distributed Mining Systems

7

• Suffer from large startup and communication overhead

- Arabesque on 10-node cluster, 35s startup, 3s execution

- DistGraph on 128-node cluster, 32,768GB memory

• Need enterprise clusters with large amounts of memory

- some nodes out of memory, other nodes with memory usage < 10%

• Poor load balancing due to dynamic working sets

Problems with Datalog Systems

8

• Programming model is not expressive enough for complex graph mining algorithms

Thoughts and Insight

9

• Not all users have access to enterprise cluster

• Many users are domain experts with limited background in hosting a cluster

• Distributed mining systems drawbacks: large startup, underutilized cpus, poor load balancing

Thoughts and Insight

9

• Not all users have access to enterprise cluster

• Many users are domain experts with limited background in hosting a cluster

• Distributed mining systems drawbacks: large startup, underutilized cpus, poor load balancing

Increasingly large SSDs

Our Proposal: RStreamA single machine, out-of-core graph mining system

10

• A simple and expressive API

• Gather-Apply-Scatter + Relational Algebra => GRAS

• An efficient runtime engine

• implements relational algebra with streaming

GAS

11

Gather information from neighbor vertices

GAS

12

Apply and update the vertex property

GAS

13

Scatter information to neighbor vertices

GRAS

14

GRAS

14

GAS

supports iterative graph processing

GRAS

14

GAS

Relational Algebra

supports iterative graph processing

enables composition of structures

GRAS

14

GAS

Relational Algebra

GRAS

supports iterative graph processing

enables composition of structures

iteratively composition of structures

GRAS

14

GAS

Relational Algebra

GRAS

supports iterative graph processing

enables composition of structures

iteratively composition of structures

Edge Streaming

15

• Use streaming to reduce I/O costs

• Sequentially access (larger) datasets from disk, randomly access (smaller) datasets held in memory

X-Stream [A Roy et al. , SOSP’13]

Edge Streaming

16

VID Value

1 12 2

Src Dest

1 42 5

Value Dest

1 42 5

Vertex Table Edge TableUpdate Table

A graph is partitioned into streaming partitions. Each streaming partition contains

Streaming for Scatter/Gather

17

Update Table

src dest

1 22 5

Edge Table

ID value

1 a2 b

Update Table

Streaming Partition 1

Streaming Partition 2

Vertex Table

Update Tablevalue dest

a 2b 5

a 2

b 5

Scatter

Update Tablevalue dest

a 2

Update Table

ID value

1 a2 b

Update TableVertex Table

Update Tablevalue dest

a+b 2

Gather/Apply

Streaming Load Shuffle

Streaming Load

RStream API

18

Scatter

Relational

Relational

GatherApply

.

.

.

Scatter

GatherApply

Relational

Example:Triangle Counting

19

Scatter R1 R2

Example:Triangle Counting

19

edge table

src dest1 42 5… …

1 4

2 5

Scatter

Scatter R1 R2

VID value1 42 5… …

vertex table

Example:Triangle Counting

19

edge table

update table1

src dest1 42 5… …

c1 c21 42 5… …

⋈src dest4 95 8… …

edge table

1 4 9

2 5 8

1 4

2 5

Scatter

R1

(a, b) ⋈ (b, c)(a, b, c)

Scatter R1 R2

VID value1 42 5… …

vertex table

Example:Triangle Counting

19

edge table

update table1

src dest1 42 5… …

c1 c21 42 5… …

⋈src dest4 95 8… …

edge table

c1 c2 c31 4 92 5 8… … …

⋈src dest9 18 2… …

update table2 edge table

1 4 9

2 5 8

1 4 9

2 5 8

1 4

2 5

Scatter

R1

(a, b) ⋈ (b, c)(a, b, c)

(a, b, c) ⋈ (c, a) (a, b, c, a)R2

Scatter R1 R2

VID value1 42 5… …

vertex table

Outline

• How to provide a general programming interface for graph mining algorithms?

• How to implement relational operators efficiently for graphs?

20

Load

Streaming for Join Operator

21

Update Table

Src Dest

1 22 5

Edge Table

C1 C2

3 16 2 ⋈

Update Table

C1 C2 C3

3 1 26 2 5

3 1 2

6 2 5

Streaming Partition 1

Streaming Partition 2

Streaming Shuffle

Load

Streaming for Join Operator

21

Update Table

Src Dest

1 22 5

Edge Table

C1 C2

3 16 2 ⋈

Update Table

C1 C2 C3

3 1 26 2 5

3 1 2

6 2 5

Streaming Partition 1

Streaming Partition 2

Locality-Aware Join

Streaming Shuffle

Structural Information

22

1

2

3

⋈ 3 4

1

2

3

1

2

3

⋈ 4

1

2

3

2

1 2 3 3 4

1 2 3 2 4

Structural Information

22

1

2

3

⋈ 3 4

1

2

3 4

1

2

3

⋈ 4

1

2

3

2

1 2 3 3 4

1 2 3 2 4

Structural Information

22

1

2

3

⋈ 3 4

1

2

3 4

1

2

3

⋈ 4

1

2

3

2

1 2 3 4

1 2 3 3 4

1 2 3 2 4

Structural Information

22

1

2

3

⋈ 3 4

1

2

3 4

1

2

3

⋈ 4

1

2

3

42

1 2 3 4

1 2 3 3 4

1 2 3 2 4

Structural Information

22

1

2

3

⋈ 3 4

1

2

3 4

1

2

3

⋈ 4

1

2

3

42

1 2 3 4

1 2 3 4

1 2 3 3 4

1 2 3 2 4

Structural Information

same update tuples

different subgraphs

22

1

2

3

⋈ 3 4

1

2

3 4

1

2

3

⋈ 4

1

2

3

42

1 2 3 4

1 2 3 4

1 2 3 3 4

1 2 3 2 4

Structural Information

same update tuples

different subgraphs

22

1

2

3

⋈ 3 4

1

2

3 4

1

2

3

⋈ 4

1

2

3

42

1 2 3 4

1 2 3 4

1 2 3 3 4

1 2 3 2 4

Structural Information

same update tuples

different subgraphs

22

1

2

3

⋈ 3 4

1

2

3 4

1

2

3

⋈ 4

1

2

3

42

1 2 3 4

1 2 3 4

Structural info is missing!

1 2 3 3 4

1 2 3 2 4

Missing Structural Information

• Identical tuples may represent different structures

• Different tuples may represent identical structures

23

Adding Structural Info• Encodes the history of joins in update tuples

24

6 8

5

7

6 8

sub graph update tuplesindex 0 1

⋈8

6 8 7

⋈8

6 8 7

5

index 0 1

6 8 7(1)

2

6 8 5(1)7(1)

index 0 1 2 3

Is Join Enough?

• Join grows a subgraph from one of its vertices

• For Frequent Subgraph Mining, we need to explore all possibilities of existing subgraphs

• A different way of joining to grow a subgraph from all of its vertices

25

Join on All Columns

1 2

• Joins update table with edge table on every column

2610

Join on All Columns

1 2

1 2 3

• Joins update table with edge table on every column

2610

Join on All Columns

1 2

1 2 3

1 24

• Joins update table with edge table on every column

2610

Join on All Columns

1 2

1 2 3

1 24

1 2 35

1 2 3

6

1 2 3 7

• Joins update table with edge table on every column

2610

Join on All Columns

1 2

1 2 3

1 24

1 2 35

1 2 3

6

1 2 3 7

1 248

1 24

9

1 24

• Joins update table with edge table on every column

2610

Automorphism and Isomorphism

1 2 3

1 2 3

thread 1

thread 2

• Different threads can generate identical(automorphic) update tuples

27

• Select and keep one, remove all the other duplicates

1

2

3

5

4

6

Aggregation( )2,

• Different tuples may belong to same isomorphism class

• Aggregate to count number of each distinct shape

Arabesque [CHC Teixeira et al. , SOSP’15]

Evaluation• Platform

- 10-node cluster, 5TB SSD

- Each node: 2 Xeon(R) CPU E5-2640 v3 processors,32GB memory

28

• Application

- Triangle Counting

- Transitive Closure

- N-Clique Finding

- N-Motif Counting

- Frequent Subgraph Mining

Graphs #Edges #VerticesCiteseer 4,732 3,312

Mico 1.1M 100K

Patents 14M 2.7M

LiverJournal 69M 4.8M

Orkut 117M 3M

UK-2005 936M 39.5M

• Input graphs

Comparisons with Mining Systems

29

Citeseer

Mico Patent

Triangle Counting

RStream 0.04 15.8 6.7Arabesque-10 38.1 43.1 114.9

5-CliqueRStream 0.01 115.1 35.3

Arabesque-10 42.8 132 174.5

3-FSM 1K

RStream 0.06 351.7 383.7Arabesque-10 35.6 5790.1 -ScaleMine-10 1.2 802.6 -DistGraph-10 0.4 - -

RStream outperforms Arabesque by 60.9x ScaleMine by 12.1x DistGraph by 7.2x

Comparisons with Mining Systems

30

0200400600800

100012001400160018002000

3-10K 3-15K 3-20K 4-15K 4-20K 4-25K 5-15K 5-20K 5-25K

Rstream

ScaleMine

Arabesque

FSM on patent graph

subgraph size - support

runn

ing

time(

seco

nds)

Comparisons with Datalog Systems

31

LiveJournal Orkut

TriangleCounting

RStream 87 827.4

BigDatlog-10 94.8 1205.3

BigDatalog-5 109.6 1850.3

BigDatalog-1 567.3 -

SociaLite 896.1 - 0

100

200

300

400

500

600

700

800

900

1,000

BD-1 BD-5 BD-10 SL RSTi

me(

seco

nds)

Transitive Closure

8,021

Size of Intermediate Data

32

Phase #MB

4-Motif Counting

Mico

0 16.5

1 2086

2 886378

3 672194

Total 1.49TB

Size of Intermediate Data

32

Phase #MB

4-Motif Counting

Mico

0 16.5

1 2086

2 886378

3 672194

Total 1.49TB

13MB initial graph68182 X

ConclusionsRStream: A single machine, out-of-core graph mining system

33

• A simple and expressive API

• GAS + Relational Algebra => GRAS

• An efficient runtime engine

• implements relational algebra with tuple streaming

https://github.com/rstream-system

top related