a visual and statistical benchmark for graph sampling methods fangyan zhang 1 song zhang 1 pak chung...

27
A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi State University 2 Pacific Northwest National Laboratory

Upload: madeline-todd

Post on 20-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

A Visual and Statistical Benchmark for Graph Sampling Methods

Fangyan Zhang1 Song Zhang1 Pak Chung Wong2 J. Edward Swan II1 T.J. Jankun-Kelly1

1Mississippi State University 2Pacific Northwest National Laboratory

Page 2: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Outline

Introduction

Motivation

Problems

Evaluation

KS-Distance

Graph Type and Properties

Sampling Methods

Node sampling

Edge sampling

Topology Based Sampling

Comparison of Sampling Results

Statistical Comparison

Visual Comparison

Efficiency Comparison

Sampling on Large Graph

Node Sampling

Out Degree Distribution

Conclusion

Page 3: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Introduction

Motivation

Visualization

To display all nodes and edges is impossible.

Estimation or calculation

To calculate graph properties on a large graph

is costly.

Data

No complete data.

To obtain data is time-consuming.

Page 4: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Introduction

Problems

Given a huge graph, how to get a representative sample?

Given several sampling methods, which sampling method is best?

How to compare those sampling results?

How do we measure success?

Page 5: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Outline

Introduction

Motivation

Problems

Evaluation

KS-Distance

Graph Type and Properties

Sampling Methods

Node sampling

Edge sampling

Topology Based Sampling

Comparison of Sampling Results

Statistical Comparison

Visual Comparison

Efficiency Comparison

Sampling on Large Graph

Node Sampling

Out Degree Distribution

Conclusion

Page 6: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Evaluation

Kolmogorov-Smirnov (KS) D-statistic

Dn= supx|Fn(x)-F(x)| supx: supremum of the set of distance. Fn and F are distribution function.

To evaluate the similarity between original and sampled graph

To calculate KS value on graph properties.

Page 7: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Evaluation

Graph properties(10): Directed Graph

In-degree distribution (InDD)

Out-degree distribution (OutDD)

Betweenness centrality distribution (BCB)

Average neighbor degree distribution (ANDD)

In-degree centrality distribution (InDCD)

Out-degree centrality distribution (OutDCD)

Edge betweenness centrality distribution (EBCD)

Weakly connected component distribution(WCCD)

Hops distribution (HD)

Hops distribution in largest weakly connected component (HLCCD

Graph properties(7): Undirected Graph

Degree distribution (DD)

Betweenness centrality distribution (BB)

Clustering coefficient (CCD)

Average neighbor degree distribution (ANDD)

Degree centrality distribution (DCD)

Edge betweenness centrality distribution (EBCD)

Hop distribution (HD)

Page 8: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Outline

Introduction

Motivation

Problems

Evaluation

KS-Distance

Graph Type and Properties

Sampling Methods

Node sampling

Edge sampling

Topology Based Sampling

Comparison of Sampling Results

Statistical Comparison

Visual Comparison

Efficiency Comparison

Sampling on Large Graph

Node Sampling

Out Degree Distribution

Conclusion

Page 9: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Sampling algorithms

Node Sampling

Random node (RN)

Random node-edge (RNE)

Random node-neighbour (RNN)

Streaming nodes (SN)

Edge Sampling

Random edge (RE)

Induced edge (IE)

Streaming edge (SE)

Topology Based Sampling

Breadth-first (BF)

Depth-first (DF)

Random first (RF)

Snowball (SB)

Random walk (RW)

Random walk with escape (RWE)

Forest fire (FF)

Page 10: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Outline

Introduction

Motivation

Problems

Evaluation

KS-Distance

Graph Type and Properties

Sampling Methods

Node sampling

Edge sampling

Topology Based Sampling

Comparison of Sampling Results

Statistical Comparison

Visual Comparison

Efficiency Comparison

Sampling on Large Graph

Node Sampling

Out Degree Distribution

Conclusion

Page 11: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Comparison of Sampling Results

Undirected Graph American Airlines connection data

235 nodes 1297 edges

Simulated random undirected Graph

Graph 1: 500 nodes 1260 edges

Graph 2: 500 nodes 1238 edges

Graph 3: 750 nodes 2871 edges

Graph 4: 1000 nodes 4989 edges

Graph 5: 1000 nodes 5092 edges

Graph 6: 1250 nodes 7884 edges

Directed Graph VAST

1214 nodes 15653 edges

Simulated random directed Graph

Graph 1: 500 nodes 1284 edges

Graph 2: 500 nodes 1262 edges

Graph 3: 750 nodes 2859 edges

Graph 4: 1000 nodes 4900 edges

Graph 5: 1000 nodes 4954 edges

Graph 6: 1250 nodes 7732 edges

Simulated Graph: Random Graph created by Erdős–Rényi model

Page 12: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Statistical Comparison of Sampling Results

Comparison Categories (4):Between directed and undirected Graph

American airline connection data and VAST data

Between different graph type of undirected or directed graphs

American airline connection data and Simulated undirected graph

Between multiple graphs of same type but different sizes.

Simulated data with different number of nodes(500, 750, 1000, 1250)

Between multiple graphs of same size and same type

Simulated Graph(500 VS 500, 1000 VS 1000)

Page 13: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Category 1: Between directed and undirected Graph (Undirected Graph)

Data: American Airlines connection data. Sampling rate: average(10%-50%)

Page 14: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Category 1: Between directed and undirected Graph (directed Graph)

Data: VAST 2013 Netflow data. Sampling rate: average(10%-50%)

Page 15: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Category 2: between two undirected graphs with different graph type

Data: American Airline connection data VS Simulated 500_1. Sampling rate: average(10%-50%)

Page 16: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Category 3: between multiple graphs of same type but different sizes

Data: Simulated data 500_1 VS 1250. Sampling rate: average(10%-50%)

Page 17: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Category 4: between multiple graphs of same size and same type

Data: Simulated data 500_1 VS 500_2. Sampling rate: average(10%-50%)

Page 18: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Category 4: between multiple graphs of same size and same type

Data: Simulated data 1000_1 VS 1000_2. Sampling rate: average(10%-50%)

Page 19: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Visual Comparison of Sampling Results Visual Comparison

Data: American Airlines connection data

Graph Type: undirected graph

Sampling rate: 10% on edges

Visual Comparison Technique

Fix nodes location, label size, color etc.

Analysis

Edge-related sampling methods are biased towards high-degree nodes. For example, random edge sampling, induced edge sampling, streaming edge sampling are easy to sample high degree nodes (136,50, 80,130,70 etc.)

Original

Nodes:235 Edges:1297

FF

Nodes:27 Edges:131

RN

Nodes:58 Edges:125

RE

Nodes:117 Edges:129

SN

Nodes:97 Edges:197

SE

Nodes: 83 Edges:129

RW

Nodes: 74 Edges:129

IE

Nodes: 28 Edges: 127

Page 20: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Efficiency Comparison

RN RNERNN SN RE IE SE BF DF SB RW RWE RF FF0.00000

5.00000

10.00000

15.00000

20.00000

25.00000

30.00000

Sampling methods

Exe

cuti

on t

ime

(sec

ond)

Execution Time

Data

simulated undirected graph data

Sampling rate

Average of sampling result with sampling rate range from 10% to 50% on edges.

Page 21: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Outline

Introduction

Motivation

Problems

Evaluation

KS-Distance

Graph Type and Properties

Sampling Methods

Node sampling

Edge sampling

Topology Based Sampling

Comparison of Sampling Results

Statistical Comparison

Visual Comparison

Efficiency Comparison

Sampling on Large Graph

Node Sampling

Out Degree Distribution

Conclusion

Page 22: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Sampling on Big Graph

Graph GenerationErdős–Rényi modelparallel algorithm (by MPi4py)Shadow ІІ. Used 100 nodes, 2000 processors, each node:

512GB memory. Size:

1 billion nodes ~500 billion edges.

Graph Storage: about 10TB

Page 23: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Sampling on Big Graph Original Graph

Out Degree distribution

Page 24: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Sampling on Big Graph

Node sampling. Sampling Rate: 10 %

Out Degree Distribution:

Page 25: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Outline

Introduction

Motivation

Problems

Evaluation

KS-Distance

Graph Type and Properties

Sampling Methods

Node sampling

Edge sampling

Topology Based Sampling

Comparison of Sampling Results

Statistical Comparison

Visual Comparison

Efficiency Comparison

Sampling on Large Graph

Node Sampling

Out Degree Distribution

Conclusion

Page 26: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Conclusion

No sampling method works well for all graphs.

In visual comparison, the consistent graph layout facilitates comparison.

The benchmark helps users choose proper sampling methods in applications.

The benchmark provides an avenue to explore big graph.

Page 27: A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi

Questions?

Thanks!

AcknowledgementTHIS WORK IS SUPPORTED BY THE PACIFIC NORTHWEST NATIONAL LABORATORY UNDER THE U.S. DEPARTMENT OF ENERGY CONTRACT DE-AC05-76RL01830