streaming graph analytics for massive graphs · 7/10/2012  · streaming graph analytics for...

32
Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology 10 July 2012

Upload: others

Post on 17-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Streaming Graph Analytics for Massive GraphsJason Riedy, David A. Bader, David EdigerGeorgia Institute of Technology

10 July 2012

Page 2: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Outline

MotivationWhere graphs appear... (hint: everywhere)Data volumes and rates of changeWhy analyze data streams?

TechnicalOverall streaming approachData structure benchmarkClustering coefficientsConnected componentsCommon aspects and questions

Session outline

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 2/31

Page 3: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Exascale Data Analysis

Health care Finding outbreaks, population epidemiology

Social networks Advertising, searching, grouping

Intelligence Decisions at scale, regulating algorithms

Systems biology Understanding interactions, drug design

Power grid Disruptions, conservation

Simulation Discrete events, cracking meshes

The data is full of semantically rich relationships.Graphs! Graphs! Graphs!

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 3/31

Page 4: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Graphs are pervasive

• Sources of massive data: petascale simulations, experimentaldevices, the Internet, scientific applications.

• New challenges for analysis: data sizes, heterogeneity,uncertainty, data quality.

AstrophysicsProblem Outlier detectionChallenges Massive datasets, temporal variationGraph problems Matching,clustering

BioinformaticsProblem Identifying targetproteinsChallenges Dataheterogeneity, qualityGraph problems Centrality,clustering

Social InformaticsProblem Emergent behavior,information spreadChallenges New analysis,data uncertaintyGraph problems Clustering,flows, shortest paths

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 4/31

Page 5: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

These are not easy graphs.Yifan Hu’s (AT&T) visualization of the Livejournal data set

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 5/31

Page 6: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

But no shortage of structure...

Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,Science 302, 1722-1736, 2003.

Jason’s network via LinkedIn Labs

• Globally, there rarely are good, balanced separators in thescientific computing sense.

• Locally, there are clusters or communities and many levels ofdetail.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 6/31

Page 7: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Also no shortage of data...

Existing (some out-of-date) data volumes

NYSE 1.5 TB generated daily into a maintained 8 PB archive

Google “Several dozen” 1PB data sets (CACM, Jan 2010)

LHC 15 PB per year (avg. 21 TB daily)http://public.web.cern.ch/public/en/lhc/

Computing-en.html

Wal-Mart 536 TB, 1B entries daily (2006)

EBay 2 PB, traditional DB, and 6.5PB streaming, 17 trillionrecords, 1.5B records/day, each web click is 50-150details. http://www.dbms2.com/2009/04/30/

ebays-two-enormous-data-warehouses/

Faceboot 845 M users... and growing.

• All data is rich and semantic (graphs!) and changing.

• Base data rates include items and not relationships.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 7/31

Page 8: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

General approaches

• High-performance static graph analysis• Develop techniques that apply to unchanging massive graphs.• Provides useful after-the-fact information, starting points.• Serves many existing applications well: market research, much

bioinformatics, ...

• High-performance streaming graph analysis• Focus on the dynamic changes within massive graphs.• Find trends or new information as they appear.• Serves upcoming applications: fault or threat detection, trend

analysis, ...

Both very important to different areas.Remaining focus is on streaming.

Note: Not CS theory streaming, but analysis of streaming data.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 8/31

Page 9: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Why analyze data streams?

Data volumes

NYSE 1.5TB daily

LHC 41TB daily

Facebook Who knows?

Data transfer• 1 Gb Ethernet: 8.7TB daily at

100%, 5-6TB daily realistic

• Multi-TB storage on 10GE: 300TBdaily read, 90TB daily write

• CPU ↔ Memory: QPI,HT:2PB/day@100%

Data growth

• Facebook: > 2×/yr

• Twitter: > 10×/yr

• Growing sources:Bioinformatics,µsensors, security

Speed growth

• Ethernet/IB/etc.: 4× in next 2years. Maybe.

• Flash storage, direct: 10× write,4× read. Relatively huge cost.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 9/31

Page 10: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Overall streaming approach

Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,Science 302, 1722-1736, 2003.

Jason’s network via LinkedIn Labs

Assumptions

• A graph represents some real-world phenomenon.• But not necessarily exactly!• Noise comes from lost updates, partial information, ...

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 10/31

Page 11: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Overall streaming approach

Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,Science 302, 1722-1736, 2003.

Jason’s network via LinkedIn Labs

Assumptions

• We target massive, “social network” graphs.• Small diameter, power-law degrees• Small changes in massive graphs often are unrelated.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 11/31

Page 12: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Overall streaming approach

Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,Science 302, 1722-1736, 2003.

Jason’s network via LinkedIn Labs

Assumptions

• The graph changes, but we don’t need a continuous view.• We can accumulate changes into batches...• But not so many that it impedes responsiveness.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 12/31

Page 13: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Difficulties for performance

• What partitioningmethods apply?

• Geometric? Nope.• Balanced? Nope.• Is there a single, useful

decomposition?Not likely.

• Some partitions exist, butthey don’t often helpwith balanced bisection ormemory locality.

• Performance needs newapproaches, not juststandard scientificcomputing methods.

Jason’s network via LinkedIn Labs

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 13/31

Page 14: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

STING’s focus

Source data

predictionaction

summary

Control

VizSimulation / query

• STING manages queries against changing graph data.• Visualization and control often are application specific.

• Ideal: Maintain many persistent graph analysis kernels.• Keep one current snapshot of the graph resident.• Let kernels maintain smaller histories.• Also (a harder goal), coordinate the kernels’ cooperation.

• Gather data into a typed graph structure, STINGER.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 14/31

Page 15: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

STINGER

STING Extensible Representation:

• Rule #1: No explicit locking.• Rely on atomic operations.

• Massive graph: Scattered updates, scattered reads rarelyconflict.

• Use time stamps for some view of time.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 15/31

Page 16: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Initial results

Prototype STING and STINGER

Background benchmark for STINGER maintenance, plus monitoringthe following properties:

1 clustering coefficients, and

2 connected components.

High-level

• Support high rates of change, over 10k updates per second.

• Performance scales somewhat with available processing.

• Gut feeling: Scales as much with sockets as cores.

http://www.cc.gatech.edu/stinger/

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 16/31

Page 17: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Experimental setup

Unless otherwise noted

Line Model Speed (GHz) Sockets Cores

Nehalem X5570 2.93 2 4Westmere E7-8870 2.40 4 10

• Westmere loaned by Intel (thank you!)

• All memory: 1067MHz DDR3, installed appropriately

• Implementations: OpenMP, gcc 4.6.1, Linux ≈ 3.0 kernel

• Artificial graph and edge stream generated by R-MAT[Chakrabarti, Zhan, & Faloutsos].

• Scale x , edge factor f ⇒ 2x vertices, ≈ f · 2x edges.• Edge actions: 7/8th insertions, 1/8th deletions• Results over five batches of edge actions.

• Caveat: No vector instructions, low-level optimizations yet.

• Portable: OpenMP, Cray XMT family

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 17/31

Page 18: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

STINGER insertion / removal rates

Edges per second

On E7-8870, 80 hardware threads but four sockets:

Number of threads

Agg

rega

te S

TIN

GE

R u

pdat

es p

er s

econ

d

104

104.5

105

105.5

106

106.5

107

●●●

●●●

●●●

●●●

●●● ●●● ●●●

●●●

●●●●●● ●●● ●●● ●●●

●●●

●●●

●●● ●●●

●●●

●●● ●●● ●●●

●●●

●●●

●●●●●● ●●● ●●●

●●●

●●●

●●●

●●●

●●●

●●● ●●●●●●●●●

●●● ●●●●●● ●●● ●●●

●●●

●●●

●●●●●●

●●●

●●●●●● ●●●

●●● ●●● ●●● ●●●●●● ●●●

●●●●●●

●●●

●●●

●●●

●●●●●● ●●●

●●●

●●● ●●● ●●●

●●●

●●●

●●●

●●●

●●●●●●

●●●

●●●●●●

●●●

●●●

●●●

●●● ●

●●●

●●

●●●

●●●

●●●●●●●●●

●●●

●●●●●●

●●●

●●● ●●● ●●● ●

●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●● ●

●●

●●

●●●

●●●

●●●

●●●

●●●

●●● ●●●

●●●

●●● ●

●● ●

●●

●●

●●

●●●

●●● ●

●●

●●●

●●●●●

● ●●●

●●●

●●●

●●●

●●

●●

●●●

●●●

20 40 60 80

Batch size

● 3000000

● 1000000

● 300000

● 100000

● 30000

● 10000

● 3000

● 1000

● 300

● 100

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 18/31

Page 19: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

STINGER insertion / removal rates

Seconds per batch, “latency”

On E7-8870, 80 hardware threads but four sockets:

Number of threads

Sec

onds

per

bat

ch p

roce

ssed

10−3

10−2

10−1

100

●●●

●●●

●●●

●●●

●●● ●●● ●●●

●●●

●●● ●●● ●●● ●●● ●●●

●●●

●●●

●●● ●●●

●●●

●●● ●●● ●●●

●●●

●●● ●●●

●●● ●●● ●●●●●●

●●●

●●●

●●●

●●●

●●● ●●● ●●●●●● ●●● ●●●

●●● ●●● ●●●

●●●

●●●

●●● ●●●

●●●

●●●●●● ●●

●●●● ●●● ●●● ●●● ●●● ●●●

●●●

●●●

●●●

●●●

●●●

●●●●●●

●●●●●●

●●● ●●● ●●●●●●

●●●

●●●

●●●

●●●●●●

●●●●●●

●●●

●●●

●●●

●●●●●● ●●

●●

●●

●●●

●●●●●●

●●●●●●●●●

●●●

●●●●●●

●●●

●●● ●●● ●●● ●●●

●●●

●●●●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●●●● ●

●● ●

●●

●●●

●●

●●●

●●●

●●●

●●●●●● ●●

●●●

●●● ●

●● ●

●● ●●

●●●

●●●

●●● ●

●●

●●●

●●●●●● ●●●

●●●

●●●

●●● ●

●●

●●

●●●

●●●

20 40 60 80

Batch size

● 3000000

● 1000000

● 300000

● 100000

● 30000

● 10000

● 3000

● 1000

● 300

● 100

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 19/31

Page 20: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Clustering coefficients

• Used to measure “small-world-ness”[Watts & Strogatz] and potentialcommunity structure

• Larger clustering coefficient ⇒ moreinter-connected

• Roughly the ratio of the number of actualto potential triangles

v

i

j

m

n

• Defined in terms of triplets.

• i – v – j is a closed triplet (triangle).

• m – v – n is an open triplet.

• Clustering coefficient:# of closed triplets / total # of triplets

• Locally around v or globally for entire graph.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 20/31

Page 21: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Updating triangle counts

Given Edge {u, v} to be inserted (+) or deleted (-)

Approach Search for vertices adjacent to both u and v , updatecounts on those and u and v

Three methods

Brute force Intersect neighbors of u and v by iterating over each,O(dudv ) time.

Sorted list Sort u’s neighbors. For each neighbor of v , check if inthe sorted list.

Compressed bits Summarize u’s neighbors in a bit array. Reducescheck for v ’s neighbors to O(1) time each.Approximate with Bloom filters. [MTAAP10]

All rely on atomic addition.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 21/31

Page 22: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Batches of 10k actions

ThreadsGraph size: scale 22, edge factor 16

Upd

ates

per

sec

onds

, bot

h m

etric

and

ST

ING

ER

103.5

104

104.5

105

105.5

Brute force

●●

5.1e+03

1.7e+04 3.4e+04

3.9e+03

1.5e+04

0 20 40 60 80

Bloom filter

●●

●●●●●●

●●●

●●●

●●

2.4e+04

8.8e+04 1.2e+05

2.0e+04

1.3e+05

0 20 40 60 80

Sorted list

●●

●●●

●●●

●●

●●

●●●

●●●

● ●

●●

●●

2.2e+04

8.8e+04 1.4e+05

1.7e+04

1.2e+05

0 20 40 60 80

Machine

a 4 x E7−8870

a 2 x X5570

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 22/31

Page 23: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Different batch sizes

ThreadsGraph size: scale 22, edge factor 16

Upd

ates

per

sec

onds

, bot

h m

etric

and

ST

ING

ER

103.5

104

104.5

105

105.5

103.5

104

104.5

105

105.5

103.5

104

104.5

105

105.5

Brute force

●●●

●●

●●

●●

●●●●●●●●●

●●●

●●

0 20 40 60 80

Bloom filter

●●

●●●●

●●●

●●●●●

● ●

●●●●● ●●●

●●●

●●●●

●●

0 20 40 60 80

Sorted list

●●●

●●

●●

●●●●●

●●

●●●●●

●●●●●

●●

●●

●●●

●●●●● ● ●

●●●●

0 20 40 60 80

1001000

10000

Machine

4 x E7−8870

2 x X5570

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 23/31

Page 24: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Different batch sizes: Latency

ThreadsGraph size: scale 22, edge factor 16

Sec

onds

bet

wee

n up

date

s, b

oth

met

ric a

nd S

TIN

GE

R

10−310−2.5

10−210−1.5

10−110−0.5

100

10−310−2.5

10−210−1.5

10−110−0.5

100

10−310−2.5

10−210−1.5

10−110−0.5

100

Brute force

●●

●●

●●

●●●●

●●●●●●●●●●●●●●

●●

0 20 40 60 80

Bloom filter

●●

●●●●

●●●●

●●●●●

● ●

●●●●● ●●●

●●●●●●●

●●●

0 20 40 60 80

Sorted list

●●

●●●

●●●●●

●●

●●●●●

●●●●●

●●

●●

●●●

●●●●● ● ●

●●●●

0 20 40 60 80

1001000

10000

Machine

4 x E7−8870

2 x X5570

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 24/31

Page 25: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Connected components

• Maintain a mapping from vertex tocomponent.

• Global property, unlike trianglecounts

• In “scale free” social networks:• Often one big component, and• many tiny ones.

• Edge changes often sit withincomponents.

• Remaining insertions mergecomponents.

• Deletions are more difficult...

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 25/31

Page 26: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Connected components: Deleted edges

The difficult case• Very few deletions

matter.

• Determining whichmatter may require alarge graph search.

• Re-running staticcomponentdetection.

• (Long history, seerelated work in[MTAAP11].)

• Coping mechanisms:• Heuristics.• Second level of

batching.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 26/31

Page 27: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Deletion heuristics

Rule out effect-less deletions• Use the spanning tree by-product of static connected

component algorithms.

• Ignore deletions when one of the following occur:

1 The deleted edge is not in the spanning tree.2 If the endpoints share a common neighbor∗.3 If the loose endpoint can reach the root∗.

• In the last two (∗), also fix the spanning tree.

Rules out 99.7% of deletions.

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 27/31

Page 28: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Connected components: Performance

On E7-8870

Number of threads

Con

nect

ed c

ompo

nent

upd

ates

per

sec

ond

103

103.5

104

104.5

105

●●●

●●●

●●

●●●●

●●● ●

●●●●●

●●●

●●

●●●● ●●● ●●● ●●●

●●●

●●●

●●● ●●●

●●● ●●●

●●● ●

●●●●●●●●

●●● ●●● ●●●

●●●●

●● ●

●●●

●●

●●●

●●●

● ●●●

●●● ●●

●●●● ●●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●● ●●● ●●●

●●● ●●●

●●●

●●●●●

●●●

●●

●●●

● ●●

● ●●●●●●

●●● ●●●

●●●

●●● ●

●●●

●●●

●●●

●●●

●●●

●●

● ●●● ●

●●

●●●

● ●●● ●●

●●●● ●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●● ●

●●

●●

●●●

●●●

●●●

●●●

●●●

●●

●●● ●●●

●●●

●●●●

●● ●

●●

●●●

●●●

●●●

●●●

●●●

●●●

●●● ●

●●

●●●

●●●

●●● ●●● ●

●●●

● ●

●●●

●●●

●●●●●●

●●●

●●● ●

●●●●●

●●●

●●●●●● ●

●●

●●●

●●

●●●

20 40 60 80

Batch size

● 3000000

● 1000000

● 300000

● 100000

● 30000

● 10000

● 3000

● 1000

● 300

● 100

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 28/31

Page 29: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Connected components: Latency

On E7-8870

Number of threads

Sec

onds

per

bat

ch p

roce

ssed

10−1.5

10−1

10−0.5

100

100.5

101●●●

●●●

●●

●●●●

●●● ●

●●●●●

●●●

●●

●●●● ●●● ●●●

●●●

●●●

●●●

●●● ●●●

●●● ●

●●●●● ●

●●●●● ●

●●

●●● ●●●●●●

●●●●

●●

●●●

●●

●●●

●●●

● ●●●

●●● ●●

●●●● ●●

●●●

●●

●●●●

●●

●●●

●●●

●●●●

●●● ●●●

●●●

●●● ●●●

●●●

●●

●●●

●●●

●●

●●●

● ●●

● ●●●●●●

●●●●●● ●

●●

●●● ●

●●●

●●●

●●● ●

●●

●●●

●●

● ●●● ●

●●

●●●

● ●●● ●●

●●●● ●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●● ●

●●

●●

●●●

●●●

●●● ●●

●●●

●●

●●● ●●●

●●●

●●● ●

●● ●

●●

●●●

●●●

●●●

●●●

●●●

●●●

●●● ●

●●

●●●

●●●

●●● ●●● ●

●●●

● ●

●●●

●●●

●●●

●●●

●●●

●●● ●●●

●●●

●●●

●●●●●●

●●●

●●●

●●

●●●

20 40 60 80

Batch size

● 3000000

● 1000000

● 300000

● 100000

● 30000

● 10000

● 3000

● 1000

● 300

● 100

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 29/31

Page 30: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Common aspects

• Each parallelizes sufficiently well over the affected vertices V ′,those touched by new or removed edges.

• Total amount of work is O(Vol(V ′)) = O(∑

v∈V ′ deg(v)).

• Our in-progress work on refining or re-agglomeratingcommunities with updates also is O(Vol(V ′)).

• How many interesting graph properties can be updated withO(Vol(V ′)) work?

• Do these parallelize well?

• The hidden constant and how quickly performance becomesasymptotic determines the metric update rate. Whatimplementation techniques bash down the constant?

• How sensitive are these metrics to noise and error?

• How quickly can we “forget” data and still maintain metrics?

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 30/31

Page 31: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Session outline

• Fast Counting of Patterns in GraphsAli Pinar, C. Seshadhri, and Tamara G. Kolda, Sandia NationalLaboratories, USA

• Statistical Models and Methods for Anomaly Detectionin Large GraphsNicholas Arcolano, Massachusetts Institute of Technology, USA

• Perfect Power Law Graphs: Generation, Sampling,Construction and FittingJeremy Kepner, Massachusetts Institute of Technology, USA

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 31/31

Page 32: Streaming Graph Analytics for Massive Graphs · 7/10/2012  · Streaming Graph Analytics for Massive Graphs Jason Riedy, David A. Bader, David Ediger Georgia Institute of Technology

Acknowledgment of support

SIAM AN 2012—Streaming Graph Analytics for Massive Graphs—Jason Riedy 32/31