school of computer science carnegie mellon llnl, feb. '07c. faloutsos1 mining static and...

92
LLNL, Feb. ' 07 C. Faloutsos 1 School of Computer Science Carnegie Mellon Mining static and time- evolving graphs Christos Faloutsos Carnegie Mellon University

Upload: gerald-green

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

LLNL, Feb. '07 C. Faloutsos 1

School of Computer ScienceCarnegie Mellon

Mining static and time-evolving graphs

Christos Faloutsos

Carnegie Mellon University

LLNL, Feb. '07 C. Faloutsos 2

School of Computer ScienceCarnegie Mellon

Overview

• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)

• Mining time-evolving graphs– Tensors + intrusion detection– Sparse graphs

• Other topics– Graph sampling– Graph generators

LLNL, Feb. '07 C. Faloutsos 3

School of Computer ScienceCarnegie Mellon

CePS

• w/ Hanghang Tong, KDD 2006

[email protected]

LLNL, Feb. '07 C. Faloutsos 4

School of Computer ScienceCarnegie Mellon

Center-Piece Subgraph(Ceps)

• Given Q query nodes• Find Center-piece ( )

• Input of Ceps– Q Query nodes– Budget b– K softand coefficient

• App.– Social Networks– Law Inforcement, …

A C

B

A C

B

A C

B

b

LLNL, Feb. '07 C. Faloutsos 5

School of Computer ScienceCarnegie Mellon

Challenges in Ceps• Q1: How to measure the importance?

• Q2: How to extract connection subgraph?

• Q3: How to do it efficiently?

LLNL, Feb. '07 C. Faloutsos 6

School of Computer ScienceCarnegie Mellon

An Illustrating Example

1

2

3

4

5

6

789

11

10 13

12•Starting from 1

•Randomly to neighbor

•Some p to return to 1

Prob (RW will finally stay at j)

LLNL, Feb. '07 C. Faloutsos 7

School of Computer ScienceCarnegie Mellon

Individual Score Calculation

Q1 Q2 Q3

Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12Node 13

0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260

1

10

11

9 8

12

13

4

3

62

0.5767

0.1260

0.1235

0.1260

0.0283

0.0333

0.0024

0.0088

0.0076

0.00760.00240.0333

0.0088

7

5

LLNL, Feb. '07 C. Faloutsos 8

School of Computer ScienceCarnegie Mellon

Individual Score Calculation

Q1 Q2 Q3

Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12Node 13

0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260

Individual Score matrix

1

10

11

9 8

12

13

4

3

62

0.5767

0.1260

0.1235

0.1260

0.0283

0.0333

0.0024

0.0088

0.0076

0.00760.00240.0333

0.0088

7

5

LLNL, Feb. '07 C. Faloutsos 9

School of Computer ScienceCarnegie Mellon

AND: Combining Scores

• Q: How to combine scores?

• A: Multiply• …= prob. 3 random

particles coincide on node j

LLNL, Feb. '07 C. Faloutsos 10

School of Computer ScienceCarnegie Mellon

K_SoftAnd: Combining Scores

Generalization – SoftAND:

We want nodes close to k of Q (k<Q) query nodes.

Q: How to do that?

LLNL, Feb. '07 C. Faloutsos 11

School of Computer ScienceCarnegie Mellon

K_SoftAnd: Combine Scores

Generalization – softAND:

We want nodes close to k of Q (k<Q) query nodes.

Q: How to do that?

A: Prob(at least k-out-of-Q will meet each other at j)

LLNL, Feb. '07 C. Faloutsos 12

School of Computer ScienceCarnegie Mellon

AND query vs. K_SoftAnd query

And Query 2_SoftAnd Query

x 1e-4

1 7

5

10

11

9 8

12

13

4

3

62

0.4505

0.1010

0.0710

0.1010

0.2267

0.1010

0.1010

0.4505

0.0710

0.07100.10100.1010

0.4505

1 7

5

10

11

9 8

12

13

4

3

62

0.0103

0.0046

0.0019

0.0046

0.0024

0.0046

0.0046

0.0103

0.0019

0.00190.00460.0046

0.0103

LLNL, Feb. '07 C. Faloutsos 13

School of Computer ScienceCarnegie Mellon

1 7

5

10

11

9 8

12

13

4

3

62

0.0103

0.1617

0.1387

0.1617

0.0849

0.1617

0.1617

0.0103

0.1387

0.13870.16170.1617

0.0103

1_SoftAnd query = OR query

LLNL, Feb. '07 C. Faloutsos 14

School of Computer ScienceCarnegie Mellon

Challenges in Ceps• Q1: How to measure the importance?

• Q2: How to extract connection subgraph?

• Q3: How to do it efficiently?

LLNL, Feb. '07 C. Faloutsos 15

School of Computer ScienceCarnegie Mellon

• Goal– Maximize total scores and– ‘Appropriate’ Connections

• How to…”Extract” Alg.– Dynamic Programming– Greedy Alg.

• Pickup promising node• Find ‘best’ path

“Extract” Alg.

1

2

3

54

6

7

8

910

11

12

13

14 15 16

1

2

3

54

6

7

8

910

11

12

13

LLNL, Feb. '07 C. Faloutsos 16

School of Computer ScienceCarnegie Mellon

Challenges in Ceps• Q1: How to measure the importance?

• Q2: How to extract connection subgraph?

• Q3: How to do it efficiently?

LLNL, Feb. '07 C. Faloutsos 17

School of Computer ScienceCarnegie Mellon

Graph Partition: Efficiency Issue

• Straightforward way– Q linear system: – linear to # of edges

• Observation– Skewed dist.

• How to…– Graph partition

LLNL, Feb. '07 C. Faloutsos 18

School of Computer ScienceCarnegie Mellon

Even better:

• We can correct for the deleted edges (Tong+, ICDM’06, best paper award)

• But let’s omit the details

LLNL, Feb. '07 C. Faloutsos 19

School of Computer ScienceCarnegie Mellon

Experimental Setup

• Dataset– DBLP/authorship

– Author-Paper

– 315k nodes

– 1,800k edges

LLNL, Feb. '07 C. Faloutsos 20

School of Computer ScienceCarnegie Mellon

Experimental Setup• We want to check

– Does the goodness criteria make sense?– Does “extract” alg. capture most of important

nodes/edge?– Efficiency

LLNL, Feb. '07 C. Faloutsos 21

School of Computer ScienceCarnegie Mellon

Case Study: AND query

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Heikki Mannila

Christos Faloutsos

Padhraic Smyth

Corinna Cortes

15 1013

1 1

6

1 1

4 Daryl Pregibon

10

2

11

3

16

LLNL, Feb. '07 C. Faloutsos 22

School of Computer ScienceCarnegie Mellon

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Umeshwar Dayal

Bernhard Scholkopf

Peter L. Bartlett

Alex J. Smola

1510

13

3 3

5 2 2

327

4

2_SoftAnd query

Statistic

database

LLNL, Feb. '07 C. Faloutsos 24

School of Computer ScienceCarnegie Mellon

Running Time vs. Quality for Fast Ceps

Running Time

Quality

~90% quality

6:1 speedup

LLNL, Feb. '07 C. Faloutsos 25

School of Computer ScienceCarnegie Mellon

Conclusion

• Q1:How to measure the importance?• A1: RWR+K_SoftAnd• Q2: How to find connection subgraph?• A2:”Extract” Alg.• Q3:How to do it efficiently?• A3:Graph Partition (Fast Ceps)

– ~90% quality– 6:1 speedup

LLNL, Feb. '07 C. Faloutsos 26

School of Computer ScienceCarnegie Mellon

Overview

• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)

• Mining time-evolving graphs

• Other topics

LLNL, Feb. '07 C. Faloutsos 27

School of Computer ScienceCarnegie Mellon

Random walk with restart

Node 4

Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12

0.130.100.130.220.130.050.050.080.040.030.040.02

1

4

3

2

56

7

910

811

120.13

0.10

0.13

0.13

0.05

0.05

0.08

0.04

0.02

0.04

0.03

Ranking vector More red, more relevant

Nearby nodes, higher scores

4r

LLNL, Feb. '07 C. Faloutsos 28

School of Computer ScienceCarnegie Mellon Computing RWR

1

43

2

5 6

7

9 10

811

12

0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0

0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0

0.13

0.22

0.13

0.050.9

0.05

0.08

0.04

0.03

0.04

0.02

0

1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4 0 0 0 0 0 0 0

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0

0 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0

0 0 0 0 0 0 0 1/4 0 1/3 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0 0 0 1/4 0 1/3 0 1/2

0 0 0 0 0 0 0 0 0 1/3 1/3 0

0.13 0

0.10 0

0.13 0

0.22

0.13 0

0.05 00.1

0.05 0

0.08 0

0.04 0

0.03 0

0.04 0

2 0

1

0.0

n x n n x 1n x 1

Ranking vector Starting vectorAdjacency matrix

@(t+1) @t

LLNL, Feb. '07 C. Faloutsos 29

School of Computer ScienceCarnegie Mellon

Alternatives

• On-the-fly: precompute nothing -> slow

• Precompute everything -> O(N*N) space

LLNL, Feb. '07 C. Faloutsos 30

School of Computer ScienceCarnegie Mellon

Alternatives

• On-the-fly: precompute nothing -> slow

• Precompute a little, and adjust on-the-fly

• Precompute everything -> O(N*N) space

LLNL, Feb. '07 C. Faloutsos 31

School of Computer ScienceCarnegie Mellon Computing RWR

1

4

3

2

56

7

910

811

12

LLNL, Feb. '07 C. Faloutsos 32

School of Computer ScienceCarnegie Mellon Computing RWR

1

4

3

2

56

7

910

811

12

Break into ‘communities’

LLNL, Feb. '07 C. Faloutsos 33

School of Computer ScienceCarnegie Mellon

FastRWR

• Instead of ONE BIG (and dense) inverted matrix

LLNL, Feb. '07 C. Faloutsos 34

School of Computer ScienceCarnegie Mellon

FastRWR

• Instead of ONE BIG (and dense) inverted matrix

• Several, smaller matrices, plus info about the ‘bridges’

LLNL, Feb. '07 C. Faloutsos 35

School of Computer ScienceCarnegie Mellon

FastRWR

• Instead of ONE BIG (and dense) inverted matrix

• Several, smaller matrices, plus info about the ‘bridges’

LLNL, Feb. '07 C. Faloutsos 36

School of Computer ScienceCarnegie Mellon

FastRWR

• Instead of ONE BIG (and dense) inverted matrix

• Several, smaller matrices, plus info about the ‘bridges’

LLNL, Feb. '07 C. Faloutsos 37

School of Computer ScienceCarnegie Mellon

Query Time vs. Pre-Compute Time

Log Query Time

Log Pre-compute Time

•Quality: 90%+ •On-line:

•Up to 150x speedup•Pre-computation:

•Two orders saving

LLNL, Feb. '07 C. Faloutsos 38

School of Computer ScienceCarnegie Mellon

Query Time vs. Pre-Storage

Log Query Time

Log Storage

•Quality: 90%+ •On-line:

•Up to 150x speedup•Pre-storage:

•Three orders saving

LLNL, Feb. '07 C. Faloutsos 39

School of Computer ScienceCarnegie Mellon

Conclusion

• FastRWR– Good accuracy (90%+)– 150x speed-up: query time– Orders of magnitude saving: pre-compute & storage

LLNL, Feb. '07 C. Faloutsos 40

School of Computer ScienceCarnegie Mellon

Overview

• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)

• Mining time-evolving graphs

• Other topics

LLNL, Feb. '07 C. Faloutsos 41

School of Computer ScienceCarnegie Mellon

Best-effort Sub-Graph Matching, on Attributed Graphs

LLNL, Feb. '07 C. Faloutsos 42

School of Computer ScienceCarnegie Mellon

• Nodes have one (categorical) attribute• query: Eg., loop -> ‘money laundering’

Synthetic data

‘Best-effort’: problem dfn.

LLNL, Feb. '07 C. Faloutsos 43

School of Computer ScienceCarnegie Mellon

‘Best-effort’: problem dfn.

• Loop-Query • Results

LLNL, Feb. '07 C. Faloutsos 44

School of Computer ScienceCarnegie Mellon

• Star-Query • Results

LLNL, Feb. '07 C. Faloutsos 45

School of Computer ScienceCarnegie Mellon

DBLP dataset

• Authorship Graph– Nodes: authors– Edges: # of co-authored paper– Attributes: Conference and Year

• ~300k nodes, ~1m edges

LLNL, Feb. '07 C. Faloutsos 46

School of Computer ScienceCarnegie Mellon

Line Query:

Results

People People People People

STOC05 SIGMOD96 ICML93 ISBMS

Moses Charikar

Surajit Chaudhuri

Usama M. Fayyad

Sidney Fels

STOC05 SIGMOD96 ICML93 ISBMS

Pietro Perona

Max Welling

Geoffrey E. Hinton

Gagan Aggarwal

Hector Garcia Molina

Sebastian Thrun

Gbor Szkely

STOC05 SIGMOD96 ICML93 ISBMS

Wolfram Burgard

Hans Burkhardt

Haymo Kurz

James Davis

Footnote for results-Red nodes: qualifying nodes-white nodes: immediate nodes.

LLNL, Feb. '07 C. Faloutsos 47

School of Computer ScienceCarnegie Mellon

Star-queryPeople

People

People

People

IAT

PODS

ISBMS

Li Yan Yuan

Xiaobo Li

Xianchang Wang

PODS

ISBMS

Huowang Chen

Lei Xu

IAT

Jia-Huai You

Results

Haixun Wang

Reinhard Mnner

Bing Liu

PODS

ISBMS

Zhong-Fei (Mark)Zhang

IAT

Phillips. S. Yu

Footnote for results-Red nodes: qualifying nodes-white nodes: immediate nodes.

LLNL, Feb. '07 C. Faloutsos 48

School of Computer ScienceCarnegie Mellon

Loop-Query:

Results

People

ICML93

RECOMB00

People

People INFOCOM00PeopleKDD96

Stan Matwin

ICML93

RECOMB00

Richard M. Karp

Scott Shenker INFOCOM00

Haym HirshKDD96

Amir Ben Dor

Nathalie Japkowicz

Yonatan Aumann

Amy P. Felty

Andrew W Appel

Kai Lei

LLNL, Feb. '07 C. Faloutsos 49

School of Computer ScienceCarnegie Mellon

P.I.T. Terrorist Relations

• Nodes: Terrorist Relationship – Attributes:

• Family Contact Colleague Congregate

• Edges: Two Relationship shares a common person

• ~1k nodes and ~8k edges

LLNL, Feb. '07 C. Faloutsos 50

School of Computer ScienceCarnegie Mellon

Star-Query

Results

Contact

Family

Colleague

Congregate

418

759 515

33

430 418

799

615 500

31

LLNL, Feb. '07 C. Faloutsos 51

School of Computer ScienceCarnegie Mellon

Overview

• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)

• Mining time-evolving graphs– Tensors + intrusion detection– Other tools (MDL)

• Other topics

LLNL, Feb. '07 C. Faloutsos 52

School of Computer ScienceCarnegie Mellon

Tensors for time evolving graphs

• [Jimeng Sun+ KDD’06]

• [ “ , SMD’07]• [ CF, Kolda, Sun,

SDM’07 tutorial]

LLNL, Feb. '07 C. Faloutsos 53

School of Computer ScienceCarnegie Mellon

Social network analysis• Static: find community structures • Dynamic: monitor community structure evolution;

spot abnormal individuals; abnormal time-stamps

DB

Aut

hors

Keywords

DM

DB

1990

2004

LLNL, Feb. '07 C. Faloutsos 54

School of Computer ScienceCarnegie Mellon

Network Forensics• Directional network flows• A large ISP with 100 POPs, each POP 10Gbps

link capacity [Hotnets2004]• Task: Identify abnormal traffic pattern and find

out the causenormal trafficabnormal traffic

dest

inati

on

source

dest

inati

on

source

LLNL, Feb. '07 C. Faloutsos 55

School of Computer ScienceCarnegie Mellon

Tensors - outline

• Motivation

• Main ideas

• Experiments

LLNL, Feb. '07 C. Faloutsos 56

School of Computer ScienceCarnegie Mellon

Static case• For a timestamp, data can be modeled

using a tensor (matrix == 2-mode tensor)

Lo

cati

on

Type

Time = 0

temperaturelight

LLNL, Feb. '07 C. Faloutsos 57

School of Computer ScienceCarnegie Mellon

Dynamic case: Tensor streams

Loca

tion

Type

Time = 0

temperaturelight

LLNL, Feb. '07 C. Faloutsos 58

School of Computer ScienceCarnegie Mellon

Dynamic Data model: Tensor streams

time

(Jimeng’s Desk, light)Lo

catio

n

Type

LLNL, Feb. '07 C. Faloutsos 59

School of Computer ScienceCarnegie Mellon

Dynamic Data model: Tensor streams

• Streams come with structure– (time, location, sensor-modality)– (time, host-id, measurement-type)

time

(Jimeng’s Desk, light)Lo

catio

n

Type

LLNL, Feb. '07 C. Faloutsos 60

School of Computer ScienceCarnegie Mellon

What is the factor?

• Factor is a set of 1D summaries

1st factor

Loc

atio

n

Type

Time

LLNL, Feb. '07 C. Faloutsos 61

School of Computer ScienceCarnegie Mellon

What is the factor?

• Factor is a set of 1D summaries• Multi-linear approximation on

all aspects

1st factor

Loc

atio

n

Type

TimeDay NightNight

Close towindow

Away from window

Day NightNight

Close towindow

Away from window

LLNL, Feb. '07 C. Faloutsos 62

School of Computer ScienceCarnegie Mellon

Tensors - outline

• Motivation

• Main ideas

• Experiments

LLNL, Feb. '07 C. Faloutsos 63

School of Computer ScienceCarnegie Mellon

1st factor Scaling factor 250

typelocationtime

WTA on real sensor data

• 1st factor consists of the main trends:– Daily periodicity on time– Uniform on all locations– Temp, Light and Volt are positively correlated while

negatively correlated with Humid

Lo

catio

n

Type Time

LLNL, Feb. '07 C. Faloutsos 64

School of Computer ScienceCarnegie Mellon

WTA on real sensor data (cont.)

• 2nd factor captures an atypical trend:– Uniformly across all time

– Concentrating on 3 locations

– Mainly due to voltage

• Interpretation: two sensors have low battery, and the other one has high battery.

2nd factorScaling factor 154

typelocationtime

LLNL, Feb. '07 C. Faloutsos 65

School of Computer ScienceCarnegie Mellon

DB

DM

Application 1: Multiway latent semantic indexing (LSI)

DB

2004

1990

Michael Stonebreak

er

QueryPattern

Ukeyword

auth

ors

keyword

Ua

uth

ors

• Projection matrices specify the clusters

• Core tensors give cluster activation level

Philip Yu

LLNL, Feb. '07 C. Faloutsos 66

School of Computer ScienceCarnegie Mellon

Bibliographic data (DBLP)

• Papers from VLDB and KDD conferences• Construct 2nd order tensors with yearly

windows with <author, keywords> • Each tensor: 45843741 • 11 timestamps (years)

LLNL, Feb. '07 C. Faloutsos 67

School of Computer ScienceCarnegie Mellon

Multiway LSIAuthors Keywords Yearmichael carey, michaelstonebreaker, h. jagadish,hector garcia-molina

queri,parallel,optimization,concurr,objectorient

1995

surajit chaudhuri,mitch cherniack,michaelstonebreaker,ugur etintemel

distribut,systems,view,storage,servic,process,cache

2004

jiawei han,jian pei,philip s. yu,jianyong wang,charu c. aggarwal

streams,pattern,support, cluster, index,gener,queri

2004

• Two groups are correctly identified: Databases and Data mining

• People and concepts are drifting over time

DM

DB

LLNL, Feb. '07 C. Faloutsos 68

School of Computer ScienceCarnegie Mellon

Application 2:Network Anomaly Detection

• Anomaly detection– Reconstruction error driven

– Multiple resolution

• Data– TCP flows collected at CMU backbone– Raw data 500GB with compression– Construct 3rd order tensors with hourly windows

with <source, destination, port #>– 1200 timestamps (hours)

LLNL, Feb. '07 C. Faloutsos 69

School of Computer ScienceCarnegie Mellon

dest

inat

ion

source

Network anomaly detection

• Identify when and where anomalies occurred. • Prominent difference between normal and abnormal ones is

mainly due to unusual scanning activity (confirmed by the campus admin).

scanners

Time (hour)

dest

inat

ion

source

err

or

Abnormal Normal

LLNL, Feb. '07 C. Faloutsos 70

School of Computer ScienceCarnegie Mellon

Computational cost

3rd order network tensor 2nd order DBLP tensor• OTA is the offline tensor analysis• Performance metric: CPU time (sec)• Observations:

– DTA and STA are orders of magnitude faster than OTA– The slight upward trend in DBLP is due to the increasing number of papers each

year (data become denser over time)

LLNL, Feb. '07 C. Faloutsos 71

School of Computer ScienceCarnegie Mellon

Accuracy comparison

• Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20%

• Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes.

3rd order network tensor 2nd order DBLP tensor

LLNL, Feb. '07 C. Faloutsos 72

School of Computer ScienceCarnegie Mellon

InteMon: intelligent monitoring system on large

clusters[VLDB06 demo]

[Operating System Review 06]

LLNL, Feb. '07 C. Faloutsos 73

School of Computer ScienceCarnegie Mellon

System Architecture

LLNL, Feb. '07 C. Faloutsos 74

School of Computer ScienceCarnegie Mellon

Case 1: Environmental Monitoring

• Abnormal dehumidification and reheating cycle is identified

Temperature

Humidity

LLNL, Feb. '07 C. Faloutsos 75

School of Computer ScienceCarnegie Mellon

LLNL, Feb. '07 C. Faloutsos 76

School of Computer ScienceCarnegie Mellon

Overview

• Mining Static graphs– CenterPiece Subgraphs (CePS)– Fast RWR computation– ‘best-effort’ subgraph matching (in progress)

• Mining time-evolving graphs– Tensors + intrusion detection– Other tools (MDL)

• Other topics

LLNL, Feb. '07 C. Faloutsos 77

School of Computer ScienceCarnegie Mellon

Parameter-free mining

• Using MDL, to– Find ‘natural’ communities– ‘natural’ cut-points

• (under submission)

LLNL, Feb. '07 C. Faloutsos 78

School of Computer ScienceCarnegie Mellon

MDL mining on time-evolving graph (Enron emails)

LLNL, Feb. '07 C. Faloutsos 79

School of Computer ScienceCarnegie Mellon

Overview

• Mining Static graphs

• Mining time-evolving graphs

• Other topics– Graph sampling– Graph generators

LLNL, Feb. '07 C. Faloutsos 80

School of Computer ScienceCarnegie Mellon

LLNL, Feb. '07 C. Faloutsos 82

School of Computer ScienceCarnegie Mellon

Realistic graph generation

• Kronecker graphs [Leskovec+, PKDD’05]

• [Leskovec+, under review]

LLNL, Feb. '07 C. Faloutsos 83

School of Computer ScienceCarnegie Mellon

Why fitting graph models?

• Parameters tell us about the structure of a graph• Extrapolation: given a graph today, how will it

look in a year?• Sampling: can I get a smaller graph with similar

properties?• Anonymization: instead of releasing real graph

(e.g., email network), we can release a synthetic version of it

LLNL, Feb. '07 C. Faloutsos 84

School of Computer ScienceCarnegie Mellon

Experiments on real AS graphDegree distribution Hop plot

Network valueAdjacency matrix eigen values

LLNL, Feb. '07 C. Faloutsos 85

School of Computer ScienceCarnegie Mellon

Intro to Kronecker graphs

LLNL, Feb. '07 C. Faloutsos 86

School of Computer ScienceCarnegie Mellon

Problem Definition• Given a growing graph with count of

nodes N1, N2, …

• Generate a realistic sequence of graphs that will obey all the patterns

• Idea: Self-similarity– Leads to power laws– Communities within communities– …

LLNL, Feb. '07 C. Faloutsos 87

School of Computer ScienceCarnegie Mellon

• There are many obvious (but wrong) ways

– Does not obey Densification Power Law– Has increasing diameter

• Kronecker Product is exactly what we need

Recursive Graph Generation• There are many obvious (but wrong) ways

Initial graph Recursive expansion

LLNL, Feb. '07 C. Faloutsos 88

School of Computer ScienceCarnegie Mellon

Adjacency matrix

Kronecker Product – a Graph

Intermediate stage

Adjacency matrix

LLNL, Feb. '07 C. Faloutsos 89

School of Computer ScienceCarnegie Mellon

Kronecker Product – a Graph

• Continuing multypling with G1 we obtain G4 and so on …

G4 adjacency matrix

LLNL, Feb. '07 C. Faloutsos 90

School of Computer ScienceCarnegie Mellon

Conclusions

• Static graphs: Random Walks, ``CePS’’, best-effort sub-graph matching

• Dynamic graphs: Tensors (intrusion/change detection

• Graph generation: Kronecker

LLNL, Feb. '07 C. Faloutsos 91

School of Computer ScienceCarnegie Mellon

References

• Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan Fast Random Walk with Restart and Its Applications ICDM 2006, Hong Kong.

• Hanghang Tong, Christos Faloutsos Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006, Philadelphia, PA

LLNL, Feb. '07 C. Faloutsos 92

School of Computer ScienceCarnegie Mellon

References

• Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award).

• Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication (ECML/PKDD 2005), Porto, Portugal, 2005. [PDF]

LLNL, Feb. '07 C. Faloutsos 93

School of Computer ScienceCarnegie Mellon

References

• Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA

• Jimeng Sun, Yinglian Xie, Hui Zhang, Christos Faloutsos. Less is More: Compact Matrix Decomposition for Large Sparse Graphs, SDM, Minneapolis, Minnesota, Apr 2007. [pdf]

LLNL, Feb. '07 C. Faloutsos 94

School of Computer ScienceCarnegie Mellon

Thank you!

Contact info:{christos, htong, jimeng, jure} <at> cs.cmu.edu

www. cs.cmu.edu /~christos

(w/ papers, datasets, code, etc)