graph mining: laws, generators and tools - stanford university

105
CMU SCS Graph Mining: Laws, Generators and Tools Christos Faloutsos CMU

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

Graph Mining: Laws, Generators

and Tools

Christos Faloutsos

CMU

Page 2: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 2

Thanks

ā€¢ Michael Mahoney

ā€¢ Lek-Heng Lim

ā€¢ Petros Drineas

ā€¢ Gunnar Carlsson

Page 3: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 4

Outline

ā€¢ Problem definition / Motivation

ā€¢ Static & dynamic laws; generators

ā€¢ Tools: CenterPiece graphs; Tensors

ā€¢ Other projects (Virus propagation, e-bay

fraud detection)

ā€¢ Conclusions

Page 4: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 5

Motivation

Data mining: ~ find patterns (rules, outliers)

ā€¢ Problem#1: How do real graphs look like?

ā€¢ Problem#2: How do they evolve?

ā€¢ Problem#3: How to generate realistic graphs

TOOLS

ā€¢ Problem#4: Who is the ā€˜master-mindā€™?

ā€¢ Problem#5: Track communities over time

Page 5: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 6

Problem#1: Joint work with

Dr. Deepayan Chakrabarti

(CMU/Yahoo R.L.)

Page 6: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 7

Graphs - why should we care?

Internet Map

[lumeta.com]

Food Web

[Martinez ā€™91]

Protein Interactions

[genomebiology.com]

Friendship Network

[Moody ā€™01]

Page 7: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 8

Graphs - why should we care?

ā€¢ IR: bi-partite graphs (doc-terms)

ā€¢ web: hyper-text graph

ā€¢ ... and more:

D1

DN

T1

TM

... ...

Page 8: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 9

Graphs - why should we care?

ā€¢ network of companies & board-of-directors

members

ā€¢ ā€˜viralā€™ marketing

ā€¢ web-log (ā€˜blogā€™) news propagation

ā€¢ computer network security: email/IP traffic

and anomaly detection

ā€¢ ....

Page 9: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 10

Problem #1 - network and graph

mining

ā€¢ How does the Internet look like?

ā€¢ How does the web look like?

ā€¢ What is ā€˜normalā€™/ā€˜abnormalā€™?

ā€¢ which patterns/laws hold?

Page 10: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 11

Graph mining

ā€¢ Are real graphs random?

Page 11: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 12

Laws and patterns

ā€¢ Are real graphs random?

ā€¢ A: NO!!

ā€“ Diameter

ā€“ in- and out- degree distributions

ā€“ other (surprising) patterns

Page 12: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 13

Solution#1

ā€¢ Power law in the degree distribution

[SIGCOMM99]

log(rank)

log(degree)

-0.82

internet domains

att.com

ibm.com

Page 13: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 14

Solution#1ā€Ÿ: Eigen Exponent E

ā€¢ A2: power law in the eigenvalues of the adjacency matrix

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

Page 14: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 15

Solution#1ā€Ÿ: Eigen Exponent E

ā€¢ [Papadimitriou, Mihail, ā€™02]: slope is Ā½ of rank exponent

E = -0.48

Exponent = slope

Eigenvalue

Rank of decreasing eigenvalue

May 2001

Page 15: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 16

But:

How about graphs from other domains?

Page 16: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 17

The Peer-to-Peer Topology

ā€¢ Count versus degree

ā€¢ Number of adjacent peers follows a power-law

[Jovanovic+]

Page 17: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 18

More power laws:

citation counts: (citeseer.nj.nec.com 6/2001)

log(#citations)

log(count)

Ullman

Page 18: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 19

More power laws:

ā€¢ web hit counts [w/ A. Montgomery]

Web Site Traffic

log(in-degree)

log(count)

Zipf

userssites

``ebayā€™ā€™

Page 19: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 20

epinions.com

ā€¢ who-trusts-whom

[Richardson +

Domingos, KDD

2001]

(out) degree

count

trusts-2000-people user

Page 20: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 21

Motivation

Data mining: ~ find patterns (rules, outliers)

ā€¢ Problem#1: How do real graphs look like?

ā€¢ Problem#2: How do they evolve?

ā€¢ Problem#3: How to generate realistic graphs

TOOLS

ā€¢ Problem#4: Who is the ā€˜master-mindā€™?

ā€¢ Problem#5: Track communities over time

Page 21: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 22

Problem#2: Time evolution

ā€¢ with Jure Leskovec

(CMU/MLD)

ā€¢ and Jon Kleinberg (Cornell ā€“

sabb. @ CMU)

Page 22: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 23

Evolution of the Diameter

ā€¢ Prior work on Power Law graphs hints

at slowly growing diameter:

ā€“ diameter ~ O(log N)

ā€“ diameter ~ O(log log N)

ā€¢ What is happening in real data?

Page 23: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 24

Evolution of the Diameter

ā€¢ Prior work on Power Law graphs hints

at slowly growing diameter:

ā€“ diameter ~ O(log N)

ā€“ diameter ~ O(log log N)

ā€¢ What is happening in real data?

ā€¢ Diameter shrinks over time

Page 24: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 25

Diameter ā€“ ArXiv citation graph

ā€¢ Citations among

physics papers

ā€¢ 1992 ā€“2003

ā€¢ One graph per

year

time [years]

diameter

Page 25: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 26

Diameter ā€“ ā€œAutonomous

Systemsā€

ā€¢ Graph of Internet

ā€¢ One graph per

day

ā€¢ 1997 ā€“ 2000

number of nodes

diameter

Page 26: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 27

Diameter ā€“ ā€œAffiliation Networkā€

ā€¢ Graph of

collaborations in

physics ā€“ authors

linked to papers

ā€¢ 10 years of data

time [years]

diameter

Page 27: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 28

Diameter ā€“ ā€œPatentsā€

ā€¢ Patent citation

network

ā€¢ 25 years of data

time [years]

diameter

Page 28: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 29

Temporal Evolution of the Graphs

ā€¢ N(t) ā€¦ nodes at time t

ā€¢ E(t) ā€¦ edges at time t

ā€¢ Suppose that

N(t+1) = 2 * N(t)

ā€¢ Q: what is your guess for

E(t+1) =? 2 * E(t)

Page 29: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 30

Temporal Evolution of the Graphs

ā€¢ N(t) ā€¦ nodes at time t

ā€¢ E(t) ā€¦ edges at time t

ā€¢ Suppose that

N(t+1) = 2 * N(t)

ā€¢ Q: what is your guess for

E(t+1) =? 2 * E(t)

ā€¢ A: over-doubled!

ā€“ But obeying the ``Densification Power Lawā€™ā€™

Page 30: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 31

Densification ā€“ Physics Citations

ā€¢ Citations among physics papers

ā€¢ 2003:

ā€“ 29,555 papers, 352,807 citations

N(t)

E(t)

??

Page 31: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 32

Densification ā€“ Physics Citations

ā€¢ Citations among physics papers

ā€¢ 2003:

ā€“ 29,555 papers, 352,807 citations

N(t)

E(t)

1.69

Page 32: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 33

Densification ā€“ Physics Citations

ā€¢ Citations among physics papers

ā€¢ 2003:

ā€“ 29,555 papers, 352,807 citations

N(t)

E(t)

1.69

1: tree

Page 33: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 34

Densification ā€“ Physics Citations

ā€¢ Citations among physics papers

ā€¢ 2003:

ā€“ 29,555 papers, 352,807 citations

N(t)

E(t)

1.69clique: 2

Page 34: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 35

Densification ā€“ Patent Citations

ā€¢ Citations among

patents granted

ā€¢ 1999

ā€“ 2.9 million nodes

ā€“ 16.5 million

edges

ā€¢ Each year is a

datapoint N(t)

E(t)

1.66

Page 35: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 36

Densification ā€“ Autonomous Systems

ā€¢ Graph of

Internet

ā€¢ 2000

ā€“ 6,000 nodes

ā€“ 26,000 edges

ā€¢ One graph per

day

N(t)

E(t)

1.18

Page 36: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 37

Densification ā€“ Affiliation

Network

ā€¢ Authors linked

to their

publications

ā€¢ 2002

ā€“ 60,000 nodes

ā€¢ 20,000 authors

ā€¢ 38,000 papers

ā€“ 133,000 edgesN(t)

E(t)

1.15

Page 37: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 38

Motivation

Data mining: ~ find patterns (rules, outliers)

ā€¢ Problem#1: How do real graphs look like?

ā€¢ Problem#2: How do they evolve?

ā€¢ Problem#3: How to generate realistic graphs

TOOLS

ā€¢ Problem#4: Who is the ā€˜master-mindā€™?

ā€¢ Problem#5: Track communities over time

Page 38: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 39

Problem#3: Generation

ā€¢ Given a growing graph with count of nodes N1,

N2, ā€¦

ā€¢ Generate a realistic sequence of graphs that will

obey all the patterns

Page 39: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 40

Problem Definition

ā€¢ Given a growing graph with count of nodes N1, N2, ā€¦

ā€¢ Generate a realistic sequence of graphs that will obey all the patterns

ā€“ Static Patterns

Power Law Degree Distribution

Power Law eigenvalue and eigenvector distribution

Small Diameter

ā€“ Dynamic Patterns

Growth Power Law

Shrinking/Stabilizing Diameters

Page 40: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 41

Problem Definition

ā€¢ Given a growing graph with count of nodes

N1, N2, ā€¦

ā€¢ Generate a realistic sequence of graphs that

will obey all the patterns

ā€¢ Idea: Self-similarity

ā€“ Leads to power laws

ā€“ Communities within communities

ā€“ ā€¦

Page 41: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 42Adjacency matrix

Kronecker Product ā€“ a Graph

Intermediate stage

Adjacency matrix

Page 42: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 43

Kronecker Product ā€“ a Graph

ā€¢ Continuing multiplying with G1 we obtain G4 and

so on ā€¦

G4 adjacency matrix

Page 43: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 44

Kronecker Product ā€“ a Graph

ā€¢ Continuing multiplying with G1 we obtain G4 and

so on ā€¦

G4 adjacency matrix

Page 44: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 45

Kronecker Product ā€“ a Graph

ā€¢ Continuing multiplying with G1 we obtain G4 and

so on ā€¦

G4 adjacency matrix

Page 45: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 46

Properties:

ā€¢ We can PROVE that

ā€“ Degree distribution is multinomial ~ power law

ā€“ Diameter: constant

ā€“ Eigenvalue distribution: multinomial

ā€“ First eigenvector: multinomial

ā€¢ See [Leskovec+, PKDDā€™05] for proofs

Page 46: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 47

Problem Definition

ā€¢ Given a growing graph with nodes N1, N2, ā€¦

ā€¢ Generate a realistic sequence of graphs that will obey all

the patterns

ā€“ Static Patterns

Power Law Degree Distribution

Power Law eigenvalue and eigenvector distribution

Small Diameter

ā€“ Dynamic Patterns

Growth Power Law

Shrinking/Stabilizing Diameters

ā€¢ First and only generator for which we can prove

all these properties

Page 47: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 48

Stochastic Kronecker Graphs

ā€¢ Create N1N1 probability matrix P1

ā€¢ Compute the kth Kronecker power Pk

ā€¢ For each entry puv of Pk include an edge

(u,v) with probability puv

0.4 0.2

0.1 0.3

P1

Instance

Matrix G2

0.16 0.08 0.08 0.04

0.04 0.12 0.02 0.06

0.04 0.02 0.12 0.06

0.01 0.03 0.03 0.09

Pk

flip biased

coins

Kronecker

multiplication

skip

Page 48: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 49

Experiments

ā€¢ How well can we match real graphs?

ā€“ Arxiv: physics citations:

ā€¢ 30,000 papers, 350,000 citations

ā€¢ 10 years of data

ā€“ U.S. Patent citation network

ā€¢ 4 million patents, 16 million citations

ā€¢ 37 years of data

ā€“ Autonomous systems ā€“ graph of internet

ā€¢ Single snapshot from January 2002

ā€¢ 6,400 nodes, 26,000 edges

ā€¢ We show both static and temporal patterns

Page 49: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 50

(Q: how to fit the parmā€Ÿs?)

A:

ā€¢ Stochastic version of Kronecker graphs +

ā€¢ Max likelihood +

ā€¢ Metropolis sampling

ā€¢ [Leskovec+, ICMLā€™07]

Page 50: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 51

Experiments on real AS graphDegree distribution Hop plot

Network valueAdjacency matrix eigen values

Page 51: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 52

Conclusions

ā€¢ Kronecker graphs have:

ā€“ All the static properties

Heavy tailed degree distributions

Small diameter

Multinomial eigenvalues and eigenvectors

ā€“ All the temporal properties

Densification Power Law

Shrinking/Stabilizing Diameters

ā€“ We can formally prove these results

Page 52: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 53

Motivation

Data mining: ~ find patterns (rules, outliers)

ā€¢ Problem#1: How do real graphs look like?

ā€¢ Problem#2: How do they evolve?

ā€¢ Problem#3: How to generate realistic graphs

TOOLS

ā€¢ Problem#4: Who is the ā€˜master-mindā€™?

ā€¢ Problem#5: Track communities over time

Page 53: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 54

Problem#4: MasterMind ā€“ ā€žCePSā€Ÿ

ā€¢ w/ Hanghang Tong,

KDD 2006

ā€¢ htong <at> cs.cmu.edu

Page 54: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 55

Center-Piece Subgraph(Ceps)

ā€¢ Given Q query nodes

ā€¢ Find Center-piece ( )

ā€¢ App.

ā€“ Social Networks

ā€“ Law Inforcement, ā€¦

ā€¢ Idea:

ā€“ Proximity -> random walk with restarts

A C

B

A C

B

A C

B

b

Page 55: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 56

Case Study: AND query

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

Page 56: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 57

Case Study: AND query

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V.

Jagadish

Laks V.S.

Lakshmanan

Heikki

Mannila

Christos

Faloutsos

Padhraic

Smyth

Corinna

Cortes

15 1013

1 1

6

1 1

4 Daryl

Pregibon

10

2

1

13

1

6

Page 57: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 58

Case Study: AND query

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V.

Jagadish

Laks V.S.

Lakshmanan

Heikki

Mannila

Christos

Faloutsos

Padhraic

Smyth

Corinna

Cortes

15 1013

1 1

6

1 1

4 Daryl

Pregibon

10

2

1

13

1

6

Page 58: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 59

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V.

Jagadish

Laks V.S.

Lakshmanan

Umeshwar

Dayal

Bernhard

Scholkopf

Peter L.

Bartlett

Alex J.

Smola

1510

13

3 3

52

2

327

42_SoftAnd query

ML/Statistics

databases

Page 59: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 60

Conclusions

ā€¢ Q1:How to measure the importance?

ā€¢ A1: RWR+K_SoftAnd

ā€¢ Q2:How to do it efficiently?

ā€¢ A2:Graph Partition (Fast CePS)

ā€“ ~90% quality

ā€“ 150x speedup (ICDMā€™06, b.p. award)

A C

B

Page 60: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 61

Outline

ā€¢ Problem definition / Motivation

ā€¢ Static & dynamic laws; generators

ā€¢ Tools: CenterPiece graphs; Tensors

ā€¢ Other projects (Virus propagation, e-bay

fraud detection)

ā€¢ Conclusions

Page 61: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 62

Motivation

Data mining: ~ find patterns (rules, outliers)

ā€¢ Problem#1: How do real graphs look like?

ā€¢ Problem#2: How do they evolve?

ā€¢ Problem#3: How to generate realistic graphs

TOOLS

ā€¢ Problem#4: Who is the ā€˜master-mindā€™?

ā€¢ Problem#5: Track communities over time

Page 62: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 63

Tensors for time evolving graphs

ā€¢ [Jimeng Sun+

KDDā€™06]

ā€¢ [ ā€œ , SDMā€™07]

ā€¢ [ CF, Kolda, Sun,

SDMā€™07 tutorial]

Page 63: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 64

Social network analysis

ā€¢ Static: find community structures

DB

Auth

or s

Keywords1990

Page 64: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 65

Social network analysis

ā€¢ Static: find community structures

DB

Auth

or s

1990

1991

1992

Page 65: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 66

Social network analysis

ā€¢ Static: find community structures

ā€¢ Dynamic: monitor community structure evolution;

spot abnormal individuals; abnormal time-stamps

DB

Auth

ors

Keywords

DM

DB

1990

2004

Page 66: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 67

DB

DM

Application 1: Multiway latent

semantic indexing (LSI)

DB

2004

1990Michael

Stonebraker

QueryPattern

Ukeyword

au

tho

rs

keyword

Uau

tho

rs

ā€¢ Projection matrices specify the clusters

ā€¢ Core tensors give cluster activation level

Philip Yu

Page 67: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 68

Bibliographic data (DBLP)

ā€¢ Papers from VLDB and KDD conferences

ā€¢ Construct 2nd order tensors with yearly windows with <author, keywords>

ā€¢ Each tensor: 45843741

ā€¢ 11 timestamps (years)

Page 68: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 69

Multiway LSI

Authors Keywords Year

michael carey, michael

stonebraker, h. jagadish,

hector garcia-molina

queri,parallel,optimization,concurr,

objectorient

1995

surajit chaudhuri,mitch

cherniack,michael

stonebraker,ugur etintemel

distribut,systems,view,storage,servic,pr

ocess,cache

2004

jiawei han,jian pei,philip s. yu,

jianyong wang,charu c. aggarwal

streams,pattern,support, cluster,

index,gener,queri

2004

ā€¢ Two groups are correctly identified: Databases and Data mining

ā€¢ People and concepts are drifting over time

DM

DB

Page 69: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 70

Network forensics

ā€¢ Directional network flows

ā€¢ A large ISP with 100 POPs, each POP 10Gbps link

capacity [Hotnets2004]

ā€“ 450 GB/hour with compression

ā€¢ Task: Identify abnormal traffic pattern and find out the

cause

normal trafficabnormal traffic

des

tin

atio

n

source

des

tin

atio

n

source(with Prof. Hui Zhang and Dr. Yinglian Xie)

Page 70: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 71

Conclusions

Tensor-based methods (WTA/DTA/STA):

ā€¢ spot patterns and anomalies on time

evolving graphs, and

ā€¢ on streams (monitoring)

Page 71: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 72

Motivation

Data mining: ~ find patterns (rules, outliers)

ā€¢ Problem#1: How do real graphs look like?

ā€¢ Problem#2: How do they evolve?

ā€¢ Problem#3: How to generate realistic graphs

TOOLS

ā€¢ Problem#4: Who is the ā€˜master-mindā€™?

ā€¢ Problem#5: Track communities over time

Page 72: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 73

Outline

ā€¢ Problem definition / Motivation

ā€¢ Static & dynamic laws; generators

ā€¢ Tools: CenterPiece graphs; Tensors

ā€¢ Other projects (Virus propagation, e-bay

fraud detection, blogs)

ā€¢ Conclusions

Page 73: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 74

Virus propagation

ā€¢ How do viruses/rumors propagate?

ā€¢ Blog influence?

ā€¢ Will a flu-like virus linger, or will it

become extinct soon?

Page 74: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 75

The model: SIS

ā€¢ ā€˜Fluā€™ like: Susceptible-Infected-Susceptible

ā€¢ Virus ā€˜strengthā€™ s= b/d

Infected

Healthy

NN1

N3

N2

Prob. b

Prob. d

Page 75: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 76

Epidemic threshold t

of a graph: the value of t, such that

if strength s = b / d < t

an epidemic can not happen

Thus,

ā€¢ given a graph

ā€¢ compute its epidemic threshold

Page 76: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 77

Epidemic threshold t

What should t depend on?

ā€¢ avg. degree? and/or highest degree?

ā€¢ and/or variance of degree?

ā€¢ and/or third moment of degree?

ā€¢ and/or diameter?

Page 77: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 78

Epidemic threshold

ā€¢ [Theorem] We have no epidemic, if

Ī²/Ī“ <Ļ„ = 1/ Ī»1,A

Page 78: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 79

Epidemic threshold

ā€¢ [Theorem] We have no epidemic, if

Ī²/Ī“ <Ļ„ = 1/ Ī»1,A

largest eigenvalue

of adj. matrix Aattack prob.

recovery prob.epidemic threshold

Proof: [Wang+03]

Page 79: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 80

Experiments (Oregon)

0

100

200

300

400

500

0 250 500 750 1000

Time

Nu

mb

er

of

Infe

cte

d N

od

es

Ī“: 0.05 0.06 0.07

Oregon

Ī² = 0.001

b/d > Ļ„

(above threshold)

b/d = Ļ„

(at the threshold)

b/d < Ļ„

(below threshold)

Page 80: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 81

Outline

ā€¢ Problem definition / Motivation

ā€¢ Static & dynamic laws; generators

ā€¢ Tools: CenterPiece graphs; Tensors

ā€¢ Other projects (Virus propagation, e-bay

fraud detection, blogs)

ā€¢ Conclusions

Page 81: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 82

E-bay Fraud detection

w/ Polo Chau &

Shashank Pandit, CMU

Page 82: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 83

E-bay Fraud detection

ā€¢ lines: positive feedbacks

ā€¢ would you buy from him/her?

Page 83: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 84

E-bay Fraud detection

ā€¢ lines: positive feedbacks

ā€¢ would you buy from him/her?

ā€¢ or him/her?

Page 84: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 85

E-bay Fraud detection - NetProbe

Page 85: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 86

Outline

ā€¢ Problem definition / Motivation

ā€¢ Static & dynamic laws; generators

ā€¢ Tools: CenterPiece graphs; Tensors

ā€¢ Other projects (Virus propagation, e-bay

fraud detection, blogs)

ā€¢ Conclusions

Page 86: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 87

Blog analysis

ā€¢ with Mary McGlohon (CMU)

ā€¢ Jure Leskovec (CMU)

ā€¢ Natalie Glance (now at Google)

ā€¢ Mat Hurst (now at MSR)

[SDMā€™07]

Page 87: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 88

Cascades on the Blogosphere

B1B2

B4B3

a

b c

d

e1

B1 B2

B4B3

1

1

2

3

1

Blogosphere

blogs + posts

Blog network

links among blogs

Post network

links among posts

Q1: popularity-decay of a post?

Q2: degree distributions?

Page 88: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 89

Q1: popularity over time

Days after post

Post popularity drops-off ā€“ exponentially?

days after post

# in links

1 2 3

Page 89: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 90

Q1: popularity over time

Days after post

Post popularity drops-off ā€“ exponentially?

POWER LAW!

Exponent?

# in links

(log)

1 2 3 days after post

(log)

Page 90: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 91

Q1: popularity over time

Days after post

Post popularity drops-off ā€“ exponentially?

POWER LAW!

Exponent? -1.6 (close to -1.5: Barabasiā€™s stack model)

# in links

(log)

1 2 3

-1.6

days after post

(log)

Page 91: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 92

Q2: degree distribution

44,356 nodes, 122,153 edges. Half of blogs

belong to largest connected component.

blog in-degree

count

B

1

B

2

B

4

B

3

11

2

3

1

??

Page 92: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 93

Q2: degree distribution

44,356 nodes, 122,153 edges. Half of blogs

belong to largest connected component.

blog in-degree

count

B

1

B

2

B

4

B

3

11

2

3

1

Page 93: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 94

Q2: degree distribution

44,356 nodes, 122,153 edges. Half of blogs

belong to largest connected component.

blog in-degree

count

in-degree slope: -1.7

out-degree: -3

ā€˜rich get richerā€™

Page 94: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 95

Next steps:

ā€¢ edges with categorical attributes and/or time-stamps

ā€¢ nodes with attributes

ā€¢ scalability (hadoop ā€“ PetaByte scale)

ā€“ first eigenvalue; diameter [done]

ā€“ rest eigenvalues; community detection [to be done]

ā€“ modularity, anomalies etc etc

ā€¢ visualization (-> summarization)

Page 95: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 96

E.g.: self-* system @ CMU

ā€¢ >200 nodes

ā€¢ target: 1 PetaByte

Page 96: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 97

D.I.S.C.

ā€¢ ā€˜Data Intensive Scientific Computingā€™

[R. Bryant, CMU]

ā€“ ā€˜big dataā€™

ā€“ http://www.cs.cmu.edu/~bryant/pubdir/cmu-

cs-07-128.pdf

Page 97: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 98

Scalability

ā€¢ Google: > 450,000 processors in clusters of ~2000 processors each

ā€¢ Yahoo: 5Pb of data [Fayyad, KDDā€™07]

ā€¢ Problem: machine failures, on a daily basis

ā€¢ How to parallelize data mining tasks, then?

ā€¢ A: map/reduce ā€“ hadoop (open-source clone) http://hadoop.apache.org/

Barroso, Dean, Hƶlzle, ā€œWeb Search for a Planet: The Google Cluster Architectureā€ IEEE Micro 2003

Page 98: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 99

2ā€Ÿ intro to hadoop

ā€¢ master-slave architecture; n-way replication (default n=3)

ā€¢ ā€˜group byā€™ of SQL (in parallel, fault-tolerant way)

ā€¢ e.g, find histogram of word frequency

ā€“ slaves compute local histograms

ā€“ master merges into global histogram

select course-id, count(*)

from ENROLLMENT

group by course-id

Page 99: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 100

2ā€Ÿ intro to hadoop

ā€¢ master-slave architecture; n-way replication (default n=3)

ā€¢ ā€˜group byā€™ of SQL (in parallel, fault-tolerant way)

ā€¢ e.g, find histogram of word frequency

ā€“ slaves compute local histograms

ā€“ master merges into global histogram

select course-id, count(*)

from ENROLLMENT

group by course-idmap

reduce

Page 100: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 101

OVERALL CONCLUSIONS

ā€¢ Graphs: Self-similarity and power laws

work, when textbook methods fail!

ā€¢ New patterns (shrinking diameter!)

ā€¢ New generator: Kronecker

ā€¢ SVD / tensors / RWR: valuable tools

ā€¢ hadoop/mapReduce for scalability

Page 101: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 102

References

ā€¢ Hanghang Tong, Christos Faloutsos, and Jia-Yu

Pan Fast Random Walk with Restart and Its

Applications ICDM 2006, Hong Kong.

ā€¢ Hanghang Tong, Christos Faloutsos Center-Piece

Subgraphs: Problem Definition and Fast

Solutions, KDD 2006, Philadelphia, PA

Page 102: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 103

References

ā€¢ Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award).

ā€¢ Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication(ECML/PKDD 2005), Porto, Portugal, 2005.

Page 103: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 104

References

ā€¢ Jure Leskovec and Christos Faloutsos, Scalable Modeling of Real Graphs using Kronecker Multiplication, ICML 2007, Corvallis, OR, USA

ā€¢ Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang and Christos Faloutsos NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks WWW 2007, Banff,

Alberta, Canada, May 8-12, 2007.

ā€¢ Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA

Page 104: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 105

References

ā€¢ Jimeng Sun, Yinglian Xie, Hui Zhang, Christos

Faloutsos. Less is More: Compact Matrix

Decomposition for Large Sparse Graphs, SDM,

Minneapolis, Minnesota, Apr 2007. [pdf]

Page 105: Graph Mining: Laws, Generators and Tools - Stanford University

CMU SCS

MMDS 08 C. Faloutsos 106

Contact info:

www. cs.cmu.edu /~christos

(w/ papers, datasets, code, etc)