graph mining: laws, generators and tools - stanford university
TRANSCRIPT
CMU SCS
Graph Mining: Laws, Generators
and Tools
Christos Faloutsos
CMU
CMU SCS
MMDS 08 C. Faloutsos 2
Thanks
ā¢ Michael Mahoney
ā¢ Lek-Heng Lim
ā¢ Petros Drineas
ā¢ Gunnar Carlsson
CMU SCS
MMDS 08 C. Faloutsos 4
Outline
ā¢ Problem definition / Motivation
ā¢ Static & dynamic laws; generators
ā¢ Tools: CenterPiece graphs; Tensors
ā¢ Other projects (Virus propagation, e-bay
fraud detection)
ā¢ Conclusions
CMU SCS
MMDS 08 C. Faloutsos 5
Motivation
Data mining: ~ find patterns (rules, outliers)
ā¢ Problem#1: How do real graphs look like?
ā¢ Problem#2: How do they evolve?
ā¢ Problem#3: How to generate realistic graphs
TOOLS
ā¢ Problem#4: Who is the āmaster-mindā?
ā¢ Problem#5: Track communities over time
CMU SCS
MMDS 08 C. Faloutsos 6
Problem#1: Joint work with
Dr. Deepayan Chakrabarti
(CMU/Yahoo R.L.)
CMU SCS
MMDS 08 C. Faloutsos 7
Graphs - why should we care?
Internet Map
[lumeta.com]
Food Web
[Martinez ā91]
Protein Interactions
[genomebiology.com]
Friendship Network
[Moody ā01]
CMU SCS
MMDS 08 C. Faloutsos 8
Graphs - why should we care?
ā¢ IR: bi-partite graphs (doc-terms)
ā¢ web: hyper-text graph
ā¢ ... and more:
D1
DN
T1
TM
... ...
CMU SCS
MMDS 08 C. Faloutsos 9
Graphs - why should we care?
ā¢ network of companies & board-of-directors
members
ā¢ āviralā marketing
ā¢ web-log (āblogā) news propagation
ā¢ computer network security: email/IP traffic
and anomaly detection
ā¢ ....
CMU SCS
MMDS 08 C. Faloutsos 10
Problem #1 - network and graph
mining
ā¢ How does the Internet look like?
ā¢ How does the web look like?
ā¢ What is ānormalā/āabnormalā?
ā¢ which patterns/laws hold?
CMU SCS
MMDS 08 C. Faloutsos 11
Graph mining
ā¢ Are real graphs random?
CMU SCS
MMDS 08 C. Faloutsos 12
Laws and patterns
ā¢ Are real graphs random?
ā¢ A: NO!!
ā Diameter
ā in- and out- degree distributions
ā other (surprising) patterns
CMU SCS
MMDS 08 C. Faloutsos 13
Solution#1
ā¢ Power law in the degree distribution
[SIGCOMM99]
log(rank)
log(degree)
-0.82
internet domains
att.com
ibm.com
CMU SCS
MMDS 08 C. Faloutsos 14
Solution#1ā: Eigen Exponent E
ā¢ A2: power law in the eigenvalues of the adjacency matrix
E = -0.48
Exponent = slope
Eigenvalue
Rank of decreasing eigenvalue
May 2001
CMU SCS
MMDS 08 C. Faloutsos 15
Solution#1ā: Eigen Exponent E
ā¢ [Papadimitriou, Mihail, ā02]: slope is Ā½ of rank exponent
E = -0.48
Exponent = slope
Eigenvalue
Rank of decreasing eigenvalue
May 2001
CMU SCS
MMDS 08 C. Faloutsos 16
But:
How about graphs from other domains?
CMU SCS
MMDS 08 C. Faloutsos 17
The Peer-to-Peer Topology
ā¢ Count versus degree
ā¢ Number of adjacent peers follows a power-law
[Jovanovic+]
CMU SCS
MMDS 08 C. Faloutsos 18
More power laws:
citation counts: (citeseer.nj.nec.com 6/2001)
log(#citations)
log(count)
Ullman
CMU SCS
MMDS 08 C. Faloutsos 19
More power laws:
ā¢ web hit counts [w/ A. Montgomery]
Web Site Traffic
log(in-degree)
log(count)
Zipf
userssites
``ebayāā
CMU SCS
MMDS 08 C. Faloutsos 20
epinions.com
ā¢ who-trusts-whom
[Richardson +
Domingos, KDD
2001]
(out) degree
count
trusts-2000-people user
CMU SCS
MMDS 08 C. Faloutsos 21
Motivation
Data mining: ~ find patterns (rules, outliers)
ā¢ Problem#1: How do real graphs look like?
ā¢ Problem#2: How do they evolve?
ā¢ Problem#3: How to generate realistic graphs
TOOLS
ā¢ Problem#4: Who is the āmaster-mindā?
ā¢ Problem#5: Track communities over time
CMU SCS
MMDS 08 C. Faloutsos 22
Problem#2: Time evolution
ā¢ with Jure Leskovec
(CMU/MLD)
ā¢ and Jon Kleinberg (Cornell ā
sabb. @ CMU)
CMU SCS
MMDS 08 C. Faloutsos 23
Evolution of the Diameter
ā¢ Prior work on Power Law graphs hints
at slowly growing diameter:
ā diameter ~ O(log N)
ā diameter ~ O(log log N)
ā¢ What is happening in real data?
CMU SCS
MMDS 08 C. Faloutsos 24
Evolution of the Diameter
ā¢ Prior work on Power Law graphs hints
at slowly growing diameter:
ā diameter ~ O(log N)
ā diameter ~ O(log log N)
ā¢ What is happening in real data?
ā¢ Diameter shrinks over time
CMU SCS
MMDS 08 C. Faloutsos 25
Diameter ā ArXiv citation graph
ā¢ Citations among
physics papers
ā¢ 1992 ā2003
ā¢ One graph per
year
time [years]
diameter
CMU SCS
MMDS 08 C. Faloutsos 26
Diameter ā āAutonomous
Systemsā
ā¢ Graph of Internet
ā¢ One graph per
day
ā¢ 1997 ā 2000
number of nodes
diameter
CMU SCS
MMDS 08 C. Faloutsos 27
Diameter ā āAffiliation Networkā
ā¢ Graph of
collaborations in
physics ā authors
linked to papers
ā¢ 10 years of data
time [years]
diameter
CMU SCS
MMDS 08 C. Faloutsos 28
Diameter ā āPatentsā
ā¢ Patent citation
network
ā¢ 25 years of data
time [years]
diameter
CMU SCS
MMDS 08 C. Faloutsos 29
Temporal Evolution of the Graphs
ā¢ N(t) ā¦ nodes at time t
ā¢ E(t) ā¦ edges at time t
ā¢ Suppose that
N(t+1) = 2 * N(t)
ā¢ Q: what is your guess for
E(t+1) =? 2 * E(t)
CMU SCS
MMDS 08 C. Faloutsos 30
Temporal Evolution of the Graphs
ā¢ N(t) ā¦ nodes at time t
ā¢ E(t) ā¦ edges at time t
ā¢ Suppose that
N(t+1) = 2 * N(t)
ā¢ Q: what is your guess for
E(t+1) =? 2 * E(t)
ā¢ A: over-doubled!
ā But obeying the ``Densification Power Lawāā
CMU SCS
MMDS 08 C. Faloutsos 31
Densification ā Physics Citations
ā¢ Citations among physics papers
ā¢ 2003:
ā 29,555 papers, 352,807 citations
N(t)
E(t)
??
CMU SCS
MMDS 08 C. Faloutsos 32
Densification ā Physics Citations
ā¢ Citations among physics papers
ā¢ 2003:
ā 29,555 papers, 352,807 citations
N(t)
E(t)
1.69
CMU SCS
MMDS 08 C. Faloutsos 33
Densification ā Physics Citations
ā¢ Citations among physics papers
ā¢ 2003:
ā 29,555 papers, 352,807 citations
N(t)
E(t)
1.69
1: tree
CMU SCS
MMDS 08 C. Faloutsos 34
Densification ā Physics Citations
ā¢ Citations among physics papers
ā¢ 2003:
ā 29,555 papers, 352,807 citations
N(t)
E(t)
1.69clique: 2
CMU SCS
MMDS 08 C. Faloutsos 35
Densification ā Patent Citations
ā¢ Citations among
patents granted
ā¢ 1999
ā 2.9 million nodes
ā 16.5 million
edges
ā¢ Each year is a
datapoint N(t)
E(t)
1.66
CMU SCS
MMDS 08 C. Faloutsos 36
Densification ā Autonomous Systems
ā¢ Graph of
Internet
ā¢ 2000
ā 6,000 nodes
ā 26,000 edges
ā¢ One graph per
day
N(t)
E(t)
1.18
CMU SCS
MMDS 08 C. Faloutsos 37
Densification ā Affiliation
Network
ā¢ Authors linked
to their
publications
ā¢ 2002
ā 60,000 nodes
ā¢ 20,000 authors
ā¢ 38,000 papers
ā 133,000 edgesN(t)
E(t)
1.15
CMU SCS
MMDS 08 C. Faloutsos 38
Motivation
Data mining: ~ find patterns (rules, outliers)
ā¢ Problem#1: How do real graphs look like?
ā¢ Problem#2: How do they evolve?
ā¢ Problem#3: How to generate realistic graphs
TOOLS
ā¢ Problem#4: Who is the āmaster-mindā?
ā¢ Problem#5: Track communities over time
CMU SCS
MMDS 08 C. Faloutsos 39
Problem#3: Generation
ā¢ Given a growing graph with count of nodes N1,
N2, ā¦
ā¢ Generate a realistic sequence of graphs that will
obey all the patterns
CMU SCS
MMDS 08 C. Faloutsos 40
Problem Definition
ā¢ Given a growing graph with count of nodes N1, N2, ā¦
ā¢ Generate a realistic sequence of graphs that will obey all the patterns
ā Static Patterns
Power Law Degree Distribution
Power Law eigenvalue and eigenvector distribution
Small Diameter
ā Dynamic Patterns
Growth Power Law
Shrinking/Stabilizing Diameters
CMU SCS
MMDS 08 C. Faloutsos 41
Problem Definition
ā¢ Given a growing graph with count of nodes
N1, N2, ā¦
ā¢ Generate a realistic sequence of graphs that
will obey all the patterns
ā¢ Idea: Self-similarity
ā Leads to power laws
ā Communities within communities
ā ā¦
CMU SCS
MMDS 08 C. Faloutsos 42Adjacency matrix
Kronecker Product ā a Graph
Intermediate stage
Adjacency matrix
CMU SCS
MMDS 08 C. Faloutsos 43
Kronecker Product ā a Graph
ā¢ Continuing multiplying with G1 we obtain G4 and
so on ā¦
G4 adjacency matrix
CMU SCS
MMDS 08 C. Faloutsos 44
Kronecker Product ā a Graph
ā¢ Continuing multiplying with G1 we obtain G4 and
so on ā¦
G4 adjacency matrix
CMU SCS
MMDS 08 C. Faloutsos 45
Kronecker Product ā a Graph
ā¢ Continuing multiplying with G1 we obtain G4 and
so on ā¦
G4 adjacency matrix
CMU SCS
MMDS 08 C. Faloutsos 46
Properties:
ā¢ We can PROVE that
ā Degree distribution is multinomial ~ power law
ā Diameter: constant
ā Eigenvalue distribution: multinomial
ā First eigenvector: multinomial
ā¢ See [Leskovec+, PKDDā05] for proofs
CMU SCS
MMDS 08 C. Faloutsos 47
Problem Definition
ā¢ Given a growing graph with nodes N1, N2, ā¦
ā¢ Generate a realistic sequence of graphs that will obey all
the patterns
ā Static Patterns
Power Law Degree Distribution
Power Law eigenvalue and eigenvector distribution
Small Diameter
ā Dynamic Patterns
Growth Power Law
Shrinking/Stabilizing Diameters
ā¢ First and only generator for which we can prove
all these properties
CMU SCS
MMDS 08 C. Faloutsos 48
Stochastic Kronecker Graphs
ā¢ Create N1N1 probability matrix P1
ā¢ Compute the kth Kronecker power Pk
ā¢ For each entry puv of Pk include an edge
(u,v) with probability puv
0.4 0.2
0.1 0.3
P1
Instance
Matrix G2
0.16 0.08 0.08 0.04
0.04 0.12 0.02 0.06
0.04 0.02 0.12 0.06
0.01 0.03 0.03 0.09
Pk
flip biased
coins
Kronecker
multiplication
skip
CMU SCS
MMDS 08 C. Faloutsos 49
Experiments
ā¢ How well can we match real graphs?
ā Arxiv: physics citations:
ā¢ 30,000 papers, 350,000 citations
ā¢ 10 years of data
ā U.S. Patent citation network
ā¢ 4 million patents, 16 million citations
ā¢ 37 years of data
ā Autonomous systems ā graph of internet
ā¢ Single snapshot from January 2002
ā¢ 6,400 nodes, 26,000 edges
ā¢ We show both static and temporal patterns
CMU SCS
MMDS 08 C. Faloutsos 50
(Q: how to fit the parmās?)
A:
ā¢ Stochastic version of Kronecker graphs +
ā¢ Max likelihood +
ā¢ Metropolis sampling
ā¢ [Leskovec+, ICMLā07]
CMU SCS
MMDS 08 C. Faloutsos 51
Experiments on real AS graphDegree distribution Hop plot
Network valueAdjacency matrix eigen values
CMU SCS
MMDS 08 C. Faloutsos 52
Conclusions
ā¢ Kronecker graphs have:
ā All the static properties
Heavy tailed degree distributions
Small diameter
Multinomial eigenvalues and eigenvectors
ā All the temporal properties
Densification Power Law
Shrinking/Stabilizing Diameters
ā We can formally prove these results
CMU SCS
MMDS 08 C. Faloutsos 53
Motivation
Data mining: ~ find patterns (rules, outliers)
ā¢ Problem#1: How do real graphs look like?
ā¢ Problem#2: How do they evolve?
ā¢ Problem#3: How to generate realistic graphs
TOOLS
ā¢ Problem#4: Who is the āmaster-mindā?
ā¢ Problem#5: Track communities over time
CMU SCS
MMDS 08 C. Faloutsos 54
Problem#4: MasterMind ā āCePSā
ā¢ w/ Hanghang Tong,
KDD 2006
ā¢ htong <at> cs.cmu.edu
CMU SCS
MMDS 08 C. Faloutsos 55
Center-Piece Subgraph(Ceps)
ā¢ Given Q query nodes
ā¢ Find Center-piece ( )
ā¢ App.
ā Social Networks
ā Law Inforcement, ā¦
ā¢ Idea:
ā Proximity -> random walk with restarts
A C
B
A C
B
A C
B
b
CMU SCS
MMDS 08 C. Faloutsos 56
Case Study: AND query
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
CMU SCS
MMDS 08 C. Faloutsos 57
Case Study: AND query
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
H.V.
Jagadish
Laks V.S.
Lakshmanan
Heikki
Mannila
Christos
Faloutsos
Padhraic
Smyth
Corinna
Cortes
15 1013
1 1
6
1 1
4 Daryl
Pregibon
10
2
1
13
1
6
CMU SCS
MMDS 08 C. Faloutsos 58
Case Study: AND query
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
H.V.
Jagadish
Laks V.S.
Lakshmanan
Heikki
Mannila
Christos
Faloutsos
Padhraic
Smyth
Corinna
Cortes
15 1013
1 1
6
1 1
4 Daryl
Pregibon
10
2
1
13
1
6
CMU SCS
MMDS 08 C. Faloutsos 59
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
H.V.
Jagadish
Laks V.S.
Lakshmanan
Umeshwar
Dayal
Bernhard
Scholkopf
Peter L.
Bartlett
Alex J.
Smola
1510
13
3 3
52
2
327
42_SoftAnd query
ML/Statistics
databases
CMU SCS
MMDS 08 C. Faloutsos 60
Conclusions
ā¢ Q1:How to measure the importance?
ā¢ A1: RWR+K_SoftAnd
ā¢ Q2:How to do it efficiently?
ā¢ A2:Graph Partition (Fast CePS)
ā ~90% quality
ā 150x speedup (ICDMā06, b.p. award)
A C
B
CMU SCS
MMDS 08 C. Faloutsos 61
Outline
ā¢ Problem definition / Motivation
ā¢ Static & dynamic laws; generators
ā¢ Tools: CenterPiece graphs; Tensors
ā¢ Other projects (Virus propagation, e-bay
fraud detection)
ā¢ Conclusions
CMU SCS
MMDS 08 C. Faloutsos 62
Motivation
Data mining: ~ find patterns (rules, outliers)
ā¢ Problem#1: How do real graphs look like?
ā¢ Problem#2: How do they evolve?
ā¢ Problem#3: How to generate realistic graphs
TOOLS
ā¢ Problem#4: Who is the āmaster-mindā?
ā¢ Problem#5: Track communities over time
CMU SCS
MMDS 08 C. Faloutsos 63
Tensors for time evolving graphs
ā¢ [Jimeng Sun+
KDDā06]
ā¢ [ ā , SDMā07]
ā¢ [ CF, Kolda, Sun,
SDMā07 tutorial]
CMU SCS
MMDS 08 C. Faloutsos 64
Social network analysis
ā¢ Static: find community structures
DB
Auth
or s
Keywords1990
CMU SCS
MMDS 08 C. Faloutsos 65
Social network analysis
ā¢ Static: find community structures
DB
Auth
or s
1990
1991
1992
CMU SCS
MMDS 08 C. Faloutsos 66
Social network analysis
ā¢ Static: find community structures
ā¢ Dynamic: monitor community structure evolution;
spot abnormal individuals; abnormal time-stamps
DB
Auth
ors
Keywords
DM
DB
1990
2004
CMU SCS
MMDS 08 C. Faloutsos 67
DB
DM
Application 1: Multiway latent
semantic indexing (LSI)
DB
2004
1990Michael
Stonebraker
QueryPattern
Ukeyword
au
tho
rs
keyword
Uau
tho
rs
ā¢ Projection matrices specify the clusters
ā¢ Core tensors give cluster activation level
Philip Yu
CMU SCS
MMDS 08 C. Faloutsos 68
Bibliographic data (DBLP)
ā¢ Papers from VLDB and KDD conferences
ā¢ Construct 2nd order tensors with yearly windows with <author, keywords>
ā¢ Each tensor: 45843741
ā¢ 11 timestamps (years)
CMU SCS
MMDS 08 C. Faloutsos 69
Multiway LSI
Authors Keywords Year
michael carey, michael
stonebraker, h. jagadish,
hector garcia-molina
queri,parallel,optimization,concurr,
objectorient
1995
surajit chaudhuri,mitch
cherniack,michael
stonebraker,ugur etintemel
distribut,systems,view,storage,servic,pr
ocess,cache
2004
jiawei han,jian pei,philip s. yu,
jianyong wang,charu c. aggarwal
streams,pattern,support, cluster,
index,gener,queri
2004
ā¢ Two groups are correctly identified: Databases and Data mining
ā¢ People and concepts are drifting over time
DM
DB
CMU SCS
MMDS 08 C. Faloutsos 70
Network forensics
ā¢ Directional network flows
ā¢ A large ISP with 100 POPs, each POP 10Gbps link
capacity [Hotnets2004]
ā 450 GB/hour with compression
ā¢ Task: Identify abnormal traffic pattern and find out the
cause
normal trafficabnormal traffic
des
tin
atio
n
source
des
tin
atio
n
source(with Prof. Hui Zhang and Dr. Yinglian Xie)
CMU SCS
MMDS 08 C. Faloutsos 71
Conclusions
Tensor-based methods (WTA/DTA/STA):
ā¢ spot patterns and anomalies on time
evolving graphs, and
ā¢ on streams (monitoring)
CMU SCS
MMDS 08 C. Faloutsos 72
Motivation
Data mining: ~ find patterns (rules, outliers)
ā¢ Problem#1: How do real graphs look like?
ā¢ Problem#2: How do they evolve?
ā¢ Problem#3: How to generate realistic graphs
TOOLS
ā¢ Problem#4: Who is the āmaster-mindā?
ā¢ Problem#5: Track communities over time
CMU SCS
MMDS 08 C. Faloutsos 73
Outline
ā¢ Problem definition / Motivation
ā¢ Static & dynamic laws; generators
ā¢ Tools: CenterPiece graphs; Tensors
ā¢ Other projects (Virus propagation, e-bay
fraud detection, blogs)
ā¢ Conclusions
CMU SCS
MMDS 08 C. Faloutsos 74
Virus propagation
ā¢ How do viruses/rumors propagate?
ā¢ Blog influence?
ā¢ Will a flu-like virus linger, or will it
become extinct soon?
CMU SCS
MMDS 08 C. Faloutsos 75
The model: SIS
ā¢ āFluā like: Susceptible-Infected-Susceptible
ā¢ Virus āstrengthā s= b/d
Infected
Healthy
NN1
N3
N2
Prob. b
Prob. d
CMU SCS
MMDS 08 C. Faloutsos 76
Epidemic threshold t
of a graph: the value of t, such that
if strength s = b / d < t
an epidemic can not happen
Thus,
ā¢ given a graph
ā¢ compute its epidemic threshold
CMU SCS
MMDS 08 C. Faloutsos 77
Epidemic threshold t
What should t depend on?
ā¢ avg. degree? and/or highest degree?
ā¢ and/or variance of degree?
ā¢ and/or third moment of degree?
ā¢ and/or diameter?
CMU SCS
MMDS 08 C. Faloutsos 78
Epidemic threshold
ā¢ [Theorem] We have no epidemic, if
Ī²/Ī“ <Ļ = 1/ Ī»1,A
CMU SCS
MMDS 08 C. Faloutsos 79
Epidemic threshold
ā¢ [Theorem] We have no epidemic, if
Ī²/Ī“ <Ļ = 1/ Ī»1,A
largest eigenvalue
of adj. matrix Aattack prob.
recovery prob.epidemic threshold
Proof: [Wang+03]
CMU SCS
MMDS 08 C. Faloutsos 80
Experiments (Oregon)
0
100
200
300
400
500
0 250 500 750 1000
Time
Nu
mb
er
of
Infe
cte
d N
od
es
Ī“: 0.05 0.06 0.07
Oregon
Ī² = 0.001
b/d > Ļ
(above threshold)
b/d = Ļ
(at the threshold)
b/d < Ļ
(below threshold)
CMU SCS
MMDS 08 C. Faloutsos 81
Outline
ā¢ Problem definition / Motivation
ā¢ Static & dynamic laws; generators
ā¢ Tools: CenterPiece graphs; Tensors
ā¢ Other projects (Virus propagation, e-bay
fraud detection, blogs)
ā¢ Conclusions
CMU SCS
MMDS 08 C. Faloutsos 82
E-bay Fraud detection
w/ Polo Chau &
Shashank Pandit, CMU
CMU SCS
MMDS 08 C. Faloutsos 83
E-bay Fraud detection
ā¢ lines: positive feedbacks
ā¢ would you buy from him/her?
CMU SCS
MMDS 08 C. Faloutsos 84
E-bay Fraud detection
ā¢ lines: positive feedbacks
ā¢ would you buy from him/her?
ā¢ or him/her?
CMU SCS
MMDS 08 C. Faloutsos 85
E-bay Fraud detection - NetProbe
CMU SCS
MMDS 08 C. Faloutsos 86
Outline
ā¢ Problem definition / Motivation
ā¢ Static & dynamic laws; generators
ā¢ Tools: CenterPiece graphs; Tensors
ā¢ Other projects (Virus propagation, e-bay
fraud detection, blogs)
ā¢ Conclusions
CMU SCS
MMDS 08 C. Faloutsos 87
Blog analysis
ā¢ with Mary McGlohon (CMU)
ā¢ Jure Leskovec (CMU)
ā¢ Natalie Glance (now at Google)
ā¢ Mat Hurst (now at MSR)
[SDMā07]
CMU SCS
MMDS 08 C. Faloutsos 88
Cascades on the Blogosphere
B1B2
B4B3
a
b c
d
e1
B1 B2
B4B3
1
1
2
3
1
Blogosphere
blogs + posts
Blog network
links among blogs
Post network
links among posts
Q1: popularity-decay of a post?
Q2: degree distributions?
CMU SCS
MMDS 08 C. Faloutsos 89
Q1: popularity over time
Days after post
Post popularity drops-off ā exponentially?
days after post
# in links
1 2 3
CMU SCS
MMDS 08 C. Faloutsos 90
Q1: popularity over time
Days after post
Post popularity drops-off ā exponentially?
POWER LAW!
Exponent?
# in links
(log)
1 2 3 days after post
(log)
CMU SCS
MMDS 08 C. Faloutsos 91
Q1: popularity over time
Days after post
Post popularity drops-off ā exponentially?
POWER LAW!
Exponent? -1.6 (close to -1.5: Barabasiās stack model)
# in links
(log)
1 2 3
-1.6
days after post
(log)
CMU SCS
MMDS 08 C. Faloutsos 92
Q2: degree distribution
44,356 nodes, 122,153 edges. Half of blogs
belong to largest connected component.
blog in-degree
count
B
1
B
2
B
4
B
3
11
2
3
1
??
CMU SCS
MMDS 08 C. Faloutsos 93
Q2: degree distribution
44,356 nodes, 122,153 edges. Half of blogs
belong to largest connected component.
blog in-degree
count
B
1
B
2
B
4
B
3
11
2
3
1
CMU SCS
MMDS 08 C. Faloutsos 94
Q2: degree distribution
44,356 nodes, 122,153 edges. Half of blogs
belong to largest connected component.
blog in-degree
count
in-degree slope: -1.7
out-degree: -3
ārich get richerā
CMU SCS
MMDS 08 C. Faloutsos 95
Next steps:
ā¢ edges with categorical attributes and/or time-stamps
ā¢ nodes with attributes
ā¢ scalability (hadoop ā PetaByte scale)
ā first eigenvalue; diameter [done]
ā rest eigenvalues; community detection [to be done]
ā modularity, anomalies etc etc
ā¢ visualization (-> summarization)
CMU SCS
MMDS 08 C. Faloutsos 96
E.g.: self-* system @ CMU
ā¢ >200 nodes
ā¢ target: 1 PetaByte
CMU SCS
MMDS 08 C. Faloutsos 97
D.I.S.C.
ā¢ āData Intensive Scientific Computingā
[R. Bryant, CMU]
ā ābig dataā
ā http://www.cs.cmu.edu/~bryant/pubdir/cmu-
cs-07-128.pdf
CMU SCS
MMDS 08 C. Faloutsos 98
Scalability
ā¢ Google: > 450,000 processors in clusters of ~2000 processors each
ā¢ Yahoo: 5Pb of data [Fayyad, KDDā07]
ā¢ Problem: machine failures, on a daily basis
ā¢ How to parallelize data mining tasks, then?
ā¢ A: map/reduce ā hadoop (open-source clone) http://hadoop.apache.org/
Barroso, Dean, Hƶlzle, āWeb Search for a Planet: The Google Cluster Architectureā IEEE Micro 2003
CMU SCS
MMDS 08 C. Faloutsos 99
2ā intro to hadoop
ā¢ master-slave architecture; n-way replication (default n=3)
ā¢ āgroup byā of SQL (in parallel, fault-tolerant way)
ā¢ e.g, find histogram of word frequency
ā slaves compute local histograms
ā master merges into global histogram
select course-id, count(*)
from ENROLLMENT
group by course-id
CMU SCS
MMDS 08 C. Faloutsos 100
2ā intro to hadoop
ā¢ master-slave architecture; n-way replication (default n=3)
ā¢ āgroup byā of SQL (in parallel, fault-tolerant way)
ā¢ e.g, find histogram of word frequency
ā slaves compute local histograms
ā master merges into global histogram
select course-id, count(*)
from ENROLLMENT
group by course-idmap
reduce
CMU SCS
MMDS 08 C. Faloutsos 101
OVERALL CONCLUSIONS
ā¢ Graphs: Self-similarity and power laws
work, when textbook methods fail!
ā¢ New patterns (shrinking diameter!)
ā¢ New generator: Kronecker
ā¢ SVD / tensors / RWR: valuable tools
ā¢ hadoop/mapReduce for scalability
CMU SCS
MMDS 08 C. Faloutsos 102
References
ā¢ Hanghang Tong, Christos Faloutsos, and Jia-Yu
Pan Fast Random Walk with Restart and Its
Applications ICDM 2006, Hong Kong.
ā¢ Hanghang Tong, Christos Faloutsos Center-Piece
Subgraphs: Problem Definition and Fast
Solutions, KDD 2006, Philadelphia, PA
CMU SCS
MMDS 08 C. Faloutsos 103
References
ā¢ Jure Leskovec, Jon Kleinberg and Christos Faloutsos Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations KDD 2005, Chicago, IL. ("Best Research Paper" award).
ā¢ Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication(ECML/PKDD 2005), Porto, Portugal, 2005.
CMU SCS
MMDS 08 C. Faloutsos 104
References
ā¢ Jure Leskovec and Christos Faloutsos, Scalable Modeling of Real Graphs using Kronecker Multiplication, ICML 2007, Corvallis, OR, USA
ā¢ Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang and Christos Faloutsos NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks WWW 2007, Banff,
Alberta, Canada, May 8-12, 2007.
ā¢ Jimeng Sun, Dacheng Tao, Christos Faloutsos Beyond Streams and Graphs: Dynamic Tensor Analysis, KDD 2006, Philadelphia, PA
CMU SCS
MMDS 08 C. Faloutsos 105
References
ā¢ Jimeng Sun, Yinglian Xie, Hui Zhang, Christos
Faloutsos. Less is More: Compact Matrix
Decomposition for Large Sparse Graphs, SDM,
Minneapolis, Minnesota, Apr 2007. [pdf]
CMU SCS
MMDS 08 C. Faloutsos 106
Contact info:
www. cs.cmu.edu /~christos
(w/ papers, datasets, code, etc)