ds504/cs586: big data analyticsyli15/courses/ds504fall16/slides/bda16fall-8... · ds504/cs586: big...

36
DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Welcome to Time: 6:00pm –8:50pm Mon. and Wed. Location: SL105 Spring 2016

Upload: others

Post on 11-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

DS504/CS586: Big Data Analytics Graph Mining II

Prof. Yanhua Li

Welcome to

Time: 6:00pm –8:50pm Mon. and Wed. Location: SL105

Spring 2016

Page 2: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Reading assignments We will increase the bar a little bit Please add more of your ideas, and share them with us in the class.

Page 3: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Project 1 Team 1: Utilizing the data from Allstate to predict the possible Bodily Injury Liability Insurance claim payments that the company may pay in line with the vehicle features. Team 2: 3D Video Growing Trend on YouTube

Page 4: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Project 1 Team 3: Designing a sampling method that can estimate the available capacity of hotels in a large area and a certain time range as accurate as possible. Team 4: A huge number of professional as well as amateur programmers use Stackoverflow.com to find solutions to their questions regularly. Thus, our team seeks to give such users a general idea of which skills to learn to create their projects and how to manage their skills by offering a project-related heat map of programming languages and knowledge points.

Page 5: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Project 1 Team 5: MEASURING RESTAURANT DIVERSITY INDEX FOR DIFFERENT CITIES in Yelp Team 6: We will estimate the number of valid and invalid users among the population. Lastly, we will analyze other related web site statistics such as passive and active users.

Page 6: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Graph Data

6

 Graphsareeverywhere.

Program Flow

Ecological Network

Biological Network

Social Network

Chemical Network Web Graph

Page 7: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Complex Graphs

7

 Real-life graph contains complex contents – labels associated with nodes, edges and graphs.

Node Labels:

Location, Gender, Charts, Library, Events, Groups,

Journal, Tags, Age, Tracks.

Page 8: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

8

# of Users # of Links Facebook 400 Million 52K Million Twitter 105 Million 10K Million LinkedIn 60 Million 0.9K Million Last.FM 40 Million 2K Million

LiveJournal 25 Million 2K Million del.icio.us 5.3 Million 0.7K Million

DBLP 0.7 Million 8 Million

Large Graphs

  Large Scale Graphs.

Page 9: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Mining in Big Graphs

v  Network Statistic Analysis (last lecture) § Network Size § Degree distribution.

v  Node Ranking (this lecture) § Identifying most important/influential nodes § Viral Marketing, resource allocation

Page 10: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Characterize Node Importance

v Rank the webpages in search engine.

v Viral Marketing, resource allocation

v Open a new restaurant, find the optimal location

v …

Page 11: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

11

Ranking nodes on an undirected graph Node Degree Stationary distribution

}  Local Importance }  Global Importance

11

}  d(5)=4 }  d(3)=3 }  d(4)=2 }  d(2)=2 }  d(1)=2 }  d(6)=1

1

2

3 4

5 6

1

2

3 4

5 6

}  π(5)=4/14 }  π(3)=3/14 }  π(4)=2/14 }  π(2)=2/14 }  π(1)=2/14 }  π(6)=1/14

}  |V|=6 }  |E|=7

They are equivalent.

Connected Graphs

Page 12: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

12

Ranking nodes on a directed graph Node in & out Degree Stationary distribution

}  Local Importance }  Global Importance

}  din(3)=3; dout(5)=3; }  din(5)=2; dout(3)=2; }  din(1)=2; dout(1)=2; }  din(2)=2; dout(4)=2; }  din(4)=1; dout(2)=1; }  din(6)=1; dout(6)=1;

1

2

3 4

5 6

1

2

3 4

5 6

}  π(5)=? }  π(4)=? }  π(3)=? }  π(2)=? }  π(1)=? }  π(6)=?

They are equivalent?

Strongly Connected Graphs & Aperiodic

Page 13: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Random Walk (Undirected Graph) v  Adjacency matrix

v  Transition Probability Matrix

v  |E|: number of links v  Stationary Distribution

1

4 3

2

D =

3 0 0 00 2 0 00 0 3 00 0 0 2

!

"

####

$

%

&&&&

Undirected

A =

0 1 1 11 0 1 01 1 0 11 0 1 0

!

"

####

$

%

&&&&

Symmetric

P = A•D−1 =

0 1/ 3 1/ 3 1/ 31/ 2 0 1/ 2 01/ 3 1/ 3 0 1/ 31/ 2 0 1/ 2 0

"

#

$$$$

%

&

''''

π i =di2 E

Pij =1ki

}  π(1)=3/10 }  π(3)=3/10 }  π(2)=2/10 }  π(4)=2/10

xt ,i = xt−1, j p jij∑

Connected Graphs

Page 14: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Random Walk (directed graph) v  Adjacency matrix

v  Transition Probability Matrix

v  |E|: number of directed links v  Stationary Distribution

1

4 3

2

D =

2 0 0 00 1 0 00 0 3 00 0 0 1

!

"

####

$

%

&&&&

A =

0 1 1 00 0 1 01 1 0 11 0 0 0

!

"

####

$

%

&&&&

Asymmetric

P = A•D−1 =

0 1/ 2 1/ 2 01/ 2 0 1/ 2 01/ 3 1/ 3 0 1/ 31 0 0 0

"

#

$$$$

%

&

''''

π i ≠di2 E

Pij =1kout ,i

Strongly Connected Graphs & Aperiodic

}  π(1)=6/18=1/3 }  π(2)=4/18=2/9 }  π(3)=3/18=1/6 }  π(4)=5/18

xt ,i = xt−1, j p jij∑

Page 15: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

15

Ranking nodes in a directed graph Node in & out Degree Stationary distribution

}  Local Importance }  Global Importance

15

}  din(3)=3; dout(5)=3; }  din(5)=2; dout(3)=2; }  din(1)=2; dout(1)=2; }  din(2)=2; dout(4)=2; }  din(4)=1; dout(2)=1; }  din(6)=1; dout(6)=1;

1

2

3 4

5 6

1

2

3 4

5 6

}  π(1)=5/16 }  π(3)=1/4 }  π(2)=3/16 }  π(4)=1/8 }  π(5)=3/32 }  π(6)=1/32

They are no longer equivalent.

Strongly Connected Graphs & Aperiodic

Page 16: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

directed graphs v  Periodic v  vs v  Aperiodic Graphs

§  The greatest common divisor of the lengths of its cycles is one or not

v  Disconnected graph v  vs v  Connected graph

§  Strongly Connected §  vs §  Weakly Connected

v  Ergodic: Strongly Connected and Aperiodic

1

4 3

2

Strongly Connected Graphs & Aperiodic

1

4 3

2

Page 17: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

WhyThisOrder?

Page 18: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

18

Ranking nodes in a directed graph (II) PageRank HITS

}  Random Walk }  with Random Jumps

}  Hub & Authority

18

}  R(3)=?; }  R(5)=?; }  R(1)=?; }  R(2)=?; }  R(4)=?; }  R(6)=?;

1

2

3 4

5 6

1

2

3 4

5 6

They are no longer equivalent.

}  Ra(3)=?; Rh(5)=?; }  Ra(5)=?; Rh(3)=?; }  Ra(1)=?; Rh(1)=?; }  Ra(2)=?; Rh(4)=?; }  Ra(4)=?; Rh(2)=?; }  Ra(6)=?; Rh(6)=?;

Page 19: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Naïve PageRank v  Adjacency matrix

v  Transition Probability Matrix

v  Stationary Distribution

v  Disconnected Graph & Random surfing behaviors

1

4 3

2

D =

2 0 0 00 1 0 00 0 3 00 0 0 1

!

"

####

$

%

&&&&

A =

0 1 1 00 0 1 01 1 0 11 0 0 0

!

"

####

$

%

&&&&

P = A•D−1 =

0 1/ 2 1/ 2 01/ 2 0 1/ 2 01/ 3 1/ 3 0 1/ 31 0 0 0

"

#

$$$$

%

&

''''

Pij =1kout ,i

Ri = Rj p jij∑

Ri = π i

}  π(1)=6/18=1/3 }  π(2)=4/18=2/9 }  π(3)=3/18=1/6 }  π(4)=5/18

Page 20: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Standard PageRank v  Adjacency matrix

v  Transition Probability Matrix (d=0.85)

v  Stationary Distribution (J is all-1 matrix).

v  Convergence §  Leading eigenvector of Ppr

1

4 3

2

D =

2 0 0 00 1 0 00 0 3 00 0 0 1

!

"

####

$

%

&&&&

A =

0 1 1 00 0 1 01 1 0 11 0 0 0

!

"

####

$

%

&&&&

P = A•D−1 =

0 1/ 2 1/ 2 01/ 2 0 1/ 2 01/ 3 1/ 3 0 1/ 31 0 0 0

"

#

$$$$

%

&

''''

Pij =1kout ,i

Ri = d Rj p jij∑ + (1− d ) 1n

Ri = π pr ,i Ppr = d •P + (1− d) 1nJ =

0.0375 0.4625 0.4625 0.0375 0.0375 0.0375 0.0375 0.8875 0.3208 0.3208 0.0375 0.3208 0.8875 0.0375 0.0375 0.0375

"

#

$$$$

%

&

''''

Page 21: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

v  How to quantify the importance as a hub and authority separately?

Page 22: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Hub & Authority (HITS) v  Adjacency matrix

v  Hub and authority §  Initial Step:

§  Each step with normalization:

v  Convergence §  hub and authority are the left and right singular vector of the

adjacency matrix A.

1

4 3

2

D =

2 0 0 00 1 0 00 0 3 00 0 0 1

!

"

####

$

%

&&&&

A =

0 1 1 00 0 1 01 1 0 11 0 0 0

!

"

####

$

%

&&&&

hub( p) =1;auth( p) =1;

hub( p) = auth(i)i=1

n∑ ;

auth( p) = hub(i)i=1

n∑ ;

hub( p) = hub( p)

hub(i)2i=1

n∑

;

auth( p) = auth( p)

auth(i)2i=1

n∑

;

Page 23: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

A Note on Maximizing the Spread of Influence in Social Networks

E. Even-Dar and A. Shapira

Page 24: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Social Influence

Page 25: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Social Influence

Collaboration networks

Microblogs Social

networks

Location Based Services

Sharing sites

Instant Messaging

Page 26: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Social Influence

Page 27: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Voter Influence Model

Word of mouth effect!

Opinion diffusions Switch opinions back and forth

[1] P. Clifford and A. Sudbury. A model for spatial conflict. Biometrika, 60(3):581, 1973.

Bob

Alice David

D

Randomly selecting one neighbor to adopt its opinion

Page 28: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Influence Maximization

Goal: Maximize the number of future red nodes

Budget: Selecting k individuals as initial red seeds

[15] E. Even-Dar and A. Shapira. A note on maximizing the spread of influence in social networks. In WINE, 2007.

Assumption: Uniform cost of selecting each initial seed

Page 29: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Formulation

xt+1(i) = xt ( j)p jij:aij>0∑

At step t>0, i i ( )tx i 1 ( )tx i−

At step t+1,

pij = aij / aijj∈V∑

3

1

4

2

6

5

Influence at step t: 0( ) ( )t ti V

f x x i∈

=∑Influence contribution:

00 0 0max : lim ( ) ( )ttx

f x f x→∞

−Long term 0

0 0 0max : ( ) ( )txf x f x−Short term

i

Probability of node i being red at step t: ( )tx i

Page 30: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Formulation (Random Walk) Influence at step t: 1xt

T

Influence contribution:

maxx0: limt→∞1xt

TLong term

maxx0:1xt

TShort term

Matrix form: xt = x0Pt

limt→∞xt = limt→∞

x0Pt = π

Influence contribution:

maxx0: limt→∞ft (x0 )− f0 (x0 ) = x0π

T − f0 (x0 )Long term

00 0 0max : ( ) ( )tx

f x f x−Short term

is a column vector, which is the transpose of row vector xtT xt

Page 31: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Influence Maximization

Goal: Maximize the number of future red nodes

Budget: Selecting C for initial red seeds

[15] E. Even-Dar and A. Shapira. A note on maximizing the spread of influence in social networks. In WINE, 2007.

Assumption: Heterogeneous costs of selecting different initial seeds (ci)

Knapsack problem

Page 32: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Knapsack problem Weight = Influence Value/ Stationary distribution Size = Cost ci of choosing a node ni

Page 33: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

One-way connection

Randomly select an out-going neighbor

Adopt the opinion of one of the outgoing neighbors.

What if directed social graph

Page 34: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Directed and Signed Networks

Friend

Foe

One-way signed connection

Randomly select an out-going neighbor

Adopt the opposite opinion of foe, the same opinion of friend

[WSDM'13] Yanhua Li, Wei Chen, Yajun Wang, Zhi-Li Zhang, Influence Diffusion Dynamics and Influence Maximization in Social Networks with Friend and Foe Relationships.The 6th ACM International Conference on Web Search and Data Mining, February 4-8, 2013, Rome, Italy.

Page 35: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

Any Comments & Critiques?

Page 36: DS504/CS586: Big Data Analyticsyli15/courses/DS504Fall16/slides/BDA16Fall-8... · DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua L i Welcome to Time: 6:00pm –8:50pm

36

Next Class: Graph Mining (Presentation)

v  Do assigned readings before class

v  Submit reviews/critiques

v  Attend in-class discussions