instructor: shengyu zhang 1. location change for the final 2 classes nov 17: yia 404 (yasumoto...

38
Instructor: Shengyu Zhang 1

Upload: rudolf-clarke

Post on 18-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

Social network Extensively studied by social scientists for decades.  Usually small datasets. Social networks on Internet are gigantic  Facebook, Twitter, LinkedIn, WeChat, Weibo, … A large class of tasks/studies are about the influence and information propagation. A typical task: select some seed customers and let them influence others. 3

TRANSCRIPT

Page 1: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

1

Instructor: Shengyu Zhang

Page 2: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

2

Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International

Academic Park 康本國際學術園 )

Nov 24: No class. Conference leave.

Dec 1: YIA 508 (Yasumoto International Academic Park 康本國際學術園 )

Page 3: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

3

Social network

Extensively studied by social scientists for decades. Usually small datasets.

Social networks on Internet are gigantic Facebook, Twitter, LinkedIn, WeChat, Weibo, …

A large class of tasks/studies are about the influence and information propagation.

A typical task: select some seed customers and let them influence others.

Page 4: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

4

Motivating examples

Adoption of smart phones. Good: easy access to Internet, many cool apps, etc. Bad: expensive, absorbing too much time, …

Once you start to use smart phones, it’s hard to go back. There are not even many choices of traditional phones.

Similar adoption: Religion, new idea, virus, … This lecture focuses on progressive models: once

a node becomes active, it stays active. There are also non-progressive models.

Page 5: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

5

Popular models

Social network: a directed graph . Note that the edges are directed:

How much an individual can influence another individual is generally different than how much can influence . --- Just think of stars and fans.

We consider the scenario where the diffusion proceeds in discrete time steps.

Page 6: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

6

Each node has two states: inactive and active. inactive: the node hasn’t adopted smart phones. active: the node has adopted smart phones.

Start from , a seed set. All nodes in are active.

Nodes in influence some of their neighbors, who then become active. Who are exactly the influenced ones depends on

the variant of the model.

Page 7: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

7

model

These new active nodes further influence some of their neighbors, and so on,

until no more nodes are influenced, reaching a set . “Final active set”.

For a social graph , a stochastic diffusion model specifies how active sets , for all , is generated, given the initial seed set .

Page 8: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

8

Model 1: IC

Independent cascade (IC) model.

Every edge has an associated influence probability � Specifying the extent to which node can influence

node .

Page 9: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

9

For each time step , the set is generated as follows. For each node , for each edge where is inactive,

becomes active with probability . This is then put in set of active nodes in time . Different edges influence independently.

For each inactive node , if it has many neighbors : as long as one such successfully influences , becomes active.

Page 10: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

10

An equivalent model

Given a graph , we mark each edge of as either live or blocked. .

The subgraph where contains all the live edges.

The step- active set is

The final active set is defined as

Page 11: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

11

This model is equivalent to the IC model. In IC, each edge is “used” only once.

Flip a coin to decide whether the edge “works”. Success with probability .

Thus we can just flip all the coins at the beginning, and then later follow the outcomes.

Page 12: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

12

Model 2: LT

Linear threshold (LT) model. In many situations, multiple and independent

sources are needed for an individual to be convinced to adopt some idea.

E.g. Seeing 1/3 of your friends using smart phone, you made the decision.

Page 13: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

13

In linear threshold model, every edge has a influence weight ,

indicating the importance of on influencing . The weights are normalized s.t. , the sum of

weights of all incoming edges is at most 1

Page 14: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

14

Each node has a threshold , model the likelihood that is influenced by its

active neighbors. A large value of means that many active

neighbors are required in order to activate . Specifically, is activated if

Page 15: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

15

Recall . Since may be smaller than , it’s possible that

can’t be activated even if all its in-neighbors are active.

Corresponding to the people who just don’t want smart phones.

Page 16: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

16

Diffusion process in LT model

First each node independently selects a threshold uniformly at random in .

At each time step , Set For each inactive node , if the total weight of the

edges from its active in-neighbors is at least , i.e. , then becomes active (and is added into ).

Note: all the randomness is in the threshold selection. Once this is done, the rest diffusion process is all deterministic.

Page 17: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

17

An equivalent model: LT’

Similar to IC model, LT also has an equivalent model, in which the live edges are selected at the beginning.

For each , among all incoming edges , we will select at most one to be live.

is the one with probability . The set of live edges gives a subgraph .

Page 18: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

18

If , there is a chance that no incoming edge is live. which happens with probability

For any , the active set is set to be .

The set reachable from within steps in . The final active set is .

Page 19: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

19

This new model is equivalent to the LT model: Suppose that at time , the current active set is ,

and we want to show that any node is activated at the same probability as in LT model.

Suppose is the set of active incoming neighbors. ()

In LT: is activated with probability . In LT’: is reached from if some is selected to be

the live incoming neighbor of , which happens with probability .

Page 20: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

20

task

Suppose we have a budget for seeds. That is, .

The main task is to find a seed set so that it influence as many other nodes as possible.

Since the influence propagation is a random process, we like to maximize

the expectation of size of final active set.

Page 21: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

21

Monotonicity

as a function of set has two important properties.

Definition. A function on subsets of is monotone if

for any subsets , . Theorem. (as a function of set ) is monotone. This is pretty intuitive: More seeds generate

more active nodes (in expectation).

Page 22: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

22

Submodularity

Definition. A function on subsets of is submodular if and ,

Diminishing marginal returns: marginal contribution of when added to marginal contribution of to a smaller .

A dollar to a millionaire counts less than a dollar to a beggar.

Theorem. (as a function of set ) is submodular.

Page 23: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

23

Fact: linear combination of submodular functions with non-negative coefficients is also submodular. Set of submodular functions is closed under non-

negative linear combination. --- (1) --- (2) Consider where . gives

Page 24: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

24

Theorem. is submodular. .

Proof. Consider the equivalent model of selecting subgraph at the beginning.

Since , a non-negative linear combination of for different subgraphs of .

It’s enough to prove submodularity for , for each fixed .

Page 25: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

25

Recall that contains vertices reachable from .

We’ll show that for any , it holds that () which implies , submodularity of .

For (), any LHS that is reachable from but not from , must be reachable from . So is reachable from . And is not reachable from since .

Thus is reachable from RHS.

Page 26: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

26

Influence maximization

Our task of influence maximization: Given a social graph , a stochastic diffusion model, a budget , find a seed set with to maximize .

Namely, find an . A related problem of influence spread

computation: Given , a diffusion model, and a seed set , compute .

Page 27: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

27

Bad news: #P-hard

Theorem. Both influence maximization and influence spread computation problems are #P-hard. In both IC and LT models. #P-hard even if .

Recall NP-complete problem SAT: decide whether a given CNF formula has a satisfying assignment.

#P-complete problem #SAT: decide how many satisfying assignments does a given CNF formula have.

Clearly #P is harder than NP: If one can count the number of solutions, then it’s trivial to see whether it is 0 or not.

Page 28: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

28

Good news

The hardness comes from two sources Combinatorial nature. Influence computation.

The first can be partly overcome by a greedy approximation algorithm.

The second can be overcome by Monte Carlo simulation.

Page 29: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

29

Greedy algorithm

Input: , where , and is a monotone and submodular set function

Output: a subset

Algorithm: for to do

Take any

return

Page 30: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

30

Basically, in each round , we take an element with a largest marginal contribution to with respect to the current .

Repeat this until we select elements. Theorem. The algorithm outputs a set with

where is the optimal value .

Page 31: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

31

Proof. Suppose the algorithm selects the elements in that order,

and an optimal solution is . Let and . Since is monotone, we have

. Note that by our notation, .

Page 32: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

32

. Recall submodularity:

. Take , and , we have

Greedy algorithm selects the max marginal contribution, so

is just . Thus

.

Page 33: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

33

. Applying this argument on , we have

Repeat times, we have. Rearranging it yields

Page 34: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

34

Multiply both sides by , list the inequality for all , and sum them up. We get

,

as claimed.

Page 35: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

35

Recall that as a set function is monotone and submodular.

Apply this greedy algorithm enables us to find a seed set with , where is the optimal value .

Page 36: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

36

But there is one catch. In the algorithm we used this step:

Take any But is hard to compute! Solution: Monte Carlo simulation. For any seed set , run the diffusion process

starting from enough number of times to get a good estimate to .

Page 37: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

37

Putting everything together

With details omitted, here is the final result.

Theorem. We have an algorithm with parameters achieving -approximation ratio in time , for both IC and LT models.

Page 38: Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class

38

Summary

Two diffusion models: IC and LT.

Influence maximization and influence spread computation problems are both #P-hard. In IC and LT models.

There exist -approximation algorithms with polynomial running time. In IC and LT models.