recommendation in advertising and social networks

1

Recommendation in Advertising and Social Networks

Deepayan Chakrabarti ([email protected])

This presentation

1) Content Match [KDD 2007]: How can we estimate the click-through rate

(CTR) of an ad on a page?

~106 ads

~109 p

ages CTR for ad j

on page i

2

This presentation

1) Estimating CTR for Content Match [KDD ‘07]

2) Theoretical underpinnings[COLT ‘10 best student paper]

Represent relationships as a graph Recommendation = Link Prediction Many useful heuristics exist Why do these heuristics work?

Goal: Suggest friends

3

4

Estimating CTR for Content Match Contextual Advertising

Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of

an ad on a page

~106 ads

~109 p

ages

CTR for ad j on page i

Estimating CTR for Content Match Why not use the MLE?

1. Few (page, ad) pairs have N>02. Very few have c>0 as well3. MLE does not differentiate between 0/10 and

0/100 We have additional information: hierarchies

5

6

Estimating CTR for Content Match Use an existing, well-understood hierarchy

Categorize ads and webpages to leaves of the hierarchy

CTR estimates of siblings are correlatedThe hierarchy allows us to aggregate data

Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer

resolutions

7

Estimating CTR for Content Match

Level i

Page hierarchy Ad hierarchy

Region= (page node, ad node)

Region Hierarchy A cross-product of the page

hierarchy and the ad hierarchy

Region

8

Estimating CTR for Content Match Level 0

Level i





Estimating CTR for Content Match Our Approach

Data Transformation Model Model Fitting

9

Data Transformation

Problem:

Solution: Freeman-Tukey transform

Differentiates regions with 0 clicks Variance stabilization:

10

Model

Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000]

1111

Level i

Level i+1

S1S2

S3 S4

Sparent1. Each region has a latent state Sr

2. yr is independent of the hierarchy given Sr

3. Sr is drawn from its parent Spa(r)

y1 y2 y4

observable

late

nt

Model

12

Sr

Spa(r)

yr

ypa(r)

variance Vr coeff. βr

variance wr Vpa(r)

wpa(r)

ur

βpa(r)

upa(r)

However, learning Wr , Vr and βr for each region is clearly infeasible

Assumptions: All regions at the same level ℓ share

the same W(ℓ) and β(ℓ)

Vr = V/Nr for some constant V, since

Model

13

Sr

yr

Vr βr

wr

ur

Spa(r)

Model

Implications: determines degree of smoothing :

Sr varies greatly from Spa(r) Each region learns its own Sr

No smoothing :

All Sr are identical A regression model on features ur is learnt

Maximum Smoothing

14

Sr

yr

Vr βr

wr

ur

Spa(r)

Implications: determines degree of smoothing Var(Sr) increases from root to leaf

Better estimates at coarser resolutions

Model

15

Sr

yr

Vr βr

wr

ur

Spa(r)

Implications: determines degree of smoothing Var(Sr) increases from root to leaf Correlations among siblings at

level ℓ: Depends only on level of least common

ancestor

Model

16

Sr

yr

Vr βr

wr

ur

Spa(r)

Corr( , ) > Corr( , )

Estimating CTR for Content Match Our Approach

Data Transformation (Freeman-Tukey) Model (Tree-structured Markov Chain) Model Fitting

17

18

Model Fitting

Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate

data from leaves to root Smoothing: Propagate

information from root to leaves

Complexity: linear in the number of regions, for both time and space

filtering

smoo

thing

19

Model Fitting

Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate

data from leaves to root Smoothing: Propagates

information from root to leaves

Kalman filter requires knowledge of β, V, and W EM wrapped around the

Kalman filter

filtering

smoo

thing

20

Experiments

503M impressions 7-level hierarchy of which the top 3 levels

were used Zero clicks in

76% regions in level 2 95% regions in level 3

Full dataset DFULL, and a 2/3 sample DSAMPLE

21

Experiments

Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE

Some of these regions R>0 get clicks in DFULL

A good model should predict higher CTRs for R>0 as against the other regions in R

22

Experiments

We compared 4 models TS: our tree-structured model LM (level-mean): each level smoothed

independently NS (no smoothing): CTR proportional to 1/Nr

Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

23

Experiments

TS

Rando

m

LM, N

S

Experiments MLE=0 everywhere, since 0 clicks were observed What about estimated CTR?

24

Impressions

Est

imat

ed C

TR

ImpressionsE

stim

ated

CTR

No Smoothing (NS) Our Model (TS)

Variability from coarser resolutions

Close to MLE for large N

25

Estimating CTR for Content Match We presented a method to estimate

rates of extremely rare events at multiple resolutions under severe sparsity constraints

Key points: Tree-structured generative model Extremely fast parameter fitting

Theoretical underpinnings

1) Estimating CTR for Content Match [KDD ‘07]

2) Theoretical underpinnings of link prediction [COLT ‘10 best student paper]

26

Link Prediction Which pair of nodes {i,j} should be connected?

Alice

Bob

Charlie

Goal: Recommend a movie

27

Link Prediction Which pair of nodes {i,j} should be connected?

Goal: Suggest friends

28

Link Prediction Heuristics Predict link between nodes

Connected by the shortest path With the most common neighbors (length 2 paths) More weight to low-degree common nbrs

(Adamic/Adar)

3 followers

1000

followers

Prolific common friends

Less evidence

Less prolific

Much more evidence

Alice

Bob

Charlie

Link Prediction Heuristics Predict link between nodes

Connected by the shortest path With the most common neighbors (length 2 paths) More weight to low-degree common nbrs (Adamic/Adar) With more short paths (e.g. length 3 paths )

exponentially decaying weights to longer paths (Katz measure)

…

Previous Empirical Studies*

Random Shortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Link

pre

dict

ion

accu

racy

*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

How do we justify these observations?

Especially if the graph is sparse

31

Link Prediction – Generative Model

Unit volume universe

Model:1. Nodes are uniformly distributed points in a latent space

2. This space has a distance metric

3. Points close to each other are likely to be connected in the graph Logistic distance function (Raftery+/2002)

32

33

1

½

Higher probability of linking

radius r

α determines the steepness

Link prediction ≈ find nearest neighbor who is not currently linked to the node.

Equivalent to inferring distances in the latent space

Link Prediction – Generative Model

Model:1. Nodes are uniformly distributed points in a latent space

2. This space has a distance metric

3. Points close to each other are likely to be connected in the graph



Common Neighbors


Link

pre

dict

ion

accu

racy

*



34

Common Neighbors

Pr2(i,j) = Pr(common neighbor|dij)

jkikijjkikjkik2 dd)d|d,d()d|~Pr()d|~Pr(j)(i,Pr Pkjki

Product of two logistic probabilities, integrated over a volume determined by dij

i j

35

As α∞ Logistic Step function

Much easier to analyze!

Common Neighbors

36

Everyone has same radius r

i j

)dr,A(r,j)(i,Pr ij2

# common nbrs gives a bound

on distance

DD

rr/2

ij

/1

ij

V(r)εη/N12d

V(r)εη/N12

21εNη)dr,A(r,ε

NηP

η=Number of common

neighbors

V(r)=volume of radius r in

D dims

Unit volume universe

Common Neighbors

OPT = node closest to i MAX = node with max common neighbors with i

Theorem:

w.h.p

Link prediction by common neighbors is asymptotically optimal

dOPT ≤ dMAX ≤ dOPT + 2[ε/V(1)]1/D

37

Common Neighbors: Distinct Radii Node k has radius rk .

ik if dik ≤ rk (Directed graph) rk captures popularity of node k

38

Type 2: i k j

rk rk

A(rk , rk ,dij)

i jk

Type 1: i k j

rirj

A(ri , rj ,dij)

i jk

i

rkk

j

m

Type 2 common neighbors

i j

kη1 ~ Bin[N1 , A(r1, r1,

dij)]η2 ~ Bin[N2 , A(r2, r2,

dij)]

Example graph:

N1 nodes of radius r1 and N2 nodes of radius r2

r1 << r2

Pick d* to maximize Pr[η1 , η2 | dij]

w(r1) E[η1|d*] + w(r2) E[η2|d*] = w(r1)η1 + w(r2) η2

Weighted common neighbors

Inversely related to d*

Type 2 common neighbors

r is close to max radius

D1

deg

constr

constw(r)

Real world graphs generally fall in this range

i

rkk

j

Presence of common neighbor is very informative

Absence is very informative

Adamic/Adar

1/r

41



Common Neighbors


Link

pre

dict

ion

accu

racy

*



42

ℓ-hop Paths Common neighbors = 2 hop paths

For longer paths:

Bounds are weaker For ℓ’ ≥ ℓ we need ηℓ’ >> ηℓ to obtain similar bounds

justifies the exponentially decaying weight given to longer paths by the Katz measure

δN,,ηg-11)r(rdij

43

Summary Three key ingredients

1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001

2. Triangle inequality holds necessary to extend to ℓ-hop paths

3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance

44

Summary


Common Neighbors


Link

pre

dict

ion

accu

racy

*


The number of paths matters, not the

length

For large dense graphs, common neighbors are

enough

Differentiating between different degrees is

important

In sparse graphs, length 3 or more

paths help in prediction.

45

Conclusions

Discussed two problems1. Estimating CTR for Content Match

Combat sparsity by hierarchical smoothing

2. Theoretical underpinnings Latent space model Link prediction ≈ finding nearest neighbors in this

space

46

Other Work

47

Web Search Finding Quicklinks [WWW ‘09]

Titles for Quicklinks [KDD ‘08]

Incorporating tweets into search results [ICWSM ‘11]

Website clustering [WWW ‘10]

Webpage segmentation [WWW ‘08]

Template detection [WWW ‘07]

Finding hidden query aspects [KDD ’09]

Computational Advertising Combining IR with click feedback [WWW

‘08]

Multi-armed bandits using hierarchies [SDM ‘07, ICML ‘07]

“Mortal” multi-armed bandits [NIPS ‘08]

Traffic Shaping [EC ‘12]

Graph Mining Epidemic thresholds [SRDS ‘03, Infocom

‘07]

Non-parametric prediction in dynamic graphs

Graph sampling [ICML ‘11]

Graph generation models [SDM ‘04, PKDD ‘05, JMLR ‘10]

Community detection [KDD ‘04, PKDD ‘04]

Advertising Setting

Content

match ad

Display Content Match

Sponsored Search

48

Advertising Setting

Pick ads

Text ads

Match ads to the content

Display Content Match

Sponsored Search

49



“Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r)Weight for nodes

of radius r

# common neighbors of radius r

i

rkk

j

m

50



“Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r)

i

rk

Weight for nodes of radius r

# common neighbors of radius r

k

j

m

51

52

Estimating CTR for Content Match

Level i





Region

53

Estimating CTR for Content Match Level 0

Level i





Page classes Ad classes

Region

recommendation in advertising and social networks

Documents

ad hierarchyestimating

page iestimating ctr

problemsestimating ctr

friends34estimating

rate ctr

content matchuse

content matchwhy

page node