recommendation in advertising and social networks
DESCRIPTION
Recommendation in Advertising and Social Networks. Deepayan Chakrabarti ([email protected]). This presentation. Content Match [KDD 2007] : How can we estimate the click-through rate (CTR) of an ad on a page?. CTR for ad j on page i. ~10 9 pages. ~10 6 ads. This presentation. - PowerPoint PPT PresentationTRANSCRIPT
This presentation
1) Content Match [KDD 2007]: How can we estimate the click-through rate
(CTR) of an ad on a page?
~106 ads
~109 p
ages CTR for ad j
on page i
2
This presentation
1) Estimating CTR for Content Match [KDD ‘07]
2) Theoretical underpinnings[COLT ‘10 best student paper]
Represent relationships as a graph Recommendation = Link Prediction Many useful heuristics exist Why do these heuristics work?
Goal: Suggest friends
3
4
Estimating CTR for Content Match Contextual Advertising
Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of
an ad on a page
~106 ads
~109 p
ages
CTR for ad j on page i
Estimating CTR for Content Match Why not use the MLE?
1. Few (page, ad) pairs have N>02. Very few have c>0 as well3. MLE does not differentiate between 0/10 and
0/100 We have additional information: hierarchies
5
6
Estimating CTR for Content Match Use an existing, well-understood hierarchy
Categorize ads and webpages to leaves of the hierarchy
CTR estimates of siblings are correlatedThe hierarchy allows us to aggregate data
Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer
resolutions
7
Estimating CTR for Content Match
Level i
Page hierarchy Ad hierarchy
Region= (page node, ad node)
Region Hierarchy A cross-product of the page
hierarchy and the ad hierarchy
Region
8
Estimating CTR for Content Match Level 0
Level i
Page hierarchy Ad hierarchy
Region= (page node, ad node)
Region Hierarchy A cross-product of the page
hierarchy and the ad hierarchy
Estimating CTR for Content Match Our Approach
Data Transformation Model Model Fitting
9
Data Transformation
Problem:
Solution: Freeman-Tukey transform
Differentiates regions with 0 clicks Variance stabilization:
10
Model
Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000]
1111
Level i
Level i+1
S1S2
S3 S4
Sparent1. Each region has a latent state Sr
2. yr is independent of the hierarchy given Sr
3. Sr is drawn from its parent Spa(r)
y1 y2 y4
observable
late
nt
Model
12
Sr
Spa(r)
yr
ypa(r)
variance Vr coeff. βr
variance wr Vpa(r)
wpa(r)
ur
βpa(r)
upa(r)
However, learning Wr , Vr and βr for each region is clearly infeasible
Assumptions: All regions at the same level ℓ share
the same W(ℓ) and β(ℓ)
Vr = V/Nr for some constant V, since
Model
13
Sr
yr
Vr βr
wr
ur
Spa(r)
Model
Implications: determines degree of smoothing :
Sr varies greatly from Spa(r) Each region learns its own Sr
No smoothing :
All Sr are identical A regression model on features ur is learnt
Maximum Smoothing
14
Sr
yr
Vr βr
wr
ur
Spa(r)
Implications: determines degree of smoothing Var(Sr) increases from root to leaf
Better estimates at coarser resolutions
Model
15
Sr
yr
Vr βr
wr
ur
Spa(r)
Implications: determines degree of smoothing Var(Sr) increases from root to leaf Correlations among siblings at
level ℓ: Depends only on level of least common
ancestor
Model
16
Sr
yr
Vr βr
wr
ur
Spa(r)
Corr( , ) > Corr( , )
Estimating CTR for Content Match Our Approach
Data Transformation (Freeman-Tukey) Model (Tree-structured Markov Chain) Model Fitting
17
18
Model Fitting
Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate
data from leaves to root Smoothing: Propagate
information from root to leaves
Complexity: linear in the number of regions, for both time and space
filtering
smoo
thing
19
Model Fitting
Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate
data from leaves to root Smoothing: Propagates
information from root to leaves
Kalman filter requires knowledge of β, V, and W EM wrapped around the
Kalman filter
filtering
smoo
thing
20
Experiments
503M impressions 7-level hierarchy of which the top 3 levels
were used Zero clicks in
76% regions in level 2 95% regions in level 3
Full dataset DFULL, and a 2/3 sample DSAMPLE
21
Experiments
Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE
Some of these regions R>0 get clicks in DFULL
A good model should predict higher CTRs for R>0 as against the other regions in R
22
Experiments
We compared 4 models TS: our tree-structured model LM (level-mean): each level smoothed
independently NS (no smoothing): CTR proportional to 1/Nr
Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R
23
Experiments
TS
Rando
m
LM, N
S
Experiments MLE=0 everywhere, since 0 clicks were observed What about estimated CTR?
24
Impressions
Est
imat
ed C
TR
ImpressionsE
stim
ated
CTR
No Smoothing (NS) Our Model (TS)
Variability from coarser resolutions
Close to MLE for large N
25
Estimating CTR for Content Match We presented a method to estimate
rates of extremely rare events at multiple resolutions under severe sparsity constraints
Key points: Tree-structured generative model Extremely fast parameter fitting
Theoretical underpinnings
1) Estimating CTR for Content Match [KDD ‘07]
2) Theoretical underpinnings of link prediction [COLT ‘10 best student paper]
26
Link Prediction Which pair of nodes {i,j} should be connected?
Alice
Bob
Charlie
Goal: Recommend a movie
27
Link Prediction Which pair of nodes {i,j} should be connected?
Goal: Suggest friends
28
Link Prediction Heuristics Predict link between nodes
Connected by the shortest path With the most common neighbors (length 2 paths) More weight to low-degree common nbrs
(Adamic/Adar)
3 followers
1000
followers
Prolific common friends
Less evidence
Less prolific
Much more evidence
Alice
Bob
Charlie
Link Prediction Heuristics Predict link between nodes
Connected by the shortest path With the most common neighbors (length 2 paths) More weight to low-degree common nbrs (Adamic/Adar) With more short paths (e.g. length 3 paths )
exponentially decaying weights to longer paths (Katz measure)
…
Previous Empirical Studies*
Random Shortest Path
Common Neighbors
Adamic/Adar Ensemble of short paths
Link
pre
dict
ion
accu
racy
*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
How do we justify these observations?
Especially if the graph is sparse
31
Link Prediction – Generative Model
Unit volume universe
Model:1. Nodes are uniformly distributed points in a latent space
2. This space has a distance metric
3. Points close to each other are likely to be connected in the graph Logistic distance function (Raftery+/2002)
32
33
1
½
Higher probability of linking
radius r
α determines the steepness
Link prediction ≈ find nearest neighbor who is not currently linked to the node.
Equivalent to inferring distances in the latent space
Link Prediction – Generative Model
Model:1. Nodes are uniformly distributed points in a latent space
2. This space has a distance metric
3. Points close to each other are likely to be connected in the graph
Previous Empirical Studies*
Random Shortest Path
Common Neighbors
Adamic/Adar Ensemble of short paths
Link
pre
dict
ion
accu
racy
*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
Especially if the graph is sparse
34
Common Neighbors
Pr2(i,j) = Pr(common neighbor|dij)
jkikijjkikjkik2 dd)d|d,d()d|~Pr()d|~Pr(j)(i,Pr Pkjki
Product of two logistic probabilities, integrated over a volume determined by dij
i j
35
As α∞ Logistic Step function
Much easier to analyze!
Common Neighbors
36
Everyone has same radius r
i j
)dr,A(r,j)(i,Pr ij2
# common nbrs gives a bound
on distance
DD
rr/2
ij
/1
ij
V(r)εη/N12d
V(r)εη/N12
21εNη)dr,A(r,ε
NηP
η=Number of common
neighbors
V(r)=volume of radius r in
D dims
Unit volume universe
Common Neighbors
OPT = node closest to i MAX = node with max common neighbors with i
Theorem:
w.h.p
Link prediction by common neighbors is asymptotically optimal
dOPT ≤ dMAX ≤ dOPT + 2[ε/V(1)]1/D
37
Common Neighbors: Distinct Radii Node k has radius rk .
ik if dik ≤ rk (Directed graph) rk captures popularity of node k
38
Type 2: i k j
rk rk
A(rk , rk ,dij)
i jk
Type 1: i k j
rirj
A(ri , rj ,dij)
i jk
i
rkk
j
m
Type 2 common neighbors
i j
kη1 ~ Bin[N1 , A(r1, r1,
dij)]η2 ~ Bin[N2 , A(r2, r2,
dij)]
Example graph:
N1 nodes of radius r1 and N2 nodes of radius r2
r1 << r2
Pick d* to maximize Pr[η1 , η2 | dij]
w(r1) E[η1|d*] + w(r2) E[η2|d*] = w(r1)η1 + w(r2) η2
Weighted common neighbors
Inversely related to d*
Type 2 common neighbors
r is close to max radius
D1
deg
constr
constw(r)
Real world graphs generally fall in this range
i
rkk
j
Presence of common neighbor is very informative
Absence is very informative
Adamic/Adar
1/r
41
Previous Empirical Studies*
Random Shortest Path
Common Neighbors
Adamic/Adar Ensemble of short paths
Link
pre
dict
ion
accu
racy
*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
Especially if the graph is sparse
42
ℓ-hop Paths Common neighbors = 2 hop paths
For longer paths:
Bounds are weaker For ℓ’ ≥ ℓ we need ηℓ’ >> ηℓ to obtain similar bounds
justifies the exponentially decaying weight given to longer paths by the Katz measure
δN,,ηg-11)r(rdij
43
Summary Three key ingredients
1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001
2. Triangle inequality holds necessary to extend to ℓ-hop paths
3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance
44
Summary
Random Shortest Path
Common Neighbors
Adamic/Adar Ensemble of short paths
Link
pre
dict
ion
accu
racy
*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
The number of paths matters, not the
length
For large dense graphs, common neighbors are
enough
Differentiating between different degrees is
important
In sparse graphs, length 3 or more
paths help in prediction.
45
Conclusions
Discussed two problems1. Estimating CTR for Content Match
Combat sparsity by hierarchical smoothing
2. Theoretical underpinnings Latent space model Link prediction ≈ finding nearest neighbors in this
space
46
Other Work
47
Web Search Finding Quicklinks [WWW ‘09]
Titles for Quicklinks [KDD ‘08]
Incorporating tweets into search results [ICWSM ‘11]
Website clustering [WWW ‘10]
Webpage segmentation [WWW ‘08]
Template detection [WWW ‘07]
Finding hidden query aspects [KDD ’09]
Computational Advertising Combining IR with click feedback [WWW
‘08]
Multi-armed bandits using hierarchies [SDM ‘07, ICML ‘07]
“Mortal” multi-armed bandits [NIPS ‘08]
Traffic Shaping [EC ‘12]
Graph Mining Epidemic thresholds [SRDS ‘03, Infocom
‘07]
Non-parametric prediction in dynamic graphs
Graph sampling [ICML ‘11]
Graph generation models [SDM ‘04, PKDD ‘05, JMLR ‘10]
Community detection [KDD ‘04, PKDD ‘04]
Advertising Setting
Content
match ad
Display Content Match
Sponsored Search
48
Advertising Setting
Pick ads
Text ads
Match ads to the content
Display Content Match
Sponsored Search
49
Common Neighbors: Distinct Radii Node k has radius rk .
ik if dik ≤ rk (Directed graph) rk captures popularity of node k
“Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r)Weight for nodes
of radius r
# common neighbors of radius r
i
rkk
j
m
50
Common Neighbors: Distinct Radii Node k has radius rk .
ik if dik ≤ rk (Directed graph) rk captures popularity of node k
“Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r)
i
rk
Weight for nodes of radius r
# common neighbors of radius r
k
j
m
51
52
Estimating CTR for Content Match
Level i
Page hierarchy Ad hierarchy
Region= (page node, ad node)
Region Hierarchy A cross-product of the page
hierarchy and the ad hierarchy
Region
53
Estimating CTR for Content Match Level 0
Level i
Page hierarchy Ad hierarchy
Region= (page node, ad node)
Region Hierarchy A cross-product of the page
hierarchy and the ad hierarchy
Page classes Ad classes
Region