contextual advertising by combining relevance with click feedback deepak agarwal joint work with...
TRANSCRIPT
Contextual Advertising by Combining Relevance with Click FeedbackDeepak Agarwal
Joint work withDeepayan Chakrabarti & Vanja Josifovski
Yahoo! Research
WWW’08, Beijing, China
24th April, 2008
Outline
Motivating Application, Challenges Contextual Advertising
Semantic versus Predictive models Pros, Cons
Our Approach: Blend Semantic with Predictive
Model Description Logistic Regression, Feature Selection Model structure amenable to fast scoring at run time
Experimental Results Ongoing work
Outline 1
Motivating Application, Background and Challenges
Motivating Application Problem: Match ads to queries
Sponsored Search: The query is a short piece of text input by the user User intent better expressed; less noisy
Contextual Advertising: The query is a webpage Generally long, noisy, user intent less clear Harder matching problem
Challenges
Serve ads to maximize revenue (CTR) Serve most relevant ads in a given context
User Feedback in the form of Clicks in different context
Automation must for profitability Billions of opportunities; millions of ads
High volume, low marginal cost →lucrative business Automation through Algorithms/Models
Accuracy: Massive data; scalable procedures Structure of Models: Scoring ads under strict latency
requirements (~few ms)
Classical Approach: Semantic Serve Shoe ads on Shoe pages Models: Information Retrieval
Get relevant docs (ads) for a query (webpage) Simple vector space model
q=(t1,w1;…,tn,wn); a=(a1,v1;…,am,vm)
Cos(q,a) = s ε q ∩ awsas/(|q||a|) w’s, a’s: tf-idf;
Frequency: reward in doc; penalize in corpus Higher score →More relevance
Semantic: Pros & Cons
Pros
Training: simple, scalable
Vocabulary (stop-words; stemming); Corpus
Serving with low latency evaluates millions of
candidate ads in few ms Clever algorithms (Broder
et al)
Cons Does not always capture
context
Clicks? Better?
Active user feedback Can we use it ?
Predictive Approach: Clicks New challenging research area Learn from historic clicks on ads
Indicator of overall relevance Rank ads by CTR = P(Click|Ad,context)
Estimating CTR difficult statistical problem High-dim, sparseness (too many combinations) (Page,Ad)→(Page Features, Ad Features)
Bias-Variance Tradeoff when selecting features Coarse is stable but less precise; fine has high variance
Statistical Challenges( contd) Retrospective data biased
I never showed ads with word “Rolex” on pages with word “Golf”, how will I learn this match?
What is irrelevant? Labeling negatives. I never click on ads no matter what
Good models maybe complex Scalability while training (Grid computing helps) Serving: All models are not index friendly
Quick evaluation during serve time improves system
When Semantic meets Predictive Semantic provides domain knowledge
Feature selection driven by semantic knowledge Predictive “enhances” semantic
“correction” terms to semantic to match click feedback fallback on semantic when signal weak
Model scalable (Grid computing) Fast to evaluate at run time
Faster→More candidates evaluated at serve time Accuracy versus Coverage
Outline 2
Modeling Approach
Predictive Regression model Region specific splitting for page and ad
Page “regions”: Title, headers, boldfaces, metadata, etc.
Ad “regions”: title, body, etc
Features: words, phrases, classes in different regions. Word matches in title more important that in the body
Illustration: word features; title regions Extension to multiple regions, multiple feature types routine
Experiments to appear in a future version
Logistic Regression: Word features Model clicks/non-clicks: Logistic Regression
Training & test data: events with clicks only
yij~Ber(pij)
CTR Main effect for page (overall
popularity)
Main effect for ad
(overall popularity)
Interaction effect
(words shared by page and ad)
Model parameters
Gaussian priors on model parameters: penalizes sparse features
Feature weights “correct” relevance
Mp,w = tfp,w1(w ε p) Ma,w = tfa,w1(w ε a) Ip,a,w = tfp,w * tfa,w1(w ε p) 1(w ε a)
So, IR-based term frequency measures are taken into account
How to select words? Word selection
Overall, nearly 110k words in our training data Stop word removal, stemming
Learning parameters for each word would be: Expensive, overfits
We use simple feature selection strategies Select top-k
Word Selection: data based Define an interaction measure for each word
Higher values for words which have higher-than-expected CTR when they occur on both page and ad
Remove words served or clicked few times for robustness
Word selection contd
Word selection: relevance based
Average tfidf score of each word : pages and ads Higher values imply higher relevance
Ranked by geometric mean: tfidf on page and ad
Ranked by tfidf on page and ad; take the union
Best Word Selection scheme Word selection Two methods
Data based Relevance based
We picked the top 1000 words by each measure
Data-based methods give better results
Recall
Pre
cisi
on
Semantic similarity score Word features have low coverage; fallback
mechanism to semantic similarity Map cosine on logit scale? Create score bins
100 points per bin Mean score vs logit(CTR) Quadratic relationship
Cosine scorelo
git(
p ij)
Incorporating similarity Quadratic relationship used in two ways
Put in cosine and cosine2 as features
Add as offset: Prior log-odds
Similar Results
Scalable Training
Fast Implementation Training: Hadoop implementation of Logistic
Regression
Data
Iterative Newton-Raphson
Random data splits Mean and
Variance estimates
Combine estimates Learned
model params
Outline 3
Fast Evaluation at Serve Time
Efficient Score Evaluation
Problem: For a page visit; select top-n ads using scoring formula Why hard: Only a few ms; too many ads to evaluate Rich literature in IR to solve this problem
Efficient solutions for vector space models through “posting lists” <term, sorted list of doc IDs containing the term>
Interaction terms in regression model motivated by this
Document at a time (DAAT) strategy Posting lists: sorted doc IDs for each query term Evaluates each doc containing at least one query term one at a time
stop prematurely if clear doc can’t make it to top n System sparse, few correlations; efficiency through approximations
Efficient evaluation through two-stage procedure
(Broder et al.)HEAPTop-n
θ=min-score
x1
x2
x3
x4
1
53
3
7
32
7
U1
U2
U3
U4
Approximate: x1*U1+x2*U2+x3*U3+x4*U4 > θ
WAND Iterator traverses posting list very efficiently by skipping unnecessary docsEfficiency depends on Upper bounds for terms
Doc Ids
CurrDoc=1
U1+U2+U3 > θU1 + U2 <= θ
Efficiency of procedure Efficiency through document skipping Must be able to compute upper bounds quickly
Match scoring formula should not use arbitrary features (“word X in query AND word Y in ad”)
Such pairwise (“cross-product”) checks may get costly Large posting lists; too many evaluations
Upper bounds crucial to performance Large→False +ve’s; Small→False –ve’s We are using upper bounds recommended in literature
More efficient implementation subject of future research
System Architecture: scoring at serve time
Fast Implementation Testing
Main effect for ads is used in ordering of ads in postings list (static)
Interaction effect is used to modify the idf-table of words (static)
Main effect for pages does not play a role in ad serving (page is given)
Building postings lists
Outline 4
Experiments and Results, Summary and Ongoing Work
Experiments
Recall
Pre
cisi
on
25% lift in precision at 10% recall
Experiments
Recall
Pre
cisi
on
25% lift in precision at 10% recall
Low recall region
Computed precision-recall for several splitsResults statistically significant
Experiments
Increasing the number of words from 1000 to 3400 led to only marginal improvement Diminishing returns System already performs close to its limit, without
needing more training Changing the training time period changes the
word list; we update our posting lists periodically
Summary
Matching ads to pages challenging problem We provide an approach that blends
semantic similarity and predictive models in a scalable fashion
Our approach index friendly Experimental results on large scale system
shows significant improvement We can only improve the relevance models
Ongoing Work
Change in training data changes word set Working on more robust word feature selection
Clustering words
Efficient indexing strategies through better upper bound estimates for WAND
Expanding feature sets to include neighborhoods of words in posting lists Balance between accuracy and WAND efficiency
Isotonic regression on cosine similarity