contextual advertising by combining relevance with click feedback deepak agarwal joint work with...

Contextual Advertising by Combining Relevance with Click FeedbackDeepak Agarwal

Joint work withDeepayan Chakrabarti & Vanja Josifovski

Yahoo! Research

WWW’08, Beijing, China

24th April, 2008

Outline

Motivating Application, Challenges Contextual Advertising

Semantic versus Predictive models Pros, Cons

Our Approach: Blend Semantic with Predictive

Model Description Logistic Regression, Feature Selection Model structure amenable to fast scoring at run time

Experimental Results Ongoing work

Outline 1

Motivating Application, Background and Challenges

Motivating Application Problem: Match ads to queries

Sponsored Search: The query is a short piece of text input by the user User intent better expressed; less noisy

Contextual Advertising: The query is a webpage Generally long, noisy, user intent less clear Harder matching problem

Challenges

Serve ads to maximize revenue (CTR) Serve most relevant ads in a given context

User Feedback in the form of Clicks in different context

Automation must for profitability Billions of opportunities; millions of ads

High volume, low marginal cost →lucrative business Automation through Algorithms/Models

Accuracy: Massive data; scalable procedures Structure of Models: Scoring ads under strict latency

requirements (~few ms)

Classical Approach: Semantic Serve Shoe ads on Shoe pages Models: Information Retrieval

Get relevant docs (ads) for a query (webpage) Simple vector space model

q=(t1,w1;…,tn,wn); a=(a1,v1;…,am,vm)

Cos(q,a) = s ε q ∩ awsas/(|q||a|) w’s, a’s: tf-idf;

Frequency: reward in doc; penalize in corpus Higher score →More relevance

Semantic: Pros & Cons

Pros

Training: simple, scalable

Vocabulary (stop-words; stemming); Corpus

Serving with low latency evaluates millions of

candidate ads in few ms Clever algorithms (Broder

et al)

Cons Does not always capture

context

Clicks? Better?

Active user feedback Can we use it ?

Predictive Approach: Clicks New challenging research area Learn from historic clicks on ads

Indicator of overall relevance Rank ads by CTR = P(Click|Ad,context)

Estimating CTR difficult statistical problem High-dim, sparseness (too many combinations) (Page,Ad)→(Page Features, Ad Features)

Bias-Variance Tradeoff when selecting features Coarse is stable but less precise; fine has high variance

Statistical Challenges( contd) Retrospective data biased

I never showed ads with word “Rolex” on pages with word “Golf”, how will I learn this match?

What is irrelevant? Labeling negatives. I never click on ads no matter what

Good models maybe complex Scalability while training (Grid computing helps) Serving: All models are not index friendly

Quick evaluation during serve time improves system

When Semantic meets Predictive Semantic provides domain knowledge

Feature selection driven by semantic knowledge Predictive “enhances” semantic

“correction” terms to semantic to match click feedback fallback on semantic when signal weak

Model scalable (Grid computing) Fast to evaluate at run time

Faster→More candidates evaluated at serve time Accuracy versus Coverage

Outline 2

Modeling Approach

Predictive Regression model Region specific splitting for page and ad

Page “regions”: Title, headers, boldfaces, metadata, etc.

Ad “regions”: title, body, etc

Features: words, phrases, classes in different regions. Word matches in title more important that in the body

Illustration: word features; title regions Extension to multiple regions, multiple feature types routine

Experiments to appear in a future version

Logistic Regression: Word features Model clicks/non-clicks: Logistic Regression

Training & test data: events with clicks only

yij~Ber(pij)

CTR Main effect for page (overall

popularity)

Main effect for ad

(overall popularity)

Interaction effect

(words shared by page and ad)

Model parameters

Gaussian priors on model parameters: penalizes sparse features

Feature weights “correct” relevance

Mp,w = tfp,w1(w ε p) Ma,w = tfa,w1(w ε a) Ip,a,w = tfp,w * tfa,w1(w ε p) 1(w ε a)

So, IR-based term frequency measures are taken into account

How to select words? Word selection

Overall, nearly 110k words in our training data Stop word removal, stemming

Learning parameters for each word would be: Expensive, overfits

We use simple feature selection strategies Select top-k

Word Selection: data based Define an interaction measure for each word

Higher values for words which have higher-than-expected CTR when they occur on both page and ad

Remove words served or clicked few times for robustness

Word selection contd

Word selection: relevance based

Average tfidf score of each word : pages and ads Higher values imply higher relevance

Ranked by geometric mean: tfidf on page and ad

Ranked by tfidf on page and ad; take the union

Best Word Selection scheme Word selection Two methods

Data based Relevance based

We picked the top 1000 words by each measure

Data-based methods give better results

Recall

Pre

cisi

on

Semantic similarity score Word features have low coverage; fallback

mechanism to semantic similarity Map cosine on logit scale? Create score bins

100 points per bin Mean score vs logit(CTR) Quadratic relationship

Cosine scorelo

git(

p ij)

Incorporating similarity Quadratic relationship used in two ways

Put in cosine and cosine2 as features

Add as offset: Prior log-odds

Similar Results

Scalable Training

Fast Implementation Training: Hadoop implementation of Logistic

Regression

Data

Iterative Newton-Raphson

Random data splits Mean and

Variance estimates

Combine estimates Learned

model params

Outline 3

Fast Evaluation at Serve Time

Efficient Score Evaluation

Problem: For a page visit; select top-n ads using scoring formula Why hard: Only a few ms; too many ads to evaluate Rich literature in IR to solve this problem

Efficient solutions for vector space models through “posting lists” <term, sorted list of doc IDs containing the term>

Interaction terms in regression model motivated by this

Document at a time (DAAT) strategy Posting lists: sorted doc IDs for each query term Evaluates each doc containing at least one query term one at a time

stop prematurely if clear doc can’t make it to top n System sparse, few correlations; efficiency through approximations

Efficient evaluation through two-stage procedure

(Broder et al.)HEAPTop-n

θ=min-score

x1

x2

x3

x4

1

53

3

7

32

7

U1

U2

U3

U4

Approximate: x1*U1+x2*U2+x3*U3+x4*U4 > θ

WAND Iterator traverses posting list very efficiently by skipping unnecessary docsEfficiency depends on Upper bounds for terms

Doc Ids

CurrDoc=1

U1+U2+U3 > θU1 + U2 <= θ

Efficiency of procedure Efficiency through document skipping Must be able to compute upper bounds quickly

Match scoring formula should not use arbitrary features (“word X in query AND word Y in ad”)

Such pairwise (“cross-product”) checks may get costly Large posting lists; too many evaluations

Upper bounds crucial to performance Large→False +ve’s; Small→False –ve’s We are using upper bounds recommended in literature

More efficient implementation subject of future research

System Architecture: scoring at serve time

Fast Implementation Testing

Main effect for ads is used in ordering of ads in postings list (static)

Interaction effect is used to modify the idf-table of words (static)

Main effect for pages does not play a role in ad serving (page is given)

Building postings lists

Outline 4

Experiments and Results, Summary and Ongoing Work

Experiments

Recall

Pre

cisi

on

25% lift in precision at 10% recall

Experiments

Recall

Pre

cisi

on

25% lift in precision at 10% recall

Low recall region

Computed precision-recall for several splitsResults statistically significant

Experiments

Increasing the number of words from 1000 to 3400 led to only marginal improvement Diminishing returns System already performs close to its limit, without

needing more training Changing the training time period changes the

word list; we update our posting lists periodically

Summary

Matching ads to pages challenging problem We provide an approach that blends

semantic similarity and predictive models in a scalable fashion

Our approach index friendly Experimental results on large scale system

shows significant improvement We can only improve the relevance models

Ongoing Work

Change in training data changes word set Working on more robust word feature selection

Clustering words

Efficient indexing strategies through better upper bound estimates for WAND

Expanding feature sets to include neighborhoods of words in posting lists Balance between accuracy and WAND efficiency

Isotonic regression on cosine similarity

contextual advertising by combining relevance with click feedback deepak agarwal joint work with...

Documents

match ads

relevant ads

relevance semantic

semanticserve shoe ads

semantic knowledgepredictive

predictive approach

semantic correction

consour approach