tutorial on learning to rank - university of texas at austin€¦ · online learning to rank beyond...

94
Learning to rank as supervised ML A brief survey of ranking methods Theory for learning to rank Pointers to advanced topics Summary Tutorial on Learning to Rank Ambuj Tewari Department of Statistics Department of EECS University of Michigan January 13, 2015 / MLSS Austin Ambuj Tewari Learning to Rank Tutorial

Upload: others

Post on 13-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Tutorial on Learning to Rank

Ambuj Tewari

Department of StatisticsDepartment of EECSUniversity of Michigan

January 13, 2015 / MLSS Austin

Ambuj Tewari Learning to Rank Tutorial

Page 2: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Outline1 Learning to rank as supervised ML

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach

3 Theory for learning to rankGeneralization error boundsConsistency

4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Ambuj Tewari Learning to Rank Tutorial

Page 3: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Outline1 Learning to rank as supervised ML

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach

3 Theory for learning to rankGeneralization error boundsConsistency

4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Ambuj Tewari Learning to Rank Tutorial

Page 4: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Rankings are everywhere

Web search: PageRank and HITSSports: FIFA rankings (int’l soccer), ICC Rankings (int’lcricket), FIDE rankings (int’l chess), BCS ratings (collegefootball)Academia: US News & World Report, Times HigherEducationDemocracy and development: Voting systems, UN HumanDevelopment IndexRecommender systems: movies (Netflix, IMDB), music(Pandora), consumer goods (Amazon)

Ambuj Tewari Learning to Rank Tutorial

Page 5: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Learning to rank objects

Creating a ranking requires a lot of effortCan we have computers learn to rank objects?Will still need some human supervisionPose learning to rank as supervised ML problem

Ambuj Tewari Learning to Rank Tutorial

Page 6: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking as a canonical learning problem

Ambuj Tewari Learning to Rank Tutorial

Page 7: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking as a canonical learning problem

NIPS Workshops: 2002, 2005, 2009, 2014SIGIR Workshops: 2007, 2008, 2009Journal of Information Retrieval Special Issue in 2010Yahoo! Learning to Rank Challenge: 2010http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

LEarning TO Rank (LETOR) and Microsoft Learning toRank (MSLR) datasets: 2007-2010http://research.microsoft.com/en-us/um/beijing/projects/letor/

Ambuj Tewari Learning to Rank Tutorial

Page 8: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Learning to rank as supervised ML

Supervised Machine Learning: construct a mapping

f : I → O

based on input, output pairs (training examples)

(X1,Y1), . . . , (XN ,YN)︸ ︷︷ ︸training set

where Xn ∈ I, Yn ∈ OWant f (X ) to be a good predictor of Y for unseen XIn ranking problems, O is all possible rankings of a givenset of objects (webpages, movies, etc)

Ambuj Tewari Learning to Rank Tutorial

Page 9: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

A learning-to-rank training example

Query: “MLSS Austin”Candidate webpages and relevance scores:

https://www.cs.utexas.edu/mlss: 2 or “good”http://mlss.cc/: 2 or “good”https://mailman.cc.gatech.edu/.../000796.html: 1 or “fair”www.councilofmls.com: 0 or “bad”

Think: example X =( query , (URL1, URL2, . . ., URL4) ),and label R = (2,2,1,0)

Ambuj Tewari Learning to Rank Tutorial

Page 10: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Typical ML cycle

1 Feature construction: Find a way to map (query ,webpage)into Rp

Each example (query, m webpages) gets mapped toXn ∈ I = Rm×p

2 Obtain labels: Get Rn ∈ {0,1,2, . . . ,Rmax}m from humanjudgements

3 Train ranker: Get f : I → O = Sm where Sm is the set ofm-permutations

Training usually done by solving some optimizationproblems on the training set

4 Evaluate performance: Using some performance measure,evaluate the ranker on a “test set”

Ambuj Tewari Learning to Rank Tutorial

Page 11: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Joint query, document features

Number of query terms that occur in document

Document length

Sum/min/max/mean/variance of term frequencies

Sum/min/max/mean/variance of tf-idf

Cosine similarity between query and document

Any model of P(R = 1|Q,D) gives a feature (e.g., BM25)

No. of slashes in/length of URL, Pagerank, site level Pagerank

Query-URL click count, user-URL click count

Ambuj Tewari Learning to Rank Tutorial

Page 12: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

A slightly more general supervised ML setting

For a future query and webpage-list, want to rank thewebpagesDon’t necessarily want to predict the relevance scoresLet f (ranking function) take values in the space of rankingsSo now, training examples (X ,Y ) ∈ I × L where L = labelspaceFunction f to be learned maps I to O (note L 6= O)E.g. (1,2,2,0,0) ∈ L and(d1 → 3,d2 → 1,d3 → 2,d4 → 4,d5 → 5) ∈ O

Ambuj Tewari Learning to Rank Tutorial

Page 13: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Notation for rankings

We’ll think of m (no. of documents) as fixed but in reality itvaries over examplesWe’ll denote a ranking using a permutation σ ∈ Sm (set ofall m! permutations)σ(i) is the rank/position of document iσ−1(j) is the document at rank/position j

Ambuj Tewari Learning to Rank Tutorial

Page 14: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Notation for rankings

Suppose we have 3 documents d1,d2,d3 and we wantthem to be ranked thus:

d2d3d1

So the ranks are d1 : 3,d2 : 1,d3 : 2This means σ(1) = 3, σ(2) = 1, σ(3) = 2Also, σ−1(1) = 2, σ−1(2) = 3, σ−1(3) = 1.

Ambuj Tewari Learning to Rank Tutorial

Page 15: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Notation for relevance scores/labels

Relevance labels can be binary or multi-gradedIn the binary case, documents are either relevant orirrelevant (1 or 0)In the multi-graded case, relevance labels are from{0,1,2, . . . ,Rmax}Typically, Rmax ≤ 4 (at most 5 levels of relevance)A relevance score/label vector R as an m-vectorR(i) gives relevance of document i

Ambuj Tewari Learning to Rank Tutorial

Page 16: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measures in ranking

Performance measures could be gains (to be maximized)or losses (to be minimized)They take a relevance vector R and ranking σ asarguments and produce a non-negative gain/lossSome performance measures are only defined for binaryrelevanceOthers can handle more general multi-graded relevance

Ambuj Tewari Learning to Rank Tutorial

Page 17: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: RR

Reciprocal rank (RR) is defined for binary relevance onlyRR is the reciprocal rank of the first relevant documentaccording to the ranking

RR(σ,Y ) =1

min{σ(i) : Y (i) = 1}.

Example: R = (0,1,0,1), σ = (1,3,4,2) then RR = 12

original rankedd1 : 0 d1 : 0d2 : 1 d4 : 1 ← RR = 1

2d3 : 0 d2 : 1d4 : 1 d3 : 0

Ambuj Tewari Learning to Rank Tutorial

Page 18: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: ERR

Expected Reciprocal rank (ERR) generalizes RR to multi-gradedrelevance

Convert relevance scores into probability (e.g., dividing by Rmax)

Imagine a user going down the ranked list and stopping withprobability given by relevance scores

ERR is the expected reciprocal rank of the document at whichthe user stops

Example: R = (0,2,0,1), σ = (1,3,4,2) then ERR = 12

12 + 1

312

original ranked cond. stopping prob. stopping prob.d1 : 0 d1 : 0 0 0d2 : 2 d4 : 1 1

2 1 · 12

d3 : 0 d2 : 2 22 1 · 1

2 ·22

d4 : 1 d3 : 0 0 0

Ambuj Tewari Learning to Rank Tutorial

Page 19: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: ERR

ERR(σ,R) =m∑

j=1

1j·

( j−1∏k=1

(1− g(R(σ−1(k)))

))︸ ︷︷ ︸

prob. of not stopping earlier

· g(R(σ−1(j)))︸ ︷︷ ︸prob. of stopping at pos. j

Recall: R(σ−1(j)) is the relevance of document at position jg is any function used to convert relevance scores intostopping probabilitiesE.g., g(r) = r

Rmaxor g(r) = 2r−1

2Rmax

Ambuj Tewari Learning to Rank Tutorial

Page 20: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: Precision@K

Precision@K is for binary relevance onlyPrecision@K is the fraction of relevant documents in thetop-k

Prec@K (σ,R) =1K

K∑j=1

R(σ−1(j))

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2d3 : 0 d2 : 1 2/3d4 : 1 d3 : 0 2/4

Ambuj Tewari Learning to Rank Tutorial

Page 21: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: AP

Average Precision (AP) is also for binary relevanceIt is the average of Precision@K over positions of relevantdocuments only

AP(σ,R) =1∑m

i=1 R(i)·

∑j:R(σ−1(j))=1

Prec@j

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2← AP = 1/2+2/3

2d3 : 0 d2 : 1 2/3←d4 : 1 d3 : 0 2/4

Ambuj Tewari Learning to Rank Tutorial

Page 22: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: DCG

Discounted Cumulative Gain (DCG) is for multi-gradedrelevanceContributions of relevant documents further down theranking are discounted more

DCG(σ,R) =m∑

i=1

f (R(i))g(σ(i))

f ,g are increasing functionsStandard choice: f (r) = 2r − 1 and g(j) = log2(1 + j)

Ambuj Tewari Learning to Rank Tutorial

Page 23: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Performance measure: NDCG

Normalized DCG (NDCG), like DCG, is for multi-gradedrelevanceAs the name suggests, NDCG normalizes DCG to keepthe gain bounded by 1

NDCG(σ,R) =DCG(σ,R)

Z (R)

The normalization constant

maxσ

DCG(σ,R)

can be computed quickly by sorting the relevance vector

Ambuj Tewari Learning to Rank Tutorial

Page 24: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking functions

A ranking function maps m query-document p-dimensionalfeatures into a permutation

X =

φ(q,d1)>

...φ(q,dm)

>

︸ ︷︷ ︸

m×d matrix

−→d5 ⇔ σ(5) = 1...

d12 ⇔ σ(12) = m

So mathematically it is a map from Rm×p to Sm

Ambuj Tewari Learning to Rank Tutorial

Page 25: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Ranking functions from scoring functions

However, it is hard to learn functions that map intocombinatorial spacesIn binary classification, we often learn a real-valuedfunction and threshold it to output a class label(positive/negative)In ranking, we do the same but replace thresholding bysortingFor example, let f : Rp → R be a scoring functionφ(q,d1)

>

φ(q,d2)>

φ(q,d3)>

f=⇒

−2.34.21.9

sort=⇒

d2 ⇔ σ(2) = 1d3 ⇔ σ(3) = 2d1 ⇔ σ(1) = 3

Ambuj Tewari Learning to Rank Tutorial

Page 26: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

Recap

Training data consists query-document features and relevance scores

X =

φ(q, d1)>

...φ(q, dm)

>

︸ ︷︷ ︸

m×d matrix

, R =

R(1)...

R(m)

Will use training data to learn a scoring function f that will be used torank documents for a new querys1

...sm

=

f (φ(qnew, dnew1 )

...f (φ(qnew, dnew

m )

Scores can be sorted to get a ranking whose performance willevaluated using a measure such as ERR, AP, NDCG

Ambuj Tewari Learning to Rank Tutorial

Page 27: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Outline1 Learning to rank as supervised ML

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach

3 Theory for learning to rankGeneralization error boundsConsistency

4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Ambuj Tewari Learning to Rank Tutorial

Page 28: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Many nice, innovative methods have been developedA partial list:

AdaRank BoltzRank Cranking FRank Infinite pushLambdaMART LambdaRank ListMLE ListNET McRank

MPRank p-norm push OWPC PermuRank PRankRankBoost Ranking SVM RankNet SmoothRank SoftRank

SVM-Combo SVM-MAP SVM-MRR SVM-NDCG SVRank

Ambuj Tewari Learning to Rank Tutorial

Page 29: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

The landscape of ideas

Reduction: Reduce ranking to a simpler problem (e.g.,classification or regression)Boosting: Combine weak rankers and boost them intostrong onesLarge margin approach: Extend the “large-margin” storybehind SVMs to the ranking caseDirect/Indirect optimization: Choose an appropriateoptimization problem based on the performance measureone wishes to optimize

Not all ranking methods neatly fall into one of these categories!

Ambuj Tewari Learning to Rank Tutorial

Page 30: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

The landscape of ideas

Reduction: Reduce ranking to a simpler problem (e.g.,classification or regression)Boosting: Combine weak rankers and boost them intostrong onesLarge margin approach: Extend the “large-margin” storybehind SVMs to the ranking caseDirect/Indirect optimization: Choose an appropriateoptimization problem based on the performance measureone wishes to optimize

Not all ranking methods neatly fall into one of these categories!

Ambuj Tewari Learning to Rank Tutorial

Page 31: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Regression

Perhaps the simplest approach: regress the relevancescore on query-document featuresFeed examples of the form (φ(q,di),R(i)) to your favoriteregression algorithmAdvantage: can use already built regression toolsDisadvantage: there’s no real reason for us to predictrelevance scores, we’re just building a ranking function

Ambuj Tewari Learning to Rank Tutorial

Page 32: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Linear regression example

The simplest scoring functions are linear functions:

f (φ(q,d)) = w>φ(q,d) =p∑

j=1

wjφj(q,d)

Using linear scoring functions in a regression approachreduces ranking to linear regression:

w = argminw

N∑n=1

m∑i=1

(Rn(i)− w>φ(qn,dn,i))2

For a new query-document example, output a ranking bysorting (w>φ(q,di))

mi=1.

Ambuj Tewari Learning to Rank Tutorial

Page 33: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Classification

If di has higher relevance than dj for a query q, then feedthe examples of the form (φ(q,di)− φ(q,dj),+1) to yourfavorite classification algorithmAdvantages: can use already build classification toolsDisadvantages: not clear how to use pairwise classificationto create a ranking

Ambuj Tewari Learning to Rank Tutorial

Page 34: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Logistic regression example

A logistic regression approach reduces ranking to:

w = argminw

N∑n=1

∑Rn(i)>Rn(j)

`log(w>(φ(q,di)− φ(q,dj)),+1)

`log is the logistic loss

`log(w>x , y) = log(1 + exp(−y · w>x))

Other losses could also be used:

`hinge(w>x , y) = max{0,1− y · w>x}

Ambuj Tewari Learning to Rank Tutorial

Page 35: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Logistic regression: producing a ranking

On a new triple (q,di ,dj) the trained classifier will be ableto tell you whether di is more relevant than dj for the queryqHow to resolve conflicts (such as cycles) and output aranking of d1, . . . ,dm?Run quick sort!

Choose a pivot document di at randomUse classifier to partition other documents into: morerelevant D> and less relevant D< than diReturn quicksort(D>), di , quicksort(D<)

Ambuj Tewari Learning to Rank Tutorial

Page 36: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Boosting in classification

Simple rules of thumb can often predict a little better thanrandom guessing:“$” in subject line⇒ email is spamHow to combine simple rules of thumb into a powerfulclassifier (say with 90% accuracy)?Adaboost: perhaps the most popular boosting algorithm

Ambuj Tewari Learning to Rank Tutorial

Page 37: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Adaboost

Training data (X1,Y1), . . . , (XN ,YN)Start from uniform distribution P1(i) = 1/N for all 1 ≤ n ≤ NAt iteration t = 1,2, . . . ,T :

Train a weak classifier ht on training data T using weights fromPt

Choose step-size αt and set

ft =t∑

s=1

αshs

Update Pt to Pt+1 so that Pt+1 puts more weight on exampleswhere ht performs badly

Final classifier is (sign of)∑T

t=1 αtht .

Ambuj Tewari Learning to Rank Tutorial

Page 38: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Adarank

Training data (X1,R1), . . . , (XN ,RN)Start from uniform distribution P1(n) = 1/N for all 1 ≤ n ≤ NAt iteration t = 1,2, . . . ,T :

Train a weak ranker ht on training data T using weights from Pt

Choose step-size αt and set

ft =t∑

s=1

αshs

Update Pt to Pt+1 so that Pt+1 puts more weight on exampleswhere ft performs badly

Final ranker is (sorted order given by)∑T

t=1 αtht .

Ambuj Tewari Learning to Rank Tutorial

Page 39: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Adarank: weak ranker

Assume that RL (ranking loss) is the target loss we wish tominimize (e.g., 1-NDCG)

Weak ranker can simply be a single feature (e.g., Pagerank) thatminimizes the loss, i.e.,

argmink∈{1,...,p}

N∑n=1

Pt(n)RL(Xnek ,Rn)

Xnek =

φ(qn,dn,1)>

...φ(qn,dn,m)

>

0...1...0

← k th position

Ambuj Tewari Learning to Rank Tutorial

Page 40: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Adarank: distribution update

The next distribution over training example is obtainedusing:

Pt+1(n) =exp (−RL(ft(Xn),Yn))

Zt+1

The normalization Zt+1 ensures that we have a probabilitydistribution

Zt+1 =N∑

n=1

exp (−RL(ft(Xn),Yn))

Ambuj Tewari Learning to Rank Tutorial

Page 41: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Support Vector Machine

SVMs solve the following problem (‖w‖22 := w>w):

minw

‖w‖22︸ ︷︷ ︸L2 regularization

+ C︸︷︷︸tuning param.

n∑i=1

ξi

subject to

ξi ≥ 0

Yiw>Xi︸ ︷︷ ︸margin

≥ 1− ξi︸︷︷︸slack variable

Ambuj Tewari Learning to Rank Tutorial

Page 42: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Ranking SVM

Consider a query qn and documents di,n and dj,n such that

Rn(i) > Rn(j)

We would like to have

w>φ(qn,dn,i) > w>φ(qn,dn,j)

As in SVM, we can introduce a slack variable ξn,i,j with theconstraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j

Ambuj Tewari Learning to Rank Tutorial

Page 43: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Ranking SVM

Ranking SVM solves the following problem

minw‖w‖22 + C

N∑n=1

∑Rn(i)>Rn(j)

ξn,i,j

subject to the constraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j

This is actually also a reduction approach!

Ambuj Tewari Learning to Rank Tutorial

Page 44: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Ranking SVM

Ranking SVM solves the following problem

minw‖w‖22 + C

N∑n=1

∑Rn(i)>Rn(j)

ξn,i,j

subject to the constraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j

This is actually also a reduction approach!

Ambuj Tewari Learning to Rank Tutorial

Page 45: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Ranking SVM as regularized loss minimization

Note that the minimum possible value of ξ subject to ξ ≥ 0and ξ ≥ t is

max{t ,0} =: [t ]+

This observation lets us write the ranking SVMoptimization as

minw‖w‖2

2︸ ︷︷ ︸regularizer

+ CN∑

n=1

∑Rn(i)>Rn(j)

[1 + w>φ(qn, dn,j)− w>φ(qn, dn,i)]+︸ ︷︷ ︸loss on example n

Ambuj Tewari Learning to Rank Tutorial

Page 46: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Regularized loss minimization

Note that

Xnw =

φ(qn,dn,1)>w

...φ(qn,dn,m)

>w

Ranking SVM problem (where λ = 1/C):

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

Loss in ranking SVM is

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+

Ambuj Tewari Learning to Rank Tutorial

Page 47: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Why not directly optimize performance measures?

Let RL be some actual loss you want to minimize (e.g.,1-NDCG)Then why not solve the following?

minw

λ‖w‖22 +N∑

n=1

RL(argsort(Xnw),Rn)

argsort(t) is a ranking σ that puts elements of t in(decreasing) sorted order:

tσ−1(1) ≥ tσ−1(2) ≥ . . . ≥ tσ−1(m)

Ambuj Tewari Learning to Rank Tutorial

Page 48: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Why not directly optimize performance measures?

The function

w 7→ RL(argsort(Xw),R)

is not a nice one (from optimization perspective)!In the neighborhood of any point, it is either flat ordiscontinuousIt can only take one of at most m! different valuesMinimizing a sum of such functions is computationallyintractable

Ambuj Tewari Learning to Rank Tutorial

Page 49: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

A smoothing approach

Instead of deterministic score s = Xw , think of smoothedscore distributions:

S ∼ N (s, σ2Im×m)

This gives us a smoothed ranking loss

RLσ(s,R) = ES[RL(argsort(S),R)]

We can then optimize (using, say, gradient descent):

minw

λ‖w‖22 +N∑

n=1

RLσ(Xnw ,Rn)

Ambuj Tewari Learning to Rank Tutorial

Page 50: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Choosing the smoothing parameter

As σ grows, the optimization problem becomes easier andwe’re less prone to overfittingAs σ shrinks, the smoothed loss approximates theunderlying loss betterAnnealing approach of SmoothRank:Initialize w0 (e.g., using linear regression)Choose a large σ1For t = 1,2, . . .:

Starting from wt−1, solve regularized RLσt minimizationusing (some version of) gradient descentLet the new solution be wt and set σt+1 = σt/2

Ambuj Tewari Learning to Rank Tutorial

Page 51: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Convex surrogates

The ranking SVM problem is convex hence easy tooptimizeBut no direct relation to a performance measureSmoothing approach starts from a performance measureSolves smooth but non-convex problems (can get stuck inlocal optimum)Given a hard-to-optimize performance measure, can weindirectly minimize it via some convex surrogate?

Ambuj Tewari Learning to Rank Tutorial

Page 52: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Reduction approachBoosting approachLarge margin approachOptimization approach

Recap

We can try to reduce ranking to another ML problem: e.g., regression orclassification

We can try to boost weak rankers into stronger ones

The margin maximization approach of SVMs leads naturally to RankingSVM

Ranking SVM solves regularized loss minimization

We can directly try to minimize target performance measure(hard because of flatness/discontinuity; might get stuck in local minimum evenafter smoothing)

Can we solve convex problems (no non-global local minima) and yetretain a principled connection to a target performance measure?

Ambuj Tewari Learning to Rank Tutorial

Page 53: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Outline1 Learning to rank as supervised ML

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach

3 Theory for learning to rankGeneralization error boundsConsistency

4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Ambuj Tewari Learning to Rank Tutorial

Page 54: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Theory: still a long way to go

An amazing amount of applied work existsYet, a lot of theoretical work remains to be done

“However, sometimes benchmark experiments are not asreliable as expected due to the small scales of the training andtest data. In this situation, a theory is needed to guarantee theperformance of an algorithm on infinite unseen data.”

–O. Chapelle, Y. Chang, and T.-Y. Liu“Future Directions in Learning to Rank” (2011)

Ambuj Tewari Learning to Rank Tutorial

Page 55: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Statistical Ranking Theory: Challenges

Combinatorial nature of output spaceSize of prediction space is m! (m = no. of things beingranked)

No canonical performance measureMany choices: ERR, MAP, (N)DCGEven with a fixed choice, how to come up with “good"convex surrogates

Dealing with diverse types of feedback modelsrelevance scores, pairwise feedback, click-through rates,preference DAGs

Large-scale challengesguarantees for parallel, distributed, online/streamingapproaches

Ambuj Tewari Learning to Rank Tutorial

Page 56: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Our focus

Human feedback in form of relevance scoresRegularized convex loss minimization based rankingmethods

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

A few commonly used performance measures: NDCG,MAPGeneralization error bounds and consistency

Ambuj Tewari Learning to Rank Tutorial

Page 57: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Standard Setup

Standard assumption in theory: examples

(X1,R1), . . . , (XN ,RN)

are drawn iid from an underlying (unknown) distributionConsider the constrained form of loss minimization

w = argmin‖w‖2≤B

1N

N∑n=1

`(Xnw ,Rn)

Ambuj Tewari Learning to Rank Tutorial

Page 58: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Risk

Risk of w : performance of w on “infinite unseen data"

R(w) = EX ,R[`(Xw ,R)]

Note: cannot be computed without knowledge ofunderlying distributionEmpirical risk of w : performance of w on training set

R(w) =1N

N∑n=1

`(Xnw ,Rn)

Note: can be computed using just the training data

Ambuj Tewari Learning to Rank Tutorial

Page 59: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Generalization error bound

We say that w is an empirical risk minimizer (ERM)

w = argmin‖w‖2≤B

R(w)

We’re typically interested in bounds on the generalizationerror

R(w)

For instance, we expect the generalization error to belarger than

R(w)

but by how much?

Ambuj Tewari Learning to Rank Tutorial

Page 60: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

A downward bias

Let us show that R(w) is, in expectation, less than R(w)

Let w? be the scoring function with least expected loss

w? = argmin‖w‖2≤B

R(w)

R(w) ≤ R(w?) by definition of ERM

E[R(w)] ≤ E[R(w?)] taking expectations

E[R(w)] ≤ R(w?) ∵ ∀w ,E[R(w)] = R(w)

E[R(w)] ≤ R(w) by definition of w?

Ambuj Tewari Learning to Rank Tutorial

Page 61: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Generalization error bound

We might therefore want a generalization error bound ofthe form

R(w) ≤ R(w) + stuff

1.5 1.0 0.5 0.0 0.5 1.0 1.50.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

w w

R(w)−R(w)

R(w)

Risk R(w)

Emp risk R(w)

Ambuj Tewari Learning to Rank Tutorial

Page 62: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Generalization error bound

A useless generalization error bound:

R(w) ≤ R(w) + |R(w)− R(w)|

Better?

R(w) ≤ R(w) + max‖w‖2≤B

|R(w)− R(w)|

Ambuj Tewari Learning to Rank Tutorial

Page 63: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Bootstrapping

The difference R(w)− R(w) compares the expected lossof w under

(X ,R) ∼ underlying dist. vs. (X ,R) ∼ (Xn,Rn) with wt.1N

Let us replace the underlying dist. with the sample and thesample with a (weighted) sub-sample!

(X ,R) ∼ (Xn,Rn) with wt.1N

vs.

(X ,R) ∼ (Xn,Rn) with wt.

{2N if εn = +10 if εn = −1

where εi are symmetric signed Bernoulli or Rademacherrandom variables

Ambuj Tewari Learning to Rank Tutorial

Page 64: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Weighted subsample: an example

Suppose original sample with weights is

(X1,R1) :13, (X2,R2) :

13, (X3,R3) :

13

Suppose ε1 = +1, ε2 = −1, ε3 = +1Then weighted sub-sample is

(X1,R1) :23, (X3,R3) :

23

Ambuj Tewari Learning to Rank Tutorial

Page 65: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Towards a generalization error bound

Define the weighted average under on sub-sample

R′(w) =1N

N∑n=1

(1 + εn)`(Xnw ,Rn)

Replacemax‖w‖2≤B

|R(w)− R(w)|

withmax‖w‖2≤B

|R(w)− R′(w)|

Ambuj Tewari Learning to Rank Tutorial

Page 66: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

A data-dependent generalization error bound

With probability at least 1− δ

R(w) ≤ R(w)︸ ︷︷ ︸shrinks as B ↑

+ max‖w‖2≤B

|R(w)− R′(w)|︸ ︷︷ ︸grows as B ↑

+

√log(1/δ)

n

Can be used for data-dependent selection of B

Ambuj Tewari Learning to Rank Tutorial

Page 67: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Rademacher complexity

Note:

R(w)− R′(w) =1N

N∑n=1

(1− (1+ εn))`(Xnw ,Rn) =1N

N∑n=1

−εn`(Xnw ,Rn)

Therefore, with high probability,

R(w) ≤ R(w) + max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣+√

log(1/δ)n

The quantity

Eε1,...,εN

[max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣]

is called Rademacher complexity

Ambuj Tewari Learning to Rank Tutorial

Page 68: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

How does Rademacher complexity scale?

Rademacher complexity depends on

Loss function `More “complicated” loss =⇒ bigger complexity

Size B of w ’sLarger B =⇒ bigger complexity

Size of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexity

Sample size NLarger N =⇒ smaller complexity

Roughly speaking, scales as:

Lip(`) · B · R ·√

log(1/δ)N

Ambuj Tewari Learning to Rank Tutorial

Page 69: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Lipschitz constant

Fix arbitrary relevance vector RHow much does `(s,R) change as we change s?Lipschitz constant Lip(`) is smallest L such that

∀R, |`(s,R)− `(s′,R)| ≤ L · ‖s − s′‖∞

where‖s − s′‖∞ =

mmaxi=1|si − s′i |

Ambuj Tewari Learning to Rank Tutorial

Page 70: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Another type of generalization bound

So far we have seen bounds of the form

R(w) ≤ R(w) + stuff

However, R(w) ≤ R(w?) and R(w?) concentrates aroundR(w?)

Therefore, it is also easy to derive bounds of the form

R(w) ≤ R(w?) + stuff

Ambuj Tewari Learning to Rank Tutorial

Page 71: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Generalization bounds: recap

Risk R(w) is expected loss of w on infinite, unseen data

Empirical risk R(w) is average loss on training data

A w that minimizes loss on training data is called an empirical riskminimizer (ERM)

Saw two types of generalization error bounds:

R(w) ≤ R(w) + stuff

R(w) ≤ R(w?)︸ ︷︷ ︸minimum possible risk

+ stuff

Rademacher complexity can be used to give data dependentgeneralization bounds for ERM

Ambuj Tewari Learning to Rank Tutorial

Page 72: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Richer function classes for more data

As we see more data, it is natural to enlarge the functionclass over which ERM is computed

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn) = argminf∈Fn

R(f )

where

f (Xn) =

f (φ(qn,dn,1))...

f (φ(qn,dn,m))

E.g., as n grows, Fn = larger decision trees, deeper neuralnetworks, SVMs with less regularization

Ambuj Tewari Learning to Rank Tutorial

Page 73: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Estimation error vs approximation error

Let f ?(Fn) be the best function in Fn and f ? be the bestoverall function:

f ?(Fn) = argminf∈Fn

R(f ) , f ? = argminf

R(f )

We can decompose the excess risk

R(fn)− R(f ?) = R(fn)− R(f ?(Fn))︸ ︷︷ ︸estimation error

+ R(f ?(Fn))− R(f ?)︸ ︷︷ ︸approximation error

Ambuj Tewari Learning to Rank Tutorial

Page 74: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Consistency

If Fn grows too quickly, estimation error will not go to zeroIf Fn grows too slowly, approximation error will not go tozeroBy judiciously growing Fn, we can guarantee that botherrors go to zeroThis results in consistency

R(fn)→ R(f ?) as n→∞

Ambuj Tewari Learning to Rank Tutorial

Page 75: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

What did we achieve?

We said that we will use a nice (smooth/convex) loss ` as asurrogate and train scoring function fn on finite data using `

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn)

Enlarging Fn neither too quickly nor too slowly with n willgive us

R(fn)n→∞−→ R(f ?)

Great, but how good is fn according to rankingperformance measures (NDCG, ERR, AP)?

Ambuj Tewari Learning to Rank Tutorial

Page 76: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Risk under ranking loss

Recall that R(f ) is the risk of f under the surrogate `:

R(f ) = EX ,R[`(f (X ),R)]

Similarly, there is L(f ), the risk of f under a target rankingloss RL:

L(f ) = EX ,R[RL(argsort(f (X )),R)]

argsort required because first argument of RL has to be aranking, not a score vector

Ambuj Tewari Learning to Rank Tutorial

Page 77: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Calibration

Fix a target loss RL (e.g., 1-NDCG) and a surrogate ` (e.g.,RankingSVM loss)Suppose we know

R(fn)n→∞−→ min

fR(f )

Does this automatically guarantee the following?

L(fn)n→∞−→ min

fL(f )

If yes, then we say that surrogate ` is calibrated w.r.t.target loss RL

Ambuj Tewari Learning to Rank Tutorial

Page 78: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Calibration in binary classification

Consider the simpler setting of binary classification whereXi ∈ Rp and Yi ∈ {+1,−1}We learn real valued functions f : Rp → R and take thesign to output a class labelGiven a surrogate ` (hinge loss, logistic loss), we candefine

R(f ) = EX ,Y [`(f (X ),Y )]

Target loss is often just the 0-1 loss:

L(f ) = EX ,Y [1(Yf (X)≤0)]

Ambuj Tewari Learning to Rank Tutorial

Page 79: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Calibration in binary classification

Suppose `(Y , f (X )) = φ(Yf (X )) (a margin loss)For a margin-based convex surrogate `, when does

R(fn)n→∞−→ min

fR(f )

implyL(fn)

n→∞−→ minf

L(f )

Surprisingly simple answer: φ′(0) < 0

Ambuj Tewari Learning to Rank Tutorial

Page 80: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Examples of calibrated surrogates

0.5 0.0 0.5 1.0 1.5 2.0

Margin Yf(X)

0.0

0.5

1.0

1.5

2.0

2.5

0-1 loss

Hinge

Logistic

Squared

Ambuj Tewari Learning to Rank Tutorial

Page 81: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Calibration w.r.t. DCG

Recall that DCG is defined as:

DCG(σ,R) =m∑

i=1

f (R(i))g(σ(i))

Let η be a distribution over relevance vectors R’sThe minimizer σ(η) of

σ 7→ ER∼η[DCG(σ,R)]

is argsort(ER∼η[f (R)])

Ambuj Tewari Learning to Rank Tutorial

Page 82: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Calibration w.r.t. DCG

Consider the score vector s(η) that minimizes

s 7→ ER∼η[`(s,R)]

Necessary and sufficient condition for DCG calibrationFor every η, s(η) has the same sorted order as ER∼η[f (R)]

Similar condition can be derived for NDCG by takingnormalization into account

Ambuj Tewari Learning to Rank Tutorial

Page 83: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Calibration w.r.t. DCG: Example

Consider the regression based surrogate

`(s,R) =m∑

i=1

(si − f (Ri))2

Easy to see that s(η) is exactly ER∼η[f (R)]

Thus, this regression based surrogate is DCG calibrated

Ambuj Tewari Learning to Rank Tutorial

Page 84: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Calibration w.r.t. ERR, AP

The (N)DCG case might lead one to believe that convexcalibrated surrogates always existSurprisingly, this is not the case!It is known that there do not exist any convex calibratedsurrogates for ERR and APCaveat: Non-existence result assumes that we’re usingargsort to move from scores to rankings

Ambuj Tewari Learning to Rank Tutorial

Page 85: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Generalization error boundsConsistency

Consistency: recap

Consistency is an asymptotic (sample size N →∞) property of learningalgorithms

A consistent algorithm’s performance reaches that of the “best" rankeras it sees more and more data

A learning algorithm that minimizes a surrogate ` has to balanceestimation error, approximation error trade-off

If trade-off is managed well, it will have consistency in the sense of `

If this also implies consistency w.r.t. a target loss RL, we say ` iscalibrated w.r.t. RL

Calibration in ranking is trickier than in classification: easy in somecases (e.g. NDCG), not so easy in others (e.g., ERR, MAP)

Ambuj Tewari Learning to Rank Tutorial

Page 86: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Outline1 Learning to rank as supervised ML

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach

3 Theory for learning to rankGeneralization error boundsConsistency

4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Ambuj Tewari Learning to Rank Tutorial

Page 87: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

The online protocol

Instead of training on a fixed batch dataset, ranking could occurin a more online/incremental setting

For t = 1, . . . ,NRanker sees feature vectors Xt ∈ Rm×p

Ranker outputs a scoring function ft : R→ RRelevance vector Rt is revealedRanker suffers loss `(ft(Xt),Rt)

Ambuj Tewari Learning to Rank Tutorial

Page 88: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Regret

In online settings, we often look at regret:

N∑t=1

`(ft(Xt),Rt)︸ ︷︷ ︸algorithm’s loss

− minf∈F

N∑t=1

`(f (Xt),Rt)︸ ︷︷ ︸best loss in hindsight

If ` is convex and we’re using linear scoring functionsf (x) = w>x then we can use tools from online convexoptimization (OCO)Great OCO introduction herehttp://ocobook.cs.princeton.edu/

Ambuj Tewari Learning to Rank Tutorial

Page 89: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Diversity

Focusing solely on relevance can lead to redundancy insearch resultsA query might be ambiguous: jaguar (car vs. animal)Simply listing highly relevant results for one facet of thequery isn’t goodNeed to encourage diversity in the search resultsNew performance measures/methods/theoreticalframework needed

Ambuj Tewari Learning to Rank Tutorial

Page 90: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Freshness

Consider ranking news items or tweets for queriesinvolving recent eventsNeed to promote more recent webpagesBut only if the query is indeed recency sensitiveDifferent time scales: for an annual event (year) vs.breaking news (days/hours)Possible approach: classify query as recency-sensitive ornot and then route to an appropriate ranker

Ambuj Tewari Learning to Rank Tutorial

Page 91: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank

Large-scale learning to rank

Gradient descent batch optimization of a regularizedobjective

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

can easy be parallelized on a clusterEnsemble based methods like bagging can train rankerson different partitions of data in parallelThese ideas have been used for supervised ML but notmuch in rankingSee the Apache Spark project:https://spark.apache.org/docs/latest/mllib-guide.html

Ambuj Tewari Learning to Rank Tutorial

Page 92: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Summary

Ranking problems arise in a variety of contextsLearning to rank can be, and has been, viewed as asupervised ML problemWe surveyed a wide landscape of ranking methodsWe discussed generalization error bounds for andconsistency of ranking methodsWe saw a few advanced topics in learning to rank that arebeing actively investigated

Ambuj Tewari Learning to Rank Tutorial

Page 93: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Suggested activities

Get the LETOR 3.0 or 4.0 data sets and try to reproducethe reported baselinesDownload the large MSLR-WEB30k data set and see howlong it takes to run your favorite learning to rank algorithmon itRead Chapter 9 (Ranking) of Foundations of MachineLearning by Mohri, Rostamizadeh and Talwalkar. Solve asmany exercises as you can!Read Chapter 11 (Learning to rank) of Boosting:Foundations and Algorithms by Freund and Schapire.Solve as many exercises as you can!

Ambuj Tewari Learning to Rank Tutorial

Page 94: Tutorial on Learning to Rank - University of Texas at Austin€¦ · Online learning to rank Beyond relevance: diversity and freshness Large-scale learning to rank Ambuj Tewari Learning

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Thanks

Thank you for your attention!

Questions/comments: [email protected]

Ambuj Tewari Learning to Rank Tutorial