tutorial on learning to rank - university of texas at austin€¦ · online learning to rank beyond...
TRANSCRIPT
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Tutorial on Learning to Rank
Ambuj Tewari
Department of StatisticsDepartment of EECSUniversity of Michigan
January 13, 2015 / MLSS Austin
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Outline1 Learning to rank as supervised ML
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach
3 Theory for learning to rankGeneralization error boundsConsistency
4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Outline1 Learning to rank as supervised ML
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach
3 Theory for learning to rankGeneralization error boundsConsistency
4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Rankings are everywhere
Web search: PageRank and HITSSports: FIFA rankings (int’l soccer), ICC Rankings (int’lcricket), FIDE rankings (int’l chess), BCS ratings (collegefootball)Academia: US News & World Report, Times HigherEducationDemocracy and development: Voting systems, UN HumanDevelopment IndexRecommender systems: movies (Netflix, IMDB), music(Pandora), consumer goods (Amazon)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Learning to rank objects
Creating a ranking requires a lot of effortCan we have computers learn to rank objects?Will still need some human supervisionPose learning to rank as supervised ML problem
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking as a canonical learning problem
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking as a canonical learning problem
NIPS Workshops: 2002, 2005, 2009, 2014SIGIR Workshops: 2007, 2008, 2009Journal of Information Retrieval Special Issue in 2010Yahoo! Learning to Rank Challenge: 2010http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf
LEarning TO Rank (LETOR) and Microsoft Learning toRank (MSLR) datasets: 2007-2010http://research.microsoft.com/en-us/um/beijing/projects/letor/
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Learning to rank as supervised ML
Supervised Machine Learning: construct a mapping
f : I → O
based on input, output pairs (training examples)
(X1,Y1), . . . , (XN ,YN)︸ ︷︷ ︸training set
where Xn ∈ I, Yn ∈ OWant f (X ) to be a good predictor of Y for unseen XIn ranking problems, O is all possible rankings of a givenset of objects (webpages, movies, etc)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
A learning-to-rank training example
Query: “MLSS Austin”Candidate webpages and relevance scores:
https://www.cs.utexas.edu/mlss: 2 or “good”http://mlss.cc/: 2 or “good”https://mailman.cc.gatech.edu/.../000796.html: 1 or “fair”www.councilofmls.com: 0 or “bad”
Think: example X =( query , (URL1, URL2, . . ., URL4) ),and label R = (2,2,1,0)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Typical ML cycle
1 Feature construction: Find a way to map (query ,webpage)into Rp
Each example (query, m webpages) gets mapped toXn ∈ I = Rm×p
2 Obtain labels: Get Rn ∈ {0,1,2, . . . ,Rmax}m from humanjudgements
3 Train ranker: Get f : I → O = Sm where Sm is the set ofm-permutations
Training usually done by solving some optimizationproblems on the training set
4 Evaluate performance: Using some performance measure,evaluate the ranker on a “test set”
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Joint query, document features
Number of query terms that occur in document
Document length
Sum/min/max/mean/variance of term frequencies
Sum/min/max/mean/variance of tf-idf
Cosine similarity between query and document
Any model of P(R = 1|Q,D) gives a feature (e.g., BM25)
No. of slashes in/length of URL, Pagerank, site level Pagerank
Query-URL click count, user-URL click count
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
A slightly more general supervised ML setting
For a future query and webpage-list, want to rank thewebpagesDon’t necessarily want to predict the relevance scoresLet f (ranking function) take values in the space of rankingsSo now, training examples (X ,Y ) ∈ I × L where L = labelspaceFunction f to be learned maps I to O (note L 6= O)E.g. (1,2,2,0,0) ∈ L and(d1 → 3,d2 → 1,d3 → 2,d4 → 4,d5 → 5) ∈ O
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Notation for rankings
We’ll think of m (no. of documents) as fixed but in reality itvaries over examplesWe’ll denote a ranking using a permutation σ ∈ Sm (set ofall m! permutations)σ(i) is the rank/position of document iσ−1(j) is the document at rank/position j
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Notation for rankings
Suppose we have 3 documents d1,d2,d3 and we wantthem to be ranked thus:
d2d3d1
So the ranks are d1 : 3,d2 : 1,d3 : 2This means σ(1) = 3, σ(2) = 1, σ(3) = 2Also, σ−1(1) = 2, σ−1(2) = 3, σ−1(3) = 1.
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Notation for relevance scores/labels
Relevance labels can be binary or multi-gradedIn the binary case, documents are either relevant orirrelevant (1 or 0)In the multi-graded case, relevance labels are from{0,1,2, . . . ,Rmax}Typically, Rmax ≤ 4 (at most 5 levels of relevance)A relevance score/label vector R as an m-vectorR(i) gives relevance of document i
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measures in ranking
Performance measures could be gains (to be maximized)or losses (to be minimized)They take a relevance vector R and ranking σ asarguments and produce a non-negative gain/lossSome performance measures are only defined for binaryrelevanceOthers can handle more general multi-graded relevance
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: RR
Reciprocal rank (RR) is defined for binary relevance onlyRR is the reciprocal rank of the first relevant documentaccording to the ranking
RR(σ,Y ) =1
min{σ(i) : Y (i) = 1}.
Example: R = (0,1,0,1), σ = (1,3,4,2) then RR = 12
original rankedd1 : 0 d1 : 0d2 : 1 d4 : 1 ← RR = 1
2d3 : 0 d2 : 1d4 : 1 d3 : 0
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: ERR
Expected Reciprocal rank (ERR) generalizes RR to multi-gradedrelevance
Convert relevance scores into probability (e.g., dividing by Rmax)
Imagine a user going down the ranked list and stopping withprobability given by relevance scores
ERR is the expected reciprocal rank of the document at whichthe user stops
Example: R = (0,2,0,1), σ = (1,3,4,2) then ERR = 12
12 + 1
312
original ranked cond. stopping prob. stopping prob.d1 : 0 d1 : 0 0 0d2 : 2 d4 : 1 1
2 1 · 12
d3 : 0 d2 : 2 22 1 · 1
2 ·22
d4 : 1 d3 : 0 0 0
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: ERR
ERR(σ,R) =m∑
j=1
1j·
( j−1∏k=1
(1− g(R(σ−1(k)))
))︸ ︷︷ ︸
prob. of not stopping earlier
· g(R(σ−1(j)))︸ ︷︷ ︸prob. of stopping at pos. j
Recall: R(σ−1(j)) is the relevance of document at position jg is any function used to convert relevance scores intostopping probabilitiesE.g., g(r) = r
Rmaxor g(r) = 2r−1
2Rmax
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: Precision@K
Precision@K is for binary relevance onlyPrecision@K is the fraction of relevant documents in thetop-k
Prec@K (σ,R) =1K
K∑j=1
R(σ−1(j))
original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2d3 : 0 d2 : 1 2/3d4 : 1 d3 : 0 2/4
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: AP
Average Precision (AP) is also for binary relevanceIt is the average of Precision@K over positions of relevantdocuments only
AP(σ,R) =1∑m
i=1 R(i)·
∑j:R(σ−1(j))=1
Prec@j
original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2← AP = 1/2+2/3
2d3 : 0 d2 : 1 2/3←d4 : 1 d3 : 0 2/4
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: DCG
Discounted Cumulative Gain (DCG) is for multi-gradedrelevanceContributions of relevant documents further down theranking are discounted more
DCG(σ,R) =m∑
i=1
f (R(i))g(σ(i))
f ,g are increasing functionsStandard choice: f (r) = 2r − 1 and g(j) = log2(1 + j)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Performance measure: NDCG
Normalized DCG (NDCG), like DCG, is for multi-gradedrelevanceAs the name suggests, NDCG normalizes DCG to keepthe gain bounded by 1
NDCG(σ,R) =DCG(σ,R)
Z (R)
The normalization constant
maxσ
DCG(σ,R)
can be computed quickly by sorting the relevance vector
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking functions
A ranking function maps m query-document p-dimensionalfeatures into a permutation
X =
φ(q,d1)>
...φ(q,dm)
>
︸ ︷︷ ︸
m×d matrix
−→d5 ⇔ σ(5) = 1...
d12 ⇔ σ(12) = m
So mathematically it is a map from Rm×p to Sm
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Ranking functions from scoring functions
However, it is hard to learn functions that map intocombinatorial spacesIn binary classification, we often learn a real-valuedfunction and threshold it to output a class label(positive/negative)In ranking, we do the same but replace thresholding bysortingFor example, let f : Rp → R be a scoring functionφ(q,d1)
>
φ(q,d2)>
φ(q,d3)>
f=⇒
−2.34.21.9
sort=⇒
d2 ⇔ σ(2) = 1d3 ⇔ σ(3) = 2d1 ⇔ σ(1) = 3
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
Recap
Training data consists query-document features and relevance scores
X =
φ(q, d1)>
...φ(q, dm)
>
︸ ︷︷ ︸
m×d matrix
, R =
R(1)...
R(m)
Will use training data to learn a scoring function f that will be used torank documents for a new querys1
...sm
=
f (φ(qnew, dnew1 )
...f (φ(qnew, dnew
m )
Scores can be sorted to get a ranking whose performance willevaluated using a measure such as ERR, AP, NDCG
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Outline1 Learning to rank as supervised ML
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach
3 Theory for learning to rankGeneralization error boundsConsistency
4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Many nice, innovative methods have been developedA partial list:
AdaRank BoltzRank Cranking FRank Infinite pushLambdaMART LambdaRank ListMLE ListNET McRank
MPRank p-norm push OWPC PermuRank PRankRankBoost Ranking SVM RankNet SmoothRank SoftRank
SVM-Combo SVM-MAP SVM-MRR SVM-NDCG SVRank
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
The landscape of ideas
Reduction: Reduce ranking to a simpler problem (e.g.,classification or regression)Boosting: Combine weak rankers and boost them intostrong onesLarge margin approach: Extend the “large-margin” storybehind SVMs to the ranking caseDirect/Indirect optimization: Choose an appropriateoptimization problem based on the performance measureone wishes to optimize
Not all ranking methods neatly fall into one of these categories!
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
The landscape of ideas
Reduction: Reduce ranking to a simpler problem (e.g.,classification or regression)Boosting: Combine weak rankers and boost them intostrong onesLarge margin approach: Extend the “large-margin” storybehind SVMs to the ranking caseDirect/Indirect optimization: Choose an appropriateoptimization problem based on the performance measureone wishes to optimize
Not all ranking methods neatly fall into one of these categories!
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Regression
Perhaps the simplest approach: regress the relevancescore on query-document featuresFeed examples of the form (φ(q,di),R(i)) to your favoriteregression algorithmAdvantage: can use already built regression toolsDisadvantage: there’s no real reason for us to predictrelevance scores, we’re just building a ranking function
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Linear regression example
The simplest scoring functions are linear functions:
f (φ(q,d)) = w>φ(q,d) =p∑
j=1
wjφj(q,d)
Using linear scoring functions in a regression approachreduces ranking to linear regression:
w = argminw
N∑n=1
m∑i=1
(Rn(i)− w>φ(qn,dn,i))2
For a new query-document example, output a ranking bysorting (w>φ(q,di))
mi=1.
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Classification
If di has higher relevance than dj for a query q, then feedthe examples of the form (φ(q,di)− φ(q,dj),+1) to yourfavorite classification algorithmAdvantages: can use already build classification toolsDisadvantages: not clear how to use pairwise classificationto create a ranking
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Logistic regression example
A logistic regression approach reduces ranking to:
w = argminw
N∑n=1
∑Rn(i)>Rn(j)
`log(w>(φ(q,di)− φ(q,dj)),+1)
`log is the logistic loss
`log(w>x , y) = log(1 + exp(−y · w>x))
Other losses could also be used:
`hinge(w>x , y) = max{0,1− y · w>x}
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Logistic regression: producing a ranking
On a new triple (q,di ,dj) the trained classifier will be ableto tell you whether di is more relevant than dj for the queryqHow to resolve conflicts (such as cycles) and output aranking of d1, . . . ,dm?Run quick sort!
Choose a pivot document di at randomUse classifier to partition other documents into: morerelevant D> and less relevant D< than diReturn quicksort(D>), di , quicksort(D<)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Boosting in classification
Simple rules of thumb can often predict a little better thanrandom guessing:“$” in subject line⇒ email is spamHow to combine simple rules of thumb into a powerfulclassifier (say with 90% accuracy)?Adaboost: perhaps the most popular boosting algorithm
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Adaboost
Training data (X1,Y1), . . . , (XN ,YN)Start from uniform distribution P1(i) = 1/N for all 1 ≤ n ≤ NAt iteration t = 1,2, . . . ,T :
Train a weak classifier ht on training data T using weights fromPt
Choose step-size αt and set
ft =t∑
s=1
αshs
Update Pt to Pt+1 so that Pt+1 puts more weight on exampleswhere ht performs badly
Final classifier is (sign of)∑T
t=1 αtht .
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Adarank
Training data (X1,R1), . . . , (XN ,RN)Start from uniform distribution P1(n) = 1/N for all 1 ≤ n ≤ NAt iteration t = 1,2, . . . ,T :
Train a weak ranker ht on training data T using weights from Pt
Choose step-size αt and set
ft =t∑
s=1
αshs
Update Pt to Pt+1 so that Pt+1 puts more weight on exampleswhere ft performs badly
Final ranker is (sorted order given by)∑T
t=1 αtht .
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Adarank: weak ranker
Assume that RL (ranking loss) is the target loss we wish tominimize (e.g., 1-NDCG)
Weak ranker can simply be a single feature (e.g., Pagerank) thatminimizes the loss, i.e.,
argmink∈{1,...,p}
N∑n=1
Pt(n)RL(Xnek ,Rn)
Xnek =
φ(qn,dn,1)>
...φ(qn,dn,m)
>
0...1...0
← k th position
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Adarank: distribution update
The next distribution over training example is obtainedusing:
Pt+1(n) =exp (−RL(ft(Xn),Yn))
Zt+1
The normalization Zt+1 ensures that we have a probabilitydistribution
Zt+1 =N∑
n=1
exp (−RL(ft(Xn),Yn))
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Support Vector Machine
SVMs solve the following problem (‖w‖22 := w>w):
minw
‖w‖22︸ ︷︷ ︸L2 regularization
+ C︸︷︷︸tuning param.
n∑i=1
ξi
subject to
ξi ≥ 0
Yiw>Xi︸ ︷︷ ︸margin
≥ 1− ξi︸︷︷︸slack variable
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Ranking SVM
Consider a query qn and documents di,n and dj,n such that
Rn(i) > Rn(j)
We would like to have
w>φ(qn,dn,i) > w>φ(qn,dn,j)
As in SVM, we can introduce a slack variable ξn,i,j with theconstraints
ξn,i,j ≥ 0
w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Ranking SVM
Ranking SVM solves the following problem
minw‖w‖22 + C
N∑n=1
∑Rn(i)>Rn(j)
ξn,i,j
subject to the constraints
ξn,i,j ≥ 0
w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j
This is actually also a reduction approach!
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Ranking SVM
Ranking SVM solves the following problem
minw‖w‖22 + C
N∑n=1
∑Rn(i)>Rn(j)
ξn,i,j
subject to the constraints
ξn,i,j ≥ 0
w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j
This is actually also a reduction approach!
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Ranking SVM as regularized loss minimization
Note that the minimum possible value of ξ subject to ξ ≥ 0and ξ ≥ t is
max{t ,0} =: [t ]+
This observation lets us write the ranking SVMoptimization as
minw‖w‖2
2︸ ︷︷ ︸regularizer
+ CN∑
n=1
∑Rn(i)>Rn(j)
[1 + w>φ(qn, dn,j)− w>φ(qn, dn,i)]+︸ ︷︷ ︸loss on example n
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Regularized loss minimization
Note that
Xnw =
φ(qn,dn,1)>w
...φ(qn,dn,m)
>w
Ranking SVM problem (where λ = 1/C):
minw
λ‖w‖22 +N∑
n=1
`(Xnw ,Rn)
Loss in ranking SVM is
`(s,R) =∑
R(i)>R(j)
[1 + sj − si ]+
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Why not directly optimize performance measures?
Let RL be some actual loss you want to minimize (e.g.,1-NDCG)Then why not solve the following?
minw
λ‖w‖22 +N∑
n=1
RL(argsort(Xnw),Rn)
argsort(t) is a ranking σ that puts elements of t in(decreasing) sorted order:
tσ−1(1) ≥ tσ−1(2) ≥ . . . ≥ tσ−1(m)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Why not directly optimize performance measures?
The function
w 7→ RL(argsort(Xw),R)
is not a nice one (from optimization perspective)!In the neighborhood of any point, it is either flat ordiscontinuousIt can only take one of at most m! different valuesMinimizing a sum of such functions is computationallyintractable
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
A smoothing approach
Instead of deterministic score s = Xw , think of smoothedscore distributions:
S ∼ N (s, σ2Im×m)
This gives us a smoothed ranking loss
RLσ(s,R) = ES[RL(argsort(S),R)]
We can then optimize (using, say, gradient descent):
minw
λ‖w‖22 +N∑
n=1
RLσ(Xnw ,Rn)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Choosing the smoothing parameter
As σ grows, the optimization problem becomes easier andwe’re less prone to overfittingAs σ shrinks, the smoothed loss approximates theunderlying loss betterAnnealing approach of SmoothRank:Initialize w0 (e.g., using linear regression)Choose a large σ1For t = 1,2, . . .:
Starting from wt−1, solve regularized RLσt minimizationusing (some version of) gradient descentLet the new solution be wt and set σt+1 = σt/2
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Convex surrogates
The ranking SVM problem is convex hence easy tooptimizeBut no direct relation to a performance measureSmoothing approach starts from a performance measureSolves smooth but non-convex problems (can get stuck inlocal optimum)Given a hard-to-optimize performance measure, can weindirectly minimize it via some convex surrogate?
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Reduction approachBoosting approachLarge margin approachOptimization approach
Recap
We can try to reduce ranking to another ML problem: e.g., regression orclassification
We can try to boost weak rankers into stronger ones
The margin maximization approach of SVMs leads naturally to RankingSVM
Ranking SVM solves regularized loss minimization
We can directly try to minimize target performance measure(hard because of flatness/discontinuity; might get stuck in local minimum evenafter smoothing)
Can we solve convex problems (no non-global local minima) and yetretain a principled connection to a target performance measure?
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Outline1 Learning to rank as supervised ML
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach
3 Theory for learning to rankGeneralization error boundsConsistency
4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Theory: still a long way to go
An amazing amount of applied work existsYet, a lot of theoretical work remains to be done
“However, sometimes benchmark experiments are not asreliable as expected due to the small scales of the training andtest data. In this situation, a theory is needed to guarantee theperformance of an algorithm on infinite unseen data.”
–O. Chapelle, Y. Chang, and T.-Y. Liu“Future Directions in Learning to Rank” (2011)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Statistical Ranking Theory: Challenges
Combinatorial nature of output spaceSize of prediction space is m! (m = no. of things beingranked)
No canonical performance measureMany choices: ERR, MAP, (N)DCGEven with a fixed choice, how to come up with “good"convex surrogates
Dealing with diverse types of feedback modelsrelevance scores, pairwise feedback, click-through rates,preference DAGs
Large-scale challengesguarantees for parallel, distributed, online/streamingapproaches
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Our focus
Human feedback in form of relevance scoresRegularized convex loss minimization based rankingmethods
minw
λ‖w‖22 +N∑
n=1
`(Xnw ,Rn)
A few commonly used performance measures: NDCG,MAPGeneralization error bounds and consistency
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Standard Setup
Standard assumption in theory: examples
(X1,R1), . . . , (XN ,RN)
are drawn iid from an underlying (unknown) distributionConsider the constrained form of loss minimization
w = argmin‖w‖2≤B
1N
N∑n=1
`(Xnw ,Rn)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Risk
Risk of w : performance of w on “infinite unseen data"
R(w) = EX ,R[`(Xw ,R)]
Note: cannot be computed without knowledge ofunderlying distributionEmpirical risk of w : performance of w on training set
R(w) =1N
N∑n=1
`(Xnw ,Rn)
Note: can be computed using just the training data
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Generalization error bound
We say that w is an empirical risk minimizer (ERM)
w = argmin‖w‖2≤B
R(w)
We’re typically interested in bounds on the generalizationerror
R(w)
For instance, we expect the generalization error to belarger than
R(w)
but by how much?
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
A downward bias
Let us show that R(w) is, in expectation, less than R(w)
Let w? be the scoring function with least expected loss
w? = argmin‖w‖2≤B
R(w)
R(w) ≤ R(w?) by definition of ERM
E[R(w)] ≤ E[R(w?)] taking expectations
E[R(w)] ≤ R(w?) ∵ ∀w ,E[R(w)] = R(w)
E[R(w)] ≤ R(w) by definition of w?
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Generalization error bound
We might therefore want a generalization error bound ofthe form
R(w) ≤ R(w) + stuff
1.5 1.0 0.5 0.0 0.5 1.0 1.50.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
w w
R(w)−R(w)
R(w)
Risk R(w)
Emp risk R(w)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Generalization error bound
A useless generalization error bound:
R(w) ≤ R(w) + |R(w)− R(w)|
Better?
R(w) ≤ R(w) + max‖w‖2≤B
|R(w)− R(w)|
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Bootstrapping
The difference R(w)− R(w) compares the expected lossof w under
(X ,R) ∼ underlying dist. vs. (X ,R) ∼ (Xn,Rn) with wt.1N
Let us replace the underlying dist. with the sample and thesample with a (weighted) sub-sample!
(X ,R) ∼ (Xn,Rn) with wt.1N
vs.
(X ,R) ∼ (Xn,Rn) with wt.
{2N if εn = +10 if εn = −1
where εi are symmetric signed Bernoulli or Rademacherrandom variables
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Weighted subsample: an example
Suppose original sample with weights is
(X1,R1) :13, (X2,R2) :
13, (X3,R3) :
13
Suppose ε1 = +1, ε2 = −1, ε3 = +1Then weighted sub-sample is
(X1,R1) :23, (X3,R3) :
23
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Towards a generalization error bound
Define the weighted average under on sub-sample
R′(w) =1N
N∑n=1
(1 + εn)`(Xnw ,Rn)
Replacemax‖w‖2≤B
|R(w)− R(w)|
withmax‖w‖2≤B
|R(w)− R′(w)|
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
A data-dependent generalization error bound
With probability at least 1− δ
R(w) ≤ R(w)︸ ︷︷ ︸shrinks as B ↑
+ max‖w‖2≤B
|R(w)− R′(w)|︸ ︷︷ ︸grows as B ↑
+
√log(1/δ)
n
Can be used for data-dependent selection of B
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Rademacher complexity
Note:
R(w)− R′(w) =1N
N∑n=1
(1− (1+ εn))`(Xnw ,Rn) =1N
N∑n=1
−εn`(Xnw ,Rn)
Therefore, with high probability,
R(w) ≤ R(w) + max‖w‖2≤B
∣∣∣∣∣ 1N
N∑n=1
εn`(Xnw ,Rn)
∣∣∣∣∣+√
log(1/δ)n
The quantity
Eε1,...,εN
[max‖w‖2≤B
∣∣∣∣∣ 1N
N∑n=1
εn`(Xnw ,Rn)
∣∣∣∣∣]
is called Rademacher complexity
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
How does Rademacher complexity scale?
Rademacher complexity depends on
Loss function `More “complicated” loss =⇒ bigger complexity
Size B of w ’sLarger B =⇒ bigger complexity
Size of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexity
Sample size NLarger N =⇒ smaller complexity
Roughly speaking, scales as:
Lip(`) · B · R ·√
log(1/δ)N
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Lipschitz constant
Fix arbitrary relevance vector RHow much does `(s,R) change as we change s?Lipschitz constant Lip(`) is smallest L such that
∀R, |`(s,R)− `(s′,R)| ≤ L · ‖s − s′‖∞
where‖s − s′‖∞ =
mmaxi=1|si − s′i |
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Another type of generalization bound
So far we have seen bounds of the form
R(w) ≤ R(w) + stuff
However, R(w) ≤ R(w?) and R(w?) concentrates aroundR(w?)
Therefore, it is also easy to derive bounds of the form
R(w) ≤ R(w?) + stuff
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Generalization bounds: recap
Risk R(w) is expected loss of w on infinite, unseen data
Empirical risk R(w) is average loss on training data
A w that minimizes loss on training data is called an empirical riskminimizer (ERM)
Saw two types of generalization error bounds:
R(w) ≤ R(w) + stuff
R(w) ≤ R(w?)︸ ︷︷ ︸minimum possible risk
+ stuff
Rademacher complexity can be used to give data dependentgeneralization bounds for ERM
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Richer function classes for more data
As we see more data, it is natural to enlarge the functionclass over which ERM is computed
fn = argminf∈Fn
1N
N∑n=1
`(f (Xn),Rn) = argminf∈Fn
R(f )
where
f (Xn) =
f (φ(qn,dn,1))...
f (φ(qn,dn,m))
E.g., as n grows, Fn = larger decision trees, deeper neuralnetworks, SVMs with less regularization
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Estimation error vs approximation error
Let f ?(Fn) be the best function in Fn and f ? be the bestoverall function:
f ?(Fn) = argminf∈Fn
R(f ) , f ? = argminf
R(f )
We can decompose the excess risk
R(fn)− R(f ?) = R(fn)− R(f ?(Fn))︸ ︷︷ ︸estimation error
+ R(f ?(Fn))− R(f ?)︸ ︷︷ ︸approximation error
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Consistency
If Fn grows too quickly, estimation error will not go to zeroIf Fn grows too slowly, approximation error will not go tozeroBy judiciously growing Fn, we can guarantee that botherrors go to zeroThis results in consistency
R(fn)→ R(f ?) as n→∞
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
What did we achieve?
We said that we will use a nice (smooth/convex) loss ` as asurrogate and train scoring function fn on finite data using `
fn = argminf∈Fn
1N
N∑n=1
`(f (Xn),Rn)
Enlarging Fn neither too quickly nor too slowly with n willgive us
R(fn)n→∞−→ R(f ?)
Great, but how good is fn according to rankingperformance measures (NDCG, ERR, AP)?
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Risk under ranking loss
Recall that R(f ) is the risk of f under the surrogate `:
R(f ) = EX ,R[`(f (X ),R)]
Similarly, there is L(f ), the risk of f under a target rankingloss RL:
L(f ) = EX ,R[RL(argsort(f (X )),R)]
argsort required because first argument of RL has to be aranking, not a score vector
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Calibration
Fix a target loss RL (e.g., 1-NDCG) and a surrogate ` (e.g.,RankingSVM loss)Suppose we know
R(fn)n→∞−→ min
fR(f )
Does this automatically guarantee the following?
L(fn)n→∞−→ min
fL(f )
If yes, then we say that surrogate ` is calibrated w.r.t.target loss RL
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Calibration in binary classification
Consider the simpler setting of binary classification whereXi ∈ Rp and Yi ∈ {+1,−1}We learn real valued functions f : Rp → R and take thesign to output a class labelGiven a surrogate ` (hinge loss, logistic loss), we candefine
R(f ) = EX ,Y [`(f (X ),Y )]
Target loss is often just the 0-1 loss:
L(f ) = EX ,Y [1(Yf (X)≤0)]
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Calibration in binary classification
Suppose `(Y , f (X )) = φ(Yf (X )) (a margin loss)For a margin-based convex surrogate `, when does
R(fn)n→∞−→ min
fR(f )
implyL(fn)
n→∞−→ minf
L(f )
Surprisingly simple answer: φ′(0) < 0
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Examples of calibrated surrogates
0.5 0.0 0.5 1.0 1.5 2.0
Margin Yf(X)
0.0
0.5
1.0
1.5
2.0
2.5
0-1 loss
Hinge
Logistic
Squared
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Calibration w.r.t. DCG
Recall that DCG is defined as:
DCG(σ,R) =m∑
i=1
f (R(i))g(σ(i))
Let η be a distribution over relevance vectors R’sThe minimizer σ(η) of
σ 7→ ER∼η[DCG(σ,R)]
is argsort(ER∼η[f (R)])
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Calibration w.r.t. DCG
Consider the score vector s(η) that minimizes
s 7→ ER∼η[`(s,R)]
Necessary and sufficient condition for DCG calibrationFor every η, s(η) has the same sorted order as ER∼η[f (R)]
Similar condition can be derived for NDCG by takingnormalization into account
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Calibration w.r.t. DCG: Example
Consider the regression based surrogate
`(s,R) =m∑
i=1
(si − f (Ri))2
Easy to see that s(η) is exactly ER∼η[f (R)]
Thus, this regression based surrogate is DCG calibrated
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Calibration w.r.t. ERR, AP
The (N)DCG case might lead one to believe that convexcalibrated surrogates always existSurprisingly, this is not the case!It is known that there do not exist any convex calibratedsurrogates for ERR and APCaveat: Non-existence result assumes that we’re usingargsort to move from scores to rankings
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Generalization error boundsConsistency
Consistency: recap
Consistency is an asymptotic (sample size N →∞) property of learningalgorithms
A consistent algorithm’s performance reaches that of the “best" rankeras it sees more and more data
A learning algorithm that minimizes a surrogate ` has to balanceestimation error, approximation error trade-off
If trade-off is managed well, it will have consistency in the sense of `
If this also implies consistency w.r.t. a target loss RL, we say ` iscalibrated w.r.t. RL
Calibration in ranking is trickier than in classification: easy in somecases (e.g. NDCG), not so easy in others (e.g., ERR, MAP)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Outline1 Learning to rank as supervised ML
FeaturesLabel and Output SpacesPerformance MeasuresRanking functions
2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach
3 Theory for learning to rankGeneralization error boundsConsistency
4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
The online protocol
Instead of training on a fixed batch dataset, ranking could occurin a more online/incremental setting
For t = 1, . . . ,NRanker sees feature vectors Xt ∈ Rm×p
Ranker outputs a scoring function ft : R→ RRelevance vector Rt is revealedRanker suffers loss `(ft(Xt),Rt)
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Regret
In online settings, we often look at regret:
N∑t=1
`(ft(Xt),Rt)︸ ︷︷ ︸algorithm’s loss
− minf∈F
N∑t=1
`(f (Xt),Rt)︸ ︷︷ ︸best loss in hindsight
If ` is convex and we’re using linear scoring functionsf (x) = w>x then we can use tools from online convexoptimization (OCO)Great OCO introduction herehttp://ocobook.cs.princeton.edu/
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Diversity
Focusing solely on relevance can lead to redundancy insearch resultsA query might be ambiguous: jaguar (car vs. animal)Simply listing highly relevant results for one facet of thequery isn’t goodNeed to encourage diversity in the search resultsNew performance measures/methods/theoreticalframework needed
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Freshness
Consider ranking news items or tweets for queriesinvolving recent eventsNeed to promote more recent webpagesBut only if the query is indeed recency sensitiveDifferent time scales: for an annual event (year) vs.breaking news (days/hours)Possible approach: classify query as recency-sensitive ornot and then route to an appropriate ranker
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank
Large-scale learning to rank
Gradient descent batch optimization of a regularizedobjective
λ‖w‖22 +N∑
n=1
`(Xnw ,Rn)
can easy be parallelized on a clusterEnsemble based methods like bagging can train rankerson different partitions of data in parallelThese ideas have been used for supervised ML but notmuch in rankingSee the Apache Spark project:https://spark.apache.org/docs/latest/mllib-guide.html
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Summary
Ranking problems arise in a variety of contextsLearning to rank can be, and has been, viewed as asupervised ML problemWe surveyed a wide landscape of ranking methodsWe discussed generalization error bounds for andconsistency of ranking methodsWe saw a few advanced topics in learning to rank that arebeing actively investigated
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Suggested activities
Get the LETOR 3.0 or 4.0 data sets and try to reproducethe reported baselinesDownload the large MSLR-WEB30k data set and see howlong it takes to run your favorite learning to rank algorithmon itRead Chapter 9 (Ranking) of Foundations of MachineLearning by Mohri, Rostamizadeh and Talwalkar. Solve asmany exercises as you can!Read Chapter 11 (Learning to rank) of Boosting:Foundations and Algorithms by Freund and Schapire.Solve as many exercises as you can!
Ambuj Tewari Learning to Rank Tutorial
Learning to rank as supervised MLA brief survey of ranking methods
Theory for learning to rankPointers to advanced topics
Summary
Thanks
Thank you for your attention!
Questions/comments: [email protected]
Ambuj Tewari Learning to Rank Tutorial