tutorial on learning to rank - university of texas at austin€¦ · online learning to rank beyond...

Learning to rank as supervised MLA brief survey of ranking methods

Theory for learning to rankPointers to advanced topics

Summary

Tutorial on Learning to Rank

Ambuj Tewari

Department of StatisticsDepartment of EECSUniversity of Michigan

January 13, 2015 / MLSS Austin

Ambuj Tewari Learning to Rank Tutorial



Summary

Outline1 Learning to rank as supervised ML

FeaturesLabel and Output SpacesPerformance MeasuresRanking functions

2 A brief survey of ranking methodsReduction approachBoosting approachLarge margin approachOptimization approach

3 Theory for learning to rankGeneralization error boundsConsistency

4 Pointers to advanced topicsOnline learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank




Summary










Summary


Rankings are everywhere

Web search: PageRank and HITSSports: FIFA rankings (int’l soccer), ICC Rankings (int’lcricket), FIDE rankings (int’l chess), BCS ratings (collegefootball)Academia: US News & World Report, Times HigherEducationDemocracy and development: Voting systems, UN HumanDevelopment IndexRecommender systems: movies (Netflix, IMDB), music(Pandora), consumer goods (Amazon)




Summary


Learning to rank objects

Creating a ranking requires a lot of effortCan we have computers learn to rank objects?Will still need some human supervisionPose learning to rank as supervised ML problem




Summary


Ranking as a canonical learning problem




Summary


Ranking as a canonical learning problem

NIPS Workshops: 2002, 2005, 2009, 2014SIGIR Workshops: 2007, 2008, 2009Journal of Information Retrieval Special Issue in 2010Yahoo! Learning to Rank Challenge: 2010http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

LEarning TO Rank (LETOR) and Microsoft Learning toRank (MSLR) datasets: 2007-2010http://research.microsoft.com/en-us/um/beijing/projects/letor/


http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

http://jmlr.csail.mit.edu/proceedings/papers/v14/chapelle11a/chapelle11a.pdf

http://research.microsoft.com/en-us/um/beijing/projects/letor/

http://research.microsoft.com/en-us/um/beijing/projects/letor/



Summary


Learning to rank as supervised ML

Supervised Machine Learning: construct a mapping

f : I → O

based on input, output pairs (training examples)

(X1,Y1), . . . , (XN ,YN)︸︷︷︸training set

where Xn ∈ I, Yn ∈ OWant f (X ) to be a good predictor of Y for unseen XIn ranking problems, O is all possible rankings of a givenset of objects (webpages, movies, etc)




Summary


A learning-to-rank training example

Query: “MLSS Austin”Candidate webpages and relevance scores:

https://www.cs.utexas.edu/mlss: 2 or “good”http://mlss.cc/: 2 or “good”https://mailman.cc.gatech.edu/.../000796.html: 1 or “fair”www.councilofmls.com: 0 or “bad”

Think: example X =( query , (URL1, URL2, . . ., URL4) ),and label R = (2,2,1,0)




Summary


Typical ML cycle

1 Feature construction: Find a way to map (query ,webpage)into Rp

Each example (query, m webpages) gets mapped toXn ∈ I = Rm×p

2 Obtain labels: Get Rn ∈ {0,1,2, . . . ,Rmax}m from humanjudgements

3 Train ranker: Get f : I → O = Sm where Sm is the set ofm-permutations

Training usually done by solving some optimizationproblems on the training set

4 Evaluate performance: Using some performance measure,evaluate the ranker on a “test set”




Summary


Joint query, document features

Number of query terms that occur in document

Document length

Sum/min/max/mean/variance of term frequencies

Sum/min/max/mean/variance of tf-idf

Cosine similarity between query and document

Any model of P(R = 1|Q,D) gives a feature (e.g., BM25)

No. of slashes in/length of URL, Pagerank, site level Pagerank

Query-URL click count, user-URL click count




Summary


A slightly more general supervised ML setting

For a future query and webpage-list, want to rank thewebpagesDon’t necessarily want to predict the relevance scoresLet f (ranking function) take values in the space of rankingsSo now, training examples (X ,Y ) ∈ I × L where L = labelspaceFunction f to be learned maps I to O (note L 6= O)E.g. (1,2,2,0,0) ∈ L and(d1 → 3,d2 → 1,d3 → 2,d4 → 4,d5 → 5) ∈ O




Summary


Notation for rankings

We’ll think of m (no. of documents) as fixed but in reality itvaries over examplesWe’ll denote a ranking using a permutation σ ∈ Sm (set ofall m! permutations)σ(i) is the rank/position of document iσ−1(j) is the document at rank/position j




Summary


Notation for rankings

Suppose we have 3 documents d1,d2,d3 and we wantthem to be ranked thus:

d2d3d1

So the ranks are d1 : 3,d2 : 1,d3 : 2This means σ(1) = 3, σ(2) = 1, σ(3) = 2Also, σ−1(1) = 2, σ−1(2) = 3, σ−1(3) = 1.




Summary


Notation for relevance scores/labels

Relevance labels can be binary or multi-gradedIn the binary case, documents are either relevant orirrelevant (1 or 0)In the multi-graded case, relevance labels are from{0,1,2, . . . ,Rmax}Typically, Rmax ≤ 4 (at most 5 levels of relevance)A relevance score/label vector R as an m-vectorR(i) gives relevance of document i




Summary


Performance measures in ranking

Performance measures could be gains (to be maximized)or losses (to be minimized)They take a relevance vector R and ranking σ asarguments and produce a non-negative gain/lossSome performance measures are only defined for binaryrelevanceOthers can handle more general multi-graded relevance




Summary


Performance measure: RR

Reciprocal rank (RR) is defined for binary relevance onlyRR is the reciprocal rank of the first relevant documentaccording to the ranking

RR(σ,Y ) =1

min{σ(i) : Y (i) = 1}.

Example: R = (0,1,0,1), σ = (1,3,4,2) then RR = 12

original rankedd1 : 0 d1 : 0d2 : 1 d4 : 1 ← RR = 1

2d3 : 0 d2 : 1d4 : 1 d3 : 0




Summary


Performance measure: ERR

Expected Reciprocal rank (ERR) generalizes RR to multi-gradedrelevance

Convert relevance scores into probability (e.g., dividing by Rmax)

Imagine a user going down the ranked list and stopping withprobability given by relevance scores

ERR is the expected reciprocal rank of the document at whichthe user stops

Example: R = (0,2,0,1), σ = (1,3,4,2) then ERR = 12

12 + 1

312

original ranked cond. stopping prob. stopping prob.d1 : 0 d1 : 0 0 0d2 : 2 d4 : 1 1

2 1 · 12

d3 : 0 d2 : 2 22 1 · 1

2 ·22

d4 : 1 d3 : 0 0 0




Summary


Performance measure: ERR

ERR(σ,R) =m∑

j=1

1j·

( j−1∏k=1

(1− g(R(σ−1(k)))

))︸︷︷︸

prob. of not stopping earlier

· g(R(σ−1(j)))︸︷︷︸prob. of stopping at pos. j

Recall: R(σ−1(j)) is the relevance of document at position jg is any function used to convert relevance scores intostopping probabilitiesE.g., g(r) = r

Rmaxor g(r) = 2r−1

2Rmax




Summary


Performance measure: Precision@K

Precision@K is for binary relevance onlyPrecision@K is the fraction of relevant documents in thetop-k

Prec@K (σ,R) =1K

K∑j=1

R(σ−1(j))

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2d3 : 0 d2 : 1 2/3d4 : 1 d3 : 0 2/4




Summary


Performance measure: AP

Average Precision (AP) is also for binary relevanceIt is the average of Precision@K over positions of relevantdocuments only

AP(σ,R) =1∑m

i=1 R(i)·

∑j:R(σ−1(j))=1

Prec@j

original ranked Prec@Kd1 : 0 d1 : 0 0/1d2 : 1 d4 : 1 1/2← AP = 1/2+2/3

2d3 : 0 d2 : 1 2/3←d4 : 1 d3 : 0 2/4




Summary


Performance measure: DCG

Discounted Cumulative Gain (DCG) is for multi-gradedrelevanceContributions of relevant documents further down theranking are discounted more

DCG(σ,R) =m∑

i=1

f (R(i))g(σ(i))

f ,g are increasing functionsStandard choice: f (r) = 2r − 1 and g(j) = log2(1 + j)




Summary


Performance measure: NDCG

Normalized DCG (NDCG), like DCG, is for multi-gradedrelevanceAs the name suggests, NDCG normalizes DCG to keepthe gain bounded by 1

NDCG(σ,R) =DCG(σ,R)

Z (R)

The normalization constant

maxσ

DCG(σ,R)

can be computed quickly by sorting the relevance vector




Summary


Ranking functions

A ranking function maps m query-document p-dimensionalfeatures into a permutation

X =

φ(q,d1)>

...φ(q,dm)

>

︸︷︷︸

m×d matrix

−→d5 ⇔ σ(5) = 1...

d12 ⇔ σ(12) = m

So mathematically it is a map from Rm×p to Sm




Summary


Ranking functions from scoring functions

However, it is hard to learn functions that map intocombinatorial spacesIn binary classification, we often learn a real-valuedfunction and threshold it to output a class label(positive/negative)In ranking, we do the same but replace thresholding bysortingFor example, let f : Rp → R be a scoring functionφ(q,d1)

>

φ(q,d2)>

φ(q,d3)>

f=⇒

−2.34.21.9

sort=⇒

d2 ⇔ σ(2) = 1d3 ⇔ σ(3) = 2d1 ⇔ σ(1) = 3




Summary


Recap

Training data consists query-document features and relevance scores

X =

φ(q, d1)>

...φ(q, dm)

>

︸︷︷︸

m×d matrix

, R =

R(1)...

R(m)

Will use training data to learn a scoring function f that will be used torank documents for a new querys1

...sm

=

f (φ(qnew, dnew1 )

...f (φ(qnew, dnew

m )

Scores can be sorted to get a ranking whose performance willevaluated using a measure such as ERR, AP, NDCG




Summary

Reduction approachBoosting approachLarge margin approachOptimization approach









Summary


Many nice, innovative methods have been developedA partial list:

AdaRank BoltzRank Cranking FRank Infinite pushLambdaMART LambdaRank ListMLE ListNET McRank

MPRank p-norm push OWPC PermuRank PRankRankBoost Ranking SVM RankNet SmoothRank SoftRank

SVM-Combo SVM-MAP SVM-MRR SVM-NDCG SVRank




Summary


The landscape of ideas

Reduction: Reduce ranking to a simpler problem (e.g.,classification or regression)Boosting: Combine weak rankers and boost them intostrong onesLarge margin approach: Extend the “large-margin” storybehind SVMs to the ranking caseDirect/Indirect optimization: Choose an appropriateoptimization problem based on the performance measureone wishes to optimize

Not all ranking methods neatly fall into one of these categories!




Summary


Regression

Perhaps the simplest approach: regress the relevancescore on query-document featuresFeed examples of the form (φ(q,di),R(i)) to your favoriteregression algorithmAdvantage: can use already built regression toolsDisadvantage: there’s no real reason for us to predictrelevance scores, we’re just building a ranking function




Summary


Linear regression example

The simplest scoring functions are linear functions:

f (φ(q,d)) = w>φ(q,d) =p∑

j=1

wjφj(q,d)

Using linear scoring functions in a regression approachreduces ranking to linear regression:

w = argminw

N∑n=1

m∑i=1

(Rn(i)− w>φ(qn,dn,i))2

For a new query-document example, output a ranking bysorting (w>φ(q,di))

mi=1.




Summary


Classification

If di has higher relevance than dj for a query q, then feedthe examples of the form (φ(q,di)− φ(q,dj),+1) to yourfavorite classification algorithmAdvantages: can use already build classification toolsDisadvantages: not clear how to use pairwise classificationto create a ranking




Summary


Logistic regression example

A logistic regression approach reduces ranking to:

w = argminw

N∑n=1

∑Rn(i)>Rn(j)

`log(w>(φ(q,di)− φ(q,dj)),+1)

`log is the logistic loss

`log(w>x , y) = log(1 + exp(−y · w>x))

Other losses could also be used:

`hinge(w>x , y) = max{0,1− y · w>x}




Summary


Logistic regression: producing a ranking

On a new triple (q,di ,dj) the trained classifier will be ableto tell you whether di is more relevant than dj for the queryqHow to resolve conflicts (such as cycles) and output aranking of d1, . . . ,dm?Run quick sort!

Choose a pivot document di at randomUse classifier to partition other documents into: morerelevant D> and less relevant D< than diReturn quicksort(D>), di , quicksort(D<)




Summary


Boosting in classification

Simple rules of thumb can often predict a little better thanrandom guessing:“$” in subject line⇒ email is spamHow to combine simple rules of thumb into a powerfulclassifier (say with 90% accuracy)?Adaboost: perhaps the most popular boosting algorithm




Summary


Adaboost

Training data (X1,Y1), . . . , (XN ,YN)Start from uniform distribution P1(i) = 1/N for all 1 ≤ n ≤ NAt iteration t = 1,2, . . . ,T :

Train a weak classifier ht on training data T using weights fromPt

Choose step-size αt and set

ft =t∑

s=1

αshs

Update Pt to Pt+1 so that Pt+1 puts more weight on exampleswhere ht performs badly

Final classifier is (sign of)∑T

t=1 αtht .




Summary


Adarank

Training data (X1,R1), . . . , (XN ,RN)Start from uniform distribution P1(n) = 1/N for all 1 ≤ n ≤ NAt iteration t = 1,2, . . . ,T :

Train a weak ranker ht on training data T using weights from Pt

Choose step-size αt and set

ft =t∑

s=1

αshs

Update Pt to Pt+1 so that Pt+1 puts more weight on exampleswhere ft performs badly

Final ranker is (sorted order given by)∑T

t=1 αtht .




Summary


Adarank: weak ranker

Assume that RL (ranking loss) is the target loss we wish tominimize (e.g., 1-NDCG)

Weak ranker can simply be a single feature (e.g., Pagerank) thatminimizes the loss, i.e.,

argmink∈{1,...,p}

N∑n=1

Pt(n)RL(Xnek ,Rn)

Xnek =

φ(qn,dn,1)>

...φ(qn,dn,m)

>

0...1...0

← k th position




Summary


Adarank: distribution update

The next distribution over training example is obtainedusing:

Pt+1(n) =exp (−RL(ft(Xn),Yn))

Zt+1

The normalization Zt+1 ensures that we have a probabilitydistribution

Zt+1 =N∑

n=1

exp (−RL(ft(Xn),Yn))




Summary


Support Vector Machine

SVMs solve the following problem (‖w‖22 := w>w):

minw

‖w‖22︸︷︷︸L2 regularization

+ C︸︷︷︸tuning param.

n∑i=1

ξi

subject to

ξi ≥ 0

Yiw>Xi︸︷︷︸margin

≥ 1− ξi︸︷︷︸slack variable




Summary


Ranking SVM

Consider a query qn and documents di,n and dj,n such that

Rn(i) > Rn(j)

We would like to have

w>φ(qn,dn,i) > w>φ(qn,dn,j)

As in SVM, we can introduce a slack variable ξn,i,j with theconstraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j




Summary


Ranking SVM

Ranking SVM solves the following problem

minw‖w‖22 + C

N∑n=1

∑Rn(i)>Rn(j)

ξn,i,j

subject to the constraints

ξn,i,j ≥ 0

w>φ(qn,dn,i) ≥ w>φ(qn,dn,j) + 1− ξn,i,j

This is actually also a reduction approach!




Summary


Ranking SVM as regularized loss minimization

Note that the minimum possible value of ξ subject to ξ ≥ 0and ξ ≥ t is

max{t ,0} =: [t ]+

This observation lets us write the ranking SVMoptimization as

minw‖w‖2

2︸︷︷︸regularizer

+ CN∑

n=1

∑Rn(i)>Rn(j)

[1 + w>φ(qn, dn,j)− w>φ(qn, dn,i)]+︸︷︷︸loss on example n




Summary


Regularized loss minimization

Note that

Xnw =

φ(qn,dn,1)>w

...φ(qn,dn,m)

>w

Ranking SVM problem (where λ = 1/C):

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

Loss in ranking SVM is

`(s,R) =∑

R(i)>R(j)

[1 + sj − si ]+




Summary


Why not directly optimize performance measures?

Let RL be some actual loss you want to minimize (e.g.,1-NDCG)Then why not solve the following?

minw

λ‖w‖22 +N∑

n=1

RL(argsort(Xnw),Rn)

argsort(t) is a ranking σ that puts elements of t in(decreasing) sorted order:

tσ−1(1) ≥ tσ−1(2) ≥ . . . ≥ tσ−1(m)




Summary


Why not directly optimize performance measures?

The function

w 7→ RL(argsort(Xw),R)

is not a nice one (from optimization perspective)!In the neighborhood of any point, it is either flat ordiscontinuousIt can only take one of at most m! different valuesMinimizing a sum of such functions is computationallyintractable




Summary


A smoothing approach

Instead of deterministic score s = Xw , think of smoothedscore distributions:

S ∼ N (s, σ2Im×m)

This gives us a smoothed ranking loss

RLσ(s,R) = ES[RL(argsort(S),R)]

We can then optimize (using, say, gradient descent):

minw

λ‖w‖22 +N∑

n=1

RLσ(Xnw ,Rn)




Summary


Choosing the smoothing parameter

As σ grows, the optimization problem becomes easier andwe’re less prone to overfittingAs σ shrinks, the smoothed loss approximates theunderlying loss betterAnnealing approach of SmoothRank:Initialize w0 (e.g., using linear regression)Choose a large σ1For t = 1,2, . . .:

Starting from wt−1, solve regularized RLσt minimizationusing (some version of) gradient descentLet the new solution be wt and set σt+1 = σt/2




Summary


Convex surrogates

The ranking SVM problem is convex hence easy tooptimizeBut no direct relation to a performance measureSmoothing approach starts from a performance measureSolves smooth but non-convex problems (can get stuck inlocal optimum)Given a hard-to-optimize performance measure, can weindirectly minimize it via some convex surrogate?




Summary


Recap

We can try to reduce ranking to another ML problem: e.g., regression orclassification

We can try to boost weak rankers into stronger ones

The margin maximization approach of SVMs leads naturally to RankingSVM

Ranking SVM solves regularized loss minimization

We can directly try to minimize target performance measure(hard because of flatness/discontinuity; might get stuck in local minimum evenafter smoothing)

Can we solve convex problems (no non-global local minima) and yetretain a principled connection to a target performance measure?




Summary

Generalization error boundsConsistency









Summary


Theory: still a long way to go

An amazing amount of applied work existsYet, a lot of theoretical work remains to be done

“However, sometimes benchmark experiments are not asreliable as expected due to the small scales of the training andtest data. In this situation, a theory is needed to guarantee theperformance of an algorithm on infinite unseen data.”

–O. Chapelle, Y. Chang, and T.-Y. Liu“Future Directions in Learning to Rank” (2011)




Summary


Statistical Ranking Theory: Challenges

Combinatorial nature of output spaceSize of prediction space is m! (m = no. of things beingranked)

No canonical performance measureMany choices: ERR, MAP, (N)DCGEven with a fixed choice, how to come up with “good"convex surrogates

Dealing with diverse types of feedback modelsrelevance scores, pairwise feedback, click-through rates,preference DAGs

Large-scale challengesguarantees for parallel, distributed, online/streamingapproaches




Summary


Our focus

Human feedback in form of relevance scoresRegularized convex loss minimization based rankingmethods

minw

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

A few commonly used performance measures: NDCG,MAPGeneralization error bounds and consistency




Summary


Standard Setup

Standard assumption in theory: examples

(X1,R1), . . . , (XN ,RN)

are drawn iid from an underlying (unknown) distributionConsider the constrained form of loss minimization

w = argmin‖w‖2≤B

1N

N∑n=1

`(Xnw ,Rn)




Summary


Risk

Risk of w : performance of w on “infinite unseen data"

R(w) = EX ,R[`(Xw ,R)]

Note: cannot be computed without knowledge ofunderlying distributionEmpirical risk of w : performance of w on training set

R(w) =1N

N∑n=1

`(Xnw ,Rn)

Note: can be computed using just the training data




Summary


Generalization error bound

We say that w is an empirical risk minimizer (ERM)

w = argmin‖w‖2≤B

R(w)

We’re typically interested in bounds on the generalizationerror

R(w)

For instance, we expect the generalization error to belarger than

R(w)

but by how much?




Summary


A downward bias

Let us show that R(w) is, in expectation, less than R(w)

Let w? be the scoring function with least expected loss

w? = argmin‖w‖2≤B

R(w)

R(w) ≤ R(w?) by definition of ERM

E[R(w)] ≤ E[R(w?)] taking expectations

E[R(w)] ≤ R(w?) ∵ ∀w ,E[R(w)] = R(w)

E[R(w)] ≤ R(w) by definition of w?




Summary



We might therefore want a generalization error bound ofthe form

R(w) ≤ R(w) + stuff

1.5 1.0 0.5 0.0 0.5 1.0 1.50.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

w w

R(w)−R(w)

R(w)

Risk R(w)

Emp risk R(w)




Summary



A useless generalization error bound:

R(w) ≤ R(w) + |R(w)− R(w)|

Better?

R(w) ≤ R(w) + max‖w‖2≤B

|R(w)− R(w)|




Summary


Bootstrapping

The difference R(w)− R(w) compares the expected lossof w under

(X ,R) ∼ underlying dist. vs. (X ,R) ∼ (Xn,Rn) with wt.1N

Let us replace the underlying dist. with the sample and thesample with a (weighted) sub-sample!

(X ,R) ∼ (Xn,Rn) with wt.1N

vs.

(X ,R) ∼ (Xn,Rn) with wt.

{2N if εn = +10 if εn = −1

where εi are symmetric signed Bernoulli or Rademacherrandom variables




Summary


Weighted subsample: an example

Suppose original sample with weights is

(X1,R1) :13, (X2,R2) :

13, (X3,R3) :

13

Suppose ε1 = +1, ε2 = −1, ε3 = +1Then weighted sub-sample is

(X1,R1) :23, (X3,R3) :

23




Summary


Towards a generalization error bound

Define the weighted average under on sub-sample

R′(w) =1N

N∑n=1

(1 + εn)`(Xnw ,Rn)

Replacemax‖w‖2≤B

|R(w)− R(w)|

withmax‖w‖2≤B

|R(w)− R′(w)|




Summary


A data-dependent generalization error bound

With probability at least 1− δ

R(w) ≤ R(w)︸︷︷︸shrinks as B ↑

+ max‖w‖2≤B

|R(w)− R′(w)|︸︷︷︸grows as B ↑

+

√log(1/δ)

n

Can be used for data-dependent selection of B




Summary


Rademacher complexity

Note:

R(w)− R′(w) =1N

N∑n=1

(1− (1+ εn))`(Xnw ,Rn) =1N

N∑n=1

−εn`(Xnw ,Rn)

Therefore, with high probability,

R(w) ≤ R(w) + max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣+√

log(1/δ)n

The quantity

Eε1,...,εN

[max‖w‖2≤B

∣∣∣∣∣ 1N

N∑n=1

εn`(Xnw ,Rn)

∣∣∣∣∣]

is called Rademacher complexity




Summary


How does Rademacher complexity scale?

Rademacher complexity depends on

Loss function `More “complicated” loss =⇒ bigger complexity

Size B of w ’sLarger B =⇒ bigger complexity

Size of features, say ‖φ(q, di)‖2 ≤ RLarger R =⇒ bigger complexity

Sample size NLarger N =⇒ smaller complexity

Roughly speaking, scales as:

Lip(`) · B · R ·√

log(1/δ)N




Summary


Lipschitz constant

Fix arbitrary relevance vector RHow much does `(s,R) change as we change s?Lipschitz constant Lip(`) is smallest L such that

∀R, |`(s,R)− `(s′,R)| ≤ L · ‖s − s′‖∞

where‖s − s′‖∞ =

mmaxi=1|si − s′i |




Summary


Another type of generalization bound

So far we have seen bounds of the form


However, R(w) ≤ R(w?) and R(w?) concentrates aroundR(w?)

Therefore, it is also easy to derive bounds of the form

R(w) ≤ R(w?) + stuff




Summary


Generalization bounds: recap

Risk R(w) is expected loss of w on infinite, unseen data

Empirical risk R(w) is average loss on training data

A w that minimizes loss on training data is called an empirical riskminimizer (ERM)

Saw two types of generalization error bounds:


R(w) ≤ R(w?)︸︷︷︸minimum possible risk

+ stuff

Rademacher complexity can be used to give data dependentgeneralization bounds for ERM




Summary


Richer function classes for more data

As we see more data, it is natural to enlarge the functionclass over which ERM is computed

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn) = argminf∈Fn

R(f )

where

f (Xn) =

f (φ(qn,dn,1))...

f (φ(qn,dn,m))

E.g., as n grows, Fn = larger decision trees, deeper neuralnetworks, SVMs with less regularization




Summary


Estimation error vs approximation error

Let f ?(Fn) be the best function in Fn and f ? be the bestoverall function:

f ?(Fn) = argminf∈Fn

R(f ) , f ? = argminf

R(f )

We can decompose the excess risk

R(fn)− R(f ?) = R(fn)− R(f ?(Fn))︸︷︷︸estimation error

+ R(f ?(Fn))− R(f ?)︸︷︷︸approximation error




Summary


Consistency

If Fn grows too quickly, estimation error will not go to zeroIf Fn grows too slowly, approximation error will not go tozeroBy judiciously growing Fn, we can guarantee that botherrors go to zeroThis results in consistency

R(fn)→ R(f ?) as n→∞




Summary


What did we achieve?

We said that we will use a nice (smooth/convex) loss ` as asurrogate and train scoring function fn on finite data using `

fn = argminf∈Fn

1N

N∑n=1

`(f (Xn),Rn)

Enlarging Fn neither too quickly nor too slowly with n willgive us

R(fn)n→∞−→ R(f ?)

Great, but how good is fn according to rankingperformance measures (NDCG, ERR, AP)?




Summary


Risk under ranking loss

Recall that R(f ) is the risk of f under the surrogate `:

R(f ) = EX ,R[`(f (X ),R)]

Similarly, there is L(f ), the risk of f under a target rankingloss RL:

L(f ) = EX ,R[RL(argsort(f (X )),R)]

argsort required because first argument of RL has to be aranking, not a score vector




Summary


Calibration

Fix a target loss RL (e.g., 1-NDCG) and a surrogate ` (e.g.,RankingSVM loss)Suppose we know

R(fn)n→∞−→ min

fR(f )

Does this automatically guarantee the following?

L(fn)n→∞−→ min

fL(f )

If yes, then we say that surrogate ` is calibrated w.r.t.target loss RL




Summary


Calibration in binary classification

Consider the simpler setting of binary classification whereXi ∈ Rp and Yi ∈ {+1,−1}We learn real valued functions f : Rp → R and take thesign to output a class labelGiven a surrogate ` (hinge loss, logistic loss), we candefine

R(f ) = EX ,Y [`(f (X ),Y )]

Target loss is often just the 0-1 loss:

L(f ) = EX ,Y [1(Yf (X)≤0)]




Summary


Calibration in binary classification

Suppose `(Y , f (X )) = φ(Yf (X )) (a margin loss)For a margin-based convex surrogate `, when does

R(fn)n→∞−→ min

fR(f )

implyL(fn)

n→∞−→ minf

L(f )

Surprisingly simple answer: φ′(0) < 0




Summary


Examples of calibrated surrogates

0.5 0.0 0.5 1.0 1.5 2.0

Margin Yf(X)

0.0

0.5

1.0

1.5

2.0

2.5

0-1 loss

Hinge

Logistic

Squared




Summary


Calibration w.r.t. DCG

Recall that DCG is defined as:

DCG(σ,R) =m∑

i=1

f (R(i))g(σ(i))

Let η be a distribution over relevance vectors R’sThe minimizer σ(η) of

σ 7→ ER∼η[DCG(σ,R)]

is argsort(ER∼η[f (R)])




Summary


Calibration w.r.t. DCG

Consider the score vector s(η) that minimizes

s 7→ ER∼η[`(s,R)]

Necessary and sufficient condition for DCG calibrationFor every η, s(η) has the same sorted order as ER∼η[f (R)]

Similar condition can be derived for NDCG by takingnormalization into account




Summary


Calibration w.r.t. DCG: Example

Consider the regression based surrogate

`(s,R) =m∑

i=1

(si − f (Ri))2

Easy to see that s(η) is exactly ER∼η[f (R)]

Thus, this regression based surrogate is DCG calibrated




Summary


Calibration w.r.t. ERR, AP

The (N)DCG case might lead one to believe that convexcalibrated surrogates always existSurprisingly, this is not the case!It is known that there do not exist any convex calibratedsurrogates for ERR and APCaveat: Non-existence result assumes that we’re usingargsort to move from scores to rankings




Summary


Consistency: recap

Consistency is an asymptotic (sample size N →∞) property of learningalgorithms

A consistent algorithm’s performance reaches that of the “best" rankeras it sees more and more data

A learning algorithm that minimizes a surrogate ` has to balanceestimation error, approximation error trade-off

If trade-off is managed well, it will have consistency in the sense of `

If this also implies consistency w.r.t. a target loss RL, we say ` iscalibrated w.r.t. RL

Calibration in ranking is trickier than in classification: easy in somecases (e.g. NDCG), not so easy in others (e.g., ERR, MAP)




Summary

Online learning to rankBeyond relevance: diversity and freshnessLarge-scale learning to rank









Summary


The online protocol

Instead of training on a fixed batch dataset, ranking could occurin a more online/incremental setting

For t = 1, . . . ,NRanker sees feature vectors Xt ∈ Rm×p

Ranker outputs a scoring function ft : R→ RRelevance vector Rt is revealedRanker suffers loss `(ft(Xt),Rt)




Summary


Regret

In online settings, we often look at regret:

N∑t=1

`(ft(Xt),Rt)︸︷︷︸algorithm’s loss

− minf∈F

N∑t=1

`(f (Xt),Rt)︸︷︷︸best loss in hindsight

If ` is convex and we’re using linear scoring functionsf (x) = w>x then we can use tools from online convexoptimization (OCO)Great OCO introduction herehttp://ocobook.cs.princeton.edu/


http://ocobook.cs.princeton.edu/



Summary


Diversity

Focusing solely on relevance can lead to redundancy insearch resultsA query might be ambiguous: jaguar (car vs. animal)Simply listing highly relevant results for one facet of thequery isn’t goodNeed to encourage diversity in the search resultsNew performance measures/methods/theoreticalframework needed




Summary


Freshness

Consider ranking news items or tweets for queriesinvolving recent eventsNeed to promote more recent webpagesBut only if the query is indeed recency sensitiveDifferent time scales: for an annual event (year) vs.breaking news (days/hours)Possible approach: classify query as recency-sensitive ornot and then route to an appropriate ranker




Summary


Large-scale learning to rank

Gradient descent batch optimization of a regularizedobjective

λ‖w‖22 +N∑

n=1

`(Xnw ,Rn)

can easy be parallelized on a clusterEnsemble based methods like bagging can train rankerson different partitions of data in parallelThese ideas have been used for supervised ML but notmuch in rankingSee the Apache Spark project:https://spark.apache.org/docs/latest/mllib-guide.html


https://spark.apache.org/docs/latest/mllib-guide.html

https://spark.apache.org/docs/latest/mllib-guide.html



Summary

Summary

Ranking problems arise in a variety of contextsLearning to rank can be, and has been, viewed as asupervised ML problemWe surveyed a wide landscape of ranking methodsWe discussed generalization error bounds for andconsistency of ranking methodsWe saw a few advanced topics in learning to rank that arebeing actively investigated




Summary

Suggested activities

Get the LETOR 3.0 or 4.0 data sets and try to reproducethe reported baselinesDownload the large MSLR-WEB30k data set and see howlong it takes to run your favorite learning to rank algorithmon itRead Chapter 9 (Ranking) of Foundations of MachineLearning by Mohri, Rostamizadeh and Talwalkar. Solve asmany exercises as you can!Read Chapter 11 (Learning to rank) of Boosting:Foundations and Algorithms by Freund and Schapire.Solve as many exercises as you can!




Summary

Thanks

Thank you for your attention!

Questions/comments: [email protected]


tutorial on learning to rank - university of texas at austin€¦ · online learning to rank beyond...

Documents