learning to rankanand/tir15/lectures/ws15-tir-l2r.pdf · 2016-11-21 · learning to rank approaches...

Learning to Rank

Temporal Information Retrieval

pointwise, pairwise and listless approaches

Avishek Anand

Road so Far - Ranking in IR

2

• Ranking documents important for information overload, quickly finding documents which are “relevant” for the query

• Interpretations and Modelling of relevance

• Geometric Interpretation — Vector Space Similarity, Okapi BM25

• Probabilistic interpretation — Probabilistic IR, Statistical LM

Avishek Anand


2





sim(d, q) =

Pi2V wd(i).wq(i)pP

i2V wd(qi)2pP

i2V wq(i)2

Avishek Anand

Other Ranking features

3

• Link analysis - Page rank, HITS

• Temporal ranking

• Recency ranking

• Historical features

• Query similarity to the document title

• Any other features that you can think of ?

Avishek Anand


3



• Recency ranking




• Anchor text similarity

• proximity of query terms in the document

Avishek Anand


3



• Recency ranking




How do you design a ranking function if you have many such features ?

• Anchor text similarity

• proximity of query terms in the document

Avishek Anand

Learning for IR

4

• What if we can learn a model from a set of features ?

• Focus on developing interesting and novel features and less on how to put them together

• Why wasn't learning approaches used earlier ?

• Limited training data

• Machine learning techniques were not good enough

• Traditional IR methods had small number of features and worked well

• small features => easily hand tunable

• Not many specialised IR problems

Avishek Anand

Supervised learning

5

• Given a set of training examples how do we learn a model to predict, classify, rank objects

• classical tasks: regression, classification

• Terminology: Input space, output space, hypothesis, loss function

• Input space: (query, document) pairs

• Output space: {relevant, nonrelevant}

Avishek Anand

Classification for IR

6

Non Relevant

Relevant

• Given a set of relevant and non-relevant documents find the hypothesis which best separates them

• Minimizes loss - e.g. number of misclassified documents (squared loss, hinge loss, cross entropy)

• Some classical classification approaches: Naive Bayes, decision trees, SVMs

Avishek Anand

Support Vector Machines

7

non relevantRelevant

Avishek Anand


8

Large Margin Classifiers

Avishek Anand


9

hw, xi+ b �1 hw, xi+ b � 1

f(x) = hw, xi+ blinear func1onfeature vector for

(q,d) pair normal vector for separator

dot product

Assume: Linearly Separable

Avishek Anand


10

hw, xi+ b = �1 hw, xi+ b = 1

hx+ � x�, wi2 kwk =

1

2 kwk [[hx+, wi+ b]� [hx�, wi+ b]] =1

kwk

margin

w

Avishek Anand


hw, xi+ b = �1 hw, xi+ b = 1

op1miza1on problem

w

maximize

w,b

1

kwk subject to yi [hxi, wi+ b] � 1

Avishek Anand


hw, xi+ b = �1 hw, xi+ b = 1

op1miza1on problem

w

minimize

w,b

1

2

kwk2 subject to yi [hxi, wi+ b] � 1

+1 if relevant , -1 o.w.

Avishek Anand


13

• Weight vector w as weighted linear combination of instances

• Only points on margin matter (ignore the rest and get same solution)• Only inner products matter

• Quadratic program• Keeps instances away from the margin

w

Avishek Anand

Soft Margins

14

Convex op)miza)on problem

hw, xi+ b �1 + ⇠ hw, xi+ b � 1� ⇠

Avishek Anand

Soft Margins

14

Convex op)miza)on problem

hw, xi+ b �1 + ⇠

minimize amount of slack

hw, xi+ b � 1� ⇠

Avishek Anand

Soft Margins

15

• Hard margin problem

• With slack variables

minimize

w,b

1

2

kwk2 subject to yi [hw, xii+ b] � 1

minimize

w,b

1

2

kwk2 + C

X

i

⇠i

subject to yi [hw, xii+ b] � 1� ⇠i and ⇠i � 0

Whats the loss function here ?

Avishek Anand

Loss Function Perspective

16

• Soft margin loss • Binary loss

max(0, 1� yf(x))

{yf(x) < 0}

convex upper bound

binary loss function margin

Avishek Anand


16


max(0, 1� yf(x))

{yf(x) < 0}

convex upper bound


Hinge loss

Avishek Anand


16


max(0, 1� yf(x))

{yf(x) < 0}

convex upper bound


Hinge loss

Hinge loss

Avishek Anand

Evaluation and Relevance Judgements

17

• Construct a test set containing a large number of (randomly sampled) queries, their associated documents, and relevance judgment for each (query, document) pair.

• Relevance judgement is the assessment of the relevance of documents for a query

• Degree of relevance

• Binary: relevant vs. irrelevant

• Multiple ordered categories: Perfect > Excellent > Good > Fair > Bad

• Pairwise preference

• Document A is more relevant than document B

• Total order

• Documents are ranked as {A,B,C,..} according to their relevance

Avishek Anand

Learning to Rank Framework

18

Avishek Anand

Learning to Rank Approaches

19

• Pointwise Approaches

• Input space: single documents d1 — if a doc. is relevant to a query

• Output space: scores or relevant classes

• Pairwise Approaches

• Input space: (d1, d2) — pairs of documents s.t. d1 > d2

• Output space: preferences (yes/no) for a given doc. pair

• Listwise Approaches

• Input space: Document set {d}

• Output space: Permutation — ranking of documents

• Optimize directly a IR quality measure — MAP, NDCG. etc

Avishek Anand

Pointwise Ranking - large margin ranking

20

• Margin:

• Constraints: Every document is correctly placed in the corresponding category

Original SVM formula)on

X

k

(bk � ak)

Fixed Margin Strategy

margin minimize

w,b

1

2

kwk2 + C

X

i

⇠i


Avishek Anand

Problems with Pointwise ranking

21

• Some queries have large number of documents and dominate the learned hypothesis

• Does not optimise for the standard IR measures like NDCG, MAP, ERR etc

• Position of the documents in the ranked list is not modelled by the loss function

• Approaches might emphasise more on the unimportant documents

• Sampling is essential (if non relevant documents >> relevant documents)

Avishek Anand

Listwise Ranking

22

• Convert input into preferences

• Use preferences as input to learn the classes

• Define loss L(.) on the difference vector and minimise loss while learning f(.)

queryd1d2d3

d5d4

d6

(d1 > d2) (d1 > d5) (d1 > d6) (d1 > d3) …..

(x1 � x2)

[Rank Boost 2003]

Avishek Anand

Listwise Ranking

23

Original SVM formula)on

minimize

w,b

1

2

kwk2 + C

X

i

⇠i


Ranking SVM1

2||w||2 + C

X

i

X

u,v

⇠iu,vminimize

w,b

1

2

kwk2 + C

X

i

⇠i


yi[hw, (xiu � x

iv)i+ b] � 1� ⇠

iu,v and ⇠

iu,v � 0

minimize

w,b

1

2

kwk2 + C

X

i

⇠i


Avishek Anand

Pros and Cons - Pairwise Approach

24

• Pros

• One step closer to ranking documents compared to predicting class labels

• Cons

• Distribution of document pairs is more skewed than distribution of document w.r.t queries

• Some queries have more query pairs than others

• Still does not optimise for IR measures

• Rank ignorant — (d1 > d2) does not encode which ranks are being compared. Rank 1 vs Rank 2 or Rank 99 vs Rank 1000

Avishek Anand

Pairwise Ranking — IR SVM

25

Avishek Anand

Listwise Approaches

26

NDCG, ERR, MAP

Avishek Anand

Learning to Rank — In Action

27

extract features from document collection and

queries

feature file

create model

Execute model on

test data for results

Avishek Anand

Learning to Rank - in Action

28

Example of a feature file

Avishek Anand

References and Further Readings

Books Liu, Tie-‐Yan. Learning to rank for informa.on retrieval. Vol. 13. Springer, 2011. Li, Hang. "Learning to rank for informa)on retrieval and natural language processing." Synthesis Lectures on Human Language Technologies 4.1 (2011): 1-‐113.

Helpful pages hWp://en.wikipedia.org/wiki/Learning_to_rank

Packages RankingSVM: hWp://svmlight.joachims.org/ RankLib: hWp://people.cs.umass.edu/~vdang/ranklib.html

Data sets LETOR hWp://research.microso].com/en-‐us/um/beijing/projects/letor// Yahoo! Learning to rank challenge hWp://learningtorankchallenge.yahoo.com

http://learningtorankchallenge.yahoo.com