learning to rankanand/tir15/lectures/ws15-tir-l2r.pdf · 2016-11-21 · learning to rank approaches...
TRANSCRIPT
Learning to Rank
Temporal Information Retrieval
pointwise, pairwise and listless approaches
Avishek Anand
Road so Far - Ranking in IR
2
• Ranking documents important for information overload, quickly finding documents which are “relevant” for the query
• Interpretations and Modelling of relevance
• Geometric Interpretation — Vector Space Similarity, Okapi BM25
• Probabilistic interpretation — Probabilistic IR, Statistical LM
Avishek Anand
Road so Far - Ranking in IR
2
• Ranking documents important for information overload, quickly finding documents which are “relevant” for the query
• Interpretations and Modelling of relevance
• Geometric Interpretation — Vector Space Similarity, Okapi BM25
• Probabilistic interpretation — Probabilistic IR, Statistical LM
sim(d, q) =
Pi2V wd(i).wq(i)pP
i2V wd(qi)2pP
i2V wq(i)2
Avishek Anand
Road so Far - Ranking in IR
2
• Ranking documents important for information overload, quickly finding documents which are “relevant” for the query
• Interpretations and Modelling of relevance
• Geometric Interpretation — Vector Space Similarity, Okapi BM25
• Probabilistic interpretation — Probabilistic IR, Statistical LM
sim(d, q) =
Pi2V wd(i).wq(i)pP
i2V wd(qi)2pP
i2V wq(i)2
P (D|Q) / P (Q |D).P (D)
P (Q|D) =Y
wi2Q
�.P (wi|D) + (1� �).P (wi|C)corpus contrib.
doc. contrib.
P (Q|D) =Y
wi2Q
tf(wi;D) + µ.P (wi|C)
|D|+ µ
corpus contrib.doc.
contrib.laplace smoothing
dirichlet smoothing
Avishek Anand
Other Ranking features
3
• Link analysis - Page rank, HITS
• Temporal ranking
• Recency ranking
• Historical features
• Query similarity to the document title
• Any other features that you can think of ?
Avishek Anand
Other Ranking features
3
• Link analysis - Page rank, HITS
• Temporal ranking
• Recency ranking
• Historical features
• Query similarity to the document title
• Any other features that you can think of ?
• Anchor text similarity
• proximity of query terms in the document
Avishek Anand
Other Ranking features
3
• Link analysis - Page rank, HITS
• Temporal ranking
• Recency ranking
• Historical features
• Query similarity to the document title
• Any other features that you can think of ?
How do you design a ranking function if you have many such features ?
• Anchor text similarity
• proximity of query terms in the document
Avishek Anand
Learning for IR
4
• What if we can learn a model from a set of features ?
• Focus on developing interesting and novel features and less on how to put them together
• Why wasn't learning approaches used earlier ?
• Limited training data
• Machine learning techniques were not good enough
• Traditional IR methods had small number of features and worked well
• small features => easily hand tunable
• Not many specialised IR problems
Avishek Anand
Supervised learning
5
• Given a set of training examples how do we learn a model to predict, classify, rank objects
• classical tasks: regression, classification
• Terminology: Input space, output space, hypothesis, loss function
• Input space: (query, document) pairs
• Output space: {relevant, nonrelevant}
Avishek Anand
Classification for IR
6
Non Relevant
Relevant
• Given a set of relevant and non-relevant documents find the hypothesis which best separates them
• Minimizes loss - e.g. number of misclassified documents (squared loss, hinge loss, cross entropy)
• Some classical classification approaches: Naive Bayes, decision trees, SVMs
Avishek Anand
Classification for IR
6
Non Relevant
Relevant
• Given a set of relevant and non-relevant documents find the hypothesis which best separates them
• Minimizes loss - e.g. number of misclassified documents (squared loss, hinge loss, cross entropy)
• Some classical classification approaches: Naive Bayes, decision trees, SVMs
Avishek Anand
Support Vector Machines
7
non relevantRelevant
Avishek Anand
Support Vector Machines
7
non relevantRelevant
Avishek Anand
Support Vector Machines
7
non relevantRelevant
Avishek Anand
Support Vector Machines
7
non relevantRelevant
Avishek Anand
Support Vector Machines
7
non relevantRelevant
Avishek Anand
Support Vector Machines
8
Large Margin Classifiers
Avishek Anand
Support Vector Machines
9
hw, xi+ b �1 hw, xi+ b � 1
f(x) = hw, xi+ blinear func1onfeature vector for
(q,d) pair normal vector for separator
dot product
Assume: Linearly Separable
Avishek Anand
Support Vector Machines
10
hw, xi+ b = �1 hw, xi+ b = 1
hx+ � x�, wi2 kwk =
1
2 kwk [[hx+, wi+ b]� [hx�, wi+ b]] =1
kwk
margin
w
Avishek Anand
Support Vector Machines
hw, xi+ b = �1 hw, xi+ b = 1
op1miza1on problem
w
maximize
w,b
1
kwk subject to yi [hxi, wi+ b] � 1
Avishek Anand
Support Vector Machines
hw, xi+ b = �1 hw, xi+ b = 1
op1miza1on problem
w
minimize
w,b
1
2
kwk2 subject to yi [hxi, wi+ b] � 1
+1 if relevant , -1 o.w.
Avishek Anand
Support Vector Machines
13
• Weight vector w as weighted linear combination of instances
• Only points on margin matter (ignore the rest and get same solution)• Only inner products matter
• Quadratic program• Keeps instances away from the margin
w
Avishek Anand
Soft Margins
14
Convex op)miza)on problem
hw, xi+ b �1 + ⇠ hw, xi+ b � 1� ⇠
Avishek Anand
Soft Margins
14
Convex op)miza)on problem
hw, xi+ b �1 + ⇠ hw, xi+ b � 1� ⇠
Avishek Anand
Soft Margins
14
Convex op)miza)on problem
hw, xi+ b �1 + ⇠
minimize amount of slack
hw, xi+ b � 1� ⇠
Avishek Anand
Soft Margins
15
• Hard margin problem
• With slack variables
minimize
w,b
1
2
kwk2 subject to yi [hw, xii+ b] � 1
minimize
w,b
1
2
kwk2 + C
X
i
⇠i
subject to yi [hw, xii+ b] � 1� ⇠i and ⇠i � 0
Whats the loss function here ?
Avishek Anand
Loss Function Perspective
16
• Soft margin loss • Binary loss
max(0, 1� yf(x))
{yf(x) < 0}
convex upper bound
binary loss function margin
Avishek Anand
Loss Function Perspective
16
• Soft margin loss • Binary loss
max(0, 1� yf(x))
{yf(x) < 0}
convex upper bound
binary loss function margin
Hinge loss
Avishek Anand
Loss Function Perspective
16
• Soft margin loss • Binary loss
max(0, 1� yf(x))
{yf(x) < 0}
convex upper bound
binary loss function margin
Hinge loss
Hinge loss
Avishek Anand
Evaluation and Relevance Judgements
17
• Construct a test set containing a large number of (randomly sampled) queries, their associated documents, and relevance judgment for each (query, document) pair.
• Relevance judgement is the assessment of the relevance of documents for a query
• Degree of relevance
• Binary: relevant vs. irrelevant
• Multiple ordered categories: Perfect > Excellent > Good > Fair > Bad
• Pairwise preference
• Document A is more relevant than document B
• Total order
• Documents are ranked as {A,B,C,..} according to their relevance
Avishek Anand
Learning to Rank Framework
18
Avishek Anand
Learning to Rank Approaches
19
• Pointwise Approaches
• Input space: single documents d1 — if a doc. is relevant to a query
• Output space: scores or relevant classes
• Pairwise Approaches
• Input space: (d1, d2) — pairs of documents s.t. d1 > d2
• Output space: preferences (yes/no) for a given doc. pair
• Listwise Approaches
• Input space: Document set {d}
• Output space: Permutation — ranking of documents
• Optimize directly a IR quality measure — MAP, NDCG. etc
Avishek Anand
Pointwise Ranking - large margin ranking
20
• Margin:
• Constraints: Every document is correctly placed in the corresponding category
Original SVM formula)on
X
k
(bk � ak)
Fixed Margin Strategy
margin minimize
w,b
1
2
kwk2 + C
X
i
⇠i
subject to yi [hw, xii+ b] � 1� ⇠i and ⇠i � 0
Avishek Anand
Problems with Pointwise ranking
21
• Some queries have large number of documents and dominate the learned hypothesis
• Does not optimise for the standard IR measures like NDCG, MAP, ERR etc
• Position of the documents in the ranked list is not modelled by the loss function
• Approaches might emphasise more on the unimportant documents
• Sampling is essential (if non relevant documents >> relevant documents)
Avishek Anand
Listwise Ranking
22
• Convert input into preferences
• Use preferences as input to learn the classes
• Define loss L(.) on the difference vector and minimise loss while learning f(.)
queryd1d2d3
d5d4
d6
(d1 > d2) (d1 > d5) (d1 > d6) (d1 > d3) …..
(x1 � x2)
[Rank Boost 2003]
Avishek Anand
Listwise Ranking
23
Original SVM formula)on
minimize
w,b
1
2
kwk2 + C
X
i
⇠i
subject to yi [hw, xii+ b] � 1� ⇠i and ⇠i � 0
Ranking SVM1
2||w||2 + C
X
i
X
u,v
⇠iu,vminimize
w,b
1
2
kwk2 + C
X
i
⇠i
subject to yi [hw, xii+ b] � 1� ⇠i and ⇠i � 0
yi[hw, (xiu � x
iv)i+ b] � 1� ⇠
iu,v and ⇠
iu,v � 0
minimize
w,b
1
2
kwk2 + C
X
i
⇠i
subject to yi [hw, xii+ b] � 1� ⇠i and ⇠i � 0
Avishek Anand
Pros and Cons - Pairwise Approach
24
• Pros
• One step closer to ranking documents compared to predicting class labels
• Cons
• Distribution of document pairs is more skewed than distribution of document w.r.t queries
• Some queries have more query pairs than others
• Still does not optimise for IR measures
• Rank ignorant — (d1 > d2) does not encode which ranks are being compared. Rank 1 vs Rank 2 or Rank 99 vs Rank 1000
Avishek Anand
Pairwise Ranking — IR SVM
25
Avishek Anand
Listwise Approaches
26
NDCG, ERR, MAP
Avishek Anand
Learning to Rank — In Action
27
extract features from document collection and
queries
feature file
create model
Execute model on
test data for results
Avishek Anand
Learning to Rank - in Action
28
Example of a feature file
Avishek Anand
References and Further Readings
Books Liu, Tie-‐Yan. Learning to rank for informa.on retrieval. Vol. 13. Springer, 2011. Li, Hang. "Learning to rank for informa)on retrieval and natural language processing." Synthesis Lectures on Human Language Technologies 4.1 (2011): 1-‐113.
Helpful pages hWp://en.wikipedia.org/wiki/Learning_to_rank
Packages RankingSVM: hWp://svmlight.joachims.org/ RankLib: hWp://people.cs.umass.edu/~vdang/ranklib.html
Data sets LETOR hWp://research.microso].com/en-‐us/um/beijing/projects/letor// Yahoo! Learning to rank challenge hWp://learningtorankchallenge.yahoo.com