learning to rank - ibismlibisml.org/ibis2009/pdf-invited/hangli1.pdf · •learning to rank methods...
TRANSCRIPT
Talk Outline
• What is Learning to Rank
• Learning to Rank Methods– Ranking SVM
– IR SVM
– ListMLE
– Ada Rank
• Learning to Rank Theory
• Learning to Rank Applications
• Future Directions of Learning to Rank Research
2
Ranking Problem: Example = Document Retrieval
q
query
documents
ranking of documents
NdddD ,,, 21
),( dqf
4
ranking based on relevance, importance,
preference
),(
),(
),(
11 ,11,1
2,112,1
1,111,1
mm nmmnm
mmm
mmm
dqfd
dqfd
dqfd
Probabilistic Model
Nd
d
d
2
1
q
),|(~
),|(~
),|(~
22
11
nn dqrPd
dqrPd
dqrPd
query
documents
ranking of documents
}0,1{
),|(
R
dqrP
6
BM25
Nd
d
d
2
1
qquery
documents
ranking function
qdw wtfavgdl
dlbkb
wtfk
)()1(
)()1(
7
),|(~
),|(~
),|(~
22
11
nn dqrPd
dqrPd
dqrPd
ranking of documents
Learning to Rank
1,1
2,1
1,1
1
nd
d
d
q
mnm
m
m
m
d
d
d
q
,
2,
1,
Learning System
Ranking System
1mq
1,1
2,1
1,1
mnm
m
m
d
d
d
9
NdddD ,,, 21
Learning to Rank
1,1
2,1
1,1
1
nd
d
d
q
mnm
m
m
m
d
d
d
q
,
2,
1,
Learning System
Ranking System
1mq
),(
),(
),(
11 ,11,1
2,112,1
1,111,1
mm nmmnm
mmm
mmm
dqfd
dqfd
dqfd
),( dqf
10
NdddD ,,, 21
1. Data Labeling(rank) 3. Learning
)(xf
2. Feature Extraction
1,1
2,1
1,1
1
nd
d
d
q
mnm
m
m
m
d
d
d
q
,
2,
1,
11 ,1,1
2,12,1
1,11,1
1
nn yd
yd
yd
q
mm nmnm
mm
mm
m
yd
yd
yd
q
,,
2,2,
1,1,
11 ,1,1
2,12,1
1,11,1
nn yx
yx
yx
mm nmnm
mm
mm
yx
yx
yx
,,
2,2,
1,1,
Training Process
11
1. Data Labeling(rank)
2. Feature Extraction
Testing Process
12
3. Ranking with )(xf
)(
)(
)(
111 ,1,1,1
2,12,12,1
1,11,11,1
mmm nmnmnm
mmm
mmm
yxfx
yxfx
yxfx
4. Evaluation
EvaluationResult
1,1
2,1
1,1
1
mnm
m
m
m
d
d
d
q
11 ,1,1
2,12,1
1,11,1
1
mm nmnm
mm
mm
m
yd
yd
yd
q
11 ,1,1
2,12,1
1,11,1
mm nmnm
mm
mm
yx
yx
yx
Notes
• Features are functions of query and document
• Query and associated documents form a group
• Groups are i.i.d. data
• Feature vectors within group are not i.i.d. data
• Ranking model is function of features
• Several data labeling methods (here labeling of rank as example)
13
Recent Trends on Learning to Rank
• Successfully applied to search
• Hot topic in Information Retrieval and Machine Learning
– Over 100 publications at SIGIR, ICML, NIPS, etc
– 2 sessions at SIGIR every year
– 3 SIGIR workshops
– Special issue at Information Retrieval Journal
– LETOR benchmark dataset, over 1,000 downloads
http://research.microsoft.com/en-us/um/beijing/projects/letor/index.html
14
Issues in Learning to Rank
• Data Labeling
• Feature Extraction
• Evaluation Measure
• Learning Method (Model, Loss Function, Algorithm)
15
Data Labeling Methods
• Labeling of Ranks
– Multiple levels (e.g., relevant, partially relevant, irrelevant)
– Widely used in IR
• Labeling of Ordered Pairs
– Ordered pairs between documents (e.g. A>B, B>C)
– Implicit relevance judgment: derived from click-through data
• Creation of List
– List (or permutation) of documents is given
– Ideal but difficult to implement
17
Feature Extraction
18
Doc A
Doc B
Doc C
Query
BM25
BM25
BM25
PageRank
PageRank
PageRank
.............
.............
.............
Query-document feature
Document feature
Feature Vectors
Example Features
• Relevance: BM25
• Relevance: proximity
• Relevance: query exactly occurs in document
• Importance: PageRank
19
Evaluation Measures
• Important to rank top results correctly
• Measures– NDCG (Normalized Discounted Cumulative Gain)
– MAP (Mean Average Precision)
– MRR (Mean Reciprocal Rank)
– WTA (Winners Take All)
– Kendall’s Tau
20
NDCG (cont’)
• Example: perfect ranking
– (3, 3, 2, 2, 1, 1, 1) rank r=3,2,1
– (7, 7, 3, 3, 1, 1, 1) gain
– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) position discount
– (7, 18.11, 24.11, …) DCG
– (1/7, 1/18.11, 1/24.11, …) normalizing factor
– (1, 1,1,1,1,1,1) NDCG for perfect ranking
12 )( jr
)1log(/1 j
)1log(/)12(1
)( ij
i
ir
jn
22
NDCG (cont’)
• Example: imperfect ranking
– (2, 3, 2, 3, 1, 1, 1)
– (3, 7, 3, 7, 1, 1, 1) Gain
– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) Position discount
– (3, 14.11, 20.11, … ) DCG
– (0.43, 0.78, 0.83, ….) NDCG
• Imperfect ranking decreases NDCG
23
Relations with Other Learning Tasks
• No need to predict category
vs Classification
• No need to predict value of
vs Regression
• Relative ranking order is more important
vs Ordinal regression
• Learning to rank can be approximated by classification, regression, ordinal regression
24
),( dqf
Ordinal Regression (Ordinal Classification)
• Categories are ordered
– 5, 4, 3, 2, 1
– e.g., rating restaurants
• Prediction
– Map to ordered categories
25
Learning to Rank Methods• Pointwise Approach
– Subset Ranking [Cossock and Zhang, 2006]: Regression
– SVM [Nallapati, 2004]: Binary Classification Using SVM
– McRank [Li et al 2007]: Multi-Class Classification Using Boosting Tree
– Prank [Crammer and Singer 2002]: Ordinal Regression Using Perceptron
– Large Margin [Shashua & Levin 2002]: Ordinal Regression Using SVM
27
Learning to Rank Methods• Pairwise Approach
– Ranking SVM: Pairwise Classification Using SVM– RankBoost [Freund et al 2003]: Pairwise Classification
Using Boosting– RankNet [Burges et al 2005]: Pairwise Classification
Using Neural Net– Frank [Tsai et al 2007]: Pairwise Classification Using
Fidelity Loss and Neural Net– GBRank [Zheng et al 2007]: Pairwise Regression Using
Boosting Tree– IR SVM [Cao et al 2006]: Cost-sensitive Pairwise
Classification Using SVM– Multiple SVMs [Qin et al 2007]: Multiple SVMs
28
Learning to Rank Methods
• Listwise Approach– ListNet [Cao et al 2007]: Probabilistic Ranking Model– ListMLE [Xia et al 2008]: Probabilistic Ranking Model– AdaRank [Xu and Li 2007]: Direct Optimization of
Evaluation Measure– SVM Map [Yue et al 2007]: Direct Optimization of
Evaluation Measure– PermuRank [Xu et al 2008]: Direct Optimization of
Evaluation Measure– Soft Rank [Taylor et al 2008]: Approximation of Evaluation
Measure– Lambda Rank [Burges et al 2007]: Using Implicit Loss
Function
29
Learning to Rank Methods
• Other Methods
– K-Nearest Neighbor Ranker [Geng et al 2008]
– Semi-Supervised Learning [Jin et al 2008]
30
Evaluation Results
• Pairwise approach and listwise approach perform better than pointwise approach
• Listwise approach performs better than pairwiseapproach in most cases
• Listwise approach– ListMLE, ListNet, AdaRank, PermuRank, SVM-MAP
• Pairwise approach– Ranking SVM, RankNet, RankBoost
• Pointwise approach– Linear Regression
31
Transforming Ranking to Pairwise Classification
• Input space: X
• Ranking function
• Ranking:
• Linear ranking function:
• Transforming to pairwise classification:
RXf :
);();( wxfwxfxx jiji
xwwxf ,);(
);();( 0, wxfwxfxxw jiji
ij
ji
jixx
xxzzxx
1
1 ),,(
33
Transformed Pairwise Classification Problem
35
);( wxf31 xx
32 xx
21 xx
+1-1
12 xx
23 xx
13 xx
Positive Examples
Negative Examples
Ranking SVM
• Pairwise classification on differences of feature vectors
• Corresponding positive and negative examples
• Negative examples are redundant and can be discarded
• Hyper plane passes the origin
• Soft Margin and Kernel can be used
• Ranking SVM = pairwise classification SVM
36
Learning of Ranking SVM
2
1
)2()1( ||||,1min wxxwzl
i
iiiw
37
),0max(][ ss C2
1
0
,,1 1,
||||2
1min
)2()1(
1
2
,
i
iiii
l
i
iw
lixxwz
Cw
Problems with Ranking SVM
• Not sufficient emphasis on correct ranking on topr: relevant, p: partially relevant, i: irrelevantranking 1: p r p i i i iranking 2: r p i p i i iranking 2 should be better than ranking 1Ranking SVM views them as the same
• Numbers of pairs vary according to queriesq1: r p p i i i iq2: r r p p p i i i i iq1 pairs: 2*(r, p) + 4*(r, i) + 8*(p, i) = 14q2 pairs: 6*(r, p) + 10*(r, i) + 15*(p, i) = 31Ranking SVM is biased toward q2
39
IR SVM
• Solving the two problems of Ranking SVM
• Higher weight on important rank pairs
• Normalization weight on pairs in query
• IR SVM = Ranking SVM using modified hinge loss
40
)(ik
)(iq
Modified Hinge Loss function
41
2)2()1(
)(
1
)( ||||,1min wxxwz iiiiq
l
i
ikw
1
2
0.5
1 )( )2()1( xxzf
Loss
Learning of IR SVM
42
2)2()1(
)(
1
)( ||||,1min wxxwz iiiiq
l
i
ikw
2
0
,,1 1,
||||2
1min
)()(
)2()1(
1
2
,
iqik
i
i
iiii
l
i
iiw
C
lixxwz
Cw
Plackett-Luce Model(Permutation Probability)
• Probability of permutation 𝜋 is defined as
• Example:
n
in
ij j
i
s
sP
1 )(
)()(
C
C
CB
B
CBA
A
s
s
ss
s
sss
sP
ABC
P(A ranked No.1)
P(B ranked No.2 | A ranked No.1)
P(C ranked No.3 | A ranked No.1, B ranked No.2)44
Properties of Plackett-Luce Model
• Objects: ABC
• Scores:
• Property 1: P(ABC) is largest, P(CBA) is smallest
• Property 2: swap B and C in ABC, P(ABC) > P(ACB)
45
1,3,5 CBA sss
Plackett-Luce Model(Top-k Probability)
• Computation of permutation probabilities is intractable
• Top-k probability
– Defining Top-k subgroup G(o1…ok) containing all permutations whose top-k objects are o1, …, ok
–
– Time complexity of computation : from n! to
• Example:
k
in
ij o
o
k
j
i
s
sooGP
1
1
)!(! knn
46
CBA
A
sss
sGP
(A)
ListMLE
• Parameterized Plackett-Luce Model
• Maximum Likelihood Estimation
47
));(exp( wxfs
k
in
ij j
i
wxf
wxfwL
1 ));(exp(
));(exp(log)(
k
in
ij x
x
k
j
i
s
sxxGP
1
1
Listwise Loss
49
111 ,1,1,1
2,11,22,1
1,11,11,1
1
nnn yx
yx
yx
q
mmm nmnmnm
mmm
mmm
m
yx
yx
yx
q
,,,
2,2,2,
1,1,1,
Theoretical Results on AdaRank
• Training error will be continuously reduced during learning phase.
53
Learning to Rank Theory
• Pairwise Approach
– Generalization Analysis [Lan et al 2008]
• Listwise Approach
– Generalization Analysis [Lan et al 2009]
– Consistency Analysis [Xia et al 2008]
55
Learning to Rank Applications
• Search [Burges et al 2005]
• Collaborative Filtering [Freund et al 2003]
• Key Phrase Extraction [Jiang et al 2009]
57
New Issues to be Further Studied
• Learning from implicit data
– Automatically generate labeled data from implicit feedback
• Model (feature) learning
– Automatically learn features such as BM25
• Global ranking
– Using features of current document as well as relations with other documents
60
New Issues to be Further Studied (cont’)
• Query-dependent ranking
– Creating different ranking models for different queries (in search)
• New applications
– Machine Translation
61
Takeaway Message
• Learning to Rank = Machine Learning Task
• Different from classification, regression, ordinal regression
• Learning to Rank has been successfully applied to search
• Existing approaches: pointwise, pairwise, listwise
• Many open problems
62