learning to rank - from pairwise approach to listwise

28
Learning To Rank: From Pairwise Approach to Listwise Approach Zhe Cao, Tao Qin, TieYan Liu, MingFeng Tsai, and Hang Li Hasan Hüseyin Topcu Learning To Rank

Upload: hasan-h-topcu

Post on 13-Apr-2017

362 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Learning to Rank - From pairwise approach to listwise

ì  

Learning  To  Rank:  From  Pairwise  Approach  to  Listwise  Approach  

Zhe  Cao,  Tao  Qin,  Tie-­‐Yan  Liu,  Ming-­‐Feng  Tsai,  and  Hang  Li  

Hasan  Hüseyin  Topcu  

Learning  To  Rank  

Page 2: Learning to Rank - From pairwise approach to listwise

Outline  

ì  Related  Work  

ì  Learning  System  

ì  Learning  to  Rank  

ì  Pairwise  vs.  Listwise  Approach  

ì  Experiments  

ì  Conclusion  

Page 3: Learning to Rank - From pairwise approach to listwise

Related  Work  

ì  Pairwise  Approach  :  Learning  task  is  formalized  as  classificaNon  of  object  pairs  into  two  categories  (  correctly  ranked  and  incorrectly  ranked)  

ì  The  methods  of  classificaNon:  ì  Ranking  SVM  (Herbrich  et  al.,  1999)  and  Joachims(2002)  applied  

RankingSVM  to  InformaNon  Retrieval  ì  RankBoost  (  Freund  et  al.  1998)  ì  RankNet  (Burges  et  al.  2005):    

Page 4: Learning to Rank - From pairwise approach to listwise

Learning  System  

Page 5: Learning to Rank - From pairwise approach to listwise

Learning  System  

Training  Data,  Data  Preprocessing,  …  

How  objects  are  idenNfied?  

How  instances  are  modeled?  

SVM,  ANN,  BoosNng  

Evaluate  with  test  data  

Adapted  from  Paaern  Classificaton(Duda,  Hart,  Stork)    

Page 6: Learning to Rank - From pairwise approach to listwise

Ranking  

Page 7: Learning to Rank - From pairwise approach to listwise

Learning  to  Rank  

Page 8: Learning to Rank - From pairwise approach to listwise

Learning  to  Rank  

ì  A  number  of  queries  are  provided  

ì  Each  query  is  associated  with  perfect  ranking  list  of  documents  (Ground-­‐Truth)  

ì  A  Ranking  funcNon  is  created  using  the  training  data  such  that  the  model  can  precisely  predict  the  ranking  list.  

ì  Try  to  opNmize  a  Loss  funcNon  for  learning.  Note  that  the  loss  funcNon  for  ranking  is  slightly  different  in  the  sense  that  it  makes  use  of  sorNng.  

Page 9: Learning to Rank - From pairwise approach to listwise

Training  Process  

Page 10: Learning to Rank - From pairwise approach to listwise

Data  Labeling  

ì  Explicit  Human  Judgment  (Perfect,  Excellent,  Good,  Fair,  Bad)  

ì  Implicit  Relevance  Judgment  :  Derived  from  click  data  (Search  log  data)  

ì  Ordered  Pairs  between  documents  (A  >  B)  

ì  List  of  judgments(scores)  

Page 11: Learning to Rank - From pairwise approach to listwise

Features  

Page 12: Learning to Rank - From pairwise approach to listwise

Pairwise  Approach  

ì  Training  data  instances  are  document  pairs  in  learning  

Page 13: Learning to Rank - From pairwise approach to listwise

Pairwise  Approach  

ì  Collects  document  pairs  from  the  ranking  list  and  for  each  document  pairs  it  assigns  a  label.  

ì   Data  labels  +1  if  score  of  A  >  B  and  -­‐1  if  A  <  B    

ì  Formalizes  the  problem  of  learning  to  rank  as  binary  classificaNon  

ì  RankingSVM,  RankBoost  and  RankNet  

Page 14: Learning to Rank - From pairwise approach to listwise

Pairwise  Approach  Drawbacks  

ì  ObjecNve  of  learning  is  formalized  as  minimizing  errors  in  classificaNon  of  document  pairs  rather  than  minimizing  errors  in  ranking  of  documents.  

ì  Training  process  is  computaNonally  costly,  as  the  documents  of  pairs  is  very  large.  

Page 15: Learning to Rank - From pairwise approach to listwise

Pairwise  Approach  Drawbacks  

ì  Equally  treats  document  pairs  across  different  grades  (labels)  (Ex.1)  

ì  The  number  of  generated  document  pairs  varies  largely  from  query  to  query,  which  will  result  in  training  a  model  biased  toward  queries  with  more  document  pairs.  (Ex.2)  

Page 16: Learning to Rank - From pairwise approach to listwise

Listwise  Approach  

ì  Training  data  instances  are  document  list  

ì  The  objecNve  of  learning  is  formalized  as  minimizaNon  of  the  total  loses  with  respect  to  the  training  data.  

ì  Listwise  Loss  FuncNon  uses  probability  models:  Permuta(on  Probability  and  Top  One  Probability  

Learning to Rank: From Pairwise Approach to Listwise Approach

we describe in details the listwise approach. In follow-ing descriptions, we use superscript to indicate the indexof queries and subscript to indicate the index of documentsfor a specific query.

In training, a set of queries Q = {q(1), q(2), · · · , q(m)} isgiven. Each query q(i) is associated with a list of docu-ments d(i) =

d(i)1 , d

(i)2 , · · · , d

(i)n(i)

, where d(i)j denotes the j-th

document and n(i) denotes the sizes of d(i). Furthermore,each list of documents d(i) is associated with a list of judg-ments (scores) y(i) =

y(i)1 , y

(i)2 , · · · , y

(i)n(i)

where y(i)j denotes

the judgment on document d(i)j with respect to query q(i).

The judgment y(i)j represents the relevance degree of d(i)

j toq(i), and can be a score explicitly or implicitly given by hu-mans. For example, y(i)

j can be number of clicks on d(i)j

when d(i)j is retrieved and returned for query q(i) at a search

engine (Joachims, 2002). The assumption is that the higherclick-on rate is observed for d(i)

j and q(i) the stronger rele-vance exists between them.

A feature vector x(i)j = (q(i), d(i)

j ) is created fromeach query-document pair (q(i), d(i)

j ), i = 1, 2, · · · ,m; j =

1, 2, · · · , n(i). Each list of features x(i) =⇣

x(i)1 , · · · , x

(i)n(i)

and the corresponding list of scores y(i) =⇣

y(i)1 , · · · , y

(i)n(i)

then form an ‘instance’. The training set can be denoted asT =

n

(x(i), y(i))om

i=1.

We then create a ranking function f ; for each feature vec-tor x(i)

j (corresponding to document d(i)j ) it outputs a score

f (x(i)j ). For the list of feature vectors x(i) we obtain a list of

scores z(i) =⇣

f (x(i)1 ), · · · , f (x(i)

n(i) )⌘

. The objective of learn-ing is formalized as minimization of the total losses withrespect to the training data.

mX

i=1

L(y(i), z(i)) (1)

where L is a listwise loss function.

In ranking, when a new query q(i0) and its associated docu-ments d(i0) are given, we construct feature vectors x(i0) fromthem and use the trained ranking function to assign scoresto the documents d(i0). Finally we rank the documents d(i0)

in descending order of the scores. We call the learningproblem described above as the listwise approach to learn-ing to rank.

By contrast, in the pairwise approach, a new training dataset T 0 is created from T , in which each feature vector pairx(i)

j and x(i)k forms a new instance where j , k, and +1 is

assigned to the pair if y(i)j is larger than y(i)

k otherwise �1.It turns out that the training data T 0 is a data set of bi-nary classification. A classification model like SVM canbe created. As explained in Section 1, although the pair-

wise approach has advantages, it also su↵ers from draw-backs. The listwise approach can naturally deal with theproblems, which will be made clearer in Section 6.

4. Probability ModelsWe propose using probability models to calculate the list-wise loss function in Eq. (1). Specifically we map a list ofscores to a probability distribution using one of the prob-ability models described in this section and then take anymetric between two probability distributions as a loss func-tion . The two models are referred to as permutation prob-ability and top one probability.

4.1. Permutation Probability

Suppose that the set of objects to be ranked are identifiedwith the numbers 1, 2, ..., n . A permutation ⇡ on the objectsis defined as a bijection from {1, 2, ..., n} to itself. We writethe permutation as ⇡ = h⇡(1), ⇡(2), ..., ⇡(n)i. Here, ⇡( j) de-notes the object at position j in the permutation. The set ofall possible permutations of n objects is denoted as ⌦n.

Suppose that there is a ranking function which assignsscores to the n objects. We use s to denote the list of scoress = (s1, s2, ..., sn), where s j is the score of the j-th object.Hereafter we sometimes make interchangeable the rankingfunction and the list of scores given by the ranking func-tion.

We assume that there is uncertainty in the prediction ofranking lists (permutations) using the ranking function. Inother words, any permutation is assumed to be possible, butdi↵erent permutations may have di↵erent likelihood calcu-lated based on the ranking function. We define the per-mutation probability, so that it has desirable properties forrepresenting the likelihood of a permutation (ranking list),given the ranking function.

Definition 1 Suppose that ⇡ is a permutation on the n ob-jects, and �(.) is an increasing and strictly positive func-tion. Then, the probability of permutation ⇡ given the listof scores s is defined as

Ps(⇡) =nY

j=1

�(s⇡( j))Pn

k= j �(s⇡(k))

where s⇡( j) is the score of object at position j of permutation⇡.

Let us consider an example with three objects {1, 2, 3} hav-ing scores s = (s1, s2, s3). The probabilities of permuta-tions ⇡ = h1, 2, 3i and ⇡0 = h3, 2, 1i are calculated as fol-lows:

Ps(⇡) =�(s1)

�(s1) + �(s2) + �(s3)· �(s2)�(s2) + �(s3)

· �(s3)�(s3)

.

Learning to Rank: From Pairwise Approach to Listwise Approach

we describe in details the listwise approach. In follow-ing descriptions, we use superscript to indicate the indexof queries and subscript to indicate the index of documentsfor a specific query.

In training, a set of queries Q = {q(1), q(2), · · · , q(m)} isgiven. Each query q(i) is associated with a list of docu-ments d(i) =

d(i)1 , d

(i)2 , · · · , d

(i)n(i)

, where d(i)j denotes the j-th

document and n(i) denotes the sizes of d(i). Furthermore,each list of documents d(i) is associated with a list of judg-ments (scores) y(i) =

y(i)1 , y

(i)2 , · · · , y

(i)n(i)

where y(i)j denotes

the judgment on document d(i)j with respect to query q(i).

The judgment y(i)j represents the relevance degree of d(i)

j toq(i), and can be a score explicitly or implicitly given by hu-mans. For example, y(i)

j can be number of clicks on d(i)j

when d(i)j is retrieved and returned for query q(i) at a search

engine (Joachims, 2002). The assumption is that the higherclick-on rate is observed for d(i)

j and q(i) the stronger rele-vance exists between them.

A feature vector x(i)j = (q(i), d(i)

j ) is created fromeach query-document pair (q(i), d(i)

j ), i = 1, 2, · · · ,m; j =

1, 2, · · · , n(i). Each list of features x(i) =⇣

x(i)1 , · · · , x

(i)n(i)

and the corresponding list of scores y(i) =⇣

y(i)1 , · · · , y

(i)n(i)

then form an ‘instance’. The training set can be denoted asT =

n

(x(i), y(i))om

i=1.

We then create a ranking function f ; for each feature vec-tor x(i)

j (corresponding to document d(i)j ) it outputs a score

f (x(i)j ). For the list of feature vectors x(i) we obtain a list of

scores z(i) =⇣

f (x(i)1 ), · · · , f (x(i)

n(i) )⌘

. The objective of learn-ing is formalized as minimization of the total losses withrespect to the training data.

mX

i=1

L(y(i), z(i)) (1)

where L is a listwise loss function.

In ranking, when a new query q(i0) and its associated docu-ments d(i0) are given, we construct feature vectors x(i0) fromthem and use the trained ranking function to assign scoresto the documents d(i0). Finally we rank the documents d(i0)

in descending order of the scores. We call the learningproblem described above as the listwise approach to learn-ing to rank.

By contrast, in the pairwise approach, a new training dataset T 0 is created from T , in which each feature vector pairx(i)

j and x(i)k forms a new instance where j , k, and +1 is

assigned to the pair if y(i)j is larger than y(i)

k otherwise �1.It turns out that the training data T 0 is a data set of bi-nary classification. A classification model like SVM canbe created. As explained in Section 1, although the pair-

wise approach has advantages, it also su↵ers from draw-backs. The listwise approach can naturally deal with theproblems, which will be made clearer in Section 6.

4. Probability ModelsWe propose using probability models to calculate the list-wise loss function in Eq. (1). Specifically we map a list ofscores to a probability distribution using one of the prob-ability models described in this section and then take anymetric between two probability distributions as a loss func-tion . The two models are referred to as permutation prob-ability and top one probability.

4.1. Permutation Probability

Suppose that the set of objects to be ranked are identifiedwith the numbers 1, 2, ..., n . A permutation ⇡ on the objectsis defined as a bijection from {1, 2, ..., n} to itself. We writethe permutation as ⇡ = h⇡(1), ⇡(2), ..., ⇡(n)i. Here, ⇡( j) de-notes the object at position j in the permutation. The set ofall possible permutations of n objects is denoted as ⌦n.

Suppose that there is a ranking function which assignsscores to the n objects. We use s to denote the list of scoress = (s1, s2, ..., sn), where s j is the score of the j-th object.Hereafter we sometimes make interchangeable the rankingfunction and the list of scores given by the ranking func-tion.

We assume that there is uncertainty in the prediction ofranking lists (permutations) using the ranking function. Inother words, any permutation is assumed to be possible, butdi↵erent permutations may have di↵erent likelihood calcu-lated based on the ranking function. We define the per-mutation probability, so that it has desirable properties forrepresenting the likelihood of a permutation (ranking list),given the ranking function.

Definition 1 Suppose that ⇡ is a permutation on the n ob-jects, and �(.) is an increasing and strictly positive func-tion. Then, the probability of permutation ⇡ given the listof scores s is defined as

Ps(⇡) =nY

j=1

�(s⇡( j))Pn

k= j �(s⇡(k))

where s⇡( j) is the score of object at position j of permutation⇡.

Let us consider an example with three objects {1, 2, 3} hav-ing scores s = (s1, s2, s3). The probabilities of permuta-tions ⇡ = h1, 2, 3i and ⇡0 = h3, 2, 1i are calculated as fol-lows:

Ps(⇡) =�(s1)

�(s1) + �(s2) + �(s3)· �(s2)�(s2) + �(s3)

· �(s3)�(s3)

.

Learning to Rank: From Pairwise Approach to Listwise Approach

we describe in details the listwise approach. In follow-ing descriptions, we use superscript to indicate the indexof queries and subscript to indicate the index of documentsfor a specific query.

In training, a set of queries Q = {q(1), q(2), · · · , q(m)} isgiven. Each query q(i) is associated with a list of docu-ments d(i) =

d(i)1 , d

(i)2 , · · · , d

(i)n(i)

, where d(i)j denotes the j-th

document and n(i) denotes the sizes of d(i). Furthermore,each list of documents d(i) is associated with a list of judg-ments (scores) y(i) =

y(i)1 , y

(i)2 , · · · , y

(i)n(i)

where y(i)j denotes

the judgment on document d(i)j with respect to query q(i).

The judgment y(i)j represents the relevance degree of d(i)

j toq(i), and can be a score explicitly or implicitly given by hu-mans. For example, y(i)

j can be number of clicks on d(i)j

when d(i)j is retrieved and returned for query q(i) at a search

engine (Joachims, 2002). The assumption is that the higherclick-on rate is observed for d(i)

j and q(i) the stronger rele-vance exists between them.

A feature vector x(i)j = (q(i), d(i)

j ) is created fromeach query-document pair (q(i), d(i)

j ), i = 1, 2, · · · ,m; j =

1, 2, · · · , n(i). Each list of features x(i) =⇣

x(i)1 , · · · , x

(i)n(i)

and the corresponding list of scores y(i) =⇣

y(i)1 , · · · , y

(i)n(i)

then form an ‘instance’. The training set can be denoted asT =

n

(x(i), y(i))om

i=1.

We then create a ranking function f ; for each feature vec-tor x(i)

j (corresponding to document d(i)j ) it outputs a score

f (x(i)j ). For the list of feature vectors x(i) we obtain a list of

scores z(i) =⇣

f (x(i)1 ), · · · , f (x(i)

n(i) )⌘

. The objective of learn-ing is formalized as minimization of the total losses withrespect to the training data.

mX

i=1

L(y(i), z(i)) (1)

where L is a listwise loss function.

In ranking, when a new query q(i0) and its associated docu-ments d(i0) are given, we construct feature vectors x(i0) fromthem and use the trained ranking function to assign scoresto the documents d(i0). Finally we rank the documents d(i0)

in descending order of the scores. We call the learningproblem described above as the listwise approach to learn-ing to rank.

By contrast, in the pairwise approach, a new training dataset T 0 is created from T , in which each feature vector pairx(i)

j and x(i)k forms a new instance where j , k, and +1 is

assigned to the pair if y(i)j is larger than y(i)

k otherwise �1.It turns out that the training data T 0 is a data set of bi-nary classification. A classification model like SVM canbe created. As explained in Section 1, although the pair-

wise approach has advantages, it also su↵ers from draw-backs. The listwise approach can naturally deal with theproblems, which will be made clearer in Section 6.

4. Probability ModelsWe propose using probability models to calculate the list-wise loss function in Eq. (1). Specifically we map a list ofscores to a probability distribution using one of the prob-ability models described in this section and then take anymetric between two probability distributions as a loss func-tion . The two models are referred to as permutation prob-ability and top one probability.

4.1. Permutation Probability

Suppose that the set of objects to be ranked are identifiedwith the numbers 1, 2, ..., n . A permutation ⇡ on the objectsis defined as a bijection from {1, 2, ..., n} to itself. We writethe permutation as ⇡ = h⇡(1), ⇡(2), ..., ⇡(n)i. Here, ⇡( j) de-notes the object at position j in the permutation. The set ofall possible permutations of n objects is denoted as ⌦n.

Suppose that there is a ranking function which assignsscores to the n objects. We use s to denote the list of scoress = (s1, s2, ..., sn), where s j is the score of the j-th object.Hereafter we sometimes make interchangeable the rankingfunction and the list of scores given by the ranking func-tion.

We assume that there is uncertainty in the prediction ofranking lists (permutations) using the ranking function. Inother words, any permutation is assumed to be possible, butdi↵erent permutations may have di↵erent likelihood calcu-lated based on the ranking function. We define the per-mutation probability, so that it has desirable properties forrepresenting the likelihood of a permutation (ranking list),given the ranking function.

Definition 1 Suppose that ⇡ is a permutation on the n ob-jects, and �(.) is an increasing and strictly positive func-tion. Then, the probability of permutation ⇡ given the listof scores s is defined as

Ps(⇡) =nY

j=1

�(s⇡( j))Pn

k= j �(s⇡(k))

where s⇡( j) is the score of object at position j of permutation⇡.

Let us consider an example with three objects {1, 2, 3} hav-ing scores s = (s1, s2, s3). The probabilities of permuta-tions ⇡ = h1, 2, 3i and ⇡0 = h3, 2, 1i are calculated as fol-lows:

Ps(⇡) =�(s1)

�(s1) + �(s2) + �(s3)· �(s2)�(s2) + �(s3)

· �(s3)�(s3)

.

Page 17: Learning to Rank - From pairwise approach to listwise

Permutation  Probability  

ì  Objects  :  {A,B,C}  and  PermutaNons:  ABC,  ACB,  BAC,  BCA,  CAB,  CBA  

ì  Suppose  Ranking  funcNon  that  assigns  scores  to  objects  sA,  sB  and  sC  

ì  Permuta5on  Probabilty:  Likelihood  of  a  permutaNon  

ì  P(ABC)  >  P(CBA)    if      sA  >  sB  >  sC  

Page 18: Learning to Rank - From pairwise approach to listwise

Top  One  Probability  

ì  Objects  :  {A,B,C}  and  PermutaNons:  ABC,  ACB,  BAC,  BCA,  CAB,  CBA  

ì  Suppose  Ranking  funcNon  that  assigns  scores  to  objects  sA,  sB  and  sC  

ì  Top  one  probability  of  an  object  represents  the  probability  of  its  being  ranked  on  the  top,  given  the  scores  of  all  the  objects  

ì  P(A)  =  P(ABC)  +  P(ACB)  

ì  NoNce  that  in  order  to  calculate  n  top  one  probabiliNes,  we  sNll  need  to  calculate  n!  permutaNon  probabiliNes.  ì  P(A)  =  P(ABC)  +  P(ACB)  ì  P(B)  =  P(BAC)  +  P(BCA)  ì  P(C)  =  P(CBA)  +  P(CAB)  

Page 19: Learning to Rank - From pairwise approach to listwise

Listwise  Loss  Function  

ì  With  the  use  of  top  one  probability,  given  two  lists  of  scores  we  can  use  any  metric  to  represent  the  distance  between  two  score  lists.  

ì  For  example  when  we  use  Cross  Entropy  as  metric,  the  listwise  loss  funcNon  becomes  

ì  Ground  Truth:  ABCD      vs.      Ranking  Output:  ACBD  or  ABDC    

Page 20: Learning to Rank - From pairwise approach to listwise

ListNet  

ì  Learning  Method:  ListNet  

ì  OpNmize  Listwise  Loss  funcNon  based  on  top  one  probability  with  Neural  Network  and  Gradient  Descent  as  opNmizaNon  algorithm.  

ì  Linear  Network  Model  is  used  for  simplicity:  y  =  wTx  +  b  

Page 21: Learning to Rank - From pairwise approach to listwise

ListNet  

Page 22: Learning to Rank - From pairwise approach to listwise

Ranking  Accuracy  

ì  ListNet              vs.            RankNet,  RankingSVM,  RankBoost  

ì  3  Datasets:  TREC  2003,  OHSUMED  and  CSearch  ì  TREC  2003:  Relevance  Judgments  (Relevant  and  Irrelevant),  20  features  

extracted  ì  OHSUMED:  Relevance  Judgments  (Definitely  Relevant,  PosiNvely  

Relevant    and  Irrelevant),  30  features  ì  CSearch:  Relevance  Judgments  from  4(‘Perfect  Match’)  to  0  (‘Bad  

Match’),  600  features  

ì  EvaluaNon  Measures:  Normalized  Discounted  CumulaNve  Gain  (NDCG)  and  Mean  Average  Precision(MAP)      

Page 23: Learning to Rank - From pairwise approach to listwise

Experiments  

ì  NDCG@n  on  TREC    

Page 24: Learning to Rank - From pairwise approach to listwise

Experiments  

ì  NDCG@n  on  OHSUMED    

Page 25: Learning to Rank - From pairwise approach to listwise

Experiments  

ì  NDCG@n  on  CSearch    

Page 26: Learning to Rank - From pairwise approach to listwise

Conclusion  

ì  Discussed  ì  Learning  to  Rank  ì  Pairwise  approach  and  its  drawbacks  ì  Listwise  Approach  outperforms  the  exisNng  Pairwise  Approaches  

ì  EvaluaNon  of  the  Paper  ì  Linear  Neural  Network  model  is  used.  What  about  Non-­‐Linear  

model?  ì  Listwise  Loss  FuncNon  is  the  key  issue.(Probability  models)  

   

Page 27: Learning to Rank - From pairwise approach to listwise

References  

ì  Zhe  Cao,  Tao  Qin,  Tie-­‐Yan  Liu,  Ming-­‐Feng  Tsai,  and  Hang  Li.  2007.  Learning  to  rank:  from  pairwise  approach  to  listwise  approach.  In  Proceedings  of  the  24th  interna(onal  conference  on  Machine  learning  (ICML  '07),  Zoubin  Ghahramani  (Ed.).  ACM,  New  York,  NY,  USA,  129-­‐136.  DOI=10.1145/1273496.1273513  hap://doi.acm.org/10.1145/1273496.1273513  

ì  Hang  Li:  A  Short  Introduc5on  to  Learning  to  Rank.  IEICE  TransacNons  94-­‐D(10):  1854-­‐1862  (2011)  

ì  Learning  to  Rank.  Hang  Li.  Microsow  Research  Asia.  ACL-­‐IJCNLP  2009  Tutorial.  Aug.  2,  2009.  Singapore  

Page 28: Learning to Rank - From pairwise approach to listwise