learning to rank

28
2005.12.27 Learning to Rank Ming-Feng Tsai National Taiwan University

Upload: zofia

Post on 24-Jan-2016

39 views

Category:

Documents


3 download

DESCRIPTION

Learning to Rank. Ming-Feng Tsai National Taiwan University. Ranking. Ranking vs. Classification Training samples is not independent, identical distributed. Criterion of training is not compatible to one of IR Many ML approaches have been applied to ranking RankSVM - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning to Rank

2005.12.27

Learning to Rank

Ming-Feng Tsai

National Taiwan University

Page 2: Learning to Rank

2005.12.27

Ranking Ranking vs. Classification

Training samples is not independent, identical distributed. Criterion of training is not compatible to one of IR

Many ML approaches have been applied to ranking RankSVM

T. Joachims, SIGKDD, 2002 (SVM Light) RankBoost

Freund Y., Iyer, Journal of Machine Learning Research, 2003 RankNet

C.J.C. Burges, ICML, 2005 (MSN Search)

Page 3: Learning to Rank

2005.12.27

Motivation RankNet

Pro Probabilistic ranking model Good properties

Con Training is not efficient Criterion of training is not compatible to one of IR

Motivation Based on the probabilistic ranking model Improve efficiency and loss function

Page 4: Learning to Rank

2005.12.27

Probabilistic Ranking Model Probabilistic Ranking Model

Model posterior by Pij

The map from outputs to probabilities are modeled using a sigmoid function

Define ( ) and ( ) ( )

1

ij

ij

i i ij i j

o

ij o

o f x o f x f x

eP

e

( )i jP x x

Properties Combined Probabilities Consistency requirements

P(A>B)=0.5, and P(B>C)=0.5, then P(A>C)=0.5 Confidence, or lack of confidence, builds as expected.

P(A>B)=0.6, and P(B>C)=0.6, then P(A>C)>0.6

Page 5: Learning to Rank

2005.12.27

Probabilistic Ranking Model

Cross entropy loss function Let be the desired target values

Total Cost Function:

RankNet applied this loss function by Nerural Network (BP network)

Applied this loss function by additive model

log(1 )ijo

ij ij ijC P o e

( ) log (1 ) log(1 )

log(1 )ij

ij ij ij ij ij ij

o

ij ij ij

C C o P P P P

C P o e

ijP

ijij

C

Page 6: Learning to Rank

2005.12.27

Derivation of cross entropy loss function for Additive Model

1 1

( ) ( )

1

( ) ( ) ( ( ) ( ))

1 1

1, ,

log(1 )

( ( ) ( )) log(1 )

Let ( ) ( ) ( )

( ( ) ( ) ( ( ) ( )) log(1 )

Let

ij

i j

k i k j k k i k j

o

ij ij ijij ij

f x f x

ij i jij

k k k k

f x f x h x h x

ij k i k j k k i k jij

k i j

C P o e

P f x f x e

f x f x h x

P f x f x h x h x e

f

1, , , ,

1, , , ,

1, , , ,

1 1 , ,

1, , , ,

1, , , ,

,, ,

( ) ( ) and = ( ) ( )

( ) log(1 )

( ) log(1 )

k i j k k i j

k i j k k i j

k i j k k i j

k i k j k i j k i k j

f h

ij k i j k k i jij

f h

ij k i j k k i jij

f h

kij k i j

k

f x f x h h x h x

P f h e

J P f h e e

e e hJP h

1, , , ,

1, ,

,

, ,

2

1 11 1

01

Let , , ,

( ) 01

with some relaxations

( ) ( ( ) ( ) ) ( ) 01 1

k i j k k i j

k i j k

i j

f hij

f

ij k i j

b

b

b bb b

e e

a P b h c e x e

bcxab

cx

c cx a a x

c c

Page 7: Learning to Rank

2005.12.27

Candidates of Loss Functions Cross entropy

KL-Divergence This loss function is equivalent to cross entropy

Information Radius KL-Divergence and cross entropy are asymmetric information radius is symmetric, that is, IRad(p,q)=IRad(q,p)

Minkowski norm

This seems simpler than cross entropy in mathematical derivation for boosting

( ; ) ( ) ( || ) ( ) log ( )x X

H X q H X D p q p x q x

( )( || ) ( ) log

( )x X

p xD p q p x

q x

( || ) ( || )2 2

p q p qD p D q

( , ) ( ) ( )x X

L p q p x q x

Page 8: Learning to Rank

2005.12.27

Fidelity Loss Function Fidelity

A more reasonable loss function that is inspired from quantum computation

Hold the same properties in probabilistic ranking model proposed by Chris et al.

New properties F(p, q) = F(q, p) Loss is between 0 and 1 get the minimum loss value 0 the loss convergence

1 12 2

( , ) 1 ( ) ( )

11 ( ) (1 )*( )

1 1

ij

ij ij

x X

o

ij ij ijo o

F p q p x q x

eF P P

e e

Page 9: Learning to Rank

2005.12.27

Fidelity Loss Function Properties

Total loss function

Pair-level loss is considerede.g. the loss of (5, 4, 3, 2, 1) and (4, 3, 2, 1, 0) is zero

Query-level loss is also considered

More penalty for larger grade of paire.g. (5, 0) and (5, 4)

q

1 1

| | | # of pairs | ijq ij

FQ

queyr1 query2 Loss

Case1 1000 0 0.5

case2 990 10 0.005

Page 10: Learning to Rank

2005.12.27

Derivation for Additive Model

1

1 1

1 12 2

1 1( )

| | | # of pairs |

1 1 11 *( ) (1 )*( )

| | | # of pairs | 1 1

k ij k kij

k ij k kij k ij k kij

ijq ijq

f h

ij ijf h f hq ijq

J H FQ

eP P

Q e e

1( , )

| | | # of pairs |q

D i jQ

1

1 1

1 12 21

( ) ( , ) 1 *( ) (1 )*( )1 1

k ij k kij

k ij k kij k ij k kij

f h

ij ijf h f hij

eJ H D i j P P

e e

We denote

Page 11: Learning to Rank

2005.12.27

Derivation for Additive Model

1 11

1 1 1 1

1 12 2

2 2

1 1 1( , ) *( ) * (1 )*( ) (1 )*

2 21 (1 ) 1 (1 )

k ij k kij k ij k kijk ij k kij

k ij k kij k ij k kij k ij k kij k ij k kij

f h f hf hkij kij

ij ij ij ijf h f h f h f hk

h e h eJ eD i j P P P P

e e e e

1 1

1 1

1 12 2

3 132 2

0

(1 )( , ) 0

( ) ( ( ) )

k ij k kij k ij k kij

k kij k ij k kij k kij k ij

ij

f h f hij ijkij kij

h f h h fij

h P e e h e P eD i j

e e e e e

1 1 1 1

1 1

1 1

1

1 11 12 22 2

3 31 12 2

1 12 2

2

(1 ) (1 )( , ) ( , )

(1 ) (1 )

(1 )( , )

(1

k ij k ij k ij k ij

k k

k ij k ijkij kij

k ij k ij

k

k

f f f fij ij ij ij

f fh h

f fij ij

f

P e e P P e e Pe D i j e D i j

e e

P e e PD i j

ee

1 1

1

31 2

1 12 2

31 2

,1

,1

)

(1 )( , )

(1 )

1ln

2

ijkij

k ij k ij

k ijkij

kij

kij

h

f fij ij

fh

i jh

ki j

h

P e e PD i j

e

W

W

Page 12: Learning to Rank

2005.12.27

FRank

1 | |{(( , ), )}, ... , {(( , ), )}ij iji j Q i jq x x P q x x P

1( , )

| | | # of pairs |q

D i jQ

1

( ) ( )T

t tt

H x h x

Algorithm: FRankGiven: ground truth

Initialize:

For t=1,2, …, T(a) For each weak learner candidate hi(x)

(a.1) Compute optimal αt,i

(a.2) Compute the fidelity loss(b) Choose the weak learner ht,i(x) with the minimal loss as ht(x)

(c) Choose the corresponding αt,i as αt

(d) Update pair weight by Wi,j

Output the final ranking

Page 13: Learning to Rank

2005.12.27

Implementation Finished

Threshold fast implementation Faster 4 times

Alpha fast implementation Faster 120 seconds per weak learner in 3w

Total loss fast implementation Faster 3 times

Resume to training Plan

Multi-Thread implementation Parallel Computation Margin Consideration (Fast, but with loss)

Page 14: Learning to Rank

2005.12.27

Preliminary Experimental Results Data Set of BestRank

Competition Training Data: about 2,500

queries Validation Data: about

1,100 queries Features: 64 features

Evaluation NDCG

Page 15: Learning to Rank

2005.12.27

Preliminary Experimental Results

Results of Validation Data

Page 16: Learning to Rank

2005.12.27

Next step

Page 17: Learning to Rank

2005.12.27

Interesting Analogy Loss function

Pair-level loss Query-level loss Other considerations

Learning Model Boosting, additive model LogitBoost Boosted Lasso SVM Neural Network

The whole new model The dependence of retrieved web pages

Page 18: Learning to Rank

2005.12.27

Pairwise Pairwise Training

Ranking is reduced to a classification problem by using pairwise items as training samples

This increases the data complexity from O(n) to O(n2)

Suppose there are n samples evenly distributed on k ranks, the total number of pairwise samples is roughly n2/k

F(x,y)Xi

Xj

F(x)Xi

Page 19: Learning to Rank

2005.12.27

Pairwise Pairwise Training

F(x,y) is more general function than F(x) – F(y) Find properties that should be modeled by F(x,y)

Nonlinear relation between x and y margin(r1, r30) > margin(r1, r10) > margin(r21, r30) …

Page 20: Learning to Rank

2005.12.27

Pairwise Pairwise Testing

In testing phrase, rank should be reconstructed from a partial orders graph, even inconsistent and incomplete

Topological sorting can only handle DAG in linear time Problem

inconsistent How to find the best spanning tree

incomplete How to deal with the node without label

Page 21: Learning to Rank

2005.12.27

Pairwise Spanning Tree Related Content

Colley’s Bias Free College Football Ranking Method

Tree Reconstruction via partial order

Page 22: Learning to Rank

2005.12.27

Thanks your attention Q&A

Page 23: Learning to Rank

2005.12.27

Additive Model AdaBoost

Construct a classifier H(x) by the linear combination of the base classifier h(x)

In order to obtain the optimal base classifiers {hT(x)} and linear combination coefficients {αT}, we need to minimize the training error

For binary classification problems (1 or -1), the training error for the classifier H(x) can be written as

11

( ) ( ) ( ) ( )T

t t T T Tt

H x h x H x h x

1

( ( ) ) /N

i ii

err sign H x y N

Page 24: Learning to Rank

2005.12.27

Additive Model AdaBoost

For the simplicity of computation, it uses the exponential cost function as the objective function

Apparently, the exponential cost function upper bounds the training error err

1 1

( )

1

( ) ( )

1

1 21 2

1 2

1

1 { ( ( ), ) ( ( ), )}

where function I is defined as

1 ( , )

0

T i i

T i i T i i

NH x y

i

NH x y H x yT T

T i i T i ii

err eN

e e I h x y e e I h x yN

if x xI x x

if x x

Page 25: Learning to Rank

2005.12.27

Additive Model AdaBoost

By setting the derivative of the equation above with respect to αT to be zero, we have the expression as follows:

1

1

( )

1

( )

1

( ( ), )1ln

2 ( ( ), )

T i i

T i i

N H x yT i ii

T N H x yT i ii

e I h x y

e I h x y

With the expression of data distribution1

1

( )

( )

1

T i i

T j j

H x yT

i N H x y

j

eW

e

the linear combination coefficient αT can be written as

1

1

( ( ), )1 1 1ln ln( )

2 2( ( ), )

where stand for the weighted error rate under the weight distribution

for the base classifier ( ) in iteration T

N T Ti T i ii

T N TTi T i ii

T T

T

W I h x y

W I h x y

W

h x

Page 26: Learning to Rank

2005.12.27

Additive ModelGiven: (xi, yi)1. Initialize: W1=1/N2. For t=1,2, …, T

(a) Train weak learner using distribution Wt

(b) compute

(c) compute

(d) update

3. Output

1( ( ), )

Nt ti t i ii

W I h x y

1 1

ln( )2

t

t t

1

( ) ( )T

t tt

H x sign h x

( )i t i ty h xt ti iW W e

Back

Page 27: Learning to Rank

2005.12.27

NDCG K Jaervelin, J Kekaelaeinen - ACM Transactions on

Information Systems, 2002 Example

Assume that the relevance scores 0 – 3 are used.

G’=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, …> Cumulated Gain (CG)

CG’=<3, 5, 8, 8, 8, 9, 11, 13, 16, 16, …>

[1], if 1 [ ]

[ 1] [ ], otherwise

G iCG i

CG i G i

Page 28: Learning to Rank

2005.12.27

NDCG Discount Cumulated Gain (DCG)

let b=2,

DCG’=<3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, …> Normalized Discount Cumulated Gain (NDCG)

Ideal vector

I’=<3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, …>

CGI’=<3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, …>

DCGI’=<3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 11.21, 11.53, 11.83, 11.83, …>

NDCG’=<1, 0.83, 0.89, 0.73, 0.62, 0.6, 0.69, 0.76, 0.89, 0.84, …>

[1], if 1 [ ]

[ 1] [ ] / log , otherwiseb

G iDCG i

DCG i G i i

Back