learning to rank - purdue university · learning to rank learning to rank or machine-learning rank...

1. Learning to Rank

Learning to rank or machine-learning rank is very important in the construction of information retrieval system. The following picture shows a general learning to rank framework.

In the picture, above, 𝑞", 𝑞$, 𝑞% … 𝑞' represents the search query. Each query has a set of

associated documents, represented by feature vectors in this picture, labeled as𝑥)*. Feature vectors

reflect the relevance of the documents to the query. Typical feature vectors include the frequency of query terms etc. And y("), y($), y(%) …y(/) represents the preference label for each query document pair. The goal is to build a model to predict the ground truth label of test data as accurately as possible, in terms of loss function. Generally speaking, many ranking algorithms can be grouped into three approaches: the pointwise approach, the pairwise approach and the listwise approach. This project is to explore multiple ranking algorithms across these three kinds of approaches in terms of accuracy and efficiency.

2. Datasets

The dataset I will use in this project is “Yahoo! Learning to Rank Challenge”. This dataset consists of three subsets, which are training data, validation data and test data. The data format for each subset is shown as follows:[Chapelle and Chang, 2011]

Each line has three parts, relevance level, query and a feature vector. There are five levels of

relevance from 0 (least relevant) to 4 (most relevant). Each query has a unique id and there are 699 features in this dataset labeled from 1 to 699. The value of each feature is normalized to a float in the [0,1] range.

3. Experiments Setup

3.1. Library and measurement function In this project, I will use RankLib to do all the experiments. There are many ways to measure

the performance of a ranking model, like ERR, DCG, NDCG and MAP. [Chapelle et al., 2009] I choose Normalized Discounted Cumulative Gain (NDCG) as the measure function, which measures the usefulness, or gain, of a document based on its position in the result list. The equations to compute NDCG are:

𝑁𝐷𝐶𝐺4 =𝐷𝐶𝐺4𝐼𝐷𝐶𝐺4

where:

𝐷𝐶𝐺4 = 𝑟𝑒𝑙" +𝑟𝑒𝑙)

log$ 𝑖

4

)?$

𝐼𝐷𝐶𝐺4 =2ABCD − 1log$ 𝑖 + 1

|HIJ|

)?$

3.2. Expanding missing data The original dataset has 700 features. However, for each entry, it might not contain values for

all features. The missing value of a feature is indicted as NaN. This fact will make the dimension reduction a little harder because dimension reduction cannot be applied to dataset containing unknown data. So, I need to expand the missing data by using the properties of existing data. To do this, I first read 200 features. Then I select features in which the proportion of NaN is no more than 50%. Because if the proportion is too high, the predicted data might not be accurate. The algorithm I am going to use is called Alternating Least Squares (ALS). This process can easily be done in MATLAB by using the following two lines of code:

[coeff,score,latent,tsquared,explained,mu] = pca(y,'algorithm','als'); t = score*coeff' + repmat(mu,number of entries,1);

data before expansion data after expansion

From these two pictures, we can see that after expansion, all NaN are replaced by predicted data and the existing data will remain the same.

3.3. Dimension reduction Taking account of all the features in the dataset will be very time consuming, therefore, it will

be necessary to reduce the dimension of the dataset. I use PCA to do the dimension reduction on the

expanded data. The first 20 features with the largest variability will be chosen. Before PCA, it is necessary to standardize the data. I use Z-score standardization to do that, which can easily be realized in MATLAB by using the following command:

X_train=zscore(X_train); 3.4. k-fold cross-validation In this project, 10-fold cross-validation will be used to do parameter tuning. For each ranking

algorithm, I will first decide its major parameters and the range within which parameters can change. Then, a 10-fold cross-validation will be processed on each set of parameters. The whole set will be split into two sets, training set (90%) and validation set (10%). I first do PCA to the training set, the output of PCA will then be used to project the validation set. This process will be repeated ten times and the mean and standard deviation of NDCG will be returned. Considering the running time, I only use the first 2000 data of the whole dataset.

3.5. Parameters tuning Based on the introduction of each algorithm, I will choose its hyper parameters. There are

several ways to find the best parameters like grid search or random search. But a direct parameter tuning will be very time consuming because I have no idea in which regions the best parameters should be. So, I will first do single parameter tuning based on cross-validation. Then I will choose some possible good values for each parameter and do grid search, that is to test all pairs of these possible values. The best parameters will be determined by using bias-variance tradeoff. To clearly show the bias and variance, I will draw a normal distribution curve for each set of parameters by using the corresponding mean and variance of NDCG.

3.6. Comparison among ranking algorithms After getting the best parameters, I will compare the performance among algorithms under

those parameters by testing them on the same unseen test data. I will also gradually increase the size of training set to see how the ranking accuracy will change. I will also use box plot to analyze the sensitivity of different algorithms in the dataset dimension.

4. Parameter tuning and k-fold cross-validation

4.1. RankNet 4.1.1. Introduction RankNet is a pairwise ranking algorithm, which means its loss function is defined on a pair of

documents or urls. For a given query, each pair of documents or urls UL, UM with different relevance level will be chosen. And let yL, yM be the computed label from a ranking model. Let UL ⊳ UM denotes the event that yL > 𝑦*. Then a posterior is defined as:

PLM = P UL ⊳ UM =1

1 + 𝑒R SDRST

Then, define the cost function of this pair to be: CLM = −𝑃)* log 𝑃)* − 1 − 𝑃)* log(1 − 𝑃)*)

𝑃)* is the known probability that the ranking of UL is greater than UM. Let SLM to be defined to be 1 if UL is labeled more relevant, -1 if UM is labeled more relevant and 0 when UL and UM are

labelled the same relevance. Then 𝑃)* can be defined as: 𝑃)* ="$(1 + 𝑆)*)

Therefore, the cost function will become:

CLM =121 − 𝑆)* 𝑦) − 𝑦* + log(1 + 𝑒R SDRST )

It is easy to see that cost function is larger when UL ⊳ UM while UM is actually known to be more relevant. And the idea of learning via gradient descent is to update the weights (model parameters) with the gradient of the cost function.

wZ → 𝑤] − 𝜂(𝜕𝐶)*𝜕𝑦)

𝜕𝑦)𝜕𝑤]

+𝜕𝐶)*𝜕𝑦*

𝜕𝑦*𝜕𝑤]

)

where η is a positive learning rate. Then a neutral network will be used to optimize the ranking model to minimize the cost function with the back-prop equations.[Orr and Müller, 2003] Typically, the accuracy of the trained model will be affected by learning rate, the number of hidden layers and the number of hidden nodes per layer.

4.1.2. Single parameter tuning

4.1.3. Grid search The possible good value I choose based on the results above are listed below.

log(learning rate)-14 -13 -12 -11 -10 -9 -8 -7 -6

mea

n of

(1-N

DC

G@

10)

0.2

0.25RankNet (learning rate)

mea

n of

(1-N

DC

G@

10)

0

0.5validationtraining

log(learning rate)-14 -13 -12 -11 -10 -9 -8 -7 -6

std

dev

of N

DC

G@

10

0.015

0.02

0.025

0.03

0.035

0.04

0.045RankNet (learning rate)

std

dev

of N

DC

G@

10

0.004

0.006

0.008

0.01

0.012

0.014


layers1 2 3 4 5 6 7 8 9 10

mea

n of

(1-N

DC

G@

10)

0.23

0.24

0.25

0.26RankNet (number of hidden layers)

mea

n of

(1-N

DC

G@

10)

0.235

0.24

0.245


layers1 2 3 4 5 6 7 8 9 10

std

dev

of N

DC

G@

10

0.025

0.03

0.035

0.04

0.045

0.05

0.055RankNet (number of hidden layers)

std

dev

of N

DC

G@

10

0.007

0.008

0.009

0.01

0.011

0.012


number of nodes per layer5 10 15 20 25

mea

n of

(1-N

DC

G@

10)

0.23

0.24

0.25

0.26RankNet (number of nodes per layer)

mea

n of

(1-N

DC

G@

10)

0.23

0.235

0.24


number of nodes per layer5 10 15 20 25

std

dev

of N

DC

G@

10

0.02

0.04

0.06RankNet (number of nodes per layer)

std

dev

of N

DC

G@

10

0.005

0.01


Learning rate 0.000025 0.0016 Hidden layers 3 6 10

Nodes per layer 13 17 25

This picture shows the bias and variance of all pairs of parameters. We can see that the variance

and bias cannot be the smallest simultaneously. I finally choose (3,25,0.0016) as the best parameters set because it can give us the smallest bias, and the variance is not that large.

4.2. MART

4.2.1. Introduction Multiple Additive Regression Trees (MART) is an algorithm of using gradient boosted decision

trees for prediction tasks.[Friedman, 2001] Its output function Fb(𝑥) is a linear combination of a set of regression trees.

Fb 𝑥 = 𝛼)𝑓)(𝑥)e

)?"

where fL(𝑥) is the output function of a single regression tree and αL is its weight. Then viewing the cost C as a function of the output function.

C = Ch +𝜕𝐶𝜕𝐹

𝛿𝐹

To reduce the cost C, δF ∝ −η mnmo

for a suitable step size, which is then used as the weight. The

step size for the jth leaf node of the mth tree is:

γMq = argminw

log(1 + 𝑒R$SD xyz{ |D }w )|D∈HTy

The accuracy of MART is affected by the number of regression trees, the number of nodes and learning rate.[Wu et al., 2010]


0.1 0.15 0.2 0.25 0.3 0.350

5

10

153, 13, 2.5e-053, 17, 2.5e-053, 25, 2.5e-056, 13, 2.5e-056, 17, 2.5e-056, 25, 2.5e-0510, 13, 2.5e-0510, 17, 2.5e-0510, 25, 2.5e-053, 13, 0.00163, 17, 0.00163, 25, 0.00166, 13, 0.00166, 17, 0.00166, 25, 0.001610, 13, 0.001610, 17, 0.001610, 25, 0.0016

To avoid over-fitting, the learning rate is suggested no more than 0.1.[Li et al., 2007]


Trees 500 700 1800 Leaves for each tree 13 18

Learning rate 0.06 0.09

number of trees500 1000 1500 2000

mea

n of

(1-N

DC

G@

10)

0.22

0.225

0.23MART (number of trees)

mea

n of

(1-N

DC

G@

10)

0.03

0.032


number of trees500 1000 1500 2000

std

dev

of N

DC

G@

10

0.041

0.042

0.043

0.044

0.045

0.046

0.047MART (number of trees)

std

dev

of N

DC

G@

10

#10-3

3.2

3.3

3.4

3.5

3.6

3.7


number of leaves for each tree5 10 15 20

mea

n of

(1-N

DC

G@

10)

0.22

0.23

0.24MART (number of leaves for each tree)

mea

n of

(1-N

DC

G@

10)

0.02

0.04


number of leaves for each tree5 10 15 20

std

dev

of N

DC

G@

10

0.035

0.04

0.045

0.05

0.055MART (number of leaves for each tree)

std

dev

of N

DC

G@

10

#10-3

6

7

8

9

10validationtraining

learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

mea

n of

(1-N

DC

G@

10)

0.2

0.21

0.22MART (learning rate)

mea

n of

(1-N

DC

G@

10)

0

0.05


learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

std

dev

of N

DC

G@

10

0.035

0.04

0.045

0.05MART (learning rate)

std

dev

of N

DC

G@

10

#10-3

2

4

6

8validationtraining


and bias cannot be the smallest simultaneously. Therefore, I choose (700,18,0.06) as the best parameters because it has the smallest bias and a quite small variance.

4.3. LamdaMART

4.3.1. Introduction

A new algorithm called LamdaRank can be achieved by directly computing the gradients instead of deriving them from a cost function. After the gradients, denoted as λLM, are calculated, there will exist an imagined cost function CLM such that:

λLM =𝜕𝐶)*(𝑦) − 𝑦*)

𝜕𝑦)

And the weights update will become:

wZ → 𝑤] + 𝜂𝜕𝐶)*𝜕𝑤]

LamdaMART is an algorithm that combines LamdaRank and MART. For each pair of UL, UM with UL ⊳ UM, the gradients are defined as:

λLM =− Δ𝑍)*1 + 𝑒SDRST

where ZLM means the difference of cost function CLM by swapping the scores of UL, UM. And the cost function will become:

C = Δ𝑍)* log(1 + 𝑒R �DR�T ){),*}⇄�

And the step size can be computed by using Newton approximation. Similarly, the accuracy of LamdaMART also depends on the number of regression trees, the number of nodes and learning rate.


0.1 0.15 0.2 0.25 0.3 0.350

2

4

6

8

10

12500, 13, 0.06500, 18, 0.06700, 13, 0.06700, 18, 0.061800, 13, 0.061800, 18, 0.06500, 13, 0.09500, 18, 0.09700, 13, 0.09700, 18, 0.091800, 13, 0.091800, 18, 0.09


Trees 1900 Leaves 5 14 20

Learning rate 0.03 0.06 0.09

number of trees500 1000 1500 2000

mea

n of

(1-N

DC

G@

10)

0.214

0.215

0.216

0.217

0.218

0.219

0.22

0.221LambdaMART (number of trees)

mea

n of

(1-N

DC

G@

10)

0.0305

0.031

0.0315

0.032

0.0325

0.033

0.0335


number of trees500 1000 1500 2000

std

dev

of N

DC

G@

10

0.03

0.035

0.04

0.045

0.05LambdaMART (number of trees)

std

dev

of N

DC

G@

10

#10-3

3.15

3.2

3.25

3.3


number of leaves5 10 15 20

mea

n of

(1-N

DC

G@

10)

0.2

0.21

0.22

0.23LambdaMART (number of leaves per tree)

mea

n of

(1-N

DC

G@

10)

0.03

0.035

0.04


number of leaves5 10 15 20

std

dev

of N

DC

G@

10

0.03

0.035

0.04

0.045LambdaMART (number of leaves per tree)

std

dev

of N

DC

G@

10

#10-3

2.8

3

3.2


learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

mea

n of

(1-N

DC

G@

10)

0.2

0.21

0.22

0.23LambdaMART (learning rate)

mea

n of

(1-N

DC

G@

10)

0.02

0.04

0.06


learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

std

dev

of N

DC

G@

10

0.032

0.034

0.036

0.038

0.04

0.042

0.044

0.046LambdaMART (learning rate)

std

dev

of N

DC

G@

10

#10-3

2.8

3

3.2

3.4

3.6

3.8

4



and bias cannot be the smallest simultaneously. Since the variance overall is not so large, I choose (1900,14,0.03) as the best parameters because it has the smallest bias.

4.4. RankBoost

RankBoost is a pairwise ranking algorithm. Its idea is to transfer the ranking problem as a problem of binary classification on instance pairs, and then to adopt boosting approach. It will train one weak ranker at ach iteration and then combine these rankers as the final ranking function. The hyper-parameter I choose for RankBoost is the number of threshold candidates to search. Since the total number of features after reduction is 20, the threshold should also be no more than 20. The results below show that 13 will be a good value for the threshold.

4.5. ListNet

ListNet is a list wise ranking algorithm by optimizing the list wise loss function based on top one probability, with Network as model and Gradient Descent as optimization algorithm. The learning rate is chosen as the hyper parameter. The results below show that the a good learning rate is 0.00001.

0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30

2

4

6

8

10

12

141900, 5, 0.031900, 5, 0.061900, 5, 0.091900, 14, 0.031900, 14, 0.061900, 14, 0.091900, 20, 0.031900, 20, 0.061900, 20, 0.09

threshold5 6 7 8 9 10 11 12 13 14 15

mea

n of

(1-N

DC

G@

10)

0.216

0.218

0.22

0.222

0.224

0.226

0.228

0.23RankBoost (threshold)

mea

n of

(1-N

DC

G@

10)

0.198

0.2

0.202

0.204

0.206

0.208

0.21


threshold5 6 7 8 9 10 11 12 13 14 15

std

dev

of N

DC

G@

10

0.04

0.042

0.044

0.046

0.048

0.05RankBoost (threshold)

std

dev

of N

DC

G@

10

#10-3

6

7

8

9

10

11validationtraining

4.6. Random Forests

4.6.1. Introduction Random-forests is an ensemble learning method for classification, regression and other tasks.

It constructs a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. The hyper parameters I choose for random forests include the number of bags, the number of trees in each bag, the number of leaves for each tree and the learning rate.


learning rate-15 -14 -13 -12 -11 -10 -9 -8

mea

n of

(1-N

DC

G@

10)

0.2017

0.2017

0.2017

0.2018

0.2018

0.2018ListNet (learning rate)

mea

n of

(1-N

DC

G@

10)

0.1884

0.1884

0.1885

0.1886

0.1886


learning rate-15 -14 -13 -12 -11 -10 -9 -8

std

dev

of N

DC

G@

10

0.037

0.0371

0.0371

0.0371

0.0372ListNet (learning rate)

std

dev

of N

DC

G@

10

#10-3

4

4.1

4.2

4.3


number of bags100 150 200 250 300 350 400 450 500

mea

n of

(1-N

DC

G@

10)

0.204

0.206

0.208

0.21

0.212

0.214

0.216Random Forests (number of bags)

mea

n of

(1-N

DC

G@

10)

0.038

0.039

0.04

0.041

0.042

0.043


number of bags100 150 200 250 300 350 400 450 500

std

dev

of N

DC

G@

10

0.04

0.042

0.044

0.046

0.048

0.05

0.052Random Forests (number of bags)

std

dev

of N

DC

G@

10

#10-3

2

2.5

3

3.5

4

4.5

5validationtraining

number of trees in each bag1 2 3 4 5 6 7 8 9 10

mea

n of

(1-N

DC

G@

10)

0.2

0.21

0.22Random Forests (number of trees in each bag)

mea

n of

(1-N

DC

G@

10)

0.03

0.035


number of trees in each bag1 2 3 4 5 6 7 8 9 10

std

dev

of N

DC

G@

10

0.042

0.044

0.046

0.048

0.05

0.052

0.054

0.056Random Forests (number of trees in each bag)

std

dev

of N

DC

G@

10

#10-3

2.8

3

3.2

3.4

3.6

3.8

4



Bags 220 420 Trees in each bag 7 10

Leaves 60 240 Learning rate 0.025 1.6

From this figure, we can find that the best parameter set is (220,7,60,1.6).

5. Comparison of ranking algorithms

number of leaves0 50 100 150 200 250 300

mea

n of

(1-N

DC

G@

10)

0.204

0.206

0.208

0.21

0.212

0.214

0.216Random Forests (number of leaves for each tree)

mea

n of

(1-N

DC

G@

10)

0.03

0.04

0.05

0.06

0.07

0.08


number of leaves0 50 100 150 200 250 300

std

dev

of N

DC

G@

10

0.04

0.045

0.05

0.055

0.06Random Forests (number of leaves for each tree)

std

dev

of N

DC

G@

10

#10-3

2

3

4

5

6validationtraining

log(learning rate)-6 -5 -4 -3 -2 -1 0 1 2

mea

n of

(1-N

DC

G@

10)

0.2

0.205

0.21

0.215Random Forests (learning rate)

mea

n of

(1-N

DC

G@

10)

0.038

0.0385

0.039


log(learning rate)-6 -5 -4 -3 -2 -1 0 1 2

std

dev

of N

DC

G@

10

0.04

0.05

0.06Random Forests (learning rate)

std

dev

of N

DC

G@

10

#10-3

3

3.5

4validationtraining

0.1 0.15 0.2 0.25 0.3 0.350

2

4

6

8

10

12

14220, 7, 60, 0.025220, 7, 60, 1.6220, 7, 240, 0.025220, 7, 240, 1.6220, 10, 60, 0.025220, 10, 60, 1.6220, 10, 240, 0.025220, 10, 240, 1.6420, 7, 60, 0.025420, 7, 60, 1.6420, 7, 240, 0.025420, 7, 240, 1.6420, 10, 60, 0.025420, 10, 60, 1.6420, 10, 240, 0.025420, 10, 240, 1.6

5.1. Accuracy comparison

By using the results from parameter tuning, I can now test the performance of these algorithms on some unseen testing data. I will gradually increase the training size to see how their performance will change.

From these two figures, we can find that RankNet and ListNet are the two worst ranking algorithms in both ranking accuracy and the running time. If we only consider the accuracy, Random Forests is the best algorithm. If we also consider the running time, MART and Random Forests are the two best algorithms.

5.2. Sensitivity analysis

The process is very like cross-validation. I first divide the whole dataset into 20 subsets. Then I leave one subset out and train the model on the remaining 19 subsets. Then I use the trained model to calculate the NDCG@10 and MAP on the same test data. Finally, I will use box plots to analyze the results.

From these two pictures, we can find that LambdaMART and ListNet have smaller sensitivity because they have smaller IQR. But based on the position of box, ListNet has pretty bad performance. On the other hand, the data of Random Forests and MART shows not only good performance but also quite centralized. Therefore, I think they are the best two ranking algorithms.

6. Reference

1) Chapelle, O., and Y. Chang (2011), Yahoo! Learning to Rank Challenge Overview, paper

presented at Yahoo! Learning to Rank Challenge.

2) Chapelle, O., D. Metlzer, Y. Zhang, and P. Grinspan (2009), Expected reciprocal rank for

training size0 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00

1-N

DC

G

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

MARTRankNetRankBoostLambdaMARTListNetRandom Forests

training size0 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00

runt

ime/

sec

0

200

400

600

800

1000

1200

MARTRankNetRankBoostLambdaMARTListNetRandom Forests

MART RankNet RankBoost LambdaMART ListNet Random Forests

MAP

0.86

0.865

0.87

0.875

0.88

0.885

0.89

0.895

0.9

0.905

MART RankNet RankBoost LambdaMART ListNet Random Forests

ND

CG

@10

0.56

0.58

0.6

0.62

0.64

0.66

0.68

graded relevance, paper presented at Proceedings of the 18th ACM conference on

Information and knowledge management, ACM.

3) Friedman, J. H. (2001), Greedy function approximation: a gradient boosting machine,

Annals of statistics, 1189-1232.

4) Li, P., Q. Wu, and C. J. Burges (2007), Mcrank: Learning to rank using multiple classification

and gradient boosting, paper presented at Advances in neural information processing

systems.

5) Orr, G. B., and K.-R. Müller (2003), Neural networks: tricks of the trade, Springer.

6) Wu, Q., C. J. Burges, K. M. Svore, and J. Gao (2010), Adapting boosting for information

retrieval measures, Information Retrieval, 13(3), 254-270.

learning to rank - purdue university · learning to rank learning to rank or machine-learning rank...

Documents