learning to rank - purdue university · learning to rank learning to rank or machine-learning rank...
Post on 13-Aug-2020
10 Views
Preview:
TRANSCRIPT
1. Learning to Rank
Learning to rank or machine-learning rank is very important in the construction of information retrieval system. The following picture shows a general learning to rank framework.
In the picture, above, 𝑞", 𝑞$, 𝑞% … 𝑞' represents the search query. Each query has a set of
associated documents, represented by feature vectors in this picture, labeled as𝑥)*. Feature vectors
reflect the relevance of the documents to the query. Typical feature vectors include the frequency of query terms etc. And y("), y($), y(%) …y(/) represents the preference label for each query document pair. The goal is to build a model to predict the ground truth label of test data as accurately as possible, in terms of loss function. Generally speaking, many ranking algorithms can be grouped into three approaches: the pointwise approach, the pairwise approach and the listwise approach. This project is to explore multiple ranking algorithms across these three kinds of approaches in terms of accuracy and efficiency.
2. Datasets
The dataset I will use in this project is “Yahoo! Learning to Rank Challenge”. This dataset consists of three subsets, which are training data, validation data and test data. The data format for each subset is shown as follows:[Chapelle and Chang, 2011]
Each line has three parts, relevance level, query and a feature vector. There are five levels of
relevance from 0 (least relevant) to 4 (most relevant). Each query has a unique id and there are 699 features in this dataset labeled from 1 to 699. The value of each feature is normalized to a float in the [0,1] range.
3. Experiments Setup
3.1. Library and measurement function In this project, I will use RankLib to do all the experiments. There are many ways to measure
the performance of a ranking model, like ERR, DCG, NDCG and MAP. [Chapelle et al., 2009] I choose Normalized Discounted Cumulative Gain (NDCG) as the measure function, which measures the usefulness, or gain, of a document based on its position in the result list. The equations to compute NDCG are:
𝑁𝐷𝐶𝐺4 =𝐷𝐶𝐺4𝐼𝐷𝐶𝐺4
where:
𝐷𝐶𝐺4 = 𝑟𝑒𝑙" +𝑟𝑒𝑙)
log$ 𝑖
4
)?$
𝐼𝐷𝐶𝐺4 =2ABCD − 1log$ 𝑖 + 1
|HIJ|
)?$
3.2. Expanding missing data The original dataset has 700 features. However, for each entry, it might not contain values for
all features. The missing value of a feature is indicted as NaN. This fact will make the dimension reduction a little harder because dimension reduction cannot be applied to dataset containing unknown data. So, I need to expand the missing data by using the properties of existing data. To do this, I first read 200 features. Then I select features in which the proportion of NaN is no more than 50%. Because if the proportion is too high, the predicted data might not be accurate. The algorithm I am going to use is called Alternating Least Squares (ALS). This process can easily be done in MATLAB by using the following two lines of code:
[coeff,score,latent,tsquared,explained,mu] = pca(y,'algorithm','als'); t = score*coeff' + repmat(mu,number of entries,1);
data before expansion data after expansion
From these two pictures, we can see that after expansion, all NaN are replaced by predicted data and the existing data will remain the same.
3.3. Dimension reduction Taking account of all the features in the dataset will be very time consuming, therefore, it will
be necessary to reduce the dimension of the dataset. I use PCA to do the dimension reduction on the
expanded data. The first 20 features with the largest variability will be chosen. Before PCA, it is necessary to standardize the data. I use Z-score standardization to do that, which can easily be realized in MATLAB by using the following command:
X_train=zscore(X_train); 3.4. k-fold cross-validation In this project, 10-fold cross-validation will be used to do parameter tuning. For each ranking
algorithm, I will first decide its major parameters and the range within which parameters can change. Then, a 10-fold cross-validation will be processed on each set of parameters. The whole set will be split into two sets, training set (90%) and validation set (10%). I first do PCA to the training set, the output of PCA will then be used to project the validation set. This process will be repeated ten times and the mean and standard deviation of NDCG will be returned. Considering the running time, I only use the first 2000 data of the whole dataset.
3.5. Parameters tuning Based on the introduction of each algorithm, I will choose its hyper parameters. There are
several ways to find the best parameters like grid search or random search. But a direct parameter tuning will be very time consuming because I have no idea in which regions the best parameters should be. So, I will first do single parameter tuning based on cross-validation. Then I will choose some possible good values for each parameter and do grid search, that is to test all pairs of these possible values. The best parameters will be determined by using bias-variance tradeoff. To clearly show the bias and variance, I will draw a normal distribution curve for each set of parameters by using the corresponding mean and variance of NDCG.
3.6. Comparison among ranking algorithms After getting the best parameters, I will compare the performance among algorithms under
those parameters by testing them on the same unseen test data. I will also gradually increase the size of training set to see how the ranking accuracy will change. I will also use box plot to analyze the sensitivity of different algorithms in the dataset dimension.
4. Parameter tuning and k-fold cross-validation
4.1. RankNet 4.1.1. Introduction RankNet is a pairwise ranking algorithm, which means its loss function is defined on a pair of
documents or urls. For a given query, each pair of documents or urls UL, UM with different relevance level will be chosen. And let yL, yM be the computed label from a ranking model. Let UL ⊳ UM denotes the event that yL > 𝑦*. Then a posterior is defined as:
PLM = P UL ⊳ UM =1
1 + 𝑒R SDRST
Then, define the cost function of this pair to be: CLM = −𝑃)* log 𝑃)* − 1 − 𝑃)* log(1 − 𝑃)*)
𝑃)* is the known probability that the ranking of UL is greater than UM. Let SLM to be defined to be 1 if UL is labeled more relevant, -1 if UM is labeled more relevant and 0 when UL and UM are
labelled the same relevance. Then 𝑃)* can be defined as: 𝑃)* ="$(1 + 𝑆)*)
Therefore, the cost function will become:
CLM =121 − 𝑆)* 𝑦) − 𝑦* + log(1 + 𝑒R SDRST )
It is easy to see that cost function is larger when UL ⊳ UM while UM is actually known to be more relevant. And the idea of learning via gradient descent is to update the weights (model parameters) with the gradient of the cost function.
wZ → 𝑤] − 𝜂(𝜕𝐶)*𝜕𝑦)
𝜕𝑦)𝜕𝑤]
+𝜕𝐶)*𝜕𝑦*
𝜕𝑦*𝜕𝑤]
)
where η is a positive learning rate. Then a neutral network will be used to optimize the ranking model to minimize the cost function with the back-prop equations.[Orr and Müller, 2003] Typically, the accuracy of the trained model will be affected by learning rate, the number of hidden layers and the number of hidden nodes per layer.
4.1.2. Single parameter tuning
4.1.3. Grid search The possible good value I choose based on the results above are listed below.
log(learning rate)-14 -13 -12 -11 -10 -9 -8 -7 -6
mea
n of
(1-N
DC
G@
10)
0.2
0.25RankNet (learning rate)
mea
n of
(1-N
DC
G@
10)
0
0.5validationtraining
log(learning rate)-14 -13 -12 -11 -10 -9 -8 -7 -6
std
dev
of N
DC
G@
10
0.015
0.02
0.025
0.03
0.035
0.04
0.045RankNet (learning rate)
std
dev
of N
DC
G@
10
0.004
0.006
0.008
0.01
0.012
0.014
0.016validationtraining
layers1 2 3 4 5 6 7 8 9 10
mea
n of
(1-N
DC
G@
10)
0.23
0.24
0.25
0.26RankNet (number of hidden layers)
mea
n of
(1-N
DC
G@
10)
0.235
0.24
0.245
0.25validationtraining
layers1 2 3 4 5 6 7 8 9 10
std
dev
of N
DC
G@
10
0.025
0.03
0.035
0.04
0.045
0.05
0.055RankNet (number of hidden layers)
std
dev
of N
DC
G@
10
0.007
0.008
0.009
0.01
0.011
0.012
0.013validationtraining
number of nodes per layer5 10 15 20 25
mea
n of
(1-N
DC
G@
10)
0.23
0.24
0.25
0.26RankNet (number of nodes per layer)
mea
n of
(1-N
DC
G@
10)
0.23
0.235
0.24
0.245validationtraining
number of nodes per layer5 10 15 20 25
std
dev
of N
DC
G@
10
0.02
0.04
0.06RankNet (number of nodes per layer)
std
dev
of N
DC
G@
10
0.005
0.01
0.015validationtraining
Learning rate 0.000025 0.0016 Hidden layers 3 6 10
Nodes per layer 13 17 25
This picture shows the bias and variance of all pairs of parameters. We can see that the variance
and bias cannot be the smallest simultaneously. I finally choose (3,25,0.0016) as the best parameters set because it can give us the smallest bias, and the variance is not that large.
4.2. MART
4.2.1. Introduction Multiple Additive Regression Trees (MART) is an algorithm of using gradient boosted decision
trees for prediction tasks.[Friedman, 2001] Its output function Fb(𝑥) is a linear combination of a set of regression trees.
Fb 𝑥 = 𝛼)𝑓)(𝑥)e
)?"
where fL(𝑥) is the output function of a single regression tree and αL is its weight. Then viewing the cost C as a function of the output function.
C = Ch +𝜕𝐶𝜕𝐹
𝛿𝐹
To reduce the cost C, δF ∝ −η mnmo
for a suitable step size, which is then used as the weight. The
step size for the jth leaf node of the mth tree is:
γMq = argminw
log(1 + 𝑒R$SD xyz{ |D }w )|D∈HTy
The accuracy of MART is affected by the number of regression trees, the number of nodes and learning rate.[Wu et al., 2010]
4.2.2. Single parameter tuning
0.1 0.15 0.2 0.25 0.3 0.350
5
10
153, 13, 2.5e-053, 17, 2.5e-053, 25, 2.5e-056, 13, 2.5e-056, 17, 2.5e-056, 25, 2.5e-0510, 13, 2.5e-0510, 17, 2.5e-0510, 25, 2.5e-053, 13, 0.00163, 17, 0.00163, 25, 0.00166, 13, 0.00166, 17, 0.00166, 25, 0.001610, 13, 0.001610, 17, 0.001610, 25, 0.0016
To avoid over-fitting, the learning rate is suggested no more than 0.1.[Li et al., 2007]
4.2.3. Grid search The possible good value I choose based on the results above are listed below.
Trees 500 700 1800 Leaves for each tree 13 18
Learning rate 0.06 0.09
number of trees500 1000 1500 2000
mea
n of
(1-N
DC
G@
10)
0.22
0.225
0.23MART (number of trees)
mea
n of
(1-N
DC
G@
10)
0.03
0.032
0.034validationtraining
number of trees500 1000 1500 2000
std
dev
of N
DC
G@
10
0.041
0.042
0.043
0.044
0.045
0.046
0.047MART (number of trees)
std
dev
of N
DC
G@
10
#10-3
3.2
3.3
3.4
3.5
3.6
3.7
3.8validationtraining
number of leaves for each tree5 10 15 20
mea
n of
(1-N
DC
G@
10)
0.22
0.23
0.24MART (number of leaves for each tree)
mea
n of
(1-N
DC
G@
10)
0.02
0.04
0.06validationtraining
number of leaves for each tree5 10 15 20
std
dev
of N
DC
G@
10
0.035
0.04
0.045
0.05
0.055MART (number of leaves for each tree)
std
dev
of N
DC
G@
10
#10-3
6
7
8
9
10validationtraining
learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
mea
n of
(1-N
DC
G@
10)
0.2
0.21
0.22MART (learning rate)
mea
n of
(1-N
DC
G@
10)
0
0.05
0.1validationtraining
learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
std
dev
of N
DC
G@
10
0.035
0.04
0.045
0.05MART (learning rate)
std
dev
of N
DC
G@
10
#10-3
2
4
6
8validationtraining
This picture shows the bias and variance of all pairs of parameters. We can see that the variance
and bias cannot be the smallest simultaneously. Therefore, I choose (700,18,0.06) as the best parameters because it has the smallest bias and a quite small variance.
4.3. LamdaMART
4.3.1. Introduction
A new algorithm called LamdaRank can be achieved by directly computing the gradients instead of deriving them from a cost function. After the gradients, denoted as λLM, are calculated, there will exist an imagined cost function CLM such that:
λLM =𝜕𝐶)*(𝑦) − 𝑦*)
𝜕𝑦)
And the weights update will become:
wZ → 𝑤] + 𝜂𝜕𝐶)*𝜕𝑤]
LamdaMART is an algorithm that combines LamdaRank and MART. For each pair of UL, UM with UL ⊳ UM, the gradients are defined as:
λLM =− Δ𝑍)*1 + 𝑒SDRST
where ZLM means the difference of cost function CLM by swapping the scores of UL, UM. And the cost function will become:
C = Δ𝑍)* log(1 + 𝑒R �DR�T ){),*}⇄�
And the step size can be computed by using Newton approximation. Similarly, the accuracy of LamdaMART also depends on the number of regression trees, the number of nodes and learning rate.
4.3.2. Single parameter tuning
0.1 0.15 0.2 0.25 0.3 0.350
2
4
6
8
10
12500, 13, 0.06500, 18, 0.06700, 13, 0.06700, 18, 0.061800, 13, 0.061800, 18, 0.06500, 13, 0.09500, 18, 0.09700, 13, 0.09700, 18, 0.091800, 13, 0.091800, 18, 0.09
4.3.3. Grid search The possible good value I choose based on the results above are listed below.
Trees 1900 Leaves 5 14 20
Learning rate 0.03 0.06 0.09
number of trees500 1000 1500 2000
mea
n of
(1-N
DC
G@
10)
0.214
0.215
0.216
0.217
0.218
0.219
0.22
0.221LambdaMART (number of trees)
mea
n of
(1-N
DC
G@
10)
0.0305
0.031
0.0315
0.032
0.0325
0.033
0.0335
0.034validationtraining
number of trees500 1000 1500 2000
std
dev
of N
DC
G@
10
0.03
0.035
0.04
0.045
0.05LambdaMART (number of trees)
std
dev
of N
DC
G@
10
#10-3
3.15
3.2
3.25
3.3
3.35validationtraining
number of leaves5 10 15 20
mea
n of
(1-N
DC
G@
10)
0.2
0.21
0.22
0.23LambdaMART (number of leaves per tree)
mea
n of
(1-N
DC
G@
10)
0.03
0.035
0.04
0.045validationtraining
number of leaves5 10 15 20
std
dev
of N
DC
G@
10
0.03
0.035
0.04
0.045LambdaMART (number of leaves per tree)
std
dev
of N
DC
G@
10
#10-3
2.8
3
3.2
3.4validationtraining
learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
mea
n of
(1-N
DC
G@
10)
0.2
0.21
0.22
0.23LambdaMART (learning rate)
mea
n of
(1-N
DC
G@
10)
0.02
0.04
0.06
0.08validationtraining
learning rate0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
std
dev
of N
DC
G@
10
0.032
0.034
0.036
0.038
0.04
0.042
0.044
0.046LambdaMART (learning rate)
std
dev
of N
DC
G@
10
#10-3
2.8
3
3.2
3.4
3.6
3.8
4
4.2validationtraining
This picture shows the bias and variance of all pairs of parameters. We can see that the variance
and bias cannot be the smallest simultaneously. Since the variance overall is not so large, I choose (1900,14,0.03) as the best parameters because it has the smallest bias.
4.4. RankBoost
RankBoost is a pairwise ranking algorithm. Its idea is to transfer the ranking problem as a problem of binary classification on instance pairs, and then to adopt boosting approach. It will train one weak ranker at ach iteration and then combine these rankers as the final ranking function. The hyper-parameter I choose for RankBoost is the number of threshold candidates to search. Since the total number of features after reduction is 20, the threshold should also be no more than 20. The results below show that 13 will be a good value for the threshold.
4.5. ListNet
ListNet is a list wise ranking algorithm by optimizing the list wise loss function based on top one probability, with Network as model and Gradient Descent as optimization algorithm. The learning rate is chosen as the hyper parameter. The results below show that the a good learning rate is 0.00001.
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.30
2
4
6
8
10
12
141900, 5, 0.031900, 5, 0.061900, 5, 0.091900, 14, 0.031900, 14, 0.061900, 14, 0.091900, 20, 0.031900, 20, 0.061900, 20, 0.09
threshold5 6 7 8 9 10 11 12 13 14 15
mea
n of
(1-N
DC
G@
10)
0.216
0.218
0.22
0.222
0.224
0.226
0.228
0.23RankBoost (threshold)
mea
n of
(1-N
DC
G@
10)
0.198
0.2
0.202
0.204
0.206
0.208
0.21
0.212validationtraining
threshold5 6 7 8 9 10 11 12 13 14 15
std
dev
of N
DC
G@
10
0.04
0.042
0.044
0.046
0.048
0.05RankBoost (threshold)
std
dev
of N
DC
G@
10
#10-3
6
7
8
9
10
11validationtraining
4.6. Random Forests
4.6.1. Introduction Random-forests is an ensemble learning method for classification, regression and other tasks.
It constructs a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. The hyper parameters I choose for random forests include the number of bags, the number of trees in each bag, the number of leaves for each tree and the learning rate.
4.6.2. Single parameter tuning
learning rate-15 -14 -13 -12 -11 -10 -9 -8
mea
n of
(1-N
DC
G@
10)
0.2017
0.2017
0.2017
0.2018
0.2018
0.2018ListNet (learning rate)
mea
n of
(1-N
DC
G@
10)
0.1884
0.1884
0.1885
0.1886
0.1886
0.1886validationtraining
learning rate-15 -14 -13 -12 -11 -10 -9 -8
std
dev
of N
DC
G@
10
0.037
0.0371
0.0371
0.0371
0.0372ListNet (learning rate)
std
dev
of N
DC
G@
10
#10-3
4
4.1
4.2
4.3
4.4validationtraining
number of bags100 150 200 250 300 350 400 450 500
mea
n of
(1-N
DC
G@
10)
0.204
0.206
0.208
0.21
0.212
0.214
0.216Random Forests (number of bags)
mea
n of
(1-N
DC
G@
10)
0.038
0.039
0.04
0.041
0.042
0.043
0.044validationtraining
number of bags100 150 200 250 300 350 400 450 500
std
dev
of N
DC
G@
10
0.04
0.042
0.044
0.046
0.048
0.05
0.052Random Forests (number of bags)
std
dev
of N
DC
G@
10
#10-3
2
2.5
3
3.5
4
4.5
5validationtraining
number of trees in each bag1 2 3 4 5 6 7 8 9 10
mea
n of
(1-N
DC
G@
10)
0.2
0.21
0.22Random Forests (number of trees in each bag)
mea
n of
(1-N
DC
G@
10)
0.03
0.035
0.04validationtraining
number of trees in each bag1 2 3 4 5 6 7 8 9 10
std
dev
of N
DC
G@
10
0.042
0.044
0.046
0.048
0.05
0.052
0.054
0.056Random Forests (number of trees in each bag)
std
dev
of N
DC
G@
10
#10-3
2.8
3
3.2
3.4
3.6
3.8
4
4.2validationtraining
4.6.3. Grid search The possible good value I choose based on the results above are listed below.
Bags 220 420 Trees in each bag 7 10
Leaves 60 240 Learning rate 0.025 1.6
From this figure, we can find that the best parameter set is (220,7,60,1.6).
5. Comparison of ranking algorithms
number of leaves0 50 100 150 200 250 300
mea
n of
(1-N
DC
G@
10)
0.204
0.206
0.208
0.21
0.212
0.214
0.216Random Forests (number of leaves for each tree)
mea
n of
(1-N
DC
G@
10)
0.03
0.04
0.05
0.06
0.07
0.08
0.09validationtraining
number of leaves0 50 100 150 200 250 300
std
dev
of N
DC
G@
10
0.04
0.045
0.05
0.055
0.06Random Forests (number of leaves for each tree)
std
dev
of N
DC
G@
10
#10-3
2
3
4
5
6validationtraining
log(learning rate)-6 -5 -4 -3 -2 -1 0 1 2
mea
n of
(1-N
DC
G@
10)
0.2
0.205
0.21
0.215Random Forests (learning rate)
mea
n of
(1-N
DC
G@
10)
0.038
0.0385
0.039
0.0395validationtraining
log(learning rate)-6 -5 -4 -3 -2 -1 0 1 2
std
dev
of N
DC
G@
10
0.04
0.05
0.06Random Forests (learning rate)
std
dev
of N
DC
G@
10
#10-3
3
3.5
4validationtraining
0.1 0.15 0.2 0.25 0.3 0.350
2
4
6
8
10
12
14220, 7, 60, 0.025220, 7, 60, 1.6220, 7, 240, 0.025220, 7, 240, 1.6220, 10, 60, 0.025220, 10, 60, 1.6220, 10, 240, 0.025220, 10, 240, 1.6420, 7, 60, 0.025420, 7, 60, 1.6420, 7, 240, 0.025420, 7, 240, 1.6420, 10, 60, 0.025420, 10, 60, 1.6420, 10, 240, 0.025420, 10, 240, 1.6
5.1. Accuracy comparison
By using the results from parameter tuning, I can now test the performance of these algorithms on some unseen testing data. I will gradually increase the training size to see how their performance will change.
From these two figures, we can find that RankNet and ListNet are the two worst ranking algorithms in both ranking accuracy and the running time. If we only consider the accuracy, Random Forests is the best algorithm. If we also consider the running time, MART and Random Forests are the two best algorithms.
5.2. Sensitivity analysis
The process is very like cross-validation. I first divide the whole dataset into 20 subsets. Then I leave one subset out and train the model on the remaining 19 subsets. Then I use the trained model to calculate the NDCG@10 and MAP on the same test data. Finally, I will use box plots to analyze the results.
From these two pictures, we can find that LambdaMART and ListNet have smaller sensitivity because they have smaller IQR. But based on the position of box, ListNet has pretty bad performance. On the other hand, the data of Random Forests and MART shows not only good performance but also quite centralized. Therefore, I think they are the best two ranking algorithms.
6. Reference
1) Chapelle, O., and Y. Chang (2011), Yahoo! Learning to Rank Challenge Overview, paper
presented at Yahoo! Learning to Rank Challenge.
2) Chapelle, O., D. Metlzer, Y. Zhang, and P. Grinspan (2009), Expected reciprocal rank for
training size0 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00
1-N
DC
G
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0.5
MARTRankNetRankBoostLambdaMARTListNetRandom Forests
training size0 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00
runt
ime/
sec
0
200
400
600
800
1000
1200
MARTRankNetRankBoostLambdaMARTListNetRandom Forests
MART RankNet RankBoost LambdaMART ListNet Random Forests
MAP
0.86
0.865
0.87
0.875
0.88
0.885
0.89
0.895
0.9
0.905
MART RankNet RankBoost LambdaMART ListNet Random Forests
ND
CG
@10
0.56
0.58
0.6
0.62
0.64
0.66
0.68
graded relevance, paper presented at Proceedings of the 18th ACM conference on
Information and knowledge management, ACM.
3) Friedman, J. H. (2001), Greedy function approximation: a gradient boosting machine,
Annals of statistics, 1189-1232.
4) Li, P., Q. Wu, and C. J. Burges (2007), Mcrank: Learning to rank using multiple classification
and gradient boosting, paper presented at Advances in neural information processing
systems.
5) Orr, G. B., and K.-R. Müller (2003), Neural networks: tricks of the trade, Springer.
6) Wu, Q., C. J. Burges, K. M. Svore, and J. Gao (2010), Adapting boosting for information
retrieval measures, Information Retrieval, 13(3), 254-270.
top related