fast feature selection for learning to rank - acm international conference on information retrieval...

ACM International Conference on the

Theory of Information RetrievalUniversity of Delaware, Newark, DE, USA September 13-16, 2016

Fast Feature Selection Algorithms

for Learning to Rank

Andrea Gigli

Department of Computer Science, University of Pisa & ISTI – CNR PisaFranco Maria Nardini, Claudio Lucchese, Raffaele Perego

ISTI – CNR Pisa & istella*, Pisa

Outline

� Introduction

�Proposed Feature Selection Algorithms (FSA)

�Application to Learning to Rank

ICTIR 2016, Newark, DE

...

…

…

...

...

Learning

System

Ranking

System

Indexed

Documents

... Training

Prediction

How to Rank Documents using

Supervised Learning

...

��,� ��,�

��,�

��,��

��,� ��,��

…

…

�,� �,�

�,�

�,��

�,� �,��

��

� ��, ��,��

� ��

��, ��

��, ��

��: query i

��,�: document j

associated to the query i

��,�: relevance

label for the j-thdocument associated to the i-th query

��, ��: scoring function


� �� , �

Learning to Rank

� ,� � ,� � ,�!��… …� ,� � ,� � ,�!

"�,�(�)

"�,�(#)

"�,�($)

⋮

"�,�(&)

"�,#(�)

"�,#(#)

"�,#($)

⋮

"�,#(&)

"�,��

(�)

"�,��

(#)

"�,�� ($)

⋮

"�,��

(&)

…

Documents Query-Document LabelsQuery

� �� , � ≈ ��(()

� K is in order of hundreds, thousands


Outline

� Introduction




We propose the following algorithms

� Naïve Greedy search Algorithm for feature Selection (N-GAS)

� eXtended naïve Greedy search Algorithm for feature Selection

(X-GAS)

� Hierarchical clustering Greedy search Algorithm for feature

Selection (H-GAS)

Proposed Algorithms for Feature

Selection


� We compare them with the Greedy search Algorithm for

feature Selection (GAS) proposed by Geng, Liu, Qin, Li (SIGIR07)

� All the competing FSAs belong to Filter Methods family.

� Competing FSAs try to to Maximise the Importance of a feature w.r.t. the judgements and Minimize Similarity

among selected features.

� Both X-GAS and the GAS require hyper-parameter calibration.


Selection



Selection



(X-GAS)


Selection (H-GAS)



Selection: N-GAS

The graph is built and the Subset S of n=4 selected features is initialized.

Importance of the 8th

feature w.r.t. query-offer

relevance judgements

Similarity

between features

6th and 7th



Selection: N-GAS

Start by adding the node with the highest importance to S(Node ❶ in this example)



Selection: N-GAS

• Let � be the node having the lowest similarity wrt Node ❶

• Let � be the node having the highest similarity wrt Node �



Selection: N-GAS

From (�, � ) select the Node with the highest importance and add it to S (Node � in the example).



Selection: N-GAS

• Let ❷ be the node having the lowest similarity wrt Node �• Let ❸ be the node having the highest similarity wrt Node ❷



Selection: N-GAS

From (❷, ❸ ) select the node with the highest importance and add it to S (Node ❷ in the example).



Selection: N-GAS

• Let ❹be the node having the lowest similarity wrt Node ❷• Let ❽ be the node having the highest similarity wrt Node ❹



Selection: N-GAS

In (❹, ❽ ) select the node with the highest importance and add it to S (Node ❹ in the example)




(X-GAS)


Selection (H-GAS)


Selection



Selection: X-GAS

The graph is built and the Subset S of n=4 selected features is initialized.



Selection: X-GAS

Start by adding the node with the highest importance to S(Node ❶ in this example)



Selection: X-GAS

Select the 50% of nodes less similar to ❶

Filter

Parameter



Selection: X-GAS

From the selection take the node with the highest importance and

add it to S (Node � in the example)



Selection: X-GAS

Select the 50% of the nodes less similar to �



Selection: X-GAS

From the selection take the node with the highest importance and add it to S ( Node ❸ in the example)



Selection: X-GAS

Select the 50% of the nodes less similar to ❸



Selection: X-GAS

From the selection take the node with the highest importance

and add it to S (Node ❹ in the example)




(X-GAS)


Selection (H-GAS)


Selection



Selection: H-GAS

1

4

5

7

6

8

2

3

1

4

6

5

2

3

8

7

8

3

7

5

2

1

4

6

1

5

8

4


Outline

� Introduction




Application to Web Search

Engine Data

� Bing data http://research.microsoft.com/en-us/projects/mslr/

� Yahoo! data http://webscope.sandbox.yahoo.com

Train Validation Test

# queries 19,944 2,994 6,983

# urls 473,134 71,083 165,660

# features 519

Train Validation Test

# queries 18,919 6,306 6,306

# urls 723,412 235,259 241,521

# features 136


Experimental Framework

Importance, ) �� :+,-.@10 using each � as a

ranking model

Similarity, 2 ��, �� :Spearman Rank Correlation

Coefficient

Distance, 3 ��, �� : 1 − S � , �6

L2R Algorithm: LambdaMART


Select a subset of n<K

features using a given FSA

Repeat for different n in{5%K, 10%K, 20%K, 30%K, 40%K, 50%K, 75%K, K}

Experimental Protocol

Train LamdaMARTusing n features

Measure LamdaMARTPerformance on the

Test SetCompare

FSAs using average

�789@�:

1 2 3 4

Repeat from ❶for each FSA


Results on “Bing” dataset �789@�:

Feature Subset Size

as % of the Feature Set

Size (K)


Results on “Yahoo!” dataset

Feature Subset Size

as % of the Feature Set

Size (K)


�789@�:

Feature Subset

Dimension5% 10% 20% 30% 40% 100%

N-GAS 0.4011▼ 0.4459 0.471 0.4739▼ 0.4813 0.4863

X-GAS, p = 0.05 0.4376▲ 0.4528 0.4577▼ 0.4825 0.4834 0.4863

H-GAS, "single" 0.4423▲ 0.4643▲ 0.4870▲ 0.4854 0.4848 0.4863

H-GAS, "ward" 0.4289 0.4434▼ 0.4820 0.4879 0.4853 0.4863

GAS, c = 0.01 0.4294 0.4515 0.4758 0.4848 0.4863 0.4863

Feature Subset

Dimension5% 10% 20% 30% 40% 100%

N-GAS 0.7430▼ 0.7601 0.7672 0.7717 0.7724 0.7753

X-GAS, p = 0.8 0.7655 0.7666 0.7723 0.7742 0.7751 0.7753

H-GAS, "single" 0.7350▼ 0.7635 0.7666 0.7738 0.7742 0.7753

H-GAS, "ward" 0.7570▼ 0.7626 0.7704 0.7743 0.7755 0.7753

GAS, c = 0.01 0.7628 0.7649 0.7671 0.773 0.7737 0.7753

Results

Yahoo! dataset

Bing dataset


Conclusion

� X-GAS e H-GAS show a performance greater or equal thanthe benchmark model

� H-CAS and N-GAS are more efficient than the othersbecause do not need any hyper-parameter calibration.

� Future Work:

� experiments on the new LtR dataset provided by istella*(http://blog.istella.it/istella-learning-to-rank-dataset/)

� application to other ML contexts, sorting problems and ensemble learning.


ACM International Conference on the

Theory of Information RetrievalUniversity of Delaware, Newark, DE, USA September 13-16, 2016

Thank you and

special thanks to ACM-SIGIR for

the Travel Grant support

Andrea Gigli Email: [email protected]

Twitter: @andrgig

http://www.slideshare.net/andrgig

fast feature selection for learning to rank - acm international conference on information retrieval...

Technology