fast feature selection for learning to rank - acm international conference on information retrieval...

37
ACM International Conference on the Theory of Information Retrieval University of Delaware, Newark, DE, USA September 13-16, 2016 Fast Feature Selection Algorithms for Learning to Rank Andrea Gigli Department of Computer Science, University of Pisa & ISTI – CNR Pisa Franco Maria Nardini, Claudio Lucchese, Raffaele Perego ISTI – CNR Pisa & istella*, Pisa

Upload: andrea-gigli

Post on 15-Apr-2017

141 views

Category:

Technology


2 download

TRANSCRIPT

ACM International Conference on the

Theory of Information RetrievalUniversity of Delaware, Newark, DE, USA September 13-16, 2016

Fast Feature Selection Algorithms

for Learning to Rank

Andrea Gigli

Department of Computer Science, University of Pisa & ISTI – CNR PisaFranco Maria Nardini, Claudio Lucchese, Raffaele Perego

ISTI – CNR Pisa & istella*, Pisa

Outline

� Introduction

�Proposed Feature Selection Algorithms (FSA)

�Application to Learning to Rank

ICTIR 2016, Newark, DE

Outline

� Introduction

�Proposed Feature Selection Algorithms (FSA)

�Application to Learning to Rank

ICTIR 2016, Newark, DE

...

...

...

Learning

System

Ranking

System

Indexed

Documents

... Training

Prediction

How to Rank Documents using

Supervised Learning

...

��,� ��,�

��,�

��,��

��,� ��,��

�,� �,�

�,�

�,��

�,� �,��

��

� �����, ��,��

� ��

�������, ����

�������, ����

��: query i

��,�: document j

associated to the query i

��,�: relevance

label for the j-thdocument associated to the i-th query

����, ��: scoring function

ICTIR 2016, Newark, DE

� �� �, �

Learning to Rank

� ,� � ,� � ,�!��… …� ,� � ,� � ,�!

"�,�(�)

"�,�(#)

"�,�($)

"�,�(&)

"�,#(�)

"�,#(#)

"�,#($)

"�,#(&)

"�,��

(�)

"�,��

(#)

"�,�� ($)

"�,��

(&)

Documents Query-Document LabelsQuery

� �� �, � ≈ ��(()

� K is in order of hundreds, thousands

ICTIR 2016, Newark, DE

Outline

� Introduction

�Proposed Feature Selection Algorithms (FSA)

�Application to Learning to Rank

ICTIR 2016, Newark, DE

We propose the following algorithms

� Naïve Greedy search Algorithm for feature Selection (N-GAS)

� eXtended naïve Greedy search Algorithm for feature Selection

(X-GAS)

� Hierarchical clustering Greedy search Algorithm for feature

Selection (H-GAS)

Proposed Algorithms for Feature

Selection

ICTIR 2016, Newark, DE

� We compare them with the Greedy search Algorithm for

feature Selection (GAS) proposed by Geng, Liu, Qin, Li (SIGIR07)

� All the competing FSAs belong to Filter Methods family.

� Competing FSAs try to to Maximise the Importance of a feature w.r.t. the judgements and Minimize Similarity

among selected features.

� Both X-GAS and the GAS require hyper-parameter calibration.

Proposed Algorithms for Feature

Selection

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection

� Naïve Greedy search Algorithm for feature Selection (N-GAS)

� eXtended naïve Greedy search Algorithm for feature Selection

(X-GAS)

� Hierarchical clustering Greedy search Algorithm for feature

Selection (H-GAS)

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

The graph is built and the Subset S of n=4 selected features is initialized.

Importance of the 8th

feature w.r.t. query-offer

relevance judgements

Similarity

between features

6th and 7th

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

Start by adding the node with the highest importance to S(Node ❶ in this example)

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

• Let � be the node having the lowest similarity wrt Node ❶

• Let � be the node having the highest similarity wrt Node �

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

From (�, � ) select the Node with the highest importance and add it to S (Node � in the example).

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

• Let ❷ be the node having the lowest similarity wrt Node �• Let ❸ be the node having the highest similarity wrt Node ❷

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

From (❷, ❸ ) select the node with the highest importance and add it to S (Node ❷ in the example).

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

• Let ❹be the node having the lowest similarity wrt Node ❷• Let ❽ be the node having the highest similarity wrt Node ❹

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: N-GAS

In (❹, ❽ ) select the node with the highest importance and add it to S (Node ❹ in the example)

ICTIR 2016, Newark, DE

� Naïve Greedy search Algorithm for feature Selection (N-GAS)

� eXtended naïve Greedy search Algorithm for feature Selection

(X-GAS)

� Hierarchical clustering Greedy search Algorithm for feature

Selection (H-GAS)

Proposed Algorithms for Feature

Selection

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

The graph is built and the Subset S of n=4 selected features is initialized.

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

Start by adding the node with the highest importance to S(Node ❶ in this example)

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

Select the 50% of nodes less similar to ❶

Filter

Parameter

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

From the selection take the node with the highest importance and

add it to S (Node � in the example)

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

Select the 50% of the nodes less similar to �

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

From the selection take the node with the highest importance and add it to S ( Node ❸ in the example)

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

Select the 50% of the nodes less similar to ❸

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: X-GAS

From the selection take the node with the highest importance

and add it to S (Node ❹ in the example)

ICTIR 2016, Newark, DE

� Naïve Greedy search Algorithm for feature Selection (N-GAS)

� eXtended naïve Greedy search Algorithm for feature Selection

(X-GAS)

� Hierarchical clustering Greedy search Algorithm for feature

Selection (H-GAS)

Proposed Algorithms for Feature

Selection

ICTIR 2016, Newark, DE

Proposed Algorithms for Feature

Selection: H-GAS

1

4

5

7

6

8

2

3

1

4

6

5

2

3

8

7

8

3

7

5

2

1

4

6

1

5

8

4

ICTIR 2016, Newark, DE

Outline

� Introduction

�Proposed Feature Selection Algorithms (FSA)

�Application to Learning to Rank

ICTIR 2016, Newark, DE

Application to Web Search

Engine Data

� Bing data http://research.microsoft.com/en-us/projects/mslr/

� Yahoo! data http://webscope.sandbox.yahoo.com

Train Validation Test

# queries 19,944 2,994 6,983

# urls 473,134 71,083 165,660

# features 519

Train Validation Test

# queries 18,919 6,306 6,306

# urls 723,412 235,259 241,521

# features 136

ICTIR 2016, Newark, DE

Experimental Framework

Importance, ) �� :+,-.@10 using each � as a

ranking model

Similarity, 2 ��, �� :Spearman Rank Correlation

Coefficient

Distance, 3 ��, �� : 1 − S � , �6

L2R Algorithm: LambdaMART

ICTIR 2016, Newark, DE

Select a subset of n<K

features using a given FSA

Repeat for different n in{5%K, 10%K, 20%K, 30%K, 40%K, 50%K, 75%K, K}

Experimental Protocol

Train LamdaMARTusing n features

Measure LamdaMARTPerformance on the

Test SetCompare

FSAs using average

�789@�:

1 2 3 4

Repeat from ❶for each FSA

ICTIR 2016, Newark, DE

Results on “Bing” dataset �789@�:

Feature Subset Size

as % of the Feature Set

Size (K)

ICTIR 2016, Newark, DE

Results on “Yahoo!” dataset

Feature Subset Size

as % of the Feature Set

Size (K)

ICTIR 2016, Newark, DE

�789@�:

Feature Subset

Dimension5% 10% 20% 30% 40% 100%

N-GAS 0.4011▼ 0.4459 0.471 0.4739▼ 0.4813 0.4863

X-GAS, p = 0.05 0.4376▲ 0.4528 0.4577▼ 0.4825 0.4834 0.4863

H-GAS, "single" 0.4423▲ 0.4643▲ 0.4870▲ 0.4854 0.4848 0.4863

H-GAS, "ward" 0.4289 0.4434▼ 0.4820 0.4879 0.4853 0.4863

GAS, c = 0.01 0.4294 0.4515 0.4758 0.4848 0.4863 0.4863

Feature Subset

Dimension5% 10% 20% 30% 40% 100%

N-GAS 0.7430▼ 0.7601 0.7672 0.7717 0.7724 0.7753

X-GAS, p = 0.8 0.7655 0.7666 0.7723 0.7742 0.7751 0.7753

H-GAS, "single" 0.7350▼ 0.7635 0.7666 0.7738 0.7742 0.7753

H-GAS, "ward" 0.7570▼ 0.7626 0.7704 0.7743 0.7755 0.7753

GAS, c = 0.01 0.7628 0.7649 0.7671 0.773 0.7737 0.7753

Results

Yahoo! dataset

Bing dataset

ICTIR 2016, Newark, DE

Conclusion

� X-GAS e H-GAS show a performance greater or equal thanthe benchmark model

� H-CAS and N-GAS are more efficient than the othersbecause do not need any hyper-parameter calibration.

� Future Work:

� experiments on the new LtR dataset provided by istella*(http://blog.istella.it/istella-learning-to-rank-dataset/)

� application to other ML contexts, sorting problems and ensemble learning.

ICTIR 2016, Newark, DE

ACM International Conference on the

Theory of Information RetrievalUniversity of Delaware, Newark, DE, USA September 13-16, 2016

Thank you and

special thanks to ACM-SIGIR for

the Travel Grant support

Andrea Gigli Email: [email protected]

Twitter: @andrgig

http://www.slideshare.net/andrgig