learning to rank - hang li · 2014-08-16 · •pairwise approach and listwise approach perform...

Machine Learning and Applications to Social Media

Hang Li

Huawei Technologies

CCF ADL 2014 Beijing

Aug. 10, 2014

Outline

1. Introduction

2. Learning to Rank

3. Learning to Match

4. Applications to Social Media

5. Summary

2

Two Challenges: Information Overload and Information Shortage

Video: Weibo Robot

Main Features of Weibo Robot V1

• Features Developed

– Following People

– Re-Tweeting (Forwarding Tweets)

– Generating Short Comments

• Features To Be Developed

– Generating Original Tweets

– Generating Long Comments 小诺_Noah

2. Learning to Rank

7

2.1. Overview of Learning to Rank

8

Ranking Plays Key Role in Many Applications

9

Ranking Problem: Example = Document Retrieval

qnq

q

q

d

d

d

,

2,

1,

q

query

documents

ranking of documents

NdddD ,,, 21

),( dqf

10

ranking based on relevance, importance,

preference

Ranking Problem Example = Recommenders System

11

Item1 Item2 Item3 ... ItemN

User1 5 4

User2 1 2 2

... ? ? ?

UserM 4 3

Ranking Problem Example = Machine Translation

12

Re-Ranking

Model

1000

2

1

e

e

e

ranked sentence candidates in target language

Generative Model

f

sentence source language

e~

re-ranked top sentence in target language

Ranking Problem Example = Meta Search

13

Learning to Rank

• Definition 1 (in broad sense)

Learning to rank = any machine learning technology for ranking problem

• Definition 2 (in narrow sense)

Learning to rank = machine learning technology for ranking creation and ranking aggregation

• This tutorial takes Definition 2

14

Taxonomy of Problems in Learning to Rank

15

Ranking Creation (with Local Ranking Model)

16

requests

objects

ranking of objects

Mi qqqqS ,,,,, 21

Nj ooooO ,,,,, 21

iq

iniiii oooO ,2,1, ,,,

),( oqf

),(

),(

),(

,,

2,2,

1,1,

ii niini

iii

iii

oqfo

oqfo

oqfo

Learning for Ranking Creation

• Creating a ranking list of offerings based on request and offerings

• Feature-based

• Usually local ranking model

• Usually supervised learning

17

Ranking Aggregation

18

Learning for Ranking Aggregation

• Aggregating a ranking list from multiple ranking lists of offerings

• Ranking based

• Usually global ranking model

• Both supervised and unsupervised learning

19

2.2. Problem and Approaches of Learning to Rank

20

Ranking Problem: Example = Document Search

qnq

q

q

d

d

d

,

2,

1,

q

query

documents


NdddD ,,, 21

),( dqf

21

ranking based on relevance, importance,

preference

Traditional Approach = Probabilistic Model

Nd

d

d

2

1

q

),|(~

),|(~

),|(~

22

11

nn dqrPd

dqrPd

dqrPd

query

documents


}0,1{

),|(

R

dqrP

22

BM25 [Robertson & Walker 94]

Nd

d

d

2

1

qquery

documents

ranking function

qdw wtfavgdl

dlbkb

wtfk

)()1(

)()1(

23

),|(~

),|(~

),|(~

22

11

nn dqrPd

dqrPd

dqrPd


}0,1{

),|(

R

dqrP

PageRank [Page et al, 1999]

ndL

dPdP

ij dMd j

j

i

1)1(

)(

)()(

)(

New Approach = Learning to Rank

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

Learning System

Ranking System

1mq

),(

),(

),(

11 ,11,1

2,112,1

1,111,1

mm nmmnm

mmm

mmm

dqfd

dqfd

dqfd

),( dqf

25

NdddD ,,, 21

1. Data Labeling (rank) 3. Learning

)(xf

2. Feature Extraction

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

11 ,1,1

2,12,1

1,11,1

1

nn yd

yd

yd

q

mm nmnm

mm

mm

m

yd

yd

yd

q

,,

2,2,

1,1,

11 ,1,1

2,12,1

1,11,1

nn yx

yx

yx

mm nmnm

mm

mm

yx

yx

yx

,,

2,2,

1,1,

Training Process

26

1. Data Labeling (rank)

2. Feature Extraction

Testing Process

27

3. Ranking with )(xf

)(

)(

)(

111 ,1,1,1

2,12,12,1

1,11,11,1

mmm nmnmnm

mmm

mmm

yxfx

yxfx

yxfx

4. Evaluation

Evaluation Result

1,1

2,1

1,1

1

mnm

m

m

m

d

d

d

q

11 ,1,1

2,12,1

1,11,1

1

mm nmnm

mm

mm

m

yd

yd

yd

q

11 ,1,1

2,12,1

1,11,1

mm nmnm

mm

mm

yx

yx

yx

Notes

• Features are functions of query and document

• Query and associated documents form a group

• Groups are i.i.d. data

• Feature vectors within group are not i.i.d. data

• Ranking model is function of features

• Several data labeling methods (here labeling of grade)

28

Issues in Learning to Rank

• Data Labeling

• Feature Extraction

• Evaluation Measure

• Learning Method (Model, Loss Function, Algorithm)

29

Data Labeling Problem

• E.g., relevance of documents w.r.t. query

30

Doc A

Doc B

Doc C

Query

Data Labeling Methods

• Labeling of Grades

– Multiple levels (e.g., relevant, partially relevant, irrelevant)

– Widely used in IR

• Labeling of Ordered Pairs

– Ordered pairs between documents (e.g. A>B, B>C)

– Implicit relevance judgment: derived from click-through data

• Creation of List

– List (or permutation) of documents is given

– Ideal but difficult to implement

31

Implicit Relevance Judgment

Doc A

Doc B

Doc C B > A

ranking of documents at search system

users often clicked on Doc B

ordered pair

32

Feature Extraction

33

Doc A

Doc B

Doc C

Query

BM25

BM25

BM25

PageRank

PageRank

PageRank

.............

.............

.............

Query-document feature

Document feature

Feature Vectors

Example Features

34

Evaluation Measures

• Important to rank top results correctly

• Measures – NDCG (Normalized Discounted Cumulative Gain)

– MAP (Mean Average Precision)

– MRR (Mean Reciprocal Rank)

– WTA (Winners Take All)

– Kendall’s Tau

35

NDCG

• Evaluating ranking using labeled grades

• NDCG at position j

36

)1log(/)12(1

1

)( in

j

i

ir

j

NDCG (cont’)

• Example: perfect ranking

– (3, 3, 2, 2, 1, 1, 1) grade r=3,2,1

– (7, 7, 3, 3, 1, 1, 1) gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) position discount

– (7, 11.41, 12.91, …) DCG

– (1/7, 1/11.41, 1/12.91, …) normalizing factor

– (1, 1,1,1,1,1,1) NDCG for perfect ranking

12 )( jr

)1log(/1 j

)1log(/)12(1

)( ij

i

ir

jn

37

NDCG (cont’)

• Example: imperfect ranking

– (2, 3, 2, 3, 1, 1, 1)

– (3, 7, 3, 7, 1, 1, 1) Gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) Position discount

– (3, 7.41, 8.91, … ) DCG

– (1/7, 1/11.41, 1/12.91, …) normalizing factor

– (0.43, 0.65, 0.69, ….) NDCG

• Imperfect ranking decreases NDCG

38

Relations with Other Learning Tasks

• No need to predict category

vs Classification

• No need to predict value of

vs Regression

• Relative ranking order is more important

vs Ordinal regression

• Learning to rank can be approximated by classification, regression, ordinal regression

39

),( dqf

Ordinal Regression (Ordinal Classification)

• Categories are ordered

– 5, 4, 3, 2, 1

– e.g., rating restaurants

• Prediction

– Map to ordered categories

40

Three Major Approaches

• Pointwise approach

• Pairwise approach

• Listwise approach

• SVM based

• Boosting based

• Neural Network based

• Others

41

Categorization of Learning to rank Methods

42

Pointwise Approach

• Transforming ranking to regression, classification, or ordinal classification

• Query-document group structure is ignored

43

Pointwise Approach

44

Pointwise Approach

45

Pairwise Approach

• Transforming ranking to pairwise classification

• Query-document group structure is ignored

46

Pairwise Approach

47

Listwise Approach

• List as instance

• Query-document group structure is used

• Straightforwardly represents learning to rank problem

48

Listwise Approach

49

Evaluation Results

• Pairwise approach and listwise approach perform better than pointwise approach

• LabmdaMART performs best in Yahoo Learning to rank Challenge

• No significant difference among pairwise and listwise methods

50

2.3. Methods of Learning to Rank

52

Ranking SVM

53

Pairwise Classification

• Converting document list to document pairs

54

Doc A

Doc B

Doc C

Query

Doc A

Doc B Doc C

Query

Doc A Doc B

Doc C

grade 1

grade 2

grade 3

Transforming Ranking to Pairwise Classification

• Input space: X

• Ranking function

• Ranking:

• Linear ranking function:

• Transforming to pairwise classification:

RXf :

);();( wxfwxfxx jiji

xwwxf ,);(

);();( 0, wxfwxfxxw jiji

ij

ji

jixx

xxyzxx

1

1 ),,(

55

Ranking Problem

56

w

grade 1

grade 2

grade 3

1x

2x

3x

w

Transformed Pairwise Classification Problem

57

);( wxf31 xx

32 xx

21 xx

+1 -1

12 xx

23 xx

13 xx

Positive Examples

Negative Examples

Redundant

Ranking SVM (Hebrich et al., 1999)

• Pairwise classification on differences of feature vectors

• Corresponding positive and negative examples

• Negative examples are redundant and can be discarded

• Hyper plane passes the origin

• Soft margin and kernel can be used

• Ranking SVM = pairwise classification SVM

58

Learning of Ranking SVM

2

1

)2()1( ||||,1min wxxwyl

i

iiiw

59

),0max(][ ss C2

1

0

,,1 1,

||||2

1min

)2()1(

1

2

,

i

iiii

N

i

iw

Nixxwy

Cw

IR SVM

60

Cost-sensitive Pairwise Classification

• Converting to document pairs

61

Doc A

Doc B

Doc C

Query

Doc A

Doc B Doc C

Query

Doc A Doc B

Doc C

grade 1

grade 2

grade 3

Critical Not Critical

Problems with Ranking SVM

• Not sufficient emphasis on correct ranking on top grades: 3, 2, 1 ranking 1: 2 3 2 1 1 1 1 ranking 2: 3 2 1 2 1 1 1 ranking 2 should be better than ranking 1 Ranking SVM views them as the same

• Numbers of pairs vary according to queries q1: 3 2 2 1 1 1 1 q2: 3 3 2 2 2 1 1 1 1 1 number of pairs for q1 : 2*(2-2) + 4*(3-1) + 8*(2-1) = 14 number of pairs for q2: 6*(3-2) + 10*(3-1) + 15*(2-1) = 31 Ranking SVM is biased toward q2

62

IR SVM (Cao et al., 2006)

• Solving the two problems of Ranking SVM

• Higher weight on important grade pairs

• Normalization weight on pairs in query

• IR SVM = Ranking SVM using modified hinge loss

63

)(ik

)(iq

Modified Hinge Loss function

64

2)2()1(

)(

1

)( ||||,1min wxxwy iiiiq

l

i

ikw

1

2

0.5

1 )( )2()1( xxfy

Loss

Learning of IR SVM

65

2)2()1(

)(

1

)( ||||,1min wxxwy iiiiq

l

i

ikw

2

0

,,1 1,

||||2

1min

)()(

)2()1(

1

2

,

iqik

i

i

iiii

l

i

iiw

C

lixxwy

Cw

AdaRank

66

Listwise Loss

67

111 ,1,1,1

2,11,22,1

1,11,11,1

1

nnn yx

yx

yx

q

mmm nmnmnm

mmm

mmm

m

yx

yx

yx

q

,,,

2,2,2,

1,1,1,

AdaRank (Xu and Li, 2007)

• Optimizing exponential loss function

• Algorithm: AdaBoost-like algorithm for ranking

68

Loss Function of AdaRank

69

Any evaluation measure taking value between [-1,+1]

AdaRank Algorithm

70

3. Learning to Match

71

3-1. Semantic Matching in Search

72

Query Document Mismatch is Biggest Challenge in Search

73

Query Document Mismatch

• Same intent can be represented by different queries (representations)

• Search is still mainly based on term level matching

• Query document mismatch occurs, when searcher and author use different representations

74

Examples of Query Document Mismatch

Query Document Term Matching

Semantic Matching

seattle best hotel seattle best hotels

no yes

pool schedule swimmingpool schedule

no yes

natural logarithm transformation

logarithm transformation

partial yes

china kong china hong kong partial no

why are windows so expensive

why are macs so expensive

partial no

75

Matching at Different Levels

Semantic Matching

Form Phrase Sense Topic Structure

Term Matching

Structure Identification

Topic Identification

Similar Query Finding

Phrase Identification

Spelling Error Correction

Sense

michael jordan berkele

query form: michael jordan berkeley

phrase: michael jordan

similar query: michael i. jordan

main phrase: michael jordan

Phrase

Term

Structure

topic: machine learning, berkeley

Topic

phrase: berkeley

Query Understanding

Title Structure Identification

Topic Identification

Key Phrase Identification

Phrase Identification Term

Topic

Homepage of Michael Jordan

Michael Jordan is Professor in the

Department of Electrical Engineering

……

key phrase: michael jordan, professor,

electrical engineering Key Phrase

topic: machine learning, berkeley

main phrase in title: michael jordan

phrase: michael jordan, professor,

department, electrical engineering]

Phrase

Structure

Document Understanding

Query

Representation

Document

Representation Query Document

Matching

Relevance Ranking

Query form: michael jordan berkeley

Similar query: michael i jordan

Main phrase: michael jordan

Phrase: michael jordan, berkeley

Topic : machine learning

Document: michael jordan homepage

Main phrase in title: michael jordan

Key phrase: michael jordan, berkeley

Phrase: michael jordan, professor,

department of electrical engineering

Topic : machine learning, berkeley

Semantic Matching

Matching in Different Ways

80

q d

q’

d’

c

Query Reformulation

Document transformation

Query and document transformation

No transformation

Web Search System

Ranking

Document

Understanding Indexing

Index

User Interface

Web

User

Crawling

Query

Document

Matching

Query

Understanding Retrieving

Machine Learning for Query Document Matching in Web Search

82

Learning for Matching between Query and Document

• Learning matching function

• Using training data

• and can be id’s or feature vectors

• can be binary or numerical values

• Using relations in data and/or prior knowledge

83

),|(or ),( dqrpdqf MM

),,(,),,,( 111 NNN rdqrdq

Nqqq ,,, 21 Nddd ,,, 21

Nrrr ,,, 21

Challenges in Machine Learning for Matching

• How to leverage relations in data

• How to handle complicated data (e.g., long queries, questions)

• How to scale up

• How to leverage prior knowledge

• How to deal with tail

84

Long Tail Challenge

• Head pages have rich anchor texts and click data

• Tail queries and pages suffer more from mismatch

• Problem of propagating information and knowledge from head to tail

85

Relation between Matching and Ranking

• In traditional IR: – Ranking = matching

• Web search:

– Ranking and matching become separated – Learning to rank becomes state-of-the-art

– Matching = feature learning for ranking

86

or ),(),( 25 dqfdqf BM

)(),(),( 25 dgdqfdqf PageRankBM

)|(),( qdPdqf LMIR

Matching vs Ranking

Matching Ranking

Prediction Matching degree between query and document

Ranking list of documents

Model f(q, d) f(q,d1), f(q,d2), … f(q,dn)

Challenge Mismatch Correct ranking on top

87

In search, first matching and then ranking

Matching Functions as Features in Learning to Rank

• Term level matching:

• Phrase level matching:

• Sense level matching:

• Topic level matching:

• Structure level matching:

• Term level matching (spelling, stemming):

88

),(25 dqfBM

),( dqfP

),( dqfT

),( dqfC

),( dqfS

qq '

),(25 dqf BMn

Approaches to Learning for Matching Between Query and Document

• Matching by Query Reformulation

• Matching with Term Dependency Model

• Matching with Translation Model

• Matching with Topic Model

• Matching with Latent Space Model

89

3-2. Overview of Learning to Match

91

Matching between Heterogeneous Data is Everywhere

• Matching between user and product (collaborative filtering)

• Matching between text and image (image annotation)

• Matching between people (dating)

• Matching between languages (machine translation)

• Matching between receptor and ligand (drug design)

92

Formulation of Learning Problem

• Learning matching function

• Training data

• Generated according to

93

),( yxf

),,(,),,,( 111 NNN ryxryx

),|(~ ),|(~ ),(~ YXRPrXYPyXPx

Formulation of Learning Problem

• Loss Function

• Risk Function

• Objective Function in Learning

94

) ),(,( yxfrL

RYXryxdPyxfrLryxPyxfrR ),,() ),(,(),,() ),(,(

N

iiii

FffyxfrL

1

)() ),(,(min

Matching Problem: Instance Matching

1

4

5 1

1

x1

xm

y1 yn y2 y3

x2

x3

95

Instances

Can be represented as matching between nodes in bipartite graph

Matching Problem: Feature Matching

1

4

5 1

1

x1

xm

y1 yn y2 y3

x2

x3

96

Features

Can be represented as matching between objects in two spaces

Matching Problem: Structure Matching

1

4

5 1

1

x1

xm

y1 yn y2 y3

x2

x3

97

Structures

3-3. Methods of Learning to Match

98

Regularized Matching in Latent Space (RMLS)

99

Matching in Latent Space (Wu et al., WSDM 2013, JMLR 2013)

• Motivation

– Matching between query and document in latent space

• Assumption

– Queries have similarity

– Document have similarity

– Click-through data represent “similarity” relations between queries and documents

• Approach

– Projection to latent space

– Regularization or constraints

• Results

– Significantly enhance accuracy of query document matching

Matching in Latent Space

q1

qm q2

d1

dn

d2

Query Space Document Space

q1 d2

qm

dn

d1

q2 Latent Space

qL

dL

• Matching between Heterogeneous Data

• Example: Image Annotation

hook

fishing

singer

solider

worrier

microphone

Projecting Keywords and Images into Latent Space

IR Models as Similarity Functions (Xu and Li 2010)

q1

qm q2

d1

dn

d2

Query Space Document Space

q1 d2

qm

dn

d1

q2

New Space

'

unigram

unigram

unigram

unigram

unigram

unigram

unigram unigram

unigram

VSM, BM25, LM, MRF

Mapping functions are diagonal matrices

IR Models Are Similarity Functions

Partial Least Square (PLS) • Setting

– Two spaces:

• Input

– Training data:

• Output

– Matching function

• Assumption

– Two linear (and orthonormal ) transformations

– Dot product as similarity function

• Optimization

DdQq ,

Nccdq iiii )},,{(

),( dqf

dq LL ,

dLqL dq ,

ILLILLdLqLc d

T

dq

T

q

dq

idiqiLL

iidq

, st. , maxarg),(

,

Solution of Partial Least Square

• Non-convex optimization

• Global optimal solution exists

• Global optimum can be found by solving SVD (Singular Value Decomposition)

Regularized Mapping to Latent Space (RMLS)

• Setting

– Two spaces:

• Input

– Training data:

• Output

– Matching function

• Assumption

– Two linear (and sparse) transformations

– Dot product as similarity function

• Optimization

DdQq ,

Nccdq iiii )},,{(

),( dqf

dq LL ,

dLqL dq ,

ddqqddqq

dq

idiqiLL

lllldLqLcii

dq

||||,||||,||,|| st. , maxarg),(

,

Solution of Regularized Mapping to Latent Space (RMLS)

• Coordinate Descent

• Repeat

– Fix , update

– Fix , update

• No guarantee to find global optimum

• Updates can be parallelized by rows

qL dL

dLqL

Comparison between PLS and RMLS

PLS RMLS

Assumption Orthogonal L1 and L2 Regularization

Optimization Method

Singular Value Decomposition

Coordinate Descent

Optimality Global optimum

Local optimum

Efficiency Low High

Scalability Low High

String Re-writing Kernel

111

Learning for String Re-wring

((distance between sun and earth), (how far sun earth), +1) ((distance between beijing and shanghai), (how far is beijing from shanghai), +1) … …. ((distance between moon and earth), (how far sun earth), -1)

((distance between moon and earth), (how far is earth from moon))

Learning System

Prediction System

Model

+1/-1?

String Re-writing Kernel (Bu et al., 2012)

• Training data

• Model

• String Re-writing Kernel

)),,(()),,(( 111 nnn ytsyts

0 )),(),,(((sign1

iii

n

i

ii tstsKyy

)),(),,(( tstsK ii

Re-Writing Rule • Rule is applied to re-writing of substrings

Shakespeare wrote Hamlet in 16 century Hamlet was written by Shakespeare

Cao Xueqin wrote Dream of the Red Chamber Dream of the Red Camber was written by Cao Xueqin

* wrote * * was written by *

Re-writing Rule

Cao Xueqin wrote Dream of the Red Chamber Hamlet was written by Shakespeare

Re-Writing Rule (2) • Multiple re-writing rules may be applied


Re-writing Rule

* wrote * in 16 century * was written by *

Definition of String Re-writing Kernel (SRK)

),(),,()),(),,(( 22112211 tstststsK

Rrr tsts )),((),(

(0,1] ),( i

r nts

),(),()),(),,(( 22112211 tstststsK r

r

r

Similarity between Two Re-writings • Similarity between two re-writings with respect to re-

writing rules


Cao Xueqin wrote Dream of the Red Chamber Dream of the Red Camber was written by Cao Xueqin

,1,,1,,0 10011001 rrr ,0,,1,,0 10011001 rrr

* wrote * * was written by *

100r 1001r * wrote * in 16 century * was written by *

A Sub-Class of String Re-writing Kernel

• Advantage: Matching between informally written sentences such as long queries in search can be effectively performed

• Challenge

• Number of re-writing rules is infinite

• Number of applicable rules increase exponentially when length of sentence increases

• Our Approach

• Sub-class: kb-SRK

Definition of kb-SRK

• Special class of SRK

• Re-writing rules in kb-SRK

– String patterns in rule are of length k

– Wildcard ? only substitutes a single character

– Alignment between string patterns is bijective

? wrote ? ? described ?

Reformulation of kb-SRK

otherwise ,0),( match, , if,),(

),(),()( )(

tsrts

i

tsr

sgramk tgramk

tsrr

s t

ts

),(),()),(),,(( 22112211 tstststsK r

r

r

Definition

Equivalent formulation based on fact that summations are exchangable

Rr

tsrtsrk

tstsk

sgramk

sgramk

tgramk

tgramk

K

KtstsK

s

s

t

t

),(),(

)),(),,(()),(),,((

2211

2211

22

11

22

11)(

)(

)(

)(

2211

Similarity between Re-writings of K-gram Pairs

abbccbb1s

cbcbbcb1t

abcccdd2s

cbccdcd2t

two k-gram pairs

,1,,1,,0 20012001 rrr ,0,,1,,0 20012001 rrr

a b ? c c ? ? c b c ? ? c ?

200r 2001r

abbccbb1s

cbcbbcb1t

abcccdd2s

cbccdcd2t

a b ? c c ? ? c ? c b ? c ?

applicable to both pairs

not applicable to second pair

Only need to consider the rules applicable to both k-gram pairs

two k-gram pairs

Efficiently Calculating Number of Applicable Rules on Combined k-gram Pairs

abbccbb1s

cbcbbcb1t

abcccdd2s

cbccdcd2t

cbc??c?

ab?cc??

),)(cc,)(,)(,)(cc,)(bb,)(cc,(

),)(,)(cc,)(cc,)(,)(bb,)(aa,(

dbdbcb

dbdbcb

t

s

?)cc,??()cc,)(bb,)(cc,(

??)cc,)(cc,?()bb,)(aa,(

applicable rule applicable rule

combined k-gram pairs two k-gram pairs

identical doubles can be either or not substituted non-identical doubles must be substituted

)661)(!2)()(1)(1( 42422 kK

Time Complexity of Calculation of kb-SRK

• Inner loop

– Time complexity

• Outer loop

– Empirical time complexity

123

)(kO

)( 2 klO

|)||,||,||,max(| 2211 tstsl

4. Applications to Social Media

4.1 Tweet Recommendation

Tweet Recommendation

126

1

1

1 1

1

u1

um

i1 in i2 i3

u2

u3

tweets

users

SVD Feature: Feature-based Matrix Factorization

(Chen et al., 2012)

• Model

• Loss function: squared loss

• Algorithm: Gradient descent

127

iLuLilulgliur iugug ,,,,),(

Matching between User and Item

128

ui

g

uLu iLi

Matching in latent space

global feature vector

user vector item vector

4.2 Short Text Conversation

Short Text Conversation

One Step toward Turing Test Important subtask of dialogue

Human: a message Computer: a reply

Massive Amount of Data Available

Our paper entitled learning to rank has been accepted by ACL .

Congratulations! It is a great achievement

We are lucky. Our paper has been accepted by SIGIR this year. We are going to present it.

Great news! Please accept my congrats!

The PC of WSDM noticed us that our paper has been accepted.

Awesome! It is a great achievement

…..

…..

System of Retrieval-based Short Text Conversation

Given post, find most suitable response

Large repository of post-response pairs

Take it as search problem

index of post and response

matching

ranking

short text

retrieval

retrieved posts and responses

ranked responses

learning to match

learning to rank

online

offline

best response

matched responses

Learning to Match for Short Text Conversation

Learning System

Matching System

Model

t1 c1

t2

tn

c2

cn

c(n+1) t(n+1)

……

……

Linear Matching Model (RMLS) (Wang et al. EMNLP 2013)

• Short text = sum of word vectors

• Matching degree between short texts = Inner product of their document vectors

• Word vectors are learned from short text data

Deep Matching Model (Lu & Li, NIPS 2013)

• Take interactions between texts as input

• Learning features in different granularities

using LDA

• Learning parameters using back-propagation

Representing Posts and Comments as Bags of Words

Mats catch mice Great mats

Mats chase mice Poor mice

P_mats, P_catch, P_mice, C_great, C_mats

P_mats, P_chase, P_mice, C_poor, C_mice

….

….

….

Post Comment Bag of words

Words in post and comment are viewed as different words

Constructing Topics Using Latent Dirichlet Allocation

P_mats, P_catch, P_mice, C_great, C_mats

P_mats, P_chase, P_mice, C_poor, C_mice

Bag of words

….

….

P_mats P_mice ….

C_mice C_mats

P_chase P_catch ….

C_great C_poor ….

Topic

Topics in different granularities form different hidden layers

matching degree

Architecture of Deep Matching Model

(Q,A) interaction space

局部匹配模型

烤鸭啊，想吃热乎的去烤鸭店，如全聚德，真空包装的超市就有

北京有什么出名的特产？

0 0 1 0 1 … 0 0 0 1 0

0

0

1

0

1

…

0

0

0

1

0

Examples Local Model 1: (特产, 土产, 味道, …) || （豆腐, 烤鸭, 甜, 野味, 糯米…） Local Model 2: (路程, 安排, 地点, …) || （距离, 安全, 隧道, 高速, 机票…）

multiple layers

5. Summary

139

Summary

• Learning to rank and learning to match are key technologies for many applications including search and recommendation

• Learning to Rank – Ranking SVM – IR SVM – AdaRank

• Learning to Match – Regularized Matching in Semantic Space – String Re-writing Kernel

• Many applications on social media can be constructed with learning to rank and learning to match techniques

140

References • Hang Li, Learning to Rank for Information Retrieval and Natural Language Processing,

Synthesis Lectures on Human Language Technology, Lecture 12, Morgan & Claypool Publishers, 2011.

• Hang Li, A Short Introduction to Learning to Rank, IEICE Transactions on Information and Systems, E94-D(10), 2011.

• Jun Xu and Hang Li. AdaRank: A Boosting Algorithm for Information Retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR’07), 391-398, 2007.

• Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon. Adapting Ranking SVM to Document Retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference (SIGIR’06), 186-193, 2006.

• Hang Li, Jun Xu, Semantic Matching in Search, Foundations and Trends in Information Retrieval, Now Publishers, 2014, ISBN 978-1-60198-804-1.

• Wei Wu, Zhengdong Lu, Hang Li, Learning Bilinear Model for Matching Queries and Documents, Journal of Machine Learning Research (JMLR), 14: 2519-2548, 2013.

• Fan Bu, Hang Li, Xiaoyan Zhu, String Re-Writing Kernel, In Proceedings of the 50th Annual Meeting of Association for Computational Linguistics (ACL’12), 449-458, 2012.

• Zhengdong Lu, Hang Li, In Advances in Neural Information Processing Systems (NIPS 2013), 1367-1375.

141

Thank You!

[email protected]

142