introductioncord/pdfs/courses/2019rivcourse2.pdfmetric learning in a nutshell basic recipe 1. pick a...

62
metric learning course Cours RI Master DAC UPMC (Construit ` a partir d’un tutorial ECML-PKDD 2015 (A. Bellet, M. Cord)) 1. Introduction 2. Linear metric learning 3. Nonlinear extensions 4. Large-scale metric learning 5. Metric learning for structured data 6. Generalization guarantees 1

Upload: others

Post on 13-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

metric learning course

Cours RI Master DAC UPMC (Construit a partir d’un tutorial

ECML-PKDD 2015 (A. Bellet, M. Cord))

1. Introduction

2. Linear metric learning

3. Nonlinear extensions

4. Large-scale metric learning

5. Metric learning for structured data

6. Generalization guarantees

1

Page 2: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

introduction

Page 3: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

• Similarity / distance judgments are essential components of manyhuman cognitive processes (see e.g., [Tversky, 1977])

• Compare perceptual or conceptual representations

• Perform recognition, categorization...

• Underlie most machine learning and data mining techniques

3

Page 4: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

Nearest neighbor classification

?

4

Page 5: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

Clustering

5

Page 6: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

Information retrieval

Query document

Most similar documents

6

Page 7: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

Data visualization

0123456789

(image taken from [van der Maaten and Hinton, 2008])7

Page 8: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

• Choice of similarity is crucial to the performance

• Humans weight features di↵erently depending on context[Nosofsky, 1986, Goldstone et al., 1997]

• Facial recognition vs. determining facial expression

• Fundamental question: how to appropriately measure similarity or

distance for a given task?

• Metric learning ! infer this automatically from data

• Note: we will refer to distance or similarity indistinctly as metric

8

Page 9: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

metric learning in a nutshell

Metric Learning

9

Page 10: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

metric learning in a nutshell

Basic recipe

1. Pick a parametric distance or similarity function

• Say, a distance DM(x , x 0) function parameterized by M

2. Collect similarity judgments on data pairs/triplets

• S = {(xi , xj) : xi and xj are similar}• D = {(xi , xj) : xi and xj are dissimilar}• R = {(xi , xj , xk) : xi is more similar to xj than to xk}

3. Estimate parameters s.t. metric best agrees with judgments

• Solve an optimization problem of the form

M = argminM

2

64`(M,S,D,R)| {z }loss function

+ �reg(M)| {z }regularization

3

75

10

Page 11: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

linear metric learning

Page 12: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

• Mahalanobis distance: DM(x , x 0) =p(x � x 0)TM(x � x 0) where

M = Cov(X)�1 where Cov(X) is the covariance matrix estimated

over a data set X

12

Page 13: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

• Mahalanobis (pseudo) distance:

DM(x , x 0) =q(x � x 0)TM(x � x 0)

where M 2 Sd+ is a symmetric PSD d ⇥ d matrix

• Equivalent to Euclidean distance after linear projection:

DM(x , x 0) =q

(x � x 0)TLTL(x � x 0) =q

(Lx � Lx 0)T (Lx � Lx 0)

• If M has rank k d , L 2 Rk⇥d reduces data dimension

13

Page 14: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

A first approach [Xing et al., 2002]

• Targeted task: clustering with side information

Formulation

maxM2Sd+

X

(x i ,x j )2D

DM(x i , x j)

s.t.X

(x i ,x j )2S

D2M(x i , x j) 1

• Convex in M and always feasible (take M = 0)

• Solved with projected gradient descent

• Time complexity of projection on Sd+ is O(d3)

• Only look at sums of distances14

Page 15: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

Large Margin Nearest Neighbor [Weinberger et al., 2005]

• Targeted task: k-NN classification

• Constraints derived from labeled data

• S = {(x i , x j) : yi = yj , x j belongs to k-neighborhood of x i}• R = {(x i , x j , xk) : (x i , x j) 2 S, yi 6= yk}

15

Page 16: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

Large Margin Nearest Neighbor [Weinberger et al., 2005]

Formulation

minM2Sd+,⇠�0

(1� µ)X

(x i ,x j )2S

D2M(x i , x j) + µ

X

i,j,k

⇠ijk

s.t. D2M(x i , xk)� D

2M(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R

µ 2 [0, 1] trade-o↵ parameter

• Number of constraints in the order of kn2

• Solver based on projected gradient descent with working set

• Simple alternative: only consider closest “impostors”

• Chicken and egg situation: which metric to build constraints?

• Possible overfitting in high dimensions

16

Page 17: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

Large Margin Nearest Neighbor [Weinberger et al., 2005]

Test Image:

Nearest neighbor before training:

Nearest neighbor after training:

MNIST

NEWS

ISOLET

BAL

FACES

IRIS

WINE

2.1

17.613.4

12.4

8.64.7

3.3

14.49.7

7.8

5.92.6

19.0

4.34.74.4

2.6

1.71.2

2.2

30.1

kNN LMNN

kNN Euclideandistance

MulticlassSVM

testing error rate (%)17

Page 18: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

Interesting regularizers

• Add regularization term to prevent overfitting

• Simple choice: kMk2F =Pd

i,j=1 M2ij (Frobenius norm)

• Used in [Schultz and Joachims, 2003] and many others

• LogDet divergence (used in ITML [Davis et al., 2007])

Dld(M ,M0) = tr(MM�10 )� log det(MM�1

0 )� d

=X

i,j

�i

✓j(vT

i u i )2 �

X

i

log

✓�i

✓i

◆� d

where M = V⌃V T and M0 = U⇥UT is PD

• Remain close to good prior metric M0 (e.g., identity)

• Implicitly ensure that M is PD

• Convex in M (determinant of PD matrix is log-concave)

• E�cient Bregman projections in O(d2) 18

Page 19: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

mahalanobis distance learning

Interesting regularizers

• Mixed L2,1 norm: kMk2,1 =Pd

i=1 kM ik2• Tends to zero-out entire columns ! feature selection

• Used in [Ying et al., 2009]

• Convex but nonsmooth

• E�cient proximal gradient algorithms (see e.g., [Bach et al., 2012])

• Trace (or nuclear) norm: kMk⇤ =Pd

i=1 �i (M)

• Favors low-rank matrices ! dimensionality reduction

• Used in [McFee and Lanckriet, 2010]

• Convex but nonsmooth

• E�cient Frank-Wolfe algorithms [Jaggi, 2013]

19

Page 20: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

linear similarity learning

• Mahalanobis distance satisfies the distance axioms• Nonnegativity, symmetry, triangle inequality

• Natural regularization, required by some applications

• In practice, these axioms may be violated• By human similarity judgments (see e.g., [Tversky and Gati, 1982])

• By some good visual recognition systems [Scheirer et al., 2014]

• Alternative: learn bilinear similarity function SM(x , x 0) = xTMx 0

• See [Chechik et al., 2010, Bellet et al., 2012b, Cheng, 2013]

• No PSD constraint on M ! computational benefits

• Theory of learning with arbitrary similarity functions

[Balcan and Blum, 2006]20

Page 21: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

nonlinear extensions

Page 22: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

beyond linearity

• So far, we have essentially been learning a linear projection

• Advantages• Convex formulations

• Robustness to overfitting

• Drawback• Inability to capture nonlinear structure

Linear metric Kernelized metric Multiple local metrics

22

Page 23: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

kernelization of linear methods

Definition (Kernel function)

A symmetric function K is a kernel if there exists a mapping function

� : X ! H from the instance space X to a Hilbert space H such that

K can be written as an inner product in H:

K (x , x 0) = h�(x),�(x 0)i .

Equivalently, K is a kernel if it is positive semi-definite (PSD), i.e.,

nX

i=1

nX

j=1

cicjK (xi , xj) � 0

for all finite sequences of x1, . . . , xn 2 X and c1, . . . , cn 2 R.

23

Page 24: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

kernelization of linear methods

Kernel trick for metric learning

• Notations

• Kernel K(x , x 0) = h�(x),�(x 0)i, training data {x i}ni=1

• �idef= �(x i ) 2 R

D , �def= [�1, . . . ,�n] 2 R

n⇥D

• Mahalanobis distance in kernel space

D2M(�i ,�j) = (�i � �j)

TM(�i � �j) = (�i � �j)TLTL(�i � �j)

• Setting LT = �UT , where U 2 RD⇥n, we get

D2M(�(x),�(x 0)) = (k � k 0)TM(k � k 0)

• M = UTU 2 Rn⇥n, k = �T�(x) = [K(x1, x), . . . ,K(xn, x)]T

• Justified by a representer theorem [Chatpatanasiri et al., 2010]

24

Page 25: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

learning a nonlinear metric

• More flexible approach: learn nonlinear mapping � to optimize

D�(x , x 0) = k�(x)� �(x 0)k2

• Possible parameterizations for �:

• Regression trees [Kedem et al., 2012]

• Deep neural nets [Chopra et al., 2005, Hu et al., 2014]

25

Page 26: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

MetricLearninginCV

• PairWise Constraints for learning

8

Page 27: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Deep Metric Learning

9

• Deepmetriclearningoptimization

– SiameseArchitecture[LeCun NIPS1993]

Page 28: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Deep Metric Learning

10

• Deepmetriclearningoptimization

[credit:Y.LeCun 05]

Page 29: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Deep Metric Learning

11

• Deepmetriclearningoptimization

[Y.LeCun CVPR05,06]DrLIM scheme

SimilartoLMNNprocedure:

Y=0forsimilarpairs

Y=1fordissimilarpairs

Page 30: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Deep Metric Learning

12

• SiameseNetworkforpaiwise comparison:DDMLapproach

[Credit:HuCVPR2014]

Page 31: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Deep Metric Learning

13

• DDMLoptimization[HuCVPR2014]:

Page 32: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Resultsonfaceverificationpb

14

2images=>sameface?

LabeledFacesintheWild

(LFW)-- 27SIFTdescriptors

concatenated

10-foldCrossValidation

(600pairsperfold)

Classicalerrors:

About15%better with metric learning

Method Accuracy (in %)ITML 76.2 ± 0.5LDML 77.5 ± 0.5PCCA 82.2 ± 0.4Fantope 83.5 ± 0.5

Page 33: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Peopleverification

16

Page 34: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Visualization

17

Deep fordatavisualisation

Metric vsFeature learning

LeCun CVPR2006

Page 35: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!

Deep Metric Learning

18=>Manydifferentcontextsprovidetrainingdata

MetricLearningforGeo-localization:

[LeBarz ICIP2015]fromLMNNscheme

Roboticsapplis:

[Carlevaris-Bianco IROS2014]fromDrLIM scheme

Page 36: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!FeaturelearningforImageretrieval

19

Query Top 5

Learned Distance

Euclidean Distance

Learned Distance

Euclidean Distance

Page 37: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

!FeaturelearningforImageretrieval

22

Query Top 5

Learned Distance

Euclidean Distance

Learned Distance

Euclidean Distance

Page 38: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

learning multiple local metrics

• Simple linear metrics perform well locally

• Idea: di↵erent metrics for di↵erent parts of the space

• Various issues• How to split the space?

• How to avoid blowing up the number of parameters to learn?

• How to make local metrics “mutually comparable”?

• . . .

26

Page 39: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

learning multiple local metrics

Multiple Metric LMNN [Weinberger and Saul, 2009]

• Group data into C clusters

• Learn a metric for each cluster in a coupled fashion

Formulation

minM1,...,MC

⇠�0

(1� µ)X

(x i ,x j )2S

D2MC(x j )

(x i , x j) + µX

i,j,k

⇠ijk

s.t. D2MC(xk )

(x i , xk)� D2MC(x j )

(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R

• Remains convex

• Computationally more expensive than standard LMNN

• Subject to overfitting

• Many parameters

• Lack of smoothness in metric change

27

Page 40: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

learning multiple local metrics

Sparse Compositional Metric Learning [Shi et al., 2014]

• Learn a metric for each point in feature space

• Use the following parameterization

D2w (x , x 0) = (x � x 0)T

KX

k=1

wk(x)bkbTk

!(x � x 0),

• bkbTk : rank-1 basis (generated from training data)

• wk(x) = (aTk x + ck)

2: weight of basis k

• A 2 Rd⇥K and c 2 R

K : parameters to learn

28

Page 41: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

learning multiple local metrics

Sparse Compositional Metric Learning [Shi et al., 2014]

Formulation

minA2R(d+1)⇥K

X

(x i ,x j ,xk )2R

⇥1 + D

2w (x i , x j)� D

2w (x i , xk)

⇤++ �kAk2,1

• A: stacking A and c

• [·] = max(0, ·): hinge loss

• Nonconvex problem

• Adapts to geometry of data

• More robust to overfitting

• Limited number of parameters

• Basis selection

• Metric varies smoothly over feature space29

Page 42: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

large-scale metric learning

Page 43: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

main challenges

• How to deal with large datasets?

• Number of similarity judgments can grow as O(n2) or O(n3)

• How to deal with high-dimensional data?

• Cannot store d ⇥ d matrix

• Cannot a↵ord computational complexity in O(d2) or O(d3)

31

Page 44: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

case of large n

Online learning

OASIS [Chechik et al., 2010]

• Set M0 = I

• At step t, receive (x i , x j , xk) 2 R and update by solving

M t = argminM,⇠1

2kM � M t�1k2F + C⇠

s.t. 1� SM(x i , x j) + SM(x i , xk) ⇠

⇠ � 0

• SM(x , x 0) = xTMx 0, C trade-o↵ parameter

• Closed-form solution at each iteration

• Trained with 160M triplets in 3 days on 1 CPU

32

Page 45: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

case of large n

Stochastic and distributed optimization

• Assume metric learning problem of the form

minM

1

|R|X

(x i ,x j ,xk )2R

`(M , x i , x j , xk)

• Can use Stochastic Gradient Descent

• Use a random sample (mini-batch) to estimate gradient

• Better than full gradient descent when n is large

• Can be combined with distributed optimization

• Distribute triplets on workers

• Each worker use a mini-batch to estimate gradient

• Coordinator averages estimates and updates

33

Page 46: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

case of large d

Simple workarounds

• Learn a diagonal matrix

• Used in [Xing et al., 2002, Schultz and Joachims, 2003]

• Learn d parameters

• Only a weighting of features...

• Learn metric after dimensionality reduction (e.g., PCA)

• Used in many papers

• Potential loss of information

• Learned metric di�cult to interpret

34

Page 47: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

case of large d

Matrix decompositions

• Low-rank decomposition M = LTL with L 2 Rr⇥d

• Used in [Goldberger et al., 2004]

• Learn r ⇥ d parameters

• Generally nonconvex, must tune r

• Rank-1 decomposition M =PK

i=1 wkbkbTk

• Used in SCML [Shi et al., 2014]

• Learn K parameters

• Hard to generate good bases in high dimensions

35

Page 48: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

metric learning

for structured data

Page 49: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

• Each data instance is a structured object

• Strings: words, DNA sequences

• Trees: XML documents

• Graphs: social network, molecules

ACGGCTT

• Metrics on structured data are convenient

• Act as proxy to manipulate complex objects

• Can use any metric-based algorithm

37

Page 50: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

motivation

• Could represent each object by a feature vector

• Idea behind many kernels for structured data

• Could then apply standard metric learning techniques

• Potential loss of structural information

• Instead, focus on edit distances

• Directly operate on structured object

• Variants for strings, trees, graphs

• Natural parameterization by cost matrix

38

Page 51: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

string edit distance

• Notations

• Alphabet ⌃: finite set of symbols

• String x: finite sequence of symbols from ⌃

• |x|: length of string x

• $: empty string / symbol

Definition (Levenshtein distance)

The Levenshtein string edit distance between x and x0 is the length of the

shortest sequence of operations (called an edit script) turning x into x0.

Possible operations are insertion, deletion and substitution of symbols.

• Computed in O(|x| · |x0|) time by Dynamic Programming (DP)

39

Page 52: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

string edit distance

Parameterized version

• Use a nonnegative (|⌃|+ 1)⇥ (|⌃|+ 1) matrix C• Cij : cost of substituting symbol i with symbol j

Example 1: Levenshtein distance

C $ a b

$ 0 1 1

a 1 0 1

b 1 1 0

=) edit distance between abb and aa is

2 (needs at least two operations)

Example 2: specific costs

C $ a b

$ 0 2 10

a 2 0 4

b 10 4 0

=) edit distance between abb and aa is

10 (a ! $, b ! a, b ! a)

40

Page 53: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

large-margin edit distance learning

GESL [Bellet et al., 2012a]

• Inspired from successful algorithms for non-structured data

• Large-margin constraints

• Convex optimization

• Requires key simplification: fix the edit script

eC (x, x0) =

X

u,v2⌃[{$}

C uv ·#uv (x, x0)

• #uv (x, x0): nb of times u ! v appears in Levenshtein script

• eC is a linear function of the costs

41

Page 54: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

large-margin edit distance learning

GESL [Bellet et al., 2012a]

Formulation

minC�0,⇠�0,B1�0,B2�0

X

i,j

⇠ij + �kCk2F

s.t. eC (x, x0) � B1 � ⇠ij 8(xi , xj) 2 D

eC (x, x0) B2 + ⇠ij 8(xi , xj) 2 S

B1 � B2 = �

� margin parameter

• Convex, less costly and use of negative pairs

• Straightforward adaptation to trees and graphs

• Less general than proper edit distance

• Chicken and egg situation similar to LMNN

42

Page 55: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

large-margin edit distance learning

Application to word classification [Bellet et al., 2012a]

60

62

64

66

68

70

72

74

76

78

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Cla

ssifi

catio

nac

cura

cy

Size of cost training set

Oncina and SebbanGESL

Levenshtein distance

43

Page 56: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

generalization guarantees

Page 57: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

statistical view of supervised metric learning

• Training data Tn = {z i = (x i , y i )}ni=1

• z i 2 Z = X ⇥ Y• Y discrete label set

• independent draws from unknown distribution µ over Z

• Minimize the regularized empirical risk

Rn(M) =2

n(n � 1)

nX

1i<jn

`(M , z i , z j) + �reg(M)

• Hope to achieve small expected risk

R(M) = Ez,z0⇠µ

[`(M , z , z 0)]

• Note: this can be adapted to triplets

45

Page 58: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

statistical view of supervised metric learning

• Standard statistical learning theory: sum of i.i.d. terms

• Here Rn(M) is a sum of dependent terms!

• Each training point involved in several pairs

• Corresponds to practical situation

• Need specific tools to go around this problem

• Uniform stability

• Algorithmic robustness

46

Page 59: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

uniform stability

Definition ([Jin et al., 2009])

A metric learning algorithm has a uniform stability in /n, where is a

positive constant, if

8(Tn, z), 8i , supz1,z2

|`(MTn , z1, z2)� `(MTi,zn, z1, z2)|

n

• MTn : metric learned from Tn

• Ti,zn : set obtained by replacing z i 2 Tn by z

• If reg(M) = kMk2F , under mild conditions on `, algorithm hasuniform stability [Jin et al., 2009]

• Applies for instance to GESL [Bellet et al., 2012a]

• Does not apply to other (sparse) regularizers

47

Page 60: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

uniform stability

Generalization bound

Theorem ([Jin et al., 2009])

For any metric learning algorithm with uniform stability /n, with prob-

ability 1� � over the random sample Tn, we have:

R(MTn) Rn(MTn) +2

n+ (2+ B)

rln(2/�)

2n

B problem-dependent constant

• Standard bound in O(1/pn)

48

Page 61: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

algorithmic robustness

Definition ([Bellet and Habrard, 2015])

A metric learning algorithm is (K , ✏(·)) robust for K 2 N and ✏ : (Z ⇥Z)n ! R if Z can be partitioned into K disjoints sets, denoted by

{Ci}Ki=1, such that the following holds for all Tn:

8(z1, z2) 2 T2n , 8z , z 0 2 Z, 8i , j 2 [K ], if z1, z 2 Ci , z2, z 0 2 Cj

|`(MTn , z1, z2)� `(MTn , z , z 0)| ✏(T 2n )

zz

z

z′ Ci

z1

z2

Ci

Cj

Classic robustness Robustness for metric learning49

Page 62: introductioncord/pdfs/courses/2019RIVCourse2.pdfmetric learning in a nutshell Basic recipe 1. Pick a parametric distance or similarity function • Say, a distance D M(x,x0) function

algorithmic robustness

Generalization bound

Theorem ([Bellet and Habrard, 2015])

If a metric learning algorithm is (K , ✏(·))-robust, then for any � > 0,

with probability at least 1� � we have:

R(MTn) Rn(MTn) + ✏(T 2n ) + 2B

r2K ln 2 + 2 ln(1/�)

n

• Wide applicability

• Mild assumptions on `

• Any norm regularizer: Frobenius, L2,1, trace...

• Bounds are loose

• ✏(T 2n ) can be as small as needed by increasing K

• But K potentially very large and hard to estimate

50