introductioncord/pdfs/courses/2019rivcourse2.pdfmetric learning in a nutshell basic recipe 1. pick a...

metric learning course

Cours RI Master DAC UPMC (Construit a partir d’un tutorial

ECML-PKDD 2015 (A. Bellet, M. Cord))

1. Introduction

2. Linear metric learning

3. Nonlinear extensions

4. Large-scale metric learning

5. Metric learning for structured data

6. Generalization guarantees

1

introduction

motivation

• Similarity / distance judgments are essential components of manyhuman cognitive processes (see e.g., [Tversky, 1977])

• Compare perceptual or conceptual representations

• Perform recognition, categorization...

• Underlie most machine learning and data mining techniques

3

motivation

Nearest neighbor classification

?

4

motivation

Clustering

5

motivation

Information retrieval

Query document

Most similar documents

6

motivation

Data visualization

0123456789

(image taken from [van der Maaten and Hinton, 2008])7

motivation

• Choice of similarity is crucial to the performance

• Humans weight features di↵erently depending on context[Nosofsky, 1986, Goldstone et al., 1997]

• Facial recognition vs. determining facial expression

• Fundamental question: how to appropriately measure similarity or

distance for a given task?

• Metric learning ! infer this automatically from data

• Note: we will refer to distance or similarity indistinctly as metric

8

metric learning in a nutshell

Metric Learning

9

metric learning in a nutshell

Basic recipe

1. Pick a parametric distance or similarity function

• Say, a distance DM(x , x 0) function parameterized by M

2. Collect similarity judgments on data pairs/triplets

• S = {(xi , xj) : xi and xj are similar}• D = {(xi , xj) : xi and xj are dissimilar}• R = {(xi , xj , xk) : xi is more similar to xj than to xk}

3. Estimate parameters s.t. metric best agrees with judgments

• Solve an optimization problem of the form

M = argminM

2

64`(M,S,D,R)| {z }loss function

+ �reg(M)| {z }regularization

3

75

10

linear metric learning

mahalanobis distance learning

• Mahalanobis distance: DM(x , x 0) =p(x � x 0)TM(x � x 0) where

M = Cov(X)�1 where Cov(X) is the covariance matrix estimated

over a data set X

12


• Mahalanobis (pseudo) distance:

DM(x , x 0) =q(x � x 0)TM(x � x 0)

where M 2 Sd+ is a symmetric PSD d ⇥ d matrix

• Equivalent to Euclidean distance after linear projection:

DM(x , x 0) =q

(x � x 0)TLTL(x � x 0) =q

(Lx � Lx 0)T (Lx � Lx 0)

• If M has rank k d , L 2 Rk⇥d reduces data dimension

13


A first approach [Xing et al., 2002]

• Targeted task: clustering with side information

Formulation

maxM2Sd+

X

(x i ,x j )2D

DM(x i , x j)

s.t.X

(x i ,x j )2S

D2M(x i , x j) 1

• Convex in M and always feasible (take M = 0)

• Solved with projected gradient descent

• Time complexity of projection on Sd+ is O(d3)

• Only look at sums of distances14


Large Margin Nearest Neighbor [Weinberger et al., 2005]

• Targeted task: k-NN classification

• Constraints derived from labeled data

• S = {(x i , x j) : yi = yj , x j belongs to k-neighborhood of x i}• R = {(x i , x j , xk) : (x i , x j) 2 S, yi 6= yk}

15



Formulation

minM2Sd+,⇠�0

(1� µ)X

(x i ,x j )2S

D2M(x i , x j) + µ

X

i,j,k

⇠ijk

s.t. D2M(x i , xk)� D

2M(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R

µ 2 [0, 1] trade-o↵ parameter

• Number of constraints in the order of kn2

• Solver based on projected gradient descent with working set

• Simple alternative: only consider closest “impostors”

• Chicken and egg situation: which metric to build constraints?

• Possible overfitting in high dimensions

16



Test Image:

Nearest neighbor before training:

Nearest neighbor after training:

MNIST

NEWS

ISOLET

BAL

FACES

IRIS

WINE

2.1

17.613.4

12.4

8.64.7

3.3

14.49.7

7.8

5.92.6

19.0

4.34.74.4

2.6

1.71.2

2.2

30.1

kNN LMNN

kNN Euclideandistance

MulticlassSVM

testing error rate (%)17


Interesting regularizers

• Add regularization term to prevent overfitting

• Simple choice: kMk2F =Pd

i,j=1 M2ij (Frobenius norm)

• Used in [Schultz and Joachims, 2003] and many others

• LogDet divergence (used in ITML [Davis et al., 2007])

Dld(M ,M0) = tr(MM�10 )� log det(MM�1

0 )� d

=X

i,j

�i

✓j(vT

i u i )2 �

X

i

log

✓�i

✓i

◆� d

where M = V⌃V T and M0 = U⇥UT is PD

• Remain close to good prior metric M0 (e.g., identity)

• Implicitly ensure that M is PD

• Convex in M (determinant of PD matrix is log-concave)

• E�cient Bregman projections in O(d2) 18


Interesting regularizers

• Mixed L2,1 norm: kMk2,1 =Pd

i=1 kM ik2• Tends to zero-out entire columns ! feature selection

• Used in [Ying et al., 2009]

• Convex but nonsmooth

• E�cient proximal gradient algorithms (see e.g., [Bach et al., 2012])

• Trace (or nuclear) norm: kMk⇤ =Pd

i=1 �i (M)

• Favors low-rank matrices ! dimensionality reduction

• Used in [McFee and Lanckriet, 2010]

• Convex but nonsmooth

• E�cient Frank-Wolfe algorithms [Jaggi, 2013]

19

linear similarity learning

• Mahalanobis distance satisfies the distance axioms• Nonnegativity, symmetry, triangle inequality

• Natural regularization, required by some applications

• In practice, these axioms may be violated• By human similarity judgments (see e.g., [Tversky and Gati, 1982])

• By some good visual recognition systems [Scheirer et al., 2014]

• Alternative: learn bilinear similarity function SM(x , x 0) = xTMx 0

• See [Chechik et al., 2010, Bellet et al., 2012b, Cheng, 2013]

• No PSD constraint on M ! computational benefits

• Theory of learning with arbitrary similarity functions

[Balcan and Blum, 2006]20

nonlinear extensions

beyond linearity

• So far, we have essentially been learning a linear projection

• Advantages• Convex formulations

• Robustness to overfitting

• Drawback• Inability to capture nonlinear structure

Linear metric Kernelized metric Multiple local metrics

22

kernelization of linear methods

Definition (Kernel function)

A symmetric function K is a kernel if there exists a mapping function

� : X ! H from the instance space X to a Hilbert space H such that

K can be written as an inner product in H:

K (x , x 0) = h�(x),�(x 0)i .

Equivalently, K is a kernel if it is positive semi-definite (PSD), i.e.,

nX

i=1

nX

j=1

cicjK (xi , xj) � 0

for all finite sequences of x1, . . . , xn 2 X and c1, . . . , cn 2 R.

23

kernelization of linear methods

Kernel trick for metric learning

• Notations

• Kernel K(x , x 0) = h�(x),�(x 0)i, training data {x i}ni=1

• �idef= �(x i ) 2 R

D , �def= [�1, . . . ,�n] 2 R

n⇥D

• Mahalanobis distance in kernel space

D2M(�i ,�j) = (�i � �j)

TM(�i � �j) = (�i � �j)TLTL(�i � �j)

• Setting LT = �UT , where U 2 RD⇥n, we get

D2M(�(x),�(x 0)) = (k � k 0)TM(k � k 0)

• M = UTU 2 Rn⇥n, k = �T�(x) = [K(x1, x), . . . ,K(xn, x)]T

• Justified by a representer theorem [Chatpatanasiri et al., 2010]

24

learning a nonlinear metric

• More flexible approach: learn nonlinear mapping � to optimize

D�(x , x 0) = k�(x)� �(x 0)k2

• Possible parameterizations for �:

• Regression trees [Kedem et al., 2012]

• Deep neural nets [Chopra et al., 2005, Hu et al., 2014]

25

!

MetricLearninginCV

• PairWise Constraints for learning

8

!

Deep Metric Learning

9

• Deepmetriclearningoptimization

– SiameseArchitecture[LeCun NIPS1993]

!


10


[credit:Y.LeCun 05]

!


11


[Y.LeCun CVPR05,06]DrLIM scheme

SimilartoLMNNprocedure:

Y=0forsimilarpairs

Y=1fordissimilarpairs

!


12

• SiameseNetworkforpaiwise comparison:DDMLapproach

[Credit:HuCVPR2014]

!


13

• DDMLoptimization[HuCVPR2014]:

!

Resultsonfaceverificationpb

14

2images=>sameface?

LabeledFacesintheWild

(LFW)-- 27SIFTdescriptors

concatenated

10-foldCrossValidation

(600pairsperfold)

Classicalerrors:

About15%better with metric learning

Method Accuracy (in %)ITML 76.2 ± 0.5LDML 77.5 ± 0.5PCCA 82.2 ± 0.4Fantope 83.5 ± 0.5

!

Peopleverification

16

!

Visualization

17

Deep fordatavisualisation

Metric vsFeature learning

LeCun CVPR2006

!


18=>Manydifferentcontextsprovidetrainingdata

MetricLearningforGeo-localization:

[LeBarz ICIP2015]fromLMNNscheme

Roboticsapplis:

[Carlevaris-Bianco IROS2014]fromDrLIM scheme

!FeaturelearningforImageretrieval

19

Query Top 5

Learned Distance

Euclidean Distance

Learned Distance

Euclidean Distance

!FeaturelearningforImageretrieval

22

Query Top 5

Learned Distance

Euclidean Distance

Learned Distance

Euclidean Distance

learning multiple local metrics

• Simple linear metrics perform well locally

• Idea: di↵erent metrics for di↵erent parts of the space

• Various issues• How to split the space?

• How to avoid blowing up the number of parameters to learn?

• How to make local metrics “mutually comparable”?

• . . .

26


Multiple Metric LMNN [Weinberger and Saul, 2009]

• Group data into C clusters

• Learn a metric for each cluster in a coupled fashion

Formulation

minM1,...,MC

⇠�0

(1� µ)X

(x i ,x j )2S

D2MC(x j )

(x i , x j) + µX

i,j,k

⇠ijk

s.t. D2MC(xk )

(x i , xk)� D2MC(x j )

(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R

• Remains convex

• Computationally more expensive than standard LMNN

• Subject to overfitting

• Many parameters

• Lack of smoothness in metric change

27


Sparse Compositional Metric Learning [Shi et al., 2014]

• Learn a metric for each point in feature space

• Use the following parameterization

D2w (x , x 0) = (x � x 0)T

KX

k=1

wk(x)bkbTk

!(x � x 0),

• bkbTk : rank-1 basis (generated from training data)

• wk(x) = (aTk x + ck)

2: weight of basis k

• A 2 Rd⇥K and c 2 R

K : parameters to learn

28


Sparse Compositional Metric Learning [Shi et al., 2014]

Formulation

minA2R(d+1)⇥K

X

(x i ,x j ,xk )2R

⇥1 + D

2w (x i , x j)� D

2w (x i , xk)

⇤++ �kAk2,1

• A: stacking A and c

• [·] = max(0, ·): hinge loss

• Nonconvex problem

• Adapts to geometry of data

• More robust to overfitting

• Limited number of parameters

• Basis selection

• Metric varies smoothly over feature space29

large-scale metric learning

main challenges

• How to deal with large datasets?

• Number of similarity judgments can grow as O(n2) or O(n3)

• How to deal with high-dimensional data?

• Cannot store d ⇥ d matrix

• Cannot a↵ord computational complexity in O(d2) or O(d3)

31

case of large n

Online learning

OASIS [Chechik et al., 2010]

• Set M0 = I

• At step t, receive (x i , x j , xk) 2 R and update by solving

M t = argminM,⇠1

2kM � M t�1k2F + C⇠

s.t. 1� SM(x i , x j) + SM(x i , xk) ⇠

⇠ � 0

• SM(x , x 0) = xTMx 0, C trade-o↵ parameter

• Closed-form solution at each iteration

• Trained with 160M triplets in 3 days on 1 CPU

32

case of large n

Stochastic and distributed optimization

• Assume metric learning problem of the form

minM

1

|R|X

(x i ,x j ,xk )2R

`(M , x i , x j , xk)

• Can use Stochastic Gradient Descent

• Use a random sample (mini-batch) to estimate gradient

• Better than full gradient descent when n is large

• Can be combined with distributed optimization

• Distribute triplets on workers

• Each worker use a mini-batch to estimate gradient

• Coordinator averages estimates and updates

33

case of large d

Simple workarounds

• Learn a diagonal matrix

• Used in [Xing et al., 2002, Schultz and Joachims, 2003]

• Learn d parameters

• Only a weighting of features...

• Learn metric after dimensionality reduction (e.g., PCA)

• Used in many papers

• Potential loss of information

• Learned metric di�cult to interpret

34

case of large d

Matrix decompositions

• Low-rank decomposition M = LTL with L 2 Rr⇥d

• Used in [Goldberger et al., 2004]

• Learn r ⇥ d parameters

• Generally nonconvex, must tune r

• Rank-1 decomposition M =PK

i=1 wkbkbTk

• Used in SCML [Shi et al., 2014]

• Learn K parameters

• Hard to generate good bases in high dimensions

35

metric learning

for structured data

motivation

• Each data instance is a structured object

• Strings: words, DNA sequences

• Trees: XML documents

• Graphs: social network, molecules

ACGGCTT

• Metrics on structured data are convenient

• Act as proxy to manipulate complex objects

• Can use any metric-based algorithm

37

motivation

• Could represent each object by a feature vector

• Idea behind many kernels for structured data

• Could then apply standard metric learning techniques

• Potential loss of structural information

• Instead, focus on edit distances

• Directly operate on structured object

• Variants for strings, trees, graphs

• Natural parameterization by cost matrix

38

string edit distance

• Notations

• Alphabet ⌃: finite set of symbols

• String x: finite sequence of symbols from ⌃

• |x|: length of string x

• $: empty string / symbol

Definition (Levenshtein distance)

The Levenshtein string edit distance between x and x0 is the length of the

shortest sequence of operations (called an edit script) turning x into x0.

Possible operations are insertion, deletion and substitution of symbols.

• Computed in O(|x| · |x0|) time by Dynamic Programming (DP)

39

string edit distance

Parameterized version

• Use a nonnegative (|⌃|+ 1)⇥ (|⌃|+ 1) matrix C• Cij : cost of substituting symbol i with symbol j

Example 1: Levenshtein distance

C $ a b

$ 0 1 1

a 1 0 1

b 1 1 0

=) edit distance between abb and aa is

2 (needs at least two operations)

Example 2: specific costs

C $ a b

$ 0 2 10

a 2 0 4

b 10 4 0

=) edit distance between abb and aa is

10 (a ! $, b ! a, b ! a)

40

large-margin edit distance learning

GESL [Bellet et al., 2012a]

• Inspired from successful algorithms for non-structured data

• Large-margin constraints

• Convex optimization

• Requires key simplification: fix the edit script

eC (x, x0) =

X

u,v2⌃[{$}

C uv ·#uv (x, x0)

• #uv (x, x0): nb of times u ! v appears in Levenshtein script

• eC is a linear function of the costs

41


GESL [Bellet et al., 2012a]

Formulation

minC�0,⇠�0,B1�0,B2�0

X

i,j

⇠ij + �kCk2F

s.t. eC (x, x0) � B1 � ⇠ij 8(xi , xj) 2 D

eC (x, x0) B2 + ⇠ij 8(xi , xj) 2 S

B1 � B2 = �

� margin parameter

• Convex, less costly and use of negative pairs

• Straightforward adaptation to trees and graphs

• Less general than proper edit distance

• Chicken and egg situation similar to LMNN

42


Application to word classification [Bellet et al., 2012a]

60

62

64

66

68

70

72

74

76

78

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Cla

ssifi

catio

nac

cura

cy

Size of cost training set

Oncina and SebbanGESL

Levenshtein distance

43

generalization guarantees

statistical view of supervised metric learning

• Training data Tn = {z i = (x i , y i )}ni=1

• z i 2 Z = X ⇥ Y• Y discrete label set

• independent draws from unknown distribution µ over Z

• Minimize the regularized empirical risk

Rn(M) =2

n(n � 1)

nX

1i<jn

`(M , z i , z j) + �reg(M)

• Hope to achieve small expected risk

R(M) = Ez,z0⇠µ

[`(M , z , z 0)]

• Note: this can be adapted to triplets

45

statistical view of supervised metric learning

• Standard statistical learning theory: sum of i.i.d. terms

• Here Rn(M) is a sum of dependent terms!

• Each training point involved in several pairs

• Corresponds to practical situation

• Need specific tools to go around this problem

• Uniform stability

• Algorithmic robustness

46

uniform stability

Definition ([Jin et al., 2009])

A metric learning algorithm has a uniform stability in /n, where is a

positive constant, if

8(Tn, z), 8i , supz1,z2

|`(MTn , z1, z2)� `(MTi,zn, z1, z2)|

n

• MTn : metric learned from Tn

• Ti,zn : set obtained by replacing z i 2 Tn by z

• If reg(M) = kMk2F , under mild conditions on `, algorithm hasuniform stability [Jin et al., 2009]

• Applies for instance to GESL [Bellet et al., 2012a]

• Does not apply to other (sparse) regularizers

47

uniform stability

Generalization bound

Theorem ([Jin et al., 2009])

For any metric learning algorithm with uniform stability /n, with prob-

ability 1� � over the random sample Tn, we have:

R(MTn) Rn(MTn) +2

n+ (2+ B)

rln(2/�)

2n

B problem-dependent constant

• Standard bound in O(1/pn)

48

algorithmic robustness

Definition ([Bellet and Habrard, 2015])

A metric learning algorithm is (K , ✏(·)) robust for K 2 N and ✏ : (Z ⇥Z)n ! R if Z can be partitioned into K disjoints sets, denoted by

{Ci}Ki=1, such that the following holds for all Tn:

8(z1, z2) 2 T2n , 8z , z 0 2 Z, 8i , j 2 [K ], if z1, z 2 Ci , z2, z 0 2 Cj

|`(MTn , z1, z2)� `(MTn , z , z 0)| ✏(T 2n )

zz

z

z′ Ci

′

z1

z2

Ci

Cj

Classic robustness Robustness for metric learning49

algorithmic robustness

Generalization bound

Theorem ([Bellet and Habrard, 2015])

If a metric learning algorithm is (K , ✏(·))-robust, then for any � > 0,

with probability at least 1� � we have:

R(MTn) Rn(MTn) + ✏(T 2n ) + 2B

r2K ln 2 + 2 ln(1/�)

n

• Wide applicability

• Mild assumptions on `

• Any norm regularizer: Frobenius, L2,1, trace...

• Bounds are loose

• ✏(T 2n ) can be as small as needed by increasing K

• But K potentially very large and hard to estimate

50

introductioncord/pdfs/courses/2019rivcourse2.pdfmetric learning in a nutshell basic recipe 1. pick a...

Documents