introductioncord/pdfs/courses/2019rivcourse2.pdfmetric learning in a nutshell basic recipe 1. pick a...
TRANSCRIPT
metric learning course
Cours RI Master DAC UPMC (Construit a partir d’un tutorial
ECML-PKDD 2015 (A. Bellet, M. Cord))
1. Introduction
2. Linear metric learning
3. Nonlinear extensions
4. Large-scale metric learning
5. Metric learning for structured data
6. Generalization guarantees
1
introduction
motivation
• Similarity / distance judgments are essential components of manyhuman cognitive processes (see e.g., [Tversky, 1977])
• Compare perceptual or conceptual representations
• Perform recognition, categorization...
• Underlie most machine learning and data mining techniques
3
motivation
Nearest neighbor classification
?
4
motivation
Clustering
5
motivation
Information retrieval
Query document
Most similar documents
6
motivation
Data visualization
0123456789
(image taken from [van der Maaten and Hinton, 2008])7
motivation
• Choice of similarity is crucial to the performance
• Humans weight features di↵erently depending on context[Nosofsky, 1986, Goldstone et al., 1997]
• Facial recognition vs. determining facial expression
• Fundamental question: how to appropriately measure similarity or
distance for a given task?
• Metric learning ! infer this automatically from data
• Note: we will refer to distance or similarity indistinctly as metric
8
metric learning in a nutshell
Metric Learning
9
metric learning in a nutshell
Basic recipe
1. Pick a parametric distance or similarity function
• Say, a distance DM(x , x 0) function parameterized by M
2. Collect similarity judgments on data pairs/triplets
• S = {(xi , xj) : xi and xj are similar}• D = {(xi , xj) : xi and xj are dissimilar}• R = {(xi , xj , xk) : xi is more similar to xj than to xk}
3. Estimate parameters s.t. metric best agrees with judgments
• Solve an optimization problem of the form
M = argminM
2
64`(M,S,D,R)| {z }loss function
+ �reg(M)| {z }regularization
3
75
10
linear metric learning
mahalanobis distance learning
• Mahalanobis distance: DM(x , x 0) =p(x � x 0)TM(x � x 0) where
M = Cov(X)�1 where Cov(X) is the covariance matrix estimated
over a data set X
12
mahalanobis distance learning
• Mahalanobis (pseudo) distance:
DM(x , x 0) =q(x � x 0)TM(x � x 0)
where M 2 Sd+ is a symmetric PSD d ⇥ d matrix
• Equivalent to Euclidean distance after linear projection:
DM(x , x 0) =q
(x � x 0)TLTL(x � x 0) =q
(Lx � Lx 0)T (Lx � Lx 0)
• If M has rank k d , L 2 Rk⇥d reduces data dimension
13
mahalanobis distance learning
A first approach [Xing et al., 2002]
• Targeted task: clustering with side information
Formulation
maxM2Sd+
X
(x i ,x j )2D
DM(x i , x j)
s.t.X
(x i ,x j )2S
D2M(x i , x j) 1
• Convex in M and always feasible (take M = 0)
• Solved with projected gradient descent
• Time complexity of projection on Sd+ is O(d3)
• Only look at sums of distances14
mahalanobis distance learning
Large Margin Nearest Neighbor [Weinberger et al., 2005]
• Targeted task: k-NN classification
• Constraints derived from labeled data
• S = {(x i , x j) : yi = yj , x j belongs to k-neighborhood of x i}• R = {(x i , x j , xk) : (x i , x j) 2 S, yi 6= yk}
15
mahalanobis distance learning
Large Margin Nearest Neighbor [Weinberger et al., 2005]
Formulation
minM2Sd+,⇠�0
(1� µ)X
(x i ,x j )2S
D2M(x i , x j) + µ
X
i,j,k
⇠ijk
s.t. D2M(x i , xk)� D
2M(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R
µ 2 [0, 1] trade-o↵ parameter
• Number of constraints in the order of kn2
• Solver based on projected gradient descent with working set
• Simple alternative: only consider closest “impostors”
• Chicken and egg situation: which metric to build constraints?
• Possible overfitting in high dimensions
16
mahalanobis distance learning
Large Margin Nearest Neighbor [Weinberger et al., 2005]
Test Image:
Nearest neighbor before training:
Nearest neighbor after training:
MNIST
NEWS
ISOLET
BAL
FACES
IRIS
WINE
2.1
17.613.4
12.4
8.64.7
3.3
14.49.7
7.8
5.92.6
19.0
4.34.74.4
2.6
1.71.2
2.2
30.1
kNN LMNN
kNN Euclideandistance
MulticlassSVM
testing error rate (%)17
mahalanobis distance learning
Interesting regularizers
• Add regularization term to prevent overfitting
• Simple choice: kMk2F =Pd
i,j=1 M2ij (Frobenius norm)
• Used in [Schultz and Joachims, 2003] and many others
• LogDet divergence (used in ITML [Davis et al., 2007])
Dld(M ,M0) = tr(MM�10 )� log det(MM�1
0 )� d
=X
i,j
�i
✓j(vT
i u i )2 �
X
i
log
✓�i
✓i
◆� d
where M = V⌃V T and M0 = U⇥UT is PD
• Remain close to good prior metric M0 (e.g., identity)
• Implicitly ensure that M is PD
• Convex in M (determinant of PD matrix is log-concave)
• E�cient Bregman projections in O(d2) 18
mahalanobis distance learning
Interesting regularizers
• Mixed L2,1 norm: kMk2,1 =Pd
i=1 kM ik2• Tends to zero-out entire columns ! feature selection
• Used in [Ying et al., 2009]
• Convex but nonsmooth
• E�cient proximal gradient algorithms (see e.g., [Bach et al., 2012])
• Trace (or nuclear) norm: kMk⇤ =Pd
i=1 �i (M)
• Favors low-rank matrices ! dimensionality reduction
• Used in [McFee and Lanckriet, 2010]
• Convex but nonsmooth
• E�cient Frank-Wolfe algorithms [Jaggi, 2013]
19
linear similarity learning
• Mahalanobis distance satisfies the distance axioms• Nonnegativity, symmetry, triangle inequality
• Natural regularization, required by some applications
• In practice, these axioms may be violated• By human similarity judgments (see e.g., [Tversky and Gati, 1982])
• By some good visual recognition systems [Scheirer et al., 2014]
• Alternative: learn bilinear similarity function SM(x , x 0) = xTMx 0
• See [Chechik et al., 2010, Bellet et al., 2012b, Cheng, 2013]
• No PSD constraint on M ! computational benefits
• Theory of learning with arbitrary similarity functions
[Balcan and Blum, 2006]20
nonlinear extensions
beyond linearity
• So far, we have essentially been learning a linear projection
• Advantages• Convex formulations
• Robustness to overfitting
• Drawback• Inability to capture nonlinear structure
Linear metric Kernelized metric Multiple local metrics
22
kernelization of linear methods
Definition (Kernel function)
A symmetric function K is a kernel if there exists a mapping function
� : X ! H from the instance space X to a Hilbert space H such that
K can be written as an inner product in H:
K (x , x 0) = h�(x),�(x 0)i .
Equivalently, K is a kernel if it is positive semi-definite (PSD), i.e.,
nX
i=1
nX
j=1
cicjK (xi , xj) � 0
for all finite sequences of x1, . . . , xn 2 X and c1, . . . , cn 2 R.
23
kernelization of linear methods
Kernel trick for metric learning
• Notations
• Kernel K(x , x 0) = h�(x),�(x 0)i, training data {x i}ni=1
• �idef= �(x i ) 2 R
D , �def= [�1, . . . ,�n] 2 R
n⇥D
• Mahalanobis distance in kernel space
D2M(�i ,�j) = (�i � �j)
TM(�i � �j) = (�i � �j)TLTL(�i � �j)
• Setting LT = �UT , where U 2 RD⇥n, we get
D2M(�(x),�(x 0)) = (k � k 0)TM(k � k 0)
• M = UTU 2 Rn⇥n, k = �T�(x) = [K(x1, x), . . . ,K(xn, x)]T
• Justified by a representer theorem [Chatpatanasiri et al., 2010]
24
learning a nonlinear metric
• More flexible approach: learn nonlinear mapping � to optimize
D�(x , x 0) = k�(x)� �(x 0)k2
• Possible parameterizations for �:
• Regression trees [Kedem et al., 2012]
• Deep neural nets [Chopra et al., 2005, Hu et al., 2014]
25
!
MetricLearninginCV
• PairWise Constraints for learning
8
!
Deep Metric Learning
9
• Deepmetriclearningoptimization
– SiameseArchitecture[LeCun NIPS1993]
!
Deep Metric Learning
10
• Deepmetriclearningoptimization
[credit:Y.LeCun 05]
!
Deep Metric Learning
11
• Deepmetriclearningoptimization
[Y.LeCun CVPR05,06]DrLIM scheme
SimilartoLMNNprocedure:
Y=0forsimilarpairs
Y=1fordissimilarpairs
!
Deep Metric Learning
12
• SiameseNetworkforpaiwise comparison:DDMLapproach
[Credit:HuCVPR2014]
!
Deep Metric Learning
13
• DDMLoptimization[HuCVPR2014]:
!
Resultsonfaceverificationpb
14
2images=>sameface?
LabeledFacesintheWild
(LFW)-- 27SIFTdescriptors
concatenated
10-foldCrossValidation
(600pairsperfold)
Classicalerrors:
About15%better with metric learning
Method Accuracy (in %)ITML 76.2 ± 0.5LDML 77.5 ± 0.5PCCA 82.2 ± 0.4Fantope 83.5 ± 0.5
!
Peopleverification
16
!
Visualization
17
Deep fordatavisualisation
Metric vsFeature learning
LeCun CVPR2006
!
Deep Metric Learning
18=>Manydifferentcontextsprovidetrainingdata
MetricLearningforGeo-localization:
[LeBarz ICIP2015]fromLMNNscheme
Roboticsapplis:
[Carlevaris-Bianco IROS2014]fromDrLIM scheme
!FeaturelearningforImageretrieval
19
Query Top 5
Learned Distance
Euclidean Distance
Learned Distance
Euclidean Distance
!FeaturelearningforImageretrieval
22
Query Top 5
Learned Distance
Euclidean Distance
Learned Distance
Euclidean Distance
learning multiple local metrics
• Simple linear metrics perform well locally
• Idea: di↵erent metrics for di↵erent parts of the space
• Various issues• How to split the space?
• How to avoid blowing up the number of parameters to learn?
• How to make local metrics “mutually comparable”?
• . . .
26
learning multiple local metrics
Multiple Metric LMNN [Weinberger and Saul, 2009]
• Group data into C clusters
• Learn a metric for each cluster in a coupled fashion
Formulation
minM1,...,MC
⇠�0
(1� µ)X
(x i ,x j )2S
D2MC(x j )
(x i , x j) + µX
i,j,k
⇠ijk
s.t. D2MC(xk )
(x i , xk)� D2MC(x j )
(x i , x j) � 1� ⇠ijk 8(x i , x j , xk) 2 R
• Remains convex
• Computationally more expensive than standard LMNN
• Subject to overfitting
• Many parameters
• Lack of smoothness in metric change
27
learning multiple local metrics
Sparse Compositional Metric Learning [Shi et al., 2014]
• Learn a metric for each point in feature space
• Use the following parameterization
D2w (x , x 0) = (x � x 0)T
KX
k=1
wk(x)bkbTk
!(x � x 0),
• bkbTk : rank-1 basis (generated from training data)
• wk(x) = (aTk x + ck)
2: weight of basis k
• A 2 Rd⇥K and c 2 R
K : parameters to learn
28
learning multiple local metrics
Sparse Compositional Metric Learning [Shi et al., 2014]
Formulation
minA2R(d+1)⇥K
X
(x i ,x j ,xk )2R
⇥1 + D
2w (x i , x j)� D
2w (x i , xk)
⇤++ �kAk2,1
• A: stacking A and c
• [·] = max(0, ·): hinge loss
• Nonconvex problem
• Adapts to geometry of data
• More robust to overfitting
• Limited number of parameters
• Basis selection
• Metric varies smoothly over feature space29
large-scale metric learning
main challenges
• How to deal with large datasets?
• Number of similarity judgments can grow as O(n2) or O(n3)
• How to deal with high-dimensional data?
• Cannot store d ⇥ d matrix
• Cannot a↵ord computational complexity in O(d2) or O(d3)
31
case of large n
Online learning
OASIS [Chechik et al., 2010]
• Set M0 = I
• At step t, receive (x i , x j , xk) 2 R and update by solving
M t = argminM,⇠1
2kM � M t�1k2F + C⇠
s.t. 1� SM(x i , x j) + SM(x i , xk) ⇠
⇠ � 0
• SM(x , x 0) = xTMx 0, C trade-o↵ parameter
• Closed-form solution at each iteration
• Trained with 160M triplets in 3 days on 1 CPU
32
case of large n
Stochastic and distributed optimization
• Assume metric learning problem of the form
minM
1
|R|X
(x i ,x j ,xk )2R
`(M , x i , x j , xk)
• Can use Stochastic Gradient Descent
• Use a random sample (mini-batch) to estimate gradient
• Better than full gradient descent when n is large
• Can be combined with distributed optimization
• Distribute triplets on workers
• Each worker use a mini-batch to estimate gradient
• Coordinator averages estimates and updates
33
case of large d
Simple workarounds
• Learn a diagonal matrix
• Used in [Xing et al., 2002, Schultz and Joachims, 2003]
• Learn d parameters
• Only a weighting of features...
• Learn metric after dimensionality reduction (e.g., PCA)
• Used in many papers
• Potential loss of information
• Learned metric di�cult to interpret
34
case of large d
Matrix decompositions
• Low-rank decomposition M = LTL with L 2 Rr⇥d
• Used in [Goldberger et al., 2004]
• Learn r ⇥ d parameters
• Generally nonconvex, must tune r
• Rank-1 decomposition M =PK
i=1 wkbkbTk
• Used in SCML [Shi et al., 2014]
• Learn K parameters
• Hard to generate good bases in high dimensions
35
metric learning
for structured data
motivation
• Each data instance is a structured object
• Strings: words, DNA sequences
• Trees: XML documents
• Graphs: social network, molecules
ACGGCTT
• Metrics on structured data are convenient
• Act as proxy to manipulate complex objects
• Can use any metric-based algorithm
37
motivation
• Could represent each object by a feature vector
• Idea behind many kernels for structured data
• Could then apply standard metric learning techniques
• Potential loss of structural information
• Instead, focus on edit distances
• Directly operate on structured object
• Variants for strings, trees, graphs
• Natural parameterization by cost matrix
38
string edit distance
• Notations
• Alphabet ⌃: finite set of symbols
• String x: finite sequence of symbols from ⌃
• |x|: length of string x
• $: empty string / symbol
Definition (Levenshtein distance)
The Levenshtein string edit distance between x and x0 is the length of the
shortest sequence of operations (called an edit script) turning x into x0.
Possible operations are insertion, deletion and substitution of symbols.
• Computed in O(|x| · |x0|) time by Dynamic Programming (DP)
39
string edit distance
Parameterized version
• Use a nonnegative (|⌃|+ 1)⇥ (|⌃|+ 1) matrix C• Cij : cost of substituting symbol i with symbol j
Example 1: Levenshtein distance
C $ a b
$ 0 1 1
a 1 0 1
b 1 1 0
=) edit distance between abb and aa is
2 (needs at least two operations)
Example 2: specific costs
C $ a b
$ 0 2 10
a 2 0 4
b 10 4 0
=) edit distance between abb and aa is
10 (a ! $, b ! a, b ! a)
40
large-margin edit distance learning
GESL [Bellet et al., 2012a]
• Inspired from successful algorithms for non-structured data
• Large-margin constraints
• Convex optimization
• Requires key simplification: fix the edit script
eC (x, x0) =
X
u,v2⌃[{$}
C uv ·#uv (x, x0)
• #uv (x, x0): nb of times u ! v appears in Levenshtein script
• eC is a linear function of the costs
41
large-margin edit distance learning
GESL [Bellet et al., 2012a]
Formulation
minC�0,⇠�0,B1�0,B2�0
X
i,j
⇠ij + �kCk2F
s.t. eC (x, x0) � B1 � ⇠ij 8(xi , xj) 2 D
eC (x, x0) B2 + ⇠ij 8(xi , xj) 2 S
B1 � B2 = �
� margin parameter
• Convex, less costly and use of negative pairs
• Straightforward adaptation to trees and graphs
• Less general than proper edit distance
• Chicken and egg situation similar to LMNN
42
large-margin edit distance learning
Application to word classification [Bellet et al., 2012a]
60
62
64
66
68
70
72
74
76
78
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Cla
ssifi
catio
nac
cura
cy
Size of cost training set
Oncina and SebbanGESL
Levenshtein distance
43
generalization guarantees
statistical view of supervised metric learning
• Training data Tn = {z i = (x i , y i )}ni=1
• z i 2 Z = X ⇥ Y• Y discrete label set
• independent draws from unknown distribution µ over Z
• Minimize the regularized empirical risk
Rn(M) =2
n(n � 1)
nX
1i<jn
`(M , z i , z j) + �reg(M)
• Hope to achieve small expected risk
R(M) = Ez,z0⇠µ
[`(M , z , z 0)]
• Note: this can be adapted to triplets
45
statistical view of supervised metric learning
• Standard statistical learning theory: sum of i.i.d. terms
• Here Rn(M) is a sum of dependent terms!
• Each training point involved in several pairs
• Corresponds to practical situation
• Need specific tools to go around this problem
• Uniform stability
• Algorithmic robustness
46
uniform stability
Definition ([Jin et al., 2009])
A metric learning algorithm has a uniform stability in /n, where is a
positive constant, if
8(Tn, z), 8i , supz1,z2
|`(MTn , z1, z2)� `(MTi,zn, z1, z2)|
n
• MTn : metric learned from Tn
• Ti,zn : set obtained by replacing z i 2 Tn by z
• If reg(M) = kMk2F , under mild conditions on `, algorithm hasuniform stability [Jin et al., 2009]
• Applies for instance to GESL [Bellet et al., 2012a]
• Does not apply to other (sparse) regularizers
47
uniform stability
Generalization bound
Theorem ([Jin et al., 2009])
For any metric learning algorithm with uniform stability /n, with prob-
ability 1� � over the random sample Tn, we have:
R(MTn) Rn(MTn) +2
n+ (2+ B)
rln(2/�)
2n
B problem-dependent constant
• Standard bound in O(1/pn)
48
algorithmic robustness
Definition ([Bellet and Habrard, 2015])
A metric learning algorithm is (K , ✏(·)) robust for K 2 N and ✏ : (Z ⇥Z)n ! R if Z can be partitioned into K disjoints sets, denoted by
{Ci}Ki=1, such that the following holds for all Tn:
8(z1, z2) 2 T2n , 8z , z 0 2 Z, 8i , j 2 [K ], if z1, z 2 Ci , z2, z 0 2 Cj
|`(MTn , z1, z2)� `(MTn , z , z 0)| ✏(T 2n )
zz
z
z′ Ci
′
z1
z2
Ci
Cj
Classic robustness Robustness for metric learning49
algorithmic robustness
Generalization bound
Theorem ([Bellet and Habrard, 2015])
If a metric learning algorithm is (K , ✏(·))-robust, then for any � > 0,
with probability at least 1� � we have:
R(MTn) Rn(MTn) + ✏(T 2n ) + 2B
r2K ln 2 + 2 ln(1/�)
n
• Wide applicability
• Mild assumptions on `
• Any norm regularizer: Frobenius, L2,1, trace...
• Bounds are loose
• ✏(T 2n ) can be as small as needed by increasing K
• But K potentially very large and hard to estimate
50