ms publisher template for reu posteryhuang/poster/ecml2013.pdf · title: ms publisher template for...

Reduced-Rank Local Distance Metric Learning

Y. Huang acknowledges partial support from a UCF Graduate College Presidential Fellowship and National Science Foundation (NS F) grant No. 1200566.

C. Li acknowledges partial support from NSF grants No. 0806931 and No. 0963146.

M. Georgiopoulos acknowledges partial support from NSF grants No. 1161228 and No. 0525429.

G. G. Anagnostopoulos acknowledges partial support from NSF grant No. 1263011.

MACHINE LEARNING LABORATORY @ UCF

Yinjie Huang(*), Cong Li(*), Michael Georgiopoulos(*) and Georgios C. Anagnostopoulos(**)

(*)Department of EECS, University of Central Florida, Orlando, Florida, Department of EECS, University of Central Florida, Orlando, Florida,

(**)Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, Florida

W e propose a new method for local metric learning based on a conical combination

of Mahalanobis metrics and pair-wise similarities between the data. Its formula-

tion allows for controlling the rank of the metrics' weight matrices. We also offer a con-

vergent algorithm for training the associated model. Experimental results on a collection

of classification problems imply that the new method may offer notable performance ad-

vantages over alternative metric learning approaches that have recently appeared in the

literature.

Abstract

References

Introduction

Conclusions

Many Machine Learning Problems and algorithms entail the computation of distanc-

es , for example, k-nearest neighbor and k-Means algorithm.

Euclidean distance metric, or a weighted variation, the Mahalanobis metric, are most

encountered. However, the fixed metric may not perform well for all problems.

Data-driven approaches to infer the best metric for a given dataset is called metric

learning.

Through side information (similar or dissimilar) from training data, a weight matrix of

Mahalanobis metric can be learned to move similar data points closer while mapping

dissimilar points apart. This is done so that an eventual application of a KNN decision

rule exhibits improved performance.

Many current metric learning model focus only on learning one single, global metric.

However, this global metric may not be well-suited in some settings that entail multi-

modality or non-linearity.

Problem Formulation

Algorithm

Experiments

Through learning local metrics that takes into account the location and similarity char-

acteristics of the data, we may have better KNN classification performance as illustrat-

ed in Fig 1.

Fig. 1. The dataset that illustrates the potential advantages of learning a local metric instead of a global one. (a) Original data

distribution. (b) Data distribution in the feature space obtained by learning a global metric. (c) Data distribution in the feature

space obtained by learning a local metric

Contributions

This work has been accepted as an oral presentation in the European Conference on Machine Learning and Principles and Practic e of Knowledge Discovery in Databases (ECML/PKDD 2013) in Prague, Czech Republic, Sep 23th—27th, 2013.

We propose a new local metric learning approach—Reduced-Rank Local Metric Learn-

ing ( ).

The local metric in our method is modeled as a conical combination of Mahalanobis

metrics.

The proposed model is able to control the rank of the involved linear mappings

through a sparsity-inducing matrix norm.

We supply an algorithm for training the model and prove that the set of fixed points of

the algorithm includes the Karush-Kuhn-Tucker (KKT) points.

Running on 9 benchmarks and compare the results with other recent global or

local metric learning methods, our method exhibits the best classification accuracy in 7

out of the 9 datasets.

Let for any positive integer and we have a training set

and corresponding pair-wise sample similarities arranged in a matrix as

side information.

If and are similar, then ; if other wise, then .

Mahalanobis distance between two samples and is defined

as , where is a positive semi-definite matrix

(denoted as ), which we will refer to as the weight matrix of the metric.

If , the previous metric becomes the usual Euclidean distance.

Being positive semi-definite, the weight matrix can be expressed as ,

where with . The previously defined distance can be expressed

as: .

In our work, we assume the metric involved is expressed as a conical combinations

of Mahalanobis metrics. We also define a vector for each local metric .

The element of this vector may be regarded as a measure of how important met-

ric is, when computing distances involving the training sample.

We constrain the vectors to belong to ,

where ’ ‘ denotes component-wiseordering. The fact that the ‘s need to sum up to

the all-ones vector forces at least one metric to be relevant.

The weight matrix for each pair of training samples is given as . We

could formulate the following minimization problem:

where and denotes the rank of matrix . Controls the

penalty of violating the previous desiteratum. The final term controls the rank.

The above formulation could be simplified. Let be the hinge function de-

fined as for all . Obviously, . Besides, we

could also replace the final term into the nuclear norm of , which is the sum of ‘s

singular values. The simplified formulation is:

where denotes the nuclear norm; And , is a singular value.

We use two steps algorithm to solve problem (2). Step one, the problem is minimized

over while is fixed. Step two, are considered fixed, the problem is a non-

convex problem w.r.t .

For the first step, problem (3) becomes an unconstrained minimization problem and

has the form , where is the parameter we trying to minimize over. The

hinge loss function is non-differentiable and is a non-smooth, convex regu-

larization term. We use the Proximal Subgradient Descent (PSD) to solve this problem.

For second step, we consider a matrix associated to the metric, whose element is

defined as: . Then problem (3) becomes .

Let be the vector that results from concatenating all individual vectors into

a single vector.

Define the matrix . Based on the previous definitions,

the cost function becomes and the constraint becomes:

, where and denotes the Kronecker

product and is the identity matrix.

The above cost function is non-convex since is almost always indefinite. We employ

a Majorization Minimization (MM) approach.

(1)

(3)

(6)

MM first requires identifying a function of that majorizes the objective function.

Let , where is the largest eigenvalue. Based on this, we define a

matrix is negative semi-definite. Since , we have

that for all and equality only if . So we could

use the following minimization problem to approximate the true non-convex problem:

The above problem is solvable as the nex theorem implies:

Theorem 1. Let , and . The unique minimizer of

has the form , where is the element of and is the

Lagrange multiplier vector associated to the equality constraint.

Algorithm is summarized

in Algorithm 1.

We also prove the follow-

ing theorem:

Theorem 3. Algorithm 1

yields a convergent, non-

increasing sequence of cost

function values relevant to

Problem (3). Furthermore, the set of fixed points of the iterative map embodied by Algo-

rithm 1 includes the KKT points of Problem (3).

The dataset we used are from the

UCI machine learning repository

and the Delve Dataset Collection.

We assign the testing sample ‘s

vector the value of the correspond-

ing vector associated to ‘s nearest

(in terms of Euclidean distance)

training sample.

Table 1. Details of benchmark data sets. For the Letter and Pendigits da-

tasets, only 4 and 5 classes were considered respectively.

Experiments 1: Number of local metrics

We show the performance of varies with respect to the number of local metrics K.

the results are shown in Fig. 2. We claim the number of metrics might mot necessarily be

equal to the number of classes of dataset.

Fig. 2. classification accuracy results on the 9 benchmark datasets for varying number K of local metrics. #C indicates the

number of classes of each dataset.

Experiments 2: Comparisons

We compared with other metric learning algorithms, including Euclidean met-

ric KNN, ITML [6], LMNN [5], LMNN-MM [9], GLML [11] and PLML [12].

Both ITML and LMNN learn a global metric, while LMNN-MM, GLML and PLML are

local metric learning algorithms.

After the metrics are learned, the KNN classifier is utilized for classification with k

(number of nearest neighbors) set to 5.

For each dataset, we use K’s optimal value as established in the previous series of ex-

periments, while the regularization parameter was chosen via a validation procedure

over the set {0.01,0.1,1,10,100}.

For pair-wise model comparisons, we employed McNemar’s test. Since there are 7 al-

gorithms to be compared, we use Holm’s step-down procedure as a multiple hypothe-

sis testing method to control the Family-Wise Error Rate (FWER) [22] of the resulting

pair-wise McNemar’s tests. The results are reported in Table 2. Table 2. Percent accuracy results of 7 algorithms on 9 benchmark datasets. For each dataset, the statistically best and compara-

ble results for a family-wise significance level of 0.05 are highlighted in boldface. All algorithms are ranked from best to worst;

algorithms share the same rank, if their performance is statistically comparable.

We proposed a new local metric learning model, namely Reduced-Rank Local Metric

Learning ( ).

In order to solve our proposed formulation, a two-step algorithm is showcased, which

iteratively solves two sub-problems in an alternating fashion.

We have demonstrated that our algorithm converges and that its fixed points include

the Karush-Kuhn-Tucker (KKT) points of our proposed formulation.

In the first experiment, we varied the number of local metrics K and discussed the in-

fluence of K on classification accuracy.

In the second experiment, we compared with other metric learning algorithms

and demonstrated that our proposed method is highly competitive.

[1] Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning with application to clustering with side-information. In: Neural Information Processing Systems Foundation

(NIPS), MIT Press (2002) 505-512

[2] Shalev-Shwartz, S., Singer, Y., Ng, A.Y.: Online and batch learning of pseudometrics. In: International Conference on Machine Learning (ICML), ACM (2004)

[3] Chopra, S., Hadsell, R., Lecun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Computer Vision and Pattern Recognition (CVPR), IEEE Press

(2005) 539-546

[4] Goldberger, J., Roweis, S. Hinton, G., Salakhutdinov, R.: Neighborhood components analysis. In: Neural Information Processing Systems Foundation (NIPS), MIT Press (2004) 513-520

[5] Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. In: Neural Information Processing Systems Foundation (NIPS), MIT

Press (2006)

[6] Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: International Conference on Machine Learning (ICML), ACM (2007) 209-216

[7] Yang, L., Jin, R., Sukthankar, R., Liu, Y.: An efficient algorithm for local distance metric learning. In: AAAI conference on Artificial Intelligence (AAAI), AAAI Press (2006)

[8] Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Ingelligence 18(6) (June 1996) 607-616

[9] Weinberger, K., Saul, L.: Fast solvers and efficient implementations for distance metric learning. In: International Conference on Machine Learning (ICML), ACM (2008) 1160-1167

[10] Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-supervised clustering. In: International Conference on Machine Learning (ICML), ACM(2004) 81-

88

[11] Noh, Y.K., Zhang, B.T., Lee, D.D.: Generative local metric learning for nearest neighbor classification. In: Neural Information Processing Systems Foundation (NIPS), MIT Press (2010)

[12] Wang, J., Kalousis, A., Woznica, A.: Parametric local metric learning for nearest neighbor classification. In: Neural Information Processing systems Foundation (NIPS), MIT Press(2012)

1610-1618

[13] Candes, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. CoRR abs/0903.1476 (2009)

[14] Candes, E.J., Recht, B.: Exact matrix completion via convex optimization. CoRR abs/0805.4471 (2008)

[15] Rakotomamonjy, A., Flamary, R., Gasso, G., Ganu, S.: lp-lq penalty for sparse linear and sparse multiple kernel multi-task learning. IEEE Transactions on Neural Networks 22 (2011)

1307-1320

[16] Chen, X., Pan, W., Kwok, J.t., Carbonell, J.G.: Accelerated gradient method for multi-task sparse learning problem. In: International Conference on Data Mining (ICDM), IEEE Computer

Society (2009) 746-751

[17] Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research (JMLR) 10 (December 2009) 2899-2934

[18] Balaji, R., Bapat, R.: On Euclidean distance matrices. Linear Algebra and its Applications 424(1) (2007) 108-117

[19] Hunter, D.R., Lange, K.: A tutorial on mm algorithms. The American Statistician 58(1) (2004)

[20] Langford, J., Li, L., Zhang. T.: Sparse online learning via truncated gradient. Journal of Machine Learning Research (JMLR) 10 (2009) 777-801

[21] Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l1-regularized loss minimization. Journal of Machine Learning Research (JMLR) 12 (2011) 1865-1892

[22] Hochberg, Y., Tamhane, A.C.: Multiple comparison procedures. John Wiley & Sons, Inc., New York, NY, USA (1987)

ms publisher template for reu posteryhuang/poster/ecml2013.pdf · title: ms publisher template for...

Documents