icml2004, banff, alberta, canada learning larger margin machine locally and globally kaizhu huang...

27
ICML2004, Banff, Alberta, Canada ICML2004, Banff, Alberta, Canada Learning Larger Margin Learning Larger Margin Machine Locally and Machine Locally and Globally Globally Kaizhu Huang ( Kaizhu Huang ( [email protected] [email protected]) Haiqin Yang, Irwin King, Michael R. Lyu Haiqin Yang, Irwin King, Michael R. Lyu Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering The Chinese University of Hong Kong The Chinese University of Hong Kong July 5, 2004 July 5, 2004 The Chinese University of Hong The Chinese University of Hong Kong Kong

Upload: irma-bradley

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

Learning Larger Margin Learning Larger Margin Machine Locally and Machine Locally and GloballyGlobally

Kaizhu Huang (Kaizhu Huang ([email protected]@cse.cuhk.edu.hk))Haiqin Yang, Irwin King, Michael R. LyuHaiqin Yang, Irwin King, Michael R. LyuDept. of Computer Science and EngineeringDept. of Computer Science and EngineeringThe Chinese University of Hong KongThe Chinese University of Hong Kong

July 5, 2004July 5, 2004

The Chinese University of Hong KongThe Chinese University of Hong Kong

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Learning Larger Margin Learning Larger Margin Machine Locally and GloballyMachine Locally and Globally

ContributionsContributionsBackground:Background:– Linear Binary ClassificationLinear Binary Classification– MotivationMotivation

Maxi-Min Margin Machine(MMaxi-Min Margin Machine(M44))– Model DefinitionModel Definition– Geometrical InterpretationGeometrical Interpretation– Solving MethodsSolving Methods– Connections With Other ModelsConnections With Other Models– Nonseparable caseNonseparable case– KernelizationsKernelizations

Experimental ResultsExperimental ResultsFuture WorkFuture WorkConclusionConclusion

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Theory:Theory: A unified model of Support Vector Machi A unified model of Support Vector Machine (SVM), Minimax Probability Machine (MPM), anne (SVM), Minimax Probability Machine (MPM), and Linear Discriminant Analysis (LDA).d Linear Discriminant Analysis (LDA).

Practice:Practice: A sequential Conic Programming Proble A sequential Conic Programming Problem.m.

ContributionsContributions

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Background: Linear Binary Background: Linear Binary ClassificationClassification

Given two classes of data sampled from x and y, we are trying to find a linear decision plane wT z + b=0, which can correctly discriminate x from y.

wT z + b< 0, z is classified as y;

wT z + b >0, z is classified as x. wT z + b=0 : decision hyperplane

Only partial information is available, we need to choose a criterion to select hyperplanes

y

x

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

wT z + b=0

Background: Support Background: Support Vector MachineVector Machine

Margin

Support Vector Machines (SVM): The optimal hyperplane is the one which maximizes the margin between two classes of data

Support Vectors

The boundary of SVM is exclusively determined by several critical points called support vectors

All other points are totally irrelevant with the decision plane

SVM discards global information

x

y

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

Learning Locally and Learning Locally and GloballyGlobally

The Chinese University of Hong KongThe Chinese University of Hong Kong

wT z + b=0y

x

Along the dashed axis, y data have a larger data trend than x data. Therefore, a more reasonable hyerplane may lie closer than x data rather than locating itself in the middle of two classes as in SVM.

SVM

A more reasonable hyperplane

Learning Locally and Globally

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Learning Locally and : Learning Locally and GloballyGlobally

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Geometric Interpretation: Geometric Interpretation

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Solving Method: Solving Method

Divide and Conquer:

If we fix ρ to a specific ρn , the problem changes to check whether this ρn satisfies the following constraints:

If yes, we increase ρn; otherwise, we decrease it.

Second Order Cone Programming Problem!!!

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Solving Method (Cont’): Solving Method (Cont’)

Iterate the following two Divide and Conquer steps:

Sequential Second Order Cone Programming Problem!!!

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

can it satisfy the constraints?

YesNo

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Solving Method (Cont’): Solving Method (Cont’)

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Links with MPM: Links with MPM

+

Span all the data points and add them

together

Exactly MPM Optimization Problem!!!

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Links with MPM (Cont’): Links with MPM (Cont’)

MPM

M4

Remarks: The procedure is not reversible: MPM is a special case of M4

MPM focuses on building decision boundary GLOBALLY, i.e., it exclusively depends on the means and covariances. However, means and covariances may not be accurately estimated.

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

If one assumes ∑=I

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Links with SVM: Links with SVM

The magnitude of w can scale up

without influencing the optimization

1

2

3

4

Support Vector Machines!!!

SVM is the special case of MM44

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Links with SVM (Cont’): Links with SVM (Cont’)

These two assumptions of SVM are inappropriate

If one assumes ∑=I

Assumption 1Assumption 2

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Links with LDA: Links with LDAIf one assumes

∑x=∑y=(∑*y+∑*x)/2

Perform a procedure similar to MPM…

LDA

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

MM44: Links with LDA (Cont’): Links with LDA (Cont’)

Assumption

Still inappropriate?If one assumes

∑x=∑y=(∑*y+∑*x)/2

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Nonseparable CaseNonseparable CaseIntroducing slack variables

How to solve?? Line Search+Second Order Cone Programming

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization• Map data to higher dimensional feature space Rf

xi(xi)

yi(xi)

• Construct the linear decision plane f(γ ,b)=γ T z + b in the feature space Rf, with γ Є Rf, b Є R•In Rf, we need to solve

• However, we do not want to solve this in an explicit form of . Instead, we want to solve it in a kernelization form

K(z1,z2)= (z1)T(z2)

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization

Notation

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Experimental Experimental ResultsResults

Toy Example: Two Gaussian Data with different data trends

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Data sets: UCI Machine Learning RepositoryProcedures: 10-fold Cross validationSolving Package: SVM: Libsvm 2.4, M4: Sedumi 1.05 MPM: MPM 1.0

In linear cases, M4 outperforms SVM and MPM In Gaussian cases, M4 is slightly better or comparable than SVM (1). Sparsity in the feature space results in inaccurate estimation of covariance matrices (2) Kernelization may not keep data topology of the original data.—Maximizing Margin in the feature space does not necessarily maximize margin in the original space

Experimental Experimental ResultsResults

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

From Simon Tong et al. Restricted Bayesian Optimal classifiers, AAAI, 2000.

An example to illustrate that maximizing Margin in the feature space does not necessarily maximize margin in the original space

Experimental Experimental ResultsResults

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

Future WorkFuture Work

Speeding up MSpeeding up M44

Contain support vectors—can we employ its sparsity aContain support vectors—can we employ its sparsity as has been done in SVM?s has been done in SVM?

Can we reduce redundant points??Can we reduce redundant points??

How to impose constrains on the kernelization fHow to impose constrains on the kernelization for keeping the topology of data?or keeping the topology of data?

Generalization error bound?Generalization error bound? SVM and MPM have both error bounds.SVM and MPM have both error bounds.

How to extend to multi-category classifications?How to extend to multi-category classifications?

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

The Chinese University of Hong KongThe Chinese University of Hong Kong

ConclusionConclusion

Proposed a new large margin classifier MProposed a new large margin classifier M44 which learns the decision boundary both which learns the decision boundary both locally and globallylocally and globally

Built theoretical connections with other Built theoretical connections with other models: A unified model of SVM, MPM and LDAmodels: A unified model of SVM, MPM and LDA

Developed sequential Second Order Cone Developed sequential Second Order Cone Programming algorithm for MProgramming algorithm for M44

Experimental results demonstrated the Experimental results demonstrated the advantages of our new modeladvantages of our new model

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

Thanks!Thanks!

The Chinese University of Hong KongThe Chinese University of Hong Kong