multiple kernel learning

Multiple Kernel LearningMarius Kloft TECHNISCHE UNIVERSITÄT BERLIN(NOW: COURANT / SLOAN-KETTERING INSTITUTE)

Marius Kloft

• Aim

▫ Learning the relation of of two random quantities and

from observations

• Kernel-based learning:

Machine Learning

1/20

• Example

▫ Object detection in images

Marius Kloft

Multiple Views / Kernels

2/20

How to combine the views?

Weightings.

(Lanckriet, 2004)

Shape

Space

Color

Marius Kloft

Computation of Weights?

3/20

• State of the art

▫ Sparse weights

Kernels / views are completely discarded

▫But why discard information?

(Lanckriet et al., Bach et al., 2004)

Marius Kloft

From Vision to Reality?

• State of the art: sparse method

▫ empirically ineffective

4/20

(Gehler et al., Noble et al., Cortes et al., Shawe-Taylor et al., etc., NIPS WS 2008, ICML, ICCV, ...)

• New methodology

learning bounds of rate up to O(M/n)

More effective in applications

More efficient and effective in practice

Marius Kloft

Presentation of MethodologyNon-sparse Multiple Kernel Learning

Marius Kloft

• Generalized formulation

▫ arbitrary loss

▫ arbitrary norms

e.g. lp-norms:

1-norm leads to sparsity:

New Methodology

• Computation of weights?

▫ Model

Kernel

▫ Mathematical program

Convex problem.

5/20

(Kloft et al., NIPS 2009, ECML 2009/10, JMLR 2011)

Optimization over weights

Marius Kloft

Theoretical Analysis

• Theoretical foundations

▫ Active research topic

NIPS workshop 2009

▫ We show:

Theorem (Kloft & Blanchard). The local Rademacher com-plexity of MKL is bounded by1:

• Corollaries (Learning Bounds)

▫ Bound:

depends on kernel

best known rate:

6/20

(Kloft & Blanchard, NIPS 2011, JMLR 2012; Kloft, Rückert, Bartlett, ECML 2010)

(Cortes, Mohri, Rostamizadeh, ICML 2010)

Marius Kloft

Proof (Sketch)

1. Relating the original class with the centered class

2. Bounding the complexity of the centered class

3. Khintchine-Kahane’s and Rosenthal’s inequalities

4. Bounding the complexity of the original class

5. Relating the bound to the truncation of the spectra of the kernels

7/20

Marius Kloft

• Implementation

▫ In C++ (“SHOGUN Toolbox”)

Matlab/Octave/Python/R support

▫ Runtime:

~ 1-2 orders of magnitude faster

Optimization

• Algorithms

1. Newton method

2. sequential, quadratically constrained programming

with level set projections

3. block-coordinate descent alg.

Alternate

solve (P) w.r.t. w

solve (P) w.r.t. :

Until convergence

(proved)

8/20

(Kloft et al., JMLR 2011)

analytical

(Sketch)

Marius Kloft

▫ Choice of p crucial

▫ Optimality of p depends on true sparsity

SVM fails in the sparse scenarios

1-norm best in the sparsest scenario only

p-norm MKL proves robust in all scenarios

• Bounds

▫ Can minimize bounds w.r.t. p

▫ Bounds well reflect empirical results

• Design

▫ Two 50-dimensional Gaussians

mean μ1 := binary vector;

▫Zero-mean features irrelevant

50 training examples

One linear kernel per feature

▫ Six scenarios

Vary % of irrelevant features

Variance so that Bayes error constant

• Results

sparsity0% 44% 64% 82% 92% 98%

0%

50%

test error

0% 44% 64% 82% 92% 98%

Toy Experiment

Sparsity

9/20

sparsity

Marius Kloft

Applications▫

Compu

ter v

ision

Visu

al o

bjec

t rec

ogni

tion

▫Bio

info

rmat

ics

Subce

llula

r loc

aliza

tion

Pro

tein

fold

pre

dict

ion

DNA

Tran

scrip

tion

star

t site

det

ectio

n

M

etab

olic

netw

ork

reco

nstru

ctio

n

non-sparsity

10/20

• Lesson learned:

▫ Optimality of a method depends on true underlying sparsity of problem

• Applications studied:

Marius Kloft

• Visual object recognition

▫ Aim: annotation of visual media (e.g., images)

▫ Motivation:

▫ content-based image retrieval

Application Domain: Computer Vision

11/20

aeroplane bicycle bird

Marius Kloft

• Multiple kernels

▫ based on

Color histograms

shapes

(gradients)

local features

(SIFT words)

spatial features

• Visual object recognition

▫ Aim: annotation of visual media (e.g., images)

▫ Motivation:

▫ content-based image retrieval


12/20

Marius Kloft

• Datasets

▫ 1. VOC 2009 challenge

7054 train / 6925 test images

20 object categories

▫aeroplane, bicycle,…

▫ 2. ImageCLEF2010 Challenge

8000 train / 10000 test images

▫ taken from Flickr

93 concept categories

▫partylife, architecture, skateboard, …


13/20

• 32 Kernels

▫ 4 types:

▫ varied over color channel combinations and spatial tilings (levels of a spatial pyramid)

▫ one-vs.-rest classifier for each visual concept

Bag of Words Global Histogram

Gradients BoW-SIFT Ho(O)G

Color Histograms BoW-C HoC

Marius Kloft

• Challenge results

▫ Employed our approach for ImageCLEF2011 Photo Annotation challenge

achieved the winning entries in 3 categories


14/20

• Preliminary results:

▫ using BoW-S only gives worse results

→ BoW-S alone is not sufficient

SVM MKL

VOC2009 55.85 56.76

CLEF2010 36.45 37.02

(Binder, Kloft, et al., 2010, 2011)

Marius Kloft

• Experiment

▫ Disagreement of single-kernel classifiers

▫ BoW-S kernels induce more or less the same predictions


15/20

• Why can MKL help?

▫ some images better captured by certain kernels

▫ different images may have different kernels that capture them well

Marius Kloft

• Detection of

▫ transcription start sites:

• by means of kernels based on:

▫ sequence alignments

▫ distribution of nukleotides

downstream, upstream

▫ folding properties

binding energies and angles

• Empirical analysis

▫ detection accuracy (AUC):

▫ higher accuracies than sparse MKL and ARTS (small n)

ARTS = winner of international comparison of 19 models

(Abeel et al., 2009)

(Kloft et al., NIPS 2009, JMLR 2011)

Application Domain: Genetics

16/20

Marius Kloft

• Theoretical analysis

▫ impact of lp-Norm on bound

▫ confirms experimental results:

stronger theoretical guarantees for proposed approach (p>1)

empirical and theoretical results approximately equal for

• Empirical analysis

▫ detection accuracy (AUC):

▫ higher accuracies than sparse MKL and ARTS (small n)

ARTS = winner of international comparison of 19 models

(Kloft et al., NIPS 2009, JMLR 2011)

(Abeel et al., 2009)

Application Domain: Genetics

17/20

Marius Kloft

• Protein Fold Prediction

▫ Prediction of fold class of protein

Fold class related to protein’s function

▫e.g., important for drug design

▫ Data set and kernels from Y. Ying

27 fold classes

Fixed train and test sets

12 biologically inspired kernels

▫e.g. hydrophobicity, polarity, van-der-Waals volume

• Results

▫ Accuracy:

▫ 1-norm MKL and SVM on par

▫ p-norm MKL performs best

6% higher accuracy than baselines

18/20

Application Domain: Pharmacology

Marius Kloft

▫Com

pute

r visi

on

Visual

obj

ect r

ecog

nitio

n

▫Bio

info

rmat

ics

Subce

llula

r loc

aliza

tion

Pro

tein

fold

pre

dict

ion

DNA

Tran

scrip

tion

splic

e sit

e de

tect

ion

M

etab

olic

netw

ork

reco

nstru

ctio

n

non-sparsity

19/20

Further Applications

Marius Kloft

Conclusion: Non-sparse Multiple Kernel Learning

20/20

Training with > 100,000 data points and > 1 000 Kernels

Sharp learning bounds

Appli-cations

Visual Object Recognition

winner of ImageCLEF 2011 Photo-annotation Challenge

Computational Biology

gene detector competitive to winner of int. comparison

Thank you for your attention

I will be happy to answer any additional questions

References

▫ Abeel, Van de Peer, Saeys (2009). Toward a gold standard for promoter prediction evaluation. Bioinformatics.

▫ Bach (2008). Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research (JMLR).

▫ Kloft, Brefeld, Laskov, and Sonnenburg (2008). Non-sparse Multiple Kernel Learning. NIPS Workshop on Kernel Learning.

▫ Kloft, Brefeld, Sonnenburg, Laskov, Müller, and Zien (2009). Efficient and Accurate Lp-norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2009).

▫ Kloft, Rückert, and Bartlett (2010). A Unifying View of Multiple Kernel Learning. ECML.

▫ Kloft, Blanchard (2011). The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2011).

▫ Kloft, Brefeld, Sonnenburg, and Zien (2011). Lp-Norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR), 12(Mar):953-997.

▫ Kloft and Blanchard (2012). On the Convergence Rate of Lp-norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR), 13(Aug):2465-2502.

▫ Lanckriet, Cristianini, Bartlett, El Ghaoui, Jordan (2004). Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research (JMLR).

multiple kernel learning

Documents