learning the kernel: theory and applicationsyy298919/kl.pdfmotivation for learning the kernel...

76
Motivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data Integration via KL-divergence Experiments Conclusion and Outlook Learning the Kernel: Theory and Applications Yiming Ying Department of Engineering Mathematics – University of Bristol Febuary 2009 Yiming Ying Learning the Kernel: Theory and Applications

Upload: others

Post on 07-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Learning the Kernel: Theory and Applications

Yiming Ying

Department of Engineering Mathematics – University of Bristol

Febuary 2009

Yiming Ying Learning the Kernel: Theory and Applications

Page 2: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Outline

1 Motivation for Learning the Kernel

2 Regularization Framework for Kernel Learning

3 Statistical Generalization Analysis

4 Data Integration via KL-divergence

5 Experiments

6 Conclusion and Outlook

Yiming Ying Learning the Kernel: Theory and Applications

Page 3: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Yeast Protein Function Prediction (CYGD): Multi-labeloutputs

Protein functional classes: metabolism, membrane, proteinsynthesis etc.

Functional characteristics: protein-protein interaction, mRNAexpression patterns, amino acid sequence etc.

Yiming Ying Learning the Kernel: Theory and Applications

Page 4: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Yeast Protein Function Prediction (CYGD): Multi-labeloutputs

Protein functional classes: metabolism, membrane, proteinsynthesis etc.

Functional characteristics: protein-protein interaction, mRNAexpression patterns, amino acid sequence etc.

Yiming Ying Learning the Kernel: Theory and Applications

Page 5: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Protein Fold Recognition (SCOP): Multi-class outputs

Protein fold: major secondarystructures and the same topology

Fold characteristics:physicochemical, structuralproperties of amino acids etc.

Yiming Ying Learning the Kernel: Theory and Applications

Page 6: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Protein Fold Recognition (SCOP): Multi-class outputs

Protein fold: major secondarystructures and the same topology

Fold characteristics:physicochemical, structuralproperties of amino acids etc.

Yiming Ying Learning the Kernel: Theory and Applications

Page 7: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Gene Network Inference: graph structured outputs

Protein-protein interaction or metabolic network (yeast)

Gene expression measurements; Phylogenetic profiles;Location of proteins/enzymes in the cell

Image from J.P. Vert

Yiming Ying Learning the Kernel: Theory and Applications

Page 8: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Practical Problems and Goals

Goal I: best prediction for single data source

Goal II: data integration of multiple data sources

Challenging issues: structure outputs (protein functionprediction, network inference), number of classes is very large(protein fold recognition), multiple data sources

Yiming Ying Learning the Kernel: Theory and Applications

Page 9: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Practical Problems and Goals

Goal I: best prediction for single data source

Goal II: data integration of multiple data sources

Challenging issues: structure outputs (protein functionprediction, network inference), number of classes is very large(protein fold recognition), multiple data sources

Yiming Ying Learning the Kernel: Theory and Applications

Page 10: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Practical Problems and Goals

Goal I: best prediction for single data source

Goal II: data integration of multiple data sources

Challenging issues: structure outputs (protein functionprediction, network inference), number of classes is very large(protein fold recognition), multiple data sources

Yiming Ying Learning the Kernel: Theory and Applications

Page 11: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

SVM: linear maximum margin

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

maxα

n∑i=1

αi −n∑

i,j=1

αiαjyiyj〈xi , xj〉

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

Yiming Ying Learning the Kernel: Theory and Applications

Page 12: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

SVM: linear maximum margin

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

maxα

n∑i=1

αi −n∑

i,j=1

αiαjyiyj〈xi , xj〉

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

Yiming Ying Learning the Kernel: Theory and Applications

Page 13: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

SVM: linear maximum margin

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

maxα

n∑i=1

αi −n∑

i,j=1

αiαjyiyj〈xi , xj〉

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

Yiming Ying Learning the Kernel: Theory and Applications

Page 14: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

SVM: linear maximum margin

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

maxα

n∑i=1

αi −n∑

i,j=1

αiαjyiyj〈xi , xj〉

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

Yiming Ying Learning the Kernel: Theory and Applications

Page 15: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

SVM: Dual transition to nonlinear case

Feature map:φ : X 7−→ F

x → φ(x)

Dual Formulation of SVM (QP)

maxα

n∑i=1

αi −n∑

i,j=1

αiαjyiyj〈φ(xi ), φ(xj)〉

s.t.n∑

i=1

αiyi = 0, 0 ≤ αi ∀i .

Yiming Ying Learning the Kernel: Theory and Applications

Page 16: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

SVM: Regularization

Kernel: K (x , t) = 〈φ(x), φ(t)〉 symmetric, continuous, p.s.d.

Gram matrix in Dual formulation of SVM:(K (xi , xj)

)n

i ,j=1

p.s.d.

Reproducing kernel Hilbert space HK

SVM with regularization

minf ∈HK ,b∈R,ξ

‖f ‖2K + C

∑ni=1 ξi

s.t. yi (f (xi ) + b) ≥ 1− ξi , ξi ≥ 0 ∀ i = 1, 2, . . . , n

Yiming Ying Learning the Kernel: Theory and Applications

Page 17: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

SVM: Regularization

Kernel: K (x , t) = 〈φ(x), φ(t)〉 symmetric, continuous, p.s.d.

Gram matrix in Dual formulation of SVM:(K (xi , xj)

)n

i ,j=1

p.s.d.

Reproducing kernel Hilbert space HK

SVM with regularization

minf ∈HK ,b∈R,ξ

‖f ‖2K + C

∑ni=1 ξi

s.t. yi (f (xi ) + b) ≥ 1− ξi , ξi ≥ 0 ∀ i = 1, 2, . . . , n

Yiming Ying Learning the Kernel: Theory and Applications

Page 18: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Example of Kernels

K (x , t) = x>t

K (x , t) = e−(x−t)>A(x−t) with A p.s.d.

Diffusion kernel: eβ4

Sequence kernelCHART, CAT, CART

Yiming Ying Learning the Kernel: Theory and Applications

Page 19: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Example of Kernels

K (x , t) = x>t

K (x , t) = e−(x−t)>A(x−t) with A p.s.d.

Diffusion kernel: eβ4

Sequence kernelCHART, CAT, CART

Yiming Ying Learning the Kernel: Theory and Applications

Page 20: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Example of Kernels

K (x , t) = x>t

K (x , t) = e−(x−t)>A(x−t) with A p.s.d.

Diffusion kernel: eβ4

Sequence kernelCHART, CAT, CART

Yiming Ying Learning the Kernel: Theory and Applications

Page 21: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Problem I: Classical Model Selection

Problem I

For example, how to automaticallytune hyper-parameter σ of Gaussiankernel Kσ(x , t) = e−σ‖x−t‖2

?

Not automatic way: crossvalidation on training data

Yiming Ying Learning the Kernel: Theory and Applications

Page 22: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Problem I: Classical Model Selection

Problem I

For example, how to automaticallytune hyper-parameter σ of Gaussiankernel Kσ(x , t) = e−σ‖x−t‖2

?

Not automatic way: crossvalidation on training data

Yiming Ying Learning the Kernel: Theory and Applications

Page 23: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Problem II: Data Integration

Problem II

How to integrate differentsources of biological datasets?

Yiming Ying Learning the Kernel: Theory and Applications

Page 24: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Problem II: Data Integration

Problem II

How to integrate differentsources of biological datasets?

Yiming Ying Learning the Kernel: Theory and Applications

Page 25: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Regularization Framework for Kernel Learning

Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

General dual problem:

minK∈K

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyjK (xi , xj)

s.t. 0 ≤ αi ≤ C ∀i .

Yiming Ying Learning the Kernel: Theory and Applications

Page 26: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Regularization Framework for Kernel Learning

Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

General dual problem:

minK∈K

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyjK (xi , xj)

s.t. 0 ≤ αi ≤ C ∀i .

Yiming Ying Learning the Kernel: Theory and Applications

Page 27: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Regularization Framework for Kernel Learning

Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

General dual problem:

minK∈K

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyjK (xi , xj)

s.t. 0 ≤ αi ≤ C ∀i .

Yiming Ying Learning the Kernel: Theory and Applications

Page 28: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Regularization Framework for Kernel Learning

Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

General dual problem:

minK∈K

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyjK (xi , xj)

s.t. 0 ≤ αi ≤ C ∀i .

Yiming Ying Learning the Kernel: Theory and Applications

Page 29: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Specific Examples of Kernel Learning

Hyper-parameter tuning [Chapelle & Vapnik et al., 2002]:

Candidate kernels: K = {e−σ‖x−t‖2: σ ∈ (0,∞)}

Dual problem:

minσ>0

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyje−σ‖xi−xj‖2

s.t. 0 ≤ αi ≤ C ∀i .

Optimization: gradient descent

Yiming Ying Learning the Kernel: Theory and Applications

Page 30: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Specific Examples of Kernel Learning

Hyper-parameter tuning [Chapelle & Vapnik et al., 2002]:

Candidate kernels: K = {e−σ‖x−t‖2: σ ∈ (0,∞)}

Dual problem:

minσ>0

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyje−σ‖xi−xj‖2

s.t. 0 ≤ αi ≤ C ∀i .

Optimization: gradient descent

Yiming Ying Learning the Kernel: Theory and Applications

Page 31: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Specific Examples of Kernel Learning (cont.)

Linear combination of candidate kernels [Lanckriet et al. 2004]:

Candidate kernels: K = {∑m

`=1 λ`K` :

∑` λ` = 1, λ` ≥ 0}

Dual problem (mini-max):

minλ

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyj

∑`

λ`K`(x`i , x

`j )

s.t.∑

` λ` = 1, λ` ≥ 0 0 ≤ αi ≤ C ∀i

Yiming Ying Learning the Kernel: Theory and Applications

Page 32: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Specific Examples of Kernel Learning (cont.)

Linear combination of candidate kernels [Lanckriet et al. 2004]:

Candidate kernels: K = {∑m

`=1 λ`K` :

∑` λ` = 1, λ` ≥ 0}

Dual problem (mini-max):

minλ

maxα

n∑i=1

αi −n∑

i ,j=1

αiαjyiyj

∑`

λ`K`(x`i , x

`j )

s.t.∑

` λ` = 1, λ` ≥ 0 0 ≤ αi ≤ C ∀i

Yiming Ying Learning the Kernel: Theory and Applications

Page 33: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Specific Examples of Kernel Learning (cont.)

Equivalent with nonparametric group lasso[Micchelli andPontil, 2005; Bach, 2008]:

minf`∈HK` ,∀`

Cn∑

i=1

(1− yi

∑`

f`(x`i ))+ + (

∑`

‖f`‖K `)2

Yiming Ying Learning the Kernel: Theory and Applications

Page 34: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Statistical Generalization Analysis

Challenging Problem: How to characterize the complexity andoverfitting?

Hypothesis space H ={HK : K ∈ K

}: H the larger, more complex

and more risk of overfitting.

Yiming Ying Learning the Kernel: Theory and Applications

Page 35: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Statistical Generalization Analysis: Empirical process

Solution space fz ∈ B := {‖f ‖K ≤ 1 : f ∈ HK ,K ∈ K}

Uniform Glivenko-Cantelli (uGC) class (uniform convergence)

For any ε > 0, there holds

supP

Pr

{supm≥`

supf ∈B

∣∣∣∣ 1

m

m∑i=1

f (xi )−∫

Xf (x)dP(x)

∣∣∣∣ > ε

}→ 0, as `→∞

Characterization by candidate kernels [Ying and Zhou,2004]:

B is uGC iff Vγ-dimension of KX is finite for any γ > 0 whereKX :=

{K (·, x) : K ∈ K, x ∈ X

}Yiming Ying Learning the Kernel: Theory and Applications

Page 36: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Statistical Generalization Analysis: Empirical process

Solution space fz ∈ B := {‖f ‖K ≤ 1 : f ∈ HK ,K ∈ K}

Uniform Glivenko-Cantelli (uGC) class (uniform convergence)

For any ε > 0, there holds

supP

Pr

{supm≥`

supf ∈B

∣∣∣∣ 1

m

m∑i=1

f (xi )−∫

Xf (x)dP(x)

∣∣∣∣ > ε

}→ 0, as `→∞

Characterization by candidate kernels [Ying and Zhou,2004]:

B is uGC iff Vγ-dimension of KX is finite for any γ > 0 whereKX :=

{K (·, x) : K ∈ K, x ∈ X

}Yiming Ying Learning the Kernel: Theory and Applications

Page 37: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Statistical Generalization Analysis: Shattering dimension

Vγ dimension [Alon et.al. 1997; Anthony and Bartlett, 1999]

A ⊂ X is Vγ shattered by G if there is a number b ∈ R such that, for

every subset E of A, there exists fE ∈ G with property fE (x) + b ≤ −γand for every x ∈ A \ E, there holds fE (x) + b ≥ γ for every x ∈ E. The

Vγ dimension of G is the maximal cardinality of a set A ⊂ X that is Vγshattered by G.

2-dim linear functions:

Yiming Ying Learning the Kernel: Theory and Applications

Page 38: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Statistical Generalization Analysis: Generalization bounds

Rough summary of generalization bound [Ying and Zhou, 2004;Ying and Campbell,2008]

True Err ≤ Train Err +( dK

training sample number×margin2

) 12

where dK is the pseudo-dimension of K (limγ→0+ Vγ).

Examples

Single kernel: dK = 1⇒ VC Theory for SVM

K ={e−σ‖x−t‖2

: σ > 0}⇒ dK ≤ 2.

Yiming Ying Learning the Kernel: Theory and Applications

Page 39: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Statistical Generalization Analysis: Generalization bounds

Rough summary of generalization bound [Ying and Zhou, 2004;Ying and Campbell,2008]

True Err ≤ Train Err +( dK

training sample number×margin2

) 12

where dK is the pseudo-dimension of K (limγ→0+ Vγ).

Examples

Single kernel: dK = 1⇒ VC Theory for SVM

K ={e−σ‖x−t‖2

: σ > 0}⇒ dK ≤ 2.

Yiming Ying Learning the Kernel: Theory and Applications

Page 40: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Statistical Generalization Analysis: Generalization bounds

Rough summary of generalization bound [Ying and Zhou, 2004;Ying and Campbell,2008]

True Err ≤ Train Err +( dK

training sample number×margin2

) 12

where dK is the pseudo-dimension of K (limγ→0+ Vγ).

Examples

Single kernel: dK = 1⇒ VC Theory for SVM

K ={e−σ‖x−t‖2

: σ > 0}⇒ dK ≤ 2.

Yiming Ying Learning the Kernel: Theory and Applications

Page 41: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Related Work on Learning the Combination of Kernels

Lanckriet et al. (2004) for binary SVM classification: SDP,QCQP

Bach et al. (2004): SMOSonnenburg et al. (2006): SILPMini-max problem: others using optimization textbook ...

Probabilistic Bayesian Models:

Girolami and Rogers (2005): Hierarchic Bayesian ModelGirolami and Zhong (2007): Gaussian Process priorDamoulas and Girolami(2008): Multinomial probit model(VBKC)Damoulas, Ying, Girolami and Campbell (2008): RVM

Yiming Ying Learning the Kernel: Theory and Applications

Page 42: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Related Work on Learning the Combination of Kernels

Lanckriet et al. (2004) for binary SVM classification: SDP,QCQP

Bach et al. (2004): SMOSonnenburg et al. (2006): SILPMini-max problem: others using optimization textbook ...

Probabilistic Bayesian Models:

Girolami and Rogers (2005): Hierarchic Bayesian ModelGirolami and Zhong (2007): Gaussian Process priorDamoulas and Girolami(2008): Multinomial probit model(VBKC)Damoulas, Ying, Girolami and Campbell (2008): RVM

Yiming Ying Learning the Kernel: Theory and Applications

Page 43: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Kernel Learning via KL-divergence

Learning kernel matrix by minimizing the KL-divergence[Lawrence and Sanguinetti, 2004]

KL(N (0,Ky)||N (0,Kx)

):=

1

2Tr(KyK

−1x )+

1

2log |Kx|−

1

2log |Ky|

Yiming Ying Learning the Kernel: Theory and Applications

Page 44: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Kernel Learning via KL-divergence (cont.)

Input kernel matrix Kx =∑

` λ`K` with `-th data source

encoded into K`

Output kernel matrix Ky derived from label information

Data integration via KL-divergence [Ying, Huang and Campbell,2009]:

minλ

Tr(Ky(∑

`∈Nmλ`K

` + σIn)−1)

+ log∣∣∣∑`∈Nm

λ`K` + σIn

∣∣∣s.t.

∑` λ` = 1, λ` ≥ 0.

where σ > 0 is a small number to avoid singularity.

Yiming Ying Learning the Kernel: Theory and Applications

Page 45: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Kernel Learning via KL-divergence (cont.)

Input kernel matrix Kx =∑

` λ`K` with `-th data source

encoded into K`

Output kernel matrix Ky derived from label information

Data integration via KL-divergence [Ying, Huang and Campbell,2009]:

minλ

Tr(Ky(∑

`∈Nmλ`K

` + σIn)−1)

+ log∣∣∣∑`∈Nm

λ`K` + σIn

∣∣∣s.t.

∑` λ` = 1, λ` ≥ 0.

where σ > 0 is a small number to avoid singularity.

Yiming Ying Learning the Kernel: Theory and Applications

Page 46: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Kernel Learning via KL-divergence (cont.)

Input kernel matrix Kx =∑

` λ`K` with `-th data source

encoded into K`

Output kernel matrix Ky derived from label information

Data integration via KL-divergence [Ying, Huang and Campbell,2009]:

minλ

Tr(Ky(∑

`∈Nmλ`K

` + σIn)−1)

+ log∣∣∣∑`∈Nm

λ`K` + σIn

∣∣∣s.t.

∑` λ` = 1, λ` ≥ 0.

where σ > 0 is a small number to avoid singularity.

Yiming Ying Learning the Kernel: Theory and Applications

Page 47: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Why KL-divergence

Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice

Easily adapted to different learning scenarios via the outputkernel matrix Ky

Multi-label y (n × T matrix): Ky := yy>

Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian

KL-divergence based criterion is parameter-free

Yiming Ying Learning the Kernel: Theory and Applications

Page 48: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Why KL-divergence

Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice

Easily adapted to different learning scenarios via the outputkernel matrix Ky

Multi-label y (n × T matrix): Ky := yy>

Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian

KL-divergence based criterion is parameter-free

Yiming Ying Learning the Kernel: Theory and Applications

Page 49: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Why KL-divergence

Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice

Easily adapted to different learning scenarios via the outputkernel matrix Ky

Multi-label y (n × T matrix): Ky := yy>

Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian

KL-divergence based criterion is parameter-free

Yiming Ying Learning the Kernel: Theory and Applications

Page 50: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Why KL-divergence

Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice

Easily adapted to different learning scenarios via the outputkernel matrix Ky

Multi-label y (n × T matrix): Ky := yy>

Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian

KL-divergence based criterion is parameter-free

Yiming Ying Learning the Kernel: Theory and Applications

Page 51: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Why KL-divergence

Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice

Easily adapted to different learning scenarios via the outputkernel matrix Ky

Multi-label y (n × T matrix): Ky := yy>

Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian

KL-divergence based criterion is parameter-free

Yiming Ying Learning the Kernel: Theory and Applications

Page 52: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Optimization Formulation

Letf (λ) := Tr

(Ky(

∑`∈Nm

λ`K` + σIn)−1)

andg(λ) := − log |

∑`∈Nm

λ`K` + σIn|

Theorem

KL-divergence based kernel learning is a convex concave (differenceof convex) problem:

min∑` λ`=1,λ`≥0

f (λ)− g(λ).

Yiming Ying Learning the Kernel: Theory and Applications

Page 53: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Optimization Formulation

Letf (λ) := Tr

(Ky(

∑`∈Nm

λ`K` + σIn)−1)

andg(λ) := − log |

∑`∈Nm

λ`K` + σIn|

Theorem

KL-divergence based kernel learning is a convex concave (differenceof convex) problem:

min∑` λ`=1,λ`≥0

f (λ)− g(λ).

Yiming Ying Learning the Kernel: Theory and Applications

Page 54: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Difference of convex (DC) Programming

DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]

Given a stopping criterion ε > 0

Initialize λ, e.g. λ` = 1m

Iteration:λt+1 = arg min

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

}Stop until f (λ(t))− g(λ(t))−

(f (λt+1)− g(λ(t+1))

)≤ ε

Yiming Ying Learning the Kernel: Theory and Applications

Page 55: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Difference of convex (DC) Programming

DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]

Given a stopping criterion ε > 0

Initialize λ, e.g. λ` = 1m

Iteration:λt+1 = arg min

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

}Stop until f (λ(t))− g(λ(t))−

(f (λt+1)− g(λ(t+1))

)≤ ε

Yiming Ying Learning the Kernel: Theory and Applications

Page 56: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Difference of convex (DC) Programming

DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]

Given a stopping criterion ε > 0

Initialize λ, e.g. λ` = 1m

Iteration:λt+1 = arg min

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

}Stop until f (λ(t))− g(λ(t))−

(f (λt+1)− g(λ(t+1))

)≤ ε

Yiming Ying Learning the Kernel: Theory and Applications

Page 57: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Difference of convex (DC) Programming

DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]

Given a stopping criterion ε > 0

Initialize λ, e.g. λ` = 1m

Iteration:λt+1 = arg min

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

}Stop until f (λ(t))− g(λ(t))−

(f (λt+1)− g(λ(t+1))

)≤ ε

Yiming Ying Learning the Kernel: Theory and Applications

Page 58: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Subproblem in Each Iteration

Consider λt+1 = arg min{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑

` λ` = 1, λ` ≥ 0}

SILP or QCQP

arg maxλ,γ γs.t.∑

` λ` = 1, λ` ≥ 0

γ −∑

` λ`(Tr(αα>K`) + ∂g(λ(t))

∂λ`

)≤ Tr

(−2α>A + σα>α

), ∀α

where Ky = AA> with n × s matrix A and s = rank(Ky)

Yiming Ying Learning the Kernel: Theory and Applications

Page 59: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Protein Fold Recognition [Ding et al. 2001; Shen et al.2006; Damoulas and Girolami, 2008]

27 SCOP fold (class), 311 proteins for training and 381 proteinsfor testing with 12 different data sources:

Amino-acid composition (C) Second order polynomialPredicted secondary structure (S) Second order polynomialHydrophobicity (H) Second order polynomialPolarity (P) Second order polynomialvan der Waals volume (V) Second order polynomialPolarizability (Z) Second order polynomialFour pseudo-amino acid compositions (L1, L4, L14, L30) Second order polynomialsTwo local sequence alignment using Smith-Waterman Scores (SW1, SW2) Pairwise kernels

Yiming Ying Learning the Kernel: Theory and Applications

Page 60: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Comparative Kernel Learning Algorithms

MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM

SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based

MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis

VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression

Yiming Ying Learning the Kernel: Theory and Applications

Page 61: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Comparative Kernel Learning Algorithms

MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM

SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based

MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis

VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression

Yiming Ying Learning the Kernel: Theory and Applications

Page 62: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Comparative Kernel Learning Algorithms

MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM

SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based

MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis

VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression

Yiming Ying Learning the Kernel: Theory and Applications

Page 63: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Comparative Kernel Learning Algorithms

MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM

SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based

MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis

VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression

Yiming Ying Learning the Kernel: Theory and Applications

Page 64: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Overall Features Performance

F.S. MKLdiv SimpleMKL VBKC MKL-RKDA Ding et al.C 51.69 51.83 51.2± 0.5 45.43 44.9S 40.99 40.73 38.1± 0.3 38.64 35.6H 36.55 36.55 32.5± 0.4 34.20 36.5P 35.50 35.50 32.2± 0.3 30.54 32.9V 37.07 37.85 32.8± 0.3 30.54 35Z 37.33 36.81 33.2± 0.4 30.28 32.9L1 44.64 45.16 41.5± 0.5 36.55 −L4 44.90 44.90 41.5± 0.4 38.12 −L14 43.34 43.34 38± 0.2 40.99 −L30 31.59 31.59 32± 0.2 36.03 −SW1 62.92 62.40 59.8± 1.9 61.87 −SW2 63.96 63.44 49± 0.7 64.49 −Overall 73.36 66.57 70 68.40 56Average 68.40 68.14 − 66.06 −

Yiming Ying Learning the Kernel: Theory and Applications

Page 65: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Kernel weights of Overall Features

Yiming Ying Learning the Kernel: Theory and Applications

Page 66: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Performance of Sequentially Added Features

Features MKLdiv SimpleMKL MKL-RKDASW1 62.92 62.40 61.87SW1S 65.27 64.22 64.75SW1SW2S 67.10 64.75 64.49SW1SW2CS 73.36 65.01 67.62SW1SW2CSH 74.67 66.31 67.88SW1SW2CSHP 74.93 66.31 69.71SW1SW2CSHPZ 75.19 68.92 66.05SW1SW2CSHPZV 74.41 66.31 69.19SW1SW2CSHPZVL1 73.10 66.84 68.66SW1SW2CSHPZVL1L4 72.84 67.10 67.62SW1SW2CSHPZVL1L4L14 72.58 66.84 69.19SW1SW2CSHPZVL1L4L14L30 73.36 66.57 68.40

Yiming Ying Learning the Kernel: Theory and Applications

Page 67: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Kernel Weights of Dominant Features

Yiming Ying Learning the Kernel: Theory and Applications

Page 68: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Yeast Protein Functional Prediction (Lanckriet et al. 2004)

Data sources and kernels:

SW: protein sequences with Smith-Waterman Score

B: protein sequences with BLAST

Pfam: protein sequences with Pfam HMM

FFT: hydropathy profile with FFT

LI: protein interactions with linear kernel

Diff: protein interactions with diffusion kernel

E: gene expression with radial basis kernel

Yiming Ying Learning the Kernel: Theory and Applications

Page 69: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

MKLdiv performance on membrane protein prediction(binary)

Yiming Ying Learning the Kernel: Theory and Applications

Page 70: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Conclusion and Outlook

We saw

Kernel learning is well motivated

Sound theoretical results: regularization theory and statisticallearning theory

Novel data integration via KL-divergence implemented by DCprogramming

State-of-art performance on protein fold recognition

Future

More investigation into structure outputs datasets

Systematic study into the relation between KL-divergencebased criterion

Yiming Ying Learning the Kernel: Theory and Applications

Page 71: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Conclusion and Outlook

We saw

Kernel learning is well motivated

Sound theoretical results: regularization theory and statisticallearning theory

Novel data integration via KL-divergence implemented by DCprogramming

State-of-art performance on protein fold recognition

Future

More investigation into structure outputs datasets

Systematic study into the relation between KL-divergencebased criterion

Yiming Ying Learning the Kernel: Theory and Applications

Page 72: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Conclusion and Outlook

We saw

Kernel learning is well motivated

Sound theoretical results: regularization theory and statisticallearning theory

Novel data integration via KL-divergence implemented by DCprogramming

State-of-art performance on protein fold recognition

Future

More investigation into structure outputs datasets

Systematic study into the relation between KL-divergencebased criterion

Yiming Ying Learning the Kernel: Theory and Applications

Page 73: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Conclusion and Outlook

We saw

Kernel learning is well motivated

Sound theoretical results: regularization theory and statisticallearning theory

Novel data integration via KL-divergence implemented by DCprogramming

State-of-art performance on protein fold recognition

Future

More investigation into structure outputs datasets

Systematic study into the relation between KL-divergencebased criterion

Yiming Ying Learning the Kernel: Theory and Applications

Page 74: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Conclusion and Outlook

We saw

Kernel learning is well motivated

Sound theoretical results: regularization theory and statisticallearning theory

Novel data integration via KL-divergence implemented by DCprogramming

State-of-art performance on protein fold recognition

Future

More investigation into structure outputs datasets

Systematic study into the relation between KL-divergencebased criterion

Yiming Ying Learning the Kernel: Theory and Applications

Page 75: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Relevant Work

Yiming Ying, Kaizhu Huang, and Colin Campbell.

Enhanced protein fold recognition through a novel data integration approach.Preprint, 2009.

Yiming Ying and Colin Campbell.

The Rademacher chaos complexity of learning the kernel.Submitted, 2008.

Theodoros Damoulas, Yiming Ying, Mark A. Girolami, and Colin Campbell.

Inferring sparse kernel combinations and relevance vectors: an application to subcellular localization ofproteins.In Proceedings of the 7th International Conference on Machine Learning and Applications (ICMLA2008).

Yiming Ying and Ding-Xuan Zhou.

Learnability of Gaussians with flexible variances.J. of Machine Learning Research, 8: 249-276, 2007.

Qiang Wu, Yiming Ying, and Ding-Xuan Zhou.

Multi-kernel regularized classifiers.Journal of Complexity, 23: 108-134, 2007.

Available in http://www.cs.ucl.ac.uk/staff/Y.Ying

Yiming Ying Learning the Kernel: Theory and Applications

Page 76: Learning the Kernel: Theory and Applicationsyy298919/KL.pdfMotivation for Learning the Kernel Regularization Framework for Kernel Learning Statistical Generalization Analysis Data

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Thank you for your attention!

Yiming Ying Learning the Kernel: Theory and Applications