optimization for machine learning - purdue universityvishy/talks/mkl.pdf · the story so far...

77
Optimization for Machine Learning SMO-MKL and Smoothing Strategies S.V . N. (vishy) Vishwanathan Purdue University [email protected] June 2, 2011 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 46

Upload: others

Post on 11-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Optimization for Machine LearningSMO-MKL and Smoothing Strategies

S.V.N. (vishy) Vishwanathan

Purdue [email protected]

June 2, 2011

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 46

Page 2: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2 x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46

Page 3: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2 x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46

Page 4: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2 x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46

Page 5: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2 x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46

Page 6: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Linear Support Vector Machines

Optimization Problem

minw ,b,ξ

12‖w‖2 + C

m∑i=1

ξi

s.t. yi(〈w , xi〉+ b) ≥ 1− ξi for all iξi ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 46

Page 7: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

The Kernel Trick

x

y

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 46

Page 8: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

The Kernel Trick

x

y

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 46

Page 9: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

The Kernel Trick

x

x2+

y2

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 46

Page 10: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Kernel Trick

Optimization Problem

minw ,b,ξ

12‖w‖2 + C

m∑i=1

ξi

s.t. yi(〈w , φ(xi)〉+ b) ≥ 1− ξi for all iξi ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 46

Page 11: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Kernel Trick

Optimization Problem

maxα

− 12α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyj⟨φ(xi), φ(xj)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 46

Page 12: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Kernel Trick

Optimization Problem

maxα

− 12α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyjk(xi , xj)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 46

Page 13: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning AnswerCook up as many (base) kernels as you canCompute a data dependent kernel function as a linearcombination of base kernels

k(x , x ′) =∑

k

dkkk (x , x ′) s.t. dk ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 46

Page 14: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning AnswerCook up as many (base) kernels as you canCompute a data dependent kernel function as a linearcombination of base kernels

k(x , x ′) =∑

k

dkkk (x , x ′) s.t. dk ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 46

Page 15: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Object Detection

Localize a specified object of interest if it exists in a given image

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 46

Page 16: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Some Examples of MKL Detection

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 46

Page 17: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Summary of Our Results

Sonar Dataset with 800 kernels

pTraining Time (s) # Kernels Selected

SMO-MKL Shogun SMO-MKL Shogun1.1 4.71 47.43 91.20 258.001.33 3.21 19.94 248.20 374.202.0 3.39 34.67 661.20 664.80

Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutesSonar with a hundred thousand kernels

Precomputed: ≈ 8 minutesKernels computed on-the-fly: ≈ 30 minutes

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 46

Page 18: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Setting up the Optimization Problem -I

The SetupWe are given K kernel functions k1, . . . , kn with correspondingfeature maps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

d1φ1(x)...√

dnφn(x)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 46

Page 19: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Setting up the Optimization Problem -I

The SetupWe are given K kernel functions k1, . . . , kn with correspondingfeature maps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

d1φ1(x)...√

dnφn(x)

=⇒ w =

w1...

wn

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 46

Page 20: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ

12‖w‖2 + C

m∑i=1

ξi

s.t. yi(〈w , φ(xi)〉+ b) ≥ 1− ξi for all iξi ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46

Page 21: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ,d

12

∑k

‖wk‖2 + Cm∑

i=1

ξi

s.t. yi

(∑k

√dk 〈wk , φk (xi)〉+ b

)≥ 1− ξi for all i

ξi ≥ 0dk ≥ 0 for all k

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46

Page 22: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ,d

12

∑k

‖wk‖2 + Cm∑

i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

√dk 〈wk , φk (xi)〉+ b

)≥ 1− ξi for all i

ξi ≥ 0dk ≥ 0 for all k

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46

Page 23: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Setting up the Optimization Problem

Optimization Problem

minw ,b,ξ,d

12

∑k

‖wk‖2

dk+ C

m∑i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

〈wk , φk (xi)〉+ b

)≥ 1− ξi for all i

ξi ≥ 0dk ≥ 0 for all k

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46

Page 24: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Setting up the Optimization Problem

Optimization Problem

mind

maxα

− 12

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46

Page 25: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Saddle Point Problem

−4 −2 0 2 4 −5

0

5−20

0

20

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 46

Page 26: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Motivation

Solving the Saddle Point

Saddle Point Problem

mind

maxα

− 12

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 46

Page 27: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Our Approach

The Key Insight

Eliminate d

D(α) := maxα

− 18ρ

(∑k

(α>Hkα

)q) 2

q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

1p

+1q

= 1

Not a QP but very close to one!

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 46

Page 28: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Our Approach

SMO-MKL: High Level Overview

D(α) := maxα

− 18ρ

(∑k

(α>Hkα

)q) 2

q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

AlgorithmChoose two variables αi and αj to optimizeSolve the one dimensional reduced optimization problemRepeat until convergence

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 46

Page 29: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Our Approach

SMO-MKL: High Level Overview

Selecting the Working SetCompute directional derivative and directional HessianGreedily select the variables

Solving the Reduced ProblemAnalytic solution for p = q = 2 (one dimensional quartic)For other values of p use Newton Raphson

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 46

Page 30: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Our Approach

SMO-MKL: High Level Overview

Selecting the Working SetCompute directional derivative and directional HessianGreedily select the variables

Solving the Reduced ProblemAnalytic solution for p = q = 2 (one dimensional quartic)For other values of p use Newton Raphson

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 46

Page 31: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.080

82

84

86

88

90Te

stA

ccur

acy

(%)

Australian

SMO-MKL Shogun

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 46

Page 32: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.088

90

92

94Te

stA

ccur

acy

(%)

ionosphere

SMO-MKL Shogun

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 46

Page 33: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104

Number of Training Examples

CP

UTi

me

inse

cond

sSMO-MKL

Shogun

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 46

Page 34: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104

Number of Training Examples

CP

UTi

me

inse

cond

sObservedO(n1.9)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 46

Page 35: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

On Another Dataset

Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5

101

102

103

Number of Training Examples

CP

UTi

me

inse

cond

s

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 46

Page 36: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UTi

me

inse

cond

s

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 46

Page 37: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UTi

me

inse

cond

sSMO-MKL

O(n)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 46

Page 38: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

On Another Dataset

Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1

100.6 100.8 101 101.2 101.4 101.6104

105

Number of Training Examples

CP

UTi

me

inse

cond

sSMO-MKL

O(n)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 46

Page 39: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

The Story so Far

KernelsKernel algorithms are very successful in tackling small to mediumsized problems

Small number of dimensionsThe number of training examples = number of dual variables

However . . .Datasets are growing in sizeIn some applications such as Natural Language Processing thetraining examples have large number of dimensions

Kernels do not help!

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 46

Page 40: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

The Story so Far

KernelsKernel algorithms are very successful in tackling small to mediumsized problems

Small number of dimensionsThe number of training examples = number of dual variables

However . . .Datasets are growing in sizeIn some applications such as Natural Language Processing thetraining examples have large number of dimensions

Kernels do not help!

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 46

Page 41: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Summary of Our Results

Webspam trigram: 300,000 / 50,000 examples, 16,609,143dimensions

Train Linear SVM in 10 minutes80 seconds on 8 cores

Epsilon: 400,000 / 100,000 examples, 2000 dimensions (dense)Train Linear SVM in 4 minutesROCArea in 7 minutes

KDD Cup 2010: 19,264,097 / 748,401 examples, 29,890,095dimensions,

Train Linear SVM in 5 minutesROCArea in 1 hour

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 23 / 46

Page 42: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Regularized Risk Minimization

Regularized Risk Minimization

J(w) := minw

λ

2‖w‖2︸ ︷︷ ︸

Regularizer

+1m

m∑i=1

max(0,1− yi 〈w , xi〉)︸ ︷︷ ︸Risk: Remp

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 24 / 46

Page 43: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Non-Smooth Loss

−2 −1 1 2

1

2

3

y 〈w , x〉

hinge loss

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 46

Page 44: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Key Issues

Objective function is non-smoothState-of-the-art algorithms provably require Ω

( 1λε

)iterations

Some algorithms do not work well in a distributed settingCan we do better?

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 26 / 46

Page 45: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Key Idea [Nesterov]

−0.5 0.5 1 1.5

0.5

1

1.5

y 〈w , x〉

loss

Want to do this smoothing in a principled way

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 46

Page 46: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Our Approach

Regularized Risk Minimization

J(w) := minw

λ

2‖w‖2 +

1m

m∑i=1

max(0,1− yi 〈w , xi〉)︸ ︷︷ ︸Remp

Approximating Remp

g∗µ(A>w) is smooth with a Lipschitz continuous gradient

g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 46

Page 47: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Our Approach

Regularized Risk Minimization

J(w) := minw

λ

2‖w‖2 + g∗µ(A>w)

Approximating Remp

g∗µ(A>w) is smooth with a Lipschitz continuous gradient

g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 46

Page 48: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Our Approach

Regularized Risk Minimization

J(w) := minw

λ

2‖w‖2 + g∗µ(A>w)

Approximating Remp

g∗µ(A>w) is smooth with a Lipschitz continuous gradient

g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 46

Page 49: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Convergence Theorem [Nesterov]

Theorem

Suppose one can find g∗µ(A>w) which uniformly approximately Remp,then there exists an optimization procedure which will converge to an εaccurate solution of the regularized risk minimization problem in

O∗(√

D ‖A‖min

1ε,

1√λε

)(1)

iterations. Here, D is a geometric constant that depends only on g∗µ(and is independent of ε or λ)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 29 / 46

Page 50: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Designing g∗µ

High Level Recipe

Choose g and A such that Remp = g∗(A>w)

Choose a strongly convex function dModulus of strong convexity: σd(·) ≤ σD

Set g∗µ = (g + µd)∗

Pitfalls

Need√

D ‖A‖ to be independent of m the training set sizeGradient of g∗µ must be easy to compute

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 46

Page 51: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Designing g∗µ

High Level Recipe

Choose g and A such that Remp = g∗(A>w)

Choose a strongly convex function dModulus of strong convexity: σd(·) ≤ σD

Set g∗µ = (g + µd)∗

Pitfalls

Need√

D ‖A‖ to be independent of m the training set sizeGradient of g∗µ must be easy to compute

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 46

Page 52: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Smoothing the Binary Hinge Loss

Implementing the Recipe

g(α) =

− 1

m∑

i αi if αi ∈ [0,1] for all i∞ otherwise

Choose A = −1n YX = −1

n

[y1x>1 , . . . , ymx>m

]Verify that Remp = 1

m∑m

i=1 max(0,1− yi 〈w , xi〉) = g∗(A>w)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 31 / 46

Page 53: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Smoothing the Binary Hinge Loss

Two variants: Square norm

Let d(α) = 12m∑

i α2i then

g∗µ(A>w) =1m

∑i

h(yi 〈w , xi〉) where

h(u) =

0 if u > 1(1−u)2

2µ if u ∈ [1− µ,1]

1− u − µ2 if u < 1− µ

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 32 / 46

Page 54: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Visualizing the Loss

−0.5 0.5 1 1.5

0.5

1

1.5

y 〈w , x〉

loss

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 33 / 46

Page 55: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Smoothing the Binary Hinge Loss

Two variants: Entropy

Let d(α) = 1m∑

i αi logαi + (1− αi) log(1− αi) + log 2 then

g∗µ(A>w) =1m

∑i

l(yi 〈w , xi〉) where

l(u) = µ

(log(

1 + exp(

1− uµ

))− log 2

)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 46

Page 56: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Smoothing the Binary Hinge Loss

Two variants: Entropy

Let d(α) = 1m∑

i αi logαi + (1− αi) log(1− αi) + log 2 then

g∗µ(Aw) =1m

∑i

l(yi 〈w , xi〉) where

l(u) = µ log(

exp(

)+ exp

(−uµ

))+ 1− µ log 2

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 46

Page 57: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Visualizing the Loss

−0.5 0.5 1 1.5

0.5

1

1.5

y 〈w , x〉

loss

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 35 / 46

Page 58: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Smoothing Strategy

Other Extensions

Multivariate Performance ScoresMulticlassROCAreaPrecision Recall Break Even Point (PREBEP)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 36 / 46

Page 59: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Webspam (Trigram)

300,000 / 50,000 examples, 16,609,143 dimensions λ = 10−4

0 500 1,000 1,500 2,000

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Obj

ectiv

eFu

nctio

n

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46

Page 60: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Webspam (Trigram)

300,000 / 50,000 examples, 16,609,143 dimensions λ = 10−4

0 500 1,000 1,500 2,0000

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Test

Acc

urac

y

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46

Page 61: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Webspam (Trigram)

300,000 / 50,000 examples, 16,609,143 dimensions λ = 10−8

0 500 1,000 1,500 2,0000

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Test

Acc

urac

y

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46

Page 62: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Webspam (Trigram)

distributed across 8 processors

0 50 100 150 200

0

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Obj

ectiv

eFu

nctio

n

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46

Page 63: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Webspam (Trigram)

distributed across 8 processors

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Test

Acc

urac

y

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46

Page 64: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Epsilon

400,000 / 100,000 examples, 2000 dimensions λ = 10−8

0 500 1,000 1,500 2,000

0.3

0.4

0.5

0.6

0.7

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

Liblinear

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 38 / 46

Page 65: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Epsilon

400,000 / 100,000 examples, 2000 dimensions λ = 10−8

0 500 1,000 1,500 2,0000.65

0.7

0.75

0.8

0.85

0.9

CPU Time in seconds

Test

Acc

urac

y

Smooth LossLiblinear

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 38 / 46

Page 66: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

KDD Cup 2010 (Bridge to Algebra)

19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

·104

0

0.1

0.2

0.3

0.4

0.5

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

Liblinear

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 46

Page 67: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

KDD Cup 2010 (Bridge to Algebra)

19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

·104

0.7

0.8

0.9

CPU Time in seconds

Test

Acc

urac

y

Smooth LossLiblinear

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 46

Page 68: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Epsilon (ROCArea)

400,000 / 100,000 examples, 2000 dimensions λ = 10−8

0 500 1,000 1,500 2,000 2,5000.2

0.3

0.4

0.5

0.6

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

BMRM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 46

Page 69: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Epsilon (ROCArea)

400,000 / 100,000 examples, 2000 dimensions λ = 10−8

0 500 1,000 1,500 2,000 2,50075

80

85

90

95

CPU Time in seconds

Test

RO

CA

rea

Smooth LossBMRM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 46

Page 70: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

KDD Cup 2010 (Bridge to Algebra) ROCArea

19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8

0 2,000 4,000 6,000 8,000

0.5

1

1.5

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

BMRM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 41 / 46

Page 71: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

KDD Cup 2010 (Bridge to Algebra) ROCArea

19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8

0 2,000 4,000 6,000 8,000

70

75

80

85

CPU Time in seconds

Test

RO

CA

rea

Smooth LossBMRM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 41 / 46

Page 72: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Covertype (PRBEP)

Covertype: 522911 / 58101, 54 dimensions, λ = 2−17

0 5 10 15 20

0.75

0.8

0.85

0.9

0.95

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

BMRM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 42 / 46

Page 73: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Experiments

Covertype (PRBEP)

Covertype: 522911 / 58101, 54 dimensions, λ = 2−17

0 5 10 15 20

78

80

82

84

CPU Time in seconds

PR

BE

P

Smooth LossBMRM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 42 / 46

Page 74: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Conclusion

Conclusion

Things I did not talk aboutBregman Divergence (entropy) regularization in MKLConvergence of SMO-MKLExtensions of our smoothing formulation to other losses

Designing specialized optimizers for machine learning problems isimportant

One can scale to larger and more interesting problemsSmoothing is not the answer to everything!

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 43 / 46

Page 75: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

References

References

S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, andManik Varma. Multiple Kernel Learning and the SMO Algorithm.NIPS 2010. Pages 2361–2369.Code available for download fromhttp://research.microsoft.com/en-us/um/people/manik/code/SMO-MKL/download.html

Xinhua Zhang, Ankan Saha, and S. V. N. Vishwanathan.Smoothing Multivariate Performance Measures. Submitted to UAI.Code coming soon . . .

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 44 / 46

Page 76: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Team

Acknowledgments

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 45 / 46

Page 77: Optimization for Machine Learning - Purdue Universityvishy/talks/mkl.pdf · The Story so Far Kernels Kernel algorithms are very successful in tackling small to medium sized problems

Team

Acknowledgments

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 46 / 46