optimization for machine learning - purdue universityvishy/talks/mkl.pdf · the story so far...

Optimization for Machine LearningSMO-MKL and Smoothing Strategies

S.V.N. (vishy) Vishwanathan

Purdue [email protected]

June 2, 2011

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 46

Motivation

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2 x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖


Motivation

Linear Support Vector Machines

Optimization Problem

minw ,b,ξ

12‖w‖2 + C

m∑i=1

ξi

s.t. yi(〈w , xi〉+ b) ≥ 1− ξi for all iξi ≥ 0


Motivation

The Kernel Trick

x

y


Motivation

The Kernel Trick

x

x2+

y2


Motivation

Kernel Trick


minw ,b,ξ

12‖w‖2 + C

m∑i=1

ξi

s.t. yi(〈w , φ(xi)〉+ b) ≥ 1− ξi for all iξi ≥ 0


Motivation

Kernel Trick


maxα

− 12α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyj⟨φ(xi), φ(xj)

⟩


Motivation

Kernel Trick


maxα

− 12α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyjk(xi , xj)


Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning AnswerCook up as many (base) kernels as you canCompute a data dependent kernel function as a linearcombination of base kernels

k(x , x ′) =∑

k

dkkk (x , x ′) s.t. dk ≥ 0


Motivation

Object Detection

Localize a specified object of interest if it exists in a given image


Motivation

Some Examples of MKL Detection


Motivation

Summary of Our Results

Sonar Dataset with 800 kernels

pTraining Time (s) # Kernels Selected

SMO-MKL Shogun SMO-MKL Shogun1.1 4.71 47.43 91.20 258.001.33 3.21 19.94 248.20 374.202.0 3.39 34.67 661.20 664.80

Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutesSonar with a hundred thousand kernels

Precomputed: ≈ 8 minutesKernels computed on-the-fly: ≈ 30 minutes


Motivation

Setting up the Optimization Problem -I

The SetupWe are given K kernel functions k1, . . . , kn with correspondingfeature maps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√

d1φ1(x)...√

dnφn(x)


Motivation

Setting up the Optimization Problem -I

The SetupWe are given K kernel functions k1, . . . , kn with correspondingfeature maps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√

d1φ1(x)...√

dnφn(x)

=⇒ w =

w1...

wn


Motivation

Setting up the Optimization Problem


minw ,b,ξ

12‖w‖2 + C

m∑i=1

ξi

s.t. yi(〈w , φ(xi)〉+ b) ≥ 1− ξi for all iξi ≥ 0


Motivation



minw ,b,ξ,d

12

∑k

‖wk‖2 + Cm∑

i=1

ξi

s.t. yi

(∑k

√dk 〈wk , φk (xi)〉+ b

)≥ 1− ξi for all i

ξi ≥ 0dk ≥ 0 for all k


Motivation



minw ,b,ξ,d

12

∑k

‖wk‖2 + Cm∑

i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

√dk 〈wk , φk (xi)〉+ b




Motivation



minw ,b,ξ,d

12

∑k

‖wk‖2

dk+ C

m∑i=1

ξi +ρ

2

(∑k

dpk

) 2p

s.t. yi

(∑k

〈wk , φk (xi)〉+ b




Motivation



mind

maxα

− 12

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0


Motivation

Saddle Point Problem

−4 −2 0 2 4 −5

0

5−20

0

20

dα


Motivation

Solving the Saddle Point

Saddle Point Problem

mind

maxα

− 12

∑k

dkα>Hkα + 1> α +

ρ

2

(∑k

dpk

) 2p

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0


Our Approach

The Key Insight

Eliminate d

D(α) := maxα

− 18ρ

(∑k

(α>Hkα

)q) 2

q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

1p

+1q

= 1

Not a QP but very close to one!


Our Approach

SMO-MKL: High Level Overview

D(α) := maxα

− 18ρ

(∑k

(α>Hkα

)q) 2

q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

AlgorithmChoose two variables αi and αj to optimizeSolve the one dimensional reduced optimization problemRepeat until convergence


Our Approach

SMO-MKL: High Level Overview

Selecting the Working SetCompute directional derivative and directional HessianGreedily select the variables

Solving the Reduced ProblemAnalytic solution for p = q = 2 (one dimensional quartic)For other values of p use Newton Raphson


Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.080

82

84

86

88

90Te

stA

ccur

acy

(%)

Australian

SMO-MKL Shogun


Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.088

90

92

94Te

stA

ccur

acy

(%)

ionosphere

SMO-MKL Shogun


Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104

Number of Training Examples

CP

UTi

me

inse

cond

sSMO-MKL

Shogun


Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

102

103

104


CP

UTi

me

inse

cond

sObservedO(n1.9)


Experiments

On Another Dataset

Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5

101

102

103


CP

UTi

me

inse

cond

s


Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UTi

me

inse

cond

s


Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

102.5

103

Number of Kernels

CP

UTi

me

inse

cond

sSMO-MKL

O(n)


Experiments

On Another Dataset

Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1

100.6 100.8 101 101.2 101.4 101.6104

105


CP

UTi

me

inse

cond

sSMO-MKL

O(n)


Smoothing Strategy

The Story so Far

KernelsKernel algorithms are very successful in tackling small to mediumsized problems

Small number of dimensionsThe number of training examples = number of dual variables

However . . .Datasets are growing in sizeIn some applications such as Natural Language Processing thetraining examples have large number of dimensions

Kernels do not help!


Smoothing Strategy

Summary of Our Results

Webspam trigram: 300,000 / 50,000 examples, 16,609,143dimensions

Train Linear SVM in 10 minutes80 seconds on 8 cores

Epsilon: 400,000 / 100,000 examples, 2000 dimensions (dense)Train Linear SVM in 4 minutesROCArea in 7 minutes

KDD Cup 2010: 19,264,097 / 748,401 examples, 29,890,095dimensions,

Train Linear SVM in 5 minutesROCArea in 1 hour


Smoothing Strategy

Regularized Risk Minimization


J(w) := minw

λ

2‖w‖2︸︷︷︸

Regularizer

+1m

m∑i=1

max(0,1− yi 〈w , xi〉)︸︷︷︸Risk: Remp


Smoothing Strategy

Non-Smooth Loss

−2 −1 1 2

1

2

3

y 〈w , x〉

hinge loss


Smoothing Strategy

Key Issues

Objective function is non-smoothState-of-the-art algorithms provably require Ω

( 1λε

)iterations

Some algorithms do not work well in a distributed settingCan we do better?


Smoothing Strategy

Key Idea [Nesterov]

−0.5 0.5 1 1.5

0.5

1

1.5

y 〈w , x〉

loss

Want to do this smoothing in a principled way


Smoothing Strategy

Our Approach


J(w) := minw

λ

2‖w‖2 +

1m

m∑i=1

max(0,1− yi 〈w , xi〉)︸︷︷︸Remp

Approximating Remp

g∗µ(A>w) is smooth with a Lipschitz continuous gradient

g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w


Smoothing Strategy

Our Approach


J(w) := minw

λ

2‖w‖2 + g∗µ(A>w)

Approximating Remp

g∗µ(A>w) is smooth with a Lipschitz continuous gradient

g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w


Smoothing Strategy

Convergence Theorem [Nesterov]

Theorem

Suppose one can find g∗µ(A>w) which uniformly approximately Remp,then there exists an optimization procedure which will converge to an εaccurate solution of the regularized risk minimization problem in

O∗(√

D ‖A‖min

1ε,

1√λε

)(1)

iterations. Here, D is a geometric constant that depends only on g∗µ(and is independent of ε or λ)


Smoothing Strategy

Designing g∗µ

High Level Recipe

Choose g and A such that Remp = g∗(A>w)

Choose a strongly convex function dModulus of strong convexity: σd(·) ≤ σD

Set g∗µ = (g + µd)∗

Pitfalls

Need√

D ‖A‖ to be independent of m the training set sizeGradient of g∗µ must be easy to compute


Smoothing Strategy

Smoothing the Binary Hinge Loss

Implementing the Recipe

g(α) =

− 1

m∑

i αi if αi ∈ [0,1] for all i∞ otherwise

Choose A = −1n YX = −1

n

[y1x>1 , . . . , ymx>m

]Verify that Remp = 1

m∑m

i=1 max(0,1− yi 〈w , xi〉) = g∗(A>w)


Smoothing Strategy


Two variants: Square norm

Let d(α) = 12m∑

i α2i then

g∗µ(A>w) =1m

∑i

h(yi 〈w , xi〉) where

h(u) =

0 if u > 1(1−u)2

2µ if u ∈ [1− µ,1]

1− u − µ2 if u < 1− µ


Smoothing Strategy

Visualizing the Loss

−0.5 0.5 1 1.5

0.5

1

1.5

y 〈w , x〉

loss


Smoothing Strategy


Two variants: Entropy

Let d(α) = 1m∑

i αi logαi + (1− αi) log(1− αi) + log 2 then

g∗µ(A>w) =1m

∑i

l(yi 〈w , xi〉) where

l(u) = µ

(log(

1 + exp(

1− uµ

))− log 2

)


Smoothing Strategy


Two variants: Entropy

Let d(α) = 1m∑

i αi logαi + (1− αi) log(1− αi) + log 2 then

g∗µ(Aw) =1m

∑i

l(yi 〈w , xi〉) where

l(u) = µ log(

exp(

1µ

)+ exp

(−uµ

))+ 1− µ log 2


Smoothing Strategy

Visualizing the Loss

−0.5 0.5 1 1.5

0.5

1

1.5

y 〈w , x〉

loss


Smoothing Strategy

Other Extensions

Multivariate Performance ScoresMulticlassROCAreaPrecision Recall Break Even Point (PREBEP)


Experiments

Webspam (Trigram)

300,000 / 50,000 examples, 16,609,143 dimensions λ = 10−4

0 500 1,000 1,500 2,000

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Obj

ectiv

eFu

nctio

n


Experiments

Webspam (Trigram)


0 500 1,000 1,500 2,0000

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Test

Acc

urac

y


Experiments

Webspam (Trigram)

distributed across 8 processors

0 50 100 150 200

0

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Obj

ectiv

eFu

nctio

n


Experiments

Webspam (Trigram)

distributed across 8 processors

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

CPU Time in seconds

Test

Acc

urac

y


Experiments

Epsilon

400,000 / 100,000 examples, 2000 dimensions λ = 10−8

0 500 1,000 1,500 2,000

0.3

0.4

0.5

0.6

0.7

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

Liblinear


Experiments

Epsilon


0 500 1,000 1,500 2,0000.65

0.7

0.75

0.8

0.85

0.9

CPU Time in seconds

Test

Acc

urac

y

Smooth LossLiblinear


Experiments

KDD Cup 2010 (Bridge to Algebra)

19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

·104

0

0.1

0.2

0.3

0.4

0.5

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

Liblinear


Experiments

KDD Cup 2010 (Bridge to Algebra)


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

·104

0.7

0.8

0.9

CPU Time in seconds

Test

Acc

urac

y

Smooth LossLiblinear


Experiments

Epsilon (ROCArea)


0 500 1,000 1,500 2,000 2,5000.2

0.3

0.4

0.5

0.6

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

BMRM


Experiments

Epsilon (ROCArea)


0 500 1,000 1,500 2,000 2,50075

80

85

90

95

CPU Time in seconds

Test

RO

CA

rea

Smooth LossBMRM


Experiments

KDD Cup 2010 (Bridge to Algebra) ROCArea


0 2,000 4,000 6,000 8,000

0.5

1

1.5

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

BMRM


Experiments

KDD Cup 2010 (Bridge to Algebra) ROCArea


0 2,000 4,000 6,000 8,000

70

75

80

85

CPU Time in seconds

Test

RO

CA

rea

Smooth LossBMRM


Experiments

Covertype (PRBEP)

Covertype: 522911 / 58101, 54 dimensions, λ = 2−17

0 5 10 15 20

0.75

0.8

0.85

0.9

0.95

CPU Time in seconds

Obj

ectiv

eFu

nctio

nSmooth Loss

BMRM


Experiments

Covertype (PRBEP)

Covertype: 522911 / 58101, 54 dimensions, λ = 2−17

0 5 10 15 20

78

80

82

84

CPU Time in seconds

PR

BE

P

Smooth LossBMRM


Conclusion

Conclusion

Things I did not talk aboutBregman Divergence (entropy) regularization in MKLConvergence of SMO-MKLExtensions of our smoothing formulation to other losses

Designing specialized optimizers for machine learning problems isimportant

One can scale to larger and more interesting problemsSmoothing is not the answer to everything!


References

References

S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, andManik Varma. Multiple Kernel Learning and the SMO Algorithm.NIPS 2010. Pages 2361–2369.Code available for download fromhttp://research.microsoft.com/en-us/um/people/manik/code/SMO-MKL/download.html

Xinhua Zhang, Ankan Saha, and S. V. N. Vishwanathan.Smoothing Multivariate Performance Measures. Submitted to UAI.Code coming soon . . .


http://research.microsoft.com/en-us/um/people/manik/code/SMO-MKL/download.html

http://research.microsoft.com/en-us/um/people/manik/code/SMO-MKL/download.html

Team

Acknowledgments


optimization for machine learning - purdue universityvishy/talks/mkl.pdf · the story so far...

Documents