optimization for machine learning - purdue universityvishy/talks/mkl.pdf · the story so far...
TRANSCRIPT
Optimization for Machine LearningSMO-MKL and Smoothing Strategies
S.V.N. (vishy) Vishwanathan
Purdue [email protected]
June 2, 2011
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 46
Motivation
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2 x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46
Motivation
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2 x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46
Motivation
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2 x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46
Motivation
Binary Classification
yi = −1
yi = +1
x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1
x | 〈w , x〉+ b = 1
x2 x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2
⟩= 2‖w‖
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 46
Motivation
Linear Support Vector Machines
Optimization Problem
minw ,b,ξ
12‖w‖2 + C
m∑i=1
ξi
s.t. yi(〈w , xi〉+ b) ≥ 1− ξi for all iξi ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 46
Motivation
The Kernel Trick
x
y
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 46
Motivation
The Kernel Trick
x
y
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 46
Motivation
The Kernel Trick
x
x2+
y2
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 46
Motivation
Kernel Trick
Optimization Problem
minw ,b,ξ
12‖w‖2 + C
m∑i=1
ξi
s.t. yi(〈w , φ(xi)〉+ b) ≥ 1− ξi for all iξi ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 46
Motivation
Kernel Trick
Optimization Problem
maxα
− 12α>Hα + 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
Hij = yiyj⟨φ(xi), φ(xj)
⟩
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 46
Motivation
Kernel Trick
Optimization Problem
maxα
− 12α>Hα + 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
Hij = yiyjk(xi , xj)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 46
Motivation
Key Question
Which kernel should I use?
The Multiple Kernel Learning AnswerCook up as many (base) kernels as you canCompute a data dependent kernel function as a linearcombination of base kernels
k(x , x ′) =∑
k
dkkk (x , x ′) s.t. dk ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 46
Motivation
Key Question
Which kernel should I use?
The Multiple Kernel Learning AnswerCook up as many (base) kernels as you canCompute a data dependent kernel function as a linearcombination of base kernels
k(x , x ′) =∑
k
dkkk (x , x ′) s.t. dk ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 46
Motivation
Object Detection
Localize a specified object of interest if it exists in a given image
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 46
Motivation
Some Examples of MKL Detection
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 46
Motivation
Summary of Our Results
Sonar Dataset with 800 kernels
pTraining Time (s) # Kernels Selected
SMO-MKL Shogun SMO-MKL Shogun1.1 4.71 47.43 91.20 258.001.33 3.21 19.94 248.20 374.202.0 3.39 34.67 661.20 664.80
Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutesSonar with a hundred thousand kernels
Precomputed: ≈ 8 minutesKernels computed on-the-fly: ≈ 30 minutes
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 46
Motivation
Setting up the Optimization Problem -I
The SetupWe are given K kernel functions k1, . . . , kn with correspondingfeature maps φ1(·), . . . , φn(·)We are interested in deriving the feature map
φ(x) =
√
d1φ1(x)...√
dnφn(x)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 46
Motivation
Setting up the Optimization Problem -I
The SetupWe are given K kernel functions k1, . . . , kn with correspondingfeature maps φ1(·), . . . , φn(·)We are interested in deriving the feature map
φ(x) =
√
d1φ1(x)...√
dnφn(x)
=⇒ w =
w1...
wn
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 46
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ
12‖w‖2 + C
m∑i=1
ξi
s.t. yi(〈w , φ(xi)〉+ b) ≥ 1− ξi for all iξi ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ,d
12
∑k
‖wk‖2 + Cm∑
i=1
ξi
s.t. yi
(∑k
√dk 〈wk , φk (xi)〉+ b
)≥ 1− ξi for all i
ξi ≥ 0dk ≥ 0 for all k
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ,d
12
∑k
‖wk‖2 + Cm∑
i=1
ξi +ρ
2
(∑k
dpk
) 2p
s.t. yi
(∑k
√dk 〈wk , φk (xi)〉+ b
)≥ 1− ξi for all i
ξi ≥ 0dk ≥ 0 for all k
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ,d
12
∑k
‖wk‖2
dk+ C
m∑i=1
ξi +ρ
2
(∑k
dpk
) 2p
s.t. yi
(∑k
〈wk , φk (xi)〉+ b
)≥ 1− ξi for all i
ξi ≥ 0dk ≥ 0 for all k
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46
Motivation
Setting up the Optimization Problem
Optimization Problem
mind
maxα
− 12
∑k
dkα>Hkα + 1> α +
ρ
2
(∑k
dpk
) 2p
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
dk ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 46
Motivation
Saddle Point Problem
−4 −2 0 2 4 −5
0
5−20
0
20
dα
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 46
Motivation
Solving the Saddle Point
Saddle Point Problem
mind
maxα
− 12
∑k
dkα>Hkα + 1> α +
ρ
2
(∑k
dpk
) 2p
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
dk ≥ 0
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 46
Our Approach
The Key Insight
Eliminate d
D(α) := maxα
− 18ρ
(∑k
(α>Hkα
)q) 2
q
+ 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
1p
+1q
= 1
Not a QP but very close to one!
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 46
Our Approach
SMO-MKL: High Level Overview
D(α) := maxα
− 18ρ
(∑k
(α>Hkα
)q) 2
q
+ 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
AlgorithmChoose two variables αi and αj to optimizeSolve the one dimensional reduced optimization problemRepeat until convergence
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 46
Our Approach
SMO-MKL: High Level Overview
Selecting the Working SetCompute directional derivative and directional HessianGreedily select the variables
Solving the Reduced ProblemAnalytic solution for p = q = 2 (one dimensional quartic)For other values of p use Newton Raphson
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 46
Our Approach
SMO-MKL: High Level Overview
Selecting the Working SetCompute directional derivative and directional HessianGreedily select the variables
Solving the Reduced ProblemAnalytic solution for p = q = 2 (one dimensional quartic)For other values of p use Newton Raphson
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 46
Experiments
Generalization Performance
1.1 1.33 1.66 2.0 2.33 2.66 3.080
82
84
86
88
90Te
stA
ccur
acy
(%)
Australian
SMO-MKL Shogun
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 46
Experiments
Generalization Performance
1.1 1.33 1.66 2.0 2.33 2.66 3.088
90
92
94Te
stA
ccur
acy
(%)
ionosphere
SMO-MKL Shogun
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 46
Experiments
Scaling with Training Set Size
Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1
103.5 104 104.5101
102
103
104
Number of Training Examples
CP
UTi
me
inse
cond
sSMO-MKL
Shogun
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 46
Experiments
Scaling with Training Set Size
Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1
103.5 104 104.5101
102
103
104
Number of Training Examples
CP
UTi
me
inse
cond
sObservedO(n1.9)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 46
Experiments
On Another Dataset
Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1
103.5 104 104.5
101
102
103
Number of Training Examples
CP
UTi
me
inse
cond
s
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 46
Experiments
Scaling with Number of Kernels
Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1
104 105
102.5
103
Number of Kernels
CP
UTi
me
inse
cond
s
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 46
Experiments
Scaling with Number of Kernels
Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1
104 105
102.5
103
Number of Kernels
CP
UTi
me
inse
cond
sSMO-MKL
O(n)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 46
Experiments
On Another Dataset
Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1
100.6 100.8 101 101.2 101.4 101.6104
105
Number of Training Examples
CP
UTi
me
inse
cond
sSMO-MKL
O(n)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 46
Smoothing Strategy
The Story so Far
KernelsKernel algorithms are very successful in tackling small to mediumsized problems
Small number of dimensionsThe number of training examples = number of dual variables
However . . .Datasets are growing in sizeIn some applications such as Natural Language Processing thetraining examples have large number of dimensions
Kernels do not help!
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 46
Smoothing Strategy
The Story so Far
KernelsKernel algorithms are very successful in tackling small to mediumsized problems
Small number of dimensionsThe number of training examples = number of dual variables
However . . .Datasets are growing in sizeIn some applications such as Natural Language Processing thetraining examples have large number of dimensions
Kernels do not help!
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 46
Smoothing Strategy
Summary of Our Results
Webspam trigram: 300,000 / 50,000 examples, 16,609,143dimensions
Train Linear SVM in 10 minutes80 seconds on 8 cores
Epsilon: 400,000 / 100,000 examples, 2000 dimensions (dense)Train Linear SVM in 4 minutesROCArea in 7 minutes
KDD Cup 2010: 19,264,097 / 748,401 examples, 29,890,095dimensions,
Train Linear SVM in 5 minutesROCArea in 1 hour
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 23 / 46
Smoothing Strategy
Regularized Risk Minimization
Regularized Risk Minimization
J(w) := minw
λ
2‖w‖2︸ ︷︷ ︸
Regularizer
+1m
m∑i=1
max(0,1− yi 〈w , xi〉)︸ ︷︷ ︸Risk: Remp
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 24 / 46
Smoothing Strategy
Non-Smooth Loss
−2 −1 1 2
1
2
3
y 〈w , x〉
hinge loss
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 46
Smoothing Strategy
Key Issues
Objective function is non-smoothState-of-the-art algorithms provably require Ω
( 1λε
)iterations
Some algorithms do not work well in a distributed settingCan we do better?
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 26 / 46
Smoothing Strategy
Key Idea [Nesterov]
−0.5 0.5 1 1.5
0.5
1
1.5
y 〈w , x〉
loss
Want to do this smoothing in a principled way
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 46
Smoothing Strategy
Our Approach
Regularized Risk Minimization
J(w) := minw
λ
2‖w‖2 +
1m
m∑i=1
max(0,1− yi 〈w , xi〉)︸ ︷︷ ︸Remp
Approximating Remp
g∗µ(A>w) is smooth with a Lipschitz continuous gradient
g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 46
Smoothing Strategy
Our Approach
Regularized Risk Minimization
J(w) := minw
λ
2‖w‖2 + g∗µ(A>w)
Approximating Remp
g∗µ(A>w) is smooth with a Lipschitz continuous gradient
g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 46
Smoothing Strategy
Our Approach
Regularized Risk Minimization
J(w) := minw
λ
2‖w‖2 + g∗µ(A>w)
Approximating Remp
g∗µ(A>w) is smooth with a Lipschitz continuous gradient
g∗µ(A>w) uniformly approximates Remp∣∣∣Remp(w)− g∗µ(A>w)∣∣∣ ≤ µ for all w
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 46
Smoothing Strategy
Convergence Theorem [Nesterov]
Theorem
Suppose one can find g∗µ(A>w) which uniformly approximately Remp,then there exists an optimization procedure which will converge to an εaccurate solution of the regularized risk minimization problem in
O∗(√
D ‖A‖min
1ε,
1√λε
)(1)
iterations. Here, D is a geometric constant that depends only on g∗µ(and is independent of ε or λ)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 29 / 46
Smoothing Strategy
Designing g∗µ
High Level Recipe
Choose g and A such that Remp = g∗(A>w)
Choose a strongly convex function dModulus of strong convexity: σd(·) ≤ σD
Set g∗µ = (g + µd)∗
Pitfalls
Need√
D ‖A‖ to be independent of m the training set sizeGradient of g∗µ must be easy to compute
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 46
Smoothing Strategy
Designing g∗µ
High Level Recipe
Choose g and A such that Remp = g∗(A>w)
Choose a strongly convex function dModulus of strong convexity: σd(·) ≤ σD
Set g∗µ = (g + µd)∗
Pitfalls
Need√
D ‖A‖ to be independent of m the training set sizeGradient of g∗µ must be easy to compute
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 46
Smoothing Strategy
Smoothing the Binary Hinge Loss
Implementing the Recipe
g(α) =
− 1
m∑
i αi if αi ∈ [0,1] for all i∞ otherwise
Choose A = −1n YX = −1
n
[y1x>1 , . . . , ymx>m
]Verify that Remp = 1
m∑m
i=1 max(0,1− yi 〈w , xi〉) = g∗(A>w)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 31 / 46
Smoothing Strategy
Smoothing the Binary Hinge Loss
Two variants: Square norm
Let d(α) = 12m∑
i α2i then
g∗µ(A>w) =1m
∑i
h(yi 〈w , xi〉) where
h(u) =
0 if u > 1(1−u)2
2µ if u ∈ [1− µ,1]
1− u − µ2 if u < 1− µ
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 32 / 46
Smoothing Strategy
Visualizing the Loss
−0.5 0.5 1 1.5
0.5
1
1.5
y 〈w , x〉
loss
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 33 / 46
Smoothing Strategy
Smoothing the Binary Hinge Loss
Two variants: Entropy
Let d(α) = 1m∑
i αi logαi + (1− αi) log(1− αi) + log 2 then
g∗µ(A>w) =1m
∑i
l(yi 〈w , xi〉) where
l(u) = µ
(log(
1 + exp(
1− uµ
))− log 2
)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 46
Smoothing Strategy
Smoothing the Binary Hinge Loss
Two variants: Entropy
Let d(α) = 1m∑
i αi logαi + (1− αi) log(1− αi) + log 2 then
g∗µ(Aw) =1m
∑i
l(yi 〈w , xi〉) where
l(u) = µ log(
exp(
1µ
)+ exp
(−uµ
))+ 1− µ log 2
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 46
Smoothing Strategy
Visualizing the Loss
−0.5 0.5 1 1.5
0.5
1
1.5
y 〈w , x〉
loss
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 35 / 46
Smoothing Strategy
Other Extensions
Multivariate Performance ScoresMulticlassROCAreaPrecision Recall Break Even Point (PREBEP)
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 36 / 46
Experiments
Webspam (Trigram)
300,000 / 50,000 examples, 16,609,143 dimensions λ = 10−4
0 500 1,000 1,500 2,000
0.2
0.4
0.6
0.8
1
CPU Time in seconds
Obj
ectiv
eFu
nctio
n
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46
Experiments
Webspam (Trigram)
300,000 / 50,000 examples, 16,609,143 dimensions λ = 10−4
0 500 1,000 1,500 2,0000
0.2
0.4
0.6
0.8
1
CPU Time in seconds
Test
Acc
urac
y
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46
Experiments
Webspam (Trigram)
300,000 / 50,000 examples, 16,609,143 dimensions λ = 10−8
0 500 1,000 1,500 2,0000
0.2
0.4
0.6
0.8
1
CPU Time in seconds
Test
Acc
urac
y
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46
Experiments
Webspam (Trigram)
distributed across 8 processors
0 50 100 150 200
0
0.2
0.4
0.6
0.8
1
CPU Time in seconds
Obj
ectiv
eFu
nctio
n
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46
Experiments
Webspam (Trigram)
distributed across 8 processors
0 50 100 150 2000
0.2
0.4
0.6
0.8
1
CPU Time in seconds
Test
Acc
urac
y
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 46
Experiments
Epsilon
400,000 / 100,000 examples, 2000 dimensions λ = 10−8
0 500 1,000 1,500 2,000
0.3
0.4
0.5
0.6
0.7
CPU Time in seconds
Obj
ectiv
eFu
nctio
nSmooth Loss
Liblinear
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 38 / 46
Experiments
Epsilon
400,000 / 100,000 examples, 2000 dimensions λ = 10−8
0 500 1,000 1,500 2,0000.65
0.7
0.75
0.8
0.85
0.9
CPU Time in seconds
Test
Acc
urac
y
Smooth LossLiblinear
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 38 / 46
Experiments
KDD Cup 2010 (Bridge to Algebra)
19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
·104
0
0.1
0.2
0.3
0.4
0.5
CPU Time in seconds
Obj
ectiv
eFu
nctio
nSmooth Loss
Liblinear
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 46
Experiments
KDD Cup 2010 (Bridge to Algebra)
19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
·104
0.7
0.8
0.9
CPU Time in seconds
Test
Acc
urac
y
Smooth LossLiblinear
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 46
Experiments
Epsilon (ROCArea)
400,000 / 100,000 examples, 2000 dimensions λ = 10−8
0 500 1,000 1,500 2,000 2,5000.2
0.3
0.4
0.5
0.6
CPU Time in seconds
Obj
ectiv
eFu
nctio
nSmooth Loss
BMRM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 46
Experiments
Epsilon (ROCArea)
400,000 / 100,000 examples, 2000 dimensions λ = 10−8
0 500 1,000 1,500 2,000 2,50075
80
85
90
95
CPU Time in seconds
Test
RO
CA
rea
Smooth LossBMRM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 46
Experiments
KDD Cup 2010 (Bridge to Algebra) ROCArea
19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8
0 2,000 4,000 6,000 8,000
0.5
1
1.5
CPU Time in seconds
Obj
ectiv
eFu
nctio
nSmooth Loss
BMRM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 41 / 46
Experiments
KDD Cup 2010 (Bridge to Algebra) ROCArea
19,264,097 / 748,401 examples, 29,890,095 dimensions, λ = 10−8
0 2,000 4,000 6,000 8,000
70
75
80
85
CPU Time in seconds
Test
RO
CA
rea
Smooth LossBMRM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 41 / 46
Experiments
Covertype (PRBEP)
Covertype: 522911 / 58101, 54 dimensions, λ = 2−17
0 5 10 15 20
0.75
0.8
0.85
0.9
0.95
CPU Time in seconds
Obj
ectiv
eFu
nctio
nSmooth Loss
BMRM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 42 / 46
Experiments
Covertype (PRBEP)
Covertype: 522911 / 58101, 54 dimensions, λ = 2−17
0 5 10 15 20
78
80
82
84
CPU Time in seconds
PR
BE
P
Smooth LossBMRM
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 42 / 46
Conclusion
Conclusion
Things I did not talk aboutBregman Divergence (entropy) regularization in MKLConvergence of SMO-MKLExtensions of our smoothing formulation to other losses
Designing specialized optimizers for machine learning problems isimportant
One can scale to larger and more interesting problemsSmoothing is not the answer to everything!
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 43 / 46
References
References
S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, andManik Varma. Multiple Kernel Learning and the SMO Algorithm.NIPS 2010. Pages 2361–2369.Code available for download fromhttp://research.microsoft.com/en-us/um/people/manik/code/SMO-MKL/download.html
Xinhua Zhang, Ankan Saha, and S. V. N. Vishwanathan.Smoothing Multivariate Performance Measures. Submitted to UAI.Code coming soon . . .
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 44 / 46
Team
Acknowledgments
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 45 / 46
Team
Acknowledgments
S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 46 / 46