efficient classification based on sparse regression · motivation i. . . . . . . . . . . . . .. . ....
TRANSCRIPT
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Efficient Classification Based onSparse Regression
Pardis Noorzad
Department of Computer Engineering and Information TechnologyAmirkabir University of Technology
Supervisor: Prof. Mohammad Rahmati
MSc Thesis Defense – July 17, 2012
Efficient Classification 1/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
OutlineMotivation I
SVM, its Advantages and its LimitationsRelated Work
Proposal ITheory: Square Loss for ClassificationTheory: Linear Least Squares Regression and Regularizationℓ1-regularized Square Loss Minimization for Classification
Motivation IISparse CodingSparse Representation ClassificationExtension to Regression: SPARROW
Proposal IIℓ1-regularized Square Loss Minimization for ReconstructionEmpirical Evaluation of SPARROWkNN vs SRC
Conclusions
Efficient Classification 2/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
SVM, its Advantages and its Limitations
The SVM Optimization Problem
minimize1
2∥w∥2 + C
n∑i=1
ξi
subject to
yi(w.xi − b) ≥ 1− ξiξi ≥ 0
for i = 1, · · · , n,
This can alternatively be written as
minimize1
2∥w∥2 + C
n∑i=1
ξi
subject to ξi ≥ max(0, 1− yi(w.xi − b)
)for i = 1, · · · , n,
Introduce the notation
ξi ≥[1− yi(w.xi − b)
]+, where [x]+ = max(0, x)
Efficient Classification 5/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
SVM, its Advantages and its Limitations
The SVM Optimization Problem: Continued
We will be working with the following formulation:
min1
2∥w∥2 + C
n∑i=1
[1− yi(w.xi − b)
]+
Which contains the terms:
I ℓ2-regularization of the weights
I hinge loss
Efficient Classification 6/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
SVM, its Advantages and its Limitations
SVM is popular
I SVM solution is sparse—due to the hinge lossI SVM uses a subset of training samples for prediction:
called support vectors
f(x) =n∑
i=1
αiyixix− b
I SVM can be employed for nonlinear classification—due to the kernel trick: f(x) =
∑ni=1 αiyik(xi,x)− b
I SVM has excellent generalization performance—due to ℓ2-regularization on the weights
Efficient Classification 7/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
SVM, its Advantages and its Limitations
BUT...
I SVM solution, α, is usually not sparse enoughI sparsity is an issue because
1. classification/testing/prediction time2. classifier space
I more importantly, sparsity of α is not easily controlled
Efficient Classification 8/95
0 5 10 15 20 2553
54
55
56
57
58
59
60
C
# su
ppor
t vec
tors
(a) Australian dataset
0 10 20 30 40 5023
23.5
24
24.5
25
25.5
26
C
# su
ppor
t vec
tors
(b) Heart dataset
Figure: Graphs show that the number of support vectors does not have ameaningful relation to the hyperparameter C in the SVM optimization.
0 5 10 15 20 2510
12
14
16
18
20
22
24
26
28
30
C
# su
ppor
t vec
tors
(a) Ionosphere dataset
0 10 20 30 40 50 60 70 8060
62
64
66
68
70
72
74
76
78
80
C
# su
ppor
t vec
tors
(b) Liver disorders dataset
Figure: Graphs show that the number of support vectors does not have ameaningful relation to the hyperparameter C in the SVM optimization.
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
SVM, its Advantages and its Limitations
BUT...
I SVM solution, α, is usually not sparse enoughI sparsity is an issue because
1. classification/testing/prediction time2. classifier space
I more importantly, sparsity of α is not controllableI number of support vectors grows with sample size
Efficient Classification 11/95
0 200 400 600 800 1000 1200 1400 1600 18000
1
2
3
4
5
6
7
8
n
# su
ppor
t vec
tors
(a) Adult 1 dataset
0 1000 2000 3000 4000 50000
2
4
6
8
10
12
14
16
n
# su
ppor
t vec
tors
(b) Adult 4 dataset
Figure: Graphs show that the number of support vectors increase as thenumber of samples grow.
0 1 2 3 4 5 6
x 105
0
5
10
15
20
25
n
# su
ppor
t vec
tors
(a) Covertype dataset
0 50 100 150 200 250 300 3500
2
4
6
8
10
12
n
# su
ppor
t vec
tors
(b) Ionosphere dataset
Figure: Graphs show that the number of support vectors increase as thenumber of samples grow.
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
SVM, its Advantages and its Limitations
BUT...
I SVM solution, α, is usually not sparse enoughI sparsity is an issue because
1. classification/testing/prediction time2. classifier space
I more importantly, sparsity of α is not controllableI number of support vectors grows with sample size
I for most real applications, linear SVM is usedI linear SVM is faster to train and testI especially for OVA or OVOI high-dimensional data is sparse
Efficient Classification 14/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Related Work
Related work in square loss minimization for classification
I In his PhD thesis, Rifkin (2002) claims hinge lossis not the secret to SVM’s success
I Rifkin proposes Regularized Least Squares Classification(RLSC)
minw
∥y −Xw∥2 + λ∥w∥2
I and the nonlinear case
minc
∥y −Kc∥2 + λcTKc
I BUTI resulting classifier is not sparseI nonlinear RLSC takes longer than SVM to train
Efficient Classification 16/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Related Work
Related work in ℓ1-regularization for sparse classifiers
I Yuan et al. (2010) compare several sparse linear classifiers
minw
∥w∥1 + C
n∑i=1
ξ(w;xi, yi)
I with the logistic, hinge, and square hinge lossI ξlog(w;xi, yi) = log
(1 + exp(−ywTx)
)I ξL1(w;xi, yi) = max
(1− ywTx, 0
)I ξL2(w;xi, yi) = max
(1− ywTx, 0
)2I BUT
I they don’t consider those optimizing the square lossI the square loss has several good
computational and statistical properties
Efficient Classification 17/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Classification(Devroye et al., 1996)
I Find a function g : Rp → −1, 1 which takes an observationx ∈ Rp and assigns it to y ∈ −1, 1
I g is called a classifier
I Probability of error or probability of misclassification
L(g) = Pg(X) = Y
I The optimal classifier g∗ is
g∗ = argming:Rp→1,...,M
Pg(X) = Y
I and is called the Bayes classifier.
Efficient Classification 20/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Empirical Risk Minimization
I Minimizing L(g) is only possible with the knowledge of thejoint distribution of X and Y
I Given Tn = (xi, yi) : i = 1, . . . , n of n observations,assumed to be sampled i.i.d. from the distribution of (X,Y )
I An estimate of L(g) is
Ln(g) =1
n
n∑i=1
Ig(xi )=yi
I called the empirical error—but it is intractable to compute
Efficient Classification 21/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Empirical Risk MinimizationContinued
I Consider classifiers of the form
gf (x) =
−1 if f(x) < 0
1 otherwise
I where f : Rp → R is a real-valued function in FI The probability of error of gf is
L(gf ) = L(f) = Psgn(f(X)) = Y = PY f(X) ≤ 0= EIY f(X)≤0
I The quantity yf(x) is called the margin
Efficient Classification 22/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Empirical Risk MinimizationContinued
I Given Tn, once can estimate L(f) by Ln(f)
Ln(f) =1
n
n∑i=1
Iyif(xi)≤0
I where Iyf(x)≤0 is the 0-1 loss function
I minimizing the empirical error is computationally intractable
I we seek to minimize a smooth convex upper bound of the 0-1loss
Efficient Classification 23/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Convex Loss
I The cost functional becomes
A(f) = Eϕ(Y f(X))
I with its corresponding empirical form being
An(f) =1
n
n∑i=1
ϕ(yif(xi)).
Efficient Classification 24/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Convex LossContinued
I One can prove that the minimizer f∗(x) of A(f) is such thatthe induced classifier gf
g∗(x) =
−1 if f∗(x) < 0
1 otherwise
I is the Bayes classifier (Zhang, 2004; Boucheron et al.,2005)—thereby proving Fisher consistency of convex costfunctions
Efficient Classification 25/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Table: Well-known convex loss functions and their correspondingminimizing function.
Loss function name Form of ϕ(v) Form of f∗ϕ(η)
Square loss (1− v)2 2η − 1Hinge loss max(0, 1− v) sign(2η − 1)Squared hinge loss max(0, 1− v)2 2η − 1
Logistic loss ln(1 + exp(−v)) lnη
1− η
Efficient Classification 26/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Square Loss for Classification
Two important insights
I Convex cost functions are all Fisher consistent.I SVM estimates sign(2η − 1), whereas a least squares classifier
estimates 2η − 1I thus giving us information about the confidence of its
predictionsI making it more suitable for OVA
Efficient Classification 27/95
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Margin
Loss
misclassification
Figure: A comparison of convex loss functions. The misclassification lossis also shown.
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Margin
Loss
misclassificationhinge
Figure: A comparison of convex loss functions. The misclassification lossis also shown.
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Margin
Loss
misclassificationhingesquared hinge
Figure: A comparison of convex loss functions. The misclassification lossis also shown.
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Margin
Loss
misclassificationhingesquared hingelogistic
Figure: A comparison of convex loss functions. The misclassification lossis also shown.
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Margin
Loss
misclassificationhingesquared hingelogisticsquare
Figure: A comparison of convex loss functions. The misclassification lossis also shown.
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
The Setting
I We have the linear inverse problem
b = Ax
where b ∈ Rn,x ∈ Rp,A ∈ Rn×p; x is unknown.
I Many problems in ML are linear inverse problems, for e.g.,
I regression and classification: y = Xa, a is unknown;
I sparse coding: x = Da, a is unknown;
Efficient Classification 34/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
SolutionTake I
a = X−1y
I What’s the problem here?I X is almost never invertible in our problems:
I needs to be square
I needs to have full column rank
Efficient Classification 35/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
Ill-posedness
I Case II If n = p or n > p, we say that the system of equations is
overdetermined.
I In this case, the solution to the inverse problem does not exist.
I Case III If n < p, the system is underdetermined,
I and there exists infinitely many solutions.
Efficient Classification 36/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
SolutionCase I, Take II
I Instead of the equations, y = Xa, only minimize the residual,
mina
∥y −Xa∥22
I this yields an approximate solution to the inverse problem, i.e.,
a = (XTX)−1XTy
I The solution exists if XTX is invertible, i.e.,I X must have full column rank
I o.w., the least squares solution is no better than the originalproblem, which is the case for Case II.
Efficient Classification 37/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
RegularizationI Regularize to incorporate a priori assumptions about the size
and smoothness of the solution.I for e.g. by using the ℓ2 norm as the measure of size
I Regularization is done using one of the following schemes:
min1
2∥y −Xa∥22 s.t. ∥a∥1 ≤ T
min ∥a∥1 s.t. ∥y −Xa∥22 ≤ ϵ
min1
2∥y −Xa∥22 + λ∥a∥1 (Lagrangian form)
I Note that the schemes are equivalent in theory but not inpractice, since relations between T, ϵ, and λ are unknown.
Efficient Classification 38/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
SolutionTake III
I Regularize, i.e.,
min1
2∥y −Xa∥22 + λ∥x∥22
I called ridge regression with the unique solution,
a∗ = (XTX+ λI)−1XTy.
I Note that XTX+ λI is nonsingular even when XTX issingular.
Efficient Classification 39/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
When n ≪ pHigh-dimensional Problem
I Standard procedure is to constrain with sparsity.
I To measure sparsity, we introduce the ℓ0 quasi-norm,
∥a∥0 = #i : ai = 0.
I The problem becomes,
min ∥a∥0 s.t. y = Xa.
I Because of the combinatorial aspect of the ℓ0 norm, theℓ0-regularization is intractable.
Efficient Classification 40/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
SolutionConvex Relaxation
I Basis pursuit (Chen et al., 1995)
min ∥a∥1 s.t. y = Xa.
I which is a linear program for which a tractable algorithmexists, in this case:
I primal-dual interior point method
I solves the approximate problem, exactly
I To allow for some noise, Chen et al. proposed basis pursuitde-noising, also called the lasso (Tibshirani, 1996)
1
2∥y −Xa∥22 + λ∥a∥1.
Efficient Classification 41/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Theory: Linear Least Squares Regression and Regularization
Power family of penaltiesℓp norms raised to the pth power
∥a∥pp =
(∑i
|ai|p)
I For 1 ≤ p < ∞, the above is convex.
I 0 < p ≤ 1, is the range of p useful for measuring sparsity.
Efficient Classification 42/95
Figure: As p goes to 0, |x|p becomes the indicator function and |x|pbecomes a count of the nonzeros in x (Bruckstein et al., 2009).
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Classification
I We train a classifier ( sign(f(x) = wTx+ b))
I using the lasso optimization
I we find the best λ using cross-validation on the training setI we know that if we start with the smallest λ in
cross-validation,I then we have the most compact classifier possible
Efficient Classification 45/95
1 2 3 4 5 6 7 80
200
400
600
800
1000
1200
1400
1600
1800
2000
λ
# no
nzer
o co
effic
ient
s
(a) Colon dataset
2 4 6 8 10 12 14 16 18 205
10
15
20
25
30
35
λ
# no
nzer
o co
effic
ient
s
(b) Ionosphere dataset
Figure: In these figures, we see that the number of nonzero elements ofthe solution increases as we increase the regularization parameter λ, thusproviding us with means to control the sparsity of the solution.
1 2 3 4 5 6 7 80
20
40
60
80
100
120
λ
# no
nzer
o co
effic
ient
s
(a) Mushrooms dataset
Figure: In this figure, we see that the number of nonzero elements of thesolution increases as we increase the regularization parameter λ, thusproviding us with means to control the sparsity of the solution.
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Classification
Table: Data set information: n denotes the number of samples, pdenotes the dimension of the samples, and #nz are the nonzero elementsof the n× p data matrix.
Data set n p
adult1 1605 123adult4 4781 123adult7 16,100 123australian 690 14colon 62 2,000covertype 581,012 54diabetes 768 8heart 270 13ionosphere 351 34liverdisorders 8,124 112
Efficient Classification 48/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Classification
Table: In this table we see a comparison of three other classifiers with thelasso on seven data sets. The hyperparameter is denoted by C or λ,depending on the algorithm. The number of nonzero elements in thesolution vector is denoted by #nz. The percentage of correctly classifiedtesting samples is denoted by Acc.
Dataset lasso SVM ℓ1-reg L2SVM ℓ1-reg logreg
λ Acc #nz C Acc #nz C Acc #nz C Acc #nz
australian 10 86 14 2 86 68 20 86 14 5 87 14colon 1 77 16 1 87 11 10 75 112 10 83 91
diabetes 10 80 8 1 75 105 10 77 8 10 76 8heart 6 87 13 1.5 84 30 10 80 13 5 83 13
ionosphere 1 77 16 1 87 11 10 75 28 2 82 31liverdisorders 2 46 6 2 62 70 5 66 6 5 67 6mushrooms 2 48 13 2 100 90 10 100 96 20 100 95
Efficient Classification 49/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Classification
Table: In this table we present the results for ridge regression. Thesolution is dense and hence the number of nonzero elemenst equals thenumber of features.
Dataset ridge
λ Acc #nz
australian 30 86 14colon 6 87 2000
diabetes 40 76 8heart 9 86 13
ionosphere 10 74 34liverdisorders 8 35 6mushrooms 20 49 112
Efficient Classification 50/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Representation by sparse approximation
I To motivate this idea let’s look atI feature learning with sparse coding, and
I sparse representation classification (SRC)I an example of exemplar-based sparse approximation
Efficient Classification 52/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Sparse Coding
Unsupervised feature learningApplication to image classification
x = Da
I An example is the recent work by Coates and Ng (2011).I where x is the input feature vector
I could be a vectorized image patch, or a SIFT descriptor
I a is the higher-dimensional sparse representation of x
I D is usually learned
Efficient Classification 54/95
Figure: Image classification (Coates and Ng, 2011).
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Sparse Representation Classification
Multiclass classification(Wright et al., 2009)
I D :=(xi, yi) : xi ∈ Rm, yi ∈ 1, . . . , c, i ∈ 1, . . . , N
I Given a test sample z
1. Solve minα∈RN ∥α∥1 subject to ∥z−Dα∥22 ≤ σ2. Define αy : y ∈ 1, . . . , c where [αy]i = αi if xi belongs to
class y, o.w. 03. Construct X (α) :=
xy(α) = Dαy, y ∈ 1, . . . , c
4. Predict y := argminy∈1,...,c ∥z− xy(α)∥22
Efficient Classification 57/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Extension to Regression: SPARROW
Global methods
I In parametric approaches, the regression function is known
I for e.g., in multiple linear regression (MLR) we assume
f(z) =
M∑j=1
βjzj + ϵ
I we can also add higher order terms but still havea model that is linear in the parameters βj , γj
f(z) =
M∑j=1
(βjzj + γjz
2j
)+ ϵ
Efficient Classification 59/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Extension to Regression: SPARROW
Local methods
I A successful nonparametric approach to regression:local estimation(Hastie and Loader, 1993; Hardle and Linton, 1994; Ruppertand Wand, 1994)
I In local methods:
f(z) =N∑i=1
li(z)yi + ϵ
Efficient Classification 60/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Extension to Regression: SPARROW
Local methodsContinued
I For e.g. in k-nearest neighbor regression (k-NNR)
f(z) =
N∑i=1
αi(z)∑Np=1 αp(z)
yi
I where αi(z) := INk(z)(xi)
I Nk(z) ⊂ D is the set of the k-nearest neighbors of z
Efficient Classification 61/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Extension to Regression: SPARROW
Local methodsContinued
I In weighted k-NNR (Wk-NNR),
f(z) =
N∑i=1
αi(z)∑Np=1 αp(z)
yi
I αi(z) := S(z,xi)−1 INk(z)(xi)
I S(z,xi) = (z− xi)TV−1(z− xi)
is the scaled Euclidean distance
Efficient Classification 62/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Extension to Regression: SPARROW
Local methodsContinued
I In local methods:estimate the regression function locallyby a simple parametric model
I In local polynomial regression:estimate the regression function locally,by a Taylor polynomial
I This is what happens in SPARROW, as we will explain
Efficient Classification 63/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Sparrow
Efficient Classification 66/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
I meant this sparrow
Efficient Classification 67/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
SPARROW is a local method
I Before we get into the details,
I see a few examples showing benefits of local methods
I then we’ll talk about SPARROW
Efficient Classification 68/95
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Goal functionDataset
Figure: Our generated dataset. yi = f(xi) + ϵi, wheref(x) = (x3 + x2) I(x) + sin(x) I(−x).
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
DatasetGoal functionMLR:1stMLR:2ndMLR:3rd
Figure: Multiple linear regression with first-, second-, and third-orderterms.
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
DatasetGoal functionε−SVR
Figure: ϵ-support vector regression with an RBF kernel.
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
DatasetGoal function4−NNR
Figure: 4-nearest neighbor regression.
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Effective weights in SPARROW
I In local methods:
f(z) =
N∑i=1
li(z)yi
I Now we define li(z)
Efficient Classification 73/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Local estimation by a Taylor polynomial
I To obtain the local quadratic estimate of the regressionfunction at z
I we can approximate f(x) about z by a second-degree Taylorpolynomial
f(x) ≈ f(z) + (x− z)Tθz +1
2(x− z)THz(x− z)
I θz := ∇f(z) the gradient of f(x),Hz := ∇2f(z) is its Hessianboth evaluated at z
Efficient Classification 74/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Local estimation by a Taylor polynomialContinued
We need to solve the locally weighted least squares problem
minf(z),θz,Hz
∑i∈Ω
αi(z)[yi−f(z)−(xi−z)Tθz−
1
2(xi−z)THz(xi−z)
]2
Efficient Classification 75/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Local estimation by a Taylor polynomialContinued
I This can be expressed as
minΘz
∥∥∥A1/2z
[y −XzΘz
]∥∥∥22
I aii = αi, y := [y1, y2, . . . , yN ]T
I Xz :=
1 (x1 − z)T vechT [(x1 − z)(x1 − z)T ]...
......
1 (xN − z)T vechT [(xN − z)(xN − z)T ]
I parameter supervector: Θz :=
[f(z),θz, vech(Hz)
]TEfficient Classification 76/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Local estimation by a Taylor polynomialContinued
I The parameters defined by the least squares solution:
Θz =(XT
zAzXz
)−1XT
zAzy
I And so the local quadratic estimate is
f(z) = eT1(XT
zAzXz
)−1XT
zAzy
I Since f(z) =∑N
i=1 li(z)yi,the ith effective weight for SPARROW is
li(z,D) = eTi ATzXz
(XT
zAzXz
)−1e1
Efficient Classification 77/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Local estimation by a Taylor polynomialContinued
I The local constant regression estimate is
f(z) = (1TAz1)−11TAzy =
∑i∈Ω αi(z)yi∑k∈Ω αk(z)
.
I Look familiar?
Efficient Classification 78/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
ℓ1-regularized Square Loss Minimization for Reconstruction
Observation weights in SPARROW
I To find αi we solve the following problem (Chen et al., 1995)
mins∈RN
∥s∥1 subject to∥z−Ds∥22
∥z∥22≤ ϵ2
I σ2 > 0 limits signal to approximation error ratio
I and D :=[
x1
∥x1∥2, x2
∥x2∥2, . . . , xN
∥xN∥2
]I Finally, the ith observation weight in SPARROW is
αi(z) :=
[S(z,xi)
minj∈Ω S(z,xj)
]−1 si∥z∥2
Efficient Classification 79/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Empirical Evaluation of SPARROW
Table: Summary of the four datasets we test. The last column indicatesthe tuned parameter k used in the experiments involving k-NNR andWk-NNR.
Dataset # observations (N) # attributes (M) k
abalone 4,177 8 9bodyfat 252 14 4housing 506 13 2mpg 392 7 4
Efficient Classification 81/95
3
4
5
6
7
8
9
MS
E
MLR
NWR
LLKR
k−NNR
Wk−
NNR
C−SPARROW
(a) Abalone dataset
0
5
10
15
20
25
MS
E (
× 10
−5 )
MLR
NWR
LLKR
k−NNR
Wk−
NNR
C−SPARROW
(b) Bodyfat dataset
Figure: Boxplots for 10-fold cross-validation estimate of mean squarederror (100 independent runs) for four different datasets. Each boxdelimits 25 to 75 percentiles, and the red line marks median. Extrema aremarked by whiskers, and outliers by pluses.
0
10
20
30
40
50
MS
E
MLR
NWR
LLKR
k−NNR
Wk−
NNR
C−SPARROW
(a) Housing dataset
0
5
10
15
20
25
MS
E
MLR
NWR
LLKR
k−NNR
Wk−
NNR
C−SPARROW
(b) MPG dataset
Figure: Boxplots for 10-fold cross-validation estimate of mean squarederror (100 independent runs) for four different datasets. Each boxdelimits 25 to 75 percentiles, and the red line marks median. Extrema aremarked by whiskers, and outliers by pluses.
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Empirical Evaluation of SPARROW
Linear SPARROW
I L-SPARROW should perform better than C-SPARROWbecause it is a higher-order model
I But a problem with higher-order models is that solutionscould become unstable
I We resolve the problem by solving
minΘz,λ
∥∥∥A1/2z
[y −XzΘz
]∥∥∥22+ λ∥Θz∥22
I The solution becomes
Θ(z) =(XT
zAzXz + λI)−1
XTzAzy.
Efficient Classification 84/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Empirical Evaluation of SPARROW
Table: A comparison of the MSE estimates obtained by 10 trials of10-fold cross-validation of C-SPARROW and L-SPARROW without andwith ridge regression on the four datasets. The last column denotes theridge parameter used to obtain the L-SPARROW estimate.
Dataset C-SPAR. L-SPAR. L-SPAR. w/ RR λ
abalone 5 16 988 10−3
bodyfat 5 ×10−5 35 ×10−5 960× 10−5 10−6
housing 10 45 4304 10−4
mpg 7 8 6335 10−3
Efficient Classification 85/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
kNN vs SRC
Table: In this table we compare kNN and SRC on five multiclassclassification data sets.
Dataset n p #classes k kNN SRC
dna 2000 180 3 125 86 86glass 214 9 6 2 70 65iris 150 4 3 6 95 72
vowel 528 10 11 2 94 84wine 178 13 3 7 97 99
Efficient Classification 87/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Conclusions
I ℓ1-regularized square loss minimization for classification is asuccess both computationally and statistically
I ℓ1-regularized square loss minimization for reconstruction isnot worth it
I simpler methods like kNN classification and WkNNR are atleast as good
Efficient Classification 89/95
. . . . . . . . . . .
. . .
Motivation I. . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . .
Proposal I. . .. .. . . . . .
Motivation II. . . . . . . . . . . . . . .. . . . . .. .
Proposal II Conclusions
Recommendations for Future WorkI Replace dictionary learning and sparse coding with k-means
and kNN for feature learning in image classification tasksI Replace xi’s with ϕ(xi) to get nonlinear classificationI Perform analysis on computational complexity of
ℓ1-regularized square loss minimization methodslike (Yuan et al., 2010)
I Try the elastic net (Prof. Rahmati’s suggestion)
min ∥y −Xa∥22 + λ2∥a∥22 + λ1∥va∥1I Prof. Ebadzadeh’s initial proposal on regularizing α, the SVM
dual variable, has been done before by Osuna and Girosi(1999) in a paper entitled:“Reducing run-time complexity of support vector machines”
Efficient Classification 90/95
References
References I
Stephane Boucheron, Olivier Bousquet, and Gabor Lugosi. Theory ofclassification : A survey of some recent advances. ESAIM: Probability andStatistics, 9:323–375, 2005.
Alfred M. Bruckstein, David L. Donoho, and Michael Elad. From sparsesolutions of systems of equations to sparse modeling of signals and images.SIAM Review, 51(1):34–81, 2009.
Scott S. Chen, David L. Donoho, and Michael A. Saunders. Atomicdecomposition by basis pursuit. Technical Report 479, Department ofStatistics, Stanford University, May 1995.
Adam Coates and Andrew Ng. The importance of encoding versus trainingwith sparse coding and vector quantization. In International Conference onMachine Learning (ICML), pages 921–928, 2011.
L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of PatternRecognition. Springer, 1996.
Efficient Classification 91/95
References
References IIW. Hardle and O. Linton. Applied nonparametric methods. Technical Report
1069, Yale University, 1994.
T. J. Hastie and C. Loader. Local regression: Automatic kernel carpentry.Statistical Science, 8(2):120–129, 1993.
Edgar E. Osuna and Federico Girosi. Reducing the run-time complexity insupport vector machines. In Bernhard Scholkopf, Christopher J. C. Burges,and Alexander J. Smola, editors, Advances in kernel methods, pages271–283. MIT Press, Cambridge, MA, USA, 1999.
Ryan Rifkin. Everything old is new again: a fresh look at historical approachesin machine learning. PhD thesis, Sloan School of Management,Massachusetts Institute of Technology, 2002.
D. Ruppert and M. P. Wand. Multivariate locally weighted least squaresregression. The Annals of Statistics, 22:1346–1370, 1994.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society (Series B), 58:267–288, 1996.
Efficient Classification 92/95
References
References III
John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma.Robust face recognition via sparse representation. IEEE Transactions onPattern Analysis and Machine Intelligence, 31:210–227, 2009.
Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Acomparison of optimization methods and software for large-scaleℓ1-regularized linear classification. Journal of Machine Learning Research,11:3183–3234, 2010.
Tong Zhang. Statistical behavior and consistency of classification methodsbased on convex risk minimization. The Annals of Statistics, 32:56–134,March 2004.
Efficient Classification 93/95
References
Acknowledgements
I Committee:I Prof. Mohammad RahmatiI Prof. Narollah Moghaddam CharkariI Prof. Mohammad Mehdi Ebadzadeh
I Prof. Saeed Shiry
I Special thanks to Sheida Bijani, Isaac Nickaein, and MinaShirvani
I Prof. Bob Sturm
I Last, and certainly not least, my parents and my brother
Efficient Classification 94/95
In memory of Uncle Asadollah Noorzad.