study of sparse classifier design algorithms

Study of Sparse Classifier Study of Sparse Classifier Design AlgorithmsDesign Algorithms

Sachin Nagargoje, 08449

Advisor : Prof. Shirish Shevade

20th June 2013

OutlineOutline IntroductionSparsity w.r.t. features

◦ Using regularizer/penalty Traditional regularizer/penalty Other regularizer/penalty SparseNet

Sparsity w.r.t. support vectors / basis points◦ Various Techniques◦ SVM with L1 regularizer◦ Greedy Methods

Proposed MethodsExperimental ResultsConclusion / Future Work

2

3

Introduction

What is Sparsity?What is Sparsity?Sparsity w.r.t. features in model

◦eg: #Non - zero coefficients of model

Sparsity w.r.t. Support Vectors

4

Support Vectors,x1, …, xd

Vapnik 1992, Vapnik, et al 1995

Need for Sparsity?Need for Sparsity?• Faster prediction• Decreases complexity of model• In the case of sparsity w.r.t.

features• To Remove

– Redundant features– Irrelevant features– Noisy features

• As number of features increases– Data becomes sparse in High Dimension– Difficult to achieve low generalization error

5

• Filter–Select features before ML Algorithm is

run• E.g. Rank features and eliminate

• Wrapper–Find best subset of features using ML

techniques• E.g. Forward Selection, Random Selection

• Embedded–Feature selection as part of ML

Algorithm• E.g. L1 regularized linear regression

Traditional ways to achieve Traditional ways to achieve SparsitySparsity

6

Sparsity w.r.t. featuresSparsity w.r.t. features

7

Using Regularizer/PenaltyUsing Regularizer/PenaltyData, x= [x1, x2, … ,xn], Labels, y= [y1,

y2 … ,yn]T, Model, w = [w1, w2 … ,wp]A type of Embedded approachEg: In the case of linear least square

regression

R represents regularizer, eg: L0 or L1

8Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267, 1994.

Traditional regularizersTraditional regularizersL0 Penalty

L1 Penalty

9

Traditional regularizers Traditional regularizers contd..contd..

Example: Let us take Rainfall prediction problem Assuming, both model has same training errorModel 1 L0 Penalty = 1+ 1+ 1+ 1+ 1 = 5 L1 Penalty = |3| +|-5| +|8|+|-4|+|1|

= 21

Model 2 L0 Penalty = 1+ 0+ 1+ 1+ 0 = 3 L1 Penalty = |-20| +|0| +|7|+|18|+|0|

= 45

Since L1 shrinks and selects - it often selects dense model

10

Rainfall Prediction

Feature Model 1 Model 2

Temperature 3 -20

Outlook -5 0

Pressure 8 7

Wind -4 18

Humidity 1 0

Other regularizerOther regularizer

11

MC+MC+

12

SparseNetSparseNet

Uses Coordinate Descent with Non-convex PenaltyLets consider least square problem for single

feature data matrix:

It has a closed form solution as:

Our goal is to minimize:

13Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descentwith non-convex penalties, 2009

SparseNet (cont.)SparseNet (cont.)• Let us define a soft threshold

operator as below:

• There are three cases here : w>0, w<0, w=0

• Convert multiple feature function into single feature function

• Apply Coordinate Descent14

Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009

SparseNet (cont.)SparseNet (cont.)Now let us extend our problem to

solve data matrix with multiple features

Therefore soft threshold operator function becomes

15- Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009- Jerome Friedman, Trevor Hastie, Holger H¨ofling, and Robert Tibshirani. Pathwise coordinate optimization. Technical report, Annals of Applied Statistics, 2007.

SparseNet (Algorithm)SparseNet (Algorithm)

16

SparseNet with L1 PenaltySparseNet with L1 PenaltyUsing L1 Penalty

17Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.

SparseNet with MC+ SparseNet with MC+ PenaltyPenalty

Using MC+ Penalty

18Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.

Sparsity w.r.t. Sparsity w.r.t. Support vectors/Basis pointsSupport vectors/Basis points

19

Sparsity w.r.t. Support Sparsity w.r.t. Support vectorsvectorsKernel based learning algorithmsf(x) is linear combination of

terms of form

20

Various techniquesVarious techniquesSupport Vector Machine (SVM)

◦SVM with L1 penaltyGreedy methods (wrapper):

◦Kernel Matching Pursuit (KMP)◦Building SVM with sparser

complexity (Keerthi et al)Proposed method:

◦Preprocessing the training points using filtering and then applying wrapper methods

21

• Settings: Data• SVM optimization:

• SVM with L1 Penalty:• Solved using Linear Programming

• Settings used: – Lambda: [1/100 1/10 1 10 100 ], Sigma: [1/16 ¼ 1 4 16] 22

SVM with L1 regularizer

Decision Boundaries and Support Vectors

23

SVM with L1 regularizerSVM with L1 regularizer

RBF Kernel on dummy data

Poly & RBF Kernel on Banana data

24

SVM with L1 regularizer

Our formulation gave better sparser results than SVM

DatasetsDatasets

25

Greedy methods

Kernel Matching PursuitKernel Matching PursuitInspired from signal processing community

◦Decomposes any signal into a linear expansion of waveforms selected from dictionary of functions

Set of basis points are constructed in greedy fashion

Removes the requirement of positive definiteness of Kernel matrix

Allow us to directly control the sparsity (in terms of number of support vectors)

26Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.

Kernel Matching PursuitKernel Matching PursuitSetup:

◦ ◦ D, finite dictionary of functions,

◦ , l = # training points◦ n = # support vectors chosen so far

◦ At (n+1)th step & are to be chosen s.t. :

◦ Predictor:

where = indexes of SVs

27Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.

Basis points versus Support Basis points versus Support VectorsVectors

28

- Dataset: http://mldata.org/repository/data/viewslug/banana-ida/- S. Sathiya Keerthi, et al. Building support vector machines with reduced classier complexity. JMLR, 2006.- Vladimir Vapnik, Steven E. Golowich, and Alex J. Smola. Support vector method for function approximation, regression estimation and signal processing. NIPS, 1996

Basis Points / Support Vectors

29

Proposed methods

Two step process: Step 1: Choosing subset of training set:

◦ Modified BIRCH Clustering Algorithm◦ K-means Clustering◦ GMM Clustering

Step 2: Apply Greedy Algorithm◦ Kernel Matching Pursuit (KMP)◦ Building SVM with sparser complexity (Keerthi et al)

30

Proposed methods

Modified BIRCH

Training Points

K-means

GMM Clustering

KMP / Keerthi et

alModelBasis

Points

- S. Sathiya Keerthi, Olivier Chapelle, and Dennis DeCoste. Building support vector machines with reduced classifier complexity. JOURNAL OF MACHINE LEARNING RESEARCH, 2006.- Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, Senior Member, and Senior Member. An efficient k-means clustering algorithm: Analysis and implementation.2002

BIRCH basicsBIRCH basics Balanced Iterative Reducing and Clustering using

Hierarchies Uses one-scan over dataset, therefore suits large

dataset Each CF vector of cluster is defined as (N,LS,SS),

N=data points, LS=Linear Sum, SS=Squared Sum Merging of two clusters:

◦ CF1 + CF2 = (Nl + N2, LSl + LS2, ,SS1 + SS2)

CF Tree◦ Height balanced tree ◦ Two factors:

B (branching factor): Each non-leaf node contains at most B entries [CFi, childi], i=1..B, CFi is sub-cluster represented by childi. A leaf node contains at most L entries [CFi], i=1..L

T (threshold): radius/diameter of cluster

31

Root

LN1

LN2 LN3

LN1 LN2 LN3

sc1

sc2

sc3sc4

sc5 sc6sc7

sc1 sc2sc3 sc4

sc5 sc6 sc7sc8

sc8

New subcluster

BIRCH Example

32- www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96,

Insertion into CF Tree

B=3L=3

LN1’

LN2 LN3

LN1” LN3

sc1

sc2

sc3sc4

sc5 sc6sc7

sc1 sc2sc3sc4 sc5 sc6 sc7sc8

sc8

New subcluster

BIRCH Example

33

Root

LN1’

LN1”

LN2

Here, Branch factor of leaf node exceeds 3, so LN1 is split

- www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96,

LN1’

LN2 LN3

LN1 LN3

sc1

sc2

sc3sc4

sc5 sc6sc7

sc1sc2 sc3sc4 sc5 sc6 sc7sc8

sc8

New subcluster

BIRCH Example

34

Root

NLN2NLN1

NLN2

NLN1

LN1’

LN1”

LN2

Here, Branch factor of non-leaf node exceeds 3, so root is split and height of CF Tree increases by one


LN1’

LN2 LN3

LN1 LN3

sc1

sc2

sc3sc4

sc5 sc6sc7


sc8

New Point

BIRCH Example

35

Root

NLN2NLN1

NLN2

NLN1

LN1’

LN1”

LN2


LN1’

LN2 LN3

LN1 LN3

sc1

sc2

sc3sc4

sc5 sc6

sc8


sc8

New subcluster

BIRCH Example

36

Root

NLN2NLN1

NLN2

NLN1

LN1’

LN1”

LN2

Here, alien point falls inside leaf-node. Break it into parts.

Branch factor of leaf node exceeds 3, so LN3 should split ..

sc7

sc9


Clusters using modified Clusters using modified BIRCHBIRCH

39

Centroids

41

ExperimentsExperiments

42

Datasets Used

Modified BIRCH with KMPModified BIRCH with KMPAfter Modified Birch – Optimization

using KMP's basic Algorithm

KMP with basic AlgorithmAfter Modified Birch – Optimization

using KMP's back-fitting Algorithm

KMP with Back-fitting Algorithm

Datasets # Basis Test Accuracy # Basis Test Accuracy # Basis Test Accuracy # Basis Test Accuracy

banana 44.1 0.7896 40 0.8893 81.44 0.8746 80 0.8758

breast-cancer 28.5 0.7273 40 0.7169 81.7 0.7156 80 0.7052

diabetis 48.9 0.7610 47 0.7653 185.7 0.7473 187 0.7497

flare-solar 78.5 0.6053 133 0.6593 311.2 0.5923 266 0.6210

german 75.4 0.7553 140 0.7683 252.3 0.7230 280 0.7487

heart 28.9 0.7910 34 0.8300 61.5 0.8060 68 0.7990

image 267.6 0.9567 260 0.9468 275.7 0.9566 260 0.9275

ringnorm 128.4 0.8993 160 0.8529 133.5 0.9104 160 0.8562

splice 385.9 0.7847 400 0.8726 394.4 0.8106 400 0.7033

thyroid 33 0.9533 56 0.9547 44.9 0.9493 28 0.9453

titanic 5.8 0.7701 8 0.7755 48.3 0.7783 60 0.7727

twonorm 21.3 0.9680 40 0.9746 103.6 0.9597 80 0.9544

waveform 74.7 0.7810 80 0.8909 92.3 0.8667 80 0.8717

svmguide1 295 0.9693 309 0.9693 740 0.9658 618 0.9663

svmguide3 297 0.8780 249 0.7073

43

Our formulation gave descent results (red color)

Multi - class Modified BIRCH Multi - class Modified BIRCH with KMPwith KMP

44

All multi-class datasets gave better results

Modified BIRCH with Keerthi et al’s Modified BIRCH with Keerthi et al’s methodmethod

Using modified BIRCH and Keerthi et al's method

Using Keerthi et al's method

Basis Test Accuracy Basis Test Accuracy

banana 16.38 .8798 20 0.8879

breast-cancer 16.30 .7195 20 0.7221

diabetis 36.40 .7593 23 0.7757

flare-solar 63.00 .6063 67 0.6678

german 50.60 .7503 70 0.7660

heart 11.10 .8070 9 0.8330

image 55.00 .8671 65 0.9440

ringnorm 133.50 .7776 80 0.9829

splice 78.90 .7312 100 0.8421

thyroid 9.00 .9347 7 0.9467

titanic 10.30 .7635 8 0.7753

twonorm 20.70 .9675 20 0.9757

waveform 20.63 .8665 20 0.8962

gisette 275.00 .9190 300 0.9740

svmguide1 154.00 .9648 154 0.9678

svmguide3 72.00 .7317 62 0.7317

w1a 144.00 .9702 124 0.9766

w2a 131.00 .9706 174 0.9796

w3a 196.00 .9711 246 0.9808

w4a 274.00 .9702 368 0.9825

w5a 369.00 .9705 494 0.9730

w6a 571.00 .9713 859 0.9733

w7a 793.00 .9711 1235 0.9747

w8a 1,370.00 .9740 2487 0.982345

SVM Using Kmeans to find basis points

Using GMM clustering to find basis points

Test accuracy SVs Test accuracy SVs Test accuracy SVs

banana 0.8881 186.76 0.8870 41 0.8851 41

breast-cancer 0.7425 131.64 0.7104 10.7 0.7104 40.4

diabetis 0.7532 275.45 0.7370 24.3 0.7397 24.3

german 0.7657 449.12 0.7413 141 0.7500 71

heart 0.8265 102.69 0.7870 9 0.8030 9

image 0.9202 441.6 0.9527 260.7 0.9385 260.7

ringnorm 0.9817 102.59 0.7633 21 0.7474 21

splice 0.8868 694.65 0.8288 51 0.8328 51

thyroid 0.9511 40.45 0.9227 8 0.9120 29

titanic 0.7726 72.32 0.7607 2.5 0.7157 8.5

twonorm 0.9727 126.52 0.9701 21 0.9698 21

waveform 0.9013 158.42 0.8878 21 0.8876 21

46

Gave sparse model with less testset accuracy (except in blue color)

K means and GMM with KMP

ConclusionConclusionStudied various sparse classifier design

algorithmsBetter results obtained using SVM with L1 Penalty.Modified BIRCH with KMP:

◦ gave descent result on binary datasets◦ gave good results on multi-class datasets◦ saved kernel calculations (and time) by almost ~1/5th of

actual timeClustering is an easy way (time consuming) to

choose basis points but not much effective.Future work:

◦ Explore greedy embedded sparse multi-class classification with different loss functions, e.g. Logistic Loss

◦ Explore such techniques for Semi-supervised learning

47

Thank You.Thank You.

48

study of sparse classifier design algorithms

Documents

l1 penaltyusing l1 penalty

case of sparsity

model 2l0 penalty

sparsenet algorithm

5l1 penalty

3l1 penalty

mc penaltyusing mc penalty

rank features