“study on parallel svm based on mapreduce” kuei-ti lu 03/12/2015

“Study on Parallel SVM Based on MapReduce”

Kuei-Ti Lu03/12/2015

Support Vector Machine (SVM)

• Used for – Classification– Regression

• Applied in – Network intrusion detection– Image processing– Text classification– …

libSVM

• Library for support vector machines• Integrate different types of SVMs

Types of SVMs Supported by libSVM

• For support vector classification– C-SVC– Nu-SVC

• For support vector regression– Epsilon-SVR– Nu-SVR

• For distribution estimation– One-class SVM

C-SVC

• Goal: Find the separating hyperplane that maximizes the margin

• Support vectors: data points closest to the separating hyperplane

C-SVC

• Primal form

• Dual form (derived using Lagrange multipliers)

ni

nibwxyts

Cw

i

iiTi

ii

bw i

,...,10

,...,11))((..

}||2

1{min 2

,,

liCa

ayts

axxkyyaa

i

T

ii

jijijiji

a

,...,1,0

0..

),(min,

Speedup

• Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases

• Need efficient algorithms and implementation to apply to large scale data mining

• => Parallel SVM

Parallel SVM Methods

• Message Passing Interface (MPI) – Efficient for computation-intensive problems• Ex. Simulation

• MapReduce– Can be used for data-intensive problems

• …

Other Speedup Techniques

• Chunking: optimize subsets of training data iteratively until the global optimum is reached– Ex. Sequential Minimal Optimization (SMO) • Use a chunk size of 2 vectors

• Eliminate non-support vectors early

This Paper’s Approach

1. Partition & distribute data to nodes2. Map class: Train each subSVM to find support

vectors for subset of data3. Reduce class: Combine support vectors of

each 2 subSVMs4. If more than 1 SVM

Go to 2.

Twister

• Support iterative MapReduce

• More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce

Computation Complexity

mN

tnOnOm

nO trans

N

i

iNiNN

i

iN

2

1

22

22

log

)2)2(()))

2((())((

Evaluations

• Number of nodes• Training time• Accuracy = # correctly predicted data / # total

testing data * 100 %

Adult Data Analysis

• Binary classification• Correlation between attribute variable X and

class variable Y used to select attributes

YX

YX

YXYX

YXEYX

)])([(),cov(

,

Adult Data Analysis

• Computation cost concentrates on training

• Data transfer time cost minor• Last layer computation time

depends on α and β instead of number of nodes (1 node only)

• Feature selection reduces computation greatly but does not reduce accuracy very much

Forest Cover Type Classification

• Multiclass classification– Use k(k - 1)/2 binary SVMs as k-class SVM– 1 binary SVM for each pair of classes– Use maximum voting to determine the class


• Correlation between attribute variable X and class variable Y used to select attributes

• Attribute variables are normalized to [0, 1]

minmax

min

xx

xxxnorm


• Last layer computation time depends on α and β instead of number of nodes (1 node only)

• Feature selection reduces computation greatly but does not reduce accuracy very much

Heart Disease Classification

• Binary classification• Data replicated different times to compare

results for different sample sizes

Heart Disease Classification

• When sample size too big, can’t be processed with 1 node because of memory constraint

• Training time decreases little when number of nodes > 8

Conclusion

• Classical SVM impractical for large scale data• Need parallel SVM• This paper proposes a model based on

iterative MapReduce• Show the model efficient for data-intensive

problems

References

[1] Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las

Vegas, NV, 2012, pp. [2] C. Lin et al., “Anomaly Detection Using

LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.

“study on parallel svm based on mapreduce” kuei-ti lu 03/12/2015

Documents

predicted data

subsets of training

class variable y

data points closest

number of training vectors

class svm1 binary svm

total testing data

nodesmap class