“study on parallel svm based on mapreduce” kuei-ti lu 03/12/2015

23
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Upload: laureen-edwards

Post on 29-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

“Study on Parallel SVM Based on MapReduce”

Kuei-Ti Lu03/12/2015

Page 2: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Support Vector Machine (SVM)

• Used for – Classification– Regression

• Applied in – Network intrusion detection– Image processing– Text classification– …

Page 3: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

libSVM

• Library for support vector machines• Integrate different types of SVMs

Page 4: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Types of SVMs Supported by libSVM

• For support vector classification– C-SVC– Nu-SVC

• For support vector regression– Epsilon-SVR– Nu-SVR

• For distribution estimation– One-class SVM

Page 5: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

C-SVC

• Goal: Find the separating hyperplane that maximizes the margin

• Support vectors: data points closest to the separating hyperplane

Page 6: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

C-SVC

• Primal form

• Dual form (derived using Lagrange multipliers)

ni

nibwxyts

Cw

i

iiTi

ii

bw i

,...,10

,...,11))((..

}||2

1{min 2

,,

liCa

ayts

axxkyyaa

i

T

ii

jijijiji

a

,...,1,0

0..

),(min,

Page 7: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Speedup

• Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases

• Need efficient algorithms and implementation to apply to large scale data mining

• => Parallel SVM

Page 8: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Parallel SVM Methods

• Message Passing Interface (MPI) – Efficient for computation-intensive problems• Ex. Simulation

• MapReduce– Can be used for data-intensive problems

• …

Page 9: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Other Speedup Techniques

• Chunking: optimize subsets of training data iteratively until the global optimum is reached– Ex. Sequential Minimal Optimization (SMO) • Use a chunk size of 2 vectors

• Eliminate non-support vectors early

Page 10: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

This Paper’s Approach

1. Partition & distribute data to nodes2. Map class: Train each subSVM to find support

vectors for subset of data3. Reduce class: Combine support vectors of

each 2 subSVMs4. If more than 1 SVM

Go to 2.

Page 11: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Twister

• Support iterative MapReduce

• More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce

Page 12: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Computation Complexity

mN

tnOnOm

nO trans

N

i

iNiNN

i

iN

2

1

22

22

log

)2)2(()))

2((())((

Page 13: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Evaluations

• Number of nodes• Training time• Accuracy = # correctly predicted data / # total

testing data * 100 %

Page 14: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Adult Data Analysis

• Binary classification• Correlation between attribute variable X and

class variable Y used to select attributes

YX

YX

YXYX

YXEYX

)])([(),cov(

,

Page 15: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Adult Data Analysis

• Computation cost concentrates on training

• Data transfer time cost minor• Last layer computation time

depends on α and β instead of number of nodes (1 node only)

• Feature selection reduces computation greatly but does not reduce accuracy very much

Page 16: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Forest Cover Type Classification

• Multiclass classification– Use k(k - 1)/2 binary SVMs as k-class SVM– 1 binary SVM for each pair of classes– Use maximum voting to determine the class

Page 17: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Forest Cover Type Classification

• Correlation between attribute variable X and class variable Y used to select attributes

• Attribute variables are normalized to [0, 1]

minmax

min

xx

xxxnorm

Page 18: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Forest Cover Type Classification

• Last layer computation time depends on α and β instead of number of nodes (1 node only)

• Feature selection reduces computation greatly but does not reduce accuracy very much

Page 19: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Heart Disease Classification

• Binary classification• Data replicated different times to compare

results for different sample sizes

Page 20: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Heart Disease Classification

• When sample size too big, can’t be processed with 1 node because of memory constraint

• Training time decreases little when number of nodes > 8

Page 21: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Conclusion

• Classical SVM impractical for large scale data• Need parallel SVM• This paper proposes a model based on

iterative MapReduce• Show the model efficient for data-intensive

problems

Page 22: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

References

[1] Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las

Vegas, NV, 2012, pp. [2] C. Lin et al., “Anomaly Detection Using

LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.

Page 23: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

Q & A