“study on parallel svm based on mapreduce” kuei-ti lu 03/12/2015
TRANSCRIPT
“Study on Parallel SVM Based on MapReduce”
Kuei-Ti Lu03/12/2015
Support Vector Machine (SVM)
• Used for – Classification– Regression
• Applied in – Network intrusion detection– Image processing– Text classification– …
libSVM
• Library for support vector machines• Integrate different types of SVMs
Types of SVMs Supported by libSVM
• For support vector classification– C-SVC– Nu-SVC
• For support vector regression– Epsilon-SVR– Nu-SVR
• For distribution estimation– One-class SVM
C-SVC
• Goal: Find the separating hyperplane that maximizes the margin
• Support vectors: data points closest to the separating hyperplane
C-SVC
• Primal form
• Dual form (derived using Lagrange multipliers)
ni
nibwxyts
Cw
i
iiTi
ii
bw i
,...,10
,...,11))((..
}||2
1{min 2
,,
liCa
ayts
axxkyyaa
i
T
ii
jijijiji
a
,...,1,0
0..
),(min,
Speedup
• Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases
• Need efficient algorithms and implementation to apply to large scale data mining
• => Parallel SVM
Parallel SVM Methods
• Message Passing Interface (MPI) – Efficient for computation-intensive problems• Ex. Simulation
• MapReduce– Can be used for data-intensive problems
• …
Other Speedup Techniques
• Chunking: optimize subsets of training data iteratively until the global optimum is reached– Ex. Sequential Minimal Optimization (SMO) • Use a chunk size of 2 vectors
• Eliminate non-support vectors early
This Paper’s Approach
1. Partition & distribute data to nodes2. Map class: Train each subSVM to find support
vectors for subset of data3. Reduce class: Combine support vectors of
each 2 subSVMs4. If more than 1 SVM
Go to 2.
Twister
• Support iterative MapReduce
• More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce
Computation Complexity
mN
tnOnOm
nO trans
N
i
iNiNN
i
iN
2
1
22
22
log
)2)2(()))
2((())((
Evaluations
• Number of nodes• Training time• Accuracy = # correctly predicted data / # total
testing data * 100 %
Adult Data Analysis
• Binary classification• Correlation between attribute variable X and
class variable Y used to select attributes
YX
YX
YXYX
YXEYX
)])([(),cov(
,
Adult Data Analysis
• Computation cost concentrates on training
• Data transfer time cost minor• Last layer computation time
depends on α and β instead of number of nodes (1 node only)
• Feature selection reduces computation greatly but does not reduce accuracy very much
Forest Cover Type Classification
• Multiclass classification– Use k(k - 1)/2 binary SVMs as k-class SVM– 1 binary SVM for each pair of classes– Use maximum voting to determine the class
Forest Cover Type Classification
• Correlation between attribute variable X and class variable Y used to select attributes
• Attribute variables are normalized to [0, 1]
minmax
min
xx
xxxnorm
Forest Cover Type Classification
• Last layer computation time depends on α and β instead of number of nodes (1 node only)
• Feature selection reduces computation greatly but does not reduce accuracy very much
Heart Disease Classification
• Binary classification• Data replicated different times to compare
results for different sample sizes
Heart Disease Classification
• When sample size too big, can’t be processed with 1 node because of memory constraint
• Training time decreases little when number of nodes > 8
Conclusion
• Classical SVM impractical for large scale data• Need parallel SVM• This paper proposes a model based on
iterative MapReduce• Show the model efficient for data-intensive
problems
References
[1] Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las
Vegas, NV, 2012, pp. [2] C. Lin et al., “Anomaly Detection Using
LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.
Q & A