haifa research lab © 2008 ibm corporation parallel streaming decision trees yael ben-haim &...
TRANSCRIPT
Haifa Research Lab
© 2008 IBM Corporation
Parallel streaming decision trees
Yael Ben-Haim & Elad Yom-TovPresented by: Yossi Richter
Haifa Research Lab
© 2008 IBM Corporation2
Why decision trees?
Simple classification model, short testing time
Understandable by humans
BUT:
– Difficult to train on large data (need to sort each feature)
Haifa Research Lab
© 2008 IBM Corporation3
Previous work
Presorting (SLIQ, 1996)
Approximations (BOAT, 1999) (CLOUDS, 1997)
Parallel (e.g. SPRINT 1996)
– Vertical parallelism
– Task parallelism
– Hybrid parallelism
Streaming
– Minibatch (SPIES, 2003)
– Statistic (pCLOUDS, 1999)
Haifa Research Lab
© 2008 IBM Corporation4
Streaming parallel decision tree
Data
Haifa Research Lab
© 2008 IBM Corporation5
Iterative parallel decision tree
Initializeroot
Master Workers
Build histogram
Compute node splits
Buildhistogram
Until convergence
Time
DataDataBuild
histogram
Buildhistogram
Merge
Haifa Research Lab
© 2008 IBM Corporation6
Building an on-line histogram
A histogram is a list of pairs (p1, m1) … (pn, mn)
Initialize: c=0, p=[ ], m=[ ]
For each data point p:
– If p==pj for any j<=c
• mj = mj + 1– Otherwise
• Add a bin to the histogram with the value (p, 1)• c = c + 1• If c > max_bins
– Merge the two closest bins in the histogram– c = max_bins
Haifa Research Lab
© 2008 IBM Corporation7
Merging two histograms
Concatenate the two histogram lists, creating a list of length c
Repeat until c <= max_bins
– Merge the two closest bins
Haifa Research Lab
© 2008 IBM Corporation8
Example of the histogram
50 bins, 1000 data points
Haifa Research Lab
© 2008 IBM Corporation9
Pruning
Taken from the MDL-based SLIQ algorithm
Consists of two phases:
– Tree construction
– Bottom-up pass on the complete tree
During tree construction, for each tree node,
set cleaf = 1 + number of samples that reached the node and do not belong to the
majority class
The bottom-up pass:
– for each leaf, set cboth = cleaf
– for each internal node, for which cboth(left) and cboth(right) have been assigned,
set cboth = 2 + cboth(left) + cboth(right)
– The subtree rooted at a node is to be pruned if cleaf is small, namely:
• Only a few samples reach it• A substantial portion of the samples that reach it belongs to the majority class
– If cleaf < cboth (i.e., the subtree does not contribute much information) then:
• Prune the subtree• Set cboth = cleaf
Haifa Research Lab
© 2008 IBM Corporation10
IBM Parallel Machine Learning toolbox
A toolbox for conducting large-scale machine learning– Supports architectures ranging from single machines with multiple cores to large distributed clusters
Works by distributing the computations across multiple nodes– Allows for rapid learning of very large datasets
Includes state-of-the-art machine learning algorithms for:– Classification: Support-vector machines (SVM), decision tree– Regression: Linear and SVM– Clustering: k-means, fuzzy k-means, kernel k-means, Iclust– Feature reduction: Principal component analysis (PCA),
and kernel PCA.
Includes an API for adding algorithms
Freely available from alphaWorks
Joint project of the Haifa Machine Learning group and the
Watson Data Analytics group 0
2
4
6
8
10
12
14
16
18
0 200 400 600 800 1000 1200
Number of processors
Sp
ee
du
p (
Co
mp
are
d t
o a
sin
gle
no
de
)
Initializeparameters
Master Workers
Compute kernel matrix
Compute local updates
Compute global update
Compute local updates
Until convergence
Time
DataData
Compute kernel matrix
Compute local updates
Compute local updates
Merge
K-means, Blue Gene
Shameless
PR slide
Haifa Research Lab
© 2008 IBM Corporation11
Results: Comparing single node solvers
Dataset Number of examples
Number of features
Standard tree SPDT
Adult 32561 (16281) 105 17.7 15.7
Isolet 6238 (1559) 617 18.7 14.6
Letter 20000 16 7.5 8.6
Nursery 12960 25 1.0 2.6
Page blocks 5473 10 3.1 3.1
Pen digits 7494 (3498) 16 4.6 5.4
Spambase 4601 57 8.4 10.5
No statistically
Significant
difference
Ten-fold cross-validation, unless test\train partition exists
Haifa Research Lab
© 2008 IBM Corporation12
Results: Pruning
Dataset Standard tree SPDT
before pruning
SPDT
after pruning
Tree size
before pruning
Tree size
after pruning
Adult 17.7 15.7 14.3 1645 409
Isolet 18.7 14.6 17.8 211 141
Letter 7.5 8.6 9.3 135 67
Nursery 1.0 2.6 3.2 178 167
Page blocks 3.1 3.1 3.4 55 36
Pen digits 4.6 5.4 5.8 89 81
Spambase 8,4 10.5 11.4 572 445
80% reduction
in size
Haifa Research Lab
© 2008 IBM Corporation13
Speedup (Strong scalability)
Alpha Beta
Speedup improves with data size!
Haifa Research Lab
© 2008 IBM Corporation14
Weak scalability
Alpha Beta
Scalability improves with the number of processors!
Haifa Research Lab
© 2008 IBM Corporation15
Algorithm complexity
Haifa Research Lab
© 2008 IBM Corporation16
Summary
An efficient new algorithm for parallel streaming decision trees
Results as good as single-node trees, but with scalability that improves with the data size and the number of processors
Ongoing work: Proof that the algorithm is only epsilon different from previous decision tree algorithm
Haifa Research Lab
© 2008 IBM Corporation17
תודהHebrew (Toda)
Thank You
MerciGrazie
Gracias
Obrigado
Danke
Japanese
English
French
Russian
German
Italian
Spanish
Portuguese
Arabic
Traditional Chinese
Simplified Chinese
Thai
Korean
KIITOSDanish