haifa research lab © 2008 ibm corporation parallel streaming decision trees yael ben-haim &...

Haifa Research Lab

© 2008 IBM Corporation

Parallel streaming decision trees

Yael Ben-Haim & Elad Yom-TovPresented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation2

Why decision trees?

Simple classification model, short testing time

Understandable by humans

BUT:

– Difficult to train on large data (need to sort each feature)

Haifa Research Lab


Previous work

Presorting (SLIQ, 1996)

Approximations (BOAT, 1999) (CLOUDS, 1997)

Parallel (e.g. SPRINT 1996)

– Vertical parallelism

– Task parallelism

– Hybrid parallelism

Streaming

– Minibatch (SPIES, 2003)

– Statistic (pCLOUDS, 1999)

Haifa Research Lab


Streaming parallel decision tree

Data

Haifa Research Lab


Iterative parallel decision tree

Initializeroot

Master Workers

Build histogram

Compute node splits

Buildhistogram

Until convergence

Time

DataDataBuild

histogram

Buildhistogram

Merge

Haifa Research Lab


Building an on-line histogram

A histogram is a list of pairs (p1, m1) … (pn, mn)

Initialize: c=0, p=[ ], m=[ ]

For each data point p:

– If p==pj for any j<=c

• mj = mj + 1– Otherwise

• Add a bin to the histogram with the value (p, 1)• c = c + 1• If c > max_bins

– Merge the two closest bins in the histogram– c = max_bins

Haifa Research Lab


Merging two histograms

Concatenate the two histogram lists, creating a list of length c

Repeat until c <= max_bins

– Merge the two closest bins

Haifa Research Lab


Example of the histogram

50 bins, 1000 data points

Haifa Research Lab


Pruning

Taken from the MDL-based SLIQ algorithm

Consists of two phases:

– Tree construction

– Bottom-up pass on the complete tree

During tree construction, for each tree node,

set cleaf = 1 + number of samples that reached the node and do not belong to the

majority class

The bottom-up pass:

– for each leaf, set cboth = cleaf

– for each internal node, for which cboth(left) and cboth(right) have been assigned,

set cboth = 2 + cboth(left) + cboth(right)

– The subtree rooted at a node is to be pruned if cleaf is small, namely:

• Only a few samples reach it• A substantial portion of the samples that reach it belongs to the majority class

– If cleaf < cboth (i.e., the subtree does not contribute much information) then:

• Prune the subtree• Set cboth = cleaf

Haifa Research Lab


IBM Parallel Machine Learning toolbox

A toolbox for conducting large-scale machine learning– Supports architectures ranging from single machines with multiple cores to large distributed clusters

Works by distributing the computations across multiple nodes– Allows for rapid learning of very large datasets

Includes state-of-the-art machine learning algorithms for:– Classification: Support-vector machines (SVM), decision tree– Regression: Linear and SVM– Clustering: k-means, fuzzy k-means, kernel k-means, Iclust– Feature reduction: Principal component analysis (PCA),

and kernel PCA.

Includes an API for adding algorithms

Freely available from alphaWorks

Joint project of the Haifa Machine Learning group and the

Watson Data Analytics group 0

2

4

6

8

10

12

14

16

18

0 200 400 600 800 1000 1200

Number of processors

Sp

ee

du

p (

Co

mp

are

d t

o a

sin

gle

no

de

)

Initializeparameters

Master Workers

Compute kernel matrix

Compute local updates

Compute global update


Until convergence

Time

DataData

Compute kernel matrix



Merge

K-means, Blue Gene

Shameless

PR slide

Haifa Research Lab


Results: Comparing single node solvers

Dataset Number of examples

Number of features

Standard tree SPDT

Adult 32561 (16281) 105 17.7 15.7

Isolet 6238 (1559) 617 18.7 14.6

Letter 20000 16 7.5 8.6

Nursery 12960 25 1.0 2.6

Page blocks 5473 10 3.1 3.1

Pen digits 7494 (3498) 16 4.6 5.4

Spambase 4601 57 8.4 10.5

No statistically

Significant

difference

Ten-fold cross-validation, unless test\train partition exists

Haifa Research Lab


Results: Pruning

Dataset Standard tree SPDT

before pruning

SPDT

after pruning

Tree size

before pruning

Tree size

after pruning

Adult 17.7 15.7 14.3 1645 409

Isolet 18.7 14.6 17.8 211 141

Letter 7.5 8.6 9.3 135 67

Nursery 1.0 2.6 3.2 178 167

Page blocks 3.1 3.1 3.4 55 36

Pen digits 4.6 5.4 5.8 89 81

Spambase 8,4 10.5 11.4 572 445

80% reduction

in size

Haifa Research Lab


Speedup (Strong scalability)

Alpha Beta

Speedup improves with data size!

Haifa Research Lab


Weak scalability

Alpha Beta

Scalability improves with the number of processors!

Haifa Research Lab


Algorithm complexity

Haifa Research Lab


Summary

An efficient new algorithm for parallel streaming decision trees

Results as good as single-node trees, but with scalability that improves with the data size and the number of processors

Ongoing work: Proof that the algorithm is only epsilon different from previous decision tree algorithm

Haifa Research Lab


תודהHebrew (Toda)

Thank You

MerciGrazie

Gracias

Obrigado

Danke

Japanese

English

French

Russian

German

Italian

Spanish

Portuguese

Arabic

Traditional Chinese

Simplified Chinese

Thai

Korean

KIITOSDanish

haifa research lab © 2008 ibm corporation parallel streaming decision trees yael ben-haim &...

Documents