big data paradigms in python - kevin...

Big Data Paradigms in Python San Diego Data Science and R Users Group | January 2014

Thank you to our sponsors:

Kevin Davenport!http://kldavenport.com [email protected] @KevinLDavenport

San Diego Data Science: Largish Data in Python

Setting up your environmentCompletely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing

Spend time writing code, not working to set up a system. Create interactive plots in your browser with Bokeh or D3.

http://ipython.org/notebook.html http://docs.continuum.io/anaconda/index.html http://pandas.pydata.org http://scikit-learn.org/stable/



Anaconda$ conda update conda

$ conda update anaconda

$ conda update numpy

$ conda update bokeh

$ conda update numba

$ conda install ggplot


Wakari


Two Worlds

Big Data!

• Size

• Computing Clusters

• Distributed Storage

Everything else!

• Commodity hardware

• Python Programming


Toolsscikit-learn!

• Python ML library based on (NumPy, SciPy, matplotlib)!

joblib!

• Pipeline jobs with python functions

Learn a model from the data: estimator.fit(X train, Y train)!

Predict using learned model estimator.predict(X test)!

Test goodness of fit estimator.score(X test, y test)!

Apply change of representation estimator.transform(X, y)


Design1. Debugging:!

%debug, %time, %timeit, %lprun, %prun,%mprun,%memit!

2. Dependency Hell:!

HomeBrew, Anaconda, Enthought!

3. Get to NumPy as fast possible:!

Optimized C drivers/connectors


Efficient Data Handling

1. On the fly data reduction!

2. On-line algorithms!

3. Parallel Processing patterns!

4. Caching


Simpler Case for ML

“Data-driven” work needs ML because of the curse of dimensionality


Manner of Big

1. Large N (Many obs.)!

2. Large M (Features, Descriptors)

1


Less data = less work1. Big Data often I/O Bound!

2. Layer Memory Access!

• CPU cache!• RAM!• Local disks!• Distant Storage

1


Dropping Data in a Loop

1. Take a random subset/sample of the data!

2. Apply algorithm on given subset!

3. aggregate results across subsets

-Run the loop in parallel -Exploit redundancy across obs.

1


2

2 22

Resample the sample with replacement

Sample

Bootstrap Aggregating (Bagging):

1


Dimension Reduction• Random Projections (averaging features)!

sklearn.random_projection!

• Fast (sub-optimal) clustering of features:!

sklearn.cluster.WardAgglomeration!

• Hashing (obs. of varying size, e.g. words)!

sklearn.feature_extraction.text.HashingVectorizer

1


Gaussian Random Projection

1


PCA using randomized SVD

1


Efficient Data Handling Schemes




4. Caching


Convergence

1. i.i.d. converges to expectations of distribution of interest!

2. Mini-batch: bunch observationsTrade-off between memory usage and vectorization

2


Batch

2

http://scikit-learn.org/stable/modules/clustering.html

http://scikit-learn.org/stable/modules/clustering.html


Batch

2

Minibatch

Vanilla50.9 ms

15.1 ms


Efficient Data Handling Schemes




4. Caching


Embarrassingly Parallel Loops

3

AB C D

E

F

A

E

BF

C D

Poor load balance (4 UEs)

E

B

F

C

D

Good load balance (4 UEs)

A

Unit of Execution - a collection of concurrently-executing entities, usually either processes or threads


joblibRunning Python functions as pipeline jobs!

• Don’t change your code!

• No dependencies

3


Return vs Yield Quick Review

3


scikit-learn Integration

3

• Random Projections (averaging features)!

cross_val(model, X, y, n_jobs=4,cv=3)!

• Grid Search:!

GridsearchCV(model, n_jobs=4,cv=3).fit(X,y) !

• Random Forests!

RandomForestClassifier(n_jobs=4).fit(X,y)!

ExtraTreesClassifier(n_jobs=4).fit(X,y)


3

All Data

All Labels to Predict

All Data

All Labels

All Data

All Labels

All Data

All Labels

Replicate Dataset

All Data

All Labels

All Data

All Labels

All Data

All Labels

Train Forest Models in Parallel

http://ogrisel.com/ http://scikit-learn.org/dev/modules/ensemble.html#random-forests

Seed each model with a different random state integerclf_1 clf_1 clf_1

clf = (clf_1 + clf_2 + clf_2)

In-memory

http://ogrisel.com/

http://scikit-learn.org/dev/modules/ensemble.html#random-forests


3

All Data

All Labels to Predict

Data 1

Labels 1

Data 2

Labels 2

Data 3

Labels 3

Replicate Partition Dataset Train Forest Models in Parallel

http://ogrisel.com/ http://scikit-learn.org/dev/modules/ensemble.html#random-forests

Data 1

Labels 1

Data 2

Labels 2

Data 3

Labels 3

Seed each model with a different random state integer

clf_1 clf_2 clf_3

clf = (clf_1 + clf_2 + clf_3)

Too large for memory

http://ogrisel.com/

http://scikit-learn.org/dev/modules/ensemble.html#random-forests


Efficient Data Handling




4. Caching


Memoize

3


Don’t underestimate the cost of complexity whether it be cognitive, maintenance, mutability, portability, etc.!

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” - Donald Knuth


Please Donate to numfocus.org

http://numfocus.org

big data paradigms in python - kevin...

Documents