simple big data, in python

65
Simple big data, in Python Ga¨ el Varoquaux

Upload: gaelvaroquaux

Post on 02-Jul-2015

4.279 views

Category:

Technology


1 download

DESCRIPTION

This is a high-level big picture talk on machine learning, Python, and patterns of big data

TRANSCRIPT

Page 1: Simple big data, in Python

Simple big data, in Python

Gael Varoquaux

Page 2: Simple big data, in Python

Simple big data, in Python

Gael Varoquaux

This is a lie!

Page 3: Simple big data, in Python

Please allow me to introduce myselfI’m a man of wealth and tasteI’ve been around for a long, long year

Physicist gone badNeuroscience, Machine learning

Worked in a software startupEnthought: scientific computingconsulting in Python

Coder (done my share of mistake)Mayavi, scikit-learn, joblib...

Scipy communityChair of scipy and EuroScipy conferences

Researcher (PI) at INRIAG Varoquaux 2

Page 4: Simple big data, in Python

1 Machine learning in 2 words

2 Scikit-learn

3 Big data on a budget

4 The bigger picture: a community

G Varoquaux 3

Page 5: Simple big data, in Python

1 Machine learning in 2 words

G Varoquaux 4

Page 6: Simple big data, in Python

1 A historical perspectiveArtificial Intelligence The 80s

Building decision rules

Eatable?

Mobile?

Tall?

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 5

Page 7: Simple big data, in Python

1 A historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 5

Page 8: Simple big data, in Python

1 A historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 5

Page 9: Simple big data, in Python

1 A historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 5

Page 10: Simple big data, in Python

1 A historical perspectiveArtificial Intelligence The 80s

Building decision rules

Machine learning The 90sLearn these from observations

Statistical learning 2000sModel the noise in the observations

Big data todayMany observations,simple rules

“Big data isn’t actually interesting without machinelearning” Steve Jurvetson, VC, Silicon Valley

G Varoquaux 5

Page 11: Simple big data, in Python

1 Machine learning

Example: face recognition

Andrew Bill Charles Dave

G Varoquaux 6

Page 12: Simple big data, in Python

1 Machine learning

Example: face recognition

Andrew Bill Charles Dave

?G Varoquaux 6

Page 13: Simple big data, in Python

1 A simple method

1 Store all the known (noisy) images and the namesthat go with them.

2 From a new (noisy) images, find the image that ismost similar.

“Nearest neighbor” method

How many errors on already-known images?... 0: no erreurs

Test data 6= Train data

G Varoquaux 7

Page 14: Simple big data, in Python

1 A simple method

1 Store all the known (noisy) images and the namesthat go with them.

2 From a new (noisy) images, find the image that ismost similar.

“Nearest neighbor” methodHow many errors on already-known images?

... 0: no erreurs

Test data 6= Train data

G Varoquaux 7

Page 15: Simple big data, in Python

1 1st problem: noiseSignal unrelated to the prediction problem

0.0 0.5 1.0 1.5 2.0 2.5 3.0Noise level

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Pred

ictio

n ac

cura

cy

G Varoquaux 8

Page 16: Simple big data, in Python

1 2nd problem: number of variables

Finding a needle in a haystack

1 2 3 4 5 6 7 8 9 10Useful fraction of the frame

0.65

0.70

0.75

0.80

0.85

0.90

0.95

Pred

ictio

n ac

cura

cy

G Varoquaux 9

Page 17: Simple big data, in Python

1 Machine learning

Example: face recognition

Andrew Bill Charles Dave

?

Learning from numerical descriptorsDifficulties: i) noise,

ii) number of features“supervised” task: known labels“unsupervised” task: unknown labels

G Varoquaux 10

Page 18: Simple big data, in Python

1 Machine learning: regressionA single descriptor:one dimension

x

y

G Varoquaux 11

Page 19: Simple big data, in Python

1 Machine learning: regressionA single descriptor:one dimension

x

y

x

yWhich model to prefer?

G Varoquaux 11

Page 20: Simple big data, in Python

1 Machine learning: regressionA single descriptor:one dimension

x

y

x

yProblem of “over-fitting”

Minimizing error is not always the best strategy(learning noise)

Test data 6= train dataG Varoquaux 11

Page 21: Simple big data, in Python

1 Machine learning: regressionA single descriptor:one dimension

x

y

x

yPrefer simple models

= concept of “regularization”Balance the number of parameters to learnwith the amount of data

G Varoquaux 11

Page 22: Simple big data, in Python

1 Machine learning: regressionA single descriptor:one dimension

x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters

G Varoquaux 11

Page 23: Simple big data, in Python

1 Machine learning: regressionA single descriptor:one dimension

x

yTwo descriptors:2 dimensions

X_1

X_2

y

More parameters⇒ need more data

“curse of dimensionality”

G Varoquaux 11

Page 24: Simple big data, in Python

1 Supervised learning: classificationPredicting categories, e.g. numbers

X2

X1

G Varoquaux 12

Page 25: Simple big data, in Python

1 Unsupervised learning

Stock market structure

Unlabeled datamore common than labeled data

G Varoquaux 13

Page 26: Simple big data, in Python

1 Unsupervised learning

Stock market structure

Unlabeled datamore common than labeled data

G Varoquaux 13

Page 27: Simple big data, in Python

1 Recommender systems

G Varoquaux 14

Page 28: Simple big data, in Python

1 Recommender systems

Andrew ? ? ? ?Bill ?? ? ? ?

Charles ? ? ? ??Dave ? ? ? ?Edie ?? ? ? ?

Little overlap between users

G Varoquaux 14

Page 29: Simple big data, in Python

1 Machine learning

Challenges

StatisticsComputational

G Varoquaux 15

Page 30: Simple big data, in Python

1 Petty day-to-day technicalities

Buggy code

Slow code

Lead data scientist leaves

New intern to train

I don’t understand thecode I have written a year ago

An in-house data science squad

DifficultiesRecruitmentLimited resources

(people & hardware)

RisksBus factorTechnical dept

We need big data (and machine learning)on a tight budget

G Varoquaux 16

Page 31: Simple big data, in Python

1 Petty day-to-day technicalities

Buggy code

Slow code

Lead data scientist leaves

New intern to train

I don’t understand thecode I have written a year ago

An in-house data science squad

DifficultiesRecruitmentLimited resources

(people & hardware)

RisksBus factorTechnical dept

We need big data (and machine learning)on a tight budget

G Varoquaux 16

Page 32: Simple big data, in Python

1 Petty day-to-day technicalities

Buggy code

Slow code

Lead data scientist leaves

New intern to train

I don’t understand thecode I have written a year ago

An in-house data science squad

DifficultiesRecruitmentLimited resources

(people & hardware)

RisksBus factorTechnical dept

We need big data (and machine learning)on a tight budget

G Varoquaux 16

Page 33: Simple big data, in Python

2 Scikit-learnMachine learning without learning themachinery

c©Theodore W. GrayG Varoquaux 17

Page 34: Simple big data, in Python

2 My stack

Python, what else?Interactive languageEasy to read / writeGeneral purpose

G Varoquaux 18

Page 35: Simple big data, in Python

2 My stack

Python, what else?The scientific Python stack

numpy arrayspandas

...

It’s about plugin thingstogether

G Varoquaux 18

Page 36: Simple big data, in Python

2 scikit-learn vision

Machine learning for allNo specific application domain

No requirements in machine learning

High-quality software libraryInterfaces designed for users

Community-driven developmentBSD licensed, very diverse contributors

http://scikit-learn.org

G Varoquaux 19

Page 37: Simple big data, in Python

2 A Python library

A library, not a programMore expressive and flexibleEasy to include in an ecosystem

As easy as py

from s k l e a r n import svmc l a s s i f i e r = svm.SVC()c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n )Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t )

G Varoquaux 20

Page 38: Simple big data, in Python

2 Very rich feature set

Supervised learningDecision trees (Random-Forest, Boosted Tree)Linear modelsSVM

Unsupervised LearningClusteringDictionary learningOutlier detection

Model selectionBuilt in cross-validationParameter optimization

G Varoquaux 21

Page 39: Simple big data, in Python

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000

Fit

tim

e(s

)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 22

Page 40: Simple big data, in Python

2 Computational performance

scikit-learn mlpy pybrain pymvpa mdp shogunSVM 5.2 9.47 17.5 11.52 40.48 5.63LARS 1.17 105.3 - 37.35 - -Elastic Net 0.52 73.7 - 1.44 - -kNN 0.57 1.41 - 0.56 0.58 1.36PCA 0.18 - - 8.93 0.47 0.33k-Means 1.34 0.79 ∞ - 35.75 0.68

Algorithmic optimizations

Minimizing data copies

Random Forest fit time

0

2000

4000

6000

8000

10000

12000

14000Fi

tti

me

(s)

203.01 211.53

4464.65

3342.83

1518.14 1711.94

1027.91

13427.06

10941.72

Scikit-Learn-RFScikit-Learn-ETsOpenCV-RFOpenCV-ETsOK3-RFOK3-ETsWeka-RFR-RFOrange-RF

Scikit-LearnPython, Cython

OpenCVC++

OK3C Weka

Java

randomForestR, Fortran

OrangePython

Figure: Gilles Louppe

G Varoquaux 22

Page 41: Simple big data, in Python

3 Big data on a budget

Biggish

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

G Varoquaux 23

Page 42: Simple big data, in Python

3 Big data on a budgetBiggish

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

G Varoquaux 23

Page 43: Simple big data, in Python

3 Big data on a budgetBiggish

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

Simple data processing patterns

G Varoquaux 23

Page 44: Simple big data, in Python

“Big data”, but big how?2 scenarios:

Many observations –samplese.g. twitterMany descriptors per observation –featurese.g. brain scans

G Varoquaux 24

Page 45: Simple big data, in Python

3 On-line algorithms

Process the data one sample at a time

Compute the mean of a gazillionnumbers

Hard?

G Varoquaux 25

Page 46: Simple big data, in Python

3 On-line algorithms

Process the data one sample at a time

Compute the mean of a gazillionnumbers

Hard?No: just do a running mean

G Varoquaux 25

Page 47: Simple big data, in Python

3 On-line algorithmsConverges to expectations

Mini-batch = bunch observations for vectorization

Example: K-Means clusteringX = np.random.normal(size=(10 000, 200))

scipy.cluster.vq.kmeans(X, 10,

iter=2)11.33 s

sklearn.cluster.MiniBatchKMeans(n clusters=10,

n init=2).fit(X)0.62 s

G Varoquaux 25

Page 48: Simple big data, in Python

3 On-the-fly data reduction

Big data is often I/O bound

Layer memory accessCPU cachesRAMLocal disksDistant storage

Less data also means less work

G Varoquaux 26

Page 49: Simple big data, in Python

3 On-the-fly data reductionDropping data

1 loop: take a random fraction of the data

2 run algorithm on that fraction

3 aggregate results across sub-samplingsLooks like bagging : bootstrap aggregation

Exploits redundancy across observations

Run the loop in parallel

G Varoquaux 26

Page 50: Simple big data, in Python

3 On-the-fly data reduction

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast clustering of featuressklearn.cluster.WardAgglomeration

on images: super-pixel strategy

Hashing when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 26

Page 51: Simple big data, in Python

3 On-the-fly data reduction

Example: randomized SVD Random projectionsklearn.utils.extmath.randomized svd

X = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)

1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)

1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)

1 loops, best of 3: 303 ms per loop

linalg.norm(lapack[0][:, :10] - arpack[0]) / 20000.0022360679774997738

linalg.norm(lapack[0][:, :10] - randomized[0]) / 20000.0022121161221386925

G Varoquaux 26

Page 52: Simple big data, in Python

3 Data-parallel computing

I tackle only embarrassingly parallel problemLife is too short to worry about deadlocks

Stratification to followthe statistical dependencies andthe data storage structureBatch size scaled by the releventcache pool- Too fine ⇒ overhead- Too coarse ⇒ memory shortage

joblib.ParallelG Varoquaux 27

Page 53: Simple big data, in Python

3 Caching

Minimizing data-access latency

Never computing twice the same thing

joblib.cache

G Varoquaux 28

Page 54: Simple big data, in Python

3 Fast access to data

Stored representation consistent with access patterns

Compression to limit bandwidth usage- CPU are faster than data access

Data access speed is often more alimitation than raw processing power

joblib.dump/joblib.loadG Varoquaux 29

Page 55: Simple big data, in Python

3 Biggish ironOur new box: 15 ke

48 cores384G RAM70T storage

(SSD cache on RAID controller)

Gets our work done faster than our 800 CPU cluster

It’s the access patterns!

“Nobody ever got fired for using Hadoop on a cluster”A. Rowstron et al., HotCDP ’12

G Varoquaux 30

Page 56: Simple big data, in Python

4 The bigger picture: acommunity

Helping your future selfG Varoquaux 31

Page 57: Simple big data, in Python

4 Community-based development in scikit-learnHuge feature set:

benefits of a large teamProject growth:

More than 200 contributors∼ 12 core contributors

1 full-time INRIA programmerfrom the start

Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn

G Varoquaux 32

Page 58: Simple big data, in Python

4 The economics of open sourceCode maintenance too expensive to be alone

scikit-learn ∼ 300 email/month nipy ∼ 45 email/monthjoblib ∼ 45 email/month mayavi ∼ 30 email/month

“Hey Gael, I take it you’re toobusy. That’s okay, I spent a daytrying to install XXX and I thinkI’ll succeed myself. Next timethough please don’t ignore myemails, I really don’t like it. Youcan say, ‘sorry, I have no time tohelp you.’ Just don’t ignore.”

Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah

Share the common code......to avoid dying under code

Code becomes less precious with timeAnd somebody might contribute features

G Varoquaux 33

Page 59: Simple big data, in Python

4 The economics of open sourceCode maintenance too expensive to be alone

scikit-learn ∼ 300 email/month nipy ∼ 45 email/monthjoblib ∼ 45 email/month mayavi ∼ 30 email/month

Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah

Share the common code......to avoid dying under code

Code becomes less precious with timeAnd somebody might contribute features

G Varoquaux 33

Page 60: Simple big data, in Python

4 Many eyes makes code fast

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer

G Varoquaux 34

Page 61: Simple big data, in Python

4 Communities increase the knowledge pool

Even if you don’t do software,you should worry about communities

More experts in the package⇒ Easier recruitement

The future is open⇒ Enhancements are possible

Meetups, conferences,are where new ideas are born

G Varoquaux 35

Page 62: Simple big data, in Python

4 6 steps to a community-driven project

1 Focus on quality

2 Build great docs and examples

3 Use github

4 Limit the technicality of your codebase

5 Releasing and packaging matter

6 Focus on your contributors,give them credit, decision power

http://www.slideshare.net/GaelVaroquaux/scikit-learn-dveloppement-communautaire

G Varoquaux 36

Page 63: Simple big data, in Python

4 Core project contributors

Credit: Fernando Perez, Gist 5843625G Varoquaux 37

Page 64: Simple big data, in Python

4 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.

Wikipedia

Make it work, make it right, make it boringCore projects (boring) taken for granted

⇒ Hard to fund, less excitementThey need citation, in papers & on corporate web pages

G Varoquaux 38

Page 65: Simple big data, in Python

Simple big data, in PythonBeyond the lie

Machine learning gives value to (big) data

Python + scikit-learn =- from interactive data processing (IPython notebook)- to crazy big problems (Python + spark)

Big data will require you to understand thedata-flow patterns (access, parallelism, statistics)

Big data community addresses the human factor

@GaelVaroquaux