strata london - deep learning 05-2015

Deep learning made doubly easy with reusable deep features

Carlos GuestrinDato, CEOUniversity of Washington, Amazon Prof. of ML

Successful apps in 2015 must be

intelligent

Machine learning

key to next-gen apps

• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized

medicine • Churn prediction• Smart UX

(video & text)• Personal assistants• IoT• Socials nets• …

Last decade: Data management

Now: Intelligent apps

?Last 5 years:

Traditional analytics

The ML pipeline circa 2013

DATA

MLAlgorith

m

My curve is better

than your curve

Write a paper

2015: Production ML pipeline

DATA

Your W

eb S

erv

ice o

r In

tellig

ent A

pp

MLAlgorith

m

Data cleaning

& feature

eng

Offline eval &

Parameter search

Deploy model

Data engineering Data intelligence Deployment

Using deep learning

Goal: Platform to help implement, manage, optimize entire pipeline

Today’s talk

Features in ML

Neural networks

Deep learning

for computer

vision

Deep learning

made easy with deep features

Applications to text

data

Deployment in

production

Features are key to machine learning

7

Simple example: Spam filtering

• A user walks into an email…- Will she thinks its spam?

• What’s the probability email is spam?

Text of email

User info

Source info

Input: x

MODELYes!

No

Output: Probability of y

8

Feature engineering: the painful black art of transforming raw inputs into useful inputs for ML algorithm

• E.g., important words, stemming text, complex transformation of inputs,…

MODELYes!

No

Output: Probability of y

Feature extractio

n

Features: Φ(x)

Text of email

User info

Source info

Input: x

Neural networks

Learning *very* non-linear features

10

Linear classifiers

• Most common classifier- Logistic regression- SVMs- …

• Decision correspond to hyperplane:- Line in high dimensional

space

w0 +

w1 x

1 +

w2 x

2 =

0

w0 + w1 x1 + w2 x2 > 0 w0 + w1 x1 + w2 x2 < 0

11

Graph representation of classifier:useful for defining neural networks

x1

x2

xd

y…

1 w0

w1

w2

w d

w0 + w1 x1 + w2 x2 + … + wd xd

> 0, output 1

< 0, output 0

Input Output

12

What can a linear classifier represent

x1 OR x2 x1 AND x2

x1

x2

1

y

-0.5

1

1

x1

x2

1

y

-1.5

1

1

13

What can’t a simple linear classifier represent?

XOR the counterexample

to everything

Need non-linear features

Solving the XOR problem: Adding a layer

XOR = x1 AND NOT x2 OR NOT x1 AND x2

z1

-0.5

1

-1

z1 z2

z2

-0.5

-1

1

x1

x2

1

y

1 -0.5

1

1

Thresholded to 0 or 1

15

A neural network

• Layers and layers and layers of linear models and non-linear transformation

• Around for about 50 years- Fell in “disfavor” in 90s

• In last few years, big resurgence- Impressive accuracy on a several benchmark problems- Powered by huge datasets, GPUs, & modeling/learning alg

improvements

x1

x2

1

z1

z2

1

y

Applications to computer vision(or the deep devil is in the deep details)

17

Image features• Features = local detectors

- Combined to make prediction- (in reality, features are more low-level)

Face!

Eye

Eye

Nose

Mouth

18

Many hand create features exist…

19

Standard image classification approach

Input Extract features Use simple classifiere.g., logistic regression, SVMs

Car?

20

Many hand create features exist…

… but very painful to design

21

Use neural network to learn features Each layer learns features, at different levels of abstraction

22

Many tricks needed to work well… • Different types of layers, connections,… needed for high

accuracy

Krizhevsky et al. ‘12

Sample performance results

Sample results• Traffic sign recognition

(GTSRB)- 99.2% accuracy

• House number recognition (Google)- 94.3% accuracy

30

Krizhevsky et al. ’12: 60M parameters, won 2012 ImageNet competition

31

32

ImageNet 2012 competition: 1.5M images, 1000 categories

32

34

Application to scene parsing

Challenges of deep learning

Deep learning score card

Pros• Enables learning of features

rather than hand tuning

• Impressive performance gains on- Computer vision- Speech recognition- Some text analysis

• Potential for much more impact

Cons

Deep learning workflow

Lots of labeled

data

Training set

Validation set

80%

20%

Learn deep

neural net model

Validate

Adjust hyper-parameters,

model architecture,…

Deep learning score card

Pros• Enables learning of features

rather than hand tuning

• Impressive performance gains on- Computer vision- Speech recognition- Some text analysis

• Potential for much more impact

Cons• Computationally really expensive• Requires a lot of data for high

accuracy• Extremely hard to tune

- Choice of architecture- Parameter types- Hyperparameters- Learning algorithm- …

• Computational + so many choices = incredibly hard to tune

Deep features: Deep

learning

+ Transfer

learning

40

Change image classification approach?

Input Extract features Use simple classifiere.g., logistic regression, SVMs

Car?Can we learn features

from data, even when

we don’t have data or time?

41

Transfer learning:Use data from one domain to help learn on another

Lots of data:Learn

neural net

Great accuracy on cat v. dogvs.

Some data: Neural net as feature extractor

+Simple classifier

Great accuracy on 101

categories

Old idea, explored for deep learning by Donahue et al. ’14

42

What’s learned in a neural net

Neural net trained for Task 1: cat vs. dog

Very specific to Task 1Should be ignored for other tasks

More genericCan be used as feature extractor

vs.

43

Transfer learning in more detail…

Neural net trained for Task 1: cat vs. dog

Very specific to Task 1Should be ignored for other tasks

More genericCan be used as feature extractor

Keep weights fixed!

For Task 2, predicting 101 categories, learn only end part

Use simple classifiere.g., logistic regression, SVMs

Class?

44

Careful where you cut…Last few layers tend to be too specific

Too specific for car

detection

Use these!

Transfer learning with deep features

Training set

Validation set

80%

20%

Learn simple model

Some labeled

data

Extract features

with neural net trained on different

task

Validate

Deploy in production

How general are deep features?

Applications to text data

Simple text classification with bag of wordsaardvark0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

…

Zaire 0

Use simple classifiere.g., logistic regression, SVMs

Class?

One “feature” per word

Word2Vec: Neural network for finding high dimensional representation per word Mikolov et al. ‘13

Skip-gram Model: From a word, predict nearby words in sentence

Awesome learning talk at Strata

deep

300 dim representatio

n


n


n


n


n


n

Neural net

Viewed as deep features

50

Related words placed nearby high dim space

Projecting 300 dim space into 2 dim with PCA (Mikolov et al. ’13)

Classifier:e.g., logistic regression, SVMs with300 x number_of_words parameters

Class?

Embed each

word into 300 dim space

Text classification with word embeddingsaardvark0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

…

Zaire 0

Practical example

Blog corpus HahaYeaHahahaHahahLisxcUmmHehelaughingoutloud

LOLClosest words

in 300 dim

Predicts gender of author with 79% accuracy

Deploying ML in production

55

DATAML

Algorithm

Deployment?

• Write spec, other team implements in ‘production’ languageo 6-12 monthso Stale/irrelevant model/approacho 2 teams maintaining 2 systems

Custom Model

Data Engineers, Data Architects, DevOps,

App Developers

App

API

Data Scientist

ML deployment requirements

56

Easy to integrate

Rest API

Scalable

Fault tolerant

FlexibleAny model, any Python

App

API

API

CACHE

API

CACHE

API

CACHE

LBGLC

Model

GLC Model

GLC ModelDato

Models

DatoModels

DatoModels

API

CACHE

API

CACHE

API

CACHE

LBGLC

Model

GLC Model

GLC ModelPytho

n

Python

Python

57

Do-It-Yourself

• Web Service layer: - Tornado, Flask, Keen, Django, …

• Caching layer: - Redis, Cassandra, Memcached,

DynamoDb, MySQL, …

• Logs: - Logback, LogStash, Splunk, Loggly, …

• Metrics: - AWS CloudWatch, Mixpanel, Librato, …

API

CACHE

API

CACHE

API

CACHE

LBGLC

Model

GLC Model

GLC ModelPython

Python

PythonApp

58

… or use Dato Predictive Services

Your W

eb S

erv

ice o

r In

tellig

ent A

ppML Model

DatoPredictive Services

Cach

ing

Layer

Pre

dictiv

e

Obje

ct Serv

er

Serves predictions in a robust, scalable, incremental fashion

BetterML Model

Serve any model: GraphLab Create, scikit-learn, Python, …

• Out-of-core computation• Tools for feature engineering• Rich data type support

• Models built for scale• App-oriented toolkits• Advanced ML & Extensible

• Deploy models as low-latency REST services• Same code for distributed computation• Elastically scale up or out with one command• Job monitoring & model management• Deploy existing Python code & models• Run on AWS EC2 or Hadoop Yarn

SGraph

Create Engine

SFrameCanvas

Machine Learning Toolkits SDK

GraphLab Create Dato DistributedDato Predictive Services

Predictive Engine

REST Client

Direct

Model Mgmt

Clean Learn Deploy

Distributed Engine

DirectJob Client

Job Mgmt

Dato Platform

Summary

Deep learning made easy with deep features

Deep learning: exciting ML development

Slow, lots of tuning, needs lots of data

Deep features: reuse deep models for new domains

Needs less data Faster training times

Much simpler tuning

Can still achieve excellent performance

strata london - deep learning 05-2015

Data & Analytics