strata london - deep learning 05-2015
TRANSCRIPT
Deep learning made doubly easy with reusable deep features
Carlos GuestrinDato, CEOUniversity of Washington, Amazon Prof. of ML
Successful apps in 2015 must be
intelligent
Machine learning
key to next-gen apps
• Recommenders • Fraud detection• Ad targeting• Financial models• Personalized
medicine • Churn prediction• Smart UX
(video & text)• Personal assistants• IoT• Socials nets• …
Last decade: Data management
Now: Intelligent apps
?Last 5 years:
Traditional analytics
The ML pipeline circa 2013
DATA
MLAlgorith
m
My curve is better
than your curve
Write a paper
2015: Production ML pipeline
DATA
Your W
eb S
erv
ice o
r In
tellig
ent A
pp
MLAlgorith
m
Data cleaning
& feature
eng
Offline eval &
Parameter search
Deploy model
Data engineering Data intelligence Deployment
Using deep learning
Goal: Platform to help implement, manage, optimize entire pipeline
Today’s talk
Features in ML
Neural networks
Deep learning
for computer
vision
Deep learning
made easy with deep features
Applications to text
data
Deployment in
production
Features are key to machine learning
7
Simple example: Spam filtering
• A user walks into an email…- Will she thinks its spam?
• What’s the probability email is spam?
Text of email
User info
Source info
Input: x
MODELYes!
No
Output: Probability of y
8
Feature engineering: the painful black art of transforming raw inputs into useful inputs for ML algorithm
• E.g., important words, stemming text, complex transformation of inputs,…
MODELYes!
No
Output: Probability of y
Feature extractio
n
Features: Φ(x)
Text of email
User info
Source info
Input: x
Neural networks
Learning *very* non-linear features
10
Linear classifiers
• Most common classifier- Logistic regression- SVMs- …
• Decision correspond to hyperplane:- Line in high dimensional
space
w0 +
w1 x
1 +
w2 x
2 =
0
w0 + w1 x1 + w2 x2 > 0 w0 + w1 x1 + w2 x2 < 0
11
Graph representation of classifier:useful for defining neural networks
x1
x2
xd
y…
1 w0
w1
w2
w d
w0 + w1 x1 + w2 x2 + … + wd xd
> 0, output 1
< 0, output 0
Input Output
12
What can a linear classifier represent
x1 OR x2 x1 AND x2
x1
x2
1
y
-0.5
1
1
x1
x2
1
y
-1.5
1
1
13
What can’t a simple linear classifier represent?
XOR the counterexample
to everything
Need non-linear features
Solving the XOR problem: Adding a layer
XOR = x1 AND NOT x2 OR NOT x1 AND x2
z1
-0.5
1
-1
z1 z2
z2
-0.5
-1
1
x1
x2
1
y
1 -0.5
1
1
Thresholded to 0 or 1
15
A neural network
• Layers and layers and layers of linear models and non-linear transformation
• Around for about 50 years- Fell in “disfavor” in 90s
• In last few years, big resurgence- Impressive accuracy on a several benchmark problems- Powered by huge datasets, GPUs, & modeling/learning alg
improvements
x1
x2
1
z1
z2
1
y
Applications to computer vision(or the deep devil is in the deep details)
17
Image features• Features = local detectors
- Combined to make prediction- (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth
18
Many hand create features exist…
19
Standard image classification approach
Input Extract features Use simple classifiere.g., logistic regression, SVMs
Car?
20
Many hand create features exist…
… but very painful to design
21
Use neural network to learn features Each layer learns features, at different levels of abstraction
22
Many tricks needed to work well… • Different types of layers, connections,… needed for high
accuracy
Krizhevsky et al. ‘12
Sample performance results
Sample results• Traffic sign recognition
(GTSRB)- 99.2% accuracy
• House number recognition (Google)- 94.3% accuracy
30
Krizhevsky et al. ’12: 60M parameters, won 2012 ImageNet competition
31
32
ImageNet 2012 competition: 1.5M images, 1000 categories
32
33 33
34
Application to scene parsing
Challenges of deep learning
Deep learning score card
Pros• Enables learning of features
rather than hand tuning
• Impressive performance gains on- Computer vision- Speech recognition- Some text analysis
• Potential for much more impact
Cons
Deep learning workflow
Lots of labeled
data
Training set
Validation set
80%
20%
Learn deep
neural net model
Validate
Adjust hyper-parameters,
model architecture,…
Deep learning score card
Pros• Enables learning of features
rather than hand tuning
• Impressive performance gains on- Computer vision- Speech recognition- Some text analysis
• Potential for much more impact
Cons• Computationally really expensive• Requires a lot of data for high
accuracy• Extremely hard to tune
- Choice of architecture- Parameter types- Hyperparameters- Learning algorithm- …
• Computational + so many choices = incredibly hard to tune
Deep features: Deep
learning
+ Transfer
learning
40
Change image classification approach?
Input Extract features Use simple classifiere.g., logistic regression, SVMs
Car?Can we learn features
from data, even when
we don’t have data or time?
41
Transfer learning:Use data from one domain to help learn on another
Lots of data:Learn
neural net
Great accuracy on cat v. dogvs.
Some data: Neural net as feature extractor
+Simple classifier
Great accuracy on 101
categories
Old idea, explored for deep learning by Donahue et al. ’14
42
What’s learned in a neural net
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1Should be ignored for other tasks
More genericCan be used as feature extractor
vs.
43
Transfer learning in more detail…
Neural net trained for Task 1: cat vs. dog
Very specific to Task 1Should be ignored for other tasks
More genericCan be used as feature extractor
Keep weights fixed!
For Task 2, predicting 101 categories, learn only end part
Use simple classifiere.g., logistic regression, SVMs
Class?
44
Careful where you cut…Last few layers tend to be too specific
Too specific for car
detection
Use these!
Transfer learning with deep features
Training set
Validation set
80%
20%
Learn simple model
Some labeled
data
Extract features
with neural net trained on different
task
Validate
Deploy in production
How general are deep features?
Applications to text data
Simple text classification with bag of wordsaardvark0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
Use simple classifiere.g., logistic regression, SVMs
Class?
One “feature” per word
Word2Vec: Neural network for finding high dimensional representation per word Mikolov et al. ‘13
Skip-gram Model: From a word, predict nearby words in sentence
Awesome learning talk at Strata
deep
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
300 dim representatio
n
Neural net
Viewed as deep features
50
Related words placed nearby high dim space
Projecting 300 dim space into 2 dim with PCA (Mikolov et al. ’13)
Classifier:e.g., logistic regression, SVMs with300 x number_of_words parameters
Class?
Embed each
word into 300 dim space
Text classification with word embeddingsaardvark0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
Practical example
Blog corpus HahaYeaHahahaHahahLisxcUmmHehelaughingoutloud
LOLClosest words
in 300 dim
Predicts gender of author with 79% accuracy
Deploying ML in production
55
DATAML
Algorithm
Deployment?
• Write spec, other team implements in ‘production’ languageo 6-12 monthso Stale/irrelevant model/approacho 2 teams maintaining 2 systems
Custom Model
Data Engineers, Data Architects, DevOps,
App Developers
App
API
Data Scientist
ML deployment requirements
56
Easy to integrate
Rest API
Scalable
Fault tolerant
FlexibleAny model, any Python
App
API
API
CACHE
API
CACHE
API
CACHE
LBGLC
Model
GLC Model
GLC ModelDato
Models
DatoModels
DatoModels
API
CACHE
API
CACHE
API
CACHE
LBGLC
Model
GLC Model
GLC ModelPytho
n
Python
Python
57
Do-It-Yourself
• Web Service layer: - Tornado, Flask, Keen, Django, …
• Caching layer: - Redis, Cassandra, Memcached,
DynamoDb, MySQL, …
• Logs: - Logback, LogStash, Splunk, Loggly, …
• Metrics: - AWS CloudWatch, Mixpanel, Librato, …
API
CACHE
API
CACHE
API
CACHE
LBGLC
Model
GLC Model
GLC ModelPython
Python
PythonApp
58
… or use Dato Predictive Services
Your W
eb S
erv
ice o
r In
tellig
ent A
ppML Model
DatoPredictive Services
Cach
ing
Layer
Pre
dictiv
e
Obje
ct Serv
er
Serves predictions in a robust, scalable, incremental fashion
BetterML Model
Serve any model: GraphLab Create, scikit-learn, Python, …
• Out-of-core computation• Tools for feature engineering• Rich data type support
• Models built for scale• App-oriented toolkits• Advanced ML & Extensible
• Deploy models as low-latency REST services• Same code for distributed computation• Elastically scale up or out with one command• Job monitoring & model management• Deploy existing Python code & models• Run on AWS EC2 or Hadoop Yarn
SGraph
Create Engine
SFrameCanvas
Machine Learning Toolkits SDK
GraphLab Create Dato DistributedDato Predictive Services
Predictive Engine
REST Client
Direct
Model Mgmt
Clean Learn Deploy
Distributed Engine
DirectJob Client
Job Mgmt
Dato Platform
Summary
Deep learning made easy with deep features
Deep learning: exciting ML development
Slow, lots of tuning, needs lots of data
Deep features: reuse deep models for new domains
Needs less data Faster training times
Much simpler tuning
Can still achieve excellent performance