courtesy of xkcd.com. used with permission. - ocw.mit.edu · nguyen a, yosinski j, clune j. deep...
TRANSCRIPT
Image Classification via Deep Learning
Ryan Alexander, Ishwarya Ananthabhotla, Julian Brown, Henry Nassif, Nan Ma, Ali Soylemezoglu
2
Overview
• What is Deep Learning?
• Image Processing
• CNN Architecture
• Training Process
• Image Classification Results
• Limitations
3
Overview
• What is Deep Learning?
• Image Processing
• CNN Architecture
• Training Process
• Image Classification Results
• Limitations
4
Deep Learning Refers to...
Machine Learning algorithms designed to
extract high-level abstractions from data
via multi-layered processing architectures
using nonlinear transformations at each layer
5
Human Visual System• Distributed Hierarchical processing in the primate cerebral cortex (1991)
6© AAAS. All rights reserved. This content is excluded from our Creative Commonsllicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
How To Classify a Face?
• Identify where the face region is• Foreground Extraction• Edge Detection
• Classify features of the face• Identify and describe eyes, nose, mouth areas
• Look at face as a collection of those features
© ACM, Inc. All rights reserved. This contentis excluded from our Creative Commonslicense. For more information, seehttps://ocw.mit.edu/help/faq-fair-use/.
7
7
Common Architectures
•Deep Convolutional Neural Networks (CNNs)•Deep Belief Networks (DBNs)•Recurrent Neural Network
8
Common Architectures
•Deep Convolutional Neural Networks (CNNs)•Deep Belief Networks (DBNs)•Recurrent Neural Network
9
ImageNet Competition Through Time
10
Overview
• What is Deep Learning?
• Image Processing
• CNN Architecture
• Training Process
• Image Classification Results
• Limitations
11
Classic Classification -- Feature Engineering
12
What if the techniques could be “learned”?
© ACM, Inc. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
13
Step 1: Convolution - Definition
Informal Definition: Procedure where two sources of information are intertwined.
Formal Definition :
Discrete :
Continuous :
14
Convolution - Example
Assume the following kernel/filter :
1 0 1
0 1 0
1 0 1
15
Convolution
16
1 1 1
1 -8 1
1 1 1
1 2 1
0 0 0
-1 -2 -1
.0113 .0838 .0113
.0838 .6193 .0838
.0113 .0838 .0113
17
More Information? Fourier Transform!Sum of a set of sinusoidal gratings differing in spatial frequency, orientation,
amplitude, phase
18
Fourier Transform● Fourier Transform image itself is weird to visualize -- Phase and Magnitude!
● Magnitude -- orientation information at all spatial scales
● Phase -- contour information
19
20
Overview
• What is Deep Learning?
• Image Processing
• CNN Architecture
• Training Process
• Image Classification Results
• Limitations
21
Why Neural Net
Hubel & Wiesel (1959, 1962)
22© source unknown. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
The Structure of a Neuron
input_1
input_2
input_n
. . .
Sum Σx (weight_2) Activation output
23
Combining Neurons into Nets
24
Convolution Step
input_1
input_2
input_n
. . .
Sum Σx (weight_2) Activation output
Convolution Step (dot product between filter and input)
25
Convolutional Layer
26
Activation Step
input_1
input_2
input_n
. . .
Sum Σx (weight_2) Activation output
Activation Step
27
Activation Layer
28
CNN overview
Convolution and
Activation Subsampling
29
Activation Step
Each neuron adds up its inputs, and then feeds the sum into a function -- the activation function -- to determine the neuron's output.
Eg : Sigmoid, tanh, ReLu
30
Activation functions - sigmoid
31
Activation function - tanh
32
Activation function - ReLu
33
Non-linearity Constraint
Activation function is to introduce non-linearity into the network
Without a nonlinear activation function in the network, NN, no matter how many layers it has, will behave like a linear system and we will not be able to mimic a ‘complicated’ function
A neural network may very well contain neurons with linear activation functions, such as in the output layer, but these require the company of neurons with a nonlinear activation function in other parts of the network.
34
Convolution Step
An RGB image is represented by a 3 dimensional matrix
The first channel holds the ‘R’ value of each pixel
The second channel holds the ‘G’ value of each pixel
The third channel holds the ‘B’ value of each pixel
Eg: A 32x32 image is represented by a 32x32x3 matrix
35
32x32x3
Filter 5x5x3
36
32x32x337
Input Volume vs Output Volume for convolution
W1
H1
D1
W2
H2
D2
W2 = W1 - (filter width) + 1
H2 = H1 - (filter height) + 1
D2 = 1 (D1 = filter depth)
Input Output
38
Neurons Activation Map
28x28x1 28x28x139
32x32x3 ( 28x28x1 ) * 5 40
Parameters
Input volume: 32x32x3
Filter size : 5x5x3
Size of 1 activation map: 28*28*1
Depth of first layer: 5
Total Number of neurons: 28*28*5 = 3920
Weights per neuron: 5*5*3 = 75
Total Number of parameters: 75*3920 = 294 000
Neurons Activation Map
41
CNN overview
Convolution and
Activation Subsampling
42
SubsamplingObjectives:
Reduce the size of input/feature space
Keep output of the most responsive neuron of the given interest region.
Common Methods:
- Max Pooling
- Average Pooling
This involves splitting up the matrix of filter outputs into small non-overlapping grids and taking the maximum/average
43
44
Max Pooling
45
Input Volume vs Output Volume for Max Pooling
W1
H1
D1
W2
H2
D2
W2 = W1 - (pool width) + 1
H2 = H1 - (pool height) + 1
D2 = D1
Input Output
46
CNN overview
Convolution and
Activation Subsampling
?
47
Fully Connected LayerNeurons in fully connected layers have full connections to all activations in the previous layer
...
48
SoftmaxTypically, output layer has one neuron corresponding to each label/class
The softmax function, or normalized exponential, "squashes" multi-dimensional vector of arbitrary real values to a multi-dimensional vector of values in the range (0, 1) that add up to 1.
49
Overview
• What is Deep Learning?
• Image Processing
• CNN Architecture
• Training Process
• Image Classification Results
• Limitations
50
Train the Network (setup the problem)- The training is in fact to find a set of weights (for the filters) that minimize the cost
functions, C(w,b).
- Normally, gradient descent algorithm is used to find the optimal
- Therefore, we need to find ∂C/∂wljk and ∂C/∂blj,and we update the weights and bias by:
51
Train the Network (compute the gradient)
- Traditionally, for one training data, If using conventional method (central difference) and we have a million weights, the cost function, C(w,b), will need to be calculated a million times !!
- How can we just calculate C(w,b) once? -- (Backpropagation Algorithm, Rumelhart, Hinton, and Williams, 1986).
52
Backward Propagation of Errors
53
Backward Propagation of Errors
54
Backward Propagation of Errors
55
Backward Propagation of Errors (put it together)
- Proof: http://neuralnetworksanddeeplearning.com/chap2.html
56
Backward Propagation of Errors (put it together)- Tutorial: http://neuralnetworksanddeeplearning.com/chap2.html
57
Backward Propagation of Errors (put it together)
58
Backward Propagation of Errors (put it together)
59
Backward Propagation of Errors (put it together)
60
Train the network (Initializing Weights)
Initialization is need for the gradient descent algorithm and it is critical for the learning performance:
61
Initial Weights
We want to stay away from the saturation area.
Suppose there is n weights coming in one NeuronBest strategy is: Normal(0, )
62
Example architecture
Alex Net, 61 millions weights
63
© ACM. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Preprocessing Tricks and TipsSuppose we have dataset X = [N X D], where N is number of data points, and D is their dimensionality
1. Mean Image Subtraction: Subtraction of the mean across each individual feature in dataset
2. Normalization for Dimension: Division by standard deviation
64
Preprocessing Tricks and Tips3. Principle Component Analysis (PCA) for dimensionality reduction
- Generate covariance matrix across the data- SVD factorization- Decorrelation, rotation into Eigenbasis- Choose a top-k eigenvalues: X’ = [NxK]
4. Whitening- Divide by eigenvalues (square roots of singular val)- Result: Zero mean, Identity Covariance
65
Data Augmentation
1. Rotations
2. Reflections
3. Scaling
4. Cropping
5. Color space remapping
6. Randomization!
66
Overview
• What is Deep Learning?
• Image Processing
• CNN Architecture
• Training Process
• Image Classification Results
• Limitations
67
Revisiting the ImageNet Competition (ILSVRC 2010)
Model Top-1 error rate Top-5 error rate
Sparse coding 0.47 0.28
SIFT + FVs 0.46 0.26
CNNs 0.37 0.17
68
Krizhevsky, Alex et al. Imagenet classification with deep convolutional neural nets. NIPS 2012 69
© ACM. All rights reserved. This content is excluded from our Creative Commonsicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.l
Google Street View House Numbers
70© Goodfellow, Ian et al. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/
“Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks” by Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet
Courtesy of Goodfellow, Ian et al. Used with permission.71
Recognizing Hand Gestures-HCI application
N. Jawad, D. Frederick, A. Gianni, C. Dan and M. Ueli, “Max-pooling convolutional neural networks for vision-based hand gesture recognition”, IEEE International Conference on Signal and Image Processing Applications, 2011.
72
© IEEE. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Extended Image Classification: Video Classification
Extend image classification by adding temporal component to classify videos
Note that this adds additional complexity, but the underlying system is the same: Convolutional Neural Nets
73
©IEEE. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Karpathy, Andrej, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. "Large-Scale Video Classification with Convolutional Neural Networks." 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014)
74
New Text
74
Overview
• What is Deep Learning?
• Image Processing
• CNN Architecture
• Training Process
• Image Classification Results
• Limitations
75
Even the Best have Issues
Microsoft won the most recent ImageNet competition and currently holds the state-of-the-art implementation
They can recognize 1000 categories of images, extremely reliably.
However:
1000 categories does not cover as many objects as you might expect.
Uses 1.28 million images to train
Takes weeks to train on multiple GPUs, with heavy optimization
http://arxiv.org/abs/1512.0338576
Szegedy et al. Intriguing Properties of Neural Networks. 2014.
Courtesy of Szegedy, Christian et al. License: CC-BY.
77
Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR ’15), IEEE, 2015.
78
©IEEE. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR ’15), IEEE, 2015.
Gradient Ascent
79
© IEEE. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR ’15), IEEE, 2015.
Indirect Encoding
80
© IEEE. All rights reserved. This content is excluded from our Creative Commonslicense. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
81
82
83
84
85
86
© Stuart McMillen. All rights reserved. This content is excluded from our CreativeCommons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
• Deep Learning is a powerful tool that relies on many iterations of processing
• CNNs outperform all other algorithms for image classification because of the image processing power of convolutional filters
• Backpropagation is used to efficiently train CNNs
• CNNs need tons of data and processing power
Takeaways
87
Getting Started With Deep Learning
88
ReferencesImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
Approximation by Superpositions of a Sigmoidal Function. G. Cybenko https://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf
Backpropagation Tutorial http://neuralnetworksanddeeplearning.com/chap2.html
https://papers.nips.cc/paper/877-the-softmax-nonlinearity-derivation-using-statistical-mechanics-and-useful-properties-as-a-multiterminal-analog-circuit-element.pdf
Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR ’15), IEEE, 2015.
“Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks” by Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet
Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. http://arxiv.org/abs/1512.03385 89
Appendix
90
Backward Propagation of Errors
- The gradient of weights and bias can be found by back chaining the auxiliary variable, defined as:
- By chain rule:
- The back propagate it (chain rule again):
91
Backward Propagation of Errors (put it together)
92
Train the Network (put it together)
93
94© Tim Dettmers. All rights reserved. This content is excluded from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
95Courtesy of Tim Dettmers. Used with permission.
Convolution: Filters An output pixel’s value is some function of the corresponding input pixel’s
neighbors
Examples:
Smooth, sharpen, contrast, shift
Enhance edges
Detect particular orientations
Approach: Kernel Convolution
1/9 = 40
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 90 90 90 90 90 90 0
0 0 0 90 90 90 90 90 90 0
0 0 0 90 90 90 90 90 90 0
0 0 0 90 90 90 90 90 90 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
1 1 1
1 1 1
1 1 1
96
Convolution for 2D matrices
Given two three-by-three matrices, one a kernel, and the other an image piece, convolution is the process of multiplying entries and summing
The output of this operation constitutes the input to a single neuron in the following layer.
97
MIT OpenCourseWarehttps://ocw.mit.edu
16.412J / 6.834J Cognitive RoboticsSpring 2016
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.