Ceng 783 – Deep Learning
Week 8 – Applications of Convolutional Neural Networks
Fall 2017
Emre Akbas
3
Today
A brief note on normalization
Applications of ConvNets – Image classification (ResNets, ResNeXt,
DenseNet)– Object detection– Artistic style transfer– Image segmentation (FCNs, “deconvolution”)– Visualizing ConvNet classifications– ConvNets for NLP
4
NormalizationRemember batch norm? What happens as m gets smaller?Also, what do you do at test time?
5[From “Group normalization”, Wu et al. ECCV 2018]
6
[From “Group normalization”, Wu et al. ECCV 2018]
7
Applications of ConvNets
8
Image classification
9
Image ClassificationILSVRC benchmark/challenge– ImageNET dataset: 1.2 million images, 1000
categories– Since 2010– The task: given an image, make 5 predictions for the
dominant object in the image. If one of them is correct, then it is counted as a success.
10
Image ClassificationILSVRC benchmark/challenge– ImageNET dataset: 1.2 million images, 1000
categories– Since 2010– The task: given an image, make 5 predictions for the
dominant object in the image. If one of them is correct, then it is counted as a success.
Newer datasets since then– e.g. Google's Open Image Dataset
● 9 million images, 6000 categories [link]● annotated with image-level labels, object bounding boxes,
object segmentation masks, and visual relationships.
11
The second success story
3.10.2016 CEng 783 - Deep Learning - Fall 2016 11
Source: G. Hinton’s talk at the Royal Society, May 22, 2015. https://youtu.be/izrG86jycck
Slide
from
1st w
eek
12
Top-5 error rate over time● 2012: AlexNet 16.5%● 2013: ZF 11.7%● 2014: VGG 7.3%
2014: GoogLeNet 6.7%● 2015: ResNet 3.6%● Aug 2016: 3.1%
GoogLeNet-v4
Human error rate: 5.1% [http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/]
13
AlexNet [Krizhevsky et al. NIPS 2012]
5 convolutional layers 3 full-connected layers
Each convolutional layer consists of:convolution + ReLU + normalization + max-pooling
14
Top-5 error rate over time● 2012: AlexNet 16.5% [Krizhevsky et al. (2012)]
● 2013: ZF 11.7% [Zeiler & Fergus (2014)]
● 2014: VGG 7.3%
2014: GoogLeNet 6.7%
● 2015: ResNet 3.6%
● Today (Aug 2016) 3.1%GoogLeNet-v4
“It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.”
[Source: http://cs231n.github.io/convolutional-networks/#case]
15
Top-5 error rate over time● 2012: AlexNet 16.5% [Krizhevsky et al. (2012)]
● 2013: ZF 11.7% [Zeiler & Fergus (2014)]
● 2014: VGG 7.3% [Simonyan & Zisserman (2014)]
2014: GoogLeNet 6.7%
● 2015: ResNet 3.6%
● Today (Aug 2016) 3.1%GoogLeNet-v4
“Main contribution: depth is critical. They used 16 layers.
Extremely homogeneous architecture: only 3x3 convolutions and 2x2 pooling.
But, very expensive to evaluate and requires more memory.[Source: http://cs231n.github.io/convolutional-networks/#case]
16
VGG or VGGnet
[Figure by D. Frossard]
VGG (Visual Geometry Group at Oxford)
[Simonyan & Zisserman , 2014]
VGG16 VGG19
Is the network above a VGG16 or a VGG19?
17
Top-5 error rate over time● 2012: AlexNet 16.5% [Krizhevsky et al. (2012)]
● 2013: ZF 11.7% [Zeiler & Fergus (2014)]
● 2014: VGG 7.3% [Simonyan & Zisserman (2014)]
2014: GoogLeNet 6.7% [Szegedy et al. (2015)]
● 2015: ResNet 3.6%
● Today (Aug 2016) 3.1%GoogLeNet-v4
“Main contribution: Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M. Increased # layers to 22).
Uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. [Source: http://cs231n.github.io/convolutional-networks/#case]
18
Inception module
[Figure from Szegedy et al. (2015)]Multiscale processing + wider network
But very costly!! Solution is to reduce dimension (next slide)
19
Inception module
Dimension is reduced using 1x1 convolutions.
20
Top-5 error rate over time● 2012: AlexNet 16.5% [Krizhevsky et al. (2012)]
● 2013: ZF 11.7% [Zeiler & Fergus (2014)]
● 2014: VGG 7.3% [Simonyan & Zisserman (2014)]
2014: GoogLeNet 6.7% [Szegedy et al. (2015)]
● 2015: ResNet 3.6%
● 2016: 3.1% [Szegedy et al. (2016)]GoogLeNet-v4
v4 of Google's Inception Network is the best right now (as of Fall 2016). Uses better crafted inception modules + residual connections.
21
Top-5 error rate over time● 2012: AlexNet 16.5% [Krizhevsky et al. (2012)]
● 2013: ZF 11.7% [Zeiler & Fergus (2014)]
● 2014: VGG 7.3% [Simonyan & Zisserman (2014)]
2014: GoogLeNet 6.7% [Szegedy et al. (2015)]
● 2015: ResNet 3.6% [He et al. (2015)]
● 2016: 3.1% [Szegedy et al. (2016)]GoogLeNet-v4
Last ILSVRC was held in 2017. top-5 error rate was 2.3%
22
Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop
23
Is learning better networks as simple as stacking more layers?
Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop
24Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop
25
Plain network Residual network
F (x)=W 2σ(W 1 x)
26
Plain network Residual network
F (x)=W 2σ(W 1 x)
Skip connection
27Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop
28Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop
29
Think about how the skip connection helps with the “vanishing gradients” problem.
31
ResNets have had huge impact.
“Deep residual learning for image recognition”
K He, X Zhang, S Ren, J Sun
CVPR 2016
More than 32K citations (source: Google Scholar).
ResNets have inspired other architectures.
32
ResNeXt [Xie et al. CVPR 2017]
“Aggregated Residual Transformations for Deep Neural Networks”
33
ResNeXt [Xie et al. CVPR 2017]
“Aggregated Residual Transformations for Deep Neural Networks”
34
DenseNet [Huang et al. CVPR 2017]
“Densely Connected Convolutional Networks”● Traditional CNNs with L layers
have L connections.● DenseNets have L(L+1) direct
connections.
35
DenseNet [Huang et al. CVPR 2017]
“Densely Connected Convolutional Networks”● Traditional CNNs with L layers
have L connections.● DenseNets have L(L+1) direct
connections. ● For each layer, the feature-maps
of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers.
● DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.
36
DenseNet [Huang et al. CVPR 2017]
On CIFAR10
On ImageNet
37
Object detection
38
Object detection
Task: given an image and an object class, find its instance(s):
aeroplane
Desired result:
39
Colab Notebook on basic object detection:
http://bit.do/basicod
40
The ConvNet approach to object detection
Image
Object proposals or candidates● Generic (class independent)● Typically around 1000s (much larger
than the # of object instances in the image, but much smaller than the # of total sliding windows in the image)
Each proposal
ConvNet
Estimated class of the proposal
E.g. Fast RCNN (Girshick, ICCV 2015) uses image segmentation (the Selective Search algorithm) to obtain object proposals.
42
Novelty
Replaces the segmentation-based, separate “object proposal generator” with a neural network called the “Region Proposal Network (RPN)”.
Uses a ConvNet for both
● Generating object proposals,
● And classifying them.
NIPS, June 2015; IEEE TPAMI, June 2017
43
Faster RCNN’s architecture:
Region Proposal Network:
Code is available on GitHub.
CPU only mode takes 17 seconds per image. Using an optimized linear algebra package (OpenBLAS), this time goes down to 4 seconds.
On Tesla K40 GPU, it takes 0.25 seconds per image.
44
Other state-of-the-art object detectors
● One-stage detection family– e.g. RetinaNet [Focal loss for dense object
detection, Lin et al. ICCV 2017 – Best paper award]
● R-FCN● Bottom-up object detectors:
– CornerNet, Law and Deng, ECCV 2018 – ExtremeNet, Zhou et al. CVPR 2019
45
Artistic Style Transfer
46
Gatys et al. (2016) “Image style transfer using convolutional neural networks”
Input 1: an image Input 2: a painting
Output
47
Gatys et al. (2016) “Image style transfer using convolutional neural networks”
48
Gatys et al. (2016) “Image style transfer using convolutional neural networks”
49
ConvNet1 is used to capture style
ConvNet2 captures content
ConvNets are VGGnets
ConvNet3 generates output image by using back-propagation to minimize L
total
by changing x.
50
Self-driving cars
51[Bojarski et al. 2016] Source of images
52
54
Input image Ground-truth label map
55
Fully convolutional networks (FCN)[Long et al. CVPR 2015]
>12K citations
56
Need for upsampling
57
Need for upsampling
● Most deep learning libraries have support for plain upsampling: nearest, linear, bilinear, bicubic, etc.
● If you want to make it learnable, two possible ways are: – Upsample + conv– “Devoncolution” (or transposed convolution)
58
Upsample + conv
SegNet: Badrinarayanan et al. TPAMI 2017]
59
“Deconvolution” (Transposed Conv.)
60
61
Now, what would Now, what would happen if the filter is happen if the filter is larger than the input larger than the input map?map?
62
Visual explanations for ConvNet classification
63
There are several ways to visualize what CNNs learn.
● Directly plotting the learned filters:
[From http://cs231n.github.io/understanding-cnn/]
64
● Showing the layer activations, i.e. feature maps
[From http://cs231n.github.io/understanding-cnn/]
A very good example:
3D visualization of a CNN on handwritten digit classification: http://scs.ryerson.ca/~aharley/vis/conv/
65
● Retrieving images that maximally activate a neuron
[From http://cs231n.github.io/understanding-cnn/]
66
● Occluding parts of the image
[From http://cs231n.github.io/understanding-cnn/]
67
Another way: Grad-CAM
● Gradient weighted class activation mapping [Selvaraju et al. (2016)].
● Applicable to any ConvNet (that was trained for image classification)
● Uses class-specific gradient information flowing into the final convolutional layer.
● Given an image and a class label, visually explains which part of the image is responsible for the class label.
68
Grad-CAM for “cat”
Grad-CAM for “dog”
69
How does it work?
y: predicted class vectorA: final conv layer (has k channels)
First, train a network for image classification. Then,
70
● ConvNets are not just for images
● They are used for speech/audio. e.g. WaveNet
● Even for natural language understanding/processing (a simple example in the next slide)
● We will study “transformers” after the RNN chapter.
71[Figure taken from wildml.com]
Each row represents a word (word2vec)
Words sharing common context have similar word2vec vectors. Also learned by a neural network.
Sentiment analysis
73
References● Badrinarayanan, V., Kendall, A. and Cipolla, R., 2017. Segnet: A deep convolutional
encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), pp.2481-2495.
● Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J, Zhang X. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. 2016 Apr 25.
● Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" J. Machine Learning Research. 2: 265–292.
● Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.
● Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414-2423).
● Glorot, X., & Bengio, Y. (2010, May). Understanding the difficulty of training deep feedforward neural networks. In Aistats (Vol. 9, pp. 249-256).
● Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.● He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1026-1034).
● He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR 2016.
74
References● Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z,
Song Y, Guadarrama S, Murphy K. Speed/accuracy trade-offs for modern convolutional object detectors. CVPR 2017.
● Huang, G., Liu, Z., van der Maaten, L. and Weinberger, K.Q. Densely connected convolutional networks. In CVPR 2017.
● Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
● Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation.Neural networks, 1(4), 295–307.
● Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
● Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
● Law, H. and Deng, J., 2018. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 734-750).
● Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
75
References● Mohamed, A. R., Dahl, G., & Hinton, G. (2009, December). Deep belief
networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications (Vol. 1, No. 9, p. 39).
● Ronneberger, O., Fischer, P. and Brox, T., 2015, October. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.
● S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, June 1 2017.
● Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization. arXiv preprint arXiv:1610.02391.
● Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
● Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).
76
● Szegedy, C., Ioffe, S., & Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261.
● Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154-171.
● Wu, Y. and He, K., 2018. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3-19).
● Xie, S., Girshick, R., Dollár, P., Tu, Z. and He, K., 2017. Aggregated residual transformations for deep neural networks. In CVPR 2017.
● Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European Conference on Computer Vision (pp. 818-833).
● Zhou, X., Zhuo, J. and Krahenbuhl, P., 2019. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 850-859).
References