week 8 – applications of convolutional neural...

Ceng 783 – Deep Learning

Week 8 – Applications of Convolutional Neural Networks

Fall 2017

Emre Akbas

3

Today

A brief note on normalization

Applications of ConvNets – Image classification (ResNets, ResNeXt,

DenseNet)– Object detection– Artistic style transfer– Image segmentation (FCNs, “deconvolution”)– Visualizing ConvNet classifications– ConvNets for NLP

4

NormalizationRemember batch norm? What happens as m gets smaller?Also, what do you do at test time?

5[From “Group normalization”, Wu et al. ECCV 2018]

6

[From “Group normalization”, Wu et al. ECCV 2018]

7

Applications of ConvNets

8

Image classification

9

Image ClassificationILSVRC benchmark/challenge– ImageNET dataset: 1.2 million images, 1000

categories– Since 2010– The task: given an image, make 5 predictions for the

dominant object in the image. If one of them is correct, then it is counted as a success.

10

Image ClassificationILSVRC benchmark/challenge– ImageNET dataset: 1.2 million images, 1000

categories– Since 2010– The task: given an image, make 5 predictions for the

dominant object in the image. If one of them is correct, then it is counted as a success.

Newer datasets since then– e.g. Google's Open Image Dataset

● 9 million images, 6000 categories [link]● annotated with image-level labels, object bounding boxes,

object segmentation masks, and visual relationships.

11

The second success story

3.10.2016 CEng 783 - Deep Learning - Fall 2016 11

Source: G. Hinton’s talk at the Royal Society, May 22, 2015. https://youtu.be/izrG86jycck

Slide

from

1st w

eek

https://storage.googleapis.com/openimages/web/index.html

12

Top-5 error rate over time● 2012: AlexNet 16.5%● 2013: ZF 11.7%● 2014: VGG 7.3%

2014: GoogLeNet 6.7%● 2015: ResNet 3.6%● Aug 2016: 3.1%

GoogLeNet-v4

Human error rate: 5.1% [http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/]

13

AlexNet [Krizhevsky et al. NIPS 2012]

5 convolutional layers 3 full-connected layers

Each convolutional layer consists of:convolution + ReLU + normalization + max-pooling

14

Top-5 error rate over time● 2012: AlexNet 16.5% [Krizhevsky et al. (2012)]

● 2013: ZF 11.7% [Zeiler & Fergus (2014)]

● 2014: VGG 7.3%

2014: GoogLeNet 6.7%

● 2015: ResNet 3.6%

● Today (Aug 2016) 3.1%GoogLeNet-v4

“It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.”

[Source: http://cs231n.github.io/convolutional-networks/#case]

15


● 2013: ZF 11.7% [Zeiler & Fergus (2014)]

● 2014: VGG 7.3% [Simonyan & Zisserman (2014)]

2014: GoogLeNet 6.7%

● 2015: ResNet 3.6%


“Main contribution: depth is critical. They used 16 layers.

Extremely homogeneous architecture: only 3x3 convolutions and 2x2 pooling.

But, very expensive to evaluate and requires more memory.[Source: http://cs231n.github.io/convolutional-networks/#case]

16

VGG or VGGnet

[Figure by D. Frossard]

VGG (Visual Geometry Group at Oxford)

[Simonyan & Zisserman , 2014]

VGG16 VGG19

Is the network above a VGG16 or a VGG19?

17


● 2013: ZF 11.7% [Zeiler & Fergus (2014)]


2014: GoogLeNet 6.7% [Szegedy et al. (2015)]

● 2015: ResNet 3.6%


“Main contribution: Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M. Increased # layers to 22).

Uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. [Source: http://cs231n.github.io/convolutional-networks/#case]

https://www.cs.toronto.edu/~frossard/post/vgg16/

18

Inception module

[Figure from Szegedy et al. (2015)]Multiscale processing + wider network

But very costly!! Solution is to reduce dimension (next slide)

19

Inception module

Dimension is reduced using 1x1 convolutions.

20


● 2013: ZF 11.7% [Zeiler & Fergus (2014)]



● 2015: ResNet 3.6%

● 2016: 3.1% [Szegedy et al. (2016)]GoogLeNet-v4

v4 of Google's Inception Network is the best right now (as of Fall 2016). Uses better crafted inception modules + residual connections.

21


● 2013: ZF 11.7% [Zeiler & Fergus (2014)]



● 2015: ResNet 3.6% [He et al. (2015)]

● 2016: 3.1% [Szegedy et al. (2016)]GoogLeNet-v4

Last ILSVRC was held in 2017. top-5 error rate was 2.3%

22

Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop

23

Is learning better networks as simple as stacking more layers?

Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop

http://image-net.org/challenges/talks/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

24Slide from Kaiming He's talk at ICCV 2015 ImageNet and COCO joint workshop


25

Plain network Residual network

F (x)=W 2σ(W 1 x)


26

Plain network Residual network

F (x)=W 2σ(W 1 x)

Skip connection

29

Think about how the skip connection helps with the “vanishing gradients” problem.


31

ResNets have had huge impact.

“Deep residual learning for image recognition”

K He, X Zhang, S Ren, J Sun

CVPR 2016

More than 32K citations (source: Google Scholar).

ResNets have inspired other architectures.

32

ResNeXt [Xie et al. CVPR 2017]

“Aggregated Residual Transformations for Deep Neural Networks”

33

ResNeXt [Xie et al. CVPR 2017]

“Aggregated Residual Transformations for Deep Neural Networks”

34

DenseNet [Huang et al. CVPR 2017]

“Densely Connected Convolutional Networks”● Traditional CNNs with L layers

have L connections.● DenseNets have L(L+1) direct

connections.

35


“Densely Connected Convolutional Networks”● Traditional CNNs with L layers

have L connections.● DenseNets have L(L+1) direct

connections. ● For each layer, the feature-maps

of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers.

● DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

36


On CIFAR10

On ImageNet

37

Object detection

38

Object detection

Task: given an image and an object class, find its instance(s):

aeroplane

Desired result:

39

Colab Notebook on basic object detection:

http://bit.do/basicod

40

The ConvNet approach to object detection

Image

Object proposals or candidates● Generic (class independent)● Typically around 1000s (much larger

than the # of object instances in the image, but much smaller than the # of total sliding windows in the image)

Each proposal

ConvNet

Estimated class of the proposal

E.g. Fast RCNN (Girshick, ICCV 2015) uses image segmentation (the Selective Search algorithm) to obtain object proposals.

41

Example of Selective Search [Ujlings et al. (2013)]

http://bit.do/basicod

42

Novelty

Replaces the segmentation-based, separate “object proposal generator” with a neural network called the “Region Proposal Network (RPN)”.

Uses a ConvNet for both

● Generating object proposals,

● And classifying them.

NIPS, June 2015; IEEE TPAMI, June 2017

43

Faster RCNN’s architecture:

Region Proposal Network:

Code is available on GitHub.

CPU only mode takes 17 seconds per image. Using an optimized linear algebra package (OpenBLAS), this time goes down to 4 seconds.

On Tesla K40 GPU, it takes 0.25 seconds per image.

44

Other state-of-the-art object detectors

● One-stage detection family– e.g. RetinaNet [Focal loss for dense object

detection, Lin et al. ICCV 2017 – Best paper award]

● R-FCN● Bottom-up object detectors:

– CornerNet, Law and Deng, ECCV 2018 – ExtremeNet, Zhou et al. CVPR 2019

http://ieeexplore.ieee.org/document/7485869/

45

Artistic Style Transfer

46

Gatys et al. (2016) “Image style transfer using convolutional neural networks”

Input 1: an image Input 2: a painting

Output

https://arxiv.org/abs/1708.02002

47


48


49

ConvNet1 is used to capture style

ConvNet2 captures content

ConvNets are VGGnets

ConvNet3 generates output image by using back-propagation to minimize L

total

by changing x.

50

Self-driving cars

51[Bojarski et al. 2016] Source of images

53

Image segmentation

https://devblogs.nvidia.com/parallelforall/deep-learning-self-driving-cars/

54

Input image Ground-truth label map

55

Fully convolutional networks (FCN)[Long et al. CVPR 2015]

>12K citations

56

Need for upsampling

57

Need for upsampling

● Most deep learning libraries have support for plain upsampling: nearest, linear, bilinear, bicubic, etc.

● If you want to make it learnable, two possible ways are: – Upsample + conv– “Devoncolution” (or transposed convolution)

58

Upsample + conv

SegNet: Badrinarayanan et al. TPAMI 2017]

59

“Deconvolution” (Transposed Conv.)

61

Now, what would Now, what would happen if the filter is happen if the filter is larger than the input larger than the input map?map?

62

Visual explanations for ConvNet classification

63

There are several ways to visualize what CNNs learn.

● Directly plotting the learned filters:

[From http://cs231n.github.io/understanding-cnn/]

64

● Showing the layer activations, i.e. feature maps


A very good example:

3D visualization of a CNN on handwritten digit classification: http://scs.ryerson.ca/~aharley/vis/conv/

65

● Retrieving images that maximally activate a neuron


66

● Occluding parts of the image


http://scs.ryerson.ca/~aharley/vis/conv/

67

Another way: Grad-CAM

● Gradient weighted class activation mapping [Selvaraju et al. (2016)].

● Applicable to any ConvNet (that was trained for image classification)

● Uses class-specific gradient information flowing into the final convolutional layer.

● Given an image and a class label, visually explains which part of the image is responsible for the class label.

68

Grad-CAM for “cat”

Grad-CAM for “dog”

69

How does it work?

y: predicted class vectorA: final conv layer (has k channels)

First, train a network for image classification. Then,

70

● ConvNets are not just for images

● They are used for speech/audio. e.g. WaveNet

● Even for natural language understanding/processing (a simple example in the next slide)

● We will study “transformers” after the RNN chapter.

71[Figure taken from wildml.com]

Each row represents a word (word2vec)

Words sharing common context have similar word2vec vectors. Also learned by a neural network.

Sentiment analysis

73

References● Badrinarayanan, V., Kendall, A. and Cipolla, R., 2017. Segnet: A deep convolutional

encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), pp.2481-2495.

● Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J, Zhang X. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. 2016 Apr 25.

● Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" J. Machine Learning Research. 2: 265–292.

● Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

● Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414-2423).

● Glorot, X., & Bengio, Y. (2010, May). Understanding the difficulty of training deep feedforward neural networks. In Aistats (Vol. 9, pp. 249-256).

● Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.● He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing

human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1026-1034).

● He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR 2016.

74

References● Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z,

Song Y, Guadarrama S, Murphy K. Speed/accuracy trade-offs for modern convolutional object detectors. CVPR 2017.

● Huang, G., Liu, Z., van der Maaten, L. and Weinberger, K.Q. Densely connected convolutional networks. In CVPR 2017.

● Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

● Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation.Neural networks, 1(4), 295–307.

● Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

● Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

● Law, H. and Deng, J., 2018. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 734-750).

● Long, J., Shelhamer, E. and Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

75

References● Mohamed, A. R., Dahl, G., & Hinton, G. (2009, December). Deep belief

networks for phone recognition. In Nips workshop on deep learning for speech recognition and related applications (Vol. 1, No. 9, p. 39).

● Ronneberger, O., Fischer, P. and Brox, T., 2015, October. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.

● S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, June 1 2017.

● Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization. arXiv preprint arXiv:1610.02391.

● Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

● Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).

76

● Szegedy, C., Ioffe, S., & Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261.

● Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154-171.

● Wu, Y. and He, K., 2018. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 3-19).

● Xie, S., Girshick, R., Dollár, P., Tu, Z. and He, K., 2017. Aggregated residual transformations for deep neural networks. In CVPR 2017.

● Zeiler, M. D., & Fergus, R. (2014, September). Visualizing and understanding convolutional networks. In European Conference on Computer Vision (pp. 818-833).

● Zhou, X., Zhuo, J. and Krahenbuhl, P., 2019. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 850-859).

References

week 8 – applications of convolutional neural...

Documents