cs6501: deep learning for visual recognition...
TRANSCRIPT
![Page 1: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/1.jpg)
CS6501: Deep Learning for Visual RecognitionCNN Architectures
![Page 2: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/2.jpg)
Today’s Class
Recap
• The Convolutional Layer
• Spatial Pooling Operations
CNN Architectures
• LeNet (LeCun et al 1998)
• AlexNet (Krizhesvky et al 2012)
• VGG-Net (Simonyan and Zisserman, 2014)
![Page 3: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/3.jpg)
Automatic Differentiation
You only need to write code for the forward pass,backward pass is computed automatically.
Pytorch (Facebook -- mostly):
Tensorflow (Google -- mostly):
DyNet (team includes UVA Prof. Yangfeng Ji):
https://pytorch.org/
https://www.tensorflow.org/
http://dynet.io/
![Page 4: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/4.jpg)
Defining a Model in Pytorch (Two Layer NN)
![Page 5: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/5.jpg)
1. Creating Model, Loss, Optimizer
![Page 6: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/6.jpg)
2. Running forward and backward on a batch
Compare this to what we had to do for toynn
![Page 7: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/7.jpg)
Convolutional Layer
![Page 8: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/8.jpg)
Convolutional Layer
Weights
![Page 9: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/9.jpg)
Convolutional Layer
4
Weights
![Page 10: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/10.jpg)
Convolutional Layer
4 1
Weights
![Page 11: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/11.jpg)
Convolutional Layer (with 4 filters)
Input: 1x224x224 Output: 4x224x224
if zero padding,and stride = 1
weights:4x1x9x9
![Page 12: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/12.jpg)
Convolutional Layer (with 4 filters)
Input: 1x224x224 Output: 4x112x112
if zero padding,but stride = 2
weights:4x1x9x9
![Page 13: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/13.jpg)
Convolutional Layer in pytorch
in_channels (e.g. 3 for RGB inputs)
out_channels (equals the number of convolutional filters for this layer)
out_channels x
in_channels
kernel_size
kernel_size
Input Output
![Page 14: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/14.jpg)
Convolutional Network: LeNet
Yann LeCun
![Page 15: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/15.jpg)
LeNet in Pytorch
![Page 16: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/16.jpg)
SpatialMaxPooling Layer
take the max in this neighborhood
888
8 8
![Page 17: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/17.jpg)
LeNet Summary
• 2 Convolutional Layers + 3 Linear Layers
• + Non-linear functions: ReLUs or Sigmoids+ Max-pooling operations
![Page 18: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/18.jpg)
New Architectures Proposed
• Alexnet (Kriszhevsky et al NIPS 2012) [Required Reading]
• VGG (Simonyan and Zisserman 2014)
• GoogLeNet (Szegedy et al CVPR 2015)
• ResNet (He et al CVPR 2016)
• DenseNet (Huang et al CVPR 2017)
![Page 19: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/19.jpg)
Convolutional Layers as Matrix Multiplication
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
![Page 20: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/20.jpg)
Convolutional Layers as Matrix Multiplication
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
![Page 21: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/21.jpg)
Convolutional Layers as Matrix Multiplication
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
Pros?Cons?
![Page 22: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/22.jpg)
CNN Computations are Computationally Expensive• However highly parallelizable• GPU Computing is used in practice (Why is GPU computing Good?)• CPU Computing in fact is prohibitive for training these models
![Page 23: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/23.jpg)
ILSVRC: Imagenet Large Scale Visual Recognition Challenge
[Russakovsky et al 2014]
![Page 24: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/24.jpg)
The Problem: ClassificationClassify an image into 1000 possible classes:
e.g. Abyssinian cat, Bulldog, French Terrier, Cormorant, Chickadee, red fox, banjo, barbell, hourglass, knot, maze, viaduct, etc.
cat, tabby cat (0.71)Egyptian cat (0.22)red fox (0.11)…..
![Page 25: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/25.jpg)
The Data: ILSVRC
Imagenet Large Scale Visual Recognition Challenge (ILSVRC): Annual Competition
1000 Categories
~1000 training images per Category
~1 million images in total for training
~50k images for validation
Only images released for the test set but no annotations,
evaluation is performed centrally by the organizers (max 2 per week)
![Page 26: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/26.jpg)
The Evaluation Metric: Top K-error
cat, tabby cat (0.61)Egyptian cat (0.22)red fox (0.11)Abyssinian cat (0.10)French terrier (0.03)…..
True label: Abyssinian cat
Top-1 error: 1.0 Top-1 accuracy: 0.0
Top-2 error: 1.0 Top-2 accuracy: 0.0
Top-3 error: 1.0 Top-3 accuracy: 0.0
Top-4 error: 0.0 Top-4 accuracy: 1.0
Top-5 error: 0.0 Top-5 accuracy: 1.0
![Page 27: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/27.jpg)
Top-5 error on this competition (2012)
![Page 28: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/28.jpg)
Alexnet (Krizhevsky et al NIPS 2012)
![Page 29: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/29.jpg)
Alexnet
https://www.saagie.com/fr/blog/object-detection-part1
![Page 30: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/30.jpg)
Pytorch Code for Alexnet
• In-class analysis
https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py
![Page 31: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/31.jpg)
Dropout Layer
Srivastava et al 2014
![Page 32: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/32.jpg)
What is happening?
https://www.saagie.com/fr/blog/object-detection-part1
![Page 33: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/33.jpg)
Feature extraction
(SIFT)
Feature encoding
(Fisher vectors)
Classification(SVM or softmax)
SIFT + FV + SVM (or softmax)
Convolutional Network(includes both feature extraction and classifier)
Deep Learning
![Page 34: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/34.jpg)
Preprocessing and Data Augmentation
![Page 35: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/35.jpg)
Preprocessing and Data Augmentation
256
256
![Page 36: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/36.jpg)
Preprocessing and Data Augmentation
224x224
![Page 37: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/37.jpg)
Preprocessing and Data Augmentation
224x224
![Page 38: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/38.jpg)
True label: Abyssinian cat
![Page 39: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/39.jpg)
•Using ReLUs instead of Sigmoid or Tanh•Momentum + Weight Decay•Dropout (Randomly sets Unit outputs to zero during training) •GPU Computation!
Other Important Aspects
![Page 40: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/40.jpg)
VGG Network
https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py
Simonyan and Zisserman, 2014.
Top-5:
https://arxiv.org/pdf/1409.1556.pdf
![Page 41: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/41.jpg)
BatchNormalization Layer
![Page 42: CS6501: Deep Learning for Visual Recognition CNNArchitecturesvicente/deeplearning/2019/slides/lecture09.pdf · Convolutional Layer (with 4 filters) Input: 1x224x224 Output: 4x224x224](https://reader036.vdocuments.mx/reader036/viewer/2022071014/5fcd084fb50bf004de473717/html5/thumbnails/42.jpg)
Questions?
42