multi-scale iclr context 16 aggregation by … · 2018-11-08 · segmentation, surface normal,...

45
Multi-Scale Context Aggregation by Dilated Convolutions Yi-An Lai 2018. 03. 07 ICLR 16 http://goo.gl/XLtqZB

Upload: others

Post on 17-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

Multi-Scale Context

Aggregation by Dilated

Convolutions Yi-An Lai

2018. 03. 07

ICLR 16

http://goo.gl/XLtqZB

Page 2: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

Outline● Situations and Results (4)

● Dilated, Explained (3)

● Context Module and the Front End (6)

● Show and Justify (8)

● Extensions (2) UCSD CSE

http://goo.gl/XLtqZB

Page 3: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S1: Dense Prediction?● One pixel, one prediction: Pixel-wise

○ Segmentation, surface normal, depth estimation, style transfer

● Common ConvNets NOT tailored for this○ LeNet, AlexNet, VGG: image-wise classifications “-”: Redundant components?

“+”: Better networks?

Learning to Simplify: Fully Convolutional Networks for Rough Sketch Cleanup. E Simo-Serra, et al. (2016)

http://goo.gl/XLtqZB

Page 4: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S2: What we want● Customized architecture, tailored module

○ Full-resolution output

● Multiscale contexts: receptive field (RF)○ LeNet, et al. — Stacking + Pooling / Sub-sampling — RF⬆� ¢µ¥ DROP details

■ Conflicts: multiscale vs dense prediction

https://guillaumebrg.wordpress.com/2016/02/13/adopting-the-vgg-net-approach-more-layers-smaller-filters/

http://goo.gl/XLtqZB

Page 5: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S3: What others did● Noh et al. — Encoder-Decoder

● Farabet et al. — Rescaled imagesHeavy downsampling necessary?

Separate handling necessary?

http://goo.gl/XLtqZB

Page 6: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S4: What we’ve done● Why not modify architecture

○ Replace Fully-connected with Conv (previous works)○ Remove redundant components — last 2 pooling

● Why not invent new module○ Multiscale-capable, resolution-preserving

https://www.techkingdom.org/single-post/2017/11/07/Machine-Learning-with-Python-Image-Classifier-using-VGG16-Model---Coming-Soon

http://goo.gl/XLtqZB

Page 7: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

D1: Dilated Conv● Vanilla Conv:

● l-Dilated Conv: filter element spacing

l = 1 l = 2

l = 4

Q1: What’s the RF here ?

http://goo.gl/XLtqZB

Page 8: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

D2: Dilated Conv● Red: receptive field

● F1: dilation = 1● F2: dilation = 2● F3: dilation = 4

● Exponential RF as stacking with dilations 1, 2, 4, 8, …

● (i+1)th layer RF:

● Multiscale + full resolution!

Recent Advances in Convolutional Neural Networks. J. Gu et al. (2015)

(Whiteboard explain)

http://goo.gl/XLtqZB

Page 9: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

D3: WaveNet - 1D Dilated Conv ● Causal/Online version

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

dilation 4

dilation 2

dilation 8

Receptive field

http://goo.gl/XLtqZB

Page 10: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

So Far, So Good ?

Page 11: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C1: The Module ● Goal: performance⬆� using multiscale contexts

● All Dilated Conv = Multiscale + Full-resolution

● (W, H, C) in, (W, H, C) out → Plug into any layer!○ Equip feature maps with multiscale contextual info

● Module:

Image: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/convolutional_neural_networks.html

1 - 7: + ReLU

Equal #channel

Q2: Why stop RF expansion at layer 6 ?

Page 12: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C2: Initializations - inspiration● RNN initialized with identity matrix + ReLU (Red) → Beat LSTM

“A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.” QV Le, et al

LSTM

IRNN

Image: http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html

I-RNN

identity

Page 13: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C3: Initializations● Basic: #channel the same

○ Identity initializations → “copy & paste”

● Large: #channel double○ Copy & paste but with scaling○ All other zeros → break symmetry

→ Randomly initialized and small (<< 0.5)

1

0

0

0

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. W. Shi, et al. (2016)

Page 14: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C4: Front End - Modified VGG 161. Input 900 x 900 x 3 (reflection padding)

Dilation: 2 Kernal: 7 x 7Dilation: 4

C=4096 C=4096

C=21

C=21

Page 15: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C4: Front End - Modified VGG 161. Input 900 x 900 x 3 (reflection padding)

2. No padding in the middle at all

Dilation: 2 Kernal: 7 x 7Dilation: 4

C=4096 C=4096

C=21

C=21

Page 16: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C4: Front End - Modified VGG 161. Input 900 x 900 x 3 (reflection padding)

2. No padding in the middle at all

3. Last 2 pooling→ dilated

Dilation: 2 Kernal: 7 x 7Dilation: 4

C=4096 C=4096

C=21

C=21

Page 17: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C4: Front End - Modified VGG 161. Input 900 x 900 x 3 (reflection padding)

2. No padding in the middle at all

3. Last 2 pooling→ dilated

Dilation: 2 Kernal: 7 x 7Dilation: 4

4. Fully Connected → Conv

C=4096 C=4096

C=21

C=21

Page 18: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C4: Front End - Modified VGG 161. Input 900 x 900 x 3 (reflection padding)

2. No padding in the middle at all

3. Last 2 pooling→ dilated

Dilation: 2 Kernal: 7 x 7Dilation: 4

4. Fully Connected → Conv

5. Softmax: 21 classes → 64 x 64 x 21

C=4096 C=4096

C=21

C=21

Prediction: upsample to original size (500 x 500)

Page 19: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C4: Front End - VGG 161. Input 900 x 900 x 3 (reflection padding)

2. No padding in the middle at all

3. Last 2 pooling→ dilated

Dilation: 2 Kernal: 7 x 7Dilation: 4

4. Fully Connected → Conv

C=4096 C=4096

C=21

Q3: Where did authors plug in the module ?A. after blue1 B. after blue 2C. after blue 3D. after brown

The module

C=21

5. Softmax: 21 classes → 64 x 64 x 21

Prediction: upsample to original size (500 x 500)

Page 20: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C5: Front End Performance

→ Modified architecture performs good!

Page 21: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

C6: Front End

More accurate

More details

Page 22: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S1: Training ● VOC 2012 + COCO

○ COCO: images with at least 1 class in VOC

○ 2-stage: (VOC + COCO) → VOC

○ Front End only: 71.3% test → Our modifications justified

● Module: context aggre.○ Add padding for inputs (64 x 64 x C), why? → RF 67 x 67, to output 64 x 64 x C

○ Train Front End → Fixed → Train module (Joint ~ the same)

○ Goal: Testing the module + Pairing with structured prediction (Could be jointly learned)

Page 23: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

Warning of New Stuff !

Page 24: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S2: Wait, structured prediction? ● Inputs and outputs: objects with structures

○ Object: sequence, boxes, images, list, tree…

○ Need more powerful F to define “compatibility” of (X, Y)

Page 25: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S2: Wait, structured prediction? ● Inputs and outputs: objects with structures

○ Object: sequence, boxes, images, list, tree…

○ Need more powerful F to define “compatibility” of (X, Y)

Retrieval

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD 15_2.html

Page 26: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S2: Wait, structured prediction? ● Inputs and outputs: objects with structures

○ Object: sequence, boxes, images, list, tree…

○ Need more powerful F to define “compatibility” of (X, Y)

Retrieval

Speech

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD 15_2.html

Page 27: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S2: Wait, structured prediction? ● Inputs and outputs: objects with structures

○ Object: sequence, boxes, images, list, tree…

○ Need more powerful F to define “compatibility” of (X, Y)

Retrieval

Speech

Detection

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD 15_2.html

Page 28: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S2: Wait, structured prediction? ● Inputs and outputs: objects with structures

○ Object: sequence, boxes, images, list, tree…

○ Need more powerful F to define “compatibility” of (X, Y)

Retrieval

Speech

Detection

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD 15_2.html

Pose

Page 29: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S3: How?● Structured prediction in only 3 steps

● As easy as putting an elephant into a refrigerator

Page 30: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S4: 3 Steps1. Evaluation: how does “compatibility” F(X, Y) look like?

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html Energy-based model: https://cs.nyu.edu/~yann/research/ebm/

Page 31: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S4: 3 Steps1. Evaluation: how does “compatibility” F(X, Y) look like?

1. Inference: Solve “arg max” problem○ Detection Y: all possible bounding boxes○ Summarization Y: all combinations of sentences in a doc

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html Energy-based model: https://cs.nyu.edu/~yann/research/ebm/

Page 32: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S4: 3 Steps1. Evaluation: how does “compatibility” F(X, Y) look like?

1. Inference: Solve “arg max” problem○ Detection Y: all possible bounding boxes○ Summarization Y: all combinations of sentences in a doc

2. Training: Given data(xi, yi), learn F such that

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html Energy-based model: https://cs.nyu.edu/~yann/research/ebm/

Page 33: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

End of Warning

Page 34: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S5: Structured predictions used● Goal: refine details to improve performance

● CRF: Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. Krähenbühl, et al. (2011)

○ Pairwise links between pixels○ Capture fine edge details with long range dependencies

● RNN-CRF: Conditional Random Fields as Recurrent Neural Networks. Zheng, et al. (2015)

○ Interpret CRFs as RNN○ Can be plug into CNNs and jointly learned

● Q4: What are our inputs and outputs for structured prediction here?

● More about Conditional Random Fields (CRFs)○ Machine Learning: A Probabilistic Perspective ○ http://dontloo.github.io/blog/CRF/○ http://davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/

Page 35: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S6: Evaluate ModuleEffective & Synergistic

Page 36: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S5: Evaluate ModuleEffective & SynergisticCorrections Details

Page 37: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S7: Now the real test● Yellow: off by a margin● Green: over by a margin

→ Context module boost performance

(Large)

Page 38: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

Demo videos (Only front end + context): https://www.youtube.com/user/karol32768/videos

Page 39: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

S8: Other datasets● Urban Scenes: CamVid, KITTI, Cityscapes

● Adjust #layer in the module w.r.t. input size

● Outperform all prior works

● Code: https://github.com/fyu/dilation

Page 40: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

E1: Strength & Weakness● Strength

○ Multiscale contextual information

○ Full-resolution context module can be plugged into any layer

○ Architecture from classification tasks is modified and justified for segmentation

Page 41: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

E1: Strength & Weakness● Strength

○ Multiscale contextual information

○ Full-resolution context module can be plugged into any layer

○ Architecture from classification tasks is modified and justified for segmentation

● Weakness○ Only 64 * 64 output size: lose details → Helped by CRF

○ Full-resolution requires more memory usage

○ Performance⬆� ¢ϖ more parameters rather than multiscale info?

→ No experiment to compare with another module with all vanilla conv

Page 42: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

E2: Got ideas?1. Multi-head self-attention rather than dilated

conv for multiscale info?○ Attention Is All You Need. A. Vaswani, et al. (2017)○ Multi-Context Attention for Human Pose Estimation. X.

Chu, et al. (2017)

https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/

Page 43: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

E2: Got ideas?1. Multi-head self-attention rather than dilated

conv for multiscale info?○ Attention Is All You Need. A. Vaswani, et al. (2017)

2. Striding as “learnable” pooling→ replace all pooling layers

https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/

2x2 kernel with stride = 2

Page 44: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

E2: Got ideas?1. Multi-head self-attention rather than dilated

conv for multiscale info?○ Attention Is All You Need. A. Vaswani, et al. (2017)

2. Striding as “learnable” pooling→ replace all pooling layers

3. Other variants:○ Architecture: Densely Connected Convolutional

Networks. G. Huang, et al. (2016)

○ Objective: Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation. MA Rahman, et al. (2016)

https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/

2x2 kernel with stride = 2

Page 45: Multi-Scale ICLR Context 16 Aggregation by … · 2018-11-08 · Segmentation, surface normal, depth estimation, style transfer Common ConvNets NOT tailored for this LeNet ... Real-Time

Thanks

http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/Neural Machine Translation in Linear Time. N. Kalchbrenner, et al. (2017)