multi-scale iclr context 16 aggregation by … · 2018-11-08 · segmentation, surface normal,...

Multi-Scale Context

Aggregation by Dilated

Convolutions Yi-An Lai

2018. 03. 07

ICLR 16

http://goo.gl/XLtqZB

Outline● Situations and Results (4)

● Dilated, Explained (3)

● Context Module and the Front End (6)

● Show and Justify (8)

● Extensions (2) UCSD CSE


S1: Dense Prediction?● One pixel, one prediction: Pixel-wise

○ Segmentation, surface normal, depth estimation, style transfer

● Common ConvNets NOT tailored for this○ LeNet, AlexNet, VGG: image-wise classifications “－”: Redundant components?

“＋”: Better networks?

Learning to Simplify: Fully Convolutional Networks for Rough Sketch Cleanup. E Simo-Serra, et al. (2016)


S2: What we want● Customized architecture, tailored module

○ Full-resolution output

● Multiscale contexts: receptive field (RF)○ LeNet, et al. — Stacking + Pooling / Sub-sampling — RF⬆� ¢µ¥ DROP details

■ Conflicts: multiscale vs dense prediction

https://guillaumebrg.wordpress.com/2016/02/13/adopting-the-vgg-net-approach-more-layers-smaller-filters/


S3: What others did● Noh et al. — Encoder-Decoder

● Farabet et al. — Rescaled imagesHeavy downsampling necessary?

Separate handling necessary?


S4: What we’ve done● Why not modify architecture

○ Replace Fully-connected with Conv (previous works)○ Remove redundant components — last 2 pooling

● Why not invent new module○ Multiscale-capable, resolution-preserving

https://www.techkingdom.org/single-post/2017/11/07/Machine-Learning-with-Python-Image-Classifier-using-VGG16-Model---Coming-Soon


D1: Dilated Conv● Vanilla Conv:

● l-Dilated Conv: filter element spacing

l = 1 l = 2

l = 4

Q1: What’s the RF here ?


D2: Dilated Conv● Red: receptive field

● F1: dilation = 1● F2: dilation = 2● F3: dilation = 4

● Exponential RF as stacking with dilations 1, 2, 4, 8, …

● (i+1)th layer RF:

● Multiscale + full resolution!

Recent Advances in Convolutional Neural Networks. J. Gu et al. (2015)

(Whiteboard explain)


D3: WaveNet - 1D Dilated Conv ● Causal/Online version

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

dilation 4

dilation 2

dilation 8

Receptive field


So Far, So Good ?

C1: The Module ● Goal: performance⬆� using multiscale contexts

● All Dilated Conv = Multiscale + Full-resolution

● (W, H, C) in, (W, H, C) out → Plug into any layer!○ Equip feature maps with multiscale contextual info

● Module:

Image: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/convolutional_neural_networks.html

1 - 7: + ReLU

Equal #channel

Q2: Why stop RF expansion at layer 6 ?

C2: Initializations - inspiration● RNN initialized with identity matrix + ReLU (Red) → Beat LSTM

“A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.” QV Le, et al

LSTM

IRNN

Image: http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html

I-RNN

identity

C3: Initializations● Basic: #channel the same

○ Identity initializations → “copy & paste”

● Large: #channel double○ Copy & paste but with scaling○ All other zeros → break symmetry

→ Randomly initialized and small (<< 0.5)

1

0

0

0

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. W. Shi, et al. (2016)

C4: Front End - Modified VGG 161. Input 900 x 900 x 3 (reflection padding)

Dilation: 2 Kernal: 7 x 7Dilation: 4

C=4096 C=4096

C=21

C=21


2. No padding in the middle at all


C=4096 C=4096

C=21

C=21



3. Last 2 pooling→ dilated


C=4096 C=4096

C=21

C=21





4. Fully Connected → Conv

C=4096 C=4096

C=21

C=21






5. Softmax: 21 classes → 64 x 64 x 21

C=4096 C=4096

C=21

C=21

Prediction: upsample to original size (500 x 500)

C4: Front End - VGG 161. Input 900 x 900 x 3 (reflection padding)





C=4096 C=4096

C=21

Q3: Where did authors plug in the module ?A. after blue1 B. after blue 2C. after blue 3D. after brown

The module

C=21

5. Softmax: 21 classes → 64 x 64 x 21

Prediction: upsample to original size (500 x 500)

C5: Front End Performance

→ Modified architecture performs good!

C6: Front End

More accurate

More details

S1: Training ● VOC 2012 + COCO

○ COCO: images with at least 1 class in VOC

○ 2-stage: (VOC + COCO) → VOC

○ Front End only: 71.3% test → Our modifications justified

● Module: context aggre.○ Add padding for inputs (64 x 64 x C), why? → RF 67 x 67, to output 64 x 64 x C

○ Train Front End → Fixed → Train module (Joint ~ the same)

○ Goal: Testing the module + Pairing with structured prediction (Could be jointly learned)

Warning of New Stuff !

S2: Wait, structured prediction? ● Inputs and outputs: objects with structures

○ Object: sequence, boxes, images, list, tree…

○ Need more powerful F to define “compatibility” of (X, Y)




Retrieval

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD 15_2.html




Retrieval

Speech





Retrieval

Speech

Detection





Retrieval

Speech

Detection


Pose

S3: How?● Structured prediction in only 3 steps

● As easy as putting an elephant into a refrigerator

S4: 3 Steps1. Evaluation: how does “compatibility” F(X, Y) look like?

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html Energy-based model: https://cs.nyu.edu/~yann/research/ebm/


1. Inference: Solve “arg max” problem○ Detection Y: all possible bounding boxes○ Summarization Y: all combinations of sentences in a doc



1. Inference: Solve “arg max” problem○ Detection Y: all possible bounding boxes○ Summarization Y: all combinations of sentences in a doc

2. Training: Given data(xi, yi), learn F such that


End of Warning

S5: Structured predictions used● Goal: refine details to improve performance

● CRF: Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. Krähenbühl, et al. (2011)

○ Pairwise links between pixels○ Capture fine edge details with long range dependencies

● RNN-CRF: Conditional Random Fields as Recurrent Neural Networks. Zheng, et al. (2015)

○ Interpret CRFs as RNN○ Can be plug into CNNs and jointly learned

● Q4: What are our inputs and outputs for structured prediction here?

● More about Conditional Random Fields (CRFs)○ Machine Learning: A Probabilistic Perspective ○ http://dontloo.github.io/blog/CRF/○ http://davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/

S6: Evaluate ModuleEffective & Synergistic

S5: Evaluate ModuleEffective & SynergisticCorrections Details

S7: Now the real test● Yellow: off by a margin● Green: over by a margin

→ Context module boost performance

(Large)

Demo videos (Only front end + context): https://www.youtube.com/user/karol32768/videos

S8: Other datasets● Urban Scenes: CamVid, KITTI, Cityscapes

● Adjust #layer in the module w.r.t. input size

● Outperform all prior works

● Code: https://github.com/fyu/dilation

E1: Strength & Weakness● Strength

○ Multiscale contextual information

○ Full-resolution context module can be plugged into any layer

○ Architecture from classification tasks is modified and justified for segmentation

E1: Strength & Weakness● Strength

○ Multiscale contextual information

○ Full-resolution context module can be plugged into any layer

○ Architecture from classification tasks is modified and justified for segmentation

● Weakness○ Only 64 * 64 output size: lose details → Helped by CRF

○ Full-resolution requires more memory usage

○ Performance⬆� ¢ϖ more parameters rather than multiscale info?

→ No experiment to compare with another module with all vanilla conv

E2: Got ideas?1. Multi-head self-attention rather than dilated

conv for multiscale info?○ Attention Is All You Need. A. Vaswani, et al. (2017)○ Multi-Context Attention for Human Pose Estimation. X.

Chu, et al. (2017)

https://pythonmachinelearning.pro/introduction-to-convolutional-neural-networks-for-vision-tasks/


conv for multiscale info?○ Attention Is All You Need. A. Vaswani, et al. (2017)

2. Striding as “learnable” pooling→ replace all pooling layers


2x2 kernel with stride = 2


conv for multiscale info?○ Attention Is All You Need. A. Vaswani, et al. (2017)

2. Striding as “learnable” pooling→ replace all pooling layers

3. Other variants:○ Architecture: Densely Connected Convolutional

Networks. G. Huang, et al. (2016)

○ Objective: Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation. MA Rahman, et al. (2016)


2x2 kernel with stride = 2

Thanks

http://musyoku.github.io/2016/09/18/wavenet-a-generative-model-for-raw-audio/Neural Machine Translation in Linear Time. N. Kalchbrenner, et al. (2017)

multi-scale iclr context 16 aggregation by … · 2018-11-08 · segmentation, surface normal,...

Documents