[unofficial] pyramid scene parsing network (cvpr 2017)

32
Pyramid Scene Parsing Network Hengshuang Zhao 1 , Jianping Shi 2 , Xiaojuan Qi 1 , Xiaogang Wang 1 , Jiaya Jia 1 1 The Chinese University of Hong Kong, 2 SenseTime Group Limited Presentation: Shunta Saito Slide: Powered by Deckset (c) Preferred Networks 1

Upload: shunta-saito

Post on 21-Jan-2018

895 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Pyramid Scene Parsing NetworkHengshuang Zhao1, Jianping Shi2, Xiaojuan Qi1,

Xiaogang Wang1, Jiaya Jia 1

1The Chinese University of Hong Kong, 2SenseTime Group Limited

Presentation: Shunta Saito

Slide: Powered by Deckset

(c) Preferred Networks 1

Page 2: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Summary• Introduce Pyramid Pooling Module for better context grasp with sub-region awareness

(c) Preferred Networks 2

Page 3: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Why did I choose this paper?

• Presented in CVPR 2017

• 1st place in ImageNet Scene Parsing Challenge 2016 (ADE20K)

• was 1st place in Cityscapes leaderboard

• now it's in 2nd place (I noticed this last week!)

(c) Preferred Networks 3

Page 4: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Agenda

1. Common building blocks in semantic segmentation

2. Major Issue

3. Prior Work

4. Pyramid Pooling Module

5. Experiment results

(c) Preferred Networks 4

Page 5: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Semantic Segmentation

• Predict pixel-wise labels from natural images

• Each pixel in an image belongs to an object class

• So it's not instance-aware !

(c) Preferred Networks 5

Page 6: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Common Building Blocks (1)

Fully convolutional network (FCN)1

• A deep convolutional neural network which doesn't include any fully-connected layers

• Almost all recent methods are based on FCN

• Typically pre-trained with ImageNet under classification problem setting

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 6

Page 7: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Common Building Blocks (2)

Dilated convolution1

• Widen receptive field without reducing feature map resolution

• Important for leveraging global context prior efficiently

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 7

Page 8: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Common Building Blocks (3)

Multi-scale feature ensemble

• Higher-layer feature contains more semantic meaning and less location information

• Combining multi-scale features can improve the performance1

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 8

Page 9: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Common Building Blocks (4)

Conditional random field (CRF)

• Post-processing to refine the segmentation result (DeepLab1)

• Some following methods refined network via end-to-end modeling (DPN2, CRF as RNN3, Detections and Superpixels4)

4 "Higher order conditional random fields in deep neural networks", ECCV 2016

3 "Conditional random fields as recurrent neural networks", ICCV 2015

2 "Semantic image segmentation via deep parsing network", ICCV 2015

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 9

Page 10: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Common Building Blocks (5)

Global average pooling (GAP)

• ParsenNet1 proved that global average pooling with FCN can improve semantic segmentation results

• But the global descriptors used in the paper are not representative enough for some challenging datasets like ADE20K

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 10

Page 11: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Major Issue (1)

Mismatched relationship

• Co-occurrent visual patterns imply some contexts

• e.g., an airplane is likely to fly in sky while not over a road

• Lack of the ability to collect contextual information increases the chance of misclassification

• In the right figure, FCN predicts the boat in the yellow box as a "car" based on its appearance

(c) Preferred Networks 11

Page 12: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Major Issue (2)

Confusing Classes

• There are confusing classes in major datasets: field and earth; mountain and hill; wall, house, building and skyscraper, etc.

• The expert human annotator still makes 17.6% pixel error for ADE20K1

• FCN predicts the object in the box as part of skyscraper and part of building but the whole object should be either skyscraper or building, not both

• Utilizing the relationship between classes is important

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 12

Page 13: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Major Issue (3)

Inconspicuous Classes

• Small objects like streetlight and signboard are inconspicuous and hard to find while they may be important

• Big objects may appear in discontinuous, but FCN couldn't label the pillow which has similar appearance with the sheet correctly

• To improve performance for small or very big objects, sub-regions should be paid more attention

(c) Preferred Networks 13

Page 14: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Summary of Issues

• Use co-occurrent visual patterns as context

• Consider relationship between classes

• Sub-regions should be paid more attention

(c) Preferred Networks 14

Page 15: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Prior Work

Global Average Pooling (GAP)1

• Receptive field of ResNet is already larger than the input image, so GAP sounds good to summarize the all information

• But, pixels in an image may be various objects which have different sizes, so directly fusing them to form a single vector may lose the spatial relation and cause ambiguity

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 15

Page 16: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Prior Work

Spatial Pyramid Pooling (SPP)1

• Pooling with different kernel/stride sizes to the feature maps

• Then flatten and concatenate the pooling results to make fix-length representation

• There still is context information loss

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 16

Page 17: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Pyramid Pooling Module• A hierarchical global prior, containing information with different scales and varying among different sub-regions

• Pyramid Pooling Module for global scene prior constructed on the top of the final-layer-feature-map

(c) Preferred Networks 17

Page 18: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Pyramid Pooling Module• Use 1x1 conv to reduce the number of channels

• Then upsample (bilinear) them to the same size and concatenate all

(c) Preferred Networks 18

Page 19: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Implementation details (1)

• The average pooling are four levels, 1x1, 2x2, 3x3, and 6x6 (ksize, stride)

• Pre-trained ResNet model with dilated convolution is used as the feature extractor (the output size will be 1/8 of input image)

• They use two losses;

1. softmax loss between final layer and labels

2. softmax loss between an intermediate output of ResNet and labels1 (weighted by 0.4)

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 19

Page 20: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Implementation details (2)

Optimization

MomentumSGD with weight deacy

LR Scheduling

Momentum: 0.9 �

Weight decay: 0.0001 where �

(c) Preferred Networks 20

Page 21: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Implementation details (3)

Training iteration Dataset augmentation

ADE20K: 150K Random mirror

PASCAL VOC: 30K Random resize between 0.5 and 2

Cityscapes: 90K Random rotation betwee -10 and 10 degrees

Random Gaussian blur for ADE20K and PASCAL VOC

(c) Preferred Networks 21

Page 22: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Implementation detailts (4)• An appropriately large "cropsize" can yield good performance

• "batchsize" in the batch normalization layer is of great importance:

Cropsize Batchsize

ADE20K: 473 x 473 16 for all dataset

PASCAL VOC: 473 x 473

Cityscapes: 713 x 713

(c) Preferred Networks 22

Page 23: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Implementation detailts (5)

Distributed Batch Normalization

• To increase the "batchsize" in batch normalization layers, they used custom BN layer applied on data gathered from mulitple GPUs using OpenMPI

• We have Akiba-san's implementation of distributed batch normalization !

(c) Preferred Networks 23

Page 24: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

ImageNet Scene Parsing Challenge 2016

• Dataset: ADE20K

• 150 classes and 1,038 image-level labels

• 20,000/2,000/3,000 pixel-level labels for train/val/test

(c) Preferred Networks 24

Page 25: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Ablation Study for Pyramid Pooling Module

• Average pooling works better than max pooling in all settings

• Pooling with pyramid parsing outperforms that using global pooling

• With dimension reduction (DR; reducing the number of channels after pyramid pooling), the performance is further enhanced

(c) Preferred Networks 25

Page 26: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Ablation Study for Auxiliary Loss

• Set the auxiliary loss weight between 0 and 1 and compared the final results

• yields the best performance

(c) Preferred Networks 26

Page 27: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Ablation Study for the ResNet Part

Deeper is better

(c) Preferred Networks 27

Page 28: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

More Detailed Performance Analysis

Additional processing Improvement (% in mIoU)

Data augmentation (DA) +1.54

Auxiliary loss (AL) +1.41

Pyramid pooling module (PSP) +4.45

Use deeper ResNet (50 to 269) +2.13

Multi-scale testing (MS) +1.13

• For multi-scale testing, they create prediction at 6 different scales (0.5, 0.75, 1, 1.25, 1.5, and 1.75) and take average of them.

(c) Preferred Networks 28

Page 29: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Results on PASCAL VOC 2012

• Extended with Semantic Boundaries Dataset (SBD) 1, they

used

• 10582, 1449, and 1456 images for train/val/test

• Mismatched relationship: For "aeroplane" and "sky" in the

second and third rows, PSPNet finds missing parts.

• Confusing classes: For "cows" in row one, our baseline

model treats it as "horse" and "dog" while PSPNet corrects

these errors

• Conspicuous objects: For "person", "bottle" and "plant" in

following rows, PSPNet performs well on these small-size-

object classes in the images compared to the baseline model

1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html

(c) Preferred Networks 29

Page 30: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Results on PASCAL VOC 2012• Comparing PSPNet with previous best-performing methods on the testing set based on two settings, i.e., with or without pre-training

on MS-COCO dataset

(c) Preferred Networks 30

Page 31: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Results on Cityscapes

• Cityscapes dataset consits of 2975, 500, and 1525 train/val/tests images (19 classes)

• 20000 coarsely annotated images are available (in the table below, ‡ means it's used)

(c) Preferred Networks 31

Page 32: [unofficial] Pyramid Scene Parsing Network (CVPR 2017)

Thank you for your attention

• The official repository doesn't include any training code

• My own implementation for both training and testing have been ready:

• mitmul/segmentation: https://github.pfidev.jp/mitmul/segmentation

• Now I'm training a model to ensure the reproducibility

• Once finished the reproduction work, I'll send the code to ChainerCV

• The training on Cityscapes dataset takes over 20 days using 8 GPUs even with ResNet50-based PSPNet (They used ResNet101 for Cityscapes)

• Now ChainerMN is necessary tool for such large-scale dataset and deep models

• So, we need more GPU machines connected with InfiniBand each other

(c) Preferred Networks 32