exploring spatial context for 3d ... - computer vision€¦ · 3d semantic parsing of largescale...

1
Geometry + Appearance M.Sc. Francis Engelmann, M.Sc. Theodora Kontogianni Phone: +49 241 80 20 {760, 759} E-Mail: {engelmann, kontogianni}@vision.rwth-aachen.de Website: http://www.vision.rwth-aachen.de/publication/00146/ Computer Vision Group, Visual Computing Institute RWTH Aachen University Mies-van-der-Rohe Str. 15, D-52074 Aachen, Germany http://www.vision.rwth-aachen.de This project has received funding from the European Research Council (ERC) CV-SUPER (ERC-2012-StG-307432). Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds Francis Engelmann*, Theodora Kontogianni*, Alexander Hermans, Bastian Leibe Computer Vision Group, Visual Computing Institute, RWTH Aachen University Datasets and evaluation metrics: We evaluate our method on S3DIS [2] and virtu- al KITTI [3] in terms of mean IoU, overall pixel accuracy and average class accuracy. Both datasets include per point semantic class annotations. Deep learning approaches have made tremendous progress in the field of semantic segmentation over the past few years. However, most current approaches operate in the 2D image space. Direct semantic segmentation of unstructured 3D point clouds is still an open research problem. The recently proposed PointNet architecture pres- ents an interesting step ahead in that it can operate on unstructured point clouds, achieving decent segmentation results. However, it subdivides the input points into a grid of blocks and processes each such block individually. In this paper, we investigate the question how such an architecture can be extended to incorporate larger-scale spatial context. We build upon PointNet and propose two extensions that enlarge the receptive field over the 3D scene. We evaluate the proposed strategies on chal- lenging indoor and outdoor datasets and show improved results in both scenarios. Multi-Scale Block Feature MLP (64,128) MLP (64,128) MLP (64,128) M : max pool M M M Output Score S : stack : concatenate Multi-Scale Blocks same position different scales MLP (O) N x O M S 1 x O N x 2 O S O=256 O=128 MLP (M) N x M 1 x 384 N x 128+384 N x D N x D N x D Consolidation Unit (CU) C Input-Level Context C C C :block feature Grid Blocks different positions same scale Input-Level Context N x D N x D N x D N x D MLP (128,64) M 1 x 64 S 1 x 64 GRU-RNN S C MLP (M) Output Score N x M MLP (128,64) shared MLP (128,64) MLP (128,64) M M M S 1 x 64 C MLP (M) shared S 1 x 64 C MLP (M) shared S 1 x 64 C MLP (M) shared GRU RNN (unrolled) 1 x 64 1 x 64 1 x 64 Recurrent Consolidation Unit (RCU) shared shared : max pool M S : stack : concatenate C :block feature N x D+D’ MLP M Global Feature S C MLP N x D N x D’ 1 x D’ N x M Point Features Input References Contributions We present different mechanisms to increase and consolidate spatial context. [1] C . R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [2] I . Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3D Semantic Parsing of LargeScale Indoor Spaces. Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [3] A . Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Abstract Our Method Quantitative Results Qualitative Results Background Problem Statement Output Evaluation Output Unstructured 3D Point Cloud Semantic Segmentation Input Ground Truth Geometry-Only Results Output Ground Truth Geometry Only · We present two mechanisms that increase the spatial context for semantic seg- mentation of 3D point clouds: input- and output-level context. · We verify experimentally that our proposed extensions achieve improved results on challenging indoor and outdoor datasets. · We show competative results using only geometric input features (no color). [2] [1] [3] [1] [1] [1] [2] PointNet (Simplified) · Main Idea of PointNet: Compute a global feature summarizing a set of unordered point features using max-pooling (M). · Prediction is based on point features representing local context concatenated (C) with the global feature representing neighboring context. · Neigborhood is limited spatially up to a certain radius. Input-Level Context: Output-Level Context: MLP (O) N x O M S 1 x O N x 2 O Consolidation Unit (CU) C GRU RNN (unrolled) Recurrent Consolidation Unit (RCU) MultiScale Blocks Grid Blocks

Upload: others

Post on 30-Dec-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring Spatial Context for 3D ... - Computer Vision€¦ · 3D Semantic Parsing of LargeScale Indoor Spaces. Conference on Computer Vision and Pattern Recognition (CVPR), 2016

Geometry + Appearance

M.Sc. Francis Engelmann, M.Sc. Theodora KontogianniPhone: +49 241 80 20 {760, 759}E-Mail: {engelmann, kontogianni}@vision.rwth-aachen.deWebsite: http://www.vision.rwth-aachen.de/publication/00146/

Computer Vision Group, Visual Computing InstituteRWTH Aachen UniversityMies-van-der-Rohe Str. 15, D-52074 Aachen, Germanyhttp://www.vision.rwth-aachen.de

This project has received funding from the European Research Council (ERC)CV-SUPER (ERC-2012-StG-307432).

Exploring Spatial Context for 3D Semantic Segmentation of Point CloudsFrancis Engelmann*, Theodora Kontogianni*, Alexander Hermans, Bastian Leibe

Computer Vision Group, Visual Computing Institute, RWTH Aachen University

Datasets and evaluation metrics: We evaluate our method on S3DIS [2] and virtu-al KITTI [3] in terms of mean IoU, overall pixel accuracy and average class accuracy. Both datasets include per point semantic class annotations.

Deep learning approaches have made tremendous progress in the field of semantic segmentation over the past few years. However, most current approaches operate in the 2D image space. Direct semantic segmentation of unstructured 3D point clouds is still an open research problem. The recently proposed PointNet architecture pres-ents an interesting step ahead in that it can operate on unstructured point clouds, achieving decent segmentation results. However, it subdivides the input points into a grid of blocks and processes each such block individually. In this paper, we investigate the question how such an architecture can be extended to incorporate larger-scale spatial context. We build upon PointNet and propose two extensions that enlarge the receptive field over the 3D scene. We evaluate the proposed strategies on chal-lenging indoor and outdoor datasets and show improved results in both scenarios.

Multi-ScaleBlock FeatureMLP (64,128)

MLP (64,128)

MLP (64,128)

M

: max poolM

M

M

Output Score

S : stack

: concatenate

Multi-Scale Blockssame positiondifferent scales

MLP (O)

N x O

M S1 x O

N x 2 O

S O=256 O=128 MLP (M)

N x M1 x 384

N x 128+384

N x

DN

x D

N x

D

Consolidation Unit (CU)

C

Input-Level Context

C

C C

:block feature

Grid Blocksdifferent positions

same scale

Input-Level Context

N x

DN

x D

N x

DN

x D

MLP (128,64) M1 x 64

S1 x 64

GRU-RNN

S

C MLP (M)

Output ScoreN x M

MLP (128,64)

shared

MLP (128,64)

MLP (128,64)

M

M

M

S1 x 64

C MLP (M)

shared

S1 x 64

C MLP (M)

shared

S1 x 64

C MLP (M)

shared

GRU RNN (unrolled)

1 x 64

1 x 64

1 x 64

Recurrent Consolidation Unit (RCU)

shared

shared

: max poolM

S : stack

: concatenateC

:block feature

N x D+D’

MLP M

GlobalFeature

S C MLP

N x

D

N x

D’ 1 x D’

N x

M

PointFeatures

Input

References

Contributions

We present different mechanisms to increase and consolidate spatial context.

[1] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[2] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3D Semantic Parsing of LargeScale Indoor Spaces. Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[3] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Abstract Our Method

Quantitative Results

Qualitative Results

Background

Problem Statement

Out

put

Evaluation

Output

Unstructured 3D Point Cloud Semantic Segmentation

Inpu

tG

roun

d Tr

uth

Geometry-Only Results

Out

put

Gro

und

Trut

h

Geometry Only

· We present two mechanisms that increase the spatial context for semantic seg-mentation of 3D point clouds: input- and output-level context.

· We verify experimentally that our proposed extensions achieve improved results on challenging indoor and outdoor datasets.

· We show competative results using only geometric input features (no color).

[2]

[1]

[3]

[1]

[1]

[1]

[2]

Poin

tNet

(Simplified

)

· Main Idea of PointNet: Compute a global feature summarizing a set of unordered point features using max-pooling (M).

· Prediction is based on point features representing local context concatenated (C) with the global feature representing neighboring context.

· Neigborhood is limited spatially up to a certain radius.

Input-Level Context: Output-Level Context:

MLP (O)

N x O

M S1 x O

N x 2 OConsolidation Unit (CU)

C

GRU RNN (unrolled)

Recurrent Consolidation Unit (RCU)

Mul

tiSca

le Bl

ocks

Grid

Bloc

ks