multi-layer depth and epipolar transformers 3d scene … · 2019. 9. 26. · viewer-centered vs....

75
3D Scene Reconstruction with Multi-layer Depth and Epipolar Transformers to appear, ICCV 2019

Upload: others

Post on 23-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

3D Scene Reconstruction withMulti-layer Depth and Epipolar Transformers

to appear, ICCV 2019

Page 2: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Goal: 3D scene reconstruction from a single RGB image

RGB Image 3D Scene Reconstruction(SUNCG Ground Truth)

Page 3: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

z

y

x

z

y

x

Object-centered

Viewer-centered

Multi-surface Voxels

Pixels, voxels, and views: A study of shape representationsfor single view 3D object shape prediction (CVPR 18. Shin, Fowlkes, Hoiem)

Question: What effect does shape representation have on prediction?

Page 4: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Coordinate system is an important part of shape representation

z

yx

z

yx

CVPR 18

Page 5: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Synthetic training data

CVPR 18

Page 6: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Multi-surface Prediction

Surfaces vs. voxels for 3D object shape prediction

RGB Image

Predicted Mesh

Predicted Voxels

3D Reconstruction

3D Convolution (most common approach)

2D Conv.

CVPR 18

Page 7: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

z

y

x

z

y

x

Object-centered

Viewer-centered

Multi-surface Voxels

Question: What effect does shape representation have on prediction?

CVPR 18

Page 8: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Network architecture for surface predictionCVPR 18

Page 9: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Experiments

• Three difficulty settings (how well does the prediction generalize?)– Novel view: new view of model that is in training set

– Novel model: new model from a category that is in training set

– Novel category: new model from a category that is not in the training set

• Evaluation metrics: Mesh surface distance, Voxel IoU, Depth L1 error

• Same procedure applied in all four cases.

CVPR 18

Page 10: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

What effect does coordinate system have on prediction?

Voxel IoU (mean, higher is better) Depth error (mean, lower is better)

Viewer-centered vs. Object-centered

CVPR 18

Page 11: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

What effect does shape representation have on prediction?

Voxel IoU (mean, higher is better)

Voxels vs. multi-surface

Surface distance (mean, lower is better)

CVPR 18

Page 12: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

CVPR 18

Page 13: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is
Page 14: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Input GT Object-centered prediction (3D-R2N2)

Inspiring examples from 3D-R2N2's Supplementary Material

Page 15: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is
Page 16: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Shape representation is important in learning and prediction.

• Viewer-centered representation generalizes better to difficult input, such as, novel object categories.

• 2.5D surfaces (depth and segmentation) tend to generalize better than voxels and predicts higher fidelity shapes (thin structures)

2.5D segmentation, depth

Page 17: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Viewer-centered vs. Object-centered: Human vision

- Tarr and Pinker 1: Found that human perception is largely tied to viewer-centered coordinate, in experiments on 2D symbols

- McMullen and Farah 2: Object-centered coordinates seem to play more of a role for familiar exemplars, in line drawing experiments.

- We do not claim our computational approach has any similarity to human visual processing.

[1]: M. J. Tarr and S. Pinker. When does human object recognition use a viewer-centered reference frame? Psychological Science, 1(4):253–256, 1990

[2]: P. A. McMullen and M. J. Farah. Viewer-centered and object-centered representations in the recognition of naturalistic line drawings. Psychological Science, 2(4):275–278, 1991.

Page 18: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Follow-up work (Tatarchenko et al., CVPR 19):

- They observe that SoA single-view 3D object reconstruction methods actually perform image classification, and retrieval performance is just as good.

- Following our CVPR 18 work, they recommend the use of viewer-centered coordinate frames.

Page 19: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Follow-up work (Zhang et al., NIPS 18 oral):

- Zhang et al. performs single-view reconstruction of objects in novel categories.

- Their viewer-centered approach achieves SoA results.

- Following our CVPR 18 work, they experiment with both object-centered and viewer-centered models and validate our findings.

Page 20: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

How can we extend viewer-centered, surface-based object representationsto whole scenes?

Page 21: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Predicted Depth(2.5D Surface)

Viewer-centered visible geometryinference

GT Depth

Pixel-wise error

Evaluation

What about the rest of the scene?

Background: Typical monocular depth estimation pipeline

Page 22: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Predicted Depth

2.5D in relation to 3D

Predicted Depth as 3D mesh Ground Truth 3D Mesh

Evaluation

- 3D requires predicting both visible and occluded surfaces!

Page 23: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Multi-layer Depth

Page 24: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Synthetic dataset

CAD model of 3D Scene(SUNCG Ground Truth, CVPR 17)

RGB RenderingPhysically-based rendering(PBRS, CVPR 17)

Page 25: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Object First-hit Depth Layer

Learning Target:

“Traditional depth image with segmentation”

D1z

Page 26: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Object Instance-exit Depth Layer

Learning Target:

“Back of the first object instance”

D2

Page 27: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Room Envelope Depth Layer

Learning Target:

D5

Page 28: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Predicted Multi-layer Depthand Semantic Segmentation

Input RGB ImageEncoder-decoder

Multi-layer Surface Prediction

Page 29: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Input RGB Image

Surface Reconstruction from multi-layer depth

Multi-layerDepth Predictionand Segmentation

Multi-layer Surface Prediction

Page 30: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

3D scene geometry from depth (2.5D)

- How much geometric information is present in a depth image?

RGB image (2D) 2.5D depth

Mesh representation of a synthetically generated depth image (SUNCG).

Page 31: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Epipolar Feature Transformers

Page 32: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Multi-layer is not enough. Motivation for multi-view prediction

2.5D (objects only) Multiple layers of 2.5D Multiple views of 2.5DIncluding a top-down view

Ground truth depth visualization

Page 33: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Multi-view prediction from a single image:Epipolar Feature Transformer Networks

Page 34: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Multi-view prediction from a single image:Epipolar Feature Transformer Networks

Page 35: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

TransformedDepthFeature Map

TransformedSegmentation Feature Map

Transformed RGB

“Best Guess”Depth

Frustum Mask

Transformed Virtual View Features

3 channels

1 channel

1 channel

48 channels

64 channels

Virtual View Surface Prediction

Virtual Viewpoint Proposal

(tx, ty, tz, θ, σ)

117 channels total

Page 36: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Height Map Prediction

Transformed Virtual View Features

Ground Truth L1 Error Map

Page 37: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Frontal Multi-layer Prediction

Height Map Prediction

Input Image

Virtual View Surface Reconstruction

Frontal View Surface Reconstruction

Multi-layer Multi-view Inference

Page 38: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Network architecture for multi-layer depth prediction

Page 39: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Network architecture for multi-layer semantic segmentation

Page 40: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Network architecture for virtual camera pose proposal

Page 41: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Network architecture for virtual view surface prediction

Page 42: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Network architecture for virtual view semantic segmentation

Page 43: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is
Page 44: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Layer-wise cumulative surface coverage

Page 45: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Results

Page 46: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Input View / Alternate viewpoint

Page 47: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Input View / Alternate viewpoint

Page 48: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Previous state-of-the-art based on object detection and volumetric object shape prediction

- CVPR 2018- "Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene" by Tulsiani et al.

- 3D scene geometry prediction from a single RGB image

Page 49: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Object-based reconstruction is sensitive to detection and pose estimation errors

Our viewer-centered, end-to-end scene surface prediction

Object-detection-based state of the art (Tulsiani et al., CVPR 18)

Page 50: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Results on real-world images: object detection error and geometry

Page 51: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Results on real-world images

Page 52: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Results on real-world images

Page 53: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Predicted 3D Mesh Ground Truth 3D Mesh

Quantitative Evaluation Metric

“Inlier” Threshold:

Page 54: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Surface CoveragePrecision-Recall Metrics

GT Surface from SUNCG

Predicted Surface

Page 55: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Surface CoveragePrecision-Recall Metrics

GT Surface from SUNCG

Predicted Surface

i.i.d. point sampling on predicted mesh(with constant density ρ = 10000 points per unit area, m2 in real world scale)

Page 56: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Surface CoveragePrecision-Recall Metrics

GT Surface from SUNCG

Predicted Surface

Closest distance from point to surface, within threshold

Number of points within threshold ( )

Total number of sampled points ( + )

Precision =

“Inlier” Threshold:

Page 57: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Surface CoveragePrecision-Recall Metrics

GT Surface from SUNCG

Predicted Surface

Page 58: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Surface CoveragePrecision-Recall Metrics

GT Surface from SUNCG

Predicted Surface

i.i.d. point sampling on GT mesh(with constant density ρ = 10000 points per unit area, m2 in real world scale)

Page 59: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Surface CoveragePrecision-Recall Metrics

GT Surface from SUNCG

Predicted Surface

Closest distance from point to surface, within threshold

“Inlier” Threshold:

Number of points within threshold ( )

Total number of sampled points ( + )

Recall =

Page 60: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Our multi-layer, virtual-view depthsvs. Object detection based state-of-the-art, 2018

Multi-layer + virtual-view (ours) Multi-layer + virtual-view (ours)

Page 61: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Layer-wise evaluation

Page 62: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Top-down virtual-view prediction improves both precision and recall

(Match threshold of 5cm)

Page 63: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Synthetic-to-real transfer of 3D scene geometry on ScanNet

We measure recovery of true object surfaces and room layouts within the viewing frustum(threshold of 10cm).

Page 64: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Our fully convolutional, viewer-centered inference of 3D scene geometry

Output

We project the center of each voxel into the input camera, and the voxel is marked occupied if the depth value falls in the first object interval (D1, D2) or the occluded object interval (D3, D4).

Page 65: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Output

Page 66: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Output

Page 67: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Input Image

Voxelization of multi-layer depth maps

Occluded structure

Output

Page 68: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Output

Page 69: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Output

Page 70: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Output

Page 71: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Voxelization of multi-layer depth maps

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Input Image

Output

Page 72: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Output

Page 73: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

SemanticSegmentation

Front Surfaces Back Surfaces

VisibleObjects

InvisibleObjects

High resolution voxels

Voxelization of multi-layer depth maps

Input Image

Output

Page 75: Multi-layer Depth and Epipolar Transformers 3D Scene … · 2019. 9. 26. · Viewer-centered vs. Object-centered: Human vision - Tarr and Pinker 1: Found that human perception is

Conclusion- Multi-layer and virtual-view prediction from a single image

- Surface-based accuracy evaluation

- Synthetic-to-real transfer of 3D scene geometry prediction, evaluated quantitatively

- Geometric comparison with detection-based voxel prediction methodsReal-world Input Image Real-world Ground Truth

@DaeyunShin

Code and dataset coming soon.Follow on Twitter for updates!