learning less is more - heidelberg collaboratory for image ...€¦ · learning less is more - 6d...

1
Learning Less is More - 6D Camera Localization via 3D Surface Regression Eric Brachmann and Carsten Rother Heidelberg University (HCI/IWR) This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 647769) Code and trained models: Problem Statement Contributions Estimate the 6D camera pose (position + orientation) relative to a known scene from a single RGB image. We show that learning less is more. See right: - Red: Learn everything. CNN predicts pose directly. - Orange: Learn two components of a geometric pipeline. Our previous work [Bra17]. - Cyan: Learn one component of a geometric pipeline. This work. - Green: Ground truth camera path. Estimated Camera Pose Input Fully differentiable, robust pose optimization without learnable parameters on top of learned scene coordinate regression Learning scene coordinate regression without a 3D scene model or depth maps Stable end-to-end training due to new approximation of refinement gradients, and controlling the entropy of pose hypotheses We exceed state-of-the-art on camera localization on three datasets (indoor and outdoor) Previous Work: Differentiable RANSAC (DSAC) [Bra17] Reprojection Errors of 2 1 3 4 2 Input RGB Scene Coordinate () Regression [Sho13] Hypothesis Sampling Scoring () Hypothesis Selection Refinement () ( 1 ) ( 4 ) ( 2 ) ( 3 ) Probabilistic Pose Selection = , where ~(|) (|) = exp / exp(( )) DSAC Learning Objective ~(|,) ( , , ) = ~(|) . log + . Reprojection Errors of 2 1 3 4 2 ( 1 ) ( 4 ) ( 2 ) ( 3 ) Updated Pipeline 640x480 80x60 Learned Not learned but differentiable Fully Convolutional Network Architecture Learning w/o a 3D model Differentiable Refinement Soft Inlier Count Entropy Control Learning without a 3D Scene Model Training scene coordinate regression in 3 stages: Input RGB Ground Truth Scene Coordinates 1) Assume constant depth 2) Optimize Re- projection Error 3) End-to-end training (DSAC) 1) min σ , with = , ,,1 2) min σ ∗−1 Hypothesis Score: Soft Inlier Count Reprojection Error: , = −1 Inlier Count: , - not differentiable Soft In. Count: sig( − , ) differentiable Previously: learned - hard to regularize, overfits Hypothesis Score: Entropy Control , = exp(( ,) σ exp(( ,) ( 1 ) ( 4 ) ( 2 ) ( 3 ) ( 1 ) ( 4 ) ( 2 ) ( 3 ) ( 1 ) ( 4 ) ( 2 ) ( 3 ) Hypothesis Distribution: High Entropy: No separation, Instable Low Entropy: Few Learning Signals, Instable Medium Entropy: Rich Learning Signals, Stable Entropy: =− , log , Keep target entropy during training by adjusting : argmin | | Differentiable Refinement Refinement optimizes re- projection errors of inlier set : = argmin , 2 Gauss-Netwon update step: + = , ≈− O , Last update: = O O , , with O = =∞ () Gradient approximation: Results 38,6% 55,9% 60,4% 62,5% 76,1% 0% 20% 40% 60% 80% ORB+PNP [Sho13] DSAC (w/ 3D Model) Our (w/o 3D Model) DSAC (w/ Depth) Our (w/ 3D Model) % Correct Test Frames 7Scenes [Sho13] Results (Error < 5cm,5°) Avg. Median Err. w/ 3D Model w/o 3D Model PoseNet [Ken17] 1.43m, 2.9° 1.63m, 2.8° Active Search [Sat16] 0.29m, 0.6° - DSAC [Bra17] 0.31m, 0.8° - Our 0.14m, 0.3° 0.19m, 0.5° Cambridge Landmarks [Ken15] Results DSAC (w/ 3D Model) Our (w/ 3D Model) Our (w/o 3D Model) [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13 [Ken15] PoseNet: A Convolutional Network for Real-Time 6-DOF Camera RelocalizationKendall et al., ICCV’15 [Ken17] “Geometric Loss Functions for Camera Pose Regression with Deep Learning” Kendall and Cipolla, CVPR 2017 [Sat16] Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization” Sattler et al., PAMI 2016 [Bra17] “DSAC - Differentiable RANSAC for Camera Localization”, Brachmann et al., CVPR’17 Training Set (2 Images Total) Test Image Estimation with Learned Score (DSAC) Estimation with Soft Inlier Count (Our) Estimated Camera Poses 3D Model Overlay DSAC Our 3D Model Overlay

Upload: others

Post on 13-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Less is More - Heidelberg Collaboratory for Image ...€¦ · Learning Less is More - 6D Camera Localization via 3D Surface Regression Eric Brachmann and Carsten Rother Heidelberg

Learning Less is More - 6D Camera Localization via 3D Surface RegressionEric Brachmann and Carsten Rother

Heidelberg University (HCI/IWR)

This project has received funding from the European ResearchCouncil (ERC) under the European Union’s Horizon 2020 researchand innovation program (grant agreement No 647769)

Code and trained models:

Problem Statement

Contributions

Estimate the 6D camera pose (position + orientation)

relative to a known scene from a single RGB image.

We show that learning less is more. See right:

- Red: Learn everything.

CNN predicts pose directly.

- Orange: Learn two components of a geometric pipeline.

Our previous work [Bra17].

- Cyan: Learn one component of a geometric pipeline.

This work.

- Green: Ground truth camera path.

Estimated Camera PoseInput

• Fully differentiable, robust pose

optimization without learnable

parameters on top of learned scene

coordinate regression

• Learning scene coordinate regression

without a 3D scene model or depth

maps

• Stable end-to-end training due to new

approximation of refinement gradients,

and controlling the entropy of pose

hypotheses

• We exceed state-of-the-art on camera

localization on three datasets (indoor

and outdoor)

Previous Work: Differentiable RANSAC (DSAC) [Bra17]

Reprojection

Errors of 𝐡2

𝐰

𝐡1

𝐡3

𝐡4

𝐡2

Input RGB Scene Coordinate (𝐲)Regression [Sho13]

Hypothesis

Sampling

Scoring (𝑠) Hypothesis

Selection

Refinement (𝐑)

𝐯መ𝐡

𝑠(𝐡1)

𝑠(𝐡4)

𝑠(𝐡2)

𝑠(𝐡3)

𝐲

Probabilistic Pose Selection

መ𝐡𝐰 = 𝐡𝑗𝐰, where 𝑗~𝑃(𝑗|𝐰)

𝑃(𝑗|𝐰) = exp 𝑠 𝐡𝑗𝑤 /

𝑘exp(𝑠(𝐡𝑘

𝐰))

DSAC Learning Objective

𝜕

𝜕𝐰𝔼𝑗~𝑃(𝑗|𝐰,𝐯) 𝓁(𝐑 𝐡𝑗

𝐰, 𝐰 , 𝐡∗) =

𝔼𝑗~𝑃(𝑗|𝐰) 𝓁 . 𝜕𝜕𝐰

log 𝑃 𝑗 𝐰 + 𝜕𝜕𝐰𝓁 .

Reprojection

Errors of 𝐡2

𝐡1

𝐡3

𝐡4

𝐡2

መ𝐡

𝑠(𝐡1)

𝑠(𝐡4)

𝑠(𝐡2)

𝑠(𝐡3)

Updated Pipeline

64

0x4

80

80

x6

0

Learned Not learned but differentiable

Fully Convolutional

Network ArchitectureLearning w/o

a 3D model

Differentiable

RefinementSoft Inlier

Count

Entropy

Control

Learning without a 3D Scene Model

Training scene coordinate regression in 3 stages:

Input RGB Ground Truth

Scene Coordinates

1) Assume

constant depth 𝑑

2) Optimize Re-

projection Error3) End-to-end

training (DSAC)

1) min σ𝑖 𝐲𝑖 𝐰 − 𝐲𝑖∗ , with 𝐲𝑖

∗ = 𝐡∗ 𝑑𝑥𝑖𝑓,𝑑𝑦𝑖𝑓, 𝑑, 1

𝑇2) min σ𝑖 𝐶𝐡∗−1𝐲𝑖 𝐰 − 𝐩𝑖

Hypothesis Score: Soft Inlier Count

Reprojection Error:

𝑟𝑖 𝐡,𝐰

= 𝐶𝐡−1𝐲𝑖 𝐰 − 𝐩𝑖

Inlier Count: 𝑠 𝐡 = σ𝑖 𝟙 𝜏 − 𝑟𝑖 𝐡,𝐰 - not differentiable

Soft In. Count: 𝑠 𝐡 = σ𝑖 sig(𝜏 − 𝛽𝑟𝑖 𝐡,𝐰 )− differentiable

Previously: learned 𝑠 𝐡 - hard to regularize, overfits

Hypothesis Score: Entropy Control

𝑃 𝑗 𝐰, 𝛼 =exp(𝛼𝑠(𝐡𝑗,𝐰)

σ𝑘 exp(𝛼𝑠(𝐡𝑘,𝐰)

𝑠(𝐡1)

𝑠(𝐡4)

𝑠(𝐡2)

𝑠(𝐡3)

𝑠(𝐡1)

𝑠(𝐡4)

𝑠(𝐡2)

𝑠(𝐡3)

𝑠(𝐡1)

𝑠(𝐡4)

𝑠(𝐡2)

𝑠(𝐡3)

Hypothesis Distribution:

High Entropy:

No separation,

Instable

Low Entropy:

Few Learning

Signals, Instable

Medium Entropy:

Rich Learning

Signals, Stable

Entropy:

𝑆 𝛼 = −

𝑗

𝑃 𝑗 𝐰, 𝛼 log 𝑃 𝑗 𝐰, 𝛼

Keep target entropy 𝑆∗during training by adjusting 𝛼: argmin𝛼|𝑆 𝛼 − 𝑆∗|

Differentiable Refinement

Refinement 𝐑 optimizes re-

projection errors 𝐫ℐ of inlier set ℐ: 𝐑 𝐡 = argmin𝐡′ 𝐫ℐ 𝐡′, 𝐰 2

Gauss-Netwon update step: 𝐑𝒕+𝟏 = 𝐑𝒕 − 𝐽𝐫𝑇𝐽𝐫

−𝟏𝐽𝐫𝑇𝐫ℐ 𝐑𝒕, 𝐰

𝜕

𝜕𝐰𝐑 𝐡 ≈ − 𝐽𝐫

𝑇𝐽𝐫−𝟏𝐽𝐫

𝑇𝜕

𝜕𝐰𝐫ℐ 𝐡O, 𝐰

Last update: 𝐑 𝐡 = 𝐡O − 𝐽𝐫𝑇𝐽𝐫

−𝟏𝐽𝐫𝑇𝐫ℐ 𝐡O, 𝐰 , with 𝐡O = 𝐑𝒕=∞ (𝐡)

Gradient approximation:

Results

38,6%

55,9%

60,4%

62,5%

76,1%

0% 20% 40% 60% 80%

ORB+PNP [Sho13]

DSAC (w/ 3D Model)

Our (w/o 3D Model)

DSAC (w/ Depth)

Our (w/ 3D Model)

% Correct Test Frames

7Scenes [Sho13] Results (Error < 5cm,5°)

Avg. Median Err. w/ 3D Model w/o 3D Model

PoseNet [Ken17] 1.43m, 2.9° 1.63m, 2.8°

Active Search [Sat16] 0.29m, 0.6° -

DSAC [Bra17] 0.31m, 0.8° -

Our 0.14m, 0.3° 0.19m, 0.5°

Cambridge Landmarks [Ken15] Results

DSAC (w/ 3D Model) Our (w/ 3D Model) Our (w/o 3D Model)

[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13

[Ken15] “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization” Kendall et al., ICCV’15

[Ken17] “Geometric Loss Functions for Camera Pose Regression with Deep Learning” Kendall and Cipolla, CVPR 2017

[Sat16] “Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization” Sattler et al., PAMI 2016

[Bra17] “DSAC - Differentiable RANSAC for Camera Localization”, Brachmann et al., CVPR’17

Training Set (2 Images Total)

Test

Image

Estimation with Learned Score (DSAC) Estimation with Soft Inlier Count (Our)

Estimated Camera Poses 3D Model Overlay

DSACOur

3D Model Overlay