panocontext: a whole-room 3d context model for panoramic...

1
PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding Yinda Zhang 1 Shuran Song 1 Ping Tan 2 Jianxiong Xiao 1 1 Princeton University 2 National University of Singapore Abstract We observe that the small field-of-view in standard cam- eras is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we propose a whole-room context model in 3D for a 360 full-view panorama. From an in- put panorama, our method outputs a 3D bounding box of the room and all major objects inside, together with their semantic categories (Fig. 1). To train our model, we con- struct an annotated panorama dataset and reconstruct the 3D model from single-view using manual annotation. Ex- periments show that our model can recognize objects us- ing only 3D contextual information without any image fea- ture for categorization, and still achieve a comparable per- formance with the state-of-the-art object detector that only uses image features. 1. Introduction While the past decade witnesses rapid progress on bottom-up object detection methods, the improvement brought by the top-down context cue is rather limited. In contrast, there are strong psychophysical evidence that con- text plays a crucial role in scene understanding for humans. We believe that one of the main reasons for this gap is be- cause the field of view (FOV) for a typical camera is only about 15% of that of the human vision system. Therefore, we advocate the use of panoramic images in scene under- standing, which nowadays can be easily obtained by camera arrays, special lenses, and automatic image stitching. 2. Method and Results Our method first generates scene hypotheses (room lay- out and objects) in a bottom-up fashion from image evi- dence, and then evaluate them holistically by top-down in- formation learned from our dataset. In a panorama, we can see the whole scene, and characteristic scene objects such as beds and sofas are usually visible despite occlusion, so that we can jointly optimize the room layout and object de- Input: a single-view panorama Output: 3D reconstruction Output: object detection bed sofa nightstand painting tv mirror door window painting desk chair Figure 1. Input and output. Taken an full-view panorama as in- put, our algorithm can detect all the objects inside the panorama and represent them as a bounding box in 3D, which also enables 3D reconstruction from a single-view. tection to exploit the contextual information in a variety of ways with its full strength. Some results for both bedroom and living rooms are shown in Fig. 2, where we can see that the algorithm per- forms reasonably. Using only 3D contextual information without any image feature for categorization, we can still achieve a comparable performance with state-of-the-art ob- ject detectors (DPM) using image features. room painting nightstand door mirror cabinet tv desk window bed chair wardrobe sofa tv stand coffee table dining table end table −500 −400 −300 −200 −100 0 100 −100 0 100 200 −150 −100 −50 0 50 100 150 200 13 5 3 6 11 4 1 8 12 7 9 2 10 −100 0 100 200 −300 −200 −100 0 100 200 300 −150 −100 −50 0 50 100 7 3 9 2 1 6 5 10 8 4 0 100 200 300 −100 0 100 200 300 −150 −100 −50 0 50 100 150 5 8 3 11 9 10 1 7 6 2 4 0 100 200 −300 −200 −100 0 100 −150 −100 −50 0 50 100 3 5 11 7 4 6 1 10 9 8 2 Figure 2. Example results. The first column is the input panorama and the output object detection results. The second column shows generated cuboid hypotheses. The third column is the results visu- alized in 3D. 1

Upload: others

Post on 09-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PanoContext: A Whole-room 3D Context Model for Panoramic ...sunw.csail.mit.edu/2014/papers2/02_Zhang_SUNw.pdf · PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding

PanoContext: A Whole-room 3D Context Modelfor Panoramic Scene Understanding

Yinda Zhang1 Shuran Song1 Ping Tan2 Jianxiong Xiao1

1Princeton University 2National University of Singapore

Abstract

We observe that the small field-of-view in standard cam-eras is one of the main reasons that contextual informationis not as useful as it should be for object detection. Toovercome this limitation, we propose a whole-room contextmodel in 3D for a 360◦ full-view panorama. From an in-put panorama, our method outputs a 3D bounding box ofthe room and all major objects inside, together with theirsemantic categories (Fig. 1). To train our model, we con-struct an annotated panorama dataset and reconstruct the3D model from single-view using manual annotation. Ex-periments show that our model can recognize objects us-ing only 3D contextual information without any image fea-ture for categorization, and still achieve a comparable per-formance with the state-of-the-art object detector that onlyuses image features.

1. IntroductionWhile the past decade witnesses rapid progress on

bottom-up object detection methods, the improvementbrought by the top-down context cue is rather limited. Incontrast, there are strong psychophysical evidence that con-text plays a crucial role in scene understanding for humans.We believe that one of the main reasons for this gap is be-cause the field of view (FOV) for a typical camera is onlyabout 15% of that of the human vision system. Therefore,we advocate the use of panoramic images in scene under-standing, which nowadays can be easily obtained by cameraarrays, special lenses, and automatic image stitching.

2. Method and ResultsOur method first generates scene hypotheses (room lay-

out and objects) in a bottom-up fashion from image evi-dence, and then evaluate them holistically by top-down in-formation learned from our dataset. In a panorama, we cansee the whole scene, and characteristic scene objects suchas beds and sofas are usually visible despite occlusion, sothat we can jointly optimize the room layout and object de-

Input: a single-view panorama Output: 3D reconstructionOutput: object detection

bedsofa nightstand

painting

tvmirror

door

window

painting

desk

chair

Figure 1. Input and output. Taken an full-view panorama as in-put, our algorithm can detect all the objects inside the panoramaand represent them as a bounding box in 3D, which also enables3D reconstruction from a single-view.

tection to exploit the contextual information in a variety ofways with its full strength.

Some results for both bedroom and living rooms areshown in Fig. 2, where we can see that the algorithm per-forms reasonably. Using only 3D contextual informationwithout any image feature for categorization, we can stillachieve a comparable performance with state-of-the-art ob-ject detectors (DPM) using image features.

roompainting

nightstanddoormirror cabinet tv

deskwindowbed chairwardrobe

sofatv stand

coffee tabledining table end table

−500

−400

−300

−200

−100

0

100

−1000

100200

−150

−100

−50

0

50

100

150

200 13

5

3

6

11

4

1

8

12

7

9

2

10

−1000

100200 −300

−200

−100

0

100

200

300

−150

−100

−50

0

50

100

7

3

9

2

16

5

10

84

0100

200300

−100

0

100

200

300−150

−100

−50

0

50

100

150

5

83

11

9

10

1

7

6

2

4

0100

200 −300

−200

−100

0

100

−150

−100

−50

0

50

100

35

11

7

4

6

1

10

9

8

2

Figure 2. Example results. The first column is the input panoramaand the output object detection results. The second column showsgenerated cuboid hypotheses. The third column is the results visu-alized in 3D.

1