new reconstructing pascal vocsunw.csail.mit.edu/2014/papers2/30_vicente_sunw.pdf · 2017. 5....

Reconstructing PASCAL VOC

Sara Vicente*†

University College LondonJoão Carreira*‡

ISR-CoimbraLourdes Agapito

University College LondonJorge BatistaISR-Coimbra

{s.vicente,l.agapito }@cs.ucl.ac.uk {joaoluis,batista }@isr.uc.pt

1. IntroductionWe consider the problem of populating object cate-

gory detection datasets with dense 3D reconstructions ofall object instances. If solved, this would further enablegeometry-oriented approaches to object recognition andscene understanding that have recently seen renewed inter-est. These approaches hold the promise of achieving jointrecognition and reconstruction but still suffer from the prob-lem that led to their demise in the early 90s [5]: 3D modelacquisition is expensive, as it requires manual design or 3Dsensors. The aim of our work is to alleviate this problem.

Detection datasets comprise different object instancesfrom the same class, which may have very diverse appear-ance, scale, image location, pose and articulation. Per-forming class based reconstruction from only 2D data insuch datasets is a challenging problem that has not been ad-dressed before. We assume here that we can access eachobject’s class label, figure-ground segmentation and around10 annotated class-specific keypoints of easily identifiableparts, such as “left mirror” for cars or “nose tip” for aero-planes. This type of input is illustrated in Fig.1 and isavailable [2] for popular datasets such as PASCAL VOC[1], which we used in our experiments.

The next section describes how we attack this problemand the last section shows experimental results. Our fullsource code and data are available online1.

2. Proposed approachOur 3D reconstruction approach assumes that at least

some instances of the same class have a similar 3D shapeso that standard multiview geometry applies as if they werethe same object – we call such instances surrogates. We pro-pose a feedforward strategy with two phases: first, ortho-graphic cameras for all objects in a class are estimated usingboth keypoint and silhouette information, then a sampling-based approach employing a novel variation of visual hullreconstruction is used to produce dense per-object 3D re-

* First two authors contributed equally.† Now with Anthropics Technology.‡ Now with the EECS department at UC Berkeley.

1http://www.isr.uc.pt/ ˜ joaoluis/carvi

2

34

5

6

7

8

9

10 11

12

13

15

1

2

3

4

5

6

7

9

10

11

13 14

15

16

3

4

5

6

7

10

11

12

13

14

15

Figure 1. Example inputs of our algorithm, here dog images fromthe Pascal VOC dataset and their associated figure-ground seg-mentations and keypoints.

constructions.

2.1. Camera viewpoint estimation and refinementThe first step of our algorithm is to estimate the cam-

era viewpoint for each of the instances using a factorizationbased rigid structure from motion (SFM) algorithm [4].

The SFM algorithm finds a rigid 3D shape, commonto all object instances, that can be seen as a rough “meanshape” for the class. This 3D shape is a reconstruction of thekeypoints annotated in the images and is thus very sparse.The SFM algorithm also provides an estimate of the view-point for each specific instance, but this estimate is onlybased on the keypoints visible in that instance. We employan additional refinement step that forces all keypoints - eventhe ones which are not visible, hence annotated - to repro-ject inside the silhouette.

2.2. Visual hull based object reconstructionAfter jointly estimating camera viewpoints for all in-

stances in each class, we reconstruct the 3D shape of eachobject using shape information borrowed from other exem-plars in the same class.

In a dataset as diverse as VOC, it is reasonable to as-sume that for every instance there are at least a few shapesurrogates. If these surrogate instances were identified, anapproximate reconstruction based on the visual hull of theobject could be recovered, but this is usually hard whenviewpoint is far apart. We aim to overcome this problem byusing a sampling strategy: we sample multiple times groupsof three views, the reference image plus two potential sur-rogates.

Our reconstruction algorithm has the following steps: (1)

1

http://www.isr.uc.pt/~joaoluis/carvi

Figure 2. Illustration of the imprinted visual hull reconstructionmethod, for two different triplets corresponding to the same ref-erence instance (in black). The reconstructions are obtained byintersecting the three instances shown and their left-right flippedversions.

sampling of shape surrogates, (2) visual hull based recon-struction and (3) selection of the best reconstruction fromthe pool.

Sampling shape surrogates.We chose surrogates with aviewpoint distinct from the reference image, so as to max-imise the the number of background voxels carved away.We also restrict surrogate sampling to be over the subsetof objects pictured near three informative orthogonal view-points, determined using PCA on the SFM keypoints – theprincipal directions.Imprinted visual hull reconstruction. A clear inefficiencyof using the standard visual hull algorithm in our setting isthat there is no guarantee that the visual hull is silhouette-consistent with the reference instance, i.e. that for all theforeground pixels in the mask there will be an active voxelreprojecting on them. This happens because the algorithmtrusts equally all silhouettes. We use a variation of the orig-inal formulation that does not have this problem, that wedenoteimprinted visual hull reconstruction. The new for-mulation finds a reconstruction that respects exactly the sil-houette of the reference image.Reconstruction selection.Once all reconstruction propos-als have been computed based on different sampled triplets,the final step is to choose the best reconstruction for the ref-erence instance. We use a simple selection criterion that re-flects a natural assumption: reconstructions should be simi-lar to the average shape of their object class. This is opera-tionalized by comparing the projections of a reconstructiononto the principal directions with the average figure-groundsegmentations of objects pictured in those viewpoints.

3. ExperimentsWe consider the subset of 9,087 fully visible objects in

5,363 images from the 20,775 objects and 10,803 fully an-notated images available in the PASCAL VOC 2012 train-ing data. VOC has20 classes, including highly articulatedones (dogs, cats, people), vehicles (cars, trains, bicycles)and indoor objects (dining tables, potted plants) in realisticimages drawn from FLICKR. We reconstructed all the ob-jects for all the classes and show example outputs in fig.3.

Figure 3. Examples of our reconstructions for some of the PAS-CAL VOC categories. For each object we show the original image,the original image with the reconstruction overlaid and twootherviewpoints of our reconstruction. Blue is closer to the camera, redis farther (best seen in color).

Synthetic test data.We also performed a quantitative eval-uation on synthetic test images with similar segmentationsand keypoints as those in VOC and found that the averageroot mean squared across all the classes was6.96. Theclasses with the lowest error were:car (3.04), aeroplane(3.58) andmotorbike (4.08) while classes such as train andanimals did worse, which agrees with our qualitative per-ception of results on the VOC dataset.

References[1] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and

A. Zisserman. The pascal visual object classes (voc) chal-lenge.IJCV, 2010.1

[2] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.Semantic contours from inverse detectors. InICCV, 2011.1

[3] D. Hoiem and S. Savarese.Representations and techniques for3D object recognition and scene interpretation, volume 15.Morgan & Claypool Publishers, 2011.1

[4] M. Marques and J. P. Costeira. Estimating 3D shape fromdegenerate sequences with missing data.CVIU, 2008.1

[5] J. L. Mundy. Object recognition in the geometric era: A ret-rospective. InToward Category-Level Object Recognition,2006. 1

new reconstructing pascal vocsunw.csail.mit.edu/2014/papers2/30_vicente_sunw.pdf · 2017. 5....

Documents