new reconstructing pascal vocsunw.csail.mit.edu/2014/papers2/30_vicente_sunw.pdf · 2017. 5....

2
Reconstructing PASCAL VOC Sara Vicente* University College London Jo˜ ao Carreira* ISR-Coimbra Lourdes Agapito University College London Jorge Batista ISR-Coimbra {s.vicente,l.agapito}@cs.ucl.ac.uk {joaoluis,batista}@isr.uc.pt 1. Introduction We consider the problem of populating object cate- gory detection datasets with dense 3D reconstructions of all object instances. If solved, this would further enable geometry-oriented approaches to object recognition and scene understanding that have recently seen renewed inter- est. These approaches hold the promise of achieving joint recognition and reconstruction but still suffer from the prob- lem that led to their demise in the early 90s [5]: 3D model acquisition is expensive, as it requires manual design or 3D sensors. The aim of our work is to alleviate this problem. Detection datasets comprise different object instances from the same class, which may have very diverse appear- ance, scale, image location, pose and articulation. Per- forming class based reconstruction from only 2D data in such datasets is a challenging problem that has not been ad- dressed before. We assume here that we can access each object’s class label, figure-ground segmentation and around 10 annotated class-specific keypoints of easily identifiable parts, such as “left mirror” for cars or “nose tip” for aero- planes. This type of input is illustrated in Fig. 1 and is available [2] for popular datasets such as PASCAL VOC [1], which we used in our experiments. The next section describes how we attack this problem and the last section shows experimental results. Our full source code and data are available online 1 . 2. Proposed approach Our 3D reconstruction approach assumes that at least some instances of the same class have a similar 3D shape so that standard multiview geometry applies as if they were the same object – we call such instances surrogates. We pro- pose a feedforward strategy with two phases: first, ortho- graphic cameras for all objects in a class are estimated using both keypoint and silhouette information, then a sampling- based approach employing a novel variation of visual hull reconstruction is used to produce dense per-object 3D re- * First two authors contributed equally. Now with Anthropics Technology. Now with the EECS department at UC Berkeley. 1 http://www.isr.uc.pt/ ˜ joaoluis/carvi 2 3 4 5 6 7 8 9 10 11 12 13 15 1 2 3 4 5 6 7 9 10 11 13 14 15 16 3 4 5 6 7 10 11 12 13 14 15 Figure 1. Example inputs of our algorithm, here dog images from the Pascal VOC dataset and their associated figure-ground seg- mentations and keypoints. constructions. 2.1. Camera viewpoint estimation and refinement The first step of our algorithm is to estimate the cam- era viewpoint for each of the instances using a factorization based rigid structure from motion (SFM) algorithm [4]. The SFM algorithm finds a rigid 3D shape, common to all object instances, that can be seen as a rough “mean shape” for the class. This 3D shape is a reconstruction of the keypoints annotated in the images and is thus very sparse. The SFM algorithm also provides an estimate of the view- point for each specific instance, but this estimate is only based on the keypoints visible in that instance. We employ an additional refinement step that forces all keypoints - even the ones which are not visible, hence annotated - to repro- ject inside the silhouette. 2.2. Visual hull based object reconstruction After jointly estimating camera viewpoints for all in- stances in each class, we reconstruct the 3D shape of each object using shape information borrowed from other exem- plars in the same class. In a dataset as diverse as VOC, it is reasonable to as- sume that for every instance there are at least a few shape surrogates. If these surrogate instances were identified, an approximate reconstruction based on the visual hull of the object could be recovered, but this is usually hard when viewpoint is far apart. We aim to overcome this problem by using a sampling strategy: we sample multiple times groups of three views, the reference image plus two potential sur- rogates. Our reconstruction algorithm has the following steps: (1) 1

Upload: others

Post on 21-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Reconstructing PASCAL VOC

    Sara Vicente*†

    University College LondonJoão Carreira*‡

    ISR-CoimbraLourdes Agapito

    University College LondonJorge BatistaISR-Coimbra

    {s.vicente,l.agapito }@cs.ucl.ac.uk {joaoluis,batista }@isr.uc.pt

    1. IntroductionWe consider the problem of populating object cate-

    gory detection datasets with dense 3D reconstructions ofall object instances. If solved, this would further enablegeometry-oriented approaches to object recognition andscene understanding that have recently seen renewed inter-est. These approaches hold the promise of achieving jointrecognition and reconstruction but still suffer from the prob-lem that led to their demise in the early 90s [5]: 3D modelacquisition is expensive, as it requires manual design or 3Dsensors. The aim of our work is to alleviate this problem.

    Detection datasets comprise different object instancesfrom the same class, which may have very diverse appear-ance, scale, image location, pose and articulation. Per-forming class based reconstruction from only 2D data insuch datasets is a challenging problem that has not been ad-dressed before. We assume here that we can access eachobject’s class label, figure-ground segmentation and around10 annotated class-specific keypoints of easily identifiableparts, such as “left mirror” for cars or “nose tip” for aero-planes. This type of input is illustrated in Fig.1 and isavailable [2] for popular datasets such as PASCAL VOC[1], which we used in our experiments.

    The next section describes how we attack this problemand the last section shows experimental results. Our fullsource code and data are available online1.

    2. Proposed approachOur 3D reconstruction approach assumes that at least

    some instances of the same class have a similar 3D shapeso that standard multiview geometry applies as if they werethe same object – we call such instances surrogates. We pro-pose a feedforward strategy with two phases: first, ortho-graphic cameras for all objects in a class are estimated usingboth keypoint and silhouette information, then a sampling-based approach employing a novel variation of visual hullreconstruction is used to produce dense per-object 3D re-

    * First two authors contributed equally.† Now with Anthropics Technology.‡ Now with the EECS department at UC Berkeley.

    1http://www.isr.uc.pt/ ˜ joaoluis/carvi

    2

    34

    5

    6

    7

    8

    9

    10 11

    12

    13

    15

    1

    2

    3

    4

    5

    6

    7

    9

    10

    11

    13 14

    15

    16

    3

    4

    5

    6

    7

    10

    11

    12

    13

    14

    15

    Figure 1. Example inputs of our algorithm, here dog images fromthe Pascal VOC dataset and their associated figure-ground seg-mentations and keypoints.

    constructions.

    2.1. Camera viewpoint estimation and refinementThe first step of our algorithm is to estimate the cam-

    era viewpoint for each of the instances using a factorizationbased rigid structure from motion (SFM) algorithm [4].

    The SFM algorithm finds a rigid 3D shape, commonto all object instances, that can be seen as a rough “meanshape” for the class. This 3D shape is a reconstruction of thekeypoints annotated in the images and is thus very sparse.The SFM algorithm also provides an estimate of the view-point for each specific instance, but this estimate is onlybased on the keypoints visible in that instance. We employan additional refinement step that forces all keypoints - eventhe ones which are not visible, hence annotated - to repro-ject inside the silhouette.

    2.2. Visual hull based object reconstructionAfter jointly estimating camera viewpoints for all in-

    stances in each class, we reconstruct the 3D shape of eachobject using shape information borrowed from other exem-plars in the same class.

    In a dataset as diverse as VOC, it is reasonable to as-sume that for every instance there are at least a few shapesurrogates. If these surrogate instances were identified, anapproximate reconstruction based on the visual hull of theobject could be recovered, but this is usually hard whenviewpoint is far apart. We aim to overcome this problem byusing a sampling strategy: we sample multiple times groupsof three views, the reference image plus two potential sur-rogates.

    Our reconstruction algorithm has the following steps: (1)

    1

    http://www.isr.uc.pt/~joaoluis/carvi

  • Figure 2. Illustration of the imprinted visual hull reconstructionmethod, for two different triplets corresponding to the same ref-erence instance (in black). The reconstructions are obtained byintersecting the three instances shown and their left-right flippedversions.

    sampling of shape surrogates, (2) visual hull based recon-struction and (3) selection of the best reconstruction fromthe pool.

    Sampling shape surrogates.We chose surrogates with aviewpoint distinct from the reference image, so as to max-imise the the number of background voxels carved away.We also restrict surrogate sampling to be over the subsetof objects pictured near three informative orthogonal view-points, determined using PCA on the SFM keypoints – theprincipal directions.Imprinted visual hull reconstruction. A clear inefficiencyof using the standard visual hull algorithm in our setting isthat there is no guarantee that the visual hull is silhouette-consistent with the reference instance, i.e. that for all theforeground pixels in the mask there will be an active voxelreprojecting on them. This happens because the algorithmtrusts equally all silhouettes. We use a variation of the orig-inal formulation that does not have this problem, that wedenoteimprinted visual hull reconstruction. The new for-mulation finds a reconstruction that respects exactly the sil-houette of the reference image.Reconstruction selection.Once all reconstruction propos-als have been computed based on different sampled triplets,the final step is to choose the best reconstruction for the ref-erence instance. We use a simple selection criterion that re-flects a natural assumption: reconstructions should be simi-lar to the average shape of their object class. This is opera-tionalized by comparing the projections of a reconstructiononto the principal directions with the average figure-groundsegmentations of objects pictured in those viewpoints.

    3. ExperimentsWe consider the subset of 9,087 fully visible objects in

    5,363 images from the 20,775 objects and 10,803 fully an-notated images available in the PASCAL VOC 2012 train-ing data. VOC has20 classes, including highly articulatedones (dogs, cats, people), vehicles (cars, trains, bicycles)and indoor objects (dining tables, potted plants) in realisticimages drawn from FLICKR. We reconstructed all the ob-jects for all the classes and show example outputs in fig.3.

    Figure 3. Examples of our reconstructions for some of the PAS-CAL VOC categories. For each object we show the original image,the original image with the reconstruction overlaid and twootherviewpoints of our reconstruction. Blue is closer to the camera, redis farther (best seen in color).

    Synthetic test data.We also performed a quantitative eval-uation on synthetic test images with similar segmentationsand keypoints as those in VOC and found that the averageroot mean squared across all the classes was6.96. Theclasses with the lowest error were:car (3.04), aeroplane(3.58) andmotorbike (4.08) while classes such as train andanimals did worse, which agrees with our qualitative per-ception of results on the VOC dataset.

    References[1] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and

    A. Zisserman. The pascal visual object classes (voc) chal-lenge.IJCV, 2010.1

    [2] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik.Semantic contours from inverse detectors. InICCV, 2011.1

    [3] D. Hoiem and S. Savarese.Representations and techniques for3D object recognition and scene interpretation, volume 15.Morgan & Claypool Publishers, 2011.1

    [4] M. Marques and J. P. Costeira. Estimating 3D shape fromdegenerate sequences with missing data.CVIU, 2008.1

    [5] J. L. Mundy. Object recognition in the geometric era: A ret-rospective. InToward Category-Level Object Recognition,2006. 1