analysis by synthesis: 3d object recognition by object...

Analysis by Synthesis: 3D Object Recognition by Object Reconstruction

Mohsen HejratiUniversity of California, Irvine

[email protected]

Deva RamananUniversity of California, Irvine

[email protected]

Abstract

We introduce a new approach for recognizing and re-constructing 3D objects in images. Our approach is basedon an analysis by synthesis strategy. A forward synthesismodel constructs possible geometric interpretations of theworld, and then selects the interpretation that best agreeswith the measured visual evidence. The forward model syn-thesizes visual templates defined on invariant (HOG) fea-tures. These visual templates are discriminatively trained tobe accurate for inverse estimation. We introduce an efficient“brute-force” approach to inference that searches througha large number of candidate reconstructions, returning theoptimal one. One benefit of such an approach is that recog-nition is inherently (re)constructive. We show state of the artperformance for detection and reconstruction on two chal-lenging 3D object recognition datasets of cars and cuboids.

1. Problem and Proposed Method

We focus on the task of recognizing and reconstructingobjects in images. Specifically, we describe a single modelthat simultaneously detects instances of general object cat-egories, and reports a detailed 3D reconstruction of eachinstance. Our approach is based on an analysis by synthe-sis strategy. A forward synthesis model constructs possiblegeometric interpretations of the world, and then selects theinterpretation that best agrees with the measured visual ev-idence. One benefit of such an approach is that recognitionis inherently (re)constructive.

Challenges: Though attractive, an “inverse rendering”approach to computer vision is wildly challenging for twoprimary reasons. (1) It is difficult to build accurate gener-ative models that capture the full complexity of the visualworld. (2) Even given such a model, inverting it is difficultbecause the problem is fundamentally ill-posed (differentreconstructions may generate similar images) and full of lo-cal minima (implying local search will fail).

Our approach: Our approach addresses both difficul-ties. (1) Instead of generating pixel values, we use forwardmodels to synthesize visual templates defined on invari-

Figure 1. Recognition + reconstructions from our method. Leftcolumn shows the test image and recognized + reconstructed ob-ject overlaid on it. Right column illustrate the associated templatethat triggered the detection. Our method can recognize objectsfrom various viewpoints, shapes and is robust to heavy occlusion.Because every synthesized template has a 3D shape, recognition isinherently reconstructive.

ant (HOG) features. These visual templates are discrimi-natively trained to be accurate for inverse estimation. (2)We describe a “brute-force” approach to inference that ef-ficiently searches through a large number of candidate re-constructions, returning the optimal one (or multiple likelycandidates, if desired).

2. Results

Datasets: We evaluate our model using two object de-tection/ datasets. The SUN primitive dataset [4] contains785 images with 1269 annotated cuboids. The UCI-Cardataset [2] contains 500 images from PASCAL VOC2011containing 723 cars with detailed landmark annotation.

Synthesis strategies: We explored numerous strategiesfor constructing a set of 3D shapes. First, our Exemplarmodel uses the shapes encountered in the training set of an-notated images, augmented with synthetic camera transla-

1

+

Figure 1. We describe a method for synthesizing a large set of discriminative templates, each associated with a candidate 3D reconstructionof an object (in this case, cars). Our model makes use of a generative 3D shape model to synthesize a large collection of 2D landmarks,which in turn specify rules for composing 2D templates out of a common pool of parts.

thesize templates that are detailed enough to perform 3Dreconstruction, while this may be difficult for view-basedapproaches (since many views may be needed).

Geometric indexing: Historically, many model-basedrecognition systems proceeded by aligning 3D models toimage data. Typically, a sparse set of local features are ini-tially detected, after which alignment is treated as a featurecorrespondence problem. Efficient correspondence searchis implemented through affine/projective invariant index-ing [6], geometric hashing [12], or simple enumeration [7].Hough transforms use sparse feature detections to vote ina shape parameter space [2], resulting in 2D implicit shapemodels [1, 18]. Our part indexing scheme is similar in spirit,except that we use a dense set of part responses to cast votesfor a discrete set of candidate 3D reconstructions.

Model synthesis: Synthetic parametric models ofobject-categories is an active area of research in the graph-ics community. Approaches include procedural grammars[14], morphable basis shapes [3], and component/part-based models [11]. We make use of morphable models torepresent categorical shape variation, as in [9]. Followingpast work, we show that such basis models can be learnedfrom 2D annotations using techniques adapted from non-rigid structure from motion (SFM) [20, 19]. Our modelsdiffer from past work in that we use a geometric model tosynthesize a large set of exemplars, which are then usedfor “brute-force” matching. From this perspective, our ap-proach is similar to work that relies on a non-parametricmodel of object shape [?].

3. Synthesis modelWe begin by defining a parametric model for construct-

ing 3D shapes from a particular object category (such ascars). We will then sample from this parametric family todefine a large set of candidate 3D reconstructions. Our 3Dshape model should capture nonrigid shape variation withinan object category (sedans and SUVs look different) andviewpoint variation. To do so, we make use of morphablebasis models [3, 9] that model any 3D shape instance as alinear combination of 3D basis shapes. Our shape modelshould also encode changes in appearance due to geometricvariation (wheels look different when foreshortened). To

Figure 2. Following [9], we use basis shapes to model differenttypes of cars, like sedans in the first column and SUVs in thefourth.

do so, we learn separate local templates for landmarks con-ditioned on their 3D geometry (learning separate templatesfor frontal vs. foreshortened wheels).

Shape parameters: We represent the 3D shape of anobject with a set of N 3D keypoints, represented as B 2R3⇥N . Given a set of nB basis shapes {Bj} and coefficients↵, we synthesize a 3D shape B as follows:

B = B0 +

nBX

j>0

↵jBj , where B, Bj 2 R3⇥N (1)

where B0 is the mean 3D shape. We visualize our shapebasis in Fig. 2.

Camera parameters: We transform B into camera co-ordinates by rotating by R and translating by t

P = t + RB, t 2 R3, R 2 R3⇥3 (2)

When augmented with camera intrinsic parameters (the fo-cal length), the set of camera parameters are (t, R, f). Wenow summarize our parameters with a vector ✓:

✓ =⇥Shape Camera

⇤(3)

Shape =⇥↵1 ↵2 . . .↵k

⇤

Camera =⇥t R f

⇤

Forward projection: Given a parameter vector, we gen-erate a set of 2D keypoints, scales, and local mixtures withthe following:

Render(✓) = {(zi, mi) : i = 1 . . . N} (4)

zi = (xi, yi,�i) = (fpi

x

piz

, fpi

y

piz

,f

piz

) (5)

Appearance Model 3D Synthesis Engine Parametric Templates

Figure 2. We describe a method for synthesizing a large set of discriminative templates, each associated with a candidate 3D reconstructionof an object (in this case, cars). Our model makes use of a generative 3D shape model to synthesize a large collection of 2D landmarks,which in turn specify rules for composing 2D templates out of a common pool of parts.

Box detection Box landmark reconstruction

20

25

30

35

40

45

50

55

60

5 25 125 625

Ave

rage

Pre

cisi

on

Time (seconds)

Param Synth

Exemp

Oracle Synth

DPM

Tree

Xiao et al.20

25

30

35

40

45

50

55

60

65

70

5 25 125 625

Ave

rage

Acc

ura

cy

Time (seconds)

Param Synth

Exemp

Oracle Synth

DPM + Regress

Tree

Xiao et al.

Figure 3. Detection (left) and reconstruction accuracy (right) versus running time of our method and other baselines, includingDPMs [1], supervised-tree models [2], and multi-view star models [2]. Points correspond to different (constant-time) baselines, whilecurves correspond to our models. Because our models can process a variable number of synthesized templates, we sweep overK ∈ {20, 50, 100, 500, 1000, 4000} templates to generate the curves. Our box detection and reconstruction results (43% and 48%)nearly double the best previously-reported performance from Xiao et al.[4] (24% and 38%), while being 10X faster.

tions. Exemplar Synthesis augments this set with additionalexemplar shapes. We implement this strategy by learning amodel with a subset of training images, but using the larger(full) set of keypoint annotations. This mimics scenarioswhere we have access to a limited amount of image data,but a larger set of keypoint annotations. Parametric Synthe-sis constructs a shape set by discretely enumerating θi overbounded parameter ranges. Finally, Oracle Synthesis usesshapes extracted from annotated test-data. We use this up-per bound on performance (given the the “perfect” synthesisstrategy) for additional analysis.

Benchmark results: Fig. 3 plots performance for boxdetection and localization. Exemplars almost double thebest previously-reported numbers in [4], in terms of detec-tion (43% vs 24%) and landmark reconstruction (48% vs38%). Interestingly, the tree model of [2] outperforms [4],perhaps due to its modeling of local part mixtures. Ourmodels even surpass [2], while directly reporting 3D re-constructions and while being 10X faster. Exemplar andParametric Synthesis perform similarly for low numbers oftemplates, but Exemplars do better with more templates,particularly with respect to reconstruction accuracy. Theseresults suggests that our parametric model is not capturingtrue shape statistics. For example, people may take picturesof certain objects from iconic viewpoints. Such dependan-cies are not modeled by Parametric Synthesis, but are cap-tured by Exemplars and Exemplar Synthesis. We find simi-lar results for car detection and reconstruction, but refer the

reader to our full paper [3] for additional discussion.Anytime recognition/reconstruction: Our models have

a free parameterK, the number of enumerated shapes. Bothperformance and run-time computation increase with K.When comparing to baselines with fixed run-time costs, weplot performance as a function of run-time, measured interms of seconds per image. All methods are run on thesame physical system (a 12-core Intel 3.5 Ghz processor).Our plots reveal that a simple re-ordering of shapes in acoarse-to-fine fashion (with hierarchical clustering) can beused for any-time analysis. For example, after enumeratingthe firstK = 50 coarse shapes, one can still obtain 37% boxlandmark reconstruction accuracy (which in turns improvesas more shapes are enumerated).

References[1] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-

manan. Object detection with discriminatively trained partbased models. IEEE PAMI, 99(1), 5555. 2

[2] M. Hejrati and D. Ramanan. Analyzing 3d objects in clutteredimages. In NIPS, 2012. 1, 2

[3] M. Hejrati and D. Ramanan. Analysis by synthesis: 3d objectrecognition by object reconstruction. In CVPR, 2014. 2

[4] J. Xiao, B. Russell, and A. Torralba. Localizing 3d cuboids insingle-view images. In NIPS, 2012. 1, 2

analysis by synthesis: 3d object recognition by object...

Documents