motion from structure (mfs): searching for 3d objects in...

Motion from Structure (MfS):Searching for 3D Objects in Cluttered Point Trajectories

Jayakorn Vongkulbhisal†,‡, Ricardo Cabral†,‡, Fernando De la Torre‡, João P. Costeira††ISR - Instituto Superior Técnico, Lisboa, Portugal‡Carnegie Mellon University, Pittsburgh, PA, USA

[email protected], [email protected], [email protected], [email protected]

Abstract

Object detection has been a long standing problem incomputer vision, and state-of-the-art approaches rely onthe use of sophisticated features and/or classifiers. How-ever, these learning-based approaches heavily depend onthe quality and quantity of labeled data, and do not gener-alize well to extreme poses or textureless objects.

In this work, we explore the use of 3D shape models todetect objects in videos in an unsupervised manner. We callthis problem Motion from Structure (MfS): given a set ofpoint trajectories and a 3D model of the object of interest,find a subset of trajectories that correspond to the 3D modeland estimate its alignment (i.e., compute the motion matrix).MfS is related to Structure from Motion (SfM) and motionsegmentation problems: unlike SfM, the structure of the ob-ject is known but the correspondence between the trajecto-ries and the object is unknown; unlike motion segmentation,the MfS problem incorporates 3D structure, providing ro-bustness to tracking mismatches and outliers. Experimentsillustrate how our MfS algorithm outperforms alternativeapproaches in both synthetic data and real videos extractedfrom YouTube.

1. IntroductionObject detection is one of the most common tasks that

humans perform. We are constantly looking and detectingpeople, roads, chairs or automobiles. Yet, much of how weperceive objects so accurately and with so little apparent ef-fort remains a mystery. Although much progress has beendone in the last few years, we are still far away from devel-oping algorithms that can match human performance. Mostprevious efforts in the area of object detection have empha-sized 2D approaches on different types of features and clas-sifiers. Features have included shape (e.g. contours) and/orappearance descriptors (e.g., SIFT, deep learning features).Classifiers have included Support Vector Machines (SVM),

Figure 1. Motion from Structure (MfS) problem: Given a set ofpoint trajectories (a) and a 3D model of an object (b), find a subsetof trajectories that are aligned with the model. Our approach splitsthis combinatorial problem in two parts. First, we reduce potentialcandidate matching points by finding trajectories that compose theconvex hull of objects (red squares in (c)). Second, a subset of theremaining trajectories are aligned with the 3D model (d).

Boosting or Gaussian Processes. While 2D approaches arefast and have been shown to scale well to a large numberof objects, they typically depend heavily on texture regions,lack robustness to viewpoints and incorporate video infor-mation in a heuristic manner. Surprisingly, little attentionhas been paid to the problem of 3D object detection, es-pecially since acquiring 3D models has become more ac-cessible due to the ubiquity of RGBD cameras and crowd-sourced repositories such as 3D Warehouse [21, 29].

To address the limitations of 2D approaches for detec-tion, this paper proposes a new problem that we call Motionfrom Structure (MfS). Given a set of point trajectories con-taining multiple objects and a 3D model of an object (i.e., apoint cloud), MfS finds a subset of trajectories that are well-aligned (i.e., via motion matrix) with the 3D model. Con-

1

sider Fig. 1; Fig. 1a shows the point trajectories and Fig. 1bplots several 3D views of the object of interest to detect.The goal of MfS is to solve the alignment between the 3Dmodel and its trajectories (see Fig. 1d). MfS differs from2D approaches to object detection in several aspects. First,MfS provides a natural framework to incorporate geometryinto object detection in video. It computes the motion usinga 3D model, which makes it more robust to strong view-point changes. Second, the MfS problem is unsupervisedand there is no need for expensive and error-prone label-ing process. Finally, in contrast to image-based approacheswhich rely on similar appearance, MfS allows 3D models tobe aligned to objects with the same shape while having nosimilarity in textures.

The main challenge of MfS lies in the lack of discrimina-tive features (i.e., it is an unsupervised method without anyappearance features) to solve for correspondence betweentrajectories and the queried model. Recall that most detec-tion problems [7, 36] use supervised detectors for parts ofthe objects. Since the number of trajectories can be signif-icantly larger than the number of vertices representing theobject, the problem is NP-hard in general. To alleviate thecombinatorial problem, we proposed a two-step approach tothe MfS problem: (1) Reducing the number of trajectoriesby considering only the tracked points in the convex hull ofan object (see red squares in Fig. 1c). This provides a com-pact representation of objects in the scene. (2) Aligning theremaining tracks with the 3D model via a guided samplingapproach (see Fig. 1d).

The MfS problem is related to two well-known problemsin computer vision: Structure from Motion (SfM) and mo-tion segmentation (MSeg). Similar to SfM, the MfS prob-lem starts from point trajectories, but, unlike SfM, in MfSa 3D model of the object of interest is given. However, inthe MfS problem the point trajectories that correspond tothe 3D object are unknown. MfS is able to find not onlythe subset of trajectories corresponding to the object, butalso its alignment (i.e., the motion matrix). MfS also relatesto the MSeg problem that performs grouping of trajectoriescorresponding to the same rigid object. MSeg may seemto be an intuitive approach to solve the MfS problem: wecould use MSeg methods to group trajectories into objects,and then fit the 3D model to each set of points, and select theone with less error. However, each of these steps is proneto errors by itself, and as we will show in experimental vali-dation, MSeg-based algorithms may lead to worse solutionsfor MfS than our approach. Moreover, in MSeg the numberof clusters needs to be known a priori and they are suscepto-ble to outliers which are typically present due to backgroundmotion, camera blur, and noise. On the other hand, our ap-proach does not perform clustering or 3D reconstruction.

2. Related WorkOur work has a direct connection to Structure from Mo-

tion (SfM) and Motion Segmentation (MSeg). In this sec-tion, we briefly review these two problems and other worksthat perform partial matching and pose estimation.

Structure from Motion (SfM) Let W ∈ R2F×n (seenotation1) be a matrix containing n feature tracks of an ob-ject in F image frames. Tomasi and Kanade [25] showedthat under an orthographic camera model W lies in a fourdimensional subspace, leading to the factorization

W = MS, (1)

where M ∈ R2F×4 is the F -frame motion matrix andS ∈ R4×n contains 3D homogeneous coordinates of thevertices. This factorization relates feature tracks in a se-quence of 2D images to its 3D reconstruction. Since thisfactorization was first proposed, it has been extended tohandle different camera models, articulated and non-rigidobjects, outlier handling, missing data, and other featuretypes [1, 12, 13, 20, 28].

Motion Segmentation (MSeg) Given a matrix W con-taining tracks of multiple objects, the goal of MSeg is togroup the tracks of different objects. Since the tracks ofeach object lie in a low dimensional subspace [25], sub-space clustering techniques have been proposed as a so-lution to MSeg. Previous works in this direction includethe use of shape interaction matrix [4], generalized prin-cipal component analysis [30], and principal subspace an-gles [31], with extensions to handle articulated and non-rigid objects, and also segmentation of dense trajecto-ries [33]. Recent approaches such as sparse subspace clus-tering [6] and low rank representation [15] impose sparsityor low rank priors to recover the affinity matrix in a con-vex formulation. However, MSeg only clusters tracks intogroups with low dimensional subspaces, and it is not clearhow to incorporate 3D models of objects as additional in-formation.

Track-based information processing Beyond recon-struction and MSeg, feature tracks have also been used forother higher level computer vision tasks, such as foregroundobject segmentation [24], action recognition [17], object de-tection [35], and object recognition [19]. However, mostof the approaches to detection and recognition still requireclassifiers to be learned from training data, which have tobe manually labeled and can be laborous to obtain.

Correspondence and pose estimation Instead of usinglabeled data, 3D models provide compact representations of

1Bold capital letters denote a matrix X, bold lower-case letters a col-umn vector x. xi represents the ith column of the matrix X. xij denotesthe scalar in the ith row and jth column of the matrix X. All non-boldletters represent scalars. 1m×n, 0m×n ∈ Rm×n are matrices of onesand zeros. In ∈ Rn×n is an identity matrix. diag(X) and tr(X) denotesthe vector of diagonal entries and the trace of matrix X, respectively.

an object’s shapes. To incorporate 3D models for object de-tection or recognition, pose estimation must be performedto align the models to the images. In the case of single im-ages, several works utilize both appearance features (e.g.SIFT) and 3D shape as input [3, 21, 22, 37]. These ap-proaches require the correspondence between appearancefeatures of 3D models and the images to be established priorto estimating pose. However, without appearance features,there is much less cue for obtaining the correspondence, andhence both correspondence and pose must be solved simul-taneously [5, 18].

In the case of videos, motion can be used to provide in-formation to solve for pose and correspondence. Toshevet al. [27] proposed to synthesize different views from 3Dmodels and match their silhouettes to a moving object seg-mented from a video. On the other hand, we aim to directlyalign 3D point clouds to feature tracks from videos using thesubspace relation as in (1). When the number of tracks is thesame as that of the model vertices, the correspondence canbe obtained by permuting the nullspace of the shape matrixto match with the track matrix [11, 16]. However, there is noobvious way to extend these approaches to deal with extratracks from videos. To deal with such cases, [34] includedtexture features as input and sought the correspondence thatminimizes the nuclear norm of the selected tracks and fea-tures.

Recently, Zhou and De la Torre [36] proposed a methodto select a subset of trajectories that aligns body configura-tions with a motion capture data. However, the selection oftrajectories uses supervised information. Our work tacklesMfS by reducing the candidate for matching to the convexhull of objects. Although the idea of using convex hulls hasbeen used in image registration [8, 32], their focus was onregistering two sets of 2D point clouds whereas we registerthe 3D structure by observing 2D tracks from videos.

Despite the name resemblance, we note that our prob-lem is different from [14], which proposed a large-scale 3Dreconstruction approach for rolling-shutter cameras, whileour goal is to register 3D models onto 2D tracks.

3. The Motion from Structure ProblemThe motion from structure (MfS) problem can be stated

as follows. Given a measurement matrix

W =

x11 x1

2 · · · x1n

x21 x2

2 · · · x2n

......

. . ....

xF1 xF

2 · · · xFn

∈ R2F×n, (2)

where xfi ∈ R2 is the 2D coordinate of the ith feature

track in frame f , and a shape matrix S ∈ R4×m contain-ing m 3D vertices of an object of interest, find the subsetof tracks in W that belongs to the object. MfS is different

from MSeg as it does not require tracks to be clustered intogroups beforehand, but rather finds a partial permutationmatrix P ∈ {0, 1}n×m and a motion matrix M ∈ R2F×4

that aligns a given 3D object to a subset of tracks. Assum-ing no noise and no missing data, the relationship betweenW, P, and S is given by

WP = MS. (3)

Assuming that the motion is scaled orthographic, we canexpand M as:

M =

α1M

1b1

......

αFMF

bF

, (4)

where Mf ∈ R2×3, bf ∈ R2, and αf ∈ R are rotation,

translation, and scaling components in frame f .

3.1. Problem Analysis

In this section, we analyze the challenges of the MfSproblem.

Complexity due to large number of tracks: The ob-jects in general videos typically contain complex textureresulting in large number of additional tracks (besides thetracks belonging to the 3D object of interest). Since we onlyuse the shape of the object, MfS problem is formulated asa combinatorial problem, with increasing complexity withthe number of trajectories.

Lack of appearance features: The lack of appearancefeatures (e.g. SIFT) causes each track to be indistinct fromothers, and prevents us from obtaining putative matches be-tween them and the 3D model. This difference sets MfSapart from general pose estimation problems where distinc-tive appearance features play a significant role as an ini-tialization to obtain pose. Without these putative matching,pose estimation is in general an ill-posed problem as it isnot clear how to obtain the alignment.

Self-occlusion of 3D model: Generally, a 3D object ina real scene is subject to self-occlusion. This means not allvertices of S in (3) should be matched to the tracks. Sincespurious matches can easily lead to a bad alignments, howto handle self-occlusion is an important issue that needs tobe addressed.

Coherent motion trajectories: Although the tracks areindistinct, coherent motion of tracks from the same ob-jects allows their tracks to be clustered into groups (e.g. byMSeg). In addition, they also encode information about theshape of each object (e.g. as in (1)). These two propertiesprovide enough structure for us to solve the MfS problem.

4. Solving MfSIn order to tackle MfS, we make two assumptions: (1)

the target objects have independent motion from other ob-

jects in the scene, and (2) the camera model is scaled ortho-graphic. These assumptions can well approximate typicalscenes and are generally adopted in solving MSeg [6, 15].However, MfS is different as it does not require all tracks tobe clustered into groups, but rather selecting tracks belong-ing to a given shape. To this end, we propose a two-stepapproach. In the first step, we solve a convex optimizationproblem to select a subset of special tracks constituting theconvex hulls of moving objects. We call these tracks Sup-port Tracks (STs). Since STs constitute convex hull of ob-jects, they are more likely to be matched to the target shape.Selecting STs before matching results in the reduction ofboth false matches and complexity of the problem. In thesecond step, we align the 3D model to STs using a guidedsampling procedure, where the selected STs are more likelyto come from the same object. The following section pro-vides details on the cost function and the algorithms.

4.1. What is a good alignment?

Defining a good metric for alignment is critical for MfS,especially since the vertices in the given 3D model may notexactly match the tracks of the target object. Rather thansolving for both P and M in (3), we say that an alignmentis good when there exists a matrix M such that the convexhulls2 of back-projection of S and all tracks that undergothe same motion3 are similar (see Fig. 2):

πf (WM) ∼ πf (MS), (5)

where πf (·) returns the convex hull in frame f of the inputtrack matrix, and WM is a matrix containing columns of Wthat lie in span(M). Specifically, we use the intersection-over-union (IoU) between πf (WM) and πf (MS) to mea-sure alignment. This leads to the following optimizationproblem:

maxM

F∑f=1

Area(πf (WM) ∩ πf (MS))

Area(conv(πf (WM) ∪ πf (MS))), (6)

subject to MfM

f>= I2, f = 1, . . . , F,

where the relation between Mf

and M is given in (4),conv(·) returns convex hull of the input set, and Area(·) re-turns the area of the input convex hull. The cost function in(6) possesses many desirable properties, namely it (1) usesmotion and shape to measure alignment, (2) is insensitiveto self-occlusion of the 3D model since alignment is mea-sured based on regions, and (3) takes into account coherentmotion of tracks. Note that P is not included in (6), but itcan be recovered after obtaining M.

2In general, other types of region can be used. However, since we aregiven point clouds without any relation between the vertices, convex hullprovides the most detailed description of the region.

3We say a group of tracks undergoes the same motion when there is amotion matrix M that can well explain them (see (1)).

Bad alignment Good alignment

πf(MS)

πf(WM)

πf(WM) ∩ πf(MS)

conv(πf(WM) ∪ πf(MS))

Figure 2. Example of good and bad alignments from a frame f .The yellow dots are coherent tracks. The convex hulls shown areof WM (yellow), MS (green), their intersection (red), and union(blue). An alignment is good when the ratio between intersectionand union is high. (Best viewed in color.)

Solving (6) requires estimating motion M, which in-volves sampling 4 matches between tracks from W and ver-tices of S. However, the total number of candidate pairs isin the order of O(n4k4), while the number of correct pairscan be significantly smaller. In the next section, we proposean approach to identify special tracks constituting visiblepart of objects’ convex hull by using convex optimization.We refer to these tracks as support tracks (STs). SelectingSTs as candidates for matching helps reduce problem com-plexity, while also retain useful tracks that are more likelyto match to a given shape.

4.2. Identifying support tracks (STs)

To identify STs in W, we exploit the idea that suchtracks cannot be represented by convex combinations oftracks of the same object. Inspired by sparse subspace clus-tering (SSC) [6], we formulate the identification of STs andnon-STs as

minC,E

tr (C) + µf(E) (7)

subject to E = W −WC,

1>nC = 1n,

C ≥ 0n×n,

where C ∈ Rn×n is the matrix containing coefficients forrepresenting each track as a convex combination of others,and E ∈ R2F×n allows for errors. The error function f(·)can be `1 norm, `2,1 norm, or squared Frobenius norm. Theintuition of (7) is akin to the self-expressiveness propertyin [6], which states that each vector in a subspace can berepresented as a linear combination of others in the samesubspace. By using convex combination rather than linear,STs can be identified as tracks that cannot be representedby other tracks in the same subspace. Fig. 3 illustrates theidea behind (7).

There are three subtle but non-trivial differences between(7) and SSC, in both the tasks they try to accomplish and the

...

...00

1

0

...

0.3

.1

0

0

1

0

0

0.03

.05

0

C*

w1 wnw2 ...w3

W

Figure 3. (Left) Since tracks (yellow circles) of independentlymoving objects lie in independent subspaces, any tracks that can-not be expressed as a convex combination of other tracks are con-sidered STs (red squares) that constitute the convex hull of objects(see Fig. 1c). (Right) By solving (7), these STs induce 1’s in thediagonal of C.

variables involved. First, SSC was proposed to perform sub-space clustering, where spectral clustering is performed onC to obtain the clusters. On the contrary, our formulationfocuses on both C and E, which encode the information fordistinguishing STs from non-STs. Second, C is nonnega-tive and its columns sum to 1, which makes them convexcombination coefficients instead of linear ones. As C isnonnegative, ‖C‖1 in SSC’s objective function becomes aconstant value n, and thus can removed. Lastly, we penal-ize tr(C) as opposed to restricting diag(C) = 0n. We showin Theorem 1 that at the global optimum diag(C) can beevaluated as a binary vector and provides information onwhether a track is a ST or not.

Theorem 1. Let C∗ and E∗ be the optimal solutions of(7), and E be the optimal solution of variable E of (7) withdiag(C) = 0n as an additional constraint. The relationbetween c∗ii, µ, and f(ei) for any i can be stated as follows:

c∗ii ∈

{0} f(ei) < µ−1,

[0, 1] f(ei) = µ−1,

{1} otherwise.(8)

Proof. It can be seen that (7) can be column-wise sepa-rated. For each column i, the objective is a tradeoff be-tween cii and µf(ei). If µf(ei) > 1 then it would costless to set c∗ii = 1 with e∗i = 02F . On the other hand,if µf(ei) < 1 then setting c∗ii = 0 and e∗i = ei wouldcost less. Finally, if µf(ei) = 1, then the cost can bekept constant at 1 by letting c∗ii be any values in [0, 1] whilef(ei) = (1− c∗ii)f(ei).

Corollary 2. For any µ, there exists an optimal solution of(7) where diag(C∗) is a binary vector.

In Theorem 1, ei is the error of representing wi witha convex combination of strictly other columns than wi.We can see that µ acts as a threshold for determining thevalue of c∗ii. Specifically, if the error f(ei) is larger thanµ−1, it would cost less to represent wi by itself, resulting in

c∗ii = 1. On the other hand, if the error is smaller than µ−1,then it costs less for wi to be represented by other columns,resulting in c∗ii = 0. For our problem, this implies that wi

with c∗ii = 1 is a ST, while wi with c∗ii = 0 is not.By noting that µ is simply a threshold value, we can

solve (7) for all values of µ in a single optimization run.This is done by solving (7) with diag(C) = 0n as an addi-tional constraint to obtain E. Since c∗ii = 1 and e∗i = 02F

for all i with f(ei) > µ−1, determining the order of tracksthat gradually become STs as µ increases is equivalent tosorting f(ei) in a descending order. Thus, instead of settingthe value µ, we can specify the number of desired STs p byselecting p tracks with highest f(ei).

After obtaining STs, we proceed to estimate motion ma-trix. To do so, we need to sample 4 STs and match them to4 vertices from S. Although the number of tracks is reducedfrom n to p, the total number of matches can still be verylarge. To increase the likelihood of selecting good matches,it would be beneficial to sample 4 STs that are more likelyto come from the same object instead of uniform sampling.

4.3. Guided Sampling

We propose to sequentially sample STs using randomwalk strategy, where the transition probability of samplingST i after ST j depends on how likely they are from thesame object. In this work, we measure this value by repre-senting each non-ST as a convex combination of only STs4,and counting the number of non-STs that each pair of STsmutually represent.

To represent all non-STs by only STs, we leverage onthe structure of C∗ obtained from previous section; the con-straints of (7) enforces C∗ to be a stochastic matrix with pones in its diagonal. In essence, C∗ is a transition matrix ofan absorbing Markov chain with p absorbing nodes corre-sponding to the p STs. From the Markov chain theory [9],the limiting matrix of C∗:

C = limt→∞

(C∗)t, (9)

can be interpreted as convex coefficients of representingnon-STs by only STs. Let dik = I(ci′k > 0) where i′ isthe index of ST i in the original W, and I(·) is the indicatorfunction which returns 1 if the argument is true and 0 other-wise. We calculate transition probability of sampling ST jafter ST i using Jaccard similarity:

Pr(i→ j) =

{1zi

∑nk=1 min(dik,djk)∑nk=1 max(dik,djk)

; i 6= j,

0 ; i = j,(10)

where zi is a normalization factor. Note that Pr(i → j) isnot equal to Pr(j → i). For the 3D model, 4 vertices are

4By independent motion assumption, STs should represent only non-STs in the same subspace.

uniformly sampled to match the 4 sampled STs without anyspecial procedure.

Given 4 matches between tracks and 3D vertices, we firstestimate affine transformation between them, then projectthe rotation part of each frame to the Stiefel manifold (seesupplementary material for more detail). The M that maxi-mizes (6) can be used to recover P in (3) by selecting closesttracks in W to MS.

The algorithm for MfS is summarized in Alg. (1).

Algorithm 1 Motion from Structure

Input: W,S, λ0, µ, nIterOutput: P,M

Step 1: Identifying STs1: Compute C and E with (7)2: Select columns of W indicated by diag(C) = 1 as STs

Step 2: Guided Sampling3: Compute C with (9)4: Compute transition matrix with (10)5: for i = 1 to nIter do6: Uniformly sample a ST7: Use guided sampling to obtain another 3 STs8: Uniformly sample 4 vertices from S9: Estimate motion matrix M

10: Keep M if it has highest value in (6)11: end for12: Compute P by selecting tracks closest to MS

4.4. Implementation details

This section provides the implementation details for eachstep of our algorithm.

Identifying STs: We use `1 norm as the error functionf(·) in (7). The optimization problem is solved usingGurobi optimizer [10].

Guided sampling: In (9), rather than using eigendecom-position to compute C, we perform 1000 rounds of poweriterations on C∗, which results in a similar C while be-ing faster than eigendecomposition. To calculate di,k, itis likely that most ci′k will not be zero. Instead, we setthe threshold in I(·) to 2/p, where p is the number of STs.During the sampling, we perform two steps of random walkbefore selecting a ST, and allow the walker to restart at anysampled STs with equal probability [26]. We prevent thesampled STs from being resampled by setting Pr(i→ j) to0 for all sampled STs j.

5. ExperimentsWe evaluated our algorithm on synthetic data and real

videos downloaded from YouTube. Recall that MfS is a

new problem, and there is no existing baseline algorithm.Hence, we constructed two baselines to compare our algo-rithm against. (a) All-R: This baseline approach directlysamples 4 tracks and 3D vertices uniformly to form thematches required for estimating M. (b) ST-R: For this ap-proach, we use (7) to first identify STs, then uniformly sam-ple 4 track-vertex pairs. We refer to our ST selection andguided sampling as STGS-R.

5.1. Synthetic data

In this section, we used synthetic data to quantitativelyevaluate the performance of our approach (see Fig. 4a).We generated a set of 15-frame tracks with three movingshapes: a cube (8 corners), a double pyramid (6 corners),and a cuboid (8 corners). A number of internal vertices(IVs) were generated randomly inside each shape to imi-tate non-STs. For each shape, we generated a motion ma-trix Mgt comprising rotation, translation, and scaling. 200additional tracks are generated as background tracks. Weadded a single random translation to all tracks to simu-late camera motion, and perturbed each track with Gaus-sian noise. To model outliers, we set the background tracksto follow the first shape that comes close to them. Thisis to imitate real scenarios when parts of background gotoccluded by moving objects, causing background tracks tofollow foreground objects instead.

We evaluated our algorithm in two situations. First,we tested the robustness of our MfS against the densityof tracks for each object with varying IVs for each objectfrom 20 to 100 (corners inclusive). Second, we randomlyremoved tracks belonging to the corner of each shape withprobability from 0 to 100 percent to imitate the situationswhere corner vertices may not be tracked. In the secondcase, we set the number of IVs to 100. The simulation wasrepeated 100 rounds for each setting. In all experiments,we selected 10% and 20% of tracks as STs for ST-R andSTGS-R and sampled track-vectex pairs 5 × 104 time oneach set of STs, while we sampled 1× 105 times for All-R.The sample that yielded the highest cost in (6) was returnedas the result of each algorithm.

We used three metrics to measure performance. The firstmetric is IoU between the projections due to M and Mgt,defined as:

IoU =

F∑f=1

Area(πf (MgtS) ∩ πf (MS))

Area(conv(πf (MgtS) ∪ πf (MS))). (11)

The second metric is segmentation precision, defined as thenumber of correctly selected tracks against the total num-ber of selected tracks (i.e. number of rows of WM). Thelast metric is segmentation recall, defined as the number ofcorrectly selected tracks against the number of tracks of thecorrect objects.

(a)

ST-RAll-R STGS-R

(b)

(c)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Prob. of removing corners

IoU

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1


Prec

isio

n

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1


Rec

all

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

#Internal vertices

IoU

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

#Internal verticesPr

ecis

ion

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

#Internal vertices

Rec

all

Figure 4. Synthetic experiment. (a) A synthetic scene comprising tracks from a cube (red), a double pyramid (blue), a cuboid (green) andbackground (gray). Outlier tracks are shown in dark gray. The performance was measured by varying (b) number of internal vertices, and(c) probability of having each corner track removed.

Fig. 4 shows the result of the synthetic experiment. Theperformance of All-R is very low, implying that randomlyselecting matches is not a good approach to MfS. This is dueto the challenging nature of the MfS that the ratio of goodmatches against all matches is very low. Improvements dueto ST selection and guided sampling can be seen by the per-formance gaps between the three algorithms. STGS-R per-forms very well regardless of the number of IVs. On theother hand, since the shapes are similar, removing cornerscaused the performance to drop. In real cases, we wouldexpect at least 50% of corners to be present, in which theperformance is still in the acceptable range.

5.2. Real videos

In this section, we provide qualitative evaluation of ourapproach using videos collected by us and videos down-loaded from YouTube (see Fig. 5). The videos are 10 to97 frames long, and are extremely challenging since theycontain multiple moving objects, camera motion, zoomchanges, and perspective effects. These geometric changescause the scenes to be very dynamic and difficult to trackfeatures over long periods of time. We detected the fea-ture points using Shi-Tomasi feature detector [23] and usedLucas-Kanade tracker [2] to track, resulting in 490 to 1014tracks for each video (see Col. 3, Fig. 5). Note that thereare multiple outlier tracks caused by features correspond-ing to the occlusion boundary. We downloaded 3D mod-els that approximate target objects from [29], and gener-ated the shape matrix by manually selecting 8 to 18 verticesthat represent well the convex hull of each 3D model (seeCol. 2, Fig. 5). For the algorithm settings, we selected 10%of tracks as STs, and sampled 1× 106 track-vertex pairs for

all algorithms.Col. 4 and 5 of Fig. 5, respectively, show the selected

STs and alignment results for the real videos experiment.We encourage the readers to see the results in the supple-mentary material to appreciate the difficulty of the data. Ascan be seen in Col. 4 of Fig. 5, our ST selection can reliablyselect tracks on the convex hull of objects, which signifi-cantly reduced the complexity of the problem. This strategycombined with guided sampling and the appropriate costfunction in (6) effectively allows to solve for the correspon-dence between the 3D models and trajectories. With only afew vertices representing convex hulls of objects, our MfSalgorithm can correctly find an alignment under self occlu-sion without knowing which vertices are occluded. It alsocan handle different camera motion effects, and imprecise3D models (Fig. 5, different models of harrier (row 2) andcars (row 5)). Due to space restriction, we include in thesupplementary document additional results with compari-son to baselines, and several examples where MSeg-basedapproaches cannot solve this problem. It is important to re-mind the reader that MfS is a different problem from MSeg,but we use MSeg as baseline since the problems are related.

While the method has worked well in most of the se-quences that we have tried, the last row of Fig. 5 shows afailure case of our algorithm. In this case, the camera waspanning from left to right, allowing only a portion of thebackground to be fully tracked. Rather than aligning to oneof the cars, the 3D model is aligned to the background tracksbecause they coincidentally form a rectangular shape sim-ilar to the top view of the car’s 3D model. In such cases,only tracks may not provide enough information to obtainthe correct result. Additional information, such as expected

First frame Tracks3D model Alignment resultSTs

Figure 5. Results from a real video experiment. Column 1: First frame of the videos. Column 2: 3D models with the vertices representedby S shown in dark green dots. Column 3: Tracks are shown with yellow lines. Recall that we only use tracks and 3D models as input, notthe images. Column 4: Selected STs. Column 5: Results of alignment by backprojecting 3D models to the video. The STs in this frameare shown in red points. We reduce the image intensity of columns 3 to 5 for visualization purpose. (Best viewed in color.)

orientation (e.g. wheels should point down), can be incor-porated to reject wrong solutions during estimation of M.

6. ConclusionsThis paper proposes a new problem, Motion from Struc-

ture (MfS), where a given 3D model is aligned in an unsu-pervised manner to tracks in video. A two-step approachis proposed where convex hulls of objects are discovered inthe first step, then guided sampling is performed for align-ment in the second step. Our approach does not require thesegmentation and 3D reconstruction of objects, and thus al-lows for bypassing their drawbacks. We tested our approachon synthetic data and real videos, and showed that it outper-formed baseline approaches.

One limitation of using convex hulls for alignment is thatthe convex hull of some objects may be symmetric, whichmay lead to incorrect alignment due to oversimplificationof shapes. We also notice that some corners were not iden-

tified as STs. This occurs due to (1) close proximity be-tween tracks allowing them to well represent each other, and(2) violation of independent subspace assumption. Furtherimprovements also include incorporating other information(e.g. orientation), and handling incomplete tracks (i.e. miss-ing data). We will address these issues in future works.

AcknowledgmentsThis research was partially supported by Fun-

dação para a Ciência e a Tecnologia (project FCT[UID/EEA/50009/2013] and a PhD grant from theCarnegie Mellon-Portugal program).

References[1] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Trajec-

tory space: A dual representation for nonrigid struc-ture from motion. TPAMI, 33(7):1442–1456, 2011. 2

[2] S. Baker and I. Matthews. Lucas-Kanade 20 years on:A unifying framework. IJCV, 56(3):221–255, 2004. 7

[3] A. Collet, D. Berenson, S. S. Srinivasa, and D. Fergu-son. Object recognition and full pose registration froma single image for robotic manipulation. In ICRA,2009. 3

[4] J. P. Costeira and T. Kanade. A multibody factoriza-tion method for independently moving objects. IJCV,29(3):159–179, 1998. 2

[5] P. David, D. Dementhon, R. Duraiswami, andH. Samet. SoftPOSIT: Simultaneous pose and cor-respondence determination. IJCV, 59(3):259 – 284,2004. 3

[6] E. Elhamifar and R. Vidal. Sparse subspace clus-tering: Algorithm, theory, and applications. TPAMI,35(11):2765–2781, 2013. 2, 4

[7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, andD. Ramanan. Object detection with discriminativelytrained part-based models. TPAMI, 32(9):1627–1645,2010. 2

[8] A. Goshtasby and G. C. Stockman. Point patternmatching using convex hull edges. IEEE Trans. Sys-tems, Man, and Cybernatics, 15(5):631–637, 1985. 3

[9] C. M. Grinstead and J. L. Snell. Introduction to Prob-ability. American Mathematical Society, 1997. 5

[10] Gurobi Optimization Inc. Gurobi optimizer referencemanual, 2015. 6

[11] Y. Igarashi and K. Fukui. 3D object recognition basedon canonical angles between shape subspaces. InACCV, 2010. 3

[12] D. W. Jacobs. Linear fitting with missing data forstructure-from-motion. CVIU, 82:206–2012, 1997. 2

[13] Q. Ke and T. Kanade. Robust L1 norm factorization inthe presence of outliers and missing data by alternativeconvex programming. In CVPR, 2005. 2

[14] B. Klingner, D. Martin, and J. Roseborough. Streetview motion-from-structure-from-motion. In ICCV,2013. 3

[15] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Ro-bust recovery of subspace structures by low-rank rep-resentation. TPAMI, 35(1):171–184, 2013. 2, 4

[16] M. Marques, M. Stošic, and J. Costeira. Subspacematching: Unique solution to point matching with ge-ometric constraints. In ICCV, 2009. 3

[17] P. Matikainen, M. Hebert, and R. Sukthankar. Trajec-tons: Action recognition through the motion analysisof tracked features. In ICCV, 2009. 2

[18] F. Moreno-Noguer, V. Lepetit, and P. Fua. Pose priorsfor simultaneously solving alignment and correspon-dence. In ECCV, 2008. 3

[19] B. Ommer, T. Mader, and J. M. Buhmann. Seeing theobjects behind the dots: Recognition in videos from amoving camera. IJCV, 83(1):57–71, 2009. 2

[20] L. Quan and T. Kanade. A factorization method foraffine structure from line correspondences. In CVPR,

1996. 2[21] S. Satkin, J. Lin, and M. Hebert. Data-driven scene

understanding from 3D models. In BMVC, 2012. 1, 3[22] S. Savarese and L. Fei-Fei. 3D generic object catego-

rization, localization and pose estimation. In ICCV,2007. 3

[23] J. Shi and C. Tomasi. Good features to track. In CVPR,1994. 7

[24] S. W. Sun, Y. C. F. Wang, F. Huang, and H. Y. M.Liao. Moving foreground object detection via robustsift trajectories. Journal of Visual Communication andImage Representation, 24(3):232–243, 2013. 2

[25] C. Tomasi and T. Kanade. Shape and motion fromimage streams under orthography: a factorizationmethod. IJCV, 9(2):137–154, 1992. 2

[26] H. Tong, C. Faloutsos, and J. Y. Pan. Fast randomwalk with restart and its applications. In ICDM, 2006.6

[27] A. Toshev, A. Makadia, and K. Daniilidis. Shape-based object recognition in videos using 3d syntheticobject models. In CVPR, 2009. 3

[28] P. Tresadern and I. Reid. Articulated structure frommotion by factorization. In CVPR, 2005. 2

[29] Trimble Navigation Ltd. 3D warehouse. http://3dwarehouse.sketchup.com/. 1, 7

[30] R. Vidal, Y. Ma, and S. Sastry. Generalized principalcomponent analysis (GPCA). TPAMI, 27(12):1945–1959, 2005. 2

[31] J. Yan and M. Pollefeys. A factorization-based ap-proach for articulated non-rigid shape, motion andkinematic chain recovery from video. TPAMI,30(5):865–877, 2008. 2

[32] X. Y. Yu, H. Sun, and J. Chen. Points matching via it-erative convex hull vertices pairing. In Int. Conf. Ma-chine Learning and Cybernatics, 2005. 3

[33] L. Zelnik-Manor, M. Machline, and M. Irani. Multi-body factorization with uncertainty: Revisiting mo-tion consistency. IJCV, 68(1):27–41, 2006. 2

[34] Z. Zeng, T. H. Chan, K. Jia, and D. Xu. Finding corre-spondence from multiple images via sparse and low-rank decomposition. In ECCV, 2012. 3

[35] M. Zhai, L. Chen, J. Li, M. Khodabandeh, andG. Mori. Object detection in surveillance video fromdense trajectories. In MVA, 2015. 2

[36] F. Zhou and F. De la Torre. Spatio-temporal matchingfor human detection in video. In ECCV, 2014. 2, 3

[37] M. Z. Zia, M. Stark, B. Schiele, and K. Schindler. De-tailed 3D representations for object recognition andmodeling. TPAMI, 35:2608–2623, 2013. 3

http://3dwarehouse.sketchup.com/

http://3dwarehouse.sketchup.com/

motion from structure (mfs): searching for 3d objects in...

Documents