online 6 dof augmented reality registration from natural features - mixed and augmented reality,...

9
Online 6 DOF Augmented Reality Registration from Natural Features Kar Wee Chia, Adrian David Cheok, Simon J.D. Prince Dept. of Electrical and Computer Engineering National University of Singapore {engp1679, adriancheok, elesp}@nus.edu.sg Abstract We present a complete scalable system for 6 d.o.f. cam- era tracking based on natural features. Crucially, the calcu- lation is based only on pre-captured reference images and previous estimates of the camera pose and is hence suitable for online applications. We match natural features in the current frame to two spatially separated reference images. We overcome the wide baseline matching problem by match- ing to the previous frame and transferring point positions to the reference images. We then minimize deviations from the two-view and three-view constraints between the reference images and the current frame as a function of the camera position parameters. We stabilize this calculation using a recursive form of temporal regularization that is similar in spirit to the Kalman filter. We can track camera pose over hundreds of frames and realistically integrate virtual ob- jects with only slight jitter. Keywords: augmented reality, natural feature tracking, vi- sual registration, camera pose estimation. 1. Introduction For three-dimensional (3-D) Augmented Reality (AR) applications, accurate measurements of the camera pose (i.e. position and orientation) relative to the real world are required for the proper registration of virtual objects. AR tracking based on fiducial markers in the scene has been highly successful [11, 12]. Markers are constructed so that they are easily detected in each image frame and given some a priori information about the shape or positions of the markers, the relative pose of the camera can be easily determined. This paper is concerned with the considerably harder task of measuring camera pose accurately by track- ing natural point features in the scene alone (see Figure 1). Most previous work on natural feature tracking in AR has only attempted to track two-dimensional (2-D) features across a sequence of images. This is considerably simpler than the full 3-D problem and can readily be achieved in real time. The recovered 2-D motion field can be used to esti- mate the change in positions of labels for 2-D geographic labeling applications: One approach is to measure the op- tical flow between adjacent image frames [13, 16]. For the special case in which the camera motion is pure rotation or the viewed scene is planar, the 2-D positions of correspond- ing features in two different camera views are related by a homography. This relationship is exploited in [14], where 3-D AR registration from natural features has been achieved but only when the scene is principally planar. Figure 1. We aim to track the camera position from frame to frame so that virtual content can be realistically introduced without plac- ing any markers in the scene. In this se- quence, the cube stays on the corner of the book as the camera moves forward and right. There are two challenges in camera pose tracking from natural features. First we must establish which features cor- respond to which between different frames from the same sequence. Second, we must estimate the change in cam- era pose between frames based on the change in positions Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

Upload: independent

Post on 12-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Online 6 DOF Augmented Reality Registration from Natural Features

Kar Wee Chia, Adrian David Cheok, Simon J.D. PrinceDept. of Electrical and Computer Engineering

National University of Singapore{engp1679, adriancheok, elesp}@nus.edu.sg

Abstract

We present a complete scalable system for 6 d.o.f. cam-era tracking based on natural features. Crucially, the calcu-lation is based only on pre-captured reference images andprevious estimates of the camera pose and is hence suitablefor online applications. We match natural features in thecurrent frame to two spatially separated reference images.We overcome the wide baseline matching problem by match-ing to the previous frame and transferring point positions tothe reference images. We then minimize deviations from thetwo-view and three-view constraints between the referenceimages and the current frame as a function of the cameraposition parameters. We stabilize this calculation using arecursive form of temporal regularization that is similar inspirit to the Kalman filter. We can track camera pose overhundreds of frames and realistically integrate virtual ob-jects with only slight jitter.

Keywords: augmented reality, natural feature tracking, vi-sual registration, camera pose estimation.

1. Introduction

For three-dimensional (3-D) Augmented Reality (AR)applications, accurate measurements of the camera pose(i.e. position and orientation) relative to the real world arerequired for the proper registration of virtual objects. ARtracking based on fiducial markers in the scene has beenhighly successful [11, 12]. Markers are constructed so thatthey are easily detected in each image frame and givensome a priori information about the shape or positions ofthe markers, the relative pose of the camera can be easilydetermined. This paper is concerned with the considerablyharder task of measuring camera pose accurately by track-ing natural point features in the scene alone (see Figure 1).

Most previous work on natural feature tracking in ARhas only attempted to track two-dimensional (2-D) featuresacross a sequence of images. This is considerably simplerthan the full 3-D problem and can readily be achieved in real

time. The recovered 2-D motion field can be used to esti-mate the change in positions of labels for 2-D geographiclabeling applications: One approach is to measure the op-tical flow between adjacent image frames [13, 16]. For thespecial case in which the camera motion is pure rotation orthe viewed scene is planar, the 2-D positions of correspond-ing features in two different camera views are related by ahomography. This relationship is exploited in [14], where3-D AR registration from natural features has been achievedbut only when the scene is principally planar.

Figure 1. We aim to track the camera positionfrom frame to frame so that virtual contentcan be realistically introduced without plac-ing any markers in the scene. In this se-quence, the cube stays on the corner of thebook as the camera moves forward and right.

There are two challenges in camera pose tracking fromnatural features. First we must establish which features cor-respond to which between different frames from the samesequence. Second, we must estimate the change in cam-era pose between frames based on the change in positions

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

of these features. Usually, the 3-D positions of the featuresthemselves are estimated simultaneously.

The optimal way to recover camera motion is by offlinetechniques which use the whole image sequence (e.g. [5]).The scene is typically assumed to be static and rigid. Globalbundle adjustment techniques attempt to simultaneously es-timate the structure of the scene and the camera motion. Atthe core of these methods is a non-linear minimization of acost function based on reprojection errors of estimated 3-Dpoints across the whole video sequence. An excellent re-view of the theory of bundle adjustment is found in [15].Despite their excellent performance, such batch processingmethods are clearly not suitable for real-time online AR ap-plications.

Various incremental motion estimation approaches havealso been proposed for time critical applications. A key is-sue here is to ensure that incremental motion estimates atdifferent points in the time series are compatible with a sin-gle 3-D structure. Unfortunately, recursive online estima-tion of this structure as in [4] is not possible as it is impracti-cally slow. Avidan and Shashua [3] introduced the operationof “threading” adjacent two-view motion estimates. Thistechnique is repeatedly applied to a sliding window of viewtriplets to recover the camera trajectory. Fitzgibbon and Zis-serman [7] apply bundle adjustment techniques to a slidingwindow of triplets of images using point matches across allthree images. Zhang and Shan [19] present a similar schemein which both two-view and three-view constraints are usedin the bundle adjustment.

Although the above methods are suitable for online im-plementation, the incremental nature means that errors incamera motion estimates will inevitably accrue over time:the current camera position is calculated by concatenatingtransformations between adjacent frames in the sequence.Unfortunately, this means that estimation errors accrue overtime. This is unacceptable in AR applications, where smallregistration errors are very noticeable and very accurate es-timates must be maintained over thousands of frames.

Our approach is based on always calculating camera mo-tion relative to two or more pre-captured reference imageframes of the scene. This has the advantage of preventing agradual increase in the camera position error. Camera poserelative to the reference frames is computed through theminimization of a simple cost function based on two-viewepipolar and three-view constraints on feature position. Weuse time-series information to provide the starting point forthis minimization and to regularize the error surface whenthe incoming data is impoverished (see Figure 2).

The structure of the paper is as follows: In the next sec-tion we give an overview of our approach. Section 3 in-troduces the two-view and three-view constraints and dis-cusses how to estimate the camera position by minimizingdeviations from these constraints. Section 4 describes the

Figure 2. The problem is to estimate the trans-formation matrix, Tk, between the camera andthe scene for the current frame, Vk. The sys-tem matches the current frame to two storedreference frames, VA and VB, to determinethe camera position. This estimation prob-lem is regularized using data from previousframes in the time-series.

temporal regularization method used to stabilize the com-putation of the motion estimates. Section 5 presents a sum-mary of the recursive time-series approach used to estimatethe motion parameters. Finally, we describe how the featurecorrespondence problem between the current frame and thereference frames can be solved rapidly and accurately.

2. Overview of approach

We take two spatially separated photos of the area inwhich we intend to work. We call these photos referenceimage frames VA and VB (see Figure 2). A high quality setof point feature [8] matches is obtained for these two im-ages. The matching technique described in [18] will be suit-able. The two images are also calibrated, meaning that theircamera poses, TA and TB, relative to the virtual objectsare known. These can be computed by placing a fiducialmarker [11] in the scene for these two frames only. Hence,upon entering the system, we have two reference frames, aset of point matches between these frames, an accurate es-timate of the relative orientation and position of these twoframes and the positions of the virtual contents relative tothe two frames. All of these things can be calculated veryaccurately and reliably using offline processing.

As the camera moves in the scene, a set of two-viewpoint matches is computed for each incoming image frameVk and one of the reference frames, VA or VB. Thismatching process is described in section 6. Given a set of

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

point matches, the standard technique to estimate the rela-tive camera motion between Vk and VA is by a two-viewalgorithm [9, 17], which essentially calculates the funda-mental matrix between the two frames and extracts the ro-tation and translation parameters from this matrix. Theseinitial motion estimates are then used as the starting pointin a non-linear optimization of what we term the “two-viewcost function”. For a given set of motion parameters, eachpoint in the reference frame is constrained to lie on a par-ticular (epipolar) line in the current image. This cost func-tion is essentially the sum of the deviations of the matchingpoints from these lines across all the point matches. Thetranslation vector can only possibly be obtained up to scalebut this scaling factor can be easily recovered given priorknowledge of the depth of any one point in the two images.

In our algorithm, we simplify this system by using mo-tion estimates based upon the previous frames . . .Vk−2,Vk−1 as the starting point for the non-linear minimization.We simultaneously minimize the two-view cost function be-tween the current frame and the first reference frame (VA

and Vk) and the current frame and the second referenceframe (VB and Vk). We also minimize the “three-view costfunction” – here, a given set of motion parameters predictsexactly where a point occurring in both the reference framesshould lie in the current image. The three-view cost func-tion consists of the deviation between this prediction andthe actual position of the matched corner.

Since the camera transformation TA between VA andthe virtual object is known, the camera transformation be-tween the current frame and the virtual object is easily com-puted using the estimated camera motion between the cur-rent frame and VA. We can then proceed to introduce thevirtual object realistically into the current frame.

3. Estimation of relative camera motion

This section describes the two-view and three-view con-straints used to construct the cost function which is mini-mized as a function of the 6 motion parameters.

3.1. Two-view constraints

Figure 3 shows the two-view epipolar geometry betweena reference image frame VA and the current image frameVk. Consider an imaged point in the reference frame, pA,which is the projection of a 3-D world point P into the ref-erence image. From the reference frame alone, we knowthat the location of P must lie somewhere along the lineprojecting from the reference camera center through the im-age point pA, although we have no way of establishing thedepth. Assuming that that we know the Euclidean trans-formation TAk between the two camera positions, we can

Figure 3. The two-view (epipolar) con-straint. The three-dimensional point P, whichprojects to point pA in one image must projectto somewhere along fixed line in a second im-age. This line is known as an epipolar line andis determined by the relative pose of the twocameras.

project this line into the current image frame Vk. The two-view constraint means that the corresponding point pk inthe current frame must lie along this line.

In this paper, we leverage this relationship to estimate theunknown camera transformation TAk given a set of pointmatches. In particular, we minimize the distance betweenthe point position in the current image and the predictedepipolar line as a function of the elements of TAk. We nowgive a more precise definition of the cost function for thisminimization.

For the j-th point match (pAj,pkj) between frames VA

and Vk corresponding to a 3-D point, the two view con-straint may be expressed as:

pTkjFpAj = 0 (1)

Where ˜ denotes homogenous coordinates and the super-script T represents matrix transpose. F is the fundamentalmatrix describing the epipolar geometry. When the camerais calibrated, the fundamental matrix F has the followingform:

F = A−T [tAk]×RAkA−1 (2)

Where A is the intrinsic parameter matrix of the calibratedcamera, RAk is the rotation matrix relating frames VA andVk and [tAk]× is a skew-symmetric matrix of the transla-tion vector tAk. Thus, for a given set of motion parameters,the fundamental matrix can be calculated for a calibratedcamera.

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

Figure 4. Two-view constraint between reference frame, VA (left), and the current frame, Vk (right).The current estimate of the translation and rotation between the frames defines the fundamentalmatrix, F. This maps a given point, pAj, in the reference image to a line in the current image.The transpose of the fundamental matrix maps each point in the current image to a line in thereference image. If the motion parameters are correct, then these lines should pass through thecorresponding points in the other image. We minimize the perpendicular distance from the lines totheir corresponding points as a function of the motion parameters.

A consequence of the constraint equation (1) is that givena point pAj in VA, the corresponding point pkj must liealong the epipolar line in Vk. Consideration of equation (1)reveals that this epipolar line is given by FpAj. Similarly,given a point pkj in Vk, the corresponding point pAj mustlie along the epipolar line in VA which is given by FT pkj.

We can now construct a function, φ, associated with thematched point pair, which is the distance of the points fromtheir respective predicted epipolar lines (see Figure 4):

φA(pkj, pAj,TAk) = d(pkj,FpAj)2 + d(pAj,FT pkj)2

(3)Where TAk = [RAk tAk ] and F is given by equation (2).The distance terms are given by:

d(pkj,FpAj) =pT

kjFpAj√(FpAj)21 + (FpAj)22

d(pAj,FT pkj) =pT

AjFT pkj√

(FT pkj)21 + (FT pkj)22(4)

Where for e.g. (FpAj)i is the i-th component of vectorFpAj. The cost function φA is termed the symmetric epipo-lar distance. Given a set of point matches across two im-ages, we minimize the cost φA as a function of the 6 motionparameters to generate a motion estimate.

In fact, it transpires that the above criterion can deter-mine the rotation correctly, but only estimate the translationvector up to an unknown scaling factor. We resolve this sit-uation by also matching to points in VB. This second cost

function φB creates a second, independent family of poten-tial solutions.

3.2. Three-view constraints

A second way to resolve this ambiguity is to use thethree-view constraint. Since we have two reference images,we can identify some corner matches which are present inall three frames. Since we know the relative position of thetwo reference frames, we know the 3-D position of any fea-ture that is found in both images. Hence, for a given set ofmotion parameters, we can predict where this feature willlie in the current frame. If the motion parameters are cor-rect, this point should coincide with the point position inVk. One way to think about this geometrically is to con-sider two pairs of two-view constraints. Each defines anepipolar line. The point must lie at the intersection of theseepipolar lines (see Figure 5). In a similar way to the two-view constraint, this operation can be repeated three timesto project the point from each of the three pairs of imagesinto the third. Hence, the cost function for the three-viewmatching contains three terms:

ψ(pkj, pAj, pBj,TAk) = d(pkj,FAkpAj × FBkpBj)2

+ d(pAj,FTAkpkj × FT

ABpBj)2

+ d(pBj,FTBkpkj × FABpAj)2

(5)

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

Figure 5. Three-view cost function. For a point pkj in the current frame Vk (centre) matched to bothpoint pAj in reference frame VA (left) and point pBj in reference frame VB (right), a further constraintis imposed. For a given motion estimate, each match defines an epipolar line in the current frame.The intersection of these lines produces a predicted point position for that motion estimate. Thethree-view cost function is the Euclidean distance from pkj to this predicted point. Notice that thisconstraint is stronger than the two two-view constraints alone. It is possible for the perpendiculardistance to each epipolar line to be small, but the distance to the point of intersection to be large.

Where d(a, b) is simply the Euclidean distance between im-age points a and b and × represents vector cross product.

In practice, it is better to implement the three-view con-straint using the tri-focal tensor, since there are some degen-erate conditions where the epipolar lines are parallel andhence do not meet at a point, but the point transfer is stillperfectly well defined. The tri-focal tensor is a geometricentity relating points or lines across three images. It can bedefined entirely using the motion parameters relating thesethree views and is the three-view equivalent of the funda-mental matrix. See [10] for a review.

4. Temporal regularization

We now create a measurement vector zk which incor-porates all of the above information. Specifically, zk is acolumn vector where the elements are two-view costs be-tween the first reference frame and the current frame, thetwo-view costs between the second reference frame and thecurrent frame and the three-view costs. The simplest pos-sible solution to the pose estimation problem would be tominimize the quantity

ε = zTkzk (6)

as a function of the rotation and translation parameters. Thisminimization can indeed provide an estimation of the cam-era movement, but it fails to make use of the fact that thecamera position in each frame is closely related to that inthe preceding frames in the time sequence. In general, mini-mization of this function will provide a very noisy sequence.Moreover, in a real sequence there may be the occasionalmismatched point, which will cause the motion estimate forthat frame to be erroneous. If the number of point matches

becomes small, or matching to one frame fails, or the ge-ometrical distribution of the 3-D points is not sufficientlygeneral, the above function may not have a unique globalminimum. Even if none of these things occur, the error sur-face may be extremely flat near the minimum as the rota-tion and translation parameters may trade off against oneanother. Very small amount of noise in the measurementsmay hence result in quite large variations in the solution.

In order to guard against this ill-conditioning, we reg-ularize the solution by imposing some prior knowledge toensure that the error surface has a well-defined global min-imum in a “likely” area of the motion parameter space. Weuse the motion parameter estimates from previous framesin the series to define what is “likely”. It transpires thatthe camera motion parameters are highly predictable fromframe to frame. Figure 6 shows part of a real sequence ofAR head motion data. The majority of the variation in thedata can be predicted by a simple time-series model.

We denote the vector of motion parameters (the state) byxk. We now minimize the function:

ε = ‖z − z(xk)‖Mmeas+ α‖xk − xk‖Mprior

(7)

Where the notation ‖a‖B denotes the magnitude of the vec-tor a measured using distance metric B or aT Ba. The firstterm is the sum of the squared deviations of the measure-ments from the predictions. Since the predicted distancefrom the epipolar line/transferred point is always zero, thefirst term is identical to the function in equation (6). Thesecond term is a regularization term that favors state vectorxk which is close to a prior predicted value. Hence, evenwhen the error surface due to the first term is flat near theminimum, the component due to the second term will en-sure there is a distinct solution. The constant α controls the

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

Figure 6. Camera translation was measuredusing ARToolkit for a 30 second sequence ofan observer viewing a small VRML model onthe desk in front of him. We fit a third orderARMA model to the first half of the dataset.The plot above shows the values for one of thetranslation components from the second halfof the dataset (solid line) together with pre-dictions generated from this model (dottedline). The model successfully predicts 76%of the actual data variance. Similar resultswere found for the rotational motion compo-nents.

relative contribution of the prior knowledge about the solu-tion and the measured data.

We minimize the above expression over the 6 translationand rotation parameters, using the predicted value xk as astarting point. We use a Levenberg-Marquardt procedure toseek the current estimate xk. The Jacobian of expression (7)can either be estimated using finite differences or calculatedexplicitly as in [19].

5. Recursive estimation

Some readers may have noticed that the temporal reg-ularization method described above is closely related tothe Kalman filter. We now describe a recursive estimationmethod for the state vector. We represent the knowledgeabout the state of the system (motion parameters) at frameVk in terms of the current state estimate xk and the un-certainty of that estimate which is described by the covari-ance matrix Pk. We assume for simplicity that the predic-tion model is a linear function, B, of the previous state,

xk−1. Deviations from this prediction are assumed to takethe form of Gaussian noise, with co-variance Q. Given thestate and co-variance in the previous frame, xk−1 and Pk−1,we use this model to predict the state xk in the current frameand estimate the uncertainty of this prediction, Pk. We usethis prediction to regularize the solution for the state, as inequation (7). Finally, we assess the gradient of the functionaround the minimum and use this as an estimate of the un-certainty of the state estimate. The complete algorithm canbe summarized as follows:

State Prediction

xk = Bxk−1

Covariance Prediction

Pk = BPk−1BT + Q

State Update

minxk

ε = ‖z − z(xk)‖Mmeas+ α‖xk − xk‖P−1

k

Covariance Update

Pk = (JT Σ−1J)−1

Σ =[

M−1meas 00 Pk

](8)

Here, J is the Jacobian of the minimization expression as-sessed at the final position xk. The temporal regularizationis very closely related to the Kalman filter algorithm. In-deed, it can be shown that the Kalman filter is an exact so-lution of the error metric in equation (7) for a linear system.The extended Kalman filter can be thought of as one iter-ation of a Gauss-Newton optimization of this error metric.Hence, the iterated extended Kalman filter also seeks theminimum of this function.

5.1. Parameter estimation

In order to get optimal motion estimates, it is importantto choose the parameters carefully. We estimate the mea-surement metric Mmeas by considering the accuracy of ourcorner detection routine under noisy conditions (see Figure7). We parameterize corner accuracy in terms of the RMScontrast of the local region, since this must be computedanyway in the feature matching stage. We assume that thenoise on each corner position is independent, so the matrixMmeas is a diagonal matrix. Each entry is the reciprocal ofthe estimated corner position variance.

The simplest possible motion model is just to assumethat the position is the same as in the previous frame withsome added jitter, which has covariance Q. In this case,the state prediction matrix B is simply the identity matrix.

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

Figure 7. Plot of corner accuracy as a func-tion of RMS contrast of 15 × 15 pixel regionaround corner. Accuracy estimated from anideal square corner contaminated with inde-pendent Gaussian noise (s.d. = 5) at eachpixel. The variation in corner estimationasymptotes approximately at 0.16 pixels. Weuse these data as estimates of our measure-ment accuracy.

We estimate the covariance Q by measuring a real sequenceof head movements using ARToolkit [1]. We have also ex-perimented with using second order ARMA models to pre-dict the head position and orientation with improved results.Again, we estimate the parameters of the model from pre-captured head-movement records.

The parameter α controls the relative importance of thedata and the prior model. For small values, the motion es-timates are noisy, but for large values they become overlysmooth and the motion estimates fail to capture real varia-tion in the head position. In principle, α should be set toone, but we have found that larger values are useful as itmakes the system more robust to outlying measurements.

6. Correspondence problem

We now turn to the problem of matching corners acrossmultiple images. For each incoming frame Vk, we mustidentify which corners correspond to which in each of thereference frames.

An effective method for stereo matching is to use theepipolar constraint (see [10] for a detailed description) com-bined with the robust statistical procedure, RANSAC [6].An initial set of point matches is made. A minimal subset(8 point matches) of this initial set is used to calculate theepipolar geometry between the two images. This proposed

geometry is assessed by considering how many of the othermatches are in agreement. We repeat this procedure untila set with high support is found. It is assumed that all ofthe matches that are in agreement with this geometry arecorrect.

The success of this approach depends largely on howmany correct inliers there are in the initial set of matches.Unfortunately, although matching to fixed reference imagesfor each frame removes the problem of a gradual drift in theposition of the virtual object, it somewhat aggravates thecorrespondence problem. Instead of matching to the previ-ous frame in the image sequence, we now need to match tothese reference images, which may be far from the currentposition. Under these “wide baseline” conditions, the initialset of matches may be very poor, causing the method to fail.Alternative matching techniques (e.g. [18]) can be used butthe computational cost is too expensive for real-time imple-mentation.

Figure 8. Given a feature in the current im-age Vk, we aim to find the correspondingpoint in the reference image VB. To helpconstrain the search, we construct an indi-rect mapping from Vk to Vk−1 and then fromVk−1 to VB. We can approximate the formermapping by a homography, since the cam-era translation is likely to be very small be-tween the frames. We already have a list ofpoint correspondences between the previousframe and the reference. We interpolate be-tween these known correspondences to es-timate the new position. We now considercorner matches only in the region of this es-timate (white square).

We propose a novel method to increase the proportionof initial correct matches dramatically, which exploits the

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

Figure 9. Example results. Image x-positionof cube vertex relative to desired positionover one hundred frames. The desired po-sition was assessed by hand. Apart from aconstant offset due to an initial error in theposition, the cube vertex closely follows thedesired trajectory.

fact that each frame is part of a time series (see also Figure8). We assume that the previous frame has been matchedcorrectly to each of the reference frames.

We first construct a one-to-one mapping from the currentframe to the previous frame by calculating the best homog-raphy between the images. Although the change in imagefeature position cannot be exactly described by a homogra-phy, it is a reasonable approximation, as the camera move-ment is very small over the time period of a single frame. Inpractice, it is always possible to compute this relation.

We then construct a one-to-one mapping between theprevious frame and the reference frame by treating theknown matches between these frames as samples of a con-tinuous vector-valued function and interpolating betweenthese samples. The simplest way to do this is to assumethat each point in the previous image has the same 2-D im-age position change as the nearest known match when it istransferred to the reference image. This is approximatelyequivalent to assuming that the local depth in the imagevaries smoothly. By chaining these two mappings, we cantransfer any corner in the current image to an approximateposition in the reference frame.

If we only consider corners in the reference image thatare in the neighborhood of this prediction as potentialmatches, we can considerably increase the initial proportionof good matches. The particular initial match is chosen so

as to maximize the cross-correlation of the area surroundingthe corners. In this way, we typically generate sets of initialmatches which are already ∼80% correct. This decreasesthe amount of trials required in the RANSAC procedure andreduces the possibility of any remaining false matches.

However, one limitation of the proposed matchingmethod is that the initial camera position should be close toone of the reference frames so that we do not have a “widebaseline” problem for the first image frame.

7. Results

The system currently runs at ∼10 Hz on a fast desktopPC with 320×240 pixel images. Typically, we find about60 two-view matches to each reference frame and 30 three-view matches across all three frames. We have successfullytracked the camera position over sequences as long as 600frames. In one sequence, we aligned the virtual cube withthe corner of a card marker. We compare the true position ofthe corner to the predicted position from our tracking sys-tem. These first 100 frames of these data are depicted inFigure 9. Our system correctly follows the position of thecorner and maintains the alignment extremely well. Videofootage of several sequences can be downloaded from [2].

8. Discussion

To summarize, we have presented a system for natu-ral feature tracking in augmented reality. We estimate thecurrent camera position relative to pre-captured referenceframes by matching corners across the frames and minimiz-ing a cost function based on two- and three- view relations.We apply a method based on time series to stabilize theseestimates.

Only a few previous attempts have been made to im-plement online 6 d.o.f. motion tracking based on naturalfeatures alone in a general environment (e.g. [19]). To thebest of our knowledge, all of these methods suffer fromthe inevitable drift that results from chaining together cam-era transformations along the time series. We remove thisproblem by always estimating the camera pose relative tofixed reference images. This aggravates the correspondenceproblem, but we resolve this difficulty by using informationderived from previous correspondences in the time series.

A strong advantage of this system is scalability. It isquite possible to store more than two reference images.By accurately pre-calculating the geometric relationship be-tween a large number of such images, we could potentiallyperform natural feature tracking over wide areas. The cur-rent camera position could be used to determine which ofthe many possible reference images to match to.

Moreover, the system can work with completely generalscenes, but does not fail if the scene structure should be-

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE

come degenerate (e.g. all points fall on a plane). A fur-ther advantage is the robustness of the system – since thematching is based on the RANSAC method, it is tolerantto changes or movement in parts of the image. Hence, theonly requirements are that the scene be mostly static andrigid and contain enough texture to reliably identify cornersacross images.

Acknowledgements

This work was funded by the Defence Science and Tech-nology Agency (DSTA), Land Systems Division, Singa-pore, http://www.dsta.gov.sg.

References

[1] http://www.hitl.washington.edu/artoolkit/.[2] http://mixedreality.nus.edu.sg/.[3] S. Avidan and A. Shashua. Threading fundamental matrices.

Proceedings of the 5th European Conference on ComputerVision, pages 124–140, June 1998.

[4] T. J. Broida, S. Chandrashekhar, and R. Chellapa. Recursive3-d motion estimation from a monocular image sequence.IEEE Trans. Aerospace and Elec. Systems, 26(4):639–656,July 1990.

[5] K. Cornelis, M. Pollefeys, M. Vergauwen, and L. Van Gool.Augmented reality using uncalibrated video sequences.Technical report, Nr. KUL/ESAT/PSI/0002, PSI-ESAT,K.U. Leuven, Belgium, 2000.

[6] M. A. Fischler and R. C. Bolles. Random sample concen-sus: a paradigm for model fitting with applications to imageanalysis and automated cartography. Comm. Assoc. Comp.Mach., 24:381–395, 1981.

[7] A. Fitzgibbon and A. Zisserman. Automatic camera recov-ery for closed or open image sequences. Proceedings of the5th European Conference on Computer Vision, pages 311–326, June 1998.

[8] C. J. Harris and M. Stephens. A combined corner and edgedetector. Proc. 4th Alvey Vision Conferences, pages 147–151, 1988.

[9] R. Hartley. In defense of the eight-point algorithm. IEEEPAMI, 19:580–593, October 1997.

[10] R. Hartley and A. Zisserman. Multiple view geometry incomputer vision. Cambridge University Press, 2000.

[11] H. Kato and M. Billinghurst. Marker tracking and hmd cal-ibration for a video-based augmented reality conferencingsystem. Proc. IWAR, pages 85–94, October 1999.

[12] U. Neumann and Y. Cho. A self-tracking augmented realitysystem. Proc. VRST, pages 109–115, July 1996.

[13] U. Neumann and S. You. Natural feature tracking for aug-mented reality. IEEE Transactions on Multimedia, 1(1):53–64, 1999.

[14] G. Simon, A. Fitzgibbon, and A. Zisserman. Markerlesstracking using planar structures in the scene. Proc. ISAR,pages 120–128, 2000.

[15] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.Bundle adjustment - a modern synthesis. Proceedings of theInternational Workshop on Vision Algorithms: Theory andPractice, pages 298–372, September 1999.

[16] S. You, U. Neumann, and R. Azuma. Orientation trackingfor outdoor augmented reality registration. IEEE ComputerGraphics and Applications, 19(6):36–42, November 1999.

[17] Z. Zhang. A new multistage approach to motion and struc-ture estimation: From essential parameters to euclidean mo-tion via fundamental matrix. Technical report, INRIA, June1996.

[18] Z. Zhang, O. Faugeras, and Q. T. Luong. A robust techniquefor matching two uncalibrated images through the recoveryof the unknown epipolar geometry. Technical report, INRIA,May 1994.

[19] Z. Zhang and Y. Shan. Incremental motion estimationthrough local bundle adjustment. Technical report, MSR-TR-01-54, Microsoft Research, May 2001.

Proceedings of the International Symposium on Mixed and Augmented Reality (ISMAR’02) 0-7695-1781-1/02 $17.00 © 2002 IEEE