object tracking by comparing multiple viewpoint images to cg images

Object Tracking by Comparing Multiple Viewpoint Images toCG Images

Takayuki Moritani, Shinsaku Hiura, and Kosuke Sato

Graduate School of Engineering Science, Osaka University, Toyonaka, 560-8531 Japan

SUMMARY

There has recently been much research into real-timetracking of object location and orientation tracking. How-ever, many of the current methods use explicit image fea-tures such as corners, edges or silhouettes, etc. and are,therefore, not well suited to tracking curved and smoothobjects from which it is difficult to extract features. On theother hand, with methods that attempt to estimate the move-ment of an object from the optical flow obtained from thepixel intensity gradient, it is difficult to avoid trackingerrors accumulated during tracking. Therefore, in this paperwe propose a method for estimating the movement of anobject with six degrees of freedom without extracting fea-tures from the images. In particular, the difference betweenreal images and CG (computer graphics) images is mini-mized based on the principle of intensity gradient images.CG images can be generated using an object model whichconsists of shape and color information with location andorientation parameters of the object. Here we show that byusing images obtained simultaneously from multiple cam-eras at different angles, the stability of this tracking methodis improved. We present experiments in which we constructan object tracking system that uses multiple cameras andshow that we can perform object tracking with six degreesof freedom at a speed of approximately five frames persecond for a rigid and arbitrarily curved object from whichit is difficult to extract features. © 2006 Wiley Periodicals,Inc. Syst Comp Jpn, 37(13): 28–39, 2006; Published onlinein Wiley InterScience (www.interscience.wiley.com). DOI10.1002/scj.20546

Key words: object tracking; multiple viewpointimages; intensity images; intensity-gradient methods;model-based analysis.

1. Introduction

Image-based measurement systems have alreadybeen used for many tasks in constrained environments suchas in the automation of industries. However, in recent yearsas humanoid technology progresses and with a greaterimportance placed on crime prevention and security, de-mand for image analysis technology that can deal with avariety of different objects in more general environmentshas grown. In addition, computational technology has de-veloped significantly at the same time and it is now possibleto process large volumes of highly redundant data. Conse-quently, it has become possible to use computers to solvenoisy and highly nondeterministic problems and there isnow a demand for flexible new video image analysis meth-ods that make use of such functionality more generally. Inthis context the object tracking task whereby changes in anobject’s location and orientation are calculated from imageshas become a core focus of attention and has been tackledby many researchers to date.

In general, range images are effective for obtainingthe geometrical information that describes an object’s loca-tion and orientation in a three-dimensional scene. In pre-vious work we have shown that range images can beeffectively used for this task [2, 3], performing objecttracking with six degrees of freedom in real time for arbi-trary curved rigid objects using a device [1] that can meas-

© 2006 Wiley Periodicals, Inc.

Systems and Computers in Japan, Vol. 37, No. 13, 2006Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J88-D-II, No. 5, May 2005, pp. 876–885

28

ure range images for several tens of frames per second. Inaddition, Sumi and colleagues [5] have performed real-timeobject tracking with six degrees of freedom by applyingstereo image processing technology using a high-perform-ance three-dimensional vision system VVV [4]. However,a range finder that can obtain range images at high speedsis not popular and there are also many cases when it is notpossible to get stable range images from passive stereomeasurements due to the shape or texture of the targetobject.

For these reasons in object tracking an approachbased on the use of intensity images that can be obtained ina short period of time is more generally employed. How-ever, the subject of many pieces of research in this area isto track only the location of a target object and this approachhas not successfully been applied to the detection of anobject’s orientation [6]. Many researchers have used moredetailed object models in order to perform object locationand orientation detection. Armstrong and Zisserman per-formed real-time object tracking with six degrees of free-dom for multifaceted objects with well-defined edges bycomparing edge information extracted from intensity im-ages to a three-dimensional shape model [7]. In addition,Drummond and Cipolla have performed real-time objecttracking by extracting edges using multiple cameras; theyhave shown that the use of images obtained from multipleviewpoints is effective for this task [8]. From a differentperspective Horn and Weldon have proposed an intensitygradient-based method to calculate object motion withoutextracting features from the target object [9]. However, withthis method there is the problem that because we have nomodel of the target object, tracking errors will be accumu-lated as tracking proceeds and it is difficult to calculatechanges in the orientation of an object.

We can consider these different methods within theschematic model of visual information processing of Marr(see Fig. 1). Horn and Weldon’s method attempts to esti-mate the highest order information from the images in abottom-up fashion as shown by the path marked (a) in Fig.1. Since the method does not make use of any knowledgeregarding the target object it is difficult for it to performobject tracking stably. In comparison, the methods of Arm-strong and Zisserman and Drummond and Cipolla shownby path (b) in Fig. 1 involve comparing the distribution offeatures extracted from an image with those generated by amodel and adjusting the location and orientation parame-ters until these match. In other words, these methods are atype of model-based analysis, and since they do not involvedifficult subtasks such as feature interpretation, they arecomparatively stable although, as with the above methods,they require the extraction of features from the images.Unfortunately, it is not necessarily always possible to ex-tract features stably, and the ease with which features canbe extracted can in fact vary significantly with the target

object considered. In particular, feature extraction is diffi-cult to apply to arbitrarily curved objects with a smoothtexture such as human bodies.

However, in recent years the tendency to use top-down methods has become more prevalent, with it becom-ing more widely accepted to verify on the image level byusing the image synthesis process as the model. This hasbeen proposed as appearance-based analysis motivated byits practical applicability and for reasons related to thestructure of the visual cortex and related neurocomputa-tional considerations [10]. Therefore, in this paper we pro-pose a method for estimating the movement of a targetobject in a short time period by directly comparing thedifference in brightness that arises in images due to move-ment in the model and the actual images obtained fromcameras. This method uses raw intensity images as they areobtained from standard cameras and does not involve theapplication of any feature extraction on the images. In otherwords, as shown by the path marked (c) in Fig. 1 this methodcan avoid the limitations of the target and ill-posed problemby performing processing at a lower level; as a result webelieve it is more versatile in terms of the range of objects towhich it can be applied. However, since in single viewpointobservations there may be few signs of variation in the move-ment of an object, we extend the method to work with multipleviewpoint images and show that this increases the stabilitywith which objects can be tracked.

2. Principles of Object Tracking

2.1. Estimating motion parameters using agradient-based method

Variation in pixel intensity between images in a se-quence of captured video images will arise when an object

Fig. 1. Visual information processing.

29

that is being tracked is moving through the environment.Here we assume that the distribution of pixel intensities onthe surface of the target object does not change of itself overtime but that such variation is due only to movement. Basedon this assumption, gradient-based methods estimate opti-cal flow from this variation in pixel intensity. Here we letthe brightness of point (x, y) on an image at time t berepresented by I(x, y, t), and assuming that after an incre-ment of time, δt, has passed, this point has moved to (x +δx, y + δy), the following equation will hold:

By making a first-order approximation using a Taylor seriesexpansion and rearranging we obtain the following opticalflow constraint equation [12]:

This equation expresses the movement of each point on theplane in the image, in other words, the relationship betweenthe optical flow and variation in pixel intensity. It is knownthat since there are two parameters for the flow to becalculated for each point on the image while there is oneparameter for pixel intensity for each point, it is not possibleto solve this equation without some form of constraint.However, if we consider the problem of tracking a rigidobject, while the movement of each point will not beidentical, due to the rigidity of the target the movements areconstrained with respect to one another. Therefore, belowwe consider the relationship between the movement of anobject and variation in pixel intensity more directly.

Assuming there is no variation in the lighting condi-tions or the surrounding environment, the appearance of anobject in a three-dimensional space can be uniquely deter-mined by six parameters defining the location and orienta-tion of a rigid object; we are given a single intensity imagecorresponding to this location. Here if we define the imagespace as an N-dimensional space where a single imagemade up of N pixels can be expressed as a vector consistingof N intensity values, then this image can be represented asa single point in this image space. In other words, a singlepoint in image space is determined by the six parametersdescribing the location and orientation of the target object.Here, let I(p1, . . . , p6) be the image vector in image spaceat time t. Then if we assume that the object’s movement iscontinuous and that its surface texture is smooth, the trajec-tory of the image vector is also continuous and smooth withrespect to the temporal variation. In this case, the followingequation will hold in a similar way to the optical flowconstraint given in Eq. (2):

Here we quantize the temporal axis and approximate usingfinite differences. In other words, with the passage of anincrement of time, the parameters expressing the locationand orientation of the target object change only by ∆p1, . .. , ∆p6, and, assuming that the corresponding variation inthe image is ∆I, we can obtain the following approximation:

∆I is obtained from the variation in pixel intensities ofsuccessive images; since in general the number of pixels,N, is much larger than the number of degrees of freedom ofmovement, if ∂I / ∂pi can be obtained, the six parametersdescribing movement can be calculated directly.

Here if we compare Eqs. (2) and (3) we can see thattheir signs are different but this is simply due to the fact thatin Eq. (2) the sign of the rate of change in pixel intensitythat occurs when the target moves an atomic unit along thedirection of the X axis and the sign of the spatial delta ofthe X axis of the image are different. Assuming that the flowis uniform over the image when calculating Eq. (2) is thesame as assuming that the target is parallel to the imageplane in Eq. (3) and that moves parallel to this plane.

2.2. Determining the movement parametersusing CG images

In order to calculate the object’s location and orien-tation parameters from the variation in pixel intensity, it isnecessary to estimate the partial derivatives on the RHS ofEq. (4) numerically. Here we assume that CG images canbe accurately generated from a model of the target objectfor arbitrary locations and orientations of the target and thatwe can freely calculate the image vector I(p1, . . . , p6) inimage space corresponding to arbitrary parameters. Herewe calculate the coefficients of the partial derivatives withthe following approximation using the finite increment δ:

We now describe the method we use for calculatingthe movement parameters. The relationships between theworld coordinate system in which the target object movesand the camera coordinate system are first calibrated. Weassume the relations between these different coordinatesystems are known. Let I[t] be the CG image created fromthe CG model of the target object with the location andorientation at time t. Taking the location and orientation ofthe CG model at that point as a base point, let the CG imagescreated by performing a virtual parallel translation of smallincrement, 1 / 2δt, in the positive and negative directions

(1)

(2)

(3)

(5)

(4)

30

along the X axis in the world coordinate system be I+Xt andI−Xt, respectively. Similarly, let the CG images created whena parallel translation is performed along the Y and Z axes inboth directions be I+Yt, I−Yt, I+Zt, and I−Zt, respectively, andthe image generated by performing a rotation of the smallincrement, 1 / 2δr, in both positive and negative directionson each axis be I+Xr, I−Xr, I+Yr, I−Yr, I+Zr, and I−Zr, respec-tively. Here we take the input from the cameras after anincrement of time, δt, as I[t+δt] and based on Eq. (4) obtainthe following equation:

Here ∆Xt, ∆Yt, ∆Zt, ∆Xr, ∆Yr, and ∆Zr are estimates of theparallel translation and rotation along each axis, and thetarget’s movement can be estimated by calculating thesevalues.

Here if we let,

then

holds. Here the G is an N by 6 matrix, and since this istherefore not a square matrix, E is calculated by the methodof least squares. By switching both sides around and pre-multiplying by GT we obtain the following normal equa-tions:

When the inverse of GTG exists the following least squaressolution to these equations can be calculated:

In this way we can estimate the movement parameters thatexpress the amount of motion along each axis. By combin-ing the movement parameters thus calculated with thelocation and orientation of the CG model, we can expressthe current location and orientation of the target and inaddition, we can use them to create the next CG images. Ifwe assume that the amount of rotation angle in a single

iteration is minute and since, as we describe below, weminimize the error via iterative computations, we do notneed to consider the composite sequence of rotations.

Here the image error vector D is not taken as thedifference between two successive input images but thedifference between an input image It+δt and a CG image It.Therefore, even when there are errors in this estimatedvalue for the object’s movement due, for example, to noisein the image or violations of the assumption of linearity,these errors will not be accumulated, and by iterating thesecomputations the CG model will approach the location andorientation of the target object. Since the existence of localoptima will depend on the surface and texture of the targetobject as well as its orientation it is difficult to discuss thisissue theoretically; however, if the initialization value issufficiently close to the true value these will not pose aproblem.

3. Estimating Movement Using MultipleViewpoint Images

3.1. The necessity of multiple viewpointimages

In general it is difficult to accurately estimate themovement of an object in a three-dimensional space usingonly a single viewpoint. An example of movement that isdifficult to measure from a single viewpoint is shown in Fig.2 where the target is moving along the optical axis. Inparticular, when the distance between the camera and thetarget is much larger than the amount the target moves, inother words when the movement is close to orthogonal,there will only be a slight change in the size of the objectthat does not reflect the true movement of the object and

(6)

(7)

(8)

(9)

(10)

Fig. 2. Effectiveness of multiple viewpoint images.

31

there will be insufficient cue in the images. In other words,the appearance of movement along the optical axis will besmall.

In our proposed method we use a three-dimensionalmodel to generate CG but to estimate the movement of theobject we only use intensity images from the camera, andit is not as if each pixel includes any explicit informationrelating to the distance to the object and its shape as wouldbe found in a range image. Therefore, the method faces theproblem that if there is no significant variation in the imagesinput from the camera or if for some reason the movementcorresponding to particular variation in the images cannotbe determined uniquely, the movement estimation will tendto become less accurate and unstable. We can deal withthese issues, however, by using multiple viewpoint imagesfrom multiple cameras since some motion that may bedifficult for one camera to detect will be detected moreaccurately by another camera.

The stereo vision is a widely used measurementmethod that employs multiple cameras, but in our proposedmethod we do not perform any type of image processingthat makes use of the relations between images such assearching for corresponding points. Therefore, we positionthe cameras without constraints with respect to one anotherand particularly not as they are arranged for the stereomethod with each camera being close to another; rather webelieve that it is best if we can arrange the cameras to obtainthe greatest variation in images from them.

3.2. Motion estimation by combininginformation from multiple viewpointimages

Based on the principle described in the previoussection we can calculate movement parameters by combin-ing information from images obtained from multiple cam-eras. The problem of calculating the movement parametersof the target object using our proposed method may beregarded as a problem of finding a solution to the minimi-zation of the numerical approximation of Newton’s methodby differentiation. In other words, we calculate the differ-ence between the CG images generated by the model andthe input images obtained from a camera, and performobject tracking by iteratively adjusting the location andorientation of the computational model so as to minimizethis error. Then if we expand the approach directly tomultiple viewpoint images, it seems natural to calculate theerror of the model’s image with those of each of the cam-era’s image in the same way as before, and simply estimatethe movement parameters by minimizing the total of theseerrors. This can be realized by simply combining the imagesobtained from all of the cameras and applying the same

parameter estimation principle as before by treating themas one large image.

First we consider the case where n input images areobtained from n cameras. Given the matrices Di and Gi

obtained from applying Eq. (7) to the input image from thei-th camera (i = 1, . . . , n) we can create the new matricesDall and Gall by simply combining these as

By substituting both of these into Eq. (10) we can computethe movement parameters by minimizing the sum ofsquares of the pixel intensities for each pixel on the wholeimage.

3.3. Acceleration by distributed processing

One practical problem that we face when attemptingto perform object tracking with multiple cameras is thatthere is a limit to the number of cameras that can beconnected to a single computer. In addition, as the numberof viewpoints employed increases, the amount of data to beprocessed grows dramatically, and a high-performancecomputer becomes essential. To deal with this type ofsituation, distributed computer vision systems have beenproposed that are created simply by pairing up computersand cameras one-to-one and connecting these via a network[13, 14]. The advantage of such a system is that it is easy toincrease the number of cameras and the overall computa-tional performance of the system does not go down as thenumber of cameras is increased. In this paper we alsoconstruct a distributed camera system by connecting onecamera to each computer and joining these to each other viaa network.

Di described in the previous section is an N-dimen-sional vector (N is the number of pixels) and Gi is an N by6 matrix; these represent extremely large amounts of data.Therefore, in the method where the movement parametersare calculated by combining the images into a single imagethe communication load will become large due to the hugeamounts of data that need to be communicated, while thetracking speed will go down if the computation is per-formed on a host computer. Therefore, we reduce the com-putational load and communications overheads by usingdistributed processing. First, we compute Gi

TDi and GiTGi

on each machine from the Di and Gi we obtain; the resultingvector and matrix are then sent to the host computer. Wethen calculate the movement parameters on the host com-puter by minimizing the sum of squares shown in thefollowing equation that is obtained by substituting Eq. (10)into Eq. (11) and expanding:

(11)

32

Here since GiTDi is a six-dimensional vector and

GiTGi is a 6 by 6 matrix, we obtain a significant reduction

in communications overhead. In addition, the computationto be performed on the host computer consists only of suchsmall procedures as computing the inverse of a 6 by 6 matrixand therefore we manage to distribute the processing at thesame time. A comparison of the computational load andamount of data to be communicated under the differentprocessing methods is shown in Fig. 3.

3.4. Evaluating the effectiveness of usingmultiple viewpoint images

In this paper since we estimate the movement pa-rameters using the least squares method, we require that asolution that can be obtained from Eq. (10) exist. In otherwords, since G is an N by 6 maxtrix (N ≥ 6), the rank of G= 6 is a condition for the existence of a solution to theseequations. Here G is given by Eq. (7); this column vectoris the error of the computer-generated image created byvirtual movement along the different axes. In general, withperspective projections, since variation in the images dueto parallel translation and rotation of the object are inde-pendent of one another the above condition will hold;however, if we consider the single viewpoint problem de-scribed in Section 3.1 we see that this implies that thereexists a vector close to the zero vector among the columnvectors of G and there exist vectors that are almost linearlyindependent. When multiple vectors that are close to line-arly independent are contained in this matrix, since thesolution obtained will be too sensitive to the additive error

of D, the movement parameters that are calculated will alsobe unstable and the matrix G could be said to be unstableitself.

In this context in order to discuss the stability of themovement parameters it is important to evaluate G and inparticular GTG as this determines the existence of theinverse. Here we use the condition number; this has beenused as an index of the characteristics of a matrix in Ref.15. When the eigenvalues of a 6 by 6 real-valued symmetricmatrix GTG are λi (i = 1, . . . , 6), then the conditional number(CN) is defined by the equation

When G is an unstable matrix, then the condition numberwill be large since one of the eigenvalues of GTG will beclose to zero. Conversely, when all the column vectors ofG are orthonormal bases, then the condition number will be1 since all the eigenvalues will be equal. In this case we canobtain a stable solution that is subject only to the smallestlevel of error. In other words we can use the CN to evaluatethe stability of the movement parameters that we estimate.

4. Object Tracking Experiments andEvaluations

4.1. System architecture and object trackingprocessing

The system architecture of the object tracking systemused in this paper is shown in Fig. 4. The system consistsof four cameras (Sony EVI-G20) and a projector (EpsonELP-703); a range finder is constructed from one cameraand the projector. One camera and the projector are control-led by a single computer (see Table 1); we create a clustermade up of four such computers. The space within which

(12)

(13)

Fig. 3. Comparison of computation and communicationinvolved in centralized and distributed processing. Fig. 4. Multicamera system architecture.

33

object tracking is performed is approximately 1 m2 in areawith a height of 0.6 m. The distance from the center of thisspace to the cameras is 2.4 m.

We divide the processing involved in moving objecttracking into preprocessing (offline processing) and theobject tracking itself (online processing). In preprocessingwe first learn the camera and projector parameters bycalibration using basic objects. Each of these sets of pa-rameters is represented by a 3 by 4 matrix; these describethe relationship between the space coordinate system deter-mined by the basic objects and the camera coordinatesystems as well as that between the projector coordinatesystem and the camera coordinate systems. Then we takethree-dimensional shape measurements using the combina-tion of the projector and a camera as a range finder. At thesame time we acquire information about the texture of thetarget object from the images obtained via the cameras. Weuse the gray code pattern lighting projection method [16]to obtain these shape measurements. To extract the targetobject from its background we use the difference in pixelintensity between it and the background; likewise we ex-clude from the model dark areas and areas of shadow on thetarget object arising from the projector lighting.

Next we describe the flow of processing involved inobject tracking (Fig. 5). In the method proposed in thispaper, when the location and orientation of the actual objectand the CG model are the same, it is desirable that the CGimage and the input image from the cameras match exactlyboth geometrically and in terms of pixel intensities. We

simplify the problem of making these match by using thesame camera that is used for generating the target modeldirectly in the object tracking. Specifically with regard togeometric matching we can achieve this without sacrificingspeed by setting the coordinate transformation matrix in thegraphics hardware to the camera parameters learned duringthe preparatory processing when drawing the computergraphics. In addition, the matching of pixel intensities issimplified by using texture mapping of the camera imagedirectly onto the CG model’s surface. This method is sim-pler in that it does not require the reflection property of thetarget object to be identified or the sensor to be calibratedto pixel intensities but does mean that a point on the modelwill always be drawn with the same pixel intensity regard-less of the orientation of the object when the image is drawn.This would not be a problem if the lighting conditions couldbe assumed to be uniform and the object’s surface to pro-vide only diffuse reflections, but this assumption is notfulfilled for the objects we use and our experimental envi-ronment. This point will be discussed in more detail inSection 4.5.

Next we calculate motion parameters such that thedifference image generated on the basis of the input imagesand that between the generated CG match. At this stage,since the outline of the object will move when the objectmoves in the vicinity of the object’s outline in the twoimages that we are taking the difference between, pixelswill arise that in one image correspond to the object itself,while in the other image they correspond to the background.By excluding such pixels when generating the CG images,we can simultaneously create an image mask that representsthe region that was drawn by the CG model and by onlyusing that region obtained by a logical product of the imagemask in the estimate of the movement parameters we areable to perform object tracking without being affected bythe background.

Finally, once the movement parameters have beencalculated, the CG model is updated to move only by thatamount. Then by repeating the same procedure with a newimage we can track the object.

4.2. Evaluating the effectiveness of multipleviewpoint images

We calculated G and then GTG using images obtainedfrom a single viewpoint only and from multiple viewpoints,respectively; next we evaluated the stability of the move-ment parameters that had been estimated using the condi-tion number described in Section 3.4. The results are shownin Table 2. δ is the amount of virtual movement in the virtualCG model when generating the CG image. Here we see thatthe condition number is smaller when using the multipleviewpoint images regardless of the value of δ. Moreover,Fig. 5. Flowchart of object tracking processing.

Table 1. Basic specifications of the computer and theprogramming environment

34

the stability of the parameters obtained via the method ofleast squares was also increased by the use of multipleviewpoint images and we can therefore conclude that thismethod enables accurate and stable object tracking.

4.3. Object tracking results

We performed object-tracking experiments using themultiple camera system described in the previous section.The object used in these object-tracking experiments isshown in Fig. 6. The dimensions of this object are 42 cmby 28 cm by 42 cm (width × depth × height); it is composedof smooth unstructured curved surfaces and has no strongfeatures; and its texture is smooth making it a difficultobject to perform tracking on using standard feature-basedmethods. The images obtained from each of the cameras areshown in Fig. 7; the results of object tracking using all theimages are shown in Fig. 8. On each of these images theestimated location and orientation of the CG model aredrawn as a wire frame. Figure 8(a) shows the initial posi-tion; Fig. 8(b) shows the appearance with tracking involv-ing a parallel translation within the X–Y plane and arotation; Fig. 8 (c) shows the appearance with trackinginvolving rotations around arbitrary axes: the model suc-ceeds in performing object tracking for this solid objectwith six degrees of freedom in its movement. The framerate with an image size of QVGA (320 × 240) was approxi-mately 5 frames per second.

4.4. Accuracy evaluations

First we evaluated the processing of object trackingwith parallel translations using a slide stage. We perform aparallel translation of 10 cm along one of the axes from theinitial position and record the estimated value for the dis-tance moved when the object tracking algorithm converges.Then we perform a further parallel translation of 10 cmfrom this position and record the estimated value for thedistance moved. Repeating this process we record the esti-mates for the distance moved at intervals of 10 cm itera-tively up to a total of 80 cm. These results are shown in Fig.9(a). After this, following a similar method, we perform anevaluation with rotations using a turntable. Starting fromthe base orientation, we perform rotations in intervals of 2degrees up to 16 degrees around the Z-axis perpendicularto the bottom face of the object through which its center ofgravity runs recording the estimated rotation values. Theseresults are shown in Fig. 9(b). In each of these graphs thehorizontal axis shows the true distance moved or rotatedwhile the vertical axis shows the corresponding estimatedvalue.

The error in estimating translations was within 1.5cm while for rotations it was within 0.3 degree. This can beregarded as a relatively high level of accuracy when weconsider the size of the space within which tracking wasperformed, the distances between cameras and the object,and the size of the object itself. Since with this method weare continually minimizing the error between the CG im-ages generated by the model and the input images, we canavoid compounding estimation errors. In addition, when theobject is stationary or the processing speed is sufficient todeal with the object’s movement, accuracy will be im-proved since the CG model will converge to the correctlocation.

Table 2. Comparison of the condition number of GTG

Fig. 6. Object used in tracking experiments.

Fig. 7. Images obtained from four different viewpoints.

35

4.5. Considerations regarding the differencebetween the input image and CG image

As described in Section 4.1, we use a simple texturemapping technique to make the pixel intensities of the inputimage and the CG image match in our experiments. Inaddition, because the camera used for modeling the targetobject is also used as is for the object tracking, we do notneed to calibrate the sensitivity of the sensor, the colors and

the γ characteristics. However, because this method willresult in a point on the CG model being drawn with aconstant intensity regardless of orientation, this may resultin differences arising between the input image and CGimage due to the effects of shadow on the body of the objectitself or when for example the object creates mirroredreflections or the lighting conditions are not uniform; thesewould not be issues when the lighting conditions are uni-form and the object’s surface creates only diffuse reflec-tions. In order to solve these problems there are methodsthat use CG models that are able to reproduce higher orderoptical phenomena based on more detailed graphical andlighting models. In Section 2.1 we explained our methodwithin the context of gradient-based techniques; however,since numerically we reduce the problem of minimizing the

Fig. 8. Results of object tracking.

Fig. 9. Accuracy evaluations.

36

error between the input image and the CG image to one ofcalculating each of the parameters relating to location andorientation, we believe it would be possible to apply ourmethod to CG models that included more complex opticalphenomena. However, in general it is not easy to completelymodel the bidirectional reflectance distribution function(BRDF) for each point on the target object and the questionof how simple a model can be to work for real-life objecttracking is one that must be answered on the basis ofpractical benefits and experimental results. Therefore, inthis section we experimentally calculate the difference be-tween the input image and the CG image and discuss theseissues.

In Fig. 10 we show the error between the input andCG images when the orientation of the target object and theCG model are made to differ with respect to one another(here we use the same target object, room and cameras asused in the experiments described earlier in this section). Inthis case the movements we chose to apply to the targetobject consisted of rotations rather than parallel translationssince the former result in more conspicuous variation inshadows and movement of reflecting surfaces. In addition,the value we used to quantify the error was the average ofthe absolute values of the error for each of the RGB valuesfor all the pixels used in the object tracking (the areacovered by the CG image). As can be seen from Fig. 6, thetarget object contains specular reflections and because theneck area is concave there is no shortage of variation inshadow around the shoulders.

From Fig. 10 we can see that when the object isrotated from the initial position, the difference between theimages gradually increases. However, when the CG orien-tation of model matches that of the actual object (in Fig. 10this corresponds to the area marked by a diagonal line) itdoes not increase conspicuously. In contrast, when there isa discrepancy between the orientation of the CG image and

the actual object, then the error between the images in-creases dramatically regardless of the angle that the objecthas been rotated and can be seen to be smooth with noextremal values. In addition, regardless of the angle ofrotation, the trough on the error surface remains V-shapedand since the slope of this surface becomes steeper as thetrough is approached, we can see that convergence will befast regardless of the angle of rotation and that it will notbe susceptible to the effects of noise. From the results of theaccuracy evaluations in Fig. 9 we can also see that high-or-der reflection phenomena do not have a significant effect.

Why such characteristics are observed is somethingthat will need to be examined further in future work; onereason that could be given for the method’s robustness isthat we evaluate using a large number of pixels over thewhole of the target object. In template matching methods,the extremal values will be smaller and more stable if thetemplate is increased in size; the robustness of our methodappears to be the result of a similar effect. In addition, sincewith our method we are able to accurately reproduce thegeometric variation in the image generation, the effect ofhaving convexity and texture details match simultaneouslyover the whole image is significant and we believe this iswhat gives rise to the steep trough in the error surface. Todraw a relation to existing methods, our method may beconsidered to be close to conditions in which the wholeplane orthogonal to the camera is treated as a single tem-plate and a two-dimensional parallel translation is calcu-lated. However, as can be seen from the figure, there arealmost no extremal values and therefore there is no need toperform a complete search; rather the method performs wellwith a combination of linear estimation and iterative opti-mization.

In these experiments we have used a target that has apoorly defined texture; when the target has a richer texture,the area in the vicinity of the convergence point will forma sharper trough. In addition, with the exception of objectsthat have an arbitrarily iterative design, significant extremalvalues in the vicinity of the convergence point will be rare.

5. Conclusion

In this paper we have proposed a novel gradient-based method for versatile moving object tracking andshown that by using multiple cameras this method is capa-ble of tracking smooth solid objects with six degrees offreedom in their movement. This method avoids the needto perform a common task in computer vision, namely, thatof determining corresponding points. Instead the move-ment of an object is tracked by directly minimizing the errorbetween a CG image based on shape and color informationfor the target object and input images from cameras. In anexperimental system we found that texture mapping, a

Fig. 10. Difference between input images and CGimages.

37

method for making the pixel intensities more uniform inorder to be able to estimate an object’s movement directlyfrom variation in pixel intensities, was sensitive to variationin the lighting in the environment; however, we believe thatthis can be overcome by measuring the variation in lightingin the environment and reproducing such changes in the CGmodel and CG images. In a similar way, variation in reflec-tive surfaces and shadows on the target object may also bedealt with directly if they can be expressed and generatedwithin the CG model. The method is not yet sufficiently fastin terms of execution; however, currently improvements inCG generation hardware and computational performanceare dramatic and we therefore envisage further accelera-tions in the method as procedures that hitherto required anextremely large amount of time can be realized in shortperiods.

Although at the moment the method is limited totracking the movement of solid bodies, in the future wehope to extend it by adding the ability for the model to beupdated as part of online processing functionality and thenwe believe that by providing additional physical constraintsit will be possible to use the method for tracking objectswith moving parts such as humans. In addition, we plan tocreate more practical systems that use multiple cameras notsimply to perform tracking but to realize a unified move-ment recognition system that performs all the tasks fromdetecting a target object to three-dimensional shape meas-urement and movement recognition.

REFERENCES

1. Yokoyama A, Sato K, Togahara T, Inokuchi S. Real-time range imaging using an adjustment-free photoVLSI: Silicon rangefinder. Trans IEICE 1996;J79-D-II:1492–1500. (in Japanese)

2. Hiura S, Yamaguchi A, Sato K, Inokuchi S. Real-timetracking of free form object based on measurementand synthesis of range image sequence. Trans IEICE1997;J80-D-II:1539–1546. (in Japanese)

3. Hiura S, Yamaguchi A, Sato K, Inokuchi S. Real-timetracking of free form object by range and intensityimage fusion. Trans IEICE 1997;J80-D-II:2904–2911. (in Japanese)

4. Tomita F. VVV—a high performance 3-dimensionalvision system. J Inst Inf Process 2001;42:370–375.(in Japanese)

5. Sumi Y, Ishiyama Y, Tomita F. Hyper frame vision: Areal-time vision system for 6-DOF object localiza-tion. Proc ICPR02 III, p 577–580.

6. Matsumura A, Iwai Y, Yachida M. Multiple persontracking using flesh color information. Tech Rep InstInf Process, CVIM-133-18, p 133–138, 2002. (inJapanese)

7. Armstrong M, Zisserman A. Robust object tracking.Proc ACCV95, Vol. 1, p 58–62.

8. Drummond T, Cipolla R. Real-time visual trackingof complex structures. IEEE Trans Pattern Anal MachIntell 2002;24:932–946.

9. Horn B, Weldon E. Direct methods for recoveringmotion. Int J Comput Vis 1988;2:51–76.

10. Kawato M, Inui T. Computational theory of the largevisual cortex. Trans IEICE 1990;J73-D-II:1111–1121. (in Japanese)

11. Marr D. Vision. W.H. Freeman; 1982.12. Yachida M. Robot vision. Shoko-do Co.; 1990.13. Taniguchi R, Wada T. A real-time multiple viewpoint

image processing system using a PC cluster—A real-time motion capturing and 3-dimensional reproduc-tion system. J Robot Soc Japan 2001;19:427–432. (inJapanese)

14. Matsuyama T. The distributed and collaborative vi-sion project. J Robot Soc Japan 2001;19:416–419. (inJapanese)

15. Varga RS. Matrix iterative analysis. Prentice–Hall;1972.

16. Inokuchi S, Sato K. Three dimensional image meas-urements. Shoko-do Co.; 1990.

38

AUTHORS (from left to right)

Takayuki Moritani (student member) graduated from the School of Engineering Science at Osaka University in 2002,completed the preliminary doctoral program in 2004, and is currently in the advanced doctoral program.

Shinsaku Hiura (member) completed studies in the School of Engineering Science at Osaka University in three yearsjumping the final year to enter the Graduate School of Engineering Science in 1993, completed the master’s program in 1995,and completed his doctorate before schedule in 1997. He then became a research assistant in the Graduate School of Engineeringat the University of Kyoto. In 1999 he became a research assistant in the Graduate School of Engineering Science at OsakaUniversity, and was appointed an assistant professor there in 2003. His research interests are three-dimensional imagemeasurement and processing and related areas in virtual reality and communication. In 1993 he received the Achievement Awardof the Kansai-section of the Joint Convention of the Institutes of Electrical Engineers of Japan. In 2000 he received theCommended Paper Award at the Image Sensing Symposium of Japan. He holds a D.Eng. degree, and is a member of theInformation Processing Society of Japan and the Virtual Reality Society of Japan.

Kosuke Sato (member) graduated from the School of Engineering Science at Osaka University in 1983 and completedthe master’s program in 1985. In 1986 he became a research assistant in the Department of Engineering Science at OsakaUniversity. In 1988 he was a visiting researcher at the Carnegie Mellon University Robotics Institute. In 1992 he became alecturer in the School of Engineering Science at Osaka University. In 1994 he became an assistant professor at the GraduateSchool of Information Science at Nara Institute of Science and Technology. In 1999 he became an assistant professor in theGraduate School of Engineering Science at Osaka University. In 2003 he became a professor at that institution. His researchinterests are three-dimensional image measurement, virtual reality, and other applications of video information media such asdigital archiving. In 1987 he received the Shinohara Memorial Prize of the Institute of Information Processing of Japan. He isa member of the Information Processing Society of Japan, the Color Science Association of Japan, the Virtual Reality Societyof Japan, and IEEE.

39

object tracking by comparing multiple viewpoint images to cg images

Documents