vergence micromovements and depth perception

Vergence Micromovements and DepthPerception

Antonio Francisco *CVAP, Royal Institute of Technology (KTH)

S-100 44 Stockholm, Sweden

AbstractA new approach in stereo vision is proposed in which 3D depth infor-mation is recovered using continuous vergence angle control with simul-taneous local correspondence response. This technique relates elementswith the same relative position in the left and right images for a contin-uous sequence of vergence angles. The approach considers the extremelyfine vergence movements (micromovements) about a given fixation pointwithin the depth of field boundaries. It allows the recovery of 3D depthinformation given the knowledge of the geometry of the system and asequence of pairs [oi,Ci], where a, is the ith vergence angle and C, isthe ith matrix of correspondence responses. Due to its local operationcharacteristics, the resulting algorithms are implemented in a modularhardware scheme using transputers. Unlike currently used algorithms,there is no need to compute depth from disparity values; at the cost ofthe acquisition of a sequence of images during the micromovements. Ex-perimental results from physiology and psychophysics suggest that theapproach is biologically plausible. Therefore, the approach proposes afunctional correlation between the vergence micromovements, depth per-ception, stereo acuity and stereo fusion.

The perception of the 3D-distance, depth, of objects using stereo images havebeen studied by many researchers for a long time. Some of these studies usevergence camera systems [1] integrating position control, image acquisition anddepth processing on the modality of vision system named "active vision" [2].Following this line of research, the present work analyses the correlation be-tween the real time depth acquisition and the extremely fine vergence move-ments (micromovements) of the cameras about the fixation point. We assumethat these movements are synchronized between the two cameras.

The continuous vergence micromovements differ from the vergence, transla-tion and rotation movements used on the other methods to fixate the camerason a new fixation point. The previous methods (using particular techniques asmulti-resolution) compute some depth-map or depth directly from the acquisi-tion of the left and right images at this fixation point, i.e., using two imagesand some stored information (estimation) about the depth-map they are ableto infer the current depth at the correctly matched image points. Generally,

* Researcher at the National Institute of Space Research (INPE), Sao Jose dos Campos, SaoPaulo, Brazil. The support from the Swedish National Board for Industrial and TechnicalDevelopment, NUTEK, is gratefully acknowledged. I would like to thank Prof. RuzenaBajcsy and Prof. Jan-Olof Eklundh for the support to the development of this work as wellas Kourosh Pahlavan, Akihiro Horii and Thomas Uhlin for valuable help when using theKTH head-eye system.

BMVC 1992 doi:10.5244/C.6.38

368

the strategies used is such methods have the search space for correspondencematches along epipolar lines. Therefore the depth is calculated using the dis-parity information between the left-right matched points and the geometry ofthe camera system.

The current approach uses neither epipolar lines nor disparities to calculatethe depth of any 3D point. The depth is determined by the geometry of thecamera system (mainly, the vergence angle) and by the relative position of apixel with respect to the image plane. The procedure can be simply describedas following: micromovements of two cameras occur about the fixation point.For each left-image point, on the left image plane, the vergence angle and the"correspondence response" of this point and the right-image point at the samerelative position on the right image plane are stored. For each left-point, usingthese "correspondence response" signals and the camera geometry, the depthof the 3D points where the correspondence response reach the highest level arecalculated.

Therefore, the approach is functionally different from the previous ones inthe sense that the depth is calculated locally for each point (without searchingepipolar lines) with the necessity of acquiring a sequence of images during themicromovements. The objective here is to clarify how to calculate depth usingmicromovements.

The paper covers the theoretical background, the experimental results andthe biological support for depth acquisition from the vergence micromoventsapproach. The theoretical and simulations parts of this work were developed[3] during my stay at the General Robotics and Active Sensory Perception(GRASP) laboratory, University of Pennsylvania, USA. The experiments tovalidate the approach have been done using the KTH head-eye system [4].

1 The stereo vision system and the horopter

Each lens of the right and left camera is considered to be thin and ideal, inthe sense that an object at a distance d o u t (the object distance) from theprincipal plane has its image (with inverted direction) at distance d j n (theimage distance) from this plane. The relationship of these two distances andthe focal length of the lens is given by the Gaussian Lens Equation:

1 - — 1

7~d^ + dZ ( 'With respect to the camera platform, a symmetric fixation in the visual

plane is assumed. Therefore the vergence angles of the two cameras have thesame value a and the point being fixated is in the visual (horizontal) plane of thecameras. With this assumption, any camera torsion (about the axis connectingthe lens center and the image plane center) is considered to be zero. Accordingto the symmetric fixation model (figure 1) associated with each image planethere is a coordinate system with its origin at the image plane center. The lenscenters are separated by a baseline b. These coordinate systems define the leftprojection (xpi,ypi) and the right projection (xPr,yPr) of a point in space. Toidentify the 3D position (X0,Y0, Z0)

T of a point Po in space a global coordinatesystem xyz is used (figure 1).

369

A4-\/

iV

-Vi

(a) (b)Figure 1: Stereo camera geometry: (a) Perspective view (b) Top view

In order to adopt the same terminology used in the human vision field wewill review some concepts from the physiology and psychophysics sciences. Thehoropter defines the set of points in space for which the binocular disparityis zero [5]. The point horopter is the locus of zero disparities for the pointstimulus where both horizontal and vertical disparities are zero. (There hasbeen a considerable amount of confusion in the literature caused by the laxityin defining the horopter [5].)

We consider the ideal point horopter composed of points with zero horizontaland vertical disparities for the symmetric fixation in the visual plane, with anyposition, torsion and optical aberrations assumed to be absent. As describedin [5], any point off the point horopter in space (off-axis points) projects to thetwo image planes with horizontal and vertical disparities. With vergence, thepoints at the distance corresponding to the point horopter would nullify thehorizontal disparity. Note that in the ideal case nothing can be done to nullifythe vertical disparity produced by off-axis points being necessarily closer to oneeye than the other, with a resulting difference in the projection angle in the twoeyes. The present analysis concerning the horopter is based on zero horizontaldisparity.

For the zero horizontal disparity case we can define a 3D "intersection" pointof the left and right optic axes (same x and z and different ys) passing throughthe same corresponding element (xp,yp)

T on the image planes coordinate sys-tems. The coordinate {xt,yt,zL)T of this "intersection" point [3] is:

ib tan(q+9|)-tan(g-fl))V 2 tan(o+»i)+tan(a-»i) » tan(a+#i )+tan(o-»

where,

a = arctan(^—), 0; = — arctan(j*-), and (eq. 1)

(2)

(3)

The above equation for yt was deduced considering the average of the leftand right y coordinates of the left and right optic axes at the "intersection"point. This assumption includes an error in the present analysis. In order toevaluate the dimension of this error with respect to the length of the photo-receptor element, the difference between the projections on both image planesof a point Po in (xt,yt,.z,)T (eq. 2) is analyzed. Note that the projectionsare determined by the intersection of the right and left optic axes, passingthrough the left and right optic lens center respectively, with the image planes.Let us denote the intersection of the left optic axis with the left image planeas (xpi,ypi)

T and the intersection of the right optic axis with the respective

370

/ y x^dJtanCOcos^

iyJ71 / y = d.tan(r)sin(<P)

Figure 2: Polar coordinate system of the image plane

image plane as (xpr, ypr ) T . It is possible [3] to define the following euclidianprojections deviation (dev) for a given point Po as:

(4)

xpr = -din tan(- arctan(ii^) + a), ypr = ^ f f f f i (6)

0, = - a r c t a n ( ^ ) , 6T = arctan(^) (7)

The dev analysis is done considering the object distance as a multiple ofthe baseline (doij = k\,b) and the image planes mapped by a polar coordinatesystem (figure 2). Having deduced all needed equations for the dev analysis,it is time to show some simulated results about the human visual system andthe GRASP platform system. The procedure to accomplish the dev analysiscan be synthesized in the following simulation steps:

• for a given: kt,, 7, <p, f , b ; compute: dobj , a , din (eq. 3), xp, yp (figure2),(x, Iy, Iz,)T (eq. 2),

• using Po - (xt,yl,zl)T, compute: xpi, ypi, xpr, ypr (eq.s 5 to 7) and then

dev (eq. 4).• plot: the results of the dev normalized with respect to the distance be-

tween centers of adjacent photo-receptor elements dee. Note that dee isa constant for most machine vision systems (dce(.)) and is a function of7 for the human visual system (dce(j)) [3].

The results of the simulation shown in Figure 3.a imply that the highestdev occurs when ip = k 90 degree + 45 degree (k = 0,1,..). Therefore allother simulations are done with the value of <p equal 45 degree. A conclusionfrom Figure 3.a, is that the normalized dev is smaller for the human visualsystem since dce(j) increases on the periphery. This characteristic implicitin the human visual system tends to diminish dev. Another feature of thehuman visual system that tends to diminish dev is the known difference betweennasal and temporal retina eccentricity (if nasal is larger than temporal) forevery pair of corresponding points. This difference in eccentricity could explainthe deviation of the empirical horopter [5] from the V-M circle as well as thenecessity to diminish dev.

For the GRASP system we want to know how dev varies with object dis-tance. From the analysis of Figure 3.b we see that dev decreases with object

371

(a) (b)Figure 3: Normalized deviation as a function of: (a) f (7 = 2°, b = 65 mm,f = 17 mm, doij = 26), (b) kb (dobj = kbb, ip = 45°, b = 128 mm, f = 65 mm)

dobj (kb)

20

(a) (b)Figure 4: Point horopter for GRASP platform system (b = 128 mm, f = 65mm, do},j = 15 b, dce(.) = 30.0 10~3 mm): (a) Top view, (b) Perspective view

distance, therefore the next set of simulations are done with dobj = 15b whichensures a small dev for <p = 45 degree. The dev has been investigated for itsmaximum value, in spite of the zero value of this deviation on the image planescoordinate axes (<p = k 90 degree, k = 0,1,..) for any value of 7, dobj, b and / .

Having shown that the normalized dev is very small when dobj is greaterthan fifteen times the length of the baseline, the point horopter is plotted usingequation 2 for this value of dobj. Almost all the parameters of the aboveequations can be computed directly (like din and a) from the defined valuesof doi,j > b and / . The only two parameters that do not have a defined rangeare xp and yp. In the present paper, 80 photo-receptor elements are used asthe distance from the image plane center to the periphery of the workspacebeing analyzed. Therefore, a square workspace of side size equal to 160 pixelscentered in the image plane is assumed. This range of xp and yp gives the pointhoropter plotted in Figure 4 for the GRASP platform system. It can be seenthat the point horopter is a surface in space.

2 MicromovementsThe shape of the point horopter has been analyzed for a given vergence anglecalculated from the object distance under fixation. The main analysis now isconducted for a number of vergence angles a,- about the fixation point in thevisual plane, described by the following equation:

a,- = arctan(——) +tUbj

(8)

372

(a) (b)Figure 5: Set of point horopter surfaces for GRASP platform system (£,• €[—0.54°, 0.54°]): (a) Top view, (b) Expanded perspective view

where £,• is a small angle increment (positive or negative) that describes themicromovement about the fixation point (first part of equation 8). The set ofa; about a given fixation point determines a complete micromovement cycle.

Figure 5 shows the locus of "intersection" points in 3D space of the GRASPplatform system for a given micromovement cycle. The surfaces shown corre-spond to the set of point horopter surface generated for each vergence anglea,-. As can be seen in Figure 5, the locus of all the "intersection" points forma volume in the 3D space. Therefore, any object inside this volume can haveits depth measurements determined by the response of a local correspondenceoperator to the continuous vergence angle control. Remember that this opera-tor relates elements (image plane points) with the same horizontal and verticaldistance from the center of the left and right image planes. It is possible touse a local correspondence operator since we assume that do\,j is greater thanfifteen baselines, implying a small dev (see previous section).

The errors in stereo (along z axis) with the present approach are due thevergence angle quantization (angle steps fixated by £,). These errors differ fromthe quantization errors due to discrete photo-elements in cameras, that are acommon characteristic of other stereoscopic methods. As described in [6], theerrors due the photo-receptor quantization are significant and increase with thedistance from the object to the cameras system. The present approach allowsus to overcome the photo-receptor quantization limitation by using a sequenceof pairs [a,-,Cj], where a,- is the »'*" vergence angle and C, is the ith matrixof correspondence responses. Although the present analysis considers only themicromovements in the visual plane (horizontal micromovements), the humaneye system performs micromovements in a vertical plane including the visualaxis as well as rotations about the visual axis itself [7].

3 Biological support of the micromovements

The following discussion of eye-movements according to physiological and psy-chophysical experiments is offered as a working hypothesis, useful for the under-standing the role of the micromovements on depth perception. Physiologicalresults [8, 7] show that the human eye performs fine movements during theprocess of fixation on a single point, which are collectively called physiologicalnystagmus. Physiological nystagmus is composed of three different kinds ofmovements: (1) high-frequency tremor, (2) slow drifts, and (3) rapid binocularflicks. The drift and flick movements occur in opposing directions and produceconvergence/divergence waves of the eyes on a similar way as the micromove-

373

| (ccmptote micrcmovwrnnt

cyeto

Figure 6: Correspondence operator response

ments studied in previous sections.Assuming the vergence micromovements mechanism as the basis of the

depth perception, it is easy to understand the phenomena of stereoacuity (depthor stereoscopic acuity, stereopsis). As well described in [8, 5], it is almost incred-ible that most observers under normal conditions can discriminate a differencein depth corresponding to an angular disparity (interocular disparity) of about10 arc sec. The best values reported in the literature have been obtained by theapparatus called the Howard-Dolman apparatus, devised by Howard in 1919.The best observers achieve a 75% discrimination level close to 2 arc-seconds inthat experiment. The most incredible fact is that this disparity value is muchsmaller than the distance between the cones' centers at the central part of thefovea (« 22 arc sec).

We suggest that the high sensitivity to slight disparity can be explainedby the correlation between depth perception and the vergence eyes micromove-ments and not by the capacity of the human visual system to spatially detectdisparity on the retinas. Therefore the idea of an angular disparity that canbe detected spatially by the visual system is substituted by a local approachwhere the human visual system determines the depth values by the highestpeak of correspondence response (figure 6) during a complete micromovementcycle (section 2). The highest peak of correspondence occurs when there is nospatial disparity between the left and right stimulus of elements with the samerelative position on both retinas, i.e., when the spatial disparity is cancelled fora given vergence angle.

Another phenomenon that can be explained by the present approach isknown in the literature [8] as Panum's fusional area: the range of interoculardisparities within which objects viewed with both eyes on corresponding retinalregions appear single. This area is such that fusion occurs, only one dot is seen,when two points that are perceived in different eyes fall closer together in thecombined view. Note that these two points can be seen through an uncrossed(left and right optic axis do not cross) or crossed disparity. The classical staticlimits for Panum's area, the mean crossed to uncrossed range of horizontaldisparities, is reported as being 14 arc min. The experiments described in [9]support the existence of binocular fusion as a unique category of sensory perfor-mance, disconfirming several non fusional explanations of single vision. Whilethe range of binocular disparities allowing fusion (Panum's fusional horizontaldiameter) is typically in the region of 14 arc min, stereoscopic depth can beperceived from a disparity 500 times smaller.

In the present approach, the phenomena of binocular fusion and stereo-scopic depth are assumed to be supported by the mechanism of vergence eyemicromovements about a fixation point. In this way, the fusion area dimension

374

(a) (b) (c) (d)

Figure 7: Two planes experiment (33 vergence steps): (a) original image,(b) perspective view of the acquired object, (c) horizontal cut of the acquiredobject, (d) perspective view pasted with real grey value from original image

is determined by the range of a complete micromovement cycle (section 2). Itis important to point out that the classical value of the Panum's fusional hor-izontal radius (average of the crossed and uncrossed disparities), 7.0 arc min,coincides with the micromovement range value described in [7]. Note that thePanum's fusional horizontal radius must be compared to the total range of amonocular micromovement reported in [7] to be coherent to both definitions.In the present analysis the vertical fusion radius is not considered since thisradius follows the monocular spatial resolution limit of the retina [10]. As aconclusion, the "real" binocular fusion is assumed to occur only between cellsadjacent on the horizontal axis of the retinas, and that binocular vertical fusionis a result of the monocular fusion mechanism.

4 Experimental results

In order to validate experimentally the micromovements approach, a practicalimplementation was done using the KTH head-eye system. This active visionsystem is composed of several motors, two cameras and two camera lens con-trollers. The system is connected on the VME-bus of a SUN-SPARC stationvia a transputer board. Our main control was over the vergence motors, imageacquisition, zoom and focus. At the beginning, we do not use the transputerboard to execute the algorithms. Instead of that we did the experiments ac-quiring and storaging a sequence of images pairs in the SPARC station forfurther processing. The two experiments that will be described below consistof an object located in front of the head-eye system around 3000 mm from thebaseline. The set up is done by choosing a value for the focal length (zoom) andadjusting manually the focus and the initial fixation point over the central partof the object surface. These values of zoom and focus are kept constant duringthe experiment. A program makes the sweeping of the object by changing thevergence angle between the initial vergence value and a final value determinedby the number of steps specified. For each vergence angle, two images (fromthe left and right cameras) are stored on the SPARC station. After the acqui-sition of the desired number of images pairs, we compute the correspondenceresponse for every pixel (x,y) on the acquired left and right images, using the

375

(a) (b) (c) (d)

Figure 8: The box-plane-cylinder experiment (41 vergence steps): (a) originalimage, (b) perspective view of the acquired object, (c) horizontal cut of theobject, (d) perspective view pasted with real grey value from original image

following correlation operator:

_ E[Left(x, y)Right(x, y)] - E[Left(x, y)]E[Right(x, y)]- <r[Left(x,y)}<r[Ri9ht(x,y)]

where,

For every pixel, using the vergence angle a* that gives the highest corre-spondence response for that pixel and using ( 8) we compute the object depthdobj for that pixel.

The following parameters are common in both experiments: b = 200 mm,f = 30 mm, dobj = 15 b, the work window size of 200x200 pixels, the operatorsize of 21x21 pixels and the vergence step resolution of 26 arc sec (imposed bythe head-eye system). Our first experiment was done using two vertical planesas the object being viewed. In Figure 7.a the left image of the object beforethe vergence sweeping is shown. Figure 7.b shows the perspective view of theacquired object after the vergence sweeping and the use of equation 8. Thedarker grey patches of the object are farther from the baseline than the lightergrey patches. On Figure 7.c, a horizontal cut of the acquired object is shownpermitting us to have the correct idea of the object being viewed. The lastpicture is the perspective view pasted with the real grey values instead of thedepth represented grey values. The second experiment is shown in Figure 8.The object is composed of a box (left side of the object) a cylinder (right side)and an inclined newspaper. Figure 8.c gives the notion of the object used.

The step-shape wave seen on Figures 7.c and 8.c is a consequence of thevergence step resolution. The processing time using the previous scheme wasabout one hour for the processing of the entire 200x200 pixel being viewed.Actually the correspondence operation is being executed on the transputerboard. The entire image was split in four transputers before the correspondenceoperation. Using this new scheme the experiment took around 20 seconds. Weare not using the entire power of our transputer board since we did not havetime enough to implement it. In spite of the great improvement using thetransputer board our goal was not reach yet, since we want to process depthat the frame rate.

376

5 ConclusionThe continuous vergence micromovements approach permits to overcome thephysical limitation of the photo-receptor dimension (CCD element or cone) onthe depth perception. Moreover, there is no need to compute depth from dis-parity values since the disparity is cancelled by the vergence micromovements.Note that the stereoscopic matching problem still exists, since there is the pos-sibility to have two or more correspondence peaks with similar values for anelement of the correspondence matrix.

The highlight of this new approach is the vergence micromovements as amechanism to nullify the disparity between the left and right visual stimulusat the same retina locus. Therefore, the concept of a "neural structure spreadspatially" in the visual system to perceive depth via measurement of disparityis substituted by a "neural structure connected locally with the neighborhood"of each retina locus.

References[1] E. P. Krotkov. Exploratory visual sensing for determining spatial layout with

an agile stereo camera system. PhD thesis, School of Engineering and AppliedScience, University of Pennsylvania, Philadelphia, PA, USA, 1987.

[2] R. Bajcsy. Active perception vs. passive perception. In Proc. Workshop onComputer Vision, pages 55-59, Bellaire,MI, October 1985.

[3] A. Francisco. The role of vergence micromovements on depth perception. Tech-nical Report MS-CIS-91-37, GRASP LAB, CIS, University of Pennsylvania,Philadelphia, PA, USA, 1991.

[4] K. Pahlavan and J.O. Eklundh. A head-eye system - analysis and design. InComputer Vision, Graphics, and Image Processing: Image Understanding, page(To appear.), July 1992.

[5] C. M. Schor and K. J. Giuffreda. Vergence eye movements: basic and clinicalaspects. Butterworth, 1983.

[6] F. Solina. Errors in stereo due to quantization. Technical Report MS-CIS-85-34,GRASP LAB, CIS, University of Pennsylvania, Philadelphia, PA, USA, 1985.

[7] R. W. Ditchburn. Eye-movements in relation to retinal action. Optica Ada,1(4):171-176, 1955.

[8] J. W. Kling and L. A. Riggs. Experimental psychology. Holt, Rinehart andWinston, Inc., 1971.

[9] T. Heckmann and C. M. Schor. Panum's fusional area estimated with a criterion-free technique. Perception & Psychophysics, 45(4):297-306, 1989.

[10] C. Schor, I. Wood, and J. Ogawa. Binocular sensory fusion is limited by spatialresolution. Vision Res., 24(7):661-665, 1984.

vergence micromovements and depth perception

Documents