stereo integration, mean field theory and...

Network 2 (1991) 423442. Printed in the UK

Stereo integration, mean field theory and psychophysics

Alan L Yuillet, Davi Geiger: and Heinrich H Bulthoftg'll tDivision of Applied Science, Haward University, Cambridge, MA 02138, USA $ Siemens Research Center, Princeton, NJ 08540, USA 5 Department of Cognitive Science and Linguistic Scicnces, Brown University, Providence, RI 02912. USA

Received 22 March 1991

Abslracl. We dcscribe a thcorctical formulation for stereo in terms of the Bayesian approach to vision and rclatc it to psychophysical experiments. The formulation enables us to integrate the depth information from dilI'erent types of matching primitives, or from different vision modules. Wc solvc the correspondence problem using comparibiliry constraints bctwecn fcatures and prior assumprions on the interpolared surfaces that result from the marching. We use techniques from sraristical physics to show how our theory relates lo previous work. Finally we show that, by a suilable choice of prior assumptions about surfaces, the theory is consisrent with some psychophysical experiments which investigate the rclarive importance of different marching primitives.

1. Introduction

Computational vision (Marr 1982) attempts to construct thcorctical models of visual processes and relate thcm to psychophysical and physiological experiments. With the vast number of research papers on these topics it is dcsirablc to develop an overall framework for such theories.

In this paper we introducc a theoretical formulation for stereo in terms of the Bayesian approach to vision. We show that this formalism is rich cnough to contain most of the elements used in standard stereo theories. Techniques from statistical physics are used to show its relations to previous theorics. Finally we show that thc theory is consistent with some psychophysical cxpcrimenb.

The fundamental issues of stereo are: (i) what primitives are matchcd bctween the two images, (ii) what a prion assumptions are madc about the scene to determine the matching and thercby compute the depth, and (iii) how is thc geometry and calibration of the stereo system determined? For this papcr we assume that (iii) is solved, and so the corresponding epipolar lines (see e.g. Horn 1986) betwcen the two images are known. Thus we use thc epipolar line constraint to reduce the matching problem to a one-dimensional search. Somc support for this is given by the work of Biilthoff and Fahle (1989) and Bulthoff el al (1991), described in more detail in section 5.

1 1 To whom correspondence should be addressed.

0954-898X/91/040423+20$03.50 @ 1991 IOP Publishing Ltd

424 A L Yuille er a1

Our framcwork combines cues from diffcrent matching primitives to obtain depth perception. These primitives can be weighted according to their robustness. For example, depth estimates obtained by matching intensity are sometimes unrcliable since small fluctuations in intensity (due to illumination or detector noise) might lead to large fluctuations in depth. Therefore intensity matches are often less reliable than estimates from matching edges. The formalism also can be extended to incorporate information from other depth modulcs (see psychophysical experiments in Biilthoff and Mallot 1987, 1988). It provides also a model for sensor fusion that can include domain-dependent knowledge (Clark and Yuille 1990).

Unlike many previous theories of stereo that first solved the correspondence problem and then constructed a surface by interpolation (Grimson 1981), we pro- pose combining the two stages. The correspondence problem is solved to givc the disparity field that best satisties the a priori constraints. Our model involves the interaction of several processes. We will introduce it in thrce stages at different lcvcls of complexity in sections 2.2, 2.3 and 2.4. It involves fields for matching, disparity and discontinuities.

Using standard techniques from statistical physics we can eliminate certain fields and obtain effective energies for thc remaining fields (see Geiger and Girosi 1991). By these methods we can relate our framework to the apparcntly rather diffcrent cooperative stereo algorithms (Dev 1975, Marr and Poggio 1976) and the disparity gradient limit theories (Prazdny 1985, Pollard et a1 1985). Thcse techniqucs also suggest novel algorithms for stereo computation that incorporate constraints about the set of possible matches differently from thc cooperative stereo algorithms.

We relate this model to psychophysical experimenis (Biilthoff and Mallot 1988, Bulthoff and Fahle 1989, Biilthoff et a1 1991) in which perceived depth for different matching primitives and disparity gradients are precisely measurcd. These cxperiments suggcst that several types of primitive are used for correspndcnce, but that some primitives are better than others. Our modcl is consistcnt with the data from these cxperiments assuming standard prior assumptions about the surfaces.

Some of the cost functions formulated here, but without the statistical methods or the Bayesian framework, were developed in collaboration with Gennert (Gennert er a1 1990) based on a model for long range motion correspondence by Yuille and Grzywacz (1989).

This paper uses differential operators to impose prior constraints on surfaces. For the discretizcd case this typically corresponds to Markov random lields. Wc would like to emphasize, however, that our choice of priors is empirical and is motivated by consistency with a specific class of psychophysical experiments. Our framework allows for a far more general class of prior. In a specilic domain, such as modclling thin plates (Terzopoulos 1983), it is possiblc to deduce an optimal prior from first principles but we doubt that this is possible in general. An attractive possibility (Clark and Yuille 1990) is to use domain-specilic priors, if domain knowledge is available, or competitive and adaptive priors. In our model the data terms, which enforce matching between features or intensity, are chosen assuming Gaussian errors in the data measurements (Cernushi-Frias er a1 1989).

The plan of this paper is as follows. In section 2 we rcview the Bayesian approach to vision and describe our framework Section 3 introduces techniques from statistical physics and uses them to analyse the theory. This analysis is used in section 4 to compare the model to existing stereo theories and to suggest novel algorithms. Finally, in section 5, we relate the theory to psychophysical data.

Stereo integration, mean field theory and psychophysics 425

2. The Bayesian approach to stereo

There has been a vast amount of work on stereo. For overviews, see Grimson (1981) and Barnard and Fischler (1982). We first briefly review the problcm of stereo and give an overview of our theory. We describe this theory in terms of a cost function and finally put it into a probabilistic framework

2.1. The matching problem

The input to any binocular stereo system is a pair of images. The task is to match primitives of the two images, thereby solving the correspondence problem. The depth of objects in the scene can then be determined by triangulation, assuming the orientations o f the cameras (and other camera parameters) are known. In many sterco theorics the disparity, the relative distance between matched featurcs, is first computed. Thc depth of the feature from the fixation point is then, to first-order approximation, linearly dependent on its disparity.

There arc several choices of matching primitives such as edges and peaks in imagc intensity, intensity itself, o r varieties of measures of texture. It is unclear which primitives the human visual system uses. Psychophysics described in section 5 suggests that a t least edges and image intensity are used as primitives.

It is desirable to build a stcrco theory that is capable of using all these ditfcrcnt types of primitives. This will allow us to reduce the complexity of the correspondence problem and will enhance the robustness of thc theory and its applicability to natural images. However, not all primitives are cqually reliable. A small fluctuation in the image intensity might lead to a large change in the measured disparity for a system that matches intensity. Thus image intensity is usually less reliable than features such as edges.

Somc assumptions about the scene are usually necessary to solve the corrcspon- dence problem. These can be thought of as natural constraints (Marr 1982) and are ncedcd because of the ill-posed nature of vision (Poggio el a1 1985). There are two typcs of assumption: (i) assumptions about the matching primitives, for example the compntibility constraint that similar features match, and (ii) assumptions about surfaces. For (ii) one typically assumes that either the surface is close to the fucation point (the disparity is small) o r that the surface's orientation is close to the fronto- parallel plane (the disparity gradient is small) with a few possible discontinuities. We discuss specific smoothness measures in section 2.2.

Our theory requires both assumptions, though their relative importance depends on the scene. If the featurcs in the scene arc sufficiently different then assumption (i) is often sufAcicnt to obtain a good match. If all featurcs are very similar, assumption (ii) is necessary. For really jagged surfaces, such as Bryce Canyon, stereo data cannot be fitted to smooth surfaces, even with discontinuities. Therefore compatibility constraints are far morc important than prior surface assumptions. Fortunately the more jagged the surface the more structure there will be in the image and the morc powerful the compatibility constraint. Note that although there may be severe corrc- spondence problems for simplc edge features, there will usually be unique matches for clustcrs of edges o r other primitives including intensity. In general we require that the matching is chosen to obtain the smoothest possible surface, so interpolation and matching a re performed simultaneously. The next section formalizes these ideas.

The word 'smoothncss' is often used loosely in computer vision. qpically a cost function imposing smoothness corresponds to using a quadratic approximation to the

426 A L Yuille et a1

cost function for a membrane or a thin plate (Grimson 1981, Terzopoulos 1983). This approximation is only valid for small depth gradients. For thc membrane cost function the approximation will cause a bias towards the fronto-parallel plane, rather than causing smoothness. In this paper we will still use the term smoothness, but it should be emphasized that some smoothness cost functions will enforcc fronto-parallel biases.

2.2. The first level: matching field and disparity field

The basic idea is that therc are several possible primitives that could be used for matching and that these all contribute to a disparity field d(x). This disparity ficld exists even where there is no source of data. The primitives we will consider here are features, such as edges in image brightness. Edges typically correspond to object boundaries, and other significant events in the image. Other primitives, such as pcaks in the image brightness or texture features, can also bc added. In the following we describe the theoly for the onedimensional case.

We assume that the edges and other features have already been extractcd from the image in a preprocessing stage. The matching elements in thc left eye consist of the features {xiL}, for il- = 1, ..., N I . The right eye contains features {xaR}, for a~ = 1 , ..., lVT. We define a set of binaly matching elements {V&R), the matching field, such that yLa, = 1 if point iL in the left eye corresponds to point a, in the right eye, and yLa, = 0 otherwise. A compatibilityfield {AiLa,) is defincd to be small if iL and a , are compatible (i.e. features of the same type) and large if they are incompatible (an edge cannot match a pcak). For example, for random dot stereograms the features are all equally compatible and we can set AiLaR = 1 for all i,, a ~ .

We now define a cost function E ( d , V) of the disparity field and thc matching elcrnents. We will interpret this in terms of Baycsian probability theory in thc ncxt section. This will suggest several methods of estimating the fields d, V given the data. A standard maximum likelihood estimator is to minimize E ( d , V) with respect to d , V.

The first term gives a contribution to the disparity obtained from matching iL to aR. Its precise form as a quadratic function assumes that the uncertainty in the position of the features can be modelled by independent Gaussian distributions. The second term encourages features to have a single match. The third term imposes a prior constraint on the disparity field imposed by a suitable differential operator S.

The coefficient 7 determines the amount of a pnori knowledge rcquired. If all the features in the left eye have only one compatible feature in the right eye then little a pnon knowledge is needed and y may be small. If all the fcatures are compatible then there is matching ambiguity which the a pnon knowledge is needed to resolve,

Slereo inlegralion, mean field theory and psychophysics 427

requiring a larger value of y and therefore more smoothing. In section 5 we show that this gives a possiblc explanation for some psychophysical experiments.

Compatibility is enforced multiplicatively in the first term of (1). An alternative would be to choose

The theory can be extcndcd to two dimensions in a straightforward way. The matching elements v.LaR must be constrained to allow only for matches that use the epipolar line constraint. The disparity field will have a smoothness constraint perpendicular to the epipolar line which will enforce figural continuity.

Finally, and perhaps most importantly, we must choose a form for the differential operator S which imposes the prior constraints. Marr (1982) proposed that, to make stereo correspondence unambiguous, the human visual system assumes that the world consists of smooth surfaces. This suggests that we should choose a smoothness opcrator that cncourages the disparity to vary smoothly spatially. In practice the assumptions uscd in Marr and Poggio's two theories of stereo arc somcwhat stronger. The first theory (Marr and Poggio 1976) encourages matches with constant disparity, thereby enforcing a bias to the fronto-parallel plane. The second theory (Marr and Poggio 1979) uses a coarse-to-fine strategy to match nearby points, therefore encouraging matches with least disparity and by that giving a bias toward the fixation plane.

The simplcst smoothness constraint corresponds to approximating a membrane surface. This gives an operator S = a/ax and leads to a discretizcd term C k ( d k - d,,,)'. WC will use this operator as a default choice for our theory. We will use it in conjunction with discontinuity fields, introduced in the next section, which break the smoothness constraint. This specific choice of operator is consistent with the experiments described in scction 5.

We would like to reiterate that the choices of prior are motivated by consistency with the psychophysical experiments described in section 5. Alternate choices, which nced not necessarily be implemented by differential operators, may bc prcfcrable for computer vision systems.

2.3. The second level: adding disconlinuity fields

The first level theory is easy to analyse but makes the a priori assumption that the disparity field is smooth evelywhere, which is false at object boundaries. There are several standard ways to allow smoothness constraints to break (Blake 1983, Geman and Gcman 1984, Mumford and Shah 1989). We introduce a discontinuity field l ( x ) reprcsented by a set of curves C.

Introducing the discontinuity fields C gives a cost function


where smoothness is not enforced across the curves C and M ( C ) is the cost for enforcing breaks.

2.4. The third level: adding intensity terms

The final version of the theory couples intensity based and feature based stcrco. Psychophysical results (see section 5.2) suggest that this is necessary. Our cost function becomes

The new addition is the intensity tcrm. Its precise form as a quadratic penalty term can be derived (Cernushi-Frias et a1 1989) from a model assuming that the two images are corrupted by independent Gaussian noise.

If certain terms are set to zero in (3) it reduccs to previous cost function formulations of stereo. If the second and fourth terms are kcpt, without allowing discontinuities, it is similar to work by Gennert (1987) and Barnard (1989). If we add the fifth term, and allow discontinuities, we get connections to some work described in Yuille (1989) (done in collaboration with T Poggio). Thc third tcrm will again be removed in the final version of the theory.

Thus the cost function (3) reduces to well known stereo theorics in certain limits. It also shows how it is possible to combine feature and brightness data in a natural manner.

25. The Bayesian formulation

Givcn a cost function model one can define a corresponding statistical thcory. It is now convenient to consider the cost function as an energy function, though it need not correspond to the energy of any known physical system. If the encrgy E ( d , V, C ) depends on three fields: d (the disparity field), V the matching field and C (thc discontinuities), then (using the Gibbs distribution-see Parisi (1988)) the probability of a particular state of the system is defined by

where g is the data, ,B is the inverse of the temperature parameter and % is thc partition function (a normalization constant).

Using the Gibbs distribution we can interpret the results in terms of Bayes' formula (Bayes 1783)

Stereo inregrotion, mean field theov and psychophysics 429

where P(g1d. V, C ) is the probability of the data g given a scene d , V , C. P ( d , V, C) is the a pnori probability of the scene and P ( g ) is the a priori probability of the data. Note that P ( g ) appears in the above formula as a normalization constant, so its value can be determined if P ( g l d , V, C ) and P ( d , V, C) are assumed known

The fundamental concept for our model is the probability distribution p ( d , V, Clg) . The Gibbs distribution is merely a convenient tool to help specify it. Observe that quadratic energy terms will correspond to Gaussian probability distributions and that adding energy terms corresponds to multiplying probabilities.

The Gibbs distribution implies that every state of the system has a finite probability of occurring. The more likely ones are those with low energy. This statistical approach is attractive because the p parameter gives us a measure of the uncertainty of the model temperature parameter T = 1/,0. At zero temperature ( P -, co) there is no uncertainty. In this case the lowest energy state is the only state of the system that has non-zero probability. Therefore the state that globally minimizes E ( d , V, C ) has probability one. Although in some degencrate situations there could be more then one global minimum of E ( d , V, C ) .

Minimizing the energy function will correspond to finding the most probable state, the maximum a posteriori estimator, and is independent of the value of 0. The mean field solution,

is more gcneral and reduces to the most probable solution as T 4 0. It corresponds to defining the solution to be the mean liclds, the averages of the f and 1 fields over the probability distribution. This enables us to obtain different solutions depending on the uncertainty.

In this paper we concentrate on the mean quantities of the field. An additional justification for using the mean field is that it represents the minimum variance Bayes estimator (Gelb 1974). More precisely, the variance of the field d is givcn by

where d is the centre of the variance and the Cdtv,, represents the sum over all the possible configurations of d , V, C. Minimizing Var(d : d ) with respect to all possible values of d we obtain

This implies that the minimum variance estimator is given by the mean field value.

3. Statistical mechanics and mean field theory

In this section we describe techniques from statistical physics which can be applied to the Gibbs distribution (4). There are two main uses of these techniques: (i) we can obtain methods for finding optimal estimators of the probability distributions (4), and (ii) we can eliminate (or average out) different fields from the energy function to obtain effective energies depending on only some fields (therefore relating our model to previous theories).


3.1. Optinzization using statistical ntechanics

One can exploit statistical physics to estimate the most probable states of the probability distribution (4) by, for example, using Monte Carlo techniques (Metropolis et a1 1953) and the simulated annealing approach (Kirkpatrick et a1 1983). Thc drawback of these methods is the amount of computer time needed for the implerncntation.

Altcrnatively we can use (Yuille et a1 1989) mean field theory methods to obtain deterministic algorithms. This can be thought of as deterministic annealing and it has been successfully applied to a variety of problems, (for example Durbin and Willshaw 1987, Geiger and Girosi 1991, Simic 1990, Peterson and Soderbcrg 1989). When using these mean field mcthods it is important not to impose global constraints on variables (such as those requiring each feature point to have a unique match) by energy biases, sometimes called weak constraints. Instead it is better to use slrong constraints which enforce the constraints by restricting the space of possible configurations of thc system when estimating the mean fields. This is illustrated by the far greater empirical success of the elastic net model (Durbin and Willshaw 1987) over the Hopfield and B n k (1985) model, which uses energy biases, as an algorithm for the travclling salesman problem.

In this paper we are not explicitly concerned with finding the optimal estimators for our theory. Initial simulations using mean ficld theory were successful but more traditional algorithms, such as dynamical programming may be preferable.

3.2. Unijicafion by averaging out fields

A second use of statistical physics techniques is to average out fields to obtain different, but related, theories. Such methods have been recently used to show (Gciger and Girosi 1991, Geiger and Yuille 1991) that several seemingly different approaches to imagc segmentation are closely related. They have been used (Simic 1990, Yuillc 1990) to show the close relationship between the Hopfield and B n k (1985) and the elastic net (Durbin and Willshaw 1987) approaches to thc travelling salesman problem. These techniques are closely related to the Legendre transform.

We illustrate these methods by considering the first-level theory. It can be ex- pressed in terms of an energy E ( d , V ) (1) and corresponds to a probability distribution

n o alternative theories can be obtained by computing thc marginal probability distributions P,(d) = C , P ( d , V ) and P M ( V ) = J, P ( d , V). Equivalently we can write the partition function Z = Jd CV e x p ( - P E ( d , V ) ) and by integrating out thc d field o r summing out the V field we can obtain partition functions for systems con- taining only V variables or d variables. Using standard statistical physics techniques we can differentiate these partition functions to obtain consistency equations for the mean values of the V and d fields respectively. Deterministic annealing can thcn bc used to find solutions to these equations (Yuille et a1 1989).

We now show explicitly how to derive the theory P M ( V ) . This theory is similar to cooperative stereo algorithms (see section 4).

Since E ( d , V ) is quadratic in d it follows that P ( d , V ) is a product of Gaussians in d. Thus P ( d , V ) can be integrated with respect to d by completing the square.

Stereo integration, mean field theory and psychophysics 43 1

This is equivalent to minimizing E(d , V) with respect to d, solving for d as a function of V to obtain d'(V), and writing

where E,,(V) = E(d*(V) , V). For the first-level theory, assuming all features are compatible, our energy function

is

with Euler-Lagrange equations

The solutions of this equation are given by

where the G ( x , z i L ) are the Green functions of the operator S" and the aiL obey

Substituting this back into the energy function (assuming that each feature is matched precisely once, the uniqueness constraint) eliminates the disparity field and yields

This calculation shows that the disparity field is, strictly speaking, unnecessary since the theory can be formulated as in (15). We discuss the connection to cooperative stereo algorithms in section 4. Note that the global constraints are imposed by energy biasest.

In Yuille et a1 (1989) we show how to average out othcr fields for the first-, second- and third-level theories.

t A similar calculation (Yuille and Grzywac~ 1989) showed that minimal mapping theory (Ullman 1979) was a special case of the motion coherence theory.


4. Comparisons with other theories

In the previous section we have used mcthods from statistical physics to analyse our theory. In particular, we have shown how we can climinatc some fields to obtain equivalent, but apparently different, theories. In this section we show that this helps compare our work to previous theories.

4.1. The cooperative stereo algorithm

The cooperative stereo algorithm (Dev 1975, Marr and Poggio 1976) contains binary units Cia for each lattice point i in the left image and a in thc right image. Notc that, unlike our y.,, these units d o not occur only where there are featurcs. The C,, are initialized to be 1 if there are features a t i and a in thc left and right images, therefore a potential match, otherwise thcy are 0.

The algorithm corrcsponds to doing iterative gradient dcscent with an cncrgy function

where ,Tijab imposes excitation for matchcs with similar disparity and inhibition for more than one match per feature. A specific choice of Tijab might be

We now compare cooperative stcreo algorithm to our theory without line processors. This has the energy function

Comparing the two energy functions we sce several similarities. Thcrc is a gcncral excitation for a match in the direction of constant disparity, duc to thc first term, and inhibition of matching in the vicwing directions, due to the second tcrm. Both theories suffer from imposing global constraints as encrgy biascs (sec scction 3.1) and hcnce suffer from the same weaknesses as that of Hopfield and Tdnk (1985). T h c principal difference is that our method only has matching elcrnents a t feature points, rather than everywhere in the imagc.

The cooperative stereo algorithm can be secn (Yuille el a1 1989) to bc the zero- temperature limit of a statistical physics algorithm where global constraints are imposed by energy biases. Thus the algorithm can be improvcd in two ways: (i) introducing a temperature and using deterministic annealing, (ii) imposing thc global constraints by the strong methods described in section (3.1).

The disparity gradient limit theories (Pollard et a1 1985, Prazdny 1985) also can be thought of as extremizing a cost function similar to (18), the difference being that the

Stereo integration, mean field theory and p~ychophysics 433

support for a specific match is over all matches in the neighbourhood whose disparity gradient is below a certain cut-off value. Thus the directions of excitation are more general than for the cooperative stereo algorithm, therefore the theory makes fewer prior assumptions about the surface.

Moreover these theories also use a version of strong constraints for matching (instead of using energy biases). The strategy is to fix the indices i , a and then for each j in a suitable neighbourhood of i t o d o winner-take-all on b (imposing a unique winner). This gives the 'support' for i to be matched to a. Winner-take-all is again employed to find the best match for i. The process is repeated for matching from the other eye and the results are checked for consistency.

The disparity gradient limit theories seem preferable to the original cooperative stereo algorithms since they have a more general prior and use a version of strong constraints. Morecjver computcr implementations of the theory give impressive results on rcal data (Frisby 1991). Despite the many advantages of the disparity gradient theories we argue that it is preferable to represent depth explicitly by surfaces, using more than one surface (Yuille et a1 1990) if transparcncy occurs. Such a representation in terms of surfaces is easier to integrate with other depth information and can be used to incorporate domain specific knowledge (Clark and Yuille 1990). Moreover, see next section, it seems that the disparity gradient theories cannot account for the experimental data reported here.

5. Comparisons to psychophysics

In this section we describe the relationship between our theory and psychophysics. We will chiefly be concerned with two experiments (Bulthoff and Fahle 1989, Bulthoff and Mallot 1987) which provide measurements of the perceived dcpth as a function of the real depth and the matching primitives by use of refercncc systems (or dcpth probes). Thcse experiments are particularly useful for our purposes because of (i) the quantitative depth information they supply, and (ii) their investigation of which features are used for matching and how the perceived depth depends on thesc features.

O n e general conclusion from these experiments is that objects perccived stereoscopically tend to be biased toward the fronto-parallel plane and the degree of this bias depends o n the features being matched. This is in general agreement with our theory in which the disparity smoothness term causes such a bias with a magnitude depending on the features. We should emphasize that smoothness is required for two related purposes: (i) reducing the matching ambiguity for features, and (ii) reliably matching the intensity. The more incompatible the features (as imposed by the compatibility terms) then the less the matching ambiguity and hence the lcss smoothness required. Smoothness is required for intensity matching to reduce the cffect of noise fluctuations in the intensity values.

We emphasize, as discussed in section 2.1, that since quadratic smoothness constraints only impose smoothness approximately (Grimson 1981, Terzopoulos 1983) these constraints can give the fronto-parallel biases required. In this paper we make no attempt to derive these constraints from first principles but simply compare the effectiveness of different constraints for consistency with thc experiments.


5.1. The dispariry gradient experiments

For these experiments (Biilthoff and Fahle 1989) the observcr was askcd to estimate the disparity of stereo stimuli relative to a set of reference lincs (figure 1). The stimuli were either lines or pairs of dots or simple geometric featurcs. The stimuli were shown in three diffcrent orientations (0°, 45' and 90' angle of the line stimuli or the line joining the pairs of features). The perceived depth was plottcd as a function of the disparity and the disparity gradient. These werc calculated, assuming features at x,, x, (2 , > 2,) and y , , y, (y, > y,) in the left and right eyes, using the formulae (Burt and Julesz 1980) for: (i) disparities d l = (x, - y,) , d2 = (x , - y,), (ii) the binocular disparity d, = d, + d,, and (iii) the disparity gradient d,,, = d6/{(x1 + Y,) - ( x 2 + y2)1.

L E F T E Y E RIGHT E Y E

Figure 1. Illustration of the disparity gradient expcrimen!. Stimuli (points. lincs or simple symbols as shown here) are projected together with depth rcfcrence lincs on a cm monitor for short-term presentation (7 s). Thc following parameters were varicd in different experirncnts: disparity, disparity gradient and orientation (hcre shown with 0'). The experimental equivalent can he experienced by uncrosscd fusion of thc lcll and right stereo images. 'Thc square symbol should appear bchind the zero disparity plane at about the height of the third line from the top (15 arc min). Rcdrawn from Biilthoa a a1 (1991).

The experiments showed that the perceivcd disparity decreased with increasing disparity gradient. This effect was: (i) strongest for horizontal lincs, (ii) strong for pairs of dots, (iii) weak for dissimilar features (e.g. different symbols as in figurc 1) and (iv) weak for non-horizontal lines (figure 2).

Our explanation assumes these effects are due to the matching stratcgy and is based on the second-level theory, with energy function given by (2). Thc idea is that the smoothness term (the third term) is required to give unique matching but that its importance, measured by 7, increases as the features become more similar and the matching ambiguity increases. For random dot stereograms all featurcs in one imagc are compatible with all featurcs in the other, so a lot of smoothncss is required. How- ever, if the features are sufficiently different (perhaps pre-attentivcly discriminable) then there is no matching ambiguity, so the correct disparities arc obtained. If the features are similar then smoothness (or some other a priori assumption) must be used to obtain a unique match, leading to biases toward the fronto-parallel plane. The greater the similarity between features the more the nced for smoothness and therefore the stronger the bias toward the fronto-parallel plane. The discontinuity

Stereo integration, mean field theory and psychophysics

Symbols 90" Symbols 45" Symbols 0" Polnts 90"

Points 45'

Points 0'

Lines 90"

Lines 45O

Lines 0"

Figure 2. Disparity gradient and matching primitives: perceived depth decreases with increasing depth gradicnt and depends largely on the matching primitive (points, lines o r symbols) and orientation ( 0 ° , 45O, 90') relative to the epipolar line. Each data item represents the mean o f nine diRerent disparities (3-27 arc min) tested with ten subjects. The standard errors o f the mcans are in the order o f the symbol size. Redrawn from Biilthoff and Fahle (1991).

field is switched on at both the points ensuring that smoothness is only imposed between the two points. Thus the two points are considered the boundaries of an object and only the object itself is smoothed. We assume 7 is a global variable but do not give a prescription for calculating it here. We are investigating the possibility of estimating it using standard statistical techniques such as cross validation and Bayesian techniques.

We now analyse the second-level theory and show that it predicts the falloff of perceived disparity with disparity gradient, provided we choose the smoothness operator to be the first derivative of the disparity. The change of rate of falloff for different types of features is due to varying 7 as described above.

Suppose we have features x, , x, in the left eye and y, , y, in the right eye. We define matching elements Vai so that Vai = 1 if point a in the first image matches point i in the sccond image, Vai = 0 otherwise. The matching strategy assumes we minimize

with respect to the Vai and the disparity field d(x). We assume d(x) is only defined betwecn the dots, due to the discontinuities at the boundarics and L is a differential operator. Apparently only some operators L are consistent with the experiments.

Assume that L = a/ax, which corresponds to a smoothness term

i.e. just minimizing the gradient of the disparity field. Suppose the minimum energy state corresponds to matching x , with y, and x,

with y,. Then we have points z, = 2 , / 2 + y , / 2 and r , = 2,/2 + y 2 / 2 with measured disparities a, = x, - y, and a, = x, - y, (assume z, > zl).

The perceived disparity will correspond to the field d ( z ) . It is obtained by minimizing

The minimum will occur when d ( x ) is a straight line passing through d ( z , ) a t zl and d(z,) at z, (where d(z , ) and d(z,) are to be determined). This gives

Minimizing with respect to d(z , ) and d(z,) gives, assuming 7 << (z , - z, ) ,

Thus the perceived disparity equals the true disparity minus 2 7 times the disparity gradient. This seems in general agreement with the experiments. As the features become less distinguishable more smoothness is required to give an unambiguous match causing 7 to be increased and increasing the gradient of the falloff.

The results are not consistent with several possible choices of the smoothness operator, such as the second derivative of the disparity a 2 d / a x 2 . It is straightforward to calculate that this choice will give d (z , ) = cr, and d(z,) = a 2 , and therefore does not bias toward the fronto-parallel plane. It is likely that the smoothness operator S must contain a alas term to ensure the observed fronto-parallel bias.

5.2. The matching primitive aperiments

These experiments (Biilthoff and Mallot 1987, 1988) compared the relative effectiveness of image intensity and edges as matching primitives. The stimuli were chosen t o give a threedimensional perception of an ellipsoid. The observer used a stereo depth probe to make a pointwise estimate of the perceived shape (figure 3(n)). The data show that depth can be derived from images with disparate shading even in the absence of disparate edges. The perceived depth, however, was weaker for shading disparities (about seventy per cent of the true depth).

Putting in edges o r features helped improve the accuracy of the depth perception. But in some cases these additional features appeared to decouple from the intensity and were perceived to lie above the depth surface generated from the intensity disparities. These results are again in general agreement with our model. The edges give good estimates of disparity and so little a prion smoothncss is required. The disparity estimates from the intensity, however, are far less reliable because small fluctuations of intensity might yield large fluctuations in the disparity. Therefore more a priori smoothness is required to obtain a stable result This causes a weaker perception of depth o r a bias to fronto-parallel.

Stereo integration, mean field theoty and pychophysics

W ~ t h edqes I E * )

- - Edges

. - Smooth

Idea l 4 -

No edges (E-)

Figure 3. Edge versus intensity-based stereo: (a) Ellipsoidal surfaces with o r without edge information were stereoscopically displayed on a cRr monitor. T h e perceived dcpth map of flat-shaded (edges) and smooth shaded (no edges) ellipsoids of rotation was measured with a local s tereo depth probe for different elongations c of the main axis. T h e main axis was perpendicular t o thc display screen, i.e. the objects were viewed end-on. (b) T h c coeflicient of thc first principal component is used as a global measurc of perceived elongation for different displayed elongations of the depth map data. Less reliable information (smooth surfaces without edges) puts more weight on surface priors and leads to a fronto-parallel bias of surface perception. Redrawn from Biilthoff and Mallot (1988).

438 A L Yuille et al

The use of the intensity peak as a matching feature is vital (at lcast for the cdgeless case) since it ensures that the image intensity is accurately matchcd. Some stcreo theories based purely on intensity give an incorrect match for these stimuli (Biilthoff and Yuille 1990). For the ellipsoids, however, the peak is difficult to localize and depth estimates based on it are not very reliable. Thus the pcak cannot pull the rest of the surface to the true depth.

Biilthoff and Mallot (1988) found that pulling up did occur for the edgelcss case if a dot was added at the peaks of the images (figure 4). This is consistent with our theory since, unlike the peaks, the dots are easily localized and matching them would give a good depth estimate. Our present theory, however, is not consistent with a perception that sometimes occurred for this stimulus. Some observers saw the dots as lying above the surface rather than being part of it. This may be explaincd by the extension of our theory to transparent surfaces (Yuille et al 1990).

I Surface Interpolation and Pulling

Denee Edge-baaed and Intensity-baaed Stereo

Intensity-based Stereo

Perceived Depth: 70%

Stereo Edge and Shading Subjects ISA & HAM: 83%

Sparse Edge-baaed and Inlensity-based Stereo


Sparae Edge-based and Shape from Shading


Stereo Edge and Shading Subject HHB: 41%

Figure 4. The four upper figures show surface interpolation for: (i) dcnse edges and intcnsily-based stereo, (ii) sparse cdges and intensity-based stereo, (iii) intensity-based stereo, and (iv) sparse edges in stereo and shape-from-shading. The two lower figures illustrate pulling: an edge token in front of intensity patterns with no relative disparity (no intensity-based stereo) pulls thc surface up (left figure) but can sometimes cause a lransparency percept (right figure) of the token lying in front of the intcnsity surface.

Stereo integrution, mean field theory and psychophysics 439

5.3. Summary of psychophysics

We have shown that the experiments described above can be explained in terms of a Bayesian model (of our type) with suitably chosen priors. It is unclear whether other existing stereo theories could explain them. Most theories, such as the coarse-to-fine algorithm (Marr and Poggio 1979) and the disparity gradient limit theories (Prazdny 198.5, Pollard et a1 198.5), are designed to get the correct depth if the correspondence problem is correctly solved. Therefore these theories will not give the types of fronto- parallel biases reportcd here.

Other examples of fronto-parallel bias occur in the psychophjsics literature. A similar poor perception of horizontal disparity gradients was reported previously by Seagrim (1967). Our theory might provide an alternative explanation to Rogers and Graham's finding of anisotropics in the perception of three-dimensional surfacest. In their remarkable paper on a depth equivalent of the Craik-O'Brian-Cornsweet illusion they demonstme that for a horizontal Cornsweet depth profile the flanks of the surface are perceived a t different depth levels (see figure .5), while this is not the case for a vertical depth profile (Rogers and Graham 1983). Instead of their explanation in terms of expansion and shear transformations we can explain the horizontal/vertical anisotropy with our second-level theory. For the horizontal case the line processors preserve the steep disparity gradient a t the centre of the signal while the smoothness terms drag the left-hand side up and the right-hand side down, thereby making the left-hand side appear closer to the viewer than the right-hand side. In our theory smoothness is enforced most strongly in the horizontal direction, so as to resolve the matching ambiguity along the epipolar lines, and is weaker in the vertical direction where it is only needed to enforce similarity of matching between neighbouring epipolar lines (figural continuity). Thus for the vertical case we would only expect weak smoothing, which is consistent with their experiments.

Figure 5. Part (a) shows the depth profile of the Rogers and Graham experiment. If the profile is displayed horizontally (i.e. s o that it projects lo corresponding epipolar lincs in thc two images) then our theory would predict smoothing thc signal, except at places with large disparity gradients, thereby giving the pcrccption in (h), which is similar to the experimental rcsults. I f the profile is displayed vertically then our theory would predict little smoothing, which again is consistent with the experiment. Nore than the lower the tempcraturc at which we run our system the more we will prescmc thc largc disparity gradients.

t We a r e grateful to G Mitchison for suggesting this connection.

440 A L Yuille el nl

For another class of experiments (Mitchison and Westheimer 1990) there is a very strong bias toward the fronto-parallel plane which seems inconsistent with our theory (at least it is stronger than we would expect). Their explanation, based on invariance under head-eye movements is, however, unable to account for our effects. There seems no way to get the differential fronto-parallel bias depending on the form of the features discussed in section 5.1 nor to obtain the squashing of the ellipsoid described in section 5.2. Since our theory and Mitchison and Westheimer's are complementary it seems likely that they could be combined in an extended version that includcs head and eye movements.

6. Conclusion

We have derived a theory of stereo on theoretical grounds using the Bayesian approach to vision. This theory can incorporate most of the desirablc elements of stereo and it is closely related to several existing theories. The theory can combine information from matching different primitives, which is desirablc on computational and psychophysical grounds. The formulation can be extended to include monocular depth cues, domain-dependent priors and competitive priors (Clark and Yuille 1990).

A basic assumption of our work is that corrcspondencc and interpolation should be performed simultaneously. This is related to the experimental and theoretical work of Mitchison (1988) and Mitchison and McKee (1987) which suggests that the initial matching uses a planarity assumption.

The use of mean field theory allows us to averagc out fields and to make math- ematical connections between diffcrent formulations of stereo. It also suggests novel algorithms for computing the estimators, see section 3.2. We argue that these algorithms are likely to bc more effective than existing cooperative algorithms which impose global constraints with energy biases.

Finally, and perhaps most interestingly, the theory is consistent with some psychophysical experiments (Biilthoff and Mallot 1987, 1988, Biilthoff and Fahle 1989, 1991) for a suitable choice of priors. We are currently investigating this further. More complex scenes probably require more sophisticated priors than those which we have used to explain our experiments on simple stimuli. Our ongoing research in computer graphics psychophysics linked to Bayesian computer vision attempts to generalize these models to more natural scenes.

Acluiowledgments

ALY would like to acknowledge support from the Brown/Harvard/MIT Center for Intelligent Control Systems with U S Army Research Office grant DAAL03-86-K-0171 and from DARPA with contract AFOSR-89-0506. H H B work at MIT was supported by the Office of Naval Research, Cognitive and Behavioural Sciences Division. Some of these ideas were initially developed with Mike Genner t We would like to thank Jim Clark, Manfred Fahle, Norberto Grzywacz, Stephan Mallat, David Mumford and Tommy Poggio for many helpful discussions.

References

Barnard S 1989 Stochastic stereo matching over scale Int. J. Cornpuler Vuion 3 17-33

Slereo inlegration, mean field theory and p~ychophysics 44 I.

Barnard S 7' and Fischler M A 1982 Computational stereo ACM Compur. Sunqs 143 553-72 Bayes T 1783 An essay towards solving a problcm in the doctrine of chances Phil. Tram R. Soc. 53

370-418 Blake A 1983 The least disturbance principle and weak constraints Paffem Recog. Lert. 1 393-9 Biilthoff H H and Fahlc M I989 Disparity gradients and depth scaling Report Anificial Intelligmce

Laborafoty, MIT A1 TR no 1175 Biilthoff H H, Fahle M and Wegmann M 1991 Disparity gradients and depth scaling Perceplion 20 145-53 Biilthoff H H and Mallot H A 1987 Interaction of different modules in depth perception Proc. 1st Int.

Conf: on Compufer Viion pp 295-305 - 1988 Interaction of depth modules: stereo and shading J. Opt. Soc. A m 5 1749-58 Biilthoff H H and Yuille A 1990 Bayesian models for seeing shapes and depth Hanard Robotics Laboratory

Technical Report 90-1 1 Burt P and Julesz B 1980 A disparity gradient limit for binocular fusion Science 208 615-7 Cernushi-Frias B, Cooper D R, llung Y-P and Belhumeur P N 1989 Towards a model-based Bayesian

theory for estimating and recognizing parameterized 3D objects using two or more images taken from dilTcrent positions IEEE Tram Pattern Anal Machine Intell PAM-11 1028-52

Clark J and Yuille A 1990 Data Fusion for Smoty Information Processing Systems (Boston, MA: Kluwer Acadcmic)

Dev P 1975 Perception of depth surfaces in random-dot stereograms: A neural model Inr. J. Man-hfachine Snrd. 7 5 1 1-28

Durbin R and Willshaw D 1989 An analog approach to the travelling salesman problem using an elastic net method Nature 326 689-91

Frisby J 1991 Compufational Modes1 of V i a l Procesring ed M Landy and A Moushen (Cambridge, MA: MIT Prcss)

Geiger D and Girosi F 1991 Parallcl and deterministic algorithms from MRFs :integration and surface reconstruction IEEE Trans. Parrem Anal Machine Infell PAM-13 401-12

Geiger D and Yuille A 1991 A common framework for image segmentation Inr. J. Computer Vkion 6 227-33

Gelb A 1974 Applied Opfical E.sfimarion (Cambridge, MA: MIT Pres) Geman S and Geman D 1984 Stochastic relaxation, Gibbs distributions, and the Raycsian rcsloration of

images ILLE Trans. on Pattern Anal. Machine Intell PAM-6 72141 Gennert M 1987 A computational framework for understanding problcms in stcrco vision PhD rhesb MIT Gennert M, Ren B and Yuille A L 1990 Stereo matching by energy function minimization Proc. SI'IE:

Viual Comrnutzications and Image Processing (Bmton, November 1990) vol 1383 Grimson W E L 1981 From Images to Srrrfaces (Cambridge. MA: MIT Prcss) Hoplicld J J and B n k D W 1985 Neural computation of decisions in optimization problcms Biological

Qbernerics 52 141-52 Morn R K P 1986 Robot Vision (Cambridge, MA: MIT Press) Kirkpatrick S, Gelatt C 1) and Vecchi M P 1983 Optimization by simulated anncaling Science 220 671-80 Marr D 1982 Vision (San Fmncisco, CA: Freeman) Marr D and Poggio T 1976 Cooperativc computation of sterco disparity Science 194 283-7 - 1979 A computational theory of human stereo vision Roc. R Soc. B 204 301-28 Mctropolis N, Roscnbluth A, Rosenbluth M, Teller A and Teller E 1953 Equation of statc calculations

by fast computing machines J. Phys. Chem 21 1087-91 Mitchison G J 1988 Planarity and segmentation in stercoscopic matching Perception 17 75342 Mitchison G J and McKce S 1987 The resolution of ambiguous stereoscopic niatchcs by interpolation

Viion Res. 27 285-94 Mitchison G J and Wcstheinier 1988 Vision: Codingand efficiatcy ed C Rlakemorc (Cambridgc: Canibridgc

University Press) Mumford D and Shah J 1989 Optimal approximation of piecewise smooth functions and associated

variational problcms Cotntnun. Pure Appl Math. 42 577485 Paris G 1988 Stafirfical Field Theory (Reading, MA: Addison-Wesley) Petcrson C and Soderbcrg B 1989 A new method for mapping optimization problems onto neural networks

Int. J . Neural Sysfems. 1 3-22 Poggio 7; Torre V and Koch C 1985 Computational vision and regularization theory Nalrrre 317 314-9 Pollard S B, Mayhew J E W and Frisby J P 1985 A stereo correspondence algorithm using a disparity

gradient limit Perception 14 44%70 Prazdny K 1985 Detection of binocular disparities Biol. Cybem 52 9>9


Rogers B J and Graham M E 1983 Anisotropies in the perception of thrcc-dimensional surfaces Science 221 1409-1 1

Seagrim G N 1967 Stereoscopic vision and aniseikonic lenses I BI: J. P~ych. 58 337-50 Simic P 1990 Statislical mechanics as the underlying thcory of 'elastic' and 'neural' optimization Nelwork

1 89-103 Terzopoulos D 1963 Multilevel computational processes for visual surface reconstruction Co~rrput. fiiorz,

Graphics Image Processing 24 52-96 Ullman S 1979 The Inferprefafion of Mofion (Cambridge, MA: MIT Press) Yuille A L 1989 Encrgy functions for early vision and analog networks Biol. Cyhem. 61 115-23 - 1990 Generalized deformable models, statistical physics. and matching problems Neural Cornpuf. 2

1-24 Yuille A, Gciger D and Biilthoff H H 1989 Stereo integration, mean field theory and psychophysics

Harvard Robofics Laborafory Technical Report 89-11 Yuille A L and Grzywan N M 1989 A winner-take-all mechanism based on presynaptic inhibition feedback

Neural Compuf. 1 33447 Yuille A, Yang T and Geiger D 1990 Robust statistics, transparency and correspondencc IIan~ard Kobofics

Laborafory Technical Report 90-7

stereo integration, mean field theory and...

Documents