[ieee 2009 24th international conference image and vision computing new zealand (ivcnz) -...

Stereo Accuracy for Collision Avoidance

Waqar Khan, John Morris and Reinhard KletteDepartment of Computer Science, Tamaki Campus,

The University of Auckland, Auckland 1142, New Zealand.

[email protected], [j.morris,r.klette]@auckland.ac.nz

Abstract—In a stereo configuration, the measurable disparityvalues are integral, therefore the measurable depths are discrete.This can create a trap for a safety system whose purpose isto estimate the trajectory of a moving object, and issue anearly warning. Accuracy of this estimation is determined bythe samples which have different measurable depths. Change inmeasurable depths becomes obvious for closer regions, but dueto the limited extent of the stereo common field of view for theseregions, the object might not be in the common field of view. Avelocity estimation algorithm has been created, which takes intoaccount the constraints of stereo, while accurately estimating theobject’s trajectory. From examination of various scenarios, weshow that a stereo system could be misleading while estimatingthe object trajectory: it could predict that an object on a collisioncourse is ‘safe’.

I. INTRODUCTION

On the surface, stereo photogrammetry would appear to have

the right characteristics for an autonomous vehicle navigation

system - the depth accuracy is low at far distances and im-

proves as an object gets closer, enabling more and more precise

estimates of the likelihood of a collision to be made as the

objects comes closer [1]. However, since measured disparity

values are integral, therefore measured depths belong to a

discrete set [2]. This introduces a subtle trap: in the distance,

when it would be desirable for the system to provide an early

warning to allow more time to plan avoidance strategies, an

object which is on a collision course might appear to be not

moving at all with respect to the system. This problem is

a general one for all vision guided autonomous navigation,

but, because of its enormous social and economic costs, the

examples in this paper are based on a vision-based driver

assistance system which warns a driver about an imminent

collision [1], [3], [4], [5].

In this paper, we model the stereo system with respect to

its ability to avoid collisions in dynamic environments. A

driver assistance system is the source of the numeric examples,

but the model is applicable to any system moving through a

changing environment [1], [5]. A key problem is the rapidly

decreasing accuracy of distance measurements derived from a

stereo system as distance to an ’object of interest’ increases.

Thus we are concerned to determine the accuracy of the

estimated trajectory of the object compared to its actual

path for various scenarios. Note that other techniques, in

particular laser range finders, do not suffer from this problem

to the same degree, although there is the problem of rapidly

weakening return signals as distance increases which means

that distant objects with poor optical scattering ability might be

Z

Image planeVirtual

X

δ

O L centreOptic

O R

Z

b

f

Z

Fig. 1. Canonical stereo configuration: sets of rays are drawn through pixelsof the virtual images planes. Scene points at which these rays intersect areimaged on to the centres of image pixels. Examination of the diagram showsthat the horizontal lines pass through points of equal disparity. The degradingdepth resolution (larger δZ) for larger Z values is evident.

ignored altogether [6]. However, they require significant time

to provide a dense environment map covering all potential

hazards in a wide field of view, whereas recently developed

stereo correspondence hardware can process megapixel images

in real time [7].

II. STEREO GEOMETRY

Let Ω be an W × H image domain for both a left (L) and

a right (R) image of a rectified stereo pair in the canonical

configuration, see Figure 1 [7]. Assume that we have calcu-

lated a disparity, d(x, y), at each pixel p = (x, y) ∈ Ω, using

some stereo correspondence algorithm. The depth of an object

point, projected onto the pixel at location p = (x, y), is:

Z(x, y) =f · b

τ · d(x, y)(1)

where f is the focal length of both left and right cameras, bis the length of the baseline, τ is the pixel size, and d(x, y)is the disparity. (If f is also measured in pixels, then τ = 1.)

In a stereo configuration, the depth resolution degrades - δZincreases - as the distance from the system increases. Since the

sensor is composed of discrete pixels therefore the measurable

depths are integral in nature. The distance between adjacent

lines of equal disparity (i.e. for disparities d and d+1 in Figure

978-1-4244-4698-8/09/$25.00 ©2009 IEEE

24th International Conference Image and Vision Computing New Zealand (IVCNZ 2009)

- 67 -

1) is

δZ(d) =f · bτ

(1d− 1

d + 1)

More precisely, the uncertainty in distance to a 3D point,

appearing to lie at disparity d, is:

ΔZ(d) =δZ(d)

2+

δZ(d − 1)2

(2)

=f · bτ

(1

(d + 1)(d − 1)) (3)

= Zd

(d + 1)(d − 1)(4)

= O(Z2) (5)

Therefore, accuracy of a measurable depth degrades as the

distance to the object increases.

Even though the accuracy improves for smaller distances, how-

ever this is at the cost of higher disparity range 0 . . . dmax. This

range is proportional to the memory required for a real-time

(yet accurate) hardware implementation of a stereo matching

algorithm like dynamic programming, belief propagation or

graph cuts [7], [8], [9]. For example, memory needed for belief

propagation stereo is O(5WHdmax) [10].

Due to memory and time constraints, any practical algorithm

limits the range of disparities it can handle to some 0 . . . dmax.

This in turn further constrains the region in which depths can

be measured to that part of the common field of view (CFoV)

beyond Zmin(dmax), with

Zmin(dmax) =fb

τ · dmax

See Figure 2. An exclusion zone of radius rexc is chosen

more from psychological factors - the necessity to avoid stress

induced by near misses - than our vehicle’s actual extent.

III. VELOCITY ESTIMATION MODEL

A. Assumptions

1) Matching noise: A stereo matching algorithm will generate

some incorrect disparities. This ‘noise’ is usually removed

through Kalman filters [4], [11], [12], [13]. We assume that

the estimated disparity is already noise free.

2) Discrete steps: The discrete steps in measured depths are

related to the pixel size τ . Therefore improved depth accuracy

could be achieved either through a sensor with smaller pixels,

or through a matching algorithm with sub-pixel accuracy [4].

In either case, the depth would always be measured in discrete

steps, so the argument set out here still applies, but at a

different scale.

3) Units: The object’s location is measured in metres (m),

while its speed is measured in metres per second (ms−1).

B. Model

The key task is to estimate the collision point and time with all

potential colliders. For an object at−→O , travelling with velocity,

X

Z

ofView

FieldCommon

θ

12 3

4x z

d

min

iD (V )

excr

minZZ (d )

(V ,V )x z

(O ,O )

max

Fig. 2. Coordinate system based on a point on the leading edge of ourvehicle. The figure also shows an exclusion zone of radius rexc.

−→V , the time tcross to enter our path and become a possible

collision,

tcross = (Ox − rexc)/Vx (6)

where rexc is the radius of the exclusion zone, see Figure 2.

The time for our vehicle to reach the collision point−→Oc =

(0, rexc) is

tfx = (Oz − rexc)/Vz

The opposing vehicle will reach a point safely behind our

vehicle in

tbx = (Oz + rexc)/Vz

Clearly, if tfx < tcross < tbx, then our system should generate

an avoidance strategy - in a driver assistance system, this

will simply be a warning to the driver but in completely

autonomous systems, a procedure for planning a new path

should be invoked.

If the stereo system can process nf frames per second, then

velocity components could be estimated as the difference in

perceived/apparent positions at intervals of 1/nf seconds.

However, due to the discrete nature of measurable depths,

this velocity estimate may include very high errors in the

longitudinal velocity [4] - for example, an object may appear

to stay at the same distance for several frames, leading to an

apparent V appz = 0 ms−1. An example is shown in Figure 3

(detailed discussion later in Scenario 2 of Section III-D).

Note that there are 36 samples (captured images and therefore

position estimates for the opposing object) between S1 and

S2. For these initial 36 samples, the measured depth does not


- 68 -

appear to change - see Figure 3. However, because of the depth

error, possible trajectories for this object cover a wide range.

Since Oz could lie anywhere between

Zu = Z(d) +δZ(d − 1)

2

and

Zl = Z(d) − δZ(d − 1)2

for n frames or time t(n) = n/nf , the apparent speed V appz

in the Z direction is in the following interval:

nf

n(Zu − Zl) < V app

z <nf

n(Zl − Zu) (7)

This could be simplified to

nf

2n[δZ(d) + δZ(d − 1)] < V app

z (8)

<nf

2n[δZ(d) + δZ(d − 1)]

so that the error in speed equals

ΔV appz =

nf

n[δZ(d) + δZ(d − 1)]

This error decreases with 1/n, as the number of frames nincreases, but the error is substantial for over 25 frames in

our example - see Figure 3. Note that for these 36 frames, the

system cannot be sure whether the object is approaching or

moving away! As soon as a change in disparity is observed,

however, the velocity error drops dramatically.

C. Constraints

Figure 3 shows that the distance resolution (indicated by the

step heights in the diagram) improves as the object becomes

closer. However, the system has several constraints.

1) Safe braking distance: A guidance system should make a

decision about the trajectory of an opposing object before a

critical distance - the safe braking or stopping distance.1

For a motor vehicle, the stopping distance Db can be estimated

from the speed Vi, and the road friction coefficient, ω. For a

vehicle stopping with locked tires, so that the only stopping

force is friction between the tires and the road, then the energy

to stop the vehicle after distance Db is defined as follows:

−ω · m · g · Db = −mV 2i

2

where m denotes the mass, and g denotes the gravitational

constant. Thus, the braking distance Db equals

Db =V 2

i

2ωg(9)

For safety, we should assume the worst braking conditions

with ω ∼ 0.4.

1In a more complex guidance system, this critical distance might be theturning radius or other dynamic characteristic of the vehicle. See also http://www.csgnetwork.com/stopdistcalc.html.

Fig. 3. Actual and apparent object position vs time: stereo configuration -f = 4.6 mm, b = 308 mm, W = 640, H = 480, τ = 6 μm, nf = 25;Object initially at 20 m moving with velocity (−2,−7). Dd = 10 m is thebraking distance for our vehicle moving at 7 ms−1. The point at which theobject leaves the CFoV is marked.

A safe system must also consider a driver decision time td and

a response time tr. Altogether, the effective stopping distance

equals

Dd = Db + Vi(td + tr)

Thus, our system must identify the potential obstacle when

Oz > Dd.

2) Baseline: Depth accuracy can be improved by increasing

the baseline, but then the CFoV shifts further away from the

system and decreases its effectiveness at slower speed. For a

moving platform, the baseline is usually constrained by the

vehicle’s width.

3) Common Field of View: The system can only measure the

depth of the object which lies in the CFoV. For an approaching

object, the CFoV extent at its distance reduces for every

frame. So any improvement in depth accuracy with decrease

in distance may be negated by the object leaving the CFoV.

The extent E(Z) of the CFoV equals

E(Z) =Zτ(W − d)

f(10)

An object is in the CFoV when |Ox| < E(Z)2 . The time that the

object will remain in the CFoV depends upon the ratio Vz/Vx

of longitudinal to lateral velocity. If this ratio is greater than

tan θ (where θ is the half-angle of the CFoV; see Figure 2),

then an object in CFoV will leave it. This occurs in Scenarios

2, 3 and 4 as discussed in Section III-D.

Both constraints are illustrated in Figure 3. Dd corresponds

to the safe braking distance, and the point at which the object

leaves the CFoV (after 63 samples) is marked. Thus, improved

stereo accuracy at closer distances is unusable. The stereo

system’s optical configuration can be altered to change the

depth resolution at any point, but, except for increasing the

number of pixels in the sensor, this will involve some trade-off

between resolution and other factors, such as closest working

distance (Zmin(dmax) in Figure 1) or the extent of the CFoV.


- 69 -

A system in which the configuration was adapted dynamically

to various conditions (rather like the swiveling of our eyes

to fixate on an object) was not considered feasible due to

the problems associated with calibrating moving components

subject to continual vibration. Therefore the velocity estimator

must work within fixed stereo constraints.

D. Model Generation

To illustrate the various facets of the velocity estimation

model, we consider four different scenarios. The optical

system’s parameters for these scenarios are as follows:

Focal length f 4.9 mmBaseline b 308 mmCamera sensor width W 640Camera sensor height H 480Frame rate nf 25Maximum disparity dmax 64Minimum measurable depth Zmin(dmax) 39.5 mAngular field of view θ 2.2◦

The initial positions and velocities for the scenarios are

labeled in Figure 2. We considered the following scenarios:

1) Passing in front. This is the simplest scenario in which

the object remains in the CFoV until it passes safely in

front of us. The algorithm must accurately determine the

object trajectory before the object comes closer than the

safe braking distance Dd and thus avoid a false warning.

The object is initially placed at−→O1 = (4, 100), moving

with a constant velocity,−→V1 = (−2,−20). Our vehicle

is moving at 16 ms−1 leading to a safe braking distance

of Dd = 41.1 m.

2) Collision. This scenario models a collision situation in

which the object is initially in the CFoV and leaves it,

but still remaining in the field of view of one camera.

This scenario offers a challenge for the algorithm, which

must recognize the collision situation using limited

stereo samples. The moving object is initially placed

at−→O2 = (−19, 56), with

−→V2 = (2,−7). Our vehicle is

moving at 7 ms−1 leading to a safe braking distance of

Dd = 10 m.

3) Passing behind our vehicle. This scenario models a

no-collision situation in which the object is initially

in the CFoV but leaves it and passes after us. This

scenario offers a challenge for the algorithm, which

must determine the object trajectory before the object

leaves the CFoV and thus avoid a false warning. The

moving object is initially placed at−→O3 = (14, 64), with−→

V3 = (−2,−16.7). Our vehicle is moving at 16.7 ms−1

leading to a safe braking distance of Dd = 44.3 m.

4) Second collision. This scenario models a collision sit-

uation in which the object is initially placed at−→O4 =

(17, 48) in CFoV with−→V4 = (−2,−7). Our vehicle is

moving at 7 ms−1 leading to a safe braking distance

of 10 m. This scenario offers a challenge for the

algorithm, which must recognize the collision situation

using limited stereo samples.

IV. VELOCITY ESTIMATION ALGORITHM

We consider the longitudinal speed

|V ez | =

ΔZ(d)Δt

where Δt = nnf

is the time for which the object’s disparity did

not change. Since the direction of motion is initially unknown,

the error |Vz| − |V ez | in V e

z may be positive or negative.

Therefore the error just before the change in apparent depth

is observed to be 0 ± 44 ms−1! See Figure 4(a).

If |Vz| > 0 then there will be a sample which will have a

different disparity. This ‘core-sample’ (and future such sam-

ples) is a key to accurate velocity estimation. The initial core-

sample would allow the algorithm to estimate the longitudinal

speed and direction of moving object. For Scenario 1, before

the core-sample the error in estimated velocity is 0±44 ms−1

which reduces to 3 ms−1 after the first core-sample; see Figure

4.

An observed change in disparity at the core-sample will serve

two future causes. Firstly, it results in a dramatic drop in the

apparent velocity error; see Figure 4 (e), (f) and (g). Secondly,

it is used to estimate the sample at which the next change

in disparity is expected. If, however, the estimated velocity is

found to be inconsistent with new measurements, it is modified

on the basis of the most recent two core-samples.

Scenario 3 shows an example of this situation, where the

estimated velocity is faster than the actual velocity; see Figure

4 (c). Based upon the estimated velocity, and the known stereo

parameters, the algorithm estimates the sample sc, at which

the next core-sample should be found. For the samples before

sc, the algorithm estimates the object trajectory by taking the

centre of a recently observed step as the reference point; see

Figure 4 (f). At sample sc, the change in disparity d does not

occur as anticipated : an inconsistency is found, therefore the

estimate is modified toΔZ(d)

Δt , where Δt is the time after the

recent core-sample.

From the trajectories of Scenarios 1 and 2, it would seem

that a single core-sample is sufficient for accurate trajectory

estimation; see Figure 4 (d) and (e). However, in Scenario 3,

at-least two core-samples are necessary for accurate trajectory

estimation; see Figure 4 (f). The difference here is that the

constrained optical system does not know the actual position

of object within the region of uncertainty ΔZ(d). Therefore,

the distance Dc the object covered before the core-sample has

a large error. If this distance is approximately equal to ΔZ(d)then this reduces to Scenarios 1 and 2 (see Figure 4 (d) and

(e)), however if Dc < ΔZ(d) then it reduces to Scenarios 3

and 4 (see Figure 4 (f) and Figure 5).

In Scenario 4, the system only sees a single core-sample before

it leaves the CFoV(see Figure 5). The outcome is that the

velocity being estimated is much higher than actual, and the

system will classify this as a safe situation in which the object


- 70 -

(a) (d)

(b) (e)

(c) (f)

Fig. 4. Accuracy of estimated velocities and their trajectories vs time for Scenarios 1, 2 and 3. Error in estimated velocity is computed as |Vz | − |V ez |, the

range of possible estimations are also shown. Before the core-samples, the direction of V ez is unknown. Error for Scenario 1 improves from 0± 44 ms−1 to

3 ms−1 at the first core-sample, and then to 1 ms−1 for the second core-sample, and then to 0.3ms−1 for the third core-sample. (b) Error for Scenario 2improves from 0± 15 ms−1 to 0.3 ms−1 at the first core-sample and then to −0.1± ms−1 for the second core-sample. (c) Error for Scenario 3 improvesfrom 0± 59 ms−1 to 18 ms−1 at the first core-sample, then due to inconsistency of estimated velocity with new measurements, it is improved to 2 ms−1,which is further improved to −0.2 ms−1 after a new core-sample. Actual and apparent trajectories: (d) Scenario 1, (e) Scenario 2, (f) Scenario 3.


- 71 -

Fig. 5. Actual, apparent and estimated trajectories vs time for Scenario 4;the object at (17, 48) m, moving with (−2,−7) ms−1, is processed for only8 samples before it exits the CFoV. The object is actually going to collidebut is classified as a safe situation in which the object passes safely in frontof us.

safely passes in front of us, whereas it is a collision situation.

In order to avoid such situations, the safety system will have

to use monocular vision as well. Stereo can assist in providing

the initial cues like object segmentation, size of object at the

identified range of distance and ground plane estimation to

locate objects of interest, while motion (optical flow) would

act as an additional cue for trajectory estimation.

V. CONCLUSION

In this paper, we identify a critical limitation of object trajec-

tory estimation via stereo. Accuracy of a stereo based velocity

estimation algorithm improves as an object approaches the

optical system, but due to stereo (i.e., common field of view,

baseline) and system (i.e., safe braking distance) constraints,

accuracy at closer distances is sometimes not usable. These

limitations of discrete stereo become a threat for the safety

system. From examination of several scenarios, we show that a

stereo system could be misleading while estimating the object

trajectory. An approaching object on a collision course could

be reported as a safe situation. Reliable warning systems must

use models which allow for the errors highlighted here - anduse additional cues, e.g. information from monocular regions.

For example, in Scenario 4 (Figure 2), the object, having been

detected from stereo analysis, could still be tracked through

the monocular region after it leaves the common field of view.

REFERENCES

[1] G. N. DeSouza and A. C. Kak, “Vision for mobile robot navigation: Asurvey,” in IEEE Trans. Pattern Anal. Mach. Intell, vol. 24, no. 2, 2002,pp. 237–267.

[2] L. Matthies and S. A. Shafer, “Error modeling in stereo navigation,”IEEE Journal of Robotics and Automation, vol. 3, no. 3, pp. 239–248,1987.

[3] M. Bertozzi, A. Broggi, A. Fascioli, and S. Nichele, “Stereo vision-based vehicle detection,” in IEEE Intell. Vehicles Symposium, 2000, pp.39–44.

[4] U. Franke and A. Joos, “Real-time stereo vision for urban traffic sceneunderstanding,” in IEEE Intell. Vehicles Symposium, 2000, pp. 273–278.

[5] D. Murray and J. Little, “Using real-time stereo vision for mobile robotnavigation,” Autonomous Robots, vol. 8, no. 2, pp. 161–171, 2000.

[6] Y. Negishi, J. Miura, and Y. Shirai, “Mobile robot navigation in unknownenvironments using omnidirectional stereo and laser range finder,” inProc. IEEE Int. Conf. on Intelligent Robots and Systems, 2004, pp. 250–255.

[7] J. Morris, K. Jawed, and G. Gimel’farb, “Intelligent vision: A first step- real time stereovision,” in Proc. 11th Int. Conf. on Advanced Conceptsfor Intell. Vision Systems, ser. LNCS, vol. 5807. Springer, 2009, pp.355–366.

[8] P. Felzenszwalb and D. Huttenlocher, “Efficient belief propagation forearly vision,” Int. Journal of Computer Vision, vol. 70, no. 1, pp. 41–54,2006.

[9] M. F. Tappen and W. T. Freeman, “Comparison of graph cuts with beliefpropagation for stereo, using identical MRF parameters,” in Int. Conf.on Computer Vision, vol. 2, 2003, pp. 900–907.

[10] S. Park and H. Jeong, “A high-speed parallel architecture for stereomatching,” in Advances in Visual Computing, ser. LNCS, vol. 4291.Springer, 2006, pp. 334–342.

[11] L. Matthies, T. Kanade, and R. Szeliski, “Kalman filter-based algorithmsfor estimating depth from image sequences,” in Int. Journal of ComputerVision, vol. 3, no. 3. Springer, 1989, pp. 209–238.

[12] A. Tirumalai, B. G. Schunck, and R. C. Jain, “Dynamic stereo withself-calibration,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 14,no. 12, pp. 1184–1189, 1992.

[13] A. Suppes, F. Suhling, and M. Hotter, “Robust obstacle detection fromstereoscopic image sequences using Kalman filtering,” in 23rd DAGM-Symposium on Pattern Recog., ser. LNCS, vol. 2191, 2001, pp. 385–391.


- 72 -

[ieee 2009 24th international conference image and vision computing new zealand (ivcnz) -...

Documents