[ieee 2009 24th international conference image and vision computing new zealand (ivcnz) -...
TRANSCRIPT
Stereo Accuracy for Collision Avoidance
Waqar Khan, John Morris and Reinhard KletteDepartment of Computer Science, Tamaki Campus,
The University of Auckland, Auckland 1142, New Zealand.
[email protected], [j.morris,r.klette]@auckland.ac.nz
Abstract—In a stereo configuration, the measurable disparityvalues are integral, therefore the measurable depths are discrete.This can create a trap for a safety system whose purpose isto estimate the trajectory of a moving object, and issue anearly warning. Accuracy of this estimation is determined bythe samples which have different measurable depths. Change inmeasurable depths becomes obvious for closer regions, but dueto the limited extent of the stereo common field of view for theseregions, the object might not be in the common field of view. Avelocity estimation algorithm has been created, which takes intoaccount the constraints of stereo, while accurately estimating theobject’s trajectory. From examination of various scenarios, weshow that a stereo system could be misleading while estimatingthe object trajectory: it could predict that an object on a collisioncourse is ‘safe’.
I. INTRODUCTION
On the surface, stereo photogrammetry would appear to have
the right characteristics for an autonomous vehicle navigation
system - the depth accuracy is low at far distances and im-
proves as an object gets closer, enabling more and more precise
estimates of the likelihood of a collision to be made as the
objects comes closer [1]. However, since measured disparity
values are integral, therefore measured depths belong to a
discrete set [2]. This introduces a subtle trap: in the distance,
when it would be desirable for the system to provide an early
warning to allow more time to plan avoidance strategies, an
object which is on a collision course might appear to be not
moving at all with respect to the system. This problem is
a general one for all vision guided autonomous navigation,
but, because of its enormous social and economic costs, the
examples in this paper are based on a vision-based driver
assistance system which warns a driver about an imminent
collision [1], [3], [4], [5].
In this paper, we model the stereo system with respect to
its ability to avoid collisions in dynamic environments. A
driver assistance system is the source of the numeric examples,
but the model is applicable to any system moving through a
changing environment [1], [5]. A key problem is the rapidly
decreasing accuracy of distance measurements derived from a
stereo system as distance to an ’object of interest’ increases.
Thus we are concerned to determine the accuracy of the
estimated trajectory of the object compared to its actual
path for various scenarios. Note that other techniques, in
particular laser range finders, do not suffer from this problem
to the same degree, although there is the problem of rapidly
weakening return signals as distance increases which means
that distant objects with poor optical scattering ability might be
Z
Image planeVirtual
X
δ
O L centreOptic
O R
Z
b
f
Z
Fig. 1. Canonical stereo configuration: sets of rays are drawn through pixelsof the virtual images planes. Scene points at which these rays intersect areimaged on to the centres of image pixels. Examination of the diagram showsthat the horizontal lines pass through points of equal disparity. The degradingdepth resolution (larger δZ) for larger Z values is evident.
ignored altogether [6]. However, they require significant time
to provide a dense environment map covering all potential
hazards in a wide field of view, whereas recently developed
stereo correspondence hardware can process megapixel images
in real time [7].
II. STEREO GEOMETRY
Let Ω be an W × H image domain for both a left (L) and
a right (R) image of a rectified stereo pair in the canonical
configuration, see Figure 1 [7]. Assume that we have calcu-
lated a disparity, d(x, y), at each pixel p = (x, y) ∈ Ω, using
some stereo correspondence algorithm. The depth of an object
point, projected onto the pixel at location p = (x, y), is:
Z(x, y) =f · b
τ · d(x, y)(1)
where f is the focal length of both left and right cameras, bis the length of the baseline, τ is the pixel size, and d(x, y)is the disparity. (If f is also measured in pixels, then τ = 1.)
In a stereo configuration, the depth resolution degrades - δZincreases - as the distance from the system increases. Since the
sensor is composed of discrete pixels therefore the measurable
depths are integral in nature. The distance between adjacent
lines of equal disparity (i.e. for disparities d and d+1 in Figure
978-1-4244-4698-8/09/$25.00 ©2009 IEEE
24th International Conference Image and Vision Computing New Zealand (IVCNZ 2009)
- 67 -
1) is
δZ(d) =f · bτ
(1d− 1
d + 1)
More precisely, the uncertainty in distance to a 3D point,
appearing to lie at disparity d, is:
ΔZ(d) =δZ(d)
2+
δZ(d − 1)2
(2)
=f · bτ
(1
(d + 1)(d − 1)) (3)
= Zd
(d + 1)(d − 1)(4)
= O(Z2) (5)
Therefore, accuracy of a measurable depth degrades as the
distance to the object increases.
Even though the accuracy improves for smaller distances, how-
ever this is at the cost of higher disparity range 0 . . . dmax. This
range is proportional to the memory required for a real-time
(yet accurate) hardware implementation of a stereo matching
algorithm like dynamic programming, belief propagation or
graph cuts [7], [8], [9]. For example, memory needed for belief
propagation stereo is O(5WHdmax) [10].
Due to memory and time constraints, any practical algorithm
limits the range of disparities it can handle to some 0 . . . dmax.
This in turn further constrains the region in which depths can
be measured to that part of the common field of view (CFoV)
beyond Zmin(dmax), with
Zmin(dmax) =fb
τ · dmax
See Figure 2. An exclusion zone of radius rexc is chosen
more from psychological factors - the necessity to avoid stress
induced by near misses - than our vehicle’s actual extent.
III. VELOCITY ESTIMATION MODEL
A. Assumptions
1) Matching noise: A stereo matching algorithm will generate
some incorrect disparities. This ‘noise’ is usually removed
through Kalman filters [4], [11], [12], [13]. We assume that
the estimated disparity is already noise free.
2) Discrete steps: The discrete steps in measured depths are
related to the pixel size τ . Therefore improved depth accuracy
could be achieved either through a sensor with smaller pixels,
or through a matching algorithm with sub-pixel accuracy [4].
In either case, the depth would always be measured in discrete
steps, so the argument set out here still applies, but at a
different scale.
3) Units: The object’s location is measured in metres (m),
while its speed is measured in metres per second (ms−1).
B. Model
The key task is to estimate the collision point and time with all
potential colliders. For an object at−→O , travelling with velocity,
X
Z
ofView
FieldCommon
θ
12 3
4x z
d
min
iD (V )
excr
minZZ (d )
(V ,V )x z
(O ,O )
max
Fig. 2. Coordinate system based on a point on the leading edge of ourvehicle. The figure also shows an exclusion zone of radius rexc.
−→V , the time tcross to enter our path and become a possible
collision,
tcross = (Ox − rexc)/Vx (6)
where rexc is the radius of the exclusion zone, see Figure 2.
The time for our vehicle to reach the collision point−→Oc =
(0, rexc) is
tfx = (Oz − rexc)/Vz
The opposing vehicle will reach a point safely behind our
vehicle in
tbx = (Oz + rexc)/Vz
Clearly, if tfx < tcross < tbx, then our system should generate
an avoidance strategy - in a driver assistance system, this
will simply be a warning to the driver but in completely
autonomous systems, a procedure for planning a new path
should be invoked.
If the stereo system can process nf frames per second, then
velocity components could be estimated as the difference in
perceived/apparent positions at intervals of 1/nf seconds.
However, due to the discrete nature of measurable depths,
this velocity estimate may include very high errors in the
longitudinal velocity [4] - for example, an object may appear
to stay at the same distance for several frames, leading to an
apparent V appz = 0 ms−1. An example is shown in Figure 3
(detailed discussion later in Scenario 2 of Section III-D).
Note that there are 36 samples (captured images and therefore
position estimates for the opposing object) between S1 and
S2. For these initial 36 samples, the measured depth does not
24th International Conference Image and Vision Computing New Zealand (IVCNZ 2009)
- 68 -
appear to change - see Figure 3. However, because of the depth
error, possible trajectories for this object cover a wide range.
Since Oz could lie anywhere between
Zu = Z(d) +δZ(d − 1)
2
and
Zl = Z(d) − δZ(d − 1)2
for n frames or time t(n) = n/nf , the apparent speed V appz
in the Z direction is in the following interval:
nf
n(Zu − Zl) < V app
z <nf
n(Zl − Zu) (7)
This could be simplified to
nf
2n[δZ(d) + δZ(d − 1)] < V app
z (8)
<nf
2n[δZ(d) + δZ(d − 1)]
so that the error in speed equals
ΔV appz =
nf
n[δZ(d) + δZ(d − 1)]
This error decreases with 1/n, as the number of frames nincreases, but the error is substantial for over 25 frames in
our example - see Figure 3. Note that for these 36 frames, the
system cannot be sure whether the object is approaching or
moving away! As soon as a change in disparity is observed,
however, the velocity error drops dramatically.
C. Constraints
Figure 3 shows that the distance resolution (indicated by the
step heights in the diagram) improves as the object becomes
closer. However, the system has several constraints.
1) Safe braking distance: A guidance system should make a
decision about the trajectory of an opposing object before a
critical distance - the safe braking or stopping distance.1
For a motor vehicle, the stopping distance Db can be estimated
from the speed Vi, and the road friction coefficient, ω. For a
vehicle stopping with locked tires, so that the only stopping
force is friction between the tires and the road, then the energy
to stop the vehicle after distance Db is defined as follows:
−ω · m · g · Db = −mV 2i
2
where m denotes the mass, and g denotes the gravitational
constant. Thus, the braking distance Db equals
Db =V 2
i
2ωg(9)
For safety, we should assume the worst braking conditions
with ω ∼ 0.4.
1In a more complex guidance system, this critical distance might be theturning radius or other dynamic characteristic of the vehicle. See also http://www.csgnetwork.com/stopdistcalc.html.
Fig. 3. Actual and apparent object position vs time: stereo configuration -f = 4.6 mm, b = 308 mm, W = 640, H = 480, τ = 6 μm, nf = 25;Object initially at 20 m moving with velocity (−2,−7). Dd = 10 m is thebraking distance for our vehicle moving at 7 ms−1. The point at which theobject leaves the CFoV is marked.
A safe system must also consider a driver decision time td and
a response time tr. Altogether, the effective stopping distance
equals
Dd = Db + Vi(td + tr)
Thus, our system must identify the potential obstacle when
Oz > Dd.
2) Baseline: Depth accuracy can be improved by increasing
the baseline, but then the CFoV shifts further away from the
system and decreases its effectiveness at slower speed. For a
moving platform, the baseline is usually constrained by the
vehicle’s width.
3) Common Field of View: The system can only measure the
depth of the object which lies in the CFoV. For an approaching
object, the CFoV extent at its distance reduces for every
frame. So any improvement in depth accuracy with decrease
in distance may be negated by the object leaving the CFoV.
The extent E(Z) of the CFoV equals
E(Z) =Zτ(W − d)
f(10)
An object is in the CFoV when |Ox| < E(Z)2 . The time that the
object will remain in the CFoV depends upon the ratio Vz/Vx
of longitudinal to lateral velocity. If this ratio is greater than
tan θ (where θ is the half-angle of the CFoV; see Figure 2),
then an object in CFoV will leave it. This occurs in Scenarios
2, 3 and 4 as discussed in Section III-D.
Both constraints are illustrated in Figure 3. Dd corresponds
to the safe braking distance, and the point at which the object
leaves the CFoV (after 63 samples) is marked. Thus, improved
stereo accuracy at closer distances is unusable. The stereo
system’s optical configuration can be altered to change the
depth resolution at any point, but, except for increasing the
number of pixels in the sensor, this will involve some trade-off
between resolution and other factors, such as closest working
distance (Zmin(dmax) in Figure 1) or the extent of the CFoV.
24th International Conference Image and Vision Computing New Zealand (IVCNZ 2009)
- 69 -
A system in which the configuration was adapted dynamically
to various conditions (rather like the swiveling of our eyes
to fixate on an object) was not considered feasible due to
the problems associated with calibrating moving components
subject to continual vibration. Therefore the velocity estimator
must work within fixed stereo constraints.
D. Model Generation
To illustrate the various facets of the velocity estimation
model, we consider four different scenarios. The optical
system’s parameters for these scenarios are as follows:
Focal length f 4.9 mmBaseline b 308 mmCamera sensor width W 640Camera sensor height H 480Frame rate nf 25Maximum disparity dmax 64Minimum measurable depth Zmin(dmax) 39.5 mAngular field of view θ 2.2◦
The initial positions and velocities for the scenarios are
labeled in Figure 2. We considered the following scenarios:
1) Passing in front. This is the simplest scenario in which
the object remains in the CFoV until it passes safely in
front of us. The algorithm must accurately determine the
object trajectory before the object comes closer than the
safe braking distance Dd and thus avoid a false warning.
The object is initially placed at−→O1 = (4, 100), moving
with a constant velocity,−→V1 = (−2,−20). Our vehicle
is moving at 16 ms−1 leading to a safe braking distance
of Dd = 41.1 m.
2) Collision. This scenario models a collision situation in
which the object is initially in the CFoV and leaves it,
but still remaining in the field of view of one camera.
This scenario offers a challenge for the algorithm, which
must recognize the collision situation using limited
stereo samples. The moving object is initially placed
at−→O2 = (−19, 56), with
−→V2 = (2,−7). Our vehicle is
moving at 7 ms−1 leading to a safe braking distance of
Dd = 10 m.
3) Passing behind our vehicle. This scenario models a
no-collision situation in which the object is initially
in the CFoV but leaves it and passes after us. This
scenario offers a challenge for the algorithm, which
must determine the object trajectory before the object
leaves the CFoV and thus avoid a false warning. The
moving object is initially placed at−→O3 = (14, 64), with−→
V3 = (−2,−16.7). Our vehicle is moving at 16.7 ms−1
leading to a safe braking distance of Dd = 44.3 m.
4) Second collision. This scenario models a collision sit-
uation in which the object is initially placed at−→O4 =
(17, 48) in CFoV with−→V4 = (−2,−7). Our vehicle is
moving at 7 ms−1 leading to a safe braking distance
of 10 m. This scenario offers a challenge for the
algorithm, which must recognize the collision situation
using limited stereo samples.
IV. VELOCITY ESTIMATION ALGORITHM
We consider the longitudinal speed
|V ez | =
ΔZ(d)Δt
where Δt = nnf
is the time for which the object’s disparity did
not change. Since the direction of motion is initially unknown,
the error |Vz| − |V ez | in V e
z may be positive or negative.
Therefore the error just before the change in apparent depth
is observed to be 0 ± 44 ms−1! See Figure 4(a).
If |Vz| > 0 then there will be a sample which will have a
different disparity. This ‘core-sample’ (and future such sam-
ples) is a key to accurate velocity estimation. The initial core-
sample would allow the algorithm to estimate the longitudinal
speed and direction of moving object. For Scenario 1, before
the core-sample the error in estimated velocity is 0±44 ms−1
which reduces to 3 ms−1 after the first core-sample; see Figure
4.
An observed change in disparity at the core-sample will serve
two future causes. Firstly, it results in a dramatic drop in the
apparent velocity error; see Figure 4 (e), (f) and (g). Secondly,
it is used to estimate the sample at which the next change
in disparity is expected. If, however, the estimated velocity is
found to be inconsistent with new measurements, it is modified
on the basis of the most recent two core-samples.
Scenario 3 shows an example of this situation, where the
estimated velocity is faster than the actual velocity; see Figure
4 (c). Based upon the estimated velocity, and the known stereo
parameters, the algorithm estimates the sample sc, at which
the next core-sample should be found. For the samples before
sc, the algorithm estimates the object trajectory by taking the
centre of a recently observed step as the reference point; see
Figure 4 (f). At sample sc, the change in disparity d does not
occur as anticipated : an inconsistency is found, therefore the
estimate is modified toΔZ(d)
Δt , where Δt is the time after the
recent core-sample.
From the trajectories of Scenarios 1 and 2, it would seem
that a single core-sample is sufficient for accurate trajectory
estimation; see Figure 4 (d) and (e). However, in Scenario 3,
at-least two core-samples are necessary for accurate trajectory
estimation; see Figure 4 (f). The difference here is that the
constrained optical system does not know the actual position
of object within the region of uncertainty ΔZ(d). Therefore,
the distance Dc the object covered before the core-sample has
a large error. If this distance is approximately equal to ΔZ(d)then this reduces to Scenarios 1 and 2 (see Figure 4 (d) and
(e)), however if Dc < ΔZ(d) then it reduces to Scenarios 3
and 4 (see Figure 4 (f) and Figure 5).
In Scenario 4, the system only sees a single core-sample before
it leaves the CFoV(see Figure 5). The outcome is that the
velocity being estimated is much higher than actual, and the
system will classify this as a safe situation in which the object
24th International Conference Image and Vision Computing New Zealand (IVCNZ 2009)
- 70 -
(a) (d)
(b) (e)
(c) (f)
Fig. 4. Accuracy of estimated velocities and their trajectories vs time for Scenarios 1, 2 and 3. Error in estimated velocity is computed as |Vz | − |V ez |, the
range of possible estimations are also shown. Before the core-samples, the direction of V ez is unknown. Error for Scenario 1 improves from 0± 44 ms−1 to
3 ms−1 at the first core-sample, and then to 1 ms−1 for the second core-sample, and then to 0.3ms−1 for the third core-sample. (b) Error for Scenario 2improves from 0± 15 ms−1 to 0.3 ms−1 at the first core-sample and then to −0.1± ms−1 for the second core-sample. (c) Error for Scenario 3 improvesfrom 0± 59 ms−1 to 18 ms−1 at the first core-sample, then due to inconsistency of estimated velocity with new measurements, it is improved to 2 ms−1,which is further improved to −0.2 ms−1 after a new core-sample. Actual and apparent trajectories: (d) Scenario 1, (e) Scenario 2, (f) Scenario 3.
24th International Conference Image and Vision Computing New Zealand (IVCNZ 2009)
- 71 -
Fig. 5. Actual, apparent and estimated trajectories vs time for Scenario 4;the object at (17, 48) m, moving with (−2,−7) ms−1, is processed for only8 samples before it exits the CFoV. The object is actually going to collidebut is classified as a safe situation in which the object passes safely in frontof us.
safely passes in front of us, whereas it is a collision situation.
In order to avoid such situations, the safety system will have
to use monocular vision as well. Stereo can assist in providing
the initial cues like object segmentation, size of object at the
identified range of distance and ground plane estimation to
locate objects of interest, while motion (optical flow) would
act as an additional cue for trajectory estimation.
V. CONCLUSION
In this paper, we identify a critical limitation of object trajec-
tory estimation via stereo. Accuracy of a stereo based velocity
estimation algorithm improves as an object approaches the
optical system, but due to stereo (i.e., common field of view,
baseline) and system (i.e., safe braking distance) constraints,
accuracy at closer distances is sometimes not usable. These
limitations of discrete stereo become a threat for the safety
system. From examination of several scenarios, we show that a
stereo system could be misleading while estimating the object
trajectory. An approaching object on a collision course could
be reported as a safe situation. Reliable warning systems must
use models which allow for the errors highlighted here - anduse additional cues, e.g. information from monocular regions.
For example, in Scenario 4 (Figure 2), the object, having been
detected from stereo analysis, could still be tracked through
the monocular region after it leaves the common field of view.
REFERENCES
[1] G. N. DeSouza and A. C. Kak, “Vision for mobile robot navigation: Asurvey,” in IEEE Trans. Pattern Anal. Mach. Intell, vol. 24, no. 2, 2002,pp. 237–267.
[2] L. Matthies and S. A. Shafer, “Error modeling in stereo navigation,”IEEE Journal of Robotics and Automation, vol. 3, no. 3, pp. 239–248,1987.
[3] M. Bertozzi, A. Broggi, A. Fascioli, and S. Nichele, “Stereo vision-based vehicle detection,” in IEEE Intell. Vehicles Symposium, 2000, pp.39–44.
[4] U. Franke and A. Joos, “Real-time stereo vision for urban traffic sceneunderstanding,” in IEEE Intell. Vehicles Symposium, 2000, pp. 273–278.
[5] D. Murray and J. Little, “Using real-time stereo vision for mobile robotnavigation,” Autonomous Robots, vol. 8, no. 2, pp. 161–171, 2000.
[6] Y. Negishi, J. Miura, and Y. Shirai, “Mobile robot navigation in unknownenvironments using omnidirectional stereo and laser range finder,” inProc. IEEE Int. Conf. on Intelligent Robots and Systems, 2004, pp. 250–255.
[7] J. Morris, K. Jawed, and G. Gimel’farb, “Intelligent vision: A first step- real time stereovision,” in Proc. 11th Int. Conf. on Advanced Conceptsfor Intell. Vision Systems, ser. LNCS, vol. 5807. Springer, 2009, pp.355–366.
[8] P. Felzenszwalb and D. Huttenlocher, “Efficient belief propagation forearly vision,” Int. Journal of Computer Vision, vol. 70, no. 1, pp. 41–54,2006.
[9] M. F. Tappen and W. T. Freeman, “Comparison of graph cuts with beliefpropagation for stereo, using identical MRF parameters,” in Int. Conf.on Computer Vision, vol. 2, 2003, pp. 900–907.
[10] S. Park and H. Jeong, “A high-speed parallel architecture for stereomatching,” in Advances in Visual Computing, ser. LNCS, vol. 4291.Springer, 2006, pp. 334–342.
[11] L. Matthies, T. Kanade, and R. Szeliski, “Kalman filter-based algorithmsfor estimating depth from image sequences,” in Int. Journal of ComputerVision, vol. 3, no. 3. Springer, 1989, pp. 209–238.
[12] A. Tirumalai, B. G. Schunck, and R. C. Jain, “Dynamic stereo withself-calibration,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 14,no. 12, pp. 1184–1189, 1992.
[13] A. Suppes, F. Suhling, and M. Hotter, “Robust obstacle detection fromstereoscopic image sequences using Kalman filtering,” in 23rd DAGM-Symposium on Pattern Recog., ser. LNCS, vol. 2191, 2001, pp. 385–391.
24th International Conference Image and Vision Computing New Zealand (IVCNZ 2009)
- 72 -