dynamic stereo vision for intersection assistance … · dynamic stereo vision for intersection...
TRANSCRIPT
DYNAMIC STEREO VISION FOR INTERSECTION ASSISTANCE
1Franke, Uwe
*,
2Rabe, Clemens,
1Gehrig, Stefan,
3Badino, Hernan,
1Barth, Alexander
1Daimler AG, Group Research, Germany,
2University of Kiel,
3University of Frankfurt
KEYWORDS – environment perception, driver assistance, intersection assistance, stereo
vision, space-time stereo.
ABSTRACT
More than one third of all traffic accidents with injuries occur in urban areas, especially at
intersections. Therefore, a driver assistance system supporting the driver in cities is highly
desirable and has a tremendous potential of reducing the number of collisions at intersections.
A suitable system for such complex situations requires a comprehensive understanding of the
scene. This implies a precise estimation of the free space and the reliable detection and
tracking of other moving traffic participants. Since the goal of accident free traffic requires a
sensor with high spatial and temporal resolution, stereo vision will play an important role in
future driver assistance systems.
Most known stereo systems concentrate on single image pairs. However, in intelligent vehicle
applications image sequences have to be analyzed. The contribution shows that a smart fusion
of stereo vision and motion analysis (optical flow) gives much better results than classical
frame-by-frame reconstructions. The basic idea is to track points with depth known from
stereo vision over two and more consecutive frames and to fuse the spatial and temporal
information using Kalman filters. The result is an improved accuracy of the 3D-position and
an estimation of the 3D-motion of the considered point at the same time. This approach,
called 6D Vision, enables a detection of moving objects even if they are partially hidden.
From static points very accurate occupancy grids are built. A global optimization technique
delivers a robust estimation of the free space. Pixels moving in the world are clustered to
objects which are then tracked over time in order to estimate their motion state and to predict
their paths. This allows for powerful collision avoidance systems: pedestrians crossing the
street are detected before they enter the lane; the same holds for vehicles from the sides which
are not detectable by common radar systems. Since we are able to estimate the yaw rate of
oncoming traffic, the prediction is not restricted to straight motion but can detect potential
collisions with turning traffic, especially at intersections.
Urban vision asks for a large field of view. Within the German project AKTIV a fisheye
stereo camera system is under development with a field of view of up to 150 degrees. If the
6D-Vision principle is applied to these images, laterally entering vehicles are also detectable.
FISITA 2008 World Automotive CongressMunich, Germany, September 2008.
INTRODUCTION
Stereo vision is a research topic with a long history. See (1) for an overview. For a long time,
correlation based techniques were commonly used. They deliver precise and reliable
measurements in real-time on a PC or on dedicated hardware. Recently, much progress has
been achieved in dense stereo. Especially the work of Hirschmueller (2) paves the road
towards real-time solutions. His so called “Semi-Global-Matching” algorithms deliver near-
optimum solutions on the computational expense of a classical correlation scheme. New sub-
pixel algorithms (3) reduce the distance noise significantly and further push the limits of
stereo for a given camera system.
Fig. 1 compares the results obtained by a common correlation based scheme with a modern
dense stereo algorithm. The colors encode distance, the warmer the color the closer the point.
The results do not only differ in density. Note the differences in low contrast areas such as the
building and the road surface.
Using stereo vision, the three-dimensional structure of the scene is easily obtained. The
standard approach for free space analysis and obstacle detection is as follows: After
rectification, the stereo correspondences are computed. Then, all 3D points are projected onto
an occupancy grid. In a third step, this grid is segmented and potential obstacles are tracked
over time in order to verify their existence and to estimate their motion state.
This strategy ignores the strong correlation of successive frames and the information
contained within. This paper describes an efficient exploitation of the correlation in time. It
leads to more precise and stable results, and allows estimating the motion state of single
image points even before the objects are detected. This "track-before-detect" approach
distinguishes between static and moving pixels before any segmentation has been performed.
Using static points, very accurate occupancy grids are generated while moving points can be
easily grouped.
The paper is organized as follows: First we sketch the problems in stereo vision and show that
the uncertainties of occupancy grids are significantly reduced if the stereo information is
integrated over time. Then, we introduce a Kalman filter based integration of stereo and
optical flow allowing for the direct measurement of 3D-position and 3D-motion of all tracked
image points (6D-Vision). The following section describes the motion state estimation of
oncoming vehicles at intersections. Finally, we highlight the potential of fisheye cameras for
intersection assistance and give results.
Fig. 1: Correlation based stereo (left) vs. dense stereo (right). Red encodes close, green encodes far points. Note
the higher density especially in low-contrast areas like the road or the building on the right side.
STEREO VISION AND FREE SPACE ANALYSIS
Given a carefully rectified stereo image pair (i.e. all lens distortions have been corrected and
the epipolar lines coincide with the image rows), stereo vision aims to find corresponding
features in the left and right image pair along the epipolar lines. From the disparities, i.e. the
distance between corresponding points, the world position can be easily derived.
Nevertheless, the task is not as simple as it sounds: periodic structures can cause false
correspondences, hidden points are hard to identify, areas with low or even no contrast are
difficult to evaluate and illumination differences ask for robustness of the used similarity
measure. Besides the mentioned epipolar constraint, other constraints like the ordering
constraint, the uniqueness constraint, the smoothness constraint or the recently introduced
gravitational constraint (3) help to solve those problems.
Since the relative orientation of a stereo camera system cannot be assumed to be constant over
time, a slow on-line calibration is necessary. Recently, Dang (4) proposed a scheme that
solves this task robustly.
As mentioned in the introduction, it is common to accumulate all 3D points above ground in a
stochastic occupancy grid. Figure 2 shows such a grid obtained for the urban situation
considered in the sequel. The origin of the coordinate system is centered in our own vehicle.
Our standard stereo camera system has a base line of 30cm and an angle of view of 42 deg.
The imagers have VGA resolution.
Fig. 2: Occupancy grids of an urban situation. Left: stereo image pair with enlarged bicyclist. Middle: the
stochastic occupancy grid based one single image pair. Right: the improved accuracy obtained by the procedure.
Note the decreased uncertainty especially at larger distances.
It becomes obvious that the uncertainty of stereo depth measurements increases quadratically
with distance. Therefore, the bicyclist (zoomed out in the left image) at around 60m is highly
blurred in the occupancy grid. Free space analysis of those occupancy grids is not very
reliable, thus we are looking for strategies to reduce the uncertainty.
One way to reduce the disparity noise is the tracking of features in the images over multiple
frames. If the disparity measurements are uncorrelated, the variance decreases with 1/N, if N
is the number of images. The 6D-Vision algorithm described below exploits this fact.
Fortunately, tracking becomes redundant in static scenes when the ego-motion of the camera
is known a priori. This is beneficial since it allows working with dense stereo disparity maps
despite the real-time constraint. Disparity measurements which are consistent over time are
considered as belonging to the same world point, and therefore, disparity variance is reduced
accordingly. This stereo integration requires three main steps:
� Prediction: the current integrated disparity and a variance image are predicted. This is
equivalent to computing the expected optical flow and disparity based on ego-motion.
Our prediction of the variance image includes the addition of a driving noise
parameter that models the uncertainties of the system, such as ego-motion inaccuracy.
� Measurement: disparity and variance images are computed based on the current left
and right images.
� Update: if the measured disparity confirms its prediction, then both are fused together
reducing the variance of the estimation. The verification of the disparity consistency is
performed using a standard 3-sigma test.
Figure 2 shows an example of the improvement achieved. The occupancy grid shown at the
right was computed with an integrated disparity image. Note the significantly reduced
uncertainties of the registered 3D points. A bicyclist at approximately 60 meters away is
marked in the images.
The occupancy grids shown above are in Cartesian coordinates. However, Cartesian space is
not a suitable space to compute the free space because the search must be done in the
direction of rays leaving the camera. The set of rays must span the whole grid. This leads to
discretization problems. A more appropriate space is the polar space. In polar coordinates
every grid column is, by definition, already in the direction of a ray. Therefore, searching for
obstacles in the ray direction is straightforward. For the computation of free space the first
step is to transform the Cartesian grid to a polar grid by applying a remapping operation. The
polar representation we use is a Column/Disparity occupancy grid, for a detailed discussion
see (5). A result of this is shown in the middle image of Figure 3.
Fig. 3: Free space computation. The green carpet shows the computed available free space. The free space is
obtained applying dynamic programming on a Column/Disparity occupancy grid, which is as a remapping of the
Cartesian depth map, shown at the right. The free space resulting from the dynamic programming is shown over
the grids.
In polar representation, the task is to find the first visible obstacle in the positive direction of
depth. All the space found in front of occupied cell is considered free space. The desired
solution forms a path from left to right segmenting the polar grid into two regions. Instead of
simply thresholding each column, dynamic programming is used. The method based on
dynamic programming has the following properties:
� Global optimization: every row is not considered independently, but as a part of a
global optimization problem which is optimally solved.
� Spatial and temporal smoothness of the solution: the spatial smoothness is imposed by
penalizing depth discontinuities while temporal smoothness is imposed by penalizing
the deviation of the current solution from the prediction.
� Preservation of spatial and temporal discontinuities: the saturation of the spatial and
temporal costs allows the preservation of discontinuities.
Figure 3 shows the result of the dynamic programming applied to the considered scene. For
more details on this analysis see (6).
6D-VISION
Until now, we assumed the world to be static and showed how to combine successive stereo
image pairs to reduce the variance of the free space estimation. This information can be used
for obstacle detection and obstacle avoidance in a straight-forward manner, since all non-free
space is considered an obstacle.
However, the world is not completely static and a system for obstacle detection has to cope
with moving objects and precisely estimate their movements to predict potential collisions. A
common approach is to analyze the occupancy grid and to track isolated objects over time.
The major disadvantage of this algorithm is that the segmentation of isolated objects is
difficult in scenes consisting of multiple nearby objects.
Fig. 4: Dangerous traffic scene. The left image shows a pedestrian appearing behind a standing car. The
corresponding stereo reconstruction is shown in the center image. Red encodes close, green encodes far points.
The optical flow field is shown in the right image. Here red lines encode large image displacements, green small
displacements.
This problem is illustrated in Figure 4: Here the pedestrian appears behind the standing car
and runs towards the street. In the center image, the reconstructed stereo information is shown
using the red to green color encoding scheme. Here, the points belonging to the pedestrian are
hardly distinguished from the points on the standing car. A segmentation based on this
information only will therefore merge the pedestrian and the standing car into a single static
object.
In the right image, the optical flow between the last and the current frame is shown. The color
encodes the length of the displacement vector: red encodes large image displacements, green
small displacements. Here the pedestrian and the standing car can easily be distinguished.
This leads to the main idea of the 6D-Vision algorithm: Track an image point in one camera
from frame to frame and calculate its stereo disparity. Together with the known motion of the
ego-vehicle, the movement of the corresponding world point can be calculated. In practice, a
direct motion calculation based on two consecutive frames is extremely noisy. Therefore the
obtained measurements are filtered by a Kalman filter.
Since we allow the observer to move, we fix the origin of the coordinate system to the car.
The state vector of the Kalman filter consists of the world point in the car coordinate system,
and its corresponding velocity vector. The six-dimensional state vector ( )ZYXZYX &&& ,,,,,
gives this algorithm its name: 6D Vision. The mathematical details are given in (7).
The measurement vector used in the update step of the Kalman filter is ( )dvu ,, , with u and v
being the current image coordinates of the tracked image point and d its corresponding
disparity. As the perspective projection formulae are non-linear, we have to apply the
Extended Kalman filter.
A block diagram of the algorithm is shown in
Figure 5. In every cycle, a new stereo image
pair is obtained. In the left image, appropriate
features (e.g. edges, corners) are detected and
tracked over time. In the current application
we use a version of the Kanade-Lucas-Tomasi
tracker (8) which provides sub-pixel accuracy
and tracks the features robustly for a long
image sequence. The disparities for all
tracked features are determined in the stereo
module.
After this step the estimated 3D-position of
each feature is known. Together with the ego-motion the measurements of the tracking and
the stereo modules are given to the Kalman filter system that updates the state estimation. For
the next image pair analysis, the acquired 6D information is used to predict the image position
of the tracked features. This yields a better tracking performance with respect to speed and
robustness. In addition, the predicted depth information is used to improve the stereo
calculation.
The motion of a vehicle is not at all straight but exhibits strong pitch and roll motion. In order
to compensate for these disturbances, a precise ego-motion analysis is advisable. If stereo
tracks are available, the full motion state (6 degrees of freedom) can be obtained from vision.
The powerful real-time algorithm we use is described in (9).
�������
���� �
���� ����
�������
������
��� � �������
�����
� ���
��� �
�
��� �
����
����
� �
�
��� ��� ���
������� ��
�������
���� �
�������
���� �
���� ����
�������
������
�������
������
��� � �������
�����
� ���
��� �
�
��� �
����
����
� �
�
��� ��� ���
������� ��
Fig. 5: 6D-Vision block diagram
Fig. 6: Estimation results for the pedestrian from Figure 5. The time between the images is (from left to right) 0,
80, 160 and 240 ms. The vectors point to the predicted position of the corresponding world point in 0.5s. The
color encodes the distance of the points.
The result of this algorithm is shown in Figure 6. From left to right the estimation results for
the pedestrian from Figure 5 are shown at 0, 80, 160 and 240 ms relative to the first
appearance of the pedestrian. The estimated velocity vectors point to the predicted position of
the corresponding world point in 0.5s. The colors encode the distance of the points. It can be
seen, that this rich information helps to detect the moving pedestrian and provides a first
prediction of its movement at the same time.
OBJECT TRACKING
6D-Vision is a powerful method to extract linear point motion in the 3D world. A group of 6D
vectors corresponding to adjacent 3D points with similar 3D motion vector is likely to belong
to the same object and, thus, can be used to generate object hypothesis. However, due to the
linear motion model of the single points, predicting the motion of such object hypothesis
without any further constraints is also limited to linear motion. With respect to vehicles,
especially at turning maneuvers, the prediction of the driving path can not be very precise and
may lead to misinterpretations.
In (10), a vision-based approach for estimating the nonlinear motion state of vehicles from a
moving platform is proposed. Objects are represented by a 3D point cloud combined with a
state vector including object pose and
dynamics. It is assumed that it is
possible to distinguish vehicles from
other moving objects such as
pedestrians, e.g. based on dimension
and velocity of a cluster of 6D vectors.
The dynamics of a vehicle is
approximated by a coordinated turn
motion model, which restricts lateral
movements to a circular path based on
velocity and yaw rate. Moving the
point cloud in the world induces
changes in the image plane and is
observed in terms of optical flow and
disparity changes. An Extended
Kalman Filter is used to solve for the
Fig. 7: Turning vehicle at an intersection. The orange box
indicates the estimated position and orientation. The red
arrow indicates the predicted driving path assuming
constant motion.
inverse problem, i.e. relating these observations in the image to a movement of the point
cloud in the world.
All points are referred to a local object coordinate system defined for each tracked vehicle. It
is assumed that the real position of a point within the object coordinate system does not
change over time (rigid body assumption) and that the object's structure is well described by
the point cloud. In practice one has to deal with noisy observations of these points and the real
position of a point is not known as is the overall structure. However, it is possible to refine the
object point cloud over time based on a number of noisy observations of the single points.
Fig. 7 shows a typical situation at an intersection. The orange box shows the current position
of the oncoming car. The complete motion state of the turning vehicle has been estimated
based on the stereo tracks. Assuming constant motion, the green arrow in front of the car
indicates the expected circular driving path for the next second.
The proposed system is able to estimate the motion state of vehicles at urban intersections
including velocity, yaw rate, and acceleration as well as position and orientation, and runs
currently at 25Hz on our demonstrator car UTA. The filter can be easily extended by adding
additional measurements, for example radar sensor information such as relative velocity or
distance.
FISHEYE STEREO FOR INTERSECTION ASSISTANCE
Common stereo camera systems have opening angles around 40 degrees. Simple
investigations reveal that this angle must be increased to about 150 degrees if dangerous
situations at intersections should be recognized, e.g. vehicles coming from the side.
Fisheye lenses – in contrast to standard wide angle lenses – have the advantage of a constant
resolution over the whole field of view. Currently, we use 150 degree lenses. A typical image
is shown in fig. 8. The computation can be limited to 400 lines of the 1628x1236 imager. In
the first step, the images are rectified, based on the data of calibration performed in an offline
process. For details see (11). This allows using the free space analysis and 6D-Vision as
described above without any changes. The rectification step works with a cylindrical camera
model as opposed to the pinhole model in order to obtain a bounded image size.
Figure 8 shows a situation at a pedestrian crossing. The computed free space is overlaid in
green.
Fig. 8: Free space analysis for the pedestrian crossing situation.
Figure 9 shows a second intersection scene where a vehicle approaches quickly from the right
having the right-of-way. Note the position of the vehicle at initial detection. It is first detected
at 15m longitudinal and 22m lateral distance, yielding 26m Euclidian distance. An earlier
detection was impossible due to occlusion of a wall visible at the right edge.
Fig. 9: 6D Vision result for a scene with a vehicle approaching fast from the right side having the right-of-way.
The significant lateral motion is detected within 4 frames. The arrow length shows the predicted position in 0.5s.
The arrow color encodes distance - red is near and green is far away
Fig. 10: Situation shown in fig. 8 two seconds later after the stop of our own vehicle.
The actual object detection is done via direction and position analysis of the 6D vectors (see
previous section). Figure 10 shows the same scene two seconds later. The ego-vehicle almost
stopped while the vehicle from the right was able to continue.
SUMMARY
Vehicles acting in a dynamic environment must be able to detect any static or moving
obstacle. This implies that an optimal stereo vision algorithm has to seek for an optimal
exploitation of spatial and temporal information contained in the image sequence.
As shown in the paper, precision and robustness of 3D reconstructions are significantly
improved if the stereo information is appropriately integrated over time. This requires
knowledge of the ego-motion, which in turn can be efficiently computed from 3D-tracks. It
turns out that the obtained ego-motion data outperforms commonly used inertial sensors. The
obtained depth maps show less noise and uncertainties than those generated by simple frame-
by-frame analysis. A dynamic programming approach allows for determining the free space
without any susceptible obstacle threshold. The algorithm runs in real-time on a PC and has
proven robustness in daily traffic including night-time driving and heavy rain.
The request to detect small or partly hidden moving objects from a moving observer asks for
fusion of stereo and optical flow. This leads to the 6D-Vision approach that allows to
simultaneously estimate position and motion of each observed image point. Since the fusion is
based on Kalman filters, the information contained in multiple frames is integrated. This leads
to a more robust and precise estimation than differential approaches like pure evaluation of
the optical flow on consecutive image pairs. Grouping this 6D-information is very reliable
and enables fast detection of moving objects which can be further tracked using appropriate
dynamic models. The same concept is applied to cameras with fisheye lenses.
Practical tests confirm that a crossing cyclist at an intersection is detected within 4-5 frames.
The implementation on a 3.2GHz Pentium 4 proves real-time capability. Currently, we select
and track about 2000 image points at 25Hz (the images have VGA resolution).
REFERENCES
(1) D.Scharstein, R.Szeliski: “A taxonomy and evaluation of dense two-frame stereo
correspondence algorithms”. IJCV 47(1) (2002) pp.7-42.
(2) H.Hirschmueller: ”Accurate and efficient stereo processing by semi-global matching
and mutual information. CVPR 2005, San Diego, CA. Volume 2. (June 2005)
pp.807-814.
(3) S.Gehrig, U.Franke: “Improving Stereo Sub-Pixel Accuracy for Long Range Stereo”,
Workshop on Virtual Representations and Modeling of Large-Scale Environments
VRML@ ICCV 07, Rio, 2007.
(4) T.Dang, C.Hoffmann: “Tracking Camera Parameters of an Active Stereo Rig”. In 28th
Annual Symposium of the German Association for Pattern Recognition (DAGM
2006), Berlin, September 12-14 2006.
(5) H.Badino, U.Franke, and R.Mester: “Free space computation using stotchastic
occupancy grids and dynamic programming”. Workshop on Dynamical Vision, ICCV
07, Rio, 2007.
(6) U.Franke, S.Gehrig, H.Badino, C.Rabe: “Towards Optimal Stereo Analysis of Image
Sequences“, RobotVision 2008, 18.-20.February 2008, Auckland.
(7) U.Franke, C.Rabe, H.Badino, S.Gehrig: „6D-Vision: Fusion of Stereo and Motion for
Robust Environment Perception”, 27th
DAGM Symposium 2005, pp. 216-223 – ISBN
3-540-28703-5.
(8) J.Shi and C.Tomasi, “Good Features to Track”. IEEE Conference on Computer Vision
and Pattern Recognition, pages 593-600, 1994.
(9) H.Badino, U.Franke, C.Rabe, S.Gehrig, „Stereo-vision based detection of moving
objects under strong camera motion, VisApp , Setubal (Portugal), February 2006.
(10) A.Barth, U.Franke: “Where will the Oncoming Vehicle be the Next Second?”,
IEEE Intelligent Vehicles Symposium IV 2008, Eindhoven, 4.-6. June 2008.
(11) S.Gehrig, C. Rabe, L. Krüger, “6D Vision Goes Fisheye for Intersection
Assistance”, Canadian Robot Vision, Windsor, May 2008.