dynamic stereo vision for intersection assistance … · dynamic stereo vision for intersection...

DYNAMIC STEREO VISION FOR INTERSECTION ASSISTANCE

1Franke, Uwe

*,

2Rabe, Clemens,

1Gehrig, Stefan,

3Badino, Hernan,

1Barth, Alexander

1Daimler AG, Group Research, Germany,

2University of Kiel,

3University of Frankfurt

KEYWORDS – environment perception, driver assistance, intersection assistance, stereo

vision, space-time stereo.

ABSTRACT

More than one third of all traffic accidents with injuries occur in urban areas, especially at

intersections. Therefore, a driver assistance system supporting the driver in cities is highly

desirable and has a tremendous potential of reducing the number of collisions at intersections.

A suitable system for such complex situations requires a comprehensive understanding of the

scene. This implies a precise estimation of the free space and the reliable detection and

tracking of other moving traffic participants. Since the goal of accident free traffic requires a

sensor with high spatial and temporal resolution, stereo vision will play an important role in

future driver assistance systems.

Most known stereo systems concentrate on single image pairs. However, in intelligent vehicle

applications image sequences have to be analyzed. The contribution shows that a smart fusion

of stereo vision and motion analysis (optical flow) gives much better results than classical

frame-by-frame reconstructions. The basic idea is to track points with depth known from

stereo vision over two and more consecutive frames and to fuse the spatial and temporal

information using Kalman filters. The result is an improved accuracy of the 3D-position and

an estimation of the 3D-motion of the considered point at the same time. This approach,

called 6D Vision, enables a detection of moving objects even if they are partially hidden.

From static points very accurate occupancy grids are built. A global optimization technique

delivers a robust estimation of the free space. Pixels moving in the world are clustered to

objects which are then tracked over time in order to estimate their motion state and to predict

their paths. This allows for powerful collision avoidance systems: pedestrians crossing the

street are detected before they enter the lane; the same holds for vehicles from the sides which

are not detectable by common radar systems. Since we are able to estimate the yaw rate of

oncoming traffic, the prediction is not restricted to straight motion but can detect potential

collisions with turning traffic, especially at intersections.

Urban vision asks for a large field of view. Within the German project AKTIV a fisheye

stereo camera system is under development with a field of view of up to 150 degrees. If the

6D-Vision principle is applied to these images, laterally entering vehicles are also detectable.

FISITA 2008 World Automotive CongressMunich, Germany, September 2008.

INTRODUCTION

Stereo vision is a research topic with a long history. See (1) for an overview. For a long time,

correlation based techniques were commonly used. They deliver precise and reliable

measurements in real-time on a PC or on dedicated hardware. Recently, much progress has

been achieved in dense stereo. Especially the work of Hirschmueller (2) paves the road

towards real-time solutions. His so called “Semi-Global-Matching” algorithms deliver near-

optimum solutions on the computational expense of a classical correlation scheme. New sub-

pixel algorithms (3) reduce the distance noise significantly and further push the limits of

stereo for a given camera system.

Fig. 1 compares the results obtained by a common correlation based scheme with a modern

dense stereo algorithm. The colors encode distance, the warmer the color the closer the point.

The results do not only differ in density. Note the differences in low contrast areas such as the

building and the road surface.

Using stereo vision, the three-dimensional structure of the scene is easily obtained. The

standard approach for free space analysis and obstacle detection is as follows: After

rectification, the stereo correspondences are computed. Then, all 3D points are projected onto

an occupancy grid. In a third step, this grid is segmented and potential obstacles are tracked

over time in order to verify their existence and to estimate their motion state.

This strategy ignores the strong correlation of successive frames and the information

contained within. This paper describes an efficient exploitation of the correlation in time. It

leads to more precise and stable results, and allows estimating the motion state of single

image points even before the objects are detected. This "track-before-detect" approach

distinguishes between static and moving pixels before any segmentation has been performed.

Using static points, very accurate occupancy grids are generated while moving points can be

easily grouped.

The paper is organized as follows: First we sketch the problems in stereo vision and show that

the uncertainties of occupancy grids are significantly reduced if the stereo information is

integrated over time. Then, we introduce a Kalman filter based integration of stereo and

optical flow allowing for the direct measurement of 3D-position and 3D-motion of all tracked

image points (6D-Vision). The following section describes the motion state estimation of

oncoming vehicles at intersections. Finally, we highlight the potential of fisheye cameras for

intersection assistance and give results.

Fig. 1: Correlation based stereo (left) vs. dense stereo (right). Red encodes close, green encodes far points. Note

the higher density especially in low-contrast areas like the road or the building on the right side.

STEREO VISION AND FREE SPACE ANALYSIS

Given a carefully rectified stereo image pair (i.e. all lens distortions have been corrected and

the epipolar lines coincide with the image rows), stereo vision aims to find corresponding

features in the left and right image pair along the epipolar lines. From the disparities, i.e. the

distance between corresponding points, the world position can be easily derived.

Nevertheless, the task is not as simple as it sounds: periodic structures can cause false

correspondences, hidden points are hard to identify, areas with low or even no contrast are

difficult to evaluate and illumination differences ask for robustness of the used similarity

measure. Besides the mentioned epipolar constraint, other constraints like the ordering

constraint, the uniqueness constraint, the smoothness constraint or the recently introduced

gravitational constraint (3) help to solve those problems.

Since the relative orientation of a stereo camera system cannot be assumed to be constant over

time, a slow on-line calibration is necessary. Recently, Dang (4) proposed a scheme that

solves this task robustly.

As mentioned in the introduction, it is common to accumulate all 3D points above ground in a

stochastic occupancy grid. Figure 2 shows such a grid obtained for the urban situation

considered in the sequel. The origin of the coordinate system is centered in our own vehicle.

Our standard stereo camera system has a base line of 30cm and an angle of view of 42 deg.

The imagers have VGA resolution.

Fig. 2: Occupancy grids of an urban situation. Left: stereo image pair with enlarged bicyclist. Middle: the

stochastic occupancy grid based one single image pair. Right: the improved accuracy obtained by the procedure.

Note the decreased uncertainty especially at larger distances.

It becomes obvious that the uncertainty of stereo depth measurements increases quadratically

with distance. Therefore, the bicyclist (zoomed out in the left image) at around 60m is highly

blurred in the occupancy grid. Free space analysis of those occupancy grids is not very

reliable, thus we are looking for strategies to reduce the uncertainty.

One way to reduce the disparity noise is the tracking of features in the images over multiple

frames. If the disparity measurements are uncorrelated, the variance decreases with 1/N, if N

is the number of images. The 6D-Vision algorithm described below exploits this fact.

Fortunately, tracking becomes redundant in static scenes when the ego-motion of the camera

is known a priori. This is beneficial since it allows working with dense stereo disparity maps

despite the real-time constraint. Disparity measurements which are consistent over time are

considered as belonging to the same world point, and therefore, disparity variance is reduced

accordingly. This stereo integration requires three main steps:

� Prediction: the current integrated disparity and a variance image are predicted. This is

equivalent to computing the expected optical flow and disparity based on ego-motion.

Our prediction of the variance image includes the addition of a driving noise

parameter that models the uncertainties of the system, such as ego-motion inaccuracy.

� Measurement: disparity and variance images are computed based on the current left

and right images.

� Update: if the measured disparity confirms its prediction, then both are fused together

reducing the variance of the estimation. The verification of the disparity consistency is

performed using a standard 3-sigma test.

Figure 2 shows an example of the improvement achieved. The occupancy grid shown at the

right was computed with an integrated disparity image. Note the significantly reduced

uncertainties of the registered 3D points. A bicyclist at approximately 60 meters away is

marked in the images.

The occupancy grids shown above are in Cartesian coordinates. However, Cartesian space is

not a suitable space to compute the free space because the search must be done in the

direction of rays leaving the camera. The set of rays must span the whole grid. This leads to

discretization problems. A more appropriate space is the polar space. In polar coordinates

every grid column is, by definition, already in the direction of a ray. Therefore, searching for

obstacles in the ray direction is straightforward. For the computation of free space the first

step is to transform the Cartesian grid to a polar grid by applying a remapping operation. The

polar representation we use is a Column/Disparity occupancy grid, for a detailed discussion

see (5). A result of this is shown in the middle image of Figure 3.

Fig. 3: Free space computation. The green carpet shows the computed available free space. The free space is

obtained applying dynamic programming on a Column/Disparity occupancy grid, which is as a remapping of the

Cartesian depth map, shown at the right. The free space resulting from the dynamic programming is shown over

the grids.

In polar representation, the task is to find the first visible obstacle in the positive direction of

depth. All the space found in front of occupied cell is considered free space. The desired

solution forms a path from left to right segmenting the polar grid into two regions. Instead of

simply thresholding each column, dynamic programming is used. The method based on

dynamic programming has the following properties:

� Global optimization: every row is not considered independently, but as a part of a

global optimization problem which is optimally solved.

� Spatial and temporal smoothness of the solution: the spatial smoothness is imposed by

penalizing depth discontinuities while temporal smoothness is imposed by penalizing

the deviation of the current solution from the prediction.

� Preservation of spatial and temporal discontinuities: the saturation of the spatial and

temporal costs allows the preservation of discontinuities.

Figure 3 shows the result of the dynamic programming applied to the considered scene. For

more details on this analysis see (6).

6D-VISION

Until now, we assumed the world to be static and showed how to combine successive stereo

image pairs to reduce the variance of the free space estimation. This information can be used

for obstacle detection and obstacle avoidance in a straight-forward manner, since all non-free

space is considered an obstacle.

However, the world is not completely static and a system for obstacle detection has to cope

with moving objects and precisely estimate their movements to predict potential collisions. A

common approach is to analyze the occupancy grid and to track isolated objects over time.

The major disadvantage of this algorithm is that the segmentation of isolated objects is

difficult in scenes consisting of multiple nearby objects.

Fig. 4: Dangerous traffic scene. The left image shows a pedestrian appearing behind a standing car. The

corresponding stereo reconstruction is shown in the center image. Red encodes close, green encodes far points.

The optical flow field is shown in the right image. Here red lines encode large image displacements, green small

displacements.

This problem is illustrated in Figure 4: Here the pedestrian appears behind the standing car

and runs towards the street. In the center image, the reconstructed stereo information is shown

using the red to green color encoding scheme. Here, the points belonging to the pedestrian are

hardly distinguished from the points on the standing car. A segmentation based on this

information only will therefore merge the pedestrian and the standing car into a single static

object.

In the right image, the optical flow between the last and the current frame is shown. The color

encodes the length of the displacement vector: red encodes large image displacements, green

small displacements. Here the pedestrian and the standing car can easily be distinguished.

This leads to the main idea of the 6D-Vision algorithm: Track an image point in one camera

from frame to frame and calculate its stereo disparity. Together with the known motion of the

ego-vehicle, the movement of the corresponding world point can be calculated. In practice, a

direct motion calculation based on two consecutive frames is extremely noisy. Therefore the

obtained measurements are filtered by a Kalman filter.

Since we allow the observer to move, we fix the origin of the coordinate system to the car.

The state vector of the Kalman filter consists of the world point in the car coordinate system,

and its corresponding velocity vector. The six-dimensional state vector ( )ZYXZYX &&& ,,,,,

gives this algorithm its name: 6D Vision. The mathematical details are given in (7).

The measurement vector used in the update step of the Kalman filter is ( )dvu ,, , with u and v

being the current image coordinates of the tracked image point and d its corresponding

disparity. As the perspective projection formulae are non-linear, we have to apply the

Extended Kalman filter.

A block diagram of the algorithm is shown in

Figure 5. In every cycle, a new stereo image

pair is obtained. In the left image, appropriate

features (e.g. edges, corners) are detected and

tracked over time. In the current application

we use a version of the Kanade-Lucas-Tomasi

tracker (8) which provides sub-pixel accuracy

and tracks the features robustly for a long

image sequence. The disparities for all

tracked features are determined in the stereo

module.

After this step the estimated 3D-position of

each feature is known. Together with the ego-motion the measurements of the tracking and

the stereo modules are given to the Kalman filter system that updates the state estimation. For

the next image pair analysis, the acquired 6D information is used to predict the image position

of the tracked features. This yields a better tracking performance with respect to speed and

robustness. In addition, the predicted depth information is used to improve the stereo

calculation.

The motion of a vehicle is not at all straight but exhibits strong pitch and roll motion. In order

to compensate for these disturbances, a precise ego-motion analysis is advisable. If stereo

tracks are available, the full motion state (6 degrees of freedom) can be obtained from vision.

The powerful real-time algorithm we use is described in (9).

��

��

��

��

��

��

��

� ��

��

�

��

��

��

� �

�

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

�

��

��

��

� �

�

��

��

Fig. 5: 6D-Vision block diagram

Fig. 6: Estimation results for the pedestrian from Figure 5. The time between the images is (from left to right) 0,

80, 160 and 240 ms. The vectors point to the predicted position of the corresponding world point in 0.5s. The

color encodes the distance of the points.

The result of this algorithm is shown in Figure 6. From left to right the estimation results for

the pedestrian from Figure 5 are shown at 0, 80, 160 and 240 ms relative to the first

appearance of the pedestrian. The estimated velocity vectors point to the predicted position of

the corresponding world point in 0.5s. The colors encode the distance of the points. It can be

seen, that this rich information helps to detect the moving pedestrian and provides a first

prediction of its movement at the same time.

OBJECT TRACKING

6D-Vision is a powerful method to extract linear point motion in the 3D world. A group of 6D

vectors corresponding to adjacent 3D points with similar 3D motion vector is likely to belong

to the same object and, thus, can be used to generate object hypothesis. However, due to the

linear motion model of the single points, predicting the motion of such object hypothesis

without any further constraints is also limited to linear motion. With respect to vehicles,

especially at turning maneuvers, the prediction of the driving path can not be very precise and

may lead to misinterpretations.

In (10), a vision-based approach for estimating the nonlinear motion state of vehicles from a

moving platform is proposed. Objects are represented by a 3D point cloud combined with a

state vector including object pose and

dynamics. It is assumed that it is

possible to distinguish vehicles from

other moving objects such as

pedestrians, e.g. based on dimension

and velocity of a cluster of 6D vectors.

The dynamics of a vehicle is

approximated by a coordinated turn

motion model, which restricts lateral

movements to a circular path based on

velocity and yaw rate. Moving the

point cloud in the world induces

changes in the image plane and is

observed in terms of optical flow and

disparity changes. An Extended

Kalman Filter is used to solve for the

Fig. 7: Turning vehicle at an intersection. The orange box

indicates the estimated position and orientation. The red

arrow indicates the predicted driving path assuming

constant motion.

inverse problem, i.e. relating these observations in the image to a movement of the point

cloud in the world.

All points are referred to a local object coordinate system defined for each tracked vehicle. It

is assumed that the real position of a point within the object coordinate system does not

change over time (rigid body assumption) and that the object's structure is well described by

the point cloud. In practice one has to deal with noisy observations of these points and the real

position of a point is not known as is the overall structure. However, it is possible to refine the

object point cloud over time based on a number of noisy observations of the single points.

Fig. 7 shows a typical situation at an intersection. The orange box shows the current position

of the oncoming car. The complete motion state of the turning vehicle has been estimated

based on the stereo tracks. Assuming constant motion, the green arrow in front of the car

indicates the expected circular driving path for the next second.

The proposed system is able to estimate the motion state of vehicles at urban intersections

including velocity, yaw rate, and acceleration as well as position and orientation, and runs

currently at 25Hz on our demonstrator car UTA. The filter can be easily extended by adding

additional measurements, for example radar sensor information such as relative velocity or

distance.

FISHEYE STEREO FOR INTERSECTION ASSISTANCE

Common stereo camera systems have opening angles around 40 degrees. Simple

investigations reveal that this angle must be increased to about 150 degrees if dangerous

situations at intersections should be recognized, e.g. vehicles coming from the side.

Fisheye lenses – in contrast to standard wide angle lenses – have the advantage of a constant

resolution over the whole field of view. Currently, we use 150 degree lenses. A typical image

is shown in fig. 8. The computation can be limited to 400 lines of the 1628x1236 imager. In

the first step, the images are rectified, based on the data of calibration performed in an offline

process. For details see (11). This allows using the free space analysis and 6D-Vision as

described above without any changes. The rectification step works with a cylindrical camera

model as opposed to the pinhole model in order to obtain a bounded image size.

Figure 8 shows a situation at a pedestrian crossing. The computed free space is overlaid in

green.

Fig. 8: Free space analysis for the pedestrian crossing situation.

Figure 9 shows a second intersection scene where a vehicle approaches quickly from the right

having the right-of-way. Note the position of the vehicle at initial detection. It is first detected

at 15m longitudinal and 22m lateral distance, yielding 26m Euclidian distance. An earlier

detection was impossible due to occlusion of a wall visible at the right edge.

Fig. 9: 6D Vision result for a scene with a vehicle approaching fast from the right side having the right-of-way.

The significant lateral motion is detected within 4 frames. The arrow length shows the predicted position in 0.5s.

The arrow color encodes distance - red is near and green is far away

Fig. 10: Situation shown in fig. 8 two seconds later after the stop of our own vehicle.

The actual object detection is done via direction and position analysis of the 6D vectors (see

previous section). Figure 10 shows the same scene two seconds later. The ego-vehicle almost

stopped while the vehicle from the right was able to continue.

SUMMARY

Vehicles acting in a dynamic environment must be able to detect any static or moving

obstacle. This implies that an optimal stereo vision algorithm has to seek for an optimal

exploitation of spatial and temporal information contained in the image sequence.

As shown in the paper, precision and robustness of 3D reconstructions are significantly

improved if the stereo information is appropriately integrated over time. This requires

knowledge of the ego-motion, which in turn can be efficiently computed from 3D-tracks. It

turns out that the obtained ego-motion data outperforms commonly used inertial sensors. The

obtained depth maps show less noise and uncertainties than those generated by simple frame-

by-frame analysis. A dynamic programming approach allows for determining the free space

without any susceptible obstacle threshold. The algorithm runs in real-time on a PC and has

proven robustness in daily traffic including night-time driving and heavy rain.

The request to detect small or partly hidden moving objects from a moving observer asks for

fusion of stereo and optical flow. This leads to the 6D-Vision approach that allows to

simultaneously estimate position and motion of each observed image point. Since the fusion is

based on Kalman filters, the information contained in multiple frames is integrated. This leads

to a more robust and precise estimation than differential approaches like pure evaluation of

the optical flow on consecutive image pairs. Grouping this 6D-information is very reliable

and enables fast detection of moving objects which can be further tracked using appropriate

dynamic models. The same concept is applied to cameras with fisheye lenses.

Practical tests confirm that a crossing cyclist at an intersection is detected within 4-5 frames.

The implementation on a 3.2GHz Pentium 4 proves real-time capability. Currently, we select

and track about 2000 image points at 25Hz (the images have VGA resolution).

REFERENCES

(1) D.Scharstein, R.Szeliski: “A taxonomy and evaluation of dense two-frame stereo

correspondence algorithms”. IJCV 47(1) (2002) pp.7-42.

(2) H.Hirschmueller: ”Accurate and efficient stereo processing by semi-global matching

and mutual information. CVPR 2005, San Diego, CA. Volume 2. (June 2005)

pp.807-814.

(3) S.Gehrig, U.Franke: “Improving Stereo Sub-Pixel Accuracy for Long Range Stereo”,

Workshop on Virtual Representations and Modeling of Large-Scale Environments

VRML@ ICCV 07, Rio, 2007.

(4) T.Dang, C.Hoffmann: “Tracking Camera Parameters of an Active Stereo Rig”. In 28th

Annual Symposium of the German Association for Pattern Recognition (DAGM

2006), Berlin, September 12-14 2006.

(5) H.Badino, U.Franke, and R.Mester: “Free space computation using stotchastic

occupancy grids and dynamic programming”. Workshop on Dynamical Vision, ICCV

07, Rio, 2007.

(6) U.Franke, S.Gehrig, H.Badino, C.Rabe: “Towards Optimal Stereo Analysis of Image

Sequences“, RobotVision 2008, 18.-20.February 2008, Auckland.

(7) U.Franke, C.Rabe, H.Badino, S.Gehrig: „6D-Vision: Fusion of Stereo and Motion for

Robust Environment Perception”, 27th

DAGM Symposium 2005, pp. 216-223 – ISBN

3-540-28703-5.

(8) J.Shi and C.Tomasi, “Good Features to Track”. IEEE Conference on Computer Vision

and Pattern Recognition, pages 593-600, 1994.

(9) H.Badino, U.Franke, C.Rabe, S.Gehrig, „Stereo-vision based detection of moving

objects under strong camera motion, VisApp , Setubal (Portugal), February 2006.

(10) A.Barth, U.Franke: “Where will the Oncoming Vehicle be the Next Second?”,

IEEE Intelligent Vehicles Symposium IV 2008, Eindhoven, 4.-6. June 2008.

(11) S.Gehrig, C. Rabe, L. Krüger, “6D Vision Goes Fisheye for Intersection

Assistance”, Canadian Robot Vision, Windsor, May 2008.

dynamic stereo vision for intersection assistance … · dynamic stereo vision for intersection...

Documents