lost and found: detecting small road hazards for self ...lost and found: detecting small road...

Lost and Found:Detecting Small Road Hazards for Self-Driving Vehicles

Peter Pinggera1,3,∗, Sebastian Ramos1,2,

∗, Stefan Gehrig1, Uwe Franke1, Carsten Rother2, Rudolf Mester3

Abstract— Detecting small obstacles on the road ahead is acritical part of the driving task which has to be mastered byfully autonomous cars. In this paper, we present a method basedon stereo vision to reliably detect such obstacles from a movingvehicle.

The proposed algorithm performs statistical hypothesis testsin disparity space directly on stereo image data, assessing free-space and obstacle hypotheses on independent local patches.This detection approach does not depend on a global roadmodel and handles both static and moving obstacles.

For evaluation, we employ a novel lost-cargo image sequencedataset comprising more than two thousand frames with pixel-wise annotations of obstacle and free-space and provide athorough comparison to several stereo-based baseline methods.The dataset will be made available to the community to fosterfurther research on this important topic4.

The proposed approach outperforms all considered baselinesin our evaluations on both pixel and object level and runs atframe rates of up to 20 Hz on 2 mega-pixel stereo imagery.Small obstacles down to the height of 5 cm can successfully bedetected at 20 m distance at low false positive rates.

I. INTRODUCTION

The detection of small obstacles or debris on the roadis a crucial task for autonomous driving. In the US, 25000crashes per year due to road debris were reported in 2004[1], and approximately 150 people were killed by accidentsinvolving lost hazardous cargo in 2011 [2]. In Austria, adedicated initiative was launched in 2006 to detect hazardouscargo in tunnels to improve traffic safety [3].

The task of detecting small but potentially hazardous cargoon the road proves to be quite difficult, even for experiencedhuman drivers (see e.g. Fig. 1). Different sensor types canbe applied to the problem, from passive cameras to activeradar or lidar sensors. While active range sensors providehigh accuracy in terms of point-wise distance and velocitymeasurement, they typically suffer from low resolution andhigh cost. In contrast, cameras provide very high spatialresolution at relatively low cost. However, the detection taskof small obstacles is a very challenging problem from acomputer vision perspective, since the considered objectscover very small image areas and come in all possible shapesand appearances.

1Environment Perception, Daimler R&D, Sindelfingen, [email protected]

2Computer Vision Lab Dresden, TU Dresden, [email protected]

3VSI Lab, Computer Science Dept., Goethe Univ., Frankfurt, [email protected]

∗The first two authors contributed equally to this work.4http://www.6d-vision.com/lostandfounddataset

Left stereo input image

Ground truth annotation

Detection results of the FPHT-CStix method, color-coded by distance

3D view

Fig. 1. Exemplary scene from the Lost and Found dataset and corre-sponding results of the proposed approach. The setup includes static andmoving hard-to-detect obstacles such as a EUR-pallet and a soccer ball(blue annotations), as well as a non-hazardous flat piece of wood (yellowannotation). Ground truth free-space is shown in purple

Stereo or multi-camera setups allow for the computa-tion of dense range maps of the observed environment,being increasingly popular for application in self-driving carsand mobile robots in general. Unfortunately, an inherentdrawback of stereo vision systems is the comparativelylow accuracy of distance measurements, especially at longranges. However, accuracy is crucial for the timely detectionof small obstacles and an appropriate response by safety-critical moving platforms.

In this paper, we build upon previous work of [4] andextend it in two ways: We present a reparametrization of theunderlying problem to boost efficiency, achieving a speed upof factor 10 while keeping the quality of the results at thehighest level. In addition, inspired by the established Stixelrepresentation [5], we introduce a mid-level obstacle repre-

arX

iv:1

609.

0465

3v1

[cs

.CV

] 1

5 Se

p 20

16

sentation based on the original point-based output, resultingin improved robustness and compactness that significantlyaids further processing steps.

Along with these contributions, we introduce Lost andFound, the first dataset dedicated to visual lost cargo de-tection, to push forward research on these typically under-represented but critical events. In our detailed evaluation,we introduce suitable metrics derived from related computervision problems with the application focus in mind.

The paper is organized as follows: Section II lists relevantrelated work concerning lost cargo detection. The proposedand baseline methods are explained in Section III. Our newdataset, an extensive evaluation of the presented methodsand a discussion of the results are covered in Section IV.The final section comprises conclusions and future work.

II. RELATED WORK

There exists a significant amount of literature on obsta-cle detection in general, spanning a variety of applicationareas. However, the literature focusing on detection of smallobstacles on the road is quite limited. Most relevant arecamera-based methods for the detection and localization ofgeneric obstacles in 3D space, using small-baseline stereosetups (< 25 cm) on autonomous cars.

Many obstacle detection schemes are based on the flat-world-assumption, modeling free-space or ground as a singleplanar surface and characterizing obstacles by their height-over-ground [6], [7], [8]. Geometric deviations from thereference plane can be estimated either from a precomputedpoint cloud, directly from image data [9], or via modeextraction from the v-disparity histogram on multiple scales[10]. To become independent of the often violated flat-world-assumption, more sophisticated ground profile models havebeen introduced, from piece-wise planar longitudinal profiles[11] to clothoids [12] and splines [13]. Also, parameter-free ground profile models have been investigated usingmutliple filter steps and adaptive thresholding in the v-disparity domain [14].

The survey in [15] presents an overview of several stereo-based generic obstacle detection approaches that have provento perform very well in practice. The methods are groupedinto different obstacle representation categories and includeStixels [5], Digital Elevation Maps (DEM) [16] and geomet-ric point clusters [17], [18]. Notably, all of the methods relyon precomputed stereo disparity maps. We select the Stixelmethod [5] as well as the point clustering method of [17],[18] to serve as baselines during our experimental evaluation.

The Stixel algorithm distinguishes between a globalground surface model and a set of rectangular vertical obsta-cle segments, providing a compact and robust representationof the 3D scene. In [17], [18], the geometric relation between3D points is used to detect and cluster obstacle points.

The above methods yield robust detection results basedon generic geometric criteria and perform best for detectingmedium-sized objects at close to medium range. Detectionperformance and localization accuracy drop with increasingdistance and decreasing object sizes.

Our present work builds on the approach presented in [4],originally devised for high sensitivity in long range detectiontasks. Obstacle and free-space hypotheses are tested againsteach other using local plane models, optimized directly onthe underlying image data.

Several works use appearance cues in addition to geo-metric information to improve generic obstacle detection.Detection of lost cargo can also be mapped to the taskof anomaly detection [19]. There, every patch that deviatesfrom the learned road appearance is considered a potentialhazard. Due to the lack of 3D information, this method alsotriggers on patches of harmless flat objects. In [20], colorand texture cues are used for stereo-based obstacle detectionincorporating the cues into the Stixel framework of [5]. Foroff-road driving, basic deep learning techniques have beenemployed to detect driveable regions [21]. There, stereo-based obstacle detection acts as a supervisor for collectingtraining samples in the short range while the classifieryields predictions for the long range. A similar idea hasbeen pursued in [22], however, instead of learning-basedpredictions, a spectral clustering of superpixels is appliedfor the long-range prediction.

Overall, the - by definition - unknown type of the objects tobe detected makes the direct application of common machinelearning-based approaches difficult.

III. METHODS

A. Direct Planar Hypothesis Testing (PHT)

In this work we build upon and extend the geometricobstacle detection approach proposed in [4]. We refer to thismethod as Direct Planar Hypothesis Testing (PHT), since thedetection task is formulated as a statistical hypothesis testingproblem on the image data.

Free-space is represented by the null hypothesis Hf , whileobstacles correspond to the alternative hypothesis Ho. Thehypotheses are characterized by constraints on the orienta-tions of local 3D plane models, each plane being defined by aparameter vector ~θ comprised of the normal vector ~n and thenormal distance D from the origin: ~θ = (nX , nY , nZ , D)

T .The parameter spaces of Hf and Ho are bounded by themaximum allowed deviations ϕi = ϕf,o of the normalvectors from their reference orientation (see Fig. 2a).

For each local image patch an independent GeneralizedLikelihood Ratio Test (GLRT) is formulated using the Max-imum Likelihood Estimates (MLE) ~θi of the respectivehypothesis parameter vector. The decision for Ho is thentaken based on the threshold γ according to the criterion

ln(p(~I; ~θo,Ho)

)− ln

(p(~I; ~θf ,Hf )

)> ln (γ) . (1)

The likelihood terms p(.) are derived directly from a statisti-cal model of the stereo image data ~I, where the left and rightintensity values Il(~x) and Ir(~x) represent noisy samples ofthe observed image signal f at position ~x = (x, y)T , with

Y

ZX

ϕf

ϕo

(a) Direct Planar HypothesisTesting (PHT) [4]: The anglesϕf and ϕo constrain the al-lowed plane normal orienta-tions. The Z axis representsthe optical axis of the camera

(b) Point Compatibility (PC) [17],[18]: Any point P2 lying within thegiven truncated cone based at pointP1 is labeled as obstacle and thepoints are said to be compatible, i.e.are part of an obstacle cluster

Fig. 2. Geometric models for point-wise obstacle detection criteria

local bias α(~x) and zero-mean noise samples η(~x):

Il(~x) = f(~x) + αl(~x) + η(~x) (2)

Ir

(W (~x, ~θ)

)= f(~x) + αr(~x) + η(~x). (3)

The warp W transforms the image coordinates ~x fromthe left to the right image, according to the planemodel of the true hypothesis and the camera parametersP = K [R |~t ]. For the models used in [4] this warprepresents a multiplication by the plane-induced homographyH = K

(R− 1

D~t ~nT

)K−1.

Each hypothesis’ contribution to (1) then reduces to a sumF of pixel-wise residuals over the local patch area Ω

Fi =∑

~x∈Ω

ρ(Ir

(W (~x, ~θi)

)− f(~x)

). (4)

The loss function ρ depends on the assumed noise model,and the unknown image signal f is being estimated as themean of the concurrently realigned input images.Finding the MLE ~θi corresponds to the non-linear optimiza-tion problem with simple bound constraints

~θi ← arg min~θi

(Fi) s.t. |ϕi| ≤ ϕi. (5)

The optimization problem is solved iteratively using a pro-jected Levenberg-Marquardt method, similar to the localapproach proposed in [23].

Note that only reliable decisions are reported, i.e. forpatches in sufficiently textured image areas, where the min-imum eigenvalue of the approximate Hessian is sufficientlylarge during optimization.

B. Fast Direct Planar Hypothesis Testing (FPHT)

The PHT method provides high flexibility in terms of bothmodel parameters and multi-camera configurations. How-ever, for calibrated stereo cameras a simplified parametriza-tion can be utilized, reducing the number of free parametersand the complexity of the optimization problem.

Thus we propose a Fast Direct Planar Hypothesis Testing(FPHT) method that exploits such a reparametrization, result-ing in a computational speed-up of one order of magnitude

without sacrificing detection performance in practice. Theproposed reparametrization is based on considering• rectified stereo image pairs and• plane models without yaw or roll angles, i.e. nX = 0.

Under these assumptions, computation of the warp W issimplified significantly, since a plane with nX = 0 can berepresented by a line in stereo disparity space [11]:

W (~x, ~θ) =

(x− dy

)=

(x− (ay + b)

y

). (6)

Disparity is denoted by d, while y = yc−yh/2 represents the

normalized vertical image coordinate, with the patch centerposition yc and height h.

The new parameter vector ~θ∗ = (a, b)T consists only of

the disparity slope a and offset b, which directly relate to the3D plane parameters as

a = −nYfxfy

B

D, b = −B

D

(nY (yc − y0)

fxfy

+ nZfx

)(7)

where the camera’s focal lengths, vertical principal pointand baseline are denoted by fx, fy, y0 and B, respectively.

In contrast to the PHT method, where the parametersare bounded by globally valid box constraints, we have toconsider the fact that two planes with the same slope angleϕi in 3D space, but with different normal distances D to thecamera, will in general have different disparity offsets b aswell as different slopes a in disparity space. It is thereforenot possible to specify independent global bounds on a and b.Instead, we formulate the constraints by plugging the originalbounds on the normal vector ~n into the linear relation

a =b

(y0 − yc) + fynZ

nY

= b · const. (8)

If an optimization step violates the bound and results in aninvalid configuration of a and b, both values are warpedback onto the bounding line via vector projection.

As our principal motivation for FPHT is efficiency, weskip the estimation of the true image signal as in the PHTmethod and instead use the left image sample as the referenceestimate. For both PHT and FPHT we use a Gaussian noisemodel, i.e. a squared error for the loss ρ.

C. Point Compatibility (PC)

As a first baseline we use the Point Compatibility (PC)approach proposed in [17] and successfully applied in [18].This geometric obstacle detection method is based on therelative positions of pairs of points in 3D space. Placing atruncated cone on a point P1 as shown in Fig. 2b, any pointP2 lying within that cone is labeled as obstacle and said tobe compatible with P1. The cone is defined by the maximumslope angle ϕ, the minimum relevant obstacle height Hmin

and the maximum connection height threshold Hmax.All points of a precomputed stereo disparity map are tested

in this way by traversing the pixels from bottom left totop right. The truncated cones are projected back onto the

image plane and the points within the resulting trapeziumare labeled accordingly. The algorithm not only provides apixel-wise obstacle labeling but at the same time performs ameaningful clustering of compatible obstacle points [18].

Similar to PHT and FPHT, the PC approach does notdepend on any global surface or road model due to its rel-ative geometric decision criterion. However, it does dependdirectly on the quality of the underlying point cloud.

D. StixelsFurthermore, we use the Stixel approach of [5] as a

baseline. Stixels provide a compact and robust descriptionof 3D scenes, especially in man-made environments withpredominantly horizontal and vertical structures. The algo-rithm distinguishes between a global ground surface modeland a set of vertical obstacle segments of variable height. Thesegmentation task is based directly on a precomputed stereodisparity map and is performed column-wise in an optimalway via dynamic programming. The algorithm makes use ofan estimated road model, a B-spline model as in [13], and in-corporates further features such as ordering and gravitationalconstraints.

The results of the Stixel computation depend both on thequality of the disparity map as well as the estimated roadmodel.

E. Mid-level Representation: Cluster-Stixels (CStix)Inspired by the compactness and flexibility of the Stixel

representation, we present a corresponding extension forpoint-wise obstacle detection approaches such as PHT, FPHTand PC. Our aim is to create an obstacle representationsimilar to the Stixel algorithm, reducing the amount ofoutput data and at the same time increasing robustness.Additionally, interpolating the sparse point clouds can evenincrease detection performance.

The proposed approach does not perform column-wiseoptimization like the actual Stixel algorithm, but consists ofa clustering and a splitting step (cf. Alg. 1).

1) Clustering: In the first step, density-based geometricpoint clustering is performed via a modified DBSCANalgorithm [24]. We approximate the circular point neigh-borhood regions of the original algorithm by rectangles forefficient data access using bulk-loaded R-trees [25], [26].The orientation of the rectangular neighborhood regions isaligned to the viewing rays of the camera.

Furthermore, we introduce several distance-adaptive modi-fications, considering the characteristics of stereo-based pointclouds. The sizes of neighborhood regions are adapted to thepoints’ distances, according to the estimated disparity noise.Also, the minimum number of cluster points is scaled withthe distance from the camera, where the scaling formulais similar to the coordinate scaling in [12]: minPts =minPts0 + k · fxZ .

The adaptive DBSCAN algorithm allows for the use ofmeaningful clustering parameters, combining real-world di-mensions and disparity uncertainty, and avoids discretizationartifacts typical of e.g. scaled grid maps.

Note that for the PC approach the clustering step isomitted, since the detection algorithm itself already providesa set of meaningful clusters.

2) Splitting: After the clustering phase, each cluster issplit horizontally and vertically into a set of Stixel-likevertical boxes. The horizontal splitting step strictly enforcesa fixed box width to ensure the characteristic Stixel layout.The optional vertical splitting step takes the precomputeddisparity map into account and performs recursive splits onlyas long as the disparity variance of a Stixel box exceeds acertain threshold.

Algorithm 1 Mid-level representation: Cluster-StixelsInput

- list of obstacle points ~P , e.g. from PHT, FPHT, PC- dense or sparse disparity map D

Output- list of obstacle Cluster-Stixels

−−−−→CStix

Algorithm1: function MIDLEVELREP( D )2: ~C ← ADAPTIVEDBSCAN( ~P ) . Compute list of

obstacle clusters ~C with associated points3:

−−−−→CStix← SPLITANDFIT( ~C, D ) . Split clusters ~C

and fit Bounding Boxes (BB)4: return −−−−→CStix5: end function1: function SPLITANDFIT( ~C, D )2: for all C ∈ ~C do3:

−−−−→CStix

+←− SPLITHORIZONTALLY( C, width ) .split horizontally and fit BB with fixed Stixel width

4:−−−−→CStix

+←− SPLITVERTICALLY(−−−−→CStix, C, D )

. split vertically until disparity variance in BB is belowthreshold

5: end for6: return −−−−→CStix7: end function

IV. EVALUATION

A. Lost and Found DatasetIn order to evaluate the performance of small road obstacle

detection approaches, we introduce a novel dataset withrecordings from 13 different challenging street scenarios,featuring 37 different obstacle types. The selected scenarioscontain particular challenges including irregular road pro-files, far object distances, different road surface appearanceand strong illumination changes. The objects are selected toreproduce a representative set that may actually appear onthe road in practice (see Fig. 3). These objects vary in sizeand material, which are factors that define how hazardous anobject may be for a self-driving vehicle in case the obstacleis placed within the driving corridor. Note that we currentlytreat some very flat objects (i.e. lower than 5 cm) as non-hazardous and thus do not take them into account in theresults of Sect. IV-C.

Fig. 3. Collection of objects included in the Lost and Found dataset

Subset Sequences Frames Locations Objects

Train/Val 51 1036 8 28Test 61 1068 5 (5) 35 (9)

TABLE IDETAILS ON THE DATASET SUBSETS. NUMBERS IN PARANTHESES

REPRESENT UNSEEN TEST ITEMS NOT INCLUDED IN THE TRAINING SET

The Lost and Found dataset consist of a total of 112 videostereo sequences with coarse annotations of free-space areasand fine-grained annotations of the obstacles on the road.Annotations are provided for every 10th frame, giving atotal of 2104 annotated frames (see Fig. 7). Each object islabeled with a unique ID, allowing for a later refinement intodifferent subcategories (e.g. obstacle sizes).

The dataset is split into a Train/Validation subset and aTest subset. Each of these subsets consists of recordings incompletely different surroundings, covering a similar numberof video sequences, frames and objects (see Table I). TheTest subset contains nine previously unseen objects that arenot present at all in the Training/Validation subset. Further,the test scenarios can be considered to be more difficultthan the training scenarios, owing e.g. to more complex roadprofile geometries.

The stereo camera setup features a baseline of 21 cm anda focal length of 2300 pixels, with spatial and radiometricresolutions of 2048×1024 pixels and 12 bits. While thedataset contains full color images, all methods consideredin the present work make use only of the grayscale data.

To the best of our knowledge, this represents the firstpublicly available dataset with its main focus on the detectionof small hazards and lost cargo on the road. We hope thatthis dataset supports further research on this critical topic forself-driving vehicles.

B. Metrics

To quantitatively analyze the detection performance ofthe different approaches, pixel- and object-level metricsderived from related computer vision problems are definedwhile keeping the application focus in mind.

1) Pixel-level Metric: As a first metric we define aReceiver-Operator-Characteristic (ROC) curve that compares

a pixel-wise True Positive Rate (TPR) over False PositiveRate (FPR). This ROC curve is generated by performing aparameter sweep and computing the convex hull over theresults from all evaluated parameter configurations.

TPR =TP · Sub2 ·Dwn2

GTObstacles(9)

FPR =FP · Sub2 ·Dwn2

GTFreeSpace(10)

TP and FP refer to the number of true and false pixel-wisepredictions that a given method produces, which are evaluatewith respect to the annotated image areas. Sub and Dwnare two scaling factors that compensate the subsampling anddownsampling settings of some of the evaluated methods(cf. Sect. IV-C). Finally, GTObstacles and GTFreeSpace arethe total number of ground truth pixels labeled as obstacleor free-space respectively.

2) Instance-level Metric: The main drawback of the abovedescribed ROC curve is its bias toward object instancesthat cover large areas in the images. In order to overcomethis disadvantage, we propose a second metric based onan instance-level variation of the Jaccard Index, known asinstance Intersection over Union (iIoU) [27]. This secondmetric analyzes the instance-level Intersection (iInt) be-tween mid-level predictions (Stixels) and pixel-wise groundtruth annotations over the false positive Stixels per frame(FP/frame). A Stixel is defined as false positive if its overlapwith the labeled free space area is larger than a giventhreshold (50% for the purpose of the evaluation presentedin the next subsection).

iInt =iTP

iTP + iFN(11)

iTP and iFN represent pixel-wise true positives and falsenegatives per instance.

C. Results

1) Experimental Setup: All methods as described aboveare included in our experiments for evaluation. For allpoint-based methods PHT, FPHT and PC, by default weemploy subsampling of stride two for higher computationalefficiency. From our experience, this is a reasonable choicewhere no significant detection performance is being sacri-ficed. To further investigate the trade-off of efficiency anddetection performance, we optionally scale down the imagesby an additional factor of two for even faster execution. Inthe following, results including this downscaling step will bedenoted as downsampled.

We perform a parameter sweep over the principal param-eters of each method, i.e.:• PHT/FPHT: patch size h×w, minimum eigenvalue for

optimization, likelihood ratio decision threshold γ• PC: maximum angle ϕ, Hmin, Hmax

• Stixels: vertical cut costs used in dynamic programming

For PHT and FPHT, the exact bounds on the plane normalsprove to be not too critical and we set them to ϕf = 25

and ϕo = 45. The parameter vectors are initialized using adense disparity map, precomputed via Semi-Global Matching(SGM) [28]. The same disparity map provides the input tothe PC and Stixel approaches.

For the additional computation of the mid-level Cluster-Stixels representation we use a fixed, manually optimized,set of parameters. In fact, these exact parameter values havea much lower impact on the final results than the parametersoptimized in the sweep described above.

2) Quantitative Results: First, the primary methods (PHT,FPHT, PC and Stixels) are benchmarked using the Train-ing/Validation subset and the described pixel-level ROCcurve. For the purpose of this first evaluation, we performa parameter sweep as described above. The best performingparameter configurations are then determined by computingthe convex hulls over the TPR and FPR results (see Fig. 4).Note that the main purpose of this step is method-specificparameter optimization. Direct comparisons between theROC curves of the various methods have to be approachedwith care, as e.g. the large-object-size bias mentioned earlierhas to be considered.

Once the best performing parameter sets have been de-termined, a second pixel-level ROC curve is computed onthe Test subset, including the primary approaches along withtheir corresponding Cluster-Stixels extensions (Fig. 5).

It can be observed that all methods except Stixels yieldconsistent results across Training and Test subsets. Stixelsperform notably worse on the Test subset, as it containsrather challenging road profiles, where a failure of theroad estimation module has fatal consequences for the FPR.Therefore, Stixel results are not shown within the axis scalesof Fig. 5.

Note that the curves in Fig. 5 display a considerablegain provided by the proposed mid-level Cluster-Stixelsrepresentation.

Finally, we compare the Cluster-Stixels extensions of theproposed and baseline approaches using the defined instance-level metric. The results in Fig. 6 clearly show that thePHT/FPHT approaches significantly outperform both base-lines, achieving iInt values of approx. 0.4 at an averageof 3 false positives per frame. Here the negative impact ofdownsampling becomes visible, since it mainly influencesthe smallest object instances at long distances, now beingweighted equally by the metric.

Notably, it can be seen that the proposed FPHT methodperforms on a par with or even better than the original PHTvariant. While the algorithmic core of PHT with downsam-pling takes approx. 500 ms to process a full image on a state-of-the-art GPU, our FPHT version requires only 50 ms.

3) Qualitative Results: Fig. 7 depicts qualitative resultsof the evaluated methods on three example scenarios fromthe Test subset. The left column shows a typical example of

Fig. 4. Pixel-level: TPR over FPR (Training subset). Curves represent theconvex hulls of the respective parameter sweep results

Fig. 5. Pixel-level: TPR over FPR (Test subset)

Fig. 6. Mid-level: iInt over FP/frame (Test subset)

a small road hazard (bobby car) in a residential area. In thiscase, due to the flat road profile and the medium object size,all methods are able to successfully detect the object.

In the middle column, an example with objects at largedistances on a bumpy surface is shown. At such distances, thesignal-to-noise ratio of the disparity measurements drops sig-nificantly, leading to a very low quality of the constructed 3Dpoint cloud. Thus, neither the Stixel nor the PC approachesare able to detect the relevant objects in the scene. In contrast,the FPHT methods, which operate on the image data directly,still perform reasonably well at such large distances.

The scene in the rightmost column illustrates a ratherchallenging case for geometry-based obstacle detection ap-proaches. A noticeable double kink in the longitudinal roadprofile would require an extremely accurate road modelestimation for the Stixel method to be able to detect suchsmall objects. While the PC and FPHT methods are invariantto such conditions, only FPHT succeeds in actually detectingthe tire on the left side of the image. The tire simplyappears to be not prominent enough for a PC-based detection.Considering the FPHT-CStix results, it can be seen that thedetections cover larger portions of the obstacle than theFPHT results, which demonstrates the clustering step beinga suitable compact representation. The second obstacle in thescene (square timber) is not detected by any of the methodsdue to its low profile.

Overall, from the observed qualitative results it can beconcluded that the FPHT methods show the best performancefor various obstacles and scenarios. The PC methods sufferfrom increased false positives rates, since noisy disparitymeasurements directly influence the results. This effect couldpossibly be reduced by applying sophisticated spatial andtemporal disparity filtering methods. The qualitative resultsalso confirm the Stixel method’s dependency on a correctlyestimated road profile.

V. CONCLUSION

In this work, we have presented and evaluated an efficientstereo-based method for detecting small but critical roadhazards, such as lost cargo, for self–driving vehicles. Theapproach extends previous work in this very relevant area,providing a significant gain in computational speed while atthe same time outperforming all baseline methods. Addition-ally, the proposed mid-level representation Cluster-Stixelsyields an extra gain in detection performance and robustnessas well as a significant reduction in output complexity.

To allow for a detailed evaluation, we introduced theLost and Found dataset, comprising over 2000 pixel-wiseannotated stereo frames in a wide range of locations androad conditions and with generic objects of various typesand sizes on the road. Based on the presented performancemetrics, the experimental evaluation clearly demonstrates theefficacy of the proposed methods.

Future work may include a fusion of the consideredgeometric methods with an appearance-based approach usingconvolutional neural networks, where the Lost and Founddataset can be employed for training.

REFERENCES

[1] American Automobile Association, “25000 crashes ayear due to road debris,” AAA, Tech. Rep., 2004,URL:https://www.aaafoundation.org/sites/default/files/Crashes.pdf.

[2] NHTSA, “Traffic safety facts 2011,” NHTSA, Tech. Rep., 2011,URL:http://www-nrd.nhtsa.dot.gov/Pubs/811754AR.pdf, p. 72.

[3] H. H. Schwabach, M. Harrer, A. Waltl, H. Bischof, G. Tacke, G. Zoff-mann, C. Beleznai, B. Strobl, H. Grabner, and G. Fernandez, “VITUS:Video Image Analysis for Tunnel Safety,” in 3rd International Con-ference on Tunnel Safety and Ventilation, 2006.

[4] P. Pinggera, U. Franke, and R. Mester, “High-Performance LongRange Obstacle Detection Using Stereo Vision,” in IROS, 2015.

[5] D. Pfeiffer and U. Franke, “Towards a Global Optimal Multi-LayerStixel Representation of Dense 3D Data,” in BMVC, 2011.

[6] Z. Zhang, R. Weiss, and A. Hanson, “Obstacle Detection Based onQualitative and Quantitative 3D Reconstruction,” TPAMI, vol. 19,no. 1, pp. 15–26, 1997.

[7] M. Lourakis and S. Orphanoudakis, “Visual Detection of ObstaclesAssuming a Locally Planar Ground,” in ACCV, 1998.

[8] S. Nedevschi, R. Schmidt, R. Danescu, D. Frentiu, T. Marita, T. Graf,F. Oniga, and C. Pocol, “High Accuracy Stereo Vision System for FarDistance Obstacle Detection,” in IV, 2004.

[9] H. S. Sawhney, “3D Geometry from Planar Parallax,” in CVPR, 1994.[10] S. Kramm and A. Bensrhair, “Obstacle Detection Using Sparse Stere-

ovision and Clustering Techniques,” in IV, 2012.[11] R. Labayrade, D. Aubert, and J.-P. Tarel, “Real Time Obstacle

Detection in Stereovision on Non Flat Road Geometry Through ”V-disparity” Representation,” in IV, 2002.

[12] S. Nedevschi, R. Danescu, D. Frentiu, T. Marita, F. Oniga, C. Pocol,T. Graf, and R. Schmidt, “High Accuracy Stereovision Approach forObstacle Detection on Non-Planar Roads,” INES, 2004.

[13] A. Wedel, H. Badino, C. Rabe, H. Loose, U. Franke, and D. Cremers,“B-Spline Modeling of Road Surfaces with an Application to FreeSpace Estimation,” TITS, 2009.

[14] A. Harakeh, D. Asmar, and E. Shammas, “Ground Segmentation andOccupancy Grid Generation Using Probability Fields,” in IROS, 2015.

[15] N. Bernini, M. Bertozzi, L. Castangia, M. Patander, and M. Sabbatelli,“Real-Time Obstacle Detection using Stereo Vision for AutonomousGround Vehicles: A Survey,” in ITSC, 2014.

[16] F. Oniga and S. Nedevschi, “Processing Dense Stereo Data UsingElevation Maps: Road Surface, Traffic Isle and Obstacle Detection,”Vehicular Technology, 2010.

[17] R. Manduchi, A. Castano, A. Talukder, and L. Matthies, “ObstacleDetection and Terrain Classification for Autonomous Off-Road Navi-gation,” Autonomous Robots, vol. 18, pp. 81–102, 2005.

[18] A. Broggi, M. Buzzoni, M. Felisa, and P. Zani, “Stereo ObstacleDetection in Challenging Environments: The VIAC Experience,” inIROS, 2011.

[19] C. Creusot and A. Munawar, “Real-time Small Obstacle Detectionon Highways using Compressive RBM Road Reconstruction,” in IV,2015.

[20] T. Scharwachter and U. Franke, “Low-Level Fusion of Color, Textureand Depth for Robust Road Scene Understanding,” in IV, 2015.

[21] R. Hadsell, A. Erkan, P. Sermanet, and M. Scoffier, “Deep Belief NetLearning in a Long-Range Vision System for Autonomous Off-RoadDriving,” in IROS, 2008.

[22] H. Lu, L. Jiang, , and A. Zell, “Long Range Traversable RegionDetection Based on Superpixels Clustering for Mobile Robots,” inIROS, 2015.

[23] C. Kanzow, N. Yamashita, and M. Fukushima, “Levenberg-MarquardtMethods with Strong Local Convergence Properties for Solving Non-linear Equations with Convex Constraints,” J. of Computational andApplied Mathematics, vol. 172, pp. 375–397, 2004.

[24] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-BasedAlgorithm for Discovering Clusters in Large Spatial Databases withNoise,” in KDD, 1996.

[25] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Search-ing,” in ACM SIGMOD, 1984, pp. 47–57.

[26] S. Leutenegger, M. Lopez, and J. Edgington, “STR: A Simple andEfficient Algorithm for R-tree Packing,” in International Conferenceon Data Engineering, 1997.

[27] M. Cordts, L. Schneider, U. Franke, and S. Roth, “Object-level Priorsfor Stixel Generation,” in GCPR, 2014.

[28] S. Gehrig, R. Stalder, and N. Schneider, “A Flexible High-ResolutionReal-Time Low-Power Stereo Vision Engine,” in ICVS, 2015.

Input: bobby car Input: car front bumper and cardboard box Input: tire and square timber

Ground truth annotation

PC

FPHT (downsampled)

FPHT

Stixels

PC-CStix

FPHT-CStix (downsampled)

FPHT-CStix

Fig. 7. Qualitative results of the evaluated methods. The top two rows show the left input image and the ground truth annotation, lower rows showpixle-wise and mid-level detections as overlay, color-coded by distance (red: near, green: far)

lost and found: detecting small road hazards for self ...lost and found: detecting small road...

Documents