omnidirectional stereo vision using fisheye lenses · omnidirectional stereo vision using fisheye...
TRANSCRIPT
Omnidirectional stereo vision using fisheye lenses
Marius Drulea, Istvan Szakats, Andrei Vatavu, Sergiu Nedevschi
Technical University of Cluj-Napoca, Romania, Computer Science Department
{marius.drulea, istvan.szakats, andrei.vatavu, sergiu.nedevschi}@cs.utcluj.ro
Abstract—This work presents an omnidirectional stereo
system with a direct application to autonomous logistics. The
omnidirectional system uses fisheye lenses to inspect the
surroundings of the automated forklift. The lenses are mounted on
the top of the vehicle to allow a 360o scene reconstruction with a
single stereo pair. The system provides the list of obstacles detected
around the vehicle. The reconstruction of the scene is possible via
a division of the fisheye images into several rectified images. A
stereo matching algorithm is applied to each pair of rectified
images. The computed 3D points corresponding to the rectified
pairs are unified into a single cloud. The obstacle detection module
operates on the unified cloud of 3D points.
Keywords — omnidirectional stereovision, fisheye lens , multi-
channel rectification, stereo reconstruction
I. INTRODUCTION
The PAN-Robots project [1] proposes a new generation of autonomous guided vehicles –AGVs- for autonomous logistics factories. The vehicles are equipped with advanced on-board sensors to increase the flexibility and safety. The on-board sensors provides two functionalities: autonomous navigation and collision avoidance.
In order to detect the possible collisions, the sensors detect the obstacles around the vehicle. In the state of the art, this is achieved using laser-scanners. They perform a single horizontal scan of the scene and report all the obstacles that hit the scan plane. The system fuse the information from multiple laser-scanners to provide a complete 360o detection around the AGV. The scanning plane hits the majority of the obstacles, but not all. Hanging/protruding objects are exceptions and the AGVs might collide with such obstacles. Objects with irregular form might also create undesired situations. A worker carrying a long ladder is such an example. Humans are more difficult to detect when they are partially occluded or when they are very close to other objects.
The PAN-Robots project proposes a novel sensor, a stereo system consisting of two high resolution cameras and two fisheye lenses. The stereo sensor is mounted on the top of the AGV and it is oriented to the ground. Figure 1 depicts this setup. In this manner, the stereo sensor is able to monitor the entire area surrounding the AGV. The fisheye lenses play an important role in this sensor. If we use conventional rectilinear lenses, we would only process a restricted area around the AGV. Several conventional stereo sensors have to be used to monitor the AGV’ surroundings. With the proposed stereo sensor, this is no longer necessary, a single sensor suffices for
this task. The fisheye stereo sensor has to detect, track and classify all the objects around the AGV. Moreover, it should detect hanging, protruding objects and obstacles with irregular shapes. The AGV system fuses the list of detected objects by the stereo sensor with the list of objects detected by the laser-scanners.
Up to the current project phase, we have implemented modules for synchronous image acquisition, calibration, rectification, stereo-reconstruction and obstacle detection. These modules are the subject of this paper. The tracking and classification modules will be implemented in the future phases of the project.
The rest of the paper is organized as follows: section II presents the stereo sensor requirements, section III presents an overview of our system. A synchronous image acquisition module for Gigabit Ethernet cameras is described in Section IV. The sections V to VIII present the image processing related components of the system, from stereo calibration and rectification, to the object detection module. Section X presents the experimental results of the system.
II. SENSOR REQUIREMETNS
A. Range
For the derivation of the range, the following worst case scenario is considered. The AGV operates at full speed and is approaching and object. At the same time the object moves towards the AGV at maximum speed. The AGV velocity is 2m/s. The maximum time for detection, tracking and classification is one second. Once the obstacle is detected, the AGV must have enough space and time to brake and stop to avoid the collisions. The AGV is able to brake at 1.0m/s2 (hard braking) and at 0.5 m/s2 (soft braking). Table 1 presents the required range for different obstacles. According to the table, a running person (2m/s) is safe if the detection range is minimum 10m and hard braking is involved. Therefore, 10m is the minimum acceptable range of the stereo sensor. For a complete derivation of these values, see [2].
B. Field of view
The field of view depends on the required range and on the mounting position. In order to use a single stereo sensor, it has to be mounted on the top of the AGV, as depicted in Figure 1. From this position, the sensor is able to monitor the entire area around the AGV. The maximum height is between 4 m and 4.5 m and this is restricted by the ceil height. A pole is used to mount the sensor at the desired height. The minimum range is slightly higher than 10 m because the pole is not in front of the AGV. Finally, the minimum range from the pole is about 11 m.
This work was supported by the research project PAN-Robots. PAN-
Robots is funded by the European Commission, under the 7th Framework Programme Grant Agreement n. 314193. The partners of the consortium thank
the European Commission for supporting the work of this project.
This range would cover the eventual modifications of the pole position. For a correct detection, objects of 1 m height should be entirely visible at 11 m from the sensor. With a sensor height of 4 m, there results a minimum field of view of 1500.
TABLE I RANGE REQUIREMENTS
Range requirement
AGV braking deceleration 𝒂AGV, brake
1.0 m/s²
(hard braking)
0.5 m/s²
(soft braking)
ob
ject
vel
oci
ty 𝒗
ob
j
0 m/s
(stationary
object)
4 m 6.0 m
0.8 m/s
(slow person) 6.4 m 10.0 m
2 m/s
(fast person) 10.0 m 16.0 m
5 m/s
(forklift) 19.0 m 31.0 m
C. Data processing requirements
The system has to detect, track and classify objects at 10
frames/second. The complete list of requirements is given
below (see [2]):
Object detection
The objects has to be detected with the following accuracies:
Position: estimated location of the object with respect
to the local sensor coordinate system, accuracy: 3 %
of object distance
Size: the estimated size of the object in terms of width,
length and height, accuracy: ±0.1 m at 1 σ
The rate of obstacle detection should exceed 95%.
Object tracking
Each detected object needs to be tracked with respect to the
following properties:
Velocity: the estimated velocity of the object (scalar
value), accuracy ±0.5 m/s at 1 σ
Heading: the estimated direction of motion with
respect to the local sensor coordinate system, accuracy
±15o at 1 σ
Age: number of successive scans an object has been
tracked
Object classification
The recognized objects need to be classified as one of the
following: pedestrian, AGV, manual forklift, electric tricycle,
other object. The class should be supplemented by a
classification quality indicating the confidence in the
classification decision. The rate of correct classifications should
exceed 90 %.
III. SYSTEM OVERVIEW
The architecture of the system is given in Figure 2. The
sensor consists of two Gigabit Ethernet Manta G-419 B NIR
cameras and two fisheye lenses. The system uses a special pair
of fisheye lenses [6], tailored to the requirements of the PAN-
Robots project. The system is compatible with any fisheye lens,
but the one described in [6] provides better specifications in the
context of the project. The baseline between the cameras is 20
cm. We use an Arduino Uno microcontroller to synchronously
trigger the cameras at a frequency of 10 Hz. Although the
trigger is simultaneous for the cameras, the computer receives
the frames asynchronously and this might lead to an incorrect
pairing of the frames. Section IV presents a method that
correctly pairs the received frames. Once we receive the left and
right fisheye images, we split them into three pairs of rectified
images. The rectification module uses the intrinsic and extrinsic
parameters of the cameras. The calibration module computes
Figure 1. The mounting position, range and field of view of the fisheye stereo sensor
these parameters offline. Section V provides details about the
calibration model and details the rectification module. The
stereo reconstruction algorithm presented in section VI
computes the 3D locations of the image points. The NVIDIA
GeForce GTX 780 Ti carries the computationally intensive
operations of the 3D reconstruction algorithm. The computed
clouds of points and the pairs of images are the inputs of the
obstacle detection, tracking and classification module. In the
current project phase, only the obstacle detection module is
implemented. Sections VII and VIII present a description of this
module. This module exports the list of detected obstacles from
the data processing layer to the application layer. The
application layer uses the obstacles for collision avoidance and
local-path planning [3], [19], [20].
IV. SYNCHRONOUS IMAGE ACQUISITION
Industrial cameras for machine vision are available with
many different interfaces, CameraLink, FireWire, Gigabit
Ethernet (GigE), USB 3.0. Long, flexible cables are required
within the PAN-Robots project. FireWire/USB (with
maximum lengths of only 4.5/5.0 m) and CameraLink (with
longer but thicker, more rigid cables) did not suffice so GigE
interface was chosen. Allied Vision Manta 419-B NIR cameras
were used for the stereo system. While GigE cameras can be
synchronised using the Precision Time Protcol (PTP) protocol,
that requires an industrial ethernet switch in addition to the
gigabit ethernet boards. Instead of PTP, a common external
trigger was supplied to both cameras for starting the exposure.
A trigger signal was generated periodically by an Arduino
board. Using the output pins on the cameras the time difference
between the start of the exposure of the two images was
measured using an oscilloscope to be 1 microsecond. The
synchronous trigger of the cameras ensures that the acquisition
starts at the same time for both the cameras. However, the
transmission of the frames from the cameras to the computer
might lead to errors. Some frames (left or right or both) might
be dropped due to transmissions errors. Usually, the right image
arrive after the left image, or the left image arrive after the right
image. Sometimes it happens that two left images arrive and
then two right images. Without an acquisition board and
without PTP to synchronize the clocks of the two cameras, the
following method was implemented in software to match the
left/right images of the stereo pair. At acquisition start the
clocks of both cameras were reset and so the first frame pair had
to arrive with a timestamp difference of only a few
milliseconds. However due to clock skew between the cameras,
Figure 2. The architecture of the omnidirectional stereo system
this difference grew over time to be large, so to match the stereo
pair the difference was instead assumed required not to change
by more than 1 millisecond between successive matched
frames. Another complication was that frames were sometimes
missing from one or the other camera or they might arrived out
of order. When the ethernet boards were properly configured
this did not happen too frequently, only when the system was
too busy to receive the ethernet packets for example, but to
make it robust to such events, the software had to search a few
previously received and unmatched packets frames to find the
stereo pairs. In some rare cases the frames would also stop
coming indefinitely so a watch dog thread was used to reset the
acquisition after a timeout. For maximum performance, a zero-
copy solution was implemented which passed the buffers in
which the images first arrived directly to the rest of the
application, did not reuse those buffers until the application no
longer needed them and supplied additional frame buffers to
the acquisition driver as necessary to maintain maximum frame
rate.
V. CAMERA CALIBRATION AND MULTI-CHANNEL RECTIFICATION
A. Fisheye lens description and calibration
The fisheye lens is an ultra-wide lens and can provide field
of views up to 185o, both horizontally and vertically. The
images obtained with such a lens are highly distorted. The most
popular are the equiangular fisheye lens: the angular resolution
provided by the lens is constant. They are characterized by the
equality
𝑟 = 𝑓 ⋅ 𝛼 (1)
where 𝑓 is the focal length of the lens, 𝛼 is the incoming ray
angle and 𝑟 is the projection of the ray on the imaging plane.
See figure 3. This relationship is a theoretical one. In reality,
there are deviations from this equality due to imperfections in
lens production and assembly.
For the calibration of the lens, we use the toolbox described
in [4]. Kannala [4] derived a projection function that can cope
with any type of ultra-wide lens. This model is quite complex
and it depends on 23 parameters. It is briefly described below.
It begins with a modification of the theoretical raw projection
function:
𝑟(𝛼) = 𝑘1𝛼 + 𝑘2𝛼3 + 𝑘3𝛼5 + 𝑘4𝛼7 + 𝑘5𝛼9 (2)
This polynomial representation of the raw projection function
theoretically fits to all ultra-wide lenses. The lenses also have
distortions and therefore, the theoretical raw projection function
does not suffice to characterize the lens. [4] also uses a general
distortion profile which is given below. The radial distortion is
characterized by the following formula:
Δ𝑟(𝛼,𝜑) = (𝑙1𝛼 + 𝑙2𝛼3 + 𝑙3𝛼5)(𝑖1 cos(𝜑)
+ 𝑖2 sin(𝜑) + 𝑖3cos(2𝜑)+ 𝑖4 sin(2φ))
(3)
Similarly, the tangential distortion is given by:
Δ𝑡(𝛼,𝜑) = (𝑚1𝛼 + 𝑚2𝛼3 + 𝑚3𝛼5)(𝑗1 cos(𝜑)
+ 𝑗2 sin(𝜑) + 𝑗3cos(2𝜑)+ 𝑗4 sin(2φ))
(4)
Including the distortions leads us to the following projection
function:
[𝑥𝑦] = 𝐹𝑚𝑚(𝑋, 𝑌, 𝑍) = 𝑟(𝛼) ⋅ [
cos(𝜑)
sin(𝜑)] +
Δ𝑟(𝛼, 𝜑) ⋅ [cos(𝜑)
sin(𝜑)] + Δ𝑡(𝛼, 𝜑) ⋅ [
−sin(𝜑)
cos(𝜑)]
(5)
In this equality, (𝛼, 𝜑) are the angles of the incoming ray from
the point 𝑃(𝑋, 𝑌, 𝑍) and can be easily computed from the
coordinates (𝑋, 𝑌, 𝑍). The coordinates of 𝑃 are given in mm.
𝑝(𝑥, 𝑦) is the projection of the point 𝑃 onto the imager plane.
Note that the coordinates (𝑥, 𝑦) are still in mm. There remains
to transform these coordinates into pixel coordinates. Let
(𝑢0, 𝑣0) be the principal point and let 𝑡𝑢 and 𝑡𝑣 be the number
of pixels per unit mm in horizontal and vertical direction
respectively. The final projection function is:
[𝑢𝑣
] = 𝐹(𝑋, 𝑌, 𝑍) = [𝑡𝑢 00 𝑡𝑣
] ⋅ [𝑥𝑦] + [
𝑢0
𝑣0] (6)
The projection function has 23 intrinsic parameters
(𝑘1−5, 𝑓𝑢, 𝑓𝑣, 𝑢0, 𝑣0, 𝑙1−3, 𝑖1−4, 𝑙1−3, 𝑖1−4). We use the calibration
procedure described in [4] and a checkerboard pattern to
calculate these parameters. The calibration involves a non-
linear optimization. Interested readers can found details about
the calibration in [4].
Figure 3. The projection model of the fisheye lens. The front,
central and back virtual imagers used for rectification
Beside the intrinsic parameters, the calibration procedure also
provides the 3D coordinates of the checkerboard points,
denoted as 𝐶𝐾𝐿 and 𝐶𝐾𝑅, in the left and right camera
coordinates respectively. This helps in computing the rotation
𝑅 and translation 𝑇 of the right camera w.r.t. to the left camera.
The left and right coordinates are related by the following
formula:
𝐶𝐾𝐿 = 𝑅 ⋅ 𝐶𝐾𝑅 + 𝑇 (7)
The rotation and the translation can be easily computed using
the least squares method. In order to reach a canonical system
and to generate rectified pairs of images we find 𝑅𝐿 , 𝑅𝑅 and 𝑡 =[‖𝑇‖, 0, 0]𝑇. This technique can be found in [5].
𝑅𝐿 ⋅ 𝐶𝐾𝐿 = 𝑅𝑅 ⋅ 𝐶𝐾𝑅 + 𝑡 (8)
Figure 4. The division of the fisheye image into three channels. First
column: the original left fisheye image and its front, central and
back rectified images. Second column: the original right fisheye
image and the disparities for each left-right pair of rectified images
B. Multi-channel rectification
The rectification is a key operation in our system. Using this
operation, the original highly distorted fisheye image is divided
into three perspective images. We apply this division for both
left and right fisheye images. We then apply a stereo
reconstruction algorithm on the left-right pairs of rectified
images. We generate a single point cloud from all the 3D points
generated by stereo reconstruction. The obstacle detection
module operates on this cloud of 3D points.
In order to generate the rectified images, we create a front, a
central and a back virtual pinhole imagers, as depicted in Figure
3. The imagers are in fact rectangles in the 3D space relative to
the objective optical center. Let’s say we want to obtain a
rectified image 𝐼𝑓𝑟𝑜𝑛𝑡𝐿 of 𝑁 × 𝑀 pixels from the front imager of
the left camera. We denote the corners of the front imager with
𝑉1, 𝑉2, 𝑉3 and𝑉4. Then we split the front imager into a uniformly
sampled grid of size 𝑁 × 𝑀. The point 𝑃𝑖,𝑗(𝑋, 𝑌, 𝑍) at location
(𝑖, 𝑗) is a linear combination of the points 𝑉1, 𝑉2, 𝑉3𝑉4.
𝐴𝑖 = ((𝑁 − 𝑖) ⋅ 𝑉1 + 𝑖 ⋅ 𝑉2)/𝑁
𝐵𝑖 = ((𝑁 − 𝑖) ⋅ 𝑉3 + 𝑖 ⋅ 𝑉4)/𝑁
𝑃𝑖,𝑗 = ((𝑀 − 𝑗) ⋅ 𝐴𝑖 + 𝑗 ⋅ 𝐵𝑖)/𝑀
(9)
The intensity of the pixel (𝑖, 𝑗) in the rectified image is
recovered by means of the projection function 𝐹. We also take
into account the rotations 𝑅𝐿 and 𝑅𝑅 which transforms the
fisheye system into a canonical system. We have the following
relationships:
𝐼𝑓𝑟𝑜𝑛𝑡𝑋 (𝑖, 𝑗) = 𝐼𝑓𝑖𝑠ℎ𝑒𝑦𝑒 (𝐹(𝑅𝑋
𝑇 ⋅ 𝑃𝑖,𝑗)) , 𝑋 ∈ {𝐿, 𝑅} (10)
𝐼𝑓𝑖𝑠ℎ𝑒𝑦𝑒 represents the original fisheye image. In a similar
manner we calculate the rectified images 𝐼𝑐𝑒𝑛𝑡𝑟𝑎𝑙𝑋 and 𝐼𝑏𝑎𝑐𝑘
𝑋 , 𝑋 ∈{𝐿, 𝑅}. There results three rectified left images and three
rectified right images. An example of such images are depicted
in Figure 4. |
VI. STEREO RECONSTRUCTION ALGORITHM
We apply the stereo reconstruction algorithm described on
[18] on each left-right rectified pair. An example of disparity is
shown in Figure 4. Once the disparity of each channel (front,
central, back) is available, we calculate the 3D positions of the
pixels for all the channels. We take into account the rotation of
the imager w.r.t. to the physical imager. In our
rectification 𝑅𝑓𝑟𝑜𝑛𝑡 = 𝑟𝑜𝑡(90°), 𝑅𝑐𝑒𝑛𝑡𝑟𝑎𝑙 = 𝐼3, 𝑅𝑏𝑎𝑐𝑘 =
𝑟𝑜𝑡(−90°). All the computed 3D points reside into the AGV’s
coordinate system. The stereo matching algorithm [18] operates as follows. A
sparse census transform of the images is computed in a 9x9 window. The left and right pixels are compared using the hamming distance and the results are aggregated over a 5x5 window. Once the cost volume for all disparities is computed, semi-global matching (SGM) is applied to minimize an energy function which assigns an adaptive penalty for large disparity discontinuities, depending on the image gradient in each pixel. A winner-takes-all (WTA) strategy with confidence filtering is employed for each pixel in the left image to find the best matching pixel in the right image and vice-versa. Left-right consistency checking is performed to filter errors at occlusions. Finally, to estimate depth with sub-pixel accuracy, an interpolation function adapted to this particular stereo algorithm is used [23]. To achieve real-time performance, the algorithm was implemented in CUDA and runs on modern NVidia GPUs.
VII. INTERMEDIATE ENVIRONMENT REPRESENTATION WITH
DIGITAL ELEVATION MAPS
In this section we propose a spatial modeling solution by mapping the dense stereo data into a more compact representation. The dense stereo data consists of the all 3D points from all the rectified pairs of images. The proposed method can be decomposed into three main steps. In the first step, each 3D point is mapped into the digital elevation map (DEM) grid by scaling its lateral X and longitudinal Z coordinates. The number of points that fall in the same cell denotes a density value. Next, the raw density values are used to compute a density map, by incorporating stereo uncertainties. The second step consists in estimating the ground plane model by using a RANSAC approach and a least-squares refinement. In the last step, we use the estimated ground plane model and point densities to classify the measurements as ground cells or obstacle cells. Next we detail each of these steps.
A. Mapping 3D points into DEM
The Digital Elevation Map space is represented by a 2D grid consisting of discrete 10 cm x 10 cm cells. Each cell mxz is described by its position (x,z), density value dxz, height hxz and type cxz. (object or ground):
),,,,( xzxzxzxz cdhzxm (11)
where GroundObjectcxz , For each 3D point P(X,Y,Z), the
corresponding grid cell (x,z) position is computed by scaling down its lateral X and longitudinal Z coordinates. As the result of the mapping step, each grid cell will store a number of associated 3D points Nxz, a minimum height value Ymin and a maximum height value Ymax. The minimum height values Ymin are used as the point hypotheses for ground-plane model computation, while the maximum height values Ymax describe the elevation of the associated grid cell. However, in reality, due to errors introduced by the sensorial system, especially in the matching step, the reconstructed 3D information may be inaccurate. Subsequently, the inaccurate 3D data may also produce erroneous grid mapping results. Therefore, we need, first, to define an error model, and then to estimate the DEM map density values by taking into account the introduced stereo uncertainties. An example of the obtained density map is shown in Figure 5.
Figure 5. Computing the DEM density values. a) The left camera image. b)
The raw density values obtaining by counting the number Nxz of associated 3D
points for each DEM cell mxz. c) The error standard deviation values
expressed as gray intensities. It can be observed that the error values increase
quadratically with the depth along the optic axes. d) The obtained densities by taking into account the probability distributions of all neighbor measurements.
B. Ground Plane Model Estimation
In order to detect the ground plane we use a RANSAC approach combined with a Least-Squares refinement step. Instead of including the whole 3D point cloud as the input to the algorithm, we select a smaller subset, by considering the 3D points with the minimum height values Ymin. These values are computed for each DEM cell in the mapping phase. Each RANSAC iteration is described by the following main steps:
Step 1: Sampling. This step consists in selecting a random subset of three non-collinear points. The three points represent the hypothetical inliers and are used to calculate the parameters of the corresponding plane model.
Step 2: Estimate the consensus set. The distance from all 3D observations to the candidate plane model is computed. The points with a distance less than a given threshold are selected as inliers. These points form the consensus set.
Step 3: Testing. In this step the obtained number of inliers is compared with the number of inliers from the previous iteration. The consensus set with a better score is saved.
Figure 6. a) Left grayscale image. b) The Elevation Map is projected on the
left image. The points are classified as Ground (blue) or Object (red). c) The
3D view representation including the 3D points, the Elevation Map, the detected ground surface represented as a white grid and the detected objects represented
as oriented 3D boxes.
Having the resulted consensus set ]..1[|),,( Nizyxp iiii ,
we estimate a refined plane model by applying a least-squares minimization. An example with the estimated ground surface is illustrated in Figure 6.
C. 3D Point Classification
Having the estimated ground plane, we classify all 3D
points ]..1[|),,( Nizyxp iiii as Objects or Ground. The
points with the distance Dist to the ground plane smaller than a given threshold are classified as Ground. Otherwise the points are classified as Objects. The same classification method is performed at the DEM cell level. Each DEM cell is classified as belonging either the ground surface or to an obstacle based on the point-to-plane distance.
VIII. EXTRACTING OBSTACLES FROM DEM INFORMATION
Obstacle Detection by Grouping DEM Cells
This method consists in grouping the DEM cells into individual clusters (blobs). The grouping technique is performed on the 2D grid so that the connected entities are determined based on a proximity criterion. Each individual blob is described by an oriented bounding box (see Figure 6), which can be used as a primary information for other subsequent processing tasks such as collision detection, tracking or path planning.
IX. RESULTS AND EVALUATIONS
The preliminary functionality of the system is the obstacle
detection and localization and it was considered in this test.
A grid of 5x4 global positions were measured and marked
onto the ground. The positions were measured using the
existing AGV’s localization system, which provides an ac-
curacy of +-1cm [21], [22]. Pallets and boxes were installed in
the selected locations. The size of each obstacle was manually
measured. The height of the objects varies from 40 cm to 3m.
After all the objects were installed, the AGV started to navigate
among the objects. The AGV sends its location to the 3D Vision
ECU at a frequency of 10 Hz.
The software module installed on the 3D Vision ECU detects
the objects in the area of interest and their 3D position. The area
of interest around the AGV is a rectangular region of 19 m x 10
m: 11 m in front of the AGV, 8 in the back and 5 in the left/right
side. Using the current AGV global position, the system
compute the position of the ground truth objects in the AGV’s
local coordinate system. After this, the detected objects are
matched and compared with the ground-truth objects. Figure 7
depicts a sample from the evaluation of the system and TABLE
II shows the average measurements errors. The average
detection rate is 76%. The distance between the correct object
position and the estimated position is 22 cm on average. The
width, height and length are estimated with an error of 26, 11
and 32 cm. The cumulated number of ground-truth objects in
all the frames is 10684.
TABLE II SUMMARY OF 3D PERCEPTION SYSTEM EVALUATION
detection rate 76%
mean localization error 22 cm
mean width error 26 cm
mean height error 11 cm
mean length error 32 cm
ground truth objects 10684
missing objects 2551
number of frames 1418
Figure 7. Sample from the evaluation of the omnidirectional stereo
sensor. Green: detected ground-truth object, cyan: the ground-truth
object as it was detected by the sensor, red: undetected ground-truth
object
X. REFERENCES
[1] http://www.pan-robots.eu/
[2] PAN-Robots – Plug And Navigate Robots for Smart Factories: User Needs and Requirements, 2013.
[3] PAN-Robots – Plug And Navigate Robots for Smart Factories: Specification and Architecture, 2013.
[4] Juho Kannala and Sami S. Brandt. A generic camera model and calibration method for conventional, wide-angle and fish-eye lenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, August 2006.
[5] http://www.vision.caltech.edu/bouguetj/calib_doc/
[6] Aki Mäyrä, Mika Aikio, Matti Kumpulainen, “Fisheye optics for omnidirectional perception”, Intelligent Computer Communication and Processing (ICCP), 2014 IEEE International Conference on
[7] Rabe, C.; Franke, U.; Gehrig, S., "Fast detection of moving objects in complex scenarios," Intelligent Vehicles Symposium, 2007 IEEE , vol., no., pp.398,403, 13-15 June 2007
[8] Labayrade, R.; Aubert, D.; Tarel, J. -P, "Real time obstacle detection in stereovision on non flat road geometry through "v-disparity" representation," Intelligent Vehicle Symposium, 2002. IEEE , vol.2, no., pp.646,651 vol.2, 17-21 June 2002
[9] Pfeiffer, D.; Franke, U., "Efficient representation of traffic scenes by means of dynamic stixels," Intelligent Vehicles Symposium (IV), 2010 IEEE , vol., no., pp.217,224, 21-24 June 2010
[10] Thien-Nghia Nguyen; Michaelis, B.; Al-Hamadi, A.; Tornow, M.; Meinecke, M., "Stereo-Camera-Based Urban Environment Perception Using Occupancy Grid and Object Tracking," Intelligent Transportation Systems, IEEE Transactions on , vol.13, no.1, pp.154,165, March 2012
[11] Lategahn, H.; Derendarz, W.; Graf, T.; Kitt, B.; Effertz, J., "Occupancy grid computation from dense stereo and sparse structure and motion points for automotive applications," Intelligent Vehicles Symposium (IV), 2010 IEEE , vol., no., pp.819,824, 21-24 June 2010
[12] F. Oniga, S. Nedevschi, "Processing Dense Stereo Data Using Elevation Maps: Road Surface, Traffic Isle and Obstacle Detection", IEEE Transactions on Vehicular Technologies, 2010.
[13] Alberto Elfes. Sonar-based real-world mapping and navigation. IEEE Journal of Robotics and Automation, RA-3(3):249–265, June 1987.
[14] Gallup, D.; Frahm, J.-M.; Mordohai, P.; Pollefeys, M., "Variable baseline/resolution stereo," Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on , vol., no., pp.1,8, 23-28 June 2008
[15] O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint: Mit Press, 1993.
[16] M. Fischler, R. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, Commun. ACM 24(6), pp. 381-395,1981.
[17] A. Vatavu, Sergiu Nedevschi, and Florin Oniga, “Real Time Object Delimiters Extraction for Environment Representation in Driving Scenarios”, In Proc. of ICINCO-RA 2009, Milano, Italy, 2009, pp 86-93.
[18] Haller, I.; Nedevschi, S., "GPU optimization of the SGM stereo algorithm," Intelligent Computer Communication and Processing (ICCP), 2010 IEEE International Conference on , vol., no., pp.197,202, 26-28 Aug. 2010
[19] Sabattini, L.; Digani, V.; Secchi, C.; Cotena, G.; Ronzoni, D.; Foppoli, M.; Oleari, F., "Technological roadmap to boost the introduction of AGVs in industrial applications," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.203,208, 5-7 Sept. 2013
[20] Digani, V.; Sabattini, L.; Secchi, C.; Fantuzzi, C., "Towards decentralized coordination of multi robot systems in industrial environments: A hierarchical traffic control strategy," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.209,215, 5-7 Sept. 2013
[21] Reinke, C.; Beinschob, P., "Strategies for contour-based self-localization in large-scale modern warehouses," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.223,227, 5-7 Sept. 2013
[22] Beinschob, P.; Reinke, C., "Strategies for 3D data acquisition and mapping in large-scale modern warehouses," Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference on , vol., no., pp.229,234, 5-7 Sept. 2013
[23] István Haller, Sergiu Nedevschi: Design of Interpolation Functions for Subpixel-Accuracy Stereo-Vision Systems. IEEE Transactions on Image Processing 21(2): 889-898 (2012)