comparison of open source stereo vision algorithms -...
TRANSCRIPT
1
COMPARISON OF OPEN SOURCE STEREO VISION ALGORITHMS
by
CHOUSTOULAKIS EMMANOUIL
Engineer of Applied Informatics and Multimedia
A THESIS
submitted in partial fulfillment of the requirements for the degree
MASTER OF SCIENCE
DEPARTMENT OF INFORMATICS ENGINEERING
SCHOOL OF APPLIED TECHNOLOGY
TECHNOLOGICAL EDUCATIONAL INSTITUTE OF CRETE
2015
Approved by:
Assistant Professor Kosmopoulos Dimitrios
2
Abstract
Stereo vision in the extraction of 3D information from a pair of images
depicting the same scene viewed from different angles. It happens in nature in
creatures that possess two eyes. It is also a very active field in Computer Vision where
the pair of images is digital and instead of eyes, it is obtained by cameras.
In order to achieve that, there are several methods and algorithms. In this Master
Thesis there is a theoretical and experimental comparison of a few of them that are
open source and can be found implemented online.
3
Acknowledgements
I would like to thank Professor Dimitris Kosmopoulos for the assistance in the
completion of the thesis. I would also like to thank my family for their patience and
support during my studies for this MSc degree. Finally special thanks to the
examiners committee and to all fellow stereo vision researchers and engineers whose
work contributed to the completion on this work.
4
Table of Contents
Abstract ..................................................................................................................................... 2
Acknowledgements ................................................................................................................... 3
Table of Figures ......................................................................................................................... 6
Chapter 1: Introduction and Goals ............................................................................................ 7
Chapter 2: Stereo Vision Basics ................................................................................................. 9
2.1: Overview......................................................................................................................... 9
2.2: History of Stereo Vision ................................................................................................ 10
2.3: Pinhole Camera Model ................................................................................................. 14
2.4: Camera Resectioning/Parameters ................................................................................ 14
2.5: Epipolar geometry ........................................................................................................ 15
2.6: Fundamental matrix ..................................................................................................... 16
2.7: Image rectification ........................................................................................................ 17
2.8: Disparity map ............................................................................................................... 17
2.9: Stereo Matching ........................................................................................................... 18
2.10: Matching Cost............................................................................................................. 20
2.11: Normalized Cross Correlation .................................................................................... 21
2.12: Ground truth .............................................................................................................. 21
2.13: Applications of Stereo Vision ..................................................................................... 21
2.14: Challenges and difficulties .......................................................................................... 24
Chapter 3: Stereo Algorithms Evaluation Process ................................................................... 26
3.1: Previous Work .............................................................................................................. 26
3.2: State-of-the-Art Middlebury Evaluation ...................................................................... 27
3.3: Thesis Evaluation Process ............................................................................................. 28
Chapter 4: Testing and Comparison of Stereo Algorithms ...................................................... 31
4.1: Intro .............................................................................................................................. 31
4.2: Common Algorithm Parameters .................................................................................. 31
4.3: Semi Global (Block) Matching ...................................................................................... 32
4.3.1: Algorithm Overview............................................................................................... 32
4.3.2: Pixelwise Cost Calculation ..................................................................................... 33
4.3.3: Aggregation of Costs ............................................................................................. 34
4.3.4: Disparity Computation .......................................................................................... 35
4.3.5: Implementation Details ......................................................................................... 35
5
4.3.6: Testing and Results ................................................................................................ 37
4.4: Block matching ............................................................................................................. 37
4.4.1: Algorithm overview ............................................................................................... 37
4.4.2: Algorithm Analysis ................................................................................................. 38
4.4.3: Implementation Details ......................................................................................... 41
4.4.4: Testing and Results ................................................................................................ 42
4.5: Loopy belief propagation ............................................................................................. 42
4.5.1: Overview ................................................................................................................ 42
4.5.2: Markov Random Fields .......................................................................................... 43
4.5.3: MRF Formulation ................................................................................................... 43
4.5.4: DataCost ................................................................................................................ 44
4.5.5: SmoothnessCost .................................................................................................... 45
4.5.6: Loopy Belief Propagation main part ...................................................................... 46
4.5.7: Implementation Details ......................................................................................... 48
4.5.8: Testing and Results ................................................................................................ 49
4.6: Fast stereo matching and disparity estimation ............................................................ 49
4.6.1: Overview ................................................................................................................ 49
4.6.2: Algorithm Analysis ................................................................................................. 50
4.6.3: Implementation Details ......................................................................................... 52
4.6.4: Testing and Results ................................................................................................ 53
4.7: Probability-Based Rendering for View Synthesis ......................................................... 54
4.7.1: Algorithm Overview............................................................................................... 54
4.7.2: SSMP with RWR ..................................................................................................... 55
4.7.3: PBR with SSMP ...................................................................................................... 56
4.7.4: Implementation details ......................................................................................... 58
4.7.5: Testing and Results ................................................................................................ 58
4.8: Results Analysis and Conclusion ................................................................................... 59
Chapter 5: Discussion and future work ................................................................................... 62
References ............................................................................................................................... 65
6
Table of Figures
Figure 1: Simple stereo vision illustration ................................................................................. 9
Figure 2: Wheatstone's Stereocope ........................................................................................ 10
Figure 3: Brewster's Stereoscope ............................................................................................ 11
Figure 4: A typical ViewMaster device .................................................................................... 12
Figure 5: Nintendo's Virtual Boy .............................................................................................. 13
Figure 6: Pinhole Camera Model ............................................................................................. 14
Figure 7:Epipolar Geometry Illustration .................................................................................. 16
Figure 8: A groundtruth disparity map .................................................................................... 18
Figure 9: Simple window matching illustration ....................................................................... 19
Figure 10: Nintendo's 3DS ....................................................................................................... 22
Figure 11: The NASA STEREO Project ...................................................................................... 23
Figure 12: Cones, teddy, tsukuba and venus left view ............................................................ 30
Figure 13: SGBM matching costs aggregation ......................................................................... 34
Figure 14: SGBM Results ......................................................................................................... 37
Figure 15: Block Matching Algorithm for SVM ........................................................................ 38
Figure 16: Block Matching Sample Images .............................................................................. 39
Figure 17: Block Matching Results........................................................................................... 42
Figure 18: Markov Random Field illustration .......................................................................... 43
Figure 19: Various cost functions ............................................................................................ 45
Figure 20: LBP message passing .............................................................................................. 47
Figure 21: Loopy Belief Propagation Results ........................................................................... 49
Figure 22: Fast stereo Matching and Disparity Estimation ..................................................... 50
Figure 23: Fast stereo Matching and Disparity Estimation Results ......................................... 53
Figure 24: SSMP Results .......................................................................................................... 58
Figure 25: Percentage of bad matching pixels (lower value is better) .................................... 60
Figure 26: Current Virtual Reality Devices ............................................................................... 62
Figure 27: Stereoscopic Endoscope ......................................................................................... 63
7
Chapter 1: Introduction and Goals
Stereoscopic vision (called binocular vision in nature) is the extraction of 3d
information about a scene from a pair of images depicting different views of that
scene. The process in nature is called stereopsis and it occurs in the brain which
combines the images received from the 2 eyes.
Computer stereo vision is the extraction of 3D information from a pair of digital
images usually obtained by 2 CCD cameras. This is made possible using various
techniques and algorithms. Over the years many algorithms have been proposed to
combine 2 images into a final image and their performance is measured by efficiency
and speed.
The primary goal of this work is to illustrate a simple process of comparing methods
used for stereo matching. The secondary goal is to compare experimentally the
relevant algorithms that can be found implemented on various sources online by using
the aforementioned process and some of the state of the art datasets. The steps
followed in this work can be summarized here:
1. Extensive study of computer stereo vision resources and literature to gain as
thorough understanding as possible of all related terms and methodology.
2. Study any previous work related to the topic of this master thesis and determine the
state of the art.
3. Define the process for comparison, choosing a simple and comprehensible one.
4. Research online for implemented stereo algorithms and methods. Make any
changes to the code for optimal run, without affecting the core algorithm.
8
5. Test the said methods using all 4 stereo sets of the Middlebury platform and select
the ones giving usable results.
5. Make the comparison, document, study and analyze the result.
6. Write the thesis document including terminology, process description, algorithm
and result analysis.
Any reference to existing work is of course acknowledged and documented here. Also
the results displayed here are produced by executing the relevant code with the
optimal parameters for each method and none is found online and just used as is.
9
Chapter 2: Stereo Vision Basics
2.1: Overview
In traditional stereo vision [1], two cameras, displaced horizontally from one another
are used to obtain two differing views on a scene, in a manner similar to
human binocular vision. By comparing these two images, the relative depth
information can be obtained, in the form of disparities, which are inversely
proportional to the differences in distance to the objects. To compare the images, the
two views must be superimposed in a stereoscopic device, the image from the right
camera being shown to the observer's right eye and from the left one to the left eye.
Figure 1: Simple stereo vision illustration
In real camera systems however, several pre-processing steps are required.
1. The image must first be removed of distortions, such as barrel distortion to
ensure that the observed image is purely projectional.
2. The image must be projected back to a common plane to allow comparison of
the image pairs, known as image rectification.
3. An information measure which compares the two images is minimized. This
gives the best estimate of the position of features in the two images, and
creates a disparity map.
10
4. Optionally, the disparity as observed by the common projection is converted
back to the height map by inversion. Utilizing the correct proportionality
constant, the height map can be calibrated to provide exact distances.
2.2: History of Stereo Vision
In this chapter, the most important moments in the history [2] of stereoscopic vision
are described.
Stereopsis was first explained by Charles Wheatstone in 1838. "...the mind perceives
an object of three dimensions by means of the two dissimilar pictures projected by it
on the two retinae..." was his exact definition. He recognized that each eye views the
world from slightly different horizontal position. As a result of that, each eye views a
different image.
Also objects at different distance seem to have a different horizontal position for each
eye (horizontal disparity) leading to the concept of depth. Wheatstone created the
illusion of depth from flat pictures that differed only in horizontal disparity. To
display his pictures separately to the two eyes, Wheatstone invented the stereoscope.
Figure 2: Wheatstone's Stereocope
11
Although Wheatstone was the first man to explain and showcase stereoscopic vision,
he was not the first to notice it and try to understand it. Leonardo Da Vinci had also
realized that objects at different distances project images to the eyes that differ in their
horizontal positions. Despite his efforts he concluded that it is impossible for a painter
to portray a realistic depiction of depth in a scene with a single canvas. Da Vinchi
chose for the near objects a column with a circular cross section and for his far object
a flat wall. His column projects identical images of itself in the two eyes.
Stereoscopy became popular during victorian times with the invention of the Prism
Stereoscope by David Brewster. Combined with the advances of photography, tens of
thousands stereograms were produced.
Figure 3: Brewster's Stereoscope
In 1939 the View Master [3] line was introduced. It is a series of special stereograms
that are loaded with proprietary discs (namely reel) containing a film of stereoscopic
scenes. Transition between scenes happens with a switch that rotates the reel. The
viewer looks inside the two lenses to view the scene. In order to put light into the
film, most models feature 2 opaque white films in front of the reel. The viewer needs
to look into the View Master against an light source, although some models with self
12
illumination were introduced. View master is best known as a toy for children and is
still available, although it is less popular than it used to be. Mattel Corporation
currently owns the rights for its production.
Figure 4: A typical ViewMaster device
In the 1960's Bela Julesz invented the Random Dot Stereogram. Unlike previous
stereograms, in which each half image showed recognizable objects, each half image
of the first random-dot stereograms showed a square matrix of about 10,000 small
dots, with each dot having a 50% probability of being black or white. No recognizable
objects could be seen in either half image. The two half images of a random-dot
stereogram were essentially identical, except that one had a square area of dots shifted
horizontally by one or two dot diameters, giving horizontal disparity. The gap left by
the shifting was filled in with new random dots, hiding the shifted square.
Nevertheless, when the two half images were viewed one to each eye, the square area
was almost immediately visible by being closer or farther than the background. Julesz
called it Cyclopean image in the notion that each eye was seeing part of an object and
it was combined into one in the brain.
In the 70’s Christopher Tyler invented autostereograms. That is random dot
stereograms that can be viewed without a stereoscope. A famous example is the
13
Magic Eye pictures, a series of books featuring images which allow people to view
3D images by focusing on 2D patterns.
In 1989 Antonio Medina Puerta demonstrated with photographs that retinal images
with no parallax disparity but with different shadows are fused stereoscopically,
imparting depth perception to the imaged scene. He named the phenomenon "Shadow
Stereopsis". He showed how effective the phenomenon is by taking two photographs
of the Moon at different times, and therefore with different shadows, making the
Moon to appear in 3D stereoscopically, despite the absence of any other stereoscopic
cue.
In 1995 Nintendo Corporation introduced the Virtual Boy [4]. It was a table top
console and a first step into virtual reality devices for consumers. The goal was to
"immerse players into their own private universe", according to Nintendo. The goal
was never achieved though due to large number of factors. Limited technology to
keep the cost down, bad design causing eye and neck strain and a small number of
games catalog led to the commercial failure of the device that was discontinued from
production only a year later. Nevertheless, it was the first step in the use of stereo
vision and virtual reality for entertainment purposes.
Figure 5: Nintendo's Virtual Boy
14
2.3: Pinhole Camera Model
The pinhole camera model [5] describes the mathematical relationship between the
coordinates of a 3D point and its projection onto the image plane of an ideal pinhole
camera, where the camera aperture is described as a point and no lenses are used to
focus light. The model does not include, for example, geometric distortions or
blurring of unfocused objects caused by lenses and finite sized apertures. It also does
not take into account that most practical cameras have only discrete image
coordinates. This means that the pinhole camera model can only be used as a first
order approximation of the mapping from a 3D scene to a 2D image. Its validity
depends on the quality of the camera and, in general, decreases from the center of the
image to the edges as lens distortion effects increase.
Figure 6: Pinhole Camera Model
2.4: Camera Resectioning/Parameters
Camera resectioning [6] is the process of estimating the parameters of a pinhole
camera model approximating the camera that produced a given photograph or video.
Usually, the pinhole camera parameters are represented in a 3 × 4 matrix called the
camera matrix. There are typicall
extrinsic.
Intrinsic parameters encompass focal length, image sensor format and principal point.
Those parameters are contained in the intrinsic matrix.
Extrinsic parameters denote the coordinate system tr
coordinates to 3d camera coordinates.
the position of the camera center and the camera’s heading in world coordinates.
T is the position of the origin of the world coordinate system
of the camera-centered coordinate system
often mistaken). C is the position of the camera expressed in world coordinates
rotation matrix.
2.5: Epipolar geometry
Epipolar geometry [7] is the geometry of
scene from two distinct positions, there are a number of geometric relations between
the 3D points and their projections onto the 2D images that lead to constraints
between the image points.
the cameras can be approximated by the
15
There are typically two types of camera parameters, intrinsic and
Intrinsic parameters encompass focal length, image sensor format and principal point.
Those parameters are contained in the intrinsic matrix.
Extrinsic parameters denote the coordinate system transformations from 3d word
coordinates to 3d camera coordinates. Equivalently, the extrinsic parameters define
the position of the camera center and the camera’s heading in world coordinates.
is the position of the origin of the world coordinate system expressed in coordinates
centered coordinate system (and NOT the position of the camera as is
often mistaken). C is the position of the camera expressed in world coordinates
Epipolar geometry
is the geometry of stereo vision. When two cameras view a 3D
scene from two distinct positions, there are a number of geometric relations between
the 3D points and their projections onto the 2D images that lead to constraints
between the image points. These relations are derived based on the assumption that
the cameras can be approximated by the pinhole camera model.
y two types of camera parameters, intrinsic and
Intrinsic parameters encompass focal length, image sensor format and principal point.
ansformations from 3d word
Equivalently, the extrinsic parameters define
the position of the camera center and the camera’s heading in world coordinates.
expressed in coordinates
(and NOT the position of the camera as is
often mistaken). C is the position of the camera expressed in world coordinates. R is a
. When two cameras view a 3D
scene from two distinct positions, there are a number of geometric relations between
the 3D points and their projections onto the 2D images that lead to constraints
These relations are derived based on the assumption that
16
The figure below depicts two pinhole cameras looking at point X. In real cameras, the
image plane is actually behind the center of projection, and produces an image that is
rotated 180 degrees. Here, however, the projection problem is simplified by placing
a virtual image plane in front of the center of projection of each camera to produce an
unrotated image. OL and OR represent the centers of projection of the two
cameras. X represents the point of interest in both cameras. Points xL and xR are the
projections of point X onto the image planes.
Each camera captures a 2D image of the 3D world. This conversion from 3D to 2D is
referred to as a perspective projection and is described by the pinhole camera model.
It is common to model this projection operation by rays that emanate from the
camera, passing through its center of projection. Note that each emanating ray
corresponds to a single point in the image.
Figure 7:Epipolar Geometry Illustration
2.6: Fundamental matrix
The fundamental matrix [8] F is a 3×3 matrix which relates corresponding points
in stereo images. In epipolar geometry, with homogeneous image
coordinates, x and x′, of corresponding points in a stereo image pair, Fx describes a
line (an epipolar line) on which the corresponding point
lie. That means, for all pairs of corresponding points holds
2.7: Image rectification
Image rectification [9] is a transformation process used to project two
onto a common image plane. This process has several degrees of freedom and there
are many strategies for transforming images to the common plane.
It is used in computer stereo vision
between images. It uses triangulation based on
distance to an object. More specifically,
the depth of an object to its change in position when viewed from a different camera,
given the relative position of each camera is known.
2.8: Disparity map
Disparity [10] refers to the difference in location of an object in corresponding two
(left and right) images as seen by the left and right eye which is created due to
parallax (eyes’ horizontal separation). The brain uses this disparity to calculate depth
information from the two dimensional images.
In computer vision, the disparity map is an image that depicts how far from the
viewing source are the objects of the scene. The depiction is based on intensities.
Brighter objects are closer to the source (they have larger
17
line (an epipolar line) on which the corresponding point x′ on the other image must
lie. That means, for all pairs of corresponding points holds
on
is a transformation process used to project two-or
onto a common image plane. This process has several degrees of freedom and there
are many strategies for transforming images to the common plane.
computer stereo vision to simplify the problem of finding matching points
between images. It uses triangulation based on epipolar geometry
distance to an object. More specifically, binocular disparity is the process of relating
object to its change in position when viewed from a different camera,
given the relative position of each camera is known.
refers to the difference in location of an object in corresponding two
(left and right) images as seen by the left and right eye which is created due to
parallax (eyes’ horizontal separation). The brain uses this disparity to calculate depth
rom the two dimensional images.
In computer vision, the disparity map is an image that depicts how far from the
viewing source are the objects of the scene. The depiction is based on intensities.
Brighter objects are closer to the source (they have larger apparent distance between
on the other image must
or-more images
onto a common image plane. This process has several degrees of freedom and there
to simplify the problem of finding matching points
epipolar geometry to determine
is the process of relating
object to its change in position when viewed from a different camera,
refers to the difference in location of an object in corresponding two
(left and right) images as seen by the left and right eye which is created due to
parallax (eyes’ horizontal separation). The brain uses this disparity to calculate depth
In computer vision, the disparity map is an image that depicts how far from the
viewing source are the objects of the scene. The depiction is based on intensities.
apparent distance between
18
the left and right image) and darker objects are further (they have smaller apparent
distance between the left and right view).
Figure 8: A groundtruth disparity map
In short, the disparity of a pixel is equal to the shift value that leads to minimum sum-
of-squared-differences for that pixel.
Disparity map calculation is essentially the result of stereo matching. This work
revolves around some of its calculation methods.
2.9: Stereo Matching
Stereo matching [11] is used for finding corresponding pixels in a pair of images,
which allows 3D reconstruction by triangulation, using the known intrinsic and
extrinsic orientation of the camera. There are two methods [12] for stereo matching,
local and global. Local methods attempt to match small regions of one image to
another based on intrinsic features of the region. Global methods are supplementary to
local methods by considering physical constraints such as surface continuity or base-
of-support. Local methods can be further classified by whether they correlate a small
area patch among images (called correlation or area based) or match features (called
feature based).
19
Correlation based (or area based) stereo matching considers a certain area on the left
image (usually) and tries to find an equally sized area on the right image that is the
closest match. That area is called matching or correlation window. Since the images
are rectified, the algorithm searches only horizontally by a predefined offset. They
produce dense disparity maps. Generally a smaller matching window will give more
detail but more noise and a larger window will produce a smoother disparity map but
with less captured detail. Such algorithms are by default fast and memory efficient so
they are usually preferred. However, finding the combination of optimal window size
and other algorithm parameters can be challenging and requires a lot of testing.
To find a match for a pixel in the left image, the left window is drawn centered on that
pixel. It is then compared to several windows in the right image, beginning with a
window at the same location (zero disparity) and moving left in increments of one
pixel (increasing disparity by one with each move). Whichever window in the right
image gives the lowest cost is said to match the left image. The difference in x
coordinates between the center of this match and the center of the left window is the
disparity value of the pixel in question on the left image.
Figure 9: Simple window matching illustration
Feature based stereo matching computes the corresponding pixels by using the
extracted features from the images. Features are usually chosen to be lighting and
20
viewpoint independent. As a result, they compensate for viewpoint changes and
camera differences. Techniques used to find the image features include edge, corner
and blob and ridge detection. Such features include
• edge elements
• corners
• line and curve segments
• circles and ellipses
• regions defined either as blobs or polygons.
Area based algorithms are simple and efficient in general. Some of those algorithms
are ideal for real-time stereo matching applications. But, as discussed earlier, it can be
challenging to produce robust matching. Feature based on the other hand, can produce
fast and robust matching but usually require expensive feature extraction. Area
methods are more commonly used by researchers and this work will focus on them for
the most part.
2.10: Matching Cost
At the base of any matching algorithm is the matching cost which measures the (dis-)
similarity of a pair of locations, one in each image. Matching costs can be defined at
the pixel level or over a certain area. Common examples of this are Absolute intensity
Difference, Squared intensity Difference, Filter-Bank responses and gradient based
measures. Binary matching costs area also possible, based on binary features such as
edges.
21
2.11: Normalized Cross Correlation
The higher the normalized cross correlation [13] of two windows, the better they
match. Normalized cross correlation is calculated by computing the mean and
standard deviation of intensity in each window. Then, the mean intensity is subtracted
from each pixel's intensity. Corresponding values of intensity - mean intensity from
the left and right windows are multiplied together. These multiplied values are
summed over the entire window. Finally, this sum is divided by the number of pixels
in either window and divided by each standard deviation.
2.12: Ground truth
Ground truth [14] is a term used in various fields to refer to the absolute truth of
something and is used to test the efficiency of an algorithm or a system. Used widely
in machine learning it denotes a set of measurements that are much more accurate
than the system being tested.
In the case of stereo vision systems the question is how well they can estimate 3D
positions. The ground truth disparity map is composed by = the positions given by a
laser range finder which is known to be much more accurate than any camera system.
Practically, it describes the perfect disparity between left and right image and it is
compared to the produced disparity map using various metrics.
2.13: Applications of Stereo Vision
Stereo vision is well known to the general public mainly by 3D movies. But that is
only a small part of a wide range of applications not only in everyday life, but also in
industry and research and even space exploration.
• Robotics to extract information about the relative position of 3D
vicinity of autonomous systems, object recognition where depth information
allows for the system to separate occluding image components. Such robotic
systems are primarily used in industrial applications
• 3D displays [15] and head mounted di
to the human eyes.
basic requirement is to display offset images that are filtered separately to the
left and right eye.
There are two methods
glasses to filter the offset images to each eye and another where no glasses are
required.
In the case where glasses (or filters) are used there are two types used, passive
and active filters. Passive
filters or polarization filters. Active shutter filters have active shutters to filter
the image as their name suggests and require power.
In the glass free case the light source splits the images dire
viewer's eyes. Such displays are called autostereoscopic.
famous example of such technology is the Nintendo 3DS game console.
22
Robotics to extract information about the relative position of 3D
vicinity of autonomous systems, object recognition where depth information
allows for the system to separate occluding image components. Such robotic
systems are primarily used in industrial applications.
and head mounted displays to provide stereoscopic imaging
to the human eyes. In such applications where the goal is depth perception, the
basic requirement is to display offset images that are filtered separately to the
methods to accomplish that. One method is when
glasses to filter the offset images to each eye and another where no glasses are
In the case where glasses (or filters) are used there are two types used, passive
and active filters. Passive filters do not require power and could be either color
filters or polarization filters. Active shutter filters have active shutters to filter
the image as their name suggests and require power.
In the glass free case the light source splits the images directionally into the
viewer's eyes. Such displays are called autostereoscopic. Maybe the most
famous example of such technology is the Nintendo 3DS game console.
Figure 10: Nintendo's 3DS
Robotics to extract information about the relative position of 3D objects in the
vicinity of autonomous systems, object recognition where depth information
allows for the system to separate occluding image components. Such robotic
splays to provide stereoscopic imaging
applications where the goal is depth perception, the
basic requirement is to display offset images that are filtered separately to the
method is when the user wears
glasses to filter the offset images to each eye and another where no glasses are
In the case where glasses (or filters) are used there are two types used, passive
filters do not require power and could be either color
filters or polarization filters. Active shutter filters have active shutters to filter
ctionally into the
Maybe the most
famous example of such technology is the Nintendo 3DS game console.
23
Finally there are the head mounted displays where a separate display is
positioned in front of each and the image is projected through lenses to assist
the eye focusing. Such devices are used in a plethora of concepts like military
to provide augmented reality applications, engineering to provide stereoscopic
views of CAD schematics and of course entertainment, like 3D gaming and
movies or tours in virtual environments.
• Calculation of contour maps and geometry extraction for 3D building mapping
mainly from aerial surveys [16]. Those surveys are usually conducted with the
use of unmanned aerial vehicles or UAV's for short. The large number of
aerial images captured is then converted into geo-referenced 2D high
resolution orthophotos and 3D surface models and point clouds. Various
automated systems.
• The NASA STEREO project [17] one of the most important and high scale
projects ever. It stands for Solar Terrestrial Relations Observatory and in
essence it is a solar observation mission. Two nearl identical spacecraft were
launched in 2006 into orbits around the sun in a manner that enables
stereoscopic imaging of the sun.
Figure 11: The NASA STEREO Project
24
The goal is to study solar phenomena (principally coronal mass erections-
massive bursts of solar wind, plasma and magnetic fields ejected into space
that can disrupt earth communications and power networks) in the far side of
the sun. This practically enables solid forecasts of solar activity through 360
degree view of the sun at all times.
• Driverless cars [18] is a hot topic at the time of writing. Driverless cars as the
name suggests, can drive to their destination without requiring human
intervention. Stereo vision is the way the car can "see" the world in front of it.
Of course, a plethora of sensors are used, including laser, ultrasonic, GPS etc,
so a 360 degree "map" of the surrounding world can be formed by the car.
There are quite a few similarities with robot navigation in the process. Of
course this technology is still not in the reliable stage, especially due to the
risk for traffic accidents in case of errors/inaccuracy.
2.14: Challenges and difficulties
The correspondence problem refers to the problem of ascertaining which parts of one
image correspond to which parts of another image, where differences are due to
movement of the camera, the elapse of time, and/or movement of objects in the
photos.
Other major pitfalls include reflections and transparency. It is usually very hard for a
machine to distinguish whether it is looking at an object or the reflection of that
object. Similarly, it is hard for a computer vision system to recognize the existence of
transparent objects between the view source and the target scene.
25
The third pitfall is continuous and textureless regions. It is very difficult to determine
which point on the left image corresponds to which point on the right image. Finally
there can be technical difficulties like sensor noise and calibration noise to the
cameras.
26
Chapter 3: Stereo Algorithms Evaluation Process
In this chapter there is a short analysis of previous work related to the topic of this
thesis. Also, there is analysis of the state of the art evaluation method as well as the
process followed for the thesis.
3.1: Previous Work
• A Taxonomy and Evaluation of Dense Two Frame Stereo Correspondence
Algorithms [19] by D. Scharstein and R. Szeliski. It is the state of the art
evaluation and is analyzed in the next paragraph.
• An Experimental Comparison of Stereo Algorithms by R. Szeliski and R.
Zabih [20]. In this work by Szeliski and Zabih there is an effort to compare
experimentally a few stereo vision algorithms. They make use of two stereo
pairs, the well known set from Tsukuba university and another produced by
them ( a simple scene with a slanted surface). Their methodology consists of
comparison with ground truth depth maps and the measurement of novel
prediction errors.
• Review of Stereo Matching Algorithms for 3D Vision by L. Nalpantidis, G.
Sirakoulis and A. Gasteratos [21]. In this work there is a theoretical
comparison and summary of various methods. It considers both local and
global methods, computational intelligence techniques and the speed and
accuracy of those. Also, some hardware implementation techniques are
presented.
• Overview of Stereo Matching Research, by R.A.Lane and N.A. Thacker [22].
This is a literature survey of a few area and feature based methods. It includes
a short description of those methods and some conclusions drawn. It is a
relatively old paper and part of a large series of stereo vision journals.
3.2: State-of-the-Art Middlebury E
The state of the art evaluation method for stereo vision algorithms is offered by the
Middlebury College. The creators are Daniel Scharstein and Richard Szeliski
evaluation process is documented in their publication titled "A Taxonomy and
Evaluation of Dense To Frame Stereo Correspondence Algorithms
The goal of the creators of this evaluation process was to comp
methods within one common framework. For that reason they have focused
techniques that produce a
The evaluation process is very detailed and quite complicated.
error measurements, RMS error and percentage of bad matching
image areas.
RMS (root-mean-squared) error (measured in disparity units) between the computed
disparity map dC(x, y) and the ground truth map
following formula:
Percentage of bad matching pixels
27
a short description of those methods and some conclusions drawn. It is a
relatively old paper and part of a large series of stereo vision journals.
Art Middlebury Evaluation
The state of the art evaluation method for stereo vision algorithms is offered by the
Middlebury College. The creators are Daniel Scharstein and Richard Szeliski
evaluation process is documented in their publication titled "A Taxonomy and
Evaluation of Dense To Frame Stereo Correspondence Algorithms [19]
The goal of the creators of this evaluation process was to compare a large number of
e common framework. For that reason they have focused
techniques that produce a univalued disparity map.
The evaluation process is very detailed and quite complicated. In essence, t
RMS error and percentage of bad matching pixels in 3 different
squared) error (measured in disparity units) between the computed
) and the ground truth map dT (x, y) is computed by the
Percentage of bad matching pixels is computed by this formula:
a short description of those methods and some conclusions drawn. It is a
relatively old paper and part of a large series of stereo vision journals.
The state of the art evaluation method for stereo vision algorithms is offered by the
Middlebury College. The creators are Daniel Scharstein and Richard Szeliski and the
evaluation process is documented in their publication titled "A Taxonomy and
[19].
are a large number of
e common framework. For that reason they have focused on
In essence, there are 2
pixels in 3 different
squared) error (measured in disparity units) between the computed
is computed by the
Also the images are segmented into three different areas.
• textureless regions T : regions where the squared horizontal intensity gradient
averaged over a square window of a given size
Essentially, it is areas on the scene with little to no texture.
• occluded regions O: regions where the
with a larger (nearer) disparity.
on one of the images and not visible on the othe
• depth discontinuity regions D:
more than a predefined
practically the areas in the scene where there is a sudden change in the depth
between the objects.
These regions were selected to support the analysis of matching results in typical
problem areas.
The Middlebury College offers an online
researchers to upload and test their algorithms and compare them against many
There are also a few datasets offered in various resolutions for testing. The online
evaluation tool utilizes 4 certain image pairs and compares the user submitted
disparity maps with the groun
evaluation tool is of version two at the time of writing this thesis.
3.3: Thesis Evaluation
The first step was to collect all the open source algorithms that can be found
implemented on various sources online. They were all tested and the ones producing
28
Also the images are segmented into three different areas.
textureless regions T : regions where the squared horizontal intensity gradient
averaged over a square window of a given size is below a given threshold.
areas on the scene with little to no texture.
occluded regions O: regions where the left-to-right disparity lands at a location
ith a larger (nearer) disparity. This means that an occluded region is visible
on one of the images and not visible on the other.
depth discontinuity regions D: regions where neighboring disparities differ by
a predefined gap, dilated by a window of a given width
practically the areas in the scene where there is a sudden change in the depth
between the objects.
ese regions were selected to support the analysis of matching results in typical
The Middlebury College offers an online evaluation tool for computer vision
researchers to upload and test their algorithms and compare them against many
There are also a few datasets offered in various resolutions for testing. The online
evaluation tool utilizes 4 certain image pairs and compares the user submitted
disparity maps with the ground truth maps for the four pairs. Note that the online
valuation tool is of version two at the time of writing this thesis.
valuation Process
The first step was to collect all the open source algorithms that can be found
implemented on various sources online. They were all tested and the ones producing
textureless regions T : regions where the squared horizontal intensity gradient
is below a given threshold.
disparity lands at a location
This means that an occluded region is visible
neighboring disparities differ by
a given width. It is
practically the areas in the scene where there is a sudden change in the depth
ese regions were selected to support the analysis of matching results in typical
valuation tool for computer vision
researchers to upload and test their algorithms and compare them against many others.
There are also a few datasets offered in various resolutions for testing. The online
evaluation tool utilizes 4 certain image pairs and compares the user submitted
d truth maps for the four pairs. Note that the online
The first step was to collect all the open source algorithms that can be found
implemented on various sources online. They were all tested and the ones producing
29
unusable results were discarded. The ones that gave meaningful results are analyzed
here.
The comparison of the algorithms is based partly on the state of the art evaluation,
namely the percentage of bad matching pixels (sum of absolute differences of the
disparity map and the ground truth image matrices, BadMatchPercent formula from
the previous paragraph). The focus is to find how good the algorithm has performed
in estimating the disparity map. Mean elapsed time is measured in all cases, but it is
not directly comparable, since two different tools are used and OpenCV is vastly
faster than Matlab because it is written in C++ language. Also, elapsed time depends
on more factors, like the parameters of each algorithm, image size and of course the
hardware it is running on. The algorithms presented here were tested on an Intel
Celeron G1620 CPU with 4GB of RAM and an AMD HD6450 GPU.
A pixel by pixel subtraction is conducted between the result and the ground truth
image matrices. There is a 30 pixel margin on all sides to eliminate empty image
borders, since some algorithms produce disparity maps with black borders that could
lower the score for no reason. Any result that is larger than the predefined threshold
(which is around 1.0 traditionally) is considered a bad matching pixel. The threshold
is the same for all images so the comparison is fair. When a bad matching pixel is
found it is added to the previous sum. In the end the sum is divided by the total
number of pixels. The final result is a percentage of the bad matching pixels in the
disparity map.
The images used are the widely popular stereo datasets from Middlebury College.
They contain several right and left views of the same scene, as well as a ground truth
30
image for evaluation. As mentioned before, the Middlebury online evaluation
platform uses 4 standard image sets, a total of 8 images (cones, teddy, tsukuba,
venus). Those images along with their ground truth will be used. Note that those four
image sets are used in the second version of the online evaluation tool of Middlebury,
which is still online at the time of writing. Version three should be online soon after
this work is completed and it will use different image sets for the evaluation of
algorithms.
Figure 12: Cones, teddy, tsukuba and venus left view
The method described above is summed up to a mathematical equation and the code
that implements it is written by the author of this thesis. The platform chosen for the
comparison is Matlab, due to the simplicity of matrix operations.
31
Chapter 4: Testing and Comparison of Stereo Algorithms
4.1: Intro
Stereo vision is still a popular topic when it comes to research. It is very active and
there is a large number of algorithms being evaluated in the Middlebury platform.
Unfortunately, finding implementation of the various algorithms can be difficult
because communication with creators is rarely successful and many of them refuse to
help. Furthermore, available implementations most of the times do not function as
expected and produce unusable results.
In this work the algorithms compared can be found implemented on the internet and
their implementation is correct. That means, it gives satisfactory results not only by
visual examination but also in comparison with the ground truth disparity maps. All
the methods described here give a bad matching pixel percentage of less than 50%.
4.2: Common Algorithm Parameters
Each stereo algorithm is unique and features a certain number of inputs and
parameters. But there are some common parameters among the ones analyzed in this
work that apply also to the majority of existent stereo algorithms.
As expected, the input to all the algorithms is the stereo pair, traditionally left-right
views of the scene in that order. Some algorithms accept as input the image matrix
(RGB or grayscale) and others accept plain images reading the matrix afterwards. The
rest of the algorithm inputs are actually the parameters.
The first parameter is window size, when window/block matching is used for the
search of similarities. Smaller window size usually means more detailed but coarse
32
(with noise) disparity map. Larger window size gives a smoother disparity map
overall, but with less detail captured. Of course this parameter should be odd number
since there is always a "center" pixel on the matching window.
The second parameter is the maximum disparity range. This means the minimum and
maximum disparity value that the matching algorithm will search for similarities
between the blocks of the image pair. Disparity values outside that range will be
ignored. The minimum disparity value can be a negative number. In the case of
Middlebury test images the minimum value is always 0 and the maximum varies
depending on the image.
4.3: Semi Global (Block) Matching
4.3.1: Algorithm Overview
Semi-Global (Block) Matching [23] successfully combines concepts of global and
local stereo methods for accurate, pixel - wise matching at low runtime. This is
probably the most popular algorithm for stereo matching. It has spawned many other
algorithms and has been widely used by stereo vision researchers. As it is evident
from the results in this work, it gives results with a relatively high number of bad
matching pixels and is surpassed by other algorithms. Despite that, it is fast and very
effective for real time stereo applications since the number of bad matches in its
output is not prohibitive.
The core algorithm considers pairs of images with known intrinsic and extrinsic
orientation. The method has been implemented for rectified and unrectified images. In
the latter case, epipolar lines are effectively computed and followed explicitely while
33
matching. Of course in this work only rectified images are used (with known epipolar
geometry).
The whole method is based on the idea of pixelwise matching of Mutual Information
and approximating a global, two dimensional smoothness constraints by combining
many single dimensional constraints. In a nutshell, the main algorithm has the
following processing steps: 1) Pixelwise cost calculation 2) Implementation of the
smoothness constraint 3) disparity computation with sub-pixel accuracy and occlusion
detection.
4.3.2: Pixelwise Cost Calculation
In step 1 (pixelwise cost calculation) the matching cost is calculated for a base image
pixel (the left one usually) from its intensity and the suspected correspondence of the
match image. An important aspect is the size and shape of the area that is considered
for matching. The robustness of matching is increased with large areas.
One way to perform pixelwise cost calculation is to use the Birchfield-Tomasi
subpixel metric. The cost is calculated as the absolute minimum difference of
intensities in the range of half pixel in each direction (8 directions) along the epipolar
line.
Another way to calculate the pixelwise cost is based on mutual information
(abbreviated as MI) which is insensitive to recording and illumination changes. It is
defined as the sum of the entropies of the two images minus their joint entropy
according to the following formula:
MII1,I2 = HI1 + HI2 – HI1,I2
34
H. Hirshmueler in his work favors the Mututal Information approach, contrary to the
OpenCV implementation that uses Birchfield - Tomasi.
4.3.3: Aggregation of Costs
Pixelwise cost calculation is generally ambiguous since wrong matches can easily
have a lower cost than correct matches due to factors like noise etc. Therefore, an
additional constraint is added that supports smoothness by penalizing changes to
neighboring disparities.
A global, 2D smoothness constraint is approximated by combining several 1D
constraints.
Figure 13: SGBM matching costs aggregation
The matching costs in 1D are aggregated from all eight directions equally as
illustrated on the figure above. The aggregated (or smoothed) cost for a pixel p and
disparity d is calculated by summing the costs of all 1D minimum cost paths that end
in pixel p at disparity d.
35
4.3.4: Disparity Computation
The disparity image D that corresponds to the reference image I is determined as in
local stereo methods by selecting for each pixel p the disparity d that corresponds to
the minimum cost.
For sub pixel estimation, a quadratic curve is fitted through the neighboring costs and
the position of the minimum is calculated.
4.3.5: Implementation Details
SGBM is implemented in OpenCv and is embedded in the library. Also it is included
in Matlab since version 2011b. The Matlab implementation did not produce usable
results despite the extensive experimentation. Consequently, only the OpenCV
version will be used.
OpenCV uses a modified version of the original Hirschmuller algorithm. Contrary to
the original algorithm that considers 8 directions, this one considers only 5 (single
pass). Also this variation matches blocks, not individual pixels thus the Semi Global
Block Matching naming. The parameters of this modified version can be tuned so the
algorithm will behave like the original one.
Also, mutual information cost function is not implemented. Instead a simpler
Birchfield-Tomasi sub-pixel metric is used. Finally, some pre and post processing
steps from the Konolige Block Matching implementation are included, for example
pre and post filtering. It is evident from the few identical parameters between the two
algorithms.
36
The OpenCV SGBM implementation features the common parameters and a few
more that are listed here (the OpenCV documentation is insufficient so the
explanation is based on experimentation):
• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].
• uniquenessRatio: Computed disparity d* is accepted only if
SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.
• speckleRange, speckleWindowSize: Parameters of the OpenCV function
filterSpeckles which is used to post process the disparity map. It replaces
blobs of similar disparities (the difference of two adjacent values does not
exceed speckleRange) whose size is less or equal to speckleWindowSize (the
number of pixels forming the blob) by the invalid disparity value.
• disp12MaxDiff: A left-right check is performed. Pixels are matched from left
to right image and then from the right back to the left. The disparity value is
accepted only if the distance of the first match and the distance of the second
match have maximum difference of disp12MaxDiff.
• fullDP: If set to true, the algorithm considers eight directions instead of five
(like the original) but with higher memory consumption.
• P1: Penalty for small disparity changes.
• P2: Penalty for higher disparity changes.
It should also be noted that the disparity range consists of two parameters,
minDisparity and numberofDisparities. The first value is the minimum disparity for
the search window. The second shows the maximum difference from the minimum
disparity. It works the same way as with the next algorithm, Block Matching.
4.3.6: Testing and Results
The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are
the following:
Figure 14: SGBM Results
The percentage of bad matching pixels according to the Sum of Absolute Differences
metric is the following:
Cones
SGBM 0.4943
4.4: Block matching
4.4.1: Algorithm overview
This method is based on the block matching al
frames for motion estimation,
matching also.
Block matching algorithm
‘macro blocks’ and comparing each of the macro
its adjacent neighbors in the previous
captures the movement of macro
37
Testing and Results
The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are
The percentage of bad matching pixels according to the Sum of Absolute Differences
Teddy Tsukuba Venus
0.4977 0.3923 0.4982
Algorithm overview
is method is based on the block matching algorithm. It is mainly used in
frames for motion estimation, but its principles can apply successfully to stereo
Block matching algorithm [24] involves dividing the current frame
‘macro blocks’ and comparing each of the macro-block with corresponding block and
its adjacent neighbors in the previous frame of the video. A vector
captures the movement of macro-block from one location to another in the previous
The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are
The percentage of bad matching pixels according to the Sum of Absolute Differences
Avg time
~0.065 sec
gorithm. It is mainly used in video
sfully to stereo
frame of video into
block with corresponding block and
is created that
om one location to another in the previous
38
frame. This movement calculated for all the macro blocks comprising a frame,
constitutes the motion estimated in the current frame.
The search area for a good macro-block match is decided by the ‘search parameter’, p,
where p is the number of pixels on all four sides of the corresponding macro-block in
the previous frame. The search parameter is a measure of motion. Larger the value of
p, larger is the motion; however this becomes a computationally extensive task.
Usually the macro-block is taken to be of size 16 pixels and the search parameter is
set to 7 pixels.
4.4.2: Algorithm Analysis
The tested implementation was submitted to OpenCV library by Kurt Konolige and is
partly based on his work Small Vision Systems: Hardware and Implementation [12].
The paper revolves around the Small Vision Module or SVM, a compact, inexpensive
real-time device for computing dense stereo range images.
In the case of stereo matching, the adjacent neighbor is the second image of the stereo
pair.
Figure 15: Block Matching Algorithm for SVM
39
The algorithm that is implemented here has the following features:
• Laplacian of Gaussian transform (LOG for short), L1 norm(absolute
difference) correlation.
• Variable disparity search in pixel unit.
• Postfiltering with an interest operator and left/right check.
• x4 range interpolation.
The LOG transform and L1 norm were chosen because they give good quality results
and can be optimized on standard instruction sets available on DSPs and
microprocessors.
The following images are copied directly from the paper and help in the explanation
of the algorithm. The disparity maps are green on the paper but they are converted to
grayscale here for uniformity reasons (this work examines grayscale disparity maps).
Figure 16: Block Matching Sample Images
40
Image (a) shows the grayscale input image. Figure (b) depicts the typical disparity
map produced by the algorithm. Brighter areas indicate higher disparities (closer
objects) while darker areas indicate lower disparities(further objects). There are 64
possible levels of disparity total. In figure (b) the highest level is around 40 while the
lowest is about 5. It is obvious that there is significant error in the upper left and right
corners of the image. That is due to the uniform areas without enough texture to
determine the disparity.
In figure (c) the interest operator is applied as a post filter. Areas with insufficient
texture are rejected and appear black in the produced image. Even after using this
filter, some errors still remain in portions of the image with disparity discontinuities,
in this case the side of the person's head. Those errors are caused by overlapping the
correlation window on areas with very different disparities.
One way to eliminate those errors is by applying left/right check. The left/right check
can be implemented efficiently by storing enough information when doing the
original disparity correlation. As the author concludes, the combination of interest
operator and left/right check has proven to be the most effective at eliminating bad
matches. As mentioned by the author, correlation surface checks were not used, since
they do not add to the quality of the range image and can be computationally
expensive.
As mentioned earlier, the algorithm described in the paper was intended to be used
with the Small Vision Module, which is a small programmable device with limited
resources, so it was designed with storage efficiency in mind.
41
4.4.3: Implementation Details
This algorithm is implemented in OpenCV and is embedded in the library. It is also
part of Matlab 2011b onwards. The Matlab implementation strangely, gave no usable
results even after extensive experimentation with the parameters (similarly to the
SGBM), so only the OpenCV version is presented here.
The input and parameters include of course the common ones and some more. The
OpenCV documentation does not sufficiently explain the parameters, so the analysis
here is based mainly on experimentation. Most of them are optional and only the ones
used for the testing are analyzed. The disparity range is actually 2 parameters, one for
minimum disparity and one for maximum disparity range. (minDisparity and
numberofDisparities respectively, final disparity range is [minDisparity,
minDisparity+numberofDisparities]). The rest of the parameters are the following:
• preFilterSize: Window size of the prefilter.
• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].
• textureThreshold: Calculates the disparity only at locations where the texture
is larger than or equal to this threshold.
• UniquenessRatio: Computed disparity d* is accepted only if
SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.
• speckleRange, speckleWindowSize: Parameters of the OpenCV function
filterSpeckles which is used to post process the disparity map. It replaces
blobs of similar disparities (the difference of two adjacent values does not
exceed speckleRange) whose size is less or equal to speckleWindowSize (the
number of pixels forming the blob) by the invalid disparity value.
• disp12MaxDiff: A left
to right image and then from the right back to the left. The disparity value is
accepted only if the distance of the first match
match have maximum difference of disp12MaxDiff.
4.4.4: Testing and Results
Produced disparity maps (cones, teddy, tsukuba, venus respectively):
The percentage of bad matching pixels
metric is the following:
Cones
BM 0.4450
4.5: Loopy belief propagation
4.5.1: Overview
This method [25] by Ngia Kien
Markov Random Fields and Loopy Belief Propagation.
mathematics and quite complicated.
algorithm as well as an OpenCV implementation on his w
42
disp12MaxDiff: A left-right check is performed. Pixels are matched from left
to right image and then from the right back to the left. The disparity value is
accepted only if the distance of the first match and the distance of the s
match have maximum difference of disp12MaxDiff.
Testing and Results
Produced disparity maps (cones, teddy, tsukuba, venus respectively):
Figure 17: Block Matching Results
The percentage of bad matching pixels according to the Sum of Absolute Differences
Teddy Tsukuba Venus
0.4583 0.3556 0.4707
Loopy belief propagation
by Ngia Kien Ho focuses on solving the stereo problem using
Markov Random Fields and Loopy Belief Propagation. The method is heavy on
mathematics and quite complicated. The creator offers extensive analysis of the
an OpenCV implementation on his website.
right check is performed. Pixels are matched from left
to right image and then from the right back to the left. The disparity value is
and the distance of the second
according to the Sum of Absolute Differences
Avg time
~0.028 sec
Ho focuses on solving the stereo problem using
The method is heavy on
extensive analysis of the
43
4.5.2: Markov Random Fields
Markov Random Fields (abbreviated as MRF) are undirected graphical models that
can encode spatial dependences. They consist of nodes and links as all graphical
models, but also feature cycles/loops. Given a 3x3 image, the stereo problem can be
modeled using MRF as follows:
Figure 18: Markov Random Field illustration
The blue nodes are observed variables and represent pixel intensity values in this
work. The pink nodes are the hidden variables and represent the unknown disparity
value. The hidden variable values are referred to as labels. The links between the
nodes represent a dependency. For example, the center node depends only on the four
nodes it is connected to. This rather strong assumption that each node depends only
on the nodes it is connected to, is called Markov assumption.
4.5.3: MRF Formulation
The stereo problem can be formulated in terms of MRF as the following energy
function:
Where Y is the observed node, X is the hidden node,
neighboring nodes of node
This energy function sums up all the costs at each link given an image Y and a
labeling X. The aim is to find a labelin
essentially the disparity map. The energy function contains two other functions,
DataCost and SmoothnessCost.
4.5.4: DataCost
The DataCost function returns the cost/penalty of assigning a label value of x
yi. Good matches require a low cost and bad matches a high cost. Usually, sum of
absolute differences or sum of squared differences are ideal to serve as cost metrics.
Practically, the function calculates the SAD (or any other metric chosen) betwe
blocks (or even single pixels) in the two images of the stereo pair
the different tested disparity values.
Naturally, the direction of the matching window on the right image depends o
stereo pair.
44
Where Y is the observed node, X is the hidden node, i is the pixel index and j are the
neighboring nodes of node xi (see above diagram).
This energy function sums up all the costs at each link given an image Y and a
labeling X. The aim is to find a labeling for X that produces the lowest energy. This is
essentially the disparity map. The energy function contains two other functions,
DataCost and SmoothnessCost.
returns the cost/penalty of assigning a label value of x
. Good matches require a low cost and bad matches a high cost. Usually, sum of
absolute differences or sum of squared differences are ideal to serve as cost metrics.
Practically, the function calculates the SAD (or any other metric chosen) betwe
blocks (or even single pixels) in the two images of the stereo pair taking into account
disparity values. The following pseudo code illustrates all this:
Naturally, the direction of the matching window on the right image depends o
is the pixel index and j are the
This energy function sums up all the costs at each link given an image Y and a
g for X that produces the lowest energy. This is
essentially the disparity map. The energy function contains two other functions,
returns the cost/penalty of assigning a label value of xi to data
. Good matches require a low cost and bad matches a high cost. Usually, sum of
absolute differences or sum of squared differences are ideal to serve as cost metrics.
Practically, the function calculates the SAD (or any other metric chosen) between
taking into account
The following pseudo code illustrates all this:
Naturally, the direction of the matching window on the right image depends on the
45
4.5.5: SmoothnessCost
The SmoothnessCost function enforces smooth labeling across adjacent hidden nodes.
To achieve that, a function that penalizes adjacent labels that are different is needed.
The following table shows some commonly used cost functions.
Figure 19: Various cost functions
The Potts model is a binary penalizing function with a single tunable lambda (λ)
variable. This value controls how much smoothing is applied. The linear and
46
quadratic models have the extra parameter K which is a truncation value that caps the
max penalty.
As the creator of the method comments, the choice of DataCost and SmoothnessCost
functions is vague and should be based on experimentation.
4.5.6: Loopy Belief Propagation main part
When the DataCost and SmoothnessCost functions have been chosen and the
parameters tuned, next step is to solve the energy function. Trying all possible
combinations (brute-forcing) would require a quantum computer. So finding an exact
solution is definitely not easy and should not be expected. Instead finding an
approximate solution is a more viable approach.
The Loopy Belief Propagation (LBP) algorithm was chosen among others (Graph Cut,
ICM etc) to find an approximate solution for the MRF. The original Belief
Propagation algorithm [26] was proposed by Pearl in 1982 for finding exact marginals
on trees. Trees are essentially graphs that contain no loops, but as it turned out the
same algorithm can successfully be applied to general graphs that contain loops. The
“loopy” word in the naming originates from there.
LBP is a message passing algorithm. A node passes a message to an adjacent node
only when it has received all incoming messages, excluding the message from the
destination node to itself. The following figure illustrates the process:
47
Figure 20: LBP message passing
Node x1 wants to send message to x2. So it waits for messages from all other nodes
(A, B, C, D) before sending it. As explained earlier, it will not send the message from
x2 to x1 back to x2. Node x1 maintains all possible beliefs about node x2. The choice
of using cost/penalty or probabilities is dependent on the choice of the MRF energy
formulation.
This pseudo code can illustrate the process discussed above. The first step is always
the initialization of the messages. As mentioned earlier, each node has to wait for all
incoming messages before sending its message to the target node. This means that at
the start of the algorithm, each node will wait forever and receive nothing, so no
message can be sent from it. To overcome that problem all messages are initia
some constant so the algorithm can proceed. The initialization is typically 0 or 1.
The main part of LBP is iterative
algorithm can run for a chosen number of iterations or until the change in energy
drops below a threshold. For each iteration, messages are passed around the MRF.
The passing scheme is arbitrary and any sequence is valid (the algorithm creator
chooses right, left, up and down). As it is mentioned, different sequences will produce
different results.
Once the LBP iteration completes, the best label at every pixel can be
calculating its belief using the following formula, where msg is the message sent to
node I from k with label l:
4.5.7: Implementation Details
As mentioned earlier, there is an OpenCV implementation available at the author’s
website. The common parameters and a few more are featured.
called labels and the window size is controlled by the variable wradius which accepts
even numbers. Afterwards one is added to the selected number so the window size is
odd as it should. The rest of the parameters are the following:
• BP_ITERATIONS: An integer that defines how many iterations/loops the
algorithm will run for.
48
the start of the algorithm, each node will wait forever and receive nothing, so no
message can be sent from it. To overcome that problem all messages are initia
some constant so the algorithm can proceed. The initialization is typically 0 or 1.
The main part of LBP is iterative. By adjusting the respective parameters, the
algorithm can run for a chosen number of iterations or until the change in energy
ops below a threshold. For each iteration, messages are passed around the MRF.
The passing scheme is arbitrary and any sequence is valid (the algorithm creator
chooses right, left, up and down). As it is mentioned, different sequences will produce
Once the LBP iteration completes, the best label at every pixel can be
calculating its belief using the following formula, where msg is the message sent to
node I from k with label l:
Implementation Details
As mentioned earlier, there is an OpenCV implementation available at the author’s
. The common parameters and a few more are featured. The disparity range is
and the window size is controlled by the variable wradius which accepts
numbers. Afterwards one is added to the selected number so the window size is
of the parameters are the following:
BP_ITERATIONS: An integer that defines how many iterations/loops the
algorithm will run for.
the start of the algorithm, each node will wait forever and receive nothing, so no
message can be sent from it. To overcome that problem all messages are initialized to
some constant so the algorithm can proceed. The initialization is typically 0 or 1.
. By adjusting the respective parameters, the
algorithm can run for a chosen number of iterations or until the change in energy
ops below a threshold. For each iteration, messages are passed around the MRF.
The passing scheme is arbitrary and any sequence is valid (the algorithm creator
chooses right, left, up and down). As it is mentioned, different sequences will produce
Once the LBP iteration completes, the best label at every pixel can be found by
calculating its belief using the following formula, where msg is the message sent to
As mentioned earlier, there is an OpenCV implementation available at the author’s
The disparity range is
and the window size is controlled by the variable wradius which accepts
numbers. Afterwards one is added to the selected number so the window size is
BP_ITERATIONS: An integer that defines how many iterations/loops the
• LAMBDA: This value controls how much smoothing is applied in the
SmoothnessCost function.
• SMOOTHNESS_TRUNC: Truncation value for the truncated linear model
that is used in the implementation.
4.5.8: Testing and Results
The produced disparity maps for Cones
the following (note that the algorithm ran for 5 loops)
The percentage of bad matching pixels according to the Sum of Absolute Differences
metric is the following (note that the time is measured for 5 loops)
Cones
LoopyBP 0.0953
4.6: Fast stereo matching and
4.6.1: Overview
This method is based on the paper "A
from Sparse Disparity Estimates Based on Stereo Vision
49
LAMBDA: This value controls how much smoothing is applied in the
SmoothnessCost function.
SMOOTHNESS_TRUNC: Truncation value for the truncated linear model
that is used in the implementation.
Testing and Results
The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are
(note that the algorithm ran for 5 loops):
Figure 21: Loopy Belief Propagation Results
The percentage of bad matching pixels according to the Sum of Absolute Differences
(note that the time is measured for 5 loops):
Teddy Tsukuba Venus
0.4206 0.0410 0.4203
Fast stereo matching and disparity estimation
based on the paper "A hybrid Algorithm for Disparity Calculation
Sparse Disparity Estimates Based on Stereo Vision" [27] .
LAMBDA: This value controls how much smoothing is applied in the
SMOOTHNESS_TRUNC: Truncation value for the truncated linear model
, Teddy, Tsukuba and Venus image pairs are
The percentage of bad matching pixels according to the Sum of Absolute Differences
Avg time
~155.88 sec
hybrid Algorithm for Disparity Calculation
50
This excellent work proposes a hybrid method for disparity estimation by combining
the existing methods of block based and region based stereo matching. It utilizes
image segmentation through K-Means clustering, morphological filtering and
connected component analysis, SAD cost function and disparity map reconstruction.
The process is very clearly documented by the authors and will be analyzed here step
by step. The following diagram depicts an overview of the whole algorithm.
Figure 22: Fast stereo Matching and Disparity Estimation
4.6.2: Algorithm Analysis
The first step is color conversion from RGB color to Lab color. The majority of
imaging equipment captures images in RGB format. This format though does not
properly approximate human vision. To overcome that difficulty the lab color space
was developed to better approximate human vision. The lightness component,
abbreviated L matches closely the human perception of lightness and is widely used
by image processing algorithms. This algorithm only retains the L values of the pixels
for further processing.
51
Step two is image segmentation. It is performed on the L values of the left image
pixels using fast implementation of the K-Means algorithm. The image pixels are
represented using one dimensional feature, namely a vector containing the L value for
each pixel. Next a histogram of the L values is built and used instead of the actual
pixel values for the subsequent iterations of the K-Means clustering. A histogram has
a smaller fixed number of bins than the actual pixels thus the runtime is significantly
reduced.
Step three is segment boundary detection and refinement. Segment boundary
detection is achieved by comparing the cluster assignment of each pixel with that of
its 8 neighboring pixels. If any of them is found to be different, the pixel is marked as
one (belongs to a segment boundary), or else it is marked as zero. Thus, the boundary
map is generated from the segmented left image. Since the clustering in step two is
based only on the pixels' lightness values there are limitations in the accuracy of the
said clustering. Consequently, many pixels can be falsely identified as belonging to
segment boundaries. To overcome that, the authors of this work apply two
morphological filters to refine the boundary map by removing such noisy pixels.
There are two types of morphological filters, Fill and Remove. Fill isolates interior
zero pixels that are surrounded by ones and sets them to one also. Remove sets
individual zero pixels to one if all of its four connected neighbors are one leaving only
the boundary pixels on. Furthermore, they use connected components analysis and
remove small artefacts in the boundary map due to segmentation errors. If disparity is
calculated for those artefacts, it will most probably be false. Finally the smallest
connected components that contribute about 4% of the total number of pixels are
removed.
52
Step four is disparity calculation of the boundary map. The well known SAD (Sum of
Absolute Differences) cost function is used to determine only the disparities of the
boundary pixels, using the L values of the left and right image pixels. A partial
disparity is map is built, considering the sparse disparity measurements.
The fifth and final step is disparity map reconstruction from boundaries. The
algorithm scans through each row of the partial disparity map and computes the
remaining disparities based on the ones that have already been calculated. It operates
in two stages:
-Disparity propagation ('fill' stage): In this first stage the disparity map is scanned
row-wise, left to right. Whenever two boundary pixels with identical disparity values
are encountered, the intermediate pixels of that row (aka the pixels between the
boundaries with the same value) are 'filled' with that disparity value. An exception is
made near the left and right end of each row. The left and right ends of each row are
filled with the disparity value of the nearest border pixel until a boundary pixel is
encountered.
-Estimation from known disparities ('Peek" Stage): In the second stage the algorithm
searches for the pixels whose disparity is not determined yet and estimates it based on
the disparity value of their neighboring pixels. When such a pixel is found, the known
disparities of its neighbors are stored in an array and the unknown disparity is
computed using statistical analysis.
4.6.3: Implementation Details
The algorithm input parameters except the common ones are:
• K representing the number of intensity based clusters for K
in the second step.
• Disp_scale represents the factor by which calculated disparities will be
multiplied. Value s
It should be noted that all the parameters except the image pair are optional. A
random value will be used if they are not specified and the results might not be
optimal.
4.6.4: Testing and Results
The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are
the following:
Figure 23: Fast stereo Matching and Disparity Estimation Results
The percentage of bad matching pixels according to the Sum of
metric is the following:
Cones
FSM 0.0770
53
K representing the number of intensity based clusters for K-Means clustering
Disp_scale represents the factor by which calculated disparities will be
multiplied. Value should be such that, max_disparity * disp_scale <+ 255.
It should be noted that all the parameters except the image pair are optional. A
random value will be used if they are not specified and the results might not be
Testing and Results
produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are
: Fast stereo Matching and Disparity Estimation Results
The percentage of bad matching pixels according to the Sum of Absolute Differences
Teddy Tsukuba Venus
0.1338 0.0320 0.1202
Means clustering
Disp_scale represents the factor by which calculated disparities will be
hould be such that, max_disparity * disp_scale <+ 255.
It should be noted that all the parameters except the image pair are optional. A
random value will be used if they are not specified and the results might not be
produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are
Absolute Differences
Avg time
~10.14 sec
54
4.7: Probability-Based Rendering for View Synthesis
4.7.1: Algorithm Overview
The main objective of this work [28] is to synthesize a virtual view, given two
reference images, without deterministic correspondences. The first challenge that
occured was to construct the probability of all probable matching points. The second
was to render an intermediate view using a set of all matching candidate points with
the probability.
To address the aforementioned challenges the authors of the paper presented the
probability-based rendering (PBR) approach that robustly reconstructs an intermediate
view with the steady- state matching probability (SSMP) density function.
SSMP: In this particular work the matching cost, typically referred to as a cost
volume in the correspondence matching literature, is re-defined as the probability of
being matched between points, enabling random walk with restart (RWR) to be
applied to optimize the matching probability. The RWR uses edge weights between
neighboring pixels to enhance the matching probability similar to aggregation
methods for local stereo matching.
PBR: The rendering process is re-formulated as an image fusion, so that all probable
matching points represented by the SSMP can be considered together. This approach
has a couple of significant advantages. First it suppresses existing flicker artifacts.
Second the intermediate view is free from the hole filling problem since the SSMP
considers all positions of probable matching points.
4.7.2: SSMP with RWR
First of all, the SSMP is defined. Two images are assumed, left and right. The
probability p measures how likely I
matched to Ir(m1-d,m2) (a point on the right image with disparity) or the opposite.
Also, the probability is inversely proportional to the cost, since smaller matching cost
means higher matching probability.
formulas:
Where p0 is an initial calculated matching probability based on an initial matching
cost e0. Z(m) represents a normalization term. The variable m denotes coordinates m
and m2 and d denotes the disparity.
Next step is SSMP estimation using RWR. The random walk has been widely used to
optimize probabilistic problems as the authors suggest. A random walker iteratively
transits to its neighboring points according to an edge weight. Also, the random
walker goes back to the initial position with a restarting probability a (0<=a<=1) at
each iteration. A matching probability in the SSMP can be obtained by the RWR in an
iterative fashion as follows:
55
is defined. Two images are assumed, left and right. The
probability p measures how likely Il(m1,m2) (a point on the left image) is to be
) (a point on the right image with disparity) or the opposite.
Also, the probability is inversely proportional to the cost, since smaller matching cost
means higher matching probability. The above can be summarized in the following
is an initial calculated matching probability based on an initial matching
. Z(m) represents a normalization term. The variable m denotes coordinates m
and d denotes the disparity.
Next step is SSMP estimation using RWR. The random walk has been widely used to
optimize probabilistic problems as the authors suggest. A random walker iteratively
transits to its neighboring points according to an edge weight. Also, the random
back to the initial position with a restarting probability a (0<=a<=1) at
each iteration. A matching probability in the SSMP can be obtained by the RWR in an
iterative fashion as follows:
is defined. Two images are assumed, left and right. The
) (a point on the left image) is to be
) (a point on the right image with disparity) or the opposite.
Also, the probability is inversely proportional to the cost, since smaller matching cost
can be summarized in the following
is an initial calculated matching probability based on an initial matching
. Z(m) represents a normalization term. The variable m denotes coordinates m1
Next step is SSMP estimation using RWR. The random walk has been widely used to
optimize probabilistic problems as the authors suggest. A random walker iteratively
transits to its neighboring points according to an edge weight. Also, the random
back to the initial position with a restarting probability a (0<=a<=1) at
each iteration. A matching probability in the SSMP can be obtained by the RWR in an
Where Nm denotes the four
formula becomes the random walk when the restarting probability is zero. With an
assumption that neighboring pixels tend to have similar matching probability when
the range distance between the reference pixel m and its neighboring pixel
an edge weight w(m,n) is computed by the following formula:
where y represents the bandwidth parameter, typically set to the intensity variance and
|| . ||2 denoted l2 norm. Then a steady state solution p
in this work, can be obtained by iteratively updating p
According to the authors, this work presents significant advantages. First of all, it
does not require specifying a window size for reliable matching,
conventional methods due to the small number of adjacent neighbors. Second, there
is no need to specify the number of iterations, since it gives a non
the steady state. Third, this method gives the optimal solution for
functional.
4.7.3: PBR with SSMP
Now the two reference images and the sets of their corresponding SSMPs are given.
The rendering process is cast into the probabilistic image fusion.
the left and right cameras is assumed to b
location of a virtual camera where 0<=
matching probability of a pixel on the left and right image
respectively as follows:
56
denotes the four-neighborhood of a reference pixel m. Note that the above
formula becomes the random walk when the restarting probability is zero. With an
assumption that neighboring pixels tend to have similar matching probability when
the range distance between the reference pixel m and its neighboring pixel
an edge weight w(m,n) is computed by the following formula:
where y represents the bandwidth parameter, typically set to the intensity variance and
Then a steady state solution ps(m,d) which is reffered
in this work, can be obtained by iteratively updating pt+1(m,d) until pt+1
According to the authors, this work presents significant advantages. First of all, it
does not require specifying a window size for reliable matching, contrary to the
due to the small number of adjacent neighbors. Second, there
is no need to specify the number of iterations, since it gives a non-trivial solution in
the steady state. Third, this method gives the optimal solution for
Now the two reference images and the sets of their corresponding SSMPs are given.
The rendering process is cast into the probabilistic image fusion. A baseline between
the left and right cameras is assumed to be normalized to 1. Beta (
location of a virtual camera where 0<=β<=1. Also, Pl(m,d) and Pr(m,d) encode the
matching probability of a pixel on the left and right image (Iul(m,d) and I
te that the above
formula becomes the random walk when the restarting probability is zero. With an
assumption that neighboring pixels tend to have similar matching probability when
the range distance between the reference pixel m and its neighboring pixel n is small,
where y represents the bandwidth parameter, typically set to the intensity variance and
(m,d) which is reffered the SSMP
t+1(m,d)=pt(m,d).
According to the authors, this work presents significant advantages. First of all, it
contrary to the
due to the small number of adjacent neighbors. Second, there
trivial solution in
the steady state. Third, this method gives the optimal solution for given energy
Now the two reference images and the sets of their corresponding SSMPs are given.
A baseline between
e normalized to 1. Beta (β) denotes the
(m,d) encode the
(m,d) and Iur(m,d) )
where Zl(m) and Zr(m) are:
And < . > represents a rounding operator. The virtual view is the synthesized via an
image fusion process. Specifically, a probabilistic average, E
two reference images is computed with corresponding probability P
and the textures, Iul(m,d) and I
blended as follows:
Left and right disparity maps can be denoted
sampled points Iul(m,d) and I
Iur(m) respectively. Furthermore, the matching probability functions P
Pr(m,d) are simplified as a set of shifted Dirac delta function as follows:
57
are:
represents a rounding operator. The virtual view is the synthesized via an
Specifically, a probabilistic average, El(Il(m)) and E
two reference images is computed with corresponding probability Pl(m,d)
(m,d) and Iur(m,d) along with the disparity hypothesis d and then
Left and right disparity maps can be denoted as dwl(m) and dwr(m) respectively.
(m,d) and Iur(m,d) are then converted as functions of m, I
Furthermore, the matching probability functions P
(m,d) are simplified as a set of shifted Dirac delta function as follows:
and
represents a rounding operator. The virtual view is the synthesized via an
(m)) and Er(Ir(m)), for
(m,d) and Pr(m,d)
(m,d) along with the disparity hypothesis d and then
(m) respectively. The
) are then converted as functions of m, Iul(m) and
Furthermore, the matching probability functions Pl(m,d) and
(m,d) are simplified as a set of shifted Dirac delta function as follows:
Then, the PBR on the previous equation b
For a given fixed point m*
the function of reference view I
Finally, the PBR is able to handle occlusion and dis
assuming that the background texture varies smoothly.
their textures synthesized in a probabilistic manner.
4.7.4: Implementation details
The implementation of this algorithm runs on Matlab. As discussed earlier, it does not
require one of the common parameters (window size). The most important parameter
the user has to modify is the disparity range. Apart from that there is a large number
of parameters that can be tuned to control various aspects of the algorithm, but none
are necessary to be changed and could be left to their default values.
4.7.5: Testing and Results
58
Then, the PBR on the previous equation becomes:
For a given fixed point m*, the PBR synthesizes the intermediate view I
the function of reference view Iul(m*,d) and the probability Pl(m*,d) as follows:
Finally, the PBR is able to handle occlusion and dis-occlusion (hole) regions by
assuming that the background texture varies smoothly. The problematic regions have
their textures synthesized in a probabilistic manner.
Implementation details
The implementation of this algorithm runs on Matlab. As discussed earlier, it does not
require one of the common parameters (window size). The most important parameter
the user has to modify is the disparity range. Apart from that there is a large number
of parameters that can be tuned to control various aspects of the algorithm, but none
e necessary to be changed and could be left to their default values.
Testing and Results
Figure 24: SSMP Results
, the PBR synthesizes the intermediate view Iu(m*) with
(m*,d) as follows:
occlusion (hole) regions by
The problematic regions have
The implementation of this algorithm runs on Matlab. As discussed earlier, it does not
require one of the common parameters (window size). The most important parameter
the user has to modify is the disparity range. Apart from that there is a large number
of parameters that can be tuned to control various aspects of the algorithm, but none
59
The percentage of bad matching pixels according to the Sum of Absolute Differences
metric is the following:
Cones Teddy Tsukuba Venus Avg time
SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec
4.8: Results Analysis and Conclusion
Below is a table that summarizes the percentage of bad matching pixels for all the
algorithms. Highlighted in blue is the best (lowest) score for each image pair and in
red the worst (highest).
OpenCV Cones Teddy Tsukuba Venus Avg. Time
BM 0.4450 0.4583 0.3556 0.4707 ~0.028 sec
SGBM 0.4943 0.4977 0.3923 0.4982 ~0.065 sec
LoopyBP 0.0953 0.4206 0.0410 0.4203 ~155.88 sec
Matlab
SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec
FSM 0.0770 0.1338 0.0320 0.1202 ~10.14 sec
The following diagram depicts a visual representation of the performance of the five
algorithms. Since the measured amount is percentage of bad matching pixels, it means
that lower value shows better performance and higher value shows more performance
mismatches.
60
Figure 25: Percentage of bad matching pixels (lower value is better)
The most "difficult" image pairs to match seem to be Venus and Teddy. All the
algorithms give the highest number of bad matching pixels in those two image sets. A
possible reason is that those two sets exhibit small variation in the scene and higher
uniformity on some regions which could "confuse" an algorithm and give a high
number of candidate matching points. The other two image sets, Cones and Tsukuba
give better scores in all the algorithms. Those images exhibit a greater variation on the
scene and matching points are easier to identify, with Tsukuba giving the best result
in all of the algorithms.
Overall, the most efficient algorithm of the ones tested here is the Fast Stereo
Matching and Disparity Estimation from professor GREDDY and S. Mukherjee. It
gives a small number of bad matching pixels under all circumstances and has a low
running time. The SSMP gives excellent results also but it is highly complicated and
exhibits a high running time due to the large number of steps required for its
completion. Also, as is evident from the Venus result, it cannot handle uniformity in
61
images effectively in all cases. The Loopy Belief Propagation also handles uniformity
in scenes poorly as the results from Teddy and Venus show. Additionally if a higher
disparity range is selected the running time will be quite high even in a smaller
number of loops. Finally, the SGBM and BM have many common points in their
OpenCV implementation with BM giving slightly better scores. Both those algorithms
performed poorly when it comes to the matching process itself, but have a very small
runtime. They seem ideal for real-time applications or where speed is more important
than robust stereo matching.
62
Chapter 5: Discussion and future work
Stereo vision is implemented as mentioned earlier in scientific, industrial, military and
even consumer fields. Although it is still considered as a gimmick by many people, it
steadily gains traction and acceptance.
Maybe the most obvious research field that will employ stereo vision in the near
future is virtual reality applications. They have already existed for a few years but
mostly for educational/entertainment purposes with limited use, mainly in virtual
tours in rather small 3D environments. Nowadays, with increased computational
power virtual reality can also be used in immersive and interactive applications, like
video games.
Several consumer virtual reality devices like Google Cardboard [29] and Oculus Rift
[30] have started to make their way to consumers. More such devices are expected
from many manufacturers in the near future.
Figure 26: Current Virtual Reality Devices
Another interesting project scheduled to be released commercially in October 2015 is
a device dubbed as virtual reality toy and is intended to be the greatest remodeling of
the abundantly famous View Master. Mattel toy corporation works with Google for
the project, which is largely based on the Google Cardboard. The traditional reels are
63
replaced by plastic cards and a smartphone. The user slides the smartphone inside the
headset and scans the cards. A 3D image based on the theme of the cards is depicted.
The trademark switch on the future View Master is now used to zoom or to focus on
objects in the virtual scene.
Also, another field that has recently started to employ stereo vision is medicine and
more specifically endoscopy. Traditional endoscopes feature a single camera that
provides a two dimensional image of the patient's examined internal organ. The
stereoscopic endoscopes feature two cameras that provide three dimensional imaging
thus allowing more thorough visual examination by extracting information about the
internal surface of the organs.
Figure 27: Stereoscopic Endoscope
Research is also very active in the driveless cars, discussed in the first chapter and is
expanding to other vehicles, mainly autonomous drones and Unmanned Ground
Vehicles. Also there are several space exploration projects that employ stereo vision,
such as an innovative planetary landing algorithm [31] used to extract planet surface
64
information and safely guide the space vessel to touch down as proposed by S.
Woicke and E. Mooij.
Finally it is the autonomous or not robotic systems that use stereo imaging along with
many other sensors. Robots are becoming more efficient and intelligent and their use
is set to be expanded in the near future in almost every sector imaginable.
65
References
1. http://en.wikipedia.org/wiki/Computer_stereo_vision.
2. https://en.wikipedia.org/wiki/Stereopsis#History_of_investigations_into_stereopsis.
3. https://en.wikipedia.org/wiki/View-Master#History.
4. https://en.wikipedia.org/wiki/Virtual_Boy.
5. http://en.wikipedia.org/wiki/Pinhole_camera_model.
6. http://en.wikipedia.org/wiki/Camera_resectioning.
7. http://en.wikipedia.org/wiki/Epipolar_geometry.
8. http://en.wikipedia.org/wiki/Fundamental_matrix_%28computer_vision%29.
9. http://en.wikipedia.org/wiki/Image_rectification.
10. http://www.jayrambhia.com/blog/disparity-maps/.
11. http://www.cs.stolaf.edu/wiki/index.php/Stereo_Matching.
12. Konolige, K. Small Vision Systems: Hardware and Implementaion. Springer. 1998.
13. https://en.wikipedia.org/wiki/Cross-correlation#Normalized_cross-correlation.
14. https://en.wikipedia.org/wiki/Ground_truth.
15. http://techcrunch.com/2010/06/19/a-guide-to-3d-display-technology-its-principles-
methods-and-dangers/.
16. http://www.self.gutenberg.org/articles/Aerial_survey.
17. http://www.nasa.gov/mission_pages/stereo/main/index.html.
66
18. https://en.wikipedia.org/wiki/Autonomous_car.
19. D. Scharstein, R. Szeliski. A Taxonomy and Evaluation of Dense Two Frame Stereo
Correspondence Algorithms. 2001.
20. R. Szeliski, R. Zabih. An Experimental Comparison of Stereo Algorithms. Springer. 2000.
21. L. Nalpantidis, G. Sirakoulis, A. Gasteratos. Review of Stereo Matching Algorithms for 3D
Vision . 2007.
22. R.A. Lane, N.A. Thacker. Overview of Stereo Matching Research. 1998.
23. Hirschmuller, H. Semi-global Matching - Motivation, Development and Applications.
2011.
24. http://en.wikipedia.org/wiki/Block-matching_algorithm.
25. Ho, Ngia Kien. http://nghiaho.com/?page_id=1366#LBP. [Online]
26. https://en.wikipedia.org/wiki/Belief_propagation.
27. s. Mukherjee, G.R.M. Reddy. A Hybrid Algorithm for Disparity Calculation From Sparse
Disparity Estimates Based on Stereo Vision. IEEE. 2014.
28. B. Ham, D. Min, C. Oh, M.N. Do, K. Sohn. Probability-Based Rendering for View
Synthesis. IEEE. 2014.
29. https://www.google.com/get/cardboard/.
30. https://www.oculus.com.
31. S. Woicke, E. Mooij. A Stereo-Vision Based Hazard-Detection Algorithm for Future
Planetary Landers. 2014.