comparison of open source stereo vision algorithms -...

1

COMPARISON OF OPEN SOURCE STEREO VISION ALGORITHMS

by

CHOUSTOULAKIS EMMANOUIL

Engineer of Applied Informatics and Multimedia

A THESIS

submitted in partial fulfillment of the requirements for the degree

MASTER OF SCIENCE

DEPARTMENT OF INFORMATICS ENGINEERING

SCHOOL OF APPLIED TECHNOLOGY

TECHNOLOGICAL EDUCATIONAL INSTITUTE OF CRETE

2015

Approved by:

Assistant Professor Kosmopoulos Dimitrios

2

Abstract

Stereo vision in the extraction of 3D information from a pair of images

depicting the same scene viewed from different angles. It happens in nature in

creatures that possess two eyes. It is also a very active field in Computer Vision where

the pair of images is digital and instead of eyes, it is obtained by cameras.

In order to achieve that, there are several methods and algorithms. In this Master

Thesis there is a theoretical and experimental comparison of a few of them that are

open source and can be found implemented online.

3

Acknowledgements

I would like to thank Professor Dimitris Kosmopoulos for the assistance in the

completion of the thesis. I would also like to thank my family for their patience and

support during my studies for this MSc degree. Finally special thanks to the

examiners committee and to all fellow stereo vision researchers and engineers whose

work contributed to the completion on this work.

4

Table of Contents

Abstract ..................................................................................................................................... 2

Acknowledgements ................................................................................................................... 3

Table of Figures ......................................................................................................................... 6

Chapter 1: Introduction and Goals ............................................................................................ 7

Chapter 2: Stereo Vision Basics ................................................................................................. 9

2.1: Overview......................................................................................................................... 9

2.2: History of Stereo Vision ................................................................................................ 10

2.3: Pinhole Camera Model ................................................................................................. 14

2.4: Camera Resectioning/Parameters ................................................................................ 14

2.5: Epipolar geometry ........................................................................................................ 15

2.6: Fundamental matrix ..................................................................................................... 16

2.7: Image rectification ........................................................................................................ 17

2.8: Disparity map ............................................................................................................... 17

2.9: Stereo Matching ........................................................................................................... 18

2.10: Matching Cost............................................................................................................. 20

2.11: Normalized Cross Correlation .................................................................................... 21

2.12: Ground truth .............................................................................................................. 21

2.13: Applications of Stereo Vision ..................................................................................... 21

2.14: Challenges and difficulties .......................................................................................... 24

Chapter 3: Stereo Algorithms Evaluation Process ................................................................... 26

3.1: Previous Work .............................................................................................................. 26

3.2: State-of-the-Art Middlebury Evaluation ...................................................................... 27

3.3: Thesis Evaluation Process ............................................................................................. 28

Chapter 4: Testing and Comparison of Stereo Algorithms ...................................................... 31

4.1: Intro .............................................................................................................................. 31

4.2: Common Algorithm Parameters .................................................................................. 31

4.3: Semi Global (Block) Matching ...................................................................................... 32

4.3.1: Algorithm Overview............................................................................................... 32

4.3.2: Pixelwise Cost Calculation ..................................................................................... 33

4.3.3: Aggregation of Costs ............................................................................................. 34

4.3.4: Disparity Computation .......................................................................................... 35

4.3.5: Implementation Details ......................................................................................... 35

5

4.3.6: Testing and Results ................................................................................................ 37

4.4: Block matching ............................................................................................................. 37

4.4.1: Algorithm overview ............................................................................................... 37

4.4.2: Algorithm Analysis ................................................................................................. 38



4.5: Loopy belief propagation ............................................................................................. 42

4.5.1: Overview ................................................................................................................ 42

4.5.2: Markov Random Fields .......................................................................................... 43

4.5.3: MRF Formulation ................................................................................................... 43

4.5.4: DataCost ................................................................................................................ 44

4.5.5: SmoothnessCost .................................................................................................... 45

4.5.6: Loopy Belief Propagation main part ...................................................................... 46



4.6: Fast stereo matching and disparity estimation ............................................................ 49

4.6.1: Overview ................................................................................................................ 49

4.6.2: Algorithm Analysis ................................................................................................. 50



4.7: Probability-Based Rendering for View Synthesis ......................................................... 54

4.7.1: Algorithm Overview............................................................................................... 54

4.7.2: SSMP with RWR ..................................................................................................... 55

4.7.3: PBR with SSMP ...................................................................................................... 56

4.7.4: Implementation details ......................................................................................... 58


4.8: Results Analysis and Conclusion ................................................................................... 59

Chapter 5: Discussion and future work ................................................................................... 62

References ............................................................................................................................... 65

6

Table of Figures

Figure 1: Simple stereo vision illustration ................................................................................. 9

Figure 2: Wheatstone's Stereocope ........................................................................................ 10

Figure 3: Brewster's Stereoscope ............................................................................................ 11

Figure 4: A typical ViewMaster device .................................................................................... 12

Figure 5: Nintendo's Virtual Boy .............................................................................................. 13

Figure 6: Pinhole Camera Model ............................................................................................. 14

Figure 7:Epipolar Geometry Illustration .................................................................................. 16

Figure 8: A groundtruth disparity map .................................................................................... 18

Figure 9: Simple window matching illustration ....................................................................... 19

Figure 10: Nintendo's 3DS ....................................................................................................... 22

Figure 11: The NASA STEREO Project ...................................................................................... 23

Figure 12: Cones, teddy, tsukuba and venus left view ............................................................ 30

Figure 13: SGBM matching costs aggregation ......................................................................... 34

Figure 14: SGBM Results ......................................................................................................... 37

Figure 15: Block Matching Algorithm for SVM ........................................................................ 38

Figure 16: Block Matching Sample Images .............................................................................. 39

Figure 17: Block Matching Results........................................................................................... 42

Figure 18: Markov Random Field illustration .......................................................................... 43

Figure 19: Various cost functions ............................................................................................ 45

Figure 20: LBP message passing .............................................................................................. 47

Figure 21: Loopy Belief Propagation Results ........................................................................... 49

Figure 22: Fast stereo Matching and Disparity Estimation ..................................................... 50

Figure 23: Fast stereo Matching and Disparity Estimation Results ......................................... 53

Figure 24: SSMP Results .......................................................................................................... 58

Figure 25: Percentage of bad matching pixels (lower value is better) .................................... 60

Figure 26: Current Virtual Reality Devices ............................................................................... 62

Figure 27: Stereoscopic Endoscope ......................................................................................... 63

7

Chapter 1: Introduction and Goals

Stereoscopic vision (called binocular vision in nature) is the extraction of 3d

information about a scene from a pair of images depicting different views of that

scene. The process in nature is called stereopsis and it occurs in the brain which

combines the images received from the 2 eyes.

Computer stereo vision is the extraction of 3D information from a pair of digital

images usually obtained by 2 CCD cameras. This is made possible using various

techniques and algorithms. Over the years many algorithms have been proposed to

combine 2 images into a final image and their performance is measured by efficiency

and speed.

The primary goal of this work is to illustrate a simple process of comparing methods

used for stereo matching. The secondary goal is to compare experimentally the

relevant algorithms that can be found implemented on various sources online by using

the aforementioned process and some of the state of the art datasets. The steps

followed in this work can be summarized here:

1. Extensive study of computer stereo vision resources and literature to gain as

thorough understanding as possible of all related terms and methodology.

2. Study any previous work related to the topic of this master thesis and determine the

state of the art.

3. Define the process for comparison, choosing a simple and comprehensible one.

4. Research online for implemented stereo algorithms and methods. Make any

changes to the code for optimal run, without affecting the core algorithm.

8

5. Test the said methods using all 4 stereo sets of the Middlebury platform and select

the ones giving usable results.

5. Make the comparison, document, study and analyze the result.

6. Write the thesis document including terminology, process description, algorithm

and result analysis.

Any reference to existing work is of course acknowledged and documented here. Also

the results displayed here are produced by executing the relevant code with the

optimal parameters for each method and none is found online and just used as is.

9

Chapter 2: Stereo Vision Basics

2.1: Overview

In traditional stereo vision [1], two cameras, displaced horizontally from one another

are used to obtain two differing views on a scene, in a manner similar to

human binocular vision. By comparing these two images, the relative depth

information can be obtained, in the form of disparities, which are inversely

proportional to the differences in distance to the objects. To compare the images, the

two views must be superimposed in a stereoscopic device, the image from the right

camera being shown to the observer's right eye and from the left one to the left eye.

Figure 1: Simple stereo vision illustration

In real camera systems however, several pre-processing steps are required.

1. The image must first be removed of distortions, such as barrel distortion to

ensure that the observed image is purely projectional.

2. The image must be projected back to a common plane to allow comparison of

the image pairs, known as image rectification.

3. An information measure which compares the two images is minimized. This

gives the best estimate of the position of features in the two images, and

creates a disparity map.

10

4. Optionally, the disparity as observed by the common projection is converted

back to the height map by inversion. Utilizing the correct proportionality

constant, the height map can be calibrated to provide exact distances.

2.2: History of Stereo Vision

In this chapter, the most important moments in the history [2] of stereoscopic vision

are described.

Stereopsis was first explained by Charles Wheatstone in 1838. "...the mind perceives

an object of three dimensions by means of the two dissimilar pictures projected by it

on the two retinae..." was his exact definition. He recognized that each eye views the

world from slightly different horizontal position. As a result of that, each eye views a

different image.

Also objects at different distance seem to have a different horizontal position for each

eye (horizontal disparity) leading to the concept of depth. Wheatstone created the

illusion of depth from flat pictures that differed only in horizontal disparity. To

display his pictures separately to the two eyes, Wheatstone invented the stereoscope.

Figure 2: Wheatstone's Stereocope

11

Although Wheatstone was the first man to explain and showcase stereoscopic vision,

he was not the first to notice it and try to understand it. Leonardo Da Vinci had also

realized that objects at different distances project images to the eyes that differ in their

horizontal positions. Despite his efforts he concluded that it is impossible for a painter

to portray a realistic depiction of depth in a scene with a single canvas. Da Vinchi

chose for the near objects a column with a circular cross section and for his far object

a flat wall. His column projects identical images of itself in the two eyes.

Stereoscopy became popular during victorian times with the invention of the Prism

Stereoscope by David Brewster. Combined with the advances of photography, tens of

thousands stereograms were produced.

Figure 3: Brewster's Stereoscope

In 1939 the View Master [3] line was introduced. It is a series of special stereograms

that are loaded with proprietary discs (namely reel) containing a film of stereoscopic

scenes. Transition between scenes happens with a switch that rotates the reel. The

viewer looks inside the two lenses to view the scene. In order to put light into the

film, most models feature 2 opaque white films in front of the reel. The viewer needs

to look into the View Master against an light source, although some models with self

12

illumination were introduced. View master is best known as a toy for children and is

still available, although it is less popular than it used to be. Mattel Corporation

currently owns the rights for its production.

Figure 4: A typical ViewMaster device

In the 1960's Bela Julesz invented the Random Dot Stereogram. Unlike previous

stereograms, in which each half image showed recognizable objects, each half image

of the first random-dot stereograms showed a square matrix of about 10,000 small

dots, with each dot having a 50% probability of being black or white. No recognizable

objects could be seen in either half image. The two half images of a random-dot

stereogram were essentially identical, except that one had a square area of dots shifted

horizontally by one or two dot diameters, giving horizontal disparity. The gap left by

the shifting was filled in with new random dots, hiding the shifted square.

Nevertheless, when the two half images were viewed one to each eye, the square area

was almost immediately visible by being closer or farther than the background. Julesz

called it Cyclopean image in the notion that each eye was seeing part of an object and

it was combined into one in the brain.

In the 70’s Christopher Tyler invented autostereograms. That is random dot

stereograms that can be viewed without a stereoscope. A famous example is the

13

Magic Eye pictures, a series of books featuring images which allow people to view

3D images by focusing on 2D patterns.

In 1989 Antonio Medina Puerta demonstrated with photographs that retinal images

with no parallax disparity but with different shadows are fused stereoscopically,

imparting depth perception to the imaged scene. He named the phenomenon "Shadow

Stereopsis". He showed how effective the phenomenon is by taking two photographs

of the Moon at different times, and therefore with different shadows, making the

Moon to appear in 3D stereoscopically, despite the absence of any other stereoscopic

cue.

In 1995 Nintendo Corporation introduced the Virtual Boy [4]. It was a table top

console and a first step into virtual reality devices for consumers. The goal was to

"immerse players into their own private universe", according to Nintendo. The goal

was never achieved though due to large number of factors. Limited technology to

keep the cost down, bad design causing eye and neck strain and a small number of

games catalog led to the commercial failure of the device that was discontinued from

production only a year later. Nevertheless, it was the first step in the use of stereo

vision and virtual reality for entertainment purposes.

Figure 5: Nintendo's Virtual Boy

14

2.3: Pinhole Camera Model

The pinhole camera model [5] describes the mathematical relationship between the

coordinates of a 3D point and its projection onto the image plane of an ideal pinhole

camera, where the camera aperture is described as a point and no lenses are used to

focus light. The model does not include, for example, geometric distortions or

blurring of unfocused objects caused by lenses and finite sized apertures. It also does

not take into account that most practical cameras have only discrete image

coordinates. This means that the pinhole camera model can only be used as a first

order approximation of the mapping from a 3D scene to a 2D image. Its validity

depends on the quality of the camera and, in general, decreases from the center of the

image to the edges as lens distortion effects increase.

Figure 6: Pinhole Camera Model

2.4: Camera Resectioning/Parameters

Camera resectioning [6] is the process of estimating the parameters of a pinhole

camera model approximating the camera that produced a given photograph or video.

Usually, the pinhole camera parameters are represented in a 3 × 4 matrix called the

camera matrix. There are typicall

extrinsic.

Intrinsic parameters encompass focal length, image sensor format and principal point.

Those parameters are contained in the intrinsic matrix.

Extrinsic parameters denote the coordinate system tr

coordinates to 3d camera coordinates.

the position of the camera center and the camera’s heading in world coordinates.

T is the position of the origin of the world coordinate system

of the camera-centered coordinate system

often mistaken). C is the position of the camera expressed in world coordinates

rotation matrix.

2.5: Epipolar geometry

Epipolar geometry [7] is the geometry of

scene from two distinct positions, there are a number of geometric relations between

the 3D points and their projections onto the 2D images that lead to constraints

between the image points.

the cameras can be approximated by the

15

There are typically two types of camera parameters, intrinsic and


Those parameters are contained in the intrinsic matrix.

Extrinsic parameters denote the coordinate system transformations from 3d word

coordinates to 3d camera coordinates. Equivalently, the extrinsic parameters define


is the position of the origin of the world coordinate system expressed in coordinates

centered coordinate system (and NOT the position of the camera as is

often mistaken). C is the position of the camera expressed in world coordinates

Epipolar geometry

is the geometry of stereo vision. When two cameras view a 3D



between the image points. These relations are derived based on the assumption that

the cameras can be approximated by the pinhole camera model.

y two types of camera parameters, intrinsic and


ansformations from 3d word

Equivalently, the extrinsic parameters define


expressed in coordinates

(and NOT the position of the camera as is

often mistaken). C is the position of the camera expressed in world coordinates. R is a

. When two cameras view a 3D



These relations are derived based on the assumption that

16

The figure below depicts two pinhole cameras looking at point X. In real cameras, the

image plane is actually behind the center of projection, and produces an image that is

rotated 180 degrees. Here, however, the projection problem is simplified by placing

a virtual image plane in front of the center of projection of each camera to produce an

unrotated image. OL and OR represent the centers of projection of the two

cameras. X represents the point of interest in both cameras. Points xL and xR are the

projections of point X onto the image planes.

Each camera captures a 2D image of the 3D world. This conversion from 3D to 2D is

referred to as a perspective projection and is described by the pinhole camera model.

It is common to model this projection operation by rays that emanate from the

camera, passing through its center of projection. Note that each emanating ray

corresponds to a single point in the image.

Figure 7:Epipolar Geometry Illustration

2.6: Fundamental matrix

The fundamental matrix [8] F is a 3×3 matrix which relates corresponding points

in stereo images. In epipolar geometry, with homogeneous image

coordinates, x and x′, of corresponding points in a stereo image pair, Fx describes a

line (an epipolar line) on which the corresponding point

lie. That means, for all pairs of corresponding points holds

2.7: Image rectification

Image rectification [9] is a transformation process used to project two

onto a common image plane. This process has several degrees of freedom and there

are many strategies for transforming images to the common plane.

It is used in computer stereo vision

between images. It uses triangulation based on

distance to an object. More specifically,

the depth of an object to its change in position when viewed from a different camera,

given the relative position of each camera is known.

2.8: Disparity map

Disparity [10] refers to the difference in location of an object in corresponding two

(left and right) images as seen by the left and right eye which is created due to

parallax (eyes’ horizontal separation). The brain uses this disparity to calculate depth

information from the two dimensional images.

In computer vision, the disparity map is an image that depicts how far from the

viewing source are the objects of the scene. The depiction is based on intensities.

Brighter objects are closer to the source (they have larger

17

line (an epipolar line) on which the corresponding point x′ on the other image must

lie. That means, for all pairs of corresponding points holds

on

is a transformation process used to project two-or


are many strategies for transforming images to the common plane.

computer stereo vision to simplify the problem of finding matching points

between images. It uses triangulation based on epipolar geometry

distance to an object. More specifically, binocular disparity is the process of relating

object to its change in position when viewed from a different camera,

given the relative position of each camera is known.

refers to the difference in location of an object in corresponding two



rom the two dimensional images.



Brighter objects are closer to the source (they have larger apparent distance between

on the other image must

or-more images


to simplify the problem of finding matching points

epipolar geometry to determine

is the process of relating

object to its change in position when viewed from a different camera,

refers to the difference in location of an object in corresponding two





apparent distance between

18

the left and right image) and darker objects are further (they have smaller apparent

distance between the left and right view).

Figure 8: A groundtruth disparity map

In short, the disparity of a pixel is equal to the shift value that leads to minimum sum-

of-squared-differences for that pixel.

Disparity map calculation is essentially the result of stereo matching. This work

revolves around some of its calculation methods.

2.9: Stereo Matching

Stereo matching [11] is used for finding corresponding pixels in a pair of images,

which allows 3D reconstruction by triangulation, using the known intrinsic and

extrinsic orientation of the camera. There are two methods [12] for stereo matching,

local and global. Local methods attempt to match small regions of one image to

another based on intrinsic features of the region. Global methods are supplementary to

local methods by considering physical constraints such as surface continuity or base-

of-support. Local methods can be further classified by whether they correlate a small

area patch among images (called correlation or area based) or match features (called

feature based).

19

Correlation based (or area based) stereo matching considers a certain area on the left

image (usually) and tries to find an equally sized area on the right image that is the

closest match. That area is called matching or correlation window. Since the images

are rectified, the algorithm searches only horizontally by a predefined offset. They

produce dense disparity maps. Generally a smaller matching window will give more

detail but more noise and a larger window will produce a smoother disparity map but

with less captured detail. Such algorithms are by default fast and memory efficient so

they are usually preferred. However, finding the combination of optimal window size

and other algorithm parameters can be challenging and requires a lot of testing.

To find a match for a pixel in the left image, the left window is drawn centered on that

pixel. It is then compared to several windows in the right image, beginning with a

window at the same location (zero disparity) and moving left in increments of one

pixel (increasing disparity by one with each move). Whichever window in the right

image gives the lowest cost is said to match the left image. The difference in x

coordinates between the center of this match and the center of the left window is the

disparity value of the pixel in question on the left image.

Figure 9: Simple window matching illustration

Feature based stereo matching computes the corresponding pixels by using the

extracted features from the images. Features are usually chosen to be lighting and

20

viewpoint independent. As a result, they compensate for viewpoint changes and

camera differences. Techniques used to find the image features include edge, corner

and blob and ridge detection. Such features include

• edge elements

• corners

• line and curve segments

• circles and ellipses

• regions defined either as blobs or polygons.

Area based algorithms are simple and efficient in general. Some of those algorithms

are ideal for real-time stereo matching applications. But, as discussed earlier, it can be

challenging to produce robust matching. Feature based on the other hand, can produce

fast and robust matching but usually require expensive feature extraction. Area

methods are more commonly used by researchers and this work will focus on them for

the most part.

2.10: Matching Cost

At the base of any matching algorithm is the matching cost which measures the (dis-)

similarity of a pair of locations, one in each image. Matching costs can be defined at

the pixel level or over a certain area. Common examples of this are Absolute intensity

Difference, Squared intensity Difference, Filter-Bank responses and gradient based

measures. Binary matching costs area also possible, based on binary features such as

edges.

21

2.11: Normalized Cross Correlation

The higher the normalized cross correlation [13] of two windows, the better they

match. Normalized cross correlation is calculated by computing the mean and

standard deviation of intensity in each window. Then, the mean intensity is subtracted

from each pixel's intensity. Corresponding values of intensity - mean intensity from

the left and right windows are multiplied together. These multiplied values are

summed over the entire window. Finally, this sum is divided by the number of pixels

in either window and divided by each standard deviation.

2.12: Ground truth

Ground truth [14] is a term used in various fields to refer to the absolute truth of

something and is used to test the efficiency of an algorithm or a system. Used widely

in machine learning it denotes a set of measurements that are much more accurate

than the system being tested.

In the case of stereo vision systems the question is how well they can estimate 3D

positions. The ground truth disparity map is composed by = the positions given by a

laser range finder which is known to be much more accurate than any camera system.

Practically, it describes the perfect disparity between left and right image and it is

compared to the produced disparity map using various metrics.

2.13: Applications of Stereo Vision

Stereo vision is well known to the general public mainly by 3D movies. But that is

only a small part of a wide range of applications not only in everyday life, but also in

industry and research and even space exploration.

• Robotics to extract information about the relative position of 3D

vicinity of autonomous systems, object recognition where depth information

allows for the system to separate occluding image components. Such robotic

systems are primarily used in industrial applications

• 3D displays [15] and head mounted di

to the human eyes.

basic requirement is to display offset images that are filtered separately to the

left and right eye.

There are two methods

glasses to filter the offset images to each eye and another where no glasses are

required.

In the case where glasses (or filters) are used there are two types used, passive

and active filters. Passive

filters or polarization filters. Active shutter filters have active shutters to filter

the image as their name suggests and require power.

In the glass free case the light source splits the images dire

viewer's eyes. Such displays are called autostereoscopic.

famous example of such technology is the Nintendo 3DS game console.

22

Robotics to extract information about the relative position of 3D



systems are primarily used in industrial applications.

and head mounted displays to provide stereoscopic imaging

to the human eyes. In such applications where the goal is depth perception, the


methods to accomplish that. One method is when



and active filters. Passive filters do not require power and could be either color


the image as their name suggests and require power.

In the glass free case the light source splits the images directionally into the

viewer's eyes. Such displays are called autostereoscopic. Maybe the most


Figure 10: Nintendo's 3DS

Robotics to extract information about the relative position of 3D objects in the



splays to provide stereoscopic imaging

applications where the goal is depth perception, the


method is when the user wears



filters do not require power and could be either color


ctionally into the

Maybe the most


23

Finally there are the head mounted displays where a separate display is

positioned in front of each and the image is projected through lenses to assist

the eye focusing. Such devices are used in a plethora of concepts like military

to provide augmented reality applications, engineering to provide stereoscopic

views of CAD schematics and of course entertainment, like 3D gaming and

movies or tours in virtual environments.

• Calculation of contour maps and geometry extraction for 3D building mapping

mainly from aerial surveys [16]. Those surveys are usually conducted with the

use of unmanned aerial vehicles or UAV's for short. The large number of

aerial images captured is then converted into geo-referenced 2D high

resolution orthophotos and 3D surface models and point clouds. Various

automated systems.

• The NASA STEREO project [17] one of the most important and high scale

projects ever. It stands for Solar Terrestrial Relations Observatory and in

essence it is a solar observation mission. Two nearl identical spacecraft were

launched in 2006 into orbits around the sun in a manner that enables

stereoscopic imaging of the sun.

Figure 11: The NASA STEREO Project

24

The goal is to study solar phenomena (principally coronal mass erections-

massive bursts of solar wind, plasma and magnetic fields ejected into space

that can disrupt earth communications and power networks) in the far side of

the sun. This practically enables solid forecasts of solar activity through 360

degree view of the sun at all times.

• Driverless cars [18] is a hot topic at the time of writing. Driverless cars as the

name suggests, can drive to their destination without requiring human

intervention. Stereo vision is the way the car can "see" the world in front of it.

Of course, a plethora of sensors are used, including laser, ultrasonic, GPS etc,

so a 360 degree "map" of the surrounding world can be formed by the car.

There are quite a few similarities with robot navigation in the process. Of

course this technology is still not in the reliable stage, especially due to the

risk for traffic accidents in case of errors/inaccuracy.

2.14: Challenges and difficulties

The correspondence problem refers to the problem of ascertaining which parts of one

image correspond to which parts of another image, where differences are due to

movement of the camera, the elapse of time, and/or movement of objects in the

photos.

Other major pitfalls include reflections and transparency. It is usually very hard for a

machine to distinguish whether it is looking at an object or the reflection of that

object. Similarly, it is hard for a computer vision system to recognize the existence of

transparent objects between the view source and the target scene.

25

The third pitfall is continuous and textureless regions. It is very difficult to determine

which point on the left image corresponds to which point on the right image. Finally

there can be technical difficulties like sensor noise and calibration noise to the

cameras.

26

Chapter 3: Stereo Algorithms Evaluation Process

In this chapter there is a short analysis of previous work related to the topic of this

thesis. Also, there is analysis of the state of the art evaluation method as well as the

process followed for the thesis.

3.1: Previous Work

• A Taxonomy and Evaluation of Dense Two Frame Stereo Correspondence

Algorithms [19] by D. Scharstein and R. Szeliski. It is the state of the art

evaluation and is analyzed in the next paragraph.

• An Experimental Comparison of Stereo Algorithms by R. Szeliski and R.

Zabih [20]. In this work by Szeliski and Zabih there is an effort to compare

experimentally a few stereo vision algorithms. They make use of two stereo

pairs, the well known set from Tsukuba university and another produced by

them ( a simple scene with a slanted surface). Their methodology consists of

comparison with ground truth depth maps and the measurement of novel

prediction errors.

• Review of Stereo Matching Algorithms for 3D Vision by L. Nalpantidis, G.

Sirakoulis and A. Gasteratos [21]. In this work there is a theoretical

comparison and summary of various methods. It considers both local and

global methods, computational intelligence techniques and the speed and

accuracy of those. Also, some hardware implementation techniques are

presented.

• Overview of Stereo Matching Research, by R.A.Lane and N.A. Thacker [22].

This is a literature survey of a few area and feature based methods. It includes

a short description of those methods and some conclusions drawn. It is a

relatively old paper and part of a large series of stereo vision journals.

3.2: State-of-the-Art Middlebury E

The state of the art evaluation method for stereo vision algorithms is offered by the

Middlebury College. The creators are Daniel Scharstein and Richard Szeliski

evaluation process is documented in their publication titled "A Taxonomy and

Evaluation of Dense To Frame Stereo Correspondence Algorithms

The goal of the creators of this evaluation process was to comp

methods within one common framework. For that reason they have focused

techniques that produce a

The evaluation process is very detailed and quite complicated.

error measurements, RMS error and percentage of bad matching

image areas.

RMS (root-mean-squared) error (measured in disparity units) between the computed

disparity map dC(x, y) and the ground truth map

following formula:

Percentage of bad matching pixels

27



Art Middlebury Evaluation


Middlebury College. The creators are Daniel Scharstein and Richard Szeliski


Evaluation of Dense To Frame Stereo Correspondence Algorithms [19]

The goal of the creators of this evaluation process was to compare a large number of

e common framework. For that reason they have focused

techniques that produce a univalued disparity map.

The evaluation process is very detailed and quite complicated. In essence, t

RMS error and percentage of bad matching pixels in 3 different

squared) error (measured in disparity units) between the computed

) and the ground truth map dT (x, y) is computed by the

Percentage of bad matching pixels is computed by this formula:




Middlebury College. The creators are Daniel Scharstein and Richard Szeliski and the


[19].

are a large number of

e common framework. For that reason they have focused on

In essence, there are 2

pixels in 3 different

squared) error (measured in disparity units) between the computed

is computed by the

Also the images are segmented into three different areas.

• textureless regions T : regions where the squared horizontal intensity gradient

averaged over a square window of a given size

Essentially, it is areas on the scene with little to no texture.

• occluded regions O: regions where the

with a larger (nearer) disparity.

on one of the images and not visible on the othe

• depth discontinuity regions D:

more than a predefined

practically the areas in the scene where there is a sudden change in the depth

between the objects.

These regions were selected to support the analysis of matching results in typical

problem areas.

The Middlebury College offers an online

researchers to upload and test their algorithms and compare them against many

There are also a few datasets offered in various resolutions for testing. The online

evaluation tool utilizes 4 certain image pairs and compares the user submitted

disparity maps with the groun

evaluation tool is of version two at the time of writing this thesis.

3.3: Thesis Evaluation

The first step was to collect all the open source algorithms that can be found

implemented on various sources online. They were all tested and the ones producing

28

Also the images are segmented into three different areas.

textureless regions T : regions where the squared horizontal intensity gradient

averaged over a square window of a given size is below a given threshold.

areas on the scene with little to no texture.

occluded regions O: regions where the left-to-right disparity lands at a location

ith a larger (nearer) disparity. This means that an occluded region is visible

on one of the images and not visible on the other.

depth discontinuity regions D: regions where neighboring disparities differ by

a predefined gap, dilated by a window of a given width


between the objects.

ese regions were selected to support the analysis of matching results in typical

The Middlebury College offers an online evaluation tool for computer vision

researchers to upload and test their algorithms and compare them against many



disparity maps with the ground truth maps for the four pairs. Note that the online

valuation tool is of version two at the time of writing this thesis.

valuation Process



textureless regions T : regions where the squared horizontal intensity gradient

is below a given threshold.

disparity lands at a location

This means that an occluded region is visible

neighboring disparities differ by

a given width. It is


ese regions were selected to support the analysis of matching results in typical

valuation tool for computer vision

researchers to upload and test their algorithms and compare them against many others.



d truth maps for the four pairs. Note that the online



29

unusable results were discarded. The ones that gave meaningful results are analyzed

here.

The comparison of the algorithms is based partly on the state of the art evaluation,

namely the percentage of bad matching pixels (sum of absolute differences of the

disparity map and the ground truth image matrices, BadMatchPercent formula from

the previous paragraph). The focus is to find how good the algorithm has performed

in estimating the disparity map. Mean elapsed time is measured in all cases, but it is

not directly comparable, since two different tools are used and OpenCV is vastly

faster than Matlab because it is written in C++ language. Also, elapsed time depends

on more factors, like the parameters of each algorithm, image size and of course the

hardware it is running on. The algorithms presented here were tested on an Intel

Celeron G1620 CPU with 4GB of RAM and an AMD HD6450 GPU.

A pixel by pixel subtraction is conducted between the result and the ground truth

image matrices. There is a 30 pixel margin on all sides to eliminate empty image

borders, since some algorithms produce disparity maps with black borders that could

lower the score for no reason. Any result that is larger than the predefined threshold

(which is around 1.0 traditionally) is considered a bad matching pixel. The threshold

is the same for all images so the comparison is fair. When a bad matching pixel is

found it is added to the previous sum. In the end the sum is divided by the total

number of pixels. The final result is a percentage of the bad matching pixels in the

disparity map.

The images used are the widely popular stereo datasets from Middlebury College.

They contain several right and left views of the same scene, as well as a ground truth

30

image for evaluation. As mentioned before, the Middlebury online evaluation

platform uses 4 standard image sets, a total of 8 images (cones, teddy, tsukuba,

venus). Those images along with their ground truth will be used. Note that those four

image sets are used in the second version of the online evaluation tool of Middlebury,

which is still online at the time of writing. Version three should be online soon after

this work is completed and it will use different image sets for the evaluation of

algorithms.

Figure 12: Cones, teddy, tsukuba and venus left view

The method described above is summed up to a mathematical equation and the code

that implements it is written by the author of this thesis. The platform chosen for the

comparison is Matlab, due to the simplicity of matrix operations.

31

Chapter 4: Testing and Comparison of Stereo Algorithms

4.1: Intro

Stereo vision is still a popular topic when it comes to research. It is very active and

there is a large number of algorithms being evaluated in the Middlebury platform.

Unfortunately, finding implementation of the various algorithms can be difficult

because communication with creators is rarely successful and many of them refuse to

help. Furthermore, available implementations most of the times do not function as

expected and produce unusable results.

In this work the algorithms compared can be found implemented on the internet and

their implementation is correct. That means, it gives satisfactory results not only by

visual examination but also in comparison with the ground truth disparity maps. All

the methods described here give a bad matching pixel percentage of less than 50%.

4.2: Common Algorithm Parameters

Each stereo algorithm is unique and features a certain number of inputs and

parameters. But there are some common parameters among the ones analyzed in this

work that apply also to the majority of existent stereo algorithms.

As expected, the input to all the algorithms is the stereo pair, traditionally left-right

views of the scene in that order. Some algorithms accept as input the image matrix

(RGB or grayscale) and others accept plain images reading the matrix afterwards. The

rest of the algorithm inputs are actually the parameters.

The first parameter is window size, when window/block matching is used for the

search of similarities. Smaller window size usually means more detailed but coarse

32

(with noise) disparity map. Larger window size gives a smoother disparity map

overall, but with less detail captured. Of course this parameter should be odd number

since there is always a "center" pixel on the matching window.

The second parameter is the maximum disparity range. This means the minimum and

maximum disparity value that the matching algorithm will search for similarities

between the blocks of the image pair. Disparity values outside that range will be

ignored. The minimum disparity value can be a negative number. In the case of

Middlebury test images the minimum value is always 0 and the maximum varies

depending on the image.

4.3: Semi Global (Block) Matching

4.3.1: Algorithm Overview

Semi-Global (Block) Matching [23] successfully combines concepts of global and

local stereo methods for accurate, pixel - wise matching at low runtime. This is

probably the most popular algorithm for stereo matching. It has spawned many other

algorithms and has been widely used by stereo vision researchers. As it is evident

from the results in this work, it gives results with a relatively high number of bad

matching pixels and is surpassed by other algorithms. Despite that, it is fast and very

effective for real time stereo applications since the number of bad matches in its

output is not prohibitive.

The core algorithm considers pairs of images with known intrinsic and extrinsic

orientation. The method has been implemented for rectified and unrectified images. In

the latter case, epipolar lines are effectively computed and followed explicitely while

33

matching. Of course in this work only rectified images are used (with known epipolar

geometry).

The whole method is based on the idea of pixelwise matching of Mutual Information

and approximating a global, two dimensional smoothness constraints by combining

many single dimensional constraints. In a nutshell, the main algorithm has the

following processing steps: 1) Pixelwise cost calculation 2) Implementation of the

smoothness constraint 3) disparity computation with sub-pixel accuracy and occlusion

detection.

4.3.2: Pixelwise Cost Calculation

In step 1 (pixelwise cost calculation) the matching cost is calculated for a base image

pixel (the left one usually) from its intensity and the suspected correspondence of the

match image. An important aspect is the size and shape of the area that is considered

for matching. The robustness of matching is increased with large areas.

One way to perform pixelwise cost calculation is to use the Birchfield-Tomasi

subpixel metric. The cost is calculated as the absolute minimum difference of

intensities in the range of half pixel in each direction (8 directions) along the epipolar

line.

Another way to calculate the pixelwise cost is based on mutual information

(abbreviated as MI) which is insensitive to recording and illumination changes. It is

defined as the sum of the entropies of the two images minus their joint entropy

according to the following formula:

MII1,I2 = HI1 + HI2 – HI1,I2

34

H. Hirshmueler in his work favors the Mututal Information approach, contrary to the

OpenCV implementation that uses Birchfield - Tomasi.

4.3.3: Aggregation of Costs

Pixelwise cost calculation is generally ambiguous since wrong matches can easily

have a lower cost than correct matches due to factors like noise etc. Therefore, an

additional constraint is added that supports smoothness by penalizing changes to

neighboring disparities.

A global, 2D smoothness constraint is approximated by combining several 1D

constraints.

Figure 13: SGBM matching costs aggregation

The matching costs in 1D are aggregated from all eight directions equally as

illustrated on the figure above. The aggregated (or smoothed) cost for a pixel p and

disparity d is calculated by summing the costs of all 1D minimum cost paths that end

in pixel p at disparity d.

35

4.3.4: Disparity Computation

The disparity image D that corresponds to the reference image I is determined as in

local stereo methods by selecting for each pixel p the disparity d that corresponds to

the minimum cost.

For sub pixel estimation, a quadratic curve is fitted through the neighboring costs and

the position of the minimum is calculated.

4.3.5: Implementation Details

SGBM is implemented in OpenCv and is embedded in the library. Also it is included

in Matlab since version 2011b. The Matlab implementation did not produce usable

results despite the extensive experimentation. Consequently, only the OpenCV

version will be used.

OpenCV uses a modified version of the original Hirschmuller algorithm. Contrary to

the original algorithm that considers 8 directions, this one considers only 5 (single

pass). Also this variation matches blocks, not individual pixels thus the Semi Global

Block Matching naming. The parameters of this modified version can be tuned so the

algorithm will behave like the original one.

Also, mutual information cost function is not implemented. Instead a simpler

Birchfield-Tomasi sub-pixel metric is used. Finally, some pre and post processing

steps from the Konolige Block Matching implementation are included, for example

pre and post filtering. It is evident from the few identical parameters between the two

algorithms.

36

The OpenCV SGBM implementation features the common parameters and a few

more that are listed here (the OpenCV documentation is insufficient so the

explanation is based on experimentation):

• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].

• uniquenessRatio: Computed disparity d* is accepted only if

SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.

• speckleRange, speckleWindowSize: Parameters of the OpenCV function

filterSpeckles which is used to post process the disparity map. It replaces

blobs of similar disparities (the difference of two adjacent values does not

exceed speckleRange) whose size is less or equal to speckleWindowSize (the

number of pixels forming the blob) by the invalid disparity value.

• disp12MaxDiff: A left-right check is performed. Pixels are matched from left

to right image and then from the right back to the left. The disparity value is

accepted only if the distance of the first match and the distance of the second

match have maximum difference of disp12MaxDiff.

• fullDP: If set to true, the algorithm considers eight directions instead of five

(like the original) but with higher memory consumption.

• P1: Penalty for small disparity changes.

• P2: Penalty for higher disparity changes.

It should also be noted that the disparity range consists of two parameters,

minDisparity and numberofDisparities. The first value is the minimum disparity for

the search window. The second shows the maximum difference from the minimum

disparity. It works the same way as with the next algorithm, Block Matching.

4.3.6: Testing and Results

The produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are

the following:

Figure 14: SGBM Results

The percentage of bad matching pixels according to the Sum of Absolute Differences

metric is the following:

Cones

SGBM 0.4943

4.4: Block matching

4.4.1: Algorithm overview

This method is based on the block matching al

frames for motion estimation,

matching also.

Block matching algorithm

‘macro blocks’ and comparing each of the macro

its adjacent neighbors in the previous

captures the movement of macro

37

Testing and Results



Teddy Tsukuba Venus

0.4977 0.3923 0.4982

Algorithm overview

is method is based on the block matching algorithm. It is mainly used in

frames for motion estimation, but its principles can apply successfully to stereo

Block matching algorithm [24] involves dividing the current frame

‘macro blocks’ and comparing each of the macro-block with corresponding block and

its adjacent neighbors in the previous frame of the video. A vector

captures the movement of macro-block from one location to another in the previous



Avg time

~0.065 sec

gorithm. It is mainly used in video

sfully to stereo

frame of video into

block with corresponding block and

is created that

om one location to another in the previous

38

frame. This movement calculated for all the macro blocks comprising a frame,

constitutes the motion estimated in the current frame.

The search area for a good macro-block match is decided by the ‘search parameter’, p,

where p is the number of pixels on all four sides of the corresponding macro-block in

the previous frame. The search parameter is a measure of motion. Larger the value of

p, larger is the motion; however this becomes a computationally extensive task.

Usually the macro-block is taken to be of size 16 pixels and the search parameter is

set to 7 pixels.

4.4.2: Algorithm Analysis

The tested implementation was submitted to OpenCV library by Kurt Konolige and is

partly based on his work Small Vision Systems: Hardware and Implementation [12].

The paper revolves around the Small Vision Module or SVM, a compact, inexpensive

real-time device for computing dense stereo range images.

In the case of stereo matching, the adjacent neighbor is the second image of the stereo

pair.

Figure 15: Block Matching Algorithm for SVM

39

The algorithm that is implemented here has the following features:

• Laplacian of Gaussian transform (LOG for short), L1 norm(absolute

difference) correlation.

• Variable disparity search in pixel unit.

• Postfiltering with an interest operator and left/right check.

• x4 range interpolation.

The LOG transform and L1 norm were chosen because they give good quality results

and can be optimized on standard instruction sets available on DSPs and

microprocessors.

The following images are copied directly from the paper and help in the explanation

of the algorithm. The disparity maps are green on the paper but they are converted to

grayscale here for uniformity reasons (this work examines grayscale disparity maps).

Figure 16: Block Matching Sample Images

40

Image (a) shows the grayscale input image. Figure (b) depicts the typical disparity

map produced by the algorithm. Brighter areas indicate higher disparities (closer

objects) while darker areas indicate lower disparities(further objects). There are 64

possible levels of disparity total. In figure (b) the highest level is around 40 while the

lowest is about 5. It is obvious that there is significant error in the upper left and right

corners of the image. That is due to the uniform areas without enough texture to

determine the disparity.

In figure (c) the interest operator is applied as a post filter. Areas with insufficient

texture are rejected and appear black in the produced image. Even after using this

filter, some errors still remain in portions of the image with disparity discontinuities,

in this case the side of the person's head. Those errors are caused by overlapping the

correlation window on areas with very different disparities.

One way to eliminate those errors is by applying left/right check. The left/right check

can be implemented efficiently by storing enough information when doing the

original disparity correlation. As the author concludes, the combination of interest

operator and left/right check has proven to be the most effective at eliminating bad

matches. As mentioned by the author, correlation surface checks were not used, since

they do not add to the quality of the range image and can be computationally

expensive.

As mentioned earlier, the algorithm described in the paper was intended to be used

with the Small Vision Module, which is a small programmable device with limited

resources, so it was designed with storage efficiency in mind.

41


This algorithm is implemented in OpenCV and is embedded in the library. It is also

part of Matlab 2011b onwards. The Matlab implementation strangely, gave no usable

results even after extensive experimentation with the parameters (similarly to the

SGBM), so only the OpenCV version is presented here.

The input and parameters include of course the common ones and some more. The

OpenCV documentation does not sufficiently explain the parameters, so the analysis

here is based mainly on experimentation. Most of them are optional and only the ones

used for the testing are analyzed. The disparity range is actually 2 parameters, one for

minimum disparity and one for maximum disparity range. (minDisparity and

numberofDisparities respectively, final disparity range is [minDisparity,

minDisparity+numberofDisparities]). The rest of the parameters are the following:

• preFilterSize: Window size of the prefilter.

• preFilterCap: Clips the output to [-preFilterCap, preFilterCap].

• textureThreshold: Calculates the disparity only at locations where the texture

is larger than or equal to this threshold.

• UniquenessRatio: Computed disparity d* is accepted only if

SAD(d)>=SAD(d*)*(1+uniquenessRatio/100) for any d!=d+/-1.

• speckleRange, speckleWindowSize: Parameters of the OpenCV function

filterSpeckles which is used to post process the disparity map. It replaces

blobs of similar disparities (the difference of two adjacent values does not

exceed speckleRange) whose size is less or equal to speckleWindowSize (the

number of pixels forming the blob) by the invalid disparity value.

• disp12MaxDiff: A left


accepted only if the distance of the first match



Produced disparity maps (cones, teddy, tsukuba, venus respectively):

The percentage of bad matching pixels


Cones

BM 0.4450

4.5: Loopy belief propagation

4.5.1: Overview

This method [25] by Ngia Kien

Markov Random Fields and Loopy Belief Propagation.

mathematics and quite complicated.

algorithm as well as an OpenCV implementation on his w

42

disp12MaxDiff: A left-right check is performed. Pixels are matched from left


accepted only if the distance of the first match and the distance of the s


Testing and Results

Produced disparity maps (cones, teddy, tsukuba, venus respectively):

Figure 17: Block Matching Results


Teddy Tsukuba Venus

0.4583 0.3556 0.4707

Loopy belief propagation

by Ngia Kien Ho focuses on solving the stereo problem using

Markov Random Fields and Loopy Belief Propagation. The method is heavy on

mathematics and quite complicated. The creator offers extensive analysis of the

an OpenCV implementation on his website.

right check is performed. Pixels are matched from left


and the distance of the second

according to the Sum of Absolute Differences

Avg time

~0.028 sec

Ho focuses on solving the stereo problem using

The method is heavy on

extensive analysis of the

43

4.5.2: Markov Random Fields

Markov Random Fields (abbreviated as MRF) are undirected graphical models that

can encode spatial dependences. They consist of nodes and links as all graphical

models, but also feature cycles/loops. Given a 3x3 image, the stereo problem can be

modeled using MRF as follows:

Figure 18: Markov Random Field illustration

The blue nodes are observed variables and represent pixel intensity values in this

work. The pink nodes are the hidden variables and represent the unknown disparity

value. The hidden variable values are referred to as labels. The links between the

nodes represent a dependency. For example, the center node depends only on the four

nodes it is connected to. This rather strong assumption that each node depends only

on the nodes it is connected to, is called Markov assumption.

4.5.3: MRF Formulation

The stereo problem can be formulated in terms of MRF as the following energy

function:

Where Y is the observed node, X is the hidden node,

neighboring nodes of node

This energy function sums up all the costs at each link given an image Y and a

labeling X. The aim is to find a labelin

essentially the disparity map. The energy function contains two other functions,

DataCost and SmoothnessCost.

4.5.4: DataCost

The DataCost function returns the cost/penalty of assigning a label value of x

yi. Good matches require a low cost and bad matches a high cost. Usually, sum of

absolute differences or sum of squared differences are ideal to serve as cost metrics.

Practically, the function calculates the SAD (or any other metric chosen) betwe

blocks (or even single pixels) in the two images of the stereo pair

the different tested disparity values.

Naturally, the direction of the matching window on the right image depends o

stereo pair.

44

Where Y is the observed node, X is the hidden node, i is the pixel index and j are the

neighboring nodes of node xi (see above diagram).


labeling X. The aim is to find a labeling for X that produces the lowest energy. This is


DataCost and SmoothnessCost.

returns the cost/penalty of assigning a label value of x

. Good matches require a low cost and bad matches a high cost. Usually, sum of


Practically, the function calculates the SAD (or any other metric chosen) betwe

blocks (or even single pixels) in the two images of the stereo pair taking into account

disparity values. The following pseudo code illustrates all this:

Naturally, the direction of the matching window on the right image depends o

is the pixel index and j are the


g for X that produces the lowest energy. This is


returns the cost/penalty of assigning a label value of xi to data

. Good matches require a low cost and bad matches a high cost. Usually, sum of


Practically, the function calculates the SAD (or any other metric chosen) between

taking into account

The following pseudo code illustrates all this:

Naturally, the direction of the matching window on the right image depends on the

45

4.5.5: SmoothnessCost

The SmoothnessCost function enforces smooth labeling across adjacent hidden nodes.

To achieve that, a function that penalizes adjacent labels that are different is needed.

The following table shows some commonly used cost functions.

Figure 19: Various cost functions

The Potts model is a binary penalizing function with a single tunable lambda (λ)

variable. This value controls how much smoothing is applied. The linear and

46

quadratic models have the extra parameter K which is a truncation value that caps the

max penalty.

As the creator of the method comments, the choice of DataCost and SmoothnessCost

functions is vague and should be based on experimentation.

4.5.6: Loopy Belief Propagation main part

When the DataCost and SmoothnessCost functions have been chosen and the

parameters tuned, next step is to solve the energy function. Trying all possible

combinations (brute-forcing) would require a quantum computer. So finding an exact

solution is definitely not easy and should not be expected. Instead finding an

approximate solution is a more viable approach.

The Loopy Belief Propagation (LBP) algorithm was chosen among others (Graph Cut,

ICM etc) to find an approximate solution for the MRF. The original Belief

Propagation algorithm [26] was proposed by Pearl in 1982 for finding exact marginals

on trees. Trees are essentially graphs that contain no loops, but as it turned out the

same algorithm can successfully be applied to general graphs that contain loops. The

“loopy” word in the naming originates from there.

LBP is a message passing algorithm. A node passes a message to an adjacent node

only when it has received all incoming messages, excluding the message from the

destination node to itself. The following figure illustrates the process:

47

Figure 20: LBP message passing

Node x1 wants to send message to x2. So it waits for messages from all other nodes

(A, B, C, D) before sending it. As explained earlier, it will not send the message from

x2 to x1 back to x2. Node x1 maintains all possible beliefs about node x2. The choice

of using cost/penalty or probabilities is dependent on the choice of the MRF energy

formulation.

This pseudo code can illustrate the process discussed above. The first step is always

the initialization of the messages. As mentioned earlier, each node has to wait for all

incoming messages before sending its message to the target node. This means that at

the start of the algorithm, each node will wait forever and receive nothing, so no

message can be sent from it. To overcome that problem all messages are initia

some constant so the algorithm can proceed. The initialization is typically 0 or 1.

The main part of LBP is iterative

algorithm can run for a chosen number of iterations or until the change in energy

drops below a threshold. For each iteration, messages are passed around the MRF.

The passing scheme is arbitrary and any sequence is valid (the algorithm creator

chooses right, left, up and down). As it is mentioned, different sequences will produce

different results.

Once the LBP iteration completes, the best label at every pixel can be

calculating its belief using the following formula, where msg is the message sent to

node I from k with label l:


As mentioned earlier, there is an OpenCV implementation available at the author’s

website. The common parameters and a few more are featured.

called labels and the window size is controlled by the variable wradius which accepts

even numbers. Afterwards one is added to the selected number so the window size is

odd as it should. The rest of the parameters are the following:

• BP_ITERATIONS: An integer that defines how many iterations/loops the

algorithm will run for.

48


message can be sent from it. To overcome that problem all messages are initia


The main part of LBP is iterative. By adjusting the respective parameters, the


ops below a threshold. For each iteration, messages are passed around the MRF.



Once the LBP iteration completes, the best label at every pixel can be


node I from k with label l:

Implementation Details


. The common parameters and a few more are featured. The disparity range is

and the window size is controlled by the variable wradius which accepts

numbers. Afterwards one is added to the selected number so the window size is

of the parameters are the following:

BP_ITERATIONS: An integer that defines how many iterations/loops the

algorithm will run for.


message can be sent from it. To overcome that problem all messages are initialized to


. By adjusting the respective parameters, the


ops below a threshold. For each iteration, messages are passed around the MRF.



Once the LBP iteration completes, the best label at every pixel can be found by



The disparity range is

and the window size is controlled by the variable wradius which accepts

numbers. Afterwards one is added to the selected number so the window size is

BP_ITERATIONS: An integer that defines how many iterations/loops the

• LAMBDA: This value controls how much smoothing is applied in the

SmoothnessCost function.

• SMOOTHNESS_TRUNC: Truncation value for the truncated linear model

that is used in the implementation.


The produced disparity maps for Cones

the following (note that the algorithm ran for 5 loops)


metric is the following (note that the time is measured for 5 loops)

Cones

LoopyBP 0.0953

4.6: Fast stereo matching and

4.6.1: Overview

This method is based on the paper "A

from Sparse Disparity Estimates Based on Stereo Vision

49

LAMBDA: This value controls how much smoothing is applied in the

SmoothnessCost function.

SMOOTHNESS_TRUNC: Truncation value for the truncated linear model

that is used in the implementation.

Testing and Results


(note that the algorithm ran for 5 loops):

Figure 21: Loopy Belief Propagation Results


(note that the time is measured for 5 loops):

Teddy Tsukuba Venus

0.4206 0.0410 0.4203

Fast stereo matching and disparity estimation

based on the paper "A hybrid Algorithm for Disparity Calculation

Sparse Disparity Estimates Based on Stereo Vision" [27] .

LAMBDA: This value controls how much smoothing is applied in the

SMOOTHNESS_TRUNC: Truncation value for the truncated linear model

, Teddy, Tsukuba and Venus image pairs are


Avg time

~155.88 sec

hybrid Algorithm for Disparity Calculation

50

This excellent work proposes a hybrid method for disparity estimation by combining

the existing methods of block based and region based stereo matching. It utilizes

image segmentation through K-Means clustering, morphological filtering and

connected component analysis, SAD cost function and disparity map reconstruction.

The process is very clearly documented by the authors and will be analyzed here step

by step. The following diagram depicts an overview of the whole algorithm.

Figure 22: Fast stereo Matching and Disparity Estimation

4.6.2: Algorithm Analysis

The first step is color conversion from RGB color to Lab color. The majority of

imaging equipment captures images in RGB format. This format though does not

properly approximate human vision. To overcome that difficulty the lab color space

was developed to better approximate human vision. The lightness component,

abbreviated L matches closely the human perception of lightness and is widely used

by image processing algorithms. This algorithm only retains the L values of the pixels

for further processing.

51

Step two is image segmentation. It is performed on the L values of the left image

pixels using fast implementation of the K-Means algorithm. The image pixels are

represented using one dimensional feature, namely a vector containing the L value for

each pixel. Next a histogram of the L values is built and used instead of the actual

pixel values for the subsequent iterations of the K-Means clustering. A histogram has

a smaller fixed number of bins than the actual pixels thus the runtime is significantly

reduced.

Step three is segment boundary detection and refinement. Segment boundary

detection is achieved by comparing the cluster assignment of each pixel with that of

its 8 neighboring pixels. If any of them is found to be different, the pixel is marked as

one (belongs to a segment boundary), or else it is marked as zero. Thus, the boundary

map is generated from the segmented left image. Since the clustering in step two is

based only on the pixels' lightness values there are limitations in the accuracy of the

said clustering. Consequently, many pixels can be falsely identified as belonging to

segment boundaries. To overcome that, the authors of this work apply two

morphological filters to refine the boundary map by removing such noisy pixels.

There are two types of morphological filters, Fill and Remove. Fill isolates interior

zero pixels that are surrounded by ones and sets them to one also. Remove sets

individual zero pixels to one if all of its four connected neighbors are one leaving only

the boundary pixels on. Furthermore, they use connected components analysis and

remove small artefacts in the boundary map due to segmentation errors. If disparity is

calculated for those artefacts, it will most probably be false. Finally the smallest

connected components that contribute about 4% of the total number of pixels are

removed.

52

Step four is disparity calculation of the boundary map. The well known SAD (Sum of

Absolute Differences) cost function is used to determine only the disparities of the

boundary pixels, using the L values of the left and right image pixels. A partial

disparity is map is built, considering the sparse disparity measurements.

The fifth and final step is disparity map reconstruction from boundaries. The

algorithm scans through each row of the partial disparity map and computes the

remaining disparities based on the ones that have already been calculated. It operates

in two stages:

-Disparity propagation ('fill' stage): In this first stage the disparity map is scanned

row-wise, left to right. Whenever two boundary pixels with identical disparity values

are encountered, the intermediate pixels of that row (aka the pixels between the

boundaries with the same value) are 'filled' with that disparity value. An exception is

made near the left and right end of each row. The left and right ends of each row are

filled with the disparity value of the nearest border pixel until a boundary pixel is

encountered.

-Estimation from known disparities ('Peek" Stage): In the second stage the algorithm

searches for the pixels whose disparity is not determined yet and estimates it based on

the disparity value of their neighboring pixels. When such a pixel is found, the known

disparities of its neighbors are stored in an array and the unknown disparity is

computed using statistical analysis.


The algorithm input parameters except the common ones are:

• K representing the number of intensity based clusters for K

in the second step.

• Disp_scale represents the factor by which calculated disparities will be

multiplied. Value s

It should be noted that all the parameters except the image pair are optional. A

random value will be used if they are not specified and the results might not be

optimal.



the following:

Figure 23: Fast stereo Matching and Disparity Estimation Results

The percentage of bad matching pixels according to the Sum of


Cones

FSM 0.0770

53

K representing the number of intensity based clusters for K-Means clustering

Disp_scale represents the factor by which calculated disparities will be

multiplied. Value should be such that, max_disparity * disp_scale <+ 255.



Testing and Results

produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are

: Fast stereo Matching and Disparity Estimation Results


Teddy Tsukuba Venus

0.1338 0.0320 0.1202

Means clustering

Disp_scale represents the factor by which calculated disparities will be

hould be such that, max_disparity * disp_scale <+ 255.



produced disparity maps for Cones, Teddy, Tsukuba and Venus image pairs are

Absolute Differences

Avg time

~10.14 sec

54

4.7: Probability-Based Rendering for View Synthesis

4.7.1: Algorithm Overview

The main objective of this work [28] is to synthesize a virtual view, given two

reference images, without deterministic correspondences. The first challenge that

occured was to construct the probability of all probable matching points. The second

was to render an intermediate view using a set of all matching candidate points with

the probability.

To address the aforementioned challenges the authors of the paper presented the

probability-based rendering (PBR) approach that robustly reconstructs an intermediate

view with the steady- state matching probability (SSMP) density function.

SSMP: In this particular work the matching cost, typically referred to as a cost

volume in the correspondence matching literature, is re-defined as the probability of

being matched between points, enabling random walk with restart (RWR) to be

applied to optimize the matching probability. The RWR uses edge weights between

neighboring pixels to enhance the matching probability similar to aggregation

methods for local stereo matching.

PBR: The rendering process is re-formulated as an image fusion, so that all probable

matching points represented by the SSMP can be considered together. This approach

has a couple of significant advantages. First it suppresses existing flicker artifacts.

Second the intermediate view is free from the hole filling problem since the SSMP

considers all positions of probable matching points.

4.7.2: SSMP with RWR

First of all, the SSMP is defined. Two images are assumed, left and right. The

probability p measures how likely I

matched to Ir(m1-d,m2) (a point on the right image with disparity) or the opposite.

Also, the probability is inversely proportional to the cost, since smaller matching cost

means higher matching probability.

formulas:

Where p0 is an initial calculated matching probability based on an initial matching

cost e0. Z(m) represents a normalization term. The variable m denotes coordinates m

and m2 and d denotes the disparity.

Next step is SSMP estimation using RWR. The random walk has been widely used to

optimize probabilistic problems as the authors suggest. A random walker iteratively

transits to its neighboring points according to an edge weight. Also, the random

walker goes back to the initial position with a restarting probability a (0<=a<=1) at

each iteration. A matching probability in the SSMP can be obtained by the RWR in an

iterative fashion as follows:

55

is defined. Two images are assumed, left and right. The

probability p measures how likely Il(m1,m2) (a point on the left image) is to be

) (a point on the right image with disparity) or the opposite.


means higher matching probability. The above can be summarized in the following

is an initial calculated matching probability based on an initial matching

. Z(m) represents a normalization term. The variable m denotes coordinates m

and d denotes the disparity.




back to the initial position with a restarting probability a (0<=a<=1) at


iterative fashion as follows:

is defined. Two images are assumed, left and right. The

) (a point on the left image) is to be

) (a point on the right image with disparity) or the opposite.


can be summarized in the following

is an initial calculated matching probability based on an initial matching

. Z(m) represents a normalization term. The variable m denotes coordinates m1




back to the initial position with a restarting probability a (0<=a<=1) at


Where Nm denotes the four

formula becomes the random walk when the restarting probability is zero. With an

assumption that neighboring pixels tend to have similar matching probability when

the range distance between the reference pixel m and its neighboring pixel

an edge weight w(m,n) is computed by the following formula:

where y represents the bandwidth parameter, typically set to the intensity variance and

|| . ||2 denoted l2 norm. Then a steady state solution p

in this work, can be obtained by iteratively updating p

According to the authors, this work presents significant advantages. First of all, it

does not require specifying a window size for reliable matching,

conventional methods due to the small number of adjacent neighbors. Second, there

is no need to specify the number of iterations, since it gives a non

the steady state. Third, this method gives the optimal solution for

functional.

4.7.3: PBR with SSMP

Now the two reference images and the sets of their corresponding SSMPs are given.

The rendering process is cast into the probabilistic image fusion.

the left and right cameras is assumed to b

location of a virtual camera where 0<=

matching probability of a pixel on the left and right image

respectively as follows:

56

denotes the four-neighborhood of a reference pixel m. Note that the above



the range distance between the reference pixel m and its neighboring pixel

an edge weight w(m,n) is computed by the following formula:


Then a steady state solution ps(m,d) which is reffered

in this work, can be obtained by iteratively updating pt+1(m,d) until pt+1


does not require specifying a window size for reliable matching, contrary to the

due to the small number of adjacent neighbors. Second, there

is no need to specify the number of iterations, since it gives a non-trivial solution in

the steady state. Third, this method gives the optimal solution for


The rendering process is cast into the probabilistic image fusion. A baseline between

the left and right cameras is assumed to be normalized to 1. Beta (

location of a virtual camera where 0<=β<=1. Also, Pl(m,d) and Pr(m,d) encode the

matching probability of a pixel on the left and right image (Iul(m,d) and I

te that the above



the range distance between the reference pixel m and its neighboring pixel n is small,


(m,d) which is reffered the SSMP

t+1(m,d)=pt(m,d).


contrary to the

due to the small number of adjacent neighbors. Second, there

trivial solution in

the steady state. Third, this method gives the optimal solution for given energy


A baseline between

e normalized to 1. Beta (β) denotes the

(m,d) encode the

(m,d) and Iur(m,d) )

where Zl(m) and Zr(m) are:

And < . > represents a rounding operator. The virtual view is the synthesized via an

image fusion process. Specifically, a probabilistic average, E

two reference images is computed with corresponding probability P

and the textures, Iul(m,d) and I

blended as follows:

Left and right disparity maps can be denoted

sampled points Iul(m,d) and I

Iur(m) respectively. Furthermore, the matching probability functions P

Pr(m,d) are simplified as a set of shifted Dirac delta function as follows:

57

are:

represents a rounding operator. The virtual view is the synthesized via an

Specifically, a probabilistic average, El(Il(m)) and E

two reference images is computed with corresponding probability Pl(m,d)

(m,d) and Iur(m,d) along with the disparity hypothesis d and then

Left and right disparity maps can be denoted as dwl(m) and dwr(m) respectively.

(m,d) and Iur(m,d) are then converted as functions of m, I

Furthermore, the matching probability functions P

(m,d) are simplified as a set of shifted Dirac delta function as follows:

and

represents a rounding operator. The virtual view is the synthesized via an

(m)) and Er(Ir(m)), for

(m,d) and Pr(m,d)

(m,d) along with the disparity hypothesis d and then

(m) respectively. The

) are then converted as functions of m, Iul(m) and

Furthermore, the matching probability functions Pl(m,d) and

(m,d) are simplified as a set of shifted Dirac delta function as follows:

Then, the PBR on the previous equation b

For a given fixed point m*

the function of reference view I

Finally, the PBR is able to handle occlusion and dis

assuming that the background texture varies smoothly.

their textures synthesized in a probabilistic manner.

4.7.4: Implementation details

The implementation of this algorithm runs on Matlab. As discussed earlier, it does not

require one of the common parameters (window size). The most important parameter

the user has to modify is the disparity range. Apart from that there is a large number

of parameters that can be tuned to control various aspects of the algorithm, but none

are necessary to be changed and could be left to their default values.


58

Then, the PBR on the previous equation becomes:

For a given fixed point m*, the PBR synthesizes the intermediate view I

the function of reference view Iul(m*,d) and the probability Pl(m*,d) as follows:

Finally, the PBR is able to handle occlusion and dis-occlusion (hole) regions by

assuming that the background texture varies smoothly. The problematic regions have

their textures synthesized in a probabilistic manner.

Implementation details





e necessary to be changed and could be left to their default values.

Testing and Results

Figure 24: SSMP Results

, the PBR synthesizes the intermediate view Iu(m*) with

(m*,d) as follows:

occlusion (hole) regions by

The problematic regions have





59



Cones Teddy Tsukuba Venus Avg time

SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec

4.8: Results Analysis and Conclusion

Below is a table that summarizes the percentage of bad matching pixels for all the

algorithms. Highlighted in blue is the best (lowest) score for each image pair and in

red the worst (highest).

OpenCV Cones Teddy Tsukuba Venus Avg. Time

BM 0.4450 0.4583 0.3556 0.4707 ~0.028 sec

SGBM 0.4943 0.4977 0.3923 0.4982 ~0.065 sec

LoopyBP 0.0953 0.4206 0.0410 0.4203 ~155.88 sec

Matlab

SSMP 0.0883 0.1087 0.0379 0.4993 ~231.1 sec

FSM 0.0770 0.1338 0.0320 0.1202 ~10.14 sec

The following diagram depicts a visual representation of the performance of the five

algorithms. Since the measured amount is percentage of bad matching pixels, it means

that lower value shows better performance and higher value shows more performance

mismatches.

60

Figure 25: Percentage of bad matching pixels (lower value is better)

The most "difficult" image pairs to match seem to be Venus and Teddy. All the

algorithms give the highest number of bad matching pixels in those two image sets. A

possible reason is that those two sets exhibit small variation in the scene and higher

uniformity on some regions which could "confuse" an algorithm and give a high

number of candidate matching points. The other two image sets, Cones and Tsukuba

give better scores in all the algorithms. Those images exhibit a greater variation on the

scene and matching points are easier to identify, with Tsukuba giving the best result

in all of the algorithms.

Overall, the most efficient algorithm of the ones tested here is the Fast Stereo

Matching and Disparity Estimation from professor GREDDY and S. Mukherjee. It

gives a small number of bad matching pixels under all circumstances and has a low

running time. The SSMP gives excellent results also but it is highly complicated and

exhibits a high running time due to the large number of steps required for its

completion. Also, as is evident from the Venus result, it cannot handle uniformity in

61

images effectively in all cases. The Loopy Belief Propagation also handles uniformity

in scenes poorly as the results from Teddy and Venus show. Additionally if a higher

disparity range is selected the running time will be quite high even in a smaller

number of loops. Finally, the SGBM and BM have many common points in their

OpenCV implementation with BM giving slightly better scores. Both those algorithms

performed poorly when it comes to the matching process itself, but have a very small

runtime. They seem ideal for real-time applications or where speed is more important

than robust stereo matching.

62

Chapter 5: Discussion and future work

Stereo vision is implemented as mentioned earlier in scientific, industrial, military and

even consumer fields. Although it is still considered as a gimmick by many people, it

steadily gains traction and acceptance.

Maybe the most obvious research field that will employ stereo vision in the near

future is virtual reality applications. They have already existed for a few years but

mostly for educational/entertainment purposes with limited use, mainly in virtual

tours in rather small 3D environments. Nowadays, with increased computational

power virtual reality can also be used in immersive and interactive applications, like

video games.

Several consumer virtual reality devices like Google Cardboard [29] and Oculus Rift

[30] have started to make their way to consumers. More such devices are expected

from many manufacturers in the near future.

Figure 26: Current Virtual Reality Devices

Another interesting project scheduled to be released commercially in October 2015 is

a device dubbed as virtual reality toy and is intended to be the greatest remodeling of

the abundantly famous View Master. Mattel toy corporation works with Google for

the project, which is largely based on the Google Cardboard. The traditional reels are

63

replaced by plastic cards and a smartphone. The user slides the smartphone inside the

headset and scans the cards. A 3D image based on the theme of the cards is depicted.

The trademark switch on the future View Master is now used to zoom or to focus on

objects in the virtual scene.

Also, another field that has recently started to employ stereo vision is medicine and

more specifically endoscopy. Traditional endoscopes feature a single camera that

provides a two dimensional image of the patient's examined internal organ. The

stereoscopic endoscopes feature two cameras that provide three dimensional imaging

thus allowing more thorough visual examination by extracting information about the

internal surface of the organs.

Figure 27: Stereoscopic Endoscope

Research is also very active in the driveless cars, discussed in the first chapter and is

expanding to other vehicles, mainly autonomous drones and Unmanned Ground

Vehicles. Also there are several space exploration projects that employ stereo vision,

such as an innovative planetary landing algorithm [31] used to extract planet surface

64

information and safely guide the space vessel to touch down as proposed by S.

Woicke and E. Mooij.

Finally it is the autonomous or not robotic systems that use stereo imaging along with

many other sensors. Robots are becoming more efficient and intelligent and their use

is set to be expanded in the near future in almost every sector imaginable.

65

References

1. http://en.wikipedia.org/wiki/Computer_stereo_vision.

2. https://en.wikipedia.org/wiki/Stereopsis#History_of_investigations_into_stereopsis.

3. https://en.wikipedia.org/wiki/View-Master#History.

4. https://en.wikipedia.org/wiki/Virtual_Boy.

5. http://en.wikipedia.org/wiki/Pinhole_camera_model.

6. http://en.wikipedia.org/wiki/Camera_resectioning.

7. http://en.wikipedia.org/wiki/Epipolar_geometry.

8. http://en.wikipedia.org/wiki/Fundamental_matrix_%28computer_vision%29.

9. http://en.wikipedia.org/wiki/Image_rectification.

10. http://www.jayrambhia.com/blog/disparity-maps/.

11. http://www.cs.stolaf.edu/wiki/index.php/Stereo_Matching.

12. Konolige, K. Small Vision Systems: Hardware and Implementaion. Springer. 1998.

13. https://en.wikipedia.org/wiki/Cross-correlation#Normalized_cross-correlation.

14. https://en.wikipedia.org/wiki/Ground_truth.

15. http://techcrunch.com/2010/06/19/a-guide-to-3d-display-technology-its-principles-

methods-and-dangers/.

16. http://www.self.gutenberg.org/articles/Aerial_survey.

17. http://www.nasa.gov/mission_pages/stereo/main/index.html.

66

18. https://en.wikipedia.org/wiki/Autonomous_car.

19. D. Scharstein, R. Szeliski. A Taxonomy and Evaluation of Dense Two Frame Stereo

Correspondence Algorithms. 2001.

20. R. Szeliski, R. Zabih. An Experimental Comparison of Stereo Algorithms. Springer. 2000.

21. L. Nalpantidis, G. Sirakoulis, A. Gasteratos. Review of Stereo Matching Algorithms for 3D

Vision . 2007.

22. R.A. Lane, N.A. Thacker. Overview of Stereo Matching Research. 1998.

23. Hirschmuller, H. Semi-global Matching - Motivation, Development and Applications.

2011.

24. http://en.wikipedia.org/wiki/Block-matching_algorithm.

25. Ho, Ngia Kien. http://nghiaho.com/?page_id=1366#LBP. [Online]

26. https://en.wikipedia.org/wiki/Belief_propagation.

27. s. Mukherjee, G.R.M. Reddy. A Hybrid Algorithm for Disparity Calculation From Sparse

Disparity Estimates Based on Stereo Vision. IEEE. 2014.

28. B. Ham, D. Min, C. Oh, M.N. Do, K. Sohn. Probability-Based Rendering for View

Synthesis. IEEE. 2014.

29. https://www.google.com/get/cardboard/.

30. https://www.oculus.com.

31. S. Woicke, E. Mooij. A Stereo-Vision Based Hazard-Detection Algorithm for Future

Planetary Landers. 2014.

comparison of open source stereo vision algorithms -...

Documents