assisted 3d reconstruction from a single vieassisted 3d reconstruction from a single view master’s...

88
Lappeenranta University of Technology Faculty of Technology Management Degree Program in Information Technology Master’s Thesis Teemu Tarkiainen ASSISTED 3D RECONSTRUCTION FROM A SINGLE VIEW Instructor: Professor Joni Kämäräinen

Upload: others

Post on 30-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

Lappeenranta University of TechnologyFaculty of Technology ManagementDegree Program in Information Technology

Master’s Thesis

Teemu Tarkiainen

ASSISTED 3D RECONSTRUCTION FROM A SINGLE VIEW

Instructor: Professor Joni Kämäräinen

Page 2: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

ABSTRACT

Lappeenranta University of TechnologyFaculty of Technology ManagementDegree Program in Information Technology

Teemu Tarkiainen

Assisted 3D reconstruction from a single view

Master’s Thesis

2012

88 pages, 46 figures, 7 tables, and 2 appendices.

Keywords: interactive 3D reconstruction, image processing, thin plate spline, computervision

This thesis is about constructing a 3d representation from a single image. This recon-struction process uses no prior data nor is considered to be automatic. Human interactionis used to assist and guide the reconstruction process. A proof of concept computer pro-gram was developed with the C programming language in the Linux environment. TheOpenGL application programming interface was used for graphics programming. The 3dreconstruction process is performed by using image filtering, segmentation, dithering, De-launay triangulation and thin plate spline interpolation. First a triangular mesh that welladapts into image content is generated. The image is then segmented into small homo-geneous patches. User is required to form larger entities by joining these small patchestogether. These defined areas are considered as computational units that represent sur-faces that have different orientations in the final 3d representation of the image. User isalso required to place 3d depth points over the image. These control-points are used inthin plate spline interpolation method where the generated 2D triangular mesh is bendedto meet the control-points. This process transforms the 2d mesh into a 2.5D manifold.

Page 3: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

TIIVISTELMÄ

Lappeenrannan teknillinen yliopistoTeknistaloudellinen tiedekuntaTietotekniikan koulutusohjelma

Teemu Tarkiainen

Avustettu 3D mallinnus yhdestä kuvasta.

Diplomityö

2012

88 sivua, 46 kuvaa, 7 taulukkoa ja 2 liitettä.

Hakusanat: interaktiivinen 3D mallinnus, kuvankäsittely, thin plate spline, konenäköKeywords: interactive 3D reconstruction, image processing, thin plate spline, computervision

Tässä työssä muodostetaan kolmiulotteinen malli yhdestä kuvasta. Mallinnusprosessis-sa ei hyödynnetä aiempaa tietoa kuvasta eikä prosessi ole automaattinen. Ihminen avus-taa ja ohjaa vuorovaikutteisesti prosessia. Muunnosta varten kehitettiin tietokoneohjel-ma C-ohjelmointikielellä Linux-ympäristössä. OpenGL-ohjelmointirajapintaa käytettiingrafiikkaohjelmoinnissa. 3D-muunnoksessa käytetään kuvan suodatusta, segmentointia,rasterointia, Delaunay-triangulaatiota ja thin plate spline -interpolointia. Ensimmäisenämuodostetaan kolmioista kuvan sisältöön mukautuva verkko. Seuraavaksi kuva erotel-laan pieniin homogeenisiin osiin. Käyttäjän on muodostettava näitä pieniä osia yhdiste-lemällä suurempia kokonaisuuksia. Nämä määritellyt alueet toimivat laskennallisina yk-siköinä, jotka edustavat erilaisia pintoja lopullisessa kolmiulotteisessa esityksessä. Käyt-täjän on myös asetettava kontrollipisteitä kuvaan ja näiden avulla määritetään kolmiulot-teinen syvyys kyseisessä kuvan kohdassa. Annettuja pisteitä käytetään thin plate spline-interpolointi menetelmässä, jossa kolmioverkko taivutetaan kulkemaan pisteiden kautta.Tämä prosessi muuntaa 2-ulotteisen verkon 2.5-ulotteiseksi sulkemattomaksi verkoksi.

Page 4: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

PREFACE

This thesis would not have been completed without support from friends and relatives. Iwish to express my gratitude for your support. Thank you for being involved in the spirit.

First of all, I would like to thank my supervisor, professor Joni-Kristian Kämäräinen. Ialso want to express my gratitude to examiner, professor Ville Kyrki.

Lappeenranta, March 4th, 2012

Teemu Tarkiainen

Page 5: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

5

CONTENTS

1 INTRODUCTION 81.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Objectives and and the scope of the thesis . . . . . . . . . . . . . . . . . 91.3 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 3D RECONSTRUCTION IN COMPUTER VISION 122.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Simple camera system . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Extrinsic and intrinsic camera parameters . . . . . . . . . . . . . . . . . 152.4 3D from stereo vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5 3D from a single image . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 PREVIOUS WORK ON SINGLE IMAGE RECONSTRUCTION 253.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Automatic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Interactive methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 REPRESENTING 3D SURFACES 304.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 The OpenGL interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.3 Polygons in OpenGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Surfaces from triangles . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Delaunay triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Tools for Delaunay triangulation . . . . . . . . . . . . . . . . . . . . . . 39

5 FEATURE POINT EXTRACTION 415.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.2 Image noise reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Color to grayscale conversion . . . . . . . . . . . . . . . . . . . . . . . . 435.4 Calculation of partial derivatives . . . . . . . . . . . . . . . . . . . . . . 445.5 Floyd Steinberg error diffusion . . . . . . . . . . . . . . . . . . . . . . . 46

6 THIN PLATE SPLINES 486.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Solving with LU -decomposition . . . . . . . . . . . . . . . . . . . . . . 506.4 TPS applied for digital elevation models . . . . . . . . . . . . . . . . . . 53

Page 6: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

6

7 SUPER-PIXEL IMAGE SEGMENTATION METHODS 547.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.2 Graph-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.3 Gradient-ascent-based algorithms . . . . . . . . . . . . . . . . . . . . . 567.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 IMPLEMENTATION 608.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.2 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.3 Segmentation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668.4 Mesh generation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 698.5 Thin Plate Spline interpolation phase . . . . . . . . . . . . . . . . . . . . 71

9 EXAMPLES 72

10 DISCUSSION 7610.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

11 CONCLUSIONS 80

REFERENCES 81

APPENDICESAppendix 1: The Keyboard controls for the GUIAppendix 2: Configuration files

Page 7: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

7

ABBREVIATIONS AND SYMBOLS2D Two dimensional.2.5D Two and a half dimensional.3D Three dimensional.AMD Advanced Micro Devices, Inc.API application programming interfaceC C programming languageC++ C++ programming languageCIE Committee International de L’ÈclairageDEM Digital Elevation ModelGLU Graphics Library UtilityGLUT Graphics Library Utility ToolKitGTS GNU Triangulated Surface libraryGNU Unix-like operating system that is free softwareGTK+ GNU Image manipulation programHVS Human visual systemMRF Markov random fieldNCUT Normalized cutOpenGL software interface to graphics hardwarepdf Probability density funtionPSLG Planar straight line graphRGB Red,Green,Blue color systemSDL Simple Direct Media LayerSFM Structure from motionSGI Silicon Graphics Computer SystemsTIN Triangulated Irregular NetworkTPS Thin Plate SplineUI User interfaceAx vector x given in coordinate frame A.X a matrixR a matrixT a matrixG = (V,E) Graph G defined by nodes V and edges E.

Page 8: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

8

1 INTRODUCTION

1.1 Background

The topic of three-dimensional (3D) reconstruction from images has been a long-standingissue in the computer vision literature. The main applications in computer vision havebeen visual inspection and robot guidance. Today there is more and more demand for3D content for computer graphics, virtual reality and communication. The visual qualitybecomes one of the main points of attention. Computer graphics has made tremendousprogress in visualizing 3D models in recents years. Many techniques have reached ma-turity and are being ported to hardware. The performance of 3D visualization hardwarehas been increasing fast. What required a very expensive computer a decade ago, cannow be achieved by a computer which almost everyone can afford. Today it is possible tovisualize complex 3D scenes in real time. Even though the tools and hardware availablefor 3D modelling are getting more and more powerful, synthesizing realistic 3D modelsis still difficult and time-consuming.

Computing 3D object shapes and scene structures from two-dimensional (2D) imagesis the basic problem in computer vision and has been studied extensively. One of themain streams of research in literature is to compute a 3D shape from multiple images.Methods such as structure from motion (SFM) and multi-view stereo seek contraints frommultiple images. These images are often taken from well-controlled camera positionswhich minimize the uncertainty in the 3D reconstruction process. The information ofknown camera calibration is advantageous when computing the 3D models. In thesekind of approaches very little prior knowledge about the object or scene is used. In therecent years, research in computer vision has aimed to the both reduce the requirementsfor the calibration and augment the automation of the acquisition. The goal has been toautomatically extract a realistic 3D models.

Another stream of research is to compute 2.5D depth-maps using shape from shading,texture, defocus, etc. or compute 3D models with user-interactions from a single image.In the 3D reconstruction from a single still image, the research has recently also beenfocusing on automatic reconstruction solutions. A single image is a projection of the3D world to 2D. During this transformation the depth of the 3D world is lost. A singleimage might represent an infinite number of 3D models from which only a few are valid.The human visual system (HVS) uses monocular cues like color, focus, haze, etc. toinfer 3D structure of the scene. These local image cues alone are usually not enough to

Page 9: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

9

infer the 3D structure. Humans understand 3D structures by "integrating information"from monocular cues and knowledge learned from their prior experiences. This has hadan impact on the methods used in automatic reconstruction solutions from a single stillimage. Researchers have been using supervised learning techniques in 3D reconstruction.Some "prior" knowledge has been used to train the 3D reconstruction system so that thislearned knowledge can be used to assist the reconstruction process. If prior knowledgeis not applied in any way, the user-interactions become essential because a single imagelacks too much information.

User interaction is a common step in reconstruction from a single image because there isless information kept in one image than in multiple images. Especially the depth estima-tion is considered to be a difficult task when only a single image is applied.

1.2 Objectives and and the scope of the thesis

The objective in this work is to provide a solution for 3D reconstruction from a single im-age. The 3D reconstruction process is not considered to be automatic and thus needs userinteraction. No prior knowledge of the scene or image is used or required in the process.No supervised learning techniques are applied. A 3D or 2.5D surface model is obtainedfrom a single image that could have been taken with an off-the-shelf consumer camera.Camera calibration information is considered to be unknown. The 3D reconstruction pro-cess is performed from one single 2D image at a time.

The objective of this thesis is to develop a functional proof-of-concept level programfor the assisted 3D reconstruction. Programming platform for this work is Linux. Thetools used in this work are C programming language (C) and Open Graphics Library(OpenGL). The program should have a graphical user interface (GUI), Through the GUIthe user assists the 3D reconstruction process. All the "missing" information (e.g. 3ddepth) required to perform the reconstruction comes from the user.

The result of the 3D reconstruction from a single image is not meant to be a full 3D surfacemanifold (closed 3D mesh). This can be the requirement for solutions that use multipleimages taken from different viewpoints. In this work the result is more likely to be a 2.5Dsurface manifold (unclosed 3D mesh) because it is not possible to accurately reconstructthe occluded parts of the scene from a single image without doing major assumptions.

All information used in the reconstruction process are obtained from a single 2D image

Page 10: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

10

and from the user using keyboard and mouse. However, the implementation has to workfor different images. A single 2D image is reconstructed as 3D scene by using the pro-gram. At first, the 2D surface is converted to a polygonal mesh. Then the user is requiredto give 3d depth over the image as control-points. The flat 2D polygonal mesh surface isbended to meet these control points.

1.3 Structure of the thesis

This thesis concerns image processing, graphics, theory and software implementationrequired for a functional proof-of-concept level program for assisted 3d reconstructionfrom a single image.

This thesis is organized as follows:

The Chapter 2 is about 3D reconstruction in general. It is a brief introduction withoutgoing into deep details. The Chapter is organized so that it leads from multi-view recon-struction towards reconstruction from a single image. A simple camera model is presentedto illustrate how camera parameters are related to the 3D reconstruction.

The Chapter 3 is a literary review of previous research on 3d reconstruction from a singlestill image.

The Chapter 4 starts with a breaf introduction to OpenGL application programming in-terface (API). What OpenGL is and how it works is presented. Polygon rendering playsthe essential part in graphics programming. The elements that OpenGL provides for thispurpose are introduced. The polygon triangulation is discussed. The triangulation is re-quired when a surface or plane is transformed to a mesh that consists of polygonal meshelements. Delaunay triangulation method is presented. Available open source implemen-tations for Delaunay triangulation are surveyed.

The Chapter 5 is about feature points extraction from images. The feature points are inputdata for Delaunay triangulation. How the feature points are selected, that the surface meshadapts well to the image content.

The Chapter 6 is about thin plate spline (TPS) interpolation method. In this work thinplate splines are used for bending the 2D triangle mesh to 2.5D surface manifold. Theprevious research provides the means for solving TPS with the LU-decomposition. These

Page 11: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

11

mathematical equations are introduced in this chapter.

The chapter 7 is about the existing superpixel segmentation methods. To be able to extractshapes and surfaces from the image some kind of image segmentation method is required.In this section we name a few existing possible image segmentation methods.

The chapter 8 is about implementation of the application. How the methods and toolsintroduced in the previous chapters are used and combined into a working applicationthat can be used to transform a single still image into a 3d representation. User interfaceand implemented functions and procedures are introduced.

The Chapter 9 considers the experiments. What kind of results one can achieve with thisimplementation.

The Chapter 10 is reserved for discussion. What worked and what went wrong. How theimplementation can be improved is discussed.

The Chapter 11 is a short summary what was done and achieved in this thesis.

Page 12: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

12

2 3D RECONSTRUCTION IN COMPUTER VISION

2.1 Introduction

Space reconstruction relates to the techniques of recovering information about the struc-ture of a 3D space based on direct measurements or depth computation from stereo match-ing. This gives positions and dimensions of the sensed object surfaces and this informa-tion can, for instance, be used for robot navigation. [1]

The main application for 3D reconstruction is the generation of 3D models (see Figure 1)from images. One of the simplest methods to obtain a 3D model of a scene is thereforeto use a digital camera and to shoot a few pictures of the scene from different viewpoints.With the camera calibration given for all viewpoints of the image sequence, a sparsesurface model based on distinct feature points can be obtained by using a feature trackingalgorithms [2]. This however is not sufficient to reconstruct geometrically correct andvisually pleasing surface models. More precise models can be accomplished by a densedisparity matching that estimates correspondences from the grey level images directly byexploiting additional geometrical constraints [2].

To experiment 3D reconstruction one self, there already exists applications like Pho-tosynth, which is a free Windows application including a supporting web site: photo-synth.net. The application bases on the research done in University of Washington [3].The idea of research was to automatically reconstruct a 3D space from large set of commu-nity or personal photos. The application takes as input a large set of photos, reconstructscamera viewpoints, and automatically computes orbits, panoramas, canonical views, andoptimal paths between views. After the reconstruction process, the user can interactivelybrowse the scene in 3D. As the user browses the scene, nearby views are continuously se-lected and transformed, using control-adaptive re-projection techniques. The idea is thatthe application identifies images taken at different times, different cameras from differentangles without carefully controlled viewing angles in lab conditions. The more there arephotos from the scene, the more accurate the 3d model becomes. The algorithm that theapplication uses is given in [3]. The original purpose of the algorithm is to form modelsfor Internet image services from a particular scene in photographs, but it is also suitablefor self-recorded images. The algorithms used in the application are complex and thecalculation of the reconstruction can take several minutes.

In order to create visually pleasing 3d models, there are few ground rules one should

Page 13: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

13

follow. It is essential that each object in the scene should be visible in at least threedifferent images. When the target scene is a landscape, it would be in place to take apanorama picture from on spot and after that take pictures from different locations insidethe scene and around it. By taking images from different distances, one can get moredetails into reconstructed model. If the target is some object the best result is gainedwhen the camera moves around the object. Out of the entire 360-degree circle is a goodidea to take at least 24 images. In this case, the camera moves 15 degrees at a time.

Figure 1. 3D surface model obtained automatically from an uncalibrated image sequence, shaded(left), textured (right). [2]

In order to gain understanding how 3D reconstruction works in applications like Photo-Synth, we deal with some basics of space reconstruction in the next sections. Dependingon the available parameters of a 3D acquisition system, different parameters of the spacecan be determined. Every image acquisition system by its nature performs some kindof transformation of real 3D space into 2D spatial space. To describe a 3D acquisitionsystems, it is fundamental to find the parameters of such transformation. Although, themain point of interest in this thesis is the 3d reconstruction from a single image, in thischapter we also refer to 3d reconstruction from stereo vision. It is easier to understand theconcepts of 3D reconstruction by observing methods that are related to stereo vision. 3Dreconstruction from a single image is considered to be a harder task because of the lackof information compared to stereo vision. The common factor for a single or stereo viewis the use of camera. For most cameras a model that describes the space transformationthey perform is based either on the parallel or central perspective projections. The linearparallel projection is the simplest approach. So, first a simple pinhole camera system ispresented, secondly 3D from stereo vision is discussed and how it relates to the camerasystems parameters. Last, a brief introduction to 3D from monocular cues is discussed.

Page 14: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

14

2.2 Simple camera system

Knowledge about the camera can also be used to restrict the ambiguity on the reconstruc-tion from projective to metric or even beyond. Different parameters of the camera canbe known. Both knowledge about the extrinsic parameters (i.e. position and orientation)as the intrinsic parameters can be used for calibration. Knowing the relative position ofthe viewpoints is equivalent to knowing the relative position of 3D points. Therefore, therelative position of 5 image points in general position suffices to obtain a metric recon-struction.

The pinhole camera in Figure (2) provides an example of image formation that we canunderstand the simple geometric model. A pinhole camera is a box with a small hole inthe centre of one side of the box. Because the pinhole lies between the imaging screen andthe observed 3D world scene, any ray of light that is emitted or reflected from a surfacepatch in the scene is constrained to travel through the pinhole before reaching the imagingscreen. The imaging screen plane is located at the distance d from the pinhole. Therefore,there is correspondence between each 2D area on the imaging screen and the area in the3D world, as observed "through the pinhole" from the imaging screen. [4] [1]

Figure 2. Left: The pinhole camera. Right: A side view of the pinhole camera. [4]

It is possible to calculate from the side view of the pinhole camera (see Figure 2) wherethe image point (x, y, z) is on the imaging plane z = −d. Using the fact that the twotriangles in Figure (2) are similar, it can be found that the y coordinate of the image is atyp, where

yp = −yzd

. (1)

Using a view from top would similarly yield

xp = −xzd

. (2)

Page 15: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

15

These equations constitute the foundation of the pinhole camera model [1]. The point(xp, yp,−d) is called the projection of the point (x, y, z). In pinhole model the color onthe imaging plane at this point will be the color of the point (x, y, z). [4]

Angel et al. [4] say that the ideal pinhole camera has an infinite depth of field: Everypoint within its field of view is in focus [4]. They add that disadvantages of the pinholecamera are that it only admits a single ray from a point source and that the camera cannotbe adjusted to have a different angle of view. Angel et al. [4] say that the transition tomore sophisticated cameras and to other imaging systems that have lenses is a small one.By replacing the pinhole with a lens, the two disadvantages of the pinhole camera canbe removed. The larger the aperture of the lens, the more light the lens can collect. Bypicking a lens with the proper focal length is equivalent to choosing d for the pinholecamera [4].

2.3 Extrinsic and intrinsic camera parameters

Figure 3 shows a mathematical model of the simple pinhole camera where imaging screenis in front of the pinhole. Cyganek et al. [1] use this formulation to simplify the concept ofprojection to that of magnification. According to them, in order to understand how pointsin the real world are related mathematically to points on the imaging screen two coor-dinate systems are of particular interest: The external coordinate system and the cameracoordinate system. The external coordinate system W is independent of placement andparameters of the camera C. Cyganek et al. [1] show (Figure 3) that the two coordinatesystems are related by a translation, expressed by the matrix T , and rotation, representedby the matrix R. The point Oc, called a central or a focal point, together with the axesXc , Yc and Zc determine the coordinate system of the camera. An important part of thecamera model is the image plane Π. In Figure 3 the image plane contains small pictureelements (pixels). The picture elements are indexed by a pair of coordinates expressed byintegers. They depict the plane Π with a discrete grid of pixels.

The projection of the point Oc on the plane Π in the direction of Zc determines the princi-pal point of local coordinates (ox , oy). The principal axis is a line between points Oc andO

′c. The distance from the image plane to the principal point is known as the focal length.

The values hx and hy determine the physical dimensions of a single pixel. Placementof a given point P from the 3D space depends on the chosen coordinate system: in thecamera coordinate system it is a column vector Pc; in the external coordinate system it isa column vector Pw. Point p is an image of point P under the projection with the centre

Page 16: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

16

Figure 3. A pinhole model of the perspective camera with two coordinate systems: external Wand internal C. The point Oc , called a central or a focal point, together with the axes Xc , Yc andZc determine the coordinate system of the camera. [1]

at the point Oc on the plane Π. [1]

Coordinates of the points p and P in the camera coordinate system are denoted as

P =[X Y Z

]Tp =

[x y z

]T.

(3)

The pinhole camera model can be defined by providing two sets of parameters: the ex-trinsic parameters and the intrinsic parameters [1].

Cyganek et al. [1] claim that a change from the camera coordinate systemC to the externalworld coordinate system W can be accomplished providing a translation T and a rotationR (Figure (3)). The translation vector T describes a change in position of the coordinatecentres Oc and Ow. They say that the rotation, in turn, changes the corresponding axes ofeach system. They continue that this change is described by an orthogonal matrix R ofdimensions 3×3. According Cyganek et al. [1] the extrinsic parameters of the perspectivecamera are all the necessary geometric parameters that allow a change from the cameracoordinate system to the external coordinate system and vice versa. Thus, they concludethat the extrinsic parameters of a camera are just the introduced matrices R and T .

Cyganek et al. [1] summarize the intrinsic camera parameters as follows.

Page 17: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

17

1. " The parameters of the projective transformation itself: For the pin-hole camera

model, this is given by the focal length f ."

2. " The parameters that map the camera coordinate system into the image coordinate

system: Assuming that the origin of the image constitutes a point o = (ox, oy) (i.e.

a central point) and that the physical dimensions of pixels on a camera plane in

the two directions are constant and given by hx and hy , a relation between image

coordinates xu and yu and camera coordinates x and y can be stated as follows

(see Figure 3):x = (xu − ox)hx

y = (yu − oy)hy(4)

, where a point (x, y) is related to the camera coordinate systemC, whereas (xu, yu)

and (ox, oy) to the system of a local camera plane. It is customary to assume that

xu ≥ 0 and yu ≥ 0. For instance, the point of origin of the camera plane (xu, yu) =

(0, 0) transforms to the point (−oxhx,−oyhy) of the system C. More often than not

it is assumed also that hx = hy = 1. A value of hy/hx is called an aspect ratio.

Under this assumption a point from our example is simply (−ox,−oy) in the C

coordinates, which can be easily verified analysing Figure (3)."

3. "Geometric distortions that arise due to the physical parameters of the optical ele-

ments of the camera. "

2.4 3D from stereo vision

Many 3D vision systems are based on stereo pair images. These 3D vision systems mimicthe human visual system (HVS) with cameras. When humans observe a scene with botheyes, an image of the scene is formed on the retina of the eyes. By adjusting the angleof each eye, humans can focus on objects at any arbitrary distance, throwing objects atother distances out of focus. This contrast between focus and blur, along with an innerknowledge of how our eyes are aligned and the distance between them, gives us the abilityto estimate the distance to an unknown object. The images formed by our two eyes arenot identical. This stereo-pair of retinal images contains slight displacements between therelative locations of local parts of the image of the scene with respect to each image of thepair, depending upon how close these local scene components are to the point of fixationof the observer’s eyes. [1]

It is possible to reverse this process of "seeing" and deduce how far away scene compo-nents were from the observer. The magnitude and direction of the parallaxes within the

Page 18: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

18

stereo pairs are used for this deduction. This is also called triangulation – the measuringof a distance given two reference points, the distance between them, and the angles topoint at the distant subject (see Figure 5).

A basic triangulation gives rise to the so-called 2.5D depth reconstruction. However, inmany practical applications of 3D imaging a full 3D surface manifold is required. In orderto generate a complete and closed 3D surface manifold for a object, multiple 2.5D rangemaps has to be integrated. In this case sufficient views of the object must be captured toensure that a closed 3D mesh can be formed. [1]

Stereo vision is one of the methods that can yield depth information of the scene. It usesstereo image pairs from two cameras to produce disparity maps that can be easily turnedinto depth maps. Disparity simply means difference, so a disparity map is a measurementof the difference between two things. A stereo disparity map is something that measuresthe difference between two views, the left and right view. For example, pixel A describessome feature in our scene in the left view. The next step is to find that feature and a verysimilar pixel in the right image and measure the difference in the x coordinate from theright to the left. This gives the disparity between the two views and if this done withevery pixel, it would produce a stereo pixel disparity map. In other words, a stereo pixeldisparity map describe what X offset a pixel would have to make to move from the leftview into the correct position in the right. [1]

If there is two similar cameras, it is possible to build a stereo photography constraint thathas adjustable photographic distances and angles (see Figure 4). This system would allowus to capture moving, living objects with high quality. This kind of system would alsoallow us to manually match the f-stops and shutter speeds of the cameras and use a syn-chronized remote grip switch to trigger the shutters. By definition, a stereo photogramme-try based 3D vision system will require stereo pair image acquisition hardware [1]. Thishardware is usually connected to a computer hosting software that automates acquisitioncontrol. They also mention that multiple stereo-pairs of cameras can be employed to allowaround coverage of an object or a person, e.g., in the context of whole-body scanners.

Cyganek et al. [1] claim that the accuracy of 3D reconstruction depends on availabilityand accuracy of data of the camera setup. By this they refer to Grimson [6]. Grimsonpresented a detailed analysis of the 3D reconstruction with respect to the accuracy of thecamera calibration parameters. He showed that the reconstruction process based on avail-able disparities extracted from stereo pair images has a critical and non-linear dependencyon the accuracy of the camera calibration parameters. Cyganek et al. [1] mention that es-

Page 19: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

19

Figure 4. Stereo setup with cameras in the canonical position [1].

Figure 5. Reconstruction of a three-dimensional point through triangulation (left) [2]. Imagepoints x and x

′back project to rays. If the epipolar constraint is statisfied, then these two rays lie

in a plane and intersect in a point X in the 3-space (right) [5].

Page 20: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

20

pecially important is the precise computation of the camera central points (see Figure 3),as well as the deviation angle of the camera optical axes. Depending on availability andaccuracy of calibration data associated with the camera set-up used, there are differentpossible degrees of 3D reconstruction:

1. Full reconstruction of the Euclidean 3D space. If the extrinsic and the intrinsicparameters of the camera set-up are known, full reconstruction of the Euclidean 3Dspace is possible.

2. In the case when only the intrinsic camera parameters are known then reconstruc-tion is possible only up to a certain scaling factor. This also complies with intuition,since if the external calibration parameters are not known then the position of cam-eras with respect to the external ’world’ coordinate system is also not known – andthus can have arbitrary values. It is now evident that the reconstructed coordinatesof 3D points cannot be unique since the positions of cameras are not given.

3. In the case when extrinsic and intrinsic parameters are unknown the reconstructionup to a certain projective transformation is possible.

Camera calibration is a well-studied problem both in photogrammetry and computer vi-sion. While there has been recent progress in the use of uncalibrated views for 3D recon-struction [7], in many real life applications it is not necessary or not even possible to havethe form of absolute Euclidean coordinates of visible objects in a predefined coordinatesystem attached to the scene. In applications like PhotoSynth (discussed earlier) the cam-era calibration data is unknown. Images could have been taken by different persons andby different cameras.

Hartley et al. [5] describe how the spatial layout of a scene and the cameras can be recov-ered from two views. If it is supposed that a set of image correspondences xi ↔ x

′i are

given. It can be assumed that these correspondences come from a set of 3D points Xi,which are unknown. Similarly, the position, orientation and calibration of the cameras arenot known. The reconstruction task is to find the camera matrices P and P ′ , as well asthe 3D points Xi such that xi = PXi x

′i = PX

′i for all i. They presented the following

conceptual outline approach for reconstruction from stereo-pairs. Many variants on thismethod are possible.

1. Compute the fundamental matrix from the point correspondenses (see Figure 6).

2. Compute the camera matrices from the fundamental matrix.

Page 21: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

21

3. For each point correspondence xi ↔ x′i, compute the point in space that projects to

these two image points.

Figure 6. Architecture of the system for computation of the fundamental matrix [1].

With the xi and x′i known this equation is a linear equation in the entries of fundamental

matrix. Hartley et al. [5] claim that given at least 8 point correspondences it is possibleto solve linearly for the entries of fundamental matrix up to scale and with more than 8equations a least-squares solution is found. This is the general principle of a method forcomputing the fundamental matrix.

" Given the camera matrices P and P′, let x and x

′be two points in the two images

that satisfy the epipolar constraint x′TFx = 0. Epipolar constraint may be interpreted

geometrically in terms of the rays in space corresponding to the two image points. x′

lies

on the epipolar line Fx. This means that the two rays back-projected from image points x

and x′

lie in a common epipolar plane, that is, a plane passing through the two camera

centres. Since the two rays lie in a plane, they will intersect in some point. This point X

projects via the two cameras to the points x and x′

in the two images. The only points in

3-space that cannot be determined from their images are points on the baseline between

the two cameras. In this case the back-projected rays are collinear and intersect along

their whole length. Thus, the point X cannot be uniquely determined. " [5]

Saxena et al. [8] say that if the images are taken from nearby cameras (i.e., if the baselinedistance is small), then these methods often suffer from large triangulation errors forpoints far-away from the camera. They add that if, conversely, one chooses images takenfar apart, then often the change of viewpoint causes the images to become very different,so that finding correspondences becomes extremely difficult, leading either to spurious ormissed correspondences. The large baseline also means that there may be little overlapbetween the images, so that a few correspondences may even exist. These difficultiesmake purely geometric 3D reconstruction algorithms work unreliably in practice, whengiven only a small set of images. [8]

The 3D surface is usually approximated by a triangular mesh to reduce geometric com-

Page 22: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

22

plexity and to tailor the model to the requirements of computer graphics visualizationsystems. In 3D reconstruction, a simple approach consists of overlaying a 2D triangularmesh on top of the image and then build a corresponding 3D mesh by placing the verticesof the triangles in 3D space according to the values found in the depth map (see Figure 7).[2].

Figure 7. Surface reconstruction approach: A triangular mesh (left) is overlaid on top of the image(middle). The vertices are back-projected in space according to the value found in the depth map(right). [2]

2.5 3D from a single image

The acquisition of 3D model becomes even harder when there is only a single image orview available. It is now clear that depth estimation from a single monocular image canbe considered a difficult task. It is not possible to use triangulation to recover depth-maps like with stereo-pairs. Depth estimation from a single image requires that the globalstructure of the image is taken account and that some prior knowledge about the scene isused. [9]

So, why researchers are still trying to recover 3d from single still images when goodresults from stereo-vision are acchievable? If the stereo-vision approach is applied ine.g robot navigation, the robot has to have two cameras. There can be situations werethere are restrictions that compel to use only single camera. Robot navigation may notbe the best example because they use video stream and use techniques like vision frommotion. Robot moves and from video feed it gets multiple images of the scene fromslightly different angles. So this makes the use of e.g triangulation possible.

Navigation systems based on binocular vision work quite well on short distances. In ob-stacle detection systems objects must be perceived at a distance so that a collisions canbe avoided. In high speed driving the standard binocular vision algorithms have difficul-ties to perceive obstacles due to the fundamental limitations of the "baseline" distance

Page 23: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

23

between the two cameras and the noise in the observations [10]. This has been one moti-vation for research in depth estimation from monocular cues. Better understanding of theinformation that can be perceived from monocular vision helps to improve the binocularsystems as well.

Figure 8. Back-projection of a point along the line of sight. [2]

An image in Figure 8 tells us a lot about the observed scene. There is however not enoughinformation to reconstruct the 3D scene (at least not without doing an important numberof assumptions on the structure of the scene). This is due to the nature of the imageformation process which consists of a projection from a three-dimensional scene onto atwo-dimensional image. During this process the depth is lost.

However, humans appear to be good also at judging depth from a single image. If welook a photograph, it is no problem for us to instantly grasp the overall 3D structure of ascene. Humans understand the scene by "integrating information" available from differentsources. From a single image, humans use a variety of monocular cues, such as texturevariations and gradients, color, haze, defocus etc. [1]. Humans can infer the 3D structureeven when only a single view is available of parts of a scene. For this we humans use ourprior knowledge and experience. We live in reasonably structured world and we use theknowledge we have learned from it. This is why the recent research of 3D reconstructionfrom single images bases on statistical machine learning techniques. Instead of tryingto explicitly extract all the required geometric parameters from a single image, recentapproaches use other images as a training set and furnish this information in an implicitway, through recognition. Some of the recent research on 3d reconstruction from a singleimage is reviewed in Chapter 3.

According Li et al. [11] the common point in 3D reconstruction from single images is

Page 24: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

24

that the user’s interaction is a necessary step because current computer vision algorithmscan extract only the required edges from general images. A 3D model built from a singleimage will almost invariably be an incomplete model of the scene, because many portionsof the scene will be missing or occluded.

Page 25: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

25

3 PREVIOUS WORK ON SINGLE IMAGE RECON-STRUCTION

3.1 Introduction

This Chapter is a literary review of previous work on 3D reconstruction from a singlestill image. The reviewed methods have been classified into two sub-chapters: automaticmethods and interactive methods. The latest research has been focusing more on auto-matic methods. These methods often require the use of supervised learning techniquesand some additional knowledge about the images. At the moment, it seems that only in-teractive methods can be implemented in such way that no prior knowledge of the imageor scene is required. In interactive methods, the user is required to assist the reconstruc-tion process. It also seems that there is relatively little work on automatic single-viewreconstruction. The 3D reconstruction from a single image often focuses for modellingspecific objects types.

3.2 Automatic methods

Hoiem et al. [12] have made research on how to recover surface layout from a single stillimage. They constructed a surface layout by estimating the orientations of large surfacesin outdoor images. Their goal has been to label the image into coarse geometric classes.In their method, they begin by dividing the image into small homogeneous patches calledsuper-pixels. More about image segmentation to super-pixels is discussed in Chapter 7.After that they, classify each super-pixel as either being parallel to the ground plane,belonging to a surface that sticks up from the ground, or being part of the sky. Thesurfaces that are sticking up from the ground, they subdivide into planar surfaces facingleft, right, or toward the camera and non-planar surfaces, either porous or solid.

Hoiem et al. [12] used terms of statistical learning in posing the problem of 3D surfaceestimation. They posed surface layout recovery as a recognition problem. They did not tryto explicitly calculate all the required parameters from the image. 300 outdoor pictureswere hand-picked from Google image search and used as a training set. Hoiem et al.modelled through recognition geometric classes that depended on the orientation of aphysical object in the scene. Hoiem et al. slowly gather structural knowledge from pixelsto super-pixels and from super-pixels to multiple image segmentations. According them,

Page 26: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

26

an important aspect of their method is a wide use of image cues including position, color,texture and perspective.

While Hoiem et al. [12] focused on outdoor images, Delage et al. [13] focused on indoorManhattan world scenes. Manhattan worlds mainly consist orthogonal planes. In [13]they used Markov random field (MRF) to estimate whether each point in an image rep-resents a surface or an edge, and also the orientation of the surface or edge. With thisinformation, they used an iterative algorithm to infer the 3D reconstruction of the scene.Like Hoiem et al. [12], Delage et al. used supervised learning in their method. In orderto train their model parameters, they hand-labeled two images with their ground-truth la-bels. This set of two images made up their training set. According to Delage et al. theiralgorithm is the first fully automatic method for 3D reconstruction from a single indoorimage.

Saxena et al. [14] focus on inferring the detailed 3D structure that is both "quantitativelyaccurate as well as visually pleasing". They infer both the 3D location, the orientation andthe relationships of the small planar regions in the image using MRF. Saxena et al. useda custom-built 3D laser scanner to collect images and their corresponding depth-maps.They collected total of 534 images with image resolution of 2272× 1704 and depth-mapsfor images with 55 × 305 resolution. From these images 400 were picked and used totrain their model. The use of laser scanner relates to previous work of Saxena et al. [15],where they focus on learning the depth from single monocular cues. In [15] they dividethe image into small rectangular patches, and estimate a single depth value for each patch.They use two types of features: absolute depth features and relative depth features. Thefirst is used to estimate the absolute depth at particular patch and later is the magnitude ofthe difference in depth between two patches.

In [14] Saxena et al. learn the relation between the image features and the location/orientationof the planes, and also the relationships between various parts of the image using super-vised learning. At first, they use the same method as Hoiem et al. [12] and divide theimage into super-pixels. Typically an image is divided into 2000 super-pixels. Saxena etal. [14] capture features from the image. According to Saxena et al. the image features ofa super-pixel bear some relation to depth (and orientation) of the super-pixel. They alsoassume that except in case of occlusion, neighbouring super-pixels are more likely to beconnected to each other. They also say that neighbouring super-pixels are more likely tobelong to the same plane, if they have similar features and if there are no edges betweenthem. Last, a long straight line in a image represents straight line in 3D model (for exam-ple the edge of a building). Saxena et al. [14] compare the results of their model to the

Page 27: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

27

work of Hoiem et al. [12]. They show that their approach creates "qualitatively correct"3D models for 64.9% of 588 images downloaded from the internet, as compared to Hoiemet al. [12]. performance of 33.1%.

Chiu et al. [16] automatically reconstruct 3D objects from a single image, by using prior3D shape models of classes that are an extension of the Potemkin model [16]. The 3DPotemkin class model can be viewed as a collection of 3D planar shapes, one for each part,which are arranged in three dimensions. Chiu et al. say that their 3D Potemkin model isa relatively weak 3D model, but it is powerful enough to support reconstruction of the 3Dshapes of objects. They say that the learned 3D Potemkin model can be used to enableexisting detection systems to reconstruct the 3D shapes of detected objects. AccordingChiu et al. current existing detection methods are only able to obtain 2D shapes (or partial3D information) from the detected objects, which are not sufficient for artificial systems tointeract with the external object in 3D space, such as moving a robot arm to grasp objects.This is why they have developed the 3D Potemkin model that enables existing detectionmethods to reconstruct the 3D shapes of objects and to be applicable to applications incomputer graphics and robotics. The class model is trained and learned from a few part-labeled 2D views of instances of an object class from different, uncalibrated viewpoints.The model does not require any 3D training information. In [16] Chiu et al. demonstratethat with their model, the robot is able to estimate the pose of an object and to grasp it,even in situations where the part to be grasped is not visible in the input image.

3.3 Interactive methods

Ting et al. [17] create curvilinear, texture mapped, 3D models from a single picture withno prior internal knowledge about the shape. In their approach, the surface is recon-structed from a set of user specified constraints, such as point positions, normals, contoursand regions. The scene is modelled as a piecewise continuous surface represented on aquad-tree-based adaptive grid. The problem of computing the best surface that satisfiesthese constraints is cast as a constrained optimization problem. This technique is inter-active and updates the model in real time when constraints are added. The hierarchicaltransformation technique (thin plate functional) is used to bend the surface to meet theconstraints. By using this technique Ting et al. have yielded very good results with areasonable amount of user interaction. They have shown that with their approach it ispossible to reconstruct a 3d model of the scene in roughly 20 minutes. This time includesthe specification of constraints. In their experiment the calculation of the model usuallytook less than a minute to converge on a 1.5GHz Pentium 4 processor.

Page 28: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

28

Shum et al. [18] presented two interactive image-based 3D modelling systems. If theconcept of "3d from a single image" is stretched to include panoramic mosaics that consistof set of images, the first system can be mentioned here. The second system of Shum et al.was based on stereo vision. Shum et al. constructed a panoramic mosaic by taking imagesaround the same viewpoint including camera matrix associated with each input image.In [18], the user first interactively specifies features such as points, lines, and planes.Then the system recovers the camera pose for each mosaic from known line directionsand reference points. It then constructs the 3D model using all available geometricalconstraints.

Sturm et al. [19] present a method where they interactively reconstruct 3D piecewise pla-nar objects from single images. In their method the reconstruction requires three types ofgeometrical constraints on the 3D structure of the observed object: co-planarity of points,perpendicularity of directions or planes and parallelism of directions or planes. Perpen-dicularity constraints Sturm et al. use to calibrate the image. Together with parallelismconstraints perpendicularity provide the vanishing geometry of the scene which forms theskeleton of the 3D reconstruction. Sturm et al. use co-planarity constraints to completethe reconstruction, via alternating reconstruction of points and planes.

Li et al. [11] also commit interactively the 3D reconstruction of piecewise planar objects.Their approach bases on image regularities such as connectivity, parallelism, and orthog-onality. In this approach these regularities are provided by the user. For example, the useris required to draw the edges of the objects. Li et al. say that if the user is able to drawalso the occluded edges, their algorithm can recover both the visible and invisible shapesof the objects. Li et al. present the object in an image as a shape vector. They formulatethe regularities as geometric constraints and solve an optimization problem to obtain theoptimal shape vector. System of equations is derived in terms of the shape vector and thefocal length.

In [20] Han et al. commmit bayesian reconstruction of 3D shapes and scenes from asingle image. They represent prior knowledge of 3D shapes and scenes by probalisticmodels at two levels that are defined on graphs. The first level model is a mixture for bothman-made block objects and natural objects such as trees. The second level model is builton the relation graph of all objects in a scene. Han et al. extract the geometry throughimage segmentation and sketching algorithms to a big graph. This graph is partitioned tosubgraphs each being an object. Han et al. recover 3D for each subgraph by infering the3D shape and recovering occluded surfaces, edges, and vertices in each subgraph. Last,they infer the scene structures between the recovered 3D subgraphs.

Page 29: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

29

3.4 Summary

At the moment, the top of the art solution in the field of automatic 3D reconstructionfrom a single image is the Make3D software http://make3d.cs.cornell.edu/.It takes a 2D image and creates a 3D "fly around" model, giving the viewers access tothe scene’s depth and a range of points of view. This software is the result of the work ofSaxena et et al. [15] [14]. The authors say that the technology works better than any otherhas so far but it is not perfect. It is at its best with landscapes and scenery rather than close-ups of individual objects. They hope to improve it by introducing object recognition. Theidea is that if the software can recognize a human form in a photo it can make moreaccurate distance judgments based on the size of the person in the photo.

On the other hand, interactive methods from a single image seem to focus more on 3Dreconstruction of individual objects or planar objects rather than landscape scenes. It isdifficult to say which method is the very best at the moment. It seems though that Tinget al. [17] and Li et al. [11] both have managed the get good results with their methods.Personally, I value the method of Tin et al. slightly more since it is possible to reconstructnon-planar objects with it. The major part of the reconstruction processing time with thesemethods is spent for the user interaction.

Page 30: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

30

4 REPRESENTING 3D SURFACES

4.1 Introduction

In this Chapter the Open Graphics Library (OpenGL) is introduced briefly. What is OpenGraphics Library and what it does? This introduction is based on the work of DaveShreiner. Shreiner is a computer graphics specialist at ARM, Inc. According to O’ReillyCommunity Biography in the http://www.oreillynet.com, he has been work-ing with OpenGL since its inception at Silicon Graphics Computer Systems (SGI). Thebiography says that during his 15-year tenure at SGI, he authored the first commercialOpenGL training course, co-authored the OpenGL programming guide [21] and refer-ence manuals, and engineered OpenGL drivers for a multitude of different systems.

According to Shreiner [21], OpenGL is a software interface to graphics hardware. It is ahardware-independent interface to be implemented on many different hardware platforms.OpenGL does not include commands for performing windowing tasks or commands forobtaining user input. It does not provide high-level commands for describing models of3D objects. With OpenGL, a desired model must be built from a small set of geometricprimitives – points, lines, and polygons – that are specified by their vertices. [21].

The final rendered image consists of pixels drawn on the screen and a pixel is the smallestvisible element the display hardware outputs on the screen. The information about thepixels is organized in memory into bit-planes. A bit-plane is an area of memory thatholds information for every pixel on the screen. The bit-planes are themselves organizedinto a frame-buffer, which holds all the information that the graphics display needs tocontrol the color and intensity of all the pixels on the screen. [21]

In [21] OpenGL is presented as a state machine. Various states or modes are set whichthen remain in effect until they are changed. A few of the various state variables thatOpenGL maintains are current color, current viewing and projection transformations, lineand polygon stipple patterns, polygon drawing modes, pixel packing conventions, posi-tions and characteristics of lights, and material properties of the objects being drawn. [21]

Most implementations of OpenGL have a similar order of operations, a series of process-ing stages called the OpenGL rendering pipeline. This ordering as shown in Figure 9, isnot a strict rule about how OpenGL is implemented, but it provides a reliable guide forpredicting what OpenGL will do. [21]

Page 31: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

31

Figure 9. Order of Operations [21]

In Figure 9, we can see that there is an object called display list. All data, whether it isgeometry or pixels, can be saved into display list for current or later use. When the displaylist is executed, the retained data is sent from it just if it were sent by the application inimmediate mode.

Figure 10 illustrates all geometric primitives in OpenGL, that are eventually describedby vertices. The stage "per-vertex-operations" (see Figure 9), convert the vertices intoprimitives. If texturing is used, texture coordinates (see Figure 11) may be generated andtransformed in the "per-vertex-operations". Yet again, if used, the lighting calculationsand also the material properties may be generated in the "per-vertex-operations" stage.

The results of the primitive assembly stage in Figure 9 are complete geometric primi-tives, which are the transformed and clipped vertices with related color, depth, texture-coordinate values and guidelines for the rasterization step. The elimination of portions ofgeometry that fall outside a half-space, defined by a plane, is the major part of primitiveassembly. [21]

Pixel data takes different route in rendering pipeline than geometric data. The pixels froman array in system memory are first unpacked from one variety of formats into the propernumber of components. Next the data is scaled, biased, and processed by a pixel map. Theresults are clamped and stored into texture memory or sent to the rasterization stage. [21]

OpenGL applications can apply texture images to geometric objects to make the objectslook more realistic and that almost all OpenGL implementations have special resourcesfor accelerating texture performance. The texture assembly stage helps OpenGL imple-mentation to manage these memory resources efficiently.

The Rasterization in Figure 9 is the conversion of both geometric and pixel data into

Page 32: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

32

Figure 10. Geometric Primitive Types [21].

Figure 11. Linear texture mapping [4]. The patch determined by the corners (smin,tmin) and(smax,tmax) corresponds to the surface patch with corners (umin,vmin) and (umax,vmax).

Page 33: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

33

fragments. Each fragment square corresponds to a pixel in the frame buffer.

Before values are actually stored in the frame buffer, series of operations are performedthat may alter or even throw out fragments. All these operations can be enabled or dis-abled. The first operation that a fragment might encounter is texturing, where a texel(texture element) is generated from texture memory for each fragment and applied tothe fragment. Next, primary and secondary colors are combined and, for example, fogcalculation may be applied.

4.2 The OpenGL interface

OpenGL contains rendering commands but is designed to be independent of any windowsystem or operating system. This is why OpenGL does not contain commands for openingwindows or reading events from keyboard or mouse. OpenGL functions are in a singlelibrary called GL. [4]

To interface with the window system and to get input from external devices into OpenGLprograms, additional libraries are required. There is a system-specific library that providesthe "glue" between the window system and OpenGL, for each major operating system.For the X Window System, this library is called GLX, for Windows, it is wgl, and for theMacintosh, it is agl. Rather than using a different library for each system, there are tworeadily available libraries, the OpenGL Extension Wrangler (GLEW) and the OpenGLUtility Toolkit (GLUT). GLEW removes operating system dependencies. GLUT providesthe minimum functionality that should be expected in any modern windowing system. [4]

In Figure 12 is illustrated the organization of the libraries for X Window System envi-ronment. For this window system, GLUT uses the GLX and X libraries. It can be seenfrom the figure that an application program, can use only the GLUT functions and this iswhy it can be recompiled with the GLUT library for other window systems. GLUT alsohides the complexities of differing window system application programming interfaces(API). [4]

4.3 Polygons in OpenGL

Line segments and polylines can model the edges of objects, and closed objects haveinteriors (Figure 13). Usually the name polygon is reserved for an object that has a border

Page 34: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

34

Figure 12. Library organization [4].

that can be described by a line loop but also has a well-defined interior. The polygonsplay a special role in computer graphics because they can be displayed rapidly and theycan be used to approximate arbitrary surfaces. [4]

Figure 13. Filled objects. [4]

Figure 14. Methods of displaying a polygon. [4]

A polygon can be rendered in different ways: Only the edges of polygon can be rendered,only the interior of polygon can be rendered with a solid color or pattern and it is possibleto choose to render or not to render the edges as they show in Figure 14. Although theouter edges of a polygon are defined easily by an ordered list of vertices, if the interioris not well defined, then the list of vertices may not be rendered at all or rendered in aundesirable manner. There are three properties that will ensure that a polygon will bedisplayed correctly: It must be simple, convex, and flat. [4]

In 2D, as long as no two edges of polygon cross each other, it is a simple polygon. Fromthe perspective of implementing a practical algorithm to fill the interior of a polygon,simplicity alone is not always enough. A consistent fill is guaranteed in some APIs onlywhen the polygon is convex. An object is convex, if all points on the line segment between

Page 35: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

35

any two points inside the object, or on its boundary, are inside the object. Convex objectsinclude triangles, tetrahedra, rectangles, circles, spheres and all parallelepipeds. [4]

In 3D, polygons present a few more difficulties because, unlike all 2D objects, all the ver-tices that specify the polygon do not need to lie in the same plane. The basis of OpenGLpolygons, is that three vertices that are not collinear determine both a triangle and theplane in which that triangle lies. By always using triangles, we can be sure that these ob-jects will be rendered correctly. This leads to the fact that triangles are the only polygonsthat OpenGL supports. Triangles can be displayed in three ways: as points correspondingto the vertices, as edges, or with the interiors filled. [4]

4.4 Surfaces from triangles

When we are interested in objects with interiors, general polygons are problematic. Thereare occasions where a set of vertices may not all lie in the same plane or specify a polygonthat is neither simple nor convex. Such problems do not arise with triangles. As long asthe three vertices of a triangle are not collinear, its interior is well defined and the triangleis simple, flat and convex. Triangles are easy to render and for these reasons they are theonly fillable geometric entity that OpenGL recognises. In practice, one needs to deal withmore general polygons. The usual strategy is to start with a list of vertices and generatea set of triangles consistent with the polygon defined by the list, a process known astriangulation. [4]

Figure 15. (a) Quadrilateral. (b) A triangulation. (c) Another triangulation. [4]

Although every set of vertices can be triangulated, not all triangulations are equivalent.Consider the quadrilateral in Figure 15. That quadrilateral can be triangulated the twoways shown in Figure (15b) and (15c). In (15b) we create two long thin triangles ratherthan two triangles closer to being equilateral as in (15c). Long thin triangles can createvisual artifacts in rendering. The triangulation can be started with any vertices. Thesevertices form a triangle. We can then remove the second vertex from the list of verticesand repeat the process until we have only three vertices left, which form the final triangle.

Page 36: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

36

This process is illustrated in Figure 16, but it does not guarantee a good set of trianglesnor can it handle concave polygons. [4]

Figure 16. Recursive triangulation of a convex polygon. [4]

The Delaunay triangulation algorithm finds a best triangulation in the sense that the circledetermined by any triangle, no other vertex lies in this circle. This triangulation wasinvented by Boris Delaunay in 1934. Delaunay proved that when a dual graph is drawnwith two straight lines, it produces a planar triangulation of the Voronoi sites [22]. Theconcept of Voronoi diagrams is more than a century old, discussed already in 1850 byDirichlet and in a 1908 paper by Voronoi. The triangulation is a special case of moregeneral problem of tessellation, which divides a polygon into a polygonal mesh, not all ofwhich need to be triangles. [4]

4.5 Delaunay triangulation

The Delaunay triangulation procedure is given in [4]. At first we have a set of pointsin a plane and there are many ways to form a triangular mesh that uses all the points asvertices. Even four vertices that specify a convex quadrilateral can form a two-trianglemesh in two ways, depending on which way we draw a diagonal as previously shown inFigure 15. For a mesh of n vertices , there will be n − 2 triangles in the mesh. Becausewe always want a mesh in which no edges cross, the four edges in Figure 15a must be inthe mesh. Note that they form the convex hull of the four points. Hence we only havea choice as to the diagonal. In Figure 15b, the diagonal creates two long thin triangleswhereas the diagonal in Figure (15c) creates two more robust triangles. We prefer thesecond case because the long thin triangles tend to render badly, showing artifacts frominterpolation of vertex attributes.

In general, the closer a triangle is to an equilateral triangle, the better it is for rendering.In more mathematical terms, the best triangles have the largest minimum interior angle.If two triangle meshes derived from the same set of points are compared, it can be said

Page 37: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

37

that the better mesh is the one with largest minimum interior angle of all the trianglesin the mesh. Although it may appear that determining such a mesh for a large numberof triangles is difficult, the problem can be approached in a manner that will yield thesmallest minimum angle. [4]

Figure 17. Circles determined by possible triangulations. [4]

As an example the situation in Figure 17 can be considered. There are some vertices in theplane that will be part of a triangle mesh in Figure 17. Focusing on the vertex v, it appearsthat one of the triangles a, v, c or v, c, b should be part of the mesh. Recall that threepoints in the plane determine a unique circle that interpolates them. Also to be noted isthat the circle formed by a, v, c does not include another point, whereas the circle formedby v, c, b does. Moreover, the triangle formed by a, v, c has a smaller minimum anglethan the triangle formed by v, c, b. Because these two triangles share an edge v, c, onlyone can be used in our mesh. These observations suggest a strategy known as Delaunaytriangulation. Given a set of n points in the plane, the Delaunay triangulation has thefollowing properties, any one which is sufficient to define the triangulation:

1. For any triangle in Delaunay triangulation, the circle passing through its three ver-tices has no other vertices in its interior.

2. For any edge in the Delaunay triangulation, there is no circle passing through theendpoints (vertices) of this edge that includes another vertex in its interior.

3. If we consider the set of angles of all triangles in a triangulation, the Delaunaytriangulation has the greatest minimum angle.

The third property ensures that the triangulation is good one for computer graphics [4].The first two properties follow from how we construct the triangulation. [4]

The Delaunay triangulation begins by adding three vertices such that all the points inthe set of vertices lie inside the triangle formed by these vertices (see Figure 18). These

Page 38: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

38

Figure 18. Starting a Delaunay triangulation. [4]

added three vertices can be removed later on. Next, a vertex v is selected from data setat random. It is then connected to the three vertices added previously. This results threetriangles, shown in Figure 19.

Figure 19. Triangulation after adding first data point. [4]

Next, a vertex u is is randomly picked from the remaining vertices. Figure 20 shows thatthis vertex lies inside the triangle a, v, c and that the three triangles it forms do not presenta problem. However, the edge between a and v is a diagonal for the quadrilateral a, u, v, b,and the circle that interpolates a,u and v has b in its interior. So, if this edge is used, aDelaunay triangulation criteria is violated [4].

Figure 20. Adding a vertex requiring flipping. [4]

As a solution, another diagonal of quadrilateral can be chosen and the edge a, v can bereplaced wit an edge u, b. This operation is called flipping [4]. This results a partial mesh

Page 39: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

39

shown in Figure 21. The circle now passes through u, v, b and does not include any othervertices, nor do any of the other circles determined by any of these triangles. A Delaunaytriangulation of a subset of the points is now formed. [4]

Figure 21. Mesh after flipping. [4]

This operation is continued by adding another randomly chosen vertex from the originalset of vertices and flipping if necessary. Because each flip gives an improvement, theprocess terminates for each added vertex. Once all the vertices have been added, the threevertices added at start and the edges connected to them can be removed. On the average,the triangulation has a O(n log n) complexity. [4]

4.6 Tools for Delaunay triangulation

There are open source tools available which can be used among other things to produce aDelaunay triangulation for a given point cloud. Because there is a restriction of using Cprogramming language in this thesis, we concentrate on solutions implemented in C. Thisnarrows down available candidates.

GNU Triangulated Surface Library (GTS) is an Open Source Free Software Library in-tended to provide a set of useful functions to deal with 3D surfaces meshed with inter-connected triangles. GTS is written entirely in C with an object-oriented approach basedmostly on the design of GTK+. Initial goal of GTS is to provide a simple and efficientlibrary to scientists dealing with 3D computational surface meshes. One feature of GTSis 2D dynamic Delaunay and constrained Delaunay triangulations.

Qhull (http://www.qhull.org) is a general dimension code for computing con-vex hulls, Delaunay triangulations, Voronoi vertices, furthest-site Voronoi vertices, and

Page 40: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

40

half-space intersections. It appears to be the choice for higher-dimensional convex hullapplications. Qhull is written in C, and implements the divide-and-conquer Quickhullalgorithm [23]. Qhull has been widely used in scientific applications.

Triangle (http://www.cs.cmu.edu/~quake/triangle.html) is a C programfor two-dimensional mesh generation and construction of Delaunay triangulations, con-strained Delaunay triangulations and Voronoi diagrams by Shewchuck [24] [25]. As dis-cussed earlier,having no small angles in triangle mesh elements is essential for generatingqualitively good meshes. Shewchuck [25] denotes that the problem is to find a triangu-lation that covers a specified domain, and contains only triangles whose shape and sizessatisfy constraints: the angles should not be too small or too large, and the triangles shouldnot be smaller than necessary, nor larger than desired. In Triangle a guaranteed-qualitymeshes are generated using Ruppert’s Delaunay refinement algorithm [26]. Ruppert’salgorithm produces meshes with no small angles, using relatively few triangles and al-lows the density of triangles to vary quickly over short distances. As an input Triangletakes planar straight line graph (PSLG) which is a collection of vertices and segments.The endpoints of every segment are included in the list of vertices. It is possible to callTriangle from other programs. This requires that Triangle is compiled into an object file(triangle.o). Instructions how this is done are provided with source code. The resultingobject file can be called via the single procedure triangulate().

In this work Triangle is used for Delaunay triangulation. It is fast, memory efficient,small in size, and robust. It is possible to generate a reduced version of Triangle objectfile and include that in the program. Using Triangle this way does not require existence ofpre-installed libraries like using GTS or QHull would. In Chapter 5 we discuss the stepsneeded for extracting the collection of vertices from the image. This collection will thenbe the input for Delaunay triangulation triangulation.

Page 41: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

41

5 FEATURE POINT EXTRACTION

5.1 Introduction

As discussed eaerlier in Chapter 4.5 mesh modelling of an image involves partitioningthe image domain into a collection of non-overlapping patches, called mesh elements. Inthe case where Delaunay triangulation is applied, the mesh elements will be triangles (seeFigure 22). To be able to perform Delaunay triangulation that adapts to image contents asin Figure 22 one needs to extract set of vertices from the original image.

Figure 22. A 128× 128 section of the original image "Lena;" and triangulation that adapts to theimage content. [27]

Yang et al. [27] presented a procedure for generating a mesh structure for the purpose ofimage representation that adapts well to the content of an image. Their procedure canbe viewed as a method of non-uniform sampling in the image domain wherein the non-uniform samples (mesh nodes) are placed by an error-diffusion algorithm. Error diffusionis a type of halftoning in which the quantization residual is distributed to neighbouringpixels that have not yet been processed. Its main use is to convert a gray-level image intoa binary image. The main idea of the method is that dense sampling is used in regionscontaining high frequency features and coarse sampling is used in smooth regions of theimage. Furthermore, they have demonstrated with numerical results that their method isvery fast, effective and robust to noise. They also point out that with their method it is veryeasy to control the number of mesh nodes used in the resulting mesh. These propertiessound ideal for us because we do not want to jeopardize the usability with long processingtimes. The control over the number of mesh nodes is also very useful feature.

In this work the mesh modelling is done on the basis of by Yang et al. [27]. Their methodconsists of the following three steps that are discussed in this chapter. First, a feature mapof the image is computed. Calculation of a feature map describes the spatial distribution

Page 42: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

42

of the largest magnitude second directional directives of the image. Derivatives are knownto be sensitive to noise in the image data. When an image is noisy, it is necessary to pre-filter the image. In this work, it is presumed that the input images being processed into 3Dare color images in the RGB image format. To be sure that similar results can be expectedthan in [27], color images are first converted to grayscale images. In the second phase ofthe method, the Floyd-Steinberg error-diffusion algorithm [28] is employed to distributethe mesh nodes so that their local spatial density is proportional to the corresponding valueof the computed feature map. More about Floyd Steinberg error diffusion is discussed inSection 5.5. In the final step, a two-dimensional Delaunay triangulation algorithm is usedto connect the mesh nodes. Delaunay triangulation was introduced in Chapter 4.5.

5.2 Image noise reduction

Images can invariably suffer from random degradations that are collectively referred toas noise. There can be numerous reasons which cause these degradations to occur. Forexample, radiation scatter from the surface before the image is sensed, electrical noiseas the image is transmitted over a communication channel, bit errors after the image isdigitized, etc. [29]. There are many types of noise that can contaminate a 2D signal of animage. Gaussian noise can be found from almost any signal. For example, the familiarwhite noise on a weak television station is well modelled as Gaussian. The grain noise inphotographic films is sometimes modelled as Gaussian and sometimes as Poisson. Thesalt and pepper type of noise comes from the visual effect which manifests as white andblack dots in images – the same as scattering salt and pepper over a sheet of paper. Quan-tization noise results from the change of a continuous signal into a digital representationwhich, of course, is of finite precision. It arises also in a change from one digital repre-sentation into another with smaller precision (fewer bits). Photon counting noise arisesfrom the physical properties of image acquisition systems that rely on photon counting.For instance, the speed of a shutter in a camera influences the number of photons that canreach the sensor and as a result adds to the photon counting noise. [29] [1]

Depending on the type of noise present in an image, the type of filter is chosen to removethe noise can and also affect the important details that must remain unaffected by thefiltering operation. The simplest order statistic based estimator is the sample median.Gaussian type noise is better filtered using a mean filter while salt and pepper noise isbetter filtered using a median filter. The median has some interesting properties. Its valueis one of the samples. The median tends to blur images less than the mean. Bovik [29]says that the median can pass an edge without any blurring at all. The disadvantage

Page 43: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

43

of using a mean filter over a median filter is that the mean filter removes high spatialfrequencies which blurs the edges within an image. The blurriness is worse than thenoise. [1] [30]

Linear estimators in image filtering do not often perform well because the noise-freeimage is not well modelled as a constant. If the noise-free image is Gaussian and the noiseis Gaussian, then the optimal estimators can be found. However, images often have smalldetails and sharp edges and thus images are not modelled as Gaussians. Bovik [31] showthat it is often true that the filtered image is more objectionable than the original. [31]

Discrete averaging refers to the process of low-pass filtering discrete signals. Explainingin simple words, this is a process of substituting a value of a pixel with a value computedas an average of its surrounding pixels, usually multiplied by some weighting parame-ters. This kind of low-pass filtering is ubiquitous in all areas of digital signal processing,and also in computer vision. The most common application is removal of the unwantedcomponent of a signal, commonly known as noise. [1]

5.3 Color to grayscale conversion

Gray level is the shade of gray between black and white in a (monochrome) image. In animage with 256 gray levels, the white value is typically 255, while black is representedas zero [30]. In this work the images that are being converted to 3D are expected to becolor images in the RGB color format. In the three primary color, additive color RGBsystem, there are conceptually separate channels for red, green and blue. Each pixel hasseparate red, green and blue component. Conversion of a color image to grayscale is notunique. As described in [32], green appears as the brightest of the three; the red appearsless bright, and blue the darkest of the three.

The CIE system was defined in 1931 by the Committee International de L’Èclairage (CIE).It defines a device-independent color space based on the characteristics of human colorperception. The CIE set up a hypothetical set of primaries, XYZ, that correspond to theway the eye’s retina behaves. They did experiments based on color matching by humanobservers. On the basis of experiment results, the CIE defined the primaries so that allvisible light maps to a additive mixture of X,Y, and Z so that Y correlates approximatelyto the apparent brightness of a color. The CIE system precisely defines all colors. Withthis as a standard reference, colors can be transformed from the native (device-dependent)color space of one device to the native color space of another device. [33]

Page 44: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

44

Brightness is defined by the CIE as the attribute of a visual sensation according to whichappears to emit more or less light. Because brightness perception is very complex, theCIE defined a more tractable quantity luminance, which is radiant power weighted by thespectral sensitivity functions that are characteristic of vision. The luminous efficiency ofthe Standard Observer is defined numerically, is everywhere positive, and peaks at about555nm. When an spectral power distribution is integrated using this curve as a weight-ing function, the result is CIE luminance, denoted by Y. The magnitude of luminance isproportional to physical power. In that sense it is like intensity, but the spectral compo-sition of luminance is related to the brightness sensitivity of human vision. Luminanceshould be expressed in a unit such as candelas per meter squared, but in practice it isoften normalized to 1 or 100 units with respect to the luminance of a specified or impliedwhite reference. As an example, Poynton [32] refers to a studio broadcast monitor thathas a white reference whose luminance is about 100cd ∗ m−2, and Y = 1 refers to thisvalue. [32]

A common strategy for converting RGB color to graylevel is to match the luminance ofthe grayscale image to the luminance of the color image. To get luminance of a color weuse the formula recommended by CIE:

L = 0.2126×R + 0.7152×G+ 0.0722×B (5)

5.4 Calculation of partial derivatives

In [27], calculation of a feature map σ involves calculation of second partial derivativesof the image function f(x). In reality, f(x) is available only in the form of discrete imagesamples f(i, j), and these partial derivatives must be estimated from f(i, j). The partialderivatives in [27] are computed for each pixel f(i, j) using finite differences as follows:

f′′

xx ≈ f(i+ 1, j)− 2f(i, j) + f(i− 1, j) (6)

Page 45: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

45

f′′

xy ≈1

4[f(i+ 1, j + 1)− f(i+ 1, j − 1)− f(i− 1, j + 1) + f(i− 1, j − 1)] (7)

f′′

yy ≈ f(i, j + 1)− 2f(i, j) + f(i, j − 1) (8)

The estimates in equations (6) - (8) can be viewed as a result of linear filtering of theimage f(x) [27]. Figure 23 shows an example of the feature map.

Figure 23. The original image (top). Feature map calculated from gray level image (bottom).

Page 46: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

46

5.5 Floyd Steinberg error diffusion

Floyd-Steinberg dithering is a method of dithering which was first published in 1976 byRobert W. Floyd and Louis Steinberg [28]. Dithering is a technique used in computergraphics to create the illusion of more colors when displaying an image which has alow color depth. In a dithered image, the missing colors are reproduced by a certainarrangement of pixels in the available colors. The human eye perceives this as a mixtureof the individual colors. These kind of methods are used for example in printing industry.The human visual system tends to merge small dots together and sees no dots but ratheran intensity proportional to the ratio of white to black in a small area.

The dithering process begins in the upper left corner of the image. For each pixel, theclosest available color in the palette is chosen and the difference between that color andthe original color is computed in each RGB channel. Then specific fractions of thesedifferences are dispersed among several adjacent pixels which have not yet been visited(below and to the right of the original pixel). Because of the order of processing, theprocedure can be done in a single pass over the image.

The Floyd-Steinberg dithering algorithm is based on error dispersion. The error disper-sion technique is very simple to describe: for each point in the image, first find the closestcolor available. Calculate the difference between the value in the image and the coloryou have. Now divide up these error values and distribute them over the neighbouringpixels which you have not visited yet. When you get to these later pixels, just add theerrors distributed from the earlier ones, clip the values to the allowed range if needed,then continue as above.

Floyd-Steinberg is based on the concept of error diffusion, where we choose the nearestpalette color that we can to the current pixel, and then compute the difference of thatcolor from the original color in each RGB channel. Pieces of this difference are dispersedthroughout several adjacent pixels not yet visited. There are many ways to distributethe errors and many ways to scan the image. The two basic ways to scan the image arewith a normal left-to-right, top-to-bottom raster, or with an alternating left-to-right thenright-to-left raster. The latter method generally produces fewer artifacts.

The error diffusion can be expressed as filters. The Floyd-Steinberg dithering filter patternmatrix is presented in Table 1.

In this filter, the X represents the pixel you are currently scanning, and the numbers (called

Page 47: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

47

Table 1. The Floyd-Steinberg dithering matrix.

- - -- X 7/16

3/16 5/16 1/16

the weights) represent the proportion of the error distributed to the pixel in that position.Here, the pixel immediately to the right gets 7/16 of the error (the divisor is 16 becausethe weights add to 16), the pixel directly below gets 5/16 of the error, and the diagonallyadjacent pixels get 3/16 and 1/16. When scanning a line right-to-left, this pattern is re-versed. This pattern was chosen carefully so that it would produce a checker-board patternin areas with intensity of 1/2 (or 128 in our image). It is also fairly easy to calculate whenthe division by 16 is replaced by shifts.

In our implementation we are dithering a grayscale image for output to a black-and-whiteimage. Thus the "find closest color" is just a simple threshold operation between whiteand black. In color, it involves matching the input color to the closest available hardwarecolor, which can be difficult depending on the hardware palette.

Figure 24. Floyd Steinberg error diffusion algrithm executed on feature map (see Figure 23).

Page 48: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

48

6 THIN PLATE SPLINES

6.1 Introduction

Thin plate spline, or TPS for short, is an interpolation method that finds a "minimallybended" smooth surface that passes through all given points. The method was first intro-duced for geometric design applications by Duchon, 1967 [34]. The name "Thin Plate"comes from the fact that a TPS more or less simulates how a thin metal plate would be-have if it was forced through the given control points. TPS of three control points is aplane, more than three is generally a curved surface and less than three remains unde-fined. Thin plate splines are a well known entity of geometric design. They are defined asthe minimizer of a variational problem whose differential operators approximate a simplenotion of bending energy. Therefore, thin plate splines approximate surfaces with mini-mal bending energy and they are widely considered as the standard "fair" surface model.Such surfaces are desired for many modelling and design applications [35].

In Figure 25 (a) there are two sets of points for which the correspondences are assumedto be known. The TPS warping [36] allows a perfect alignment of the points and thebending of the grid shows the deformation needed to bring the two sets on top of eachother (b). Note that in the case of TPS applied to coordinate transformation we actuallyuse two splines, one for the displacement in the x direction and one for the displacementin the y direction. The displacement in each direction is considered as a height map forthe points and a spline is fit as in the case of scattered points in 3D space. And finally thetwo resulting transformations are combined into a single mapping.

Figure 25. Simple example of coordinate transformation using TPS. [37]

Page 49: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

49

6.2 Definition

The following mathematical definition of TPS is presented in [37] [36]. Let vi denotethe target function values at locations (xi, yi) in the plane, with i = 1, 2, . . . , p. vi isset equal to target coordinates (x

′i, y

′i) in turn to obtain one continuous transformation for

each coordinate (see Figure 25). The assumption is that locations (xi, yi) are all differentand are not collinear. The TPS interpolant f(x, y) minimizes the bending energy

If =

∫ ∫R2

(f 2xx + 2f 2

xy + f 2yy) dxdy (9)

and has the form

f(x, y) = a1 + axx+ ayy +

p∑i=1

wiU(‖ (xi, yi)− (x, y) ‖) (10)

Figure 26. The thin plate spline radial basis function U(r) = r2 log r. [37]

where U(r) = r2 log r (see Figure 26). In order for f(x, y) to have square integrablesecond derivatives, it is required that

p∑i=1

wi = 0 (11)

and

p∑i=1

wixi =

p∑i=1

wiyi = 0. (12)

Together with the interpolation conditions, f(xi, yi) = vi, this yields a linear system for

Page 50: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

50

the TPS coefficients: [K P

P T O

[~w

~a

]=

[~v

~o

](13)

where Kij = U(‖ (xi, yi)− (xj, yj) ‖), the ith row of P is (1, xi, yi), O is a 3× 3 matrixof zeros, ~o is 3 × 1 column vector of zeros, ~w and ~v are column vectors formed from wi

and vi respectively and a is the column vector with elements a1, ax, ay. It’s denoted thatthe (p+ 3)× (p+ 3) matrix of this system by L; as discussed in [36], L is nonsingular. Ifdenoted that the upper left p× p block of L−1 by L−1p , then it can be shown that

If = wTKw = vTL−1p KL−1p v = vTL−1p v (14)

When there is noise in the specified values vi, one may wish to relax the exact interpola-tion requirement by means of regularization. This is accomplished by minimizing

H[f ] =n∑

i=1

(vi − f(xi, yi))2 + λIf . (15)

The regularization parameter λ, a positive scalar, controls the amount of smoothing; thelimiting case of λ = 0 reduces to exact interpolation.

6.3 Solving with LU -decomposition

The source data points (x,y and z coordinates) of the new depth-map model (control-points) and the relaxation parameter λ are the input data for the thin plate spline routine.The relaxation parameter λ determines how strictly the approximated surface follows thedepth-map points and its value is between 0 and 1. If the relaxation parameter is 0 thesurface passes exactly through the depth values (the z coordinate) in source data. If therelaxation parameter is 1, linear interpolation by minimizing the sum of squared errorsbetween the depth-map point values and the approximated surface is performed.

The first step in solving the the thin plate spline algorithm is to find the weights of thesource data. This requires several matrix operations which are easier to perform if the ma-trix containing the x and y coordinates of the source data is first decomposed. This matrixcan be decomposed using LU- or QR-decomposition. In this work the LU-decompositionmethod is used.

Page 51: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

51

The following mathematical solution of TPS with LU-decomposition is the method in [38].Elonen’s example is based on [37]. Let there be two equally sized sets of 2D-points, Abeing the vertices of the original shape and B of the target shape. Let zi = Bix − Aix.Then a TPS is fit over points (aix, aiy, zi) to get interpolation function for translation ofpoints in x direction. The same is repeated for y.

Given set C of p 3D control points,

ci1 = xi

ci2 = yi

ci3 = zi

, i ∈[1 · · · p

]≡ Cp×3 =

x1 y1 z1

x2 y2 z2...

......

xn yn zn

, (16)

and regularization parameter λ, solve the unknown TPS weights w and a from linearequation system

[K P

P T O

[~w

~a

]=

[~v?

~o?

]≡ L(p+3)×(p+3) · ~x?(p+3)×1 = b(p+3), (17)

where K, P and O are submatrices and w, a, v and o are column vectors, given by:

Kij = U(∣∣∣[ci1 ci2

]−[cj1 cj2

]∣∣∣)+ Iij · α2 · λ ‖ j ∈[1 · · · p

]∧ λ ≥ 0 (18)

U(r) =

r2 · log r ‖ r > 0

0 ‖ r = 0(19)

α =1

p2

p∑i=1

p∑j=1

∣∣∣[ci1 ci2

]−[cj1 cj2

]∣∣∣ (20)

Page 52: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

52

Pp×3 =

1 c11 c12

1 c21 c22

1...

...1 cp1 cp2

, O3×3 =

0 0 0

0 0 0

0 0 0

(21)

P Tij = Pji ‖ i ∈

[1 · · · p

]∧ j ∈

[1 · · · 3

](22)

~vp×1 =

c13

c23...cp3

, ~o3×1 =

0

0

0

~wp×1 =

w1

w2

...wp

,~a3×1 =

a1a2a3

. (23)

At this point it can be noted that L, and thus also its submatrixK, are symmetric. It is nowpossible to calculate elements for the upper triangle only and copy them to the lower one.α (mean of distances between control points’ xy-projections) is a constant only presenton the diagonal of K. It can be easily calculated while filling up the upper and lowertriangles. I is the standard unit diagonal matrix. [38]

Once the values for w and a are known, z can be interpolated for arbitrary points (x,y)from

z(x, y) = a1 + a2x+ a3y +

p∑i=1

wiU(∣∣∣[ci1 ci2

]−[x y

]∣∣∣). (24)

Bending energy (scalar) of a TPS is given by

If = ~wTK~w. (25)

Elonen [38] says that the LU-decomposition used in his example is a generic, direct solver

Page 53: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

53

that does not scale well as the size of the matrices grow large: it is O(N3). For large setsof control points, there are optimized (and much more complicated) methods for solvingthe thin plate spline problem. In [37] RBF or Radial Basis Function interpolation ismentioned as one. They are based on iterative numerical solvers (like Gauss-Seidel orthe conjugate gradient method) and the assumption that the effect of the control points ismainly local (i.e. only a few neighbouring control points contribute to the interpolating ata given point). These approximations scale well, in the order of O(N logN).

6.4 TPS applied for digital elevation models

Pohjola et al. [39] have used TPS and QR-decompostion creating high-resolution digitalelevation models (DEM). DEM comprises a representation of the elevation of the surfaceof the earth in particular area in digital format. DEM is an essential component of geo-graphic information systems designed for the analysis and visualization of the location-related data. DEM is most often represented either in raster or Triangulated IrregularNetwork (TIN) format. They tested several methods and they found the thin plate splineinterpolation was best suited for the creation of the elevation model. The thin plate splinemethod gave the smallest error in the test where certain amount of control-points wasremoved from the data and the resulting model looked most natural.

In [39] creation of the digital elevation model of Olkiluoto area incorporating a large areaof seabed is described. The modeled area covers 960 square kilometers and the apparentresolution of the created elevation model was specified to be 2.5 x 2.5 meters. They usedelevation data like contour lines and irregular elevation measurements as source data inthe process. The precision and reliability of the available source data varied largely.

Donato et al. [37] describe that one drawback of the TPS model is that its solution requiresthe inversion of a large dense matrix of size p× p, where p is the number of points in thedata set. Also Elonen pointed out that his example [38] of TPS with LU-decompositiondoes not scale well as the size of the matrices grow large. Thus, they realized from thebeginning that the area of interest must be divided into smaller subareas for computationdue to the size of the area and the resolution of the elevation model. Dividing the areainto computational units enabled them to perform computations in parallel and reduce thecomputation time.

Page 54: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

54

7 SUPER-PIXEL IMAGE SEGMENTATION METHODS

7.1 Introduction

Image segmentation is the process of dividing an image into parts that have a strong cor-relation with objects or areas of the real world. In segmentation, an image is divided intoseparate regions that are homogeneous with respect to a chosen property such as bright-ness, color, texture, etc. These regions are sets of pixels also known as super-pixels andthey can be extracted with segmentation algorithm. More precisely, image segmentationis the process of assigning a label for every pixel in an image. Image segmentation is typ-ically used to locate objects and boundaries (lines, curves, etc.) in images. In this worksegmentation is needed to gain better control over the TPS interpolation process. Withsegmentation the image can be divided to surfaces which act as computational units inTPS. Instead of calculating the TPS for whole image, the problem is divided into smallercomputational units. Each segmented and user defined surface is processed individually.

To obtain super-pixels, one often uses image segmentation algorithms such as mean-shift [40], graph based local variation [41] or normalized cuts [42]. To increase the chancethat super-pixels do not cross object boundaries, the segmentation algorithm must be runin an over segmentation mode which means that the image is segmented to smaller regionsthan is desired and this produces greater number of segmented regions. The problem isthat it is very difficult to define what a good segmentation is. It depends amongst otherthings, also the nature of an image, lighting conditions, noise and texture. These can allhave a large impact on the results. A key problem in segmentation is that of splitting upinto too few (under-segmentation) or too many regions (over-segmentation).

7.2 Graph-based algorithms

Graph-based image segmentation techniques generally represent the problem in terms ofgraph G = (V,E), where each image pixel is treated as a node vi ∈ V in a graph, and theedges E connect the neighbouring pixels. The weight on each edge w(i, j), is a functionof similarity (e.g. difference between pixel intensities) between nodes i and j. Super-pixels are extracted by effectively minimizing a cost function defined on the graph (seeFigure 27).

The Normalized cuts algorithm (NCUT) by Shi et al. [42], recursively partitions a given

Page 55: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

55

Figure 27. Pixels as a graph. Fully-connected graph where is a node for every pixel and a linkbetween every pair of pixels. Cost cpq for each link measures similarity [43].

Figure 28. Superpixels by normal cuts algorithm. [44]

graph using contour and texture cues. NCUT globally minimizes a cost function definedon the edges at the partition boundaries. NCUT [42] is the basis of the superpixel seg-mentation scheme of Ren et al. [45] and Normalized cuts (NCUT05) by Mori et al [46].NCUT05 has complexity of O(N

32 ) [47], where N is the number of pixels. Levinshtein

et al. [47] show that there have been attempts to speed up the algoritm, but it remainscomputationally expensive for large images. Felzenszwalb et al. [41] argue that thesekind methods are too slow for many applications. NCUT05 has been used in body modelestimation [46].

Figure 29. Superpixels by local variation graph-based algorithm. [44]

Felzenszwalb et al. [41] present another graph-based segmentation method –EfficientGraph-Based Image Segmentation, also called as local variation graph-based algorithm.

Page 56: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

56

The algorithm is based on the idea of partitioning an image into regions, such that for eachpair of neighbouring regions the variation between regions is larger than the variationwithin regions. The criterion for determining the similarity of image regions is basedon measures of image variation. The measure of the internal variation of a region isa simple statistic of the intensity differences between neighbouring pixels in the region.The measure of the external variation between regions is the minimum intensity differencebetween neighbouring pixels along the border of the two regions. The algorithm uses agreedy decision criterion to merge regions based on these measures, yielding a methodthat runs in time nearly linear in the number of image pixels [48].

Felzenszwalb et al. [41] also stated that it is important that segmentation algorithms takeinto account non-local properties of the image because purely local properties fail to cap-ture perceptually important differences in images. They argue that when applied to imagesegmentation problems, the nearest neighbours alone are not enough to get a reasonablemeasure of image variability. Their technique adaptively adjusts the segmentation crite-rion based on the degree of variability in neighbouring regions of the image. This algo-rithm isO(N logN) complex and it is quite fast in practice as compared to NCUT05 [47].However, it does not offer direct control on the number of superpixels it produces or theircompactness, similar to NCUT05.

7.3 Gradient-ascent-based algorithms

Starting from an initial rough clustering, during each iteration gradient ascent methods re-fine the clusters from the previous iteration to obtain a better segmentation until converge.

Figure 30. Superpixels by the TurboPixel algorithm. [44]

TurboPixel in [47] generates super-pixels by progressively dilating a given number ofseeds in the image plane, using computationally efficient level-set based geometric flow(see Figure 31). The geometric flow relies on local image gradients, and aims to distribute

Page 57: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

57

super-pixels evenly on image plane. The super-pixels are constrained to have uniformsize, compactness, and adherence to object boundaries (see Figure 30).

The key idea in of Turbopixels is to reduce super-pixel computation to an efficiently solv-able geometric-flow problem. Given a user-specified value K of superpixels, K circularseeds are placed in a lattice formation so that distances between lattice neighbours are allapproximately equal to

√NK

, where N is the total number of pixels in the image. Thisdistance completely determines the seed lattice since it can be readily converted into adistance across lattice rows and columns. In [47], the initial seed radius is one pixel.

Figure 31. Steps of the TurboPixels algorithm. In Step 4a, the vectors depict the current velocitiesat seed boundaries. Where edges have been reached, the velocities are small. In Step 4b, themagnitude of velocities within the narrow band is proportional to brightness. [47]

The technique provides comparable accuracy to Normalized cuts, but with significantlylower runtimes. The complexity of TurboPixel algorithm is O(N) and Levinshtein etal. [47] mark that, if both under-segmentation and irregularly shaped super-pixel bound-aries can be tolerated, the Felzenszwalb and Huttenlocher algorithm [41] is clearly a betterchoice, offering a tenfold speed-up as well as improved boundary recall at lower super-pixel densities.

Figure 32. Example of feature space. (a) A 400x276 color image. (b) Corresponding L*u*v colorspace with 110,400 data points. [40]

Mean shift [40] is a mode-seeking algorithm that generates super-pixels by recursivelymoving to the kernel smoothed centroid for every data point in the pixel feature space, ef-

Page 58: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

58

fectively performing a gradient ascent [49]. Kernel density estimation is a non parametricway to estimate the density function of a random variable. This is usually called as theParzen window technique. Mean shift treats the points of the feature space (see Figure 32)as a probability density function (pdf). Dense regions in the feature space correspond tolocal maxima or modes. So for each data point, a gradient ascent is performed on thelocal estimated density until convergence. The stationary points obtained via gradientascent represent the modes of the density function. All points associated with the samestationary point belong to the same cluster.

The fact that Mean Shift does not make assumptions about the number of clusters or theshape of the cluster makes it ideal for handling clusters of arbitrary shape and number.Mean shift does not assume anything about number of clusters. The number of modesgive the number of clusters. Also, since it is based on density estimation, it can handlearbitrarily shaped clusters.

Figure 33. Superpixels by Quick shift algorithm. [44]

Quick shift [49] is also a mode-seeking algorithm that generates superpixels like Meanshift but is faster in practice. It is non-iterative and does not make assumptions about thenumber or shape of the superpixels (see Figure 33).

7.4 Summary

A summary of superpixel segmentation method properties presented in this chapter ispresented in Table 2. In this work we seek a segmentation algorithm that is fast andeasy to implement. Our desire is to implement an graphical user interface that respondsquick for users actions. We also seek a solution where the amount of the required userinteraction is minimal. From the presented segmentation methods we choose the localvariation [41] algorithm. It is fast and requires only two input parameters. Furthermore,it has been used successfully in similar works [12] [14]. Felzenszwalb et al. [41] have

Page 59: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

59

released C++ source code of their algorithm and the code is freely available in http:

//www.cs.brown.edu/~pff/segment/.

Table 2. Method property comparisons for Normalized Cut(NC) [46], Local variation(LV) [41],TurboPixels(TB) [47] and Quick shift(QS) [49].

Graph-based Gradient-ascentNC LV TP QS

Complexity O(.) N32 N logN N dN2

Control number of superpixels yes no yes noControl shape and size of superpixels yes no yes noNumber of parameters 1 2 1 2

Page 60: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

60

8 IMPLEMENTATION

8.1 Introduction

Figure 34. The three phase 3D reconstruction process.

In this work the 3D reconstruction process is executed in three different phases (Fig-ure 34). The first phase is image segmentation. The local variation algorithm by Felzen-szwalb et al. [41] is used to segment image into super-pixels. The user is required tointeractively form larger surfaces by mergin generated super-pixels together using theimplemented graphical user interface. This is discussed in more detail in Section 8.3.

The second phase in reconstruction process is mesh modelling. A mesh that consist oftriangular mesh elements is generated. In this process the nodes of the triangle mesh areextracted from the image with image processing methods that were presented in Chap-ter 5. The implementation details are presented in Section 8.4. The extracted node pointsare the input data for the Delaunay triangulation algorithm which connects the mesh nodesto a triangle mesh discussed in Section 4.5.

The third and final phase is the thin plate spline interpolation phase. First the user isrequired to place control-points on the top of the image. When a control-point is assignedthe user is also required to define its 3D depth in that particular location in the image.By placing these control-points the user defines a sparse depth-map of the image. Thinplate spline interpolation method is executed for each surface plane form in the phaseone. The interpolation phase calculates new 3D depths for the mesh nodes in the trianglemesh generated in the second phase. This bends the triangle mesh through the definedcontrol-points and thus transforms the 2D flat image plane to a 2.5D manifold (unclosedmesh). This phase is discussed in Section 8.5

The Section 8.2 is about the implementation of the user interface that was developed for

Page 61: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

61

the user interaction. The tools and the environment used in the implementation are dis-cussed and the executable procedures, key controls, and configuration files implementatedare introduced.

8.2 User interface

The user interface between an application program and a graphics system is implementedthrough a set of functions that reside in a graphic library. These are called as applicationprogramming interface (API). The application programmers model of the system is shownin Figure 35. A programmer sees only the API and is thus shielded from the details ofboth the hardware and the software implementation of the graphics library. The softwaredrivers are responsible for interpreting the output of the API. In this work the used APIis OpenGL and the programming language is C. The hardware and libraries used in thisimplementation are presented in Table 3, Table 4 and Table 5.

Figure 35. Application programmer’s view of a graphics application [4].

Table 3. Hardware information table.

Computer Laptop ASUS G50VseriesProcessor 2 x Intel(R) Core(TM)2 Duo CPU T9400 @ 2.53 GHzMemory 4059 MBDisplay 15.4" WSXGA+Graphics card NVIDIA GEFORCE 9700M GT 512 MB

Page 62: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

62

Table 4. General system information table.

OS Release LinuxMint 9 (isadora)Kernel 2.6.32-36-genericGNOME 2.30.2NVIDIA Driver Version 290.10NV-CONTROL Version 1.27X Server Version Number 11.0X Server Vendor Name The X.Org FoundationX Server Vendor Version 1.7.6GL Version 3.3.0GLSL Version 3.30 NVIDIA via Cg compilerGLU Version 1.3

Table 5. Libraries and third party code information table.

GSL The GNU Scientific Library (GSL) is a numerical libraryfor C and C++ programmers. The library provides a widerange of mathematical routines such as random numbergenerators, special functions and least-squares fitting.There are over 1000 functions in total with an extensivetest suite.

libfreeimage FreeImage is an Open Source library projectfor developers who would like to support populargraphics image formats like PNG, BMP, JPEG, TIFF andothers as needed by today’s multimedia applications.

freeglut OpenSourced alternative to the OpenGL Utility Toolkit(GLUT) library. Allows the user to create and managewindows containing OpenGL contexts on a wide range ofplatforms and also read the mouse and keyboard.

libConfig libconfig – C/C++ Configuration File Libraryhttp://www.hyperrealm.com/libconfig/libconfig-1.4.8.tar.gz

AntTweakBar AntTweakBar is a small and easy-to-use C/C++ librarythat allows programmers to quickly add a light andintuitive graphical user interface into graphicapplications based on OpenGL, DirectX 9, DirectX 10or DirectX 11 to interactively tweak parameters on-screen.http://www.antisphere.com/Wiki/tools:anttweakbar:download

Triangle A Two-Dimensional Quality Mesh Generator and Delaunay Triangulator.http://www.cs.cmu.edu/ quake/triangle.html

segment.tgz C++ implementation of the image segmentation algorithmEfficient Graph-Based Image Segmentationhttp://www.cs.brown.edu/ pff/segment/

Page 63: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

63

During the developement process it came clear that phases one (segmentation) and three(defining control-points) can be time consuming and it is easy to make bad decisions.Saving the progress and data into files became one of the requirements for this implemen-tation. Some images can require more user interaction than others. For example, imagespresenting outdoor landscapes require far more user interaction than images presentinginterior spaces like hallways. This enables the possibility for the user to close the ap-plication and continue later from the same point. Without saving data into the files, theuser would have to start from the beginning each time the application is started. The firstimplementation of application stored the data into binary files. This was later changedto adjustable configuration files. It became clear that it is far more convinient to makemodifications to control-point data directly to a file than through the user interface.

In this application the open source library libConfig is used for processing structuredconfiguration files. When the application is started it requires that there is a3dr.cnf con-figuration file in the same directory where the a3dr executable program is. One exampleof this configuration file is presented in Figure 36. Through the configuration file is set thefilename of the image to be processed. Also the filenames for the three processing phasesare named in this file. These files are generated when the corresponding procedures areexecuted in the application. Examples of these files are also presented in Appendix 2.Additionally there are options to change the default parameters of the segmentation al-gorithm. This is also possible to via the user interface when the application is alreadyrunning. Nevertheless, for the documentation purposes it is convienient to put the usedparameters in to configuration file. This is just to remind the user which parameter valueswere used in segmentation process if they need to be checked later on.

Figure 36. Configuration file (a3dr.cfg) for the program.

Page 64: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

64

One requirement for the application user interface is to place the control-points for defin-ing the 3D depths in different parts of the image. Placing the points over the image withleft-click is trivial matter in OpenGL. However, what comes to editing the 3D depths inthese points is not so trivial. In this work GLUT library is used and it provides meansto construct user interface at its simplest form. Convinient way for editing control-points3D depth would be through input text box. This is something OpenGL and GLUT donot support at all. GLUT provides means to display pop-up menus that can be used forlaunching procedures or change different states of the application. Use of pop-up menufor depth editing does not feel suitable solution. A quick Google search resulted a readyand light open source library (AntTweakBar) suitable for C and C++ programming lan-guages. This is why in this work a combination of GLUT and AntTweakBar is used inimplementation of the user interface (Figures 37 and 38).

One could also choose to use libraries such as enhanced version of GNU Image Manip-ulation Program (GTK+) or Simple DirectMedia Layer (SDL) which are cross-platformmultimedia libraries, designed to provide low level access to audio, keyboard, mouse,joystick, 3D hardware via OpenGL. Because of the fact that OpenGL alone is a vastfield to explore, it is better to keep things simple at this point. Thus, the combination ofC, OpenGL, GLUT and AntTweakBar is enough for this implementation. Furthermore,OpenGL and GLUT are very often the combination which is used in OpenGL program-ming literature [21] [4] and the OpenGL implementation of this work relies on this par-ticular literature. More advanced implementations are left for the possible future works.Another open source library has to be mentioned here: FreeImage is used for loadingimages from files into OpenGL’s memory buffers for processing. FreeImage is an OpenSource library project for developers who would like to support popular image formatslike PNG, BMP, JPEG, TIFF and others as needed by recent multimedia applications.In this work, images mainly in JPEG image format have been used but it is good to beprepared for other image formats too.

Page 65: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

65

Figure 37. GUI with 5 AntTweakBars in use. A 2D Image reconstructed as 3D.

Figure 38. An AntTweakBar for GUI controls: Rotation, states, modes (left). An AntTweakBarfor showing information (right).

Page 66: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

66

8.3 Segmentation phase

When the application starts an image is loaded in to a memory buffer. The image containsnumber of pixels (n = width×height). A pixel is a single dot, or "picture element", of animage. A rectangular image may be composed of thousands of pixels, each representingthe color of the image at a given location. The value of the pixel typically consist ofseveral channels, such as red, green, and blue components of its color, and sometimes itsalpha (transparency).

The final goal of the 3D reconstruction is to define 3D depths for different areas in image.Since an image has a vast number of pixels, it is not efficient to define depths for each pixelindividually. With a image segmentation algorithm the number of pixels is reduced. Suit-able image segmentation methods were introduced earlier in Chapter7. The image is seg-mented with local variation [41] algorithm. The pixels that share common characteristicsare merged into larger entities, super-pixels. This is something that also Hoiem et al. [12]and Saxena et al. [14] have done in their solutions. They also segmented the image intosuper-pixels in the first phase of their processes. Both used the local variation algrorithmby Felzenszwalb et al. [41]. In this implementation the very same algorithm is used for im-age segmentation. Algorithm is fast and requires some input parameters. In [12] and [14]constant parameter values (σ = 0.5, k = 100,minsize = 100) have been used. Usingalgorithm with these parameter values results an over-segmentated image with irregularshape super-pixels. However, images can be very different and it can be necessary to seg-ment images with different values. This in mind, the user has the ability to adjust these pa-rameters through AntTweakBar or configuration file. Figure 39 illustrates the AntTweak-Bar controls for this phase. The parameter σ is a value between 0 ... 1 and is used tosmooth the image. The parameter k is a constant for the graylevel treshold. The pa-rameter minsize is minimum component size which is enforced by post-processing stage.Felzenszwalb et al. [41] have released C++ source code of their algorithm in the Internet.The code is freely available http://www.cs.brown.edu/~pff/segment/. Inthis work this released C++ code is converted with minor changes to C. The link to theweb page is also presented in Table 5.

After the local variation algorithm is executed this implementation takes a slightly dif-ferent direction compared to the works of Hoiem et al. [12] and Saxena et al. [14].Their next step was to detect surfaces from the image by classifying super-pixels withsupervised learning and pattern recognition. For this purpose, they gathered a knowledgedatabase from several images. They executed pattern recognition for superpixels againstthis knowledge database. In this work, the approach is user oriented. The user is required

Page 67: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

67

Figure 39. Ability to tweak segmentation parameters.

to manually combine the super-pixels to larger entities.

The following features for classifying purposes are implemented to the user interface.The super-pixel area is highlighted with different color when the user holds the mousepointer above the super-pixel. When the super-pixel is highlighted, a user can select itby clicking the left mouse button. This adds the identification number of the selectedsuper-pixel into the selection-list. All super-pixels that reside in the selection-list arehighlighted with white color. By observing the image on screen, the user can decide,does the neighbouring super-pixel of the just previously selected one belong to the samelarger entity (e.g. wall or door). If it does, the user can add also that super-pixel into theselection-list. The selected super-pixel can be removed from selection-list by left-clickingit again when it is highlighted in white. The super-pixels in the selection-list are merged toone single entity or surface with right mouse-click. This kind of merging of super-pixelsinto surfaces is continued until the whole image is divided to larger entities. This processis illustrated with figures in Table 6. The data of the segmentation phase is stored aftereach merging action to a file defined in a3dr.cnf file. When the program is started and ifthe file with previous segmentation data exists, the values from the file are restored. Thusthe segmentation phase has to be committed only once for a single image.

Page 68: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

68

Table 6. Original image (top-left). Image segmented into super-pixels (top-right). One super-pixelhighlighted when in segmentation mode (middle-left). One super-pixel selected (white) and aboutto select the higlighted super-pixel next to it (green) when in segmentation mode (middle-right).Surface (floor) defined by user by joining super-pixels together (bottom-left). Surface highlightedwith surrounding polygon when not in segmentation mode (bottom-right).

Page 69: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

69

8.4 Mesh generation phase

In this phase a triangular mesh that well adapts to image content is generated. The pro-cessed image is first tranformed into a gray-scale image. The image being transformedresides in an OpenGL memory buffer and is in the RGB color format. Every pixel inthe image has values for red, green and blue color elements that varies between 0 and255. The gray-level value is computed for each pixel with equation (5) presented in Sec-tion 5.3. The computed gray-levels for pixels are then stored in the red, green and bluechannels. This operation transforms the color image into a gray-scale image. In the userinterface this procedure can be launched by pressing the F3 key in the keyboard or by leftmouse click on the corresponding button in AntTweakBar (see Figure 40).

Figure 40. Image processing procedures needed for mesh modelling.

After the image has been converted into grayscale, a feature map is calculated from theimage. This procedure can be launched by pressing the F4 key of the keyboard or by thebutton from AntTweakBar. The equations (6)-(8) used in the image filtering are presentedin Chapter 5.4. The feature map is a binary image containing grayscale values for blackand white. Example image of the feature-map is presented in Table 7.

Next a Floyd-Steinberg error diffusion algorithm is computed for the feature map. Thisprocedure can be launched by pressing the F5 key of the keyboard. The Floyd-Steingbergalgorithm scans the feature-map from left to right, top to bottom, quantizing pixel val-ues one by one. Each time the quantization error is transferred to the neighboring pixels,while not affecting the pixels that already have been quantized. This procedure distributesthe points so that their local spatial density is proportional to the corresponding value ofthe computed feature map. An example image is presented in Table 7 (and in Figure 24).It shows the result after the Floyd-Steinberg algorithm has completed. In this imple-mentation the algorithm is implemented in C. There also exist an implementation of thisalgorithm in used FreeImage library but it is not invoked in this implementation.

Page 70: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

70

Table 7. Original image converted into a grayscale image (top-left). Feature-map calculated fromthe graysacle image (top-right). Floyd-Steinberg error diffusion algorithm executed on feature-map (bottom-left). Delaunay triangulation executed for point cloud that was extracted by FloydSteinberg error diffusion (bottom-right).

Through these steps more dots appear in areas where there are more details in image.Fewer dots appear in those areas in the image that contain less details. So more pointsare generated in the borders of the shapes in the image. These extracted node points arethe input data for the Delaunay triangulation algorithm which connects the mesh nodesto a triangle mesh like discussed earlier in Chapter 4.5. The Delanay triangulation algo-rithm can be launched by pressing the F6 key of the keyboard. This procedure has also acorresponding AntTweakBar button.

In this implementation the Delaunay triangulation is computed by using the the two-dimensional quality mesh generator and Delaunay triangulator called Triangle. Trianglegenerates exact Delaunay triangulations, constrained Delaunay triangulations, conform-ing Delaunay triangulations, Voronoi diagrams, and high-quality triangular meshes. Tri-angular meshes can be generated with pre-set angle limits, and are thus suitable for finiteelement analysis. Triangle was created at Carnegie Mellon University as part of the Quakeproject. The Quake project is about large-scale earthquake simulation. Triangle’s sourcecode distribution file triangle.h contains the instructions how the Triangle can be calledfrom other programs. The example program file "tricall.c" also illustrates how the Trian-gle can be called. In this implementation a callable object file triangle.o is compiled and

Page 71: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

71

is called from this program. The web link (http://www.cs.cmu.edu/~quake/triangle.html) where the Triangle can be obtained is also presented in Table 5.

8.5 Thin Plate Spline interpolation phase

In this final phase the user is required to form a sparse depth-map by defining control-points into the image. Thin plate spline interpolation is a method for fitting a surfacethrough defined control-points by applying the minimal energy principle. In this work, themethod was implemented using the GNU Scientific Library (GSL) and LU-decomposition.The AntTweakBar for this phase is presented in Figure 41.

The LU-decomposition is basically a modified form of Gaussian elimination. LU-decompositionrequires calculation with matrices. GSL is a numerical library for C and C++ program-mers. It is free software under the GNU General Public License. The library provides lin-ear algebra operations which operate directly on the gsl_vector and gsl_matrix objects.These routines use the standard algorithms from Golub & Van Loan’s Matrix Computa-tions [50] with Level-1 and Level-2 BLAS calls for efficiency. The functions used in thiswork are declared in the header file gsl_linalg.h of the GSL library.

Figure 41. Procedures needed to invoke the TPS interpolation phase.

Page 72: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

72

9 EXAMPLES

In this chapter the experimental results that can be acchieved with this implementation arepresented with screen shots. In Figure 42 (top) the image is segmented to 6 surfaces bythe help of the user. Control-points with depths are placed so that there are control-pointsinside each surface. Some of the control points are not visible in this figure because theirpositions are behind the 2D image plane. Figure 42 (bottom) presents the same situationlike in Figure 42 (top). The only difference is that the textures are off and all control-points can be seen through the triangle mesh.

Figure 42. Here the image is segmented to 6 surfaces. Control points with depths are placed sothat there are control points inside each surface. Some of the control points are not visible becausetheir position is behind the 2D image (top). Here (bottom) is the same situation like in (top). Theonly difference is that the textures are off and the control points can be seen through the trianglemesh.

Page 73: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

73

In Figure 43 (top) the same very same situation is presented from the side with textureson. From this point of view it can be seen that the defined control-points have differentdepths. Again, Figure 43 (bottom) presents the same situation like in Figure 43 (top) butthe textures are off. In Figure 44 the triangle mesh is bended to meet the control points.

Figure 43. In this figure the same situation is presented from the side with textures on. From thispoint of view it can be seen that the control points have different depths (top). Same situation here(bottom) like in (top) but the textures are off.

Pressing F8 key in keyboard executes the TPS interpolation which produces new positionsfor the mesh nodes. In Figure 44 (bottom) the same situation like in Figure 44 withtextures on. Original 2D image has now has 3D form. In Figure 45 the same situation likein Figure 44 with textures on but from different angle. In Figure 46 the same situationlike in Figure 44 with textures on but from back side.

Page 74: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

74

Figure 44. In this figure the triangle mesh is bended to meet the control points. Pressing F8 keyin keyboard executes the TPS interpolation which produces new positions for mesh nodes (top).In (bottom) figure the same situation like in (top) with textures on.

Page 75: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

75

Figure 45. In this figure the same situation like in Figure 44 with textures on but from differentangle.

Figure 46. In this figure the same situation like in Figure 44 with textures on but from back side.

Page 76: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

76

10 DISCUSSION

At first this topic felt compact. However, some experiments and the review of previousresearch revealed that 3D reconstruction from a single image is not so trivial one mightfirst assume. Remarkable amount of research has been done in the field of multi-viewand vision from motion techniques. The topic of 3D reconstruction from images has beena long-standing issue in computer vision and has slightly been increasing lately. Robotvision and obstacle detection in high speeds have been one of latest fields of interest. If thedepth could be extracted with better results and automatically from a single still image,this would certainly contribute more information for existing stereo-vision techniquesand improve their performance also. This is perhaps one of the reasons why single-view3D reconstruction research has recently been focusing on automatic solutions. This kindof reconstruction is considered more difficult than reconstruction from stereo vision ormultiple images. A single still image contains less essential information required for 3Drconstruction than multiple images. This is why the automatic approaches often use some"prior" knowledge or heuristics of the image scene. These automatic 3D reconstructionsystems are often trained with this "prior" knowledge and supervised learning techniques.

Automatic scene understanding from a single image appears to be very hard problem. Itis possible to detect shapes automatically from the image. However, it is not possible todetect the occluded parts of the scene automatically without doing major assumptions.There have been experiments that use probabilities and perspective line continuity, infor-mation from shading, etc. This kind of information can be used to quess or predict howthe occluded part in the image is formed. This is why an unclosed 2.5D mesh model isthe closest 3D reconstruction model that can be accurately formed from a single imagewithout prior knowledge from the scene.

In order to reconstruct a 3D model from a single still image with satisfying results, oneis compelled to use the variety of image processing techniques such as feature detec-tion, segmentation, triangulation, mesh generation, depth map generation, applications ofmathematics, image filtering, pixel processing, OpenGL programming and so on. Manyof these areas alone appear to be challenging enough for a topic of a single MSc thesis oreven a PhD thesis. The greatest challenge in this work for me has been gathering and find-ing all the missing pieces of the puzzle and combine them together to a seamless solution.The topic, used techniques, programming environment, OpenGL and the programminglanguage were all something that I had not worked with before. In the limited time, manydifferent topics and techniques had to be studied. In order to get satisfying results, a com-bination of these techniques was required. It has been challenging to keep the focus on

Page 77: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

77

getting results within the limited time without diving too deep into the details. Duringthis work I have learned a thing or two outside the scope of this thesis as well, since some"wrong turns" could not be avoided in the early stages of the work.

The field of methods used in image processing is broad. In order to achieve results ina limited time, a few guidelines were set to narrow down the search. User interactionthrough the user interface gave one restriction for this work. Quick response time forthe user actions is often considered as one of the key factors for a good user experience.Fast and easy to implement algorithms were seeked and selected for this implementation.Another guideline was to work toward a solution were the amount of required user in-teraction was minimal. Also the desire towards automatic solutions in future had someimpact for the methods selected to this implementation.

In this thesis the main focus was in 3d reconstruction from a single still image. No priorknowledge of the reconstructed scenes were applied nor considered as available. Theprogram user is free to select any image and perform 3D reconstruction for it. The lack ofinformation in a single image makes human interaction a necessity. It appears that therestill is relatively little work on single-view reconstruction from a single still images andthis research has lately been focusing more on automatic approaches.

In this thesis one approach and solution was implemented for transforming a 2D imageinto a 3D image. The image was first segmented into super-pixels. The super-pixelswhere then combined into larger surfaces by user selection. After this, a feature map ofthe image was computed. This feature map reveals the shapes in image (edges, corners,etc). After this a dithering algorithm was executed on the feature map. This way it waspossible to extract the point cloud that represents the mesh nodes of a triangle mesh. Also,at this phase each point of the point cloud was classified belonging into the segementedsurface areas that were defined by the user.

The triangle mesh was generated from the point cloud by using the Delaunay triangula-tion. The original image was binded into the mesh elements by defining correspondingOpenGL texture coordinates for the each mesh node individually. This way the originalimage was "glued on" the mesh elements. After this the only thing left was to define the3D depth map of the image and bend the generated mesh to meet this depth map. In thissolution, the mesh nodes were extracted so that the generated mesh adapts to the imagecontent. High density of the mesh nodes is focused on the areas where there are lot ofdetails. Less mesh nodes are in the areas where there are less details.

In this implementation the depth-map has to be defined manually. The user is required to

Page 78: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

78

define control points with 3D depth for each surface. If a defined surface is considered asa flat planar surface (like a wall), only three control points are required. This is enoughto set the orientation of the surface and one can get results similar to Li et al. [11] andSturm et al. [19] with piecewise planar objects. On the other hand, if the surface is notflat, then the user is required to define more control points on the surface. TPS allowsreconstruction of surfaces that are not planar.

If all the surfaces in the image are planar then the TPS calculation phase remains veryfast. The TPS is executed for each defined surface individually. This gives far more bettercontrol for a program user to infer the transformation process. The process could perhapsbe improved with a method such as in Pojola et al. [39]. They solved TPS with QR-decomposition when computing digital elevation models from the Olkiluoto landscapes.

If the number of control-points is large and TPS is executed on the whole image insteadof defined surfaces, the calculation problem quickly grows large and the processing timeincreases. It is also highly expected that LU-decomposition fails to solve the exact solu-tion. In this situation the regularization parameter of TPS has to be defined in order toget an approximative solution. If the segmentation to surfaces is not done and the wholemesh is interpolated with TPS to meet the depth map in single TPS calculation, this isusually the case.

It is hard to reconstruct with TPS images where there are sharp edges and smooth curvesnear each other. The TPS interpolation tends to smooth these sharp edges away. Planarpiece-wise surfaces are easier to define with only three control points. It is obvious that theoutdoor images presenting forests, bushes and such are far more difficult to reconstructthan images from interiors like halways, tunnels, rooms and such. This is why betterresults can be gained from carefully selected images. A clear solid background is easierto extract.

10.1 Future Work

It is always nice to give some ideas for the future. Defining the control points with 3Ddepths is time consuming. This requires lot of work and accuracy from the user. Theamount of user interaction could certainly be reduced. It would be interesting to see whatkind of results can be achieved when the depth map generation is somehow improved intosemi-automatic direction. Depth reconstruction could be based on perspective detection,color detection, shadows, corner detection, etc. A combination of these methods could

Page 79: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

79

then produce a semi depth map which could be adjusted by the user. This would reduce theuser interaction and would perhaps lead to more accurate 3D models. This will propablyincrease the number of the control-points and lead to performance problems in the TPSinterpolation phase.

The TPS interpolation phase should be optimized into direction where it can handle moredense depth maps. When the number of control points increases the size of matricesused in LU-decomposition grow too large. For large sets of control points, there are muchmore complex methods for solving the TPS problem and that scale well. Donato et al. [37]mentions that Radial Basis Function interpolation is one such method. In this work, theTPS calculation problem was divided into computational units. The TPS was executed foreach defined surface individually. At this moment this easily produces image distortionsbetween the surface areas. It would be worth to experiment some kind of overlappingmethod between the computational units. This would perhaps smooth the distortionsbetween the surfaces. Pohjola et al. proposes one kind of overlapping method with TPSwhen creating high-resolution digital elevation models in [39].

Also the mesh generation phase should be taken under consideration. The mesh genera-tion could also be divided according to the defined surfaces. Instead of one triangle meshthat covers the whole image, each surface area could have separate triangle meshes. Thiswould also cut down the distortions between the surface areas. Use of separate meshesmake it possible to handle each surface more freely. Their orientation and position couldbe adjusted with OpenGL freely. User could then assemble the 3D model from theseextracted meshes.

Programming in Linux environment with C and OpenGL was something that I had noearlier experience and working with graphics was also completely new field of interest forme. Although the C programming language is the programming language of the Gods,I would prefer using some other object oriented programming language ( C++ or C# ).Programming in C was much slower and less productive process than I was used to. C# inWindows environment has been my main tool for some time now. The .NET frameworkand C# provide wide variety of tools for a programmer. It is also possible in C#.NET tomix C# and C code inside the same application. Performance wise critical parts can stillbe programmed in C but the user interface parts could be implemented in C#. This wouldbe interesting to experiment in future also. There also exists a cross platform, open source,.NET development framework (http://www.mono-project.com) which works in the Linuxenvironment.

Page 80: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

80

11 CONCLUSIONS

Reconstruction from a single still image is considered to be a far more difficult task thanreconstruction from stereo images or multiple images from different view points. A singleimage holds far less information than is essential for 3D reconstruction. In this work, noprior knowledge of the reconstructed scenes was applied nor considered as available. Thislack of knowledge makes human interaction a necessity.

In this thesis a proof of concept computer program was developed using the C program-ming language and OpenGL in the Linux environment. All information used in the re-construction process were obtained from a single 2D image. The information requiredfor reconstruction process like 3D depth, was provided via the GUI. A user is requiredto assist and guide the reconstruction process. No supervised or unsupervised learningtechniques were used.

The 3D reconstruction from a single image was implemented the following way. A tri-angular mesh that well adapts into image content was generated by the Delaunay trian-gulation. A point cloud for the triangulation was obtained by using the Floyd-Steinbergerror diffusion algorithm over a calculated feature map. The image was also segmentedinto small homogeneous patches. A user of the program was then required to form largerentities by joining these super-pixels together. These defined and extracted surfaces werethen considered as computational units.

In final 3D representation the surfaces can have different orientations and depths in the3D space. A user was also required to define the depths for the surfaces by placing controlpoints over the image. These control points were used to change the orientation and depthof the surface to gain 3D representation. TPS interpolation method was used to bend thesurface to meet the control points.

Through these phases it was possible to transform a 2D flat image plane into a 2.5D man-ifold which is the closest possible 3D reconstruction from a single image without usingprior knowledge or making assumptions about the occluded parts of the image scene.A full 3D reconstruction would require multiple images taken from different angles andaround the scene.

Page 81: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

81

REFERENCES

[1] Cyganek B.; Siebert J.P. An introduction to 3D computer vision techniques and

algorithms. Wiley, 1st edition, 2009. ISBN 978-0-470-01704-3.

[2] Pollefeys M. Visual 3D Modeling from Images (Tutorial notes). University of NorthCarolina - Chapel Hill, USA, 2002. http://www.cs.unc.edu/ marc/tutorial.pdf.

[3] Snavely N.; Garg R.; Seitz S.M.; Szeliski R. Finding paths through the world’s pho-tos. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2008), 27(3):11–21, 2008.

[4] Angel E.; Shreiner D. Interactive computer graphics : a top-down approach with

shader-based OpenGL. Addison Wesley, 6th edition, 2010. ISBN 978-0-13-254523-5.

[5] Hartley R.I.; Zisserman A. Multiple View Geometry in Computer Vision. CambridgeUniversity Press, 2000. ISBN 0521623049.

[6] Grimson W.E. Why stereo vision is not always about 3d reconstruction. In MIT,

Artificial Intelligence Laboratory, Memo 1435, 1993.

[7] Debevec P.E.; Taylor C.J.; Malik J. Modeling and rendering architecture from pho-tographs: a hybrid geometry- and image-based approach. In Proceedings of the 23rd

annual conference on Computer graphics and interactive techniques, SIGGRAPH’96, pages 11–20, New York, NY, USA, 1996. ACM.

[8] Saxena A.; Sun M.; Ng A.Y. 3-d reconstruction from sparse views using monocularvision. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference

on, volume 76, 2007.

[9] Saxena A.; Chung S.H.; Ng A.Y. Learning depth from single monocular images. InNIPS, volume 18, 2005.

[10] Michels J.; Saxena A.; Ng A.Y. High speed obstacle avoidance using monocularvision and reinforcement learning. In Proceedings of the 22nd international confer-

ence on Machine learning, ICML ’05, pages 593–600, New York, NY, USA, 2005.ACM.

[11] Li Z.; Liu J.; Tang X. Shape from regularities for interactive 3d reconstruction ofpiecewise planar objects from single images. In Proceedings of the 14th annual

ACM international conference on Multimedia, MULTIMEDIA ’06, pages 85–88,New York, NY, USA, 2006. ACM.

Page 82: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

82

[12] Hoiem D.; Efros A.; Hebert M. Recovering surface layout from an image. In Inter-

national Journal of Computer Vision, volume 75, pages 151–172, 2007.

[13] Delage E.; Lee H.; Ng A.Y. Automatic single-image 3d reconstructions of indoormanhattan world scenes. In International Symposium of Robotics Research, vol-ume 28, 2005.

[14] Saxena A.; Sun M.; Ng A.Y. Learning 3-d scene structure from a single still image.In Pattern Analysis and Machine Intelligence, IEEE Transactions on, volume 31,pages 824–840, 2009.

[15] Saxena A.; Chung S.H; Ng A.Y. 3-d depth reconstruction from a single still image.In International Journal of Computer Vision, volume 76, 2007.

[16] Chiu H-P.; Kaelbling L.P.; Lozano-Pèrez T. Automatic class-specific 3d reconstruc-tion from a single image. In CSAIL Technical Reports (July 1, 2003 - present), 2009.

[17] Ting Z.; Feng D.D.; Zheng T. 3d reconstruction of single picture. In Proceedings

of the Pan-Sydney area workshop on Visual information processing, pages 83–86,2004.

[18] Shum H.-Y.; Szeliski R.; Baker S.; Han M.; Anandan P. Interactive 3d modelingfrom multiple images using scene regularities. In SMILE Workshop, Freiburg, Ger-

many, pages 236–252, 1998.

[19] Sturm P.F.; Maybank S.J. A method for interactive 3d reconstruction of piecewiseplanar objects from single images. In Proc. BMVC, pages 265–274, 1999.

[20] Han F.; Zhu S.-C. Bayesian reconstruction of 3d shapes and scenes from a singleimage. In Proceedings of the First IEEE International Workshop on Higher-Level

Knowledge in 3D Modeling and Motion Analysis, pages 12–20, Washington, DC,USA, 2003. IEEE Computer Society.

[21] Shreiner D. OpenGL Programming Guide : the official guide to learning OpenGL,

versions 3.0 and 3.1. Addison Wesley, 7th edition, 2010. ISBN 978-0-321-55262-4.

[22] O’Rourke J. Computational geometry in C. Cambridge University Press, 2th edition,1998. ISBN 978-0521649766.

[23] Barber C.B.; Dobkin D.P.; Huhdanpaa H. The quickhull algorithm for convex hulls.ACM Trans. Math. Softw., 22:469–483, December 1996.

[24] Shewchuk J.R. Triangle: Engineering a 2D Quality Mesh Generator and DelaunayTriangulator. In Ming C. Lin and Dinesh Manocha, editors, Applied Computational

Page 83: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

83

Geometry: Towards Geometric Engineering, volume 1148 of Lecture Notes in Com-

puter Science, pages 203–222. Springer-Verlag, May 1996. From the First ACMWorkshop on Applied Computational Geometry.

[25] Shewchuck J.R. Delaunay refinement algorithms for triangular mesh generation. InComputational Geometry, volume 22, pages 21–74, 2002.

[26] Ruppert J. A delaunay refinement algorithm for quality 2-dimensional mesh gener-ation. Journal of Algorithms, 18(3):548–585, 1995.

[27] Yang Y.; Wernick M.N.; Brankov J.G. A fast approach for accurate content-adaptivemesh generation. In IEEE Transactions On Image Processing, volume 12, 2003.

[28] Floyd R.W.; Steinberg L. An Adaptive Algorithm for Spatial Greyscale. Proceed-

ings of the Society for Information Display, 17(2):75–77, 1976.

[29] Bovik A.C. Handbook of image and video processing. Academic Press, 2nd edition,2005. ISBN 978-0121197926.

[30] Myler H.R; Weeks A.R. The pocket handbook of image processing algorithms in C.Prentice Hall, 1th edition, 1993. ISBN 0-13-642240-3.

[31] Bovik A.C. The Essential Guide to Image Processing. Elsevier, 1st edition, 2009.ISBN 978-0-12-374457-9.

[32] Poynton C. Digital Video and HD: Algorithms and Interfaces. The Morgan Kauf-mann Series in Computer Graphics, 1st edition, 2002. ISBN 978-1558607927.

[33] Rost R.J; Licea-Kane B. OpenGL Shading Language. Addison Wesley, 3rd edition,2009. ISBN 978-0-321-63763-5.

[34] Duchon J. Interpolation des fonctions de deux variables suivant le principe de laflexion des plaques minces. In RAIRO Analyse Numrique, volume 10, pages 5–12,1967.

[35] Weimer H.; Warren J. Subdivision schemes for thin plate splines. In Computer

Graphics Forum, volume 17, pages 303–313, 1998.

[36] Bookstein F.L. Principal warps: Thin-plate splines and the decomposition of de-formations. In IEEE Transactions on Pattern Analysis and Machine Intelligence -

PAMI, volume 11, pages 567–585, 1989.

[37] Donato G.; Belongie S. Approximation methods for thin plate spline mappings andprincipal warps. In 7th European Conference on Computer Vision, volume 2352,pages 13–31, 2002.

Page 84: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

84

[38] Elonen J. Thin Plate Spline editor - an example program in C++.http://elonen.iki.fi/code/tpsdemo/, 2005. Based mostly on "Approximation Meth-ods for Thin Plate Spline Mappings and Principal Warps" by Gianluca Donato andSerge Belongie, 2002.

[39] T. Pohjola, J.; Turunen J.; Lipping. Creating high-resolution digital elevation model

using thin plate spline interpolation and Monte Carlo simulation. Posiva. Working

report 2009-56 Olkiluoto. Posiva. 56 p. Posiva, 2009.

[40] Comaniciu D.; Meer P.; Senior Member. Mean shift: A robust approach towardfeature space analysis. In IEEE Transactions on Pattern Analysis and Machine In-

telligence, volume 24, pages 603–619, 2002.

[41] Felzenszwalb P.; Huttenlocher D. Efficient graph-based image segmentation. InInternational Journal of Computer Vision, volume 59, pages 167 – 181, 2004.

[42] Shi J.; Malik J. Normalized cuts and image segmentation. In IEEE Transactions on

Pattern Analysis and Machine Intelligence, volume 22, 2000.

[43] Su Y.; Song X. Normalized cuts and image segmentation (presentation). In Image

Rochester NY, 2003.

[44] Achanta R.; Smith K.; Lucchi A.; Fua P.; Süsstrunk S. Slic superpixels*. In EPFL

Technical Report no. 149300, 2010.

[45] Ren X.; Malik J. Learning a classification model for segmentation. In Proceedings

of the 9th International Conference on Computer Vision, pages 10–17, 2003.

[46] Mori G.; Fraser S. Guiding model search using segmentation. In Proceedings of the

Tenth IEEE International Conference on Computer Vision, volume 2, pages 1417–1423, 2005.

[47] Levinshtein A.; Stere A.; Kutulakos K.N.; Fleet D.J.; Dickinson S.J.; Siddiqi K. Tur-bopixels: Fast superpixels using geometric flows. In IEEE Transactions on Pattern

Analysis and Machine Intelligence, volume 31, pages 2290–2297, 2009.

[48] Felzenszwalb P.; Huttenlocher D. Image segmentation using local variation. InProceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol-ume 59, pages 98–104, 1998.

[49] Vedaldi A.; Soatto S. Quick shift and kernel methods for mode seeking. In European

Conference on Computer Vision, volume IV, pages 705–718, 2008.

Page 85: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

85

[50] Golub G.H; Van Loan C.F. Matrix computations. JHU Press, 3th edition, 1996.ISBN 0-8018-5414-8.

Page 86: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

Appendix 1. The Keyboard controls for the GUI

Keyboard keys are used in the programs user interface to launch procedures and changethe state modes of the program.

Table A1.1. Keyboard controls table.

F1 reloads original image from image file to buffer 0F2 commit avg filtering for image loaded in buffer 0F3 convert color image in buffer into a gray-scale 2F4 calculate a feature map from pixels in buffer 2F5 execute Floyd-Steinberg error diffusion algorithm 2F6 generate Delaunay triangulation and draw resulting mesh 2F7 execute local variation segmentation algorithm 1F8 execute TPS interpolation 3h hide/show drawing aidst textures on/off (off = render mesh without textures)p render control points on/offf fullscreen on/offs save control points to file (if depth edited)

-> increase or change field value under selectrion in AntTweakBar<- decrease or change field value under selectrion in AntTweakBar

Page 87: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

Appendix 2. Configuration files

// a3dr configuration file.

// load image from :

image_filename = "valencia.jpg";

// write segments to file :

image_segments_filename = "valencia.jpg.data.segments.cfg";

// write controlpoints to file :

image_controlpoints_filename = "valencia.jpg.data.controlpoints.cfg";

// write meshnodes to file :

image_meshnodes_filename = "valencia.jpg.data.meshnodes.cfg";

// settings for segmentation :

segmentation =

{

sigma = 0.5;

k = 500;

min_size = 100;

};

// valencia.jpg.data.controlpoints.cfg

// saved control-points in configuration file.

data =

{

controlpoints = (

{id = 0; x = -48.00; z = -31.00; depth = 0.00; seg = 1; },

{id = 1; x = 48.00; z = -31.00; depth = 0.00; seg = 1; },

{id = 2; x = 42.83; z = -1.73; depth = 20.00; seg = 1; },

{id = 3; x = -48.00; z = -13.90; depth = 10.00; seg = 2; },

{id = 4; x = -48.00; z = 33.00; depth = 10.00; seg = 2; },

{id = 5; x = 35.30; z = 1.16; depth = 35.00; seg = 2; },

{id = 6; x = 20.21; z = 33.00; depth = 0.00; seg = 4; },

{id = 7; x = 48.00; z = 33.00; depth = 0.00; seg = 4; },

{id = 8; x = 48.00; z = 5.50; depth = 20.00; seg = 4; },

{id = 9; x = 2.03; z = 3.19; depth = 35.00; seg = 3; },

{id = 10; x = 39.92; z = 2.03; depth = 25.00; seg = 0; },

{id = 11; x = 7.51; z = 33.00; depth = 10.00; seg = 0; },

{id = 12; x = -26.28; z = 33.00; depth = 10.00; seg = 0; },

{id = 13; x = 21.97; z = -16.16; depth = 0.00; seg = 1; },

{id = 14; x = 6.29; z = -11.45; depth = 0.00; seg = 1; }

);

};

// valencia.jpg.data.meshnodes.cfg

// saved meshnodes in configuration file.

data =

{

meshnodes = (

{id = 0; x = 0.00; y = 0.00; z = 0.00; seg = 1; position = 0; },

{id = 1; x = 400.00; y = 0.00; z = 0.00; seg = -1; position = 399; },

{id = 2; x = 400.00; y = 267.00; z = 0.00; seg = -1; position = 106799; },

{id = 3; x = 0.00; y = 267.00; z = 0.00; seg = -1; position = 106799; },

{id = 4; x = 194.00; y = 14.00; z = 0.00; seg = 1; position = 5794; },

{id = 5; x = 191.00; y = 15.00; z = 0.00; seg = 1; position = 6191; },

{id = 6; x = 207.00; y = 15.00; z = 0.00; seg = 1; position = 6207; },

{id = 7; x = 241.00; y = 15.00; z = 0.00; seg = 1; position = 6241; },

{id = 8; x = 198.00; y = 16.00; z = 0.00; seg = 1; position = 6598; },

{id = 9; x = 205.00; y = 16.00; z = 0.00; seg = 1; position = 6605; },

{id = 10; x = 220.00; y = 16.00; z = 0.00; seg = 1; position = 6620; },

....

{id = 1900; x = 28.00; y = 264.00; z = 0.00; seg = 2; position = 105628; },

{id = 1901; x = 237.00; y = 264.00; z = 0.00; seg = 0; position = 105837; },

{id = 1902; x = 24.00; y = 265.00; z = 0.00; seg = 2; position = 106024; },

{id = 1903; x = 250.00; y = 265.00; z = 0.00; seg = 4; position = 106250; }

);

};

// valencia.jpg.data.segments.cfg

// saved segments in configuration file.

data =

(continues)

Page 88: Assisted 3D reconstruction from a single vieAssisted 3D reconstruction from a single view Master’s Thesis 2012 88 pages, 46 figures, 7 tables, and 2 appendices. Keywords: interactive

Appendix 2. (continued)

{

segments = (

{id = 0; x0 = 67.00; y0 = 130.00; x1 = 372.00; y1 = 266.00; pixels = 13843; border = 11;

payload = (

{id = 0; x = 349.00; y = 130.00; },

{id = 1; x = 371.00; y = 131.00; },

{id = 2; x = 372.00; y = 150.00; },

{id = 3; x = 372.00; y = 151.00; },

....

{id = 13842; x = 69.00; y = 266.00; }

);

},

...

{id = 5; x0 = 372.00; y0 = 128.00; x1 = 399.00; y1 = 154.00; pixels = 452; border = 9;

payload = (

{id = 0; x = 384.00; y = 128.00; },

...

{id = 451; x = 399.00; y = 154.00; }

);

}

);

};