arxiv:1406.6595v2 [cs.cv] 20 aug 2016 · preprint submitted to arxiv august 23, 2016...

A Structured-Light Scanning Software for Rapid Geometry

Acquisition

Qing Gu1,∗, Kyriakos Herakleous1,∗∗, Charalambos Poullis1

aImmersive and Creative Technologies Lab, Department of Computer Science and Software Engineering,Concordia University

Abstract

Recently, there has been an increase in the demand of virtual 3D objects representing real-lifeobjects. A plethora of methods and systems have already been proposed for the acquisitionof the geometry of real-life objects ranging from those which employ active sensor technology,passive sensor technology or a combination of various techniques.

In this paper we present the development of a 3D scanning system which is based on theprinciple of structured-light, without having particular requirements for specialized equip-ment. We discuss the intrinsic details and inherent difficulties of structured-light scanningtechniques and present our solutions. Finally, we introduce our open-source scanning soft-ware system ”3DUNDERWORLD-SLS” which implements the proposed techniques bothin CPU and GPU. We have performed extensive testing with a wide range of models andreport the results. Furthermore, we present a comprehensive evaluation of the system anda comparison with a high-end commercial 3D scanner.

Keywords: 3D, Scanning, Reconstruction, Structured-light scanning, SLS system

1. Introduction

In recent years, there is an increasing demand for 3D models and in particular 3D modelsrepresenting/replicating real-world objects. Of the many available techniques ([? ], [? ])([?],[? ]), Structured-light Scanning (SLS)([? ]) systems have emerged as the most cost-effective and accurate method to capture the 3D geometry and appearance of a real object.

SLS systems employ active-sensors such as projectors and laser emitters to project lightof a known structure (pattern). The scanning process involves the projection of a series ofthese known patterns on the object being scanned, while capturing the scene from one ormore cameras. The projected pattern will be distorted according to the geometry of theobject being scanned. The geometry of the object can then be computed by identifying

∗Lead researcher from v4 [CPU + GPU].∗∗Lead researcher up to and including v3.2.Email addresses: [email protected] (Qing Gu), [email protected] (Kyriakos

Herakleous), [email protected] (Charalambos Poullis)

Preprint submitted to arxiv August 23, 2016

arX

iv:1

406.

6595

v2 [

cs.C

V]

20

Aug

201

6

correspondences between the pixels in the images captured by the camera(s). In order tobe able to identify these pixel correspondences, the projected pattern is encoded such thatevery pixel in the projected pattern can be uniquely identified in the images captured bythe cameras. This provides an efficient and very accurate method of mapping correspondingpixels between the images captured by the cameras. Finally, using this mapping betweencorresponding image pixels, it is possible to calculate their accurate 3D positions.

Many variants of SLS systems have already been proposed each one tailored to a partic-ular task. Although the theory behind the SLS systems is well documented and understood,there are still many issues one has to consider when developing or using SLS systems, whichare currently lacking documentation. In this paper, we present all the intrinsic details,limitations and solutions one has to consider when involved with the design, developmentand application of SLS systems. Moreover, we introduce the open-source scanning system”3DUNDERWORLD-SLS” and present the results of extensive testing.

The paper is organized as follows: Section 2 provides a brief overview of state-of-the-art in the area of 3D scanning, and Section 3 gives an overview of the scanning system.The different schemes for the encoding of the patterns are presented in Section 4 and thecalibration of the cameras is described in Section 5. Section 6 describes the data acquisition.The captured images are decoded as explained in Section 7 and the reconstructed pointsare calculated as described in Section 8. Section 9 describes how the points can be finallyconverted to a mesh. In Section 10 we report on the results of our extensive testing and inSection 11 we provide a comprehensive evaluation of the scanning system.

Figure 1: System Overview

2. Related Work

A plethora of work has been done in the area of scanning and in particular structured-light scanning. Below we present a brief overview on the state-of-the-art in the area ofscanning systems:

Microsoft’s Kinect[? ] is currently attracting the attention of researchers in the area.The device employs an infrared camera and pattern projector system to enable 3D vision inorder to recognize people in the scene. Many uses of this device have already been proposedfor fast 3D geometry acquisition, even for full body scanning as reported in [? ] [? ] for 3Davatar building and 3D human animation purposes.

There are also uses of the device combined with other techniques for higher resolutionresults, such as the system proposed in [? ] which integrates traditional stereo matchingwith Kinect’s information to take the advantages of both. In [? ] a multi-camera system fordense 3D Point cloud computation is proposed. The system uses 5 cameras and a Kinectdevice. In this case the Kinect device is not directly used for the 3D acquisition, but ratheras a light source where the IR projector embeds features in the scene which the 4 infrared

2

sensitive cameras are observing while the fifth is used for movement registration. Similarly,in [? ] a camera flash based projector is presented for use in stereo camera systems (withtwo cameras) to improve stereo matching.

Nowadays, due to the improvement of computational systems, 3D geometry of real scenesor objects can find uses for Virtual and Augmented Reality applications even for homeapplications. KinectFusion [? ] is a system that allows 3D reconstructions through amoving Kinect device and in addition it can be used for augmenting real scene based on itsreconstructed geometry.

In [? ] the authors propose a system which enables users to interact with surface particlesin real world with the use of a camera, a projector and an infrared pen. The proposed systememploys SLS 3D scanning technology in order to reconstruct the 3D geometry of the scenein order appropriately augment the surface particles with the projector. In [? ] the use ofdigitization of archaeological findings is discussed for the development of integrated virtualexhibitions.

3. System Overview

Figure 1 shows the system overview. Firstly, given as input the resolution of the projec-tor, a sequence of patterns is encoded. Two of the most effective ways of encoding this typeof information are presented in Section 4. Secondly, the cameras-projector system is geo-metrically calibrated and the intrinsic and extrinsic parameters are calculated as describedin Section 5. Thirdly, the sequence of patterns is projected on the object and images arecaptured by the cameras for each pattern in the sequence, as described in Section 6. Finally,the captured images are decoded in order to derive a pixel-to-pixel mapping between theimages captured by the cameras as explained in Section 7 and is then used to reconstruct apointcloud of the object using triangulation as described in Section 8.

4. Encoding in Temporal Domain

The first step of the process is the encoding of the information into patterns in thetemporal domain. This involves the generation of encoded patterns which when projectedin sequence on the object being scanned they allow the unique identification of each of theprojector’s pixels. In order to achieve this, the encoding of the information is performed inthe temporal domain i.e. the encoding is a function of time and all encoded patterns in thesequence are required in order to uniquely identify each pixel.

There are several schemes for encoding information into sequences of patterns, the mostpopular being the Binary-code and Gray-code encoding which are performed in the temporaldomain. In both cases, the information about the two image axes X, Y is encoded separatelyinto a different pattern. A projector P with resolution Presx , Presy , will result in Ncols =dlog2(Presx)e encoded patterns representing the columns, and in Nrows = dlog2(Presy)e en-coded patterns representing the rows. For example a projector with resolution 1024x768will result in Ncols = 10 and Nrows = 10 i.e. a total of 20 patterns.

3

4.1. Binary Encoding

In the binary encoding, the decimal number Dci corresponding to each column ci wherethe index i ∈ Presx is converted to its binary form Bci . Each individual bit bkci wherek ∈ [1, |Bci |], of the binary form Bci is then marked in the kth pattern in the sequence, withblack color if its value is 0 or white color if its value is 1. Note, that the length of the binaryform Bci equals the number of patterns Ncols. Similarly, this process is repeated for eachrow rj where j ∈ Presy .

An example of the process is shown in Figure 2 where the binary form of a single pixel pat location (209, 660) is encoded into two sequences of 1x1 patterns in temporal domain; onesequence representing the column and the other the row information. The column sequencerepresents the binary form of the column’s decimal number i.e. Bcp = 1000101100 and therow sequence represents the binary form of the row’s decimal number i.e. Brp = 0010100101.As it can be seen, each pattern corresponds to a bit in the binary pattern and is representedwith either the color black or white.

Figure 2: Encoding of a single pixel p at location (209, 660) is encoded into two sequencesof 1x1 patterns in temporal domain; one sequence representing the column and the otherthe row information.

By iteratively performing the process for every pixel p(x,y) in the projector where x ∈resx, y ∈ resy, two sequences of 2D patterns are produced representing all the columns andall the rows respectively. Figure 3 shows an example of the final sequences for the columnsand rows. The patterns are shown in the order they are projected as indicated by the timearrow i.e the first pattern to be projected corresponds to the most significant bit and appearsas the last one in the diagram, followed by the remaining. As time progresses the frequencyof the pattern increases until it reaches the highest frequency which represents the leastsignificant bit in binary pattern.

The result is a sequence of binary encoded patterns where each pattern represents onebit of the binary sequence. This produces images containing vertical stripes in the case ofthe columns, and horizontal stripes in the case of the rows.

4.2. Gray-code Encoding

The Gray-code encoding works in a similar way as the binary encoding previously de-scribed, however it ensures that there is only one bit different between consecutive patterns.

4

Figure 3: Encoding the information into a sequence of patterns in temporal domain.

Figure 4 shows a comparison between Gray and Binary encoding for the columns of an 8x8area. As it is evident, in the case of Gray codes there is a maximum of one bit differencebetween any two consecutive encoding of values. Figure 6b shows a sequence of encodedpatterns using Gray-codes, corresponding to the same area of 8x8 as in Figure 6a.

Decimal Value Gray-code Binary code0 000 0001 001 0012 011 0103 010 0114 110 1005 111 1016 101 1107 100 111

Figure 4: Comparison between Gray codes and binary codes. In the case of Gray codesthere is only one bit difference between any two consecutive encodings.

Gray codes can be calculated by first computing the Binary representation of a numberand then converting it as follows: copy the most significant bit as is, and then for theremaining bits (taking one bit at a time), replace with the result of an XOR operation ofthe current bit with the previous bit of higher significance in the binary form. An exampleis shown in Figure 5a for the computation of the Gray code for the decimal number 93.

Similarly, the conversion of a Gray code to the corresponding Binary code is as follows:copy the most significant bit as is and for the remaining bits (taking one bit at a time),replace with the result of the XOR between the current bit in the Gray code and theprevious bit of higher significance in the Binary code. Figure 5b shows the computation ofthe Binary code from the Gray code representing the decimal number 93.

Although the two schemes presented above result in similar patterns, the Gray encodingis considered as a better alternative to the Binary Codes. When using Gray codes the valueof the least significant bit changes after two consecutive stripes, in contrast to the Binarycodes where the value changes at every pattern. This can be seen in Figure 6a, 6b wherethe pattern using Gray encoding corresponding to the least significant bit, contains stripes

5

(a) (b)

Figure 5: (a) Encoding of decimal number 93. The Binary Code representing the decimalnumber 93 is converted into Gray Code. (b) Conversion of Gray Code to Binary Code.

(a)

(b)

Figure 6: (a) Binary encoding of the columns for an area of 8x8 pixels. (b) Gray encodingcorresponding to the same columns of the area in (a).

with thickness of two pixels, as opposed to the pattern using Binary encoding correspondingto the least significant bit which contains stripes with thickness of a single pixel. Largerstripe thickness is preferable since it can considerably reduce unwanted effects such as colorbleeding in the projection or/and in the captured images.

4.3. Implementation Issues

The generation of the two sequences of pattern is performed using Gray encoding asdescribed above. However, there are two important implementation issues to consider:

• Firstly, the choice of the colors to be used in the patterns. The patterns can have anytwo colors, although traditionally black and white has been used. In any case, eachpixel must have an intensity value which will be used as a threshold to determine thevalue of the pixel i.e. 1 or 0. Therefore, it is imperative to take this into accountwhen choosing the two colors and choose colors which will result in large intensity

6

differences. A poor choice on colors, can otherwise jeopardize the entire process byfailing to distinguish a pixel’s value later on in the process. For this reason and toovercome this limitation, it is preferable to project a pattern followed by a projection ofits color-inverted pattern. Inverted pattern images are images with the same structureas the original but with inverted colors. This provides an effective method for easilydetermining the intensity value of each pixel when it is lit (highest value) and when itis not lit (lowest value). The threshold for the intensity value of a pixel p with a highestvalue ph and a lowest value pl can then be computed as the average τp = (ph − pl)/2.

• Secondly, identifying shadow regions. The cameras observe the object from differentangles than the projector, hence quite often there will be a viewing areas which liesin shadow regions. Thus, it is preferable that pixels falling under a shadow regionare removed at the early stages of the process. This can be achieved by projecting awhite and then black image on the object and capturing the images. By evaluatingthe intensity value of the pixels in the images, one can determine which pixels fallunder a shadow region by identifying cases where the intensity values in the twocaptured images are very similar. A ”shadow mask” can then be created which leavesonly the pixels which do not fall under shadow regions. This can considerably reducecomputational processing time. Section 7.1 explains how to compute the shadow mask.

Thus, the final projection sequence contains the column and row pattern sequences en-coded using Gray-code, their inverted patterns, as well as two images of solid colors, onefor each used color. The patterns are projected in sequence as follows: first the two solidcolor images, then interchangeably the column and its inverted sequence, followed by inter-changeably the row and its inverted sequence.

5. Camera Calibration

When considering a camera, the image is formed when light rays being reflected in thescene, pass through the camera lens, through the aperture and interact with the photo-sensitive light sensor. If there had been no light sensor present, then the light rays wouldconverge at a single point called the center of projection as shown in Figure 7. A projectorcan be thought of as an inverted camera. The light rays start from a point i.e. light bulbinside the projector, then go through mirrors and/or lenses and finally interact with thescene.

In both cases (camera, projector), one can relate a light ray to a particular pixel if thegeometry of the system at the time the image was taken is known. The geometry is definedin terms of a set of intrinsic and extrinsic parameters which can be computed by performinga geometric calibration.

To calibrate the cameras, an object with known geometric characteristics is used; usuallya flat board containing a printed checker pattern. By taking images of the board at differentpositions and orientations as explained in [? ] we can compute the intrinsic and extrinsic

7

Figure 7: Image formation. Light rays being reflected in the scene, pass through the cameralens, then through the aperture and interact with the photo-sensitive light sensor.

parameters which specify the camera matrix C in Equation 1,

C =

α −αcot(θ) u0

0 βsin(θ)

v0

0 0 1

︸︷︷︸

intrinsic

r11 r12 r13 txr21 r22 r23 tyr31 r32 r33 tz

︸︷︷︸

extrinsic

(1)

where α = kfx, β = kfy, (fx, fy) is the focal length on the x and y axis respectively, θ isthe skew angle, u0, v0 is the principal point on the x and y axis respectively and r1−3, tx−zdetermine the camera’s rotation and translation relative to the world.

Given a minimum of four 2D to 3D correspondences specified interactively by the operatorthe camera extrinsic and intrinsic parameters can be accurately estimated. The camera poseestimation is performed using a non-linear Levenberg-Marquardt optimization [? ? ] whichminimizes the error function ECk

for each camera Ck,

ECk=

1

n

n∑i=0

√(I ix − P i

x)2 + (I iy − P i

y)2) (2)

where I i is the ith image point, P i is the projection of the ith 3D world point and n isthe number of 2D to 3D correspondences.

In addition to the camera matrix C as defined in Equation 1, the intrinsic parame-ters include the distortion coefficients; three coefficients for the radial distortion and twocoefficients for the tangential distortion.

Although, theoretically the camera matrix C can be computed from a single image of theflat board, it is recommended that 10-20 images of the flat board at different orientationsand positions are used.

Multi-Camera System Extrinsics: In order to calculate the extrinsic parameters ofeach camera, its intrinsic ones must be already calculated and then using one picture, wherethe origin of the world is specified, the rotation and translation based on the world origincan be estimated. To achieve this, all cameras in the system must capture the scene at

8

the same time, in order to take the same position and orientation of the board,as shownin Figure 8. In this way the same world origin can be used in order to relate the motionbetween the cameras i.e. the rotation and translation, as explained later in the Section 8.

Figure 8: Multiple Camera Calibration. For each orientation and position of the calibrationboard, images are captured from all cameras. This allows the use of a common referencepoint i.e. world origin.


• Once the calibration is performed, there should be no movement to any part of thesystem otherwise a re-calibration will be needed.

• The camera calibration in 3DUNDERWORLD-SLS is implemented using OpenCV’scalibration functions. Another well-known and established calibration toolbox is theCamera Calibration Toolbox for MatLab [? ].

6. Acquisition

The acquisition is a relatively straight forward process. Each image of the sequence isprojected on the object and every camera in the system captures an image. However, thereare several things to consider:

• The object should remain static during the acquisition.

• The cameras’ and the projector’s settings (exposure, brightness etc.) should be ad-justed according to the lighting in scene. Indirect light reaching the scene from othersources should be eliminated if possible in order to achieve better results.

• If the acquisition process is automated then it is recommended that there is a ”forced-delay” after each image capture until the cameras confirm that the last image wascaptured and stored successfully. This will reduce and/or eliminate unwanted artifacts

9

which may appear because the projector delayed the projection of a pattern or thecamera delayed the capturing an image.

An acquisition set consists of the images captured by each camera for each pattern inthe sequence. However, from each acquisition set, only the parts of the scene which arevisible to all the cameras can be reconstructed (Figure 9), thus more scans may be requiredto obtain all sides of the object. This can be achieved by changing the object’s positionand orientation between each acquisition set while ensuring that no changes occur in thecameras and projector.

(a) (b) (c)

Figure 9: Result of a single scan. (a) and (b) show the two camera views and (c) thereconstructed mesh. In same cases more scans are needed in order to obtain the desiredarea.

7. Decoding of Captured Images

Following the acquisition of the data, the next step is the decoding of the capturedimages. This involves the calculation of the shadow masks and the decoding of the patterns.

7.1. Computing the shadow mask

The goal of this process is to determine which pixels in a camera’s image fall undershadow regions. This involves comparing each pixel’s intensity values between the two firstprojections: the black and white images. Pixels whose intensity values in the two capturedimages corresponding to the black image projection and white image projection is large, areconsidered as a valid pixels. If the difference is small then the pixel is marked as ”in-shadow”.Figure 10 shows an example of a shadow mask.

7.2. Decoding the patterns

The cameras capture images of the encoded projected patterns. The next step is todecode each pixel in the captured images into their corresponding decimal number repre-senting the column and the row. This provides a mapping between the pixels in the cameras

10

(a) (b)

Figure 10: Shadow mask example. (a) Image captured by the camera. (b) Shadow mask -all black pixels are removed from subsequent processing.

i.e. pixels in the captured images which correspond to the same projector pixel as shown inFigure 11.

Figure 11: A 3D point is viewed by two different cameras. The decoding aims at derivinga mapping between the pixels of the two cameras, corresponding to the same 3D point.

More specifically, the decoding is performed as follows:

• Determine whether a pixel p is lit or not (1 or 0) in the images capturing the projectedsequence of patterns encoding the columns.

• Calculate its binary form Bp.

• Convert the binary form Bp into the equivalent decimal number x.

Similarly, the process is repeated for the images capturing the projected sequence of patternsencoding the rows and results in a decimal number y. Thus, (x, y) are the image coordinatesof the projector’s pixel corresponding to the pixel being decoded in a camera. By repeatingthe process for all the cameras, we can map the pixels not only to the projector’s pixelsbut rather to the other cameras viewing the object. It should be noted that a camera pixelmay be mapped to more than one pixels of another camera due to differences between thecamera and projector resolutions.

11

8. Reconstruction

The decoded captured images result in a set of a many-to-many mappings between thepixels of the different cameras. Next, by triangulating the rays corresponding to each pair,a 3D point is computed at their intersection. The projection of this point falls onto themapped pixels in the different cameras.

8.1. Pixel-to-Ray

A camera ray is a straight line in 3D Euclidean space that starts from the camera’s centerof projection (Figure 7), intersects the image plane and extends outward to the scene. Theray can be defined with a single point of the ray and the ray’s direction vector.

In order to compute the direction vector two ray’s points are needed. The first point is thecamera’s center of projection qcameracenter = (0, 0, 0). The second point is the point correspondingto the pixel from which the ray passes through. That point can be computed by first findingthe undistorted pixel’s position, using the distortion coefficients computed during cameracalibration, and then ”unprojecting” the pixel into 3D space. This is achieved by multiplyingthe inverse of the camera matrix C with the pixel’s coordinates as given by the Equation 3,

qcamerapixel = C−1 ×

xy1

(3)

Therefore, both qcameracenter and qccamerapixel are in the camera’s local coordinates frame and theymust be converted to world coordinates using the extrinsic parameters computed during thecamera calibration as follows,

qworld = R−1 × qcamera −R−1 × T (4)

where R is the rotation matrix and T is the translation vector of the camera relative to theworld origin shown in Figure 8.

Thus, the ray corresponding to pixel p is defined as Rworldp =< pworldcenter, ~v

worlddir > in world

coordinates, where ~vworlddir = pworldpixel − pworldcenter.

8.2. Ray Triangulation

Given one pixel in image I1 and its corresponding pixel in image I2, two rays are formedas described above, and their intersection is computed. In practice, the triangulation of tworays in 3D space can be considered as an ill-posed problem since quite often the two rays donot intersect ’exactly’ but rather pass by one another in close proximity.

In order to overcome this limitation, the segment perpendicular to the two rays withthe shorted distance is computed, and the middle point of the segment is considered to bethe intersection point. Figure 12 shows an example where the segment ab is the shortestline connecting the two rays A and B. The mid-point p is considered to be the intersectionpoint.

Computation of the Intersection Point: Consider two lines in 3D space A and Bpassing through points p and q, with direction vectors ~u and ~v respectively, and let the two

12

Figure 12: Ray intersection. In cases where two rays do not intersect ’exactly’, the inter-section point is considered to be the midpoint of the shortest segment connecting the tworays.

closest points on the lines be a and b, as defined in Equation 5 and Equation 6, where s andt are scalar values.

a = p+ s ∗ ~u (5)

b = q + t ∗ ~v (6)

The segment connecting a and b is perpendicular to the lines, hence the dot product of theirvectors is equal to 0 as follows,

(a− b).~u = 0 (7)

(a− b).~v = 0 (8)

and with Equations 5 and 6, the Equations 7 and 8 become,

~w.~u+ s ∗ ~u.~u− t ∗ ~v.~u = 0 (9)

~w.~v + s ∗ ~v.~u− t ∗ ~v.~v = 0 (10)

where~w = p− q (11)

From Equation 9 and Equation 10 it follows that,

s =~w.~u ∗ ~v.~v − ~v.~u ∗ ~w.~v~v.~u ∗ ~v.~u− ~v.~v ∗ ~u.~u

(12)

t =~v.~u ∗ ~w.~u− ~u.~u ∗ ~w.~v~v.~u ∗ ~v.~u− ~v.~v ∗ ~u.~u

(13)

The intersection point is the average of the 2 end-points of the segment as follows,

Pi =(p+ s ∗ ~u) + (q + t ∗ ~v)

2(14)

Thus, for all ray-pairs the intersections are computed as above in order to avoid any possibleproblems with non-intersecting rays.

13


There are several limitations with the chosen triangulation method that should be takeninto account. Firstly, although two rays (as defined in our system) should not be parallel,it is a good practice to check for near-parallel conditions as well. In order to to so, it issuggested that a near zero condition is checked on the denominator of the Equations 12 and13.

Another important note is the fact that in practice the mapping between pixels is not one-to-one but rather many-to-many i.e. an area of pixels in one image maps onto another area ofpixels in another image. This is due to the fact that the cameras and projector resolutionsare different. In situations like these it is suggested that the average of the intersectionpoints is taken as the final intersection point. Averaging the resulting intersection pointsnot only reduces the amount of generated geometry by removing duplicates bu also resultsin smoother geometry.

Finally, the linear triangulation described in 8.2 is the simplest triangulation methodand not necessarily the best one. In [? ] the triangulation problem is discussed extensively,and the authors propose an alternative method which outperforms the linear triangulationdiscussed above. We have found through extensive testing that linear triangulation performsvery well; at least in our case.

9. Point Cloud to Mesh

Thus far, we have described how to generate a cloud of 3D points in Euclidean space thatrepresents the scanned scene. In many cases this type of output is not easily usable, sinceit is difficult to manipulate. For this reason, we may first need to convert it to a 3D mesh,where the points (vertices) are interconnected (with edges) to each other producing faces.Many methods already exist with various complexities and resulting output qualities. Inthis paper a quick conversion method will be described which takes advantage of the spatialrelation between the 3D points and the projector’s pixels.

Figure 13: Neighbouring pixels. 1st level neighbours of a pixel p are the previous, next,above and below pixels. 2nd level are the diagonal pixels.

As already mentioned, each point in the cloud corresponds to a projector’s pixel; in facteach point is representing the area in the scene that was lit by the related projectors’ pixel.Based on this fact, areas that are lit by neighbouring pixels are next to each other, andsimilarly the points relating to those pixels are neighbours as well. Neighbours of a pixel pare considered the 8 surrounding pixels, where the ones placed in the same row or column

14

with p are 1st level neighbours, while the diagonals are 2nd level neighbours as it can beseen in Figure 13.

Figure 14: Conversion of point-cloud to mesh. 1st level connections result to a quadrilateralfaces. 1st and 2nd level result to triangular faces

Points of neighbouring pixels are iteratively connected to each other forming faces. Mostcommonly, faces can be either triangular or quadrilateral. In order to form triangular facesboth 1st and 2nd level neighbours are interconnected, in contrast to quadrilateral formingwhere only 1st level neighbour connections are needed as shown in Figure 14. Meshesconsisting of triangular faces have higher complexity than the quadrilateral ones, howeverthey often appear to form smoother surfaces.

The described method can introduce some artifacts in the output mesh in cases wheresome pixels are not visible by more than one camera in which case holes may appear in themesh, or cases where neighbouring pixels are projected on different surfaces in the sceneresulting connections between them. However, this method is generic and quite fast sinceit is performed in projector’s image space and works very well for most of the scene’s area.Possible hole problems can be filled by another scanning and possible connections betweenwrong faces which are seldom found can be easily removed manually. In addition, it canaccommodate extra information such as the color of the mesh and it is even suitable forcases where the scanned object is decorated with structured color motives or other images.

10. Experimental Results

10.1. Apparatus

In this work we focus on a system consisting of two digital cameras with manual (fixed)focus mode (as opposed to autofocus) and a DLP projector. 1 It is important that theprojector’s pixels are clearly captured in the images taken by the cameras, thus it is preferableto use a camera which has higher resolution than the resolution of the projection. Thishowever will not affect the resolution of the final 3D model since the reconstruction takes

13DUNDERWORLD-SLS v1.0 requires one web camera.3DUNDERWORLD-SLS v2.x requires one Canon SLR camera.3DUNDERWORLD-SLS v3.x requires two or more Canon SLR cameras.3DUNDERWORLD-SLS v4.x requires two or more cameras and includes a CUDA GPU implementation aswell as a CPU implementation in case an Nvidia card is not found. In this version, we provide a genericcamera interface implementation and which the programmer can extend to support any kind of camera.

15

place in projector’s space i.e. it will have the resolution of the projector. The processof capturing the images can be done either automatically using a connected computer, ormanually. In the latter case, it is imperative that the setup remains static. Even a slightmovement can result in misalignment and severe reconstruction errors.

In the open-source scanning system 3DUNDERWORLD-SLS v3.x, we employ two CanonSLR cameras EOS-1D Mark IV with a resolution of 4896x3264, and an inFocus IN110portable projector with a resolution of 1024x768. All three devices are connected to aportable computer which runs the software for the scanning process. Each of the encodedimages (total of 42 for a resolution of 1024x768), generated as previously explained, areprojected on the object and at the same time each connected camera captures a photo ofthe scene. At the end of the scanning process these captured images will be processed inorder to compute the 3D geometry of the scene.

Figure 15: Example of two cameras setup, with the cameras left and right of the projector.

10.2. Setup

As this technique is very general, there are several acceptable ways to position the imag-ing devices. The positioning is subjected to the size of the scanned object, the number of thecameras used and the type of their lenses. Generally it is good practice to have the camerasin non-parallel positions in order to get smoother results. Note however that as the anglebetween the cameras increases the areas of the object that can be reconstructed decreasessince the area visible to all cameras reduces. For this reason, we suggest to position theprojector in the middle of the cameras as shown in Figure 15.

Another important aspect is to adjust the optics to the object’s size. For better results,the projector can be set in a way such that most of the area being lit lies on the object; themore ’lit’ pixels project on the object the higher the resolution of the resulting pointcloud.The same rule of thumb applies to the cameras too: it is crucial to ensure that cameras arein close proximity such that the projection is visible down to the pixel level. Additionally,the more camera pixels capture one projector’s pixel the smoother the reconstructed modelwill be.

After the placement of the imaging devices, it is important to set the focus on the object’ssurface in order to have a sharp projection and sharp captured images. Before starting thescanning process the system must be calibrated.

16

Model Real Recon. Size Pts Tri.

Aphrodite88cm×20cm×20cm

602685 1205406

Aztec22.5cm×10.5cm×

6cm540745 1054972

Caryatid24.5cm×5.5cm×5.5cm

659777 1319544

Aphrodite25.5cm×19cm×14.5cm

2028162 4056320

Table 1: Videos of the reconstructed models can be found at: http://www.vimeo.com/

TheICTLab

10.3. Scanning & Reconstruction

The proposed algorithms and implemented techniques were extensively tested and theresults are reported. All results shown below were generated with the developed open-sourcescanning system 3DUNDERWORLD-SLS. Several objects were scanned and informationabout their structural size in real-life, the number of points and the number of triangles ofthe reconstructed models are presented in Table 2 and Table 3.

Figure 19 shows a render of a genuine replica of an amphora (right) next to a photo ofthe original (left). The amphora has dimensions 18cmx10cmx10cm and the resulting objectcontains 1276766 points and 2553536 faces.

11. Evaluation

The evaluation of the proposed technique is performed by measuring the following fourevaluation metrics: linearity, orthogonality, sampling rate, accuracy.

11.1. Linearity metric

A perfectly flat plane Π is scanned and a plane Πfitted = 〈α, β, γ, δ〉 with normalNΠfitted=

〈nx, ny, nz〉 is fitted on the resulting ℵ 3D points. The plane fitting is perfomed usingRANSAC and the average error Eavg and average RMSE is computed as follows,

Eavg =ℵ∑i=0

|δ − (α× ℵxi + β × ℵyi + γ × ℵzi )|/||ℵ|| (15)

RMSE =

√∑ℵi=0{δ − (α× ℵxi + β × ℵyi + γ × ℵzi )}2

||ℵ||(16)

17

http://www.vimeo.com/TheICTLab

http://www.vimeo.com/TheICTLab

(a) (b)

Figure 16: Reconstructed model of a genuine replica of an amphora. (a) Photo of theobject. (b) A render of the reconstructed model.

We scan several different planar surfaces shown in Table 2 and perform plane fittingusing RANSAC. This metric is measured as the distance of all points from the fitted surfacein terms of the average error Eavg and the root-mean-square error RMSE. As it can beseen from the reported measured the error is minuscule when considering the total numberof points. It should be noted that for the box example, the three orthogonal planes weremeasured separately.

11.2. Orthogonality metric

A set of three perpendicular planes were scanned and three planes Π1,Π2,Π3 are fittedrespectively using RANSAC. The orthogonality metric is defined as the magnitude of thethree dimensional vector containing the measured angles between the three planes in termsof the dot product as follows,

Eortho = 〈dot(N1Πfitted

, N2Πfitted

), dot(N1Πfitted

, N3Πfitted

), dot(N2Πfitted

, N3Πfitted

)〉 (17)

We scan the box object containing orthogonal planes shown in Table 2. For each of thethree planes a linear surface is fitted using RANSAC. This metric is measured as the angleformed between the three planes as shown in Table 3. As it is evident from the reportedresults the resulting planes are perpendicular (up to at least the fourth decimal point).

11.3. Accuracy

An object containing a ruler of length Lorig is scanned. The size of ruler is measured inthe reconstructed model as Lscan. The accuracy is defined as the absolute difference betweenthe two measurements,

Eacc = |Lorig − Lscan| (18)

18

Object Left Image Right image Linearity PointsEavg RMSE ℵ

Flat Plane Π1 0.0046794643 0.0083896662 490625

Flat Plane Π2 0.0109797063 0.0186112309 359659

Flat Plane Π3 0.002567169 0.004390225 540160

Box Plane Π1Box 0.0012361568 0.0015234051 59210

Box Plane Π2Box 0.0064742412 0.0090599699 24980

Box Plane Π3Box 0.0264965185 0.0352513109 24464

Table 2: Linearity metric. This metric is measured by fitting planes using RANSAC to thedata and measuring the distance of all points from the fitted surface in terms of the averageerror Eavg and the root-mean-square error RMSE.

⊥ Box Plane Π1Box Box Plane Π2

Box Box Plane Π3Box

Box Plane Π1Box (long) - 89.9997272431◦ 89.9995961527◦

Box Plane Π2Box (top) 89.9997272431◦ - 89.9978673427◦

Box Plane Π3Box (side) 89.9995961527◦ 89.9978673427◦ -

Table 3: Orthogonality metric. This metric is measured by fitting planes using RANSACto the three sides of the box shown in Table 2 and measuring the angle formed betweenthem.

All calibration parameters are calculated in millimeters, hence the measuring units for ac-curacy is also millimeters.

We scan a planar object which contains a imprinted ruler on its surface. We measureseveral distances between points on the ruler on the reconstructed model and compute theaverage accuracy as shown in the Table 4.

19

Eacc = 0.0251Eacc = 0.0827Eacc = 0.0259Eacc = 0.0166Eacc = 0.0716Eacc = 0.04335Eacc = 0.0533Eacc = 0.1223Eacc = 0.003Eacc = 0.0361Eacc = 0.0413Eacc = 0.1124Eacc = 0.0397Eacc = 0.0174Eacc = 0.1091Eacc = 0.0298Eacc = 0.0933Eacc = 0.0163Eacc = 0.0153Eacc = 0.0855

Eavg =20∑i=0

Eiacc

200.0520025

Table 4: Accuracy metric. This metric is measured by taking multiple measurements on areconstructed planar object with an imprinted ruler on its surface.

11.4. Sampling rate

The sampling rate is computed by selecting an area(patch) of known dimensions (width, height)in the reconstructed model and measuring the number of points contained. We scanned aplanar object containing an imprinted ruler on its surface. We manually crop multiple smallpatches from the reconstructed object and measure the corresponding point and face den-sities, an example of this procedure is shown in Figure 17. We report the individual pointdensities per squared centimeter (cm2) and the average point density per squared centimeter(cm2) in Table 5. As it can be also be seen in Figure 18 the mean face density per cm2 is2835.6 faces and the mean point density per cm2 is 1458.3. It should be noted that all facesare triangles for these experiments.

11.5. Comparison to a High-End Commercial 3D Scanner

The proposed algorithms and implemented techniques were evaluated against a high-end commercial laser scanner. Figure 19a shows part of a mesh scanned with a high-endcommercial 3D scanner, namely ZScanner 700CX. The laser scanner is hand held and it wasused exactly as recommended in its manual with the highest possible resolution. During

20

Figure 17: Example of a density measurement on the reconstructed object(rendered). Mul-tiple areas are selected on a marked reconstructed object and the number of points andnumber of faces(triangles) are counted in order to compute the point and face density persquared centimetre.

(a) (b)

Figure 18: The point and face density for difference patch sizes. Mean point densityper cm2: 1458.3. Mean face density per cm2: 2835.6

the the scanning process we scanned the part shown from multiple directions in order tocapture as much as possible of the curved surface. The same part of the object is shown inFigure 19b but this time it was scanned using 3DUNDERWORLD-SLS. It should be noted

21

Width (cm) Height (cm) Point density Face density1 1 1588 29952 2 5968 115943 5 21129 415513 3 12864 252114 4 22354 440201 2 3117 59555 3 20959 411972 1 3060 58383 2 8700 169522 3 8740 170245 5 34759 686485 2 14295 279522 5 14308 28004

Mean point density per cm2: 1458.3Mean face density per cm2: 2835.6

Table 5: Sampling rate. Multiple patches of difference sizes are taken from the reconstructedmodel shown in Table 4 and their corresponding point and face densities are measured.

that this is the results of a single scan.We applied hole-filling on both meshes with the same settings (i.e. of size less than

50 edges) and both meshes are shown with Gouraud shading. The mesh produced by thecommercial solution contained 107888 faces formed by 54870 points, and the mesh producedby our solution contained 55068 faces formed by 28014 points. A possible explanation forthe significant difference in the number of points and faces is the fact that the laser scannerworks in ”sweeps” therefore, as the scanner moves more points are added to the reconstructedresult. When using SLS the same occurs when combining multiple scans. Although thenumber of points and number of faces of the high-end 3D scanner surpass those of the3DUNDERWORLD-SLS it is evident that the 3DUNDERWORLD-SLS outperforms thecommercial solution in the level of detail captured and the amount of information capturedin a single scan.

12. Conclusion

Although the theory behind the SLS systems is well documented and understood, thereare still many issues one has to consider when developing or using SLS systems, which arecurrently lacking documentation. Many variants of SLS systems have already been proposedhowever, each one is tailored to a particular task. In this paper, we have presented all possiblelimitations, difficulties and solutions that one has to consider when involved in the design,development or use of SLS systems.

Furthermore, we have introduced the general-purpose, open-source 3DUNDERWORLD-SLS software and reported on the results of our extensive testing. Moreover, we have evalu-

22

ated the proposed system on four different evaluation metrics and compared it to a high-endcommercial 3D scanner. The produced results are of considerable high-fidelity and thesystem is currently being used as a documentation device in archaeological sites.

23

(a)

(b)

Figure 19: The same part of an object reconstructed with (a) ZScanner 700CX configured to the highest resolution andscanned from multiple directions and (b) the proposed system using a single scan. Both meshes were hole-filled usingthe same settings and are presented with the same Gouraud shading.

24

Acknowledgements

3DUNDERWORLD-SLS v3.2 and prior versions, were supported by EC FP7 Marie CurieIRG-268256 - ”Rapid Scanning and Automatic 3D Reconstruction of Underwater Sites” -3DUNDERWORLD http://www.3dunderworld.org.

3DUNDERWORLD-SLS v4.0, was supported by Concordia University Strategic HireGrant VH0003 - ”Immersive and Interactive Visualizations of Realistic Virtual Environ-ments”.

The software and other documentation is publicaly available and can be downloadedfrom the ICT lab’s Github account.

13. APPENDIX: Implementation Details

Structured-light scanning involves the acquisition and processing of a large size of data.This, coupled with the high computational complexity required to compute the 3D positions,real-time implementation is not yet feasible with these type of techniques. However, ourimplementation offers near real-time reconstructions primarily due to the following:

• A compact yet thread safe binary array to store the masks and decoded patterns.

• Fast bit operations

• Fast GPU implementation

13.1. CPU Implementation Details

The CPU implementation of the SLS algorithm involves the following steps:

for each camera:

generate Mask

for each pixel which passes the mask test in camera:

decode pixel and assign it into buckets

for each bucket:

find corresponding pixels in both cameras

undistort ray and find intersection points

calculate intersection point as reconstructed point

13.1.1. Dynamic bit array

There are two versions of Dynamic bit arrays, one for the CPU implementation and an-other for the GPU implementation. They follow same design theory – Compact arrangementof binary bits.

8 bits of each byte are fully used to represent a mask or a pattern. Within a byte, thebit is arranged from lower to higher as shown below,

High 10011011︸︷︷︸1 byte

Low

25

All of the bits are stored in a array of unsigned chars vector<uchar>. To operate onthe bit array, we first locate the byte where the target bit is located, and use bit operationson the char.

CPU Dynamic Bitset.h. This bit set class encapsulates only one bit set in each object. Sincewe decode patterns once at a time on CPU. Bit sets can be thrown away when they havebeen encoded into an integer.

13.2. GPU Implementation Details

The GPU implementation of the SLS algorithm involves the following steps, includingcertain kernel calls:

for each camera:

compute masks and thresholds

genPatternArray

buildBucket

for each bucket:

getPointCloud2Cam

• genPatternArray – This kernel decodes patterns into binary arrays. Each tread goseover the same pixel in the camera images, and generate a bit set array.

• buildBucket – This kernel inserts pixels into the buckets by their decoded pattern. Eachthread decodes a camera pixel pattern and insert it into the corresponding bucket.

• getPointCloud2Cam – This kernel takes two buckets of cameras to reconstruct image.Each thread fetches pixels in the same bucket, then undistort and reconstruct thepoint.

As most of the steps are similar to the CPU implementation, the following sections willfocus on the improvement on the arrangement of patterns using dynamic bit set on GPUand the parallel insertion of buckets.

13.2.1. GPU Dynamic Bitset

The GPU implementation requires high concurrency and small device-host communica-tion. In this case, we put the patterns for all pixels in a single bit array.

unsigned char* bits;

size_t BITS_PER_BYTE;

size_t numElem;

size_t bitsPerElem;

26

Figure 20: Inserting elements to the same bucket may trigger race condition.

Note that the number of bits are aligned to integer times of BITS PER BYTE. Similar tothe bit sets on CPU, each thread will operate on the pattern of each pixel. Thus, there willbe no more than two threads operated on the same memory, and race conditions are thusavoided.

thread0︷︸︸︷00110010︸︷︷︸

byte

01110010︸︷︷︸byte︸︷︷︸

pattern for one pixel

· · · 00110010︸︷︷︸byte

01110010︸︷︷︸byte

︸︷︷︸patterns for all pixels

For the example shown above, the patterns take 16 bits, i.e. 2 bytes, thread0 will onlyoperate on the pattern for one pixel. Therefore, no threads will operate on the same bytesat the same time.

13.2.2. Bit operations on Graycode Generation

For a binary array encoded in Gray code we use the following snippet to convert it todecimal which allows us to find the location of the projector’s pixels.

for (unsigned b i t=1U<<31; b i t >1; b i t>>=1) {i f ( xDec & b i t ) xDec ˆ= b i t >> 1 ;

}

13.2.3. GPU Generate Buckets

Parallel implementation of buckets is a challenging task. As we insert camera pixels intobuckets indexed by the projector’s pixels with one thread processing one camera pixel, it isimpossible to put the elements one after another since all threads are running parallel andone thread cannot determine if the insertion position is going to be used by another thread.

In the Figure 20, the race condition occurs when thread x and thread y both want toinsert into bucket 0. The solution is to pre-allocate the maximum possible memory for eachbucket and use an atomically increasing index to indicate where to store the next elementin the buckets, as depicted in Figure21.

The atomic operation will only be triggered when two threads are trying to insert to thesame bucket, thus the serialization is low and the impact on performance is not obvious.

27

Figure 21: Each bucket will maintain an atomicly increasing index indicates where to insertnext element within the bucket.

13.3. Benchmark

A comparison between the CPU implementation of v4 and the previous versions showa speed increase of at least two orders of magnitude. The configuration of the machine isgiven in the following table

CPU Intel i7 4771kMemory 16GB

GPU nVidia GTX 770vMemory 2GB

A comparison between the GPU and CPU of v4 on the same configuration with theAlexander dataset is shown in the following table.

GPU 12.940sCPU 29.67s

13.4. Code Structure

The latest version of the 3DUNDERWORLD-SLS (v4) was designed with extensibilityand maintainability in mind. 3DUNDERWORLD-SLS is itself a library that can be easilyintegrated into other applications using CMake building system. As it is shown below,lib folder contains all of the implementation of library and app folder contains some testapplications using the libraries as references.

src

app

App_Calib.cpp

App.cpp

App_CUDA.cu

App_Graycode.cpp

CMakeLists.txt

lib

calibration

28

Calibrator.cpp

Calibrator.hpp

CMakeLists.txt

core

Camera.cpp

Camera.h

CMakeLists.txt

Dynamic_Bitset.cpp

Dynamic_Bitset.h

fileReader.cpp

fileReader.h

log.cpp

log.hpp

Projector.h

Ray.h

Reconstructor.cpp

ReconstructorCPU.cpp

ReconstructorCPU.h

Reconstructor.h

GrayCode

CMakeLists.txt

GrayCode.cpp

GrayCode.hpp

ReconstructorCUDA

CMakeLists.txt

CUDA_Error.cuh

Dynamic_bits.cu

Dynamic_bits.cuh

fileReaderCUDA.cu

fileReaderCUDA.cuh

ReconstructorCUDA.cu

ReconstructorCUDA.cuh

13.4.1. core

As the name indicates, core library contains the base classes of reconstructor, camera, andpatterns. Please note that a CPU reconstructor also locates here since all of the machinesshould be able to run it.

13.4.2. ReconstructorCUDA

ReconstructorCUDA class is an implementation of reconstruction on GPU that inheritedfrom reconstructor in the core. A GPU version of Dynamic bits and fileReader are alsoimplemented to be compatible with device code on GPU.

29

13.4.3. GrayCode

This library is used to generate gray code pattern. Only a small change has bee appliedfrom the original code.

13.4.4. Calibration

Camera Calibration with OpenCV.

References

30

arxiv:1406.6595v2 [cs.cv] 20 aug 2016 · preprint submitted to arxiv august 23, 2016...

Documents