d6.1: 3d media tools – report – version a -...

43
D6.1: 3D Media Tools – Report – Version A Project ref. no. FP7-ICT-610691 Project acronym BRIDGET Start date of project (duration) 2013-11-01 (36 months) Document due Date: 2015-04-30 Actual date of delivery 2015-04-30 Leader of this document Ingo Feldmann Reply to Document status Public deliverable

Upload: vongoc

Post on 31-Jul-2018

252 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

D6.1: 3D Media Tools – Report – Version A

Project ref. no. FP7-ICT-610691

Project acronym BRIDGET

Start date of project (duration) 2013-11-01 (36 months)

Document due Date: 2015-04-30

Actual date of delivery 2015-04-30

Leader of this document Ingo Feldmann

Reply to

Document status Public deliverable

Page 2: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Deliverable Identification Sheet

Project ref. no. FP7-ICT-610691

Project acronym BRIDGET

Project full title BRIDging the Gap for Enhanced broadcasT

Document name D6.2_final.docx

Security (distribution level) PU

Contractual date of delivery (none)

Actual date of delivery 2015-04-30

Document number

Type Public deliverable

Status & version v2.8

Number of pages 43

WP / Task responsible WP6

Other contributors Nicola Piotto, Daniel Berjón Díez, Rafael Pagés Scasso, Sergio García Lobo, Francisco Morán Bur-gos, Giovanni Cordara, Milos Markovic, Sascha Ebel, Wolfgang Waizenegger

Author(s) Ingo Feldmann

Project Officer Alberto Rabbachin

Abstract This document summarizes the research achieve-ments of WP6 for the tasks of off-line extraction of 3D audio-visual content (T6.1) as well as the free viewpoint scene reconstruction and rendering (T6.3). T6.2 and T6.4 are not mentioned because the former, which deals with 3D media coding, was foreseen to start at the end of the first year, and the latter, which focuses on standardisation (in the context of WP6), will have its results re-ported in D9.3 at the end of the project. A prelimi-nary version of this report was submitted in Janu-ary 2015 in order to support the first years project review. The current document extends this inter-im version. Changes are listed in the document version overview table.

Keywords 3D tools, media analysis, 3D reconstruction

Sent to peer reviewer 2015-04-20

Peer review completed 2015-04-23

D6.1: 3D Media Tools – Report – Version A 2/43

Page 3: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Circulated to partners 2015-04-23

Read by partners 2015-04-29

Mgt. Board approval 2015-04-30

Version Date Reason of change

0.1 2014-10-27 Ingo Feldmann – Initial draft

0.2 2014-12-12 Nicola Piotto – Updates of Sections 2.1, 3.1, 4.1 and 4.4

0.3 2014-12-12 Daniel Berjón Díez, Rafael Pagés Scasso, Sergio García Lobo – Updates of Sections 3.2, 3.3 and 4.2

1.0 2014-12-17 Ingo Feldmann – Integration and first review

1.1 2014-12-19 Nicola Piotto – Additions to Sections 3.1 and 4.1

1.2 2014-12-20 Francisco Morán Burgos – Second review

1.3. 2015-01-14 Nicola Piotto – Final review

1.4 2015-01-14 Submission of interim report

2.0 2015-04-14 Ingo Feldmann – Extension to full deliverable, draft

2.1 2015-04-16 Sergio García Lobo, Rafael Pagés Scasso, Francisco Morán Burgos – Updates of Sections 3, 4.2, 5 and 6

2.2 2015-04-16 Nicola Piotto, Updates of Section 4.5, added Section 4.6

2.3 2015-04-16 Ingo Feldmann, Sascha Ebel, Update of Section 3.4

2.4 2015-04-16 Merged version, distributed to partners

2.5 2015-04-17 Incorporated comments

2.6 2015-04-20 ready for QAT version

2.6 r 2015-04-23 Review comments: - starting with page 7 the foot page is InterimReport_D6.1i_v1.3.docx; to be changed in "D6.1: 3D Media Tools – Report – Version A"

2.7 2015-04-28 incorporating all QAT comments

2.8 2015-04-29 Ready for submission

D6.1: 3D Media Tools – Report – Version A 3/43

Page 4: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Table of Contents 1 Executive summary ............................................................................................................................................... 7 2 Introduction ............................................................................................................................................................. 8 3 Algorithmic workflow .......................................................................................................................................... 8 4 3D Scene Off-line Modelling ............................................................................................................................... 9

4.1 Point cloud-based 3D model generation ............................................................................................................. 9 4.2 Alternative feature detectors for detailed point clouds ............................................................................. 11

4.2.1 Issues encountered with A-KAZE features ............................................................................................ 13 4.3 Hybrid 3D model approach .................................................................................................................................... 14

4.3.1 Point cloud filtering, segmenting and surface reconstruction ....................................................... 14 4.3.2 Splat creation for remaining isolated points ......................................................................................... 18

4.4 Dense depth-based 3D surface patch structure reconstruction ............................................................. 22 4.5 Surface texture refinement based on geometric primitives ..................................................................... 27

5 3D Scene On-line Reconstruction ................................................................................................................... 30 5.1 Video to point cloud registration ......................................................................................................................... 30 5.2 Online Structure-from-Motion Model Extension with Generic Video Streams ................................ 31 5.3 3D point cloud rendering ........................................................................................................................................ 35 5.4 View dependent rendering of hybrid 3D scene representations ........................................................... 37 5.5 Static 3D audio scene rendering .......................................................................................................................... 40 5.6 Dynamic 3D audio scene rendering – multi view support ........................................................................ 41

6 Conclusion .............................................................................................................................................................. 42 7 References .............................................................................................................................................................. 42

D6.1: 3D Media Tools – Report – Version A 4/43

Page 5: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Table of Figures Figure 1: Envisioned algorithmic workflow for 3D reconstruction ............................................................................ 9 Figure 2: Structure from Motion pipeline. .......................................................................................................................... 10 Figure 3: 3D point cloud and camera locations for Palazzo Madama (left) and Conte Verde Statue

(right) in Turin ................................................................................................................................................................... 11 Figure 4: Sparse point cloud reconstruction using SIFT (top) and A-KAZE (bottom) feature

descriptors ........................................................................................................................................................................... 12 Figure 5: SIFT (left) manages shaded areas much better than A-KAZE (right) ................................................... 13 Figure 6: Previous enhancement of the source images using CLAHE dramatically improves A-

KAZE results ........................................................................................................................................................................ 14 Figure 7: Point cloud before (left) and after (right) statistical outlier removal .................................................. 15 Figure 8: Some segmented planes from the filtered point cloud ............................................................................... 15 Figure 9: 3D meshes obtained after the surface reconstruction stage (left: ~250k vertices and

~500k triangles) and the decimation step (right: ~25k vertices and ~50k triangles) ....................... 17 Figure 10: Resulting 3D mesh after the multi-texturing stage ................................................................................... 18 Figure 11: Traditional 3D point cloud (left) and discretization of the model into cubic voxels

(right) ..................................................................................................................................................................................... 19 Figure 12: Close-up of the previous model showing the neighbours and tangential vectors of a

given point (left) and all the tangential vectors of a portion of the model (right) ................................. 19 Figure 13: Full splat-based 3D model (top row) and close-up (bottom row) using the original

point locations (left) and the averaged splat locations (right) ...................................................................... 20 Figure 14: Splat-based 3D model coloured with the original values (left) and a weighted-average

(right) ..................................................................................................................................................................................... 21 Figure 15: Base (left) and refined splat model (right) ................................................................................................... 22 Figure 16: General algorithmic workflow for dense surface structure estimation ........................................... 23 Figure 17: Grouping of cameras to trifocal sub-systems (here: three, marked by colours green,

blue, yellow) based on a larger set of available cameras. Not used cameras are labelled with grey ......................................................................................................................................................................................... 23

Figure 18: Detailed workflow for dense 3D surface structure estimation ............................................................ 24 Figure 19: Principle of Patch Sweep based surface estimation .................................................................................. 25 Figure 20: left) original sample image from “Salzufer” image data set; right) estimated depth map ........ 25 Figure 21: left)Dense depth based 3D point cloud; right) polygonal 3D model after meshing .................... 26 Figure 22: left) Resulting rendered 3D model after 3D point cloud fusion and hole filling; right)

surface support representation for two trifocal depth input data sets (yellow and red) ................... 26 Figure 23: “Villa la Tesoriera” data set, left) reconstructed polygonal 3D model, right) surface

support representation for two trifocal depth input data sets (yellow and red) ................................... 26 Figure 24: Reconstructed 3D for the “Villa la Tesoriera” data set ............................................................................ 27 Figure 25: Algorithmic workflow for surface texture refinement ............................................................................ 28 Figure 26: left) Sample image of original data set, the registered surface is marked in yellow;

right) surface plane after normalization ................................................................................................................. 28 Figure 27: left) Reconstructed 3D point cloud representation of the building in Figure 26; right)

planar surface approximation based on a sparse 3D point cloud ................................................................ 29 Figure 28: Normalized surface planes of two input images; right) difference between the images .......... 29 Figure 29: Results for texture refinement based on the area depicted in Figure 26: left) original

view; right) enhanced view with less sampling artifacts ................................................................................. 30 Figure 31: Video to point cloud integration pipeline ..................................................................................................... 30

D6.1: 3D Media Tools – Report – Version A 5/43

Page 6: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A Figure 31: Base model (a) and base model registered with 5 video streams (b). The samples of

different videos are coded in different colors ....................................................................................................... 31 Figure 32: Online video to point cloud integration pipeline ....................................................................................... 33 Figure 33: Close-up of a detail in the base model (left), updated model with Bundler (center), and

the method of Section 5.2. ............................................................................................................................................. 35 Figure 34: Schematic representation of the vertex shader inputs (in black) and outputs (in red) ............ 36 Figure 35: Schematic representation of the fragment shader output ..................................................................... 36 Figure 36: Splat-based 3D model rendered without (left) and with (right) alpha-blending ......................... 37 Figure 37: Rendered 3D hybrid 3D point cloud model including a refined surface texture, see

Figure 29 for an enlarged view of the area marked in red .............................................................................. 38 Figure 38: Efficient patch group based re-meshing procedure .................................................................................. 38 Figure 39: left) Meshing of surface patch segments as a result of 3D point cloud fusion; right)

enlargement ........................................................................................................................................................................ 39 Figure 40: Viewpoint dependent 3D model representation rendered from different perspectives .......... 40 Figure 41: User interface for spatial audio rendering .................................................................................................... 41 Figure 42: User interface for spatial audio rendering of a dynamic scene ............................................................ 42

D6.1: 3D Media Tools – Report – Version A 6/43

Page 7: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

1 Executive summary This document summarizes the research achievements of WP6 for the tasks of off-line extraction of 3D audio-visual content (T6.1) as well as the free viewpoint scene reconstruction and rendering (T6.3). T6.2 and T6.4 are not mentioned because the former, which deals with 3D media coding, was foreseen to start at the end of the first year, and the latter, which focuses on standardisation (in the context of WP6), will have its results reported in D9.3 at the end of the project. A preliminary version of this report was sub-mitted in January 2015 in order to support the first year project review. The current document extends this interim version. Changes are listed in the document version overview table.

D6.1: 3D Media Tools – Report – Version A 7/43

Page 8: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

2 Introduction The objective of WP6 in the current reporting period was to develop tools for multi-view 3D A/V media generation at the service provider’s side in including the possibility of 3D reconstruction updates and refinement at the end user’s side. In this context, on one hand the off-line extraction (and generation, if needed), at the service provider’s Authoring Tool (AT), of 3D models from broadcast and associated Internet 2D/3D video, and of Computer Graphics (CG) data needed to be addressed. On the other hand, free viewpoint scene representation at the user’s player, either from a set of pre-defined viewpoints from which the scene was originally recorded, or through free navigation around the scene was required to be developed. At the broadcaster’s side, the professional AT enables the content creator to run a semi-automatic extrac-tion of 3D scene information from archived (and, typically, previously broadcast) 2D/3D video content (T6.1). Further on, associated content from the Internet, including images, videos and CG data, if availa-ble, can be used as well to complete the 3D scene information. 3D audio source localisation methods needed to be supported for reconstruction purposes. Optionally, the generation of additional scene in-formation and augmentation content needed to be applied to refine and complement the extracted 3D models. Based on this, starting in year 2 of the project, solutions for an efficient encoding and decoding of 3D media will be evaluated and adapted in order to achieve an efficient transmission from the service pro-vider’s AT to the end user’s device (T6.2). At the user's side, WP6 targeted in the given reporting period the projection and rendering of 3D scene data according to the user-selected viewpoint from a set of original viewpoints (T6.3). In this context, 3D audio was aimed to be adapted and rendered according to the chosen viewpoint. Up to the maximum possible extent, additional views for free scene navigation around original viewpoints could be provided. The following document is structured as follows. First, a general overview of the algorithmic workflow is given which links the functionality of the developed tools. Afterwards, the task of 3D scene offline model-ing is addressed which refers to T6.1 of WP6. Several solutions for 3D reconstruction approaches are presented. Finally, the task of 3D scene online reconstruction is discussed which refers to T6.3 of WP6. Please note that T6.2 and T6.4 are not explicitly mentioned in the document because the former, which deals with 3D media coding, was foreseen to start at the end of the first year, and the latter, which focus-es on standardisation (in the context of WP6), will have its results reported in D9.3 at the end of the project.

3 Algorithmic workflow One of the main objectives of WP6 is the generation of 3D models from unordered visual and acoustical cues of a scene. In order to generate an accurate model for 3D scene representation, eventually to be used for discrete or continuous free-viewpoint navigation, a number of state of the art technologies have been deeply investigated, both for video and audio domain. Concerning the generation of visual 3D mod-els of a scene, the idea is to rely on Structure from Motion (SfM) technology, providing sparse and typical-ly noise point clouds, and apply specific surface modelling in order to obtain a dense, accurate model that combines different 3D representation types for optimized 3D rendering. Figure 1 reports the envisioned general algorithmic workflow for visual 3D model generation in work package WP6. Not incorporated in the figure are the audio integration and the rendering as they are more straightforward. Figure 1 illustrates the vision of WP6 on how to combine the currently independ-ent and standalone sub-modules for 3D reconstruction in a meaningful and efficient way. In this way Figure 1 summarizes the envisioned interaction of the algorithms presented in chapter 4. It shows that all developed tools and methods are aimed to be part of a more general and powerful 3D reconstruction processing chain. The idea of WP6 at this stage is to combine the results of different algorithmic solutions for different types of 3D scene reconstruction tasks. As shown in the figure, the point cloud based 3D model genera-tion (see section 4.1) is a starting point for the proposed 3D reconstruction chain incorporating a subse-

D6.1: 3D Media Tools – Report – Version A 8/43

Page 9: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A quent refinement step (see section 4.2). Based on this, a simple geometric entity based 3D modelling is carried out. For example, 3D planes can be fitted to the 3D point cloud in order to create simple 3D mod-els (see section 4.5). More complex point based surface reconstruction methods as proposed in section 4.3.1.G can be applied here too. More complex 3D surfaces structures could be modelled with the depth based 3D surface patch reconstruction approach as presented in section 4.4. Finally, structures with very complex surface properties can be reconstructed using the splat-based 3D surface modelling method (see section 4.3). Please note that the refinement of this concept is still a matter of ongoing research activity within WP6. A more detailed version of this workflow is planned to be developed at a later stage of WP6.

Point cloud based 3D model

generation

3D point cloud refinement

images sparse & noisy 3D point cloud

depth-based 3D surface patch reconstruction

camera calibration

data

3D data fusion

geometric entity-based 3D

modelling

splat-based 3D surface

modelling

dense & clean 3D point cloud

3D surface patches

splats

video

hybrid 3D data representation

Figure 1: Envisioned algorithmic workflow for 3D reconstruction

Audio modelling follows two different strategies. In the first one, sound sources of interest are recorded with “spot” microphones, close to the source. This approach provides dry audio signals which are ready to be used as an input for binaural synthesis in audio engine (see D7.1i). The second strategy is to use spherical microphone array with 32 channels in order to capture full audio scene. Then, the material is processed in order to get binaural signals out of 32 channels by down-mixing the original recordings. Both strategies represent acquisition of a sound sources or an audio scene and are not modelling in a strict sense. Therefore, the main process of audio reconstruction is given in sections 5.5 and 5.6.

4 3D Scene Off-line Modelling

4.1 Point cloud-based 3D model generation In this Section the generation of sparse 3D models from unordered image sets will be detailed. As men-tioned above, the main building block for the generation of the scene 3D model is a SfM engine. Some of the most promising state-of-the-art approaches have been compared (e.g., Bundler, VisualSFM), their performances evaluated in terms of computational complexity and model accuracy, and a reference solution has been implemented, allowing robust 3D scene modelling from unordered sets of images. The models generated by our SfM framework provide a scene representation including:

D6.1: 3D Media Tools – Report – Version A 9/43

Page 10: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A • A (sparse) set of 3D points representing the structure of the scene; • A set of calibrated cameras locating in the 3D space the location of the images used in the reconstruc-

tion. Each point in the sparse cloud is described by its 3D location coordinates and a colour. Each camera instead consists of an image, and its location and orientation in the 3D space, given as rotation matrix and translation vector. The pipeline for structure from motion (Figure 2) is structured in a sequential fashion, providing for feature extraction, exhaustive image cross-matching, estimation of the epipolar constraints, and final non-linear optimization for parameter refinement. The first step is to detect relevant feature point in the images. Given the potential heterogeneity of viewpoints of the starting image set, the adopted feature representation must exhibit a good invariance wrt image transformations, allowing a robust matching of images of the same scene taken from significant different locations. Given its characteristics and its consolidated robustness, we adopted SIFT, however other detectors/descriptors have been con-sidered to achieve denser (but less robust) matches (see Section 4.2.1).

Figure 2: Structure from Motion pipeline.

In the next phase, feature matching is carried out for every pair of images in the starting collection. Given two images A and B, a kd-tree is built from the feature descriptors of B, and for each feature in image A the Approximate Nearest Neighbour (ANN) is found. Although SIFT descriptors proved their robustness in many application domains, wrong associations between keypoints (i.e., wrong matches) will appear, affecting significantly the accuracy of the model generated. Because of this, 3 independent outlier remov-al phases are provided:

SIFT feature extraction

Non-linear optimiza-tion

(Bundle adjustment)

Exaustive image cross-

matching

Epipolar geome-try recovery

Sparse 3D point cloud and camera locations

Image Collection

D6.1: 3D Media Tools – Report – Version A 10/43

Page 11: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A • a first preliminary filtering relying on the 2-NN ratio test [10] ; • a second removal based on fundamental matrix (F-matrix); • a more sophisticated strategy based on geometric constraints [8] While the ratio test is a standard procedure to determine the goodness of a match, the procedure de-scribed in [8] represent an innovative solution that brings significant advantages in the SfM framework. The idea is to build a model describing the geometric relationship of corresponding features in the 2 images (Log-Distance-Ratio, LDR), and allowing rejecting statistically irrelevant pairs. The LDR concept is further enhanced by including in the model the uncertainty of scale-space feature location, allowing an extremely robust detecting of inlier matches. As detailed in [22] , when in presence of scale invariant features, like SIFT, or SURF, the keypoint extraction phase cannot be considered as a deterministic pro-cess identifying interest points with uniform and not significant localization error. Instead, the impact of the location uncertainty is shown to be significant, in particular when dealing with homography estima-tion or bundle adjustment. In light of these results, the native computation of LDR has been updated in order to include the uncertainty of the keypoint locations: in particular, the L2 norm has been replace by the Mahalanobis distance, and each keypoint is augmented with a [2x2] covariance matrix encoding the uncertainty location along the x and y axis. As proposed in [22] , this covariance can be obtained by in-verting the Hessian matrix associated to each keypoint in the detection phase. After finding consistent corresponding samples, matching keypoints in multiple images are organized in tracks. The idea is then to estimate the camera parameters and 3D location of each track that minimize the reprojection error (i.e., bundle adjustment). The minimization problem is formulated as non-linear minimization and solved with state-of-the-art algorithms (e.g., Levenberg-Marquardt). In order to avoid getting stuck in local minima, instead of trying to estimate all the unknowns at once, an incremental paradigm is adopted, starting from a candidate image pair, and adding single images sequentially. Always to speed up the optimization, the framework relies on a state of the art multicore implementation of bundle adjustment [23] that allows full exploitation of hardware capabilities while providing significant boost in performance. When all the images have been incrementally added to the optimization problem, the 3D point cloud and the camera parameters for each image are provided. Figure 3 shows results for Palazzo Madama and Conte Verde statue in Turin.

Figure 3: 3D point cloud and camera locations for

Palazzo Madama (left) and Conte Verde Statue (right) in Turin

4.2 Alternative feature detectors for detailed point clouds The usual workflow for obtaining a point cloud reconstruction from a set of images begins with the de-tection of feature points in all input images. Then, feature matches across all relevant pairs of images must be determined (i.e., which feature points in several images depict the same location in the original object), so that when all the geometric constraints of later stages of the reconstruction process are ap-plied, the location of the original point relative to the cameras that saw it can be inferred. There are numerous algorithms for feature detection in the literature, each adapted for different needs; in the case of 3D reconstruction, it is interesting that the features detected are reasonably invariant to

D6.1: 3D Media Tools – Report – Version A 11/43

Page 12: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A changes of scale, rotation and perspective distortion (none actually succeeds at the latter, hence the need for a dense image coverage of the subject to be reconstructed). It is also necessary that the location of the key point associated with the detected feature is precisely located at the projection of the point that originated the image feature; in other words, feature detectors for this task are necessarily very local, whereas in other higher-level tasks other feature detectors may be more desirable. Arguably the most widely used feature detector for many tasks, including 3D reconstruction, is Scale-Invariant Feature Transform (SIFT) [10] . Indeed, it is the built-in option for the well-known state-of-the-art SfM frameworks Bundler [11] and VisualSfM [12] , despite being a patent-encumbered algorithm. SIFT features are robust, leading to reliable camera position estimations, but not very numerous; there-fore the reconstructions thus obtained frequently present the shortcoming of being too sparse for a subsequent detailed surface reconstruction.

Figure 4: Sparse point cloud reconstruction using

SIFT (top) and A-KAZE (bottom) feature descriptors

One possible solution is using a patch-based point cloud densification algorithm based on photocon-sistency checks such as PMVS or CMVS [13] , but these algorithms are very slow and frequently introduce wrong patches in the scene due to photoconsistency among non-homologue areas (notably sky). Instead, we have chosen to experiment with Accelerated KAZE (A-KAZE) [14] , a novel feature detector that provides much denser coverage of the images with feature points. This in turn results in a denser point cloud but, since all the feature points must satisfy the geometric restrictions of epipolar geometry

D6.1: 3D Media Tools – Report – Version A 12/43

Page 13: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A during the SfM process rather than an arguably weaker photoconsistency check, the number of outliers remains low.

4.2.1 Issues encountered with A-KAZE features While point clouds generated using A-KAZE are indeed denser than SIFT-derived ones, we have encoun-tered a number of issues that required further work to get them to work properly and in a practical way. In this Section we will enumerate the issues and the solutions we have found or are working on.

A. Processing speed Whereas there are some GPU-accelerated implementations of SIFT feature extractors [15] , A-KAZE feature extraction is quite slow at the moment. We have successfully combined it with the OpenCV li-brary [16] to leverage its GPU-accelerated feature matching routines, significantly cutting down on processing time. Unfortunately, feature matching is an intrinsically time-consuming problem because, in the absence of any information that narrows it down, all possible pairs of images must be checked, which is O(N2), and within each pair of images, all possible correspondences between their feature points must be tried to find the best matches (again O(N2)). However, in a data set of photographs of an actual object it is obvious that there are many pairs of pictures that are not seeing the same part of the object and, therefore, need not computing its feature matching. Therefore, our first aim is to determine which pairs of images have actual overlap. In order to do that, we subsample all the images to save processing time and run a full point cloud reconstruction process: feature detection (SIFT at the moment, since there is a GPU-based implementation available), exhaustive matching and SfM. Since the images have been subsampled, the resulting point cloud is necessarily very sparse, but we obtain a good estimation of the camera poses, along with the information of which camer-as contributed each point of the cloud. Then, we only need to compute feature detection on the full-resolution images and feature matching only among the pairs of cameras that contributed points to the low-resolution point cloud.

B. Low sensitivity in shadows In the data sets we have tried, we have observed that the A-KAZE feature detector seems to have worse performance than SIFT when the scene is badly lit, or in the shaded areas of objects that feature both sunlit and shaded areas.

Figure 5: SIFT (left) manages shaded areas much better than A-KAZE (right)

Therefore, we have been evaluating image enhancement algorithms to enhance the contrast of all zones of the image in order to alleviate this problem. We have hitherto evaluated two such algorithms: Con-trast-Limited Adaptive Histogram Equalization (CLAHE) [17] and Multi-Scale Retinex [18] . Both algo-rithms perform well and recover detail from the shadows, resulting in increased feature detection in the shaded areas.

D6.1: 3D Media Tools – Report – Version A 13/43

Page 14: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 6: Previous enhancement of the source images

using CLAHE dramatically improves A-KAZE results

C. Robustness issues We are still battling some issues with the reliability of feature matches provided by A-KAZE under some conditions. When the perspective distortion is too severe and/or the subject exhibits rhythmical pat-terns, we are encountering zones of the object where A-KAZE does not improve upon SIFT or camera pose estimation is wrong. We are working on combining information from the low-resolution SIFT-based reconstruction for improving the final reconstruction: • adaptation of AKAZE descriptors yielding denser point clouds than SIFT; • automatic point cloud noise filtering.

4.3 Hybrid 3D model approach In order to visually enhance the results attained with traditional 3D models, especially those generated through a reconstruction pipeline like the process mentioned above, a novel hybrid 3D approach is pro-posed within BRIDGET. This concept involves a mixture of both mesh-based and point-based regions within the complete model, depending on their characteristics, in order to globally improve user percep-tion. Those areas of the model made up of planar surfaces, which therefore can be easily meshed, are rendered through a classic polygon mesh 3D model, whereas regions with complex geometry remain as isolated points and, as is described below, are rendered as splats. Advances in this topic during the first half of the project’s life include the usage of the reconstructed point cloud to generate a polygon mesh through automatic plane segmentation, and the computation of the splats that approximate best the remaining model surfaces. Both processes are detailed below, whereas research is currently being carried out on the combination of both type of approaches into a single 3D model and will be addressed during the second half of the project’s life.

4.3.1 Point cloud filtering, segmenting and surface reconstruction The purpose of this process is obtaining a 3D mesh which represents the building to be reconstructed accurately while reducing the amount of information provided by the dense point cloud obtained in the previous stages. The 3D point cloud meshing is based on a Poisson surface reconstruction stage. This algorithm uses the point normals to be able to orient the resulting surface properly, so we will need to do some pre-processing of the point cloud to obtain good results.

D6.1: 3D Media Tools – Report – Version A 14/43

Page 15: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A D. Statistical outlier removal The resulting point cloud is dense and accurate when we analyse areas close to the building that is going to be reconstructed. However, we can see that there are noisy points around the building which do not resemble any particular shape. This is due to two reasons: firstly, there are matches in areas that are not relevant for the reconstruction. And, secondly, there are wrong matches in areas as the sky which, again, are not relevant for us.

Figure 7: Point cloud before (left) and after (right) statistical outlier removal

To remove these noisy points, we perform a statistical outlier removal process based on the analysis of the distribution of point-to-neighbours distances in the original point cloud: for each point, we calculate the mean distance from it to all its neighbours. We assume that these distances follow a Gaussian distri-bution, so points outside the interval defined by the mean and standard deviation distances are removed from the point cloud. Figure 7 shows the difference between the original point cloud and the filtered one.

E. Planar segmentation

Figure 8: Some segmented planes from the filtered point cloud

As we are reconstructing buildings, we can assume that they will be mainly composed by planes (or pseudo-planes which include more than just a completely flat surface). Because of this, a planar segmen-

D6.1: 3D Media Tools – Report – Version A 15/43

Page 16: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A tation stage is performed. This approach is based on an iterative RANSAC model which uses discarded outliers in the current iteration but considers them for the next. The result of this stage is a set of point clouds that resemble different planes of the model. The final outliers are not completely discarded but will not be part of the final mesh; instead, they will be represented as splats in our hybrid model. Figure 8 shows some of the resulting planes.

F. Normal estimation and correction Once we have segmented our original point cloud into different planes, we search for the normal of every point of the cloud. For this, we analyse a small neighbourhood of each point and try to fit a plane using least squares; the normal of this plane will consequently be the normal of that point. This is done by doing a Principal Component Analysis (PCA) of a covariance matrix created with the neighbours of the point. Performing this in an already segmented planar region helps the process to be sped up, since the region analysed to compute the normal value can be smaller. This also helps the preservation of sharp edges in the final model, because this approach tends to make them slightly round. Although the process works fast and correctly, there still another issue: PCA cannot provide the correct orientation of the normal. This means that even if the direction of the normal is calculated correctly, we cannot be sure that its orientation is correct. To solve this, we search for the camera which “sees” every point from a better position. The dot product of the pointing vector of the camera and the computed normal of the point will determine the sign of the latter vector.

G. Surface reconstruction As stated before, once we have the normals correctly estimated, we can perform a Poisson surface recon-struction stage which will give us a 3D mesh using the segmented point cloud. Sometimes there will be areas with very low point density in the point cloud (due to glossy surfaces, bad illumination conditions, etc.) which would produce a hole in our 3D mesh; however, as this surface reconstruction technique produces closed surfaces when possible, all these holes are correctly closed. On the other hand, areas that are not covered at all, as roofs of hidden walls, will also be closed with a big round surface. These other surfaces should be removed, as they should not be part of the final model. To do so, we analyse the mean distance from the resulting 3D mesh to the point cloud we used for the reconstruction. If a vertex of the mesh is very far from the point cloud, it is removed (with all the trian-gles which included the vertex). This way, if there was a hole due to a window, it will continue closed, but roofs and hidden walls will be removed correctly. Figure 9 left shows the result of the surface reconstruc-tion approach.

D6.1: 3D Media Tools – Report – Version A 16/43

Page 17: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 9: 3D meshes obtained after the surface reconstruction stage (left: ~250k vertices and

~500k triangles) and the decimation step (right: ~25k vertices and ~50k triangles)

H. Mesh decimation The last step of the 3D meshing process is reducing the number of triangles in the resulting model. The previous surface reconstruction stage normally produces models as dense as the point cloud we started with, so the models are too big to store or transmit them efficiently. We therefore need to reduce this amount of information while trying to preserve as many details as possible in the model. We use a quad-ric edge collapse decimation system, which analyses the local curvature of every vertex of the mesh, and removes the ones with lower value. This makes planar areas very simple while keeping edges and other shapes with enough detail. Figure 9 right shows the result of this stage.

I. Mesh multi-texturing Once the shape of the 3D mesh has been reconstructed and simplified, its appearance must be dealt with, and the input images used to obtain the original 3D point cloud may of course be used as well for textur-ing the final 3D mesh. We use a multi-texturing technique [27] which manages to reduce drastically the typical inaccuracies of this kind of texture atlases (such as seams or blurs) by blending colour contribu-tions from different images. The global result is very realistic, as illustrated by Figure 10.

D6.1: 3D Media Tools – Report – Version A 17/43

Page 18: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 10: Resulting 3D mesh after the multi-texturing stage

4.3.2 Splat creation for remaining isolated points Some objects or parts of the models are too irregular (e. g. lack planar structures) for the previous mesh-based reconstruction. Consequently, alternative and more flexible methods of modelling/visualization can be helpful, as those relying on point clouds. These approaches entail a higher degree of flexibility, but also major drawbacks that should be addressed: namely the appearance of large areas without infor-mation among scattered points, the adoption of non-natural and sharp-edged shapes for the rendering and the inaccuracy in the point colours caused by the reconstruction process. A splat-based solution has therefore been explored in order to alleviate these issues and enhance the visual feeling of the rendered models. A splat is to be considered an arbitrary-shaped patch which ap-proximates locally the underlying surface in a point cloud, and which is sent to the renderer as a single graphic primitive. Splat-based 3D models, as opposed to traditional point clouds (where every point is rendered as a single pixel or a simple square), reduce the huge amount of empty space among points and approximate better the surfaces of the model, hence lessening the aforementioned weaknesses. The modelling stages of the splat-based solution are described below, whereas the on-line rendering process which avoids the typical billboard effect (i.e., points having the same shape regardless of the viewpoint) is described in Section 5.3.

J. Base splat model The first approximation of the splat model is obtained by replacing each point with a splat that adapts better to the shape of the original object. In order to infer information about the surface from which a point was sampled, neighbouring points should be considered. A fixed-radius neighbour search has been adopted rather than a k-nearest-neighbour exploration since the former outputs points which are more likely to belong to the same surface and also manages to filter out noisy samples when they are not tight-ly surrounded by enough points. Nevertheless, neither of these methods ensure that all derived neigh-bours constitute a single surface, and research is being carried out to overcome this issue and improve the current results for the second half of the project’s life. Currently, search radius must be tuned manu-ally as it depends on the average distances among points, and a trade-off between deriving enough neighbours and sticking to a locality of the point must be achieved, but is also expected to become an automatic parameter for the second half of the project’s life. A speed-up of this procedure is attained with

D6.1: 3D Media Tools – Report – Version A 18/43

Page 19: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A the discretization of the 3D space into voxels, which narrows the number of candidate points to those located inside the same voxel of the considered point, and the ones adjacent to it. Figure 11 shows an example of a traditional point cloud obtained through the previous 3D reconstruction pipeline and the corresponding set of voxels encompassing the points.

Figure 11: Traditional 3D point cloud (left) and

discretization of the model into cubic voxels (right)

Once the neighbours of every point are identified, PCA techniques are applied to estimate the normal and tangential directions of the local surface. The plane yielding the best approximation of the given point set in the least-squares sense is hence found. Both the directions and the lengths of the local tangential vectors are computed based on the variance of the input data. Figure 12 left shows a point cloud region centred around a given point whose neighbourhood has been highlighted in green, and whose estimated tangential vectors are shown as small red and green line segments. Figure 12 right shows in the same way the tangential vectors for many points of the model displayed in Figure 12.

Figure 12: Close-up of the previous model showing the neighbours and tangential vectors of a

given point (left) and all the tangential vectors of a portion of the model (right)

Besides, a measure of the flatness of the region around each point is computed using the ratio of the sum of the variances along the tangential direction to the sum of the variances along the three directions (the two tangential ones plus the normal one). This measure is useful to determine which splats can be effi-ciently merged with their neighbours to generate larger splats and therefore fill potential holes in under-sampled regions, as explained below in Section K. A point thus holds local information of the model surface, instead of remaining as a mere sample, and can be modelled as a splat, currently an ellipse whose major and minor axes are the computed tangential

D6.1: 3D Media Tools – Report – Version A 19/43

Page 20: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A vectors. The 3D coordinates assigned to the centre of the splat may be either the corresponding point location or the mean of its neighbour points (that is, the origin of the derived least-squares plane). How-ever, as is noted by Figure 13, the latter approach generates more holes in flat regions of the model, as scattered splats tend to become closer and therefore leave larger areas without information. In contrast, this is quite beneficial in areas with edges (such as the window and the top of the tower from the close-up of Figure 13), since noise introduced during the 3D reconstruction process is partially reduced and the resulting patches tend to adapt more tightly to the underlying surface. An intermediate solution that takes advantage of both techniques, bearing in mind the splat normals, is therefore proposed based on the flatness coefficient mentioned previously. Splats with a flatness coefficient over a predefined thresh-old preserve the position of the original point, whereas those whose value is under the threshold are moved to the average point.

Figure 13: Full splat-based 3D model (top row) and close-up (bottom row)

using the original point locations (left) and the averaged splat locations (right)

Besides, the original point colour may be replaced by a weighted average of the colours of its neighbours, in order to smooth the high dispersion of the values retrieved during the 3D reconstruction and provide a more natural colour. Nevertheless, this solution should be further improved (for instance, by texturing every splat) since the adoption of a single colour for all the pixels inside a splat provides a flat-shading-

D6.1: 3D Media Tools – Report – Version A 20/43

Page 21: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A like style that still seems outlandish, and the use of the weighted average fails in preserving the colour details of the model. The difference between both approaches is shown in Figure 14.

Figure 14: Splat-based 3D model coloured with the original values (left)

and a weighted-average (right)

K. Refined splat model Since the base splat model has a high number of splats, most of which overlap, it cannot be directly ren-dered in mobile devices, or at least not smoothly. Therefore, we simplify this model while preserving the sharp details and minimizing the appearance of new holes due to the removal of splats. A two-stage decimation/fusion process is hence proposed aiming to achieve both a reduction of the algorithmic computational cost and the use of fewer larger splats (as opposed to many small ones) in planar regions. The first stage consists of a coarse simplification that focuses on removing splats with close centres, in order to eliminate quickly as many nearby ellipses as possible, whereas the second step involves the merging of several overlapping ellipses into a single one. Both of them are described below. The coarse decimation process starts by merging all nearby splats (defined by a percentage of the neigh-bour-search radius defined at the very beginning of the base splat-model description) whose normal vectors form a relatively small angle. The resulting splat centre and colour are then naturally computed as a weighted average among the affected splats, using their respective areas as the weighting coeffi-cients. Regarding the principal vectors, their directions are also obtained through a weighted average, whereas their length is later scaled to ensure that the new ellipse covers as much area as the original splats. This is achieved by selecting a user-defined fraction of the involved splats (discarding always the farthest ones to the new centre), sampling their contour uniformly with eight samples, and then project-ing the resulting points onto the plane supporting the new splat. Finally, the maximum projections along each axis are found and the new ellipse is forced to go through those points. The aim of the second decimation stage is to detect overlapping splats and replacing them by their mini-mum enclosing ellipse. In contrast to what happened in the previous step, in this one we only combine two splats at a time, to avoid using too many sampled points at once and therefore having extremely large ellipses in the end. Pairs of splats located at a distance smaller than the sum of their major-axis lengths and with a low angle between their normals are merging candidates. Potential overlaps between such merging candidates are detected by taking eight samples from the contour of the first one and projecting them onto the second one. If two or more projections do overlap, this is considered significant, and then a minimum enclosing ellipsoid for both initial splats is obtained.

D6.1: 3D Media Tools – Report – Version A 21/43

Page 22: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 15: Base (left) and refined splat model (right)

This refinement process generates models which need, on average, only a 10% of the original splat num-ber, hence reducing drastically the computational requirements for the rendering phase (which, as was previously mentioned, is crucial for handheld devices). Regarding the benefits in terms of memory (or storage/transmission) requirements, the size reduction is not as high: refined splat models need approx-imately one quarter of the number of bytes when compared to the corresponding original point clouds. This is due to the need to store more information per primitive: not only position and colour, but also the ellipse major and minor axes. Figure 15 shows a base model made up of 160k splats, and the correspond-ing refined model, which has only around 20k splats. On one hand, the global shape of the model is pre-served and some small holes are eliminated due to the splat-merging process. On the other hand, some details are missing in the refined model because of the simplification, and splat colours seem more un-natural because, in general, ellipses are larger and uniformly coloured. However, this issue is expected to be addressed in the second half of the project’s life by developing a texturing process for the splats simi-lar to the one used typically in polygonal meshes (see as well Section 5.3).

4.4 Dense depth-based 3D surface patch structure reconstruction In this section a dense 3D reconstruction and modelling process is introduced. In difference to previously discussed methods it is not based on the reconstructed sparse 3D point cloud. In contrast, dense depth maps are envisioned to be estimated and to be used as a base for 3D reconstruction and modelling. The general algorithmic idea is illustrated in Figure 16. An initial structure from motion step is required to obtain the camera calibration parameter. Here, results from dedicated BRDIGET tools developed in WP6T1 are used (see section 4.1). Based on these, a dense depth estimation and depth refinement step are applied. The objective of the subsequent point cloud fusion step is to efficiently handle occlusions as well as to algorithmically cope with large scale 3D models captured from various view points, such as building captured from various viewing positions. For this purpose, the estimated depth maps are con-verted to 3D points and fused accordingly. Further on, this step implicitly determines 3D surface patch segments which can be further exploited for efficient meshing and free viewpoint rendering in subse-quent modules as described in section 5.4.

D6.1: 3D Media Tools – Report – Version A 22/43

Page 23: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 16: General algorithmic workflow for dense surface structure estimation

The task of 3D reconstruction within BRIDGET is based on the assumption that a large set of input imag-es and videos is available. Figure 17 illustrates this exemplarily for the capturing of a building from mul-tiple camera input sources. The challenge for the 3D reconstruction process is to robustly identify corre-spondences throughout the whole set of input image data. This becomes even more critical as the related 3D scene cannot be expected to be visible in all cameras simultaneously. So, additionally to the standard problems of dense depth estimation, such as the reconstruction of homogeneous regions or periodic structures, in the current case the issue of hidden regions and occlusions has to be addressed explicitly. Pioneer work in this context was done by Furukawa et. al. The authors proposed an overall processing workflow for the task of 3D reconstruction of buildings from 2D images [5] . The main idea of the authors is to use an initial structure-from-motion step as a base for a sub-sequent multi-view stereo processing. As mentioned earlier, our current work in BRIGET generally follows this concept (see Figure 16).

Figure 17: Grouping of cameras to trifocal sub-systems (here: three, marked by colours green,

blue, yellow) based on a larger set of available cameras. Not used cameras are labelled with grey

A fundamental difference and novelty of our work is the way we perform the multi-view stereo pro-cessing. Rather than treating all input cameras equally our idea is to select robust trifocal sub-systems first. The idea of narrow base line camera sub-systems has a long tradition in dense multi-view matching. The advantage for such approaches is that additional robustness for the final result can be gained from the fact that multiple pair-wise estimation processes can be performed. Due to the narrow camera dis-tances a comparison of the results of these processes is possible which allows the efficient detection of outliers and occlusions [4] ,[6] .

structure from motion

camera parameter estimation

dense depth estimation

depth refinement

point cloud fusion

free viewpoint rendering

Building to be captured

trifocal camerasub-system

not used

D6.1: 3D Media Tools – Report – Version A 23/43

Page 24: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A Following this idea, for BRIDGET, a main condition for the definition of camera sub-systems is to select only those cameras which have small distances between them, i.e. narrow base lines. Please note, that state of the art approaches, such as presented in [4] ,[6] are based on fixed base line configurations, usually having equal distances between neighbouring cameras. The novelty for BRIDGET is that we introduce this idea to random camera input sets, i.e. we allow arbitrary camera input configurations for the definition of trifocal sub-systems. Our goal is to exploit the robustness of trifocal sub-systems even in the case of arbitrary input data. Figure 17 illustrates this exemplarily for three such trifocal sub-systems, marked with different colours. It can be further seen, that not all cameras were integrated into this process. By nature, some cameras (marked in the figure with grey) will have large distances to any other camera, so they are not selected for this approach and removed from further processing. As we assume large input data sets these remov-als are expected to have little influence on the final result. Please note, that at the current state of work so far we select the related trifocal sub-systems interactively. Nevertheless, it is expected that automated approaches, as for example presented in [7] , can be applied.

Figure 18: Detailed workflow for dense 3D surface structure estimation

In order to test the above described workflow within BRIDGET we assembled the algorithmic processing chain as shown in detail in Figure 18. The figure illustrates in detail our proposal for dense 3D surface structure estimation. The processing starts with a dense depth estimation step. For this purpose we applied the state-of-the-art Line-Wise Hybrid Recursive Matching approach (L-HRM) [4] . In the litera-ture this tool was successfully tested for narrow base line systems with little occlusions between neigh-boring views. Key properties are its robustness of results even for the case of homogeneous regions. Further on, it is computationally light weight. In BRIDGET this tools was extended to deal with arbitrary camera positions for the trifocal input setups. In a second step, we apply a higher precision depth fine tuning step based on the state of the art Patch Sweep approach [1] . The patch Sweep technique achieves very high structural detail for the reconstruct-ed 3D structure even in sup-pixel resolution. It is based on 3D surface patches (surflets) which are opti-mized in terms of position and orientation (see Figure 19). The optimization criterion is the 2D correla-tion of the projected surflets in the image plains of all related cameras. In our case, this relates to the three cameras of a given trifocal input system. For the work in WP6T1 one vision for future work here is to add robust spatio-temporal objects (STO) to the estimation loop. This process in illustrated in Figure 18 with dashed lines. The basic idea was intro-duced in [19] . Following the authors proposal, such STOs can be reconstructed in a separated process,

camera inputmultiple devices

trifocaldepth estimation

spatio-temporal object estimation

patch based3D surface

reconstruction

STO priorupdate

textures

patch groupfusion

intra patch groupmeshing

inter patch groupmesh fusion

mesh

densedepth estimation

depth refinement

point cloud fusion

rendering

3D pointcloud / patches

RGB+Z

D6.1: 3D Media Tools – Report – Version A 24/43

Page 25: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A for example based on robust feature points. In case of BRIDGET the STOs could consist of simple geomet-ric entities, such as surface planes etc. These geometric entities could be used to increase the robustness of the patch based surface reconstruction, for example for the case of homogeneous textures. A final step is the fusion of the reconstructed 3D surface data, i.e. the fusion of the input of multiple trifo-cal input systems. For BRIDEGT we tested and extended the visibility-driven patch group generation technique introduced in [6] . Based on a simple geometric rule set the 3D input data are fused by opti-mizing camera related parameters in a visibility driven way. For example, the selection of the supporting trifocal camera input system is derived besides others from the camera distances and orientations. As a result, surface patch segments are generated which are aimed to be transmitted and rendered efficiently (see section 5.4).

Figure 19: Principle of Patch Sweep based surface estimation

An example for the application of the above mentioned processing chain is discussed in the following. Figure 20 left shows a sample image of the “Salzufer” test data set which was captured within BRIDGET WP6T1. The right-hand side of the figure shows the estimated dense depth map. The result of the patch based surface refinement is shown in Figure 21. The textured 3D point cloud can be seen on the left-hand side, the polygonal 3D model is shown on the right-hand side of the figure. Finally, Figure 22 left demon-strates the resulting rendered 3D model after 3D point cloud fusion and hole filling, the right-hand side shows in this context a representation of the supporting trifocal camera input systems. In the current examples only two systems were used (6 cameras) which are labelled with the colours red and yellow in the figure. Another example is given in Figure 23 and Figure 24 for the “Villa la Tesoriera” data set. The two figures show a reconstructed 3D model generated from 6 input images. Figure 23 illustrates the reconstructed 3D model in polygonal representation (left) and the surface support representation for two trifocal depth input data sets (right). Figure 24 shows the final reconstructed 3D model rendered from an arbitrary virtual viewing position.

Figure 20: left) original sample image from “Salzufer” image data set; right) estimated depth map

D6.1: 3D Media Tools – Report – Version A 25/43

Page 26: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 21: left)Dense depth based 3D point cloud; right) polygonal 3D model after meshing

Figure 22: left) Resulting rendered 3D model after 3D point cloud fusion and hole filling;

right) surface support representation for two trifocal depth input data sets (yellow and red)

Figure 23: “Villa la Tesoriera” data set, left) reconstructed polygonal 3D model, right) surface support representation for two trifocal depth input data sets (yellow and red)

D6.1: 3D Media Tools – Report – Version A 26/43

Page 27: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 24: Reconstructed 3D for the “Villa la Tesoriera” data set

Summarizing, we can conclude that within WP6T1 a novel 3D surface reconstruction workflow was assembled and tested which enables a dense reconstruction of 3D objects based on random input data sets. The optimization of this algorithmic concept is still matter of ongoing work for BRIDGET WP6T1. Next steps will be the introduction of more trifocal input data sets as well as the evaluation of video data in parallel to static images.

4.5 Surface texture refinement based on geometric primitives One major goal of the 3D Media Tools work package (WP6) is the extraction of 3D models from a set of input image or video data. The algorithms presented in this section deal with the refinement of the sur-face textures of the reconstructed 3D models. The general idea is to exploit the redundancy of multiple input image or video data in order to create surface textures with resolutions higher than originally available. In this way, the visual quality of the reconstructed 3D models is aimed to be enhanced. From algorithmic side, our general idea is based on the spatial super-resolution approach. This topic is intensively studied [2] . The aim is to enhance the inherent resolution limitation of low resolution images captured at different times or from different positions. For this purpose, the images are first registered to each other, for example based on motion compensation approaches. Afterwards, individual image areas can be re-sampled at a much higher resolution based on multiple low resolution reference areas in the related input images.

D6.1: 3D Media Tools – Report – Version A 27/43

Page 28: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 25: Algorithmic workflow for surface texture refinement

If a 3D model of a given scene is available then the related 3D structure information can be used for the above mentioned image registration process. A pioneer work in this area was introduced in [3] . The authors refer to monoscopic camera sequences with translational movement. In this case neighbouring cameras have positions which are very close to each other. Therefore, within certain limits they allow a linear camera path approximation and in this way a linear (homography based) image registration in order to apply the super-resolution technique. In BRIDGET we do not necessarily have this advantage. In contrast, the camera input data (either video or image based) can be based on arbitrary camera positions in space. However, our general idea is simi-lar. As proposed in [3] we first apply a structure from motion approach in order to estimate the camera positions and parameter as well as to acquire a first sparse 3D point cloud of the 3D scene (see Figure 25). Note, that we here apply the outcome of the tools developed in WP6T1 (described in section 4.1).

Figure 26: left) Sample image of original data set, the registered surface is marked in yellow;

right) surface plane after normalization

structure from motion

camera parameter estimation

sparse reconstruction (point cloud)

surface plane fitting

surface plane normalization

surface plane refinement

multi-view surface texture enhancement

D6.1: 3D Media Tools – Report – Version A 28/43

Page 29: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 27: left) Reconstructed 3D point cloud representation of the building in Figure 26;

right) planar surface approximation based on a sparse 3D point cloud

Further on, similar to [3] our idea is to apply a linear registration of the scene content in order to apply the super-resolution approach. The main difference is that we do not normalize (and linearize) the cam-era path (as shown in [3] ). In contrast, in order to be able to apply linear, homography based image registrations, we normalize appropriate regions in the image. Figure 26 illustrates this exemplarily. The left-hand side of the image shows a building with planar regions. One such region was pre-selected and normalized to a fronto-parallel plane (Figure 26, right) based on a homography transformation. At the current state, this step was performed manually in the related 3D point cloud. Figure 27 illustrates this for the building chosen building. The left-hand side of the figure shows the reconstructed 3D point cloud, the right-hand side shows a manually registered plane which corresponds to the surface of the 3D object. Note, that at a later stage of the project these initial plane registrations may come from the auto-matic point cloud plane segmentation method developed in BRDIGET WPP6T1 (see section 4.1) As this initial surface plane fitting introduces slight errors a subsequent refinement step is applied. Here, the position and orientation of the plane is corrected based on a texture difference based optimization loop. Figure 28 gives an example for the difference of two registered and normalized input images. The final result for the enhanced surface texture resolution is depicted in Figure 28. It exemplarily high-lights a certain region of the sample image illustrated in Figure 26 right. The left-hand side of the figure shows an enlarged region of the original image. The right-hand side shows the result of the super-resolution approach for five registered and normalized input images. It can be clearly seen that the sam-pling artifacts in the linear brick structures of the original image can be reduced a lot by applying the proposed method.

Figure 28: Normalized surface planes of two input images; right) difference between the images

D6.1: 3D Media Tools – Report – Version A 29/43

Page 30: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 29: Results for texture refinement based on the area depicted in Figure 26:

left) original view; right) enhanced view with less sampling artifacts

Concluding, it can be summarized that the general idea for a surface texture refinement based on geo-metric primitives was implemented and tested successfully. Next steps in this context will be the applica-tion of automated initial plane fitting methods (for example as developed in WP6T1). Further on, the surface refinement step is aimed to apply surface texture segments of different sizes and shapes. In this way, our vision is to model a given 3D scene structure by multiple simple geometric entities which can be connected to more complex shapes if required. As a result, not only refined texture for the resulting surface segments can be obtained. On top of this, parts of the reconstructed 3D scene structure are aimed to be modelled with such simple entities. In this way, the proposed algorithm is envisioned to support the hybrid scene model approach described in section 4.3 by providing dedicated simplified 3D models for appropriate parts of a given scene. Please note that a simple example for the rendering of a combined hybrid point cloud and surface texture based representation is given in Figure 37 in section 5.4.

5 3D Scene On-line Reconstruction

5.1 Video to point cloud registration In order to provide a first step towards online operation based on the structure from motion engine described in paragraph 4.1, support for video registration has been developed. Specifically, the idea is to provide a tool capable of integrating live video feeds into an existing sparse point cloud, with the general goal of model refinement, densification, and update. According to a first algorithmic solution [9] , the idea is to process separately the video frames in order to produce a local 3D sparse model (i.e., point cloud), and then register the local model to the global point cloud using state of the art 3D point cloud alignment strategies. A significant gain in terms of computational speed is provided when multiple videos have to be incorporated. In fact, exploiting the multi-core architecture of recent CPUs, local models from multiple videos can be obtained in a parallel fashion. The video to point cloud registration engine requires as in input a base 3D model (i.e., point cloud result of the engine of paragraph 4.1), and a number of video streams that need to be registered. In Figure 28 the pipeline for the proposed solution is reported for a single video stream. As outlined before, each video is processed in parallel as to obtain a local 3D model of the scene for each sequence. Once a number of models are available, the problem is how to register them in an efficient fashion. Our proposal is to focus on sequential, pairwise model registration, where the base model is updated sequen-tially according to each of the available local models. To this aim, instead of relying on Iterative Closest Point (ICP) algorithm and derived solutions, a pre-processing step is introduced to augment each point in the models with an average descriptor. Relying on these descriptors, the 3D-to-3D point matching problem can be casted to a more tractable 2D feature matching problem, that can be solved with state of the art techniques. After corresponding pairs have been identified, the relative scale between the models to merge is recovered by analyzing the distance ratio of matching samples. Once the models are in the same scale and coordinate system, the 3D roto-translation that maps the local model to the base model is found relying on a RANSAC-based algorithm that minimize the 3D distance between matching pairs. The last step in the pipeline provides for dupli-

Figure 30: Video to point cloud integration pipeline

D6.1: 3D Media Tools – Report – Version A 30/43

Page 31: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A cate samples removal, and final bundle adjustment to further refine the global model structure (i.e., point locations and camera parameters). In an example is reported of base model (a) and the final result after registering 5 video streams to it (b).

Figure 31: Base model (a) and base model registered with 5 video streams (b). The samples of different videos are coded in different colors

5.2 Online Structure-from-Motion Model Extension with Generic Video Streams The method presented in Section 5.1 can effectively integrate information from video stream into struc-ture from motion point clouds, bringing significant improvements in terms of computation time thanks to the parallel creation of local models. However, the system cannot operate in real-time. Therefore, a more efficient pipeline has been designed with the general goal of exploiting the video temporal consistency in order to speed up the frame integration, and step towards a system capable of coping with real-time constraints [24] .

The paradigm of the online video integration engine is significantly different from the one presented in Section 5.1. In fact, instead of separately processing each video stream and then integrate it into the existing base model, the new proposal builds on a per-frame video processing and integration paradigm. The flowchart of this solution is reported in Figure 32

The system I/O is the same as the one presented in the previous section, thus the input are a point-based 3D model obtained using some state of the art structure from motion algorithm and a video stream cap-tured by a monocular camera. As before, the base model points are assumed to be augmented with an average local descriptor obtained from the feature points of the different images that contributed to its triangulation. In a first stage, the video feed is downsampled in the spatial and temporal domain with the general goal of finding the best compromise between complexity and amount of added information. Successively, each frame is fed into the online frame registration engine. In order to localize the 6-DoF pose of the current camera wrt the base model, a direct 2D-3D matching scheme is applied in the same spirit of [25] . To this aim, a set of local features is extracted from each new image (i.e., query) and stored as SIFT descriptors; exploiting the average descriptor associated to each point of the base model, a number of correspond-ences are determined between the current frame feature points and the model samples. Multithreaded kd-tree search is adopted for this task. Once the correspondences are available the pose of the current camera is found by using the 6-point DLT algorithm and running a RANSAC routine to identify the projec-tion matrix. Since the accuracy of the camera pose significantly impact the precision of the point triangu-lation, a local refinement is introduced in the form of bundle adjustment involving only the points in the current camera. Since the model update happens in a continuous fashion and on a per-frame basis, the variables related to the base model have to be updated accordingly. In particular, at every time instant the base model is

D6.1: 3D Media Tools – Report – Version A 31/43

Page 32: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A enriched with new 3D points, the associated kd-tree is updated, and a new average descriptor is calculat-ed for the involved 3D point, Once the camera pose is estimated, the features of the query image are matched against those extracted from a subset of N images used for the creation of the base model, that are deemed to be the most rele-vant for adding new information to the point cloud. In particular, these images are selected according a proximity criterion considering not only the spatial displacement, but also the camera orientation. Given this image set, and the corresponding matching features with the current frame, points can be triangulat-ed, and added to the model.

D6.1: 3D Media Tools – Report – Version A 32/43

Page 33: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 32: Online video to point cloud integration pipeline

Base model

Camera pose refinement

Local feature extraction

Camera pose estima-tion

Final model

Image Collection Video frames

Direct 2D-3D matching

1xN matching

Frame pre-processing

Point triangu-lation

Global model refinement

Other f ?

Yes

No

Camera registration

Model extension

D6.1: 3D Media Tools – Report – Version A 33/43

Page 34: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A After the model is extended/updated, the next frame is fed into the system, and the process is iterated until no other video frames are available. When all frames and corresponding points are registered, a final refinement based on a full global bundle adjustment is applied to clean up the extended point cloud and correct for drift errors in the camera location and point triangulation. In order to assess the performances of the video registration methods of Section 5.1 and Section 5.2, a number of experiments have been run and the results compared with different state of the art approach-es including batch-based, and online SfM. Bundler [11] and VisualSfM [12] (VSfM) are used as offline baselines for evaluating the accuracy of the reconstruction, and the online SfM framework of [26] (Hoppe) as reference for assessing complexity of the model extension. The main dataset consists of 492 images and 4 monocular videos (i.e., v1, v2, v3, v4) of the Rathaus build-ing in Marienplatz, Munich (Germany).

Input Bunlder VSfM Hoppe Method Sect. 4.1 Method Sect. 4.2

(A) Base+ v1 536 487 533 495 536

(B) A + v2 599 553 573 559 593

(C) B + v3 673 630 621 632 645

(D) C+ v4 726 679 669 681 691

Table 1: number of successfully registered video frames.

Offline SfM methods are normally able to successfully register a higher number of images computing full cross-matching step among every image; this comes at the cost of very high computational complexity. Local online direct matching approaches, conversely, privilege complexity at the risk of failing in register-ing some frames, as it is possible to see in Table 1. However, the method proposed in Section 5.2, that continuously refines the point cloud at every frame with a local BA, proves to be particularly robust for registering consecutive video frames. In fact, the number of successfully registered frames basically matches the one of the offline methods, outperforming the image retrieval-based work of [26] and the method of Section 5.1.

Input Bunlder VSfM Hoppe Method Sect. 4.1 Method Sect. 4.2

(A) Base+ v1 118862 110111 103333 116099 110985

(B) A + v2 121850 121294 108101 128881 114936

(C) B + v3 132648 131497 115016 139836 125533

(D) C+ v4 141086 141412 126452 148772 134249

Table 2: total number of point in the updated models.

Similarly to frame registration, also the point cloud density shown by offline approaches is normally outperforming online SfM techniques. Table 2 shows the results obtained with our method. The slightly lower density is due to the usage of just N = 8 images to add new points every time a frame is introduced in the system, instead of the whole image set used by the other cited methods. However, Figure 33 shows that, although sparser, our final point cloud is less noisy, thus enabling more accurate meshing. This is due to the selection of only a limited subset (N = 8) of relevant images, that minimize the probability of matching outliers. Conversely, the online approach in [26] provides sparser clouds due to higher selectiv-ity of frame-based global BA step.

D6.1: 3D Media Tools – Report – Version A 34/43

Page 35: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 33: Close-up of a detail in the base model (left), updated model with Bundler (center), and the method of Section 5.2.

Finally, we compared our method in terms of overall execution time with selected state-of-the-art base-lines. Table 3 shows how our framework significantly speeds up the offline methods, being 8-18X faster than VSfM, and 26-42X faster than Bundler. The proposed method also shows to be approx-imately 3-4X faster than the method of Section 5.1. It is possible to note how the per-frame registration time is also significantly smaller than [26] (Hoppe). This is, in particular, due to our modified direct matching and local refinement that avoid the need to compare against the whole image set. Notably, we also didn’t include in the computation of [26] results the time needed to run the full BA after registration of every frame. Including such post-processing step, execution time would strongly increase, up to run-times similar to offline methods.

Input Bunlder VSfM Hoppe(*) Method Sect. 4.1 Method Sect. 4.2

(A) Base+ v1 5345 2803 326 735 212

(B) A + v2 6302 2642 439 875 303

(C) B + v3 7374 3992 413 1098 287

(D) C+ v4 9485 4014 364 1150 226

Table 3: computation time (s). (*) Full BA after every registration is not included.

5.3 3D point cloud rendering Regarding the splat-based 3D models obtained in T6.1, the aim in T6.3 is to render them efficiently and therefore improve visually (and, if possible, computationally) the results obtained with typical point-based models. As stated in Section 4.3.2, the on-line rendering must avoid the billboard effect that arises with the classic square-based approaches, where the same square point is displayed from every user-selected viewpoint. In contrast, the local information retained by splats is suitable for inferring how their shape is deformed as the viewpoint changes, and makes them capable of being re-oriented on the fly. The process of displaying every point as an oriented elliptical splat cannot be performed out-of-the-box with any standard graphic pipeline, and therefore a specific implementation based on point primitives (as opposed to triangles or quads, typical of polygon meshes) has been developed with the support of the OpenGL 2.0 rendering library. This involved the design of specific GLSL vertex and fragment shaders, whose details follow. The vertex shader runs for every vertex or point primitive and receives as input the splat 3D position, colour and ellipse 3D axes. As typical shaders do, point position (stored as local coordinates) is trans-formed into clip coordinates, whereas colour information is passed straightforward to the fragment shader. However, it also performs specific splat-related actions: it applies the view and projection matri-

D6.1: 3D Media Tools – Report – Version A 35/43

Page 36: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A ces to turn the endpoints of the ellipse axes into clip coordinates, and then projects them onto the view-port to compute their corresponding window coordinates. This procedure is needed to determine both the precise region of the screen where the splat must be rendered and the size of its screen-allocated area, which is essential to guarantee that the splat can be wholly displayed. The concepts involved in these operations are depicted in Figure 34. Consequently, the outputs of the vertex shader include the splat clip-coordinates 3D position, colour, point size and 2D locations of its axes endpoints.

Figure 34: Schematic representation of the vertex shader inputs (in black) and outputs (in red)

As for the fragment shader, it is run for every fragment, which broadly speaking represents each possible pixel inside the region allocated to a point. Its primary aim is to decide whether a fragment should be discarded or not, and in the latter case, which colour value gets written in the frame buffer. Bearing that all outputs from the vertex shader are inputs for this stage, the 2D axes endpoints and the point size are used to determine the parameters needed to display the splat ellipse: its centre, the lengths of its princi-pal axes and its rotation angle. Using the ellipse general equation, every fragment can then be determined to belong to the inner or the outer region of the splat, and respectively be rendered with the colour asso-ciated to the splat, or simply discarded. These concepts are graphically illustrated by Figure 35.

Figure 35: Schematic representation of the fragment shader output

The possibility of applying alpha-blending has also been explored and implemented in the fragment shader, for the non-discarded fragments. This means rendering with an opaque colour in the centre of the ellipse, and increasing the transparency towards its edge following a normal distribution. Figure 36 provides a comparison of the resulting 3D model using an opaque colour per splat and applying alpha-blending. This technique provides a more natural visual effect since it avoids abrupt transitions between splats, which unavoidably still remain (although to a much lesser extent) with the transition from square to elliptical points. Moreover, the adoption of blending also smoothes the high variety of colours across adjacent splats (as the weigthed average explained in Section 4.3.2 does), specially for overlapping splats with strong colour differences. Nevertheless, in order to correctly apply alpha-blending to a set of points,

D6.1: 3D Media Tools – Report – Version A 36/43

Page 37: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A it is mandatory to send them to the rendering pipeline sorted by their distance to the viewpoint, from the furthest to the closest points. Otherwise, the depth test that is performed on every fragment will fail for the more distant ones and their colour will not be considered. This issue prevents from achieving a real-time rendering process, since every new frame displayed from a different viewpoint implies the need to re-sort all the involved points. The adoption of less computationally demanding approaches and the study of order-independent transparency techniques will be addressed during the second half of the project’s life, as will the possibility of applying texture mapping techniques to splat rendering.

Figure 36: Splat-based 3D model rendered without (left) and with (right) alpha-blending

5.4 View dependent rendering of hybrid 3D scene representations This section refers to the work presented in section 4.4 and section 4.5. One aim is to render combined hybrid point cloud and surface texture based representations. Figure 37 illustrates this at the example of the 3D point cloud of the building (Figure 27) which is combined with the enhanced planar surface tex-ture described in section 4.5. From a rendering point of view, at the current stage a straight forward state of the art approach was applied. Nevertheless, in future BRIDGET application scenarios envision the mixing of different 3D data representations and rendering techniques in a hybrid way (see section 4.3 for more details). The 3D model shown in Figure 37 could be used for scenarios with poor overall 3D recon-struction quality or low visualization demands on the rendered result

D6.1: 3D Media Tools – Report – Version A 37/43

Page 38: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 37: Rendered 3D hybrid 3D point cloud model including a refined surface texture, see

Figure 29 for an enlarged view of the area marked in red

Another goal for BRIDGET WP6T3 is the view point dependent meshing and rendering of the 3D surfaces patch segments generated in WP6T1. For this purpose first several state of the art meshing algorithms were evaluated and compared. Further on, an existing algorithm for efficient surface patch segment fusion was applied [6] and extended to meet the requirements of a view dependent view synthesis. The main purpose of our proposed algorithm is to achieve a high efficiency for render the 3D surface patch segments resulting from the 3D reconstruction process described in section 4.4. The main algo-rithmic components were already illustrated in Figure 18 in combination with the related 3D reconstruc-tion workflow. For WP6T3 the relevant three modules are again shown in Figure 38.

Figure 38: Efficient patch group based re-meshing procedure

The algorithmic idea for efficient view dependent rendering is to use visibility driven surface patch groups in order to first perform an intra-patch group re-meshing and afterwards to connect the surface patches by an inter-patch group meshing. Figure 38 illustrates this algorithmic work flow. The main novelty and advantage of this approach is that the overall meshing quality is more robust towards outli-ers and computationally much more efficient in comparison to a general re-meshing of the dense 3D point cloud. From an algorithmic point of view the patch group fusion is based on the idea of combining surface patches corresponding to the same trifocal sub-system. In this way a re-meshing of the patch groups can

patch groupfusion

intra patch groupmeshing

inter patch groupmesh fusion

mesh

3D pointcloud / patches

D6.1: 3D Media Tools – Report – Version A 38/43

Page 39: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A be performed in a simple and straight forward way just based on the image pixel index of the originally reconstructed depth maps. In Figure 38 this steps is denoted as “intra patch group meshing”. A final processing step combines the re-meshed patch groups. Figure 39 shows the principle of re-meshing of patch groups. In the enlargement on the right-hand side of the figure can be seen that only the borders of the patch groups are re-meshed. As mentioned before, this leads to more robustness and computational efficiency. For this purpose in WP6T3 we have evaluated several re-meshing algorithms and compared them to the ball-pivoting approach [20] used in [6] . As a result in the current work flow we replaced the ball-pivoting by a novel approach proposed by Bernardini et.al. in [21] . Main advantage of this approach is that it can be adapted to non-equidistant sampling which is required for handling the arbitrary trifocal setups.

Figure 39: left) Meshing of surface patch segments as a result of 3D point cloud fusion;

right) enlargement

A result for the rendering of the 3D reconstructions generated in section 4.4 is shown in Figure 40 for the “Salzufer” data set. The figure illustrates a reconstructed 3D model of a building rendered from different perspective in 3D space. Please note, that for BRIDGET WP6T3 these results are still part of ongoing research work. Next steps will be the enhancement of robustness and more efficient removal of outliers. Further on, a better computational efficiency is envisioned.

D6.1: 3D Media Tools – Report – Version A 39/43

Page 40: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 40: Viewpoint dependent 3D model representation rendered from different perspectives

5.5 Static 3D audio scene rendering The goal of BRIDGET audio engine is to provide user a feeling of being immersed in a spatial audio scene as close as possible to a real life listening experience. Meaning, the user should be able to determine the direction of any coming sound as well as the acoustics of environment. This should increase the overall realism of artificially created audio-video BRIDGED content. In order to recreate an audio scene for the purpose of BRIDGET, an audio engine based on binaural technology has been created. Binaural technolo-gy is a powerful tool for 3D sound synthesis that exploits knowledge of human perception of sound and signal processing. In general, the way how people perceive spatial sound is defined by an individual transfer functions called Head Related Transfer Functions (HRTF). Most applications can provide suffi-ciently good results if generic HRTFs are used instead of individualized. In this way, the whole process is speeded up since the acquisition of individualized HRTF is very time consuming and complex. Also, it has to be done in a controlled (laboratory) environment. Especially when targeting a wide population of end users, generic sets of HRTFs are usually used. Input for the spatial sound engine is mono audio signal as well as spatial position and orientation of the corresponding sound source. According to the position of the source relative to the listener (calculations based on position and orientation data) a pair of Head Related Transfer Functions (HRTF) that corresponds to the calculated direction is selected from HRTF database. Minimum-phase filters are created on a basis of recorded HRTFs. The optimisation between the filters length and preserved spatial information has to be done. The filters of length 256 taps are used. In this way, the time and computational power is saved on convolution process, while the spatial features of HRTFs are perceived. When a correct pair of HRTFs is selected, it is convolved with the corresponding mono audio signal. The result of convolution is two-channel audio signal (binaural signal) that now contains spatial information of the perceived sound source – the direction of sound is easily noticeable. Binaural signals are then reproduced over a set of equalised headphones. Equalisation is done in order to flatten the frequency response of the headphones so it doesn’t influence the frequency content of the binaural signal. Since frequency response of headphones is not linear it can filter a signal to be reproduced and thus influence

D6.1: 3D Media Tools – Report – Version A 40/43

Page 41: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A Interaural Time Difference and Interaural Level Difference between left and right channel. As a conse-quence, virtual sound image created at a curtain direction can be moved in space and thus ruin the accu-racy of a simulated scene. The user interface for spatial sound rendering is given at Figure 41.

Figure 41: User interface for spatial audio rendering

5.6 Dynamic 3D audio scene rendering – multi view support In order to fully exploit the power of spatial sound, BRIDGET audio engine was extended to support simulation of a dynamic interactive scene. Dynamic scene allows sound sources to freely change their positions within the scene limited only by the design of the BRIDGET content. Interactivity allows the user of BRIDGET technology to change the position and orientation of the listener (its own position and orientation in case of “first view” mode). In both cases, 3D audio scene is rendered in real time according to the changes made to the position and/or orientation of the sound sources and/or listener. It provides the possibility to the user to move around the scene and change the point of view and thus to receive the spatial audio scene accordingly. The user interface for the dynamic interactive 3D audio rendering is given at Figure 42. Gray dots represent sound sources which can be arranged according to the BRIDGET content design while the red dot represents the listener which position and orientation are controlled by the user.

D6.1: 3D Media Tools – Report – Version A 41/43

Page 42: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A

Figure 42: User interface for spatial audio rendering of a dynamic scene

Rendering of the acoustics of a scene is done by adding the artificial reverberation to the binaural signal. Reverberation is implemented in terms of room impulse responses which are parametrically created on the basis of scene geometry description.

6 Conclusion This document summarises the research outcome of WP6 during the first half of BRIDGET’s life. Several tools were developed for the task of 3D A/V reconstruction and rendering. All tools have different types of outcomes and application scenarios. A concept for a common overall algorithmic workflow for 3D scene reconstruction was introduced which combines the proposed sub-modules and algorithms in a single powerful processing chain. All tools were tested and evaluated with dedicated data sets.

7 References [1] W. Waizenegger, I. Feldmann, and O. Schreer, “Real-time Patch Sweeping for High-Quality Depth

Estimation in 3D Videoconferencing Applications“, SPIE 2011 Conference on Real-Time Image and Video Processing, San Francisco, CA, USA, January 2011.

[2] S. Borman and R. L. Stevenson, “Super-resolution from image sequences-a review,” in 1998 Mid-west Symposium on Circuits and Systems, 1998. Proceedings, 1998, pp. 374–378.

[3] S. Knorr, M. Kunter, and T. Sikora, “Super-Resolution Stereo- and Multi-View Synthesis from Mo-nocular Video Sequences,” in Sixth International Conference on 3-D Digital Imaging and Modeling, 2007. 3DIM ’07, 2007, pp. 55–64.

[4] Christian Riechert, Frederik Zilly, Marcus Müller, and Peter Kauff: Real-Time Disparity Estimation Using Line-Wise Hybrid Recursive Matching and Cross-Bilateral Median Up-Sampling, Internation-al Conference on Pattern Recognition (ICPR 2012), Tsukuba Science City, Japan, November 2012.

[5] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Reconstructing building interiors from images,” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 80–87.

[6] S.Ebel, W.Waizenegger, M.Reinhardt, O.Schreer, I.Feldmann, “Visibility-driven patch group genera-tion”, proceedings of International Conference on 3D Imaging (IC3D), November 2014, Liege, Bel-gium

D6.1: 3D Media Tools – Report – Version A 42/43

Page 43: D6.1: 3D Media Tools – Report – Version A - Ict-Bridgetict-bridget.eu/.../BRIDGET_D6.1_3D_Media_Tools-Report-Version_A.… · D6.1: 3D Media Tools – Report – Version A. Project

FP7-ICT-610691 D6.1: 3D Media Tools – Report – Version A [7] M. Farenzena, A. Fusiello, and R. Gherardi, “Structure-and-motion pipeline on a hierarchical cluster

tree,” in 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Work-shops), 2009, pp. 1489–1496.

[8] N. Piotto, G. Cordara, “Statistical Modelling for Enhanced Outlier Detection”, IEEE International Conference on Image Processing (ICIP 2014), p. 4280-4284, October 2014.

[9] E. Vidal, N. Piotto, G. Cordara, F. M. Burgos, “Automatic video to point cloud registration”, IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP 2015), submitted for review.

[10] D. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Com-puter Vision, 60, 2 (2004), pp. 91-110.

[11] N. Snavely, S. Seitz, R. Szeliski, “Photo Tourism: Exploring image collections in 3D”. ACM Transac-tions on Graphics (Proceedings of SIGGRAPH 2006), 2006.

[12] C. Wu, "Towards linear-time incremental structure from motion." In 3DTV-Conference, 2013 Inter-national Conference on, pp. 127-134. IEEE, 2013.

[13] Y. Furukawa, J. Ponce. "Accurate, dense, and robust multiview stereopsis." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32, no. 8 (2010): 1362-1376.

[14] P. Alcantarilla, J. Nuevo, A. Bartoli. "Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces." Trans. Pattern Anal. Machine Intell 34, no. 7 (2011): 1281-1298.

[15] C. Wu, "SiftGPU: A GPU implementation of scale invariant feature transform (SIFT)." (2007). [16] G. Bradski, “The OpenCV Library”. Dr. Dobb's Journal of Software Tools, 2000. [17] S. M. Pizer, E. P. Amburn, J. D. Austin, et al.: Adaptive Histogram Equalization and Its Variations.

Computer Vision, Graphics, and Image Processing 39 (1987) 355--368. [18] Z. Rahman, D. J. Jobson, and G. A. Woodell, Retinex Processing for Automatic Image Enhancement,

Journal of Electronic Imaging, January 2004. [19] Wolfgang Waizenegger, Nicole Atzpadin, Oliver Schreer and Ingo Feldmann: Patch-Sweeping with

Robust Prior for High Precision Depth Estimation in Real-Time Systems, Proceedings of 18th Inter-national Conference on Image Processing (ICIP 2011), Brussels, Belgium, September 11-14, 2011.

[20] F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin, “The ball-pivoting algorithm for surface reconstruction,” IEEE Transactions on Visualization and Computer Graphics, vol. 5, no. 4, pp. 349–359, Oct. 1999.

[21] Z. C. Marton, R. B. Rusu, and M. Beetz, “On fast surface reconstruction methods for large and noisy point clouds,” in IEEE International Conference on Robotics and Automation, 2009. ICRA ’09, 2009, pp. 3218–3223.

[22] B. Zeisl, P. F. Georgel, F. Schweiger, E. G. Steinbach, N. Navab, “Estimation of location uncertainty for scale invariant feature points”, British Machine Vision Conference (BMVC), pp. 1-12, 2009

[23] C. Wu, S. Agarwal, B, Curless, S. M. Seitz, “Multicore bundle adjustment”, IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pp. 3057-3064, 2011.

[24] E. Vidal, N.Piotto, G.Cordara, F. Moran Burgos, “Online Structure-from-Motion Model Extension with Generic Video Streams”, International Conference on 3D Web Technology (Web3D), submitted for review, 2015.

[25] T. Sattler, B. Leibe, L. Kobbelt, “Fast image based localization using direct 2D-to-3D matching”, IEEE International Conference on Computer Vision (ICCV), pp.667-674, 2011.

[26] C. Hoppe, M, Klopschitz, M. Rumpler, A. Wendel, S. Kluckner, H. Bischof, G. Reitmayr, “Online feed-back for structure form motion image acquisition”, British Machine Vision Conference (BMVC), pp.1-12, 2012.

[27] R. Pagés, D. Berjón, F. Morán, N. García, “Seamless, Static Multi-Texturing of 3D Meshes”, Computer Graphics Forum, vol. 34, no. 1, pp. 228-238, February 2015.

D6.1: 3D Media Tools – Report – Version A 43/43