a new high resolution depth map estimation system using stereo vision...

13
A New High Resolution Depth Map Estimation System Using Stereo Vision and Kinect Depth Sensing Shuai Zhang & Chong Wang & S. C. Chan Received: 9 April 2013 / Revised: 25 June 2013 / Accepted: 27 June 2013 # Springer Science+Business Media New York 2013 Abstract Depth map estimation is an active and long stand- ing problem in image/video processing and computer vision. Conventional depth estimation algorithms which rely on stereo/multi-view vision or depth sensing devices alone are limited by complicated scenes or imperfections of the depth sensing devices. On the other hand, the depth maps obtained from the stereo/multi-view vision and depth sensing devices are de facto complementary to each other. This motivates us to develop in this paper a new system for high resolution and high quality depth estimation by joint fusion of stereo and Kinect data. We modeled the observations using Markov random field (MRF) and formulated the fusion problem as a maximum a posteriori probability (MAP) estimation prob- lem. The reliability and the probability density functions for describing the observations from the two devices are also derived. The MAP problem is solved using a multiscale belief propagation (BP) algorithm. To suppress possible estimation noise, the depth map estimated is further refined by color image guided depth matting and a 2D polynomial regression (LPR)-based filtering. Experimental results and numerical comparisons show that our system can provide high quality and high resolution depth maps, thanks to the complementary strengths of both stereo vision and Kinect depth sensors. Keywords Depth estimation system . High resolution . Kinect . Stereo vision 1 Introduction Depth information is an important ingredient in many ad- vanced computer vision, graphics and video applications such as image-based rendering (IBR) [3, 14], 3D model reconstruction, intelligent human computer interface, etc. In general, depth acquisition can be classified into two main categories: passive and active approaches. Passive ap- proaches usually rely on stereo or multiple cameras to esti- mate the depth information from multiple images, whereas active approaches rely on active illumination to infer the depth information using special devices. Stereo/multi-view matching [711, 1820, 2225, 30, 31] is the most popular method in the passive approach because of its low-cost and effectiveness in specific applications. However, its per- formance is usually limited by occlusion, complicated scenes and texture-less areas. Therefore, effective and efficient ste- reo/multi-view depth estimation algorithms have been a long standing problem. To deal with the problem of texture-less areas, regularization techniques using Markov Random Field (MRF) [7] is frequently employed, where the observation is modeled as a MRF and the hidden depth map is estimated by a maximum a posterior probability (MAP) criterion. This usually leads to the equivalent minimization of an energy function consisting of the likelihood function and some regu- larization or prior term in the unknown or hidden depth param- eters. Graph Cuts (GC)-based [19] and Belief Propagation (BP)-based [20] methods are commonly used to solve the optimization problem because of their good performances. To better utilize the 3D information, techniques for enhanc- ing image structure in form of occlusion penalization [25], visibility checking [24, 30] and structural informa- tion [19, 22, 24, 25, 30], are areas of active research. Another popular direction to hand occlusion is to com- bine segmentation with GC or BP [19, 22, 24, 25, 30]. There are already commercial products for depth estimation using stereo matching such as the MobileRanger [Available: Part of this work was presented in IEEE Colloquium on Signal Processing and its Applications 2013 [32]. This project is supported in parts by a GRF grant from the Hong Kong Research Grant Council and a tier-3 grant from the Hong Kong Innovative Technology Fund (ITF). S. Zhang : C. Wang : S. C. Chan (*) Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, Hong Kong e-mail: [email protected] J Sign Process Syst DOI 10.1007/s11265-013-0821-8

Upload: tranminh

Post on 27-Apr-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

A New High Resolution Depth Map Estimation System UsingStereo Vision and Kinect Depth Sensing

Shuai Zhang & Chong Wang & S. C. Chan

Received: 9 April 2013 /Revised: 25 June 2013 /Accepted: 27 June 2013# Springer Science+Business Media New York 2013

Abstract Depth map estimation is an active and long stand-ing problem in image/video processing and computer vision.Conventional depth estimation algorithms which rely onstereo/multi-view vision or depth sensing devices alone arelimited by complicated scenes or imperfections of the depthsensing devices. On the other hand, the depth maps obtainedfrom the stereo/multi-view vision and depth sensing devicesare de facto complementary to each other. This motivates usto develop in this paper a new system for high resolution andhigh quality depth estimation by joint fusion of stereo andKinect data. We modeled the observations using Markovrandom field (MRF) and formulated the fusion problem asa maximum a posteriori probability (MAP) estimation prob-lem. The reliability and the probability density functions fordescribing the observations from the two devices are alsoderived. The MAP problem is solved using a multiscalebelief propagation (BP) algorithm. To suppress possibleestimation noise, the depth map estimated is further refinedby color image guided depth matting and a 2D polynomialregression (LPR)-based filtering. Experimental results andnumerical comparisons show that our system can providehigh quality and high resolution depth maps, thanks to thecomplementary strengths of both stereo vision and Kinectdepth sensors.

Keywords Depth estimation system . High resolution .

Kinect . Stereo vision

1 Introduction

Depth information is an important ingredient in many ad-vanced computer vision, graphics and video applicationssuch as image-based rendering (IBR) [3, 14], 3D modelreconstruction, intelligent human computer interface, etc.In general, depth acquisition can be classified into two maincategories: passive and active approaches. Passive ap-proaches usually rely on stereo or multiple cameras to esti-mate the depth information from multiple images, whereasactive approaches rely on active illumination to infer thedepth information using special devices. Stereo/multi-viewmatching [7–11, 18–20, 22–25, 30, 31] is the most popularmethod in the passive approach because of its low-costand effectiveness in specific applications. However, its per-formance is usually limited by occlusion, complicated scenesand texture-less areas. Therefore, effective and efficient ste-reo/multi-view depth estimation algorithms have been a longstanding problem. To deal with the problem of texture-lessareas, regularization techniques using Markov Random Field(MRF) [7] is frequently employed, where the observation ismodeled as aMRF and the hidden depth map is estimated by amaximum a posterior probability (MAP) criterion. Thisusually leads to the equivalent minimization of an energyfunction consisting of the likelihood function and some regu-larization or prior term in the unknown or hidden depth param-eters. Graph Cuts (GC)-based [19] and Belief Propagation(BP)-based [20] methods are commonly used to solve theoptimization problem because of their good performances. Tobetter utilize the 3D information, techniques for enhanc-ing image structure in form of occlusion penalization[25], visibility checking [24, 30] and structural informa-tion [19, 22, 24, 25, 30], are areas of active research.Another popular direction to hand occlusion is to com-bine segmentation with GC or BP [19, 22, 24, 25, 30].There are already commercial products for depth estimationusing stereo matching such as the MobileRanger [Available:

Part of this work was presented in IEEE Colloquium on SignalProcessing and its Applications 2013 [32]. This project is supported inparts by a GRF grant from the Hong Kong Research Grant Council anda tier-3 grant from the Hong Kong Innovative Technology Fund (ITF).

S. Zhang :C. Wang : S. C. Chan (*)Department of Electrical and Electronic Engineering, TheUniversity of Hong Kong, Pokfulam Road, Hong Kong, HongKonge-mail: [email protected]

J Sign Process SystDOI 10.1007/s11265-013-0821-8

http://www.mobilerobots.com/accessories/MobileRanger.aspx]as shown in Fig. 1a.

Unlike the passive approach, the active approach is basedon active illumination of the scene under consideration usingvarious light sources such as visible or invisible laser orinfrared (IR) illumination. Subsequent sensing or detectionof the reflected light enables the depth information to bedetermined though Time-of-Flight (ToF) or phase variationin fringes, depending on the functional principle of the depthsensors. Commonly used depth sensing devices include laserscanners [16, 17], ToF camera/sensors [15] and the recentlylaunched Microsoft Kinect [1] as shown in Fig. 1b-d. Laserscanners can handle both indoor and outdoor environmentsbut their acquisition speed is rather low. Its main advantage islong sensing range, which is suitable for outdoor staticscenes. ToF cameras, which are usually based on an arrayof sensors for measuring the ToF of the active illumination,have the advantage of high frame rate and registered depthand intensity data. However the resolution of most ToFcameras is lower than 320×240 and the captured depth mapsare usually quite noisy and sensitive to the ambient light[15]. On the other hand, Kinects overcome the limited reso-lution of ToF cameras by means of structured lighting usingIR illumination. Because of its low cost and higher resolu-tion, it has attracted much attention in various image andvision applications. However, the Kinect is still limited by itshigh sensing noise and missing sample due to occlusion,reflection and limited sensing range.

The usefulness of the stereo/multi-view methods relyheavily on how the phenomena such as occlusion, edges,color correlation and so on, are modeled. In certain circum-stances, they are able to produce high accuracy depth mapswith high resolution and wide distance range. However, intexture-less regions, the performances of stereo matchingtechniques are somewhat limited. Nevertheless, reliabledepth maps are usually generated offline and different de-grees of human intervention are involved depending on thealgorithms being used. On the other hand, most of the depthsensing devices can easily handle the texture-less regionswhich are contrary to the stereo matching. However, existingdepth sensing devices still suffer from many limitationsmentioned above. One of the main research and practical

problem in the computer vision and image processing com-munity currently is the restoration of the noisy depth mapsextracted from these depth sensing devices and its reliableintegration with the texture information due to their differentresolutions and viewpoints, and other imperfections. Forinstance, ToF sensors and Kinect are usually poorly calibrat-ed, limited in resolution and accuracy as compared withstereo/multi-view methods. Moreover, their abilities in deal-ing with transparency materials and object boundaries arealso not so satisfactory.

From the above discussion, it can be seen that the depthmaps obtained from the stereo vision and depth sensingdevices are indeed complementary to each other. This moti-vates us to develop a new high resolution depth map estima-tion system and approach, which is able to combine theadvantages of stereo/multi-view matching and depth sensingdevices in order to obtain depth maps with high resolutionand accuracy, and yet using much less computational time.

The proposed system consists of a high-definition (HD)3D stereo camera and a Kinect depth sensor. To fully utilizethe information obtained from these two different devices,we first calibrate the system using a co-planarity basedmethod. Then, we explore the complementary characteristicsof the 3D stereo camera and Kinect, and propose a newmethod for joint fusion of their depth maps. In particular,we develop a fusion framework based on MRF and derivethe probability distribution functions to describe the charac-teristic of these multimodal depth sensing devices. More-over, we incorporate into the problem a pixel-wise weightingfunction which reflects the reliabilities of the stereo cameraand Kinect depth sensor. By so doing, a more accurate depthmap can be obtained. The resultant fusion problem is solvedusing a multiscale BP algorithm. Due to missing and noisysamples, the computed depth map may still be corrupted bysensor noise. To address this problem, a two stage depth mapenhancing algorithm is proposed to further refine the esti-mated depth map. It consists of a new color image guideddepth matting process to refine the depth map based on thetexture and a 2D polynomial regression (LPR)-baseddenoising algorithm to ensure the smoothing of the depthmap while preserving discontinuities. Simulation resultsshow that the proposed approach is able to obtain

(a) (b) (c) (d)Figure 1 Various depth acquisition devices. a Stereo matching camera set (MobileRanger with nDepth stereo processor embedded), b 3D laserscanner (NextEngine 3D Scanner), c ToF camera (PMDTechnologies CamCube) and d Microsoft Kinect.

J Sign Process Syst

satisfactory depth maps which significantly outperform theircounterparts obtained by either stereo matching or Kinectdepth sensor alone.

The paper is organized as follows: The construction of theproposed depth estimation system and its calibration procedureare summarized in Section 2. Section 3 is devoted to the jointstereo and Kinect fusion algorithm for depth map estimation.Experimental results, evaluation and comparison are presentedin Section 4. Finally conclusions are drawn in Section 5.

2 System Setup and Calibration

We now describe the setup of our high resolution depth mapestimation system and summarize the methods for calibrat-ing the devices involved.

2.1 System Setup

The high resolution depth map estimation system constructedis shown in Fig. 2a. It consists of a Microsoft Kinect, a JVCGS-TD1B FHD (Full-HD) 3D Everio camcorder and aBlackmagic-design Intensity shuttle [Abailable: http://www.blackmagicdesign.com/]. The Kinect is equipped with an RGBcamera and a depth sensor consisting of an IR camera and an IRprojector. The main features of Kinect are summarized as

follows: (a) It is able to support a distance range from 0.4 m to4 m with an official SDK and further from 0.4 m to 8 m with athird party SDK, and (b) it provides a depth map with640×480 resolution at 30 frames per second (FPS). For theJVC GS-TD1B FHD 3D camcorder, it provides stereo side-by-side FHD videos in 30 FPS, and it is connected to aBlackmagic-design Intensity shuttle which transfers thestereo data in real-time to a PC via HDMI cable for furtherprocessing. The transformation relationship (relative pose) be-tweenKinect and JVCGS-TD1B FHD 3D camcorder is shownin Fig. 2b. K_d and K_c indicate respectively the coordinatesof the depth sensor and RGB camera of the Kinect. On the otherhand, J_c1 and J_c2 denote the coordinates of left and rightviews of the JVC GS-TD1B FHD 3D camcorder. V is thecalibration pattern. A 3D point from one camera coordinatecan be transformed to another using a rigid transformationdenoted by {R, t}ij whereR is the rotation and t is translationbetween two different coordinates, i and j.

Although the depth maps of Kinect are less noisy than ToFcamera [1], there are still considerable sensor noise and missingareas or holes which should be suppressed. Apart from thelimited sensing range of the Kinect, these holes and noisemainly come from two different sources: I) occlusions betweenthe IR camera and the IR projector of the depth sensor and II)material absorption and surface normal direction of objects, asillustrated in Fig. 2c. Moreover, the low resolution of the depth

Figure 2 a The joint stereo andKinect system for depth Mapestimation. b The transformationrelationship between Kinect andJVC GS-TD1B FHD 3D camera.c Illustration of data missingregions in Kinect’s depth map.The red and yellow rectanglesare referring the type I and type IImissing data.

J Sign Process Syst

map (640×480) will restrict its usage in high resolution IBRand other applications. In Section III, we will propose a frame-work for data fusion of Kinect and a stereo camera (i.e. the JVCGS-TD1B FHD 3D camcorder in this work) so as to alleviatethe above limitations of Kinect. This gives rise to a highresolution and more reliable depth map estimation system.

2.2 System Calibration

To combine the two different data sources from the Kinect andJVCGS-TD1B FHD3D camcorder, calibration between thesedevices is required. Moreover, there is a particular need torecalibrate the Kinect device since it delivers depth informa-tion in Kinect disparity units (kdu), and its conversion tometric units usually changes for one device to another. Fig-ure 3 shows the color image captured by the left view of theJVC GS-TD1B FHD 3D camcorder, color image captured bythe Kinect RGB camera and its corresponding depth map.Various methods have been proposed for recalibration ofKinects. [Available: http://nicolas.burrus.name/] follows theidea of traditional checkerboard based calibration schemes,which looks for the common feature points between color anddepth images, to recalibrate the Kinect. Though such kind ofmethod is simple and feasible, a major drawback is that thefeature points chosen by this method may not be reliablebecause of the missing data at the edges and corners of theobjects. In [1] and [4], the IR images of Kinect are useddirectly to perform standard calibration between the IR andRGB cameras. The calibration accuracy of this method ishigher than the one proposed in [Available: http://nicolas.burrus.name/]. However, additional IR illumination isrequired. In the proposed system, the two methods mentionedabove may not yield the optimal system parameters since jointcalibration between Kinect and JVC GS-TD1B FHD 3D cam-corder is needed. Therefore, in order to calibrate the proposedsystem as a whole, the co-planarity based joint calibrationmethod [5] is employed which could improve individual cali-bration as it uses all the available information. In co-planaritybased methods [5, 6, 21], the plane of checkerboard is observedaccurately as shown in Fig. 3c, though the pattern of checker-board is invisible in the depth map. Therefore, the plane corre-spondence is used for accurate calibration between Kinect andJVC GS-TD1B FHD 3D camcorder.

In the proposed calibration method, we first calibrate thestereo cameras of the JVC GS-TD1B FHD 3D camcorderusing standard checkerboard-based method such as theZhang’s method [2]. In order to obtain more accurate cali-bration results, intrinsic and extrinsic parameters are estimat-ed separately. Then stereo rectification is performed betweenJ_c1 and J_c2 based on the estimated camera parametersafter stereo calibration.

The joint calibration of JVC GS-TD1B FHD 3D cam-corder and Kinect is carried out using co-planarity basedmethod [5]. The left camera of JVC GS-TD1B FHD 3Dcamcorder (J_c1) is set as the external high resolutionRGB camera of Kinect. The basic idea of this method is toexploit the co-planar property of the calibration board withthe help of the JVC GS-TD1B FHD 3D camcorder. Moreprecisely, the calibration procedure can be divided into threesteps. First, the feature point based method [Available: http://nicolas.burrus.name/] is employed to obtain the initial guessof the intrinsic and extrinsic parameters of the Kinect depthsensor. Then, based on the co-planar property, relative posesof Kinect and JVC GS-TD1B FHD 3D camcorder are esti-mated. Furthermore, the system parameters of these twodevices can be obtained by solving a non-linear minimiza-tion problem. By using a disparity distortion model proposedin [5], the depth map captured by Kinect is further rectifiedwhich will increase the depth estimation accuracy. Next, wewill consider how to fuse the information offered by the twodevices.

3 Joint stereo and Kinect for high resolution depthestimaion

Previous works investigated for fusion of stereo vision anddepth sensing devices mostly originate from ToF sensors andstereo cameras [5, 27]. However, the system setup and cal-ibration between multiple sensors are rather complicatedwhich may prevent their practical usage. Due to the advan-tages of Kinect mentioned earlier, several approaches havebeen proposed to merge the Kinect depth sensor with imagescaptured from the Kinect built-in RGB camera or externalRGB cameras. In [28], the stereo matching between theKinect’s IR image and RGB image was first performed to

(a) (b) (c)

Figure 3 a A frame ofcheckerboard captured by JVCGS-TD1B FHD 3D camcorder. bThe same frame of checkerboardcaptured by Kinect’s RGBcamera. c Depth map capturedby Kinect’s depth sensor.

J Sign Process Syst

generate a depth map. This depth map and the inner depthmap computed using Kinect IR projector are then fused toobtain an improved depth map. Because the capturing of theIR and RGB images by Kinect are not synchronous, themethod is mostly suitable for low-activity videos. Further-more, as the resolutions of these two cameras are rather low,high resolution depth map estimation cannot be achieved bysuch fusion mechanism. [29] proposes a stereo matching-based improvement method based on Kinect. The depthmaps obtained from the Kinect are only applied to the stereomatching as disparity searching references. In other word,the depth maps returned from the Kinect are used as refer-ences in the disparity search from stereo matching of the twoviews. Therefore, this method seems to be more like aKinect-assisted stereo matching method, rather than a jointfusion of the information offered by the two devices. In orderto better utilize the information obtained from the Kinect andhigh quality 3D camera, a joint stereo and Kinect reliabilityfusion algorithm for depth map estimation is proposed in thissection. The proposed fusion framework is based on MRFand the probability distribution functions to describe thecharacteristic of these multimodal depth sensing devices willbe derived. A pixel-wise weighting function which re-flects the reliabilities of the stereo camera and Kinectdepth sensor will be developed for effective data fusionof the multimodal devices. Finally, a two-step depth maprefinement is proposed to further refine the estimateddepth map.

As mentioned earlier, Markov Random Field (MRF)[7] provides a convenient model for estimating hiddenparameters from the observation and is widely used instereo matching. Moreover, the MRF model allows thecontinuity, coherence and occlusion constraints to betaken into account conveniently [23]. Based on the as-sumed model, one can derive the posterior probability asa function of the hidden depth variable. By maximizingthe posterior probability, one gets the desired MAP esti-mator of the depth map. An advantage of the MAP-MRFapproach is that it provides a systematic framework tointegrate the information from multiple sensors and theresultant problem can be solved using Graph Cuts (GC)and Belief Propagation (BP) methods for approximatingthe inference in the MRF. In what follows, we shallpropose a data fusion algorithm based on MRF for in-corporating the extra Kinect depth data into the conven-tional MAP-MRF framework for stereo vision and solvethe resultant problem using an efficient multiscale BPapproach.

3.1 Stereo vision and Kinect Fusion Problem

The stereo vision and Kinect fusion problem for depth esti-mation can be formulated as the following MAP problem:

P X Y ;Zjð Þ∝∏pf sðxp; ypÞ f k xp; zp

� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

data terms

∏p0∈N pð Þ

f r xp; xp0� �

|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}smoothness term

ð1Þ

where X={xp,∀p} denotes the hidden variables associatedwith the disparities of all pixels, Y={yp,∀p} andZ={zp,∀p} are observed disparities and depths obtainedby minimizing a given color-based matching costs and thosereturned by the Kinect respectively. N(p) represents the fourconnected neighbors of pixel p. fs(xp, yp) and fk(xp, zp) arelocal evidences (data terms) based on the initial pixel-wisematching cost and the measurement from the Kinect depthsensor, respectively. fr(xp, xp′) represents the smoothness orregularization term which incurs discontinuity cost or penal-ty of assigning different disparities xp and xp′ to two neigh-boring pixels. By taking the negative logarithm of Eq. (1), itcan be seen that the MAP problem is equivalent to theminimization of the following energy function

E ¼Xp

D xp; yp; zp� �

þX

p 0∈N pð ÞV xp; xp 0� �

; ð2Þ

where D(xp,yp,zp)=− log fs(xp,yp)− log fk(xp,zp) is called thedata term and V(xp,xp')=− log fr(xp,xp') is called thesmoothness term.

In order to produce more accurate fusion results, wepropose to use a weighted data term as follows

D xp; yp; zp� �

¼ −wsplog f s xp; yp

� �−wk

plog f k xp; zp� �

; ð3Þ

where wps and wp

k are the pixel-wise weighting factors forstereo and Kinect depth sensor, respectively. They are relatedto the reliability of each estimated depth pixel resulting fromstereo matching and Kinect depth map, which are denoted byHp

s and Hpk, respectively. This reliability fusion concept is

first introduced in [12], which is developed for the fusion ofToF camera and stereo vision. In this paper we greatly extendthe original concept and apply it to the joint fusion of stereovision and Kinect. In the proposed fusion framework, wecompute these weights factors wp

s and wpk based on the

pixel-wise reliability of both stereo and Kinect as follows:

wsp ¼

Hsp

Hsp þ Hk

p

and wkp ¼ Hk

p

Hsp þ Hk

p

: ð4Þ

Here Hpd is computed in the same fashion as [12]:

Hsp ¼

1−m1stp

m2ndp0

m2ndp > T

otherwiseHs

p∈ 0; 1½ �;8<: ð5Þ

where mp1st and mp

2nd denote respectively the best and thesecond best matching costs of each p. Hence, Hp

s quantifieshow distinctive or reliable the current estimate is. T is a small

J Sign Process Syst

positive threshold to avoid division by zeros. Instead ofusing absolute difference (AD) as the matching cost, mp

1st

and mp2nd are computed by Birchfield and Tomasi’s pixel

dissimilarity [26], which has the advantage of reducing theproblem of sampling with little additional computationalcomplexity, compared to AD.

On the other hand, the reliability of Kinect depth sensorHp

k of each data point is derived from standard deviations ofrandom error (ϑ) and depth resolution (ζ) of plane fittingresiduals at different plane to sensor distances. Figure 4shows the calculated ϑ and ζ plotted against the distancefrom the plane to the sensor. It can be seen that the errorincreases quadratically from 0.5 m distance to the maximumrange of the sensor. Therefore, Hp

k can be modeled in termsof ϑ and ζ as:

Hkp ¼

1

ϑζ0

if zp≠0otherwise

Hkp∈ 0; 1½ �:

(ð6Þ

3.2 MRF Formulation and Multiscale BP

From Eq. (1), our MRF model includes three different terms,which are the data term from stereo matching, the data termfrom the Kinect depth sensor and the smoothness term. Formodeling these terms, we shall adopt the truncated linearmodel so that the resultant problem can be solved using theefficient multiscale BP framework [11]. The detailed formu-lations of these terms are given below:

1) Data term from stereo matching fs(xp, yp): as defined inEq. (1), yp is the observation corresponding to the color-based matching cost mp using the Birchfield andTomasi’s pixel dissimilarity. To measure the discrepancybetween the disparity variable xp and the observation yp,we employ the following truncated linear model as thedata term for stereo matching,

f s xp; yp

� �¼ λsmin

���xp−yp���;μs

� �: ð7Þ

It can be seen from Eq. (7) that the cost increaseslinearly with the absolute distance between xp and yp upto some level where it is kept constant. μs is the corre-sponding upper bound and λs is a scaling factor used tocontrol the relative importance between various terms.2)Data term from Kinect depth sensor fk(xp, zp): It encodesthe depth consistency between the stereo and the Kinectdepth sensor. After joint calibration and rectification, theangles and distances between Kinect and JVC GS-TD1BFHD 3D camcorder are adjusted. Therefore, the outputsare the depth map and stereo images that are row-alignedand undistorted. However, the output depth map fromKinect depth sensor is the distance map from the sensorto the object surface (plane to sensor distance). In order tofuse the data from stereo matching, the depth value of theKinect (i.e. plane to sensor distance) should be convertedto stereo disparity value as follows:

zp ¼ ftezp− cleft−ck� � ; ð8Þ

where ezp is the disparity calculated by the Kinect depthsensor and zp is the depth value of pixel p as defined inEq. (1). f and t are the focal length and baseline betweenthe Kinect and the left camera of the JVCGS-TD1B FHD3D camcorder. cleft and ck are principle points of the leftcamera in the JVCGS-TD1B FHD 3D camcorder and theKinect depth sensor, respectively. To measure the discrep-ancy between disparity values obtained from the Kinectdepth sensor and stereo matching, the cost function fk(xp,zp) of the Kinect data term is defined as a truncated linearfunction in the difference between these two disparities asfollows:

f k xp; zp� � ¼ λkmin

���xp−ezp���;μk

� �; ð9Þ

where μk is the corresponding upper bound and λk is ascaling factor.

3) Smoothness term fr(xp, xp′): It is designed to ensure thatthe depth map is smooth except at the discontinuities.Therefore, a similar truncated linear function in the dif-ference between xp and its neighbors xp′(p′∈N(p)) can beadopted. This motivates the use of the following functionfor the smoothness term between adjacent depth values:

f r xp; xp0� � ¼ min

���xp−xp0 ���;μr

� �; ð10Þ

where μr is the upper bound. Details of all of the param-eter settings are discussed later in Section 4.1 and sum-marized in Table 1.

Figure 4 Standard deviations of plane fitting residuals at differentdistances of the plane to the sensor: theoretical random error ϑ (red)and depth resolution ζ (blue) [4].

J Sign Process Syst

After fs, fk and fr are defined, we employ a multiscaleBP algorithm [11] to solve the MAP-MRF problem sinceconventional BP is too slow to be practical. Major steps ofthe multiscale BP algorithm is briefly summarized asfollows: 1) initialize the messages at the coarsest levelto all zeros, 2) apply BP at the coarsest scale to iterativelyrefine the messages and 3) use refined messages from thecoarser level to initialize the messages for the next scale.Figure 5 illustrates an example with two-level of mes-sages passing. The l-th level corresponds to a problemwhere blocks of 2l×2l pixels are grouped together, andthe resulting blocks are connected in a grid structure. Akey property of this construction is that long range inter-actions can be captured by short paths in the coarse levels,as the paths are connected through blocks instead ofpixels. Compare to conventional BP algorithms, thismethod uses hierarchical technique (coarse-to-fine man-ner) to obtain a good approximation of the optimal solu-tion with a small fixed number of message passing itera-tions and levels. In addition, by using Eq. (10), the com-plexity of the inference can be reduced to linear ratherthan quadratic in the number of possible labels for eachpixel. Interested reader are referred to [11] for details.

3.3 Depth Map Refinement

Generally, the color of a given object in a neighborhood ishighly correlated. This is also true for their depth valuesbecause they usually arise from the same neighborhood ofa physical object. Consequently, the value of a given depthpixel is closely related to the correspondent color pixels inthe neighborhood. To this end, a two-step approach is con-sidered below to further refine the estimated depth map.

1) Color image guided depth matting process. The basicidea is to refine the depth images with the help of thecorresponding color images. To this end, we shall extendthe conventional matting technique of color images tojoint color and depth images. More precisely, a colorguided depth matting process is proposed below to fur-ther refine the quality of depth edges under the Bayesianmatting framework [13].

Given the observed color image C and depth imageD, the joint color and depth matting problem is to findthe proper matting parameters: foreground F=[FC|FD],

background B=[BC|BD] and opaque α, where FC andFD (BC and BD) are respectively the foreground(background) in the color image and depth map respec-tively. Under the framework of Bayesian matting forcolor images, we have

arg maxF;B;α

P F;B;αjC;Dð Þ¼ arg max

F;B;αP C;DjF;B;αð ÞP Fð ÞP Bð ÞP αð Þ= P Cð ÞP Dð Þð Þ;

¼ arg maxF;B;α

L C;DjF;B;αð Þ þ L Fð Þ þ L Bð Þ þ L αð Þð11Þ

where L(.)=logP(.) is the log likelihood function. Notethat the log likelihood terms are modeled similar to theBayesian matting approach in [13]. The terms P(C) andP(D) are dropped because they are constant with respectto the optimization parameters. If the log likelihoodterms are modeled as Gaussian distributed, then we have

L C;DjF;B;αð Þ ¼ −1=σ2CD C=D½ �−αF− 1−αð ÞBð Þ2; ð12Þ

L Fð Þ ¼ − F−F� �T

Ψ−1F F−F� �

=2; ð13Þ

L Bð Þ ¼ − B−B� �T

Ψ−1B B−B� �

=2; ð14Þ

where F and B are the weighted means of the foregroundand background, and ΨF and ΨB are the weighted covariancematrix of the foreground and background, respectively.

The maximization problem in Eq. 11 can be divided intotwo sub problems, which can be iteratively solved for F, Band α, similar to the maximization problem described inBayesian matting [13]. The foreground F and backgroundB can be obtained by solving the following linear equation

Σ−1F þ Iα2=σ2

CD Iα 1−αð Þ = σ2CD

Iα 1−αð Þ =σ2CD Σ−1

B þ I 1−αð Þ2=σ2CD

� FB

¼ Σ−1F F þ C;D½ � α = σ2

CD

Σ−1B Bþ C;D½ � 1−αð Þ=σ2

CD

" #: ð15Þ

Since the observed depth map is not matted, only colorimage is available in the process of updating α:

Table 1 Parameter settings used in our algorithm.

parameter ϕ ψ wps wp

k λs λk μs μk μr

value 7 5 Adaptively Adaptively 0.25 0.25 10 nd nd/8

Figure 5 Illustration of two-level multiscale BP method. a shows themassage passing in level l−1. b shows the massage passing in level l.Each node in (b) corresponds to a 2×2 block of nodes in (a) [11].

J Sign Process Syst

α ¼ C−BCð Þ⋅ FC−BCð ÞFC−BCð Þ2 : ð16Þ

With the matting parameters {FD, BD, α}, the edgesin the depth map can be matted. All the edges aresmoothed which provides similar visual quality to theresult of anti-aliasing. Figure 6 illustrated the proposedidea of the color image guided depth matting. Figure 6ashows the reference color image. Figure 6b and c aredepth map results before and after matting. Figure 6dand e are the enlargements of Fig. 6b and c, respectively.It can be seen from Fig. 6c and e that the depth map isrefined after we get the matting parameters {FD, BD, α}.The edges in the depth map are nowmatted to reflect ourconfidence on the actual depth values.

2) 2D LPR-based smoothing: To further reduce possibleimage noise arising from low texture, occlusion, etc., thedepth maps should be further smoothed. Here, we adoptedthe 2D LPR smoothing with adaptive bandwidth selection[9, 33] for refining the depth map after matting. It isparticularly useful in preserving the discontinuity at objectboundaries while performing smoothing at flat areas.

More precisely, we treat the depth map as a 2D functionDi(p1, p2) of the coordinate p=[p1,p2]

T withp1=1,2,⋯,ℵ1 and p2=1,2,⋯,ℵ2, where ℵ1×ℵ2 is theresolution of the depth image. Following the homoscedas-tic data model, the depth observation is given by

Di ¼ g Pið Þ þ σ Pið Þεi; ð17Þwhere (Di, P i) is a set of observations with i=1,⋯, n, andeach was taken at location Pi=[Pi,1,Pi,2]

T. g(Pi) is asmooth function specifying the conditional mean of theobserved depth Di given Pi. ε i is an independent identi-cally distributed (i.i.d.) additive white Gaussian noise andσ2(Pi) is its conditional variance. The problem is toestimate g(Pi) from the noisy sample Di. Since g(Pi) isa smooth function, we can approximate it locally as ageneral degree-l polynomial at a given point p=[p1,p2]

T:

g P : pð Þ ¼Xκ¼0

l Xk1þk2¼κ

βk1;k2 ∏j¼1

2 Pj−pj

� �k j

; ð18Þ

where β¼ βk1;k2 : k1þk2¼κ and κ ¼ 0;…l �

is the

vector of coefficients. The polynomial coefficient at alocation p can be determined by minimizing the follow-ing weighted LS problem:

minβ

Xn

i¼1KH Pi−pð Þ Di−g Pi : pð Þ½ �2; ð19Þ

where KH(⋅) is a suitably chosen 2D kernel. When p isevaluated at a series of 2D grid points, we obtain asmoothed depth map from the noisy depth estimates Di.Equation (19) can be solved using the LS method and thesolution is:

bβLS p; hð Þ ¼ ΞTΩΞ� �−1

ΞTΩΓ; ð20Þwhere Ω=diag{KH(P1−p),⋯,KH(Pn−p)} is theweighting matrix, Γ=[D1,D2,⋯,Dn]

T,

Ξ ¼

1 P1−pð ÞT vech P1−pð Þ P1−pð ÞTn o

1 P2−pð ÞT vech P2−pð Þ P2−pð ÞTn o

⋯⋮ ⋮ ⋮ ⋮1 Pn−pð ÞT vech Pn−pð Þ Pn−pð ÞT

n o…

2666664

3777775 ,

and vech(⋅) is the half-vectorization operation. The fol-lowing Gaussian kernel is employed in this work:

KH uð Þ ¼ 1

h2 2π detC−1�� ��� � exp −

1

2uTCu

� ; ð21Þ

where the positive definite matrixC and scalar bandwidthh determine respectively the orientation and scale of thesmoothing. Since the Gaussian kernel is not of compactsupport, it should be truncated to a sufficient size ℵK×ℵK

to reduce the arithmetic complexity. Usually C is deter-mined from the principal component analysis (PCA) of

Figure 6 The results of depth matting: a reference image frame of thecolor camera; b and c depth maps before and after matting; d and eenlargements of the marked blocks in (b) and (c).

J Sign Process Syst

the local data covariance matrix at p. When h is small,noise in the depth map may not be removed effectively.On the contrary, a large-scale kernel better suppressesadditive noise at the expense of possibly blurring of thedepth maps. Here, we adopt the iterative steering kernelregression (ISKR) method in [33], which was shown tohave a better performance than the conventional symmet-ric kernel, especially along edges of the depth map. In theISKR method, the local scaling parameter was obtainedas hi=h0γi, where h0 and γi are respectively the globalsmoothing parameter and the local scaling parameter. Thescale selection process is fully automatic and it can beperformed by using the data-driven adaptive scale selec-tion method with the refined intersection of confidenceintervals (R-ICI) rule [33].

4 Experimental results

We now present and evaluate the experimental results of theproposed system and algorithm. More precisely, the visualquality of the depth estimation from stereo matching algo-rithms [10, 11], the Kinect depth sensor and our joint stereoand Kinect fusion are compared using indoor complex

scenes. Moreover, a qualitative comparison among existingstereo matching algorithms is performed based on theMiddlebury stereo evaluation using Tsukuba and Venus test-ing datasets. The rest of this section is organized as follows:Section 4.1 depicts implementation details of our depthestimation system and the parameter settings used in ouralgorithm. Depth map estimation results and visual compar-isons are illustrated in Section 4.2. Numerical comparisonbased on Middlebury stereo evaluation is performed inSection 4.3.

4.1 Implementation Details

The construction of our depth estimation system is shown inFig. 3. Kinect provides depth maps with a resolution of640×480 at 30 FPS. The left view of the JVC GS-TD1BFHD 3D camcorder is set as the reference view and theresolution for each view is 720p (HD) at 30 FPS. The systemis calibrated based on the method proposed in Section 2.OpenNI SDK (×64) version 2.1 [Available: http://www.openni.org/openni-sdk/] is used to drive the Kinect in ourexperiments. 7 message passing iterations (ϕ) per level and 5levels (ψ) are used in the multiscale BP algorithm.wp

s andwpk

are chosen according to Eqs. (5)-(7). The scaling factors forstereo matching term fs(xp, yp) and Kinect term fk(xp, zp),

Figure 7 Depth estimationresults: (a) reference image; (b)raw depth from Kinect (type Iand II invalid regions arehighlighted by red and green) (c)Hs map; (d) Hk map; (e) rawmatching result from multiscaleBP stereo [11] only; (f) matchingresult from the non-Local Filter(with σ=0.20 in [10]); (g) jointstereo and Kinect fusion resultand (h) joint stereo and Kinectfusion result with depth maprefinement.

J Sign Process Syst

(λs, λk), are chosen as 0.25. The upper bounds of truncatedlinear models (Eqs. (8)-(10)), (μs, μk, μr), are chosen as 10,nd and nd/8, respectively where nd is the maximum disparitylevel. The parameter settings are summarized in Table 1.

The proposed depth estimation algorithm is implementedon a PC which is equipped with an Intel i7 920 CPU, 4GBRAM and GTX295 GPU cards. Furthermore, the algorithm isaccelerated by GPUwith the help of CUDA [Available: http://www.nvidia.com/object/cuda_home_new.html] and OpenCV

GPU module [Available: http://opencv.willowgarage.com/wiki/OpenCV_GPU]. The processing time is approximately0.5 second per image frame.

4.2 Depth map estimation results

We first test our algorithm on a static indoor complex scene.The image from the reference view is shown in Fig. 7a. Itcontains texture-less regions, partially transparent objects and

Figure 8 Depth estimationresults: (a) reference image; (b)raw depth from Kinect (type Iand II invalid regions arehighlighted by red and green) (c)Hs map; (d) Hk map; (e) rawmatching result from multiscaleBP stereo [11] only; (f) matchingresult from the non-Local Filter(with σ=0.20 in [10]); (g) jointstereo and Kinect fusion resultand (h) joint stereo and Kinectfusion result with depth maprefinement.

Figure 9 Depth estimationresults (Tsukuba): (a) referenceimage; (b) simulated depth fromKinect with patched invalidregions (c) simulated depth fromKinect with random noise σ anddown-sampled by depthresolution error ζ; (d) Hs map;(e) Hk map; (f) raw matchingresult from multiscale BP stereo[11] only; (g) joint stereo andKinect fusion result and (h)ground truth.

J Sign Process Syst

type I/II factors which will cause holes in the depth mapcaptured by Kinect, as shown in Fig. 7b. Since the resolutionof the reference image is much higher than the Kinect depthsensor, the depth data cannot cover every pixel of the refer-ence image. We can see from Fig. 7c and d that the proposedreliability maps can effectively capture the strengths of thetwo devices and demonstrate their complementary nature. Hs

is computed based on the distinctiveness between the first andsecond matching costs of each pixel. Highly textured regionswill lead to higher Hs in the fusion and low texture regionssuch as the wall and texture-less object surfaces will tend todepend on the Kinect results. Hk is obtained based on Fig. 4,which reflects the reliability of the depth informationaccording to distance from the object to the Kinect depthsensor.

Figure 7e shows the depth map obtained from the multiscaleBP stereo [11] only. Unlike the Kinect result in Fig. 7b, thedepth of the green bottle is successfully estimated in this stereomatching method. However, it is erroneous in texture-lessregions and there are large ambiguities in assigning depthvalues to pixels around object boundaries. Figure 7f shows

the matching result obtained from a newly proposed methodwhich is based on non-local cost aggregation (non-Local Filter)and non-local disparity refinement [10]. We can see that thedepth values of the green bottle can be recovered approximatelyand the depth discontinuities are well preserved. Obviously, theresult of [10] outperforms that of [11] but both methods fail toreconstruct thin structure such as the guitar bar and tripod in thescene. In addition, we found that both stereo methods cannotsuccessfully reconstruct the back of the chair. Figure 7g showsour fusion result, which is of high quality and resolution ascompared to those of multiscale BP stereo only, non-Local filterapproach and Kinect’s result. The texture-less regions are wellhandled and holes of Fig. 7b are also filled by reasonablevalues. However there is still some noise in the depth map.Figure 7h shows the resultant depth map after the depth maprefinement. Compare to Fig. 7g, the object boundaries inFig. 7h are better preserved and the noise is efficientlysuppressed.

We further test our algorithm on a dynamic indoor complexscene. It can be seen from Fig. 8a that a person is playingguitar in this scene. Figure 8b shows the Kinect’s depth mapwith invalid areas highlighted. Figure 8c and d illustrate thereliability maps of this scene. Depth maps obtained frommultiscale BP only and non-Local filter are shown in Fig. 8e

Figure 10 Depth estimationresults (Venus): (a) referenceimage; (b) simulated depth fromKinect with patched invalidregions (c) simulated depth fromKinect with random noise σ anddown-sampled by depthresolution error ζ; (d) Hs map;(e) Hk map; (f) raw matchingresult from multiscale BP stereo[11] only; (g) joint stereo andKinect fusion result and (h)ground truth.

Table 2 Comparison of the rank using standard threshold of 1 pixel onTsukuba dataset. Note, results of Tables 1 and 2 are used to quantify theaccuracy of the proposed algorithm and it makes use of a noisy depthmap derived from a stimulated Kinect, while the other stereo algorithmsuse only the stereo data. Therefore, care should be taken in assessingthese results.

Algorithm Tsukuba

Nonocc All Disc

Joint Stereo and Kinect fusion 0.22 1.57 2.19

Stereo only [11] 1.46 3.36 7.75

Kinect only 9.13 10.5 9.41

CoopRegion [31] 0.87 1.16 4.61

Table 3 Comparison of the rank using standard threshold of 1 pixel onVenus dataset.

Algorithm Venus

Nonocc All Disc

Joint Stereo and Kinect fusion 0.05 0.15 0.94

Stereo only [11] 0.29 1.27 3.96

Kinect only 9.57 10.1 4.61

CoopRegion [31] 0.11 0.21 1.54

J Sign Process Syst

and f. The depth map estimated by our joint stereo and Kinectreliability fusion and its 2 step refinement result are shown inFig. 8g and h, respectively. It is noticed that the Kinect cannotreconstruct the back of the chair in this case since the IRpatterns of Kinect are highly absorbed by the material. How-ever, we can see from Fig. 8g and h that the back of the chair issuccessfully reconstructed by our algorithm with the help ofstereo. In addition, depth values of the human object are alsosuccessfully estimated by our algorithm.

4.3 Numerical comparison

In this section, we evaluate quantitatively the accuracy of ouralgorithm with other state-of-the-art stereo algorithms basedon the Middlebury stereo evaluation criteria. In order toobtain a ground truth for comparison, we stimulated thedepth maps obtain from the Kinect depth camera and evalu-ated all the algorithms on two standard datasets, Tsukuba andVenus. The left views of Tsukuba pair and Venus pair aregiven in Figs. 9a and 10a. In order to mimic the depth mapobtained from Kinect, the depth maps of Tsukuba and Venusare patched with missing regions. The missing regions aresimulated based on the occlusion among IR camera and IRprojector of Kinect and JVC GS-TD1B FHD 3D camcorderas shown in Figs. 9b and 10b. In addition, the theoreticalrandom error σ and depth resolution ζ of Kinect are used togenerate the corresponding sensor noise. The noise and patchcorrupted ground true is then resampled to obtain the stim-ulated depth map as shown in Figs. 9c and 10c. Figures 9d-eand 10d-e show the reliability maps of stereo and Kinect,respectively. Depth maps obtained from multiscale BP stereo[11] only are shown in Figs. 9f and 10f. The final results ofTsukuba and Venus are shown in Figs. 9g and 10g. For faircomparison, the proposed two steps depth map refinement isnot applied here. Ground truths of Tsukuba and Venus areshown in Figs. 9g and 10g. Tables 2 and 3 show the results ofthe evaluation of Tsukuba and Venus datasets on theMiddlebury stereo pages. A standard threshold of 1 pixelhas been used in Tables 2 and 3. The result on each dataset iscomputed by measuring the percentage of the pixels with anincorrect disparity estimate. The measure is computed forthree subsets of the image: “nonocc” stands for the subset ofnon-occluded pixels; “all” denotes the subset of pixels beingeither non-occluded or half-occluded; and “disc” representsthe subset of pixels near the occluded areas. It can be seenthat the proposed joint stereo vision and Kinect algorithmcomplements the strengths of stereo vision and outperformseither the stereo vision algorithm or Kinect alone in thestimulated results. The performance differences betweenour algorithm and top algorithms such as CoopRegion [31],on Middlebury stereo evaluation page, are very small. Thisshows that the proposed fusion approach is a promising andaccurate approach for high resolution depth estimation.

5 Conclusion

In this paper, a new high quality and high resolution depthestimation system using joint stereo vision and Kinect hasbeen presented. The system construction and methods forcalibrating the various devices are discussed. The sensorobservations are modeled using MRF and the fusion problemis formulated as a MAP estimation problem. The reliabilityand the probability density functions for describing the obser-vations from the two devices are also derived and the resultantMAP problem is solved using a multiscale BP algorithm. Atwo-step depth map enhancing method is proposed to furthersuppress possible estimation noise by a new color imageguided depth matting method followed by 2D LPR-basedfiltering. Experimental results and numerical comparisonsshow that our system can provide high quality and highresolution depth maps in reasonable time. It performance alsooutperforms either conventional stereo vision or Kinect alone,thanks to the complementary nature of these two sensors.

References

1. Smisek, J., Jancosek, M., Pajdla, T. (2011). “3D with kinect.” InProc. IEEE Workshop Consum. Depth Cameras Comput. Vision(pp. 1154–1160).

2. Zhang, Z. Y. (2000). A flexible new technique for camera calibra-tion. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 22(11), 1330–1334.

3. Shum, H. Y., Chan, S. C., & Kang, S. B. (2007). Image-basedrendering. NY: Springer.

4. Khoshelham, K., & Oude Elberink, S. (2012). Accuracy and reso-lution of Kinect depth data for indoor mapping applications. Sen-sors, 12(2), 1437–1454.

5. Herrera, D., Kannala, C. J., & Heikkila, J. (2012). Joint depth andcolor camera calibration with distortion correction. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 34(10), 2058–2064.

6. Zhang, C., & Zhang, Z. (2011). “Calibration between depth andcolor sensors for commodity depth cameras.” In Proc. IEEE Intl.Workshop Hot Topics in 3D, in conjunction with ICME.

7. Zhang, L., & Seitz, S. (2005). Parameter estimation for MRF stereo.Proc. IEEE Comput. Soc. Conf. CVPR, 2, 288–295.

8. Yoon, K. J., &Kweon, I. S. (2005). “Locally adaptive support-weightapproach for visual correspondence search.” In Proc. IEEE Intl Conf.Computer Vision and Pattern Recognition (pp. 924–931).

9. Zhu, Z. Y., Zhang, S., Chan, S. C., & Shum, H. Y. (2012). Object-based rendering and 3D reconstruction using a moveable image-based system. IEEE Trans. Circuits Syst. Video Technol., 22(10),1405–1419.

10. Yang, Q. X. (2012). “A non-local cost aggregation method forstereo matching.” In Proc. IEEE Comput. Soc. Conf. CVPR (pp.1402–1409).

11. Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief prop-agation for early vision. Intl. Journal. Comput. Vision, 70(1), 41–54.

12. Zhu, J. J., Wang, L., Yang, R. G., Davis, J. E., & Pan, Z. G. (2011).Reliability fusion of time-of-flight depth and stereo geometry forhigh quality depth maps. IEEE Transactions on Pattern Analysisand Machine Intelligence, 33(7), 1400–1414.

J Sign Process Syst

13. Chuang, Y., Curless, B., Salesin, D. H., Szeliski, R. (2001). “ABayesian approach to digital matting.” In Proc. IEEE Comput. Soc.Conf. CVPR, vol. 2 (pp. 264–271).

14. Chan, S. C., Shum, H. Y., & Ng, K. T. (2007). Image-basedrendering and synthesis: Technological advances and challenges.IEEE Signal Processing Magazine, 24(6), 22–33.

15. Foix, S., Alenyà, G., & Torras, C. (2011). Lock-in time-of-flight(ToF) cameras: A survey. IEEE Sensors Journal, 11(9), 1917–1926.

16. Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D.,Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., Shade,J., Fulk, D. (2000). The digital michelangelo project: 3D scanning oflarge statues. In Proc. Annu. Comput. Graph (pp. 131–144).

17. Ikeuchi, K., Nakazawa, A., Hasegawa, K., Ohishi, T. (2003). Thegreat Buddha project: Modeling cultural heritage for VR systemsthrough observation. In Proc. IEEE/ACM Intl. Symp. Mixed Aug-mented Reality (pp. 7–16).

18. Wang, Z., & Zheng, Z. (2008). A region based stereo matchingalgorithm using cooperative optimization. In Proc. IEEE Comput.Soc. Conf. CVPR, vol. 1, no. 12 (pp. 2720–2727).

19. Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximateenergy minimization via graph cuts. IEEE Transactions on PatternAnalysis and Machine Intelligence, 23(11), 1222–1239.

20. Sun, J., Li, Y., Kang, S. B., Shum, H. Y. (2005). “Symmetric stereomatching for occlusion handling.” In Proc. IEEE Comput. Soc.Conf. CVPR, vol. 2 (pp. 399–406).

21. Herrera, C. D., Kannala, J., Heikkila, J. (2011). “Accurate andpractical calibration of a depth and color camera pair.” In proc. Intl.Conf. Computer Analysis of Images and Pattern, vol. II, LNCS6855 (pp. 437–445).

22. Klaus, A., Sormann, M., Karner, K. (2006). “Segment-based stereomatching using belief propagation and a self-adapting dissimilarity mea-sure.” In Proc. IEEE Int. Conf. Pattern Recognition, vol. 3 (pp. 15–18).

23. Lazaros, N., Sirakoulis, G. C., Gasteratos, A. (2008). “Review ofstereo vision algorithms: from software to hardware.” InternationalJournal of Optomechatronics, 435–462.

24. Bleyer, M., Rother, C., Kohli, P. (2010). “Surface stereo with softsegmentation.” In Proc. IEEE Comput. Soc. Conf. CVPR (pp.1570–1577).

25. Taguchi, Y., Wilburn, B., & Zitnick, L. (2008). Stereo reconstruc-tion with mixed pixels using adaptive over-segmentation. Proc.IEEE Comput. Soc. Conf. CVPR, 1(12), 2720–2727.

26. Birchfield, S., & Tomasi, C. (1998). A pixel dissimilarity measurethat is insensitive to image sampling. IEEE Transactions on PatternAnalysis and Machine Intelligence, 20, 401–406.

27. Lindner, M., Kolb, A., Hartmann, K. (2007). “Data-fusion of PMD-based distance-information and high-resolution RGB-images.” InProc. Intl. Symp. Signals, Circuits, and Systems (pp. 1–4).

28. Chiu, W., Blanke, U., Fritz, M. (2011). “Improving the Kinect bycross-model stereo.” In Proc. British Mach. Vision Conf.

29. Chan, D. Y., & Hsu, C. H. (2013). “Regular stereo matchingimprovement system based on Kinect-supporting mechanism.”Open Journal Applied Sciences, 22–26.

30. Yang, Q., Wang, L., Yang, R., Stewénius, H., & Nistér, D. (2009).Stereo matching with color-weighted correlation, hierarchical be-lief propagation and occlusion handling. IEEE Transactions onPattern Analysis and Machine Intelligence, 31(1), 492–504.

31. Strecha, C., Fransens, R., Gool, L. V. (2006). “Combined depth andoutlier estimation in multi-view stereo.” In Proc. IEEE Comput.Soc. Conf. CVPR (pp. 2394–2401).

32. Zhang, S.,Wang, C., Chan, S. C. (2013). “A new high resolution depthmap estimation system using stereo vision and depth sensing device.”In Proc. IEEE colloq. Signal Process. Applications (pp. 49–53).

33. Zhang, Z. G., Chan, S. C., Ho, K. L., & Ho, K. C. (2008). Onbandwidth selection in local polynomial regression analysis and itsapplication tomulti-resolution analysis of non-uniform data. J. SignalProcess. Syst. Signal Image and Video Technol., 52(3), 263–280.

Shuai Zhang received his B.Sc(CE) degree from the Yanshan Uni-versity in 2009. He received hisM.Sc degree from The Universityof Hong Kong in 2010 and is nowpursuing his Ph.D degree in thedepartment of electrical and elec-tronic engineering, The Universityof Hong Kong.His research interests focus onmulti-modality data fusion, humanbody tracking, and statistical videoprocessing.

Chong Wang received the B.Eng.degree from Zhejiang University ofTechnology in 2007, and theM.Eng. degree from University ofScience and Technology of Chinain 2010. He is currently pursuingthe degree of Ph.D. at the depart-ment of Electrical and ElectronicEngineering, The University ofHong Kong.His main research interests arein image and video restoration,image based rendering and parallelcomputing.

S. C. Chan received theB.Sc.(Eng.) and Ph.D. degreesfrom The University of HongKong, Pokfulam, Hong Kong, in1986 and 1992, respectively. Since1994, he has been with the Depart-ment of Electrical and ElectronicEngineering, the University ofHong Kong, where he is currentlya Professor. His research interestsinclude fast transform algorithms,filter design and realization,multirate and biomedical signal pro-cessing, communications and arraysignal processing, high-speed A/D

converter architecture, bioinformatics, smart grid image-based rendering.Dr. Chan is currently a member of the Digital Signal Processing TechnicalCommittee of the IEEE Circuits and Systems Society. He is AssociateEditors of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII, the Journal of Signal Processing Systems (Springer) and Digital SignalProcessing (Elsevier). Hewas the Chair of the IEEEHongKongChapter ofSignal Processing in 2000–2002, an organizing committee member of the2003 IEEE International Conference on Acoustics, Speech, and SignalProcessing, the 2010 International Conference on Image Processing, andan Associate Editor of IEEE TRANSACTIONS ON CIRCUITS ANDSYSTEMS I from 2008 to 2009.

J Sign Process Syst