video 2-d–to–3-d conversion based on hybrid depth cueingj... · by means of the generated depth...

Video 2-D–to–3-D conversion based on hybrid depth cueing

Chao-Chung Cheng (SID Student Member)Chung-Te LiLiang-Gee Chen

Abstract — With the maturation of three-dimensional (3-D) technologies, display systems can providehigher visual quality to enrich the viewer experience. However, the depth information required for3-D displays is not available in conventional 2-D recorded contents. Therefore, the conversion ofexisting 2-D video to 3-D video becomes an important issue for emerging 3-D applications. This paperpresents a system which automatically converts 2-D videos to 3-D format. The proposed system com-bines three major depth cues: the depth from motion, the scene depth from geometrical perspective,and the fine-granularity depth from the relative position. The proposed system uses a block-basedmethod incorporating a joint bilateral filter to efficiently generate visually comfortable depth mapsand to diminish the blocky artifacts. By means of the generated depth map, 2-D videos can be readilyconverted into 3-D format. Moreover, for conventional 2-D displays, a 2-D image/video depth per-ception enhancement application is also presented. With the depth-aware adjustment of color satu-ration, contrast, and edge, the stereo effect of the 2-D content can be enhanced. A user study onsubjective quality shows that the proposed method has promising results on depth quality and visualcomfort.

Keywords — Depth-map generation, 2-D/3-D conversion, image enhancement, depth cue, 3-D TV.

DOI # 10.1889/JSID18.9.704

1 IntroductionAs 3-D display technologies mature, enhanced reality andthe stereoscopic effect are being persued. Emerging 3-Ddisplays provide a better visual experience than conven-tional 2-D displays. 3-D technology enriches the content ofmany applications, such as broadcasting, movies, gaming,photography, camcorders, education, etc. There are severalapproaches to generate 3-D content, such as active depthsensor,4 triangular stereo vision, 3-D graphics rendering,etc. The active methods and multi-camera systems requirespecific devices. The computer graphics requires 3-Dmodel construction for rendering. All of them are only suit-able to generate new content. While the depth informationis not recorded in conventional 2-D content, the time-con-suming manual generation31 of the depth information is abarrier for mass-market promotion. Therefore, a 2-D–to–3-D conversion system, which automatically estimates thedepth information from a single-view video, becomes animportant technique for 3-D displays. Figure 1 shows a typi-cal 2-D–to–3-D video-conversion flow. The depth-genera-tion module est imates depth information from themonocular video. Then, the depth image-based renderingtechnology can generate multi-view videos from the monocu-lar video and the estimated depth maps.

2 Study on previous worksBefore trying to retrieve the limited depth information from2-D data, we first study human depth perception. Humandepth perception is the visual ability to perceive the world

in three dimensions. When observing 3-D space, the humanbrain integrates various heuristic depth cues to generatedepth perception. The binocular depth cue of two eyes andthe monocular depth cues of single eye20 are the majordepth perceptions of the human visual system.

The two eyes converge and accommodate in a coordi-nated manner to produce depth perception. The disparity ofbinocular depth is one of the strongest depth perceptions ofthe human visual system. The monocular cues, such asfocus/defocus, motion parallax, relative height/size, texturegradient, etc., can also generate depth perception based onhuman experience and single eye accommodation. There-fore, various depth cues can be used to acquire the depthinformation from monocular videos.

The existing methods of depth-map generation can beclassified into two types: spatial methods using a singleframe, and temporal method using multiple frames. The sin-gle-frame methods include depth assignment using imageclassification, machine learning, depth from focus/defocus,depth from geometric perspective, depth from texture gra-dient, depth from relative height, etc. Harman1 proposed amachine learning algorithm to generate depth of the keyframe. Hoiem12 used machine learning algorithms to gener-ate depth maps. The depth of the scene and object isassigned according to the trained classes. Battiato17 pro-posed to generate depth maps using multiple cues from asingle image. The depth is assigned according to the charac-teristic of different regions. However, these methodsstrongly rely on cases that are in the database and well-trained.

The authors are with the National Taiwan University, Graduate Institute of Electronic Engineering, R332, EE-II, No. 1, Sec. 4, Roosevelt Rd., Taipei106, Taiwan; telephone +886-2-912-537851, e-mail: [email protected].© Copyright 2010 Society for Information Display 1071-0922/10/1809-0704$1.00.

704 Journal of the SID 18/9, 2010

The Computed Image Depth method14 (CID) dividesa single image into several sub-blocks and uses contrast andblurriness information to generate depth information foreach block. Park et al.5 used the blur property of low depthof field in optical physics to determine the distance from thefocal plane. These methods can only detect a low-resolutiondepth layer. Tsai et al.7 used the vanishing point of a line tofind the depth gradient of the scene. Jung21 used edgeinformation and prior depth knowledge to assign the depth.Those single-frame methods use different concepts and aresuitable for different cases. However, they are not reliablewhen the chosen cue is weak in the image.

The multi-frame-based methods generally use themotion vector to retrieve the depth information.2,9–11,18,19

Depth from motion parallax can be seen as the temporaltype of depth form triangular stereo vision. The correspon-dence shift of objects in multiple frames can be retrievedfrom the motion vector. In the depth from motion-parallaxapproaches, the depth-induced motion vector (MV) can beconverted to a disparity vector (DV). Another approachcalled a modified time-difference method15 (MTD) pro-duces stereo pairs without the generation of a depth map.The frame pairs from different times are used as left-eyeand right-eye images to generate the stereoscopic effect.However, this method fails when the objects have verticalmotion. A hybrid integration of CID and MTD calledDCP16 is the first literature considering both the monocularcue and binocular cue. Although the depth from motion cangenerate an effective depth map, the method fails to gener-ate a depth map when the camera does not have motion.

In this paper, we focus on the depth-map generationand the application that use depth information to enhancethe depth perception on display systems. A video 2-D–to–3-D conversion system which exploits both spatial and tempo-ral methods to generate the depth map of video is proposed.With the fusion of multiple cues, depth can be retrieved

from the information of single and multiple frames morecompletely. The organization of this paper is as follows. InSec. 3, the video 2-D–to–3-D conversion system is pre-sented. Then, experimental results and the in-depth discus-sion of visual quality of proposed algorithm are given inSec. 4. Finally, concluding remarks are made in Sec. 5.

3 Proposed video 2-D–to–3-D conversionsystem

The proposed video 2-D–to–3-D conversion system isshown in Fig. 2. The depth generation core consists of threemajor components: the depth from the motion-parallax(DMP) module using temporal information, the depth fromthe geometrical perspective (DGP) module retrieving fine-grained scene depth, and depth from relative position mod-ule exploiting fine-grained object depth. The three depthmaps are fused together to generate the final depth map.The DMP and DGP modules are all implemented by block-based algorithms and the DRP module is implemented by aline-based algorithm to ease the hardware design andachieve real-time demand. Then, a joint bilateral filter8 isapplied to remove the block and line artifact of the depthmap. The filtered depth map can be used to generate multi-view/stereoscopic video using DIBR for 3-D displays and2-D video depth perception enhancement for the 2-D dis-plays. We will detail the individual modules in the followingsubsections.

3.1 Depth from motion-parallax (DMP)The depth from motion-parallax (DMP) module is one ofthe important cues of the system. The basic concept is theuse of consecutive frames as binocular images to retrievedisparities. The video content can be classified into two

FIGURE 1 — The typical 2-D–to–3-D video conversion flow. The depth-generation module retrieves the depth map frommonocular 2-D Video. The multi-view images can be generated from the monocular video and depth map using thedepth-image-based rendering technique.

Cheng et al. / Video 2-D–to–3-D conversion based on hybrid depth cueing 705

types, static scenes and individual moving objects (IMOs).For these two types of contents, the relationships of themotion vectors to the disparity vectors mapping are differ-ent, and thus different treatments must be applied. We willdiscuss these two cases in this section. For convenience, wedefine the horizontal axis of the target display as x axis, andthe vertical axis as the y axis. We define the norm of theimage plane as the z axis. Because the global motion willinduce an offset of the motion vector, the camera motions ofthe video frames should be removed by using the globalmotion compensation (GMC) algorithm6 before convertingmotion vector to disparity vector.

3.1.1 Camera-motion compensationThe camera motion can be basically classified into the fol-lowing types: translational movement on the x axis or y axis,rotation, and zoom-in and zoom-out. The translationalmovements on the x and y axis are depth-dependent. Theshift of the translational movement can be used to derive thebinocular disparity. As shown in Fig. 3, the disparity can bederived from the following equations.

(1)

(2)

(3)

where x1 and x2 are the projected coordinate on the twoframes, f is the focal length, Z is the distance of P1 object,and B is the baseline width.

Therefore, depth information can be retrieved fromthe disparity estimation. Rotation around the x axis or y axiscan be viewed as a full frame shift operation on the y and x

directions. Rotation around the z axis is depth-independent.It can be compensated by mapping (x, y) to (xcos q + ysin q,xsin q – ycos q) when the rotation angle is q. Zoom-in andzoom-out enlarge all the objects in the scene with a depth-independent weight.

The global motion model can be described as the fol-lowing equation:

(4)

(5)

where (a0, ..., a7) are the global motion parameters. Thismodel is suitable when the scene can be approximated by aplanar surface. The affine (six parameters: a6 = a7 = 0),translation-zoom-rotation (four-parameters: a2 = a5; a3 =–a4; a6 = a7 = 0), translation-zoom (three-parameter: a2 = a5;a3 = a4 = a6 = a7 = 0), and translation (two-parameter: a2 =a5 = 1; a3 = a4 = a6 = a7 = 0) models are specialized cases ofthe eight-parameter model.

According to the estimated camera parameters,frames are warped to form a parallel view configuration with

x fXZ1 = ,

x fB X

Zx f

BZ2 1= - - = - ,

Disparity D x xfBZ

= - =1 2 ,

¢ = + ¢ + ¢ ¢ + ¢ +x x y x y( ) ( ),a a a a a0 2 3 6 7 1

¢ = + ¢ + ¢ ¢ + ¢ +y x y x y( ) ( ),a a a a a1 4 5 6 7 1

FIGURE 3 — Disparity estimation from parallel view frames. The depthvalue Z is inversely proportional to the disparity value x1 – x2.

FIGURE 2 — Proposed video 2-D–to–3-D conversion system. The depth generation core has three major modules. The depth frommotion-parallax module retrieves depth from temporal information. The depth from the geometrical perspective module retrieves depthfrom scene structure. And the depth from the relative position module refines the local relative depth value. Then, the three depth cuesare fused to generate the final depth map. The post-filtering removes the blocky artifact of the block-based algorithm. Finally, the generateddepth map can be used for 2-D video depth-perception enhancement and the visualization of 3-D displays.


the current frame as shown in Fig. 4. This model can warpthe reference frames to the current frame. After warping,the motion parallax can be computed without the interfer-ence of camera motion.

From our experiment, we find that the four-parameterglobal motion model is most suitable for motion-parallaxreconstruction. The four-parameter global motion modelcompensates the translation, zoom, and rotation of the ref-erence view. The information regarding depth-inducedmotion vector can be preserved. A higher parameter GMCmodel, such as affine (six-parameter) and perspective(eight-parameter) models which remove the depth gradientof the frames is not suitable for the depth from motionapplication.

3.1.2 Block-based motion estimation withprior smoothness

After the effects of depth-independent global motion areremoved, the candidate frame can be used to reconstructthe motion parallax. However, when a candidate frame haslarger translational movement, the corresponding disparitymap has better precision. As for the frame with smallertranslational movement, the disparity map has less occlusionregion. Therefore, the selection of candidate frame amongneighboring frames should be carefully traded off betweenthe precision and the frequency of occlusion.

Therefore, we convert the local motion vectors to dis-parity vectors and obtain the depth information. To generatea spatially smooth result, the motion vector of each block isestimated with the smoothness consideration of neighbor-ing blocks. The method is shown in Eqs. (6)–(10):

(6)

(7)

(8)

(9)

(10)

where MVP is the motion vector predicted from the motionvectors of the neighboring blocks and MVcandidate is the can-didate motion vector. The data cost, DD, is the sum of theabsolute difference (SAD) value of the current block andreference block in a selected reference frame. The smooth-ness cost, DS, is the Euclidean distance between currentsMVi and MVP. The smoothness cost can propagate themotion values into the textureless regions and providesmooth results. The depth value is mapped in proportion tothe magnitude of the motion vector. A maximum thresholdmv_threshold is applied to prevent the exaggeration of out-liers on the motion vector. Finally, the weighting factor αcontrols the strength of the stereo effect.

In the motion vector to disparity vector conversion,the depth of the static scene can be retrieved. In the rigidbody, depth-induced motion information can be convertedto the disparity vector. As for the motion of IMO, the esti-mated motion vectors basically compose two parts, one isself-motion and the other is depth-induced motion. Theself-motion of IMO will result in additional offset in thedisparity vector. Fortunately, we can also convert the self-motion to disparity vector in most cases when the self-motionis small. The visual effect of this approximation is that mov-ing objects would pop up and catch more attention.

With the consideration of computational complexity,the depth from motion parallax is implemented by using ablock-based algorithm. One advantage of the block-baseddepth from motion parallax over segmentation-based algo-rithms is its computational efficiency. However, the resultswould suffer from the blocky effect. To solve this problem,we use an edge-aware bilateral filter to remove the blockyeffect and preserve object boundary. The details will beshown in the post-filtering section.

3.2 Depth from geometrical perspective(DGP)

In the spatial domain, depth information also can be retrievedfrom monocular depth cues. The geometrical perspectivecan also provide primary information of the scene depth.13

For simplicity, the major scene types of 2-D video can beclassified to three iconic representations as shown in Fig. 5.

MV MV MV MV MVP up right left down= + + +( ) ,4

D SADD( ) [ , ( )],MV Block Block MVi cur ref i=

DS( , ) ,MV MV MV MVi P i P= -

MV MVcandidate i= +argmin ( ) ,D DD Sll qDM = ◊a MV mv thresholdcandidatemin , _ ,d i

FIGURE 4 — The consecutive frames with different viewing angles andpositions are compensated to parallel view configuration with thecurrent frame by means of using the estimated camera parameters.Therefore, the motion parallax can be computed by disparity estimationwithout the interference of camera motion.


By analyzing the line structure in the scene, horizontal andvanishing lines are detected to make the scene-mode deci-sion. The DGP is only applied to the key frame to reducecomputational complexity. When the DGP fails to find thescene mode, the default mode which assigns a simple depthgradient from top-to-bottom is applied. From our study, wefind that the top-to-bottom mode is the mode that is suitablefor most case in the sequence of nature images. Althoughthe default mode have less protrusive effect, the mode alsohas a lower side effect when the scene mode is not detected.

The flow of the DGP module is shown in Fig. 6. First,the Hough line detection is applied to find the dominantgeometrical lines in the image; that is, the lines with thestrongest response, as shown in Figs. 7(a) and 7(c). An exam-ple of the vanishing point is visualized in Fig. 7(d). Firstly,the Hough line transform detects to all possible lines. Afterthe vanishing point is detected, the depth gradient is assignedaccording to the direction of each segment plane.

For practical conditions, Hough line detection is notalways stable due to foreground moving objects, cameranoise, or the high contrast edges in the background. In thosecases, dominant lines in consecutive detected frames wouldnot be identical. The unstable lines would cause flicker inthe scene depth maps and results in visual discomfort. Anovel line stabilization technique is applied to preserve thetemporal coherence of the dominant geometrical lines.Whenever a dominant geometrical line is detected in thecurrent frame, the weight of the region passed by the domi-nant line is increased in the next frame. The weight of theedge in the region where the previous line passes throughwill have a higher probability to be selected. In this way,

dominant lines become more temporally stable. Finally, thedepth assignment is applied to generate a depth gradientaccording to the scene mode.

3.3 Depth from relative position (DRP)The depth from the relative position uses the perceptionknowledge to generate the fine granularity depth map.Luminance and color are strong and self-sufficient pictorialdepth cues in visual scenes and images. Some experimentsin color stereopsis have also shown that long-wavelengthstimuli, such as red or yellow, appear closer than short-wave-length stimuli such as blue or green.29,30 In general, the redcolor does not always mean “near”; however, inevitably seenas “nearer” in the visual field than stimuli with other colors.Moreover, the edge of the image also has high potential tobe the edge of the depth map. This suggests that the colorcan be a potentially strong candidate as a depth cue with astrong relative weight.28 Based on the concept, DRP usesthe depth perception on luminance and color to refine thedepth map.

(11)

DR is a linear increasing function relative to the Y and Crvalues, and is a linear decreasing function relative to the Cbvalue. Although not all the conditions do not satisfy the psy-chological hypothesis, the depth cue which has high corre-lation to human experience also enhances depth perceptionwith low side effect. Interestingly, some great painters,notably Paul Cezanne, apply “warm” pigments (red, orange,and yellow) to bring the surface forward towards the viewer,and the “cool” ones (blue, violet, and cyan) to indicate thatcurves away from the picture plane. By using the DRPrefinement, the depth perception on the edge and colordomain can both be enhanced. The value of α and γ areabout 0.3–0.1, and β is about 0.3–0.5 in empirical rule.

3.4 Depth fusion and post-filteringAfter depth generation from multiple cues, the depth mapsof DMP, DGP, and DRP are fused using the following equa-tion:

(12)

where WM, WP, and WR are the weights of DMP, DGP, andDRP cues, respectively. DM, DP, and DR are the depth mapof DMP, DGP, and DRP cues. The flow of depth fusion isshown in Fig. 8. Depending on the camera motion analysis,three weighting factors are adaptively adjusted. The recom-mended weights of depth fusion are related to the contentcharacteristics. If the sequence does not have a stable scenemode, the depth can only reliably be generated frommotion. Figure 5(a) will be chosen as the initial mode, andthe depth of DGP will be fused with small weighting than

D [ ( ) ] [ ( ) ]

[ ( ) ]R = + - ¥ + - ¥

¥ - -1 128 128 1 128 128

128 128 1

a bg

Y Cr

Cb

D W W Wfused M P R= +( ) ,D D DM P R

FIGURE 5 — Multiple scene-depth types. The geometrical perspectivecan also provide primary information of the scene depth. The majorscene types of 2-D video are classified into three iconic representations.

FIGURE 6 — Flow of depth from geometrical perspective. By analyzingthe line structure in the scene, horizontal line and vanishing point aredetected to make the scene mode decision. Then the depth gradient ofeach regions are assigned depending on the detected scene types.


FIGURE 7 — Processed images in the DGP module. (a) The detected horizontal line (red) with the strongest response in theStefan sequence. (c) An example of the detected vanishing point for the hall-monitor sequence. (b) and (d) show thecorresponding depth maps which are assigned by the detected scene depth types.

FIGURE 8 — Flow of depth fusion. WM, WP, and WR are the weights of DMP, DGP, and DRP cues, respectively. DM, DP, and DRare the depth maps of DMP, DGP, and DRP cues. The depth of three cues are fused by the equation (WMDM + WPDP)WRDR. Thedepth map of DMP and DGP are added together and are then refined by DRP.


WM. WM, WP, and WR are proportion to 4:1:4 in our imple-mentation, respectively. On the other hand, if the video havea stable scene mode, WP can be fused with larger weightingthan WM. WM, WP, and WR are proportional to 2:3:4 in ourimplementation, respectively.

Moreover, the artifacts due to the property of block-based motion estimation algorithm generate visuallyuncomfortable results. Before the 3-D visualization, theside effect should be eliminated. The bilateral filteringproperly smoothes the image while preserving objectboundaries.8 Therefore, we apply the joint bilateral filter tosmooth the depth map.

(13)

(14)

where u(xi) is the intensity value of pixel xi, Ω(xi) are theneighboring pixels of xi, and N(xi) is the normalization factorof the filter coefficients, and Dfiltered is the filtered depthmap. The window size depends on the characteristic of theobjects. In our implementation, the kernel of the bilateralfilter is larger than the block size in the DMP module. Oneexample of filtered image is shown in Fig. 9. The blockyartifacts in the fused depth map are effectively removedwhile the sharp depth discontinuities along the objectboundary are enhanced. The depth map can generate highdepth quality on the object boundary.

3.5 3-D visualizationFor modern 3-D displays, the depth map is used for renderingmultiple viewing angles according to the following equation:

(15)

where xi is the horizontal coordinate of the interpolatedview and xc is the horizontal coordinate of the intermediateview. Z is the depth value of the current pixel, f is the camerafocal length, and tx is the eye distance. The edge-dependentinterpolation method3 is applied to preserve edge informa-

tion of the interpolated area. The experimental result andsubjective quality analysis will be made in Sec. 4.

For conventional 2-D displays, we also propose anapplication that uses depth information to enhance thedepth perception of 2-D video. The depth-aware videoenhancement adjusts video in relation to the depth informa-tion. Three cues, color saturation, contrast, and edge areadjusted according to the relative depth range to enhancethe depth perception.

The flow is shown in Fig. 10. From the characteristicsaerial perspective, objects that are a great distance awayhave lower luminance contrast and lower color saturationdue to atmospheric scattering.25 The detailed discussionabout saturation enhancement is presented here. First ofall, the definition of saturation is introduced.

(16)

The lightness can be decomposed into two parts: thelightness of the object and the lightness of the environment.For simplicity, environmental light is supposed to be whitein our model. According to the atmospheric scatteringmodel,25 the lightness of observing the object projected onthe image plane degrades exponentially with distance.

(17)

where λ is the dominant wavelength. Therefore, the satura-tion of x in the projected image can be modeled as

(18)

The stronger total light will result in a higher satura-tion value. Thus, saturation can be modeled as an increasingfunction of the depth. The saturation channel of HVS colorspace can be adjusted stronger to represent the object iscloser to the viewer.

The other characteristic is that far objects have lowercontrast than near objects.32 Contrast is also an effectivedepth cue to produce depth perception.23 The edge blur-

D xN x

e e D xfiltered ii

x x

x x

u x u x

fused j

j i

s

j i

j i

r( )( )

( ),( )

( ) ( )

=

-

Ã

-

Â1

2

2

2

22 2s s

W

N x e ei

x x

x x

u x u xj i

s

j i

j i

r( ) ,( )

( ) ( )

=

-

Ã

-

Â

2

2

2

22 2s s

W

x xt f

Zi cx= ± F

HGIKJ2

,

Saturation Colorfulness lightness= ( ).

Light x Light x eprojectedc Z x( ) ( ) ,( )= -

*l

SaturationColorfulness

Light( ) Light

Colorfulness

Light( ) Light

( )( )

( ).

( )

( )

( )

( )

xx e

x e

x e

x e

c Z x

c Z xEnvironment

cDepth x

cDepth x

Environment

=+

=

+

-

-

-

-

l

l

l

l

1

1

FIGURE 9 — The original image (left), the pre-filtered depth image(middle), and the post-filtered depth image (right). By using the jointbilateral filter, the blocky artifacts of the depth map are effectivelyremoved while the sharp depth discontinuities along the objectboundary are preserved.

FIGURE 10 — Depth-perception enhancement of 2-D image/video. Fora conventional 2-D display, three cues (contrast, edge, and saturation)are adjusted to enhance the depth perception according to the relativedepth range.


riness24 has the same phenomenon as contrast. In the 2-Denhancement flow, edge and contrast enhancement are appliedindividually because all the depth cues are suggested to beconsidered independently.26 As for the relationship of con-trast and edge blurriness to the depth, we can derive thefollowing equations from the basic theory of optics inFig. 11.

(19)

(20)

(21)

(22)

assuming that Z(x) > Fv0/(v0 – F), which means that all theobjects are behind the focus plane. Thus,

(23)

From the above discussion, if depth(x) increases, ddecreases. It means if the object has a higher value (near) on

the depth map, then the object will have a higher contrastand sharpness.

A 2-D enhancement example is shown in Fig. 12.From the figure, we can see that the near grassland of fuss-ball has a higher contrast and saturation. Compared with theoriginal image, the enhanced images show an implicit rela-tion to the object distance. Because the enhanced relativesaturation, edge, and contrast can provide stronger pictorialdepth cues on visual perception, the viewers can also per-ceive an enhanced 3-D feeling on conventional 2-D dis-plays.

4 Experimental resultsFigure 13 shows the results of hybrid depth fusion for theeight test sequences. In the test sequences, the scene modeof barden, fussball, jojo, kirchweih, stefan, hall_monitor,and flamingo are successfully detected. The parameters ofWM, WP, and WR are proportion to 2:3:4. The scene modeof the hall_monitor sequence is the vanishing point modeand the others are horizontal line modes. The scene modeof the Akko&Kayo sequence is not detected. Therefore, thedefault mode with a monotonic depth gradient is selected.The parameter of WM, WP, and WR are proportion to 4:1:4.From the images we find that the moving objects are well-detected to enhance the protrusion effect. However, someerrors may occur when there are variations in illuminationand shadow. Fortunately, by using a bilateral filter, the errordepth in these regions is smoothed. From our experiment,we found the filtered depth map to have little perceptibleside effect. For further analysis and discussion, the in-depthcomputational complexity and visual-quality analysis of thevideo sequences will be discussed in the following subsec-tions.

4.1 Analysis of computational complexityThe computational complexity of the proposed algorithm isanalyzed in this section. The prototype of our algorithm canachieve 41.3 fps at SDTV 720 × 576 when operating on anIntel® Core™2 Duo CPU E6850 @ 3.0 GHz with Nvidia9800GT with 112-core CUDA technology. The block size is

1 1 1

0 0F v u= + ,

tan ,q = =-

rv

dv v0

1 1 1F v Z

= + ,

Z xFrv

rv F r dFv

v F fd( )

( )=

- +=

- -0

0

0

0 2

dZ x v Fv Z x F

f Z xv f

fFvf Z x

v ff

Fvf

Depth x

=- -

=-

-

=-

-

( ) ( )( ) ( )

( ).

0 0 0 0

0 0

2 2 2

2 2

FIGURE 11 — The basic optics model and parameters. The blurriness dincreases with increments of Z(x). Therefore, with the adjustment ofcontrast and blurriness, the depth perception of objects in different depthranges can be represented.

FIGURE 12 — Original image (left), depth-perception enhanced image (middle), and depth map (right). Compared with theoriginal image, the region of the grassland of an enhanced image has higher contrast and saturation. The enhanced imagesshow an implicit relation to the object distance. The viewers can also perceive the enhanced 3-D effect on a conventional2-D display.


a 4 × 4 block. With larger block size, the proposed algorithmwill have a smaller computational time. However, the largerblock size also results in lower depth-map quality. The com-putational complexity of the proposed algorithm is O(n4),where n is the width of frame. If the resolution of the frameincreases, the blocks and search range are both increased. Itmeans that the computational time will be much longer.This is because the DMP motion algorithm has highly com-putational complexity. The proposed algorithm uses a physi-cal cue DMP, pictorial cue DGP, and perception-cue DRPto generate a depth map. The three maps can be parallelcomputed and fused together to generate depth from differ-ent cues. Moreover, the runtime model of the proposedalgorithm is only a prototype. Further runtime improve-ment can be made in fast-motion estimation algorithm orparallel computing.

4.2 Analysis of subjective visual qualityTo evaluate the visual quality of the proposed algorithms, wecompare the three types of video data: videos which are cap-tured by a stereoscopic camera; the left view of stereo cam-era is used to generate the depth maps by using our

algorithm and a commercial software DDD’s TriDef. Thestimuli consist of six video sequences. Four 720 × 576sequences, jojo, barden, fussball, and kirchweih, capturedby two DCR-PC-8 camcorders, and two sequences fromMPEG multi-view video coding, flamingo and Akko&Kayo,are used to perform the subjective view evaluation. Bothdepth quality and visual comfort are assessed using a single-stimulus presentation method that is a slightly modified ver-sion of that descr ibed in ITU-R BT.500-10.27 Theperformance of generated stereoscopic video is evaluated sub-jectively by comparing stereo captured video and a stereo-scopic video that is generated from left view of the originalstereoscopic video. The synthesized stereo-view images aredisplayed on a 120-Hz 3-D display with active shutterglasses.

The subjective evaluation was performed by 15 peoplewith normal or correct-to-normal visual acuity and stereoacuity. The participants watched the stereoscopic video inrandom order and are asked to rate each video according totwo factors: depth quality and visual comfort. The overallquality of the depth quality is accessed using a five-segmentscale as shown in Fig. 14(a), and that for visual comfort isshown in Fig. 14(b). The values of the two factors acquired

FIGURE 13 — Images and depth maps of the test sequence. From left to right, top to bottom are Akko&Kayo, barden, fussball, hall-monitor,jojo, kirchweih, stefan, and flamingo sequences. In these images, the shape of the objects are well represented to enhance the depthperception of the moving objects. However, some errors may occur when there are variations on illumination and object shadow.Fortunately, by using the joint bilateral filter, the side effects on these regions are greatly reduced.


by experiments for the six evaluation sequences are shownin Figs. 15(a) and 15(b). And the red–cyan images of the testsequence are shown in Fig. 16.

Generally speaking, the stereoscopic videos capturedby the stereoscopic camera usually gets the highest score,and the proposed algorithm is better than the DDC algo-rithm in terms of depth quality. From our evaluation, wefound that the proposed method has better quality on sportsvideos, especially in the outdoor-scene mode. For thesesequences, the smooth background depth maps are extractedby DGP and DRP, and the foreground moving objects pop-up by DMP, resulting in a protrusive 3-D effect. In thesequences with regular motions, such as the dog in jojosequence and the football in the fussball sequence, theobject depth is smooth and has a spectacular pop-up effect.The regular motion means that the objects have simplemovement in the same direction.

However, when the objects have complex self-motions,the DMP results in non-continuous depth on objects and

makes viewers feel uncomfortable. For example, in fla-mingo, the hand and body of dancer have a different motionmagnitude. The same phenomenon can also be seen on thefeet of horse in kirchweih. Therefore, for visual comfort,our method in sequences with nonregular self-motion has

FIGURE 14 — Rating scales used for assessing (a) depth quality and (b)visual comfort. The overall quality for depth quality is accessed using afive-segment scale and that for visual comfort.

FIGURE 15 — Subjective evaluation result (a) depth quality and (b) visualcomfort. The subjective evaluation is performed by 15 people withnormal or correct-to-normal visual acuity and stereo acuity. Theparticipants watched the stereoscopic video in a random order and areasked to rate each video according to two factors: depth quality andvisual comfort. The higher value means the video has higher depthquality and visual comfortable.

FIGURE 16 — Red–cyan images of (a) stefan, (b) hall-monitor, (c) Akko&Kayo, (d) barden, (e) flamingo, (f) fussball, (g) jojo, and (h)kirchweih. The smooth background depth maps are extracted by DGP and DRP. The foreground moving objects are pop-ups by DMP togenerate a protrusive 3-D effect. In the sequences with regular motions, such as jojo and fussball, the object depth is smooth and has thepop-up effect. As for flamingo and kirchweih, the hands of the dancers and the feet of the horses have a different motion magnitude. Theirregular motion is filtered by the joint bilateral filter to diminish the side effect. Therefore, the stereoscopic images still have high-qualityresults.


worse scores than the DDC method. In addition, we alsonotice that the original stereoscopic videos do not alwaysobtain the highest score in term of visual comfort. This isbecause the camera configurations, including camera base-line, focal length, and convergence, may not be identical tothe human visual system.22 The fixed camera configurationsof original stereoscopic videos degrade the visual comfortterm.

4.3 Analysis of objective visual qualityThe objective evaluation is discussed in this section. It iswell known that there are various objective quality metricscommonly used for 2-D images, such as mean squared error(MSE) and peak signal-to-noise ratio (PSNR), but thesemethods are not suitable for 2-D–to–3-D converted video

contents. In the 2-D–to–3-D converted contents, the gener-ated depth maps are pseudo-depth maps rather than realdepth maps. There are no ground truth images for compari-son. However, we can still evaluate the depth quality objec-tively from the edge or textureless region of depth maps.This is reasonable as viewers always pay more attention tothe object boundary and depth discontinuity region.

We use an objective evaluation method which is amodification of the CSED method33. Because the edges ofthe color image have a high potential to be the edge of thedepth map, the color-image and depth-image correlation(CDC) metrics are used to examine the objective quality ofthe depth map. The defined CDC metrics have four items:

(24)Boundary Quality:SE x y E x y

E x yC D

D1 =

«( , ) ( , )

( , ),

FIGURE 17 — The objective visual-quality assessment of (a) fussball, (b) barden, (c) kirchweih, (d) jojo sequences. S1 describesthe ratio of the edge of the depth map that is also the edge of color image. With a higher S1 correlation, the depth map has ahigher depth quality on the boundary. S2 represents the ratio of the edge of color image that is also the edge of the depth map.With higher S2 values, the depth map has more detail on the edge of the color image. Because the edge of the color image is notsupposed to be the edge of depth map, the S2 value is just a reference metric about the complexity of the depth map. S3 describesthe smoothness of the depth map in the textureless region of the color images. Finally, the S4 metric that calculates the edgevariance of the depth map is used to show the degree of protrusion in the depth map.


(25)

(26)

(27)

where the EC(x,y) and ED(x,y) are the edge maps of thecolor image and depth image, and EC′(x,y) and ED′(x,y) arethe complement sets of EC and ED, respectively.

S1 describes the ratio that the edge of the depth mapis also the edge of the color image. With higher S1 values,the depth map has higher depth quality on the boundary. S2represents the ratio when the edge of the color image is alsothe edge of depth map. With higher S2 values, the depthmap has more detail on the edge of the color image. Becausethe edge of the color image is not supposed to be the edgeof the depth map, the S2 value is just a reference metricabout the complexity of the depth map. S3 describes thesmoothness of the depth map in the textureless region of thecolor images. The S1 and S2 can provide an objective scoreon the quality of the edge. With higher S1 and S3 values,viewers feel more comfortable when watching the video.Finally, the S4 metric that calculates the edge variance ofthe depth map is used to show the degree of protrusion inthe depth map.

The evaluation result is shown in Fig. 17. Foursequences are examined. The DMP has the highest protru-sion value, but has lower detail because DMP only gener-ates the depth of moving objects. The DGP has a largesmoothness value on the textureless region. When the scenestructure line is not heavily overlapped by foreground objects,such as fussball and jojo, the boundary quality S1 of DGPhave very high scores. However, the disadvantage is that theDGP has lower detail and a lower protrusion value. TheDRP has the highest S1 and S2 values. This is because theDRP uses the color image as a function of depth map, theedge of depth map has the highest correlation to the coloredge. But the DRP has a small value on the protrusionmetric. The fused depth map compromises the protrusioneffect and the detail of the depth map. The advantage thatwe use is multiple cues to generate the depth map.

Finally, it is important and must be emphasized thatthe CDC metrics can provide an objective evaluation of theedge and textureless region of the depth map rather thandepth perception. There is still no well-established standardon the metric of depth perception. Therefore, the quality ofthe depth map still basically relies on the subjective-viewevaluation.

5 ConclusionsWe have proposed a novel 2-D–to–3-D conversion systemusing three depth cues. The depth from motion parallaxenhances the depth when the video has camera motion, the

depth from a geometrical perspective enhances the stereoeffect of the scene structure, and the depth from the relativeposition generates a fine granularity depth by using thedepth perception of the color. The proposed system uses theblock-based algorithms incorporating a bilateral filter togenerate a comfortable depth map in real time. Finally, theproposed system is quality scalable. Depending on theapplications, different block sizes can be selected or com-bined with multi-scale subsampling. A larger block sizeresults in lower depth details but faster computation speed.The users can tradeoff between quality and complexity. Thesubjective also shows that the proposed algorithm providespromising results. We believe that our system is suitable for3-D displays and broadcasting systems.

AcknowledgmentsThe authors thank Prof. Su-Ling Yeh and Mr. Yong-HaoYang for providing the depth perceptual knowledge to helpin the development of the algorithm. They are the membersof the Perception & Attention Lab, Department of Psychol-ogy, National Taiwan University.

References1 P. Harman et al., “Rapid 2-D to 3-D conversion,” Proc. SPIE 4660,

78–86 (2002).2 Y. Matsumoto et al., “Conversion system of monocular image sequence

to stereo using motion parallax,” Proc. SPIE Stereo. Disp. and VRSystem 3012, 108–115 (1997).

3 W.-Y. Chen et al., “Efficient depth image based rendering with edgedependent depth filter and interpolation,” IEEE Intl. Conf. on Multi-media & Expo, 1314–1317 (2005).

4 S. B. Gokturk et al., “A time-of-flight depth sensor, system description,issues and solutions,” IEEE Workshop on Real-Time 3-D Sensors andTheir Use (2004).

5 J. Park and C. Kim, “Extracting focused object from low depth-of-fieldimage sequences,” Proc. SPIE 6077 (2006).

6 F. Dufaux and J. Konrad, “Efficient, robust, and fast global motionestimation for video coding,” IEEE Trans. Image Processing 44,108–116 (1998).

7 Y.-M. Tsai et al., “Block-based vanishing line and vanishing pointdetection for 3-D scene reconstruction,” Proc. Intl. Symp. IntelligentSignal Proc. and Comm. Syst. (2006).

8 C. Tomasi and R. Manduchi, “Bilateral filtering for gray and colorimages,” IEEE Intl. Conf. Computer Vision, 839–846 (1998).

9 C.-C. Cheng et al., “A block-based 2-D-to-3-D conversion system withbilateral filter,” IEEE Intl. Conf. Consumer Electronics (2009).

10 I. A. Ideses et al., “3-D from compressed 2-D video,” Proc. SPIE 6490(2007).

11 D. Kim et al., “A stereoscopic video generation method using stereo-scopic display characterization and motion analysis,” IEEE Trans.Broadcasting 54, No. 2, 188–197 (2008).

12 D. Hoiem et al., “Automatic photo pop-up,” Proc. ACM SIGGRAPH(2005).

13 V. Nedovic et al., “Depth information by stage classification,” IEEEIntl. Conf. Computer Vision (2007).

14 H. Murata et al., “A real-time 2-D to 3-D image conversion techniqueusing computed image depth,” SID Symposium Digest 29, 919–922(1998).

15 H. Murata et al., “Conversion of two-dimensional images to threedimensions,” SID Symposium Digest 26, 859–862 (1995).

16 T. Iinuma et al., “Natural stereo depth creation methodology for areal-time 2-D-to-3-D image conversion,” SID Symposium Digest 30(2000).

17 S. Battiato et al., “Depth map generation by image classification,” SPIEThree-Dimensional Image Capture and Applications VI 5302, 95–104(2004).

Detail: SE x y E x y

E x yC D

C2 =

«( , ) ( , )

( , ),

Smoothness: SE x y E x y

E x yC D

C3 =

¢ « ¢¢

( , ) ( , )

( , ),

Protrusion: VarianceS E x yD4 = [ ( , )],


18 J.-Y. Chang et al., “Relative depth layer extraction for monoscopic videoby use of multidimensional filter,” IEEE Intl. Conf. Multimedia &Expo, 221–224 (2006).

19 Y.-L. Chang et al., “Depth map generation for 2-D-to-3-D conversionby short-term motion assisted color segmentation,” IEEE Intl. Conf.Multimedia & Expo, 1958–1961 (2007).

20 W. J. Tam and L. Zhang, “3-D-TV content generation: 2-D-to-3-Dconversion,” IEEE Intl. Conf. Multimedia & Expo, 1869–1872 (2006).

21 Y. J. Jung et al., “A novel 2-D-to-3-D conversion technique based onrelative height depth cue,” SPIE Stereoscopic Displays and Applica-tions XX 7237 (2009).

22 A. Woods, “Image distortions in stereoscopic video system,” SPIEStereoscopic Displays and Applications, 36–48 (1993).

23 S. Ichihara et al., “Contrast and depth perception: Effects of texturecontrast and area contrast,” Perception 36, 686–695 (2007).

24 G. Mather, “The use of image blur as a depth cue,” Perception 26,1147–1158 (1997).

25 A. J. Preetham et al., “A practical analytic model for daylight,” Proc.ACM SIGGRAPH (1999).

26 R. Patterson, “Review Paper: Human factors of stereo displays: Anupdate,” J. Soc. Info. Display 17, No. 12, 987–996 (2009).

27 ITU-R Recommendation BT.500-10, “Methodology for the subjectiveassessment of the quality of television pictures” (2000).

28 C. R. Guibal and B. Dresp, “Interaction of color and geometric cuesin depth perception: When does red mean near?” Psychological Res.69, 30–40 (2004).

29 D. Brewster, “Notice of a chromatic stereoscope,” Philos. Mag. 4, No.3, 31 (1851).

30 M. Dengler and W. Nitschke, “Color stereopsis: A model for depthreversals based on border contrast,” Perception & Psychophysics 53,150–156 (1993).

31 C. Wu et al., “A novel method for semi-automatic 2-D to 3-D videoconversion,” 3-DTV-CON (2008).

32 R. P. O’Shea et al., “Contrast as a depth cue,” Vision Res. 34 No. 121595–1604 (1994).

33 H. Shao et al., “Objective quality assessment of depth image basedrendering in 3-DTV system,” Proc. IEEE 3-DTV Conference (2009).

Chao-Chung Cheng received his B.S. and M.S.degrees in electronics engineering from NationalChiao-Tung University, Hsinchu, Taiwan, R.O.C.,in 2003 and 2005, respectively. He is currently aPh.D. student at the Graduate Institute of Elec-tronics Engineering, National Taiwan University.His research interests include digital signal proc-essing, video system design, 3-D signal process-ing, stereo vision, and 2-D–to–3-D conversion.

Chung-Te Li received his B.S. degree in electron-ics engineering from National Taiwan University,Taiwan, Taiwan, R.O.C., in 2006. He is currentlya Ph.D. student at the Graduate Institute of Elec-tronics Engineering, National Taiwan University.His research interests include digital signal proc-essing and image/video processing algorithm.

Liang-Gee Chen received his B.S., M.S., andPh.D. degrees in electrical engineering from NationalCheng Kung University, Tainan, Taiwan, R.O.C.,in 1979, 1981, and 1986, respectively. In 1988,he joined the Department of Electrical Engineer-ing, National Taiwan University, Taipei, Taiwan.From 1993 to 1994, he was a Visiting Consultantin the DSP Research Department, AT&T Bell Labs,Murray Hill, New Jersey. In 1997, he was a Visit-ing Scholar at the Department of Electrical Engi-

neering, University of Washington, Seattle. Currently, he is a Professorwith National Taiwan University. His current research interests are DSParchitecture design, video processor design, and video-coding systems.Liang-Gee Chen has served as an Associate Editor of the IEEE Transac-tions on Circuits and Systems for Video Technology since 1996, asAssociate Editor of the IEEE Transactions on VLSI Systems since 1999,and as Associate Editor of IEEE Transactions Circuits and Systems II since2000. He has been the Associate Editor of the Journal of Circuits, Sys-tems and Signal Processing since 1999, and a Guest Editor for the Jour-nal of Video Signal Processing Systems. He is also the Associate Editorof the Proceedings of the IEEE. He is the Past-Chair of the Taipei Chapterof IEEE Circuits and Systems (CAS) Society and is a member of the IEEECAS Technical Committee of VLSI Systems and Applications, the Tech-nical Committee of Visual Signal Processing and Communications, andthe IEEE Signal Processing Technical Committee of Design and Imple-mentation of SP Systems. He is the Chair-Elect of the IEEE CAS TechnicalCommittee on Multimedia Systems and Applications. From 2001 to2002, he served as a Distinguished Lecturer of the IEEE CAS Society.


video 2-d–to–3-d conversion based on hybrid depth cueingj... · by means of the generated depth...

Documents