[ieee international conference on computational intelligence and multimedia applications (iccima...

7
Natural Image and Video Matting R.Abhilash Visualization and Perception Lab Dept. of CSE IIT Madras [email protected] Abstract The process of extracting a foreground object from an image based on limited user input, is an important task in image and video editing. This paper addresses the problem of ef cient extraction of a foreground object in a complex environment whose background cannot be trivially subtracted. Natural image matting is usually composed of: foreground and background color estimating and alpha estimating. Current approaches either restrict the estimation to a small part of the image, i.e. estimate foreground and background colors based on nearby pixels where they are known, or perform iterative nonlinear estimation by alternating foreground and background color estimation with alpha estimation. This paper presents an object extraction technique which uses global color and local smoothness information to estimate the accurate alpha values. We then extend the object extraction method to video, where from a single video shot with a moving foreground object and stationary background, the motion statistics, color and contrast cues are combined to extract a foreground object efciently with no interaction from the user. 1 Introduction Efcient and high-quality compositing is an important task in special effects industry. Typically, movies scenes are composited from two different layers (foreground and background). To use the foreground content of one sequence as the foreground layer in a composited video, the foreground elements must be separated from the background in the source video. This process, known as matting, separates a foreground element from the background by estimating a color F and an opacity α for each foreground pixel. Formally, image matting method takes as input an image I, which is assumed to be a composite of a foreground image F and a background image B. The color of each pixel is assumed to be a linear combination of the corresponding foreground and background colors. I = αF + (1 α)B (1) where α is the pixel’s foreground opacity. In natural image matting all quantities in the right hand side of the composition equation 1 are unknown. Thus for a 3 channel image, at each pixel there are 3 equations and 7 unknowns. Obviously this is a severely under-constrained problem, and user interaction is required to extract a good matte. Most recent methods require user input in the form of a well-drawn trimap [4, 13] and others require user effort in the form of a few scribbles [5, 7, 6, 9, 12]. The performance of most of International Conference on Computational Intelligence and Multimedia Applications 2007 0-7695-3050-8/07 $25.00 © 2007 IEEE DOI 10.1109/ICCIMA.2007.11 469 International Conference on Computational Intelligence and Multimedia Applications 2007 0-7695-3050-8/07 $25.00 © 2007 IEEE DOI 10.1109/ICCIMA.2007.11 469

Upload: r

Post on 18-Feb-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Sivakasi, Tamil Nadu, India (2007.12.13-2007.12.15)] International Conference

Natural Image and Video Matting

R.AbhilashVisualization and Perception Lab

Dept. of CSEIIT Madras

[email protected]

Abstract

The process of extracting a foreground object from an image based on limited user input, is animportant task in image and video editing. This paper addresses the problem of efficient extractionof a foreground object in a complex environment whose background cannot be trivially subtracted.Natural image matting is usually composed of: foreground and background color estimating andalpha estimating. Current approaches either restrict the estimation to a small part of the image,i.e. estimate foreground and background colors based on nearby pixels where they are known, orperform iterative nonlinear estimation by alternating foreground and background color estimationwith alpha estimation. This paper presents an object extraction technique which uses global colorand local smoothness information to estimate the accurate alpha values. We then extend the objectextraction method to video, where from a single video shot with a moving foreground object andstationary background, the motion statistics, color and contrast cues are combined to extract aforeground object efficiently with no interaction from the user.

1 Introduction

Efficient and high-quality compositing is an important task in special effects industry. Typically,movies scenes are composited from two different layers (foreground and background). To use theforeground content of one sequence as the foreground layer in a composited video, the foregroundelements must be separated from the background in the source video. This process, known asmatting, separates a foreground element from the background by estimating a color F and an opacityα for each foreground pixel.

Formally, image matting method takes as input an image I, which is assumed to be a compositeof a foreground image F and a background image B. The color of each pixel is assumed to be alinear combination of the corresponding foreground and background colors.

I = αF + (1 − α)B (1)

where α is the pixel’s foreground opacity. In natural image matting all quantities in the righthand side of the composition equation 1 are unknown. Thus for a 3 channel image, at each pixelthere are 3 equations and 7 unknowns. Obviously this is a severely under-constrained problem, anduser interaction is required to extract a good matte.

Most recent methods require user input in the form of a well-drawn trimap [4, 13] and othersrequire user effort in the form of a few scribbles [5, 7, 6, 9, 12]. The performance of most of

International Conference on Computational Intelligence and Multimedia Applications 2007

0-7695-3050-8/07 $25.00 © 2007 IEEEDOI 10.1109/ICCIMA.2007.11

469

International Conference on Computational Intelligence and Multimedia Applications 2007

0-7695-3050-8/07 $25.00 © 2007 IEEEDOI 10.1109/ICCIMA.2007.11

469

Page 2: [IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Sivakasi, Tamil Nadu, India (2007.12.13-2007.12.15)] International Conference

these methods deteriorates as user interaction reduces. This paper first presents an object extractiontechnique from natural images, and extends it for extracting a moving object from a video shot.Our technique also requires minimal user interaction in the form of a bounding box around theforeground object to be extracted. We have conducted usability studies to compare our methodwith state-of-the-art interactive matting methods.

2 Related Work

The problem of alpha matting has been researched for almost half a century. Blue-screen mat-ting was mathematically formalized by Smith and Blinn [11], where the problem is simplified byphotographing foreground object against a constant-colored background. However, this method islimited to simple backgrounds of a solid color. Errors often occur when foreground objects containcolors similar to the background.

Recent approaches attempt to extract foreground mattes directly from natural images. The mostsuccessful systems include Knockout 2 [2], Intelligent Scissors [8], the approach proposed by Ru-zon and Tomasi [10], Bayesian matting [13], Poisson matting [4], GrabCut [9], Lazy Snapping[7], Belief Propagation [12] and Closed Form Solution [6]. All these systems start by havingthe user segment the image into three regions: definitely foreground, definitely background andunknown. The problem is thus reduced to estimating F,B and α in the unknown region.

The knockout system extrapolates known foreground and background colors into the unknownregion and estimates αs according to them. Ruzon and Tomasi are the first to take a probabilisticview of the problem. They analyze foreground and background color distributions and use them foralpha estimation. This approach has been further improved by the Bayesian matting system, whichformulates the problem in a well-defined Bayesian framework and solves it using the maximuma posteriori (MAP) technique. The Poisson matting approach estimates the matte from an imagegradient by solving Poisson equations using boundary information from the trimap.

The Lazy Snapping and GrabCut systems translate simple user-specified scribbles or a boundingbox into a min-cut problem. Solving min-cut problem yields a hard segmentation, rather than afractional alpha matte (Figure 1(b)). The hard segmentation could be transformed into a trimap byerosion, but this could still miss some fine or fuzzy features (Figure 1(c)). Although Rother et al.[9] do perform border matting by fitting a parametric alpha profile in a narrow strip around the hardboundary, wide fuzzy regions cannot be handled in this manner.

Wang and Cohen have proposed a scribble based method for interactive matting [12]. Startingfrom a few scribbles indicating a small number of background and foreground pixels, they use abelief propagation to iteratively estimate the unknowns at every pixel in the image. While thisapproach has produced some impressive results, it has the disadvantage of employing an expensiveiterative non-linear optimization process, which might converge to different local minima.

Our approach is closely related to the Closed form solution method [6]. This method propagatescribbled constraints to the entire image by minimizing a quadratic cost function. This methoddoes not make use of global color models for F and B and thus fails to produce accurate matte andrequires more user interaction at some cases.

3 Foreground Extraction from an Image

In our system user provides constraints on the matte by drawing a bounding box on the fore-ground object. We use global color information in the form of GMMs along with the local smooth-

470470

Page 3: [IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Sivakasi, Tamil Nadu, India (2007.12.13-2007.12.15)] International Conference

Figure 1. (a) An image with sparse constraints: Bounding box around the fore-ground object to be extracted. Foreground extraction algorithms, such as [7, 9]produce a hard segmentation (b). An automatically generated trimap from a hardsegmentation may miss fine features (c). An accurate hand-drawn trimap (d) isrequired in this case to produce a reasonable matte (e).

ness information to extract the matte of the foreground object. Our approach proceeds iteratively,where in each iteration, the pixels in the unknown region are classified into either foreground orbackground with certain likelihoods. As the classification converges, we use the likelihoods ofeach pixel and minimize the quadratic cost function proposed by Levin et.al. [6] to estimate alphamatte.

We treat the region outside the bounding box as initial background region, and the region insidethe bounding box as unknown region. Initial segmentation is done where all unknown pixels aretentatively placed in the foreground class and all known background pixels are placed in the back-ground class. Given this initial information, each GMM, one for background and one for foregroundclass, is taken to be full-covariance Gaussian mixture with K components (typically K = 5). Eachpixel in the foreground class is assigned to the most likely Gaussian component in the foregroundGMM. Similarly, each pixel in the background class is assigned to the most likely backgroundGaussian component. A graph is built and GraphCut is performed, as described by Boykov andJolly [1], to find a new tentative foreground and background classification of pixels. This processis repeated with the new foreground and background pixels until the classification converges.

In GraphCut there are two types of links: N-links connect pixels in 8-neighborhood, T-linksconnect each pixel to the foreground and background nodes. The N-link weights describe thepenalty for placing a segmentation boundary between the neighboring pixels. The appropriate N-link weight between pixel m and n is [9]:

N(m,n) =50

dist(m,n)e−β‖zm−zn‖2

(2)

where zm is the color of pixel m, β is chosen [1] to be β = 2 < (zm − zn)2 >.There are two T-links for each pixel: Background T-link connects the pixel to Background node

while Foreground T-link connects to Foreground node. The weights of these links depends on thestate of the classification. If the user has indicated that a particular pixel is definitely foreground ordefinitely background, we reflect this fact by weighting the links such that pixel is forced into ap-propriate group. For unknown pixels we use the probabilities obtained from GMMs to set weights.The T-link weights for a pixel m are shown in table 1.

where, PFore and PBack are the likelihoods that the pixel belongs to the foreground and back-

471471

Page 4: [IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Sivakasi, Tamil Nadu, India (2007.12.13-2007.12.15)] International Conference

Table 1. T-link weights of a pixel used for minimum cut.Pixel Type Back-GR. Fore-GR.

T-link T-link

m ∈ Trimap ForeGR 0 Xm ∈ Trimap BackGR X 0m ∈ Trimap Unknown PFore(m) PBack(m)

ground GMMs respectively. These likelihoods are computed for pixel m as follows:

P (m) = −log

k∑i=1

(πi1√

detΣi× exp(

12[zm − μi]T Σ−1

i )[zm − μi]) (3)

where, i represents a unique GMM component either from the background or the foregroundi ∈ 1, 2, ....., k; π is a weighting mixture coefficient, μ is mean of the Gaussian component, Σis covariance of the Gaussian component and X is a large constant value calculated as follows toensure that it is the largest weight in the graph:

X = maxm

∑n:(m,n)∈E

N(m,n) (4)

where, E is the set of all edges joining neighboring pixels.The alpha values for unknown pixels are set based on the likelihoods that they belong to fore-

ground or background (PFore, PBack):

αm =

⎧⎨⎩

1 if (PBack(m) − PFore(m)) > τ0 if (PFore(m) − PBack(m)) > τ

unknown if |PFore(m) − PBack(m)| < τ(5)

where τ is a positive threshold value.A window of 3x3 pixels is placed around each pixel whose alpha values are unknown, and the

accurate alpha is estimated by minimizing the quadratic cost function [6]:

J(α) = αT Lα (6)

Here L is an NxN matrix, whose (i, j)th element is:

L =∑

k|(i,j)∈wk

δij − 1|wk|(1 + (zi − μk)(Σk +

ε

wkI3)

−1(zj − μk)) (7)

where, δij is the Kronecker delta, zi is the intensity value of pixel i, Σk is a 3x3 covariancematrix, μk is a 3x1 mean vector of the colors in a window wk, ε is a small positive constant, and I3

is the 3x3 identity matrix.The alpha matte can be extracted by solving for

α = arg min [αT Lα + λ(αT − bTs )Ds(α − bs)] (8)

where λ is some large number, Ds is a diagonal matrix whose diagonal elements equal 1 forpixels whose α is known, and 0 for all other pixels. bs is a vector containing the known α valuesfor the pixels.

472472

Page 5: [IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Sivakasi, Tamil Nadu, India (2007.12.13-2007.12.15)] International Conference

Figure 2. (a) A video frame. (b)Moving foreground layer obtained by GraphCutbased motion segmentation [3]. (c)Matte extracted using proposed matting tech-nique. (d)Foreground object composited into new background.

4 Extension to video

Our algorithm can also be applied for extracting a moving object from a video. A straight forwardmethod to extend the interactive foreground extraction techniques to videos is to operate on eachframe individually. This would fail in two ways: Painting foreground and background strokes onevery frame would be tedious for the user. In addition slight differences in the extraction from frameto frame would lead to a lack of temporal coherence. Our method uses motion information alongwith the proposed matting technique to extract the moving object without any user interaction.

Initially we perform motion segmentation based on GraphCuts [3] to detect the moving fore-ground layer in the video. This layer will help in labeling the pixels as known background andunknown as shown in Figure 2(b). This information when given to the matting algorithm proposedin section 3, it results in good matte. Figure 2(c) shows the extracted matte from a video sequence.Figure 2(d) shows the composition of foreground object onto a new background.

5 Results and Comparisons

The proposed approach has been tested on a variety of images. Figure 3 shows the mattes ex-tracted on a challenging image used in [12] and compares our result to several other recent al-gorithms. One can observe that our result is comparable in terms of visual quality to those ofWang-Cohen and Closed form matting methods [12, 6]. Figure 4 shows an example (from [6]),where Wang-Cohen’s method fails to extract a good matte from sparse scribbles due to color ambi-guity between the foreground and background. Closed-Form method [6] also fails to give perfectmatte from the same set of scribbles. Our method produces cleaner matte with less interaction fromthe user.

Figure 5 shows comparison with Closed-Form method [6]. Since Closed-From matting [6]does not make use of global color models, it fails to produce accurate matte. Even though the blackscribble covers all colors of the background, the generated matte includes parts of the background.Since our method makes uses of global color distribution it can handle such situations.

We have also evaluated our video matting algorithm on several videos composed of 50 − 200frames in length. The source videos were taken with a camera of 320x240 resolution. The overallprocessing time for a video clip of 30 frames took less than half an hour. About 40% was formotion segmentation and 60% for alpha estimation. Figure 2 shows the extraction of foreground

473473

Page 6: [IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Sivakasi, Tamil Nadu, India (2007.12.13-2007.12.15)] International Conference

Figure 3. A comparison of alpha mattes extracted by different algorithms: (a) Inputimages for different methods. (b) Extracted alpha mattes of different methods

object from a video sequence. Figure 2(d) shows the composition of the extracted object into a newbackground.

6 Summary and Conclusion

Image and video Matting tasks by definition are ill-posed and generally require user interaction.The performance ( based on visual effect ) of most existing algorithms deteriorates rapidly as theamount of user interaction decreases. We have proposed and demonstrated an approach that unifiesglobal color features and local smoothness constraint to estimate an alpha value with small amountof user interaction. Unlike previous approaches, our method does not require a well specifiedtrimap, or scribbles. Our experiments on real images and videos show that our algorithm clearlyoutperforms (visually) other algorithms.

Figure 4. An example from [6] with color ambiguity between foreground and back-ground. (a)Scribbles and matte by [12]. (b) [12] results using a trimap. (c) [6]result using similar scribbles. (d)Matte extracted by proposed method using a sim-ple bounding box.

474474

Page 7: [IEEE International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007) - Sivakasi, Tamil Nadu, India (2007.12.13-2007.12.15)] International Conference

Figure 5. A comparison of alpha mattes extracted by [6]. [6] fails in (a,b,d) due tolack of global color model. (c,e) are our results.

In future, we would like to extend our matting algorithm for extracting shadows from an imagewith little interaction from the user. We also plan to extend our work of video matting to simulta-neously extract several moving objects from a video sequence.

References

[1] Y. Boykov and M. Jolly. Interactive graph cuts for optimal boundary and region segmentationof objects in n-d images. In ICCV, pp. 105–112, 2001.

[2] C. CORPORATION. Knockout. User Guide, 2002.

[3] N. Howe and A. Deschamps. Better foreground segmentation through graph cuts tech report.In http://arxiv.org/abs/cs.CV/0401017.

[4] C.-K. T. Jian Sun, Jiaya Jia and H.-Y. Shum. Poisson matting. ACM SIGGRAPH, 23(3):315–321, April 2004.

[5] S. A. L. Grady, T. Schiwietz and R. Westermann. Random walks for interactive alpha-matting.In Proceedings of Visualization Imaging and Image Processing(VIIP), pp. 423–429, 2005.

[6] A. Levin, D. Lischinski, and Y. Weiss. A closed form solution to natural image matting. InCVPR, pp. 61–68, 2006.

[7] Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum. Lazysnapping. In ACM SIGGRAPH, pp. 303–308,2004.

[8] E. Mortensen and W. Barrett. Intelligent scissors for image composition. In ACM SIGGRAPH,pp. 191–198, 1995.

[9] C. Rother, V. Kolomogorov, and A. Blake. Grabcut- interactive foreground extraction usingiterated graph cuts. ACM SIGGRAPH, 23(3):309–314, August 2004.

[10] M. Ruzon and C. Tomasi. Alpha estimation in natural images. In CVPR, pp. 18–25, 2000.

[11] A. Smith and J. Blinn. Blue screen matting. In ACM SIGGRAPH, pp. 259–268, 1996.

[12] J. Wang and M. F. Cohen. An iterative optimization approach for unified image segmentationand matting. In ICCV, pp. 936–943, 2005.

[13] Y-Y.Chuang, B.Curless, D.Salesin, and R.Szeliski. A bayesian approach to digital matting. InCVPR, pp. 264–271, 2001.

475475