high resolution matting via interactive trimap segmentation...high resolution matting via...

High Resolution Matting via Interactive Trimap Segmentation

Technical report corresponding to the CVPR’08 paperTR-188-2-2008-04

Christoph Rhemann1∗, Carsten Rother2, Alex Rav-Acha3, Toby Sharp21Institute for Software Technology and Interactive Systems2Microsoft Research Cambridge3The Weizmann Institute of Science

Vienna University of Technology, Austria Cambridge, UK Rehovot, Israel

Abstract

We present a new approach to the matting problem whichsplits the task into two steps: interactive trimap extractionfollowed by trimap-based alpha matting. By doing so wegain considerably in terms of speed and quality and are ableto deal with high resolution images. This paper has threecontributions: (i) a new trimap segmentation method usingparametric max-flow; (ii) an alpha matting technique forhigh resolution images with a new gradient preserving prioron alpha; (iii) a database of27 ground truth alpha mattesof still objects, which is considerably larger than previousdatabases and also of higher quality. The database is usedto train our system and to validate that both our trimap ex-traction and our matting method improve on state-of-the-arttechniques.

1. Introduction

Natural image matting addresses the problem of extract-ing an object from its background by recovering the opacityand foreground color of each pixel. Formally, the observedcolorC is a combination of foreground (F) and background(B) colors:

C = αF + (1 − α)B (1)

interpolated by the opacity valueα. (This simplified modelwill be reconsidered later). Matting is a highly under-constrained problem and hence user interaction is essential.

In this introduction we will first consider the user aspectsof the matting problem and then compare the different mat-ting approaches themselves. Previous work in this domaincan be broadly classified into three types of user interfaces.

The first class of interface is based on trimaps [3, 16, 21,19, 4]. First the user paints a trimap by hand as accurately aspossible, i.e. each pixel is assigned to one of three classes:foreground (F), background (B) or unknown (U) (e.g. fig.1(middle)). In a perfectly tight trimap theα values in U are

∗This work was supported in part by Microsoft Research Cambridgethrough its PhD Scholarship Programme and a travel sponsorship.

above0 and below1 andF andB regions have onlyα val-ues which are exactly0 and1 respectively. The informationfrom the known regions (F, B) is used to predict for each un-known pixel the values forF,B andα. It has been shown[21], and we will confirm it, that if the trimap is perfect (ornearly perfect), the resulting matte is of very high quality.The recent soft scissors approach [19] is probably the mostsophisticated trimap “paint tool”. It builds on the intelligentscissors approach [13] which gives a hard segmentation, i.e.only F,B labels. In soft scissors the user traces the boundarywith a “fat brush”. The brush size is adapted according tothe underlying data and intermediate results of the matte areshown, enhancing the user experience. The main drawbackof such a brush tool is that objects with a long boundaryor complicated boundary topology are very tedious to trace,e.g. a tree with many foreground holes or the example in fig.1(left).

Mainly due to this drawback, the trend of hard segmen-tation has been to move fromboundaryselection tools likeintelligent scissors [13] to scribble-basedregion selectiontools [2, 15]. This second class of interfaces is more user-friendly since only a few pixels have to be assigned to F orB, which are ideally far away from the boundary. Impres-sive results were achieved for hard segmentation [2, 15, 1]and also to some extent for matting [9, 20, 11, 5]. Note, asimple approach to obtain a soft matte from a hard segmen-tation is to run existing trimap-based matting techniques ina band of constant width around the hard segmentation, asdone in e.g. [15, 1]. In contrast to this, our work computesanadaptiveband which respects the underlying data.

We now review scribble-based matting methods. In[9, 11] a pure local “propagation” approach is taken andno global color information (i.e. outside a small window) isused. If the assumption holds that all colors within a smallwindow around any unknown pixel lie on a line in colorspace then this approach obtains the ground truth matte. Weobserved in our experiments that this approach obtains goodresults for relatively tight trimaps (or scribbles), but itper-

1

Figure 1.Our interactive matting framework. Given an input image (left) the user interactively creates a trimap (middle). From thetrimap anα matte (right) is computed, first in low and then in high resolution, together with the true fore- and background colors. Theuser interactions consist of three types of brushes (F-foreground (red), B-background (blue), and U-unknown (green)) and a sliderfor thetrimap size (interactive due to the recently developed parametric maxflow technique [8]). We believe that this approach has several benefitsover existing ones: Speed, quality, and user friendliness (see text fordetails).

Figure 2.Comparison of scribble-based matting approaches (see text for discussion). Scribbles are marked in either red (fgd) and blue(bkg) or white (fgd) and black (bkg). Our result was achieved with a single bounding box selection, inspired by [15], and one additionalbackground brush stroke. Note, our approach can also handle morechallengingα mattes, e.g. fig.1. All results we show were either takenfrom the original papers or created with the original implementation of the respective authors.

forms quite poorly on sparse scribble inputs. An example isshown in fig. 2(a) and fig. 7 in [9], where even relativelytight scribbles cause problems. A plausible explanation isthat global color models do help considerably to overcomelocal ambiguities. In contrast, [20, 5] use both global colormodels and local propagation. In [20] an iterative approachis suggested where a global color model is used only forpixels with low confidence. In [5] the global color model ofthe brush strokes is used to reason about all unknown pix-els (very similarly to [21]). However, even [20, 5] do notperform well for the example in fig.2(b) and fig. 5b in [20]using scribbles. In order to achieve a good result for this ex-ample, previous results have required either many scribbles[9] (fig. 2(c)) or a very tight trimap [20] (fig. 5e in [20]).

Recently an intermediate solution has been suggested byJuan and Keriven [7]: The user interactively creates a trimapusing a scribble-based interface (see fig.1). In our work weuse the same approach since we believe it to be an intu-itive interface and it has an advantage in speed and qual-

ity compared to the approaches above. While the user cre-ates the trimap, a preview of the matte is computed (fig.1right) and shown to the user. But even before a preview isavailable, the user can interactively adjust the trimap andremove obvious mistakes in order to simplify the mattingtask. In this process all images are scaled down to typi-cally 0.3 Mpix. In a final step a high resolution matte, e.g.6 Mpix, is computed, which in our case is a slow process. Itis important to note that in contrast to the multi-resolutionapproach for hard segmentation [12], sub-pixel structure iscaptured in the unknown trimap region. Additionally, theuser has a slider for controlling the trimap size which is in-teractive due to parametric max-flow [8]. The main benefitin speed comes from the fact that in a typical image mostpixels belong solely to either fore- or background. For thesepixels expensive matting algorithms which recover the fullrange of fractionalα should not be invoked. For example,for a typical small resolution image (0.3 Mpix), [20] and[9] have reported a runtime of about20sec, [5] of about

Figure 3.Unary terms for trimap segmentation. (a) Input image with user scribbles (red-foreground, blue-background). (b) Unaryenergy for the sub-blur kernel structure term and (c) color term. Dark indicates low energy and white high energy. (d) Pixels in a smallband around theF ′, B′ transition of GrabCut [15] (green line) are classified into physically sharp boundary pixels indicated in bright red(the image was darkened for better visibility). The class prior is not visualized since it is constant over the whole image.

200sec, and we achieve with a tight trimap a runtime of3.5sec using [21]. Moreover, not only speed but also thequality of the matte is improved, as shown in fig.2(d). Ouradvantage over scribbled-based systems and [7] is that weexploit the result of a hard segmentation system such as [15]to build better global color models, and to detect and modelphysically sharp boundaries. Furthermore, we use our largeground truth dataset to build a classifier for potential trimappixels, and to train parameters of our energy (in particularwe learned a predictor for the trimap sizeλ).

For the second task of trimap-based alpha matting weconcentrate on two challenges, which we believe have notyet been solved: (i) working with high resolution imagesand (ii) finding a good prior forα. The novel ideas arean edge-preserving sparsity prior forα and the use of thecamera’s point spread function to model most fractionalαvalues in high resolution images.

Finally, using our new ground truth database we are ableto show that we outperform both existing trimap creationapproaches and trimap-based matting methods.

The paper is organized as follows. Section2 explainsour trimap extraction method, and section3 the trimap-based matting approach, and finally section4 introduces ourdatabase and describes experiments.

2. Interactive Trimap Segmentation

In the following we extend the approach of Juan andKeriven [7]. We denote the color of pixeli as ci, its la-bel asxi, and itsα value asαi. Let I be the set of allpixels. Formally, the three subsetsF,B andU (see fig.1)are defined asB = {i|αi < ǫ}, F = {i|αi > 1 − ǫ} andU = I\(F ∪B), where we chooseǫ = 5

255 . (For simplicity,sets and labels have the same name, e.g. F.) We also intro-duce two extra subsetsF ′, B′ whereB′ = {i|αi ≤ 0.5}andF ′ = {i|αi > 0.5}. The transition fromF ′ to B′ isthe 0.5 level-line of a hard segmentation. Obviously, it isF ⊂ F ′, B ⊂ B′ andF ′ ∪ B′ = F ∪ B ∪ U = I.

In [7], an energy for the three labelsF,B,U was definedand optimized globally. Ideally we would like to define an

energy for all5 labels:F, F ′, B,B′, U . It has the main ad-vantage that the transitionF ′, B′ is modelled to coincidewith an image edge (as in [2, 15]) whereas other transitions,e.g.B,U , are not data dependent (e.g. Potts model in [7]).However, instead of optimizing an energy for all5 labels,we employ a 2 step process that allows a more expressivemodel and higher speed. First, we obtain a hard binary seg-mentation into the setsF ′ andB′ using GrabCut [15]. Theenergy and parameter settings are as defined in [15] and theinterested reader is refereed to the respective paper for de-tails. Following the hard binary segmentation we computea trimap segmentation with labelsF,B andU (sec. 2.1).The energy function considers several image cues, and fourdifferent types of priors are used to regularize the result (vi-sualized in fig.3). We show that the trimap segmentationcan be formulated as a binary classification problem andminimized with graph cut. Finally, in sec.2.2, we showhow to learn the parameters for trimap segmentation from atraining data set. Please note that in this section we assumethe image to be of a small size, typically0.3 Mpix.

2.1. Trimap Segmentation - Model

In this step all pixels will be assigned to one of the threelabelsF,B or U . We assume that each pixel has been al-ready classified intoF ′ or B′ using GrabCut [15]. SinceF ⊂ F ′ andB ⊂ B′ a binary classification into two labelsU andU is sufficient, whereU = F ∪ B. (Each pixel inUis uniquely specified to be in eitherF or B givenF ′, B′.)Note thatα has to be just fractional (0 < α < 1) at theboundary of the hard segmentation (F’ to B’ transition) andnot necessarily exactly0.5.We define the binary energyE for the trimap extraction as:

E(x, θ) =∑

(i,j)∈N

θbVbij(xi, xj) + V s

ij(xi, xj)

+∑

i

U ci (xi) + Up

i (xi) + θb′Ubi (xi) + θs(U

si (xi))

θs′ (2)

whereN is the set of neighboring pixels (8-neighborhood),andθ comprises of all model parameters. The energy can

be locally optimized using graph cut [15] or parametricmaxflow [8] depending on the choice of the free parame-ters during test time (see below). The individual terms aredefined as follows.Color (c) The color unary term for pixelsxi ∈ U ismodeled asU c

i (xi) = −logP (ci|θGF ) if xi = F ′, andU c

i (xi) = −logP (ci|θGB) if xi = B′. HereθGF andθGF

are the Gaussian Mixture Models (GMM) of fore- and back-ground respectively (see sec.2.2for computational details).In the unknown region the color distribution ofU c

i (xi) withxi ∈ U is represented by a third GMMθGU by blendingall combinations of fore- and background mixtures of therespective GMMs as in [7] (see example in fig.3(c)). Moti-vated by [22], and in contrast to [7], the distribution of theblending coefficients is modeled as beta distribution whosetwo free parameters were derived as(0.25, 0.25) from ourtraining set of ground truth alpha mattes (see sec.4).Class prior (p) The probability of a pixel belonging to theclassU or U is image dependent. For instance, an imagewhere the foreground object has been tightly cropped has adifferent proportion ofU versusU pixels than the originalimage. We model this ratio by an unary term as:

Upi (xi) = λ[xi 6= U ] , (3)

i.e. a largerλ gives a largerU region. We show that predict-ing λ during test time improves the performance consider-ably. The learning of the predictor is discussed below (sec.2.2). Furthermore, the parameterλ is also exposed to theuser as a slider. Due to the nestedness property [8] of thesolutions for allλ’s, theλ slider corresponds to the size ofthe trimap. Note, the solution of our energy forall λ’s canbe efficiently optimized using parametric maxflow [8].Sub-blur kernel structure (s) There are many differentreasons for a pixel to have fractionalα values (i.e. belongto U ): Motion blur, optical blur, sub-pixel structure, trans-parency, or discretization artifacts. Here we concentrateonoptical blur. The goal is to detect thin structures which havea width that is smaller than the size of the camera’s pointspread function (PSF). These structures give rise to frac-tional α values but they may not be close to a physicallysharp boundary (e.g. the hair in fig.3(b)).

Consider fig.4 which shows a1-D example of a thin(top,left) and thick (bottom,left) hard boundary (sparseαs),which is either0 or 1. It is convolved with the camera’sPSF, here a box filter, which givesα. In the bottom casetwo smooth boundaries appear, i.e. someα values remain1,which ideally should be detected by the hard segmentation(and then handled by the sharp boundary term). We wantto build a detector for the top case where the thin structureis smaller than the size of the blur kernel. We have exper-imented with many different first and second order deriva-tive filters and found the following to work best. Roughlyspeaking, the magnitude of the first derivative of as size

filter should be low and at the same time the magnitude ofthe derivatives of the twos/2 sized filters, shifted bys/2,should be high, wheres is 2x the size of the blur kernel.Note that by choosing different values fors we can copewith sub-pixel structure, discretization, and interference ofclose-by thin structures. We confirmed experimentally thats = 5 gives the lowest error rate (area under ROC curve)over the training set.

Formally, we usemax(0, |a−b|+|b−c|−|a−c|), wherea, b, c are the left, center and right pixel values on a line seg-ment of length5. We made this (symmetric) detector orien-tation independent by taking the maximum response overfour discrete angles(0o, 45o, 90o, 135o). All three colorchannels were weighted equally. A further improvementwas achieved by setting those filter responses to0 wherenot all pixels on the line segment were assigned to the samemixture in the GMMθGU . The underlying assumption isthat in a small window the true fore- and background colorsare similar. We defineUs as the filter response. Fig.3(b)depicts a result for the image in fig.3(a). It even works wellfor complicated interference patterns, e.g. many thin over-lapping hairs. As desired, it also has a lower response atsharp boundaries and inside the trueF andB areas.

Figure 4. A1-D example of a thin (top,left) and thick (bottom,left)hard boundary (sparseαs), which are convolved with the camera’sPSF, which givesα (see text for details).

Sharp boundary (b) The F ′, B′ transition determined byGrabCut [15] often coincides with a clean, physically sharpboundary. This means that in the vicinity of the detectedboundary (defined by the PSF) there is no other boundarytransition, hence no sub-blur kernel structure. An exampleis the body of the object in fig.3(d). At such boundaries thewidth of U is equal to the width of the camera’s PSF andthus is only a few pixels wide. To determine the physicallysharp parts of theF ′, B′ transition we have designed a sim-ple classifier to detect whether the hard segmentationF ′, B′

corresponds to a sharp boundary. We first run a matting al-gorithm [9] in a small band (twice the size of the blur kernel,centered on theF ′, B′ transition), which is very efficient.Then a pixeli in this band belongs to a sharp boundary ifthe following conditions hold for a small windowWi (twicelarger than the blur kernel) centered oni : a) the averageαinsideWi is 0.5; b) half of the pixels inWi are above0.5; c)

at least half of the pixels inWi are close to0 or 1 (rejectingsmooth boundaries). Note that these conditions tolerate ashift of the boundary by half the size of the blur kernel. Theclassification error on our training set is19.3%. Strongerconditions, which are computational more expensive couldbe considered in the future.

Figure 5. Sharp boundary boundary terms (see text for details).

The result of this classifier is used to model the bound-ary termsU b andV b. Consider fig.5, which illustrates anin-focus region of a physically sharp boundary in a low res-olution α matte. The red line is the result of the hard seg-mentation (F ′, B′ transition). We force all pixels adjacentto the F ′, B′ boundary to be inU using hard constraintsfor U b (green pixels in fig.5). Also some pixels, which areneighbors of green pixels, are forced to be inU (using hardconstraints forUb). These are those pixels which are closeto a physically sharp boundary and are shown in red in fig.5. Intuitively, these pixels form a barrier, which preventsU from leaking out at sharp boundaries. The pairwise termV b is shown as a blue line in fig.5. It forms an 8-connectedpath which follows tightly the green pixels. Intuitively, itenforces smoothness of this barrier. This means that it canclose small gaps where our sharp boundary detector mis-classified theF ′, B′ transition.Smoothness (s) Following [15] the smoothness term is de-fined as

V sij(xi, xj) =

δ(xi = xj)

dist(i, j)

(

θr exp−β ‖ci − cj‖2)

, (4)

whereδ is the Kronecker delta,β is as in [15] and θr isdefined below. As we show in sec.4, it nicely preservesthin structures, e.g. hair, inside the unknown region.

We also enforce theU region to be4-connected, whichis true for98.6% of the pixels in the ground truth database.Since enforcing connectivity is NP-hard [18], we do it bya simple post-processing step. This means that all discon-nected islands ofU are detected and removed.

2.2. Trimap Segmentation - TrainingFor training we have used the following heuristic error

(loss) function, which counts false negatives twice com-pared to false positives:error = 100

n

∑

i 2[xtruei = U ∧

xi 6= U ] + [xtruei 6= U ∧ xi = U ], wherextrue is the label-

ing of the ground truth trimap andn the number of pixels.

This is motivated by the fact that a missed unknown (U)pixel in the trimap cannot be recovered during alpha mat-ting. We see in sec.4 that it is indeed correlated to theerror forα matting. Based on our training dataset of20 im-ages (see sec.4) we have hand tuned all the parameters inθ, except of those discussed below, to{θb, θb′ , θs, θs′ , θr} ={2, 40, 1, 2, 0.1}.

We have observed that the initialization of the color mod-elsθGF , θGB is rather crucial, and was not discussed in [7].The reason is that the energy has typically many local min-ima with a high error since e.g. a trueF color can be mod-eled as the blend of a trueB color with another trueF . Sur-prisingly, even making the GMMs spatially dependent didnot help to overcome these local minima. After initializa-tion, the color models can be updated iteratively as in [15].However, it is not guaranteed that the energy decreases dueto the mixture termU c(U), and therefore we build the colormodels from a guessed trimap (see below), and do not iter-ate.

0 5 10 15 20 250

1

2

3

4

5

Size of unknown region U (%)

Lam

bda

Figure 6. Visualizing the correlation betweenλ and the size of theU region (see text for details).

As discussed above,λ (eq. 3) corresponds to the sizeof the regionU . Fig. 6 shows, in blue dots, the optimalλwrt to the size of regionU of our training set. We see astrong correlation. To exploit this, we have built a predictorfor the size ofU (see below). The red dots in fig.6 showthe predicted values of our training data, and the red line isa quadratic function (3 parameters) fitted to them. We seethat the red(predicted) and blue(true) points are close-by.During test time the size ofU is predicted, given the testimage, and the quadratic function provides the correspond-ing λ. Note, in practiceλ is predicted after the first twoFandB brush strokes and not changed afterwards, i.e. we donot alter our energy during the optimization. The dashedline in fig. 6 shows the optimal averageλ = 2.3 which isindependentof the size ofU . It performs less well as wesee in sec.4.

Finally, we use the following heuristic to guess the initialtrimap. We use the data termsU c, Us, that are available atthis point, and find the globally optimal trimap by simplethresholding the unary energy. Note, forU c we initializeθGF andθGB with θGF ′ andθGB′ as in [7]. On our trainingimage set we have obtained an average prediction error for

U of 1.5% relative to the image size.

3. High Resolution Alpha Matting

Given a trimap we describe now our approach to matting.We base it on the method of Wang& Cohen [21] whichwas shown to produce state-of-the-art results from trimaps.They first obtain a pixel-wise estimation ofα from highconfidence color samples collected from the fore- and back-ground regions of the trimap. This estimation is translatedinto two data termsWF andWB which are combined witha pairwise smoothness term and solved using random walk.(see [21] for a more detailed description). In our implemen-tation, we use the matting laplacianL of [9] as a smoothnessterm for the matting, as it has a better theoretical justifica-tion, giving the following objective function:

J(α) = αT Lα+θα(α−~1)T WF (α−~1)+θααT WBα (5)

whereα is treated as a column vector,WF ,WB are writtenas diagonal matrices andθα is the relative weighting of thedata versus the smoothness terms (we useθα = 10−3). Thisobjective function is minimized by solving a set of sparselinear equations, subject to the input constraints.

Fig. 7(d) shows the result of this process for the inputin (a), which is a crop of a7.6Mpix image of hair. The re-sult looks too blurry compared to the ground truth in (c)1.Note, the result of applying [9] directly, i.e. omitting thedata termsWF ,WB , gives a clearly inferior result. It wasshown in [11] that an additional sparsity prior, i.e. push-ing α towards0 or 1, can solve some ambiguities in theestimation ofα. However, [11] employs a simplepixel in-dependentprior, and also the prior turns eq.5 into a com-plicated non-linear system. Thresholding the initialα (fig.7(f)) demonstrates the problem (loss of hair structure) of us-ing a pixel independent prior on the blurry alpha. The resultof [11] in fig. 7(g) is pretty poor. Additionally, in [11] mat-ting was performed on low res images whereα is even lesssparse then in high res (compare fig.7(b) and (c)).

In this work we suggest a novel sparsity prior. It is basedon a model of the imaging process, where the observed highresolutionα is the result of blurring the true sparse alphaαs

with the camera’s point spread function (PSF). Hence, theobservedα has the formα = K ⊗αs as proposed in [6] forthe case of motion blur, where⊗ indicates convolution andthe blur kernelK models the PSF. (Note that this approxi-mates the real image composting eq.1.) Fig. 7(h) showsαs derived from the ground truthα in (c) and our kernel K(see details below). It is nearly a binary mask apart fromsub-pixel structure, discretization artifacts, object motion,and semi-transparency (such as windows glass). Note, evena hair, as a physical entity, is opaque if the resolution is highenough and after deblurring. Here we assume that the object

1The supplementary material shows the result of applying the originalimplementation of [21] which gives a more blurry result.

boundary is in-focus, i.e. we model the PSF of the in-focusarea. Note, even in-focus pixels are slightly blurred due toimperfect camera optics. Note, Jia [6] computesK andαs

for motion blurred images givenα by applying the methodof [9]. In our work the “loop” is closed by improvingαusingαs as a prior. We show that this works well for stillimages, and our framework is general enough to deal withmotion blurred images, which we leave for future work.

Starting with the high resolutionα we apply the fol-lowing steps: (a) Initialize the PSF, (b) Deblurα to obtainsparseαs using the PSF, (c) Estimate a binary sparse alphaαsb from αs while preserving edges, (d) Iterate (a-c) a fewtimes, (e) Convolve the binaryαsb with the PSF and use itto re-estimateα.2 Intermediate results of the process are infig. 7 (i-l). Each step is now described in detail:Estimating the PSF. We model the PSF as a symmetric ker-nelK of size9 × 9 with non-negative elements which sumup to1. Motivated by [6] we use the following heuristic toinitialize K. For all pixels inU we compute those whichmay belong to a physically sharp boundary (sec.2.1). If theacceptance rate is above10%, we obtainK by minimizingthe linear system||δ(α > 0.5)⊗K−α]||2. Otherwise, a re-liable initialization ofK for the in-focus area is a Gaussian[14] (we useσ = 1.5). After the first iteration, the linearsystem is solved again using all in-focus pixels (see below),whereδ(α > 0.5) is replaced byαsb.Alpha Debluring and Binarization. To get the sparse al-phaαs from α andK we use the image-debluring imple-mentation of [10] (see fig. 7(i)). Fromαs we construct thebinary sparse alphaαsb as follows. We observed that ap-plying a per-pixel sparsity such as thresholdingαs removesmany features, such as hair, from the matte. A much betterbinarization can be obtained by preserving the edges ofαs

(fig. 7(j)). To achieve this, we use the following MRF, andsolve it with graph cut (since E is submodular):

E(αsb) =∑

i Ui(αsbi )+θα1

∑

{i,j}∈N Vij(αsbi , αsb

j ), (6)

where αsb is a binary labeling andN denotes an8-connected neighborhood. The termsUi andVij are givenby

Ui(fi) = |αsbi − αs

i | + θα2|αsbi |; (7)

Vij(fi, fj) = δ(asbi 6= asb

j ) + θα3(αsbi − αsb

j )(αsj − αs

i )

with the constants(θα1, θα2, θα3) = (5, 0.2, 0.002). Thedata term encourages the labeling to be similar toαs (witha small preference towards 0). The pair-wise term consistsof both a regular smoothness and an edge-preserving term.Note that this edge-preserving smoothness is very differ-ent from the standard smoothness. Actually, with the edge-preserving term only, the global optimum contains typicallythin (1-pixel wide) structures.

2In the formulation of [11] α can be replaced byαs andK, however,we found it to be experimentally inferior to our approach.

Figure 7. The matting process. (see detailed description in text). The key idea is that our finalα matte (l) is derived from the initial matteof [21] (d) by imposing a new edge-preserving sparsity prior (k), which is better than a simple pixel independent prior (f).

Re-estimating α using the Sparsity Prior. By convolvingthe binaryαsb with K, we construct a new sparsity priorsα

on the values ofα (fig. 7(k)). This prior is added simply byreplacing the data terms in eq.5 with the new terms:

W′

B(i) = WB(i) + θα4sα; W ′

F (i) = WF (i) + θα4(1 − sα),

whereθα4 = 5 is the relative weight of the new prior. Thefinal alpha matte, shown in fig.7(l), is less blurry than theinitial matte in (d).Handling Multiple PSFs. Due to depth variations theremay be more than a single PSF along the object boundary.Ideally, this problem is handled by recovering depth at eachpixel. In our work, however, we estimate a single PSF andassume it can describe well all in-focus pixels. The spar-sity prior for other pixels is set to 0. First, we compute foreach pixel the gradient ofαs, normalized by the range ofvalues in a window (11 × 11) around each pixel. Then wecompute the in-focus mask by thresholding this score (weused 0.4). Note that for out-of-focus regions our methodis equivalent to regular matting methods and therefore canovercome multiple PSFs.

3.1. Multi-Resolution Estimation of the Matte

To obtain high quality alpha mattes of6 Mpix imageswith reasonable time and memory requirements, we use amulti-resolution framework with three levels:0.3 Mpix, 1.5Mpix and6 Mpix. The matte in lower resolutions is used asa week regularization for higher resolutions. At the higherresolution,α was solved by processing the image in over-lapping windows. Using the low resolution matte as regu-larization has two advantages: (a) it encourages a smoothtransition between windows (for that reason, this prior gotahigher weight along window boundaries), (b) it pushes thesolution towards the global optimum, which is essential forhandling non-informative windows. Note, we computed the

data terms of [21] in full resolution using the entire imagesince runtime and memory was reasonable.

4. Experimental results

In order to quantitatively compare different interactivematting techniques two problems have to be addressed: (i)a large ground truth database of high quality is necessary,(ii) a metric for evaluating interactive matting systems hasto be derived. In this paper we propose such a new database.For the evaluation task, we follow the simple style of recentpapers e.g. [21] where a few user inputs were simulated.We believe that it is important to improve on this procedure,however we leave it as future work.

We first introduce our database and then compare ourapproach to state-of-the-art methods.Ground Truth Database. Recently two small databaseswere introduced [21, 11]. The database we use is consider-ably larger and, we believe, of higher quality. The data in[11] is of intermediate quality, probably due to noise3. In[21] most examples are natural (outdoor) images, which isa very realistic scenario, however the ground truthα wascreated using a variety of matting methods along with ex-tensive user assistance. We believe that such a dataset isbiased especially if used, as in our work, for training. Wehope that in the future our high quality database can also beused to learn better priors for alpha, e.g. high order cliqueMRFs, which is along the current trend of training low-levelvision systems e.g. stereo.

Our dataset has27 α mattes obtained in a professionalstudio environment using triangulation matting [17] fromthe RAW sensor data (10.1 Mpix; Canon 1D Mark III). Theobjects have a variety of hard and soft boundaries and dif-ferent boundary lengths, e.g. a tree with many holes (see

3In the matte of fig. 8 in [11] 42% of true foreground pixels have anαvalue below0.98, in our case this occurred only for1% of pixels.

Figure 8.Trainings and test images. Our trainings (images in red box) and test images (images in blue box). The images are shown atthumbnail size (original size is approx. 6 Mpix).

fig. 8 for an overview). The final mattes, of average size6 Mpix, consists of the cropped out objects. From theseground truthα mattes we derived ground truth trimaps bythresholding theα matte.

To create training and test images we photographed eachobject in front of a screen showing natural images as thebackground. To increase the difficulty of the trainings andtest cases, we replaced some of the backgrounds with morechallenging, e.g. shaper backgrounds, using the groundtruth alpha and ground truth foreground color. The differ-ent background images show a varying degree of difficulty:color ambiguity of fore- and background and backgroundswith different degrees of focus. Then the set was splitinto 10 training and17 test images. To obtain20 trainingimages,2 different backgrounds were used for each fore-ground object. An overview of the trainings and test imagesis depicted in fig.8.

Finally, we created for each image a set of potential userinputs in the form of scribbles and trimaps. For each imagewe have casually drawn large scribbles that cover all majorcolors in the image but are not close to the object boundary.Comparison of Trimap Extraction Methods. We had todown-scale our6 Mpix images to a size that most competi-tors can handle, which was0.3 Mpix (e.g.700 × 560) - thelimit of the publicly available system of [9]4. We compare

4For [11] we had to even scale down the images to0.15 Mpix. Forcomparison, the obtained result was then up-scaled to0.3 Mpix images.

our approach to six other methods [7, 21, 9, 11, 5, 4]5. Note,with respect to [20] we only have results for two images, fig.9 and2 (see also fig. 5 in [20]). However [5], to which wecompare to, has shown that they slightly outperform [20].The reason for including matting systems [21, 9, 11, 5, 4]in this comparison is to show that we gain not only in speedbut also in quality in terms of the finalα matte. The trimaperror rate was defined in sec.2.2 (measured in percentagewrt image size), and the error for anα matte is defined be-low. The trimap error for systems which directly produce amatte was done by transforming the matte into a trimap. Inorder to compute a matte from our trimaps, and those of [7],we use our matting approach with low resolution input andwithout sparsity prior (essentially [21]). For this experimentthe input was the set of user-defined scribbles.

Qualitative results are in fig.1,2, and9, and quantita-tive results are in table1. We see that matting systems[21, 9, 11, 5] are obviously considerably slower. Note,our approach and [7] need for small resolution alpha mat-ting additional3.5sec on average (not reported in table1).Also, we see that our method with optimalλ takes on av-erage0.8sec longer to compute all solutions for the rangeof λ ∈ (0, 5), but obviously it improves the usability of ourmethod. Considering error rates, we see a correlation be-tween the trimap- andα matte error, which motivates ourheuristically defined error functions. We see that our sys-

5[21, 9, 11, 5, 4] was the authors implementation and [7] our own.

tem clearly outperforms all other approaches both in termsof trimap error andα matte errors. Also, predictingλ inour system works better than using a fixedλ. As expected,choosing for each image the optimalλ gives best perfor-mance. Finally, ground truth trimaps (last row) give by farthe lowestα matte errors, which shows that the problemof good trimap generation is vital for successfulα matting.Note, the low rate of [11] might be explained by the fact thatit was not designed for a scribble-based interface, but a mat-ting component picking interface. It is not even guaranteedthat the scribbles will be assigned to the correctα.Comparison of Trimap-based Matting Methods. Weused the following error function for theα matte, whichpenalizes more heavily an over-estimation ofα:

100

n

∑

i

[1.5δ(αi ≥ αtruei )+0.5δ(αi < αtrue

i )]|αi−αtruei |,

whereαtrue is the trueα andn the number of pixels inU .It has been shown in [21, 11] that [21, 9] are the best

performing methods for this task. Fig.10 and7 show qual-itative results of [21, 9, 11] on crops of high resolution im-ages. Since we were not able to get all competitors workingfor our high resolution images, we adapted our method tosimulate: (i) [21], by removing our sparsity prior, and (ii)[9], by settingWF , WB to 0 in eq. 5. Quantitative re-sults for high resolution images are shown in table2 for dif-ferent trimaps: ground truth, our trimaps (optimalλ)6, andsmall(large) trimaps (dilation of the ground truth trimap by22(44) pixels). We see that we outperform other methods,more significantly for larger trimaps. The improvementsmight look small but it is very important to note that ourresults are overall considerably less blurry than others (e.g.fig. 10). This visual improvement is, however, not capturedwell in our error metric, which motivates new research inthis field. Table2also shows that our scribble-based trimapsare better than large, hand drawn trimaps with a brush of ra-dius88 respectively. Note, theα mate error in table1 and2 can deviate due to differently sized input images. Finally,in our un-optimized implementation a full7 Mpix α mattecomputation takes between10 − 25 minutes depending onthe size of the unknown region.

5. ConclusionsWe have presented a new approach for alpha matting of highresolution images via interactive trimap extraction. Usingour new database we could train our system and also con-firm that we improve on state-of-the-art methods for bothtrimap extraction and alpha matting. Our main contribu-tion for matting was anα prior which imposes sparsity on

6For a fair comparison of our trimaps to dilated trimaps, we brushedadditionally over false negative pixels, i.e. pixels markedas unknown inthe ground truth trimap but not in our trimap, with a brush of radius 44.On average we had to brush only0.5% of pixels in the image. Note, asmall dilated trimap contains15% of all image pixels.

Method av. error worst 25% timeGrady et al. ’05 [4] (24.3, 19.8) (33.6, 28.6) 5Levin et al. ’07 [11] (17.9;9.5) (28.3;17.8) 20Guan et al. ’06 [5] (13.4;9.0) (22.7;16.5) 300Levin et al. ’06 [9] (11.4;6.9) (19.0;13.3) 18Wang et al. ’07 [21] (11.0;8.4) (22.5;19.0) 50Juan et al. ’05 [7] (7.6;4.6) (13.8;12.0) 1.5

Our (fixedλ = 2.3) (2.5;1.2) (4.9;2.3) 1Our (predictλ) (2.3;1.0) (4.5;1.9) 1Our (optimalλ) (2.2;0.7) (4.5;1.5) 1.8

True trimap (0.0;0.4) (0.0;0.8) -Table 1. Comparison of trimap methods. In brackets is ourtrimap error and ourα matte error (definition in text). All num-bers are averaged over all (worst25%) test images. Times are inseconds and were measured on the same machine (2.2 GHz).

Method Large Small Our TrueOur impl. Levin ’06[9] 1.5 1.2 1.3 0.71Our impl. Wang ’07[21] 1.7 1.0 1.0 0.68

Ours 1.3 0.8 0.9 0.67Table 2. Comparison of trimap-based matting methods. Theaverage error over all test images for different trimaps (see text).

α while preserving gradients. In the future we hope to ob-tain even better results by modeling a spatially varying blurkernel of the cameras’ PSF in order to compensate for depthvariations and spatially varying motion blur.

References

[1] X.Bai and G.Sapiro. A geodesic framework for fast interac-tive image and video segmentation and matting.ICCV 07.1

[2] Y. Boykov and M. Jolly. Interactive graph cuts for optimalboundary and region segmentation of objects in N-D images.In ICCV, 2001.1, 3

[3] Y. Chuang, B. Curless, D. Salesin, and R. Szeliski. Abayesian approach to digital matting. InCVPR, 2001.1

[4] L. Grady, T. Schiwietz, S. Aharon, and R.Westermann. Ran-dom walks for interactive alpha-matting.VIIP 05.1, 8, 9

[5] Y. Guan, W. Chen, X. Liang, Z. Ding, and Q. Peng. Easymatting: A stroke based approach for continuous image mat-ting. In Eurographics, 2006.1, 2, 8, 9

[6] J. Jia. Single image motion deblurring using transparency.In CVPR, 2007.6

[7] O. Juan and R. Keriven. Trimap segmentation for fast anduser-friendly alpha matting. InVLSM, 2005.2, 3, 4, 5, 8, 9

[8] V. Kolmogorov, Y. Boykov, and C. Rother. Applications ofparametric maxflow in computer vision. ICCV ’07.2, 4

[9] A. Levin. D. Lischinski, Y. Weiss. A closed form solution tonatural image matting. InCVPR ’06. 1, 2, 4, 6, 8, 9

[10] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Imageand depth from a conventional camera with a coded aperture.In SIGGRAPH, 2007.6

[11] A. Levin, A. Rav-Acha, and D. Lischinski. Spectral matting.In CVPR, 2007.1, 6, 7, 8, 9

Figure 9.Comparison of trimap methods. This figure shows trimap segmentation results for some challenging examples (the bottommostexample was taken from [21]). For each example the first and second row depict the trimaps and final composites respectively. All resultswere generated from the scribbles shown in the leftmost images in the top rows, except of column 5 where we adjustedλ and used extrascribbles (leftmost images in the bottom rows) to demonstrate the capability ofour method to easily generate almost perfect results. Inbrackets is our trimap andα matting error. Our results outperform the competitors, where we show only the top2 for each example, theothers were worse, both in terms of error rates and visually.

Figure 10.Comparison of α matting methods. (b) Top: Crop of a7.7 Mpix image (a) of a region with a woollen scarf. The input trimap(small dilation of22 pixels) is superimposed: black(inside) and white(outside). Bottom: Cropof a7.6 Mpix image (a) of a region showinghair. The input trimap (large dilation of44 pixels) is superimposed: black(inside) and white(outside). (b-f) Results of various methodswith respectiveα matte error.

[12] H. Lombaert, Y. Sun, L. Grady, and C. Xu. A multilevelbanded graph cuts method for fast image segmentation. InICCV, 2005.2

[13] E. Mortensen and W. Barrett. Intelligent scissors for imagecomposition.SIGGRAPH, 1995.1

[14] A. Pentland. A new sense for depth of field.PAMI, 1987.6[15] C. Rother, V. Kolmogorov, and A. Blake. Grabcut - inter-

active foreground extraction using iterated graph cuts.SIG-GRAPH, 2004.1, 2, 3, 4, 5

[16] M. Ruzon and C. Tomasi. Alpha estimation in natural im-ages. InCVPR, 2000.1

[17] A. Smith and J. Blinn. Blue Screen Matting. InSIGGRAPH,1996.7

[18] S. Vicente, V. Kolmogorov, and C. Rother. Graph cut basedimage segmentation with connectivity priors. CVPR ’08.5

[19] J. Wang, M. Agrawala, and M. F. Cohen. Soft scissors :An interactive tool for realtime high quality matting.SIG-GRAPH, 2007.1

[20] J. Wang and M. F. Cohen. An iterative optimization approachfor unified image segmentation and matting. InICCV, 2005.1, 2, 8

[21] J. Wang and M. F. Cohen. Optimized color sampling forrobust matting. InCVPR, 2007.1, 2, 3, 6, 7, 8, 9, 10

[22] Y. Wexler, A. Fitzgibbon, and A. Zisserman. Bayesian esti-mation of layers from multiple images. InECCV ’02. 4

high resolution matting via interactive trimap segmentation...high resolution matting via...

Documents