06714519

14
High-Quality Real-Time Video Inpainting with PixMix Jan Herling, Member, IEEE and Wolfgang Broll, Member, IEEE Abstract—While image inpainting has recently become widely available in image manipulation tools, existing approaches to video inpainting typically do not even achieve interactive frame rates yet as they are highly computationally expensive. Further, they either apply severe restrictions on the movement of the camera or do not provide a high-quality coherent video stream. In this paper we will present our approach to high-quality real-time capable image and video inpainting. Our PixMix approach even allows for the manipulation of live video streams, providing the basis for real Diminished Reality (DR) applications. We will show how our approach generates coherent video streams dealing with quite heterogeneous background environments and non-trivial camera movements, even applying constraints in real-time. Index Terms— Video inpainting, diminished reality, real-time, image inpainting, image completion, object removal Ç 1 INTRODUCTION I MAGE inpainting (aka context-aware fill, image synthesis, image completion, etc.) has recently become widely avail- able as part of image manipulation tools. While image inpainting has been researched for quite some time (see e.g., [1], [2], [3]), it only recently has achieved a sufficient quality at an acceptable speed, allowing for integration in standard software. Nevertheless quality is still an issue and achieving sophisticated results for non-trivial image backgrounds still requires a significant amount of time [4] and [5]. Even for rather low-resolution images (such as VGA) most approaches will not allow for proper inpainting at interac- tive frame rates. Due to the numerous applications in video post-produc- tion, video inpainting has drawn quite a lot of attention. Sample applications include repairing video frames from vintage movies or removing undesired objects in the back- ground such as the airplane in an antique movie or the wrist watch of an extra in a medieval drama. Accordingly, those approaches (e.g., [6], [7], [8], [9], [10], [11], [12], [13]) did not aim for interactive frame rates. In contrast to those approaches, Diminished Reality (DR) requires a processing in real-time. Some approaches were introduced allowing for marker hiding in AR applications by semi-dynamic textures combined with alpha blending [14], [15]. Other approaches realize Diminished Reality by combining images of several cameras, observing a certain area from different viewpoints [16], [17]. This information is then used to remove blocking objects by blending the indi- vidual views rather than by synthesizing the area by inpainting. So far, only our previous approach [18] already aimed for real-time video inpainting, although the back- ground had to be rather uniform and camera movements were restricted. In this paper we will introduce our approach on sophis- ticated real-time video inpainting. Our approach is based on a high quality image inpainting and allows for the reali- zation of Diminished Reality applications. Our current approach was very much inspired by the randomized approaches of Barnes et al. [19] as well as our own previ- ous approach [18]. However, in contrast to those purely patch-based approaches, our new approach PixMix is based on a combined pixel-based approach. This allows for even faster inpainting while improving the overall image quality significantly. Combined with new tracking approach and frame-to-frame coherence this provides the basis for real-time video manipulation. By additionally applying a homography based approach, we can now achieve a high coherence for translational and rotational camera movements, not yet demonstrated in real-time elsewhere (see Fig. 1 and Fig. 13). This paper is structured as follows: in Section 2 we will review the recent related work on image and video inpaint- ing with a focus on quality and speed. In Section 3 we will introduce our approach to image inpainting providing highly sophisticated results in real-time. In Section 4 we show how this approach can be used to realize real-time Diminished Reality. We will present our novel real-time object selection and tracking mechanism in detail. We will further introduce our approach to achieve frame-to-frame coherence applying a homography. In Section 5 we will dis- cuss limitations and quality issues of our approach before finally concluding and looking into future work. 2 RELATED WORK Wexler et al. [13] successfully demonstrated how to remove objects from video sequences. Their approach applied 3D image patches, extending over space and time, using the entire video sequence for patch look-up. J. Herling is with the fayteq GmbH, Erfurt, Germany. E-mail: [email protected]. W. Broll is with the Virtual Worlds and Digtial Games group at the Ilmenau University of Technology, Ilmenau, Thuringia, Germany. E-mail: [email protected]. Manuscript received 13 Feb. 2013; revised 23 Dec. 2013; accepted 28 Dec. 2013; date of publication 15 Jan. 2014; date of current version 25 Apr. 2014. Recommended for acceptance by M. Gandy, K. Kiyokawa, and G. Reitmayr. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TVCG.2014.2298016 866 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014 1077-2626 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: yatin209

Post on 08-Jun-2015

77 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 06714519

High-Quality Real-Time Video Inpaintingwith PixMix

Jan Herling,Member, IEEE and Wolfgang Broll,Member, IEEE

Abstract—While image inpainting has recently become widely available in image manipulation tools, existing approaches to video

inpainting typically do not even achieve interactive frame rates yet as they are highly computationally expensive. Further, they either

apply severe restrictions on the movement of the camera or do not provide a high-quality coherent video stream. In this paper we will

present our approach to high-quality real-time capable image and video inpainting. Our PixMix approach even allows for the

manipulation of live video streams, providing the basis for real Diminished Reality (DR) applications. We will show how our approach

generates coherent video streams dealing with quite heterogeneous background environments and non-trivial camera movements,

even applying constraints in real-time.

Index Terms—�Video inpainting, diminished reality, real-time, image inpainting, image completion, object removal

Ç

1 INTRODUCTION

IMAGE inpainting (aka context-aware fill, image synthesis,image completion, etc.) has recently become widely avail-

able as part of image manipulation tools. While imageinpainting has been researched for quite some time (see e.g.,[1], [2], [3]), it only recently has achieved a sufficient qualityat an acceptable speed, allowing for integration in standardsoftware. Nevertheless quality is still an issue and achievingsophisticated results for non-trivial image backgroundsstill requires a significant amount of time [4] and [5]. Evenfor rather low-resolution images (such as VGA) mostapproaches will not allow for proper inpainting at interac-tive frame rates.

Due to the numerous applications in video post-produc-tion, video inpainting has drawn quite a lot of attention.Sample applications include repairing video frames fromvintage movies or removing undesired objects in the back-ground such as the airplane in an antique movie or the wristwatch of an extra in a medieval drama. Accordingly, thoseapproaches (e.g., [6], [7], [8], [9], [10], [11], [12], [13]) did notaim for interactive frame rates.

In contrast to those approaches, Diminished Reality (DR)requires a processing in real-time. Some approaches wereintroduced allowing for marker hiding in AR applicationsby semi-dynamic textures combined with alpha blending[14], [15]. Other approaches realize Diminished Reality bycombining images of several cameras, observing a certainarea from different viewpoints [16], [17]. This information isthen used to remove blocking objects by blending the indi-vidual views rather than by synthesizing the area by

inpainting. So far, only our previous approach [18] alreadyaimed for real-time video inpainting, although the back-ground had to be rather uniform and camera movementswere restricted.

In this paper we will introduce our approach on sophis-ticated real-time video inpainting. Our approach is basedon a high quality image inpainting and allows for the reali-zation of Diminished Reality applications. Our currentapproach was very much inspired by the randomizedapproaches of Barnes et al. [19] as well as our own previ-ous approach [18]. However, in contrast to those purelypatch-based approaches, our new approach PixMix isbased on a combined pixel-based approach. This allowsfor even faster inpainting while improving the overallimage quality significantly. Combined with new trackingapproach and frame-to-frame coherence this provides thebasis for real-time video manipulation. By additionallyapplying a homography based approach, we can nowachieve a high coherence for translational and rotationalcamera movements, not yet demonstrated in real-timeelsewhere (see Fig. 1 and Fig. 13).

This paper is structured as follows: in Section 2 we willreview the recent related work on image and video inpaint-ing with a focus on quality and speed. In Section 3 we willintroduce our approach to image inpainting providinghighly sophisticated results in real-time. In Section 4 weshow how this approach can be used to realize real-timeDiminished Reality. We will present our novel real-timeobject selection and tracking mechanism in detail. We willfurther introduce our approach to achieve frame-to-framecoherence applying a homography. In Section 5 we will dis-cuss limitations and quality issues of our approach beforefinally concluding and looking into future work.

2 RELATED WORK

Wexler et al. [13] successfully demonstrated how toremove objects from video sequences. Their approachapplied 3D image patches, extending over space andtime, using the entire video sequence for patch look-up.

� J. Herling is with the fayteq GmbH, Erfurt, Germany.E-mail: [email protected].

� W. Broll is with the Virtual Worlds and Digtial Games group at theIlmenau University of Technology, Ilmenau, Thuringia, Germany.E-mail: [email protected].

Manuscript received 13 Feb. 2013; revised 23 Dec. 2013; accepted 28 Dec.2013; date of publication 15 Jan. 2014; date of current version 25 Apr. 2014.Recommended for acceptance by M. Gandy, K. Kiyokawa, and G. Reitmayr.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TVCG.2014.2298016

866 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014

1077-2626� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 06714519

While this allows for quite sophisticated results in videomanipulation, it cannot be used to manipulate live videostreams, where only the current frame and to some extentprevious frames are available. Further, while allowing fora user created synthesis mask, their approach, was limitedto a static camera and was not able to provide iterative oreven real-time capable results. Simakov et al. [4] extendedthe coherence-based approach of Wexler et al. by a com-pleteness term. Thus, they achieved a bi-directional dis-similarity function also allowing for image reshuffling.However, the approach was too slow for interactiveimage manipulations. In PatchMatch Barnes et al. [19]resp. in Generalized PatchMatch [20] the cost functionintroduced by Simakov et al. was used within a random-ized patch searching approach. Instead of seeking theoptimal image information, randomized iterations areused to find patches close to the optimum. This signifi-cantly speeded up the overall process, even allowing forinteractive manipulations of images of a reasonable size.Recently, Darabi et al. [21] use orientation, scale, reflec-tance and illumination invariant image patches for imageinpainting as originally introduced by Drori et al. [2]. Byour original approach [18], [22] we already demonstrateda real-time capable approach for object removal fromvideo streams, allowing the manipulation of live videos.We used a randomized approach similar to those ofBarnes et al. [19] while applying the similarity measure ofWexler et al. [13]. We further introduced means for frameto frame coherence providing the basis for interactivevideo manipulation. The video synthesis mask wasdefined by a simple but fast snake contour approach, pro-viding sufficient computational time for the imageinpainting. However, in the original approach weachieved the real-time performance mainly by data andsample reductions, e.g., using grayscale images andapplying an under sampling inside the target region. Thesynthesis quality therefore did not reach the quality of thestatic image approach of e.g., Barnes et al. Pritch et al.[23] proposed an synthesis approach based on a global

optimization problem using single pixels and their directneighborhood rather than patches. They apply a graphlabeling approach seeking for an optimal solution overseveral pyramid layers. However, this approach is severalmagnitudes too slow for real-time DR. Bugeau et al. [5]define an energy function to be minimized, composed bythree terms: self-similarity, diffuse and propagation, andcoherence. They propose to compute a correspondencemap mainly by the application of pixel intensities inside apatch similar to the approach of Demanet et al. [24]. Theyachieve sophisticated qualitative results, but do not qual-ify for real-time use. Takeda and Sakamoto [25] apply ahomography for removing near occluders from landscapevideo streams. However, their approach is restricted torotational camera movements and can be used for nonreal-time post processing only.

3 IMAGE INPAINTING

In this section we will present our approach to real-timeimage manipulation.

3.1 Mapping Function

Image inpainting can be defined as a global minimizationproblem of finding the transformation function f : T ! Sproducing minimal overall synthesis costs for an arbitraryimage I according to a given cost function. The image I issubdivided into the two distinct sets T and S withI ¼ T [ S, T \ S ¼ � and S 6¼ �. All pixels from T (arget)are to be replaced by pixels defined in S(ource). Thus, fdefines a mapping between target and source pixels insidean image to be manipulated. Once f has been determined,the final image can be created by replacing all target pixelswith source information as defined in the determinedmapping. Generally, the result of a manipulated imagemay be considered as acceptable, if the replaced (synthe-sized) image content blends in seamlessly with the sur-rounding image information while it remains free ofdisturbing artifacts and implausible blurring effects.

Fig. 1. DR result for an object fixed in a wall. A coherent video stream is provides for a hand-held camera.

HERLING AND BROLL: HIGH-QUALITY REAL-TIME VIDEO INPAINTINGWITH PIXMIX 867

Page 3: 06714519

Further, the new image information should visually fit tothe remaining image parts, while image content not exist-ing in the source S must not be used to synthesize the tar-get T (equivalent to the coherence measure of Simakovet al. [4]). Thus, the transformation function f is based onthe following two constraints:

� Neighboring pixels defined in T should be mappedto equivalent neighboring pixels in S. This first con-straint ensures the structural and spatial preserva-tion of image information (see Fig. 2a).

� The neighborhood appearance of pixels in T shouldbe similar to the neighborhood appearance of theirmapped equivalents in S. Thus, a visually coherentresult and seamless transitions at the border of thesynthesized area are ensured (see Fig. 2b).

The global minimization problem to solve is to find a trans-formation function f producing the minimal overall cost foran image I and a target region T � I, with S ¼ I n T :

minf

Xp2T

costaðpÞ; (1)

while p ¼ ðpx; pyÞ> is a 2D position and the cost functioncosta : T ! Q is defined for all elements (pixels) inside thetarget region.

3.2 Cost Function

As described above, our approach subdivides the overallcosts into a part based on the spatial impact and a partbased on the appearance impact. This can be represented bythe following linear combination:

costaðpÞ ¼ a � costspatialðpÞ þ ð1� aÞ � costappearanceðpÞ; (2)

while the control parameter a 2 ½0; 1� allows a balancingbetween both types of costs. Minimization of the spatialcost impact forces a mapping of neighboring target pix-els to neighboring mapping pixels. This is representedfor an arbitrary neighborhood Ns by costspatial : T ! Q:

costspatialðpÞ ¼X~v2Ns

ds½fðpÞ þ~v; fðpþ~vÞ� � wsð~vÞ; (3)

with the pixel set/region Ns holding the spatial relativepositions of neighboring pixels, any suitable spatial dis-tance function dsð�Þ and an individual weight parameterws 2 Q, while

P~v2Ns

wsð~vÞ � 1 and 0 =2 Ns must hold. Ide-ally, any neighbor ~v 2 Ns of p is mapped to the corre-sponding neighbor ~v 2 Ns of fðpÞ. The spatial cost sumsup the spatial distances dsð�Þ from this ideal situation forany ~v 2 Ns and p 2 T (see Figs. 2a and 3). Thus, our

approach fundamentally differs from previous pixel- andpatch-based approaches such as [1], [3], [19], [23], [24] or[5] applying appearance similarity costs. In contrast tothose approaches, our spatial cost function allows for asignificant faster convergence while reducing image blur-ring and geometrical artifacts. This novel cost constraintcan be seen as an elastic spring optimization automati-cally minimizing neighboring mapping offsets. While Ns

allows for any kind of neighborhood, a common symmet-ric neighborhood is defined by

NsðdsÞ ¼ ~v j 8~v 2 ðZ� ZÞ : 0 < j~vj dsf g; (4)

where ds 2 R specifies the radius of the neighborhood.The impact of the appearance cost measure is repre-

sented by costappearance : T ! Q:

costappearanceðpÞ ¼X~v2Na

da½iðpþ~vÞ; iðfðpÞ þ~vÞ� � vaðpþ~vÞ;

(5)

with Na being an individual neighborhood equivalent toNs holding the relative positions of neighboring pixels.iðpÞ holds the image pixel value of image I, and da specifiesa pixel intensity distance measure. The application of anappearance cost can be found in several related works likethe approach of Efros and Leung [26], Demanet et al. [24]or Barnes et al. [19]. However, our additional weight func-tion va allows for weighting appearance distances individ-ually according to external constraints like e.g., thesynthesis border between source and target pixels. Thus,appearance costs close to the synthesis border may havehigher impact to the overall cost to avoid undesired bordereffects such as edges or visual discontinuities. Tests withdifferent neighborhood sets revealed that Na provides thebest results regarding the tradeoff between accuracy andperformance when represented by a small image patchcentered at p. Further, circular neighborhood sets requiredfar more processing time while providing only a negligiblevisual improvement. A patch size of 5� 5 pixels proved toprovide sufficient details regarding the visual contentwhile still allowing for fast computation.

In our approach we use the sum of squared differences(SSD) for the appearance distance da as it provided a

Fig. 2. The two cost constraints of the transformation functions costað�Þ.

Fig. 3. Spatial cost calculated by neighboring mappings depicted for afour-neighborhood. The mapping fðpx þ 1; pyÞ is ideal and therefore thelocal spatial cost dsð�Þ is zero. The mapping fðpx; py þ 1Þ is quite goodand therefore dsð�Þ is almost zero. However, the mappings forfðpx; py � 1Þ and fðpx � 1; pyÞ are far away from the ideal positionsresulting in a high dsð�Þ.

868 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014

Page 4: 06714519

suitable tradeoff regarding performance and quality whencompared to other measures such as the sum of absolutedifferences (SAD) and the zero-mean SSD. Also, testsshowed that the spatial distance function is well approxi-mated by the squared distance clamped to ts:

dsðp0; p1Þ ¼ min�jp0 � p1j2; ts

�: (6)

The upper border ts crops the spatial cost to a maximal

influence as the cost for mappings with an already large dis-

tance of e.g., 200 and 2,000 pixels should be the same. The

cropping is comparable to the Tukey robust M-estimator

[27] where the error remains constant if it exceeds a speci-

fied threshold. We apply a symmetric neighborhood NsðdsÞwith ds ¼ 1 defining a four-neighborhood or ds ¼

ffiffiffi2

pdefin-

ing an eight-neighborhood while the importance weighing

for those small neighborhoods is set to a uniform weight

wsð~vÞ ¼ 1jNsj. Tests further showed that the l2 norm provides

better results but obviously requires significantly more com-

putation time.

3.3 Iterative Refinement and Propagation

Finding the optimal transformation function f is realized bystarting with a rather rough guess of f , followed by a seriesof iterative refinements steps. At each iteration, the map-ping for each target pixel is sought to be improved. Ran-domly, individual source positions are tested according tothe local cost function and accepted whenever the local costcan be reduced. The improved matching is then propagatedto neighboring positions in T . This approach is similar tothat proposed by Barnes et al. [19] and the approach weapplied in our previous work [18]. However, as these twoapproaches apply the dissimilarity measure of Wexler et al.[13] or Simakov et al. [4] respectively, each refinement needsa target information update of an entire image patch, requir-ing the application of the individual contribution from eachpatch followed by a normalization. Our current approachdirectly updates only a single pixel and thus avoids expen-sive normalizations.

We apply a multiresolution inpainting approach. Itera-tive refinement is applied on an image pyramid startingwith a reduced resolution layer and increasing the imagesize until the original resolution has been reached. Depend-ing on the mask size and frame dimension, typicallybetween three and eight layers are used. The coarsest pyra-mid layer is found by the first layer in that no mask pixelexists that has a larger distance than three pixels to theinpainting border. The algorithm starts with an initial map-ping guess f̂n�1 in the coarsest pyramid layer Ln�1

and stops if an improved mapping fn�1 with minimal over-all cost has been determined. This mapping is then for-warded to the next pyramid layer Ln�2 and is usedas the new initialization f̂n�2. Again, after a series of itera-tions within the current layer, the optimized transformationfn�2 is forwarded as the initialization of the next layer untilthe final layer L0 (providing the highest resolution) hasbeen reached and processed (see Fig. 4).

The applied image pyramid allows the covering ofvisual structures with individual frequency, speeds up themapping convergence and significantly reduces the chancethat the algorithm gets trapped by some local minima.

Barnes et al. [19] applied an information propagationimproving the overall process significantly. However,originally the propagation idea in the context of imageinpainting had been proposed by Ashikhmin [28] andDemanet et al. [24]. Our work applies a comparable prop-agation to benefit from the significant speedup opportu-nity. However, instead of propagating the position ofentire image patches, our approach forwards single pixelmapping positions only.

As in our previous approach [18], the iterative refine-ment approach as developed in this work benefits frommulti-core CPUs. Iterative cost refinement is applied ondisjoint subsets T0; T1; . . . ; Tn�1 of T ¼ T0 [ T1 [ � � � [ Tn�1

concurrently. Thus each subset Ti can be processed by anindividual thread in parallel. In our previous work, staticframe subsets have been applied, restricting propagationof mapping information to be within individual subsets.This isolated refinement may produce undesired synthesis

Fig. 4. Scheme of the pyramid refinement: The original frame is downsampled (left) and iteratively refined and upsampled again (diagonal from bot-tom left to top right).

HERLING AND BROLL: HIGH-QUALITY REAL-TIME VIDEO INPAINTINGWITH PIXMIX 869

Page 5: 06714519

blocks in the final image as the mapping exchangebetween subsets is restricted to the subset borders. Fur-ther, damped propagation may reduce synthesis perfor-mance. Our current approach applies random subsetschanging between forward and backward propagationwhile the subsets’ size stays constant. Neighboring sub-sets propagate their mapping in opposite directions start-ing from a common start row. These start rows areoptimized explicitly before any refinement iteration isprocessed, so that neighboring subsets have access to thesame mapping information. Successive refinements togglebetween forward and backward propagation, have chang-ing target subsets, and start from common (alreadyrefined) rows. The random subsets have a significantimpact on the final image quality and convergence perfor-mance as information propagation is applied to the entiresynthesis mask, rather than limited to the sub-blocks. InFig. 5, a comparison between our previous and currentapproach is presented.

3.4 Constraints

More advanced constraints may be used to guide theinpainting algorithm providing improved visual results.Depending on the structure of the background or thedesired and undesired visual elements, the final inpaintingresults may be optimized according to the expectations ofthe users.

However, in a real-time video inpainting approach suchas a Diminished Reality application, explicit user-definedconstraints cannot easily be applied. The constraint defini-tion itself requires a certain amount of time that might vio-late the real-time execution. Additionally, e.g., in aDiminished Reality application, users might not be willingto spend any significant amount of time on defining con-straints. Thus, user-defined constraints as typically used forimage inpainting, may not be directly applied to videoinpainting approaches.

However, automated or semi-automated constraints mayalso be applied to video sequences. Simple structural con-straints, e.g., lines or regular objects, which can be detectedautomatically in a real-time approach and therefore do notneed to be specified by the user. Further, when inpainting isapplied to video streams not requiring real-time perfor-mance or for slightly time shifted live broadcasts, even indi-vidual user-defined constraints may be used.

Our pixel mapping approach allows seamless integra-tion of constraints into the synthesis pipeline by simplyextending the cost measure from (2) by an additionalconstraint cost:

costaaðp; fÞ ¼ as � costspatialðp; fÞþ aa � costappearðp; fÞ p 2 T

þ ac � costconstrðp; fÞ;(7)

with the new cost extension costconstr : T � ðT ! SÞ ! Q

and the affine combination aa ¼ ðas;aa;acÞ with as;aa;ac 2 ½0; 1�, while as þ aa þ ac ¼ 1must hold.

The cost constraint extension may be composed of sev-eral individual sub-constraints:

costconstrðp; fÞ ¼ constr0ðp; fÞ þ constr1ðp; fÞ þ . . . ; p 2 T:

(8)

In the following, two constraints are introduced whichsimply allow improvement of the visual inpainting results.

3.4.1 Area Constraints

The most obvious form of inpainting constraints guide thealgorithm to explicitly use or avoid image regions from theremaining image content. The algorithm is forced to useimage content explicitly selected by the user, discardingcontent the user does not prefer. An inverse importancemap m : S ! Q over all elements (pixels) in S has to bedefined to individually rate visual importance of imagecontent.

A simple inverse importance map rates all undesiredvisual elements with an infinite high value while weightingthe desired content with zero:

mðqÞ ¼ 0; q 2 Sdesired;

1; else;8q 2 S;

�(9)

while Sdesired S defines a set of visual elements in S appro-priate for inpainting.

A map with a more detailed granularity clearly allowsfor more precise algorithm guidance.

The final area constraint constrA : T � ðT ! SÞ ! Q isthen directly given by the inverse importance mapm:

constrAðp; fÞ ¼ mðfðpÞÞ: (10)

In Fig. 6, a comparison between an inpainting with andwithout an area constraint is provided. The inverse impor-tance map provides an infinite cost for conspicuous ele-ments. The inpainting result using the area constraint doesnot contain repetitions of these elements. For video inpaint-ing area constraints will be applied to the first frame only.However, subsequent frames will automatically avoid thatcontent due to the frame to frame coherence based on anadditional appearance cost term (see Section 4.5).

3.4.2 Structural Constraints

Structural constraints may be used to explicitly preservestraight lines or strong borders during the inpainting pro-cess. Any number of individual structural constraints canbe considered concurrently.

Fig. 5. Scheme of the multithreading inpainting realization comparing ourprevious approach with the recent algorithm. The small gaps betweenthe individual threads are added to improve visibility.

870 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014

Page 6: 06714519

A set of arbitrary structural constraints Ccs ¼ fcs0;cs1; cs2; . . .g may be defined, characterizing the individualstructural features of the current synthesis image I. Eachstructural constraint csi is composed of a 3-tuple csi ¼ðdcsð�Þ; csimpact; csradiusÞ, combining individual spatial dis-tance measures with two control parameters. This will beexplained in more detail in the following.

The cost for all structural constraints constrS : T � ðT !SÞ ! Q for a given point p 2 T is specified by:

constrSðp; fÞ¼ 1P

cs2Ccsgcsðp; csÞ �

Xcs2Ccs

cstructðp; cs; fÞ � gcsðp; csÞ; (11)

with filter function gcsðp; csÞ and structural constraint costfunction cstruct. The individual cost for a constraint cs incombination with a position p 2 T is determined bycstruct : T � Ccs � ðT ! SÞ ! Q:

cstructðp; cs; fÞ ¼ dcsðp; csÞ � dcsðfðpÞ; csÞj j2 � vcsðp; csÞ; (12)

with spatial distance function dcsðp; csÞ : I � Ccs ! Q

between constraint cs and a given point p (see Fig. 7).The weighting function vcsðp; csÞ : T � Ccs ! Q weights

constraint cost according to the distance between point pand constraint cs:

vcsðp; csÞ ¼ csimpact � e�8

dcsðp;csÞcsradius

� �2

: (13)

The constant csimpact 2 Q specifies the overall impact of csto the entire synthesis cost. This scalar factor allows distinct

lines to be weighted stronger than weak lines. csradius 2 Q

shrinks the area of influence to a specific image sub-region.The smaller the distance between a target point p 2 T andthe constraint cs 2 Ccs the higher the weighting vcsð�Þ anddirectly implied, the higher the resulting constraint cost.The normalization factor �8 is applied to provide an opti-mal scaling of the Gaussian due to the influence radiuscsradius.

The distance measure dcsð�Þ for infinite straight lines isdefined by:

dcsðp; csÞ ¼ ðnx; ny; dÞ � ðpx; py; 1Þ>; p 2 I; (14)

with line normal ðnx; nyÞ for which jðnx; nyÞj ¼ 1 must hold,while the constant d defines the line distance to the origin.

Similar to infinite lines, finite line constraints are repre-sented by a slightly modified distance measure. For a finiteline with endpoints P0 and P1, the distance is determinedfor an arbitrary point p 2 I by

dcsðp; csÞ

¼ ðnx; ny; dÞ �px

py

1

0B@

1CA;

0 ðP1 � P0Þ> � ðp� P0Þ

jP1 � P0j2;

k; else

8>>><>>>:

(15)

with the penalty constant k � ðnx; ny; dÞ � ðpx; py; 1Þ> forpoints not projecting onto the finite line. The higher thepenalty k, the lesser the application of mapping positionsoutside the finite constraint. Thus, if k is set to 1, map-ping positions projecting outside the finite line arerejected. More complex structures such as splines andcurves may be realized in the same manner as the dis-tance function for straight lines.

Finally, the constraint filter function gcsðp; csÞ needs to bedefined:

gcsðp; csÞ ¼1; vcsðp; csÞ � vcsðp; cs0Þ 8cs0 2 Ccs n fcsg0; else

�(16)

The filter function ensures that only the most relevantconstraint for each pixel is considered in order to avoidrace conditions in areas with several constraints. gcsðp; csÞselects the unique structural constraint cs with the highest

Fig. 6. Inpainting result for the Ruin image with dimension 2;736� 3;648and more than 3;000;000 inpainting pixels. Top row: our inpainting resultwithout using an area constraint, bottom row: inpainting result usingarea constraint. Copyright for the original image: bbroianigo (photogra-pher), kindly authorized by the author; pixelio.de (creative commonsimage database), viewed 3 October 2012.

Fig. 7. Determination of constraint costs for infinite lines. The constraintcost corresponds to the distance between the actual mapping positionfðpÞ and the projected ideal position for a given point p.

HERLING AND BROLL: HIGH-QUALITY REAL-TIME VIDEO INPAINTINGWITH PIXMIX 871

Page 7: 06714519

influence value vcsðp; csÞ for a point p 2 T as long as notwo (or more) constraints exist with identical and maxi-mal weight for the point p. Equation (11) has a normaliza-tion term to keep the overall constraint cost stable, even ifmore than one constraint has to be considered due toidentical weights.

Barnes et al. explicitly change the mapping positions ofpixels belonging to user-defined constraints after each opti-mization iteration by forcing them to lie on a straight line.The straight line is determined by a RANSAC [29] voting ofall pixels belonging to the user-defined constraint.

Instead, our approach is more flexible. More complexconstraints such as ellipsoids or splines can be easily sup-ported due to the application of a universal distancefunction.

4 REAL-TIME VIDEO INPAINTING

The significant performance improvement compared to pre-vious approaches allows us to apply the image manipula-tion techniques to video streams in real-time. However, inaddition to the aspects to be considered in image manipula-tion, manipulations of video streams additionally require aframe to frame coherence. The simplest video streammanipulation would be a static image synthesis in the firstframe followed by an interpolation of this artificial imagecontent in subsequent video frames. However, pure contentinterpolation cannot handle dynamic effects in a videostream like changing light conditions and moving camerapositions sufficiently. Therefore, our approach creates a syn-thesis result based on the visual information of the recentframe to handle dynamic elements while using an alreadysynthesized key frame as a reference model.

4.1 Object Selection

Interactive object selection for static image inpainting is arather trivial problem as the synthesis mask may be definedby the user applying common selection techniques such aspolygon or lasso tools. The closer and more accurately theobject to be removed is selected (including e.g., areas shad-owed by the object), the better the final synthesis result.Object selection and image segmentation for static frameshas been in focus of research for several decades. The mostsophisticated approaches include combinations of graph,mean-shift, scribble and painting based techniques like thework of Ning et al. [30], Rother et al. [31], Levin et al. [32],Arbel�aez et al. [33] or the progressive paint selection of Liuet al. [34]. However, although their segmentation results ofarbitrary objects may achieve a high accuracy; theseapproaches do not provide real-time performance, oftenrequiring several iterative adjustments by the user andnever include image areas shadowed by the objects to beremoved. Existing selection and segmentation approachesfor video streams are even slower like the approaches ofTong et al. [35] and Bai et al. [36].

Thus, in our previous work [18] we applied a simpleactive contour algorithm allowing to detect and trackobjects providing a noticeable boundary with an almosthomogenous background. This approach allowed users toroughly select previously unknown objects while trackingthem during the successive video frames. Although our

previous approach benefited from the usability and thehigh selection and tracking performance, the boundary andhomogenous background constraints limited the number ofapplication areas significantly.

Therefore, we combined the rather rough selectiontechnique of our previous approach with a more complexsegmentation approach allowing for non homogenousbackgrounds and stronger blurred object contours. Asrecent state-of-the-art segmentation approaches are notreal-time capable, we developed a segmentation and track-ing algorithm based on the following three assumption:

� Objects to be removed are entirely visible in thevideo frame and do not intersect with the frameboundaries in successive frames.

� The image area to be removed is directly enclosed byimage content visibly different from the undesiredarea.

� The enclosing image content itself may be texturedand may change along the object boundary.

The three assumptions are fulfilled by an approachdetecting image content visually not matching with thecharacteristics of the rough selection.

In detail, the selection approach is based on several fin-gerprints U ¼ Up1 ; Up2 ; . . . ; Upn

� storing the appearance of

the frame at n equally distributed positions p1; p2; . . . ; pnspread on the roughly defined object contour. FingerprintsUpi are determined by a function fðpiÞ : I ! Qm definingthe most important visual characteristics of each point. Eachfingerprint Ui is composed of m individual componentsdefining disjoint fingerprint characteristics, and thus Ui ¼fu1; u2; :::; umg.

The n fingerprints are compared to the entire image con-tent inside the rough user-defined contour and tested uponsimilarity. A pixel p inside the rough contour is consideredas an undesired pixel if:

XjU j

k¼1

dfðfðpÞ; UkÞ � jU j � g; (17)

where g defines the amount of necessary fingerprint dis-similarities to consider the pixel p as undesired and dfmeasures the dissimilarity between two fingerprints Uk

and Uj by:

dfðUk; UjÞ ¼ 0; ~dfðuki ; ujiÞ vi;uki 2 Uk; uji 2 Uj;

8i 2 ½1; m�1; else,

(

(18)

while ~df measures the component-wise distance betweenthe corresponding components of two individual finger-prints. The distances are rated according to referencethresholds vi.

The cost measure ~df is defined as the one dimensionaleuclidean distance:

~dfðuki ; ujiÞ ¼ juki � uji j; (19)

while the corresponding thresholds vi are adapted to thedeviation of U . However, instead of using the entire devia-tion of all fingerprints U , we separate U into disjoint

872 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014

Page 8: 06714519

fingerprint clusters C ¼ fC1; C2; . . .Cbg with C1 [ C2 [ � � � [Cb ¼ U with b n and take the maximal deviation from allb clusters as reference thresholds. Thus, each vi is calculatedby

vi ¼ maxi2½1;b�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiECi

2�� ðE½Ci�Þ2

q: (20)

Without clustering, an application of the deviationwould be useless for background images represented bymore than one visual characteristic. Fig. 8a) depicts a selec-tion situation with individual background areas illustratingthe necessity of data clustering.

Finally, the cluster calculation has to be investigated. Awide variety of clustering algorithms exist providing indi-vidual clustering accuracies and run-time complexities ashierarchical, k-means, or distribution-based clusteringapproaches. However, concerning the real-time necessitywe decided to apply a very simple but for our needs ade-quate clustering algorithm:

1. First, one fingerprint is selected randomly anddefined as the center of the first cluster.

2. Afterwards, the remaining fingerprints are assignedto the new cluster if their distance is close enough tothe cluster’s center.

3. If still fingerprints are left, again one of them isselected randomly, defining the center of a new clus-ter. The algorithm then continues with 2.

This algorithm is repeated several times and the cluster-ing result with lowest maximal deviation within all clustersis finally accepted. Obviously the clustering method is veryefficient and we found that the result is accurate enough forour needs. Another benefit is that the number of clusters donot have to be defined in advance as e.g., for k-means algo-rithms. Thus, the more variations within the fingerprints,the higher the number of final clusters.

We defined g ¼ 0:95 to detect undesired pixels if at least95 percent of the fingerprints reject the corresponding pixel.As data base we use the color image with three channels.Several individual fingerprint functions fð�Þ have been eval-uated and we found that a simple Gaussian filter with a

large kernel provides sufficient results. However, due toperformance reasons we apply a modified mean filter withone large and one small kernel size in combination with anintegral image. This optimized mean filter needs eight inte-gral image lookups only and performs much faster than aGaussian filter while providing sufficient good results.

Further, the entire fingerprint segmentation is combinedwith a multiresolution approach reducing the computa-tional effort. The approach starts on the coarsest image reso-lution and forwards the result to the next finer layer. Onthis finer layer only the border pixels are investigatedaccording to their fingerprint dissimilarity. Finally, a dila-tion filter is applied to removed possible tiny gaps. All pix-els considered as undesired build the binary synthesismask. A corresponding synthesis contour directly enclosingthis mask is determined.

According to the available processing time, the perfor-mance and accuracy of the approach can be tailored easilyby adding or removing contour fingerprints or visual char-acteristics fi (e.g., color and texture channels). Obviously,the more information is used to determine the dissimilar-ities, the better the selection result.

We found that the application of fingerprints with threeelements, one for each image color channel, is a good com-promise between segmentation accuracy and applicationspeed. However, if accuracy is more important than compu-tational time or if the computational hardware is powerfulenough, we use a fourth fingerprint element by default. Thefourth fingerprint element represents a simple texturenessproperty determined by averaging the Scharr response [37]for the grayscale image information.

4.2 Object Tracking

Once the object’s contour is found, the determined objectcontour has to be tracked in successive video frames. There-fore, we apply a two phase contour tracking approach. Inthe first phase, a homography based contour tracking isused, while in the second phase the new contour is refinedand adjusted regarding to the undesired object area.

Typically, the contour points found will be rather pla-nar. Thus, the relation of a set of tracked contour pointsbetween two consecutive frames may be described by ahomography [27]. For cases with non-planar backgroundsthe motion between two frames can be expected to be verysmall. Thus, even for these situations, determining thehomography may provide a good approximation of thecontour movement. For homography determination, onlythe strongest contour points are tracked between twovideo frames by using a pyramid based motion detection.A Harris corner vote distinguishes between contour pointswith good and bad motion tracking properties. Finally,within several RANSAC iterations a reliable homographyis determined [27]. However, determination of the homog-raphy may fail if only an insufficient number of contourpoints can be tracked reliably. This may happen if thepoint motion detection is inaccurate due to an almosthomogenous image content.

The second phase for contour tracking is necessary toadjust the new contour positions (determined by thehomography) more precisely to the real object contour. If no

(a) (b)

Fig. 8. Object selection and tracking scheme. (a) Object segmentationby the applications of fingerprints distributed along the rough user selec-tion. The fingerprints are clustered into disjoint sets. (b) Adjustment ofthe contour received by the homography; red squares: randomlyselected contour points with perpendicular virtual lines; white squares:resulting landmarks; blue dashed line: resulting accurate and finalcontour.

HERLING AND BROLL: HIGH-QUALITY REAL-TIME VIDEO INPAINTINGWITH PIXMIX 873

Page 9: 06714519

adjustment would be applied, a solely homography basedtracking would accumulate small drift errors over time.Therefore, the fingerprint dissimilarity is used again. Thistime, the reference fingerprints are automatically definedoutside the new contour (equally distributed) in the currentframe. However, in order to ensure the performancerequired, the contour adjustment is performed for a ran-domly selected subset of the new contour points only. Foreach random point a virtual line perpendicular to the newcontour starting outside and ending inside the new contouris defined. For each pixel, covered by this virtual line, thefingerprint dissimilarity is determined. Therefore, to avoidthe influence of fingerprints at the opposite side of the con-tour, the new fingerprints in the direct neighborhood areinvestigated only. The first pixel on the virtual line identi-fied as undesired and followed by at least two successiveundesired pixels defines a contour landmark for the virtualline. Finally, all contour points extracted in the first trackingphase are adjusted according to the contour landmarks intheir neighborhood (see Fig. 8b). The adjusted positionsdefine the final and accurate contour.

We wish to emphasize the aspect that the homography iscalculated in the first tracking phase, while the final synthe-sis mask is determined after the second phase. Therefore,the homography is based on strong correspondencesbetween contour points, while the mask is based on theactual object border. This separation is important as thehomography determined will later be used to ensure avisual coherence of subsequent frames for non-linear cam-era movements. In contrast to the active snake approach ofour previous approach [18] the improved object trackingprovides unique correspondences for contour pointsbetween successive video frames.

4.3 Video Inpainting Pipeline

First, the user selects the image parts to be replaced. Accord-ing to this selection, a precise contour is calculated and astatic image synthesis is invoked. The synthesized image isprovided to the viewer and stored as the key frame IK .Then the homography and the contour are determined asdescribed before. The key frame, the current binary synthe-sis mask Mn, and the calculated homography define a refer-ence model IR for each new video frame In: Current framepixels p 2 In are copied directly for all desired pixels (withMnðpÞ ¼ 1), while undesired pixels (with MnðpÞ ¼ 0) arereplaced by interpolated information from the key frameIK . This interpolation is defined by a concatenation ofhomographies ð. . . �Hn�2 �Hn�1 �HnÞ ¼ HK determinedsince the key frame has been changed:

IRðpÞ ¼InðpÞ; MnðpÞ ¼ 1;

IKðHK � pÞ; MnðpÞ ¼ 0;8p 2 I:

�(21)

Our approach allows for compensation of ambient light-ing changes. Inpainting contour points B ¼ fb1; b2; b3; . . .g ofthe current frame are transformed to the correspondingpoints B0 ¼ fb01; b02; b03; . . .g in the key frame IK by applicationof the gathered homography HK . In Fig. 9, the relationshipbetween the corresponding contour points bi and b0i isdepicted. For all pairs of contour points, the color differen-ces Dbi ¼ bi � b0i between the current frame and the key

frame are stored individually for each image channel. Asparse virtual grid G is defined covering the entire inpaint-ing mask of the current frame. The nodes of the gridG ¼ fg1; g2; g3; . . .g receive approximated color correctionsby interpolation of the gathered contour differences Dbi.Thus, for a grid node gi 2 G the approximated color correc-tion mðgiÞ is determined by

mðgiÞ ¼ 1

uðgiÞXjBjk¼1

ðbk � b0kÞ � e�ffiffiffiffiffiffiffiffiffiffiffijgi�bkj

p; (22)

while uðgiÞ represents a simple normalization factor:

uðgiÞ ¼XjBj

k¼1

e�ffiffiffiffiffiffiffiffiffiffiffijgi�bkj

p: (23)

The virtual grid G is then used to correct the referenceframe IR according to the illumination changes between thecurrent and key frame. Each image pixel p that has beeninterpolated from the key frame is corrected by a bi-linearinterpolation of the four nearest grid nodes. An applicationof the Poisson equation [38] may provide more accurateresults than the proposed approximation. However, wefound that the interpolation allows for compensating themost important lighting changes while processing in lessthan 1 ms, which is several magnitudes faster than, e.g., aPoisson related approach.

The reference frame is used only for coherence guidingas the video inpainting uses image information of the cur-rent video frame to create the inpainting result. In con-trast to previously introduced approaches, the referenceframe is not simply used as final video inpainting contentbut provides a visual reference model that has to bereconstructed or approximated by the image inpaintingalgorithm. Visual noise of the remaining video contentwill be adopted by the video inpainting approach while acoherent video stream is synthesized. Zooming cameramovements can be supported as more detailed visual con-tent may be introduced by the guiding capabilities of thereference model.

The compensation of lighting changes creates a refer-ence model matching the illumination conditions of eachcurrent video frame. In Fig. 10, a comparison between ref-erence models with and without illumination correction is

Fig. 9. Lighting compensation between the key frame IK and the refer-ence frame IR. The sparse grid receives the approximated correctionvalues Dbi by an interpolation. The interpolation is depicted for twoexemplary grid points g1 and g2. Afterwards, the individual gird nodesare used to modify the reference frame IR by a bi-linear interpolation ofthe grid values.

874 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014

Page 10: 06714519

provided. Obviously, the reference model with lightingcompensation will result in more convincing image qual-ity than a reference model that does not compensate forchanging lighting conditions.

Afterwards, the synthesis module replaces all undesiredmask pixels (with MnðpÞ ¼ 0) of the current frame with pix-els of the remaining image data while creating a visualresult almost identical to the given reference model IR.Therefore, as described for the static inpainting approach, acurrent mapping f is determined having minimal overallcosts for the current frame. The mapping found for the pre-vious frame is adjusted as the initial guess to improve thequality and to speed up the convergence (see Section 4.4). Incontrast to the static image synthesis, the appearance costsare extended and additionally measured between the syn-thesis result and the reference model (see Section 4.5). Thisextension ensures that a sufficient coherence between suc-cessive video frames will be received. Then, the synthesizedimage content is blended with the original frame at the syn-thesis border to avoid mask border effects like semi-sharptransitions. Tests revealed that a blend border of 1-3 pixelsis typically sufficient. Depending on the age of the keyframes, it might be replaced by the final synthesis result.However, the number of key frame replacements has to bebalanced carefully to avoid visual drifts during the synthe-sis Afterwards, the synthesis pipeline restarts with the newvideo frame.

4.4 Mapping Forwarding

In the case of a given mapping function for the previoussynthesis, the number of iterations to find a transformationfor the new frame can be reduced significantly. A mappinginitialization for each mask pixel is determined from thecorresponding pixel in the previous image by application ofthe associated mapping. Therefore, the already determinedhomography is applied as depicted in Fig. 11. An arbitrarymask pixel p in frame n is translated to the correspondingpixel p0 in frame n� 1 by

p0 ¼ Hp; (24)

with known Homography H. Further, the previous map-ping in frame n� 1 at point p0 is given by

m0 ¼ fn�1ðp0Þ: (25)

Thus, a sufficiently precise prediction of this mapping forthe current frame n can be calculated by applying theinverse homography:

m̂ ¼ H�1m0: (26)

Thus, the initial mapping prediction of an arbitrary maskpoint in frame n is given by:

f̂nðpÞ ¼ H�1ðfn�1ðHpÞÞ: (27)

In general an initial guess of f̂n is defined by

f̂n ¼ H�1fn�1H: (28)

Please note that the application of the inverse homogra-phy commonly will not result in a precise mapping for thecurrent frame. This is due to the determination of thehomography from the object contour, providing exact esti-mations for totally planar contours only while non-planarcontours will introduce a certain error in the calculation ofthe homography. However, as the change in contentbetween two successive video frames is low, this inaccuracyis negligible as it will be adjusted in the subsequent costminimization iterations. These subsequent cost minimiza-tion iterations also handle cases when the predicted initialguess f̂n lies outside the camera frame. In this case our algo-rithm starts with a random choice and iteratively tries toreduce the matching cost as described in Section 3.3.

4.5 Coherence with Extended Appearance Cost

When compared to the static image synthesis, a coherentvideo inpainting must additionally follow the appearanceof the reference model IR. Thus, the appearance cost from(5) is extended to measure not only the cost between thesynthesized data and the remaining image but also the costbetween the synthesized data and the reference model.While the standard appearance cost ensures coherenceinside the frame itself, the additional appearance costensures visual coherence between successive synthesizedframes. Therefore, the appearance cost is extended by theadditional cost term cost0appearance : T ! Q:

cost0appearanceðpÞ ¼X~v2Na

d0a½iðpþ~vÞ; rðpþ~vÞ� � v0aðpþ~vÞ; (29)

Fig. 11. Mapping forwarding by homography.

Fig. 10. Comparison between a default (b) and a corrected referencemodel (c) regarding illumination changes of the original video frames (a).

HERLING AND BROLL: HIGH-QUALITY REAL-TIME VIDEO INPAINTINGWITH PIXMIX 875

Page 11: 06714519

with rðpÞ holds the image pixel value of the reference modeland its own distance function d0a and weight function v0

a.The distance measure as well as the weight function may beidentical to da and va from (5) respectively. However, ifquality is more important than performance, a zero-meanSSD instead of a simple SSD might produce better resultsbecause changing light conditions between the currentframe and the reference model will be better supported.

4.6 Real-Time Inpainting with Constraints

As described in Section 3.4 the visual inpainting quality canbe improved by application of constaints. Fig. 13 gives acomparison between the standard video inpainting and themore advanced video inpainting using structural con-straints. The line constraint is determined by application ofa Hough line detector [27] that detects all lines occluded bythe undesired image content (the inpainting mask). A linethat is detected in the initial camera frame is tracked fromframe to frame using the homography followed by an itera-tive refinement step adjusting the line best fitting to thevisual edge in the camera frame. Obviously, the constrainedvideo inpainting provides a reconstructed edge with sub-pixel accuracy while the remaining undesired image con-tent is synthesized with visual content still matching withthe surrounding environment.

5 DISCUSSION

In this section we will discuss the limitations of ourapproach and look into quality issues.

5.1 Limitations

Although our real-time capable video inpainting system isable to provide a high quality result as well as a coherent

video stream, it has some limitations which have to be dis-cussed. We found that the fingerprint based object selectionapproach is a significant improvement compared to our pre-vious approach. However, the object selection may still faildue to the appearance complexity. The more individualcharacteristics the object and the background have and thesmaller the difference between the object and the back-ground characteristics, the more difficult is a reliableselection.

Fig. 12 shows two examples which our real-time selectionapproach is not able to handle in a sufficient way. The firstexample misses significant visual differences between thefingerprints of the rough user-defined contour and thevisual characteristics of the stone. Therefore, a wrong area isselected not intended by the user. In contrast, the selectionin the second example fails at a small part of the object only.The area of the car’s hatchback is not selected as the roughcontour fingerprints scattered at the gray sky are almostidentical to the visible characteristics of the hatchback dueto the reflection of the sky.

The presented video inpainting system is able to providea coherent high quality result for a hand-held camera withdynamic translations and rotations. Due to the homographydetermination of the object’s contour the object must bestatic and should have an almost planar background. How-ever, if the camera is rather static or its movements are lim-ited to very small amounts, a coherent video stream can beprovided even for non-planar backgrounds. While ourapproach supports real-time Diminished Reality applica-tions with a camera moving around a rather static object tobe removed, it does not yet cover scenarios, where the objectmoves around or where both, camera and object aredynamic. Further, our approach currently does not suffi-ciently cover situations where a previously unknown back-ground becomes visible later on. These situations mayoccur if an object with a significant volumetric expansion(perpendicular to the surface) in combination with dynamiccamera movements has to be removed. A previously syn-thesized image area is then replaced by the real image con-tent. In order to realize a coherent view, additional fadingmechanisms will be required. However, as a real-time videoinpainting system does not have any information aboutfuture video frames (in contrast to removing objects from apre-recorded video), a trivial solution for this problem doesnot exist (see Fig. 14).

Fig. 13. Comparison between our video inpainting approach with and without line constraint. (a) Original video frame, (b) our approach without con-straints, (c) our approach with constraints, (d)-(e) subsequent video frames with constraints, f) magnified comparison of (b) and (c).

Fig. 12. Real-time selection results of the fingerprint segmentationapproach. (a) Original images with rough user-defined selections,(b) the stone cannot uniquely be separated from the environment,(c) thealgorithm fails to identify parts of the hatchback.

876 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014

Page 12: 06714519

5.2 Visual Quality

In this section a detailed comparison between well-knownbenchmark images of recent state-of-the-art approaches,our previous approach [18] and our current image inpaint-ing algorithm is provided.

The Figs. 15 and 16 depict that our proposed approach isable to recover simple image structures with visual resultscomparable to related approaches while processing severalmagnitudes faster.

Fig. 16 shows that our previous approach providedundesired blurring effects for heterogeneous images.Although the patch similarity approach of Wexler et al. [13]is known as tending to minor blurring artifacts, the blurringof our previous approach might be slightly larger due to thereal-time performance constraints. In contrast, the pixelmapping approach introduced in this paper cannot result inblurring artifacts as the final inpainting result is not deter-mined by superimposing image patches.

The image inpainting result of our approach for an imagewith more than 2.6 mega pixels is provided in Fig. 17.

5.3 Performance Issues

We measured the performance of our approach witha laptop Intel i7-3840QM Core with 2.8 GHz runningWindows 7. The implementation is realized in C++. Wemeasured the entire performance for the video inpainting(including tracking, line detection, reference frame crea-tion and video synthesis) applying a video resolution of640� 480 pixels. Table 1 provides the detailed perfor-mance values showing the dependency between perfor-mance and the amount of pixels to be removed in eachframe. The table shows a comparison between the

standard video inpainting and the video inpaintingapplying a line constraint. The line constraint is deter-mined by application of a Hough line detector.

The measurements show that the video inpaintingapproach applying a line constraint needs approx. 30 per-cent more time compared to the standard video inpaintingapproach. The performance loss is due to the additional linedetection and the additional computational time for the con-straint cost as defined in (8). Therefore, our system reachesbetween 24 and 41 fps providing the visual quality resultsas depicted in Fig. 13. For a more detailed performance andcomplexity consideration please refer to our previous work[22].

6 CONCLUSION

In this paper we presented PixMix, our pixel basedapproach to real-time image and video inpainting. Wefurther showed how this approach enables the realizationof Diminished Reality. Additionally, a real-time capableobject selection and tracking algorithm has been intro-duced. Our inpainting approach allows for balancingbetween the spatial and the appearance term of the costfunction in order to provide optimal inpainting results.The overall results showed fewer artifacts than otherapproaches, allowing for high-quality image inpainting inreal-time. This provided the basis for our self-contained

Fig. 15. Result comparison for the Wood image. (a) Original image withinpainting mask; (b) result by Drori et al. [2], (c) resulty by Shen et al.[39], (d) our previous result [18], (e) our result. The reference images aretaken from [39], kindly authorized by Shen and Drori, �2010 IEEE.

Fig. 16. Result comparison for the Bungee image with four individualapproaches. (a) Original image with inpainting mask, (b) result by Crimi-nisi et al. [1], (c) result by Shen et al. [39], (d) result by Kwok et al. [40],(e) our previous result [18], (f) our new result. The reference imageshave been taken from [1], [39] and [40], kindly authorized by the authors,�2010 IEEE.

Fig. 14. Video inpainting of a volumetric object. (a) Original video frame,(b) inpainting result after a few seconds, (c) video frame after the camerahas moved around the volumetric object.

Fig. 17. Inpainting result example with leaves in the background.(a) Original image with inpainting mask, (b) inpainting result. Imagesource: bbroianigo / pixelio.de (creative commons image database).

HERLING AND BROLL: HIGH-QUALITY REAL-TIME VIDEO INPAINTINGWITH PIXMIX 877

Page 13: 06714519

high-quality video inpainting approach. We achieved thisby extending the overall cost function by a frame-to-frame coherence term and by applying a homography asa first guess for the mapping in the next frame providinga significantly better initialization. Our video inpaintingapproach is based on the previous and the current frameonly, allowing for a high-quality manipulation of livevideo streams. In our future work, we are planning toextend the homography based approach to arbitrary 3Dobjects. Further, we intend to publish the results of a userstudy on the perceived quality and plausabilof our videoinpainting approach. While the actual user study hasalready been conducted, it requires further analysis of thedata obtained. Finally, we will investigate smart temporalmapping approaches, providing more sophisticatedresults for moving objects, while still allowing for real-time video manipulations.

REFERENCES

[1] A. Criminisi, P. Perez, and K. Toyama, “Region Filling and ObjectRemoval by Exemplar-Based Image Inpainting,” IEEE Trans.Image Processing, vol. 13, no. 9, pp. 1200-1212, Sept. 2004.

[2] I. Drori, D. Cohen-Or, and H. Yeshurun, “Fragment-Based ImageCompletion,” Proc. ACM SIGGRAPH, pp. 303-312, 2003.

[3] J. Sun, L. Yuan, J. Jia, and H.-Y. Shum, “Image Completion withStructure Propagation,” Proc. ACM SIGGRAPH, pp. 861-868,2005.

[4] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, “SummarizingVisual Data Using Bidirectional Similarity,” Proc. IEEE Conf. Com-puter Vision and Pattern Recognition (CVPR ’08), 2008.

[5] A. Bugeau, M. Bertalm�ıo, V. Caselles, and G. Sapiro, “A Compre-hensive Framework for Image Inpainting,” IEEE Trans. ImageProcessing, vol. 19, no. 10, pp. 2634-2645, Oct. 2010.

[6] J. Jia, T.-P. Wu, Y.-W. Tai, and C.-K. Tang, “Video Repairing: Infer-ence of Foreground and Background under Severe Occlusion,”Proc. IEEE Conf. Computer Vision and Pattern Recognition(CVPR ’04), vol. 1, pp. 364-371, June 2004.

[7] J. Jia, Y.-W. Tai, T.-P. Wu, and C.-K. Tang, “Video Repairing underVariable Illumination Using Cyclic Motions,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 28, no. 5, pp. 832-839, May2006.

[8] K.A. Patwardhan, G. Sapiro, and M. Bertalmio, “Video Inpaintingunder Constrained Camera Motion,” IEEE Trans. Image Processing,vol. 16, no. 2, pp. 545-553, Feb. 2007.

[9] Y. Shen, F. Lu, X. Cao, and H. Foroosh, “Video Completion forPerspective Camera under Constrained Motion,” Proc. 18th Int’lConf. Pattern Recognition (ICPR ’06)), vol. 3, pp. 63-66, June 2006.

[10] T. Shiratori, Y.Matsushita, X. Tang, and S.B. Kang, “Video Comple-tion by Motion Field Transfer,” Proc. IEEE Conf. Computer Visionand Pattern Recognition (CVPR ’06), vol. 1, pp. 411-418, June 2006.

[11] M.V. Venkatesh, S.-C.S. Cheung, and J. Zhao, “Efficient Object-Based Video Inpainting,” Pattern Recognition Letters, vol. 30, no. 2,pp. 168-179, Jan. 2009.

[12] Y. Zhang, J. Xiao, and M. Shah, “Motion Layer Based ObjectRemoval in Videos,” Proc. Seventh IEEE Workshops Application ofComputer Vision, (WACV/MOTIONS ’05), vol. 1, pp. 516-521, Jan.2005.

[13] Y. Wexler, E. Shechtman, and M. Irani, “Space-Time Completionof Video,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 29, no. 3, pp. 463-476, Mar. 2007.

[14] O. Korkalo, M. Aittala, and S. Siltanen, “Light-Weight MarkerHiding for Augmented Reality,” Proc. IEEE Ninth Int’l Symp.Mixed and Augmented Reality (ISMAR ’10), pp. 247-248, Oct. 2010.

[15] N. Kawai, M. Yamasaki, T. Sato, and N. Yokoya, “ARMarker Hid-ing Based on Image Inpainting and Reflection of IlluminationChanges,” Proc. IEEE Int’l Symp. Mixed and Augmented Reality(ISMAR ’12), pp. 293-294, Nov. 2012.

[16] S. Zokai, J. Esteve, Y. Genc, and N. Navab, “Multiview Paraper-spective Projection Model for Diminished Reality,” Proc. SecondIEEE/ACM Int’l Symp. Mixed and Augmented Reality (ISMAR ’03),pp. 217-226, Oct. 2003.

[17] A. Enomoto and H. Saito, “Diminished Reality Using MultipleHandheld Cameras,” Proc.Workshop Multi-Dimensional and Multi-View Image Processing (ACCV ’07), 2007.

[18] J. Herling and W. Broll, “Advanced Self-Contained ObjectRemoval for Realizing Real-Time Diminished Reality in Uncon-strained Environments,” Proc. IEEE Ninth Int’l Symp. Mixed andAugmented Reality (ISMAR ’10), pp. 207-212, Oct. 2010.

[19] C. Barnes, E. Shechtman, A. Finkelstein, and D.B. Goldman,“Patchmatch: A Randomized Correspondence Algorithm forStructural Image Editing,” Proc. ACM SIGGRAPH, pp. 1-11, 2009.

[20] C. Barnes, E. Shechtman, D.B. Goldman, and A. Finkelstein, “TheGeneralized Patchmatch Correspondence Algorithm,” Proc. 11thEuropean Conf. Computer Vision (ECCV ’10), pp. 29-43, Springer-Verlag, 2010.

[21] S. Darabi, E. Shechtman, C. Barnes, D.B. Goldman, and P. Sen,“Image Melding: Combining Inconsistent Images Using Patch-Based Synthesis,” ACM Trans. Graphics (TOG ’12), vol. 31, no. 4,article 82, 2012.

[22] J. Herling and W. Broll, “PixMix: A Real-Time Approach to High-Quality Diminished Reality,” Proc. 11th IEEE Int’l Symp. Mixed andAugmented Reality (ISMAR ’12), pp. 141-150, Nov. 2012.

[23] Y. Pritch, E. Kav-Venaki, and S. Peleg, “Shift-Map Image Editing,”Proc. IEEE 12th Int’l Conf. Computer Vision, pp. 151-158, 2009.

[24] L. Demanet, B. Song, and T. Chan, “Image Inpainting by Corre-spondence Maps: A Deterministic Approach,” Applied and Compu-tational Math., vol. 1100, nos. 3-40, pp. 217-250, 2003.

[25] K. Takeda and R. Sakamoto, “Diminished Reality for LandscapeVideo Sequences with Homographies,” Proc. 14th Int’l Conf.Knowledge-Based and Intelligent Information and Engineering Systems(KES ’10), pp. 501-508, 2010.

[26] A.A. Efros and T.K. Leung, “Texture Synthesis by Non-ParametricSampling,” Proc. Seventh IEEE Int’l Conf. Computer Vision(ICCV ’99), vol. 2, pp. 1033-1038, 1999.

[27] R. Szeliski, Computer Vision: Algorithms and Applications. Springer,2010.

[28] M. Ashikhmin, “Synthesizing Natural Textures,” Proc. Symp.Interactive 3D Graphics, pp. 217-226, 2001.

[29] M.A. Fischler and R.C. Bolles, “Random Sample Consensus: AParadigm for Model Fitting with Applications to Image Analysisand Automated Cartography,” Comm. ACM, vol. 24, pp. 381-395,1981.

[30] J. Ning, L. Zhang, D. Zhang, and C. Wu, “Interactive Image Seg-mentation by Maximal Similarity Based Region Merging,” PatternRecognition, vol. 43, no. 2, pp. 445-456, Feb. 2010.

[31] C. Rother, V. Kolmogorov, and A. Blake, ““GrabCut”: InteractiveForeground Extraction Using Iterated Graph Cuts,” Proc. ACMSIGGRAPH, pp. 309-314, 2004.

[32] A. Levin, D. Lischinski, and Y. Weiss, “A Closed-Form Solution toNatural Image Matting,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 30, no. 2, pp. 228-242, Feb. 2008.

[33] P. Arbel�aez, M. Maire, C. Fowlkes, and J. Malik, “Contour Detec-tion and Hierarchical Image Segmentation,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 33, no. 5, pp. 898-916, May2011.

[34] M. Liu, S. Chen, J. Liu, and X. Tang, “Video Completion viaMotion Guided Spatial-Temporal Global Optimization,” Proc.17th ACM Int’l Conf. Multimedia, pp. 537-540, 2009.

[35] R. Tong, Y. Zhang, and M. Ding, “Video Brush: A Novel Interfacefor Efficient Video Cutout,” Computer Graphics Forum, vol. 30,no. 7, pp. 2049-2057, 2011.

[36] X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video Snapcut:Robust Video Object Cutout Using Localized Classifiers,” Proc.ACM SIGGRAPH, pp. 1-11, 2009.

TABLE 1Performance Overview of the Video Inpainting for

Two Individual Video Sequences

From left to right: Used video sequence; averaged ratio between maskpixels and the entire frame pixels; averaged standard video inpainting;averaged video inpainting with line constraint; the resulting averagedpixel fillrate for standard video inpainting.

878 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 20, NO. 6, JUNE 2014

Page 14: 06714519

[37] H. Scharr, “Optimal Operators in Digital Image Processing,” PhDdissertation, Interdisziplinaeres Zentrum Fuer WissenschaftlichesRechnen Univ., 2000.

[38] P. P�erez, M. Gangnet, and A. Blake, “Poisson Image Editing,”Proc. ACM SIGGRAPH, pp. 313-318, 2003.

[39] J. Shen, X. Jin, C. Zhou, and C.C.L. Wang, “Technical Section: Gra-dient Based Image Completion by Solving the Poisson Equation,”Computer and Graphics, vol. 31, pp. 119-126, Jan. 2007.

[40] T.-H. Kwok, H. Sheung, and C.C.L. Wang, “Fast Query for Exem-plar-Based Image Completion,” IEEE Trans. Image Processing,vol. 19, pp. 3106-3115, Dec. 2010.

Jan Herling received the master’s degree (Dipl.-Inf.) in computer science from the Rheinisch-Westfaelische Technische Hochschule AachenUniversity in 2008 and the PhD degree in theadvanced real-time video manipulation from theIlmenau University of Technology in Spring 2013,where he was a research assistant in the VirtualWorlds and Digital Games group till summer2013. From 2008, he was a researcher at theFraunhofer Institute for Applied InformationTechnology in Sankt Augustin. He is a CTO and

cofounder of fayteq, a company concerned with advanced video manipu-lation technologies. He participated in the EU projects IPCity, CoSpa-ces, and EXPLOAR. He was a major contributor to the Morgan AR/VRframework and is the chief developer of the platform independent OceanAR framework. He published his research at several international confer-ences including ACM CHI, IEEE 3DUI, IEEE ISMAR, and ACM VRST.He is currently concerned with real-time image synthesis, computervision based tracking approaches, and mixed reality applications. He isa member of the IEEE. He is a student member of the ACM and theIEEE Computer Society.

Wolfgang Broll received the master’s (Dipl.-Inf.) degree in computer science from the Darm-stadt University of Technology (TU Darmstadt) in1993 and the PhD degree in computer sciencefrom the T€ubingen University in 1998. He was alecturer at the RWTH Aachen University from2000 to 2009. From 1994 to 2012, he was head-ing the VR and AR activities at Fraunhofer FIT inSankt Augustin. He has been doing research inthe area of augmented reality, shared virtualenvironments, multiuser VR, and 3D interfaces

since 1993. He is currently a full professor at the Ilmenau University ofTechnology (TU Ilmenau), where he heads the Virtual Worlds and DigitalGames Group. He is also a managing partner of fayteq, a company con-cerned with advanced video manipulation technologies. He was alsoproject manager and general coordinator of several national and interna-tional research projects. His current research interest include real-timevideo manipulation including mediated reality, natural user interfaces,and their application to games. He served on several international pro-gram committees and is the author of more than 85 peer-reviewedpapers, having presented research at several conferences including theIEEE VR, IEEE ISMAR, and ACM SIGGRAPH. He is a member of theIEEE, IEEE Computer Society, ACM SIGGRAPH, and the steering com-mittee of the VR/AR chapter of Germany’s Computer Society (GI).

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

HERLING AND BROLL: HIGH-QUALITY REAL-TIME VIDEO INPAINTINGWITH PIXMIX 879