4838 ieee transactions on image processing, vol. 23, no ... › ~pingtan › papers ›...

12
4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014 Video Tonal Stabilization via Color States Smoothing Yinting Wang, Dacheng Tao, Senior Member, IEEE, Xiang Li, Mingli Song, Senior Member, IEEE , Jiajun Bu, Member, IEEE, and Ping Tan, Member, IEEE Abstract—We address the problem of removing video color tone jitter that is common in amateur videos recorded with hand- held devices. To achieve this, we introduce color state to represent the exposure and white balance state of a frame. The color state of each frame can be computed by accumulating the color trans- formations of neighboring frame pairs. Then, the tonal changes of the video can be represented by a time-varying trajectory in color state space. To remove the tone jitter, we smooth the original color state trajectory by solving an L1 optimization problem with PCA dimensionality reduction. In addition, we propose a novel selective strategy to remove small tone jitter while retaining extreme exposure and white balance changes to avoid serious artifacts. Quantitative evaluation and visual comparison with previous work demonstrate the effectiveness of our tonal stabilization method. This system can also be used as a preprocessing tool for other video editing methods. Index Terms—Tonal stabilization, color state, L1 optimization, selective strategy. I. I NTRODUCTION A VIDEO captured with a hand-held device, such as a cell-phone or a portable camcorder, often suffers from undesirable exposure and white balance changes between successive frames. This is caused mainly by the continuous automatic exposure and white balance control of the device in response to illumination and content changes of the scene. We use “tone jitter” to describe these undesirable exposure and white balance changes. The first row of Fig. 1 shows an exam- ple of a video with tone jitter; it can be seen that some surfaces (e.g., leaves, chairs and glass windows) in frames extracted from the video have different exposures and white balances. Manuscript received September 24, 2013; revised March 23, 2014 and July 23, 2014; accepted September 5, 2014. Date of publication September 17, 2014; date of current version September 30, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61170142, in part by the Program of International Science and Technol- ogy Cooperation under Grant 2013DFG12840, National High Technology Research and Development Program of China (2013AA040601), and in part by the Australian Research Council under Project Grant FT-130101457 and Project DP-120103730. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Joseph P. Havlicek. (Corresponding author: Mingli Song.) Y. Wang, X. Li, M. Song, and J. Bu are with the College of Computer Science, Zhejiang University, Hangzhou 310027, China (e-mail: [email protected]). D. Tao is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, NSW 2007, Australia (e-mail: [email protected]). P. Tan is with the School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2358880 It is of great importance to create a tonally stabilized video by removing tone jitter for online sharing or further processing. In this paper therefore, we address this video tonal stabilization problem to remove undesirable tone jitter in a video. Farbman and Lischinski [1] have proposed a method to stabilize the tone of a video. One or more anchors to the frames of the input video are first designated, and then an adjustment map is computed for each frame to make all frames appear to be filmed with the same exposure and white balance settings as the corresponding anchor. The adjustment map is propagated from one frame to its neighbor, based on an assumption that a large number of the pixel grid points from the two neighboring frames will sample the same scene surfaces. However, this assumption is usually erroneous, especially when the camera undergoes sudden motion or the scene has complex textures. In this case, a very small corresponding pixel set will be produced, and erratic color changes will occur in some regions of the final output. The performance of this method also depends on the anchor selection. Therefore, it is tedious for users to carefully examine the entire video and select several frames as anchors following strict rules. If we simply set one anchor to the middle frame, or two anchors to the first and last frames, the result video might suffer from over-exposure artifacts or contrast loss, especially in videos of a scene with high dynamic range. Exposure and white balance changes in an image sequence have been studied in panoramic image construction before the work of Farbman and Lischinski on tonal stabilization. To compensate for these changes, earlier approaches compute a linear model that matches the averages of each channel over the overlapping area in RGB [2] or YCbCr color spaces [3], while Zhang et al. [4] constructed a mapping function between the color histograms in the overlapping area. How- ever, these models are not sufficiently accurate to represent tonal changes between frames and may result in unwanted compensation results. Other methods have been proposed to perform color correction using non-linear models, such as a polynomial mapping function [5] and linear correction for chrominance and gamma correction for luminance [6]. How- ever, these models have the limitations of large accumulation errors and high computational complexities when adapted to video tonal stabilization. If the camera response function is known, the video tone can be stabilized by applying the camera response func- tion inversely to each frame. Several attempts have been made to model the camera response function by utilizing 1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014

Video Tonal Stabilization via ColorStates Smoothing

Yinting Wang, Dacheng Tao, Senior Member, IEEE, Xiang Li, Mingli Song, Senior Member, IEEE,Jiajun Bu, Member, IEEE, and Ping Tan, Member, IEEE

Abstract— We address the problem of removing video colortone jitter that is common in amateur videos recorded with hand-held devices. To achieve this, we introduce color state to representthe exposure and white balance state of a frame. The color stateof each frame can be computed by accumulating the color trans-formations of neighboring frame pairs. Then, the tonal changesof the video can be represented by a time-varying trajectoryin color state space. To remove the tone jitter, we smooth theoriginal color state trajectory by solving an L1 optimizationproblem with PCA dimensionality reduction. In addition, wepropose a novel selective strategy to remove small tone jitterwhile retaining extreme exposure and white balance changesto avoid serious artifacts. Quantitative evaluation and visualcomparison with previous work demonstrate the effectiveness ofour tonal stabilization method. This system can also be used asa preprocessing tool for other video editing methods.

Index Terms— Tonal stabilization, color state, L1 optimization,selective strategy.

I. INTRODUCTION

AVIDEO captured with a hand-held device, such as acell-phone or a portable camcorder, often suffers from

undesirable exposure and white balance changes betweensuccessive frames. This is caused mainly by the continuousautomatic exposure and white balance control of the devicein response to illumination and content changes of the scene.We use “tone jitter” to describe these undesirable exposure andwhite balance changes. The first row of Fig. 1 shows an exam-ple of a video with tone jitter; it can be seen that some surfaces(e.g., leaves, chairs and glass windows) in frames extractedfrom the video have different exposures and white balances.

Manuscript received September 24, 2013; revised March 23, 2014 andJuly 23, 2014; accepted September 5, 2014. Date of publication September 17,2014; date of current version September 30, 2014. This work was supportedin part by the National Natural Science Foundation of China under Grant61170142, in part by the Program of International Science and Technol-ogy Cooperation under Grant 2013DFG12840, National High TechnologyResearch and Development Program of China (2013AA040601), and in partby the Australian Research Council under Project Grant FT-130101457 andProject DP-120103730. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Joseph P. Havlicek.(Corresponding author: Mingli Song.)

Y. Wang, X. Li, M. Song, and J. Bu are with the College ofComputer Science, Zhejiang University, Hangzhou 310027, China (e-mail:[email protected]).

D. Tao is with the Centre for Quantum Computation and IntelligentSystems, Faculty of Engineering and Information Technology, University ofTechnology, Sydney, NSW 2007, Australia (e-mail: [email protected]).

P. Tan is with the School of Computing Science, Simon Fraser University,Burnaby, BC V5A 1S6, Canada (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2014.2358880

It is of great importance to create a tonally stabilized video byremoving tone jitter for online sharing or further processing.In this paper therefore, we address this video tonal stabilizationproblem to remove undesirable tone jitter in a video.

Farbman and Lischinski [1] have proposed a method tostabilize the tone of a video. One or more anchors to the framesof the input video are first designated, and then an adjustmentmap is computed for each frame to make all frames appear tobe filmed with the same exposure and white balance settings asthe corresponding anchor. The adjustment map is propagatedfrom one frame to its neighbor, based on an assumption that alarge number of the pixel grid points from the two neighboringframes will sample the same scene surfaces. However, thisassumption is usually erroneous, especially when the cameraundergoes sudden motion or the scene has complex textures.In this case, a very small corresponding pixel set will beproduced, and erratic color changes will occur in some regionsof the final output. The performance of this method alsodepends on the anchor selection. Therefore, it is tedious forusers to carefully examine the entire video and select severalframes as anchors following strict rules. If we simply set oneanchor to the middle frame, or two anchors to the first andlast frames, the result video might suffer from over-exposureartifacts or contrast loss, especially in videos of a scene withhigh dynamic range.

Exposure and white balance changes in an image sequencehave been studied in panoramic image construction beforethe work of Farbman and Lischinski on tonal stabilization.To compensate for these changes, earlier approaches computea linear model that matches the averages of each channel overthe overlapping area in RGB [2] or YCbCr color spaces [3],while Zhang et al. [4] constructed a mapping functionbetween the color histograms in the overlapping area. How-ever, these models are not sufficiently accurate to representtonal changes between frames and may result in unwantedcompensation results. Other methods have been proposed toperform color correction using non-linear models, such as apolynomial mapping function [5] and linear correction forchrominance and gamma correction for luminance [6]. How-ever, these models have the limitations of large accumulationerrors and high computational complexities when adapted tovideo tonal stabilization.

If the camera response function is known, the video tonecan be stabilized by applying the camera response func-tion inversely to each frame. Several attempts have beenmade to model the camera response function by utilizing

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4839

Fig. 1. Five still frames extracted from a video with unstable tone and the results of tonal stabilization. Top: the original frames. Middle: the result ofremoving all tone jitter. Bottom: the result using our tonal stabilization strategy.

a gamma curve [7], polynomial [8], semi-parametric [9] orPCA model [10]. However, most of these methods requireperfect pixel-level image alignment, which is unrealisticin practice for amateur videos. The work proposed byKim et al. [11], [12] jointly tracks features and estimates theradiometric response function of the camera as well as expo-sure differences between the frames. Grundmann et al. [13]employed the KLT feature tracker to find the pixel correspon-dences for alignment. After alignment, they locally computedthe response curves for key frames and then interpolated thesecurves to generate the pixel-to-irradiance mapping. These twomethods adjust all frames to have the same exposure and whitebalance according to the estimated response curves, withouttaking into account any changes in the illumination and contentof the scene; this leads to artifacts of over-exposure, contrastloss or erratic color in the results.

Color transfer is a topic that is highly related to this paper.It is possible to stabilize a video using a good color transfor-mation model to make all frames have a tone similar to theselected reference frame. Typical global color transformationmodels are based on a Gaussian distribution [14], [15] orhistogram matching [16]. An and Pellacini [17] proposed ajoint model that utilizes an affine model for chrominanceand a mapping curve for luminance. Chakrabarti et al. [18]extended the six-parameter color model introduced in [19]to three nonlinear models, independent exponentiation, inde-pendent polynomial and general polynomial, and proved thatthe general polynomial model has the smallest RMS errors.Local model-based methods [20]–[23] either segment theimages and then compute a local model for each correspond-ing segment pair or estimate the local mapping between asmall region of the source image and the target image andthen propagate the adjustment to the whole image. Whilethese global and local models are powerful for color transferbetween a pair of images, stabilizing the tone of a videoby using frame-to-frame color transfer is still impracticalbecause of error accumulation. Furthermore, they cannot han-dle the large exposure and white balance changes contained insome video.

Commercial video editing tools, such as Adobe Premiere orAfter Effect, can be used to remove tone jitter. However, toomany user interactions are required to manually select the keyframes and edit their exposure and white balance.

In summary, there are two major difficulties in stabilizingthe tone of a video:

• How to represent the tone jitter? A robust model isrequired to describe the tonal change between frames.Because the video contains camera motion and the expo-sure and white balance setting are not constant, it is verychallenging to model the exposure and white balancechange accurately.

• How can the tone jitter be removed selectively? A goodstrategy should be proposed for tonal stabilization.It should be able to remove tonal jitter caused by imper-fect automatic exposure and white balance control, whilepreserving necessary tonal changes due to illuminationand content change of the scene. Videos captured incomplex light conditions may have a wide exposureand color range, and neighboring frame pairs from suchvideos may exhibit very sharp color or exposure changes.Removing these sharp changes will produce artifacts ofover-exposure, contrast loss or erratic colors (refer tothe second row of Fig. 1). A perfect tonal stabilizationstrategy will eliminate small tone jitter while preservingsharp exposure and white balance changes, as in the resultshown in the last row of Fig. 1.

To overcome these two difficulties, a novel video tonalstabilization framework is proposed in this paper. We introducea new concept of color state, which is a parametric represen-tation of the exposure and white balance of a frame. The tonejitter can then be represented by the change of the color statesbetween two successive frames. To remove the tone jitter, asmoothing technique is applied to the original color states toobtain the optimal color states. We then adjust each frame to itsnew color state and generate the final output video. In this way,our method stabilizes the tone of the input video and increasesits visual quality. Additionally, the proposed method can alsoserve as a pre-processing step for other video processing and

Page 3: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

4840 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014

Fig. 2. Flowchart of our tonal stabilization method. (a) The input frames. (b) The aligned frames. (c) The correspondence masks. (d) The original colorstates St . (e) Mt . (f) The new color states Pt . (g) The update matrices Bt . (h) The output frames.

computer vision applications, such as video segmentation [24],object tracking [25], etc.

Inspired by the camera shake removal work ofGrundmann et al. [26], in which the camera pose ofeach frame in an input video is first recovered and thensmoothed to produce a stabilized result, our method furtherextends the framework and applies it to this new video tonalstabilization problem. Specifically, the contributions of ourwork are as follows:

• We use color state, a parametric representation, todescribe the exposure and white balance of an image.With this representation, the tone of a frame is describedas a point in a high dimensional space, and then thevideo tonal stabilization problem can be modeled as asmoothing process of the original color states;

• For the first time, we propose a selective strategy toremove undesirable tone jitter while preserving exposureand white balance changes due to sharp illumination andscene content changes. This strategy can help to avoidthe artifacts of over-exposure, contrast loss or erraticcolor when processing videos with high dynamic rangesof tone;

• To achieve tonal stabilization, we combine PCAdimensionality reduction with linear programmingfor color state smoothing. This not only significantlyimproves the stabilization results but also greatly reducesthe computational cost.

II. OVERVIEW

In this paper, we use color state to represent the exposureand white balance of each frame in the input video. With thisrepresentation, the tonal changes between successive framesform a time-varying trajectory in the space of the color states.Undesirable tone jitter in the video can then be removed byadaptively smoothing the trajectory.

Fig. 2 shows the flowchart of our method. We first conducta spatial alignment and find the corresponding pixel pairsbetween successive frames. This helps us estimate the original

color states, denoted as S. The path of S is then smoothedby an L1 optimization with PCA dimensionality reduction toobtain the stabilized color states P . An update matrix Bt isthen estimated, and by applying it, each frame t is transferredfrom the original state St to the new color state Pt to generatethe final output video.

We propose a selective strategy to implement video tonalstabilization. Because some videos have sharp exposure andwhite balance changes, transferring all of the frames to havethe same tone will result in serious artifacts. Our goal is tokeep the color states constant in the sections of the video withlittle tone jitter and give the color states a smooth transitionbetween the sections with sharp tone changes. Thus, we adoptthe idea in [26] to smooth the path of color states so thatit contains three types of motion corresponding to differentsituations of exposure and white balance changes:

• Static: A static path means the final color states stayunchanged, i.e., D1

t (P) = 0, where Dnt (.) is the n-th

derivative at t .• Constant speed: A constant rate of changes allows the

tone of the video to change uniformly from one colorstate to another, i.e., D2

t (P) = 0.• Constant acceleration: The segments of static and con-

stant rate are both stable; constant acceleration in colorstate space is needed to connect two discrete stablesegments. The transition from one stable segment witha constant acceleration to another segment will make thevideo tone change smoothly, i.e., D3

t (P) = 0.

To obtain the optimal path composed of distinct constant,linear and parabolic segments instead of a superposition ofthem, we use L1 optimization to minimize the derivatives ofthe color states. Our main reason for choosing L1 rather thanL2 optimization is that the solution induced by the L1 costfunction is sparse, i.e., it will attempt to satisfy many of theabove motions along the path exactly. The computed paththerefore has derivatives that are exactly zero for most seg-ments, which is very suitable for our selective strategy. On theother hand, the L2 optimization will satisfy the above motions

Page 4: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4841

on average (in the least-squares sense), which results in smallbut non-zero gradients. Qualitatively, the L2 optimized colorstate path always has some small non-zero motion (most likelyin the direction of the original color state motion), while theL1 optimized path is composed only of segments resemblingstatic, constant speed and constant acceleration.

The rest of this paper is organized as follows. In Section III,we introduce a clear definition of the color state and show howto estimate it. A color state smoothing method is presented inSection IV to stabilize the path of color states. We show ourexperimental results in Section V and conclude the paper inSection VI.

III. DEFINITION OF COLOR STATE

A. Frame Color State

In this paper, we use the term “color state” to represent theexposure and white balance of an image. Let St denote thecolor state of frame t of the video. The change to color state St

from St−1 is considered to be the exposure and white balancechanges between these two frames. We use the following affinetransformation to model the color state change between twosuccessive frames,

A =

⎡⎢⎢⎣

a00 a01 a02 b0a10 a11 a12 b1a20 a21 a22 b20 0 0 1

⎤⎥⎥⎦ . (1)

An affine transformation includes a series of linear transfor-mations, such as a translation, scaling, rotation or similaritytransformation. These transformations can model the exposureand white balance changes well. An affine model has beensuccessfully applied to user-controllable color transfer [17]and color correction [27]. In practice, although most camerascontain non-linear processing components, we find in ourexperiment that an affine model can approximate a non-lineartransformation well and produce results with negligible errors.

Given a pair of images I and J of different tones, a colortransformation function A j

i can be applied to transfer thepixels in I to have the same exposure and white balanceas their corresponding pixels in J . Let x and x ′ denote apair of corresponding pixels in I and J , respectively, andIx = [I R

x , I Gx , I B

x ]T and Jx ′ = [J Rx ′ , J G

x ′ , J Bx ′ ]T represent the

colors of these two pixels. Then, Jx ′ = A ji (Ix ).

⎡⎢⎣

J Rx ′

J Gx ′

J Bx ′

⎤⎥⎦ =

⎡⎣

a00 a01 a02a10 a11 a12a20 a21 a22

⎤⎦

⎡⎢⎣

I Rx

I Gx

I Bx

⎤⎥⎦ +

⎡⎣

b0b1b2

⎤⎦ (2)

Note that the color transfer process in all our experiments isperformed in the log domain due to the gamma correction ofinput videos.

Let Att−1 denote the color transformation between frames

t −1 and t . Then, the color state of frame t can be representedby St = At

t−1St−1. The color transformation between S0 andSt can be computed by accumulating all of the transformationmatrices from frame 0 to frame t , i.e.,

St = Att−1 . . . A2

1 A10S0. (3)

Fig. 3. Color transfer results by our affine model. (a) The source image I .(b) The target image J . (c) The aligned image. (d) The correspondence mask.The white pixels are the corresponding pixels used to estimate A, and theblack ones are the expelled outliers. (e) The result without the constraintterm. (f) The result with the identity constraint and using a uniform weightωc = 2 × 103.

Thus St is a 4 × 4 affine matrix and has 12 degrees offreedom. We can further set the color state of the first frameS0 to be identity matrix. Therefore, to compute the colorstate of each video frame, we need only to estimate the colortransformation matrices for all neighboring frame pairs. In thenext subsection, we will present how to estimate the colortransformation model.

B. Color Transformation Model Estimation

For a common scene point projected in two different framesI and J at pixel locations x and x ′, respectively, the colors ofthese two pixels Ix and Jx ′ should be the same. Therefore,if the corresponding pixels can be found, the matrix A j

idescribing the color transformation between frames I and Jcan be estimated by minimizing

∑x

‖Jx ′ − A ji (Ix )‖2. (4)

To estimate the color transformation matrix, we first need tofind the pixel correspondences. However, we cannot use thesparse matching of local feature points directly to estimatethe transformation because local feature points are usuallylocated at corners, where the surface color is not well defined.The positive aspect is that local feature descriptors (such asSIFT [28] and LBP [29]) are robust to tone differences. Thus,we can use the sparse matching of local feature points to alignframes. To achieve that, we track the local feature points usingpyramidal Lucas-Kanade [30] and then compute a homographybetween two successive frames to align them. Fig. 3 shows achallenging alignment case in which (c) is the aligned resultfrom (a) to (b). After alignment, two pixels having the samecoordinates in the two images can be treated as a candidatecorresponding pixel pair.

To estimate the color transformation more accurately, wefurther process the candidate correspondence set to removeoutliers. Firstly, the video frames may contain noise, whichwill affect the model computation. To avoid that, we employa bilateral filter [31] to smooth the video frames. Secondly, thecolors of pixels on edges are not well defined and cannot be

Page 5: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

4842 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014

used to estimate the color transformation. Therefore, we con-duct edge detection and exclude pixels around edges. Thirdly,because under- and over-exposure truncation may affect esti-mating the model, we discard all under- and over-exposed pix-els from the candidate correspondence set. Fourthly, we adoptthe RANSAC algorithm to further expel outliers during themodel estimation, avoiding the effect caused mainly by noiseand dynamic objects (such as moving persons, cars or trees,in the scene). Fig. 3(d) shows the correspondence mask. Notethat in our implementation, frames are downsampled to a smallsize (shorter side of 180 pixels), so that the computational costis reduced while an accurate color transformation can still beestimated.

We notice that if we estimate the color transformation bydirectly minimizing Equation (4), the affine model tends toover-fit the data and accumulate errors at each stage, especiallyfor scenes that contain large regions of similar color. To avoidthe over-fitting problem, we can add a regularization term toEquation (4). Because the color tone of a natural video isunlikely to change much between two successive frames, thecolor transformation of two successive frames should be veryclose to an identity matrix. Based on this observation, theestimation of color transformation can be re-formulated as

ωc

|X|∑

x

‖Jx ′ − A ji (Ix )‖2 + ‖A j

i − I4×4‖2. (5)

Here, I4×4 is a 4 × 4 identity matrix. ωc|X| is the weight usedto combine the two terms, where |X| denotes the number ofcorresponding pixel pairs and ωc was set to 2 × 103 in ourexperiments.

The identity constraint helps to choose the solution closerto the identity matrix only when getting several solutionswith similar small errors by minimizing the first term ofEquation (5), which reduces the over-fitting problem andimproves the estimation accuracy. Taking the scene in Fig. 3 asan example, the estimated model without using the regularizerover-fits to the creamy-white pixels (table, bookshelf and wall)and causes large errors for the highlighted regions. To providea numerical comparison, we placed a color check chart intothis scene. The accuracy of color transfer is measured by anangular error [32] in RGB color space, which is the anglebetween the mean color of the pixels inside a color quad (c)in the transfer result and their mean (c) in the target image,eANG = arccos

( cTc‖c‖‖c‖

). The average angular errors of each

color quad from (e) and (f) are 5.943◦ and 1.897◦, respectively.The quad in the color of Red (Row 3, Col 3) in (e) has thehighest error, 22.03◦. For comparison, the highest error of thecolor check chart in (f) is 4.476◦ from Light Blue (Row 3,Col 6). Note that the color check chart is not used to aidmodel estimation.

IV. COLOR STATES SMOOTHING

To remove tone jitter, we can transfer all video framesto have a constant color state. However, as shown inSection I, large tone changes need to be retained in videosthat contain large illumination or scene content changes;otherwise, artifacts may arise after removing tone jitter, such

as over-exposure, contrast loss or erratic color. In this paper,we propose a selective strategy for video tonal stabilization.Under this selective strategy, we generate a sequence of frameswith smooth color states and meanwhile constrain the newcolor states close to the original. Here, ‘smooth’ means thatthe color states remain static or are in uniform or uniformlyaccelerated motion.

We utilize an L1 optimization to find the smooth color statesby minimizing the energy function,

E = ωs E(P, S) + E(P). (6)

E(P, S) reflects our selective strategy, which is a soft con-straint and ensures that the new color states P do not deviatefrom the original states S,

E(P, S) = |P − S|1 . (7)

E(P) smoothes the frame color states. As mentioned inSection II, the first, second and third derivatives of Pt shouldbe minimized to make the path of Pt consist of segments ofstatic, constant speed or constant acceleration motion.

E(P) = ω1|D1(P)|1 + ω2|D2(P)|1 + ω3|D3(P)|1, (8)

where Dn(P) represents the n-th derivative of the new colorstates P . Minimizing |D1(P)|1 causes the color states to tendto be static. Likewise, |D2(P)|1 constrains the color states touniform motion, and |D3(P)|1 is relative with the accelerationof the color state motion. The weights ω1, ω2 and ω3 balancethese three derivatives. In our experiments, ω1, ω2 and ω3 wereset to 50, 1 and 100, respectively. The weight ωs combining thetwo terms is the key parameter in our method. It makes the newcolor states either have an ideal smooth path or remain veryclose to the original states. We conducted many experimentsto analyze ωs and found that [0.01, 0.3] is the tunable rangeof ωs . When ωs = 0.01, the new color states remain constant.In contrast, if ωs = 0.3, the new paths of color states retainpart of the initial motion. A detailed discussion of parametersetting is presented in Section V.

Here we discuss the smoothness of color states in differentframes. We can optimize Pt similarly to the L1 camerashake removal method [26], using forward differencing toderive |Dn(P)|1 and minimizing their residuals. However, thismethod has limitations. From its optimization objective, the12 elements of color state are smoothed independently, whoseresult is that the 12 elements do not change synchronously.Fig. 4 shows the curves of the new states generated by anL1 optimization-based method as in [26]. To relieve thisproblem in removing camera shake, an inclusion constraint isutilized in [26] that the four corners of the crop rectangle mustreside inside the frame rectangle. However, we cannot find thecorners or boundaries of color states. So optimizing the newcolor states by the L1 optimization directly will result in somenew color states being outside the clusters of the original colorstates, and the corresponding output frames will have erraticcolors (shown in the middle row of Fig. 5).

We therefore seek to improve the smoothing method so thatall 12 elements of color state can change in the same way. Thepath of original color states is a curve in a 12D space; if we

Page 6: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4843

Fig. 4. The curves of the 12 color state elements before and after theL1 optimization without PCA. Green curves: original color state elements.Red curves: new color state elements. From top to bottom, left to right, eachcurve corresponds to an element of color state. The vertical axis plots the valueof the corresponding element, and the horizontal axis plots the frame index.

Fig. 5. The comparison of stabilization results generated by the L1 optimiza-tion methods without and with PCA. Top: the input video. Middle: the resultgenerated by the L1 optimization without PCA, which contains erratic color insome output frames. Bottom: the result using the L1 optimization with PCA.

can constrain all of the new color states to be along a straightline, the above problem will be solved. We employ PrincipalComponent Analysis (PCA) [33], [34] to find a linear subspacenear the original color state path. Using PCA, the color statescan be represented by St = S + ∑

i βcit Sci , where S denotes

the mean color state over t , and Sci and βcit denote the

eigenvector and eigenvalue of the i -th component, respectively.S and Sci are 4 × 4 matrices, and β

cit is a scalar. The mean

color state S and the eigenvector of the first component Sc1

are used to build this linear subspace, and the new color statesare encoded as

Pt = S + Mt Sc1 . (9)

Here, Mt denotes the new coefficient, which is a scalar.Because Sc1 is the first principle component corresponding tothe largest eigenvalue βc1

t , the line of the new color stateswill not deviate much from the original color state path,as shown in Fig. 6. In this way, we limit the degrees offreedom of the solution to a first order approximation. Then,in the smoothing method we only need to take into accountminimizing the magnitude of the velocity and acceleration ofthe color state path in the L1 optimization and do not needconsider their direction changes. Our method is different fromthe dimensionality-reduction-based smoothing methods thatdirectly smooth the coefficients of the first or several major

Algorithm 1 LP for Color States Optimization

Fig. 6. An example of the paths of original color states (Red curve) and thelinear subspace (Green curve) generated by PCA. Note that the plotted dotsare not the real color states but simulation values. We choose a point in thescene and use the color curve over time to simulate the color state path.

components with a low-pass filter; we will find a smoothlychanging function (a smooth curve) subject to t , and then all12 elements of the new color states will have the same motionas the curve of Mt . After PCA, our L1 optimization objectiveis re-derived and minimized based on Equation (9).

Minimizing E(P, S): The new formulation of Pt is substi-tuted into Equation (7),

E(P, S) =∑

t

∣∣S + Mt Sc1 − St∣∣1. (10)

Minimizing E(P): Forward differencing is used to computethe derivatives of P . Then

|D1(P)| =∑

t

|Pt+1 − Pt |

=∑

t

|(S + Mt+1 Sc1) − (S + Mt Sc1)|

≤∑

t

|Sc1 ||Mt+1 − Mt |.

Because |Sc1 | is known, we only need minimize |Mt+1 − Mt |,i.e., |D1(M)|. Similarly, we can prove that |D2(P)| and|D3(P)| are equivalent to |D2(M)| and |D3(M)|, respectively.

Then, our goal is to find a curve Mt such that (1) it changessmoothly and (2) after mapping along this curve, the new colorstates are close to the original states. Algorithm 1 summa-rizes the entire process of our L1 optimization. To minimizeEquation (6), we introduce non-negative slack variables tobound each dimension of the color state derivatives and solvea linear programming problem as described in [26]. Using Mt

Page 7: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

4844 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014

Fig. 7. The curves of the 12 color state elements before and after ourL1 optimization with PCA. Green curves: original color state elements.Red curves: new color state elements.

to represent the color states, the number of unknown variablesto be estimated for each frame becomes 1, and the numberof introduced slack variables also declines substantially. Thissignificantly reduces the time and space costs of the linear pro-gramming. In our implementation, we employed COIN CLP1

to minimize the energy function and generate a sequence ofstable states.

The curves of our optimal color states and the original colorstates are shown in Fig. 7. In contrast to the result from theL1 optimization without PCA, all 12 elements of color stateobey the same motion. The last row of Fig. 5 shows examplesof the output frames generated by our L1 optimization withPCA; the problem of unusual color has been avoided.

After obtaining the new color states, an update matrix Bt

is calculated to transfer each frame from the original colorstate to the new color state. From the definition of color state,Pt = Bt St . The update matrix Bt can then be computed as

Bt = Pt S−1t . (11)

It is applied to the original frame t pixel by pixel to generatethe new frame.

V. EXPERIMENTS AND EVALUATION

A. Parameter Setting

The weight that balances the constraint and smoothnessterms is the most important parameter in our system. As thisweight is varied, the system generates different results. If ωs

is small (such as 0.01), Mt tends to be a straight line, whichmaps all of the frames to the same color state. This keepsthe exposure and white balance unchanged always. For videoswhose exposure and color ranges are too wide, a straight linemapping will cause some frames to have artifacts, such asover-exposure, contrast loss or erratic color. If ωs is large(such as 1.0), Pt will be very close to St . Then, most colorstate changes will be retained. The weight ωs balances thesetwo aspects, and it is difficult to find a general value suitablefor all types of videos. We leave this parameter to be tunedby users. We suggest that users tune ωs within three levels,0.01, 0.1 and 0.3, which were widely used in our experiments.Variant weights were tried to stabilize the video in Fig. 1;

1COIN CLP is an Open Source linear programming solver that can be freelydownloaded from http://www.coin-or.org/download/source/Clp/.

Fig. 8. The curves of Mt with different ωs for the video in Fig. 1.Red curves: Mt . Green curves: β

c1t . ω1 = 50, ω2 = 1 and ω3 = 100.

(a) ωs = 0.01. (b) ωs = 0.1. (c) ωs = 0.3. (d) ωs = 1.0.

Fig. 9. The curves of Mt with different ω1, ω2 and ω3 for the video inFig. 1. Red curves: Mt . Green curves: β

c1t . In this experiment, ωs = 0.3.

(a) ω1 = 100, ω2 = ω3 = 1. (b) ω2 = 100, ω1 = ω3 = 1. (b) ω3 = 100,ω1 = ω2 = 1. (d) ω1 = 50, ω2 = 1, ω3 = 100.

the curves of Mt are shown in Fig. 8. The comparison ofthe output videos with different parameters is shown in oursupplementary material.2

Three other parameters affecting color state trajectory areω1, ω2 and ω3. We explored different weights for stabilizingthe video in Fig. 1 and plotted the curves of Mt in Fig. 9. Ifwe set one of the three parameters to a large value but depressthe other two, it is apparent that the new color state pathtends to be (a) constant non-continuous, (b) linear with suddenchanges or (c) a smooth parabola, but it is rarely static. Amore agreeable viewing experience is produced by setting ω1and ω3 larger than ω2 because we hope that the optimal pathcan sustain longer static segments and be absolutely smooth.For this paper, we set ω1 = 50, ω2 = 1 and ω3 = 100; thecorresponding curve of Mt is shown in Fig. 9(d).

In practical situations, users may prefer the exposure andwhite balance of some frames and hope to keep the tone ofthese frames unchanged. Our system can provide this function

2The supplementary material can be found in our project page,http://eagle.zju.edu.cn/~wangyinting/TonalStabilization/.

Page 8: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4845

Fig. 10. The stabilization results generated by keeping the exposure and whitebalance of the ‘preferred frame’ unchanged. Top: the original video. Middle:the stabilization result with the first frames fixed. Bottom: the stabilizationresult with the last frame fixed. ωs,t is set to 100 for the preferred frame and0.01 for the others.

by using a non-uniform weight ωs,t instead of ωs . We ask theusers to point out one or several frames as ‘preferred frames’and set higher ωs,t for these frames. The weights ωs,t forthe other frames are chosen by our selective strategy. Then,the new optimal color states for the ‘preferred frames’ willbe very close to the original ones. Fig. 10 gives an examplein which the second row is the result generated by settingthe first frame as the ‘preferred frame’, and the third rowis the result with the last frame fixed. In this example, weset ωs,t equal to 100 for the ‘preferred frame’ and 0.01 forthe others.

In a similar way, the weight ωc in Equation (5) can bechanged to a spatial-variant one, ωc,x , and then the estimatedaffine model would be more accurate for the pixels with largerweights. We can extract the salient regions [35], [36] of eachframe or ask the users to mark their ‘preferred regions’ andtrack these regions in each frame. In this way, our systemwill generate more satisfying results for the regions to whichusers pay more attention. A numeral comparison experimentis described in the next subsection.

B. Quantitative Evaluation

To illustrate the effectiveness of our tonal stabilizationmethod, we employed a low reflectance grey card to doquantitative evaluation as [1]. We placed an 18% reflectancegrey card into a simple indoor scene and recorded a video withobvious tone jitter with an iPhone 4S; five example framesfrom the video are shown in the first row of Fig. 11. Thisvideo was then processed by our tonal stabilization method.

Note that the grey card was not used to aid and improvethe stabilization. We compared our results with Farbman andLischinski’s pixel-wise color alignment method (PWCA) [1].For PWCA, we set the first frame as the anchor. To reach asimilar processing result, we set ωs,t to 100 for the first frameand 0.01 for the other frames, so that the exposure and whitebalance of the first frame was fixed and propagated to theothers. Both uniform weight ωc and non-uniform weight ωc,x

were tried in this experiment. For uniform weight, we setωc = 2 ×103. For non-uniform weight, we set ωc,x to 104 forall of the corresponding pixel pairs inside the grey card and2 × 103 for the others. We measured the angular error [32]in RGB color space between the mean color of the pixelsinside the grey card of each frame and the first frame. Theplot in Fig. 11(a) shows the angular errors from the firstframe of the original video (Red curve), the video stabilized byPWCA (Blue curve), the results generated by our method withuniform weight (Green curve) and non-uniform weight (DarkGreen curve). The second column of Table I is the averageerrors over each frame of the original video and the resultsgenerated by PWCA and our method. Both PWCA and ourmethod performed well, and our method with non-uniformweight came out slightly ahead.

To assess the benefits of tonal stabilization for whitebalance estimation, we conducted a similar experiment tothat presented in [1]; we applied a Grey-Edge family ofalgorithms [37] to a video and its stabilization results and com-pared the performance of white balance estimation. The twowhite balance estimation methods chosen assume that someproperties of the scene, such as average reflectance (Grey-World) or average derivative (Grey-Edge), are achromatic.We computed the angular error [32] of the mean color insidethe grey card of all frames to the ground truth; the plots of theestimation error are shown in Figs. 11(b) and (c). The thirdand fourth columns of Table I are the average angular errorsof each frame after white balance estimation by Grey-Worldand Grey-Edge, respectively.

The grey card restricts that the camera motion during videoshooting should not be large and the camera should not bevery far from the scene; thus, the video used for evaluationwill be relatively simple. On the other hand, PWCA performsextremely well for large homogeneous regions (grey card).These two factors are the reasons why our method only ledPWCA a little in the quantitative evaluations.

C. Comparison

From the discussion above, we find that both PWCA and ourmethod can generate good results for simple videos. However,PWCA sometimes does not work well for videos that includescenes with complex textures or which have sudden camerashaking. For these cases, it produces a very small robust setand results in a final output that is not absolutely stable.Our method aligns the successive frames and detects thecorresponding pixels in a more robust way. Fig. 12 comparesthe result of PWCA with an anchor set to the central frameand our result with ωs = 0.01. We can see that our result ismore stable. Another advantage of our method is the selective

Page 9: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

4846 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014

Fig. 11. Quantitative evaluation of tonal stabilization and consequent white balance correction. The first row is several still frames from the input video. Thesecond row compares the numeral errors of the original video and the results generated by PWCA and our method. (a) The angular errors in RGB color spaceof average color within the grey card of each frame with respect to the first frame. (b) and (c) The angular errors from the ground truth after white balanceestimation by grey-world and grey-edge. Red curves: original video. Blue curves: the result of PWCA. Green curves: the result of our method with uniformωc = 2 × 103. Dark Green curves: the result of our method with non-uniform ωc,x , 104 for the pixels within the grey card and 2 × 103 for the other pixels.

TABLE I

MEAN ANGULAR ERRORS OF EACH FRAME. THE SECOND COLUMN IS THE ERRORS WITH RESPECT TO THE FIRST FRAME OF THE ORIGINAL VIDEO

AND ITS STABILIZATION RESULTS. THE THIRD AND FOURTH COLUMNS ARE THE ERRORS FROM THE GROUND TRUTH AFTER

WHITE BALANCE ESTIMATION BY GREY-WORLD AND GREY-EDGE, RESPECTIVELY

Fig. 12. Comparison of the stabilization results generated by PWCA andour method. Top: the original video. Middle: the result of PWCA, which stillcontains tone jitter. Bottom: the result of our method.

stabilization strategy. It allows us to adaptively smooth thepath of color states according to the original color states.Users need only tune the parameter ωs to generate resultswith different degrees of stability and choose the best of them.Even for a video with a large dynamic range, our methodis still very convenient to use. Benefitting from the selectivestrategy, the stabilized result will remove the small tone jitterand retain the sharp tone changes. In contrast, to generatecomparable result, PWCA requires that the user choose theanchors carefully. If one or two anchors are set automatically,the result may include artifacts. The second row of Fig. 13 isthe result of PWCA with two anchors set to the first and lastframes. It is clear that some frames of the output video areover-exposed.

Fig. 13. Another comparison of PWCA and our method. The right twoframes of the result generated by PWCA are over-exposed.

D. Challenging Cases

A video containing noise or dynamic objects is very chal-lenging for affine model estimation because of the outliersfrom the noise and dynamic objects. Our robust correspon-dence selection helps to handle these challenging cases bydiscarding most of the outliers. Figs. 14 and 15 show twoexamples of these challenging cases and their stabilizationresults from our method with a uniform ωc (2 × 103) anda small ωs (0.01). Because noise and dynamic objects donot affect PWCA, we choose its stabilization results as bor-derline; they are shown in the second row of each figure.These two examples demonstrate that our method can per-form well for videos containing noise and dynamic objects,and our results are even a little better than those generated

Page 10: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4847

Fig. 14. A video containing dynamic object and its stabilization results.Top: the original frames from the video. Middle: the result of PWCA. Sometone jitter can be found in the 3rd and 4th columns. Bottom: the result of ourmethod.

Fig. 15. A noisy video and its stabilization results with PWCA and ourmethod. The exposure and color of the 2nd column is a little different fromthe other columns in the PWCA result.

by PWCA for exposure and color consistency (refer to thethird and fourth columns of Fig. 14 and the second columnof Fig. 15).

E. Computational Overhead

The running time of our system depends on the length ofthe input video. When we compute the transformation matrixfor two neighboring frames, the images are resized to a fixedsmall size, so the size of frame does not significantly affect therunning time. For the 540-frame video (1920 × 1080 in size)shown in Fig. 1, it took about 511 seconds to complete theentire stabilization process. Computing the color transforma-tion matrix for each pair of successive frames took approxi-mately 0.88 seconds, and the system spent 7.28 seconds onoptimizing the color states. Our system is implemented inC++ without code optimization, running on a Dell Vostro420 PC with an Intel Core2 Quad 2.40GHz CPU and 8GB ofmemory.

The running time of our method can be shortened byparallelization. Approximately 90% of running time is spenton computing the color transformation matrix between neigh-boring frames, which is easily parallelizable because the frameregistration and affine model estimation for each pair ofneighboring frames can be carried out independently. On theother hand, when processing a long video, we can first cutthe video into several short sub-videos and set the frame thatcontacts two sub-videos as a ‘preferred frame’; these sub-videos can then be stabilized synchronously. In future work

Fig. 16. Failure case. These four frames are extracted from our stabilizationresult. The stone appears in two different sections of the video, and the colorsin our result are not coherent.

we plan to accelerate the performance of our method byusing GPUs.

F. Discussion

Our method depends on alignment of consecutive frames,so feature tracking failures will affect our stabilization results.There are several situations that may cause feature trackingfailures. a) The neighboring frames are very homogeneous(e.g., wall, sky) and have too few matching feature points.In this situation, we assume these two source frames arealigned. Because the frames are homogeneous, this will notresult in large errors from misalignment; our correspondenceselection algorithm helps to discard the outliers from noise,edges, etc. Therefore, our method also performs well on thissituation. b) Very sharp brightness changes between consecu-tive frames may cause feature tracking to fail. We can adoptan iteration-based method to solve this problem. The twoneighboring frames are denoted as I and J . We first assumethe two images are aligned, i.e., H j

i = I3×3. We estimate thecolor transfer model A j

i that is applied to frame I to make theexposure and white balance of I and J closer. Then, a newhomography matrix H j

i is estimated to align the modifiedframe I and J . We repeat to compute A j

i and H ji ; usually

two or three iterations are sufficient to generate a good result.Most of the neighboring frames in natural videos will not havebrightness changes sharp enough to affect feature tracking.We tested approximately 150 videos and never encounteredthis problem. c) Most local features are located at non-rigidmoving objects. This is the most challenging case, and a vastmajority of feature-tracking-based vision algorithms cannothandle it, such as [38]. Because our method needs to tracka feature for only two adjacent frames, if the dynamic objectsdo not move too quickly, serious artifacts will not result.Otherwise, our method will fail.

In addition to feature tracking, our method has anotherlimitation. If the camera is moved from one scene to anew scene and then returned to the former, the color of thesame surface in different sections of a video stabilized byour method may be not coherent. An example is shown inFig. 16, in which a large stone appears, is passed by, andthen reappears. We can see from the figure that the color ofthe stone has changed slightly in our output frames. This iscaused by the error arising during color state computation.When we estimate the color transformation model for twoneighboring images, if a surface with a particular color existsonly in one frame, then the computed model may be unsuitablefor the region of this surface. Because the color state isthe accumulation of color transformation matrices, the errorof color transformation for two frames will be propagatedto all later images. Another possible reason for this artifactis that our trajectory smoothing method cannot ensure that

Page 11: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

4848 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 11, NOVEMBER 2014

two similar original color states remain similar after stabiliza-tion. We leave these unsolved problems for future work.

VI. CONCLUSION

In this paper, we utilize a 4 × 4 matrix to model theexposure and white balance state of a video frame, which werefer to as the color state. PCA dimensionality reduction isthen applied to find a linear subspace of color states, and anL1 optimization-based method is proposed to generate smoothcolor states in the linear subspace and achieve the goal of videotonal stabilization. Our experimental results and quantitativeevaluation show the effectiveness of our stabilization method.Compared to previous work, our method performs better inlooking for pixel correspondences in two neighboring frames.In addition, we use a new stabilization strategy that retainssome tone changes due to sharp illumination and scene contentchanges, so our method can handle videos with an extremedynamic range of exposure and color. Our system can removetone jitter effectively and thus increase the visual quality ofamateur videos. It also can be used as a pre-processing toolfor other video editing methods.

REFERENCES

[1] Z. Farbman and D. Lischinski, “Tonal stabilization of video,”ACM Trans. Graph., vol. 30, no. 4, pp. 1–89, 2011.

[2] G. Y. Tian, D. Gledhill, D. Taylor, and D. Clarke, “Colour correctionfor panoramic imaging,” in Proc. 6th Int. Conf. Inf. Visualisat., 2002,pp. 483–488.

[3] S. J. Ha, H. I. Koo, S. H. Lee, N. I. Cho, and S. K. Kim, “Panoramamosaic optimization for mobile camera systems,” IEEE Trans. Consum.Electron., vol. 53, no. 4, pp. 1217–1225, Nov. 2007.

[4] Z. Maojun, X. Jingni, L. Yunhao, and W. Defeng, “Color histogramcorrection for panoramic images,” in Proc. 7th Int. Conf. Virtual Syst.Multimedia, Oct. 2001, pp. 328–331.

[5] B. Pham and G. Pringle, “Color correction for an image sequence,” IEEEComput. Graph. Appl., vol. 15, no. 3, pp. 38–42, May 1995.

[6] Y. Xiong and K. Pulli, “Color matching of image sequences withcombined gamma and linear corrections,” in Proc. Int. Conf. ACMMultimedia, 2010, pp. 261–270.

[7] S. Mann and R. W. Picard, “On being ‘undigital’ with digital cameras:Extending dynamic range by combining differently exposed pictures,”in Proc. IS&T, 1995, pp. 442–448.

[8] T. Mitsunaga and S. K. Nayar, “Radiometric self calibration,” in Proc.Comput. Vis. Pattern Recognit., vol. 1. Jun. 1999, pp. 374–380.

[9] F. M. Candocia and D. A. Mandarino, “A semiparametric model foraccurate camera response function modeling and exposure estimationfrom comparametric data,” IEEE Trans. Image Process., vol. 14, no. 8,pp. 1138–1150, Aug. 2005.

[10] M. D. Grossberg and S. K. Nayar, “Modeling the space of cameraresponse functions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26,no. 10, pp. 1272–1282, Oct. 2004.

[11] S. J. Kim, J.-M. Frahm, and M. Pollefeys, “Joint feature tracking andradiometric calibration from auto-exposure video,” in Proc. IEEE 11thInt. Conf. Comput. Vis., Oct. 2007, pp. 1–8.

[12] S. J. Kim and M. Pollefeys, “Robust radiometric calibration andvignetting correction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,no. 4, pp. 562–576, Apr. 2008.

[13] M. Grundmann, C. McClanahan, S. B. Kang, and I. Essa, “Post-processing approach for radiometric self-calibration of video,” in Proc.IEEE Int. Conf. Comput. Photography, Apr. 2013, pp. 1–9.

[14] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley, “Color transferbetween images,” IEEE Comput. Graph. Appl., vol. 21, no. 5, pp. 34–41,Sep./Oct. 2001.

[15] F. Pitié, A. C. Kokaram, and R. Dahyot, “Automated colour gradingusing colour distribution transfer,” Comput. Vis. Image Und., vol. 107,nos. 1–2, pp. 123–137, Jul. 2007.

[16] T. Oskam, A. Hornung, R. W. Sumner, and M. Gross, “Fast and stablecolor balancing for images and augmented reality,” in Proc. 2nd Int.Conf. 3D Imag., Modeling, Process., Visualizat. Transmiss., Oct. 2012,pp. 49–56.

[17] X. An and F. Pellacini, “User-controllable color transfer,” Comput.Graph. Forum, vol. 29, no. 2, pp. 263–271, May 2010.

[18] A. Chakrabarti, D. Scharstein, and T. Zickler, “An empirical cameramodel for internet color vision,” in Proc. Brit. Mach. Vis. Conf., vol. 1.2009, pp. 1–4.

[19] G. Finlayson and R. Xu, “Illuminant and gamma comprehensive nor-malisation in log RGB space,” Pattern Recognit. Lett., vol. 24, no. 11,pp. 1679–1690, Jul. 2003.

[20] S. Kagarlitsky, Y. Moses, and Y. Hel-Or, “Piecewise-consistent colormappings of images acquired under various conditions,” in Proc. IEEE12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 2311–2318.

[21] Y.-W. Tai, J. Jia, and C.-K. Tang, “Local color transfer via probabilisticsegmentation by expectation-maximization,” in Proc. IEEE Comput. Soc.Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun. 2005, pp. 747–754.

[22] D. Lischinski, Z. Farbman, M. Uyttendaele, and R. Szeliski, “Interactivelocal adjustment of tonal values,” ACM Trans. Graph., vol. 25, no. 3,pp. 646–653, Jul. 2006.

[23] Y. Li, E. Adelson, and A. Agarwala, “ScribbleBoost: Adding classifica-tion to edge-aware interpolation of local image and video adjustments,”Comput. Graph. Forum, vol. 27, no. 4, pp. 1255–1264, 2008.

[24] Q. Zhu, Z. Song, Y. Xie, and L. Wang, “A novel recursive Bayesianlearning-based method for the efficient and accurate segmentation ofvideo with dynamic background,” IEEE Trans. Image Process., vol. 21,no. 9, pp. 3865–3876, Sep. 2012.

[25] S. Das, A. Kale, and N. Vaswani, “Particle filter with a mode trackerfor visual tracking across illumination changes,” IEEE Trans. ImageProcess., vol. 21, no. 4, pp. 2340–2346, Apr. 2012.

[26] M. Grundmann, V. Kwatra, and I. Essa, “Auto-directed video stabiliza-tion with robust L1 optimal camera paths,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., Jun. 2011, pp. 225–232.

[27] H. Siddiqui and C. A. Bouman, “Hierarchical color correction forcamera cell phone images,” IEEE Trans. Image Process., vol. 17, no. 11,pp. 2138–2155, Nov. 2008.

[28] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[29] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987,Jul. 2002.

[30] J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Comput.Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 1994, pp. 593–600.

[31] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and colorimages,” in Proc. 6th Int. Conf. Comput. Vis., Jan. 1998, pp. 839–846.

[32] S. D. Hordley, “Scene illuminant estimation: Past, present, and future,”Color Res. Appl., vol. 31, no. 4, pp. 303–314, Aug. 2006.

[33] I. T. Jolliffe, Principal Component Analysis. New York, NY, USA:Springer-Verlag, 1986, p. 487.

[34] B.-K. Bao, G. Liu, C. Xu, and S. Yan, “Inductive robust principalcomponent analysis,” IEEE Trans. Image Process., vol. 21, no. 8,pp. 3794–3800, Aug. 2012.

[35] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,“Global contrast based salient region detection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2011, pp. 409–416.

[36] C. Jung and C. Kim, “A unified spectral-domain approach forsaliency detection and its application to automatic object segmentation,”IEEE Trans. Image Process., vol. 21, no. 3, pp. 1272–1283,Mar. 2012.

[37] J. van de Weijer, T. Gevers, and A. Gijsenij, “Edge-based color con-stancy,” IEEE Trans. Image Process., vol. 16, no. 9, pp. 2207–2214,Sep. 2007.

[38] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agarwala, “Subspace videostabilization,” ACM Trans. Graph., vol. 30, no. 1, pp. 1–4, 2011.

Yinting Wang received the B.E. degree in softwareengineering from Zhejiang University, Hangzhou,China, in 2008, where he is currently pursuingthe Ph.D. degree in computer science. His researchinterests include computer vision and image/videoenhancement.

Page 12: 4838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO ... › ~pingtan › Papers › tip14.pdf · images and then compute a local model for each correspond-ing segment pair or

WANG et al.: VIDEO TONAL STABILIZATION VIA COLOR STATES SMOOTHING 4849

Dacheng Tao (M’07–SM’12) is currently aProfessor of Computer Science with the Centrefor Quantum Computation and Intelligent Systems,and the Faculty of Engineering and InformationTechnology, University of Technology, Sydney,NSW, Australia. He mainly applies statistics andmathematics for data analysis problems in datamining, computer vision, machine learning, mul-timedia, and video surveillance. He has authoredover 100 scientific articles at top venues, includ-ing the IEEE IEEE TRANSACTIONS ON PATTERN

ANALYSIS AND MACHINE INTELLIGENCE, the IEEE TRANSACTIONS ON

NEURAL NETWORKS AND LEARNING SYSTEMS, the IEEE TRANSACTIONS

ON IMAGE PROCESSING, the IEEE Conference on Neural InformationProcessing Systems, the International Conference on Machine Learning,the International Conference on Artificial Intelligence and Statistics, theIEEE International Conference on Data Mining series (ICDM), the IEEEConference on Computer Vision and Pattern Recognition, the InternationalConference on Computer Vision, the European Conference on ComputerVision, the ACM Transactions on Knowledge Discovery from Data, theACM Multimedia Conference, and the ACM Conference on Knowl-edge Discovery and Data Mining, with the Best Theory/Algorithm PaperRunner Up Award in IEEE ICDM’07 and the Best Student Paper Awardin IEEE ICDM’13.

Xiang Li received the B.E. degree in computerScience from Zhejiang University, Hangzhou, China,in 2013. He is currently pursuing the M.S. degreein information technology with Carnegie MellonUniversity, Pittsburgh, PA, USA. His research inter-ests include machine learning and computer vision.

Mingli Song (M’06–SM’13) is currently anAssociate Professor with the Microsoft Visual Per-ception Laboratory, Zhejiang University, Hangzhou,China. He received the Ph.D. degree in computerscience from Zhejiang University in 2006. He wasa recipient of the Microsoft Research Fellowship in2004. His research interests include face modelingand facial expression analysis.

Jiajun Bu is currently a Professor with theCollege of Computer Science, Zhejiang University,Hangzhou, China. His research interests includecomputer vision, computer graphics, and embeddedtechnology.

Ping Tan is currently an Assistant Professor with theSchool of Computing Science, Simon Fraser Uni-versity, Burnaby, BC, Canada. He was an AssociateProfessor with the Department of Electrical andComputer Engineering, National University ofSingapore, Singapore. He received the Ph.D. degreein computer science and engineering from theHong Kong University of Science and Technology,Hong Kong, in 2007, and the bachelor’s and mas-ter’s degrees from Shanghai Jiao Tong University,Shanghai, China, in 2000 and 2003, respectively. He

has served as an Editorial Board Member of the International Journal ofComputer Vision and the Machine Vision and Applications. He has servedon the Program Committees of SIGGRAPH and SIGGRAPH Asia. He wasa recipient of the inaugural MIT TR35@Singapore Award in 2012, and theImage and Vision Computing Outstanding Young Researcher Award and theHonorable Mention Award in 2012.