hybrid shift map for video retargeting

Upload: hu-yiqun

Post on 30-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    1/8

    Hybrid Shift Map for Video Retargeting

    Yiqun Hu Deepu RajanSchool of Computer Engineering

    Nanyang Technological University, Singapore 639798

    {yqhu,asdrajan}@ntu.edu.sg

    Abstract

    We propose a new method for video retargeting, which

    can generate spatial-temporal consistent video. The new

    measure called spatial-temporal naturality preserves the

    motion in the source video without any motion analysis in

    contrast to other methods that need motion estimation. This

    advantage prevents the retargeted video from degenerating

    due to the propagation of the errors in motion analysis. It

    allows the proposed method to be applied on challenging

    videos with complex camera and object motion. To improve

    the efficiency of the retargeting process, we retarget video

    using a 3D shift map in low resolution and refine it using

    an incremental 2D shift map in higher resolution. This new

    hierarchical framework, denoted as hybrid shift map, can

    produce satisfactory retargeting results while significantly

    improving the computational efficiency.

    1. Introduction

    Media retargeting aims to increase or decrease the size

    of media according to the inherent content and not blindly

    as in scaling and cropping. The important content is pre-

    served during the retargeting process. For images, this may

    involve simply resizing, while for videos, the resizing could

    be in the spatial and/or the temporal domain. With the de-

    velopment of diverse terminal devices e.g. mobile phones,

    large-screen displays etc., this technique is useful for adapt-

    ing multimedia content onto devices with different screen

    resolutions.

    Image Retargeting

    Various retargeting algorithms have been proposed to

    adapt images to different resolutions and aspect ratios. As

    opposed to homogeneous resizing methods [21, 14] that

    crop the most important regions to be included, recent meth-

    ods focus on nonlinear image retargeting according to im-

    age content. For example, seam carving [1] and its vari-

    ants [7, 6] remove horizontal/vertical seams from the image.

    Seams are monotonically connected curves with minimum

    perceptual energy. Adaptive warping methods redistribute

    the image pixels in a single direction [5] or over several

    directions [19] according to their importance. Visually im-

    portant regions are preserved while homogeneous regions

    are merged. Some image editing techniques solve the retar-

    geting problem by redistributing pixels under completeness

    and coherence constraints [15, 2]. Other extensions are alsoproposed: [8] introduced an image retargeting framework

    based on Fourier analysis to improve efficiency. Retarget-

    ing results are improved by preserving the image structures

    in [17]. In [13], multiple operators are integrated to obtain

    optimally retargeted images.

    Video Retargeting

    Some video retargeting methods have been proposed by

    adapting image retargeting methods for video content. The

    local temporal consistency between adjacent pixels in a

    spatial-temporal video cube is enforced. For example, in

    [20], local temporal consistency was enforced by introduc-ing a penalty for changes in position of temporally adjacent

    pixels in a least-squares optimization formulation. Seam

    carving operator was improved to retarget video both spa-

    tially [12] and temporally [3] by searching monotonic and

    connected manifolds using graph cut. However, such lo-

    cal consistency in temporal domain is invalid when there

    exists large object/camera motion. Some methods enforce

    global temporal consistency by estimating motion informa-

    tion. For example, cropping-based methods [9, 16, 4] were

    extended to find temporally smooth cropping windows for

    videos. Motion segmentation is used to model the back-

    ground or to extract moving objects. A scale-and-stretch

    operator was extended to retarget video in [18]. Consecu-

    tive video frames are aligned to estimate inter-frame camera

    motion which can then be used to constrain retargeting. Al-

    though these methods can handle complex object or camera

    motion, they require motion estimation which in itself is a

    challenging task.

    1.1. Motivation

    This work is motivated by two major limitations of cur-

    rent video retargeting techniques. First, the effectiveness of

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    2/8

    the current methods depend on the performance of motion

    estimation, which is performed prior to the actual retarget-

    ing. The errors in motion estimation will degrade the fi-

    nal result, especially for scenes with complex motion and

    large background clutter. Second, for videos, the compu-

    tational complexity of retargeting methods based on graph

    cut is very large. In the multi-resolution framework, the 3Dgraph cut is inefficient at a higher resolution although the

    solution at lower resolution can be efficiently solved.

    We propose a new framework for video retargeting with-

    out relying on any motion analysis, which results in an ef-

    ficient algorithm in terms of both computational time and

    memory usage. Within this framework, a new measure is

    introduced to estimate temporal consistency (naturality) for

    video retargeting. This measure does not require motion

    analysis and easily integrates both spatial and temporal do-

    mains into a unified framework. Using this measure in an

    energy function, we propose a multi-resolution framework

    for video retargeting. 3D shift map is applied to find theinitial retargeting solution of a video volume in the lowest

    resolution. Incremental 2D shift map is applied to refine ev-

    ery seam in the individual frame with temporal consistency

    with respect to the retargeting result of the previous frame.

    Compared to the traditional multi-resolution solution, our

    method solves the 3D retargeting problem by solving a se-

    ries of 2D retargeting problems, which is much more com-

    putationally efficient, especially at high resolution.

    The rest of this paper is organized as follows. A new

    measure spatial-temporal naturality is introduced in Sec 2,

    which is used to calculate the energy for graph cut. The

    proposed retargeting framework including 3D shift map, in-

    cremental 2D shift map as well as the new multi-resolution

    scheme is described in Sec 3. We evaluate the proposed

    method and analyze different properties on real video se-

    quences in Sec 4. Finally, we present conclusion in Sec 5.

    2. Spatial-Temporal Naturality

    Most existing retargeting methods resort to minimizing

    a distortion measure in order to retarget a source video. For

    example, seam carving techniques [1, 12] try to minimize

    the distortion due to a new pair of pixels becoming adja-

    cent. Similarly, warping-based methods [20, 18] minimize

    the distortion resulting from the warping operation. How

    to model various forms of distortion is still an open ques-

    tion. In this paper, we assume that every pixel in the retar-

    geted video is sampled from some pixel in the source video.

    When performing video retargeting, the retargeted video is

    visually pleasing if as less artifacts as possible are generated

    in both the spatial and the temporal domain. We introduce

    a measure to quantify the strength of artifacts introduced in

    the retargeted video and call it the spatial-temporal natural-

    ity, which is computed on every pair of neighboring pixels.

    As the name implies, both spatial and temporal neighbor-

    ing pixels are considered to ensure the naturality in both the

    domains. Without explicitly modeling the distortion, a re-

    targeted video which looks as natural as the source video

    can be generated by maximizing this measure over all pairs

    of neighboring pixels. This measure is used to define the

    energy function for graph cut. In the rest of this section,

    we first generalize the naturality measure in spatial domain,which was used in [11] and then extend it to temporal do-

    main.

    2.1. Spatial Naturality

    In spatial domain, naturality requires that two spatial

    neighboring pixels in the retargeted video are similar to

    some spatial neighboring pair in the source video. Fig-

    ure 1 (a) illustrates this constraint for two adjacent pixels

    marked as black (x) and black (+) in a frame of the re-

    targeted video. Under our assumption, they are sampled

    from two pixels that are not necessarily adjacent in the cor-

    responding source frame. Given the mappings of two pixelsfrom the target to the source video, we can measure their

    naturality by considering each pixel in turn and computing

    the difference between the neighbor of its mapped pixel in

    the source and the mapped pixel of its neighbor. For exam-

    ple, consider the black (x) pixel in the target whose neigh-

    bor is black (+), which corresponds to the black (+) in the

    source. Also consider the black (x) pixel in the target which

    maps to the black (x) pixel in the source having neighbor

    as red (+). If the black (+) and red (+) pixels in the source

    are similar, then the neighboring pixels black (x) and black

    (+) in the target are considered natural with respect to the

    source. Otherwise, they will introduce artifacts that do not

    appear in the source frame. Similarly, if the red(x) and black

    (x) pixels in the source are similar, the black (+) and black

    (x) in the target are also considered natural. This formula-

    tion is same as the pairwise smoothness of [11] in which

    the target pixel R(u, v) is derived from the source pixelS(u+ tx, v + ty) through the shift map M(u, v) = (tx, ty).

    We extend the shift map along the temporal domain and

    denote it by Mt(p) indicating the value of the shift mapat frame t and location p = (x, y) in the target domain.Note that the mapping is from the target to the source. We

    maximize the spatial naturality of the retargeted video by

    minimizing

    (p,t)R

    4i=1

    D(S(p+Mt(p)+ei, t), S(p+ei+Mt(p+ei), t))

    (1)

    where R denotes the collection of all pixels in the retargeted

    video, ei are the four unit vectors representing the four spa-

    tial neighbors,D(, ) is the distance function to measure thesimilarity between two pixels, and S refers to source pixels.

    The distance function operates on the source between (i) a

    pixel that is a spatial neighbor of a mapped pixel and (ii) the

    mapped pixel of the spatial neighbor of a target pixel.

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    3/8

    (a) (b)Figure 1. Illustration of Spatial-Temporal Naturality. (a) Spatial naturality within same frame; (b) temporal naturality between neighboring

    frames.

    2.2. Temporal Naturality

    When considering the temporal information, naturality

    requires that two temporal neighbors in the retargeted video

    are similar to some temporal neighbors in the source video.

    Figure 1 (b) is an illustration of this constraint on two tem-

    porally adjacent pixels in the retargeted video. The pixel

    black (+) in the (t 1)th frame of the retargeted videois mapped to black (+) in the source video. The temporal

    neighbor of this pixel is the red (x) in the tth frame of the

    source video. If this pixel is similar to black (x), which is

    the mapping of the temporal neighbor of black (+) in the tth

    frame of the retargeted video, then the black (+) and black

    (x) in the retargeted video are considered temporally natu-

    ral. Similar analysis can be applied on the pixel black (x) in

    tth frame of the target. If the black (+) in the (t1)th frameof the source is similar to the red (+) in the same frame, two

    temporal neighbors (marked as black (+) and (x)) in the re-

    targeted video are temporally natural as in the source video.

    We maximize the temporal naturality of the retargeted

    video by minimizing(p,t)R

    t{1,+1}

    D(S(p+Mt(p), t +t), S(p +Mt+t(p), t +t)) (2)

    where the definitions ofR, S, M and D are the same as in

    equation (1). The distance function operates on the source

    between (i) a pixel that is a temporal neighbor of a mapped

    pixel and (ii) the mapped pixel of the temporal neighbor of

    a target pixel. Any suitable distance measure can be used.

    3. Hybrid Shift Map

    The retargeted video that preserves the spatial-temporal

    naturality of the source video is modeled as graph(s) where

    the nodes represent the pixels of the retargeted video. Retar-

    geting is achieved by finding the optimal mapping between

    the source and the retargeted video. Specifically, we encode

    the spatial-temporal naturality as well as other constraints

    into the following form, which can be minimized by graph

    cut algorithm, similar to [11]:

    E(M) =

    (p,t)R

    Ed(Mt(p)) +

    (pi,ti),(pj ,tj)N

    Es(Mti(pi),Mtj (pj))

    (3)

    whereEd is the data term encoding the unary energy andEsis the smoothness term encoding pairwise energy. In this

    section, we first develop a 3D shift map to retarget video

    with spatial-temporal naturality. To improve the computa-

    tional efficiency, an incremental 2D shift map is then in-

    troduced to retarget video. Compared with 3D shift map,

    this method can only achieve a local optimum. However,

    its computational complexity is much lower while still pre-

    serving spatial-temporal naturality. Finally, a novel solution

    for video retargeting is provided by combining these two

    methods in a multi-resolution hierarchy, which is called theHybrid Shift Map.

    3.1. 3D Shift Map

    We model the retargeted video as a 3D grid graph where

    every node is connected to its 4 spatial and 2 temporalneighbors. There are two types of constraints related to

    video retargeting: pixel preservation during resizing and

    spatial-temporal naturality for artifact reduction. To find

    the optimal 3D shift map using graph cut, we encode the

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    4/8

    pixel preservation in the data term and the spatial-temporal

    naturality in the smoothness term of equation (3).

    Energy Function

    In video retargeting, it is required that some pixels need to

    be preserved. For example, in changing the width of the

    video, the leftmost and rightmost columns of every frameshould be preserved in the target video. We use data term to

    encode such pixel preservation by assigning

    Ed(Mt(p)) =

    if(x = 0) (tx = 0) if(x = WR) (tx = WS WR)0 otherwise

    (4)

    where WS and WR are the widths of the video frames in the

    source and the retargeted videos, respectively.

    The spatial and temporal naturalities described in sec-

    tion 2 can be unified into a single pairwise measure. Con-

    sider a pair of neighboring pixels (pi, ti) and (pj , tj), whichcan be either spatial neighbors (ti = tj) or temporal neigh-bors (pi = pj). For (pj , tj), the mapping of its neighbor(pi, ti) onto the source video is given by

    Si = S(pj + pji + Mtj+tji (pj +pji), tj + tji).

    where pji = (xi xj , yi yj) and tji = ti tj .The neighbor of the mapping of(pj, tj) that corresponds to

    (pi, ti) is given by Si = S(pj +Mtj (pj)+pji , tj +tji).Similarly,

    Sj = S(pi + pij + Mti+tij (pi +pij), ti + tij),

    Sj = S(pi + Mti(pi) + pij, ti +tij),

    where pij = (xj xi, yj yi) and tij = tj ti.Therefore, the smoothness term in equation (3) can measure

    spatial-temporal naturality as

    Es(M(pi, ti),M(pj , tj)) = min(D(Si, Si), D(Sj , Sj)).

    If either D(Si, Si) = 0 or D(Sj, Sj) = 0, this pair ofneighboring pixels in the retargeted video is perfectly natu-

    ral with respect to the source video.

    If we wish to eliminate one row/column from the video,

    the binary 3D graph cut guarantees a globally optimal so-

    lution. However, the computational complexity of this

    method is very high due to the large size of the 3D graph re-

    sulting in high computational time and memory usage. This

    precludes the application of this method on large video vol-

    umes and hence, a more efficient solution is required.

    3.2. Incremental 2D Shift Map

    Instead of considering the video as a 3D volume, it can

    be viewed as a collection of frames, each of which is rep-

    resented as a 2D grid graph. However, if we simply apply

    the 2D shift map [11] to retarget individual frames indepen-

    dently, the retargeted video will be not temporally smooth

    and will contain jitters. The temporal information in the

    source video must be utilized in order to ensure temporal

    consistency. We propose an incremental solution to retarget

    video using temporal information so that the result is tempo-

    rally consistent (smooth). In this scheme, the first frame ofthe sequence is retargeted using the 2D shift map [11], and

    the tth frame is processed based on the retargeted (t 1)th

    frame. The temporal consistency is improved by maximiz-

    ing the temporal naturality between the current retargeted

    frame and the retargeted result of previous frame.

    Given the shift map of the (t 1)th frame, the tth frameis retargeted by finding a minimum cut in an augmented 2D

    grid graph. Each node of this graph is not only associated

    with the coordinate shift in the current frame, but also as-

    sociated with the corresponding shift in the previous frame.

    The shift map of previous frame is utilized to constrain the

    retargeting of the current frame. Specifically, we extend theenergy function to consider the temporal naturality.

    Energy Function

    In the data term of the energy function, in addition to the

    pixel preservation term of equation (4), we encode the

    temporal naturality with respect to the shift map of previous

    frame . The data term containing the measure of temporal

    naturality is

    Ed(Mt(p)) = min(Dt1(p), Dt(p)). (5)

    where,Dt1(p) =

    D(S(p + Mt(p), t 1), S(p + Mt1(p), t 1)),

    Dt(p) = D(S(p + Mt(p), t), S(p + Mt1(p), t)). (6)

    Minimization of the data terms that comprise equations (5)

    and (4) results in the optimal shift map for the tth frame,

    which is temporally smooth (natural) with respect to the re-

    targeted (t 1)th frame.The spatial naturality between pixel p and its neighbors

    in the tth frame is encoded as:

    Es(Mt(p),Mt(p + ei))

    = D(S(p+Mt(p) + ei, t), S(p+ ei +Mt(p+ei), t)) (7)

    where ei represents the four unit vectors representing the

    four spatial neighbors, similar to [11].

    While the 3D shift map achieves a global optimal so-

    lution, the 2D shift map can only obtain the local optimum

    which is dependent on the initial guess. If the initial solution

    in the first frame is far away from the global optimum, the

    retargeting results of other frames will be also away from

    the global optimum. However, the incremental 2D approach

    is much more efficient in terms of computational time and

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    5/8

    (a) (b)Figure 2. Comparison of the bands of hybrid shift-map and video

    seam carving [12] on the same grid graph. The shaded area in (a)

    is the band for hybrid shift-map and in (b) is the band for seam

    carving.

    memory usage, as shown in the experimental results. Since

    the original 3D problem is simplified into a series of 2D

    problems, the number of nodes in each graph is much less

    than that in the 3D graph.

    3.3. Hybrid Shift Map as a Hierarchical Solution

    An efficient solution for video retargeting is provided

    by combining 3D shift map and 2D shift map in a multi-

    resolution framework. We build a Gaussian pyramid of the

    source video. In the lowest resolution, an initial retarget-

    ing result for every frame is estimated using 3D shift map.

    The global optimum property of 3D shift map guarantees

    a good initial solution, which is used to constrain the fi-

    nal retargeting result in higher resolution not too far away

    from this initial guess. In the higher resolutions, we itera-

    tively apply incremental 2D shift map to obtain retargeted

    video: the initial solution in lower resolution is first inter-polated to the higher resolution. Starting from first frame,

    the shift map is optimized to refine the interpolated solution

    on banded 2D graph incrementally, where the band covers

    the initial solution as shown in Figure 2. For comparison

    between hybrid shift map and video seam carving [12], we

    use a simple multi-resolution banding method on standard

    grid graph. Compared to more advanced multi-resolution

    banding methods e.g. [10], the banded graph in Figure 2 is

    still a grid graph although the band is not minimum.

    This multi-resolution framework, denoted as Hybrid

    Shift Map, improves the computational complexity for re-

    targeting in two aspects. First, the 3D shift map is more

    efficient than 3D seam carving because the proposed graph

    is much simpler than seam graph with forward energy. Ev-

    ery node in the graph used in 3D shift map has only 6 edgeswhile the node in the seam graph has 14 edges. Second,the 3D graph cut at higher resolution is divided into a se-

    ries of 2D graph cuts. Its computational time increases only

    linearly with the length of video sequence. Using the same

    banding method, the incremental 2D shift map has a nar-

    rower band in every frame since it bands the individual seam

    in the frame. On the other hand, video seam carving needs

    to band the whole 2D manifold interpolated from the lowest

    resolution. When the manifold involves a large variance in

    the temporal domain, the rectangular band in the 3D graph

    is larger than that in the hybrid shift map. Figure 2 shows

    this difference between hybrid shift map and video seam

    carving using the same banding method. Note that more

    advanced banded multi-resolution graph cut technique [10]can also be applied on hybrid shift map to further reduce

    complexity.

    4. Experimental Evaluation

    In this section, we evaluate different properties of the

    proposed method on real video sequences and compare

    them with other retargeting methods. Some test video se-

    quences are the same as used in [12] and some are web

    videos downloaded from Youtube, which contain large cam-

    era/object motion in a complex scene. For all experiments,

    we use the following setting: 3-layer Gaussian pyramid isbuilt. 3D shift map is estimated in the lowest resolution

    and individual 2D shift map is refined in the original res-

    olution incrementally. The parameters are set as = 1and = 1. The distance function D((p1, t1), (p2, t2)) =|S(p1, t1) S(p2, t2)|, which is simply the grayscale dif-ference of two pixels.

    4.1. Temporal Consistency

    Compared with 2D shift map [11], our method retargets

    video by considering temporal consistency between source

    and target videos. Without motion analysis, our method can

    remove the flickering/waving artifacts and generate tempo-rally smooth retargeted video. We compare our method with

    naive solution of applying [11] independently on individual

    frames. Figure 3 (a) shows 4 consecutive frames of a bas-

    ketball sequence. The corresponding retargeted frames of

    the proposed hybrid shift map and the naive solution are

    shown in Figure 3 (b) and (c), respectively. The red curves

    in every frame are the next optimal seams to be removed.

    From the figure, we can see that the seams obtained by

    hybrid shift map are temporally smooth while those ob-

    tained by naive solution are not. Consequently, hybrid shift

    map generates retargeted video which is temporally consis-

    tent with the source video. Naive solution using [11] can-

    not achieve this temporal consistency. Our method enforces

    not temporal smoothness as in [12], but temporal natural-

    ity, which is adaptive to the video content. When the con-

    tent is homogeneous, even the non-smooth seams can pre-

    serve temporal naturality and generate temporally consis-

    tent video. Compared to other retargeting algorithms e.g.

    [18] which can preserve temporal consistency, our method

    does not rely on any motion analysis and is robust enough

    to be applied on challenging videos where motion analysis

    may be erroneous.

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    6/8

    Figure 3. (a) Four consecutive frames of basketball sequence. Retargeted by (b) hybrid shift map and (c) the naive solution of applying 2D

    shift map [11] on every frame independently.

    (a)

    (b)

    (c)Figure 4. Comparison of hybrid shift map and incremental 2D shift

    map. (a) two frames (1st and 98th frames) of a cycling sequence;

    corresponding retargeted frames using (b) hybrid shift map and (c)

    incremental 2D shift map.

    4.2. Global vs Local vs Hybrid

    Our retargeting framework combines two components:

    3D shift map at the lowest resolution and incremental 2D

    shift map at the original resolution. We analyze the retar-

    geting results as well as the computational complexities of

    our framework and individual components. Figure 4 shows

    the retargeted frames obtained from hybrid shift map and in-

    cremental 2D shift map for two frames of a video sequence.

    We can see that the retargeted frames output from incremen-

    tal 2D shift map (Figure 4 (c)) introduce large distortions

    within the red box. This is because without the initializa-

    tion of 3D shift map, the incremental method does not con-

    sider all frames and can only achieve local optimum. The

    Table 1. Computational time for reducing 1 pixel in width for a

    320 240 video of110 frames.

    3D shift map 2D shift map Hybrid shift mapTime 178s 20s (9+8)s

    hybrid shift map instead uses 3D shift map in the lowest

    resolution to constrain every 2D shift map to be close to the

    global optimum. Hence, the retargeted frames from the hy-

    brid shift map (shown in Figure 4 (b)) are more natural than

    incremental 2D shift map alone.

    In terms of computational complexity, we compare the

    computational time on reducing 1 pixel in width for thewhole video using different components. For a 320 240

    video sequence of 110 frames, the computational time ofdifferent components are summarized in Table 1. The 3D

    shift map and incremental 2D shift map are applied on the

    original resolution. For hybrid shift map, 3D shift map

    is applied on the third layer of Gaussian pyramid and in-

    cremental 2D shift map is applied on the original resolu-

    tion. We can see the proposed hybrid shift map significantly

    improves the efficiency while preserving the effectiveness.

    The 3D shift map component (9s) improves since it is ap-

    plied at the lowest resolution and the incremental 2D shift

    map component (8s) in the original resolution improves due

    to the smaller graph for every frame.

    4.3. Hybrid Shift Map vs Video Seam Carving

    In this section, we compare our proposed method with

    [12], which improves seam carving for video retargeting.

    Both methods model the video as graph(s) and use graph

    cut techniques to iteratively retarget video content. When

    reducing 1 pixel in width, each binary shift map actuallycorresponds to a removed seam. However, the proposed

    method differs from [12] in several aspects. First, graph

    constructions are different. [12] models the source video as

    a graph while our method represents the retargeted video as

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    7/8

    Figure 5. Example of disconnected seam (marked as red).

    (a)

    (b)

    (c)Figure 6. Comparison of hybrid shift map and seam carving [12].

    (a) 3 consecutive frames of a video; (b) retargeted frames of hybrid

    shift map; (c) retargeted frames of seam carving. The sequences

    can be seen in the supplementary material.

    Table 2. Computational complexity of hybrid shift map and seam

    carving.

    Hybrid shift map Seam carving

    Time (7+8)s (22+14+54)s

    Memory 572Mb 3550Mb

    a graph. By modeling the retargeted video, forward energy

    in [12] can be easily derived in our method. Since only the

    binary-labeling problem is guaranteed to achieve global op-

    timum, we recursively reduce 1 pixel in width by solvinga binary-labeling problem. Second, the properties of the

    seam which is removed from every frame are different. In

    our method, the removed seam is only required to be mono-

    tonic and need not be connected. A vertical/horizontal seam

    is monotonic if there is only 1 pixel belonging to it in everyrow/column, respectively. As shown in Figure 5, the con-

    nectivity of a seam is automatically controlled by the con-

    tent: it can be disconnected in the homogeneous area and

    will be connected in the boundary area. Hence, our method

    is more flexible than [12]. Figure 6 compares hybrid shift

    map and seam carving on a video sequence. We can see

    hybrid shift map produces a retargeted video with less dis-

    tortion than seam carving.

    Table 2 summarizes both the time and memory usage

    when removing 1 pixel in width from a 480 272 video of

    Figure 7. Enlarging using hybrid shift map. The sequences can be

    seen in the supplementary material.

    86 frames. For the computational time of hybrid shift map,the two numbers are the time spent on initialization and re-

    finement, respectively. For video seam carving, the three

    numbers indicate the time spent in 3 different resolutionsof Gaussian pyramid. As analyzed in Section 3.3, hybrid

    shift map reduces the time spent on both initialization and

    refinement. We can also see that hybrid shift map signifi-cantly reduces the memory usage compared to seam carv-

    ing method from Table 2. On the higher resolution, hybrid

    shift map only needs to maintain a 2D graph corresponding

    to a narrow band of single frame in the memory while seam

    carving method has to maintain the whole 3D graph. Note

    that the same multi-resolution banding method is applied on

    both methods for fair comparison. The complexity of video

    seam carving is higher than that reported in [12] because

    of the banding method. When applying the more advanced

    banded method [10], the hybrid shift map can also further

    reduce the complexity and is still more efficient than seam

    carving.

    The hybrid shift map can also change the height of avideo in a similar way. As illustrated in Figure 7, it also can

    be used for increasing the width of a video sequence. Figure

    8 shows more retargeting results on 5 video sequences. Wecan see that the retargeted videos from the proposed hybrid

    shift map are visually more natural than those from other

    methods. Compared to simple scaling and warping based

    method, seam carving method generates retargeted videos

    with less distortion. The proposed hybrid shift map outputs

    even better results. For example, the head and the leg of

    the player in two basketball sequences have less distortion

    than seam carving results. For the fourth sequence, the left

    person is less distorted than that in other methods.

    5. Conclusion

    In this paper, we introduce a new method denoted as hy-

    brid shift map for video retargeting. Without applying any

    motion analysis, this method can retarget video by maxi-

    mizing the spatial-temporal naturality between source and

    target video. A novel multi-resolution framework is pro-

    posed to break the computational bottleneck of video re-

    targeting. Specifically, 3D shift map is designed to get the

  • 8/9/2019 Hybrid Shift Map for Video Retargeting

    8/8

    Figure 8. More retargeting results on 5 video sequences. The first column shows a sample frame in the source video. The second column

    to fifth column are the corresponding retargeted frames using proposed hybrid shift map, seam carving [12], a recent warping-based

    retargeting method [8] and simple down-scaling, respectively. The sequences can be seen in the supplementary material.

    initial solution in the lowest resolution and incremental 2D

    shift map is designed to refine the initial solution in the orig-

    inal resolution. Compared with related retargeting methods,

    the proposed hybrid shift map significantly improves the

    efficiency in terms of both computational time and mem-ory usage while still retargeting video with spatial-temporal

    naturality.

    Acknowledgement

    This research was supported by the Media Development

    Authority (MDA) under grant NRF2008IDM-IDM004-032.

    References

    [1] S. Avidan and A. Shamir. Seam Carving for Content-Aware Image

    Resizing. SIGGRAPH, 26(3), July 2007.

    [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patch-

    Match: A Randomized Correspondence Algorithm for Structural Im-age Editing. SIGGRAPH, 28(3), August 2009.

    [3] B. Chen and P. Sen. Video Carving. In Eurographics, April 2008.

    [4] T. Deselaers, P. Dreuw, and H. Ney. Pan, Zoom, Scan - Time-

    coherent, Trained Automatic Video Cropping. In CVPR, June 2008.

    [5] R. Gal, O. Sorkine, and D. Cohen-Or. Feature-aware Texturing. In

    Eurographics Symposium on Rendering, June 2006.

    [6] J.-W. Han, K.-S. Choi, T.-S. Wang, S.-H. Cheon, and S.-J. Ko.

    Wavelet Based Seam Carving For Content-Aware Image Resizing.

    In ICIP, November 2009.

    [7] H. Huang, T. Fu, P. L. Rosin, and C. Qi. Real-Time Content-Aware

    Image Resizing. Science in China Series F: Information Science,

    52(2), February 2009.

    [8] J.-H. K. Jun-Seong Kim and C.-S. Kim. Adaptive Image and Video

    Retargeting Based on Fourier Analysis. In CVPR, June 2009.

    [9] F. Liu and M. Gleicher. Video Retargeting: Automating Pan-and-

    Scan. In ACM Multimedia, October 2006.

    [10] H. Lombaert, Y. Sun, L. Grady, and C. Xu. A Multilevel Banded

    Graph Cuts Method for Fast Image Segmentation. In ICCV, October2005.

    [11] Y. Pritch, E. K. Venaki, and S. Peleg. Shift-Map Image Editing. In

    ICCV, September 2009.

    [12] M. Rubinstein, A. Shamir, and S. Avidan. Improved Seam Carving

    for Video Retargeting. SIGGRAPH, 27(3), December 2008.

    [13] M. Rubinstein, A. Shamir, and S. Avidan. Multi-operator Media Re-

    targeting. SIGGRAPH, 28(3), August 2009.

    [14] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, and M. Co-

    hen. Gaze-based Interaction for Semi-automatic Photo Cropping.

    In SIGCHI, April 2006.

    [15] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing

    Visual Data Using Bidirectional Similarity. In CVPR, June 2008.

    [16] C. Tao, J. Jia, and H. Sun. Active Window Oriented Dynamic Video

    Retargeting. In ICCV Workshop on Dynamic Vision, October 2007.

    [17] S.-F. Wang and S.-H. Lai. Fast Structure-Preserving Image Retarget-

    ing. In ICCASP, April 2009.

    [18] Y.-S. Wang, H. Fu, O. Sorkine, T.-Y. Lee, and H.-P. Seide. Motion-

    Aware Temporal Coherence for Video Resizing. SIGGRAPH ASIA,

    28(5), December 2009.

    [19] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee. Optimized Scale-

    and-Stretch for Image Resizing. SIGGRAPH ASIA, 27(5), December

    2008.

    [20] L. Wolf, M. Guttmann, and D. Cohen-Or. Non-homogeneous

    Content-driven Video-Retargeting. In ICCV, October 2007.

    [21] X. Xie, H. Liu, W.-Y. Ma, and H.-J. Zhang. Browsing Large Pictures

    under Limited Display Sizes. IEEE Transaction on Multimedia, 8(4),

    August 2006.