hybrid shift map for video retargeting

8/9/2019 Hybrid Shift Map for Video Retargeting

1/8

Hybrid Shift Map for Video Retargeting

Yiqun Hu Deepu RajanSchool of Computer Engineering

Nanyang Technological University, Singapore 639798

{yqhu,asdrajan}@ntu.edu.sg

Abstract

We propose a new method for video retargeting, which

can generate spatial-temporal consistent video. The new

measure called spatial-temporal naturality preserves the

motion in the source video without any motion analysis in

contrast to other methods that need motion estimation. This

advantage prevents the retargeted video from degenerating

due to the propagation of the errors in motion analysis. It

allows the proposed method to be applied on challenging

videos with complex camera and object motion. To improve

the efficiency of the retargeting process, we retarget video

using a 3D shift map in low resolution and refine it using

an incremental 2D shift map in higher resolution. This new

hierarchical framework, denoted as hybrid shift map, can

produce satisfactory retargeting results while significantly

improving the computational efficiency.

1. Introduction

Media retargeting aims to increase or decrease the size

of media according to the inherent content and not blindly

as in scaling and cropping. The important content is pre-

served during the retargeting process. For images, this may

involve simply resizing, while for videos, the resizing could

be in the spatial and/or the temporal domain. With the de-

velopment of diverse terminal devices e.g. mobile phones,

large-screen displays etc., this technique is useful for adapt-

ing multimedia content onto devices with different screen

resolutions.

Image Retargeting

Various retargeting algorithms have been proposed to

adapt images to different resolutions and aspect ratios. As

opposed to homogeneous resizing methods [21, 14] that

crop the most important regions to be included, recent meth-

ods focus on nonlinear image retargeting according to im-

age content. For example, seam carving [1] and its vari-

ants [7, 6] remove horizontal/vertical seams from the image.

Seams are monotonically connected curves with minimum

perceptual energy. Adaptive warping methods redistribute

the image pixels in a single direction [5] or over several

directions [19] according to their importance. Visually im-

portant regions are preserved while homogeneous regions

are merged. Some image editing techniques solve the retar-

geting problem by redistributing pixels under completeness

and coherence constraints [15, 2]. Other extensions are alsoproposed: [8] introduced an image retargeting framework

based on Fourier analysis to improve efficiency. Retarget-

ing results are improved by preserving the image structures

in [17]. In [13], multiple operators are integrated to obtain

optimally retargeted images.

Video Retargeting

Some video retargeting methods have been proposed by

adapting image retargeting methods for video content. The

local temporal consistency between adjacent pixels in a

spatial-temporal video cube is enforced. For example, in

[20], local temporal consistency was enforced by introduc-ing a penalty for changes in position of temporally adjacent

pixels in a least-squares optimization formulation. Seam

carving operator was improved to retarget video both spa-

tially [12] and temporally [3] by searching monotonic and

connected manifolds using graph cut. However, such lo-

cal consistency in temporal domain is invalid when there

exists large object/camera motion. Some methods enforce

global temporal consistency by estimating motion informa-

tion. For example, cropping-based methods [9, 16, 4] were

extended to find temporally smooth cropping windows for

videos. Motion segmentation is used to model the back-

ground or to extract moving objects. A scale-and-stretch

operator was extended to retarget video in [18]. Consecu-

tive video frames are aligned to estimate inter-frame camera

motion which can then be used to constrain retargeting. Al-

though these methods can handle complex object or camera

motion, they require motion estimation which in itself is a

challenging task.

1.1. Motivation

This work is motivated by two major limitations of cur-

rent video retargeting techniques. First, the effectiveness of


2/8

the current methods depend on the performance of motion

estimation, which is performed prior to the actual retarget-

ing. The errors in motion estimation will degrade the fi-

nal result, especially for scenes with complex motion and

large background clutter. Second, for videos, the compu-

tational complexity of retargeting methods based on graph

cut is very large. In the multi-resolution framework, the 3Dgraph cut is inefficient at a higher resolution although the

solution at lower resolution can be efficiently solved.

We propose a new framework for video retargeting with-

out relying on any motion analysis, which results in an ef-

ficient algorithm in terms of both computational time and

memory usage. Within this framework, a new measure is

introduced to estimate temporal consistency (naturality) for

video retargeting. This measure does not require motion

analysis and easily integrates both spatial and temporal do-

mains into a unified framework. Using this measure in an

energy function, we propose a multi-resolution framework

for video retargeting. 3D shift map is applied to find theinitial retargeting solution of a video volume in the lowest

resolution. Incremental 2D shift map is applied to refine ev-

ery seam in the individual frame with temporal consistency

with respect to the retargeting result of the previous frame.

Compared to the traditional multi-resolution solution, our

method solves the 3D retargeting problem by solving a se-

ries of 2D retargeting problems, which is much more com-

putationally efficient, especially at high resolution.

The rest of this paper is organized as follows. A new

measure spatial-temporal naturality is introduced in Sec 2,

which is used to calculate the energy for graph cut. The

proposed retargeting framework including 3D shift map, in-

cremental 2D shift map as well as the new multi-resolution

scheme is described in Sec 3. We evaluate the proposed

method and analyze different properties on real video se-

quences in Sec 4. Finally, we present conclusion in Sec 5.

2. Spatial-Temporal Naturality

Most existing retargeting methods resort to minimizing

a distortion measure in order to retarget a source video. For

example, seam carving techniques [1, 12] try to minimize

the distortion due to a new pair of pixels becoming adja-

cent. Similarly, warping-based methods [20, 18] minimize

the distortion resulting from the warping operation. How

to model various forms of distortion is still an open ques-

tion. In this paper, we assume that every pixel in the retar-

geted video is sampled from some pixel in the source video.

When performing video retargeting, the retargeted video is

visually pleasing if as less artifacts as possible are generated

in both the spatial and the temporal domain. We introduce

a measure to quantify the strength of artifacts introduced in

the retargeted video and call it the spatial-temporal natural-

ity, which is computed on every pair of neighboring pixels.

As the name implies, both spatial and temporal neighbor-

ing pixels are considered to ensure the naturality in both the

domains. Without explicitly modeling the distortion, a re-

targeted video which looks as natural as the source video

can be generated by maximizing this measure over all pairs

of neighboring pixels. This measure is used to define the

energy function for graph cut. In the rest of this section,

we first generalize the naturality measure in spatial domain,which was used in [11] and then extend it to temporal do-

main.

2.1. Spatial Naturality

In spatial domain, naturality requires that two spatial

neighboring pixels in the retargeted video are similar to

some spatial neighboring pair in the source video. Fig-

ure 1 (a) illustrates this constraint for two adjacent pixels

marked as black (x) and black (+) in a frame of the re-

targeted video. Under our assumption, they are sampled

from two pixels that are not necessarily adjacent in the cor-

responding source frame. Given the mappings of two pixelsfrom the target to the source video, we can measure their

naturality by considering each pixel in turn and computing

the difference between the neighbor of its mapped pixel in

the source and the mapped pixel of its neighbor. For exam-

ple, consider the black (x) pixel in the target whose neigh-

bor is black (+), which corresponds to the black (+) in the

source. Also consider the black (x) pixel in the target which

maps to the black (x) pixel in the source having neighbor

as red (+). If the black (+) and red (+) pixels in the source

are similar, then the neighboring pixels black (x) and black

(+) in the target are considered natural with respect to the

source. Otherwise, they will introduce artifacts that do not

appear in the source frame. Similarly, if the red(x) and black

(x) pixels in the source are similar, the black (+) and black

(x) in the target are also considered natural. This formula-

tion is same as the pairwise smoothness of [11] in which

the target pixel R(u, v) is derived from the source pixelS(u+ tx, v + ty) through the shift map M(u, v) = (tx, ty).

We extend the shift map along the temporal domain and

denote it by Mt(p) indicating the value of the shift mapat frame t and location p = (x, y) in the target domain.Note that the mapping is from the target to the source. We

maximize the spatial naturality of the retargeted video by

minimizing

(p,t)R

4i=1

D(S(p+Mt(p)+ei, t), S(p+ei+Mt(p+ei), t))

(1)

where R denotes the collection of all pixels in the retargeted

video, ei are the four unit vectors representing the four spa-

tial neighbors,D(, ) is the distance function to measure thesimilarity between two pixels, and S refers to source pixels.

The distance function operates on the source between (i) a

pixel that is a spatial neighbor of a mapped pixel and (ii) the

mapped pixel of the spatial neighbor of a target pixel.


3/8

(a) (b)Figure 1. Illustration of Spatial-Temporal Naturality. (a) Spatial naturality within same frame; (b) temporal naturality between neighboring

frames.

2.2. Temporal Naturality

When considering the temporal information, naturality

requires that two temporal neighbors in the retargeted video

are similar to some temporal neighbors in the source video.

Figure 1 (b) is an illustration of this constraint on two tem-

porally adjacent pixels in the retargeted video. The pixel

black (+) in the (t 1)th frame of the retargeted videois mapped to black (+) in the source video. The temporal

neighbor of this pixel is the red (x) in the tth frame of the

source video. If this pixel is similar to black (x), which is

the mapping of the temporal neighbor of black (+) in the tth

frame of the retargeted video, then the black (+) and black

(x) in the retargeted video are considered temporally natu-

ral. Similar analysis can be applied on the pixel black (x) in

tth frame of the target. If the black (+) in the (t1)th frameof the source is similar to the red (+) in the same frame, two

temporal neighbors (marked as black (+) and (x)) in the re-

targeted video are temporally natural as in the source video.

We maximize the temporal naturality of the retargeted

video by minimizing(p,t)R

t{1,+1}

D(S(p+Mt(p), t +t), S(p +Mt+t(p), t +t)) (2)

where the definitions ofR, S, M and D are the same as in

equation (1). The distance function operates on the source

between (i) a pixel that is a temporal neighbor of a mapped

pixel and (ii) the mapped pixel of the temporal neighbor of

a target pixel. Any suitable distance measure can be used.

3. Hybrid Shift Map

The retargeted video that preserves the spatial-temporal

naturality of the source video is modeled as graph(s) where

the nodes represent the pixels of the retargeted video. Retar-

geting is achieved by finding the optimal mapping between

the source and the retargeted video. Specifically, we encode

the spatial-temporal naturality as well as other constraints

into the following form, which can be minimized by graph

cut algorithm, similar to [11]:

E(M) =

(p,t)R

Ed(Mt(p)) +

(pi,ti),(pj ,tj)N

Es(Mti(pi),Mtj (pj))

(3)

whereEd is the data term encoding the unary energy andEsis the smoothness term encoding pairwise energy. In this

section, we first develop a 3D shift map to retarget video

with spatial-temporal naturality. To improve the computa-

tional efficiency, an incremental 2D shift map is then in-

troduced to retarget video. Compared with 3D shift map,

this method can only achieve a local optimum. However,

its computational complexity is much lower while still pre-

serving spatial-temporal naturality. Finally, a novel solution

for video retargeting is provided by combining these two

methods in a multi-resolution hierarchy, which is called theHybrid Shift Map.

3.1. 3D Shift Map

We model the retargeted video as a 3D grid graph where

every node is connected to its 4 spatial and 2 temporalneighbors. There are two types of constraints related to

video retargeting: pixel preservation during resizing and

spatial-temporal naturality for artifact reduction. To find

the optimal 3D shift map using graph cut, we encode the


4/8

pixel preservation in the data term and the spatial-temporal

naturality in the smoothness term of equation (3).

Energy Function

In video retargeting, it is required that some pixels need to

be preserved. For example, in changing the width of the

video, the leftmost and rightmost columns of every frameshould be preserved in the target video. We use data term to

encode such pixel preservation by assigning

Ed(Mt(p)) =

if(x = 0) (tx = 0) if(x = WR) (tx = WS WR)0 otherwise

(4)

where WS and WR are the widths of the video frames in the

source and the retargeted videos, respectively.

The spatial and temporal naturalities described in sec-

tion 2 can be unified into a single pairwise measure. Con-

sider a pair of neighboring pixels (pi, ti) and (pj , tj), whichcan be either spatial neighbors (ti = tj) or temporal neigh-bors (pi = pj). For (pj , tj), the mapping of its neighbor(pi, ti) onto the source video is given by

Si = S(pj + pji + Mtj+tji (pj +pji), tj + tji).

where pji = (xi xj , yi yj) and tji = ti tj .The neighbor of the mapping of(pj, tj) that corresponds to

(pi, ti) is given by Si = S(pj +Mtj (pj)+pji , tj +tji).Similarly,

Sj = S(pi + pij + Mti+tij (pi +pij), ti + tij),

Sj = S(pi + Mti(pi) + pij, ti +tij),

where pij = (xj xi, yj yi) and tij = tj ti.Therefore, the smoothness term in equation (3) can measure

spatial-temporal naturality as

Es(M(pi, ti),M(pj , tj)) = min(D(Si, Si), D(Sj , Sj)).

If either D(Si, Si) = 0 or D(Sj, Sj) = 0, this pair ofneighboring pixels in the retargeted video is perfectly natu-

ral with respect to the source video.

If we wish to eliminate one row/column from the video,

the binary 3D graph cut guarantees a globally optimal so-

lution. However, the computational complexity of this

method is very high due to the large size of the 3D graph re-

sulting in high computational time and memory usage. This

precludes the application of this method on large video vol-

umes and hence, a more efficient solution is required.

3.2. Incremental 2D Shift Map

Instead of considering the video as a 3D volume, it can

be viewed as a collection of frames, each of which is rep-

resented as a 2D grid graph. However, if we simply apply

the 2D shift map [11] to retarget individual frames indepen-

dently, the retargeted video will be not temporally smooth

and will contain jitters. The temporal information in the

source video must be utilized in order to ensure temporal

consistency. We propose an incremental solution to retarget

video using temporal information so that the result is tempo-

rally consistent (smooth). In this scheme, the first frame ofthe sequence is retargeted using the 2D shift map [11], and

the tth frame is processed based on the retargeted (t 1)th

frame. The temporal consistency is improved by maximiz-

ing the temporal naturality between the current retargeted

frame and the retargeted result of previous frame.

Given the shift map of the (t 1)th frame, the tth frameis retargeted by finding a minimum cut in an augmented 2D

grid graph. Each node of this graph is not only associated

with the coordinate shift in the current frame, but also as-

sociated with the corresponding shift in the previous frame.

The shift map of previous frame is utilized to constrain the

retargeting of the current frame. Specifically, we extend theenergy function to consider the temporal naturality.

Energy Function

In the data term of the energy function, in addition to the

pixel preservation term of equation (4), we encode the

temporal naturality with respect to the shift map of previous

frame . The data term containing the measure of temporal

naturality is

Ed(Mt(p)) = min(Dt1(p), Dt(p)). (5)

where,Dt1(p) =

D(S(p + Mt(p), t 1), S(p + Mt1(p), t 1)),

Dt(p) = D(S(p + Mt(p), t), S(p + Mt1(p), t)). (6)

Minimization of the data terms that comprise equations (5)

and (4) results in the optimal shift map for the tth frame,

which is temporally smooth (natural) with respect to the re-

targeted (t 1)th frame.The spatial naturality between pixel p and its neighbors

in the tth frame is encoded as:

Es(Mt(p),Mt(p + ei))

= D(S(p+Mt(p) + ei, t), S(p+ ei +Mt(p+ei), t)) (7)

where ei represents the four unit vectors representing the

four spatial neighbors, similar to [11].

While the 3D shift map achieves a global optimal so-

lution, the 2D shift map can only obtain the local optimum

which is dependent on the initial guess. If the initial solution

in the first frame is far away from the global optimum, the

retargeting results of other frames will be also away from

the global optimum. However, the incremental 2D approach

is much more efficient in terms of computational time and


5/8

(a) (b)Figure 2. Comparison of the bands of hybrid shift-map and video

seam carving [12] on the same grid graph. The shaded area in (a)

is the band for hybrid shift-map and in (b) is the band for seam

carving.

memory usage, as shown in the experimental results. Since

the original 3D problem is simplified into a series of 2D

problems, the number of nodes in each graph is much less

than that in the 3D graph.

3.3. Hybrid Shift Map as a Hierarchical Solution

An efficient solution for video retargeting is provided

by combining 3D shift map and 2D shift map in a multi-

resolution framework. We build a Gaussian pyramid of the

source video. In the lowest resolution, an initial retarget-

ing result for every frame is estimated using 3D shift map.

The global optimum property of 3D shift map guarantees

a good initial solution, which is used to constrain the fi-

nal retargeting result in higher resolution not too far away

from this initial guess. In the higher resolutions, we itera-

tively apply incremental 2D shift map to obtain retargeted

video: the initial solution in lower resolution is first inter-polated to the higher resolution. Starting from first frame,

the shift map is optimized to refine the interpolated solution

on banded 2D graph incrementally, where the band covers

the initial solution as shown in Figure 2. For comparison

between hybrid shift map and video seam carving [12], we

use a simple multi-resolution banding method on standard

grid graph. Compared to more advanced multi-resolution

banding methods e.g. [10], the banded graph in Figure 2 is

still a grid graph although the band is not minimum.

This multi-resolution framework, denoted as Hybrid

Shift Map, improves the computational complexity for re-

targeting in two aspects. First, the 3D shift map is more

efficient than 3D seam carving because the proposed graph

is much simpler than seam graph with forward energy. Ev-

ery node in the graph used in 3D shift map has only 6 edgeswhile the node in the seam graph has 14 edges. Second,the 3D graph cut at higher resolution is divided into a se-

ries of 2D graph cuts. Its computational time increases only

linearly with the length of video sequence. Using the same

banding method, the incremental 2D shift map has a nar-

rower band in every frame since it bands the individual seam

in the frame. On the other hand, video seam carving needs

to band the whole 2D manifold interpolated from the lowest

resolution. When the manifold involves a large variance in

the temporal domain, the rectangular band in the 3D graph

is larger than that in the hybrid shift map. Figure 2 shows

this difference between hybrid shift map and video seam

carving using the same banding method. Note that more

advanced banded multi-resolution graph cut technique [10]can also be applied on hybrid shift map to further reduce

complexity.

4. Experimental Evaluation

In this section, we evaluate different properties of the

proposed method on real video sequences and compare

them with other retargeting methods. Some test video se-

quences are the same as used in [12] and some are web

videos downloaded from Youtube, which contain large cam-

era/object motion in a complex scene. For all experiments,

we use the following setting: 3-layer Gaussian pyramid isbuilt. 3D shift map is estimated in the lowest resolution

and individual 2D shift map is refined in the original res-

olution incrementally. The parameters are set as = 1and = 1. The distance function D((p1, t1), (p2, t2)) =|S(p1, t1) S(p2, t2)|, which is simply the grayscale dif-ference of two pixels.

4.1. Temporal Consistency

Compared with 2D shift map [11], our method retargets

video by considering temporal consistency between source

and target videos. Without motion analysis, our method can

remove the flickering/waving artifacts and generate tempo-rally smooth retargeted video. We compare our method with

naive solution of applying [11] independently on individual

frames. Figure 3 (a) shows 4 consecutive frames of a bas-

ketball sequence. The corresponding retargeted frames of

the proposed hybrid shift map and the naive solution are

shown in Figure 3 (b) and (c), respectively. The red curves

in every frame are the next optimal seams to be removed.

From the figure, we can see that the seams obtained by

hybrid shift map are temporally smooth while those ob-

tained by naive solution are not. Consequently, hybrid shift

map generates retargeted video which is temporally consis-

tent with the source video. Naive solution using [11] can-

not achieve this temporal consistency. Our method enforces

not temporal smoothness as in [12], but temporal natural-

ity, which is adaptive to the video content. When the con-

tent is homogeneous, even the non-smooth seams can pre-

serve temporal naturality and generate temporally consis-

tent video. Compared to other retargeting algorithms e.g.

[18] which can preserve temporal consistency, our method

does not rely on any motion analysis and is robust enough

to be applied on challenging videos where motion analysis

may be erroneous.


6/8

Figure 3. (a) Four consecutive frames of basketball sequence. Retargeted by (b) hybrid shift map and (c) the naive solution of applying 2D

shift map [11] on every frame independently.

(a)

(b)

(c)Figure 4. Comparison of hybrid shift map and incremental 2D shift

map. (a) two frames (1st and 98th frames) of a cycling sequence;

corresponding retargeted frames using (b) hybrid shift map and (c)

incremental 2D shift map.

4.2. Global vs Local vs Hybrid

Our retargeting framework combines two components:

3D shift map at the lowest resolution and incremental 2D

shift map at the original resolution. We analyze the retar-

geting results as well as the computational complexities of

our framework and individual components. Figure 4 shows

the retargeted frames obtained from hybrid shift map and in-

cremental 2D shift map for two frames of a video sequence.

We can see that the retargeted frames output from incremen-

tal 2D shift map (Figure 4 (c)) introduce large distortions

within the red box. This is because without the initializa-

tion of 3D shift map, the incremental method does not con-

sider all frames and can only achieve local optimum. The

Table 1. Computational time for reducing 1 pixel in width for a

320 240 video of110 frames.

3D shift map 2D shift map Hybrid shift mapTime 178s 20s (9+8)s

hybrid shift map instead uses 3D shift map in the lowest

resolution to constrain every 2D shift map to be close to the

global optimum. Hence, the retargeted frames from the hy-

brid shift map (shown in Figure 4 (b)) are more natural than

incremental 2D shift map alone.

In terms of computational complexity, we compare the

computational time on reducing 1 pixel in width for thewhole video using different components. For a 320 240

video sequence of 110 frames, the computational time ofdifferent components are summarized in Table 1. The 3D

shift map and incremental 2D shift map are applied on the

original resolution. For hybrid shift map, 3D shift map

is applied on the third layer of Gaussian pyramid and in-

cremental 2D shift map is applied on the original resolu-

tion. We can see the proposed hybrid shift map significantly

improves the efficiency while preserving the effectiveness.

The 3D shift map component (9s) improves since it is ap-

plied at the lowest resolution and the incremental 2D shift

map component (8s) in the original resolution improves due

to the smaller graph for every frame.

4.3. Hybrid Shift Map vs Video Seam Carving

In this section, we compare our proposed method with

[12], which improves seam carving for video retargeting.

Both methods model the video as graph(s) and use graph

cut techniques to iteratively retarget video content. When

reducing 1 pixel in width, each binary shift map actuallycorresponds to a removed seam. However, the proposed

method differs from [12] in several aspects. First, graph

constructions are different. [12] models the source video as

a graph while our method represents the retargeted video as


7/8

Figure 5. Example of disconnected seam (marked as red).

(a)

(b)

(c)Figure 6. Comparison of hybrid shift map and seam carving [12].

(a) 3 consecutive frames of a video; (b) retargeted frames of hybrid

shift map; (c) retargeted frames of seam carving. The sequences

can be seen in the supplementary material.

Table 2. Computational complexity of hybrid shift map and seam

carving.

Hybrid shift map Seam carving

Time (7+8)s (22+14+54)s

Memory 572Mb 3550Mb

a graph. By modeling the retargeted video, forward energy

in [12] can be easily derived in our method. Since only the

binary-labeling problem is guaranteed to achieve global op-

timum, we recursively reduce 1 pixel in width by solvinga binary-labeling problem. Second, the properties of the

seam which is removed from every frame are different. In

our method, the removed seam is only required to be mono-

tonic and need not be connected. A vertical/horizontal seam

is monotonic if there is only 1 pixel belonging to it in everyrow/column, respectively. As shown in Figure 5, the con-

nectivity of a seam is automatically controlled by the con-

tent: it can be disconnected in the homogeneous area and

will be connected in the boundary area. Hence, our method

is more flexible than [12]. Figure 6 compares hybrid shift

map and seam carving on a video sequence. We can see

hybrid shift map produces a retargeted video with less dis-

tortion than seam carving.

Table 2 summarizes both the time and memory usage

when removing 1 pixel in width from a 480 272 video of

Figure 7. Enlarging using hybrid shift map. The sequences can be

seen in the supplementary material.

86 frames. For the computational time of hybrid shift map,the two numbers are the time spent on initialization and re-

finement, respectively. For video seam carving, the three

numbers indicate the time spent in 3 different resolutionsof Gaussian pyramid. As analyzed in Section 3.3, hybrid

shift map reduces the time spent on both initialization and

refinement. We can also see that hybrid shift map signifi-cantly reduces the memory usage compared to seam carv-

ing method from Table 2. On the higher resolution, hybrid

shift map only needs to maintain a 2D graph corresponding

to a narrow band of single frame in the memory while seam

carving method has to maintain the whole 3D graph. Note

that the same multi-resolution banding method is applied on

both methods for fair comparison. The complexity of video

seam carving is higher than that reported in [12] because

of the banding method. When applying the more advanced

banded method [10], the hybrid shift map can also further

reduce the complexity and is still more efficient than seam

carving.

The hybrid shift map can also change the height of avideo in a similar way. As illustrated in Figure 7, it also can

be used for increasing the width of a video sequence. Figure

8 shows more retargeting results on 5 video sequences. Wecan see that the retargeted videos from the proposed hybrid

shift map are visually more natural than those from other

methods. Compared to simple scaling and warping based

method, seam carving method generates retargeted videos

with less distortion. The proposed hybrid shift map outputs

even better results. For example, the head and the leg of

the player in two basketball sequences have less distortion

than seam carving results. For the fourth sequence, the left

person is less distorted than that in other methods.

5. Conclusion

In this paper, we introduce a new method denoted as hy-

brid shift map for video retargeting. Without applying any

motion analysis, this method can retarget video by maxi-

mizing the spatial-temporal naturality between source and

target video. A novel multi-resolution framework is pro-

posed to break the computational bottleneck of video re-

targeting. Specifically, 3D shift map is designed to get the


8/8

Figure 8. More retargeting results on 5 video sequences. The first column shows a sample frame in the source video. The second column

to fifth column are the corresponding retargeted frames using proposed hybrid shift map, seam carving [12], a recent warping-based

retargeting method [8] and simple down-scaling, respectively. The sequences can be seen in the supplementary material.

initial solution in the lowest resolution and incremental 2D

shift map is designed to refine the initial solution in the orig-

inal resolution. Compared with related retargeting methods,

the proposed hybrid shift map significantly improves the

efficiency in terms of both computational time and mem-ory usage while still retargeting video with spatial-temporal

naturality.

Acknowledgement

This research was supported by the Media Development

Authority (MDA) under grant NRF2008IDM-IDM004-032.

References

[1] S. Avidan and A. Shamir. Seam Carving for Content-Aware Image

Resizing. SIGGRAPH, 26(3), July 2007.

[2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patch-

Match: A Randomized Correspondence Algorithm for Structural Im-age Editing. SIGGRAPH, 28(3), August 2009.

[3] B. Chen and P. Sen. Video Carving. In Eurographics, April 2008.

[4] T. Deselaers, P. Dreuw, and H. Ney. Pan, Zoom, Scan - Time-

coherent, Trained Automatic Video Cropping. In CVPR, June 2008.

[5] R. Gal, O. Sorkine, and D. Cohen-Or. Feature-aware Texturing. In

Eurographics Symposium on Rendering, June 2006.

[6] J.-W. Han, K.-S. Choi, T.-S. Wang, S.-H. Cheon, and S.-J. Ko.

Wavelet Based Seam Carving For Content-Aware Image Resizing.

In ICIP, November 2009.

[7] H. Huang, T. Fu, P. L. Rosin, and C. Qi. Real-Time Content-Aware

Image Resizing. Science in China Series F: Information Science,

52(2), February 2009.

[8] J.-H. K. Jun-Seong Kim and C.-S. Kim. Adaptive Image and Video

Retargeting Based on Fourier Analysis. In CVPR, June 2009.

[9] F. Liu and M. Gleicher. Video Retargeting: Automating Pan-and-

Scan. In ACM Multimedia, October 2006.

[10] H. Lombaert, Y. Sun, L. Grady, and C. Xu. A Multilevel Banded

Graph Cuts Method for Fast Image Segmentation. In ICCV, October2005.

[11] Y. Pritch, E. K. Venaki, and S. Peleg. Shift-Map Image Editing. In

ICCV, September 2009.

[12] M. Rubinstein, A. Shamir, and S. Avidan. Improved Seam Carving

for Video Retargeting. SIGGRAPH, 27(3), December 2008.

[13] M. Rubinstein, A. Shamir, and S. Avidan. Multi-operator Media Re-

targeting. SIGGRAPH, 28(3), August 2009.

[14] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, and M. Co-

hen. Gaze-based Interaction for Semi-automatic Photo Cropping.

In SIGCHI, April 2006.

[15] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing

Visual Data Using Bidirectional Similarity. In CVPR, June 2008.

[16] C. Tao, J. Jia, and H. Sun. Active Window Oriented Dynamic Video

Retargeting. In ICCV Workshop on Dynamic Vision, October 2007.

[17] S.-F. Wang and S.-H. Lai. Fast Structure-Preserving Image Retarget-

ing. In ICCASP, April 2009.

[18] Y.-S. Wang, H. Fu, O. Sorkine, T.-Y. Lee, and H.-P. Seide. Motion-

Aware Temporal Coherence for Video Resizing. SIGGRAPH ASIA,

28(5), December 2009.

[19] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee. Optimized Scale-

and-Stretch for Image Resizing. SIGGRAPH ASIA, 27(5), December

2008.

[20] L. Wolf, M. Guttmann, and D. Cohen-Or. Non-homogeneous

Content-driven Video-Retargeting. In ICCV, October 2007.

[21] X. Xie, H. Liu, W.-Y. Ma, and H.-J. Zhang. Browsing Large Pictures

under Limited Display Sizes. IEEE Transaction on Multimedia, 8(4),

August 2006.

hybrid shift map for video retargeting

Documents