hybrid shift map for video retargeting
TRANSCRIPT
-
8/9/2019 Hybrid Shift Map for Video Retargeting
1/8
Hybrid Shift Map for Video Retargeting
Yiqun Hu Deepu RajanSchool of Computer Engineering
Nanyang Technological University, Singapore 639798
{yqhu,asdrajan}@ntu.edu.sg
Abstract
We propose a new method for video retargeting, which
can generate spatial-temporal consistent video. The new
measure called spatial-temporal naturality preserves the
motion in the source video without any motion analysis in
contrast to other methods that need motion estimation. This
advantage prevents the retargeted video from degenerating
due to the propagation of the errors in motion analysis. It
allows the proposed method to be applied on challenging
videos with complex camera and object motion. To improve
the efficiency of the retargeting process, we retarget video
using a 3D shift map in low resolution and refine it using
an incremental 2D shift map in higher resolution. This new
hierarchical framework, denoted as hybrid shift map, can
produce satisfactory retargeting results while significantly
improving the computational efficiency.
1. Introduction
Media retargeting aims to increase or decrease the size
of media according to the inherent content and not blindly
as in scaling and cropping. The important content is pre-
served during the retargeting process. For images, this may
involve simply resizing, while for videos, the resizing could
be in the spatial and/or the temporal domain. With the de-
velopment of diverse terminal devices e.g. mobile phones,
large-screen displays etc., this technique is useful for adapt-
ing multimedia content onto devices with different screen
resolutions.
Image Retargeting
Various retargeting algorithms have been proposed to
adapt images to different resolutions and aspect ratios. As
opposed to homogeneous resizing methods [21, 14] that
crop the most important regions to be included, recent meth-
ods focus on nonlinear image retargeting according to im-
age content. For example, seam carving [1] and its vari-
ants [7, 6] remove horizontal/vertical seams from the image.
Seams are monotonically connected curves with minimum
perceptual energy. Adaptive warping methods redistribute
the image pixels in a single direction [5] or over several
directions [19] according to their importance. Visually im-
portant regions are preserved while homogeneous regions
are merged. Some image editing techniques solve the retar-
geting problem by redistributing pixels under completeness
and coherence constraints [15, 2]. Other extensions are alsoproposed: [8] introduced an image retargeting framework
based on Fourier analysis to improve efficiency. Retarget-
ing results are improved by preserving the image structures
in [17]. In [13], multiple operators are integrated to obtain
optimally retargeted images.
Video Retargeting
Some video retargeting methods have been proposed by
adapting image retargeting methods for video content. The
local temporal consistency between adjacent pixels in a
spatial-temporal video cube is enforced. For example, in
[20], local temporal consistency was enforced by introduc-ing a penalty for changes in position of temporally adjacent
pixels in a least-squares optimization formulation. Seam
carving operator was improved to retarget video both spa-
tially [12] and temporally [3] by searching monotonic and
connected manifolds using graph cut. However, such lo-
cal consistency in temporal domain is invalid when there
exists large object/camera motion. Some methods enforce
global temporal consistency by estimating motion informa-
tion. For example, cropping-based methods [9, 16, 4] were
extended to find temporally smooth cropping windows for
videos. Motion segmentation is used to model the back-
ground or to extract moving objects. A scale-and-stretch
operator was extended to retarget video in [18]. Consecu-
tive video frames are aligned to estimate inter-frame camera
motion which can then be used to constrain retargeting. Al-
though these methods can handle complex object or camera
motion, they require motion estimation which in itself is a
challenging task.
1.1. Motivation
This work is motivated by two major limitations of cur-
rent video retargeting techniques. First, the effectiveness of
-
8/9/2019 Hybrid Shift Map for Video Retargeting
2/8
the current methods depend on the performance of motion
estimation, which is performed prior to the actual retarget-
ing. The errors in motion estimation will degrade the fi-
nal result, especially for scenes with complex motion and
large background clutter. Second, for videos, the compu-
tational complexity of retargeting methods based on graph
cut is very large. In the multi-resolution framework, the 3Dgraph cut is inefficient at a higher resolution although the
solution at lower resolution can be efficiently solved.
We propose a new framework for video retargeting with-
out relying on any motion analysis, which results in an ef-
ficient algorithm in terms of both computational time and
memory usage. Within this framework, a new measure is
introduced to estimate temporal consistency (naturality) for
video retargeting. This measure does not require motion
analysis and easily integrates both spatial and temporal do-
mains into a unified framework. Using this measure in an
energy function, we propose a multi-resolution framework
for video retargeting. 3D shift map is applied to find theinitial retargeting solution of a video volume in the lowest
resolution. Incremental 2D shift map is applied to refine ev-
ery seam in the individual frame with temporal consistency
with respect to the retargeting result of the previous frame.
Compared to the traditional multi-resolution solution, our
method solves the 3D retargeting problem by solving a se-
ries of 2D retargeting problems, which is much more com-
putationally efficient, especially at high resolution.
The rest of this paper is organized as follows. A new
measure spatial-temporal naturality is introduced in Sec 2,
which is used to calculate the energy for graph cut. The
proposed retargeting framework including 3D shift map, in-
cremental 2D shift map as well as the new multi-resolution
scheme is described in Sec 3. We evaluate the proposed
method and analyze different properties on real video se-
quences in Sec 4. Finally, we present conclusion in Sec 5.
2. Spatial-Temporal Naturality
Most existing retargeting methods resort to minimizing
a distortion measure in order to retarget a source video. For
example, seam carving techniques [1, 12] try to minimize
the distortion due to a new pair of pixels becoming adja-
cent. Similarly, warping-based methods [20, 18] minimize
the distortion resulting from the warping operation. How
to model various forms of distortion is still an open ques-
tion. In this paper, we assume that every pixel in the retar-
geted video is sampled from some pixel in the source video.
When performing video retargeting, the retargeted video is
visually pleasing if as less artifacts as possible are generated
in both the spatial and the temporal domain. We introduce
a measure to quantify the strength of artifacts introduced in
the retargeted video and call it the spatial-temporal natural-
ity, which is computed on every pair of neighboring pixels.
As the name implies, both spatial and temporal neighbor-
ing pixels are considered to ensure the naturality in both the
domains. Without explicitly modeling the distortion, a re-
targeted video which looks as natural as the source video
can be generated by maximizing this measure over all pairs
of neighboring pixels. This measure is used to define the
energy function for graph cut. In the rest of this section,
we first generalize the naturality measure in spatial domain,which was used in [11] and then extend it to temporal do-
main.
2.1. Spatial Naturality
In spatial domain, naturality requires that two spatial
neighboring pixels in the retargeted video are similar to
some spatial neighboring pair in the source video. Fig-
ure 1 (a) illustrates this constraint for two adjacent pixels
marked as black (x) and black (+) in a frame of the re-
targeted video. Under our assumption, they are sampled
from two pixels that are not necessarily adjacent in the cor-
responding source frame. Given the mappings of two pixelsfrom the target to the source video, we can measure their
naturality by considering each pixel in turn and computing
the difference between the neighbor of its mapped pixel in
the source and the mapped pixel of its neighbor. For exam-
ple, consider the black (x) pixel in the target whose neigh-
bor is black (+), which corresponds to the black (+) in the
source. Also consider the black (x) pixel in the target which
maps to the black (x) pixel in the source having neighbor
as red (+). If the black (+) and red (+) pixels in the source
are similar, then the neighboring pixels black (x) and black
(+) in the target are considered natural with respect to the
source. Otherwise, they will introduce artifacts that do not
appear in the source frame. Similarly, if the red(x) and black
(x) pixels in the source are similar, the black (+) and black
(x) in the target are also considered natural. This formula-
tion is same as the pairwise smoothness of [11] in which
the target pixel R(u, v) is derived from the source pixelS(u+ tx, v + ty) through the shift map M(u, v) = (tx, ty).
We extend the shift map along the temporal domain and
denote it by Mt(p) indicating the value of the shift mapat frame t and location p = (x, y) in the target domain.Note that the mapping is from the target to the source. We
maximize the spatial naturality of the retargeted video by
minimizing
(p,t)R
4i=1
D(S(p+Mt(p)+ei, t), S(p+ei+Mt(p+ei), t))
(1)
where R denotes the collection of all pixels in the retargeted
video, ei are the four unit vectors representing the four spa-
tial neighbors,D(, ) is the distance function to measure thesimilarity between two pixels, and S refers to source pixels.
The distance function operates on the source between (i) a
pixel that is a spatial neighbor of a mapped pixel and (ii) the
mapped pixel of the spatial neighbor of a target pixel.
-
8/9/2019 Hybrid Shift Map for Video Retargeting
3/8
(a) (b)Figure 1. Illustration of Spatial-Temporal Naturality. (a) Spatial naturality within same frame; (b) temporal naturality between neighboring
frames.
2.2. Temporal Naturality
When considering the temporal information, naturality
requires that two temporal neighbors in the retargeted video
are similar to some temporal neighbors in the source video.
Figure 1 (b) is an illustration of this constraint on two tem-
porally adjacent pixels in the retargeted video. The pixel
black (+) in the (t 1)th frame of the retargeted videois mapped to black (+) in the source video. The temporal
neighbor of this pixel is the red (x) in the tth frame of the
source video. If this pixel is similar to black (x), which is
the mapping of the temporal neighbor of black (+) in the tth
frame of the retargeted video, then the black (+) and black
(x) in the retargeted video are considered temporally natu-
ral. Similar analysis can be applied on the pixel black (x) in
tth frame of the target. If the black (+) in the (t1)th frameof the source is similar to the red (+) in the same frame, two
temporal neighbors (marked as black (+) and (x)) in the re-
targeted video are temporally natural as in the source video.
We maximize the temporal naturality of the retargeted
video by minimizing(p,t)R
t{1,+1}
D(S(p+Mt(p), t +t), S(p +Mt+t(p), t +t)) (2)
where the definitions ofR, S, M and D are the same as in
equation (1). The distance function operates on the source
between (i) a pixel that is a temporal neighbor of a mapped
pixel and (ii) the mapped pixel of the temporal neighbor of
a target pixel. Any suitable distance measure can be used.
3. Hybrid Shift Map
The retargeted video that preserves the spatial-temporal
naturality of the source video is modeled as graph(s) where
the nodes represent the pixels of the retargeted video. Retar-
geting is achieved by finding the optimal mapping between
the source and the retargeted video. Specifically, we encode
the spatial-temporal naturality as well as other constraints
into the following form, which can be minimized by graph
cut algorithm, similar to [11]:
E(M) =
(p,t)R
Ed(Mt(p)) +
(pi,ti),(pj ,tj)N
Es(Mti(pi),Mtj (pj))
(3)
whereEd is the data term encoding the unary energy andEsis the smoothness term encoding pairwise energy. In this
section, we first develop a 3D shift map to retarget video
with spatial-temporal naturality. To improve the computa-
tional efficiency, an incremental 2D shift map is then in-
troduced to retarget video. Compared with 3D shift map,
this method can only achieve a local optimum. However,
its computational complexity is much lower while still pre-
serving spatial-temporal naturality. Finally, a novel solution
for video retargeting is provided by combining these two
methods in a multi-resolution hierarchy, which is called theHybrid Shift Map.
3.1. 3D Shift Map
We model the retargeted video as a 3D grid graph where
every node is connected to its 4 spatial and 2 temporalneighbors. There are two types of constraints related to
video retargeting: pixel preservation during resizing and
spatial-temporal naturality for artifact reduction. To find
the optimal 3D shift map using graph cut, we encode the
-
8/9/2019 Hybrid Shift Map for Video Retargeting
4/8
pixel preservation in the data term and the spatial-temporal
naturality in the smoothness term of equation (3).
Energy Function
In video retargeting, it is required that some pixels need to
be preserved. For example, in changing the width of the
video, the leftmost and rightmost columns of every frameshould be preserved in the target video. We use data term to
encode such pixel preservation by assigning
Ed(Mt(p)) =
if(x = 0) (tx = 0) if(x = WR) (tx = WS WR)0 otherwise
(4)
where WS and WR are the widths of the video frames in the
source and the retargeted videos, respectively.
The spatial and temporal naturalities described in sec-
tion 2 can be unified into a single pairwise measure. Con-
sider a pair of neighboring pixels (pi, ti) and (pj , tj), whichcan be either spatial neighbors (ti = tj) or temporal neigh-bors (pi = pj). For (pj , tj), the mapping of its neighbor(pi, ti) onto the source video is given by
Si = S(pj + pji + Mtj+tji (pj +pji), tj + tji).
where pji = (xi xj , yi yj) and tji = ti tj .The neighbor of the mapping of(pj, tj) that corresponds to
(pi, ti) is given by Si = S(pj +Mtj (pj)+pji , tj +tji).Similarly,
Sj = S(pi + pij + Mti+tij (pi +pij), ti + tij),
Sj = S(pi + Mti(pi) + pij, ti +tij),
where pij = (xj xi, yj yi) and tij = tj ti.Therefore, the smoothness term in equation (3) can measure
spatial-temporal naturality as
Es(M(pi, ti),M(pj , tj)) = min(D(Si, Si), D(Sj , Sj)).
If either D(Si, Si) = 0 or D(Sj, Sj) = 0, this pair ofneighboring pixels in the retargeted video is perfectly natu-
ral with respect to the source video.
If we wish to eliminate one row/column from the video,
the binary 3D graph cut guarantees a globally optimal so-
lution. However, the computational complexity of this
method is very high due to the large size of the 3D graph re-
sulting in high computational time and memory usage. This
precludes the application of this method on large video vol-
umes and hence, a more efficient solution is required.
3.2. Incremental 2D Shift Map
Instead of considering the video as a 3D volume, it can
be viewed as a collection of frames, each of which is rep-
resented as a 2D grid graph. However, if we simply apply
the 2D shift map [11] to retarget individual frames indepen-
dently, the retargeted video will be not temporally smooth
and will contain jitters. The temporal information in the
source video must be utilized in order to ensure temporal
consistency. We propose an incremental solution to retarget
video using temporal information so that the result is tempo-
rally consistent (smooth). In this scheme, the first frame ofthe sequence is retargeted using the 2D shift map [11], and
the tth frame is processed based on the retargeted (t 1)th
frame. The temporal consistency is improved by maximiz-
ing the temporal naturality between the current retargeted
frame and the retargeted result of previous frame.
Given the shift map of the (t 1)th frame, the tth frameis retargeted by finding a minimum cut in an augmented 2D
grid graph. Each node of this graph is not only associated
with the coordinate shift in the current frame, but also as-
sociated with the corresponding shift in the previous frame.
The shift map of previous frame is utilized to constrain the
retargeting of the current frame. Specifically, we extend theenergy function to consider the temporal naturality.
Energy Function
In the data term of the energy function, in addition to the
pixel preservation term of equation (4), we encode the
temporal naturality with respect to the shift map of previous
frame . The data term containing the measure of temporal
naturality is
Ed(Mt(p)) = min(Dt1(p), Dt(p)). (5)
where,Dt1(p) =
D(S(p + Mt(p), t 1), S(p + Mt1(p), t 1)),
Dt(p) = D(S(p + Mt(p), t), S(p + Mt1(p), t)). (6)
Minimization of the data terms that comprise equations (5)
and (4) results in the optimal shift map for the tth frame,
which is temporally smooth (natural) with respect to the re-
targeted (t 1)th frame.The spatial naturality between pixel p and its neighbors
in the tth frame is encoded as:
Es(Mt(p),Mt(p + ei))
= D(S(p+Mt(p) + ei, t), S(p+ ei +Mt(p+ei), t)) (7)
where ei represents the four unit vectors representing the
four spatial neighbors, similar to [11].
While the 3D shift map achieves a global optimal so-
lution, the 2D shift map can only obtain the local optimum
which is dependent on the initial guess. If the initial solution
in the first frame is far away from the global optimum, the
retargeting results of other frames will be also away from
the global optimum. However, the incremental 2D approach
is much more efficient in terms of computational time and
-
8/9/2019 Hybrid Shift Map for Video Retargeting
5/8
(a) (b)Figure 2. Comparison of the bands of hybrid shift-map and video
seam carving [12] on the same grid graph. The shaded area in (a)
is the band for hybrid shift-map and in (b) is the band for seam
carving.
memory usage, as shown in the experimental results. Since
the original 3D problem is simplified into a series of 2D
problems, the number of nodes in each graph is much less
than that in the 3D graph.
3.3. Hybrid Shift Map as a Hierarchical Solution
An efficient solution for video retargeting is provided
by combining 3D shift map and 2D shift map in a multi-
resolution framework. We build a Gaussian pyramid of the
source video. In the lowest resolution, an initial retarget-
ing result for every frame is estimated using 3D shift map.
The global optimum property of 3D shift map guarantees
a good initial solution, which is used to constrain the fi-
nal retargeting result in higher resolution not too far away
from this initial guess. In the higher resolutions, we itera-
tively apply incremental 2D shift map to obtain retargeted
video: the initial solution in lower resolution is first inter-polated to the higher resolution. Starting from first frame,
the shift map is optimized to refine the interpolated solution
on banded 2D graph incrementally, where the band covers
the initial solution as shown in Figure 2. For comparison
between hybrid shift map and video seam carving [12], we
use a simple multi-resolution banding method on standard
grid graph. Compared to more advanced multi-resolution
banding methods e.g. [10], the banded graph in Figure 2 is
still a grid graph although the band is not minimum.
This multi-resolution framework, denoted as Hybrid
Shift Map, improves the computational complexity for re-
targeting in two aspects. First, the 3D shift map is more
efficient than 3D seam carving because the proposed graph
is much simpler than seam graph with forward energy. Ev-
ery node in the graph used in 3D shift map has only 6 edgeswhile the node in the seam graph has 14 edges. Second,the 3D graph cut at higher resolution is divided into a se-
ries of 2D graph cuts. Its computational time increases only
linearly with the length of video sequence. Using the same
banding method, the incremental 2D shift map has a nar-
rower band in every frame since it bands the individual seam
in the frame. On the other hand, video seam carving needs
to band the whole 2D manifold interpolated from the lowest
resolution. When the manifold involves a large variance in
the temporal domain, the rectangular band in the 3D graph
is larger than that in the hybrid shift map. Figure 2 shows
this difference between hybrid shift map and video seam
carving using the same banding method. Note that more
advanced banded multi-resolution graph cut technique [10]can also be applied on hybrid shift map to further reduce
complexity.
4. Experimental Evaluation
In this section, we evaluate different properties of the
proposed method on real video sequences and compare
them with other retargeting methods. Some test video se-
quences are the same as used in [12] and some are web
videos downloaded from Youtube, which contain large cam-
era/object motion in a complex scene. For all experiments,
we use the following setting: 3-layer Gaussian pyramid isbuilt. 3D shift map is estimated in the lowest resolution
and individual 2D shift map is refined in the original res-
olution incrementally. The parameters are set as = 1and = 1. The distance function D((p1, t1), (p2, t2)) =|S(p1, t1) S(p2, t2)|, which is simply the grayscale dif-ference of two pixels.
4.1. Temporal Consistency
Compared with 2D shift map [11], our method retargets
video by considering temporal consistency between source
and target videos. Without motion analysis, our method can
remove the flickering/waving artifacts and generate tempo-rally smooth retargeted video. We compare our method with
naive solution of applying [11] independently on individual
frames. Figure 3 (a) shows 4 consecutive frames of a bas-
ketball sequence. The corresponding retargeted frames of
the proposed hybrid shift map and the naive solution are
shown in Figure 3 (b) and (c), respectively. The red curves
in every frame are the next optimal seams to be removed.
From the figure, we can see that the seams obtained by
hybrid shift map are temporally smooth while those ob-
tained by naive solution are not. Consequently, hybrid shift
map generates retargeted video which is temporally consis-
tent with the source video. Naive solution using [11] can-
not achieve this temporal consistency. Our method enforces
not temporal smoothness as in [12], but temporal natural-
ity, which is adaptive to the video content. When the con-
tent is homogeneous, even the non-smooth seams can pre-
serve temporal naturality and generate temporally consis-
tent video. Compared to other retargeting algorithms e.g.
[18] which can preserve temporal consistency, our method
does not rely on any motion analysis and is robust enough
to be applied on challenging videos where motion analysis
may be erroneous.
-
8/9/2019 Hybrid Shift Map for Video Retargeting
6/8
Figure 3. (a) Four consecutive frames of basketball sequence. Retargeted by (b) hybrid shift map and (c) the naive solution of applying 2D
shift map [11] on every frame independently.
(a)
(b)
(c)Figure 4. Comparison of hybrid shift map and incremental 2D shift
map. (a) two frames (1st and 98th frames) of a cycling sequence;
corresponding retargeted frames using (b) hybrid shift map and (c)
incremental 2D shift map.
4.2. Global vs Local vs Hybrid
Our retargeting framework combines two components:
3D shift map at the lowest resolution and incremental 2D
shift map at the original resolution. We analyze the retar-
geting results as well as the computational complexities of
our framework and individual components. Figure 4 shows
the retargeted frames obtained from hybrid shift map and in-
cremental 2D shift map for two frames of a video sequence.
We can see that the retargeted frames output from incremen-
tal 2D shift map (Figure 4 (c)) introduce large distortions
within the red box. This is because without the initializa-
tion of 3D shift map, the incremental method does not con-
sider all frames and can only achieve local optimum. The
Table 1. Computational time for reducing 1 pixel in width for a
320 240 video of110 frames.
3D shift map 2D shift map Hybrid shift mapTime 178s 20s (9+8)s
hybrid shift map instead uses 3D shift map in the lowest
resolution to constrain every 2D shift map to be close to the
global optimum. Hence, the retargeted frames from the hy-
brid shift map (shown in Figure 4 (b)) are more natural than
incremental 2D shift map alone.
In terms of computational complexity, we compare the
computational time on reducing 1 pixel in width for thewhole video using different components. For a 320 240
video sequence of 110 frames, the computational time ofdifferent components are summarized in Table 1. The 3D
shift map and incremental 2D shift map are applied on the
original resolution. For hybrid shift map, 3D shift map
is applied on the third layer of Gaussian pyramid and in-
cremental 2D shift map is applied on the original resolu-
tion. We can see the proposed hybrid shift map significantly
improves the efficiency while preserving the effectiveness.
The 3D shift map component (9s) improves since it is ap-
plied at the lowest resolution and the incremental 2D shift
map component (8s) in the original resolution improves due
to the smaller graph for every frame.
4.3. Hybrid Shift Map vs Video Seam Carving
In this section, we compare our proposed method with
[12], which improves seam carving for video retargeting.
Both methods model the video as graph(s) and use graph
cut techniques to iteratively retarget video content. When
reducing 1 pixel in width, each binary shift map actuallycorresponds to a removed seam. However, the proposed
method differs from [12] in several aspects. First, graph
constructions are different. [12] models the source video as
a graph while our method represents the retargeted video as
-
8/9/2019 Hybrid Shift Map for Video Retargeting
7/8
Figure 5. Example of disconnected seam (marked as red).
(a)
(b)
(c)Figure 6. Comparison of hybrid shift map and seam carving [12].
(a) 3 consecutive frames of a video; (b) retargeted frames of hybrid
shift map; (c) retargeted frames of seam carving. The sequences
can be seen in the supplementary material.
Table 2. Computational complexity of hybrid shift map and seam
carving.
Hybrid shift map Seam carving
Time (7+8)s (22+14+54)s
Memory 572Mb 3550Mb
a graph. By modeling the retargeted video, forward energy
in [12] can be easily derived in our method. Since only the
binary-labeling problem is guaranteed to achieve global op-
timum, we recursively reduce 1 pixel in width by solvinga binary-labeling problem. Second, the properties of the
seam which is removed from every frame are different. In
our method, the removed seam is only required to be mono-
tonic and need not be connected. A vertical/horizontal seam
is monotonic if there is only 1 pixel belonging to it in everyrow/column, respectively. As shown in Figure 5, the con-
nectivity of a seam is automatically controlled by the con-
tent: it can be disconnected in the homogeneous area and
will be connected in the boundary area. Hence, our method
is more flexible than [12]. Figure 6 compares hybrid shift
map and seam carving on a video sequence. We can see
hybrid shift map produces a retargeted video with less dis-
tortion than seam carving.
Table 2 summarizes both the time and memory usage
when removing 1 pixel in width from a 480 272 video of
Figure 7. Enlarging using hybrid shift map. The sequences can be
seen in the supplementary material.
86 frames. For the computational time of hybrid shift map,the two numbers are the time spent on initialization and re-
finement, respectively. For video seam carving, the three
numbers indicate the time spent in 3 different resolutionsof Gaussian pyramid. As analyzed in Section 3.3, hybrid
shift map reduces the time spent on both initialization and
refinement. We can also see that hybrid shift map signifi-cantly reduces the memory usage compared to seam carv-
ing method from Table 2. On the higher resolution, hybrid
shift map only needs to maintain a 2D graph corresponding
to a narrow band of single frame in the memory while seam
carving method has to maintain the whole 3D graph. Note
that the same multi-resolution banding method is applied on
both methods for fair comparison. The complexity of video
seam carving is higher than that reported in [12] because
of the banding method. When applying the more advanced
banded method [10], the hybrid shift map can also further
reduce the complexity and is still more efficient than seam
carving.
The hybrid shift map can also change the height of avideo in a similar way. As illustrated in Figure 7, it also can
be used for increasing the width of a video sequence. Figure
8 shows more retargeting results on 5 video sequences. Wecan see that the retargeted videos from the proposed hybrid
shift map are visually more natural than those from other
methods. Compared to simple scaling and warping based
method, seam carving method generates retargeted videos
with less distortion. The proposed hybrid shift map outputs
even better results. For example, the head and the leg of
the player in two basketball sequences have less distortion
than seam carving results. For the fourth sequence, the left
person is less distorted than that in other methods.
5. Conclusion
In this paper, we introduce a new method denoted as hy-
brid shift map for video retargeting. Without applying any
motion analysis, this method can retarget video by maxi-
mizing the spatial-temporal naturality between source and
target video. A novel multi-resolution framework is pro-
posed to break the computational bottleneck of video re-
targeting. Specifically, 3D shift map is designed to get the
-
8/9/2019 Hybrid Shift Map for Video Retargeting
8/8
Figure 8. More retargeting results on 5 video sequences. The first column shows a sample frame in the source video. The second column
to fifth column are the corresponding retargeted frames using proposed hybrid shift map, seam carving [12], a recent warping-based
retargeting method [8] and simple down-scaling, respectively. The sequences can be seen in the supplementary material.
initial solution in the lowest resolution and incremental 2D
shift map is designed to refine the initial solution in the orig-
inal resolution. Compared with related retargeting methods,
the proposed hybrid shift map significantly improves the
efficiency in terms of both computational time and mem-ory usage while still retargeting video with spatial-temporal
naturality.
Acknowledgement
This research was supported by the Media Development
Authority (MDA) under grant NRF2008IDM-IDM004-032.
References
[1] S. Avidan and A. Shamir. Seam Carving for Content-Aware Image
Resizing. SIGGRAPH, 26(3), July 2007.
[2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patch-
Match: A Randomized Correspondence Algorithm for Structural Im-age Editing. SIGGRAPH, 28(3), August 2009.
[3] B. Chen and P. Sen. Video Carving. In Eurographics, April 2008.
[4] T. Deselaers, P. Dreuw, and H. Ney. Pan, Zoom, Scan - Time-
coherent, Trained Automatic Video Cropping. In CVPR, June 2008.
[5] R. Gal, O. Sorkine, and D. Cohen-Or. Feature-aware Texturing. In
Eurographics Symposium on Rendering, June 2006.
[6] J.-W. Han, K.-S. Choi, T.-S. Wang, S.-H. Cheon, and S.-J. Ko.
Wavelet Based Seam Carving For Content-Aware Image Resizing.
In ICIP, November 2009.
[7] H. Huang, T. Fu, P. L. Rosin, and C. Qi. Real-Time Content-Aware
Image Resizing. Science in China Series F: Information Science,
52(2), February 2009.
[8] J.-H. K. Jun-Seong Kim and C.-S. Kim. Adaptive Image and Video
Retargeting Based on Fourier Analysis. In CVPR, June 2009.
[9] F. Liu and M. Gleicher. Video Retargeting: Automating Pan-and-
Scan. In ACM Multimedia, October 2006.
[10] H. Lombaert, Y. Sun, L. Grady, and C. Xu. A Multilevel Banded
Graph Cuts Method for Fast Image Segmentation. In ICCV, October2005.
[11] Y. Pritch, E. K. Venaki, and S. Peleg. Shift-Map Image Editing. In
ICCV, September 2009.
[12] M. Rubinstein, A. Shamir, and S. Avidan. Improved Seam Carving
for Video Retargeting. SIGGRAPH, 27(3), December 2008.
[13] M. Rubinstein, A. Shamir, and S. Avidan. Multi-operator Media Re-
targeting. SIGGRAPH, 28(3), August 2009.
[14] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, and M. Co-
hen. Gaze-based Interaction for Semi-automatic Photo Cropping.
In SIGCHI, April 2006.
[15] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing
Visual Data Using Bidirectional Similarity. In CVPR, June 2008.
[16] C. Tao, J. Jia, and H. Sun. Active Window Oriented Dynamic Video
Retargeting. In ICCV Workshop on Dynamic Vision, October 2007.
[17] S.-F. Wang and S.-H. Lai. Fast Structure-Preserving Image Retarget-
ing. In ICCASP, April 2009.
[18] Y.-S. Wang, H. Fu, O. Sorkine, T.-Y. Lee, and H.-P. Seide. Motion-
Aware Temporal Coherence for Video Resizing. SIGGRAPH ASIA,
28(5), December 2009.
[19] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee. Optimized Scale-
and-Stretch for Image Resizing. SIGGRAPH ASIA, 27(5), December
2008.
[20] L. Wolf, M. Guttmann, and D. Cohen-Or. Non-homogeneous
Content-driven Video-Retargeting. In ICCV, October 2007.
[21] X. Xie, H. Liu, W.-Y. Ma, and H.-J. Zhang. Browsing Large Pictures
under Limited Display Sizes. IEEE Transaction on Multimedia, 8(4),
August 2006.