desab-ginftlilertibevninoitomevitpaadmrofsnrat)tami(l ... · soal etartmnsode thesseenvicteefo...

35

Upload: nguyenkiet

Post on 10-Mar-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Lifting-based Invertible Motion Adaptive Transform (LIMAT)

Framework for Highly Scalable Video Compression

Andrew Secker and David Taubman∗

Abstract

We propose a new framework for highly scalable video compression, using a Lifting-based

Invertible Motion Adaptive Transform (LIMAT). We use motion-compensated lifting steps to

implement the temporal wavelet transform, which preserves invertibility, regardless of the motion

model. By contrast, the invertibility requirement has restricted previous approaches to either

block-based or global motion compensation. We show that the proposed framework effectively

applies the temporal wavelet transform along a set of motion trajectories. An implementation

demonstrates high coding gain from a finely embedded, scalable compressed bit-stream. Results

also demonstrate the effectiveness of temporal wavelet kernels other than the simple Haar, and

the benefits of complex motion modeling, using a deformable triangular mesh. These advances

are either incompatible or difficult to achieve with previously proposed strategies for scalable

video compression. Video sequences reconstructed at reduced frame-rates, from subsets of the

compressed bit-stream, demonstrate the visually pleasing properties expected from low-pass

filtering along the motion trajectories. The paper also describes a compact representation for

the motion parameters, having motion overhead comparable to that of motion-compensated

predictive coders.

Our experimental results compare favourably with others reported in the literature, however,

the principle objective of this paper is to motivate a new framework for highly scalable video

compression.

∗The authors are with the School Of Electrical Engineering & Telecommunications, The University Of New South

Wales, Sydney 2052, Australia. Contact: A. Secker*; email: [email protected], ph: +61 (2) 9385 4803, fax:

+61 (2) 9385 5993. D.Taubman; email: [email protected]. ph: +61 (2) 9385 5223. EDICS code: 1-VIDC

1

1 Introduction

The objective of highly scalable video coding is to produce a dense family of embedded bit-streams,

each an efficient compressed representation of the video, at successively higher bit-rates. Scalable

representations are important for efficient utilization of limited channel capacity, and have applica-

tions in many areas including simulcast, videoconferencing and remote video browsing. In addition

to bit-rate scalability, other important forms of scalability for video compression include spatial

resolution and temporal (frame-rate) scalability.

Highly scalable compression imposes an important restriction on the encoder. Specifically, it

must operate without prior knowledge of the rate at which the compressed video will be decoded.

For this reason, the predictive feedback paradigm inherent in traditional motion-compensated video

compression algorithms is fundamentally incompatible with highly scalable compression. Instead

the preferred method is that of feed-forward compression, in which a spatio-temporal transform

precedes embedded quantization and coding.

Karlsson and Vetterli first proposed the use of a separable three-dimensional (3D) discrete

wavelet transform (DWT) for video compression [1]. Separable 3D transforms are also employed

in [2], which extends the well-known SPIHT [3] coding algorithm into the temporal dimension.

However, without motion compensation, temporal filtering produces visually disturbing ghosting

artefacts in the low-pass temporal subband. This is clearly undesirable where temporal scalability

is of interest. The challenge therefore lies in finding a way to effectively exploit motion within the

spatio-temporal transform.

Taubman and Zakhor proposed an approach in which frames are invertibly warped so as to align

spatial features prior to the application of a separable 3D-DWT [4]. We refer to this and related

approaches as “frame-warping” schemes. While a variety of warping operators may be considered,

invertible warpings are unable to represent the localized expansion and contraction effects exhibited

within most video sequences. This is because any warping involving local expansion and contraction

2

essentially corresponds to a non-uniform resampling of the video frame, thus violating the Nyquist

sampling criterion in regions of expansion. In some proposed schemes, such as that of Tham et

al. [5], the invertibility requirement is deliberately violated so that high quality reconstruction is

impossible.

Another class of approaches can be described as block displacement methods, originally pro-

posed by Ohm [6]. In these approaches, video frames are divided into blocks, where each block

undergoes rigid motion, usually translation. The 3D-DWT is essentially applied in a separable

fashion along the displaced blocks, but the effects of expansion and contraction in the motion field

are observed by the appearance of “disconnected” pixels between the blocks. For the transform

to remain invertible, these disconnected pixels must be treated differently, which seriously affects

coding efficiency. In addition, perfect reconstruction is only possible with integer block displace-

ments, although extensions to half-pixel accuracy have been demonstrated [6], [7]. These methods

invariably involve block-based motion models, which cannot capture expansive or contractive mo-

tion. In [6] the concept of applying the separable DWT along displaced blocks is also generalized

to encompass arbitrary motion trajectories. However, this results in loss of perfect reconstruction,

unless the motion trajectories are confined to integer displacements, in which case the occurrence

of disconnected pixels increases significantly.

Temporal transforms based on either frame-warping or block displacement methods generally

employ only the Haar wavelet kernel. Extensions to longer temporal filters have been reported

[6],[4], with no significant improvement in performance.

In this paper, we propose a Lifting-based Invertible Motion Adaptive Transform (LIMAT),

which overcomes the limitations of the existing methods mentioned above. We elaborate on pre-

liminary work previously published in [8] and [9]. Our proposed framework employs a lifting

realization of the temporal DWT, in which each lifting step is compensated for the estimated scene

motion. The LIMAT framework enables the construction of motion adaptive temporal transforms

3

based on any wavelet kernel and motion model, while retaining the perfect reconstruction prop-

erty. Video coding based on 3D lifting structures has also been proposed in [10], as a means for

addressing some of the weaknesses of block displacement methods.

Section 2 describes the proposed framework, beginning with examples based on the Haar and

5/3 wavelet kernels. By contrast to previous methods, our results indicate that the 5/3 transform

offers superior coding efficiency compared to the Haar transform. In this section we also show that

if the video frames and motion mappings are both spatially continuous, the LIMAT framework

is equivalent to applying the DWT along the motion trajectories. In the discrete spatial domain,

this behaviour is retained for the most important spatial frequencies, and the remaining higher

frequencies are dealt with in a way that preserves the invertibility of the transform.

In practice, the performance of a motion adaptive transform is inevitably dependent on the

properties of the selected motion model. In Section 3 we provide the example of a deformable mesh

motion model, comparing its performance with that of a simple block motion model. Only in the

context of continuous motion fields, such as those yielded by the deformable mesh, can the proposed

transform be truly understood as filtering along the motion trajectories. Our experimental results

also indicate coding efficiency improvements when using the deformable mesh instead of block-based

motion.

In Section 4, we propose an efficient representation for the motion information associated with

the LIMAT framework. This representation allows us to reduce the number of explicitly coded

motion mappings to approximately one per original video frame. As a result, the motion overhead

is comparable to that of motion-compensated predictive coders.

Our experiments are described in detail in Section 5 where we provide compression results for

several standard test sequences. In the experiments, the proposed temporal transform is followed

by spatial wavelet decomposition and embedded block coding, using an implementation of the

JPEG2000 image compression standard. Our compression results compare favourably with others

4

reported in the literature.

2 Motion Adaptive Temporal DWT Based on Lifting

The key to efficient scalable video coding is to effectively exploit motion within the spatio-temporal

transform. Specifically, we identify three primary objectives for a motion adaptive temporal trans-

form, suitable for highly scalable video compression. Firstly, the low-pass temporal subband frames

should represent a high quality reduced frame-rate video. In particular, the transform should not

introduce ghosting artefacts into the low-pass frames, so that the visual quality is comparable to

that obtained by temporally subsampling the original video sequence prior to compression. In fact,

if the low-pass temporal subband frames are obtained by filtering along the true scene motion tra-

jectories, the reduced frame-rate video sequence obtained by discarding high temporal frequency

subbands from the compressed representation, may have an even higher quality than that obtained

by subsampling the original video sequence. This is because low-pass filtering along the motion

trajectories tends to reduce the effects of camera noise, spatial aliasing and scene illuminant varia-

tions.

Secondly, the transform should exhibit high coding gain. This means the high-pass temporal

subband frames should contain as little energy as possible, after adjusting for the energy gains

associated with the subband synthesis system. It is also important to minimize the introduction of

spurious spatial details in both the high-pass and low-pass frames, since these reduce the effective-

ness of subsequent spatial transformation and coding techniques. If the subbands are obtained by

applying suitable wavelet filters along the true scene motion trajectories, there should be minimal

introduction of spurious spatial details, and the high-pass subbands should have particularly low

energy. Moreover, so long as the quality of the low-pass temporal subband frames is preserved,

iterative application of the temporal decomposition along the low-pass channel should yield a multi-

resolution hierarchy with similar properties. For the experimental results reported in this paper,

5

we invariably consider three “stages” of such iterative decomposition.

As suggested, the above objectives are both closely related to the use of temporal filtering

and subsampling along the motion trajectories associated with a realistic motion model. However,

realistic motion models inevitably accommodate expansion and contraction, which makes it difficult

to achieve our third objective, that the transform should be invertible. Of course, lossy compression

generally prevents the original video sequence from being recovered exactly, but lack of invertibility

in the transform limits the range of compressed bit-rates over which the complete compression

system can be used efficiently.

In the LIMAT framework, the key to accomplishing invertibility is the use of lifting [11]. Any

two-channel FIR subband transform can be described as a finite sequence of lifting steps [12, 13].

In the proposed scheme, the original lifting steps are modified to exploit motion, which in no way

compromises the invertibility of the transform. In fact, by introducing integer rounding into our

modified lifting steps, in the manner suggested by Calderbank et al. [14], it is possible to achieve

efficient lossless compression of the original video sequence. Most significantly, however, the lifting-

based transform may be understood as applying the temporal wavelet transform along the motion

trajectories established by the selected motion model.

2.1 Example with the Haar Transform

It is instructive to begin with an example based upon the Haar wavelet transform. Up to a scale

factor, this transform may be realized in the temporal domain, through a sequence of two lifting

steps, as

hk [n] = x2k+1[n]− x2k[n]

lk[n] = x2k[n] +1

2hk[n]

where xk[n] ≡ xk [n1, n2] denotes the samples of frame k from the original video sequence and

hk [n] ≡ hk [n1, n2] and lk [n] ≡ lk [n1, n2] denote the high-pass and low-pass subband frames. This

6

decomposition of the Haar transform into two steps is also known as the S-transform [15].

The reader can verify that lk [n] and hk [n] correspond to the scaled sum and the difference of

each original pair of frames. An example is shown in Figure 1. Since motion is ignored, ghosting

artefacts are clearly visible in the low-pass temporal subband, and the high-pass subband frame

has substantial energy.

Now letWk1→k2denote a motion-compensated mapping of frame k1 onto the coordinate system

of frame k2, so that Wk1→k2(xk1)[n] ≈ xk2

[n], for all n. No particular motion model is assumed

here. The lifting steps are modified as follows.

hk[n] = x2k+1[n]−W2k→2k+1(x2k)[n]

lk[n] = x2k[n] +1

2W2k+1→2k(hk)[n]

Observe that W2k→2k+1 and W2k+1→2k represent forward and backward motion mappings, respec-

tively. The high-pass subband frames correspond to motion-compensated residuals. These will be

close to zero in regions where the motion is accurately modelled. As we shall see in Section 2.3,

so long as the motion is well modelled by W2k→2k+1 and W2k+1→2k, the low-pass frames lk [n], are

effectively the result of applying a low-pass temporal filter along the motion trajectories. For the

Haar wavelet, this low-pass analysis filter has transfer function

H0 (z) =1

2(1 + z)

The visual effects of motion compensation can be seen by comparison of Figures 1 and 2. In the

example of Figure 2, we assume that the motion is captured perfectly. As a result, the high-pass

frame has no energy, and the low-pass frame is an excellent representation of frame 0, free from the

ghosting artefacts observed in Figure 1. If the signal amplitude on the surface of the moving object

fluctuates over time, possibly due to noise, the low-pass frame represents a temporal average of the

object’s surface intensity, while the high-pass frame represents the temporal difference.

The modified Haar lifting steps evidently achieve the first two objectives identified above, in

7

1 Frame

0Lowpass 0Highpass

1−

5.0

0Frame 1 Frame

0Lowpass 0Highpass

1−1−

5.0 5.0

0Frame

Figure 1: Lifting representation for the Haar temporal transform.

5.0

1−

0Frame 1Frame

0Lowpass 0Highpass

10→W

01→W5.0

1−

0Frame 1Frame

0Lowpass 0Highpass

10→W 10→W

01→W 01→W

Figure 2: Same as Figure 1, but with motion-compensated lifting steps. This avoids ghosting in

the low-pass frames and reduces the energy in the high-pass frames.

8

spatial regions where the motion is well modelled. Moreover, invertibility is an inherent property

of the lifting structure, regardless of how we choose to compensate for motion. To invert the

transform, one must simply apply the same lifting steps in reverse order, reversing the sign of the

updates. The reverse transform in this example is given by

x2k[n] = lk[n]−1

2W2k+1→2k(hk)[n] (1)

x2k+1[n] = hk[n] +W2k→2k+1(x2k)[n])

2.2 A More Interesting Wavelet Transform

The framework described above is readily extended to any two-channel FIR subband transform, by

motion-compensating the relevant lifting steps. We demonstrate this in the important case of the

biorthogonal 5/3 wavelet transform [16], whose lifting incarnation plays an important role in the

highly scalable JPEG2000 image compression standard [13]. As before, x2k[n] and x2k+1[n] denote

the even and odd indexed frames from the original sequence. Without motion, the 5/3 transform

may be implemented by alternately updating each of these two frame sub-sequences, based on

filtered versions of the other sub-sequence. The lifting steps are

hk[n] = x2k+1[n]−1

2(x2k [n] + x2k+2[n])

lk[n] = x2k [n] +1

4(hk−1[n] + hk [n])

As before, we introduce arbitrary motion warping operators within each lifting step, which yields

hk[n] = x2k+1[n]−1

2(W2k→2k+1(x2k)[n] +W2k+2→2k+1(x2k+2)[n]) (2)

lk[n] = x2k[n] +1

4(W2k−1→2k(hk−1)[n] +W2k+1→2k(hk)[n])

In Figure 3 we see the effect of these modified lifting steps. The high-pass frames are now

essentially the residual from a bidirectional motion-compensated prediction of the odd-indexed

original frames. When the motion is adequately captured, these high-pass frames have little energy

9

and the low-pass frames correspond to low-pass filtering of the original video sequence along its

motion trajectories. In this example, the surface intensity of the moving object varies randomly

over time. These variations are effectively low-pass filtered, improving the visual quality of the

low-pass temporal subband. The low-pass analysis filter in this case has transfer function

H0 (z) = −1

8z2 +

1

4z +

3

4+

1

4z−1 −

1

8z−2

whose frequency response is much closer to that of an ideal low-pass filter than was the Haar filter.

As before, failure to capture the motion reduces the coding gain, and introduces multiple ghosting

artefacts into the low-pass subband frames, as suggested by the figure.

In our experiments, the 5/3 wavelet consistently outperforms the Haar, which contrasts with

observations reported in the context of the frame-warping and block displacement methods [17, 4].

Table 1 presents indicative compression results, comparing the behaviour of the Haar and 5/3

wavelet kernels. These results are obtained using block-based motion warping operators, over three

stages of temporal transform. The reconstruction bit-rate is 1 Mbps, but similar results are obtained

at other bit-rates. A full description of the experimental conditions associated with these and other

results presented in this paper may be found in Section 5.

According to Table 1, the 5/3 transform provides an improvement in PSNR of between 1.25

dB and 2.35 dB, relative to the Haar transform. This can be attributed to the use of both forward

and backward motion compensation, which reduces the energy of the high-pass frames, as well

as improved low-pass filtering along motion trajectories. Note that these results exclude the cost

of coding motion information. The amount of information and preferred methods for coding the

motion are different in each case. For the moment, however, we defer consideration of these effects

so as to focus on the merits of the transforms themselves. We later show that the observations

presented here apply even when the cost of motion is taken into account, subject to the use of

appropriate coding techniques.

10

Table 1: Reconstructed PSNR using 5/3 and Haar wavelets at 1 MbpsSequence Haar 5/3 GainMobile 26.55 27.80 +1.25Flower 26.84 28.83 +1.99Table 31.40 33.75 +2.35

Football 26.05 28.26 +2.21

4 Frame

5.0−

0 Frame 1 Frame 2 Frame 3 Frame

12→W

23→W25.0

34→W

21→W 25.0

32→W10→W 5.0− 5.0− 5.0−

1 Lowpass 1 Highpass0 Highpass

oncompensatimotion Ideal

oncompensatimotion No

4 Frame

5.0−

0 Frame 1 Frame 2 Frame 3 Frame

12→W

23→W25.0

34→W

21→W 25.0

32→W10→W 5.0− 5.0− 5.0−

1 Lowpass 1 Highpass0 Highpass

oncompensatimotion Ideal

oncompensatimotion No

Figure 3: Motion adaptive 5/3 temporal transform. The low-pass frame is the result of low-pass

filtering along the motion trajectories.

11

2.3 Generalization and Interpretation of the Lifting Transform

We have already mentioned that motion-compensating the lifting steps of a temporal subband

transform effectively results in the relevant subband filters being applied along the motion trajecto-

ries described by the motion model. In this section, we provide justification for this statement. To

do so, we first consider the application of the motion-compensated lifting transform to a sequence

of spatially continuous video frames, denoted xk (s) ≡ xk (s1, s2), where s ∈ R2 represents the

continuous spatial location and k ∈ Z is the frame index.

It is known that any two-channel FIR subband transform may be factored into a finite sequence

of Λ lifting steps [12, 13]. Each successive lifting step converts its input sequence, denoted x(λ−1)k

,

into an output sequence, x(λ)k

, where λ = 1, 2, . . . ,Λ, and x(0)k� xk is the input sequence supplied

to the subband transform. For odd λ, the odd indexed sub-sequence x(λ−1)2k+1 is updated using a

filtered version of the even indexed sub-sequence x(λ−1)2k , according to

x(λ)2k+1 = x

(λ−1)2k+1 +

i

u(λ)i

· x(λ−1)2(k−i)

; x(λ)2k = x

(λ−1)2k

Here u(λ)i

denotes the impulse response of the λth lifting step filter. For even λ, the even sub-

sequence is updated using a filtered version of the odd sub-sequence, according to

x(λ)2k = x

(λ−1)2k +

i

u(λ)i

· x(λ−1)2(k−i)+1

, x(λ)2k+1 = x

(λ−1)2k+1

The even and odd sub-sequences output from the final lifting step are the low-pass and high-pass

subband sequences, respectively. That is

lk = x(Λ)2k and hk = x

(Λ)2k+1

The succession of lifting steps is illustrated in Figure 4 and its inverse is illustrated in Figure 5.

Using this notation, our motion-compensated temporal transform may be expressed through

12

2u� Lu�

)1(2kx)0(

22 kk xx =

1−Lu�� 1u

)2(2kx )2(

2−L

kx )1(2

−Lkx k

Lk lx =)(

2

)1(12 +kx)0(

1212 ++ = kk xx )2(12 +kx )2(

12−+

Lkx )1(

12−+

Lkx k

Lk hx =+

)(12

subbands out

Figure 4: Network of L lifting steps, used to realize a two-channel subband transform, with subband

sequences lk and hk.

2u� Lu�1−Lu�� 1u

)1(2kx)0(

22 kk xx = )2(2kx )2(

2−L

kx )1(2

−Lkx k

Lk lx =)(

2

)1(12 +kx)0(

1212 ++ = kk xx )2(12 +kx )2(

12−+

Lkx )1(

12−+

Lkx k

Lk hx =+

)(12

subbands in

Figure 5: Synthesis network corresponding to the analysis system in Figure 4.

13

the lifting steps

x(λ)2k+1 (s) = x

(λ−1)2k+1 (s) +

i

u(λ)i

·W2(k−i)→2k+1

(x(λ−1)2(k−i)

)(s) , λ odd

x(λ)2k (s) = x

(λ−1)2k (s) +

i

u(λ)i

·W2(k−i)+1→2k

(x(λ−1)2(k−i)+1

)(s) , λ even

Suppose now that our motion model is invertible, meaning that there is a one to one correspondence

between locations s in frame 0, and locations sk = Vk (s) in frame k. Equivalently, we are assuming

that our motion model assigns unique trajectories, represented by the sequence, {sk}, to each

location s in frame 0, such that the trajectories do not intersect. In this assumption, we are clearly

ignoring the finite spatial support of the frames, as well as the possibility of occlusion.

Since the motion model is invertible, we must have

Wk1→k2(xk1) (sk2) = xk1

(Vk1

(V−1

k2(sk2)

))

That is, to find the location sk1in frame k1 which corresponds to a location sk2

in frame k2, we can

map sk2back to the origin of its motion trajectory through V

−1

k2, and then map it forward along

the same trajectory into frame k1, using Vk1 . We can now rewrite the motion-compensated lifting

steps in terms of the sequence of locations sk = Vk (s), corresponding to a single motion trajectory

anchored at location s in frame 0. We get

x(λ)2k+1 (s2k+1) = x

(λ−1)2k+1 (s2k+1) +

i

u(λ)i

·W2(k−i)→2k+1

(x(λ−1)2(k−i)

)(s2k+1)

= x(λ−1)2k+1 (s2k+1) +

i

u(λ)i

· x(λ−1)2(k−i)

(s2(k−i)

), λ odd

x(λ)2k (s2k) = x

(λ−1)2k (s2k) +

i

u(λ)i

·W2(k−i)+1→2k

(x(λ−1)2(k−i)+1

)(s2k)

= x(λ−1)2k (s2k) +

i

u(λ)i

· x(λ−1)2(k−i)+1

(s2(k−i)+1

), λ even

Finally, let x̃k (s) denote the warped frame obtained by mapping xk (s) from the coordinate

system associated with frame k onto the coordinate system associated with frame 0. That is,

x̃k (s) =Wk→0 (xk) (s) = xk (sk)

14

The lifting steps may be expressed in terms of the sequence of warped frames as

x̃(λ)2k+1 (s) = x̃

(λ−1)2k+1 (s) +

i

u(λ)i

· x̃2(k−i) (s) , λ odd

x̃(λ)2k (s) = x̃

(λ−1)2k (s) +

i

u(λ)i

· x̃2(k−i)+1 (s) , λ even

This means that we are effectively applying the original temporal subband transform directly to the

sequence of warped frames, x̃k (s). Equivalently, the original subband transform is being applied

along the motion trajectories, sk = Vk (s). The low-pass temporal subband sequence, lk (s) =

x(Λ)2k (s) is equivalent to the low-pass subband, x̃

(Λ)2k , of the warped sequence, warped back onto

the coordinate system of frame 2k. Similarly, the high-pass temporal subband sequence, hk (s) =

x(Λ)2k+1 (s) is obtained by warping x̃

(Λ)2k+1 back onto the coordinate system of frame 2k + 1.

We turn our attention now to the case of discrete frames, xk [n], each of which we model

as a unit sampled version, xk [n] = xk (s)|s=n, of a continuous frame, xk (s), obtained through

Nyquist bandlimited (sinc) interpolation of xk [n]. In the special case of purely translational motion,

application of the warping operator Wk1→k2to xk (s) yields another Nyquist bandlimited frame,

whose unit sample sequence may be obtained directly from xk [n] by ideal sinc-interpolative filtering.

Under these idealized conditions, the discrete and continuous realizations of our proposed lifting

framework are truly equivalent and the discrete subband frames are warped versions of those which

would be obtained if we had directly applied the temporal subband transform, after first warping

the original video frames onto a consistent reference coordinate system. This suggests a strong

connection between the LIMAT framework and the frame-warping methods proposed in [4] and [5],

in which the original frames are explicitly warped prior to the application of the temporal subband

transform.

In the above, we considered only translational motion and ideal sinc interpolation for the mo-

tion warping operators. The key innovation of the proposed LIMAT framework lies in its ability

to handle non-translational motion and non-ideal interpolation. In the LIMAT framework, the

temporal subband transform is not applied directly to the warped frames, x̃k [n]. Instead, the

15

terms in each lifting step are compensated for motion. This ensures that the transform remains

invertible, even if the individual frame warping operations are not invertible. Suppose the motion

field describing the relationship between frames k1 and k2 is contractive, meaning that features

are smaller and closer together in xk2 (s) than they are in xk1(s). Then warping xk1

(s) onto the

coordinate system of frame k2 will not generally leave a Nyquist bandlimited image, even if xk1 (s)

was Nyquist bandlimited. As a result, high spatial frequencies must generally be lost whenWk1→k2

is applied in the discrete image domain.

Since motion fields typically contain both expansive and contractive regions, the relationship

between continuous and discrete warped frames cannot generally be preserved. Nevertheless, with

carefully designed motion compensating interpolation filters, it is possible to accurately preserve

the discrete/continuous relationship at lower spatial frequencies. The range of spatial frequencies

over which the relationship can be considered preserved, depends on the nature of the motion

field, as well as the complexity of the interpolation filters. In the neighbourhood of any subband

frame of interest, we may model the video sequence as the sum of two component sequences: a

“base” sequence, containing the lower spatial frequencies which are correctly preserved by the

motion warping operators; and a “residual” sequence, containing the higher spatial frequencies.

Since the transform and warping operators are all linear, the subband frame in question may also

be regarded as the sum of two components, one of which is effectively obtained by applying the

temporal subband filters along the motion trajectories of the “base” video sequence. While we

cannot provide a similar interpretation for the second component, produced by processing the

“residual” sequence, this component is usually much less significant than the base component,

especially in regions where the motion field is nearly translational or the video frames are relatively

smooth.

16

3 Motion Modelling and Preliminary Observations

To realize a subband decomposition along the underlying motion trajectories, the motion in the

sequence must be accurately modelled. The LIMAT framework provides a flexible solution to this

challenging problem because advanced motion models may be employed without sacrificing the

invertibility of the transform. In this section we demonstrate this by comparing a deformable mesh

motion model with a typical block-based motion model. Our main objective here is to observe the

effect of the motion model on the energy compaction properties of the temporal transform. For this

reason we ignore the cost of coding the motion until Section 5, where it is shown to have relatively

little effect on the experimental results.

3.1 Block Motion Model

Block-based motion models are predominant in traditional motion-compensated video coders. A

typical block motion model involves partitioning the current frame into a regular grid of blocks,

where each block undergoes translation to a new location in the reference frame. The block motion

mapping is represented by the field of block displacement vectors.

The block motion field corresponds to a piecewise constant approximation to the underlying

motion field. In general, this is only an accurate representation of very smooth motion fields, or

those consisting of only simple translational motion. In particular, block motion models poorly

represent expansions and contractions in the true motion field, which commonly arise due to de-

formations of scene objects, camera translation and zooming. Furthermore, the introduction of

artificial discontinuities into the motion field can produce disturbing visual artefacts.

Our experiments use a hierarchical block-matching algorithm. In comparison to full-search

approaches, hierarchical motion estimation is more robust to image noise and illumination fluctu-

ations. Hierarchical motion estimation also tends to produce more uniform motion fields, which

facilitate efficient coding of the motion overhead.

17

Table 2 shows reconstructed video PSNR at a bit-rate of 1 Mbps. Evidently, incorporating

motion compensation into the transform can significantly improve the PSNR. Consider, for example,

the Mobile and Calender sequence, in which increases of 4.63 dB and 5.43 dB are observed for the

Haar and 5/3 wavelets, respectively. Similar gains of 4.84 dB and 6.65 dB are observed with

the Flower Garden sequence, but there is little improvement for the Table Tennis and Football

sequences. In these sequences, motion-compensating the Haar lifting steps actually reduces the

PSNR.

Although motion compensation generally increases energy compaction, it can also expand the

quantization error energy during synthesis. Most video sequences contain local expansions and

contractions, and therefore do not exhibit a one-to-one correspondence between pixel locations in

consecutive frames. This is essentially the source of the “disconnected” pixel problem observed in

block displacement approaches. It manifests itself in this scheme when displaced blocks overlap in

the reference frame, causing some pixels to be mapped to multiple locations in the current frame.

During temporal synthesis, the quantization error in these pixels will also be mapped to multiple

locations in the reconstructed frames, causing an overall increase in frame distortion.

Our current quantization and coding strategies treat the 3D transform as though it were fully

separable. As a result, PSNR performance is only improved if the increase in energy compaction

outweighs quantization error energy expansion during synthesis. The motion adaptive Haar trans-

form does not achieve this with the Football and Table Tennis sequences, because the motion is

difficult to exploit. In particular, the Football sequence contains rapid translations, deformations

and complex camera motion. The Table Tennis sequence consists of segments in which the camera

is either still or zooming, neither of which benefit substantially from the use of a block motion

model.

In the case of the 5/3 transform, inadequacies in the motion model are partially alleviated by

the bidirectional motion compensation associated with the first lifting step. The corresponding

18

Table 2: Reconstructed PSNR using block motion model, at 1 MbpsNo DWT 3-stage Haar DWT 3-stage 5/3 DWT

Sequence None Block Gain None Block GainMobile 19.55 21.92 26.55 +4.63 22.37 27.80 +5.43Flower 21.10 22.00 26.84 +4.84 22.18 28.83 +6.65Table 28.50 31.77 31.40 −0.37 32.00 33.75 +1.75

Football 26.25 27.03 26.05 −0.98 27.00 28.26 +1.26

(a) (b)(a) (b)

Figure 6: Demonstrates the visual effect of block-based motion compensation on the low-pass

frames, using 3 stages of 5/3 temporal DWT. The ghosting artefacts observed in (a) are avoided in

(b) by motion compensation.

increase in energy compaction leads to improved compression performance for all test sequences.

Nevertheless, further improvements should be possible if the expansion in quantization error energy

were correctly compensated during quantization of the spatial subband samples.

In all cases, the use of motion-compensated lifting steps significantly improves the visual quality

of the low-pass temporal subband. As an example, we compare the visual effect of motion compen-

sation on the low-pass frames from the Flower Garden sequence. Figure 6a shows a low-pass frame

produced using three stages of the original 5/3 temporal transform. In Figure 6b, the substantial

ghosting artefacts are avoided through the use of motion compensation.

19

3.2 Deformable Mesh Motion Model

Unlike block-based models, deformable meshes can track complex motion, including local expansion

and contraction, while maintaining a continuous motion field. A regular deformable mesh is created

by partitioning the current frame into a regular grid of patches, usually either triangles [18] or

quadrilaterals [19],[20]. The mesh node-points move to form a warped mesh on the reference frame,

and the mapping is represented by the set of node displacement vectors. The motion vector at any

given location within a patch is approximated by linearly interpolating the motion vectors at the

patch vertices. This corresponds to an affine transformation for triangular meshes, and a bilinear

transformation for quadrilateral meshes

Deformable meshes yield motion fields that are piecewise smooth and continuous at patch

boundaries. These motion fields can provide a much better representation of the underlying mo-

tion field than discontinuous block motion fields. Determination of a globally optimum set of

node vectors is not usually possible within reasonable computational constraints. Local searches,

or gradient-based approximations are commonly used instead, which typically find only a local

minimum of an appropriate objective function, such as the energy of the displaced frame differ-

ence. Nevertheless, deformable meshes have been found to offer superior motion compensation in

comparison to block-based models.

We incorporate a triangular deformable mesh model into the LIMAT framework using a hexago-

nal refinement estimation algorithm similar to that proposed in [21]. As mentioned, local expansions

and contractions in the motion field can cause an increase in reconstruction error energy during

synthesis. With a deformable mesh, the expansion in quantization error energy is directly related

to expansion in the mesh itself. The expansion of a patch in a triangular mesh is given by the

determinant of the corresponding affine transform. To discourage excessive error expansion, we

weight the estimated distortion within each patch by the determinant of its affine transform or

the reciprocal of the determinant, whichever is larger. This provides a significant increase in com-

20

Table 3: Reconstructed PSNR using deformable mesh motion model, at 1 Mbps3-stage Haar DWT 3-stage 5/3 DWT

Sequence None Mesh Gain cf. Block None Mesh Gain cf. BlockMobile 21.92 27.07 +5.15 +0.52 22.37 28.09 +5.72 +0.29Flower 22.00 27.90 +5.90 +1.06 22.18 29.40 +7.22 +0.57Table 31.77 32.79 +1.02 +1.39 32.00 34.07 +2.07 +0.32

Football 27.03 28.24 +1.21 +2.19 27.00 29.03 +2.03 +0.77

pression performance. Further improvement could be obtained by directly adapting the spatial

quantization and coding of the temporal subbands, according to the affine determinants.

Table 3 shows the reconstructed video PSNR using the mesh motion model, with 3 stages

of temporal transform, and a bit-rate of 1 Mbps. The mesh model consistently outperforms the

block motion model, so that motion compensation of the lifting steps is now seen to improve the

compression performance for every sequence.

The deformable mesh serves as a good example of how superior motion modelling can lead

to improved video compression within the LIMAT framework. Of course, a variety of more so-

phisticated motion modelling techniques exist in the literature. For example, hierarchical meshes

can achieve superior motion compensation by allocating a denser distribution of motion vectors to

regions with more complex motion [22].

3.3 The Importance Of Motion Trajectories

In Section 2.3, we showed that motion-compensating the lifting steps is equivalent to applying

the temporal DWT along an underlying set of motion trajectories, so long as the motion model is

invertible. Referring to equations (1) and (2), we see that for each forward motion warping operator,

Wk2→k1, the reverse warping operator,Wk1→k2

, is also required. In the results presented so far, only

one set of motion parameters is explicitly estimated for each such pair of warping operators. Where

the forward motion parameters are estimated, a set of reverse motion parameters is determined

using an inversion procedure, and vice-versa. Details of the inversion procedure are discussed in

Section 4.2. However, it is worth noting here that discontinuous motion models such as block-

21

based models cannot be strictly inverted, so approximations must be employed. No significant

approximations are required to invert the mesh motion models.

It is tempting instead to independently optimize the parameters of each individual motion

mapping, with respect to a displaced frame difference measure. In this case, Wk1→k2and Wk2→k1

are obtained through independent forward and backward motion estimation. The resulting motion

mappings will not generally be inverses of one another. Even in the absence of modelling or

estimation errors, discrepancies between Wk1→k2and W−1

k2→k1(if it exists) can be expected in

regions of occlusion and uncovered background. In such regions the relationship between successive

frames cannot truly be described in terms of a set of motion trajectories. Nevertheless, if we

choose to use independently optimized motion mappings, abandoning motion trajectories, we find

in practice that compression performance suffers significantly.

To quantify this effect, we compare the reconstructed PSNR obtained using directly inverted

motion mappings, with that obtained by estimating every motion mapping independently. The

results are taken using 3 stages of temporal transform, at a bit-rate of 1 Mbps. According to

Table 4, the reconstructed PSNR is uniformly higher with inverted motion fields, by as much as

2.51 dB in one case. The presented results do not include the cost of motion information1. Note

that the improvement is also generally larger for the deformable mesh model, as compared to

the block model, which lacks a true inverse. These results reinforce our earlier analysis, which

provides a meaningful interpretation to the LIMAT framework only in the context of an invertible

motion model. Even if the motion trajectories assigned by the model do not perfectly describe the

underlying scene, at least we know that the wavelet filters are being applied along those trajectories.

WhenWk1→k2andWk2→k1

are not inverses of one another, there is no clear way to understand the

behaviour of the transform, but our experimental observations suggest that enforcing the existence

1Including this cost would further penalize the performance achieved with independently estimated motion fields.

This is because there is no need to transmit motion parameters which are determined by inversion of another motion

field, but independently estimated motion fields must each be separately encoded and transmitted.

22

Table 4: Gain in reconstruction PSNR from using inverted motion fields instead of estimated motion

fields, at 1 Mbps

Sequence Haar, Block Haar, Mesh 5/3, Block 5/3, Mesh

Mobile +0.29 +0.21 +0.13 +0.09Flower +0.21 +0.46 +0.07 +0.22Table +0.51 +1.75 +0.15 +0.54

Football +0.74 +2.51 +0.31 +0.96

of motion trajectories is important for compression performance.

4 Motion Representation

In this section, we propose a representation for the motion information that allows us to avoid

coding one motion field for every motion warping operation. Instead, both encoder and decoder

derive all of the necessary motion fields by applying appropriate transformations to a much smaller

set of distinct motion mappings.

Recall that our complete temporal transform is constructed by iterative application of a single

stage. Each stage produces a low-pass and a high-pass temporal subband, using motion compen-

sated lifting steps, as described previously. Each filter tap of each lifting step in each stage involves

its own motion warping operator. The Haar transform involves two warping operators per pair of

input frames, at each stage. With multiple stages this total approaches two warping operators per

original video frame. This is doubled in the case of the 5/3 wavelet kernel. If we were to explicitly

code a motion field for every warping operator, the resulting motion overhead would reduce the

performance of the coder at low bit-rates. Instead, the motion representation proposed here allows

us to code only one motion mapping per original frame, plus occasional refinement information,

regardless of the selected wavelet kernel. The resulting motion overhead is comparable to that of

motion-compensated predictive coders.

We reduce the number of distinct motion mappings in two ways. The first arises directly from

observations in the previous section. Specifically, if we find each forward-backward pair of motion

23

mappings by estimating one and applying an inversion procedure to get the other, we actually

obtain higher coding efficiency than if each were estimated independently. Clearly then, one set of

motion parameters can represent both motion mappings.

Secondly, we exploit the relationship between the motion mappings at each decomposition stage.

Note that mappings at each stage of the transform essentially represent the same scene motion

trajectories, at different temporal resolutions. This allows us to represent a motion mapping at

any particular stage as being a composition of mappings from other stages. Section 4.2 discusses

inversion and composition of motion mappings for the block-based and deformable mesh motion

models.

4.1 Examples with the Haar and 5/3 Transforms

Figure 7 shows the motion representation associated with two stages of the proposed temporal

transform, using the 5/3 wavelet kernel. The mappings required to perform the lifting steps are

shown as arrows. At the jth transform stage, the ith forward and backward mappings are denoted

Fji and B

ji , respectively. In this paper we define a forward mapping as one which serves to warp

an earlier frame onto a later frame.

In the proposed scheme, motion mappings for each stage of the decomposition are estimated

using the corresponding original video frames, rather than the frames which are actually input

to that stage. This is more conducive to accurate estimation since the input frames may contain

ghosting artefacts caused by inadequate motion modelling in previous stages. This strategy also

allows us to determine motion mappings for any stage of the transform before knowing the input

frames to that stage, which turns out to be crucial to the representation developed here.

The entire set of motion fields in Figure 7 is represented by only F2

1and B1

2, which are shown in

black. Only these mappings need to be coded and delivered to the decoder. Inverting F2

1produces

the backward mapping B2

1. The forward mapping F1

1is recovered by compositing the second-stage

24

0 Frame 1 Frame 2 Frame

1 Lowpass0 Lowpass

11

B

12

B

21

B

11

F

12

F

21

F

0 Frame 1 Frame 2 Frame

1 Lowpass0 Lowpass

11

B

12

B

21

B

11

F

12

F

21

F

Figure 7: Motion representation for two stages of 5/3 temporal transform. Grey mappings are

inferred from the coded mappings, shown here in black.

forward mapping F2

1with the first-stage backward mapping B1

2. The remaining mappings B1

1and

F1

2are recovered by inverting F1

1and B1

2, respectively.

The representation for the Haar wavelet is a simplification of the above. In this case, mappings

F1

2and B1

2are not required, so we simply code F1

1and F2

1, and recover the corresponding backward

motion fields by inversion.

This motion representation can be iterated to any number of transform stages, and the total

number of coded mappings is never more than one per original frame. Furthermore, because the

mapping between any two frames may be recovered through a series of compositions and inversions

of the coded mapping sequences, F2

iand B1

2i, it follows that this motion representation is sufficient

for use with any wavelet kernel.

Finally, note that the motion representation scales with the temporal resolution at which the

video sequence is to be reconstructed. This is because the motion information required at any stage

in the transform involves no mappings from higher resolution stages.

25

4.2 Motion Field Inversion and Composition Strategies

We begin by presenting methods for inverting block-based and triangular mesh motion fields. A

true inverse mapping only exists for motion mappings that are continuous and one-to-one, so block-

based motion mappings are not strictly invertible. For block motion fields, we adopt an ad-hoc

approach that involves simply reversing each motion vector. This is a good approximation when

the motion is smooth and relatively small.

To invert the triangular mesh motion field, the affine transformation associated with each trian-

gular patch must be inverted. Performing the motion warping operation with an inverted mesh is

somewhat more complicated than with the original mesh, since the triangular patches do not form a

regular mesh in the warped coordinate system. This makes it more difficult to locate the triangular

patch associated with each pixel to be mapped, although this is not the dominant computational

cost for the process.

We allow the mesh to detach from the frame boundaries, employing a zero-order hold boundary

extension policy. This is important for accurate motion modelling near frame boundaries, partic-

ularly in the presence of global motion. However, the inverse only exists over the extent of the

deformed mesh, which may exclude some regions near the frame boundary. To approximate the in-

verse in these regions, we extrapolate the mesh using a point-symmetric extension technique which

preserves affine motion flows.

The inverse would be undefined in certain regions if the mesh were allowed to fold over itself,

since the one-to-one nature of the mapping would then be lost. However, this cannot happen in

practice, because our motion estimation algorithm already prevents excessive expansion or con-

traction in the motion field. As noted in Section 3.2, a weighted minimization objective function

is employed so as to avoid excessive expansion of quantization errors in the reconstructed video

sequence. Of course, this may prevent the mesh model from accurately tracking true scene mo-

tion in some regions of the video sequence. However, as already noted, the weighted minimization

26

objective results in motion fields which yield superior compression performance.

We now turn our attention to compositing motion mappings. One possible way to implement

a composited mapping is by applying each individual mapping in turn. However, this approach

suffers from the accumulation of spatial aliasing and other distortions that typically accompany

each warping step. To avoid these problems, we construct a single forward mapping, by projecting

the motion vectors through the various component mappings.

In the case of the block motion model, we determine each pixel’s trajectory through the set of

component mappings. The composit motion vector for a particular block is then approximated by

the average trajectory for all pixels within the block.

We composit triangular mesh mappings by warping the mesh node points through each com-

ponent mapping. The sequence of component mappings are therefore approximated by a new

triangular mesh, which serves to interpolate the projected trajectories of its nodes.

To avoid possible loss in performance due to the approximations made by our composition

strategy, our coder has the option of refining any motion mapping that is inferred by the compositing

process. Using the directly estimated field as a reference, if the composited field contains significant

error, a refinement field is also coded. Refinement fields are irrelevant for the Haar transform, which

involves no composited mappings. The 5/3may require a maximum of one refinement field for every

pair of mappings, at each stage in the transform. In any event, the refinement fields consist mostly

of zeros, and so are less costly to encode.

The decision of when to code refinement fields raises the classic problem of rate allocation

between motion parameters and subband coefficients. For the purpose of these experiments, we

circumvent this issue by coding all refinement fields. At very low bit-rates the motion cost becomes

more significant and superior results may be achieved by selectively coding the refinement fields.

It is worth noting that the proposed motion representation strategy is well aligned with efficient

methods for motion estimation. Wherever possible, an initial guess for the motion parameters

27

is found by inverting and compositing previously estimated motion fields, as appropriate. This

significantly reduces the computational load associated with motion estimation. In total, only one

full motion estimation operation is required per original frame.

To actually code the motion information, we apply the JPEG-LS algorithm [23] directly to the

arrays of horizontal and vertical motion vector components.

5 Experimental Results

We provide results for the block-based motion model and the triangular deformable mesh model,

using three levels of temporal transform. We use the first 96 frames of the standard test sequences,

Mobile and Calender, Table Tennis, Flower Garden and Football. The original full colour sequences

have a frame-rate of 30 fps and a spatial resolution of 352 × 240. Chrominance components are

subsampled by 2 in each dimension.

The block-based model is implemented using a hierarchical search algorithm. A coarse motion

field is first estimated by full-search block matching at half the spatial resolution. The search range

is relative to the temporal displacement of ±8 pixels per frame, except for the Football sequence

where a search range of ±16 is used. The coarse motion field is successively refined up to 1/4

pixel accuracy, using cubic spline interpolation of the original frames. The block size is 16 × 16

pixels, giving 330 motion vectors per field. The estimation algorithm uses the mean-squared error

distortion metric.

Our triangular mesh is created with a node spacing of 16 pixels, by dividing 16 × 16 blocks

along their diagonals. Motion vectors are estimated for each node in the mesh, resulting in 368

vectors per motion mapping. A coarse motion field is estimated by full-search block matching at

half the spatial resolution, with overlapping blocks centred at each node-point. We use the same

search ranges and motion vector precision as with the block model. As discussed in Section 3.2,

refinement of the motion field is performed using a hexagonal refinement algorithm, where the

28

Table 5: PSNR performance of LIMAT at 1 MbpsSequence No DWT Haar Haar, Block Haar, Mesh 5/3 5/3, Block 5/3, MeshMobile 19.55 +2.37 +6.86 +7.32 +2.82 +7.98 +8.16Flower 21.10 +0.90 +5.40 +6.51 +1.08 +7.19 +7.59Table 28.50 +3.27 +2.60 +3.98 +3.50 +4.77 +4.97

Football 26.25 +0.78 −0.59 +1.54 +0.75 +1.37 +2.08

estimated distortion is weighted by the determinant of the relevant affine transformation.

The temporal subband frames are subjected to spatial wavelet decomposition and embedded

block coding of the quantized wavelet coefficients, using an implementation of the JPEG2000 image

compression standard. Although results are given only for luminance components, the chrominance

components are also coded and are assigned equal importance (energy weight) during rate alloca-

tion. A constant number of bits are allocated to each group of 8 original frames.

Tables 5 and 6 give the reconstructed luminance PSNR at bit-rates of 1 Mbps and 500 kbps,

respectively. These results include the cost of coding the motion information. To emphasize

the scalability of the compression system, the test bit-rates were obtained by simply discarding

unwanted bits from an original bit-stream compressed to a much higher bit-rate.

Evidently, including the cost of coding the motion information does not affect the conclusions

drawn so far. For example, the 5/3 wavelet still consistently outperforms the Haar, although

the improvement is slightly diminished at the lower bit-rate, where the added cost of coding the

refinement fields comes into effect. Selective coding of the refinement fields is likely to be beneficial

here. The mesh motion model is still largely superior to the block model, although the difference

is less significant at the lower bit-rate. This is because the block motion fields tend to be more

spatially coherent and hence code more efficiently. We expect that improved regularization of the

mesh, especially around frame boundaries, should close the gap between block and mesh motion

coding efficiencies. However, these matters are not central to the conclusions of the present paper.

Our compression results compare favourably with others reported in the literature. Table 7

compares LIMAT with the Motion-Compensated 3D Subband Coder (MC-3DSBC) proposed in [7],

29

Table 6: PSNR performance of LIMAT at 500 kbpsSequence No DWT Haar Haar, Block Haar, Mesh 5/3 5/3, Block 5/3, MeshMobile 17.97 +2.13 +5.22 +5.47 +2.44 +5.89 +5.85Flower 18.93 +0.98 +4.56 +5.18 +1.15 +5.71 +5.69Table 26.02 +2.98 +2.35 +3.42 +3.37 +4.18 +4.20

Football 23.87 +0.71 −0.71 +1.34 +0.67 +0.93 +1.45

Table 7: PSNR performance of LIMAT, compared with MC-3DSBC, at 1.2 MbpsSequence MC-3DSBC LIMATMobile 27.01 +1.77Flower 27.55 +2.17Table 33.78 +0.57

Football 28.02 +1.11

at 1.2 Mbps. MC-3DSBC uses the block-displacement approach, employing the Haar wavelet and

a hierarchical block motion model. Evidently, the LIMAT framework, employing the 5/3 wavelet

transform and the deformable mesh motion model, outperforms MC-3DSBC in each sequence.

The proposed coder also significantly outperforms that proposed in [4], which employs a frame-

warping approach in the temporal transform. For example, using 120 frames of the Flower Garden

sequence, a luminance PSNR of 23.3 dB, at 1Mbps, was reported in [4]. For the same test scenario,

the proposed 5/3 transform with deformable mesh motion modelling achieves 28.6 dB PSNR.

To appreciate how the LIMAT framework adapts to scene motion, consider the frame-by-frame

PSNR results shown in Figure 8, for the Table Tennis sequence. The reconstructed bit-rate is

1Mbps, and we use the 5/3 wavelet with deformable mesh motion modelling. The camera is

stationary in the first and last parts of the sequence; the only motion is that of the rapidly moving

table-tennis ball, so motion-compensating the transform provides little added benefit. In the middle

of the sequence the camera is zooming. This motion is captured by the mesh model, providing 2-3

dB gain in PSNR. Improved results would be possible by detecting the scene change at the 68th

frame, and modifying the wavelet transform accordingly. Note that in regions of significant motion,

the temporal transform by itself does nothing to improve the coding efficiency, compared to coding

the original frames directly.

30

0 10 20 30 40 50 60 70 80 90 10026

28

30

32

34

36

38

40

Frame Number

PSN

R (

dB)

5/3, No motion5/3, Mesh

No DWT

0 10 20 30 40 50 60 70 80 90 10026

28

30

32

34

36

38

40

Frame Number

PSN

R (

dB)

5/3, No motion5/3, Mesh

No DWT5/3, No motion5/3, Mesh

No DWT

Figure 8: Reconstructed PSNR at 1 Mbps for the Table Tennis sequence.

To investigate the scalability of the proposed transform, the first three curves plotted in Figure

9 reveal rate-distortion performance for full, half and quarter frame-rate reconstruction of the

Flower Garden sequence at 1Mbps. Observe that the rate-distortion behaviour is the same for

each temporal resolution. These curves are generated from a single highly scalable compressed

bit-stream, discarding unwanted bits and temporal resolutions, as appropriate. As expected, for

a given bit-rate, we could choose to decode the video at a reduced frame-rate, and achieve higher

PSNR for each decoded frame. In this way one could exchange smooth rendition of the scene

motion for higher individual frame quality.

An important consideration when comparing reduced frame-rate video sequences, is the selected

reference video sequence. Simply subsampling the original video does not necessarily represent the

most appropriate reduced temporal resolution reference, particularly in the presence of noise. For

the results presented in Figure 9, we use the unquantized reduced temporal resolution sequences

as our references. These are the result of motion-compensated temporal low-pass filtering, which

can be considered a valid method for downsampling video.

31

As an alternative, we could take every second or fourth original video frame as our respective

references for the half and quarter resolution video sequences. In this case, somewhat higher PSNR

can be obtained by using a different temporal transform, in which only the first (prediction) lifting

step shown in equation (2) is used. The equivalent low- and high-pass analysis filters have one and

three taps, respectively. Accordingly, we may refer to this as the 1/3 transform. Unlike the Haar

and 5/3 wavelets, the low-pass subband frames are now obtained without any filtering at all, so

they do not exhibit ghosting artefacts if the scene motion is poorly modelled. In fact, the reduced

resolution video sequences obtained by discarding temporal synthesis stages from the 1/3 transform

correspond directly to subsampled frames from the original sequence.

Using the 1/3 transform we find that compression performance at full frame-rate suffers by 1

to 2 dB, relative to that obtained using the full 5/3 transform. However, for the 1/3 transform

it is meaningful to plot the half resolution rate-distortion performance using the original frames

as our reference. The behaviour is illustrated by the dotted curve in Figure 9. Interestingly, the

figure reveals that higher individual frame PSNR values can sometimes be achieved by decoding the

5/3-coded sequence at full resolution, than by decoding the 1/3-coded sequence at half resolution.

6 Conclusions

We have proposed a new framework for the construction of invertible motion adaptive temporal

transforms. The LIMAT framework is based on the lifting representation of the temporal DWT,

with motion-compensated lifting steps. Invertibility is an inherent property of the lifting structure,

irrespective of the manner in which we model or compensate for motion.

Incorporation of sophisticated motion models allows the transform to adapt to complex motion.

We demonstrate this with a deformable mesh motion model. Deformable meshes can improve mo-

tion compensation by tracking expansions and contractions, while maintaining a continuous motion

field. More importantly, only continuous motion mappings allow us to understand the proposed

32

50 100 200 400 800 1600 320015

20

25

30

35

40

45

50

Bit-rate (kbps)

PSN

R (

dB)

1/3 Half Res.

5/3 Full Res.

5/3 Quarter Res.5/3 Half Res.

50 100 200 400 800 1600 320015

20

25

30

35

40

45

50

Bit-rate (kbps)

PSN

R (

dB)

1/3 Half Res.

5/3 Full Res.

5/3 Quarter Res.5/3 Half Res.

1/3 Half Res.

5/3 Full Res.

5/3 Quarter Res.5/3 Half Res.

Figure 9: Rate-Distortion Curves for Flower Garden Sequence

transform as truly applying the temporal DWT along a set of motion trajectories. Experimental

results reveal that this is particularly desirable for compression performance.

The LIMAT framework is amenable to any temporal wavelet kernel. We observe consistently

superior performance with the 5/3 wavelet as compared to the Haar transform. This differs from

evidence reported in the context of block-based and frame-warping approaches.

We also provide a compact representation for the motion parameters, which reduces the number

of distinct motion mappings to the point where the motion overhead is comparable to that of

motion-compensated predictive coders.

References

[1] G. Karlsson and M. Vetterli, “Three-dimensional subband coding of video,” Proc. Int. Conf.

Acoust. Speech and Sig. Proc., vol. 2, pp. 1100—1103, Apr 1988.

33

[2] B.J. Kim and W.A. Pearlman, “An embedded wavelet video coder using three-dimensional set

partitioning in hierarchical trees (SPIHT),” Proc. IEEE Data Compression Conf. (Snowbird),

pp. 251—260, Mar 1997.

[3] A. Said and W. Pearlman, “A new, fast and efficient image codec based on set partitioning in

hierarchical trees,” IEEE Trans. Circ. Syst. for Video Tech., pp. 243—250, June 1996.

[4] D.S. Taubman and A. Zakhor, “Multi-rate 3-d subband coding of video,” IEEE Trans. Image

Proc., vol. 3, no. 5, pp. 572—588, September 1994.

[5] J. Tham, S. Ranganath, and A. Kassim, “Highly scalable wavelet-based video codec for very

low bit-rate environment,” IEEE Journal on Selected Areas in Comm., vol. 16, pp. 12—27, Jan

1998.

[6] J. Ohm, “Three dimensional subband coding with motion compensation,” IEEE Trans. Image

Proc., vol. 3, pp. 559—571, Sep 1994.

[7] S. Choi and J. Woods, “Motion compensated 3-D subband coding of video,” IEEE Trans.

Image Proc., vol. 8, pp. 155—167, Feb 1999.

[8] A. Secker and D. Taubman, “Motion-compensated highly scalable video compression using

an adaptive 3D wavelet transform based on lifting,” Proc. IEEE Int. Conf. Image Proc., pp.

1029—1032, Oct 2001.

[9] A. Secker and D. Taubman, “Highly scalable video compression using a lifting-based 3d wavelet

transform with deformable mesh motion compensation,” Proc. IEEE Int. Conf. Image Proc.,

pp. 749—752, Sep 2002.

[10] B. Pesquet-Popescu and V. Bottreau, “Three-dimensional lifting schemes for motion compen-

sated video compression,” Proc. Int. Conf. Acoust. Speech and Sig. Proc., pp. 1793—1796, May

2001.

[11] W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,”

Applied and Computational Harmonic Analysis, vol. 3, no. 2, pp. 186—200, April 1996.

[12] I. Daubechies and W. Sweldens, “Factoring wavelet and subband transforms into lifting steps,”

Tech. Rep., Technical report, Bell Laboratories, Lucent Technologies, 1996.

[13] D.S. Taubman andM.W. Marcellin, JPEG2000: Image Compression Fundamentals, Standards

and Practice, Kluwer Academic Publishers, Boston, 2002.

[14] R. Calderbank, I. Daubechies, W. Sweldens, and B. Yeo, “Wavelet transforms that map

integers to integers,” Applied and Computational Harmonic Analysis, vol. 5, no. 3, pp. 332—

369, July 1998.

[15] V.K. Heer and H.-E. Reinfelder, “A comparison of reversible methods for data compression,”

Proc. SPIE conference, ‘Medical Imaging IV’, vol. 1233, pp. 354—365, 1990.

[16] D. Le Gall and A. Tabatabai, “Sub-band coding of digital images using symmetric short kernel

filters and arithmetic coding techniques,” IEEE International Conference on Acoustics, Speech

and Signal Processing, vol. 2, pp. 761—764, April 1988.

34

[17] J. Ohm, “Advanced packet-video coding based on layered VQ and SBC techniques,” IEEE

Trans. Circ. Syst. for Video Tech., vol. 3, no. 3, pp. 208—221, June 1993.

[18] H. Brusewitz, “Motion compensation with triangles,” Proc. 3rd Int. Conf. 64-kbit Coding of

Moving Video, Sep 1990.

[19] G.J. Sullivan and Baker R.L., “Motion compensation for video compression using control grid

interpolation,” Proc. Int. Conf. Acoust. Speech and Sig. Proc., pp. 2713—2716, May 1991.

[20] V. Seferdis and M. Ghanbari, “General approach to block-matching motion estimation,”

Optical Engineering., vol. 32, no. 7, pp. 1464—1474, Jul 1993.

[21] Y. Nakaya and H. Harashima, “Motion compensation based on spatial transformations,” IEEE

Trans. Circ. Syst. for Video Tech., vol. 4, pp. 339—367, Jun 1994.

[22] C.L. Huang and Hsu C.Y., “A new motion compensation method for image sequence coding

using hierarchical grid interpolation,” IEEE Trans. Circ. Syst. for Video Tech., , no. 4, pp.

42—52, Feb 1994.

[23] M.J. Weinberger, G. Seroussi, and G. Sapiro, “LOCO-I: A low complexity, context-based,

lossless image compression algorithm,” Proc. IEEE Data Compression Conf. (Snowbird), pp.

140—149, April 1996.

35