1 state of-the-art and trends in scalable video

18
1238 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007 State-of-the-Art and Trends in Scalable Video Compression With Wavelet-Based Approaches Nicola Adami, Member, IEEE, Alberto Signoroni, Member, IEEE, and Riccardo Leonardi, Member, IEEE (Invited Paper) Abstract—Scalable Video Coding (SVC) differs form traditional single point approaches mainly because it allows to encode in a unique bit stream several working points corresponding to different quality, picture size and frame rate. This work describes the current state-of-the-art in SVC, focusing on wavelet based motion-compensated approaches (WSVC). It reviews individual components that have been designed to address the problem over the years and how such components are typically combined to achieve meaningful WSVC architectures. Coding schemes which mainly differ from the space-time order in which the wavelet transforms operate are here compared, discussing strengths and weaknesses of the resulting implementations. An evaluation of the achievable coding performances is provided considering the reference architectures studied and developed by ISO/MPEG in its exploration on WSVC. The paper also attempts to draw a list of major differences between wavelet based solutions and the SVC standard jointly targeted by ITU and ISO/MPEG. A major emphasis is devoted to a promising WSVC solution, named STP-tool, which presents architectural similarities with respect to the SVC standard. The paper ends drawing some evolution trends for WSVC systems and giving insights on video coding applications which could benefit by a wavelet based approach. Index Terms—Entropy coding, motion compensated tem- poral filtering (MCTF), MPEG, Scalable Video Coding (SVC), spatio–temporal multiresolution representations, video coding architectures, video quality assessment, wavelets. I. INTRODUCTION T RADITIONAL single operating point video coding sys- tems can be surpassed by using Scalable Video Coding (SVC) architectures where, from a single bit stream, a number of decodable streams can be extracted corresponding to various operating points in terms of spatial resolution, temporal frame rate or reconstruction accuracy. Different scalablity features can coexist in a single video coded bit stream with coding perfor- mance approaching the state-of-art single point coding tech- niques. This has been more and more a trend in the last years [1] and it has become a reality thanks to the development of Manuscript received October 9, 2006; revised June 22, 2007. This paper was recommended by Guest Editor T. Wiegand. The authors are with the Department of Electronics for Automation, Faculty of Engineering, University of Brescia, I-25123 Brescia, Italy (e-mail: alberto. [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2007.906828 SVC systems derived either from hybrid schemes [2] (used in all MPEG-x or H.26x video coding standards) or from spatio–tem- poral wavelet technologies [3]. A great part of this development and relative debates regarding SVC requirements [4], applica- tions and solutions has been carried out by people participating in the ISO/MPEG standardization. This paper offers an overview of existing SVC architectures which are based on multiresolution spatio–temporal represen- tation of video sequences. In particular, wavelet-based SVC (WSVC) tools and systems will be classified and analyzed with the following objectives: 1) to give a quite comprehen- sive tutorial reference to those interested in the field; 2) to analyze strong and weak points of the main WSVC tools and architectures; 3) to synthesize and account for the exploration activities made by the MPEG video group on WSVC and to describe the issued reference platform; 4) to compare such platform with the ongoing SVC standard (jointly targeted by ITU and ISO/MPEG) in qualitative and quantitative terms; and 5) to discuss promising evolution paths and target applications for WSVC. For this purpose the presentation has been struc- tured as follows. In Section II, WSVC fundamentals as well as basic and advanced tools which enable temporal, spatial and quality salability are presented and discussed. Section III shows how such components are typically combined to achieve meaningful WSVC architectures, which typically differ from the space-time order in which the wavelet transform operates, discussing strengths and weaknesses of the resulting imple- mentations. In the same section, an emphasis is placed on a promising architecture which presents some similarities to the SVC standard. Subsequently, Section IV explains the Wavelet Video Reference architecture(s) studied by ISO/MPEG in its exploration on Wavelet Video Compression. The paper attempts as well (Section V) to draw a list of major differences between such architecture(s) and tools with respect to the SVC reference model, providing performance comparison in terms of coding efficiency and giving a critical analysis of the presented results. In Section VI, further investigations on WSVC critical aspects are presented. A discussion on perspectives and applications that appear better aimed to wavelet based approach is made in Section VII. Finally, conclusions are derived. Note that, for consistency with the other papers of this special issue, the acronym SVC will be used without distinction to indicate generic SVC systems and the SVC standard. The appropriate meaning will emerge from the context. 1051-8215/$25.00 © 2007 IEEE

Upload: yogananda-patnaik

Post on 06-Aug-2015

26 views

Category:

Technology


2 download

TRANSCRIPT

1238 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

State-of-the-Art and Trends in Scalable VideoCompression With Wavelet-Based Approaches

Nicola Adami, Member, IEEE, Alberto Signoroni, Member, IEEE, and Riccardo Leonardi, Member, IEEE

(Invited Paper)

Abstract—Scalable Video Coding (SVC) differs form traditionalsingle point approaches mainly because it allows to encode ina unique bit stream several working points corresponding todifferent quality, picture size and frame rate. This work describesthe current state-of-the-art in SVC, focusing on wavelet basedmotion-compensated approaches (WSVC). It reviews individualcomponents that have been designed to address the problem overthe years and how such components are typically combined toachieve meaningful WSVC architectures. Coding schemes whichmainly differ from the space-time order in which the wavelettransforms operate are here compared, discussing strengths andweaknesses of the resulting implementations. An evaluation ofthe achievable coding performances is provided considering thereference architectures studied and developed by ISO/MPEGin its exploration on WSVC. The paper also attempts to drawa list of major differences between wavelet based solutions andthe SVC standard jointly targeted by ITU and ISO/MPEG. Amajor emphasis is devoted to a promising WSVC solution, namedSTP-tool, which presents architectural similarities with respectto the SVC standard. The paper ends drawing some evolutiontrends for WSVC systems and giving insights on video codingapplications which could benefit by a wavelet based approach.

Index Terms—Entropy coding, motion compensated tem-poral filtering (MCTF), MPEG, Scalable Video Coding (SVC),spatio–temporal multiresolution representations, video codingarchitectures, video quality assessment, wavelets.

I. INTRODUCTION

TRADITIONAL single operating point video coding sys-tems can be surpassed by using Scalable Video Coding

(SVC) architectures where, from a single bit stream, a numberof decodable streams can be extracted corresponding to variousoperating points in terms of spatial resolution, temporal framerate or reconstruction accuracy. Different scalablity features cancoexist in a single video coded bit stream with coding perfor-mance approaching the state-of-art single point coding tech-niques. This has been more and more a trend in the last years[1] and it has become a reality thanks to the development of

Manuscript received October 9, 2006; revised June 22, 2007. This paper wasrecommended by Guest Editor T. Wiegand.

The authors are with the Department of Electronics for Automation, Facultyof Engineering, University of Brescia, I-25123 Brescia, Italy (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2007.906828

SVC systems derived either from hybrid schemes [2] (used in allMPEG-x or H.26x video coding standards) or from spatio–tem-poral wavelet technologies [3]. A great part of this developmentand relative debates regarding SVC requirements [4], applica-tions and solutions has been carried out by people participatingin the ISO/MPEG standardization.

This paper offers an overview of existing SVC architectureswhich are based on multiresolution spatio–temporal represen-tation of video sequences. In particular, wavelet-based SVC(WSVC) tools and systems will be classified and analyzedwith the following objectives: 1) to give a quite comprehen-sive tutorial reference to those interested in the field; 2) toanalyze strong and weak points of the main WSVC tools andarchitectures; 3) to synthesize and account for the explorationactivities made by the MPEG video group on WSVC and todescribe the issued reference platform; 4) to compare suchplatform with the ongoing SVC standard (jointly targeted byITU and ISO/MPEG) in qualitative and quantitative terms; and5) to discuss promising evolution paths and target applicationsfor WSVC. For this purpose the presentation has been struc-tured as follows. In Section II, WSVC fundamentals as wellas basic and advanced tools which enable temporal, spatialand quality salability are presented and discussed. Section IIIshows how such components are typically combined to achievemeaningful WSVC architectures, which typically differ fromthe space-time order in which the wavelet transform operates,discussing strengths and weaknesses of the resulting imple-mentations. In the same section, an emphasis is placed on apromising architecture which presents some similarities to theSVC standard. Subsequently, Section IV explains the WaveletVideo Reference architecture(s) studied by ISO/MPEG in itsexploration on Wavelet Video Compression. The paper attemptsas well (Section V) to draw a list of major differences betweensuch architecture(s) and tools with respect to the SVC referencemodel, providing performance comparison in terms of codingefficiency and giving a critical analysis of the presented results.In Section VI, further investigations on WSVC critical aspectsare presented. A discussion on perspectives and applicationsthat appear better aimed to wavelet based approach is madein Section VII. Finally, conclusions are derived. Note that,for consistency with the other papers of this special issue,the acronym SVC will be used without distinction to indicategeneric SVC systems and the SVC standard. The appropriatemeaning will emerge from the context.

1051-8215/$25.00 © 2007 IEEE

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1239

Fig. 1. SVC rationale. A unique bit stream is produced by the Coder. Decod-able substreams can be obtained with simple extractions.

II. BASIC PRINCIPLES OF WAVELET-BASED

APPROACHES TO SVC

A. WSVC Fundamentals

Fig. 1, shows a typical SVC system which refers to the codingof a video signal at an original CIF resolution and a framerate of30 fps. In the example, the highest operating point and decodingquality level correspond to a bit rate of 2 Mbps associated withthe data at the original spatio–temporal resolution. For a scaleddecoding, in terms of spatial, temporal and/or quality resolution,the decoder only works on a portion of the originally coded bitstream according to the specification of a desired working point.Such a stream portion is extracted from the originally codedstream by a functional block called extractor. As shown in Fig. 1the extractor is arranged between the coder and the decoder. Inany practical application, it can be implemented as an indepen-dent block within any node of the distribution chain or it can beincluded in the decoder. The extractor receives the informationrelated to the desired working point—in the example of Fig. 1,a lower spatial resolution QCIF, a lower frame rate (15 fps), anda lower bit rate (quality) (150 kbps)— and extracts a decodablebit stream matching or almost matching the working point spec-ification. One of the main differences between an SVC systemand a transcoding solution is the low complexity of the ex-tractor which does not require coding/decoding operations andtypically consists of simple parsing operations on the coded bitstream. For practical purposes, let us consider the following ex-ample where a 4CIF video is coded allowing to extract threedifferent spatial scalability layers and four embedded temporalresolutions. In this case, a possible structure of the bit streamgenerated by an SVC system is shown in Fig. 2. As indicated,the usual partition in independently coded groups of pictures(GOP) can be adopted for separating the various video frames.The GOP header contains the fundamental characteristics of thevideo and the pointers to the beginning of each independentGOP bit stream. In this example, each GOP stream containsthree bit stream segments, one for each spatial level. Note that inan ideal scalable bit stream, except for the lowest working point,all information is incremental. This means, for example, thatin order to properly decode higher resolution (e.g., 4CIF), oneneeds to use also the portions of the bit stream which representthe lower spatial resolutions (e.g., QCIF and CIF). Given this

bit stream organization, for each spatial resolution level a tem-poral segmentation of the bit stream can be further performed.Accordingly, the 7.5 fps portion contains data that allow to re-construct the lowest temporal resolution of the considered spa-tial resolution reference video sequence. Similarly the 15, 30,60 fps parts refer to information used to refine the frame-rateuntil the allowed maximum. Since most successful video com-pression schemes require the use of motion compensation (MC),each temporal resolution level bit stream portion should includethe relative motion information (usually consisting of coded mo-tion vector fields), as shown in Fig. 2. Further quality layerscan be obtained, for example, by arranging the signal repre-sentation, for every temporal segment, in a real or virtual bitplane progressive fashion (see Section II-B.3). With such a bitstream, every allowed combination in terms of spatial, temporaland quality scalability can be generally decided and achievedaccording to the decoding capabilities. Fig. 3 provides a graph-ical representation of a target bit rate extraction of a CIF 15 fpscoded sequence, from the original 4CIF 60 fps bit stream. Notethat corresponding parts of QCIF and CIF bit streams have beenextracted to reconstruct the sequence at the desired workingpoint.

A suitable property for concrete applications (e.g., for con-tent distribution on heterogeneous networks) is the possibilityto adapt the bit stream according to a multiple (chained) ex-traction path, from the highest towards lower operating points;this has been called bit stream multiple adaptation [5]. Multipleadaptations should be allowed, of course, along every advisablepath, and in general without any a-priori encoding settings. Mul-tiple adaptation extraction differs from a single extraction onlyin one aspect; for multiple adaptations the bit stream extractioninformation must be conserved (i.e., processed and reinsertedthrowing out what become useless) from the input bit stream.Clearly, there is no need to reinsert (and spend bits for) the ex-traction information if no further extraction is expected.

B. Tools Enabling Scalability

In the past two decades, there has been a proliferation ofactivities focalized on theory and solutions for efficient (min-imally redundant) multiresolution representation of signals. Ingeneral, a multiresolution signal decomposition (transform) in-herently enables a corresponding low to high resolution scal-ability by the representation of the signal in the transformeddomain. The tools that enable scalable multidimensional (space-time-quality) signal representation and coding are multidisci-plinary. This fact is clearly depicted in Fig. 4 that gives a quickoverview of these tools and how they can be combined in orderto achieve the desired scalability feature: spatial, temporal, andquality (S, T, and Q). In the following, each scalability dimen-sion, S, T, and Q, will be considered and the most popular and ef-fective solutions will be briefly described. In this context, Fig. 4can be helpful in keeping the big picture when analyzing a spe-cific tool. Due to lack of space to describe each tool in detail,support references will be provided for the interested reader.

1) Spatial Scalability: Tools producing a multiresolutionrepresentation of -dimensional signals can be of differenttypes. They can be linear or nonlinear, separable or not, re-dundant or critically sampled. Linear decompositions have

1240 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

Fig. 2. Example of a scalable bit stream structure.

Fig. 3. Example of a substream extraction: CIF at 15 fps is extracted from a 4CIF sequence at 60 fps.

Fig. 4. WSVC theoretical context.

been studied for a long time. The simplest mechanism toperform a two resolution representation of a signal is basedon what here we call inter-scale prediction (ISP): an original(full resolution) signal can be seen as a coarse resolutionsignal , obtained by a decimation of (e.g., by filtering with

a lowpass filter and downsampling of factor 2), added to adetail signal calculated as the difference of the original signaland an interpolated version of (throughoutthis document a symbol over a signal means that the detailis not critically sampled). The Laplacian pyramid introducedby Burt and Adelson [6] is an iterated version of such an ISPmechanism resulting in a coarsest resolution signal and aset of details at decreasing levels of spatial resolution

. Fig. 5 shows a Laplacian decomposition of the1-D signal in -stages (from full resolution 0 to the coarsestone ) where is implemented by upsampling by a factorof 2 and interpolating with . The discrete wavelet transform(DWT) is a linear operator which projects the original signalin a set of multiresolution subspaces [7] allowing a criticallysampled (orthogonal or biorthogonal) representation of thesignal in the transformed domain and guaranteeing perfectreconstruction synthesis. Similarly to the Laplacian pyramid,the DWT decomposition of generates a coarse signal anda series of detail signals at various levels of spatial reso-lutions. By using the terminology of multirate filter banks [8],the transformed domain signals are also called subbands. In

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1241

Fig. 5. n-level Laplacian pyramid analysis filter bank.

Fig. 6. n-level (bi)orthogonal DWT analysis/synthesis implemented with adyadic tree filter bank.

fact, the DWT can be implemented by means of a two-channelfilter bank, iterated on a dyadic tree path, as shown in Fig. 6,[9], [10]. Symbols and represent respectively the lowpassand highpass analysis filters, whereas and are the synthesisfor an orthogonal or biorthogonal wavelet decomposition.Symmetric filter responses are preferred for visual data codingapplications due to their linear phase characteristics and canbe obtained only with biorthogonal decompositions (except forlength 2 filters, i.e., Haar basis). A popular filter pair for imagecoding purposes is the one [10, p. 279], which gives ananalysis filter with 9 taps and a synthesis filter with 7 taps.

For multidimensional signals, e.g., images, separate filteringon rows and columns are usually adopted to implement the socalled separable pyramidal and DWT decompositions. In thecase of separable 2-D-DWT on images this generates, at eachlevel of decomposition, one coarse subband , and three detailsubband that we continue to indicate altogether as .

Sweldens [11] introduced the lifting scheme as an alternativespatial domain processing to construct multiresolution signalrepresentations. The lifting structure is depicted in Fig. 7 and re-veals a quite intuitive spatial domain processing of the originalsignal capable to generate a critically sampled represen-tation of signal . According to this scheme, signal is split intwo polyphase components: the even and the odd samplesand . As the two components (each one is half the originalresolution) are correlated, a prediction can be performed be-tween the two; odd samples can be predicted with an operatorwhich uses a neighborhood of even samples and that producesthe residue

(1)

The transform is not satisfactory from a multires-olution representation point of view as the subsampled signal

could contain a lot of aliased components. As such, it shouldbe updated to a filtered version . The update step

(2)

Fig. 7. Dyadic decomposition by lifting steps: split, prediction P , and updateU .

can be constrained to do this job and it is structurally invertibleas well as the preceding stages. Perfect reconstruction is simplyguaranteed by

(3)

(4)

Daubechies and Sweldens [12] proved that every DWT can befactorized in a chain of lifting steps.

The lifting structure is not limited to be an alternative andcomputationally convenient form to implement a DWT trans-form. Actually, it also allows nonlinear multiresolution decom-positions. For example, lifting steps implementing a DWT canbe slightly modified by means of rounding operations in orderto perform a so called integer to integer (nonlinear) wavelettransform [13]. Such a transform is useful as a component ofwavelet-based lossless coding systems. The kind and the degreeof introduced nonlinearity can depend on the structural or mor-phological characteristics of the signal that one may want to cap-ture [14], keeping in mind that nonlinear systems can have unex-pected behaviors in terms of error propagation (which may turnout to be a burden in presence of quantization of the transformedsignal). Moreover, the lifting structure has a fundamental rolefor MC temporal filtering (MCTF) as we will see in the nextsubsection.

2) Temporal Scalability: A key tool which enables temporalscalability while exploiting temporal correlation is the MCTF.An introduction to the subject can be found in [15]. After thefirst work of Ohm proposing a motion compensated version ofthe Haar transform [16], other studies [17], [18] began to showthat video coding systems exploiting this tool could be com-petitive with respect to the classical and most successful hybridones (based on a combination of block-based spatial DCT trans-form coding and block-based temporal motion estimation andcompensation). Other works (e.g., [19] and [20]) focus on thestudy of advanced MC lifting structures due to their versatilityand strength in increasing the coding performance. Basically,an MCTF implementation by lifting steps can be representedby the already seen (1)–(4) where index has now a temporalmeaning and where the prediction and update operators andcan be guided by motion information generated and coded aftera motion estimation (ME) stage becoming MCP and MPU oper-ators respectively. Fig. 8 shows a conceptual block diagram ofthis general idea. ME/MC are implemented according to a cer-tain motion model and in the case of spatio–temporal multires-olution SVC and WSVC systems they usually generate a set ofmotion descriptions consisting of motion vector fields .These are estimations of the trajectory of groups (usually rect-angular blocks) of pixels between the temporal frames, at spatial

1242 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

Fig. 8. Prediction and update lifting steps with explicit motion informationconveyed to the P and U operators.

level , involved in the th MCTF temporal decomposition level.Haar filters based [16], [17], [21] and longer filters MCTF so-lutions have been proposed [22]–[24]. For example MCTF im-plementations of the transform exist [20]. and in gen-eral longer filter solutions usually achieve better performance interms of prediction efficiency and visual quality at lower tem-poral resolution frames, but make use of a bidirectional predic-tion and thus motion information is more bitdemanding thanHaar based solutions. The two solutions recall unidirectionaland bidirectional block mode prediction used in hybrid schemes.Similarly to advanced hybrid schemes, an adaptive selectionamong the two modes is also possible in the case of MCTF,leading to block-based adaptive MCTF solutions [25], [26]. Al-ternatively, deformable motion models have also been exploredwith promising results [27].

By exploiting the local adaptability of the MCP and MCUoperators and thanks to the transmitted (usually lossless coded)

information, the MCTF can also be aware of andhandle a series of issues affecting MC: 1) occlusion and uncov-ered area problems can be handled recognizing unconnectedand multiple connected pixels and adapting the MCP andMCU structure accordingly [22], [17], [21]; 2) in the case ofblock-based ME/MC, blocking effects can be reduced in variousways by considering adjacent blocks (spatio–temporal lifting)realizing for example overlapped block-based MC (OBMC)[21], [26]; and 3) when MVs are provided with fractional pelprecision the lifting structure can be modified to implement thenecessary pixel interpolations [22], [28], [20] while preservingthe temporal invertibility of the transform. It must also be notedthat MCU makes use of a reversed motion field with respect tothe one used by MCP. This may constitute a problem especiallyin those areas where inversion cannot be performed exactly,e.g., on occluded/uncovered areas, or does not fall on integerpixel positions because fractional pel accuracy is involved. Inorder not to duplicate the motion information and the associatedbitbudget, mechanisms to derive or predict MCU fields fromthe MCP ones have been proposed [27], [29], [30] as well asmethods that identify inversion problems and adopt interpola-tion based solutions (e.g., the Barbell lifting scheme [31]). Withlifting structures nondyadic temporal decomposition are alsopossible [32], leading to temporal scalability factors differentfrom a power of two.

As already noticed, the update step could be omitted, withoutimpairing perfect reconstruction, generating a so called uncon-strained MCTF (UMCTF) [33]. This guarantees the direct useof the original frames for reduced frame-rates, then reduces thecomplexity of the transform stage and presents the benefit of

eliminating ghosting artifacts. Ghosting artifacts can appear insome regions of the lowpass temporal frames as a consequenceof unavoidable local MC failures (associated to scene changes,unhandled object deformations, uncovered/occluded regions).In such erroneously motion-compensated regions, the lowpasstemporal filtering updates the corresponding temporal subbandswith “ghosts” (which “should not exist” in the current frame)coming from past or future frames. Depending on the entity ofthe ghosts, this can locally impair the visual appearance. On theother hand, for correctly motion compensated regions, the up-date step has a beneficial temporal smoothing effect that can im-prove visual perception at reduced frame-rates and also codingperformance. Reduction of ghosting artifacts is then an impor-tant task and it has been obtained by the implementation ofMCTF adaptive update steps [34], [35] or by clipping the up-date signal [36].

Either MCTF or UMCTF generate a temporal subbandhierarchy starting from higher temporal resolution to lowerones. Alternatively, the hierarchical B-frames decomposition(HBFD) [2] can be built enabling a similar temporal detailstructure but starting from the lower resolutions (predictionsbetween more distant frames). This allows closed-loop formsof the temporal prediction [37]. Efficient MCTF implementa-tions have also been studied to reduce system complexity [38],memory usage [39], [23], and coding delay [40].

3) Quality Scalability: Among the best image compres-sion schemes, wavelet-based ones currently provide for highrate-distortion (R-D) performance while preserving a limitedcomputational complexity. They usually do not interfere withspatial scalability requirements and allow a high degree ofquality scalability which, in many cases, consists in the possi-bility of optimally truncating the coded bit stream at arbitrarypoints (bit stream embedding). Most techniques that guaranteean optimal bit stream embedding and consequently a bitwisequality scalability are inspired from the zerotree idea firstintroduced by Shapiro’s embedded zerotree wavelet (EZW)technique [41] and then reformulated by Said and Pearlmanin their SPIHT algorithm [42]. A higher performance zerotreeinspired technique is the embedded zero-block coding (EZBC)algorithm [43], where quad-tree partitioning and context mod-eling of wavelet coefficients are well combined. The zerotreeidea can also be reformulated in a dual way, allowing to directlybuild significant coefficients maps. To this end, a morphologicprocessing based on connectivity analysis has been used andalso justified by the statistical evidence of the energy clusteringin the wavelet subbands [44]. The EMDC technique [45], [46]exploits principles of morphological coding and guarantees,for 2-D and 3-D data [47], coding performance comparableor superior to state-of-the-art codecs, progressive decodingand fractional bit plane embedding for a highly optimized bitstream.

Another popular technique, which does not use the zerotreehypothesis, is the embedded block coding with optimized trun-cation (EBCOT) algorithm [48], adopted in the JPEG2000 stan-dard [49], which combines layered block coding, fractional bitplanes [50], block-based R-D optimizations, and context-basedarithmetic coding to obtain good (adjustable) scalability prop-erties and high coding efficiency.

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1243

Fig. 9. Basic WSVC schemes in a signal perspective. (a) t + 2D. (b) 2D + t.

Proposed techniques for coding coefficients of spatio–tem-poral subbands generated by WSVC systems usually areextensions of image coding techniques. Obviously such 3-Dextensions shall not interfere with temporal scalability require-ments as well, and can be achieved by considering; 1) singletemporal frame coding with multiple frame statistical contextsupdating; 2) extended spatio–temporal nonsignificance (orsignificance) prediction structures; and 3) spatio–temporalstatistical contexts, or a combination of them. Temporal ex-tensions of SPIHT [18], EZBC [51], [52], EBCOT [36], [24],and EMDC [53] have been proposed and used in WSVC videocoding systems.

III. CODING ARCHITECTURES FOR WSVC SYSTEMS

A. WSVC Notation

In a multidimensional multilevel spatio–temporal decom-position there is a proliferation of signals and/or subbands.To support our discussion we have adopted a notation thatguarantees easy referencing to the generated signals in ageneral decomposition structure. We consider a genericspatio–temporal signal , with spatial dimensions

. Signal which undergoes an -level mul-tiresolution spatial transform is indicated with .For nonredundant transforms, such as the DWT, the spa-tially transformed signal actually consist of the subband set

, where superscript indi-cates the signal representation at the coarsest spatial resolution(lowpass) while, for each spatial levelindicates a critically sampled level of detail. In the case ofseparable transforms (such as the DWT) each consists of

directional subbands. For redundant transforms (suchas the Laplacian pyramid) the transformed signal consists of

the signal set whereis an oversampled detail at level . With a similar notation aspatio–temporal signal which undergoes a -levelmultiresolution temporal transform is indicated as

. As before, superscripts and will be used

referring to the temporal dimension to indicate coarsest tem-poral resolution, critically and not critically sampled temporaldetails respectively. In the figures of the present section we willgraphically represent temporally transformed data subbandsby associating a gray level to them: the white corresponds to

while grey-tones get darker as the detail level goes from thefinest to the coarsest. Moreover, and indicate original andquantized (reduced coefficient representation quality) signalsrespectively. Symbol “ ” will be used as wild card to refer tothe whole set of spatio–temporal subbands of a given spatial ortemporal transform.

A decoded version of the original signal extracted and re-constructed at given temporal resolution , spatial resolutionand reduced quality rate, will be indicated as .

Concerning ME/MC, given a certain motion model, a de-scription is produced consisting of a set of motion vector fields

, where, and refer to various spatial and temporalresolution levels respectively. As and areobviously not independent, mechanisms of up- or down-conver-sion, with or without refinement or approximation of the mo-tion vector field, are usually implemented. Then, for example,

indicates that motion vectors at acoarser resolution have been down-converted, by a suitable re-duction of the block size and of the vector module, without ap-proximation. Otherwise, in case of down-conversion with ap-proximation the notation would have been used.Moreover, indicates that motionvectors at a second layer of finer resolution have been up-con-verted twice with a refinement of the block partition and/or ofthe motion vector value.

B. Basic WSVC Architectures

In Figs. 9 and 10, the main multiresolution decompositionstructures and corresponding WSVC architectures are shown.Each one will be presented in the following paragraphs.

1) : Perhaps the most intuitive way to build a WSVCsystem is to perform an MCTF in the original spatial domainfollowed by a spatial transform on each temporal subband. This

1244 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

Fig. 10. Pyramidal WSVC schemes with the pyramidal decomposition put (a) before and (b) after the temporal MCTF.

is usually denoted as a scheme, it guarantees criticallysampled subbands and it is represented in Fig. 9(a). Earlierwavelet based coding systems were based on this scheme [16],[17]. Many other wavelet based SVC systems are based onthe spatio–temporal decomposition, among which wecan cite [18]–[21], [31], [33], [51], [52], [54]–[58]. Despiteits conceptual simplicity, the solution presents somerelevant drawbacks especially for spatial scalability perfor-mance. When only temporal and quality scalability (full spatialresolution decoding) are required, the process is reverseduntil reaching the desired frame-rate at resolution (partialor complete MCTF inversion) and signal-to-noise ratio (SNR)quality: where both directand inverse temporal transforms make use of the set of motionvectors . Instead, if a lower spatial resolu-tion version is needed the inversion process is incoherent with

respect to the forward decomposition. For clarity, let us con-sider only spatial scalability, so that the spatially scaled signalis obtained by . The problemis due to the fact that the inverse MCTF transform is obtainedusing motion vectors that are typically deduced fromthe estimated ones by . 1) This cannotguarantee perfect reconstruction because spatial and motioncompensated temporal transforms cannot be inverted withoutmost likely changing the result (due to the shift-variant na-ture of the spatial transform) [15]. 2) These motion vectorshave not been estimated on data at the target resolution andtherefore they cannot be considered optimal for use, whenspatial subsampling has been made with nonalias free waveletfilters. Moreover, the bit occupancy of full resolution motionvectors penalizes the performance of spatially scaled decodingespecially at lower bit rates.

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1245

The above issues can only be mitigated in such a criticallysampled scheme but not structurally resolved. To re-duce the MV bit load at full resolution scalable MV codingcan be used [59], [60], to reduce it at intermediate and lowerresolutions approximated MV down-conversions

can be adopted (e.g., from half-pel at spatial res-olution again to half-pel at resolution instead of thequarter-pel generated by the down-conversion). These two so-lutions have the disadvantage of introducing some imprecisionsin the inverse temporal transform and their possible adoptionshould be balanced in a R-D sense. The problem of the nonop-timality of the down-conversion due to spatial aliasing, intro-duced by the decimation, can be reduced by using (possiblyonly for the needed resolutions) more selective wavelet filters[61] or locally adaptive spectral shaping by acting on the quan-tization parameters inside each spatial subband [62]. However,such approaches can determine coding performance loss at fullresolution, because either the wavelet filters or coefficient quan-tization laws are moved away from ideal conditions.

2) : In order to solve the spatial scalability perfor-mance issues of schemes, a natural approach couldbe to consider a scheme, where the spatial transformis applied before the temporal one as shown in Fig. 9(b). Theresulting scheme is often called In-Band MCTF (IBMCTF)[63] because estimation of is made and applied inde-pendently on each spatial level in the spatial subband domain,leading to a structurally scalable motion representation andcoding. With IBMCTF spatial and temporal scalability aremore decoupled with respect to the case. Signal recon-struction after a combination of spatio–temporal and qualityscalability is given bywhere the (partial) inverse temporal transform makes use of

, (a subset of) the same MVsused for the direct decomposition.

Despite its conceptual simplicity the approach suf-fers as well from the shift-variant (periodically shift-invariant)nature of the DWT decomposition. In fact, ME in the criticallysampled detail subbands is not reliable because the coefficientpatterns of moving parts usually change from one frame to con-secutive or preceding ones. Solutions to this issue have beenproposed [64], [65]; they foresee a frame expansion formula-tion, at the expense of an increased complexity, to perform MEand compensation in an overcomplete (shift-invariant) spatialdomain, while transmitting only decimated samples. Even so,

schemes demonstrated lower coding efficiency com-pared to ones, especially at higher resolutions [66].Motivations of this fact are probably due to the reduced effi-ciency of MCTF when applied in the highpass spatial subbanddomain. In addition, spatial inversion of detail subbands whichcontain block-based MCTF coding artifacts visually smoothesblocking effects but probably causes an overall visual qualitydegradation.

Another possible explanation of a reduced coding efficiencyresides in the fact that parent-child spatial relationships, typ-ically exploited during embedded wavelet coding or contextbased arithmetic coding, are partially broken by independenttemporal filtering on the spatial levels.

3) Adaptive Architectures: Being characterized by differentadvantages and disadvantages, and architec-tures seem inadequate to give a final answer to the composite is-sues arising in SVC applications. One possible approach comesfrom schemes which try to combine the positive aspects of eachof them and propose adaptive spatio–temporal decompositionsoptimized with respect to suitable criteria. In [67] and [68], theauthors suggest that content-adaptive versusdecomposition can improve coding performance. In particular,they show how various kind of disadvantages of fixed transformstructures can be more or less preponderant depending on videodata content, especially on the accuracy of the derived motionmodel. Therefore, local estimates of the motion accuracy allowto adapt the spatio–temporal data decomposition structure inorder to preserve scalability while obtaining overall maximiza-tion of coding performance.

Recently, another adaptive solution, named aceSVC [69],[70], has been proposed where the decomposition adaptation isdriven by scalability requirements. This solution tries to achievemaximum bit stream flexibility and adaptability to a selectedspatio–temporal and quality working point configuration. Thisis obtained by arranging spatial and temporal decompositionsfollowing the principles of a generalized spatio–temporalscalability (GSTS) [71]. In aceSVC encoded bit stream, thanksto the adapted decomposition structure, every coding layer isefficiently used (without waste of bits) for the construction ofhigher ones according to the predefined spatial, temporal andquality working point configurations.

Another perspective to compensate for the versusdrawbacks can come from the pyramidal approaches,

as discussed in the reminder of this section.4) Multiscale Pyramids: From the above discussion it is clear

that the spatial and temporal wavelet filtering cannot be decou-pled because of the MC taking place. As a consequence, it isnot possible to encode different spatial resolution levels at once(with only one MCTF) and thus, both lower and higher resolu-tion frames or subband sequences should be MCT-Filtered. Inthis perspective, a possibility for obtaining good performance interms of bit rate and scalability is to use coding schemes whichare usually referred as to or multiscale pyra-mids [72]. In such schemes spatially scaled versions of videosignals are obtained starting from the higher resolution, whilethe coding systems encompass ISP mechanisms in order to ex-ploit the multiscale representation redundancy. These multires-olution decompositions derive naturally from the first hierar-chical representation technique introduced for images, namelythe Laplacian pyramid [6] (see Section II-B). So, even if froman intuitive point of view schemes seems to be wellmotivated, they have the typical disadvantage of overcompletetransforms, namely those leading to a full size residual image, sothat coding efficiency is harder to achieve. As in the case of thepreviously discussed and schemes, two differentmultiscale pyramid based schemes are conceivable, dependingwhere the pyramidal residue generation is implemented, i.e., be-fore or after the MCTF decomposition. These two schemes areshown in Fig. 10. In both schemes the lowest spatial resolutiontemporal subbands are the same, while higher reso-lution residuals are generated differently.

1246 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

Fig. 11. STP-tool scheme in a generated signals/subbands perspective.

The scheme of Fig. 10(a) suffers from the problems ofschemes, because ME is operated on spatial residuals, and to ourknowledge did not lead to efficient SVC implementations.

In the scheme depicted in Fig. 10(b) we first observe thatthe higher resolution residuals , with ,

cannot be called since the spatial transform is notexactly a Laplacian one because the temporal transform stages,which are placed between the spatial down-sampling and theprediction with interpolation stages, operate independently (onseparate originals). This fact can cause prediction inefficiencieswhich may be due to motion model discrepancies among dif-ferent resolutions or simply to the nonlinear and shift variantnature of the MC even in the presence of similar motion models.It is worth noting that the effect of these inefficiencies affects thewhole residue and then it worsens the problem of the increasednumber of coefficients to code. Strategies to reduce these inef-ficiencies are workable and we will shortly return on these is-sues in the following subsection. Note that the scheme shownin Fig. 10(b) can be used to represent the fundamental codingprinciples underlying the SVC standard [2].

It is also important to mention that some recent works demon-strate that using the frame theory [7] it is possible to improve thecoding performance associated to the Laplacian pyramid repre-sentation [73], [74] and also to reduce its redundancy [75]. Inthe future, these works may play a role in improving the spatialscalability performance of the coding schemes described in thisparagraph.

C. STP-Tool Scheme

We here present a WSVC architecture that has been devel-oped [76] and also adopted as a possible configuration of theMPEG VidWav reference software [36] (see Section IV-B). Thisscheme is based on a multiscale pyramid and differs from theone of Fig. 10(b) in the ISP mechanism. The solution, depictedin Fig. 11, is called STP-tool as it can be interpreted as a pre-diction tool (ISP) for spatio–temporal subbands. After the repre-

sentation of the video at different scales, these undergo a normalMCTF. The resulting signals are then passed to a further spa-tial transform stage (typically one level of a wavelet transform)before the ISP. This way, it is possible to implement a predic-tion between two signals which are likely to bear similar pat-terns in the spatio–temporal domain without the need to per-form any interpolation. In particular, instead of the full reso-lution residuals of Fig. 10(b) the spatio–temporal sub-

bands and residues are produced forresolutions where are produced bythe coarse resolution adder (shown in Fig. 11) according to: a)

in a spatial open loop

architecture or b) in aspatial closed loop architecture.

In an open-loop scheme and without the temporal transform,the residual would be zero (if the same pre- and post-spatialtransforms were used). In this case, the scheme would be re-duced to a critically sampled one. With the presence of the tem-poral transform, things are more complicated due to the data-de-pendent structure of the motion model and to the nonlinear andtime variant behavior of the MCTF. In general, it is not possibleto move the spatial transform stages forward and backward withrespect to the MCTF without changing the result. If the spatialtransform used to produce spatial subbands are the same, themain factor that produces differences between theand the signals can be identified in the incoher-ences between the motion trajectories at different spatial reso-lutions. On one hand, independent ME at various spatial resolu-tion could lead to intra-layer optimized MC. On the other hand,a motion model that captures the physical motion in the scene iscertainly effective for the MC task and reasonably should be selfsimilar across scales. Therefore, a desirable objective is to havea hierarchical ME where high-resolution motion fields are ob-tained by refinement of corresponding lower resolution ones (orviceversa). Indeed this guarantees a certain degree of inter-scalemotion field coherence which makes the STP-tool prediction

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1247

Fig. 12. VidWav general framework.

more effective. A hierarchical motion model is a desired fea-ture here and in addition it helps to minimize the motion vectorcoding rate.

The STP-tool architecture presents another relevant and pecu-liar aspect. Although the number of subband coefficients to codeis not reduced with respect to a pyramidal approach, different(prediction) error sources are separated into different criticallysampled signals. This does not happen in a pyramidal schemeswhere only full resolution residuals are produced [seeFig. 10(b)].

To better understand these facts let us analyze sep-arately each residual or detail component in the set

. In an open loop architecture:• the residue signals account for the above men-

tioned inter-scale motion model discrepancies; while• the temporal highpass, spatial highpass subbands

account for the intra-scale motion modelfailure to completely absorb and represent the real motion(occluded and uncovered areas and motion feature nothandled by the motion model); finally

• the temporal lowpass, spatial highpass subbandscontain the detail information to increase

the resolution of lowpass temporal frames.Thus, maximizing the inter-scale motion coherence implies

minimizing the energy (and the coding cost) of . Notethat even if a perfect motion down-scaling is adopted, localizedresidual components would be due to the shift variant proper-ties of the MC especially where uncovered areas and occlusionhandling must be adopted. In other words, the coefficient in-crease, determined by the presence of the residues ,with respect to the critically sampled case is the price to pay forallowing a suitable ME and MC across different scales.

In a closed-loop ISP architecture we also have the great ad-vantage that the quantization errors related to resolution areconfined to signals ofresolution that contribute in a more or less predominantway to the residues to code, that is

. This is of great advantage for the reduced number ofcoefficients on which the closed-loop error is spread upon. Be-sides, the signals are less structured, and thus more

difficult to code, with respect to and .Therefore, the STP-tool scheme allows a useful separation of

the above signals, with the advantage that suitable coding strate-gies can be designed for each source.

As for and schemes, the STP-tool solution maysuffer from the presence of spatial aliasing on reduced spatialresolutions. If not controlled this may also create some inter-scale discrepancies between the motion models. Again, a partialsolution can be found with the use of more selective decimationwavelet filters, at least for decomposition levels correspondingto decodable spatial resolutions.

The STP-tool solution has been tested on different platformswith satisfactory performance, as described in the following sec-tions.

IV. WSVC REFERENCE PLATFORM IN MPEG

In the past years, the ISO/MPEG standardization group initi-ated an exploration activity to develop SVC tools and systemsmainly oriented to obtain spatio–temporal scalability. One ofthe key issues of the conducted work was not to sacrifice codingperformance with respect to the state-of-the-art reference videocoding standard H.264/AVC.

In 2004, at Palma meeting, the ISO/MPEG group set up aformal evaluation of SVC technologies. Several solutions [1],[66], [77], [3], either based on wavelet spatio–temporal decom-position or H.264/AVC technologies, had been proposed andcompared. The MPEG visual testing indicated that performanceof an H.264/AVC pyramid [25] appeared the most competitiveand it was selected as a starting point for the new standard. At thenext meeting in Hong Kong, MPEG jointly with the IEC/ITU-Tvideo group, started a joint standardization initiative accord-ingly. A scalable reference model and software platform namedJSVM and derived from H.264/AVC technologies was adopted[2]. During the above mentioned competitive phases, amongother wavelet based solutions, the platform submitted by Mi-crosoft Research Asia (MSRA) was selected as the referenceto continue experiments on wavelet technologies. The MPEGWSVC reference model and software (RM/RS) [78] will be in-dicated, hereinafter, with the acronym VidWav (Video Wavelet).The VidWav RM/RS has evolved seeing various components in-tegrated, provided they would have given sufficient performanceimprovement after appropriate core experiments.

In the paragraphs of the present section we mainly recall thecharacteristics of the VidWav RM. Document [78] should be

1248 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

taken as a reference to go into further details. In the next sec-tion, we will discuss the main differences with respect to theSVC standard. We will also provide some comparison resultsbetween the two approaches.

A. VidWav: General Framework and Main Modules

The general framework supported by the VidWav RMis shown in Fig. 12. With pre- and post-spatial decompo-sition different WSVC configurations can be implemented:

and STP-tool. The main modules of the man-ageable architectures are briefly considered in the following.

ME and Coding: A macroblock-based motion model wasadopted with H.264/AVC like partition patterns. To support spa-tial scalability, macroblock size is scaled according to the dec-imation ratio. Interpolation accuracy is also scaled. Therefore,full pel estimation on 64 64 macroblocks at 4CIF resolutioncan be used at quarter-pel precision on a 16 16 macroblockbasis for QCIF reconstructions. Hence, motion is estimated atfull resolution and kept coherent across scales by simply scalingthe motion field partition and the motion vector values. Thisposes the limitations already discussed in Section III-B for the

architecture.For each block there is the possibility to select a forward,

backward or bidirectional motion model. This requires a modeselection information. Mode selection also determines theblock aspect ratio and informs about the possibility of deter-mining current MV values from already estimated ones. Motionvector modes and values are estimated using rate-constrainedlagrangian optimizations and coded with variable length andpredictive coding, similarly to what happens in H.264/AVC.For MC filtering a connected/disconnected flag is included tothe motion information on units of 4 4 pixels. For 4:2:0 YUVsequences, motion vectors for chroma components are halfthose of the luminance.

Temporal Transform: Temporal transform modules imple-ment a framewise motion compensated (or motion aligned)wavelet transform on a lifting structure. The block-based tem-poral model is adopted for temporal filter alignment. Motionvectors are directly used for the motion aligned prediction(MAP) lifting step while the motion field should be invertedfor motion aligned update (MAU). To this end, Barbell updatelifting [31] is adopted which directly uses original motionvectors and bilinear interpolation to find values of updatesignals from noninteger pixel values. In order to further reduceblocking artifacts, overlapped block MC (OBMC) can beused during the MAP. This is accomplished by multiplyingthe reference signals by a pyramidal window with supportlarger than 4 4 before performing the temporal filtering. Theupdate signal is also clipped according to a suitable thresholdin order to limit the visibility of possible ghosting artifacts.Moreover, the implemented temporal transform is generallylocally and adaptively a Haar or a -taps depending whethermotion block modes are unidirectional (forward or backward)or bidirectional.

Spatial Transform: A simple syntax allows to configure pre-and post-spatial transforms on each frame or previously trans-formed subband. This allows to implement both and

schemes. Different spatial decompositions are possible

Fig. 13. VidWaw framework in STP-tool configuration. The lowest spatial res-olution layer is equivalent to a t+2�D scheme. Higher spatial layers implementSTP-tool ISP.

for different temporal subbands for possible adaptation to thesignal properties of the temporal subbands.

Entropy Coding: After the spatio–temporal modules the co-efficients are coded with a 3-D (spatio–temporal) extension ofthe EBCOT algorithm, called 3-D EBCOT. Each spatio–tem-poral subband is divided into 3-D blocks which are coded in-dependently. For each block, fractional bit plane coding andspatio–temporal context based arithmetic coding are used.

B. VidWav STP-Tool Configuration

The VidWav reference software can be configured to followthe STP-tool architecture of Fig. 11. A corresponding functionalblock diagram of the resulting system is shown in Fig. 13. Thecoding process starts from the lower resolution which acts asa base layer for ISP. In particular, a closed loop ISP is im-plemented as coded temporal subbands are saved and used forSTP-tool prediction for the higher resolution. STP-tool predic-tion can be applied on all spatio–temporal subbands or onlyto a subset of them. Decoded temporal subbands at a suitablebit rate and temporal resolution are saved and used for ISP. Atpresent, no particular optimization exists in selecting the pre-diction point. Obviously, it should be not too far from the max-imum point for a certain temporal resolution at lower spatialresolution in order to avoid bit stream spoilage (lower resolutionbit stream portions are useless for higher resolution prediction)and prediction inefficiency (too large prediction residual). Spa-tial transforms on the levels interested by STP-tool ISP can beimplemented either by using the three lifting steps (3LS) [61]or the 9 7 filters. In order to better exploit inter scale redun-dancy these filters should be of the same type of those used forthe generation of the reference videos for each of the consideredspatial resolutions. In this implementation, some important lim-itations remain. These are mainly related to software integrationproblems due to the data structures used by the original MSRAsoftware. Therefore, coherent and refinable bottom up ME andcoding are not present in the current VidWav software. WhenSTP-tool is used, three independent motion fields are estimatedindependently. This introduces some undesired redundancy inthe bit stream which has a major impact for the lowest quality

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1249

point at higher spatial resolutions. In addition, it does not pro-vide the motion field coherence across scales which was seen asa desirable feature for good STP-tool prediction.

C. VidWav Additional Modules

Other modules have been incorporated to the baselineVidWav reference platform after careful review of performanceimprovement by appropriate testing conditions. Such tools canbe turned on or off through configuration parameters. Moredetailed description of these tools can be found in [78].

Base Layer: The minimum spatial, temporal and quality res-olution working point define a Base Layer on which additionalEnhancement Layers are built to provide scalability up to a de-sired maximum spatial, temporal and quality resolution. Thistool enables the use of a base layer, in subband SVC,which has been previously compressed with a different tech-nique. In particular, the H.264/AVC stream can be used as baselayer by implementing a hierarchical B-frames decomposition(HBFD) structure. This allows each B-picture of the base layerto temporally match corresponding highpass subbands in theenhancement layers. It therefore enables the frames to sharesimilar MCTF structure decomposition and this correlation canbe efficiently exploited. Motion information derived from thebase layer codec can also be used as additional candidate pre-dictors for estimating motion information of the blocks in thecorresponding highpass temporal subbands of the enhancementlayers. Clearly, the base layer tool can significantly improvethe coding performances, at least of the lowest working pointwhich is now encoded with a nonscalable method. A second butnot less relevant provided benefit, is the backward compatibilitywith preexisting nonscalable H.264/AVC decoder.

In-Band : The scheme is activated whena nonidentity pre-spatial transform in Fig. 12 is set. In sucha configuration the MCTF is applied to each spatial subbandgenerated by the pre-spatial transform. In Section III-B.2 it hasbeen already evidenced that in-band scheme suffers a loss incoding performance at full resolution when the MCTF is per-formed independently on each subband. Conversely, if high res-olution signals are involved in the prediction for low resolution,drifting occurs when decoding low resolution video, because inthat case high resolution signals are not available at the decoder.The VidWav RM adopts a technique [79] based on the use onadditional macroblock inter coding modes in the MCTF of thespatial lowpass band. With respect to the pure scheme,this method provides a better tradeoff between the lower resolu-tion drifting error and the high resolution coding efficiency andalso improves the global coding performances.

Wavelet Ringing Reduction: In WSVC, ringing caused bythe loss of details in the high pass spatial subbands is a majorconcern for the resulting visual quality [80]. Ringing is gener-ated during inverse DWT because quantization-noise of lossily-coded wavelet coefficients is spread by the reconstruction fil-ters of the IDWT. The oscillating pattern of these filter impulseresponses at different reconstruction scales becomes objection-able especially in flat luma and chroma regions. The ringingmagnitude is related to the quantization step-size. In order to re-duce these ringing patterns, a 2-D bilateral filter applied to eachcomponent of the output video is used.

Intra-Mode Prediction: The intra-prediction mode exploitsthe spatial correlation within the frame in order to predict intraareas before entropy coding. Similarly to the H.264/AVC ap-proach, a macroblock can be predicted form the values of itsneighbors, i.e., by using spatial predictors.

V. WSVC AND SVC STANDARDIZATION REFERENCE

PLATFORMS: A COMPARISON

A. Architecture and Tools: Similarities and Differences

In this section, some differences and similarities betweenJSVM and VidWav RM (see Fig. 13) frameworks will behighlighted and discussed.

1) Single-Layer Coding Tools: We will start to analyze thedifferences between the VidWav RM [36] and the JSVM [2]when they are used to encode a single operating point, i.e., noscalability features being considered. Broadly speaking this canbe seen as a comparison between a WSVC (which willremain fully scalable at least for temporal and quality resolu-tion) and something similar to H.264/AVC (possibly supportingtemporal scalability thanks to HBFD).

VidWav uses a block-based motion model. Block modetypes are similar to those of JSVM with the exception of theIntra-mode which is not supported by VidWav. This is due tothe fact that the Intra mode determines blocks rewriting fromthe original and prediction signal (highpass temporal subbands)producing a mixed (original+residue) source which is detri-mental for subsequent whole frame spatial transforms. This isalso one of the main differences for spatio–temporal redun-dancy reduction between the two approaches. JSVM (as wellas all H.264/AVC and most of the best performing hybrid videocoding methods) operates in a local manner: single framesare divided into macroblocks which are treated separately inall the coding phases (spatial transform included). Instead,the VidWav (as well as WSVC approaches) operates with aglobal approach since the spatio–temporal transform is directlyapplied to a group of frames. The fact that in the VidWavframework a block-based motion model has been adopted isonly due to efficiency and the lack of effective implementationsof alternative solutions. However, this cannot be consideredan optimal approach. In fact, the presence of blockiness in thehighpass temporal subbands due to block-based predictions isinconsistent and causes inefficiencies with respect to the sub-sequent spatial transform. The energy spreading around blockboundaries, generates artificial high frequency components anddisturbs both zerotree and even block-based entropy codingmechanisms which act on spatio–temporal subbands. On theother hand blockiness is not a main issue when block-basedtransforms, such as the DCT, are instead applied, as in the caseof JSVM. OBMC techniques can reduce the blockiness in thehighpass temporal subbands but do not remove it.

Contrarily to JSVM, the single layer VidWav frame-work only supports open loop encoding/decoding. Thanks to theclosed loop encoding, JSVM can exploit an in-loop deblockingfilter, which makes a similar job with respect to the post-pro-cessing deblocking (JSVM)/deringing (VidWav) filters with theadditional benefit of improving the quality of the reconstructedsignals used in closed-loop predictions.

1250 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

Closed-loop configurations within WSVC systems would bepossible by adopting an HBFD like decomposition with the aimof incorporating in-loop deringing strategies as well.

2) Scalable Coding Tools: We now consider spatial scala-bility activated in both solutions, looking in particular to JSVMand VidWav in STP-tool configuration. The latter can be consid-ered as to the WSVC implementation closest to the JSVM one.The main difference between the two systems in terms of a spa-tial scalability mechanism is again related to the block-basedversus frame-based dichotomy. In JSVM, a decision on eachsingle macroblock determines whether the macroblock is en-coded by using the information of its homologous at a lowerspatial resolution or not. In the latter case, one of the compat-ible modality available for a single layer encoding will be used.The decision, among all possible modes is globally regulatedby R-D optimization. On the contrary, ISP inside STP-tool ar-chitecture is forced to be made at a frame level (with possibledecisions on which frames to involve).

Similarly to JSVM, STP-tool can use both closed and openloop interlayer encoding while this is not true for or

configurations which can only operate in open loop mode.

B. Objective and Visual Result Comparisons

The comparison between DCT based SVC and WSVC sys-tems is far from being an obvious task. Objective and subjec-tive comparisons are critical even when the most “similar” ar-chitectures are considered (e.g., JSVM and STP-tool), espe-cially when comparisons are performed using decoded videosequences at reduced spatial resolutions. In this case, since ingeneral different codecs use different down-sampling filters, thecoding–decoding processes will have different reconstructionreference sequences, i.e., different down-sampled versions ofthe same original. JSVM adopt the highly selective half-bandMPEG down-sampling filter, while WSVC systems are more orless forced to use wavelet filter banks which in general offer lessselective half-band response and smoother transition bands. Inparticular, in architectures wavelet down-sampling is arequisite while in STP-tool ones (see Section III-C) one couldadopt a more free choice, at the expense of a less efficient ISP.Visually, the reference sequences generated by wavelet filtersare in general more detailed but sometimes, depending on dataor visualization conditions, some spatial aliasing effects couldbe seen. Therefore, visual benefits in using wavelet filters is con-troversial while schemes adopting more selective filters can takeadvantage, in terms of coding efficiency, of the lower data en-tropy. Unfortunately, the MPEG down-sampling filter is not agood lowpass wavelet filter because the corresponding highpassfilter have an undesired behavior in the stop-band [81]. Attemptsto design more selective wavelet filters have been made [61],[81] with expected benefits in terms of visual and coding per-formance for WSVC schemes.

Anyway, depending on the particular filter used for spatialdown-sampling, reduced spatial resolution decoded sequenceswill differ even at full quality, impairing fair objective compar-isons. For the above reasons peak SNR (PSNR) curves at inter-mediate spatio–temporal resolution have been mainly used toevaluate the coding performance of a single family of architec-tures while for a direct comparison between two different sys-

Fig. 14. Combined Scalability example results on four test sequences: City,Soccer, Crew, and Harbour, at different spatio–temporal resolutions and bitrates, using encapsulated bit stream decoding as described in [82].

tems more time-demanding and costly visual comparisons havebeen preferred. In the following we summarize the latest objec-tive and subjective performance comparison issued by the workon SVC systems as conducted in April 2006 by ISO/MPEG.Finally, we will also return on the possibility of increasing ob-jective comparisons fairness by proposing a common referencesolution.

1) Objective Comparison Results: Experiments have beenconducted under testing conditions described in document [82].Three classes of experiments have been completed with the in-tention of evaluating coding performance of different systemsand tool settings. The first class of experiments, named com-bined scalability, aimed to test the combined use of spatial, tem-poral and quality scalabilities according to shared scalabilityrequirements [4]. The comparison has been performed usingJSVM 4.0 and the VidWav RS 2.0 in STP-tool configuration.Fig. 14 shows an example of the R-D curves where PSNRare only shown. The example considers different sequences,quality points and spatio–temporal resolutions, i.e., 4CIF-60 Hz,CIF-7.5 Hz, QCIF-15 Hz (original resolution is 4CIF-60 Hzin all cases). The complete set of results can be found in [3,Annex 1].

As mentioned in Section IV-B, the VidWav STP-tool config-uration implements an inefficient ME and coding that partiallyjustifies the performance differences visible in the above men-tioned PSNR plots. The other two groups of experiments, spa-tial scalability and SNR scalability, aimed to test the use and theR-D performance of single scalability tools when other forms ofscalability are not requested. VidWav in STP-tool and inconfiguration modes were tested. For the interested reader com-plete results are reported in [3, Annex 2 and 3].

Where PSNR comparisons did not suffer from the differentoriginal reference signal problems, mixed performance were ob-served with a predominance of higher values for sequences ob-tained using JSVM.

2) Subjective Comparison Results: Visual tests conductedby ISO/MPEG included 12 expert viewers and have been per-formed according to guidelines specified in [83]. On the basisof these comparisons between VidWav RS 2.0 and JSVM 4.0appear on average superior for JSVM 4.0, with marginal gainsin SNR conditions, and superior gains in combined scalabilitysettings. A full description of such test can be found in [3]

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1251

Fig. 15. Common (a) reference generation and (b) representation.

Annex 4]. It should be said that, from a qualitative point ofview, a skilled viewer could infer from which system the de-coded image is generated. In fact, for motivations ascribable tospatial filtering aspects, VidWav decoded frames are in generalmore detailed with respect to JSVM ones, but bits not spent inbackground details can be used by JSVM to better patch motionmodel failures and then to reduce the impact of more objection-able motion artifacts for the used test material.

3) Objective Measures With a Common Reference: A rea-sonable way to approach a fair objective comparison, betweentwo systems that adopt different reference video sequencesand , generated from a same original sequence , is to createa common weighted sequence . In partic-ular, by selecting it can be easily verified that

. This means that andare both equally disadvantaged by the creation of the mean ref-erence . Fig. 15(a) shows the generation of the above threesignals using two different lowpass-filters andwhich are assumed here to be a lowpass wavelet filter and theMPEG down-sampling filter respectively. As already discussedthe output will have a more confined low-frequency contentwith respect to . Fig. 15(b) is a projection of the decodingprocess which shows vertically the video frame frequency con-tent and horizontally other kind of differences (accounting forfilter response discrepancies and spatial aliasing). Such differ-ences can be considered small since and are generatedstarting from the same reference while targeting the samedown-sampling objective. Thus, can be considered halfwayon the above projection being a linear average. Now, as trans-form based coding systems tend to reconstruct first the lowerspectral components of the signal, it is plausible that, at a cer-tain bit rate (represented in Fig. 15(b) by the dashed bold line),the VidWav reconstructed signal lies nearer to the JSVM refer-ence than to its own reference. The converse is not possible andthis gives to the capability to compensate for this disparity.Therefore, signal can reasonably be used as a common refer-ence for PSNR comparisons of decoded sequences generated bydifferent SVC systems. These conjectures find a clear confirma-tion in experimental measures as it can be seen e.g., in Fig. 16.

VI. FURTHER INVESTIGATIONS ON WSVC SCHEMES

In order to better identify the aspects or tools that could sig-nificantly improve WSVC performances and to drive future in-

Fig. 16. PSNR comparison at QCIF resolution with the usage of own andcommon reference.

vestigations and improvements more precisely, a series of basicexperiments has been conducted comparing JSVM and its mostsimilar WSVC codec, namely STP-tool.

In a first experiment a set of generic video sequences and stillimages were selected with the intent to compare the two codecsonly in terms of intra coding performance. In other words, wetested the ability of independently compressing a still image thatcan be either a still image or frame randomly selected from avideo sequence. Each test image was coded using both codecsin a single layer configuration repeating the procedure for arange of different rate values. These values have been selectedin a way to have a reconstruction quality ranging from 30 to40 dB. On the basis of the quality of decoded images the testset would be divided into two groups. The first group includedall the images for which JSVM gives the best average codingperformance while the second includes those where PSNR isin favor of the STP-tool codec. Not surprisingly, the first groupcontains only frame of video sequences such as those used forthe MPEG comparisons and sequences acquired with low reso-lution cameras. Conversely, the second group was characterizedby the presence of all the tested still images and high defini-tion (HD) video sequences acquired with devices featuring thelatest digital technologies. Thus, most performing DCT-basedtechnologies appear to outperform wavelet-based ones for rel-atively smooth signals and vice versa. This indicates that eli-gible applications for WSVC are those that produce or use high-definition/high-resolution content. The fact that wavelet-basedimage (intra) coding can offer comparable if not better perfor-mance when applied to HD content is compatible with the re-sults reported by other independent studies, e.g., [84], [85]. In[86], the STP-tool has been used to test the performances ofWSVC approaches for in home distribution of HD video se-quence. As it can be noticed from Fig. 17 the SNR performanceof the proposed codec are comparable with those offered bythe state of the art of single point encoding techniques, e.g.,H.264/AVC. Another set of tests has been conducted with theintent of analyzing the coding efficiency of the multiresolutionto compress high pass MC temporal subbands (or hierarchicalB-frames residues). In this case coding efficiency was evaluatedby comparing the performance only on the residual image byusing the same predictor for both encoders and by changing thefilters of the spatial wavelet transform. The main consideration

1252 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

Fig. 17. Quality (SNR) scalability results for the HD video sequenceCrowdRun.

arising from the obtained results is that commonly used filters,such as the 9 7, 5 3 and 2 2, are not as effective on highpass temporal subbands (or B-residues) as much as they are onlowpass temporal subbands (or frames). This is mainly becauseof the nature of the signal being considered which contains theresidue information produced by a local adaptation procedurewhile the wavelet filter operates as a global transformation ofthe considered temporal subband. This means, for example, thatif we want to encode a frame containing reconstruction errorsonly for one macroblock after the spatial transformation theseerrors will be spread across all the spatial subbands and the rela-tive error support will also be extended because of filter lengths.A more detailed analysis of the energy of spatial subbands re-vealed that, contrarily to what happens for lowpass temporalframes, most of the energy is located in the highest spatial res-olution subbands. This is in accordance with the multiresolu-tion interpretation for the used spatial transform but because oftheir size these subbands cannot be efficiently encoded. Addi-tionally, this prevents inter-band prediction mechanisms to cor-rectly work.

VII. PERSPECTIVES

Although considerable efforts have been made to improveWSVC solutions there are some reasons to say that we couldbe at the beginning of a prolific research field with many pos-sible applications. Present coding performances are not too farfrom that of JSVM, which implements the technology defined inthe forthcoming SVC standard. Hybrid video coding (especiallybased on block-based spatial DCT and block-based temporalME and MC) has experimented a strong increase in performanceas a consequence of the introduction of many optimized toolsand configuration parameters. While some of these solutionshave not yet been tested on a common reference architecturesuch as VidWav, other were forcely discarded since not ade-quate to the whole frame nature of the spatial wavelet trans-form. For example advanced block-based predictions with mul-tiple block-mode selection, advanced block-based ME, MC, andcoding can be fully adopted with block-based transforms whilethey are not natural with full frame transforms. An example for

which an attempt has been made is represented by the “intra-pre-diction” mode. This tool works remarkably well for JSVM butif simply imported on a VidWav architecture, it does not givethe expected results. Wavelet based video coding systems wouldrequire alternative prediction modes and motion models. Ex-isting solutions appear still preliminary and, in order to maxi-mize their benefit, they need to be effectively integrated in thecomplete system. When compared to JSVM, existing waveletcodecs lack several performance optimization tools. In partic-ular R-D optimized solutions are hardly implementing thosetools which are not fully compatible with the whole frame ap-proach used in WSVC systems. Therefore, both: 1) activitiesdirected on finding new tools and solutions; and 2) activities di-rected on shaping and optimizing the use of existing tools, seemto be important for the improvement and development of effec-tive WSVC systems.

A. New Application Potentials

There are a number of functionalities and applications whichcan be effectively targeted by WSVC technology and seem toprospect comparable if not more natural solutions with respectto their hybrid DCT+ME/MC SVC counterparts. A brief list isreported here, while a more complete overview can be foundin [3].

• HD material storage and distribution. Thanks to the built-inscalability brought by wavelets, it is possible to encode agiven content with nonpredefined scalability range and ata quality up to near lossless. At the same time, a very lowdefinition decoding is possible allowing, a quick previewof the content, for example.

• The possibility of using nondyadic wavelet decomposi-tions [87], which enables the support of multiple HD for-mats is an interesting feature not only for the aforemen-tioned application but also for video surveillance and formobile video applications.

• WSVC with temporal filtering can also be adapted so thatsome of the allowable operating points can be decoded byJ2K and MJ2K (MJ2K is only “intra” coded video usingJ2K; if J2K compatibility is preserved MJ2K compatibilitywill also remain).

• Enabling efficient similarity search in large videodatabases. Different methods based on wavelets exist.Instead of searching full resolution videos, one can searchlow quality videos (spatially, temporally, and reducedSNR), to accelerate the search operation. Starting fromsalient points, on low quality videos, similarities can berefined in space and time.

• Multiple Description Coding which would lead to bettererror-resilience. By using the lifting scheme, it is easy toreplicate video data to transmit two similar bit streams.Using intelligent splitting, one can decode independentlyor jointly the two bit streams (spatially, temporally, and/orSNR reduced).

• Space variant resolution adaptive decoding. When en-coding the video material, it is possible to decode at a highspatial resolution only a certain region while keeping at alower resolution the surrounding areas.

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1253

VIII. CONCLUSION

This paper has provided a picture of various tools that havebeen designed in recent years for scalable video compression.It has focused on multiresolution representations with the useof the wavelet transform and its extensions to handle motion inimage sequences. A presentation of benefits of wavelet basedapproaches has also been suggested for various application per-spectives. The paper has also shown how it is possible to or-ganize various multiresolution tools into different architecturefamilies depending on the (space/time) order in which the signaltransformation is operated. All such architectures make it pos-sible to generate embedded bit streams that can be parsed tohandle combined quality, spatial and temporal scalability re-quirements. This has been extensively simulated during the ex-ploration activity on wavelet video coding as carried out effortby ISO/MPEG, demonstrating that credible performance resultscould be achieved. Nevertheless, when compared to the currentSVC standardization effort, which provides a pyramidal exten-sion of MPEG-4/AVC, it appears that though close, the per-formance of wavelet based video codecs remains inferior. Thisseems partly due to:

• the level of fine (perceptual and/or R-D theoretic) opti-mization that can be achieved for hybrid (DCT+ME/MC)codecs because of the deep understanding on how all itscomponents interact;

• the spectral characteristics of standard resolution videomaterial on which standard multiresolution (waveletbased) representation already exhibit lower performanceeven in intra-mode (some conducted preliminary experi-ments on high resolution material are showing increasingperformance for WT approaches);

• the inability of the experimented multiresolution (WT) ap-proaches to effectively represent the information localityin image sequences (block-based transform can be insteadmore easily adapted to handle the nonstationarity of vi-sual content, in particular by managing uncovered/coveredareas through a local effective representation of intracodedblocks).

This last argument deserves probably some further consid-erations. For image compression, WT based approaches areshowing quite competitive performance due to the energy com-paction ability of the WT to handle piecewise polynomials thatare known to well describe many natural images. In video se-quences, the adequacy of such model falls apart unless a precisealignment of moving object trajectories can be achieved. Thismight remain only a challenge, since as for any segmentationproblem, it is difficult to achieve it in a robust fashion, due tothe complex information modeling which is often necessary.As an alternative, new transform families [88] could also bestudied so as to model the specificity of spatio–temporal visualinformation.

This paper is an attempt to summarize the current under-standing in wavelet video coding as carried out collectively inthe framework of the ISO/MPEG standardization. While it hasnot yet been possible to establish wavelet video coding as analternative video coding standard for SVC, its performance ap-pears on the rise when compared to previous attempts to estab-lish credibly competitive video coding solutions with respect

to hybrid motion-compensated transform based approaches. Wehope that the picture offered will help many future researchersto undertake work in the field and shape the future of waveletvideo compression.

ACKNOWLEDGMENT

The authors would like to acknowledge the efforts of col-leagues and friends who have helped to establish the stateof the art in the field as described in this work. Particularthanks are meant for all those institutions (academia, fundingorganizations, research centers, and industries) which continueto support work in this field and in particular the people whohave participated in the ISO/MPEG wavelet video codingexploration activity during several years, e.g., C. Guillemot,W.-J. Han, J. Ohm, S. Pateux, B. Pesquest-Popescu, J. Reichel,M. Van der Shaar, D. Taubmann, J. Woods, J.-Z. Xu, just toname a few.

REFERENCES

[1] “Registered Responses to the Call for Proposals on Scalable VideoCoding,” Munich, Germany, ISO/MPEG Video, Tech. Rep. M10569/S01-S24, Mar. 2004.

[2] “Joint Scalable Video Model (JSVM) 6.0,” Montreux, Switzerland,ISO/MPEG Video, Tech. Rep. N8015, Apr. 2006.

[3] R. Leonardi, T. Oelbaum, and J.-R. Ohm, “Status report on waveletvideo coding exploration,” Montreux, Switzerland, Tech. Rep. N8043,Apr. 2006.

[4] “Applications and Requirements For Scalable Video Coding, ISO/IECJTC1/SC29/WG11, 71st MPEG Meeting,” Hong Kong, China, ISO/MPEG Requirements, Tech. Rep. N6880, Jan. 2005.

[5] “Description of Core Experiments in MPEG-21 Scalable Video CodingISO/IEC JTC1/SC29/WG11, 69th MPEG Meeting Redmond,” WA,USA, ISO/MPEG Video, Tech. Rep. N6521, Jul. 2004.

[6] P. Burt and E. Adelson, “The laplacian pyramid as a compact imagecode,” IEEE Trans. Commun., vol. COM-31, no. 4, pp. 532–540, Apr.1983.

[7] S. Mallat, A Wavelet Tour of Signal Processing. San Diego, CA: Aca-demic, 1998.

[8] P. Vaidyanathan, Multirate Systems and Filter Banks. EnglewoodCliffs, NJ: Prentice-Hall, 1992.

[9] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding.. Engle-wood Cliffs, NJ: Prentice-Hall, 1995.

[10] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM,1992.

[11] W. Sweldens, “The lifting scheme: A construction of second generationwavelets,” SIAM J. Math. Anal., vol. 29, no. 2, pp. 511–546, 1997.

[12] I. Daubechies and W. Sweldens, “Factoring wavelet transforms intolifting steps,” J. Fourier Anal. Appl., vol. 4, no. 3, pp. 247–269, 1998.

[13] A. Calderbank, I. Daubechies, W. Sweldensand, and B.-L. Yeo,“Wavelet transforms that map integers to integers,” Appl. Comput.Harmon. Anal., vol. 5, no. 3, pp. 332–369, 1998.

[14] H. Heijmans and J. Goutsias, “Multiresolution signal decompositionschemes. Part 2: Morphological wavelets,” The Netherlands, CWI,Amsterdam, Tech. Rep. PNA-R9905, Jun. 1999.

[15] J.-R. Ohm, Multimedia Communication Technology.. New York:Springer Verlag, 2003.

[16] J. Ohm, “Three-dimensional subband coding with motion compensa-tion,” IEEE Trans. Image Process., vol. 3, no. 9, pp. 559–571, Sep.1994.

[17] S.-J. Choi and J. Woods, “Motion-compensated 3-D subband coding ofvideo,” IEEE Trans. Image Process., vol. 8, no. 2, pp. 155–167, Feb.1999.

[18] B.-J. Kim, Z. Xiong, and W. Pearlman, “Low bit—Rate scalablevideo coding with 3-D set partitioning in hierarchical trees (3DSPIHT),” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 8, pp.1374–1387, Dec. 2000.

[19] A. Secker and D. Taubman, “Lifting-based invertible motion adaptivetransform (LIMAT) framework for highly scalable video compression,”IEEE Trans. Image Processing, vol. 12, no. 12, pp. 1530–1542, Dec.2003.

1254 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 9, SEPTEMBER 2007

[20] G. Pau, C. Tillier, B. Pesquet-Popescu, and H. Heijmans, “Motioncompensation and scalability in lifting-based video coding,” SignalProcess.: Image Commun., vol. 19, no. 7, pp. 577–600, Aug. 2004.

[21] B. Pesquet-Popescu and V. Bottreau, “Three dimensional liftingschemes for motion compensated video compression,” in Proc.ICASSP, Salt Lake City, UT, May 2001, vol. 3, pp. 1793–1796.

[22] J. Ohm, “Motion-compensated wavelet lifting filters with flexible adap-tation,” in Proc. Thyrrenian Int. Workshop Digital Commun. (IWDC),Capri, Italy, Sep. 2002, pp. 113–120.

[23] Y. Zhan, M. Picard, B. Pesquet-Popescu, and H. Heijmans, “Long tem-poral filters in lifting schemes for Scalable Video Coding,” ISO/IECJTC1/SC29/WG11, 61th MPEG Meeting, Klagenfurt, Austria, Tech.Rep. M8680, Jul. 2002.

[24] J. Xu, Z. Xiong, S. Li, and Y.-Q. Zhang, “Three-dimensional embeddedsubband coding with optimal truncation (3D ESCOT),” Appl. Comput.Harmon. Anal., vol. 10, pp. 290–315, 2001.

[25] H. Schwarz, D. Marpe, and T. Wiegand, “Scalable extension ofH.264/AVC, ISO/IEC JTC1/SC29/WG11,” 68th MPEG MeetingMunich, Germany, Tech. Rep. M10569/S03, Mar. 2004.

[26] R. Xiong, F. Wu, S. Li, Z. Xiong, and Y.-Q. Zhang, “Exploiting tem-poral correlation with block-size adaptive motion alignment for 3-Dwavelet coding,” in Proc. Visual Commun. Image Process., San Jose,CA, Jan. 2004, vol. 5308, pp. 144–155.

[27] A. Secker and D. Taubman, “Highly scalable video compression usinga lifting-based 3-D wavelet transform with deformable mesh motioncompensation,” in Proc. Int. Conf. Image Process. (ICIP), Sep. 2002,vol. 3, pp. 749–752.

[28] L. Luo, F. Wu, S. Li, Z. Xiong, and Z. Zhuang, “Advanced motionthreading for 3-D wavelet video coding,” Signal Process.: ImageCommun., vol. 19, no. 7, pp. 601–616, 2004.

[29] D. Turaga, M. van der Schaar, and B. Pesquet-Popescu, “Temporal pre-diction and differential coding of motion vectors in the MCTF frame-work,” in Proc. Int. Conf. Image Process. (ICIP), Barcelona, Spain,Sep. 2003, vol. 2, pp. 57–60.

[30] V. Valéntin, M. Cagnazzo, M. Antonini, and M. Barlaud, “Scalablecontext-based motion vector coding for video compression,” in Proc.Picture Coding Symp. (PCS), Apr. 2003, pp. 63–68.

[31] J. Xu, R. Xiong, B. Feng, G. Sullivan, M.-C. Lee, F. Wu, and S. Li,“3D subband video coding using barbell lifting,” ISO/IEC JTC1/SC29/WG11, 68th MPEG Meeting Munich, Germany, Tech. Rep. M10569/S05, Mar. 2004.

[32] C. Tillier, B. Pesquet-Popescu, and M. van der Schaar, “Highly scalablevideo coding by bidirectional predict-update 3-band schemes,” in Proc.Int. Conf. Acoust. Speech Signal Process. (ICASSP), Montreal, Canada,May 2004, vol. 3, pp. 125–128.

[33] M. V. der Schaar and D. Turaga, “Unconstrained motion compensatedtemporal filtering (UMCTF) framework for wavelet video coding,” inProc. Int. Conf. Acoust. Speech and Signal Process. (ICASSP), Hong-Kong, Apr. 2003, vol. 3, pp. 81–84.

[34] N. Mehrseresht and D. Taubman, “Adaptively weighted update stepsin motion compensated lifting based on scalable video compression,”in Proc. Int. Conf. on Image Process. (ICIP), Barcelona, Spain, Sep.2003, vol. 3, pp. 771–774.

[35] D. Maestroni, M. Tagliasacchi, and S. Tubaro, “In-band adaptive up-date step based on local content activity,” in Proc. Visual Commun.Image Process., Beijing, China, Jul. 2005, vol. 5960, pp. 169–176.

[36] “Wavelet Codec Reference Document and Software Manual,” ISO/IECJTC1/SC29/WG11, 73th MPEG Meeting, Poznan, Poland, ISO/MPEGVideo, Tech. Rep. N7334, Jul. 2005.

[37] H. Schwarz, D. Marpe, and T. Wiegand, “Analysis of hierarchicalB-pictures and MCTF,” presented at the nt. Conf. Multimedia Expo(ICME), Toronto, ON, Canada, Jul. 2006.

[38] D. Turaga, M. van der Schaar, and B. Pesquet-Popescu, “Complexityscalable motion compensated wavelet video encoding,” IEEE Trans.Circuits Syst. Video Technol., vol. 15, no. 8, pp. 982–993, Aug. 2005.

[39] C. Parisot, M. Antonini, and M. Barlaud, “Motion-conpensated scanbased wavelet transform for video coding,” in Proc. Thyrrenian Int.Workshop on Digital Commun. (IWDC), Capri, Italy, Sep. 2002, pp.121–128.

[40] G. Pau, B. Pesquet-Popescu, M. van der Schaar, and J. Viéron, “Delay-performance tradeoffs in motion-compensated scalable subband videocompression,” presented at the ACIVS, Brussels, Belgium, Sep. 2004.

[41] J. Shapiro, “Embedded image coding using zerotrees of wavelet coef-ficients,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3445–3462,Dec. 1993.

[42] A. Said and W. A. Pearlman, “A new fast and efficient image codecbased on set partitioning in hierarchical trees,” IEEE Trans. CircuitsSyst. Video Technol., vol. 6, no. 3, pp. 243–250, Jun. 1996.

[43] S.-T. Hsiang and J. W. Woods, “Embedded image coding usingzeroblocks of subband/wavelet coefficients and context modeling,”in Proc. MPEG-4 Workshop Exhibition ISCAS, Geneva, Switzerland,May 2000, vol. 3, pp. 662–665.

[44] S. D. Servetto, K. Ramchandran, and M. T. Orchard, “Image codingbased on a morphological representation of wavelet data,” IEEE Trans.Image Process., vol. 8, no. 9, pp. 1161–1174, Sep. 1999.

[45] F. Lazzaroni, R. Leonardi, and A. Signoroni, “High-performance em-bedded morphological wavelet coding,” IEEE Signal Process. Lett.,vol. 10, no. 10, pp. 293–295, Oct. 2003.

[46] F. Lazzaroni, R. Leonardi, and A. Signoroni, “High-performance em-bedded morphological wavelet coding,” in Proc. Thyrrenian Int. Work-shop Digital Commun. (IWDC), Capri, Italy, Sep. 2002, pp. 319–326.

[47] F. Lazzaroni, A. Signoroni, and R. Leonardi, “Embedded morpholog-ical dilation coding for 2-D and 3-D images,” in Proc. Visual Commun.Image Process., San José, CA, Jan. 2002, vol. 4671, pp. 923–934.

[48] D. Taubman, “High performance scalable image compression withEBCOT,” IEEE Trans. Image Process., vol. 9, no. 7, pp. 1158–1170,Jul. 2000.

[49] D. Taubman, E. Ordentlich, M. Weinberger, and G. Seroussi, “Em-bedded block coding in JPEG 2000,” Signal Process.: Image Commun.,vol. 17, pp. 49–72, Jan. 2002.

[50] J. Li and S. Lei, “Rate-distortion optimized embedding,” in Proc. Pic-ture Coding Symp., Sep. 1997, pp. 201–206.

[51] S.-T. Hsiang and J. Woods, “Embedded video coding using invert-ible motion compensated 3-D subband/wavelet filter bank,” SignalProcess.: Image Commun., vol. 16, no. 8, pp. 705–724, May 2001.

[52] P. Chen and J. W. Woods, “Bidirectional MC-EZBC with lifting im-plementation,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no.10, pp. 1183–1194, Oct. 2004.

[53] N. Adami, M. Brescianini, M. Dalai, R. Leonardi, and A. Signoroni,“A fully scalable video coder with inter-scale wavelet prediction andmorphological coding,” in Proc. Visual Commun. Image Process., Bei-jing, China, Jul. 2005, vol. 5960, pp. 535–546.

[54] V. Bottreau, M. Benetiere, B. Felts, and B. Pesquet-Popescu, “A fullyscalable 3-D subband video codec,” in Proc. Int. Conf. Image Process.(ICIP), Thessaloniki, Greece, Oct. 2001, vol. 2, pp. 1017–1020.

[55] K. Ho and D. Lun, “Efficient wavelet based temporally scalable videocoding,” in Proc. Int. Conf. Image Process. (ICIP), New York, USA,Aug. 2002, vol. 1, pp. 881–884.

[56] N. Cammas, “Codage video scalable par maillages et ondelettes t +2D” Ph.D. dissertation, Univ. Rennes, Rennes, France, Nov. 2004 [On-line]. Available: http:www.inria.fr/rrrt/tu-1122.html

[57] G. Pau, “Ondlettes et décompositions spatio–temporelles avancées;application au codage vidéo scalable,” Ph.D. dissertation, Ecole Nat.Supér. Télecomun., Paris, France, 2007.

[58] M. Flierl and B. Girod, “Video coding with motion-compensated liftedwavelet transforms,” Signal Process.: Image Commun., vol. 19, no. 7,pp. 561–575, Aug. 2004.

[59] J. Barbarien, A. Munteanu, F. Verdicchio, Y. Andreopoulos, J. Cor-nelis, and P. Schelkens, “Motion and texture rate—allocation for pre-diction-based scalable motion-vector coding,” Signal Process.: ImageCommun., vol. 20, no. 4, pp. 315–342, Apr. 2005.

[60] M. Mrak, N. Sprljan, G. Abhayaratne, and E. Izquierdo, “Scalablegeneration and coding of motion vectors for highly scalable videocoding,” presented at the Picture Coding Symp. (PCS), San Francisco,CA, 2004.

[61] V. Bottreau, C. Guillemot, R. Ansari, and E. Francois, “SVC CE5:Spatial Transform Using Three Lifting Steps Filters,” ISO/IEC JTC1/SC29/WG11, 70th MPEG Meeting, Palma de Mallorca, Spain, Tech.Rep. M11328, Oct. 2004.

[62] Y. Wu and J. Woods, “Aliasing Reduction for Scalable Sub-band/Wavelet Video Coding,” ISO/IEC JTC1/SC29/WG11, 73rdMPEG Meeting, Poznan, Poland, Tech. Rep. M12376, Jul. 2005.

[63] Y. Andreopoulos, A. Munteanu, J. Barbarien, M. van der Schaar, J.Cornelis, and P. Schelkens, “Inband motion compensated temporal fil-tering,” Signal Process.: Image Commun., vol. 19, no. 7, pp. 653–673,Aug. 2004.

[64] H.-W. Park and H.-S. Kim, “Motion estimation using low-band-shiftmethod for wavelet-based moving-picture coding,” IEEE Trans. ImageProcess., vol. 9, pp. 577–587, 2000.

[65] Y. Andreopoulos, A. Munteanu, G. V. D. Auwera, J. Cornelis, andP. Schelkens, “Complete-to-overcomplete discrete wavelet transforms:Theory and applications,” IEEE Trans. Signal Process., vol. 53, no. 4,pp. 1398–1412, 2005.

[66] “Subjective test results for the CfP on Scalable Video Coding tech-nology,” ISO/IEC JTC1/SC29/WG11, 68th MPEG Meeting, Munich,Germany, Tech. Rep. N6383, Mar. 2004.

ADAMI et al.: STATE-OF-THE-ART AND TRENDS IN SCALABLE VIDEO COMPRESSION WITH WAVELET-BASED APPROACHES 1255

[67] N. Mehrseresht and D. Taubman, “An efficient content-adaptive MC3D-DWT with enhanced spatial and temporal scalability,” in Proc. Int.Conf. Image Process. (ICIP), Oct. 2004, vol. 2, pp. 1329–1332.

[68] D. Taubman, N. Mehrseresht, and R. Leung}, “SVC technical contribu-tion: Overview of recent technology developments at UNSW ISO/IECJTC1/SC29/WG11,” 69th MPEG Meeting, Redmond, WA, USA, Tech.Rep. M10868, Jul. 2004.

[69] N. Sprljan, M. Mrak, T. Zgaljic, and E. Izquierdo, “Softwareproposal for Wavelet Video Coding Exploration Group ISO/IECJTC1/SC29/WG11,” 75th MPEG Meeting, Bangkok, Thailand, Tech.Rep. M12941, Jan. 2006.

[70] M. Mrak, N. Sprljan, T. Zgaljic, N. Ramzan, S. Wan, and E. Izquierdo,“Performance evidence of software proposal for Wavelet Video CodingExploration Group”, ISO/IEC JTC1/SC29/WG11, 6th MPEG Meeting,Montreux, Switzerland, Tech. Rep. M13146, Apr. 2006.

[71] C. Ong, S. Shen, M. Lee, and Y. Honda, “Wavelet Video Coding—Gen-eralized Spatial Temporal Scalability (GSTS),” ISO/IEC JTC1/SC29/WG11, 72nd MPEG Meeting, Busan, Korea, Tech. Rep. M11952, Apr.2005.

[72] “Scalable Video Model v2.0 ISO/IEC JTC1/SC29/WG11,” 69thMPEG Meeting, Redmond, WA, USA, Tech. Rep. N6520, Jul. 2004,ISO/MPEG Video.

[73] M. N. Do and M. Vetterli, “Framing pyramids,” IEEE Trans. SignalProcessing, vol. 51, no. 9, pp. 2329–2342, Sep. 2003.

[74] M. Flierl and P. Vandergheynst, “An improved pyramid for spatiallyscalable video coding,” in Proc. Int. Conf. Image Process. (ICIP),Genova, Italy, Sep. 2005, vol. 2, pp. 878–881.

[75] G. Rath and C. Guillemot, “Representing laplacian pyramids withvarying amount of redundancy,” presented at the Eur. Signal Process.Conf. (EUSIPCO), Florence, Italy, Sep. 2006.

[76] N. Adami, M. Brescianini, R. Leonardi, and A. Signoroni, S¨ VC CE1:STool—A native spatially scalable approach to SVC ISO/IEC JTC1/SC29/WG11,” 70th MPEG Meeting, Palma de Mallorca, Spain, Tech.Rep. M11368, Oct. 2004.

[77] “Report of the subjective quality evaluation for SVC CE1 ISO/IECJTC1/SC29/WG11,” 70th MPEG Meeting, Palma de Mallorca, Spain,Tech. Rep. N6736, Oct. 2004, ISO/MPEG Video.

[78] R. Leonardi and S. Brangoulo, “Wavelet codec reference documentand software manual v2.0,” ISO/IEC JTC1/SC29/WG11, 74th MPEGMeeting, Nice, France, Tech. Rep. N7573, Oct. 2005.

[79] D. Zhang, S. Jiao, J. Xu, F. Wu, W. Zhang, and H. Xiong, “Mode-based temporal filtering for in-band wavelet video coding with spatialscalability,” in Proc. Visual Commun. Image Process., Beijing, China,Jul. 2005, vol. 5960, pp. 355–363.

[80] M. Beermann and M. Wien, “Wavelet video coding, EE4: Joint re-duction of ringing and blocking, ISO/IEC JTC1/SC29/WG11,” 74thMPEG Meeting, Nice, France, Tech. Rep. M12640, Oct. 2005.

[81] M. Li and T. Nguyen, “Optimal wavelet filter design in scalable videocoding,” in Proc. Int. Conf. Image Process. (ICIP), Genova, Italy, Sep.2005, pp. 473–476.

[82] Description of testing in wavelet video coding ISO/IEC JTC1/SC29/WG11,” 75th MPEG Meeting, Bangkok, Thailand, Tech. Rep. N7823,Jan. 2006, ISO/MPEG Video.

[83] C. Fenimore, V. Baroncini, T. Oelbaum, and T. K. Tan, “Subjectivetesting methodology in MPEG video verification,” in Proc. Appl. Dig-ital Image Process. XXVII, Aug. 2004, vol. 5558, pp. 503–511.

[84] M. Ouaret, F. Dufaux, and T. Ebrahimi, “On comparing JPEG2000 andintraframe AVC,” presented at the Appl. Digit. Image Process. XXIX,Genova, Italy, Aug. 2006, 63120U.

[85] D. Marpe, S. Gordon, and T. Wiegand, “H.264/MPEG4-AVC fidelityrange extensions: Tools, profiles, performance, and application areas,”in Proc. Int. Conf. Image Process. (ICIP), Genova, Italy, Sep. 2005,vol. 1, pp. 593–596.

[86] L. Lima, F. Manerba, N. Adami, A. Signoroni, and R. Leonardi,“Wavelet-based encoding for HD applications,” presented at the Proc.Int. Conf. Multimedia Expo (ICME), Beijing, China, Jul. 2006.

[87] G. Pau and B. Pesquet-Popescu, “Image coding with rational spatialscalability,” presented at the Proc. Eur. Signal Process. Conf. (EU-SIPCO), Florence, Italy, Sep. 2006.

[88] B. Wang, Y. Wang, I. Selesnick, and A. Vetro, “An investigation of3-D dual-tree wavelet transform for video coding,” in Proc. Int. Conf.Image Process. (ICIP), Singapore, Oct. 2004, pp. 1317–1320.

Nicola Adami (M’01) received the Laurea degree inelectronic engineering and the Ph.D. degree in infor-mation engineering from the University of Brescia,Brescia, Italy, in 1998 and 2002, respectively.

From 2002 to the end of 2003, he was a Fellowat the University of Brescia, where he has been anAssistant Professor in the Engineering Faculty since2003. His research interests are signal processing,multimedia content analysis and representation,and image and video coding. He has been involvedin several European granted projects and has been

participating to ISO/MPEG and JPEG standardization works.

Alberto Signoroni (M’03) received the Laurea de-gree in electronic engineering and the Ph.D. in infor-mation engineering from the University of Brescia,Brescia, Italy, in 1997 and 2001, respectively.

Since March 2002, he has been an Assistant Pro-fessor in Signal Processing and Telecommunicationsat the Department of Electronics for the Automation,University of Brescia. His research interests are inimage processing, image and video coding systems,medical image coding, and medical image analysis.

Riccardo Leonardi (M’88) received the Diplomaand Ph.D. degrees in electrical engineering from theSwiss Federal Institute of Technology, Lausanne,Switzerland, in 1984 and 1987, respectively.

He spent one year (1987–1988) as a PostdoctoralFellow with the Information Research Laboratory,University of California, Santa Barbara. From 1988to 1991, he was a Member of Technical Staff atAT&T Bell Laboratories, performing research ac-tivities on image communication systems. In 1991,he returned briefly to the Swiss Federal Institute

of Technology, Lausanne, to coordinate the research activities of the SignalProcessing Laboratory. Since February 1992, he has been with the Universityof Brescia to lead research and teaching in the field of telecommunication. Hismain research interests cover the field of digital signal processing applications,with a specific expertise on visual communications, and content-based analysisof audio–visual information. He has published more than 100 papers on thesetopics. Since 1997, he acts also as an evaluator and auditor for the EuropeanUnion IST and COST programmes.