transmitting video surveillance sequences based on jpeg 2000 conditional replenishment

1

Transmitting video surveillance sequences

based on JPEG 2000 conditional replenishmentFrancois-Olivier Devaux, Jerome Meessen, Christophe Parisot, Jean-Francois Delaigle,

Benoit Macq and Christophe De Vleeschouwer

Abstract

In many video surveillance applications, images are storedlocally and are likely to be accessed

remotely and possibly interactively upon user request. In such context, the JPEG 2000 still image

compression format is attractive because it provides high coding efficiency, while supporting a highly

flexible access to each individual image, in terms of spatiallocation, quality level, as well as resolution.

However, when consecutive images constituting a video sequence have to be accessed, the fact that

JPEG 2000 does not exploit the temporal redundancy inherentto the image sequence dramatically

penalizes the transmission efficiency. This paper proposesa solution to mitigate this drawback when

conveying a video surveillance sequence directly through JPEG 2000 codestream segments. The method is

based on conditional replenishment, but is original in two main aspects. First, the proposed replenishment

method exploits the specificities of the JPEG 2000 codestream structure to balance the size (in terms of

code-blocks) and the accuracy (in terms of bit-planes) of the replenishment in a rate-distortion optimal

way. Second, it takes into account the still background nature of video surveillance content by maintaining

two reference images at the receiver. One reference is the last reconstructed frame, as proposed in [2].

The other is a dynamically-computed estimate of the scene background, which helps to recover the

background after a moving object has left the scene. Simulation results demonstrate the efficiency and

flexibility of the approach in terms of transmission resources allocation. As an additional contribution,

we demonstrate that the embedded nature of the JPEG 2000 codestream easily supports prioritization of

content that is known to be semantically relevant. An interesting aspect of JPEG 2000-based prioritization

This work has been funded by the EU commission on the scope of the FP6 IST-2003-507204 project WCAM [1] “Wireless

Cameras and Audio-Visual Seamless Networking”.

F.O. Devaux and C. De Vleeschouwer are funded by the Belgian NSF.

F.O. Devaux, C. De Vleeschouwer and B. Macq are with the Communications and Remote Sensing Laboratory (TELE),

Universite catholique de Louvain (UCL), Belgium. E-mail:{devaux,devlees,macq}@tele.ucl.ac.be.

J. Meessen,C. Parisot and J.F. Delaigle are with Multitel A.S.B.L, Belgium. E-mail: {meessen,parisot,delaigle}@multitel.be

September 28, 2006 DRAFT

2

is that it can be regulated a posteriori, after the codestream generation, based on user needs or rights. These

results encourage the development of integrated and entirely JPEG 2000-based storage and transmission

video surveillance systems, without the need to transcode the content to an MPEG-like format before its

transmission.

Index Terms

Replenishment, JPEG 2000, Region of Interest, Segmentation, Intra Coding, Semantic Based Coding,

Adaptive Delivery

I. I NTRODUCTION

Nowadays, an increasing number of video surveillance systems use digital video coding standards and

IP networks to compress and transmit a huge amount of video data from cameras and storage servers

to a wide variety of terminals, from control rooms to wireless PDAs. While Motion JPEG and MPEG-2

codecs have been largely deployed, MPEG-4 AVC and JPEG 2000 codecs are now emerging in video

surveillance devices and systems.

Motion JPEG 2000 (MJ2), the video file format encapsulating JPEG 2000 frames, presents several

important and attractive features for video surveillance systems [3] [4]. Compared to MPEG-based

systems, it provides efficient Regions of Interest (RoI) coding, as well as fine-grained temporal, spatial,

resolution and quality scalability [5] [6]. The coded bitstream can easily be parsed and adapted in real-time

following each of these scalabilities without the need of expensive transcoding operations. This enables

the server to optimize the transmitted video quality according to the client decoding capabilities and the

varying network resources with a minimum impact on its processing requirements. Furthermore, MJ2

supports direct access to each individual frame of the sequence and provides state of the art compression

efficiency for countries where inter-frame coding techniques are not recognized by courts as admissible

evidences [7].

Some recent papers have studied compression and transmission systems exploiting the JPEG 2000 RoI

coding and multi-layer features [8] [9] [10]. These approaches promote the delivery of higher quality for

mobile objects, considered as RoI, than for the other regions when the bandwidth is limited. Separate

transmission of the RoI and non-RoI regions have also been proposed in the SPRITE coding framework

of the MPEG-4 object-based coding strategy [11] [12] [13].

In this paper, we focus on JPEG 2000 video surveillance systemswith fixed cameras. Rather than

transmitting each frame independently to the clients as it is generally done in the literature for JPEG 2000


3

based systems, we adopt a conditional replenishment schemeto exploit the temporal correlation of the

video sequence. As a first contribution, we propose a rate-distortion optimal strategy to select the most

profitable packets to transmit. As a second contribution, we provide the client with two references, the

previous reconstructed frame and an estimation of the current scene background calculated at the server

side, which significantly improves the transmission system rate-distortion performances. To the best of

our knowledge, this paper is the first to consider multiple references in a replenishment framework.

However, multiple references have successfully been used in the AVC context [14] [15]. As a third

important contribution, the server exploits the scalability of JPEG 2000 to allocate the transmission

resources according to some a priori knowledge it has about the semantic relevance of the content.

Semantically important areas of the video sequences are alsodenoted Regions of Interest (RoI) in the

following. Semantic video analysis has already been used to improve video transmissions [16], but in

most cases the semantic knowledge is used prior to encoding.In contrast, we propose to exploit the

semantic information after the encoding step to perform JPEG 2000 packet prioritization. Such approach

makes it possible to transmit several versions of a single compressed sequence, each being adapted to

distinct user interests.

To summarize, our study considers how to implement a multi-reference replenishment scheme in a

JPEG 2000 environment, and demonstrates the relevance of the approach in scenarios capturing the video

sequence with still cameras, as often encountered in a videosurveillance context. The goal of our work

is not to compete with other existing video coding systems like AVC, but to propose a rate-distortion

optimized transmission system adapted to a JPEG 2000 video surveillance environment. Our simulations

encourage the deployment of such video surveillance systems taking advantage of the JPEG 2000 features

throughout the acquisition, analysis and transmission chain.

This paper is structured as follows. In Section II, we present an overview of the proposed replenishment

system. Section III describes the segmentation technique, used both to define the background reference

picture and to assign level of importance to scene areas. In Section IV, we remind the JPEG 2000

concepts useful for this work, and propose three replenishment methods. The first is conventional, the

second relies on a background estimation and the third exploits semantic information to prioritize content

replenishment. Section V presents the simulation results. Conclusions are provided in Section VI.

II. SYSTEM OVERVIEW

As explained in the previous section, the purpose of our paper is to explore how JPEG 2000 can support

the efficient transmission of video sequences. As a still image compression standard, JPEG 2000 encodes


https://www.researchgate.net/publication/3308836_Semantic_video_analysis_for_adaptive_content_delivery_and_automatic_description?el=1_x_8&enrichId=rgreq-edc6c9e7-8baf-4af5-ba3c-39905fcf08b9&enrichSource=Y292ZXJQYWdlOzIyODU5Mzc2MjtBUzoxMDIyMzc3NDU0NTEwMTBAMTQwMTM4Njc4NjQ0Mw==

4

the video frames independently, and does not exploit the potential temporal correlation existing between

consecutive frames. The approach makes the access to each individual image direct and flexible, but

penalizes the costs associated to the transmission of an entire video sequence. To mitigate this drawback,

we propose to adopt a rate-distortion formalism so as to restrict the transmission of each image to the

data units that bring a sufficient benefit per unit of transmission cost.

Our approach follows the conditional replenishment principle [2] in that only the parts of the current

image that significantly differ from a reference maintained at the receiver are transmitted. However, our

work extends the original replenishment scheme in two majoraspects. First, it exploits the specificities

of the JPEG 2000 standard in that, for a given bit budget, it balances the size (in terms of code-blocks)

and the accuracy (in terms of bit-planes) of the replenishment in a rate-distortion optimal way. Second,

it proposes to maintain two reference images at the receiverinstead of one. One reference is the last

reconstructed frame, as proposed in [2]. The second reference is an estimate of the scene background and

appears to bring significant benefits in surveillance scenarios. As an additional contribution, our study

demonstrates the capability to prioritize the refresh of semantically relevant parts of the scene.

RoI

Ref.

Ref.previous Delay

BackgdRef.

Ref.previous Delay

JPEG2000 packetsand

Replenishment decisions

estimationBackgd

replenishment

Decoding andconditional

RD−optimaldecisions andreplenishment

Reconstructed video

Server Client

Video

Backgd

content

Fig. 1. Overview of the proposed JPEG 2000 video transmission architecture. Conditional replenishment is based on two

reference images, and replenishment decisions are taken in an RD optimal way at the JPEG 2000 precinct level. Optionally

(dashed arrow), Regions of Interest that are an inherent by-product of the background estimation module can be used to prioritize

the refresh of areas affected by relevant changes of the scene.

Figure 1 depicts the proposed transmission architecture. For each frame, the system only transmits the


5

JPEG 2000 data units that are not properly approximated at the decoder, neither based on the background

estimate, nor based on the previous reconstructed frame. Asa consequence, the main concern of the sender

is related to the selection of (i) the parts of the JPEG 2000 image that have to be refreshed, and (ii) the

level of quality associated to the corresponding refreshments. Given a targeted transmission bit budget, we

explain in Section IV how these decisions are taken in a rate-distortion optimal way, and in agreement

with the JPEG 2000 syntax. The second issue addressed by the sender is related to the background

estimation. In the proposed system, an average background is computed based on Gaussian mixtures that

collect the statistics of past image samples in specific pixellocations, as described in Section III. At

regular time intervals, or when the current background estimate sufficiently differs from the reference

background available at the client, the current backgroundis transmitted to the receiver, and the reference

background is updated. The simulation results presented in Section V demonstrate that in practice the

transmission overhead caused by the background updates arenegligible compared to the cost associated to

refreshed data. Besides, the outcome of the background estimation process allows to partition the current

image into RoI and non-RoI regions, respectively defined to correspond to moving and static objects of

the scene. In Section IV, we make the assumption that RoI areasare semantically more important, and

demonstrate the ability of our transmission system to take such a priori semantic knowledge into account

when allocating transmission resources. In final, RoI replenishment prioritization is shown to improve

the perceived quality of noisy video content (see Section V-C).

III. V IDEO CONTENT ANALYSIS

The algorithm described in this section automatically computes the scene background based on the

past frames, extracts the RoI and provides this informationto the replenishment module.

A. Background estimation

The goal of the background estimation process is to create background references frames for the

replenishment module. The estimated background frames update the reference background either at a

fixed low frame-rate or only when major background changes aredetected.

The estimation is performed on a sliding window and is based ona real-time statistical segmentation

algorithm using a mixture of Gaussians modeling for the background luminance of each pixel [17] [18]

[19]. This approach automatically supports backgrounds having multiple states like blinking lights, grass

and trees moving in the wind, acquisition noise, etc. Furthermore, the background model is updated in

an unsupervised manner when the scene conditions are changing.


6

Fig. 2. Statistical background modeling of a pixel using three Gaussians.Multiple Gaussians aggregate the pixel luminance

values observed in a sliding window.

Figure 2 shows the mixture of Gaussians for one pixel at a giventime. It aggregates all luminance

values observed for that specific pixel in the previous framesbelonging to the sliding window. The current

pixel luminance is compared to the current mixture. We consider it belongs to one of the Gaussians if

the distance between the current pixel luminance and the Gaussian mean is lower than a given threshold

proportional to the considered Gaussian standard deviation (typically 1.6 times the standard deviation).

If the pixel belongs to one of the most probable Gaussians, the pixel is classified as background and

the relevant Gaussian parameters (i.e. mean, variance, frequency) are updated. Otherwise, the pixel is

classified as foreground and the parameters of the associatedGaussian are updated according to this

additional luminance value. At the beginning of the process, a new Gaussian is initialised each time a pixel

is classified as foreground until the pre-defined maximum number of Gaussians is reached. The maximum

number of Gaussians is a parameter that should theoretically be adapted to the number of different states

a pixel of the background can have according to the differentnoises (acquisition, vibrations, etc.). In

practice three Gaussians per mixture perform well in most indoor and outdoor conditions while four

Gaussians may give better results in some situations.

At any time, an estimate of the background can be constructed. It just requires getting the mean of

the most probable Gaussian for each pixel. Such estimated background frames are less noisy than the

original frames. This feature is exploited in the proposed system, as explained in Section V-C.

At the very beginning of the sequence, the background estimate is unstable since the number of times

each Gaussian occurred is very small. In order to avoid prohibitive transmissions associated to numerous

background updates during this period, the first frame is considered as being the best background estimate


7

until the Gaussian mixtures can be considered as stable. In our simulations, the background stability is

obtained within less than two seconds of video. During this initialization period, a huge part of the scene

can sometimes be considered as foreground if many mobile objects are present at the beginning of the

sequence or if the sequence is very noisy. While this could beconsidered as an inherent problem from

the strict semantical point of view, it does not have much impact on the delivered video quality within

the proposed replenishment method since our approach is based on two reference images.

B. RoI definition

In a video surveillance context, Regions of Interest are generally defined to be mobile objects. In some

applications, one may be interested only in mobile objects matching pre-defined decision characteristics

(e.g. size, position, texture, etc.) or behaviors (e.g. people entering restricted areas).

In our simulations, as in [10], we consider that all pixels classified as foreground by the above

segmentation algorithm belong to the RoI. In Section IV-C, weexplain how to prioritize the replenishment

of JPEG 2000 packets that correspond to the RoI.

One characteristic of the segmentation algorithm is that the background Gaussians widths are au-

tomatically adapted to the sequence noise, i.e. the Gaussians have a higher standard deviation in noisy

sequences than sequences with a lower noise. This feature prevents the pixels of a noisy background to be

considered as semantically important, and guarantees thatthe RoI replenishment prioritization allocates

transmission ressources to the objects moving in the scene,and not to the non-relevant variations of

background caused by the noise (see Section V-C).

IV. JPEG 2000 CONDITIONAL REPLENISHEMENT

As depicted in Figure 1, the proposed conditional replenishment system relies on two references to

approximate the current image. These two references respectively correspond to the previous image

reconstructed at the receiver, and to the background estimated at the sender, as described in Section II.

In this section, we are interested in the replenishment decision process, i.e. in the method which chooses

the parts of the image to refresh and the way to refresh them. The section is organized as follows.

First, we review the specificities of the JPEG 2000 standard that are relevant to the design of our

replenishment decision engine. Then, we explain how rate-distortion optimal replenishment decisions are

taken in agreement with the JPEG 2000 structure. Finally, we define three replenishment schemes that

differ by their ability to exploit the background estimate as a replenishment reference and to support the

prioritized transmission of RoI data units.


https://www.researchgate.net/publication/4186230_Scene_analysis_for_reducing_motion_JPEG_2000_video_surveillance_delivery_bandwidth_and_complexity?el=1_x_8&enrichId=rgreq-edc6c9e7-8baf-4af5-ba3c-39905fcf08b9&enrichSource=Y292ZXJQYWdlOzIyODU5Mzc2MjtBUzoxMDIyMzc3NDU0NTEwMTBAMTQwMTM4Njc4NjQ0Mw==

8

A. JPEG 2000 image representation and code stream abstraction

The JPEG 2000 standard describes images in terms of their discrete wavelet coefficients. Hence, a

replenishment scheme dedicated to JPEG 2000 contents decidesto refresh or approximate the current

image wavelet transform, based on the knowledge of the wavelet coefficients describing the reference

background and previous images. An important question raised by conditional replenishment is related to

the granularity of access to the current JPEG 2000 image coefficients. Specifically, one needs to understand

to which extent it is possible to define the resolution, the subband, the position and the reconstruction

accuracy of the coefficients that are refreshed. That issue is directly related to the JPEG 2000 format,

which can be summarized as follows.

According to the JPEG 2000 standard, the subbands issued from the wavelet transform are partitioned

into code-blocksthat are coded independently [3] [5] [20]. Each code-block iscoded into an embedded

bitstream, i.e. into a stream that provides a representation that is (close-to-)optimal in the rate-distortion

sense when truncated to any desired length. To achieve rate-distortion (RD) optimal scalability at the

image level, the embedded bitstream of each code-block is partitioned into a sequence of increments

based on a set of truncating points that correspond to the various rate-distortion trade-offs [21] defined

by a set of Lagrange multipliers. A Lagrange multiplierλ translates a cost in bytes in terms of distortion.

It defines the relative importance of rate and distortion. Given λ, the RD optimal truncation of a code-

block bitstream is obtained by truncating the embedded bitstream so as to minimize the Lagrangian cost

functionL(λ) = D(R)+λR, whereD(R) denotes the distortion resulting from the truncation toR bytes.

Different Lagrange multipliers define different rate-distortion trade-offs, which in turn result in different

truncation points. For each code-block, a decreasing sequence of Lagrange multipliers{λq}q>0 identifies

an ordered set of truncation points that partition the code-block bitstream into a sequence of incremental

contributions [21]. Incremental contributions from the set of image code-blocks are then collected into

so-called quality layers,Qq. The targeted rate-distortion trade-offs during the truncation are the same

for all the code-blocks. Consequently, for any quality layer index l, the contributions provided by layers

Q1 throughQl constitute a rate-distortion optimal representation of the entire image. It thus provides

distortion scalability at the image level. Resolution scalability and spatial random access to the image

result from the fact that each code-block is associated to a specific subband and to a limited spatial

region.

Although they are coded independently, code-blocks are notidentified explicitly within a JPEG 2000

codestream. Instead, the code-blocks associated to a givenresolution are grouped intoprecincts, based on


https://www.researchgate.net/publication/220500984_High_performance_scalable_image_compression_with_EBCOT?el=1_x_8&enrichId=rgreq-edc6c9e7-8baf-4af5-ba3c-39905fcf08b9&enrichSource=Y292ZXJQYWdlOzIyODU5Mzc2MjtBUzoxMDIyMzc3NDU0NTEwMTBAMTQwMTM4Njc4NjQ0Mw==


9

their spatial location [3], [22]. Hence, a precinct corresponds to the parts of the JPEG 2000 codestream

that are specific to a given resolution and spatial location. As a consequence of the quality layering defined

above, a precinct can also be viewed as a hierarchy ofpackets, each packet collecting the parts of the

codestream that correspond to a given quality among all code-blocks matching the precinct resolution

and position. Hence, packets are the basic access unit in theJPEG 2000 codestream.

B. RD optimal replenishment

Given a targeted transmission budget and a reference image available at the receiver, we now explain

how to select the JPEG 2000 packets of the current image codestream so as to maximize the reconstructed

image quality. As the JPEG 2000 codestream consists in a set of precincts organized in a hierarchy of

layers (see Section IV-A), the problem consists in selectingthe indices of the precincts to refresh and their

quality of refreshment, so as to maximize the reconstructedquality (or minimize the distortion) under

the bit budget constraint, knowing that non-refreshed precincts are approximated based on the wavelet

coefficients of the reference image. The use of multiple reference images is described in Section IV-C.

To simplify notations, and without loss of generality, the precincts, originally defined by their(r, p)

indexes, are now labeled by a single indexi. To solve the problem efficiently, we assume an additive

distortion metric, for which the contribution provided by multiple precincts to the entire image distortion

is equal to the sum of the distortion computed for each individual precinct. We definedq(i) and d0(i)

to denote the distortion computed when theith precinct is approximated based on itsq first packets,

i.e. its q first layers, and based on the reference image, respectively.We also denotesq(i) to be the

size in bytes of theq first packets of theith precinct andT the bit budget. Based on the additivity

assumption and because a packet is only useful upon reception of all its ancestors, the problem can

be formulated as a Knapsack problem with precedence constraints [23]. Let q(i) denote the number of

quality layers transmitted for theith precinct. Then, the RD optimal refreshment decisions are defined by

the set{q(i)}i≤N that maximizes∑

i<N (d0(i) − dq(i)(i)), subject to∑

i<N sq(i)(i) ≤ T . Formally, this

Knapsack problem can be solved based on dynamic programming[23], [24]. However, two specificities

of our problem simplify it, and make an iterative greedy solution RD optimal.

First, the lower RD convex-hull of a precinct originates in the RD point defined by the reference

image (R = 0) and goes through all the refreshment solutions that involve a sufficient number of quality

layers. This is because, in absence of a reference frame, the benefit per transmission cost of a precinct

packet decreases as the layer index increases [21]. Hence, the succession of RD points corresponding

to an increasing number of layers sustains the lower RD convex-hull in absence of reference. In the


https://www.researchgate.net/publication/220694474_Knapsack_Problems?el=1_x_8&enrichId=rgreq-edc6c9e7-8baf-4af5-ba3c-39905fcf08b9&enrichSource=Y292ZXJQYWdlOzIyODU5Mzc2MjtBUzoxMDIyMzc3NDU0NTEwMTBAMTQwMTM4Njc4NjQ0Mw==

https://www.researchgate.net/publication/220694474_Knapsack_Problems?el=1_x_8&enrichId=rgreq-edc6c9e7-8baf-4af5-ba3c-39905fcf08b9&enrichSource=Y292ZXJQYWdlOzIyODU5Mzc2MjtBUzoxMDIyMzc3NDU0NTEwMTBAMTQwMTM4Njc4NjQ0Mw==

https://www.researchgate.net/publication/224744903_Rate-distortion_optimized_interactive_browsing_of_JPEG2000_images?el=1_x_8&enrichId=rgreq-edc6c9e7-8baf-4af5-ba3c-39905fcf08b9&enrichSource=Y292ZXJQYWdlOzIyODU5Mzc2MjtBUzoxMDIyMzc3NDU0NTEwMTBAMTQwMTM4Njc4NjQ0Mw==


10

replenishment case, the lower RD convex-hull is affected bythe existence of a reference frame, and the

refreshment of a precinct only becomes worthwhile in the convex-hull sense beyond a quality level for

which the benefit (compared to the quality achieved based on the reference frame) per unit of rate becomes

larger than the relative gain offered by subsequent layers of the precinct. Hence, for theith precinct,

the set of convex-hull RD optimal solutions contains the reference precinct (R=0) and the refreshment

solutions involving more thanqr(i) quality layers, withqr(i) being the smallest valueq such that

d0(i) − dq(i)

sq(i)≥

d0(i) − dq+1(i)

sq+1(i)(1)

Second, the bit budget constraint can be somewhat relaxed, without impairing the overall performance

of the communication system. This is because all video communication applications rely on buffers to

absorb momentary rate fluctuations. As a consequence, the fewbits that are saved (or overspent) compared

to the bit budget allocated to a frame just slightly increments (or decrements) the budget allocated to the

next frame.

As a consequence of the above observations, overall RD optimality can be achieved at the image level

by selecting the packets to transmit so as to refresh the image precincts in decreasing order of benefit

per unit of rate, up to exhaustion of the transmission budget. This approach is equivalent in principle to

the one defined in [22], but is adapted to account for the availability of a reference image. Formally, the

iterative process can be defined as follows.

Let qt(i, m) denote the number of layers already transmitted for theith precinct at stepm, and let

q+t (i, m) denote the next convex-hull optimal refreshment level for the ith precinct at stepm. Based on

the above discussion,q+t (i, m) = qr(i) when qt(i, m) = 0, andq+

t (i, m) = qt(i, m) + 1 in other cases.

Based on these definitions, at the initial step, we haveqt(i, 1) = 0 ∀i. Then, at each stepm, the greedy

process decides to improve the quality of the precincti∗m that provides the largest decrement in distortion

per unit of transmission, i.e.

im∗ = argmax1≤i≤N

(

dqt(i,m)(i) − dq+

t(i,m)(i)

)

(

sq+

t(i,m)(i) − sqt(i,m)(i)

) (2)

To prepare the next iteration,qt(i, m + 1) is set toqt(i, m) ∀i 6= i∗m, and toq+t (i∗m, m) when i = i∗m.

The process goes on iterating onm as long as the bit budget is not exhausted.

The solution is RD optimal in the sense that, for the achieved bit-budget, it is not possible to attain

a lower reconstructed image distortion based on different refreshment decisions. This is because, by

construction, it is not possible to find a non-transmitted packet that provides a larger gain per unit of rate

than the gain provided by a transmitted packet.


https://www.researchgate.net/publication/224744903_Rate-distortion_optimized_interactive_browsing_of_JPEG2000_images?el=1_x_8&enrichId=rgreq-edc6c9e7-8baf-4af5-ba3c-39905fcf08b9&enrichSource=Y292ZXJQYWdlOzIyODU5Mzc2MjtBUzoxMDIyMzc3NDU0NTEwMTBAMTQwMTM4Njc4NjQ0Mw==

11

In practice, in our work, the distortion metric is computed based on the Square Error (SE) of wavelet

coefficients, and approximates the reconstructed image square error [21]. Formally, letBi denote the set

of code-blocks associated to precincti, and letcb[k] and cb[k] respectively denote the two-dimensional

sequences of original and approximated subband samples in code-block b ∈ Bi. The distortiond(i)

associated to the approximation of theith precinct is then defined by

d(i) =∑

b∈Bi

w2sb

∑

k∈b

(cb[k] − cb[k])2 (3)

wherewsb denotes the L2-norm of the wavelet basis functions for the subbandsb to which code-blockb

belongs [21]. As an alternative to the conventional SE metric, in the rest of the paper, we also consider a

distortion defined based on semantically meaningful weighting of the SE, so as to take into account the

a priori knowledge one may get about the semantic significanceof approximation errors. We assume that

the information about the semantic relevance of approximation errors is provided at the precinct level,

and define the semantically weighted distortion to bed′(i) = w(i)d(i), wherew(i) denotes the semantic

weight assigned to theith precinct (see Section IV-C). Semantically meaningful weighted distortion

metrics have already been considered in the past, e.g. in [16]. However, most earlier contributions exploit

these metrics either before or during the encoding step. In contrast, our work supports the posterior

definition of semantics weights, at transmission time, giventhe pre-encoded stream.

In the next section, we introduce three different replenishment mechanisms. They all follow the above

greedy algorithm, but differ in the reference they use for replenishment, or in the weights they assign to

precincts when computing their contribution to the reconstructed image distortion.

C. Replenishment methods definition

We now introduce the three replenishment methods that are considered in the simulation results

presented in Section V. They are all based on the greedy approach described above in Section IV-B

above, but differ in the way they define the reference image or compute the distortion. They are denoted

and defined as follows:

• The CR – Conditional Replenishment – method follows the conventional replenishment mechanism

originally introduced in [2] and adapted to the wavelet domain. The reference image is the previously

reconstructed image, and the distortion is defined to approximate the MSE, i.e. the semantic weights

w(i) = 1 for all precincts.

• TheCRB – Conditional Replenishment with Background – method is novel and proposes to consider

both the previous image and the estimated background as possible references for each precinct. In



12

practice, for a given precinct, the image that best approximates the precinct is selected as the reference

for that specific precinct. As for the CR method, the distortion still estimates the MSE based on

wavelet coefficients square errors. Our simulations demonstrate that CRB significantly outperforms

CR in the surveillance scenario of interest in our study.

• The CROI – Conditional Replenishment with RoI – follows the mechanism introduced by CRB,

but forbids refreshments in non-RoI areas of the scene. It corresponds to an aggressive semantic

weighting of the approximation error, for which the a prioriknowledge about scene perception is

inferred from the RoI/non-RoI partition defined in Section III. Semantic weightsw(i) are set to one

(zero) for precincts that belong to the RoI (non-RoI) areas.In other words, approximation errors are

only considered to be semantically relevant in the RoI area.The strategy is aggressive but defines

a limit case that allows to get a clear idea about the potential benefit to draw from a semantic

weighting of distortion. Compared to the previous method, CROI is less robust to segmentation

errors that can lead to the integration of semantically relevant objects in the non-RoI regions. Note

that in practice, the RoI/non-RoI partition is defined at the pixel level in Section III. Hence, we

consider that a precinct belongs to the RoI if at least5% of its supporting pixels are labelled as

RoI pixels. The supporting pixels of a precinct are obtained by dyadic upsampling of the precinct

subband support.

Intermediate strategies between the CROI and CRB methods can be defined by selecting semantic

weightsw(i) between0 and1. This choice may for example depend on the sequence noise (as explained

in the Section V-C) or on the reliability of the segmentation step. Besides, we notice that the RoI

segmentation does not depend on the allocation strategy done afterward. Thus, this framework can trivially

be extended to transmission systems with several clients, each having its own network and decoding

resources, as well as semantic interests.

V. RESULTS

In this section, we present experimental results and discuss them. First, we compare the performances

of the three replenishment methods described in the previous section with MJ2 and MPEG-4 AVC. Then,

a deeper analysis of the quality achieved in the RoI and non-RoI regions is performed. Finally, we analyze

how CROI can improve the transmission of noisy sequences.

The transmission methods have been tested exhaustively, butwe present the results onSpeedway,

a CIF video-surveillance sequence captured with a fixed camera at 25 fps. The original sequence, its

estimated background and the segmentation masks are available on the WCAM project website [1].


13

Regarding the JPEG 2000 compression parameters, the sequencehas been encoded with four quality

layers (corresponding to compression ratios of 2.7, 13.5, 37 and 76) and with three code-blocks per

precinct (one in each subband). In order to have a spatial coherence between the precincts at different

resolutions, we have chosen decreasing precinct sizes of 32x32, 16x16, 8x8, and 4x4 for the three

remaining lowest resolutions. Regarding the rate control,the bit-rate has been uniformly distributed on

all frames in the four intra methods. With AVC, we have adapted the quantization parameters to reach

the expected bit-rates.

In these simulations, the background is sent only once at thebeginning of the transmission because it

remains sufficiently constant during the whole sequence. The transmission overhead is negligible, as the

compressed estimated background ofSpeedwayhas a size of 55 Kbytes.

A. Overall Evaluation

288 500 750 1.000 1250 1500 1750 2000

21

23

25

27

29

31

33

35

37

39

41

43

45

Bit rate (kbps)

PS

NR

(dB

)

CRCROICRBMJ2AVC (IP=2)AVC (IP=5)AVC (IP=10)

Fig. 3. Rate distortion curves of the proposed algorithms compared with MJ2 and AVC. Frame rates and encoding parameters

are defined in the text.

Figure 3 compares the PSNR at different bit-rates of the CR, CROI, CRB, MJ2 and MPEG-4 AVC

(with three different Intra Periods, IP) methods.

We observe that the CROI method offers a good compression efficiency at low bit rates, thanks to the

estimated background available at the decoder. At higher bit rates however, only the RoI are updated

and the non-RoI quality is not increased. Hence, the averagequality saturates around 36 dB. MJ2 is the


14

less efficient compression scheme except at very high bit rates where it outperforms the CROI method,

because the entire picture is refined. The CR method improves the MJ2 compression by 2 dBs at low bit

rates, because only the most relevant blocks are refreshed.CRB takes the best out of both CR and CROI

methods. Like CROI, at low bit rates, the estimated background allows to concentrate the refreshment in

the most changing areas mostly located in the RoI; like CR, athigh bit rates, the possibility to refresh

any region of the image increases the global quality.

At very low bit rates, the CRB and CROI methods results are close to MPEG-4 AVC. At 300 kbps,

their PSNR is 1.5 dB below IP-10, 1.5 dB above IP-5 and 7 dB above IP-2. The performances of CRB

are comparable to AVC IP-2 at 1300 kbps. As mentioned in the introduction, the goal of this paper is

not to propose a new compression scheme competing with existing ones like AVC, but rather to increase

the performances of flexible video surveillance transmission systems based on JPEG 2000.

Temporal evolution of the quality

Figure 4 shows the temporal evolution of the quality for the CR, CROI and CRB methods. We observe

that the quality offered by these methods is quite constant during the transmission. At low bit rates, the CR

quality slightly increases until frame 70. This is due to the fact that, at this bit rate, the background blocks

are slowly refreshed compared to the other methods. Both CRBand CROI approaches introduce a peak

of bit-consumption at the beginning of the session due to thetransmission of the estimated background.

Snapshots

Snapshots of theSpeedwaysequence compressed with the CR, CROI, CRB, MJ2 and AVC methods at

235 and 775 kbps are respectively shown in Figures 5 and 6. As wecan observe, the CR improves slightly

the MJ2 method, increasing mostly the precision on the vehicles. A major drawback of the CR method

is visible in Figure 5: artifacts appear on the border of the previously refreshed precincts, mostly on the

path of the car. This is due to the fact that at 235 kbps, the bit budget does not allow the refreshment of

these precincts.

At this low bit rate, the quality of the CROI and CRB methods are very similar. The artifacts of the

CR method explained above do not appear because the background is used as reference in these difficult

regions. However, the cars seem slightly transparent. This transparency is due to the fact that not all the

precincts in the car regions have been refreshed. This is visible for example with the white line of the

speedway border belonging to the background that is still vaguely visible through the car on the right.

At 775 kbps (Figure 6), this transparency does not appear anymore because the bit budget was sufficient


15

0 10 20 30 40 50 60 70 8015

20

25

30

35

40

Frame Number

PS

NR

(dB

)

0 10 20 30 40 50 60 70 8015

20

25

30

35

40

Frame numberP

SN

R (

dB)

CRBCROICR

235 kbps 1600 kbps

Fig. 4. Temporal evolution of the image quality for the CR, CROI and CRB methods (Speedwaysequence transmitted at 235

and 1600 kbps, 25 fps and in CIF format).

to refresh the vehicle areas.

B. RoI and non-RoI quality

The quality of RoI and non-RoI regions defined with the segmentation method described in Section III-

B are shown for theSpeedwaysequence in Figure 7.

For the MJ2 method, the non-RoI quality is always higher thanthe RoI because most of these

background regions, like the road and the sky, are very efficiently compressed. Indeed, since these regions

are quite predictable, the JPEG 2000 entropy coder easily reduces the number of bits used to code them

compared to regions with a lower predictability. The RoI contains the cars that are characterized by an

important amount of details, which are less efficiently compressed. Hence, the RD optimal bit allocation

strategy proposed by the EBCOT algorithm [21] assigns in thiscase more bit-planes to a given quality

layer for the non-RoI regions than for the RoI. This is illustrated on the top left of the figure.

Compared to MJ2, the CR method offers a higher quality for theRoI, which correspond to the zones

that are more often refreshed.

As the CROI method only relies on the background reference toreconstruct non-RoI areas, the non-RoI

quality is constant throughout the bit rates. The RoI qualityincreases until a given threshold where all the

code-blocks from the original JPEG 2000 sequence are sent. After this threshold (at 1700 kbps), neither

the non-RoI nor the RoI quality is increased, as no additional data are transmitted.


16

MJ2 CR

CROI CRB

AVC (IP=5) Original

Fig. 5. MJ2, CR, CROI, CRB and AVC methods for the 25th frame of theSpeedwaysequence transmitted at 235 kbps, 25

fps and in CIF format.


17

MJ2 CR

CROI CRB

AVC (IP=5) Original

Fig. 6. MJ2, CR, CROI, CRB and AVC methods for the 25th frame of theSpeedwaysequence transmitted at 775 kbps, 25

fps and in CIF format.


18

MJ2 method

Bit rate (kbps)

PS

NR

(dB

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

25

28

31

34

37

40

43

46

49

52

55

MJ2 (Non−RoI)MJ2 (RoI)

CR Method

Bit rate (kbps)

PS

NR

(dB

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

25

28

31

34

37

40

43

46

49

52

55

CR (Non−RoI)CR (ROI)

CROI Method

Bit rate (kbps)

PS

NR

(dB

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

25

28

31

34

37

40

43

46

49

52

55

CROI (Non−RoI)CROI (RoI)

CRB method

Bit rate (kbps)

PS

NR

(dB

)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

25

28

31

34

37

40

43

46

49

52

55

CRB (Non−RoI)CRB (RoI)

Fig. 7. RoI and non-RoI quality as a function of the total transmission rate for the CR, CROI, CRB and MJ2 methods (Speedway

sequence).

The CRB method behaves like CR at high bit rates, but offers a higher non-RoI quality at low bit

rates.

C. Noisy sequence

In this section, we consider a noisy version of theSpeedwaysequence to demonstrate the flexibility of

the replenishment methods based on RD optimal JPEG 2000 packetscheduling. Specifically, we show that

these methods naturally support the exploitation of a priori knowledge about the relevance of image parts.

Adaptive transmission mechanisms that follow the user needs can be implemented, based on single pre-

encoded JPEG 2000 streams. Besides, we also demonstrate the capabilities of the RoI/non-RoI selection


19

algorithm to extract relevant moving areas in presence of noise. The noise causes luminance changes

in the background regions, but these changes are not relevant with respect to the surveillance purpose

of the application. Hence, these background regions shouldnot be considered as part of the region of

interest and, indeed, they are indeed considered as non-RoIregions based on the algorithm presented in

Section III-B.

In practice, we have added white Gaussian noise with a standard deviation of 10 to theSpeedway

sequence, as illustrated on Figure 8. The noise simulates the effect of adverse surveillance conditions:

noisy camera acquisition, bad weather, presence of traffic lights or moving objects (trees, ...).

Fig. 8. Speedwaysequence corrupted with additive white Gaussian noise characterized bya standard deviation of 10.

Figure 9 shows the performance of the three methods using the noisy sequence as the reference for

PSNR computations. As expected, the CRB method performs best.

However, the noise present in the sequences does not add any relevant information. The segmentation

method proposed in Section III detects this noise, and only considers the vehicles as being part of the

regions of interest. Moreover, the background estimation process filters the sequence temporally and

provides a denoised version of the background. Thus, we expect the CROI method to offer a denoised,

and perceptually more pleasant version of the sequence at the client side. This is confirmed visually, and

illustrated in Figure 10 where the CROI and CRB methods are compared for the transmission of the

original and noisy sequences, taking this time the originalsequence as the reference to compute PSNR

values. The left part of the figure focuses on theRoI. In normal conditions, all transmitted bits of the

CROI method are dedicated to the RoI, which explains the higher performances of this method compared

with CRB. However, in noisy conditions, the RoI quality of both CROI and CRB are similar. The right

part of the figure represents thenon-RoI quality. In normal conditions, the CROI method maintains a


20

400 600 800 1000 1200 1400 1600 1800 200023

24

25

26

27

28

29

Bit rate (kbps)

PS

NR

(dB

)

CRBCROICR

Fig. 9. CR, CROI and CRB quality when transmitting the noisy version ofSpeedway. The PSNR is calculated using the noisy

sequence as reference.

400 600 800 1000 1200 1400 1600 1800 200020

25

30

35

40

45

50

55

Bit rate (kbps)

PS

NR

(dB

)

CROICRBCROI noiseCRB noise

400 600 800 1000 1200 1400 1600 1800 200031

32

33

34

35

36

37

38

39

Bit rate (kbps)

PS

NR

(dB

)

CROICRBCROI noiseCRB noise

RoI Non-RoI

Fig. 10. RoI and non-RoI quality for the CROI and CRB methods in normaland noisy conditions (Speedwaysequence). In

both conditions, the PSNR is calculated using the original (non noisy) sequence as reference.

constant non-RoI quality, while CRB progressively refreshes these regions as the available rate increases,

providing a higher overall non-RoI quality. In noisy conditions, since the non-RoI regions are slightly

modified by the noise at each frame, it constantly differs fromnon-RoI regions of the references available

at the decoder. Thus, the CRB method constantly refreshes thenon-RoI regions mainly to render noise


21

effects. It leads to a loss of efficiency for this method, loss that increases with the bit rate. On the contrary,

since the CROI method never refreshes the non-RoI regions, its quality remains higher and constant.

Although the CROI method is less efficient than CRB in noiseless conditions, we can conclude that

the a priori knowledge of the scene is efficiently used by the CROI mechanism, and offers a significant

advantage in noisy environments. CROI is also expected to provide significant benefit in cases where

the a priori semantic knowledge is either based on user interaction or sophisticated scene interpretation

mechanisms.

VI. CONCLUSION

In this work, we have investigated the use of conditional replenishment mechanisms to transmit

JPEG 2000 video surveillance content. We have explained how totake the refreshment decisions in

a RD optimal way. We have also demonstrated the benefit of usingmultiple reference images for non-

refreshed areas. In particular, we have proposed to computean estimate of the background of the scene

captured by a still camera, and have shown that such estimatesignificantly improves rate-distortion

performances in video surveillance scenarios. In addition, we have highlighted the flexibility offered by a

JPEG 2000 transmission of video content by prioritizing the refresh of scene areas that are a priori known

to be semantically significant. Interestingly, as a consequence of the JPEG 2000 intrinsic scalability,

the prioritization allows to dynamically allocate transmission resources to the video content, but is

independent of the JPEG 2000 codestream creation. Hence, it allows to allocate the rate to the content

according to the user needsa posteriori, once the images have been compressed and stored. For the same

reason, our system can be extended to a transmission to several clients, each client being characterized

by its own resources. Eventually, simulations have revealedthat the proposed system achieves close

to AVC performance at low rates, and significantly outperforms both naive independent transmission

of consecutive frames, and conventional replenishment mechanisms. At 500 kbps, the distortion of the

proposed method is at 1.5dB / 3dB below AVC (with an Intra Period of 5/10) and 11 dB above MJ2. These

results encourage the deployment of integrated solutions able to store and transmit video surveillance

content in JPEG 2000 format.

REFERENCES

[1] FP6 IST-2003-507204 WCAM, Wireless Cameras and Audio-Visual Seamless Networking, http://www.ist-wcam.org, 2004.

[2] S. McCanne, M. Vetterli and V. Jacobson. Low-complexity video coding for receiver-driven layered multicast.IEEE

Journal of Selected Areas in Communications, 15(6):982–1001, 1997.

[3] ISO/IEC 15444-1. JPEG2000 image coding system, 2000.


[4] Motion JPEG 2000 Final Committee Draft, 1.0, ISO/IEC JTC 1/SC 29/WG1N2117, March 2001.

[5] M. Rabbani and R. Joshi. An overview of the JPEG 2000 image compression standard.Signal Processing: Image

processing, 17:3–48, 2002.

[6] D. Santa-Cruz and T. Ebrahimi. An analytical study of JPEG 2000 functionalities. InProc. of IEEE International

Conference on Image Processing (ICIP), Vancouver, September 2000.

[7] Avid Technology. Forensic video decision, May 2001.

[8] V. Sanchez, A. Basu and M. Mandal. Prioritized Region Of InterestCoding in JPEG 2000.IEEE trans. on CSVT,

14(9):1149–1155, Sept. 2004.

[9] J. Meessen, C. Parisot, C. Le Barz, D. Nicholson and J.-F. Delaigle. WCAM: Smart Encoding for Wireless Surveillance.

In SPIE Image and Video Communications and Processing (IVCP 05), San Jose, USA, January 2005.

[10] J. Meessen, C. Parisot, X. Desurmont and J.F. Delaigle. SceneAnalysis for Reducing Motion JPEG 2000 video Surveillance

Delivery Bandwidth and Complexity. InIEEE International Conference on Image Processing (ICIP 05), volume 1, pages

577–580, Genova, Italy, September 2005.

[11] F. Pereira and T. Ebrahimi.The MPEG-4 Book. Prentice Hall, 2002.

[12] R. Koenen. MPEG-4 overview ISO/IEC JTC1/SC29/WG11 N4668, available at

http://www.chiariglione.org/mpeg/standards/mpeg-4/mpeg-4.htm, March2002.

[13] T. Sikora. Trends and perspectives in image and video coding. In Proceedings of the IEEE, volume 93(1), pages 6–17,

January 2005.

[14] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG. Joint Final Commitee Draft (JFCD) of Joint Video

Specification (ITU-T Rec. H.264 – ISO/IEC 14496-10 AVC). Doc. JVT-D157, July 2002.

[15] T. Wiegand, G.J. Sullivan, G. Bjntegaard, A. Luthra. Overview of the H.264/AVC video coding standard.IEEE trans. on

CSVT, 13(7):560–576, July 2003.

[16] A. Cavallaro, O. Steiger and T. Ebrahimi. Semantic video analysis for adaptive content delivery and automatic description.

IEEE trans. on CSVT, 15(10):1200–1209, October 2005.

[17] C. Stauffer and W.E.L. Grimson. Adaptive background mixturemodels for real-time tracking. InIEEE Conference on

Computer Vision and Pattern Recognition, volume 2, pages 246–252, June 1999.

[18] K. Kim, T. Horprasert, D. Harwood and L. Davis. Codebook-based background subtraction and performance evaluation

methodology. 2003.

[19] X. Desurmont, C. Chaudy, A. Bastide, C. Parisot, J.F. Delaigle and B. Macq. Image analysis architectures and techniques

for intelligent systems. InIEE proc. on Vision, Image and Signal Processing, Special issue on Intelligent Distributed

Surveillance Systems, 2005.

[20] D. Taubman D. and M. Marcellin.JPEG 2000: Image compression fundamentals, standards and practice. Kluwer Academic

Publishers, 2001.

[21] D. Taubman. High performance scalable image compression with EBCOT. IEEE Trans. on Image Processing, 9(7):1158–

1170, July 2000.

[22] D. Taubman and R. Rosenbaum. Rate-distortion optimized interactive browsing of JPEG 2000 images. InIEEE International

Conference on Image Processing (ICIP), September 2003.

[23] H. Kellerer, U. Pferschy, and D. Pisinger.Knapsack problems. Springer Verlag, 2004. ISBN 3-540-40286-1.

[24] L. Wolsey. Integer Programming. Wiley, 1998.

22

transmitting video surveillance sequences based on jpeg 2000 conditional replenishment

Documents