ieee transactions on image processing, vol. …latecki/papers/ipc_ip09final.pdf · ieee...

12
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based on Joint Spatiotemporal and Adaptive Prediction Wenfei Jiang, Longin Jan Latecki, Senior Member, IEEE, Wenyu Liu, Member, IEEE, Hui Liang, and Ken Gorman Abstract—We propose a video coding scheme that departs from traditional Motion Estimation/DCT frameworks and instead uses Karhunen–Loeve Transform (KLT)/Joint Spatiotemporal Predic- tion framework. In particular, a novel approach that performs joint spatial and temporal prediction simultaneously is intro- duced. It bypasses the complex H.26x interframe techniques and it is less computationally intensive. Because of the advantage of the effective joint prediction and the image-dependent color space transformation (KLT), the proposed approach is demonstrated experimentally to consistently lead to improved video quality, and in many cases to better compression rates and improved computational speed. Index Terms—APT, AVC, compression, H.264, joint predictive coding, video. I. INTRODUCTION M OST video coding schemes build upon the motion estimation (ME) and discrete cosine transform (DCT) frameworks popularized by ITU-T’s H.26x family of coding standards. Although the current H.264/AVC video coding standard achieves significantly better compression over its pre- decessors, namely, H.263, it still relies on traditional ME/DCT methods. This compression improvement is not without cost. By sacrificing computational time for high complexity mo- tion estimation and mode selection, H.264 reduces the output bit-rate by 50% while increasing the encoding time signifi- cantly. Granted, the complex DCT of H.263 was abandoned for the simpler integer transform in H.264, however, the overall time spent calculating motion estimation is increased from 34% to 70% on average [1]. This shows it is costly for H.264 to achieve current compression performance by pursuing accurate motion estimation. The compression efficiency is not likely to be improved significantly by modifying the ME module, and, consequently, the development potential of the ME/DCT coding mechanism may be limited. Manuscript received June 24, 2008; revised January 28, 2009. Current ver- sion published April 10, 2009. This work was supported in part by the Na- tional High Technology Research and Development Program of China (Grant 2007AA01Z223) and in part by the National Natural Science Foundation of China (Grant 60572063, 60873127). The associate editor coordinating the re- view of this manuscript and approving it for publication was Dr. Pier Luigi Dragotti. W. Jiang, W. Liu, and H. Liang are with the Department of Electronics and Information Engineering, Huazhong University of Science and Techology, Wuhan, Hubei 430074, China (e-mail: fl[email protected]). L. J. Latecki and K. Gorman are with the Department of Computer and In- formation Sciences, Temple University, Philadelphia, PA 19094 USA (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TIP.2009.2016140 This paper proposes a video coding scheme that reduces the computation complexity of the H.264 interprediction and mode selection calculations by substituting it with a method based on Karhunen–Loeve Transform (KLT) and interpolative prediction framework called Joint Predictive Coding (JPC). It follows a new coding procedure that starts with an image-dependent color space transformation, followed by a pixel-to-pixel interpolative prediction, and ends in entropy coding in the spatial domain. The proposed framework introduces a new way of prediction that takes both temporal and spatial correlation into considera- tion simultaneously. Since, as mentioned above, efforts on pur- suing accurate ME seem to have reached their limit in increasing the compression rate, the proposed prediction method does not follow conventional methods. It performs intraprediction using Adaptive Prediction (Robinson [2]) but not in each frame sepa- rately, but instead in both the current and reference frame. The pixels in both frames are linked by MPEG like motion estima- tion. The calculated error of the reference frame is then used to estimate the intraprediction error in the current frame, and, using these values determines the joint predicted value. JPC has at least three advantages in comparison to the tradi- tional scheme. 1. It utilizes both temporal and spatial correlation for predic- tion of each pixel, which leads to more accurate prediction, smaller error, and makes better compression. 2. The encoding is done on a pixel-to-pixel basis instead of on a block-to-block basis in traditional coding, avoiding visually sensitive blocking effects. Moreover, it tends to have good quality on videos with complicated textures and irregular contours, which is a difficult problem for block- based codecs. 3. Interpolative coding allows video frames to be recon- structed with increasing pixel accuracy or spatial reso- lution. This feature allows the reconstruction of video frames with different resolutions and pixel accuracy, as needed or desired, for different target devices. Fig. 1 shows the main parts of a JPC coder, including K-L color space transformation, motion estimation module, mode selector, joint predictor, adaptive quantizer and entropy coder. The basic differences between JPC and traditional H.26x family of coding schemes are the prediction and transform module, in the darker boxes. In JPC encoding, pixels in video frames are coded hierarchically. Each pixel is predicted by several available pixels that are most temporally or spatially correlated. The prediction errors are quantized, and entropy coded. This scheme leads to good compression performance at a relatively small computation cost. When using a single frame reference and MPEG-2 level ME techniques [3], such as single block 1057-7149/$25.00 © 2009 IEEE

Upload: buicong

Post on 10-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025

A Video Coding Scheme Based on JointSpatiotemporal and Adaptive Prediction

Wenfei Jiang, Longin Jan Latecki, Senior Member, IEEE, Wenyu Liu, Member, IEEE, Hui Liang, and Ken Gorman

Abstract—We propose a video coding scheme that departs fromtraditional Motion Estimation/DCT frameworks and instead usesKarhunen–Loeve Transform (KLT)/Joint Spatiotemporal Predic-tion framework. In particular, a novel approach that performsjoint spatial and temporal prediction simultaneously is intro-duced. It bypasses the complex H.26x interframe techniques andit is less computationally intensive. Because of the advantage ofthe effective joint prediction and the image-dependent color spacetransformation (KLT), the proposed approach is demonstratedexperimentally to consistently lead to improved video quality,and in many cases to better compression rates and improvedcomputational speed.

Index Terms—APT, AVC, compression, H.264, joint predictivecoding, video.

I. INTRODUCTION

M OST video coding schemes build upon the motionestimation (ME) and discrete cosine transform (DCT)

frameworks popularized by ITU-T’s H.26x family of codingstandards. Although the current H.264/AVC video codingstandard achieves significantly better compression over its pre-decessors, namely, H.263, it still relies on traditional ME/DCTmethods. This compression improvement is not without cost.By sacrificing computational time for high complexity mo-tion estimation and mode selection, H.264 reduces the outputbit-rate by 50% while increasing the encoding time signifi-cantly. Granted, the complex DCT of H.263 was abandonedfor the simpler integer transform in H.264, however, the overalltime spent calculating motion estimation is increased from34% to 70% on average [1]. This shows it is costly for H.264 toachieve current compression performance by pursuing accuratemotion estimation. The compression efficiency is not likelyto be improved significantly by modifying the ME module,and, consequently, the development potential of the ME/DCTcoding mechanism may be limited.

Manuscript received June 24, 2008; revised January 28, 2009. Current ver-sion published April 10, 2009. This work was supported in part by the Na-tional High Technology Research and Development Program of China (Grant2007AA01Z223) and in part by the National Natural Science Foundation ofChina (Grant 60572063, 60873127). The associate editor coordinating the re-view of this manuscript and approving it for publication was Dr. Pier LuigiDragotti.

W. Jiang, W. Liu, and H. Liang are with the Department of Electronics andInformation Engineering, Huazhong University of Science and Techology,Wuhan, Hubei 430074, China (e-mail: [email protected]).

L. J. Latecki and K. Gorman are with the Department of Computer and In-formation Sciences, Temple University, Philadelphia, PA 19094 USA (e-mail:[email protected]; [email protected]).

Digital Object Identifier 10.1109/TIP.2009.2016140

This paper proposes a video coding scheme that reduces thecomputation complexity of the H.264 interprediction and modeselection calculations by substituting it with a method based onKarhunen–Loeve Transform (KLT) and interpolative predictionframework called Joint Predictive Coding (JPC). It follows anew coding procedure that starts with an image-dependent colorspace transformation, followed by a pixel-to-pixel interpolativeprediction, and ends in entropy coding in the spatial domain.The proposed framework introduces a new way of predictionthat takes both temporal and spatial correlation into considera-tion simultaneously. Since, as mentioned above, efforts on pur-suing accurate ME seem to have reached their limit in increasingthe compression rate, the proposed prediction method does notfollow conventional methods. It performs intraprediction usingAdaptive Prediction (Robinson [2]) but not in each frame sepa-rately, but instead in both the current and reference frame. Thepixels in both frames are linked by MPEG like motion estima-tion. The calculated error of the reference frame is then usedto estimate the intraprediction error in the current frame, and,using these values determines the joint predicted value.

JPC has at least three advantages in comparison to the tradi-tional scheme.

1. It utilizes both temporal and spatial correlation for predic-tion of each pixel, which leads to more accurate prediction,smaller error, and makes better compression.

2. The encoding is done on a pixel-to-pixel basis instead ofon a block-to-block basis in traditional coding, avoidingvisually sensitive blocking effects. Moreover, it tends tohave good quality on videos with complicated textures andirregular contours, which is a difficult problem for block-based codecs.

3. Interpolative coding allows video frames to be recon-structed with increasing pixel accuracy or spatial reso-lution. This feature allows the reconstruction of videoframes with different resolutions and pixel accuracy, asneeded or desired, for different target devices.

Fig. 1 shows the main parts of a JPC coder, including K-Lcolor space transformation, motion estimation module, modeselector, joint predictor, adaptive quantizer and entropy coder.The basic differences between JPC and traditional H.26x familyof coding schemes are the prediction and transform module,in the darker boxes. In JPC encoding, pixels in video framesare coded hierarchically. Each pixel is predicted by severalavailable pixels that are most temporally or spatially correlated.The prediction errors are quantized, and entropy coded. Thisscheme leads to good compression performance at a relativelysmall computation cost. When using a single frame referenceand MPEG-2 level ME techniques [3], such as single block

1057-7149/$25.00 © 2009 IEEE

Page 2: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

1026 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009

Fig. 1. Block diagram of JPC encoder.

size and half-pel precision ME, JPC significantly outperformsMPEG-2 at the cost of 30% more complexity. It is expectedthat JPC will do even better if more advanced techniques suchas support for variable block sizes, multiple frame reference,and sub-pixel (i.e., quarter-pel) precision motion estimation areused.

Section II briefly introduces interpolative coding and Adap-tive Prediction Trees forming the basis of JPC. Section IIIdescribes and analyzes the JPC scheme in detail. Section IVpresents experimental result and performance comparison.Section V concludes.

II. INTERPOLATIVE CODING AND ADAPTIVE

PREDICTION TREES

Interpolative coding for image compression is a kind of pre-dictive coding dating back over 20 years and was first introducedby Hunt [4]. Several years later, Burt and Adelson built uponthe coding scheme [5]. It was further developed by a few dif-ferent groups [6]–[11]. In such methods, pixels in an image areordered in a series of groups called bands in increasing sampledensity and the encoder goes through all the bands from coarseto fine. The main idea is to use grids of points, oriented bothsquarely and diagonally, to predict their midpoints recursively.These groupings of square and diagonal points are alternatelyused for neighboring bands.

Fig. 2 shows the band structure of interpolative coding. Itrepresents a 5 5 image which is divided into bands denotedby letters to . Pixels in each band are predicted by those inhigher bands. Pixels labeled form a square with its center ;thus, , and all equivalent s throughout the image are pre-dicted by s; similarly, s are predicted by s and s, and soon. Different bands have different quantization intervals for theprediction errors. Pixels in primary bands (especially Band ),of more importance, have smaller quantization intervals, whilepixels in latter bands with lower entropy are quantized coarsely.

In Fig. 3, each dashed square represents a predictive grid.It helps to explain the algorithm of prediction. Pixel is thecenter of the square grid . Pixels and are ear-lier processed neighbors in the same band as while pixels

belong to a higher band. To predict , bilinear pre-diction is usually ineffective. A

Fig. 2. Band structure.

Fig. 3. Predictive grid.

nonlinear approach that selects a subset of its predictor setas predictors is proposed by Robinson in

his image compression scheme called Adaptive Prediction Trees(APT) [2]. The value of a pixel can either be less than, greaterthan, or equal to its closest neighbor. Let , wehave four pixel relationships ( , and ).This represents distinct possibilities. However, severalof the relationships are impossible (e.g.,

). Eliminating the impossible relationships, experimental re-sults using training images were obtained to determine the bestpredictor based upon different combinations of values of pixelsin , which is defined as structure. These predictors were thencombined, duplicates were eliminated, and a small set of “rules”remained.

For the sake of the joint spatiotemporal prediction descrip-tion in Section III, we split the APT prediction process intotwo parts. First, get the structure feature in terms of the orderof amplitudes of the predictors. Represent this with a functionGet Structure(), with its return value structure indicating which

Page 3: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

JIANG et al.: A VIDEO CODING SCHEME BASED ON JOINT SPATIOTEMPORAL AND ADAPTIVE PREDICTION 1027

of the six possible pixels are used for prediction. Second, predictwith the predictors based on the determined structure. This

course is represented by a function Adaptive Pred(), which re-turns the estimate of . The prediction method in APTcan be expressed as follows:

(1)

The Get Structure() and Adaptive Pred() functions together notonly determine which of the six neighboring pixels are to beused in the prediction, but specify how the prediction is calcu-lated based upon the structure determined.

III. JOINT PREDICTIVE CODING SCHEME

This section introduces the joint predictive coding in detail.Subsection A describes the joint spatiotemporal prediction tech-nically. Subsections B and C explain the design process of themode configuration and the filter for better subjective quality.The main functional stages of JPC encoding are presentedin Sub-section . Our JPC method is compared to binarytree residue coding [12] and temporal prediction trees [13] inSubsections and , respectively. A theoretical analysis onjoint prediction and traditional motion compensation is shownin the Appendix. Analysis and experiments show that JPCoutperforms the other methods.

A. Joint Spatiotemporal Prediction

Classical video coders choose between intra- and interpre-diction to remove the correlation in the video frames and,thus, achieve greater compression. Unfortunately, there aresuch cases that neither the temporal correlation nor the spatialcorrelation holds the dominant position. We propose a jointspatiotemporal prediction that consider both of them and leadto better accuracy in most circumstances.

As shown in Fig. 3, the pixel has six pos-sible neighboring predictors, forming its predictor set

. Its motion vectorcan be obtained by motion estimation. The matching pixel(interpolation generated as well) in the reference frame,

, similarly, has its predictor set. Joint prediction takes both

and to estimate . To get the joint predicted value ,JPC executes the following steps (see Fig. 4).

First, for a pixel in the current frame, implement adaptiveprediction with its predictor set and acquire the structurefeature and prediction error. The following functions are definedin Section II. is the intrapredicted value calculated as (2)in the same way as APT and intra error is the prediction error

– (2)

Then, use the same structure to estimate the referring pixelwith its predictor set as (3). For example, if the estimate

of is , the estimate of shall be

Fig. 4. Illustration of joint spatiotemporal prediction.

, and if the estimate of is , theestimate of shall be

– (3)

Finally, the joint predicted value, is computed by the MEprediction of from the previous frame, plus the prediction of

using APT from the current frame, and minus the predictionof using APT from the previous frame. The estimated value

and the prediction error jp error can be expressed as fol-lows:

(4)

In (4), mc error and pmc error can be regarded as the predic-tion error of motion estimation in the actual domain and pre-dicted domain respectively.

All predictive coding strives to make the prediction error assmall as possible. Since APT does well on still images and mo-tion compensation does well on videos with regular motion, JPCtries a joint predictive approach that combines the advantages ofboth. This joint prediction error, jp error, takes the place of thetraditional motion compensated residual mc error and is sup-posed to be smaller.

To decode a joint predicted pixel, JPC executes followingsteps.

1) Predict with , determine structure and get the esti-mated value .

2) Predict with with the same structure in 1) and getthe estimated value .

3) The joint predicted value, -4) The decoded value, .

B. Mode Configuration

Different prediction methods fit different cases. Althoughjoint prediction is more precise than motion estimation in manycases, however, for some sequences with very regular or fast

Page 4: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

1028 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009

TABLE IEFFECT OF DISABLING EACH MODE

object motion, direct compensation or intraprediction does evenbetter. Multiple mode prediction is necessary to make goodcompression. JPC considers four candidate modes aiming atdifferent features for blocks of video frames.

Skip mode: values of pixels in such blocks are copied frompixels in the same position in reference frame directly.

Motion Compensation (MC) mode: pixels in such blocks arepredicted as in the reference frame. It is the same with theintermode in traditional coding.

Joint Prediction mode: pixels in such blocks are predicted as, which is introduced in Section III-A. Experiments show

this mode covers most of the images in video sequences.Intra mode: pixels in such blocks are predicted as ,

which is acquired by APT prediction.To investigate the necessity of the candidate modes besides

Joint Prediction mode, different kinds of video sequences weretested at various quality parameters. Each mode was disabledand the increase in the sum of absolute difference (SAD) of theprediction was calculated respectively. Table I shows the SADincrease on average caused by each mode disabling. It shows thethree modes have significant effect on the prediction accuracy,and, thus, all of them are brought into the system.

With the help of mode selection, the prediction error of JPCgets 24.6% smaller than the traditional inter/intrapredictionresidual.

C. “Despeckling” Filter Design

The calculation of Peak Signal-to-Noise ratio (PSNR) of thereconstructed video to the original is widely used to objectivelydetermine video quality. However, because of the characteris-tics of the human visual system, PSNR does not correlate some-times with perceived quality. Section IV.C will present compar-isons between JPC and H.264 at the equivalent PSNRs whereJPC shows additional detail without blocking artifacts associ-ated with H.264. However, like all interpolative prediction basedcoding, JPC has its own problem: speckling artifacts which istypified in the example shown in Fig. 5(a). When compressed atlow bit-rate, the noise would be obvious and has to be removed.Therefore, a “despeckling” filter is proposed.

Because of the similarity of speckling artifact and salt-and-pepper noise, order-statistic filtering [14] is chosen as a basicsolution in both video and image coding. Each pixel is processedas follows:

Fig. 5. Effect of the “despeckling” filter on the tenth frame of FOREMAN.(a) Speckling artifact caused by original JPC. ���� � ���� dB. (b) Result ofJPC compression with the “despeckling” filter. ���� � ���� dB.

Sort the values in the 3 3 neighborhood. If the value of thecentral pixel is higher/lower than the th highest/lowest value,it is changed into the value of the th value. Note that is anempirical parameter which is set to 3 or 4 adaptively in APT[15].

Studies on a number of video cases show that the lack ofcorrelation between content on successive frames indicated bynoisy pixels is the key problem to visual quality. To improve thecontinuity in successive frames, JPC processes the 3 3 neigh-borhood of the target pixel along with the 3 3 neighborhoodof its reference pixel. Thus, 18 pixels are sorted, followed byregular operations.

There are three different filtering methods. A prefilter pro-cesses the image before the actual coding; a postfilter processesit during the reconstruction; or thirdly, a loop-filter which notonly processes the reconstructed image but also replaces theoriginal frame with the filtered frame as a reference. The threemethods were tested experimentally and loop filtering provedsuperior to the others. Experimental results showed that loop fil-tering not only improves the subjective quality, but also signifi-cantly improves the objective quality by 0.5–1 dB in many cases.Because of the experimental successes, we opted for the loop-filter design. As shown in Fig. 5(b), the “despeckling” filter, al-though still preliminary, removes most of the noise.

Page 5: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

JIANG et al.: A VIDEO CODING SCHEME BASED ON JOINT SPATIOTEMPORAL AND ADAPTIVE PREDICTION 1029

D. Procedure of Joint Predictive Coding

1) Preparation: First, the JPC codec computes the quanti-zation intervals based upon a user-selectable quality parameter,which is derived in APT system [2]. Then, it reads YUV datafrom video sequence in 4:4:4 format (because KLT requires thesame sample interval on all components). Each frame is ex-tended by copying the border pixels into the extended regionfor the following motion estimation and prediction.

2) Color space Transformation: For video frames whosequality parameter is below an empirical threshold, PrincipalComponent Analysis (PCA), i.e., KLT, is used to achieve evengreater compression. Each pixel in current frame is added asa 3-D sample for the covariance matrix, whose eigenvaluesand eigenvectors are then calculated. Based on the eigenvaluesand eigenvectors, the matrix is diagonalized to determine thebasis of the PCA color space. The eigenvector with the highesteigenvalue is the principal component of the data set.

Both the current and reference frame are transformed intothe new color space in which the first component of the cur-rent frame has the highest energy. Thus, motion estimation andstructure determination are performed on the first component.The PCA color space transformation is used in other imagecoding schemes [16]. However, JPC does motion estimation andjoint prediction on the image-dependent PCA color space in-stead of the traditional YUV color space, which achieves higherlevel of compression.

3) Motion Estimation: The JPC encoder does a half-pel pre-cision motion vector search for every 16 16 block based onthe primary component of the current frame. After this stage,each pixel has a motion vector (mv x, mv y).

4) Mode Selection: JPC has four modes for blocks in videosequences of different features, Skip, Motion Compensation,Joint Prediction and Intra.

First, the encoder determines whether to choose Skip modebased on the SAD of motion estimation. Then, it prepredictsthe current frame in Motion Compensation, Joint Prediction andIntra mode. The mean square error (MSE) of the three modes ineach 16 16 block, and are com-puted and the mode with smallest MSE is selected (consideringthe tradeoff between MSE and motion vectors, is mul-tiplied by an empirical coefficient before the comparison). Modeinformation and motion vectors are written into the bit-streamat the end of this stage.

5) Predictions and Reconstruction: JPC predicts the threecomponents one by one and goes through all of the pixels inthe current frame band by band like all interpolative codersdo. , the estimate value, is determined by its position andmode. To initialize the predictors, pixels on the top band (Band

) are estimated to be the referring pixels or the nearestneighbors . Assume the sample interval of the top band is

(5)

For the rest of bands, each pixel is predicted based on itsmode, which is determined in Step 4)

(6)

The prediction error, , is quantizedby the quantization interval computed previously in Step 1)and reconstructed to be . The reconstructed value of

. At the decoder, the output of thepixel , will equal this . and vary becauseof the quantization of . Therefore, is used for thedownward prediction instead of .

To save the encoding time, the prediction on the second andthe third components does not do mode selection or structuredetermination. Instead, it follows the mode and structure on theprincipal component.

After this stage, three matrixes (three components) of quan-tized jp errors are formed and written into the stream. Thecurrent frame is then reconstructed, transformed to YUV colorspace, “despeckling” filtered (optional), interpolated by a6-tap Finite Impulse Response filter [17], and then saved as areference frame.

6) Reorder and Entropy Coding: The prediction errors arereordered in advance of entropy coding. They are rearranged bythe band order so there might be more consecutive zeroes, es-pecially on coarsely quantized bands. This leads to better effi-ciency for runlength coding. Last, three entropy coders performordinary Huffman-runlength coding [14] on the three compo-nents independently.

7) Go to the Next Frame.

E. Comparison With Binary Tree Residue Coding

A prior codec that attempts to apply interpolative predictionto video coding is called Binary Tree Residue Coding (BTRC),developed by Robinson, Druet, and Gosset. It simply regardsthe motion compensated residue as an image and codes it withBinary Tree Predictive Coding (BTPC) [18], a predecessor ofAPT. However, as residue is not a natural image any more; its in-tracorrelation gets weaker. Consequently, for this kind of image,APT or BTPC does not perform as well as it does on naturalimages. In the proposed approach, the prediction is performedonly on original pixels. An explanation follows detailing whyjoint prediction leads to better compression than BTRC.

Let’s define a function that returns the predictionerror based upon APT prediction process as described in Sec-tion II above. Then, for joint prediction, the error is defined as

(7)

Alternatively, for comparison the error returned for BTPC is

– (8)

Although extremely efficient on static images, the difficultyencountered with using BTPC on video lies in the nature ofhow the pixels are predicted. The locations of pixels used for

Page 6: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

1030 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009

Fig. 6. BTRC difficulty with integrating motion vector predictions.

prediction in one frame may change significantly in successiveframes because the pixels may have different motion vectors.

Consider the btrc error defined above as being comprised oftwo parts, that is, the Pred() of the current frame minus the pre-diction of the reference frame. Note that we do not define thePred 1 of the reference frame the same as the Pred of the cur-rent frame

(9)

BTRC reconstructs an image (frame) by recursively using theprediction error on previous encoded pixels. For example, inFig. 6, the Pixel is to be reconstructed from the 4 neighboringpixels that form a square, i.e., , and , but in a video,a very likely scenario is that these pixels may have completelydifferent motion vectors. Let , and be the corre-sponding pixels in the previous frame. Consequently, the func-tion Pred 1 uses four unrelated pixels , and(they do not form a square) to predict . In other words, thefacts that differences , andare used to predict , and the pixels , and

may be unrelated in the previous frame render the BTRC re-construction algorithm inefficient. In contrast the proposed JPCuses only the motion vector of pixel .

We have experimentally demonstrated that joint predictionleads to an average of a 20.5% bit-rate reduction compared tothis approach.

Fig. 7 shows the rate-distortion curves of JPC and BTRC onthe sequence News and Football, which verify the advantage ofthe Joint Predictive method over BTRC. These results are notunexpected. BTRC and its successor APT were not designedwith video in mind and their difficulties with video are summa-rized above. By utilizing the adaptive structure empirical predic-tion of APT and combining it with the joint prediction methods,we can demonstrate a marked improvement over BTRC/APTalone.

Fig. 7. Comparison of JPC and BTRC.

F. Comparison With Temporal Prediction Trees

Day and Robinson developed another approach to video com-pression called temporal prediction trees (TPT) [13]. The mainidea is to determine temporal/spatial prediction mode for eachpixel based on a threshold (calculated by several parameterssuch as desired quality and motion vectors) and perform anadaptive temporal/spatial coding. Temporal prediction, whichsimply predicts the current pixel as its corresponding pixel inthe reference frame, is the same as the Motion Compensationmode in JPC. Spatial prediction uses the APT spatial predictorfor the current pixel. Since this approach also separates the spa-tial and temporal prediction, it is not as precise as the proposedjoint spatiotemporal prediction. As can be seen in Fig. 8, TPTis not able to match the performance of JPC. (Luma-weightedPSNR weights the luminance component four times as highlyas each of the chrominance components.)

Another reason for the performance difference is the treestructure. APT and TPT use a so called “hex-tree” structure toarrange pixels that efficiently represent large areas of an imagewith a zero prediction errors. They introduce a “terminator”

Page 7: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

JIANG et al.: A VIDEO CODING SCHEME BASED ON JOINT SPATIOTEMPORAL AND ADAPTIVE PREDICTION 1031

Fig. 8. Comparison of JPC and TPT on the 300-frame sequence Mother andChild (352� 288).

flag to represent the sub-trees whose nodes are all zero values[15]. They use a flag for every nonterminated node indicatingwhether its prediction error is zero and whether it is a ter-minator. This makes good result on still image compression.Video sequences, however, have much more redundancy be-tween close frames. Consequently, after motion compensation,there are many more groupings of zeroes in prediction errors.A higher number of consecutive zeroes lend itself better touse alternative methods of compression, including traditionalrunlength coding. In this case, the tree-flag mechanism leads toinefficiency. As a result, in JPC, the basics of APT prediction(adaptive structures, prediction, etc.) were all used; however,the use of trees was abandoned. Instead we used the runlengthcoding.

IV. EXPERIMENTAL RESULT AND COMPARISON

A. Rate-Distortion Performance

JPC is compared with MPEG-2 and H.264 (high 4:4:4profile) on six 100-frame CIF (352 288 30 frame/s) videosequences, Football, News, Foreman, Stefan, bus, and Tempete.MPEG-2 V1.2 [19] and JM 12.4 [20] reference software isused respectively for MPEG-2 and H.264. A coding structureof IPPPPIPPP is chosen. We take PSNR in dependency ofbit-rate as the rate-distortion performance standard and inves-tigate a large range of performance up to 50 dB PSNR. Thismay help indicate the possible applications for our new codec.Overall, H.264 proves superior to the other methods; however,JPC shows promising results. Figs. 9–11 present the results ofcomparative tests on the sequences News, Foreman and Foot-ball. All coders make good compression on the sequence News(with low activity). For the sequence Foreman (the part withrapid camera motion) and Football (with fast object motion),the coding efficiency gets poorer but the relative performancesremain the same. It can also be seen that JPC is able to catchup with H.264 when coding videos in extremely high quality.In general, JPC outperforms MPEG-2 in all cases but cannotmatch H.264 now. Nevertheless, several areas of improvement

Fig. 9. Comparison of JPC, MPEG-2 and H.264 on the beginning 100 framesof the sequence News.

Fig. 10. Comparison of JPC, MPEG-2 and H.264 on the 130th to 230th framesof the sequence Foreman.

are already being researched and experimented with, includingvariable block sizes, multiple frame reference and more precisemotion estimation (i.e., quarter pel) and mode selection. Surely,these improvements will definitely result in better compression.

B. Speed

Tables II and III present an overview of the main modulesemployed by JPC that differ from the corresponding modulesfor H.264 and MPEG-2. Modules that are common to all threeschemes (e.g., entropy coding) are not presented in those tables.

The JPC source code has not been optimized; therefore, it isdisadvantaged in speed comparisons to other coding schemes.We do, however, present the timing information for informa-tional purposes. MPEG-2 V1.2 [19] and JM 12.4 [20] referencesoftware are used for MPEG-2 and H.264 codecs, respectively.The test was done on a PC with Pentium 4 2.6-GHz CPU and

Page 8: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

1032 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009

Fig. 11. Comparison of JPC, MPEG-2 and H.264 on the beginning 100 framesof the sequence Football.

TABLE IIMODULES FOR ENCODING COMPLEXITY COMPARISON

TABLE IIIMODULES FOR DECODING COMPLEXITY COMPARISON

512-M memory. All the three coders perform full search for mo-tion estimation and the filters are disabled.

As can be seen from the timing data in Tables IV and V, thegood performance of the nonoptimized JPC codec demonstratesthat traditional coding schemes can be further refined.

We tested six video sequences on a full range of quality pa-rameters and recorded the average time cost of each sequence.Table IV shows the encoding time cost for I frame and P framerespectively while Table V shows the decoding time cost of thethree codecs on the modules listed above. For encoding, JPCis 57.6% faster than H.264 while 39.9% slower than MPEG-2;for decoding, it is 7.9% faster than H.264 and 18.0% faster than

TABLE IVAVERAGE TIME COST OF ENCODING ONE FRAME (MS)

TABLE VAVERAGE TIME COST OF DECODING ONE FRAME (MS)

MPEG-2. It is clear that the computational complexity of JPC isacceptable. The methods presented in this paper show promisefor practical applications and future enhancements will enhancethe practicality even further. The current implementation hassome inefficiencies in the entropy coding process. JPC entropycoder goes through all the prediction errors twice, once to col-lect statistics and once to code the values. For better efficiency,the entropy code table for prediction errors should be designedin advance. The encoder would then go through the values onlyonce, which is the case in MPEG and H.26x series implementa-tions.

C. Subjective Quality

Although the compression efficiency of JPC is currentlyworse than H.264, it leads to much better subjective qualityin some specific domains. After conducting a large numberof comparative tests on various types of video, we concludethat both JPC and H.264 each have each their strong points insubjective quality. However, we find that JPC shows significantadvantage over H.264 on coding videos with text as well asvideos with complicated textures and irregular contours.

Fig. 12 presents the result on one reconstructed frame ofa video of a slide presentation. Fig. 12(a) is reconstructed byH.264, with the luma-weighted PSNR 31.1 dB. Fig. 12(b) isreconstructed by JPC, with the luma-weighted PSNR 31.3 dB.It shows that JPC preserves the texts well in compressing slideswhile H.264 fails to keep good quality. Fig. 13 shows H.264leads to obvious blocking artifact on videos with complicatedtextures and irregular contours, while JPC does much better insubjective quality.

The main problem of JPC is that it generates temporal noise(speckling artifacts) during high compression, which makes it

Page 9: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

JIANG et al.: A VIDEO CODING SCHEME BASED ON JOINT SPATIOTEMPORAL AND ADAPTIVE PREDICTION 1033

Fig. 12. Subjective quality comparison of H.264 and JPC on a video of slides.

Fig. 13. Subjective quality comparison of H.264 and JPC on a landscape video.

under perform H.264 in some cases. We note that H.264 usessophisticated filters while the current JPC implementation hasonly recently been implemented and has not been optimizedfully. However, the filter shows promising result on removingthe noise. It shows at least that the speckling artifact problemis not unsolvable. Because of the principal advantage over theblock-based coding, JPC is very likely to be able to obtain bettersubjective quality in many domains.

V. CONCLUSION

The good performance of the proposed joint predictive codingshows that joint spatiotemporal prediction is a serious com-petitor to the current block based video coding techniques. Withprediction error reduced by 24.6% on average compared to thetraditional inter/intraprediction approach, joint spatiotemporalprediction is obviously superior to the conventional inter/in-traprediction. JPC not only introduces a new prediction methodbut also proposes a new KLT/Joint Prediction framework to re-place the traditional ME/DCT mechanism. This opens up a new,very promising direction in video coding. With the help of jointprediction technique, JPC performs much better than MPEG-2using the equivalent ME technique. It is at least a promising al-ternative in related applications such as DVD and digital satel-lite broadcasting. After further research, specifically in better

filtering technique, JPC has a strong potential to become a verycompetitive video coding scheme.

APPENDIX

THEORETICAL ANALYSIS OF JOINT PREDICTION

The conventional way of prediction is to use motion estima-tion to find the best matching block in a reference frame and takethe corresponding pixel as the estimate of current one. However,this is not enough for the cases where temporal correlation islimited. Joint spatiotemporal prediction is a method that utilizesboth temporal and spatial correlation and usually yields a betterestimated value. To prove that the joint prediction is a better ap-proach than motion compensation, we compare the variances ofjoint predicted error and motion compensated residue.

Assumethat is thevarianceof jointpredictederrorand is the variance of motion compensated residue.

presents the expected value of the . Motion compensatedresidue is recognized to follow a zero-mean Laplacian distribu-tion [21], i.e., . Assume is one of the pixelsused for the prediction of and is its weight. Then the APTestimated value of is computed as follows:

Page 10: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

1034 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009

According to the rules of APT prediction

and

Similarly, .For

Since and error follow the sameLaplacian distribution. By Schwartz inequality, we obtain

For

can be obtained by mathematical induction

Define as the correlation coefficient of mc error and pmc error(see the equation shown at the bottom of the page). With theabove substitution, we obtain

The above derivation proves the following theorem:Theorem: As long as the correlation coefficient of mc error

and pmc error, , it holds that, i.e., the variance of joint predicted error is

smaller than the variance of motion compensated residue.Since and the intrapre-

diction of APT is accurate, it is expected that mc error andpmc error are highly correlated, which means that the assump-tion of should be satisfied in most cases.

We investigated eight 100-frame video sequences from staticto dynamic for correlation of mc error and pmc error. Thevalues of are computed empirically from each couple ofsuccessive frames and the statistical distribution is shown inFig. 14. It can be observed that the value of is seldom smallerthan 0.5 in either low activity sequence News or very dynamicsequence Football. As a result,

Page 11: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

JIANG et al.: A VIDEO CODING SCHEME BASED ON JOINT SPATIOTEMPORAL AND ADAPTIVE PREDICTION 1035

Fig. 14. Distribution of � of 100 frames in sequence News and Football.

is true in most cases. Therefore, joint prediction makes bettercompression than motion compensation for majority of videos.

REFERENCES

[1] W. Pu, Y. Lu, and F. Wu, “Joint power-distortion optimization on de-vices with MPEG-4 AVC/H.264 codec,” in Proc. IEEE Int. Conf. Com-munications, June 2006, vol. 1, pp. 441–446.

[2] J. A. Robinson, “Adaptive prediction trees for image compression,”IEEE Trans. Image Process., vol. 15, no. 8, pp. 2131–2145, Aug. 2006.

[3] F. Dufaux and F. Moscheni, “Motion estimation techniques for digitalTV: A review and a new contribution,” Proc. IEEE, vol. 83, no. 6, pp.858–876, Jun. 1995.

[4] B. R. Hunt, “Optical computing for image bandwidth compression:Analysis and simulation,” Appl. Opt., vol. 15, pp. 2944–2951, Sept.1978.

[5] P. J. Burt and E. H. Adelson, “The Laplacian pyramid as a compactimage code,” IEEE Trans. Commun., vol. COM-3l, no. 4, Apr. 1983.

[6] W. Woods and S. D. O’Neil, “Subband coding of images,” IEEE Trans.Acoust., Speech, Signal Process., vol. 34, no. 5, pp. 1278–1288, Oct.1986.

[7] P. Roos, M. A. Viergever, M. C. A. can Dijke, and J. H. Peters,“Reversible intraframe coding of medical images,” IEEE Trans. Med.Imag., vol. 7, no. 12, pp. 328–336, Dec. 1988.

[8] L. Arnold, “Interpolative coding of images with temporally increasingresolution,” Signal Process., vol. 17, pp. 151–160, 1989.

[9] P. G. Howard and J. S. Vitter, “New methods for lossless image com-pression using arithmetic coding,” Inf. Process. Manag., vol. 28, no. 6,pp. 765–779, 1992.

[10] P. G. Howard and J. S. Vitter, “Fast progressive lossless image com-pression,” presented at the Data Comp. Conf., Snowbird, UT, Mar.1994.

[11] E. A. Gifford, B. R. Hunt, and M. W. Marcellin, “Image coding usingadaptive recursive interpolative DPCM,” IEEE Trans. Image Process.,vol. 4, no. 8, pp. 1061–1069, Aug. 1995.

[12] J. A. Robinson, A. Druet, and N. Gosset, “Video compression withbinary tree recursive motion estimation and binary tree residue coding,”IEEE Trans. Image Process., vol. 9, no. 7, Jul. 2000.

[13] M. G. Day and J. A. Robinson, “Residue-free video coding with pixel-wise adaptive spatio-temporal prediction,” IET Image Process., vol. 2,no. 3, pp. 131–138, Jun. 2008.

[14] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed.Upper Saddle River, NJ: Prentice-Hall, 2005, Section 3.6.2 and Sec-tion 8.4.

[15] APT Online Reference Code, [Online]. Available: http://www.intuac.com/userport/john/apt/index.html, [Online]

[16] C. L. Yang, L. M. Po, D. H. Cheung, and K. W. Cheung, “A novelordered-SPIHT for embedded color image coding,” in Proc. IEEE Int.Conf. Neural Networks and Signal Processing, Nanjing, China, Dec.14–17, 2003, pp. 1087–1090.

[17] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst.Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.

[18] J. A. Robinson, “Efficient general-purpose image compression with bi-nary tree predictive coding,” IEEE Trans. Image Process., vol. 6, no. 4,pp. 601–608, Apr. 1997.

[19] MPEG-2 Reference Software, [Online]. Available: http://www.mpeg.org/MSSG, MPEG-2 version 1.2, MPEG Software Simulation Group.

[20] JM Software, JM 12.4, H.264/AVC Software Coordination, [Online].Available: http://iphome.hhi.de/suehring/tml/

[21] I.-M. Pao and M.-T. Sun, “Modeling DCT coefficients for fast videoencoding,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 4, pp.608–616, Jun. 1999.

Wenfei Jiang received the B.S. degree in telecom-munication engineering in 2007 and the M.S. degreein communication and information systems in 2009from the Huazhong University of Science and Tech-nology, Wuhan, China.

His research interests include image processingand video coding.

Longin Jan Latecki (M’02–SM’07) received thePh.D. degree in computer science from the HamburgUniversity, Germany, in 1992.

He has published over 160 research papers andbooks. He is an Associate Professor of computerscience at Temple University, Philadelphia, PA. Hismain research areas are shape representation andsimilarity, robot perception, and digital geometryand topology.

Dr. Latecki is the winner of the Pattern Recogni-tion Society Award with A. Rosenfeld for the best ar-

ticle published in Pattern Recognition in 1998. He received the main annualaward from the German Society for Pattern Recognition (DAGM), the 2000Olympus Prize. He is an Editorial Board Member of Pattern Recognition.

Page 12: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. …latecki/Papers/IPC_IP09final.pdf · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009 1025 A Video Coding Scheme Based

1036 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 18, NO. 5, MAY 2009

Wenyu Liu (M’08) received the B.S. degree incomputer science from Tsinghua University, Beijing,China, in 1986, and the M.S. and Ph.D. degrees inelectronics and information engineering from theHuazhong University of Science and Technology(HUST), Wuhan, China, in 1991 and 2001, respec-tively.

He is now a Professor and Associate Dean of theDepartment of Electronics and Information Engi-neering, HUST. His current research areas includemultimedia information processing, and computer

vision.

Hui Liang received the B.S. degree in electronicsand information engineering from the Huazhong Uni-versity of Science and Technology (HUST), Wuhan,China, in 2008. He is currently pursuing the M.S. de-gree in the Department of Electronics and Informa-tion Engineering, HUST.

His research interests include image compressionand wireless communication.

Ken Gorman received the B.S.E.E. degree fromVillanova University and the M.S. degree in com-puter science from Temple University, Philadelphia,PA, where he is pursuing the Ph.D. degree.

He is a Systems Staff Engineer for Honeywell In-ternational. His research interests include imaging,video streaming and compression.