video coding for mobile handheld conferencing

Multimedia Tools and Applications, 13, 165–176, 2001c© 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

Video Coding for Mobile Handheld Conferencing

JOLON FAICHNEY [email protected] GONZALEZ [email protected] of Information Technology, Griffith University, Australia

Abstract. Since the time of Dick Tracy cartoons, the idea of wearable videoconferencing devices has been adesirable but unachievable goal. Current technology is now on the verge of making this dream a reality. Existing andfuture low bit-rate video coding standards such as H.263 and MPEG-4 may require specialized hardware for real-time handheld video conferencing [3]. This paper evaluates the performance of software-based videoconferencingon widely available mobile handheld devices using a range of both traditional and original video coding schemes.Our results demonstrate that wearable videoconferencing can no longer be relegated to the realm of science fictionas practical first generation devices are feasible today.

Keywords: video compression, mobile video systems, handheld computing

1. Introduction

Over the last few years the popularity of cabled video communication has been steadilyincreasing, although it has mainly been limited to the corporate market because of expensiveequipment. Lower cost equipment and higher bandwidth public networks are bringing videoconferencing to the consumer. Simultaneously, the popularity of mobile telephony has alsobeen rapidly increasing and wireless personal communications are now ubiquitous. TheMobile Data Initiative [9] reports the number of subscribers of mobile services has beensteadily increasing from 1.28 million in 1995 to 3.26 million in 1997. The current generationof personal based communication systems are lightweight, low-power and low-bandwidthhandheld devices. They are not only restricted to speech but users are able to connect todata services via handheld or even palm computers.

Research into wireless handheld conferencing is trending towards using specialized hard-ware (or DSP coprocessors) to decode MPEG-4 video streams and to produce data ratesgreater than 9600 bps [3]. This paper investigates the development of wireless personal videoconferencing using widely available handheld computing and wireless communicationstechnologies. These include Windows CE and Newton handheld computers and GSM mo-bile phone communications. Previous work by Streit and Hanzo [14] investigated VQ, DCTand quadtree coding techniques using motion compensation for low bit-rate video. In thispaper we explore run length coding, quadtree compression, DCT coding, and wavelet com-pression. We also present the results of each coder running on handheld personal computers.

2. Video coding

Video coding for handheld computers presents many challenges as these devices have lim-ited processing power and typically have no floating point unit (FPU). In this study we

166 FAICHNEY AND GONZALEZ

investigated six different video coding techniques that exhibit low computational com-plexity and would be suitable for handheld video conferencing. To reduce computationalcomplexity, these techniques use conditional replenishment of the temporal prediction datarather than motion compensation [10]. In addition to limited CPU power, video codersdesigned for widely available wireless environments must be able to produce low bit-ratestreams. The coders investigated were designed to produce a bit-rate suitable for existing9600 bps GSM networks. Some existing coding schemes such as run length coding andquadtree coding were investigated because they exhibit very low computational complex-ity. At the other extreme, wavelet coders were investigated because they are still relativelynew and provide a wide variety of implementation options such as choice of wavelet basisfunction and coding technique. A DCT coder was also implemented for comparison withexisting coding standards such as MPEG and H.263. The design of each of these codingschemes is discussed in the following subsections.

2.1. Run-length coding

Our primary concern when selecting video coders was coding time. The run-length coderrepresents one of the lowest complexity coders. Run-length coding is commonly used incombination with other forms of compression such as DCT-based coding and is rarely usedby itself. However, we wanted to test the performance of a low complexity run-length coderon low resolution video conferencing style images. Run-length coding represents a run ofpixels with the same value, as the length of the run and the pixel value. Natural imagescontain a high amount of inter-pixel variation, making run-length coding unsuitable withoutfirst quantizing the image. Our system uses run-length coding for both I- and P-frames.The P-frames in our implementation represent the difference between frames rather thanmotion vectors. If the background is stationary many of the difference frame pixels will bezero improving the performance of run-length coding. To further the reduce the sensitivityto inter-pixel variations the input image is quantised to 5 bits per pixel providing smallbitstreams with very little reduction in quality. To reduce the systems sensitivity to smalltemporal variations a threshold is applied to the difference coefficients. If the magnitude ofa difference value is not larger than the threshold then it is replaced with a difference valueof zero. Run-length coefficients are encoded using Eliasγ variable-length codes. Variable-length codes provide minimal overhead when the runs or difference values are small. Thedifference threshold is varied dynamically to achieve a target bit-rate of 9600 bps.

2.2. Quadtree coding

Run-length coding exhibits two major problems. Firstly, it only reduces one-dimensionalspatial redundancy and secondly, it is sensitive to variations between pixels. Quadtreecoding can overcome these problems by representing the image as two-dimensional blocksof homogeneous color and performing the decomposition based on the variation within anentire block rather than between individual pixels. Quadtree coding also has the advantagethat the decoding phase is no more complex than run-length coding.

VIDEO CODING FOR MOBILE HANDHELD CONFERENCING 167

The quadtree coding algorithm recursively partitions an image into quadrants [12]. Bit-rate control is obtained by adjusting the partitioning threshold based on the mean-squarederror of each quadrant. This effectively determines what level of image detail is discarded.Quadtree coding generally uses a fixed threshold throughout all levels of the decomposition.This approach suffers from the problem that an image with small localized variations (suchas a difference image) will not be broken up into smaller blocks because the overall variationin the image is small. The result is a bit-rate that is very low for most frames and very highfor a few frames when the variance in the image has exceeded the threshold. This problemis minimized by inverse scaling the variance threshold with the block size. For example,at the highest level the partitioning threshold is small so that small details can be detected,while at lower levels the partitioning threshold is high reducing the likelihood of blocksbeing decomposed and hence reducing the bits used. The partitioning threshold is adjusteddynamically to achieve a constant bit-rate of 9600 bps.

2.3. DCT-based coding

Quadtree coding treats blocks as areas of homogeneous color, ignoring redundancies inspatial frequency. DCT (discrete cosine transform)-based coding schemes transform animage into the frequency domain before coding. Our implementation is similar to otherpopular DCT coding techniques such as MPEG, H.263, and JPEG. The image is decom-posed into blocks of 8× 8 pixels which are transformed using a fast integer DCT. Thecoefficients of the DCT represent the two-dimensional frequency content of the block. TheDC coefficient represents the lowest frequency content whilst the AC coefficients representthe higher frequencies. Frequency-based compression is achieved by reducing the fidelityof the high-frequency components. A quantization table is used that generally quantizesthe high frequency coefficients the most and the low frequency coefficients the least. Thevalues of the quantization table are determined by the quality level. The quality level isadjusted dynamically to achieve the target bit-rate of 9600 bps. The result is a block ofcoefficients which are largely zero at the high frequency end. Blocks can be efficientlycoded through run-length coding the zigzag scanned coefficients. Most coding schemes uselook up tables of Huffman or arithmetic codes to store the coefficients and their run-lengths.Our implementation uses Eliasγ variable-length codes avoiding the use of long look uptables.

Existing DCT coding schemes are designed for compressing I-frames rather than differ-ence frames. Difference frames may consist of large areas with very small values. Thequantization stage may produce blocks that consist completely of zeros. However, existingschemes, which aren’t designed for difference images, do not take advantage of the fact thatmany blocks may consist of zero coefficients especially when coding for very low bit-rates.For example, to store a zero-filled block at least one bit is required to store the sign of theDC coefficient, at least another is required to store the value of the DC coefficient, andfinally one more bit is required for the end of block marker. A QCIF difference imagewhich contains no change between frames will require 3 bits to be stored for each of its396 blocks or a total of 1188 bits per frame. Assuming a frame rate of 10 fps the target bitrate of 9600 bps would be exceeded without any change in video content.


To reduce the overhead of zero-filled blocks, inter-block run-length coding is performed.One bit is used to indicate a zero-filled block whilst at least another is used to indicate thenumber of zero-filled blocks. If there is only one zero block then only 2 bits need to bestored which is less than the 3 bits needed in conventional schemes.

2.4. Floating point Daubechies 9/7-tap filter wavelet

Wavelet comression has gained popularity in recent years for image coding because of theability to represent multiple resolutions and the emergence of embedded coding schemes.Embedded coding techniques have the advantage that they can be stopped at any pointto produce an exact bit-rate. Embedded streams are well suited for video conferencingapplications because an exact frame rate can be determined based on the bit-rate. Thecoefficients from a wavelet transform can be efficiently coded using a technique pioneeredby Shapiro [13] called embedded zerotree wavelet (EZW) coding. Said and Pearlman [11]refined the EZW algorithm so that vector quantization was no longer required and produceda technique called set partitioning in hierarchical trees (SPIHT). Wavelets have mainly beenused in image coding but more recent research has looked at using them for very low bit-ratevideo [4].

A popular wavelet transform which provides good image compression is the Daubechies9/7-tap filter [1]. The Daubechies 9/7-tap filter has better performance than simpler waveletfunctions because its basis functions provide more compact support. That is, more energyis compacted into the lower bands, so that the remaining components contain very lowenergy levels. The major problem with this wavelet is that it is more complex and hencecomputationally intensive. Existing implementations use floating point arithmetic which isnot supported by most handheld computers. Our implementation performs a reflection atthe edges of the image so that the taps may extend past the boundaries of the image. Thewavelet transform is performed to 5 levels and the coefficients from the 9/7-tap filter arecoded using the SPIHT coder.

2.5. Fixed point Daubechies 9/7-tap filter wavelet

Initial experiments showed that the floating point Daubechies wavelet coder provided goodcompression ratios whilst maintaining spatial detail. However, it was very slow becausehandheld computers implement floating point operations in software libraries. To speed updecode times the previous coder was also implemented using 16 bit fixed point arithmetic.

2.6. Simple wavelet coding using SPIHT

Even the fixed point implementation of the Daubechies filter exhibited a large amount ofcomplexity due to the large number of taps in the 9/7-tap filter. Wavelet transforms wereinvestigated that use less taps. The S+P transform was initially investigated because it canbe implemented using shifts and additions. The S+P transform consists of the S transform


and a prediction stage which reduces redundancies between high frequency subbands. TheS transform consists of the following analysis filter:

L[n] = b(X[2n] + X[2n+ 1])/2c (1)

H [n] = X[2n] − X[2n+ 1] (2)

and synthesis filter:

X[2n] = L[n] + dH [n]/2e (3)

X[2n+ 1] = L[n] − bH [n]/2c (4)

The prediction is based on three neighboring low frequency coefficients and a highfrequency coefficient. The problem with this prediction technique is that a small error in onehigh frequency coefficient can propogate to all other coefficients in the same row or column.For still image compression these errors are small but for video the errors accumulate oversuccessive frames. Removing the prediction stage alleviates these errors, but results intoo much energy in the high frequency subbands. So the S transform was modified so thatmore energy is placed in the in low frequency subband. The simple wavelet consists of thefollowing analysis filter:

L[n] = X[2n] + X[2n+ 1] (5)

H [n] = X[2n] − X[2n+ 1] (6)

and synthesis filter:

X[2n] = b(L[n] + H [n])/2c (7)

X[2n+ 1] = b(L[n] − H [n])/2c (8)

The simple wavelet transform is performed to 5 levels and coded using SPIHT coding.

3. System architecture

A framework based on one presented by Furht [8] was designed to support both video andaudio data whilst maintaining platform portability. Figure 1 shows the framework used.

The components within the grey box represent components that are platform independentand deal with source and channel coding of video and audio. The components outsidethe grey box deal with platform dependent functions such as capturing and playing backaudio and visual data. The portable architecture has been implemented on a variety of bothhandheld and desktop computing architectures including Newton, Windows CE, Macintosh,and Windows NT. Our existing implementation includes audio encoding and decodingand video decoding on handheld computer platforms. Currently, video transmission is notpossible from handheld computers due to the lack of video capture hardware. Our currentimplementation uses a desktop computer to capture and transmit video data to multiplehandheld computers.


Figure 1. Video conferencing framework.

Wireless communication channels exhibit characteristics which can introduce errors intothe data stream [7]. A video conferencing system must be able to reliably detect and recoverfrom these errors. Error detection is usually achieved by introducing redundancy into thedata stream. However, at 9600 bps there is very little room for redundancy. The packetformat provides a synchronization code at the beginning of each packet that allows thedecoder to detect errors and recover from them by skipping to the next synchronization block.If a corrupt frame is detected the decoder can send a request to the encoder to send an I-frame,rather than waiting for the next I-frame to appear. Because motion compensation isn’t usedthere is very little need for periodic transmission of I-frames. Demand transmission ofI-frames on channel errors and interframe difference coding result in a reduction of thetransmission of unnecessary I-frames, providing a reduction in the overall bit-rate. Eachvideo packet includes a cyclic 4-bit code indicating the current frame number. This code isused to provide flow control when protocols such as TCP/IP are not available and can alsobe used for error checking. Further details of the protocol format can be found in [6].

The design of a handheld video conferencing system is also largely impacted by thebattery life of handheld computers. Hardware components are chosen that consume smallamounts of power and can be put into low power modes when not in use [5]. Software, inaddition to hardware, must also be designed to reduce the power consumption of the device.The operating system and applications must sleep hardware subsystems that are not in use[15]. Also applications must return control regularly to the operating system so that thismay be achieved [2]. The impact of such methods was tested on the Newton MessagePad2100. Normal use of the PDA will result in a battery life of 24 hours. However, whenthe device is continuously processing and using other hardware subsystems such as serialcommunications the battery life is reduced to 4 hours (figure 2).


Figure 2. Battery life of Newton MessagePad under different CPU loads.

A number of audio coders were investigated for integration with our system. An ADPCMcoder was chosen because of its good mix of low computational complexity and low bit-rates.The G.723 coder was selected which produces a bit-rate of 24 kbps when the sampling-rateis 8 kHz by coding 3 bits per sample.

For our evaluations a number of handheld devices were used including Newton Mes-sagePad and Window CE platforms. These platforms use a variety of CPU architecturesincluding StrongARM, MIPS, and SH3. These devices were equipped with Proxim Range-LAN2 7400 PC cards for wireless networking.

4. Experiments and results

Handheld video conferencing systems require the selection of a video coder that is low incomputational complexity and performs well at low bit-rates. The six previously mentionedvideo coders were implemented and evaluated based on decoding time, bit-rate stability,and quality. Decoding times were recorded using the Casio E-100 Windows CE Palm-sizedPC.

The video coders were tested with four ITU-T test video sequences Claire, Susie, MissAmerica, and Football. The first three video sequences are head and shoulders shots whichare representative of video found in a video conferencing system. The final video sequence,Football, allows us to see how the video coders perform when there are large amounts ofmotion.

The input images were converted to 8-bit greyscale with dimensions of 176 by 144 pixels(QCIF). The video coders were all configured to produce a video stream at a constantdata-rate of 9600 bps, which is the GSM bandwidth. A frame rate of 10 fps was selectedwhich is considered acceptable for video conferencing. Quality parameters for each ofthe non-embedded video coders were dynamically adjusted to achieve the target bit-rate of960 bits per frame.

Initially, the video coders were evaluated based on their computational complexity.Figure 3 shows the decoding times for each of the video coders. To be able to decodeat 10 fps, enough time must be available for the handheld computer to decode frames,encode frames, encode audio, decode audio, and handle IO processing. Therefore, the


Figure 3. Mean decoding times.

decoding time for each frame must be much less than 100 ms. The wavelet coders were notable to decode frames in less than 100 ms on the Casio E-100 which currently representsone of the fastest palm-sized computers. However, the fixed point Daubechies waveletimplementation provided a dramatic improvement in decoding time over the floating pointversion, whilst the simple wavelet provided even shorter decoding times. The remainingRLE, Quadtree, and DCT coding schemes all exhibited similar coding complexity.

The coders were also evaluated based on subjective quality (figure 4) as it has beenshown that objective measures such as PSNR do not perform well at low bit-rates [4]. TheDCT and Daubechies coders were able to produce acceptable quality frames. Whilst little

Figure 4. Cumulative error due to coding induced distortion. (a) RLE, (b) Quadtree, (c) DCT, (d) Simple Wavelet,(e) floating point Daubechies 9/7 tap filter, (f) fixed point Daubechies 9/7 tap filter.


Figure 5. Mean objective quality.

difference in quality was found between the floating point and fixed point implementationsof the Daubechies coder.

In addition to subjective quality measures, objective quality measures were also recorded.The objective quality measure selected was the widely used PSNR (peak signal-to-noiseratio). The mean PSNR for each coder across all test sequences was calculated and evaluated(figure 5). The objective quality results reflect the subjective quality results.

Finally, the bit-rate stability of each of the coders was investigated. A data stream with alarge amount of fluctuation between frame sizes will result in a fluctuating frame rate andmay require the user to wait whilst a large frame is being transmitted. A stable bit-rate isachieved either through directly or indirectly adjusting the quality of the encoded frames.This may not always be desirable. For instance, a large amount of motion may require morebits, and hence the quality is reduced to achieve a stable bit-rate. However, a large amountof motion may indicate an important event and hence spatial detail should be providedrather than temporal.

To examine the bit-rate stability, the standard deviation of frame sizes in bits was cal-culated and averaged across all of the video sequences (figure 6). The embedded coders

Figure 6. Bit-rate stability.


provide a completely stable bit-rate because they are stopped when the desired size isreached. The run length and Quadtree coding schemes produce a large amount of variationin frame sizes, even up to twice the desired frame size, whilst the DCT coder produces themost stable bit-rate of the non-embedded coders.

5. Observations and discussion

The most important measure was computational complexity. The coders had to be able todecode a frame in less than 100 ms so that a frame rate of 10 fps could be achieved. Of thecoders that encoded a frame in less than 100 ms the DCT coder ranked the best in terms ofsubjective and objective quality.

The DCT coder has the advantage of fast implementations. The combined wavelet trans-form and SPIHT coders are yet to receive the same level of optimization as DCT codingschemes. In the objective quality measures the Daubechies wavelet coders performedmarginally better than the DCT coder whilst providing a perfectly stable bit-rate throughembedded coding and other advantages such as multiresolution representations. Furtherwork could be done in optimizing these schemes.

Run-length coding and Quadtree coding provided very fast decoding times but verypoor quality frames. This is due to the fact that they both use 0-th order representations,representing intervals and blocks as flat colors. Better quality may be achievable throughrepresenting the frequencies of each block with variable block-sized DCT coefficients.

The DCT coder running on the Casio E-100 can achieve a frame rate of 17 fps indicat-ing that approximately 50% of the time is spent on other tasks such as displaying video.Handheld computers often have limited video bus bandwidths imposing an upper limit ondecoding frame rates.

Recently, the DCT coder has been modified to support color and has been tested on theHewlett Packard 620LX, 420LX, and Casio E-100 color Windows CE devices. Figure 7shows the decoding times for each device. YUV images are processed using a4 : 1 : 1formatwhich imposes an overhead of 60% on CPU load. The graph also provides an indication ofthe rapid improvement in handheld computer technology in one year, exceeding Moore’sLaw.

An ADCPM audio coder was incorporated into the cross platform framework. Figure 8shows decode and encode times for the Newton implementation (162 MHz StrongARM).

Figure 7. Color decoding on various Windows CE devices.


Figure 8. Audio processing times on the Newton using the G.723 codec.

Each decode and encode time represents 1 second of audio or 11,025 samples. Times wererecorded over a period of 40 seconds. Both encoding and decoding show little variance incoding times. The average decoding time was 115 ms whereas the average encoding timewas 129 ms. The time occupied while coding audio is about 244 ms every second or about24.4% of the available time. In addition, DCT video decoding consumes approximately30% of the available time with video display consuming another 30%. This leaves only15% to perform all other functions of the video conferencing system. Further research isrequired to reduce the complexity of a software only, wireless, handheld video conferencingsystem.

6. Conclusion

We have developed a mobile videoconferencing system using current generation handheldcomputers. Our system makes use of G.723 audio compression and non-standard video-coding algorithms. In developing our system we experimented with a total of six differentvideo coders to investigate their quality and decoding performance at GSM bit-rates. Ourexperiments show that useable mobile handheld video conferencing is achievable using cur-rent technology. We have developed DCT and Wavelet-based coders that produce acceptablequality pictures for video conferencing. We are currently testing other mobile handheldplatforms and working on improving the performance of our low complexity video coders.We currently have a prototype wireless videophone running over our wireless LAN. Thenext generation of personal computing devices should enable the development of DickTracy style video watches.

References

1. M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image coding using wavelet transform,” IEEETransactions on Image Processing, Vol. 1, pp. 205–220, 1992.

2. Apple Computer, “Newton programmer’s reference,” 1996.


3. M. Budagavi, W. R. Heinzelman, J. Webb, and R. Talluri, “Wireless MPEG-4 video communication on DSPchips,” IEEE Signal Processing Magazine, Vol. 17, No. 1, pp. 36–53, 2000.

4. K. Cinkler, “Very low bit-rate wavelet video coding,” IEEE Journal on Selected Areas in Communications,Vol. 16, pp. 4–11, 1998.

5. M. Culbert, “Low power hardware for a high performance PDA,” in Proceedings of the IEEE ComputerConference, pp. 144–147, 1994.

6. J. Faichney, “Mobile, handheld video conferencing,” Honours dissertation, School of Information Technology,Griffith University, Australia, 1998.

7. M. Flack, Cellular Communications for Data Transmission, Blackwell: Manchester, England: Blackwell,1990.

8. B. Furht, S.W. Smoliar, and H. Zhang, Video and Image Processing in Multimedia Systems, Kluwer: Norwell,Massachusetts, ch. 5, 1995.

9. M.D. Initiative, “GSM world markets,” Technical report, 1998. http://www.gsmdata.com/worlmark.htm.10. T. Ishiguro and K. Iinuma, “Television bandwidth compression transmission by motion-compensated inter-

frame coding,” IEEE Communicatons Magazine, Vol. 20, pp. 24–30, 1982.11. A. Said and W.A. Pearlman, “A new fast and efficient image codec based on set partitioning in hierarchical

trees,” IEEE Transactionson Circuits and Systems for Video Technology, Vol. 6, pp. 243–250, 1996.12. H. Samet, “Hierarchical representations of collections of small rectangles,” ACM Computing Surveys, Vol. 20,

pp. 297–309, 1998.13. J.M. Shapiro, “An embedded wavelet hierarchical image coder,” in Proceeding of the IEEE International

Conference on Audio, Speech andS Signal Processing, Vol. 4, pp. 657–660, 1992.14. J. Streit and L. Hanzo, “Fixed-rate video codecs for mobile radio systems,” European Transactions in Telecom-

munications, Vol. 9, No. 6, pp. 551–572, 1997.15. R. Welland, G. Seitz, L. Wang, L. Dyer, T. Harrington, and D. Culbert, “The Newton Operating System,” in

Proceedings of the IEEE Computer Conference, pp. 148–155, 1994.

video coding for mobile handheld conferencing

Documents