video

A Dynamic Search Range Algorithm for H.264/AVC Full-Search Motion Estimation

Yuan-Teng Chang and Wen-Hao Chung Information & Communications Research Laboratories

Industrial Technology Research Institute, Hsinchu, Taiwan [email protected]

Abstract— Motion estimation plays an important role in inter-frame prediction for the video coding standards such as H.264/AVC, MPEG-2, MPEG-4, VC-1, and so on. Its huge computation complexity, however, makes it difficult to achieve real-time coding for the HDTV1080p. In this paper, we propose a dynamic search range algorithm which reduces about 80% of search points in full search algorithm for the H.264/AVC. In addition, we design the corresponding VLSI architecture of integer motion estimation. The proposed integer motion estimation can achieve real-time coding for 30fps HDTV1080p operating at 166 MHz. Keywords— H.264/AVC, Integer Motion Estimation, Dynamic Search Range

I. INTRODUCTION H.264/AVC is a video coding standard developed by the

ITU-T Video Coding Experts and ISO/IEC Moving Picture Experts Group [1]. It has been widely adopted in various applications, such as blu-ray disc, IPTV and HDTV broadcasting. H.264/AVC provides better image quality and compression ratio by adopting a lot of new features, such as variable-block-size motion estimation, 1/4-pel fractional motion estimation, multiple reference frames, de-blocking filter, context-adaptive binary arithmetic coding, and so on.

To reduce temporal redundancy, the motion estimation is used to find the best macroblock (MB) among different reference frames for each MB of inter frame. In H.264/AVC, there are four MB modes including 16x16, 16x8, 8x16, and 8x8. Each partition of MB mode 8x8 can be further divided into four sub-macroblock modes consisting of 8x8, 8x4, 4x8, and 4x4. It is required to totally estimate 8160x(3+4x4) modes per frame for the HD1080p video sequences. However, it will result in huge time computation and make it difficult to achieve real-time coding by the RISC or DSP implementation. Therefore, to achieve HD video coding, many hardware accelerators for integer motion estimation (IME) are proposed [2-3].

Chen adopts the full search (FS) algorithm and designs the parallel SAD trees to estimate several search points [2] simultaneously. In addition, the variable-block-size motion estimation (VBSME) is employed and make it possible to process all the MB modes in parallel. Chen’s design, however, only supports HD720p real-time video coding. Afterward, Liu proposes an IME architecture to achieve HD1080p real-time coding [3]. Several algorithm-level optimizations are provided:

elimination of inter mode 4x4, 4x8, and 8x4, low-pass filter based 4:1 down-sampling, and coarse-to-fine search.

The FS algorithm is adequate for hardware realization because it provides regular search patterns and makes reference pixel data reuse easy. The IME will check all points in an assigned search range. The search range is usually set as more than 64 when an HD video is coded, and thus the IME needs to check (64x2+1)2 points at least. To avoid redundant search in FS, Minocha proposes a dynamic search range (DSR) algorithm [4]. It exploits the temporal correlation of motion vectors in successive frames to predict the search range. Afterward, Saponara takes both temporal and spatial correlation of motion vectors into account to predict the search range [5]. However, the both DSR algorithms are not able to reduce search points effectively.

In this paper, we propose a new DSR algorithm to reduce search points and embeds it in our IME. It determines the search range according to the max. motion vector and average SAD of the pervious frame, as well as motion vectors and SAD of neighbouring blocks. The DSR algorithm brings another advantage of reducing internal and external memory bandwidth of fetching reference pixels. The resultant IME can achieve real-time coding for HD1080p video sequences operating at 166 MHz with slight PSNR loss, and bit rate increase.

The rest of this paper is organized as follows. In Section 2, we present the proposed dynamic search range algorithm. In Section 3, we present the proposed integer motion estimation architecture. Implementation results and comparisons are shown in Section 4. Finally, the paper is concluded in Section 5.

II. DYNAMIC SEARCH RANGE ALGORITHM

A. The Proposed Algorithm The proposed DSR algorithm is described below:

Step 1. Calculate the maximum motion vector (max. MVk-1) and average sum of absolute difference value per MB (avg. SADk-1) in the previous frame k-1.

D B C

A E

Fig. 1 The neighbouring blocks of the current block E

978-1-4244-7456-1/10/$26.00 ©2010 IEEE 124

Step 2. Calculate the search range of each block in current frame k by applying the following rule. Fig. 1 shows the status of neighbouring blocks for a current block E. It should be noted that the block C is replaced with block D if block C is not available. if (block B and block C are not available) search range = max. MV k-1 else if (max. SAD > 1.75*avg. SADk-1) search range = max. search range else search range = min(2*max. MV, max. search range) where, max. SAD = max (SADA, SADB, SADC), and max. MV = max (MVXA, MVXB, MVXC, MVYA, MVYB, MVYC) The concept of this algorithm is that it exploits both the temporal and spatial correlation of motion vectors to predict the search range of a block. The initial search range of the current frame is set as the max. motion vector of the previous frame since the motion content is usually gradual especially for low-motion scenes. The sudden large movement of certain objects will, however, causes error prediction of search range if only the temporal correlation is considered. Therefore, by taking into account the spatial correlation, the max. motion vector of neighbouring blocks is also a factor to determine the search range. To avoid trapping into the local minimum, we set the average SAD per MB in the previous frame as the threshold of current frame. Once the max. SAD among neighbouring blocks is greater than the 1.75*threshold, the search range will return to maximum.

Fig. 2 (a) Three pipelined stages of motion estimation (b) Example of

prefetching reference pixels (c) Timing diagram of the pipelined motion estimation

B. The Consideration of Hardware Realization Although this algorithm is simple and adequate for

hardware realization, some slight modifications are required. In general, the variable-block-size motion estimation (VBSME) [2] is employed in IME to achieve higher

throughput because it can estimate all MB modes in parallel. To make it possible that both the VBSME and our DSR algorithm coexist in IME, we must decide the search range at MB level instead of block level. That is, all block modes use the same search range in an MB.

Second, motion estimation is divided into three separate pipelined stages so as to improve the throughput as shown in Fig. 2 (a). Consequently, we can not acquire the correct motion vector of left MB at IME stage until the FME stage is completed due to the data dependency. To overcome this dilemma, the motion vector of the best mode of the left MB at IME stage may be used. TABLE I THE PERFORMANCE OF OUR PROPOSED ALGORITHM COMPARED WITH

THE FULL-SEARCH MOTION ESTIMATION

CIF Video

Sequence

△PSNR (dB)

△Bit-rate (%)

Search Points /

MB

Search PointsReduction

(%) QP=0

akiyo -0.01 0.00 504 88.07 container 0.01 0.01 266 93.70 foreman +0.02 0.01 883 79.10 mobile -0.02 -0.04 481 88.62 stefan 0.00 -0.15 1344 68.19

QP=12 akiyo 0 0 238 94.44

container -0.01 0.05 279 93.40 foreman 0 0.05 906 78.56 mobile 0.02 -0.07 474 88.78 stefan 0.01 0.21 1414 66.53

QP=24 akiyo 0 0 213 94.96

container -0.01 0.09 127 97.00 foreman -0.01 0.29 750 82.25 mobile 0 0.06 410 90.30 stefan -0.01 0.13 2029 52.00

QP=36 akiyo 0 -0.02 505 88.05

container 0 0.02 339 92.00 foreman -0.05 1.36 652 84.57 mobile 0.01 -0.06 206 95.12 stefan -0.02 0.59 1193 71.76

QP=48 akiyo 0 0.05 717 83.03

container -0.01 0.06 582 86.22 foreman -0.17 2.00 933 77.92 mobile -0.04 -0.26 91 97.85 stefan -0.08 0.83 1048 75.20 However, to lower influence of external memory latency, it

intends to prefetch the reference pixels and determine the search range ahead of the IME stage as shown in Fig. 2 (a). For instance, the operation PREF0~2 fetch the reference pixels in the red dashed rectangle 0~2 before the IME0 is performed as shown in Fig. 2 (b)(c) if the search range of MB0 is set as 32. Because there are three MB latency between IME and DSR, it is impossible to acquire the SAD and motion

125

vectors of the left MB. Therefore, we replace MB A with MB D for the sake of realizing the DSR algorithm with hardware.

C. Performance Analysis Table I shows the performance of the proposed DSR

compared with the full search (FS) algorithm in JM15.1 [6]. It simulates five 30fps CIF sequences with IPPPPPPP (intra frame period 8), RDO off, maximum search range 32, no intra MB in inter frame, 1 reference frame, and CAVLC entropy coding under five different QPs 0, 12, 24, 36, and 48. In which, the akiyo and container are low-motion sequences, but foreman, mobile and stefan are high-motion sequences. The search points per MB are calculated by the equation (2*search range+1)2. Thus, there are totally (2*32+1)2 search points per MB for the FS algorithm with search range 32.

The proposed DSR algorithm behaves better than FS. It greatly reduces the search points with only a little PSNR decrease and bit rate increase. Besides, the DSR lower the demand of external memory bandwidth and the power consumption resulting from redundant search as the search points are decreasing.

III. INTEGER MOTION ESTIMATION ARCHITECTURE Fig. 3 depicts the proposed IME architecture mainly

composed of a dynamic search range generator, a reference pixel prefetch unit, two SAD trees, and three SRAM modules. The dynamic search range generator is responsible for generating a predicted search range for each MB according the proposed algorithm. To avoid the use of extra memory to store SAD value, the status register stores the judgement result whether or not the min. SAD is greater than 1.75*SADk-1 after completing the IME of an MB. Since there are 120 MBs in horizontal resolution for HD1080p video sequences, the status register requires 120 bits.

Fig. 3 Architecture of integer motion estimation (IME)

The DSR generators calculates the search range according to the motion vectors and SAD status of neighbouring MBs fetched from the MV SRAM and the status register respectively. Then, the prefetch unit will load the reference pixels from external memory based on the search range.

Simultaneously, the control unit will command the SAD trees to scan the allowed search area and calculate the rate-distortion cost by the lagrange equation:

rate-distortion cost = λ*rate(mvd)+SAD λ: lagrange multiplier;

rate(mvd): coded bit number of motion vector difference In the period of scanning search area, the SAD trees will calculate the rate distortion cost of each point for all MB modes in parallel and keep the min. cost and its corresponding position, namely motion vector. Finally, the mode decision will determine the best mode, motion vectors and SAD.

1:1

4:116:1

SR=16

SR=32

SR=64

1:1 4:1

16:1

(a)

(b) (c)

(d)

Fig. 4 Different search point sub-sampling based on search location For the HD1080p coding, it is appropriate to set the search

range as 64 at least. This way, the IME must check 16641 points in the search area in the worst case; that is, it totally consumes 16641 clock cycles if it employs one SAD tree and checks every point. To meet the real-time requirement, we adopt multi-resolution motion estimation and use two SAD tree as shown in Fig. 4. It performs the elaborate search and checks each point in the area of search range 16 as shown in Fig. 4 (b). Out of search range 16, it performs the coarse search, which adopts the 4:1 (Fig. 4 (c)) and 16:1 (Fig. 4 (d)) search point sub-sampling in the area of search range 32 and 64 respectively. Therefore, it at most requires 1325 clock cycles in the case of max. search range 64. To reduce hardware cost, the SAD tree adopts 2:1 pixel decimation and truncates two least significant bits of luma pixels as shown in Fig. 5.

Fig. 5 2:1 pixel decimation and 2-bit pixel truncation

126

TABLE II COMPARISON OF PSNR, BIT-RATE, AND AVERAGE SEARCH POINTS PER MACROBLOCK

QP=27 container (CIF) mobile (CIF) pedestrian (1080p) station2 (1080p) FS [4] [5] ours FS [4] [5] ours FS [4] [5] ours FS [4] [5] ours

PSNR (db) 37.22 37.23 37.23 37.22 36.07 36.07 36.07 36.07 40.05 40.04 40.04 40.04 39.26 39.25 39.25 39.25

bit-rate (kb) 602.6 603.2 603.3 603.5 3325.7 3326.4 3326.7 3325.6 8969.2 9101.1 9101.1 9112.0 5642.3 5671.4 5671.9 5718.9

Search Points /MB 4225 2043 1747 134 4225 1849 2098 384 16641 16641 16641 3249 16641 16589 16538 729

Search points reduction (%) no 51.64 58.65 96.83 no 56.24 50.34 90.91 no 0 0 80.48 no 0.31 0.62 95.62

IV. EXPERIMENTAL RESULTS Table II compares our integer motion estimation with the

other three methods in terms of PSNR, bit-rate, and search points per MB. It simulates four 30fps video sequences including two CIF sequences container and mobile, and two HD1080p sequences pedestrian and station with IPPPPPPP (intra frame period 8), RDO off, QP27, no intra MB in inter frame, 1 reference frame, and CAVLC entropy coding. The max. search range is set as 32 for CIF sequences and 64 for HD1080p sequences. Our algorithm combines the proposed dynamic search algorithm, 2:1 pixel decimation, and 2-bit pixel truncation. The motion estimation methods in [4] and [5] also adopt 2:1 pixel decimation and 2-bit pixel truncation. The threshold TH1 and TH2 mentioned by [5] are set as 1024 and 2048 respectively.

Compared with the full search algorithm, our algorithm can achieve 80% to 90% reduction of search points per MB with only a bit PSNR decay and bit-rate increase. It is not useful to reduce search points for the Minocha’s method [4] because it ignores the spatial correlation of motion vectors. Although the [5] takes account of temporal and spatial correlation, the search range will be set as maximum once it finds a high-motion object in a frame. In addition, it is difficult to determine appropriate value for the threshold TH1 and TH2.

TABLE III COMPARISONS WITH OTHER INTEGER MOTION ESTIMATION ARCHITECTURE

Design [2] [3] Proposed technology 180 nm 180 nm 180 nm cycles / MB 1000 817 677 clock freq. 108 MHz 200 MHz 166 MHz

max. resolution 720p 30fps 1080p 30fps 1080p 30fpsgate count 305 k 486 k 117 k

internal SRAM 13.71 KB 40 KB 20 KB The proposed IME has been implemented with Verilog HDL and synthesized under the TSMC 180 nm CMOS technology operating at 166 MHz. Table III shows the synthesis results compared with the other two related works. The [2] processes an MB within 1000 cycles but only supports HD720p video coding. In order to achieve

HD1080p coding, the [3] uses thirty-two sets of SAD trees to improve the throughput and consumes much hardware cost. By applying the DSR algorithm and multi-resolution motion estimation, our IME just employs two sets of SAD trees and processes an MB within 677 cycles. Therefore, our IME not only uses less hardware cost, but achieves real-time coding for HDTV1080p operating at 166 MHz.

V. CONCLUSION In this paper, we present a new dynamic search algorithm

which take both the temporal and spatial correlation into consideration. Some design issues are provided when we need to implement it with hardware. Furthermore, our IME also adopts several optimization methods including multi-resolution motion estimation, 2:1 pixel decimation, and 2-bit pixel truncation to improve the throughput and save hardware cost. The resultant IME can process an MB within 677 cycles quite enough for achieving real-time coding for HDTV1080p operating at 166 MHz.

REFERENCES [1] ITU-T, “Advanced Video Coding for Generic Audiovisual

Services,” ITU-T Recommendation H.264 and ISO/IEC 14496-10 (MPEG4-AVC), March 2005.

[2] T.C. Chen; S.Y. Chien; Y.W. Huang; C.H. Tsai; C.Y. Chen;

T.W. Chen; L.G. Chen, "Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder," IEEE Trans. on CSVT, vol.16, no.6, pp.673-668, June 2006.

[3] Zhenyu Liu; Yang Song; Ming Shao; Shen Li; Lingfeng Li;

Ishiwata, S.; Nakagawa, M.; Goto, S., "HDTV1080p H.264/AVC encoder chip design and performance analysis," IEEE Journal of Solid State Circuits, vol. 44, no. 2, pp.594-608, Feb. 2007.

[4] Minocha, J.; Shanbhag, N.R., “A low power data-adaptive

motion estimation algorithm”, in Proc. of MMSP, 1999. [5] Saponara, S.; Fanucci, L., "Data-adaptive motion estimation

algorithm and VLSI architecture design for low-power video systems," Proc. IEE Computers and Digital Techniques, vol. 151, no. 1, pp51-59, 2004

[6] Joint Video Team Reference Software JM 15.1

127

video

Documents