[ieee 2012 ieee 55th international midwest symposium on circuits and systems (mwscas) - boise, id,...

A Fast Fractional Motion Estimation Algorithm and Architecture for H.264/AVC Multiview Video Coding

Yuan-Teng Chang Information & Communications Research Laboratories

Industrial Technology Research Institute Hsinchu, Taiwan

Wen-Hao Chung Information & Communications Research Laboratories

Industrial Technology Research Institute Hsinchu, Taiwan

Abstract— This paper presents a fast fractional motion estimation (FME) and the associated VLSI architecture for H.264/AVC multiview video coding. The proposed FME automatically turns off the mode P8x8 by exploiting the results of integer motion estimation and similarity between views. In addition, the fraction-pel refinement of integer motion or disparity vector in any partition may be skipped according to the difference of their rate-distortion cost. This algorithm accelerates the FME by nearly 50% with negligible PSNR degradation and bitrate increase. The resultant FME can process a macroblock within 612 clock cycles, enough to achieve real-time coding for the stereoscopic HD1080p video sequences operating at frequency of 300 MHz.

I. INTRODUCTION The Multiview Video Coding (MVC) has been developed

as the extension of H.264/AVC by the ITU-T Video Coding Experts and ISO/IEC Moving Picture Experts Group [1]. Multiview High Profile supports an arbitrary number of views, and Stereo High Profile is designed specifically for two-view stereoscopic video. MVC can provide better compression ratio than simulcast coding because it introduces not only the motion estimation to reduce temporal redundancy but also the disparity estimation to reduce view redundancy [2].

Generally, the hardware realization for H.264 MVC coder is usually divided into the four pipelined stages as shown in Fig. 1. The IME stage performs the integer motion and disparity estimation by applying the full search algorithm to find suitable integer motion and disparity vectors (IMV/IDV) respectively. The FME stage further refines the integer motion and disparity vectors of each partition for all the macroblock (MB) modes in half-pel and quarter-pel accuracy. The 4x4, 8x8, and 16x16 intra predictions are performed in the third stage. In the final stage, one of the two entropy coders will be chosen to code the side information as well as the residual data. Different from the single-view coding, both the IME and FME require extra computation time to perform the disparity estimation. Therefore, the IME and FME will become the bottleneck for MVC.

The IME can be accelerated by choosing an adequate search range to reduce search points. Shen et al. exploits the motion homogeneity analyzed from the previously coded

views to change the search range [3]. Ding et al. performs motion estimation within a small area around the initial guess motion vectors [4].

Figure 1. Pipelined H.264 MVC coder

Several fast inter mode decision algorithms can be used to speed up the FME. The view-adaptive mode size decision is presented in [3]. The nine corresponding MBs in neighboring view are used to determine the mode characteristics of the current MB. Only the mode 16x16 is performed for the MBs judged as simple mode. Zhu et al. adopt the median rate-distortion cost of Intra16 and Intra4 modes in neighboring views to perform textural segmentation [5]. According to the result of textural segmentation, early skip mode decision and inter8x8 reduction are introduced.

In this paper, a fast FME algorithm and the associated architecture are presented to fulfill the real-time requirement for coding the stereoscopic HD1080p video sequences. This algorithm bypasses the FME of mode P8x8 for those MBs with low texture complexity and small motion displacement. When coding a base-view frame, the average rate-distortion cost of mode P8x8 and the average edge gradient of mode Intra16x16 are calculated and acted as the basis for the decision of MB complexity. To further reduce computation time, this algorithm just chooses either the IMV or IDV to refine in fraction-pel positions when the mode P8x8 is turned on. The resultant FME can save computation time up to 50% under negligible PSNR degradation.

The rest of this paper is organized as follows. Section II describes the FME algorithm for H.264 MVC. Section III presents the proposed fast FME algorithm. Section IV presents the FME architecture. Experimental results and conclusions are given in Section V and VI respectively.

II. FRACTIONAL MOTION ESTIMATION FOR H.264 MVC A. FME Algorithm in Reference Software

The FME algorithm in reference software JMVC is described in Fig. 2 [6]. It estimates all the MB modes from

978-1-4673-2527-1/12/$31.00 ©2012 IEEE 984

one 16x16 MB (mode 1) to four 8x8 partition MB (mode P8x8) as shown in Fig. 3 (a). Each 8x8 sub-macroblock can be further divided into one 8x8 partition (mode 4), two 4x8 partitions (mode 5), two 8x4 partitions (mode 6), and four 4x4 partitions (mode 7). For the IMV and IDV of each partition, the eight half-pel and eight quarter-pel positions around the best integer position are checked as depicted in Fig 3 (b). The position with minimum rate-distortion cost (rd_cost) is chosen as the best fractional motion or disparity vector (FMV/FDV), where the rd_cost is calculated by the equation (1). Then either FMV or FDV with minimum rate-distortion cost is adopted as the best motion vector. After estimating all the modes, the mode with minimum mode_rd_cost is chosen as the final MB mode.

rd_cost = λ x rate(MVD) + SATD (1)

λ: Lagrange parameter rate (MVD): bitrate of motion vector difference SATD: sum of absolute transformed difference

for each mode for each partition find the best FMV around the best IMV find the best FDV around the best IDV part_rd_cost = min (FMV_rd_cost, FDV_rd_cost) mode_rd_cost = mode_rd_cost + part_rd_cost choose the mode with minimum mode_rd_cost as best mode

Figure 2. The FME algorithm in JMVC

1/2 Pixel 1/4 Pixel

Integer Pixel

Figure 3. (a) macroblock mode and sub-macroblock mode which size from 16x16 to 4x4 (b) Algorithm of finding the best FMV or FDV

Figure 4. Timing diagram of FME for base view and non-base view

B. Timing Analysis of Hardware Realization When adopting the similar hardware architecture in [7] as

well as the mode filtering algorithm in [8], the timing diagram of the FME is illustrated in Fig. 4. Different from [7], we double the interpolation unit throughput and the number of 4x4 hadamard transform to improve the FME throughput. Therefore, the required clock cycles for each mode are below:

mode 1: 118 mode 2: 145

mode 3: 116 mode P8x8: 163

skip mode: 56 MC (motion compensation): 73

Since just the IMV for the base-view frame is required to be refined, the FME takes the 671 clock cycles. For a non-

base-view frame, however, the FME takes 1213 cycles because the IMV and the IDV refinement are both performed. Thus the FME of the non-base-view frames will become the bottleneck for the MVC encoder. As a result, the fast FME algorithm is proposed to reduce the FME computation time for the non-base-view frame.

III. FAST FME ALGORITHM FOR H.264 MVC A. Proposed Algorithm

Fig. 5 presents the proposed fast FME algorithm. By analyzing the coding results of JMVC, it is observed that the possibility of adopting the mode P8x8 is highly correlated to MB texture complexity, degree of motion displacement, and QP. For an MB with low complexity or small motion displacement, the possibility of choosing mode P8x8 is quite low. Moreover, as QP value becomes bigger, the possibility of adopting mode P8x8 becomes lower. This algorithm employs the rate-distortion cost and the edge gradient as the criterion to judge the MB complexity. Bigger rate-distortion cost and bigger edge gradient are obtained for complex MBs.

Since similar image content is obtained between different views captured at the same time instant, the average rate-distortion cost of mode P8x8 and edge gradient of mode Intra16x16 calculated by equation (2) and (3) in the corresponding base-view frame can be used as the threshold to judge the current MB complexity. If the IME rate-distortion cost of mode P8x8 and the edge gradient of the current MB are greater than α and β respectively, this MB is decided as high complexity. In addition, the IME mode decision result can be reused to assist the judgment of MB complexity and degree of motion displacement. Consequently, the FME of mode P8x8 is turned on only for those MBs judged as high complexity or large motion displacement.

Since the rate-distortion cost acts as the criterion of mode decision and the Lagrange parameter λ is derived from the QP value, this algorithm has already taken the QP impact into consideration implicitly. α ∑ _ _ (2)

NP8x8: total MB number of mode P8x8 β ∑ (3)

NI16: total MB number of mode Intra16x16

, 1, , 1 1, 1 /4f(x,y): pixel value at the position (x,y); k=0,1,2,3,4 | | | | | | | | | | | | | | | |

Figure 6. Calculate the edge gradient of an MB

985

if ((IME_best_mode = mode_P8x8) || (P8x8_rd_cost > α && edge_gradient > β)) for each partition Pi in mode 16x16, 16x8, 8x16 and P8x8 perform FME for Pi around either IMV or IDV with smaller rd_cost else for each partition Pi in mode 16x16, 16x8, and 8x16 if (|rd_cost(Pi_IMV)-rd_cost(Pi_IDV)|< γ) perform FME for Pi around both IMV and IDV else perform FME for Pi around either IMV or IDV with smaller rd_cost

Figure 5. Proposed fast FME algorithm

The procedure of calculating the MB edge gradient is described in Fig. 6. The simple method presented in [9] is adopted to calculate the edge gradient of a 4x4 block by adding together the average pixel gradient in the four directions: 00, 450, 900, and 1350. Thus a 16x16 MB edge gradient is the summation of sixteen 4x4 block edge gradient.

To reduce the computation time, if the mode P8x8 is turned on, just the IMV or the IDV with smaller rate-distortion cost is chosen to refine. Otherwise, if the difference of the IMV and IDV rate-distortion cost is greater than γ, both the IMV and IDV are required to be refined. The tradeoff between achieving better coding quality and reducing more computation time is taken into account by adjusting the threshold γ. By experiment, it is appropriate to set γ as 4 for a 4x4 partition, 16 for an 8x8 partition, and 64 for a 16x16 partition.

This algorithm reduces the FME computation time for the non-base-view frames by bypassing the mode P8x8, or skipping the IMV or IDV refinement for a partition. The worst case of this algorithm takes 887 clock cycles occurring when the mode P8x8 is turned off and the IMV and IDV of all the partitions are refined.

B. Analysis of Prediction Accuracy Table I shows the probability of prediction hit compared

with the reference software with the five 640x480 multiview video sequences under different QPs. We define it as prediction hit when the best mode and the motion vector in each partition for an MB estimated by our algorithm is the same as the reference software. This result shows that our algorithm can achieve 70%~ 99% prediction accuracy. TABLE I. HIT RATIO OF MODES AND MOTION VECTORS COMPARED WITH THE

REFERENCE SOFTWARE UNDER DIFFERENT QPS

8 views, 640x480, 120 frames, IPPP, IDR period 15, max. search range = 32, RDO off

QP rena ballroom exit flamenco2 race1 10 87.2% 72.7% 70.3% 83.8% 76.8% 16 93.8% 86.8% 83.6% 92.0% 87.2% 22 95.6% 93.7% 96.0% 93.2% 90.2% 28 97.7% 96.4% 99.1% 95.2% 94.8% 34 98.9% 98.0% 99.7% 97.4% 98.4% 40 99.4% 99.0% 99.9% 98.6% 98.5%

IV. HARDWARE ARCHITECTURE

Fig. 7 depicts the proposed hardware architecture. The interpolation unit interpolates the half and quarter reference pixels for an 8x4 block. The direction-based FME algorithm presented in our prior work is adopted to choose 4 half-pel positions and 4 quarter-pel positions based on the predicted motion vector [10]. As a result, only four 4x4 hadamard transform units are used instead of nine. To double the throughput, the basic processing unit is an 8x4 block rather than a 4x4 block. It is achieved by employing the additional four 4x4 hadamard transform units. Each 4x4 hadamard transform unit receives the corresponding residual data to calculate SATD value. Each Acc unit accumulates the rate-distortion cost for a half-pel or a quarter-pel position. Then a tree comparator selects the position with minimum rate-distortion cost as the best motion vector. The bypass mode decision unit applies the proposed fast FME algorithm to decide whether to bypass the mode P8x8, or skip the IMV or IDV refinement. After all modes are estimated, the mode decision unit will choose the best mode and the best MV.

Figure 7. FME hardware architecture

986

TABLE II PSNR, BIT RATE, AND REDUCED CYCLE COUNT COMPARISON WITH REFERENCE SOFTWARE FOR DIFFERENT SEQUENCES AND QPS

QP

rena ballroom exit flamenco2 race1

△ PSNR (db)

△ bit rate

△ cycles

△ PSNR (db)

△ bit rate △cycles △ PSNR (db)



△ bit rate △ cycles

10 -0.06 0.53% -54.9% -0.04 0.15% -52.8% -0.05 0.14% -49.5% -0.05 0.63% -51.3% -0.04 0.35% -52.8%

16 -0.08 0.68% -56.5% -0.04 0.32% -55.4% -0.05 0.32% -54.0% -0.06 0.94% -54.5% -0.04 0.58% -54.9%

22 -0.08 0.59% -56.8% -0.05 0.44% -56.4% -0.03 0.29% -56.8% -0.10 1.05% -55.3% -0.06 0.79% -55.8%

28 -0.05 0.46% -57.4% -0.05 0.66% -57.4% -0.02 0.71% -57.9% -0.10 0.88% -56.5% -0.05 1.18% -57.2%

34 -0.03 0.44% -57.3% -0.06 0.85% -58.2% -0.02 0.75% -57.6% -0.10 0.72% -57.9% -0.06 1.16% -58.0%

40 -0.01 0.56% -57.5% -0.04 0.73% -58.6% -0.04 0.92% -58.6% -0.08 0.46% -58.5% -0.4 1.15% -58.3%

V. EXPERIMENTAL RESULTS Table II compares the proposed FME algorithm with the

reference software in terms of PSNR, bit rate, and the clock cycle with the five 640x480 multiview video sequences, frame rate of 30 fps per view, frame sequence of IPPP, IDR period 15, RDO off, CABAC entropy coding under different QPs. Eight views are adopted for each sequence. Each P frame in the base view has only one reference frame for inter-frame prediction. Each P frame in the non-base view has two reference frames for inter-frame and inter-view prediction respectively. The simulation result shows that the proposed FME can reduce 51%~58% clock cycles for hardware realization with negligible PSNR degradation and bit rate increase. That’s because the proposed fast FME algorithm can accurately bypass the mode P8x8 and avoid redundant refinement of IMV or IDV by referring to the IME results and neighboring view information. TABLE III COMPARISON BETWEEN THE PROPOSED ARCHITECTURE AND THE

RELATED WORKS

[11] [12] [10] Proposed△ PSNR (db) 0 -0.11 -0.13 -0.05 △ bit rate (%) 0 4.59% 1.20% 0.78%

Cycle / MB 1260 2004 1213 616 Gate count (NAND) 212,017 59,574 39,425 73,215

The FME has been implemented with Verilog HDL and

synthesized by Design Compiler under the TSMC 90 nm CMOS technology. The clock period and uncertainty are set as 3.0 ns and 0.3 ns respectively. Our FME can process an MB within 616 clock cycles on average. That is, it can achieve real-time coding for the stereoscopic 1080p video sequences with 30 fps per view operating at the frequency of 300 MHz. Table III shows the PSNR, bit rate, cycles per MB, and gate count comparison with the three related prior researches. The [11] directly implements the reference software algorithm and designs the three-engine architecture to increase the parallelism. Our FME achieves the twice and triple throughput higher than [10-11] and [12] respectively. Compared with [10], our FME requires additional four 4x4 hadamard transform units because the basic processing unit is an 8x4 block instead of a 4x4 block.

VI. CONCLUSION In this paper, we present a fast fractional motion estimation

algorithm and the associated VLSI architecture for H.264/AVC MVC applications. The mode P8x8 is automatically turned off based on the MB texture complexity and the degree of motion displacement. These can be observed from the IME results and the neighboring view information. In addition, a simple rule to skip the IMV or IDV refinement is presented. The resultant FME can achieve real-time coding for the stereoscopic HD1080p video sequences operating at the frequency of 300 MHz.

REFERENCES [1] ITU-T, “Advanced Video Coding for Generic Audiovisual Services,”

ITU-T Recommendation H.264 and ISO/IEC 14496-10 (MPEG4-AVC), March 2010.

[2] Merkle, P. et al., "Efficient Prediction Structures for Multiview Video Coding," IEEE Trans. on CSVT, vol.17, no.11, pp.1461-1473, Nov. 2007.

[3] Liquan Shen et al., "View-Adaptive Motion Estimation and Disparity Estimation for Low Complexity Multiview Video Coding," IEEE Trans. on CSVT , vol.20, no.6, pp.925-930, June 2010.

[4] Li-Fu Ding et al. , "Content-Aware Prediction Algorithm With Inter-View Mode Decision for Multiview Video Coding," IEEE Trans. on Multimedia, vol.10, no.8, pp.1553-1564, Dec. 2008.

[5] Wei Zhu et al., "Fast inter mode decision based on textural segmentation and correlations for multiview video coding," IEEE Trans. on Consumer Electronics, vol.56, no.3, pp.1696-1704, Aug. 2010

[6] Joint Video Team Reference Software JMVC 8.3.1. [7] Tung-Chien Chen et al., "Fully utilized and reusable architecture for

fractional motion estimation of H.264/AVC," in Proc. of ICASSP, 2004 [8] Chia-Chun Lin et al., "A Fast Algorithm and Its Architecture for Motion

Estimation in MPEG-4 AVC/H.264 Video Coding," in Proc. of APCCAS, 2006

[9] An-Chao Tsai et al., "A Simple and Robust Direction Detection Algorithm for Fast H.264 Intra Prediction," in Proc. of ICME, 2007

[10] Yuan-Teng Chang et al. , "A fast and low-cost fractional motion estimation for H.264/AVC HD1080p coding," in Proc. of APCCAS, 2010

[11] C.L. Wu et al., "A high performance three-engine architecture for H.264/AVC fractional motion estimation," in Proc. of ICME, 2008.

[12] T.Y. Kuo et al., "SIFME: a single iteration fractional-pel motion estimation algorithm and architecture for HDTV sized H.264 video coding," in Proc. of ICASSP, 2007.

987

[ieee 2012 ieee 55th international midwest symposium on circuits and systems (mwscas) - boise, id,...

Documents