a fast vlsi architecture for full-search variable block size motion estimation in mpeg-4 avuh.264

Upload: htmabalhp

Post on 03-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 A Fast VLSI Architecture for Full-Search Variable Block Size Motion Estimation in MPEG-4 AVUH.264

    1/4

    7B-5s

    AFast VLSI Architecture for Full-Search Variable Block SizeMotion Estimation in MPEG-4AVUH.264

    Minho Kim Ingu Hwang Soo-Ik ChaeSchool of Electrical EngineeringSeoul National UniversitySeoul 151-742, KoreaTel : +82-2-880-5457Fax : +82-2-888-1691E-mail : (mhkim,ingu,chae}@sdgroup.snu.ac.kr

    Abstract- We describe a fast VLSl architecture for full-searchmotion c iiation for the blocks with 7 different sizes inMPEG-4 AVCYH.264, The proposed variable block size motionestimation (VESME) architecture consists of a 16x16 PE array,an a dder tree and comparators to find all 41 motion vectors andtheir minimum SADs for the blocks of 16x16, 16x8, 8x16, 8x8,8x4, 4x8 and 4x4. It employs a 2-D datapath and its controf ofthe search area data i s simple and regular. The proposedVBSME can achieve 100% P E utilization by employing apreload register and a search data buffer inside each PE andallow .real-time proc essing of 4CIF(704x576) video with 15 fps a t100 Mhz for a search range of [-32- 311.

    I . I N T R O D U C T I O NNowadays, many intemational video compressionstandards such as ITU-T H.261, H.263, MPEG-1, -2, and -4adopts motions estimation technique to reduce temporalredundancy between a current frame and its reference frames.The emerging MPEG-4 AVC/ITU-T H.264 standard supportsmotion estimation for the block of 7 different sizes: 16x16,16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 to improve codingperformance, but its computational complexity becomessubstantially higher. Thus, a fast architecture that can supportmotion estimation for all the 7 block sizes i s essential forhigh-end real-time applications.Many full-search motion estimation architectures wereproposed so far, but there we re only a few architectures thatsupport motion estimation for variable block size [l-41.Algorithms in El] and [2] described -D array architectures,and that in [ ] is about a 16x16 PE array for the MPEG-4standard which supports ME only for the blocks of 16x16,8x8 and 4x4. The algorithm in [4j described a 2-D arrayarchitecture with 1-D partial data reuse and 1-D data

    broadcasting. Note that we limit our discussion only for thefull-search ME algorithm in this paper.An I-D array architecture is simple, easy to control, andoccupy smaller area than a 2-D array architecture butsearches only one row or column of t he block at a time, so itis slower than a 2-D array architecture which computes thesum of absolute differences (SAD ) of a search point at everycycle. Therefore, a 2-D array architecture is more suitable forhigh-end real-time video processing.In the full-search 2-D array architecture, a processingelement (PE) array computes the SADs between a pixel of

    the current frame and a pixel of one of its reference a t everycycle, so the pixel data should be changed every cycle.According to the data flows, the 2-D array architectures canbe divided into 3 classes: (1) The current frame pixel data aremoving while the reference frame pixel data are fixed, (2)While the current frame pixel data are fixed, the referenceframe pixel data are changed, (3) Both current and referenceframe pixel data ar e changed.In the VBSME architectures, the SADs of 4x4 blocks arefirst computed on a 16x16 macroblock, and the SADs withblocks of larger size are calculated by summing up the SAD5of 4x4 blocks. For a 2-D array VBSME architecture,therefore, the class I ) is not suitable because the positions ofthe partition are changed if the current frame pixel datamoves every cycle, which makes it difficult to sum up theSADs of the block of smaller size for those of larger size. Theclass (2) is also not suitable because it is hard to satisfy bothhigh PE utilization and simple datapath requirements at thesame time wh en the search area da ta moves both horizontallyand vertically. In this paper we propose a fast architecture ofthe class 3 ) because it can achieve both high PE utilizationand simple datapath.

    11. P R O P O S E D R C H I T E C T U R EFig. 1s h o w s t h e b l o c k d i a g r a m of t h e p r o p o s e darchitecture, which consists of 4 basic blocks. The processingelement ar ray computes s ix teen 4x4 SADs of a 16x14macroblock. The adder comparator block sums up the 4x4SADs to form the SADs for 7 different block sizes and findsthe minimum distortions and corresponding motion vectors.

    The search area SRAM contains the reference frame pixeldata within a given search range to reduce IiO memory

    Fig. 1. Block diagram of the V B S M E architecture

    0-7803-8736-8/05/ 20.0002005 IEEE. 63 1 ASP-DAC 2005

  • 8/12/2019 A Fast VLSI Architecture for Full-Search Variable Block Size Motion Estimation in MPEG-4 AVUH.264

    2/4

    bandwidth. The M E control block generates the addresses ofthe search area S U M S and the control signals to otherblocks.The dataflows of the search area data and the current blockdata are shown in Fig. 2(a) and 2(b) respectively. In order tooperate all the 16x16 PES every cycle, 16 pixel data from thesearch area must be :;upplied to the PES at every cycle. Wedecided to shift the search area data only in the horizontaldirection (right to left) and to supply the 16 search area datato the rightmost column of the 16x16 PES from the searcharea SRAM. In the vertical direction, the current block datamove down and wrap around.For a search range of [-164151, the data sequence in thePE array is presented in TABLE I where C(x, y) is the pixeldata in the current block and R(x, y) is the pixel data in thesearch area. During the clocks from 0 to 31, y coordinates ofthe search area pixels are fixed and only x coordinates arechanged. When com putations for one row are finished, thecurrent block data shift down and wrap around by one rowposition and the initial search area data is loaded from thesearch data buffer (SDB) to the reference block register(RBR). Then the PE m a y starts to find the second row in thesearch area from the dock 32without stall.A PE consists of t absolute difference computing unit, acurrent block register (CBR), a RBR, a preload register (PR),a SDB and two multiplexers as shown in Fig. 3(a). At everyclock, the absolute difference between the CBR and the RBRis computed. The SDB stores data in the initial search areadata. If the current macroblock is the region 4 in Fig. 3(b), theshaded macroblocks 0, and 6 will be the initial search area.Each SDB stores one pixel from each shaded macroblock.

    592 ~ O , l ] . R( - 1 6 , 16 ) C ( : . l ) ~ ~ - 1 5 . 1 6 ) C(l5,Il-R(~1,16)C(l5.1)-R(O.l6)... ... ...

    1023 C(O.i)-R(I5,16 ) CXl.I)-R{16.16 ) C(15,1)-R(30,1 6)

    Fig. 2. (a) Dataflow of the search area data (b) Dataflow of thecurrent block data

    C(OZ)-f l[ -l6 ,17) M1.21-R(-15.17) C(l5,Z )-R(- l , l7 i C(U OiR[-16 15) C(l.O>R(-l5.15) Y15.OkRI-1.15). . . . ... CiO.O)-R(GlS,lS) C(1,OtR(~14,15I _O P ~ I y l S l ? ) C ( 1. 2 )- 3 ( ~1 4 .1 7 ) C (1 5. 2> R1 0. 1? )~ C(15 01-RIO.151C O.Z>R 15.17) C ( l ZI-R(16,17) C(1 .2/-R( . 17) C ( O,O)~ R ( 15 .1 5) C ( 1 .0 ) - ~ 1 5 .1 5 ) C ( 15 .O> R IM.I 5)

    ... ... ... ... ....

    The search area S M M stores the data in the remainingregions 1, 2, 4, 5 , 7 and 8. Whe n starting the computation fora- ne w row, the multiplexer in the PE selects the initial datafrom the SDB so that w e can compute the SADs without stall.Otherwise, the data from W i n are selected. The regions 1,4and 7 , which will be the initial search area in the nextmacroblock, are stored in the SDB as soon as thecomputation for the regions 0,3, and 6 is finished.In addition, a preload register is put inside a PE to achievemacroblock pipelining [5 ] When the current block is theregion 4, he data of the next current block, which is theregion 5 are transferred to the preload register in each PE.They are loaded to the CBR in all PES immediately beforestarting computation for the next ma croblock.As the current block data move down in the PE array, theposition of 4x4 sub-blocks also moves in the proposedarchitecture. To compute the 4x4 SADs correctly, we add aselector inside each 4x16 PE array to match results of 4x1SADs with adder inputs. Four 4x16 PES are connected toarrange the 16x16 PE array.While computing for the second row of the search range,

    all the rows except the first row in the PE a m y are p rov idedwith the same search area data just like when the first row ofthe search range is com puted. The search area data of the firstrow is supplied with the new data separated by 16 rows fromthe one previously supplied. Likewise, when changing therow position in a search area, we just change one addressamong 16 on-chip SRAM addresses, where the new addressis pointing the row below by 16 rows. Therefore, eachon-chip SRAM has only to store every 16th row of the searcharea data. That is, k-th (k=0,1,2, ...15) SRAM contains-

    il

    T h eg mwhich Search yea dataa r e s t a e d i n S A M f e r i n s l d e P EThe r e g m w h c hseircharea M a

    C B o V l s L D arestcredinSASRAMahsidePE4 (b)Fig. 3. (a) The PE structure (b) The region which the search areadata are stored (when search range is [-16-+15])

    TABLE IDATASEQUENCEF THE CURRENT BLOCK DATA A N D SEARCH AREA DATA (WHEN SEARCH R A N G E IS [-16--tl ])

    632

  • 8/12/2019 A Fast VLSI Architecture for Full-Search Variable Block Size Motion Estimation in MPEG-4 AVUH.264

    3/4

    CB Data

    Algorithmof PE

    4x4 SAD0

    4x4 AD1

    4x4 SAD2

    4x4 SAD3

    Full Search16x16 (2-D rray)

    Fig. 4. The architecture of the 4x16 PE array

    Search Range

    Ra 11 6 r .-?---ssAh4 75Fig. 5 . Data tr:iii

  • 8/12/2019 A Fast VLSI Architecture for Full-Search Variable Block Size Motion Estimation in MPEG-4 AVUH.264

    4/4

    TABLE 111COMPARISON OF VBSME ARCHITECTURES

    ProcessGale Count

    On-chip memoryE-ax Freq.REFERENCES

    0.13um 0.5um 0 35um O.18um1DBk 106k 154k

    96kb i t s 24k bits 60kbits100Mhz 100Mhz 66.67 Mhz 10M.lh.z

    ~ y PI [41 ourso f PE 16x16 16x16 16x16

    Sear chR ange 64x64 48X32

    7 kinds of t6X16, 7 k inds of 7 k inds ofblock s ize black s ize j block s izelock size

    Throughput (bl~ckdsec)II

    I I

    1V.CONCLUSIONThis paper presents a new fast VLSI architecture forV B S M E in MPEG-4 A V C m . 2 6 4 . It is based on 2-D arrayarchitecture and computes the SADs for the blocks of all the7 different sizes. The search area data moves onlyhorizontally, and the current block data moves only vertically

    to simplify the dataflow. Tt achieves 100 PE utilization andmacroblock pipelining by employing 16 on-chip SRAMs andsearch data buffers inside each FE. For the search range of1 - 3 2 4 3 11 our implementation allows variable block sizemotion estimation of 4CIF (704x576) video with 15 f p s a t100 Mhz.

    [ l ] Swee Yeow Yap, John V. McCanny, A VLSIarchitecture for advanced video coding motionestimation, Proc. IEEE International Conference onApplication-Specijc Systems, Architectures, undProcessors (ASAPU3), une 24-26, 2003.Cao Wei, Mao Zhi Gang, A novel SAD computinghardware architecture for variable-size block motionestimation and its implementation with FPGA, Proc.5th international conference on ASIC, Oct 2 1 - 2 4 , 2 0 0 3 .P. M . Kuhn, A . Weisgerber, R. Poppenwimmer, and W.Stechele, A flexible VLSI architecture for variableblock size segment matching with luminancecorrection, IEEE International conference onAppiication-speci c Systems, Architectures, andProcessors (ASAP 97 , Zurich, 1997.

    [4] Yu-Wen Huang, Tu-Chih Wang, Bing-Yu Hsieh, andLiang-Gee Chen, Hardware architecture design forvariable block size motion estimation in MPEG-4AVCIJVTIITU-T H.264 , Proc. IEEE InternationalSymposium on Circuits and Systems(ISCAS 2U03 ,Bangkok, Thailand, May 2003.Jen-Chieh Tuan, Tian-Sheuan Chang, On the data reuseand memory bandwidth analysis for full-searchblock-matching VLSI architecture, lEEE Transactionson circuits and systems for video technology, vol. 12,no. 1, Jan. 2002 .

    [Z]

    [ 3 ]

    [ ]

    634