hevc paper (1)

17
Evaluation of Parallelization Strategies for the Emerging HEVC Standard Mauricio Alvarez Mesa 1 , Chi Ching Chi 2 , Thomas Schierl 3 and Ben Juurlink 2 1 Universitat Polit` ecnica de Catalunya, Barcelona, Spain 2 Technische Universit¨ at Berlin, Berlin, Germany 3 Fraunhofer HHI - Heinrich Hertz Institute, Berlin, Germany 1 Introduction Parallel architectures: multicore, manycore, GPUs, etc. What is HEVC? High Efficiency Video Coding New standardization initiative by ISO/MPEG and ITU-T/VCEG Undertaken by the Joint Collaborative Team on Video Coding (JCT-VC) Target Compression performance: 2X compared to H.264/AVC Resolution: up to 4k × 2k and up to 60 fps (or more) Color depth: 8-bit and 10-bit (up to 14-bit) Application use cases Intra, random access, low delay – High efficiency, low complexity Timeline: Started in January 2010 First version expected to be completed in 2012-2013 Evolving reference software: HEVC test Model (HM) 2 Overview of HEVC HEVC is based on the same structure of prior hybrid video codecs like H.264/AVC but with enhancements in each stage. It has a prediction stage composed of motion compensation (with variable block size and fractional-pel motion vectors) and spatial intra-prediction; integer transformation and scalar quantization are applied to predic- tion residuals; and quantized coefficiets are entropy encoded using either arithmetic coding or variable length coding. Also, as in H.264 an in-loop deblocking filter is applied to the reconstructed signal. 1

Upload: azzam-alkaisee

Post on 30-Oct-2014

77 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hevc Paper (1)

Evaluation of Parallelization Strategies for theEmerging HEVC Standard

Mauricio Alvarez Mesa1, Chi Ching Chi2,Thomas Schierl3 and Ben Juurlink2

1Universitat Politecnica de Catalunya, Barcelona, Spain2Technische Universitat Berlin, Berlin, Germany

3 Fraunhofer HHI - Heinrich Hertz Institute, Berlin, Germany

1 Introduction

• Parallel architectures: multicore, manycore, GPUs, etc.

• What is HEVC? High Efficiency Video Coding

– New standardization initiative by ISO/MPEG and ITU-T/VCEG

– Undertaken by the Joint Collaborative Team on Video Coding (JCT-VC)

• Target

– Compression performance: 2X compared to H.264/AVC

– Resolution: up to 4k × 2k and up to 60 fps (or more)

– Color depth: 8-bit and 10-bit (up to 14-bit)

• Application use cases

– Intra, random access, low delay

– High efficiency, low complexity

• Timeline:

– Started in January 2010

– First version expected to be completed in 2012-2013

– Evolving reference software: HEVC test Model (HM)

2 Overview of HEVC

HEVC is based on the same structure of prior hybrid video codecs like H.264/AVCbut with enhancements in each stage. It has a prediction stage composed of motioncompensation (with variable block size and fractional-pel motion vectors) and spatialintra-prediction; integer transformation and scalar quantization are applied to predic-tion residuals; and quantized coefficiets are entropy encoded using either arithmeticcoding or variable length coding. Also, as in H.264 an in-loop deblocking filter isapplied to the reconstructed signal.

1

Page 2: Hevc Paper (1)

Figure 1: General diagram of HEVC decoder

Split flag = 0

Split flag = 1

64

Depth = 0

Split flag = 0

Split flag = 1

32

Depth = 1

Split flag = 0

4

Depth = 4

Figure 2: Coding structure: quad-tree segmentation

HEVC has two main features that differentiates it from H.264 and its predecesors.First, a new coding structure that replaces the macroblock structure of H.264; andseond, the inclusion of two new filters that are applied after the deblocking filter:Adaptive Loop Filter (ALF) and Sample Adaptive Offset (SAO). Figure 1 shows ageneral diagram of thhe main stages of the HEVC decoder.

2.1 Coding Structure

The new block structure is based on codig units (CUs) that contain one or severalprediction unit(s) (PUs) and transform units (TUs) [8]. Each frame is divided into acollection of Large Coding Units (LCUs) (with a maximum size of 64×64 samples inthe current status of HEVC). Each LCU can be recursive split into smaller CUs usinga generic quad-tree segmentation structure. PU is the basic unit for prediction andeach PU can contain several partitions of variable size. Finally, TU is the basic unitof transform which also can have its own partitions. Figure 2 illustrates the conceptof quadtree segmentation, and figure 3 shows an example of partitions in differenttype of units.

2

Page 3: Hevc Paper (1)

Coding Units

CU

CU CU

CU

CU

CU

CU

Prediction Units

Transform Units

Figure 3: Coding, Transform and Predictions Units

2.2 New filters

2.2.1 Adaptive Loop Filter: ALF

ALF is a filter designed to minimize the distortion of the docoded frame compared tothe original one using a Wiener filter. The filter can be activated at the CU level andcoefficients are encoded at the slice level. The filter is applied after the deblockingfilter or after SAO is this one is enabled.

2.2.2 Sample Adaptive Offset Filter: SAO

The sample adaptive offset (SAO) filter is applied in between the deblocking filterand the ALF [6]. In the SAO filter the entire picture is considered as an hierarchicalquadtree. For each subquadrant in the quadtree the SAO filter can be activated bytransmitting offset values for the pixels in the quadrant. These offsets can either cor-respond to the intensity band of pixel values (band offset) or the difference comparedto neighboring pixels (edge offset). In HM-3.0 only the luma samples are considered.The same pixel offsets are used for all the LCUs in the quadrant.

3 Parallelization Opportunities

The techniques used to parallelize previous video codecs can also be used for paral-lelization of the HEVC decoder. In addition to that, some new features have beenincluded in the current draft of the new standard for allowing parallel execution atdifferent levels of granularity. In this section we review some of these strategies andpresent in more detail a technique called “entropy slices” which is the focus of ourparallel implementation.

3.1 Parallelism among decode stages

Function-level parallelism consists of processing different stages of the video decoderin parallel, for example using a frame-level pipelining approach. A 4 stage pipelinecan be implemented with stages like Parsing, Entropy Decoding, LCU Reconstructionand Filtering [4]. Although this mechanism is applicable to HEVC decoding its maindisadvantage is the high latency and memory bandwidth required by the multipleframe buffers.

3

Page 4: Hevc Paper (1)

3.2 Parallelism Within Decode Stages

This is the finest level of data-level parallelism and consist of finding independentdata partitions within a decode stage or kernel. Due to fine granularity this approachis well suited for hardware architectures [10].

3.2.1 Parallel Entropy Decoding

Some techniques currently under consideration for the HEVC standard include: Prob-ability Interval Partitioning (PIPE) and Syntax Element Partitioning (SEP).

PIPE uses an entropy coding algorithm similar to CABAC in H.264/AVC. Themain difference is in the binary arithmetic coder in which instead of coding the binsusing a single arithmetic coding engine a set of encoders are used each one associatedto a partition of the probability interval. In the original design 12 different probabilityintervals are used allowing 12 different bin encoders to operate in parallel [11].

SEP consists of grouping bins in a slice by the type of syntax element rather thanby macroblock (or LCU) as is in H.264/AVC. Bin groups can be processed in parallelbut they need to maintain some data dependencies. A maximum throughput of 2.7Xhas been reported using 5 different partitions [?].

Another proposal for parallel entropy encoding/decoding is related to the paral-lelization of the context processing stage of the CABAC algorithm. In this case, someof the internal loops for processing context state (significance map, coefficient sign andcoefficient level) are rearranged for exposing fine-grain data-level parallelism [15, 3].

3.2.2 Parallel Intra Prediction

Intra prediction uses reconstructed data from neighbor blocks to create the predic-tion of a current block. This creates strong data dependencies that inhibits parallelprocessing at the block level. A proposal for partially removing these dependenciesis known as “Parallel Prediction Unit for Parallel Intra Coding”. In this approachthe blocks inside a LCU are grouped into two sets using a checkerboard pattern. Thefirst set of blocks can be predicted and reconstructed in parallel and without refer-encing the second set. After this, the second set can be processed in parallel withoutreferring to blocks in the second set [20].

3.3 Data-level parallelism

In data-level parallelism the same program (instruction or task) is applied to differentportions of the data set. In a video codec data-level parallelism can be applied atdifferent data granularity such as frame-level, macroblock- (or LCU-) level, block-leveland sample-level. A detailed analysis of different data-level parallelization strategiesfor H.264/AVC can be found in the literature [12, 14].

3.3.1 LCU-level parallelism

LCU- (or macroblock-) level parallelism can be exploited inside or between frames ifthe data dependencies of different kernels are satisfied. In HEVC the dependenciesvary depending on each stage.

For kernels that reference neighbour data at LCU level, like intra-prediction, pro-cessing LCUs in a diagonal wavefront allows to exploit parallelism between them [5].

4

Page 5: Hevc Paper (1)

For motion compensation there are not intra-frame dependencies (if motion vectorprediction is performed at entropy decoding) but there are inter-frame dependenciesdue to access to reference areas. By detecting (statically or dynamically) inter-framedependencies it is possible to increase the number of independent LCUs compared towavefront processing. This has been reported for H.264/AVC encoding and decod-ing [12, 21].

In H.264/AVC the deblocking filter uses filtered samples as input for filtering MBscreating wavefront style dependencies. In HEVC a new approach called “paralleldeblocking filter” [9] is introduced. In this scheme the deblocking filter is dividedin two separated frame stages: horizontal and vertical filtering. Horizontal filteringtakes the reconstructed frame as input and produces a filtered frame. After that thevertical filtering is applied, taking the horizontal filtered frame as input and producinga final filtered frame. With this approach all the LCU dependencies of deblockingfilter are removed allowing parallel processing of all the LCUs in a frame.

3.4 Slice-level parallelism

As in previous video codecs in HEVC each frame can be partitioned into one ormore slices. Traditionally slices have been included in order to add robustness to theencoded bitstream in the presence of network transmission errors. In order to accom-plish this, slices in a frame are completely independent from each other. That meansthat no content of a slice is used to predict elements of other slices in the same frame,and that the search area of a dependent frame can not cross the slice boundary [19].Although not originally designed for that slices can be used for exploiting parallelismbecause they don’t have data dependencies.

Parallel processing with slices has several advantages like coarse-grain parallel pro-cessing, data locality, low delay and low memory bandwidth.

The main disadvantage of slices is the reduction in coding efficiency. This is dueto three main reasons: first, a reduction in efficiency of entropy coding due to thereduction of the training of probability contexts and the inability to cross the sliceboundary for context selection for the CABAC entropy coder. Second, an efficiencyreduction of the prediction stage due to the inability to cross slice boundaries for LCUprediction. And third, and finally, the increase in bitstream size due to slice headersand start code prefixes used to signal the presence of slices in the bitstream [17].Figure 4 shows the contribution of each of these factors to the total loss of codingefficiency.

Other disadvantages of traditional slices are load balancing and scalability. Loadunbalance appears because slices are created with the same number of MBs, and thuscan result that some slices are decoded faster than others depending on the inputcontent. Scalability is limited at the decoder because the number of slices per frameis determined by the encoder. If there is no control of what the encoder does then it ispossible to receive sequences with one (or few) slice(s) per frame, with a correspondingreduction in the parallelization opportunities.

3.5 Entropy Slices

As a way to overcome the limitations of traditional slices a new approach for creatingslices called “entropy slices” has been included in the current proposal of HEVC [13].

5

Page 6: Hevc Paper (1)

Figure 4: Coding loss due to slices in H.264/AVC. (BigShips, QP=27)[17]

Slices Entropy Slices Interleaved EntropySlices

Context model initialization slice slice interleavedContext model selection intra-slice intra-slice inter-sliceLCU reconstruction neighborhood intra-slice inter-slice inter-sliceSlice header overhead high low low

Table 1: Comparison of Slices (S), Entropy Slices (ES) and Interleaved Entropy Slices (IES)

The first difference with traditional slices is that entropy slices are proposed for par-allelism not for error resilience. The main differences of slices and entropy slices arepresented in Table 1.

As with regular slices in entropy slices CABAC context models are initialized at thebeginning of each slice. Also both in regular and entropy slices the entropy coding of aLCU can not cross slice boundaries. A difference appears in the reconstruction phasewhere entropy slices allow to access LCUs in other slices for prediction. Finally, theheaders of entropy slices only contain information of start codes and entropy codingsignaling thus having a reduced size compared to traditional slices.

Entropy slices allows to perform entropy decoding in parallel without data depen-dencies. In the original design it is assumed that entropy decoding is decoupled ofLCU reconstruction with a frame buffer. After parallel entropy decoding LCU recon-struction can be performed in parallel using a wavefront parallel pattern.

A similar approach to entropy slices called Interleaved Entropy Slices (IES) has beenproposed (but not included in HEVC) [16]. In IES, slices are interleaved across LCUlines. Context model states are maintained for a longer period compared to regularand entropy slices and context model selection can cross slice boundaries resulting inminimal coding efficiency impact.

6

Page 7: Hevc Paper (1)

4 Parallel HEVC Decoder with Entropy Slices

The parallelization opportunities discussed in the previous section allow for roughlytwo approaches, a decoupled and a combined approach. In a decoupled approachparallelism is exploited in each stage of the HEVC pipeline. Entropy decoding can beperformed parallel using the entropy slices. LCU reconstruction can be executed inparallel using wavefront parallelism. The deblocking parallel deblocking filter (pre-sented in section 3.3.1 allows to deblock edges of the entire picture in two parallelstages, one for the vertical edges followed by one for the horizontal edges. The SAOfilter can be performed in one pass in a LCU independent fashion. Finally, the ALFcan also be performed in a LCU independent fashion.

The parallelism in the decoupled approach can be exploited using a task pool ap-proach in which the work units in each stage are distributed dynamically among theavailable cores. In most of the decoder stages this can be implemented efficientlyusing a single atomic counter for both synchronization and work distribution. Thedrawback of this approach is that it requires large buffers to store the data betweenthe stages. The entropy decoded data is particularly large at . . . per picture in HM-3.0. Additionally, cache locality is reduced because several passes of the picture bufferare required in the reconstruction and filtering stages. Especially with higher reso-lution sequences, for which the picture cannot be contained in the on-chip caches,performance and scalability is reduced, because additional off-chip memory traffic isgenerated.

In the combined approach as many stages as possible are combined in a singlepass to increase cache locality and to reduce off-chip memory bandwidth require-ments. Recent work in parallel H.264 decoders [] have also opted for this approach.In H.264, however, the entropy decoding stage cannot be combined with the MB re-construction and filtering in parallel approaches, without resorting to regular sliceswhich impact both objective and subjective quality significantly []. The more efficientHEVC entropy slices allow for parallel entropy decoding, but still maintain the intradependencies of the LCU reconstruction and deblocking filter stages. Combining theentropy decoding and reconstruction stages requires, therefore, the ability to performLCU wavefront execution. This can be only achieved by enforcing a one entropy sliceper row encoding approach.

When using one entropy slice per row, it is not only possible to combine the entropydecode and reconstruct stages, but also the deblocking and SAO filter stages, whenusing the Ring-Line approach []. In the Ring-Line strategy an arbitrary number ofline decoders is used to decode the picture in a line interleaved manner. The linedecoders can maintain the wavefront dependencies efficiently using a ring synchro-nization approach, in which the line decoders only require to synchronize with theirneighbors. In each line decoder the entropy decode, LCU reconstruct, and deblockingfilter can be performed for the same LCU. The SAO filter, however, operates on thedeblocked output image and, therefore, cannot process the same LCU as the lowerand right edges are not deblocked yet. Instead the SAO filter is performed on theupper left LCU, for which all the deblocked image data is available. The decodingorder of the stages and the corresponding modified pixels for one LCU are illustratedin Figure 5. Figure 6 shows the wavefront progression when using four line decoderthreads.

The SAO filter is designed to be LCU independent, by ignoring some pixel classi-

7

Page 8: Hevc Paper (1)

Reconstruction

Deblock ver. edges

Deblock hor. edges

SAO filtering

Figure 5: The decoder stages are applied on different adjacent LCUs to maintain the kerneldependencies. Each square represent a 2x2 pixel block.

T1

T2

T3

T4

T1

Figure 6: Wavefront progression of the combined stages. The colors show the decoding progressof each stage, before starting the decode of the hatched blocks. The SAO filter isdelayed until all the pixels in the LCU are deblocked.

fication method that would require pixels from neighboring LCUs in the edge offsetmode. This allows the SAO filter to be applied LCU parallel for each level of thehierarchical quadtree. For our implementation it is necessary that the SAO filter isapplied in a single pass for all the quadtree levels for each LCU. Since the SAO filteris LCU independent, a depth-first method is equivalent to the breadth-first method.A mapping table is used to efficiently link each LCU to the corresponding quadrantof each level.

Unfortunately, the ALF stage cannot be combined with the other stages in HM-3.0.To perform the ALF on a CU, its absolute CU index is required to index the ALFon/off flag array. The absolute CU index, however, is only known after all previousCUs are entropy decoded. This condition is not met when processing the ALF inthe line decoders using the Ring-Line strategy. The ALF, therefore, is performedafter reconstructing the entire picture in a separated pass. Because the ALF is arelative compute intensive stage the impact of an additional picture buffer pass is notsignificant. In our implementation, to reduce cache line conflicts and synchronizationoverhead, eight consecutive LCUs are grouped in a work unit and processed by asingle core.

5 Experimental Results

In this section we present the experimental results for the proposed parallelizationmethodology. We present coding efficiency results for analysing the impact of entropyslices on compression and, after that, we analyse the performance of our implemen-tation on two different parallel machines.

8

Page 9: Hevc Paper (1)

Options Value

Max. CU Size Width 64×64Max. Partition Depth 4Period of I-frames 32Number of B-frames (GOPSize) 8Number of reference frames 4Motion Estimation Algorithm EPZS [18]Search range 64Entropy Coding CABACInternal Bit Depth 10Adaptive Loop Filter (ALF) enabledSample Adaptive Offset (SAO) enabledQuantization Parameter (QP) 22, 27, 32, and 37

Table 2: Coding Options

Class Sequence Resolution Framecount

Framerate

Bitdepth

S CrowdRun 3840x2160 500 50p 8S IntoThree 3840x2160 500 50p 8S ParkJoy 3840x2160 500 50p 8A NebutaFestival 2560x1600 300 60p 10A PoepleOnStreet 2560x1600 150 30p 8A SteamLocomotiveTrain 2560x1600 300 60p 10A Traffic 2560x1600 150 30p 8B BasketballDrive 1920x1080 500 50p 8B BQTerrace 1920x1080 600 60p 8B Cactus 1920x1080 500 50p 8B Kimono1 1920x1080 240 24p 8B ParkScene 1920x1080 240 24p 8

Table 3: Input sequences

5.1 Experimental Environment

5.1.1 HEVC software and Input Videos

We have implemented a parallel HEVC decoder on top of the HM-3.0 reference de-coder. From all the available test conditions, described in section 1 we selected theRandom Access High Efficiency (RA-HE) which we consider as the most demand-ing application scenario of the current HEVC proposal. RA-HE includes the mostcomputationally demanding tools of HEVC such as CABAC and ALF and it makesextensive use of B-frames. It should be noted, however, that the same parallel codecan be used without modifications with the other test conditions.

Encoding options are based on common conditions described in JCTVC-E700 [2].Table 2 shows the main parameters used for the encodings.

We encoded all the videos from the HEVC test sequences using the HM-3.0 refer-ence encoder. Due to space reasons, and because we are mainly interested in highdefinition applications, we only present results for class A (2560×1600 pixels) andclass B (1920×1080 pixels) sequences. We also included 4K videos (3840×2160) fromthe SVT High Definition Multi Format Test Set [7]. We will refer to videos with thisresolution as “class S”. Test sequences information is presented in Table 3.

9

Page 10: Hevc Paper (1)

System Fujitsu RX600 S5 Dell Precision T5500

Processor Intel Xeon X7550 Intel Xeon X5680ISA X86-64 X86-64µarchitecture Nehalem-EX WestmereNum. Sockets 4 2Num. Cores/socket 8 6Num Threads/core 16 12Technology 45nm 32nmClock frequency 2.0 GHz 3.33 GHzPower 130 W 130WLevel 1 D-cache 32 KB / core 32KB / coreLevel 2 D-cache 256 KB / core 256 KB / coreLevel 3 cache 18 MB / socket 12MB / socketMemory DDR3-1066 DDR3-1333Interconnection QuickPath QuickPathBoost 1.46.1. 1.42.1Compiler GCC-4.4.5 -O3 GCC-4.5.2 -O3Operating system Linux kernel 2.6.32-5 Linux kernel 2.6.38-8

Table 4: Experimentation platform

5.1.2 Platform

Multiple threads were created using Boost thread C++ library; this library allowsthe use of multithreading in shared memory systems for C++ programs.

For our parallel decoding experiments we used two parallel systems. One is acache-coherent Non-Uniform Memory Access Architecture (cc-NUMA) machine. Itis based on the Intel Xeon 7550 processor which has 8 cores per chip and the wholesystem contains 4 sockets, for a total of 32 cores, connected with the Intel QuickPathInterconnect (QPI). The second one is dual socket machine based on the Intel XeonX5680 processor that has 6 cores per chip for a total of 12 cores. Main parameters ofthese architectures are listed in Table 4.

The machine were configured with TurboBoost feature disabled to avoid dynamicchanges in frequency. Although Simultaneous MultiThreading (SMT) is enabled bydefault, we do not schedule more than one thread per core. Also, we used threadpinning to manually assign threads to cores for avoiding thread migrations whichhave even more negative effects on a cc-NUMA architectures.

5.2 Coding Efficiency

In this section we quantify the effect of entropy slices on coding efficiency and compareit with a baseline system and a system with regular slices.

First, we used a configuration with one regular slice per frame which representsthe baseline with the highest quality and the minimum bitrate. Our parallelizationapproach is based on entropy slices, for which we encoded the videos with one entropyslice per row. As a comparison we also encoded all videos with one regular slice perrow. A comparison of the two slice approaches with the baseline is summarized inTable 5 using the Bjontegaard metric [1]. Regular slices result in average bitrateincrease of 6.8%, 14% and 9.5% for Y, U and V components respectively. Entropyslices results in an average bitrate increase of 5.2%, 5.9% and 5.5% for the threecomponents.

10

Page 11: Hevc Paper (1)

1 regular slice per row 1 entropy slice per row

Y BD-rate U BD-rate V BD-rate Y BD-rate U BD-rate V BD-rate

Class S 5.037 13.689 7.093 3.808 4.585 4.865Class A 6.261 16.854 12.429 5.472 6.381 5.724Class B 9.216 11.518 8.964 6.3 6.802 5.929

Table 5: Coding efficiency of regular slices and entropy slices compared to HM-3.0 with one sliceper frame

5.3 Performance

We ran the parallel HEVC decoder on the two parallel machines. We used all thevideos described in Table 3 and we decoded each one five times for different numberof threads.

5.3.1 Speedup and Frames-per-Second

The sub-figures of the left-side of figures 7 and 8 show the performance in terms ofaverage speedup and frame-rate for the three input video classes and the two machinesunder study. Speedup is computed against the original sequential code (thread 0) andis presented along with the parallel code using one core (thread 1).

The main difference between the two machines is that the T5500 has a higher clockrate than the RX600. This allows the former to process an average of 50 class Bframes per second when using 10 processors. The speedup curves are very similar:the parallelization efficiency is relatively high for small core counts (more than 80%for 4 cores) and the speedup saturates at some number of processors, for example 12processors for class B at which a maximum speedup of 5 is reached with an efficiencyof 50%. For the most demanding sequences (class S) even with a high number ofprocessors and using the high frequency configuration it was not possible to reachreal-time operations (a maximum of 15 fps is reached with 11 cores). In general thelow absolute performance is due to the low performance of the original single threadreference code. Additional optimizations (like SIMD vectorization) can be applied toincrease the performance of the single thread code.

Table 6 shows the corresponding number of entropy slices, maximum number ofprocessors used and the obtained speedup for the different code sections. The ALFsection exhibits an almost linear speedup (with an efficiency close to 90%). The othersections (Entropy Decoding, LCU Reconstruction, Deblocking Filter and SAO) havea lower efficiency (around 53%) due to data dependencies and load unbalance. Thetotal speedup and performance is limited by the ED + LCU + DF + SAO stages.

5.3.2 Profiling of Execution Time

We performed a profiling analysis in order to identify the relative contribution ofdifferent parts of the application on the final performance. The sub-figures of theright-side of figures 7 and 8 show the average execution time for each video class. Ithas been divided in the sequential and parallel portions. The parallel part has beenfurther divided in two sections, one with the ED+LCU +DF +SAO stages and theother one with ALF kernel. Due to its massively parallel nature ALF filter execution

11

Page 12: Hevc Paper (1)

0

2

4

6

8

10

0 5 10 15 20 25 0

2

4

6

8

10

12

Sp

ee

du

p

Fra

me

s p

er

se

co

nd

Number of threads

(a) Class S: Speedup

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25

Ave

rag

e e

xe

cu

tio

n t

ime

pe

r fr

am

e [

s]

Number of threads

SequentialALF

ED+REC+DF+SAO

(b) Class S: Exec. time

0

1

2

3

4

5

6

7

8

0 2 4 6 8 10 12 14 16 0

2

4

6

8

10

12

14

16

18

Sp

ee

du

p

Fra

me

s p

er

se

co

nd

Number of threads

(c) Class A: Speedup

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 2 4 6 8 10 12 14 16 18

Ave

rag

e e

xe

cu

tio

n t

ime

pe

r fr

am

e [

s]

Number of threads

SequentialALF

ED+REC+DF+SAO

(d) Class A: Exec. time

0

1

2

3

4

5

6

0 2 4 6 8 10 12 14 16 0

5

10

15

20

25

30

Sp

ee

du

p

Fra

me

s p

er

se

co

nd

Number of threads

(e) Class B: Speedup

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 2 4 6 8 10 12 14 16 18

Ave

rag

e e

xe

cu

tio

n t

ime

pe

r fr

am

e [

s]

Number of threads

SequentialALF

ED+REC+DF+SAO

(f) Class B: Exec. time

Figure 7: Speedup and average execution time for the rx600s51t machine. Speedup is measuredagainst original sequential code (referred as thread 0 in the figure). Thread 1 is theparallel code with one thread. Average execution time divided in sequential and parallelparts

time reduces almost linearly with the number of threads. ED + LCU + DF + SAOstages also reduces but reaches a saturation point; and the sequential stage increasesits fraction of the total execution time according to Amdhal’s law.

Table 7 shows the contribution to the total execution time of the different stages

12

Page 13: Hevc Paper (1)

0

1

2

3

4

5

6

7

8

0 2 4 6 8 10 12 0

2

4

6

8

10

12

14

16

Speedup

Fra

mes p

er

second

Number of threads

(a) Class S: Speedup

0

0.1

0.2

0.3

0.4

0.5

0.6

0 2 4 6 8 10 12

Ave

rag

e e

xe

cu

tio

n t

ime

pe

r fr

am

e [

s]

Number of threads

SequentialALF

ED+REC+DF+SAO

(b) Class S: Exec. time

0

1

2

3

4

5

6

7

0 2 4 6 8 10 12 0

5

10

15

20

25

30

Speedup

Fra

mes p

er

second

Number of threads

(c) Class A: Speedup

0

0.05

0.1

0.15

0.2

0.25

0 2 4 6 8 10 12

Ave

rag

e e

xe

cu

tio

n t

ime

pe

r fr

am

e [

s]

Number of threads

SequentialALF

ED+REC+DF+SAO

(d) Class A: Exec. time

0

1

2

3

4

5

6

0 2 4 6 8 10 12 0

10

20

30

40

50

60

Speedup

Fra

mes p

er

second

Number of threads

(e) Class B: Speedup

0

0.02

0.04

0.06

0.08

0.1

0.12

0 2 4 6 8 10 12

Ave

rag

e e

xe

cu

tio

n t

ime

pe

r fr

am

e [

s]

Number of threads

SequentialALF

ED+REC+DF+SAO

(f) Class B: Exec. time

Figure 8: Speedup and average execution time for the x5680 machine. Speedup is measuredagainst original sequential code (referred as thread 0 in the figure). Thread 1 is theparallel code with one thread. Average execution time divided in sequential and parallelparts

using the number of processors that generates the maximum speedup. The contri-bution of ALF at the maximum number of processors is below 15%. The sequentialpart of the application becomes important with a contribution between 15% and 22%.But, the main limitation in scalability is the ED + LCU + DF + SAO stage which

13

Page 14: Hevc Paper (1)

rx600s51t machine x5680 machine

Class S A B S A BNum. entropy slices 34 25 17 34 25 17Max. processors 24 16 14 12 12 12ED+LCU+DF+SAO speedup 11.5 8.6 6.03 7.94 7.24 5.35ALF speedup 21.5 14.5 12.05 11.15 10.62 9.98Total speedup 10.3 7.6 5.67 7.35 6.62 5.20FPS 11.7 18.5 31.9 15.38 29.54 53.15

Table 6: Maximum speedup

rx600s51t machine x5680 machine

Class S A B S A B

Max. processors 24 16 14 12 12 12LCU+DF+SAO 64.11% 62.70% 71.24% 64.95% 63.26% 70.69%ALF 13.81% 16.52% 13.81% 19.08& 18.62% 14.66%Sequential part 22.07% 20.78% 14.95% 15.97% 18.12% 14.65%

Table 7: Contribution of different stages to total execution time with maximum number of pro-cessors

0

5

10

15

20

25

30

35

0 20 40 60 80 100 120 140

Pa

ralle

l L

CU

s

Time slot

Class S

Class A

Class B

Figure 9: Maximum parallelism in the wavefront kernels

dominates the execution time of the parallel decoder. It takes more that 60% ofexecution time with the maximum number of processors.

This limitation is due to the wavefront dependencies in some kernels. This typeof dependencies generates a variable number of parallel tasks with an ramp pattern.Figure 9 shows the number of independent tasks for kernels with wavefront dependen-cies. Assuming constant task time and no synchronisation overhead the maximumtheoretical speedups are 16.19, 11.36 and 8.22 for class S, A and B respectively. Theefficiency of our parallelization compared to this maximum is between 71% and 76%.

14

Page 15: Hevc Paper (1)

6 Limitations and Solutions

The proposed parallelization strategy has several advantageous properties from aparallel implementation perspective. First, it achieves good scaling efficiency at lowercore counts. Second, the number of line decoders used in the implementation isindependent from the bitstream. The actual number of line decoders is flexible andcan be chosen to match the processing capabilities of the computing hardware andthe performance requirements. Third, no addtional frame buffers are required, whichkeeps the memory size requirements similar to that of the single threaded approach.Fourth, as reflected by the performance of the parallel implementation runnning onone core, the paralellization overhead is low because the stages are combined as muchas possible. Finally, all single threaded optimization opportunities to increase cachelocality and reduce off-chip memory traffic remain exploitable.

A limitation is the scaling efficiency at higher core counts. As shown in the exe-cution time breakdown in Section ??, this is caused by the sequential part and thewavefront parallel part. The sequential part does not decrease when increasing thenumber of cores and takes as much as 22% of the total execution time. This, however,can be solved by pipelining the sequential part, which consist mainly of scanning thebitstreams for startcodes and decoding the SAO and ALF parameters. By parsingthe bitstream one frame ahead in a separate thread will hide the time spend in thissequential part.

The wavefront parallel part is not scaling linear due to its inherent parallelismramp-up and ramp-down. This could be mitigated by overlapping the decoding ofconsecutive frames []. Currently, this cannot be performed because the ALF cannotbe combined with the other stages, but nevertheless is inside the prediction loop.The ALF could have been combined with the other stages if the ALF CU on/off flagswere not signaled in the slice header, but as syntax elements of the CUs in the rawbyte sequence payload. By doing so, the absolute CU index would not be necessaryanymore. It is expected that this contribution will be made in future developmentsof the HEVC standardization as theoretically there will be no impact on the codingefficiency since the ALF CU flags are only moved to a different location.

From a coding efficiency perspective, entropy slices have a reduced impact on theobjective and no impact on the subjective quality compared to using a single sliceper frame. Using one entropy slice per row and wavefront execution, however, allowsfor further improvement of the objective quality. For example, context selection overentropy slice boundaries (D243) [] and propagation of context tables (E196) [] can beapplied straightforward without complicating the proposed parallelization strategy.Furthermore, the training losses can be further reduced by using multiple contextinitialization tables for each entropy slice or picture. Compared to the regular oneslice per picture approach the only losses would originate from the additionally codedstart codes or bitstream offsets and bitstream padding.

7 Conclusions

In this paper we have proposed and evaluated a parallelization strategy for the emerg-ing HEVC video codec. The proposed strategy requires that each LCU row constitutesan entropy slice. The LCU rows are processed in a wavefront parallel fashion by sev-

15

Page 16: Hevc Paper (1)

eral line decoder threads using a ring synchronization. The presented implementationachieves real-time performance for 1920×1080 (53.1 fps) and 2560×1600 (29.5 fps)resolutions on a 12-core Xeon machine.

The proposed parallelization strategy has several desirable properties. First, itachieves good scaling efficiency at moderate core counts. Second, the number ofline decoders can be chosen to match the processing capabilities of the computinghardware and the performance requirements. Third, using more cores increases thethroughput and at the same time reduces the frame latency, making the implemen-tation both suitable for low delay and high throughput use scenarios.

A limitation is the scaling efficiency at higher core counts. This is caused by thesequential part and the ramp-up and ramp-down efficiency losses of the wavefrontparallel part. In future work this can be solved by pipelining the sequential part andoverlapping the execution of consecutive frames.

References

[1] Gisle Bjontegaard. Calculation of average PSNR differences between RD-curves.Technical Report VCEG-M33, ITU-T Video Coding Experts Group (VCEG),2001.

[2] Frank Bossen. Common test conditions and software reference configurations.Technical Report JCTVC-E700, Jan. 2011.

[3] Madhukar Budagavi and Mehmet Umut Demircin. Parallel Context Processingtechniques for high coding efficiency entropy coding in HEVC. Technical ReportJCTVC-B088, July 2010.

[4] Chi Ching Chi and Ben Juurlink. A QHD-capable parallel H.264 decoder. InProc. of the Int. Conf. on Supercomputing, pages 317–326, 2011.

[5] E. B. Van der Tol, E. G. T. Jaspers, and R. H. Gelderblom. Mapping of h.264decoding on a multiprocessor architecture. In Proceedings of SPIE, 2003.

[6] Chih-Ming Fu, Ching-Yeh Chen, Chia-Yang Tsai, Yu-Wen Huang, and ShawminLei. CE13: Sample Adaptive Offset with LCU-Independent Decoding. TechnicalReport JCTVC-E409, March 2011.

[7] Lars Haglund. The SVT High Definition Multi Format Test Set. Technicalreport, Sveriges Television, Feb. 2006.

[8] Woo-Jin Han, Junghye Min, Il-Koo Kim, Elena Alshina, Alexander Alshin,Tammy Lee, Jianle Chen, Vadim Seregin, Sunil Lee, Yoon Mi Hong, Min-SuCheon, Nikolay Shlyakhov, Ken McCann, Thomas Davies, and Jeong-Hoon Park.Improved Video Compression Efficiency Through Flexible Unit Representationand Corresponding Extension of Coding Tools. IEEE Transactions on Circuitsand Systems for Video Technology, 20(12):1709–1720, Dec 2010.

[9] Masaru Ikeda, Junichi Tanaka, and Teruhiko Suzuki. Parallel deblocking fil-ter. Technical Report JCTVC-E181, Joint Collaborative Team on Video Coding(JCT-VC) of ITU-T and ISO/IEC, March 2011.

[10] Young-Long Steve Lin, Chao-Yang Kao, Hung-Chih Kuo, and Jian-Wen Chen.VLSI Design for Video Coding. Springer, 2010.

16

Page 17: Hevc Paper (1)

[11] Detlev Marpe, Heiko Schwarz, and Thomas Wiegand. Entropy Coding in VideoCompression using Probability Interval Partitioning. In Picture Coding Sympo-sium (PCS 2010), pages 66–69, Dec. 2010.

[12] Cor Meenderinck, Arnaldo Azevedo, Mauricio Alvarez, Ben Juurlink, and AlexRamırez. Parallel Scalability of Video Decoders. Journal of Signal ProcessingSystems, 57:173–194, November 2009.

[13] Kiran Misra, Jie Zhao, and Andrew Segall. Entropy slices for parallel entropycoding. Technical Report JCTVC-B111, July 2010.

[14] Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, and Margrit Gelautz. Eval-uation of data-parallel splitting approaches for H.264 decoding. In Proceedingsof the 6th International Conference on Advances in Mobile Computing and Mul-timedia, pages 40–49, 2008.

[15] J. Sole, R. Joshi, I. S. Chong, M. Coban, and M. Karczewicz. Parallel contextprocessing for the significance map in high coding efficiency. Technical ReportJCTVC-D262, Jan 2011.

[16] Vivienne Sze, Madhukar Budagavi, and Anantha P. Chandrakasan. Massivelyparallel cabac. Technical Report VCEG-AL21, Video Coding Experts Group(VCEG), July 2009.

[17] Vivienne Sze and Anantha P. Chandrakasan. A high throughput CABAC al-gorithm using syntax element partitioning. In Proceedings of the 16th IEEEinternational conference on Image processing, pages 773–776, 2009.

[18] Alexis M. Tourapis. Enhanced Predictive Zonal Search for Single and MultipleFrame Motion Estimation. In Proceedings of SPIE Visual Communications andImage Processing 2002, pages 1069–1079, Jan. 2002.

[19] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of theH.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systemsfor Video Technology, 13(7):560–576, July 2003.

[20] Jie Zhao and Andrew Segall. Parallel prediction unit for parallel intra cod-ing. Technical Report JCTVC-B112, Joint Collaborative Team on Video Coding(JCT-VC) of ITU-T and ISO/IEC, July 2010.

[21] Zhuo Zhao and Ping Liang. Data partition for wavefront parallelization of H.264video encoder. In IEEE International Symposium on Circuits and Systems., 2006.

17