parallel h.264 decoding on an embedded multicore processor

Parallel H.264 Decoding on an Embedded

Multicore Processor

Arnaldo Azevedo1, Cor Meenderink1, Ben Juurlink1

Andrei Terechko2, Jan Hoogerbrugge2, Mauricio Alvarez3, Alex Ramirez3,4

1 - Delft University of Technology, Netherlands2 - NXP, Netherlands

3 - Barcelona Supercomputing Center, Spain4 - Universitat Politecnica de Catalunya, Spain

HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009

Outline

Introduction 3D-Wave 3D-Wave Implementation Experimental Results Conclusions

Introduction

Industry shift to multicores Increasing demand for higher media

quality/resolution Efficient and scalable exploitation of multicore

architectures for video coding H.264 is widely used and computationally

demanding Decoding is part of encoding and more

challenging

Parallel H.264 Decoding The H.264 Decoder

The H.264 decoding process http://www.powercam.cc/slide/1580

Stream Parsing

Entropy Decoder

Inverse Quantization

Inverse DCT

Spatial Prediction

Motion Compensation

Reference Frames

Deblocking+

ParserReconstructorData-Parallel Processing

H.264 Parallelization

Frame-level Motion Compensation introduces

inter-frame dependencies Frame-level parallelism is very

limited

Slice-level Slice-level parallelism is uncertain

and increase bitrate

Slice 1

Slice 3

Slice 2

H.264 ParallelizationMacroBlock-level

Current MB

IntraDF

IntraIntra

Intra DF

2D-Wave:

exploits MB-level parallelism

H.264 ParallelizationMacroBlock-level

Current MB

IntraDF

IntraIntra

Intra DF

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241

Time slot

2D-Wave:

Full HD:up to 60 MBs inparallel

Exploits MB-level parallelism

H.264 Parallelizationoverview current strategies

Frame-level: very limited parallelism

Slice-level: uncertain parallelism increases bitrate

MB-level: Reasonable parallelism

None of these is sufficient to leverage a many-core!

motion compensation

frame 0 (I) frame 1 (P) frame 2 (P)

3D-Wave

3D-Wavemaximum parallelism

For full HD:Maximum available parallelism ranges from 5000-9000 MBs!

Note:This requires >200 frames in flight.

Time Stamp

Blue sky

Riverbed

Pedestrian

Rush hour

3D-Wave Implementation

3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias• TM3270 was projected for SD video processing• VLIW-based media-processor with SIMD support• In-house simulator capable of simulating up to 64 cores• 2D-Wave was already implemented

Tail submit (proposed by Hoogerbrugge, Terechko) [13]• Checks the right and down-left MBs• Execute one of them if ready, send other to TQ

[13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

Reference Frame Buffer

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4

Decoder

Frame 5

Sync info

Reference Frame Buffer Structure

3D-Wave ImplementationReference Frame Buffer Structure

Decoder

Sync info Sync info Sync info Sync info Sync info

Parallel Reference Frame Buffer Structure

DecoderDecoderDecoder

Sync info Sync info Sync info Sync info Sync info

Parallel Reference Frame Buffer Structure

3D-Wave ImplementationInter frame dependencies

mb_decode checks inter frame dependencies On failure, it inserts the MB in the Kick-Off List of the

Ref MB

Ref MB F1;MB(1,3) NULL

Frame 0 Frame 1

3D-Wave Implementation Inter frame dependencies

Decoding process continues normally

Ref MB F1;MB(1,3) NULL

Frame 0 Frame 1

mb_decode checks Kick-Off List and submits subscribed tasks

F1;MB(1,3) NULLRef MB

Frame 0 Frame 1

And the decoding process carries on

Ref MB NULL

Frame 0 Frame 1

3D-Wave ImplementationFrame Scheduling

3D-Wave can have many of frames in flight Practical implementation requires few frames in

flight A policy was developed to limit the number of

frames in flight Implementation• uses the Kick-Off List• subscribes the first MB of the next frame to a

specific MB in the current frame• position of the MB defines number of frames in

flight

3D-Wave ImplementationFrame Priority

Frame latency is an important factor in video decoding 3D-Wave interleaves the processing of all frames in

flight Frame Priority is necessary to limit frame latency in

3D-Wave Implementation

splits the Task Queue(TQ) into high and low priority task queues

sends the tasks of the frame next-in-line to the high priority task queue

checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise

Experimental Results Use the NXP H.264 decoder that is highly optimized.

Machine-dependent optimizations (e.g. SIMD operations) Machine-independent optimizations (e.g. code restructuring)

The experiments use all 4 videos from the HD-VideoBench[10].

[10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

Experimental ResultsMethodology

Entropy Decoding results of the entire sequence are buffered

Sequence contains only I and P frames with one slice

All frames are scheduled to execute at once Reference Frame Buffer keeps all the frames of the

sequence Presented results are for 25 frames (1 second) of

Rush_Hour Full High Definition(FHD) On a single core, 2D-Wave can decode 39 SD, 18

HD, and 8 FHD frames per second, respectively.

Speedups for Rush Hour Full HD

0102030405060

1 2 4 8 16 32 64

up 2D-Wave

3D-Wave

Experimental ResultsScalability

Efficiency of more than 80% for 64 cores Start-up and ramp-down times of short sequence limit efficiency 64 cores is 16x faster than real-time for FHD

Experimental ResultsFrame Scheduling

FHD Rush_Hour decoding on 16 cores

Different colors represent different frames Frame Scheduling limits the number of frames in flight Performance loss is < 5% for at most 6 frames in flight

Experimental ResultsFrame Scheduling and Priority

Frame Priority reduces frame latency to the same as 2D-Wave (10ms) The latency of the 1st frame: 58.5ms Frame Scheduling(15.1ms)

Frame Scheduling and Priority(9.2ms) Does not reduce performance significantly (< 1%)

FHD Rush_Hour decoding on 16 cores

1 2 4 8 16 32 640

1800Data Traffic for 3D-Wave and 2D-Wave

2D-Wave3D-Wave3D-Wave Scheduled3D-Wave Priority and Scheduled

Experimental Results Bandwidth Requirements

Bandwidth required for 64 cores is approximately 21 GB/s 3D-Wave is 20% more bandwidth efficient than 2D-Wave Scheduling and Priority reduce locality and increase bandwidth

Conclusions

3D-Wave scales with high efficiency to large number of cores

3D-Wave allows efficient use of many-cores architectures for video processing

Frame priority reduces latency to its minimum

References [3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Paralle

l Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers 2008.

[10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

[13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April 2009.

A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.

parallel h.264 decoding on an embedded multicore processor

limited parallelismslicelevel

decoding process http

embedded architectures

multicoresincreasing

ben juurlink1andrei

cor meenderink1

international conference

p3dwavemaximum parallelismfor

Documents

multicore 101: migrating embedded apps to multicore with...

multicore computers

massively ldpc decoding on multicore architectures present...

heterogeneous multicore

multicore processing, virtualization, and...

iec60502 multicore

directx video acceleration (dxva) specification for h.264...

1 slice-balancing h.264 video encoding for improved...

directx video acceleration specification for h.264/avc ......

multicore simulator

multicore processor

multicore processors

a study on moving object detection and tracking with partial...

multicore digital signal processing - etsist.upm.es · crc/...

1399 هام ی{ ip price list - mipcctv · 2020. 12....

real-time 3d tracking with gpus | gtc...

bs6724 multicore

mrf-based true motion estimation using h.264 decoding...

implementation of multi-standard video decoding algorithms...

parallel h.264 decoding on an embedded multicore...