parallel h.264 decoding on an embedded multicore processor

30
1 Parallel H.264 Deco ding on an Embedded Multicore Processor Arnaldo Azevedo 1 , Cor Meenderink 1 , Ben Juurlink 1 Andrei Terechko 2 , Jan Hoogerbrugge 2 , Mauricio Alvarez 3 , Alex Ramirez 3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands 3 - Barcelona Supercomputing Center, Spain 4 - Universitat Politecnica de Catalunya, Spain HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 20 09

Upload: xander

Post on 05-Feb-2016

69 views

Category:

Documents


0 download

DESCRIPTION

Parallel H.264 Decoding on an Embedded Multicore Processor. Arnaldo Azevedo 1 , Cor Meenderink 1 , Ben Juurlink 1 Andrei Terechko 2 , Jan Hoogerbrugge 2 , Mauricio Alvarez 3 , Alex Ramirez 3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel H.264 Decoding on an Embedded Multicore Processor

1

Parallel H.264 Decoding on an Embedded

Multicore Processor

Arnaldo Azevedo1, Cor Meenderink1, Ben Juurlink1

Andrei Terechko2, Jan Hoogerbrugge2, Mauricio Alvarez3, Alex Ramirez3,4

1 - Delft University of Technology, Netherlands2 - NXP, Netherlands

3 - Barcelona Supercomputing Center, Spain4 - Universitat Politecnica de Catalunya, Spain

HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009

Page 2: Parallel H.264 Decoding on an Embedded Multicore Processor

2

Outline

Introduction 3D-Wave 3D-Wave Implementation Experimental Results Conclusions

Page 3: Parallel H.264 Decoding on an Embedded Multicore Processor

3

Introduction

Industry shift to multicores Increasing demand for higher media

quality/resolution Efficient and scalable exploitation of multicore

architectures for video coding H.264 is widely used and computationally

demanding Decoding is part of encoding and more

challenging

Page 4: Parallel H.264 Decoding on an Embedded Multicore Processor

4

Parallel H.264 Decoding The H.264 Decoder

The H.264 decoding process http://www.powercam.cc/slide/1580

Stream Parsing

Entropy Decoder

Inverse Quantization

Inverse DCT

Spatial Prediction

Motion Compensation

Reference Frames

Deblocking+

Enco

ded

Bits

trea

m

ParserReconstructorData-Parallel Processing

Page 5: Parallel H.264 Decoding on an Embedded Multicore Processor

5

H.264 Parallelization

Frame-level Motion Compensation introduces

inter-frame dependencies Frame-level parallelism is very

limited

Slice-level Slice-level parallelism is uncertain

and increase bitrate

Slice 1

Slice 3

Slice 2

I0 P3

B1

B2

P9

B4

B5

P6

Page 6: Parallel H.264 Decoding on an Embedded Multicore Processor

6

H.264 ParallelizationMacroBlock-level

Current MB

IntraDF

IntraIntra

Intra DF

2D-Wave:

exploits MB-level parallelism

Page 7: Parallel H.264 Decoding on an Embedded Multicore Processor

7

H.264 ParallelizationMacroBlock-level

Current MB

IntraDF

IntraIntra

Intra DF

0

10

20

30

40

50

60

70

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241

Time slot

Par

alle

l M

Bs

2D-Wave:

Full HD:up to 60 MBs inparallel

Exploits MB-level parallelism

Page 8: Parallel H.264 Decoding on an Embedded Multicore Processor

8

H.264 Parallelizationoverview current strategies

Frame-level: very limited parallelism

Slice-level: uncertain parallelism increases bitrate

MB-level: Reasonable parallelism

None of these is sufficient to leverage a many-core!

Page 9: Parallel H.264 Decoding on an Embedded Multicore Processor

9

motion compensation

frame 0 (I) frame 1 (P) frame 2 (P)

3D-Wave

Page 10: Parallel H.264 Decoding on an Embedded Multicore Processor

10

3D-Wavemaximum parallelism

For full HD:Maximum available parallelism ranges from 5000-9000 MBs!

Note:This requires >200 frames in flight.

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

010

020

030

040

050

060

070

080

090

010

0011

0012

0013

00

Time Stamp

MB

s i

n P

ara

lle

l

Blue sky

Riverbed

Pedestrian

Rush hour

Page 11: Parallel H.264 Decoding on an Embedded Multicore Processor

11

3D-Wave Implementation

3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias• TM3270 was projected for SD video processing• VLIW-based media-processor with SIMD support• In-house simulator capable of simulating up to 64 cores• 2D-Wave was already implemented

Tail submit (proposed by Hoogerbrugge, Terechko) [13]• Checks the right and down-left MBs• Execute one of them if ready, send other to TQ

[13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

Page 12: Parallel H.264 Decoding on an Embedded Multicore Processor

12

Reference Frame Buffer

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4

Decoder

Frame 5

Sync info

Reference Frame Buffer Structure

3D-Wave ImplementationReference Frame Buffer Structure

Page 13: Parallel H.264 Decoding on an Embedded Multicore Processor

13

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4

Decoder

Sync info Sync info Sync info Sync info Sync info

Parallel Reference Frame Buffer Structure

3D-Wave ImplementationReference Frame Buffer Structure

Page 14: Parallel H.264 Decoding on an Embedded Multicore Processor

14

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4

DecoderDecoderDecoder

Sync info Sync info Sync info Sync info Sync info

Parallel Reference Frame Buffer Structure

3D-Wave ImplementationReference Frame Buffer Structure

Page 15: Parallel H.264 Decoding on an Embedded Multicore Processor

15

3D-Wave ImplementationInter frame dependencies

mb_decode checks inter frame dependencies On failure, it inserts the MB in the Kick-Off List of the

Ref MB

Ref MB F1;MB(1,3) NULL

Frame 0 Frame 1

Page 16: Parallel H.264 Decoding on an Embedded Multicore Processor

16

3D-Wave Implementation Inter frame dependencies

Decoding process continues normally

Ref MB F1;MB(1,3) NULL

Frame 0 Frame 1

Page 17: Parallel H.264 Decoding on an Embedded Multicore Processor

17

3D-Wave Implementation Inter frame dependencies

mb_decode checks Kick-Off List and submits subscribed tasks

F1;MB(1,3) NULLRef MB

Frame 0 Frame 1

Page 18: Parallel H.264 Decoding on an Embedded Multicore Processor

18

3D-Wave Implementation Inter frame dependencies

And the decoding process carries on

Ref MB NULL

Frame 0 Frame 1

Page 19: Parallel H.264 Decoding on an Embedded Multicore Processor

19

3D-Wave ImplementationFrame Scheduling

3D-Wave can have many of frames in flight Practical implementation requires few frames in

flight A policy was developed to limit the number of

frames in flight Implementation• uses the Kick-Off List• subscribes the first MB of the next frame to a

specific MB in the current frame• position of the MB defines number of frames in

flight

Page 20: Parallel H.264 Decoding on an Embedded Multicore Processor

20

3D-Wave ImplementationFrame Priority

Frame latency is an important factor in video decoding 3D-Wave interleaves the processing of all frames in

flight Frame Priority is necessary to limit frame latency in

3D-Wave Implementation

splits the Task Queue(TQ) into high and low priority task queues

sends the tasks of the frame next-in-line to the high priority task queue

checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise

Page 21: Parallel H.264 Decoding on an Embedded Multicore Processor

21

Page 22: Parallel H.264 Decoding on an Embedded Multicore Processor

22

Page 23: Parallel H.264 Decoding on an Embedded Multicore Processor

23

Experimental Results Use the NXP H.264 decoder that is highly optimized.

Machine-dependent optimizations (e.g. SIMD operations) Machine-independent optimizations (e.g. code restructuring)

The experiments use all 4 videos from the HD-VideoBench[10].

[10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

Page 24: Parallel H.264 Decoding on an Embedded Multicore Processor

24

Experimental ResultsMethodology

Entropy Decoding results of the entire sequence are buffered

Sequence contains only I and P frames with one slice

All frames are scheduled to execute at once Reference Frame Buffer keeps all the frames of the

sequence Presented results are for 25 frames (1 second) of

Rush_Hour Full High Definition(FHD) On a single core, 2D-Wave can decode 39 SD, 18

HD, and 8 FHD frames per second, respectively.

Page 25: Parallel H.264 Decoding on an Embedded Multicore Processor

25

Speedups for Rush Hour Full HD

0102030405060

1 2 4 8 16 32 64

Cores

Sp

eed

up 2D-Wave

3D-Wave

Experimental ResultsScalability

Efficiency of more than 80% for 64 cores Start-up and ramp-down times of short sequence limit efficiency 64 cores is 16x faster than real-time for FHD

Page 26: Parallel H.264 Decoding on an Embedded Multicore Processor

26

Experimental ResultsFrame Scheduling

FHD Rush_Hour decoding on 16 cores

Different colors represent different frames Frame Scheduling limits the number of frames in flight Performance loss is < 5% for at most 6 frames in flight

Page 27: Parallel H.264 Decoding on an Embedded Multicore Processor

27

Experimental ResultsFrame Scheduling and Priority

Frame Priority reduces frame latency to the same as 2D-Wave (10ms) The latency of the 1st frame: 58.5ms Frame Scheduling(15.1ms)

Frame Scheduling and Priority(9.2ms) Does not reduce performance significantly (< 1%)

FHD Rush_Hour decoding on 16 cores

Page 28: Parallel H.264 Decoding on an Embedded Multicore Processor

28

1 2 4 8 16 32 640

200

400

600

800

1000

1200

1400

1600

1800Data Traffic for 3D-Wave and 2D-Wave

2D-Wave3D-Wave3D-Wave Scheduled3D-Wave Priority and Scheduled

Cores

L2-L

1 D

ata

Tra

ffic

(M

Byte

s)

Experimental Results Bandwidth Requirements

Bandwidth required for 64 cores is approximately 21 GB/s 3D-Wave is 20% more bandwidth efficient than 2D-Wave Scheduling and Priority reduce locality and increase bandwidth

Page 29: Parallel H.264 Decoding on an Embedded Multicore Processor

29

Conclusions

3D-Wave scales with high efficiency to large number of cores

3D-Wave allows efficient use of many-cores architectures for video processing

Frame priority reduces latency to its minimum

Page 30: Parallel H.264 Decoding on an Embedded Multicore Processor

30

References [3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Paralle

l Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers 2008.

[10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

[13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April 2009.

A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.