parallel h.264 decoding on an embedded multicore processor
Post on 05-Feb-2016
70 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Parallel H.264 Decoding on an Embedded
Multicore Processor
Arnaldo Azevedo1, Cor Meenderink1, Ben Juurlink1
Andrei Terechko2, Jan Hoogerbrugge2, Mauricio Alvarez3, Alex Ramirez3,4
1 - Delft University of Technology, Netherlands2 - NXP, Netherlands
3 - Barcelona Supercomputing Center, Spain4 - Universitat Politecnica de Catalunya, Spain
HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009
2
Outline
Introduction 3D-Wave 3D-Wave Implementation Experimental Results Conclusions
3
Introduction
Industry shift to multicores Increasing demand for higher media
quality/resolution Efficient and scalable exploitation of multicore
architectures for video coding H.264 is widely used and computationally
demanding Decoding is part of encoding and more
challenging
4
Parallel H.264 Decoding The H.264 Decoder
The H.264 decoding process http://www.powercam.cc/slide/1580
Stream Parsing
Entropy Decoder
Inverse Quantization
Inverse DCT
Spatial Prediction
Motion Compensation
Reference Frames
Deblocking+
Enco
ded
Bits
trea
m
ParserReconstructorData-Parallel Processing
5
H.264 Parallelization
Frame-level Motion Compensation introduces
inter-frame dependencies Frame-level parallelism is very
limited
Slice-level Slice-level parallelism is uncertain
and increase bitrate
Slice 1
Slice 3
Slice 2
I0 P3
B1
B2
P9
B4
B5
P6
6
H.264 ParallelizationMacroBlock-level
Current MB
IntraDF
IntraIntra
Intra DF
2D-Wave:
exploits MB-level parallelism
7
H.264 ParallelizationMacroBlock-level
Current MB
IntraDF
IntraIntra
Intra DF
0
10
20
30
40
50
60
70
1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241
Time slot
Par
alle
l M
Bs
2D-Wave:
Full HD:up to 60 MBs inparallel
Exploits MB-level parallelism
8
H.264 Parallelizationoverview current strategies
Frame-level: very limited parallelism
Slice-level: uncertain parallelism increases bitrate
MB-level: Reasonable parallelism
None of these is sufficient to leverage a many-core!
9
motion compensation
frame 0 (I) frame 1 (P) frame 2 (P)
3D-Wave
10
3D-Wavemaximum parallelism
For full HD:Maximum available parallelism ranges from 5000-9000 MBs!
Note:This requires >200 frames in flight.
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
010
020
030
040
050
060
070
080
090
010
0011
0012
0013
00
Time Stamp
MB
s i
n P
ara
lle
l
Blue sky
Riverbed
Pedestrian
Rush hour
11
3D-Wave Implementation
3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias• TM3270 was projected for SD video processing• VLIW-based media-processor with SIMD support• In-house simulator capable of simulating up to 64 cores• 2D-Wave was already implemented
Tail submit (proposed by Hoogerbrugge, Terechko) [13]• Checks the right and down-left MBs• Execute one of them if ready, send other to TQ
[13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.
12
Reference Frame Buffer
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4
Decoder
Frame 5
Sync info
Reference Frame Buffer Structure
3D-Wave ImplementationReference Frame Buffer Structure
13
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4
Decoder
Sync info Sync info Sync info Sync info Sync info
Parallel Reference Frame Buffer Structure
3D-Wave ImplementationReference Frame Buffer Structure
14
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4
DecoderDecoderDecoder
Sync info Sync info Sync info Sync info Sync info
Parallel Reference Frame Buffer Structure
3D-Wave ImplementationReference Frame Buffer Structure
15
3D-Wave ImplementationInter frame dependencies
mb_decode checks inter frame dependencies On failure, it inserts the MB in the Kick-Off List of the
Ref MB
Ref MB F1;MB(1,3) NULL
Frame 0 Frame 1
16
3D-Wave Implementation Inter frame dependencies
Decoding process continues normally
Ref MB F1;MB(1,3) NULL
Frame 0 Frame 1
17
3D-Wave Implementation Inter frame dependencies
mb_decode checks Kick-Off List and submits subscribed tasks
F1;MB(1,3) NULLRef MB
Frame 0 Frame 1
18
3D-Wave Implementation Inter frame dependencies
And the decoding process carries on
Ref MB NULL
Frame 0 Frame 1
19
3D-Wave ImplementationFrame Scheduling
3D-Wave can have many of frames in flight Practical implementation requires few frames in
flight A policy was developed to limit the number of
frames in flight Implementation• uses the Kick-Off List• subscribes the first MB of the next frame to a
specific MB in the current frame• position of the MB defines number of frames in
flight
20
3D-Wave ImplementationFrame Priority
Frame latency is an important factor in video decoding 3D-Wave interleaves the processing of all frames in
flight Frame Priority is necessary to limit frame latency in
3D-Wave Implementation
splits the Task Queue(TQ) into high and low priority task queues
sends the tasks of the frame next-in-line to the high priority task queue
checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise
21
22
23
Experimental Results Use the NXP H.264 decoder that is highly optimized.
Machine-dependent optimizations (e.g. SIMD operations) Machine-independent optimizations (e.g. code restructuring)
The experiments use all 4 videos from the HD-VideoBench[10].
[10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.
24
Experimental ResultsMethodology
Entropy Decoding results of the entire sequence are buffered
Sequence contains only I and P frames with one slice
All frames are scheduled to execute at once Reference Frame Buffer keeps all the frames of the
sequence Presented results are for 25 frames (1 second) of
Rush_Hour Full High Definition(FHD) On a single core, 2D-Wave can decode 39 SD, 18
HD, and 8 FHD frames per second, respectively.
25
Speedups for Rush Hour Full HD
0102030405060
1 2 4 8 16 32 64
Cores
Sp
eed
up 2D-Wave
3D-Wave
Experimental ResultsScalability
Efficiency of more than 80% for 64 cores Start-up and ramp-down times of short sequence limit efficiency 64 cores is 16x faster than real-time for FHD
26
Experimental ResultsFrame Scheduling
FHD Rush_Hour decoding on 16 cores
Different colors represent different frames Frame Scheduling limits the number of frames in flight Performance loss is < 5% for at most 6 frames in flight
27
Experimental ResultsFrame Scheduling and Priority
Frame Priority reduces frame latency to the same as 2D-Wave (10ms) The latency of the 1st frame: 58.5ms Frame Scheduling(15.1ms)
Frame Scheduling and Priority(9.2ms) Does not reduce performance significantly (< 1%)
FHD Rush_Hour decoding on 16 cores
28
1 2 4 8 16 32 640
200
400
600
800
1000
1200
1400
1600
1800Data Traffic for 3D-Wave and 2D-Wave
2D-Wave3D-Wave3D-Wave Scheduled3D-Wave Priority and Scheduled
Cores
L2-L
1 D
ata
Tra
ffic
(M
Byte
s)
Experimental Results Bandwidth Requirements
Bandwidth required for 64 cores is approximately 21 GB/s 3D-Wave is 20% more bandwidth efficient than 2D-Wave Scheduling and Priority reduce locality and increase bandwidth
29
Conclusions
3D-Wave scales with high efficiency to large number of cores
3D-Wave allows efficient use of many-cores architectures for video processing
Frame priority reduces latency to its minimum
30
References [3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Paralle
l Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers 2008.
[10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.
[13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.
M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April 2009.
A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.
top related