1 a scalable parallel h.264 decoder on the cell broadband engine architecture michael a. baker,...
Post on 20-Dec-2015
217 views
TRANSCRIPT
![Page 1: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/1.jpg)
1
A Scalable Parallel H.264 Decoder on the Cell Broadband Engine ArchitectureMichael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudh
ula
Arizona State University
CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009
![Page 2: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/2.jpg)
2
Outline
Introduction and Motivation Opportunities for Parallelization in H.264 Implementation Performance Optimizations Experimental Results Conclusion
![Page 3: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/3.jpg)
3
Motivation
Multicore Architectures Scalability:
more cores = more performance H.264
Standard for video applications including High Definition(HD)
Computationally expensive Cell Broadband Engine(CBE)
Common and inexpensive thanks to PS3
Low power high performance design gives a glimpse of future embedded architectures
![Page 4: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/4.jpg)
4
IBM Cell Broadband Engine Architecture 3.2 GHz
9 cores, 10 threads >200 Gflops(single precisi
on) >20 Gflops(double precisi
on) Up to 25 GB/s memory ba
ndwidth Up to 75 GB/s I/O bandwi
dth >300 GB/s interconnect b
us
http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htm
SPE: Synergistic Processor ElementSPU: Synergistic Processor UnitSXU: SPU CoreLS: Local Storage SMF: Synergistic Memory Flow ControlEIB: Element Interconnect BusPPE: PowerPC Processor ElementPPU: PowerPC processor UnitPXU: Power Processor UnitMIC: Memory Interface ControllerBIC: Bus Interface ControllerL1: Memory Cache Internal to the CPU L2: Memory Cache External to the CPU
![Page 5: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/5.jpg)
5
H.264 Advanced Video Coding H.264 is a video compression standard
Version 1 completed May 2003 ITU-T Video Coding Experts Group (H.264) ISO/IEC Moving Picture Experts Group (MPEG-4 AVC)
Macroblock(MB) based CODEC closely related to MPEG-2
Growing demand for HD and Wireless video 50% bit rate reduction over previous standard Computational complexity approximately 2.4 x M
PEG2
![Page 6: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/6.jpg)
6
H.264: Decoder
![Page 7: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/7.jpg)
7
Reference Code: FFmpeg (H.264 Decoder)
Open source video and audio converter Handles a multitude of formats Codecs other than H.264 decoder removed About 250K Lines of Code after paring to H.2
64 only About 200 functions ported to SPU in our imp
lementation
http://www.ffmpeg.org/
![Page 8: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/8.jpg)
8
H.264 Frame Level Relationships
I Frame: Independently Encoded Intra Prediction
P Frame: Predicted from a Preceding frame Intra and Inter Prediction
B Frame: Predicted from Both preceding and following frames Intra and Inter Prediction
![Page 9: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/9.jpg)
9
H.264 Opportunities for Parallelism: GOP and Frame Level I, P, B Frames
Picture Sequence IBBPBBP
Independent Group of Pictures (GOP)
Independent Frames within GOP
![Page 10: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/10.jpg)
10
H.264 Opportunities for Parallelism: Slice and MB Level Slices: Independently encode
d groups of MBs within a frame
Intra Dependencies:
![Page 11: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/11.jpg)
11
Data Partitioning Scheme
Our Scheme: One row of MBs issued to each SPU
Possible Intra MB dependencies:
![Page 12: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/12.jpg)
12
Functional Partitioning
CBE Architecture:
![Page 13: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/13.jpg)
13
FFmpeg main MB decoding loop
Intra
Inter
![Page 14: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/14.jpg)
14
Scalable Implementation
![Page 15: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/15.jpg)
15
FFmpeg Data Structure Modification Single threaded code: monolithic data structure Entire structure needed to decode single MB but majority is static from one MB
to the next SPU only requires applicable subset for one row of MBs Only MB specific data replicated in SPU LS
Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.
![Page 16: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/16.jpg)
16
SPU LS(Local Store) Limitations
![Page 17: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/17.jpg)
17
Code Overlay
Code segment contains one or more functions
Memory region assigned one or more segments
At run time, region contains exactly one segment
![Page 18: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/18.jpg)
18
Designing an Overlay Scheme Start with one flat region
1. Identify key functions and assign to new regions Profiling indicates f21()
is most important with 50 calls
However, f11() is present 80 times in the trace
f11() is a key function 2. Create new regions
based on profiling data until memory is exhausted
![Page 19: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/19.jpg)
19
Designing an Overlay Scheme
![Page 20: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/20.jpg)
20
Overlay Performance
![Page 21: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/21.jpg)
21
Additional Performance Optimizations
![Page 22: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/22.jpg)
22
Experimental Results Microsoft’s WMV HD demonstration page [13] The source videos were transcoded into H.264 1920x1080 (1080p) format
5 different bitrates: 2.5, 4, 8, 12, 16Mbps CAVLC and CABAC Use the x264 H.264 encoder integrated into ffmpeg
The videos were encoded using the x264 presets: baseline, normal, and hq Decoder performance is measured on the Sony’s Playstation 3, 3.2 GHz Cell
Processor (limited by Sony for access to six of the CBE’s eight SPUs) running Linux Fedora 9
[13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx
![Page 23: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/23.jpg)
23Figure 14: Breakdown of decoder performance by component using a single SPU.
• Motion vector decoding and deblocking are the most expensive components• The white band at the bottom is the PPU (entropy decoder) contribution
![Page 24: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/24.jpg)
24
Decoder Performance
[4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.
Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs.
![Page 25: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/25.jpg)
25
• Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs• And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.
![Page 26: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649d4c5503460f94a2accf/html5/thumbnails/26.jpg)
26
Conclusion
Demonstrated scalable H.264 decoder for multicore processor
23% frame rate advantage over prior work [4] on similar videos and using same number of cores
Careful engineering required to efficiently manage data structures and scratchpad memory