modular refinement of h.264 kermin fleming
DESCRIPTION
Modular Refinement of H.264 Kermin Fleming. What is H.264?. Mobile Devices Low bit-rate Video Decoder Follow on to MPEG-2 and H.26x Operates on pixel blocks Smaller blocks 4x4, 8x4, 4x8 In-loop deblocking filter Base profile Bluespec implementation Works on FPGA!. H.264 Overview. - PowerPoint PPT PresentationTRANSCRIPT
1
Modular Refinement of H.264
Kermin Fleming
2
What is H.264?
• Mobile Devices• Low bit-rate• Video Decoder
– Follow on to MPEG-2 and H.26x• Operates on pixel blocks
– Smaller blocks 4x4, 8x4, 4x8• In-loop deblocking filter• Base profile Bluespec implementation
– Works on FPGA!
3
H.264 Overview
4
H.264 Modules• NAL unwrap
– Unwraps network packets– Byte stream separated by special tags
• Entropy Decoder– Decodes various slices, parameters– Primarily Golomb encoded– Residual data uses CAVLC
• Inverse Transform– Reconstructs whole blocks– Quantized frequency coefficients
5
H.264 Modules• Intra-prediction
– Prediction based on previously blocks– Corrected by residual
• Inter-predication– Correlation between frames– Motion vectors
• Deblocking filter– Removes prediction artifacts
• Frame Buffer– Maintains cache of previous frames
6
Modular Refinement
• Latency insensitive design– Data centric– Swap functionally equivalent modules– Design exploration easy
• Bluespec generates control– Design timing change?– No problem.
7
Deblocking Filter Details• Block prediction
leaves artifacts• Apply a smoothing
filter across macroblock boundaries
• Highly configurable
MacroblockFilter Order
8
Original Implementation• Store the whole
macroblock• Iteratively filter the
macroblock • Store and stream left
macroblock• Simple to reason about –
very like software• BAD!!!!
– Highly sequential– Large storage
requirements– Wiring:
Left Macroblock(64x32)
Current Macroblock(64x32)
Above Macroblock(16x32)
Filter Filter
Filter Filter
Above Block Data to External Storage
Above Block Data from External Storage
PredictionInput
DeblockedOutput
9
Pipelining• Sequential execution was
a problem• Unclear how to pipeline
design– Data stored in row major – Can be rotated to column
major• 16-stage pipeline
– Horizontal Filter– Row-to-Column – Vertical Filter – Column-to-Row
Left
Mac
robl
ock
Mem
ory
(16x
32)
CurrentMacroblock
Memory(8x32)
Horizontal Filter
Rotation(Row to Column Major)
Vertical Filter
Inverse Rotation(Column to Row Major)
Cur
rent
Mac
robl
ock
Mem
ory
(16x
32)
Above Block Data from External Storage
Above Block Data to External Storage
DeblockedOutput
PredictionInput
10
Pipelining• Parallelism Improved
– Two filtrations per cycle
• Memory Reduced– 5/8 of macroblock stored– Accesses simplified
• Fewer Filters– Only need one…
• Design now far more complex– 2x code size
Left
Mac
robl
ock
Mem
ory
(16x
32)
CurrentMacroblock
Memory(8x32)
Horizontal Filter
Rotation(Row to Column Major)
Vertical Filter
Inverse Rotation(Column to Row Major)
Cur
rent
Mac
robl
ock
Mem
ory
(16x
32)
Above Block Data from External Storage
Above Block Data to External Storage
DeblockedOutput
PredictionInput
11
Pipeline Issues• Throughput improved, but
not perfect• Structural Hazards
– Loads and Stores to the Above memory
– Third and Fourth Macroblocks conflict
• Both need to be rotated at the same time
– Outputing Left Blocks• Pipeline drain
– Control data shared – Pipeline control state
12
Relaxed Memory Ordering • Original Sequential
Ordering too conservative• Above data is not
immediately used– Allowing stores to bypass
loads– Separate load and store
request queues
• Stalls eliminated– Design complexity stays the
same– Artificial dependency
removed
Single Ported Memory
Store Requests
Store Responses
Load Requests
13
Side Buffering• Frequent conflicts between
4x4 blocks• Store one of them in a side
buffer• When the resource is
available, release the stored data– Sometimes ordering matters
– sometimes not – Memory acts a reorder buffer
• Encode priority in rule• Deadlock can be a
problem…
Filter Q Data
Filter P Data
To Output
Row to Column Rotation
To CurrentStore
Processing Left Block
14
Other Refinement• Pipelined Interpredict
rules– Chroma interpolation
• Improved Interpolator filter implementation
• Improved memory subsystem– Previously too general– Needless crossbar
Interpolation Sampling
15
Results
16
Results
• Nearly 60 fps at 1080p• Power, area, and throughput
improvements• Fast Deblocking filter implementation
– Faster than any known implementation– Does it really matter?
17
Questions?