modular refinement of h.264 kermin fleming

17
1 Modular Refinement of H.264 Kermin Fleming

Upload: gaerwn

Post on 11-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Modular Refinement of H.264 Kermin Fleming. What is H.264?. Mobile Devices Low bit-rate Video Decoder Follow on to MPEG-2 and H.26x Operates on pixel blocks Smaller blocks 4x4, 8x4, 4x8 In-loop deblocking filter Base profile Bluespec implementation Works on FPGA!. H.264 Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Modular Refinement of H.264 Kermin Fleming

1

Modular Refinement of H.264

Kermin Fleming

Page 2: Modular Refinement of H.264 Kermin Fleming

2

What is H.264?

• Mobile Devices• Low bit-rate• Video Decoder

– Follow on to MPEG-2 and H.26x• Operates on pixel blocks

– Smaller blocks 4x4, 8x4, 4x8• In-loop deblocking filter• Base profile Bluespec implementation

– Works on FPGA!

Page 3: Modular Refinement of H.264 Kermin Fleming

3

H.264 Overview

Page 4: Modular Refinement of H.264 Kermin Fleming

4

H.264 Modules• NAL unwrap

– Unwraps network packets– Byte stream separated by special tags

• Entropy Decoder– Decodes various slices, parameters– Primarily Golomb encoded– Residual data uses CAVLC

• Inverse Transform– Reconstructs whole blocks– Quantized frequency coefficients

Page 5: Modular Refinement of H.264 Kermin Fleming

5

H.264 Modules• Intra-prediction

– Prediction based on previously blocks– Corrected by residual

• Inter-predication– Correlation between frames– Motion vectors

• Deblocking filter– Removes prediction artifacts

• Frame Buffer– Maintains cache of previous frames

Page 6: Modular Refinement of H.264 Kermin Fleming

6

Modular Refinement

• Latency insensitive design– Data centric– Swap functionally equivalent modules– Design exploration easy

• Bluespec generates control– Design timing change?– No problem.

Page 7: Modular Refinement of H.264 Kermin Fleming

7

Deblocking Filter Details• Block prediction

leaves artifacts• Apply a smoothing

filter across macroblock boundaries

• Highly configurable

MacroblockFilter Order

Page 8: Modular Refinement of H.264 Kermin Fleming

8

Original Implementation• Store the whole

macroblock• Iteratively filter the

macroblock • Store and stream left

macroblock• Simple to reason about –

very like software• BAD!!!!

– Highly sequential– Large storage

requirements– Wiring:

Left Macroblock(64x32)

Current Macroblock(64x32)

Above Macroblock(16x32)

Filter Filter

Filter Filter

Above Block Data to External Storage

Above Block Data from External Storage

PredictionInput

DeblockedOutput

Page 9: Modular Refinement of H.264 Kermin Fleming

9

Pipelining• Sequential execution was

a problem• Unclear how to pipeline

design– Data stored in row major – Can be rotated to column

major• 16-stage pipeline

– Horizontal Filter– Row-to-Column – Vertical Filter – Column-to-Row

Left

Mac

robl

ock

Mem

ory

(16x

32)

CurrentMacroblock

Memory(8x32)

Horizontal Filter

Rotation(Row to Column Major)

Vertical Filter

Inverse Rotation(Column to Row Major)

Cur

rent

Mac

robl

ock

Mem

ory

(16x

32)

Above Block Data from External Storage

Above Block Data to External Storage

DeblockedOutput

PredictionInput

Page 10: Modular Refinement of H.264 Kermin Fleming

10

Pipelining• Parallelism Improved

– Two filtrations per cycle

• Memory Reduced– 5/8 of macroblock stored– Accesses simplified

• Fewer Filters– Only need one…

• Design now far more complex– 2x code size

Left

Mac

robl

ock

Mem

ory

(16x

32)

CurrentMacroblock

Memory(8x32)

Horizontal Filter

Rotation(Row to Column Major)

Vertical Filter

Inverse Rotation(Column to Row Major)

Cur

rent

Mac

robl

ock

Mem

ory

(16x

32)

Above Block Data from External Storage

Above Block Data to External Storage

DeblockedOutput

PredictionInput

Page 11: Modular Refinement of H.264 Kermin Fleming

11

Pipeline Issues• Throughput improved, but

not perfect• Structural Hazards

– Loads and Stores to the Above memory

– Third and Fourth Macroblocks conflict

• Both need to be rotated at the same time

– Outputing Left Blocks• Pipeline drain

– Control data shared – Pipeline control state

Page 12: Modular Refinement of H.264 Kermin Fleming

12

Relaxed Memory Ordering • Original Sequential

Ordering too conservative• Above data is not

immediately used– Allowing stores to bypass

loads– Separate load and store

request queues

• Stalls eliminated– Design complexity stays the

same– Artificial dependency

removed

Single Ported Memory

Store Requests

Store Responses

Load Requests

Page 13: Modular Refinement of H.264 Kermin Fleming

13

Side Buffering• Frequent conflicts between

4x4 blocks• Store one of them in a side

buffer• When the resource is

available, release the stored data– Sometimes ordering matters

– sometimes not – Memory acts a reorder buffer

• Encode priority in rule• Deadlock can be a

problem…

Filter Q Data

Filter P Data

To Output

Row to Column Rotation

To CurrentStore

Processing Left Block

Page 14: Modular Refinement of H.264 Kermin Fleming

14

Other Refinement• Pipelined Interpredict

rules– Chroma interpolation

• Improved Interpolator filter implementation

• Improved memory subsystem– Previously too general– Needless crossbar

Interpolation Sampling

Page 15: Modular Refinement of H.264 Kermin Fleming

15

Results

Page 16: Modular Refinement of H.264 Kermin Fleming

16

Results

• Nearly 60 fps at 1080p• Power, area, and throughput

improvements• Fast Deblocking filter implementation

– Faster than any known implementation– Does it really matter?

Page 17: Modular Refinement of H.264 Kermin Fleming

17

Questions?