platform-based design for mpeg-4 video encoder presenter: yu-han chen
Post on 11-Jan-2016
236 Views
Preview:
TRANSCRIPT
Platform-based Design forMPEG-4 Video Encoder
Presenter: Yu-Han Chen
DSP/IC Design Lab. 2
Video Coding Standards
H.263
H.261
MPEG1
MPEG2
MPEG4
1990
19921994
19991995
Telcomm
Storage
StorageBroadcasting
TelcommStorage
Multimedia
Data Rate
Re
so
luti
on
/Qu
alit
y
QCIF
CIF
SDTV
HDTV
10K 100K 1M 10M bps
DSP/IC Design Lab. 3
Introduction
Multimedia applications are emerging Video-phone, camcorder, surveillance, and video
streaming
MPEG-4 provides a total solution for these applications High compression ratio for limited bandwidth Error robustness to error-prone environment Content interactivity for more functionalities besides
‘seeing’
DSP/IC Design Lab.
Proposed MPEG-4 Encoder
MPEG-4 video encoding Platform-based system architecture Motion encoding module Texture encoding module
4
DSP/IC Design Lab. 5
MPEG-4 Simple Profile Video Encoder
DCT QDC/AC
Prediction
IDCT IQ
VLC
MC
ME
+
Frame Memory
BlockEngine
VideoSource
BitstreamScan
DSP/IC Design Lab. 6
Complexity Analysis of Optimized Software Model
• SPL3 foreman sequence at 30 fps• ME – full search with half-stop algorithm• DCT/IDCT – row-column decomposition
Computing Controlling Memory Access(MIPS) (MIPS) (MBytes/Sec)
ME 6,142.64 75.91 2,766.08 77.33 30,668.20 80
IDCT 539.31 6.66 109.49 3.06 2,016.22 5.26DCT 442.16 5.46 58.52 1.64 1,621.95 4.23MC 386.12 4.77 271.45 7.59 1,987.58 5.18Q 205.55 2.54 129.33 3.62 629.79 1.64ACDC 112.08 1.39 64.36 1.8 387.65 1.01SCAN 91.96 1.14 60.33 1.69 385.09 1IQ 93.8 1.16 56.66 1.58 338.12 0.88VLC 77.6 0.96 60.65 1.7 301.77 0.79TOTAL 8,092 100 3,577 100 38,336 100
Units % % %
DSP/IC Design Lab. 7
Characteristics of Video Coding Tools
Processingtype
Example Parallelism FeaturesDatatype
Frequency(CIF 30fps)
PreferredImplementation
ProgramControl
Coding modeselection,predictor
Mostlysequential
High complexity 16-bitVery Low(10K Hz)
SW
Streamprocessing
VLC/VLD,CAE/CAD,parsing, RLD
Mostlysequential
High complexity, non-word-alignedprocessing
< 16-bitMedium(1~10MHz)
HW or SW
BlockProcessing
DCT/IDCT,MC, ME,filters
HighLow complexity, highdata rate, regular
8, 16-bit
High(10M~10GHz)
HW
[Micro]
DSP/IC Design Lab. 8
Implementation Demands
Computational power is up to 12 GIPS ME is the most important key component DCT/IDCT is the second one Dedicated hardware accelerators is employed
Implementation for various features of algorithms Software for irregular and sequential ones Hardware for high-processing rate ones
HW/SW co-design is the most promising solution to achieve a cost-effective system
DSP/IC Design Lab. 9
Platform for MPEG-4 Video Coding
SRAMHyRISCFirmware
ME
Wrapper
MC
Wrapper
BlockEngine
Wrapper
DMA
Wrapper
Sequencer
Wrapper
ExternalMemory
Coeff.Generator
MEMIFBitstream
Unit
Wrapper
RISC BUS (16 bits)
Data BUS (32 bits)
Coeff.Buffer
VirtualTools
CHIP is inside the dot-line region
Platform-based system includes HYRISC, RBUS and DBUS, DMA, MEMIF
Hardware accelerators includes ME, MC, BE(DCT/IDCT,Q,IQ,ACDCP), Bitstream Unit,
Share Memory (CG, CB)
DSP/IC Design Lab. 10
Motion Encoding Module
SpiralPattern
Rom-basedDiamondPattern M
UX
Ctrl.
(id,u,v)
FIFO
feed
full
AG
(id,u,v)
empty
fetch
SWMEM
MBRAM
AdderTree
AccumulatorComparatorElimination
start/finishdata_in
id
(pmvx, pmvy)
mode
(mvx.m
vy, SAD
)
(id,u,v)
Pattern Generation Distortion CalculationFIFO
Loading Path:Ref. Sum Ram/MB Ram/Ref. Ram
stop
RangeChecker
DSP/IC Design Lab. 11
Summary of ME
Low cost and high performance hybrid motion estimation is proposed
Dynamic modes for various applications Applications of real-time and low power
PDS (Predictive Diamond Search) mode Applications of high compression quality
FFS (Fast Full Search) mode Spiral full search with PDE (Partial Distortion Elimination)
DSP/IC Design Lab.
Texture Encoding Module
Interleaving DCT/IDCT schedule DCT and IDCT are performed interleaved for the
same block Sub-structure sharing technique
Applied on AC/DC prediction datapath and Q/IQ by extracting the same formula term
12
DSP/IC Design Lab.
Interleaved DCT/IDCT Processing
13
Y1 Y3Y2 Y4 Cb Cr
1-D 1-D
Q
IQ
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ
DCT
1-D 1-DIDCT
time1 2 3 4 5 6 7 8 9 10 11 12 13
1-DDCT/IDCT
Unit
TransposeMemory
DMUX1:2
Z
X
YYMUX
2:1
DSP/IC Design Lab. 14
Sub-structure Sharing of Q/IQ and ACDC Prediction
Scalar operation : (QAC x QPA) / QPX Share partial result (QAC x QP = M) in IQ module Share data-path of Q for M / QPx
Y1 Y3Y2 Y4 Cb Cr
1-D 1-D
Q
IQ(Y1)
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ(Y2)
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ(Y3)
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ(Cb)
DCT
1-D 1-DIDCT
1-D 1-D
Q
IQ(Cr)
DCT
1-D 1-DIDCT
time1 2 3 4 5 6 7 8 9 10 11 12 13
1-D 1-D
Q
IQ(Y4)
DCT
1-D 1-DIDCT
DIV(Y1)
DIV(Y2)
DIV(Y3)
DIV(Y4)
DIV(Cb)
DIV(Cr)
DSP/IC Design Lab. 15
Chip Features and Layout
Chip MPEG-4 Video Encoder
Specification Simple profile @ Level 3Encoding Complexity 352 x 288 at 30 fps
Technology TSMC 0.35 um 1P4M
Die Size 5.1 x 5.1 mm2
Logic gate count 71,459 gates
On-chip memory 39,080 bits
Off-chip memory 2,027,527 bits
Transistor count 828692 trans.
Package 208 CQFP
Input PAD 67
Output PAD 83
Power PAD 48
Working frequency 40 MHzVoltage 3.3V
Power Consumption 339.51mW
DSP/IC Design Lab.
Hardware/Software Co-Design Flow
16
DSP/IC Design Lab. 17
Subject View
FFS (QP = 16 PSNR_Y=32.4012, Bits=9537)
PDS (QP = 16, PSNR_Y=32.0256, Bits=9465)
Worse case of PSNR drop (0.3962 dB) at the 69th frame
DSP/IC Design Lab. 18
R-D Curve for Stefan (High Activity)
24
26
28
30
32
34
36
0 500 1000 1500 2000 2500 3000
Bit rate (Kbps)
PSN
R Y
(dB
)
PDS
FFS
DSP/IC Design Lab. 19
Conclusion
A cost-effective MPEG-4 video encoder is proposed Hardware accelerators
A novel hybrid motion estimation architecture A cost-effective texture block engine
architecture Platform-based system backbone
Compromise flexibility and high performance HW/SW co-design flow and tools
20
Thank you
DSP/IC Design Lab. 21
DCT/IDCT Coefficient Matrix
N=8
16
7cos
8
3cos
16
5cos
16
3cos
8cos
16cos
4cos
2
N
g
f
e
d
c
b
aEven Symmetric
gedbbdeg
fccffccf
ebgddgbe
aaaaaaaa
dgbeebgd
cffccffc
bdeggedb
aaaaaaaa
A
Odd Symmetric
DSP/IC Design Lab. 22
1-D DCT and IDCT
1-D DCT (Y=AX)
1-D IDCT (Y=ATX)
)4()3(
)5()2(
)6()1(
)7()0(
)6(
)4(
)2(
)0(
XX
XX
XX
XX
fccf
aaaa
cffc
aaaa
Y
Y
Y
Y
)4()3(
)5()2(
)6()1(
)7()0(
)7(
)5(
)3(
)1(
XX
XX
XX
XX
bdeg
dgbe
ebgd
gedb
Y
Y
Y
Y
)6(
)4(
)2(
)0(
)3(
)2(
)1(
)0(
X
X
X
X
faca
cafa
cafa
faca
Y
Y
Y
Y
)7(
)5(
)3(
)1(
X
X
X
X
bdeg
dgbe
ebgd
gedb
)6(
)4(
)2(
)0(
)4(
)5(
)6(
)7(
X
X
X
X
faca
cafa
cafa
faca
Y
Y
Y
Y
)7(
)5(
)3(
)1(
X
X
X
X
bdeg
dgbe
ebgd
gedb
Preprocessing
Postprocessing
Data Reordering
Data Reordering
8 MAC operation down to 4!
DSP/IC Design Lab. 23
DCT/IDCT Architecture
DRU(Data Reordering Unit):
BDEG MATRIX VECTOR MULTIPLIER
ACF MATRIX VECTOR MULTIPLIER
DRUTRANSPOSE
MEMORYIDRUX
YZ
MUXB
MUXA LIFO MUXC MUXDADD SUB
X
Y
INSEL
Two parallel MAC
Preprocessing Postprocessing
Two 1-D operation multiplexing
DCT IDCT
DSP/IC Design Lab. 24
Multiplication of Constant Coefficients
Only 7 constant coefficients used Sign Digit representation
Minimum nonzero term (1, -1) Shift and Add
Avoid dedicated multiplier
C o e f f i c i e n t V a l u e 1 2 b i t
s i g n e d S D r e p r e s e n t a t i o n
N o . o f
N o n - Z e r o
a 0 . 3 5 3 5 5 0 . 3 5 3 5 1 35352.022222 97542 5
b 0 . 4 9 0 3 9 0 . 4 9 0 2 3 49036.02222 13971 4
c 0 . 4 6 1 9 4 0 . 4 6 1 9 1 46191.02222 10751 4
d 0 . 4 1 5 7 3 0 . 4 1 5 5 2 41577.0222222 1297532 6
e 0 . 2 7 7 7 9 0 . 2 7 7 3 4 27783.02222 11842 4
f 0 . 1 9 1 3 4 0 . 1 9 0 9 1 19135.02222 14843 4
g 0 . 0 9 7 5 5 0 . 0 9 7 1 6 09753.02222 13854 4
DSP/IC Design Lab. 25
Gate Count Distribution
0%3%3%3%
4%
5%
5%
7%
8%
9%9%
21%
23%
ME (15565)
DCT/IDCT (14785)
VLC (6505)
WRAPPER (6215)
HYRISC (5785)
Q (4736)
MC (3459)
IQ (3278)
COGEN (2619)
DMA (2382)
LPSEQ (2045)
DCACP (1885)
BUS (300)
DSP/IC Design Lab. 26
Memory DistributionFunctions Characteristic Depth x Width Num. Bits
Current MB for ME Asynchronous 32x32 2 2,048
Block buffers for MC/COGEN
16x32 6 3,072
Search window for ME 288x8 8 18,432
ACDC Prediction Two port 76x12 1 912
Two port 386x12 1 4,632
Two port 64x12 1 768Transpose mem. forDCT /IDCT
64x16 1 1,024
RISC data RAM Two port 512x16 1 8,192
Total 21 39,080
RISC instruction ROMROM
1024x22 1 22,528
External RAMSRAM withACK 152,416x32 1 4,877,312
Total 2 4,899,840
On-chipRAM Scan buffer
Off-chipRAM
27
Power Estimation
DSP/IC Design Lab. 28
Power Consumption Estimation
Originalfeatures
Case 1 Case 2 Case 3
Technology (μ m) 0.35 0.18 0.18 0.18Spec. CIF at 30 fps CIF at 30 fps QCIF at 15 fps QCIF at 15 fps
Encoding complexity (MBs/s) 11880 11880 11880 2970
Working frequency (MHz) 40 40 5 5Voltage (V) 3.3 1.5 1.5 1.5Gated clock No No No Yes
Power estimation (mW)339.51
(Powermill)154.32 19.29 6.55
Case 1 – 0.18μm Case 2 – 0.18μm, 1/8 computational power Case 3 – 0.18μm, 1/8 computational power, gated clock
top related