a 80/20mhz 160mw multimedia processor integrated with ......architecture overview color out (24b)...
TRANSCRIPT
A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM, MPEG-4
Accelerator and 3D Rendering Engine for Mobile Applications
Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Young-Don Bae,
In-Cheol Park and Hoi-Jun Yoo
Dept. of Electrical Engineering, Korea Advanced Institute of
Science and Technology (KAIST), Korea
Outline• Introduction• System Architecture Overview• Low Power Block Design
• 32Bit RISC • MPEG-4 Accelerator• 3D Rendering Engine • Embedded DRAM Frame Buffer
• Features of Test Chip• Conclusions
Requirements for Future Mobile Information Terminals
• Multimedia Signal Processing– mp3, 2D Image Processing, etc.– 3D Graphics
• Low Power Features– Battery-driven Products
• Low Cost Solutions– Major Factor for Consumer Electronics
Target Specifications
3D Image Rendering! > 2 Mpolygons /sec! 256 x 256 Resolution
with 24b True Color! 16b Z-Buffering! Alpha Blending! Double Buffering
MPEG-4 Video Decoding! Simple Profile! QCIF(176 x 144)! 15 frames /sec
System Power! < 200mW
Others! mp3, etc.
Low PowerMultimediaProcessor
LCDDisplay
The Proposed Solution
+! Distribution of Computational Load! Small Area! Programmability! Circuit Level Low Power Techniques
OptimizedPerform.
CPU
Dedicated H/W
EmbeddedDRAM
! Large Area! High Power
Consumption
Optimized at Architecture / Circuit LevelH/W & S/W Mixed Solution
HighPerform.
CPU
Architecture Overview
Color Out(24b)
512b
MCAccelerator
FrameBuffer
20MHz
Ext. I/O
128b
YCrC
b to
RG
BSA
M
2048
b3DRendering
Engine
20MHz
Frame Buffer+
Z-Buffer SAM
ARM9ARM9
MAC
B.W.EqualizerDual-Port
SRAM(2KB)
B.W.EqualizerDual-Port
SRAM(2KB)
80MHz
DLL
ClkGen.
32b
80MHz
Fast / NarrowData Transaction(32b @ 80MHz)
Color Out(24b)
Slow / Wide Data Transaction(512b @ 20MHz)
Data BufferingOn-Chip WideBus between
Logic / eDRAM
Multimedia Enhancement in RISC
5-StagePipeline
D EX MEM WBFD EX MEM WBF
D EX MEM WBF
ExecutionUnits
REGFile Su
MUL
ALU
4:2Add
4:2Add
4:2Add
4:2Add
4:2Add
4:2Add
4:2Add
4:2Add
4:2Add
Tree Structure with 4:2 Adders
! 1-Cycle 32b x 32b Multiplication! 2-Cycle 32b x 32b Multiplication and Accumulation! 23% Cycle Reduction Compared with Conventional
ARM Architecture
Bandwidth Equalizer
512 (32)b @ 20MHzTo Dedicated H/W
32b @ 80MHzFrom RISC
DP-SRAM(2KB)
FlowCont. 32
512b
Single Endedfor Tight Bit Pitch
Act asA Row Cache
WBE : Wide Bus EnableSTR : Cache StoreDDO : Direct Data Out
DB DDO
BL
SEBLSA
BLSASAE
CS
DB
BL
BL2
BL2CELL
BL
WBE
BL
WBE
STR
Motion Compensation(MC) Accelerator
Pixe
l Buf
fer
MU
Log
ic PixelALU #0
#6#7 D
ata
Alig
nmen
t Half-PelALU #0
#6#7
MU
Log
ic
FB B
uffe
r
Parallel Operation @ 20MHz
Ada
ptiv
e Fe
tch
Con
trol
128b (16 Pixels)
512b
Frame Buffer #0(512b x 128row x 9bank)
FBCont
Frame Buffer #1(512b x 128row x 9bank)
FBCont
20MHz
Frame Buffer for MCA
9-Bank with 128b I/O
Sub-wordline with Partial Activation, Partial I/O Scheme
128bI/O
Bank#0 #1 #2 #8
128
Partial Activation ControlSW
L D
river
S/ADB S/A
x32
SWL
Driv
erS/A
DB S/Ax32
SWL
Driv
er
S/ADB S/A
x32
GW
L D
river
128b
Partial I/O Control
SWL
Driv
er
S/ADB S/A
x32
SWDL/GWL
RXPA Cont
GWL
Spatial Locality
MB Addr = N
MB Addr = N+1
Blocks to beReconstructed
PreviouslyUsed
NewlyNeeded
70~90% areConfined
in 8x8 Boundary
Large SpatialLocality
Re-usableBlock
MVy
Distribution of MotionVectors for Class A/B
(MVx,MVy)
4 8 16
4
8
16
-4-8
-16
-4-8-16 MVx
CommonlyUsed
Needed
Distributed Nine-Tiled Block Mapping: Low Power Technique (1)
Frame Image
Bank #0Bank #1
IncreasingRe-usability
Bank #8
9-Banks (1- Macro)
BK#0BK#8
! Minimizing Cell Core Activation in DRAM
0000 1111 22223333 4444 5555
6666 7777 8888
1-Bank
Row ConflictsA Block in A Row
Partial Activation Scheme: Low Power Technique (2)
NormalOperation
SAMTransfer G
WL
drv
0 1 2 3 GW
L dr
v
GW
L dr
v
GW
L dr
v
Bank #0 Bank #1
Bank #2 Bank #3
0 1 2 3
0 1 2 3 0 1 2 3
GW
L dr
v
0 1 2 3 GW
L dr
v
GW
L dr
v
GW
L dr
v
Bank #0 Bank #1
Bank #2 Bank #3
0 1 2 3
0 1 2 3 0 1 2 3
UnnecessaryData
Screen
0123#0 #1
#2 #3
0123
Screen
#0 #1
#2 #3
PartialActivation
NecessaryData
DNTBM +Partial ACT
Up to 31% Power ReductionCompared with 1-Bank Structure
Adaptive Fetch Control Scheme: Low Power Technique (3)
Block-by-BlockReconstruction
PE #6PE #7
PE #4PE #5
PE #2PE #3
PE #0PE #1Valid Data
Garbage Data
FB B
uffer
Muxing Logic
No Switchingin Datapath
=
Adaptive FetchControl
4
21
3
+ + +1 2 3 4
3D Rendering Engine
1-EdgeProcessor
Bandwidth Equalizer
Left Right
Polygon Buffer
PP0 PP1 PP2 PP7
Fram
e-B
uffe
rIn
terf
ace
Calculating 8 Pixels/Cycle
R G B X Y Z R G B X Y Z
8-PixelProcessors
Parallel Datapath for RGB and Z
Shading
Blending
Shading
Blending
Shading
Blending
DepthComparison
Z-Unit
Old Pixel (RGBZ)
New Pixel (RGBZ)
R/G/B Unit
Shading
Blending
Update
1280b
20MHz
• Virtually Spanning 2D Array(ViSTA) Architecture
ViSTA Architecture
Previous Work(ISSCC2000 TP14.7)
Control
EPPPPP
PP
MM
MEP
PPPP
PP
MM
M
EPPPPP
PP
MM
M
EPPPPP
PP
MM
M
EPPPPP
PP
MM
M
EPPPPP
PP
MM
MEP
PPPP
PP
MM
M
EPPPPP
PP
MM
M
1/8 Scaling EP
PP PP PP
Interface
M M M
! 1 EP! 8 PP's
8-StagePipelined EP
This Work(ViSTA)
VirtuallySpans 2D
ArrayDynamic Bus
Reconfiguration
! 8 EP's! 64 PP's
Parallel EP
Frame Buffer for 3DRE
512kbDRAM512kbDRAM512kbDRAM512kbDRAM
512kbDRAM512kbDRAM512kbDRAM512kbDRAM
SAMFBI
Depth Buffer
256b 256b
Write
Read
640b
From PixelProcessors
768b 24b TrueColor
InterchangeableDouble Color-Buffers
SCLK
FrameBuffer #0
FrameBuffer #1
384b 384b
1280b x 20MHz =3.2GB/sec
640b
384b 384b512kbDRAM512kbDRAM512kbDRAM512kbDRAM
ConcurrentData Transfer
Single Bitline Writing Scheme: Low Power Technique (4)
BL(Real)
WL
Vcc Vcc/2
Vcc
Single Bitline Writing
GND
BIS_0
BIS_1
No Transitionsin /BL
20% Power Reductionin Data Sensing
/BL(Ref.)
30.02 mW 19.3 mW
30.47 mW 15.0 mW
Periphery &Control
DataSensing
8Kb Cell Arraywith 2K Column
System Power ConsumptionPower(mW)
Conventional Design - I(Ext. FB)
ProposedSystem
25
5075
100125150 eDRAM
Macro
Logic
DataI/O
eDRAMMacro
ConventionalDesign - II
(Embedded FB)
175
This Work
By EmbeddingDRAM
Logic Logic
1000
160mW
400~700
Die Photograph
3DRendering
Engine
DLL
32bitRISC
MCA
BandwidthEqualizer
MCFrameBuffer
#1
MCFrameBuffer
#2
3DREFrame Buffer
3DREZ-Buffer
InternalDRAM
SAMYCrCb to RGB SAM
! 0.18um EML Technology with 3-poly, 6-metal! 240pin QFP! 84mm2 (14 x 7 Including I/O Cells)
Chip Features (Physical)
MCAccelerator
FrameBuffer
3DRendering
Engine
! 80MHz! 1.5V! 12mW! 1.7mm2
FrameBuffer
B.W.EqualizerARM9
! 20MHz! 1.5V! 4.6mW! 2.3mm2
! 20MHz! 2.5V! 11.7mW! 5.25mm2
! 20MHz! 1.5V! 36mW! 5mm2
! 80MHz! 2.5V! 84mW! 16.4mm2
< 40mW
< 140mW
! 80/20MHz! 1.5V! 4mW! 1.6mm2
! I/O Cells : 3.3V
Conclusions• Low Power Multimedia Processor for Mobile
Applications– Optimized H/W & S/W Mixed System– Multimedia Signal Processing
• Not only 2D Image, But also 3D Graphics– Low Power Techniques
• Distributed Nine-Tiled Block Mapping• Partial Activation, Partial I/O scheme • Adaptive Fetch Control Scheme • Single Bitline Writing Scheme
• 160mW, 84mm2