1 u niversity of m ichigan 11 1 soda: a low-power architecture for software radio author: yuan lin,...
TRANSCRIPT
1UNIVERSITY OF MICHIGAN 11
1
SODA: A Low-power Architecture For Software Radio
Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor Mudge
Advanced Computer Architecture Laboratory
University of Michigan at Ann Arbor
Chaitali Chakrabarti
Department of Electrical Engineering
Arizona State University
Kriszti´an Flautner
ARM, Ltd.Presenter: Wei Miao
Jingcheng Wang
2UNIVERSITY OF MICHIGAN
Overview Introduction on SDR Behavior model and Design tradeoff Architecture analysis Performance analysis Summary
3UNIVERSITY OF MICHIGAN
INTRODUCTION AND ANALYSIS
Wei Miao
4UNIVERSITY OF MICHIGAN
Basic introduction on SODA Signal-processing On-Demand Architecture Support software radio 4-core, containing asymmetric pipeline Meet requirement of 2Mbps WCDMA/24Mbps 802.11a
5UNIVERSITY OF MICHIGAN
Introduction on SDR Software Defined Radio(SDR) Decode different signals on a single processor
6UNIVERSITY OF MICHIGAN
Why SDR? Easy to implement & update Multi-mode operation Prototyping and bug fixes Shorter time to develop
UWB EDGE 802.16a
802.16a Bluetooth
802.11b WCDMA 802.11n
SDR
(Picture From Lin, ISCA’06 slides)
7UNIVERSITY OF MICHIGAN
Challenges of SDR Need to achieve high throughput Power limitation
8UNIVERSITY OF MICHIGAN
Wireless protocols behavior Feed-forward, multiple kernel Low but heterogeneous requirement for inter-kernel
communication Real-time deadline Heavy data parallelism 8-16 bits data width Scalar vector operation
9UNIVERSITY OF MICHIGAN
Design Tradeoff Concurrent execution vs. Single Context execution Static Multi-core Scheduling vs. Multi-threading Vector vs. SIMD vs. VLIW
10UNIVERSITY OF MICHIGAN
SODA ARCHITECTURE AND RESULTS
Jingcheng Wang
11UNIVERSITY OF MICHIGAN
4 PEs static kernel mapping
and scheduling SIMD+Scalar units
1 ARM GPP controller scalar algorithms and
protocol controls
SIMDRF
SIMDMEM
scalarRF
scalarMEM
WtoS&
StoW
DMA
Scalar ALU SIMD ALU
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
SIMDRF
SIMDMEM
scalarRF
scalarMEM
WtoS&
StoW
DMA
Scalar ALU SIMD ALU
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
SIMDRF
SIMDMEM
scalarRF
scalarMEM
WtoS&
StoW
DMA
Scalar ALU SIMD ALU
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
LocalMem
ExecutionUnit
PE
GlobalMemSystem ArchitectureARM
SODA System Architecture
(From Lin, ISCA’06 slides)
12UNIVERSITY OF MICHIGAN
SODA PE Architecture
PE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
2 issue LIW (400MHz) - SIMD + (Scalar or AGU) DMA: - mem-to-mem transfer - access global memory
(From Lin, ISCA’06 slides)
13UNIVERSITY OF MICHIGAN
SODA PE Scalar Pipeline
PE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
Scalar: - One 16-bit datapath - No mult unit Scalar memory: - 16bit port - 1 read/write port - 4 KBytes Scalar-to-Vector Vector-to-Scalar
(From Lin, ISCA’06 slides)
14UNIVERSITY OF MICHIGAN
SODA PE SIMD Pipeline
PE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
16-bit 16 entries2 read/ 1 write port
RF
EX
16-bitMultiplier
40-bit ACC
16-bit
ALU
16bit
16bitWB
16bit
(From Lin, ISCA’06 slides)
15UNIVERSITY OF MICHIGAN
SODA PE SIMD Pipeline
PE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
SIMD: - 32 wide - predicated exec. - predicated neg.
Memory: - 512bit port - 1 read port - 1 write port - 8 KBytes (From Lin, ISCA’06 slides)
16UNIVERSITY OF MICHIGAN
SODA PE SIMD Shuffle Network
PE
Scalar pipeline
32x16bit
SSN
Vector to ScalarStage 1
SIMD Memory (8KB)
IR
RF ID
16bit EX
16bit WBALU
Scalar Memory (4KB)
32-waySIMD
IR
ScalarRF
RF ID
EX
WB
IR
AGURF
AGU ALU12bit
Inst.Mem.4KB
SIMD pipeline
AGU pipelineDMA16bit BUS
512bit
Vector to ScalarStage 2
Scalar to Vector
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
RF ID
16bit EX Multiplier16bit W
BALU
SIMD Shuffle NetworkShuffle Exchange (SE)Inverse Shuffle Exchange (SE)Exchange Only (EX)Iterative Feedback
(From Lin, ISCA’06 slides)
17UNIVERSITY OF MICHIGAN
W-CDMA Mapping On SODA
LPF-Tx scrambler spreader InterleaverChannelencoder
LPF-Rx
searcher
descrambler despreader combiner
descrambler despreader
...
deinteleaverChanneldecoder
(turbo/viterbi)
Upper layersTransmitter
Receiver
D/A
A/D
FrontendW-CDMA Physical Layer Processing
2 LPF-RxMisc.
ControlSearcher
De-interleaver
PowerControl
PN CodeTX/RX
TurboDecoder
Buffer(1K Bytes)
Buffer(1K Bytes) Buffer
(2K Bytes)FIFO Queue
(12.5 KBytes)
Buffer(10 Bytes)
Buffer(20 KBytes)
Buffer(20 KBytes)
Buffer(1K Bytes)
ARM PE PE PE PE GlobalMemory
Buffer(1K Bytes)
WCDMA Receiver WCDMATransmitter
4 LPF-Rx
Scrambler
Spreader
TurboEncoder
Interleaver
De-scrambler
Despreader
Combiner
4 LPF-Rx
Scrambler
Spreader
TurboEncoder
Interleaver
descrambler despreader combiner
descrambler despreader
...
TurboDecoderSearcher
2 LPF-Rx
De-scrambler
Despreader
Combiner
Misc.Control
De-interleaver
PowerControl
PN CodeTX/RX
Channeldecoder
(turbo/viterbi)deinteleaver
searcher
LPF-Rx
Channelencoder
InterleaverspreaderscramblerLPF-Tx
Buffer(1K Bytes)
Buffer(1K Bytes) Buffer
(2K Bytes)FIFO Queue
(12.5 KBytes)
Buffer(10 Bytes)
Buffer(20 KBytes)
Buffer(20 KBytes)
Buffer(1K Bytes)
Buffer(1K Bytes)
(From Lin, ISCA’06 slides)
18UNIVERSITY OF MICHIGAN
19UNIVERSITY OF MICHIGAN
SIMD Design and Tradeoffs 40GOPS required In 4 PE system,
10 GOPS in each
20UNIVERSITY OF MICHIGAN
Low-power Design Clustered Register Files with 2 Read Ports and 1 Write Port
Fewer Memory Read/Write Ports
Smaller Instruction Fetch logic
21UNIVERSITY OF MICHIGAN
Experiment Methodology Area and power estimation calculated using RTL Verilog
model Synthesized using Synopsys Physical Compiler and TSMC
180nm Library Memories generated by Artisan SRAM generator Estimated 90nm and 65nm processes using a quadratic
scaling factor Dynamic power was estimated from behavior simulation on
their system simulator Leakage power was estimated at 30% of the total power
22UNIVERSITY OF MICHIGAN
Performance results
23UNIVERSITY OF MICHIGAN
Power Area result Typical cellular phone power for physical layer ~ 200mW
24UNIVERSITY OF MICHIGAN
Discussion Points 1. The author only synthesized the core in TSMC180nm and
estimated the area and power of 90nm and 65nm. Is that fair to claim that the architecture meet the requirement?
The author claims that he reduces CDMA search algorithm from 26.5Gops in GP processor to 200Mops in SODA. And the main reason is due to SIMD execution. Is SIMD the only and main speedup factor? Is the novelty of paper enough?
2. Utilization of the 4 PEs are 60%, 50%, 100% and 94% respectively. Can it do better?
25UNIVERSITY OF MICHIGAN
Reference http://cccp.eecs.umich.edu/slides/lin-isca06.ppt http://ieeexplore.ieee.org/xpl/login.jsp?tp=&
arnumber=1635943&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F10899%2F34298%2F01635943.pdf%3Farnumber%3D1635943