platform-based design 성균관대 조준동 교수. 발표순서 zwhy platform-based design? zs/w...
TRANSCRIPT
Platform-based Design
성균관대 조준동 교수
발표순서
Why Platform-based Design? S/W configurable platform 의 필요성 Design Space of Reconfigurable
Architectures Reconfigurable Radio and Multimedia
Systems Network-centric Design: Clock and Power Reliable Design
SoC and Customizable Platform Based-Design
ReconfigurableHardware
(Coarse Grain)ASIC 1
DSP
Reconfigurable
Hardware (Fine Grain)
ASIC 2
ControllerCPU
RAMROM
Flash
?
ControllerCPU
RAMROM
Flash
?
Semiconductor Revolutions
TTL
custom
standard
1957
1967
1977LSI,MSI
µproc.,memory
1987
1997ASICs,accel’s
1st design crisis
2nd design crisishardware
software
2007
reconfigurable
instruction
streamsdata
streams
structured
VLSI design
“Mainstream Silicon Applicationis switching every 10 Years”Makimoto’s Wave
coarsegrain
FPGAs
Definition of Platforms?
•An architecture that is designed for an application domain
Platform 분류
Application Platform: 멀티미디어 platform: Nexperia, TI 의 OMAP 3G 무선 platform: Infineon 의 M-gold Bluetooth platform: Parthus 무선 platform: ARM 의 PrimeXsys
Process-centric platform Improv System, ARC, Tensilica, Triscend
Communication-centric platform: Sonics, Palmchip
SoC Platform Adaptation
The Platform-Based Design Concept Cadence
ApplicationSpace
HW-SW Kernel
MEM
FPGACPU Processor(s), RTOS(es)
and SW architecture
* IP can be hardware (digital or analogue) or software. IP can be hard, soft or‘firm’ (HW), source orobject (SW)
* IP can be hardware (digital or analogue) or software. IP can be hard, soft or‘firm’ (HW), source orobject (SW)
Scaleablebus, test, power, IO,clock, timing architectures
+ Reference Design
Programmable
SW IP
Hardware IP
Pre-Qualified/VerifiedFoundation-IP*
Foundry-SpecificHW Qualification
Reconfigurable Hardware Region(FPGA, LPGA, …)
SW architecturecharacterisation
Platform Architecture
How fast will my user interface
software run? How much can I fit onto my microcontroller?
Which Bus? PI? AMBA?Dedicated Bus for DSP?
Which RTOS do I use? Which scheduling policy do I have to choose ?
Do I need a dedicated DSP ?Which micro-
controller? ARM?HC11? ARC?
Can I buy a QCELPdecoding core?
Do I need a dedicatedHW or can I run this
on the Microcontroller ?
Example of a commercial SoC
More CPUs?More SRAM/Flash?Add FPGA?
A Legacy SoC ApproachCoreConnect (PPC), AMBA (ARM)…
Networks-on-Silicon, Phillips
MP-SOC Cluster
Definition of MP-SOC?
Usually heterogeneous multiprocessor:
CPUs, DSPs, etc.Hardwired accelerators.Mixed-signal front end.
기존 MP-SoC 의 문제점
▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다 .
▷ 같은 ( 호모지니어스 ) 프로세서를 여러 개 사용하는 것은 자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다 .
▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적으로 설계되었다 .
▷ 인터콘넥트는 와이어 - 의존 뿐아니라 로직 의존적이기도 하다 .
▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다 .
▷ 특정 응용분야에 대해서만 최고 성능을 낸다 .
4G: Multiple standards
Communications.Networking.Multimedia.Security.
Muti-band/multimode 를 지원하는 Digital RF
The triangle, Chicken and Egg?
architectures
applications
methodologies
•Hardware and software architectures determine capabilities.•Applications guide design decisions.•Methodologies allow repeatable, predictable design.
DSP
Why Multi-Threaded Cores?
Out
NoC
In SRAM
DSPDSP H/W-MT
RISC
H/WProc. Element
$GPP
I$D$ I$
Increasing gap: memory & processor
speeds(2x / 2 years)
Increasing gap: interconnect &
gate delays(multi-clock)
More parallel processing
(lower-power, higher-perf./mm2)
…………
MPSoC “Bus” Alternatives
• Fixed Bus [Bergamaschi, DAC, 2000]– Point to point communication– Signals between cores transferreddedicated wires• FPGA-like Bus [Cherepacha, FPGA Sym,– Programmable interconnects– Employ static network• Arbitrated Bus [IDT Inc., 2000]– Time-shared multiple core connectivity– Use arbitrator• Hierarchical Bus [AMBA, ARM Inc]– Combine multiple buses using bus– Separate buses for cores and I/O NoCBus [Dally, DAC, 2000]– Resources communicate with data packets– Use switch fabric
Future mobile platform?
Mobile supercomputing Speech recognition. Cryptography. Augmented reality. Typical applications (email, etc.).
Requires 16x 2 GHz Pentium 4. Peak power must not exceed 75 mW
Mudge et al:
미래 모빌 어플리케이션 플랫폼 ?
Culture and Education? Personal Entertainment
Platform?
Road Map to MP-SoC Trends mask NRE: Over 1M$; design NRE: 10M$ to 75M$
ASICs replaced by programmable ASSP, FPGA’s number of embedded processors
DVD/STB/HDTV, mobile phones: 5 to 8 Image proc, networking, basestation: 8 to 100+ eS/W complexity
Set-top box, audio: >1 million lines of code eS/W becoming essential part of SoC’s
?’s Law??’s Law?
Tape Out
VerifyCompose the system
VerifySimulate
VerifySoC Composer
Verify (timing, area)Synthesis + P&R
VerifySimulate (performance)
Should the SoC designer work hard?
Requirements
Mobile SoC 에서 검증이 왜 중요한지 ? 왜 우리는 검증이 취약하게 되었는지
More SoC topics …
Platform optimization Power management BW allocation Resource sharing Task distribution Efficient communications
Low PowerVerification
•인재 (System Architect) 양성
Available Mobile and VLIW Processors
The ARM Family The ARM7 Generation The StrongARM The ARM Thumb Option The ARM Piccolo Option The ARM9 and ARM10
The Motorola M-Core The LSI TinyRisc The Hitachi SuperH Family VLIW Processors
The Motorola-Lucent Star*Core The Philips TriMedia The HP/Intel IA-64
NexperiaTM DVP Hardware architecture (source: Th. Claasen, Philips, DAC 2000)
1~8 2~6
Exploitable Parallelism
GP O/SThread-LevelParallelism
Instruction-Level
Parallelism
1
10 000’sInstructions
Min parallel grain size (instrns.)
Exploitable taskparallelism
1~100
MultiFlex Thread-Level
Parallelism
100’s
NEC MP211: Homogeneous MP core
Asymmetric mp with very coarse grain multitasking 3 ARM9’s utilized as predefined function units NO complex overhead : e.g. no cache coherency,
dynamic scheduling/load balancing
MP-SoC 의 장점
쉬운 하드웨어 Implementation 이 가능하다 . : 즉 , 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을 낮출 수 있다 .
전력 소비를 줄일 수 있다 . : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다 . 낮은 클럭 주파수는 적은 supply voltage 를 가능하게 하고 파워 소모를 줄일 수 있다 .
Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다 .
Boosting real-time 성능 : 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다 . 이는 다중 어플리케이션간 인터페이스를 줄일 수 있다 .
시스템의 안전도를 높일 수 있다 . : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다 .
AMP task allocation image
Bus and Memory Architecture
MP211 block diagram
Power consumption of H.264+AAC
Holistic design of multi-core architectures
Naïve Methodology is inefficient Demonstrated inefficiency for cores and proposed
alternatives Single-ISA Heterogeneous Multi-core
Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core
Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04]
What about interconnects? How much can interconnects impact processor
architecture? Need to be co-designed with caches and cores?
Heterogenous MP Core
▷ Single-ISA heterogeneous multicore 구조는 볼테지 스케일링 , 클럭 게이팅 , speculation control등을 사용하는 경우에 비해 우수한 성능을 보인다 .
▷ Homogeneous CMP (Chip Multiprocessor) 와 비교해서 Heterogeneous CMP( 또는 asymmetric CMP) 는 많은 장점을 가지고 있다 . 많은 응용 제품들은 큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다 . 또한 바테리를 사용하는 경우와 전원을 사용하는 경우등 시스템의 콘텍스트에 의존적이다 . 따라서 복잡도가 다른 코어들을 사용하는 것이 효율적이다 .
▷Multi-ISA multicore architecture 는 다른 ISA 를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism 을 동시에 처리 가능하도록 설계되었다 . 그러나 single-ISA heterogeneous CMP 는 모든 코어가 같은 ISA 를 수행하기 때문에 각 응용이 어느 코어에 매핑이 되어도 상관없게 된다 . 코어 숫자와 크기 , 타입 , 그리고 캐쉬를 결정해야 한다 . 8-core 프로세서의 경우 , 인터콘넥트의 전력 소모량은 하나의 코어와 같다 . 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지 - 딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다 .
▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level 과 high thread level 을 이용하는 heterogeneous processors 는 homogeneous 에 비해서 63% 성능이 개선된다 .
5-8 threads level 을 사용하는 경우에는 평균 29% 의 개선이 있다 . Amdahl's 의 법칙에 의하면 병렬 응용들의 속도개선은 직렬 응용 부분때문에 제한적이 된다 .
▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며 , 병렬 부분에 대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를 최대화 한다 . [Annavaram, et al]
Heterogeneous MP-SoC 문제점들
Processors are bound by wire and memory latencies
Peak performance on only a small class of applications.
How well they map to a given design Diversification of workloads Increased hardware complexity Poor resource utilization
Alpha cores scaled to 0.10 um. EV8 is 80 times bigger but provides only two to three times more single-threaded performance
Heterogenous MP Core
If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection.
Multi-ISA multicore architecture 는 다른 ISA 를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism 을 동시에 처리 가능하도록 설계되었다 .
헤티로지니어스 플랫폼의 특징
8-core 프로세서의 경우 , 인터콘넥트의 전력 소모량은 하나의 코어와 같다 . 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지 - 딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다 .
듀얼 프로세서의 경우를 예를 들면 low Thread level 과 high thread level 을 이용하는 heterogeneous processors 는 homogeneous 에 비해서 63% 성능이 개선된다 . threads level 을 사용하는 경우에는 평균 29%의 개선이 있다 . Amdahl’s 의 법칙에 의하면 병렬 응용들의 속도 개선은 직렬 응용 부분때문에 제한적이 된다 .
10 Performance of heuristics for equal-area heterogeneous architectures with multithreaded cores.
Exploring the potential from heterogeneity
CT 3400 Multi-core DSP
8 개 32 비트 DSP 코어
6 개 32 비트 범용 프로세서 코어
128 핀 프로그램 가능 I/O 서브시스템으로 구성
C 프로그램 가능 H.264 및
MPEG4 코드를 지원
http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf
H.264 encoder , decoder and audio codecs and the system control
H.264 codec onto the cradle CT3400 MDSP
CT 3400 Multi-core DSP
CT3400 DPS Engine
http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdfhttp://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf
DSP Engine
Each DSP engine contains
A Single Instruction Multiple Data
Arithmetic Logic Unit (SIMD ALU)
A Packed Integer Multiplier
Accumulator (PIMAC)
A Floating Point Unit (FPU)
Bi-directional FIFO data buffers
DMA channels
A 128 x 32 register and
A 512 x 20 program memory
CT3600 계열 제품군
CT3616 은 채널 당 5.50 달러 (MPEG4 SP L3) 로 업계에서 가장 뛰어난 가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2 배 이상 우수
프로그램 가능 DSP 를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다
0.13 미크론 기술 , 16 개의 DSP, 8 개의 범용 프로세서로 전체 성능을 네 배로 증가
40 달러에서 90 달러
CT3600 Multiprocessor DSP Family Members
http://www.cradle.com/downloads/CT3600-PB.pdf
CT 3616 Multi-core DSP
http://www.cradle.com/downloads/CT3600-PB.pdf
Homogeneous MP-SoC 문제점들
The hardware must be configurable for efficient execution across broad class of application.
Each core consists of an array of homogenous processing execution nodes, a banked Instruction Cache, Data Cache, register file and block control logic.
Some of the resources (called polymorphous
resources) in the TRIPS architecture can be configured to
operate differently depending on the mode (instruction, thread or data parallelism).
HiBRID-SoC Architecture
HIBRID-SoC multi-core system-on-chip Architecture
Integrate a powerful on-chip communication structure
A well-balanced memory system to account for the growing amount of data memory
system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding
(AVC))
Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing
demend
Three programmable cores Each adapted towards a specific class of algorithms
Combination of the cores and their software development environment
An extention of a programmable core with dedicated modules (e.g.,Trimedia)
HIBRID-SoC multi core
Developed at the University of Hannover
Multi-Core SoC Architecture
Multi-Core SoC Architecture Instruction Level VLIW (Very long instruction word)
Data Level SIMD (Single instruction multiple data)
Task Level (Simultaneous multithreading)
Hi-par DSP 16-datatath SIMD processor core controlled by VLIW,
Particularly optimized towards high-throughput two dimensional DSP-style processing
(FFT-intensive applications or filtering)
Stream Processor (SP) 32-Bit RISC architecture that is more optimized to-wards control-dominated task
Bitstream processing or global system control
Macroblock processor(MP) Efficient processing of data blocks (Heterogeneous data path structure consisting of scalar
and a vecture unit)
Controlled by dual-issue VLIW, offers flexible subword parallelism, and contains
instruction set extensions for typical processing computation steps
HiBRID-SoC multi-core architecture
64-bit AMBA AHB system bus
Connects all cores SDRAM
memory via a 64 Bit SDRAM
interface
Two versatile 32-Bit host
interfaces for access (e.g., host
PC via PCI and to serial flash
memory)
Figure 1. HiBRID-SoC multi-core architecture
HiPAR-DSP
HiPAR-DSP Highly paralled DSP core with a
VLIW-controlled SIMD architecture
Memory concept provides an easy data
exchange between the data paths, which
is required for many filter and image
processing algorithms
DMA unit serves all cache misses and
performs data prefetch transfers to the
matrix memory
At the targeted clock frequency of 145
MHz, the HiPAR-DSP achieves a
performance of 2.3 GMACsFigure 2. HiPAR-DSP architecture
Stream Processor
Stream Processor
Sp has been optimized for high-level programmability and
efficient processing of control-driven applications
Harvard architecture with a 32-bit data path consisting of 5
pipeline stages and controlled by 32-Bit RISC instructions.
Supports Conditional execution, forwarding interlocks, and
provides full interrupt capability
Convert the 64-Bit AMBA bus width to the 32-Bit internal
Macroblock processor
Macroblock processor
Heterogeneous data path structure
consisting of a scalar and a vector data
path
The scalar data path operates on 32-Bit
data words in a 32-entry register file
and provides control instructions
(jump,branch, and loop)
The vector data path is equipped with a
64 entry register file of 64 bit width
Special fuction unit(SFU) provide
instruction set extensions for common
video and multimedia core algorithms.
MUL/MAC or ALU, incorporate
SIMD-style subword parallelism by
processing either two 32-Bit, four 16-
Bit, or eight 8-Bit data entities in
parallel within a 64-bit register operand
Figure 3. Macroblock processor data paths.
HiBRID-SoC Implementations
Figure 3. Chip layout of the HiBRID-SoC.
Table 1. MPEG-4 ASP decoder (full TV resolution) performance on MP and SP,
720*576@25Hz,1.5-3 Mbits:
HiBRID-SoC is fabricated in a 0.18 um,
6LM standard-cell technology,
14 million tr’s 3.5W
occupies 82 mm2, and operates at 145 MHz
Analyzing On-chip Communication in MPSoC Enviroment
Proceedings of Design,Automation and Test’04 Mirko Loghi et al
•Analysis and trade-off exploration of on-chip communication architectures.•Compare and analysis with two practical configurations : AHB-AMBA (ARM) and STBus (ST Microelectronics).•Models hardware and software of MPSoC at high-level of accuracy and sufficient simulation speed.•Provide realistic performance by stimulating communication system with functional traffic.
Multiprocessor simulalation platform
• Homogeneous MPSoC platform.
• Configurable number of 32-bit ARM processors.
• Processor cores : GPL-licensed ARM Instruction Set Simulator (ISS) SWARM in C++
• Private memories for each processor.
• A shared memory
• A hardware interrupt module.
• 32-bit interconnection
• All components are wrapped in SystemC
Hardware architecture:Hardware architecture:
Multiprocessor simulalation platform
Benchmarks running with RTEMS-OS :• Running on top of RTEMS• Synchronization : Use OS queues to exchange matrices between
processors.• Benchmark 1: Independent matrix multiplication.• Benchmark 2: Pipeline of matrix multiplication
Benchmark 1: Independent matrix multiplication:• Perform independent matrix multiplication at each processors• Not require interprocessor communication.• Operands are stored in private memories of each processor.
Multiprocessor simulation platform
Benchmark 2: Pipeline matrix multiplication:• Platform receives a continous flow of input and out put• Operation of every cores follows this partern :
Copies input matrix from share memory to private space Multiplicate input matrix with a already matrix in private
space Copies the resulting matrix back to shared space.
• Interrupt and semaphores slaves are queried to keep synchronization in all process.
Multiprocessor simulation platform
Code development and analysis tool :
• Development tool : GNU-cross compiler
• Allow flexible profiling by functions of simulator.
• Output of simulator :• Statistics about processor and
interconnect performance.• VCD waveform of all bus signal• Traces of memory accesses
performed by every cores.
Features of communication architecture
AMBA-AHB Architecture:• Traditional shared bus with pipelining.• Distinct data and address/control bus• Transfer with data phase and control phase.• Support burst as streams of single transaction.• “split/retry transfer” and “early burst termination” are used to
solve high-latency slaves. STBus Architecture:• Protocol type 3: simple load/store operation , pipelining and
spliting transaction,out-of-order support.• Flexible topology :from shared bus to full crossbar• Overlapping transfer:Requesting new burst while previous ones are
still completing without idle cycle.• Fast arbitration with two cycles and minimum latency is three
cycles.
Experimental Result
Comparison of performance interconnection Five interconnections :
• AMBA-AHB• Shared-bus STBus• Full crossbar STBus• Partial crossbar STBus : ST-32• Partial crossbar STBus : ST-54
Experimental Result
Performance comparison
Experimental Result
Comparison of performance interconnection Four benchmarks :
• Matrix multiplication independent : ASM-IND• Matrix multiplication pipeline without OS : ASM-PIP• Matrix multiplication with OS : OS-IND• Matrix multiplication pipeline with OS : OS-PIP
Experimental Result
Comparison of performance interconnection
MPSoC Clock and PowerOlivier Franza, Intel
Increased uncertainty with process scaling Process, voltage, temperature variations, noise,
coupling Affects design margin over design, power & performance
loss Increased power constraints Increasing leakage, power (density, delivery)
limitations More transistors mean:
Larger clock distribution networks Higher capacitance (more load and parasitics)
With each new technology: Gate delay decreases ~25% Wire delay increases ~100% Cross-chip communication increases Clock needs multiple cycles to cover die
Interconnect Delays & Density
Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology
Multiple Clocks due to Interconnect limitation
At reduced performance, larger resource size
Noise in Mixed Signal Systems
Multiple clock domains Low skew and jitter ALWAYS a must Clock modeling requires more accuracy Within-die variations, inductance, crosstalk, electromigration, self-heat, … Floor plan modularity Think adding/removing cores seamlessly! Hierarchical clock partitioning Reduce global clock and possibly relax its requirements Generate “locally”-used clock “locally” Implement clock domain deskewing techniques Bound clock problem into simple, reliable, efficient
domains
DEC/Compaq Alpha
more complex core to improve performance, morecomplex clocks (?), Source: DEC/Compaq – Gronoski & al., JSSC 1998 – Xanthopoulos & al., ISSCC 2001 – Barroso & al., ISCA 2000
Clock and Power ConvergenceIntel® Itanium® Montecito Each core split into 3 clock domains on variable power supply
Each domain controlled by Digital Frequency Divider (DFD)
generating low-skew variable-frequency clocks; fed by central PLL and aligned through phase detectors
Regional Voltage Detector (RVD): supply voltage monitor
Second level clock buffer (SLCB): digitally controlled delay buffer for active deskewing
Regional Active Deskew (RAD): phase comparators monitoring
and adjusting delay difference between SLCBs
Clock Vernier Device (CVD): digitally controlled delay buffer
Clock generation and distribution are essential Clock generation and distribution are essential enablers of microprocessor performance
On-Chip Interconnects:Circuits and Signaling, Wayne Burleson
• Using Vdd programmability• High Vdd to devices on critical path• Low Vdd to devices on non-critical paths• VddOff for inactive paths
A – Baseline FabricB – Fabric with Vdd Configurable Interconnect
This work builds on a similar idea for FPGAs described in:Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International Conference on Computer-Aided Design, Nov. 2004
Why Reconfigurable System?
GPP 와 재구성 h/w 를 포함
목적 : 전력 감축 및 유연성
1. 동적인 환경에 따른 Quality of Service 를 제공
2. 알고리즘 진화에 따른 유연한 구조
3. 개발 및 유지 보수해야 하는 플랫폼 감소
Reconfigurable Hardware
A
D
B
C
E
A B D C E
A B D D C C E E
Task 1
X
Z
W
Y
Task N
H I J H I J W
XY
ZW
XY
Z
Energy Efficiencyof Reconfigurability
system architecture communication protocol O/S and applications Partitioning of functions between wireless
device and services on the network The mobiles must be flexible enough to
accommodate a variety of multimedia services and communication capabilities and adapt to various operating conditions in an (energy) efficient way
S/W configurable platform 의 필요성
– Doing More by Doing Less : 다양한 표준을 다룰 수 있는 능력이 필요 (AM, FM, GSM, UMTS, digital broadcasting standards, analog and digital television and other data links.
– A fully software reconfigurable multi-channel broadband sampling receiver for standards in the 100 MHz band
Granularité dela reconfigurationSébastien PILLEMENT - ENSSAT/LASTI
Reconfiguration au niveau système Lx, C62 (décomposition en cluster)
Reconfiguration au niveau fonctionnel Pleiades, RaPiD, DART(2001)
Reconfiguration au niveau opérateur Chameleon, Piperench, Morphosys(2000)
Reconfiguration au niveau porte Napa, GARP, FPGA
The gain size of operationsin Reconfigurable System Architectures
Fine gained operations : Multiply and addition
Medium gained operations : reconfigurable modules
Course gained operations : CPU, host
Design Space of Reconfigurable Architectures
RECONFIGURABLE ARCHITECTURES(R-SOC)
FINE GRAIN(FPGA)
MULTI GRANULARITY(Heterogeneous)
COARSE GRAIN(Systolic)
Processor +Coprocessor
Tile-BasedArchitecture
Coarse Grain Coprocessor
Fine GrainCoprocessor
IslandTopology
Hierarchical Topology
LinearTopology
HierarchicalTopology
MeshTopology
• Chameleon• REMARC• Morphosys
• Pleiades• Garp• FIPSOC• Triscend E5• Triscend A7• Xilinx Virtex-II Pro• Altera Excalibur• Atmel FPSIC
• Xilinx Virtex• Xilinx Spartran• Atmel AT40K• Lattice ispXPGA
• Altera Stratix• Altera Apex• Altera Cyclone
• Systolic Ring• RaPiD• PipeRench
• DART• FPFA
• RAW• CHESS• MATRIX• KressArray• Systolix Pulsedsp
• aSoC• E-FPFA
Lilian BossuetLESTER LabUniversité de Bretagne SudLorient, France
Digital Signal Processing With FPGAs
Paul EkasJean-Charles Bouzigues
OptionOption ResourceResource Area UsageArea Usage
1 Logic Multipliers
Logic Elements (Traditional)
500 LEs per 18x18
Multiplier
2 Hard Multipliers
DSP Blocks 4 18x18 Multipliers per
DSP Block
3 Soft Multipliers
RAM 1 to 2 Embedded
Memory Blocks
Multiplier Options In FPGAs Multiplier Options In FPGAs
Logic Elements
Smallest Unit of LogicGrouped into Logic Array
Blocks (LABs) of Ten LEsFeatures
Four-Input Look-Up Table (LUT)
Configurable Register Dynamic Add/Subtract
Control Carry-Select Chain Logic
LE14
4
4
4
4
4
4
4
4
4
Control Signals
LocalInterconnect
LE2
LE3
LE4
LE5
LE6
LE7
LE8
LE9
LE10
LogicElement
Logic ArrayBlock
18 Bit x 18 Bit
4 Multiplies
2 Multiplies with Accumulate
1 Sum of 2 Multipliers (Complex Multiply)
1 Sum of 4 Multiplies
9 Bit x 9 Bit
8 Multiplies
2 Multiplies with Accumulate
2 Sum of 2 Multipliers(Complex Multipliers)
2 Sum of 4 Multiplies
+
Op
tio
nal
Pip
elin
ing
Ou
tpu
t R
egis
ter
Un
it
Ou
tpu
t M
UX
144144
36
36
36
36
37
37
38
+ -
+ -
Inp
ut
Reg
iste
r U
nit
36 Bit x 36 Bit
1 Multiply
DSP Block: Optimized Hard MACDSP Block: Optimized Hard MAC
Soft Multipliers: Lookup Based MultiplicationSoft Multipliers: Lookup Based Multiplication
Use Embedded RAM Blocks as Look-Up Tables (LUTs) for Generating Partial Products
Coefficient or Sum of Coefficients Values Stored in RAM Blocks MSB Partial Product Shifted & Added to LSB Partial Product
ADDRESS MULT_RESULT
00000 0
00001 C
00010 2*C
00011 3*C
… ….
11111 31*C
32*18M512
C = Coefficient[12:0]
Multiplier Table5
18
Address
Data Output
Example Multiplication of 5-Bit Input
with 13-Bit Coefficient All 18 Bit Possible Results
Stored at 32*18 Look Up Table
Altera FPGA Memory Architectures Today’s applications need more high performance memory One size does not fit all Wide choice of modes and widths
M512 Blocks M4K Blocks M-RAM External Memory Devices DDR SDRAM & SRAM SDR SDRAM QDR & QDRII SRAM ZBT SRAM DDR FCRAM
True Dual Port RAM Embedded Shift Register
Mode 512K bits 300 Mhz Operates Up to 300Mhz Mixed Clock Mode
True Dual Port RAM Embedded Shift
Register Mode Operates Up to
312Mhz Mixed Clock Mode
Rate Changing Embedded Shift
Register Mode Operates Up to
312Mhz Mixed Clock Mode
More Bits For Larger Memory Buffering
More Data Ports for Greater Memory Bandwidth
Soft Multiplier: Sum of Multiplications Soft Multiplier: Sum of Multiplications
M51232*18
18
M51232*18
1
18
1935
1 1
4ADDRESADDRESSS
MULT_RESULMULT_RESULTT
0000 0
0001 C0
0010 C1
0011 C0+C1
… ….
1111 C0+C1+C2+C3
16-Bit Serial Shift Registers
Sum of Multiplications Table
Output
Input
(Sample 16-Bit, Coefficient 16 Bit)
Example: FIR FilterMemory: 2 M512
++
++
4
16-Bit Serial Shift Registers
Example Direct Sequence Spread Spectrum (DSSS) Modem
DSSS Modem
Five Independent Data Channels Spread to 3.84 Mcps Three-Stage FIR Interpolation-by-32 Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth 112 dB SFDR 15.36 MHz Quadrature Carriers 122.88 MSPS Transmitter Output with 5 MHz Bandwidth & Over
78-dB Out–of-Band Rejection Automatic Gain Control (AGC) Compensating for Channel
Attenuation of up to 30 dB Costas Loop Carrier Recovery 4x Oversampling Code Synchronization
DSSSModulator
ChannelModel
DCH0
DCH1
DCH2
DCH3
DCH4
DCH0
DCH1
DCH2
DCH3
DCH4
DSSSDemodulator
DSSS ModulatorFIR3 RRC25-Tap FIR
FilterInterpolation x4
Ex BW:22%
NCO FrequencyResolution:
0.03HzSFDR: 112dB
FIR1LPF
2-Channel87-Tap
FIR FilterInterpolation
x2
Length 256Gold CodeSpreader
DCH0
DCH1
DCH2
DCH3
DCH4
PCH
Cch,16,0
Cch,16,1
Cch,16,2
Cch,16,8
Cch,16,9
Cch,16,10
SCH
FIR2LPF
2-Channel47-Tap
FIR FilterInterpolation
x4
FIR3 RRC25-Tap FIR
FilterInterpolation x4
Ex BW:22%
Sin(wn)
Cos(wn)
Carrier PhaseIncrement
K
K
gi
gq
Re[]
Im[]
DSSS Demodulator
PeakDetector
NCOFrequencyResolution:
0.03HzSFDR: 112dB
FIRAltera RRC
31-Tap FIR FilterExcess BW: 22%
Fixed Rate
AGC
Free-RunningPhase Increment
FIRAltera RRC
31-Tap FIR FilterExcess BW: 22%
Fixed Rate
CarrierRecovery
Loop
8 Gold CodeCorrelator
4xOversampling
Buffer I-QDerotate
Pilot Monitor
HadamardDespreader
PilotOutput
DataChannels
Output1…5
pn_lock
max_index
8
DSSS Modem Resources Resource Usage Summary
DesignEntity
LogicElement
s
M512RAM
M4KRAM
MegaRAM
DSP Block
Elements
Modulator 9943 1 8 0 12
Demodulator
12196 60 8 1 60
Power Usage Estimates
Power mW
Total Standby Internal Power 75
Total Logic Element Internal Power
283
Total Clocktree Internal Power
175
Total DSP Internal Power 23
Other Internal Power 92
Total Power 505
FIR Filter Example* – 16X Cost/Performance Improvement
Device Solution FIR Performanc
e(MHz)
Device Cost***
*
Cost perFIR MHz
TI C6713-200 64 cycles** @ 200MHz
3.125 $24.59 $7.87
TI C6416-600 32 cycles** @ 600MHz
18.75 $160 $8.53
Altera 1C3-8 8 cycles*** @ 230MHz
28.75 $14 $0.49
Altera 1C12-8 1 Cycles*** @ 170MHz
170 $84 $0.49* FIR 128 Tap, 16 bit data, 14 bit coefficients** DSPLib Optimized Assembly Libraries from Texas Instruments*** MegaCore Optimized FIR Compiler from Altera**** Pricing in quantity of 100 at Arrow 6/25/03
* FIR 128 Tap, 16 bit data, 14 bit coefficients** DSPLib Optimized Assembly Libraries from Texas Instruments*** MegaCore Optimized FIR Compiler from Altera**** Pricing in quantity of 100 at Arrow 6/25/03
Reconfigurable video processor for SDRAM access optimization(Henriss, Ernst et al.)
Reconfigurable video platform
· SDRAM memory centered design· FPGA based scheduler merges different
streams and random accesses exploitation of SDRAM bank structure
· supports 2 HDTV streams at 1.48 Gbit/s each plus DSP and filter unit access
· reaches 700MByte/s in practical application for 4 Byte SDRAM memory word
· extremly cost efficient design· used in professional video product line
Fine-Grained RSOCs: Triscend A7 CSOC
A7 Family 32-bit ARM 7
with 8kB Cache 3200 logic cells
max. (40K gates) Up to 3800 FF’s Up to 300 Prog.
I/O pins www.triscend.co
m
Coarse-Grained RSOCsChameleon Structure (2000)
Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M.
Heysters, www.chameleonsystems.com
32-bit ARC control processor Up to 84 32-bit Datapath Units DPU=a 32-bit ALU+a 32-bit
barrel shifter Up to 24 of 16x24-bit multipliers Up to 48 of 128x32-bit local
memory modules Up to 160 Prog. I/O pins Targeted at 3rd gen. wireless basestation, wireless local loop, SW radio, etc.
Design a battery powered personal mobile computing device that has multimedia functionality and can operate in a dynamic environment.
- Do just enough and not too much for a given task (QoS)
Field Programmable Function Array
The FPFA concept has a number of advantage The FPFA has a highly regular organisation We use general purpose process core Its scalability stands in contrast to the
dedicated chips designed nowadays The FPFA can do media processing tasks
such as compression/decompression efficiently
Field Programmable Function Array
ALU ALU ALU ALU ALU
M M M M M M M M M M Memory
CrossBar
Registers
ALUs
Processor tiles Consists of five identical blocks, which share a control unit
and a communication unit An individual block contains an ALU, two memories and four
register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the ALUs,
registers and memories This structure is convenient for the Fast Fourier
Transform(6-input,4-output) and the Finite impulse response
Dedicated Hardware Architecture
Per
form
ance
(M
MA
Cs/
sec)
DSP System Architecture OptionsDSP System Architecture Options
DSP DSP DSP DSP
DSP DSP DSP DSP
DSP DSP DSP DSP
DSP DSP DSP DSP
Processor ArrayStand-Alone Processor
DSP
Processor + Co-Processor
DSP
Optional Coprocessor Mappings
ProcessorProcessor
MemoryMemory
FPGAFPGAFPGAFPGA
Processor External to FPGAProcessor External to FPGAProcessor On FPGAProcessor On FPGA
•TI c6x (EMIF)•Mot PPC (MPX)•Mot Starcore (MPX, AHB)•Intel 2850 (PCI Express)•ARM (AHB)•…..
•TI c6x (EMIF)•Mot PPC (MPX)•Mot Starcore (MPX, AHB)•Intel 2850 (PCI Express)•ARM (AHB)•…..
•Nios•ARM (AHB)
•Nios•ARM (AHB)
Mapping of DSP Algorithms on the FPFA
DFT
N=8
FFT
N=8
DFT
N=8
DFT
N=8
FFT
N=8
FFT
N=8
FFT
N=8
DFTN=2
DFTN=2
DFTN=2
DFTN=2
Fast Fourier Transform FFT recursively divides a DFT into
smaller DFTs
+
--
a
b
W
Recursion of a radix 2 FFT with 8 inputs
The radix 2 FFT butterfly
Mapping of DSP Algorithms on the FPFA
1 2 3 4 5O
h4 h3 h2 h1 h0
Cross Bar
Level 2
Five-tap finite-impulse response filter
MorphoSys (1999)
Reconfigurable cell
RC Array
•Array of reconfigurable cells
•64 cells in a 2-D matrix
•SIMD model•Same row(column) share configuration• Each RC operates on different data
TinyRISC (Cont’d)
Implementation & Performance
•0.35 micron technology•4 metal layers•Operation at 100MHz•170 mm2
Motion Estimation
Block size : 16x16 pixel, Image size : 352x288 pixel
Lx de STMicroelectronics
DART, Raphael David, IRISA/ENSSAT
With STMicroelectronics, UBO univ.With STMicroelectronics, UBO univ.
Reconfigurable multigrain= DPR+FPGA
Reconfiguration Dynamique
Faible Consommation Distribution
hierarchique des ressources
SCMD (Single Configuration Multiple Data)
DARTCluster
11 GOPS/cluster1.6 GMACS/cluster0.64 W @ 11GOPS16 MIPS/mW @ 11GOPS0.18u CMOS
Cluster architecture
Configmem.
FPGA
DMA ctrl
Control
DPR1
DPR2
DPR3
DPR4
DPR5
DPR6
Data mem
Segm
ented network
DPR architecture
reg1 reg2MUL1 ALU1 MUL2 ALU2
Multibus network
Datamem1
Datamem2
Datamem3
Datamem4
AG1 AG2 AG3 AG4
Loop management
Global bus