platform-based design 성균관대 조준동 교수. 발표순서 zwhy platform-based design? zs/w...

Platform-based Design

성균관대 조준동 교수

발표순서

Why Platform-based Design? S/W configurable platform 의 필요성 Design Space of Reconfigurable

Architectures Reconfigurable Radio and Multimedia

Systems Network-centric Design: Clock and Power Reliable Design

SoC and Customizable Platform Based-Design

ReconfigurableHardware

(Coarse Grain)ASIC 1

DSP

Reconfigurable

Hardware (Fine Grain)

ASIC 2

ControllerCPU

RAMROM

Flash

?

ControllerCPU

RAMROM

Flash

?

Semiconductor Revolutions

TTL

custom

standard

1957

1967

1977LSI,MSI

µproc.,memory

1987

1997ASICs,accel’s

1st design crisis

2nd design crisishardware

software

2007

reconfigurable

instruction

streamsdata

streams

structured

VLSI design

“Mainstream Silicon Applicationis switching every 10 Years”Makimoto’s Wave

coarsegrain

FPGAs

Definition of Platforms?

•An architecture that is designed for an application domain

Platform 분류

Application Platform: 멀티미디어 platform: Nexperia, TI 의 OMAP 3G 무선 platform: Infineon 의 M-gold Bluetooth platform: Parthus 무선 platform: ARM 의 PrimeXsys

Process-centric platform Improv System, ARC, Tensilica, Triscend

Communication-centric platform: Sonics, Palmchip

SoC Platform Adaptation

The Platform-Based Design Concept Cadence

ApplicationSpace

HW-SW Kernel

MEM

FPGACPU Processor(s), RTOS(es)

and SW architecture

* IP can be hardware (digital or analogue) or software. IP can be hard, soft or‘firm’ (HW), source orobject (SW)

* IP can be hardware (digital or analogue) or software. IP can be hard, soft or‘firm’ (HW), source orobject (SW)

Scaleablebus, test, power, IO,clock, timing architectures

+ Reference Design

Programmable

SW IP

Hardware IP

Pre-Qualified/VerifiedFoundation-IP*

Foundry-SpecificHW Qualification

Reconfigurable Hardware Region(FPGA, LPGA, …)

SW architecturecharacterisation

Platform Architecture

How fast will my user interface

software run? How much can I fit onto my microcontroller?

Which Bus? PI? AMBA?Dedicated Bus for DSP?

Which RTOS do I use? Which scheduling policy do I have to choose ?

Do I need a dedicated DSP ?Which micro-

controller? ARM?HC11? ARC?

Can I buy a QCELPdecoding core?

Do I need a dedicatedHW or can I run this

on the Microcontroller ?

Example of a commercial SoC

More CPUs?More SRAM/Flash?Add FPGA?

A Legacy SoC ApproachCoreConnect (PPC), AMBA (ARM)…

Networks-on-Silicon, Phillips

MP-SOC Cluster

Definition of MP-SOC?

Usually heterogeneous multiprocessor:

CPUs, DSPs, etc.Hardwired accelerators.Mixed-signal front end.

기존 MP-SoC 의 문제점

▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다 .

▷ 같은 ( 호모지니어스 ) 프로세서를 여러 개 사용하는 것은 자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다 .

▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적으로 설계되었다 .

▷ 인터콘넥트는 와이어 - 의존 뿐아니라 로직 의존적이기도 하다 .

▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다 .

▷ 특정 응용분야에 대해서만 최고 성능을 낸다 .

4G: Multiple standards

Communications.Networking.Multimedia.Security.

Muti-band/multimode 를 지원하는 Digital RF

The triangle, Chicken and Egg?

architectures

applications

methodologies

•Hardware and software architectures determine capabilities.•Applications guide design decisions.•Methodologies allow repeatable, predictable design.

DSP

Why Multi-Threaded Cores?

Out

NoC

In SRAM

DSPDSP H/W-MT

RISC

H/WProc. Element

$GPP

I$D$ I$

Increasing gap: memory & processor

speeds(2x / 2 years)

Increasing gap: interconnect &

gate delays(multi-clock)

More parallel processing

(lower-power, higher-perf./mm2)

…………

MPSoC “Bus” Alternatives

• Fixed Bus [Bergamaschi, DAC, 2000]– Point to point communication– Signals between cores transferreddedicated wires• FPGA-like Bus [Cherepacha, FPGA Sym,– Programmable interconnects– Employ static network• Arbitrated Bus [IDT Inc., 2000]– Time-shared multiple core connectivity– Use arbitrator• Hierarchical Bus [AMBA, ARM Inc]– Combine multiple buses using bus– Separate buses for cores and I/O NoCBus [Dally, DAC, 2000]– Resources communicate with data packets– Use switch fabric

Future mobile platform?

Mobile supercomputing Speech recognition. Cryptography. Augmented reality. Typical applications (email, etc.).

Requires 16x 2 GHz Pentium 4. Peak power must not exceed 75 mW

Mudge et al:

미래 모빌 어플리케이션 플랫폼 ?

Culture and Education? Personal Entertainment

Platform?

Road Map to MP-SoC Trends mask NRE: Over 1M$; design NRE: 10M$ to 75M$

ASICs replaced by programmable ASSP, FPGA’s number of embedded processors

DVD/STB/HDTV, mobile phones: 5 to 8 Image proc, networking, basestation: 8 to 100+ eS/W complexity

Set-top box, audio: >1 million lines of code eS/W becoming essential part of SoC’s

?’s Law??’s Law?

Tape Out

VerifyCompose the system

VerifySimulate

VerifySoC Composer

Verify (timing, area)Synthesis + P&R

VerifySimulate (performance)

Should the SoC designer work hard?

Requirements

Mobile SoC 에서 검증이 왜 중요한지 ? 왜 우리는 검증이 취약하게 되었는지

More SoC topics …

Platform optimization Power management BW allocation Resource sharing Task distribution Efficient communications

Low PowerVerification

•인재 (System Architect) 양성

Available Mobile and VLIW Processors

The ARM Family The ARM7 Generation The StrongARM The ARM Thumb Option The ARM Piccolo Option The ARM9 and ARM10

The Motorola M-Core The LSI TinyRisc The Hitachi SuperH Family VLIW Processors

The Motorola-Lucent Star*Core The Philips TriMedia The HP/Intel IA-64

NexperiaTM DVP Hardware architecture (source: Th. Claasen, Philips, DAC 2000)

1~8 2~6

Exploitable Parallelism

GP O/SThread-LevelParallelism

Instruction-Level

Parallelism

1

10 000’sInstructions

Min parallel grain size (instrns.)

Exploitable taskparallelism

1~100

MultiFlex Thread-Level

Parallelism

100’s

NEC MP211: Homogeneous MP core

Asymmetric mp with very coarse grain multitasking 3 ARM9’s utilized as predefined function units NO complex overhead : e.g. no cache coherency,

dynamic scheduling/load balancing

MP-SoC 의 장점

쉬운 하드웨어 Implementation 이 가능하다 . : 즉 , 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을 낮출 수 있다 .

전력 소비를 줄일 수 있다 . : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다 . 낮은 클럭 주파수는 적은 supply voltage 를 가능하게 하고 파워 소모를 줄일 수 있다 .

Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다 .

Boosting real-time 성능 : 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다 . 이는 다중 어플리케이션간 인터페이스를 줄일 수 있다 .

시스템의 안전도를 높일 수 있다 . : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다 .

AMP task allocation image

Bus and Memory Architecture

MP211 block diagram

Power consumption of H.264+AAC

Holistic design of multi-core architectures

Naïve Methodology is inefficient Demonstrated inefficiency for cores and proposed

alternatives Single-ISA Heterogeneous Multi-core

Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core

Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04]

What about interconnects? How much can interconnects impact processor

architecture? Need to be co-designed with caches and cores?

Heterogenous MP Core

▷ Single-ISA heterogeneous multicore 구조는 볼테지 스케일링 , 클럭 게이팅 , speculation control등을 사용하는 경우에 비해 우수한 성능을 보인다 .

▷ Homogeneous CMP (Chip Multiprocessor) 와 비교해서 Heterogeneous CMP( 또는 asymmetric CMP) 는 많은 장점을 가지고 있다 . 많은 응용 제품들은 큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다 . 또한 바테리를 사용하는 경우와 전원을 사용하는 경우등 시스템의 콘텍스트에 의존적이다 . 따라서 복잡도가 다른 코어들을 사용하는 것이 효율적이다 .

▷Multi-ISA multicore architecture 는 다른 ISA 를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism 을 동시에 처리 가능하도록 설계되었다 . 그러나 single-ISA heterogeneous CMP 는 모든 코어가 같은 ISA 를 수행하기 때문에 각 응용이 어느 코어에 매핑이 되어도 상관없게 된다 . 코어 숫자와 크기 , 타입 , 그리고 캐쉬를 결정해야 한다 . 8-core 프로세서의 경우 , 인터콘넥트의 전력 소모량은 하나의 코어와 같다 . 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지 - 딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다 .

▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level 과 high thread level 을 이용하는 heterogeneous processors 는 homogeneous 에 비해서 63% 성능이 개선된다 .

5-8 threads level 을 사용하는 경우에는 평균 29% 의 개선이 있다 . Amdahl's 의 법칙에 의하면 병렬 응용들의 속도개선은 직렬 응용 부분때문에 제한적이 된다 .

▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며 , 병렬 부분에 대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를 최대화 한다 . [Annavaram, et al]

Heterogeneous MP-SoC 문제점들

Processors are bound by wire and memory latencies

Peak performance on only a small class of applications.

How well they map to a given design Diversification of workloads Increased hardware complexity Poor resource utilization

Alpha cores scaled to 0.10 um. EV8 is 80 times bigger but provides only two to three times more single-threaded performance

Heterogenous MP Core

If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection.

Multi-ISA multicore architecture 는 다른 ISA 를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism 을 동시에 처리 가능하도록 설계되었다 .

헤티로지니어스 플랫폼의 특징

8-core 프로세서의 경우 , 인터콘넥트의 전력 소모량은 하나의 코어와 같다 . 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지 - 딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다 .

듀얼 프로세서의 경우를 예를 들면 low Thread level 과 high thread level 을 이용하는 heterogeneous processors 는 homogeneous 에 비해서 63% 성능이 개선된다 . threads level 을 사용하는 경우에는 평균 29%의 개선이 있다 . Amdahl’s 의 법칙에 의하면 병렬 응용들의 속도 개선은 직렬 응용 부분때문에 제한적이 된다 .

10 Performance of heuristics for equal-area heterogeneous architectures with multithreaded cores.

Exploring the potential from heterogeneity

CT 3400 Multi-core DSP

8 개 32 비트 DSP 코어

6 개 32 비트 범용 프로세서 코어

128 핀 프로그램 가능 I/O 서브시스템으로 구성

C 프로그램 가능 H.264 및

MPEG4 코드를 지원

http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

H.264 encoder , decoder and audio codecs and the system control

H.264 codec onto the cradle CT3400 MDSP


CT3400 DPS Engine

http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdfhttp://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

DSP Engine

Each DSP engine contains

A Single Instruction Multiple Data

Arithmetic Logic Unit (SIMD ALU)

A Packed Integer Multiplier

Accumulator (PIMAC)

A Floating Point Unit (FPU)

Bi-directional FIFO data buffers

DMA channels

A 128 x 32 register and

A 512 x 20 program memory

http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdf

CT3600 계열 제품군

CT3616 은 채널 당 5.50 달러 (MPEG4 SP L3) 로 업계에서 가장 뛰어난 가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2 배 이상 우수

프로그램 가능 DSP 를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다

0.13 미크론 기술 , 16 개의 DSP, 8 개의 범용 프로세서로 전체 성능을 네 배로 증가

40 달러에서 90 달러

CT3600 Multiprocessor DSP Family Members

http://www.cradle.com/downloads/CT3600-PB.pdf


http://www.cradle.com/downloads/CT3600-PB.pdf

Homogeneous MP-SoC 문제점들

The hardware must be configurable for efficient execution across broad class of application.

Each core consists of an array of homogenous processing execution nodes, a banked Instruction Cache, Data Cache, register file and block control logic.

Some of the resources (called polymorphous

resources) in the TRIPS architecture can be configured to

operate differently depending on the mode (instruction, thread or data parallelism).

HiBRID-SoC Architecture

HIBRID-SoC multi-core system-on-chip Architecture

Integrate a powerful on-chip communication structure

A well-balanced memory system to account for the growing amount of data memory

system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding

(AVC))

Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing

demend

Three programmable cores Each adapted towards a specific class of algorithms

Combination of the cores and their software development environment

An extention of a programmable core with dedicated modules (e.g.,Trimedia)

HIBRID-SoC multi core

Developed at the University of Hannover

Multi-Core SoC Architecture

Multi-Core SoC Architecture Instruction Level VLIW (Very long instruction word)

Data Level SIMD (Single instruction multiple data)

Task Level (Simultaneous multithreading)

Hi-par DSP 16-datatath SIMD processor core controlled by VLIW,

Particularly optimized towards high-throughput two dimensional DSP-style processing

(FFT-intensive applications or filtering)

Stream Processor (SP) 32-Bit RISC architecture that is more optimized to-wards control-dominated task

Bitstream processing or global system control

Macroblock processor(MP) Efficient processing of data blocks (Heterogeneous data path structure consisting of scalar

and a vecture unit)

Controlled by dual-issue VLIW, offers flexible subword parallelism, and contains

instruction set extensions for typical processing computation steps

HiBRID-SoC multi-core architecture

64-bit AMBA AHB system bus

Connects all cores SDRAM

memory via a 64 Bit SDRAM

interface

Two versatile 32-Bit host

interfaces for access (e.g., host

PC via PCI and to serial flash

memory)

Figure 1. HiBRID-SoC multi-core architecture

HiPAR-DSP

HiPAR-DSP Highly paralled DSP core with a

VLIW-controlled SIMD architecture

Memory concept provides an easy data

exchange between the data paths, which

is required for many filter and image

processing algorithms

DMA unit serves all cache misses and

performs data prefetch transfers to the

matrix memory

At the targeted clock frequency of 145

MHz, the HiPAR-DSP achieves a

performance of 2.3 GMACsFigure 2. HiPAR-DSP architecture

Stream Processor

Stream Processor

Sp has been optimized for high-level programmability and

efficient processing of control-driven applications

Harvard architecture with a 32-bit data path consisting of 5

pipeline stages and controlled by 32-Bit RISC instructions.

Supports Conditional execution, forwarding interlocks, and

provides full interrupt capability

Convert the 64-Bit AMBA bus width to the 32-Bit internal

Macroblock processor

Macroblock processor

Heterogeneous data path structure

consisting of a scalar and a vector data

path

The scalar data path operates on 32-Bit

data words in a 32-entry register file

and provides control instructions

(jump,branch, and loop)

The vector data path is equipped with a

64 entry register file of 64 bit width

Special fuction unit(SFU) provide

instruction set extensions for common

video and multimedia core algorithms.

MUL/MAC or ALU, incorporate

SIMD-style subword parallelism by

processing either two 32-Bit, four 16-

Bit, or eight 8-Bit data entities in

parallel within a 64-bit register operand

Figure 3. Macroblock processor data paths.

HiBRID-SoC Implementations

Figure 3. Chip layout of the HiBRID-SoC.

Table 1. MPEG-4 ASP decoder (full TV resolution) performance on MP and SP,

720*576@25Hz,1.5-3 Mbits:

HiBRID-SoC is fabricated in a 0.18 um,

6LM standard-cell technology,

14 million tr’s 3.5W

occupies 82 mm2, and operates at 145 MHz

mailto:720*576@25Hz,1.5-3

Analyzing On-chip Communication in MPSoC Enviroment

Proceedings of Design,Automation and Test’04 Mirko Loghi et al

•Analysis and trade-off exploration of on-chip communication architectures.•Compare and analysis with two practical configurations : AHB-AMBA (ARM) and STBus (ST Microelectronics).•Models hardware and software of MPSoC at high-level of accuracy and sufficient simulation speed.•Provide realistic performance by stimulating communication system with functional traffic.

Multiprocessor simulalation platform

• Homogeneous MPSoC platform.

• Configurable number of 32-bit ARM processors.

• Processor cores : GPL-licensed ARM Instruction Set Simulator (ISS) SWARM in C++

• Private memories for each processor.

• A shared memory

• A hardware interrupt module.

• 32-bit interconnection

• All components are wrapped in SystemC

Hardware architecture:Hardware architecture:

Multiprocessor simulalation platform

Benchmarks running with RTEMS-OS :• Running on top of RTEMS• Synchronization : Use OS queues to exchange matrices between

processors.• Benchmark 1: Independent matrix multiplication.• Benchmark 2: Pipeline of matrix multiplication

Benchmark 1: Independent matrix multiplication:• Perform independent matrix multiplication at each processors• Not require interprocessor communication.• Operands are stored in private memories of each processor.

Multiprocessor simulation platform

Benchmark 2: Pipeline matrix multiplication:• Platform receives a continous flow of input and out put• Operation of every cores follows this partern :

Copies input matrix from share memory to private space Multiplicate input matrix with a already matrix in private

space Copies the resulting matrix back to shared space.

• Interrupt and semaphores slaves are queried to keep synchronization in all process.

Multiprocessor simulation platform

Code development and analysis tool :

• Development tool : GNU-cross compiler

• Allow flexible profiling by functions of simulator.

• Output of simulator :• Statistics about processor and

interconnect performance.• VCD waveform of all bus signal• Traces of memory accesses

performed by every cores.

Features of communication architecture

AMBA-AHB Architecture:• Traditional shared bus with pipelining.• Distinct data and address/control bus• Transfer with data phase and control phase.• Support burst as streams of single transaction.• “split/retry transfer” and “early burst termination” are used to

solve high-latency slaves. STBus Architecture:• Protocol type 3: simple load/store operation , pipelining and

spliting transaction,out-of-order support.• Flexible topology :from shared bus to full crossbar• Overlapping transfer:Requesting new burst while previous ones are

still completing without idle cycle.• Fast arbitration with two cycles and minimum latency is three

cycles.

Experimental Result

Comparison of performance interconnection Five interconnections :

• AMBA-AHB• Shared-bus STBus• Full crossbar STBus• Partial crossbar STBus : ST-32• Partial crossbar STBus : ST-54

Experimental Result

Performance comparison

Experimental Result

Comparison of performance interconnection Four benchmarks :

• Matrix multiplication independent : ASM-IND• Matrix multiplication pipeline without OS : ASM-PIP• Matrix multiplication with OS : OS-IND• Matrix multiplication pipeline with OS : OS-PIP

Experimental Result

Comparison of performance interconnection

MPSoC Clock and PowerOlivier Franza, Intel

Increased uncertainty with process scaling Process, voltage, temperature variations, noise,

coupling Affects design margin over design, power & performance

loss Increased power constraints Increasing leakage, power (density, delivery)

limitations More transistors mean:

Larger clock distribution networks Higher capacitance (more load and parasitics)

With each new technology: Gate delay decreases ~25% Wire delay increases ~100% Cross-chip communication increases Clock needs multiple cycles to cover die

Interconnect Delays & Density

Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology

Multiple Clocks due to Interconnect limitation

At reduced performance, larger resource size

Noise in Mixed Signal Systems

Multiple clock domains Low skew and jitter ALWAYS a must Clock modeling requires more accuracy Within-die variations, inductance, crosstalk, electromigration, self-heat, … Floor plan modularity Think adding/removing cores seamlessly! Hierarchical clock partitioning Reduce global clock and possibly relax its requirements Generate “locally”-used clock “locally” Implement clock domain deskewing techniques Bound clock problem into simple, reliable, efficient

domains

DEC/Compaq Alpha

more complex core to improve performance, morecomplex clocks (?), Source: DEC/Compaq – Gronoski & al., JSSC 1998 – Xanthopoulos & al., ISSCC 2001 – Barroso & al., ISCA 2000

Clock and Power ConvergenceIntel® Itanium® Montecito Each core split into 3 clock domains on variable power supply

Each domain controlled by Digital Frequency Divider (DFD)

generating low-skew variable-frequency clocks; fed by central PLL and aligned through phase detectors

Regional Voltage Detector (RVD): supply voltage monitor

Second level clock buffer (SLCB): digitally controlled delay buffer for active deskewing

Regional Active Deskew (RAD): phase comparators monitoring

and adjusting delay difference between SLCBs

Clock Vernier Device (CVD): digitally controlled delay buffer

Clock generation and distribution are essential Clock generation and distribution are essential enablers of microprocessor performance

On-Chip Interconnects:Circuits and Signaling, Wayne Burleson

• Using Vdd programmability• High Vdd to devices on critical path• Low Vdd to devices on non-critical paths• VddOff for inactive paths

A – Baseline FabricB – Fabric with Vdd Configurable Interconnect

This work builds on a similar idea for FPGAs described in:Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International Conference on Computer-Aided Design, Nov. 2004

Why Reconfigurable System?

GPP 와 재구성 h/w 를 포함

목적 : 전력 감축 및 유연성

1. 동적인 환경에 따른 Quality of Service 를 제공

2. 알고리즘 진화에 따른 유연한 구조

3. 개발 및 유지 보수해야 하는 플랫폼 감소

Reconfigurable Hardware

A

D

B

C

E

A B D C E

A B D D C C E E

Task 1

X

Z

W

Y

Task N

H I J H I J W

XY

ZW

XY

Z

Energy Efficiencyof Reconfigurability

system architecture communication protocol O/S and applications Partitioning of functions between wireless

device and services on the network The mobiles must be flexible enough to

accommodate a variety of multimedia services and communication capabilities and adapt to various operating conditions in an (energy) efficient way

S/W configurable platform 의 필요성

– Doing More by Doing Less : 다양한 표준을 다룰 수 있는 능력이 필요 (AM, FM, GSM, UMTS, digital broadcasting standards, analog and digital television and other data links.

– A fully software reconfigurable multi-channel broadband sampling receiver for standards in the 100 MHz band

Granularité dela reconfigurationSébastien PILLEMENT - ENSSAT/LASTI

Reconfiguration au niveau système Lx, C62 (décomposition en cluster)

Reconfiguration au niveau fonctionnel Pleiades, RaPiD, DART(2001)

Reconfiguration au niveau opérateur Chameleon, Piperench, Morphosys(2000)

Reconfiguration au niveau porte Napa, GARP, FPGA

The gain size of operationsin Reconfigurable System Architectures

Fine gained operations : Multiply and addition

Medium gained operations : reconfigurable modules

Course gained operations : CPU, host

Design Space of Reconfigurable Architectures

RECONFIGURABLE ARCHITECTURES(R-SOC)

FINE GRAIN(FPGA)

MULTI GRANULARITY(Heterogeneous)

COARSE GRAIN(Systolic)

Processor +Coprocessor

Tile-BasedArchitecture

Coarse Grain Coprocessor

Fine GrainCoprocessor

IslandTopology

Hierarchical Topology

LinearTopology

HierarchicalTopology

MeshTopology

• Chameleon• REMARC• Morphosys

• Pleiades• Garp• FIPSOC• Triscend E5• Triscend A7• Xilinx Virtex-II Pro• Altera Excalibur• Atmel FPSIC

• Xilinx Virtex• Xilinx Spartran• Atmel AT40K• Lattice ispXPGA

• Altera Stratix• Altera Apex• Altera Cyclone

• Systolic Ring• RaPiD• PipeRench

• DART• FPFA

• RAW• CHESS• MATRIX• KressArray• Systolix Pulsedsp

• aSoC• E-FPFA

Lilian BossuetLESTER LabUniversité de Bretagne SudLorient, France

Digital Signal Processing With FPGAs

Paul EkasJean-Charles Bouzigues

OptionOption ResourceResource Area UsageArea Usage

1 Logic Multipliers

Logic Elements (Traditional)

500 LEs per 18x18

Multiplier

2 Hard Multipliers

DSP Blocks 4 18x18 Multipliers per

DSP Block

3 Soft Multipliers

RAM 1 to 2 Embedded

Memory Blocks

Multiplier Options In FPGAs Multiplier Options In FPGAs

Logic Elements

Smallest Unit of LogicGrouped into Logic Array

Blocks (LABs) of Ten LEsFeatures

Four-Input Look-Up Table (LUT)

Configurable Register Dynamic Add/Subtract

Control Carry-Select Chain Logic

LE14

4

4

4

4

4

4

4

4

4

Control Signals

LocalInterconnect

LE2

LE3

LE4

LE5

LE6

LE7

LE8

LE9

LE10

LogicElement

Logic ArrayBlock

18 Bit x 18 Bit

4 Multiplies

2 Multiplies with Accumulate

1 Sum of 2 Multipliers (Complex Multiply)

1 Sum of 4 Multiplies

9 Bit x 9 Bit

8 Multiplies

2 Multiplies with Accumulate

2 Sum of 2 Multipliers(Complex Multipliers)

2 Sum of 4 Multiplies

+

Op

tio

nal

Pip

elin

ing

Ou

tpu

t R

egis

ter

Un

it

Ou

tpu

t M

UX

144144

36

36

36

36

37

37

38

+ -

+ -

Inp

ut

Reg

iste

r U

nit

36 Bit x 36 Bit

1 Multiply

DSP Block: Optimized Hard MACDSP Block: Optimized Hard MAC

Soft Multipliers: Lookup Based MultiplicationSoft Multipliers: Lookup Based Multiplication

Use Embedded RAM Blocks as Look-Up Tables (LUTs) for Generating Partial Products

Coefficient or Sum of Coefficients Values Stored in RAM Blocks MSB Partial Product Shifted & Added to LSB Partial Product

ADDRESS MULT_RESULT

00000 0

00001 C

00010 2*C

00011 3*C

… ….

11111 31*C

32*18M512

C = Coefficient[12:0]

Multiplier Table5

18

Address

Data Output

Example Multiplication of 5-Bit Input

with 13-Bit Coefficient All 18 Bit Possible Results

Stored at 32*18 Look Up Table

Altera FPGA Memory Architectures Today’s applications need more high performance memory One size does not fit all Wide choice of modes and widths

M512 Blocks M4K Blocks M-RAM External Memory Devices DDR SDRAM & SRAM SDR SDRAM QDR & QDRII SRAM ZBT SRAM DDR FCRAM

True Dual Port RAM Embedded Shift Register

Mode 512K bits 300 Mhz Operates Up to 300Mhz Mixed Clock Mode

True Dual Port RAM Embedded Shift

Register Mode Operates Up to

312Mhz Mixed Clock Mode

Rate Changing Embedded Shift

Register Mode Operates Up to

312Mhz Mixed Clock Mode

More Bits For Larger Memory Buffering

More Data Ports for Greater Memory Bandwidth

Soft Multiplier: Sum of Multiplications Soft Multiplier: Sum of Multiplications

M51232*18

18

M51232*18

1

18

1935

1 1

4ADDRESADDRESSS

MULT_RESULMULT_RESULTT

0000 0

0001 C0

0010 C1

0011 C0+C1

… ….

1111 C0+C1+C2+C3

16-Bit Serial Shift Registers

Sum of Multiplications Table

Output

Input

(Sample 16-Bit, Coefficient 16 Bit)

Example: FIR FilterMemory: 2 M512

++

++

4

16-Bit Serial Shift Registers

Example Direct Sequence Spread Spectrum (DSSS) Modem

DSSS Modem

Five Independent Data Channels Spread to 3.84 Mcps Three-Stage FIR Interpolation-by-32 Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth 112 dB SFDR 15.36 MHz Quadrature Carriers 122.88 MSPS Transmitter Output with 5 MHz Bandwidth & Over

78-dB Out–of-Band Rejection Automatic Gain Control (AGC) Compensating for Channel

Attenuation of up to 30 dB Costas Loop Carrier Recovery 4x Oversampling Code Synchronization

DSSSModulator

ChannelModel

DCH0

DCH1

DCH2

DCH3

DCH4

DCH0

DCH1

DCH2

DCH3

DCH4

DSSSDemodulator

DSSS ModulatorFIR3 RRC25-Tap FIR

FilterInterpolation x4

Ex BW:22%

NCO FrequencyResolution:

0.03HzSFDR: 112dB

FIR1LPF

2-Channel87-Tap

FIR FilterInterpolation

x2

Length 256Gold CodeSpreader

DCH0

DCH1

DCH2

DCH3

DCH4

PCH

Cch,16,0

Cch,16,1

Cch,16,2

Cch,16,8

Cch,16,9

Cch,16,10

SCH

FIR2LPF

2-Channel47-Tap

FIR FilterInterpolation

x4

FIR3 RRC25-Tap FIR

FilterInterpolation x4

Ex BW:22%

Sin(wn)

Cos(wn)

Carrier PhaseIncrement

K

K

gi

gq

Re[]

Im[]

DSSS Demodulator

PeakDetector

NCOFrequencyResolution:

0.03HzSFDR: 112dB

FIRAltera RRC

31-Tap FIR FilterExcess BW: 22%

Fixed Rate

AGC

Free-RunningPhase Increment

FIRAltera RRC

31-Tap FIR FilterExcess BW: 22%

Fixed Rate

CarrierRecovery

Loop

8 Gold CodeCorrelator

4xOversampling

Buffer I-QDerotate

Pilot Monitor

HadamardDespreader

PilotOutput

DataChannels

Output1…5

pn_lock

max_index

8

DSSS Modem Resources Resource Usage Summary

DesignEntity

LogicElement

s

M512RAM

M4KRAM

MegaRAM

DSP Block

Elements

Modulator 9943 1 8 0 12

Demodulator

12196 60 8 1 60

Power Usage Estimates

Power mW

Total Standby Internal Power 75

Total Logic Element Internal Power

283

Total Clocktree Internal Power

175

Total DSP Internal Power 23

Other Internal Power 92

Total Power 505

FIR Filter Example* – 16X Cost/Performance Improvement

Device Solution FIR Performanc

e(MHz)

Device Cost***

*

Cost perFIR MHz

TI C6713-200 64 cycles** @ 200MHz

3.125 $24.59 $7.87

TI C6416-600 32 cycles** @ 600MHz

18.75 $160 $8.53

Altera 1C3-8 8 cycles*** @ 230MHz

28.75 $14 $0.49

Altera 1C12-8 1 Cycles*** @ 170MHz

170 $84 $0.49* FIR 128 Tap, 16 bit data, 14 bit coefficients** DSPLib Optimized Assembly Libraries from Texas Instruments*** MegaCore Optimized FIR Compiler from Altera**** Pricing in quantity of 100 at Arrow 6/25/03

* FIR 128 Tap, 16 bit data, 14 bit coefficients** DSPLib Optimized Assembly Libraries from Texas Instruments*** MegaCore Optimized FIR Compiler from Altera**** Pricing in quantity of 100 at Arrow 6/25/03

Reconfigurable video processor for SDRAM access optimization(Henriss, Ernst et al.)

Reconfigurable video platform

· SDRAM memory centered design· FPGA based scheduler merges different

streams and random accesses exploitation of SDRAM bank structure

· supports 2 HDTV streams at 1.48 Gbit/s each plus DSP and filter unit access

· reaches 700MByte/s in practical application for 4 Byte SDRAM memory word

· extremly cost efficient design· used in professional video product line

Fine-Grained RSOCs: Triscend A7 CSOC

A7 Family 32-bit ARM 7

with 8kB Cache 3200 logic cells

max. (40K gates) Up to 3800 FF’s Up to 300 Prog.

I/O pins www.triscend.co

m

Coarse-Grained RSOCsChameleon Structure (2000)

Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M.

Heysters, www.chameleonsystems.com

32-bit ARC control processor Up to 84 32-bit Datapath Units DPU=a 32-bit ALU+a 32-bit

barrel shifter Up to 24 of 16x24-bit multipliers Up to 48 of 128x32-bit local

memory modules Up to 160 Prog. I/O pins Targeted at 3rd gen. wireless basestation, wireless local loop, SW radio, etc.

Design a battery powered personal mobile computing device that has multimedia functionality and can operate in a dynamic environment.

- Do just enough and not too much for a given task (QoS)

Field Programmable Function Array

The FPFA concept has a number of advantage The FPFA has a highly regular organisation We use general purpose process core Its scalability stands in contrast to the

dedicated chips designed nowadays The FPFA can do media processing tasks

such as compression/decompression efficiently

Field Programmable Function Array

ALU ALU ALU ALU ALU

M M M M M M M M M M Memory

CrossBar

Registers

ALUs

Processor tiles Consists of five identical blocks, which share a control unit

and a communication unit An individual block contains an ALU, two memories and four

register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the ALUs,

registers and memories This structure is convenient for the Fast Fourier

Transform(6-input,4-output) and the Finite impulse response

Dedicated Hardware Architecture

Per

form

ance

(M

MA

Cs/

sec)

DSP System Architecture OptionsDSP System Architecture Options

DSP DSP DSP DSP

DSP DSP DSP DSP

DSP DSP DSP DSP

DSP DSP DSP DSP

Processor ArrayStand-Alone Processor

DSP

Processor + Co-Processor

DSP

Optional Coprocessor Mappings

ProcessorProcessor

MemoryMemory

FPGAFPGAFPGAFPGA

Processor External to FPGAProcessor External to FPGAProcessor On FPGAProcessor On FPGA

•TI c6x (EMIF)•Mot PPC (MPX)•Mot Starcore (MPX, AHB)•Intel 2850 (PCI Express)•ARM (AHB)•…..

•TI c6x (EMIF)•Mot PPC (MPX)•Mot Starcore (MPX, AHB)•Intel 2850 (PCI Express)•ARM (AHB)•…..

•Nios•ARM (AHB)

•Nios•ARM (AHB)

Mapping of DSP Algorithms on the FPFA

DFT

N=8

FFT

N=8

DFT

N=8

DFT

N=8

FFT

N=8

FFT

N=8

FFT

N=8

DFTN=2

DFTN=2

DFTN=2

DFTN=2

Fast Fourier Transform FFT recursively divides a DFT into

smaller DFTs

+

--

a

b

W

Recursion of a radix 2 FFT with 8 inputs

The radix 2 FFT butterfly

Mapping of DSP Algorithms on the FPFA

1 2 3 4 5O

h4 h3 h2 h1 h0

Cross Bar

Level 2

Five-tap finite-impulse response filter

MorphoSys (1999)

Reconfigurable cell

RC Array

•Array of reconfigurable cells

•64 cells in a 2-D matrix

•SIMD model•Same row(column) share configuration• Each RC operates on different data

TinyRISC (Cont’d)

Implementation & Performance

•0.35 micron technology•4 metal layers•Operation at 100MHz•170 mm2

Motion Estimation

Block size : 16x16 pixel, Image size : 352x288 pixel

Lx de STMicroelectronics

DART, Raphael David, IRISA/ENSSAT

With STMicroelectronics, UBO univ.With STMicroelectronics, UBO univ.

Reconfigurable multigrain= DPR+FPGA

Reconfiguration Dynamique

Faible Consommation Distribution

hierarchique des ressources

SCMD (Single Configuration Multiple Data)

DARTCluster

11 GOPS/cluster1.6 GMACS/cluster0.64 W @ 11GOPS16 MIPS/mW @ 11GOPS0.18u CMOS

Cluster architecture

Configmem.

FPGA

DMA ctrl

Control

DPR1

DPR2

DPR3

DPR4

DPR5

DPR6

Data mem

Segm

ented network

DPR architecture

reg1 reg2MUL1 ALU1 MUL2 ALU2

Multibus network

Datamem1

Datamem2

Datamem3

Datamem4

AG1 AG2 AG3 AG4

Loop management

Global bus

platform-based design 성균관대 조준동 교수. 발표순서 zwhy platform-based design? zs/w...

Documents