eng3050 embedded reconfigurable computing systems introduction to reconfigurable computing...

ENG3050Embedded Reconfigurable

Computing Systems

Introduction to Introduction to Reconfigurable ComputingReconfigurable Computing

ENG3050 ERCS 2

Topics

The VLSI Design Process Application Specific Integrated Circuits Issues related to power consumption Issues related to scaling

Traditional Von Neumann Architecture Limitations Enhancements to VNA

Reconfigurable Computing (fill the gap!) Research Issues

Summary

ENG3050 ERCS 3

References

I. “Reconfigurable Computing: The Theory and Practice of FPGA-Based Computing”, by S. Hauk & A. Dehon

II. “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications”, by C. Bobda

III. “Computer Organization and Design”, by Patterson and Hennessy

IV. “Digital Integrated Circuits: A Design Perspective”, by Jan Rabaey

ENG3050 ERCS 4

PDA

Body

Entertainment

Household

Communication

Home Networking

Car

Medicine

PC

Super Computer

Computing Devices Everywhere!

Game console

ENG3050 ERCS 5

The Transistor Revolution

First transistorBell Labs, 1948

Bipolar logic1960’s

• Intel 4004 processor • Designed in 1971• Almost 3000 transistors• Speed:1 MHz operation

ENG3050 ERCS 6

The VLSI Design Process

Physical Design

Partitioning

Routing

Floorplanning&Placement

Specification

Functional design

Circuit design

Physical design

Test/Fabrication

Logic design

Very Costly

Time Consuming

Different Views

Logic Circuit (Tran) Layout Physical

n+n+S

GD

+

DEVICE

CIRCUIT

GATE

MODULE

SYSTEM

ENG3050 ERCS 9

Productivity Gap in VLSI Design

A growing gap between design complexity and design productivity

ENG3050 ERCS 10

MOSFET: Metal Oxide Semiconductor

Field Effect Transistor

A voltage controlled device Handles less current than a BJT (Slower) Dissipates less power Achieves higher density on an IC Has full swing voltage 0 5V

11

Transistor as a Switch

VGS VT

RonS D

A Switch!

|VGS|

An MOS Transistor

12

nMOS Transistor

Ids

Vgs

|VGS|

An nMOS Transistor

ENG3050 ERCS 13

MOS Transistors -Types and Symbols

D

S

G

D

S

G

G

S

D D

S

G

NMOS Enhancement NMOS

PMOS

Depletion

Enhancement

B

NMOS withBulk Contact

Circuits can be built using either NMOS transistors or PMOS transistors

14

Implementing Logic using: nMOS vs. pMOS Devices

ENG3050 ERCS

ENG3050 ERCS 15

Complementary MOS (CMOS)

NMOS Transistors pass a ``strong” 0 but a ``weak” 1

PMOS Transistors pass a ``strong” 1 but a ``weak” 0

Combining both would lead to circuits that can pass strong 0’s and strong 1’s

X Y

C

C

16

Static Complementary MOS (CMOS)

VDD

F(In1,In2,…InN)

In1In2

InN

In1In2

InN

PUN

PDN

PMOS only

NMOS only

PUN and PDN are dual logic networks

……

At every point in time (except during the switching transients) each gate output is connected to either VDD or VSS via a low resistive path

VSS

ENG3050 ERCS

ENG3050 ERCS 17

CMOS Inverter (normal output)

A Y

0

1

VDD

A Y

GNDA Y

Pull-up Network

Pull-down Network

ENG3050 ERCS 18

CMOS Inverter

A Y

0

1 0

VDD

A=1 Y=0

GND

ON

OFF

A Y

ENG3050 ERCS 19

CMOS Inverter

A Y

0 1

1 0

VDD

A=0 Y=1

GND

OFF

ON

A Y

20

Example Gate: NAND

ENG3050 ERCS

Example Gate: NOR

21ENG3050 ERCS

ENG3050 ERCS 22

Complex CMOS Gate

OUT = D + A • (B + C)

D

A

B C

D

A

B

C

Sources of Power Consumption

Power has three components Static power: when input isn’t

switching

Dynamic capacitive power: due to charging and discharging of load capacitance

Dynamic short-circuit power: direct current from VDD to Gnd when both transistors are on

Dynamic Power

° Dynamic power is required to charge and discharge load capacitances when transistors switch.

° One cycle involves a rising and falling output.• On rising output, charge Q = CVDD is required

• On falling output, charge is dumped to GND

Cfsw

iDD(t)

VDD1. Short circuit current

2. Charge/discharge current

Dynamic Power

Cfsw

iDD(t)

VDD

dynamic

0

0

sw

2sw

1( )

( )

T

DD DD

TDD

DD

DDDD

DD

P i t V dtT

Vi t dt

T

VTf CV

T

CV f

Short circuit power <10% of dynamic power

Lowering Dynamic Power

Pdyn = CL VDD2 f

Capacitance:Function of fan-out, wire length, transistor sizes

Supply Voltage:Has been dropping with successive generations

Clock frequency:Increasing…

Cfsw

iDD(t)

VDD

Static Power Consumption

Static power consumption: Static current: in CMOS there is no static current there is no static current as long as Vin <

VTN or Vin > VDD+VTP

Leakage currentLeakage current: determined by “off” transistor Influenced by transistor widthtransistor width, supply voltagesupply voltage, transistor

threshold voltages

VDD

VI<VTN

Ileak,n

Vcc VDD

Ileak,p

Vo(low)

VDD

° Junction leakage

° Gate oxide leakage

° Subthreshold leakage

Static Power Consumption

Implementation Choices (target technology)Implementation Choices (target technology)

CustomCustom

Standard CellsStandard CellsMaMacro Cellscro Cells

Cell-basedCell-based

Pre-diffusedPre-diffused(Gate Arrays)(Gate Arrays)

Pre-wiredPre-wired(FPGAs, PLDs)(FPGAs, PLDs)

Array-basedArray-based

SemicustomSemicustom

Digital Circuit Implementation ApproachesDigital Circuit Implementation Approaches

29

ENG3050 ERCS 30

Design Style I: Full Custom

IN Out

Vdd

Gnd

Designer Specifies the layout of each individual transistor and connection

Design StylesDesign Styles

• Full CustomFull Custom– Utilized for large production volume chips such as Utilized for large production volume chips such as

microprocessors.microprocessors.– No restriction on the placement of functional blocks No restriction on the placement of functional blocks

and their interconnections.and their interconnections.– Highly optimized, but labour intensive.Highly optimized, but labour intensive.

31

ENG3050 ERCS 32

Design Style II: Gate-array

Oxide isolation

Gate isolation

PMOS

NMOS

Vdd

Gnd

BA

Out

Vdd

Gnd

A

B

Out

Sea of gates Channel based

NAND gate using gate isolation

Can in principle be used by adjacent cell

• Array of prefabricated gates/transistors

Design StyleDesign Style

• Gate ArraysGate Arrays– Pre-fabricated array of gates (could be NAND).Pre-fabricated array of gates (could be NAND).– Design is mapped onto the gates, and the Design is mapped onto the gates, and the

interconnections are routed.interconnections are routed.

33

ENG3050 ERCS 34

Design Style III: Standard cells

Routing

Cell

IO cell

35

Standard CellsStandard Cells

InOut

VDD

GND

In Out

VDD

GND

With silicided diffusion

With minimaldiffusionrouting

OutIn

VDD

M2

M1

36

Standard CellsStandard Cells

A

Out

VDD

GND

B

2-input NAND gate

B

VDD

A

Design StylesDesign Styles

• Standard CellStandard Cell– Utilized for smaller production ASICs that are Utilized for smaller production ASICs that are

generated by synthesis tools.generated by synthesis tools.– Layout arranged in row of cells that perform Layout arranged in row of cells that perform

computation.computation.– Routing done on “channels” between the rows.Routing done on “channels” between the rows.

37

Field Programmable Gate Array (FPGA)Field Programmable Gate Array (FPGA)

• Field programmable gate Field programmable gate arraysarrays– Pre-fabricated array of Pre-fabricated array of

programmable logic programmable logic and interconnections.and interconnections.

– No fabrication step No fabrication step required.required.

38

Design Style ComparisonsDesign Style Comparisons

STYLESTYLE

Full Full CustomCustom

Standard Standard CellCell

Gate Gate ArrayArray

FPGAFPGA

Cell sizeCell size VariableVariable Fixed Fixed heightheight

FixedFixed FixedFixed

Cell typeCell type VariableVariable VariableVariable FixedFixed Prog.Prog.

Cell placementCell placement VariableVariable In rowIn row FixedFixed FixedFixed

InterconnectionsInterconnections VariableVariable VariableVariable VariableVariable Prog.Prog.

Design costDesign cost HighHigh MediumMedium MediumMedium LowLow

39

ENG3050 ERCS 40

Dual port RAM

Full custom

Standard cell

ASIC with mixture of full custom, RAM and standard cells

FIFO

Single port RAM

ASIC (Application Specific Integrated Circuit)

is an IC customized for a particular use, rather than intended for general purpose use, eg., phones

Technology Scaling

ENG3050 ERCS 41

ENG3050 ERCS 42

Technology Scaling

Technology scaling has a threefold objective: IncreaseIncrease the transistor density ReduceReduce the gate delay Stabilize Dynamic Power Consumption

Main challenges faced are static power static power consumption, delivery and density which determine the performance of the chip.

Finding solutions to these challenges makes technology scaling an important issue to academia and the industry.

ENG3050 ERCS 43

VLSI Trends: Moore’s Law

In 1965, Gordon Moore predicted that transistors would continue to shrink, allowing: Doubled transistor density every 18-24 months Doubled performance every 18-24 months

History has proven Moore right But, is the end in sight?

Physical limitations Economic limitations

Gordon MooreIntel Co-Founder and Chairman Emeritus

Image source: Intel Corporation www.intel.com

ENG3050 ERCS 44

Amazingly visionary – million transistor/chip barrier was crossed in the 1980’s. 2300 transistors, 1 MHz clock (Intel 4004) -

1971 16 Million transistors (Ultra Sparc III) 42 Million transistors, 2 GHz clock (Intel

Xeon) – 2001 55 Million transistors, 3 GHz, 130nm

technology, 250mm2 die (Intel Pentium 4) - 2004

140 Million transistor (HP PA-8500) 1 Billion transistor (SoC)

Technology Scaling: Moore’s LawTechnology Scaling: Moore’s Law

Moore’s Law Continues Moore’s Law Continues Fueling Reprogrammable FPGA AdvancesFueling Reprogrammable FPGA Advances

65 nm

90 nm

130 nm

150 nm

180 nm

45 nm32 nm22 nm

1999 2001 2003 2005 2007 2009 2011 2013 2015 2017

8 nm

MatureFPGA Product

Technology

DevelopingFPGA Product

Technology

FutureProcess Technology

• Plan continuation of 2 year Technology node cycle

• “Traditional Scaling” is starting to be effected by the fundamental material limits of the planar CMOS process

• “Equivalent Scaling” or the assimilation of new materials, structures and functional integration will drive continued scaling

ENG3050 ERCS 46

Moore’s law in Microprocessors

40048008

80808085 8086

286386

486Pentium® proc

P6

0.001

0.01

0.1

1

10

100

1000

1970 1980 1990 2000 2010Year

Tra

nsi

sto

rs (

MT

)

2X growth in 1.96 years!

Transistors on Lead Microprocessors double every 2 yearsTransistors on Lead Microprocessors double every 2 years

ENG3050 ERCS 47

Frequency

P6Pentium ® proc

486386

28680868085

8080

80084004

0.1

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Fre

qu

ency

(M

hz)

Lead Microprocessors frequency doubles every 2 yearsLead Microprocessors frequency doubles every 2 years

Doubles every2 years

ENG3050 ERCS 48

Power Dissipation!

P6Pentium ® proc

486

3862868086

80858080

80084004

0.1

1

10

100

1971 1974 1978 1985 1992 2000Year

Po

wer

(W

atts

)

Lead Microprocessors power continues to increaseLead Microprocessors power continues to increase

ENG3050 ERCS 49

Leakage or `Static’ Power

Leakage power becomes a substantial portion of total power.

Trend projects that the leakage power will account for up to 50% of total power.

Not a practical limit of power dissipation.

ENG3050 ERCS 50

Power Dissipation Projection

Considering only active power, power dissipation increases from 100W in 1999 to 2000W in 2010.

Dotted line indicates the power dissipation including the leakage power.

ENG3050 ERCS 51

Interconnect Scaling

Interconnect scaling: Higher densities are only possible if the

interconnects also scale. Reduced width increased resistance Denser interconnects higher capacitance To account for increased parasitics and integration

complexity more interconnection layers are added:

thinner and tighter layers local interconnections

thicker and sparser layers global interconnections and power

Higher resistance and capacitance leads to higher RC delay.

ENG3050 ERCS 52

Technology Trends: Interconnect Delay

2.0 µ 1.5 µ 1.0 µ 0.8 µ 0.5 µ 0.35 µ

0.1

1.0

10

Dela

y (

ns)

Minimum Feature Size

TypicalGate Delay

InterconnectDelay

Interconnect delayInterconnect delay has become performance limiter as technology shrinks into sub micron regime.

ENG3050 ERCS 53

Technology Trend and Challenges

Source:ITRS’03

Interconnect determines the overall performance In addition: noise, power => Design closure Furthermore: manufacturability => Manufacturing closure

Application-Specific Integrated Circuits

ASIC advantages Best performance (speed) Smallest chip (area) Best power consumption

ASIC limitations? Inflexible (fixed!) Expensive to design High fixed costs require large

production runs Requires skillful designers

How to cope with complexity & cost?

ENG3050 ERCS 54

ENG3050 ERCS 55

Intellectual Property (IP)-based Design

ENG3050 ERCS 56

System on Chip (SoC)

Increasing complexityMPSoC, NoC

Assembly of “prefabricated component”

Maximize VC(IP) reuse: over 90%

New economics: fast and correct design > optimum design

Design and Verification at the system level

interface between VCs SW becomes more important

up memory

video unitgraphics

coms DSP custom

software

up

ENG3050 ERCS 57

Principle

In 1945, the mathematician Von Neumann (VN)demonstrated in study of computation that acomputer could have a simple structure, capable of executing any kind of program, given a properly programmed control unit, without the need of hardware modification

The Von Neumann Computer

ENIAC - The first electronic computer (1946)

ENG3050 ERCS 58

Structure A memory for storing program and data.

The memory consists of the word with the same length A control unit (control path) featuring a program counter for

controlling program execution An arithmetic and logic unit (ALU) also called data path for

program execution

Datapath

Controllpath

Processor orCentral processing unit

Dataand

Instructions

Addressregister

Memory

Instructionregister PC

Data

Address

Registers


ENG3050 ERCS 59


CodingA program is coded as a set of instructions to besequentially executed

Program execution Instruction Fetch (IF): The next instruction to be

executed is fetched from the memory Decode (D): Instruction is decoded (operation?) Read operand (R): Operands read from the memory Execute (EX): Operation is executed on the ALU Write result (W): Results written back to the memory Instruction execution in Cycle (IF, D, R, EX, W)

What is the problem with this computing paradigm?

.60 1999©UCB

Execution Cycle

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in storage for later use

Determine successor instruction

Bottlenecks in VN Architecutre

ENG3050 ERCS 61

ENG3050 ERCS 62

Processor-Memory Performance GapProcessor-Memory Performance Gap

1

10

100

1000

10000

Year

Per

form

ance

“Moore’s Law”

µProc55%/year(2X/1.5yr)

DRAM7%/year(2X/10yrs)

Processor-MemoryPerformance Gap(grows 50%/year)

ENG3050 ERCS 63


Advantage: Simplicity. Flexibility: any well coded program can be executed

Drawbacks: Speed efficiency: Not efficient, due to the sequential

program execution (temporal resource sharing). Resource efficiency: Only one part of the hardware

resources is required for the execution of an instruction. The rest remains idle.

Memory access: Memories are about 5 times slower than the processor

How to compensate for deficiencies?

ENG3050 ERCS 64

Improving Performance of VN (GPPs)Improving Performance of VN (GPPs)1. Technology Scaling

Improve performance (increase clock frequency!)

2. Improving Instruction Set of Processor3. Application Specific Processors (DSP)4. Use of Hierarchical Memory System

Cache can enhance speed

5. Multiplicity of Functional Units (H/W) Adders/Multipliers/Dividers (CDC-6600)

6. Pipelining within CPU (H/W) A four stage pipeline stage (IF/ID/OF/EX)

7. Overlap CPU & I/O Operations (H/W) DMA (Direct Memory Access) can be used to enhance performance

8. Time Sharing (SW) Multi-tasking assigns fixed or variable time slices to multiple programs

9. Parallelism & Multithreading (S/W) (H/W) Compilers/Multi-core systems

ENG3050 ERCS 65

Technology Scaling

• Technology scaling tends to increase transistor density and enhance performance.

• Scaling is essential to ensure sustained growth in the IC industry to meet future needs.

• However, main challengeschallenges faced are:

– PowerPower consumption,

– Manufacturing issues, i.e., yieldyield

– New Cad tools Cad tools that can support newer technologies

ENG3050 ERCS 66

Instruction Set of Processor Instruction Set of Processor

CPU execution timeCPU execution time (CPU time) – time the CPU spends working on a task

Does not include time waiting for I/O or running other programs

CPU execution time # CPU clock cycles for a program for a program = x clock cycle

time

CPU execution time # CPU clock cycles for a program for a program clock rate = -------------------------------------------

Can improve performance by: reducing either the length of the clock cycle or the number of

clock cycles required for a program

or

ENG3050 ERCS 67

Example: Improving PerformanceExample: Improving Performance

Our favorite program runs in 10 seconds on 10 seconds on computer Acomputer A, which has a 4 GHz clock. We are trying to help a computer designer build a computercomputer BB, that will run this program in 6 , that will run this program in 6 secondsseconds. The designer has determined that a substantial increase in the clock rate is possible, but this increase will affect the rest of the CPU design, causing computer B to require 1.2 times as many clock cycles as computer A for this program. What clock rate should we tell the designer to target?What clock rate should we tell the designer to target?

CPU timeA = CPU clock cyclesA / (Clock rate)A

10 seconds = CPU (clock cycles) A / 4 x 109 cycles/second

CPU (clock cycles)A = 10 seconds x 4 x 109 cycles/sec = 40 x 109 cycles

CPU timeB = 1.2 x CPU clock cyclesA / (Clock rate)B

6 seconds = 1.2 x 40 x 109 (clock cycles)A / (Clock rate) B

(Clock rate)B = 1.2 x 40 x 109 cycles / 6 seconds = 8 GHz

ENG3050 ERCS 68

Clock Cycles per InstructionClock Cycles per Instruction

Not all instructions take the same amount of time to execute

One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction

Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA

# CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x

CPI for this instruction class

A B C

CPI 1 2 3

69

Determinates of CPU PerformanceDeterminates of CPU Performance

CPU time = Instruction_count x CPI x clock_cycle

Instruction_count

CPI clock_cycle

Algorithm

Programming language

Compiler

ISA

Processor organization

TechnologyX

XX

XX

X X

X

X

X

X

X

ENG3050 ERCS

ENG3050 ERCS 70

Application Specific Processors

#1: CPU designed CPU designed for efficient DSP processing– Multiple MAC unitsMultiple MAC units, 2 Accumulators, Additional

Adder, Barrel Shifter

#2: Multiple busses Multiple busses for efficient data

and program flow– Four busses and large on-chip memory that

result in sustained performance near peak

#3: Highly tuned Highly tuned instruction set instruction set forfor powerful DSP computing powerful DSP computing– Sophisticated instructions that execute in Sophisticated instructions that execute in fewer fewer

cyclescycles, with less code and low power demands, with less code and low power demands

ENG3050 ERCS 71

Memory Hierarchy: The Principle of Locality

The Principle of Locality: Program access a relatively small portion of the address space

at any instant of time.

Two Different Types of Locality: Temporal Locality Temporal Locality (Locality in Time):(Locality in Time): If an item is referenced, it

will tend to be referenced again soon (e.g., loopsloops, reuse) Spatial Locality Spatial Locality (Locality in Space):(Locality in Space): If an item is referenced,

items whose addresses are close by tend to be referenced soon (e.g., straight line code, array accessarray access)

Take advantage of locality. How? Use memory hierarchy.

ENG3050 ERCS 72

Levels of the Memory Hierarchy (Speed vs. Cost)

CPU Registers100s Bytes<10s ns<10s ns

CacheK Bytes10-100 ns10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns200ns- 500ns$.0001-.00001 cents /bit

DiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min10 -8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

73

CPU

CacheController

CacheMemory

PCIController

DRAM

EISA/PCI BridgeController

Hard DriveController

VideoAdaptor

PC Card 1 PC Card 2

SCSIAdaptor

PC Card 3

Local CPU / Memory Bus

Peripheral Component Interconnect Bus

EISA PC BusSCSIBus

Co-processor

Memory

Static RAM

Dynamic RAM

Registers

ENG3050 ERCS 74

Harvard ArchitectureHarvard Architecture

CPU

PCdata memory

program memory

address

data

address

data

The Harvard architecture uses a different approach than the Von Neumann architecture where the program memory program memory and the data memory are not shared data memory are not shared (on different busses)

This allows the instructions to be fetched and executed concurrentlyfetched and executed concurrently with data. The Harvard architecture allows for a cleaner pipelining of instructions since there is no

contention in fetching data vs. instructions.

ENG3050 ERCS 75

Exploiting ParallelismExploiting Parallelism

• Bit level Bit level parallelism: 1970 to ~2014– 4 bits, 8 bit, 16 bit, 32 bit, 64 bit microprocessors

• Instruction level Instruction level parallelism (ILP): ~1985 through today

– Pipelining (Enhance Throughput)

– Superscalar (execute multiple instructions)

– Limits to benefits of ILP?

• Process Level Process Level or Thread level parallelism; mainstream for general purpose computing?

– Servers (Multi-Processing Systems)

– High-end Desktop dual processor (Multi-Core)

ENG3050 ERCS 76

PipeliningPipelining

Exploits parallelism at the instruction levelinstruction level.Pipelining is an implementation technique in

which multiple instructions are overlapped in execution.Today pipelining is keypipelining is key to making processors fast.Pipelining is not only used in General Purpose

processors but can also be used in hardware hardware acceleratorsaccelerators (Reconfigurable Computing Systems).

ENG3050 ERCS 77

ENG3050 ERCS 78

ENG3050 ERCS 79

ENG3050 ERCS 80

ENG3050 ERCS 81

ENG3050 ERCS 82

ENG3050 ERCS 83

ENG3050 ERCS 84

ENG3050 ERCS 85

ENG3050 ERCS 86

ENG3050 ERCS 87

Speed Up Speed Up

stages pipe ofNumber

nsinstructiobetween Time nsinstructiobetween Time

ned)(nonpipeli

)(pipelined

If the stages are perfectly balanced, then the time If the stages are perfectly balanced, then the time between instructions on the pipelined processor – between instructions on the pipelined processor – assuming ideal conditions assuming ideal conditions – is equal to:– is equal to:

Under ideal conditionsUnder ideal conditions and with a large number of and with a large number of instructions, the speedup from pipelining is instructions, the speedup from pipelining is approximately equal to the number of pipe stage, i.e., a approximately equal to the number of pipe stage, i.e., a five stage five stage pipeline is nearly pipeline is nearly five times fasterfive times faster..

Pipelining Pipelining improves performance by:improves performance by: Increasing instruction throughputIncreasing instruction throughput, , The execution time of an individual instruction The execution time of an individual instruction

remains the same (i.e., latency is same).remains the same (i.e., latency is same).

ENG3050 ERCS 88

ENG3050 ERCS 89

ENG3050 ERCS 90

ENG3050 ERCS 91

ENG3050 ERCS 92

ENG3050 ERCS 93

ENG3050 ERCS 94

95

Example: Conventional Data Path Timing

The figure shows the maximum delay values for each of the components of a typical data path.

1. 4 ns (3ns + 1ns) to read two operands from register file.

2. 4ns to perform an operation.

3. 4ns to write info back

Total 12 ns 12 ns to perform a single micro operation.

The rate of execution is then set at 1/12ns = 83.3MHz83.3MHz

Can we make it faster?

ENG3050 ERCS

96

Example: Pipelined Data Path Timing

We can break the delay of 12ns by inserting registers between the different components of the system.

A register is inserted between the function unit and the register file (OF)

Another register can be inserted between the function unit and MUX D. (EX + WB)

3 stage pipeline: OF / EX / WB

The maximum delay now is 5ns 5ns allowing a maximum clock frequency of 200 MHz200 MHz

ENG3050 ERCS

ENG3050 ERCS 97

ENG3050 ERCS 98

Its Not That Easy for ComputersIts Not That Easy for Computers

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination of instructions - two dogs fighting for the same bonetwo dogs fighting for the same bone

» In a computer system: instead of having two memories we have a single memory and we need to write a value and also read a new instruction.

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

» Example:• Add $s0, $t0, $t1• Sub $t2, $s0, $t3

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow:

» (branches and» jumps).

Fetching and dispatching two instructions per cycle

99

SUPERSCALAR ARCHITECTURE

ENG3050 ERCS 100

Parallel Processing (Process Level Parallelism)

Using more than one processor to solve a problem Idea is that n processorsn processors operating simultaneously can

achieve the result n times faster.

Motives Diminishing returns from ILP

- Limited ILP in programs

- ILP increasingly expensive to exploit

Fault tolerance Availability of Multiple Threads/Processes (independent).

ENG3050 ERCS 101

Flynn’s Taxonomy of MultiprocessingFlynn’s Taxonomy of Multiprocessing

Single-instructionSingle-instruction single-data stream (SISD) machines

Single-instructionSingle-instruction multiple-data stream (SIMD) machines

Multiple-instruction Multiple-instruction single-data stream (MISD) machines

Multiple-instruction Multiple-instruction multiple-data stream (MIMD) machines

Instructions

Single (SI) Multiple (MI)D

ata

Mu

ltip

le (

MD

)SISD

Single-threaded process

MISD

Pipeline architecture

SIMD

Vector Processing

MIMD

Multi-threaded Programming

Sin

gle

(S

D)

ENG3050 ERCS 102

Single Instruction Stream Single Data Single Instruction Stream Single Data Stream (SISD)Stream (SISD)

In a single processor computer, a single stream of instructions isgenerated by the program. The instructions operate on a singlestream of data items (Traditional Von Neumann ArchitectureTraditional Von Neumann Architecture)

ProcessorControl MemoryInstructionstream

Data stream

Algorithms for SISD computers do not contain any parallelismdo not contain any parallelism

ENG3050 ERCS 103

Single Instruction Stream Multiple Single Instruction Stream Multiple Data Stream (SIMD)Data Stream (SIMD)

•A specially designed computer in which a single instruction streamsingle instruction stream is from a single program, but multiple data streamsmultiple data streams exist. •The Instructions from program are broadcast to more than one Processor. •Each processor executes the same instruction in synchronism, but using different data.

Vector computersVector computers

ENG3050 ERCS 104

P1 PNP2

Control

Shared memory or interconnection memory

SIMD ArchitectureSIMD Architecture

Instruction stream

Data streams

The processors operate synchronously and a global clock is usedto ensure lockstep operation

ENG3050 ERCS 105

SIMD Application: ExampleSIMD Application: Example

Add two matrices C = A + B

Say we have two matrices A and B of order 2 and we have 4 processors, ie we wish to calculate:

CC1111 = A = A1111 + B + B1111 CC1212 = A = A1212 + B + B1212

CC21 21 = A= A2121 + B + B2121 C C2222 = A = A2222 + B + B2222

The same instruction same instruction (add the two numbers) is sent toeach processor, but each processor receives different data

ENG3050 ERCS 106

Multiple Instruction Stream Single Data Multiple Instruction Stream Single Data Stream (MISD)Stream (MISD)

• A computer with multiple processors each sharing a commonmemory. • There are multiple streams multiple streams of instructions and one stream one stream of

data.

Processor

Processor

Memory

Control

Control

Example?Example?

ENG3050 ERCS 107

MISD exampleMISD example

Check whether a number Z is prime

• Each processor is assigned a set of test divisors in its instruction stream• Each processor, takes Z as input and tries to divide it by its divisors

MISD is awkward to implement and such machines are just experimental

No commercial MISD machine exists

ENG3050 ERCS 108

Multiple Instruction Stream Multiple Data Multiple Instruction Stream Multiple Data Stream (MIMD)Stream (MIMD)

• General purpose multiprocessor system – • Each processor has a separate program and one instruction stream is generated from each program for each processor. • Each instruction stream operates upon different data.

The most general and most useful most general and most useful of our classifications.

ENG3050 ERCS 109

P1 PNP2

C1 C2 CN

Shared memory or interconnection network

Processors

Controls

Each processor operates under the control of an instructionstream issued by its own control unit.

Processors operate asynchronously in general

MIMD architectureMIMD architecture

ENG3050 ERCS 110

Parallel vs. Distributed

SharedMemory

Parallel: Multiple CPUs within a shared memory machine

Distributed: Multiple machines with own memory connected over a network

Ne

two

rk c

on

ne

ctio

nfo

r d

ata

tra

nsf

er

D D D D D D D

Processor

Instructions

D D D D D D D

Processor

Instructions

ENG3050 ERCS 111

MIMD: Parallel Programming ModelsMIMD: Parallel Programming Models

• Distributed Memory– Explicit communicationExplicit communication

» Send messages

» Send (tid, tag, message)

» Receive (tid, tag, message)

– SynchronizationSynchronization

» Block on messages (implicit sync)

• Shared Memory – Implicit communicationImplicit communication

» Using shared address space

» Loads and stores

– SynchronizationSynchronization

» Atomic memory operators

» Semaphores

Scalability?Scalability?

ENG3050 ERCS 112

Multi-Processing Issues

• How to assign tasks to processors?

• What if we have more tasks than processors?

• What if processors need to share partial results?

• How do we aggregate partial results?

• How do we know all processors have finished?

• What if processors die?

ENG3050 ERCS 113

Speedup factor

S(n) = Execution time on a single processor

Execution time on a multiprocessor with n processors

S(n) gives increase in speed by using a multiprocessor

Speedup factor can also be cast in terms of computational steps

S(n) = Number of steps using one processor

Number of parallel steps using n processors

Maximum speedup is n with n processors (linear speedup) - this theoretical limit is not always achieved. WHY?WHY?

ENG3050 ERCS 114

Amdahl’s Law• Amdahl's law, is used to find the maximum expected improvement

to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical predict the theoretical maximum maximum speedupspeedup using `n’ processors.

• The speedupspeedup of a program using multiple processors in parallel computing is limited is limited by the time needed for the sequential fraction of the program.

http://en.wikipedia.org/wiki/Parallel_computing

http://en.wikipedia.org/wiki/Speedup

ENG3050 ERCS 115

Limitations?Limitations?

Amdahl’s Law

T = 1

(1 – a) + a / s

T = Overall speedup

a = Fraction of the original program that could be enhanced by executing in parallel or transferring it to hardware.

s = Expected speedup obtained (from hardware) for particular fraction of program. In other words the number of processors used!

ENG3050 ERCS 116

Amdahl’s Law: Example• Assume you profiled an application and noticed that

12% of the application can be parallelized and 88% of the operations are not parallelizable.

• Question: What is the maximum speedup of the parallelized version?

• Answer: Amdahl’s law states that the maximum speedup of the parallelized version is:

1/(1 – 0.12) = 1.136 times as fast as the non-parallelized implementation.

We are assuming that `ss’ is infinite (the number of processors you are using is infinity.

ENG3050 ERCS 117

AmdahlAmdahl’’s Law: s Law: ExampleExample

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

= 1

0.95= 1.053

Law of diminishing return:Law of diminishing return:Focus on the common case!Focus on the common case!

Speedupoverall = 1

(1-0.1) + 0.1/2=

ENG3050 ERCS 118

Spatial vs. Temporal Computing

Ax2 + Bx + c (Ax + B)x + C

Spatial (ASIC or FPGA) Temporal (Processor)

Temporal vs. spatial based computing

Temporal-based execution(software)

Spatial-based execution(reconfigurable computing)

Ability to extract parallelism (or concurrency) from algorithm descriptions is the key to acceleration acceleration using reconfigurable computingusing reconfigurable computing

ENG3050 ERCS 119

Methods for executing algorithms

Advantages:•very high performance and efficientDisadvantages:•not flexible (can’t be altered after fabrication)• expensive

Hardware(Application Specific Integrated Circuits)

Software-programmed processors

Advantages:•software is very flexible to changeDisadvantages:•performance can suffer if clock is not fast•fixed instruction set by hardware

Reconfigurablecomputing

Advantages:•fills the gap between hardware and software •much higher performance than software•higher level of flexibility than hardware 120

ENG3050 ERCS 121

Reconfigurable Computing

The Ideal device should combine: the flexibility of the Von Neumann computer the efficiency of ASICs

The ideal device should be able to Optimally implement an application at a given time Re-adapt to allow the optimal implementation of a new

application. We call such a device a reconfigurable device.

Definition: Reconfigurable computing can be defined as the study of computations involving reconfigurable devices. This includes,

1. Architecture, 2. Algorithms, 3. Applications.

ENG3050 ERCS 122

How to fill the gap?

Fle

xib

ilit

y

Performance

ASIC

GPs

DSP

RCS

ENG3050 ERCS 123

Reconfigurable Hardware (FPGAs)

KEY ADVANTAGE: Performance of KEY ADVANTAGE: Performance of Hardware, Flexibility of SoftwareHardware, Flexibility of Software

CLB Block RAM IP Core (Multiplier)

Why reconfigurable computing is more relevant these days?

• Demand for high-performance computation is soaring: – large-scale optimization problems, physics and earth simulation,

bioinformatics, signal processing (e.g. HDTV), …, etc)• Why software-programmed processors are no longer attractive?

– Faster temporal execution of instructions is no longer improving– General-purpose multi-core processors requires coarse grain

thread-level parallelism• Why reconfigurable fabrics are currently attractive?

– Increased integration densities allow large FPGAs that can implement substantial functions

– Provide the spatial computational resources required to implement massively-parallel computations directly in hardware

ENG3050 ERCS 124

ENG3050 ERCS 125

Reconfigurable Devices

Reconfigurable Devices (RD) are usually used in many different ways:1. Rapid Prototyping2. Non-frequent reconfigurable systems3. Frequently reconfigurable systems4. High Performance Computing (Acceleration

of Complex Algorithms

ENG3050 ERCS 126

1. Rapid prototyping `STATICSTATIC`

Testing hardware in real conditions before fabrication

Software simulation Relatively inexpensive Slow Accuracy ?

Hardware emulationHardware testing under real

operation conditionsFastAccurateAllow several iterations

APTIX System Explorer

ITALTEL FLEXBENCH

ENG3050 ERCS 127

2. Non-Frequent Reconfiguration `STATIC`STATIC`

ENG3050 ERCS 128

3. Frequently Reconfigured `STATICSTATIC`

Computing systems that are able to adapt their behaviour and structure to changing operating and environmental conditions, time-varying optimization objectives, and physical constraints like changing protocols, new standards, or dynamically changing operation conditions of technical systems

ENG3050 ERCS 129

Static & Dynamic Reconfiguration

ENG3050 ERCS 130

Benefits of RCS Non-permanentNon-permanent customization and application

development after fabrication “Late Binding”

Achieve high performance high performance that require real time Fast Time-to-market Fast Time-to-market (evolving requirements and

standards, new ideas)

Disadvantages

• Efficiency penalty (area, performance, power) as compared to ASIC

• Not as flexible as General Purpose Processors

Conclusions Conclusions

ENG3050 ERCS 131

+Resource Resource HazardsHazards

A resource hazard occurs when A resource hazard occurs when two or more instructions that two or more instructions that are already in the pipeline are already in the pipeline need the same resourceneed the same resource

The result is that the instructions The result is that the instructions must be executed in serial rather must be executed in serial rather than parallel for a portion of the than parallel for a portion of the pipelinepipeline

A resource hazard is sometimes A resource hazard is sometimes referred to as a referred to as a structural hazard

ENG3050 ERCS

ENG3050 ERCS 133

ENG3050 ERCS 134

ENG3050 ERCS 135

ENG3050 ERCS 136

ENG3050 ERCS 137

ENG3050 ERCS 138

ENG3050 ERCS 139

eng3050 embedded reconfigurable computing systems introduction to reconfigurable computing...

Documents

rabaeyeng3050 ercseng3050

5veng3050 ercs

s eng3050 ercs

pmos transistorseng3050

vlsi design lec01

vlsi designintroduction

design complexity

design perspective