Download - Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review

Lecture 13: Mid-term 1 Review October 22, 2013

ECE 636

Reconfigurable Computing

Lecture 13

Mid-term I Review


SRAM-based FPGA

• SRAM bits can be programmed many times

• Each programming bit takes up five transistors

• Larger device area reduces speed versus EPROM and antifuse.

Read or Write

Data

Q

Q

Programming Bit I1I2

P1

P2P3P4

Out

2-Input LUT


Field Programmable Gate Array


Connection Box Flexibility

• Fc -> How many tracks does an input pin connect to?

• If logic cluster is small, FC is large FC = W

• If logic cluster is large, Fc can be less.

- Approximately 0.2W for Xilinx XC4000EX, Virtex

LogicCluster

IO pin

Tracks

OutT0 T1 T2

T0T1T2

Out

FC = 3T0 T1 T2


Switchbox Flexibility

• Switch box provides optimized interconnection area.

• Flexibility found to be not as important as FC

• Six transistors needed for FS= 3

0

1

0

1

0 1

0 1


Switchbox Issues


Bidirectional vs Directional


Directional Wiring: Outputs can use switch block muxes

Dir Architecture

Single-driverWiring!!!

New connectivityconstraint


Fine-grained Approach

• For 4-input LUTs 16 bits of information available

• Can be chained together through programmable network.

• Decoder and multiplexer an issue.

• Flexibility is a key aspect.

Addr

A D

A D

16X1

16X1LUT1

LUT2


Hill Climbing Algorithms

• To avoid getting trapped in local minima, consider “hill-climbing” approach

• Need to accept worse solutions or make “bad” moves to get global minima.

• Acceptance is probabalistic. Only accept cost-increasing moves some of the time.

Cost

Solution space


Maze Routing

• Evaluate shortest feasible paths based on a cost function• Like row-based device global route allocates channel

bandwidth not specific solutions. • Formulate cost function as needed to address desired

goal.

L

L

C

S


Routing Tradeoffs

• Bias router to find first, best route.

• Vary number of node expansions using:

pcosti = (1 – a) x pcosti-1 + ncosti + a x disti


Architectural Limitation

• Routing architecture necessitates domain selection.

• Bigger effect for multi-fanout nets


Pathfinder

• Use a non-decreasing history value to represent congestion.

• Similarities to multi-commodity flow

• Can be implemented efficiently but does require substantial run time

• Only update after an interation.

ci = (1 + hn * hfac) * (1 + pn * pfac) + bn, n-1


Bipartitioning

• Perhaps biggest problem in multi-FPGA design is partitioning

• Partitioner must deal with logic and pin constraints.

• Could simultaneously attempt partitioning across all devices. Even “simple” algorithms are O(n3)

• Better to recursively bipartition circuit.


KLFM Partitioning

• Identify nodes to swap to reduce overall cut size

• Lock moved nodes

• Algorithm continues until no un-locked node can be moved without violating size constraints

Bin 1 Bin 2


KLFM Partitioning

• Key issue is implementing node costs in lists that can be easily accessed and updated.

• Many extensions to consider to speed up overall optimization

• Reasonably easy to implement in software


Partition Preprocessing: Clustering

• Identify bin size

• Choose a seed block (node)

• Identify node with highest connectivity to join cluster

• Terminate when cluster size met.

• In practical terms cluster size of 4 works best


Clustering

• Technology mapping before partitioning is typically ineffective since frequently area is secondary to interconnect

• Frequently bipartitioning continues after unclustering as well.

Cluster

KLFM

uncluster KLFM

• This allows for additional fine-grain moves.


Logic Replication

• Attempt to reduce cutset by replicating logic.

• Every input of original cell must also input the replicated cell.

• Replication can either be integrated into the partitioning process or used as a post-process technique.


Logic Emulation

• Emulation takes a sizable amount of resources

• Compilation time can be large due to FPGA compiles

• One application: also direct ties to other FPGA computing applications.


Are Meshes Realistic?

• The number of wires leaving a partition grows with Rent’s Rule

P = KGB

• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5

• Effectively devices highly pin limited

• What does this mean for meshes?


Multi-FPGA Software

• Missing high-level synthesis

• Global placement and routing similar to intra-device CAD


Virtual Wires

• Overcome pin limitations by multiplexing pins and signals

• Schedule when communication will take place.


Virtual Wires Software Flow

• Global router enhanced to include scheduling and embedding.

• Multiplexing logic synthesized from FPGA logic.


Why Compiling C is Hard

° General Language

° Not Designed For Describing Hardware

° Features that Make Analysis Hard• Pointers

• Subroutines

• Linear code

° C has no direct concept of time

° C (and most procedural languages) are inherently sequential• Most people think sequentially.

• Opportunities primarily lie in parallel data


Variables

° Handel-C has one basic type - integer

° May be signed or unsigned

° Can be any width, not limited to 8, 16, 32 etc.

Variables are mapped to hardware registers.

void main(void){

unsigned 6 a;a=45;

}

1 0 1 1 0 1 = 0x2da =

LSBMSB


DeepC Compiler• Consider loop based

computation to be memory limited

• Computation partitioned across small memories to form tiles

• Inter-tile communication is scheduled

• RTL synthesis performed on resulting computation and communication hardware


DeepC Compiler• Parallelizes compilation across multiple tiles

• Orchestrates communication between tiles

• Some dynamic (data dependent) routing possible.


Control FSM• Result for each tile is a datapath, state machine,

and memory block


Striped Architecture

• Same basic approach, pipelined communication, incremental modification• Functions as a linear pipeline• Each stripe is homogeneous to simplify computation• Condition codes allow for some control flexibility

FPGAFabric

Control Unit

Configuration Cache

Configuration Control &Next Addr

Address

Condition Codes Microprocessor

Interface


Piperench Internals

• Only multi-bit functional units used• Very limited resources for interconnect to neighboring programming

elements• Place and route greatly simplied

Convolutional Encoder Accepts information bits as a continuous stream Operates on the current b-bit input, where b

ranges from 1 to 6 and some number of immediately preceding b-bit inputs to produce V output bits, V > b

FF FF

+

+

1

0 1

0

0

b =1, V =2

Definitions Constraint Length

Number of successive b-bit groups of information bits for each encoding operation

Denoted by K Code Rate (or) Rate

b/V Typical values

K : 7 Rate : 1/2, 1/3

The Viterbi Algorithm Finds a bit-sequence in the set of all

possible transmitted bit-sequences that most closely resembles the received data.

Maximum likelihood algorithm Each bit received by decoder associated

with a measure of correctness. Practical for short constraint length

convolutional codes

00

10

11

01

0/00

1/11

1/01

1/10

0/01

0/11

1/00

0/10

State diagram State

Encoder memory Branch

k/ij,where i and j

representthe output bitsassociated with input bit k

Trellis Diagram

00

01

10

11

00 00 00

11 1111

11

10

01

10

01

00

10

T=0 T=1 T=2 T=3

ENC IN : 0 1 0ENC OUT : 00 11 10RECEIVED: 00 11 11

Accumulated metric

2+2,3+0 : 3

0+1,3+1 : 1

2+0,3+1 : 2

0+1,3+1 : 1

0 0

3

2

2

3 1

3

0 2

1

K = 3Rate ½

Total number of states = 2K-1

Adaptive Viterbi Algorithm

Motivation Extremely large memory and logic for Viterbi

Algorithm Fewer number of paths retained Reduced memory and computation

Definitions Path – Bit sequence Path metric or cost – Accumulated error metric of a

path Survivor – Path which is retained for the

subsequent time step

Adaptive Viterbi AlgorithmCriterion for path survival

1. A threshold T is introduced such that a path is retained if and only if current path metric is less than dm+T, where dm is the minimum cost among all survivors of the previous time step.

2. The total number of survivors per time step is limited to a critical number called Nmax selected by user.

Only best Nmax paths have to be retained at any

time.

Trellis Diagram for AVA

Architecture (contd.)

Add

Add

b1

sum1

b2

sum2

di < dm + T

di < dm + T

Countpaths

Count < Nmax

T = T-2

yes

no

Updatememory

yes

yes

Elimination of sorting

42

Virtual Router Independent routing policies

for each virtual router

Key challenges• Isolation• Performance• Flexibility• Scalability

Forwarding Table

Routing Control

Virtual router B

Forwarding Table

Routing Control

Physical routerVirtual router A

DEMUX MUX

43

Virtualization using FPGAs

A novel network virtualization substrate which

• Uses FPGA to implement high performance virtual routers

• Introduces scalability through virtual routers in host software

• Exploits reconfiguration to customize hardware virtual routers

FPGA

VirtualRouter 1

VirtualRouter 2

VirtualRouter 3

VirtualRouter 4

44

Partial Reconfiguration Use partial reconfiguration to independently configure

virtual routers

45

Full FPGA Reconfiguration Two virtual routers (A, B) initially in FPGA During reconfiguration router A migrated to software, the

other eliminated After reconfiguration two virtual routers (A, B’) again in FPGA

ReducedThroughput

46

Partial FPGA Reconfiguration A remains in hardware and operates at full speed 20X speedup in reconfiguration down time due to partial

reconfiguration SustainedThroughput

Download - Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review

Top Related