what is fpga?

FPGA

by Andriy Smolskyy

FPGA

Programmable Logic Evolution: TTL PLA CPLD FPGA

ASIC Development aspects Using FPGA for high speed data

processing OpenCL

Programmable Logic is Found Everywhere!

TTL Logic Design

Digital Design with TTL Logic

From TTL to Programmable Logic

General features of logic implementations Sum of products (AND-OR gates, combinatorial

logic) Stored results (registered outputs) Wired together

What if Logic functions were fixed (like TTL), but

combined into a single device? Wiring (routing) connections could be controlled

(programmed) somehow?

Programmable Array Logic (PAL) Simplest implementation of programmable logic Logic gates and registers are fixed Programmable sum of products array and output

control

Programmable Logic Advantages Fewer devices required Lower cost Power savings Simpler to test and debug Design security (prevent reverse

engineering) Design flexibility Automated tools simplify and consolidate

design flow In-system reprogrammability! (in

some cases)

From PAL to Programmable Logic Device (PLD) Arrange multiple PAL arrays in a single device

From PLD to Complex PLD (CPLD) Combine multiple PLDs in single device

with programmable interconnect and I/O

General CPLD Advantages

Ample amounts of logic and advanced configurable I/Os

Programmable routing Instant on Low cost Non-volatile configuration Reprogrammable

From CPLD to FPGAs

Higher density CPLDs don’t scale well because of requires additional global routing

Rearrange LABs themselves into an array

Field Programmable Gate Array (FPGA) LABs arranged in an array Row and column programmable interconnect Interconnect may span all or part of the array

CPLD LABs vs. FPGA LABs

FPGA LABs made up of logic elements (LEs) instead of product terms and macrocells

Easier to create complex functions through LE cascading

Lookup Tables (LUTs)

Replaces product term array Combinational functions created with

programmed “tables” (cascaded multiplexers) LUT inputs are mux select lines

Adaptive Logic Modules (ALM) Based on LE, but includes dedicated resources & adaptive LUT (ALUT) Improves performance and resource utilization

FPGA Routing

All device resources can feed into or be fed by any routing in device

Differing fixed lengths to adjust for timing Scales linearly as density increases Local interconnect

Connects between Les or ALMs within a LAB Can include direct connections between

adjacent LABs Row and column interconnect

Fixed length routing segments Span a number of LABs or entire device

Other Typical FPGA Features

Embedded multipliers Useful for DSP High-performance multiply/add/accumulate

operations Memory blocks High-speed transceivers Replace some LABs with dedicated

functional hardware blocks PLLs SDRAM controllers Hard Processor System

System on Chip (SoC) + FPGA

FPGA Programming

FPGA programming information must be stored somewhere to program device at power on

Use external EEPROM, CPLD or CPU to program

Two programming methods Active: FPGA controls programming sequence

automatically at power on Passive: Intelligent host (typically CPU) controls

programming Also programmable through JTAG

connection

FPGA Advantages

High density to create many complex logic functions

High performance Low cost Integration of many functions Many available I/O standards and features Fast programming

From FPGA to ASIC

A true ASIC: no configuration at power-on required

Create and test design with FPGA device Migrate design to pin–compatible,

functionally equivalent ASIC device

FPGA design development

Verilog Hardware Description Language VDHL - Very high speed integrated circuits

Hardware Description Language

FPGA design development

Schematic design development

Dedicated FPGA design block Core IP

SDRAM Controller Ethernet PHY, Custom Transceiver PHY PCIe PHY SDi, Display Port

Megafunctions PLL I/O Custom logic blocks

FPGA for high speed data processing CPU data processing optimization Pipelining, parallelism OpenCL

A simple CPU

B

AA ALU

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr

PC Load StoreLdAddr StAddr

CWriteEnable

C

Op

LdData

StData

Op

CData

Load immediate value into register

B

AA ALU

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr


CWriteEnable

C

Op

LdData

StData

Op

CData

Load memory value into register

B

AA ALU

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr


CWriteEnable

C

Op

LdData

StData

Op

CData

Store register value into memory

B

AA ALU

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr


CWriteEnable

C

Op

LdData

StData

Op

CData

Add two registers, store result in register

B

AA ALU

Op

Val

Instruction

Fetch

Registers

Aaddr

Baddr

Caddr


CWriteEnable

C

Op

LdData

StData

Op

CData

A simple program

Mem[100] += 42 * Mem[101] CPU instructions:

R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]

CPU activity, step by step

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

Time

Unroll the CPU hardware…

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

Space

… and specialize by position

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A

1. Instructions are fixed. Remove “Fetch”

… and specialize

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A


2. Remove unused ALU ops

and specialize

A

A

A

A

A

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100] A


2. Remove unused ALU ops3. Remove unused Load /

Store

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]



Store4. Wire up registers

properly! And propagate state.

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]





5. Remove dead data.

… and specialize

R0 Load Mem[100]

R1 Load Mem[101]

R2 Load #42

R2 Mul R1, R2

R0 Add R2, R0

Store R0 Mem[100]





5. Remove dead data.6. Reschedule!

So what ?

Load Load

Store

42

FPGA datapath = Your algorithm, in silicon

Build exactly what you need:

OperationsData widths

Memory size, configuration

Efficiency:Throughput / Latency /

Power

OpenCL Programming Model

Accelerator

Local Mem

Global M

em

Local Mem

Local Mem

Local MemAcceleratorAcceleratorAcceleratorProcessor

Accelerator

Local Mem

Global M

em

Local Mem

Local Mem


Host Accelerator

Local Mem

Global M

em

Local Mem

Local Mem


__kernel voidsum(__global float *a, __global float *b, __global float *y){ int gid = get_global_id(0); y[gid] = a[gid] + b[gid];}

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … );}

Host + Accelerator Programming Model

Sequential Host program on microprocessor

Function offload onto a highly parallel accelerator device

OpenCL FPGA is NOT just ‘C’-to-HW

IPIP

EMIF

IP

Processor

Rest of the System ?HLSvoid F(...) {

#pragma ... for(int i ...) { #pragma ... for(int j ...) { #pragma ... } }}

RTL

OpenCL

kernel void F(...) { for(int i ...) { for(int j ...) { } }}

?

Complete Platform

C-to-HW tools

Standard OpenCL

UsersHardware Designers

TargetFPGA Only

FPGA Expertise

Yes

Timing Closure

Manual

UsersSoftware Programmers

TargetComplete Platforms

FPGA Expertise

No

Timing Closure

Automatic

Traditional OpenCL Data Parallelism OpenCL kernels expresses parallelism

explicitly

__kernel voidsum(__global const float *a,__global const float *b,__global float *answer){int xid = get_global_id(0);answer[xid] = a[xid] + b[xid];}

for (int i=0; i < n; i++){ answer[i] = a[i] + b[i];}Host

CodeKernel Code

setup_memory_buffers();transfer_data_to_fpga();size_t global_size = {N, 1, 1};

clEnqueueNDRangeKernel( sum_kernel, .., &global_size, ..);

read_data_from_fpga();

Loop Pipelining

To achieve acceleration, we can pipeline each iteration of the loop Analyze any dependencies between iterations Schedule these operations Launch the next iteration as soon as possible

float array[M];

for (int i=0; i < n*numSets; i++){ for (int j=0; j < M-1; j++) array[j] = array[j+1]; array[M-1] = a[i];

for (int j=0; j < M; j++) answer[i] += array[j] * coefs[j];}

At this point, we can launch the next iteration

Loop Pipelining Example

No Loop Pipelining

i0

i1

i2

i0

i1i2

i3i4

Looks almost like parallel thread execution!

With Loop Pipelining

Digital Filter

z-1 z-1 z-1 z-1 z-1 z-1 z-1

X X X X X X X X

C0 C1 C2 C3 C4 C5 C6 C7

x(n)

+

y(n)

FPGA

Q&A

FPGA

Thank You!

what is fpga?

Engineering

fpga device

fpga labs

single device

programmable logic evolution

program device

device resources

logic elements

entire device