what is fpga?
TRANSCRIPT
FPGA
Programmable Logic Evolution: TTL PLA CPLD FPGA
ASIC Development aspects Using FPGA for high speed data
processing OpenCL
From TTL to Programmable Logic
General features of logic implementations Sum of products (AND-OR gates, combinatorial
logic) Stored results (registered outputs) Wired together
What if Logic functions were fixed (like TTL), but
combined into a single device? Wiring (routing) connections could be controlled
(programmed) somehow?
Programmable Array Logic (PAL) Simplest implementation of programmable logic Logic gates and registers are fixed Programmable sum of products array and output
control
Programmable Logic Advantages Fewer devices required Lower cost Power savings Simpler to test and debug Design security (prevent reverse
engineering) Design flexibility Automated tools simplify and consolidate
design flow In-system reprogrammability! (in
some cases)
From PLD to Complex PLD (CPLD) Combine multiple PLDs in single device
with programmable interconnect and I/O
General CPLD Advantages
Ample amounts of logic and advanced configurable I/Os
Programmable routing Instant on Low cost Non-volatile configuration Reprogrammable
From CPLD to FPGAs
Higher density CPLDs don’t scale well because of requires additional global routing
Rearrange LABs themselves into an array
Field Programmable Gate Array (FPGA) LABs arranged in an array Row and column programmable interconnect Interconnect may span all or part of the array
CPLD LABs vs. FPGA LABs
FPGA LABs made up of logic elements (LEs) instead of product terms and macrocells
Easier to create complex functions through LE cascading
Lookup Tables (LUTs)
Replaces product term array Combinational functions created with
programmed “tables” (cascaded multiplexers) LUT inputs are mux select lines
Adaptive Logic Modules (ALM) Based on LE, but includes dedicated resources & adaptive LUT (ALUT) Improves performance and resource utilization
FPGA Routing
All device resources can feed into or be fed by any routing in device
Differing fixed lengths to adjust for timing Scales linearly as density increases Local interconnect
Connects between Les or ALMs within a LAB Can include direct connections between
adjacent LABs Row and column interconnect
Fixed length routing segments Span a number of LABs or entire device
Other Typical FPGA Features
Embedded multipliers Useful for DSP High-performance multiply/add/accumulate
operations Memory blocks High-speed transceivers Replace some LABs with dedicated
functional hardware blocks PLLs SDRAM controllers Hard Processor System
FPGA Programming
FPGA programming information must be stored somewhere to program device at power on
Use external EEPROM, CPLD or CPU to program
Two programming methods Active: FPGA controls programming sequence
automatically at power on Passive: Intelligent host (typically CPU) controls
programming Also programmable through JTAG
connection
FPGA Advantages
High density to create many complex logic functions
High performance Low cost Integration of many functions Many available I/O standards and features Fast programming
From FPGA to ASIC
A true ASIC: no configuration at power-on required
Create and test design with FPGA device Migrate design to pin–compatible,
functionally equivalent ASIC device
FPGA design development
Verilog Hardware Description Language VDHL - Very high speed integrated circuits
Hardware Description Language
Dedicated FPGA design block Core IP
SDRAM Controller Ethernet PHY, Custom Transceiver PHY PCIe PHY SDi, Display Port
Megafunctions PLL I/O Custom logic blocks
A simple CPU
B
AA ALU
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Load immediate value into register
B
AA ALU
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Load memory value into register
B
AA ALU
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Store register value into memory
B
AA ALU
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
Add two registers, store result in register
B
AA ALU
Op
Val
Instruction
Fetch
Registers
Aaddr
Baddr
Caddr
PC Load StoreLdAddr StAddr
CWriteEnable
C
Op
LdData
StData
Op
CData
A simple program
Mem[100] += 42 * Mem[101] CPU instructions:
R0 Load Mem[100]R1 Load Mem[101]R2 Load #42R2 Mul R1, R2R0 Add R2, R0Store R0 Mem[100]
CPU activity, step by step
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
Time
Unroll the CPU hardware…
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
Space
… and specialize by position
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove “Fetch”
… and specialize
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops
and specialize
A
A
A
A
A
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100] A
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load /
Store
… and specialize
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load /
Store4. Wire up registers
properly! And propagate state.
… and specialize
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load /
Store4. Wire up registers
properly! And propagate state.
5. Remove dead data.
… and specialize
R0 Load Mem[100]
R1 Load Mem[101]
R2 Load #42
R2 Mul R1, R2
R0 Add R2, R0
Store R0 Mem[100]
1. Instructions are fixed. Remove “Fetch”
2. Remove unused ALU ops3. Remove unused Load /
Store4. Wire up registers
properly! And propagate state.
5. Remove dead data.6. Reschedule!
So what ?
Load Load
Store
42
FPGA datapath = Your algorithm, in silicon
Build exactly what you need:
OperationsData widths
Memory size, configuration
Efficiency:Throughput / Latency /
Power
OpenCL Programming Model
Accelerator
Local Mem
Global M
em
Local Mem
Local Mem
Local MemAcceleratorAcceleratorAcceleratorProcessor
Accelerator
Local Mem
Global M
em
Local Mem
Local Mem
Local MemAcceleratorAcceleratorAcceleratorProcessor
Host Accelerator
Local Mem
Global M
em
Local Mem
Local Mem
Local MemAcceleratorAcceleratorAcceleratorProcessor
__kernel voidsum(__global float *a, __global float *b, __global float *y){ int gid = get_global_id(0); y[gid] = a[gid] + b[gid];}
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … );}
Host + Accelerator Programming Model
Sequential Host program on microprocessor
Function offload onto a highly parallel accelerator device
OpenCL FPGA is NOT just ‘C’-to-HW
IPIP
EMIF
IP
Processor
Rest of the System ?HLSvoid F(...) {
#pragma ... for(int i ...) { #pragma ... for(int j ...) { #pragma ... } }}
RTL
OpenCL
kernel void F(...) { for(int i ...) { for(int j ...) { } }}
?
Complete Platform
C-to-HW tools
Standard OpenCL
UsersHardware Designers
TargetFPGA Only
FPGA Expertise
Yes
Timing Closure
Manual
UsersSoftware Programmers
TargetComplete Platforms
FPGA Expertise
No
Timing Closure
Automatic
Traditional OpenCL Data Parallelism OpenCL kernels expresses parallelism
explicitly
__kernel voidsum(__global const float *a,__global const float *b,__global float *answer){int xid = get_global_id(0);answer[xid] = a[xid] + b[xid];}
for (int i=0; i < n; i++){ answer[i] = a[i] + b[i];}Host
CodeKernel Code
setup_memory_buffers();transfer_data_to_fpga();size_t global_size = {N, 1, 1};
clEnqueueNDRangeKernel( sum_kernel, .., &global_size, ..);
read_data_from_fpga();
Loop Pipelining
To achieve acceleration, we can pipeline each iteration of the loop Analyze any dependencies between iterations Schedule these operations Launch the next iteration as soon as possible
float array[M];
for (int i=0; i < n*numSets; i++){ for (int j=0; j < M-1; j++) array[j] = array[j+1]; array[M-1] = a[i];
for (int j=0; j < M; j++) answer[i] += array[j] * coefs[j];}
At this point, we can launch the next iteration
Loop Pipelining Example
No Loop Pipelining
i0
i1
i2
i0
i1i2
i3i4
Looks almost like parallel thread execution!
With Loop Pipelining