specialized computing - cornell university computing ece 5775 ... the main computation of...
TRANSCRIPT
Specialized Computing
ECE 5775 (Fall’17)High-Level Digital Design Automation
▸ All students enrolled in CMS & Piazza
▸ Vivado HLS tutorial on Tuesday 8/29– Install an SSH client (mobaxterm or putty)– % ssh -X <netid>@ecelinux-01.ece.cornell.edu– Bring your laptop!
1
Announcements
Agenda
▸ Motivation for hardware specialization– Case study on convolution
▸ FPGA introduction– Basic building blocks– Basic homogeneous FPGA architecture– Modern heterogeneous FPGA architecture
2
Tension between Efficiency and Flexibility
3© 2012 Altera Corporation—Public
The Dilemma: Flexibility vs. Efficiency
16
Source: “High-performance Energy-Efficient Reconfigurable Accelerator Circuits for the Sub-45nm Era” July 2011by Ram K. Krishnamurthy, Circuits Research Labs, Intel Corp.
MO
PS
/mW
Programmable Processing
4
A Simple Single-Cycle Microprocessor
Adder1
PC
RF
LDSASBDRD_in
ALU RAM
DataA
DataB
V C Z NSE
IMMMB
M_address
Data_in
MW MD
01
0
1
DRSASB
IMMMBFS
MDLD
MWRAM
5
Evaluating an Simple Expression on CPU
Step-by-stepCPU activities
R1 <= M[R0]
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3
Source: Adapted from Desh Singh’s talk at HCP’14 workshop
PC
RF ALU RAM
PC
RF ALU RAM
PC
RF ALU RAM
PC
RF ALU RAM
6
“Unrolling” the CPU Hardware
Space
1. Replicate the CPU hardware
PC
RF ALU RAM
PC
RF ALU RAM
PC
RF ALU RAM
R1 <= M[R0]
PC
RF ALU RAM
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3
CPU2
CPU3
CPU4
CPU1
7
Eliminating Unused Logic
RF ALU RAM
R1 <= M[R0]
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3RF ALU RAM
RF ALU RAM
RF ALU
Space
1. Replicate the CPU hardware
2. Instruction fixed -> Remove FETCH logic
3. Remove unused ALU operations
4. Remove unusedLOAD/STORE logic
8
A Special-Purpose Architecture
R1 <= M[R0]
R3 <= R1 + R2
R2 <= M[R0+1]
M[R0+2] <= R3
Space
1. Replicate the CPU hardware
2. Instruction fixed -> Remove FETCH logic
3. Remove unused ALU operations
4. Remove unusedLOAD/STORE logic
5. Wire up registers and propagate values
R2
R3
+
R1
+
R0
+
LW
LW
SW
Resulting circuit can be realized with either ASIC or FPGA
9
Understanding Energy Inefficiency of General-Purpose Processors (GPPs)
L1-I$
BranchPredictor
FetchDecode Rename
RAT
Free list
Reservation-station
Schedule
Int RF
FP RF
RegisterRead/write
D-cache
LSQ + TLB
ALU
FPU
ROB
Commit
Typical Superscalar OoO Pipeline
Parameter Value
Fetch/issue/retire width 4
# Integer ALUs 3
# FP ALUs 2
# ROB entries 96
# Reservation station entries 64
L1 I-cache 32 KB, 8-way set associative
L1 D-cache 32 KB, 8-way set associative
L2 cache 6 MB, 8-way set associative
[source: Jason Cong, ISLPED’14 keynote]
10
Energy Breakdown of Pipeline Components
Fetch unit9%
Decode6%
Rename12%
Register files3%
Scheduler11%
Int ALU14%
Mul/div4%
FPU8%
Memory10%
Misc23%
L1-I$
BranchPredictor
FetchDecode Rename
RAT
Free list
Reservation-station
Schedule
Int RF
FP RF
RegisterRead/write
D-cache
LSQ + TLB
ALU
FPU
ROB
Commit
Control
11
Removing “Non-Computing” Portions
Fetch unit9%
Decode6%
Rename12%
Register files3%
Scheduler11%
Int ALU14%
Mul/div4%
FPU8%
Memory10%
Misc23%
▸ “Computing” portion: 10% (memory) + 26% (compute) = 36%
Energy Comparison of Processor ALUs and Dedicated Units
▸ Why are processor units so expensive?– ALU can perform multiple
operations• Add/sub/bitwise
XOR/OR/AND– 64-bit ALU– Dynamic/domino logic
used to run at high frequency• Higher power dissipation
Operation Processor ALU
45 nm TSMC standard cell library
32-bit add 0.122 nJ@2 GHz
0.002 nJ @1 GHz
32-bitmultiply
0.120 nJ@2 GHz
0.007 nJ @1 GHz
Single-precisionFP operation
0.150 nJ @ 2GHz
0.008 nJ @500 MHz
12
13
Energy Breakdown with Standard-Cell ASICs
Fetch unit9%
Decode6%
Rename12%
Register files3%
Scheduler11%
Int ALU0.2%
Mul/div0.2%
FPU0.4%
ALU/FPU savings25.0%
Memory10%
Misc23%
▸ “Computing” portion: 10% (memory) + ~1% (compute) = 11%
Only 10X gain is attainable?
14
Additional Energy Savings from Specialization
▸ Specialized memory architecture– Exploit regular memory access patterns to minimize energy per
memory read/write
▸ Specialized communication architecture– Exploit data movement patterns to optimize the
structure/topology of on-chip interconnection network
▸ Customized data type– Exploit data range information to reduce bitwidth/precision and
simply arithmetic operations
These techniques combined can lead to another 10-100X energy efficiency improvement over GPPs
Case Study: Memory Specialization for Convolution
▸ The main computation of image/video processing is performed over overlapping stencils, termed as convolution
-1 -2 -1
0 0 01 2 1
3x3 ConvolutionInput image frame
Output image frame
0123456
0 1 2 3 4 5 60123456
0 1 2 3 4 5 6
15
Example Application: Edge Detection
▸ Identifies discontinuities in an image where brightness (or image intensity) changes sharply– Very useful for feature extractions in computer vision
Figures: Pilho Kim, GaTech
Sobel operator G=(GX , GY)
16
CPU Implementation of Convolution
CPU MainMemory
Cache
for (n=1; n<height-1; n++) for (m=1; m<width-1; m++)
for (i=0; i<3; i++) for (j=0; j<3; j++)
out[n][m]+=img[n+i][m+j] * f[i][j];
17
▸ Minimizes main memory accesses to improve performance
▸ A general-purpose cache is expensive in cost and incurs nontrivial energy overhead
General-Purpose Cache for Convolution
W
Input picture(W pixels wide)
18
Specializing Cache for Convolution (1)
▸ Remove rows that are not in the neighborhood of the convolution window
19
W
W
Specializing Cache for Convolution (2)
W W
Remove the edge pixels that are not needed for computation
Old Pixel
New Pixel
▸ Rearrange the rows as a 1D array of pixels▸ Each time we move the window to right and push in the
new pixel to the “cache”
20
New
Old
A Specialized “Cache”: Line Buffer
▸ Line buffer: a fixed-width “cache” with (K-1)*W+K pixels in flight– Fixed addressing: Low area/power and high performance
▸ In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs
2W+3 (with K=3)
Old Pixel
New Pixel
21
What is an FPGA?
▸ FPGA: Field-Programmable Gate Array– An integrated circuit designed to be configured by a customer or
a designer after manufacturing (wikipedia)
▸ Components in an FPGA Chip– Programmable logic blocks– Programmable interconnects– Programmable I/Os
22
▸ SRAM-based implementation is popular– Non-standard technology means older technology generation
23
Three Important Pieces
Pass transistor (controlled by an SRAM bit)
Multiplexer (controlled by SRAM bits)
Lookup table (LUT, formed by SRAM bits)
LUT
▸ Any function of k variables can be implemented with a 2k:1 multiplexer
24
Multiplexer as a Universal Gate
01010101
00110011
00001111
CoutSCinBA01101001
01010101
00110011
00001111
CoutSCinBA
01101001
00010111
01010101
00110011
00001111
CoutSCinBA01101001
00010111
01010101
00110011
00001111
CoutSumCinBA
?? ?
01234567
S2
8:1 MUX
S1 S0
Cout
????????
▸ How many distinct 3-input 1-output Boolean functions exist?
▸ What about K inputs?
25
How Many Functions?
26
Look-Up Table (LUT)
0/10/10/10/10/10/10/10/1
MU
X… Y
x2
A 3-input LUT
§ A k-input LUT (k-LUT) can be configured to implement any k-input 1-output combinational logic – 2k SRAM bits– Delay is independent of logic
function
x1 x0
▸ How many 3-input LUTs are needed to implement the following full adder?
▸ How about using 4-input LUTs?
27
How Many LUTs?
A B Cin Cout S
0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1
28
A Logic Element
LUT
▸ A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed▸ The LUT and FF combined form a logic element
29
A Logic Block
▸ A logic block clusters multiple logic elements
▸ Example: In Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices – Two independent carry chains per
CLB for implementing adders– Each slice contains four LUTs
Crossbar Sw
itch
CIN CIN
COUT COUT
SLICE
SLICE
Traditional Homogeneous FPGA Architecture
30
Logic block
Switchblock
Routing track
Modern Heterogeneous Field-Programmable System-on-Chip
[Figure credit: embeddedrelated.com]
▸ Island-style configurable mesh routing▸ Lots of dedicated components
– Memories/multipliers, I/Os, processors– Specialization leads to higher performance and lower power
31
▸ Built-in components for fast arithmetic operation optimized for DSP applications – Essentially a multiply-accumulate core with many
other features
– Fixed logic and connections, functionality may be configured using control signals at run time
– Much faster than LUT-based implementation (ASIC vs. LUT)
32
Dedicated DSP Blocks
33
Example: Xilinx DSP48E Slice
§25x18 signed multiplier§48-bit add/subtract/accumulate§48-bit logic operations§SIMD operations (12/24 bit)§Pipeline registers for high speed
[source: Xilinx Inc.]
Finite Impulse Response (FIR) Filter Mapped to DSP Slices
C0 C1 C2 C3
0
DSP Slice
x(n)
y(n)38
18
y[n]= cix[n− i]i=0
N
∑
[source: Xilinx Inc.]
▸ Example: Xilinx 18K/36K block RAMs – 32k x 1 to 512 x 72 in one
36K block– Simple dual-port and true
dual-port configurations– Built-in FIFO logic– 64-bit error correction
coding per 36K block
[source: Xilinx Inc.] 35
DIADIPAADDRAWEAENA
CLKA
DIBDIPB
WEBADDRB
ENB
DOA
CLKB
DOPA
DOPBDOB
18K/36K block RAM
Dedicated Block RAMs (BRAMs)
Embedded FPGA System-on-Chip
Xilinx Zynq All Programmable System-on-Chip[Source: Xilinx Inc.]
Dual ARM Cortex-A9 + NEON SIMD extension @600MHz~1GHz
Up to 350K logic cells2MB Block RAM 900 DSP48s
36
▸ Massive amount of fine-grained parallelism
▸ Silicon configurable to fit the application
▸ Performance/watt advantage over CPUs & GPUs
37
FPGA as an Accelerator for Cloud Computing
Bloc
k RA
M
Bloc
k RA
M
~2 Million Logic Blocks
~5000DSP Blocks
~300MbBlock RAM
AWS Cloud F1 FPGA instance: Xilinx UltraScale+ VU9P
[Figure source: David Pellerin, AWS]
38
Microsoft Deploying FPGAs in Datacenter
[source: Microsoft, BrainWave, HotChips’2017]
▸ FPGAs deployed in Microsoft datacenters to accelerate various web, database, and AI services– e.g., project BrainWave claimed ~40Teraflops on large recurrent
neural networks using Intel Stratix 10 FPGAs
▸ Massive amount of fine-grained parallelism – Highly parallel and/or deeply pipelined to achieve maximum
parallelism– Distributed data/control dispatch
▸ Silicon configurable to fit algorithm– Compute the exact algorithm at the desired level of numerical
accuracy• Bit-level sizing and sub-cycle chaining
– Customized memory hierarchy
▸ Performance/watt advantage– Low power consumption compared to CPU and GPGPUs
• Low clock speed• Specialized architecture blocks
39
Summary: FPGA as a Programmable Accelerator
▸ Vivado HLS tutorial (led by “TAs”)
40
Next Class
▸ These slides contain/adapt materials developed by– Prof. Jason Cong (UCLA)
41
Acknowledgements