specialized computing - cornell university computing ece 5775 ... the main computation of...

Specialized Computing

ECE 5775 (Fall’17)High-Level Digital Design Automation

▸ All students enrolled in CMS & Piazza

▸ Vivado HLS tutorial on Tuesday 8/29– Install an SSH client (mobaxterm or putty)– % ssh -X <netid>@ecelinux-01.ece.cornell.edu– Bring your laptop!

1

Announcements

Agenda

▸ Motivation for hardware specialization– Case study on convolution

▸ FPGA introduction– Basic building blocks– Basic homogeneous FPGA architecture– Modern heterogeneous FPGA architecture

2

Tension between Efficiency and Flexibility

3© 2012 Altera Corporation—Public

The Dilemma: Flexibility vs. Efficiency

16

Source: “High-performance Energy-Efficient Reconfigurable Accelerator Circuits for the Sub-45nm Era” July 2011by Ram K. Krishnamurthy, Circuits Research Labs, Intel Corp.

MO

PS

/mW

Programmable Processing

4

A Simple Single-Cycle Microprocessor

Adder1

PC

RF

LDSASBDRD_in

ALU RAM

DataA

DataB

V C Z NSE

IMMMB

M_address

Data_in

MW MD

01

0

1

DRSASB

IMMMBFS

MDLD

MWRAM

5

Evaluating an Simple Expression on CPU

Step-by-stepCPU activities

R1 <= M[R0]

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3

Source: Adapted from Desh Singh’s talk at HCP’14 workshop

PC

RF ALU RAM

PC

RF ALU RAM

PC

RF ALU RAM

PC

RF ALU RAM

6

“Unrolling” the CPU Hardware

Space

1. Replicate the CPU hardware

PC

RF ALU RAM

PC

RF ALU RAM

PC

RF ALU RAM

R1 <= M[R0]

PC

RF ALU RAM

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3

CPU2

CPU3

CPU4

CPU1

7

Eliminating Unused Logic

RF ALU RAM

R1 <= M[R0]

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3RF ALU RAM

RF ALU RAM

RF ALU

Space


2. Instruction fixed -> Remove FETCH logic

3. Remove unused ALU operations

4. Remove unusedLOAD/STORE logic

8

A Special-Purpose Architecture

R1 <= M[R0]

R3 <= R1 + R2

R2 <= M[R0+1]

M[R0+2] <= R3

Space


2. Instruction fixed -> Remove FETCH logic

3. Remove unused ALU operations

4. Remove unusedLOAD/STORE logic

5. Wire up registers and propagate values

R2

R3

+

R1

+

R0

+

LW

LW

SW

Resulting circuit can be realized with either ASIC or FPGA

9

Understanding Energy Inefficiency of General-Purpose Processors (GPPs)

L1-I$

BranchPredictor

FetchDecode Rename

RAT

Free list

Reservation-station

Schedule

Int RF

FP RF

RegisterRead/write

D-cache

LSQ + TLB

ALU

FPU

ROB

Commit

Typical Superscalar OoO Pipeline

Parameter Value

Fetch/issue/retire width 4

# Integer ALUs 3

# FP ALUs 2

# ROB entries 96

# Reservation station entries 64

L1 I-cache 32 KB, 8-way set associative

L1 D-cache 32 KB, 8-way set associative

L2 cache 6 MB, 8-way set associative

[source: Jason Cong, ISLPED’14 keynote]

10

Energy Breakdown of Pipeline Components

Fetch unit9%

Decode6%

Rename12%

Register files3%

Scheduler11%

Int ALU14%

Mul/div4%

FPU8%

Memory10%

Misc23%

L1-I$

BranchPredictor

FetchDecode Rename

RAT

Free list

Reservation-station

Schedule

Int RF

FP RF

RegisterRead/write

D-cache

LSQ + TLB

ALU

FPU

ROB

Commit

Control

11

Removing “Non-Computing” Portions

Fetch unit9%

Decode6%

Rename12%

Register files3%

Scheduler11%

Int ALU14%

Mul/div4%

FPU8%

Memory10%

Misc23%

▸ “Computing” portion: 10% (memory) + 26% (compute) = 36%

Energy Comparison of Processor ALUs and Dedicated Units

▸ Why are processor units so expensive?– ALU can perform multiple

operations• Add/sub/bitwise

XOR/OR/AND– 64-bit ALU– Dynamic/domino logic

used to run at high frequency• Higher power dissipation

Operation Processor ALU

45 nm TSMC standard cell library

32-bit add 0.122 nJ@2 GHz

0.002 nJ @1 GHz

32-bitmultiply

0.120 nJ@2 GHz

0.007 nJ @1 GHz

Single-precisionFP operation

0.150 nJ @ 2GHz

0.008 nJ @500 MHz

12

13

Energy Breakdown with Standard-Cell ASICs

Fetch unit9%

Decode6%

Rename12%

Register files3%

Scheduler11%

Int ALU0.2%

Mul/div0.2%

FPU0.4%

ALU/FPU savings25.0%

Memory10%

Misc23%

▸ “Computing” portion: 10% (memory) + ~1% (compute) = 11%

Only 10X gain is attainable?

14

Additional Energy Savings from Specialization

▸ Specialized memory architecture– Exploit regular memory access patterns to minimize energy per

memory read/write

▸ Specialized communication architecture– Exploit data movement patterns to optimize the

structure/topology of on-chip interconnection network

▸ Customized data type– Exploit data range information to reduce bitwidth/precision and

simply arithmetic operations

These techniques combined can lead to another 10-100X energy efficiency improvement over GPPs

Case Study: Memory Specialization for Convolution

▸ The main computation of image/video processing is performed over overlapping stencils, termed as convolution

-1 -2 -1

0 0 01 2 1

3x3 ConvolutionInput image frame

Output image frame

0123456

0 1 2 3 4 5 60123456

0 1 2 3 4 5 6

15

Example Application: Edge Detection

▸ Identifies discontinuities in an image where brightness (or image intensity) changes sharply– Very useful for feature extractions in computer vision

Figures: Pilho Kim, GaTech

Sobel operator G=(GX , GY)

16

CPU Implementation of Convolution

CPU MainMemory

Cache

for (n=1; n<height-1; n++) for (m=1; m<width-1; m++)

for (i=0; i<3; i++) for (j=0; j<3; j++)

out[n][m]+=img[n+i][m+j] * f[i][j];

17

▸ Minimizes main memory accesses to improve performance

▸ A general-purpose cache is expensive in cost and incurs nontrivial energy overhead

General-Purpose Cache for Convolution

W

Input picture(W pixels wide)

18

Specializing Cache for Convolution (1)

▸ Remove rows that are not in the neighborhood of the convolution window

19

W

W

Specializing Cache for Convolution (2)

W W

Remove the edge pixels that are not needed for computation

Old Pixel

New Pixel

▸ Rearrange the rows as a 1D array of pixels▸ Each time we move the window to right and push in the

new pixel to the “cache”

20

New

Old

A Specialized “Cache”: Line Buffer

▸ Line buffer: a fixed-width “cache” with (K-1)*W+K pixels in flight– Fixed addressing: Low area/power and high performance

▸ In customized FPGA implementation, line buffers can be efficiently implemented with on-chip BRAMs

2W+3 (with K=3)

Old Pixel

New Pixel

21

What is an FPGA?

▸ FPGA: Field-Programmable Gate Array– An integrated circuit designed to be configured by a customer or

a designer after manufacturing (wikipedia)

▸ Components in an FPGA Chip– Programmable logic blocks– Programmable interconnects– Programmable I/Os

22

▸ SRAM-based implementation is popular– Non-standard technology means older technology generation

23

Three Important Pieces

Pass transistor (controlled by an SRAM bit)

Multiplexer (controlled by SRAM bits)

Lookup table (LUT, formed by SRAM bits)

LUT

▸ Any function of k variables can be implemented with a 2k:1 multiplexer

24

Multiplexer as a Universal Gate

01010101

00110011

00001111

CoutSCinBA01101001

01010101

00110011

00001111

CoutSCinBA

01101001

00010111

01010101

00110011

00001111

CoutSCinBA01101001

00010111

01010101

00110011

00001111

CoutSumCinBA

?? ?

01234567

S2

8:1 MUX

S1 S0

Cout

????????

▸ How many distinct 3-input 1-output Boolean functions exist?

▸ What about K inputs?

25

How Many Functions?

26

Look-Up Table (LUT)

0/10/10/10/10/10/10/10/1

MU

X… Y

x2

A 3-input LUT

§ A k-input LUT (k-LUT) can be configured to implement any k-input 1-output combinational logic – 2k SRAM bits– Delay is independent of logic

function

x1 x0

▸ How many 3-input LUTs are needed to implement the following full adder?

▸ How about using 4-input LUTs?

27

How Many LUTs?

A B Cin Cout S

0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1

28

A Logic Element

LUT

▸ A k-input LUT is usually followed by a flip-flop (FF) that can be bypassed▸ The LUT and FF combined form a logic element

29

A Logic Block

▸ A logic block clusters multiple logic elements

▸ Example: In Xilinx 7-series FPGAs, each configurable logic block (CLB) has two slices – Two independent carry chains per

CLB for implementing adders– Each slice contains four LUTs

Crossbar Sw

itch

CIN CIN

COUT COUT

SLICE

SLICE

Traditional Homogeneous FPGA Architecture

30

Logic block

Switchblock

Routing track

Modern Heterogeneous Field-Programmable System-on-Chip

[Figure credit: embeddedrelated.com]

▸ Island-style configurable mesh routing▸ Lots of dedicated components

– Memories/multipliers, I/Os, processors– Specialization leads to higher performance and lower power

31

▸ Built-in components for fast arithmetic operation optimized for DSP applications – Essentially a multiply-accumulate core with many

other features

– Fixed logic and connections, functionality may be configured using control signals at run time

– Much faster than LUT-based implementation (ASIC vs. LUT)

32

Dedicated DSP Blocks

33

Example: Xilinx DSP48E Slice

§25x18 signed multiplier§48-bit add/subtract/accumulate§48-bit logic operations§SIMD operations (12/24 bit)§Pipeline registers for high speed

[source: Xilinx Inc.]

Finite Impulse Response (FIR) Filter Mapped to DSP Slices

C0 C1 C2 C3

0

DSP Slice

x(n)

y(n)38

18

y[n]= cix[n− i]i=0

N

∑

[source: Xilinx Inc.]

▸ Example: Xilinx 18K/36K block RAMs – 32k x 1 to 512 x 72 in one

36K block– Simple dual-port and true

dual-port configurations– Built-in FIFO logic– 64-bit error correction

coding per 36K block

[source: Xilinx Inc.] 35

DIADIPAADDRAWEAENA

CLKA

DIBDIPB

WEBADDRB

ENB

DOA

CLKB

DOPA

DOPBDOB

18K/36K block RAM

Dedicated Block RAMs (BRAMs)

Embedded FPGA System-on-Chip

Xilinx Zynq All Programmable System-on-Chip[Source: Xilinx Inc.]

Dual ARM Cortex-A9 + NEON SIMD extension @600MHz~1GHz

Up to 350K logic cells2MB Block RAM 900 DSP48s

36

▸ Massive amount of fine-grained parallelism

▸ Silicon configurable to fit the application

▸ Performance/watt advantage over CPUs & GPUs

37

FPGA as an Accelerator for Cloud Computing

Bloc

k RA

M

Bloc

k RA

M

~2 Million Logic Blocks

~5000DSP Blocks

~300MbBlock RAM

AWS Cloud F1 FPGA instance: Xilinx UltraScale+ VU9P

[Figure source: David Pellerin, AWS]

38

Microsoft Deploying FPGAs in Datacenter

[source: Microsoft, BrainWave, HotChips’2017]

▸ FPGAs deployed in Microsoft datacenters to accelerate various web, database, and AI services– e.g., project BrainWave claimed ~40Teraflops on large recurrent

neural networks using Intel Stratix 10 FPGAs

▸ Massive amount of fine-grained parallelism – Highly parallel and/or deeply pipelined to achieve maximum

parallelism– Distributed data/control dispatch

▸ Silicon configurable to fit algorithm– Compute the exact algorithm at the desired level of numerical

accuracy• Bit-level sizing and sub-cycle chaining

– Customized memory hierarchy

▸ Performance/watt advantage– Low power consumption compared to CPU and GPGPUs

• Low clock speed• Specialized architecture blocks

39

Summary: FPGA as a Programmable Accelerator

▸ Vivado HLS tutorial (led by “TAs”)

40

Next Class

▸ These slides contain/adapt materials developed by– Prof. Jason Cong (UCLA)

41

Acknowledgements

specialized computing - cornell university computing ece 5775 ... the main computation of...

Documents