cs152: computer systems architecture dark silicon ...swjun/courses/2019w-cs152... · possible ^...

CS152: Computer Systems ArchitectureDark Silicon, Application-Specific Acceleration

Sang-Woo Jun

Winter 2019

Not All Transistors Can Be Active!

Utilization wall:“With each successive process generation, the percentage of a chip

that can switch at full frequency drops exponentially due to power constraints.” -- Venkatesh, ASPLOS ‘10

The following slides adapted from Michael Taylor’s 2012 talk“Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse” – Marked ‘**’

Tradeoffs Between Cores And Frequency**4 cores @ 1.8 GHz

4 cores @ 2x1.8 GHz (12 cores dark)

2x4 cores @ 1.8 GHz (8 cores dark, 8 dim)

4x4 cores @ .9 GHz(16 dim)

Next generation

… …

The Four Horsemen**

What do we do with this dark silicon?

“Four top contenders, each of which seemed like an unlikely candidate from the beginning, carrying unwelcome burdens in design, manufacturing and programming. None is ideal, but each has its benefit and the optimal solution probably incorporates all four of them…”

The Shrinking Horseman (#1)**

“Area is expensive. Chip designers will just build smaller chips instead of having dark silicon in their designs!”

First, dark silicon doesn’t mean useless silicon, it just means it’s under-clocked or not used all of the time.

There’s lots of dark silicon in current chips: o On-chip GPU on AMD Fusion or Intel Sandybridge for GCC

• L3 cache is very dark for applications with small working sets

• SSE units for integer apps

• …

The Shrinking Horseman (#1)**

Competition and Marginso If there is an advantage to be had from using dark silicon, you have to use it too, to

keep up with the Jones.

Diminished Returns (e.g., $10 silicon selling for $200 today)o Savings Exponentially Diminishing: $5, $2.5, $1.25, 63co Overheads: packaging, test, marketing, etc.o Chip structures like I/O Pad Area do not scale

Exponential increase in Power Density -> Exponential Rise in Temperature

But, some chips will shrinko Nasty low margin, high competition chips; or a monopoly (Sony Cell)

The Dim Horseman (#2)**

Spatial dimming: Have enough cores to exceed power budget, but underclock them

Gen 1 & 2 Multicores (higher core count, lower freqs)

Near Threshold Voltage (NTV) Operationo Delay Loss/Lower clock speed > Energy Gain

o But, make it up with lots of dim cores

The Dim Horseman (#2)**

Temporal Dimming : Have enough cores to exceed power budget, but use them only in burstso Dim cores, but overclock if cold – e.g., Intel TurboBoost

o E.g., ARM A15 Core in mobile phones• A15 power usage way above sustainable for phone.

• 10 second bursts at most (big.LITTLE)

The Specialized Horseman (#3)**

“We will use all of that dark silicon area to build specialized cores, each of them tuned for the task at hand (10-100x more energy efficient), and only turn on the ones we need…”

Insights:o Power is now more expensive than area

o Specialized logic can improve energy efficiency by 10-1000x

The Specialized Horseman (#3)**

C-cores Approach:o Fill dark silicon with Conservation Cores, or c-cores,

which are automatically-generated, specialized energy-saving coprocessors that save energy on common apps

Execution jumps among c-cores (hot code) and a host CPU (cold code)o Power-gate HW that is not currently in use

o Coherent Memory & Patching Support for C-cores

Typical Energy Savings**

The Specialized Horseman (#3) -- Pssst

Another active thrust in this area is reconfigurable hardware acceleration using Field-Programmable Gate Arrays (FPGA)o A single FPGA fabric can be configured at runtime to act like any C-core

o Not as efficient as a prefabricated C-core, but can cover any function at runtime

o More on this later!

The Deus Ex Machina Horseman (#4)**

Deus Ex Machina: “A plot device whereby a seemingly unsolvable problem is suddenly and abruptly solved with the unexpected intervention of some new event, character, ability or object.”

“MOSFETs are the fundamental problem”

“FinFets, Trigate, High-K, nanotubes, 3D, for one-time improvements, but none are sustainable solutions across process generations.”

The Deus Ex Machina Horseman (#4)**

Possible “Beyond CMOS” Device Directionso Nano-electrical Mechanical Relays?

o Tunnel Field Effect Transistors (TFETS)?

o Spin-Transfer Torque MRAM (STT-MRAM)?

o Graphene?

o Human brain?

o DNA Computing?

CS152: Computer Systems ArchitectureField-Programmable Gate Arrays

Sang-Woo Jun

Winter 2019

What Are FPGAs

Field-Programmable Gate Array

Can be configured to act like any circuit – More later!

Can do many things, but we focus on computation acceleration

FPGAs Come In Many Forms

PCIe-Attached

CPU Integrated In-Network

In-Storage

How Is It Different From CPU/GPUs

GPU – The other major accelerator

CPU/GPU hardware is fixedo “General purpose”

o we write programs (sequence of instructions) for them

FPGA hardware is not fixedo “Special purpose”

o Hardware can be whatever we want

o Will our hardware require/support software? Maybe!

Optimized hardware is very efficiento GPU-level performance**

o 10x power efficiency (300 W vs 30 W)

Analogy

“The Z-Berry”“Experimental Investigations on Radiation Characteristics of IC Chips”benryves.com “Z80 Computer”

CPU/GPU comes with fixed circuits FPGA gives you a big bag of components

To build whatever

Shadi Soundation: Homebrew 4 bit CPU

Could be a CPU/GPU!

Fine-Grained Parallelism of Special-Purpose Circuits

Example -- Calculating gravitational force: 𝐺×𝑚1×𝑚2

(𝑥1−𝑥2)2+(𝑦1−𝑦2)

8 instructions on a CPU → 8 cycles**

Much fewer cycles on a special purpose circuit

A = G × m1

B = A × m2

C = x1 - x2

D = C2

E = y1 - y2

F = E2

G = D + F

Ret = B / G

A = G × m1 × m2 B = (x1 - x2)2 C = (y1 - y2)2

D = B + C

Ret = B / G

4 cycles with basic operations

3 cycles with compound operations

Ret = (G × m1 × m2) / ((x1 - x2)2 + (y1 - y2)2)

1 cycle with even further compound operations

May slow down clock

Coarse-Grained Parallelism ofSpecial-Purpose Circuits

Typical unit of parallelism for general-purpose units are threads ~= cores

Special-purpose processing units can also be replicated for parallelismo Large, complex processing units: Few can fit in chip

o Small, simple processing units: Many can fit in chip

Independent operations can explicitly be parallelized across dedicated hardware moduleso Hundreds/thousands of operations are regularly done in parallel

Only generates hardware useful for the applicationo Instruction? Decoding? Cache? Coherence?

How Is It Different From ASICs

ASIC (Application-Specific Integrated Circuit)o Special chip purpose-built for an application

o E.g., ASIC bitcoin miner, Intel neural network accelerator

o Function cannot be changed once expensively built

+ FPGAs can be field-programmedo Function can be changed completely whenever

o FPGA fabric emulates custom circuits

- Emulated circuits are not as efficient as bare-metalo ~10x performance (larger circuits, faster clock)

o ~10x power efficiency

Basic FPGA Architecture“Configurable logic block (CLB)”

Programmable interconnect

I/O block 6-InputLook-Up

Programmable

Input 1 Input 2 Output

Ex) 2-LUT for “AND”

Stores state forsequential circuit

construction

Basic FPGA Architecture – DSP Blocks

CLBs act as gates – Many needed to implement high-level logic

Arithmetic operation provided as efficient ALU blockso “Digital Signal Processing (DSP) blocks”

o Each block provides an adder + multiplier

“DSP block”

× +/-

Basic FPGA Architecture – Block RAM

CLB can act as flip-flopso (~1 bit/block) – tiny!

Some on-chip SRAM provided as blockso ~18/36 Kbit/block, MBs per chip

o Massively parallel access to data → multi-TB/s bandwidth

“Block RAM”

Basic FPGA Architecture – Hard Cores

Some functions are provided as efficient, non-configurable “hard cores”o Multi-core ARM cores (“Zynq” series)

o Multi-Gigabit Transceivers

o PCIe/Ethernet PHY

o Memory controllers

ARM PCIe

Ethernet

Memory

Example Accelerator Card Architecture

“FPGA Mezzanine Card” Expansiono Network Ports, Memory, Storage, PCIe, …

General-Purpose I/O Pins Multi-Gigabit Transceivers

Example Accelerator Card (VCU108)

Programming/Using an FPGA Accelerator

Bitfile is programmed to FPGA over “JTAG” interfaceo Typically used over USB cable

o Supports FPGA programming, limited debugging access, etc

PCIe-attached FPGA accelerator card is typically used similarly to GPUso Program FPGA, execute software

o Software copies data to FPGA board, notify FPGA-> FPGA logic performs computations-> Software copies data back from FPGA

FPGA flexibility gives immense freedom of usage patternso Streaming, coherent memory, …

cs152: computer systems architecture dark silicon ...swjun/courses/2019w-cs152... · possible ^...

Documents

cs152: computer systems architecture simd...

cs152 computer architecture and engineering cs252 graduate

cs152 spring 2002 lec2.1 cs152 computer architecture and...

cs152 quiz #5 april 27th, 2016 professor george...

cs152: computer systems architecture moore’s law

computer architecture and engineering cs152 quiz #1...

cs152 – computer architecture and engineering lecture 19

cs 152 l06 single cycle 1 (1)patterson fall 2003 © ucb...

cs 152: computer systems architecture storage...

cs152: computer systems architecture circuits recap

cs152 / kubiatowicz lec4.1 9/8/99©ucb fall 1999 september...

math100section108–2019w solutionstopracticeproblems2v2

cs152: computer systems architecture operating system...

cs152: computer systems architecture risc-v...

january 28, 2004 john kubiatowicz (kubitron) lecture slides:...

cs152 lec9.1 cs152 computer architecture and engineering...

computer architecture and engineering cs152 quiz #4 april...

cs152 computer architecture and engineering memory

cs152: computer systems architecture storytime x86 and...

cs152: computer systems architecture risc-v introduction