firesim: fpga-accelerated cycle-exact scale-out system ... · firesim fpga-accelerated cycle-exact...

42
FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović https://fires.im @firesimproject [email protected]

Upload: others

Post on 24-Apr-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

FireSimFPGA-Accelerated Cycle-Exact

Scale-Out System Simulation in the Public Cloud

Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle

Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović

https://fires.im@firesimproject

[email protected]

Page 2: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

The new datacenter hardware environment

Deeper memory/storage

hierarchies

e.g. 3DXPoint, HBM

The end of Moore’s Law

Custom Silicon in the Cloud

Faster networks

e.g. Silicon Photonics

New datacenter architectures

e.g. disaggregation[1]

2

Page 3: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Disaggregated Datacenters

3Diagram from Gao et al., OSDI’16

Page 4: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

…and custom HW is changing faster than ever

FPGAs: Agile HW Design for ASICs:

[2]

4

Page 5: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

What does our simulator need to do?• Model hardware at scale:

• CPUs down to microarchitecture• Fast networks, switches• Novel accelerators

• Run real software: • Real OS, networking stack (Linux)• Real frameworks/applications (not microbenchmarks)

• Be productive/usable:• Run on a commodity platform• Want to encourage collaboration between systems, architecture: real HW/SW

co-design

5

Page 6: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

(2) Build a software simulator

Comparing existing HW “simulation” systems

(1) Build the hardware

(3) Build a hardware-accelerated simulator

6

Page 7: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

A HW-accelerated DC simulator: DIABLO

• DIABLO, ASPLOS’15 [4]:• Simulated 3072 servers, 96 ToRs at ~2.7 MHz• Booted Linux, ran apps like Memcached• Part of RAMP collaboration [8]

• Need to hand-write abstract RTL models• Harder than writing “tapeout-ready” RTL• Need to validate against real HW

• Tied to an expensive custom host-platform• $100k+ host platform, custom built DIABLO Prototype

7

Page 8: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Comparing existing HW “simulation” systems

8

• Taping-out excels at:• Modeling reality: “single source of truth”• Scalability

• Hardware-accelerated simulators excel at:• Simulation rate• Ability to run real workloads (as fn. of sim rate)

• Software-based simulators excel at:• Ease-of-use• Ease-of-rebuild (time-to-first-cycle)• Commodity host platform• Cost• Introspection

Page 9: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

FPGAs in the Cloud

Useful trends throughout the architect’s stack

High-Productivity Hardware Design

Language w/IR

Open, Silicon-Proven SoC Implementations

Open ISA

9

Page 10: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

FireSim at a high-level

Server Simulations• Inherent parallelism – lots of gates• We have tapeout-proven RTL:

automatically FAME-1 transform• Put RTL-derived sims on the FPGAs

Network simulation• Little parallelism in switch models

(e.g. a thread per port)• Need to coordinate all of our

distributed server simulations• So use CPUs + host network

10

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

FPGAs (x8)

CPU

HostPCIe

f1.16xlarge

Host

Eth

erne

t (EC

2 Ne

twor

k)

Server Simulations

Switc

h M

odel

Page 11: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Now, let’s build a datacenter-scale FireSim simulation!

11

Page 12: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

12

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

lsDRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC

Resource Util.- < ¼ of an FPGA

Sim Rate- N/A

Step 1: Server SoC in RTL

Page 13: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

13

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

lsDRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC

Resource Util.- < ¼ of an FPGA

Sim Rate- N/A

Step 1: Server SoC in RTL

Page 14: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

14

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

ls

DRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC- 16 GB DDR3Resource Util.- < ¼ of an FPGA- ¼ Mem ChansSim Rate- ~150 MHz- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

Page 15: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

15

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

ls

DRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC- 16 GB DDR3Resource Util.- < ¼ of an FPGA- ¼ Mem ChansSim Rate- ~150 MHz- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

Page 16: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

16

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 4 Server Blades- 16 Cores- 64 GB DDR3Resource Util.- < 1 FPGA- 4/4 Mem ChansSim Rate- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

Cost: $0.49 per hour (spot)

$1.65 per hour (on-demand)

Page 17: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

17

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 4 Server Blades- 16 Cores- 64 GB DDR3Resource Util.- < 1 FPGA- 4/4 Mem ChansSim Rate- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

Page 18: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

18

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR

Switch

- 200 Gb/s, 2us

links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot)

$13.20 per hour (on-demand)

Page 19: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Cycle-accurate Network Modeling

• For global cycle-accuracy, send a token on each link for each cycle, in each direction

• Each direction of a link has link latency in cycles tokens in-flight• e.g. 6400 tokens in flight on link for 2us link

latency @ 3.2 GHz

• Each token is desired bandwidth / clock frequency bits wide• e.g. 200 Gbps / 3.2 GHz ≈ 64 bit wide token

sent per cycle• Target transport agnostic (we provide

Ethernet switch models)• Host transport agnostic (shared mem,

sockets, PCIe)• Can “downgrade” to a zero-perf-impact

functional network model (150+ MHz)

19

Switch Port

NIC Top-Level I/O on FPGA

ß à

Link Model

64b 64b

6400 tokens

Page 20: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

20

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR

Switch

- 200 Gb/s, 2us

links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot)

$13.20 per hour (on-demand)

Page 21: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

21

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR

Switch

- 200 Gb/s, 2us

links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

Page 22: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

22

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 256 Server Blades- 1024 Cores- 4 TB DDR3- 8 ToRs, 1 Aggr- 200 Gb/s, 2us linksResource Util.- 64 FPGAs =- 8x f1.16xlarge- 1x m4.16xlargeSim Rate- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

Page 23: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

23

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 256 Server Blades- 1024 Cores- 4 TB DDR3- 8 ToRs, 1 Aggr- 200 Gb/s, 2us linksResource Util.- 64 FPGAs =- 8x f1.16xlarge- 1x m4.16xlargeSim Rate- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

Page 24: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

24

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 1024 Servers- 4096 Cores- 16 TB DDR3- 32 ToRs, 4 Aggr, 1 Root- 200 Gb/s, 2us linksResource Util.- 256 FPGAs =- 32x f1.16xlarge- 5x m4.16xlargeSim Rate- ~6.6 MHz (netw)

Step 6: Simulating a 1024 node datacenter

Page 25: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Experimenting on a 1024 Node Datacenter

25

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

50th%-ile (us)

Cross-ToR 79.3Cross-aggregation 87.1

512 Memcached

Servers

512 Mutilate Clients

Page 26: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Experimenting on a 1024 Node Datacenter

26

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

50th%-ile (us)

Cross-ToR 79.3Cross-aggregation 87.1

512 Memcached

Servers

512 Mutilate Clients

Page 27: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Reproducing tail latency effects from deployed clusters

• Leverich and Kozyrakis show effects of thread-imbalance in memcached in EuroSys ’14 [3]

27

TileLink2 On-Chip Interconnect

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

memcachedthreads

1 2 3 4

No thread imbalance

Page 28: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Reproducing tail latency effects from deployed clusters

• Leverich and Kozyrakis show effects of thread-imbalance in memcached in EuroSys ’14 [3]

28

TileLink2 On-Chip Interconnect

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

memcachedthreads

1 2 3 4

?

5From [3], under thread imbalance, we expect:

1) Median latency to be unchanged

2) Tail latency to increase drastically

Thread imbalance

Page 29: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Reproducing tail latency effects from deployed clusters

• Let’s run a similar experiment on an 8 node cluster in FireSim:

29

50th

Page 30: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Reproducing tail latency effects from deployed clusters

• Let’s run a similar experiment on an 8 node cluster in FireSim:

30

50th95th

Page 31: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Open-source: Not just datacenter simulation• An “easy” button for fast, FPGA-

accelerated full-system simulation• One-click: Parallel FPGA builds, Simulation

run/result collection, building target software• Scales to a variety of use cases:

• Networked (performance depends on scale)• Non-networked (150+ MHz), limited by your budget

• firesim command line program• Like docker or vagrant, but for FPGA sims• User doesn’t need to care about distributed magic

happening behind the scenes

31FireSim Developer Environment

Page 32: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Open-source: Not just datacenter simulation• Scripts can call firesim to fully

automate distributed FPGA sim• Reproducibility: included scripts to

reproduce ISCA 2018 results• e.g. scripts to automatically run

SPECInt2017 reference inputs in ≈1 day• Many others

• 91+ pages of documentation: https://docs.fires.im• AWS provides grants for

researchers: https://aws.amazon.com/grants/

32

$ cd fsim/deploy/workloads$ ./run-all.sh

Page 33: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Wrapping Up

• We can prototype thousand-nodedatacenters built on arbitrary RTL

+ Mix software models when desired

• Simulation is automatically built and deployed• Automatically deploy real

workloads and collect results• Open-source, runs on Amazon EC2

F1, no capex

33

SoC RTL

Other RTL

Network Topology

Automatically deployed, high-performance,

distributed simulation

SW Models

Full Work-load

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Page 34: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Now open-sourced!https://fires.imhttps://github.com/firesim

@firesimproject

[email protected] information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011-12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.

Page 35: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016.[3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14[4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15[5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10[6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17.[7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16[8] http://ramp.eecs.berkeley.edu/

35

Page 36: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Backup Slides/Common Questions

36

Page 37: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

New! Beta support for BOOM, an OoO Core

• A superscalar out-of-order RV64G core• Highly parameterizable

• Beta support for BOOM is now available on the rc-bump-may branch• On FPGA, currently boots Linux to userspace, then

hangs• A target-RTL issue, can be reproduced in VCS without

any FireSim simulation shims• Working with BOOM devs to solve this

37

Page 38: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

FPGA Utilization

38

• “Supernode” config – 4 quad-core nodes• Available on FPGA: 1,182,000 LUTs• Top-level consumed (w/shell):

803,462 LUTs = ~68%• firesim_top consumed: 569921

LUTs = ~48%

Page 39: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Comparing cost of Cloud FPGAs: SPEC17 Run

• f1.2xlarge spot market: 49c per hour per FPGA• SPEC17 intrate reference inputs – 10 workloads – 177.6 fpga-sim

machine-hours• Longest individual benchmark: omnetpp @ 27.3 hours

• $87.024 for a SPEC17 Intrate run w/reference inputs, in 27.3 hours• Roughly 1 day• ~ 1/100 – 1/500 the cost of the FPGA to run on the cloud

• Purchase one FPGA:• Pay 10k - 50k per FPGA (+ops/cooling)• Each run takes 177.6 fpga-sim hours – roughly 1 week

39

Page 40: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Or – compare purchasing one FPGA

• Let’s say $10k for one of the FPGAs (conservative)

• 49c per hour on F1

• Break even if you run an F1 instance continuously for 2.33 years

• What if a grad student works 16 hours a day, 7 days a week? 3.5 years

• What about 16 hours a day, 5 days a week? 4.9 years

• Ignoring all other benefits of cloud FPGAs…

40

Page 41: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

Now open-sourced!https://fires.imhttps://github.com/firesim

@firesimproject

[email protected] information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011-12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.

Page 42: FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System ... · FireSim FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud Sagar Karandikar, Howard Mao, Donggyu

References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016.[3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14[4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15[5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10[6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17.[7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16[8] http://ramp.eecs.berkeley.edu/

42