firesim: fpga-accelerated cycle-exact scale-out system ... · firesim fpga-accelerated cycle-exact...
TRANSCRIPT
FireSimFPGA-Accelerated Cycle-Exact
Scale-Out System Simulation in the Public Cloud
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle
Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović
https://fires.im@firesimproject
The new datacenter hardware environment
Deeper memory/storage
hierarchies
e.g. 3DXPoint, HBM
The end of Moore’s Law
Custom Silicon in the Cloud
Faster networks
e.g. Silicon Photonics
New datacenter architectures
e.g. disaggregation[1]
2
Disaggregated Datacenters
3Diagram from Gao et al., OSDI’16
…and custom HW is changing faster than ever
FPGAs: Agile HW Design for ASICs:
[2]
4
What does our simulator need to do?• Model hardware at scale:
• CPUs down to microarchitecture• Fast networks, switches• Novel accelerators
• Run real software: • Real OS, networking stack (Linux)• Real frameworks/applications (not microbenchmarks)
• Be productive/usable:• Run on a commodity platform• Want to encourage collaboration between systems, architecture: real HW/SW
co-design
5
(2) Build a software simulator
Comparing existing HW “simulation” systems
(1) Build the hardware
(3) Build a hardware-accelerated simulator
6
A HW-accelerated DC simulator: DIABLO
• DIABLO, ASPLOS’15 [4]:• Simulated 3072 servers, 96 ToRs at ~2.7 MHz• Booted Linux, ran apps like Memcached• Part of RAMP collaboration [8]
• Need to hand-write abstract RTL models• Harder than writing “tapeout-ready” RTL• Need to validate against real HW
• Tied to an expensive custom host-platform• $100k+ host platform, custom built DIABLO Prototype
7
Comparing existing HW “simulation” systems
8
• Taping-out excels at:• Modeling reality: “single source of truth”• Scalability
• Hardware-accelerated simulators excel at:• Simulation rate• Ability to run real workloads (as fn. of sim rate)
• Software-based simulators excel at:• Ease-of-use• Ease-of-rebuild (time-to-first-cycle)• Commodity host platform• Cost• Introspection
FPGAs in the Cloud
Useful trends throughout the architect’s stack
High-Productivity Hardware Design
Language w/IR
Open, Silicon-Proven SoC Implementations
Open ISA
9
FireSim at a high-level
Server Simulations• Inherent parallelism – lots of gates• We have tapeout-proven RTL:
automatically FAME-1 transform• Put RTL-derived sims on the FPGAs
Network simulation• Little parallelism in switch models
(e.g. a thread per port)• Need to coordinate all of our
distributed server simulations• So use CPUs + host network
10
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
FPGAs (x8)
CPU
HostPCIe
f1.16xlarge
Host
Eth
erne
t (EC
2 Ne
twor
k)
Server Simulations
Switc
h M
odel
Now, let’s build a datacenter-scale FireSim simulation!
11
12
13
Mod
eled
Sys
tem
Reso
urce
Uti
l
Step
N: T
itle
(pla
ceho
lder
slid
e)
Root
Sw
itchA
ggre
gati
on P
od
Agg
rega
tion
Pod
Agg
rega
tion
Pod
Agg
rega
tion
Sw
itch
Rack
Rack
Rack
Rack
Rack
Rack
Rack
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Hos
t In
stan
ce C
PU: T
oRSw
itch
Mod
el
DRA
M
DRA
M
DRA
M
DRA
M
Serv
er
Blad
eSi
mul
aIon
Serv
er
Blad
eSi
mul
atio
n
Serv
er
Blad
eSi
m.
Serv
er
Blad
eSi
mul
atio
n
Rock
et
Core
Rock
et
Core
Rock
et
Core
Rock
et
Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Oth
er P
erip
hera
lsDRAM Model
NIC
Sim
En
dpoi
ntO
ther
Per
iph.
Si
m E
ndpo
ints
FPGA Fabric
PCIe
to H
ost
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC
Resource Util.- < ¼ of an FPGA
Sim Rate- N/A
Step 1: Server SoC in RTL
13
13
Mod
eled
Sys
tem
Reso
urce
Uti
l
Step
N: T
itle
(pla
ceho
lder
slid
e)
Root
Sw
itchA
ggre
gati
on P
od
Agg
rega
tion
Pod
Agg
rega
tion
Pod
Agg
rega
tion
Sw
itch
Rack
Rack
Rack
Rack
Rack
Rack
Rack
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Hos
t In
stan
ce C
PU: T
oRSw
itch
Mod
el
DRA
M
DRA
M
DRA
M
DRA
M
Serv
er
Blad
eSi
mul
aIon
Serv
er
Blad
eSi
mul
atio
n
Serv
er
Blad
eSi
m.
Serv
er
Blad
eSi
mul
atio
n
Rock
et
Core
Rock
et
Core
Rock
et
Core
Rock
et
Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Oth
er P
erip
hera
lsDRAM Model
NIC
Sim
En
dpoi
ntO
ther
Per
iph.
Si
m E
ndpo
ints
FPGA Fabric
PCIe
to H
ost
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC
Resource Util.- < ¼ of an FPGA
Sim Rate- N/A
Step 1: Server SoC in RTL
14
13
Mod
eled
Sys
tem
Reso
urce
Uti
l
Step
N: T
itle
(pla
ceho
lder
slid
e)
Root
Sw
itchA
ggre
gati
on P
od
Agg
rega
tion
Pod
Agg
rega
tion
Pod
Agg
rega
tion
Sw
itch
Rack
Rack
Rack
Rack
Rack
Rack
Rack
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Hos
t In
stan
ce C
PU: T
oRSw
itch
Mod
el
DRA
M
DRA
M
DRA
M
DRA
M
Serv
er
Blad
eSi
mul
aIon
Serv
er
Blad
eSi
mul
atio
n
Serv
er
Blad
eSi
m.
Serv
er
Blad
eSi
mul
atio
n
Rock
et
Core
Rock
et
Core
Rock
et
Core
Rock
et
Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Oth
er P
erip
hera
ls
DRAM Model
NIC
Sim
En
dpoi
ntO
ther
Per
iph.
Si
m E
ndpo
ints
FPGA Fabric
PCIe
to H
ost
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC- 16 GB DDR3Resource Util.- < ¼ of an FPGA- ¼ Mem ChansSim Rate- ~150 MHz- ~40 MHz (netw)
Step 2: FPGA Simulation of one server blade
15
13
Mod
eled
Sys
tem
Reso
urce
Uti
l
Step
N: T
itle
(pla
ceho
lder
slid
e)
Root
Sw
itchA
ggre
gati
on P
od
Agg
rega
tion
Pod
Agg
rega
tion
Pod
Agg
rega
tion
Sw
itch
Rack
Rack
Rack
Rack
Rack
Rack
Rack
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Hos
t In
stan
ce C
PU: T
oRSw
itch
Mod
el
DRA
M
DRA
M
DRA
M
DRA
M
Serv
er
Blad
eSi
mul
aIon
Serv
er
Blad
eSi
mul
atio
n
Serv
er
Blad
eSi
m.
Serv
er
Blad
eSi
mul
atio
n
Rock
et
Core
Rock
et
Core
Rock
et
Core
Rock
et
Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Oth
er P
erip
hera
ls
DRAM Model
NIC
Sim
En
dpoi
ntO
ther
Per
iph.
Si
m E
ndpo
ints
FPGA Fabric
PCIe
to H
ost
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
FPG
A
(4 S
ims)
Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC- 16 GB DDR3Resource Util.- < ¼ of an FPGA- ¼ Mem ChansSim Rate- ~150 MHz- ~40 MHz (netw)
Step 2: FPGA Simulation of one server blade
16
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System- 4 Server Blades- 16 Cores- 64 GB DDR3Resource Util.- < 1 FPGA- 4/4 Mem ChansSim Rate- ~14.3 MHz (netw)
Step 3: FPGA Simulation of 4 server blades
Cost: $0.49 per hour (spot)
$1.65 per hour (on-demand)
17
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System- 4 Server Blades- 16 Cores- 64 GB DDR3Resource Util.- < 1 FPGA- 4/4 Mem ChansSim Rate- ~14.3 MHz (netw)
Step 3: FPGA Simulation of 4 server blades
18
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR
Switch
- 200 Gb/s, 2us
links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz
(netw)
Step 4: Simulating a 32 node rack
Cost: $2.60 per hour (spot)
$13.20 per hour (on-demand)
Cycle-accurate Network Modeling
• For global cycle-accuracy, send a token on each link for each cycle, in each direction
• Each direction of a link has link latency in cycles tokens in-flight• e.g. 6400 tokens in flight on link for 2us link
latency @ 3.2 GHz
• Each token is desired bandwidth / clock frequency bits wide• e.g. 200 Gbps / 3.2 GHz ≈ 64 bit wide token
sent per cycle• Target transport agnostic (we provide
Ethernet switch models)• Host transport agnostic (shared mem,
sockets, PCIe)• Can “downgrade” to a zero-perf-impact
functional network model (150+ MHz)
19
Switch Port
NIC Top-Level I/O on FPGA
ß à
Link Model
64b 64b
6400 tokens
20
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR
Switch
- 200 Gb/s, 2us
links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz
(netw)
Step 4: Simulating a 32 node rack
Cost: $2.60 per hour (spot)
$13.20 per hour (on-demand)
21
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR
Switch
- 200 Gb/s, 2us
links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz
(netw)
Step 4: Simulating a 32 node rack
22
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System- 256 Server Blades- 1024 Cores- 4 TB DDR3- 8 ToRs, 1 Aggr- 200 Gb/s, 2us linksResource Util.- 64 FPGAs =- 8x f1.16xlarge- 1x m4.16xlargeSim Rate- ~9 MHz (netw)
Step 5: Simulating a 256 node “aggregation pod”
23
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System- 256 Server Blades- 1024 Cores- 4 TB DDR3- 8 ToRs, 1 Aggr- 200 Gb/s, 2us linksResource Util.- 64 FPGAs =- 8x f1.16xlarge- 1x m4.16xlargeSim Rate- ~9 MHz (netw)
Step 5: Simulating a 256 node “aggregation pod”
24
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Modeled System- 1024 Servers- 4096 Cores- 16 TB DDR3- 32 ToRs, 4 Aggr, 1 Root- 200 Gb/s, 2us linksResource Util.- 256 FPGAs =- 32x f1.16xlarge- 5x m4.16xlargeSim Rate- ~6.6 MHz (netw)
Step 6: Simulating a 1024 node datacenter
Experimenting on a 1024 Node Datacenter
25
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
50th%-ile (us)
Cross-ToR 79.3Cross-aggregation 87.1
512 Memcached
Servers
512 Mutilate Clients
Experimenting on a 1024 Node Datacenter
26
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
50th%-ile (us)
Cross-ToR 79.3Cross-aggregation 87.1
512 Memcached
Servers
512 Mutilate Clients
Reproducing tail latency effects from deployed clusters
• Leverich and Kozyrakis show effects of thread-imbalance in memcached in EuroSys ’14 [3]
27
TileLink2 On-Chip Interconnect
Rocket Core
L1I L1D
Accel
Rocket Core
L1I L1D
Accel
Rocket Core
L1I L1D
Accel
Rocket Core
L1I L1D
Accel
memcachedthreads
1 2 3 4
No thread imbalance
Reproducing tail latency effects from deployed clusters
• Leverich and Kozyrakis show effects of thread-imbalance in memcached in EuroSys ’14 [3]
28
TileLink2 On-Chip Interconnect
Rocket Core
L1I L1D
Accel
Rocket Core
L1I L1D
Accel
Rocket Core
L1I L1D
Accel
Rocket Core
L1I L1D
Accel
memcachedthreads
1 2 3 4
?
5From [3], under thread imbalance, we expect:
1) Median latency to be unchanged
2) Tail latency to increase drastically
Thread imbalance
Reproducing tail latency effects from deployed clusters
• Let’s run a similar experiment on an 8 node cluster in FireSim:
29
50th
Reproducing tail latency effects from deployed clusters
• Let’s run a similar experiment on an 8 node cluster in FireSim:
30
50th95th
Open-source: Not just datacenter simulation• An “easy” button for fast, FPGA-
accelerated full-system simulation• One-click: Parallel FPGA builds, Simulation
run/result collection, building target software• Scales to a variety of use cases:
• Networked (performance depends on scale)• Non-networked (150+ MHz), limited by your budget
• firesim command line program• Like docker or vagrant, but for FPGA sims• User doesn’t need to care about distributed magic
happening behind the scenes
31FireSim Developer Environment
Open-source: Not just datacenter simulation• Scripts can call firesim to fully
automate distributed FPGA sim• Reproducibility: included scripts to
reproduce ISCA 2018 results• e.g. scripts to automatically run
SPECInt2017 reference inputs in ≈1 day• Many others
• 91+ pages of documentation: https://docs.fires.im• AWS provides grants for
researchers: https://aws.amazon.com/grants/
32
$ cd fsim/deploy/workloads$ ./run-all.sh
Wrapping Up
• We can prototype thousand-nodedatacenters built on arbitrary RTL
+ Mix software models when desired
• Simulation is automatically built and deployed• Automatically deploy real
workloads and collect results• Open-source, runs on Amazon EC2
F1, no capex
33
SoC RTL
Other RTL
Network Topology
Automatically deployed, high-performance,
distributed simulation
SW Models
Full Work-load
13
Modeled SystemResource Util
Step N: Title (placeholder slide)
Root Switch
Aggregation Pod
Aggregation PodAggregation Pod
Aggregation Switch
Rack
Rack Rack RackRack
Rack RackFPGA
(4 Sims)
FPGA (4 Sims)
Host Instance CPU: ToR Switch Model
DRAM DRAM
DRAM DRAM
Server Blade
SimulaIon
Server Blade
Simulation
Server BladeSim.
Server Blade
Simulation
Rocket Core
Rocket Core
Rocket Core
Rocket Core
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L2
NIC
Other Peripherals
DRA
M M
odel
NIC Sim Endpoint
Other Periph. Sim Endpoints
FPGA Fabric
PCIe to Host
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
FPGA (4 Sims)
Now open-sourced!https://fires.imhttps://github.com/firesim
@firesimproject
[email protected] information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011-12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.
References
[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016.[3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14[4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15[5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10[6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17.[7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16[8] http://ramp.eecs.berkeley.edu/
35
Backup Slides/Common Questions
36
New! Beta support for BOOM, an OoO Core
• A superscalar out-of-order RV64G core• Highly parameterizable
• Beta support for BOOM is now available on the rc-bump-may branch• On FPGA, currently boots Linux to userspace, then
hangs• A target-RTL issue, can be reproduced in VCS without
any FireSim simulation shims• Working with BOOM devs to solve this
37
FPGA Utilization
38
• “Supernode” config – 4 quad-core nodes• Available on FPGA: 1,182,000 LUTs• Top-level consumed (w/shell):
803,462 LUTs = ~68%• firesim_top consumed: 569921
LUTs = ~48%
Comparing cost of Cloud FPGAs: SPEC17 Run
• f1.2xlarge spot market: 49c per hour per FPGA• SPEC17 intrate reference inputs – 10 workloads – 177.6 fpga-sim
machine-hours• Longest individual benchmark: omnetpp @ 27.3 hours
• $87.024 for a SPEC17 Intrate run w/reference inputs, in 27.3 hours• Roughly 1 day• ~ 1/100 – 1/500 the cost of the FPGA to run on the cloud
• Purchase one FPGA:• Pay 10k - 50k per FPGA (+ops/cooling)• Each run takes 177.6 fpga-sim hours – roughly 1 week
39
Or – compare purchasing one FPGA
• Let’s say $10k for one of the FPGAs (conservative)
• 49c per hour on F1
• Break even if you run an F1 instance continuously for 2.33 years
• What if a grad student works 16 hours a day, 7 days a week? 3.5 years
• What about 16 hours a day, 5 days a week? 4.9 years
• Ignoring all other benefits of cloud FPGAs…
40
Now open-sourced!https://fires.imhttps://github.com/firesim
@firesimproject
[email protected] information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011-12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.
References
[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016.[3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14[4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15[5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10[6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17.[7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16[8] http://ramp.eecs.berkeley.edu/
42