scalable fpga-accelerated cycle-accurate hardware
TRANSCRIPT
Scalable FPGA-accelerated Cycle-Accurate Hardware
Simulation in the Cloud
Introduction
RISC-V Summit 2018 Tutorial
Speakers: Sagar Karandikar, David Biancolin, Alon Amid
https://fires.im
@firesimproject
What is this tutorial about?
• How to get started with FireSim • We’re going to keep things very practical today, treat this as a
tutorial • For more of the research background, see our papers/previous
talks
• We’re going to cram in a lot of info, but feel free to interrupt us with questions • We’ll put up the full slide deck online anyway
2
Today’s Agenda
• Intro, what is FireSim capable of? • (Sagar, 10 mins)
• Building/Deploying FireSim Simulations • (Sagar, 15 mins)
• Using FireSim to simulate your own custom HW • (David, 15 mins)
• Testing and Debugging your design in FireSim • (Alon, 15 mins)
• Community/Contributing/Conclusion • (David, 10 mins)
3
What can FireSim be used for?
• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the
Hwacha project (see RV-Summit talk tomorrow)
• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018
• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on
10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018
• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case
4
What can FireSim be used for?
• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the
Hwacha project (see RV-Summit talk tomorrow)
• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018
• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on
10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018
• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case
5
How does FireSim simulate a datacenter?
• Model hardware at scale, cycle-accurately: • CPUs down to microarchitecture (automatically transformed from Chisel RTL)
• Fast networks, switches (SW models)
• Novel accelerators (also transformed from Chisel)
• Run real software: • Real OS, networking stack (Linux)
• Real frameworks/applications (not microbenchmarks)
• Be productive/usable: • Run on a commodity platform (Amazon EC2 F1)
• Want to encourage collaboration between systems, architecture: real HW/SW co-design research
6
Why do we want an FPGA-based simulator?
7
RTL on FPGA 100 MHz
DRAM
100ns latency
RTL taped-out
1 GHz
DRAM
100ns latency
FPGA Emulation/Prototyping SoC sees 10 cycle DRAM latency
Taped-out Design SoC sees 100 cycle DRAM latency
How do we fix this? FAME-1 Xform in FIRRTL w/MIDAS
8
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
FPGA Fabric
“FAME-1” Transformed RTL Design
Comparing existing HW “simulation” systems
9
• Taping-out excels at: • Modeling reality: “single source of truth” • Scalability
• Hardware-accelerated simulators excel at: • Simulation rate • Ability to run real workloads (as fn. of sim rate)
• Software-based simulators excel at: • Ease-of-use • Ease-of-rebuild (time-to-first-cycle) • Commodity host platform • Cost • Introspection
Prior work on HW-accelerated simulators: DIABLO
• DIABLO, ASPLOS’15 [4]: • Simulated 3072 servers, 96 ToRs at ~2.7 MHz
• Booted Linux, ran apps like Memcached
• An example of the many projects from RAMP
• Need to hand-write abstract RTL models • Harder than writing “tapeout-ready” RTL
• Need to validate against real HW
• Tied to an expensive custom host-platform • $100k+ host platform, custom built
DIABLO Prototype
10
FPGAs in the Cloud
Useful trends throughout the architect’s stack
High-Productivity Hardware Design
Language w/IR
Open, Silicon-Proven SoC Implementations
Open ISA
11
FireSim at a high-level
Server Simulations • Or your own custom hardware
• Good fit for the FPGA
• We have tapeout-proven RTL: automatically FAME-1 transform
Network simulation • Or your own custom SW models
• Little parallelism in switch models (e.g. a thread per port)
• Need to coordinate all of our distributed server simulations
• So use CPUs + host network
12
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
Server Simulation(s)
CPU
Host PCIe
f1.16xlarge
Ho
st E
ther
net
(EC
2 N
etw
ork
)
Server Simulations
Swit
ch M
od
el
Now, let’s build a datacenter-scale FireSim simulation!
13
14
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
Resource Util.
- < ¼ of an FPGA
Sim Rate
- N/A
Step 1: Server SoC in RTL
15
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
Resource Util.
- < ¼ of an FPGA
Sim Rate
- N/A
Step 1: Server SoC in RTL
16
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
- 16 GB DDR3
Resource Util.
- < ¼ of an FPGA
- ¼ Mem Chans
Sim Rate
- ~150 MHz
- ~40 MHz (netw)
Step 2: FPGA Simulation of one server blade
17
Modeled System
- 4x RISC-V Rocket Cores @ 3.2 GHz
- 16K I/D L1$
- 256K Shared L2$
- 200 Gb/s Eth. NIC
- 16 GB DDR3
Resource Util.
- < ¼ of an FPGA
- ¼ Mem Chans
Sim Rate
- ~150 MHz
- ~40 MHz (netw)
Step 2: FPGA Simulation of one server blade
18
Modeled System
- 4 Server Blades
- 16 Cores
- 64 GB DDR3
Resource Util.
- < 1 FPGA
- 4/4 Mem Chans
Sim Rate
- ~14.3 MHz (netw)
Step 3: FPGA Simulation of 4 server blades
Cost: $0.49 per hour (spot) $1.65 per hour (on-demand)
19
Modeled System
- 4 Server Blades
- 16 Cores
- 64 GB DDR3
Resource Util.
- < 1 FPGA
- 4/4 Mem Chans
Sim Rate
- ~14.3 MHz (netw)
Step 3: FPGA Simulation of 4 server blades
20
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR Switch
- 200 Gb/s, 2us links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz (netw)
Step 4: Simulating a 32 node rack
Cost: $2.60 per hour (spot) $13.20 per hour (on-demand)
21
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR Switch
- 200 Gb/s, 2us links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz (netw)
Step 4: Simulating a 32 node rack
Cost: $2.60 per hour (spot) $13.20 per hour (on-demand)
22
Modeled System
- 32 Server Blades
- 128 Cores
- 512 GB DDR3
- 32 Port ToR Switch
- 200 Gb/s, 2us links
Resource Util.
- 8 FPGAs =
- 1x f1.16xlarge
Sim Rate
- ~10.7 MHz (netw)
Step 4: Simulating a 32 node rack
23
Modeled System
- 256 Server Blades
- 1024 Cores
- 4 TB DDR3
- 8 ToRs, 1 Aggr
- 200 Gb/s, 2us links
Resource Util.
- 64 FPGAs =
- 8x f1.16xlarge
- 1x m4.16xlarge
Sim Rate
- ~9 MHz (netw)
Step 5: Simulating a 256 node “aggregation pod”
24
Modeled System
- 256 Server Blades
- 1024 Cores
- 4 TB DDR3
- 8 ToRs, 1 Aggr
- 200 Gb/s, 2us links
Resource Util.
- 64 FPGAs =
- 8x f1.16xlarge
- 1x m4.16xlarge
Sim Rate
- ~9 MHz (netw)
Step 5: Simulating a 256 node “aggregation pod”
25
Modeled System
- 1024 Servers
- 4096 Cores
- 16 TB DDR3
- 32 ToRs, 4 Aggr, 1 Root
- 200 Gb/s, 2us links
Resource Util.
- 256 FPGAs =
- 32x f1.16xlarge
- 5x m4.16xlarge
Sim Rate
- ~6.6 MHz (netw)
Step 6: Simulating a 1024 node datacenter
26
Modeled System
- 1024 Servers
- 4096 Cores
- 16 TB DDR3
- 32 ToRs, 4 Aggr, 1 Root
- 200 Gb/s, 2us links
Resource Util.
- 256 FPGAs =
- 32x f1.16xlarge
- 5x m4.16xlarge
Sim Rate
- ~6.6 MHz (netw)
Step 6: Simulating a 1024 node datacenter
Harnesses millions of dollars of FPGAs to simulate 1024 nodes cycle-exactly
with a cycle-accurate network simulation and global synchronization
at a cost-to-user of only 100s of dollars/hour
Open-source: Not just datacenter simulation
• An “easy” button for fast, FPGA-accelerated full-system simulation • Replace network endpoints with your own Chisel
designs
• One-click: Parallel FPGA builds, Simulation run/result collection, building target software
• Scales to a variety of use cases: • Networked (performance depends on scale)
• Non-networked (150+ MHz), limited by your budget
• firesim command line program • Like docker or vagrant, but for FPGA sims
• User doesn’t need to care about distributed magic happening behind the scenes
27 FireSim Developer Environment
Open-source: Not just datacenter simulation
• Scripts can call firesim to fully automate distributed FPGA sim • Reproducibility: included scripts to
reproduce ISCA 2018 results
• e.g. scripts to automatically run SPECInt2017 reference inputs in ≈1 day
• Many others
• 100+ pages of documentation: https://docs.fires.im
• AWS provides grants for researchers: https://aws.amazon.com/grants/
28
$ cd fsim/deploy/workloads
$ ./run-all.sh
Latest Updates
• Growing ecosystem!
• BOOM (Out-of-order RISC-V core) now available in FireSim
• Projects publicly releasing FireSim images at RISC-V Summit: • Hwacha vector accelerator • Keystone Secure Enclave
• Berkeley IceNet (photonic DC network) modeling
• New debugging features • Auto-ILA: Annotate Chisel and get an
ILA automatically wired-up in FireSim • Tracer widgets: Collect live instruction
traces from FireSim sims • Integrating DESSERT [8]
• Assertion Synthesis now on master
• First Academic User papers: • ISCA ‘18: Maas et. al. “A Hardware
Accelerator for Tracing Garbage Collection” (Berkeley)
• MICRO ‘18: Zhang et. al. “Composable Building Blocks to Open up Processor Design” (MIT)
29
Wrapping Up
• We can prototype scalable-systems built on arbitrary RTL at unprecedented scale
+ Mix software models when desired
• Simulation is automatically built and deployed
• Automatically deploy real workloads and collect results
• Open-source, runs on Amazon EC2 F1, no capex
30
SoC RTL
Other RTL
Network Topology
Automatically deployed, high-performance,
distributed simulation
SW Models
Full Work-load
Administrivia
• Everything we’re going to show you today is documented in excruciating detail at http://docs.fires.im/
• Use these slides as a high-level view of what’s possible
• We’ll also put the slides/videos online
31
Today’s Agenda
• Intro, what is FireSim capable of? • (Sagar, 10 mins)
• Building/Deploying FireSim Simulations • (Sagar, 15 mins)
• Using FireSim to simulate your own custom HW • (David, 15 mins)
• Testing and Debugging your design in FireSim • (Alon, 15 mins)
• Community/Contributing/Conclusion • (David, 10 mins)
32
☑ ️
Learn More: Web: https://fires.im GitHub: https://github.com/firesim @firesimproject
ISCA’18 Paper: sagark.org/assets/pubs/firesim-isca2018.pdf
The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
Questions?
References
[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016. [3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14 [4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15 [5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10 [6] Evaluation of RISC-V RTL Designs with FPGA Simulation. Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. CARRV '17. [7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16 [8] Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, and Krste Asanović, “DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles”, FPL 2018
34
Building/Deploying Simulations
FireSim Tutorial
RISC-V Summit 2018 Tutorial
Speaker: Sagar Karandikar
What will we cover?
•Building FireSim FPGA images for a set of targets
•Choosing targets at runtime
•Configuring target software
•Managing EC2 infrastructure for simulations
•Running a simulation
•Customizing workloads: SPEC example
36
Background Terminology
37
“AGFI”: FPGA Bitstream for F1
FPGAs
First-time User Setup
•Won’t discuss this today, but we’ve documented the process of starting with AWS/FireSim from scratch here: http://docs.fires.im/en/latest/Initial-Setup/index.html • For the following walkthrough, we assume that you’ve
setup a Manager instance and cloned firesim into ~/firesim
38
Let’s simulate the following system:
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
39
☑ ️
Using the firesim manager command line
From inside ~/firesim, source sourceme-f1-manager.sh
This properly sets up your environment and puts firesim on your $PATH
This means that we can now call firesim from anywhere on the instance. It will always run from the directory:
your_firesim_clone/deploy/
in this case:
~/firesim/deploy/
40
Configuring the Manager. 4 files in firesim/deploy/
41
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
Setting up the manager to build FPGA images
42
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
Running builds
• Once we’ve configured what we want to build, let’s build it
$ firesim buildafi
• This completely automates the process. For each design, in-parallel: • Launch a build instance (c4.4xlarge) • Run FireSim generator (Chisel/FIRRTL/MIDAS) • Ship infrastructure to build instances, run Vivado FPGA
builds in parallel • Collect results back onto manager instance
• ~/firesim/deploy/results-workload/TIMESTAMP-config/ contains final results + QoR information + dcps to open in Vivado for further manual tuning
• Email you the entry to put into config_hwdb.ini • Terminate the build instance
43
Status - We have FPGA images for the simulator of our design
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
44
☑ ️
☑ ️☑ ️☑ ️
☑ ️
Now, let’s build our target-software
• We need: • bbl + vmlinux image: Berkeley BootLoader + Linux kernel image as payload • Root Filesystem disk image
• FireSim provides two levels of workload automation: • firesim-software repo for automatically building base-images for various
distros (e.g. buildroot or Fedora) that are compatible with Rocket / BOOM cd firesim/sw/firesim-software
./sw-manager.py -c br-disk.json build # simple buildroot distro
This lets us use the linux-uniform.json workload, which we will see later
• firesim manager / workload generation tool can bake custom workloads into rootfses • We’ll get to this later
45
Status
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
46
☑ ️
☑ ️☑ ️☑ ️☑ ️
☑ ️
Setting up the manager’s runtime configuration
47
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
Setting up the manager’s runtime configuration
48
config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini
Launching Simulation Instances
• Once we’ve configured what we want to simulate and what infrastructure we require, let’s launch our simulation (F1) instances
49
$ firesim launchrunfarm
FireSim Manager. Docs: http://docs.fires.im
Running: launchrunfarm
Waiting for instance boots: f1.16xlarges
Waiting for instance boots: m4.16xlarges
Waiting for instance boots: f1.2xlarges
i-0d6c29ac507139163 booted!
The full log of this run is: /home/centos/firesim-new/deploy/logs/2018-05-19--00-19-43-launchrunfarm-B4Q2ROAK.log
Status
Target Design:
• One Rocket Chip-based node (RTL) • 4 Rocket Cores
• 16K L1 I$, D$
• Block Device, UART, Serial Adapter
• No NIC
• 4 MB LLC (RTL Model)
• DDR3 Memory System
• No network
• Boot vanilla buildroot-Linux distro
Host Resources:
• One manager instance (c4.4xlarge)
• One F1 instance (f1.2xlarge, single FPGA)
50
☑ ️
☑ ️☑ ️☑ ️☑ ️
☑ ️☑ ️
Distributing Simulation Infrastructure to the Run Farm • Next, we need to setup our run farm instance with simulation infrastructure
$ firesim infrasetup FireSim Manager. Docs: http://docs.fires.im
Running: infrasetup
Building FPGA software driver for FireSimNoNIC-FireSimRocketChipQuadCoreConfig-FireSimDDR3FRFCFSLLC4MBConfig
[172.30.2.254] Executing task 'instance_liveness'
[172.30.2.254] Checking if host instance is up...
[172.30.2.254] Executing task 'infrasetup_node_wrapper'
[172.30.2.254] Copying FPGA simulation infrastructure for slot: 0.
[172.30.2.254] Installing AWS FPGA SDK on remote nodes. Upstream hash: 2fdf23ffad944cb94f98d09eed0f34c220c522fe
[172.30.2.254] Unloading EDMA Driver Kernel Module.
[172.30.2.254] Copying AWS FPGA EDMA driver to remote node.
[172.30.2.254] Clearing FPGA Slot 0.
[172.30.2.254] Checking for Cleared FPGA Slot 0.
[172.30.2.254] Flashing FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.
[172.30.2.254] Checking for Flashed FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.
[172.30.2.254] Loading EDMA Driver Kernel Module.
[172.30.2.254] Starting Vivado hw_server.
[172.30.2.254] Starting Vivado virtual JTAG.
The full log of this run is:
/home/centos/firesim/deploy/logs/2018-11-14--16-42-35-infrasetup-CHLESMNB83XFPZUR.log
51
Let’s run a simulation!
$ firesim runworkload
FireSim Manager. Docs: http://docs.fires.im
Running: runworkload
Creating the directory: /home/centos/firesim/deploy/results-workload/2018-11-14--16-47-17-linux-uniform/
[172.30.2.254] Executing task 'instance_liveness'
[172.30.2.254] Checking if host instance is up...
[172.30.2.254] Executing task 'boot_switch_wrapper'
[172.30.2.254] Executing task 'boot_simulation_wrapper'
[172.30.2.254] Starting FPGA simulation for slot: 0.
52
Monitoring
53
Manually Interacting with simulations
54
$ ssh 172.30.2.254
$ screen –r fsim0
Shutting down simulations/Capturing results
• Frequently, we want to capture results from simulations automatically
• FireSim’s workload system supports this! • More on how to specify what to capture in a few mins
• But let’s continue with our example – it’s configured just to capture each simulated system’s console output, once the simulation is powered off. If we do so, the manager will produce:
...
FireSim Simulation Exited Successfully. See results in:
/home/centos/firesim/deploy/results-workload/2018-11-14--16-52-49-linux-
uniform/
The full log of this run is:
/home/centos/firesim/deploy/logs/2018-11-14--16-52-49-runworkload-
HSMYYMD4JBA6BHT5.log
55
Finally, get rid of our run farm EC2 instances
• Easy!
$ firesim terminaterunfarm
FireSim Manager. Docs: http://docs.fires.im
Running: terminaterunfarm
IMPORTANT!: This will terminate the following instances:
f1.16xlarges
[]
m4.16xlarges
[]
f1.2xlarges
['i-033959e7513fcf928']
Type yes, then press enter, to continue. Otherwise, the operation will be cancelled.
yes
Instances terminated. Please confirm in your AWS Management Console.
The full log of this run is:
/home/centos/firesim/deploy/logs/2018-11-14--17-00-01-terminaterunfarm-RGWY68L5ICAYQTA3.log
56
Custom Workloads: linux-uniform workload
• Previously, we relied on linux-uniform.json as our workload
• These jsons live in firesim/deploy/workloads/
57
Let’s look at a more interesting example: SPEC
58
• JSONs are used for two things: • Building rootfs/binary
combos automatically, with benchmark infrastructure built-in
• Deploying simulations with the manager (like we saw previously) and knowing which results to collect at the end
Building/Deploying SPEC on Rocket/BOOM
• Building: We don’t have enough time to go into detail here. Look at the spec17-% target in firesim/deploy/workloads/Makefile
• At a high-level, you can just run make spec17-intrate, and you’ll get 10 rootfs/linux image combos that will automatically run the spec benchmarks in parallel
• Deploying: Set your workload to spec17-intrate.json in config_runtime.ini, set the # f1.2xlarges to 10, select the hardware config you want to benchmark, then firesim launchrunfarm/infrasetup/runworkload as usual
• At the end, all your performance results live in one directory on the manager: firesim/deploy/results-workload/TIMESTAMP-spec17-intrate-HASH/
• Instances get automatically terminated one-by-one as benchmarks complete – essentially zero cost to running in parallel on EC2, since you pay by the machine-second anyway
59
Summary
• Don’t fret if you didn’t catch everything, everything we showed you today is documented in excruciating detail at http://docs.fires.im
• We learned how to: • Build FireSim FPGA images for a set of targets
• http://docs.fires.im/en/latest/Building-a-FireSim-AFI.html
• Setup/Launch a simulation, including choosing targets, configuring target software, and managing EC2 infrastructure • http://docs.fires.im/en/latest/Running-Simulations-Tutorial/Running-a-Single-Node-
Simulation.html
• Customize workloads: SPEC example • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/Defining-Custom-
Workloads.html • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/SPEC-2017.html
60
Modifying the Target Design
FireSim Tutorial RISC-V Summit 2018
Speaker: David Biancolin
Modifying the Target Design
In this section:
• Explain how the target is modeled in FireSim
• Discuss how to add and modify each class of model
Code references apply to FireSim v.1.4.0
Relevant FireSim docs section: “Targets”
https://docs.fires.im/en/latest/Advanced-Usage/Generating-Different-Targets.html
62
Important Submodules & Aliases
Submodules:
MIDAS: sim/midas
FIRRTL: sim/firrtl
Firechip: target-design/firechip
Rocket-Chip: target-design/firechip/rocket-chip
Chisel: target-design/firechip/rocket-chip/chisel
Aliases for this presentation:
FSCALA -> sim/src/main/scala
FCPP -> sim/src/main/cc/
MSCALA -> {in MIDAS}/src/main/scala/midas
MCPP -> {in MIDAS}/src/main/cc/
63
• Represent the target as a synchronous-dataflow graph
• FireSim stitches together a larger graph to model a datacenter
Target as a Dataflow Graph
64
SW Model To be hosted on a CPU
Transformed RTL Model Derived from SoC RTL
To be hosted on an FPGA
Abstract RTL Model Cannot be taped out
To be hosted on an FPGA
Expressing the Target Graph in FireSim
• Target graph is implicit, currently not encoded in one place.
• There is only one transformed-RTL model • The CHIRRTL you passed to midas.Compiler
• In Simulation Mapping • 2+ entry queue channels are instantiated on Decoupled IO
• 1-cycle pipe channels are used everywhere else.
• SW and Custom RTL models are bound to IO bundles using midas.core.Endpoint
• Matches on a IO type, generates a model or endpoint
65
Parameterizing the RTL-Transformed Model
• FireSim provides four base Rocket-Chip “cakes” that form the base RTL-transformed model • DESIGN make variable
Two types of parameters:
1. Target: (see TargetConfigs.scala) Configure the generated RTL model • # of Cores, L1 $ Sizes, # of memory channels • TARGET_CONFIG make variable
2. Sim/Platform: (see SimConfigs.scala) Configure MIDAS, SW/Custom RTL model generation
• Configs for endpoints, Type of DRAM model, enable debug features etc… • PLATFORM_CONFIG make variable
66
Swapping out the RTL-Transformed Model
Two common approaches, based on the type of Target:
1. Project-template-based Rocket-chip targets (FireChip) • Point at the FireChip submodule at your fork
NB: MIDAS depends on Rocket-Chip; Rocket-Chip provides Chisel submodule
2. Everything else (ex. Your custom CGRA etc..) • Add scala sources for your target (modify sim/build.sbt)
• Define an App (Generator) that calls midas.Compiler(…)
• Add a target-specific makefrag (ex. sim/src/main/makefrag/firesim/Makefrag)
See the MIDAS examples for an simple demonstration of this.
67
On Abstract RTL & SW Models.
FireSim & MIDAS provide the following models.
Abstract-RTL :
1. DefaultIOModel ($MSCALA/widgets/PeekPokeIO.scala) • Bound to I/O unbound by custom endpoints
2. DRAM & LLC model ($MSCALA/models/dram/MidasMemModel.scala) • Bound to AXI4 slave interfaces
SW-models:
1. NIC
2. SerialIO
3. Block-device
4. UART
These are all FireSim provided: See $FSCALA/endpoints and $FCPP/endpoints
68
Compile-time vs Runtime Configuration
Software models and custom RTL models are configured twice:
1. Compile/Generation time
2. Runtime
On Compile-time configuration:
• RTL-generation for Custom-RTL models (FPGA resynthesis required) • How: using different a PLATFORM_CONFIG; hacking on the Chisel sources
• CPP compilation for SW models (no FPGA resynthesis) • How: changing model & driver CPP sources.
69
Runtime Configuration
• Pass plus args to the simulator • SW models: does what you expect
• Custom-RTL models: driver writes to model’s configuration registers
• Ex. +blkdev_write_latency,
• How: Modify deploy/config_runtime.ini • Change ini-exposed variables
• Specify a customruntimeconfig in deploy/config_hwdb.ini
70
Runtime Configuration – config_runtime.ini
71
HWDB entry: None -> Uses a default runtime.conf emitted at generation time
Runtime.conf -> plusArgs you want to pass to the simulator
Runtime Configuration – config_hwdb.ini
72
Adding a Custom Model – Chisel-side
1. Expose a Record/Bundle in your Target RTL’s IO with a specific type
2. Extend midas.Endpoint to: 1. Match on the correct Chisel type
• Implement Endpoint.matchType(Data)=> Boolean
2. Generate your SW model’s endpoint, or custom RTL-model Implement • Implement Endpoint.widget(Parameters)=> EndpointWidget
3. Add your endpoint to the EndpointMap Field in your PLATFORM_CONFIG
See:
• $FSCALA/endpoints/UARTWidget.scala
• $FSCALA/endpoints/BlockDevWidget.scala
• $FSCALA/firesim/SimConfigs.scala
73
Adding a Custom Model – Driver-side
1. Define an endpoint driver class • Use simif::{read, write, push, pull} to interact with FPGA-hosted endpoint
2. Register the endpoint driver in firesim_top.cc
See:
• $FCPP/endpoints/uart.{cc, h}
• $FCPP/endpoints/blockdev.{cc, h}
• $FCPP/firesim/firesim_top.cc
74
DRAM Timing Models – Generation Time Conf
• Parameterized using the PLATFORM_CONFIG make variable • See $FSCALA/firesim/SimConfigs.scala
• Abstract timing models: • Latency-Bandwidth Pipe • Bank Conflict
• DDR3 Timing models: • First-Come First-Served (FCFS) • First-Ready, First-Come First-Served (FR-FCFS)
• All timing models can be composed with a LLC model.
• To be presented at FPGA2019
75
DRAM Timing Models – Runtime Conf
• DDR timing models expose >30 runtime arguments • Many illegal settings, need to validate against generated model instance
• Ex. Setting that would overflow a timing register.
• Many invalid settings, need to validate against DDR3 part database • Ex. A non-standard JEDEC DDR3 timing, timings are density dependent
• A default runtime-configuration is emitted at generation time • See generated-src/f1/${TARGET_TUPLE}/runtime.conf
• To generate a custom runtime-config: • In sim/: make conf
76
DRAM Timing Models – custom runtime.conf
77
Debugging, Testing, and Profiling
FireSim Tutorial RISC-V Summit 2018 Speaker: Alon Amid
Agenda
• FireSim Debugging Using Software Simulation
• FireSim Debugging Using Integrated Logic Analyzers
• Advanced FPGA-Based Debugging and Profiling Features
• The FireSim Vision for Debugging and Profiling
79
Debugging Using Software Simulation
80
Modifying internal simulated target hardware, no new external endpoints
Target-Level SW Simulation
What Am I
doing?
Simulator-Level SW Simulation
Adding/Modifying new interfaces and endpoints,
modifying simulation models
Midas-Level SW Simulation
FPGA-Level SW Simulation
My FireSim Simulation Is Not Working
Debugging Using Software Simulation
81
Target-Level Simulation
• Software Simulation
• Target Design Untransformed
• No Host-FPGA interfaces
MIDAS-Level Simulation
• Software Simulation
• Target Design Transformed by MIDAS
• Host-FPGA interfaces/shell emulated using abstract models
FPGA-Level Simulation
• Software Simulation
• Target Design Transformed by MIDAS
• Host-FPGA interfaces/shell simulated by the FPGA tools
82
Remember this slide from
Sagar’s presentation?
Debugging Using Software Simulation
83
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
“FAME-1” Transformed RTL Design
Target-Level SW Simulation
FPGA Fabric
Debugging Using Software Simulation
84
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
“FAME-1” Transformed RTL Design
MIDAS-Level SW Simulation
FPGA Fabric
Abstract Model
Target-Level SW Simulation
Debugging Using Software Simulation
85
RTL Design
Physical DRAM
100ns
latency
<- Resp Queue
Req Queue ->
DRAM Model
100
cycle latency
Mem Channel
“FAME-1” Transformed RTL Design
MIDAS-Level SW Simulation
FPGA Fabric
Abstract Model
Target-Level SW Simulation
FPGA-Level SW Simulation
Debugging Using Software Simulation
Debugging Using Software Simulation
86
Level Waves VCS Verilator XSIM
Target Off ~5 kHz ~5 kHz N/A
Target On ~1 kHz ~5 kHz N/A
MIDAS Off ~4 kHz ~2 kHz N/A
MIDAS On ~3 kHz ~1 kHz N/A
FPGA On ~2 Hz N/A ~0.5 Hz
• Target-Level Simulation In firesim/target-design/firechip/vsim
$ make DESIGN=<YourDesign> CONFIG=<YourConfig> debug
$ ./simv-<YourDesign>-<YourConfig>-debug +max-cycles=50000 +vcdplusfile=<WaveformFileName>.vpd ../tests/<InputTest>.riscv
• MIDAS-level Simulation In firesim/sim
$ make <verilator|vcs>-debug
$ make EMUL=<verilator|vcs> DESIGN=FireSimNoNIC run-asm-test-debug
• FPGA-Level Simluation In firesim/sim
$ make xsim
$ make xsim-dut <VCS=1> &
$ make run-xsim SIM_BINARY=<PATH/TO/DRIVER/BINARY/FOR/TARGET/TO/RUN>
**FPGA-level simulation currently does not support DMA_PCIS (which is used for the NIC interface)
87
Debugging Using Software Simulation
88
“Everything looks OK in SW simulation, but there is still a bug somewhere”
“My bug only appears after hours of running Linux on my simulated HW”
When SW Simulation Debugging is Not Enough…
Debugging Using Integrated Logic Analyzers
Integrated Logic Analyzers (ILAs)
• Common debugging feature provided by FPGA vendors
• Continuous recording of a sampling window of up to 1024 cycles. • Stores recorded samples in BRAM.
• Realtime trigger-based sampled output of probed signals • Multiple probes ports can be combined to a single trigger
• Trigger can be in any location within the sampling window
• On the AWS F1-Instances, ILA interfaced through a debug-bridge and server
89
From: aws-fpga cl_hello_world example
Debugging Using Integrated Logic Analyzers
AutoILA – Automation of ILA integration with FireSim
• Annotate requested signals and bundles in the Chisel source code
• Automatic configuration and generation of the ILA IP in the FPGA toolchain
• Automatic expansion and wiring of annotated signals to the top level of a design using a FIRRTL transform.
• Remote waveform and trigger setup from the manager instance
90
Debugging using Integrated Logic Analyzers
91
Pros:
• No emulated parts – what you see is what’s running on the FPGA
• FPGA simulation speed - O(MHz) compared to O(KHz) in software simulation
• Real-time trigger-based
Cons:
• Requires a full build to modify visible signals/triggers (takes several hours)
• Limited sampling window size
• Consumes FPGA resources
Advanced FPGA-Based Debugging Features
• FPGA-based simulation enables high simulation speed, which enables advanced debugging and profiling tools.
• Reach “deep” in simulation time, and obtain large levels of coverage and data
• Examples: • TracerV
• Synthesizable Assertions
92
Simulated Time
SW Simulation
FPGA-based Simulation
TracerV
93
• Out-of-band full instruction execution trace
• MIDAS widget connected to target trace ports
• By default, large amount of info wired out of Rocket/BOOM, per-hart, per-cycle: • Instruction Address • Instruction • Privilege Level • Exception/Interrupt Status, Cause
• TracerV can rapidly generate several TB of data.
TracerV
• Out-of-Band software-level debugging and profiling: Profiling does not perturb execution
• Useful for kernel and hypervisor level cycle-sensitive profiling
• Examples: • Co-Optimization of NIC and Network Driver • Keystone Secure Enclave Project • High-performance hardware-specific code
(supercomputing?)
• Requires large-scale analytics for insightful profiling and optimization.
94
TracerV
95
Pros:
• Out-of-Band (no impact on workload execution)
• SW-centric method
• Large amounts of data
Cons:
• Slower simulation performance (40 MHz)
• No HW visibility
• Large amounts of data
Synthesizable Assertions
• Assertions – rapid error checking embedded in HW source code. • Commonly used in SW Simulation
• Halts the simulation upon a triggered assertion. Represented as a “stop” statement in FIRRTL
• By defaults, emitted as non-synthesizable SV functions ($fatal)
96
From: Trillion-Cycle Bug Finding Using FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, Krste Asanović. ADEPT Winter Retreat 2018
From: BROOM: An open-source Out-of-Order processor with resilient low-voltage operation in 28nm CMOS, Christopher Celio, Pi-Feng Chiu, Krste Asanovic, David Patterson and Borivoje Nikolic. HotChip 30, 2018
Synthesizable Assertions
• Synthesizable Assertions on FPGA • Transform FIRRTL stop statements into synthesizable logic
• Insert combinational logic and signals for the stop condition arguments
• Insert encodings for each assertion (for matching error statements in SW)
• Wire the assertion logic output to the Top-Level
• Generate timing tokens for cycle-exact assertions
• Assertion checker records the cycle and halts simulation when assertion is triggered
97
Synthesizable Assertions
• Previously a research feature presented in DESSERT [1]
• Helped Identify BOOM bugs trillions of cycles into execution
• Integrated into FireSim in the latest release.
98
[1] Kim, D., Celio, C., Karandikar, S., Biancolin, D., Bachrach, J. and Asanovic, K., DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles. The International Conference on Field-Programmable Logic and Applications (FPL), 2018
Synthesizable Assertions
99
Pros:
• FPGA simulation speed
• Real-time trigger-based
• Consumes small amount of FPGA resources (compared to ILA)
• Key signals have pre-written assertions in re-usable components/libraries
Cons:
• Low visibility: No waveform/state
• Assertions are best added while writing source RTL rather than during “investigative” debugging
The FireSim Vision: Speed and Visibility
• High-performance simulation
• Full application workloads
• Tunable visibility & resolution
• Unique data-based insights
100
Speed and Visibility – How do we get there?
• Integration of additional DESSERT features • Hardware state snapshot extraction from FPGA to software simulation using
arbitrary software triggers.
• Easy-to-use information transfer between execution traces and software hooks.
• Data-Processing pipeline for insights from large-scale traces • O(GB)-O(TB) out-of-band logs require big-data analysis methods for insights
(Golden model comparison is one potential method).
• Potential unique insights: can collect globally-cycle-accurate out-of-band instruction traces from a networked datacenter simulation.
101
Summary
• Debugging Using Software Simulation (docs) • Target-Level
• MIDAS-Level
• FPGA-Level
• Debugging Using Integrated Logic Analyzers (docs)
• Advanced Debugging and Profiling Features • TracerV (docs)
• Assertion Synthesis (docs)
• FireSim Debugging and Profiling Future Vision
102
Project Management, Contributing & Conclusion
FireSim Tutorial RISC-V Summit 2018
Speaker: David Biancolin
Development Model – Branches
• FireSim and its key submodules (FireChip, MIDAS, aws-fpga) have two write-controlled branches
• master • Stable
• Contains the most recent release of FireSim
• Can replicate ISCA2018 paper results
• dev • Main development branch
• Base branch for feature PRs and non-critical bug fixes
• Every merge commit has globally shared, regenerated AGFIs
104
Development Model – Merges & Releases
• To merge a feature branch into dev: • Pass all MIDAS-level tests (in sim/: `make test`)
• Regenerate all AGFIs if RTL changes were made
• Associated submodule PRs should be merged; submodule pointers should point at merge commit on dev
• Changes to dev will be summarized in a “Dev-tracking PR” • PR description will become the Changelog entry
• Dev will be periodically merged into master and tagged as a release • Changelog will be updated
• Submodule dev branches will also be merged into master
105
How to Get Help
• FireSim and MIDAS adds a bunch of additional complexity on top of Rocket-Chip, Chisel, FIRRTL, Scala… • But there’s a team of us here to help!
• For general questions: • Post on the google groups: https://groups.google.com/forum/#!forum/firesim
• Check out the docs: https://docs.fires.im/en/latest/
• (Nascent) FAQ section: https://docs.fires.im/en/latest/Advanced-Usage/FAQs.html
106
Reporting Issues
• If you think you’ve encountered a bug (in FireSim or MIDAS): • Open an issue on FireSim’s github page
• Try your best to open an issue in the most relevant repo • Please don’t open duplicates across multiple repos!
• If you are unsure, default to FireSim and we’ll be happy to redirect you
107
Contributing
• We welcome PRs!
• Issues are labeled with “Good First Issue”
• Ask on the google groups for suggestions
108
Regarding Future Work
• There are many research projects underway at UCB-BAR using and improving FireSim
• Many will be merged into mainline FireSim
• There will be API breakages -> We apologize in advance
• Watch the changelogs, dev tracking PRs
Our goal: make FireSim the standard open way to do fast full-system simulation of Chisel-designed systems, especially ones based on Rocket-Chip.
Focus on:
• Generality (eg. support multi-clock target designs)
• Robustness (eg. More extensive testing in RTL-simulation; CI; regressions)
• Debuggability (eg. DESSERT features) 109
Two more talks from Berkeley Architecture Research
110
Both talks open-sourcing their work and releasing FireSim images!
Conclusion
Key links: • Website: https://fires.im • Google Groups: https://groups.google.com/forum/#!forum/firesim • Docs: https://docs.fires.im/en/latest/ • Twitter: @firesimproject Acknowledgements: The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
111