scalable fpga-accelerated cycle-accurate hardware

Scalable FPGA-accelerated Cycle-Accurate Hardware

Simulation in the Cloud

Introduction

RISC-V Summit 2018 Tutorial

Speakers: Sagar Karandikar, David Biancolin, Alon Amid

https://fires.im

@firesimproject

What is this tutorial about?

• How to get started with FireSim • We’re going to keep things very practical today, treat this as a

tutorial • For more of the research background, see our papers/previous

talks

• We’re going to cram in a lot of info, but feel free to interrupt us with questions • We’ll put up the full slide deck online anyway

2

Today’s Agenda

• Intro, what is FireSim capable of? • (Sagar, 10 mins)

• Building/Deploying FireSim Simulations • (Sagar, 15 mins)

• Using FireSim to simulate your own custom HW • (David, 15 mins)

• Testing and Debugging your design in FireSim • (Alon, 15 mins)

• Community/Contributing/Conclusion • (David, 10 mins)

3

What can FireSim be used for?

• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the

Hwacha project (see RV-Summit talk tomorrow)

• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018

• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on

10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018

• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case

4

What can FireSim be used for?

• Evaluating a custom hardware design • For example, building a custom accelerator and running real workloads on it • e.g. “A Hardware Accelerator for Tracing Garbage Collection” from ISCA 2018 and the

Hwacha project (see RV-Summit talk tomorrow)

• Debugging a Chisel design at FPGA-speeds • e.g. FireSim Debugging Docs and DESSERT from FPL 2018

• Rapidly Customizing/Evaluating RISC-V cores, • For example, Rocket Chip and BOOM, provided as default designs in FireSim. • FireSim can run all of SPECInt2017 with reference inputs on Rocket Chip in parallel on

10 cloud-FPGAs in ~1 day • e.g. “Composable Building Blocks to Open up Processor Design” from MICRO 2018

• Simulating thousand-node datacenter-scale systems with full control over the hardware • e.g. FireSim ISCA 2018 Paper • Implicitly includes the previous use case

5

How does FireSim simulate a datacenter?

• Model hardware at scale, cycle-accurately: • CPUs down to microarchitecture (automatically transformed from Chisel RTL)

• Fast networks, switches (SW models)

• Novel accelerators (also transformed from Chisel)

• Run real software: • Real OS, networking stack (Linux)

• Real frameworks/applications (not microbenchmarks)

• Be productive/usable: • Run on a commodity platform (Amazon EC2 F1)

• Want to encourage collaboration between systems, architecture: real HW/SW co-design research

6

Why do we want an FPGA-based simulator?

7

RTL on FPGA 100 MHz

DRAM

100ns latency

RTL taped-out

1 GHz

DRAM

100ns latency

FPGA Emulation/Prototyping SoC sees 10 cycle DRAM latency

Taped-out Design SoC sees 100 cycle DRAM latency

How do we fix this? FAME-1 Xform in FIRRTL w/MIDAS

8

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel

FPGA Fabric

“FAME-1” Transformed RTL Design

Comparing existing HW “simulation” systems

9

• Taping-out excels at: • Modeling reality: “single source of truth” • Scalability

• Hardware-accelerated simulators excel at: • Simulation rate • Ability to run real workloads (as fn. of sim rate)

• Software-based simulators excel at: • Ease-of-use • Ease-of-rebuild (time-to-first-cycle) • Commodity host platform • Cost • Introspection

Prior work on HW-accelerated simulators: DIABLO

• DIABLO, ASPLOS’15 [4]: • Simulated 3072 servers, 96 ToRs at ~2.7 MHz

• Booted Linux, ran apps like Memcached

• An example of the many projects from RAMP

• Need to hand-write abstract RTL models • Harder than writing “tapeout-ready” RTL

• Need to validate against real HW

• Tied to an expensive custom host-platform • $100k+ host platform, custom built

DIABLO Prototype

10

FPGAs in the Cloud

Useful trends throughout the architect’s stack

High-Productivity Hardware Design

Language w/IR

Open, Silicon-Proven SoC Implementations

Open ISA

11

FireSim at a high-level

Server Simulations • Or your own custom hardware

• Good fit for the FPGA

• We have tapeout-proven RTL: automatically FAME-1 transform

Network simulation • Or your own custom SW models

• Little parallelism in switch models (e.g. a thread per port)

• Need to coordinate all of our distributed server simulations

• So use CPUs + host network

12

Server Simulation(s)







CPU

Host PCIe

f1.16xlarge

Ho

st E

ther

net

(EC

2 N

etw

ork

)

Server Simulations

Swit

ch M

od

el

Now, let’s build a datacenter-scale FireSim simulation!

13

14

Modeled System

- 4x RISC-V Rocket Cores @ 3.2 GHz

- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

Resource Util.

- < ¼ of an FPGA

Sim Rate

- N/A

Step 1: Server SoC in RTL

15

Modeled System


- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

Resource Util.

- < ¼ of an FPGA

Sim Rate

- N/A

Step 1: Server SoC in RTL

16

Modeled System


- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

- 16 GB DDR3

Resource Util.

- < ¼ of an FPGA

- ¼ Mem Chans

Sim Rate

- ~150 MHz

- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

17

Modeled System


- 16K I/D L1$

- 256K Shared L2$

- 200 Gb/s Eth. NIC

- 16 GB DDR3

Resource Util.

- < ¼ of an FPGA

- ¼ Mem Chans

Sim Rate

- ~150 MHz

- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

18

Modeled System

- 4 Server Blades

- 16 Cores

- 64 GB DDR3

Resource Util.

- < 1 FPGA

- 4/4 Mem Chans

Sim Rate

- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

Cost: $0.49 per hour (spot) $1.65 per hour (on-demand)

19

Modeled System

- 4 Server Blades

- 16 Cores

- 64 GB DDR3

Resource Util.

- < 1 FPGA

- 4/4 Mem Chans

Sim Rate

- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

20

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR Switch

- 200 Gb/s, 2us links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz (netw)

Step 4: Simulating a 32 node rack


21

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3



Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz (netw)



22

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3



Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz (netw)


23

Modeled System

- 256 Server Blades

- 1024 Cores

- 4 TB DDR3

- 8 ToRs, 1 Aggr


Resource Util.

- 64 FPGAs =

- 8x f1.16xlarge

- 1x m4.16xlarge

Sim Rate

- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

24

Modeled System

- 256 Server Blades

- 1024 Cores

- 4 TB DDR3

- 8 ToRs, 1 Aggr


Resource Util.

- 64 FPGAs =

- 8x f1.16xlarge

- 1x m4.16xlarge

Sim Rate

- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

25

Modeled System

- 1024 Servers

- 4096 Cores

- 16 TB DDR3

- 32 ToRs, 4 Aggr, 1 Root


Resource Util.

- 256 FPGAs =

- 32x f1.16xlarge

- 5x m4.16xlarge

Sim Rate

- ~6.6 MHz (netw)

Step 6: Simulating a 1024 node datacenter

26

Modeled System

- 1024 Servers

- 4096 Cores

- 16 TB DDR3

- 32 ToRs, 4 Aggr, 1 Root


Resource Util.

- 256 FPGAs =

- 32x f1.16xlarge

- 5x m4.16xlarge

Sim Rate

- ~6.6 MHz (netw)

Step 6: Simulating a 1024 node datacenter

Harnesses millions of dollars of FPGAs to simulate 1024 nodes cycle-exactly

with a cycle-accurate network simulation and global synchronization

at a cost-to-user of only 100s of dollars/hour

Open-source: Not just datacenter simulation

• An “easy” button for fast, FPGA-accelerated full-system simulation • Replace network endpoints with your own Chisel

designs

• One-click: Parallel FPGA builds, Simulation run/result collection, building target software

• Scales to a variety of use cases: • Networked (performance depends on scale)

• Non-networked (150+ MHz), limited by your budget

• firesim command line program • Like docker or vagrant, but for FPGA sims

• User doesn’t need to care about distributed magic happening behind the scenes

27 FireSim Developer Environment

Open-source: Not just datacenter simulation

• Scripts can call firesim to fully automate distributed FPGA sim • Reproducibility: included scripts to

reproduce ISCA 2018 results

• e.g. scripts to automatically run SPECInt2017 reference inputs in ≈1 day

• Many others

• 100+ pages of documentation: https://docs.fires.im

• AWS provides grants for researchers: https://aws.amazon.com/grants/

28

$ cd fsim/deploy/workloads

$ ./run-all.sh

https://docs.fires.im/

https://aws.amazon.com/grants/

Latest Updates

• Growing ecosystem!

• BOOM (Out-of-order RISC-V core) now available in FireSim

• Projects publicly releasing FireSim images at RISC-V Summit: • Hwacha vector accelerator • Keystone Secure Enclave

• Berkeley IceNet (photonic DC network) modeling

• New debugging features • Auto-ILA: Annotate Chisel and get an

ILA automatically wired-up in FireSim • Tracer widgets: Collect live instruction

traces from FireSim sims • Integrating DESSERT [8]

• Assertion Synthesis now on master

• First Academic User papers: • ISCA ‘18: Maas et. al. “A Hardware

Accelerator for Tracing Garbage Collection” (Berkeley)

• MICRO ‘18: Zhang et. al. “Composable Building Blocks to Open up Processor Design” (MIT)

29

Wrapping Up

• We can prototype scalable-systems built on arbitrary RTL at unprecedented scale

+ Mix software models when desired

• Simulation is automatically built and deployed

• Automatically deploy real workloads and collect results

• Open-source, runs on Amazon EC2 F1, no capex

30

SoC RTL

Other RTL

Network Topology

Automatically deployed, high-performance,

distributed simulation

SW Models

Full Work-load

Administrivia

• Everything we’re going to show you today is documented in excruciating detail at http://docs.fires.im/

• Use these slides as a high-level view of what’s possible

• We’ll also put the slides/videos online

31

http://docs.fires.im/

Today’s Agenda

• Intro, what is FireSim capable of? • (Sagar, 10 mins)

• Building/Deploying FireSim Simulations • (Sagar, 15 mins)

• Using FireSim to simulate your own custom HW • (David, 15 mins)

• Testing and Debugging your design in FireSim • (Alon, 15 mins)

• Community/Contributing/Conclusion • (David, 10 mins)

32

☑ ️

Learn More: Web: https://fires.im GitHub: https://github.com/firesim @firesimproject

ISCA’18 Paper: sagark.org/assets/pubs/firesim-isca2018.pdf

The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Questions?

https://fires.im/

https://github.com/firesim

https://twitter.com/firesimproject

https://sagark.org/assets/pubs/firesim-isca2018.pdf



References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016. [3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14 [4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15 [5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10 [6] Evaluation of RISC-V RTL Designs with FPGA Simulation. Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. CARRV '17. [7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16 [8] Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, and Krste Asanović, “DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles”, FPL 2018

34

Building/Deploying Simulations

FireSim Tutorial

RISC-V Summit 2018 Tutorial

Speaker: Sagar Karandikar

What will we cover?

•Building FireSim FPGA images for a set of targets

•Choosing targets at runtime

•Configuring target software

•Managing EC2 infrastructure for simulations

•Running a simulation

•Customizing workloads: SPEC example

36

Background Terminology

37

“AGFI”: FPGA Bitstream for F1

FPGAs

First-time User Setup

•Won’t discuss this today, but we’ve documented the process of starting with AWS/FireSim from scratch here: http://docs.fires.im/en/latest/Initial-Setup/index.html • For the following walkthrough, we assume that you’ve

setup a Manager instance and cloned firesim into ~/firesim

38

http://docs.fires.im/en/latest/Initial-Setup/index.html



Let’s simulate the following system:

Target Design:

• One Rocket Chip-based node (RTL) • 4 Rocket Cores

• 16K L1 I$, D$

• Block Device, UART, Serial Adapter

• No NIC

• 4 MB LLC (RTL Model)

• DDR3 Memory System

• No network

• Boot vanilla buildroot-Linux distro

Host Resources:

• One manager instance (c4.4xlarge)

• One F1 instance (f1.2xlarge, single FPGA)

39

☑ ️

Using the firesim manager command line

From inside ~/firesim, source sourceme-f1-manager.sh

This properly sets up your environment and puts firesim on your $PATH

This means that we can now call firesim from anywhere on the instance. It will always run from the directory:

your_firesim_clone/deploy/

in this case:

~/firesim/deploy/

40

Configuring the Manager. 4 files in firesim/deploy/

41

config_build_recipes.ini config_build.ini config_hwdb.ini config_runtime.ini

Setting up the manager to build FPGA images

42


Running builds

• Once we’ve configured what we want to build, let’s build it

$ firesim buildafi

• This completely automates the process. For each design, in-parallel: • Launch a build instance (c4.4xlarge) • Run FireSim generator (Chisel/FIRRTL/MIDAS) • Ship infrastructure to build instances, run Vivado FPGA

builds in parallel • Collect results back onto manager instance

• ~/firesim/deploy/results-workload/TIMESTAMP-config/ contains final results + QoR information + dcps to open in Vivado for further manual tuning

• Email you the entry to put into config_hwdb.ini • Terminate the build instance

43

Status - We have FPGA images for the simulator of our design

Target Design:


• 16K L1 I$, D$


• No NIC



• No network


Host Resources:



44

☑ ️

☑ ️☑ ️☑ ️

☑ ️

Now, let’s build our target-software

• We need: • bbl + vmlinux image: Berkeley BootLoader + Linux kernel image as payload • Root Filesystem disk image

• FireSim provides two levels of workload automation: • firesim-software repo for automatically building base-images for various

distros (e.g. buildroot or Fedora) that are compatible with Rocket / BOOM cd firesim/sw/firesim-software

./sw-manager.py -c br-disk.json build # simple buildroot distro

This lets us use the linux-uniform.json workload, which we will see later

• firesim manager / workload generation tool can bake custom workloads into rootfses • We’ll get to this later

45

Status

Target Design:


• 16K L1 I$, D$


• No NIC



• No network


Host Resources:



46

☑ ️

☑ ️☑ ️☑ ️☑ ️

☑ ️

Setting up the manager’s runtime configuration

47


Setting up the manager’s runtime configuration

48


Launching Simulation Instances

• Once we’ve configured what we want to simulate and what infrastructure we require, let’s launch our simulation (F1) instances

49

$ firesim launchrunfarm

FireSim Manager. Docs: http://docs.fires.im

Running: launchrunfarm

Waiting for instance boots: f1.16xlarges

Waiting for instance boots: m4.16xlarges

Waiting for instance boots: f1.2xlarges

i-0d6c29ac507139163 booted!

The full log of this run is: /home/centos/firesim-new/deploy/logs/2018-05-19--00-19-43-launchrunfarm-B4Q2ROAK.log

Status

Target Design:


• 16K L1 I$, D$


• No NIC



• No network


Host Resources:



50

☑ ️

☑ ️☑ ️☑ ️☑ ️

☑ ️☑ ️

Distributing Simulation Infrastructure to the Run Farm • Next, we need to setup our run farm instance with simulation infrastructure

$ firesim infrasetup FireSim Manager. Docs: http://docs.fires.im

Running: infrasetup

Building FPGA software driver for FireSimNoNIC-FireSimRocketChipQuadCoreConfig-FireSimDDR3FRFCFSLLC4MBConfig

[172.30.2.254] Executing task 'instance_liveness'

[172.30.2.254] Checking if host instance is up...

[172.30.2.254] Executing task 'infrasetup_node_wrapper'

[172.30.2.254] Copying FPGA simulation infrastructure for slot: 0.

[172.30.2.254] Installing AWS FPGA SDK on remote nodes. Upstream hash: 2fdf23ffad944cb94f98d09eed0f34c220c522fe

[172.30.2.254] Unloading EDMA Driver Kernel Module.

[172.30.2.254] Copying AWS FPGA EDMA driver to remote node.

[172.30.2.254] Clearing FPGA Slot 0.

[172.30.2.254] Checking for Cleared FPGA Slot 0.

[172.30.2.254] Flashing FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.

[172.30.2.254] Checking for Flashed FPGA Slot: 0 with agfi: agfi-06b9b705ab9af1238.

[172.30.2.254] Loading EDMA Driver Kernel Module.

[172.30.2.254] Starting Vivado hw_server.

[172.30.2.254] Starting Vivado virtual JTAG.

The full log of this run is:

/home/centos/firesim/deploy/logs/2018-11-14--16-42-35-infrasetup-CHLESMNB83XFPZUR.log

51

Let’s run a simulation!

$ firesim runworkload


Running: runworkload

Creating the directory: /home/centos/firesim/deploy/results-workload/2018-11-14--16-47-17-linux-uniform/

[172.30.2.254] Executing task 'instance_liveness'

[172.30.2.254] Checking if host instance is up...

[172.30.2.254] Executing task 'boot_switch_wrapper'

[172.30.2.254] Executing task 'boot_simulation_wrapper'

[172.30.2.254] Starting FPGA simulation for slot: 0.

52

Monitoring

53

Manually Interacting with simulations

54

$ ssh 172.30.2.254

$ screen –r fsim0

Shutting down simulations/Capturing results

• Frequently, we want to capture results from simulations automatically

• FireSim’s workload system supports this! • More on how to specify what to capture in a few mins

• But let’s continue with our example – it’s configured just to capture each simulated system’s console output, once the simulation is powered off. If we do so, the manager will produce:

...

FireSim Simulation Exited Successfully. See results in:

/home/centos/firesim/deploy/results-workload/2018-11-14--16-52-49-linux-

uniform/


/home/centos/firesim/deploy/logs/2018-11-14--16-52-49-runworkload-

HSMYYMD4JBA6BHT5.log

55

Finally, get rid of our run farm EC2 instances

• Easy!

$ firesim terminaterunfarm


Running: terminaterunfarm

IMPORTANT!: This will terminate the following instances:

f1.16xlarges

[]

m4.16xlarges

[]

f1.2xlarges

['i-033959e7513fcf928']

Type yes, then press enter, to continue. Otherwise, the operation will be cancelled.

yes

Instances terminated. Please confirm in your AWS Management Console.


/home/centos/firesim/deploy/logs/2018-11-14--17-00-01-terminaterunfarm-RGWY68L5ICAYQTA3.log

56

Custom Workloads: linux-uniform workload

• Previously, we relied on linux-uniform.json as our workload

• These jsons live in firesim/deploy/workloads/

57

Let’s look at a more interesting example: SPEC

58

• JSONs are used for two things: • Building rootfs/binary

combos automatically, with benchmark infrastructure built-in

• Deploying simulations with the manager (like we saw previously) and knowing which results to collect at the end

Building/Deploying SPEC on Rocket/BOOM

• Building: We don’t have enough time to go into detail here. Look at the spec17-% target in firesim/deploy/workloads/Makefile

• At a high-level, you can just run make spec17-intrate, and you’ll get 10 rootfs/linux image combos that will automatically run the spec benchmarks in parallel

• Deploying: Set your workload to spec17-intrate.json in config_runtime.ini, set the # f1.2xlarges to 10, select the hardware config you want to benchmark, then firesim launchrunfarm/infrasetup/runworkload as usual

• At the end, all your performance results live in one directory on the manager: firesim/deploy/results-workload/TIMESTAMP-spec17-intrate-HASH/

• Instances get automatically terminated one-by-one as benchmarks complete – essentially zero cost to running in parallel on EC2, since you pay by the machine-second anyway

59

Summary

• Don’t fret if you didn’t catch everything, everything we showed you today is documented in excruciating detail at http://docs.fires.im

• We learned how to: • Build FireSim FPGA images for a set of targets

• http://docs.fires.im/en/latest/Building-a-FireSim-AFI.html

• Setup/Launch a simulation, including choosing targets, configuring target software, and managing EC2 infrastructure • http://docs.fires.im/en/latest/Running-Simulations-Tutorial/Running-a-Single-Node-

Simulation.html

• Customize workloads: SPEC example • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/Defining-Custom-

Workloads.html • http://docs.fires.im/en/latest/Advanced-Usage/Workloads/SPEC-2017.html

60


http://docs.fires.im/en/latest/Building-a-FireSim-AFI.html







http://docs.fires.im/en/latest/Running-Simulations-Tutorial/Running-a-Single-Node-Simulation.html













http://docs.fires.im/en/latest/Advanced-Usage/Workloads/Defining-Custom-Workloads.html







http://docs.fires.im/en/latest/Advanced-Usage/Workloads/SPEC-2017.html





Modifying the Target Design

FireSim Tutorial RISC-V Summit 2018

Speaker: David Biancolin

Modifying the Target Design

In this section:

• Explain how the target is modeled in FireSim

• Discuss how to add and modify each class of model

Code references apply to FireSim v.1.4.0

Relevant FireSim docs section: “Targets”

https://docs.fires.im/en/latest/Advanced-Usage/Generating-Different-Targets.html

62








Important Submodules & Aliases

Submodules:

MIDAS: sim/midas

FIRRTL: sim/firrtl

Firechip: target-design/firechip

Rocket-Chip: target-design/firechip/rocket-chip

Chisel: target-design/firechip/rocket-chip/chisel

Aliases for this presentation:

FSCALA -> sim/src/main/scala

FCPP -> sim/src/main/cc/

MSCALA -> {in MIDAS}/src/main/scala/midas

MCPP -> {in MIDAS}/src/main/cc/

63

• Represent the target as a synchronous-dataflow graph

• FireSim stitches together a larger graph to model a datacenter

Target as a Dataflow Graph

64

SW Model To be hosted on a CPU

Transformed RTL Model Derived from SoC RTL

To be hosted on an FPGA

Abstract RTL Model Cannot be taped out

To be hosted on an FPGA

Expressing the Target Graph in FireSim

• Target graph is implicit, currently not encoded in one place.

• There is only one transformed-RTL model • The CHIRRTL you passed to midas.Compiler

• In Simulation Mapping • 2+ entry queue channels are instantiated on Decoupled IO

• 1-cycle pipe channels are used everywhere else.

• SW and Custom RTL models are bound to IO bundles using midas.core.Endpoint

• Matches on a IO type, generates a model or endpoint

65

Parameterizing the RTL-Transformed Model

• FireSim provides four base Rocket-Chip “cakes” that form the base RTL-transformed model • DESIGN make variable

Two types of parameters:

1. Target: (see TargetConfigs.scala) Configure the generated RTL model • # of Cores, L1 $ Sizes, # of memory channels • TARGET_CONFIG make variable

2. Sim/Platform: (see SimConfigs.scala) Configure MIDAS, SW/Custom RTL model generation

• Configs for endpoints, Type of DRAM model, enable debug features etc… • PLATFORM_CONFIG make variable

66

Swapping out the RTL-Transformed Model

Two common approaches, based on the type of Target:

1. Project-template-based Rocket-chip targets (FireChip) • Point at the FireChip submodule at your fork

NB: MIDAS depends on Rocket-Chip; Rocket-Chip provides Chisel submodule

2. Everything else (ex. Your custom CGRA etc..) • Add scala sources for your target (modify sim/build.sbt)

• Define an App (Generator) that calls midas.Compiler(…)

• Add a target-specific makefrag (ex. sim/src/main/makefrag/firesim/Makefrag)

See the MIDAS examples for an simple demonstration of this.

67

On Abstract RTL & SW Models.

FireSim & MIDAS provide the following models.

Abstract-RTL :

1. DefaultIOModel ($MSCALA/widgets/PeekPokeIO.scala) • Bound to I/O unbound by custom endpoints

2. DRAM & LLC model ($MSCALA/models/dram/MidasMemModel.scala) • Bound to AXI4 slave interfaces

SW-models:

1. NIC

2. SerialIO

3. Block-device

4. UART

These are all FireSim provided: See $FSCALA/endpoints and $FCPP/endpoints

68

Compile-time vs Runtime Configuration

Software models and custom RTL models are configured twice:

1. Compile/Generation time

2. Runtime

On Compile-time configuration:

• RTL-generation for Custom-RTL models (FPGA resynthesis required) • How: using different a PLATFORM_CONFIG; hacking on the Chisel sources

• CPP compilation for SW models (no FPGA resynthesis) • How: changing model & driver CPP sources.

69

Runtime Configuration

• Pass plus args to the simulator • SW models: does what you expect

• Custom-RTL models: driver writes to model’s configuration registers

• Ex. +blkdev_write_latency,

• How: Modify deploy/config_runtime.ini • Change ini-exposed variables

• Specify a customruntimeconfig in deploy/config_hwdb.ini

70

Runtime Configuration – config_runtime.ini

71

HWDB entry: None -> Uses a default runtime.conf emitted at generation time

Runtime.conf -> plusArgs you want to pass to the simulator

Runtime Configuration – config_hwdb.ini

72

Adding a Custom Model – Chisel-side

1. Expose a Record/Bundle in your Target RTL’s IO with a specific type

2. Extend midas.Endpoint to: 1. Match on the correct Chisel type

• Implement Endpoint.matchType(Data)=> Boolean

2. Generate your SW model’s endpoint, or custom RTL-model Implement • Implement Endpoint.widget(Parameters)=> EndpointWidget

3. Add your endpoint to the EndpointMap Field in your PLATFORM_CONFIG

See:

• $FSCALA/endpoints/UARTWidget.scala

• $FSCALA/endpoints/BlockDevWidget.scala

• $FSCALA/firesim/SimConfigs.scala

73

Adding a Custom Model – Driver-side

1. Define an endpoint driver class • Use simif::{read, write, push, pull} to interact with FPGA-hosted endpoint

2. Register the endpoint driver in firesim_top.cc

See:

• $FCPP/endpoints/uart.{cc, h}

• $FCPP/endpoints/blockdev.{cc, h}

• $FCPP/firesim/firesim_top.cc

74

DRAM Timing Models – Generation Time Conf

• Parameterized using the PLATFORM_CONFIG make variable • See $FSCALA/firesim/SimConfigs.scala

• Abstract timing models: • Latency-Bandwidth Pipe • Bank Conflict

• DDR3 Timing models: • First-Come First-Served (FCFS) • First-Ready, First-Come First-Served (FR-FCFS)

• All timing models can be composed with a LLC model.

• To be presented at FPGA2019

75

DRAM Timing Models – Runtime Conf

• DDR timing models expose >30 runtime arguments • Many illegal settings, need to validate against generated model instance

• Ex. Setting that would overflow a timing register.

• Many invalid settings, need to validate against DDR3 part database • Ex. A non-standard JEDEC DDR3 timing, timings are density dependent

• A default runtime-configuration is emitted at generation time • See generated-src/f1/${TARGET_TUPLE}/runtime.conf

• To generate a custom runtime-config: • In sim/: make conf

76

DRAM Timing Models – custom runtime.conf

77

Debugging, Testing, and Profiling

FireSim Tutorial RISC-V Summit 2018 Speaker: Alon Amid

Agenda

• FireSim Debugging Using Software Simulation

• FireSim Debugging Using Integrated Logic Analyzers

• Advanced FPGA-Based Debugging and Profiling Features

• The FireSim Vision for Debugging and Profiling

79

Debugging Using Software Simulation

80

Modifying internal simulated target hardware, no new external endpoints

Target-Level SW Simulation

What Am I

doing?

Simulator-Level SW Simulation

Adding/Modifying new interfaces and endpoints,

modifying simulation models

Midas-Level SW Simulation

FPGA-Level SW Simulation

My FireSim Simulation Is Not Working


81

Target-Level Simulation

• Software Simulation

• Target Design Untransformed

• No Host-FPGA interfaces

MIDAS-Level Simulation


• Target Design Transformed by MIDAS

• Host-FPGA interfaces/shell emulated using abstract models

FPGA-Level Simulation


• Target Design Transformed by MIDAS

• Host-FPGA interfaces/shell simulated by the FPGA tools

82

Remember this slide from

Sagar’s presentation?


83

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel



FPGA Fabric


84

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel


MIDAS-Level SW Simulation

FPGA Fabric

Abstract Model



85

RTL Design

Physical DRAM

100ns

latency

<- Resp Queue

Req Queue ->

DRAM Model

100

cycle latency

Mem Channel


MIDAS-Level SW Simulation

FPGA Fabric

Abstract Model


FPGA-Level SW Simulation



86

Level Waves VCS Verilator XSIM

Target Off ~5 kHz ~5 kHz N/A

Target On ~1 kHz ~5 kHz N/A

MIDAS Off ~4 kHz ~2 kHz N/A

MIDAS On ~3 kHz ~1 kHz N/A

FPGA On ~2 Hz N/A ~0.5 Hz

• Target-Level Simulation In firesim/target-design/firechip/vsim

$ make DESIGN=<YourDesign> CONFIG=<YourConfig> debug

$ ./simv-<YourDesign>-<YourConfig>-debug +max-cycles=50000 +vcdplusfile=<WaveformFileName>.vpd ../tests/<InputTest>.riscv

• MIDAS-level Simulation In firesim/sim

$ make <verilator|vcs>-debug

$ make EMUL=<verilator|vcs> DESIGN=FireSimNoNIC run-asm-test-debug

• FPGA-Level Simluation In firesim/sim

$ make xsim

$ make xsim-dut <VCS=1> &

$ make run-xsim SIM_BINARY=<PATH/TO/DRIVER/BINARY/FOR/TARGET/TO/RUN>

**FPGA-level simulation currently does not support DMA_PCIS (which is used for the NIC interface)

87


88

“Everything looks OK in SW simulation, but there is still a bug somewhere”

“My bug only appears after hours of running Linux on my simulated HW”

When SW Simulation Debugging is Not Enough…

Debugging Using Integrated Logic Analyzers

Integrated Logic Analyzers (ILAs)

• Common debugging feature provided by FPGA vendors

• Continuous recording of a sampling window of up to 1024 cycles. • Stores recorded samples in BRAM.

• Realtime trigger-based sampled output of probed signals • Multiple probes ports can be combined to a single trigger

• Trigger can be in any location within the sampling window

• On the AWS F1-Instances, ILA interfaced through a debug-bridge and server

89

From: aws-fpga cl_hello_world example

Debugging Using Integrated Logic Analyzers

AutoILA – Automation of ILA integration with FireSim

• Annotate requested signals and bundles in the Chisel source code

• Automatic configuration and generation of the ILA IP in the FPGA toolchain

• Automatic expansion and wiring of annotated signals to the top level of a design using a FIRRTL transform.

• Remote waveform and trigger setup from the manager instance

90

Debugging using Integrated Logic Analyzers

91

Pros:

• No emulated parts – what you see is what’s running on the FPGA

• FPGA simulation speed - O(MHz) compared to O(KHz) in software simulation

• Real-time trigger-based

Cons:

• Requires a full build to modify visible signals/triggers (takes several hours)

• Limited sampling window size

• Consumes FPGA resources

Advanced FPGA-Based Debugging Features

• FPGA-based simulation enables high simulation speed, which enables advanced debugging and profiling tools.

• Reach “deep” in simulation time, and obtain large levels of coverage and data

• Examples: • TracerV

• Synthesizable Assertions

92

Simulated Time

SW Simulation

FPGA-based Simulation

TracerV

93

• Out-of-band full instruction execution trace

• MIDAS widget connected to target trace ports

• By default, large amount of info wired out of Rocket/BOOM, per-hart, per-cycle: • Instruction Address • Instruction • Privilege Level • Exception/Interrupt Status, Cause

• TracerV can rapidly generate several TB of data.

TracerV

• Out-of-Band software-level debugging and profiling: Profiling does not perturb execution

• Useful for kernel and hypervisor level cycle-sensitive profiling

• Examples: • Co-Optimization of NIC and Network Driver • Keystone Secure Enclave Project • High-performance hardware-specific code

(supercomputing?)

• Requires large-scale analytics for insightful profiling and optimization.

94

TracerV

95

Pros:

• Out-of-Band (no impact on workload execution)

• SW-centric method

• Large amounts of data

Cons:

• Slower simulation performance (40 MHz)

• No HW visibility

• Large amounts of data

Synthesizable Assertions

• Assertions – rapid error checking embedded in HW source code. • Commonly used in SW Simulation

• Halts the simulation upon a triggered assertion. Represented as a “stop” statement in FIRRTL

• By defaults, emitted as non-synthesizable SV functions ($fatal)

96

From: Trillion-Cycle Bug Finding Using FPGA-Accelerated Simulation Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, Krste Asanović. ADEPT Winter Retreat 2018

From: BROOM: An open-source Out-of-Order processor with resilient low-voltage operation in 28nm CMOS, Christopher Celio, Pi-Feng Chiu, Krste Asanovic, David Patterson and Borivoje Nikolic. HotChip 30, 2018


• Synthesizable Assertions on FPGA • Transform FIRRTL stop statements into synthesizable logic

• Insert combinational logic and signals for the stop condition arguments

• Insert encodings for each assertion (for matching error statements in SW)

• Wire the assertion logic output to the Top-Level

• Generate timing tokens for cycle-exact assertions

• Assertion checker records the cycle and halts simulation when assertion is triggered

97


• Previously a research feature presented in DESSERT [1]

• Helped Identify BOOM bugs trillions of cycles into execution

• Integrated into FireSim in the latest release.

98

[1] Kim, D., Celio, C., Karandikar, S., Biancolin, D., Bachrach, J. and Asanovic, K., DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles. The International Conference on Field-Programmable Logic and Applications (FPL), 2018


99

Pros:

• FPGA simulation speed

• Real-time trigger-based

• Consumes small amount of FPGA resources (compared to ILA)

• Key signals have pre-written assertions in re-usable components/libraries

Cons:

• Low visibility: No waveform/state

• Assertions are best added while writing source RTL rather than during “investigative” debugging

The FireSim Vision: Speed and Visibility

• High-performance simulation

• Full application workloads

• Tunable visibility & resolution

• Unique data-based insights

100

Speed and Visibility – How do we get there?

• Integration of additional DESSERT features • Hardware state snapshot extraction from FPGA to software simulation using

arbitrary software triggers.

• Easy-to-use information transfer between execution traces and software hooks.

• Data-Processing pipeline for insights from large-scale traces • O(GB)-O(TB) out-of-band logs require big-data analysis methods for insights

(Golden model comparison is one potential method).

• Potential unique insights: can collect globally-cycle-accurate out-of-band instruction traces from a networked datacenter simulation.

101

Summary

• Debugging Using Software Simulation (docs) • Target-Level

• MIDAS-Level

• FPGA-Level

• Debugging Using Integrated Logic Analyzers (docs)

• Advanced Debugging and Profiling Features • TracerV (docs)

• Assertion Synthesis (docs)

• FireSim Debugging and Profiling Future Vision

102

https://docs.fires.im/en/latest/Advanced-Usage/Debugging/RTL-Simulation.html

https://docs.fires.im/en/latest/Advanced-Usage/Debugging/Debugging-Hardware-Using-ILA.html

https://docs.fires.im/en/latest/Advanced-Usage/Debugging/TracerV.html

https://docs.fires.im/en/dev/Advanced-Usage/Debugging/DESSERT.html


Project Management, Contributing & Conclusion

FireSim Tutorial RISC-V Summit 2018

Speaker: David Biancolin

Development Model – Branches

• FireSim and its key submodules (FireChip, MIDAS, aws-fpga) have two write-controlled branches

• master • Stable

• Contains the most recent release of FireSim

• Can replicate ISCA2018 paper results

• dev • Main development branch

• Base branch for feature PRs and non-critical bug fixes

• Every merge commit has globally shared, regenerated AGFIs

104

Development Model – Merges & Releases

• To merge a feature branch into dev: • Pass all MIDAS-level tests (in sim/: `make test`)

• Regenerate all AGFIs if RTL changes were made

• Associated submodule PRs should be merged; submodule pointers should point at merge commit on dev

• Changes to dev will be summarized in a “Dev-tracking PR” • PR description will become the Changelog entry

• Dev will be periodically merged into master and tagged as a release • Changelog will be updated

• Submodule dev branches will also be merged into master

105

How to Get Help

• FireSim and MIDAS adds a bunch of additional complexity on top of Rocket-Chip, Chisel, FIRRTL, Scala… • But there’s a team of us here to help!

• For general questions: • Post on the google groups: https://groups.google.com/forum/#!forum/firesim

• Check out the docs: https://docs.fires.im/en/latest/

• (Nascent) FAQ section: https://docs.fires.im/en/latest/Advanced-Usage/FAQs.html

106

https://groups.google.com/forum/#!forum/firesim

https://docs.fires.im/en/latest/

https://docs.fires.im/en/latest/Advanced-Usage/FAQs.html



Reporting Issues

• If you think you’ve encountered a bug (in FireSim or MIDAS): • Open an issue on FireSim’s github page

• Try your best to open an issue in the most relevant repo • Please don’t open duplicates across multiple repos!

• If you are unsure, default to FireSim and we’ll be happy to redirect you

107

Contributing

• We welcome PRs!

• Issues are labeled with “Good First Issue”

• Ask on the google groups for suggestions

108

Regarding Future Work

• There are many research projects underway at UCB-BAR using and improving FireSim

• Many will be merged into mainline FireSim

• There will be API breakages -> We apologize in advance

• Watch the changelogs, dev tracking PRs

Our goal: make FireSim the standard open way to do fast full-system simulation of Chisel-designed systems, especially ones based on Rocket-Chip.

Focus on:

• Generality (eg. support multi-clock target designs)

• Robustness (eg. More extensive testing in RTL-simulation; CI; regressions)

• Debuggability (eg. DESSERT features) 109

Two more talks from Berkeley Architecture Research

110

Both talks open-sourcing their work and releasing FireSim images!

Conclusion

Key links: • Website: https://fires.im • Google Groups: https://groups.google.com/forum/#!forum/firesim • Docs: https://docs.fires.im/en/latest/ • Twitter: @firesimproject Acknowledgements: The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, RISE Lab sponsor Amazon Web Services, and ADEPT Lab affiliates Google, Huawei, Siemens, SK Hynix, and Seagate. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

111

https://fires.im/

https://groups.google.com/forum/#!forum/firesim

https://docs.fires.im/en/latest/

scalable fpga-accelerated cycle-accurate hardware

Documents