ramp gold : an fpga-based architecture simulator for multiprocessors zhangxi tan, andrew waterman,...

31
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab, EECS UC Berkeley March 2010

Upload: drake-bradstreet

Post on 28-Mar-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

RAMP Gold : An FPGA-based Architecture Simulator for

Multiprocessors

Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic

Parallel Computing Lab, EECS UC BerkeleyMarch 2010

Page 2: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

2

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

Page 3: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

3

Overview Purpose of RAMP Gold

An FPGA-based simulator for shared-memory multicore target for Parlab

Usage case: Architecture, OS and applications

Highlight of RAMP Gold Works on $750 Xilinx XUP v5 board Written in systemverilog, no special CAD tools required,

works with standard FPGA CAD flows (Synplify/ISE/Modelsim)

Two orders of magnitude faster than Simics+GEMS Runtime configurable parameters without resynthesis Full RTL verification environment and software

infrastructure BSD and GNU license

Page 4: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

4

Simulation Jargon Target vs. Host

Target: System/architecture being simulated, e.g. SPARC v8 CMP

Host : The platform on which the simulator runs, e.g. FPGAs

Functional model and timing model Functional: compute instruction result Timing: how long to compute the instruction

Page 5: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

5

RAMP Gold Overall Setup

Functional Models

TimingModels

Functional State

Timing State

Targ

et M

emo

rySingle Xilinx Virtex 5/6 FPGA

Frontend App Server(Linux PC)

Ethernet

Both functional and timing models on FPGA App server: control and service syscall/IO

Page 6: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

6

Target Machine Template

64-core SPARC v8 shared-memory machine Configurable two-level cache + multichannel

DRAM

CPU

L1I$ L1D$

CPU

L1I$ L1D$

CPU

L1I$ L1D$

L2 Bank L2 Bank L2 Bank L2 Bank

Interconnect

DRAM channel DRAM channel DRAM channel DRAM channel

Page 7: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

7

RAMP Gold Performance vs Simics

PARSEC parallel benchmarks running on a research OS

>250x faster than full system simulator for a 64-core multiprocessor target

4 8 16 32 640

50

100

150

200

250

300

2 3 5 10

34

6 1021

44

106

7 15

36

69

263Functional onlyFunctional+cache/memory (g-cache)Functional+cache/memory+coherency (GEMS)

Number of Cores

Spee

dup

(Geo

met

ric

Mea

n)

Page 8: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

8

Outline Overview RAMP Gold HW Architecture and

Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

Page 9: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

9

RAMP Gold Model Key Concepts

Decoupled functional/timing model, both in hardware Enables many FPGA fabric friendly optimizations Increase modeling efficiency and module reuse

Host multithreading of both functional and timing models Hide emulation latencies and improve resource utilization Time-multiplexed effect patched by the timing model

Functional Model

Pipeline

Arch State

Timing Model

Pipeline

Timing State

Page 10: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Host multithreading Example: simulating four independent CPUs

10

+1

PC1PC

1PC1PC

1

I$ IR GPR1GPR1GPR1GPR1

X

Y

ALU

D$

2 2

DE

2

Thread Select

CPU0

CPU1

CPU2

CPU3Target Model

Functional CPU model on FPGA

0 1 2 3 4 5 6 7 8 9 10Host Cycle:

Target Cycle:

Instruction in Functional Model Pipeline: i0 i0 i0 i0 i1 i1 i1 i1 i2 i2 i2

0 1 2

Page 11: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

11

Functional Model

Full SPARC v8 support (FP, MMU, I/Os) Pass the SPARC v8 certification test Run Linux and research OS

Fetch PArchitecture

State(x64)

Microcode ROM Decode

PHost I$ PHost ITLB

Target Register File Access PArchitecture

Register File(x64)

MMU PHost DTLB

DD

R2

Mem

ory

Con

trol

ler

Exception/Write Back

PIO Devices(x64)

1 6 K B

u n i f i e d

h o s t D $

6 4 - e n t r y

M S H R

225

MH

z/2G

B S

OD

IMM

Pipelined FPUInteger

ALU

From timing model

To timing model

Page 12: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

12

Timing Model

Simple CPU timing but detailed memory timing model (i.e. every instruction takes 1 cycle except LD/ST)

Cache models: only store tags in BRAMs Runtime configurable parameters: associativity, size, line size, # of banks, latency and etc Model 3C but not 4C (coherent support soon)

DRAM model: bandwidth-delay pipe with optional QoS

Bank 0L2 Tags

MSHR

=

T h r e a d S c h e d u l e r

TargetCycle Count

Scoreboard

PL1 I$ Tag

==

Timing Model Config Reg

PL1 D$ Tag

Timing Model Config Reg

L2 Tags

MSHR

=

L2 Tags

MSHR

=

L2 Tags

MSHR

=

Bank 0

Bank 1

DRAM TimingQoS

DRAM TimingQoS

DRAM TimingQoS

DRAM TimingQoS

Timing Model Config Reg

From functional model I/O bus

CPU Timing Model

Banked L2 Timing Model

DRAM Channel Timing Model

Bank 2

Bank 3

Page 13: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

13

Debugging and Simulation Configuration

Frontend Link

RX TX

Frontend App ServerFrontend App Server

Gigabit Ethernet

Timing Model Control

Functional Model

Microcode Injector

Timing Model

Performance Counters

Host DRAM

32-bit@90 MHz

Frontend app server Reliable Gigabit Ethernet

connection to FPGA Periodically pulls the simulator

to serve I/O requests Transparent to target (no side

effect on simulated timing)

64-bit hardware performance counters to collect runtime stats

657 counters in timing model + 10 host counters

Can be read by either target apps or the app server

Ring interconnect for counters (easy to add and remove)

Page 14: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

14

Host Performance

Timing synchronization is the largest overhead Tiny host $/TLBs are not on the performance critical

path Host DRAM bandwidth is not a problem (<15%

utilization)

blackscholes bodytrack fluidanimate streamcluster swaption x2640

10

20

30

40

50

60

70

80

90

100Retire Inst.

Misc.

FPU

Microcode

Host TLB miss

Host D$ miss

Host I$ miss

Timing Sync.

Idle by target stalls

Perc

enta

ge o

f hos

t exe

cutio

n tim

e

Page 15: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

15

Implementation Single FPGA: 64-core @ 90 MHz, 2 GB DDR2 SODIMM ~2 hours CAD turnaround time on a mid-range

workstation BRAM bounded, but have logic resources to fit more

pipelines

Page 16: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

16

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

Page 17: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

17

Software Tools SPARC cross compiler with binutils/gcc/glibc

Support most of POSIX programs Static & dynamic linking support Built from GNU GCC (4.3.2)

Full software and HW debugging suite Low-cost XUP boards sometimes do not work out-of-box FPGA CAD tools are very bad

Page 18: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

18

Target Software Proxy Kernel: single-protection-domain application

host Runs programs statically linked against glibc Forwards I/O system calls to x86/Linux host PC Presents simple “hard-threads” API for multithreaded

programs Very easy to modify

ROS: UCB’s manycore research OS Provides multiprogramming support Sufficiently POSIX compliant to run many programs Much easier to modify than linux Run more than 64-cores

Page 19: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

19

Infrastructure

Verilog Simulation Backend

Functional Simulation Backend

Frontend Test Server

HW Backend

App Source Files

(.S or .C)

GNU SPARC v8 Compiler/

Linker

Customized Linker Script

(.lds)

RTL src files / netlist

(.sv, .v)

ELF Binaries

Modelsim SE/ Questasim

Host dynamic simulation

libraries (.so)

libbfdDisassembler C implementation

Frontend Links

Xilinx Unisim Library Systemverilog DPI interface

Simulation logs

Optional Checker

C-Gold Functional Simulator

Frontend Links

Reference Result

FPGA TargetXilinx XUP-V5, BEE3

Frontend Links

HW state dumps

Linux Machine(Handle I/Os and Syscalls)

C-Gold simulation

module

Page 20: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

20

Case studies Parallel application studies for software

programmers Parallel OS for system researchers Adding hardware performance counter for

advanced debugging Micro-architecture studies - adding features and

modifying existing timing models Adding new instructions – changing the functional

model

Page 21: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

21

Appserver 101 Appserver command-line options:

Usage: sparc_app [-f<conf>] [-p<nprocs>] [-s] <htif> <kernel> [binary] [args]

Platform memory test: App server memory test: sparc_app –p64 hw memtest none Proxykernel memory test (stress test) sparc_app –p64 hw pathlkernel.ramp path/memtest

Page 22: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

22

For application programmers

Main usage scenario: use runtime configurable timing model without any FPGA hardware change Use ‘hard-threads’ to write a parallel ‘hello world’ program

running on the proxykernel

Compile the program using the cross toolchainsparc-ros-gcc –o hello hellp.cpp -lhart

Measure performance using performance counterssparc_app –s1 –p64 hw kernel.ramp hello

Change target machine configuration on the fly and rerun the experiment

edit file ‘appserver.conf’

Page 23: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

23

For OS Developer Similar usage model like application programmers

Proxykernel is a good start to learn the bootstrapping process

ROS is a full functional kernel

Demo: Boot the ROS kernel using the appserversparc_app –p64 –fappserver_ros.conf hw your_kernel none

Page 24: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

24

Adding Hardware Performance Counters

Two types of counter interface Global counter: <EN> Local (per core) counter: <TID, EN>

Modify the verilog file to add more counters on the ring.

perfctr_io #(.NLOCAL(num_of_local), .NGLOBAL(num_of_global)) gen_tm_counter(.gclk, .rst, .bus_out(io_out), .bus_in(io_in), .bus_sel(), //IO bus interface

.global_inc(global_counter_inc), .local_inc(local_counter_inc), .local_tid(local_counter_tid));

Modify the app server to support more counters: Add your counter definition in ‘TestAppServer/perfcnt.h’

Page 25: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

25

Adding Features to Timing Models

Timing models are much simpler than functional models ~1000 LoC vs 35,000 LoC

Example 1: Changing the cache replacement policy

Example 2: Adding memory QoS Lee et al. “Globally-Synchronized Frames for,

Guaranteed Quality-of-Service in On-Chip Networks”, ISCA’08

~100 lines of code added in the timing model A new DRAM model Several memory mapped register added on the

functional I/O bus for configuration purpose

Page 26: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

26

Adding New Instructions Adding instructions to a feed-through pipeline is

straightforward FPU instructions were added as “new” instructions within

a week Including: new register file, decode, exception/commit

and microcode

Example: Adding new atomic instructions through microcode 4 global scratchpad registers (not visible to

programmer) in the main integer register file for temporary storage

Two write-port for supporting scratchpad registers update along with architecture register change

Page 27: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

27

Steps of Adding Instructions Add proper decoding logic in function

“decode_dsp_add_logic“ of “regacc_dma.sv”

Update the writeback/exception stage in file “exception_dma.sv” to trap to microcode. Edit function “decode_microcode_mode” to trap to microcode Edit function “rd_gen” to write address to scratch register 0,

and load data to scratch register 1

Edit microcode ROM ‘Microcode.sv’ //----------SWAP*-------9: begin uco.uend = '0; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {LDST, 5'b0, ST, REGADDR_SCRATCH_0 | UCI_MASK, 1'b1, 13'b0};

end 10: begin uco.uend = '1; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {FMT3, 5'b0, IADD, REGADDR_SCRATCH_1 | UCI_MASK, 1'b1,

13'b0}; end

Page 28: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

28

Future work Cache Coherence models (soon) Realistic interconnect model (soon) Better CPU core model (next major version) Support other ISAs (next major version)

Page 29: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

29

Further References Research papers

Usage case: A Case for FAME: FPGA Architecture Model Execution, ISCA’10 RAMP Gold design:RAMP Gold: An FPGA-based Architecture Simulator for

Multiprocessors, DAC’10

Beta releasehttp://sites.google.com/site/rampgold

Page 30: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

30

Backup Slides

Page 31: RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

31

Functional/Timing Model Interface

// FM -> TMtypedef struct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //cpu states bit replay; //this instruction needs to replay by FM bit retired; //retiring an instruction bit [31:0] inst; //the instruction that was retired bit [31:0] paddr; //load/store physical address bit [31:0] npc; //PC of next fetched insn}tm_cpu_ctrl_token_type;

// TM -> FMtypedef struct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //run bit }tm2cpu_token_type;