protoflex: fpga-accelerated instrumentation • software monitoring/analysis – e.g. debugging,...

20
Computer Architecture Lab at PROTOFLEX: FPGA-Accelerated Instrumentation Michael K. Papamichael, Eric S. Chung, James C. Hoe, Babak Falsafi, Ken Mai [email protected], {echung, jhoe, babak, kenmai}@ece.cmu.edu PROTOFLEX Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx. 19-Aug-2008

Upload: others

Post on 10-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Computer Architecture Lab at

PROTOFLEX:

FPGA-Accelerated Instrumentation

Michael K. Papamichael, Eric S. Chung,James C. Hoe, Babak Falsafi, Ken Mai

[email protected], {echung, jhoe, babak, kenmai}@ece.cmu.edu

PROTOFLEX

Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.

19-Aug-2008

Page 2: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

The Simulation Bottleneck

• Performance Simulation via Simulation Sampling– perf. measurements by sampling small segments of execution

2

Execution

Detailed Warm-up (cycle-accurate simulation)

Measurement (cycle-accurate simulation)

Short &Parallelizable

Long & NOT Parallelizable

!Checkpoints (i.e. system state snapshots )

Functional Warming (functional simulation)

• Speed of cycle-accurate simulator inconsequential

• Functional Warming is the Real Bottleneck

Page 3: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

FunctionalCMP Cache Model

Functional Branch Predictor Model

Faster Simulation w/ FPGAs

• Functional Warming requires– Full-system functional simulator (e.g. Simics)

– Instrumentation (e.g. functional cache model)

3

Functional Branch Predictor Model

FunctionalCMP Cache Model

BlueSPARC(16-cpu)

Instrumented HW Simulator Fast Functional Warming

SW-based HW-based

SW HW

?

Simics(16-cpu)

Page 4: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

FunctionalCMP Cache Model

Functional Branch Predictor Model

HW vs. SW Simulation Performance

• Functional Warming requires– Full-system functional simulator (e.g. Simics)

– Instrumentation (e.g. functional cache model)

4

Functional Branch Predictor Model

FunctionalCMP Cache Model

?

BlueSPARCSimics

SW-based HW-based

SW HW

0

10

20

30

40

50

60

70BlueSPARCSimics-fastBlueSPARC w/ instrumentationSimics w/ instrumentation

WITH Instrumentation Speedup: 37x

Page 5: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Functional Branch Predictor Model

Outline

• BlueSPARC Simulator (1-slide review)

• FPGA-Accelerated Instrumentation– CMP Cache Simulator

– Branch Predictor Simulator

• Design Experiences & Future Work

5

BlueSPARC

FunctionalCMP Cache Model

Page 6: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Functional Branch Predictor Model

Outline

• BlueSPARC Simulator (1-slide review)

• FPGA-Accelerated Instrumentation– CMP Cache Simulator

– Branch Predictor Simulator

• Design Experiences & Future Work

6

FunctionalCMP Cache Model

BlueSPARC

Page 7: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

BlueSPARC Simulator

• Full-system HW-based Functional Simulator– Models 16-cpu UltraSPARC III server

– Can boot OS, run commercial apps

• Virtualization Techniques– Hybrid Full-System Simulation

– Multiprocessor Host Interleaving

7

2

1

CPUP

Memory Devices

P

Common-case behaviors

Uncommon behaviors

Memory

4-way P 4-way P

PP

PP

PP

PP

1

2

2

Page 8: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Functional Branch Predictor Model

Outline

• BlueSPARC Simulator (1-slide review)

• FPGA-Accelerated Instrumentation– CMP Cache Simulator

– Branch Predictor Simulator

• Design Experiences & Future Work

8

BlueSPARC

FunctionalCMP Cache Model

Page 9: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Shared L2 Directory

CMP Cache Model

• Piranha-like CMP Cache Hierarchy – Private L1 I&D Caches

– Single Shared L2 Cache (Victim Cache)

– L1 coherence maintained through directory in L2

9

Target Cache Model

– Multiple concurrent memory refs– Directory for coherence

L1

P

L1

P

L1

P

L1

P

Shared L2

L1 L1 L1 L1

Virtualized Cache Model

– Memory refs serialized– Parallel L1 accesses for coherence

PP

PP

Page 10: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Memory

Refs

Architecture

10

L1 I&D Caches

Cache

Contents

Instruction Caches

2-way L1 caches

8 ways

8-way L2 cache

FPGA-Accelerated CMP Cache Simulator

L2 Cache

8-way pseudo-LRU

Statistics

Statistics

Data Caches

Statistics

Page 11: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Implementation Details

• Runs @ 100MHz on BEE2 board

• 2500L of fully parameterized Verilog– Parameters: # CPUs, L1/L2 dimensions, # ways, etc

• Purely Functional Model– No timing info

– Only tags + status bits stored and updated

• FPGA Resource Usage (Virtex II Pro 70)

• Limitations– FPGA resource usage dominated by on-chip memory

11

64KB L1s - 4MB L2 128KB L1s - 16MB L2

LUTs 7483 (11%) 7277 (11%)

BRAMs 134 (40%) 292 (89%)

Page 12: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Outline

• BlueSPARC Simulator (1-slide review)

• FPGA-Accelerated Instrumentation– CMP Cache Simulator

– Branch Predictor Simulator

• Design Experiences & Future Work

12

BlueSPARC

FunctionalCMP Cache Model

Functional Branch Predictor Model

Page 13: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Branch Predictor Model

• Typical 2-level Branch Predictor– Meta predictor selects Bimodal or Gshare predictor

– 8-way Branch Target Buffer

• 16 BTBs (one per cpu) too large for BEE2 FPGA

13

BTB

GshareBimodal

Meta

Target BP Model

BTB

GshareBimodal

Meta

Virtualized BP Model

GshareBimodal

Meta

Single SharedBTB

GshareBimodal

Meta…

– Single Shared BTB for all CPUs– One BTB per CPU

Page 14: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Multiple BTBs vs. Single BTB

• OK to use single BTB? Generally no, but OK for– Functional warming of homogeneous workloads

14

0

10

20

30

40

50

60

70

80

90

100

db2 oracle apache dss em3d ocean

Ove

rall

Pre

dic

tio

n A

ccu

racy

(%

)

Separate BTBs vs. Single BTB(16K-entry, 8-way)

Separate BTBs

Single BTB

Single BTB achieves same accuracy as multiple BTBs

Page 15: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Implementation Details

• Runs @ 100MHz on BEE2 board

• 700L of fully parameterized Bluespec– Parameters: # CPUs, Predictor Sizes, BTB Size/Associativity

• Realistic Prototype Configuration– 16 CPUs

– 8K-entry Meta, 32K-entry Bimodal, 8K-entry Gshare

– Single shared 16K-entry 8-way BTB

• FPGA Resource Usage (Virtex-II Pro 70)– LUTs: 3938 (5%)

– BRAMs: 193 (58%)

• Limitations– Single shared BTB may not perform accurately for all workloads

15

Page 16: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Outline

• BlueSPARC Simulator (1-slide review)

• FPGA-Accelerated Instrumentation– CMP Cache Simulator

– Branch Predictor Simulator

• Design Experiences & Future Work

16

BlueSPARC

FunctionalCMP Cache Model

Functional Branch Predictor Model

Page 17: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Design Experiences

• Identify opportunities for simpler designs

– Virtualization reduces resource requirements/complexity

– Less-constrained functional simulation environment

– Think about specific requirements of application

• Efficient mapping to FPGA resources is crucial

– Reorganizing the cache modules allowed for 2x larger designs

• Existence of SW reference design is important

– Reduces design time

– Simplifies verification

• Bluespec reduces design complexity

17

Page 18: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Future Work

• Software Monitoring/Analysis

– e.g. debugging, performance tuning, instruction set profiling

• Rapid Exploration of new Architectures

– Simple functional models for first-order perf. results

– Detailed cycle-accurate models for high-fidelity simulation

• SW Developer/Educational Tool

– Real-time viewing of system state and statistics (Check out our DEMO )

18

Other Instrumentation Applications

Future Directions• Scale number of CPUs

• Augment simulation models with timing extensions

Page 19: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

Demo

• Web-based Real-time Viewing of Statistics

19

Page 20: ProtoFlex: FPGA-Accelerated Instrumentation • Software Monitoring/Analysis – e.g. debugging, performance tuning, instruction set profiling • Rapid Exploration of new Architectures

20

Thanks! Any [email protected]://www.ece.cmu.edu/~protoflex

AcknowledgementsWe would like to thank our colleaguesin the RAMP and TRUSS projects.