protoflex: fpga-accelerated instrumentation • software monitoring/analysis – e.g. debugging,...
TRANSCRIPT
Computer Architecture Lab at
PROTOFLEX:
FPGA-Accelerated Instrumentation
Michael K. Papamichael, Eric S. Chung,James C. Hoe, Babak Falsafi, Ken Mai
[email protected], {echung, jhoe, babak, kenmai}@ece.cmu.edu
PROTOFLEX
Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.
19-Aug-2008
The Simulation Bottleneck
• Performance Simulation via Simulation Sampling– perf. measurements by sampling small segments of execution
2
Execution
Detailed Warm-up (cycle-accurate simulation)
Measurement (cycle-accurate simulation)
Short &Parallelizable
Long & NOT Parallelizable
!Checkpoints (i.e. system state snapshots )
Functional Warming (functional simulation)
• Speed of cycle-accurate simulator inconsequential
• Functional Warming is the Real Bottleneck
FunctionalCMP Cache Model
Functional Branch Predictor Model
Faster Simulation w/ FPGAs
• Functional Warming requires– Full-system functional simulator (e.g. Simics)
– Instrumentation (e.g. functional cache model)
3
Functional Branch Predictor Model
FunctionalCMP Cache Model
BlueSPARC(16-cpu)
Instrumented HW Simulator Fast Functional Warming
SW-based HW-based
SW HW
?
Simics(16-cpu)
FunctionalCMP Cache Model
Functional Branch Predictor Model
HW vs. SW Simulation Performance
• Functional Warming requires– Full-system functional simulator (e.g. Simics)
– Instrumentation (e.g. functional cache model)
4
Functional Branch Predictor Model
FunctionalCMP Cache Model
?
BlueSPARCSimics
SW-based HW-based
SW HW
0
10
20
30
40
50
60
70BlueSPARCSimics-fastBlueSPARC w/ instrumentationSimics w/ instrumentation
WITH Instrumentation Speedup: 37x
Functional Branch Predictor Model
Outline
• BlueSPARC Simulator (1-slide review)
• FPGA-Accelerated Instrumentation– CMP Cache Simulator
– Branch Predictor Simulator
• Design Experiences & Future Work
5
BlueSPARC
FunctionalCMP Cache Model
Functional Branch Predictor Model
Outline
• BlueSPARC Simulator (1-slide review)
• FPGA-Accelerated Instrumentation– CMP Cache Simulator
– Branch Predictor Simulator
• Design Experiences & Future Work
6
FunctionalCMP Cache Model
BlueSPARC
BlueSPARC Simulator
• Full-system HW-based Functional Simulator– Models 16-cpu UltraSPARC III server
– Can boot OS, run commercial apps
• Virtualization Techniques– Hybrid Full-System Simulation
– Multiprocessor Host Interleaving
7
2
1
CPUP
Memory Devices
P
Common-case behaviors
Uncommon behaviors
Memory
4-way P 4-way P
PP
PP
PP
PP
1
2
2
Functional Branch Predictor Model
Outline
• BlueSPARC Simulator (1-slide review)
• FPGA-Accelerated Instrumentation– CMP Cache Simulator
– Branch Predictor Simulator
• Design Experiences & Future Work
8
BlueSPARC
FunctionalCMP Cache Model
Shared L2 Directory
CMP Cache Model
• Piranha-like CMP Cache Hierarchy – Private L1 I&D Caches
– Single Shared L2 Cache (Victim Cache)
– L1 coherence maintained through directory in L2
9
Target Cache Model
– Multiple concurrent memory refs– Directory for coherence
L1
P
L1
P
L1
P
L1
P
Shared L2
L1 L1 L1 L1
Virtualized Cache Model
– Memory refs serialized– Parallel L1 accesses for coherence
PP
PP
Memory
Refs
Architecture
10
L1 I&D Caches
Cache
Contents
Instruction Caches
…
2-way L1 caches
8 ways
8-way L2 cache
FPGA-Accelerated CMP Cache Simulator
L2 Cache
8-way pseudo-LRU
Statistics
Statistics
Data Caches
…
Statistics
Implementation Details
• Runs @ 100MHz on BEE2 board
• 2500L of fully parameterized Verilog– Parameters: # CPUs, L1/L2 dimensions, # ways, etc
• Purely Functional Model– No timing info
– Only tags + status bits stored and updated
• FPGA Resource Usage (Virtex II Pro 70)
• Limitations– FPGA resource usage dominated by on-chip memory
11
64KB L1s - 4MB L2 128KB L1s - 16MB L2
LUTs 7483 (11%) 7277 (11%)
BRAMs 134 (40%) 292 (89%)
Outline
• BlueSPARC Simulator (1-slide review)
• FPGA-Accelerated Instrumentation– CMP Cache Simulator
– Branch Predictor Simulator
• Design Experiences & Future Work
12
BlueSPARC
FunctionalCMP Cache Model
Functional Branch Predictor Model
Branch Predictor Model
• Typical 2-level Branch Predictor– Meta predictor selects Bimodal or Gshare predictor
– 8-way Branch Target Buffer
• 16 BTBs (one per cpu) too large for BEE2 FPGA
13
BTB
GshareBimodal
Meta
Target BP Model
BTB
GshareBimodal
Meta
…
Virtualized BP Model
GshareBimodal
Meta
Single SharedBTB
GshareBimodal
Meta…
– Single Shared BTB for all CPUs– One BTB per CPU
Multiple BTBs vs. Single BTB
• OK to use single BTB? Generally no, but OK for– Functional warming of homogeneous workloads
14
0
10
20
30
40
50
60
70
80
90
100
db2 oracle apache dss em3d ocean
Ove
rall
Pre
dic
tio
n A
ccu
racy
(%
)
Separate BTBs vs. Single BTB(16K-entry, 8-way)
Separate BTBs
Single BTB
Single BTB achieves same accuracy as multiple BTBs
Implementation Details
• Runs @ 100MHz on BEE2 board
• 700L of fully parameterized Bluespec– Parameters: # CPUs, Predictor Sizes, BTB Size/Associativity
• Realistic Prototype Configuration– 16 CPUs
– 8K-entry Meta, 32K-entry Bimodal, 8K-entry Gshare
– Single shared 16K-entry 8-way BTB
• FPGA Resource Usage (Virtex-II Pro 70)– LUTs: 3938 (5%)
– BRAMs: 193 (58%)
• Limitations– Single shared BTB may not perform accurately for all workloads
15
Outline
• BlueSPARC Simulator (1-slide review)
• FPGA-Accelerated Instrumentation– CMP Cache Simulator
– Branch Predictor Simulator
• Design Experiences & Future Work
16
BlueSPARC
FunctionalCMP Cache Model
Functional Branch Predictor Model
Design Experiences
• Identify opportunities for simpler designs
– Virtualization reduces resource requirements/complexity
– Less-constrained functional simulation environment
– Think about specific requirements of application
• Efficient mapping to FPGA resources is crucial
– Reorganizing the cache modules allowed for 2x larger designs
• Existence of SW reference design is important
– Reduces design time
– Simplifies verification
• Bluespec reduces design complexity
17
Future Work
• Software Monitoring/Analysis
– e.g. debugging, performance tuning, instruction set profiling
• Rapid Exploration of new Architectures
– Simple functional models for first-order perf. results
– Detailed cycle-accurate models for high-fidelity simulation
• SW Developer/Educational Tool
– Real-time viewing of system state and statistics (Check out our DEMO )
18
Other Instrumentation Applications
Future Directions• Scale number of CPUs
• Augment simulation models with timing extensions
Demo
• Web-based Real-time Viewing of Statistics
19
20
Thanks! Any [email protected]://www.ece.cmu.edu/~protoflex
AcknowledgementsWe would like to thank our colleaguesin the RAMP and TRUSS projects.