the open source protoflex simulator
TRANSCRIPT
![Page 1: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/1.jpg)
Computer Architecture Lab at
RAMP Retreat, June 2009
The Open SourceProtoFlex Simulator
Eric S. Chung, Michael K. Papamichael,James C. Hoe, Babak Falsafi, Ken Mai
![Page 2: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/2.jpg)
The ProtoFlex Simulator
• History
– Project started (circa 2007) to build scalable, full-system multiprocessor simulators using FPGAs
• Key Features
– Functional simulator for N-way UltraSPARC III server (~50-90 MIPS)
– Using hybrid simulation, runs real server apps + Solaris OS
– Employs multithreading to virtualize # CPUs per FPGA core
2Hybrid Simulation Virtualization
![Page 3: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/3.jpg)
Open Sourcing ProtoFlex
• Why open source?
– Demonstration of FPGAs as viable architecture research vehicle
– Facilitate adoption of hybrid simulation & host multithreading
– Encourage building on top of our work
• What are we releasing?
– Bluespec source HDL, Verilog and pre-generated netlistsfor SPARCV9 CPU model + interfaces
– XUPV5 Reference Design for EDK 10.1
– Virtutech Simics plug-ins for hybrid simulation
– Top-level SW controller, user command-line interface
– Documentation through online wiki 3
![Page 4: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/4.jpg)
Outline
• Motivation
• The ProtoFlex Simulator
– High level components
– UltraSPARC core model
• Using ProtoFlex
• XUPV5 Reference Design
• Distribution Details4
![Page 5: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/5.jpg)
The ProtoFlex Simulator
• User perceives familiar SW-like UltraSPARC III simulator
– Software user interface similar to Simics
– Applications load directly from Simics checkpoints
– Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring
5
![Page 6: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/6.jpg)
The ProtoFlex Simulator
• User perceives familiar SW-like UltraSPARC III simulator
– Software user interface similar to Simics
– Applications load directly from Simics checkpoints
– Standard simulation features: state viewing, scripting, single-stepping, checkpointing, terminal, profiling/monitoring
6
FPGA Linux PC
PFMON
SIMICS
(I/O)
FPGA
Core
Main Memory
PowerPC
(or uBlaze)
Ethernet User
Interfac
e
![Page 7: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/7.jpg)
Our UltraSPARC III Core Model
• ISA Specifications
– 64-bit SPARCV9 ISA + US III extensions
– 8 register windows, 4 global register files
– 512-entry D-TLB, 128 I-TLB
• Implementation
– 14-stage, multi-threaded pipeline, switch context on each cycle
– On Virtex-5, XST~148MHz,Placed & routed @ 100MHz
– Parameterized non-blocking caches
– FP + rare MMU instructions areSW-emulated by nearby uBlaze
– 100% mirrors Virtutech Simics model 7
I-TLB Stage 1
I-TLB Stage 2
64-bit ALU Stage 1
Context Scheduler
I-Fetch Address Generate
I-Fetch Tag Check
US III Decoder
64-bit ALU Stage 2
D-TLB Stage 1
D-TLB Stage 2
D-TLB Stage 3
D-Cache Address Generate
D-Cache Tag Check
Multi-Cycle Instruction Unit
NonblockingI-cache(BRAM)
Writeback
Arbiter to DDR Memory
Integer RF(BRAM)
NonblockingD-cache(BRAM)
![Page 8: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/8.jpg)
Core Design Statistics
• Runs 100MHz on V5
– Synthesizes up to 148MHz using standard tools (ISE XST)
• Logic usage
– 23.5 KLUTs (11.3% LX330T)
• BRAM usage
– 120 BRAMs for 16-context configuration (37% LX330T)
• Future optimizations
– Paging structures to SRAM or DRAM can reduce BRAM by significant amount
– Will release in future updates 8
![Page 9: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/9.jpg)
Outline
• Motivation
• The ProtoFlex Simulator
• Using ProtoFlex
• XUPv5 Reference Design
• Distribution
9
![Page 10: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/10.jpg)
Using ProtoFlex
• Add passive monitors
– Counters, histograms
– Roll your own
10
I-TLB Stage 1
I-TLB Stage 2
64-bit ALU Stage 1
Context Scheduler
I-Fetch Address Generate
I-Fetch Tag Check
US III Decoder
64-bit ALU Stage 2
D-TLB Stage 1
D-TLB Stage 2
D-TLB Stage 3
D-Cache Address Generate
D-Cache Tag Check
Multi-Cycle Instruction Unit
NonblockingI-cache(BRAM)
Writeback
Arbiter to DDR
Memory
Integer RF(BRAM)
NonblockingD-cache(BRAM)
• Trace-based simulation
– Collect dynamic traces
– Feed traces to functional-first timing model
• Sampled Program Monitoring
– Use micro-blaze (or PPC) to monitor core/memory state
– Unintrusive profiling w/o changes to target SW
CountersCountersCounters
Histogram
TrackerHistogram
TrackerHistogram
Trackers
Timing
Model
FPGA
Hard/Soft Core
(PowerPC or
MicroBlaze)
![Page 11: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/11.jpg)
Applications of ProtoFlex
• Examples
– Functional-first CMP cache coherency model for first-order timing models and functional warming *TRETS’09+
– Real-time stack trace profiling
– CMP interconnect model (in progress)
– Realistic CPU traffic generators (in progress)
• … running real 16-CPU server workloads
– Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K
11
Piranha CMP Cache
(First-Order Timing
Model)
Statistics + Warmed
Coherency &
Tag States
![Page 12: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/12.jpg)
What does the RTL look like?
• We use Bluespec System Verilog (high-level, synthesizable HDL)
– 4-8 weeks learning curve for normal HDL users
– Once learned, easier to read/modify than conventional RTL
– Requires BSV compiler (free for academics)
– Paper in MEMOCODE’09 describes BSV coding/validation of core
• Sample code:
12
rule split_ALU_pipeline (True);…p1 = piperegs[DECODE];piperegs[ALU1] <= doALUStage1(p1, alu_ifc);p2 = piperegs[ALU1];piperegs[ALU2] <= doALUStage2(p2, alu_ifc);…
endrule
rule merged_ALU_pipeline (True);…p1 = piperegs[DECODE];p_tmp1 = doALUStage1(p1, alu_ifc);p_tmp2 = doALUStage2(p_tmp1, alu_ifc);piperegs[ALU] <= p_tmp2;…
endrule
2-stage ALU 1-stage ALU
![Page 13: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/13.jpg)
Other Simulator Features
• Changing Core Parameters
– Number of CPU contexts
– Cache sizes
– Merge/split pipeline stages
– Enable/disable modules for profiling & debugging
– Clock frequency (tested @ 10 MHz – 100 MHz)
– Set optimal LUTRAM size (16 = V2P, 64 = V5)
– Choose LUTRAMs or BRAMs for any CPU state
• System Parameters
– UDP or TCP/IP (for PFMON-to-FPGA communication)
– XUPv5, BEE2
13
![Page 14: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/14.jpg)
Outline
• Motivation
• The ProtoFlex Simulator
• Using ProtoFlex
• XUPv5 Reference Design
• Distribution
14
![Page 15: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/15.jpg)
Platform Release: XUPV5
• Why XUPv5?
– Inexpensive (~$750), easily accessible
– Standard tool flows (EDK, ISE)
– Reference design portable to other platforms – just drop in our ‘pcores’
• Supporting other platforms
– Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP)
– Plan to release with future updates
15
![Page 16: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/16.jpg)
Required Equipment
16
FPGA BoardLinux PC
Ethernet
BlueSPARCPFMON+ Simics
![Page 17: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/17.jpg)
Required Equipment
17
FPGA BoardLinux PC
Ethernet
BlueSPARCPFMON+ Simics
![Page 18: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/18.jpg)
18
XUPv5 Overview
• Virtex-5 LX110T
• DDR2 Memory
– up to 2GB
• 1ΜΒ SRAM
• 1Gbps Ethernet
• 3Gbps SATA
• Serial Port
![Page 19: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/19.jpg)
SRAMController
Multi-Port Memory Controller
Reference Design Block Diagram
19
PLB
BlueSPARC MicroBlaze
Ethernet BRAM Serial PortBRAMBRAM
XUPv5
LX110T
DRAM SRAM
![Page 20: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/20.jpg)
SRAMController
Multi-Port Memory Controller
BlueSPARC
20
PLB
MicroBlaze
Ethernet BRAM Serial PortBRAMBRAM
XUPv5
LX110T
DRAM SRAM
• EDK IP core
– connects to PLB & NPI
• Runs @ 100MHz
• 4 CPU contexts
• 64KB I&D L1 caches
BlueSPARC
![Page 21: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/21.jpg)
SRAMController
Multi-Port Memory Controller
Reference Design Block Diagram
21
PLB
BlueSPARC MicroBlaze
Ethernet BRAM Serial PortBRAMBRAM
XUPv5
LX110T
DRAM SRAM
![Page 22: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/22.jpg)
SRAMController
Multi-Port Memory Controller
BlueSPARC
22
PLB
BlueSPARC MicroBlaze
Ethernet Serial Port
XUPv5
LX110T
DRAM SRAM
• 81% utilization
– Core 51% (76 out of 148)
– Rest 30% (45 out of 148)
BRAMBRAMBRAM
![Page 23: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/23.jpg)
SRAMController
Multi-Port Memory Controller
Reference Design Block Diagram
23
PLB
BlueSPARC MicroBlaze
Ethernet BRAM Serial PortBRAMBRAM
XUPv5
LX110T
DRAM SRAM
![Page 24: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/24.jpg)
SRAMController
Multi-Port Memory Controller
Ethernet
24
PLB
BlueSPARC MicroBlaze
BRAM Serial PortBRAMBRAM
XUPv5
LX110T
DRAM SRAM
• 4 MB/sec bandwidth
• 350 usec RTT latency
• Socket Abstraction
– using LWIP RAW interface
Ethernet
![Page 25: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/25.jpg)
SRAMController
Multi-Port Memory Controller
Reference Design Block Diagram
25
PLB
BlueSPARC MicroBlaze
Ethernet BRAM Serial PortBRAMBRAM
XUPv5
LX110T
DRAM SRAM
![Page 26: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/26.jpg)
SRAMController
DDR2 Memory Controller
26
PLB
BlueSPARC MicroBlaze
Ethernet BRAM Serial PortBRAMBRAM
XUPv5
LX110T
DRAM SRAM
Multi-Port Memory Controller
• 1.5GB/s peak BW
• 115ns latency
• Multiple ports/interfaces
![Page 27: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/27.jpg)
Required Equipment
27
FPGA BoardLinux PC
Ethernet
BlueSPARCPFMON+ Simics
![Page 28: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/28.jpg)
Required Equipment
28
FPGA BoardLinux PC
Ethernet
BlueSPARCPFMON+ Simics
![Page 29: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/29.jpg)
29
Linux PC
• Software requirements
– SuSE Linux 10.1
– CAD tools + licenses (Bluespec compiler, Xilinx ISE/EDK)
– Simics 3.0.22
– Hybrid simulation plug-in modules
– ProtoFlex MONitor tool (PFMON)
![Page 30: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/30.jpg)
30
Linux PC
• Runs PFMON (ProtoFlex MONitor)
– Orchestrates communication between Simics & BlueSPARC
– Provides CLI interface to simulator (like Simics Console)
• Runs Simics
– Handles I/O, FPGA Core and memory initialization
BlueSPARCPFMON
Linux PC
Simics
![Page 31: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/31.jpg)
31
From RTL to Running System
• Bluespec Verilog
– Bluespec compiler
– ~30 minutes
• Verilog Bitstream
– Xilinx EDK
– ~ 3 hours
• BitstreamWorking System
– Stream mem. image over ethernet
– ~ 5 minutes (for 512MB image)
Bluespec code
Verilog code
Bitstream
1
2
3
1
2
WorkingSystem
3
![Page 32: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/32.jpg)
XUPv5 Caveats
• Limitations
– Due to BRAM limits, only 4 CPU contexts (compared to 16 on BEE2’s V2P70)
– Slow PC-to-FPGA latency via LwIP/Ethernet (1 key/sec in terminal)
– Limited DDR2 RAM capacity (up to 2GB)
• However…
– Our XUPv5 design still has much room for improvement
– Useful for familiarizing with ProtoFlex tools
– Can still run multithreaded workloads + perform monitoring
– Easily portable to more powerful platforms
32
![Page 33: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/33.jpg)
Outline
• Motivation
• The ProtoFlex Simulator
• Using ProtoFlex
• XUPv5 Reference Design
• Distribution
33
![Page 34: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/34.jpg)
Distribution Details
• Release dates
– All source code will be available in 1 week (June)
– Send email to [email protected] for more info
• Licensing
– GNU GPL
• Support
– Documentation: www.ece.cmu.edu/~protoflex
– Support mailing list/forum will be available soon
34
![Page 35: The Open Source ProtoFlex Simulator](https://reader033.vdocuments.mx/reader033/viewer/2022061614/62a071926313a22c1d701135/html5/thumbnails/35.jpg)
Thank you!
• Acknowledgments
– Funding for this work provided by: NSF CCF-0811702, NSF CNS-0509356, C2S2 Marco Center, and Sun Microsystems
– Paul Hartke & Xilinx for FPGA donations & email support
• Topics & References
– Hybrid Simulation and FPGA Core MultithreadingA Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08]
– Functional-First, First-Order Timing Model for a CMP cacheProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09]
– FPGA Core Design & Validation Using Flight Data RecorderImplementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation [MEMOCODE’09]
35