http://csg.csail.mit.edu transforming an implementation into a cycle-accurate simulator using bdn...
Post on 21-Dec-2015
220 views
TRANSCRIPT
http://csg.csail.mit.edu
Transforming an implementation into a cycle-accurate simulator using BDNMurali Vijayaraghavan and ArvindComputer Science and Artificial Intelligence LaboratoryM.I.T.
RAMP Workshop, Austin, TXJune 25, 2009
June 25, 2009 http://csg.csail.mit.edu
IBM/MIT CollaborationSept 2007 –
Motivation: Create an ecosystem to foster and promote the use of Power architecture in system researchInitial Goal: Create a flexible and synthesizable multithreaded, multicore PowerPC model that facilitates rapid architectural exploration
parameterized for the number of threads; the number and functionality of pipeline stages
Current Goals: Cycle-accurate modeling Open source distribution on widely available FPGAs by
summer 2010
June 25, 2009 http://csg.csail.mit.edu
The Team
Architecture and Bluespec Coding K. Ekanadham, Jessica Tseng MIT: Asif Khan, Murali Vijayaraghavan
Linux OS Bring-up Team Hubertus Franke, Jimi Xenidis
FPGA Prototyping Team Richard Kaufman, Kai Schleupen
Managers Nancy Greco, Pratap Pattnaik MIT: Arvind
June 25, 2009 http://csg.csail.mit.edu
ResultsA 64-bit embedded PowerPC was created from scratch in Bluespec System Verilog (BSV)
Implemented on an IBM internal FPGA Platform that uses Xilinx Virtex-5 LX330 chip
Linux was booted on it by Nov 2008
Jessica has ported this design onto Xilinx XUPV5
Takes up 92% of the area Running at 20Mhz but probably can be jacked up to
40MHz
June 25, 2009 http://csg.csail.mit.edu
Issues in “Prototype” RTL to FPGA mapping
Some structures consume a disproportionate amount of FPGA resources
multiported register file CAM multiply, divide
Prototype RTL implementations on FPGAs need to compensate for external memory timing
Lack of tools for mapping on multiple FPGAs
Can be implemented in multiple cycles to save resources
Cycle-accurate modeling
http://csg.csail.mit.edu
Bounded Data Flow Networks (BDNs) as a theoretical frame for cycle-accurate modeling of synchronous sequentional machines
Murali Vijayaraghavan & Arvind [MEMOCODE 2009]
June 25, 2009 http://csg.csail.mit.edu
Implementing RTL on FPGAs
3-read2-writeReg File
Simulateon
BRAMsin multiple
cycles
In general, functional correctness requires cycle accuracy
Target RTLASIC
On FPGA
June 25, 2009 http://csg.csail.mit.edu
BDN as a refinement of an SSM
There is a bijective mapping between the inputs (outputs) of S and Rfor all n > 0,
I(k) matches for S and R (1 k n) O(j) matches for S and R (1 j n)
SI1
In
O1
Om
RI1
In
O1
Om
CycleAccuracy
Refers to the kth enqueue in each input FIFO for a BDN
June 25, 2009 http://csg.csail.mit.edu
Patient SSMs: SSMs with a “start” signal to update registers
Combi--national
logic
Combi--national
logic
enable
June 25, 2009 http://csg.csail.mit.edu
S1S2
(big)S3
(big)
S1S2
(big)S3
(big)
S1S2
(big)S3
(big)
R1R2’
(small)R3’
(small)
SSM to BDNand refinements
SSM
Patient SSMs
cut
to BDNs
BDN
refine-ments
BDN
June 25, 2009 http://csg.csail.mit.edu
SSM to BDN
The translations has to be done such that the generated BDN is latency-insensitive, i.e., the input-output behavior of the BDN does not change if we change the latency of one of its component BDNs or the size of the FIFOs connecting the components
June 25, 2009 http://csg.csail.mit.edu
Implementing an SSM as a BDN
fa
bc
rule O when (a.emptyb.emptyc.full d.full) c.enq(f(a.first, b.first)); d.enq(b.first); a.deq ; b.deq
d
fa
bc
d
This description can be easily translated into logic that serves as a wrapper for the original logic
The SSM and BDN have the same input-output behavior
June 25, 2009 http://csg.csail.mit.edu
Deadlocks
fa
bc
rule O when (a.emptyb.emptyc.full d.full) c.enq(f(a.first, b.first)); d.enq(b.first); a.deq ; b.deq
d
fab
c
d
Extraneous dependencies --d unnecessarily depends upon a and c
June 25, 2009 http://csg.csail.mit.edu
Another behavior for the same BDN
rule O1 when (a.emptyb.emptyc.full cDone) c.enq(f(a.first, b.first)); cDone <= Truerule O2 when (b.emptyd.full dDone) d.enq(b.first); dDone <= Truerule In when (cDone dDone) a.deq ; b.deq; cDone <= False; dDone <= False;
fa
bc
d
fa
bc
d
No extraneous dependencies – No deadlock
dDone
cDone
June 25, 2009 http://csg.csail.mit.edu
Latency-Insensitive BDNsNo extraneous dependency property: if output Oi is not enqueued n times, assuming it is not full and all the inputs are enqueued n-1 times, then it must be that one of the inputs in Depends-on(Oi) is not enqueued n times
Self Cleaning property: If all outputs are enqueued n times then all inputs must be dequeued n times
BDNs with these properties and do not deadlock
June 25, 2009 http://csg.csail.mit.edu
Writing an LI-BDN wrapper for an SSM
Given the SSM:oj(t) = fj(ij1(t), ... ,ijIj(t), s(t))
// ij1, ij2, ... ijIj are in Depends-on(oj)
s(t+1) = g(i1(t), i2(t), ... , s(t))
LI-BDN:rule Oj when (donej)
donej <= True
oj.enq( fj(ij1.first, ... ,ijIj.first, s) )
rule Finish when (done1 done2 ...)
done1 <= False; done2 <= False; ...
s <= g(i1.first, i2.first, ... , s)
i1.deq ; i2.deq ; ...
June 25, 2009 http://csg.csail.mit.edu
The Wrapper Circuit
donei
Depends-on(Oj)
Oj
Alldones
All inputdeqs
deq
not-full
1 0
enq
not-
empt
y
Ii
value
first Patient SSM
enable
June 25, 2009 http://csg.csail.mit.edu
PPC In-order Pipeline
The designer specifies the FSM for each stageThe FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages
stall bypass
I$/ITlb1
I$/ITlb2
Mem
PC Fetch BrPred Crack Decode RegRdAddrCalc
BrResMem
1
Mem2ALU
Excep
RegWr
D$/DTlb1
D$/DTlb2
Mem
epochs
June 25, 2009 http://csg.csail.mit.edu
The steps in Cycle-accurate implementation on FPGAs
The specs are turned into Bluespec code to give a target SSM
Once the size of FIFOs is fixed the whole design has a precise timing specification
If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined
Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM
This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs
Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced
This also ensures deadlock-free operation
Can
be m
ech
an
ized
June 25, 2009 http://csg.csail.mit.edu
Preliminary resultsCycle-accurate refinements onto Xilinx XUPV5 (Asif & Murali)
Slice Logic Utilization: Number of Slice Registers: 15448 out of 69120 22% Number of Slice LUTs: 16702 out of 69120 24%
Specific Feature Utilization: Number of Block RAM/FIFO: 1 out of 148 0% (only 1 BRAM
for the register file) Number of DSP48Es: 12 out of 64 18% (these are used for
the divider) Minimum period: 7.988ns (Maximum Frequency:
125.188MHz) Partially verified by running a 50 instruction program
Compared to Jessica has port onto Xilinx XUPV5Takes up 92% of the area; 20Mhz 40Mhz
No numbers yet for actual
work done
June 25, 2009 http://csg.csail.mit.edu
Conclusion
Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulatorsBDNs offer a way to refine RTL without losing cycle-accuracyBluespec is makes quick RTL generation feasible
The generation of BDNs can be automatedWe plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem.
June 25, 2009 http://csg.csail.mit.edu
Related work
HAsim: Joel Emer, Michael Pellauer, et al at Intel/MIT Cycle accurate modeling using the A-ports abstraction
UTFast: Derek Chiou and students at UT Austin speculative functional model, corrected by timing model when
necessaryProtoflex: James Hoe, Eric Chung et al at CMU
RAMP Gold: Krste Asanovic et al at Berkeley
Luca Carloni et al for Latency-Insensitive refinements
http://csg.csail.mit.edu
Thanks!
June 25, 2009 http://csg.csail.mit.edu
BDN Input/Output notation
Ii(n) represents the nth values enqueued in input buffer Ii
I(n) represents the nth values enqueued in all input buffers
Oj(n) represents the nth values dequeued from output buffer Oj
O(n) represents the nth values dequeued from all output buffers
RI1
In
O1
Om
I o
June 25, 2009 http://csg.csail.mit.edu
Examples of primitive BDNs:
Register
rule RO when (b.full bDone) b.enq(r); bDone <= True
rule RI when (a.empty bDone) r <= a.first; a.deq; bDone <= False
a br
bDone
Behavior
bDone = False
r = r0
Initial Values
A register whose reads and writes must
match
June 25, 2009 http://csg.csail.mit.edu
Examples of primitive BDNs:
Mux
rule MuxO when c.full p.empty if(p.first a.empty) then c.enq(a.first); a.deq; bCnt<=bCnt+1 else if(!(p.first) b.empty) then c.enq(b.first); b.deq; aCnt<=aCnt+1 rule MuxI1 when aCnt >0 a.empty a.deq; aCnt<=aCnt-1 rule MuxI2 when bCnt >0 b.empty b.deq; bCnt<=bCnt-1
bCnt
aCnt
p
a
bc
Behavior
aCnt = 0bCnt = 0
Initial values
A mux that accepts an input value on each input port but passes only the appropriate value to the
output
June 25, 2009 http://csg.csail.mit.edu
Composition of BDNsIf R1 and R2 are BDNs then so is the parallel composition of R1 and R2 (R = R1 R2)
R1
R2
R1
R2
R
* No direct combinational path
R1Ii Oj R1R
Ii = Oj
R1 is a BDN then so is the ( Ii ,Oj) iterative composition of R1 (R = (i,j) R1) provided Ii Depends-on(Oj)*
June 25, 2009 http://csg.csail.mit.edu
Deadlock-free BDN
RI1
In
O1
Om
I o
Assuming an infinite sink, a BDN is deadlock-free if for all n > 0, if n values are enqueued into I then eventually n values will be dequeued from both O and I
we need a stronger property for deadlock-freeness to be preserved under composition