http://csg.csail.mit.edu transforming an implementation into a cycle-accurate simulator using bdn...

28
http://csg.csail.mit.edu Transforming an implementation into a cycle- accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. RAMP Workshop, Austin, TX June 25, 2009

Post on 21-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

http://csg.csail.mit.edu

Transforming an implementation into a cycle-accurate simulator using BDNMurali Vijayaraghavan and ArvindComputer Science and Artificial Intelligence LaboratoryM.I.T.

RAMP Workshop, Austin, TXJune 25, 2009

Page 2: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

IBM/MIT CollaborationSept 2007 –

Motivation: Create an ecosystem to foster and promote the use of Power architecture in system researchInitial Goal: Create a flexible and synthesizable multithreaded, multicore PowerPC model that facilitates rapid architectural exploration

parameterized for the number of threads; the number and functionality of pipeline stages

Current Goals: Cycle-accurate modeling Open source distribution on widely available FPGAs by

summer 2010

Page 3: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

The Team

Architecture and Bluespec Coding K. Ekanadham, Jessica Tseng MIT: Asif Khan, Murali Vijayaraghavan

Linux OS Bring-up Team Hubertus Franke, Jimi Xenidis

FPGA Prototyping Team Richard Kaufman, Kai Schleupen

Managers Nancy Greco, Pratap Pattnaik MIT: Arvind

Page 4: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

ResultsA 64-bit embedded PowerPC was created from scratch in Bluespec System Verilog (BSV)

Implemented on an IBM internal FPGA Platform that uses Xilinx Virtex-5 LX330 chip

Linux was booted on it by Nov 2008

Jessica has ported this design onto Xilinx XUPV5

Takes up 92% of the area Running at 20Mhz but probably can be jacked up to

40MHz

Page 5: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Issues in “Prototype” RTL to FPGA mapping

Some structures consume a disproportionate amount of FPGA resources

multiported register file CAM multiply, divide

Prototype RTL implementations on FPGAs need to compensate for external memory timing

Lack of tools for mapping on multiple FPGAs

Can be implemented in multiple cycles to save resources

Cycle-accurate modeling

Page 6: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

http://csg.csail.mit.edu

Bounded Data Flow Networks (BDNs) as a theoretical frame for cycle-accurate modeling of synchronous sequentional machines

Murali Vijayaraghavan & Arvind [MEMOCODE 2009]

Page 7: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Implementing RTL on FPGAs

3-read2-writeReg File

Simulateon

BRAMsin multiple

cycles

In general, functional correctness requires cycle accuracy

Target RTLASIC

On FPGA

Page 8: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

BDN as a refinement of an SSM

There is a bijective mapping between the inputs (outputs) of S and Rfor all n > 0,

I(k) matches for S and R (1 k n) O(j) matches for S and R (1 j n)

SI1

In

O1

Om

RI1

In

O1

Om

CycleAccuracy

Refers to the kth enqueue in each input FIFO for a BDN

Page 9: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Patient SSMs: SSMs with a “start” signal to update registers

Combi--national

logic

Combi--national

logic

enable

Page 10: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

S1S2

(big)S3

(big)

S1S2

(big)S3

(big)

S1S2

(big)S3

(big)

R1R2’

(small)R3’

(small)

SSM to BDNand refinements

SSM

Patient SSMs

cut

to BDNs

BDN

refine-ments

BDN

Page 11: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

SSM to BDN

The translations has to be done such that the generated BDN is latency-insensitive, i.e., the input-output behavior of the BDN does not change if we change the latency of one of its component BDNs or the size of the FIFOs connecting the components

Page 12: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Implementing an SSM as a BDN

fa

bc

rule O when (a.emptyb.emptyc.full d.full) c.enq(f(a.first, b.first)); d.enq(b.first); a.deq ; b.deq

d

fa

bc

d

This description can be easily translated into logic that serves as a wrapper for the original logic

The SSM and BDN have the same input-output behavior

Page 13: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Deadlocks

fa

bc

rule O when (a.emptyb.emptyc.full d.full) c.enq(f(a.first, b.first)); d.enq(b.first); a.deq ; b.deq

d

fab

c

d

Extraneous dependencies --d unnecessarily depends upon a and c

Page 14: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Another behavior for the same BDN

rule O1 when (a.emptyb.emptyc.full cDone) c.enq(f(a.first, b.first)); cDone <= Truerule O2 when (b.emptyd.full dDone) d.enq(b.first); dDone <= Truerule In when (cDone dDone) a.deq ; b.deq; cDone <= False; dDone <= False;

fa

bc

d

fa

bc

d

No extraneous dependencies – No deadlock

dDone

cDone

Page 15: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Latency-Insensitive BDNsNo extraneous dependency property: if output Oi is not enqueued n times, assuming it is not full and all the inputs are enqueued n-1 times, then it must be that one of the inputs in Depends-on(Oi) is not enqueued n times

Self Cleaning property: If all outputs are enqueued n times then all inputs must be dequeued n times

BDNs with these properties and do not deadlock

Page 16: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Writing an LI-BDN wrapper for an SSM

Given the SSM:oj(t) = fj(ij1(t), ... ,ijIj(t), s(t))

// ij1, ij2, ... ijIj are in Depends-on(oj)

s(t+1) = g(i1(t), i2(t), ... , s(t))

LI-BDN:rule Oj when (donej)

donej <= True

oj.enq( fj(ij1.first, ... ,ijIj.first, s) )

rule Finish when (done1 done2 ...)

done1 <= False; done2 <= False; ...

s <= g(i1.first, i2.first, ... , s)

i1.deq ; i2.deq ; ...

Page 17: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

The Wrapper Circuit

donei

Depends-on(Oj)

Oj

Alldones

All inputdeqs

deq

not-full

1 0

enq

not-

empt

y

Ii

value

first Patient SSM

enable

Page 18: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

PPC In-order Pipeline

The designer specifies the FSM for each stageThe FIFOs are latency-insensitive, that is, the correctness of the specification does not depend upon the depth of FIFOs or the number of stages

stall bypass

I$/ITlb1

I$/ITlb2

Mem

PC Fetch BrPred Crack Decode RegRdAddrCalc

BrResMem

1

Mem2ALU

Excep

RegWr

D$/DTlb1

D$/DTlb2

Mem

epochs

Page 19: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

The steps in Cycle-accurate implementation on FPGAs

The specs are turned into Bluespec code to give a target SSM

Once the size of FIFOs is fixed the whole design has a precise timing specification

If the FPGA implementation requires refining some stages then cuts are made in the design to isolate the stages (SSMs) to be refined

Each SSM is turned into a BDN by introducing FIFOs for each input and output wire, including the wires going in and out of model FIFOs of the SSM

This converts the nth time cycle of the SSM into the nth enqueue into input FIFOs and nth dequeue from output FIFOs

Atomic rules for the operation of each BDN are defined so that no extraneous dependencies are introduced

This also ensures deadlock-free operation

Can

be m

ech

an

ized

Page 20: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Preliminary resultsCycle-accurate refinements onto Xilinx XUPV5 (Asif & Murali)

Slice Logic Utilization: Number of Slice Registers: 15448 out of 69120 22% Number of Slice LUTs: 16702 out of 69120 24%

Specific Feature Utilization: Number of Block RAM/FIFO: 1 out of 148 0% (only 1 BRAM

for the register file) Number of DSP48Es: 12 out of 64 18% (these are used for

the divider) Minimum period: 7.988ns (Maximum Frequency:

125.188MHz) Partially verified by running a 50 instruction program

Compared to Jessica has port onto Xilinx XUPV5Takes up 92% of the area; 20Mhz 40Mhz

No numbers yet for actual

work done

Page 21: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Conclusion

Cycle-accurate modeling of processors on FPGAs is feasible and offers a 3-orders of magnitude improvement in performance over software simulatorsBDNs offer a way to refine RTL without losing cycle-accuracyBluespec is makes quick RTL generation feasible

The generation of BDNs can be automatedWe plan to release our Bluespec designs under open source licensing to strengthen PowrPC ecosystem.

Page 22: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Related work

HAsim: Joel Emer, Michael Pellauer, et al at Intel/MIT Cycle accurate modeling using the A-ports abstraction

UTFast: Derek Chiou and students at UT Austin speculative functional model, corrected by timing model when

necessaryProtoflex: James Hoe, Eric Chung et al at CMU

RAMP Gold: Krste Asanovic et al at Berkeley

Luca Carloni et al for Latency-Insensitive refinements

Page 23: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

http://csg.csail.mit.edu

Thanks!

Page 24: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

BDN Input/Output notation

Ii(n) represents the nth values enqueued in input buffer Ii

I(n) represents the nth values enqueued in all input buffers

Oj(n) represents the nth values dequeued from output buffer Oj

O(n) represents the nth values dequeued from all output buffers

RI1

In

O1

Om

I o

Page 25: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Examples of primitive BDNs:

Register

rule RO when (b.full bDone) b.enq(r); bDone <= True

rule RI when (a.empty bDone) r <= a.first; a.deq; bDone <= False

a br

bDone

Behavior

bDone = False

r = r0

Initial Values

A register whose reads and writes must

match

Page 26: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Examples of primitive BDNs:

Mux

rule MuxO when c.full p.empty if(p.first a.empty) then c.enq(a.first); a.deq; bCnt<=bCnt+1 else if(!(p.first) b.empty) then c.enq(b.first); b.deq; aCnt<=aCnt+1 rule MuxI1 when aCnt >0 a.empty a.deq; aCnt<=aCnt-1 rule MuxI2 when bCnt >0 b.empty b.deq; bCnt<=bCnt-1

bCnt

aCnt

p

a

bc

Behavior

aCnt = 0bCnt = 0

Initial values

A mux that accepts an input value on each input port but passes only the appropriate value to the

output

Page 27: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Composition of BDNsIf R1 and R2 are BDNs then so is the parallel composition of R1 and R2 (R = R1 R2)

R1

R2

R1

R2

R

* No direct combinational path

R1Ii Oj R1R

Ii = Oj

R1 is a BDN then so is the ( Ii ,Oj) iterative composition of R1 (R = (i,j) R1) provided Ii Depends-on(Oj)*

Page 28: Http://csg.csail.mit.edu Transforming an implementation into a cycle-accurate simulator using BDN Murali Vijayaraghavan and Arvind Computer Science and

June 25, 2009 http://csg.csail.mit.edu

Deadlock-free BDN

RI1

In

O1

Om

I o

Assuming an infinite sink, a BDN is deadlock-free if for all n > 0, if n values are enqueued into I then eventually n values will be dequeued from both O and I

we need a stronger property for deadlock-freeness to be preserved under composition