closely-coupled timing-directed partitioning in hasim michael pellauer † [email protected]...

27
Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer [email protected] Murali Vijayaraghavan , Michael Adler , Arvind , Joel Emer †‡ MIT CS and AI Lab Computation Structures Group Intel Corporation VSSAD Group To Appear In: ISPASS 2008

Post on 19-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Closely-CoupledTiming-Directed Partitioning

in HAsim

Michael Pellauer†

[email protected]

Murali Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡

†MIT CS and AI Lab

Computation Structures Group

‡Intel Corporation

VSSAD Group

To Appear In: ISPASS 2008

Page 2: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Motivation

We want to simulate target platforms quickly

We also want to construct simulators quickly

Partitioned simulators are a known technique from traditional performance models:

• ISA• Off-chipcommunication

• Micro-architecture• Resource contention• Dependencies

Interaction

• Simplifies timing model

• Amortize functional model design effort over many models

• Functional Partition can be extremely FPGA-optimized

TimingPartition

TimingPartition

FunctionalPartitionFunctionalPartition

Page 3: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Different Partitioning Schemes

As categorized by Mauer, Hill and Wood:

Source: [MAUER 2002], ACM SIGMETRICS

We believe that a timing-directed solution will ultimately lead to the best performance

Both partitions upon the FPGA

Page 4: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Functional Partition in Software Asim

Get Instruction (at a given Address)

Get Dependencies

Get Instruction Results

Read Memory*

Speculatively Write Memory* (locally visible)

Commit or Abort instruction

Write Memory* (globally visible)

* Optional depending on instruction type

Page 5: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Execution in Phases

F D X R C

F D X W C W

F D X C

The Emer Assertion:

All data dependencies can be represented via these phases

F D X R A

F D X X C W

Page 6: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Detailed Example: 3 Different Timing Models

Executing the same instruction sequence:

Page 7: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Functional Partition in Hardware?

RequirementsSupport these operations in hardware

Allow for out-of-order execution, speculation, rollback

ChallengesMinimize operation execution times

Pipeline wherever possible

Tradeoff between BRAM/multiport RAMs

Race conditions due to extreme parallelism

Page 8: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Functional Partition As Pipeline

Conveys concept well, but poor performance

Token Gen

Dec Exe Mem LCom GComFet

Timing Model

MemoryState

Register State

RegFile

FunctionalPartition

Page 9: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Implementation:Large Scoreboards in BRAM

Series of tables in BRAM

Store information about each in-flight instruction

Tables are indexed by “token”Also used by the timing partition to refer to each instruction

New operation “getToken” to allocate a space in the tables

Page 10: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Implementing the Operations

See paper for details (also extra slides)

Page 11: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Assessment:Three Timing Models

Unpipelined Target

MIPS R10K-like out-of-order superscalar

5-Stage Pipeline

Page 12: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Assessment:Target Performance

Targets have idealized memory hierarchy

Target Processor CPI

0

0.5

1

1.5

2

2.5

3

3.5

median multiply qsort towers vvadd average

Mo

de

l Cy

cle

s p

er

Ins

tru

cti

on

(C

PI)

Unpipelined

5-stage

Out-of-Order

Page 13: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Assessment:Simulator Performance

Some correspondence between target and functional partition is very helpful

Simulation Rate

0

5

10

15

20

25

30

35

40

45

median multiply qsort towers vvadd average

FP

GA

-Cy

cle

s p

er

Mo

de

l Cy

cle

(F

MR

)

Unpipelined

5-Stage

Out-of-Order

Page 14: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Assessment:Reuse and Physical Stats

Where is functionality implemented:

FPGA usage:

Design IMem ProgramCounter

Branch Predictor

Scoreboard/ROB

RegFile

Maptable/Freelist

ALU DMem Store Buffer

Snapshots/Rollback

Functional Partition

Unpipelined N/A N/A N/A N/A N/A

5-Stage N/A

Out-of-Order

Unpipelined 5-stage Out of Order

FPGA Slices 6599 (20%) 9220 (28%) 22,873 (69%)

Block RAMs 18 (5%) 25 (7%) 25 (7%)

Clock Speed 98.8 MHz 96.9 MHz 95.0 MHz

Average FMR 41.1 7.49 15.6

Simulation Rate 2.4 MHz 14 MHz 6 MHz

Average Simulator IPS

2.4 MIPS 5.1 MIPS 4.7 MIPS

Virtex IIPro 70

Using ISE 8.1i

Page 15: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Future Work:Simulating Multicores

Scheme 1: Duplicate both partitions

Scheme 2: Cluster Timing Parititions

TimingModel

A

TimingModel

A

FuncReg +

Datapath

FuncReg +

Datapath

TimingModel

B

TimingModel

B

FuncReg +

Datapath

FuncReg +

Datapath

FuncReg +

Datapath

FuncReg +

Datapath

TimingModel

C

TimingModel

C

FuncReg +

Datapath

FuncReg +

Datapath

TimingModel

D

TimingModel

D

FunctionalMemory

State

FunctionalMemory

State

TimingModel

A

TimingModel

A

TimingModel

B

TimingModel

B

TimingModel

C

TimingModel

C

TimingModel

D

TimingModel

D

FunctionalReg State +

Datapath

FunctionalReg State +

Datapath

FunctionalMemory

State

FunctionalMemory

State

Interactionoccurshere

Interactionstill occurs

here

Use a context IDto reference all state

lookups

Page 16: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Future Work: Simulating Multicores

Scheme 3: Perform multiplexing of timing models themselves

Leverage HASim A-Ports in Timing Model

Out of scope of today’s talk

TimingModel

D

TimingModel

D

FunctionalReg State +

Datapath

FunctionalReg State +

Datapath

FunctionalMemory

State

FunctionalMemory

State

Interactionstill occurs

here

Use a context IDto reference all state

lookups

TimingModel

C

TimingModel

C

TimingModel

B

TimingModel

B

TimingModel

A

TimingModel

A

Page 17: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

UT-FAST is Functional-First

This can be unified into Timing-DirectedJust do “execute-at-fetch”

Future Work:Unifying with the UT-FAST model

FuncPartition

FuncPartition

TimingPartition

TimingPartition

EmulatorEmulator

Ø

Ø

Ø

Ø

functionalemulatorrunning insoftware

FPGA

execution stream

resteer

execution stream

resteer

functionalemulatorrunning insoftware

Page 18: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Summary

Described a scheme for closely-coupled timing-directed partitioning

Both partitions are suitable for on-FPGA implementation

Demonstrated such a scheme’s benefits:Very Good Reuse, Very Good Area/Clock Speed

Good FPGA-to-Model Cycle Ratio:Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target)

We plan to extend this using contexts for hardware multiplexing [Chung 07]

Future: rare complex operations (such as syscalls) could be done in software using virtual channels

Page 19: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Questions?

[email protected]

Page 20: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Extra Slides

[email protected]

Page 21: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Functional Partition Fetch

Page 22: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Functional Partition Decode

Page 23: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Functional Partition Execute

Page 24: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Functional Partition Back End

Page 25: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Timing Model: Unpipelined

Page 26: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

5-Stage Pipeline Timing Model

Page 27: Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel

Out-Of-Order Superscalar Timing Model