cs 194-6 digital systems project laboratory lecture 3...

UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU

2008-9-22John Lazzaro

(www.cs.berkeley.edu/~lazzaro)

CS 194-6 Digital Systems Project Laboratory

Lecture 3 – Single-Cycle CPU

www-inst.eecs.berkeley.edu/~cs194-6/

TA: Greg Gibeling

1


Topics for today’s lecture

Single-Cycle CPU Design

Instruction Set Architectures (ISAs)

Very Long Instruction Words (VLIW): Doing more work in a single cycle.

2


Instruction Set Architecture

3


New successful instruction sets are rare

instruction set

software

hardware

Implementors suffer with original sins of ISAs, to support the installed base of software.

4


Instruction Sets: A Thin Interface

Instruction Set ArchitectureI/O systemProcessor

Digital DesignCircuit Design

Datapath & Control

Transistors

MemoryHardware

CompilerOperating

System(Mac OS X)

Application (iTunes)

Software Assembler

Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10

In Hexadecimal: 012A4020000000 01001 01010 01000 00000 100000Binary:

6 bits 5 bits 5 bits 5 bits 5 bits 6 bitsFieldsize:

opcode rs rt rd functshamtBitfield:

“R-Format”

5


Hardware implements semantics ...

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Fetch next inst from memory:012A4020

opcode rs rt rd functshamtDecode fields to get : ADD $8 $9 $10

“Retrieve” register values: $9 $10

Add $9 to $10

Place this sum in $8

Prepare to fetch instruction that follows the ADD in the program.


6


ADD syntax &semantics, as seen inthe MIPS ISA document.

7


Memory Instructions: LW $1,32($2)

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Fetch the load inst from memory

“Retrieve” register value: $2

Compute memory address: 32 + $2

Load memory address contents into: $1

Prepare to fetch instr that follows the LW in the program. Depending on load semantics, new $1 is visible to that instr, or not until the following instr (”delayed loads”).

Decode fields to get : LW $1, 32($2)

opcode rs rt offset “I-Format”

8


LW syntax &semantics, as seen inthe MIPS ISA document.

9


Branch Instructions: BEQ $1,$2,25

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Fetch branch inst from memory

“Retrieve” register values: $1, $2

Compute if we take branch: $1 == $2 ?

Decode fields to get: BEQ $1, $2, 25

opcode rs rt offset “I-Format”

ALWAYS prepare to fetch instr that follows the BEQ in the program (”delayed branch”). IF we take branch, the instr we fetch AFTER that instruction is PC + 4 + 100.

PC == “Program Counter”10


BEQ syntax &semantics, as seen inthe MIPS ISA document.

11


define: The Architect’s Contract

To the program, it appears that instructions execute in the correct order defined by the ISA.

What the machine actually does is up to the hardware designers, as long as the contract is kept.

As each instruction completes, themachine state (regs, mem) appears to the program to obey the ISA.

12


Single Cycle CPU Design

13


Single cycle data paths: Assumptions

Processor uses synchronous logicdesign (a “clock”).

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&'

( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+

#

4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9

5+! #0&7"8 :+#$;" :+#.',;%

0&7

f T1 MHz 1 μs

10 MHz 100 ns100 MHz 10 ns

1 GHz 1 ns

All state elements act like positive edge-triggered flip flops.

D Q

clk

Reset ?

14


Review: Edge-Triggered D Flip Flops

D Q

CLK

Value of D is sampled on positive clock edge.

Q outputs sampled value for rest of cycle.

D

Q

15


Review: Edge-Triggering in Verilog

D Q

module ff(D, Q, CLK);

input D, CLK;output Q;

always @ (CLK) Q <= D;

endmodule

CLKModule code has two bugs.

Where?



16


Review: Edge-Triggered D Flip Flops

module ff(D, Q, CLK);

input D, CLK;output Q;reg Q;

always @ (posedge CLK) Q <= D;

endmodule

D Q

CLK



17


define: Single-cycle datapath

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8

!"#$%&'

( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+

#

4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9

5+! #0&7"8 :+#$;" :+#.',;%

0&7

All instructions execute in a single cycle of the clock (positive edge to

positive edge)

Advantage: a great way to learn CPUs.

Drawbacks: unrealistic hardware assumptions,

slow clock period18


Recall: MIPS R-format instructions

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction




Add $9 to $10




19


Goal #1: An R-format single-cycle CPU

opcode rs rt rd functshamt


Sample program:ADD $8 $9 $10SUB $4 $8 $3AND $9 $8 $4...

How registers get their initial values are not of concern to us right now.

No loads or stores: machine has no use for data memory, only instruction memory.

No branches or jumps: machine only runs straight line code.

20


Separate Read-Only Instruction Memory

32

Addr

Data

32

InstrMem Reads are combinational: Put a

stable address on input, a short time later data appears on output.

Not concerned about how programs are loaded into this memory.

Related to separate instruction and data caches in “real” designs.

21


Task #1: Straight-line Instruction Fetch

32

Addr

Data

32

InstrMem

Fetching straight-line MIPS instructions requires a machine that generates this timing diagram:

“Requirements”

Why +4 and not +1?Why do we increment every clock cycle?

CLK

Addr

Data IMem[PC + 8]IMem[PC + 4]IMem[PC]

PC + 8PC + 4PC

PC == Program Counter, points to next instruction.

22


New Component: Register (for PC)

In later examples, we will add an “enable” input: clock edge updates state only if enable is high.

32Din

Clk

PC

Dout32

Built out of an array of flip-flops

D Q

clk

D Q

D Q

Din0

Din1

Din2

Dout0

Dout1

Dout2

Logic design?23


New Component: A 32-bit adder (ALU)

Combinational: Put a A and B values on inputs, a short time later A + B appears on output.

32+

32

32

A

B

A + B

32ALU

32

32

A

B

A op B

op

ln(#ops)ALU: Combinational part that is able to execute many functions of A and B (add, sub, and, or, ... ).The “op” value selects the function.

Equal?

Sometimes, extra outputs for use by control logic ...

24


Design: Straight-line Instruction Fetch

Clk

32Addr Data

InstrMem

32D

PC

Q32

32

+

32

320x4

+4 in hexadecimal

State machine design in the service of an ISA

CLK

Addr


PC + 8PC + 4PC

25


InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction




Add $9 to $10





Done! To continue, we need registers ...

26


MIPS Register file: From the top down

R1

R2

...

R31

Why is R0 special?

Q

Q

Q

R0 - The constant 0 Q

clk

.

.

.

32MUX

32

32

sel(rs1)

5

.

.

.

rd1

32MUX

32

32

sel(rs2)

5

.

.

.

rd2

“two read ports”

D

D

D

En

En

En

DEMUX

.

.

.

sel(ws)5

WE

How do we add a second write port?

wd

32

27


Register File Schematic Symbol

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

Why do we need WE?

If we had a MIPS register file w/o WE, how could we work around it?

28


InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction




Add $9 to $10





What do we do with these?

29


Computing engine of the R-format CPU

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op


Decode fields to get : ADD $8 $9 $10

Logic

What do we do with WE?

30


Putting it all together ...

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

LogicIs it safe to use same clock for PC and RegFile?

32Addr Data

InstrMem

32D

PC

Q32

32

+

32

320x4

To rs1,rs2, ws, op decodelogic ...

31


Recall: Our ideal-world D Flip-Flop

D Q

CLK



D

Q

Also assume: clocks arrive at all flip flops simultaneously.

32


Reminder: How data flows after posedge

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

Logic

Addr Data

InstrMem

D

PC

Q+

0x4

33


Next posedge: Update state and repeat

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

D

PC

Q

In this ideal world, as long as the clock is slow enough, the machine gets the right answer.

In Timing lecture,we look at theassumptions behind ideality.

34


Next Step ...

Design stand-alone machines for other major classes of instructions:immediates, branches, load/store.

Learn how to efficiently “merge” single-function machines to make one general-purpose machine.

35


Goal #2: add I-format ALU instructionsSyntax: ORI $8 $9 64 Semantics: $8 = $9 | 64

16-bit immediate extended to 32 bits.

In this example, $9 is rs and $8 is rt.

Zero-extend: 0x8000 ⇨ 0x00008000

Sign-extend: 0x8000 ⇨ 0xFFFF8000

Some MIPS instructions zero-extend immediate field, other instructions sign-extend.

CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB

Step 1a: The MIPS-lite Subset for today

° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt

° OR Immediate:• ori rt, rs, imm16

° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16

° BRANCH:• beq rs, rt, imm16

op rs rt rd shamt funct061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits





36


Computing engine of the I-format CPU

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op

Decode fields to get : ORI $8 $9 64

Logic

In a Verilog implementation, what should we do with rs2?














6 bits 16 bits5 bits5 bitsExt

37


32

rd1

RegFile

32rd2

WE32

wd

5rs1

5rs2

5ws

32A

L

U

32

32

op


Logic

32

rd1

RegFile

32rd2

WE32

wd

5rs1

5rs2

5ws

32A

L

U

32

32

op

Logic














6 bits 16 bits5 bits5 bitsExt

Merging data paths ...

I-format

R-format

Where ?

How many ?(ignore ALU control)

32M

U

X

32

32

Add muxes

N

N

N

38


The merged data path ...

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32ALU

32

32

op
















RegDest

ALUsrc

Ext

ExtOp

ALUctr

If you watched it being designed, it’s understandable ...39


Memory Instructions

40


Loads, Stores, and Data Memory ...

32Dout

Data Memory

WE32Din

32Addr

Syntax: LW $1, 32($2) Syntax: SW $3, 12($4)

Action: $1 = M[$2 + 32] Action: M[$4 + 12] = $3















Zero-extend or sign-extend immediate field?

Writes are clocked: If WE is high, memory Addr captures Din on positive edge of clock.

Reads are combinational: Put a stable address on Addr,a short time later Dout is ready

Note: Not a realistic main memory (DRAM) model ...41


Adding data memory to the data path

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws















ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Syntax: LW $1, 32($2) Syntax: SW $3, 12($4)

Action: $1 = M[$2 + 32] Action: M[$4 + 12] = $3

RegWr

Load delay slot CPU, or not ?

42


Branch Instructions

43


Conditional Branches in MIPS ...

Syntax: BEQ $1, $2, 12

Action: If ($1 != $2), PC = PC + 4















Zero-extend or sign-extend immediate field?

Action: If ($1 == $2), PC = PC + 4 + 48

Immediate field codes # words, not # bytes.Why is this encoding a good idea?

Why is this extension method a good idea?

44


Adding branch testing to the data path

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws















ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Syntax: BEQ $1, $2, 12Action: If ($1 != $2), PC = PC + 4Action: If ($1 == $2), PC = PC + 4 + 48

Equal (wire into control)

RegWr

45


Recall: Straight-line Instruction Fetch

32

Addr

Data

32

InstrMem Fetching straight-line MIPS

instructions requires a machine that generates this timing diagram:

CLK

Addr


PC + 8PC + 4PC

PC == Program Counter, points to next instruction.

46


Recall: Straight-line Instruction Fetch

CLK

Addr


PC + 8PC + 4PC

Clk

32Addr Data

InstrMem

32D

PC

Q32

32

+

32

320x4


How do we add this behavior ?

47


Design: Instruction Fetch with Branch

Clk

32Addr Data

InstrMem

32D

PC

Q

32

32+

32

32

0x4


PCSrc

32

+32















Extend

48


Single-Cycle Control

49


What is single cycle control?

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

ExtRegDest

ALUsrcExtOp

ALUctr

32A

L

U

32

32

op

MemToReg

32Dout

Data Memory

WE32

Din

Addr

MemWr

Equal

RegWr

32Addr Data

InstrMem

Equal

RegDestRegWr

ExtOpALUsrc MemWr

MemToReg

PCSrc

Combinational Logic(Only Gates, No Flip Flops)Just specify logic functions!

rs,rt,rd,imm

50


Two goals when specifying control logic

Bug-free: One “0” that should be a “1” in the control logic function breaks contract with the programmer.

Efficient: Logic function specification should map to hardware with good performance properties: fast, small, low power, etc.

Should be easy for humans to read and understand: sensible signal names, symbolic constants ...

51


Advice: Carefully written Verilog will yield identical semantics in ModelSim and Synplicity. If you write your code in this way, many “works in Modelsim but not on Xilinx” issues disappear.

In practice: Use behavioral Verilog

Always check log files, and inspect output tools produce!

Look for tell-tale Synplicity “warnings and errors” messages !

“latch generated”, “combinational loop detected”, etc

Automate with scripts if possible.52


F06 152 Labs: A small subset of MIPS ...

What if some other instruction appears in the instruction stream?

For labs: undefined. Real world: exceptions.

53


Why not in labs? Doubles complexity!

!"#$%&

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+7899%($*:;*<;$%""=>$#!"#$%&$%'()'*+%,-.)#/%0

!" !"#! $%&' $%& $(!# )! $*& (+,-./!$ 01)2! $3& $*& $(!% !"#! $4& $%& $*!& 516! $73& $3& $%!' 8!!! $%& $4& $*

()*+(,+(-./-01(23 7'''*'''* .'''7 ('''. +'''+ ( %'''%

-/4*(-/0,#0 -/4*(-/0,"5

9:;<=>?-'=;@?--AB@<

6-/174/078*/--)3*409-/0.7,,71):*0*(0723:/2/8*09*0;7<;043//.+ =98*0*(04*9-*0/>/1)*7(80(,0:9*/-0784*-)1*7(840?/,(-//>1/3*7(801;/1@40,7874;/.0(80/9-:7/-0784*-)1*7(84

!"#$%

&'"()*+,-*.//0

123.4-*5+.66+38?(%>$@:;*A';BC@;D120$!'()'*3/4)$5#67)*8/-)./0)9

C D:E>'?FG?B@=:;'$EHI<'=;'B=B?E=;?'A;@=E'G:JJ=@'B:=;@',0'<@HI?/C KFG?B@=:;<'=;'?H-E=?-'B=B?'<@HI?<':L?--=>?'EH@?-'?FG?B@=:;<C ";M?G@'?F@?-;HE'=;@?--AB@<'H@'G:JJ=@'B:=;@',:L?--=>?':@N?-</C "$'?FG?B@=:;'H@'G:JJ=@O'AB>H@?'9HA<?'H;>'KP9'-?I=<@?-<&'Q=EEHEE'<@HI?<&'=;M?G@'NH;>E?-'P9'=;@:'$?@GN'<@HI?

A4B81;-(8()40!8*/--)3*4

KFG!

P9!

P9";<@R'0?J ! !?G:>? K 0

!H@H'0?J ST

KFGK

P9K

KFG0

P90

C9)4/

D6C

E7::0F0G*9</

E7::0H0G*9</

E7::0D0G*9</

!::/<9:0I31(./

IJ/-,:(=F9*90A..-0D>1/3*

6C0A..-/440D>1/3*7(84

E7::0K-7*/?91@

G/:/1*0L98.:/-06C

!"##$%&

'"$(%

Components in blue handle exceptions ...Will cover this (pipelined CPU) example later in the term ...

54


VLIWVeryLongInstructionWords

978 IEEE TRANSACTIONS ON COMPUTERS, VOL. 37, NO. 8, AUGUST 1988

probably around 30-50 percent when compared to a tightly

encoded machine like the VAX or Motorola 68000. The

variable length main memory instruction encoding has an

associated overhead of a few bits per operation, which coupled

with main memory alignment constraints adds roughly an

additional 5-10 percent.

Operations that cannot be initiated in a single instruction

cycle are broken down into constituent suboperations. These

constituents are usually substituted in-line, although certain

operations such as the block register save and restore

associated with procedure call are implemented via special

subroutines. The overall code expansion due to this, as

compared to a machine like the VAX that has an extensive

library of microcoded “subroutines, ” is difficult to quantify,

but is probably in the neighborhood of 10-20 percent.

The compiler performs an enormous number of optimiza-

tions, most of which reduce the number of operations in the

program, but some of which increase the number of operations

with the goal of increasing parallel execution. The three most

notorious code expanders are interblock trace selection (which

can produce compensation code), loop unrolling, and in-line

procedure substitution. All three of these are currently

automatic and have been tuned to avoid undue code growth.

These optimizations can increase the size of some small

fragments of code by a large factor, but their overall effect

seems to be to increase code size by a factor of around 30-60

percent, although the user can increase or decrease these

factors arbitrarily through the use of compiler switches.

Many large (1OOK-300K lines) Fortran programs have been

built on the TRACE. After unrolling and trace selection, the

code size is approximately three times larger than VAX object

code (compiled with the VAX/VMS Fortran compiler).

The concern about code size led us to implement a shared-

libraries facility very early in our UNIX development. This

has substantially reduced the size of the UNIX utilities images.

The UNIX utilities consume approximately 20 Mbytes of disk

space on a VAX, and approximately 60 Mbytes on our VLIW

using shared libraries.

UNIX has been running on the TRACE and supporting its

own development for some time. The principal advantage of

Multiflow’s parallel processing technology is that it is trans-

parent to its client. Thus, most of the challenging problems in

developing an operating system and programming environ-

ment for the TRACE come not from its VLIW nature but from

our intention to make the system into a first rate environment

for high-performance engineering and scientific computation.

X. SUMMARY AND FUTURE WORK

This paper has introduced the Multiflow TRACE very long

instruction word architecture.

Before this machine was built, some designers and research-

ers predicted that the negative side-effects of the VLIW/

compacting compiler approach (object code size, compensa-

tion code, context swap time, and procedure callheturn

overhead) would likely swamp the machine’s performance

gains [26]. These predictions were wrong: some challenges

remain, but the substantial performance improvements that

were promised are now being routinely realized.

It is too early to be able to separate out all the different

contributions to performance in the TRACE. Our future work

will concentrate on quantifying the speedups due to trace

scheduling versus those achieved by more universal compiler

optimizations. We will also be examining the efficacy of

memory-bank disambiguation, speedlsize tradeoffs of the

fixed and variable instruction encoding schemes, and instruc-

tion cache usage statistics.

Compared to a standard scalar machine, we get significantly

higher performance at only slightly higher cost; the extra

functional units are cheap compared to the overhead of

building the computer in the first place (memory, control, I/O,

power, and packaging). With the vector approach, the parallel

hardware “turns on” only occasionally, and the speed of some

vector code is all that is improved (and VLIW’s get that

speedup anyway). When using a multiprocessor to speed the

solution of a single problem, you pay the full overhead of

instruction execution and run-time synchronization per func-

tional unit, without getting the fine-grained speedups a VLJW

can offer.

While it is difficult to compare mid-end and high-end CPU

implementations, our real-world experience on 25 million

lines of compiled Fortran indicates that a VLIW can beat a

comparable vector supercomputer by a factor of three. A

VLIW machine should be the architecture of choice for future

supercomputer implementations.

REFERENCES

M. Katevenis, Reduced Instruction Set Computer Architectures for VLSI. G. S . Tjaden and M. J . Flynn, “Detection and parallel execution of independent instructions,” IEEE Trans. Comput., vol. C-19, pp.

C. C. Foster and E. M. Riseman, “Percolation of code to enhance parallel dispatching and execution,” IEEE Trans. Comput., vol. C-

J . A. Fisher, “Very long instruction word architectures and the ELI- 512,” in Proc. loth Symp. Comput. Architecture, IEEE, June 1983,

J . R. Ellis, Bulldog: A Compiler for VLIW Architectures. Cambridge, MA: MIT Press, 1986. J . A. Fisher, “The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources,” Tech. Rep. COO-3077.161, Courant Math. and Comput. Lab., New York Univ., Oct. 1979. J. L. Hennessy, N. Jouppi, F. Baskett, and J . Gill, “MIPS: A VLSI processor architecture,” in Proc. CMU Conf. VLSI Syst. Compu-

G. Radin, “The 801 minicomputer,” in Proc. SIGARCH/SIGPLAN Symp. Architectural Support Programming Languages Oper. Syst., ACM, Mar. 1982, pp. 39-47. J . E. Thornton, Design of a Computer: The Control Data 6600.

Glenview, IL: Scott, Foreman, 1970. R. M. Tomasulo, “An efficient algorithm for exploiting multiple arithmetic units,” in Computer Structures: Principles and Exam- ples. R. D. Acosta, J . Kjelstrup, and H. C. Torng, “An instruction issuing approach to mhancing performance in multiple functional unit proces- sors,” IEEE Trans. Comput., vol. C-35, pp. 815-828, 1986. J . J . Dongarra, “Performance of various computers using standard linear equations software in a Fortran environment,” Comput. Architecture News, vol. 13, no. 1 , pp. 3-11, Mar. 1985. Swanson Analysis Systems, Inc., “Ansys large scale benchmark timing results,” Tech. Rep., Houston, PA, Apr. 30, 1987. F. H. McMahon, “The Livermore Fortran kernels: A computer test of the numerical performance range,” Tech. Rep., Lawrence Livermore Nat. Lab., Dec. 1986.

Cambridge, MA: MIT Press, 1985.

889-895, Oct. 1970.

21, pp. 1411-1415, 1972.

pp. 140-150.

tat., Oct. 1981, pp. 337-346.

New York: McGraw-Hill, 1982, pp. 293-305.

I

Josh Fisher: idea grew out of his Ph.D (1979) in compilers

Led to a startup (MultiFlow)

whose computers worked, but

which went out of business ... the

ideas remain influential.

55


Basic Idea: Super-sized Instructions

Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel.



Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10


A 64-bit VLIW instruction

56


VLIW Assembly Syntax ...

Instr: ADD $8 $9 $10 ADD $7 $8 $9

Denotes start of an instruction word. Listed operators all

execute in parallel.

Instr: SUB $2 $3 $0 OR $1 $5 $4 Execute in

parallel.

Label: AND $5 $2 $3 OR $1 $5 $4

[...]

Branch label name instead of default “instr”.

57


ADD $8 $9 $10; Result: $8 = 19

ADD $7 $8 $9; Result: $7 = 28

32-bit MIPS:

Assume: $7 = 7, $8 = 8, $9 = 9, $10 = 10 (decimal)

VLIW:

Instr: ADD $8 $9 $10 ; result $8 = 19ADD $7 $8 $9 ; result $7 = 17 (not 28)

32-bit & 64-bit semantics different? Yes!

58


Design: A 64-bit VLIW R-format CPU

No loads or stores: machine has no use for data memory, only instruction memory.

No branches or jumps: machine only runs straight line code.





59


VLIW: Straight-line Instruction Fetch

Clk

Addr Data

InstrMem

32D

PC

Q32

32

+

32

32

CLK

Addr


PC + 16PC + 8PC

64

0x8

+8 in hexadecimal -- 64 bit instructions

Simple changes to support 64-bit instructions ...

60


Computing engine of VLIW R-format CPUopcode rs rt rd functshamt


32ALU

32

32

op

32ALU

32

32

op

32rd1

RegFile

32rd2

WE1

32wd1

5rs1

5rs2

5ws1

WE2

32rd3

32rd4

5rs3

5rs4

32 wd2

5ws2

61


What have we gained with 64-bit VLIW?





If:Clock speed remains the same.All 32-bit operators do useful work.

Performance doubles!

N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family)

62


What does N = 14 assembly look like?

THE MULTIFLOW TRACE SCHEDULING COMPILER 59

Table 2. Hardware performance of the Trace 300 family.

7/300 14/300 28/300

MOPs 53 107 215

MFLOPS 30 60 120

Main memory megabytes/s 123 246 492

Linpack 1000 x 1000 23 42 70

Linpack 100 x 100 11 17 22

SPECmark NA 23 25

Sustainable operations in flight 10-13 20-26 40-52

instr clO ialuOe st.64 sbl .rO,r2,17#144

clO ialule cgt.s32 lilbb.r4,r34,6#31

clO faluOe add.f64 Isb.r4,r8,rO

clO falul e add.f64 Isb.r6,r40,r32

clO ialuOI did.64 fbl .r4,r2,17#208

cll ialuOe did.64 fb1.r34,r1,17#216

cll ialule cgt.s32 lilbb.r3,r32,zero

cll faluOe add.f64 Isb.r4,r8,r6

c l l falule add.f64 Isb.r6,r40,r38

cll ialuOI st.64 sb1.r2,r1,17#152

cll ialull add.u32 lib.r32,r36,6#32

cll br true and r3 L2373

clO br false or r4 L24?3;

instr clO ialuOe did.64 fbO.rO,r2,17#224

clO ialule cgt.s32 lilbb.r3,r34,6#30

clO faluOe mpy.f64 Ifb.rlO,r2,rlO

clO falul e mpy.f64 Ifb.r42,r34,r42

clO ialuOI st.64 sbO.r4,r2,17#160

cll ialuOe did.64 fbO.r32,r1,17#232

cll ialul e cgt.s32 lil bb.r4,r35,6#29

cll faluOe mpy.f64 Ifb.rlO,rO,rlO

cll falule mpy.f64 Ifb.r42,r32,r42

cll ialuOI st.64 sbO.r6,r1,17#168

cll ialull bor.32 ibO.r32,zero,r32

cll br false or r4 L25?3

clO br true and r3 L26?3;

Figure 8. TRACE 14/300 code fragment.

Figure 8 shows two instructions of 14/300 code, extracted from the inner loop of the

100 x 100 Linpack benchmark. Each operation is listed on a separate line. The first two

fields identify the cluster and the functional unit to perform the operation; the remainder

of the line describes the operation. Note the destination address is qualified with a register-

bank name (e.g., s b 1. r 0); the ALUs could target any register bank in the machine (with

some restrictions). There is extra latency in reaching a remote bank.

Two instructions

from a scientific

benchmark (Linpack) for

a MultiFlow CPU with

14 operations per

instruction.

63


What have we gained with 64-bit VLIW?





If:Clock speed remains the sameAll 32-bit operators do useful work.

Performance doubles!

N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family)

A very big “if” !

64


As N scales, HW and SW needs conflict

Instruction Set Architecture: Where the conflict plays out.

I/O systemProcessor


Datapath & Control

Transistors

MemoryHardware

CompilerOperating

System(Mac OS X)

Application (iTunes)

Software Assembler

Hardware need: Clock does not slow down.

Software need: All operators do useful work.

65


Example problem: Register file ports ...

32ALU

32

32

op

32ALU

32

32

op

32rd1

RegFile

32rd2

WE1

32wd1

5rs1

5rs2

5ws1

WE2

32rd3

32rd4

5rs3

5rs4

32 wd2

5ws2

N ALUs require 2*N read ports and N write ports. Why is this a problem?

66


Recall: Register File Design

R1

R2

...

R31

Q

Q

Q

R0 - The constant 0 Q

clk

.

.

.

.

.

32MUX

32

32

sel(rs1)

5

.

.

.

rd1

32MUX

32

32

sel(rs2)

5

.

.

.

rd2

D

D

D

En

En

En

DEMUX

.

.

.

sel(ws)5

WE

wd32

More read ports increases fanout, slows down reads.

More write ports adds data muxes, demux OR tree.

67


Split register files: A solution?

32ALU

32

32

op

32ALU

32

32

op

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

32rd1

RegFile

32rd2

WE32wd

5rs1

5rs2

5ws

Too often, the data an ALU needs to do “useful work” will not be in its own regfile.

Software need: All operators do useful work.

68


Architect’s job: Find a good compromise

Instruction Set Architecture: Where the conflict plays out.

I/O systemProcessor


Datapath & Control

Transistors

MemoryHardware

CompilerOperating

SystemSoftware Assembler

Application

THE MULTIFLOW TRACE SCHEDULING COMPILER 57

256-bit Instruction Word b.

Interleaved Memory Total of 512 Mb

Total of 64 Banks

Figure 6. The Multiflow TRACE 7/300 (Mb = megabytes).

In the 300 series, instructions are issued every 130 ns; there are two 65-ns beats per

instruction. Integer operations can issue in the early and late beats of an instruction; floating

point operations issue only in the early beat. Most integer ALU operations complete in

a single beat. The load pipeline is seven beats. The floating point pipelines are four beats.

Branches issue in the early beat and the branch target is reached on the following instruc-

tion, effectively a two-beat pipeline. An instruction can issue multiple branch operations

(four on the 28/300); the particular branch taken is determined by the precedence encoded

in the long instruction word.

9 There are four functional units per cluster: two integer units and two floating units. In

addition, each cluster can contribute a branch target. Since the integer units issue in both

the early and the late beat, a cluster has the resources to issue seven operations for each

instruction.

9 There are nine register files per cluster (36 register files in the 28/300) (see Table 1).

Data going to memory must first be moved to a store file. Branch banks are used to

control conditional branches and the select operation.

Example solution: Split register files, with a dedicated bus and special instructions for moves between regfiles.

Mayhurt software more than it helpshardware :-(

69


Branch policy: All instr operators execute



BNE $8 $9 Label ADD $7 $8 $9

Problem: Large N machines find it hard to fill all operators with useful work.

ADD executes if branch is taken or not taken.

Solution: New “predication” operator.Syntax: SELECT $7 $8 $9 $10

Semantics: If $8 == 0, $7 = $10, else $7 = $9

Permits simple branches to be converted to inline code.

70


Branch nesting in a single instruction ...



BEQ $8 $9 LabelOne

Conundrum: How to define the semantics of multiple branches in one instruction?

BEQ $11 $12 LabelTwo

MultiFlow: N-way Branch priority set in an opcode field.

Solution: Nested branch semanticsIf $8 == $9, branch to LabelOne

Else $11 == $12, branch to LabelTwo

71


Will return to VLIW later in semester ...THE MULTIFLOW TRACE SCHEDULING COMPILER 57

256-bit Instruction Word b.

Interleaved Memory Total of 512 Mb

Total of 64 Banks

Figure 6. The Multiflow TRACE 7/300 (Mb = megabytes).

In the 300 series, instructions are issued every 130 ns; there are two 65-ns beats per

instruction. Integer operations can issue in the early and late beats of an instruction; floating

point operations issue only in the early beat. Most integer ALU operations complete in

a single beat. The load pipeline is seven beats. The floating point pipelines are four beats.

Branches issue in the early beat and the branch target is reached on the following instruc-

tion, effectively a two-beat pipeline. An instruction can issue multiple branch operations

(four on the 28/300); the particular branch taken is determined by the precedence encoded

in the long instruction word.

9 There are four functional units per cluster: two integer units and two floating units. In

addition, each cluster can contribute a branch target. Since the integer units issue in both

the early and the late beat, a cluster has the resources to issue seven operations for each

instruction.

9 There are nine register files per cluster (36 register files in the 28/300) (see Table 1).

Data going to memory must first be moved to a store file. Branch banks are used to

control conditional branches and the select operation.

72


Next Monday:

First Design Review

This Friday: Look-ahead for Design Review, 125 Cory, 10 AM

73

cs 194-6 digital systems project laboratory lecture 3...

Documents