cs 194-6 digital systems project laboratory lecture 3...
TRANSCRIPT
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
2008-9-22John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 194-6 Digital Systems Project Laboratory
Lecture 3 – Single-Cycle CPU
www-inst.eecs.berkeley.edu/~cs194-6/
TA: Greg Gibeling
1
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Topics for today’s lecture
Single-Cycle CPU Design
Instruction Set Architectures (ISAs)
Very Long Instruction Words (VLIW): Doing more work in a single cycle.
2
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Instruction Set Architecture
3
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
New successful instruction sets are rare
instruction set
software
hardware
Implementors suffer with original sins of ISAs, to support the installed base of software.
4
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Instruction Sets: A Thin Interface
Instruction Set ArchitectureI/O systemProcessor
Digital DesignCircuit Design
Datapath & Control
Transistors
MemoryHardware
CompilerOperating
System(Mac OS X)
Application (iTunes)
Software Assembler
Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10
In Hexadecimal: 012A4020000000 01001 01010 01000 00000 100000Binary:
6 bits 5 bits 5 bits 5 bits 5 bits 6 bitsFieldsize:
opcode rs rt rd functshamtBitfield:
“R-Format”
5
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Hardware implements semantics ...
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Fetch next inst from memory:012A4020
opcode rs rt rd functshamtDecode fields to get : ADD $8 $9 $10
“Retrieve” register values: $9 $10
Add $9 to $10
Place this sum in $8
Prepare to fetch instruction that follows the ADD in the program.
Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10
6
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
ADD syntax &semantics, as seen inthe MIPS ISA document.
7
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Memory Instructions: LW $1,32($2)
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Fetch the load inst from memory
“Retrieve” register value: $2
Compute memory address: 32 + $2
Load memory address contents into: $1
Prepare to fetch instr that follows the LW in the program. Depending on load semantics, new $1 is visible to that instr, or not until the following instr (”delayed loads”).
Decode fields to get : LW $1, 32($2)
opcode rs rt offset “I-Format”
8
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
LW syntax &semantics, as seen inthe MIPS ISA document.
9
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Branch Instructions: BEQ $1,$2,25
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Fetch branch inst from memory
“Retrieve” register values: $1, $2
Compute if we take branch: $1 == $2 ?
Decode fields to get: BEQ $1, $2, 25
opcode rs rt offset “I-Format”
ALWAYS prepare to fetch instr that follows the BEQ in the program (”delayed branch”). IF we take branch, the instr we fetch AFTER that instruction is PC + 4 + 100.
PC == “Program Counter”10
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
BEQ syntax &semantics, as seen inthe MIPS ISA document.
11
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
define: The Architect’s Contract
To the program, it appears that instructions execute in the correct order defined by the ISA.
What the machine actually does is up to the hardware designers, as long as the contract is kept.
As each instruction completes, themachine state (regs, mem) appears to the program to obey the ISA.
12
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Single Cycle CPU Design
13
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Single cycle data paths: Assumptions
Processor uses synchronous logicdesign (a “clock”).
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&'
( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+
#
4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9
5+! #0&7"8 :+#$;" :+#.',;%
0&7
f T1 MHz 1 μs
10 MHz 100 ns100 MHz 10 ns
1 GHz 1 ns
All state elements act like positive edge-triggered flip flops.
D Q
clk
Reset ?
14
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Review: Edge-Triggered D Flip Flops
D Q
CLK
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
D
Q
15
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Review: Edge-Triggering in Verilog
D Q
module ff(D, Q, CLK);
input D, CLK;output Q;
always @ (CLK) Q <= D;
endmodule
CLKModule code has two bugs.
Where?
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
16
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Review: Edge-Triggered D Flip Flops
module ff(D, Q, CLK);
input D, CLK;output Q;reg Q;
always @ (posedge CLK) Q <= D;
endmodule
D Q
CLK
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
17
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
define: Single-cycle datapath
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'8
!"#$%&'
( )#*#&&'&+,-+.'*/#&+0-12'*,'*3+
#
4 5+! ,/$'60&7"89+:+,/$'6$;"9+:+,/$'6.',;%9
5+! #0&7"8 :+#$;" :+#.',;%
0&7
All instructions execute in a single cycle of the clock (positive edge to
positive edge)
Advantage: a great way to learn CPUs.
Drawbacks: unrealistic hardware assumptions,
slow clock period18
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Recall: MIPS R-format instructions
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Fetch next inst from memory:012A4020
opcode rs rt rd functshamtDecode fields to get : ADD $8 $9 $10
“Retrieve” register values: $9 $10
Add $9 to $10
Place this sum in $8
Prepare to fetch instruction that follows the ADD in the program.
Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10
19
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Goal #1: An R-format single-cycle CPU
opcode rs rt rd functshamt
Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10
Sample program:ADD $8 $9 $10SUB $4 $8 $3AND $9 $8 $4...
How registers get their initial values are not of concern to us right now.
No loads or stores: machine has no use for data memory, only instruction memory.
No branches or jumps: machine only runs straight line code.
20
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Separate Read-Only Instruction Memory
32
Addr
Data
32
InstrMem Reads are combinational: Put a
stable address on input, a short time later data appears on output.
Not concerned about how programs are loaded into this memory.
Related to separate instruction and data caches in “real” designs.
21
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Task #1: Straight-line Instruction Fetch
32
Addr
Data
32
InstrMem
Fetching straight-line MIPS instructions requires a machine that generates this timing diagram:
“Requirements”
Why +4 and not +1?Why do we increment every clock cycle?
CLK
Addr
Data IMem[PC + 8]IMem[PC + 4]IMem[PC]
PC + 8PC + 4PC
PC == Program Counter, points to next instruction.
22
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
New Component: Register (for PC)
In later examples, we will add an “enable” input: clock edge updates state only if enable is high.
32Din
Clk
PC
Dout32
Built out of an array of flip-flops
D Q
clk
D Q
D Q
Din0
Din1
Din2
Dout0
Dout1
Dout2
Logic design?23
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
New Component: A 32-bit adder (ALU)
Combinational: Put a A and B values on inputs, a short time later A + B appears on output.
32+
32
32
A
B
A + B
32ALU
32
32
A
B
A op B
op
ln(#ops)ALU: Combinational part that is able to execute many functions of A and B (add, sub, and, or, ... ).The “op” value selects the function.
Equal?
Sometimes, extra outputs for use by control logic ...
24
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Design: Straight-line Instruction Fetch
Clk
32Addr Data
InstrMem
32D
PC
Q32
32
+
32
320x4
+4 in hexadecimal
State machine design in the service of an ISA
CLK
Addr
Data IMem[PC + 8]IMem[PC + 4]IMem[PC]
PC + 8PC + 4PC
25
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Fetch next inst from memory:012A4020
opcode rs rt rd functshamtDecode fields to get : ADD $8 $9 $10
“Retrieve” register values: $9 $10
Add $9 to $10
Place this sum in $8
Prepare to fetch instruction that follows the ADD in the program.
Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10
Goal #1: An R-format single-cycle CPU
Done! To continue, we need registers ...
26
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
MIPS Register file: From the top down
R1
R2
...
R31
Why is R0 special?
Q
Q
Q
R0 - The constant 0 Q
clk
.
.
.
32MUX
32
32
sel(rs1)
5
.
.
.
rd1
32MUX
32
32
sel(rs2)
5
.
.
.
rd2
“two read ports”
D
D
D
En
En
En
DEMUX
.
.
.
sel(ws)5
WE
How do we add a second write port?
wd
32
27
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Register File Schematic Symbol
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
Why do we need WE?
If we had a MIPS register file w/o WE, how could we work around it?
28
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Fetch next inst from memory:012A4020
opcode rs rt rd functshamtDecode fields to get : ADD $8 $9 $10
“Retrieve” register values: $9 $10
Add $9 to $10
Place this sum in $8
Prepare to fetch instruction that follows the ADD in the program.
Syntax: ADD $8 $9 $10 Semantics: $8 = $9 + $10
Goal #1: An R-format single-cycle CPU
What do we do with these?
29
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Computing engine of the R-format CPU
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
32ALU
32
32
op
opcode rs rt rd functshamt
Decode fields to get : ADD $8 $9 $10
Logic
What do we do with WE?
30
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Putting it all together ...
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
32ALU
32
32
op
LogicIs it safe to use same clock for PC and RegFile?
32Addr Data
InstrMem
32D
PC
Q32
32
+
32
320x4
To rs1,rs2, ws, op decodelogic ...
31
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Recall: Our ideal-world D Flip-Flop
D Q
CLK
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
D
Q
Also assume: clocks arrive at all flip flops simultaneously.
32
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Reminder: How data flows after posedge
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
32ALU
32
32
op
Logic
Addr Data
InstrMem
D
PC
Q+
0x4
33
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Next posedge: Update state and repeat
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
D
PC
Q
In this ideal world, as long as the clock is slow enough, the machine gets the right answer.
In Timing lecture,we look at theassumptions behind ideality.
34
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Next Step ...
Design stand-alone machines for other major classes of instructions:immediates, branches, load/store.
Learn how to efficiently “merge” single-function machines to make one general-purpose machine.
35
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Goal #2: add I-format ALU instructionsSyntax: ORI $8 $9 64 Semantics: $8 = $9 | 64
16-bit immediate extended to 32 bits.
In this example, $9 is rs and $8 is rt.
Zero-extend: 0x8000 ⇨ 0x00008000
Sign-extend: 0x8000 ⇨ 0xFFFF8000
Some MIPS instructions zero-extend immediate field, other instructions sign-extend.
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
36
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Computing engine of the I-format CPU
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
32ALU
32
32
op
Decode fields to get : ORI $8 $9 64
Logic
In a Verilog implementation, what should we do with rs2?
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bitsExt
37
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
32
rd1
RegFile
32rd2
WE32
wd
5rs1
5rs2
5ws
32A
L
U
32
32
op
opcode rs rt rd functshamt
Logic
32
rd1
RegFile
32rd2
WE32
wd
5rs1
5rs2
5ws
32A
L
U
32
32
op
Logic
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bitsExt
Merging data paths ...
I-format
R-format
Where ?
How many ?(ignore ALU control)
32M
U
X
32
32
Add muxes
N
N
N
38
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
The merged data path ...
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
32ALU
32
32
op
opcode rs rt rd functshamt
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
RegDest
ALUsrc
Ext
ExtOp
ALUctr
If you watched it being designed, it’s understandable ...39
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Memory Instructions
40
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Loads, Stores, and Data Memory ...
32Dout
Data Memory
WE32Din
32Addr
Syntax: LW $1, 32($2) Syntax: SW $3, 12($4)
Action: $1 = M[$2 + 32] Action: M[$4 + 12] = $3
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
Zero-extend or sign-extend immediate field?
Writes are clocked: If WE is high, memory Addr captures Din on positive edge of clock.
Reads are combinational: Put a stable address on Addr,a short time later Dout is ready
Note: Not a realistic main memory (DRAM) model ...41
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Adding data memory to the data path
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Syntax: LW $1, 32($2) Syntax: SW $3, 12($4)
Action: $1 = M[$2 + 32] Action: M[$4 + 12] = $3
RegWr
Load delay slot CPU, or not ?
42
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Branch Instructions
43
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Conditional Branches in MIPS ...
Syntax: BEQ $1, $2, 12
Action: If ($1 != $2), PC = PC + 4
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
Zero-extend or sign-extend immediate field?
Action: If ($1 == $2), PC = PC + 4 + 48
Immediate field codes # words, not # bytes.Why is this encoding a good idea?
Why is this extension method a good idea?
44
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Adding branch testing to the data path
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Syntax: BEQ $1, $2, 12Action: If ($1 != $2), PC = PC + 4Action: If ($1 == $2), PC = PC + 4 + 48
Equal (wire into control)
RegWr
45
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Recall: Straight-line Instruction Fetch
32
Addr
Data
32
InstrMem Fetching straight-line MIPS
instructions requires a machine that generates this timing diagram:
CLK
Addr
Data IMem[PC + 8]IMem[PC + 4]IMem[PC]
PC + 8PC + 4PC
PC == Program Counter, points to next instruction.
46
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Recall: Straight-line Instruction Fetch
CLK
Addr
Data IMem[PC + 8]IMem[PC + 4]IMem[PC]
PC + 8PC + 4PC
Clk
32Addr Data
InstrMem
32D
PC
Q32
32
+
32
320x4
Syntax: BEQ $1, $2, 12Action: If ($1 != $2), PC = PC + 4Action: If ($1 == $2), PC = PC + 4 + 48
How do we add this behavior ?
47
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Design: Instruction Fetch with Branch
Clk
32Addr Data
InstrMem
32D
PC
Q
32
32+
32
32
0x4
Syntax: BEQ $1, $2, 12Action: If ($1 != $2), PC = PC + 4Action: If ($1 == $2), PC = PC + 4 + 48
PCSrc
32
+32
CS 152 L06 Single Cycle 1 (6) UC Regents Fall 2004 © UCB
Step 1a: The MIPS-lite Subset for today
° ADD and SUB• addU rd, rs, rt• subU rd, rs, rt
° OR Immediate:• ori rt, rs, imm16
° LOAD and STORE Word• lw rt, rs, imm16• sw rt, rs, imm16
° BRANCH:• beq rs, rt, imm16
op rs rt rd shamt funct061116212631
6 bits 6 bits5 bits5 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
op rs rt immediate016212631
6 bits 16 bits5 bits5 bits
Extend
48
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Single-Cycle Control
49
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
What is single cycle control?
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Equal
RegWr
32Addr Data
InstrMem
Equal
RegDestRegWr
ExtOpALUsrc MemWr
MemToReg
PCSrc
Combinational Logic(Only Gates, No Flip Flops)Just specify logic functions!
rs,rt,rd,imm
50
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Two goals when specifying control logic
Bug-free: One “0” that should be a “1” in the control logic function breaks contract with the programmer.
Efficient: Logic function specification should map to hardware with good performance properties: fast, small, low power, etc.
Should be easy for humans to read and understand: sensible signal names, symbolic constants ...
51
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Advice: Carefully written Verilog will yield identical semantics in ModelSim and Synplicity. If you write your code in this way, many “works in Modelsim but not on Xilinx” issues disappear.
In practice: Use behavioral Verilog
Always check log files, and inspect output tools produce!
Look for tell-tale Synplicity “warnings and errors” messages !
“latch generated”, “combinational loop detected”, etc
Automate with scripts if possible.52
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
F06 152 Labs: A small subset of MIPS ...
What if some other instruction appears in the instruction stream?
For labs: undefined. Real world: exceptions.
53
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Why not in labs? Doubles complexity!
!"#$%&
!"#$%
&'"()*+,-*.//0
123.4-*5+.66+7899%($*:;*<;$%""=>$#!"#$%&$%'()'*+%,-.)#/%0
!" !"#! $%&' $%& $(!# )! $*& (+,-./!$ 01)2! $3& $*& $(!% !"#! $4& $%& $*!& 516! $73& $3& $%!' 8!!! $%& $4& $*
()*+(,+(-./-01(23 7'''*'''* .'''7 ('''. +'''+ ( %'''%
-/4*(-/0,#0 -/4*(-/0,"5
9:;<=>?-'=;@?--AB@<
6-/174/078*/--)3*409-/0.7,,71):*0*(0723:/2/8*09*0;7<;043//.+ =98*0*(04*9-*0/>/1)*7(80(,0:9*/-0784*-)1*7(840?/,(-//>1/3*7(801;/1@40,7874;/.0(80/9-:7/-0784*-)1*7(84
!"#$%
&'"()*+,-*.//0
123.4-*5+.66+38?(%>$@:;*A';BC@;D120$!'()'*3/4)$5#67)*8/-)./0)9
C D:E>'?FG?B@=:;'$EHI<'=;'B=B?E=;?'A;@=E'G:JJ=@'B:=;@',0'<@HI?/C KFG?B@=:;<'=;'?H-E=?-'B=B?'<@HI?<':L?--=>?'EH@?-'?FG?B@=:;<C ";M?G@'?F@?-;HE'=;@?--AB@<'H@'G:JJ=@'B:=;@',:L?--=>?':@N?-</C "$'?FG?B@=:;'H@'G:JJ=@O'AB>H@?'9HA<?'H;>'KP9'-?I=<@?-<&'Q=EEHEE'<@HI?<&'=;M?G@'NH;>E?-'P9'=;@:'$?@GN'<@HI?
A4B81;-(8()40!8*/--)3*4
KFG!
P9!
P9";<@R'0?J ! !?G:>? K 0
!H@H'0?J ST
KFGK
P9K
KFG0
P90
C9)4/
D6C
E7::0F0G*9</
E7::0H0G*9</
E7::0D0G*9</
!::/<9:0I31(./
IJ/-,:(=F9*90A..-0D>1/3*
6C0A..-/440D>1/3*7(84
E7::0K-7*/?91@
G/:/1*0L98.:/-06C
!"##$%&
'"$(%
Components in blue handle exceptions ...Will cover this (pipelined CPU) example later in the term ...
54
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
VLIWVeryLongInstructionWords
978 IEEE TRANSACTIONS ON COMPUTERS, VOL. 37, NO. 8, AUGUST 1988
probably around 30-50 percent when compared to a tightly
encoded machine like the VAX or Motorola 68000. The
variable length main memory instruction encoding has an
associated overhead of a few bits per operation, which coupled
with main memory alignment constraints adds roughly an
additional 5-10 percent.
Operations that cannot be initiated in a single instruction
cycle are broken down into constituent suboperations. These
constituents are usually substituted in-line, although certain
operations such as the block register save and restore
associated with procedure call are implemented via special
subroutines. The overall code expansion due to this, as
compared to a machine like the VAX that has an extensive
library of microcoded “subroutines, ” is difficult to quantify,
but is probably in the neighborhood of 10-20 percent.
The compiler performs an enormous number of optimiza-
tions, most of which reduce the number of operations in the
program, but some of which increase the number of operations
with the goal of increasing parallel execution. The three most
notorious code expanders are interblock trace selection (which
can produce compensation code), loop unrolling, and in-line
procedure substitution. All three of these are currently
automatic and have been tuned to avoid undue code growth.
These optimizations can increase the size of some small
fragments of code by a large factor, but their overall effect
seems to be to increase code size by a factor of around 30-60
percent, although the user can increase or decrease these
factors arbitrarily through the use of compiler switches.
Many large (1OOK-300K lines) Fortran programs have been
built on the TRACE. After unrolling and trace selection, the
code size is approximately three times larger than VAX object
code (compiled with the VAX/VMS Fortran compiler).
The concern about code size led us to implement a shared-
libraries facility very early in our UNIX development. This
has substantially reduced the size of the UNIX utilities images.
The UNIX utilities consume approximately 20 Mbytes of disk
space on a VAX, and approximately 60 Mbytes on our VLIW
using shared libraries.
UNIX has been running on the TRACE and supporting its
own development for some time. The principal advantage of
Multiflow’s parallel processing technology is that it is trans-
parent to its client. Thus, most of the challenging problems in
developing an operating system and programming environ-
ment for the TRACE come not from its VLIW nature but from
our intention to make the system into a first rate environment
for high-performance engineering and scientific computation.
X. SUMMARY AND FUTURE WORK
This paper has introduced the Multiflow TRACE very long
instruction word architecture.
Before this machine was built, some designers and research-
ers predicted that the negative side-effects of the VLIW/
compacting compiler approach (object code size, compensa-
tion code, context swap time, and procedure callheturn
overhead) would likely swamp the machine’s performance
gains [26]. These predictions were wrong: some challenges
remain, but the substantial performance improvements that
were promised are now being routinely realized.
It is too early to be able to separate out all the different
contributions to performance in the TRACE. Our future work
will concentrate on quantifying the speedups due to trace
scheduling versus those achieved by more universal compiler
optimizations. We will also be examining the efficacy of
memory-bank disambiguation, speedlsize tradeoffs of the
fixed and variable instruction encoding schemes, and instruc-
tion cache usage statistics.
Compared to a standard scalar machine, we get significantly
higher performance at only slightly higher cost; the extra
functional units are cheap compared to the overhead of
building the computer in the first place (memory, control, I/O,
power, and packaging). With the vector approach, the parallel
hardware “turns on” only occasionally, and the speed of some
vector code is all that is improved (and VLIW’s get that
speedup anyway). When using a multiprocessor to speed the
solution of a single problem, you pay the full overhead of
instruction execution and run-time synchronization per func-
tional unit, without getting the fine-grained speedups a VLJW
can offer.
While it is difficult to compare mid-end and high-end CPU
implementations, our real-world experience on 25 million
lines of compiled Fortran indicates that a VLIW can beat a
comparable vector supercomputer by a factor of three. A
VLIW machine should be the architecture of choice for future
supercomputer implementations.
REFERENCES
M. Katevenis, Reduced Instruction Set Computer Architectures for VLSI. G. S . Tjaden and M. J . Flynn, “Detection and parallel execution of independent instructions,” IEEE Trans. Comput., vol. C-19, pp.
C. C. Foster and E. M. Riseman, “Percolation of code to enhance parallel dispatching and execution,” IEEE Trans. Comput., vol. C-
J . A. Fisher, “Very long instruction word architectures and the ELI- 512,” in Proc. loth Symp. Comput. Architecture, IEEE, June 1983,
J . R. Ellis, Bulldog: A Compiler for VLIW Architectures. Cambridge, MA: MIT Press, 1986. J . A. Fisher, “The optimization of horizontal microcode within and beyond basic blocks: An application of processor scheduling with resources,” Tech. Rep. COO-3077.161, Courant Math. and Comput. Lab., New York Univ., Oct. 1979. J. L. Hennessy, N. Jouppi, F. Baskett, and J . Gill, “MIPS: A VLSI processor architecture,” in Proc. CMU Conf. VLSI Syst. Compu-
G. Radin, “The 801 minicomputer,” in Proc. SIGARCH/SIGPLAN Symp. Architectural Support Programming Languages Oper. Syst., ACM, Mar. 1982, pp. 39-47. J . E. Thornton, Design of a Computer: The Control Data 6600.
Glenview, IL: Scott, Foreman, 1970. R. M. Tomasulo, “An efficient algorithm for exploiting multiple arithmetic units,” in Computer Structures: Principles and Exam- ples. R. D. Acosta, J . Kjelstrup, and H. C. Torng, “An instruction issuing approach to mhancing performance in multiple functional unit proces- sors,” IEEE Trans. Comput., vol. C-35, pp. 815-828, 1986. J . J . Dongarra, “Performance of various computers using standard linear equations software in a Fortran environment,” Comput. Architecture News, vol. 13, no. 1 , pp. 3-11, Mar. 1985. Swanson Analysis Systems, Inc., “Ansys large scale benchmark timing results,” Tech. Rep., Houston, PA, Apr. 30, 1987. F. H. McMahon, “The Livermore Fortran kernels: A computer test of the numerical performance range,” Tech. Rep., Lawrence Livermore Nat. Lab., Dec. 1986.
Cambridge, MA: MIT Press, 1985.
889-895, Oct. 1970.
21, pp. 1411-1415, 1972.
pp. 140-150.
tat., Oct. 1981, pp. 337-346.
New York: McGraw-Hill, 1982, pp. 293-305.
I
Josh Fisher: idea grew out of his Ph.D (1979) in compilers
Led to a startup (MultiFlow)
whose computers worked, but
which went out of business ... the
ideas remain influential.
55
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Basic Idea: Super-sized Instructions
Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel.
opcode rs rt rd functshamt
opcode rs rt rd functshamt
Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10
Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9
A 64-bit VLIW instruction
56
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
VLIW Assembly Syntax ...
Instr: ADD $8 $9 $10 ADD $7 $8 $9
Denotes start of an instruction word. Listed operators all
execute in parallel.
Instr: SUB $2 $3 $0 OR $1 $5 $4 Execute in
parallel.
Label: AND $5 $2 $3 OR $1 $5 $4
[...]
Branch label name instead of default “instr”.
57
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
ADD $8 $9 $10; Result: $8 = 19
ADD $7 $8 $9; Result: $7 = 28
32-bit MIPS:
Assume: $7 = 7, $8 = 8, $9 = 9, $10 = 10 (decimal)
VLIW:
Instr: ADD $8 $9 $10 ; result $8 = 19ADD $7 $8 $9 ; result $7 = 17 (not 28)
32-bit & 64-bit semantics different? Yes!
58
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Design: A 64-bit VLIW R-format CPU
No loads or stores: machine has no use for data memory, only instruction memory.
No branches or jumps: machine only runs straight line code.
opcode rs rt rd functshamt
opcode rs rt rd functshamt
Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10
Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9
59
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
VLIW: Straight-line Instruction Fetch
Clk
Addr Data
InstrMem
32D
PC
Q32
32
+
32
32
CLK
Addr
Data IMem[PC + 16]IMem[PC + 8]IMem[PC]
PC + 16PC + 8PC
64
0x8
+8 in hexadecimal -- 64 bit instructions
Simple changes to support 64-bit instructions ...
60
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Computing engine of VLIW R-format CPUopcode rs rt rd functshamt
opcode rs rt rd functshamt
32ALU
32
32
op
32ALU
32
32
op
32rd1
RegFile
32rd2
WE1
32wd1
5rs1
5rs2
5ws1
WE2
32rd3
32rd4
5rs3
5rs4
32 wd2
5ws2
61
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
What have we gained with 64-bit VLIW?
opcode rs rt rd functshamt
opcode rs rt rd functshamt
Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10
Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9
If:Clock speed remains the same.All 32-bit operators do useful work.
Performance doubles!
N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family)
62
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
What does N = 14 assembly look like?
THE MULTIFLOW TRACE SCHEDULING COMPILER 59
Table 2. Hardware performance of the Trace 300 family.
7/300 14/300 28/300
MOPs 53 107 215
MFLOPS 30 60 120
Main memory megabytes/s 123 246 492
Linpack 1000 x 1000 23 42 70
Linpack 100 x 100 11 17 22
SPECmark NA 23 25
Sustainable operations in flight 10-13 20-26 40-52
instr clO ialuOe st.64 sbl .rO,r2,17#144
clO ialule cgt.s32 lilbb.r4,r34,6#31
clO faluOe add.f64 Isb.r4,r8,rO
clO falul e add.f64 Isb.r6,r40,r32
clO ialuOI did.64 fbl .r4,r2,17#208
cll ialuOe did.64 fb1.r34,r1,17#216
cll ialule cgt.s32 lilbb.r3,r32,zero
cll faluOe add.f64 Isb.r4,r8,r6
c l l falule add.f64 Isb.r6,r40,r38
cll ialuOI st.64 sb1.r2,r1,17#152
cll ialull add.u32 lib.r32,r36,6#32
cll br true and r3 L2373
clO br false or r4 L24?3;
instr clO ialuOe did.64 fbO.rO,r2,17#224
clO ialule cgt.s32 lilbb.r3,r34,6#30
clO faluOe mpy.f64 Ifb.rlO,r2,rlO
clO falul e mpy.f64 Ifb.r42,r34,r42
clO ialuOI st.64 sbO.r4,r2,17#160
cll ialuOe did.64 fbO.r32,r1,17#232
cll ialul e cgt.s32 lil bb.r4,r35,6#29
cll faluOe mpy.f64 Ifb.rlO,rO,rlO
cll falule mpy.f64 Ifb.r42,r32,r42
cll ialuOI st.64 sbO.r6,r1,17#168
cll ialull bor.32 ibO.r32,zero,r32
cll br false or r4 L25?3
clO br true and r3 L26?3;
Figure 8. TRACE 14/300 code fragment.
Figure 8 shows two instructions of 14/300 code, extracted from the inner loop of the
100 x 100 Linpack benchmark. Each operation is listed on a separate line. The first two
fields identify the cluster and the functional unit to perform the operation; the remainder
of the line describes the operation. Note the destination address is qualified with a register-
bank name (e.g., s b 1. r 0); the ALUs could target any register bank in the machine (with
some restrictions). There is extra latency in reaching a remote bank.
Two instructions
from a scientific
benchmark (Linpack) for
a MultiFlow CPU with
14 operations per
instruction.
63
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
What have we gained with 64-bit VLIW?
opcode rs rt rd functshamt
opcode rs rt rd functshamt
Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10
Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9
If:Clock speed remains the sameAll 32-bit operators do useful work.
Performance doubles!
N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family)
A very big “if” !
64
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
As N scales, HW and SW needs conflict
Instruction Set Architecture: Where the conflict plays out.
I/O systemProcessor
Digital DesignCircuit Design
Datapath & Control
Transistors
MemoryHardware
CompilerOperating
System(Mac OS X)
Application (iTunes)
Software Assembler
Hardware need: Clock does not slow down.
Software need: All operators do useful work.
65
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Example problem: Register file ports ...
32ALU
32
32
op
32ALU
32
32
op
32rd1
RegFile
32rd2
WE1
32wd1
5rs1
5rs2
5ws1
WE2
32rd3
32rd4
5rs3
5rs4
32 wd2
5ws2
N ALUs require 2*N read ports and N write ports. Why is this a problem?
66
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Recall: Register File Design
R1
R2
...
R31
Q
Q
Q
R0 - The constant 0 Q
clk
.
.
.
.
.
32MUX
32
32
sel(rs1)
5
.
.
.
rd1
32MUX
32
32
sel(rs2)
5
.
.
.
rd2
D
D
D
En
En
En
DEMUX
.
.
.
sel(ws)5
WE
wd32
More read ports increases fanout, slows down reads.
More write ports adds data muxes, demux OR tree.
67
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Split register files: A solution?
32ALU
32
32
op
32ALU
32
32
op
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
Too often, the data an ALU needs to do “useful work” will not be in its own regfile.
Software need: All operators do useful work.
68
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Architect’s job: Find a good compromise
Instruction Set Architecture: Where the conflict plays out.
I/O systemProcessor
Digital DesignCircuit Design
Datapath & Control
Transistors
MemoryHardware
CompilerOperating
SystemSoftware Assembler
Application
THE MULTIFLOW TRACE SCHEDULING COMPILER 57
256-bit Instruction Word b.
Interleaved Memory Total of 512 Mb
Total of 64 Banks
Figure 6. The Multiflow TRACE 7/300 (Mb = megabytes).
In the 300 series, instructions are issued every 130 ns; there are two 65-ns beats per
instruction. Integer operations can issue in the early and late beats of an instruction; floating
point operations issue only in the early beat. Most integer ALU operations complete in
a single beat. The load pipeline is seven beats. The floating point pipelines are four beats.
Branches issue in the early beat and the branch target is reached on the following instruc-
tion, effectively a two-beat pipeline. An instruction can issue multiple branch operations
(four on the 28/300); the particular branch taken is determined by the precedence encoded
in the long instruction word.
9 There are four functional units per cluster: two integer units and two floating units. In
addition, each cluster can contribute a branch target. Since the integer units issue in both
the early and the late beat, a cluster has the resources to issue seven operations for each
instruction.
9 There are nine register files per cluster (36 register files in the 28/300) (see Table 1).
Data going to memory must first be moved to a store file. Branch banks are used to
control conditional branches and the select operation.
Example solution: Split register files, with a dedicated bus and special instructions for moves between regfiles.
Mayhurt software more than it helpshardware :-(
69
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Branch policy: All instr operators execute
opcode rs rt rd functshamt
opcode rs rt rd functshamt
BNE $8 $9 Label ADD $7 $8 $9
Problem: Large N machines find it hard to fill all operators with useful work.
ADD executes if branch is taken or not taken.
Solution: New “predication” operator.Syntax: SELECT $7 $8 $9 $10
Semantics: If $8 == 0, $7 = $10, else $7 = $9
Permits simple branches to be converted to inline code.
70
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Branch nesting in a single instruction ...
opcode rs rt rd functshamt
opcode rs rt rd functshamt
BEQ $8 $9 LabelOne
Conundrum: How to define the semantics of multiple branches in one instruction?
BEQ $11 $12 LabelTwo
MultiFlow: N-way Branch priority set in an opcode field.
Solution: Nested branch semanticsIf $8 == $9, branch to LabelOne
Else $11 == $12, branch to LabelTwo
71
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Will return to VLIW later in semester ...THE MULTIFLOW TRACE SCHEDULING COMPILER 57
256-bit Instruction Word b.
Interleaved Memory Total of 512 Mb
Total of 64 Banks
Figure 6. The Multiflow TRACE 7/300 (Mb = megabytes).
In the 300 series, instructions are issued every 130 ns; there are two 65-ns beats per
instruction. Integer operations can issue in the early and late beats of an instruction; floating
point operations issue only in the early beat. Most integer ALU operations complete in
a single beat. The load pipeline is seven beats. The floating point pipelines are four beats.
Branches issue in the early beat and the branch target is reached on the following instruc-
tion, effectively a two-beat pipeline. An instruction can issue multiple branch operations
(four on the 28/300); the particular branch taken is determined by the precedence encoded
in the long instruction word.
9 There are four functional units per cluster: two integer units and two floating units. In
addition, each cluster can contribute a branch target. Since the integer units issue in both
the early and the late beat, a cluster has the resources to issue seven operations for each
instruction.
9 There are nine register files per cluster (36 register files in the 28/300) (see Table 1).
Data going to memory must first be moved to a store file. Branch banks are used to
control conditional branches and the select operation.
72
UC Regents Fall 2008 © UCBCS 194-6 L3: Single-Cycle CPU
Next Monday:
First Design Review
This Friday: Look-ahead for Design Review, 125 Cory, 10 AM
73