csci 620 1 order of class lectures chapter 2 starts with instruction-level parallelism: concepts and...

CSCI 620 1

Order of Class Lectures

• Chapter 2 starts with

Instruction-Level Parallelism: Concepts and Challenges

Instruction Level Parallelism (ILP)

• Definition: Potential to overlap the execution of instructions

– Pipelining is one example

– Limitations of ILP are from data and control hazards

• Approaches to overcoming limitations– dynamic approaches with hardware

– static approaches that use software

• So, we will cover ISA(Appendix B) & Pipelining first (Appendix A) then Chapter 2

CSCI 620 2

Instruction SetPrinciples and Examples

(Appendix B)

CSCI 620 3

Review: Instruction Set Design Parameters

• Operand storage in the CPU: Where are operands kept other than in memory? Registers

• Number of explicit operands named per instruction: How many operands are named explicitly in a typical instruction? 0 to 3

• Operand location: Can any ALU operand be located in memory or must some or all of the operands be internal storage in the CPU? If an operand is located in memory, how is the memory location specified? Most popular addressing: Displacement, Immediate, Register Indirect

• Operations: What operations are provided in the instruction set?

Most often used are: Arithmetic & Logic, Data transfer, Control

• Type and size of operands: What is the type and size of each operand and how is it specified? 8, 16, 32, 64bits

CSCI 620 4

Current Design Guidelines

• Use general-purpose registers with a load-store architecture

• Support these addressing modes: displacement, immediate, and register Indirect

• Use a minimalist instruction set

• Support simple, most-commonly used instructions

• Support standard data sizes and types: 8-, 16-, and 32-bit integers and 64-bit IEEE 754 floating-point numbers

• Use fixed instruction encoding if interested in performance and variable instruction encoding if interested in code size

• Provide at least 16 general-purpose registers plus separate floating-point registers; 32 registers of each highly desirable

CSCI 620 5

The Big Picture: The Performance Perspective

• Performance of a machine is determined by:– Instruction count

– Clock cycle time

– Clock cycles per instruction

• Processor design (datapath and control) will determine:– Clock cycle time

– Clock cycles per instruction

Clock cycle time

CSCI 620 6

A "Typical" RISC ISA

• 32-bit fixed format instruction (3 formats)

• 32 32-bit GPR (R0 contains zero, DP take pair)

• 3-address, reg-reg arithmetic instruction

• Single address mode for load/store: base + displacement

– no indirection

• Simple branch conditions

• Delayed branch

see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

http://en.wikipedia.org/wiki/MIPS_architecture

CSCI 620 7

Basic MIPS RISC Instruction Set

• All operations on data apply to data in registers

• Only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register—Therefore, it is called “load & store” machine

• Instruction formats are few in number with all instructions typically being one size—simpler decoding faster

• 32 registers

• 3 classes of instructions: ALU, Load and Store, Branches and jumps

CSCI 620 8

MIPS Instruction Format Overview

CSCI 620 9

I-type Instructions

Examples:lw R1, 30(R2) Load Word Regs[R1] Mem[30+Regs[R2]]opcode = load word, displacement rs = R2, rt = R1, Immediate = 30

swcl F0, 40(R3) Store FP Single Mem[40+Regs[R3]] 32 Regs[F0]0..31

opcode = store FP single, displacement rs = F0, rt = R3, Immediate = 40

beq R4, R3, name Branch on equal if (Regs[R4] == Reg[R3]) PC PC+4+name

opcode = branch on equal, immediate rs = R4, rt = R3, Immediate = name

CSCI 620 10

R-type Instructions

Examples: add R1, R2, R3 Add Regs[R1] Regs[R2] + Regs[R3]Opcode = R-type register mode rd = R1, rs = R2, rt = R3

shamt = 0, funct = ADD

slt R1, R2, R3 Set less than if (Regs[R2]<Regs[R3]) ThenOpcode=R-type register mode Regs[R1] 1 else Regs[R1] 0

rd = R1, rs = R2, rt=R3shamt = 0, funct = SLT

Sll R1, R2, 10 shift left logical Regs[R1] Regs[R2]<<10Opcode = R-type register mode rd=R1, rs=0, rt=R3,

shamt=10, funct=SLL

CSCI 620 11

J-type Instructions

Examples:j name Jump PC name (jump address)opcode = jump Offset = name

jal name Jump and link PC name, Regs[R31] PC+4opcode = jump and link Offset = name

There is two more instruction formats for floating point; both are 32 bit fixed formats

CSCI 620 12

Most Popular MIPS Instructions

Integer benchmarks

CSCI 620 13

Most Popular MIPS Instructions

Floating-point benchmarks

CSCI 620 14

Implementation of MIPS RISC Instruction Set

• Instruction fetch cycle (IF)–Send PC to memory

–Fetch current instruction from memory

–Update PC (PC PC + 4)

• Instructions decode/register fetch cycle (ID)– Decode instruction

– Read registers corresponding to register source specifiers from register file (in parallel with decoding)

–Look for branch conditions, act accordingly

CSCI 620 15

Implementation of MIPS RISC Instruction Set--continued

• Execution/effective address cycle (EX)

–ALU operates on operands prepared from prior cycle, then performs one of three things…

– Memory reference: ALU adds base register and offset to form effective address

–Register-register ALU instruction: ALU does operation specified by ALU opcode on values read from register file

–Register-immediate ALU instruction in which ALU does operation specified by ALU opcode on first value read from register file + sign extended immediate

CSCI 620 16

Implementation of MIPS RISC Instruction Set--continued

• Memory Access (MEM)– Performs read using effective address if instruction is a load

– Performs write of data from second register read from register file using effective address if instruction is a store

• Write-back Cycle (WB)– Write to register file for either register-register ALU instruction or load instruction

CSCI 620 17

Pipelining: Basic and Intermediate Concepts

(Appendix A)

CSCI 620 18

Datapath vs Control

• Datapath: Storage, FU, interconnect sufficient to perform the desired functions

– Inputs are Control Points– Outputs are signals

• Controller: State machine to orchestrate operation on the data path– Based on desired function and signals

Datapath Controller

Control Points

signals

CSCI 620 19

Implementation of Single cycle machine

From this text

Lw $t0, 32($3)

http://www.elsevierdirect.com/product.jsp?isbn=9780123706065

CSCI 620 20

Single cycle machine with Control logic

CSCI 620 21

Division of execution into 5 stages

What factors to consider in the division?

CSCI 620 22

These are pipeline registers Why do we need them?

CSCI 620 23

Approaching an ISA

• Instruction Set Architecture– Defines set of operations, instruction format, hardware supported data

types, named storage, addressing modes, sequencing

• Meaning of each instruction is described by RTL (Register Transfer Language) on architected registers and memory

• Given technology constraints assemble adequate datapath– Architected storage mapped to actual storage

– Function units to do all the required operations

– Possible additional storage (eg. MAR, MBR, …)

– Interconnect to move information among regs and FUs

• Map each instruction to sequence of RTLs

• Collate sequences into symbolic controller state transition diagram (STD)

• Lower symbolic STD to control points

• Implement controller

CSCI 620 24

Visualizing Pipelining

Instr.

Order

Time (clock cycles)

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Pipelining and Hazards

CSCI 620 26

Hazards

Hazards are situations that hamper execution flow

• Structural Hazards:– Resource Conflict, hardware cannot support all possible

combinations of instructions simultaneously. E.g. Fetch instruction & Fetch data simultaneously from one memory

• Data Hazards:– Source operands are not available: instruction depends on results of

previous instructions still in the pipeline

• Control Hazards:– Changes in program counter—jumps, calls, interrupts, etc.– When branch happens, what happens to the instructions already in

pipeline?

CSCI 620 27

Structural Hazards—One Memory Port/Structural HazardsFigure A.4, Page A-14

Memory conflict

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


Reg

ALU

DMemIfetch Reg

CSCI 620 28

One Memory Port/Structural Hazards(Similar to Figure A.5, Page A-15)

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg


Reg

ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

How do you “bubble” the pipe?

CSCI 620 29

Structural Hazard: Single Memory—another view

Clock cycle number

Instruction 1 2 3 4 5 6 7 8 9 10Load IF ID EX MEM WBInstr. 1 IF ID EX MEM WBInstr. 2 IF ID EX MEM WBInstr. 3 Stall IF ID EX MEM WBInstr. 4 IF ID EX MEM WBInstr. 5 IF ID EX MEMInstr. 6 IF ID EX

Structural Hazards occur in which cycles?Whenever IF(Instruction Fetch) & MEM(Memory access—read or write) occur together, they are candidates for the structural Hazards

In this example, we assume that only the first instruction needs(Load) to access memory at MEM cycle.

Structural Hazards can be solved by “duplicating the hardware”. e.g. Dual-port memory, multiple ALUs

CSCI 620 30

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Data Hazard on R1Figure A.6, Page A-17

Time (clock cycles)

IF ID/RF EX MEM WB

CSCI 620 31

Classification of Data Hazards

Consider instructions i and j, where i occurs before j.The possible Data Hazards are:

• RAW (read after write) — j tries to read a source before i writes it, so j gets the old value. The most common type

• WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value (only possible when some instructions can write results early in the pipeline and other instructions can read sources late in the pipeline—in MIPS pipeline this hazard cannot happen)

• WAW (write after write) — j tries to write an operand before it is written by i (only possible in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled—in MIPS pipeline this hazard cannot happen)

CSCI 620 32

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

CSCI 620 33

• Write After Read (WAR) InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and

– Reads are always in stage 2, and

– Writes are always in stage 5

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7


CSCI 620 34


• Write After Write (WAW) InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and

– Writes are always in stage 5

• Will see WAR and WAW in more complicated pipes

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

CSCI 620 35

Software Solution to data hazards

Compiler recognizes data hazard and adds nops to

eliminate it—simple but …

sub R2, R1, R3 ; register R2 written by sub

nop ; no operation

nop

nop

and R12, R2, R5 ; now, result from sub available

or R13, R6, R2

add R14, R2, R2

sw 100 (R2), R15

Any problem with this solution? Yes, waste of time

CSCI 620 36

Data Hazard Control: Stalls by hardware

• Hazard occurs when instruction reads (in ID stage) register that will be written by an earlier instruction (in WB stage)

• Idea: Detect hazard and stall instructions in pipeline until hazard is resolved

• Detect hazard by comparing read fields in IF/ID pipeline register with write fields in later pipeline registers (ID/EX, EX/MEM, MEM/WB)

• To add bubble in pipeline– Preserve PC register and IF/ID pipeline register– Change EX, MEM, and WB control fields of ID/EX

pipeline register to do nothing

CSCI 620 37

Data Hazard Reduction: Forwarding

• Needed result is available before it is written into register file in WB stage

• Idea: Use temporary results instead of waiting for registers to be written

• Cannot solve problem of write (load) immediately followed by read

• Almost all pipelined machines today use some form of forwarding

CSCI 620 38

sub r4, r1, r3

add r1, r2, r3

and r6, r1, r7

or r8, r1, r9

xor r10, r1, r11

Data Hazard on r1

Instr.

Order

Time (clock cycles)

IM Reg DM Reg

IM Reg DM Reg

IM Reg DM

IM Reg

IM Reg

CC 1 CC 5CC 2 CC 3 CC 4 CC 6

R1 is changed here

r1 is changed here

Are both hazards?

CSCI 620 39

Forwarding to Avoid Data Hazard

Instr.

Order

IM Reg DM Reg

IM Reg DM Reg

IM Reg DM

IM Reg

IM Reg

Time (clock cycles)CC 1 CC 5CC 2 CC 3 CC 4 CC 6

sub r4, r1, r3

add r1, r2, r3

and r6, r1, r7

or r8, r1, r9

xor r10, r1, r11

This can be done by writing on the rising edge of the clock and reading on the falling edge

CSCI 620 40

Data Hazard Even with Forwarding

Instr.

Order

IM Reg DM

IM Reg DM Reg

IM Reg

IM Reg

Time (clock cycles)CC 1 CC 5CC 2 CC 3 CC 4

sub r4, r1, r5

lw r1, 0(r2)

and r6, r1, r7

or r8, r1, r9This can’t be done because it means forwarding the result in “negative time” So, we have to stall the pipeline

See next page

CSCI 620 41

Data Hazard Even with Forwarding

Instr.

Order

IM Reg

IM Reg DM Reg

IM

DM

Reg

IM Reg


sub r4, r1, r5

lw r1, 0(r2)

and r6, r1, r7

or r8, r1, r9

bubble

bubble

bubble

CSCI 620 42

44 and R12, R2, R5

40 beqz R1, 36

48 or R13, R6, R2

52 add R14, R2, R2

80 ld R4, R7, 100

Control Hazard on BranchesThree Stage Stall

Pro

gram

Exe

cuti

on O

rder

(in

inst

ruct

ions

)


DMIM

DMIM

RegDMIM

IM DM

CC 7 CC 8 CC 9

IM DM

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

CSCI 620 43

Control Hazard on BranchesThree Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch

What do you do with the 3 instructions in between?

How do you do it?

Where is the “commit”?

CSCI 620 44

Ad

der

IF/ID

Pipelined MIPS Datapathwith branch logic in 2nd stageFigure A.24, page A-38

MemoryAccess

Write

Back

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero?

MEM

/WB

EX

/MEM

4

Ad

der

Next SEQ PC

RD RD RD WB

Data

Next PC

Addre

ss

RS1

RS2

ImmM

UX

ID/E

X

CSCI 620 45

Branch Characteristics

• Integer Benchmarks: 14 – 16% of instructions are conditional branches

• Floating Point: 3 – 12%

• On average:

--67% of conditional branches are “taken”

--60% of forward branches are taken

--85% of backward branches are taken

CSCI 620 46

Solutions to Control Hazards

1. Stall Pipeline until branch is decided-- Simple, no other things to do, no problems with Exceptions

2. Assume(predict) Not Taken-- Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Requires back-out logic in case Taken– PC+4 already calculated, so use it to get next instruction

-- As we have seen above, most branches are taken, so “wrong predictions” may slow down

3. Assume Taken--no advantage for the simple pipe like MIPS as the branch address is not known until the MEM stage.

Solutions 2 & 3 are simple predictions—heavy penalty with a wrong prediction, so “dynamic branch predictions”—next slide

4. Delayed Branch—a software solution--Can help a little

CSCI 620 47

A Dynamic Branch Prediction• Uses some kind of “branch history table”

• Try to see the trend for branch behavior (taken or not taken) of the current codes

The states in a 2-bit prediction scheme

CSCI 620 48

Delayed Branch by compiler

(a) is the best choice if possible

Strategies (b) & (c) are used when (a) is not possible

Seung Bae Im

Seung Bae Im2/25/2008The following is quoted from http://www.cs.umd.edu/class/fall2001/cmsc411/projects/branches/delay.htmlSo, if this strategy offers improvements irregardless of whether we take or do not take the branch, what is the problem? The problem is trying to find an instruction that can both be safely executed whether the branch is taken or not, and will still improve performance. This is the compiler's job, and so using a branch delay slot makes compilers more complex to program. Also, Hennesy and Patterson mention that using this option does cause one shortcoming, if the hardware is changed so that a delay-branch slot is no longer used, all the old programs will no longer work. C programs would have to be recompiled, and assembly language programs and routines would have to be re-written. So, this method does put a lot more work into the hands of the system programmers.

CSCI 620 49

Delayed Branch

• Where to get instructions to fill branch delay slot?– Before branch instruction

– From the target address: only valuable when branch taken

– From fall through: only valuable when branch not taken

• Compilers’ effectiveness for single branch delay slot (as in MIPS): – Fills about 60% of branch delay slots

– About 80% of instructions executed in branch delay slots useful in computation

– About 50% (60% x 80%) of slots usefully filled with success

• Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot

– Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches

– Growth in available transistors has made dynamic approaches relatively cheaper

CSCI 620 50

Delayed Branch• The following is quoted from

http://www.cs.umd.edu/class/fall2001/cmsc411/projects/branches/delay.html

• So, if this strategy offers improvements irregardless of whether we take or do not take the branch, what is the problem? The problem is trying to find an instruction that can both be safely executed whether the branch is taken or not, and will still improve performance. This is the compiler's job, and so using a branch delay slot makes compilers more complex to program. Also, Hennesy and Patterson mention that using this option does cause one shortcoming, if the hardware is changed so that a delay-branch slot is no longer used, all the old programs will no longer work. C programs would have to be recompiled, and assembly language programs and routines would have to be re-written. So, this method does put a lot more work into the hands of the system programmers.



CSCI 620 51

Speedup of pipeline with branches

Pipeline Speedup = Pipeline depth

1 + Pipeline stalls

= Pipeline depth

1 + Branch frequency Branch penalty

CSCI 620 52

PipelineHazards

Caused by Solved by

Structural Hazard

Simultaneous need for the same hardware components by different pipeline stagesEx. ALU, memory are needed by several stages

*Duplicated hardware Ex. multiple ALUs, Separate Instruction memory & Data memory

Data Hazard

Dependencies of data between instructions

*Stalling(=flushing)*Forwarding (=bypassing)

Control Hazard

Branch instructions *Stalling(=flushing)*Branch prediction -static (taken/not taken) -dynamic branch predictions*Delayed branch (software)

csci 620 1 order of class lectures chapter 2 starts with instruction-level parallelism: concepts and...

Documents