logo computer architecture dr. esam al_qaralleh princess sumaya university for technology
TRANSCRIPT
LOGO
Computer Architecture Dr. Esam Al_Qaralleh
Princess Sumaya University for Technology
2
Data path
3
The processor : Data Path and Control
PCRegister
Bank
Data Memory
Address
Instructions Address
Data
Instruction Memory
A L U
Data
Register #
Register #
Register #
Two types of functional units:elements that operate on data values (combinational) elements that contain state (state elements)
4
Single Cycle Implementation
State element
1
State element
2
Combinational Logic
Clock Cycle
Typical execution:read contents of some state elements,send values through some combinational logicwrite results to one or more state elements
Using a clock signal for synchronization Edge triggered methodology
5
A portion of the datapath used for fetching instructions
6
The datapath for R-type instructions
7
The datapath for load and store insructions
8
The datapath for branch instructions
9
Complete Data Path
10
Control
Selecting the operations to perform (ALU, read/write, etc.)
Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: lW $1, 100($2)
Value of control signals is dependent upon: what instruction is being executed which step is being performed
11
Data Path with Control
12
Single Cycle Implementation
Calculate cycle time assuming negligible delays except:memory (2ns), ALU and adders (2ns), register file access (1ns)
13
Multicycle approach
Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area
One Solution: use a “smaller” cycle time have different instructions take different numbers of cycles a “multicycle” datapath:
14
Multicycle Approach
Break up the instructions into steps, each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major
functional unit
At the end of a cycle store values for use in later cycles (easiest
thing to do) introduce additional “internal” registers
15
Multicycle Approach Implementation
16
Five Execution Steps
Instruction FetchInstruction Decode and Register FetchExecution, Memory Address Computation, or
Branch CompletionMemory Access or R-type instruction
completionWrite-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
17
Five Execution Steps
Step nameStep name Action for R-type Action for R-type instructionsinstructions
Action for Memory-Action for Memory-reference Instructionsreference Instructions
Action for Action for branchesbranches
Action for Action for jumpsjumps
Instruction fetch IR = MEM[PC]
PC = PC + 4
Instruction decode/ register fetch
A = Reg[IR[25-21]]
B = Reg[IR[20-16]]
ALUOut = PC + (sign extend (IR[15-0])<<2)
Execution, address computation, branch/jump
completion
ALUOut = A op B ALUOut = A+sign extend(IR[15-0])
IF(A==B) Then
PC=ALUOut
PC=PC[31-28]||(IR[25-
0]<<2)
Memory access or R-type completion
Reg[IR[15-11]] = ALUOut
Load:MDR =Mem[ALUOut]
or
Store:Mem[ALUOut] = B
Memory read completion Load: Reg[IR[20-16]] = MDR
18
pipelining
19
Pipelining is Natural!
Pipelining provides a method for executing multiple instructions at the same time.
Laundry ExampleAnn, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
A B C D
20
Sequential Laundry
Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would
laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
21
Pipelined Laundry: Start work ASAP
Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
22
Pipelining Lessons
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously using different resources
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
Stall for Dependences
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
23
Basic DataPath
•What do we need to add to actually split the datapath into stages?
24
Pipeline DataPath
25
The Five Stages of the Load Instruction
Ifetch: Instruction FetchFetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: Calculate the memory addressMem: Read the data from the Data Memory
Wr: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
26
Pipelined Execution
On a processor multiple instructions are in various stages at the same time.
Assume each instruction takes five cycles
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WBProgram Flow
Time
27
Single Cycle, Multiple Cycle, vs. Pipeline
28
Graphically Representing Pipelines
Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Are two instructions trying to use the same resource
at the same time?
29
Why Pipeline? Because the resources are there!
30
Pipelining MIPS Execution
31
Observations
Use separate instruction and data memories To reduce memory conflict The memory system must deliver 5 times the
bandwidth, if the pipelined processor has a clock cycle that is equal to the unpipelined processor.
Register file is used in the two stages One for reading in ID and one for writing in WB 2 reads 1 write in every clock cycle
• Maintaining PC IF stage: preparing the PC for the next instruction ID stage: to compute the branch target address Problem of a branch does not change the PC until the
ID stage
32
Why Pipeline?
Suppose 100 instructions are executed The single cycle machine has a cycle time of 45 ns The multicycle and pipeline machines have cycle times of 10
ns The multicycle machine has a CPI of 4.6
Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine 10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns
Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Ideal pipelined vs. single cycle speedup 4500 ns / 1040 ns = 4.33
What has not yet been considered?
33
Compare Performance
Compare: Single-cycle, multicycle and pipelined control using SPECint2000 Single-cycle: memory access = 200ps, ALU = 100ps, register file read
and write = 50ps 200+50+100+200+50=600ps
Multicycle: 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALU CPI = 4.12, The clock cycle = 200ps (longest functional unit)
Pipelined 1 clock cycle when there is no load-use dependence 2 when there is, average 1.5 per load Stores and ALU take 1 clock cycle Branches - 1 when predicted correctly and 2 when not, average 1.25 Jump – 2 1.5x25%+1x10%+1x52%+1.25x11%+2x2% = 1.17
Average instruction time: single-cycle = 600ps, multicycle = 4.12x200=824, pipelined 1.17x200 = 234ps
Memory access 200ps is the bottleneck. How to improve?
34
Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two
different ways at the same time• E.g., two instructions try to read the same memory at the same
time data hazards: attempt to use item before it is ready
• instruction depends on result of prior instruction still in the pipeline
add r1, r2, r3sub r4, r2, r1
control hazards: attempt to make a decision before condition is evaulated
• branch instructionsbeq r1, loopadd r1, r2, r3
Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards
35
Mem
Single Memory is a Structural Hazard
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4A
LUMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UReg Mem Reg
AL
UMem Reg Mem Reg
Detection is easy in this case! (right half highlight means read, left half write)
36
Structural Hazards limit performance
Example: if 1.3 memory accesses per instruction and only one memory access per cycle then
average CPI = 1.3otherwise resource is more than 100% utilized
Solution 1: Use separate instruction and data memories
Solution 2: Allow memory to read and write more than one word per cycle
Solution 3: Stall
37
Stall: wait until decision is clear Its possible to move up decision to 2nd stage by
adding hardware to check registers as being read
Impact: 2 clock cycles per branch instruction => slow
Control Hazard Solutions
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UReg Mem RegMem
38
Predict: guess one direction then back up if wrong
Predict not taken
Impact: 1 clock cycle per branch instruction if right, 2 if wrong (right 50% of time)
More dynamic scheme: history of 1 branch ( 90%)
Control Hazard Solutions
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
MemA
LUReg Mem Reg
39
Redefine branch behavior (takes place after next instruction) “delayed branch”
Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)
Launch more instructions per clock cycle=>less useful
Control Hazard Solutions
Instr.
Order
Time (clock cycles)
Add
Beq
Misc
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
MemA
LUReg Mem Reg
Load Mem
AL
UReg Mem Reg
40
Data Hazard
41
Data Hazard---Forwarding
Use temporary results, don’t wait for them to be writtenregister file forwarding to handle read/write to same registerALU forwarding
42
Can’t always forward
Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes
to the same register.
Thus, we need a hazard detection unit to “stall” the load instruction
43
Stalling
We can stall the pipeline by keeping an instruction in the same stage
44
Designing Instruction Set for Pipelining
MIPS instructions are the same length. This makes much easer to fetch instructions in the
first stage and to decode them in the second stage.In IA-32 instructions vary from 1-17 bytes,
pipelining is considerably more challenging. All recent IA-32 architectures translate instructions
into micro_operations, which are pipelinedMIPS has only few instruction formats, with the
source registers fields in the same place This symmetry – the second stage can read the
register file at the same time that the hardware is decoding the type of instruction
Without this symmetry we would need to split stage 2
45
Memory operands only appears in loads or stores in MIPS We can use the execute stage to calculate the
memory and then access memory in the following stage
If we operate on operand in memory, stages 3 and 4 would expand to an address stage, memory stage and then execute stage
Operands must be aligned in memory We can have instructions requiring two data memory
access, the transfer can be done in a single pipeline stage
Designing Instruction Set for Pipelining
46
Summary: Pipelining
What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores
What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction
We’ll talk about modern processors and what really makes it hard:
exception handling trying to improve performance with out-of-order execution, etc.
47
Summary
Pipelining is a fundamental conceptmultiple steps using distinct resources
Utilize capabilities of the datapath by pipelined instruction processing
start next instruction while working on the current one
limited by length of longest stage (plus fill/flush)
detect and resolve hazards