kasi l.k. anbumony department of electrical and computer engineering auburn university

31
10/24/05 ELEC6200 1 Kasi L.K. Anbumony Department of Electrical and Computer Engineering Auburn University Auburn, AL 36849 Superscalar Processors

Upload: mavis

Post on 21-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Superscalar Processors. Kasi L.K. Anbumony Department of Electrical and Computer Engineering Auburn University Auburn, AL 36849. Outline. Pipelining: Motivation Pipeline Hazards Advanced Pipelining Instruction Level Parallelism (ILP) Multiple Issue (MIPS Superscalar) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 1

Kasi L.K. Anbumony Department of Electrical and Computer Engineering

Auburn UniversityAuburn, AL 36849

Superscalar Processors

Page 2: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 2

• Pipelining: Motivation

• Pipeline Hazards

• Advanced Pipelining•

– Instruction Level Parallelism (ILP)

– Multiple Issue (MIPS Superscalar)

Static Multiple Issue (SW centric) Dynamic Multiple Issue (HW centric)

• Superscalar Processor

• Conclusion

Outline

Page 3: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 3

• Multiple instructions are overlapped in execution. To exploit the Instruction level parallelism(ILP)

• One of technique to make the processors fast

• Some terms: Stages Task Order Throughput

• In pipeline the stages occur concurrently (or) parallely

• Possible as long as we have separate resources for each stage

Pipelining: Motivation

Page 4: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 4

Sequential Laundry: Non-pipelined

• Sequential laundry takes 6 hours for 4 loads

• If they learned pipelining, how long would laundry take?

30 40 20 30 40 20 30 40 20 30 40 20Task

Order

A

B

C

D

6 PM 7 8 9 10 11 MidnightTime

Page 5: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 5

Pipelined Laundry:Start work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

Task

Order

6 PM 7 8 9 10 11 MidnightTime

20

A

B

C

D

30 40 40 40 40

Page 6: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 6

Pipelining: Lessons

• Improvement in throughput of entire workload without improving any time to complete a single load

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

Task

Order

6 PM 7 8 9Time

20

A

B

C

D

30 40 40 40 40

Page 7: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 7

Comparison: Example

Consider a non-pipelined machine with 5 execution steps of lengths 200 ps, 100 ps, 200 ps, 200 ps, and 100 ps. Due to clock skew and setup, pipelining adds 5 ps of overhead to each instruction stage. Ignoring latency impact, how much speedup in the instruction execution rate will we gain from a pipeline?

Page 8: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 8

Sequential vs. Pipelined Execution

100200 100200200100200100200200100200100200200

800 800 800

Pipelined Execution

Sequential Execution

100 200 200 100200

100 200 200 100200

100 200 200 100200

Page 9: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 9

Speed Up Equation for Pipelining

Speedup from pipelining =

=

=

Ideal CPIpipelined = CPIunpipelined /Pipeline depth

Speedup =

CPI unpipelined Clock Cycleunpipelined

CPI pipelined Clock Cyclepipelined

CPI unpipelined

CPI pipelined

Clock Cycleunpipelined

Clock Cyclepipelined

Ideal CPI Pipeline depth

CPI pipelined

Clock Cycleunpipelined

Clock Cyclepipelined

Avg. Instr. Time Unpipelined

Avg. Instr. Time Pipelined

Page 10: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 10

Speed Up Equation for Pipelining

CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr

Speedup = Ideal CPI x Pipeline depth

Ideal CPI + Pipeline stall CPI

Clock Cycleunpipelined

Clock Cyclepipelined

Speedup = Pipeline depth

1 + Pipeline stall CPI

Clock Cycleunpipelined

Clock Cyclepipelined

Page 11: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 11

It’s Not That Easy for Computers: Limitation

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: Hardware cannot support this combination of instructions that has to be executed in the same clock cycle (washer+dryer)

– Data hazards: Instruction depends on result of prior instruction still in pipeline (one sock missing)

– Control hazards: Pipelining of branches & other instructions. Common solution is to stall the pipeline until the hazard “bubbles” through the pipeline

Page 12: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 12

Instruction Level Parallelism

• Longer pipeline

• Laundry analogy: Divide our washer into three machines that perform the wash, rinse and spin steps of a traditional machine

• To get the full speedup,we need to rebalance the remaining steps so that they are of the same length

• Amount of parallelism exploited is higher, since there are more operations being overlapped

Page 13: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 13

Advanced Pipelining: Techniques

• Motivation: To further exploit the Instruction Level Parallelism (ILP) • Multiple Issue Replicate the internal components of the computer so that it can

launch multiple instructions in every pipeline stage

• Dynamic Pipeline scheduling (or) Dynamic Pipelining (or) Dynamic Multiple issue by hardware to avoid pipeline hazards

Page 14: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 14

Multiple Issue: Superscalar

• Launch multiple instructions in parallel

• A Superscalar laundry would replace our household washer and dryer with say , three washers and three dryers. Also followed by 3 assistants to fold and put away thee times as much laundry in the same amount of time.

• Downside extra work needed to keep all the machines busy and transferring load to next pipeline stage.

• Superscalar is defined as executing more than one instruction per clock cycle

Page 15: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 15

Performance Metrics: CPI & IPC

• Instruction execution rate exceed the clock rate

• Example: 6GHz, 4-way multiple-issue microprocessor can execute at a peak rate of 24 billion instructions per second and have a best case of CPI of 0.25

• Instructions per clock cycle (IPC) (for the above case: 4)

• Assume a 5 stage pipeline such a processor would have 20 instructions in execution at any given time.

Page 16: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 16

Multiple issue processor: Decision Strategy

• Static Multiple Issue Decisions are made at compile time before execution Software based Compiler scheduling VLIW(Very Long Instruction Word)

• Dynamic Multiple Issue Decisions are made at run/execution time by the

processor Dynamic scheduling Hardware based

Page 17: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 17

Static Multiple Issue Processor

• Issue Packet: Set of instructions which can be paired to form one large instruction with multiple operations (VLIW)

• Relies on Compiler to take on responsibilities for handling data and control hazards

• Some of the compiler’s responsibilities may be static branch prediction and code scheduling

Page 18: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 18

Getting CPI < 1:Static 2 Issue pipeline

• Superscalar MIPS: 2 instructions, 1 ALU & 1 LOAD instruction

– Fetch 64-bits/clock cycle; ALU on left, Load on right

– Can only issue 2nd instruction if 1st instruction issues

Type Pipe Stages

ALU instruction IF ID EX MEM WB

Load instruction IF ID EX MEM WB

ALU instruction IF ID EX MEM WB

Load instruction IF ID EX MEM WB

ALU instruction IF ID EX MEM WB

Load instruction IF ID EX MEM WB

Page 19: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 19

Static Multiple Issue: Datapath

IMReg.file

ALU

ALU

ALU/bx xion

lw/sw xion

Page 20: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 20

Example: Multiple Issue code scheduling

• Loop: lw $t0, 0($s1) addu $t0, $t0, $s2

sw $t0, 0 ($s1) addi $s1, $s1, -4 bne $s1,$zero, Loop

• After reordering the instructions based on dependencies, we get a CPI=0.8 (or) IPC=1.25

ALU/BX lw/sw Clock cycle

Loop: lw $t0, 0($s1) 1

addi $s1, $s1, -4 2

addu $t0, $t0, $s2 3

bne $s1,$zero, Loop sw $t0, 0 ($s1) 4

Page 21: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 21

Loop Unrolling: 4 Iterations

• Multiple copies of the loop body are made , thus more ILP by overlapping instructions from different iterations

• CPI=8/14=0.57

ALU/BX lw/sw Clock cycle

Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1

lw $t1, 12($s1) 2

addu $t0, $t0, $s2 lw $t2, 8($s1) 3

addu $t1, $t1, $s2 lw $t3, 4($s1) 4

addu $t2, $t2, $s2 sw $t0, 16 ($s1) 5

addu $t3, $t3, $s2 sw $t0, 12 ($s1) 6

sw $t0, 8 ($s1) 7

bne $s1,$zero, Loop sw $t0, 4 ($s1) 8

Page 22: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 22

Dynamic Multiple-Issue Processors

• Instructions are issue in order and the processor decides whether zero,one (or) more instructions can issue in a given clock cycle

• Again achieving good performance requires the compiler to schedule instructions to move dependencies apart and thereby improving the instruction issue rate

Page 23: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 23

Dynamic Scheduling: Definition

• Dynamic pipeline scheduling goes past stalls to find later instructions to execute while waiting for the stall to be resolved

• Chooses which instruction to execute next by reordering the instructions to avoid stalls (dynamic issue decisions)

• lw $t0, 20($s2)

addu $t1, $t0, $s2

sub $s4, $s4, $t3

slti $t5, $s4, 20

bne $s1,$zero, Loop

Page 24: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 24

HW Schemes: Why?

• Why in HW at run time?

–Works when can’t know real dependence at compile time

–Compiler simpler

–Code for one machine runs well on another

Page 25: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 25

Dynamic Pipeline Scheduling: Model

Inst. Fetch & decode unit

Res. station

Integer FP lw/sw

Reorder buffer

Commit unit

Res. station Res. station………..

………..

In order

In order

Out order

Page 26: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 26

HW Units: Working

• Inst fetch/decode unit fetches instructions,decodes them and sends each instruction to a corresponding functional unit of the execute stage

• 5-10 functional units with buffers called reservation stations that holds the operands and operation

• As soon as buffer contains all the operands , functional unit executes, the result is calculated

• It is for the commit unit to decide when it is safe to put the result into the register file (or) for store into memory

Page 27: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 27

Dynamic scheduling: in-order completion

• To make programs behave as if they run on a non-pipelined computer, the instruction fetch and decode unit is required to issue instructions in order, and the commit unit is required to write results to registers and memory in program execution order (in-order completion)

• Hence an exception occurs, the computer can point to the last instruction executed and the only registers updated will be all those written by the instructions before exception

Page 28: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 28

Dynamic scheduling: Speculation

• Speculative execution: Dynamic scheduling can be combined with branch prediction, so after a mispredicted branch , commit unit be able to discard all the results in the execution unit

• Dynamic scheduling can also be combined with Superscalar execution, so each unit may be committing 4 to 6 instructions per cycle

Page 29: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 29

Superscalar Processor

Page 30: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 30

Conclusion: Several Steps ILP Exploitation

Page 31: Kasi L.K. Anbumony  Department of Electrical and Computer Engineering Auburn University

10/24/05 ELEC6200 31

References

• Computer Organization & Design, Patterson & Hennessy, 2 & 3 Edition

• http://www.cs.berkeley.edu/~pattrsn/152F97/index_lectures.html

• http://www.cse.lehigh.edu/~mschulte/ece401-01/

• http://paul.rutgers.edu/courses/cs505/S03/

• http://engineering.dartmouth.edu/~engs116/lectures/engs%20116%20lecture%204-05f.ppt (Pipelining)