linear pipeline_collosion vector analysis

7/29/2019 Linear Pipeline_collosion Vector Analysis

1/23

Pipeline Hazards Architecture of Parallel Computers 1

Collision Analysis

Assume we could implement on-chip cache and get the cache access timedown to 1 clock, but implement it as a unified cache.

Our new pipeline is:

Our new reservation table is:

Clock

Operation

1 2 3 4 5 6 7

Memory Op X X X

Inst Dec. X

Addr Gen XExecute X

Update PC X

And the serial execution time is 7 x 5 ns = 35 ns.

How often can we initiate an instruction with this configuration?

Instruction Decode -- 5 ns

Address Generate -- 5 ns

Operand Fetch -- 5 ns

Execute -- 5 ns

Operand Store -- 5 ns

Update Program Counter -- 5 ns

Instruction Fetch -- 5 ns


2/23

1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 2

The Collision Vector

As the pipeline becomes more complicated, we can use a collision vectorto analyze the pipeline and control initiation of execution. The collisionvector is a method of analyzing how often we can initiate a new operationinto the pipeline and maintain synchronous flow without collisions.

We construct the collision vector by overlaying two copies of thereservation table, successively shifting one clock to the right, and recordingwhether or not a collision occurs at that step. If a collision occurs, record a 1bit, if a collision does not occur, record a 0 bit.

For example, our reservation table would result in the following collisionvector:

Collision vector = 011010

Using the collision vector, we construct a reduced state diagram to tell uswhen we can initiate new operations.

The Reduced State Diagram

The reduced state diagram is a way to determine when we can initiate anew operation into the pipeline and avoid collisions when some operationsare already in process in the pipeline.


3/23


Steps to create the reduced state diagram:

Shift the collision vector left one position, filling in a 0 at the right end.

If the left-most bit shifted out is a 1, you cannot initiate a new operationinto the pipeline.

If the left-most bit shifted out is a 0, you can initiate a new operationinto the pipeline. Create a new state with a collision vector that is theshifted collision vectorORed with the original pipeline collisionvector.

Draw an arc to the new collision vector and label it with the number ofshifts from the previous vector.

Following is the resulting reduced state diagram:

Note: Some texts reverse this notation, build the collision vector from rightto left, and shift the vector right in order to determine when to initiate a newoperation.

0 1 1 0 1 0

1 1 1 1 1 0 1 1 1 0 1 0

1 4

64

6

6


4/23


The reduced state diagram tells us that we can initiate a new operation intothe pipeline one cycle after we initiated one in an empty pipe. However, thisbrings us to a state where we cannot safely initiate another operationuntil 6 more clock periods.

Since we can initiate a second instruction on the next clock period but mustwait six clock periods before we can initiate another instruction, we caninitiate only two instructions every seven clock periods. We get only 2/7 of100% efficiency (speedup of 7), so ourspeedup is only 2 for the sevenstage pipeline.

An alternative would be to wait for 4 cycles after the initial initiation, and theninitiate a new operation every 4 cycles. But this would give us a speedup ofonly 7(0.25) = 1.75.


5/23


Improving the speedup

One way to improve this situation is to insert delays at appropriate points inthe pipeline. Stone goes to great lengths to analyze where to insert thedelays. As an example, if we added a delay in the pipeline after the Executestage, we get:

Clock Operation

1 2 3 4 5 6 7 8

Memory Op X X X

Inst Dec. X

Addr Gen X

Execute X DelayUpdate PC X

And our new collision vector is:

Collision vector = 0010010


6/23


The new reduced state diagram follows.

Note that all states have an arc back to the beginning state with 7clocks in addition to those noted.

We can now look for movements from state to state that would improve ourpipeline speedup. If we took the greedy cycle, we could initiate 3

operations out of every 9 cycles for a speedup of (3/9) 7 = 2.33. However, ifwe did not take the first possible initiation and waited for 2 cycles, we wouldget into the 2, 5, 2 cycle and initiate an operation 3 out of every 9 cyclesalso. There appears to be one other3-out-of-9 cycle, but none better.

0 0 1 0 0 1 0

0 1 1 0 1 1 0 1 0 1 1 0 1 0

1 2

4 2

7

1 1 1 1 1 1 0

1

1 1 1 0 0 1 0 1 1 1 1 0 1 0

1 0 1 0 0 1 0

55

5

5

0 1 1 0 0 1 0

4

1 1 1 0 1 1 0

1

54

44 5

2

7

4


7/23


Other Pipeline Hazards

Pipeline collisions occur when there is contention for shared hardwarethat is needed by more than one stage of a pipeline. Potential collisions

prevent us from initiating (and thus completing) a new operation every clockperiod, and so slow down the effective execution rate of a processor.

Other hazards that can prevent us from completing an instruction everyclock period are:

Conditional Branches

Data dependencies

Conditional Branches (Jumps)

A conditional branch changes the location where we are fetchinginstructions. A conditional branch instruction must execute before we knowwhich location to fetch subsequent instructions from.

Example Instruction Stream

------- ; Instruction

Cmp A, B ; Compare A to B

BE NewLoc ; Branch on condition code = 0 to NewLoc



------- ; Next Sequential Instruction (NSI)


NewLoc ------- ; Instruction



8/23


Reservation Table Analysis

Assume we have the following reservation table:

Clock Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch X X

Inst Dec. X

Addr Gen X

Data Fetch X X

Execute X

Op Store X

We can show successive instruction execution through the pipeline byindicating the instruction in each cell. Here, I will use:

CC to indicate the instruction that sets the condition code.

BR to indicate the branch condition instruction.

NSI to indicate the next sequential instruction after the branch.

2SI to indicate the 2nd sequential instruction after the branch, etc.

BT to indicate the branch target instruction.

Following would be the instruction sequence for a branch not taken:

Clock

Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch CC CC BR BR NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI

Inst Dec. CC BR NSI 2SI 3SIAddr Gen CC BR NSI 2SI 3SI

Data Fetch CC CC BR BR NSI NSI 2SI 2SI

Execute CC BR NSI

Op Store CC BR NSI


9/23


Following would be the instruction sequence for a branch taken:

Clock

Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch CC CC BR BR NSI NSI 2SI 2SI 3SI 3SI BT BTInst Dec. CC BR NSI 2SI wait

Addr Gen CC BR NSI wait wait

Data Fetch CC CC BR BR NSI NSI wait wait

Execute CC BR wait

Op Store CC BR wait

We have taken a penalty of 6 clock cycles because we assumed that we

were going to be executing sequential instructions. We started theseinstructions into the pipeline, only to find that we had to abort executingthem because the conditional branch was taken.

The assumption here is that we know the outcome of the branchinstruction at the end of its execute cycle, and so we can stop furtherexecution of sequential instructions following the branch. The newprogram counter gets sent to the Instruction Fetch unit during theOperand Store cycle of the branch instruction, so it can begin to fetch thebranch target instruction and succeeding instructions on the next cycle.

Reducing Branch Penalties

We can use several methods to reduce the effects of branching:

Delayed Branch Instruction

Multiple Condition Codes (discussed with data dependencies)

Branch Prediction with and without Branch History

Speculative Execution


10/23


Delayed Branch Instruction

We can push some of the problem back on the programmer (or compiler) bydesigning a new branch instruction that telegraphs an intent to branch:

Branch Condition after executing the Next Sequential Instruction.

The instruction sequence for a branch not taken, using this new branchinstruction (BA), is identical to a normal branch:

Clock Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch CC CC BA BA NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI

Inst Dec. CC BA NSI 2SI 3SIAddr Gen CC BA NSI 2SI 3SI

Data Fetch CC CC BA BA NSI NSI 2SI 2SI

Execute CC BA NSI

Op Store CC BA NSI

However, the instruction sequence for a branch taken, using the newbranch condition after next instruction, would save us two clocks:

Clock

Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch CC CC BA BA NSI NSI 2SI 2SI 3SI 3SI BT BT

Inst Dec. CC BA NSI 2SI wait

Addr Gen CC BA NSI wait wait

Data Fetch CC CC BA BA NSI NSI wait wait

Execute CC BA NSI

Op Store CC BA NSI

Our penalty is now only 4 clock cycles instead of 6, because we followedthrough and completed execution of NSI (per definition of the delayedbranch instruction). We had to abort executing only 2SI and beyond as aresult of the conditional branch taken.


11/23


Branch Prediction

We can make a better guess about whether or not a branch will be takenrather than just always assuming it will not be taken.

Assume that a special end-of-loop branch instruction is usually taken.

Assume that a branch to a location earlier in the code will usually betaken.

Keep a history table of how this particular branch instruction behavedin the recent past.

Some processors define special instructions to be used to terminate a loop.

For example, BXLE branch on index low or equal combinesdecrementing an index register with a branch on condition. The processorcan safely assume that whenever it fetches a BXLE instruction, the branchwill normally be taken. This can be determined back at the InstructionDecode step. Note that the Unconditional Branch is a special case ofthis, in that it will always be taken.

A conditional branch to an earlier address can be determined at theAddress Generate stage.

However, note that we are making an educated guess. Even when weguess correctly, we are taking some penalty. The instruction sequence fora branch taken, when we predict that it will be taken:

Clock

Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch CC CC BR BR NSI NSI BT BT NT NT 2T 2T

Inst Dec. CC BR wait BT NT

Addr Gen CC BR wait BT NT

Data Fetch CC CC BR BR wait wait BT BTExecute CC BR wait

Op Store CC BR wait


12/23


Branch History

Rather than depending on special instructions and branch target locations,we can keep a history of how this particular branch instruction

behaved in the recent past, and assume that it will continue to behave thatway in the future. Some implementations are:

The Branch-history table (Stone page 196):

The instruction fetch unit searches a branch-history table (BHT),similar to a TLB, on every instruction fetch. If we have a hit, use thecorresponding address in the BHT for NSI instead of the real NSI.

At the execute stage of the branch, update the BHT with the actual

target (NSI or BT).

Of course, we need to keep track of which way we predicted, and abortinstructions on mis-predictions.

Decode-history table (similar to Stone page 196):

The instruction decode unit searches a decode-history table (DHT)when it encounters a conditional branch instruction. If we have a hit,

redirect the instruction fetch unit to abort NSI and give it the BTaddress from the branch instruction.

At the execute stage of the branch, add (or keep) a DHT entry for thisbranch when the branch is taken. Delete the DHT entry for this branch(if it exists) when the branch is not taken.

Note that we always abort the prefetch of NSI on predicted branchestaken.


13/23


Extra bits in the Instruction Cache

For a processor with a fixed-length instruction set and a Harvard cache,we can organize the instruction cache such that we add an extra bit or two to

each instruction (in the cache) and use them to keep a history on branchinstructions. This works the same as the decode-history table without thetime and logic for the lookup.

When a cache line is loaded from main memory, all branch indicatorbits (BIB) for the line are set to 00.

When a branch is taken, increment the corresponding BIB.

When a branch is not taken, decrement the corresponding BIB.

When the instruction is fetched:

Use NSI if the BIB is 00 or 01.

Use BT if BIB is 10 or 11.

Other Instuction00

Other Instuction00

Other Instuction00

Other Instuction00

Branch01

Branch11

Branch10

Branch00

Other Instuction00

Other Instuction00

00

Instruction Cache

Strongly not taken

Legand

01

10

11

Weakly not taken

Weakly taken

Strongly taken


14/23


Speculative Execution

The brute-force approach Provide enough logic in the processor to:

Replicate the first several stages of the pipeline.

Always follow both paths of execution (branch taken and branch nottaken).

When the outcome of the branch is known, discard the intermediateresults of the wrong path(s) and continue execution with the correctpath.

For deep pipelines, the processor must be prepared to follow severalpaths in order to keep things moving along.

Stone (page 197) says that these mechanisms have not been widely used inpractice (as of 1986). In fact, they have become very popular as a way tospeed up execution of modern processors.

Note: some literature defines speculative execution to mean performing anyprocessing steps before you know the outcome of a conditional branch.That is, if there is any chance that you may need to discard intermediateresults of an instruction, it is defined as speculative execution. We will notuse this definition.


15/23


Data Dependencies

An instruction may be stalled in the pipeline because it needs data that hasnot yet been produced by a prior instruction that is still in the pipeline.

The data dependencies among instructions can take the following forms:

READ/READ one instruction reads a data item and a followinginstruction reads the same data item.

READ/WRITE one instruction reads a data item and a followinginstruction writes that same data item.

WRITE/READ one instruction writes a data item and a followinginstruction reads that same data item.

WRITE/WRITE one instruction writes a data item and a followinginstruction writes that same data item.

The READ/READ combination is not a problem with pipelines because thedata item does not change. However, the other three combinations can allproduce invalid results unless we detect and interlock on them. We dealfirst with the WRITE/READ combination, and defer the others to a laterdiscussion on superpipelined machines.

WRITE/READ

Consider the following sequence of instructions:


R2



R3 + R4 ; Store Register 2

R5 R2 + R4 ; Use Register 2


16/23


The reservation table, where S2 is the instruction that stores a new valueinto register 2, and U2 is the instruction that uses the new value inregister 2:

Clock Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI 3SI 3SI

Inst Dec. S2 U2 NSI 2SI

Addr Gen S2 U2 NSI 2SI

Data Fetch S2 S2 wait wait U2 U2 NSI NSI

Execute S2 U2

Op Store S2 U2

The data fetch unit must detect that the value in register 2 that it needs ispending update from a prior instruction that has not yet completed. Itmust wait until the new value has been stored into register 2 by the OperandStore unit. The penalty is 2 cycles that we had to stall the pipeline.

Internal Forwarding and Register Renaming

A way to reduce the penalty due to data dependencies is to forward the

results of a computation directly to the data fetch unit or to the executeunit, and not wait for the data to be stored into the proper register.

If we forward the results of the addition in instruction S2 to the datafetch unit, we reduce the data interlock penalty to one cycle.

If we forward the results directly to the execute unit, we can eliminatethe penalty altogether.

The data is really available when we need it. It is just not in the rightplace. We rename the input register for the next operation from registerR2 to the register where the computation results will appear. Note that theOperand Store unit still needs to put the results into register 2 as well.


17/23


The new reservation table if we forward the results to the data fetch unit:

Clock

Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI 3SI 3SI 4SIInst Dec. S2 U2 NSI 2SI 3SI

Addr Gen S2 U2 NSI 2SI

Data Fetch S2 S2 wait U2 U2 NSI NSI 2SI

Execute S2 U2 NSI

Op Store S2 U2

The reservation table if we forward the results directly to the Execute unit:

Clock

Operation

1 2 3 4 5 6 7 8 9 10 11 12

Inst Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI 3SI 3SI 4SI 4SI

Inst Dec. S2 U2 NSI 2SI 3SI

Addr Gen S2 U2 NSI 2SI 3SI

Data Fetch S2 S2 U2 U2 NSI NSI 2SI 2SI

Execute S2 U2 NSI

Op Store S2 U2 NSI

The Condition Code Dependency

Another type of data dependency is that between an instruction thatgenerates a condition code setting and the branch instruction that usesthe condition code.

Internal forwarding can again be used to reduce or eliminate delays

Another variant of the Branch after NSI, called multiple condition codes,puts the problem back on the programmer (or compiler). Multiple conditioncodes makes it easier for the programmer to have intervening instructionsbetween the instruction that generates the CC and the branch instructionthat uses it.


18/23


Superscalar Architectures

Up to now, we have been discussing computer architectures with a singlepipeline for processing instructions. The objective was to complete one

instruction per clock period by breaking the instructions into (approximately)equal pieces of work, and pipelining them through the process in a serialfashion. However, all of the hazards prevent us from ever achieving aprocessing rate of 1 instruction per clock.

Given the circuit density we have today, we can replicate many of thepipeline units and process instructions in parallel, so long as we ensurethat we produce results that are indistinguishable from those obtained ifwe executed the code in a strictly sequential fashion.

This brings us back to data dependencies. We must now consider theREAD/WRITE and WRITE/WRITE sequences, because one instructionmay get ahead of another through the parallel pipelines.

Instruction Instruction Instruction Instruction

Instruction Instruction Instruction InstructionI-cache

Decode Decode Decode Decode

Op Fetch Op Fetch Op Fetch Op Fetch

Fixed PointExecute

Fixed PointExecute

Fixed PointExecute

Floating PointExecute

Store

Results

Store

Results

Store

Results

Store

Results

Instruction Instruction Instruction Instruction


19/23


Consider the following sequence of instructions:

We must interlock on register 2 to ensure that the new value (R5 + R4) does

not get stored into it before we obtain the old value to add to R4.

And the following sequence of instructions:

We must ensure that the second value of R2 gets stored if the branch is nottaken.


R3



R2 + R4 ; Use Register 2

R2 R5 + R4 ; Store Register 2




Cmp A, B ; Compare A to B

BE NewLoc ; Possible branch




20/23


Extra Internal Registers

When we have multiple pipelines and speculative execution in theprocessor, it is beneficial to have several extra sets of registers to keep

intermediate results.

Several paths are being followed due to speculative execution.

Parallel execution is proceeding along each serial path.

Many intermediate results are being forwarded to other instructions.

Many tentative final results must be held until the final outcome isknown.

Retiring Instructions

When the final outcome of a series of branches and data dependencies isknown, the winning instruction is retired.

Its tentative results are marked final.

Any data in a renamed register is stored into the real named register.

All other tentative instructions and results (the losers) are discarded, andany resources held are made available for processing new instructions.

Only the retired instructions count toward the processing rate (theMIPS) of the processor.

The objective of the computer architect is to retire more than oneinstruction per clock period.


21/23


CISC versus RISC Stone page 210

CISC Complex Instruction Set Computer

RISC Reduced Instructed Set Computer

CISC Architectures

Traditional processor architectures (e.g. IBM S/360, Intel 8086) use variable-length instructions and provide variations on basic instructions with severaladdressing modes.

8086 example:

Instructions can vary in length from 1 to 12 bytes long.

There are 14 variations of the integer ADD instruction.

There are 14 variations of the integer ADD with Carry instruction.

There are 14 variations of the integer SUB (subtract) instruction.

There are about 100 different instructions.

There are four different prefixes that can modify instructions.

This gives a lot of flexibility to the programmer and compiler writer, butcauses many problems for the computer architect.


22/23


RISC architectures

RISC architectures attempted to make life easy on the computer architect bydrastically simplifying the instruction set.

John Cocke (IBM) reasoned that only compilers generate machine code,and so making life easier for the assembly language programmer should notbe an objective.

Example:

Make all instructions four bytes long and aligned on a word boundary.

Make lots of general-purpose registers so that most intermediate data

can be held in the fast processor storage.

Make all arithmetic instructions register-to-register addressing only.

All instructions execute in a single clock.

Add instructions to help the CPU architect make a fast processor.

Over time, the CISC architectures have adopted RISC techniques and the

RISC architectures have added CISC instructions.

Today, the only real difference between the two are that CISCprocessors still have variable-length instructions and RISC processorshave fixed-length instructions.


23/23

Superpipelined architecture Stone page 218.

In the discussion on superscalar architectures, Stone describes asuperpipelined architecture as one where the internal clock for issuing

instructions is Ntimes faster than the main clock.

Virtually all processors today are superpipelined the internal clock isrun faster than the external bus clock.

VLIW Very Long Instruction Word architecture Stone page 219

VLIW is typically called microcode, and the machine architectures are notgeneral-purpose. They may be used in graphics processors, hard diskcontrollers or other dedicated function units.

The advantage of a VLIW architecture is that the fields of the instructiondirectly control the hardware latches and gates, and thus can directlyperform multiple functions in parallel. Normally, engineers program themicrocontrollers and the programs are relatively short.

VLIW microcontrollers formerly were used to implement the complexinstructions of CISC architecture machines.

linear pipeline_collosion vector analysis

Documents