superpipelining

CS-421 Parallel Processing BE (CIS) Batch 2004-05 Handout_9

Beyond Simple Pipelining

Although instruction pipelining achieves ILP (Instruction Level Parallelism) i.e., parallelism

among instructions by overlapping different phases of instructions resulting in CPIideal = 1, this

unit CPI is unachievable in practice due to hazards in practical programs.

There are many architectural techniques in vogue that can be employed to push the performance

beyond unit CPI (i.e. CPI < 1 supporting execution of multiple instructions per cycle ) and extract

more ILP from programs. A brief description of these techniques follows:

1. Superpipelining

Superpipelining is the breaking of stages of a given pipeline into smaller stages (thus making

the pipeline deeper) in an attempt to shorten the clock period and thus enhancing the

instruction throughput by keeping more and more instructions in flight at a time.

In a simple scalar pipeline the clock period is dictated by the slowest or most time consuming

stage in the system. It is often the case that the slower and more complex operations that

occur in a stage can be further broken down into simpler tasks. For example, the instruction-

fetch stage and the data-memory access stage are generally the most time-consuming in any

pipeline and can be broken down into smaller steps. The execute stage may be broken down

into two or more smaller steps depending upon the type of operation that is performed, etc. If

each of these smaller steps is performed in a single clock cycle, with the more time

consuming steps being performed in two or more clock cycles, then the effective clock cycle

time can be reduced. The net effect is to allow more instructions to have an earlier start in

the pipeline. This is the essence of superpipelined operation. The performance improvement

resulting from superpipelining is shown below.

The MIPS R4000 processor is an example of a machine that employs this technique. The

MIPS R4000 pipeline contains 8 stages.


The downside of superpipelining is, however, more dependenc ies among instructions

necessitating increased complexity in data forwarding, hazard detection units and branch

predictors.

2. Multiple-Issue Architectures

The basic idea is to fetch multiple instructions per cycle from memory and after checking

inter-dependencies, issue those instructions to independent functional units so that they can

be simultaneously executed generating increased ILP. These architectures are also known as

wide-issue architectures.

There are two methods of implementing a multiple - issue processor.

• Static multiple-issue

• Dynamic multiple -issue

a. Static Multiple Issue

Multiple instructions issued in a given clock cycle are said to form an instruction

packet. The decision of packaging instructions into issue slots is made by the compiler.

Only independent instructions can be placed in predefined instruction slots of an

instruction packet. For example, instruction packet of a static quad-issue machine can

have the following form:

Instruction Packet

FP Instruction Integer Instruction Integer Instruction Load/Store Instruction

(Instruction Slot 1) (Instruction Slot 2) (Instruction Slot 3) (Instruction Slot 4)

The instruction packet can be thought of as a very long instruction comprising multiple

base machine instructions. This was the reason behind the original name for this

approach: Very Long Instruction Word (VLIW). Intel has its own name for the

technique i.e. EPIC (Explicitly Parallel Instruction Computing) used in Itanium series.

Example: Static Dual Issue (i.e. 2-way) MIPS

Let the issue packet contain ALU or branch instruction (appearing first) and load or store

instruction. This design is akin to some embedded MIPS processors.

64 bits

R-Type or Branch Instruction Load/Store Instruction

Following figure shows such a pipelined processor in operation.

Instruction Type Pipeline Stages ALU or Branch IF ID EX M WB Load or Store IF ID EX M WB

ALU or Branch IF ID EX M WB Load or Store IF ID EX M WB




For simultaneous issue of ALU and data transfer instructions, following additional

hardware is required to avoid structural hazards.

§ Additional ports in register file:

o 2 extra reading ports

o 1 extra writing port

§ Additional ALU

§ Additional reading port in instruction memory

A static two-issue MIPS datapath

It contains no hazard detection unit, so no load-use is allowed. However, a static multiple

issue processor may adopt one of the following approaches to handle control and data

hazards.

§ Full responsibility of compiler without any support in hardware.

§ Compiler is responsible for removal of intra-packet dependencies while hardware

supports removal of inter-packet hazards. We adopt this approach.

To effectively exploit parallelism available in a multiple -issue processor, more ambitious

compilers are required.

If it is not possible to find operations that can be carried out at the same time for all

functional units, then the instruction slot is filled with a NOP in the group of fields for

unneeded units. In case, most instruction words contain some NOPs, VLIW programs


tend to be very long. The VLIW architecture requires the compiler to be very

knowledgeable of implementation details of the target computer, and may require a

program to be recompiled if moved to a different implementation of the same

architecture.

Code Scheduling Example

Consider scheduling of following loop on a static 2-way pipeline for MIPS.

Loop: lw $t0, 0($s1)

add $t0, $t0, $s2

sw $t0, 0($s1)

addi $s1, $s1, -4

bne $s1, $0, Loop

Must “schedule” the instructions to avoid pipeline stalls :

§ Instructions in one bundle must be independent

§ Must separate load use instructions from their loads by one cycle

§ Assume branches are perfectly predicted by the hardware

§ Assume forwarding hardware as necessary

Optimal Schedule

ALU/Branch Memory Reference Issue Packet (CC) Loop: NOP

addi $s1, $s1, -4

add $t0, $t0, $s2

bne $s1, $0,loop

lw $t0, 0($s1)

NOP

NOP

sw $t0, 4($s1)

1

2

3

4

Ignoring pipeline startup, 5 instructions are executed in 4 clock cycles. Hence, we

achieve a CPI of 4/5 = 0.8 (versus the best case of 0.5) or equivalently IPC of 1.25

(versus the best case of 2.0). NOPs don’t count towards performance!!

Loop Unrolling

Loop unrolling is a technique to extract more performance from loops that access arrays,

in which multiple copies of the loop body are made and instructions from different

iterations are scheduled together.

Apply loop unrolling by a factor of 4 eliminating loop overhead instructions. Note that

the compiler must rename registers so as to avoid name or false dependencies and adjust

offsets in the load and store instructions. A name dependence is said to exist between two

instructions when they use same register or memory location, called a name, but there’s

no flow of data between the instructions associated with that name.


Loop: lw $t0, 0($s1)

lw $t1, -4($s1)

lw $t2, -8($s1)

lw $t3, -12($s1)

add $t0, $t0, $s2

add $t1, $t1, $s2

add $t2, $t2, $s2

add $t3, $t3, $s2

sw $t0, 0($s1)

sw $t1, -4($s1)

sw $t2, -8($s1)

sw $t3, -12($s1)

addi $s1, $s1, -16

bne $s1, $0, Loop

Now we schedule the resulting unrolled code. Due to absence of hazard detection unit,

we must schedule so as to avoid load use hazards.

Optimal Schedule ALU/Branch Memory Reference Issue Packet (CC) Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1 lw $t1, 12($s1) 2 add $t0, $t0, $s2 lw $t2, 8($s1) 3 add $t1, $t1, $s2 lw $t3, 4($s1) 4 add $t2, $t2, $s2 sw $t0, 16($s1) 5 add $t3, $t3, $s2 sw $t1, 12($s1) 6 sw $t2, 8($s1) 7 bne $s1, $0, Loop sw $t3, 4($s1) 8

Hence, by loop unrolling, we are able to execute 14 instructions in 8 clock cycles

corresponding to CPI of 0.57 (versus the best case of 0.5) or IPC of 1.8 (versus the best

case of 2.0).

VLIW Advantages & Disadvantages

§ Simpler hardware and therefore potentially less power hungry. For this reason, they’ve

gained popularity in embedded domain. Almost all Digital Signal Processors use VLIW

architecture.

§ Compiler complexity

§ Object (binary) code incompatibility

§ Code bloat

o NOPs are a waste of program memory space

o Loop unrolling uses more program memory space

Iteration 1 Iteration 2 Iteration 3 Iteration 4


b. Dynamic Multiple Issue Processors

Dynamic multiple -issue architectures are also known as superscalars. Unlike compiler

in VLIW machines, the processor hardware decides whether zero, one, or more

instructions can be issued in a given clock cycle.

Superscalars allowing in-order execution of instructions are called static superscalars.

However, there are dynamic superscalars which allow out-of-order execution (also called

dynamic pipeline scheduling or dynamic execution).

Dynamic Execution

When executing in-order, we fetch instructions and execute them in the order the

compiler produced the object code. But, what if there is a long-running instruction (e.g.,

a floating-point divide operation that takes 40 cycles, for example) followed by some

other instructions behind it that don’t depend on the value produced by the preceding

divide instruction. If we could somehow allow those instructions to “go around” the

divide and execute in some other functional unit while the divide unit is busy, we would

get better performance.

Instructions are fetched and decoded in program order. These instructions are then sent to

reservation stations (buffers within functional units that hold the operands and the

operation until the corresponding functional unit becomes ready to execute) along with

operands that are available. As soon as all operands for an instruction become available,

and inter-instruction dependencies are discharged, these instructions are issued in the

program order to the respective functional units for execution. When an instruction

completes, its results are sent to a commit unit. Committing an instruction involves

writing back any values to memory or the register file . The commit unit holds result

values in a reorder buffer until they can be committed in order (i.e., the program order).

This step is also called retirement or graduation (of instructions).

In summary, Dynamic execution is about IN-ORDER ISSUE, OUT-OF-ORDER

EXECUTION, and IN-ORDER COMMIT. Dynamically scheduled pipelines are used in

both the PowerPC 604 and the Pentium Pro. Support from compilers is even more

crucial for the performance of superscalars because a superscalar processor can only look

at a small window of program. A good compiler schedules code in such a way that

facilities scheduling decisions by the processor.

Motivation

Detail


Out-of-Order Execution & New Data Hazards

We have witnessed RAW (Read-After-Write) hazard in normal operation of pipeline (i.e. when

execution is in-order). This hazard is a result of flow dependence between two instructions.

However, Out-of-Order execution gives rise to two more data hazards. These hazards cannot

occur in normal in-order execution.

§ WAR (Write-After-Read) Hazard

This is caused by anti-dependence between two instructions. An instruction J is said to be

anti-dependent on a preceding instruction I if destination of J and source of I are common.

E.g. add $1, $2, $3

sub $2, $4, $5

§ WAW(Write-After-Write) Hazard

This is caused by output-dependence between two instructions. An instruction J is said to be

output-dependent on a preceding instruction I if destinations of J and I are common.

E.g.

add $1, $2, $3

sub $1, $4, $5

WAR and WAW are name or false dependencies as they can be avoided simply by renaming.

In-Order Fetch

In-Order Issue

Out-of-Order Execute

In-Order Commit

superpipelining

Documents