csc 370 (blum)1 instruction-level parallelism. csc 370 (blum)2 instruction-level parallelism...

41
CSC 370 (Blum) 1 Instruction-Level Parallelism

Upload: nathen-brede

Post on 14-Dec-2015

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 1

Instruction-Level Parallelism

Page 2: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 2

Instruction-Level Parallelism

• Instruction-level Parallelism (ILP) is when a processor has more than one execution unit and thus can execute more than one instruction simultaneously. – It should be distinguished from parallelism on a higher

level which might be accomplished by having more than one processor.

– It should be distinguished from pipelining which has various instructions in various stages but only one in the execution stage.

Page 3: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 3

Pipeline Hazards

• Recall the hazards and potential hazards of pipelining. – Having multiple instructions in the pipeline

means that the first instruction is not complete before the second instruction begins, which could be a problem if the instructions share data/registers.

– Another term used is dependency.

Page 4: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 4

Dependency Categories

• RAR: Read After Read– 1st instruction reads, 2nd instruction reads

• RAW: Read After Write– 1st instruction writes, 2nd instruction reads

• WAR: Write After Read– 1st instruction reads, 2nd instruction writes

• WAW: Write After Write – 1st instruction writes, 2nd instruction writes

Page 5: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 5

Bigger Problems

• WAR and WAW are not really problems in a single, in-order pipeline.

• However, in an out-of-order pipeline or in multiple pipelines,– The write may get ahead of the read in WAR

turning it into a RAW– The second write may get ahead of the first

write in WAW leaving the wrong value in the register for subsequent processing.

Page 6: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 6

Example from Carter’s Book• LD r1, (r2)

Load Reg. 1 with memory location pointed to by Reg. 2

• ADD r5, r6, r7Add values in Reg. 6 and Reg. 7 put answer in Reg 5

• SUB r4, r1, r4Subtract value in Reg. 4 from value in Reg. 1 put answer in Reg. 4

• MUL r8, r9, r10Multiply values in Reg. 9 and Reg. 10, put answer in Reg. 8

• ST (r11), r4 Store value in Reg. 4 in memory location pointed to by Reg. 11

Page 7: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 7

Example from Carter’s Book

Execution Unit 1• LD r1, (r2)• SUB r4, r1, r4• ST (r11), r4

Execution Unit 2• ADD r5, r6, r7• MUL r8, r9, r10

This program fragment can be broken into the parallel pieces shown above since they do not use the same registers.

Page 8: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 8

Another Example from Carter’s Book

1. ADD r1, r2, r3

2. LD r4, (r5)

3. SUB r7, r1, r9

4. MUL r5, r4, r4

5. SUB r1, r12, r10

6. ST (r13), r14

7. OR r15, r14, r12

Page 9: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 9

Type of access to registers in the sequential program fragment

Registers R1 and R4 have RAWs and Registers R1 and R5 have WARs

Page 10: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 10

Hazards (RAW)

• Instruction 3 must follow Instruction 1 because they have a RAW dependency in Register 1.

• Instruction 4 must follow Instruction 2 because they have a RAW dependency in Register 4.

Page 11: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 11

Type of access to registers in the sequential program fragment

Registers R1 and R4 have RAWs and Registers R1 and R5 have WARs

Page 12: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 12

Potential Hazards (WAR)

• Instruction 5 (writes to R1) is at best simultaneous with Instruction 3 (read from R1) because the read stage of an instruction precedes the the write stage.

• Instruction 4 is at best simultaneous with Instruction 2, but we already have the stronger condition that it must follow it.

Page 13: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 13

Division of Labor

• After identifying the various conditions on the ordering of instructions, the instructions can be divided up among the execution units in any way that respects the conditions.

• Instructions that must follow each other will be sent to the same execution unit.

• This ensures their order and also allows for bypassing.

Page 14: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 14

With Two Execution Units

• 1. ADD r1, r2, r3• 3. SUB r7, r1, r9• 5. SUB r1, r12, r10• 7. OR r15, r14, r12

• 2. LD r4, (r5)• 4. MUL r5, r4, r4• 6. ST (r13), r14

7 Cycles 4 Cycles

Page 15: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 15

With Four Execution Units

ADD r1, r2, r3

SUB r7, r1, r9

LD r4, (r5)

MUL r5, r4, r4

ST (r13), r14

SUB r1,r12,r10

OR r15, r14, r12

7 Cycles 2 Cycles

Because of the RAW dependency, we cannot do better than 2 cycles here – no matter how many execution units there are.

Page 16: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 16

Another Distinction

• In the two execution unit result, one has not changed the order of the instructions – apart from executing Instructions 1 and 2 simultaneously.

• In the four execution unit result, one has changed the order of the instructions – Instructions 6 and 7 occur in the first time cycle before Instructions 3, 4 and 5 which are in the second.

• Therefore the benefit we gained from the latter assumes that the processor allows for out-of-order processing.

Page 17: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 17

Superscalar

• A processor is said to be superscalar if it has multiple execution units and if the placement of the instructions into the parallel execution units is handled by the processor’s hardware. – In other scenarios the hardware may have parallel

execution units but the hardware does not determine the splitting up of the instructions among the execution units. The parallelization of instructions will occur at a higher level. It is done by the compiler.

Page 18: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 18

Don’t have to recompile

• A superscalar processor can give ILP (Instruction-Level Parallelism) to code that was not compiled for a processor that does not have ILP without the code being recompiled. – Provided the new processor (with ILP) in

backward compatible with the old processor (without ILP).

Page 19: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 19

But consider recompiling

• The hardware can only consider so many instructions at once – its window of instructions.

• The compiler can take a much broader view of the code and arrange instructions in a way that allows the superscalar processor to take greater advantage of ILP.

Page 20: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 20

Loop Unrolling

• One example of what a compiler might do to exploit ILP is loop unrolling.

• Branching is the bane of pipelining and parallelism.

• Loops have at least one branch with each iteration.

• Loop unrolling is doing two of more iterations worth of work in one iteration. It reduces the number of branch considerations and promotes parallelism.

Page 21: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 21

Loop Unrolling Example

for(i=0; i<100; i++){

a[i] = b[i] + c[i];

}

for(i=0; i<100; i+=2){

a[i] = b[i] + c[i];

a[i+1] = b[i+1] + c[i+1];

}

The unrolled version has half as many branches and so is easier to pipeline.

The unrolled version will use more independent registers within each iteration and so takes greater advantage of ILP.

Page 22: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 22

Don’t try this at home

• Loop unrolling requires knowledge of the processor’s capabilities (the number of execution units, the number of stages in the pipeline, etc.). If the programmer does not have this knowledge, the unrolling and other code optimization techniques should be left to the compiler.

Page 23: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 23

Superscalar Versus Vector

• A vector is essentially a one-dimensional array. • A program that is optimized for the efficient

handling of such arrays is said to be vectorized. • In a superscalar processor, the execution units

can be doing different operations on different data, whereas with vectorization the execution units would be doing the same operation on different data.

Page 24: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 24

Vectorization

• Vectorization could even be beneficial if there is only one execution unit because the same operation would be performed over and over again (on different data) so it would not have to be decoded over and over again.

• Vectorization is more restrictive but easier to implement than making the processor superscalar. But since it is exactly the kind of processing that arises so often, it is worth investing effort in doing it well.

Page 25: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 25

SIMD

• Recall that one of the features of MMX (MultiMedia eXtensions or Matrix Math eXtension) was SIMD (Single Instruction Multiple Data) in which one instruction allowed one to operate on many pieces of data simultaneously (i.e. vectorization). – In Mathematics, matrices operate on vectors

• These are important to the optimization of audio-visual data, since such processing involves a lot of data that can be operated on in parallel.

Page 26: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 26

Try this at home

• While loop unrolling is probably best left to the compiler, there are some things the high-level programmer can consider to try to ensure that his or her code can be vectorized to the fullest extent.

• Recall that vectorization is concerned with the processing of arrays.

Page 27: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 27

Whenever Possible

1. Use for loops instead of while loops

2. Make the number of iterations a power of 2

3. Avoid ifs

4. Avoid subroutine calls

5. In nested loops, make the loop with the larger number of iterations the inner loop

Page 28: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 28

Who bears the burden?

• In superscalar processors, it is the hardware that provides the ILP. The compiler can help exploit the hardware’s capabilities. But the superscalar processor can yield ILP (on the fly) even for code compiled on a sequential processor.

• In Very Long Instruction Word (VLIW) Processors, the burden for discovering ILP is on the compiler.

Page 29: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 29

VLIW Processors

• When the program is compiled, operations which can be done in parallel are sandwiched together in one long instruction, hence the name “very long instruction word” processor.

• The processor has to parse this long instruction, but it does not have to make decisions about what can be done in parallel since that has been done by the compiler.

Page 30: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 30

VLIW Pros and Cons

• The good thing about VLIW processors is that they depend on the compiler (pre-processor).

• The bad thing about VLIW processors is that they depend on the compiler (pre-processor).

• ???

Page 31: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 31

VLIW Pro

• Placing the burden for parallelizing the code on software allows the hardware to be simpler.

• The instruction issue logic circuitry that would determine parallelization in the superscalar processor now does little more than parsing.

• This allows the hardware – To be cheaper– To use less power – And possibly to be faster.

Page 32: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 32

VLIW Pro

• The simplification of hardware puts it along the same lines as the RISC philosophy.

• The reduced hardware leads to a reduction in power consumption. – E.g. computers based on the Crusoe family of

processors from Transmeta can go almost all day without having to recharge the battery.

Page 33: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 33

VLIW Pro

• The compiler can take a more global view when looking for parallelization. – The superscalar processor has a window, a limited

number of instructions it sees and it looks for ILP within that window.

• This is not a real advantage of VLIW over superscalar since code on a superscalar processor must also be compiled and that compiler can also look for ILP on a more global scale.

Page 34: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 34

VLIW Con

• The dependence on the compiler for ILP can lead to backward compatibility issues.

• Within a family of superscalar processors, one can change the micro-architecture (hardware implementation) without changing the architecture. Compiled code is architecture specific but not micro-architecture specific.

Page 35: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 35

VLIW Con (Cont.)

• The new superscalar micro-architecture can take advantage (to some extent) of any new ILP capability without recompiling the code.

• In a VLIW processor, more of the hardware details must be exposed to the software. And thus changes in the hardware require changes in the software – recompiling.

• The old VLIW-compiled code may not work on a new VLIW processor.

Page 36: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 36

Hyper-Threading Technology

Page 37: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 37

HT Technology

Page 38: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 38

Thread-Level Parallelism

• “Hyper-Threading Technology provides thread-level-parallelism (TLP) on each processor resulting in increased utilization of processor execution resources.”

• “Hyper-Threading Technology makes a single physical processor appear as two logical processors ….”

Page 39: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 39

EPIC

• The new Itanium processors have a feature known as EPIC.

• “EPIC (Explicitly Parallel Instruction Computing) is a 64-bit microprocessor instruction set, jointly defined and designed by Hewlett Packard and Intel, that provides up to 128 general and floating point unit registers and uses speculative loading, predication, and explicit parallelism to accomplish its computing tasks.”

Page 40: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 40

Need a compiler to take advantage

• One feature of Itanium is its use of a "smart compiler" to optimize how instructions are sent to the processor. This approach allows Itanium and future IA-64 microprocessors to process more instructions per clock cycle (IPCs). – IPCs can be used along with clock speed in

terms of megahertz (MHz) to indicate a microprocessor's overall performance.

Page 41: CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has

CSC 370 (Blum) 41

References

• Computer Architecture, Nicholas Carter • http://www.whatis.com • http://www.webopedia.com • PC Hardware in a Nutshell, Thompson and

Thompson• http://www.intel.com/technology/itj/2002/

volume06issue01/art01_hyper/p01_abstract.htm