computer architecture - os3 · pdf filecomputer architecture r. poss 1 ca02 ... scalar vs....

39
Computer Architecture R. Poss 1 ca02 - 10 september 2015

Upload: hoangliem

Post on 06-Mar-2018

226 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Computer ArchitectureR. Poss

1 ca02 - 10 september 2015

Page 2: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Performance

2 ca02 - 10 september 2015

Page 3: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Processor performance equations

Performance = Results / second = Results / Instructions x Instructions / Cycles x Cycles / Second

Execution time = Seconds / Result = Instructions / Result x Cycles / Instruction x Second / Cycle

Performance and execution time are related - how?

Hint: think throughput vs latency, 1 result vs many

frequency (Hz)IPC

CPI

3 ca02 - 10 september 2015

Page 4: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Processor performance

Latency: expressed as CPI = cycles per instructiondivide by frequency to obtain absolute latency

Throughput: expressed as IPC = instructions per cyclemultiply by frequency to obtain absolute throughput

Pipelining objective: increase IPC, also decrease CPI As we will see ↘CPI and ↗IPC are conflicting requirements

4 ca02 - 10 september 2015

Page 5: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Who controls what?

Results vs instructions: software task of programmer, compiler, instruction set

Instructions vs cycles: micro-architecturetask of processor designer, instruction set, partly compiler

Cycles vs seconds: technologytask of circuit designer, manufacturer

5 ca02 - 10 september 2015

Page 6: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Performance comparison“X has a speedup of n relative to reference Y”

“X is n times faster than Y” n = perf(X) / perf(Y) = exectime(Y) / exectime(X)

NB: performance encompasses software + hardware

Can’t compare performance of sw alone without specifying which hw is used to measure

in general casefor 1 program

6 ca02 - 10 september 2015

Page 7: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Components and vocabulary

7 ca02 - 10 september 2015

Page 8: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Complexity vs Critical pathComplexity: number of operations/circuits needed to produce an output from an input

Critical path: longest chain of operations to the output from input

Complexity and critical path are different

think parallelism again

But: critical path establishes a lower bound on (circuit) complexity

In component design, criticial path sets lower bound on cycle time

8 ca02 - 10 september 2015

Page 9: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Von Neumann vs HarvardData memory vs code memory: where are the instructions coming from?

Harvard architecture: separate memory for code and data

Von Neumann architecture: same memory for both

Most architectures nowadays use a hybrid system:

Separate Instruction and Data cache (I-Cache / D-Cache)

Data in I-Cache not automatically updated when writing to D-cache, illustion of partially separate memories

9 ca02 - 10 september 2015

Page 10: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Buffer vs memory

Buffer: one or more memory cells arranged in a FIFO

Producer-consumer interface between two or more hardware components

Memory: one or more memory cells arranged in an array

With address selection circuit to determine which cell is read from/written to in one access operation

10 ca02 - 10 september 2015

Page 11: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Register vs main memoryRegister file:

on-chip, close to the processor; SRAM: faster, larger/bit, small capacity

indexed using fixed offsets in instruction codes

Main memory or scratchpads in MPSoCs

off-chip or farther from the processor; DRAM: slower, smaller/bit, large capacity

indexed using variable offsets in registers or other memory cells

Both are forms of “memory” from the architect’s perspectivethe word “memory” designates any component with an address / value interface.

NB: from the software/compiler/OS perspective “memory” is not registers

11 ca02 - 10 september 2015

Page 12: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Processors: RISC pipelines

12 ca02 - 10 september 2015

Page 13: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Instruction set architecture

ISA = instruction set + operational semantics

Instruction set = all possible instruction encodings

Described with instruction formats and decode logic

Defines how operands and operations are derived from the instruction codes

Operational semantics = “what instruction do”

Described with pseudo-code, also called Register Transfer Language (RTL)

Defines how results are produced from operands

13 ca02 - 10 september 2015

Page 14: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Example instruction set: MIPS

First bits determine opand encoding of rest

More regularity = simpler decode logic

Op Format Opx Insn0 R-R 0x20 add0 R-R 0x21 addu0 R-R 0x22 sub0 R-R 0x23 subu 8 R-I - addi9 R-I - addiu

0x23 R-I - lw0x2B R-I - sw

4 B (R-I) - beq3 J - jal

Example encodings

14 ca02 - 10 september 2015

Page 15: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

© Chris Jesshope 2008-2011, Raphael Poss 2011

PipelinesEach instruction / operation can be decomposed in sub-tasks: Op = A ; B ; C ; D

Considering an instruction stream [Op1; Op2; ...] at each cycle n we can run in parallel: An+3 ∣∣ Bn+2 ∣∣ Cn+1 ∣∣ Dn

A B C D

A B C D

input An+3 input Bn+2 input Cn+1at start of cycle n:

15 ca02 - 10 september 2015

Page 16: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Origins of pipelining

Idea: “assembly line”

Different phases of two instructions can occur at the same time

eg. read operand of one instruction while the next is being fetched

More steps in RTL = potentially more stages in pipeline

16 ca02 - 10 september 2015

Page 17: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

MIPS Pipeline

17 ca02 - 10 september 2015

Page 18: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

© Chris Jesshope 2008-2011, Raphael Poss 2011

Pipeline trade-offsObservations:

more complexity per sub-task requires more time per cycle

conversely, as the sub-tasks become simpler the cycle time can be reduced

so to increase the clock rate instructions must be broken down into smaller sub-tasks

…but operations have a fixed complexity

smaller sub-tasks mean deeper pipelines = more stages ⇒ more instructions need to be executed to fill the pipeline

18 ca02 - 10 september 2015

Page 19: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Control hazards

19 ca02 - 10 september 2015

Page 20: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Control hazardsBranches – in particular conditional branches – cause pipeline hazards

the outcome of a conditional branch is not known until the end of the EX stage, but is required at IF to load another instruction and keep the pipeline full

A simple solution: assume by default that the branch falls through – i.e. is not taken – then continue speculatively until the target of the branch is known

IFbeq ID Ex WBBranch not taken

IF ID Ex WB

Continue to fetch but stall ID until branch target is known – one cycle lost

IFbeq ID Ex WBBranch is taken

IF ID Ex WB

Need to refetch at new target – two cycles lostIF

Wrong target

20 ca02 - 10 september 2015

Page 21: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

How to overcomeDifferent solutions:

Stall (wait) from decode stage when a branch is encountered

Flush: erase instructions in pipeline after the branch when/if the branch is taken

Expose the branch delay to the programmer / compiler: branch delay slots (MIPS, SPARC, PA-RISC)

Predict whether the branch is taken or not and its target: branch prediction

Eliminate branches altogether via predication (most GPUs)

Execute instructions from other threads: hardware multithreading (eg Niagara, cf next lecture)

21 ca02 - 10 september 2015

Page 22: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Branch delay slotsSpecify in the ISA that a branch takes effect N instructions later, then let the compiler / programmer fill the empty slot

L1: lw a x[i] add a a a sw a x[i] sub i i 4 bne i 0 L1 nopL2: 1 cycle wasted at

each iteration

L1: lw a x[i] add a a a sw a x[i] bne i 4 L1 sub i i 4L2: no bubble, but one extra sub

at last iteration

22 ca02 - 10 september 2015

Page 23: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Grohoski’s estimate

23 ca02 - 10 september 2015

Page 24: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

SummaryControl hazards are situations where the pipeline is not fully utilized due to branch instructions

Branch prediction, delay slots and predication are architectural solutions to overcome control hazards

Only branch prediction is fully invisible to software

Without these solutions, software must avoid branches by inlining and loop unrolling; with a trade-off: more code means more pressure on I-Cache

24 ca02 - 10 september 2015

Page 25: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Data hazards

25 ca02 - 10 september 2015

Page 26: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Data hazardOccurs when the output of one operation is the input of a subsequent operation

The hazard occurs because of the latency in the pipeline

the result (output from one instruction) is not written back to the register file until the last stage of the pipe

the operand (input of a subsequent instruction) is required at register read – some cycles prior to writeback

the longer the RR to WB delay, the more cycles there must be between the writeback of the producer instruction and the read from the consumer

Example:

26 ca02 - 10 september 2015

Page 27: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

How to overcomeDo nothing i.e. expose to the programmer, e.g. MIPS 1, DSPs, VLIW

Stall the read stage

Bypass buses:

The operand is taken from the pipeline register and input directly to the ALU on the subsequent cycle

27 ca02 - 10 september 2015

Page 28: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Structural hazards and scalar ILP

28 ca02 - 10 september 2015

Page 29: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Scalar pipelinesIn the simple pipeline, register-to-register operations have a wasted cycle

a memory access is not required, but this stage still requires a cycle to complete the operations

Decoupling memory access and operation execution avoids this e.g. use an ALU plus a memory unit - this is scalar ILP

… note: either we need two write ports to the register file or arbitration on a single port

29 ca02 - 10 september 2015

Page 30: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Structural hazard - registersA structural hazard occurs when a resource in the pipeline is required by more than one instruction

a resource may be an execution unit or a register port

Example: only one write port lw a addr add b c d

30 ca02 - 10 september 2015

Page 31: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

How to overcomeThey result from contention ⇒ they can be removed by adding more hardware resources

register write hazard: add more write ports

execution unit: add more execution units

Example: CDC 6600 (1963) 10 units, 4 write ports, only FP div not pipelined

Note:

more resources = more cost (area, power)

31 ca02 - 10 september 2015

Page 32: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Superscalar processorsIntroduction / overview

32 ca02 - 10 september 2015

Page 33: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Pipelining - summaryDepth of pipeline - Superpipelining

further dividing pipeline stages increases frequency

but introduces more scope for hazards

and higher frequency means more power dissipated

Branches to different functional units - Scalar pipelining - avoids waiting for long operations to complete

instructions fetched and decoded in sequence

multiple operations executed in parallel

Concurrent issue of instructions - Superscalar ILP

multiple instructions fetched and decoded concurrently

new ordering issues and new data hazards

33 ca02 - 10 september 2015

Page 34: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Scalar vs. superscalar

in-order issue

concurrent issue, possibly out of order

Most “complex” general-purpose processors are superscalar

34 ca02 - 10 september 2015

Page 35: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Instruction-level parallelismThe number of instructions issued per cycle is called issue parallelism / issue width

IPC the number of instructions executed per cycle is limited by:

the issue width

the number of true dependencies

the number of branches in relation to other instructions

the latency of operations in conjunction with dependencies

Current microprocessors: 4-8 max ILP, 12 functional units, however IPC of typically 2-3

35 ca02 - 10 september 2015

Page 36: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Challenges of superscalar execution

parallel fetch decoding and issue

100s of instructions in-flight simultaneously

out-of-order execution and sequential consistency

Exceptions and false dependencies

finding parallelism and scheduling its execution

application specific engines, e.g. SIMD & prefetching

36 ca02 - 10 september 2015

Page 37: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Fundamental limitations of superscalar designs

Power dissipation is one of the major issues in obtaining performance

particularly true with constraints on embedded systems

Concurrency can be used to reduce power and get performance

reduce frequency which in turn allows a voltage reduction

does not apply to pipelined concurrency where more concurrency increases power due to shared structures

37 ca02 - 10 september 2015

Page 38: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

OoO not scalable – what else?It is clear that out-of-order issue is not scalable

register file and issue logic scale as ILP3 and ILP2 respectively

A number of approaches have been followed to increase the utilisation of on-chip parallelism; these include:

VLIW processors

Speculative VLIW processors - Intel’s IA64 - EPIC

multi-core and multi-threaded processors

these are collectively called explicit concurrency

38 ca02 - 10 september 2015

Page 39: Computer Architecture - OS3 · PDF fileComputer Architecture R. Poss 1 ca02 ... Scalar vs. superscalar ... Speculative VLIW processors - Intel’s IA64

Explicit concurrency as the (a?) way forward

VLIW is the simplest form of explicit concurrency it reduces hardware complexity, hence power requirements

drawbacks are code compatibility and intolerance to cache misses

Multiple instruction streams / multi-programming:

Hardware multithreading (HMT) provides more scalability on single cores

Multi-cores are even more scalable, best results combined with HMT

However the software stack is not ready yet, still too much ad-hoc APIs/frameworks. multi-programming still needs new programming models

the big question is - can we design general purpose multi-cores?

39 ca02 - 10 september 2015