is360 - high performance computing - basavaraj talawar · in the example, fp arithmetic was used...

IS360 - High Performance Computing

Basavaraj TalawarCSE, NITK

Course Syllabus● Definition, RISC ISA, RISC Pipeline,

Performance Quantification● Instruction Level Parallelism

– Pipeline Hazards, Combating hazards, Scheduling, Branch Prediction, Superscalar Processors, Out-of-Order execution

● Cache Memory● VLIW, Vector Processors● Interconnection Networks● Topics of current research

Course Structure● Textbook

– Hennessy and Patterson, Computer Architecture, A Quantitative Approach, MK, 5ed.

● References– Shen & Lipasti, Modern Processor Design,

● About Course– Quizzes – Week 5, Week 9, Week 13 – 30%

– Assignments – 20%

– Mini Project – 20%

– Final Exam – 30%

Course Objective● Identify the trade-offs involved in designing a

multiprocessor● Why?

– Improve the execution time of programs to be executed on them.

Definition● Computer Architecture

– Specific requirements of the target machine

– ISA design

– Cache and memory hierarchy

– I/O, storage, disk

– Multi-processors, networked systems

– Max performance, within constraints: cost, power, availability

Computer Architecture

Computer architecture is the design of the

abstraction/implementation layers that allow

us to execute information processing applications

efficiently using manufacturing technologies

Application

David Wentzlaff, ELE 475 – Computer Architecture, Princeton University

Algorithm

Programming Language

Operating System/Virtual Machines

Instruction Set Architecture

Microarchitecture

Register-Transfer Level

Gates

Circuits

Devices

Physics

Wikipedia: Moore's Law

Single Processor Performance

RISC

Move to multi-processor

Hennessy & Patterson, CA-QA, 5ed. MK, 2013

Intel Sandy Bridge – Successor to i7

http://images.bit-tech.net/content_images/2011/01/intel-sandy-bridge-review/sandy-bridge-die-map.jpg

● Instruction Level Parallelism: Superscalar, Very Long Instruction Word (VLIW)

● Long Pipelines (Pipeline Parallelism)

● Advanced Memory and Caches

● Data Level Parallelism: Vector, GPU

● Thread Level Parallelism: Multithreading, Multiprocessor, Multicore, Manycore

Architecture vs. Microarchitecture● Architecture

– Instruction Set Architecture

– Programmer visible state (Memory & Register)

– Operations (Instructions and how they work)

– Execution Semantics (interrupts)

– Input/Output

– Data Types/Sizes

● Microarchitecture/Organization:– Tradeoffs on how to implement ISA for some metric (Speed,

Energy, Cost)

– Examples: Pipeline depth, number of pipelines, cache size, silicon area, peak power, execution ordering, bus widths, ALU widths

Same Architecture, Different Microarchitectures

David Wentzlaff, ELE 475 – Computer Architecture, Princeton University

● AMD Athlon II X4– X86 Instruction Set, Quad Core,

Out-of-order, 2.9GHz, 125W

– Decode 3 Instructions/Cycle/Core

– 64KB L1 I Cache, 64KB L1 D Cache, 512KB L2 Cache

● Intel Atom– X86 Instruction Set, Single Core,

In-order, 1.6GHz, 2W

– Decode 2 Instructions/Cycle /Core

– 32KB L1 I Cache, 24KB L1 D Cache, 512KB L2 Cache

Trends in Technology● Integrated circuit technology

– Transistor density: 35%/year– Die size: 10-20%/year– Integration overall: 40-55%/year

● DRAM capacity: 25-40%/year (slowing)

● Flash capacity: 50-60%/year– 15-20X cheaper/bit than DRAM

● Magnetic disk technology: 40%/year– 15-25X cheaper/bit then Flash– 300-500X cheaper/bit than DRAM

Dynamic Energy & Power● Dynamic Energy

– Capacitive load x Voltage2

● Dynamic power– ½ x Capacitive load x Voltage2 x Frequency switched

Reducing Power● Techniques for reducing power:

– Do nothing well– Dynamic Voltage-Frequency Scaling– Low power state for DRAM, disks– Overclocking, turning off cores

Static Power● Static power consumption

– Currentstatic x Voltage– Scales with number of transistors– To reduce: Power Gating

Pipelining and Performance Recap

Operations and Operands

ALUControl

i1 i2

o

... Register File

.........

...Memory

PR

OC

ES

SO

R

Machine Models

ALU

...

.........

...

TOS

STACK

ALU

.........

...

ACCUMULATOR

ALU

...

.........

...

REGISTOR-MEMORY

ALU

...

.........

...

REGISTER-REGISTER

C = A + B

ALU

...

............

TOS

STACK

ALU

............

ACCUMULATOR

ALU

...

............

REGISTOR-MEMORY

ALU

...

............

REGISTER-REGISTER

Push APush BAddPop C

Load AAdd BStore C

Load R1, AAdd R3, R1, BStore R3, C

Load R1, ALoad R2, BAdd R3, R1, R2Store R3, C

Addressing Modes● Where do operands come from?

Add R1, R2, R3 Regs[R4] <- Regs[R3] + Regs[R2] Register

Add R4, R3, #5 Regs[R4] <- Regs[R3] + 5 Immediate

Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1]]

DisplacementAdd R4, R3, 100(R1)

Regs[R4] <- Regs[R3] + Mem[Regs[R1]]

Register IndirectAdd R4, R3, (R1)

Regs[R4] <- Regs[R3] + Mem[0x475] AbsoluteAdd R4, R3, (0x475)

Regs[R4] <- Regs[R3] + Mem[Mem[R1]]

Memory IndirectAdd R4, R3, @(R1)

Regs[R4] <- Regs[R3] + Mem[100 + PC]

PC relativeAdd R4, R3, 100(PC)

Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1] + Regs[R5] * 4]

ScaledAdd R4, R3, 100(R1)[R5]

ISA Encoding● Fixed Width

– Eg.: RISC Architectures: MIPS, PowerPC, SPARC, ARM

● Variable Length● Mostly Fixed or Compressed

– Eg. CISC Architectures: IBM 360, x86, Motorola 68K, VAX, …

● Mostly Fixed or Compressed– Eg.: MIPS16, THUMB (only two formats 2 and 4 bytes)

● Very Long Instruction Words– Multiple instructions in a fixed width bundle– Eg.: Multiflow, HP/ST Lx, TI C6000

Example – MIPS64 ISA● RISC, load-store architecture, simple address● 32-bit instructions, fixed format● 32 64-bit GPRs, R0-R31, 32 64-bit FPRs, F0-F31

– R0 is hardwired to 0.

– Can hold 32-bit floats also (with other ½ unused).

– “SIMD” extensions operate on more floats in 1 FPR

● A few special registers– Floating-point status register

● Load/store 8-, 16-, 32-, 64-bit integers– All sign-extended to fill 64-bit GPR

– Also 32- bit floats/doubles

MIPS64 Addressing Modes● Register (Arithmetic, Logical ops only)● Immediate (Arithmetic, Logical ) & Displacement

(load/stores only)– 16-bit immediate/offset field

– Register indirect: use 0 as displacement offset

– Direct (absolute): use R0 as displacement base

● Byte-addressed memory, 64-bit address● Software-settable big-endian/little-endian flag● Alignment required

MIPS InstructionsData Transfer Instructions

Opcode/Mnemonic Examples

Load LB, LBU, LH, LHU, LW, LWU, LD, SD, L.S, L.D LD R1, 30(R2)L.S F0, 50(R3)

Store SB, SH, SW, SD, S.S, S.D SH R3, 502(R2)SB R2, R1(R3)

Move MOV.S, MOV.D MOV.S F2, F3

Arithmetic/Logical Instructions

Add, Subtract, Multiply, Divide, …

DADD, DADDI, DSUB, DMUL, DDIV, AND, OR, XOR, LUI, DSLL, SLT

DADDU R1, R2, R3LUI R1, #43SLT R1, R2, R3

Control Instructions

Branch, Jump, Control transfer

BEQZ, BNEZ, BEQ, BNE, J, JR, JAL, JALR, TRAP, ERET

J labelBEQ R1, R2, labelMOVZ R1, R2, R3

Floating Point

FP Arithmetic ADD.D, SUB.D, MUL.D, MADD.S

MIPS Instruction Formats● R-type.

● I-type.

● J-type

6 bits 5 bits 5 bits 5 bits 6 bits5 bits

op rs rt rd shamt funct

6 bits 5 bits 5 bits 16 bits

op rs rt immediate

6 bits 26 bits

op Offset added to PC

Implementation of RISC ISA - 1● Instruction Fetch (IF)

AD

D

PC

4

InstructionMemory

IR

NPC

IR Mem[PC]

NPC PC+4

Implementation of RISC ISA - 2● Instruction Decode/Register Fetch (ID)

RegistersIR

Imm Sign-extended immediate filed of IR

A Regs[rs]

SignExtend

A

B

Imm16 32

B Regs[rt]

rs

rt

rd

Implementation of RISC ISA - 3● Execution/Effective Address (EX)

AL

UALUOuput A + Imm

A

B

Imm

ALUOutput

MUX

ALUOuput A func B

ALUOuput A func Imm

Register-Register andRegister-Immediate Instructions

Memory Reference

Implementation of RISC ISA – 3 (cont)● Execution/Effective Address (EX)

AL

U

ALUOuput A + Imm

A

B

Imm

ALUOutput

MUX

ALUOuput A func B

ALUOuput A func Imm


Memory Reference ALUOuput NPC + (Imm << 2);

Cond (A == 0)

Branch Instruction

NPC

MUX

Zero? Cond

Implementation of RISC ISA - 4● Memory Access/Branch Completion (MEM)

DataMemory

LMD

NPC

ALUOutput

Cond

MUX

PC

LMD Mem[ALUOutput]

Memory Reference

Mem[ALUOutput] B

if (Cond) PC ALUOutputBranch

B

Implementation of RISC ISA - 5● Write back (WB)

ALUOutput

MUX

LMD

Regs[rd] ALUOutput

Regs[rt] ALUOutput


Regs[rt] LMD

Load Instruction

Registers

Implementation of RISC ISA - Stages● Instruction Fetch (IF)● Instruction Decode/Register Fetch (ID)

– Fixed field decoding

● Execution/Effective address (EX)● Memory Access (MEM)● Write back (WB)

MIPS Datapath

AD

D

PC

4

IM

NPC

RegsIR

SignExtend

A

B

Imm16 32

rs

rt

rd

AL

U ALUOutput

MUX

MUX

Zero? Cond

DM LMD MUX

MUX

Instruction Fetch Instruction Decode/Register Fetch

Execute/Address

Calculation

MemoryAccess

WriteBack

IF ID EX MEM WB

MIPS Pipeline

Hennessy & Patterson, CA-QA, Appendix C, 5ed. MK, 2013

IF ID EX MEM WB

MIPS Pipeline

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

i1

i2

i3

i4

...

Time(clock cycles)

1 2 3 4 5 6 7 8 9

Example: When will i1000 complete? What is the average clock cycles spent per Instruction? If the processor were not pipelined, when will i1000 complete? What is the average clock cycles spent per Instruction? Which is faster?

Measuring Performance● Metrics: Response Time, Throughput

● Speedup of X relative to Y

● Execution time

– Wall clock time: includes all system overheads

– CPU time: only computation time

n=ExecutionTimeY

ExecutionTime X

=PerformanceX

PerformanceY

Benchmarks● Kernels (e.g. Matrix Multiply), Toy programs (e.g.

Sorting), Synthetic benchmarks (e.g. Dhrystone)

● Server Benchmarks – SPECWeb, SPECFS, SPECjbb, SPECvirt_Sc

● Embedded Systems Benchmarks – EEMBC, Dhrystone

● Database Server Benchmarks – TPC

● Desktop Benchmarks – SPECInt, SPECfp, SPECpower.

– CINT2006: perlbench, bzip2, gcc, sjeng, libquantum, h264ref, etc.

– CFP2006: bwaves, gamess, zeusmp, leslie3d, povray, calculix .lbm, wrf , sphinx3

www.spec.org

Measuring Performance

SPECRatioA=ExecutionTime reference

ExecutionTime A

1.25=SPECRatioA

SPECRatioB

=

ExecutionTime reference

ExecutionTime A

ExecutionTime reference

ExecutionTime B

=ExecutionTime B

ExecutionTimeA

=Performance A

Performance B

Measuring Performance● Processor counters

– Instructions executed, Clock cycles completed

● Profile based, static modeling– H/w counters, Code instrumentation, Interpreting the

program at instruction level

● Trace-driven simulation– Memory references and instruction addresses

● Execution-driven simulation– Pipeline activity, data and instruction references

Amdahl's Law

● What is the overall speedup by enhancing the performance of a single block?

● Speedupenhanced (always >1)

● Fractionenhanced (always <1)

Speedupenhanced=ExecutionTime original

ExecutionTime enhancement

FP Arithmetic FP Arithmetic

Program Execution (Original)

FP Arith

Program Execution (Enhanced)

FP Arith

Amdahl's Law

The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster

mode can be used

ExecutionTimenew=ExecutionTimeold∗((1−Fractionenhanced )+Fractionenhanced

Speedupenhanced)

In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous version. What is the new performance number?

Objective: Make the program 10 times faster. Say, 25% of the program is waiting in I/O and cannot be enhanced. How much should the speedup of the enhanced computer be?

Processor Performance Equation

InstructionsProgram

∗Clock cyclesInstruction

∗Seconds

Clock cycle=

SecondsProgram

=CPU time

CPU clock cycles=∑i=1

n

IC i×CPI i

CPI=CPU clock cycles for a program

InstructionCount

CPI=∑i=1

n IC i

InstructionCount×CPI i

Pipeline PerformanceAn unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup?

Average Instruction Execution time = Clock cycle * Average CPI

Pipeline Performance

Speedup pipelining=Avg.instruction time unpipelinedAvg. instruction time pipelined

Speedup pipelining=1

1+Pipeline stall cycles per instruction×Pipeline depth

is360 - high performance computing - basavaraj talawar · in the example, fp arithmetic was used...

Documents