is360 - high performance computing - basavaraj talawar · in the example, fp arithmetic was used...

44
IS360 - High Performance Computing Basavaraj Talawar CSE, NITK

Upload: others

Post on 20-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

IS360 - High Performance Computing

Basavaraj TalawarCSE, NITK

Page 2: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Course Syllabus● Definition, RISC ISA, RISC Pipeline,

Performance Quantification● Instruction Level Parallelism

– Pipeline Hazards, Combating hazards, Scheduling, Branch Prediction, Superscalar Processors, Out-of-Order execution

● Cache Memory● VLIW, Vector Processors● Interconnection Networks● Topics of current research

Page 3: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Course Structure● Textbook

– Hennessy and Patterson, Computer Architecture, A Quantitative Approach, MK, 5ed.

● References– Shen & Lipasti, Modern Processor Design,

● About Course– Quizzes – Week 5, Week 9, Week 13 – 30%

– Assignments – 20%

– Mini Project – 20%

– Final Exam – 30%

Page 4: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Course Objective● Identify the trade-offs involved in designing a

multiprocessor● Why?

– Improve the execution time of programs to be executed on them.

Page 5: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Definition● Computer Architecture

– Specific requirements of the target machine

– ISA design

– Cache and memory hierarchy

– I/O, storage, disk

– Multi-processors, networked systems

– Max performance, within constraints: cost, power, availability

Page 6: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Computer Architecture

Computer architecture is the design of the

abstraction/implementation layers that allow

us to execute information processing applications

efficiently using manufacturing technologies

Application

David Wentzlaff, ELE 475 – Computer Architecture, Princeton University

Algorithm

Programming Language

Operating System/Virtual Machines

Instruction Set Architecture

Microarchitecture

Register-Transfer Level

Gates

Circuits

Devices

Physics

Page 7: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Wikipedia: Moore's Law

Page 8: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Single Processor Performance

RISC

Move to multi-processor

Hennessy & Patterson, CA-QA, 5ed. MK, 2013

Page 9: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Intel Sandy Bridge – Successor to i7

http://images.bit-tech.net/content_images/2011/01/intel-sandy-bridge-review/sandy-bridge-die-map.jpg

● Instruction Level Parallelism: Superscalar, Very Long Instruction Word (VLIW)

● Long Pipelines (Pipeline Parallelism)

● Advanced Memory and Caches

● Data Level Parallelism: Vector, GPU

● Thread Level Parallelism: Multithreading, Multiprocessor, Multicore, Manycore

Page 10: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Architecture vs. Microarchitecture● Architecture

– Instruction Set Architecture

– Programmer visible state (Memory & Register)

– Operations (Instructions and how they work)

– Execution Semantics (interrupts)

– Input/Output

– Data Types/Sizes

● Microarchitecture/Organization:– Tradeoffs on how to implement ISA for some metric (Speed,

Energy, Cost)

– Examples: Pipeline depth, number of pipelines, cache size, silicon area, peak power, execution ordering, bus widths, ALU widths

Page 11: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Same Architecture, Different Microarchitectures

David Wentzlaff, ELE 475 – Computer Architecture, Princeton University

● AMD Athlon II X4– X86 Instruction Set, Quad Core,

Out-of-order, 2.9GHz, 125W

– Decode 3 Instructions/Cycle/Core

– 64KB L1 I Cache, 64KB L1 D Cache, 512KB L2 Cache

● Intel Atom– X86 Instruction Set, Single Core,

In-order, 1.6GHz, 2W

– Decode 2 Instructions/Cycle /Core

– 32KB L1 I Cache, 24KB L1 D Cache, 512KB L2 Cache

Page 12: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Trends in Technology● Integrated circuit technology

– Transistor density: 35%/year– Die size: 10-20%/year– Integration overall: 40-55%/year

● DRAM capacity: 25-40%/year (slowing)

● Flash capacity: 50-60%/year– 15-20X cheaper/bit than DRAM

● Magnetic disk technology: 40%/year– 15-25X cheaper/bit then Flash– 300-500X cheaper/bit than DRAM

Page 13: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Dynamic Energy & Power● Dynamic Energy

– Capacitive load x Voltage2

● Dynamic power– ½ x Capacitive load x Voltage2 x Frequency switched

Page 14: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Reducing Power● Techniques for reducing power:

– Do nothing well– Dynamic Voltage-Frequency Scaling– Low power state for DRAM, disks– Overclocking, turning off cores

Page 15: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Static Power● Static power consumption

– Currentstatic x Voltage– Scales with number of transistors– To reduce: Power Gating

Page 16: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Pipelining and Performance Recap

Page 17: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Operations and Operands

ALUControl

i1 i2

o

... Register File

.........

...Memory

PR

OC

ES

SO

R

Page 18: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Machine Models

ALU

...

.........

...

TOS

STACK

ALU

.........

...

ACCUMULATOR

ALU

...

.........

...

REGISTOR-MEMORY

ALU

...

.........

...

REGISTER-REGISTER

Page 19: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

C = A + B

ALU

...

............

TOS

STACK

ALU

............

ACCUMULATOR

ALU

...

............

REGISTOR-MEMORY

ALU

...

............

REGISTER-REGISTER

Push APush BAddPop C

Load AAdd BStore C

Load R1, AAdd R3, R1, BStore R3, C

Load R1, ALoad R2, BAdd R3, R1, R2Store R3, C

Page 20: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Addressing Modes● Where do operands come from?

Add R1, R2, R3 Regs[R4] <- Regs[R3] + Regs[R2] Register

Add R4, R3, #5 Regs[R4] <- Regs[R3] + 5 Immediate

Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1]]

DisplacementAdd R4, R3, 100(R1)

Regs[R4] <- Regs[R3] + Mem[Regs[R1]]

Register IndirectAdd R4, R3, (R1)

Regs[R4] <- Regs[R3] + Mem[0x475] AbsoluteAdd R4, R3, (0x475)

Regs[R4] <- Regs[R3] + Mem[Mem[R1]]

Memory IndirectAdd R4, R3, @(R1)

Regs[R4] <- Regs[R3] + Mem[100 + PC]

PC relativeAdd R4, R3, 100(PC)

Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1] + Regs[R5] * 4]

ScaledAdd R4, R3, 100(R1)[R5]

Page 21: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

ISA Encoding● Fixed Width

– Eg.: RISC Architectures: MIPS, PowerPC, SPARC, ARM

● Variable Length● Mostly Fixed or Compressed

– Eg. CISC Architectures: IBM 360, x86, Motorola 68K, VAX, …

● Mostly Fixed or Compressed– Eg.: MIPS16, THUMB (only two formats 2 and 4 bytes)

● Very Long Instruction Words– Multiple instructions in a fixed width bundle– Eg.: Multiflow, HP/ST Lx, TI C6000

Page 22: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Example – MIPS64 ISA● RISC, load-store architecture, simple address● 32-bit instructions, fixed format● 32 64-bit GPRs, R0-R31, 32 64-bit FPRs, F0-F31

– R0 is hardwired to 0.

– Can hold 32-bit floats also (with other ½ unused).

– “SIMD” extensions operate on more floats in 1 FPR

● A few special registers– Floating-point status register

● Load/store 8-, 16-, 32-, 64-bit integers– All sign-extended to fill 64-bit GPR

– Also 32- bit floats/doubles

Page 23: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

MIPS64 Addressing Modes● Register (Arithmetic, Logical ops only)● Immediate (Arithmetic, Logical ) & Displacement

(load/stores only)– 16-bit immediate/offset field

– Register indirect: use 0 as displacement offset

– Direct (absolute): use R0 as displacement base

● Byte-addressed memory, 64-bit address● Software-settable big-endian/little-endian flag● Alignment required

Page 24: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

MIPS InstructionsData Transfer Instructions

Opcode/Mnemonic Examples

Load LB, LBU, LH, LHU, LW, LWU, LD, SD, L.S, L.D LD R1, 30(R2)L.S F0, 50(R3)

Store SB, SH, SW, SD, S.S, S.D SH R3, 502(R2)SB R2, R1(R3)

Move MOV.S, MOV.D MOV.S F2, F3

Arithmetic/Logical Instructions

Add, Subtract, Multiply, Divide, …

DADD, DADDI, DSUB, DMUL, DDIV, AND, OR, XOR, LUI, DSLL, SLT

DADDU R1, R2, R3LUI R1, #43SLT R1, R2, R3

Control Instructions

Branch, Jump, Control transfer

BEQZ, BNEZ, BEQ, BNE, J, JR, JAL, JALR, TRAP, ERET

J labelBEQ R1, R2, labelMOVZ R1, R2, R3

Floating Point

FP Arithmetic ADD.D, SUB.D, MUL.D, MADD.S

Page 25: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

MIPS Instruction Formats● R-type.

● I-type.

● J-type

6 bits 5 bits 5 bits 5 bits 6 bits5 bits

op rs rt rd shamt funct

6 bits 5 bits 5 bits 16 bits

op rs rt immediate

6 bits 26 bits

op Offset added to PC

Page 26: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Implementation of RISC ISA - 1● Instruction Fetch (IF)

AD

D

PC

4

InstructionMemory

IR

NPC

IR Mem[PC]

NPC PC+4

Page 27: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Implementation of RISC ISA - 2● Instruction Decode/Register Fetch (ID)

RegistersIR

Imm Sign-extended immediate filed of IR

A Regs[rs]

SignExtend

A

B

Imm16 32

B Regs[rt]

rs

rt

rd

Page 28: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Implementation of RISC ISA - 3● Execution/Effective Address (EX)

AL

UALUOuput A + Imm

A

B

Imm

ALUOutput

MUX

ALUOuput A func B

ALUOuput A func Imm

Register-Register andRegister-Immediate Instructions

Memory Reference

Page 29: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Implementation of RISC ISA – 3 (cont)● Execution/Effective Address (EX)

AL

U

ALUOuput A + Imm

A

B

Imm

ALUOutput

MUX

ALUOuput A func B

ALUOuput A func Imm

Register-Register andRegister-Immediate Instructions

Memory Reference ALUOuput NPC + (Imm << 2);

Cond (A == 0)

Branch Instruction

NPC

MUX

Zero? Cond

Page 30: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Implementation of RISC ISA - 4● Memory Access/Branch Completion (MEM)

DataMemory

LMD

NPC

ALUOutput

Cond

MUX

PC

LMD Mem[ALUOutput]

Memory Reference

Mem[ALUOutput] B

if (Cond) PC ALUOutputBranch

B

Page 31: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Implementation of RISC ISA - 5● Write back (WB)

ALUOutput

MUX

LMD

Regs[rd] ALUOutput

Regs[rt] ALUOutput

Register-Register andRegister-Immediate Instructions

Regs[rt] LMD

Load Instruction

Registers

Page 32: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Implementation of RISC ISA - Stages● Instruction Fetch (IF)● Instruction Decode/Register Fetch (ID)

– Fixed field decoding

● Execution/Effective address (EX)● Memory Access (MEM)● Write back (WB)

Page 33: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

MIPS Datapath

AD

D

PC

4

IM

NPC

RegsIR

SignExtend

A

B

Imm16 32

rs

rt

rd

AL

U ALUOutput

MUX

MUX

Zero? Cond

DM LMD MUX

MUX

Instruction Fetch Instruction Decode/Register Fetch

Execute/Address

Calculation

MemoryAccess

WriteBack

IF ID EX MEM WB

Page 34: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

MIPS Pipeline

Hennessy & Patterson, CA-QA, Appendix C, 5ed. MK, 2013

IF ID EX MEM WB

Page 35: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

MIPS Pipeline

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

i1

i2

i3

i4

...

Time(clock cycles)

1 2 3 4 5 6 7 8 9

Example: When will i1000 complete? What is the average clock cycles spent per Instruction? If the processor were not pipelined, when will i1000 complete? What is the average clock cycles spent per Instruction? Which is faster?

Page 36: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Measuring Performance● Metrics: Response Time, Throughput

● Speedup of X relative to Y

● Execution time

– Wall clock time: includes all system overheads

– CPU time: only computation time

n=ExecutionTimeY

ExecutionTime X

=PerformanceX

PerformanceY

Page 37: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Benchmarks● Kernels (e.g. Matrix Multiply), Toy programs (e.g.

Sorting), Synthetic benchmarks (e.g. Dhrystone)

● Server Benchmarks – SPECWeb, SPECFS, SPECjbb, SPECvirt_Sc

● Embedded Systems Benchmarks – EEMBC, Dhrystone

● Database Server Benchmarks – TPC

● Desktop Benchmarks – SPECInt, SPECfp, SPECpower.

– CINT2006: perlbench, bzip2, gcc, sjeng, libquantum, h264ref, etc.

– CFP2006: bwaves, gamess, zeusmp, leslie3d, povray, calculix .lbm, wrf , sphinx3

www.spec.org

Page 38: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Measuring Performance

SPECRatioA=ExecutionTime reference

ExecutionTime A

1.25=SPECRatioA

SPECRatioB

=

ExecutionTime reference

ExecutionTime A

ExecutionTime reference

ExecutionTime B

=ExecutionTime B

ExecutionTimeA

=Performance A

Performance B

Page 39: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Measuring Performance● Processor counters

– Instructions executed, Clock cycles completed

● Profile based, static modeling– H/w counters, Code instrumentation, Interpreting the

program at instruction level

● Trace-driven simulation– Memory references and instruction addresses

● Execution-driven simulation– Pipeline activity, data and instruction references

Page 40: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Amdahl's Law

● What is the overall speedup by enhancing the performance of a single block?

● Speedupenhanced (always >1)

● Fractionenhanced (always <1)

Speedupenhanced=ExecutionTime original

ExecutionTime enhancement

FP Arithmetic FP Arithmetic

Program Execution (Original)

FP Arith

Program Execution (Enhanced)

FP Arith

Page 41: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Amdahl's Law

The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster

mode can be used

ExecutionTimenew=ExecutionTimeold∗((1−Fractionenhanced )+Fractionenhanced

Speedupenhanced)

In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous version. What is the new performance number?

Objective: Make the program 10 times faster. Say, 25% of the program is waiting in I/O and cannot be enhanced. How much should the speedup of the enhanced computer be?

Page 42: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Processor Performance Equation

InstructionsProgram

∗Clock cyclesInstruction

∗Seconds

Clock cycle=

SecondsProgram

=CPU time

CPU clock cycles=∑i=1

n

IC i×CPI i

CPI=CPU clock cycles for a program

InstructionCount

CPI=∑i=1

n IC i

InstructionCount×CPI i

Page 43: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Pipeline PerformanceAn unpipelined processor has 1ns clock cycle. ALU Operation and branches take 4 cycles and Memory ops take 5 cycles. Relative frequencies of the operations are 40%, 20%, and 40%. Suppose Clock skew and setup, pipelining adds 0.2ns of overhead to the clock. What is the speedup?

Average Instruction Execution time = Clock cycle * Average CPI

Page 44: IS360 - High Performance Computing - Basavaraj Talawar · In the example, FP arithmetic was used for 50% of the program. The new FP arithmetic block was 3 times faster than the previous

Pipeline Performance

Speedup pipelining=Avg.instruction time unpipelinedAvg. instruction time pipelined

Speedup pipelining=1

1+Pipeline stall cycles per instruction×Pipeline depth