computer architecture 2014 – introduction 1 computer architecture (“mamas”, 234267) spring...

44
Computer Architecture 1 Computer Architecture Computer Architecture (“MAMAS”, 234267) (“MAMAS”, 234267) Spring 2014 Spring 2014 Lecturer: Yoav Etsion Reception: Mon 15:00, Fishbach 306-8 TAs: Nadav Amit, Gil Einziger, Franck Sala Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, Adi Yoaz and Dan Tsafrir

Upload: magnus-kelley-henderson

Post on 30-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Computer Architecture 2014 – Introduction1

Computer Architecture Computer Architecture (“MAMAS”, 234267)(“MAMAS”, 234267)

Spring 2014Spring 2014

Lecturer: Yoav EtsionReception: Mon 15:00, Fishbach 306-8

TAs: Nadav Amit, Gil Einziger, Franck Sala

Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, Adi Yoaz and Dan Tsafrir

Computer Architecture 2014 – Introduction2

Computer System Computer System StructureStructure

Computer Architecture 2014 – Introduction33

COMPUTER SYSTEM COMPONENTSCOMPUTER SYSTEM COMPONENTSArchaicArchaic

CPU - Memory BUS

MAINMEMORY

CACHEBUSADAPTER

I/O BUSCPU

I/O CONTROLLERS

Disk

Network +WLAN

printer scannerkeyboard mouse...

046267 Computer Architecture 1 U Weiser

Computer Architecture 2014 – Introduction4

4

COMPUTER SYSTEM COMPONENTSCOMPUTER SYSTEM COMPONENTSYesterdayYesterday

I/O CONTROLLERSprinter scannerkeyboard mouse...

MAINMEMORYNorth Bridge

South Bridge

CPUcache

Disk

Network +WLAN

046267 Computer Architecture 1 U Weiser

Computer Architecture 2014 – Introduction5

COMPUTER SYSTEM COMPONENTSCOMPUTER SYSTEM COMPONENTSnownow

Printer, scannerKeyboard, mouse...

MAINMEMORY

South Bridge

CPUMC+cache+G

Disk/SSD

Network +WLAN

5046267 Computer Architecture 1 U Weiser

Computer Architecture 2014 – Introduction6

Classical Motherboard Classical Motherboard DiagramDiagram

CPU

PCI

North Bridge DDR2 or DDR3Channel 1

mouse

LAN

LanAdap

External Graphics

CardMem BUS

CPU BUS

Cache

SoundCard

speakers

South Bridge

PCI express 2.0

IO Controller

HardDisk

Pa

rall

el

Po

rt

Se

ria

l P

ort Floppy

Drivekeybrd

DDR2 or DDR3Channel 2

USBcontroller

SATAcontroller

PCI express ×1

Memory controller

On-board Graphics

DVDDrive

IOMMU

More to the “north” = closer to the CPU = faster

Computer Architecture 2014 – Introduction7

Course FocusCourse Focus Start from CPU (=processor)

Instruction set, performance Pipeline, hazards Branch prediction Out-of-order execution

Move on to Memory Hierarchy Caching Main memory Virtual Memory

Move on to PC Architecture System & chipset, DRAM, I/O, Disk, peripherals

End with some Advanced Topics

Computer Architecture 2014 – Introduction8

The ProcessorThe Processor

Computer Architecture 2014 – Introduction9

Architecture vs. Architecture vs. MicroarchitectureMicroarchitecture

Architecture:= The processor features as seen by its user= Interface Instruction set, number of registers, addressing modes,

Microarchitecture:= Manner by which the processor is implements the Architecture= Implementation details Caches size and structure, number of execution units, …

Note: different processors with different u-archs can support the same arch Example: ARM V8, ARM V9

We will address both

Computer Architecture 2014 – Introduction10

Why Should We Care?Why Should We Care?

Abstractions enhance productivity, so:If we know the arch (=interface),Why should we care about the u-arch (=internals)?

Same goes for archJust details for a programmer of a high-level language

Computer Architecture 2014 – Introduction11

Recent Processor TrendsRecent Processor Trends

Source: http://www.scidacreview.org/0904/html/multicore.html

Computer Architecture 2014 – Introduction12

Well-Known Moore’s LawWell-Known Moore’s Law

Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm

Computer Architecture 2014 – Introduction13

Computer Architecture 2014 – Introduction14

The Story in a NutshellThe Story in a NutshellTransistors(1000s)

clock speed(MHz)

power (W)

Instructions/cycle(ILP)

Computer Architecture 2014 – Introduction15

Took the Industry by SurpriseTook the Industry by Surprise

Computer Architecture 2014 – Introduction16

Dire Implications: PerformanceDire Implications: Performance

Computer Architecture 2014 – Introduction17

Dire Implications: SalesDire Implications: Sales

Computer Architecture 2014 – Introduction18

Dire Implications: SalesDire Implications: Sales

Computer Architecture 2014 – Introduction19

Dire Implications: ProgrammersDire Implications: Programmers

Computer Architecture 2014 – Introduction20

Supercomputing: “Top 500 list”Supercomputing: “Top 500 list”

Computer Architecture 2014 – Introduction21

Dire Implications: Dire Implications: SupercomputingSupercomputing

Computer Architecture 2014 – Introduction22

Processor PerformanceProcessor Performance

Computer Architecture 2014 – Introduction23

Metrics: IC, CPI, IPCMetrics: IC, CPI, IPC CPUs work according to a clock signal

Clock cycle: measured in nanoseconds (10-9 of a second) Clock frequency = 1/|clock cycle|: in GHz (109 cycles/sec)

Instruction Count (IC) Total number of instructions executed in the program

Cycles Per Instruction (CPI) Average #cycles per Instruction (in a given program)

IPC (= 1/CPI) : Instructions per cycles.Can be > 1; see the “story in a nutshell slide”

CPI =#cycles required to execute the program

IC

Computer Architecture 2014 – Introduction24

Minimizing Execution TimeMinimizing Execution Time CPU Time - time required to execute a program

CPU Time = IC CPI clock cycle

Our goal:

minimize CPU Time (any of above components) Minimize clock cycle: increase GHz (processor design)

Minimize CPI: u-arch (e.g.: more execution

units)

Minimize IC: arch (e.g. SSE instruction)

SSE = streaming SIMD extension (Intel)

Computer Architecture 2014 – Introduction25

Alternative Way to Calculate Alternative Way to Calculate CPICPI

ICi = #times instruction of type-i is executed in program

IC = #instruction executed in program =

Fi = relative frequency of type-i instruction = ICi/IC

CPIi = #cycles to execute type-i instruction e.g.: CPIadd = 1, CPImul = 3

#cycles required to execute the program:

CPI: CPIcyc

IC

CPI IC

ICCPI

IC

ICCPI F

i ii

n

ii

i

n

i ii

n

# 1

1 1

1

#n

i ii

cyc CPI IC

IC ICii

n

1

Computer Architecture 2014 – Introduction26

Performance Evaluation: Performance Evaluation: How?How?

Performance depends on Application Input

Mathematical analysis

Computer Architecture 2014 – Introduction27

BenchmarksBenchmarks

Use benchmarks & measure how long it takes Use real applications (=> no absolute answers)

Preferably standardized benchmarks (+input), e.g., SPEC INT: integer apps

• Compression, C complier, Perl, text-processing, … SPEC FP: floating point apps (mostly scientific) TPC benchmarks: measure transaction throughput (DB) SPEC JBB: models wholesale company (Java server, DB)

Sometimes you see FLOPS (“pick” or “sustained”) Supercomputers (top500 list), against LINPACK

Computer Architecture 2014 – Introduction28

-2%

0%

2%

4%

6%

Evaluating PerformanceEvaluating Performance Use a performance simulator to evaluate

the performance of a new feature / algorithm Models the uarch to a great detail Run 100’s of representative applications

Produce the performance s-curve Sort the applications according to the IPC increase Baseline (0%) is the processor without the new

feature

-4%

-3%

-2%

-1%

0%

1%

2%

3%

Negativeoutliers

Positiveoutliers

Bad S-curve

Small negativeoutliers

Positiveoutliers

Good S-curve

Computer Architecture 2014 – Introduction29

Amdahl’s LawAmdahl’s Law

Suppose we accelerate the computation such thatP = portion of computation we make fasterS = speedup experienced by the portion we improved

For exampleIf an improvement can speedup 40% of the computation

=> P = 0.4If the improvement makes the portion run twice as fast

=> S = 2

Then overall speedup = 1

(1 ) PP S

Computer Architecture 2014 – Introduction30

Amdahl’s Law - ExampleAmdahl’s Law - Example

FP operations improved to run 2x fasterS = 2, but…P = only affects 10% of the programSpeedup:

ConclusionBetter to make common case fast…

1 1 11.053

0.1 0.95(1 ) (1 0.1) 2PP S

Computer Architecture 2014 – Introduction31

Amdahl’s Law – ParallelismAmdahl’s Law – Parallelism

When parallelizing a program P = proportion of program that can be made parallel 1 - P = inherently serial N = number of processing elements (say, cores)Speedup:

Serial component imposes a hard limit

1

(1 ) PP N

1 1lim

(1 )(1 )N P PP N

Computer Architecture 2014 – Introduction32

The ISA is what the user & compiler see

The HW implements the ISA

instruction set

software

hardware

Instruction Set DesignInstruction Set Design

Computer Architecture 2014 – Introduction33

Considerations in ISA DesignConsiderations in ISA Design Instruction size

Long instructions take more time to fetch from memory Longer instructions require a larger memory

• Important for small (embedded) devices, e.g., cell phones

Number of instructions (IC) Reduce IC => reduce runtime (at a given CPI &

frequency)

Virtues of instructions simplicity Simpler HW allows for: higher frequency & lower power Optimization can be applied better to simpler code Cheaper HW

Computer Architecture 2014 – Introduction34

Basing Design Decisions on Basing Design Decisions on WorkloadWorkload

Immediate argument’s size in bits (histogram)

1% of data values > 16-bits Having 16 bits is likely good enough

0%

10%

20%

30%

0 1 2 3 4 5 6 7 8 9

10

11 12

13

14

15

Immediate data bits

Int. Avg.

FP Avg.

Computer Architecture 2014 – Introduction35

CISC ProcessorsCISC Processors CISC - Complex Instruction Set Computer

Example: x86 The idea: a high level machine language

• Once people programmed in assembly, CISC supposedly easier

Characteristic Many instruction types, with a many addressing modes Some of the instructions are complex

• Execute complex tasks• Require many cycles

ALU operations directly on memory (e.g., arr[j] = arr[i]+n)• Registers not used (and, accordingly, only a few registers exist)

Variable length instructions• common instructions get short codes save code length

Computer Architecture 2014 – Introduction36

Rank instruction % of total executed

1 load 22%

2 conditional branch 20%

3 compare 16%

4 store 12%

5 add 8%

6 and 6%

7 sub 5%

8 move register-register 4%

9 call 1%

10 return 1%

Total 96%

Simple instructions dominate instruction frequency

But it Turns Out…But it Turns Out…

Computer Architecture 2014 – Introduction37

CISC DrawbacksCISC Drawbacks Complex instructions and complex addressing modes

complicates the processor slows down the simple, common instructions contradicts Make The Common Case Fast

Compilers don’t use complex instructions / indexing

methods

Variable length instructions are real pain in the neck Difficult to decode few instructions in parallel

• As long as instruction is not decoded, its length is unknown It is unknown where the instruction ends It is unknown where the next instruction starts

An instruction may be longer than a cache line• Or even longer longer than a page (in theory)

Computer Architecture 2014 – Introduction38

RISC ProcessorsRISC Processors RISC - Reduced Instruction Set Computer

The idea: simple instructions enable fast hardware Characteristic

A small instruction set, with only a few instructions formats

Simple instructions• execute simple tasks• Most of them require a single cycle (with pipeline)

A few indexing methods Load/Store machine: ALU operations on registers only

• Memory is accessed using Load and Store instructions only

• Many orthogonal registers • Three address machine: Add dst, src1, src2

Fixed length instructions

Examples: MIPSTM, SparcTM, AlphaTM, PowerTM

Computer Architecture 2014 – Introduction39

RISC Processors (Cont.)RISC Processors (Cont.) Simple arch => simple u-arch

Room for larger on die caches Smaller => faster Easier to design & validate (=> cheaper to

manufacture) Shorten time-to-market More general-purpose registers (=> less memory refs)

Compiler can be smarter Better pipeline usage Better register allocation

Existing RISC processor are not “pure” RISC Various complex operations added along the way

Computer Architecture 2014 – Introduction40

Compilers and ISACompilers and ISA Ease of compilation

Orthogonality: • no special registers• few special cases • all operand modes available with any data type or

instruction type Regularity:

• no overloading for the meanings of instruction fields streamlined

• resource needs easily determined

Register assignment is critical too Easier if lots of registers

Computer Architecture 2014 – Introduction41

Still, CISC Is DominantStill, CISC Is Dominant x86 (CISC) dominates the processor market

Not necessarily because it is CISC…

Legacy A vast amount of existing software Intel, AMD, Microsoft benefit But put lot of money to compensate for

disadvantage

CISC internally arch emulates RISC Starting at Pentium II and K6, x86 processors

translate CISC instructions into RISC-like operations internally

Inside core is a RISC machine

Computer Architecture 2014 – Introduction42

Software Specific ExtensionsSoftware Specific Extensions Extend arch to accelerate exec of specific

apps

Example: SSETM – Streaming SIMD Extensions 128-bit packed (vector) / scalar single precision FP

(4×32) Introduced on Pentium® III on ’99 8 new 128 bit registers (XMM0 – XMM7) Accelerates graphics, video, scientific calculations,

Packed: Scalar:

x0x1x2x3

y0y1y2y3

x0+y0x1+y1x2+y2x3+y3

+

128-bits

x0x1x2x3

y0y1y2y3

x0+y0y1y2y3

+

128-bits

Computer Architecture 2014 – Introduction43

BACKUPBACKUP

Computer Architecture 2014 – Introduction44

CompatibilityCompatibility Backward compatibility (HW responsibility)

When buying new hardware, it can run existing software:• i5 can run SW written for Core2 Duo, Pentium4,

PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268

BTW:

Forward compatibility (SW responsibility) For example: MS Word 2003 can open MS Word 2010 doc Commonly supports one or two generations behind

Architecture-independent SW Run SW on top of VM that does JIT (just in time compiler):

JVM for Java and CLR for .NET Interpreted languages: Perl, Python