ece552 / cps550 advanced computer architecture...

63
ECE552 / CPS550 Advanced Computer Architecture I Lecture 1 Metrics and Early Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece552fall16.html

Upload: dinhkhuong

Post on 22-Apr-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

ECE552 / CPS550Advanced Computer Architecture I

Lecture 1Metrics and Early Machines

Benjamin LeeElectrical and Computer Engineering

Duke University

www.duke.edu/~bcl15www.duke.edu/~bcl15/class/class_ece552fall16.html

Mark IHarvard University, 1944

EDSACUniversity of Cambridge, 1949

ECE 552 / CPS 550 2

Computing Devices (Then)

iPadApple/ARM, 2010

Blue Gene/PIBM, 2007

ECE 552 / CPS 550 3

Computing Devices (Now)

ECE 552 / CPS 550 4

Computer Architecture Application

Physics

Gap too large to bridge in one step

Computer architecture is the design of abstraction layers, which allow efficient implementations of computational applications on available technologies

ECE 552 / CPS 550 5

Abstraction Layers

Algorithm

Gates/Register-Transfer Level (RTL)

Application

Instruction Set Architecture (ISA)

Operating System/Virtual Machines

Microarchitecture

Devices

Programming Language

Circuits

Physics

Domain of early

computer architecture

(‘50s-’80s)

Domain of recent

computer architecture(since ‘90s)

ECE 552 / CPS 550 6

An Integrated ApproachArchitect Systems

- Coordinate system across hardware-software interface~ Technology, hardware, run-time software, compilers, apps

- Responsible for end-to-end functionality

Design and Analyze- Search design space of computer systems- Evaluate designs with quantitative metrics

~ Performance, power, cost

Navigate Computing Landscape- Technologies are emerging- Applications are demanding- Systems are scaling

In-order Datapath(understand, ECE 250)

(built, ECE 350)

Chip Multiprocessors(understand, experiment ECE552)

ECE 552 / CPS 550 7

ECE 552 Executive Summary

ECE 552 / CPS 550 8

ECE 552 Administrivia

Instructor Prof. Benjamin [email protected] Hours: TuTh 4-5pm, Hudson 210

Teaching Ramin Bashizade, [email protected] Office Hours: WF 3:30 – 4:30pm, LSRC D301

Tamara Lehman, [email protected] Hours: TuTh 11:30 – 12:30pm, TBD

Lectures Tu/Th 10:05 – 11:20AM, Teer 203

Text Computer Architecture: A Quantitative Approach, 5th Edition (2012). Do not use earlier editions

Web http://www.duke.edu/~BCL15/class/class_ece552fall16.html

ECE 552 / CPS 550 9

ECE 552 PrerequisitesParticipation

- Electrical and Computer Engineering, Computer Science- PhD, MS, Undergraduates

Prerequisites- Introduction to computer architecture (CPS 104, ECE 152, or equiv.)- Programming (homework/projects in C, C++)

Background Knowledge- Instruction sets, computer arithmetic, assembly programmingD.A. Patterson and J.L. Hennessy. Computer Organization and Design: The Hardware/Software Interface, 5th Edition.

ECE 552 / CPS 550 10

ECE 552 Lectures

1. Design Metrics a) Performanceb) Powerc) Early machines

2. Simple Pipelining a) Multi-cycle machinesb) Branch predictionc) In-order superscalar d) Optimizations

3. Complex Pipelining a) Score-boarding, Tomasulo algorithmb) Out-of-order superscalar

Midterm ExamFall Break

4. Explicitly Parallel Architecturesa) VLIWb) Vector machinesc) Multi-threading

5. Memory Systemsa) Cachesb) DRAMc) Virtual memory

6. Multiprocessorsa) Memory modelsb) Coherence protocols

ECE 552 / CPS 550 11

ECE 552 Readings

1. Technologya) Moore’s Law b) Technology scaling

2. Historya) Classic machinesb) The 801 minicomputer

3. Pipelining a) Power as a design constraintb) Optimizing pipeline depth

4. Microarchitecturea) Branch predictionb) Complexity and superscalar design

5. Parallelism Ia) Data flow processorsb) Simultaneous multi-threading

6. Memorya) Victim cacheb) Phase change memory

7. Parallelism IIa) Consistencyb) Coherence

ECE 552 / CPS 550 12

ECE 552 Components30% Homework and Readings

- Homework done in teams of 3- 5 classes dedicated to paper discussions

20% Midterm exam- 75 minutes (in class), closed book

20% Final exam- 3 hours, closed-book

30% Term project/paper- Project done in teams of 3

Academic PolicyUniversity policy as codified by Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism.

ECE 552 / CPS 550 13

ECE 552 Academic Policy

University policy as codified by the Duke Undergraduate Honor Code will be strictly enforced. Zero tolerance for cheating and/or plagiarism.

If a student is suspect of academic dishonesty (e.g., cheating on an exam, copying a lab report, collaborating inappropriately on an assignment), faculty are required to report the matter to the Office of Student Conduct.

A student found responsible for academic dishonesty faces formal disciplinary action, which may include suspension. A student suspended twice for academic dishonesty automatically faces a minimum 5-year separation from Duke University.

ECE 552 / CPS 550 14

ECE 552 Term ProjectScope

- Semester-long research project- Teams of 3 - Students propose project ideas (Oct 14)

Final Paper- 6-12 page research paper- Evaluate research idea quantitatively- Survey and cite related work

ECE 552 / CPS 550 15

ECE 552 Upcoming Deadlines1 September – Reading #1 Due

Readings are available on Sakai.Submit reading responses on Sakai.

1. Moore. “Cramming more components onto integrated circuits”2. Horowitz et al. “Scaling, power, and the future of CMOS”

15 September – Homework #1 DueHomework will be available on Sakai.Submit homework on Sakai in teams of two.

Performance

ECE 552 / CPS 550 17

Latency versus ThroughputDefinitions

- Latency: time to finish given task (a.k.a. execution time)- Throughput: number of tasks in given time (a.k.a. bandwidth)- Throughput exploits parallelism. Latency cannot

Example: Move people from Duke to UNC, 10 miles- Car: capacity = 5, speed = 60 miles/hour- Latency = (10 miles @ 60 miles/hour )= 10 minutes- Throughput = (3 trips @ 60 miles per hour) = 15 people/hour

- Bus: capacity = 60, speed = 20 miles/hour- Latency = (10 miles @ 20 miles/hour) = 30 minutes- Throughput = (1 trip @ 20 miles per hour) = 60 people/hour

ECE 552 / CPS 550 18

Aggregating PerformanceAddition

- Latency is additive. Throughput is not. - Example: Consider applications A1 and A2 on processor P- Latency(A1,A2) = Latency(A1) + Latency(A2)- Throughput (A1,A2) = 1/[1/Throughput(A1) + 1/Throughput(A2)]

Averages- Arithmetic Mean: (1/N) * ∑P=1..N Latency(P)- For measures that are proportional to time (e.g., latency)

- Harmonic Mean: N / ∑P=1..N 1/Throughput(P)- For measures that are inversely proportional to time (e.g., throughput)

- Geometric Mean: (∏P=1..N Speedup(P))^(1/N)- For ratios (e.g., speed-ups)

ECE 552 / CPS 550 19

Processor Performance

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Per

form

ance

(vs.

VA

X-1

1/78

0)

25%/year

52%/year

??%/year

SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, 2006.

ECE 552 / CPS 550 20

Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)

Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture

Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)

Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture

ECE 552 / CPS 550 21

Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)

Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture

Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)

Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture

ECE 552 / CPS 550 22

Moore’s Law- Moore. “Cramming more components onto integrated circuits.”

Electronics, Vol 38, No. 8, 1965. - As integration increases, packaging cost decreases- How does Moore’s Law impact performance?

ECE 552 / CPS 550 23

Field-Effect Transistors- MOS: metal-oxide semiconductor- FET: field-effect transistor

- Charge carriers flow between source-drain, controlled by gate voltage- Abstract MOSFET as electrical switch

GateSource

Drain

Bulk

Width

Length

Channel

Source

Drain

Gate

ECE 552 / CPS 550 24

Complementary MOS (CMOS)- Map voltages to logical values (Vdd=1, Gnd=0)- Implement complementary Boolean logic

- nFET: conduct charge when Vg = Vdd, used in pull-down network- pFET: conduct charge when Vg = Gnd, used in pull-up network- Examples: Inverter, NAND

Vdd

Gnd

A !A

nFET

pFET BA

A

B

!(AB)

ECE 552 / CPS 550 25

Transistor Dimensions- Process defined by feature size (F), layout design (l = F/2)- Example: F=2l =45nm process technology

- Transistor dimensions determine technology performance- Transistor drive strength (i.e., speed) increases as channel length shrinks

GateSource

Drain

Bulk

Width

Length

Minimum Length=2l

Width=4lSource Drain

Gate

ECE 552 / CPS 550 26

Dennard Scaling- Dennard et al. “Design of ion-implanted MOSFETs with very small physical

dimensions,” Journal Solid State Circuits, 1974.

- Scale not only dimensions but also doping concentration and voltage- Transistors become faster (1.4x)- Applied to Moore’s Law: k=1.4, 1/k = 0.7 every 18-24 months

GateSource

Drain

Bulk

Width

Length

ECE 552 / CPS 550 27

Dennard Scaling Limits- Horowitz et al. “Scaling, power, and the future of CMOS.” IEDM, 2005. - Classical Dennard scaling ended at 130nm in 2000-2001.

- Oxide Thickness: How to manage increasing leakage? Use high-K dielectrics- Channel Length: How to manage increasing leakage? Stop scaling L- Doping Concentration: How to handle imprecise doping? Manage variability- Voltage: How to manage increasing leakage? Stop scaling V- Current: How to increase current with shrinking channels? Stress silicon

- Example: Intel 22nm process technology with FinFET

Image: Courtesy Intel Corp.

ECE 552 / CPS 550 28

Processor Performance

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Per

form

ance

(vs.

VA

X-1

1/78

0)

25%/year

52%/year

??%/year

SPECint Benchmarks. Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th Edition, 2006.

ECE 552 / CPS 550 29

Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)

Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture

Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)

Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture

ECE 552 / CPS 550 30

Cycles per Instruction (CPI)Average Instruction Latency

- Different instructions require different number of cycles - Examine instruction frequency - CPI is slightly easier to calculate than IPC (time versus rate)

Example- Instruction frequency: 1/3 INT, 1/3 FP, 1/3 MEM operations- Instruction cycles: 1cy INT, 3cy FP, 2cy MEM- CPI = (1/3 x 1) + (1/3 x 3) + (1/3 x 2)

Caveat- CPI provides high-level, quick estimates of performance- Does not account for details (e.g., instruction dependences)

ECE 552 / CPS 550 31

CPI and DesignBaseline Processor + Application

- Integer ALU: 50%, 1 cycle- Load: 20%, 5 cycle- Store: 10%, 1 cycle- Branch: 20%, 2 cycle

Possible Enhancements- Option 1: Branch prediction for 1-cycle branch- Option 2: Bigger data cache for 3-cycle load- Which enhancement is preferred?

Cycles Per Instruction- Base = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 2) = 2 cycles- Option 1 = (0.5 x 1) + (0.2 x 5) + (0.1 x 1) + (0.2 x 1) = 1.8 cycles- Option 1 = (0.5 x 1) + (0.2 x 3) + (0.1 x 1) + (0.2 x 2) = 1.6 cycles

ECE 552 / CPS 550 32

Measuring CPIPhysical Measurements

- Measure wall clock time as application runs- Multiply time by clock frequency to get cycles- Profile application with hardware counters (e.g., Intel VTune)

Simulated Measurements- Cycle-level, microarchitectural simulation (e.g., SimpleScalar)- Run applications on simulated hardware- Track instructions as they progress through the design

ECE 552 / CPS 550 33

BenchmarkingMeasuring Performance

- Target Workload: accurate but not portable- Representative Benchmark: portable but not accurate- Microbenchmark: small, fast code sequences but incomplete

Representative Benchmarks- SPEC (Standard Performance Evaluation Corporation, www.spec.org)- Collects, standardizes, distributes benchmark programs

- Scientific and commercial computing- SPLASH-2, NAS, SPEC OpenMP, SPECjbb

- Online transaction processing (OLTP) with heavy I/O, memory- TPC-C, TPC-H, TPC-W

- Datacenter workloads- Search (e.g., Nutch/Lucene), analytics (e.g,. Hadoop, Spark)

ECE 552 / CPS 550 34

Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)

Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture

Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)

Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture

ECE 552 / CPS 550 35

Single Accumulator

- Carry-over from calculators, typically less than 2-dozen instructions- Single operand (AC)

LOAD x AC ¬ M[x]STORE x M[x] ¬ (AC)

ADD x AC ¬ (AC) + M[x]SUB x

SHIFT LEFT AC ¬ 2 ´ (AC)SHIFT RIGHT

JUMP x PC ¬ xJGE x if (AC) ³ 0 then PC ¬ x

LOAD ADR x AC ß Extract address field (M[x])STORE ADR x

ECE 552 / CPS 550 36

Using Accumulator

LOOP LOAD N # AC ß M[N]JGE DONE # if(AC>0), PCß DONEADD ONE # AC ß AC + 1STORE N # M[N] ß AC

F1 LOAD A # AC ß M[A]F2 ADD B # AC ß (AC) + M[B]F3 STORE C # M[C] ß (AC)

JUMP LOOP

DONE HLT

Ci ¬ Ai + Bi, 1 £ i £ n

Notice M[N] is a counter, not an index. How to modify the addresses A, B and C ?

ECE 552 / CPS 550 37

Self-Modifying Code

LOOP LOAD N # AC ß M[N]JGE DONE # if (AC >= 0), PC ß DONEADD ONE # AC ß AC + M[ONE]STORE N # M[N] ß AC

F1 LOAD A # AC ß M[A]F2 ADD B # AC ß AC + M[B]F3 STORE C # M[C] ß (AC)

LOAD ADR F1 # AC ß address field (M[F1])ADD ONE # AC ß AC + M[ONE]STORE ADR F1 # changes address of ALOAD ADR F2ADD ONESTORE ADR F2 # changes address of BLOAD ADR F3ADD ONESTORE ADR F3 # changes address of CJUMP LOOP

DONE HLT

Ci ¬ Ai + Bi, 1 £ i £ n

Each iteration requires:total book-keeping

Inst fetch 17 14Stores 5 4

ECE 552 / CPS 550 38

Index RegistersSpecialized registers to simplify address calculations

- T. Kilburn, Manchester University, 1950s- Instead of single AC register, use AC and IX registers

Modify Existing Instructions- Load x, IX AC ß M[x + (IX)]- Add x, IX AC ß (AC) + M[x + (IX)]

Add New Instructions- Jzi x, IX if (IX)=0, then PCß x, else (IX)ß(IX)+1- Loadi x, IX IX ß M[x] (truncated to fit IX)

Index registers have accumulator-like characteristics

ECE 552 / CPS 550 39

Using Index Registers

LOADi -n, IX # load n into IXLOOP JZi DONE, IX # if(IX=0), DONE

LOAD LASTA, IX # ACß M[LASTA + (IX)]ADD LASTB, IX # note: LASTA is address STORE LASTC, IX # of last element in AJUMP LOOP

DONE HALT

Ci ¬ Ai + Bi, 1 £ i £ n

- Longer instructions (1-2 bits), index registers with ALU circuitry- Does not require self-modifying code, modify IX instead- Improved program efficiency (operations per iteration)

total book-keepingInst fetch 5 2Stores 1 0

Option 1: Increment index register by kAC ß (IX) new instructionAC ß (AC) + kIX ß (AC) new instruction

Also, the AC must be saved and restored

Option 2: Manipulate index register directlyINCi k, IX IX ß (IX) + kSTOREi x, IX M[x] ß (IX) (extended to fit a word)

IX begins to resemble AC- Several index registers, accumulators- Motivates general-purpose registers (e.g., MIPS ISA R0-R31)

ECE 552 / CPS 550 40

Modifying Index Registers

ECE 552 / CPS 550 41

Evolution of Addressing Modes1. Single accumulator, absolute address

Load x AC ß M[x]

2. Single accumulator, index registersLoad x, IX AC ß M[x + (IX)]

3. Single accumulator, indirection Load (x) AC ß M[M[x]]

4. Multiple accumulators, index registers, indirectionLoad Ri, IX, (x) Ri ß M[M[x] + (IX)]

5. Indirection through registersLoad Ri, (Rj) Ri ß M[M[(Rj)]]

6. The WorksLoad Ri, Rj, (Rk) Ri ß M[Rj + (Rk)]; Rj = index; Rk = base address

ECE 552 / CPS 550 42

Evolution of Instruction FormatsZero-address Formats

- Instructions have zero operands- Operands on a stack

add M[sp] ß M[sp] + M[sp-1]load M[sp] ß M[M[sp]]

- Stack can be registers or memory- Top of stack usually cached in registers

One-address Formats- Instructions have one operand- Accumulator is always other implicit operand

CBASP

Register

ECE 552 / CPS 550 43

Evolution of Instruction FormatsTwo-address Formats

- Destination is same as one of the operand sourcesRi ß (Ri) + (Rj) # (Reg x Reg) to RegRi ß (Ri) + M[x] # (Reg x Mem) to Reg

- x can be specified directly or via register- x address calculation could include indexing, indirection, etc.

Three-address Formats- One destination and up to two operand sources

Ri ß (Rj) + (Rk) # (Reg x Reg) to RegRi ß (Rj) + M[x] # (Reg x Reg) to Reg

ECE 552 / CPS 550 44

Data FormatsData Sizes

- Bytes, Half-words, words, double words

Byte Addressing- Location of most-, least- significant bits

Word Alignment- Suppose memory is organized into 32-bit words (e.g., 4 bytes). - Word aligned addresses begin only at 0, 4, 8, … bytes

Big Endian

Little Endian

MSB

MSB LSB

LSB

0 1 2 3 4 5 6 7

ECE 552 / CPS 550 45

Software DevelopmentsNumerical Libraries (up to 1955)

- floating-point operations- transcendental functions- matrix multiplication, equation solvers, etc.

High-level Languages(1955-1960)- Fortran, 1956- assemblers, loaders, linkers, compilers

Operating Systems (1955-1960)- accounting programs to track usage and charges

ECE 552 / CPS 550 46

Performance FactorsLatency = (Seconds / Cycle) x (Cycles / Instruction) x (Instructions / Program)

Seconds / Cycle- Technology and architecture- Transistor scaling- Processor microarchitecture

Cycles / Instruction (CPI)- Architecture and systems- Processor microarchitecture- System balance (processor, memory, network, storage)

Instructions / Program- Algorithm and applications- Compiler transformations, optimizations- Instruction set architecture

ECE 552 / CPS 550 47

Pitfall: Incomplete MetricsIgnoring Instructions per Program

- Neglect dynamic instruction count- Misleading if working in algorithms, compilers, or ISA

Using Instructions per Second- MIPS = (Instructions / Cycle) x (Cycles / Second) x 1E-6- FLOPS: considers only floating-point instructions- Example: CPI = 2, clock frequency = 500MHz, 250 MIPS- Example: compiler removes instructions, latency falls, MIPS increases

Using Clock Frequency- Cannot equate clock frequency with performance- Proc A: CPI = 2, f = 500MHz- Proc B: CPI = 1, f = 300MHz- Given the same ISA and compiler, B is faster

ECE 552 / CPS 550 48

Pitfall: Diminishing Returns- Amdahl. “Validity of the single-processor approach…” AFIPS, 1967.

Amdhal’s Law (Make Common Case Fast)Consider improving fraction F of system with a speedup S.T(new) = T(base) x (1-F) + T(base) x F / S

= T(base) x [(1-F) + F/S]

Speedup = 1 / [(1-F) + F/S] = T(base)/T(new)Max Speedup = 1 / (1 – F)

Example- Suppose FP computation is 1/4 of an application’s execution time- Maximum benefit from optimizing FP unit is 1.3x (=1/0.75)

- Multiprocessor systems were original application of this law- Accounts for diminishing marginal returns

Power

ECE 552 / CPS 550 50

Power FactorsDefinitions

- Energy (Joules) = a x C x V2

- Power (Watts) = a x C x V2 x f

Power Factors and Trends- activity (a): function of application resource usage- capacitance (C): function of design; scales with area- voltage (V): constrained by leakage, which increases as V falls- frequency (f): varies with pipelining and transistor speeds- Models in cycle-accurate simulators (e.g., Princeton Wattch)

Dynamic Voltage and Frequency Scaling (DVFS)- P-states: move between operational modes with different V, f- Intel TurboBoost: increase V, f for short durations without violating thermal design point (TDP)

ECE 552 / CPS 550 51

Power and TemperatureTemperature• Power density (Watts / sq-mm) is

proxy for thermal effects

• Estimate thermal conductivity, resistance to identify processor hot spots (e.g., HotSpot simulator)

Power Budgets• Power à Package Cost• 130W servers, 65W desktops, 10-30W

laptops, 1-2W hand-held

ECE 552 / CPS 550 52

Power and Multiprocessors

Multiprocessors• Chip multiprocessors (CMPs)

integrate multiple cores on die

Efficiency• Reduce power with simpler cores• Recover lost performance with many

core parallelism

ECE 552 / CPS 550 53

Power and MultiprocessorsLower voltage, frequency• Voltage, frequency scale together V ∝ f• Power proportional to V2 and f Power ∝ V2 f• Performance proportional to f Perf ∝ f

Example• Baseline: 1-core at V, f• Multiprocessor: 4-cores at 0.85V, 0.85f; program is 75% parallel• Core Power @ lower V, f 0.61x =0.853

• Core Performance @ lower V, f 0.85x

• Multicore Power @ lower V, f 2.44x = 0.61x 4• Multicore Performance @ 4 cores 2.28x = 1/[0.25 + (0.75 / 4)]• Multicore Performance @ lower f 1.94x = 2.28 x 0.85

• Multiprocessor: 1.5% power per 1% performance [+144% power, +94% perf]• Boosting V, f: 3% power per 1% performance [+(1.013-1) power, + (1.01-1) perf]

Cost

ECE 552 / CPS 550 55

CostNon-recurring Engineering (NRE)

- Dominated by engineer-years ($200K per engineer-year)- Mask costs (>$1M per spin)

Chip Cost- Depends on wafer and chip size, process maturity

Packaging Cost- Depends on number of pins (e.g., signal + power/ground)- Depends on thermal design point (e.g., heat sink)

Total Cost of Ownership- Capital costs (e.g., server procurement cost)- Operating costs (e.g., electricity)

ECE 552 / CPS 550 56

YieldWafers

- Integrated circuits built with multi-step chemical process on wafers- Cost per wafer depends on wafer size, number of steps

Chip (a.k.a. Die)- If chips are large, fewer chips per wafer- Larger chips have lower yield- Uniform defect density- Chip cost is proportional to area2-3

Process Variability- Yield is non-binary- Binning for speed grades- Binning for core count- Post-fabrication tuning with spares

Compatibility

ECE 552 / CPS 550 58

CompatibilityEarly 1960s IBM had 4 incompatible computers

- IBM 701, 650, 702, 1401- Different instruction set architecture- Different I/O system, secondary storage (magnetic taps, drums, disks)- Different assemblers, compilers, libraries- Different markets (e.g., business, scientific, real-time)

The need for compatibility motivated IBM 360.

ECE 552 / CPS 550 59

IBM 360: Design PrinciplesAmdahl, Blaauw and Brooks, “Architecture of the IBM System/360” 1964

1. Support growth and successor machines

2. Connect I/O devices with general method

3. Emphasize total performance- Evaluate programmability, answers per month not bits per second

4. Eliminate manual intervention- Machine must be capable of supervising itself

5. Reduce down time- Build hardware fault checking and fault location support

6. Facilitate assembly - Redundant I/O devices, memories for fault tolerance

7. Support flexibility- Some problems required floating-point words > 36bits

ECE 552 / CPS 550 60

IBM 360: General Purpose RegistersProcessor State

• 16, 32-bit general-purpose registers à use as index and base registers• 4, 64-bit floating-point registers• Program status word (PSW) with program counter (PC)• Condition codes, control flags

Data Formats• 8-bit bytes: the IBM 360 is why bytes are 8-bits long today!• 16-bit half-words• 32-bit words• 64-bit double-words

ECE 552 / CPS 550 61

IBM 360: Initial ImplementationModel 30 Model 70

Storage 8K - 64 KB 256K - 512 KBDatapath 8-bit 64-bitCircuit Delay 30 nsec/level 5 nsec/levelLocal Store Main Store Transistor RegistersControl Store 1 microsecond read Conventional circuits

Abstraction• IBM 360 ISA hid technologies across models

Milestone• The first true ISA designed as portable hardware-software interface• With minor modifications, ISA still survives today

Technology (seconds / cycle)5.2 GHz in IBM 45nm CMOS technology1.4 billion transistors in 512 sq-mm

Microarchitecture (cycle / instruction)64-bit virtual addressingOut-of-order, 3-way superscalar pipelineRedundant datapaths

L1 i-cache (64KB); L1 d-cache (128KB) d-cacheL2 cache (1.5MB), private, per-coreL3 cache (24MB), eDRAM

Power and ParallelismQuad-core designScales to 96 cores in one machine

ECE 552 / CPS 550 62

IBM z11: 47 Years Later

IBM HotChips 2010

ECE 552 / CPS 550 63

AcknowledgementsThese slides contain material developed and copyright by - Arvind (MIT)- Krste Asanovic (MIT/UCB)- Joel Emer (Intel/MIT)- James Hoe (CMU)- John Kubiatowicz (UCB)- Alvin Lebeck (Duke)- David Patterson (UCB)- Daniel Sorin (Duke)