cs 201 computer systems programming chapter 3 “ architecture overview ”

50
1 CS 201 Computer Systems Programming Chapter 3 “Architecture Overview” Herbert G. Mayer, PSU CS Herbert G. Mayer, PSU CS Status 10/9/2012 Status 10/9/2012

Upload: irene-manning

Post on 30-Dec-2015

33 views

Category:

Documents


2 download

DESCRIPTION

Herbert G. Mayer, PSU CS Status 10/9/2012. CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”. Syllabus. Computing History Evolution of Microprocessor µP Performance Processor Performance Growth Key Architecture Messages Code Sequences for Different Architectures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

1

CS 201Computer Systems Programming

Chapter 3“Architecture Overview”

Herbert G. Mayer, PSU CSHerbert G. Mayer, PSU CSStatus 10/9/2012Status 10/9/2012

Page 2: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

2

Syllabus Computing HistoryComputing History

Evolution of Microprocessor µP Evolution of Microprocessor µP PerformancePerformance

Processor Performance GrowthProcessor Performance Growth

Key Architecture MessagesKey Architecture Messages

Code Sequences for Different Code Sequences for Different ArchitecturesArchitectures

Dependencies, AKA DependencesDependencies, AKA Dependences

Score BoardScore Board

ReferencesReferences

Page 3: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

3

Computing HistoryComputing HistoryBefore 1940Before 19401643 Pascal’s 1643 Pascal’s Arithmetic MachineArithmetic Machine

About 1660 Leibnitz About 1660 Leibnitz Four Function CalculatorFour Function Calculator

1710 -1750 1710 -1750 Punched CardsPunched Cards by Bouchon, Falcon, by Bouchon, Falcon, JacquardJacquard

1810 Babbage 1810 Babbage Difference EngineDifference Engine, unfinished; 1st , unfinished; 1st programmer ever in the world was poet Lord programmer ever in the world was poet Lord Byron’s daughter, after whom the language Ada Byron’s daughter, after whom the language Ada was named: was named: Lady Ada LovelaceLady Ada Lovelace

1835 Babbage 1835 Babbage Analytical EngineAnalytical Engine, also unfinished, also unfinished

1920 Hollerith 1920 Hollerith Tabulating MachineTabulating Machine to help with to help with census in the USAcensus in the USA

Page 4: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

4

Computing HistoryComputing HistoryDecade of 1940sDecade of 1940s1939 – 1942 1939 – 1942 John Atanasoff John Atanasoff built programmable, built programmable,

electronic computer at Iowa State Universityelectronic computer at Iowa State University

1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague mechanical computers based on relays; colleague advised use of “vacuum tubes”advised use of “vacuum tubes”

1946 1946 John von Neumann’s John von Neumann’s computer design of stored computer design of stored programprogram

1946 Mauchly and Eckert built 1946 Mauchly and Eckert built ENIACENIAC, modeled after , modeled after Atanasoff’s ideas, built at University of Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monsterComputer, 30 ton monster

1980s John Atanasoff got acknowledgment and patent 1980s John Atanasoff got acknowledgment and patent officially officially

Page 5: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

5

Computing HistoryComputing HistoryDecade of the 1950sDecade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, Univac Uniprocessor based on ENIAC, commercially viable,

developed by developed by John Mauchly John Mauchly and John Presper Eckertand John Presper Eckert Commercial systems sold by Remington RandCommercial systems sold by Remington Rand Mark III computerMark III computer

Decade of the 1960s Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al.IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tubeTransistor replaces vacuum tube Burroughs stack machines, compete with GPR architecturesBurroughs stack machines, compete with GPR architectures All still All still von Neumannvon Neumann architectures architectures 1969 1969 ARPANETARPANET CacheCache and and VMMVMM developed, first at Manchester University developed, first at Manchester University

Page 6: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

6

Computing HistoryComputing History

Decade of the 1970sDecade of the 1970sBirth of Microprocessor at Intel, Birth of Microprocessor at Intel, see see Gordon MooreGordon Moore

High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 seriesseries

Architecture advances: Caches, VMM ubiquitous, since Architecture advances: Caches, VMM ubiquitous, since real memories were expensivereal memories were expensive

Intel 4004, Intel 8080, single-chip microprocessorsIntel 4004, Intel 8080, single-chip microprocessors

Programmable controllersProgrammable controllers

Mini-computers, PDP 11, HP 3000 16-bit computerMini-computers, PDP 11, HP 3000 16-bit computer

Height of Digital Equipment Corp. (DEC)Height of Digital Equipment Corp. (DEC)

Birth of personal computers, which DEC missesBirth of personal computers, which DEC misses

Page 7: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

7

Computing HistoryComputing History

Decade of the 1980sDecade of the 1980s

decrease of mini-computer usedecrease of mini-computer use

32-bit computing even on minis32-bit computing even on minis

Architecture advances: superscalar, faster Architecture advances: superscalar, faster caches, larger cachescaches, larger caches

Multitude of Supercomputer manufacturersMultitude of Supercomputer manufacturers

Compiler complexity: trace-scheduling, VLIWCompiler complexity: trace-scheduling, VLIW

Workstations common: Apollo, HP, DEC’s Workstations common: Apollo, HP, DEC’s Ken Ken Olsen Olsen trying to catch up, Intergraph, trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Ardent, Sun, Three Rivers, Silicon Graphics, etc.Graphics, etc.

Page 8: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

8

Computing HistoryComputing History

Decade of the 1990sDecade of the 1990s•Architecture advances: superscalar & Architecture advances: superscalar & pipelined, speculative execution, ooo pipelined, speculative execution, ooo executionexecution

•Powerful desktopsPowerful desktops

•End of mini-computer and of many super-End of mini-computer and of many super-computer manufacturerscomputer manufacturers

•Microprocessor powerful as early Microprocessor powerful as early supercomputerssupercomputers

•Consolidation of many computer companies into Consolidation of many computer companies into a few large onesa few large ones

•End of Soviet Union marked the end of several End of Soviet Union marked the end of several supercomputer companiessupercomputer companies

Page 9: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

9

Evolution of µP Performance(by: James C. Hoe @ CMU)

1970s 1980s 1990s 2000+ Transistor Count 10k-100k 100k-1M 1M-100M 1B

Clock Frequency 0.2-2 MHz 2-20 MHz 0.02 – 1 GHz 10 GHz

Instructions / cycle: ipc < 0.1 0.1 – 0.9 0.9 – 2.0 > 10 (?)

MIPs, FLOPs < 0.2 0.2 - 20 20 – 2,000 100,000

Page 10: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

10

Processor Performance GrowthMoore’s Law --from Webopedia 8/27/2004:Moore’s Law --from Webopedia 8/27/2004:

““The observation made in 1965 by Gordon Moore, co-The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors founder of Intel, that the number of transistors per square inch on integrated circuits had doubled per square inch on integrated circuits had doubled every year since it was invented. Moore predicted every year since it was invented. Moore predicted that this trend would continue for the foreseeable that this trend would continue for the foreseeable future.future.

In subsequent years, the pace slowed down a bit, but In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 monthsdata density doubled approximately every 18 months, , and this is the current definition of and this is the current definition of Moore's LawMoore's Law, , which Moore himself has blessed. Most experts, which Moore himself has blessed. Most experts, including Moore himself, expect including Moore himself, expect Moore's LawMoore's Law to hold to hold for at least another two decades.for at least another two decades.

Others coin a more general law, stating that Others coin a more general law, stating that “the “the circuit density increases predictably over time.”circuit density increases predictably over time.”

Page 11: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

11

Processor Performance GrowthSo far in 2012, Moore’s Law is holding true since So far in 2012, Moore’s Law is holding true since

~1968.~1968.

Some Intel fellows believe that an end to Moore’s Law Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations will be reached ~2018 due to physical limitations in the process of manufacturing transistors from in the process of manufacturing transistors from semi-conductor material.semi-conductor material.

This phenomenal growth is unknown in any other This phenomenal growth is unknown in any other industry. For example, if doubling of performance industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 could be achieved every 18 months, then by 2001 other industries would have achieved the other industries would have achieved the following:following:

cars would travel at 2,400,000 Mph, and get 600,000 cars would travel at 2,400,000 Mph, and get 600,000 MpGMpG

Air travel from LA to NYC would be at 36,000 Mach, or Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 secondstake 0.5 seconds

Page 12: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

12

Message 1: Memory is Slow The inner core of the processor, the CPU or the The inner core of the processor, the CPU or the

µP, is getting faster at a steady rateµP, is getting faster at a steady rate

Access to memoryAccess to memory is also getting faster over is also getting faster over time, but time, but at a slower rateat a slower rate. This rate . This rate differential has existed for quite some time, differential has existed for quite some time, with the strange effect that fast processors with the strange effect that fast processors have to rely on slow memorieshave to rely on slow memories

Not uncommon on MP server that processor has to Not uncommon on MP server that processor has to wait >100 cycles before a memory access wait >100 cycles before a memory access completes. On a Multi-Processor the bus completes. On a Multi-Processor the bus protocol is more complex due to snooping, protocol is more complex due to snooping, backing-off, arbitration, etc., thus the number backing-off, arbitration, etc., thus the number of cycles to complete an access can grow so of cycles to complete an access can grow so highhigh

Page 13: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

13

Message 1: Memory is Slow Discarding conventional memory altogether, Discarding conventional memory altogether,

relying only on cache-like memories, is NOT an relying only on cache-like memories, is NOT an option, due to the price differential between option, due to the price differential between cache and regular RAMcache and regular RAM

Another way of seeing this: Using solely Another way of seeing this: Using solely reasonably-priced cache memories (say at <= 10 reasonably-priced cache memories (say at <= 10 times the cost of regular memory) is not times the cost of regular memory) is not feasible: resulting address space would be too feasible: resulting address space would be too smallsmall

Almost all intellectual efforts in computer Almost all intellectual efforts in computer architecture focus on reducing the performance architecture focus on reducing the performance impact of fast processors accessing slow memoriesimpact of fast processors accessing slow memories

All else seems easy compared to this fundamental All else seems easy compared to this fundamental problem!problem!

Page 14: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

14

Message 1: Memory is Slow

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Time

“Moore’s Law”

Source: David Patterson, UC Berkeley

2001

2002

Page 15: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

15

Message 2: Events Tend to Cluster

A strange thing happens during program execution: A strange thing happens during program execution: Seemingly Seemingly unrelated events tend to clusterunrelated events tend to cluster

memory accessesmemory accesses tend to concentrate a majority of tend to concentrate a majority of their referenced addresses onto a small domain of their referenced addresses onto a small domain of the total address space. Even if all of memory is the total address space. Even if all of memory is accessed, during some periods of time this accessed, during some periods of time this phenomenon is observed. Here one memory access phenomenon is observed. Here one memory access seems independent of another, but they both seems independent of another, but they both happen to fall onto the same page (or happen to fall onto the same page (or working set working set of pages)of pages)

We call this phenomenon We call this phenomenon LocalityLocality! Architects ! Architects exploit locality to speed up memory access via exploit locality to speed up memory access via CachesCaches and increase the address range beyond and increase the address range beyond physical memory via physical memory via Virtual Memory ManagementVirtual Memory Management. . Distinguish Distinguish spacialspacial versus versus temporaltemporal locality locality

Page 16: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

16

Message 2: Events Tend to Cluster

Similarly, hash functions tend to Similarly, hash functions tend to concentrate an unproportionally large concentrate an unproportionally large number of keys onto a small number of number of keys onto a small number of table entriestable entries

Incoming search key (say, a C++ program Incoming search key (say, a C++ program identifier) is mapped into an index, but identifier) is mapped into an index, but the next, completely unrelated key, the next, completely unrelated key, happens to map onto the same index. In an happens to map onto the same index. In an extreme case, this may render a hash extreme case, this may render a hash lookup slower than a sequential searchlookup slower than a sequential search

Programmer must Programmer must watch outwatch out for the for the phenomenon of clustering, as it is phenomenon of clustering, as it is undesired in hashing!undesired in hashing!

Page 17: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

17

Message 2: Events Tend to Cluster

Clustering happens in all diverse modules of the Clustering happens in all diverse modules of the processor architecture. For example, when a data processor architecture. For example, when a data cache is used to speed-up memory accesses by cache is used to speed-up memory accesses by having a copy of frequently used data in a faster having a copy of frequently used data in a faster memory unit, it happens that a small cache memory unit, it happens that a small cache sufficessuffices

Due to Due to Data Locality Data Locality (spatial and temporal(spatial and temporal)). Data . Data that have been accessed recently will again be that have been accessed recently will again be accessed in the near future, or at least data accessed in the near future, or at least data that live close by will be accessed in the near that live close by will be accessed in the near futurefuture

Thus they happen to reside in the same cache Thus they happen to reside in the same cache line. Architects do exploit this to speed up line. Architects do exploit this to speed up execution, while keeping the incremental cost for execution, while keeping the incremental cost for HW contained. Here clustering is a valuable HW contained. Here clustering is a valuable phenomenon phenomenon

Page 18: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

18

Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) Clocking a processor fast (e.g. > 3-5 GHz)

increases performance and thus generally “is increases performance and thus generally “is good”good”

Other performance parameters, such as memory Other performance parameters, such as memory access speed, peripheral access, etc. do not access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the scale with the clock speed. Still, increasing the clock to a higher rate is desirableclock to a higher rate is desirable

Comes at the cost of higher current and thus more Comes at the cost of higher current and thus more heat generated in the identical physical space, heat generated in the identical physical space, the geometry (the real-estate) of the silicon the geometry (the real-estate) of the silicon processor or chipsetprocessor or chipset

But Silicon part acts like a heat-conductor, But Silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as lower resistance causes lower voltage, shown as VDroop in the figure belowVDroop in the figure below

Page 19: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

19

Message 3: Heat is Bad

Page 20: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

20

Message 3: Heat is Bad This in turn means, voltage must be increased This in turn means, voltage must be increased

artificially, to sustain the clock rate, creating artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction more heat, ultimately leading to self-destruction of the partof the part

Great efforts are being made to increase the Great efforts are being made to increase the clock speed, requiring more voltage, while at the clock speed, requiring more voltage, while at the same time reducing heat generation. Current same time reducing heat generation. Current technologies include sleep-states of the Silicon technologies include sleep-states of the Silicon part (processor as well as chip-set), and part (processor as well as chip-set), and Turbo Turbo BoostBoost mode, to contain heat generation while mode, to contain heat generation while boosting clock speed just at the right timeboosting clock speed just at the right time

Good that to date Silicon manufacturing Good that to date Silicon manufacturing technologies allow the shrinking of transistors technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.larger, more expensive, and above all: hotter.

Page 21: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

21

Message 4: Resource Replication Architects cannot increase clock Architects cannot increase clock

speed beyond physical limitationsspeed beyond physical limitations

One cannot decrease the die size One cannot decrease the die size beyond evolving technologybeyond evolving technology

Yet speed improvements are Yet speed improvements are desired, and achieveddesired, and achieved

This conflict can partly be This conflict can partly be overcome with replicated overcome with replicated resources! But careful!resources! But careful!

Page 22: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

22

Message 4: Resource Replication Key obstacle to parallel execution Key obstacle to parallel execution

is data dependence in the SW under is data dependence in the SW under execution. A datum cannot be used, execution. A datum cannot be used, before it has been computedbefore it has been computed

Compiler optimization technology Compiler optimization technology calls this calls this use-def dependence use-def dependence (short (short for use-definition dependence), AKA for use-definition dependence), AKA true dependence, AKA data dependencetrue dependence, AKA data dependence

Goal is to search for program Goal is to search for program portions that are independent of one portions that are independent of one another. This can be at multiple another. This can be at multiple levels of focus:levels of focus:

Page 23: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

23

Message 4: Resource Replication

At the At the very low levelvery low level of registers, at of registers, at the machine level –done by HWthe machine level –done by HW

At the At the low level low level of individual machine of individual machine instructions –done by HWinstructions –done by HW

At the At the medium level medium level of subexpressions in of subexpressions in a program –done by compilera program –done by compiler

At the At the higher level higher level of distinct of distinct statements in a high-level program –done statements in a high-level program –done by optimizing compiler or by programmerby optimizing compiler or by programmer

Or at the Or at the very high level very high level of different of different applications, running on the same applications, running on the same computer, but with independent data, computer, but with independent data, separate computations, and independent separate computations, and independent results –done by the userresults –done by the user

Page 24: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

24

Message 4: Resource Replication Whenever program portions are independent of Whenever program portions are independent of

one another, they can be computed at the one another, they can be computed at the same time: in parallelsame time: in parallel

Architects provide resources for this Architects provide resources for this parallelismparallelism

Compilers need to uncover opportunities for Compilers need to uncover opportunities for parallelismparallelism

If two actions are independent of one If two actions are independent of one another, they can be computed simultaneouslyanother, they can be computed simultaneously

Provided that HW resources exist, that the Provided that HW resources exist, that the absence of dependence has been proven, and absence of dependence has been proven, and that the independent execution paths are that the independent execution paths are scheduled on these replicated HW resources! scheduled on these replicated HW resources! Generally this is a complex undertaking!Generally this is a complex undertaking!

Page 25: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

25

Code 1 for Different ArchitecturesExample 1: Object Code Sequence Without Example 1: Object Code Sequence Without

OptimizationOptimization

Strict left-to-right translation, no smarts in Strict left-to-right translation, no smarts in mappingmapping

Consider non-commutative subtraction and Consider non-commutative subtraction and division operatorsdivision operators

No common subexpression elimination (CSE), and No common subexpression elimination (CSE), and no register reuseno register reuse

Conventional operator precedenceConventional operator precedence

For Single Accumulator SAA, Three-Address GPR, For Single Accumulator SAA, Three-Address GPR, Stack ArchitecturesStack Architectures

Sample source: Sample source: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c

Page 26: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

26

Code 1 for Different ArchitecturesNo Single-

Accumulator Three-Address GPR dest op1 op op2

Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 mult b add r3, a, #3 add 4 st temp1 div r4, r3, c push b 5 ld a sub d, r2, r4 mult 6 add #3 push a 7 div c pushlit #3 8 st temp2 add 9 ld temp1 push c

10 sub temp2 div 11 st d sub 12 pop d

Page 27: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

27

Code 1 for Different ArchitecturesThree-address code looks shortest, w.r.t. Three-address code looks shortest, w.r.t. number of instructionsnumber of instructions

Maybe optical illusion, must also consider Maybe optical illusion, must also consider number of bitsnumber of bits for for instructionsinstructions

Must consider number of I-fetches, operand fetchesMust consider number of I-fetches, operand fetches

Must consider total number of storesMust consider total number of stores

Numerous memory accesses on SAA due to temporary values held in Numerous memory accesses on SAA due to temporary values held in memorymemory

Most memory accesses on SA, since everything requires a memory Most memory accesses on SA, since everything requires a memory accessaccess

Three-Address architecture immune to commutativity constraint, Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either ordersince operands may be placed in registers in either order

Important architectural feature? Only if SW cannot handle this; Important architectural feature? Only if SW cannot handle this; compiler cancompiler can

No need for reverse-operation opcodes for Three-Address No need for reverse-operation opcodes for Three-Address architecturearchitecture

Decide in Three-Address architecture how to encode operand typesDecide in Three-Address architecture how to encode operand types

Numerous stack instructions, i.e. many bits for opcodes, since Numerous stack instructions, i.e. many bits for opcodes, since each operand fetch is separate instructioneach operand fetch is separate instruction

Page 28: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

28

Code 2 for Different ArchitecturesThis time we eliminate common This time we eliminate common

subexpressionsubexpression

Compiler handles left-to-right order for Compiler handles left-to-right order for non-commutative operators on SAAnon-commutative operators on SAA

Better code for: Better code for: d = ( a+3 ) * b - ( a+3 ) d = ( a+3 ) * b - ( a+3 ) / c/ c

Page 29: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

29

Code 2 for Different Architectures

No Single-Accumulator

Three-Address GPR dest op1 op op2

Stack Machine

1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 st temp1 div r1, r1, c add 4 div c sub d, r2, r1 dup 5 st temp2 push b 6 ld temp1 mult 7 mult b xch 8 sub temp2 push c

9 st d div 10 sub 11 pop d

Page 30: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

30

Code 2 for Different ArchitecturesSingle Accumulator Architecture (SAA) Single Accumulator Architecture (SAA)

optimized still needs temporary storage; optimized still needs temporary storage; uses temp1 for common subexpression; has no uses temp1 for common subexpression; has no other register!!other register!!

SAA could use SAA could use negatenegate instruction or instruction or reverse reverse subtractsubtract

Register-use optimized for Three-Address Register-use optimized for Three-Address architecture; but architecture; but dupdup and and xchxch are newly are newly added instructionsadded instructions

Common subexpresssion optimized on Stack Common subexpresssion optimized on Stack Machine by duplicating, exchanging, etc.Machine by duplicating, exchanging, etc.

20% reduced for Three-Address, 18% for SAA, 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machineonly 8% for Stack Machine

Page 31: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

31

Code 3 for Different Architectures

Analyze similar source expressions but Analyze similar source expressions but with reversed operator precedencewith reversed operator precedence

One operator sequence associates right-One operator sequence associates right-to-left, due to precedenceto-left, due to precedence

Compiler uses commutativityCompiler uses commutativity

The other left-to-right, due to explicit The other left-to-right, due to explicit parenthesesparentheses

Use simple-minded code model: no cache, Use simple-minded code model: no cache, no optimizationno optimization

Will there be advantages/disadvantages Will there be advantages/disadvantages due to architecture?due to architecture?

Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d

Page 32: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

32

Expression 1 is : e a + b * c ^ d

Code 3 for Different Architectures

No Single-Accumulator

Three-Address GPR dest op1 op op2

Stack Machine Implied Operands

1 ld c expo r1, c, d push a 2 expo d mult r1, b, r1 push b

3 mult b add e, a, r1 push c 4 add a push d 5 st e expo 6 mult 7 add 8 pop e

Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses

• Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d

Page 33: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

33

Code 3 for Different Architectures

No Single-

Accumulator Three-Address GPR dest op1 op op2

Stack Machine Implied operands

1 ld g add r1, g, h push g 2 add h mult r1, i, r1 push h

3 mult i expo f, r1, j add 4 expo j push i 5 st f mult 6 push j 7 expo 8 pop f

Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same

architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well

• Expression 2 is : f Expression 2 is : f ( ( g + h ) * i ( ( g + h ) * i ) ^ j ) ^ j

Page 34: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

34

Code For Stack Architecture Stack Machine with no register inherently slow: Stack Machine with no register inherently slow:

Memory Accesses!!!Memory Accesses!!!

Implement few top of stack elements via HW shadow Implement few top of stack elements via HW shadow registers registers Cache Cache

Measure equivalent code sequences with/without Measure equivalent code sequences with/without consideration for cacheconsideration for cache

Top-of-stack register tos points to last valid word Top-of-stack register tos points to last valid word on physical stackon physical stack

Two shadow registers may hold 0, 1, or 2 true top Two shadow registers may hold 0, 1, or 2 true top wordswords

Top of stack cache counter tcc specifies number of Top of stack cache counter tcc specifies number of shadow registers in useshadow registers in use

Thus tos plus tcc jointly specify true top of stackThus tos plus tcc jointly specify true top of stack

Page 35: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

35

Code For Stack Architecture

free free

0,1,20,1,2

tcc tcc

2 tos registers 2 tos registers

stack stack

tos tos

Page 36: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

36

Code For Stack ArchitectureTimings for push, pushlit, add, pop operations Timings for push, pushlit, add, pop operations

depend on tccdepend on tcc

Operations in shadow registers fastest, typically Operations in shadow registers fastest, typically 1 cycle, include register access and the 1 cycle, include register access and the operation itselfoperation itself

Generally, further memory access adds 2 cyclesGenerally, further memory access adds 2 cycles

For stack changes use some defined policy, e.g. For stack changes use some defined policy, e.g. keep tcc 50% fullkeep tcc 50% full

Table below refines timings for stack with shadow Table below refines timings for stack with shadow registersregisters

Note: push x into cache with free space requires 2 Note: push x into cache with free space requires 2 cycles: cache adjustment is done at the same cycles: cache adjustment is done at the same time as memory fetchtime as memory fetch

Page 37: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

37

Code For Stack Architecture

operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update

in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?

Page 38: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

38

Code For Stack ArchitectureCode emission for: a + b * c ^ ( d + e * f Code emission for: a + b * c ^ ( d + e * f

^ g )^ g )

Let + and * be commutative, by language Let + and * be commutative, by language rulerule

Architecture here has 2 shadow registers, Architecture here has 2 shadow registers, compiler compiler exploitsexploits this this

Assume initially empty 2-word cacheAssume initially empty 2-word cache

Page 39: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

39

Code For Stack Architecture

# 1 Left - to - Right cycles 1 2 Exploit Cache cycles

2

1 push a 2 push f 2

2 push b 2 push g 2

3 push c 4 e xpo 1

4 push d 4 push e 2

5 push e 4 m ult 1

6 push f 4 push d 2

7 push g 4 a dd 1

8 expo 1 push c 2

9 mult 3 r_ e xpo = swap + expo 1

10 add 3 push b 2

11 expo 3 m ult 1

12 m ult 3 push a 2

13 a dd 3 a dd 1

Page 40: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

40

Code For Stack ArchitectureBlind Blind code emission costs 40 cycles; i.e. not taking code emission costs 40 cycles; i.e. not taking

advantage of tcc knowledge: costs performanceadvantage of tcc knowledge: costs performance

Code emission with shadow register consideration costs 20 Code emission with shadow register consideration costs 20 cyclescycles

True penalty for memory access is worse in practiceTrue penalty for memory access is worse in practice

Tremendous speed-up always possible when fixing system Tremendous speed-up always possible when fixing system with severe flawswith severe flaws

Return of investment for 2 registers is twice the Return of investment for 2 registers is twice the original performanceoriginal performance

Such strong speedup is an indicator that the starting Such strong speedup is an indicator that the starting architecture was poorarchitecture was poor

Stack Machine can be fast, if purity of top-of-stack Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performanceaccess is sacrificed for performance

Note that indexing, looping, indirection, call/return are Note that indexing, looping, indirection, call/return are not addressed herenot addressed here

Page 41: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

41

Register Dependencies Inter-instruction dependenInter-instruction dependenciescies, also , also

known as known as dependendependencesces, arise between , arise between registers being defined and usedregisters being defined and used

One instruction computes a result into a One instruction computes a result into a register (or memory), another register (or memory), another instruction needs that result from the instruction needs that result from the register (or that memory location)register (or that memory location)

Or, one instruction uses a datum, only Or, one instruction uses a datum, only after this use that same datum may it be after this use that same datum may it be recomputedrecomputed

Page 42: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

42

Register DependenciesTrue DependenceTrue Dependence, AKA Data Dependence:, AKA Data Dependence:r3 ←r3 ← r1 op r2 r1 op r2r5 ← r5 ← r3r3 op r4 op r4 Read after Write, RAWRead after Write, RAW

Anti-Dependence,Anti-Dependence, not a true dependence not a true dependenceparallelize under right conditionparallelize under right conditionr3 ← r3 ← r1r1 op r2 op r2r1r1 ← r5 op r4 ← r5 op r4 Write after read, WARWrite after read, WAR

Output DependenceOutput Dependencer3r3 ← r1 op r2 ← r1 op r2r5 ← r5 ← r3r3 op r4 op r4r3 r3 ← r6 op r7← r6 op r7 Write after Write, WAW, use in Write after Write, WAW, use in

betweenbetween

Page 43: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

43

Register DependenciesControl Dependence:Control Dependence:

if ( condition1 ) {if ( condition1 ) {

r3 = r1 op r2;r3 = r1 op r2;

}else{}else{ see the jump here? see the jump here?

r5 = r3 op r4;r5 = r3 op r4;

} // end if} // end if

write( r3 );write( r3 );

Page 44: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

44

Register Renaming Only a true dependence is a Only a true dependence is a real real

dependencedependence AKA Data-Dependence AKA Data-Dependence

Others are artifacts of Others are artifacts of insufficient insufficient resourcesresources, generally register resources, generally register resources

But that means if only more registers But that means if only more registers were available, then replacing the were available, then replacing the conflicting regs with new ones these conflicting regs with new ones these additional resources could make conflict additional resources could make conflict disappeardisappear

Anti- and Output-Dependences are such Anti- and Output-Dependences are such false dependenciesfalse dependencies

Page 45: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

45

Register Renaming Original Dependences:Original Dependences: Renamed Situation, Renamed Situation,

Dependences Gone:Dependences Gone:

L1:L1: r1 ← r2 op r3r1 ← r2 op r3 r10 ← r2 op r30 –- r30 has r3 copyr10 ← r2 op r30 –- r30 has r3 copy

L2:L2: r4 ← r1 op r5r4 ← r1 op r5 r4 ← r10 op r5r4 ← r10 op r5

L3:L3: r1 ← r3 op r6r1 ← r3 op r6 r1 ← r30 op r6r1 ← r30 op r6

L4:L4: r3 ← r1 op r7r3 ← r1 op r7 r3 ← r1 op r7r3 ← r1 op r7

The dependences before:The dependences before: after:after:

L1, L2 true-Dep with r1L1, L2 true-Dep with r1 L1, L2 true-Dep with r10L1, L2 true-Dep with r10

L1, L3 output-Dep with r1L1, L3 output-Dep with r1 L3, L4 true-Dep with r1L3, L4 true-Dep with r1

L1, L4 anti-Dep with r3L1, L4 anti-Dep with r3

L3, L4 true-Dep with r1L3, L4 true-Dep with r1

L2, L3 anti-Dep with r1L2, L3 anti-Dep with r1

L3, L4 anti-Dep with r3L3, L4 anti-Dep with r3

Page 46: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

46

Register RenamingWith additional or renamed regs, the new code With additional or renamed regs, the new code

runs in half the time!runs in half the time!

First : Compute into r10 instead of r1, no First : Compute into r10 instead of r1, no costcost

Also: Compute into r30, no added copy Also: Compute into r30, no added copy operations, just more registers á-priorioperations, just more registers á-priori

Then regs are Then regs are livelive afterwards: r1, r3, r4 afterwards: r1, r3, r4

While r10 and r30 are While r10 and r30 are don’t caresdon’t cares

Page 47: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

47

Score BoardScore-board is an array of programmable bits Score-board is an array of programmable bits sb[]sb[]

Manages HW resources, specifically registersManages HW resources, specifically registers

Single-bit array, any one bit associated with one Single-bit array, any one bit associated with one specific registerspecific register

Association by index, i.e. by name: Association by index, i.e. by name: sb[i]sb[i] belongs to belongs to reg reg rrii

Only if Only if sb[i] = 0sb[i] = 0, does register , does register i i have valid datahave valid data

If If sb[i] = 0sb[i] = 0 then register then register rrii is is NOT in process of NOT in process of being writtenbeing written

If bit If bit ii is set, i.e. if is set, i.e. if sb[i] = 1sb[i] = 1, then that , then that register register rri i has stale datahas stale data

Initially all Initially all sb[*]sb[*] are stale, i.e. set to 1 are stale, i.e. set to 1

Page 48: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

48

Score BoardExecution constraints:Execution constraints:

rrdd ← r ← rss op r op rtt

if if sb[s]sb[s] or if or if sb[t] sb[t] is set → RAW dependence, is set → RAW dependence, hence stall the computation; wait until hence stall the computation; wait until both both rrss and and rrtt are 0 are 0

if if sb[d]sb[d] is set→ WAW dependence, hence stall is set→ WAW dependence, hence stall the write; wait until the write; wait until rrdd has been used; SW has been used; SW can sometimes determine to use another can sometimes determine to use another register instead of register instead of rrdd

else dispatch instruction immediatelyelse dispatch instruction immediately

Page 49: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

49

Score BoardTo allow out of order (ooo) execution, To allow out of order (ooo) execution, upon computing the value of rupon computing the value of rdd

Update Update rrdd, and clear , and clear sb[d]sb[d]

For uses (references), HW may use any For uses (references), HW may use any register i, whose register i, whose sb[i]sb[i] is 0 is 0

For definitions (assignments), HW may set For definitions (assignments), HW may set any register j, whose any register j, whose sb[j]sb[j] is 0 is 0

Independent of original order, in which Independent of original order, in which source program was writtensource program was written, i.e. possibly ooo

Page 50: CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

50

References1.1. The Humble Programmer: The Humble Programmer:

http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmhttp://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmll

2.2. Algorithm Definitions: Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizationshttp://en.wikipedia.org/wiki/Algorithm_characterizations

3.3. http://en.wikipedia.org/wiki/Moore's_lawhttp://en.wikipedia.org/wiki/Moore's_law

4.4. C. A. R. HoareC. A. R. Hoare’’s comment on readability: s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdfpdf

5.5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16Construction, Volume 21, Number 7, July 1986, pp 11-16

6.6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/turing/

7.7. Linux design: Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htmhttp://www.livinginternet.com/i/iw_unix_gnulinux.htm

8.8. Words of wisdom: http://www.cs.yale.edu/quotes.htmlWords of wisdom: http://www.cs.yale.edu/quotes.html

9.9. John von Neumann’s computer design: A.H. Taub (ed.), “Collected John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963Co., New York 1963