cs 201 computer systems programming chapter 3 “ architecture overview ”
DESCRIPTION
Herbert G. Mayer, PSU CS Status 10/9/2012. CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”. Syllabus. Computing History Evolution of Microprocessor µP Performance Processor Performance Growth Key Architecture Messages Code Sequences for Different Architectures - PowerPoint PPT PresentationTRANSCRIPT
1
CS 201Computer Systems Programming
Chapter 3“Architecture Overview”
Herbert G. Mayer, PSU CSHerbert G. Mayer, PSU CSStatus 10/9/2012Status 10/9/2012
2
Syllabus Computing HistoryComputing History
Evolution of Microprocessor µP Evolution of Microprocessor µP PerformancePerformance
Processor Performance GrowthProcessor Performance Growth
Key Architecture MessagesKey Architecture Messages
Code Sequences for Different Code Sequences for Different ArchitecturesArchitectures
Dependencies, AKA DependencesDependencies, AKA Dependences
Score BoardScore Board
ReferencesReferences
3
Computing HistoryComputing HistoryBefore 1940Before 19401643 Pascal’s 1643 Pascal’s Arithmetic MachineArithmetic Machine
About 1660 Leibnitz About 1660 Leibnitz Four Function CalculatorFour Function Calculator
1710 -1750 1710 -1750 Punched CardsPunched Cards by Bouchon, Falcon, by Bouchon, Falcon, JacquardJacquard
1810 Babbage 1810 Babbage Difference EngineDifference Engine, unfinished; 1st , unfinished; 1st programmer ever in the world was poet Lord programmer ever in the world was poet Lord Byron’s daughter, after whom the language Ada Byron’s daughter, after whom the language Ada was named: was named: Lady Ada LovelaceLady Ada Lovelace
1835 Babbage 1835 Babbage Analytical EngineAnalytical Engine, also unfinished, also unfinished
1920 Hollerith 1920 Hollerith Tabulating MachineTabulating Machine to help with to help with census in the USAcensus in the USA
4
Computing HistoryComputing HistoryDecade of 1940sDecade of 1940s1939 – 1942 1939 – 1942 John Atanasoff John Atanasoff built programmable, built programmable,
electronic computer at Iowa State Universityelectronic computer at Iowa State University
1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague mechanical computers based on relays; colleague advised use of “vacuum tubes”advised use of “vacuum tubes”
1946 1946 John von Neumann’s John von Neumann’s computer design of stored computer design of stored programprogram
1946 Mauchly and Eckert built 1946 Mauchly and Eckert built ENIACENIAC, modeled after , modeled after Atanasoff’s ideas, built at University of Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monsterComputer, 30 ton monster
1980s John Atanasoff got acknowledgment and patent 1980s John Atanasoff got acknowledgment and patent officially officially
5
Computing HistoryComputing HistoryDecade of the 1950sDecade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, Univac Uniprocessor based on ENIAC, commercially viable,
developed by developed by John Mauchly John Mauchly and John Presper Eckertand John Presper Eckert Commercial systems sold by Remington RandCommercial systems sold by Remington Rand Mark III computerMark III computer
Decade of the 1960s Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al.IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tubeTransistor replaces vacuum tube Burroughs stack machines, compete with GPR architecturesBurroughs stack machines, compete with GPR architectures All still All still von Neumannvon Neumann architectures architectures 1969 1969 ARPANETARPANET CacheCache and and VMMVMM developed, first at Manchester University developed, first at Manchester University
6
Computing HistoryComputing History
Decade of the 1970sDecade of the 1970sBirth of Microprocessor at Intel, Birth of Microprocessor at Intel, see see Gordon MooreGordon Moore
High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 seriesseries
Architecture advances: Caches, VMM ubiquitous, since Architecture advances: Caches, VMM ubiquitous, since real memories were expensivereal memories were expensive
Intel 4004, Intel 8080, single-chip microprocessorsIntel 4004, Intel 8080, single-chip microprocessors
Programmable controllersProgrammable controllers
Mini-computers, PDP 11, HP 3000 16-bit computerMini-computers, PDP 11, HP 3000 16-bit computer
Height of Digital Equipment Corp. (DEC)Height of Digital Equipment Corp. (DEC)
Birth of personal computers, which DEC missesBirth of personal computers, which DEC misses
7
Computing HistoryComputing History
Decade of the 1980sDecade of the 1980s
decrease of mini-computer usedecrease of mini-computer use
32-bit computing even on minis32-bit computing even on minis
Architecture advances: superscalar, faster Architecture advances: superscalar, faster caches, larger cachescaches, larger caches
Multitude of Supercomputer manufacturersMultitude of Supercomputer manufacturers
Compiler complexity: trace-scheduling, VLIWCompiler complexity: trace-scheduling, VLIW
Workstations common: Apollo, HP, DEC’s Workstations common: Apollo, HP, DEC’s Ken Ken Olsen Olsen trying to catch up, Intergraph, trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Ardent, Sun, Three Rivers, Silicon Graphics, etc.Graphics, etc.
8
Computing HistoryComputing History
Decade of the 1990sDecade of the 1990s•Architecture advances: superscalar & Architecture advances: superscalar & pipelined, speculative execution, ooo pipelined, speculative execution, ooo executionexecution
•Powerful desktopsPowerful desktops
•End of mini-computer and of many super-End of mini-computer and of many super-computer manufacturerscomputer manufacturers
•Microprocessor powerful as early Microprocessor powerful as early supercomputerssupercomputers
•Consolidation of many computer companies into Consolidation of many computer companies into a few large onesa few large ones
•End of Soviet Union marked the end of several End of Soviet Union marked the end of several supercomputer companiessupercomputer companies
9
Evolution of µP Performance(by: James C. Hoe @ CMU)
1970s 1980s 1990s 2000+ Transistor Count 10k-100k 100k-1M 1M-100M 1B
Clock Frequency 0.2-2 MHz 2-20 MHz 0.02 – 1 GHz 10 GHz
Instructions / cycle: ipc < 0.1 0.1 – 0.9 0.9 – 2.0 > 10 (?)
MIPs, FLOPs < 0.2 0.2 - 20 20 – 2,000 100,000
10
Processor Performance GrowthMoore’s Law --from Webopedia 8/27/2004:Moore’s Law --from Webopedia 8/27/2004:
““The observation made in 1965 by Gordon Moore, co-The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors founder of Intel, that the number of transistors per square inch on integrated circuits had doubled per square inch on integrated circuits had doubled every year since it was invented. Moore predicted every year since it was invented. Moore predicted that this trend would continue for the foreseeable that this trend would continue for the foreseeable future.future.
In subsequent years, the pace slowed down a bit, but In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 monthsdata density doubled approximately every 18 months, , and this is the current definition of and this is the current definition of Moore's LawMoore's Law, , which Moore himself has blessed. Most experts, which Moore himself has blessed. Most experts, including Moore himself, expect including Moore himself, expect Moore's LawMoore's Law to hold to hold for at least another two decades.for at least another two decades.
Others coin a more general law, stating that Others coin a more general law, stating that “the “the circuit density increases predictably over time.”circuit density increases predictably over time.”
11
Processor Performance GrowthSo far in 2012, Moore’s Law is holding true since So far in 2012, Moore’s Law is holding true since
~1968.~1968.
Some Intel fellows believe that an end to Moore’s Law Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations will be reached ~2018 due to physical limitations in the process of manufacturing transistors from in the process of manufacturing transistors from semi-conductor material.semi-conductor material.
This phenomenal growth is unknown in any other This phenomenal growth is unknown in any other industry. For example, if doubling of performance industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 could be achieved every 18 months, then by 2001 other industries would have achieved the other industries would have achieved the following:following:
cars would travel at 2,400,000 Mph, and get 600,000 cars would travel at 2,400,000 Mph, and get 600,000 MpGMpG
Air travel from LA to NYC would be at 36,000 Mach, or Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 secondstake 0.5 seconds
12
Message 1: Memory is Slow The inner core of the processor, the CPU or the The inner core of the processor, the CPU or the
µP, is getting faster at a steady rateµP, is getting faster at a steady rate
Access to memoryAccess to memory is also getting faster over is also getting faster over time, but time, but at a slower rateat a slower rate. This rate . This rate differential has existed for quite some time, differential has existed for quite some time, with the strange effect that fast processors with the strange effect that fast processors have to rely on slow memorieshave to rely on slow memories
Not uncommon on MP server that processor has to Not uncommon on MP server that processor has to wait >100 cycles before a memory access wait >100 cycles before a memory access completes. On a Multi-Processor the bus completes. On a Multi-Processor the bus protocol is more complex due to snooping, protocol is more complex due to snooping, backing-off, arbitration, etc., thus the number backing-off, arbitration, etc., thus the number of cycles to complete an access can grow so of cycles to complete an access can grow so highhigh
13
Message 1: Memory is Slow Discarding conventional memory altogether, Discarding conventional memory altogether,
relying only on cache-like memories, is NOT an relying only on cache-like memories, is NOT an option, due to the price differential between option, due to the price differential between cache and regular RAMcache and regular RAM
Another way of seeing this: Using solely Another way of seeing this: Using solely reasonably-priced cache memories (say at <= 10 reasonably-priced cache memories (say at <= 10 times the cost of regular memory) is not times the cost of regular memory) is not feasible: resulting address space would be too feasible: resulting address space would be too smallsmall
Almost all intellectual efforts in computer Almost all intellectual efforts in computer architecture focus on reducing the performance architecture focus on reducing the performance impact of fast processors accessing slow memoriesimpact of fast processors accessing slow memories
All else seems easy compared to this fundamental All else seems easy compared to this fundamental problem!problem!
14
Message 1: Memory is Slow
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Time
“Moore’s Law”
Source: David Patterson, UC Berkeley
2001
2002
15
Message 2: Events Tend to Cluster
A strange thing happens during program execution: A strange thing happens during program execution: Seemingly Seemingly unrelated events tend to clusterunrelated events tend to cluster
memory accessesmemory accesses tend to concentrate a majority of tend to concentrate a majority of their referenced addresses onto a small domain of their referenced addresses onto a small domain of the total address space. Even if all of memory is the total address space. Even if all of memory is accessed, during some periods of time this accessed, during some periods of time this phenomenon is observed. Here one memory access phenomenon is observed. Here one memory access seems independent of another, but they both seems independent of another, but they both happen to fall onto the same page (or happen to fall onto the same page (or working set working set of pages)of pages)
We call this phenomenon We call this phenomenon LocalityLocality! Architects ! Architects exploit locality to speed up memory access via exploit locality to speed up memory access via CachesCaches and increase the address range beyond and increase the address range beyond physical memory via physical memory via Virtual Memory ManagementVirtual Memory Management. . Distinguish Distinguish spacialspacial versus versus temporaltemporal locality locality
16
Message 2: Events Tend to Cluster
Similarly, hash functions tend to Similarly, hash functions tend to concentrate an unproportionally large concentrate an unproportionally large number of keys onto a small number of number of keys onto a small number of table entriestable entries
Incoming search key (say, a C++ program Incoming search key (say, a C++ program identifier) is mapped into an index, but identifier) is mapped into an index, but the next, completely unrelated key, the next, completely unrelated key, happens to map onto the same index. In an happens to map onto the same index. In an extreme case, this may render a hash extreme case, this may render a hash lookup slower than a sequential searchlookup slower than a sequential search
Programmer must Programmer must watch outwatch out for the for the phenomenon of clustering, as it is phenomenon of clustering, as it is undesired in hashing!undesired in hashing!
17
Message 2: Events Tend to Cluster
Clustering happens in all diverse modules of the Clustering happens in all diverse modules of the processor architecture. For example, when a data processor architecture. For example, when a data cache is used to speed-up memory accesses by cache is used to speed-up memory accesses by having a copy of frequently used data in a faster having a copy of frequently used data in a faster memory unit, it happens that a small cache memory unit, it happens that a small cache sufficessuffices
Due to Due to Data Locality Data Locality (spatial and temporal(spatial and temporal)). Data . Data that have been accessed recently will again be that have been accessed recently will again be accessed in the near future, or at least data accessed in the near future, or at least data that live close by will be accessed in the near that live close by will be accessed in the near futurefuture
Thus they happen to reside in the same cache Thus they happen to reside in the same cache line. Architects do exploit this to speed up line. Architects do exploit this to speed up execution, while keeping the incremental cost for execution, while keeping the incremental cost for HW contained. Here clustering is a valuable HW contained. Here clustering is a valuable phenomenon phenomenon
18
Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) Clocking a processor fast (e.g. > 3-5 GHz)
increases performance and thus generally “is increases performance and thus generally “is good”good”
Other performance parameters, such as memory Other performance parameters, such as memory access speed, peripheral access, etc. do not access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the scale with the clock speed. Still, increasing the clock to a higher rate is desirableclock to a higher rate is desirable
Comes at the cost of higher current and thus more Comes at the cost of higher current and thus more heat generated in the identical physical space, heat generated in the identical physical space, the geometry (the real-estate) of the silicon the geometry (the real-estate) of the silicon processor or chipsetprocessor or chipset
But Silicon part acts like a heat-conductor, But Silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as lower resistance causes lower voltage, shown as VDroop in the figure belowVDroop in the figure below
19
Message 3: Heat is Bad
20
Message 3: Heat is Bad This in turn means, voltage must be increased This in turn means, voltage must be increased
artificially, to sustain the clock rate, creating artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction more heat, ultimately leading to self-destruction of the partof the part
Great efforts are being made to increase the Great efforts are being made to increase the clock speed, requiring more voltage, while at the clock speed, requiring more voltage, while at the same time reducing heat generation. Current same time reducing heat generation. Current technologies include sleep-states of the Silicon technologies include sleep-states of the Silicon part (processor as well as chip-set), and part (processor as well as chip-set), and Turbo Turbo BoostBoost mode, to contain heat generation while mode, to contain heat generation while boosting clock speed just at the right timeboosting clock speed just at the right time
Good that to date Silicon manufacturing Good that to date Silicon manufacturing technologies allow the shrinking of transistors technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.larger, more expensive, and above all: hotter.
21
Message 4: Resource Replication Architects cannot increase clock Architects cannot increase clock
speed beyond physical limitationsspeed beyond physical limitations
One cannot decrease the die size One cannot decrease the die size beyond evolving technologybeyond evolving technology
Yet speed improvements are Yet speed improvements are desired, and achieveddesired, and achieved
This conflict can partly be This conflict can partly be overcome with replicated overcome with replicated resources! But careful!resources! But careful!
22
Message 4: Resource Replication Key obstacle to parallel execution Key obstacle to parallel execution
is data dependence in the SW under is data dependence in the SW under execution. A datum cannot be used, execution. A datum cannot be used, before it has been computedbefore it has been computed
Compiler optimization technology Compiler optimization technology calls this calls this use-def dependence use-def dependence (short (short for use-definition dependence), AKA for use-definition dependence), AKA true dependence, AKA data dependencetrue dependence, AKA data dependence
Goal is to search for program Goal is to search for program portions that are independent of one portions that are independent of one another. This can be at multiple another. This can be at multiple levels of focus:levels of focus:
23
Message 4: Resource Replication
At the At the very low levelvery low level of registers, at of registers, at the machine level –done by HWthe machine level –done by HW
At the At the low level low level of individual machine of individual machine instructions –done by HWinstructions –done by HW
At the At the medium level medium level of subexpressions in of subexpressions in a program –done by compilera program –done by compiler
At the At the higher level higher level of distinct of distinct statements in a high-level program –done statements in a high-level program –done by optimizing compiler or by programmerby optimizing compiler or by programmer
Or at the Or at the very high level very high level of different of different applications, running on the same applications, running on the same computer, but with independent data, computer, but with independent data, separate computations, and independent separate computations, and independent results –done by the userresults –done by the user
24
Message 4: Resource Replication Whenever program portions are independent of Whenever program portions are independent of
one another, they can be computed at the one another, they can be computed at the same time: in parallelsame time: in parallel
Architects provide resources for this Architects provide resources for this parallelismparallelism
Compilers need to uncover opportunities for Compilers need to uncover opportunities for parallelismparallelism
If two actions are independent of one If two actions are independent of one another, they can be computed simultaneouslyanother, they can be computed simultaneously
Provided that HW resources exist, that the Provided that HW resources exist, that the absence of dependence has been proven, and absence of dependence has been proven, and that the independent execution paths are that the independent execution paths are scheduled on these replicated HW resources! scheduled on these replicated HW resources! Generally this is a complex undertaking!Generally this is a complex undertaking!
25
Code 1 for Different ArchitecturesExample 1: Object Code Sequence Without Example 1: Object Code Sequence Without
OptimizationOptimization
Strict left-to-right translation, no smarts in Strict left-to-right translation, no smarts in mappingmapping
Consider non-commutative subtraction and Consider non-commutative subtraction and division operatorsdivision operators
No common subexpression elimination (CSE), and No common subexpression elimination (CSE), and no register reuseno register reuse
Conventional operator precedenceConventional operator precedence
For Single Accumulator SAA, Three-Address GPR, For Single Accumulator SAA, Three-Address GPR, Stack ArchitecturesStack Architectures
Sample source: Sample source: d d ( a + 3 ) * b - ( a + 3 ) / c ( a + 3 ) * b - ( a + 3 ) / c
26
Code 1 for Different ArchitecturesNo Single-
Accumulator Three-Address GPR dest op1 op op2
Stack Machine
1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 mult b add r3, a, #3 add 4 st temp1 div r4, r3, c push b 5 ld a sub d, r2, r4 mult 6 add #3 push a 7 div c pushlit #3 8 st temp2 add 9 ld temp1 push c
10 sub temp2 div 11 st d sub 12 pop d
27
Code 1 for Different ArchitecturesThree-address code looks shortest, w.r.t. Three-address code looks shortest, w.r.t. number of instructionsnumber of instructions
Maybe optical illusion, must also consider Maybe optical illusion, must also consider number of bitsnumber of bits for for instructionsinstructions
Must consider number of I-fetches, operand fetchesMust consider number of I-fetches, operand fetches
Must consider total number of storesMust consider total number of stores
Numerous memory accesses on SAA due to temporary values held in Numerous memory accesses on SAA due to temporary values held in memorymemory
Most memory accesses on SA, since everything requires a memory Most memory accesses on SA, since everything requires a memory accessaccess
Three-Address architecture immune to commutativity constraint, Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either ordersince operands may be placed in registers in either order
Important architectural feature? Only if SW cannot handle this; Important architectural feature? Only if SW cannot handle this; compiler cancompiler can
No need for reverse-operation opcodes for Three-Address No need for reverse-operation opcodes for Three-Address architecturearchitecture
Decide in Three-Address architecture how to encode operand typesDecide in Three-Address architecture how to encode operand types
Numerous stack instructions, i.e. many bits for opcodes, since Numerous stack instructions, i.e. many bits for opcodes, since each operand fetch is separate instructioneach operand fetch is separate instruction
28
Code 2 for Different ArchitecturesThis time we eliminate common This time we eliminate common
subexpressionsubexpression
Compiler handles left-to-right order for Compiler handles left-to-right order for non-commutative operators on SAAnon-commutative operators on SAA
Better code for: Better code for: d = ( a+3 ) * b - ( a+3 ) d = ( a+3 ) * b - ( a+3 ) / c/ c
29
Code 2 for Different Architectures
No Single-Accumulator
Three-Address GPR dest op1 op op2
Stack Machine
1 ld a add r1, a, #3 push a 2 add #3 mult r2, r1, b pushlit #3 3 st temp1 div r1, r1, c add 4 div c sub d, r2, r1 dup 5 st temp2 push b 6 ld temp1 mult 7 mult b xch 8 sub temp2 push c
9 st d div 10 sub 11 pop d
30
Code 2 for Different ArchitecturesSingle Accumulator Architecture (SAA) Single Accumulator Architecture (SAA)
optimized still needs temporary storage; optimized still needs temporary storage; uses temp1 for common subexpression; has no uses temp1 for common subexpression; has no other register!!other register!!
SAA could use SAA could use negatenegate instruction or instruction or reverse reverse subtractsubtract
Register-use optimized for Three-Address Register-use optimized for Three-Address architecture; but architecture; but dupdup and and xchxch are newly are newly added instructionsadded instructions
Common subexpresssion optimized on Stack Common subexpresssion optimized on Stack Machine by duplicating, exchanging, etc.Machine by duplicating, exchanging, etc.
20% reduced for Three-Address, 18% for SAA, 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machineonly 8% for Stack Machine
31
Code 3 for Different Architectures
Analyze similar source expressions but Analyze similar source expressions but with reversed operator precedencewith reversed operator precedence
One operator sequence associates right-One operator sequence associates right-to-left, due to precedenceto-left, due to precedence
Compiler uses commutativityCompiler uses commutativity
The other left-to-right, due to explicit The other left-to-right, due to explicit parenthesesparentheses
Use simple-minded code model: no cache, Use simple-minded code model: no cache, no optimizationno optimization
Will there be advantages/disadvantages Will there be advantages/disadvantages due to architecture?due to architecture?
Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d
32
Expression 1 is : e a + b * c ^ d
Code 3 for Different Architectures
No Single-Accumulator
Three-Address GPR dest op1 op op2
Stack Machine Implied Operands
1 ld c expo r1, c, d push a 2 expo d mult r1, b, r1 push b
3 mult b add e, a, r1 push c 4 add a push d 5 st e expo 6 mult 7 add 8 pop e
Expression 2 is : f ( ( g + h ) * i ) ^ j Here the operators associate left-to-right due to parentheses
• Expression 1 is : e Expression 1 is : e a + b * c ^ d a + b * c ^ d
33
Code 3 for Different Architectures
No Single-
Accumulator Three-Address GPR dest op1 op op2
Stack Machine Implied operands
1 ld g add r1, g, h push g 2 add h mult r1, i, r1 push h
3 mult i expo f, r1, j add 4 expo j push i 5 st f mult 6 push j 7 expo 8 pop f
Observations, Interaction of Precedence and Architecture Software eliminates constraints imposed by precedence: looking ahead Execution times identical for the 2 different expressions on the same
architecture --unless blurred by secondary effect; see cache example below Conclusion: all architectures handle arithmetic and logic operations well
• Expression 2 is : f Expression 2 is : f ( ( g + h ) * i ( ( g + h ) * i ) ^ j ) ^ j
34
Code For Stack Architecture Stack Machine with no register inherently slow: Stack Machine with no register inherently slow:
Memory Accesses!!!Memory Accesses!!!
Implement few top of stack elements via HW shadow Implement few top of stack elements via HW shadow registers registers Cache Cache
Measure equivalent code sequences with/without Measure equivalent code sequences with/without consideration for cacheconsideration for cache
Top-of-stack register tos points to last valid word Top-of-stack register tos points to last valid word on physical stackon physical stack
Two shadow registers may hold 0, 1, or 2 true top Two shadow registers may hold 0, 1, or 2 true top wordswords
Top of stack cache counter tcc specifies number of Top of stack cache counter tcc specifies number of shadow registers in useshadow registers in use
Thus tos plus tcc jointly specify true top of stackThus tos plus tcc jointly specify true top of stack
35
Code For Stack Architecture
free free
0,1,20,1,2
tcc tcc
2 tos registers 2 tos registers
stack stack
tos tos
36
Code For Stack ArchitectureTimings for push, pushlit, add, pop operations Timings for push, pushlit, add, pop operations
depend on tccdepend on tcc
Operations in shadow registers fastest, typically Operations in shadow registers fastest, typically 1 cycle, include register access and the 1 cycle, include register access and the operation itselfoperation itself
Generally, further memory access adds 2 cyclesGenerally, further memory access adds 2 cycles
For stack changes use some defined policy, e.g. For stack changes use some defined policy, e.g. keep tcc 50% fullkeep tcc 50% full
Table below refines timings for stack with shadow Table below refines timings for stack with shadow registersregisters
Note: push x into cache with free space requires 2 Note: push x into cache with free space requires 2 cycles: cache adjustment is done at the same cycles: cache adjustment is done at the same time as memory fetchtime as memory fetch
37
Code For Stack Architecture
operation Cycles tcc before tcc after tos change comment add 1 tcc = 2 tcc = 1 no change add 1+2 tcc = 1 tcc = 1 tos-- underflow? add 1+2+2 tcc = 0 tcc = 1 tos -= 2 underflow? push x 2 tcc = 0,1 tcc++ no change tcc update
in parallel push x 2+2 tcc = 2 tcc = 2 tos++ overflow? pushlit #3 1 tcc = 0,1 tcc++ no change pushlit #3 1+2 tcc = 2 tcc = 2 tos++ overflow? pop y 2 tcc = 1,2 tcc-- no change pop y 2+2 tcc = 0 tcc = 0 tos-- underflow?
38
Code For Stack ArchitectureCode emission for: a + b * c ^ ( d + e * f Code emission for: a + b * c ^ ( d + e * f
^ g )^ g )
Let + and * be commutative, by language Let + and * be commutative, by language rulerule
Architecture here has 2 shadow registers, Architecture here has 2 shadow registers, compiler compiler exploitsexploits this this
Assume initially empty 2-word cacheAssume initially empty 2-word cache
39
Code For Stack Architecture
# 1 Left - to - Right cycles 1 2 Exploit Cache cycles
2
1 push a 2 push f 2
2 push b 2 push g 2
3 push c 4 e xpo 1
4 push d 4 push e 2
5 push e 4 m ult 1
6 push f 4 push d 2
7 push g 4 a dd 1
8 expo 1 push c 2
9 mult 3 r_ e xpo = swap + expo 1
10 add 3 push b 2
11 expo 3 m ult 1
12 m ult 3 push a 2
13 a dd 3 a dd 1
40
Code For Stack ArchitectureBlind Blind code emission costs 40 cycles; i.e. not taking code emission costs 40 cycles; i.e. not taking
advantage of tcc knowledge: costs performanceadvantage of tcc knowledge: costs performance
Code emission with shadow register consideration costs 20 Code emission with shadow register consideration costs 20 cyclescycles
True penalty for memory access is worse in practiceTrue penalty for memory access is worse in practice
Tremendous speed-up always possible when fixing system Tremendous speed-up always possible when fixing system with severe flawswith severe flaws
Return of investment for 2 registers is twice the Return of investment for 2 registers is twice the original performanceoriginal performance
Such strong speedup is an indicator that the starting Such strong speedup is an indicator that the starting architecture was poorarchitecture was poor
Stack Machine can be fast, if purity of top-of-stack Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performanceaccess is sacrificed for performance
Note that indexing, looping, indirection, call/return are Note that indexing, looping, indirection, call/return are not addressed herenot addressed here
41
Register Dependencies Inter-instruction dependenInter-instruction dependenciescies, also , also
known as known as dependendependencesces, arise between , arise between registers being defined and usedregisters being defined and used
One instruction computes a result into a One instruction computes a result into a register (or memory), another register (or memory), another instruction needs that result from the instruction needs that result from the register (or that memory location)register (or that memory location)
Or, one instruction uses a datum, only Or, one instruction uses a datum, only after this use that same datum may it be after this use that same datum may it be recomputedrecomputed
42
Register DependenciesTrue DependenceTrue Dependence, AKA Data Dependence:, AKA Data Dependence:r3 ←r3 ← r1 op r2 r1 op r2r5 ← r5 ← r3r3 op r4 op r4 Read after Write, RAWRead after Write, RAW
Anti-Dependence,Anti-Dependence, not a true dependence not a true dependenceparallelize under right conditionparallelize under right conditionr3 ← r3 ← r1r1 op r2 op r2r1r1 ← r5 op r4 ← r5 op r4 Write after read, WARWrite after read, WAR
Output DependenceOutput Dependencer3r3 ← r1 op r2 ← r1 op r2r5 ← r5 ← r3r3 op r4 op r4r3 r3 ← r6 op r7← r6 op r7 Write after Write, WAW, use in Write after Write, WAW, use in
betweenbetween
43
Register DependenciesControl Dependence:Control Dependence:
if ( condition1 ) {if ( condition1 ) {
r3 = r1 op r2;r3 = r1 op r2;
}else{}else{ see the jump here? see the jump here?
r5 = r3 op r4;r5 = r3 op r4;
} // end if} // end if
write( r3 );write( r3 );
44
Register Renaming Only a true dependence is a Only a true dependence is a real real
dependencedependence AKA Data-Dependence AKA Data-Dependence
Others are artifacts of Others are artifacts of insufficient insufficient resourcesresources, generally register resources, generally register resources
But that means if only more registers But that means if only more registers were available, then replacing the were available, then replacing the conflicting regs with new ones these conflicting regs with new ones these additional resources could make conflict additional resources could make conflict disappeardisappear
Anti- and Output-Dependences are such Anti- and Output-Dependences are such false dependenciesfalse dependencies
45
Register Renaming Original Dependences:Original Dependences: Renamed Situation, Renamed Situation,
Dependences Gone:Dependences Gone:
L1:L1: r1 ← r2 op r3r1 ← r2 op r3 r10 ← r2 op r30 –- r30 has r3 copyr10 ← r2 op r30 –- r30 has r3 copy
L2:L2: r4 ← r1 op r5r4 ← r1 op r5 r4 ← r10 op r5r4 ← r10 op r5
L3:L3: r1 ← r3 op r6r1 ← r3 op r6 r1 ← r30 op r6r1 ← r30 op r6
L4:L4: r3 ← r1 op r7r3 ← r1 op r7 r3 ← r1 op r7r3 ← r1 op r7
The dependences before:The dependences before: after:after:
L1, L2 true-Dep with r1L1, L2 true-Dep with r1 L1, L2 true-Dep with r10L1, L2 true-Dep with r10
L1, L3 output-Dep with r1L1, L3 output-Dep with r1 L3, L4 true-Dep with r1L3, L4 true-Dep with r1
L1, L4 anti-Dep with r3L1, L4 anti-Dep with r3
L3, L4 true-Dep with r1L3, L4 true-Dep with r1
L2, L3 anti-Dep with r1L2, L3 anti-Dep with r1
L3, L4 anti-Dep with r3L3, L4 anti-Dep with r3
46
Register RenamingWith additional or renamed regs, the new code With additional or renamed regs, the new code
runs in half the time!runs in half the time!
First : Compute into r10 instead of r1, no First : Compute into r10 instead of r1, no costcost
Also: Compute into r30, no added copy Also: Compute into r30, no added copy operations, just more registers á-priorioperations, just more registers á-priori
Then regs are Then regs are livelive afterwards: r1, r3, r4 afterwards: r1, r3, r4
While r10 and r30 are While r10 and r30 are don’t caresdon’t cares
47
Score BoardScore-board is an array of programmable bits Score-board is an array of programmable bits sb[]sb[]
Manages HW resources, specifically registersManages HW resources, specifically registers
Single-bit array, any one bit associated with one Single-bit array, any one bit associated with one specific registerspecific register
Association by index, i.e. by name: Association by index, i.e. by name: sb[i]sb[i] belongs to belongs to reg reg rrii
Only if Only if sb[i] = 0sb[i] = 0, does register , does register i i have valid datahave valid data
If If sb[i] = 0sb[i] = 0 then register then register rrii is is NOT in process of NOT in process of being writtenbeing written
If bit If bit ii is set, i.e. if is set, i.e. if sb[i] = 1sb[i] = 1, then that , then that register register rri i has stale datahas stale data
Initially all Initially all sb[*]sb[*] are stale, i.e. set to 1 are stale, i.e. set to 1
48
Score BoardExecution constraints:Execution constraints:
rrdd ← r ← rss op r op rtt
if if sb[s]sb[s] or if or if sb[t] sb[t] is set → RAW dependence, is set → RAW dependence, hence stall the computation; wait until hence stall the computation; wait until both both rrss and and rrtt are 0 are 0
if if sb[d]sb[d] is set→ WAW dependence, hence stall is set→ WAW dependence, hence stall the write; wait until the write; wait until rrdd has been used; SW has been used; SW can sometimes determine to use another can sometimes determine to use another register instead of register instead of rrdd
else dispatch instruction immediatelyelse dispatch instruction immediately
49
Score BoardTo allow out of order (ooo) execution, To allow out of order (ooo) execution, upon computing the value of rupon computing the value of rdd
Update Update rrdd, and clear , and clear sb[d]sb[d]
For uses (references), HW may use any For uses (references), HW may use any register i, whose register i, whose sb[i]sb[i] is 0 is 0
For definitions (assignments), HW may set For definitions (assignments), HW may set any register j, whose any register j, whose sb[j]sb[j] is 0 is 0
Independent of original order, in which Independent of original order, in which source program was writtensource program was written, i.e. possibly ooo
50
References1.1. The Humble Programmer: The Humble Programmer:
http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmhttp://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.htmll
2.2. Algorithm Definitions: Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizationshttp://en.wikipedia.org/wiki/Algorithm_characterizations
3.3. http://en.wikipedia.org/wiki/Moore's_lawhttp://en.wikipedia.org/wiki/Moore's_law
4.4. C. A. R. HoareC. A. R. Hoare’’s comment on readability: s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdfpdf
5.5. Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16Construction, Volume 21, Number 7, July 1986, pp 11-16
6.6. Church-Turing Thesis: http://plato.stanford.edu/entries/church-Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/turing/
7.7. Linux design: Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htmhttp://www.livinginternet.com/i/iw_unix_gnulinux.htm
8.8. Words of wisdom: http://www.cs.yale.edu/quotes.htmlWords of wisdom: http://www.cs.yale.edu/quotes.html
9.9. John von Neumann’s computer design: A.H. Taub (ed.), “Collected John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963Co., New York 1963