machine layers design issues - lafayette collegepfaffmaj/courses/f16/cs203/slides/w13d01.pdf ·...
TRANSCRIPT
Machine Architecture
Machine Layers
Instruction Set Architecture (ISA)
Microarchitecture
Digital Logic / Components
Operating System/High Level Languages
Design Issues
•Digital Level -- have the issue of being restricted by what is physically possible, component size (transistor count and transistor density).
•Microarchitecture Level -- is concerned with component layout, signal propagation, and execution speed enhancements.
•Instruction Set Architecture (ISA) -- is concern with providing the most effective utilization of the Microarchitecture and a cohesive interface to the software layer.
ISA Design•The ISA is the interface that most designers will work with. This is to say that chip designers are only a small fraction of the whole society of people that work with computers.
•This leads to design questions like :•will hardware feature X ever be used?•can hardware feature Y be added for a specific
type of programming technique?
•The ISA is how the world sees a given chip.
Backwards Compatibility•Should the world be considered during the chip design process.
•Example -- The Pentium II supports three modes of operation:•- real mode (emulating an 8088 processor)•- virtual 8086 mode•- protected mode (acting a modern processor)
Design Trade-offs
•What makes a good ISA?
The ISA should define a set of instructions that can be implemented efficiently in current and future technologies, resulting in cost-efficient designs over several generations.
The ISA should provide a clean target for compiled code.
Instruction Set Architecture (ISA)
Microarchitecture
Digital Logic / Components
Operating System/High Level LanguagesBecause of this layer architecture certain aspects of the architecture are hidden or inaccessible.
Examples are:Is the microarchitecture microprogrammed?Is there pipelining?Is the architecture superscalar?Special registers for manipulating hardware.
Although this does not mean that a programmer can not take advantage of knowledge how things are working “under the hood”.
ISA Transparency
•ISA can also vary from one manufacture to the next to allow different levels of transparency about how the chip is designed.
•ARM allows various manufacturers to effectively produce their chips.
•Intel is more proprietary to protect their designs.
•In a proprietary situation, a chip company can more effectively support their own product than others. It allows a company to license out their “technology” (the AMD Athalon is an example).
Different Machine Designs• CISC: Complex Instruction Set Computer• Many complex instructions and addressing modes.• Some instructions may take many steps to execute• Not always easy to find best instruction for a task
• RISC: Reduced Instruction Set Computer
• Few, simple instructions, addressing modes• usually one word per instruction• may take several instructions to accomplish what CISC can do
in one• Complex address calculations may take several instructions• Usually has load-store, general registers are primarily used.
Some RISC Characteristics• Prefetching of instructions
• Pipelining: beginning execution of an instruction before the previous instruction(s) have completed.
• Superscalar operation -- Issuing more than one instruction simultaneously.
• Register Windows -- the ability to switch to a different set of CPU registers with a single command. Alleviates procedure call/return overhead.
Some RISC Characteristics• Simple instructions can be done in a few clocks• Simplicity may even allow shorter clock period
• A pipeline design can allow an instruction to complete in every clock period
• Fixed length instructions simplify fetch & decode
• The rules may allow starting next instruction without necessary results of the previous
• Unconditionally executing the instructions after a branch• Starting next instruction before register load is complete.
Design Features -- Kernel vs. User Mode
•The ISA will often defined different modes of use for different types of operation.
•Kernel mode -- Primarily used by the OS and allows full access to the processor.
•User mode -- Used by other programs, and restricts programs to a subset of functionality.
•The chip typically has ways of detecting when the program is trying to do something it should not, throwing control to the OS.
Design Features -- Registers
• The ISA defines what registers are available and how they can be accessed.
• Microarchitecture Registers come in three flavors:
• Not accessible through the ISA• Special purpose, accessible through the ISA• General-purpose, accessible through the ISA
Pentium II ExampleEAXALAH A X
EBXBLBH B X
ECXCLCH C X
EDX
ESI
EDI
EBP
ESP
DL
CS
EIP
EFLAGS
SS
DS
ES
FS
GS
DH D X
8816BitsEAX - EDX -- General PurposeEAX -- main arithmetic registerEBX -- good for holding pointersECX -- has a role in loopingEDX -- used for multiplication/division
ESI, EDI -- used for hardware string manip.
EBP -- points to stack frameESP -- stack pointer
CS - GS -- segment registers
EIP -- Extended Instruction Pointer
EFLAGS -- Program Status Word
UltraSparc II Example
R0 Hardware wired to 0R1-R7 Global VariablesR8 - R13 Called Procedure ParametersR14 Stack PointerR15 Scratch RegisterR16 - R23 Current Proc. Local VariablesR24 - R29 Incoming ParametersR30 Current stack frame baseR31 current Proc. return address
Sparc processors use the concept of register windows.
UltraSparc II ExampleR0R1
G0G1
0Global 1
R13R14R15
OSSPO7
Outgoing par ameter 5Stack pointerTempor ary
R8 O0 Outgoing par ameter 0
R29R30R31
I5FPI7
Incoming para meter 5Frame pointerRetur n address
R24 10 Incoming par ameter 0
R29R30R31
I5FPI7
Incoming par ameter 5Frame pointerRetur n address
R24 I0 Incoming par ameter 0
R16 L0 Local 0
R7 G7 Global 7
R23 L7 Local 7
R16 L0 Local 0
R23 L7 Local 7
R13R14R15
O5SPO7
Stack pointerTemporary
R8 O0
Alter native name
R0R1
G0G1
0Global 1
CWPdecrementedon call in
this direction
Overlap
CWP = 7
CWP = 6
Part ofprevious windo w
Part ofprevious windo w
R7 G7 Global 7
… … … … …… …
… … …
… … …
… …… … …
… …
…
Instruction Formats
OPCODE
(a) (b)
(c) (d)
OPCODE
OPCODE ADDR1 ADDR2 ADDR3OPCODE ADDRESS1 ADDRESS2
ADDRESS
There are four basic types: Zero-address instructionsOne-address instructionsTwo-address instructionsThree-address instructions
Instruction FormatsHow would each of the four types
represent the equation?
X = (A + B) * (C + D)
Three Address Instr.X = (A + B) * (C + D)
ADD R1, A, B R1 ←M[A] + M[B]ADD R2, C, D R2 ←M[C] + M[D]MUL X, R1, R2 M[X]←R1 * R2
The notation M[?] is used to indicate some location in main memory and R[?] is used to indicate some register.
Two Address Instr.X = (A + B) * (C + D)
MOV R1, A R1 ←M[A]ADD R1, B R1 ←R1 + M[B]MOV R2, C R2 ←M[C]ADD R2, D R2 ←R2 + M[D]MUL R1, R2 R1 ←R1 * R2MOV X, R1 M[X] ←R1
One Address Instr.X = (A + B) * (C + D)
LOAD A AC ←M[A]ADD B AC ←AC + M[B]STORE T M[T] ←ACLOAD C AC ←M[C]ADD D AC ←AC * M[D]MUL T AC ←AC * M[T]STORE X M[X] ←AC
To implement this there is an implied accumulator AC.
Zero Address Instr.X = (A + B) * (C + D)
PUSH A TOS ←APUSH B TOS ←BADD TOS ←(A + B)PUSH C TOS ←CPUSH D TOS ←DADD TOS ←(C + D)MUL TOS ←(A + B)*(C + D)POP X M[X] ←TOS
ISA and Memory•The Microarchitecture restricts the ISA.
•But so does the memory word size.•To some extent this can be controlled by allowing things to be
byte addressable•or by not enforcing alignment in the memory.
•How we handle this will impact machine speed.
ISA and MemoryInstructions can be limited to a specific size or be variable.
Instr uctionInstr uctionInstr uctionInstr uction
1 Word
Instr uctionInstr uction Instr uction Instr. Instr.Instr uctionInstr uction
Instr uctionInstr uctionInstr uctionInstr uction
1 Word
Instr uction
Instr uction
1 Word
The actual size of the instruction can also have a great impact on the architecture.
ISA and Memory
•Large instructions can consume too much space.
•Smaller instructions transmit more quickly
•Small instruction disadvantages:
•can be harder to decode;•smaller addressable instruction space.
The actual size of the instruction can have a great impact on the architecture.
Fixed Width InstructionsFrom the previous it might be assumed that all memory addresses and instructions need to be the same size.
(Maybe in a RISC)
But it is possible to have an expanding opcode where the instruction is partitioned to meet the requirements of the instruction and memory addressing.
15 13 10 8 6 4 2 014 1112 9
Opcode
7 5 3 1
Address 1 Address 2 Address 3
Expandable OpCode (fixed width)
Example:15 3-address Inst.14 2-address Inst.31 1-address Inst.16 0-address Inst.
00004-bitopcode 15 3-address
instructionsxxxx
16 bits
Bit number
yyyy zzzz0001 xxxx yyyy zzzz0010 xxxx yyyy zzzz
1100 xxxx yyyy zzzz1101 xxxx yyyy zzzz1110 xxxx yyyy zzzz
11118-bitopcode 14 2-address
instructions0000 yyyy zzzz
1111 0001 yyyy zzzz1111 0010 yyyy zzzz
1111 1011 yyyy zzzz1111 1100 yyyy zzzz1111 1101 yyyy zzzz
1111 1110 1110 zzzz1111 1110 1111 zzzz1111 1111 0000 zzzz1111 1111 0001 zzzz
111112-bitopcode 31 1-address
instructions1110 0000 zzzz
1111 1110 0001 zzzz
1111 1111 1101 zzzz1111 1111 1110 zzzz
111116-bitopcode 16 0-address
instructions1111 1111 0000
1111 1111 1111 00011111 1111 1111 0010
1111 1111 1111 11011111 1111 1111 11101111 1111 1111 1111
15 12 11 8 7 4 3 0
……
……
…
UltraSparc II (Fixed Width)
PC-RELATIVE DISPLA CEMENT CALL4302
PC-RELATIVE DISPLA CEMENT BRANCH3222
A1
OP3
COND4
IMMEDIATE CONSTANT SETHI2222
DEST5
2 5 6 5 1 8 5
OP3
Immediate1b DEST OPCODE SRC1 1 IMMEDIATE CONSTANT
3 Register1a DEST OPCODE SRC1 0 FP-OP SRC2Format
This format enforces a 32-bit instruction size and the first two bits select the instruction type: 1a, 1b, 2, 3 & 4
Instr. 1a & 1b -- are the typical instruction allowing for a destination register, 1 of 64 different opcodes, and sources.
SETHI -- stands for “set Hi” bits for a 32 bit constantBRANCH -- is for non-predictive branch (A - is a delay and COND specifies
with condition to test for)CALL -- procedure call
Pentium II Opcode (Variable Width)PREFIX
INSTRUCTION
Which operand is source?
Byte/word
SCALE INDEX BASE
MOD REG R/M
OPCODE MODE SIB DISPLACEMENT IMMEDIATE
0 - 5
6 3321Bits Bits
332Bits
Bytes
1
1 - 2 0 - 1 0 - 1 0 - 4 0 - 4
The pentium II instruction is highly convoluted, consisting of 6 variable length fields with only the opcode required.
This instruction set allows for: two register operations,1 register / 1 memory op.but not two memory op.
Pentium II Opcode (Variable Width)PREFIX
INSTRUCTION
Which operand is source?
Byte/word
SCALE INDEX BASE
MOD REG R/M
OPCODE MODE SIB DISPLACEMENT IMMEDIATE
0 - 5
6 3321Bits Bits
332Bits
Bytes
1
1 - 2 0 - 1 0 - 1 0 - 4 0 - 4
-- The Prefix field allows for special handling of the opcode.-- OPCODEs are one byte unless the code is the escape code 0xFF, indicating
that the opcode is two bytes.-- Opcodes are totally encoded and only the two end bits indicate anything. (If a
memory operand being used and how).
Pentium II Opcode (Variable Width)PREFIX
INSTRUCTION
Which operand is source?
Byte/word
SCALE INDEX BASE
MOD REG R/M
OPCODE MODE SIB DISPLACEMENT IMMEDIATE
0 - 5
6 3321Bits Bits
332Bits
Bytes
1
1 - 2 0 - 1 0 - 1 0 - 4 0 - 4
MODE -- indicate information about the operandSIB -- Additional mode informationDisplacement -- Location in memory relative to specific locationImmediate operand -- A constant in memory.
JVM Instruction (Variable Width)
CONSTINDEX
OPCODE VARIABLE LENGTH…9
OPCODE 32-BIT BRANCH OFFSET8
OPCODE INDEX CONST7
OPCODE INDEX 0#PARAMETERS6
OPCODE INDEX DIMENSIONS5
OPCODE4
OPCODE SHORT SHORT = inde x, constant or offset3
OPCODE BYTE BYTE = inde x, constant or type2
OPCODE1Format
Bits 8 8 8 88
The JVM instruction code is variable, but simple.
The opcode is 8-bits, allowing for 256 different operations.
The opcode also specifies the format to be used.
Performance Measurements• How do we determine which micro-architectures
and ISA’s are better?
• Measuring differences in machines is easy for computers that are slightly different but hard for computers that are very different.
• Performance can both be a factor of how the machine is constructed and how it is used.
Performance Measures• MIPS: Millions of Instructions Per Second
• Same job may take more instructions on one machine that on another.
• MFLOPS: Million Floating Point Ops Per Second• Other instructions counted as overhead for the floating point.
• Whetstones: Synthetic benchmark• A program made-up to test specific performance features.
• Dhrystone: Synthetic competitor for Whetstone• Made to “correct” Whetstone’s emphasis on floating point.
• SPEC: Selection of “real” programs• Taken from the C/UNIX world.
Quantitative Performance• Consider 2 auto routes, the old one, which allowed
an average speed of 34 mph, and the new one, which permitted 46 mph. What is the speedup of the new one over the old one?
• Conventionally the speedup is calculated as follows:
• Producing a speedup of 0.35, or 35%.
Speedup =SpeedOnNewRoute
SpeedOnOldRoute=
Snew
Sold
=46
34= 1.35
Quantitative Performance (2)• Alternately, the % speedup can be calculated directly:
%Speedup = Snew−Sold
Sold× 100
= 46−34
34× 100
= 35%
Quantitative Performance (3)
• Many measurements are in terms of the time, T, it takes to accomplish some task.
• Time, T, is the reciprocal of Speed, S = 1/T.
• If the improvement is measured by recording travel time rather than travel speed the equation changes as follows, with the assumed times.
Speedup =Snew
Sold=
1Tnew
1Told
=Told
Tnew
=9671
= 1.35, or 35%
Quantitative Performance (4)• Once again, the % speedup can be calculated
directly.
%Speedup = Told−Tnew
Tnew× 100
= 96−71
71× 100
= 25
71× 100 = 35%
A Classic Example• A certain computer system takes 125ms to render a certain
graphic image, and this time is reduced to 100 ms when a graphics processor card is added to the system. What is the speedup?
Speedup = Told
Tnew= 125
100
= 1.25, or 25%speedup
Getting Finer-Grained• Can we specify our processing time in a more
refined way?
Getting Finer-Grained
• The execution time can be calculated from the:
• IC -- instruction count or the number of instructions executed,
• CPI -- clocks per instruction or the average number of clock cycles per instruction, and
• τ -- clock cycle or the length of time for one full cycle of the clock.
Execution Time = T = IC × CPI × τ
Example Clock Cycle: The Mic-1 (1)A complete block diagram of an example microarchitecture, the Mic-1.
Processing steps per clock cycle: 1.Set control lines to data path. 2.Propagate data path signals. 3.Store stable data signals. 4.Stabilize next MPC 5.Determine next Micro-operation.
Data Path Timing
Timing diagram of one data path cycle.
Processing steps: 1.Set control lines to data path. 2.Propagate data path signals. 3.Store stable data signals. 4.Stabilize next MPC 5.Determine next Micro-operation.
Example Clock Cycle: The Mic-1 (1)
Processing steps: 1.Set control lines to data path. 2.Propagate data path signals. 3.Store stable data signals. 4.Stabilize next MPC 5.Determine next Micro-operation.
Example Clock Cycle: The Mic-1 (1)
Processing steps: 1.Set control lines to data path. 2.Propagate data path signals. 3.Store stable data signals. 4.Stabilize next MPC 5.Determine next Micro-operation.
Example Clock Cycle: The Mic-1 (1)
Processing steps: 1.Set control lines to data path. 2.Propagate data path signals. 3.Store stable data signals. 4.Stabilize next MPC 5.Determine next Micro-operation.
Example Clock Cycle: The Mic-1 (1) Clocks per Instruction (CPI)
Clock RTN T0 MA←PC; C←PC+4; T1 MD←M[MA]; PC←C;T2 IR←MD;T3 A←R[rb];T4 C←A + R[rc];T5 R[ra]←C;
Getting Finer-Grained (2)• The master clock in a certain computer system is
increased in frequency from 700MHz to 1.2GHz.
• What is the speedup due to this improvement if no other factors such as memory access time interfere with the improvement.
Getting Finer-Grained (2)• By the problem definition, neither IC nor CPI changed,
and since the clock period, τ , is proportional to the reciprocal of the clock frequency.
Speedup = (IC×CPI×τ)old
(IC×CPI×τ)new
= 1/7001/1200 = 1200
700
= 1.71, or 71%speed
Advancing the Simple ComputerStep RTNT0 MA←PC; C←PC+4; T1 MD←M[MA]; PC←C;T2 IR←MD;T3 A←R[rb];T4 C←A + R[rc];T5 R[ra]←C;
Add second Bus
Step Concrete RTN Control SequenceT0. MA ← PC; PCout, C=B, MAin, ReadT1. PC ← PC + 4: MD ← M[MA]; PCout, Inc4, PCin, WaitT2. IR ← MD; MDout, C=B, IRinT3. A ← R[rb]; Grb, Rout, C=B, AinT4. R[ra] ← A + R[rc]; Grc, Rout, ADD, Sra, Rin, End
1-bus vs. 2-bus
T = IC × CPI × τ
% Speedup = T1-bus−T2-bus
T2-bus× 100
1-bus vs. 2-bus• Assume for now that IC and τ do not change.
• Naively assume that just the CPI changes.
%Speedup =T1�bus � T2�bus
T2�bus⇥ 100
=(IC ⇥ CPI ⇥ ⌧)1�bus � (IC ⇥ CPI ⇥ ⌧)2�bus
(IC ⇥ CPI ⇥ ⌧)2�bus⇥ 100
=(IC ⇥ 6⇥ ⌧)1�bus � (IC ⇥ 5⇥ ⌧)2�bus
(IC ⇥ 5⇥ ⌧)2�bus⇥ 100
=6� 5
5⇥ = 20%
3-bus SRC• A-bus is ALU
operand 1, B-bus is ALU operand 2, and C-bus is ALU output.
• Note, MA input connected to the B-bus
3-bus SRC
Step Concrete RTN Control SequenceT0. MA ← PC: PC ← PC + 4: PCout, MAin, Inc4, PCin, MD ← M[MA]; Read, WaitT1. IR ← MD; MDout, C=B, IRinT2. R[ra] ← R[rb] + R[rc]; GArc, RAout, GBrb, RBout, ADD, Sra, Rin, End
1-bus vs. 3-bus• Again assume that IC and τ do not change.
%Speedup =T1�bus � T2�bus
T2�bus⇥ 100
=(IC ⇥ CPI ⇥ ⌧)1�bus � (IC ⇥ CPI ⇥ ⌧)2�bus
(IC ⇥ CPI ⇥ ⌧)2�bus⇥ 100
=(IC ⇥ 6⇥ ⌧)1�bus � (IC ⇥ 3⇥ ⌧)2�bus
(IC ⇥ 3⇥ ⌧)2�bus⇥ 100
=6� 3
3⇥ = 100%