machine layers design issues - lafayette collegepfaffmaj/courses/f16/cs203/slides/w13d01.pdf ·...

Machine Architecture

Machine Layers

Instruction Set Architecture (ISA)

Microarchitecture

Digital Logic / Components

Operating System/High Level Languages

Design Issues

•Digital Level -- have the issue of being restricted by what is physically possible, component size (transistor count and transistor density).

•Microarchitecture Level -- is concerned with component layout, signal propagation, and execution speed enhancements.

•Instruction Set Architecture (ISA) -- is concern with providing the most effective utilization of the Microarchitecture and a cohesive interface to the software layer.

ISA Design•The ISA is the interface that most designers will work with. This is to say that chip designers are only a small fraction of the whole society of people that work with computers.

•This leads to design questions like :•will hardware feature X ever be used?•can hardware feature Y be added for a specific

type of programming technique?

•The ISA is how the world sees a given chip.

Backwards Compatibility•Should the world be considered during the chip design process.

•Example -- The Pentium II supports three modes of operation:•- real mode (emulating an 8088 processor)•- virtual 8086 mode•- protected mode (acting a modern processor)

Design Trade-offs

•What makes a good ISA?

The ISA should define a set of instructions that can be implemented efficiently in current and future technologies, resulting in cost-efficient designs over several generations.

The ISA should provide a clean target for compiled code.

Instruction Set Architecture (ISA)

Microarchitecture

Digital Logic / Components

Operating System/High Level LanguagesBecause of this layer architecture certain aspects of the architecture are hidden or inaccessible.

Examples are:Is the microarchitecture microprogrammed?Is there pipelining?Is the architecture superscalar?Special registers for manipulating hardware.

Although this does not mean that a programmer can not take advantage of knowledge how things are working “under the hood”.

ISA Transparency

•ISA can also vary from one manufacture to the next to allow different levels of transparency about how the chip is designed.

•ARM allows various manufacturers to effectively produce their chips.

•Intel is more proprietary to protect their designs.

•In a proprietary situation, a chip company can more effectively support their own product than others. It allows a company to license out their “technology” (the AMD Athalon is an example).

Different Machine Designs• CISC: Complex Instruction Set Computer• Many complex instructions and addressing modes.• Some instructions may take many steps to execute• Not always easy to find best instruction for a task

• RISC: Reduced Instruction Set Computer

• Few, simple instructions, addressing modes• usually one word per instruction• may take several instructions to accomplish what CISC can do

in one• Complex address calculations may take several instructions• Usually has load-store, general registers are primarily used.

Some RISC Characteristics• Prefetching of instructions

• Pipelining: beginning execution of an instruction before the previous instruction(s) have completed.

• Superscalar operation -- Issuing more than one instruction simultaneously.

• Register Windows -- the ability to switch to a different set of CPU registers with a single command. Alleviates procedure call/return overhead.

Some RISC Characteristics• Simple instructions can be done in a few clocks• Simplicity may even allow shorter clock period

• A pipeline design can allow an instruction to complete in every clock period

• Fixed length instructions simplify fetch & decode

• The rules may allow starting next instruction without necessary results of the previous

• Unconditionally executing the instructions after a branch• Starting next instruction before register load is complete.

Design Features -- Kernel vs. User Mode

•The ISA will often defined different modes of use for different types of operation.

•Kernel mode -- Primarily used by the OS and allows full access to the processor.

•User mode -- Used by other programs, and restricts programs to a subset of functionality.

•The chip typically has ways of detecting when the program is trying to do something it should not, throwing control to the OS.

Design Features -- Registers

• The ISA defines what registers are available and how they can be accessed.

• Microarchitecture Registers come in three flavors:

• Not accessible through the ISA• Special purpose, accessible through the ISA• General-purpose, accessible through the ISA

Pentium II ExampleEAXALAH A X

EBXBLBH B X

ECXCLCH C X

EDX

ESI

EDI

EBP

ESP

DL

CS

EIP

EFLAGS

SS

DS

ES

FS

GS

DH D X

8816BitsEAX - EDX -- General PurposeEAX -- main arithmetic registerEBX -- good for holding pointersECX -- has a role in loopingEDX -- used for multiplication/division

ESI, EDI -- used for hardware string manip.

EBP -- points to stack frameESP -- stack pointer

CS - GS -- segment registers

EIP -- Extended Instruction Pointer

EFLAGS -- Program Status Word

UltraSparc II Example

R0 Hardware wired to 0R1-R7 Global VariablesR8 - R13 Called Procedure ParametersR14 Stack PointerR15 Scratch RegisterR16 - R23 Current Proc. Local VariablesR24 - R29 Incoming ParametersR30 Current stack frame baseR31 current Proc. return address

Sparc processors use the concept of register windows.

UltraSparc II ExampleR0R1

G0G1

0Global 1

R13R14R15

OSSPO7

Outgoing par ameter 5Stack pointerTempor ary

R8 O0 Outgoing par ameter 0

R29R30R31

I5FPI7

Incoming para meter 5Frame pointerRetur n address

R24 10 Incoming par ameter 0

R29R30R31

I5FPI7

Incoming par ameter 5Frame pointerRetur n address

R24 I0 Incoming par ameter 0

R16 L0 Local 0

R7 G7 Global 7

R23 L7 Local 7

R16 L0 Local 0

R23 L7 Local 7

R13R14R15

O5SPO7

Stack pointerTemporary

R8 O0

Alter native name

R0R1

G0G1

0Global 1

CWPdecrementedon call in

this direction

Overlap

CWP = 7

CWP = 6

Part ofprevious windo w

Part ofprevious windo w

R7 G7 Global 7

… … … … …… …

… … …

… … …

… …… … …

… …

…

Instruction Formats

OPCODE

(a) (b)

(c) (d)

OPCODE

OPCODE ADDR1 ADDR2 ADDR3OPCODE ADDRESS1 ADDRESS2

ADDRESS

There are four basic types: Zero-address instructionsOne-address instructionsTwo-address instructionsThree-address instructions

Instruction FormatsHow would each of the four types

represent the equation?

X = (A + B) * (C + D)

Three Address Instr.X = (A + B) * (C + D)

ADD R1, A, B R1 ←M[A] + M[B]ADD R2, C, D R2 ←M[C] + M[D]MUL X, R1, R2 M[X]←R1 * R2

The notation M[?] is used to indicate some location in main memory and R[?] is used to indicate some register.

Two Address Instr.X = (A + B) * (C + D)

MOV R1, A R1 ←M[A]ADD R1, B R1 ←R1 + M[B]MOV R2, C R2 ←M[C]ADD R2, D R2 ←R2 + M[D]MUL R1, R2 R1 ←R1 * R2MOV X, R1 M[X] ←R1

One Address Instr.X = (A + B) * (C + D)

LOAD A AC ←M[A]ADD B AC ←AC + M[B]STORE T M[T] ←ACLOAD C AC ←M[C]ADD D AC ←AC * M[D]MUL T AC ←AC * M[T]STORE X M[X] ←AC

To implement this there is an implied accumulator AC.

Zero Address Instr.X = (A + B) * (C + D)

PUSH A TOS ←APUSH B TOS ←BADD TOS ←(A + B)PUSH C TOS ←CPUSH D TOS ←DADD TOS ←(C + D)MUL TOS ←(A + B)*(C + D)POP X M[X] ←TOS

ISA and Memory•The Microarchitecture restricts the ISA.

•But so does the memory word size.•To some extent this can be controlled by allowing things to be

byte addressable•or by not enforcing alignment in the memory.

•How we handle this will impact machine speed.

ISA and MemoryInstructions can be limited to a specific size or be variable.

Instr uctionInstr uctionInstr uctionInstr uction

1 Word

Instr uctionInstr uction Instr uction Instr. Instr.Instr uctionInstr uction

Instr uctionInstr uctionInstr uctionInstr uction

1 Word

Instr uction

Instr uction

1 Word

The actual size of the instruction can also have a great impact on the architecture.

ISA and Memory

•Large instructions can consume too much space.

•Smaller instructions transmit more quickly

•Small instruction disadvantages:

•can be harder to decode;•smaller addressable instruction space.

The actual size of the instruction can have a great impact on the architecture.

Fixed Width InstructionsFrom the previous it might be assumed that all memory addresses and instructions need to be the same size.

(Maybe in a RISC)

But it is possible to have an expanding opcode where the instruction is partitioned to meet the requirements of the instruction and memory addressing.

15 13 10 8 6 4 2 014 1112 9

Opcode

7 5 3 1

Address 1 Address 2 Address 3

Expandable OpCode (fixed width)

Example:15 3-address Inst.14 2-address Inst.31 1-address Inst.16 0-address Inst.

00004-bitopcode 15 3-address

instructionsxxxx

16 bits

Bit number

yyyy zzzz0001 xxxx yyyy zzzz0010 xxxx yyyy zzzz

1100 xxxx yyyy zzzz1101 xxxx yyyy zzzz1110 xxxx yyyy zzzz


instructions0000 yyyy zzzz

1111 0001 yyyy zzzz1111 0010 yyyy zzzz

1111 1011 yyyy zzzz1111 1100 yyyy zzzz1111 1101 yyyy zzzz

1111 1110 1110 zzzz1111 1110 1111 zzzz1111 1111 0000 zzzz1111 1111 0001 zzzz


instructions1110 0000 zzzz

1111 1110 0001 zzzz

1111 1111 1101 zzzz1111 1111 1110 zzzz


instructions1111 1111 0000

1111 1111 1111 00011111 1111 1111 0010

1111 1111 1111 11011111 1111 1111 11101111 1111 1111 1111

15 12 11 8 7 4 3 0

……

……

…

UltraSparc II (Fixed Width)

PC-RELATIVE DISPLA CEMENT CALL4302

PC-RELATIVE DISPLA CEMENT BRANCH3222

A1

OP3

COND4

IMMEDIATE CONSTANT SETHI2222

DEST5

2 5 6 5 1 8 5

OP3

Immediate1b DEST OPCODE SRC1 1 IMMEDIATE CONSTANT

3 Register1a DEST OPCODE SRC1 0 FP-OP SRC2Format

This format enforces a 32-bit instruction size and the first two bits select the instruction type: 1a, 1b, 2, 3 & 4

Instr. 1a & 1b -- are the typical instruction allowing for a destination register, 1 of 64 different opcodes, and sources.

SETHI -- stands for “set Hi” bits for a 32 bit constantBRANCH -- is for non-predictive branch (A - is a delay and COND specifies

with condition to test for)CALL -- procedure call

Pentium II Opcode (Variable Width)PREFIX

INSTRUCTION

Which operand is source?

Byte/word

SCALE INDEX BASE

MOD REG R/M

OPCODE MODE SIB DISPLACEMENT IMMEDIATE

0 - 5

6 3321Bits Bits

332Bits

Bytes

1

1 - 2 0 - 1 0 - 1 0 - 4 0 - 4

The pentium II instruction is highly convoluted, consisting of 6 variable length fields with only the opcode required.

This instruction set allows for: two register operations,1 register / 1 memory op.but not two memory op.


INSTRUCTION


Byte/word

SCALE INDEX BASE

MOD REG R/M


0 - 5

6 3321Bits Bits

332Bits

Bytes

1

1 - 2 0 - 1 0 - 1 0 - 4 0 - 4

-- The Prefix field allows for special handling of the opcode.-- OPCODEs are one byte unless the code is the escape code 0xFF, indicating

that the opcode is two bytes.-- Opcodes are totally encoded and only the two end bits indicate anything. (If a

memory operand being used and how).


INSTRUCTION


Byte/word

SCALE INDEX BASE

MOD REG R/M


0 - 5

6 3321Bits Bits

332Bits

Bytes

1

1 - 2 0 - 1 0 - 1 0 - 4 0 - 4

MODE -- indicate information about the operandSIB -- Additional mode informationDisplacement -- Location in memory relative to specific locationImmediate operand -- A constant in memory.

JVM Instruction (Variable Width)

CONSTINDEX

OPCODE VARIABLE LENGTH…9

OPCODE 32-BIT BRANCH OFFSET8

OPCODE INDEX CONST7

OPCODE INDEX 0#PARAMETERS6

OPCODE INDEX DIMENSIONS5

OPCODE4

OPCODE SHORT SHORT = inde x, constant or offset3

OPCODE BYTE BYTE = inde x, constant or type2

OPCODE1Format

Bits 8 8 8 88

The JVM instruction code is variable, but simple.

The opcode is 8-bits, allowing for 256 different operations.

The opcode also specifies the format to be used.

Performance Measurements• How do we determine which micro-architectures

and ISA’s are better?

• Measuring differences in machines is easy for computers that are slightly different but hard for computers that are very different.

• Performance can both be a factor of how the machine is constructed and how it is used.

Performance Measures• MIPS: Millions of Instructions Per Second

• Same job may take more instructions on one machine that on another.

• MFLOPS: Million Floating Point Ops Per Second• Other instructions counted as overhead for the floating point.

• Whetstones: Synthetic benchmark• A program made-up to test specific performance features.

• Dhrystone: Synthetic competitor for Whetstone• Made to “correct” Whetstone’s emphasis on floating point.

• SPEC: Selection of “real” programs• Taken from the C/UNIX world.

Quantitative Performance• Consider 2 auto routes, the old one, which allowed

an average speed of 34 mph, and the new one, which permitted 46 mph. What is the speedup of the new one over the old one?

• Conventionally the speedup is calculated as follows:

• Producing a speedup of 0.35, or 35%.

Speedup =SpeedOnNewRoute

SpeedOnOldRoute=

Snew

Sold

=46

34= 1.35

Quantitative Performance (2)• Alternately, the % speedup can be calculated directly:

%Speedup = Snew−Sold

Sold× 100

= 46−34

34× 100

= 35%

Quantitative Performance (3)

• Many measurements are in terms of the time, T, it takes to accomplish some task.

• Time, T, is the reciprocal of Speed, S = 1/T.

• If the improvement is measured by recording travel time rather than travel speed the equation changes as follows, with the assumed times.

Speedup =Snew

Sold=

1Tnew

1Told

=Told

Tnew

=9671

= 1.35, or 35%

Quantitative Performance (4)• Once again, the % speedup can be calculated

directly.

%Speedup = Told−Tnew

Tnew× 100

= 96−71

71× 100

= 25

71× 100 = 35%

A Classic Example• A certain computer system takes 125ms to render a certain

graphic image, and this time is reduced to 100 ms when a graphics processor card is added to the system. What is the speedup?

Speedup = Told

Tnew= 125

100

= 1.25, or 25%speedup

Getting Finer-Grained• Can we specify our processing time in a more

refined way?

Getting Finer-Grained

• The execution time can be calculated from the:

• IC -- instruction count or the number of instructions executed,

• CPI -- clocks per instruction or the average number of clock cycles per instruction, and

• τ -- clock cycle or the length of time for one full cycle of the clock.

Execution Time = T = IC × CPI × τ

Example Clock Cycle: The Mic-1 (1)A complete block diagram of an example microarchitecture, the Mic-1.

Processing steps per clock cycle: 1.Set control lines to data path. 2.Propagate data path signals. 3.Store stable data signals. 4.Stabilize next MPC 5.Determine next Micro-operation.

Data Path Timing

Timing diagram of one data path cycle.

Processing steps: 1.Set control lines to data path. 2.Propagate data path signals. 3.Store stable data signals. 4.Stabilize next MPC 5.Determine next Micro-operation.

Example Clock Cycle: The Mic-1 (1)






Example Clock Cycle: The Mic-1 (1) Clocks per Instruction (CPI)

Clock RTN T0 MA←PC; C←PC+4; T1 MD←M[MA]; PC←C;T2 IR←MD;T3 A←R[rb];T4 C←A + R[rc];T5 R[ra]←C;

Getting Finer-Grained (2)• The master clock in a certain computer system is

increased in frequency from 700MHz to 1.2GHz.

• What is the speedup due to this improvement if no other factors such as memory access time interfere with the improvement.

Getting Finer-Grained (2)• By the problem definition, neither IC nor CPI changed,

and since the clock period, τ , is proportional to the reciprocal of the clock frequency.

Speedup = (IC×CPI×τ)old

(IC×CPI×τ)new

= 1/7001/1200 = 1200

700

= 1.71, or 71%speed

Advancing the Simple ComputerStep RTNT0 MA←PC; C←PC+4; T1 MD←M[MA]; PC←C;T2 IR←MD;T3 A←R[rb];T4 C←A + R[rc];T5 R[ra]←C;

Add second Bus

Step Concrete RTN Control SequenceT0. MA ← PC; PCout, C=B, MAin, ReadT1. PC ← PC + 4: MD ← M[MA]; PCout, Inc4, PCin, WaitT2. IR ← MD; MDout, C=B, IRinT3. A ← R[rb]; Grb, Rout, C=B, AinT4. R[ra] ← A + R[rc]; Grc, Rout, ADD, Sra, Rin, End

1-bus vs. 2-bus

T = IC × CPI × τ

% Speedup = T1-bus−T2-bus

T2-bus× 100

1-bus vs. 2-bus• Assume for now that IC and τ do not change.

• Naively assume that just the CPI changes.

%Speedup =T1�bus � T2�bus

T2�bus⇥ 100

=(IC ⇥ CPI ⇥ ⌧)1�bus � (IC ⇥ CPI ⇥ ⌧)2�bus

(IC ⇥ CPI ⇥ ⌧)2�bus⇥ 100

=(IC ⇥ 6⇥ ⌧)1�bus � (IC ⇥ 5⇥ ⌧)2�bus

(IC ⇥ 5⇥ ⌧)2�bus⇥ 100

=6� 5

5⇥ = 20%

3-bus SRC• A-bus is ALU

operand 1, B-bus is ALU operand 2, and C-bus is ALU output.

• Note, MA input connected to the B-bus

3-bus SRC

Step Concrete RTN Control SequenceT0. MA ← PC: PC ← PC + 4: PCout, MAin, Inc4, PCin, MD ← M[MA]; Read, WaitT1. IR ← MD; MDout, C=B, IRinT2. R[ra] ← R[rb] + R[rc]; GArc, RAout, GBrb, RBout, ADD, Sra, Rin, End

1-bus vs. 3-bus• Again assume that IC and τ do not change.

%Speedup =T1�bus � T2�bus

T2�bus⇥ 100

=(IC ⇥ CPI ⇥ ⌧)1�bus � (IC ⇥ CPI ⇥ ⌧)2�bus

(IC ⇥ CPI ⇥ ⌧)2�bus⇥ 100

=(IC ⇥ 6⇥ ⌧)1�bus � (IC ⇥ 3⇥ ⌧)2�bus

(IC ⇥ 3⇥ ⌧)2�bus⇥ 100

=6� 3

3⇥ = 100%

machine layers design issues - lafayette collegepfaffmaj/courses/f16/cs203/slides/w13d01.pdf ·...

Documents