eece476 lectures 4,5 –alus, add, multiply, and floating-point chapter 3: computer arithmetic the...

EECE476

Lectures 4,5 –ALUs,Add, Multiply, and Floating-Point

Chapter 3: Computer Arithmetic

The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux

2

Announcements

• Assignment 1

– First part posted on web.• Do it as practice for tomorrow’s tutorial !!

– Second part coming soon.• Do it as practice for QUIZ next week!

• Quiz Dates– Quiz 1 Thurs, Sept 22nd based on Assign 1– Quiz 2, etc TBD

3

Reading

• Chapter 3– 3.2 signed numbers– 3.3 addition and subtraction– 3.4 multiplication– 3.5 division– 3.6 floating-point (read lightly)

4

Computer Arithmetic

• Objective 1– Discover the “logic complexity” of the

different types of arithmetic done by a CPU

– The complexity will have an impact on performance later!

• Objective 2– Learn how to build an ALU for your project

5

The Conclusions

• Add is easy– Fast adding is not too bad either…– Subtraction: addition’s tricky pal

• Multiply is hard…– But you can add many times

• Divide is really hard…– Divide and be conquered!

• Anything floating-point is impossible!– Well, not quite, but you will get the idea…

6

Computer Architecture?

• Recall– Computer Architecture = ISA + Machine Organization– Machine Organization = implementation details!

• Begin to consider coupling– ISA Machine Organization

• Heart of computer: arithmetic calculations– Done by ALU: Arithmetic Logic Unit

• Some parts not done by ALU– Decision-making, iteration, memory/state …– All of these are important as well

7

MIPS Arithmetic Instructions• Let’s design an ALU that MIPS can use !!

• Many different operations– Arithmetic

• Add, AddU, Sub, SubU,• AddI, AddIU • Mult, MultiU, Div, DivU

– Logical• And, Or, Xor, Nor• AndI, OrI, XorI

– Logical/Arithmetic• SLT, SLTU• SLTI, SLTIU

– Shifting (Left/Right & Logical/Arithmetic & Const/Variable)• SLL, SRL, SRA, SLLV, SRLV, SRAV

8

MIPS ALU Design

• First: simplify!

– Throw out “hard” operations• Mult, Div

– Extract & group basic operations• Add, Sub• And, Or, Nor, Xor• SLT• Shifting (is this hard?)

9

MIPS ALU Design

• Second: simplify!

– Identify common optimizations• Sub = variation of Add• Nor = variation of Or (why Nor ?)

– Some other CPUs have even more operations• Bit set, Bit test, Bit clear, etc

10

ALU Design 1

• Easy way…

– Try to be more creative!

F

Instruction/operation

+/–

*

11

ALU Design 2

• Start with single bit operations– All operations share same 2 inputs– Small optimizations may be possible

• E.g., Or and Nor

• E.g., Add and And (see problem set)• Generally, these aren’t too helpful

NorOperation

FA

B

12

ALU Design 3

• Build up to larger multi-bit operations

– Bigger & better optimizations are possible• E.g., Add and Sub• E.g., SLT

13

Add/AND/OR for ALU: Bit-based

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

One Bit Multiple Bits

14

Subtract for ALU

• A – B = A + (B + 1)

0

2

Result

Operation

a

1

CarryOut

0

1

Binvert

b

CarryIn

15

Fast Adders

• We will assume fast adders are available– Slow O(n) vs. Fast O(log n)– Eg, carry-lookahead, carry-skip, carry-select

• You should know “fast adders” already, but…– Not a part of this course– Fast adders are NOT on assignments, tests or exam– Do NOT use a fast adder in your project

• FPGAs have their own “fast carry chain” to do adds quickly, and you will merely confuse the tool

16

Set Less Than

Most-significant bitAll other bits

• One-bit ALU blocks

0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

Set

Overflowdetection

Overflow

17

Set Less Than

• Stitching theone-bit ALUblockstogether

• Notice ‘Set’ outputis sign bit of A-B– Used as ‘Less’ input on LSB– Other ‘Less’ inputs

forced to 0

S eta3 1

0

A LU 0 R es ult0a0

R es ult1a1

0

R es ult2a2

0

O p era tio n

b3 1

b0

b1

b2

R es ult31

O ve rflo w

B in ve rt

C a rry In

Le ss

C a rryIn

C a rryO u t

A LU 1Le ss

C a rryIn

C a rryO u t

A LU 2Le ss

C a rryIn

C a rryO u t

A L U 31Less

C a rryIn

18

Testing Result for Zero

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Binvert

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

19

BIG PICTURE

• This course is about computer architecture

• Why care about ALU design details?– Our goal is performance– Some ALU designs may be faster or slower

• You must understand the impact they have on– Clock frequency (cycle time)– Instruction set design– More advanced things (eg, impact of multiple ALUs)– Ultimately, performance!

Multiply

Shift,Add,

Shift,Add,

Shift,Add…

21

Multiplication - Decimal

• More complex than addition– Multiple additions (and shifts)– More gates/area, slower

• Gradeschool algorithm:

Multiplicand M 13Multiplier Q x 11 13 <- 13 x 1 13 <- 13 x 10Product P 143

22

Multiplication - Binary• Same algorithm, different digits

Multiplicand M 1101 (13)Multiplier Q x 1011 (11) 1101 <- 1 Q0 Partial Product PP0 1101 <- 1 Q1 Partial Product PP1 0000 <- 0 Q2 Partial Product PP2 1101 <- 1 Q3 Partial Product PP3 10001111 (143) Product P

• M bits x N bits => M+N bit product• Binary makes it easy:

– Bit Qi is zero => PPi is 0– Bit Qi is one => PPi is M (shifted i times left)– Product is sum of PPs

23

Multiplication – Hardware V0a

• Array multiplier• Stage i accumulates PPi

(0 or M shifted i)depending on Qi

• Answer Pcomes outat bottom

• Slow!Big!

Q0

M0M1M2M3

M0M1M2M3

M0M1M2M3

M0M1M2M3

Q1

Q2

Q3

P0P1P2P3P4P5P6P7

0 0 0 0

24

Multiplication – Hardware V0b

• at each stage shift M left ( x 2)• next bit of Q determines whether to add in shifted multiplicand• accumulate 2n bit partial product at each stage• each stage identical: need only 1 stage in hardware (use multiple

cycles)

Q0

M0M1M2M3

M0M1M2M3

M0M1M2M3

M0M1M2M3

Q1

Q2

Q3

P0P1P2P3P4P5P6P7

0 0 0 00 0 0

25

Multiplication – Hardware V1• M: 64b shift register, Q: 32b shift register, P: 64b register• Initially, ensure high bits of M are zero (M63..M32 = 0)

P: Product

M: Multiplicand

64-bit ALU

Shift Left

Shift Right

WriteControl

32 bits

64 bits

64 bits

Multiplier = datapath + control

Q: Multiplier

Q0

Add

26

Multiplication – Hardware V1

Notes• 1 clock cycle per bit (32 total)• 0’s are left-shifted into M

– Lower bits of P never change once formed

• Half of bits in M are always zero– 64 bit ALU is wasted

Observations lead to refinement:• Right-shift P instead of left-shifting M

27

Multiplication Algorithm• Russian Peasant Algorithm

PP=MP=0while( Q != 0 ) {

if( Q is odd ) // ie, if bit 0 of Q is = ‘1’ P = P + PP // accumulate partial product (PP) in P

end ifPP = PP * 2 // shift PP left 1 positionQ = Q / 2 // shift Q right 1 position

}

• Compare this to the hardware just presented!– Each loop iteration takes one clock cycle– How many cycles are required?

28

Multiplication – Hardware V2• M: 32b register, Q: 32b shift register, P: 64b shift register• Initially, P=0. Only high bits of P (63..32) affected by a write.

P: Product

M: Multiplicand

32-bit ALU

Shift Right

Shift Right

WriteControl

32 bits

32 bits

64 bits

Q: Multiplier

Q0

Add

29


• What’sreallygoingon?

Q0

Q1

Q2

Q3

P0P1P2P3P4P5P6P7

0 0 0 0

M0M1M2M3

M0M1M2M3

M0M1M2M3

M0M1M2M3

30


Notes• 1 clock cycle per bit (32 total)• Lower 32 bits of P are initially unused

– Holds zero, but unused– Each cycle, 1 fewer unused bit

• 0’s are right-shifted into Q– Initially: 32 bits used in Q– Each step: 1 fewer bits needed in Q– At end: Q is destroyed

Observations lead to refinement:• Use lower 32 bits of P to hold Q

31


• M: 32b register, P: 64b shift register (lower half represents Q)• Initially, P=Q. Only high bits of P (63..32) are changed on write.

P: Product

M: Multiplicand

32-bit ALU

Shift Right

WriteControl

32 bits

64 bits

Q0

Add

Q: Multiplier

32


Notes• P has two halfs: high, low

MIPS multiply instruction MultU• 32 regular MIPS registers• 2 special MIPS registers: HI, LO

– Why special? Need to right-shift contents

• HI, LO store results of MultU

33

Multiplication – Signed Numbers• Gradeschool algorithm assumes unsigned numbers

Multiplicand M 1101 (13)Multiplier Q x 1011 (11) 1101 <- 1 Q0 Partial Product 0 1101 <- 1 Q1 Partial Product 1 0000 <- 0 Q2 Partial Product 2 1101 <- 1 Q3 Partial Product 3 10001111 (143)

• Signed numbers?– Example above reads (-3) * (-5) = (-113), clearly wrong!– Requires some adjustments

34

Multiplication – Signed Numbers

Two Cases For Signed Multiply: P = M*Q• Case A: M signed, Q unsigned or Q >= 0

– Add using sign-extension of PP

Multiplicand M 1101 (–3)Multiplier Q x 1011 (11) 11111101 <- 1 Q0 Partial Product 0 1111101 <- 1 Q1 Partial Product 1 000000 <- 0 Q2 Partial Product 2 11101 <- 1 Q3 Partial Product 3 11011111 (–33)

35

Multiplication – Signed Numbers

Two Cases For Signed Multiply: P = M*Q• Case B: M signed, Q signed and Q < 0

– One method:• Note that P = M*Q = (-M)*(-Q) = (M+1)*(Q+1)• Now (Q+1) is positive, follow Case A• How to do this in hardware?

– Use sign bits to modify M and Q, two extra adds for +1’s

– Alternate method: Booth encoding• Look it up!

Divide?

Forget it!

Basically, do the long division thing over multiple clock cycles:

1) Subtract divisor2) If >= 0, put “1” in answer, do next bit3) If <0, put “0” in answer, add divisor back, do next bit

Floating-Point

38

Integers and Beyond• Integers perfectly accurate, no error

– 32bit integer: -2,147,483,648 to 2,147,483,647– integers “overflow” or wrap on +1 from 2,147,483,647 to -2,147,483,648

• What about numbers with non-integral parts?– Large range in values, possibly large number of significant digits….– Rationals

• 0.5 => can represent as ½• 1/3 => 0.33333333333333333333333• 63/127 => 0.4960629921259842519685039370…

– Irrationals• sqrt(2) = 1.41421356237309504880168872420…• Transcendentals: pi = 3.14159265927…, e = 2.71828183…

– Scientific• NA = 6.022 x 1023 Avagadro’s number (atoms in one mole)• G = 6.67259 × 10-11 gravitational constant (F = -GMm/r2)• c = 2.99792458 x 108 speed of light (m/s)

39

Floating Point Numbers

• How to represent non-integral numbers in binary?– Many possible ways

• e.g., store (numerator, denominator) => doesn’t work for irrationals

– All ways have limitations• Cannot represent all real numbers: infinite number of them, finite

number of bits!

– Need a standard on how to interpret the bits• e.g. two’s complement for signed integers

– Benefits of a standard:• Software portability: same answer on any machine• Data portability: binary data can be sent directly, no conversions• Numerical environment: defines level of mathematical precision,

allows research into error analysis, avoids future problems

40

Floating Point Numbers: IEEE754

• A floating-point standard IEEE754– Standard published in 1985

• Started in 1977• Primarily work of William Kahan (UofT student)

– Based largely on development of Intel 8087• A floating-point processor designed to work with the 8086

– Intel’s chip was a model to follow• 8087 first commercial product to implement IEEE 754• Other companies implemented IEEE 754, looked at Intel’s

chip

41

Binary Representationof Fractional Numbers

• Example

101.011= 1*22 + 0*21 + 1*20+ 0*2-1 + 1*2-2 + 1*2-3

= 4 + 0 + 1 + 0 + ¼ + 1/8

= 5.375

42

Binary Representationof Floating-Point Numbers

• Recall scientific notation:

6.022 x 1023

6.022 is the normalized significant part

10 is the base or radix23 is the exponent

• Can do the same in binary:

1.011 x 23

= 1.011 x 8= 1011= 11 (base=10)

• Negative numbers?– Need to remember the sign of

the significant part

• Generally:

(–1)S x M x be

Where:

S is sign (0 or 1)M is significandb is base/radixe is exponent

43

IEEE 754 Binary Single-Precision Floating-Point Representation

(–1)S x M x be

1.011 x 23 S = 0, M = 1.011, b = 2, e = 3

Encoding into bits:– assume b=2 (binary!), no need to store/remember– S one bit: 0– M 24 bits: 1011 0000 0000 0000 0000 0000

• If normalized, first (leftmost) digit of m is always a ‘1’, never a ‘0’• Don’t store the leading ‘1’, instead define M=1.F an store F

– F 23 bits: 011 0000 0000 0000 0000 0000– convert e=3 into binary (e may be negative!):

• Use biased notation, called Excess-N• Excess 127 used here: Define E = e+127 = 130

– E 8 bits, E = 1000 0010

44

IEEE 754 Binary Floating-Point Representation

Representation of floating point numbers in IEEE 754 standard:

Single precision32 bits total

1 8 23

sign S E F

exponent:excess 127binary integer

significand:normalized binarysignificand w/ hiddeninteger bit: M = 1.F

Double precision64 bits total

1 11 52S E F

exponent:excess 1023binary integer

45

IEEE 754 Precision

• Single precision– Enough for 9 decimal digits of accuracy

• Double precision– Enough for 17 decimal digits of accuracy

• Storing floating-point numbers to disk?Two options:– A: write binary value (32bits or 64bits)

• IEEE 754 standard allows us to interchange these values!– B: write value as decimal digits, eg in ASCII

• Need to write 9 (or 17) decimal digits• Need to write sign, exponent as well• Reading back in: convert to binary, get same binary value as before

46

IEEE 754 Accuracy

• Not all real values can be represented– Inf. # of values between ½ and ¼ – Inf. # of values between ½ and 3/8 – Inf. # of values between ½ and 7/16, etc

• All floating-point numbers are approximations– Calculations with approximations introduce errors– Reduce size of errors by proper rounding– 754: keep extra bits of precision during calculations for rounding– Cannot solve all problems: algorithm numerical stability a must!– You get same problems on every machine using IEEE754

47

IEEE 754 Range

• See /usr/include/limits.h on a Unix system

• Single precision:– Minimum: 1.175494351E-38 (FLT_MIN)– Maximum: 3.402823466E+38 (FLT_MAX)

• Double precision:– Minimum: 2.2250738585072014E-308 (DBL_MIN)– Maximum: 1.7976931348623157E+308 (DBL_MAX)

48

IEEE 754 Range

• What happens on overflow ?– Depends, 754 standard defines some special cases– Normally, you get value called Infinity

• Tiny numbers ? (smaller than smallest normal)– Goes to zero? Called underflow

• This is rather drastic

– 754 standard defines denormalized numbers• Underflow occurs gradually…• Underflow/denormals hard to design hardware

– Not all chips support it

• Often use software interrupts to handle denormals

49

IEEE 754 Special Cases

• Infinity, -Infinity– Caused by overflow– Caused by 1/0 (note: no error produced)

• +0, -0– This is claimed to be useful!

• NaN– “Not a Number”– 0/0– Infinity/Infinity, 0*Infinity, Infinity–Infinity, etc– Sqrt(-number)– Infectious: NaN + number = NaN, NaN x number = NaN, etc

• Comparisons (<, >, =, etc) with Infinity? NaN?– Cases all defined by the standard

50

IEEE 754 Hardware

1. Compare exponents

2. Shift smaller number right

3. Add

4. Normalize

5. Round

FractionSign ExponentFractionSign Exponent

Big ALU

FractionSign Exponent

51

Shifters

• Left as an exercise...

52

BIG PICTURE

• This course is about computer architecture

• Why care about ALU design details?– Our goal is performance– Some ALU designs may be faster or slower

• You must understand the impact they have on– Clock frequency (cycle time)– Instruction set design– More advanced things (eg, impact of multiple ALUs)– Ultimately, performance!

eece476 lectures 4,5 –alus, add, multiply, and floating-point chapter 3: computer arithmetic the...

Documents