eece476 lectures 4,5 –alus, add, multiply, and floating-point chapter 3: computer arithmetic the...
Post on 19-Dec-2015
220 views
TRANSCRIPT
EECE476
Lectures 4,5 –ALUs,Add, Multiply, and Floating-Point
Chapter 3: Computer Arithmetic
The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux
2
Announcements
• Assignment 1
– First part posted on web.• Do it as practice for tomorrow’s tutorial !!
– Second part coming soon.• Do it as practice for QUIZ next week!
• Quiz Dates– Quiz 1 Thurs, Sept 22nd based on Assign 1– Quiz 2, etc TBD
3
Reading
• Chapter 3– 3.2 signed numbers– 3.3 addition and subtraction– 3.4 multiplication– 3.5 division– 3.6 floating-point (read lightly)
4
Computer Arithmetic
• Objective 1– Discover the “logic complexity” of the
different types of arithmetic done by a CPU
– The complexity will have an impact on performance later!
• Objective 2– Learn how to build an ALU for your project
5
The Conclusions
• Add is easy– Fast adding is not too bad either…– Subtraction: addition’s tricky pal
• Multiply is hard…– But you can add many times
• Divide is really hard…– Divide and be conquered!
• Anything floating-point is impossible!– Well, not quite, but you will get the idea…
6
Computer Architecture?
• Recall– Computer Architecture = ISA + Machine Organization– Machine Organization = implementation details!
• Begin to consider coupling– ISA Machine Organization
• Heart of computer: arithmetic calculations– Done by ALU: Arithmetic Logic Unit
• Some parts not done by ALU– Decision-making, iteration, memory/state …– All of these are important as well
7
MIPS Arithmetic Instructions• Let’s design an ALU that MIPS can use !!
• Many different operations– Arithmetic
• Add, AddU, Sub, SubU,• AddI, AddIU • Mult, MultiU, Div, DivU
– Logical• And, Or, Xor, Nor• AndI, OrI, XorI
– Logical/Arithmetic• SLT, SLTU• SLTI, SLTIU
– Shifting (Left/Right & Logical/Arithmetic & Const/Variable)• SLL, SRL, SRA, SLLV, SRLV, SRAV
8
MIPS ALU Design
• First: simplify!
– Throw out “hard” operations• Mult, Div
– Extract & group basic operations• Add, Sub• And, Or, Nor, Xor• SLT• Shifting (is this hard?)
9
MIPS ALU Design
• Second: simplify!
– Identify common optimizations• Sub = variation of Add• Nor = variation of Or (why Nor ?)
– Some other CPUs have even more operations• Bit set, Bit test, Bit clear, etc
10
ALU Design 1
• Easy way…
– Try to be more creative!
F
Instruction/operation
+/–
*
11
ALU Design 2
• Start with single bit operations– All operations share same 2 inputs– Small optimizations may be possible
• E.g., Or and Nor
• E.g., Add and And (see problem set)• Generally, these aren’t too helpful
NorOperation
FA
B
12
ALU Design 3
• Build up to larger multi-bit operations
– Bigger & better optimizations are possible• E.g., Add and Sub• E.g., SLT
13
Add/AND/OR for ALU: Bit-based
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
Result31a31
b31
Result0
CarryIn
a0
b0
Result1a1
b1
Result2a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
One Bit Multiple Bits
14
Subtract for ALU
• A – B = A + (B + 1)
0
2
Result
Operation
a
1
CarryOut
0
1
Binvert
b
CarryIn
15
Fast Adders
• We will assume fast adders are available– Slow O(n) vs. Fast O(log n)– Eg, carry-lookahead, carry-skip, carry-select
• You should know “fast adders” already, but…– Not a part of this course– Fast adders are NOT on assignments, tests or exam– Do NOT use a fast adder in your project
• FPGAs have their own “fast carry chain” to do adds quickly, and you will merely confuse the tool
16
Set Less Than
Most-significant bitAll other bits
• One-bit ALU blocks
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
Set
Overflowdetection
Overflow
17
Set Less Than
• Stitching theone-bit ALUblockstogether
• Notice ‘Set’ outputis sign bit of A-B– Used as ‘Less’ input on LSB– Other ‘Less’ inputs
forced to 0
S eta3 1
0
A LU 0 R es ult0a0
R es ult1a1
0
R es ult2a2
0
O p era tio n
b3 1
b0
b1
b2
R es ult31
O ve rflo w
B in ve rt
C a rry In
Le ss
C a rryIn
C a rryO u t
A LU 1Le ss
C a rryIn
C a rryO u t
A LU 2Le ss
C a rryIn
C a rryO u t
A L U 31Less
C a rryIn
18
Testing Result for Zero
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Binvert
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
19
BIG PICTURE
• This course is about computer architecture
• Why care about ALU design details?– Our goal is performance– Some ALU designs may be faster or slower
• You must understand the impact they have on– Clock frequency (cycle time)– Instruction set design– More advanced things (eg, impact of multiple ALUs)– Ultimately, performance!
Multiply
Shift,Add,
Shift,Add,
Shift,Add…
21
Multiplication - Decimal
• More complex than addition– Multiple additions (and shifts)– More gates/area, slower
• Gradeschool algorithm:
Multiplicand M 13Multiplier Q x 11 13 <- 13 x 1 13 <- 13 x 10Product P 143
22
Multiplication - Binary• Same algorithm, different digits
Multiplicand M 1101 (13)Multiplier Q x 1011 (11) 1101 <- 1 Q0 Partial Product PP0 1101 <- 1 Q1 Partial Product PP1 0000 <- 0 Q2 Partial Product PP2 1101 <- 1 Q3 Partial Product PP3 10001111 (143) Product P
• M bits x N bits => M+N bit product• Binary makes it easy:
– Bit Qi is zero => PPi is 0– Bit Qi is one => PPi is M (shifted i times left)– Product is sum of PPs
23
Multiplication – Hardware V0a
• Array multiplier• Stage i accumulates PPi
(0 or M shifted i)depending on Qi
• Answer Pcomes outat bottom
• Slow!Big!
Q0
M0M1M2M3
M0M1M2M3
M0M1M2M3
M0M1M2M3
Q1
Q2
Q3
P0P1P2P3P4P5P6P7
0 0 0 0
24
Multiplication – Hardware V0b
• at each stage shift M left ( x 2)• next bit of Q determines whether to add in shifted multiplicand• accumulate 2n bit partial product at each stage• each stage identical: need only 1 stage in hardware (use multiple
cycles)
Q0
M0M1M2M3
M0M1M2M3
M0M1M2M3
M0M1M2M3
Q1
Q2
Q3
P0P1P2P3P4P5P6P7
0 0 0 00 0 0
25
Multiplication – Hardware V1• M: 64b shift register, Q: 32b shift register, P: 64b register• Initially, ensure high bits of M are zero (M63..M32 = 0)
P: Product
M: Multiplicand
64-bit ALU
Shift Left
Shift Right
WriteControl
32 bits
64 bits
64 bits
Multiplier = datapath + control
Q: Multiplier
Q0
Add
26
Multiplication – Hardware V1
Notes• 1 clock cycle per bit (32 total)• 0’s are left-shifted into M
– Lower bits of P never change once formed
• Half of bits in M are always zero– 64 bit ALU is wasted
Observations lead to refinement:• Right-shift P instead of left-shifting M
27
Multiplication Algorithm• Russian Peasant Algorithm
PP=MP=0while( Q != 0 ) {
if( Q is odd ) // ie, if bit 0 of Q is = ‘1’ P = P + PP // accumulate partial product (PP) in P
end ifPP = PP * 2 // shift PP left 1 positionQ = Q / 2 // shift Q right 1 position
}
• Compare this to the hardware just presented!– Each loop iteration takes one clock cycle– How many cycles are required?
28
Multiplication – Hardware V2• M: 32b register, Q: 32b shift register, P: 64b shift register• Initially, P=0. Only high bits of P (63..32) affected by a write.
P: Product
M: Multiplicand
32-bit ALU
Shift Right
Shift Right
WriteControl
32 bits
32 bits
64 bits
Q: Multiplier
Q0
Add
29
Multiplication – Hardware V2
• What’sreallygoingon?
Q0
Q1
Q2
Q3
P0P1P2P3P4P5P6P7
0 0 0 0
M0M1M2M3
M0M1M2M3
M0M1M2M3
M0M1M2M3
30
Multiplication – Hardware V2
Notes• 1 clock cycle per bit (32 total)• Lower 32 bits of P are initially unused
– Holds zero, but unused– Each cycle, 1 fewer unused bit
• 0’s are right-shifted into Q– Initially: 32 bits used in Q– Each step: 1 fewer bits needed in Q– At end: Q is destroyed
Observations lead to refinement:• Use lower 32 bits of P to hold Q
31
Multiplication – Hardware V3
• M: 32b register, P: 64b shift register (lower half represents Q)• Initially, P=Q. Only high bits of P (63..32) are changed on write.
P: Product
M: Multiplicand
32-bit ALU
Shift Right
WriteControl
32 bits
64 bits
Q0
Add
Q: Multiplier
32
Multiplication – Hardware V3
Notes• P has two halfs: high, low
MIPS multiply instruction MultU• 32 regular MIPS registers• 2 special MIPS registers: HI, LO
– Why special? Need to right-shift contents
• HI, LO store results of MultU
33
Multiplication – Signed Numbers• Gradeschool algorithm assumes unsigned numbers
Multiplicand M 1101 (13)Multiplier Q x 1011 (11) 1101 <- 1 Q0 Partial Product 0 1101 <- 1 Q1 Partial Product 1 0000 <- 0 Q2 Partial Product 2 1101 <- 1 Q3 Partial Product 3 10001111 (143)
• Signed numbers?– Example above reads (-3) * (-5) = (-113), clearly wrong!– Requires some adjustments
34
Multiplication – Signed Numbers
Two Cases For Signed Multiply: P = M*Q• Case A: M signed, Q unsigned or Q >= 0
– Add using sign-extension of PP
Multiplicand M 1101 (–3)Multiplier Q x 1011 (11) 11111101 <- 1 Q0 Partial Product 0 1111101 <- 1 Q1 Partial Product 1 000000 <- 0 Q2 Partial Product 2 11101 <- 1 Q3 Partial Product 3 11011111 (–33)
35
Multiplication – Signed Numbers
Two Cases For Signed Multiply: P = M*Q• Case B: M signed, Q signed and Q < 0
– One method:• Note that P = M*Q = (-M)*(-Q) = (M+1)*(Q+1)• Now (Q+1) is positive, follow Case A• How to do this in hardware?
– Use sign bits to modify M and Q, two extra adds for +1’s
– Alternate method: Booth encoding• Look it up!
Divide?
Forget it!
Basically, do the long division thing over multiple clock cycles:
1) Subtract divisor2) If >= 0, put “1” in answer, do next bit3) If <0, put “0” in answer, add divisor back, do next bit
Floating-Point
38
Integers and Beyond• Integers perfectly accurate, no error
– 32bit integer: -2,147,483,648 to 2,147,483,647– integers “overflow” or wrap on +1 from 2,147,483,647 to -2,147,483,648
• What about numbers with non-integral parts?– Large range in values, possibly large number of significant digits….– Rationals
• 0.5 => can represent as ½• 1/3 => 0.33333333333333333333333• 63/127 => 0.4960629921259842519685039370…
– Irrationals• sqrt(2) = 1.41421356237309504880168872420…• Transcendentals: pi = 3.14159265927…, e = 2.71828183…
– Scientific• NA = 6.022 x 1023 Avagadro’s number (atoms in one mole)• G = 6.67259 × 10-11 gravitational constant (F = -GMm/r2)• c = 2.99792458 x 108 speed of light (m/s)
39
Floating Point Numbers
• How to represent non-integral numbers in binary?– Many possible ways
• e.g., store (numerator, denominator) => doesn’t work for irrationals
– All ways have limitations• Cannot represent all real numbers: infinite number of them, finite
number of bits!
– Need a standard on how to interpret the bits• e.g. two’s complement for signed integers
– Benefits of a standard:• Software portability: same answer on any machine• Data portability: binary data can be sent directly, no conversions• Numerical environment: defines level of mathematical precision,
allows research into error analysis, avoids future problems
40
Floating Point Numbers: IEEE754
• A floating-point standard IEEE754– Standard published in 1985
• Started in 1977• Primarily work of William Kahan (UofT student)
– Based largely on development of Intel 8087• A floating-point processor designed to work with the 8086
– Intel’s chip was a model to follow• 8087 first commercial product to implement IEEE 754• Other companies implemented IEEE 754, looked at Intel’s
chip
41
Binary Representationof Fractional Numbers
• Example
101.011= 1*22 + 0*21 + 1*20+ 0*2-1 + 1*2-2 + 1*2-3
= 4 + 0 + 1 + 0 + ¼ + 1/8
= 5.375
42
Binary Representationof Floating-Point Numbers
• Recall scientific notation:
6.022 x 1023
6.022 is the normalized significant part
10 is the base or radix23 is the exponent
• Can do the same in binary:
1.011 x 23
= 1.011 x 8= 1011= 11 (base=10)
• Negative numbers?– Need to remember the sign of
the significant part
• Generally:
(–1)S x M x be
Where:
S is sign (0 or 1)M is significandb is base/radixe is exponent
43
IEEE 754 Binary Single-Precision Floating-Point Representation
(–1)S x M x be
1.011 x 23 S = 0, M = 1.011, b = 2, e = 3
Encoding into bits:– assume b=2 (binary!), no need to store/remember– S one bit: 0– M 24 bits: 1011 0000 0000 0000 0000 0000
• If normalized, first (leftmost) digit of m is always a ‘1’, never a ‘0’• Don’t store the leading ‘1’, instead define M=1.F an store F
– F 23 bits: 011 0000 0000 0000 0000 0000– convert e=3 into binary (e may be negative!):
• Use biased notation, called Excess-N• Excess 127 used here: Define E = e+127 = 130
– E 8 bits, E = 1000 0010
44
IEEE 754 Binary Floating-Point Representation
Representation of floating point numbers in IEEE 754 standard:
Single precision32 bits total
1 8 23
sign S E F
exponent:excess 127binary integer
significand:normalized binarysignificand w/ hiddeninteger bit: M = 1.F
Double precision64 bits total
1 11 52S E F
exponent:excess 1023binary integer
45
IEEE 754 Precision
• Single precision– Enough for 9 decimal digits of accuracy
• Double precision– Enough for 17 decimal digits of accuracy
• Storing floating-point numbers to disk?Two options:– A: write binary value (32bits or 64bits)
• IEEE 754 standard allows us to interchange these values!– B: write value as decimal digits, eg in ASCII
• Need to write 9 (or 17) decimal digits• Need to write sign, exponent as well• Reading back in: convert to binary, get same binary value as before
46
IEEE 754 Accuracy
• Not all real values can be represented– Inf. # of values between ½ and ¼ – Inf. # of values between ½ and 3/8 – Inf. # of values between ½ and 7/16, etc
• All floating-point numbers are approximations– Calculations with approximations introduce errors– Reduce size of errors by proper rounding– 754: keep extra bits of precision during calculations for rounding– Cannot solve all problems: algorithm numerical stability a must!– You get same problems on every machine using IEEE754
47
IEEE 754 Range
• See /usr/include/limits.h on a Unix system
• Single precision:– Minimum: 1.175494351E-38 (FLT_MIN)– Maximum: 3.402823466E+38 (FLT_MAX)
• Double precision:– Minimum: 2.2250738585072014E-308 (DBL_MIN)– Maximum: 1.7976931348623157E+308 (DBL_MAX)
48
IEEE 754 Range
• What happens on overflow ?– Depends, 754 standard defines some special cases– Normally, you get value called Infinity
• Tiny numbers ? (smaller than smallest normal)– Goes to zero? Called underflow
• This is rather drastic
– 754 standard defines denormalized numbers• Underflow occurs gradually…• Underflow/denormals hard to design hardware
– Not all chips support it
• Often use software interrupts to handle denormals
49
IEEE 754 Special Cases
• Infinity, -Infinity– Caused by overflow– Caused by 1/0 (note: no error produced)
• +0, -0– This is claimed to be useful!
• NaN– “Not a Number”– 0/0– Infinity/Infinity, 0*Infinity, Infinity–Infinity, etc– Sqrt(-number)– Infectious: NaN + number = NaN, NaN x number = NaN, etc
• Comparisons (<, >, =, etc) with Infinity? NaN?– Cases all defined by the standard
50
IEEE 754 Hardware
1. Compare exponents
2. Shift smaller number right
3. Add
4. Normalize
5. Round
FractionSign ExponentFractionSign Exponent
Big ALU
FractionSign Exponent
51
Shifters
• Left as an exercise...
52
BIG PICTURE
• This course is about computer architecture
• Why care about ALU design details?– Our goal is performance– Some ALU designs may be faster or slower
• You must understand the impact they have on– Clock frequency (cycle time)– Instruction set design– More advanced things (eg, impact of multiple ALUs)– Ultimately, performance!