cse 141 – computer architecture summer session 1 2004...

CSE 141 – Computer ArchitectureSummer Session 1 2004

Lecture 3ALU Part 2

Single Cycle CPU Part 1

Pramod V. Argade

2Pramod Argade UCSD CSE 141, Fall 2003

AnnouncementsReading Assignment– Chapter 5: The Processor: Datapath and Control, Sec. 5.3 - 5.4

Homework 3: Due Mon., July 12 in class4.14c, 4.27, 4.28, 4.31, Multiply (-6 * 7) using Booth algorithm using 4 bit 2’s complement representation for the operands.

5.5, 5.6, 5.8, 5.9, 5.10

Quiz 3When: Mon., July 12, First 10 minutes of the classTopic: ALU, Chapter 4 Need: Paper, pen, calculator


CSE141 Course Schedule

Lecture # Date Time Room Topic Quiz topic HomeworkDue

1 Mon. 6/28 6 - 8:50 PM Center 109 Introduction, Ch. 1ISA, Ch. 3 - -

2 Wed. 6/30 6 - 8:50 PM Center 109 Performance, Ch. 2Arithmetic, Ch. 4ISA

Ch. 3 #1

- Mon. 7/5 July 4th Holiday - -

3 Wed. 7/7 6 - 8:50 PM Center 109 Arithmetic, Ch. 4 Cont.Single-cycle CPU Ch. 5Performance

Ch. 2 #2

4 Mon. 7/12 6 - 8:50 PM Center 109 Single-cycle CPU Ch. 5 Cont.Multi-cycle CPU Ch. 5 Arithmetic, Ch. 4 #3

5 Tue. 7/13 7:30 - 8:50 PM Center 109 Multi-cycle CPU Ch. 5 Cont.(July 5th make up class) - -

6 Wed. 7/14 6 - 8:50 PM Center 109 Single and Multicycle CPU Examples andReview for Midterm

Single-cycle CPUCh. 5

-

7 Mon. 7/19 6 - 8:50 PM Center 109 Mid-term ExamExceptions - #4

8 Tue. 7/20 7:30 - 8:50 PM Center 109 Pipelining Ch. 6(July 5th make up class) - -

9 Wed. 7/21 6 - 8:50 PM Center 109 Hazards, Ch. 6 - -

10 Mon. 7/26 6 - 8:50 PM Center 109 Memory Hierarchy & Caches Ch. 7 HazardsCh. 6 #5

11 Wed. 7/28 6 - 8:50 PM Center 109 Virtual Memory, Ch. 7Course ReviewCacheCh. 7 #6

12 Sat. 7/31 7 - 10 PM Center 109 Final Exam - -

No Class


SLT: Set-on-less-than Logic

SLT $1, $2, $3– if( $2 < $3)

$1 = 1;else $1 = 0;

To test A < B, do a subtraction (A - B)– (A < B) if (A - B) < 0, i.e. negative

Use sign bit– Route the sign bit to bit 0 of result– Set bits 1 - 31 to zero

There is a complication due to overflow– Work out solution in Homework problem 4.23


Set if Less Than

0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

Set

Overflow detection Overflow

a.

b.

Seta31

0

ALU0 Result0

CarryIn

a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Binvert

CarryIn

Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

SLT $m, $n, $pif( $n < $p ) {$m = 1;

}else {$m = 0;

}

$n < $p($n - $p) < 0


Complete 32-bit ALU from last lecture

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

Functionality provided• Arithmetic Operations:

• ADD, SUB• Logical Operations:

• AND, OR• Compare

• SLT• Support for branch

• BEQ, BNE• Exception detection

• Overflow

What is missing?• Signed multiply• Unsigned multiply• Signed division• Unsigned division


Grade school Multiplication algorithm• In general (ignoring sign bits):

• m bits x n bits = (m+n) bit product

• Binary makes it easy:• 0 => place 0 ( 0 x multiplicand)

• 1 => place multiplicand ( 1 x multiplicand)

• Paper and pencil example of binary multiplication: (8*10 = 80, 0x8 * 0xa = 0x50 )

1000 (multiplicand)x 1010 (multiplier)00001000x0000xx

1000xxx1010000 (Result)


More complicated than additionSimple algorithm:– Accomplished via shift and add

More time delay and more gates (=> silicon area)Let's look at 3 versions based on grade school algorithm

Observations about Multiplication


Multiplication: First Version

Done

1. Test Multiplier0

1a. Add multiplicand to product and place the result in Product register

2. Shift the Multiplicand register left 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1

No: < 32 repetitions

Yes: 32 repetitions

64-bit ALU

Control test

MultiplierShift right

ProductWrite

MultiplicandShift left

64 bits

64 bits

32 bits

Initialization:• Load 32-bit multiplicand and zero extend to 64 bits• Load 64-bit product register with zeroNeed a state machine to control operation 32 Iterations are required• Each Iteration takes 3 clocks• Total 96 + 3 = 99 clocks

•Observations:• 32 bits in multiplicand are always zero• 64-bit ALU is unnecessary• Left Shifted multiplicand does not affect

lower bits of the product


Multiplication: Second Version

MultiplierShift right

Write

32 bits

64 bits

32 bits

Shift right

Multiplicand

32-bit ALU

Product Control test

Done

1. Test Multiplier0

1a. Add multiplicand to the left half of the product and place the result in the left half of the Product register

2. Shift the Product register right 1 bit

3. Shift the Multiplier register right 1 bit

32nd repetition?

Start

Multiplier0 = 0Multiplier0 = 1


Yes: 32 repetitions

Initialization:• Load 32-bit multiplicand to 32-bit register• Load 64-bit product register with zeroNeed a state machine to control operation

Observations:32 Iterations are required• Each Iteration takes 3 clocks• Total 96 + 3 = 99 clocks32-bit ALU is used


Multiplication: Third Version

Control testWrite

32 bits

64 bits

Shift rightProduct

Multiplicand

32-bit ALU

Done

1. Test Product0

1a. Add multiplicand to the left half of the product and place the result in the left half of the Product register

2. Shift the Product register right 1 bit

32nd repetition?

Start

Product0 = 0Product0 = 1


Yes: 32 repetitions

Initialization:• Load 32-bit multiplicand to 32-bit register• Load upper 32 bits of product register with zero• Load lower 32 bits of product register with multiplierNeed a state machine to control operation

Observations:32 Iterations are required• Each Iteration takes 2 clocks• Total 64 + 3 = 67 clocks32-bit ALU is used64-bit Product Reg. holds Product and Multiplier


Multiplying Signed Numbers

Convert all operands to positiveDetermine sign of the product– Sign of the product = sign( op1) ^ sign( op2)

Multiply positive operands (only 31 bits)If the sign of the result is negative, negate the resultAdds extra logic and delay to multiply

Is there a better way?


Booth’s AlgorithmAn elegant approach to multiplying signed numbersWith ability to add, subtract and shift– There are multiple ways to do multiply

Consider signed operands A and BA = (A31*-231) + (A30*230) + (A29*229) + … +(A1*21) + (A0*20)

= (-A31*231) + (2A30 -A30 )230 + (2A29 -A29 )229 + … + (2A0- A0)20

= (A30 - A31)231 + (A29 - A30)230 + … + (A1 - A2)21 + (A-1 - A0)20

A*B = [(A30 - A31)231 + (A29 - A30)230 + … + (A1 - A2)21 + (A-1 - A0)20]*B= (A30 - A31)231*B + (A29 - A30)230*B + … + (A1 - A2)21*B + (A-1 - A0)20 *B

Recipe:Evaluate (Ai-1 - Ai)

0: Do nothing1: Add B2: Subtract B


Booths algorithm: Signed multiplication

Current Bit Bit to the Right Explanation Example Op1 0 Begins run of 1s 0001111000 sub1 1 Middle of run of 1s 0001111000 none0 1 End of run of 1s 0001111000 add0 0 Middle of run of 0s 0001111000 none

Originally for Speed (when shift was faster than add)• Replace a string of 1s in multiplier with an initial subtract when we first see a one and

then later add for the bit after the last one• Potential speed up recognizing that string of 0’s and 1’s requires no operation!

0 1 1 1 1 0beginning of runend of run

middle of run

A*B = (A30 - A31)231*B + (A29 - A30)230*B + … + (A1 - A2)21*B + (A-1 - A0)20 *B


Booth’s Algorithm

• Example: Use Booth’s Algorithm for following multiplication2 * (-6) = 0010 * 1010 = -12 = 1111 0100

• Recipe: for A*BAdd Ai-1 = 0Evaluate (Ai-1 - Ai)

0: Do nothing1: Add B2: Subtract B


Division

1001 QuotientDivisor 1000 1001010 Dividend

–1000101011010

–100010 Remainder (or Modulo result)

See how big a number can be subtracted, creating quotient bit on each stepBinary => 1 * divisor or 0 * divisor

Dividend = Quotient x Divisor + Remainder=> sizeof( Dividend ) = sizeof( Quotient ) + sizeof( Divisor )

3 versions of divide, successive refinement


Division 1.0

• Initialization:• 32-bit quotient register = 0, 64-bit remainder = divisor• 64-bit Divisor = (32-bit divisor


Division 1.0

1. Subtract the Divisor register from the Remainder register, and place the result in the

Remainder register.

Test RemainderRemainder < 0Remainder >= 0

2a. Shift the Quotient register to the left setting the new rightmost bit to 1.

2b. Restore the original value by adding the Divisor register to the Remainder register, and place the sum in the Remainder register. Also

shift the Quotient register to the left, setting the new least significant bit to 0.

3. Shift the Divisor register right 1 bit.

33rd repetition? No: < 33 repetitions

Done

Yes: 33 repetitions

Start


Divide Algorithm

Optimizations similar to that for multiply algorithm can be done– 32-bit Divisor register– 32-bit ALU– Quotient bits are left shifted into the remainder register

In case the result of subtraction is negative, remainder register has to be restored– Takes one extra clock cycle

Non-restoring divide algorithm removes this stepDivide overflow case– 0x80000000/-1


Floating Point: Introduction

We need a way to represent real numbers– Numbers with fractions, e.g., 3.14159265… (recognize me?)

– Very small numbers, e.g., 0.0000000000000000000000013621

– Very large numbers, e.g., 9,349,398,989,787,762,244,859,087,678

Binary Fractions:10112 = 1x23 + 0x22 + 1x21 + 1x20

so...101.0112 = 1x22 + 0x21 + 1x20 + 0x2-1 + 1x2-2 + 1x2-3

e.g.,.75 = 0.5 + 0.25 = 1/2 + 1/4 = .112


Recall Scientific Notation

6.02 x 1023

exponent

radix (base)Mantissa

decimal point

IEEE Single Precision F.P. ± 1.M x 2e - 127


IEEE 754Single-precision Floating-Point

N = (-1)S (1.M) 2 E-127

• Example:Convert - 325.75 to IEEE Single Precision Floating Point Representation

1 8 23

sign exponent:excess 127binary integer

mantissa:normalized binary significand w/ hidden integer bit: 1.M

S E M Total 32 bits


IEEE 754 Double-precision Floating-Point

N = (-1)S (1.M) 2 E-1023

• Example:Convert - 325.75 to IEEE Double Precision Floating Point Representation

sign exponent:excess 1023binary integer

mantissa:normalized binary significand w/ hidden integer bit: 1.M

1 11 20S E M M

32

Total 64 bits


IEEE 754 Single Precision FP

If E=255 and F is nonzero, then V=NaN ("Not a number")If E=255 and F is zero and S is 1, then V=-InfinityIf E=255 and F is zero and S is 0, then V=InfinityIf 0


Floating Point Addition

Done

2. Add the significands

4. Round the significand to the appropriate number of bits

Still normalized?

Start

Yes

No

No

YesOverflow or underflow?

Exception

3. Normalize the sum, either shifting right and incrementing the exponent or shifting left

and decrementing the exponent

1. Compare the exponents of the two numbers. Shift the smaller number to the right until its exponent would match the larger exponent


Floating Point Addition

0 10 1 0 1

Control

Small ALU

Big ALU

Sign Exponent Significand Sign Exponent Significand

Exponent difference

Shift right

Shift left or right

Rounding hardware

Sign Exponent Significand

Increment or decrement

0 10 1

Shift smaller number right

Compare exponents

Add

Normalize

Round

Example: 0.5 + ( - 0.4375)


IEEE 754 Floating Point

Increasing the size of significand enhances accuracyIncreasing the size of exponent increases the range of the numbers that can be representedOverflow or underflow can happenCan do integer compare for greater-than, signSingle Precision– Range of about 2 x 10-38 to 2 x 1038

Double Precision– Range of about 2 x 10-308 to 2 x 10308

Infinite variety of real numbers exist between, say, 0 and 1– Not more than 253 can be represented exactly in double precision


Floating Point Complexities

Operations are somewhat more complicated

In addition to overflow we can have “underflow”

Accuracy can be a big problem– IEEE 754 keeps two extra bits, guard and round

– four rounding modes

– positive divided by zero yields “infinity”

– zero divide by zero yields “not a number”

Implementing the standard can be trickyNot using the standard can be even worse– See text for description of 80x86 and Pentium bug!


• Multiplication and division take much longer than addition, requiring multiple addition steps.

• Floating Point extends the range of numbers that can be represented, at the expense of precision (accuracy).

• FP operations are very similar to integer, but with pre- and post-processing.

Summary




5.5, 5.6, 5.8, 5.9, 5.10


CSE 141 – Computer ArchitectureFall 2003

Lecture 3 The Processor: Datapath and Control

Pramod V. Argade


Datapath and Control Design

The Five Classic Components of a Computer

Control

Datapath

Memory

ProcessorInput

Output


Single Cycle Implementation Datapath and Control

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

I. Fe

tch

Dec

ode

Op.

Fet

ch

Exec

ute

Stor

e

Nex

t PC

Clock Cycle

Complete Execution of a Single Instruction


Abstract / Simplified View:

Datapath

RegistersRegister #

Data

Register #

Data memory

Address

Data

Register #

PC Instruction ALU

Instruction memory

Address


Combinational – Elements that operate on data values– Produces same output if given same inputs

State Elements– contains internal storage– state elements can be read at any time– clock is used to determine when a state element should be written

Two Types of Logic Components

CombinationalLogic

A

BC = f(A,B)

StateElement

clk

A

BC = f(A,B,state)


Clock

Clock is a free running signal– Fixed cycle time (period)– Frequency = 1/(cycle time)– Duty Cycle: (% high)/(%low), e.g. 50/50 Duty Cycle below– Jitter: Uncertainty in rising or falling edge

Clock Cycle (Period)

Rising Edge Falling Edge


Edge-triggered ClockingValues stored in the machine are updated on a clock edge– The clock edge can be either rising or falling

By default a state element is written every clock edge– An explicit write control signal is required otherwise.

Edge triggered methodology allows, in the same clock cycle to:– read the contents of a register– send the value through some combinational logic, and – write the contents of the same or another register

Possible to have the same state element as input and output

Clock cycle

Stateelement

1Combinational logic

Stateelement

2

Clock cycle

Stateelement

1Combinational logic

Stateelement

1


Storage ElementsD Latch• Two inputs:

– the data value to be stored (D)– the clock signal (C) indicating when to read & store D

• Two outputs:– the value of the internal state (Q) and it's complement

Q

C

D

_Q

D

C

Q

Falling edge triggered D flip-flop• Output changes only on the clock edge

QQ

_Q

Q

_Q

D latch

D

C

D latch

DD

C

C

D

C

Q


CPU: Clocking

Clk

Don’t CareSetup Hold

.

.

.

.

.

.

.

.

.

.

.

.

Setup Hold

All storage elements are clocked by the same clock edge

CLK CLK


Register: A Storage Element– Similar to the D Flip Flop except

• N-bit input and output• Write Enable input

– Write Enable:• 0: Data Out will not change• 1: Data Out will become Data In (on the clock

edge)

Clk

Data In

Write Enable

N N

Data Out


Register FileRegister File consists of (32) registers:– Two 32-bit output busses: busA and busB– One 32-bit input bus: busW

Register is selected by:– RA selects the register to put on busA– RB selects the register to put on busB– RW selects the register to be written

via busW when Write Enable is 1

Clock input (CLK)

Clk

busW

Write Enable

3232

busA

32busB

5 5 5RW RA RB

32 32-bitRegisters


Memory

Memory– One input bus: Data In– One output bus: Data Out

Memory word is selected by:– Address selects the word to put on Data Out– Write Enable = 1: address selects the memory word to be written

via the Data In bus

Clock input (CLK) – The CLK input is a factor ONLY during write operation– During read operation, behaves as a combinational logic block:

Address valid => Data Out valid after “access time.”

Clk

Data In

Write Enable

32 32DataOut

Address


Basic 4 x 2 Static RAM

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

Dlatch Q

D

C

Enable

2-to-4decoder

Write enable

Address

Din[0]Din[1]

Dout[1] Dout[0]

0

1

2

3


A Simple Implementation of MIPS CPUSimplified to contain only:– Memory-reference instructions: lw, sw– Arithmetic-logical instructions: add, sub, and, or, slt

– Control flow instructions: beq, j

Execution Time = Instructions * CPI * Cycle TimeProcessor design (datapath and control) will determine:– Clock cycle time– Clock cycles per instruction

We will design a single cycle processor:– Advantage: One clock cycle per instruction– Disadvantage: long cycle time


Arithmetic Instructions (R-Type)

ADD, SUB, AND, OR, SLTExampleadd rd, rs, rt

e.g. add $t3, $s0, $s5REG[$t3] = REG[$s0] + REG[$s5]

op rs rt rd shamt funct061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits


Load/Store Instructions (I-Type)

LW, SWExampleslw rt, rs, imm16sw rt, rs, imm16

e.g. lw $s3, -4($s2)REG[$s3] = D-MEM[ REG[$s2] - 4 ]

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits


Branch (I-Type)

BeqExamplebeq rs, rt, imm16

e.g.0x4c beq $s1, $t3, -12if( REG[$s1] == REG[$t3] ) {

new_PC = old_PC + 4 - 12 # new_PC = 0x44}else {

new_PC = old_PC + 4 # new_PC = 0x50}

op rs rt displacement016212631



Jump (J-Type)

JExampleJ Label

e.g.0x8000 0000 j 0x111 1111new_PC = 0x8444 4444

op target address02631

6 bits 26 bits


Components Required to implement the ISANext PC generation– Add 4 or extended 16-bit immediate to PC

Memory– Instruction read– Data read/write

Registers (32 x 32-bit)– Read register rs– Read register rt– Write register rt or rd

Sign extend immediate operandALU to operate on the operands


CPU: Instruction Fetch

• RTL version of the instruction fetch step: • Fetch the Instruction: mem[PC]– Update the program counter:

• Sequential Code: PC


CPU: Register-Register Operations (Add, Subtract etc.)

R[rd]


CPU: Load Operations

R[rt]


CPU: Store Operations

Mem[ R[rs] + SignExt[imm16]


CPU: Datapath for Branching

beq rs, rt, imm16 Datapath generates condition (equal)

op rs rt immediate016212631


32

imm16

PC

Clk

00

Adder

Mux

Adder

4nPC_sel

Clk

busW

RegWr

32

busA

32busB

5 5 5

Rw Ra Rb32 32-bitRegisters

Rs Rt

Equa

l?

Cond

PC Ext

Instruction Address

Sign extend to 32 bits and left shift by 2


CPU: Binary arithmetic for PC

• In theory, the PC is a 32-bit byte address into the instruction memory:– Sequential operation: PC = PC + 4– Branch operation: PC = PC + 4 + SignExt[Imm16] * 4

• The magic number “4” always comes up because:– The 32-bit PC is a byte address– And all our instructions are 4 bytes (32 bits) long

• In other words:– The 2 LSBs of the 32-bit PC are always zeros– There is no reason to have hardware to keep the 2 LSBs

• In practice, we can simplify the hardware by using a 30-bit PC:– Sequential operation: PC = PC + 1– Branch operation: PC = PC + 1 + SignExt[Imm16]– In either case: Instruction Memory Address = PC concat “00”


Single Cycle Implementation

Putting it all together

MemtoReg

MemRead

MemWrite

ALUOp

ALUSrc

RegDst

PC

Instruction memory

Read address

Instruction [31– 0]



Add


RegWrite

4

16 32Instruction [15– 0]

0Registers

Write registerWrite data

Write data

Read data 1

Read data 2

Read register 1Read register 2

Sign extend

ALU result

Zero

Data memory

Address Read data M

u x

1

0

M u x

1

0

M u x

1

0

M u x

1


ALU control

Shift left 2

PCSrc

ALU

Add ALU result




5.5, 5.6, 5.8, 5.9, 5.10


cse 141 – computer architecture summer session 1 2004...

Documents