computer science 3724 - memorial university of...

Computer Science 3724

Fall Semester, 2014

1

What will we do in this course?

• We will look at the design of an instruction set for a simple

processor.

The processor is based on a “real” processor, the MIPS R2000.

• We will see how logic relates to switching (and transistors) and

how logic forms a calculus for designing digital circuits.

• We will construct the basic logic blocks required to build a simple

computer.

• We will look at the internal structure of that simple processor,

having a 32 bit instruction length and a 32 bit data word.

• We will design the processor, and add enhancements to improve

the speed of execution of its instructions.

• Then we will design a memory system for the processor, and see

how we can match its speed to the processor.

2

Why bother with all this?

• Both software and hardware affect performance. Understanding

how they interact is essential.

• Understanding how computers work helps us be better program-

mers.

• We may have to provide advice on which computer to purchase

for some application.

• Computing performance has improved exponentially for 40 years.

– Why is the growth rate so fast?

– How long can this continue?

– How does this growth affect the programs I design?

– How does it affect the value of hardware and software?

• How does increased computation speed affect computer periph-

erals? (e.g., input/output devices.)

3

About questions???

Who questions much, shall learn much, and retain much.” —

Francis Bacon

“Asking a question is embarrassing for a moment, but not asking

is embarrassing for a lifetime.” — Haruki Murakami, Kafka on

the Shore, 2006, p. 255.

If there is something you don’t understand, or need clarified, ask.

If you think of a question after class, come to my office and ask.

If you don’t understand the answer, ask again!

4

A possible users view of a computer system

MEMORYINPUT AND

OUTPUTDEVICES

COMPUTER

BROWSER OFFICE ENVIRONMENT

EDITOR

SPREADSHEET

MAIL

COMPILERS OPERATING SYSTEMLIBRARIES

DATABASE

USER DESKTOP

5

A typical “desktop system”:

In this course we will be concerned mainly with the processor. (The

part typically not on the desktop.)

6

Inside the processor box:

Where is the processor?

7

A look at the “motherboard”:

8

The basic functional blocks of a simple computer

CPU MEMORY

OUTPUTINPUT/

We sometimes refer to five classic components of as computer.

We often consider the CPU, or processor, as two components — a

datapath and a control unit.

The datapath performs arithmetic and logical operations on data

stored temporarily in internal registers.

The control unit determines exactly what operations are per-

formed.

It also controls access to memory and I/O devices.

9

What are some characteristics of those components?

Characteristics of input:

• wide range of speed — keyboard, touch screen, network, video

• different modes — touch, video, voice, . . .

• different sampling rates — temperature, speed, . . .

Characteristics of output:

• again a wide range of speed — text, speech, video, . . .

• range of technologies — almost any controllable device

Characteristics of the processor:

• Does relatively simple operations at high speed

• Does exactly as it is instructed

• Very efficient for repetitive operations

• Technology developing at a consistent (rapid) rate — roughly

doubling every two years

Characteristics of memory:

• Processors require fast memory, to match processor speeds.

• Very fast memory is relatively expensive, slow memory is rela-

tively cheap.

10

Here some of the inputs and outputs are obvious, but note the lack

of wire connections.

Where is the processor?

11

Inputs? Outputs? Processor?

12

A typical instruction for a computer

w := a + b + c

What does this mean to the computer?

What are a, b, c and w?

How is the expression evaluated?

Is it the same as w := a + c + b?

How many computer instructions will this expression require?

How long will it take to execute?

Does the execution time depend on the values of a, b, and c?

Is the result exact? Why or why not?

Does the speed or accuracy depend on the particular processor?

Could using more than one processor speed up the calculation? How

about if the calculation was more complex?

13

Historical performance increase:

The following shows the increase in number of transistors for Intel

processors, memory, and other devices (from cmg.org):

These processors span the range of 4 to 64 bit processors.

Note the exponential growth in number of transistors, roughly dou-

bling every two years.

This growth was first observed by Gordon Moore, the co-founder of

Intel, and is called Moore’s law.

14

Projections for the future:

The following graphs use data from the International Technology

Roadmap for Semiconductors (ITRS) 2004 update documentation.

We can see that the predictions were actually pessimistic!

ITRS produces a new roadmap every two years, and the latest is the

2013 roadmap.

(See http://www.itrs.net)

Transistor size:

10

20

30

40

50

60

70

80

90

100

2002 2004 2006 2008 2010 2012 2014 2016 2018

chan

nel w

idth

(nm

)

Year

15

Memory size (GB/chip):

1

10

2002 2004 2006 2008 2010 2012 2014 2016 2018

Mem

ory

size

(G

B)

- si

ngle

chi

p

Year

Note the log scale on the y-axis.

This plot shows a stepwise exponential growth with time.

Why does memory have this behavior?

What happens between the beginning and end of each step?

16

Clock Frequency:

1

10

100

2002 2004 2006 2008 2010 2012 2014 2016 2018

Clo

ck fr

eque

ncy

(GH

z) -

on

chip

Year

17

Number of transistors on chip - processors:

100

1000

10000

100000

2002 2004 2006 2008 2010 2012 2014 2016 2018

Mill

ions

of t

rans

isto

rs o

n ch

ip

Year

low costhigh performance

18

Can this continue?

What will stop this kind of growth?

Presently, a transistor in a high performance processor has a “size”

of about 25 nm.

A silicon atom has a “size” of about 0.54 nm (actually, the distance

between atoms in a silicon crystal.)

When transistors change state, or “switch”, they use energy.

For a fixed power supply voltage, the heat energy produced depends

on the number of transistors.

Presently, processors require high speed fans to keep them cool enough.

Cooling a processor is a serious problem, even now.

What limits the size of a switching device?

What is the minimum energy required to remember a bit of informa-

tion?

Is there a limit to the speed at which a computation can be per-

formed?

19

Other technologies following a type of Moore’s Law:

The following data was taken from the Technium website

(http://www.kk.org/thetechnium)

Doubling time of various technologies, in months:

Technology measure Time

Optical network dollars/bit 9

Wireless network bits/second 10

Data communication bits/dollar 12

Digital cameras pixels/dollar 12

Magnetic storage GB/in2 12

RAM (dynamic) bits/dollar 18

Processor power consumption watts/cm2 18

DNA sequencing dollars/base pair 18

Disk storage GB/dollar 20

Why does this happen for some technologies and not others?

What limits the growth in these cases?

20

Instruction set architectures

What is the minimum instruction set required for a processor?

Consider a flowchart for a program.

i := i − 1

i<0? yesno

Only two symbols are really necessary; data operations (boxes) and

control operations (arrows, or links).

Does this mean that we really only need two instructions?

Can input and output be handled, as well?

21

Actually, it is possible to combine both types of operation in one

instruction, and this is all that is required to have a fully functioning

computer.

Can you figure out what this instruction could be?

A machine with only one instruction would have interesting proper-

ties.

It is an interesting exercise to determine what they are.

Although a single instruction processor is interesting, it is not very

efficient, since many instructions are required to do even simple op-

erations.

The course home page has a link to a simulator for a single instruction

processor.

A more useful exercise is to determine a small but efficient instruction

set for a particular processor.

22

What must an instruction contain?

• An encoding for the operation to be performed (op code)

• The addresses of the operands, and a destination address for the

result

The instruction encoding (op code) depends on the number of differ-

ent instructions to be encoded.

An instruction may require 0, 1, 2, more operands.

An example of a type of instruction which requires no operand is an

operation on a stack. Here, the operation (e.g., addition) uses the

top value and the next value on the stack, and the result replaces the

top of the stack.

Typical stack operations are push (place a value on the stack) and

pop (removes a value from the stack).

Some operations are inherently unary operations; e.g., negation.

More complex operations (e.g., addition) can add an operand to the

value in a fixed register (often called an accumulator) and store the

result in this accumulator.

It would have the form

Acc ← Mem[addr] op Acc

where op is an arbitrary binary operator.

23

Operations using two addresses can have a number of forms. For

example:

Mem[addr1] ← Mem[addr1] op Mem[addr2]

or

Acc ← Mem[addr1] op Mem[addr2]

Operations using three addresses can implement a full binary opera-

tion (like c = a + b) directly:

Mem[addr3] ← Mem[addr1] op Mem[addr2]

Encoding several memory addresses in an instruction requires a large

instruction size. Most processors have at least 32 address bits (4GB

memory), so an instruction using three memory operands would re-

quire more than 3× 32 = 96 bits.

Some processors have variable length instructions (e.g., INTEL pro-

cessors, used in the PC); others have fixed length instructions (e.g.,

the MIPS style processors, used in many game processors).

Generally, the decoding of fixed length instructions is simpler than

the decoding of variable length instructions.

It is also common for certain instructions to encode data within the

instruction. Typically, the data would be restricted to a constant

with a small range of values. (Incrementing by a small number is a

common operation, and encoding the data directly in an instruction

is efficient.)

This is usually called immediate data.

24

How complex should the instructions be?

It is possible to have instructions that are quite complex.

For example, a well-known processor from the past had a single in-

struction which could evaluate a polynomial of arbitrary order.

There are (at least) two schools of thought on the design of instruction

sets.

One is that the instruction set should attempt to be as close as

possible to modern computer languages.

Such ISA’s are called Complex Instruction Set architectures (CISC

architectures).

The idea is that compilers for such architectures are simpler.

These architectures typically have variable length instructions, with

many addressing modes. Each instruction may take several (or many)

machine cycles (clock periods).

The PC (Intel, AMD) architectures are of this type.

Another is that the instructions should be as simple and fast as

possible. These instruction set architectures usually have fixed size

instructions, and each instruction completes in a single clock cycle.

Such ISA’s are called Reduced Instruction Set architectures (RISC

architectures).

The MIPS architecture which we will be discussing later is of this

type.

25

Register files

Many processors have sets of registers, often called register files.

Instructions can address individual registers in the register file, using

far fewer bits than a full memory address.

For example, the MIPS processor has 32 32 bit registers; each register

therefore requires only a 5 bit address, and a three operand instruc-

tion operating on registers only would require 3× 5 = 15 bits for the

operand addresses.

The PC has 8 32 bit general registers, and a number of special pur-

pose registers.

Some processors (those in the PC, for example) allow instructions

which mix memory and register operations. Other processors permit

arithmetic and logic operations only on registers. Of course, both

types have instructions to copy values between the register file and

memory.

26

Addressing modes

Many processors have several ways of constructing the memory ad-

dress for a data operand. The register file may be used to provide

part of the address. This is particularly useful for list or tabular data

structures.

The simplest addressing mode is where the address is part of the in-

struction itself. It may be used for accessing data, or for determining

the target address for a branch or jump.

Another form of addressing is relative addressing. Here the instruc-

tion contains a displacement from the current address. This is most

commonly used with branch instructions.

For such a branch, the target address is calculated as

address = PC + displacement

where PC is the program counter, which contains the address of the

current instruction.

Addressing modes which involve a register often add the value in a

register to a displacement value from the instruction. These would

be calculated as

address = Ri + displacement

where Ri is the register designated for the address by the instruction.

This type of addressing is called indexed or based addressing.

This type of addressing is useful for manipulating list data structures;

the list can be traversed simply by incrementing or decrementing the

register value.

27

This idea can be extended to the use of two registers. This would be

useful for addressing data in a 2-dimensional structure like a table.

The address would be calculated as

address = Ri + Rj + displacement

and is usually called based indexed addressing.

Here, each register can be manipulated independently, allowing for

row and column operations.

More complex addressing modes are also possible.

The target of an address can be a data value (corresponding to a

variable in a program — the address is the variable name).

It can also be an instruction, such as the target of a jump or branch

instruction.

It can also be another address (this corresponds to a pointer in lan-

guages like C, or a reference in other languages.)

This capability is called indirect addressing, and may be supported

by the instruction set architecture of the processor.

Indirect addressing is used in the construction of more complex data

structures like linked lists, trees, etc.

Many processors support several different addressing modes. In fact,

the PC supports all the addressing modes mentioned, and several

others.

28

Relating instruction sets to logic

It is also useful to consider what the internal structure of a computer

would be, independent of any particular instruction set.

For example, the requirement that instructions and data be fetched

from memory (and that memory is independent from the processor)

requires that the processor be able to generate and maintain a mem-

ory address, and that it be able to provide data to or receive data

from memory.

This implies that there are two entities which can hold information

(an address and a data word) stable long enough for a memory read

or write.

Typically, in logic, we would implement these with registers.

The address is held in the memory address register (MAR).

The data is held in the memory data register (MDR).

Circuitry is also required to generate and maintain instruction ad-

dresses. Most often, the next instruction to be executed is the next

instruction in memory. This circuitry is usually called the program

counter (PC).

The instructions themselves contain addresses for data, and there

must be a control unit to decode instructions and manage flow of

control (e.g., branches).

Data, and computational results, are stored in internal registers.

There must be circuitry to perform the required arithmetic and/or

logical operations (the datapath).

29

Combining all these observations, we require a structure similar to

the following:

AddressGeneratorPCU

PC

GeneralRegisters

and/orAccumulator

Instructiondecode and

ControlUnitALU

MDR

MAR

30

The internal structure of a modern style processor —

the MIPS R5000:

More such photomicrographs are available at url

http://micro.magnet.fsu.edu/chipshots

31

The MIPS instruction set architecture

The MIPS has a 32 bit architecture, with 32 bit instructions, a 32

bit data word, and 32 bit addresses.

It has 32 addressable internal registers requiring a 5 bit register ad-

dress. Register 0 always has the the constant value 0.

Addresses are for individual bytes (8 bits) but instructions must have

addresses which are a multiple of 4. This is usually stated as “in-

structions must be word aligned in memory.”

There are three basic instruction types with the following formats:

shamt functop rs rt rd031 2021 1516 11 10 6 52526

I−type (immediate)

op rs rt031 2021 15162526

immediate

R−type (register)

J−type (jump)

op031 2526

target

6 bits 6 bits5 bits5 bits5 bits5 bits

5 bits 5 bits6 bits 16 bits

26 bits6 bits

All op codes are 6 bits.

All register addresses are 5 bits.

32


R−type (register)

The R-type instructions are 3 operand arithmetic and logic instruc-

tions, where the operands are contained in the registers indicated by

rs, rt, and rd.

For all R-type instructions, the op field is 000000.

The funct field selects the particular type of operation for R-type

operations.

The shamt field determines the number of bits to be shifted (0 to

31).

These instructions perform the following:

R[rd] ← R[rs] op R[rt]

Following are examples of R-type instructions:

Instruction Example Meaning

add add $s1, $s2, $s3 $s1 = $s2 + $s3

add unsigned addu $s1, $s2, $s3 $s1 = $s2 + $s3

subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3

subtract unsigned subu $s1, $s2, $s3 $s1 = $s2 - $s3

and and $s1, $s2, $s3 $s1 = $s2 & $s3

or or $s1, $s2, $s3 $s1 = $s2 | $s3

33


op rs rt031 2021 15162526

immediate

The 16 bit immediate field contains a data constant for an arithmetic

or logical operation, or an address offset for a branch instruction.

This type of branch is called a relative branch.

Following are examples of I-type instructions of type:

R[rt] ← R[rs] op imm


add addi $s1, $s2, imm $s1 = $s2 + imm

add unsigned addiu $s1, $s2, imm $s1 = $s2 + imm

subtract subi $s1, $s2, imm $s1 = $s2 - imm

and andi $s1, $s2, imm $s1 = $s2 & imm

Another I-type instruction is the branch instruction.

Examples of this are:


branch on equal beq $s1, $s2, imm if $s1 == $s2 go to

PC + 4 + (4 × imm)

branch on not equal bne $s1, $s2, imm if $s1 != $s2 go to

PC + 4 + (4 × imm)

Why is the imm field multiplied by 4 here?

34

J−type (jump)

op031 2526

target

The J-type instructions are all jump instructions.

The two we will discuss are the following:


jump j target go to address 4 × target : PC[28:31]

jump and link jal target $31 = PC + 4;

go to address 4 × target : PC[28:31]

Why is the PC incremented by 4?

Why is the target field multiplied by 4?

Recall that the MIPS processor addresses data at the byte level, but

instructions are addressed at the word level.

Moreover, all instructions must be aligned on a word boundary (an

integer multiple of 4 bytes).

Therefore, the next instruction is 4 byte addresses from the current

instruction.

Since jumps must have an instruction as target, shifting the target

address by 2 bits (which is the same as multiplying by 4) allows the

instruction to specify larger jumps.

Note that the jump instruction cannot span (jump across) all of

memory.

35

There are a few more interesting instructions, for comparison, and

memory access:

R-type instructions:


set less than slt $s1, $s2, $s3 if ($s2 < $s3), $s1=1;

else $s1=0

jump register jr $ra go to $ra

set less than also has an unsigned form.

jump register is typically used to return from a subprogram.

I-type instructions:


set less than slti $s1, $s2, imm if ($s2 < imm), $s1=1;

immediate else $s1=0

load word lw $s1, imm($s2) $s1 = Memory[$s2 + imm]

store word sw $s1, imm($s2) Memory[$s2 + imm] = $s1

load word and store word are the only instructions that access

memory directly.

Because data must be explicitly loaded before it is operated on, and

explicitly stored afterwards, the MIPS is said to be a load/store

architecture.

This is often considered to be an essential feature of a reduced in-

struction set architecture (RISC).

36

The MIPS assembly language

The previous diagrams showed examples of code in a general form

which is commonly used as a simple kind of language for a processor

— a language in which each line in the code corresponds to a single

instruction in the language understood by the machine.

For example,

add $1, $2, $3

means take add together the contents of registers $2 and $3 and

store the result in register $1.

We call this type of language an assembly language.

The language of the machine itself, called the machine language,

consists only of 0’s and 1’s — a binary code.

The machine language instruction corresponding to the previous in-

struction (with the different fields identified) is:

031 2021 1516 11 10 6 52526

rd shamt functrtrsop000000 00010 00011 00001 00000 100000

There are usually programs, called assemblers, to translate the more

human readable assembly code to machine language.

37

Compilers and assemblers

A compiler translates a “high level” language like C or Java into the

“machine language” for a particular environment (operating system

and target machine type.)

It it generally possible to compile a high-level language program to

run on almost any commercial computer system.

A single high level language statement corresponds to several, and

often many, machine instructions.

Some modern language compilers (e.g., Java) produce output that

does not correspond to any “real” computer, but rather to a “vir-

tual” or model computer. This output (called bytecode, or p-code)

can then be executed by a software model of the virtual machine

(interpreted) or further translated into the machine language of the

underlying processor.

38

An assembler translates an “assembly language” into the “machine

language” for a particular target machine.

Assembly languages for different target machines are different.

Assembly language instructions normally translate one-for-one to

machine instructions. (Some particular combinations of a few in-

structions may correspond to only one assembly instruction.)

Assembly code has a simple format. It normally includes labels,

instructions, and directives.

labels correspond directly to addresses (much like variable names in

high-level languages), but are also used to label instructions —

for example, a jump target.

Labels are character strings followed by “:”

For example, in the code following, loop: is a label.

instructions define the particular operations to be executed

directives provide information for the assembler itself.

Directives are preceded by a “.”

For example, the directive .align 2 forces the next item to

align itself on a word boundary.

Typically, there are at least two separate sections, indicated by di-

rectives, dividing the program into instructions and data.

39

A simple assembly language program

The following shows a short assembly code segment, for an infinite

loop:

.text

.align2

addi $1, $0, 0 # set register 1 to 0

loop: sw $0, 128($1) # store 0 at 128 + the location

# pointed to by register 1

addi $1, $1, 4 # increment register 1 by 4

jmp loop # go back to loop

Here, loop is a label, .text and .align are directives. The text

following # are comments.

This corresponds to the following machine language program (assum-

ing it starts at memory location 0):

location instruction

0 001000 00000 00001 00000 00000 000000

4 101011 00001 00000 00000 00010 000000

8 001000 00001 00001 00000 00000 000100

12 000010 00000 00000 00000 00000 000001

40

How does an assembler work?

It is a fairly simple process to write a program to translate these

instructions into machine code; it is a simple one-for-one translation.

The main problem is with labels — forward references, in particular.

Most simple assemblers make two “passes” over the assembler code;

in the first pass all the labels and their corresponding addresses are

placed in a symbol table. In the second pass, the instructions are

generated, using the addresses from the symbol table.

The output of the assembler is an object file. This object file still

may contain unresolved references (say, to library functions) which

are resolved by the linker.

We will look in more detail at how functions work in assembly lan-

guage later, but it is usual to provide functions for common opera-

tions in a library.

For example, there is a function printfwhich accepts a format string

and one or more values to print as arguments. (This is actually a

standard C function.)

41

In UNIX systems, object files have six components:

• An object file header describing the sizes of the other sections

• The text segment containing the actual machine code

• The data segment containing the data in the source file

• relocation information identifying data and instructions that

rely on absolute addresses, which must be changed if the program

is moved from one part of memory to another.

• The symbol table associating labels with addresses, and holding

places for unresolved references.

• debugging information, containing concise information about

how the program was compiled, so a debugger can associate

memory addresses with lines in the source file.

The following diagram shows the steps involved in assembling and

running a program:

42

languageAssembly

program

Assembler

❄

❄

❄

❄

❄

❄

✛

��

��✠

❙❙❙✇

Programmer

languageprogram

Loader

Machine

Memory

Processor

Output

Input

Libraries Other functions

43

MIPS memory usage

MIPS systems typically divided memory into three parts, called seg-

ments.

These segments are the text segment which contains the program’s

instructions, the data segment, which contains the program’s data,

and the stack segment which contains the return addresses for func-

tion calls, and also contains register values which are to be saved and

restored. It may also contain local variables.

��

��

10000000

7fffffff

Reserved

Static data

Dynamic data

stackStack segment

Data segment

Text segment

400000hex

hex

hex

The data segment is divided into 2 parts, the lower part for static

data (with size known at compile time) and the upper part, which

can grow, upward, for dynamic data structures.

The stack segment varies in size during the execution of a program,

as functions are called and returned from.

It starts at the top of memory and grows down.

44

More about assemblers

Sometimes, an assembler will accept a statement that does not cor-

respond exactly to a machine instruction. For example, it may cor-

respond to a small set of machine instructions. These are called

pseudoinstructions.

This is done when a particular set of statements are frequently used,

and have a simple translation to a set of machine instructions.

The original MIPS assembly language had a number of these.

For example, the pseudoinstruction load double

ld $4, 0($1)

would generate the following two instructions:

lw $4, 0($1)

lw $5, 4($1)

The pseudoinstruction load address

la $4, label

generates the instructions

lui $4, imm u and

ori $4, $4, imm l

which load the upper and lower 16 bits of the address, respectively.

The pseudoinstruction

mov $5, $1

moves the contents of register $1 to register $5

What single MIPS instruction corresponds to this pseudoinstruction?

45

Macros

Assemblers also provide set of instructions similar to functions, which

can accept a formal argument. These are called macros.

A macro is expanded as text, so code is generated each time the macro

is used, and the formal argument is replaced as text in the macro.

Consequently, there is no function call — the macro is expanded

directly in the code.

Following is a macro which uses the function printf to print an

integer:

.data

int_str:.asciiz "%d"

.text

.macro print_int($arg)

la $a0, int_str # load format string address

# into first argument register

mov $a1, $arg # load macro’s parameter

# (arg) into second argument

# register

jal printf

.end_macro

This macro would be “called” with a formal argument like

print int($7)

and would have the effect of inserting the above code, with register

$7 replacing the string $arg.

46

Translating programs to assembly language

Given the program statement

y = a + b− c + d

what is an equivalent assembly code?

Assuming that a, b, c, d are in registers $5 to $8, respectively, and

that y is in $9, then we could have:

add $9, $5, $6 # y = a + b

sub $10, $8, $7 # tmp = d - c

add $9, $9, $10 # y = y + tmp

Note that we have introduced a temporary register, $10 (tmp) here.

This is not really necessary.

To place the values of a, b, c, and d in the registers, from memory,

assuming register $20 contains the address for variable a, and vari-

ables b,c, d, and y are the next consecutive words in memory, we

could write

lw $5, 0($20) # load a in reg $5

lw $6, 4($20) # load b in reg $6

lw $7, 8($20) # load c in reg $7

lw $8, 12($20) # load d in reg $8

To store the value of y in memory, we could write

sw $9, 16($20) # store reg $9 in y

47

Simple data structures

It is common to use some kind of data structure in a high-level pro-

gramming language.

How would the following be translated into MIPS assembly language?

A[i] = A[i] + B;

Assuming there is a label Astart at the beginning of the data array

A[], and that register $19 has the value 4× i and that the value of

B is in register $18:

lw $8, Astart($19) # load A[i] in reg $8

add $8, $18, $8 # add B to A[i]

sw $8, Astart($19) # store reg $8 in variable A[i]

48

Program structures — loops

Extending the previous example to a simple loop; how would the

following be translated to MIPS assembly language?

for i=0; i<10,i++ {

A[i] = A[i] + B;

}

Here, we need to set up a counter, say, in register $6, and compare

it to 10.

addi $6, $0, 0 # initialize counter i to 0

addi $19,$0, 0 # initialize array address

addi $5, $0, 10 # set test value for loop

loop: lw $8, Astart($19) # load A[i] in reg $8

add $8, $18, $8 # add B to A[i] (B is in $18)

sw $8, Astart($19) # store reg $8 in variable A[i]

addi $6, $6, 1 # increment counter

addi $19,$19, 4 # increment array address

bne $5, $6, loop # jump back until counter

# equals 10

Note that this is not the most efficient code; the array index itself

could be used to terminate the loop, using one less register, and one

less instruction in the loop.

49

Conditional expressions

Consider the following C code:

if (i==j)

x = x + h;

else

x = x - h;

Assume i, j, x and h are already in registers $4, $5, $6, and $7,

respectively.

In MIPS assembly language, this could be written as:

bne $4, $5, else # jump to the "else" clause

add $6, $6, $7 # execute the "then" clause

j endif # jump past the "else" clause

else: sub $6, $6, $7 # execute the "else" clause

endif: . . .

A similar, but extended, structure could be written for case struc-

tures.

50

Subprograms

We have already seen the instruction to jump to a subprogram, jal

which places the value of PC + 4 (the address of the next instruction

in memory) into register $31.

We have also seen how the subprogram returns back to the main

program using the instruction

jr $31

There are still some questions about subprograms, however.

First, what happens when a subprogram calls another subprogram?

There must be some way to save the “old” return address before

overwriting the value in register $31.

Next, how are arguments passed to the subprogram?

To answer the first question, a stack data structure is used to save

the return address in register $31 before a subprogram is called.

The operation of placing a value on the stack is called pushing a

value onto the stack.

Returning a value from the stack to the register is called popping a

value from the stack.

By convention, register $29 is used as a stack pointer.

It is initially set to a high value, (7fffffffhex) and decremented

every time a value is pushed, and incremented whenever a value is

popped.

51

The following diagram shows the state of the stack after three nested

subprogram calls:

main programstack

return address to main

return address to subprogram 1

return address to subprogram 2sp

call subprogram 1

call subprogram 2

call subprogram 3

return from subprogram 3

Note that the stack pointer always points to the last element placed

on the stack. It is incremented before pushing, and decremented

after popping.

The return address is not the only thing which must be saved during

the execution of a subprogram. Arguments may also be passed to a

subprogram on the stack.

If a subprogram can call itself (recursion) then its entire state must

be saved. This includes the contents of registers used by the subpro-

gram, and values of local variables, etc. These are also saved on the

stack. The whole block of memory used by the stack in handling a

procedure call is referred to as a procedure call frame.

52

The procedure call frame is usually completely contained in the stack,

and is often called simply a stack frame. In order to facilitate access-

ing data in the stack frame, there is usually a frame pointer which

points to the start of a frame. The stack pointer points to the end

of the fame.

Argument 6

Argument 5

Local variables

Saved registers

$sp

$fp

Argument build

frame size

In the MIPS convention, register $30 is the frame pointer.

In order to properly preserve the contents of registers in a procedure

call, both the caller and callee must agree on who is responsible for

saving each register. The following convention was used with most

MIPS compilers:

53

MIPS register names and conventions about their use

Register Name Number Usage

zero 0 Constant 0

at 1 Reserved for assembler

v0 2 Expression evaluation and

v1 3 results of a function

a0 4 Argument 1

a1 5 Argument 2

a2 6 Argument 3

a3 7 Argument 4

t0 8 Temporary (not preserved across call)








s0 16 Saved temporary (preserved across call)










k0 26 Reserved for OS kernel

k1 27 Reserved for OS kernel

gp 28 Pointer to global area

sp 29 Stack pointer

fp 30 Frame pointer

ra 31 Return address (used by function call)

54

What happens when a procedure is called

Before calling a procedure, the caller must:

1. Pass the arguments to the callee procedure;

The first 4 arguments are passed in registers $a0 - $a3 ($4 -

$7). The remaining arguments are placed on the stack.

2. Save any caller-saved registers that the caller expects to use after

the call. This includes the argument registers and the tempo-

rary registers $t0 - $t9. (The callee may use these registers,

altering the contents.)

3. Execute a jal to the called procedure (callee). This saves the

return address in $ra.

At this point, the callee must set up its stack frame:

1. Allocate memory on the stack by subtracting the frame size from

the $sp.

2. Save any registers the caller expects to have left unchanged.

These include $ra, $fp, and the registers $s0 - $s7.

3. Set the value of the frame pointer by adding the stack frame size

to $fp and subtracting 4.

The procedure can then execute its function.

Note that the argument list on the stack belongs to the stack frame

of the caller.

55

Returning from a procedure

When the callee returns to the caller, the following steps are required:

1. If the procedure is a function returning a value, the value is

placed in register $v0 and, if two words are required, $v1 (reg-

isters $2 and $3).

2. All callee-saved registers are restored by popping the values from

the stack, in the reverse order from which they were pushed.

3. The stack frame is popped by adding the frame size to $sp.

4. The callee returns control to the caller by executing jr $ra

Note that some of the operations may not be required for every

procedure call, and modern compilers would only generate the steps

required for each particular procedure.

For example, the lowest level subprograms to be called (“leaf nodes”)

would not have to save $ra.

If a programming language does not allow a subprogram to call itself

(recursion) then implementing a stack frame may not be required,

but a stack is still required for nested procedure calls.

56

Who does what operation (caller or callee) is to some extent arbitrary,

and different systems may use quite different conventions.

For example, in some systems the subprogram arguments are part of

the callee stack frame, unlike the MIPS in which they belong to the

frame of the caller.

The designation of certain registers as caller save, and others as callee

saved is also arbitrary, and to some extent depends on how many

registers are available.

Having registers which a procedure can use without the overhead of

saving and restoring tends to lower the overhead of a procedure call.

Processors with few general registers (e.g., the INTEL processors)

would likely construct a stack frame quite differently.

It is imperative, however, that all the programs which will be linked

together strictly follow the same conventions.

Typically, procedures from several languages (e.g., assembly, C, Java)

can be intermixed at run time, so the compilers and linkers must

follow the same conventions if they are to interact correctly.

57

An example of a recursive function (factorial)

The following is a simple factorial function written in C:

C program for factorial (recursive)

main ()

{

printf ("the factorial of 10 is %d\n", fact(10))

}

int fact (int n)

{

if (n < 1)

return 1;

else

return (n * fact (n-1));

}

Following is the same code, in MIPS assembly language. First, the

main program is shown, followed by the factorial function itself.

Note that the MIPS specifies a minimum size of 32 bytes for a stack

frame.

58

# Mips assembly code showing recursive function calls:

.text # Text section

.align 2 # Align following on word boundary.

.globl main # Global symbol main is the entry

.ent main # point of the program.

main:

subiu $sp,$sp,32 # Allocate stack space for return

# address and local variables

# (32 bytes). (Stack "grows" downward.)

sw $ra, 20($sp) # Save return address

sw $fp, 16($sp) # Save old frame pointer

addiu $fp $sp,28 # Set up frame pointer

li $a0, 10 # put argument (10) in $a0

jal fact # jump to factorial function

# the factorial function returns a value in register $v0

# la $a0, $LC # Put format string pointer in $a0

# move $a1, $v0 # put result in $a1

# jal printf # print the result

59

# Instead of using printf, we can use a syscall

move $s0, $v0 # put result in $s0

# Print label for output.

li $v0, 4 # Syscall code for print string

# goes in register $v0

la $a0, $LC # Put format string pointer in $a0

syscall # print string

# Print integer result

li $v0, 1 # Syscall code for print integer

move $a0, $s0 # Put integer to be printed in $a0

syscall # print integer

move $v0, $0 # Clear register v0.

# end of print output

# restore saved registers

lw $ra, 20($sp) # restore return address

lw $fp, 16($sp) # Save old frame pointer

addiu $sp $sp,32 # Pop stack frame

jr $ra # return to caller (shell)

.rdata

$LC:

.ascii "The factorial of 10 is "

60

# factorial function

.text # Text section

fact:

subiu $sp,$sp,32 # Allocate stack frame (32 bytes)

sw $ra, 20($sp) # Save return address

sw $fp, 16($sp) # Save old frame pointer

addiu $fp $sp,28 # Set up frame pointer

sw $a0, 0($fp) # Save argument (n)

# here we do the required calculation

# first check for terminal condition

bgtz $a0, $L2 # Branch if n > 0

li $v0, 1 # Return 1

jr $L1 # Jump to code to return

# do recursion

$L2:

subiu $a0, $a0, 1 # subtract 1 from n

jal fact # jump to factorial function

# returning fact(n-1) in $v0

lw $v1, 0($fp) # Load n (saved earlier) into $v1

mul $v0, $v0, $v1 # compute (fact(n-1) * n)

# and return result in $v0

61

# restore saved registers and return

$L1: # result is in $2

lw $ra, 20($sp) # restore return address

lw $fp, 16($sp) # Restore old frame pointer

addiu $sp, $sp,32 # pop stack

jr $ra # return to calling program

62

When is assembly language used?

Modern compilers optimize code so well that assembly language is

rarely used to increase performance. Consequently, in large computer

systems, assembly language is rarely used.

Today its main application is in small systems (typically single chip

microcontrollers) where some special function is being implemented,

or there is a need to meet some particular timing constraint.

Typically, such systems have limited memory for programs and data,

and are dedicated to performing a small number of very specific func-

tions.

These kinds of constraints are often typical of I/O functions, and it is

for this type of application that assembly language is still occasionally

useful.

Generally, a programmer will solve a problem using a higher level

language like C first (this makes the resulting code more portable).

Only if the timing or size constraints are not met will the programmer

resort to recoding part or all of the function in assembler.

63

Switching Functions - logic:

Many things can be described by two distinct states; for example,

a light can be “on” or “off;” a switch can be “open” or “closed;” a

statement can be “true” or “false.”

Devices which have exactly two distinct states, say “on” and “off,”

are often particularly simple to construct; in particular, electronic

devices with two distinct states are much simpler to construct than

devices with, say 10 states. A typical electronic device with 2 states

is a switch, which can be “on” (switch closed) or “off” (switch open).

A very effective switch can be made with a single transistor. Tran-

sistor switches can be very small; current commercial integrated cir-

cuit technology routinely manufactures devices containing many mil-

lions of these switches in a single integrated circuit, with each switch

capable of switching, or changing state, in a time of less than 0.1

nanosecond (abbreviated ns, 1 ns is the time required for light to

travel approximately 30 cm, or 1 foot.)

Since such binary (i.e., 2-state) devices are so simple, it is useful to

examine the kinds of operations which can be performed involving

only 2 states. An “algebra” of entities having exactly two states

(“true” and “false” or “1” and “0” was developed by the mathemati-

cian George Boole, and later called Boolean Algebra. This algebra

was applied to electronic switching circuits by Shannon, as a “switch-

ing algebra.”

64

Exam logic — not what Boole indended?

65

We can define a switching algebra as an algebraic system consisting

of the set {0,1}, two binary operations called OR (+) and AND

(·) and one unary operation (denoted by an overbar, ¯) called

NOT, or complementation, or inversion.

These operations are defined as follows:

OR (+) AND (·) NOT (¯)

0 + 0 = 0 0 · 0 = 0 0 = 1

0 + 1 = 1 0 · 1 = 0 1 = 0

1 + 0 = 1 1 · 0 = 0

1 + 1 = 1 1 · 1 = 1

These relations are often expressed in “truth tables” as follows:

OR

A B A + B

0 0 0

0 1 1

1 0 1

1 1 1

AND

A B A · B

0 0 0

0 1 0

1 0 0

1 1 1

NOT

A A

0 1

1 0

Following are the “circuit symbols” for these functions:

A

B

AND

.A BA

B

OR

A+BA A

NOT

Note that the symbol for NOT is actually the ◦.

66

This switching algebra has a number of properties which can be

readily shown:

Idempotency A + A = A

A · A = A

Commutativity A +B = B + A

A · B = B · A

Associativity (A +B) + C = A + (B + C)

(A · B) · C = A · (B · C)

Distributivity A · (B + C) = A ·B + A · C

A + (B · C) = (A + B) · (A + C)

Absorption A + (A · B) = A

A · (A +B) = A

Concensus A · B + A · C +B · C = A · B + A · C

(A +B) · (A + C) · (B + C) = (A + B) · (A + C)

67

Another useful property is de Morgans theorem:

A +B = A · B

A · B = A +B

de Morgans theorem can be generalized to

F (A1, A2, . . . , An, 0, 1, ·,+) = F (A1, A2, . . . , An, 1, 0,+, ·)

That is, the complement of any expression can be obtained by re-

placing each variable and element with its complement, and at the

same time interchanging the OR and AND operators.

Duality

Note that each of the preceding properties occur in pairs, and that

one of the pairs can be obtained from the other simply by replacing

the AND operator by the OR operator, and vice versa. This property

is called duality, and the operators AND and OR are said to be dual.

This property comes about because if the 1’s and 0’s are interchanged

in the definition of the AND function, the function becomes the OR

function; similarly, the OR function becomes the AND function. This

property is general, and if one theorem is shown to be true, then its

dual will also be true.

68

The circuit symbol notation can be extended to other logic gates; for

example, the following represents the functions NAND (not AND)

and NOR (not OR):

A

B

NOR

A+B A

B

NAND

.A B

Note that the NOT function is represented by the ◦.

N-input OR and N-input AND gates are represented by the symbols:

A2

A1

An

A1 A2 An A1 A2 An

An

A2

A1

n−input OR

+ + ... ...

n−input AND

There is another commonly used circuit symbol, the exclusive-OR

function, denoted by the symbol ⊕. It is defined as follows:

A B A⊕ B

0 0 0

0 1 1

1 0 1

1 1 0

AB

A⊕ B

69

Switch implementation of switching functions:

The functions NOT, AND and OR can be implemented with simple

switches. In fact, in digital electronic circuits, transistors are used as

simple switches in circuits similar to those which follow.

Note that the power supplied to the circuit is shown, (a battery), as

is a device to detect the output (a lamp). They are not part of the

logic, but are required to make the switching logic useful.

In the AND function, the two switches are in series with each other;

in the OR function, the two switches are connected in parallel. For

the NOT function, the switch is connected in parallel with the output

(the lamp). NAND and NOR gates can be constructed similarly.

✁✁qA ♠❝

(a) NOT gate

✟✟q ✟✟qA B

♠❝

(b) AND gate

✟✟q✟✟qA

B ♠❝

(c) OR gate

These circuits can be combined to form more complex switching

functions. (If you have not seen it before, try to construct a simple

switching circuit for the XOR function).

Note that the inputs for these simple switches are mechanical; e.g.

the press of a finger. For electronic switches such as transistors, the

inputs can be the outputs of other logic functions, so very complex

logic circuits can be designed which operate “automatically”.

70

Canonical forms of switching functions:

It is possible to construct a truth table for any switching function

(i.e., any function of switching variables.)

The truth table provides a complete, unique description of the switch-

ing function, but it is cumbersome.

We can derive from the truth table certain unique expressions which

defines the function exactly; in fact, the expression is exactly equiv-

alent to the truth table.

One such expression is called the minterm form of the expression,

or, alternately, the sum of products (SOP) form.

e.g., for the function Y = A⊕B, the truth table is:

A B Y = A⊕ B

0 0 0

0 1 1

1 0 1

1 1 0

This is equivalent to

Y = A ·B + A ·B

This is the minterm form of the function. It is obtained by ORing

together all the minterms.

Minterms are the AND terms corresponding to each 1 in the function

column.

71

Minterms are obtained by ANDing together the variables, or their

complements, which have a 1 in the function column. If the variable

has value 1, the variable is taken; if not, its complement is taken.

The minterms are then ORed together to give the function specified

in the truth table.

Note:

1. Each minterm contains all the variables or their complements,

exactly once.

2. Each minterm is unique, except for permutation of the variables.

Therefore, the minterm form of the function is unique.

3. Any expression which contains only variables in the minterm

(sum of products) form, where each product term contains all

the variables, or their complement, exactly once, is a minterm

expression. This means that, no matter how a function is de-

rived, if it contains only minterms then it must be a minterm

form of the function.

72

A dual form of the preceding, called a maxterm form, or product

of sums (POS) form can also be written. The maxterm form of

a function can be obtained from the truth table by applying the

principle of duality to the way described previously for deriving the

minterm form of a function.

Equivalently, we can write down the minterm expression for the com-

plement of the function, Y , and apply de Morgans theorem; e.g.,

A B Y = A⊕ B Y = A⊕ B

0 0 0 1

0 1 1 0

1 0 1 0

1 1 0 1

In minterm form,

Y = A ·B + A ·B

Complementing both sides,

Y = Y = A · B + A · B

Applying de Morgans theorem,

Y = (A · B) · (A · B)

= (A +B) · (A +B)

This is the maxterm form of the switching function Y = A⊕ B

73

The maxterm form can more easily be obtained from the truth table

by ORing together all the variables or their complements which give

a zero for the function; if the variable has value 0 then it is ORed

directly, if it has a value 1, it is complemented.

Each term is called a maxterm, or a sum term. The function is equal

to the AND of all the maxterms.

Example: Find the minterm and maxterm expressions correspond-

ing to the following truth table:

A B C Y Minterms Maxterms

0 0 0 0 1 A · B · C

1 0 0 1 0 A +B + C

2 0 1 0 1 A · B · C

3 0 1 1 1 A · B · C

4 1 0 0 0 A +B + C

5 1 0 1 0 A +B + C

6 1 1 0 1 A · B · C

7 1 1 1 1 A · B · C

Minterm form:

Y = A ·B · C + A · B · C + A · B · C + A ·B · C + A · B · C

Maxterm form:

Y = (A +B + C) · (A +B + C) · (A +B + C)

74

Sometimes the minterm and maxterm expression are written in a

kind of “shorthand,” where the values (0 or 1) of the set of variables

is used to form a binary number, the decimal equivalent of which

designates the appropriate minterm or maxterm. e.g., the minterm

form of the previous function is written as:

Y =∑(0, 2, 3, 6, 7)

The maxterm form is written as:

Y =∏(1, 4, 5)

Note that the order in which the variable are written down in the

truth table is important, in this case. The numbers which appear

in the minterm form do not appear in the maxterm form, and vice

versa.

The minterm or maxterm form of the function is not usually the sim-

plest or most concise; e.g., the preceding function could be simplified

to the following:

Y = A · B · C + A · B · C + A ·B · C + A · B · C + A · B · C

= A · C +B

Systematic ways exist to reduce the complexity of minterm or max-

term forms of switching functions but we will not discuss them here.

The problem is computationally complex (NP-hard).

75

Practical examples

(1) Design of a half adder

A binary half adder is a switching circuit which will add together two

binary digits (called binary bits), producing two output bits, a sum

bit, S, and a carry bit, C. It has the following truth table:

A B S C

0 0 0 0

0 1 1 0

1 0 1 0

1 1 0 1

It is immediately obvious from the truth table that the two functions

S and C can be implemented as S = A ⊕ B, and C = A · B as

shown in the following, which also shows a logic symbol for a half

adder. (Logic symbols for devices more complex than the basic logic

gates are usually just boxes with appropriately labeled inputs and

outputs).

✩✪s

sAB

C

S

(a)

A

B

C

S

(b)

76

(2) Design of a full adder

A binary full adder is a switching circuit which will add together two

binary digits (called binary bits), and a third bit called a carry bit

which may have come from a previous full adder. It produces both

a sum bit and a carry bit as output. The full adder therefore has 3

inputs A, B and C where C is the carry bit, and 2 outputs; the sum,

S, and the carry C+. It is possible to connect N such full adders

together to add two N bit numbers.

It should be immediately obvious that the sum bit for a full adder

can be obtained using two half adders, one which adds digits A and

B together producing, say, Z as the sum and the other adding Z and

C together to form the sum of A, B and C. Clearly, a carry output

should be produced when either of the half adders produces a carry, so

the carry output for the full adder can be obtained by ORing together

the Carry outputs of the full adders. Such an implementation of a

full adder is shown in the following:

A

B

C

S A

B

C

S

A

B

C

Z

❍❍❍❍❍C+

S

77

The preceding implementation relied on our knowledge of the half

adder.

We will now consider the design of a full adder starting from its

description as a truth table:

A B C S C+

0 0 0 0 0

0 0 1 1 0

0 1 0 1 0

0 1 1 0 1

1 0 0 1 0

1 0 1 0 1

1 1 0 0 1

1 1 1 1 1

We can write the outputs in minterm form as:

S = A · B · C + A · B · C + A ·B · C + A · B · C

C+ = A · B · C + A · B · C + A ·B · C + A · B · C

These functions can be implemented directly as shown in the next

slide.

78

✩✪

✩✪

✩✪

✩✪

✄✄✄✄✄✄✄✄✄✪

✪✪

❡❡❡

❈❈❈❈❈❈❈❈❈

✩✪

✩✪

✩✪

✩✪

✄✄✄✄✄✄✄✄✄✪

✪✪

❡❡❡

❈❈❈❈❈❈❈❈❈

ABC

ABC

ABC

ABC

S

ABC

ABC

ABC

ABC

C+

Note that this implementation of the full adder requires more logic

gates than the implementation shown earlier, but the circuit is imple-

mented using only three “levels” of logic; a NOT level (not shown),

an AND level and an OR level. The previous implementation would

require more logic levels if it were implemented using only AND-

OR-NOT (AON) logic. This implementation would consequently be

a “slower” implementation, due to the inherent delay in each logic

gate.

The function S cannot be simplified any further using only AON

logic, but C+ can be rewritten as:

C+ = B · C + A · C + A ·B

79

Both the half adder and the full adder are useful functional blocks.

In particular, the full adder is often used as the basic building block

to construct larger adders which add many bits simultaneously. The

following figure shows the implementation of a four bit adder, which

adds together two four bit numbers (A3A2A1A0, and B3B2B1B0)

and produces a 5 bit result (S4S3S2S1S0), using four full adders.

In general, n full adders are required to implement an adder for

two n-bit words. The general expression for the sum and carry bits

generated in the ith addition stage is

Si = (Aii ⊕ Bi)⊕ Ci

Ci+1 = Ai · Bi + ((Ai +Bi) · Ci)

A

B

C

C+

S

A

B

C

C+

S

A

B

C

C+

S

A

B

C

C+

S

0

A0

B0

A1

B1

A2

B2

A3

B3

S0

S1

S2

S3

S4

80

Note that, for this type of adder, before the result of the add opera-

tion is correct, the carry result must be allowed to propagate through

all of the full adders. Because of this, this implementation of a wide

word adder is called a ripple carry adder.

It is possible, of course, to design an adder which has no such ripple

at all, simply by creating a truth table for each bit of the n-bit

adder and implementing each bit from, say, its minterm form. This

becomes quite tedious, however, for large n (specification of an 8-bit

adder would require 9 truth tables, each with 256 lines).

It is also possible to devise logic functions which generate the only

the n carry bits for an add operation on two n bit numbers. These

functions are called carry look-ahead functions, and are commonly

used in the construction of fast wide word adders. Such carry looka-

head adders are commonly used to implement the add operations on

the fastest computers. As we will see, it is possible to connect the

carry look-ahead units in a tree-like fashion to give reasonably fast

carry generation in a much smaller time than required in the ripple

carry adder.

81

Carry look-ahead adders

Logic expressions for this “carry look-ahead” function can be derived

from the logic functions for the full adder. Recall that, for 2 n-bit

wordsA = (An−1, An−2, . . . , A1, A0) andB = (Bn−1, Bn−2, . . . , B1, B0)

we saw earlier that the bit Si for the ith digit of the sum S was

Si = (Ai ⊕ Bi)⊕ Ci

and the carry Ci+1 was

Ci+1 = Ai · Bi + (Ai + Bi) · Ci

The expression for the carry, Ci+1, can be rewritten as

Ci+1 = Gi+1 + Pi+1 · Ci

where Gi+1 = Ai · Bi is the carry generation term

and Pi+1 = Ai + Bi is the carry propagation term

Note that Gi+1 and Pi+1 depend only on Ai and Bi.

From this recurrence relation, we see that, if we have an initial carry

in C0 = G0 then

C1 = G1 + P1 ·G0

C2 = G2 + P2 · (G1 + P1 ·G0)

= G2 + P2 ·G1 + P2 · P1 ·G0

C3 = G3 + P3 ·G2 + P3 · P2 ·G1 + P3 · P2 · P1 ·G0

...

82

Note that the product terms for the ith carry bit correspond to AND

gates with inputs numbering from 2 to i; consequently for large i the

AND gates will require a large number of inputs. (They cannot be

connected in series because this would reintroduce a ripple effect.)

Fortunately, carry look-ahead units can be cascaded in a kind of tree

fashion, as shown below.

The fact that this cascading is possible is apparent from the original

relation, Ci+1 = Gi+1 + Pi+1 · Ci. All that is necessary at any

level, say, l, is to have the term Cl−1 available. This can come from a

previous level of carry look-ahead units, and replaces the initial carry

input, C0. (Note that the adders used in with a carry look-ahead unit

should have outputs for generate, G, and propagate, P rather than

for the carry, Cout).

The following figure shows an implementation of part (half) of a 16

bit adder, using 4-bit carry look-ahead functions:

S A B

C

P G✲

❄✻✻

✻✻

S A B

C

P G✲

❄✻✻

✻✻

S A B

C

P G✲

❄✻✻

✻✻

S A B

C

P G✲

❄✻✻

✻✻

✲P0G0 C1 P1G1 C2 P2G2 C3 P3G3

Cin

Pout Gout

✻ ✻r

S A B

C

P G✲

❄✻✻

✻✻

S A B

C

P G✲

❄✻✻

✻✻

S A B

C

P G✲

❄✻✻

✻✻

S A B

C

P G✲

❄✻✻

✻✻

✲P0G0 C1 P1G1 C2 P2G2 C3 P3G3

Cin

Pout Gout

✻ ✻r

✲P0G0 C1 P1G1 C2 P2G2 C3 P3G3

✻ ✻ ✻ ✻✻ ✻ ✻ ✻✻ ✻ ✻

Cin

Pout Gout

✻ ✻

S0A0B0 S1

A1B1 S2A2B2 S3

A3B3 S4A4B4 S5

A5B5 S6A6B6 S7

A7B7

. . .

. . .

✥✥✥✥✥✥✥✥

✥✥✥✥✥✥✥✥

✥✥✥✥✥✥✥✥

✥✥✥✥✥✥✥✥

✥

✭✭✭✭✭✭✭✭

✭✭✭✭✭✭✭✭

✭✭✭✭✭

✭✭✭✭✭✭✭✭✭✭

✭✭✭✭✭✭✭✭✭

✭✭✭✭✭✭✭✭

✭✭✭✭✭

✂✂✂✂

❊❊❊❊

83

Combinational Logic — Using MSI circuits:

When designing logic circuits, the “discrete logic gates”; i.e., individ-

ual AND, OR, OR etc. gates, are often neither the simplest nor the

most effective devices we could use. There are available many stan-

dard MSI (medium scale integrated) functions which can do many of

the things commonly required in logic circuits.

These devices, or similar devices, are often used as components of

“programmable logic devices.”

The digital multiplexer

One MSI function which has been available for a long time is the

digital selector, or multiplexer. It is the digital equivalent of the

rotary switch or selector switch (e.g., the channel selector on a TV

set). Its function is to accept a binary number as a “selector input,”

and present the logic level connected to that input line as output

from the data selector.

A circuit diagram for a possible 4-line to 1-line data selector/multiplexer

(abbreviated as MUX for multiplexer) is shown in the following slide.

Here, the output Y is equal to the input I0, I1, I2, I3 depending on

whether the select lines S1 and S0 have values 00, 01, 10, 11 for S1

and S0 respectively. That is, the output Y is selected to be equal to

the input of the line given by the binary value of the select lines (or

address) S1S0.

84

✟✟✟

❍❍❍ ❢r✟✟✟

❍❍❍ ❢r

✣✢✣✢✣✢✣✢PPPPPPP

❅❅❅

��

✏✏✏✏✏✏✏

rr r

rI0 I1 I2 I3

S0

S1

The logic equation for this 4-line to 1-line MUX is:

Y = I0 · S1 · S0 + I1 · S1 · S0 + I2 · S1 · S0 + I3 · S1 · S0

This device can be used simply as a data selector/multiplexer, or it

can be used to perform logic functions. Its simplest application is to

implement a truth table directly; e.g., with a 4 line to 1 line MUX, it

is possible to implement any 2-variable function directly, simply by

connecting I0, I1, I2, I3 to logic 1 in logic 0, as dictated by a truth

table. In this way, a MUX can be used as a simple look-up table

for switching functions. This facility makes the MUX a very general

purpose logic device.

Connecting the inputs to a 4-bit memory makes the device a pro-

grammable logic device.

85

Example: Use a 4 line to 1 line MUX to implement the function

shown in the following truth table (Y = A ·B + A ·B)

A B Y

0 0 1 = I0

0 1 0 = I1

1 0 0 = I2

1 1 1 = I3

I3

I2

I1

I0

S1 S01

0

0

1

A B

Y

Simply connecting I0 = 1, I1 = 0, I2 = 0, I3 = 1, and the inputs A

and B to the S1 and S0 selector inputs of the 4-line to 1-line MUX

implement this truth table, as shown above.

The 4-line to 1-line MUX can also be used to implement any function

of three logical variables, as well. To see this, we need note only that

the only possible functions of one variable C, are C, C, and the

constants 0 or 1. (i.e., C, C, C + C = 1, and 0) We need only

connect the appropriate value, C, C, 0 or 1, to I0, I1, I2, I3 to

obtain a function of 3 variables. The MUX still behaves as a table

lookup device; it is now simply looking up values of another variable.

86

Example: Implement the function

Y (A,B,C) = A ·B · C + A · B · C + A · B · C + A ·B · C

Using a 4-line to 1-line MUX.

Here, again, we use the A and B variables as data select inputs. We

can use the above equation to construct the table shown below.

The residues are what is “left over” in each minterm when the “ad-

dress” variables are taken away. To implement this circuit, we con-

nect I0 and I3 to C, and I1 and I2 to C, as shown:

Input “Address” Other variables

(residues)

I0 A · B C

I1 A · B C

I2 A · B C

I3 A · B C

I3

I2

I1

I0

S1 S0C

C

C

C

A B

Y

In general a 4 input MUX can give any function of 3 inputs, an 8

input MUX can give any functional of 4 variables, and a 16 input

MUX, any function of 5 variables.

87

Example: Use an 8 input MUX to implement the following equa-

tion:

Y = A · B · C ·D + A ·B · C ·D + A · B · C ·D + A ·B · C ·D +

A · B · C ·D + A ·B · C ·D + A · B · C ·D + A ·B · C ·D

Again, we will use A, B, C as data select inputs, or address inputs,

connected to S2, S1 and S0, respectively.

Input Address Residues

I0 A · B · C D

I1 A · B · C D

I2 A · B · C D +D = 1

I3 A · B · C

I4 A · B · C D

I5 A · B · C D

I6 A · B · C D +D = 1

I7 A · B · C

I7

I6

I5

I4

I3

I2

I1

I0

S2 S1 S001D

D01DD

A B C

Y

Values of the address set A, B, C with no residues corresponding to

the address in the above table must have logic value 0 connected to

the corresponding data input. The select variables A, B, C must be

connected to S2, S1,and S0, respectively.

88

MUX “trees”

In practice, about 16 line to 1 line MUX’s are the largest which can

be reasonably constructed as a single circuit.

It is possible to use a “tree” of smaller MUX’s to make arbitrarily

large MUX’s. The following shows an implementation of a 16 line to

1 line MUX using five 4 line to 1 line MUX’s.

I3

I2

I1

I0

S1 S0

S1 S0

I3

I2

I1

I0

S1 S0

S1 S0

I3

I2

I1

I0

S1 S0

S1 S0

I3

I2

I1

I0

S1 S0

S1 S0

I15

I14

I13

I12

I11

I10

I9

I8

I7

I6

I5

I4

I3

I2

I1

I0

I3

I2

I1

I0

S1 S0

S3 S2

✂✂✂✂✂✂✂✂✂✂✂✂✂✂✂✂✂

��

❅❅❅❅❅❅

❇❇❇❇❇❇❇❇❇❇❇❇❇❇❇❇❇

89

Decoders (demultiplexers):

Another commonly used MSI device is the decoder. Decoders, in

general, transform a set of inputs into a different set of outputs,

which are coded in a particular manner; e.g., certain decoders are

designed to decode binary or BCD coded numbers and produce the

correct output to display a digit on a 7 segment (calculator type)

display.

Normally, however, the term “decoder” implies a device which per-

forms, in a sense, the inverse operation of a multiplexer. A decoder

accepts an n digit number as its n “select” inputs and produces an

output (usually a logic 0) at one of its 2n possible outputs. Decoders

are usually referred to as n line to 2n line decoders; e.g. a 3 line to 8

line decoder. This type of decoder is really a kind of binary to unary

decoder.

Most decoders have inverted outputs, so the selected output is set to

logic 0, while all the other outputs remain at logic 1. As well, most

decoders have an “enable” input E, which “enables” the operation

of the decoder — when the E input is set to 0, the device behaves

as a decoder and selects the output determined by the select inputs;

when the E input is set to 1, the outputs of the decoder are all set to

1. (The bar over the E indicates that it is an “active low” input; that

is, a logic 0 enables the function). The enable input allows decoders

to be connected together in a treelike fashion, much as we saw for

MUX’s.

90

A typical 3 line to 8 line decoder with an enable input behaves ac-

cording to the following truth table, and has the circuit symbol as

shown.

E S2 S1 S0 O0 O1 O2 O3 O4 O5 O6 O7

1 x x x 1 1 1 1 1 1 1 1

0 0 0 0 0 1 1 1 1 1 1 1

0 0 0 1 1 0 1 1 1 1 1 1

0 0 1 0 1 1 0 1 1 1 1 1

0 0 1 1 1 1 1 0 1 1 1 1

0 1 0 0 1 1 1 1 0 1 1 1

0 1 0 1 1 1 1 1 1 0 1 1

0 1 1 0 1 1 1 1 1 1 0 1

0 1 1 1 1 1 1 1 1 1 1 0

❢E

❢❢❢❢❢❢❢❢

O7

O6

O5

O4

O3

O2

O1

O0

S2

S1

S0

Note that, when the E input is enabled, an output of 0 is produced

corresponding to each minterm of S2, S1, S0. These minterm can be

combined together using other logic gates to form any required logic

function of the input variables. In fact, the minterms can be used to

produce several functions at the same time.

Using de Morgans theorem, we can see that when the outputs are in-

verted, as is normally the case, then the minterm form of the function

can be obtained by NANDing the required terms together.

91

Example: An implementation the functions defined by the follow-

ing truth table using a decoder and NAND gates is shown below:

A B C Y1 Y2

0 0 0 0 1

0 0 1 1 1

0 1 0 1 0

0 1 1 0 0

1 0 0 1 0

1 0 1 0 1

1 1 0 0 1

1 1 1 0 0

❢❢❢❢❢❢❢❢

O7

O6

O5

O4

O3

O2

O1

O0

S2

S1

S0

✩

✪✐

✩

✪✐

✱✱✱✱✱✱✱✱

✱✱✱✱✱✱✱

s

◗◗◗◗◗◗◗◗

❅❅❅❅❅❅❅

❅❅❅❅❅❅❅

Y1

Y2

A

B

C

Note that additional functions of the same variables would only re-

quire another NAND gate for each function.

92

Read only memory (ROM):

Often, devices requiring 8 or more input variables are implemented

using a ROM. A simple type of ROM can be constructed from a

decoder, a MUX, and a number of wires. We will look at a small (16

bit) ROM constructed in this way. Normally, memory is arranged in

a square array, as shown in the following slide. This general organi-

zation is used for other types of memory, as well.

To use the ROM to implement a logic function, the address lines

are used as the variable inputs, and the contents of the memory

are the function values.

Usually the memory has a word length of more than 1 bit; typically

4 or 8 bits, so several functions can be implemented simultaneously.

In the following figure, it is assumed that the decoder produces a

logic 1 as output when the input code selects that output; otherwise

it produces logic 0, and that a logic 1 output “outvotes” a logic 0

output, in the sense that if both are present on the same wire, the

logic 1 will dominate. (“Real” circuits usually have the opposite

behavior). A bit is “programmed” when a link is present.

93

Y

O0

O1

O2

O3

A2

A3

S0

S1

B

A

I0 I1 I2 I3

A0

A1

S0

S1

D

C

✡s s ✡s s✡s s✡s s

In this example, the function is

Y = A ·B · C ·D + A · B · C ·D + A ·B · C ·D + A · B · C ·D

corresponding to memory locations 0010, 0110, 1001, and 1010.

A type of programmable read-only memory uses small “fuse links” to

connect the horizontal and vertical wires at each intersection. This

device is “programmed” by passing sufficient current through a link

to “blow” the fuse. The link could also be a transistor which could

be turned on or off, allowing a read-write type of memory to be

implemented. We will see logic devices which could be used for this

purpose shortly.

94

Programmable logic arrays (PLA’s)

The ROM implementation of a function may become quite expensive

for functions with a large number of variables, because all potential

minterms of the function are implemented, whether or not they are

needed. A programmable logic array (PLA) requires that only the

minterms required for a function be implemented, and allows the

implementation of several functions simultaneously. Moreover, the

functions can be implemented directly from their minterm forms (al-

though it is often possible to eliminate some of the minterms, further

decreasing the cost of the PLA).

The PLA can be considered as a direct POS (or SOP) implementation

of a set of switching functions, with a set of AND functions followed

by a set of OR functions. A PLA is often said to have an “AND”

plane followed by an “OR” plane.

In practice, either NAND or NOR gates are used, with the resulting

PLA said to be a NAND/NAND or a NOR/NOR device. The next

slide shows a full adder implemented using a NAND/NAND PLA.

Note that, since the full adder does not require the minterm A ·B ·C,

this minterm is not included in the “AND” plane of the PLA. Note

also that the PLA can implement a function in POS form directly,

without reducing the function to minterm form. This often leads to

opportunities for minimizing the area of a PLA. Also, a PLA can

implement additional functions of the same set of variables simply

by adding another logic gate to the “OR” plane.

95

The PLA is an efficient device for the implementation of several

functions of the same set of variables.

AND plane OR plane

✥✦❣

✥✦❣

✥✦❣

✥✦❣

✥✦❣

✥✦❣

✥✦❣

ABC

ABC

ABC

ABC

ABC

ABC

ABC

✧✦❣ ✧✦❣S C+

t

t

t

t

t

t

t

t

✁✁✁

❆❆❆❣

✁✁✁

❆❆❆❣

✁✁✁

❆❆❆❣

A B C

t t tt t tt t tt t tt t tt t tt t t

96

Sequential Logic

Sequential logic differs from combinational logic in that the output

of the logic device is dependent not only on the present inputs to the

device, but also on past inputs; i.e., the output of a sequential logic

device depends on its present internal state and the present inputs.

This implies that a sequential logic device has some kind of memory

of at least part of its “history” (i.e., its previous inputs).

A simple memory device can be constructed from combinational de-

vices with which we are already familiar. By a memory device, we

mean a device which can remember if a signal of logic level 0 or 1 has

been connected to one of its inputs, and can make this fact available

at an output. A very simple, but still useful, memory device can be

constructed from a simple OR gate, as shown:

AQs

In this memory device, if A and Q are initially at logic 0, then Q

remains at logic 0. However if the single input A ever becomes a

logic 1, then the output Q will be logic 1 ever after, regardless of

any further changes in the input at A. In this simple memory, the

output is a function of the state of the memory element only; after

the memory is “written” then it cannot be changed back. However,

it can be “read.” Such a device could be used as a simple read only

memory, which could be “programmed” only once.

97

Often a state table or timing diagram is used to describe the be-

haviour of a sequential device. Following is both a state table and a

timing diagram for this simple memory shown previously. The state

table shows the state which the device enters after an input (the

“next state”), for all possible states and inputs. For this device, the

output is the value stored in the memory.

State table

Present State Input Next State Output

Qn A Qn+1

0 0 0 0

0 1 1 1

1 0 1 1

1 1 1 1

Q time →

A

Note that the output of the memory is used as one of the inputs; this

is called feedback and is characteristic of programmable memory de-

vices. (Without feedback, a “permanent” electronic memory device

would not be possible.) The use of feedback can introduce problems

which are not found in strictly combinational circuits. In particular,

it is possible to inadvertently construct devices for which the output

is not determined by the inputs, and for which it is not possible to

predict the output. A simple example is an inverter with its input

connected to its output. Such a device is logically inconsistent; in a

physical implementation the device would probably either oscillate

from 1 to 0 to 1 · · · or remain at an intermediate value between logic

0 and logic 1, producing an invalid and erroneous output.

98

The R-S latch

More complicated, stable, memory elements could be constructed

using simple logic gates. In particular, simple, alterable memory

cells can be readily constructed. One basic (but not often used in

this form) memory device is the following, called an RS (reset-set)

latch, or flip flop. It is the most basic of all the class of circuits

which are called latches or flip flops. A logic diagram for this device

is shown in the following, together with its circuit symbol and state

table..

S R Qn+1 Qn+1

0 0 Qn Qn

0 1 0 1

1 0 1 0

1 1 0 0❢

❢r✟✟✟✟✟✟✟ ❍❍

❍❍❍❍

❍

r

R

S

Q

QS

R

Q

Q

We can analyze this circuit to determine all possible outputs Q and

Q for all inputs to R and S; e.g., suppose we raise S to logic 1, with

Q at logic 0. Then Q must be 0, (the output of a NOR gate is 0

if any input is 1), and Q must be 1. If S is returned to 0, then Q

remains 0 and Q remains 1; i.e., the RS latch “remember” if S was

set to 1, while R is 0. If R is raised to logic 1 while S is at logic 0,

then Q is set to logic 0, and Q is set to logic 1; i.e., the latch is reset.

If both R and S are raised to logic 1, then both Q and Q will be at

logic 0. This output is inconsistent with the identification of Q and

Q as the two outputs, and therefore should be avoided.

99

Race conditions

Clearly, setting both R and S to 1 should be avoided to prevent

logical inconsistency.

However, a far more serious problem occurs if R and S change from

logic 1 to logic 0 simultaneously. This situation is called a race

condition. If both R and S are at logic 1, then Q and Q are at logic

0. When R and S are both set to 0, then bothQ and Q should switch

to logic 1. However, when they switch to logic 1, they should cause

a switch back to logic 0 again, because of the logic 1 input to each

NOR gate. If both NOR gates were identical, this would occur over

and over again, indefinitely — an oscillation of the outputs Q and

Q from state 0 to 1 and back, with a period depending on the time

delay for the NOR gate. In practice, one gate is a little faster than

the other, and the final outcome depends on the relative speeds of the

two gates. other. However, the final outcome cannot be predicted.

100

Clocked latches

There are three control signals often associated with flip flops; they

are the clock or enable signal; the preset, and the clear signal. The

clock signal is ANDed with the inputs R and S, so that these sig-

nals can reach the flip-flop only when the clock pulse is 1; all other

times the inputs to the inputs to the flip-flop are 0, and it retains its

previous value.

The clock input is used for several purposes; it is used to “capture”

data which is available for only a short time; and it is used to syn-

chronize several flip-flops so they can all operate simultaneously, or

synchronously. The following figure shows a circuit diagram for a

clocked RS flip-flop, together with its circuit symbol.

Note the special symbol, similar to an arrowhead, which denotes the

clock input.

❢

❢r✟✟✟✟✟✟✟ ❍❍

❍❍❍❍

❍

r✜✢

✜✢

rR

S

clock

Q

Q

>

S

R

Q

Q

101

Asynchronous preset and clear

The preset and clear signals are used to set the state of the flip-

flop regardless of the state of the clock input. Because they are not

synchronized by the clock pulse, they are said to be asynchronous.

The following figure shows how a simple clocked RS flip flop with

preset and clear inputs could be constructed from simple logic gates.

❢

❢r✟✟✟✟✟✟✟ ❍❍

❍❍❍❍

❍

r✜✢

✜✢r

Clear

Preset

R

S

clock

Q

Q

>

S

R

Q

Q

Clear

Preset

The preset and clear act as an unclocked RS flip-flop, and conse-

quently a logic 1 should not be applied to both at the same time.

A dual form of the RS flip flop, the RS flip flop can be implemented

with NAND gates, as follows:

✜✢❢

✜✢❢r✟✟✟✟✟✟✟ ❍❍

❍❍❍❍

❍

r

R

S

Q

Q

Clock inputs, together with preset and clear inputs could similarly be

provided for this device. Since the inputs to this device are inverted,

the preset and clear inputs would also be inverted.

102

The D Latch and the D flip-flop

It is possible to create a latch which has no race condition, simply by

providing only one input to a RS latch, and generating an inverted

signal to present to the other terminal of the latch. In this case, the

S and R inputs are always inverted with respect to each other, and

no race condition can occur. A circuit for a D latch follows:

❢

❢r✟✟✟✟✟✟✟ ❍❍

❍❍❍❍

❍

r

✜✢

✜✢

✟✟✟

❍❍❍ ❢r

✝ ✆ rD

clock

Q

Q

>

D Q

Q

The D latch is used to capture, or “latch” the logic level which is

present on the Data line when the clock input is high. If the data

on the D line changes state while the clock pulse is high, then the

output, Q, follows the input, D. This effect can be seen in the timing

diagram in the next slide.

The D flip-flop, while a slightly more complicated circuit, performs

a function very similar to the D latch. In the case of the D flip-flop,

however, the rising edge of the clock pulse is used to “capture” the

input to the flip flop. This device is very useful when it is necessary

to “capture” a logic level on a line which is very rapidly varying.

103

The following figureshows a timing diagram for a D-type flip-flop.

This type of device is said to be “edge triggered” — either rising

edge triggered (i.e. a 0–1 transition) or falling edge triggered (i.e.,

a 1–0 transition) devices are available.

CLOCK

D

Qtime→

(a) The D latch (b) The D flip flop

Both the D latch and D flip-flop have the following truth table:

Preset Clear Clock D Q Q

0 1 x x 1 0

1 0 x x 0 1

0 0 x x 1 1

1 1 ↑ or 1 0 0 1

1 1 ↑ or 1 1 1 0

1 1 0 X Q0 Q0

The symbol ↑ means a leading edge, or 0− 1 transition as the clock

input to the flip flop. For a D latch, it would be the level 1.

104

The JK flip-flop

The JK flip flop is the most versatile flip-flop, and the most commonly

used flip flop when discrete devices are used to implement arbitrary

state machines. Like the RS flip-flop, it has two data inputs, J and

K, and a clock input. It has no undefined states or race condition,

however. It is always behaves like it is edge triggered; normally on

the falling edge.

The JK flip-flop has the following characteristics:

1. If one input (J or K) is at logic 0, and the other is at logic 1,

then the output is set or reset (by J and K respectively), just

like the RS flip-flop, but on the (falling) clock edge.

2. If both inputs are 0, then it remains in the same state as it was

before the clock pulse occurred; again like the RS flip flop.

3. If both inputs are high, however the flip-flop changes state when-

ever the (falling) edge of a clock pulse occurs; i.e., the clock pulse

toggles the flip-flop.

105

There are two basic types of JK flip-flops. The first type is basically

an RS flip-flop with its outputs Q and Q ANDed together with J

and K respectively. This type of JK flip-flop has no special name.

Note that the connection between the outputs and the inputs to the

AND gates determines the input conditions to R and S when J = K

= 1. This connection is what causes the toggling, and eliminates the

invalid condition which occurs in the RS flip flop. A simplified form

of this flip-flop is shown in (a) below.

The second type of JK flip-flop is called a master-slave flip flop. This

consists of two RS flip flops arranged so that when the clock pulse

enables the first, or master, latch, it disables the second, or slave,

latch. When the clock changes state again (i.e., on its falling edge)

the output of the master latch is transferred to the slave latch. Again,

toggling is accomplished by the connection of the output with the

input AND gates. An example of this type of flip-flop is shown in

(b). The circuit symbol for a JK flip flop is shown in (c).

>R

S

Q

Q✏✑

✏✑

KClock

J✄✂q✄✂q

(a)

✏✑❝

✏✑❝✏✑❝

✏✑❝q✟✟✟✟✟ ❍❍

❍❍❍

q ✏✑❝

✏✑❝✏✑❝

✏✑❝q✟✟✟✟✟ ❍❍

❍❍❍

q✟✟❍❍ ❝ qq Q

Q

J

K

Master Slave...

...

(b)

>K

J Q

Q

(c)

106

The T flip flop

This type of flip-flop is a simplified version of the JK flip-flop. It is

not usually found as an IC chip by itself, but is used in many kinds

of circuits, especially counter and dividers. Its only function is that

it toggles itself with every clock pulse (on either the leading edge,

on the trailing edge) it can be constructed from the RS flip-flop as

shown below.

> T

S

R

Q

Q ✄✂

r✄✂

r

(a)

T

Q

(b)

This flip flop is normally set, or “loaded” with the preset and clear

inputs. It can be used to obtain an output pulse train with a fre-

quency of half that of the clock pulse train, as seen from the timing

diagram. In this example, the T flip flop is triggered on the falling

edge of the clock pulse.

Several T flip-flops are often connected together to form a “divide by

N” counter, where N is usually a power of 2.

107

Data registers:

The simplest type of register is a data register, which is used for

the temporary storage of a “word” of data. In its simplest form,

it consists of a set of N D flip flops, all sharing a common clock.

All of the digits in the N bit data word are connected to the data

register by an N line “data bus”. Following is a four bit data register,

implemented with four D flip flops.

>

D Q

Q

❄✻

r

>

D Q

Q

❄✻

r

>

D Q

Q

❄✻

r

>

D Q

Q

❄✻

rClock

I0 I1 I2 I3O0 O1 O2 O3

The data register is said to be a synchronous device, because all the

flip flops change state at the same time (they share a common clock

input).

108

Shift registers

Another common form of register used in computers and in many

other types of logic circuits is a shift register. It is simply a set of

flip flops (usually D latches or RS flip-flops) connected together so

that the output of one becomes the input of the next, and so on in

series. It is called a shift register because the data is shifted through

the register by one bit position on each clock pulse. Following is a

four bit shift register, implemented with D flip flops.

>

D Q

Q

r

>

D Q

Q

r

>

D Q

Q

r

>

D Q

Q

rClock

in out

On the leading edge of the first clock pulse, the signal in on the D

input is latched in the first flip flop. On the leading edge of the next

clock pulse, the contents of the first flip-flop is stored in the second

flip-flop, and the signal which is present at the DATA input is stored

is the first flip-flop, etc. Because the data is entered one bit at a

time, this called a serial-in shift register.

Since there is only one output, and data leaves the shift register one

bit at a time, then it is also a serial out shift register. (Shift registers

are named by their method of input and output; either serial or

parallel.)

109

Parallel input can be provided through the use of the preset and

clear inputs to the flip-flop. The parallel loading of the flip-flop can

be synchronous (i.e., occurs with the clock pulse) or asynchronous

(independent of the clock pulse) depending on the design of the shift

register. Parallel output can be obtained from the outputs of each

flip-flop as shown.

>

D Q

Q

r

>

D Q

Q

r

>

D Q

Q

r

>

D Q

Q

rClock

in✻ ✻ ✻ ✻r r r

O0 O1 O2 O3

Communication between a computer and a peripheral device is often

done serially, while computation in the computer itself is usually

performed with parallel logic circuitry. A shift register can be used

to convert information from serial form to parallel form, and vice

versa. Many different kinds of shift registers can be constructed,

depending upon the particular function required.

110

Counters — weighted coding of binary numbers

A simple binary counter can be made using T flip flops. The flip-

flops are attached to each other in a way so that the output of one

acts as the clock for the next, and so on. In this case, the position

of the flip-flop in the chain determines its weight; i.e., for a binary

counter, the “power of two” it corresponds to. A 3-bit (modulo 8)

binary counter could be configured with T flip-flops as shown:

> T > T > T

✻ ✻ ✻r r❅❅❅

❅❅❅

O0 O1 O2

Following is a timing diagram for this circuit:

CLOCKO0

O1

O2

0 1 2 3 4 5 6 7 8 9 10 11...

......

......

......

......

......

......

Note that is this counter, each flip-flops changes state on the falling

edge of the pulse from the previous flip-flop. Therefore there will

be a slight time delay, due to the propagation delay of the flip-flops

between the time one flip-flop changes state and the time the next one

changes state. i.e., the change of state ripples through the counter,

and these counters are therefore called ripple counters.

111

It is possible to design counters which will count up, count down,

and which can be preset to any desired number. Counters can also

be made which count in BCD, base 12 or any other number base.

A count down counter can be made by connecting the Q output to

the clock input in the previous counter.

Using the preset and clear inputs, and by gating the output of each

T flip flop with another logic level, using AND gates (say logic 0 for

counting down, logic 1 for counting up) then a presetable up-down

binary counter can be constructed.

The following figure shows an up-down counter, without preset or

clear:

>✄✂q

q

qK

J

Q

Q

✒✑✒✑

q q

>✄✂q

q

qK

J

Q

Q

✒✑✒✑

q q

>✄✂q

q

qK

J

Q

Q

✒✑✒✑

q q

>✄✂q

q

qK

J

Q

Q

O0 O1 O2 O3

✟✟❍❍ ❞qcount

up/down1 = up

0 = down

countenable

Clock

112

Synchronous counters

The counters shown previously have been “asynchronous counters”;

so called because the flip flops do not all change state at the same

time, but change as a result of a previous output. The output of

one flip flop is the input to the next; the state changes consequently

“ripple through” the flip flops, requiring a time proportional to the

length of the counter. It is possible to design synchronous counters,

using JK flip flops, where all flip flops change state at the same time;

i.e., the clock pulse is presented to each JK flip flop at the same

time. This can be easily done by noting that, for a binary counter,

any given digit changes its value (from 1 to 0 or from 0 to 1) whenever

all the previous digits have a value of 1. Following is an example of

a 4-bit binary synchronous counter.

>

K

J

Q

Qr�✁ >

K

J

Q

Qr�✁ >

K

J

Q

Qr�✁ >

K

J

Q

Qr�✁

r ✣✢ ✣✢

�✁r rrrr

r r rClock

O0 O1 O2 O3

113

State machines

A “state machine” is a device in which the output depends in some

systematic way on variables other than the immediate inputs to the

device. These “other variables” are called the state variables for the

machine, and depend on the history of the machine. For example, in

a counter, the state variables are the values stored in the flip flops.

For a binary machine, with n possible state variables, there may be

as many as 2n possible states, with each state corresponding to a

unique assignment of values to the state variables.

The behavior of a state machine can be completely described by a

“state table,” or equivalently, a “state diagram.” The next slide

shows a state table which describes the operation of a modulo 8

counter; the counter has 8 states, denoted S0 to S7, a single input,

the clock input, and 3 output digits, O2, O1 and O0. In this state

table, the entries where the clock input is 0 have been expressed on

a single line; in a full state table, this line would actually correspond

to 8 lines.

The essence of a state table can be captured in a state diagram. A

state diagram is a graph with labelled nodes and arcs; the nodes are

the states (denoted by circles, labelled with the state), and the arcs

are the possible transitions between states. The arcs are labelled

with the input which causes the transition, and the output which

results from the input. The next slide also shows a state diagram for

a modulo 8 counter.

114

input present next outputs

state state O2 O1 O0

0 Sx no change no change

1 S0 S1 0 0 1

1 S1 S2 0 1 0

1 S2 S3 0 1 1

1 S3 S4 1 0 0

1 S4 S5 1 0 1

1 S5 S6 1 1 0

1 S6 S7 1 1 1

1 S7 S0 0 0 0

S0

S1

S2

S3

S4

S5

S6

S7✫✪✬✩✤✜

✠

0/000

✫✪✬✩✜✢

✠

0/001

✫✪✬✩✜✢

✠

0/010

✫✪✬✩✜✢

✠

0/011

✫✪✬✩✤✜

✠

0/100

✫✪✬✩✤✣

❘

0/101

✫✪✬✩✤✣

❘

0/110

✫✪✬✩✤✣

❘

0/111❛❛❥1/001

▲▲❯1/010

✦✦✯1/000

☞☞✕1/111

✦✦✙1/100

☞☞☛1/011

❛❛❨

1/101

▲▲❑1/110

115

Designing a state machine

Typically, when we design a state machine, we first identify the re-

quired states (i.e., identify what information must be remembered),

and then consider how to go from state to state, and, finally, what

output to produce (i.e., identify state transitions and outputs). The

following examples show how a state machine can be obtained from

a written description of the device.

Example — the serial adder

The serial adder accepts as input two serial strings of digits of arbi-

trary length, starting with the low order bits, and produces the sum

of the two bit streams as its output. (The input bit streams could

come from, say, two shift registers clocked simultaneously.) This

device can be easily described as a state machine.

We first decide what must be “remembered” — in this case, it is

easy; all that must be remembered is whether or not there is a carry

to be added into the next highest order bits. Therefore, the device

will have two states, carry = 0 (C0), and carry = 1 (C1), as shown

below. We next identify the transitions between the states, and the

necessary outputs, also shown in the state diagram.

input/output

✫✪✬✩C0 ✫✪

✬✩C1

✲

✛

✤✣

❘ ✜✢

✠00/001/110/1

01/010/011/1

11/0

00/1

116

The corresponding state table, containing exactly the same informa-

tion as the state diagram is as follows:

Present state Inputs Next state Output

C0 0 0 C0 0

C0 0 1 C0 1

C0 1 0 C0 1

C0 1 1 C1 0

C1 0 0 C0 1

C1 0 1 C1 0

C1 1 0 C1 0

C1 1 1 C1 1

117

Example — a sequence detector

A state machine is required which outputs a logic 1 whenever the

input sequence 0101 is detected, and which outputs a otherwise. The

input is supplied serially, one bit at a time. The following is an

example input sequence and output sequence:

input 0 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0

output 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0

This state machine can be designed in a straightforward way. Assume

that the machine is initially in some state, say, state A. If a 0 is input,

then this ismay be the start of the required sequence, so the machine

should output a 0 and go to the next state, state B. If a 1 is input,

then this is certainly not the start of the required sequence, so the

machine should output a 0 and stay in state A. When the machine is

in state A, therefore, it has detected no digits of the required input

sequence. When the machine is in state B, it has detected exactly

one digit (the first 0) of the required input sequence.

✫✪✬✩A

✤✜✠

1/0

✲0/0

✫✪✬✩B

118

If the machine is in state B and a 0 is input, then two consecutive

0’s must have been input; this input is clearly not the second digit of

the required sequence, but it may be the first digit of the sequence.

Therefore, the machine should stay in state B and output a 0. If a 1

is input while the machine is in state B, then the first two digits of

the required sequence have been detected, and the machine should

go to the next state, state C, and output a 0. When the machine is

in state C, it has detected exactly two digits (0 1) from the required

sequence.

✫✪✬✩A B C

✤✜✠

1/0

✲0/0

✫✪✬✩✤✜

✠

0/0

✲1/0

✫✪✬✩

If the machine is in state C and a 0 is input, then three digits of

the required sequence have been input, so the machine should go

to its next state, state D, and output a 0. If a 1 is input when the

machine is in state C, then this input is clearly no part of the required

sequence, so the machine should start over in state A and output a

0.

A B C D✫✪✬✩✤✜

✠

1/0

✲0/0

✫✪✬✩✤✜

✠

0/0

✲1/0

✫✪✬✩

✲0/0

��❅

❅❅❅■

1/0

✫✪✬✩

119

If the machine is in state D and a 0 is input, then this is not the

required input (the input has been 0 1 0 0), but this may be the

first digit of another sequence, so the machine should go to state B

and output a 0. If a 1 is input while the machine is in state D, then

the required sequence (0 1 0 1) has been detected, so a 1 should be

output. Moreover, the last two digits input may be the first two

digits of another sequence, so the machine should go to state C. This

completes the state diagram, as shown in the following figure:

A B C D✫✪✬✩✤✜

✠

1/0

✲0/0

✫✪✬✩✤✜

✠

0/0

✲1/0

✫✪✬✩

✲0/0

��❅

❅❅❅■

1/0

✫✪✬✩

✛

1/1

❅❅❅❅�

��✠

0/0

The following is a state table corresponding to the state diagram:

Present Input Next Output

State State

A 0 B 0

A 1 A 0

B 0 B 0

B 1 C 0

C 0 D 0

C 1 A 0

D 0 B 0

D 1 C 1

120

Algorithmic State Machines:

An interesting way of specifying a state machine, equivalent to the

use of a state table or a state diagram, is by the use of a “flowchart”;

actually, a particular type of flowchart called an ASM (algorithmic

state machine) diagram. In an ASM diagram, or flowchart, the “al-

gorithm” which is implemented by the state machine is presented

in a clear fashion. The following figure shows a flowchart for the

controller for the traffic light at an intersection where there is both

East/West and North/South traffic.

S↓

N↑

←WE→

N↑

NS redEW yellow(10 seconds)

❄

NS redEW green(50 seconds)

❄

NS yellowEW red(10 seconds)

❄

NS greenEW red(50 seconds)

❄

❄

121

We call the rectangular blocks “action blocks,” because they specify

some action. Note that, although not specifically shown, “time” is

implicitly an input in this flowchart; moreover, each of the individual

blocks does not necessarily correspond to an individual “state” of

the system. Since the blocks specify different time periods, they

imply some method to measure time, for example, by counting clock

pulses. A more explicit flowchart is shown in the following slide,

which assumes that a clock pulse occurs every 10 seconds.

The state diagram shown in the figure is equivalent to the flowchart;

note, however, that the flowchart looks simpler, in that the clock

input is implicit, and only transitions out of the state are explicitly

shown.

The square blocks correspond to the states of the system. Transitions

are specified by the arrows in the flowchart. The output is specified

in the blocks, and, in this example, the clock input is explicit.

Recall that, in the first flowchart, each block did not necessarily

correspond to an individual state of the traffic light controller.

Either flow chart could be considered an ASM diagram, but we will

prefer the style in which each block corresponds to a unique state.

122


✲NS redEW green(10 seconds)

✻


❄


✻


❄


✻


❄


✻


❄


✻


❄


✛

✫✪✬✩F

✤✣❘

✲

✫✪✬✩G

✜✢

✠

✻✫✪✬✩✤✣❘

E

❄

✫✪✬✩✜✢

✠

H

✻✫✪✬✩✤✣❘

D

❄

✫✪✬✩✜✢

✠

I

✻✫✪✬✩✤✣❘

C

❄

✫✪✬✩✜✢

✠

J

✻✫✪✬✩✤✣❘

B

❄

✫✪✬✩✜✢

✠

K

✻✫✪✬✩✤✣❘

A

❄

✫✪✬✩✜✢

✠

L✛

1/RG

1/YR 1/RG

1/GR 1/RG

1/GR 1/RG

1/GR 1/RG

1/GR 1/RY

1/GR

0/YR 0/RG

0/GR 0/RG

0/GR 0/RG

0/GR 0/RG

0/GR 0/RG

0/GR 0/RY

CLOCK/COLOR

123

Consider another traffic light example, this time with external inputs,

shown in the next slide. In this example, the East/West traffic, is

less frequent than North/South traffic, and if there is no East/West

traffic, the North/South light should remain green. Eastbound or

westbound traffic is sensed by traffic sensors, labeled ET and WT

respectively, as shown in the diagram.

Again, this flowchart has implicit time, or clock, inputs, and intro-

duces decision blocks (the diamond shaped blocks) for the traffic

sensor inputs. The decision blocks cause a transition from one block

to one of several others, depending on whether or not some condi-

tion is met. As before, this flowchart does not correspond directly

to a state diagram, but can readily be expanded to one which does,

as shown in the slide following, where each rectangular block corre-

sponds to a 10 second interval.

This structure, consisting of arrows, action blocks, and decision blocks

is sufficiently general to specify any algorithm. These flowcharts are

often called ASM diagrams, when they refer to control devices in par-

ticular. They are (if carefully drawn) equivalent to state diagrams,

where the rectangular blocks correspond to the states of the state

machine (or, as in some of the examples, they correspond to blocks

of states which can readily be expanded). The decision boxes and

arrows correspond to transitions between the states, and the clock

input is implicit in the ordering of the blocks.

124

S↓

N↑

WTET

N↑


❄


❄


❄

❅❅❅��❅

❅❅��

WT

❅❅❅��❅

❅❅��

ET✛

❄

✲❄❝❄

✛

✲

no

no yesyes


❄


❄

❄

❝

❄

✲

✛

125


❄


❄


❄


❄

❅❅❅��❅

❅❅��

WT

❅❅❅��❅

❅❅��

ET✛

❄

✲❄❜❄

✛

✲

no

no yesyes


❄


❄❜❄


❄

❄

✲

✛

✫✪✬✩

G

❄

✫✪✬✩

F

❄

✫✪✬✩

E

❄

✫✪✬✩

D

❄

✫✪✬✩✤✣

❘

C

❄

✫✪✬✩

B

❄

✫✪✬✩

A

❄

❄

✲

✛

X/RY

X/RG

X/RG

11/YR

10/YR

01/YR

00/GR

X/GR

X/GR

X/GR

ET WT/color

126

Implementation of state machines:

There are a number of ways in which state machines described by a

state diagram or state table or ASM diagram can be implemented.

The actual “details” of the implementation depend on a number of

things such as the type of flip-flop used to hold the state information

(e.g. D ff’s or JK ff’s), the way in which the next-state logic is to

be implemented (e.g., using simple logic gates, MUX’s, PLA’s, etc.),

and the way in which the state information is stored in the flip flops

(this is usually referred to as the “state coding”).

Although these details determine the actual physical implementation

of the device, the method used to arrive at this implementation is

quite general, and can be summarized as follows:

1. Construct the state table (or state diagram, or a complete ASM

diagram) for the device, and ensure that it correctly describes

the required device.

2. Assign binary values to the states, to be encoded using flipflops.

3. Design the logic necessary to produce the appropriate values for

the flop flops to enter the next state, using the present state and

input values as inputs to this logic. Also, design the logic to

produce the appropriate outputs.

127

For the second step, two commonly used codings are:

1. a binary weighted coding, in which each state is specified by a

binary number. For N states this coding requires [log2(N)] flip-

flops. [log2(N)] means “the integer equal to or the next integer

greater than log2(N)”.

2. a unary coding, often called a “one hot” coding, in which each

state is assigned to a flip flop. A value 1 stored in the flip flop

means that the device is in the state corresponding to that par-

ticular flip flop. Since a device can be in only one state at any

time, there will always be exactly one flip flop with a value 1

stored in it. Although this coding requires more flip flops to

store the state information (N , rather than log2(N)), the next-

state logic is usually much simpler to design, because only two

flip flops need to have their values changed; the one correspond-

ing to the present state, and the one corresponding to the next

state.

128

Step (3) is not difficult, since the logic required is only simple com-

binational logic, but it can be quite tedious, especially for a binary

weighted coding since several (possibly all) flip flops may have to

change their values to produce the appropriate next-state. (For ex-

ample, in a 3 bit counter, the change from state 011 to state 100

requires that all flip flops change state at the same time). The type

of flip flop used also affects the design effort. D flip flops have only

one input, which is equal to the value to be stored in the flip flop. JK

flip flops have two inputs to be controlled, each by a separate block

of logic. This means that the design effort is easier for D flip flops

(because the next-state logic must only produce one input for each

flip flop), but the number of logic gates may be fewer for a JK flip

flop (because there are more ways to change the state). Since we are

more interested in reducing the design effort, we will use D flip flops

for our designs.

The next slide shows an ASM diagram for a device with four states,

A,B,C,D, one input, Y, and one output, Z, together with the corre-

sponding state diagram. The state table is also shown. The outputs

from the device are specified on the arrows in the ASM diagram, and

in the state diagram.

129

❩❩❩❩✚✚✚✚❩

❩❩❩✚✚✚✚

✲

✻✛

✛

✻✲

D

❄

C

❄

B

❄

❩❩❩❩✚✚✚✚❩

❩❩❩✚✚✚✚

❄

❄

❝✛

✻✲ ❝❄

A

❄

Z = 0

Z = 0

Y = 0Z = 0

Y = 1Z = 0

Y = 1Z = 0

Y = 0Z = 1

✫✪✬✩

D

X/0

✫✪✬✩

C

X/0

❄

✫✪✬✩

B

❄

1/0

✫✪✬✩

A

✤✜✠

❄

✛

0/1✲

1/0

0/0

130

State Table

Present State Input Next State Output

A 0 A 0

A 1 B 0

B 0 C 0

B 1 C 0

C 0 D 0

C 1 D 0

D 0 A 1

D 1 B 0

We will first design a state machine corresponding to this state table

using a binary weighted coding for the states. (Later we will design

the same state machine using the unary, or “one hot” state coding.)

The design requires log2(4) = 2 flip flops; we will use D flip flops as

memory elements. We choose the following coding for the states:

State FF1 FF0

A 0 0

B 0 1

C 1 0

D 1 1

(This choice of state coding is arbitrary; the problem of finding a

state coding which requires a minimum number of logic gates for its

implementation is NP-hard).

131

We next reconstruct the state table, including the values for each flip

flop, as follows:

Present State Input D FF inputs required to Output

produce next state for

QFF1 QFF0 Y DFF1 DFF0 Z

A 0 0 0 A 0 0 0

0 0 1 B 0 1 0

B 0 1 0 C 1 0 0

0 1 1 C 1 0 0

C 1 0 0 D 1 1 0

1 0 1 D 1 1 0

D 1 1 0 A 0 0 1

1 1 1 B 0 1 0

This table is, effectively, three truth tables, one for each of the D

inputs to FF1 and FF0, and one for the output, Z. Note that there

are three inputs to each truth table; namely, the outputs of FF1

and FF0, and Y. The required logic could be implemented in several

ways; using simple logic gates, using three 4 line to 1 line MUX’s (or

three 8 line to 1 line MUX’s), or using one 3 line to 8 line decoder,

and several NAND gates. The implementation shown in the following

slide uses the 3 line to 8 line decoder, and three NAND gates.

(A PLA implementation is particularly attractive for state machines,

because all the logic functions to be implemented are functions of the

same set of input variables.)

132

❡❡❡❡❡❡❡❡

O0

O1

O2

O3

O4

O5

O6

O7

S0 S1 S2

✩

✪❣

✩

✪❣

>

D Q

>

D Q

FF1

FF0

✟✟✟✟

❍❍❍❍ ❥ Z

★★★★★★★★★★

✏✏✏✏✏✏✏✏✏✏✘✘✘

✘✘✘✘✘✘

✘✘

❛❛❛❛❛❛❛❛❛❛

❍❍❍❍❍❍❍❍❍❍

❍❍❍❍❍❍❍❍❍❍

❜❜❜❜❜❜❜❜❜❜

tt

t

✑✑✑✑✑✑✑✑✑✑✑✑✑

✧✧✧✧✧

✧✧✧✧✧✞✝

✞✝✞✝Y

clock

The preceding technique would be quite tedious for the one hot state

coding, since there would be four flip flops, and the state table would

consequently require 32 lines. For the one hot state coding, we can

consider each flip flop individually, and design the logic required to

set it to 1 or 0, without concern for the other flip flops (except, of

course, that one or more of them may potentially provide an input

to the logic.)

In this case, we can break up the state table into a separate truth

table for the inputs required to produce each state; that is, we group

together in separate tables the lines corresponding to each separate

“next state.” For the previous example, we have the following four

tables:

133

For State A

Present State Input, Y Next State Output

A 0 A 0

D 0 A 1

For State B


A 1 B 0

D 1 B 0

For State C


B 0 C 0

B 1 C 0

For State D


C 0 D 0

C 1 D 0

The “Next State” column is, of course, not required in the tables.

Each table can be used to design the logic required to set the corre-

sponding flip flop. The following are directly from these tables:

For FFA, DA = A · Y +D · Y = (A +D) · Y

For FFB, DB = A · Y +D · Y = (A +D) · Y

For FFC , DC = B · Y +B · Y = B

For FFD, DD = C · Y + C · Y = C

The output, Z, would be evaluated as Z = D · Y

134

With a little practice, these design equations can be obtained directly

from a state diagram or ASM diagram. A corresponding circuit

diagram would be as shown below.

Note that, for the one hot coding, although more “next state” cir-

cuits must be designed, they are normally much simpler than for the

binary coded state assignment. In fact, for a small device, the imple-

mentation effort may be much less for a “one hot” implementation

than for an implementation using binary coded states.

>

D Q

>

D Q

>

D Q

>

D QA B C D

✏✑

✏✑

✏✑✟✟

❍❍ ❞Y

r�✁r r

�✁�✁r�✁ ✄�

Z

r r rclock

Repeating the design equations from the previous page:

DA = A · Y +D · Y = (A +D) · Y

DB = A · Y +D · Y = (A +D) · Y

DC = B · Y +B · Y = B

DD = C · Y + C · Y = C

Z = D · Y

135

Of course, sometimes a simple solution to a design problem is appar-

ent, without requiring much design effort. For example, if we wanted

to design a device to detect the sequence 0101, say, and output a 1

when this sequence was detected, we could construct a state table

for the device, and complete the design as in the previous examples.

(Recall that we constructed a state diagram and state table for this

device earlier). There is another, simpler solution, however, using a

4 bit serial in parallel out shift register (i.e., 4 D FF’s,) four com-

parators, and an AND gate, as shown below. (The comparator is

the complement of the XOR function, and is often represented in a

circuit as an XAND function). This simple design can be used to

detect any four bit sequence, by simply changing the inputs to the

comparators.

>

D Q

>

D Q

>

D Q

>

D Qinput, X

✓✏ ✓✏ ✓✏ ✓✏

✩✪

r r r0 1 0 1

Z

r r rclock

136

Structured implementation of state machines

State machines are typically implemented in three ways; using indi-

vidual logic gates, typically called a “random logic” implementation,

using a PLA, and using memory as a “look-up” for the combinational

logic. The PLA and memory (typically read-only memory) imple-

mentations are quite effective, because a state machine has a fixed

(usually relatively small) number of inputs and outputs, and both

those approaches can be readily automated.

An implementation based on memory is often called a microcoded

implementation. (This term is often reserved for the memory-based

implementation of the control unit of a computer.) It is common

to have a small “microcode engine” including simple functions like a

counter and registers, and for the microcode itself to be “sequenced”

by the counter. (In a sense, the counter is used to fetch microcode

words from memory, and these microcode words control the external

state changes and outputs.)

137

State machine models

Mealy state machines

Up to this point, we have implicitly considered only one model for

a state machine; a model in which the outputs are a function of

both the present state and the input. This model, shown pictorially

below, is called the Mealy model for sequential devices. It is a general

model for state machines, and assumes that there are two types of

inputs; clock inputs and data inputs. The clock inputs cause the

state transitions and “gate” the outputs, (so the outputs are really

“pulse” outputs; i.e., they are valid only when the clock is asserted).

The data transitions determine the values of next-states and outputs.

Essentially, the clock inputs control the timing of the state transitions

and outputs, while the data inputs determine their values.

❏❏��

❏❏��

❏❏��

❏❏��

Primaryinputs

Statevalues

next-statevalues

Clock

Combinational logic

Memory

outputsPrimary

So far, the state diagrams we have drawn correspond to this model;

we have labeled the transitions with the inputs which cause the tran-

sition, and the output corresponding to the transition.

138

Moore state machines

Another, model for state machines is the Moore model, in which the

outputs are associated with the states of the device. In the Moore

machine, the outputs are stable for the full time the device is in a

given state. (The outputs are said to be “level” outputs, and are

valid even when the clock inputs are not asserted.) Again, there

are two types of inputs, clock inputs and data inputs. In this case,

however, the clock inputs only directly enable the state transitions.

In this model, the transitions are functions of the present states and

inputs, but the outputs are functions of the states only. Below is a

pictorial representation of the Moore model of a state machine. (The

Moore model describes state machines like the traffic light controllers

we have seen as ASM diagrams in a very natural manner.)

❏❏✡✡

❏❏✡✡

❏❏✡✡ ❏❏

✡✡

❏❏✡✡

Primaryinputs

Statevalues

next-statevalues

Clock

Memory

Combinational logic

Next-state

PrimaryoutputsCombinational logic

Output

139

The following figure shows a state diagram for the Mealy machine

derived earlier which produces a 1 as output whenever the sequence

0101 is input. This machine has four states, and the outputs are

associated with the inputs to the state machine.

✫✪✬✩

A

✤✜✠

1/0

✲0/0

✫✪✬✩

B

✤✜✠

0/0

✲1/0

✫✪✬✩

C✲

0/0

��❅

❅❅❅■

1/0

✫✪✬✩

D✛

1/1

❅❅❅❅�

��✠

0/0

The next figure shows a state diagram for a Moore machine which

performs the same function.

✫✪✬✩A/0

✤✜✠

1

✲0

✫✪✬✩B/0

✤✜✠

0

✲1

✫✪✬✩C/0 ✲0

��❅

❅❅❅■

1

✫✪✬✩D/0

❅❅❅❅�

��✠

0

✫✪✬✩E/1

✧✧✧✧✧✧✧✧✧

✠

1

✧✧✧✧✧✧✧✧✧✧✒

0❅❅❅❅❅■

1

Note that a state diagram for a Moore machine is labeled differently

from a Mealy machine state diagram; the transitions are labeled

only with the inputs which cause the transition, while the states are

labeled with the corresponding outputs. (The output is a function

of the state only, and does not depend directly on the input.)

140

Comparing the Mealy and Moore state diagrams, it is clear that they

are very similar for states A, B, C and D. State E is a “new” state,

because the output of 1 must be associated with some state. In fact,

state E is equivalent to state C in the Mealy diagram — Mealy state

C has been split into two Moore states, one (C) with output 0, and

one (E) with output 1. They are equivalent in the sense that the

outputs from both Mealy state C and Moore state E (and Moore

state C, too) have the same next-states for the same inputs.

Note that state C was the only Mealy state in which the incoming

arcs (or arrows) correspond to different outputs. Moreover, the state

which was “split” retained all transitions to the corresponding next-

states in the Mealy machine. (In this example, all of the other states

are associated with only one output.)

In general, from this observation, it is possible to convert any Mealy

type machine into an equivalent Moore type machine, and vice-versa.

First, we must define what we mean for two state machines to be

equivalent. Two state machines are said to be equivalent if they

produce exactly the same output for all inputs. Consequently, to

derive an equivalent Moore machine from a Mealy machine, it must

be possible to guarantee that the two machines produce the same

output after any arbitrary input string has been input. This can

be done by splitting all the Mealy states corresponding to different

outputs, and ensuring that these states are connected to next-states

which correspond to equivalent states in the original Mealy machine.

141

As a slightly more complex example, the Mealy machine specified by

the following state table, where x is the single external input, and

y is the output, and having a state diagram as shown below can be

converted into a Moore machine as follows:

Present State Next State Output, y, for

x=0 x=1 x=0 x=1

A C B 0 0

B A D 1 0

C B A 1 1

D D C 1 0

✫✪✬✩

A✲

1/0

❅❅❅❅ �

��✒

0/0

✫✪✬✩

B✛

0/1

�� ❅

❅❅❅❘

1/0

✫✪✬✩

C✛0/1

��❅

❅❅❅❅❅■

1/1

✫✪✬✩

D

✜✢

✠

0/1✛1/0

Each state with different output values associated with transitions

into the state is split into states corresponding to each different out-

put; e.g., state B has a transition from state A with an output of

0, and from state C with an output of 1. Therefore, State B is split

into two states, B0, with an output of 0, and B1, with an output of

1. Every transition to B with output 0 goes to B0; every transition

to B with an output 1 goes to B1. The next-states of B0 and B1 are

exactly the same as for B. State D is split into two states D0 and

D1, similarly.

142

The state table becomes the following, corresponding to the state

diagram shown below.

Present State Next State State

x=0 x=1 output

A C B0 1

B0 A D0 0

B1 A D0 1

C B1 A 0

D0 D1 C 0

D1 D1 C 1

Here we have added a column called state output, which is the output

the device has while in a given state. The output no longer depends

on the input, x.

✫✪✬✩A/1

✲1

❅❅❅❅ �

��✒

0

✫✪✬✩B0/0✛

0

�� ❅

❅❅❅❘

1

✫✪✬✩C/0

��❅

❅❅❅❅❅■

1

✫✪✬✩D0/0✛ 1

❄

0

✫✪✬✩B1/1 ✟✟

✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✒

1

��✠

0

❅❅❅❅❅❅❅❅❅❅■

0

✫✪✬✩D1/1

✜✢

✠

0

❅❅❅❅❅❅❅❅❅■

1

143

We can see that the Moore machine accepts the same input sequences

as the Mealy machine we started with, and produces the same output

sequence. In addition, it produces the output 1 when started in

state A, without having any input sequence. i.e., a Moore machine

accepts a zero length sequence, called a null sequence, and produces

an output (while in its initial state.) If we wish, we can add a new

state A0 as the initial state, which produces a different output, say,

0, indicating that the machine is in its initial state. (There will be

no transitions back into this initial state.)

Note that, in general, any Mealy machine with N internal states

and P outputs can be converted to a Moore machine with at most

P ×N +1 states. (The Mealy machine and its corresponding Moore

machine will be equivalent, in the sense that both will give exactly

the same output for all possible input sequences.)

144

Present State Output Next State

x=0 x=1

A0 0 C B0

A1 1 C B0

B0 0 A1 D0

B1 1 A1 D0

C 0 B1 A1

D0 0 D1 C

D1 1 D1 C

✫✪✬✩A1/1

✲1

❅❅❅❅ �

��✒

0

✫✪✬✩B0/0✛

0

�� ❅

❅❅❅❘

1

✫✪✬✩C/0

��❅

❅❅❅❅❅■

1

✫✪✬✩D0/0✛ 1

❄

0

✫✪✬✩B1/1 ✟✟

✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✟✒

1

��✠

0

❅❅❅❅❅❅❅❅❅❅■

0

✫✪✬✩D1/1

✜✢

✠

0

❅❅❅❅❅❅❅❅❅■

1

✫✪✬✩A0/0

❅❅❅❅❅❅❅❅❅❘

1

❍❍❍❍❍❍❍❍❍❍❍❍❍❍❍❍❍❍❍❍❍❘

0

145

“Computer arithmetic”

What kinds of numbers are typically represented in a computer sys-

tem?

Positive integers (for addressing, pointers.)

Signed integers (for integer arithmetic.)

“Real numbers” for arithmetic using real numbers.

Perhaps the most important characteristics of the representation of

numbers in a computer system is that numbers are represented using

a fixed number of binary digits.

This introduces the problems of overflow or underflow.

For example, the sum of the following four bit (unsigned) numbers

would be

1101 + 0101 = 10010

Using only four bits, the result would be 0010, and, depending on

the particular processor and instruction, the overflow may or may

not be detected.

When an overflow or underflow is detected, the processor usually

causes an exception.

In this case, the program jumps to a predetermined location in mem-

ory which handles the exception, and the return address is also saved

so the program can continue after the address is handled.

We will discuss exceptions later.

146

How are signed integers represented?

Following are four possibilities for representing negative numbers:

one’s complement representation A negative number is the

complement of a positive number; e.g.

00000001 represents 1

11111110 represents -1

Note that there are two representations of zero.

This representation is not commonly used.

two’s compliment representation This is the one’s complement

representation plus 1; e.g.



This is the usual representation for signed integers.

To extend the size of a 2’s complement integer, it is sign ex-

tended; e.g., to make the 8-bit representation of -7 a 16 bit rep-

resentation, the high order bit in the 8-bit representation is used

to fill in the higher order bits.

11111001 -7 (8-bit 2’s complement)

1111111111111001 -7 (16-bit 2’s complement)

147

sign-magnitude representation An integer has a single bit, say

the high order bit, which represents the sign. For an eight bit

number, the representation would be

sddddddd



This is the usual representation for the mantissa (significand) of

a real (floating point) number.

Note that for an integer, there are again two representations for

zero.

biased representation (sometimes called excess n representation.)

A bias is added to the representation to get the number being

represented. The following shows a representation with a bias of

127 (or excess 127):

10000000 represents 1 (1 + 01111111)

01111110 represents -1 (-1 + 01111111)

This is the usual representation for the exponent of a real (float-

ing point) number.

148

Given that, for integer arithmetic, we will use a 2’s complement

representation, and that we want to combine arithmetic and logic

operations in one unit (the ALU), how would an ALU for the MIPS

be implemented?

So far, we have the following instructions to implement: add, addu,

addi, sub, subu, addiu, and, andi, or, ori, slt, sltu,

slti.

Consider the following single-bit ALU:

+0

1

0

1

3

2

Carryin

OperationBinvert

a

b

Less

Result

Carryout

Note that there are three control bits; the single bit Binvert, and

the two bit input to the MUX, labeled Operation.

The ALU performs the operations and, or, add, and subtract.

149

This ALU implements all the integer arithmetic and logical instruc-

tions seen so far. Note that the control inputs must be set according

to the particular operation required of the ALU.

A controller will also be required, to determine what particular op-

eration is required of the ALU for each individual instruction.

The ALU will be a component of the datapath of the processor we

will design; the high order bit should detect overflow, as below:

overflowdetection Overflow

+0

1

0

1

3

2

Carryin

OperationBinvert

a

b

Less

Result

Set

A set of 32 of these components can be used to implement a full 32

bit ALU, as shown in the next slide. It also produces an output,

zero, which is set to 1 whenever the 32 bit output is 0.

150

The 32 bit ALU

a0b0

CarryinALU0Less

CarryOut

Result0

CarryinALU31

a31

0b31

Less Overflow

zero

Operation

CarryinALU1Less

CarryOut

a1

0b1

Result1

Result31

Set

Binvert

CarryinALU2Less

CarryOut

a2

0b2

Result2

ALU control lines Function

0 00 and

0 01 or

0 10 add

1 10 subtract

1 11 set on less than

151

The ALU depicted on the previous slide uses ripple-carry adders.

We have seen how to build a carry look-ahead adder which would

permit faster arithmetic operations.

Using the carry look-ahead units we considered earlier, the changes

required are not difficult since they merely compute the Carryin

inputs to the ALU. In order to handle 2’s complement arithmetic,

however inverted inputs would be required.

0

1

0

1

0

1

0

1

a0b0

CarryinALU0Less

CarryOut

Result0

CarryinALU31

a31

0b31

Less

Operation

CarryinALU1Less

CarryOut

a1

0b1

Result1

Result31

Set

Binvert

a0b0

a1b1

a2

a31b31

b2

c31

c2

c1

b0

b1

b2

b31

a0

a1

a2

a31

aheadlook−Carry

Overflow

CarryinALU2Less

CarryOut

a2

0b2

Result2

152

Integer Multiplication

Integer multiplication is really repeated addition.

The basic algorithm is a kind of “shift and add” of partial products,

obtained by multiplying individual digits in the multiplier by the

multiplicand, as follows:

1010 multiplicand

× 0110 multiplier

0000 0 × 1010

1010 1 × 1010, shifted left 1 bit

1010 1 × 1010, shifted left 2 bits

0000 0 × 1010, shifted left 3 bits

0111100 product — sum of partial products

Note that the product of 2 n-bit numbers requires 2n bits.

This multiplication algorithm can be implemented in a number of

ways.

Recalling that single bit binary multiplication is the same as the AND

function, then simply ANDing the multiplicand with the digits of the

multiplier, shifting, and adding them together is all that is required.

This type of multiplier can be implemented using a single 32 bit

adder, and 32 shift operations.

One such implementation is shown in the following slide.

153

Controltest

= 0 ?multiplier[i]

Add multiplicand to left half of product.Place result in left half of product

Shift product register right 1 bit

Shift multiplier register right 1 bit

Start

32 repetitions?No

=0=1

32 bit ALU

Multiplicand

Product

Multiplier

Shift rightWrite

Shift right

32 bits

64 bits

Hardware:

Algorithm:

Done

154

Note that this implementation requires 32 shift and add operations,

and requires that the multiplier, multiplicand, and product all be

stored in separate registers.

Noting that the product register “consumes” one additional bit on

each iteration, and the multiplier register effectively removes one bit

in the same iteration, we can reduce the hardware further by storing

the multiplier in the right hand half of the product register.

It will be shifted out at the same rate product digits are shifted in.

155

Controltest

= 0 ?multiplier[i]

Add multiplicand to left half of product.Place result in left half of product

Shift product register right 1 bit

32 repetitions?No

32 bit ALU

Multiplicand

Shift rightWrite

Start

=0=1

Algorithm:

Hardware:

Done

Product64 bits

156

Another possibility is to use an array of adders and AND gates to

directly implement each partial product and sum all the partial prod-

ucts.

This would require n2 adders for an n-bit multiplier.

Following is a single multiply-add unit, consisting of an adder and

an AND gate:

xj Pi+1

Pi C i

Ci+1

yiyi Pi

xj

Ci+1xj Pi+1

C i

xj

AC S

CB

157

A 4 bit parallel multiplier:

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

yi Pixj

Ci+1xj Pi+1

C i

P0

P1

P2

P3P4P5P6P7

y0

y1

y2

y3

0

0

0

0

x1x2x3 x0

158

Signed multiplication

The previous algorithms work fine for positive numbers, but how

could negative numbers be handled?

One way is to convert the negative numbers to positive (2’s com-

plementation), perform the multiplication, and adjust the sign by

performing 2’s complementation again if necessary.

It turns out that there is a more elegant solution.

Recoded multiplication

If the adders which are used to construct the multiplier can also sub-

tract, another possibility for speeding up the multiplication process

is to “recode” the multiplication operation. We can consider the

following as a “recoding” of the multiply operation:

ai ai−1 operation comment

0 0 — no action

0 1 1xM add

1 0 2xM shift and add

1 1 4xM - M shift 2 and subtract

This recoding applied to the previous algorithms would allow the

multiply operations to complete in 16 iterations, rather than 32, using

the same hardware.

Essentially, this algorithm does a 2-bit multiply in each step.

159

Booth’s algorithm

If the multiplier contains a string of 1’s, then this can be rewritten

as a string of 0’s, provided we can do a subtraction. For example, if

we were to perform the following multiplication:

0 0 0 1 1 0 1

x 0 0 1 1 1 1 0

We could rewrite the multiplier as 0100000 - 0000010 to give the

following equivalent operation:

0 0 0 1 1 0 1

x 0 1 0 0 0 -1 0

Note that this “recoding” of the multiplier allows many more shifts

over 0’s than the previous multiplier. It is possible to use this ob-

servation to recode groups of 2 bits in a multiplier to use only shifts

and add or subtract operations to implement the multiplication.

The main idea behind Booth’s algorithm is to identify strings of 1’s

and replace them with 0’s, and apply the above observation.

This can be accomplished by examining pairs of bits:

Left bit Right bit Explanation Example

1 0 Beginning of a run of 1’s 00011110

1 1 Middle of a run of 1’s 00011110

0 1 End of a run of 1’s 00011110

0 0 Middle of a run of 0’s 00011110

160

Booth’s algorithm simply examines pairs of bits in the multiplier

and does the following, using the hardware for the multiplier shown

previously:

1. Depending on the values of the current bit and previous bit, do

the following:

00: Middle of a string of 0’s, so do nothing

01: End of a string of 1’s, so add multiplicand to the left half

of the product register

10: Beginning of a string of 1’s, so subtract multiplicand from

the left half of the product register

11: Middle of a string of 1’s, so do nothing

2. Shift the product register to the right by 1 bit.

3. Repeat from step 1 until all the multiplier bits have been con-

sumed.

The original purpose was to speed up multiplication, since shifting

was much faster than adding.

One of the major advantages of Booth’s algorithm is that it also

works for 2’s complement numbers.

In modern processors, the multiply operation is usually implemented

directly in the hardware with a multiply unit as part of the datapath.

161

Division

Division is similar to multiplication, in that it is based on repeated

subtraction. The main difference is that the quotient of two integers

is not necessarily an integer.

The basic algorithm is a “shift and subtract” procedure, similar to

multiplication.

1010 Quotient

1101101 Dividend

-1010 subtract Divisor

11 difference

111 shift in next bit, compare to Divisor (set 0 in Quotient)

1110 shift in next bit, compare to Divisor

-1010 subtract Divisor (set 1 in Quotient)

100 difference

1001 shift in next bit, compare to Divisor (set 0 in quotient)

Remainder (last bit was shifted in)

This algorithm is slightly more complex than multiplication, because

of the comparison with the Divisor at each step. Only if the divisor

is greater than the partial dividend is the subtraction actually done.

In practice, the operations compare and subtract are essentially

the same, so it is common just to subtract and check to see if the

result is positive, otherwise the previous value is restored, a 0 set as

the quotient bit, and the shift performed.

Note that the divisor and remainder can be the same size.

162

In the same way that the multiplier and product could share the

same 64 bit register, the quotient and remainder bits can share a

single 64 bit register for division.

32 bit ALU

Shift right

Divisor

WriteShift left

Controltest64 bits

Remainder

Note the similarity between the hardware for multiplication and di-

vision — the same hardware can be used for both functions.

The difference is only in the control algorithm for each.

Following is a control algorithm for the hardware for division:

163

Shift the remainder register left 1 bit

Start

32 repetitions?No

Done. Shift left half of remainder 1 bit right

< 0>=0remainder[i]

to the left, setting the newShift the remainder register

rightmost bit to 1

>= 0 ?

Subtract the divisor register from theleft half of the remainder register andplace the result in the left half of the

remainder register

Restore the original value by addingthe divisor register to the left half of theremainder register and place the sum inthe left half of the remainder register.Shift the remainder register to the left,

setting the new rightmost bit to 0

164

“Real” arithmetic — floating point numbers

Often we want to represent numbers over a wider range than can be

represented with 32 bit integers. Most computers support floating

point numbers. These numbers are similar to the representation in

scientific notation. They consist of two parts, a mantissa (significand)

and an exponent.

The representation is of the form

(−1)s ×M × 2E

where s denotes the sign, M is the mantissa, and E is the exponent.

The exponent is adjusted such that the mantissa is of the form

1.xxxxxx...

There is a standard for representing binary floating point numbers,

(IEEE 754 floating point standard) universally supported by manu-

facturers of computer systems.

The single precision (32 bit) form of a floating point number is:

012345678910111312141516171819202122232425262728293031

✲✛ ✲✛

s exponent mantissa

23 bits8 bits

The exponent is in excess 127 representation, and the mantissa is

normalized so the leading digit is 1. (Since it is always 1, it is not

stored explicitly, permitting an additional digit to be represented in

the mantissa.)

165

There is a double precision (64 bit minimum) form for floating point

numbers. Here the exponent is 11 bits, in excess 1023 form, and the

mantissa is 52 bits.

Single precision numbers are in the range 2.0× 10−38 to 2.0× 1038.

Double precision numbers are in the range 2.0×10−308 to 2.0×10308.

Note that floating point numbers are not distributed uniformly.

The following shows the case for a 2-bit mantissa:

✛ ✲

2 2 2 21 20

2-1 3

This shows that the separation between floating point numbers de-

pends on the value of the exponent.

This means that “computer arithmetic” with floating point numbers

does not behave like “real” arithmetic.

For example, the following property does not hold:

(A +B) + C = A + (B + C)

Also, subtracting two numbers which are nearly equal causes “loss of

precision.”

For example, (using decimal arithmetic)

1.11010110 × 102

- 1.11010000 × 102

0.00000110 × 102 = 1.10000000 × 10−4

which now really has only 3 significant digits.

166

The IEEE 754 floating point standard attempts to minimize problems

with floating point arithmetic, by several means, including:

• Providing several user-selected rounding modes, including

1. Round to nearest (the default mode)

2. Round towards +∞

3. Round towards −∞

4. Round towards 0

These rounding modes allow the user to determine the effect of

rounding errors.

• Provide representations for +∞, −∞, and “not a number”

(NAN).

• Insisting that all calculations produce the same result as if the

floating point calculations were exact, and then reduced to the

number of bits used in the mantissa.

In order to do this, three additional bits are required when per-

forming arithmetic operations. Two bits, called the guard and

round bits are required to ensure that normal rounding is accu-

rate. A third bit called the sticky bit which is set if any of the

discarded bits in the exact calculation would have been 1. This

is required for rounding towards ∞.

167

• Providing well-defined exceptions, and a provision for trapping

and handling those exceptions.

The five possible exceptions are:

1. Invalid operation; e.g., 0/0, 0×∞

2. Overflow

3. Underflow

4. Divide by 0 (this produces±∞ if the exception is not trapped).

5. Inexact — the rounded result is not the actual result

The standard also provides for denormalized numbers — numbers

where the leading digits in the mantissa are 0 (and the implied digit

is not present before the decimal.) This allows a graceful underflow.

Special combinations of the exponent (E) and fractional part of the

mantissa (f) represent denormalized numbers, 0, ∞, and NAN:

• if E = Emax and f = 0 then the number represents ±∞, de-

pending on the sign bit.

• if E = Emax, its maximum value (255 for single precision num-

bers) and f 6= 0 then the number represents NAN.

• if E = 0 and f = 0 then the number represents 0.

• if E = 0 and f 6= 0 then the number is denormalized, and

represents (−1)s × 2−(Emax−1) × 0.f

168

Another feature of the floating point standard was the provision for

extended formats for both single and double precision floating point

numbers.

The format parameters are summarized in the following table:

Parameter Single Single Double Double

Extended Extended

Mantissa (bits) 23 ≥ 32 52 ≥ 64

Exponent (bits) 8 ≥ 11 11 ≥ 15

Total width (bits) 32 ≥ 43 64 ≥ 79

Exponent bias 127 unspecified 1023 unspecified

Emax 127 ≥ 1023 1023 ≥ 16383

Emin -126 ≤ −1022 -1022 ≤ 16382

Recall that the leading 1 of the mantissa for normalized numbers is

not included in the table, so the actual precision of the mantissa is

one more bit than indicated.

The extended precision formats are used in the floating point proces-

sors designed by INTEL. It uses an 80 bit representation for floating

point numbers in its floating point units.

The IEEE floating point standard is presently being reviewed, and

proposals have been made to combine this standard with another

standard for decimal floating point arithmetic.

169

Implementation of floating point arithmetic

Floating point addition:

✖✗

✕✔

Start

✖✗

✕✔

Done

Normalize the sum, shifting either rightor left, incrementing or decrementing

the exponent with each shift.

Add the mantissas

Compare exponents, and shift thesmaller to the right until its exponent

equals the larger exponent.

Round the mantissa to the appropriatenumber of bits

❄

❄

❄

❄

✲

❄

❄

❄

✟✟✟✟✟✟✟❍❍❍❍❍❍❍✟✟✟✟✟✟✟❍❍❍

❍❍❍❍

✟✟✟✟✟✟✟❍❍❍❍❍❍❍✟✟✟✟✟✟✟❍❍❍

❍❍❍❍

❄

✖✗

✕✔

yes

no

Stillnormalized?

Overflow orunderflow?

Exception

yes

no

170

Floating point multiplication:

✖✗

✕✔

Start

✟✟✟✟✟✟✟❍❍❍❍❍❍❍✟✟✟✟✟✟✟❍❍❍

❍❍❍❍

Overflow orunderflow?

✖✗

✕✔

Exception

✖✗

✕✔

Done

❄

❄

❄

✲

❄

❄

❄

❄

✟✟✟✟✟✟✟❍❍❍❍❍❍❍✟✟✟✟✟✟✟❍❍❍

❍❍❍❍

❄

❄

and incrementing the exponent.Normalize the product, shifting right

yes

no

Stillnormalized?

Round the mantissa to the appropriatenumber of bits

no

yesSet the sign bit appropriately.

Add the exponents.Subtract the bias from the sum to get

the new biased exponent.

Multiply the mantissas

171

Hardware implementations of the basic floating point operations (ad-

dition, multiplication, and division) are provided in virtually all mod-

ern microprocessors.

Some processors have independent units for multiplication and addi-

tion, so both operations can execute in parallel.

The MIPS had a separate floating point unit which was used in

combination with the processor chip. Later versions integrated the

floating point unit with the processor.

A similar evolution happened earlier with the INTEL 80x8x archi-

tecture — the floating point unit was a separate co-processor, and

operated in parallel with the main processor.

Internally, INTEL’s floating point processor (which was the first float-

ing point unit to comply with the then new floating point standard)

used 80 bit arithmetic, in a stack architecture.

172

How can we determine performance?

Let us look at an example from the transportation industry:

Aircraft Passenger Fuel Cruising Throughput Cost

Capacity Capacity Range Speed

Boeing 747-400 421 216,847 10,734 920 387,320 0.048

Boeing 767-300 270 91,380 10,548 853 230,310 0.032

Airbus 340-300 284 139,681 12,493 869 246,796 0.039

Airbus 319-100 120 23,859 4,442 837 100,440 0.045

BAE-146-200 77 11,750 2,406 708 54,516 0.063

Concorde 132 119,501 6,230 2180 287,760 0.145

Dash-8 50 3,202 1,389 531 26,550 0.046

My car 5 60 700 100 500 0.017

Where fuel capacity is in litres, range is in Km., and speed is

in Km/h,

throughput is the

(number of passengers) × (cruising speed)

and cost is the

(fuel) per (passenger - Km.)

determined as (fuel capacity)/(passengers × range)

173

Which of these has the best “performance?”

This depends on how you define the term “performance.”

For raw speed, (getting from one place to another quickly) the Con-

corde is over twice as fast as its closest competitor.

If we are interested in the rate at which people are carried (we call

this throughput) then the Boeing 747-400 clearly has the best

performance.

Often we are interested in relating performance and cost. In this

example, if we consider cost as the amount of fuel used per passenger-

Km., then the most economical plane is the Boeing 767-300. Clearly,

though, the car is easily the most economical overall.

Note that we could also define cost in many different ways.

We can define similar measures of performance and cost for comput-

ers.

In a computer system, we are interested in the number of computa-

tions done per unit time, as well as the cost of the computation.

Typically, we are interested in several aspects of the cost; e.g. the

initial purchase price, the operating cost, or the cost of training users

of the system.

174

In a computer system, we may be interested in the amount of time

it takes a program to complete, (speed or response time), or the

rate at which a number of processes complete (throughput), or in

the cost of the system, relative to its performance.

Since a computer program is merely a set of instructions for the

particular computer, one might think that comparing the average

instruction speed for two computers would be a good measure of

performance.

This turns out not to be so, for a number of reasons; for example:

• Different computers have different instruction sets; some have

very powerful instructions and others very simple, so the number

of instructions required for a program might be very different on

two different computers.

• The instructions themselves may be implemented differently, and

have different execution times. This may even be true for two

machines which have the same instruction set (e.g., the Pentium

and the Pentium IV, or the AMD Athelon).

• Different compilers may produce very different machine code

from the same source code.

175

Typically, a processor has a basic “clock speed” and instructions

require some multiple of this clock speed to execute.

In order to determine the time required to execute a particular pro-

gram (TP ) we might think that we could take each instruction (I)

to be executed, multiply it by the number of clock cycles for the

instruction (CPII), and sum the result.

TP =∑

(I × CPII)× (time for one clock cycle)

This does not work for several reasons:

• Many processors have instructions with variable execution times

• Most processors today execute several instructions simultane-

ously

It is possible, however, to approximate the run time of a program if

we can determine an average number of cycles per instruction for the

particular processor (and the program to be run).

In this case, the execution time can be approximated by

TP =∑

(I × average CPI)× (time for one clock cycle)

176

This can be rewritten as

TP = (N × average CPI)× (time for one clock cycle)

where N is the number of instructions executed by the program.

Note the following:

• The number of instructions executed by the program depends on

the compiler used to generate the machine language code, and

on the particular instruction set of the processor.

• The average CPI depends on the particular instruction mix of

the program.

• The clock cycle time depends on the detailed implementation of

the processor, including the speed of the underlying technology,

the complexity of the individual instructions, and the degree of

parallelism in the processor.

Improvements in compiler technology typically produce about a 5%

speedup per year.

Improvements in technology typically produce about a 50% speedup

per year.

177

All of the previous discussion makes several assumptions:

• The process under consideration is the only process running on

the machine.

In a “real” computing environment, many processes may be run-

ning simultaneously. (In a Linux system, run the program top

to see what processes are presently using resources).

• The processor speed determines the rate at which instructions

are executed.

In reality, memory access can be much slower than the processor

speed, especially for large programs where the entire data and

instructions cannot fit in main memory. (We will discuss memory

performance later in the course.)

• In high performance systems, several processors may work simul-

taneously on a single process.

At present, most processes run on a single processor, but it is

possible to break up a computation into several “threads” which

can be executed on different, interconnected, processors. (We

will discuss this later in the course.)

178

Why not use “typical” programs to measure perfor-

mance?

This seems reasonable, since our idea of performance for a computer

system is related to the time required to run the programs in which

we are interested.

Performance = 1/execution time

If we can find a “typical set of programs” which fairly reflect the type

of code we run, then comparing the time to run these on different

machines may be a good measure of performance.

Generally, though, we do not know exactly what programs will be

run on a system throughout its lifetime.

Also, the typical load (set of programs to be run) usually changes

over the useful life of a computer system.

Usually, our goal is more modest — to determine the “best” processor

for a particular set of programs, at a given price, at a given time.

179

Consider the following example:

Program Time on Time on

Machine A Machine B

P1 10 s 20 s

P2 50 s 25 s

Here, if we consider P1, then Machine A is twice as fast as Machine

B. If we consider program P2, Machine B is twice as fast as Machine

A.

It may be reasonable to use a weighted average of the programs,

where the weight is the relative number of times each program is

usually run. For example, if P1 is run 3 times as frequently as P2,

then the relative time required for Machines A and B is:

(3× 10) + 50

(3× 20) + 25

So, Machine A requires 80/85 the time of Machine B.

Alternately, Machine A has 85/80 × the performance of Machine B.

Note that, for different weightings of the two programs, the con-

clusion as to which machine has the higher performance could be

different.

180

Performance benchmarks

In order to compare different processors, or different implementations

of a single processor, people use various measures of performance, or

benchmarks. Many benchmarks exist, often providing contradictory

information about various processors.

Several “standard” benchmark suites are available, and many of these

also specify how the benchmark programs are to be compiled and run.

One of the most famous benchmark suites (and also one of the most

useful) is the SPEC benchmark suite. Information about it can be

found at URL

http://www.spec.org/

The SPEC benchmark uses the weighted running times of a set of

programs. The programs have changed with time; the present SPEC

CPU (SPEC CPU2006) was preceded by SPEC CPU 2000, SPEC95,

SPEC92, and SPEC89.

There are now sets of SPEC benchmarks for different aspects of sys-

tems performance, including integer and floating point performance,

and graphics processor performance.

181

SPEC 2006 Benchmarks

Benchmark Language Category

Integer

400.perlbench C Programming Language

401.bzip2 C Compression

403.gcc C C Programming Language Compiler

429.mcf C Combinatorial Optimization

445.gobmk C AI, Game Playing: Go

456.hmmer C Bioinf., Gene Sequence Search

458.sjeng C AI, Game Playing: Chess

462.libquantum C Physics/Quantum Computing

464.h264ref C Video Compression

471omnetpp C++ Discrete Event Simulation

473.astar C++ Path-finding Algorithms

483.xalancbmk C++ XML Processing

182

Benchmark Language Category

Float

410.bwaves Fortran Fluid Dynamics

416.gamess Fortran Quantum Chemistry

433.milc C Physics/ Quantum Chromodynamics

434.zeusmp Fortran Physics/ Computational Fluid Dynamics

435.gromacs C, Fortran Biochemistry / Molecular Dynamics

436.cactusADM C, Fortran Physics / General Relativity

437.leslie3d Fortran Fluid Dynamics

444.deall C++ Finite Element Analysis

450.soplex C++ Linear Programming, Optimization

450.povray C++ Image Ray-tracing

454.calculix C, Fortran Structural Mechanics

459.GemsDFTD Fortran Computational Electromagnetics

465.tonto Fortran Quantum Chemistry

470.lbm C Fluid Dynamics

481.wrf C, Fortran Weather

482.sphinx3 C Speech Recognition

183

Determining the effect of performance “improvements”:

Consider the case where some aspect of the performance of a proces-

sor is improved, without making other improvements.

For example, consider a numerically intensive problem in which 25%

of the time is spent doing floating point arithmetic.

Suppose the floating point unit is improved to perform five times

faster. How much faster does the program run now?

Clearly, only the part of the program that has improved performance

will run faster, and we can easily calculate by how much:

0.75 + 0.25/5 = 0.8 — the it will require 80% of the original time.

This observation can be expressed as

execution time after improvement = execution time of unimproved part

+execution time of improved part

amount of improvement

This relationship is called Amdahl’s law.

Note that the overall speedup is relatively small (20%) even though

the performance increase for part of the code was dramatic.

Amdahl’s law has interesting consequences for parallel machines —

ultimately, it is the serial, or unparallelizable, component of the code

that determines its running time.

184

Brief summary of performance measures

The only meaningful measure of performance is execution time for

your “job mix”

The time to execute a program depends on:

Clock speed (MHz)

Code size

Cycles per instruction (CPI)

Composite or other measures of performance — what problems arise

from their use?

MIPS (Millions of Instructions Per Second, or

Meaningless Indicator of Processor Speed)

MFLOPS (Millions of Floating Point Operations Per Second)

SPEC

185

Where are we now?

We have built up a “toolbox” of components (logic gates, adders,

ALU’s, MUX’s, registers, etc.), and skills, (combinitoral logic design,

state machine design) and want to use those to implement a small

MIPS-like processor.

186

Design and implementation of the processor

We now have all the raw material to design a processor with the

instruction set we examined earlier.

We will actually design several implementations of the processor,

each with different performance characteristics.

The first implementation will be a “single cycle processor” in which

each instruction will take exactly one clock period. (In other words,

the CPI will be 1.)

In the next implementation, each instruction will require several cy-

cles for execution, and different instructions may require a different

number of cycles. In this case, the clock period may be shorter than

the single cycle machine, but the CPI will be larger.

We will begin by reviewing the instruction set and designing a data

path.

Earlier, when discussing the instruction set, we identified a rough

structure for a computer system:

❙❙

✓✓

✓✓

❙❙

❩❩ ✚✚

✓✓

❙❙

❙❙

✓✓

✚✚ ❩❩

CPU MEMORY

OUTPUT

INPUT/

187

Presently, we are interested in the CPU only, which we concluded

would have a structure similar to the following:

AddressGeneratorPCU

PC

GeneralRegisters

and/orAccumulator

Instructiondecode and

ControlUnitALU

MDR

MAR

The memory address register (MAR) and memory data register(MDR)

are the interface to memory.

The ALU and register file are the core of the data path.

The program control unit (PCU) fetches instructions and data, and

handles branches and jumps.

The instruction decode unit (IDU) is the control unit for the proces-

sor.

188

The “building blocks”

We have already designed many of the major components for the

processor, or have at least identified how they could be implemented.

For example, we have already designed an ALU, a data register, and

a register file.

A controller is merely a state machine, and we can implement one

using, say, a PLA, after identifying the required states and transi-

tions.

Following are some of the combinational logic components we will

use:

✲ ✲OP S❅❅❅❅ ◗◗✑✑ �

��

❄❄

❅❅❅❅ ◗◗✑✑ �

��

❄❄

❄

��

❄ ❄ ❄

��

��

��

��

��

❄❄

❄

Adder

Sum Carry

Adder

ALU

Result Zero

ALU

Multiplexor

A

32

B

32

A

32

B

32

32 32

MUX

A B

Y

Note that the diagram highlights the control signals (OP and S).

189

Following are some of the register components we will use:

✂✂❇❇✂✂❇❇

✂✂❇❇

Write enable

ClockClock

Write enable

Clock

Registerfile

✲ ✲ ✲ ✲��

��

��

��

✲

✲

✲

✲

✲

✲

��

��

��

��

��

��

32PC

323230

RegisterCounter

Data outData in

Registers

Read

RegisterWrite

register 2

register 1Read

Writedata

data 1Read

Readdata 2

32

32

5

5

5

32

Note that the registers have a write enable input as well as a clock

input. This input must be asserted in order for the register to be

written.

We have already seen how to construct a register file from simple D

registers.

190

Timing considerations

In a single-cycle implementation of the processor, a single instruction

(e.g., add) may require that a register be read from and written into

in the same clock period. In order to accomplish this, the register

file (and other register elements) must be edge triggered.

This can be done by using edge triggered elements directly, or by

using a master-slave arrangement similar to one we saw earlier:

❞❍❍✟✟

s D Q D Q

>>

master slave

Another observation about a single cycle processor — the memory

for instructions must be different from the memory for data, because

both must be addressed in the same cycle. Therefore, there must be

two memories; one for instructions, and one for data.

MemoryData

Readdata

Instruction[31-0]

Writedata

✲

✁✁ ✲✁✁✲

✁✁

✲

✲✁✁

✁✁

32

Address32

Read

MemoryInstruction

32

Address32

32

MemWr

MemRd

Data memory Instruction memory

191

The MIPS instruction set:

Following is the MIPS instruction format:


I-type (immediate)

op rs rt031 2021 15162526

immediate

R-type (register)

J-type (jump)

op031 2526

target

6 bits 6 bits5 bits5 bits5 bits5 bits

5 bits 5 bits6 bits 16 bits

26 bits6 bits

We will develop an implementation of a very basic processor having

the instructions:

R-type instructions add, sub, and, or, slt

I-type instructions addi, lw, sw, beq

J-type instructions j

Later, we will add additional instructions.

192

Steps in designing a processor

• Express the instruction architecture in a Register Transfer Lan-

guage (RTL)

• From the RTL description of each instruction, determine

– the required datapath components

– the datapath interconnections

• Determine the control signals required to enable the datapath

elements in the appropriate sequence for each instruction

• Design the control logic required to generate the appropriate

control signals at the correct time

193

A Register Transfer Language description of some op-

erations:

The ADD instruction

add rd, rs, rt

• mem[PC] Fetch the instruction from memory

• R[rd] ← R[rs] + R[rt] Set register rd to the value of the

sum of the contents of registers rs

and rt

• PC ← PC + 4 calculate the address of the next in-

struction

All other R-type instructions will be similar.

The addi instruction

addi rs, rt, imm16


• R[rt] ← R[rs] +

SignExt(imm16)

Set register rt to the value of

the sum of the contents of register

rs and the immediate data word

imm16


struction

All immediate arithmetic and logical instructions will be similar.

194

The load instruction

lw rs, rt, imm16


• Addr ← R[rs] +

SignExt(imm16)

Set memory address to the value of



imm16

• R[rt] ← Mem[Addr] load the data at address Addr into

register rt


struction

The store instruction

sw rs, rt, imm16



SignExt(imm16)




imm16

• Mem[Addr] ← R[rt] store the data from register rt into

memory at address Addr


struction

195

The branch instruction

beq rs, rt, imm16


• Cond ← R[rs] - R[rt] Evaluate the branch condition

• if (Cond eq 0)

PC ← PC + 4 +

(SignExt(imm16) × 4)

calculate the address of the next in-

struction

• else PC ← PC + 4

The jump instruction

j target target is a memory address


• PC ← PC + 4 increment PC by 4

• PC<31:2> ← PC<31:28> replace low order 28 bits with

concat I<25:0> << 2 the low order 26 bits from the in-

struction left shifted by 2

196

The Instruction Fetch Unit

Note that all instructions require that the PC be incremented.

We will design a datapath which performs this function — the In-

struction Fetch Unit.

Its operation is described by the following:


• PC ← PC + 4 Increment the PC

Instruction[31−0]

MemoryInstruction

addressReadPC

Add

4

Note that this does not yet handle branches or jumps.

Since it is the same for all instructions, when describing individual

instructions this component will normally be omitted.

197

Datapath for R-type instructions

• R[rd] ← R[rs] op R[rt] Example: add rd, rs, rt

Recall that this instruction type has the following format:

R−type (register)


6 bits 5 bits 5 bits 5 bits 6 bits5 bits

The datapath contains the 32 bit register file and and ALU capable

of performing all the required arithmetic and logic functions.

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

ALU Result

32

Registers

32

32

rs

rt

rd

clk

ALUCtrRegWr

Inst[15−11]

Inst[20−16]

Inst[25−21]

BusA

BusB

Inst

Note that the register is read from and written to at the “same

time.” This implies that the register’s memory elements must be

edge triggered, or are read and written on different clock phases, to

allow the arithmetic operation to complete before the data is written

in the register.

198

This datapath contains everything required to implement the re-

quired instructions add, sub, and, or, slt. All that is required

is that the appropriate values be provided for the ALUCtr input for

the required operation.

The register operands in the instruction field determine the regis-

ters which are read from and written to, and the funct field of the

instruction determine which particular ALU operation is executed.

Recalling the control inputs for the ALU seen earlier, the values for

the control input are:


000 and

001 or

010 add

110 subtract

111 set on less than

A control unit for the processor will be designed later.

It will set all the required control signals for each instruction, de-

pending both on the particular instruction being executed (the op

code) and, for r-type instructions, the funct field.

199

Datapath for Immediate arithmetic and logical instruc-

tions

• R[rt] ← R[rs] op imm16 Example: addi rt, rs, imm16

Recall that this instruction type has the following format:


op rs rt031 2021 15162526

immediate6 bits 5 bits 5 bits 16 bits

The main difference between this and an r-type instruction is that

here one operand is taken from the instruction, and sign extended (for

signed data) or zero extended (for logical and unsigned operations.)

RegDst ALUSrc

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

ALURegisters

Inst[20−16]

Inst[25−21]

32

extend

16

imm16

rs

rt

ALUCtr

BusB

BusA

32

32

Clk

RegWr

MUX

0

1MUX

0

1

Inst[15−0]

Inst[15−11]

Sign

Note the use of MUX’s (with control inputs) to add functionality.

200

Datapath for the Load instruction

lw rt, rs, imm16


SignExt(imm16)

Calculate the memory address

• R[rt] ← Mem[Addr] load the data into register rt

This is also an immediate type instruction:


op rs rt031 2021 15162526


Clk

RegWr

AluSrc AluCtr

RegDst

MemRd

MemtoReg

MUX

0

1

dataWrite

Readdata

MemoryData

MUX

0

1dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

extendSign

ALUAddress

Registers

Inst[20−16]

Inst[25−21]

Inst[15−11]

16 32Inst[15−0]

32 MU

1

0X

32

BusB

BusA

32

32

32

Data In

201

Datapath for the Store instruction

sw rt, rs, imm16


SignExt(imm16)

Calculate the memory address

• Mem[Addr] ← R[rt] Store the data from register rt to

memory

This is also an immediate type instruction:


op rs rt031 2021 15162526


MUX

0

1

dataWrite

Readdata

MemoryData

MUX

0

1dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

extendSign

ALUAddress

Registers

Inst[20−16]

Inst[25−21]

Inst[15−11]

16 32

Clk

Inst[15−0]

RegWr

AluSrc AluCtr MemWr

MemtoReg

32

RegDst

32

MemRd

BusA

BusB

Data In

32

32

MUX

1

0

202

Datapath for the Branch instruction

beq rt, rs, imm16

• Cond ← R[rs] - R[rt] Calculate the branch condition

• if (Cond eq 0)

PC ← PC + 4 +



struction


This is also an immediate type instruction.

In the load and store instructions, the ALU was used to calculate

the address for data memory.

It is possible to do this for the branch instructions as well, but it

would require first performing the comparison using the ALU, and

then using the ALU to calculate the address.

This would require two clock periods, in order to sequence the oper-

ations correctly.

A faster implementation would be to provide another adder to im-

plement the address calculation. This is what we will do, for the

present example.

203

RegWr

RegDst

Zero

ALUSrc ALUCtr

BranchPCSrc

MUX

0

1dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

extendSign

Instruction[31−0]

MemoryInstruction

addressReadPC

Add

4

Registers

Inst[20−16]

Inst[25−21]

Inst[15−11]

16 32Inst[15−0]

ALU

Add

MUX

0

1

Shiftleft 2

MUX

0

1

204

Datapath for the Jump instruction

j target

• PC<31:2> ← PC<31:28>

concat target<25:0>

Calculate the jump address by con-

catenating the high order 4 bits of

the PC with the target address

Here, the address calculation is just obtained from the high order 4

bits of the PC and the 26 bits (shifted left by 2 bits to make 28) of

the target address.

The additions to the datapath are straightforward.

6 bits

J−type (jump)

op031 2526

target address26 bits

205

RegWr

RegDst

Zero

ALUSrc ALUCtr

Branch Jump

PCSrc

MUX

0

1dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

extendSign

Instruction[31−0]

MemoryInstruction

addressReadPC

Add

4

Shiftleft 2

MUX

0

1

Registers

Inst[20−16]

Inst[25−21]

Inst[15−11]

16 32Inst[15−0]

ALU

Add

MUX

0

1

MUX

1

0

Shiftleft 2

206

Putting it together

The datapath was shown in segments, some of which built on each

other.

Required control signals were identified, and all that remains is to:

1. Combine the datapath elements

2. Design the appropriate control signals

Combining the datapath elements is rather straightforward, since

we have mainly built up the datapath by adding functionality to

accommodate the different instruction types.

When two paths are required, we have implemented both and used

multiplexors to choose the appropriate results.

The required control signals are mainly the inputs for those MUX’s

and the signals required by the ALU.

The next slide shows the combined data path, and the required con-

trol signals.

The actual control logic is yet to be designed.

207

controlALU

MemRead

MemtoReg

MemWrite

ALUSrc

RegWrite

ALUOpInst [31−26]

RegDst

Branch

Control

Jump PCSrc

Inst[5−0]

MUX

0

1

Shiftleft 2

extendSign

Shiftleft 2 M

UX

0

1

MUX

1

0

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

RegistersMUX

0

1

Instruction[31−0]

MemoryInstruction

addressRead

Add

4

PC

dataWrite

Readdata

MemoryData

Address

MUX

1

0

Add

Zero

16 32

Inst[25−0]

26 28

PC+4[31−28]

Jump address[31−0]

funct

32

32

ALU

32

32

32

32

BusA

BusB

Inst[20−16]

Inst[15−11]

rt

rd

rsInst[25−21]

Inst[15−0]

32

208

Designing the control logic

The control logic depends on the details of the devices in the control

path, and on the individual bits in the op code for the instructions.

The arithmetic and logic operations for the r-type instructions also

depend on the funct field of the instruction.

The datapath elements we have used are:

• a 32 bit ALU with an output indicating if the result is zero

• adders

• MUX’s (2 line to 1-line)

• a 32 register × 32 bits/register register file

• individual 32 bit registers

• a sign extender

• instruction memory

• data memory

209

The ALU — a single bit

+0

1

0

1

3

2

Carryin

OperationBinvert

a

b

Less

Result

Carryout

Note that there are three control bits; the single bit Binvert, and

the two bit input to the MUX, labeled Operation.

The ALU performs the operations and, or, add, and subtract.

210

The 32 bit ALU

a0b0

CarryinALU0Less

CarryOut

Result0

CarryinALU31

a31

0b31

Less Overflow

zero

Operation

CarryinALU1Less

CarryOut

a1

0b1

Result1

Result31

Set

Binvert

CarryinALU2Less

CarryOut

a2

0b2

Result2


000 and

001 or

010 add

110 subtract

111 set on less than

211

We will design the control logic to implement the following instruc-

tions (others can be added similarly):

Name Op-code

Op5 Op4 Op3 Op2 Op1 Op0

R-format 0 0 0 0 0 0

lw 1 0 0 0 1 1

sw 1 0 1 0 1 1

beq 0 0 0 1 0 0

j 0 0 0 0 1 0

Note that we have omitted the immediate arithmetic and logic func-

tions.

The funct field will also have to be decoded to produce the required

control signals for the ALU.

A separate decoder will be used for the main control signals and the

ALU control. This approach is sometimes called local decoding. Its

main advantage is in reducing the size of the main controller.

212

The control signals

The signals required to control the datapath are the following:

• Jump — set to 1 for a jump instruction

• Branch — set to 1 for a branch instruction

• MemtoReg — set to 1 for a load instruction

• ALUSrc — set to 0 for r-type instructions, and 1 for instructions

using immediate data in the ALU (beq requires this set to 0)

• RegDst — set to 1 for r-type instructions, and 0 for immediate

instructions

• MemRead — set to 1 for a load instruction

• MemWrite — set to 1 for a store instruction

• RegWrite — set to 1 for any instruction writing to a register

• ALUOp (k bits) — encodes ALU operations except for r-type

operations, which are encoded by the funct field

For the instructions we are implementing, ALUOp can be encoded

using 2 bits as follows:

ALUOp[1] ALUOp[0] Instruction

0 0 memory operations (load, store)

0 1 beq

1 0 r-type operations

213

The following tables show the required values for the control signals

as a function of the instruction op codes:

Instruction Op-code RegDst ALUSrc MemtoReg Reg

Write

r-type 0 0 0 0 0 0 1 0 0 1

lw 1 0 0 0 1 1 0 1 1 1

sw 1 0 1 0 1 1 x 1 x 0

beq 0 0 0 1 0 0 x 0 x 0

j 0 0 0 0 1 0 x x x 0

Instruction Op-code Mem Mem Branch ALUOp[1:0] Jump

Read Write

r-type 0 0 0 0 0 0 0 0 0 1 0 0

lw 1 0 0 0 1 1 1 0 0 0 0 0

sw 1 0 1 0 1 1 0 1 0 0 0 0

beq 0 0 0 1 0 0 0 0 1 0 1 0

j 0 0 0 0 1 0 0 0 0 x x 1

This is all that is required to implement the control signals; each

control signal can be expressed as a function of the op-code bits.

For example,

RegDst = Op5 · Op4 · Op3 · Op2 · Op1 · Op0

ALUSrc = Op5 · Op4 · Op2 · Op1 · Op0

All that remains is to design the control for the ALU.

214

The ALU control

The inputs to the ALU control are the ALUOp control signals, and

the 6 bit funct field.

The funct field determines the ALU operations for the r-type op-

erations, and ALUOp signals determine the ALU operations for the

other types of instructions.

Previously, we saw that if ALUOp[1] was 1, it indicated an r-type

operation. ALUOp[0] was set to 0 for memory operations (requiring

the ALU to perform an add operation to calculate the address for

data) and to 1 for the beq operation, requiring a subtraction to

compare the two operands.

The ALU itself requires three inputs.

The following table shows the required inputs and outputs for the

instructions using the ALU:

Instruction ALUOp funct ALU ALU control

operation input

lw 0 0 x x x x x x add 0 1 0

sw 0 0 x x x x x x add 0 1 0

beq 0 1 x x x x x x subtract 1 1 0

add 1 0 1 0 0 0 0 0 add 0 1 0

sub 1 0 1 0 0 0 1 0 subtract 1 1 0

and 1 0 1 0 0 1 0 0 AND 0 0 0

or 1 0 1 0 0 1 0 1 OR 0 0 1

slt 1 0 1 0 1 0 1 0 set on less than 1 1 1

215

Extending the instruction set

What is necessary to add another instruction to the instruction set?

First, the appropriate elements must be added to the datapath.

Second, any control elements must be added, and appropriate con-

trol signals identified.

Third, the control logic must be extended to enable the appropriate

elements in the datapath.

Let us consider adding the instruction or immediate (ori)

It has the form

ori $s1, $s2, imm

Its function is to perform the logical OR of the contents of register

$s2 with the zero extended immediate data field imm, storing the

result in register $s1.

$s1 ← $s2 | ZeroExtend[imm]

It has op-code 0 0 1 1 0 1 and is an immediate type instruction.

216

First — add elements to the data path

Examining the data path, the ALU can perform the OR operation,

but the extender unit only supports sign extension. It can be re-

placed by a unit, sign or zero extend, which can perform both

functions.

Second — add control elements

This new unit requires a new control signal to select the zero extend

function (0) or the sign extend function (1).

We will label the new signal ExtOp.

Also, the 2-bit control signal ALUOp only encodes the operations add

and subtract. Adding a third bit would allow the encoding of the

operations AND and OR.

It can be encoded as follows:

ALUOp[2] ALUOp[1] ALUOp[0] Instruction

0 0 0 memory operations (load, store)

0 0 1 beq (subtract, in the ALU)

0 1 0 ori

1 x x r-type operations

The following diagram shows the changes required to the datapath:

217

controlALU

MemRead

MemtoReg

MemWrite

ALUSrc

RegWrite

ALUOpInst [31−26]

PCSrcRegDst

Branch

Control

Jump

Inst[5−0]

MUX

0

1

Shiftleft 2

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

Instruction[31−0]

MemoryInstruction

addressRead

Add

4

PC

MUX

0

1

dataWrite

Readdata

MemoryData

Shiftleft 2

MUX

0

1

ALU

Add

ZeroRegisters

Inst[20−16]

Inst[25−21]

Address

16 32

Inst[15−11]

Inst[15−0]

Inst[25−0]

26 28

PC+4[31−28]


funct

rs

rt

rd

32

32

32

32

32

MUX

1

0

extend

Sign or

ExtOp

32

32

BusA

BusB

zero

MUX

0

1

218

Third - the control logic

The truth table for the ALU control unit extends to:

Instruction ALUOp funct ALU ALU control

operation input

lw 0 0 0 x x x x x x add 0 1 0

sw 0 0 0 x x x x x x add 0 1 0

beq 0 0 1 x x x x x x subtract 1 1 0

ori 0 1 0 x x x x x x OR 0 0 1

add 1 0 0 1 0 0 0 0 0 add 0 1 0

sub 1 0 0 1 0 0 0 1 0 subtract 1 1 0

and 1 0 0 1 0 0 1 0 0 AND 0 0 0

or 1 0 0 1 0 0 1 0 1 OR 0 0 1

slt 1 0 0 1 0 1 0 1 0 set on less than 1 1 1

For the ori instruction, the following settings are required for the

remaining control signals:

Jump 0

Branch 0

MemRead 0

MemWrite 0

MemtoReg 0

ALUSrc 1 ALU operand is from the extender

RegDst 0 rt is the destination register

RegWrite 1 result will be written in reg[rt]

ExtOp 0 zero extend

219

The modified tables for the control signals are:

Inst. Op-code RegDst ALUSrc MemtoReg Reg

Write

r-type 0 0 0 0 0 0 1 0 0 1

lw 1 0 0 0 1 1 0 1 1 1

sw 1 0 1 0 1 1 x 1 x 0

beq 0 0 0 1 0 0 x 0 x 0

j 0 0 0 0 1 0 x x x 0

ori 0 0 1 1 0 1 0 1 0 1

Inst. Op-code Mem Mem Branch Jump ALUOp ExtOp

Read Write 2 1 0

r-type 0 0 0 0 0 0 0 0 0 0 1 0 0 x

lw 1 0 0 0 1 1 1 0 0 0 0 0 0 1

sw 1 0 1 0 1 1 0 1 0 0 0 0 0 1

beq 0 0 0 1 0 0 0 0 1 0 0 0 1 1

j 0 0 0 0 1 0 0 0 0 1 x x x x

ori 0 0 1 1 0 1 0 0 0 0 0 1 0 0

Some of the control logic may have to be modified. For example,

the logic generating the signal ALUSrc would have to ensure that the

value 1 was set for the ori instruction:

ALUSrc = Op5·Op4·Op2·Op1·Op0+Op5 · Op4 · Op3 · Op2 · Op1 · Op0

The new control signal, ExtOp can be evaluated as:

ExtOp = Op5 · Op4 · Op2 · Op1 · Op0 + Op5 · Op4 · Op3 · Op2 · Op1 · Op0

220

Other control logic implementations

Because there are only a few instructions to be implemented, the

control logic for this processor implementation is probably best im-

plemented using simple logic functions as shown previously.

It is quite common to implement simple controllers as a PLA.

Following is a PLA implementation for the processor we have de-

signed so far:

op5op0

0 0 0 0 1 0

...

j

op5op0...

beq

0 0 0 1 0 0

op5op0...

sw

1 0 1 0 1 1

op5op0...

lw

1 0 0 0 1 1

op5op0

R−type

0 0 0 0 0 0

...op5

op0...

ori

0 0 1 1 0 1

RegDst

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

Jump

ExtOp

ALUSrc[0]

ALUSrc[1]

ALUSrc[2]

221

The controller for the ALU could be implemented similarly, although

it is also probably best implemented using simple logic functions, as

well.

Note that, in the preceding controller, there was an AND term corre-

sponding to each instruction. For a small number of instructions, this

is effective. However, if the number of instructions is large (i.e., there

are op codes for most of the 6 bit instruction combinations) then the

controller could also be implemented as a read-only memory (ROM).

In this case, the op codes would be used as address inputs to the

ROM, and the outputs would be the values stored at those addresses.

There are 12 output bits, and the total size of the memory would be

26 = 64 words of 12 bits. The encoding would be quite straightfor-

ward; merely the contents of the logic table for each control bit.

This would not be an efficient implementation for the ALU control,

however. The funct field has 6 bits, and the ALUOp control input

has 3 bits, for a total of 9 bits, requiring 29 = 512 memory words of

3 bits.

Another option for the ALU control bits is to use the funct field

to generate the required three control signals, and have the main

controller also generate these control signals directly. They could

then be selected by a MUX, which would select the control signals

evaluated from the funct field only if the instruction is r-type.

The input to the MUX could be the logical OR of the instruction

field, which evaluates to 0 only for r-type instructions.

222

The time required for single cycle instructions

Inst. Memory Reg. ReadmuxPC ALU mux Reg. Write

Inst. Memory Reg. Readmux

Inst. Memory Reg. Readmux

Inst. Memory

PC

PC

PC

mux

mux Reg. Write

Data Mem.

ALU

ALU

(The sign extension and add occur in parallel with the other operations,register read and ALU comparision )

Inst. Memory Reg. ReadmuxSign ext. add

PC ALU mux mux

Data Memory

Arithmetic and logical instructions

time

Branch

Store

Load

Jump

The "critical path"

The clock period must be at least as long as the time for the critical

path.

223

controlALU

MemRead

MemtoReg

MemWrite

ALUSrc

RegWrite

ALUOp

PCSrcRegDst

Branch

Control

Jump

Inst[5−0]

ExtOp

Inst [31−26]

Shiftleft 2

dataWrite

RegisterWrite

Register 1Read

Readdata 1

Readdata 2

Instruction[31−0]

addressRead

dataWrite

Readdata

MemoryData

Shiftleft 2

ALU

Add

ZeroRegisters

Inst[20−16]

Inst[25−21]

Address

16 32

Inst[15−11]

Inst[15−0]

Inst[25−0]

26 28

PC+4[31−28]


funct

rs

rt

rd

32

32

32

32

extend

Sign or

32

32

BusA

BusB

zero

MemoryInstruction

PC

Add

4

MUX

0

1

MUX

0

1

ReadRegister 2

MUX

1

0

R−type operations

32 MUX

1

MUX

0

1

0

224

controlALU

MemRead

MemtoReg

MemWrite

ALUSrc

RegWrite

ALUOp

PCSrcRegDst

Branch

Control

Jump

Inst[5−0]

ExtOp

Inst [31−26]

dataWrite

RegisterWrite

Register 1Read

Readdata 1

Readdata 2

Instruction[31−0]

addressRead

dataWrite

Readdata

MemoryData

Shiftleft 2

ALU

Add

ZeroRegisters

Inst[20−16]

Inst[25−21]

Address

16 32

Inst[15−11]

Inst[15−0]

Inst[25−0]

26 28

PC+4[31−28]


funct

rs

rt

rd

32

32

32

32

32

MUX

1

extend

Sign or

32

32

BusA

BusB

zero

MUX

0

1

MemoryInstruction

PC

Add

4

MUX

0

MUX

0

1

ReadRegister 2

MUX

1

0

0

The Branch instruction − beq

1

Shiftleft 2

225

controlALU

MemReadMemtoReg

MemWriteALUSrcRegWrite

ALUOp

PCSrcRegDst

Branch

Control

Jump

Inst[5−0]

ExtOp

Inst [31−26]

Shiftleft 2

dataWrite

RegisterWrite

Register 1Read

Readdata 1

Readdata 2

Instruction[31−0]

addressRead

dataWrite

Readdata

MemoryData

Shiftleft 2

The Load instruction

ALU

Add

ZeroRegisters

Inst[20−16]

Inst[25−21]

Address

16 32

Inst[15−11]

Inst[15−0]

Inst[25−0]

26 28

PC+4[31−28]


funct

rs

rt

rd

32

32

32

32

extend

Sign or

32

32BusA

BusB

zero

MemoryInstruction

PC

Add

4

MUX

0

1

MUX

0

1

ReadRegister 2

MUX

1

0

32 MUX

1

MUX

0

1

0

226

controlALU

MemReadMemtoReg


ALUOp

PCSrcRegDst

Branch

Control

Jump

Inst[5−0]

ExtOp

Inst [31−26]

Shiftleft 2

dataWrite

RegisterWrite

Register 1Read

Readdata 1

Readdata 2

Instruction[31−0]

addressRead

dataWrite

Readdata

MemoryData

Shiftleft 2

ALU

Add

ZeroRegisters

Inst[20−16]

Inst[25−21]

Address

16 32

Inst[15−11]

Inst[15−0]

Inst[25−0]

26 28

PC+4[31−28]


funct

rs

rt

rd

32

32

32

32

extend

Sign or

32

32BusA

BusB

zero

MemoryInstruction

PC

Add

4

MUX

0

1

MUX

1

ReadRegister 2

MUX

1

0

32 MUX

1

MUX

0

1

0

The Store instruction

0

227

controlALU

MemReadMemtoReg


ALUOp

PCSrcRegDst

Branch

Control

Jump

Inst[5−0]

ExtOp

Inst [31−26]

Shiftleft 2

dataWrite

RegisterWrite

Register 1Read

Readdata 1

Readdata 2

Instruction[31−0]

addressRead

dataWrite

Readdata

MemoryData

ALU

Add

ZeroRegisters

Inst[20−16]

Inst[25−21]

Address

16 32

Inst[15−11]

Inst[15−0]

Inst[25−0]

26 28

PC+4[31−28]


funct

rs

rt

rd

32

32

32

32

extend

Sign or

32

32BusA

BusB

zero

MemoryInstruction

PC

Add

4

MUX

0

1

MUX

0

1

ReadRegister 2

MUX

1

0

32 MUX

1

MUX

0

1

0

The Jump instruction

Shiftleft 2

228

Why is the single cycle implementation not used?

In order to have a single cycle implementation, each instruction in

the instruction set must have all the operands and control signals

available to implement the full instruction.

This means, for example, that an instruction could not require two

operands from memory. It also means that data and instructions

cannot share the same memory.

A multi-cycle implementation could have instructions and data stored

in the same memory; instructions could be fetched in one cycle, and

data fetched in another.

As well, every instruction will use exactly the same amount of time for

its execution. Instructions like the jump instruction, which involve

only a few datapath elements require the same time as say, the load

instruction, which involve almost all the elements in the datapath.

With more than one clock cycle, instructions using few datapath

elements could complete in fewer clock cycles than instructions using

many elements.

Also, there may be opportunities to reuse some datapath elements if

instructions used more than one clock cycle. For example, the ALU

could also be used to calculate branch addresses.

229

Considerations in a multi-cycle implementation

There may be many considerations in the design a multi-cycle pro-

cessor.

For example, the first version of the IBM PC used a processor with an

8-bit path to memory, although the internal data paths were 16 bits.

This meant that, for a full data word to be fetched from memory,

two (8-bit) memory accesses were required. (At the time, the cost

of external connections (pins on the integrated circuit “chip”) were

expensive, so a smaller path to memory made the processor cheaper

to manufacture.)

Other operations could be also performed in several cycles to re-

duce hardware costs; for example, a 32-bit add function could be

implemented using an 8-bit adder, but requiring four clock cycles to

complete the add operation.

In general, a multi-cycle implementation attempts to find a compro-

mise between the number of cycles required for a particular function

and the hardware complexity for its implementation, at a given cost.

It is a trade-off between resources and time.

The problem of designing a multi-cycle processor is therefore an op-

timization problem:

For a given cost, (i.e., amount of hardware, or logic) what is the

fastest processor which can be implemented with the specified in-

struction set.

230

This is really a multi-dimensional problem, and at any given time,

different manufacturers of similar hardware have had very different

implementations of a processor, with different performance charac-

teristics.

(Consider INTEL and AMD today; they implement much the same

instruction sets, but with different price and performance character-

istics. Years ago, IBM and AMDAHL processors implemented the

same instruction sets very differently, as well.)

We will consider the problem in two steps; first, decide on the hard-

ware resources to be available, then decide the minimum clock period,

and what operations should be done in each cycle.

For the hardware resources in our implementation we will have:

• a single memory, 32 bits wide, for instructions and data

• a single ALU similar to that designed earlier

• a full 32 bit internal datapath

• as few other arithmetic elements as possible (we will attempt to

eliminate the adders required for addressing)

231

How are instructions broken down into cycles?

This is also a complex problem. A reasonable approach might be to:

• find the single indivisible operation which requires the longest

time

• attempt to do as many of the shorter operations as possible in

single cycles of the same length

In its simplest form, this is a “greedy algorithm.”

It is made more complex by the fact that the operations may have

to be performed in some particular order.

These are called precedence relations, and discovering them is im-

portant whenever looking for opportunities for parallelism.

For example, an instruction must be fetched from memory before the

arithmetic or logic function it specifies can be executed.

In many processors, fetching a value (instructions or data) from mem-

ory is the operation which takes the longest time.

In others, it is possible to divide even this operation into sub-operations;

e.g., generate a memory address in one cycle and read or write the

value in the next cycle.

For our purposes, we will consider the fetching of an operand from

memory as the single indivisible operation which will define our basic

cycle time.

232

Looking back at the instruction timing for the single cycle processor,

we see that the load instruction requires two memory accesses, and

therefore will require at least two cycles.

Inst. Memory Reg. Read muxPC ALU mux Reg. Write

Inst. Memory Reg. Read muxPC mux Reg. WriteALU Data Memory✛ ✲

Load

The ”critical path”

Inst. MemoryPC mux

Jump

✲

Arithmetic and logical instructions

time

Considering the option of using the ALU to increment the PC, note

also that if the PC is read at the beginning of a cycle and loaded at

the end of the cycle, then it can be incremented in parallel with the

memory access. Also, if the diagram really represents the time for the

various operations, the register and MUX operations together require

approximately the same time as a memory operation, requiring five

cycles in total.

Inst. Memory Reg. Read mux mux Reg. WriteALU Data Memory

PC✛ ✲The ”critical path”

1 2 3 4 5

233

A multi-cycle implementation

We will consider the design of a multi-cycle implementation of the

processor developed so far. The processor will have:

• a single memory for instructions and data

• a single ALU for both addressing and data operations

• instructions requiring different numbers of cycles

There are now resource limitations — only one access to memory,

one access to the register file, and one ALU operation can occur in

each clock cycle.

It is clear that both the instruction and data would be required

during the execution of an instruction. Additional registers, the

instruction register (IR) and the memory data register (MDR)

will be required to hold the instruction and data words from memory

between cycles.

Registers may also be required to hold the register operands from

BusA and BusB (registers A and B, respectively).

(Recall that the branch instructions require an arithmetic compari-

son before an address calculation.)

We will look at each type of instruction individually to determine if

it can actually be done with the time and resources available.

234

The R-type instructions

• R[rd] ← R[rs] op R[rt] Example: add rd, rs, rt

Inst. Memory Reg. Read mux ALU mux Reg. Write

PC

1 2 3 4

Clearly, the instruction can be completed in four cycles, from the

timing. We need only determine if the required resources are avail-

able.

• In the first cycle, the instruction is fetched from memory, and

the ALU is used to increment the PC. The instruction must be

saved in the instruction register (IR) so it can be used in the

following cycles. (This may extend the cycle time).

• In the second cycle, the registers are read, and the values from

the registers to be used by the ALU must be saved, in registers

A and B, again new registers.

• In the third cycle, the r-type operation is completed in the ALU,

and the result saved in another new register, ALUOut.

• In the fourth cycle, the value in register ALUOut is written into

the register file.

Four registers had to be added to preserve values from one cycle

to the next, but there were no resource conflicts — the ALU was

required only in the first and third cycle.

235

We can capture these steps in an RTL description:

Cycle 1 IR ← mem[PC] Save instruction in IR

PC ← PC + 4 increment PC

Cycle 2 A ← R[rs] save register values for next cycle

B ← R[rt]

Cycle 3 ALUOut ← A op B calculate result and store in ALUOut

Cycle 4 R[rd] ← ALUOut store result in register file

This is really an expansion of the original RTL description of the

R-type instructions, where the internal registers are also used. The

original description was:

mem[PC] Fetch the instruction from memory

R[rd] ← R[rs] op R[rt] Set register rd to the value of the

operation applied to the contents of

registers rs and rt

PC ← PC + 4 calculate the address of the next in-

struction

When using a “silicon compiler” to design a processor, designers often

refine the RTL description in a similar way in order to achieve a more

efficient implementation for the datapath or control.

236

The Branch instruction — beq

• Cond ← R[rs] - R[rt] Calculate the branch condition

• if (Cond eq 0)

PC ← PC + 4 +



struction


Inst. Memory Reg. Read mux

Sign ext. add

ALU mux mux

PC

1 2 3

In this case, three arithmetic operations are required, (incrementing

the PC, comparing the register values, and adding the immediate

field to the PC.)

Clearly, the comparison could not be done until the values have been

read from the register, so this must be done in cycle 3.

The address calculation could be done in cycle 2, however, since it

uses only data from the instruction (the immediate field) and the

new value of the PC, and the ALU is not being used in this cycle.

The result would have to be stored in a register, to be used in the

next cycle. We could use the register ALUOut for this, since the

R-type operations only require it at the end of cycle 3.

Recall that the ALU produced an output Zero which could be used

to implement the comparison. It is available during the third cycle,

and could be used to enable the replacement of the PC with the value

stored in ALUOut in the previous cycle.

237

The original RTL for the beq was:


• Cond ← R[rs] - R[rt] Evaluate the branch condition

• if (Cond eq 0)

PC ← PC + 4 +



struction


Rewriting the RTL code for the beq instruction, including the oper-

ations on the internal registers, we have:



Cycle 2 A ← R[rs] save register values for next cycle

B ← R[rt] (for comparison)

ALUOut ← PC + calculate address for branch

signextend(imm16) << 2 and place in ALUOut

Cycle 3 Compare A and B

if Zero is set replace PC with ALUOut if Zero

then PC ← ALUOut is set, otherwise do not change PC

Note that this instruction now requires three cycles.

Also, the first cycle is identical to that of the R-type instructions.

The second cycle does the same as the R-type, and also does the

address calculation. Note that, at this point, the instruction may

not require the result of the address calculation, but it is calculated

anyway.

238

The Load instruction

• Addr ← R[rs] + SignExt(imm16) Calculate the memory address

• R[rt] ← Mem[Addr] load data into register rt

Inst. Memory Reg. Read mux mux Reg. WriteALU Data Memory

PC

1 2 3 4 5

Clearly, the first cycle is the same as in the previous examples.

For the second cycle, register R[rs] contains part of an address, and

register R[rt] contains a value to be saved in memory (for store)

or to be replaced from memory (for load). They must therefore be

saved in registers (A and B) for future use, like the previous instruc-

tions.

In the third cycle, the address is calculated from the contents of A and

the imm16 field of the instruction and stored in a register (ALUOut)

for use in the next cycle.

This address (now in ALUOut) is used to access the appropriate mem-

ory location in the fourth cycle, and the contents of memory are

placed in a register MDR, the memory data register.

In the fifth cycle, the contents of the MDR are stored in the register

file in register R[rt].

239

The original RTL for load was:



SignExt(imm16)




imm16

• R[rt] ← Mem[Addr] load the data at address Addr into

register rt


struction

The RTL for this implementation is:



Cycle 2 A ← R[rs] save address register for next cycle

B ← R[rt]

Cycle 3 ALUOut ← A + calculate address for data

signextend(imm16) and place in ALUOut

Cycle 4 MDR ← Mem[ALUOut] store contents of memory at address

ALUOut in MDR

Cycle 5 R[rt] ← MDR store value originally from memory

in R[rt]

Recall that this instruction was the longest instruction in the single

cycle implementation.

240

The Store instruction

• Addr ← R[rs] + SignExt(imm16) Calculate the memory address

• Mem[Addr] ← R[rt] store the contents of register rt

in memory

Inst. Memory Reg. Read mux ALU Data Memory

PC

1 2 3 4

The store instruction is much like the load instruction, except that

the value in register R[rt] is written into memory, rather than read

from it.

The main difference is that, in the fourth cycle, the address calculated

from R[rs] and imm16 (and saved in ALUOut) is used to store the

value from register R[rt] in memory.

A fifth cycle is not required.

241

The original RTL for store was:



SignExt(imm16)

Set memory address to the value of the

sum of the contents of register rs and

the immediate data word imm16

• R[rt] ← Mem[Addr] load the data at address Addr into reg-

ister rt


struction




Cycle 2 A ← R[rs] save address register for next cycle

B ← R[rt] save value to be written

Cycle 3 ALUOut ← A + calculate address for data

signextend(imm16) and place in ALUOut

Cycle 4 Mem[ALUOut] ← B store contents of register rt in

memory at address ALUOut

242

The Jump instruction

• PC<31:2> ← PC<31:28>

concat target<25:0>

Calculate the jump address by con-

catenating the high order 4 bits of

the PC with the target address

Inst. Memory

PC

mux

1 2

The first cycle, which fetches the instruction from memory and places

it in IR, and increments PC by 4, is the same as other instructions.

The next operation is to concatenate the low order 26 bits of the

instruction with the high order 4 bits of the PC.

In the PC, the low order 2 bits are 0, so they are not actually loaded

or stored.

The shift of the bits from the instruction can be accomplished without

any additional hardware, merely by connecting bit IR[25] to bit

PC[27], etc.

Note that adding 4 to the PC may cause the four high order bits to

change.

Could this cause problems ?

243

The original RTL for jump was:


• PC ← PC + 4 increment PC by 4

• PC<31:2> ← PC<31:28> replace low order 28 bits with

concat I<25:0> << 2 the low order 26 bits from the in-

struction left shifted by 2




Cycle 2

Cycle 3 PC<31:2> ← PC<31:28> replace low order 28 bits with

concat IR<25:0> << 2 the low order 26 bits from the

Note that nothing is done for this instruction in cycle 2.

There is no clear reason for this, except that cycle 2 is substantially

the same for all other instructions, and following this gives a clearer

distinction between the fetch–decode–execute cycles.

244

Changes to the datapath for a multi-cycle implementa-

tion

We have found that several additional registers are required in the

multi-cycle datapath in order to save information from one cycle to

the next.

These were the registers IR, MDR, A, B, and ALUOut.

The overall hardware complexity may be reduced, however, since the

adders required for addressing have been replaced by the ALU.

Recall that the primary reason for choosing five cycles was the as-

sumption that the time to obtain a value from memory was the single

slowest operation in the datapath. Also, we assumed that the register

file operations take a smaller, but comparable, amount of time.

If either of these conditions were not true, then quite a different

schedule of operations might have been chosen.

245

What is done during each cycle?

For this implementation, we have determined that instructions will be

divided into five cycles. (Other divisions are possible, of course, but

the original MIPS also used five cycles for the longest instruction.)

These cycles are as follows:

1. Instruction fetch (IF)

The instruction is fetched and the next address calculated.

IR ← Memory[PC]

PC ← PC + 4

2. Instruction decode (ID)

The instruction is decoded, and the register values to be read

(the contents of registers rs and rt) are stored in registers A

and B respectively.

A ← reg[IR[25:21]]

B ← reg[IR[20:16]]

At this time, the target of a branch instruction can also be cal-

culated, because both the PC and the instruction are available.

It will have to be stored in a register (ALUOut) until it is used.

ALUOut ← PC + sign-extend(IR[15:0]) <<2

where <<2 means a left-shift of 2.

246

3. Execution (EX)

In this cycle, either

• the ALU operation is completed (for r-type and arithmetic

immediate instructions),

ALUOut ← A op B , or

ALUOut ← A op sign-extend(IR[15:0])

• or the memory address of a data word is calculated (for load

or store),

ALUOut ← A + sign-extend(IR[15:0])

• or the branch instruction is completed if the conditional

expression evaluates to TRUE,

if A = B PC ← ALUOut

Note that if the target address were not calculated in the

previous clock cycle, it would have to be calculated in the

next one; the ALU is used for the comparison in this cycle.

• or the jump instruction is completed

PC ← PC[31:28] || IR[25:0] <<2

where the operator || denotes concatenation.

247

4. Memory (MEM)

Only the load or store instructions requires this cycle. In this

cycle, data is read from memory,

MDR ← Memory[ALUOut]

or data is written to memory,

Memory[ALUOut] ← B

5. Writeback (WB)

In this cycle, a value is written to a register in the register file.

Either the r-type and immediate arithmetic operations write

their results to the register file

reg[IR[15:11]] ← ALUOut

or the value read from memory in the previous cycle (for a load

instruction) is written into the register file,

reg[IR[20:16]] ← MDR

Note that not all instructions require every cycle. In particular,

branch and jump instructions require only the first 3 cycles (IF,

ID, and EX).

The R-type instructions require 4 cycles (IF, ID, EX, and WB).

Store also requires 4 cycles (IF, ID, EX, and MEM).

Load requires all 5 cycles (IF, ID, EX, MEM, and WB).

248

The datapath for the multi-cycle processor

Fortunately, after our design of the single cycle processor, we have

a good idea of the datapath elements required to implement each

individual instruction. We can also seek opportunities to reuse func-

tional blocks in different cycles, potentially reducing the number of

hardware blocks (and hence the complexity and cost) of the datap-

ath.

The datapath for the multi-cycle processor is similar to that of the

single cycle processor, with

• the addition of the registers noted (IR, MDR, A, B, and ALUOut)

• the elimination of the adders for address calculation

• a MUX must be extended because there are now three separate

calculations for the next address (jump, branch, and the normal

incrementing of the PC).

• additional control signals controlling the writing of the registers.

The following diagrams show the datapath for the multi-cycle imple-

mentation of the processor.

The additions to the datapath for each cycle is shown in red.

The required control signals are shown in green in the final figure.

249

PC

ALUZero

4

0

Memory

Writedata

MemData

Address

MUX

0

1 M

XU

M

XU

MUX

0 Inst[31−26]

Inst[25−21]

Inst[15−0]

InstructionRegister

Inst[20−16]

PC

W

Mem

Wri

te

Mem

Rea

d

IorD IR

Wri

te

AL

USr

cA

ALUSrcB

ALU

PC

Sour

ce

250

32

BusA

BusB

rs

rt

16

3

Registers

ReadRegister 2 A

B

Readdata 1

Register 1Read

data 2RegisterWrite

dataWrite

Read

extendSign Shift

left 2

ALUOut

Reg

Wri

te

PC

ALUZero

4

0

Memory

Writedata

MemData

Address

MUX

0

1 M

XU

M

XU

MUX

0 Inst[31−26]

Inst[25−21]

Inst[15−0]

InstructionRegister

Inst[20−16]

PC

W

Mem

Wri

te

Mem

Rea

d

IorD IR

Wri

te

AL

USr

cA

ALUSrcB

ALU

PC

Sour

ce

251

PC[31−28]

28

address

JumpInst[25−0] 262

1

0

2

Shiftleft 2

1

32

BusA

BusB

rs

rt

16

3

Registers

ReadRegister 2 A

B

Readdata 1

Register 1Read

data 2RegisterWrite

dataWrite

Read

extendSign Shift

left 2

ALUOut

Reg

Wri

te

PC

ALUZero

4

0

Memory

Writedata

MemData

Address

MUX

0

1 M

XU

M

XU

MUX

0 Inst[31−26]

Inst[25−21]

Inst[15−0]

InstructionRegister

Inst[20−16]

PC

W

Mem

Wri

te

Mem

Rea

d

IorD IR

Wri

te

AL

USr

cA

ALUSrcB

ALU

PC

Sour

ce

252

1

MemoryData

Register

PC[31−28]

28

address


1

1

0

2

Shiftleft 2

32

BusA

BusB

rs

rt

16

3

Registers

ReadRegister 2 A

B

Readdata 1

Register 1Read

data 2RegisterWrite

dataWrite

Read

extendSign Shift

left 2

ALUOut

Reg

Wri

te

PC

ALUZero

4

0

Memory

Writedata

MemData

Address

MUX

0

1 M

XU

M

XU

MUX

0 Inst[31−26]

Inst[25−21]

Inst[15−0]

InstructionRegister

Inst[20−16]

PC

W

Mem

Wri

te

Mem

Rea

d

IorD IR

Wri

te

AL

USr

cA

ALUSrcB

ALU

PC

Sour

ce

253

MUX

0

1

rd

Inst[15−11]

MUX

0

1

RegDst

MemtoReg

1

MemoryData

Register

1

MemoryData

Register

PC[31−28]

28

address


1

0

2

Shiftleft 2

1

PC[31−28]

28

address


1

1

0

2

Shiftleft 2

32

BusA

BusB

rs

rt

16

3

Registers

ReadRegister 2 A

B

Readdata 1

Register 1Read

data 2RegisterWrite

dataWrite

Read

extendSign Shift

left 2

ALUOut

Reg

Wri

te

32

BusA

BusB

rs

rt

16

3

Registers

ReadRegister 2 A

B

Readdata 1

Register 1Read

data 2RegisterWrite

dataWrite

Read

extendSign Shift

left 2

ALUOut

Reg

Wri

te

PC

ALUZero

4

0

Memory

Writedata

MemData

Address

MUX

0

1 M

XU

M

XU

MUX

0 Inst[31−26]

Inst[25−21]

Inst[15−0]

InstructionRegister

Inst[20−16]

PC

W

Mem

Wri

te

Mem

Rea

d

IorD IR

Wri

te

AL

USr

cA

ALUSrcB

ALU

PC

Sour

ce

254

Inst[5−0]

RegWrite

RegDstIRWrite

MemRead

PCWriteCond

PCWrite

IorD

MemWrite ALUSrcA

ALUSrcB

ALUOp

Control

Outputs

funct

op

ALUcontrol

MemtoReg

op

PCSource

MUX

0

1

MUX

0

1

rd

Inst[15−11]

MemtoReg

RegDst

1

MemoryData

Register

PC[31−28]

28

address


1

1

0

2

Shiftleft 2

32

BusA

BusB

rs

rt

16

3

Registers

ReadRegister 2 A

B

Readdata 1

Register 1Read

data 2RegisterWrite

dataWrite

Read

extendSign Shift

left 2

ALUOut

Reg

Wri

te

PC

ALUZero

4

0

Memory

Writedata

MemData

Address

MUX

0

1 M

XU

M

XU

MUX

0 Inst[31−26]

Inst[25−21]

Inst[15−0]

InstructionRegister

Inst[20−16]

PC

W

Mem

Wri

te

Mem

Rea

d

IorD IR

Wri

te

AL

USr

cA

ALUSrcB

ALU

PC

Sour

ce

255

The control signals

The following control signals are identified in the datapath:

Action when

Signal 0 (deasserted) 1 (asserted)

RegDst the register written is the

rt field

the register written is the

rd field

RegWrite the register file will not be

written into

the register addressed by

the instruction will be

written into

ALUSrcA the first ALU operand is

the PC

the first ALU operand is

register A

MemRead no memory read occurs the contents of memory

at the specified address is

placed on the data bus

MemWrite no memory write occurs the contents of register B

is written to memory at

the specified address

MemtoReg the value written to the

register file comes from

ALUOut

the value written to the

register file comes from the

MDR

256

Action when

Signal 0 (deasserted) 1 (asserted)

IorD the memory address

comes from the PC (an

instruction)

the memory address

comes from ALUOut (a

data read)

IRWrite the IR is not written into the IR is written into (an

instruction is read)

PCWrite none (see below) the PC is written into;

the value comes from the

MUX controlled by the

signal PCSource

PCWriteCond if both it and PCWrite are

not asserted, the PC is not

written

the PC is written if the

ALU output Zero is active

257

Following are the 2-bit control signals:

Signal Value Action taken

ALUOp 00 ALU performs ADD operation

01 ALU performs SUBTRACT operation

10 ALU performs operation specified by funct

field

ALUSrcB 00 the second ALU operand is from register B

01 the second ALU operand is 4

10 the second ALU operand is the sign extended

low order 16 bits of the IR (imm16)

11 the second ALU operand is the sign extended

low order 16 bits of the IR shifted left by 2 bits

PCSource 00 the PC is updated with the value PC + 4

01 the PC is updated with the value in regis-

ter ALUOut (the branch target address, for a

branch instruction)

10 the PC is updated with the jump target address

The control unit must now be designed.

Since the instructions will now require several states, the control will

be a state machine, with the instruction op codes as inputs and the

control signals as outputs.

258

Review of instruction cycles and actions

Cycle Instruction type action

IF all IR ← Memory[PC]

PC ← PC + 4

ID all A ← Reg[rs]

B ← Reg[rt]

ALUOut ← PC + (imm16 <<2)

EX R-type ALUOut ← A op B

Load/Store ALUOut ← A + sign-extend(imm16)

Branch if (A == B) then PC ← ALUOut

Jump PC ← PC[31:28] || (IR[25:0] <<2)

MEM Load MDR ← Memory[ALUOut]

Store Memory[ALUOut] ← B

WB R-type Reg[rd] ← ALUOut

Load Reg[rt] ← MDR

Note that the first two steps are required for all instructions, and all

instructions require at least the first 3 cycles.

The MEM step is required only by the load and store instructions.

The ALU control unit is still a combinational logic block, as before.

259

Design of the control unit

The control unit is a state machine, implementing the state sequenc-

ing for every instruction.

Following is a partial state machine, detailing the IF and ID stages,

which are the same for all instructions:

IorD = 0ALUSrcA = 0Memread = 1

IRWrite = 1ALUSrcB = 01ALUOp = 00PCWrite = 1

PCSource = 00

0

ALUSrcB = 11ALUOp = 00

ALUSrcA = 0

1

Start

IF

ID

OP = ’BEQ’OP = ’R−type’OP = ’SW’OP = ’LW’ OP = ’J’

The partial state machines which implement each of the instructions

follow.

260

The memory reference instructions (Load and Store)


ALUSrcA = 1

2

IorD = 1MemRead = 1

3

IorD = 1MemWrite = 1

5

RegWrite = 1MemtoReg = 1

RegDst = 0

4

OP = ’LW’ orOP = ’SW’

OP = ’SW’

To state 0(instructioncompleted)

from state 1

OP = ’LW’

261

R-type instructions



ALUSrcA = 1

6

MemtoReg = 0RegWrite = 1

RegDst = 1

7

from state 1

OP = ’R−type’

262

Branch and Jump instructions


9

PCSource = 10PCWrite = 1

ALUSrcA = 1ALUSrcB = 00ALUOp = 01

PCWriteCond = 1

8

PCSource = 01

OP = ’BEQ’ OP = ’J’

from state 1 from state 1

These can be combined into a single state diagram, and a state ma-

chine derived from this.

263

The combined control unit


RegDst = 1

7

IorD = 1MemRead = 1

3


5


RegDst = 0

4



PCSource = 00

0


ALUSrcA = 0

1

9

PCSource = 10PCWrite = 1ALUSrcB = 10

ALUOp = 00

ALUSrcA = 1

2


ALUSrcA = 1

6


PCWriteCond = 1

8

PCSource = 01OP = ’LW’

OP = ’SW’

Start

OP = ’J’OP = ’BEQ’OP = ’LW’

or

OP = ’SW’OP = ’R−type’

264

Implementing the control unit

All that remains is to implement the control unit is to design the

control logic itself.

Inputs are the instruction op codes, as before, and the outputs are

the control signals.

The following steps are typically followed in the implementation of

any sequential device:

• Construct the state diagram or equivalent (done).

• Assign numeric (binary) values to the states.

• Choose a memory element for state memory. (Normally, these

would be D flip flops or JK flip flops.)

• Design the combinational logic blocks to implement the next-

state functions.

• Design the combinational logic blocks to implement the outputs.

The actual implementation can be done in a number of ways; as

discrete logic, a PLA, read-only memory, etc.

Typically, the control unit would be automatically generated from a

description in some high level design language.

265

The control unit we have described is a Moore state machine, where

the outputs are a function only of the state.

t

t tt

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

State

memory

Inputs

Primary

State

Inputs

State

Outputs

Outputs

Primary

Following is a state table corresponding to the previous state dia-

gram. Note that the outputs are missing, but they depend only on

the state values, not the inputs.

266

Present INPUT Next

State State

0 0000 X 1 0001

1 0001 lw 100011 2 0010

1 0001 sw 101011 2 0010

1 0001 R 000000 6 0110

1 0001 BEQ 000100 8 1000

1 0001 J 000010 9 1001

2 0010 lw 100011 3 0011

2 0010 sw 101011 5 0101

3 0011 X 4 0100

4 0100 X 0 0000

5 0101 X 0 0000

6 0110 X 7 0111

7 0111 X 0 0000

8 1000 X 0 0000

9 1001 X 0 0000

Note that the outputs are not shown in this table. The notation X

in the input column means that this state change does not depend

on the particular instruction, only on the previous state.

The following figure shows an implementation of the next-state logic

for the state machine shown previously.

267

ttt

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

❞❍❍✟✟

S0

S1

S2

S3

S3

S2

S1

S0

OR plane

AND plane

tttt

t

ttt t

t

t

tt

t tttttt

ttt

ttt

tttt

tt

t t tt t t

t tt tt t

t t

t

t

tt

tt

t

tt

t

ttt

t

ttt

t

ttt

t

t

tt

t

tt

t

tt

tt

tt

OP3

OP2

OP1

OP0

OP4

OP5

Operationswlw

Rbeq j

State 0 1 2 3

lw sw

6

Q D

Statememory

268

Other controller implementations

An alternative to a PLA implementation in an implementation using

a read-only memory (or even a read-write memory.)

In this case, the inputs to the memory would be the OP codes (6

bits) and the state codes (4 bits in our case, but larger in a processor

with a richer instruction set.)

The next-state values and outputs would be stored in the memory.

For this example, there would be 10 (6 + 4) address bits, so the size

of the memory would be 210 = 1024 words of 16 bits (10 single bit

control signals and 3 2-bit control signals.)

This is a large memory for the simple control function we have im-

plemented with a PLA.

❚❚❚

✔✔✔

stateinputs

state

op0

op5...

controloutputs

outputs

A hybrid approach could be to use the PLA to generate the next-state

values, and a memory for the outputs associated with each state.

In this case, the memory size is 24 = 16 words of 16 bits.

269

Microprogrammed control

An alternative to designing a classical state machine is to designing

a microprogrammed control unit.

A microprogrammed control unit is a simple processor that generates

the control signals for a more complex processor.

logic

Address select

Microprogramcounter

✉

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✛

✻

❄ ❄

✻

✻ ✻

❅❅❏❏❏❏❏ ✡

✡✡✡✡��

MicrocodeROM Datapath

controloutputs

1

op code

Adder

sequencer

Microprogram

It has storage for the microcode (values of the control outputs and

microinstructions) and a microprogram sequencer which decides the

next microprogram operation. (This is essentially a next-state gen-

erator.)

270

Next-state generation

Note in the previous state diagram that, in many cases (states 0, 3,

6 in the example), the next state is the numerically next state in the

state machine; the state value is merely incremented.

For many other states (states 4,5,7,8,9 in the example), the next state

is the first state (instruction fetch).

For the other states, (states 1 and 2 in the example) there is a small

subset of next-states reachable from those states. Typically, a dis-

patch table (stored in ROM) is associated with each such state.

For the state machine described earlier, there would be two dispatch

tables; one for state 1 and the other for state 2.

They would contain the next-state information as follows:

Dispatch ROM 1

OP Name Value state

000000 R-type Rformat1 0110

000010 j JUMP1 1001

000100 beq BEQ1 1000

100011 lw Mem1 0010

101011 sw Mem1 0010

Dispatch ROM 2

OP Name Value state

100011 lw LW2 0011

101011 sw SW2 0101

271

The microprogram sequencer can be expanded to the following, where

the Address Select Logic block has been expanded to include the

four possible sources for the next instruction described earlier:

Microprogramcounter

✉

✲

✲

✲

✲

✲

✲

✲

✲

✲

✲

✻

❄ ❄❅❅❏

❏❏❏❏ ✡

✡✡✡✡��

✛

✻

✻

✻

✻ ✻✍✎

✌☞

✻✻

✻

MicrocodeROM Datapath

controloutputs

1

Adder

sequencer

MicroprogramMUX

ROM 1ROM 2

2 013

0

op code

Each microcode instruction will have to include a control word (input

to the MUX above) to control the microprogram sequencer.

272

Designing the microcode

The basic function of a microprogram is to supply the control signals

required to implement each instruction in the appropriate order.

A microprogram is made up of microinstructions (microcode).

The way the microcode is organized, or formatted, depends on a num-

ber of things. Two extremes of microcode are horizontal microcode

and vertical microcode.

Horizontal microcode usually requires more storage. It provides all

the control signals for a single cycle directly.

Vertical microcode is more compact. Typically, operations are en-

coded so that the operations can be specified in fewer bits. This

supports less parallelism, and the second level of decoding may ex-

tend the cycle time.

In either case, it is usual to group together the required outputs and

control information into fields. These fields are merely collections

of outputs that perform related functions. For example, it might be

useful to group together all the signals that control memory, or the

ALU.

Often, the values in different fields are given labels, much as in as-

sembly language programs.

We can identify the following fields for a microprogrammed imple-

mentation of the simple MIPS:

273

Field name Function

ALU control Specify the ALU operation for this clock cycle

SRC1 Specify the source for the first ALU operand

SRC2 Specify the source for the second ALU operand

Register control Specify read or write for the register file, and the source

for the write

Memory Specify read or write and the source for the memory.

For a read, specify the destination register

PCWrite control Specify writing of the PC

Sequencing Determines how to choose the next microinstruction

Note that the first six fields correspond to sets of control signals for

the datapath, the last (Sequencing) determines the source for the

address of the next micro-code instruction (next-state).

Typically, those fields would have symbolic values, which would later

be translated to actual control signal values, somewhat like the trans-

lation of an assembly language program to machine code.

274

The following tables show the values for each field:

Field name Values Function

ALU Add Add, using the ALU

control Subt Subtract, using the ALU

Funct Use the funct field to determine ALU control

SRC1 PC Use PC as first ALU input

A Use register A as first ALU input

SRC2 B Use register A as second ALU input

4 Use the constant 4 second ALU input

Extend Use the sign extended imm16 field as the sec-

ond ALU input

Extshft Use the 2-bit left shifted sign extended imm16

field as the second ALU input

Register Read Read two registers using rs and rt fields,

control placing results in A and B

Write ALU Write the contents of ALUOut into the regis-

ter file in register rd

Write MDR Write the contents of MDR into the register

file in register rt

275

Field name Values Function

Memory Read PC Read memory at the address in PC and write

result in IR

Read ALU Read memory at the address in ALUOut and

write result in MDR

Write ALU Write memory at the address in ALUOut us-

ing the contents of B as data

PCWrite ALU Write the output of the ALU into the PC

ALUOut-cond Write the contents of ALUOut into the PC if

the Zero output of the ALU is active

Jump-address Write the jump address from the instruction

into the PC

Sequencing Seq The next microinstruction is the next sequen-

tially

Fetch The next microinstruction is instruction fetch

(state 0)

Dispatch i The next microinstruction is obtained from

dispatch ROM i (1 or 2)

Every line of microcode will have a value for each of these fields.

Eventually, as in the translation of assembly language instructions,

these (symbolic) values will be translated into the actual values of

the control signals.

276

Creating a microprogram

Let us look at writing the microcode for a few operations:

The first thing done is the fetching and decoding of an instruction

(states 0 and 1 in the state diagram):

ALU Register PCWrite

Label control SRC1 SRC2 Control Memory control Sequencing

Fetch Add PC 4 Read PC ALU Seq

Add PC Extshft Read Dispatch 1

The first line describes the (now familiar) operations of fetching an

instruction, storing it in the IR, adding 4 to the PC, and writing the

value back to the PC.

The second line describes the calculation of the branch address, and

the storing of register values in registers A and B.

The Sequencing field determines where the next microcode in-

struction comes from.

For the first microinstruction, it is the next in sequence.

For the second, it depends on the op code (Dispatch ROM 1).

277

The memory access instructions lw and sw:



Mem1 Add A Extend Dispatch 2

LW2 Read ALU Seq

Write MDR Fetch

SW2 Write ALU Fetch

Note that the value in the Dispatch 2 table will cause a jump to

either LW1 or LW2.

R-type instructions:



Rformat1 Func code A B Seq

Write ALU Fetch

Branch and jump instructions (beq and j):



BEQ1 Subt A B ALUOut-cond Fetch

JUMP1 Jump address Fetch

278

What remains is to translate these microinstructions into actual val-

ues to be stored in the microcode ROM.

In this case, it is fairly straightforward to identify the values in each

field with appropriate values for the control signals:

Field Signals

name Value active Comment

ALU Add ALUOp = 00 Cause the ALU to add

control Subt ALUOp = 01 Cause the ALU to subtract

Func code ALUOp = 10 Use the funct field to determine

ALU operation

SRC1 PC ALUSrcA=0 Use the PC as the ALU’s first in-

put

A ALUSrcA=1 Use register A as the ALU’s first

input

SRC2 B ALUSrcB=00 Use register B as the second ALU

input

4 ALUSrcB=01 Use 4 as the second ALU input

Extend ALUSrcB=10 Use the sign extended imm16 field

as the second ALU input

Extshft ALUSrcB=11 The shifted sign extended imm16

field is the second ALU input

279

Field Signals


Register Read Place contents of registers

referenced by rs, rt in regis-

ters A, B

control Write ALU RegWrite, Write the contents of

RegDst=1, ALUOut to register rd

MemtoReg=0

Write MDR RegWrite, Write the contents of

RegDst=0, MDR to register rt

MemtoReg=1

Memory Read PC MemRead, Place the value in memory at

IorD=0, IRWrite address referenced by PC into

IR and MDR

Read ALU MemRead, Place the value in memory at

IorD=1 address ALUOut into IR

Write ALU MemWrite, Write memory using ALUOut

IorD=1 as address, B contents as data

280

Field Signals


PC write ALU PCSource=00, Write ALU output to PC

PCwrite

control ALUOut-cond PCSource=01, If ALU output is zero,

PCwrite write ALU output to PC

Jump address PCSource=10, Write jump address from

PCwrite instruction to PC

Sequencing Fetch AddrCtl=00 Go to the first microin-

struction

Dispatch 1 AddrCtl=01 Microcode address from

ROM 1

Dispatch 2 AddrCtl=10 Microcode address from

ROM 2

Seq AddrCtl=11 Next microinstruction is

sequential

281

It is now a matter of straightforward substitution to arrive at the

microcode to be stored in the ROM:


State control SRC1 SRC2 Control Memory control Sequencing

0 00 0 01 000 1001 0010 11

1 00 0 11 000 0000 0000 01

2 00 1 10 000 0000 0000 10

3 00 0 00 000 1010 0000 11

4 00 0 00 101 0000 0000 00

5 00 0 00 000 0110 0000 00

6 10 1 00 000 0000 0000 11

7 00 0 00 110 0000 0000 00

8 01 1 00 000 0000 0101 00

9 00 0 00 000 0000 1010 00

The 18 control signals here are, in order:

ALU control ALUOp[2]

SRC1 ALUSrcA

SRC2 ALUSrcB[2]

Register Control RegWrite, RegDst, MemtoReg

Memory MemRead, MemWrite, IorD, IRWrite

PCWrite control PCSource[2], PCWrite, PCWriteCond

Sequencing AddrCtl[2]

282

Advantages/disadvantages of microprogram control:

For large instruction sets:

• The control is easier to design — similar to programming

• The control is more flexible — easier to adapt or modify

• Changes to the instruction set can be made late in the design

cycle

• Very powerful instruction sets can be implemented in different

datapaths

Generality:

• Different instruction sets can be implemented on the same ma-

chine

• Instruction sets can potentially be adapted to the particular ap-

plication

• Many different datapath organizations can be used with the same

instruction set (cost/performance tradeoffs)

Microcode control can be slower than direct logic implementation of

the control, and may require more circuitry (transistors).

It also may encourage “instruction set bloat” — adding instructions

because they can easily be provided.

283

Adding additional instructions

Clearly, adding an additional instruction can be accomplished by

adding to the control unit, provided that the instruction can actually

be implemented in the datapath.

For example, adding the ori instruction would require adding a third

bit to the control signal ALUOp in order to be able to encode logic

operations, and adding the capability to zero extend the imm16 field

(with control signal ExtOp, as before). These additions are the same

as those required for the single instruction implementation, and their

truth tables can be referred to for the appropriate values for these

control signals.

Note that the new control signals may have to be added to the ex-

isting states, as well.

The additional control signals would also have to be generated by

the controller.

A microprogrammed control unit is usually easier to modify than a

conventional controller. It may be slower, though, because of the

(local) memory access time for the microinstructions.

The following diagram shows the modified datapath and controls for

the processor with the ori instruction.

284

RegWrite

RegDstIRWrite

MemRead

IorD

MemWrite Control

Outputs

funct

op

ALUcontrol

MemtoReg

op

Inst[5−0]

PCSource

ALUOp

ALUSrcB

ALUSrcA

ExtOp

PCWrite

PCWriteCond

MUX

0

1

MUX

0

1

rd

Inst[15−11]

MemtoReg

RegDst

Ext

Op

or 0

1

MemoryData

Register

PC[31−28]

28

address


1

1

0

2

Shiftleft 2

32

BusA

BusB

rs

rt

16

3

Registers

ReadRegister 2 A

B

Readdata 1

Register 1Read

data 2RegisterWrite

dataWrite

Read

Shiftleft 2

ALUOut

Reg

Wri

te

Sign

extend

PC

ALUZero

4

0

Memory

Writedata

MemData

Address

MUX

0

1 M

XU

M

XU

MUX

0 Inst[31−26]

Inst[25−21]

Inst[15−0]

InstructionRegister

Inst[20−16]

PC

W

Mem

Wri

te

Mem

Rea

d

IorD IR

Wri

te

AL

USr

cA

ALUSrcB

ALU

PC

Sour

ce

285

The following shows the additions to the state diagram required to

implement the ORI instruction:


from state 1

OP = ’ORI’


ALUSrcA = 1

14


RegDst = 0

15

ExtOp = 0

Also, in states 0, 1, and 2, the control signal ALUOp would have to

change from 00 to 000. In state 6, it would change from 10 to 100,

and in state 8, from 01 to 001.

The control signal ExtOp would have to be set to a value of 1 in

states 1 and 2.

286

Modifying the microcode to add the ori instruction

The two additional control signals would have to be added. The

third bit in the ALUOp control would naturally be added to the ALU

control field, as would a label for the OR function. The control signal

ExtOp would also have to be added to one of the fields, say, SRC2.

Field Signals


ALU Add ALUOp = 000 Cause the ALU to add

control Subt ALUOp = 001 Cause the ALU to subtract

Or ALUOp = 010 Cause the ALU to perform OR

Func code ALUOp = 100 Use the funct field to determine

ALU operation

SRC2 B ALUSrcB=00 Use register B as the second ALU

input

4 ALUSrcB=01 Use 4 as the second ALU input

Extend ExtOp = 1 Sign extension of imm16

ALUSrcB=10 Use the sign extended imm16 field

as the second ALU input

Extshft ExtOp = 1 Sign extension of imm16

ALUSrcB=11 The shifted sign extended imm16

field is the second ALU input

UExtend ExtOp = 0 Unsigned extension of imm16

ALUSrcB=10 Use the imm16 field as the second

ALU input

287

Note that two labels have been added; OR, to specify an OR opera-

tion in the ALU, and UExtend to specify unsigned extension.

Sign extension was also explicitly specified, where required.

Less obviously, another label, say, Write ALUi has to be added to the

Register control field, because the value to be written comes from

the register ALUOut and is to be written to the register indexed by

rt, which requires a new combination of control signals.

Field Signals


Register Read Read 2 registers using the rs

control and rt fields and save the re-

sults in registers A and B

Write ALU RegWrite=1 Write to the register file using

RegDst=1 the rd field as destination

MemtoReg=0 and ALUOut as source

Write MDR RegWrite=1 Write to the register file using

RegDst=0 the rt field as destination

MemtoReg=1 and MDR as the source

Write ALUi RegWrite=1 Write to the register file using

RegDst=0 the rt field as destination

MemtoReg=0 and ALUOut as source

288

Since two states were added, two microcode instructions would also

be required. The microcode is similar to that for the R-type instruc-

tions.

The ori instruction:



ORi OR A UExtend Seq

Write ALUi Fetch

Note that the additional two signals would now automatically be

provided for all instructions, since they are specified in the microcode

fields.

One other change is required. The op code for the instruction ori

has to be added to Dispatch ROM 1.

Dispatch ROM 1

OP Name Value

000000 R-type Rformat1

000010 j JUMP1

000100 beq BEQ1

001101 ori ORi

100011 lw Mem1

101011 sw Mem1

289

Exceptions and interrupts

A feature of virtually all processors is the capability to respond to

error conditions, and to be “interrupted” by some external condition.

These interruptions to the normal flow of events in the processor are

called exceptions or interrupts.

We will call something with a cause external to the processor an in-

terrupt, and an exception when the cause is internal to the processor

(say, an illegal instruction).

Note that this is by no means a standard nomenclature; the terms

are often used interchangeably.

Normally, interrupts and exceptions are handled by a combination

of hardware (the processor) and software (the operating system.)

Three things are required when an exception occurs:

1. The cause of the exception must be recorded.

2. The exception must be “handled” in some way. Normally, the

processor jumps to some location in memory where there is code

for an “exception handler.” The PC is set to this address by the

processor hardware.

3. The processor must have some way to return to the code that

was originally running, after handling the exception.

290

Adding exception handling

We will implement the hardware and control functions to handle two

types of exceptions; undefined instruction and arithmetic overflow.

Recall that the ALU had an overflow detection output, which can be

used as an input to the controller.

1. We will use a register labeled Cause to store a number (0 or 1)

to identify the type of exception, (0 for undefined instruction, 1

for arithmetic overflow).

It requires a control signal CauseWrite to be generated by the

controller. The controller also must set the value written to

the register, depending on whether or not the exception was an

arithmetic overflow.

The control signal IntCause is used to set this value.

2. The PC will be set to memory address C0000000 where the

operating system is expected to provide an event handler.

This is accomplished by adding another input (input 3) to the

MUX which updates the PC address. The MUX is controlled by

the 2-bit signal PCSource.

3. The address of the instruction which caused the exception is

stored in the register EPC, a 32 bit register.

Writing to this register is controlled by the new signal EPCWrite.

291

Storing the address of the instruction can be done several ways; for

example, it could be stored at the beginning of each instruction.

This would require a change to the datapath, and a way to disable

the storing of the address after each exception.

It is possible to store the address with only a small change to the

datapath (merely adding the EPC register to accept the output of the

ALU).

Recall that the next address (PC + 4) is calculated in the ALU, and

is written to the PC in the first cycle of every instruction. The ALU

can be used to subtract the value 4 from the PC after an exception

is detected, but before it is written into the EPC, so it contains the

actual address of the present instruction.

(Actually, there would be no real problem with saving the value

PC + 4 in the EPC; the interrupt handler could be responsible for

the subtraction.)

So, in order to handle these two exceptions, we have added two

registers — EPC and Cause, and three control signals — EPCWrite,

IntCause, and CauseWrite.

The changes to the processor datapath and control signals required

for the implementation of the exceptions detailed above are shown

in the following diagram.

292

ALUcontrol

RegWrite

RegDstIRWrite

MemRead

MemWrite ALUSrcA

ALUSrcB

op

Outputs

Control

EPCWrite

CauseWrite

PCSource

ALUOpIorD

PCWrite

PCWriteCond

MemtoReg

IntCause

PCMUX

0

1

Inst[31−26]

Inst[25−21]

Inst[20−16]

Inst[15−0]

InstructionRegister

MUX

0

1

MemoryData

RegisterextendSign Shift

left 2

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

Registers

B

A

M

XU

0

1

2

3

MUX

0

1

Shiftleft 2

M

XU

Memory

Writedata

MemData ALUZero

Inst[5−0]

32

BusA

BusB

4

Address

PC[31−28]

28

address

Jump

funct

rs

rt

MUX

0

1

Inst[25−0] 26

16

Inst[15−11]

rd

Cause

1

C0000000

2

1

0

3

MUX

0

1

0

overflowALUOut EPC

293

Adding exception handling to the control unit

The exceptions overflow and undefined can be implemented by

the addition of only one state each:

IntCause = 1CauseWrite = 1ALUSrcA = 0ALUSrcB = 01ALUOp = 01

PCSource = 11EPCWrite = 1PCWrite = 1

11

CauseWrite = 1ALUSrcA = 0ALUSrcB = 01ALUOp = 01


IntCause = 0

10

overflow

OP = ’other’

to state 0

The input overflow is an output from the ALU. It is a combi-

national logic output, produced while the ALU is performing the

selected operation.

294

Thecontro

lunit,

with

exceptio

nhandlin

g



PCSource = 00

0


ALUSrcA = 0

1

9

PCSource = 10PCWrite = 1ALUSrcB = 00

ALUOp = 10

ALUSrcA = 1

6


ALUSrcA = 1

2

IorD = 1MemRead = 1

3


5


RegDst = 0

4


RegDst = 1

7IntCause = 1





IntCause = 0


PCWriteCond = 1

8

PCSource = 01

Start

OP = ’J’OP = ’BEQ’OP = ’LW’

or

OP = ’SW’OP = ’R−type’

OP = ’LW’OP = ’SW’

OP = ’other’

11 10overflow overflow

TheALUoperation

which

could

resultin

anoverfl

owisdonein

the

EXcycle,

andtheoverflowsign

alisonlyavailab

lethen,unless

itis

savedin

aregister.

295

Adding interrupts and exceptions with microcode

It is not difficult to add exception handling with microcode.

Note that three additional control signals were added. The simplest

thing to do is to add another microcode field, called, say, Exception.

It determines the values of the three control signals, EPCWrite,

IntCause, and CauseWrite.

Field Signals


Exception Overflow EPCWrite=1 Save the output of the ALU

(PC - 4) in EPC

IntCause=1 Select the cause input

CauseWrite=1 Write the selected value in

the Cause register

Undefined EPCWrite=1 Save the output of the ALU

(PC - 4) in EPC

IntCause=0 Select the cause input

CauseWrite=1 Write the selected value in

the Cause register

The operations required are that the PC is decremented by 4, saved

in the EPC, the appropriate value set in the Cause register, and a

jump effected to the exception handler at address C0000000.

296

Another required modification to the microcode is to add a selec-

tion to the PC write control field to accommodate the jump to the

exception handler:

Field Signals


PC write ALU PCSource=00 Select the ALU output as the

control source for the PC

PCWrite=1 Write into the PC

ALUOut-cond PCSource=01 Select the ALUOut register as

the source for the PC

PCWritecond=1 Write into the PC if the zero

output of the ALU is set

jump-address PCSource=10 Select the jump address field

as the source for the PC

Exception PCSource=11 Select address C0000000 as

the value to be written to the

PC

The other changes required are:

• Add the exception state address to Dispatch Rom 1

• Add Dispatch Rom 3 (with the Overflow output as address) to

the microprogram sequencer. This requires adding another bit

to the Sequencing field, as well.

• Changing the Sequencing field of the microcode for R-type in-

structions at label Rformat1 from Seq to Dispatch 3.

297

.

Microcode for the exceptions:


Label control SRC1 SRC2 Control Memory control Exception Sequencing

Overflow Subt PC 4 Exception Overflow Fetch

Undefined Subt PC 4 Exception Undefined Fetch

Of course, the Exception field must be added to all the other mi-

crocode lines, but the values will all be the default (0) values.

298

More about interrupts

The ability to handle interrupts and exceptions is an important fea-

ture for processors.

We have added the control logic to detect the two types of excep-

tions described earlier, but note that the Cause and the EPC register

cannot be read.

Instructions would have to be provided to allow these registers to be

read and manipulated.

Processors usually have policies relating to exceptions. The MIPS

processor had the policy that instructions which cause an exception

has no effect (e.g., nothing is written into a register.)

For some exceptions, if this policy is used, the operation may have

to complete before the exception can be detected, and the result of

the operation must then be “rolled back.”

This makes the implementation of exceptions difficult — sometimes

the state prior to an operation must be saved so it can be restored.

This constraint alone sometimes results in instructions requiring more

cycles for their implementation.

299

Exceptions and interrupts in other processors

A common type of interrupt is a vectored interrupt. Here, differ-

ent interrupts or exceptions jump to different addresses in memory.

The operating system places an appropriate interrupt handler for the

particular interrupt at each of these locations.

A vectored interrupt both identifies the type of interrupt, and pro-

vides the handler at the same time. (Since different interrupts or

exceptions have different vectors.)

In the INTEL processors, it is the responsibility of the interrupting

device to provide the interrupt vector. (This is usually done by

one of the peripheral controller chips, under control of the operating

system.)

A major problem with the PC architecture is that only a small num-

ber of interrupts (typically 16) can be handled by the controller chip.

this has lead to many problems with hardware devices “sharing in-

terrupts” — defeating the advantages of vectored interrupts.

We will look at interrupts again, later, when we discuss input and

output devices.

300

Some questions about exceptions and interrupts

The following questions often have different answers for different pro-

cessors:

• How does a processor return control of the program flow from

the exception or interrupt handler to the interrupted program?

Some processors have explicit instructions for this (e.g., the MIPS

processors), others treat interrupts and exceptions as being sim-

ilar to subprogram calls (INTEL processors do this.)

• What happens when an exception or interrupt is itself inter-

rupted?

Some processors save the return addresses in a stack data struc-

ture, and successive levels of interrupts just increase the stack

depth. Typically, this is the way subprogram return addresses

are also stored.

Some processors automatically turn off the interrupt capability

at the beginning of an interrupt, and it must be explicitly turned

back on by the interrupt or exception handler to accept another

interrupt.

Some processors have both features — instructions can turn the

interrupt capability on and off, and can allow interrupts to be

interrupted themselves. (This turns out to be important for

implementing certain operating system functions.)

301

Comments on our implementation of exceptions

Note that our implementation has only one register for the address

of the interrupting instruction, and no way to read that address and

modify it to resume the program where the exception occurred.

What changes would be required to the instruction set accomplish

this?

The simplest solution would probably be to allow only one interrupt

at a time, by disabling the interrupt capability, and to provide:

1. An instruction to store the EPC in the register file.

2. An instruction to store the Cause register in the register file.

3. An instruction to turn on interrupt capability after the next

instruction completed execution. (This assumes that the next

instruction restores the PC to the address of the instruction fol-

lowing the one that caused the exception.)

Note that these would require changes to the datapath and control.

This example was just to give the flavor of the problems involved

with handling exceptions in the processor. More complex instruction

sets and architectures exacerbate the problems.

302

Comments on handling interrupts

Although exception handling is complex, it is often simpler than the

handling of external interrupts.

Exceptions occur as a result of occurrences internal to the processor.

Consequently, they are usually both predictable, and occur and are

detected at known times in the execution of a particular instruction.

Interrupts are external events, and are not at all synchronized with

the execution of instructions in the processor.

Since interrupts may be notification of an urgent event, they usually

require fast servicing.

Decisions therefore have to be taken about exactly when in the exe-

cution of an instruction an interrupt will be detected and handled.

Some of the considerations are:

• If the instruction is not allowed to complete, information must

be retained in order to either continue or restart the interrupted

instruction. How will this be done?

• If the interrupted instruction is allowed to complete, how will the

processor return to the next instruction in the current program?

• Can the interrupt handler be interrupted?

• Can interrupts be prioritized so that a high priority interrupt

can interrupt a lower priority interrupt?

303

How can we “speed up” the processor?

One idea is to try to make the most frequently used instructions as

fast as possible.

Instruction distributions for some common program types

Instruction type Type of program

LATEX C compiler Fortran

(numerical)

calls 0.012 0.006 0.010

branches 0.115 0.229 0.068

loads/stores 0.331 0.231 0.456

flops 0.001 0.000 0.163

data (R-type) 0.414 0.293 0.284

nops 0.127 0.241 0.059

304

Instruction counts for a 60 page LaTeX document (the GWM man-

ual)

count percent type

1387566431 (1.004) cycles (55.5s @ 25.0MHz)

1382108615 (1.000) instructions

206864803 (0.150) basic blocks

19570428 (0.014) calls

342200862 (0.248) loads

181925435 (0.132) stores

524126297 (0.379) loads+stores

524252660 (0.379) data bus use

50344780 (0.036) partial word references

150139046 (0.109) branches

316292645 (0.229) nops

0 (0.000) load interlock cycles

5292110 (0.004) multiply/divide interlock cycles

124148 (0.000) flops (0.00224 mflops/s @ 25.0MHz)

305

FORTRAN number crunching – hard-sphere molecular dynamics cal-

culation

count percent type

873855362 (1.050) cycles (35s @ 25.0MHz)

832495305 (1.000) instructions

81119362 (0.097) basic blocks

8071192 (0.010) calls

289695712 (0.348) loads

112932164 (0.136) stores

402627876 (0.484) loads+stores

426704925 (0.513) data bus use

258649 (0.000) partial word references

56782868 (0.068) branches

20814556 (0.025) nops

0 (0.000) load interlock cycles

751865 (0.001) multiply/divide interlock cycles

124343032 (0.149) flops (3.56 mflop/s @ 25.0MHz)

40496083 (0.049) floating point data interlock cycles

8015 (0.000) floating point add interlock cycles

46335 (0.000) floating point multiply interlock cycles

0 (0.000) floating point divide interlock cycles

57759 (0.000) other floating point interlock cycles

24071112 (0.029) 1 cycle interlocks

24052679 (0.029) overlapped floating point cycles

306

Other ideas for “speedup”

There are a number of ways of “speeding up” a multicycle processor

— generally by doing certain operations in parallel.

For example, in the INTEL 80x86 processors, the fetching of in-

structions from memory is decoupled from the instruction execution.

There is a logically separate bus interface unit which attempts to

fill an instruction queue during the times when the execution unit is

not receiving operands from memory. (The 80x86 is not a load/store

processor.)

Would this be a useful idea for our multicycle implementation of the

MIPS?

Another possibility is performing operations in the datapath in par-

allel. For example, it is not unusual for a processor to have different

adders for integer and floating point operations, and those operations

can be performed simultaneously.

(The MIPS R2000/R3000 performs floating point operations in par-

allel with integer operations.)

307

A Gantt chart showing a simple, multicycle implementation

IF

RD

ALU

MEM

WB

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

time (clock cycles)

The thick lines indicate memory accesses.

Not all instructions would require all the cycles shown.

308

A simple overlap implementation — here the instruction fetch pro-

ceeds during the WB clock phase, in which results are written into

internal registers.

IF

RD

ALU

MEM

WB

0 1 2 3 4 5 6 7 8 9 10 11 12 13

time (clock cycles)

Could this implementation be done with the multicycle datapath

shown earlier?

Are the resources used in the IF cycle also used in the WB cycle?

What parts of the instruction register are required?

When in the cycle are they used?

Which instructions do not have a WB cycle, and how could they be

handled?

309

An implementation which makes full use of a single memory bus —

data reads and writes do not interfere with instruction fetch.

IF

RD

ALU

MEM

WB

0 1 2 3 4 5 6 7 8 9 10 11 12 13

time (clock cycles)

Note that the single memory is a bottleneck.

In reality, not every instruction accesses data frommemory; in sample

codes earlier, only 1/4 to 1/2 of the instructions were loads or stores.

The Gantt chart for this situation would be more complex.

Would the single cycle datapath be sufficient in this case?

What instructions would cause problems with this implementation?

310

Pipelining

Pipelining is a technique which allows several instructions to over-

lap in time; different parts of several consecutive instructions are

executed simultaneously.

The basic structure of a pipelined system is not very different from

the multicycle implementation previously discussed.

In the pipelined implementation, however, resources from one cycle

cannot be reused by another cycle. Also, the results from each stage

in the pipeline must be saved in a pipeline register for use in the next

pipeline stage.

311

A pipelined implementation

IF

RD

ALU

MEM

WB

0 1 2 3 4 5 6 7 8 9 10 11 12 13

time (clock cycles)

Note that two memory accesses may be required in each machine

cycle (an instruction fetch, and a memory read or write.)

How could this problem be reduced or eliminated?

312

What is required to pipeline the datapath?

Recall that when the multi-cycle implementation was designed, in-

formation which had to be retained from cycle to cycle was stored in

a register until it was needed.

In a pipelined implementation, the results from each pipeline stage

must be saved if they will be required in the next stage.

In a multi-cycle cycle implementation, resources could be “shared”

by different cycles.

In a pipelined implementation, every pipeline stage must have all the

resources it requires on every clock cycle.

A pipelined implementation will therefore require more hardware

than either a single cycle or a multicycle implementation.

A reasonable starting point for a pipelined implementation would be

to add pipeline registers to the single cycle implementation.

We could have each pipeline stage do the operations in each cycle of

the multi-cycle implementation.

The next figure shows a first attempt at the datapath with pipeline

registers added.

313

MUX

0

1

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

dataWrite

Readdata

MemoryData

MUX

0

1

extendSign

MUX

0

1

Shiftleft 2

Add

Instruction[31−0]

MemoryInstruction

addressReadPC

Add

4

UX

1

0

M

EX MEMIF WBID

ALUZeroRegisters

Address

Inst[15−11]

Inst[20−16]

Inst[25−21]

3216

Inst[15−0]

314

It is useful to note the changes that have been made to the datapath,

The most obvious change is, of course, the addition of the pipeline

registers.

The addition of these registers introduce some questions.

How large should the pipeline registers be?

Will they be the same size in each stage?

The next change is to the location of the MUX that updates the PC.

This must be associated with the IF stage. In this stage, the PC

should also be incremented.

The third change is to preserve the address of the register to be

written in the register file. This is done by passing the address along

the pipeline registers until it is required in the WB stage.

The output of the MUX which provides the write address is now the

pipeline register.

315

Pipeline control

Since five instructions are now executing simultaneously, the con-

troller for the pipelined implementation is, in general, more complex.

It is not as complex as it appears on first glance, however.

For a processor like the MIPS, it is possible to decode the instruction

in the early pipeline stages, and to pass the control signals along the

pipeline in the same way as the data elements are passed through

the pipeline.

(This is what will be done in our implementation.)

A variant of this would be to pass the instruction field (or parts of

it) and to decode the instruction as needed for each stage.

For our processor example, since the datapath elements are the same

as for the single cycle processor, then the control signals required

must be similar, and can be implemented in a similar way.

All the signals can be generated early (in the ID stage) and passed

along the pipeline until they are required.

316

controlALU

BW

BW

BWM

EM

MEM

E

X

Inst[5−0]

RegDst

MemtoReg

RegWrite

Branch

MemWrite

MemRead

PCSrc

Inst [31−26]

ALUSrc

ALUop

MUX

0

1

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

dataWrite

Readdata

MemoryData

MUX

0

1

extendSign

Shiftleft 2

Add

Instruction[31−0]

MemoryInstruction

addressReadPC

EX MEMIF WBID

Add

4

UX

1

0

M

MUX

0

1

ALUZeroRegisters

Address

Inst[15−11]

Inst[20−16]

Inst[25−21]

3216

Inst[15−0]

317

Executing an instruction

In the following figures, we will follow the execution of an instruction

through the pipeline.

The instructions we have implemented in the datapath are those of

the simplest version of the single cycle processor, namely:

• the R-type instructions

• load

• store

• beq

We will follow the load instruction, as an example.

318

controlALUInst[5−0]

MUX

0

1

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

dataWrite

Readdata

MemoryData

MUX

0

1

extendSign

Shiftleft 2

Add

Instruction[31−0]

MemoryInstruction

addressReadPC

EX MEMIF WBID

Add

4

UX

1

0

M

MUX

0

1

ALUZeroRegisters

Address

Inst[15−11]

Inst[20−16]

Inst[25−21]

3216

Inst[15−0]

IF/ID ID/EX EX/MEM MEM/WB

319


MUX

0

1

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

dataWrite

Readdata

MemoryData

MUX

0

1

extendSign

Shiftleft 2

Add

Instruction[31−0]

MemoryInstruction

addressReadPC

EX MEMIF WBID

Add

4

UX

1

0

M

MUX

0

1

ALUZeroRegisters

Address

Inst[15−11]

Inst[20−16]

Inst[25−21]

3216

Inst[15−0]


320


MUX

0

1

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

dataWrite

Readdata

MemoryData

MUX

0

1

extendSign

Shiftleft 2

Add

Instruction[31−0]

MemoryInstruction

addressReadPC

EX MEMIF WBID

Add

4

UX

1

0

M

MUX

0

1

ALUZeroRegisters

Address

Inst[15−11]

Inst[20−16]

Inst[25−21]

3216

Inst[15−0]


321


MUX

0

1

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

dataWrite

Readdata

MemoryData

MUX

0

1

extendSign

Shiftleft 2

Add

Instruction[31−0]

MemoryInstruction

addressReadPC

EX MEMIF WBID

Add

4

UX

1

0

M

MUX

0

1

ALUZeroRegisters

Address

Inst[15−11]

Inst[20−16]

Inst[25−21]

3216

Inst[15−0]


322


MUX

0

1

dataWrite

RegisterWrite

ReadRegister 2

Register 1Read

Readdata 1

Readdata 2

dataWrite

Readdata

MemoryData

MUX

0

1

extendSign

Shiftleft 2

Add

Instruction[31−0]

MemoryInstruction

addressReadPC

EX MEMIF WBID

Add

4

UX

1

0

M

MUX

0

1

ALUZeroRegisters

Address

Inst[15−11]

Inst[20−16]

Inst[25−21]

3216

Inst[15−0]


323

Representing a pipeline pictorially

These diagrams are rather complex, so we often represent a pipeline

as simpler figures representing the structure as follows:

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

SW

ADD

LW

Often an even simpler representation is sufficient:

IF ID ALU MEM WB

IF ID ALU MEM WB

IF ID ALU MEM WB

The following figure shows a pipeline with several instructions in

progress:

324

IM REG ALU DM REG

ADD

SW

LW

IM REG ALU DM REG

IM REG ALU DM REG

SUB IM REG ALU DM REG

BEQ IM REG ALU DM REG

AND IM REG ALU DM REG

325

Pipeline “hazards”

There are three types of “hazards” in pipelined implementations —

structural hazards, control hazards, and data hazards.

Structural hazards

Structural hazards occur when there are insufficient hardware re-

sources to support the particular combination of instructions presently

being executed.

The present implementation has a potential structural hazard if there

is a single memory for data and instructions.

Other structural hazards cannot happen in a simple linear pipeline,

but for more complex pipelines they may occur.

Control hazards

These hazards happen when the flow of control changes as a result

of some computation in the pipeline.

One question here is what happens to the rest of the instructions in

the pipeline?

Consider the beq instruction.

The branch address calculation and the comparison are performed in

the EX cycle, and the branch address returned to the PC in the next

cycle.

326

What happens to the instructions in the pipeline following a success-

ful branch?

There are several possibilities.

One is to stall the instructions following a branch until the branch

result is determined. (Some texts refer to a stall as a “bubble.”)

This can be done by the hardware (stopping, or stalling the pipeline

for several cycles when a branch instruction is detected.)

IF ID ALU MEM WB

IF ID ALU MEM WB

IF ID ALU MEM WB

stallstallstall

beq

add

lw

It can also be done by the compiler, by placing several nop instruc-

tions following a branch. (It is not called a pipeline stall then.)

IF ID ALU MEM WB

IF ID ALU MEM WB

IF ID ALU MEM WB

IF ID ALU MEM WB

IF ID ALU MEM WB

IF ID ALU MEM WB

add

nop

lw

beq

nop

nop

327

Another possibility is to execute the instructions in the pipeline. It

is left to the compiler to ensure that those instructions are either

nops or useful instructions which should be executed regardless of

the branch test result.

This is, in fact, what was done in the MIPS. It had one “branch delay

slot” which the compiler could with a useful instruction about 50%

of the time.

IF ID ALU MEM WB

IF ID ALU MEM WB

IF ID ALU MEM WB

beq

branch delay slot

instruction atbranch target

We saw earlier that branches are quite common, and inserting many

stalls or nops is inefficient.

For long pipelines, however, it is difficult to find useful instructions to

fill several branch delay slots, so this idea is not used in most modern

processors.

328

Branch prediction

If branches could be predicted, there would be no need for stalls.

Most modern processors do some form of branch prediction.

Perhaps the simplest is to predict that no branch will be taken.

In this case, the pipeline is flushed if the branch prediction is wrong,

and none of the results of the instructions in the pipeline are written

to the register file.

How effective is this prediction method?

What branches are most common?

Consider the most common control structure in most programs —

the loop.

In this structure, the most common result of a branch is that it is

taken; consequently the next instruction in memory is a poor predic-

tion. In fact, in a loop, the branch is not taken exactly once — at

the end of the loop.

A better choice may be to record the last branch decision, (or the

last few decisions) and make a decision based on the branch history.

Branches are problematic in that they are frequent, and cause ineffi-

ciencies by requiring pipeline flushes. In deep pipelines, this can be

computationally expensive.

329

Data hazards

Another common pipeline hazard is a pipeline hazard. Consider the

following instructions:

add $r2, $r1, $r3

add $r5, $r2, $r3

Note that $r2 is written in the first instruction, and read in the

second.

In our pipelined implementation, however, $r2 is not written until

four cycles after the second instruction begins, and therefore three

bubbles or nops would have to be inserted before the correct value

would be read.

IF ID ALU MEM WB

IF ID ALU MEM WB

data hazard

add $r2, $r1, $r3

add $r5, $r2, $r3

The following would produce a correct result:

IF ID ALU MEM WB

IF ID ALU MEM WBnop nop nop

The following figure shows a series of pipeline hazards.

330

sw $7, 100($2)

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

sub $5,$2, $3

$1, $3$2,add

$2

−25beq $0,$2,

and $7, $6,

331

Handling data hazards

There are a number of ways to reduce data hazards.

The compiler could attempt to reorder instructions so that instruc-

tions reading registers recently written are not too close together,

and insert nops where it is not possible to do so.

For deep pipelines, this is difficult.

Hardware could be constructed to detect hazards, and insert stalls

in the pipeline where necessary.

This also slows down the pipeline (it is equivalent to adding nops.)

An astute observer could note that the result of the ALU operation

is stored in the pipeline register at the end of the ALU stage, two

cycles before it is written into the register file.

If instructions could take the value from the pipeline register, it could

reduce or eliminate many of the data hazards.

This idea is called forwarding.

The following figure shows how forwarding would help in the pipeline

example shown earlier.

332

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

IM REG ALU DM REG

−25beq $0,$2,

$2and $7, $6,

$1, $3$2,add

sub $5,$2, $3

sw $7, 100($2)

forwarding

Note how forwarding eliminates the data hazards in these cases.

333

Implementing forwarding

Note that from the previous examples there are now two potential

additional sources of operands for the ALU during the EX cycle —

the EX/MEM pipeline register, and the the MEM/WB pipeline.

What additional hardware would be required to provide the data

from the pipeline stages?

The data to be forwarded could be required by either of the inputs

to the ALU, so two MUX’s would be required — one for each ALU

input.

The MUX’s would have three sources of data; the original data from

the registers (in pipeline stage ID/EX) or the two pipeline stages to

be forwarded.

Looking only at the datapath for R-type operations, the additional

hardware would be as follows:

334

ForwardA

ForwardB

XUM

1

0

XUM

1

0

rd

rt

XUM

XUM

EX/MEM

address

Write

Data

Read

Data

Write

Read

MemoryData

MEM/WB

Read

Data 1

Read R2

Write R

Write dataData 2

Read

Read R1

ID/EX

Registers ALUzero

result

There would also have to be a “forwarding unit” which provides

control signals for these MUX’s.

335

Forwarding control

Under what conditions does a data hazard (for R-type operations)

occur?

It is when a register to be read in the EX cycle is the same register

as one targeted to be written, and is held in either the EX/MEM

pipeline register or the MEM/WB pipeline register.

These conditions can be expressed as:

1. EX/MEM.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt

2. MEM/WB.RegisterRd = ID/EX.RegisterRs or ID/EX.RegisterRt

Some instructions do not write registers, so the forwarding unit

should check to see if the register actually will be written. (If it

is to be written, the control signal RegWrite, also in the pipeline,

will be set.)

Also, an instruction may try to write some value in register 0. More

importantly, it may try to write a non-zero value there, which should

not be forwarded — register 0 is always zero.

Therefore, register 0 should never be forwarded.

336

The register control signals ForwardA and ForwardB have values

defined as:

MUX control Source Explanation

00 ID/EX Operand comes from the register file

(no forwarding)

01 MEM/WB Operand forwarded from a memory

operation or an earlier ALU opera-

tion

10 EX/MEM Operand forwarded from the previ-

ous ALU operation

The conditions for a hazard with a value in the EX/MEM stage are:

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd 6= 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRs))

then ForwardA = 10

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd 6= 0)

and (EX/MEM.RegisterRd = ID/EX.RegisterRt))

then ForwardB = 10

337

For hazards with the MEM/WB stage, an additional constraint is

required in order to make sure the most recent value is used:

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd 6= 0)

and (EX/MEM.RegisterRd 6= ID/EX.RegisterRs)

and (MEM/WB.RegisterRd = ID/EX.RegisterRs))

then ForwardA = 01

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd 6= 0)

and (EX/MEM.RegisterRd 6= ID/EX.RegisterRt)

and (MEM/WB.RegisterRd = ID/EX.RegisterRt))

then ForwardB = 01

The datapath with the forwarding control is shown in the next figure.

338

unitForwarding

ForwardA

ForwardB

EX/MEM.RegisterRd

MEM/WB.RegisterRd

XUM

1

0

XUM

1

0

rd

rt

XUM

XUM

EX/MEM

address

WriteData

ReadData

Write

Read

MemoryData

MEM/WB

ReadData 1

Read R2

Write R

Write dataData 2

Read

Registers ALUzero

result

ID/EX

rs

Read R1

For a datapath with forwarding, the hazards which are fixed by for-

warding are not considered hazards any more.

339

Forwarding for other instructions

What considerations would have to be made if other instructions

were to make use of forwarding?

The immediate instructions

The major difference is that the B input to the ALU comes from the

instruction and sign extension unit, so the present MUX controlled

by the ALUSrc signal could still be used as input to the ALU.

The major change is that one input to this MUX is the output of the

MUX controlled by ForwardB.

The load and store instructions

These will work fine, for loads and stores following R-type instruc-

tions.

There is a problem, however, for a store following a load.

$2,lw

400($3)

100($3) IM REG ALU DM REG

IM REG ALU DM REG$2,sw

Note that this situation can also be resolved by forwarding.

It would require another forwarding controller in the MEM stage.

340

There is a situation which cannot be handled by forwarding, however.

Consider a load followed by an R-type operation:

$2,lw 100($3) IM REG ALU DM REG

IM REG ALU DM REGadd $4,$3,$2

Here, the data from the load is not ready when the r-type instruction

requires it — we have a hazard.

What can be done here?

IM REG ALU DM REG

$2,lw 100($3) IM REG ALU DM REG

$2add $4,$3,

STALL

With a “stall”, forwarding is now possible.

It is possible to accomplish this with a nop, generated by a compiler.

Another option is to build a “hazard detection unit” in the control

hardware to detect this situation.

341

The condition under which the “hazard detection circuit” is required

to insert a pipeline stall is when an operation requiring the ALU

follows a load instruction, and one of the operands comes from the

register to be written.

The condition for this is simply:

if (ID/EX.MemRead

and (ID/EX.RegisterRt = IF/ID.RegisterRs)

or (ID/EX.RegisterRt = IF/ID.RegisterRt))

then STALL

342

Forwarding with branches

For the beq instruction, if the comparison is done in the ALU, the

forwarding already implemented is sufficient.

add$2,$3, $4

$3,25$2,beq

IM REG ALU DM REG

IM REG ALU DM REG

In the MIPS processor, however, the branch instructions were im-

plemented to require only two cycles. The instruction following the

branch was always executed. (The compiler attempted to place a

useful instruction in this “jump delay slot”, but if it could not, an

nop was placed there.)

The original MIPS did not have forwarding, but it is useful to consider

the kinds of hazards which could arise with this instruction.

Consider the sequence

IF ID ALU MEM WB

IF ID ALU MEM WB

add$2, $3, $4

beq$2, $5, 25

Here, if the conditional test is done in the ID stage, there is a hazard

which cannot be resolved by forwarding.

343

In order to correctly implement this instruction in a processor with

forwarding, both forwarding and hazard detection must be employed.

The forwarding must be similar to that for the ALU instructions,

and the hazard detection similar to that for the load/ALU type in-

structions.

Presently, most processors do not use a “branch delay slot” for branch

instructions, but use branch prediction.

Typically, there is a small amount of memory contained in the pro-

cessor which records information about the last few branch decisions

for each branch.

In fact, individual branches are not identified directly in this memory;

the low order address bits of the branch instruction are used as an

identifier for the branch.

This means that sometimes several branches will be indistinguishable

in the branch prediction unit. (The frequency of this occurrence

depends on the size of the memory used for branch prediction.)

We will discuss branch prediction in more depth later.

344

Exceptions and interrupts

Exceptions are a kind of control hazard.

Consider the overflow exception discussed previously for the multi-

cycle implementation.

In the pipelined implementation, the exception will not be identified

until the ALU performs the arithmetic operation, in stage 3.

The operations in the pipeline following the instruction causing the

exception must be flushed. As discussed earlier, this can be done by

setting the control signals (now in pipeline registers) to 0.

The instruction in the IF stage can be turned into a nop.

The control signals ID.flush AND EX.flush control the MUX’s

which zero the control lines.

The PC must be loaded with a memory value at which the exception

handler resides (some fixed memory location).

This can be done by adding another input to the PC MUX.

The address of the instruction causing the exception must then be

saved in the EPC register. (Actually, the value PC + 4 is saved).

Note that the instruction causing the exception cannot be allowed

to complete, or it may overwrite the register value which caused the

overflow. Consider the following instruction:

add $1, $1, $2

The value in register 1 would be overwritten if the instruction fin-

ished.

345

The datapath, with exception handling for overflow:

PCInstruction

memory

4

Registers

Sign extend

M u x

M u x

M u x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

M u x

Data memory

M u x

Hazard detection

unit

Forwarding unit

IF.Flush

IF/ID

=

Except PC

40000040

0

M u x

0

M u x

0

M u x

ID.Flush EX.Flush

Cause

Shift left 2

346

Interrupts can be handled in a way similar to that for exceptions.

Here, though, the instruction presently being completed may be al-

lowed to finish, and the pipeline flushed.

(Another possibility is to simply allow all instructions presently in

the pipeline to complete, but this will increase the interrupt latency.)

The value of the PC + 4 is stored in the EPC, and this will be the

return address from the interrupt, as discussed earlier.

Note that the effect of an interrupt on every instruction will have to

be carefully considered — what happens if an interrupt occurs near

a branch instruction?

347

Superscalar and superpipelined processors

Most modern processors have longer pipelines (superpipelined) and

two or more pipelines (superscalar) with instructions sent to each

pipeline simultaneously.

In a superpipelined processor, the clock speed of the pipeline can be

increased, while the computation done in each stage is decreased.

In this case, there is more opportunity for data hazards, and control

hazards.

In the Pentium IV processor, pipelines are 20 stages long.

In a superscalar machine, there may be hazards among the separate

pipelines, and forwarding can become quite complex.

Typically, there are different pipelines for different instruction types,

so two arbitrary instructions cannot be issued at the same time.

Optimizing compilers try to generate instructions that can be issued

simultaneously, in order to keep such pipelines full.

In the Pentium IV processor, there are six independent pipelines,

most of which handle different instruction types.

In each cycle, an instruction can be issued for each pipeline, if there

is an instruction of the appropriate type available.

348

Dynamic pipeline scheduling

Many processors today use dynamic pipeline scheduling to find

instructions which can be executed while waiting for pipeline stalls

to be resolved.

The basic model is a set of independent state machines performing

instruction execution; one unit fetching and decoding instructions

(possibly several at a time), several functional units performing the

operations (these may be simple pipelines), and a commit unit which

writes results in registers and memory in program execution order.

Generally, the commit unit also “kills off” results obtained from

branch prediction misses and other speculative computation.

In the Pentium IV processor, up to six instructions can be issued in

each clock cycle, while four instructions can be retired in each cycle.

(This clearly shows that the designers anticipated that there would

be many instructions issued — on average 1/3 of the instructions —

that would be aborted.)

349

Commit

unit

Instruction fetch

and decode unit

…

In-order issue

In-order commit

Load/

StoreFloating

pointIntegerInteger …Functional

unitsOut-of-order execute

Reservation

station

Reservation

station

Reservation

station

Reservation

station

Dynamic pipeline scheduling is used in the three most popular pro-

cessors in machines today — the Pentium II, III, and IV machines,

the AMD Athlon, and the Power PC.

350

A generic view of the Pentium P-X and the Power PC

pipeline

Complex

integer

Store Load

Load/

store

Floating

pointIntegerIntegerBranch

Decode/dispatch unit

Instruction queue

Register file

Instruction

cache

Data

cachePC

Branch

prediction

Reorder

buffer

Commit

unit

Reservation

station

Reservation

station

Reservation

station

Reservation

station

Reservation

station

Reservation

station

351

Speculative execution

One of the more important ways in which modern processors keep

their pipelines full is by executing instructions “out of order” and

hoping that the dynamic data required will be available, or that the

execution thread will continue.

Two cases where speculative computation are common are the “store

before load” case, where normally if a data element is stored, the

element being loaded does not depend on the element being stored.

The second case is at a branch — both threads following the branch

may be executed before the branch decision is taken, but only the

thread for the successful path would be committed.

Note that the type of speculation in each case is different — in the

first, the decision may be incorrect; in the second, one thread will

be incorrect.

352

computer science 3724 - memorial university of...

Documents