coa module 2 notes

08.503 Computer Organization and Architecture Module 2

Department of ECE, VKCET Page 1

Design of Data path and Control (based on MIPS instruction set) Basic MIPS Implementation: Consider the subset of the core MIPS instruction set:

The key principles used to create data path and design the control for other instructions are similar. The implementation ideas are common for general purpose microprocessors, processors in high

performance servers, embedded processors etc. For execution of every instruction, the first two steps are identical:

1. Send the PC to the memory that contains the code and fetches the instruction from that memory. 2. Read one or two registers, using the fields of the instruction to select the registers.

After these steps, the actions to complete the instruction depend on the instruction class. But for memory-reference, arithmetic-logical and branches class instructions the actions are largely same.

All instruction class needs ALU, except jump instruction. A high-level view of a MIPS implementation focusing on various functional units and the

interconnection is shown below:

It shows that the value written to PC can come from one of two adders. The data written into register

file can come from either ALU or the data memory. A line from multiple lines are selected by multiplexer, also called data selector. A line is selected

from several lines using control lines.



The data path with required multiplexers of MIPS implementation of the basic type is shown below:

Three multiplexers and few control lines are required. A control unit is there, that has the instruction as input used to determine how to set the control lines

for the functional units and the two multiplexers. The third multiplexer determines whether PC+4 or branch destination address is written into PC. It is

based on the zero output of ALU which is used to perform the comparison of instruction beq This type design approach is easier to understand, but not a practical one, because it is slower than the

implementation that allows different instruction classes to take different numbers of clock cycles.

There are two types of design concept: single cycle datapath concept and multicycle datapath concept.

In single cycle concept, separate instruction and data memories are required, because:

Logic design conventions:

The functional units of MIPS implementation consists of two types of logical elements:

1. Combinational elements It operates on the data values and their outputs depend only on the

current inputs. It has no storage elements. The adders, ALU and multiplexers are examples.

2. State elements: It contains state and has some internal storage. It preserves the values we stored in

the previous state. The instruction, data memories and registers are the examples.

State element has at least two inputs and one output. The required inputs are data value to be written

into it and the clock, which determine when the data value is written. A simplest state element is a D

flip flop. The clock is also used to read the state element at any time.



State element is also called sequential element, because its output depends on both the input and its

internal state.

Clocking Methodology:

It defines when the signal can read and when they can be written.

A simple methodology is edge-triggering method: Any values stored in a sequential logic element are

updated only on a clock edge.

Consider two state elements surrounding a block of combinational logic which operates in a single

clock cycle:

All signals must propagate from state element 1 to state element 2 through the combinational logic in

the time of one clock cycle. This can be done by using edge-triggering method. During the edge of

clock a read operation can be performed to state element 1 and write operation can be performed to

state element 2 during edge of the next clock.

Edge triggering may be +ve edged or ve edged.

Building data path:

Start with major components to execute: Two state elements (Instruction memory and PC) and an

adder as shown below:

All elements are combined by data path to fetch the instruction and increment PC to point next

instruction.

Consider R-Type instruction, it requires processors 32-bit register structure called register file.

R-Type instructions have 3 operands in registers.

For Read operation: An input to the register file that specifies the register number to be read and

output from the register file that carry the value that has been read from the registers. So two

inputs and two outputs are required.

To write data: One input to specify the register number to be written and one to supply the data to

be written into the register.

The register number inputs are 5 bits wide to specify one of 32 registers (32 = 25).

We need total four inputs (3 for register number and one for data) and two outputs for data.



The elements required for R-Type instruction is:

ALU takes two 32-bit input, 4 control lines, 32-bit result output and 1-bit zero signal for zero

output.

Consider memory-reference instructions:

lw $S1, offset_value ($S2) and sw $S1, offset_value ($S2)

Both require a sign extension unit for 16-bit offset_vaue to 32-bit offset_value, ALU operation

and data memory elements. Then the additional two elements are shown below:

Consider branching instruction beq: It has three operands, two registers are compared for equality

and a 16-bit offset value to calculate the target address.

To implement this instruction, the branch address is computed by adding signed extended 32-bit

offset field to PC. Before adding the offset field is shifted to left by 2 bits for word offset value.

For the instruction, if the condition is true branch is taken otherwise no branch is taken.



The structure of data path handle the branch instruction is:

To perform branch target address, the branch datapath includes a sign extension unit, a shift left

by 2 unit, an adder, ALU to compare two register file operands.

ALU provides an output signal that indicates whether the result is 0 or not. If two operands are

equal, zero output is 1 else 0.

Jump instruction operates by replacing the lower 28 bits of the PC with the lower 26 bits of the

instruction shifted left by 2 bits. This unit is not shown here.

Creating single data path:

Combining the individual instruction class datapath components into a single datapath and add

the control to complete the implementation.

A simplest attempt is to execute all instructions in one clock cycle. So any element needed more

than once must be duplicated.

The operations on memory-reference and arithmetic-logical instructions are same. But have some

differences:

1. Memory instructions use the ALU for address calculation with one input from sign extended

16-bit offset field from instruction and arithmetic-logical instructions use ALU with the

inputs from registers.

2. The ALU result for first class instruction is always to address of data memory, but for second

class it is always a register.



The simple data path for MIPS architecture for the three class instruction is shown below:

The ALU inputs are coming from two registers and memory-reference instructions can also use

ALU to do address calculation. So the second input of ALU is selected from a register or sign-

extended 16-bit offset field from the instruction using a MUX. The control signal of this MUX is

ALUSrc.

The value stored in destination register (write data) comes from the ALU result (for R-type

instruction) or a memory data (for load instruction). So it is selected by another MUX. The

control signal for this MUX is MemtoReg.

An additional MUX is required for selecting sequentially executing instructions address PC+4 or

the branch target address to be written to PC. It has a control line PCSrc.



Simple Implementation:

To add the simple control function to the datapath unit, consider the instructions lw, sw, beq, add,

sub, and, or and slt.

ALU control:

ALU has 4-bit control lines, so there are 16 possible ALU functions. But now use only 6

functions and are shown in the following table:

For the three class instructions ALU need to perform first five functions. For memory-reference

instructions ALU need to compute memory address by addition, for arithmetic-logical class

instruction ALU needs to perform any one of the five functions depends on the value of 6-bit

funct field in lower bits of instruction and for branch equal instruction ALU must perform

subtraction.

The 4-bit control input for ALU can be generated using a small control unit that has two inputs,

one is a 2-bit control filed called ALUOp (ALU operation) and 6-bit function field from the

instruction.

The following table shows how the 4-bit control lines to ALU is related to 2-bit ALUOp and 6-bit

funct field in instruction:

The table shows multilevel decoding. There is 8-bit input to generate 4-bit output. Using

optimization designing method repeating logic can be replaced by dont care (X) condition. Then

the truth table for the ALU control inputs is shown below:



Designing of main control unit:

Consider the instruction format of R-Type, memory-reference and branch instructions shown

below:

The major observations about this instructions are:

1. The opcode field op is always in bits 31:26 (6-bit), usually referred as Op[5:0]. This is

common for all three class instructions.

2. Two registers to read are always specified by rs and rt fields at the positions 25:21 and 20:16

respectively. This is also common for all three class instructions.

3. The base register for load/store instruction is always in 25:21 (rs).

4. The 16-bit offset for branch equal and load/store instructions is at 15:0.

5. The destination register for load and R-type instructions is in one of two places. For load it is

in 20:16 (rt), while for R-type instruction it is in 15:11 (rd). This will need a MUX to select

which field of instruction is used to indicate the register number to be written.

The control unit implementation along with datapath is shown below:



This implementation has seven control lines and a 2-bit ALUOp control signal.

The functions of seven control lines are:

1. RegDst: If 0, the destination register number comes from rt field bits 20:16. Else from the rd

field bits 15:11.

2. RegWrite: If 1, the register on the write register input is written with the value on the write

data input. Else nothing happen.

3. ALUSrc: If 0, the second ALU operand comes from second register file output (Read data 2).

Else from sign-extended lower 16-bit of the instruction.

4. PCSrc: If 0, the PC is replaced by the output of the adder that computes the value PC+4. Else

the output of the adder that computes the branch target.

5. MemRead: If 1, data memory content designated by the address input are put on the read data

output. Else nothing happen.

6. MemWrite: If 1, data memory content designated by the address input is replaced by the

value on the write data input. Else nothing happen.

7. MemtoReg: If 0, the value fed to the register write data input comes from the ALU. Else from

the data memory.

The simple data path design with control unit is shown below:

The control unit generates nine control signals (including 2-bit ALUOp) according to instruction

opcode. But for branch equal instruction, the control signal PCSrc is generated by branching

decision from instruction and Zero output from ALU. For the Branch signal from control unit and

Zero signal from ALU is ANDed.



The setting of control lines determined by the opcode field of the instruction is shown below:

The operation of the datapath for R-type instruction like add $t1, $t2, $t2 is shown in following

figure.

Where everything occurs in one clock cycle and requires 4 steps to execute the instruction.

The steps for the flow of the instruction are:

1. The instruction is fetched and the PC is incremented.

2. Two registers $t2 and $t3 is read from the register file and the main control unit computes

the setting of the control lines during this step.

3. The ALU operates on the data read from the register files and function code bits 5:0 from

instruction to generate ALU function.

4. The result from the ALU is written into the register file using bits 15:11 of the instruction

to select the destination register $t1.



Illustration of the instruction

is shown below:

The five steps for load instruction are:

1. Instruction is fetched from instruction memory and PC is incremented.

2. A register $t2 value is read from the register file.

3. ALU computes the sum of register value from register file and sign-extended lower 16-bit of

the instruction (offset).

4. The sum from the ALU is used to address for the data memory.

5. The data from the memory unit is written into register file, the destination register ($t1) given

by bits 20:16 of the instruction.



Illustration of the instruction

is shown below:

The four steps in execution for branch instruction are:

1. Instruction is fetched from the instruction memory and PC is incremented.

2. Two registers $t1 and $t2 are read from the register file.

3. ALU performs a subtract on the values read from the register file. The value of PC + 4 is

added to the sign-extended, lower 16-bit of the instruction (offset) shifted left by two. The

result is branch target address.

4. The Zero result from the ALU is used to decide which adder result to store into PC.



Finalizing the control:

The input signals and its corresponding output of the control unit is shown in the following truth

table:

Inputs Outputs

Instruction

Signal name of opcode

Op5 Op4 Op3 Op2 Op1 Op0 Reg

Dst

ALU

Src

Memto

Reg

Reg

Write

Mem

Read

Mem

Write

Bran

ch

ALU

Op1

ALU

Op0

R-Type (0) 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 lw (35) 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 sw (43) 1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 beq (4) 0 0 1 0 0 0 X 0 X 0 0 0 1 0 1



Implementing jump instruction:

Jump instruction is similar to branch instruction, but computes the target address for PC

differently and is not conditional.

Like branch instruction, the lower order 2 bits of jump instruction are always 002 (multiply by 4).

The next lower 26 bits of this 32-bit address comes from the 26-bit immediate field in the

instruction as shown below:

The upper 4 bit of the address should replace the PC + 4 address bit.

Then jump instruction can be implement by storing PC by:

1) The upper 4 bits of current PC+4 (31:28 bits of sequentially following instruction

address)

2) The 26-bit immediate field of the jump instruction.

3) The bits 002.

The addition of the control for jump instruction and multiplexer for selecting jump address,

PC+4 or branch target is shown below:



Advantages and disadvantages of single cycle implementation:

The only advantage of single cycle implementation is its simplicity.

The disadvantage is its inefficiency and slow speed. The clock cycle is not same for all

instructions, so it is inefficient (ie CPI not 1)

Due to the inefficiency, nowadays this implementation is not used.

Performance of single cycle implementation:

Assume that the operation time for major functional units in single cycle implementation is:

a) Memory units: 200ps

b) ALU and adders: 100ps

c) Register file: 50ps

Assume mux, control unit, PC, sign-extended units and wires have no delay.

Consider two systems, one implementation every instruction operates in 1 clock cycle of a fixed

length and other every instruction executes in 1 clock cycle using a variable length clock

according to the requirement of the instruction.

To compare the performance, assume 25% loads, 10% stores, 45% ALU, 15% branch and 5%

jump instructions are there.

We know that CPU execution time = IC x CPI x Clock cycle time. If CPI = 1, CPU execution

time = IC x Clock cycle time.

We have to find clock cycle time for both cases, since IC and CPI are same for both case.

The critical path for different class instruction is shown below:

Using these critical paths, the required length for each instruction class are:

Then the clock cycle time with a single clock for all instructions will be determined by longest

instruction, which is 600ps for load word.

A machine with variable clock will have a clock cycle that varies between 200ps and 600ps. The

average clock cycle for the machine is

CPU clock cycle = 400 x 0.45 + 600 x 0.25 + 550 x 0.1 + 350 x 0.15 + 200 x 0.05 = 447.5ps

The CPU performance can be found by



This shows that variable clock implementation is 1.34 time faster than single clock

implementation.

Implementation of variable clock machine is very difficult and cause overhead during execution.

Single clock implementation with fixed clock length is more suitable for small instruction set.

In single cycle implementation, each functional unit can be used only one clock, therefore some

units must be duplicated and cause raise of cost. So it is inefficient both in performance as well as

hardware cost.

Multi cycle implementation:

The drawbacks of single cycle implementation can be overcome by this method.

This allows a sharing of functional unit, instead of duplication and it is used on different clock

cycles. The sharing of hardware reduces the amount of hardware required.

The major advantages of this method are the ability to allow instruction to take different number

of clock cycles and ability to share functional units within the execution of a single instruction.

The high level view of multi cycle datapath is shown below:

The main difference of this implementation comparing to single cycle implementation is:

1. A single memory unit is used for both instructions and data.

2. There is a single ALU, rather than an ALU and two adders.

3. One or more registers are added after every functional unit to hold the output of that unit until

the value is used in a subsequent clock cycle.

At the end of a clock cycle, all data that is used in the subsequent clock cycle must be stored in

state elements: register file, PC or memory.



The data used by the same instruction in a later cycle must be stored one of the additional

registers.

In this design, the operations required are: a memory access, a register file access or an ALU

operation. So the data from these functional units must be saved into temporary register for later

cycle.

The temporary registers used and its purpose are:

1. Instruction Register (IR) and Memory Data Register (MDR): To save the output of the

memory for an instruction read and a data respectively.

2. A and B registers: To hold the register operand values read from register file.

3. ALUOut register: To hold the output of the ALU.

All the registers except IR hold data only between a pair of adjacent clock cycles and thus no

need a write control signal. The IR needs to hold the instruction until the end of execution of that

instruction and thus will require a write control signal.

To share functional units for different purposes, we need more MUX as well as expand the

existing MUX. For one memory is used for instructions and data, we require a MUX for selecting

two sources for a memory address from PC (for instruction access) and ALUOut (for data

access).

Three ALUs of single cycle implementation is replaced by a single ALU. So additional

multiplexers are required at the two input of ALU. A MUX for the first ALU input chooses A

register and PC. Another MUX on the second input is a 4-way MUX to choose a constant 4 (to

increment PC), the sign-extended offset and shifted offset field (both are used for branch address

computation).

The details of datapath with the additional MUXs are shown below:

The datapath takes multiple clock cycles per instruction and it will require different set of control

signals. The programmer-visible state units PC, memory and registers require wirte control

signals as well as IR also need write control signals.

The memory also need a read control signal.

ALU also need control signal similar to single cycle implementation.



Each multiplexers also need contol lines.

The datapath with control lines are shown below:

For jump and branch instruction, there are three possible sources for the value to be written into

PC:

1. The output of ALU, which is PC+4 during instruction fetch.

2. The register ALUOut, which is where the address of the branch target.

3. The lower 26 bits of the IR shifted by 2 and concatenated with upper 4-bits of PC+4, which is

the source when the instruction is jump.

PC is updated conditionally and unconditionally. During normal increment PC is written

unconditionally. If instruction in conditional branch, PC is replaced by ALUOut only if two

designated registers are equal. So two separate control signals are required for PC and are:

PCWrite, which is for unconditional write of PC and PCWriteCond, which is for write of PC if

the branch condition is true.



The multicycle datapath and control unit including additional control signals and multiplexer for

implementing PC updating is shown below:

The functions of 1-bit control lines are:



The functions of 2-bit control lines are:

Fetch, Decode, Execute and Memory Access Cycles:

Breaking the execution of instruction into multiple clock cycle should improve the performance

of the system.

Breaking instruction execution into a series of steps and each step taking one clock cycle. For

example, restrict each step contain one ALU operation, or one register access, or one memory

access. With this restriction, the clock cycle could be as short as possible.

There are three to five steps for execution of MIPS instruction using multicycle implementation.

They are:

1. Instruction fetch step:

Fetch the instruction from the memory and compute the address of next instruction:

Operation: Send PC to memory as the address, perform memory read, and write the

instruction to IR, where it will be stored. Also increment PC by 4.

To implement this step, the following signals are: MemRead and IRWrite to assert (as 1), set

IorD as 0 to select PC as source address, set ALUSrcA as 0 to select PC and send to ALU,

ALUSrcB as 01 to select 4 and send to ALU, ALUOp as 00 to make ALU add. Also, to store

incremented instruction address back to PC, PCSource signal to 00 and set PCWrite as 1.

The increment in PC and instruction memory access occurs in parallel and new value of the

PC is not visible until the new clock cycle.

2. Instruction decode and register fetch step:

The instruction is decoded and operands are fetched in this step.

The branch target address is also computed with ALU in this step. The potential branch target

is saved in ALUOut.



If instruction has two register inputs, they are always in rs and rt fields, and if instruction is a

branch, the offset is always in low-order 16 bits. This is shown below:

Operation: Access the register file to read rs and rt and store the results in A and B registers.

Since A and B registers are overwritten on every cycle, registers can be read on every cycle

and values stored into A and B. The same step will also computes branch target address and

store result to ALUOut, where it will use on next cycle for instruction fetch.

The required control signals for this step are: set ALUSrcA to 0 to send PC to ALU,

ALUSrcB to value 11 to send sign-extended and shifted offset value to ALU and ALUOp to

00 to ALU add.

The register file access and computation of branching target address occur in parallel.

After this step clock cycle, the determining action to take depends upon the instruction.

3. Execution, memory address computation, or branch completion:

First cycle during the datapath operation is determined by the instruction class.

For memory reference:

Operation: ALU adds operands to form memory address. Set ALUSrcA to 1 for send A to

ALU input and set ALUSrcB to 10 for send sign-extended offset to second ALU input.

ALUOp signals are set to 00 for ALU add.

For R-type instruction:

Operation: ALU perform the operation specified by funct field on two value read from

register file in the previous cycle. For this control signal ALUSrcA set to 1 for send A to

ALU input and set ALUSrcB to 00 for send B to ALU other input. The ALUOpsignals will

need to be set to 10 and using funct field ALU control unit generate signals for ALU

operation.

For branch:

Operation: ALU is used to compare two register read in previous step. The zero signal out

from ALU is used to determine whether or not to branch. The required control signals are: set

ALUSrcA to 1 and ALUSrcB to 00 to select A and B register to ALU inputs. ALUOp signals

set to 01 for equality testing (subtract). The PCWriteCond signal will need to assert to update

PC if the zero output of ALU is asserted. PCSource set to 01 for send value to PC from

ALUOut, which hold the target address. For conditional branches, there are two write

operation to PC: once from the output from ALU during instruction decode/register fetch and

once from ALUOut during branch completion step. The last value written to PC is used to

fetch the next instruction.



For jump:

Operation: PC is replaced by jump address. PCSource is set to 10 for jump address to PC and

PDWrite is asserted to write jump address into PC.

4. Memory access or R-type instruction complete step:

During this step, a memory reference instruction access memory and R-type instruction

writes its result.

When a value is retrieved from memory, it is stored in MDR and is used on the next clock

cycle.

For memory reference:

Operation: For load instruction, a data word is retrieved from memory and is written into

MDR. For store instruction, the data word is written into memory. In both cases the address

used is computed during previous step and stored in ALUOut. For store instruction, the

source operand is in B. The signals are: MemRead for load and MemWrite for store will be

asserted to 1. The signal IorD is set to 1 to force the memory address come from ALU.

For R-type instruction:

Operation: Place the contents of ALUOut into result register. The signal RegDst set to 1 for

the rd field (15:11 bits) to use the register file entry to write. RegWrite is asserted and

MemtoReg signal set to 0 for write ALUOut data to register file.

5. Memory read completion step:

During this step, load instruction complete by writing back the data from memory to register.

For load:

Operation: Write the load data stored in MDR during previous cycle into register file. The

signal MemtoReg set to 1 for write the result from memory, assert signal RegWrite to 1 and

make RegDst as 0 to choose the rt (20:16 bits) field of the register.



Design of the Control Unit:

To design control unit for single cycle implementation, truth table that specified the setting of the

control signals based on the instruction class is used.

For multicycle datapath, the control unit is more complex, because the instruction execution is by

series of steps.

Two different techniques used for control unit design of multicycle implementation are: One is

based on finite state machines (hardwired) and other is using microprogramming. Both represent

the control in the form of an implementation using gates, ROMs, or PLAs.

The high level view of the finite state machine control for the five steps of multi-cycle

implementation is shown below:

The first two states of the machine using graphical representation is shown below:

State 0 is for instruction fetch and after this FSM switches to state 1 for instruction

decode/Register fetch.

After state 1, FSM switches to any of the four states depend upon the instruction.

For memory-reference instructions:



For R-type instructions:



For branch instructions:

For jump instructions:

All these states can be implemented by a control unit shown below:



This FSM can be implemented with a temporary register that holds the current state and a block

of combinational logic that determines both datapath signals to be asserted as well as the next

state.

The combinational control logic for this FSM is implemented both with a ROM and a PLA.

Microprogramming Control Design:

A technique for designing complex control units.

It uses a simple hardware that can be programmed to implement a more complex instruction set.

Enhancing Performance with Pipelining:

Pipelining is an implementation technique in which multiple instructions are overlapped in

execution.

Consider a laundry system with non-pipelining and pipelining approaches.

In non-pipelining approach:

In pipelining approach, as soon as washer is finished with first load and placed in the dryer, load

the washer with second load. When the first load is dry, place it on the folder, move the wet load

to dryer and load the next load to the washer. Next put the first load away, second load start fold,

third load to dryer and put the fourth load into the washer.

These two approaches are shown below:

Pipelining is faster and is applicable to implement in MIPS instruction, because classically it take

five steps:



Comparison between single cycle and pipelining implementation:

Let as consider total time required for different units to execute each instruction as shown below:

Execution of instruction in single cycle non-pipelining implementation is shown below:

Execution of instruction in pipelining implementation is shown below:

Required time for executing first four instruction in non-pipelining implementation is

800 x 4 = 3200ps, but for pipelining implementation it is around 1500ps.

The speed of the pipelined instructions depends on the number of stages pipelined. Then



Pipeline hazards: It is the situation, when the next instruction cannot execute in the following

clock cycle

Three types of pipeline hazards:

1. Structural hazards Occur when the hardware cannot support the combination of

instructions to execute in the same clock cycle.

2. Data hazards Occur when one step of execution wait for the completion of other step.

3. Control hazards Arise from the need to make a decision based on the result of one

instruction while others are executing.

Branch prediction, forwarding and stalls are used to avoid these hazards.

Pipelining increases the number of simultaneously executing instructions and the rate at which

instructions are started and completed. Pipelining does not reduce the time it takes to complete an

individual instruction and also called latency. Thus pipelining improves instruction throughput

rather than individual instruction execution time or latency.

Pipelined Datapath:

Consider the single cycle datapath and divide it into five stages for five step execution as well as

five-stage pipeline. It means that five instructions will be in execution during any single clock

cycle.



The name of each stage are: 1) IF (Instruction Fetch) 2) ID (Instruction Decode/Register Fetch)

3) EX (Execute/Address Calculation) 4) MEM (Memory Access) 5) WB (Write Back)

The instructions and data move generally from left to right through five stages for its completion.

But there are two exceptions for this, they are: 1) The WB stage places the result back to ID stage

which is in the middle of the datapath. 2) The new value of PC is choosing between PC + 4 and

branch address from the MEM stage.

These exceptions may cause data hazard (for first) and control hazard (for second).

The executions of some instructions and their datapath on a common time line is shown below:

Here three instructions need three datapaths and allow the sharing of units for other instructions.

For example, IM is used only one of the five stages of an instruction and it is shared by other

instruction during the other four stages.

Consider the pipelined datapath with the pipeline registers highlighted.



The pipeline registers are: IF/ID registers between IF and ID stages, ID/EX registers between

ID and EX stages, EX/MEM registers between EX and MEM stages and MEM/WB registers

between MEM and WB stages.

There is no pipeline register at the end of the WB stage. All instructions must update some state

in the register file, memory or PC, so a separate pipeline register is redundant to the state updated.

Pipelined datapath of load instruction

The active portions of the datapath highlighted as a load instruction goes through the first stage of

pipelined execution is shown below:

Instruction fetch: Read instruction from memory using address in PC, and then place it in IF/ID

register. IF/ID register is similar to IR. PC address is incremented by 4 and written back to PC for

next instruction. This incremented conted is also saved into IF/ID register for further use of the

instruction like beq.

The active portions of the datapath highlighted as a load instruction goes through the second stage

of pipelined execution is shown below:



Instruction decode and register file read: IF/ID pipeline register supply 16.bit offset and

register numbers to read the two registers. The sign extended 32-bit offset value, two register data

and incremented PC values are stored ID/EX pipeline register.

The active portions of the datapath highlighted as a load instruction goes through the third stage


Execute and address calculation: Load instruction read register 1 content and sign-extended 32-

bit offset value from ID/EX and adds them using the ALU. The result is placed in the EX/MEM

pipeline register.

The active portions of the datapath highlighted as a load instruction goes through the fourth stage




Memory access: The load instruction read data memory using the address from EX/MEM

pipeline register and loads the data into the MEM/WB pipeline register.

The active portions of the datapath highlighted as a load instruction goes through the fifth stage of

pipelined execution is shown below:

Write back: This is the final step, reading data from MEM/WB register and writing it into the

register file in the middle of the datapath.

The above steps show that any information passing to next stage is via pipeline register.

Pipelined datapath of store instruction

The first two steps are same as load instruction. The others stages are shown below:



Here also information is passed to the next stage via pipeline registers.



Corrected datapath of load instruction:

At the final stage of load instruction, write register number is required. If IM is shared for other

instructions, the write register number may be changed. So we need to preserve the write register

number or instruction during the last stage using pipeline register.

The corrected datapath, by passing write register number first to ID/EX, then to EX/MEM and

finally to MEM/WB is shown below:



Graphical representation of pipelines:

To understand more about pipelines, consider multiple clock cycle pipeline diagram and single

clock cycle diagrams.

The multiple clock cycle diagram of the following five instruction sequence is shown below:

This shows that time advances from left to right and instructions advances from top to bottom.

Another version multiple clock cycle pipeline diagram is shown below:



The single clock cycle diagram is shown below:

This is a vertical slice representation.

Pipelined Control:

The control lines introduced in pipelined data path is show below:

All control lines are same as single cycle implementation without pipeline.

There are no separate write control signals for pipeline registers, because they are written during

each clock cycle.

Here the control lines are grouped into five according to the pipeline stages.



1. IF Control signals:- To read IM and write PC, they are always asserted. So there is

nothing special control signals to pipeline stage.

2. ID Control signals:- Similar to previous stage, there are no optional signals for this stage.

3. EX Control signals:- The signals are: RegDst select result register, ALUOp ALU

operation and ALUSrc select ALU input from either Read data 2 or sign-extended offset.

4. MEM Control signals:- The signals are: Branch for branch target address, MemRead

for load instruction and MemWrite for store instruction. Also have PCSrc signal to assert

branch control and ALU result Zero signal.

5. WB Control signals:- The signals are: MemtoReg select ALU result or the memory

read data to register and RegWrite to write the value to register.

Pipelining doesnt change the functions of control lines, but they are grouped together.

The full datapath with pipeline registers and control lines are shown below:



Data Hazards and Forwarding:

Consider the following instruction sequence which has dependency.

The last four instructions are all dependent on the result in register $2 of the first instruction.

The execution of these instructions in the pipeline is shown below:

This shows the register $2 value changes from 10 to -20 during the middle of clock cycle CC5

during the result of first instruction. So add and sw instructions get the correct value -20, but and

and or instructions get the wrong value 10.

Carefully looking into first instruction execution, the result is available during the operation of

EX stage, ie at end of CC3. And the data is needed for and and or instructions at the beginning of

EX stage, ie at CC4 and CC5 respectively.

Data forwarding is the method for this hazard, in which the data simply forward as soon as it is

available to any units that need it before to read from the register file.

One method for data forwarding is by forwarding an operation in the EX stage, which is either an

ALU operation or an effective address calculation.



The following figure shows the forwarding data using pipeline registers from EX stage:

Here required data exists in time for later instructions by the pipeline registers EX/MEM and

MEM/WB.

If the inputs to ALU are from any pipeline register rather than ID/EX, then forwarding data is

possible. For this multiplexers to the input of ALU with proper control line are required. This

arrangement gives pipeline at full speed with data dependencies.

The close-up of the ALU and pipeline register before and after adding forwarding is shown

below:



Forwarding control will be in the EX stage, because the ALU multiplexers are at this stage.

The operand register numbers will pass from ID stage via ID/FX register to determine

whether to forward values.

The control values and its operation for the multiplexers are:



Data hazards and stalls

Consider the following illustration:

The data forwarding cannot solve the data hazard problem introduced by load instruction as

shown. Here the data is still being read from memory at CC4, while ALU is performing the

operation following instruction.

This problem is solved by stall the pipeline for the combination of load followed by an

instruction that reads its result.

An additional forwarding unit called hazard detection unit required and it operates during the

ID stage so that it can insert stall between load and its use.

If the instruction in ID stage is stalled, then the instruction in the IF stage must also be stalled.

This is accomplished by preventing the PC and the IF/ID pipeline register from changing.

Stalling can be done by nop instruction; it has no effect in execution.

The following figure shows the action of nop instruction execution for stalling.



Here the pipeline execution slot for and instruction is turned into nop and all instructions

beginning with the and are delayed by one cycle.

The hazards forces and and or instruction to repeat in CC4, what they did in CC3, where and

reads registers and decodes, and or is re-fetched from instruction memory.

The pipeline connection for both hazard detection unit and the forwarding unit is shown

below:

Forwarding unit controls ALU multiplexers to replace the value from general purpose

registers with proper pipeline register.

The hazard detection unit controls the writing of the PC and IF/ID registers, and the

multiplexer that choose the real control values and all 0s.

The hazard detection unit stalls and de-asserts the control field if the load-use hazard

instruction occurs.



Branch hazards or Control hazard

Consider the following illustrations:

By pipelining, every clock cycle an instruction is fetched. But for branch instruction the decision

determines whether to branch or not and is until the MEM pipeline stage.

The delay up to MEM stage can be used to determine the proper instruction to fetch and is called

branch or control hazard.

Control hazards are shorter than data hazards, because they are relatively simple and occur less

frequently than data hazards.

Branch stalling

In which stalling until the branch is complete, but it is too slow.

An improvement to this method is to assume that the branch will not be taken and continue

execution down the sequential instruction stream.

If the branch is taken, the instruction that fetched and decoded must discarded.

If branch is not taken half the time and a little to discard the instructions.

This optimization halves the cost of control hazards.

To discard instruction (also called flushing instructions), the control values are changed to 0s,

which is similar to stall for load-use data hazard. But here the IF, ID and EX stages instructions

are stalled when the branch reaches the MEM stage.

Branch prediction

Another method to solve control hazard is by reducing the delay of branches.

The completion of branch instruction is over at MEM stage, but if it is in earlier stage, then fewer

instructions need be flushed.



In MIPS architecture, branch instructions need only simple test and do not require full ALU

operation.

Branch instruction require two actions: computing branch target address and evaluating the

branch decision.

The address calculation is an easy part and it can move from EX stage to ID stage, because

immediate offset field is available in IF/ID pipeline register. This operation is needed only the

branch decision is true.

For branch decision, two register values are compared by EX-ORing all the bits and then ORing

all the results.

Moving branch test to the ID stage results additional forwarding and hazard detection. Two

factors to implementing this are:

1. In ID, decode instruction, decide whether a bypass to the equality unit is needed, and

complete the equality comparison by set the PC to the branch target address. Forwarding

the operands is by the same way for data hazards, but there is an equality test unit in ID

requires a new forwarding logic.

2. The values for branch comparison may be produced later in time and cause data hazard.

This will need stalling.

Branch execution at the ID stage improve the speed by reduces the penalty of a branch to only

one instruction if the branch is taken.

Consider the following code:



The implementation is:



Super Scalar Processor:

They are dynamic multiple-issue processors, in which instructions are fetched in order, but the

processor decides whether zero, one or more instructions can issue in a given clock cycle.

This improves the instruction execution rate.

The basic framework of dynamic issue decisions is dynamic pipeline scheduling. It chooses

which instructions to execute in a given clock cycle while trying to avoid hazards are stalls.

Consider the following code:

In this case the sub instruction is ready to execute, but it has to wait to complete first two

instructions. Dynamic pipeline scheduling avoids this type of hazards either partially or fully.

Dynamic pipeline scheduling

This chooses which instructions to execute next, possibly by reordering them to avoid stalls.

The processor with this facility have three major units: an instruction fetch and issue unit,

multiple functional units and a commit unit.

A typical model is shown below:

First unit fetch instruction, decode it and sends each instruction to the corresponding functional

unit for execution.

The functional units have some buffers called reservation units that hold the operands and

operations.

The buffer contains all the operand and functional units are ready to execute, the result is

calculated.

The results are sent to buffers which are waiting as well as commit unit.

In commit unit there is also buffer called recorder buffer. It is used to supply the operands similar

to forwarding.

coa module 2 notes

Documents