coa module 2 notes

Upload: assini-hussain

Post on 02-Mar-2016

117 views

Category:

Documents


3 download

DESCRIPTION

University of Kerala 08.503. Computer Organization and Architecture Notes

TRANSCRIPT

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 1

    Design of Data path and Control (based on MIPS instruction set) Basic MIPS Implementation: Consider the subset of the core MIPS instruction set:

    The key principles used to create data path and design the control for other instructions are similar. The implementation ideas are common for general purpose microprocessors, processors in high

    performance servers, embedded processors etc. For execution of every instruction, the first two steps are identical:

    1. Send the PC to the memory that contains the code and fetches the instruction from that memory. 2. Read one or two registers, using the fields of the instruction to select the registers.

    After these steps, the actions to complete the instruction depend on the instruction class. But for memory-reference, arithmetic-logical and branches class instructions the actions are largely same.

    All instruction class needs ALU, except jump instruction. A high-level view of a MIPS implementation focusing on various functional units and the

    interconnection is shown below:

    It shows that the value written to PC can come from one of two adders. The data written into register

    file can come from either ALU or the data memory. A line from multiple lines are selected by multiplexer, also called data selector. A line is selected

    from several lines using control lines.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 2

    The data path with required multiplexers of MIPS implementation of the basic type is shown below:

    Three multiplexers and few control lines are required. A control unit is there, that has the instruction as input used to determine how to set the control lines

    for the functional units and the two multiplexers. The third multiplexer determines whether PC+4 or branch destination address is written into PC. It is

    based on the zero output of ALU which is used to perform the comparison of instruction beq This type design approach is easier to understand, but not a practical one, because it is slower than the

    implementation that allows different instruction classes to take different numbers of clock cycles.

    There are two types of design concept: single cycle datapath concept and multicycle datapath concept.

    In single cycle concept, separate instruction and data memories are required, because:

    Logic design conventions:

    The functional units of MIPS implementation consists of two types of logical elements:

    1. Combinational elements It operates on the data values and their outputs depend only on the

    current inputs. It has no storage elements. The adders, ALU and multiplexers are examples.

    2. State elements: It contains state and has some internal storage. It preserves the values we stored in

    the previous state. The instruction, data memories and registers are the examples.

    State element has at least two inputs and one output. The required inputs are data value to be written

    into it and the clock, which determine when the data value is written. A simplest state element is a D

    flip flop. The clock is also used to read the state element at any time.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 3

    State element is also called sequential element, because its output depends on both the input and its

    internal state.

    Clocking Methodology:

    It defines when the signal can read and when they can be written.

    A simple methodology is edge-triggering method: Any values stored in a sequential logic element are

    updated only on a clock edge.

    Consider two state elements surrounding a block of combinational logic which operates in a single

    clock cycle:

    All signals must propagate from state element 1 to state element 2 through the combinational logic in

    the time of one clock cycle. This can be done by using edge-triggering method. During the edge of

    clock a read operation can be performed to state element 1 and write operation can be performed to

    state element 2 during edge of the next clock.

    Edge triggering may be +ve edged or ve edged.

    Building data path:

    Start with major components to execute: Two state elements (Instruction memory and PC) and an

    adder as shown below:

    All elements are combined by data path to fetch the instruction and increment PC to point next

    instruction.

    Consider R-Type instruction, it requires processors 32-bit register structure called register file.

    R-Type instructions have 3 operands in registers.

    For Read operation: An input to the register file that specifies the register number to be read and

    output from the register file that carry the value that has been read from the registers. So two

    inputs and two outputs are required.

    To write data: One input to specify the register number to be written and one to supply the data to

    be written into the register.

    The register number inputs are 5 bits wide to specify one of 32 registers (32 = 25).

    We need total four inputs (3 for register number and one for data) and two outputs for data.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 4

    The elements required for R-Type instruction is:

    ALU takes two 32-bit input, 4 control lines, 32-bit result output and 1-bit zero signal for zero

    output.

    Consider memory-reference instructions:

    lw $S1, offset_value ($S2) and sw $S1, offset_value ($S2)

    Both require a sign extension unit for 16-bit offset_vaue to 32-bit offset_value, ALU operation

    and data memory elements. Then the additional two elements are shown below:

    Consider branching instruction beq: It has three operands, two registers are compared for equality

    and a 16-bit offset value to calculate the target address.

    To implement this instruction, the branch address is computed by adding signed extended 32-bit

    offset field to PC. Before adding the offset field is shifted to left by 2 bits for word offset value.

    For the instruction, if the condition is true branch is taken otherwise no branch is taken.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 5

    The structure of data path handle the branch instruction is:

    To perform branch target address, the branch datapath includes a sign extension unit, a shift left

    by 2 unit, an adder, ALU to compare two register file operands.

    ALU provides an output signal that indicates whether the result is 0 or not. If two operands are

    equal, zero output is 1 else 0.

    Jump instruction operates by replacing the lower 28 bits of the PC with the lower 26 bits of the

    instruction shifted left by 2 bits. This unit is not shown here.

    Creating single data path:

    Combining the individual instruction class datapath components into a single datapath and add

    the control to complete the implementation.

    A simplest attempt is to execute all instructions in one clock cycle. So any element needed more

    than once must be duplicated.

    The operations on memory-reference and arithmetic-logical instructions are same. But have some

    differences:

    1. Memory instructions use the ALU for address calculation with one input from sign extended

    16-bit offset field from instruction and arithmetic-logical instructions use ALU with the

    inputs from registers.

    2. The ALU result for first class instruction is always to address of data memory, but for second

    class it is always a register.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 6

    The simple data path for MIPS architecture for the three class instruction is shown below:

    The ALU inputs are coming from two registers and memory-reference instructions can also use

    ALU to do address calculation. So the second input of ALU is selected from a register or sign-

    extended 16-bit offset field from the instruction using a MUX. The control signal of this MUX is

    ALUSrc.

    The value stored in destination register (write data) comes from the ALU result (for R-type

    instruction) or a memory data (for load instruction). So it is selected by another MUX. The

    control signal for this MUX is MemtoReg.

    An additional MUX is required for selecting sequentially executing instructions address PC+4 or

    the branch target address to be written to PC. It has a control line PCSrc.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 7

    Simple Implementation:

    To add the simple control function to the datapath unit, consider the instructions lw, sw, beq, add,

    sub, and, or and slt.

    ALU control:

    ALU has 4-bit control lines, so there are 16 possible ALU functions. But now use only 6

    functions and are shown in the following table:

    For the three class instructions ALU need to perform first five functions. For memory-reference

    instructions ALU need to compute memory address by addition, for arithmetic-logical class

    instruction ALU needs to perform any one of the five functions depends on the value of 6-bit

    funct field in lower bits of instruction and for branch equal instruction ALU must perform

    subtraction.

    The 4-bit control input for ALU can be generated using a small control unit that has two inputs,

    one is a 2-bit control filed called ALUOp (ALU operation) and 6-bit function field from the

    instruction.

    The following table shows how the 4-bit control lines to ALU is related to 2-bit ALUOp and 6-bit

    funct field in instruction:

    The table shows multilevel decoding. There is 8-bit input to generate 4-bit output. Using

    optimization designing method repeating logic can be replaced by dont care (X) condition. Then

    the truth table for the ALU control inputs is shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 8

    Designing of main control unit:

    Consider the instruction format of R-Type, memory-reference and branch instructions shown

    below:

    The major observations about this instructions are:

    1. The opcode field op is always in bits 31:26 (6-bit), usually referred as Op[5:0]. This is

    common for all three class instructions.

    2. Two registers to read are always specified by rs and rt fields at the positions 25:21 and 20:16

    respectively. This is also common for all three class instructions.

    3. The base register for load/store instruction is always in 25:21 (rs).

    4. The 16-bit offset for branch equal and load/store instructions is at 15:0.

    5. The destination register for load and R-type instructions is in one of two places. For load it is

    in 20:16 (rt), while for R-type instruction it is in 15:11 (rd). This will need a MUX to select

    which field of instruction is used to indicate the register number to be written.

    The control unit implementation along with datapath is shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 9

    This implementation has seven control lines and a 2-bit ALUOp control signal.

    The functions of seven control lines are:

    1. RegDst: If 0, the destination register number comes from rt field bits 20:16. Else from the rd

    field bits 15:11.

    2. RegWrite: If 1, the register on the write register input is written with the value on the write

    data input. Else nothing happen.

    3. ALUSrc: If 0, the second ALU operand comes from second register file output (Read data 2).

    Else from sign-extended lower 16-bit of the instruction.

    4. PCSrc: If 0, the PC is replaced by the output of the adder that computes the value PC+4. Else

    the output of the adder that computes the branch target.

    5. MemRead: If 1, data memory content designated by the address input are put on the read data

    output. Else nothing happen.

    6. MemWrite: If 1, data memory content designated by the address input is replaced by the

    value on the write data input. Else nothing happen.

    7. MemtoReg: If 0, the value fed to the register write data input comes from the ALU. Else from

    the data memory.

    The simple data path design with control unit is shown below:

    The control unit generates nine control signals (including 2-bit ALUOp) according to instruction

    opcode. But for branch equal instruction, the control signal PCSrc is generated by branching

    decision from instruction and Zero output from ALU. For the Branch signal from control unit and

    Zero signal from ALU is ANDed.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 10

    The setting of control lines determined by the opcode field of the instruction is shown below:

    The operation of the datapath for R-type instruction like add $t1, $t2, $t2 is shown in following

    figure.

    Where everything occurs in one clock cycle and requires 4 steps to execute the instruction.

    The steps for the flow of the instruction are:

    1. The instruction is fetched and the PC is incremented.

    2. Two registers $t2 and $t3 is read from the register file and the main control unit computes

    the setting of the control lines during this step.

    3. The ALU operates on the data read from the register files and function code bits 5:0 from

    instruction to generate ALU function.

    4. The result from the ALU is written into the register file using bits 15:11 of the instruction

    to select the destination register $t1.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 11

    Illustration of the instruction

    is shown below:

    The five steps for load instruction are:

    1. Instruction is fetched from instruction memory and PC is incremented.

    2. A register $t2 value is read from the register file.

    3. ALU computes the sum of register value from register file and sign-extended lower 16-bit of

    the instruction (offset).

    4. The sum from the ALU is used to address for the data memory.

    5. The data from the memory unit is written into register file, the destination register ($t1) given

    by bits 20:16 of the instruction.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 12

    Illustration of the instruction

    is shown below:

    The four steps in execution for branch instruction are:

    1. Instruction is fetched from the instruction memory and PC is incremented.

    2. Two registers $t1 and $t2 are read from the register file.

    3. ALU performs a subtract on the values read from the register file. The value of PC + 4 is

    added to the sign-extended, lower 16-bit of the instruction (offset) shifted left by two. The

    result is branch target address.

    4. The Zero result from the ALU is used to decide which adder result to store into PC.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 13

    Finalizing the control:

    The input signals and its corresponding output of the control unit is shown in the following truth

    table:

    Inputs Outputs

    Instruction

    Signal name of opcode

    Op5 Op4 Op3 Op2 Op1 Op0 Reg

    Dst

    ALU

    Src

    Memto

    Reg

    Reg

    Write

    Mem

    Read

    Mem

    Write

    Bran

    ch

    ALU

    Op1

    ALU

    Op0

    R-Type (0) 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 lw (35) 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 sw (43) 1 0 1 0 1 1 X 1 X 0 0 1 0 0 0 beq (4) 0 0 1 0 0 0 X 0 X 0 0 0 1 0 1

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 14

    Implementing jump instruction:

    Jump instruction is similar to branch instruction, but computes the target address for PC

    differently and is not conditional.

    Like branch instruction, the lower order 2 bits of jump instruction are always 002 (multiply by 4).

    The next lower 26 bits of this 32-bit address comes from the 26-bit immediate field in the

    instruction as shown below:

    The upper 4 bit of the address should replace the PC + 4 address bit.

    Then jump instruction can be implement by storing PC by:

    1) The upper 4 bits of current PC+4 (31:28 bits of sequentially following instruction

    address)

    2) The 26-bit immediate field of the jump instruction.

    3) The bits 002.

    The addition of the control for jump instruction and multiplexer for selecting jump address,

    PC+4 or branch target is shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 15

    Advantages and disadvantages of single cycle implementation:

    The only advantage of single cycle implementation is its simplicity.

    The disadvantage is its inefficiency and slow speed. The clock cycle is not same for all

    instructions, so it is inefficient (ie CPI not 1)

    Due to the inefficiency, nowadays this implementation is not used.

    Performance of single cycle implementation:

    Assume that the operation time for major functional units in single cycle implementation is:

    a) Memory units: 200ps

    b) ALU and adders: 100ps

    c) Register file: 50ps

    Assume mux, control unit, PC, sign-extended units and wires have no delay.

    Consider two systems, one implementation every instruction operates in 1 clock cycle of a fixed

    length and other every instruction executes in 1 clock cycle using a variable length clock

    according to the requirement of the instruction.

    To compare the performance, assume 25% loads, 10% stores, 45% ALU, 15% branch and 5%

    jump instructions are there.

    We know that CPU execution time = IC x CPI x Clock cycle time. If CPI = 1, CPU execution

    time = IC x Clock cycle time.

    We have to find clock cycle time for both cases, since IC and CPI are same for both case.

    The critical path for different class instruction is shown below:

    Using these critical paths, the required length for each instruction class are:

    Then the clock cycle time with a single clock for all instructions will be determined by longest

    instruction, which is 600ps for load word.

    A machine with variable clock will have a clock cycle that varies between 200ps and 600ps. The

    average clock cycle for the machine is

    CPU clock cycle = 400 x 0.45 + 600 x 0.25 + 550 x 0.1 + 350 x 0.15 + 200 x 0.05 = 447.5ps

    The CPU performance can be found by

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 16

    This shows that variable clock implementation is 1.34 time faster than single clock

    implementation.

    Implementation of variable clock machine is very difficult and cause overhead during execution.

    Single clock implementation with fixed clock length is more suitable for small instruction set.

    In single cycle implementation, each functional unit can be used only one clock, therefore some

    units must be duplicated and cause raise of cost. So it is inefficient both in performance as well as

    hardware cost.

    Multi cycle implementation:

    The drawbacks of single cycle implementation can be overcome by this method.

    This allows a sharing of functional unit, instead of duplication and it is used on different clock

    cycles. The sharing of hardware reduces the amount of hardware required.

    The major advantages of this method are the ability to allow instruction to take different number

    of clock cycles and ability to share functional units within the execution of a single instruction.

    The high level view of multi cycle datapath is shown below:

    The main difference of this implementation comparing to single cycle implementation is:

    1. A single memory unit is used for both instructions and data.

    2. There is a single ALU, rather than an ALU and two adders.

    3. One or more registers are added after every functional unit to hold the output of that unit until

    the value is used in a subsequent clock cycle.

    At the end of a clock cycle, all data that is used in the subsequent clock cycle must be stored in

    state elements: register file, PC or memory.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 17

    The data used by the same instruction in a later cycle must be stored one of the additional

    registers.

    In this design, the operations required are: a memory access, a register file access or an ALU

    operation. So the data from these functional units must be saved into temporary register for later

    cycle.

    The temporary registers used and its purpose are:

    1. Instruction Register (IR) and Memory Data Register (MDR): To save the output of the

    memory for an instruction read and a data respectively.

    2. A and B registers: To hold the register operand values read from register file.

    3. ALUOut register: To hold the output of the ALU.

    All the registers except IR hold data only between a pair of adjacent clock cycles and thus no

    need a write control signal. The IR needs to hold the instruction until the end of execution of that

    instruction and thus will require a write control signal.

    To share functional units for different purposes, we need more MUX as well as expand the

    existing MUX. For one memory is used for instructions and data, we require a MUX for selecting

    two sources for a memory address from PC (for instruction access) and ALUOut (for data

    access).

    Three ALUs of single cycle implementation is replaced by a single ALU. So additional

    multiplexers are required at the two input of ALU. A MUX for the first ALU input chooses A

    register and PC. Another MUX on the second input is a 4-way MUX to choose a constant 4 (to

    increment PC), the sign-extended offset and shifted offset field (both are used for branch address

    computation).

    The details of datapath with the additional MUXs are shown below:

    The datapath takes multiple clock cycles per instruction and it will require different set of control

    signals. The programmer-visible state units PC, memory and registers require wirte control

    signals as well as IR also need write control signals.

    The memory also need a read control signal.

    ALU also need control signal similar to single cycle implementation.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 18

    Each multiplexers also need contol lines.

    The datapath with control lines are shown below:

    For jump and branch instruction, there are three possible sources for the value to be written into

    PC:

    1. The output of ALU, which is PC+4 during instruction fetch.

    2. The register ALUOut, which is where the address of the branch target.

    3. The lower 26 bits of the IR shifted by 2 and concatenated with upper 4-bits of PC+4, which is

    the source when the instruction is jump.

    PC is updated conditionally and unconditionally. During normal increment PC is written

    unconditionally. If instruction in conditional branch, PC is replaced by ALUOut only if two

    designated registers are equal. So two separate control signals are required for PC and are:

    PCWrite, which is for unconditional write of PC and PCWriteCond, which is for write of PC if

    the branch condition is true.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 19

    The multicycle datapath and control unit including additional control signals and multiplexer for

    implementing PC updating is shown below:

    The functions of 1-bit control lines are:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 20

    The functions of 2-bit control lines are:

    Fetch, Decode, Execute and Memory Access Cycles:

    Breaking the execution of instruction into multiple clock cycle should improve the performance

    of the system.

    Breaking instruction execution into a series of steps and each step taking one clock cycle. For

    example, restrict each step contain one ALU operation, or one register access, or one memory

    access. With this restriction, the clock cycle could be as short as possible.

    There are three to five steps for execution of MIPS instruction using multicycle implementation.

    They are:

    1. Instruction fetch step:

    Fetch the instruction from the memory and compute the address of next instruction:

    Operation: Send PC to memory as the address, perform memory read, and write the

    instruction to IR, where it will be stored. Also increment PC by 4.

    To implement this step, the following signals are: MemRead and IRWrite to assert (as 1), set

    IorD as 0 to select PC as source address, set ALUSrcA as 0 to select PC and send to ALU,

    ALUSrcB as 01 to select 4 and send to ALU, ALUOp as 00 to make ALU add. Also, to store

    incremented instruction address back to PC, PCSource signal to 00 and set PCWrite as 1.

    The increment in PC and instruction memory access occurs in parallel and new value of the

    PC is not visible until the new clock cycle.

    2. Instruction decode and register fetch step:

    The instruction is decoded and operands are fetched in this step.

    The branch target address is also computed with ALU in this step. The potential branch target

    is saved in ALUOut.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 21

    If instruction has two register inputs, they are always in rs and rt fields, and if instruction is a

    branch, the offset is always in low-order 16 bits. This is shown below:

    Operation: Access the register file to read rs and rt and store the results in A and B registers.

    Since A and B registers are overwritten on every cycle, registers can be read on every cycle

    and values stored into A and B. The same step will also computes branch target address and

    store result to ALUOut, where it will use on next cycle for instruction fetch.

    The required control signals for this step are: set ALUSrcA to 0 to send PC to ALU,

    ALUSrcB to value 11 to send sign-extended and shifted offset value to ALU and ALUOp to

    00 to ALU add.

    The register file access and computation of branching target address occur in parallel.

    After this step clock cycle, the determining action to take depends upon the instruction.

    3. Execution, memory address computation, or branch completion:

    First cycle during the datapath operation is determined by the instruction class.

    For memory reference:

    Operation: ALU adds operands to form memory address. Set ALUSrcA to 1 for send A to

    ALU input and set ALUSrcB to 10 for send sign-extended offset to second ALU input.

    ALUOp signals are set to 00 for ALU add.

    For R-type instruction:

    Operation: ALU perform the operation specified by funct field on two value read from

    register file in the previous cycle. For this control signal ALUSrcA set to 1 for send A to

    ALU input and set ALUSrcB to 00 for send B to ALU other input. The ALUOpsignals will

    need to be set to 10 and using funct field ALU control unit generate signals for ALU

    operation.

    For branch:

    Operation: ALU is used to compare two register read in previous step. The zero signal out

    from ALU is used to determine whether or not to branch. The required control signals are: set

    ALUSrcA to 1 and ALUSrcB to 00 to select A and B register to ALU inputs. ALUOp signals

    set to 01 for equality testing (subtract). The PCWriteCond signal will need to assert to update

    PC if the zero output of ALU is asserted. PCSource set to 01 for send value to PC from

    ALUOut, which hold the target address. For conditional branches, there are two write

    operation to PC: once from the output from ALU during instruction decode/register fetch and

    once from ALUOut during branch completion step. The last value written to PC is used to

    fetch the next instruction.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 22

    For jump:

    Operation: PC is replaced by jump address. PCSource is set to 10 for jump address to PC and

    PDWrite is asserted to write jump address into PC.

    4. Memory access or R-type instruction complete step:

    During this step, a memory reference instruction access memory and R-type instruction

    writes its result.

    When a value is retrieved from memory, it is stored in MDR and is used on the next clock

    cycle.

    For memory reference:

    Operation: For load instruction, a data word is retrieved from memory and is written into

    MDR. For store instruction, the data word is written into memory. In both cases the address

    used is computed during previous step and stored in ALUOut. For store instruction, the

    source operand is in B. The signals are: MemRead for load and MemWrite for store will be

    asserted to 1. The signal IorD is set to 1 to force the memory address come from ALU.

    For R-type instruction:

    Operation: Place the contents of ALUOut into result register. The signal RegDst set to 1 for

    the rd field (15:11 bits) to use the register file entry to write. RegWrite is asserted and

    MemtoReg signal set to 0 for write ALUOut data to register file.

    5. Memory read completion step:

    During this step, load instruction complete by writing back the data from memory to register.

    For load:

    Operation: Write the load data stored in MDR during previous cycle into register file. The

    signal MemtoReg set to 1 for write the result from memory, assert signal RegWrite to 1 and

    make RegDst as 0 to choose the rt (20:16 bits) field of the register.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 23

    Design of the Control Unit:

    To design control unit for single cycle implementation, truth table that specified the setting of the

    control signals based on the instruction class is used.

    For multicycle datapath, the control unit is more complex, because the instruction execution is by

    series of steps.

    Two different techniques used for control unit design of multicycle implementation are: One is

    based on finite state machines (hardwired) and other is using microprogramming. Both represent

    the control in the form of an implementation using gates, ROMs, or PLAs.

    The high level view of the finite state machine control for the five steps of multi-cycle

    implementation is shown below:

    The first two states of the machine using graphical representation is shown below:

    State 0 is for instruction fetch and after this FSM switches to state 1 for instruction

    decode/Register fetch.

    After state 1, FSM switches to any of the four states depend upon the instruction.

    For memory-reference instructions:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 24

    For R-type instructions:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 25

    For branch instructions:

    For jump instructions:

    All these states can be implemented by a control unit shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 26

    This FSM can be implemented with a temporary register that holds the current state and a block

    of combinational logic that determines both datapath signals to be asserted as well as the next

    state.

    The combinational control logic for this FSM is implemented both with a ROM and a PLA.

    Microprogramming Control Design:

    A technique for designing complex control units.

    It uses a simple hardware that can be programmed to implement a more complex instruction set.

    Enhancing Performance with Pipelining:

    Pipelining is an implementation technique in which multiple instructions are overlapped in

    execution.

    Consider a laundry system with non-pipelining and pipelining approaches.

    In non-pipelining approach:

    In pipelining approach, as soon as washer is finished with first load and placed in the dryer, load

    the washer with second load. When the first load is dry, place it on the folder, move the wet load

    to dryer and load the next load to the washer. Next put the first load away, second load start fold,

    third load to dryer and put the fourth load into the washer.

    These two approaches are shown below:

    Pipelining is faster and is applicable to implement in MIPS instruction, because classically it take

    five steps:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 27

    Comparison between single cycle and pipelining implementation:

    Let as consider total time required for different units to execute each instruction as shown below:

    Execution of instruction in single cycle non-pipelining implementation is shown below:

    Execution of instruction in pipelining implementation is shown below:

    Required time for executing first four instruction in non-pipelining implementation is

    800 x 4 = 3200ps, but for pipelining implementation it is around 1500ps.

    The speed of the pipelined instructions depends on the number of stages pipelined. Then

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 28

    Pipeline hazards: It is the situation, when the next instruction cannot execute in the following

    clock cycle

    Three types of pipeline hazards:

    1. Structural hazards Occur when the hardware cannot support the combination of

    instructions to execute in the same clock cycle.

    2. Data hazards Occur when one step of execution wait for the completion of other step.

    3. Control hazards Arise from the need to make a decision based on the result of one

    instruction while others are executing.

    Branch prediction, forwarding and stalls are used to avoid these hazards.

    Pipelining increases the number of simultaneously executing instructions and the rate at which

    instructions are started and completed. Pipelining does not reduce the time it takes to complete an

    individual instruction and also called latency. Thus pipelining improves instruction throughput

    rather than individual instruction execution time or latency.

    Pipelined Datapath:

    Consider the single cycle datapath and divide it into five stages for five step execution as well as

    five-stage pipeline. It means that five instructions will be in execution during any single clock

    cycle.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 29

    The name of each stage are: 1) IF (Instruction Fetch) 2) ID (Instruction Decode/Register Fetch)

    3) EX (Execute/Address Calculation) 4) MEM (Memory Access) 5) WB (Write Back)

    The instructions and data move generally from left to right through five stages for its completion.

    But there are two exceptions for this, they are: 1) The WB stage places the result back to ID stage

    which is in the middle of the datapath. 2) The new value of PC is choosing between PC + 4 and

    branch address from the MEM stage.

    These exceptions may cause data hazard (for first) and control hazard (for second).

    The executions of some instructions and their datapath on a common time line is shown below:

    Here three instructions need three datapaths and allow the sharing of units for other instructions.

    For example, IM is used only one of the five stages of an instruction and it is shared by other

    instruction during the other four stages.

    Consider the pipelined datapath with the pipeline registers highlighted.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 30

    The pipeline registers are: IF/ID registers between IF and ID stages, ID/EX registers between

    ID and EX stages, EX/MEM registers between EX and MEM stages and MEM/WB registers

    between MEM and WB stages.

    There is no pipeline register at the end of the WB stage. All instructions must update some state

    in the register file, memory or PC, so a separate pipeline register is redundant to the state updated.

    Pipelined datapath of load instruction

    The active portions of the datapath highlighted as a load instruction goes through the first stage of

    pipelined execution is shown below:

    Instruction fetch: Read instruction from memory using address in PC, and then place it in IF/ID

    register. IF/ID register is similar to IR. PC address is incremented by 4 and written back to PC for

    next instruction. This incremented conted is also saved into IF/ID register for further use of the

    instruction like beq.

    The active portions of the datapath highlighted as a load instruction goes through the second stage

    of pipelined execution is shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 31

    Instruction decode and register file read: IF/ID pipeline register supply 16.bit offset and

    register numbers to read the two registers. The sign extended 32-bit offset value, two register data

    and incremented PC values are stored ID/EX pipeline register.

    The active portions of the datapath highlighted as a load instruction goes through the third stage

    of pipelined execution is shown below:

    Execute and address calculation: Load instruction read register 1 content and sign-extended 32-

    bit offset value from ID/EX and adds them using the ALU. The result is placed in the EX/MEM

    pipeline register.

    The active portions of the datapath highlighted as a load instruction goes through the fourth stage

    of pipelined execution is shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 32

    Memory access: The load instruction read data memory using the address from EX/MEM

    pipeline register and loads the data into the MEM/WB pipeline register.

    The active portions of the datapath highlighted as a load instruction goes through the fifth stage of

    pipelined execution is shown below:

    Write back: This is the final step, reading data from MEM/WB register and writing it into the

    register file in the middle of the datapath.

    The above steps show that any information passing to next stage is via pipeline register.

    Pipelined datapath of store instruction

    The first two steps are same as load instruction. The others stages are shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 33

    Here also information is passed to the next stage via pipeline registers.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 34

    Corrected datapath of load instruction:

    At the final stage of load instruction, write register number is required. If IM is shared for other

    instructions, the write register number may be changed. So we need to preserve the write register

    number or instruction during the last stage using pipeline register.

    The corrected datapath, by passing write register number first to ID/EX, then to EX/MEM and

    finally to MEM/WB is shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 35

    Graphical representation of pipelines:

    To understand more about pipelines, consider multiple clock cycle pipeline diagram and single

    clock cycle diagrams.

    The multiple clock cycle diagram of the following five instruction sequence is shown below:

    This shows that time advances from left to right and instructions advances from top to bottom.

    Another version multiple clock cycle pipeline diagram is shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 36

    The single clock cycle diagram is shown below:

    This is a vertical slice representation.

    Pipelined Control:

    The control lines introduced in pipelined data path is show below:

    All control lines are same as single cycle implementation without pipeline.

    There are no separate write control signals for pipeline registers, because they are written during

    each clock cycle.

    Here the control lines are grouped into five according to the pipeline stages.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 37

    1. IF Control signals:- To read IM and write PC, they are always asserted. So there is

    nothing special control signals to pipeline stage.

    2. ID Control signals:- Similar to previous stage, there are no optional signals for this stage.

    3. EX Control signals:- The signals are: RegDst select result register, ALUOp ALU

    operation and ALUSrc select ALU input from either Read data 2 or sign-extended offset.

    4. MEM Control signals:- The signals are: Branch for branch target address, MemRead

    for load instruction and MemWrite for store instruction. Also have PCSrc signal to assert

    branch control and ALU result Zero signal.

    5. WB Control signals:- The signals are: MemtoReg select ALU result or the memory

    read data to register and RegWrite to write the value to register.

    Pipelining doesnt change the functions of control lines, but they are grouped together.

    The full datapath with pipeline registers and control lines are shown below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 38

    Data Hazards and Forwarding:

    Consider the following instruction sequence which has dependency.

    The last four instructions are all dependent on the result in register $2 of the first instruction.

    The execution of these instructions in the pipeline is shown below:

    This shows the register $2 value changes from 10 to -20 during the middle of clock cycle CC5

    during the result of first instruction. So add and sw instructions get the correct value -20, but and

    and or instructions get the wrong value 10.

    Carefully looking into first instruction execution, the result is available during the operation of

    EX stage, ie at end of CC3. And the data is needed for and and or instructions at the beginning of

    EX stage, ie at CC4 and CC5 respectively.

    Data forwarding is the method for this hazard, in which the data simply forward as soon as it is

    available to any units that need it before to read from the register file.

    One method for data forwarding is by forwarding an operation in the EX stage, which is either an

    ALU operation or an effective address calculation.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 39

    The following figure shows the forwarding data using pipeline registers from EX stage:

    Here required data exists in time for later instructions by the pipeline registers EX/MEM and

    MEM/WB.

    If the inputs to ALU are from any pipeline register rather than ID/EX, then forwarding data is

    possible. For this multiplexers to the input of ALU with proper control line are required. This

    arrangement gives pipeline at full speed with data dependencies.

    The close-up of the ALU and pipeline register before and after adding forwarding is shown

    below:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 40

    Forwarding control will be in the EX stage, because the ALU multiplexers are at this stage.

    The operand register numbers will pass from ID stage via ID/FX register to determine

    whether to forward values.

    The control values and its operation for the multiplexers are:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 41

    Data hazards and stalls

    Consider the following illustration:

    The data forwarding cannot solve the data hazard problem introduced by load instruction as

    shown. Here the data is still being read from memory at CC4, while ALU is performing the

    operation following instruction.

    This problem is solved by stall the pipeline for the combination of load followed by an

    instruction that reads its result.

    An additional forwarding unit called hazard detection unit required and it operates during the

    ID stage so that it can insert stall between load and its use.

    If the instruction in ID stage is stalled, then the instruction in the IF stage must also be stalled.

    This is accomplished by preventing the PC and the IF/ID pipeline register from changing.

    Stalling can be done by nop instruction; it has no effect in execution.

    The following figure shows the action of nop instruction execution for stalling.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 42

    Here the pipeline execution slot for and instruction is turned into nop and all instructions

    beginning with the and are delayed by one cycle.

    The hazards forces and and or instruction to repeat in CC4, what they did in CC3, where and

    reads registers and decodes, and or is re-fetched from instruction memory.

    The pipeline connection for both hazard detection unit and the forwarding unit is shown

    below:

    Forwarding unit controls ALU multiplexers to replace the value from general purpose

    registers with proper pipeline register.

    The hazard detection unit controls the writing of the PC and IF/ID registers, and the

    multiplexer that choose the real control values and all 0s.

    The hazard detection unit stalls and de-asserts the control field if the load-use hazard

    instruction occurs.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 43

    Branch hazards or Control hazard

    Consider the following illustrations:

    By pipelining, every clock cycle an instruction is fetched. But for branch instruction the decision

    determines whether to branch or not and is until the MEM pipeline stage.

    The delay up to MEM stage can be used to determine the proper instruction to fetch and is called

    branch or control hazard.

    Control hazards are shorter than data hazards, because they are relatively simple and occur less

    frequently than data hazards.

    Branch stalling

    In which stalling until the branch is complete, but it is too slow.

    An improvement to this method is to assume that the branch will not be taken and continue

    execution down the sequential instruction stream.

    If the branch is taken, the instruction that fetched and decoded must discarded.

    If branch is not taken half the time and a little to discard the instructions.

    This optimization halves the cost of control hazards.

    To discard instruction (also called flushing instructions), the control values are changed to 0s,

    which is similar to stall for load-use data hazard. But here the IF, ID and EX stages instructions

    are stalled when the branch reaches the MEM stage.

    Branch prediction

    Another method to solve control hazard is by reducing the delay of branches.

    The completion of branch instruction is over at MEM stage, but if it is in earlier stage, then fewer

    instructions need be flushed.

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 44

    In MIPS architecture, branch instructions need only simple test and do not require full ALU

    operation.

    Branch instruction require two actions: computing branch target address and evaluating the

    branch decision.

    The address calculation is an easy part and it can move from EX stage to ID stage, because

    immediate offset field is available in IF/ID pipeline register. This operation is needed only the

    branch decision is true.

    For branch decision, two register values are compared by EX-ORing all the bits and then ORing

    all the results.

    Moving branch test to the ID stage results additional forwarding and hazard detection. Two

    factors to implementing this are:

    1. In ID, decode instruction, decide whether a bypass to the equality unit is needed, and

    complete the equality comparison by set the PC to the branch target address. Forwarding

    the operands is by the same way for data hazards, but there is an equality test unit in ID

    requires a new forwarding logic.

    2. The values for branch comparison may be produced later in time and cause data hazard.

    This will need stalling.

    Branch execution at the ID stage improve the speed by reduces the penalty of a branch to only

    one instruction if the branch is taken.

    Consider the following code:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 45

    The implementation is:

  • 08.503 Computer Organization and Architecture Module 2

    Department of ECE, VKCET Page 46

    Super Scalar Processor:

    They are dynamic multiple-issue processors, in which instructions are fetched in order, but the

    processor decides whether zero, one or more instructions can issue in a given clock cycle.

    This improves the instruction execution rate.

    The basic framework of dynamic issue decisions is dynamic pipeline scheduling. It chooses

    which instructions to execute in a given clock cycle while trying to avoid hazards are stalls.

    Consider the following code:

    In this case the sub instruction is ready to execute, but it has to wait to complete first two

    instructions. Dynamic pipeline scheduling avoids this type of hazards either partially or fully.

    Dynamic pipeline scheduling

    This chooses which instructions to execute next, possibly by reordering them to avoid stalls.

    The processor with this facility have three major units: an instruction fetch and issue unit,

    multiple functional units and a commit unit.

    A typical model is shown below:

    First unit fetch instruction, decode it and sends each instruction to the corresponding functional

    unit for execution.

    The functional units have some buffers called reservation units that hold the operands and

    operations.

    The buffer contains all the operand and functional units are ready to execute, the result is

    calculated.

    The results are sent to buffers which are waiting as well as commit unit.

    In commit unit there is also buffer called recorder buffer. It is used to supply the operands similar

    to forwarding.