cs6303 computer architecture notes for 2013 regulation

Upload: algatesgiri

Post on 05-Jul-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    1/135

    Kingston Engineering CollegeChittoor Main Road,

    Katpadi, Vellore 632 059.

    Approved by AICTE, New Delhi affiliated to Anna University, Chennai

    DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

    THIRD SEMESTER

    CS6303 COMPUTER ARCHITECTURE

    NOTES

    Prepared By

    Mr. M. AZHAGIRI AP/CSE

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    2/135

    CS6303 COMPUTER ARCHITECTURE L T P C3 0 0 3

    OBJECTIVES: To make students understand the basic structure and operation of digital computer. To understand the hardware-software interface.

    To familiarize the students with arithmetic and logic unit and implementation of fixed point andfloating-point arithmetic operations. To expose the students to the concept of pipelining. To familiarize the students with hierarchical memory system including cache memories and virtual

    memory. To expose the students with different ways of communicating with I/O devices and standard I/O

    interfaces.

    UNIT I OVERVIEW & INSTRUCTIONS 9Eight ideas – Components of a computer system – Technology – Performance – Power wall –

    Uniprocessors to multiprocessors; Instructions – operations and operands – representing instructions – Logical operations – control operations – Addressing and addressing modes.

    UNIT II ARITHMETIC OPERATIONS 7ALU – Addition and subtraction – Multiplication – Division – Floating Point operations – Subword

    parallelism.

    UNIT III PROCESSOR AND CONTROL UNIT 11Basic MIPS implementation – Building datapath – Control Implementation scheme – Pipelining – Pipelined datapath and control – Handling Data hazards & Control hazards – Exceptions.

    UNIT IV PARALLELISM 9Instruction-level-parallelism – Parallel processing challenges – Flynn‘s classification – Hardwaremultithreading – Multicore processors

    UNIT V MEMORY AND I/O SYSTEMS 9Memory hierarchy – Memory technologies – Cache basics – Measuring and improving cache

    performance – Virtual memory, TLBs – Input/output system, programmed I/O, DMA and interrupts,I/O processors.

    TOTAL: 45 PERIODSOUTCOMES: At the end of the course, the student should be able to:

    Design arithmetic and logic unit. Design and anlayse pipelined control units Evaluate performance of memory systems. Understand parallel processing architectures.

    TEXT BOOK:1. David A. Patterson and John L. Hennessey, ―Computer Organization and Design‟, Fifth edition,Morgan Kauffman / Elsevier, 2014.REFERENCES:

    1. V.Carl Hamacher, Zvonko G. Varanesic and Safat G. Zaky, ―Computer Organisation―, VI edition,Mc Graw-Hill Inc, 2012.2. William Stallings ―Computer Organization and Architecture‖, Seventh Edition , Pearson Education,2006.3. Vincent P. Heuring, Harry F. Jordan, ―Computer System Architecture‖, Second Edition, PearsonEducation, 2005.4. Govindarajalu, ―Computer Architecture and Organization, Design Principles and Applications‖,first edition, Tata Mc Graw Hill, New Delhi, 2005.

    5. John P. Hayes, ―Computer Architecture and Organization‖, Third Edition, Tata Mc Graw Hill,1998.6. http://nptel.ac.in/.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    3/135

    UNIT I

    OVERVIEW & INSTRUCTIONS

    Eight ideas – Components of a computer system – Technology – Performance – Power wall – Uniprocessors to multiprocessors; Instructions – operations and operands – representing

    instructions – Logical operations – control operations – Addressing and addressing modes.

    1.1 Eight ideas1.2 Components of a computer system1.3 Technology and Performance1.4 Power wall1.5 Uniprocessors to multiprocessors1.6 Instructions – operations and operands1.7 representing instructions1.8 Logical operations1.9 control operations1.10 Addressing and addressing modes.

    1.1 EIGHT IDEAS

    These ideas are so powerful they have lasted long after the first computer that used them.

    1. Design for Moore‘s Law 2. Use Abstraction to Simplify Design

    3. Make the Common Case Fast

    4. Performance via Parallelism

    5. Performance via Pipelining

    6. Performance via Prediction

    7. Hierarchy of Memories

    8. Dependability via Redundancy Design for Moore’s Law

    Moore’s Law . It states that integrated circuit resources double every 18 – 24 months.

    computer architects must anticipate where the technology will be when the design finishes

    rather than design for where it starts. The resources available per chip can easily double or

    quadruple between the start and finish of the project.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    4/135

    Use Abstraction to Simplify Design

    A major productivity technique for hardware and software is to use abstractions to represent

    the design at different levels of representation. lower-level details are hidden to offer a

    simpler model at higher levels.

    Make the Common Case Fast

    Making the common case fast will tend to enhance performance better than optimizing the

    rare case. Ironically, the common case is oft en simpler than the rare case and hence is oft en

    easier to enhance.

    Performance via Parallelism

    Computer architects have offered designs that get more performance by performing

    operations in parallel.

    Performance via Pipelining

    A particular pattern of parallelism is so prevalent in computer architecture that it merits its

    own name: pipelining.

    Performance via Prediction

    prediction, In some cases it can be faster on average to guess and start working rather thanwait until you know for sure, assuming that the mechanism to recover from a misprediction is

    not too expensive and your prediction is relatively accurate.

    Hierarchy of Memories

    Programmers want memory to be fast, large, and cheap, as memory speed often shapes performance, capacity limits the size of problems that can be solved, and the cost of memory

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    5/135

    today is oft en the majority of computer cost. Hierarchy of Memories, with the fastest,

    smallest, and most expensive memory per bit at the top of the hierarchy and the slowest,

    largest, and cheapest per bit at the bottom. Caches give the programmer the illusion that main

    memory is nearly as fast as the top of the hierarchy and nearly as big and cheap as the bottom

    of the hierarchy.

    Dependability via Redundancy

    Computers not only need to be fast; they need to be dependable. Since any physical device

    can fail, we make systems dependable by including redundant components that can take over

    when a failure occurs and to help detect failures.

    1.2 COMPONENTS OF A COMPUTER SYSTEM

    Software is organized primarily in a hierarchical fashion, with applications being the

    outermost ring and a variety of systems soft ware sitting between the hardware andapplications software. There are many types of systems software, but two types of systemssoftware are central to every computer system today: an operating system and a compiler. Anoperating system interfaces between a user‘s program and the hardware and provide s avariety of services and supervisory functions. Among the most important functions are:

    • Handling basic input and output operations

    • Allocating storage and memory

    • Providing for protected sharing of the computer among multiple applications using itsimultaneously. Examples of operating systems in use today are Linux, iOS, and Windows.

    FIGURE A simplified view ofhardware and software as hierarchical layers.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    6/135

    Compilers perform another vital function: the translation of a program written in a high-levellanguage, such as C, C++, Java, or Visual Basic into instructions that the hardware canexecute.

    From a High-Level Language to the Language of Hardware Assembler.

    This program translates a symbolic version of an instruction into the binary version. Forexample, the programmer would write

    add A,B and the assembler would translate this notation into 1000110010100000.

    The binary language that the machine understands is the machine language. Assemblylanguage requires the programmer to write one line for every instruction that the computerwill follow, forcing the programmer to think like the computer. In later stage, high-level

    programming languages and compilers were introduced, that translate High level language

    into instructions.Example High level language a=a+b;

    Assembly level language add A,B

    Binary / Machine Language 1000110010100000 program

    High-level programming languages offer several important benefits.

    They allow the programmer to think in a more natural language, using English words

    and algebraic notation. Fortran was designed for scientific computation. Cobol for business data processing. Lisp for symbol manipulation. It improved programmer productivity. Programming languages allow programs to be independent of the computer on which

    they were developed, since compilers and assemblers can translate high-levellanguage programs to the binary instructions of any computer.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    7/135

    5 CLASSIC COMPONENTS OF A COMPUTER

    The five classic components of a computer are input, output, memory, data path, and control,with the last two sometimes combined and called the processor.

    I/O EQUIPMENT:

    The most fascinating I/O device is probably the graphics display.

    liquid crystal displays (LCDs)

    To get a thin, low-power display. The LCD is not the source of light; instead, it controls thetransmission of light.

    A typical LCD includes rod-shaped molecules in a liquid that form a twisting helix that bendslight entering the display, from either a light source behind the display or less often fromreflected light. The rods straighten out when a current is applied and no longer bend the light.

    Since the liquid crystal material is between two screens polarized at 90 degrees, the lightcannot pass through unless it is bent. Today, most LCD displays use an active matrix that hasa tiny transistor switch at each pixel to precisely control current and make sharper images. Ared-green-blue mask associated with each dot on the display determines the intensity of thethree colour components in the final image; in a colour active matrix LCD, there are threetransistor switches at each point.

    The image is composed of a matrix of picture elements, or pixels, which can be representedas a matrix of bits, called a bit map. A colour display might use 8 bits for each of the threecolours (red, blue, and green). The computer hardware support for graphics consists mainlyof a raster refresh buffer, or frame buffer, to store the bit map. The image to be represented

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    8/135

    onscreen is stored in the frame buffer, and the bit pattern per pixel is read out to the graphicsdisplay at the refresh rate.

    The processor: is the active part of the computer, following the instructions of a program tothe letter. It adds numbers, tests numbers, signals I/O devices to activate, and so on. The

    processor logically comprises two main components: data path and control, the respective brawn and brain of the processor. The data path performs the arithmetic operations, andcontrol tells the data path, memory, and I/O devices what to do according to the wishes of theinstructions of the program.

    The memory: is where the programs are kept when they are running; it also contains the dataneeded by the running programs. Th e memory is built from DRAM chips. DRAM stands fordynamic random access memory. Multiple DRAMs are used together to contain theinstructions and data of a program. In contrast to sequential access memories, such asmagnetic tapes, the RAM portion of the term DRAM means that memory accesses take

    basically the same amount of time no matter what portion of the memory is read.

    1.3.1- Technology chip manufacturing process

    The manufacture of a chip begins with silicon, a substance found in sand. Because silicondoes not conduct electricity well, it is called a semiconductor. With a special chemical

    process, it is possible to add materials to silicon that allow tiny areas to transform into one ofthree devices:

    Excellent conductors of electricity (using either microscopic copper or aluminium wire)

    Excellent insulators from electricity (like plastic sheathing or glass)

    Areas that can conduct or insulate under special conditions (as a switch)

    Transistors fall in the last category. A VLSI circuit, then, is just billions of combinations ofconductors, insulators, and switches manufactured in a single small package.

    Figure shows process for integrated chip manufacturing. The process starts with a siliconcrystal ingot, which looks like a giant sausage. Today, ingots are 8 – 12 inches in diameter andabout 12 – 24 inches long. An ingot is finely sliced into wafers no more than 0.1 inches thick.

    These wafers then go through a series of processing steps, during which patterns of chemicalsare placed on each wafer, creating the transistors, conductors, and insulators.

    The simplest way to cope with imperfection is to place many independent components on asingle wafer.

    The patterned wafer is then chopped up, or diced, into these components, called dies andmore informally known as chips. To reduce the cost, using the next generation processshrinks a large die as it uses smaller sizes for both transistors and wires. This improves theyield and the die count per wafer.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    9/135

    Once you‘ve found good dies, they are connected to the input/output pins of a package, usinga process called bonding. These packaged parts are tested a final time, since mistakes canoccur in packaging, and then they are shipped to customers.

    1.3.2 PERFORMANCE

    Running a program on two different desktop computers, you‘d say that the faster one is thedesktop computer that gets the job done first. If you were running a data centre that hadseveral servers running jobs submitted by many users, you‘d say that the faster computer was

    the one that completed the most jobs during a day. As an individual computer user, you areinterested in reducing response time — the time between the start and completion of a task — also referred as execution time.

    Data centre managers are often interested in increasing throughput or bandwidth — the totalamount of work done in a given time.

    Measuring Performance:

    The computer that performs the same amount of work in the least time is the fastest. Programexecution time is measured in seconds per program. CPU execution time or simply CPUtime, which recognizes this distinction, is the time the CPU spends computing for this taskand does not include time spent waiting for I/O or running other programs. CPU time can befurther divided into the CPU time spent in the program, called user CPU time, and the CPUtime spent in the operating system performing tasks on behalf of the program, called systemCPU time.

    The term system performance to refer to elapsed time on an unloaded system and CPU performance to refer to user CPU time.

    CPU Performance and Its Factors:

    CPU execution time for a program = CPU clock cycles for a program X Clock cycle time

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    10/135

    Alternatively, because clock rate and clock cycle time are inverses,

    CPU execution time for a program = CPU clock cycles for a program/Clock rate

    This formula makes it clear that the hardware designer can improve performance by reducing

    the number of clock cycles required for a program or the length of the clock cycle.

    Instruction Performance:

    The performance equations above did not include any reference to the number of instructionsneeded for the program. The execution time must depend on the number of instructions in a

    program. Here execution time is that it equals the number of instructions executed multiplied by the average time per instruction. clock cycles required for a program can be written as

    CPU clock cycles = Instructions for a program X Average clock cycles per instruction

    The term clock cycles per instruction, which is the average number of clock cycles eachinstruction takes to execute, is often abbreviated as CPI. CPI provides one way of comparingtwo different implementations of the same instruction set architecture, since the number ofinstructions executed for a program will be the same.

    The Classic CPU Performance Equation:

    The basic performance equation in terms of instruction count (the number of instructionsexecuted by the program), CPI, and clock cycle time:

    CPU time = Instruction count X CPI X Clock cycle time

    or, since the clock rate is the inverse of clock cycle time:

    CPU time = (Instruction count X CPI) / Clock rate

    These formulas are particularly useful because they separate the three key factors that affect performance.

    Components of performance Units of Measure CPU execution time for a program Seconds for the programInstruction count Instructions executed for the programClock cycles per instruction (CPI) Average number of clock cycles perClock cycle time Seconds per clock cycle

    We can measure the CPU execution time by running the program, and the clock cycle time isusually published as part of the documentation for a computer. The instruction count and CPIcan be more difficult to obtain. Of course, if we know the clock rate and CPU execution time,we need only one of the instruction count or the CPI to determine the other.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    11/135

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    12/135

    microprocessors. Hence, a ―quad core‖ microprocessor is a chip that contains four processors

    or four cores. In the past, programmers could rely on innovations in hardware, architecture,

    and compilers to double performance of their programs every 18 months without having to

    change a line of code. Today, for programmers to get significant improvement in response

    time, they need to rewrite their programs to take advantage of multiple processors. Moreover,

    to get the historic benefit of running faster on new microprocessors, programmers will have

    to continue to improve performance of their code as the number of cores increases. To

    reinforce how the soft ware and hardware systems work hand in hand, we use a special

    section; Hardware/Soft ware Interface, throughout the book, with the first one appearing

    below. These elements summarize important insights at this critical interface.

    1. Increasing the clock speed of Uniprocessor has reached saturation and cannot be increased

    beyond a certain limit because of power consumption and heat dissipation issues.

    2. As the physical size of chip decreased, while the number of transistors/chip increased,

    clock speed increased, which boosted the heat dissipation across the chip to a dangerous

    level. Cooling & heat sink requirement issues were there.

    3. There were limitations in the use of silicon surface area.

    4. There were limitations in reducing the size of individual gates further.

    5. To gain Performance within a single core, many techniques like pipelining, super

    pipelined, superscalar architectures are used .

    6. Most of the early dual core processors were running at lower clock speeds, the rational

    behind is that a dual core processor with each running at 1Ghz should be equivalent to a

    single core processor running at 2 Ghz.

    7. The Problem is that this does not work in practice when the applications are not written to

    take advantage of the multiple processors. Until the software is written this way, unthreaded

    applications will run faster on a single processor than on a dual core cpu.8. In Multi-core processors, the benefit is more on throughput than on response time.

    9. In the past, programmers could rely on innovations in the hardware, Architecture and

    compilers to double performance of their programs every 18 months without having to

    change a line of code.

    10. Today, for programmers to get significant improvement in response time, they need to

    rewrite their programs to take advantage of multiple processors and also they have to improve

    performance of their code as the number of core increases.The need of the hour is……..

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    13/135

    11. Ability to write Parallel programs

    12. Care must be taken to reduce Communication and Synchronization overhead. Challenges

    in Scheduling, load balancing have to be addressed.

    1.6 INSTRUCTIONS – OPERATIONS AND OPERANDS

    Operations in MIPS:

    Every computer must be able to perform arithmetic. The MIPS assembly language notation

    add a, b, c

    Instructs a computer to add the two variables b and c and to put their sum in a.

    This notation is rigid in that each MIPS arithmetic instruction performs only one operation

    and must always have exactly three variables.

    EXAMPLE, To add 4 variables, b,c,d,e and store it in a.

    add a, b, c # The sum of b and c is placed in a

    add a, a, d # The sum of b, c, and d is now in a

    add a, a, e # The sum of b, c, d, and e is now in a

    Thus, it takes three instructions to sum the four variables.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    14/135

    Design Principle 1: Simplicity favours regularity.

    EXAMPLE: Compiling Two C Assignment Statements into MIPS

    This segment of a C program contains the five variables a, b, c, d, and e. Since Java evolved

    from C, this example and the next few work for either high-level programming language:

    a = b + c;

    d = a – e;

    The translation from C to MIPS assembly language instructions is performed by the compiler.Show the MIPS code produced by a compiler. A MIPS instruction operates on two sourceoperands and places the result in one destination operand. Hence, the two simple statementsabove compile directly into these two MIPS assembly language instructions:

    add a, b, c

    sub d, a, e

    Operands in MIPS:

    The operands of arithmetic instructions are restricted; they must be from a limited number ofspecial locations built directly in hardware called registers. The size of a register in the MIPSarchitecture is 32 bits; groups of 32 bits occur so frequently that they are given the nameword in the MIPS architecture.

    Design Principle 2: Smaller is faster.

    A very large number of registers may increase the clock cycle time simply because it takeselectronic signals longer when they must travel farther. So, 32 registers were used in MIPSarchitecture. The MIPS convention is to use two-character names following a dollar sign torepresent a register. eg: $s0, $s1

    Example: f = (g + h) – (i + j); instructions using registers.

    add $t0,$s1,$s2 # register $t0 contains g + h

    add $t1,$s3,$s4 # register $t1 contains i + j

    sub $s0,$t0,$t1 # f gets $t0 – $t1, which is (g + h) – (i + j)

    Memory Operands: Programming languages have simple variables that contain single dataelements, as in these examples, but they also have more complex data structures — arrays andstructures. These complex data structures can contain many more data elements than there areregisters in a computer. The processor can keep only a small amount of data in registers, butcomputer memory contains billions of data elements. So, MIPS must include instructions thattransfer data between memory and registers.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    15/135

    Such instructions are called data transfer instructions. To access a word in memory, theinstruction must supply the memory address.

    The data transfer instruction that copies data from memory to a register is traditionally calledload. The format of the load instruction is the name of the operation followed by the registerto be loaded, then a constant and register used to access memory. The sum of the constant

    portion of the instruction and the contents of the second register forms the memory address.

    The actual MIPS name for this instruction is lw, standing for load word.

    EXAMPLE g = h + A[8];

    To get A[8] from memory use lw, lw $t0,8($s3) # Temporary reg $t0 gets A[8]

    Use Result of A[8] stored in $t0, add $s1,$s2,$t0 # g = h + A[8]

    The constant in a data transfer instruction (8) is called the off set, and the register added toform the address ($s3) is called the base register.

    In MIPS, words must start at addresses that are multiples of 4. This requirement is called analignment restriction, and many architectures have it.(since in MIPS each 32 bits form a wordin memory, so the address of one word to another jumps in multiples of 4)

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    16/135

    Byte addressing also affects the array index. To get the proper byte address in the code above,the off set to be added to the base register $s3 must be 4 x 8, or 32,(as per previous example).

    EXAMPLE: g = h + A[8]; (implemented based on byte address)

    To get A[8] from memory use lw and calculate (8 x4) = 32 which is the actual offset value,lw $t0,32($s3) # Temporary reg $t0 gets A[8]

    Use Result of A[8] i.e., stored in$t0, add $s1,$s2,$t0 # g = h + A[8]

    The instruction complementary to load is traditionally called store; it copies data from aregister to memory. The format of a store is similar to that of a load: the name of theoperation, followed by the register to be stored, then off set to select the array element, and

    finally the base register. Once again, the MIPS address is specified in part by a constant andin part by the contents of a register. The actual

    MIPS name is sw, standing for store word.

    EXAMPLE: A[12] = h + A[8];

    lw $t0,32($s3) # Temporary reg $t0 gets A[8], note (8 x4) used.

    add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[8]

    sw $t0,48($s3) # Stores h + A[8] back into A[12], note (12 x 4) used.

    Constant or Immediate Operands:

    For example, to add the constant 4 to register $s3, we could use the code lw $t0,AddrConstant4 ($s1) # $t0 = constant 4

    add $s3,$s3,$t0 # $s3 = $s3 + $t0 ($t0 == 4)

    Alternative that avoids the load instruction is to offer versions of the arithmetic instructions inwhich one operand is a constant.

    Example: add immediate or add instructions.

    addi $s3,$s3,4 # $s3 = $s3 + 4

    Constant operands occur frequently, and by including constants inside arithmetic instructions,operations are much faster and use less energy than if constants were loaded from memory.

    INSTRUCTIONS AND ITS TYPES THAT ARE USED IN MIPS

    Registers are referred to in instructions, there must be a convention to map register namesinto numbers. In MIPS assembly language, registers $s0 to $s7 map onto registers 16 to 23,

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    17/135

    and registers $t0 to $t7 map onto registers 8 to 15. Hence, $s0 means register 16, $s1 meansregister 17, $s2 means register 18, . . . , $t0 means register 8, $t1 means register 9, and so on.

    MIPS Fields for instruction

    op: Basic operation of the instruction, traditionally called the opcode. rs: The first register source operand. rt: The second register source operand. rd: The register destination operand. It gets the result of the operation.

    shamt: Shift amount. (Section 2.6 explains shift instructions and this term; it will not be used until then, and hence the field contains zero in this section.)

    funct: Function. This field, oft en called the function code, selects the specific variantof the operation in the op field.

    A problem occurs when an instruction needs longer fields than those shown above.

    MIPS designers is kept all instructions the same length, thereby requiring different kinds ofinstruction formats for different kinds of instructions. The format above is called R-type (forregister)

    or R-format. A second type of instruction format is called I-type (for immediate) or I-formatand is used by the immediate and data transfer instructions. The fields of I-format are

    Multiple formats complicate the hardware; we can reduce the complexity by keeping the

    formats similar. For example, the first three fields of the R-type and I-type formats are thesame size and have the same names; the length of the fourth field in I-type is equal to the sumof the lengths of the last three fields of R-type. Note that the meaning of the rt field haschanged for this instruction: the rt field specifies the destination register

    FIG: MIPS instruction encoding.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    18/135

    Example: MIPS instruction encoding in computer hardware.

    Consider A[300] = h + A[300]; the MIPS instruction for the operations are:

    lw $t0,1200($t1) # Temporary reg $t0 gets A[300]

    add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300]

    sw $t0,1200($t1) # Stores h + A[300] back into A[300]

    Tabulation shows how hardware decodes and determines the three machine languageinstructions:

    op rs rt rd address/shamt funct

    35 9 8 1200 lw0 18 8 8 0 32 add

    43 9 8 1200 sw

    The lw instruction is identified by 35 (OP field), The add instruction that follows is specifiedwith 0 (OP field), The sw instruction is identified with 43 (OP field).

    100011 1001 1000 0000 0100 1011 0000 lw

    00000 10010 1000 1000 00000 100000 add

    101011 1001 1000 0000 0100 1011 0000 sw

    Binary version of the above Tabulation

    1.8 LOGICAL OPERATIONS

    List of logical operators used in MIPS and other languages along with symbolic notation.

    SHIFT LEFT (sll )

    The first class of such operations is called shifts. They move all the bits in a word to the left

    or right, filling the emptied bits with 0s.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    19/135

    For example, if register $s0 contained

    0000 0000 0000 0000 0000 0000 0000 1001two = 9ten

    and the instruction to shift left by 4 was executed, the new value would be: 0000 0000 0000

    0000 0000 0000 1001 0000two = 144ten

    The dual of a shift left is a shift right. The actual name of the two MIPS shift instructions arecalled shift left logical (sll) and shift right logical (srl).

    sll $t2,$s0,4 # reg $t2 = reg $s0

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    20/135

    LOGICAL OR (or)

    It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To elaborate,if the registers $t1 and $t2 are unchanged from the preceding i.e.,

    register $t2 contains 0000 0000 0000 0000 0000 1101 1100 0000two

    and register $t1 contains 0000 0000 0000 0000 0011 1100 0000 0000two

    then, after executing the MIPS instruction

    or $t0,$t1,$t2 # reg $t0 = reg $t1 | reg $t2

    the value in register $t0 would be 0000 0000 0000 0000 0011 1101 1100 0000two(example for bit wise, ……..00101

    ….10111 Bit wise OR→ 10111 (use OR truth table for each bit))

    LOGICAL NOT (nor)

    The final logical operation is a contrarian. NOT takes one operand and places a 1 in the resultif one operand bit is a 0, and vice versa. Since MIPS needs three-operand format, thedesigners of MIPS decided to include the instruction NOR (NOT OR) instead of NOT.

    Step 1: Perform bit wise OR , ……..00101

    ….00000 (dummy operation register filled with zero)

    ……………….

    00101

    Step 2: Take Inverse for the above result now we get 11010

    Instruction: nor $t0,$t1,$t3 # reg $t0 = ~ (reg $t1 | reg $t3)

    Constants are useful in AND and OR logical operations as well as in arithmetic operations, soMIPS also provides the instructions and immediate (andi) and or immediate (ori).

    1.9 CONTROL OPERATIONS

    Branch and Conditional branches: Decision making is commonly represented in

    programming languages using the if statement, sometimes combined with go to statements

    and labels. MIPS assembly language includes two decision-making instructions, similar to an

    if statement with a go to. The first instruction is beq register1, register2, L1

    This instruction means go to the statement labelled L1 if the value in register1 equals the

    value in register2.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    21/135

    The mnemonic beq stands for branch if equal. The second instruction is bne register1,

    register2, L1

    It means go to the statement labeled L1 if the value in register1 does not equal the value in

    register2.

    The mnemonic bne stands for branch if not equal. These two instructions are traditionally

    called conditional branches.

    EXAMPLE: if (i == j) f = g + h; else f = g – h; the MIPS version of the given statements is

    bne $s3,$s4, Else # go to Else if i ≠ j

    add $s0,$s1,$s2 # f = g + h (skipped if i ≠ j)

    j Exit # go to Exit

    sub $s0,$s1,$s2 # f = g – h (skipped if i = j)

    Else:

    Exit:

    Here bne is used instead of beq, because bne(not equal to) instruction provides a better

    efficiency. This example introduces another kind of branch, often called an unconditional

    branch. This instruction says that the processor always follows the branch. To distinguish

    between conditional and unconditional branches, the MIPS name for this type of instruction

    is jump, abbreviated as j. (in example:- f, g, h, i, and j are variables mapped to fi ve registers$s0 through $s4)

    Loops:

    Decisions are important both for choosing between two alternatives — found in ifStatements — and for iterating a computation — found in loops. The same assemblyinstructions are the basic building blocks for both cases(if and loop).

    EXAMPLE: while (save[i] == k)

    i += 1; the MIPS version of the given statements

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    22/135

    Assume that i and k correspond to registers $s3 and $s5 and the base of the array save is in$s6.

    Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4 To get the address of save[i], we needto add $t1 and the base of save

    in $s6:

    add $t1,$t1,$s6 # $t1 = address of save[i] Now we can use that address to loadsave[i] into a temporary register:

    lw $t0,0($t1) # Temp reg $t0 = save[i] The next instruction performs the looptest, exiting if save[i] ≠ k:

    bne $t0,$s5, Exitaddi $s3,$s3,1

    # go to Exit if save[i] ≠ k# i = i + 1

    The next instruction adds 1 to i:

    j LoopExit:

    # go to Loop The end of the loop branches back tothe while test at the top of the loop. We

    just add the Exit label after it, andwe‘re done:

    1.10 ADDRESSING AND ADDRESSING MODES.

    Addressing types:

    Three address instructions

    Syntax: opcode source1, source2,destination

    Eg: ADD A,B, C (operation is A= B+C)Two address instructions

    Syntax: opcode source, destination

    Eg: ADD A, B (operation is A=A+B)

    One-address instruction (to fit in one word length) Syntax: opcode source

    Eg: STORE C (copies content of accumulator to memory location C) where accumulator

    means cache memory register.

    Zero-address instructions stack operation Syntax: opcodeEg: PUSH A (All addresses are implicit, pushes value in A to stack)

    Addressing Modes:

    The different ways in which the location of an operand is specified in an instruction are

    referred to as addressing modes. It is a method used to determine which part of memory is

    being referred by a machine instruction.

    Register mode: Operand is the content of a processor register. The register

    name/address is given in the instruction. Value of R2 is moved to R1.

    Example: MOV R1, R2

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    23/135

    Absolute mode (direct): Operand is in the memory location. Address of the location is

    given explicitly. Here value in A is moved to 1000H.

    Example: MOV 1000, A

    Immediate mode: Address and data constants can be given explicitly in the instruction.

    Here value constant 200 is moved to R0 register.

    Example: MOV #200, R0

    Indirect Mode : The processor will read the register content (R1) in this case, which

    will not have direct value. Instead, it will be the address or location in which, the value will

    be stored. Then the fetched value is added with the value in R0 register.

    Example: ADD (R1), R0

    Indexed / Relative Addressing Mode: The processor will take R1 register address as

    base address and adds the value constant 20 (offset / displacement) with the base address to

    get the derived or actual memory location of the value i.e., stored in the memory. It fetches

    the value then adds the value to R2 register.

    Example: ADD 20(R1), R2

    Auto increment mode and Auto decrement Mode: The value in the register / address that is

    supplied in the instruction is incremented or decremented.

    Example: Increment R1 (Increments the given register / address content by one)

    Example: Decrement R2 (Decrements the given register / address content by one)

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    24/135

    UNIT IIARITHMETIC OPERATIONS

    ALU – Addition and subtraction – Multiplication – Division – Floating Point operations – Sub word parallelism.

    2.1 ALU2.2 Addition and subtraction2.3 Multiplication2.4 Division2.5 Floating Point operations2.6 Sub word parallelism.

    2.1 ALU Arithmetic logic unit

    An arithmetic logic unit (ALU) is a digital electronic circuit that performs arithmetic and bitwise logical operations on integer binary numbers. This is in contrast to a floating-pointunit (FPU), which operates on floating point numbers. An ALU is a fundamental building

    block of many types of computing circuits, including the central processing unit (CPU) ofcomputers, FPUs, and graphics processing units (GPUs). A single CPU, FPU or GPU maycontain multiple ALUs.The inputs to an ALU are the data to be operated on, called operands, and a code indicatingthe operation to be performed; the ALU's output is the result of the performed operation. In

    many designs, the ALU also exchanges additional information with a status register, whichrelates to the result of the current or previous

    A symbolic representation of an ALU and its input and output signals, indicated by arrows pointing into or out of the ALU, respectively. Each arrow represents one or more signals.

    SignalsAn ALU has a variety of input and output nets, which are the shared electrical connections

    used to convey digital signals between the ALU and external circuitry. When an ALU isoperating, external circuits apply signals to the ALU inputs and, in response, the ALU

    produces and conveys signals to external circuitry via its outputs.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    25/135

    DataA basic ALU has three parallel data buses consisting of two input operands (A and B)

    and a result output (Y). Each data bus is a group of signals that conveys one binary integer

    number. Typically, the A, B and Y bus widths (the number of signals comprising each bus)are identical and match the native word size of the encapsulating CPU (or other processor).Opcode

    The opcode input is a parallel bus that conveys to the ALU an operation selectioncode, which is an enumerated value that specifies the desired arithmetic or logic operation to

    be performed by the ALU. The opcode size (its bus width) is related to the number ofdifferent operations the ALU can perform; for example, a four-bit opcode can specify up tosixteen different ALU operations. Generally, an ALU opcode is not the same as a machinelanguage opcode, though in some cases it may be directly encoded as a bit field within amachine language opcode.Status

    The status outputs are various individual signals that convey supplemental informationabout the result of an ALU operation. These outputs are usually stored in registers so they can

    be used in future ALU operations or for controlling conditional branching. The collection of bit registers that store the status outputs are often treated as a single, multi-bit register, whichis referred to as the "status register" or "condition code register". General-purpose ALUscommonly have status signals such as:

    Carry-out, which conveys the carry resulting from an addition operation, the borrowresulting from a subtraction operation, or the overflow bit resulting from a binary shift

    operation. Zero, which indicates all bits of the Y bus are logic zero. Negative, which indicates the result of an arithmetic operation is negative. Overflow, which indicates the result of an arithmetic operation has exceeded the

    numeric range of the Y bus. Parity, which indicates whether an even or odd number of bits on the Y bus are logic

    one.The status input allows additional information to be made available to the ALU when

    performing an operation. Typically, this is a "carry-in" bit that is the stored carry-out from a

    previous ALU operation.

    Circuit operationAn ALU is a combinational logic circuit, meaning that its outputs will change

    asynchronously in response to input changes. In normal operation, stable signals are appliedto all of the ALU inputs and, when enough time (known as the "propagation delay") has

    passed for the signals to propagate through the ALU circuitry, the result of the ALUoperation appears at the ALU outputs. The external circuitry connected to the ALU isresponsible for ensuring the stability of ALU input signals throughout the operation, and forallowing sufficient time for the signals to propagate through the ALU before sampling theALU result.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    26/135

    For example, a CPU begins an ALU addition operation by routing operands from theirsources (which are usually registers) to the ALU's operand inputs, while the control unitsimultaneously applies a value to the ALU's opcode input, configuring it to perform addition.At the same time, the CPU also routes the ALU result output to a destination register that will

    receive the sum. The ALU's input signals, which are held stable until the next clock, areallowed to propagate through the ALU and to the destination register while the CPU waits forthe next clock. When the next clock arrives, the destination register stores the ALU resultand, since the ALU operation has completed, the ALU inputs may be set up for the next ALUoperation.

    The combinational logic circuitry of the 74181 integrated circuit , which is a simple four-bitALU

    FunctionsA number of basic arithmetic and bitwise logic functions are commonly supported by ALUs.Basic, general purpose ALUs typically include these operations in their repertoires:

    Arithmetic operations

    Add: A and B are summed and the sum appears at Y and carry-out.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    27/135

    Add with carry: A, B and carry-in are summed and the sum appears at Y and carry-out.

    Subtract: B is subtracted from A (or vice-versa) and the difference appears at Y andcarry-out. For this function, carry-out is effectively a "borrow" indicator. This

    operation may also be used to compare the magnitudes of A and B; in such cases theY output may be ignored by the processor, which is only interested in the status bits(particularly zero and negative) that result from the operation.

    Subtract with borrow: B is subtracted from A (or vice-versa) with borrow (carry-in)and the difference appears at Y and carry-out (borrow out).

    Two's complement (negate): A (or B) is subtracted from zero and the differenceappears at Y.

    Increment: A (or B) is increased by one and the resulting value appears at Y. Decrement: A (or B) is decreased by one and the resulting value appears at Y.

    Pass through: all bits of A (or B) appear unmodified at Y. This operation is typicallyused to determine the parity of the operand or whether it is zero or negative.Bitwise logical operations

    AND: the bitwise AND of A and B appears at Y. OR: the bitwise OR of A and B appear at Y. Exclusive-OR: the bitwise XOR of A and B appear at Y. One's complement: all bits of A (or B) are inverted and appear at Y.

    Bit shift operationsALU shift operations cause operand A (or B) to shift left or right (depending on the

    opcode) and the shifted operand appears at Y. Simple ALUs typically can shift the operand by only one bit position, whereas more complex ALUs employ barrel shifters that allow themto shift the operand by an arbitrary number of bits in one operation. In all single-bit shiftoperations, the bit shifted out of the operand appears on carry-out; the value of the bit shiftedinto the operand depends on the type of shift.

    Arithmetic shift: the operand is treated as a two's complement integer, meaning thatthe most significant bit is a "sign" bit and is preserved.

    Logical shift: a logic zero is shifted into the operand. This is used to shift unsignedintegers.

    Rotate: the operand is treated as a circular buffer of bits so its least and mostsignificant bits are effectively adjacent.

    Rotate through carry: the carry bit and operand are collectively treated as a circular buffer of bits.

    Big Endian: In big endian, you store the most significant byte in the smallest address.Little Endian: In little endian, you store the least significant byte in the smallest address.

    Fixed-point arithmetic

    A fixed-point number representation is a real data type for a number that has a fixednumber of digits after (and sometimes also before) the radix point (after the decimal point '.'

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    28/135

    in English decimal notation). Fixed-point number representation can be compared to the morecomplicated (and more computationally demanding) floating-point number representation.Fixed-point numbers are useful for representing fractional values, usually in base 2 or base10, when the executing processor has no floating point unit (FPU) or if fixed-point provides

    improved performance or accuracy for the application at hand.

    Sign-MagnitudeThe sign-magnitude binary format is the simplest conceptual format. To represent a numberin sign-magnitude, we simply use the leftmost bit to represent the sign, where 0 means

    positive, and the remaining bits to represent the magnitude

    B7 B6 B5 B4 B3 B2 B1 B0

    Sign magnitude

    What are the decimal values of the following 8-bit sign-magnitude numbers?

    10000011 = -300000101 = +511111111 = ?01111111 = ?

    1's complementThe 1's complement of a number is found by changing all 1's to 0's and all 0's to 1's.

    This is called as taking complement or 1's complement. Example of 1's Complement is asfollows.

    2's complementThe 2's complement of binary number is obtained by adding 1 to the Least Significant Bit(LSB) of 1's complement of the number.2's complement = 1's complement + 1Example of 2's Complement is as follows.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    29/135

    2.2 ADDITION AND SUBTRACTION

    Binary Addition

    In fourth case, a binary addition is creating a sum of (1 + 1 = 10) i.e. 0 is written in the givencolumn and a carry of 1 over to the next column.

    Example – Addition

    Half Adder and Full Adder CircuitHalf Adder

    The half adder adds two binary digits called as augend and addend and produces twooutputs as sum and carry; XOR is applied to both inputs to produce sum and OR gate isapplied to both inputs to produce carry.By using half adder, you can design simple addition with the help of logic gates.

    Half Adder Logic Circuit

    Half Adder block diagram Half Adder Truth Table

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    30/135

    Full AdderAn adder is a digital circuit that performs addition of numbers. The full adder adds 3

    one bit numbers, where two can be referred to as operands and one can be referred to as bit

    carried in. And produces 2-bit output, and these can be referred to as output carry and sum.

    This adder is difficult to implement than a half-adder. The difference between a half-adder and a full-adder is that the full-adder has three inputs and two outputs, whereas halfadder has only two inputs and two outputs. The first two inputs are A and B and the thirdinput is an input carry as C-IN. When full-adder logic is designed, you string eight of themtogether to create a byte-wide adder and cascade the carry bit from one adder to the next.

    Full Adder Truth Table:

    Implementation of full order with two half adders

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    31/135

    N-Bit Parallel AdderThe Full Adder is capable of adding only two single digit binary number along with a

    carry input. But in practical we need to add binary numbers which are much longer than just

    one bit. To add two n-bit binary numbers we need to use the n-bit parallel adder. It uses anumber of full adders in cascade. The carry output of the previous full adder is connected tocarry input of the next full adder.

    4 Bit Parallel Adder In the block diagram, A0 and B0 represent the LSB of the four bit words A and B.

    Hence Full Adder-0 is the lowest stage. Hence its Cin has been permanently made 0. The restof the connections are exactly same as those of n-bit parallel adder is shown in fig. The four

    bit parallel adder is a very common logic circuit.

    Block diagram of N-Bit Parallel Adder

    Binary SubtractionSubtraction and Borrow, these two words will be used very frequently for the binary

    subtraction. Operation A-B is performed using four rules of binary subtraction.1.Take 2‘s compliment of B 2.Result A+2‘s compliment of B 3.If a carry is generated then the result is positive and in the true form, in this case carry isignored4.If carry is not generated then the result is negative and i n the 2‘s compliment form

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    32/135

    N-Bit Parallel Subtractor

    The subtraction can be carried out by taking the 1's or 2's complement of the number

    to be subtracted. For example we can perform the subtraction (A-B) by adding either 1's or2's complement of B to A. That means we can use a binary adder to perform the binarysubtraction.

    4 Bit Parallel Subtractor

    The number to be subtracted (B) is first passed through inverters to obtain its 1'scomplement. The 4-bit adder then adds A and 2's complement of B to produce thesubtraction. S3 S2 S1 S0 represents the result of binary subtraction (A-B) and carry outputCout represents the polarity of the result. If A > B then Cout = 0 and the result of binary form(A-B) then Cout = 1 and the result is in the 2's complement form.

    Block diagram of N BitSubtractor

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    33/135

    Half Subtractors

    Half subtractor is a combination circuit with two inputs and two outputs (differenceand borrow). It produces the difference between the two binary bits at the input and also

    produces an output (Borrow) to indicate if a 1 has been borrowed. In the subtraction (A-B), Ais called as Minuend bit and B is called as Subtrahend bit.

    Truth Table Circuit Diagram

    Full Subtractors

    The disadvantage of a half subtractor is overcome by full subtractor. The fullsubtractor is a combinational circuit with three inputs A,B,C and two output D and C'. A isthe 'minuend', B is 'subtrahend', C is the 'borrow' produced by the previous stage, D is thedifference output and C' is the borrow output.

    Truth Table Circuit Diagram

    2.3 MULTIPLICATION

    Multiplication of decimal numbers in long hand can be used to show the steps ofmultiplication and the names of the operands.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    34/135

    Binary multiplication is similar to decimal multiplication. It is simpler than decimalmultiplication because only 0s and 1s are involved. There are four rules of binarymultiplication.

    Example

    The number of digits in the product is considerably larger than the number in either the

    multiplicand or the multiplier. The length of the multiplication of an n-bit multiplicand andan m-bit multiplier is a product that is n + m bits long (sign bit is ignored).

    So, n + m bits are required to represent all possible products. In this case one has to considerOver flow condition also.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    35/135

    The multiplicand is moved to left each time and the multiplier moved right after each bit has performed its intermediate execution. The No of iterations to find the product will be equal to No: of bits in the multiplier. In this case we have 32 iterations (MIPS).

    Example Multiply 2ten _ 3ten, or 0010two _ 0011two. (4 bits are used to save space.)

    Booth's Algorithm of Multiplication

    General Steps of Booth's Algorithm:-

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    36/135

    Step 1:-In step 1 firstly we take a multiplicand M and multiplier Q(R) and set value ofA,Q(n+1),SC are 0,0,0 respectively.Step 2:-In step 2 We check Q(0) and Q(1).Step 3:-In step 3 if bits are 0,1 then add M with A and after that perform Right Shift

    Operation.Step 4:-If bits are 1,0 then perform A+(M)'+1 then perform Right Shift Operation.Step 5:-Check if SC is set as o.Step 6:-Repeat Step 2,3,4 until Count

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    37/135

    Binary Division Hardware

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    38/135

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    39/135

    Division algorithms fall into two main categories: slow division and fast division.Slow division algorithms produce one digit of the final quotient per iteration. Examples ofslow division include restoring, non-performing restoring, non-restoring, and SRT division.Fast division methods start with a close approximation to the final quotient and produce

    twice as many digits of the final quotient on each iteration. Newton – Raphson andGoldschmidt fall into this category.

    Sequential Restoring Division

    • A shift register keeps both the (remaining) dividend as well as the quotient • With each cycle, dividend decreases by one digit & quotient increases by one digit • The MSB‘s of the remaining dividend and the divis or are aligned in each cycle• Major difference to multiplication:

    1. we do not know if we can subtract the divisor or not2. if the subtraction failed, we have to restore the original dividend

    Procedure1. Load the 2n dividend into both halves of shift register, and add a sign bit to the left2. Add a sign bit to the left of the divisor3. Generate the 2‘s complement of the divisor 4. Shift to the left5. Add 2‘s complement of the divisor to the upper half of the shift register including sign bit(subtract)6. If sign of the result is cleared (positive)

    • then set LSB of the lower half of the shift register to one

    • else clear LSB of the lower half and add the divisor to upper half of shift register7. repeat from 4. and perform the loop n times

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    40/135

    8. after termination:• lower half of shift register ⇒ quotient• upper half of shift register ⇒ remainder

    Restoring AlgorithmAssume ─ X register k -bit dividendAssume ─ Y the k -bit divisorAssume ─ S a sign -bit

    Start: Load 0 into accumulator k-bit A and dividend X is loaded into the k-bit quotientregister MQ.Step A : Shift 2 k-bit register pair A -MQ leftStep B: Subtract the divisor Y from A.Step C: If sign of A (msb) = 1, then reset MQ0(lsb) = 0 else set = 1.Steps D: If MQ0 = 0 add Y (restore the effect of earlier subtraction).6. Steps A to D repeat again till the total numberof cyclic operations = k.At the end, A has the remainder and MQ has the quotient

    The non-restoring division algorithm:

    S1: DO n timesShift A and Q left one binary position.Subtract M from A, placing the answer back in A.

    A restoring-division example

    Initially 0 0 0 0 0 1 0 0 0

    0 0 0 1 1

    Shift 0 0 0 0 1 0 0 0

    Subtract 1 1 1 0 1

    Set q 0 1 1 1 1 0

    Restore 1 1

    0 0 0 0 1 0 0 0 0

    Shift 0 0 0 1 0 0 0 0

    Subtract 1 1 1 0 1

    Set q 0 1 1 1 1 1

    Restore 1 1

    0 0 0 1 0 0 0 0 0

    Shift 0 0 1 0 0 0 0 0

    Subtract 1 1 1 0 1

    Set q 0 0 0 0 1 0 0 0 0 1

    Shift 0 0 0 1 0 0 0 1

    Subtract 1 1 1 0 1

    Set q 0 1 1 1 1 1

    Restore 1 1

    0 0 0 1 0 0 0 1 0

    remainderQuotient

    First cycle

    Second cycle

    Third cycle

    Fourth cycle

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    41/135

    If the sign of A is 1, set q0 to 0 and add M back to A (restore A); otherwise, set q0 to 1.S1: Do n times

    If the sign of A is 0, shift A and Q left one binary position and subtract M from A;otherwise, shift A and Q left and add M to A.

    S2: If the sign of A is 1, add M to A.

    Floating-Point Number RepresentationA floating-point number (or real number) can represent a very large (1.23×10^88) or a verysmall (1.23×10^-88) value. It could also represent very large negative number (-1.23×10^88)and very small negative number (-1.23×10^88), as well as zero, as illustrated:

    A floating-point number is typically expressed in the scientific notation, with a fraction (F),and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix

    of 10 (F×10^E); while binary numbers use radix of 2 (F×2^E).

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    42/135

    Representation of floating point number is not unique. For example, the number 55.66 can berepresented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so on. The fractional part can

    be normalized. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×10^2; binary

    number 1010.1011B can be normalized as 1.0101011B×2^3.It is important to note that floating-point numbers suffer from loss of precision whenrepresented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there areinfinite number of real numbers (even within a small range of says 0.0 to 0.1). On the otherhand, a n-bit binary pattern can represent a finite 2^n distinct numbers. Hence, not all the realnumbers can be represented. The nearest approximation will be used instead, resulted in lossof accuracy.It is also important to note that floating number arithmetic is very much less efficient thaninteger arithmetic. It could be speed up with a so-called dedicated floating-point co-

    processor. Hence, use integers if your application does not require floating-point numbers.In computers, floating-point numbers are represented in scientific notation of fraction (F) andexponent (E) with a radix of 2, in the form of F×2^E. Both E and F can be positive as well asnegative. Modern computers adopt IEEE 754 standard for representing floating-pointnumbers. There are two representation schemes: 32-bit single-precision and 64-bit double-

    precision.

    IEEE-754 32-bit Single-Precision Floating-Point NumbersIn 32-bit single-precision floating-point representation:The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative

    numbers.The following 8 bits represent exponent (E).The remaining 23 bits represents fraction (F).

    Normalized FormLet's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 00000000 0000 0000, with:S = 1E = 1000 0001F = 011 0000 0000 0000 0000 0000In the normalized form, the actual fraction is normalized with an implicit leading 1 in the

    form of 1.F.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    43/135

    In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^-3 = 1.375D.The sign bit represents the sign of the number, with S=0 for positive and S=1 for negativenumber. In this example with S=1, this is a negative number, i.e., -1.375D.

    In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because we need to represent both positive and negative exponent. With an 8-bit E, rangingfrom 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In thisexample, E-127=129-127=2D.Hence, the number represented is -1.375×2^2=-5.5D.

    De-Normalized Form Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannotrepresent the number zero! Convince yourself on this!De-normalized form was devised to represent zero and other numbers.For E=0, the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) isused for the fraction; and the actual exponent is always -126. Hence, the number zero can berepresented with E=0 and F=0 (because 0.0×2^-126=0).We can also represent very small positive and negative numbers in de-normalized form withE=0. For example, if S=1, E=0, and F=011 0000 0000 0000 0000 0000. The actual fraction is0.011=1×2^-2+1×2^-3=0.375D. Since S=1, it is a negative number. With E=0, the actualexponent is -126. Hence the number is -0.375×2^-126 = -4.4×10^-39, which is an extremelysmall negative number (close to zero).

    IEEE Standard 754 Floating Point NumbersThere are several ways to represent real numbers on computers. Fixed point places a radix

    point somewhere in the middle of the digits, and is equivalent to using integers that represent portions of some unit. For example, one might represent 1/100ths of a unit; if you have fourdecimal digits, you could represent 10.82, or 00.01. Another approach is to use rationales,and represent every number as the ratio of two integers.

    Floating-point representation – the most common solution – uses scientific notation to encodenumbers, with a base number and an exponent. For example, 123.456 could be represented as1.23456 × 102. In hexadecimal, the number 123.abc might be represented as 1.23abc × 162.In binary, the number 10100.110 could be represented as 1.0100110 × 24.

    Floating-point solves a number of representation problems. Fixed-point has a fixed windowof representation, which limits it from representing very large or very small numbers. Also,fixed-point is prone to a loss of precision when two large numbers are divided.

    Floating-point, on the other hand, employs a sort of "sliding window" of precisionappropriate to the scale of the number. This allows it to represent numbers from

    1,000,000,000,000 to 0.0000000000000001 with ease, and while maximizing precision (thenumber of digits) at both ends of the scale.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    44/135

    Storage Layout IEEE floating point numbers have three basic components: the sign, the exponent, and themantissa. The mantissa is composed of the fraction and an implicit leading digit (explained

    below). The exponent base (2) is implicit and need not be stored.

    The following figure shows the layout for single (32-bit) and double (64-bit) precisionfloating-point values. The number of bits for each field are shown (bit ranges are in square

    brackets, 00 = least-significant bit):

    Floating Point Components

    Sign Exponent Fraction

    Single Precision 1 [31] 8 [30 –

    23] 23 [22 –

    00]

    Double Precision 1 [63] 11 [62 –52] 52 [51 –00]

    . The Sign BitThe sign bit is as simple as it gets. 0 denotes a positive number, and 1 denotes a negativenumber. Flipping the value of this bit flips the sign of the number.

    The Exponent The exponent field needs to represent both positive and negative exponents. To do this, a biasis added to the actual exponent in order to get the stored exponent. For IEEE single-precisionfloats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponentfield. A stored value of 200 indicates an exponent of (200 – 127), or 73. For reasons discussedlater, exponents of −127 (all 0s) and +128 (all 1s) are reserved for special numbers.

    The MantissaThe mantissa, also known as the significand, represents the precision bits of the number. It iscomposed of an implicit leading bit (left of the radix point) and the fraction bits (to the rightof the radix point).

    To find out the value of the implicit leading bit, consider that any number can be expressed inscientific notation in many different ways. For example, the number 50 can be represented asany of these:

    .5000 × 1020.050 × 1035000. × 10−2 In order to maximize the quantity of representable numbers, floating-point numbers are

    typically stored in normalized form. This basically puts the radix point after the first non-zerodigit. In normalized form, five is represented as 5.000 × 100.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    45/135

    2.6 SUB WORD PARALLELISM.A subword is a lower precision unit of data contained within a word. In subword parallelism,multiple subwords are packed into a word and then process whole words. With theappropriate subword boundaries this technique results in parallel processing of subwords.

    Since the same instruction is applied to all subwords within the word, This is a form ofSIMD(Single Instruction Multiple Data) processing.

    It is possible to apply subword parallelism to non contiguous subwords of different sizeswithin a word. In practical implementation is simple if subwords are same size and they arecontiguous within a word. The data parallel programs that benefit from subword parallelismtend to process data that are of the same size.

    For example if word size is 64bits and subwords sizes are 8,16 and 32 bits. Hence aninstruction operates on eight 8bit subwords, four 16bit subwords, two 32bit subwords or one64bit subword in parallel.

    Subword parallelism is an efficient and flexible solution for media processing becausealgorithm exhibit a great deal of data parallelism on lower precision data.

    It is also useful for computations unrelated to multimedia that exhibit data parallelism onlower precision data.

    Graphics and audio applications can take advantage of performing simultaneous operationson short vectors

    Example: 128-bit adder:Sixteen 8-bit adds

    Eight 16-bit adds

    Four 32-bit adds

    Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data(SIMD)

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    46/135

    UNIT IIIPROCESSOR AND CONTROL UNIT

    Basic MIPS implementation – Building data path – Control Implementation scheme – Pipelining – Pipelined data path and control – Handling Data hazards & Control hazards – Exceptions.

    3.1 Basic MIPS implementation3.2 Building data path3.3 Control Implementation scheme3.4 Pipelining3.5 Pipelined data path and control3.6 Handling Data hazards & Control hazards3.7 Exceptions.

    3.1 A BASIC MIPS IMPLEMENTATIONWe will be examining an implementation that includes a subset of the core MIPSinstruction set:

    The memory-reference instructions load word (lw) and store word (sw) The arithmetic-logical instructions add, sub, AND, OR, and slt The instructions branch equal (beq) and jump (j), which we add last

    This subset does not include all the integer instructions (for example, shift, multiply, anddivide are missing), nor does it include any floating-point instructions. However, the key

    principles used in creating a data path and designing the control are illustrated.

    In examining the implementation, we will have the opportunity to see how the

    instruction set architecture determines aspects of the implementation, and how the choice ofvarious implementation strategies affects the clock rate and CPI for the computer. In addition,most concepts used to implement the MIPS subset in this chapter are the same basic ideasthat are used to construct a broad spectrum of computers, from high-performance servers togeneral- purpose microprocessors to embedded processors.

    An Overview of the ImplementationThe core MIPS instructions, including the integer arithmetic-logical instructions, thememory-reference instructions, and the branch instructions. Much of what needs to be doneto implement these instructions is the same, independent of the exact class of instruction. Forevery instruction, the first two steps are identical:

    1. Send the program counter (PC) to the memory that contains the code and fetch theinstruction from that memory.2. Read one or two registers, using fields of the instruction to select the registers to read. Forthe load word instruction, we need to read only one register, but most other instructionsrequire that we read two registers.

    After these two steps, the actions required to complete the instruction depend on theinstruction class. Fortunately, for each of the three instruction classes (memory-reference,arithmetic-logical, and branches), the actions are largely the same, independent of the exactinstruction. The simplicity and regularity of the MIPS instruction set simplifies theimplementation by making the execution of many of the instruction classes similar.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    47/135

    For example, all instruction classes, except jump, use the arithmetic-logical unit (ALU) afterreading the registers. The memory-reference instructions use the ALU for an addresscalculation, the arithmetic-logical instructions for the operation execution, and branches forcomparison. After using the ALU, the actions required to complete various instruction classesdiffer. A memory-reference instruction will need to access the memory either to read data for

    a load or write data for a store. An arithmetic-logical or load instruction must write the datafrom the ALU or memory back into a register. Lastly, for a branch instruction, we may needto change the next instruction address based on the comparison; otherwise, the PC should beincremented by 4 to get the address of the next instruction.

    Figure 3.1 shows the high-level view of a MIPS implementation, focusing on the variousfunctional units and their interconnection. Although this figure shows most of the flow ofdata through the processor, it omits two important aspects of instruction execution.

    FIGURE 3.1 An abstract view of the implementation of the MIPS subset showing themajor functional units and the major connections between them.

    All instructions start by using the program counter to supply the instruction address to the

    instruction memory. After the instruction is fetched, the register operands used by aninstruction are specified by fields of that instruction. Once the register operands have beenfetched, they can be operated on to compute a memory address (for a load or store), tocompute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (fora branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must

    be written to a register. If the operation is a load or store, the ALU result is used as an addressto either store a value from the registers or load a value from memory into the registers. Theresult from the ALU or memory is written back into the register file. Branches require the useof the ALU output to determine the next instruction address, which comes either from theALU (where the PC and branch offset are summed) or from an adder that increments thecurrent PC by 4. The thick lines interconnecting the functional units represent buses, which

    consist of multiple signals. The arrows are used to guide the reader in knowing how

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    48/135

    information flows. Since signal lines may cross, we explicitly show when crossing lines areconnected by the presence of a dot where the lines cross.

    Figure 3.1 shows data going to a particular unit as coming from two different sources. Forexample, the value written into the PC can come from one of two adders, the data written into

    the register file can come from either the ALU or the data memory, and the second input tothe ALU can come from a register or the immediate field of the instruction. In practice, thesedata lines cannot simply be wired together; we must add a logic element that chooses fromamong the multiple sources and steers one of those sources to its destination. This selection iscommonly done with a device called a multiplexor, although this device might better becalled a data selector. The control lines are set based primarily on information taken from theinstruction being executed.

    The the data memory must read on a load and write on a store. The register file must bewritten on a load and an arithmetic-logical instruction. And, of course, the ALU must

    perform one of several operations. Like the multiplexors, these operations are directed bycontrol lines that are set on the basis of various fields in the instruction.

    Figure 3.2 shows the data path of Figure 3.1 with the three required multiplexors added, aswell as control lines for the major functional units. A control unit, which has the instructionas an input, is used to determine how to set the control lines for the functional units and twoof the multiplexors. The third multiplexor, which determines whether PC + 4 or the branchdestination address is written into the PC, is set based on the Zero output of the ALU, whichis used to perform the comparison of a beq instruction. The regularity and simplicity of theMIPS instruction set means that a simple decoding process can be used to determine how toset the control lines.

    Logic Design Conventions

    FIGURE 3.2 The basic implementation of the MIPS subset, including the necessarymultiplexors and control lines .

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    49/135

    The top multiple xor (―Mux‖) controls what value replaces the PC (PC + 4 or the branchdestination address); the multiplexor is controlled by the gate that ―ANDs‖ together the Zerooutput of the ALU and a control signal that indicates that the instruction is a branch. Themiddle multiplexor, whose output returns to the register file, is used to steer the output of the

    ALU (in the case of an arithmetic-logical instruction) or the output of the data memory (in thecase of a load) for writing into the register file. Finally, the bottommost multiplexor is used todetermine whether the second ALU input is from the registers (for an arithmetic-logicalinstruction OR a branch) or from the offset field of the instruction (for a load or store). Theadded control lines are straightforward and determine the operation performed at the ALU,whether the data memory should read or write, and whether the registers should perform awrite operation.

    The data path elements in the MIPS implementation consist of two different types of logicelements: elements that operate on data values and elements that contain state. The elements

    that operate on data values are all combinational, which means that their outputs depend onlyon the current inputs. Given the same input, a combinational element always produces thesame output.

    A state element has at least two inputs and one output. The required inputs are the data valueto be written into the element and the clock, which determines when the data value is written.The output from a state element provides the value that was written in an earlier clock cycle.

    Logic components that contain state are also called sequential, because their outputs dependon both their inputs and the contents of the internal state.

    A clocking methodology defines when signals can be read and when they can be written. It isimportant to specify the timing of reads and writes, because if a signal is written at the sametime it is read, the value of the read could correspond to the old value, the newly writtenvalue, or even some mix of the two! Computer designs cannot tolerate such unpredictability.A clocking methodology is designed to ensure predictability.

    An edge-triggered clocking methodology means that any values stored in a sequential logicelement are updated only on a clock edge. Because only state elements can store a data value,any collection of combinational logic must have its inputs come from a set of state elements

    and its outputs written into a set of state elements.The inputs are values that were written in a previous clock cycle, while the outputs are valuesthat can be used in a following clock cycle.

    Figure3 .3 shows the two state elements surrounding a block of combinational logic, whichoperates in a single clock cycle: all signals must propagate from state element 1, through thecombinational logic, and to state element 2 in the time of one clock cycle. The time necessaryfor the signals to reach state element 2 defines the length of the clock cycle

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    50/135

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    51/135

    FIGURE 3.5 Two state elements are needed to store and access instructions, and anadder is needed to compute the next instruction address.

    The state elements are the instruction memory and the program counter. The instructionmemory need only provide read access because the data path does not write instructions.Since the instruction memory only reads, we treat it as combinational logic: the output at anytime reflects the contents of the location specified by the address input, and no read controlsignal is needed. (We will need to write the instruction memory when we load the program;this is not hard to add, and we ignore it for simplicity.) The program counter is a 32 ‑ bitregister that is written at the end of every clock cycle and thus does not need a write controlsignal. The adder is an ALU wired to always add its two 32 ‑ bit inputs and place the sum onits output.

    FIGURE 3.6 A portion of the data path used for fetching instructions and incrementingthe program counter.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    52/135

    We will draw such an ALU with the label Add, as in Figure 3.5, to indicate that it has been permanently made an adder and cannot perform the other ALU functions.

    To execute any instruction, we must start by fetching the instruction from memory. To prepare for executing the next instruction, we must also increment the program counter sothat it points at the next instruction, 4 bytes later. Figure 3.6 shows how to combine the threeelements from Figure 3.5 to form a data path that fetches instructions and increments the PCto obtain the address of the next sequential instruction.

    The processor‘s 32 general -purpose registers are stored in a structure called a register file. Aregister file is a collection of registers in which any register can be read or written byspecifying the number of the register in the file. The register file contains the register state ofthe computer. In addition, we will need an ALU to operate on the values read from theregisters. R-format instructions have three register operands, so we will need to read two datawords from the register file and write one data word into the register file for each instruction.For each data word to be read from the registers, we need an input to the register file thatspecifies the register number to be read and an output from the register file that will carry thevalue that has been read from the registers.

    To write a data word, we will need two inputs: one to specify the register number to bewritten and one to supply the data to be written into the register. The register file alwaysoutputs the contents of whatever register numbers are on the Read register inputs. Writes,however, are controlled by the write control signal, which must be asserted for a write tooccur at the clock edge.

    FIGURE 3.7 The two elements needed to implement R-format ALU operations are theregister file and the ALU.

    The register file contains all the registers and has two read ports and one write port. Thedesign of multiport register files. The register file always outputs the contents of the registerscorresponding to the Read register inputs on the outputs; no other control inputs are needed.In contrast, a register write must be explicitly indicated by asserting the write control signal.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    53/135

    Remember that writes are edge-triggered, so that all the write inputs (i.e., the value to bewritten, the register number, and the write control signal) must be valid at the clock edge.Since writes to the register file are edge-triggered, our design can legally read and write thesame register within a clock cycle: the read will get the value written in an earlier clock cycle,

    while the value written will be available to a read in a subsequent clock cycle. The inputscarrying the register number to the register file are all 5 bits wide, whereas the lines carryingdata values are 32 bits wide. The operation to be performed by the ALU is controlled with theALU operation signal, which will be 4 bits wide, using the ALU designed. We will use theZero detection output of the ALU shortly to implement branches. The overflow output willnot be needed

    we will need a unit to sign-extend the 16 ‑ bit offset field in the instruction to a 32 ‑ bit signedvalue, and a data memory unit to read from or write to. The data memory must be written onstore instructions; hence, data memory has read and write control signals, an address input,

    and an input for the data to be written into memory. The beq instruction has three operands,two registers that are compared for equality, and a 16 ‑ bit offset used to compute the branchtarget address relative to the branch instruction address. Its form is beq $t1,$t2,offset. Toimplement this instruction, we must compute the branch target address by adding the sign-extended offset field of the instruction to the PC.

    The instruction set architecture specifies that the base for the branch address calculation is theaddress of the instruction following the branch. Since we compute PC + 4 (the address of thenext instruction) in the instruction fetch datapath, it is easy to use this value as the base forcomputing the branch target address. The architecture also states that the offset field is shiftedleft 2 bits so that it is a word offset; this shift increases the effective range of the offset field

    by a factor of 4.

    FIGURE 3.8 The two units needed to implement loads and stores, in addition to the

    register file and ALU of Figure 3.7

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    54/135

    The data memory unit and the sign extension unit. The memory unit is a state element withinputs for the address and the write data, and a single output for the read result. There areseparate read and write controls, although only one of these may be asserted on any givenclock. The memory unit needs a read signal, since, unlike the register file, reading the value f

    an invalid address can cause problems

    FIGURE 3.9 The data path for a branch uses the ALU to evaluate the branch conditionand a separate adder to compute the branch target as the sum of the incremented PCand the sign-extended, lower 16 bits of the instruction (the branch displacement),shifted left 2 bits.

    The unit labelled Shift left 2 is simply a routing of the signals between input and output thatadds 00two to the low-order end of the sign-extended offset field; no actual shift hardware is

    needed, since the amount of the ―shift‖ is constant. Since we know that the offset was sign -extended from 16 bits, the shift will throw away only ―sign bits.‖ Control logic is used todecide whether the incremented PC or branch target should replace the PC, based on the Zerooutput of the ALU.

    Creating a Single Datapath

    This simplest datapath will attempt to execute all instructions in one clock cycle. This meansthat no datapath resource can be used more than once per instruction, so any element neededmore than once must be duplicated. We therefore need a memory for instructions separate

    from one for data. Although some of the functional units will need to be duplicated, many ofthe elements can be shared by different instruction flows.

  • 8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation

    55/135

    To share a datapath element between two different instruction classes, we may need to allowmultiple connections to the input of an element, using a multiplexor and control signal toselect among the multiple inputs.

    Building a Datapath

    The operations of arithmetic-logical (or R-type) instructions and the memory instructionsdatapath are quite similar. The key differences are the following:

    The arithmetic-logical instructions use the ALU, with the inputs coming from the tworegisters. The memory instructions can also use the ALU to do the address calculation,although the second input is the sign-extended 16-bit offset field from the instruction.

    The value stored into a destination register comes from the ALU (for an R-type instruction)or the memory (for a load).

    To create a datapath with only a single register file and a single ALU, we must support twodifferent sources for the second ALU input, as well as two different sources for the datastored into the register file. Thus, one multiplexor is placed at the ALU input and another atthe data input to the register file.

    FIGURE 3.10 The datapath for the memory instructions and the R-type instructions.

    3.3 CONTROL IMPLEMENTATION SCHEME

    Implementation Scheme

    We build this simple implementation using the datapath of the last section and adding asimple control function. This simple implementation covers load word (lw), store word (sw),