cs6303 computer architecture notes for 2013 regulation
TRANSCRIPT
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
1/135
Kingston Engineering CollegeChittoor Main Road,
Katpadi, Vellore 632 059.
Approved by AICTE, New Delhi affiliated to Anna University, Chennai
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
THIRD SEMESTER
CS6303 COMPUTER ARCHITECTURE
NOTES
Prepared By
Mr. M. AZHAGIRI AP/CSE
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
2/135
CS6303 COMPUTER ARCHITECTURE L T P C3 0 0 3
OBJECTIVES: To make students understand the basic structure and operation of digital computer. To understand the hardware-software interface.
To familiarize the students with arithmetic and logic unit and implementation of fixed point andfloating-point arithmetic operations. To expose the students to the concept of pipelining. To familiarize the students with hierarchical memory system including cache memories and virtual
memory. To expose the students with different ways of communicating with I/O devices and standard I/O
interfaces.
UNIT I OVERVIEW & INSTRUCTIONS 9Eight ideas – Components of a computer system – Technology – Performance – Power wall –
Uniprocessors to multiprocessors; Instructions – operations and operands – representing instructions – Logical operations – control operations – Addressing and addressing modes.
UNIT II ARITHMETIC OPERATIONS 7ALU – Addition and subtraction – Multiplication – Division – Floating Point operations – Subword
parallelism.
UNIT III PROCESSOR AND CONTROL UNIT 11Basic MIPS implementation – Building datapath – Control Implementation scheme – Pipelining – Pipelined datapath and control – Handling Data hazards & Control hazards – Exceptions.
UNIT IV PARALLELISM 9Instruction-level-parallelism – Parallel processing challenges – Flynn‘s classification – Hardwaremultithreading – Multicore processors
UNIT V MEMORY AND I/O SYSTEMS 9Memory hierarchy – Memory technologies – Cache basics – Measuring and improving cache
performance – Virtual memory, TLBs – Input/output system, programmed I/O, DMA and interrupts,I/O processors.
TOTAL: 45 PERIODSOUTCOMES: At the end of the course, the student should be able to:
Design arithmetic and logic unit. Design and anlayse pipelined control units Evaluate performance of memory systems. Understand parallel processing architectures.
TEXT BOOK:1. David A. Patterson and John L. Hennessey, ―Computer Organization and Design‟, Fifth edition,Morgan Kauffman / Elsevier, 2014.REFERENCES:
1. V.Carl Hamacher, Zvonko G. Varanesic and Safat G. Zaky, ―Computer Organisation―, VI edition,Mc Graw-Hill Inc, 2012.2. William Stallings ―Computer Organization and Architecture‖, Seventh Edition , Pearson Education,2006.3. Vincent P. Heuring, Harry F. Jordan, ―Computer System Architecture‖, Second Edition, PearsonEducation, 2005.4. Govindarajalu, ―Computer Architecture and Organization, Design Principles and Applications‖,first edition, Tata Mc Graw Hill, New Delhi, 2005.
5. John P. Hayes, ―Computer Architecture and Organization‖, Third Edition, Tata Mc Graw Hill,1998.6. http://nptel.ac.in/.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
3/135
UNIT I
OVERVIEW & INSTRUCTIONS
Eight ideas – Components of a computer system – Technology – Performance – Power wall – Uniprocessors to multiprocessors; Instructions – operations and operands – representing
instructions – Logical operations – control operations – Addressing and addressing modes.
1.1 Eight ideas1.2 Components of a computer system1.3 Technology and Performance1.4 Power wall1.5 Uniprocessors to multiprocessors1.6 Instructions – operations and operands1.7 representing instructions1.8 Logical operations1.9 control operations1.10 Addressing and addressing modes.
1.1 EIGHT IDEAS
These ideas are so powerful they have lasted long after the first computer that used them.
1. Design for Moore‘s Law 2. Use Abstraction to Simplify Design
3. Make the Common Case Fast
4. Performance via Parallelism
5. Performance via Pipelining
6. Performance via Prediction
7. Hierarchy of Memories
8. Dependability via Redundancy Design for Moore’s Law
Moore’s Law . It states that integrated circuit resources double every 18 – 24 months.
computer architects must anticipate where the technology will be when the design finishes
rather than design for where it starts. The resources available per chip can easily double or
quadruple between the start and finish of the project.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
4/135
Use Abstraction to Simplify Design
A major productivity technique for hardware and software is to use abstractions to represent
the design at different levels of representation. lower-level details are hidden to offer a
simpler model at higher levels.
Make the Common Case Fast
Making the common case fast will tend to enhance performance better than optimizing the
rare case. Ironically, the common case is oft en simpler than the rare case and hence is oft en
easier to enhance.
Performance via Parallelism
Computer architects have offered designs that get more performance by performing
operations in parallel.
Performance via Pipelining
A particular pattern of parallelism is so prevalent in computer architecture that it merits its
own name: pipelining.
Performance via Prediction
prediction, In some cases it can be faster on average to guess and start working rather thanwait until you know for sure, assuming that the mechanism to recover from a misprediction is
not too expensive and your prediction is relatively accurate.
Hierarchy of Memories
Programmers want memory to be fast, large, and cheap, as memory speed often shapes performance, capacity limits the size of problems that can be solved, and the cost of memory
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
5/135
today is oft en the majority of computer cost. Hierarchy of Memories, with the fastest,
smallest, and most expensive memory per bit at the top of the hierarchy and the slowest,
largest, and cheapest per bit at the bottom. Caches give the programmer the illusion that main
memory is nearly as fast as the top of the hierarchy and nearly as big and cheap as the bottom
of the hierarchy.
Dependability via Redundancy
Computers not only need to be fast; they need to be dependable. Since any physical device
can fail, we make systems dependable by including redundant components that can take over
when a failure occurs and to help detect failures.
1.2 COMPONENTS OF A COMPUTER SYSTEM
Software is organized primarily in a hierarchical fashion, with applications being the
outermost ring and a variety of systems soft ware sitting between the hardware andapplications software. There are many types of systems software, but two types of systemssoftware are central to every computer system today: an operating system and a compiler. Anoperating system interfaces between a user‘s program and the hardware and provide s avariety of services and supervisory functions. Among the most important functions are:
• Handling basic input and output operations
• Allocating storage and memory
• Providing for protected sharing of the computer among multiple applications using itsimultaneously. Examples of operating systems in use today are Linux, iOS, and Windows.
FIGURE A simplified view ofhardware and software as hierarchical layers.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
6/135
Compilers perform another vital function: the translation of a program written in a high-levellanguage, such as C, C++, Java, or Visual Basic into instructions that the hardware canexecute.
From a High-Level Language to the Language of Hardware Assembler.
This program translates a symbolic version of an instruction into the binary version. Forexample, the programmer would write
add A,B and the assembler would translate this notation into 1000110010100000.
The binary language that the machine understands is the machine language. Assemblylanguage requires the programmer to write one line for every instruction that the computerwill follow, forcing the programmer to think like the computer. In later stage, high-level
programming languages and compilers were introduced, that translate High level language
into instructions.Example High level language a=a+b;
Assembly level language add A,B
Binary / Machine Language 1000110010100000 program
High-level programming languages offer several important benefits.
They allow the programmer to think in a more natural language, using English words
and algebraic notation. Fortran was designed for scientific computation. Cobol for business data processing. Lisp for symbol manipulation. It improved programmer productivity. Programming languages allow programs to be independent of the computer on which
they were developed, since compilers and assemblers can translate high-levellanguage programs to the binary instructions of any computer.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
7/135
5 CLASSIC COMPONENTS OF A COMPUTER
The five classic components of a computer are input, output, memory, data path, and control,with the last two sometimes combined and called the processor.
I/O EQUIPMENT:
The most fascinating I/O device is probably the graphics display.
liquid crystal displays (LCDs)
To get a thin, low-power display. The LCD is not the source of light; instead, it controls thetransmission of light.
A typical LCD includes rod-shaped molecules in a liquid that form a twisting helix that bendslight entering the display, from either a light source behind the display or less often fromreflected light. The rods straighten out when a current is applied and no longer bend the light.
Since the liquid crystal material is between two screens polarized at 90 degrees, the lightcannot pass through unless it is bent. Today, most LCD displays use an active matrix that hasa tiny transistor switch at each pixel to precisely control current and make sharper images. Ared-green-blue mask associated with each dot on the display determines the intensity of thethree colour components in the final image; in a colour active matrix LCD, there are threetransistor switches at each point.
The image is composed of a matrix of picture elements, or pixels, which can be representedas a matrix of bits, called a bit map. A colour display might use 8 bits for each of the threecolours (red, blue, and green). The computer hardware support for graphics consists mainlyof a raster refresh buffer, or frame buffer, to store the bit map. The image to be represented
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
8/135
onscreen is stored in the frame buffer, and the bit pattern per pixel is read out to the graphicsdisplay at the refresh rate.
The processor: is the active part of the computer, following the instructions of a program tothe letter. It adds numbers, tests numbers, signals I/O devices to activate, and so on. The
processor logically comprises two main components: data path and control, the respective brawn and brain of the processor. The data path performs the arithmetic operations, andcontrol tells the data path, memory, and I/O devices what to do according to the wishes of theinstructions of the program.
The memory: is where the programs are kept when they are running; it also contains the dataneeded by the running programs. Th e memory is built from DRAM chips. DRAM stands fordynamic random access memory. Multiple DRAMs are used together to contain theinstructions and data of a program. In contrast to sequential access memories, such asmagnetic tapes, the RAM portion of the term DRAM means that memory accesses take
basically the same amount of time no matter what portion of the memory is read.
1.3.1- Technology chip manufacturing process
The manufacture of a chip begins with silicon, a substance found in sand. Because silicondoes not conduct electricity well, it is called a semiconductor. With a special chemical
process, it is possible to add materials to silicon that allow tiny areas to transform into one ofthree devices:
Excellent conductors of electricity (using either microscopic copper or aluminium wire)
Excellent insulators from electricity (like plastic sheathing or glass)
Areas that can conduct or insulate under special conditions (as a switch)
Transistors fall in the last category. A VLSI circuit, then, is just billions of combinations ofconductors, insulators, and switches manufactured in a single small package.
Figure shows process for integrated chip manufacturing. The process starts with a siliconcrystal ingot, which looks like a giant sausage. Today, ingots are 8 – 12 inches in diameter andabout 12 – 24 inches long. An ingot is finely sliced into wafers no more than 0.1 inches thick.
These wafers then go through a series of processing steps, during which patterns of chemicalsare placed on each wafer, creating the transistors, conductors, and insulators.
The simplest way to cope with imperfection is to place many independent components on asingle wafer.
The patterned wafer is then chopped up, or diced, into these components, called dies andmore informally known as chips. To reduce the cost, using the next generation processshrinks a large die as it uses smaller sizes for both transistors and wires. This improves theyield and the die count per wafer.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
9/135
Once you‘ve found good dies, they are connected to the input/output pins of a package, usinga process called bonding. These packaged parts are tested a final time, since mistakes canoccur in packaging, and then they are shipped to customers.
1.3.2 PERFORMANCE
Running a program on two different desktop computers, you‘d say that the faster one is thedesktop computer that gets the job done first. If you were running a data centre that hadseveral servers running jobs submitted by many users, you‘d say that the faster computer was
the one that completed the most jobs during a day. As an individual computer user, you areinterested in reducing response time — the time between the start and completion of a task — also referred as execution time.
Data centre managers are often interested in increasing throughput or bandwidth — the totalamount of work done in a given time.
Measuring Performance:
The computer that performs the same amount of work in the least time is the fastest. Programexecution time is measured in seconds per program. CPU execution time or simply CPUtime, which recognizes this distinction, is the time the CPU spends computing for this taskand does not include time spent waiting for I/O or running other programs. CPU time can befurther divided into the CPU time spent in the program, called user CPU time, and the CPUtime spent in the operating system performing tasks on behalf of the program, called systemCPU time.
The term system performance to refer to elapsed time on an unloaded system and CPU performance to refer to user CPU time.
CPU Performance and Its Factors:
CPU execution time for a program = CPU clock cycles for a program X Clock cycle time
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
10/135
Alternatively, because clock rate and clock cycle time are inverses,
CPU execution time for a program = CPU clock cycles for a program/Clock rate
This formula makes it clear that the hardware designer can improve performance by reducing
the number of clock cycles required for a program or the length of the clock cycle.
Instruction Performance:
The performance equations above did not include any reference to the number of instructionsneeded for the program. The execution time must depend on the number of instructions in a
program. Here execution time is that it equals the number of instructions executed multiplied by the average time per instruction. clock cycles required for a program can be written as
CPU clock cycles = Instructions for a program X Average clock cycles per instruction
The term clock cycles per instruction, which is the average number of clock cycles eachinstruction takes to execute, is often abbreviated as CPI. CPI provides one way of comparingtwo different implementations of the same instruction set architecture, since the number ofinstructions executed for a program will be the same.
The Classic CPU Performance Equation:
The basic performance equation in terms of instruction count (the number of instructionsexecuted by the program), CPI, and clock cycle time:
CPU time = Instruction count X CPI X Clock cycle time
or, since the clock rate is the inverse of clock cycle time:
CPU time = (Instruction count X CPI) / Clock rate
These formulas are particularly useful because they separate the three key factors that affect performance.
Components of performance Units of Measure CPU execution time for a program Seconds for the programInstruction count Instructions executed for the programClock cycles per instruction (CPI) Average number of clock cycles perClock cycle time Seconds per clock cycle
We can measure the CPU execution time by running the program, and the clock cycle time isusually published as part of the documentation for a computer. The instruction count and CPIcan be more difficult to obtain. Of course, if we know the clock rate and CPU execution time,we need only one of the instruction count or the CPI to determine the other.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
11/135
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
12/135
microprocessors. Hence, a ―quad core‖ microprocessor is a chip that contains four processors
or four cores. In the past, programmers could rely on innovations in hardware, architecture,
and compilers to double performance of their programs every 18 months without having to
change a line of code. Today, for programmers to get significant improvement in response
time, they need to rewrite their programs to take advantage of multiple processors. Moreover,
to get the historic benefit of running faster on new microprocessors, programmers will have
to continue to improve performance of their code as the number of cores increases. To
reinforce how the soft ware and hardware systems work hand in hand, we use a special
section; Hardware/Soft ware Interface, throughout the book, with the first one appearing
below. These elements summarize important insights at this critical interface.
1. Increasing the clock speed of Uniprocessor has reached saturation and cannot be increased
beyond a certain limit because of power consumption and heat dissipation issues.
2. As the physical size of chip decreased, while the number of transistors/chip increased,
clock speed increased, which boosted the heat dissipation across the chip to a dangerous
level. Cooling & heat sink requirement issues were there.
3. There were limitations in the use of silicon surface area.
4. There were limitations in reducing the size of individual gates further.
5. To gain Performance within a single core, many techniques like pipelining, super
pipelined, superscalar architectures are used .
6. Most of the early dual core processors were running at lower clock speeds, the rational
behind is that a dual core processor with each running at 1Ghz should be equivalent to a
single core processor running at 2 Ghz.
7. The Problem is that this does not work in practice when the applications are not written to
take advantage of the multiple processors. Until the software is written this way, unthreaded
applications will run faster on a single processor than on a dual core cpu.8. In Multi-core processors, the benefit is more on throughput than on response time.
9. In the past, programmers could rely on innovations in the hardware, Architecture and
compilers to double performance of their programs every 18 months without having to
change a line of code.
10. Today, for programmers to get significant improvement in response time, they need to
rewrite their programs to take advantage of multiple processors and also they have to improve
performance of their code as the number of core increases.The need of the hour is……..
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
13/135
11. Ability to write Parallel programs
12. Care must be taken to reduce Communication and Synchronization overhead. Challenges
in Scheduling, load balancing have to be addressed.
1.6 INSTRUCTIONS – OPERATIONS AND OPERANDS
Operations in MIPS:
Every computer must be able to perform arithmetic. The MIPS assembly language notation
add a, b, c
Instructs a computer to add the two variables b and c and to put their sum in a.
This notation is rigid in that each MIPS arithmetic instruction performs only one operation
and must always have exactly three variables.
EXAMPLE, To add 4 variables, b,c,d,e and store it in a.
add a, b, c # The sum of b and c is placed in a
add a, a, d # The sum of b, c, and d is now in a
add a, a, e # The sum of b, c, d, and e is now in a
Thus, it takes three instructions to sum the four variables.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
14/135
Design Principle 1: Simplicity favours regularity.
EXAMPLE: Compiling Two C Assignment Statements into MIPS
This segment of a C program contains the five variables a, b, c, d, and e. Since Java evolved
from C, this example and the next few work for either high-level programming language:
a = b + c;
d = a – e;
The translation from C to MIPS assembly language instructions is performed by the compiler.Show the MIPS code produced by a compiler. A MIPS instruction operates on two sourceoperands and places the result in one destination operand. Hence, the two simple statementsabove compile directly into these two MIPS assembly language instructions:
add a, b, c
sub d, a, e
Operands in MIPS:
The operands of arithmetic instructions are restricted; they must be from a limited number ofspecial locations built directly in hardware called registers. The size of a register in the MIPSarchitecture is 32 bits; groups of 32 bits occur so frequently that they are given the nameword in the MIPS architecture.
Design Principle 2: Smaller is faster.
A very large number of registers may increase the clock cycle time simply because it takeselectronic signals longer when they must travel farther. So, 32 registers were used in MIPSarchitecture. The MIPS convention is to use two-character names following a dollar sign torepresent a register. eg: $s0, $s1
Example: f = (g + h) – (i + j); instructions using registers.
add $t0,$s1,$s2 # register $t0 contains g + h
add $t1,$s3,$s4 # register $t1 contains i + j
sub $s0,$t0,$t1 # f gets $t0 – $t1, which is (g + h) – (i + j)
Memory Operands: Programming languages have simple variables that contain single dataelements, as in these examples, but they also have more complex data structures — arrays andstructures. These complex data structures can contain many more data elements than there areregisters in a computer. The processor can keep only a small amount of data in registers, butcomputer memory contains billions of data elements. So, MIPS must include instructions thattransfer data between memory and registers.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
15/135
Such instructions are called data transfer instructions. To access a word in memory, theinstruction must supply the memory address.
The data transfer instruction that copies data from memory to a register is traditionally calledload. The format of the load instruction is the name of the operation followed by the registerto be loaded, then a constant and register used to access memory. The sum of the constant
portion of the instruction and the contents of the second register forms the memory address.
The actual MIPS name for this instruction is lw, standing for load word.
EXAMPLE g = h + A[8];
To get A[8] from memory use lw, lw $t0,8($s3) # Temporary reg $t0 gets A[8]
Use Result of A[8] stored in $t0, add $s1,$s2,$t0 # g = h + A[8]
The constant in a data transfer instruction (8) is called the off set, and the register added toform the address ($s3) is called the base register.
In MIPS, words must start at addresses that are multiples of 4. This requirement is called analignment restriction, and many architectures have it.(since in MIPS each 32 bits form a wordin memory, so the address of one word to another jumps in multiples of 4)
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
16/135
Byte addressing also affects the array index. To get the proper byte address in the code above,the off set to be added to the base register $s3 must be 4 x 8, or 32,(as per previous example).
EXAMPLE: g = h + A[8]; (implemented based on byte address)
To get A[8] from memory use lw and calculate (8 x4) = 32 which is the actual offset value,lw $t0,32($s3) # Temporary reg $t0 gets A[8]
Use Result of A[8] i.e., stored in$t0, add $s1,$s2,$t0 # g = h + A[8]
The instruction complementary to load is traditionally called store; it copies data from aregister to memory. The format of a store is similar to that of a load: the name of theoperation, followed by the register to be stored, then off set to select the array element, and
finally the base register. Once again, the MIPS address is specified in part by a constant andin part by the contents of a register. The actual
MIPS name is sw, standing for store word.
EXAMPLE: A[12] = h + A[8];
lw $t0,32($s3) # Temporary reg $t0 gets A[8], note (8 x4) used.
add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[8]
sw $t0,48($s3) # Stores h + A[8] back into A[12], note (12 x 4) used.
Constant or Immediate Operands:
For example, to add the constant 4 to register $s3, we could use the code lw $t0,AddrConstant4 ($s1) # $t0 = constant 4
add $s3,$s3,$t0 # $s3 = $s3 + $t0 ($t0 == 4)
Alternative that avoids the load instruction is to offer versions of the arithmetic instructions inwhich one operand is a constant.
Example: add immediate or add instructions.
addi $s3,$s3,4 # $s3 = $s3 + 4
Constant operands occur frequently, and by including constants inside arithmetic instructions,operations are much faster and use less energy than if constants were loaded from memory.
INSTRUCTIONS AND ITS TYPES THAT ARE USED IN MIPS
Registers are referred to in instructions, there must be a convention to map register namesinto numbers. In MIPS assembly language, registers $s0 to $s7 map onto registers 16 to 23,
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
17/135
and registers $t0 to $t7 map onto registers 8 to 15. Hence, $s0 means register 16, $s1 meansregister 17, $s2 means register 18, . . . , $t0 means register 8, $t1 means register 9, and so on.
MIPS Fields for instruction
op: Basic operation of the instruction, traditionally called the opcode. rs: The first register source operand. rt: The second register source operand. rd: The register destination operand. It gets the result of the operation.
shamt: Shift amount. (Section 2.6 explains shift instructions and this term; it will not be used until then, and hence the field contains zero in this section.)
funct: Function. This field, oft en called the function code, selects the specific variantof the operation in the op field.
A problem occurs when an instruction needs longer fields than those shown above.
MIPS designers is kept all instructions the same length, thereby requiring different kinds ofinstruction formats for different kinds of instructions. The format above is called R-type (forregister)
or R-format. A second type of instruction format is called I-type (for immediate) or I-formatand is used by the immediate and data transfer instructions. The fields of I-format are
Multiple formats complicate the hardware; we can reduce the complexity by keeping the
formats similar. For example, the first three fields of the R-type and I-type formats are thesame size and have the same names; the length of the fourth field in I-type is equal to the sumof the lengths of the last three fields of R-type. Note that the meaning of the rt field haschanged for this instruction: the rt field specifies the destination register
FIG: MIPS instruction encoding.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
18/135
Example: MIPS instruction encoding in computer hardware.
Consider A[300] = h + A[300]; the MIPS instruction for the operations are:
lw $t0,1200($t1) # Temporary reg $t0 gets A[300]
add $t0,$s2,$t0 # Temporary reg $t0 gets h + A[300]
sw $t0,1200($t1) # Stores h + A[300] back into A[300]
Tabulation shows how hardware decodes and determines the three machine languageinstructions:
op rs rt rd address/shamt funct
35 9 8 1200 lw0 18 8 8 0 32 add
43 9 8 1200 sw
The lw instruction is identified by 35 (OP field), The add instruction that follows is specifiedwith 0 (OP field), The sw instruction is identified with 43 (OP field).
100011 1001 1000 0000 0100 1011 0000 lw
00000 10010 1000 1000 00000 100000 add
101011 1001 1000 0000 0100 1011 0000 sw
Binary version of the above Tabulation
1.8 LOGICAL OPERATIONS
List of logical operators used in MIPS and other languages along with symbolic notation.
SHIFT LEFT (sll )
The first class of such operations is called shifts. They move all the bits in a word to the left
or right, filling the emptied bits with 0s.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
19/135
For example, if register $s0 contained
0000 0000 0000 0000 0000 0000 0000 1001two = 9ten
and the instruction to shift left by 4 was executed, the new value would be: 0000 0000 0000
0000 0000 0000 1001 0000two = 144ten
The dual of a shift left is a shift right. The actual name of the two MIPS shift instructions arecalled shift left logical (sll) and shift right logical (srl).
sll $t2,$s0,4 # reg $t2 = reg $s0
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
20/135
LOGICAL OR (or)
It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To elaborate,if the registers $t1 and $t2 are unchanged from the preceding i.e.,
register $t2 contains 0000 0000 0000 0000 0000 1101 1100 0000two
and register $t1 contains 0000 0000 0000 0000 0011 1100 0000 0000two
then, after executing the MIPS instruction
or $t0,$t1,$t2 # reg $t0 = reg $t1 | reg $t2
the value in register $t0 would be 0000 0000 0000 0000 0011 1101 1100 0000two(example for bit wise, ……..00101
….10111 Bit wise OR→ 10111 (use OR truth table for each bit))
LOGICAL NOT (nor)
The final logical operation is a contrarian. NOT takes one operand and places a 1 in the resultif one operand bit is a 0, and vice versa. Since MIPS needs three-operand format, thedesigners of MIPS decided to include the instruction NOR (NOT OR) instead of NOT.
Step 1: Perform bit wise OR , ……..00101
….00000 (dummy operation register filled with zero)
……………….
00101
Step 2: Take Inverse for the above result now we get 11010
Instruction: nor $t0,$t1,$t3 # reg $t0 = ~ (reg $t1 | reg $t3)
Constants are useful in AND and OR logical operations as well as in arithmetic operations, soMIPS also provides the instructions and immediate (andi) and or immediate (ori).
1.9 CONTROL OPERATIONS
Branch and Conditional branches: Decision making is commonly represented in
programming languages using the if statement, sometimes combined with go to statements
and labels. MIPS assembly language includes two decision-making instructions, similar to an
if statement with a go to. The first instruction is beq register1, register2, L1
This instruction means go to the statement labelled L1 if the value in register1 equals the
value in register2.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
21/135
The mnemonic beq stands for branch if equal. The second instruction is bne register1,
register2, L1
It means go to the statement labeled L1 if the value in register1 does not equal the value in
register2.
The mnemonic bne stands for branch if not equal. These two instructions are traditionally
called conditional branches.
EXAMPLE: if (i == j) f = g + h; else f = g – h; the MIPS version of the given statements is
bne $s3,$s4, Else # go to Else if i ≠ j
add $s0,$s1,$s2 # f = g + h (skipped if i ≠ j)
j Exit # go to Exit
sub $s0,$s1,$s2 # f = g – h (skipped if i = j)
Else:
Exit:
Here bne is used instead of beq, because bne(not equal to) instruction provides a better
efficiency. This example introduces another kind of branch, often called an unconditional
branch. This instruction says that the processor always follows the branch. To distinguish
between conditional and unconditional branches, the MIPS name for this type of instruction
is jump, abbreviated as j. (in example:- f, g, h, i, and j are variables mapped to fi ve registers$s0 through $s4)
Loops:
Decisions are important both for choosing between two alternatives — found in ifStatements — and for iterating a computation — found in loops. The same assemblyinstructions are the basic building blocks for both cases(if and loop).
EXAMPLE: while (save[i] == k)
i += 1; the MIPS version of the given statements
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
22/135
Assume that i and k correspond to registers $s3 and $s5 and the base of the array save is in$s6.
Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4 To get the address of save[i], we needto add $t1 and the base of save
in $s6:
add $t1,$t1,$s6 # $t1 = address of save[i] Now we can use that address to loadsave[i] into a temporary register:
lw $t0,0($t1) # Temp reg $t0 = save[i] The next instruction performs the looptest, exiting if save[i] ≠ k:
bne $t0,$s5, Exitaddi $s3,$s3,1
# go to Exit if save[i] ≠ k# i = i + 1
The next instruction adds 1 to i:
j LoopExit:
# go to Loop The end of the loop branches back tothe while test at the top of the loop. We
just add the Exit label after it, andwe‘re done:
1.10 ADDRESSING AND ADDRESSING MODES.
Addressing types:
Three address instructions
Syntax: opcode source1, source2,destination
Eg: ADD A,B, C (operation is A= B+C)Two address instructions
Syntax: opcode source, destination
Eg: ADD A, B (operation is A=A+B)
One-address instruction (to fit in one word length) Syntax: opcode source
Eg: STORE C (copies content of accumulator to memory location C) where accumulator
means cache memory register.
Zero-address instructions stack operation Syntax: opcodeEg: PUSH A (All addresses are implicit, pushes value in A to stack)
Addressing Modes:
The different ways in which the location of an operand is specified in an instruction are
referred to as addressing modes. It is a method used to determine which part of memory is
being referred by a machine instruction.
Register mode: Operand is the content of a processor register. The register
name/address is given in the instruction. Value of R2 is moved to R1.
Example: MOV R1, R2
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
23/135
Absolute mode (direct): Operand is in the memory location. Address of the location is
given explicitly. Here value in A is moved to 1000H.
Example: MOV 1000, A
Immediate mode: Address and data constants can be given explicitly in the instruction.
Here value constant 200 is moved to R0 register.
Example: MOV #200, R0
Indirect Mode : The processor will read the register content (R1) in this case, which
will not have direct value. Instead, it will be the address or location in which, the value will
be stored. Then the fetched value is added with the value in R0 register.
Example: ADD (R1), R0
Indexed / Relative Addressing Mode: The processor will take R1 register address as
base address and adds the value constant 20 (offset / displacement) with the base address to
get the derived or actual memory location of the value i.e., stored in the memory. It fetches
the value then adds the value to R2 register.
Example: ADD 20(R1), R2
Auto increment mode and Auto decrement Mode: The value in the register / address that is
supplied in the instruction is incremented or decremented.
Example: Increment R1 (Increments the given register / address content by one)
Example: Decrement R2 (Decrements the given register / address content by one)
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
24/135
UNIT IIARITHMETIC OPERATIONS
ALU – Addition and subtraction – Multiplication – Division – Floating Point operations – Sub word parallelism.
2.1 ALU2.2 Addition and subtraction2.3 Multiplication2.4 Division2.5 Floating Point operations2.6 Sub word parallelism.
2.1 ALU Arithmetic logic unit
An arithmetic logic unit (ALU) is a digital electronic circuit that performs arithmetic and bitwise logical operations on integer binary numbers. This is in contrast to a floating-pointunit (FPU), which operates on floating point numbers. An ALU is a fundamental building
block of many types of computing circuits, including the central processing unit (CPU) ofcomputers, FPUs, and graphics processing units (GPUs). A single CPU, FPU or GPU maycontain multiple ALUs.The inputs to an ALU are the data to be operated on, called operands, and a code indicatingthe operation to be performed; the ALU's output is the result of the performed operation. In
many designs, the ALU also exchanges additional information with a status register, whichrelates to the result of the current or previous
A symbolic representation of an ALU and its input and output signals, indicated by arrows pointing into or out of the ALU, respectively. Each arrow represents one or more signals.
SignalsAn ALU has a variety of input and output nets, which are the shared electrical connections
used to convey digital signals between the ALU and external circuitry. When an ALU isoperating, external circuits apply signals to the ALU inputs and, in response, the ALU
produces and conveys signals to external circuitry via its outputs.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
25/135
DataA basic ALU has three parallel data buses consisting of two input operands (A and B)
and a result output (Y). Each data bus is a group of signals that conveys one binary integer
number. Typically, the A, B and Y bus widths (the number of signals comprising each bus)are identical and match the native word size of the encapsulating CPU (or other processor).Opcode
The opcode input is a parallel bus that conveys to the ALU an operation selectioncode, which is an enumerated value that specifies the desired arithmetic or logic operation to
be performed by the ALU. The opcode size (its bus width) is related to the number ofdifferent operations the ALU can perform; for example, a four-bit opcode can specify up tosixteen different ALU operations. Generally, an ALU opcode is not the same as a machinelanguage opcode, though in some cases it may be directly encoded as a bit field within amachine language opcode.Status
The status outputs are various individual signals that convey supplemental informationabout the result of an ALU operation. These outputs are usually stored in registers so they can
be used in future ALU operations or for controlling conditional branching. The collection of bit registers that store the status outputs are often treated as a single, multi-bit register, whichis referred to as the "status register" or "condition code register". General-purpose ALUscommonly have status signals such as:
Carry-out, which conveys the carry resulting from an addition operation, the borrowresulting from a subtraction operation, or the overflow bit resulting from a binary shift
operation. Zero, which indicates all bits of the Y bus are logic zero. Negative, which indicates the result of an arithmetic operation is negative. Overflow, which indicates the result of an arithmetic operation has exceeded the
numeric range of the Y bus. Parity, which indicates whether an even or odd number of bits on the Y bus are logic
one.The status input allows additional information to be made available to the ALU when
performing an operation. Typically, this is a "carry-in" bit that is the stored carry-out from a
previous ALU operation.
Circuit operationAn ALU is a combinational logic circuit, meaning that its outputs will change
asynchronously in response to input changes. In normal operation, stable signals are appliedto all of the ALU inputs and, when enough time (known as the "propagation delay") has
passed for the signals to propagate through the ALU circuitry, the result of the ALUoperation appears at the ALU outputs. The external circuitry connected to the ALU isresponsible for ensuring the stability of ALU input signals throughout the operation, and forallowing sufficient time for the signals to propagate through the ALU before sampling theALU result.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
26/135
For example, a CPU begins an ALU addition operation by routing operands from theirsources (which are usually registers) to the ALU's operand inputs, while the control unitsimultaneously applies a value to the ALU's opcode input, configuring it to perform addition.At the same time, the CPU also routes the ALU result output to a destination register that will
receive the sum. The ALU's input signals, which are held stable until the next clock, areallowed to propagate through the ALU and to the destination register while the CPU waits forthe next clock. When the next clock arrives, the destination register stores the ALU resultand, since the ALU operation has completed, the ALU inputs may be set up for the next ALUoperation.
The combinational logic circuitry of the 74181 integrated circuit , which is a simple four-bitALU
FunctionsA number of basic arithmetic and bitwise logic functions are commonly supported by ALUs.Basic, general purpose ALUs typically include these operations in their repertoires:
Arithmetic operations
Add: A and B are summed and the sum appears at Y and carry-out.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
27/135
Add with carry: A, B and carry-in are summed and the sum appears at Y and carry-out.
Subtract: B is subtracted from A (or vice-versa) and the difference appears at Y andcarry-out. For this function, carry-out is effectively a "borrow" indicator. This
operation may also be used to compare the magnitudes of A and B; in such cases theY output may be ignored by the processor, which is only interested in the status bits(particularly zero and negative) that result from the operation.
Subtract with borrow: B is subtracted from A (or vice-versa) with borrow (carry-in)and the difference appears at Y and carry-out (borrow out).
Two's complement (negate): A (or B) is subtracted from zero and the differenceappears at Y.
Increment: A (or B) is increased by one and the resulting value appears at Y. Decrement: A (or B) is decreased by one and the resulting value appears at Y.
Pass through: all bits of A (or B) appear unmodified at Y. This operation is typicallyused to determine the parity of the operand or whether it is zero or negative.Bitwise logical operations
AND: the bitwise AND of A and B appears at Y. OR: the bitwise OR of A and B appear at Y. Exclusive-OR: the bitwise XOR of A and B appear at Y. One's complement: all bits of A (or B) are inverted and appear at Y.
Bit shift operationsALU shift operations cause operand A (or B) to shift left or right (depending on the
opcode) and the shifted operand appears at Y. Simple ALUs typically can shift the operand by only one bit position, whereas more complex ALUs employ barrel shifters that allow themto shift the operand by an arbitrary number of bits in one operation. In all single-bit shiftoperations, the bit shifted out of the operand appears on carry-out; the value of the bit shiftedinto the operand depends on the type of shift.
Arithmetic shift: the operand is treated as a two's complement integer, meaning thatthe most significant bit is a "sign" bit and is preserved.
Logical shift: a logic zero is shifted into the operand. This is used to shift unsignedintegers.
Rotate: the operand is treated as a circular buffer of bits so its least and mostsignificant bits are effectively adjacent.
Rotate through carry: the carry bit and operand are collectively treated as a circular buffer of bits.
Big Endian: In big endian, you store the most significant byte in the smallest address.Little Endian: In little endian, you store the least significant byte in the smallest address.
Fixed-point arithmetic
A fixed-point number representation is a real data type for a number that has a fixednumber of digits after (and sometimes also before) the radix point (after the decimal point '.'
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
28/135
in English decimal notation). Fixed-point number representation can be compared to the morecomplicated (and more computationally demanding) floating-point number representation.Fixed-point numbers are useful for representing fractional values, usually in base 2 or base10, when the executing processor has no floating point unit (FPU) or if fixed-point provides
improved performance or accuracy for the application at hand.
Sign-MagnitudeThe sign-magnitude binary format is the simplest conceptual format. To represent a numberin sign-magnitude, we simply use the leftmost bit to represent the sign, where 0 means
positive, and the remaining bits to represent the magnitude
B7 B6 B5 B4 B3 B2 B1 B0
Sign magnitude
What are the decimal values of the following 8-bit sign-magnitude numbers?
10000011 = -300000101 = +511111111 = ?01111111 = ?
1's complementThe 1's complement of a number is found by changing all 1's to 0's and all 0's to 1's.
This is called as taking complement or 1's complement. Example of 1's Complement is asfollows.
2's complementThe 2's complement of binary number is obtained by adding 1 to the Least Significant Bit(LSB) of 1's complement of the number.2's complement = 1's complement + 1Example of 2's Complement is as follows.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
29/135
2.2 ADDITION AND SUBTRACTION
Binary Addition
In fourth case, a binary addition is creating a sum of (1 + 1 = 10) i.e. 0 is written in the givencolumn and a carry of 1 over to the next column.
Example – Addition
Half Adder and Full Adder CircuitHalf Adder
The half adder adds two binary digits called as augend and addend and produces twooutputs as sum and carry; XOR is applied to both inputs to produce sum and OR gate isapplied to both inputs to produce carry.By using half adder, you can design simple addition with the help of logic gates.
Half Adder Logic Circuit
Half Adder block diagram Half Adder Truth Table
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
30/135
Full AdderAn adder is a digital circuit that performs addition of numbers. The full adder adds 3
one bit numbers, where two can be referred to as operands and one can be referred to as bit
carried in. And produces 2-bit output, and these can be referred to as output carry and sum.
This adder is difficult to implement than a half-adder. The difference between a half-adder and a full-adder is that the full-adder has three inputs and two outputs, whereas halfadder has only two inputs and two outputs. The first two inputs are A and B and the thirdinput is an input carry as C-IN. When full-adder logic is designed, you string eight of themtogether to create a byte-wide adder and cascade the carry bit from one adder to the next.
Full Adder Truth Table:
Implementation of full order with two half adders
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
31/135
N-Bit Parallel AdderThe Full Adder is capable of adding only two single digit binary number along with a
carry input. But in practical we need to add binary numbers which are much longer than just
one bit. To add two n-bit binary numbers we need to use the n-bit parallel adder. It uses anumber of full adders in cascade. The carry output of the previous full adder is connected tocarry input of the next full adder.
4 Bit Parallel Adder In the block diagram, A0 and B0 represent the LSB of the four bit words A and B.
Hence Full Adder-0 is the lowest stage. Hence its Cin has been permanently made 0. The restof the connections are exactly same as those of n-bit parallel adder is shown in fig. The four
bit parallel adder is a very common logic circuit.
Block diagram of N-Bit Parallel Adder
Binary SubtractionSubtraction and Borrow, these two words will be used very frequently for the binary
subtraction. Operation A-B is performed using four rules of binary subtraction.1.Take 2‘s compliment of B 2.Result A+2‘s compliment of B 3.If a carry is generated then the result is positive and in the true form, in this case carry isignored4.If carry is not generated then the result is negative and i n the 2‘s compliment form
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
32/135
N-Bit Parallel Subtractor
The subtraction can be carried out by taking the 1's or 2's complement of the number
to be subtracted. For example we can perform the subtraction (A-B) by adding either 1's or2's complement of B to A. That means we can use a binary adder to perform the binarysubtraction.
4 Bit Parallel Subtractor
The number to be subtracted (B) is first passed through inverters to obtain its 1'scomplement. The 4-bit adder then adds A and 2's complement of B to produce thesubtraction. S3 S2 S1 S0 represents the result of binary subtraction (A-B) and carry outputCout represents the polarity of the result. If A > B then Cout = 0 and the result of binary form(A-B) then Cout = 1 and the result is in the 2's complement form.
Block diagram of N BitSubtractor
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
33/135
Half Subtractors
Half subtractor is a combination circuit with two inputs and two outputs (differenceand borrow). It produces the difference between the two binary bits at the input and also
produces an output (Borrow) to indicate if a 1 has been borrowed. In the subtraction (A-B), Ais called as Minuend bit and B is called as Subtrahend bit.
Truth Table Circuit Diagram
Full Subtractors
The disadvantage of a half subtractor is overcome by full subtractor. The fullsubtractor is a combinational circuit with three inputs A,B,C and two output D and C'. A isthe 'minuend', B is 'subtrahend', C is the 'borrow' produced by the previous stage, D is thedifference output and C' is the borrow output.
Truth Table Circuit Diagram
2.3 MULTIPLICATION
Multiplication of decimal numbers in long hand can be used to show the steps ofmultiplication and the names of the operands.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
34/135
Binary multiplication is similar to decimal multiplication. It is simpler than decimalmultiplication because only 0s and 1s are involved. There are four rules of binarymultiplication.
Example
The number of digits in the product is considerably larger than the number in either the
multiplicand or the multiplier. The length of the multiplication of an n-bit multiplicand andan m-bit multiplier is a product that is n + m bits long (sign bit is ignored).
So, n + m bits are required to represent all possible products. In this case one has to considerOver flow condition also.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
35/135
The multiplicand is moved to left each time and the multiplier moved right after each bit has performed its intermediate execution. The No of iterations to find the product will be equal to No: of bits in the multiplier. In this case we have 32 iterations (MIPS).
Example Multiply 2ten _ 3ten, or 0010two _ 0011two. (4 bits are used to save space.)
Booth's Algorithm of Multiplication
General Steps of Booth's Algorithm:-
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
36/135
Step 1:-In step 1 firstly we take a multiplicand M and multiplier Q(R) and set value ofA,Q(n+1),SC are 0,0,0 respectively.Step 2:-In step 2 We check Q(0) and Q(1).Step 3:-In step 3 if bits are 0,1 then add M with A and after that perform Right Shift
Operation.Step 4:-If bits are 1,0 then perform A+(M)'+1 then perform Right Shift Operation.Step 5:-Check if SC is set as o.Step 6:-Repeat Step 2,3,4 until Count
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
37/135
Binary Division Hardware
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
38/135
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
39/135
Division algorithms fall into two main categories: slow division and fast division.Slow division algorithms produce one digit of the final quotient per iteration. Examples ofslow division include restoring, non-performing restoring, non-restoring, and SRT division.Fast division methods start with a close approximation to the final quotient and produce
twice as many digits of the final quotient on each iteration. Newton – Raphson andGoldschmidt fall into this category.
Sequential Restoring Division
• A shift register keeps both the (remaining) dividend as well as the quotient • With each cycle, dividend decreases by one digit & quotient increases by one digit • The MSB‘s of the remaining dividend and the divis or are aligned in each cycle• Major difference to multiplication:
1. we do not know if we can subtract the divisor or not2. if the subtraction failed, we have to restore the original dividend
Procedure1. Load the 2n dividend into both halves of shift register, and add a sign bit to the left2. Add a sign bit to the left of the divisor3. Generate the 2‘s complement of the divisor 4. Shift to the left5. Add 2‘s complement of the divisor to the upper half of the shift register including sign bit(subtract)6. If sign of the result is cleared (positive)
• then set LSB of the lower half of the shift register to one
• else clear LSB of the lower half and add the divisor to upper half of shift register7. repeat from 4. and perform the loop n times
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
40/135
8. after termination:• lower half of shift register ⇒ quotient• upper half of shift register ⇒ remainder
Restoring AlgorithmAssume ─ X register k -bit dividendAssume ─ Y the k -bit divisorAssume ─ S a sign -bit
Start: Load 0 into accumulator k-bit A and dividend X is loaded into the k-bit quotientregister MQ.Step A : Shift 2 k-bit register pair A -MQ leftStep B: Subtract the divisor Y from A.Step C: If sign of A (msb) = 1, then reset MQ0(lsb) = 0 else set = 1.Steps D: If MQ0 = 0 add Y (restore the effect of earlier subtraction).6. Steps A to D repeat again till the total numberof cyclic operations = k.At the end, A has the remainder and MQ has the quotient
The non-restoring division algorithm:
S1: DO n timesShift A and Q left one binary position.Subtract M from A, placing the answer back in A.
A restoring-division example
Initially 0 0 0 0 0 1 0 0 0
0 0 0 1 1
Shift 0 0 0 0 1 0 0 0
Subtract 1 1 1 0 1
Set q 0 1 1 1 1 0
Restore 1 1
0 0 0 0 1 0 0 0 0
Shift 0 0 0 1 0 0 0 0
Subtract 1 1 1 0 1
Set q 0 1 1 1 1 1
Restore 1 1
0 0 0 1 0 0 0 0 0
Shift 0 0 1 0 0 0 0 0
Subtract 1 1 1 0 1
Set q 0 0 0 0 1 0 0 0 0 1
Shift 0 0 0 1 0 0 0 1
Subtract 1 1 1 0 1
Set q 0 1 1 1 1 1
Restore 1 1
0 0 0 1 0 0 0 1 0
remainderQuotient
First cycle
Second cycle
Third cycle
Fourth cycle
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
41/135
If the sign of A is 1, set q0 to 0 and add M back to A (restore A); otherwise, set q0 to 1.S1: Do n times
If the sign of A is 0, shift A and Q left one binary position and subtract M from A;otherwise, shift A and Q left and add M to A.
S2: If the sign of A is 1, add M to A.
Floating-Point Number RepresentationA floating-point number (or real number) can represent a very large (1.23×10^88) or a verysmall (1.23×10^-88) value. It could also represent very large negative number (-1.23×10^88)and very small negative number (-1.23×10^88), as well as zero, as illustrated:
A floating-point number is typically expressed in the scientific notation, with a fraction (F),and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix
of 10 (F×10^E); while binary numbers use radix of 2 (F×2^E).
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
42/135
Representation of floating point number is not unique. For example, the number 55.66 can berepresented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so on. The fractional part can
be normalized. In the normalized form, there is only a single non-zero digit before the radix point. For example, decimal number 123.4567 can be normalized as 1.234567×10^2; binary
number 1010.1011B can be normalized as 1.0101011B×2^3.It is important to note that floating-point numbers suffer from loss of precision whenrepresented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there areinfinite number of real numbers (even within a small range of says 0.0 to 0.1). On the otherhand, a n-bit binary pattern can represent a finite 2^n distinct numbers. Hence, not all the realnumbers can be represented. The nearest approximation will be used instead, resulted in lossof accuracy.It is also important to note that floating number arithmetic is very much less efficient thaninteger arithmetic. It could be speed up with a so-called dedicated floating-point co-
processor. Hence, use integers if your application does not require floating-point numbers.In computers, floating-point numbers are represented in scientific notation of fraction (F) andexponent (E) with a radix of 2, in the form of F×2^E. Both E and F can be positive as well asnegative. Modern computers adopt IEEE 754 standard for representing floating-pointnumbers. There are two representation schemes: 32-bit single-precision and 64-bit double-
precision.
IEEE-754 32-bit Single-Precision Floating-Point NumbersIn 32-bit single-precision floating-point representation:The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative
numbers.The following 8 bits represent exponent (E).The remaining 23 bits represents fraction (F).
Normalized FormLet's illustrate with an example, suppose that the 32-bit pattern is 1 1000 0001 011 0000 00000000 0000 0000, with:S = 1E = 1000 0001F = 011 0000 0000 0000 0000 0000In the normalized form, the actual fraction is normalized with an implicit leading 1 in the
form of 1.F.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
43/135
In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-2 + 1×2^-3 = 1.375D.The sign bit represents the sign of the number, with S=0 for positive and S=1 for negativenumber. In this example with S=1, this is a negative number, i.e., -1.375D.
In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because we need to represent both positive and negative exponent. With an 8-bit E, rangingfrom 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In thisexample, E-127=129-127=2D.Hence, the number represented is -1.375×2^2=-5.5D.
De-Normalized Form Normalized form has a serious problem, with an implicit leading 1 for the fraction, it cannotrepresent the number zero! Convince yourself on this!De-normalized form was devised to represent zero and other numbers.For E=0, the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) isused for the fraction; and the actual exponent is always -126. Hence, the number zero can berepresented with E=0 and F=0 (because 0.0×2^-126=0).We can also represent very small positive and negative numbers in de-normalized form withE=0. For example, if S=1, E=0, and F=011 0000 0000 0000 0000 0000. The actual fraction is0.011=1×2^-2+1×2^-3=0.375D. Since S=1, it is a negative number. With E=0, the actualexponent is -126. Hence the number is -0.375×2^-126 = -4.4×10^-39, which is an extremelysmall negative number (close to zero).
IEEE Standard 754 Floating Point NumbersThere are several ways to represent real numbers on computers. Fixed point places a radix
point somewhere in the middle of the digits, and is equivalent to using integers that represent portions of some unit. For example, one might represent 1/100ths of a unit; if you have fourdecimal digits, you could represent 10.82, or 00.01. Another approach is to use rationales,and represent every number as the ratio of two integers.
Floating-point representation – the most common solution – uses scientific notation to encodenumbers, with a base number and an exponent. For example, 123.456 could be represented as1.23456 × 102. In hexadecimal, the number 123.abc might be represented as 1.23abc × 162.In binary, the number 10100.110 could be represented as 1.0100110 × 24.
Floating-point solves a number of representation problems. Fixed-point has a fixed windowof representation, which limits it from representing very large or very small numbers. Also,fixed-point is prone to a loss of precision when two large numbers are divided.
Floating-point, on the other hand, employs a sort of "sliding window" of precisionappropriate to the scale of the number. This allows it to represent numbers from
1,000,000,000,000 to 0.0000000000000001 with ease, and while maximizing precision (thenumber of digits) at both ends of the scale.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
44/135
Storage Layout IEEE floating point numbers have three basic components: the sign, the exponent, and themantissa. The mantissa is composed of the fraction and an implicit leading digit (explained
below). The exponent base (2) is implicit and need not be stored.
The following figure shows the layout for single (32-bit) and double (64-bit) precisionfloating-point values. The number of bits for each field are shown (bit ranges are in square
brackets, 00 = least-significant bit):
Floating Point Components
Sign Exponent Fraction
Single Precision 1 [31] 8 [30 –
23] 23 [22 –
00]
Double Precision 1 [63] 11 [62 –52] 52 [51 –00]
. The Sign BitThe sign bit is as simple as it gets. 0 denotes a positive number, and 1 denotes a negativenumber. Flipping the value of this bit flips the sign of the number.
The Exponent The exponent field needs to represent both positive and negative exponents. To do this, a biasis added to the actual exponent in order to get the stored exponent. For IEEE single-precisionfloats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponentfield. A stored value of 200 indicates an exponent of (200 – 127), or 73. For reasons discussedlater, exponents of −127 (all 0s) and +128 (all 1s) are reserved for special numbers.
The MantissaThe mantissa, also known as the significand, represents the precision bits of the number. It iscomposed of an implicit leading bit (left of the radix point) and the fraction bits (to the rightof the radix point).
To find out the value of the implicit leading bit, consider that any number can be expressed inscientific notation in many different ways. For example, the number 50 can be represented asany of these:
.5000 × 1020.050 × 1035000. × 10−2 In order to maximize the quantity of representable numbers, floating-point numbers are
typically stored in normalized form. This basically puts the radix point after the first non-zerodigit. In normalized form, five is represented as 5.000 × 100.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
45/135
2.6 SUB WORD PARALLELISM.A subword is a lower precision unit of data contained within a word. In subword parallelism,multiple subwords are packed into a word and then process whole words. With theappropriate subword boundaries this technique results in parallel processing of subwords.
Since the same instruction is applied to all subwords within the word, This is a form ofSIMD(Single Instruction Multiple Data) processing.
It is possible to apply subword parallelism to non contiguous subwords of different sizeswithin a word. In practical implementation is simple if subwords are same size and they arecontiguous within a word. The data parallel programs that benefit from subword parallelismtend to process data that are of the same size.
For example if word size is 64bits and subwords sizes are 8,16 and 32 bits. Hence aninstruction operates on eight 8bit subwords, four 16bit subwords, two 32bit subwords or one64bit subword in parallel.
Subword parallelism is an efficient and flexible solution for media processing becausealgorithm exhibit a great deal of data parallelism on lower precision data.
It is also useful for computations unrelated to multimedia that exhibit data parallelism onlower precision data.
Graphics and audio applications can take advantage of performing simultaneous operationson short vectors
Example: 128-bit adder:Sixteen 8-bit adds
Eight 16-bit adds
Four 32-bit adds
Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data(SIMD)
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
46/135
UNIT IIIPROCESSOR AND CONTROL UNIT
Basic MIPS implementation – Building data path – Control Implementation scheme – Pipelining – Pipelined data path and control – Handling Data hazards & Control hazards – Exceptions.
3.1 Basic MIPS implementation3.2 Building data path3.3 Control Implementation scheme3.4 Pipelining3.5 Pipelined data path and control3.6 Handling Data hazards & Control hazards3.7 Exceptions.
3.1 A BASIC MIPS IMPLEMENTATIONWe will be examining an implementation that includes a subset of the core MIPSinstruction set:
The memory-reference instructions load word (lw) and store word (sw) The arithmetic-logical instructions add, sub, AND, OR, and slt The instructions branch equal (beq) and jump (j), which we add last
This subset does not include all the integer instructions (for example, shift, multiply, anddivide are missing), nor does it include any floating-point instructions. However, the key
principles used in creating a data path and designing the control are illustrated.
In examining the implementation, we will have the opportunity to see how the
instruction set architecture determines aspects of the implementation, and how the choice ofvarious implementation strategies affects the clock rate and CPI for the computer. In addition,most concepts used to implement the MIPS subset in this chapter are the same basic ideasthat are used to construct a broad spectrum of computers, from high-performance servers togeneral- purpose microprocessors to embedded processors.
An Overview of the ImplementationThe core MIPS instructions, including the integer arithmetic-logical instructions, thememory-reference instructions, and the branch instructions. Much of what needs to be doneto implement these instructions is the same, independent of the exact class of instruction. Forevery instruction, the first two steps are identical:
1. Send the program counter (PC) to the memory that contains the code and fetch theinstruction from that memory.2. Read one or two registers, using fields of the instruction to select the registers to read. Forthe load word instruction, we need to read only one register, but most other instructionsrequire that we read two registers.
After these two steps, the actions required to complete the instruction depend on theinstruction class. Fortunately, for each of the three instruction classes (memory-reference,arithmetic-logical, and branches), the actions are largely the same, independent of the exactinstruction. The simplicity and regularity of the MIPS instruction set simplifies theimplementation by making the execution of many of the instruction classes similar.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
47/135
For example, all instruction classes, except jump, use the arithmetic-logical unit (ALU) afterreading the registers. The memory-reference instructions use the ALU for an addresscalculation, the arithmetic-logical instructions for the operation execution, and branches forcomparison. After using the ALU, the actions required to complete various instruction classesdiffer. A memory-reference instruction will need to access the memory either to read data for
a load or write data for a store. An arithmetic-logical or load instruction must write the datafrom the ALU or memory back into a register. Lastly, for a branch instruction, we may needto change the next instruction address based on the comparison; otherwise, the PC should beincremented by 4 to get the address of the next instruction.
Figure 3.1 shows the high-level view of a MIPS implementation, focusing on the variousfunctional units and their interconnection. Although this figure shows most of the flow ofdata through the processor, it omits two important aspects of instruction execution.
FIGURE 3.1 An abstract view of the implementation of the MIPS subset showing themajor functional units and the major connections between them.
All instructions start by using the program counter to supply the instruction address to the
instruction memory. After the instruction is fetched, the register operands used by aninstruction are specified by fields of that instruction. Once the register operands have beenfetched, they can be operated on to compute a memory address (for a load or store), tocompute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (fora branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must
be written to a register. If the operation is a load or store, the ALU result is used as an addressto either store a value from the registers or load a value from memory into the registers. Theresult from the ALU or memory is written back into the register file. Branches require the useof the ALU output to determine the next instruction address, which comes either from theALU (where the PC and branch offset are summed) or from an adder that increments thecurrent PC by 4. The thick lines interconnecting the functional units represent buses, which
consist of multiple signals. The arrows are used to guide the reader in knowing how
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
48/135
information flows. Since signal lines may cross, we explicitly show when crossing lines areconnected by the presence of a dot where the lines cross.
Figure 3.1 shows data going to a particular unit as coming from two different sources. Forexample, the value written into the PC can come from one of two adders, the data written into
the register file can come from either the ALU or the data memory, and the second input tothe ALU can come from a register or the immediate field of the instruction. In practice, thesedata lines cannot simply be wired together; we must add a logic element that chooses fromamong the multiple sources and steers one of those sources to its destination. This selection iscommonly done with a device called a multiplexor, although this device might better becalled a data selector. The control lines are set based primarily on information taken from theinstruction being executed.
The the data memory must read on a load and write on a store. The register file must bewritten on a load and an arithmetic-logical instruction. And, of course, the ALU must
perform one of several operations. Like the multiplexors, these operations are directed bycontrol lines that are set on the basis of various fields in the instruction.
Figure 3.2 shows the data path of Figure 3.1 with the three required multiplexors added, aswell as control lines for the major functional units. A control unit, which has the instructionas an input, is used to determine how to set the control lines for the functional units and twoof the multiplexors. The third multiplexor, which determines whether PC + 4 or the branchdestination address is written into the PC, is set based on the Zero output of the ALU, whichis used to perform the comparison of a beq instruction. The regularity and simplicity of theMIPS instruction set means that a simple decoding process can be used to determine how toset the control lines.
Logic Design Conventions
FIGURE 3.2 The basic implementation of the MIPS subset, including the necessarymultiplexors and control lines .
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
49/135
The top multiple xor (―Mux‖) controls what value replaces the PC (PC + 4 or the branchdestination address); the multiplexor is controlled by the gate that ―ANDs‖ together the Zerooutput of the ALU and a control signal that indicates that the instruction is a branch. Themiddle multiplexor, whose output returns to the register file, is used to steer the output of the
ALU (in the case of an arithmetic-logical instruction) or the output of the data memory (in thecase of a load) for writing into the register file. Finally, the bottommost multiplexor is used todetermine whether the second ALU input is from the registers (for an arithmetic-logicalinstruction OR a branch) or from the offset field of the instruction (for a load or store). Theadded control lines are straightforward and determine the operation performed at the ALU,whether the data memory should read or write, and whether the registers should perform awrite operation.
The data path elements in the MIPS implementation consist of two different types of logicelements: elements that operate on data values and elements that contain state. The elements
that operate on data values are all combinational, which means that their outputs depend onlyon the current inputs. Given the same input, a combinational element always produces thesame output.
A state element has at least two inputs and one output. The required inputs are the data valueto be written into the element and the clock, which determines when the data value is written.The output from a state element provides the value that was written in an earlier clock cycle.
Logic components that contain state are also called sequential, because their outputs dependon both their inputs and the contents of the internal state.
A clocking methodology defines when signals can be read and when they can be written. It isimportant to specify the timing of reads and writes, because if a signal is written at the sametime it is read, the value of the read could correspond to the old value, the newly writtenvalue, or even some mix of the two! Computer designs cannot tolerate such unpredictability.A clocking methodology is designed to ensure predictability.
An edge-triggered clocking methodology means that any values stored in a sequential logicelement are updated only on a clock edge. Because only state elements can store a data value,any collection of combinational logic must have its inputs come from a set of state elements
and its outputs written into a set of state elements.The inputs are values that were written in a previous clock cycle, while the outputs are valuesthat can be used in a following clock cycle.
Figure3 .3 shows the two state elements surrounding a block of combinational logic, whichoperates in a single clock cycle: all signals must propagate from state element 1, through thecombinational logic, and to state element 2 in the time of one clock cycle. The time necessaryfor the signals to reach state element 2 defines the length of the clock cycle
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
50/135
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
51/135
FIGURE 3.5 Two state elements are needed to store and access instructions, and anadder is needed to compute the next instruction address.
The state elements are the instruction memory and the program counter. The instructionmemory need only provide read access because the data path does not write instructions.Since the instruction memory only reads, we treat it as combinational logic: the output at anytime reflects the contents of the location specified by the address input, and no read controlsignal is needed. (We will need to write the instruction memory when we load the program;this is not hard to add, and we ignore it for simplicity.) The program counter is a 32 ‑ bitregister that is written at the end of every clock cycle and thus does not need a write controlsignal. The adder is an ALU wired to always add its two 32 ‑ bit inputs and place the sum onits output.
FIGURE 3.6 A portion of the data path used for fetching instructions and incrementingthe program counter.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
52/135
We will draw such an ALU with the label Add, as in Figure 3.5, to indicate that it has been permanently made an adder and cannot perform the other ALU functions.
To execute any instruction, we must start by fetching the instruction from memory. To prepare for executing the next instruction, we must also increment the program counter sothat it points at the next instruction, 4 bytes later. Figure 3.6 shows how to combine the threeelements from Figure 3.5 to form a data path that fetches instructions and increments the PCto obtain the address of the next sequential instruction.
The processor‘s 32 general -purpose registers are stored in a structure called a register file. Aregister file is a collection of registers in which any register can be read or written byspecifying the number of the register in the file. The register file contains the register state ofthe computer. In addition, we will need an ALU to operate on the values read from theregisters. R-format instructions have three register operands, so we will need to read two datawords from the register file and write one data word into the register file for each instruction.For each data word to be read from the registers, we need an input to the register file thatspecifies the register number to be read and an output from the register file that will carry thevalue that has been read from the registers.
To write a data word, we will need two inputs: one to specify the register number to bewritten and one to supply the data to be written into the register. The register file alwaysoutputs the contents of whatever register numbers are on the Read register inputs. Writes,however, are controlled by the write control signal, which must be asserted for a write tooccur at the clock edge.
FIGURE 3.7 The two elements needed to implement R-format ALU operations are theregister file and the ALU.
The register file contains all the registers and has two read ports and one write port. Thedesign of multiport register files. The register file always outputs the contents of the registerscorresponding to the Read register inputs on the outputs; no other control inputs are needed.In contrast, a register write must be explicitly indicated by asserting the write control signal.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
53/135
Remember that writes are edge-triggered, so that all the write inputs (i.e., the value to bewritten, the register number, and the write control signal) must be valid at the clock edge.Since writes to the register file are edge-triggered, our design can legally read and write thesame register within a clock cycle: the read will get the value written in an earlier clock cycle,
while the value written will be available to a read in a subsequent clock cycle. The inputscarrying the register number to the register file are all 5 bits wide, whereas the lines carryingdata values are 32 bits wide. The operation to be performed by the ALU is controlled with theALU operation signal, which will be 4 bits wide, using the ALU designed. We will use theZero detection output of the ALU shortly to implement branches. The overflow output willnot be needed
we will need a unit to sign-extend the 16 ‑ bit offset field in the instruction to a 32 ‑ bit signedvalue, and a data memory unit to read from or write to. The data memory must be written onstore instructions; hence, data memory has read and write control signals, an address input,
and an input for the data to be written into memory. The beq instruction has three operands,two registers that are compared for equality, and a 16 ‑ bit offset used to compute the branchtarget address relative to the branch instruction address. Its form is beq $t1,$t2,offset. Toimplement this instruction, we must compute the branch target address by adding the sign-extended offset field of the instruction to the PC.
The instruction set architecture specifies that the base for the branch address calculation is theaddress of the instruction following the branch. Since we compute PC + 4 (the address of thenext instruction) in the instruction fetch datapath, it is easy to use this value as the base forcomputing the branch target address. The architecture also states that the offset field is shiftedleft 2 bits so that it is a word offset; this shift increases the effective range of the offset field
by a factor of 4.
FIGURE 3.8 The two units needed to implement loads and stores, in addition to the
register file and ALU of Figure 3.7
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
54/135
The data memory unit and the sign extension unit. The memory unit is a state element withinputs for the address and the write data, and a single output for the read result. There areseparate read and write controls, although only one of these may be asserted on any givenclock. The memory unit needs a read signal, since, unlike the register file, reading the value f
an invalid address can cause problems
FIGURE 3.9 The data path for a branch uses the ALU to evaluate the branch conditionand a separate adder to compute the branch target as the sum of the incremented PCand the sign-extended, lower 16 bits of the instruction (the branch displacement),shifted left 2 bits.
The unit labelled Shift left 2 is simply a routing of the signals between input and output thatadds 00two to the low-order end of the sign-extended offset field; no actual shift hardware is
needed, since the amount of the ―shift‖ is constant. Since we know that the offset was sign -extended from 16 bits, the shift will throw away only ―sign bits.‖ Control logic is used todecide whether the incremented PC or branch target should replace the PC, based on the Zerooutput of the ALU.
Creating a Single Datapath
This simplest datapath will attempt to execute all instructions in one clock cycle. This meansthat no datapath resource can be used more than once per instruction, so any element neededmore than once must be duplicated. We therefore need a memory for instructions separate
from one for data. Although some of the functional units will need to be duplicated, many ofthe elements can be shared by different instruction flows.
-
8/15/2019 CS6303 COMPUTER ARCHITECTURE Notes for 2013 Regulation
55/135
To share a datapath element between two different instruction classes, we may need to allowmultiple connections to the input of an element, using a multiplexor and control signal toselect among the multiple inputs.
Building a Datapath
The operations of arithmetic-logical (or R-type) instructions and the memory instructionsdatapath are quite similar. The key differences are the following:
The arithmetic-logical instructions use the ALU, with the inputs coming from the tworegisters. The memory instructions can also use the ALU to do the address calculation,although the second input is the sign-extended 16-bit offset field from the instruction.
The value stored into a destination register comes from the ALU (for an R-type instruction)or the memory (for a load).
To create a datapath with only a single register file and a single ALU, we must support twodifferent sources for the second ALU input, as well as two different sources for the datastored into the register file. Thus, one multiplexor is placed at the ALU input and another atthe data input to the register file.
FIGURE 3.10 The datapath for the memory instructions and the R-type instructions.
3.3 CONTROL IMPLEMENTATION SCHEME
Implementation Scheme
We build this simple implementation using the datapath of the last section and adding asimple control function. This simple implementation covers load word (lw), store word (sw),