m.i.e.t. engineering college · 2019-03-01 · syllabus (theory) sub. code: cs6303 branch / year /...

Subject code/Name: CS6303/CA

M.I.E.T./CSE/II/COMPUTER ARCHITECTURE

DEPARTMENT OF COMPUTER SCIENCE

AND ENGINEERING

COURSE MATERIAL

CS6303 COMPUTER ARCHITECTURE

II YEAR - III SEMESTER

M.I.E.T. ENGINEERING COLLEGE

(Approved by AICTE and Affiliated to Anna University Chennai)

TRICHY – PUDUKKOTTAI ROAD, TIRUCHIRAPPALLI – 620 007



CS6303 - Computer Architecture Notes

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

COURSE OBJECTIVE

1. To understand the hardware-software interface. 2. To familiarize the students with arithmetic and logic unit and implementation of

fixed point and floating-point arithmetic operations.

3. To expose the students to the concept of pipelining. 4. To familiarize the students with hierarchical memory system including cache

memories and virtual memory.

5. To expose the students with different ways of communicating with I/O devices and standard I/O interfaces.

COURSE OUTCOMES

1. Identify the Hardware blocks, Instructions set & addressing mode 2. To solve the problems using arithmetic operations 3. Detect pipeline hazards and identify possible solutions to those hazards. 4. Overcome the challenges of parallelism and its classifications 5. Understand the basic concepts of memory and I/O Systems 6. Use various metrics to calculate the performance of a computer system.

Prepared by Verified By

S.SHANMUGA PRIYA HOD

Approved by

PRINCIPAL




Sub. Code : CS6303 Branch/Year/Sem : CSE/II/III

Sub Name : Computer Architecture Staff Name : S.Shanmuga Priya



DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

SYLLABUS (THEORY)

Sub. Code : CS6303 Branch / Year / Sem : CSE/II/III

Sub.Name : Computer Architecture Staff Name : S.SHANMUGA PRIYA

L T P C

3 0 0 3

UNIT I OVERVIEW & INSTRUCTIONS 9 Eight ideas – Components of a computer system – Technology – Performance – Power wall – Uniprocessors to multiprocessors; Instructions – operations and operands – representing instructions – Logical operations – control operations – Addressing and addressing modes. UNIT II ARITHMETIC OPERATIONS 7 ALU – Addition and subtraction – Multiplication – Division – Floating Point operations – Subword parallelism. UNIT III PROCESSOR AND CONTROL UNIT 11 Basic MIPS implementation – Building datapath – Control Implementation scheme – Pipelining – Pipelined datapath and control – Handling Data hazards & Control hazards – Exceptions. UNIT IV PARALLELISM 9 Instruction-level-parallelism – Parallel processing challenges – Flynn’s classification – Hardware multithreading – Multicore processors UNIT V MEMORY AND I/O SYSTEMS 9 Memory hierarchy – Memory technologies – Cache basics – Measuring and improving cache performance – Virtual memory, TLBs – Input/output system, programmed I/O, DMA and interrupts, I/O processors.

TOTAL: 45 PERIODS

TEXTBOOKS

Tな. David A. Patterson and John L. (ennessey, ╉Computer Organization and Design‟, Fifth edition, Morgan Kauffman / Elsevier, 2014.

REFERENCES:

R1. V.Carl (amacher, Zvonko G. Varanesic and Safat G. Zaky, ╉Computer Organisation╉, V) edition, Mc Graw-Hill Inc, 2012.

R2. William Stallings ╉Computer Organization and Architecture╊, Seventh Edition, Pearson Education,

2006.

Rぬ. Vincent P. (euring, (arry F. Jordan, ╉Computer System Architecture╊, Second Edition, Pearson Education, 2005.

SUBJECT IN-CHARGE HOD






Unit I

OVERVIEW & INSTRUCTIONS

Design Principles of Computer Architecture

• CISC vs. RISC • Instructions directly executed by hardware • Maximize instruction issue rate (ILP) • Simple instructions (easy to decode) • Access to memory only via load/store • Plenty of registers • Pipelining

Basic Computer Organization



Bus-Based Computer Organization

Data path:

Memory

Dynamic Random Access Memory (DRAM)

The choice for main memory

Volatile (contents go away when power is lost)

Fast

Relatively small



DRAM capacity: にx / に years ゅsince ╅9はょ; 64x size improvement in last decade

Static Random Access Memory (SRAM)

The choice for cache

Much faster than DRAM, but less dense and more costly

Magnetic disks

The choice for secondary memory

Non-volatile

Slower

Relatively large

Capacity: にx / な year ゅsince ╅9ばょ

250X size in last decade

Solid state (Flash) memory

The choice for embedded computers

Non-volatile

Optical disks

Removable, therefore very large

Slower than disks

Magnetic tape

Even slower

Sequential (non-random) access

The choice for archival

System Software

Operating system – supervising program that interfaces the user’s program with the hardware (e.g., Linux, MacOS, Windows) - Handles basic input and output operations - Allocates storage and memory - Provides for protected sharing among multiple applications Compiler – translate programs written in a high-level language (e.g., C, Java) into instructions that the hardware can execute

TECHNOLOGY

A transistor is simply an on/off switch controlled by electricity. The integrated circuit (IC) combined dozens to hundreds of transistors into a single chip. To describe the tremendous increase in the number of transistors from hundreds to millions, the adjective very large



scale is added to the term, creating the abbreviation VLSI, for very large-scale integrated

circuit.

CLASSES OF COMPUTERS

Desktop computers Designed to deliver good performance to a single user at low cost usually executing

3rd party software, usually incorporating a graphics display, a keyboard, and a mouse

Servers Used to run larger programs for multiple, simultaneous users typically accessed

only via a network and that places a greater emphasis on dependability and (often) security

Modern version of what used to be called mainframes, minicomputers and supercomputers

Large workloads Built using the same technology in desktops but higher capacity

- Gigabytes to Terabytes to Peta bytes of storage Expandable Scalable Reliable

Large spectrum: from low-end (file storage, small businesses) to supercomputers (high end scientific and engineering applications) Examples: file servers, web servers, database servers

Supercomputers A high performance, high cost class of servers with hundreds to thousands of processors, terabytes of memory and petabytes of storage that are used for high-end scientific and engineering applications

Embedded computers (processors) A computer inside another device used for running one predetermined application

- Microprocessors everywhere! (washing machines, cell phones, automobiles, video games)

- Run one or a few applications - Specialized hardware integrated with the application (not your

common processor) - Usually stringent limitations (battery power) - (igh tolerance for failure ゅdon’t want your airplane avionics to fail!ょ - Becoming ubiquitous - Engineered using processor cores - The core allows the engineer to integrate other functions into the

processor for fabrication on the same chip



- Using hardware description languages: Verilog, VHDL Embedded Processor Characteristics

The largest class of computers spanning the widest range of applications and performance Often have minimum performance requirements. Often have stringent limitations on cost. Often have stringent limitations on power consumption. Often have low tolerance for failure. PERFORMANCE

Defining Performance

Let’s suppose we define performance in terms of speed. This still leaves two possible definitions. You could define the fastest plane as the one with the highest cruising speed, taking a single passenger from one point to another in the least time. If you were interested in transporting 450 passengers from one point to another, however, the 747 would clearly be the fastest, as the last column of the figure shows. Similarly, we can define computer performance in several different ways.

Throughput and Response Time Do the following changes to a computer system increase throughput, decrease response time, or both? 1. Replacing the processor in a computer with a faster version



2. Adding additional processors to a system that uses multiple processors for separate tasks—for example, searching the World Wide Web

Example

Time taken to run a program = 10s on A, 15s on B Relative performance =Execution TimeB / Execution TimeA

=15s/10s =1.5 So A is 1.5 times faster than B Measuring Execution Time

Elapsed time

Total response time, including all aspects Processing, I/O, OS overhead, idle time Determines system performance CPU time

Time spent processing a given job Discounts )/O time, other jobs’ shares



Comprises user CPU time and system CPUtime Different programs are affected differently byCPU and system performance CPU Clocking

Operation of digital hardware governed by a constant-rate clock

Clock period: duration of a clock cycle l e.g., 250ps = 0.25ns = 250×10–12s

Clock frequency (rate): cycles per second l e.g., 4.0GHz = 4000MHz = 4.0×109Hz

CPU TIME

Example

Computer A: 2GHz clock, 10s CPU time Designing Computer B

l Aim for 6s CPU time



l Can do faster clock, but causes 1.2 × clock cycles How fast must Computer B clock be?

Instruction Count and CPI

Instruction Count for a program l Determined by program, ISA and compiler

Average cycles per instruction l Determined by CPU hardware l If different instructions have different CPI

- Average CPI affected by instruction mix CPI Example

Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much?

4GHz6s

10246s

10201.2Rate Clock

10202GHz10s

Rate ClockTime CPUCycles Clock

6sCycles Clock1.2

Time CPUCycles Clock

Rate Clock

99

B

9

AAA

A

B

BB

Rate Clock

CPICount nInstructio

Time Cycle ClockCPICount nInstructioTime CPU

nInstructio per CyclesCount nInstructioCycles Clock

1.2500psI

600psI

ATime CPUBTime CPU

600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU

500psI250ps2.0IATime CycleACPICount nInstructioATime CPU



CPI in More Detail

If different instruction classes take different numbers of cycles

Weighted average CPI CPI Example

Alternative compiled code sequences using instructions in classes A, B, C

Class A B C

CPI for class 1 2 3

IC in sequence 1 2 1 2

IC in sequence 2 4 1 1

Sequence 1: IC = 5

l Clock Cycles = 2×1 + 1×2 + 2×3 = 10

l Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6

l Clock Cycles = 4×1 + 1×2 + 1×3 = 9

l Avg. CPI = 9/6 = 1.5

n

1iii )Count nInstructio(CPICycles Clock

n

1i

ii Count nInstructio

Count nInstructioCPI

Count nInstructioCycles Clock

CPI



Performance depends on

l Algorithm: affects IC, possibly CPI l Programming language: affects IC, CPI l Compiler: affects IC, CPI l Instruction set architecture: affects IC, CPI, Tc

computers are constructed using a clock that determines when events take place in the hardware. These discrete time intervals are called clock cycles (or ticks, clock ticks, clock

periods, clocks, cycles). Designers refer to the length of a clock period both as the time for a complete clock cycle (e.g., 250 picoseconds, or 250 ps) and as the clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse of the clock period. In the next subsection, we will formalize the relationship between the clock cyclesof the hardware designer and the seconds of the computer user. Computer Performance and its Factors

Instruction Performance

The term clock cycles per instruction, which is the average number of clock cycles each instruction takes to execute, is often abbreviated as CPI

Classic cpu performance

cycle ClockSeconds

nInstructiocycles Clock

ProgramnsInstructio

Time CPU



POWERWALL

UNIPROCESSOR AND MULTIPROCEASSOR

The power limit has forced a dramatic change in the design of microprocessors. Figure shows the improvement in response time of programs for desktop microprocessors over



time. Since 2002, the rate has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year.

As an analogy, suppose the task was to write a newspaper story. Eight reporters working on the same story could potentially write a story eight times faster. To achieve this increased speed, one would need to break up the task so that each reporter had something to do at the same time. Thus, we must schedule the subtasks. If anything went wrong and just one reporter took longer than the seven others did, then the benefits of having eight writers would be diminished. Thus, we must balance the load evenly to get the desired speedup. Another danger would be if reporters had to spend a lot of time talking to each other to write their sections. You would also fall short if one part of the story, such as the conclusion, couldn’t be written until all of the other parts were completed. Thus, care must be taken to reduce communication and

synchronization overhead. For both this analogy and parallel programming, the challenges include scheduling, load balancing, time for synchronization, and overhead for communication between the parties. As you might guess, the challenge is stiffer with more reporters for a newspaper story and more processors for parallel programming



OPERATIONS OF COMPUTER HARDWARE



OPERANDS OF COMPUTER HARDWARE

A very large number of registers may increase the clock cycle time simply because it takes electronic signals longer when they must travel farther. Guidelines such as ╉smaller is faster╊ are not absolutes; ぬな registers may not be faster than ぬに. Yet, the truth behind such observations causes computer designers to take them seriously. In this case, the designer must balance the craving of programs for more registers with the designer’s desire to keep the clock cycle fast.Another reason for not using more than 32 is the number of bits it would take in the instruction forma Memory Operands As explained above, arithmetic operations occur only on registers in MIPS instructions; thus, MIPS must include instructions that transfer data between memory and registers. Such instructions are called data transfer instructions. To access a word in memory, the instruction must supply the memory address. Memory is just a large, single-dimensional array, with the address acting as the index to that array, starting at 0. For example, in Figure the address of the third data element is 2, and the value of Memory[2] is 10.

Given the importance of registers, what is the rate of increase in the number of registers in a chip over time? な. Very fast: They increase as fast as Moore’s law, which predicts doubling the number of transistors on a chip every 18 months. 2. Very slow: Since programs are usually distributed in the language of the computer, there is inertia in instruction set architecture, and so the number of registers increases only as fast as new instruction sets become viable. REPRESENTING INSTRUCTION



LOGICAL INSTRUCTION



Case/Switch Statement Most programming languages have a case or switch statement that allows the programmer to select one of many alternatives depending on a single value. The simplest way to implement switch is via a sequence of conditional tests, turning the switch statement into a chain of if-then-else statements. Sometimes the alternatives may be more efficiently encoded as a table of addresses of alternative instruction sequences, called a jump address

table or jump table, and the program needs only to index into the table and then jump to the appropriate sequence. The jump table is then just an array of words containing addresses that correspond to labels in the code. The program loads the appropriate entry from the jump table into a register. It then needs to jump using the address in the register. To support such situations, computers like MIPS include a jump register instruction (jr), meaning an unconditional jump to the address specified in a register. Then it jumps to the proper address using this instruction Nested Loop Procedures that do not call others are called leaf procedures. Life would be simple if all procedures were leaf procedures, but they aren’t. Just as a spy might employ other spies as part of a mission, who in turn might use even more spies, so do procedures invoke other procedures. Moreover, recursive procedures even invoke ╉clones╊ of themselves. Just as we need to be careful when using registers in procedures, more care must also be taken when invoking nonleaf procedures.



ADDRESSING MODES

1. Immediate addressing, where the operand is a constant within the instruction itself 2. Register addressing, where the operand is a register Base or displacement addressing, where the operand is at the memory location whose address is the sum of a register and a constant in the instruction 4. PC-relative addressing, where the branch address is the sum of the PC and a constant in the instruction 5. Pseudodirect addressing, where the jump address is the 26 bits of the instruction concatenated with the upper bits of the PC



UNIT 2

ADDITION AND SUBTRACTION

Digits are added bit by bit from right to left, with carries passed to the next digit to the left, just as you would do by hand. Subtraction uses addition: the appropriate operand is simply negated before being added. Binary Addition and Subtraction

Example

1. Add 610 and 710

2. Subtract 610 from 710

Adding 610 and 710 can be done as follows:

0000 0000 0000 0000 0000 0000 0000 0111two = 7ten + 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten = 0000 0000 0000 0000 0000 0000 0000 1101two = 13ten The 4 bits to the right have all the action; Subtracting 6ten from 7ten can be done directly:

0000 0000 0000 0000 0000 0000 0000 0111two = 7ten – 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten = 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten or

Subtraction can be done via addition using the two╆s complement representation of -6:

0000 0000 0000 0000 0000 0000 0000 0111two = 7ten + 1111 1111 1111 1111 1111 1111 1111 1010two = –6ten = 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten

The above figure shows the sums and carries. The carries are shown in parentheses, with the arrows showing how they are passed.



Binary addition, showing carries from right to left. The rightmost bit adds 1 to 0, resulting in the sum of this bit being 1 and the carry out from this bit being 0. Hence, the operation for the second digit to the right is 0 + 1 + 1. This generates a 0 for this sum bit and a carry out of 1. The third digit is the sum of 1 + 1 + 1, resulting in a carry out of 1 and a sum bit of 1. The fourth bit is 1 + 0 + 0, yielding a 1 sum and no carry. When the result from an operation cannot be represented with the available hardware, in this case a 32-bit word. When can overflow occur in addition? When adding operands with different signs, overflow cannot occur. The reason is the sum must be no larger than one of the operands. For example, -10 + 4 = -6. Since the operands fit in 32 bits and the sum is no larger than an operand, the sum must fit in 32 bits as well. Therefore, no overflow can occur when adding positive and negative operands. There are similar restrictions to the occurrence of overflow during subtract, but it’s just the opposite principle: when the signs of the operands are the same, overflow cannot occur. To see this, remember that x - y = x + (-y) because we subtract by negating the second operand and then add. Therefore, when we subtract operands of the same sign we end up by adding operands of different signs. From the prior paragraph, we know that overflow cannot occur in this case either. Adding or subtracting two 32-bit numbers can yield a result that needs 33 bits to be fully expressed. The lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result instead of the proper sign of the result. Since we need just one extra bit, only the sign bit can be wrong. Hence, overflow occurs when adding two positive numbers and the sum is negative, or vice versa. This means a carry out occurred into the sign bit. Overflow occurs in subtraction when we subtract a negative number from a positive number and get a negative result, or when we subtract a positive number from a negative number and get a positive result. This means a borrow occurred from the sign bit. Two kinds of Overflow Conditions

MIPS detects overflow with an exception, also called an interrupt on many computers. An exception or interrupt is essentially an unscheduled procedure call. The address of the instruction that overflowed is saved in a register, and the computer jumps to a predefined address to invoke the appropriate routine for that exception. The interrupted address is saved so that in some situations the program can continue after corrective code is executed.



MIPS includes a register called the exception program counter (EPC) to contain the address of the instruction that caused the exception. The instruction move from system control

(mfc0) is used to copy EPC into a general-purpose register so that MIPS software has the option of returning to the offending instruction via a jump register instruction. MULTIPLICATION

The multiplication of decimal numbers in longhand to remind ourselves of the steps of multiplication and the names of the operands. For reasons that will become clear shortly, we limit this decimal example to using only the digits 0 and 1. Multiplying 100010 by 100110: Multiplicand 100010

Multiplier x 100110

--------- 1000

0000 0000 1000 -------------------

Product 100100010

The first operand is called the multiplicand and the second the multiplier. The final result is called the product. The old algorithm shows that taking the digits of the multiplier one at a time from right to left, multiplying the multiplicand by the single digit of the multiplier, and shifting the intermediate product one digit to the left of the earlier intermediate products. The number of digits in the product is considerably larger than the number in either

the multiplicand or the multiplier If the sign bits were ignored then, the length of the multiplication of an n-bit multiplicand and an m-bit multiplier is a product that is n + m bits long. n + m bits are required to represent all possible products Multiply must cope with overflow because we frequently want a 32-bit product as the result of multiplying two 32-bit numbers

In the above example

1. Just place a copy of the multiplicand (1 × multiplicand) in the proper place if the multiplier digit is a 1, or

2. Place 0 (0 × multiplicand) in the proper place if the digit is 0. SEQUENTIAL VERSION OF THE MULTIPLICATION ALGORITHM AND HARDWARE



This design mimics the algorithm what we learned in school. The following figure shows the hardware.

We have drawn the hardware so that data flows from top to bottom to resemble more closely the paper-and-pencil method. Let’s assume that the multiplier is in the ぬに-bit Multiplier register and that the 64-bit Product register is initialized to 0. From the paper-and-pencil example above, it’s clear that we will need to move the multiplicand left one digit each step, as it may be added to the intermediate products. Over 32 steps, a 32-bit multiplicand would move 32 bits to the left. Hence, we need a 64-bit Multiplicand register, initialized with the 32-bit multiplicand in the right half and zero in the left half. This register is then shifted left 1 bit each step to align the multiplicand with the sum being accumulated in the 64-bit Product register. THREE BASIC STEPS NEEDED FOR EACH BIT



The least significant bit of the multiplier (Multiplier0) determines whether the multiplicand is added to the Product register. The left shift in step 2 has the effect of moving the intermediate operands to the left, just as when multiplying with paper and pencil. The shift right in step 3 gives us the next bit of the multiplier to examine in the following iteration. These three steps are repeated 32 times to obtain the product. If each step took a clock cycle, this algorithm would require almost 100 clock cycles to multiply two 32-bit numbers.



This algorithm and hardware are easily refined to take 1 clock cycle per step. The speed-up comes from performing the operations in parallel: the multiplier and multiplicand are shifted while the multiplicand is added to the product if the multiplier bit is a 1. The hardware just has to ensure that it tests the right bit of the multiplier and gets the pre shifted version of the multiplicand. The hardware is usually further optimized to halve the width of the adder and registers by noticing where there are unused portions of registers and adders.

Replacing arithmetic by shifts can also occur when multiplying by constants. Some compilers replace multiplies by short constants with a series of shifts and adds. Because one bit to the left represents a number twice as large in base 2, shifting the bits left has the same effect as multiplying by a power of 2. Almost every compiler will perform the strength reduction optimization of substituting a left shift for a multiply by a power of 2.

EXAMPLE FOR A MULTIPLY ALGORITHM

Using 4-bit numbers to save space, multiply 210 × 310, or 00102 × 00112.



The value of each register for each of the steps is given in the above figure. The final value is 0000 01102 or 610. SIGNED MULTIPLICATION

In the signed multiplication, convert the multiplier and multiplicand to positive numbers and then remember the original signs. The algorithms should then be run for 31 iterations, leaving the signs out of the calculation. The shifting steps would need to extend the sign of the product for signed numbers. When the algorithm completes, the lower word would have the 32-bit product.

FASTER MULTIPLICATION

Faster multiplications are possible by essentially providing one 32-bit adder for each bit of the multiplier: one input is the multiplicand ANDed with a multiplier bit, and the other is the output of a prior adder. Connect the outputs of adders on the right to the inputs of adders on the left, making a stack of adders 32 high.



The above figure shows an alternative way to organize 32 additions in a parallel

tree. Instead of waiting for 32 add times, we wait just the log2 (32) or five 32-bit add times. Multiply can go even faster than five add times because of the use of carry save

adders It is easy to pipeline such a design to be able to support many multiplies simultaneously

MULTIPLY IN MIPS

MIPS provides a separate pair of 32-bit registers to contain the 64-bit product, called Hi

and Lo. To produce a properly signed or unsigned product, MIPS has two instructions: multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit product, the programmer uses move from lo (mflo). The MIPS assembler generates a pseudoinstruction for multiply that specifies three generalpurpose registers, generating mflo and mfhi instructions to place the product into registers. DIVISION

The reciprocal operation of multiply is divide, an operation that is even less frequent and even more quirky. It even offers the opportunity to perform a mathematically invalid operation: dividing by 0. Let’s start with an example of long division using decimal numbers to recall the names of the operands and the grammar school division algorithm. For reasons similar to those in the previous section, we limit the decimal digits to just 0 or 1.



EXAMPLE

The example is dividing 1,001,01010 by 100010:

Divide’s two operands, called the dividend and divisor, and the result, called the quotient, are accompanied by a second result, called the remainder. Here is another way to express the relationship between the components: Dividend = Quotient × Divisor + Remainder where the remainder is smaller than the divisor. Infrequently, programs use the divide instruction just to get the remainder, ignoring the quotient. The basic grammar school division algorithm tries to see how big a number can be subtracted, creating a digit of the quotient on each attempt. Our carefully selected decimal example uses only the numbers 0 and 1, so it’s easy to figure out how many times the divisor goes into the portion of the dividend: it’s either ど times or な time. Binary numbers contain only ど or な, so binary division is restricted to these two choices, thereby simplifying binary division. Let’s assume that both the dividend and the divisor are positive and hence the quotient and the remainder are nonnegative. The division operands and both results are 32-bit values, and we will ignore the sign for now.



A DIVISION ALGORITHM AND HARDWARE

The above figure shows hardware to mimic our grammar school algorithm. We start with the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move the divisor to the right one digit, so we start with the divisor placed in the left half of the 64-bit Divisor register and shift it right 1 bit each step to align it with the dividend. The Remainder register is initialized with the dividend.



THREE STEPS OF THE FIRST DIVISION ALGORITHM.

The above figure shows three steps of the first division algorithm. Unlike a human, the computer isn’t smart enough to know in advance whether the divisor is smaller than the



dividend. It must first subtract the divisor in step 1; remember that this is how we performed the comparison in the set on less than instruction. If the result is positive, the divisor was smaller or equal to the dividend, so we generate a 1 in the quotient (step 2a). If the result is negative, the next step is to restore the original value by adding the divisor back to the remainder and generate a 0 in the quotient (step 2b). The divisor is shifted right and then we iterate again. The remainder and quotient will be found in their namesake registers after the iterations are complete. EXAMPLE

A DIVIDE ALGORITHM

Using a 4-bit version of the algorithm to save pages, let’s try dividing 710 by 210, or 0000 01112 by 00102.

The above figure shows the value of each register for each of the steps, with the quotient being 3ten and the remainder 1ten. Notice that the test in step 2 of whether the remainder is positive or negative simply tests whether the sign bit of the Remainder register is a 0 or 1. The surprising requirement of this algorithm is that it takes n + 1 steps to get the proper quotient and remainder. This algorithm and hardware can be refined to be faster and cheaper. The speedup comes from shifting the operands and the quotient simultaneously with the subtraction. This refinement halves the width of the adder and registers by noticing where there are unused portions of registers and adders.



The following figure shows the revised hardware.

SIGNED DIVISION

The one complication of signed division is that we must also set the sign of the remainder. Remember that the following equation must always hold: Dividend = Quotient × Divisor + Remainder To understand how to set the sign of the remainder, let’s look at the example of dividing all the combinations of ±710 by ±210. The first case is easy: +7 ÷ +2: Quotient = +3, Remainder = +1 Checking the results: 7 = 3 × 2 + (+1) = 6 + 1 If we change the sign of the dividend, the quotient must change as well: –7 ÷ +2: Quotient = –3 Rewriting our basic formula to calculate the remainder:



Remainder = (Dividend – Quotient × Divisor) = –7 – (–3 × +2) = –7–(–6) = –1 So, –7 ÷ +2: Quotient = –3, Remainder = –1 Checking the results again: –7 = –3 × 2 + (–1) = – 6 – 1 The reason the answer isn’t a quotient of –4 and a remainder of +1, which would also fit this formula, is that the absolute value of the quotient would then change depending on the sign of the dividend and the divisor! Clearly, if –(x ÷ yょ≠ ゅ–x) ÷ y

programming would be an even greater challenge. This anomalous behavior is avoided by following the rule that the dividend and remainder must have the same signs, no matter what the signs of the divisor and quotient. We calculate the other combinations by following the same rule: +7 ÷ –2: Quotient = –3, Remainder = +1 –7 ÷ –2: Quotient = +3, Remainder = –1 Thus the correctly signed division algorithm negates the quotient if the signs of the operands are opposite and makes the sign of the nonzero remainder match the dividend. FASTER DIVISION

We used many adders to speed up multiply, but we cannot do the same trick for divide. The reason is that we need to know the sign of the difference before we can perform the next step of the algorithm, whereas with multiply we could calculate the 32 partial products immediately. There are techniques to produce more than one bit of the quotient per step. The SRT

division technique tries to guess several quotient bits per step, using a table lookup based on the upper bits of the dividend and remainder. It relies on subsequent steps to correct wrong guesses. A typical value today is 4 bits. The key is guessing the value to subtract. With binary division, there is only a single choice. These algorithms use 6 bits from the remainder and 4 bits from the divisor to index a table that determines the guess for each step. The accuracy of this fast method depends on having proper values in the lookup table. SUBWORD PARALLELLISM

A subword is a lower precision unit of data contained within a word. In subword parallelism, multiple subwords are packed into a word and then process whole words.



With the appropriate subword boundaries this technique results in parallel processing of subwords. Since the same instruction is applied to all subwords within the word, This is a form of SIMD(Single Instruction Multiple Data) processing. It is possible to apply subword parallelism to noncontiguous subwords of different sizes within a word. In practical implementation is simple if subwords are same size and they are contiguous within a word. The data parallel programs that benefit from subword parallelism tend to process data that are of the same size. For example if word size is 64bits and subwords sizes are 8,16 and 32 bits. Hence an instruction operates on eight 8bit subwords, four 16bit subwords, two 32bit subwords or one 64bit subword in parallel. Subword parallelism is an efficient and flexible solution for media processing

because algorithm exhibit a great deal of data parallelism on lower precision data. It is also useful for computations unrelated to multimedia that exhibit data parallelism on lower precision data. Graphics and audio applications can take advantage of performing simultaneous operations on short vectors

Example: 128-bit adder: Sixteen 8-bit adds Eight 16-bit adds Four 32-bit adds

Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)



UNIT 3

PROCESSOR AND CONTROL UNIT

3.1 Basic MIPS implementation



3.2 BUILDING A DATAPATH



3.6 HANDLING DATA HAZARDS & CONTROL HAZARDS

3.6.1 Datapath to Resolve Hazards via Forwarding



UNIT 4

PARALLELISM



UNIT 5

MEMORY AND I/O SYSTEMS

From the CPU's perspective, an I/O device appears as a set of special-purpose registers,

of three general types:

x Status registers provide status information to the CPU about the I/O device. These

registers are often read-only, i.e. the CPU can only read their bits, and cannot

change them.

x Configuration/control registers are used by the CPU to configure and control the device. Bits in these configuration registers may be write-only, so the CPU can alter them, but not read them back. Most bits in control registers can be both read and written.

x Data registers are used to read data from or send data to the I/O device.

In some instances, a given register may fit more than one of the above categories, e.g. some

bits are used for configuration while other bits in the same register provide status

information.

The logic circuit that contains these registers is called the device controller, and the

software that communicates with the controller is called a device driver.

+-------------------+ +-----------+

| Device controller | | |

+-------+ | |<--------->| Device |

| |---------->| Control register | | |

| CPU |<----------| Status register | | |

| |<--------->| Data register | | |

+-------+ | | | |

+-------------------+ +-----------+

Simple devices such as keyboards and mice may be represented by only a few registers,

while more complex ones such as disk drives and graphics adapters may have dozens.



Each of the I/O registers, like memory, must have an address so that the CPU can read or

write specific registers.

Some CPUs have a separate address space for I/O devices. This requires separate

instructions to perform I/O operations.

Other architectures, like the MIPS, use memory- mapped I/O. When using memory-mapped

I/O, the same address space is shared by memory and I/O devices. Some addresses

represent memory cells, while others represent registers in I/O devices. No separate I/O

instructions are needed in a CPU that uses memory-mapped I/O. Instead, we can perform

I/O operations using any instruction that can reference memory

| | ROM | |

| + ------- + |

+------- +address| | | |

| |------ >| | RAM | |

| CPU | | | | |

| |<----- >| +------- + |

+------- + data | | | |

| | I/O | |

| +

------- + |

+--------------- +

On the MIPS, we would access ROM, RAM, and I/O devices using load and store

instructions. Which type of device we access depends only on the address used!



lw $t0, 0x00000004 # Read ROM

sw $t0, 0x00000004 # Write ROM (bus error!)

lbu $t0, 0x0000ffc1 # Read RAM

sb $t0, 0x0000ffc1 # Write RAM

lbu $t0, 0xffff0000 # Read an I/O device

sb $t0, 0xffff0004 # Write to an I/O device

The 32-bit MIPS architecture has a 32-bit address, and hence an address space of 4 gigabytes. Addresses 0x00000000 through 0xfffeffff are used for memory, and addresses

0xffff0000 - 0xffffffff (the last 64 kilobytes) are reserved for I/O device registers. This is a

very small fraction of the total address space, and yet far more space than is needed for I/O devices on any one computer.

Each register within an I/O controller must be assigned a unique address within the

address space. This address may be fixed for certain devices, and auto-assigned for others.

(PC plug-and-play devices have auto-assigned I/O addresses, which are determined during

boot-up.)



MEMORY AND I/O SYSTEMS



From the CPU's perspective, an I/O device appears as a set of special-purpose registers,

of three general types:

x Status registers provide status information to the CPU about the I/O device. These

registers are often read-only, i.e. the CPU can only read their bits, and cannot

change them.

x Configuration/control registers are used by the CPU to configure and control the device. Bits in these configuration registers may be write-only, so the CPU can alter them, but not read them back. Most bits in control registers can be both read and written.

x Data registers are used to read data from or send data to the I/O device.

In some instances, a given register may fit more than one of the above categories, e.g. some

bits are used for configuration while other bits in the same register provide status

information.

The logic circuit that contains these registers is called the device controller, and the

software that communicates with the controller is called a device driver.

+-------------------+ +-----------+

| Device controller | | |

+-------+ | |<--------->| Device |

| |---------->| Control register | | |

| CPU |<----------| Status register | | |

| |<--------->| Data register | | |

+-------+ | | | |

+-------------------+ +-----------+

Simple devices such as keyboards and mice may be represented by only a few registers,

while more complex ones such as disk drives and graphics adapters may have dozens.



Each of the I/O registers, like memory, must have an address so that the CPU can read or

write specific registers.

Some CPUs have a separate address space for I/O devices. This requires separate

instructions to perform I/O operations.

Other architectures, like the MIPS, use memory- mapped I/O. When using memory-mapped

I/O, the same address space is shared by memory and I/O devices. Some addresses

represent memory cells, while others represent registers in I/O devices. No separate I/O

instructions are needed in a CPU that uses memory-mapped I/O. Instead, we can perform

I/O operations using any instruction that can reference memory.

| | ROM | |

| + ------- + |

+------- +address| | | |

| |------ >| | RAM | |

| CPU | | | | |

| |<----- >| +------- + |

+------- + data | | | |

| | I/O | |

| +

------- + |

+--------------- +

On the MIPS, we would access ROM, RAM, and I/O devices using load and store

instructions. Which type of device we access depends only on the address used!



lw $t0, 0x00000004 # Read ROM

sw $t0, 0x00000004 # Write ROM (bus error!)

lbu $t0, 0x0000ffc1 # Read RAM

sb $t0, 0x0000ffc1 # Write RAM

lbu $t0, 0xffff0000 # Read an I/O device

sb $t0, 0xffff0004 # Write to an I/O device

The 32-bit MIPS architecture has a 32-bit address, and hence an address space of 4

gigabytes. Addresses 0x00000000 through 0xfffeffff are used for memory, and addresses

0xffff0000 - 0xffffffff (the last 64 kilobytes) are reserved for I/O device registers. This is a very small fraction of the total address space, and yet far more space than is needed for I/O

devices on any one computer.

Each register within an I/O controller must be assigned a unique address within the

address space. This address may be fixed for certain devices, and auto-assigned for others.

(PC plug-and-play devices have auto-assigned I/O addresses, which are determined during

boot-up.)



MEMORY HIERARCHY



MEMORY TECHNOLOGIES

Much of the success of computer technology stems from the tremendous progress in

storage technology.

Early computers had a few kilobytes of random-access memory. The earliest )BM PCs didn’t even have a hard disk.

That changed with the introduction of the IBM PC-XT in 1982, with its 10-megabyte disk. By the year 2010, typical machines had 150,000 times as much disk storage, and the amount of storage was increasing by a factor of 2 every couple of years.

Random-Access Memory

Random-access memory (RAM) comes in two varieties—static and dynamic. Static RAM

(SRAM) is faster and significantly more expensive than Dynamic RAM (DRAM). SRAM is used for cache memories, both on and off the CPU chip. DRAM is used for the main memory plus the frame buffer of a graphics system. Typically, a desktop system will have no more than a few megabytes of SRAM, but hundreds or thousands of megabytes of DRAM.

Static RAM

SRAMstores each bit in a bistable memory cell. Each cell is implemented with a six-transistor circuit. This circuit has the property that it can stay indefinitely in either of two different voltage configurations, or states. Any other state will be unstable—starting from there, the circuit will quickly move toward one of the stable

Dynamic RAM

DRAM stores each bit as charge on a capacitor. This capacitor is very small—typically around ぬど femtofarads,that is, ぬど × など−な5 farads. Recall, however, that a farad is a very large unit of measure. DRAM storage can be made very dense—each cell consists of a capacitor and a single access-transistor. Unlike SRAM, however, a DRAM memory cell is very sensitive to any disturbance. When the capacitor voltage is disturbed, it will never



recover. Exposure to light rays will cause the capacitor voltages to change. In fact, the sensors in digital cameras and camcorders are essentially arrays of DRAM cells.

Conventional DRAMs

The cells (bits) in a DRAM chip are partitioned into d supercells, each consisting of w DRAM cells. A d × w DRAM stores a total of dw bits of information. The supercells are organized as a rectangular array with r rows and c columns, where rc = d. Each supercell has an address of the form (i, j), where i denotes the row, and j denotes the column.

For example, Figure 6.3 shows the organization of a 16 × 8 DRAM chip with d = 16 supercells, w = 8 534 bits per supercell, r = 4 rows, and c = 4 columns. The shaded box denotes the supercell at address (2, 1).

Information flows in and out of the chip via external connectors called pins. Each pin carries a 1-bit signal.

Figure shows two of these sets of pins: eight data pins that can transfer 1 byte in or out of the chip, and two addr pins that carry two-bit row and column supercell addresses. Other pins that carry control information are not shown.

Fig: Conventionall DRAM



One reason circuit designers organize DRAMs as two-dimensional arrays instead of linear arrays is to reduce the number of address pins on the chip. For example, if our

example 128-bit DRAM were organized as a linear array of 16 supercells with addresses 0

to 15, then the chip would need four address pins instead of two. The disadvantage of the two-dimensional array organization is that addresses must be sent in two distinct steps,

which increases the access time.

Disk Storage

Disks are workhorse storage devices that hold enormous amounts of data, on the order of hundreds to thousands of gigabytes, as opposed to the hundreds or thousands of megabytes in a RAM-based memory. However, it takes on the order of milliseconds to read information from a disk, a hundred thousand times longer than from DRAM and a million times longer than from SRAM.



CACHE BASICS – MEASURING AND IMPROVING CACHE PERFORMANCE

One focuses on reducing the miss rate by reducing the probability that two different memory blocks will contend for the same cache location. The second technique reduces the miss penalty by adding an additional level to the hierarchy. This technique, called multilevel caching, first appeared in high-end computers selling for more than $100,000 in 1990; since then it has become common on desktop computers selling for less than $500! CPU time can be divided into the clock cycles that the CPU spends executing the program and the clock cycles that the CPU spends waiting for the memory



system. Normally, we assume that the costs of cache accesses that are hits are part

of the normal CPU execution cycles. Thus,

CPU time = (CPU execution clock cycles T Memory-stall clock cycles)

The memory-stall clock cycles come primarily from cache misses, and we make that assumption here. We also restrict the discussion to a simplified model of the memory system. In real processors, the stalls generated by reads and writes can be quite complex, and accurate performance prediction usually requires very detailed simulations of the processor and memory system.

Reads Read-stall cycles =Pr ogram x Read miss rate x Read miss penalty Writes are more complicated. For a write-through scheme, we have two sources of stalls: write misses, which usually require that we fetch the block before continuing the write (see the Elaboration on page 467 for more details on dealing with writes), and write buffer stalls, which occur when the write buffer is full when a write occurs.

Calculating Cache Performance:

Assume the miss rate of an instruction cache is 2% and the miss rate of the

data cache is 4%. If a processor has a CPI of 2 without any memory stalls and the

miss penalty is 100 cycles for all misses, determine how much faster a processor

would run with a perfect cache that never missed. Assume the frequency of all loads

and stores is 36%.

Reducing Cache Misses by Move Flexibfle Placement of Blocks

So far, when we place a block in the cache, we have used a simple placement scheme: A block can go in exactly one place in the cache. As mentioned earlier, it is called direct mapped because there is a direct mapping from any block address in memory to a single location in the upper level of the hierarchy. However, there is actually a whole range of schemes for placing blocks. Direct mapped, where a block can be placed in exactly one location, is at one extreme. At the other extreme is a scheme where a block can be placed in any location in the cache. Such a scheme is called fully associative, because a block in memory may be associated with any entry in the cache. To find a given block in a fully associative cache, all the entries in the cache must be searched because a block can be placed in any one. To make the search practical, it is done in parallel with a comparator associated with each cache entry. These comparators significantly increase the hardware cost, effectively making fully associative placement practical only forcaches with small numbers of blocks.



Choosing Which Block to Replace :

When a miss occurs in a direct-mapped cache, the requested block can go in exactly one position, and the block occupying that position must be replaced. In an associative cache, we have a choice of where to place the requested block, and hence a choice of which block to replace. In a fully associative cache, all blocks are candidates for replacement. In a set-associative cache, we must choose among the blocks in the selected set. The most commonly used scheme is least recently used (LRU), which we used in the previous example. In an LRU scheme, the block replaced is the one that has been unused for the longest time. The set associative example on page 482 uses LRU, which is why we replaced Memory(O) instead of Memory(6). LRU replacement is implemented by keeping track of when each element in a set was used relative to the other elements in the set. For a two-way set-associative cache, tracking when the two elements were used can be implemented by keeping a single bit in each set and setting the bit to indicate an element whenever that element is referenced. As associativity increases, implementing LRU gets harder; in Section 5.5, we will see an alternative scheme for replacement.



VIRTUAL MEMORY

Similarly, the main memory can act as a "cache" for the secondary storage, usually implemented with magnetic disks. This technique is called virtual memory. Historically, there were two major motivations for virtual memory: to allow efficient and safe sharing of memory among multiple programs, and to remove the programming burdens of a small, limited amount of main memory. Four decades after its invention, it's the former reason

that reigns today.

Consider a collection of programs running all at once on a computer. Of course,

to allow multiple programs to share the same memory, we must be able to protect the programs from each other, ensuring that a program can only read and write the portions of main memory that have been assigned to it. Main memory need contain only the active portions of the many programs, just as a cache contains only the active portion of one program. Thus, the principle of locality enables virtual memory as well as caches, and virtual memory allows us to efficiently share the processor as well as the main memory.

The second motivation for virtual memory is to allow a single user program to exceed the

size of primary memory. Formerly, if a program became too large for memory, it was up to

the programmer to make it fit. Programmers divided programs into pieces and then

identified the pieces that were mutually exclusive. These overlays were loaded or unloaded

under user program control during execution, with the programmer ensuring that the

program never tried to access an overlay that was not loaded and that the overlays loaded

never exceeded the total size of the memory. Overlays were traditionally organized as

modules, each containing both code and data.

In virtual memory, the address is broken into a virtual page number and a page

offset. Figure 5.20 shows the translation of the virtual page number to a physical page

number. The physical page number constitutes the upper portion of the physical address,



while the page offset, which is not changed, constitutes the lower portion. The number of

bits in the page offset field determines

the page size. The number of pages addressable with the virtual address need not match

the number of pages addressable with the physical address. Having a larger number of

virtual pages than physical pages is the basis for the illusion of an essentially unbounded

amount of virtual memory.

TLBS - INPUT/OUTPUT SYSTEM



Input/Output

The computer system’s )/O architecture is its interface to the outside world. This architecture is designed to provide programmed I/O, in which I/O occurs under he direct

and continuous control of the program requesting the I/O operation; interrupt-driven I/O, in which a program issues an I/O command and then continues to execute, until it is interrupted by the I/O hardware to signal the end of the I/O operations; and direct memory access (DMA), in which a specialized I/O processor takes over control of an I/O operation

to move a large block of data.

Two important examples of external I/O interfaces are FireWire and Infiniband.

Peripherals and the System Bus

There are a wide variety of peripherals each with varying methods of

operation Impractical to for the processor to accommodate all

Data transfer rates are often slower than the processor and/or

memory Impractical to use the high-speed system bus to

communicate directly

Data transfer rates may be faster than that of the processor and/or memoryThis

mismatch may lead to inefficiencies if improperly managed

Peripheral often use different data formats and word

lengths Purpose of I/O Modules

Interface to the processor and memory via the system bus or control

switch Interface to one or more peripheral devices

Purpose of I/O Modules

• Interface to the processor and memory via the system bus or control switch • Interface to one or more peripheral devices



External Devices:

External device categories • Human readable: communicate with the computer user – CRT • Machine readable: communicate with equipment – disk drive or tape drive • Communication: communicate with remote devices – may be human readable or machine readable

The External Device – I/O Module

• Control signals: determine the function that will be performed • Data: set of bits to be sent of received • Status signals: indicate the state of the device • Control logic: controls the device’s operations • Transducer: converts data from electrical to other forms of energy • Buffer: temporarily holds data being transferred

Keyboard/Monitor

• Most common means of computer/user interaction • Keyboard provides input that is transmitted to the computer • Monitor displays data provided by the computer • The character is the basic unit of exchange • Each character is associated with a 7 or 8 bit code

Disk Drive

• Contains electronics for exchanging data, control, and status signals with an I/O module • Contains electronics for controlling the disk read/write mechanism • Fixed-head disk – transducer converts between magnetic patterns on the disk surface and bits in the buffer • Moving-head disk – must move the disk arm rapidly across the surface



I/O Modules

Module Function • Control and timing • Processor communication

39 • Device communication • Data buffering • Error detection

I/O control steps

• Processor checks I/O module for external device status • I/O module returns status • If device ready, processor gives I/O module command to request data transfer • I/O module gets a unit of data from device • Data transferred from the I/O module to the processor

Processor communication

Command decoding: I/O module accepts commands from the processor sent as signals on the control bus

Data: data exchanged between the processor and I/O module over the data bus Status reporting:

common status signals BUSY and READY are used because peripherals are slow

Address recognition: I/O module must recognize a unique address for each peripheral that it controls I/O module communication

Device communication: commands, status information, and data

Data buffering: data comes from main memory in rapid burst and must be buffered by the )/O module and then sent to the device at the device’s rate

Error detection: responsible for reporting errors to the processor



Typical I/O Device Data Rates

I/O Module Structure: Block Diagram of an I/O Module



Module connects to the computer through a set of signal lines – system bus • Data transferred to and from the module are buffered with data registers • Status provided through status registers – may also act as control registers • Module logic interacts with processor via a set of control signal lines • Processor uses control signal lines to issue commands to the I/O module • Module must recognize and generate addresses for devices it controls • Module contains logic for device interfaces to the devices it controls • I/O module functions allow the processor to view devices is a simple-minded way • I/O module may hide device details from the processor so the processor only functions in terms of simple read and write operations – timing, formats, etc… • I/O module may leave much of the work of controlling a device visible to the processor – rewind a tape, etc…

I/O channel or I/O processor • I/O module that takes on most of the detailed processing burden • Used on mainframe computers

I/O controller of device controller • Primitive I/O module that requires detailed control • Used on microcomputers

PROGRAMMED I/O

Overview of Programmed I/O

• Processor executes an )/O instruction by issuing command to appropriate )/O module



• I/O module performs the requested action and then sets the appropriate bits in the I/O

status register – I/O module takes not further action to alert the processor – it does not

interrupt the processor • The processor periodically checks the status of the I/O module until it determines that the operation is complete I/O Commands The processor issues an address, specifying I/O module and device, and an I/O command. The commands are: • Control: activate a peripheral and tell it what to do • Test: test various status conditions associated with an I/O module and its peripherals • Read: causes the I/O module to obtain an item of data from the peripheral and place it into an internal register • Write: causes the I/O module to take a unit of data from the data bus and



Three Techniques for Input of a Block of Data

I/O Instructions

Processor views I/O operations in a similar manner as memory operations Each device is given a unique identifier or address

Processor issues commands containing device address – I/O module must check address lines to see if the command is for itself.

I/O mapping

Memory-mapped I/O

Single address space for both memory and I/O devices

o Disadvantage – uses up valuable memory address space

¾ I/O module registers treated as memory addresses

¾ Same machine instructions used to access both memory and I/O devices

o Advantage – allows for more efficient programming

¾ Single read line and single write lines needed

¾ Commonly used • Isolated I/O

¾ Separate address space for both memory and I/O devices

¾ Separate memory and I/O select lines needed

¾ Small number of I/O instructions

¾ Commonly used

DMA AND INTERRUPTS

Interrupt-Driven I/O



• Overcomes the processor having to wait long periods of time for I/O modules • The processor does not have to repeatedly check the I/O module status

I/O module view point

• I/O module receives a READ command form the processor • I/O module reads data from desired peripheral into data register • I/O module interrupts the processor • I/O module waits until data is requested by the processor • I/O module places data on the data bus when requested

Processor view point

• The processor issues a READ command • The processor performs some other useful work • The processor checks for interrupts at the end of the instruction cycle • The processor saves the current context when interrupted by the I/O module • The processor read the data from the I/O module and stores it in memory • The processor the restores the saved context and resumes execution

Design Issues

How does the processor determine which device issued the

interrupt How are multiple interrupts dealt with Device

identification

Multiple interrupt lines – each line may have multiple I/O

modules Software poll – poll each I/O module Separate

command line – TESTI/O

Processor read status register of I/O module

Time consuming

Daisy chain



Hardware poll

Common interrupt request line

Processor sends interrupt acknowledge

Requesting I/O module places a word of data on the data lines – ―vector‖ that uniquely identifies the I/O module – vectored interrupt

• Bus arbitration

I/O module first gains control of the bus

I/O module sends interrupt request

The processor acknowledges the interrupt request

I/O module places its vector of the data lines

Multiple interrupts

• The techniques above not only identify the requesting I/O module but provide methods of assigning priorities • Multiple lines – processor picks line with highest priority • Software polling – polling order determines priority • Daisy chain – daisy chain order of the modules determines priority • Bus arbitration – arbitration scheme determines priority

Intel 82C59A Interrupt Controller

Intel 80386 provides • Single Interrupt Request line – INTR • Single Interrupt Acknowledge • 8 external devices can be connected to the 82C59A – can be cascaded to 64 82C59A operation – only manages interrupts • Accepts interrupt requests • Determines interrupt priority • Signals the processor using INTR • Processor acknowledges using INTA • Places vector information of data bus • Processor process interrupt and communicates directly with I/O module



82C59A interrupt modes

Fully nested – priority form 0 (IR0) to 7 (IR7)

Rotating – several devices same priority - most recently device lowest priority Special mask – processor can inhibit interrupts from selected devices.



Intel 82C55A Programmable Peripheral Interface

¾ Single chip, general purpose I/O module

¾ Designed for use with the Intel 80386 ¾ Can control a variety of simple peripheral devices

A, B, C function as 8 bit I/O ports (C can be divided into two 4 bit I/O ports) Left

side of diagram show the interface to the 80386 bus.



Direct Memory Access

Drawback of Programmed and Interrupt-Driven I/O • I/O transfer rate limited to speed that processor can test and service devices • Processor tied up managing I/O transfers

DMA Function

• DMA module on system bus used to mimic the processor. • DMA module only uses system bus when processor does not need it. • DMA module may temporarily force processor to suspend operations – cycle stealing.

DMA Operation

x The processor issues a command to DMA module

x Read or write

x I/O device address using data lines

x Starting memory address using data lines – stored in address register

x Number of words to be transferred using data lines – stored in data register

x The processor then continues with other work



x DMA module transfers the entire block of data – one word at a time – directly to or from memory without going through the processor DMA module sends an interrupt to the processor when complete

46

DMA and Interrupt Breakpoints during Instruction Cycle

• The processor is suspended just before it needs to use the bus. • The DMA module transfers one word and returns control to the processor. • Since this is not an interrupt the processor does not have to save context. • The processor executes more slowly, but this is still far more efficient that either

programmed or interrupt-driven I/O.

DMA Configurations

Single bus – detached DMA module

Each transfer uses bus twice – I/O to DMA, DMA to memory

Processor suspended twice.



Single bus – integrated DMA module

Module may support more than one device

Each transfer uses bus once – DMA to memory

Processor suspended once.

Separate I/O bus

Bus supports all DMA enabled devices

Each transfer uses bus once – DMA to memory

Processor suspended once.

INPUT-OUTPUT PROCESSOR (IOP)

Communicate directly with all I/O devices

Fetch and execute its own instruction

IOP instructions are specifically designed to facilitate I/O transfer



Command

Instruction that are read form memory by an IOP

Distinguish from instructions that are read by the CPU

Commands are prepared by experienced programmers and are stored in

memory Command word = IOP program Memory

I/O Channels and Processors

The Evolution of the I/O Function

1. Processor directly controls peripheral device 2. Addition of a controller or I/O module – programmed I/O 3. Same as 2 – interrupts added 4. I/O module direct access to memory using DMA 5. I/O module enhanced to become processor like – I/O channel 6. I/O module has local memory of its own – computer like – I/O processor • More and more the I/O function is performed without processor involvement. • The processor is increasingly relieved of I/O related tasks – improved

performance.



Characteristics of I/O Channels

• Extension of the DMA concept • Ability to execute I/O instructions – special-purpose processor on I/O

channel – complete control over I/O operations • Processor does not execute I/O instructions itself – processor

initiates I/O transfer by instructing the I/O channel to execute a

program in memory • Program specifies

Device or devices

Area or areas of memory

Priority

Error condition actions

Two type of I/O channels

• Selector channel

¾ Controls multiple high-speed devices

¾ Dedicated to the transfer of data with one of the devices

¾ Each device handled by a controller, or I/O module

¾ I/O channel controls these I/O controllers • Multiplexor channel

Can handle multiple devices at the same time

Byte multiplexor – used for low-speed devices

Block multiplexor – interleaves blocks of data from several devices.



The External Interface: FireWire and

Infiniband Type of Interfaces

o Parallel interface – multiple bits transferred simultaneously

o Serial interface – bits transferred one at a time



I/O module dialog for a write operation

1. I/O module sends control signal – requesting permission to send data 2. Peripheral acknowledges the request 3. I/O module transfer data 4. Peripheral acknowledges receipt of data

FireWire Serial Bus – IEEE 1394

• Very high speed serial bus • Low cost • Easy to implement • Used with digital cameras, VCRs, and televisions FireWire Configurations • Daisy chain • 63 devices on a single port – 64 if you count the interface itself • 1022 FireWire busses can be interconnected using bridges • Hot plugging • Automatic configuration • No terminations • Can be tree structured rather than strictly daisy chained

FireWire three layer stack:Physical layer

Defines the transmission media that are permissible and the electrical and signaling

characteristics of each

25 to 400 Mbps

Subject code/Name: cs6401/CA

M.I.E.T./Dept/YR/Sub. Name

Converts binary data to electrical signals

Provides arbitration services

¾ Based on tree structure

¾ Root acts as arbiter ¾ First come first served ¾ Natural priority controls simultaneous requests – nearest root ¾ Fair arbitration ¾ Urgent arbitration

Link layer

• Describes the transmission of data in the packets • Asynchronous o Variable amount of data and several bytes of transaction data transferred as a packet o Uses an explicit address

o Acknowledgement returned • Isochronous o Variable amount of data in sequence of fixed sized packets at regular intervals o

Uses simplified addressing

o No acknowledgement

Transaction layer

• Defines a request-response protocol that hides the lower-layer detail of FireWire from applications.



FireWire Protocol Stack

FireWire Subactions

InfiniBand • Recent I/O specification aimed at high-end server market • First version released early 2001 • Standard for data flow between processors and intelligent I/O devices • Intended to replace PCI bus in servers • Greater capacity, increased expandability, enhanced flexibility • Connect servers, remote storage, network devices to central fabric of switches and links • Greater server density • Independent nodes added as required • I/O distance from server up to o 17 meters using copper

o 300 meters using multimode optical fiber

o 10 kilometers using single-mode optical fiber • Transmission rates up to ぬど Gbps



InfiniBand Operations

• 16 logical channels (virtual lanes) per physical link • One lane for fabric management – all other lanes for data transport • Data sent as a stream of packets • Virtual lane temporarily dedicated to the transfer from one end node to another • Switch maps traffic from incoming lane to outgoing lane

m.i.e.t. engineering college · 2019-03-01 · syllabus (theory) sub. code: cs6303 branch / year /...

Documents