i/o subsystem chapter 8

42
I/O Subsystem Chapter 8 N. Guydosh 4/28/04+

Upload: happy

Post on 22-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

I/O Subsystem Chapter 8. N. Guydosh 4/28/04+. Introduction. Amazing variation of characteristics and behavors Characteristics largely driven by technology Not as “elegant” as processors or memory systems Traditionally the study of I/O took a “back seat” to processors and memory - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: I/O Subsystem Chapter 8

I/O SubsystemChapter 8

N. Guydosh4/28/04+

Page 2: I/O Subsystem Chapter 8

Introduction• Amazing variation of characteristics and behavors• Characteristics largely driven by technology• Not as “elegant” as processors or memory systems

– Traditionally the study of I/O took a “back seat” to processors and memory

– An unfortunate situation because a computer system is useless without I/O and Amdal’s law tells that ultimately I/O is the performance bottleneck. See example in section 8.1

Mainmemory

I/Ocontroller

I/Ocontroller

I/Ocontroller

Disk Graphicsoutput

Network

Memory– I/O bus

Processor

Cache

Interrupts

Disk

Typical I/Oconfiguration

Fig. 8.1

Page 3: I/O Subsystem Chapter 8

I/O Performance Metrics

• A point of confusion: In I/O systems, KB, MB etc. are traditionally powers of 10: 1000, 1,000,000, bytes, but in memory/processor systems these are powers of 2: 1024, 1,048,576– For simplicity lets ignore the small difference and use only one base, say 2.

• “Supercomputer” I/O benchmarks– Typically for check-pointing the machine – want maximum bytes/sec on output.

• Transaction processing (TP)– Response time and throughput important– Lots of small I/O events, thus number of disk accesses per second more

important than “bytes/sec”– Reliability very important

• File Systems I/O benchmarks– These exercise I/O system with I/O commands, example for UNIX: Makedir,

copy, scandir (transverse directory tree), ReadAll (scan every byte in every file once), make (compiling and linking)

Page 4: I/O Subsystem Chapter 8

Types & Characteristics of I/O Devices

• Again diversity is the problem here– Devises differ significantly in:

behavior“partner” – purely machine interfaced or human interfacedData rate- ranges from a few bytes/sec to 10’s of millions bytes/sec.

– See text for descriptions of various devices commonly in use

• Disk access time calculation:– See book on disk organization– Components of access time:

Average seek time – move head to desired trackrotational latencies – wait for sector to get to head (0.5 rotation/RPM)transfer time - time to read or write a sectorsometimes queuing time included – waiting for a request to get serviced.

– Disk density and size affect performance and usefulness

Page 5: I/O Subsystem Chapter 8

Connecting The System:busses

• A “bus” connects subsystems together– Connects processor, memory, and i/o devices together– Consists of a set of wires with control logic and a will defined protocol for

using the bus – Protocol is implemented in hardware

• A “standard” bus design was a prime factor in the success of personal computer – Purchase a base system and “grow” it by adding off the self components– Historically a very chaotic aspect of the computer industry

The “bus wars” ... pci wins microchannels loses!

• Busses are a key factor in the overall performance of a computer system

Page 6: I/O Subsystem Chapter 8

Connecting The System:busses (cont.)

• Some bus tradeoffs:– Advantage: flexibility in adding new devices & peripherals– Disadvantage: A Serial reusable resource ==> only one at a time

communication bottleneck • Two performance goals

– High bandwidth (data rate mb/sec) – Low latency

• Bus consists of a set of data lines and control lines – Data lines include address and raw data – Because bus is shared, we need a protocol to decide who uses it next

• Bus transaction (send address & receive or send data)– Terminology is from point of view of memory (confusing!) – Input: Writes data to memory from I/O– Output: Reads data from memory to I/O – See example in fig. 8.7, 8.8

Page 7: I/O Subsystem Chapter 8

Connecting The System:busses (cont.)

97018/PattersonFig. 8.07

Memory Processor

Control lines

Data lines

Disks

Memory Processor

Control lines

Data lines

Disks

Processor

Control lines

Data lines

Disks

a.

b.

c.

Memory

Mem read cmd

address on bus

Access data in memory

“Data ready” response

Data on bus

Output operation:data from memory “outputted” to device

Write to diskFig. 8.7

Page 8: I/O Subsystem Chapter 8

Connecting The System:busses (cont.)

Memory Processor

Control lines

Data lines

Disks

Processor

Control lines

Data lines

Disks

a.

b.

Memory

97108/PattersonFig. 8.08

Write reg cmd

addr on bus

data on bus

Input operation: data to memory “inputted” from device

read from disk

Fig. 8.8

Page 9: I/O Subsystem Chapter 8

Types of Busses

• Backplane (mother board) bus– Interconnects backplane components– Plug-in feature– Typical “standard” busses (ISA, AT, PCI ...)– Connects to other busses

• Processor - memory – Usually proprietary– High speed– Direct connection of processor to memory with links to other busses

• I/O Bus– Typically does not connect directly to memory– Usually bridge to backplane or processor-memory bus– Example: SCSI, IDE, EIDE, …

Page 10: I/O Subsystem Chapter 8

Types of Busses (cont.)

• A lot of functional overlap in above 3 types of busses – Can put memory directly on backplane bus

• Logic needs to interconnect busses (bridge chips) – Ex: backplane to I/O bus

• A system may have a single backplane bus: – Ex: old pc’s (ISA/AT)

• See fig 8.9, p. 659 for examples ==>

Page 11: I/O Subsystem Chapter 8

Types of Busses Example

Fig. 8.9

Processor MemoryBackplane bus

a. I/O devices

Processor MemoryProcessor-memory bus

b.

Busadapter

Busadapter

I/Obus

I/Obus

Busadapter

I/Obus

Processor MemoryProcessor-memory bus

c.

Busadapter

Backplanebus

Busadapter

I/O bus

Busadapter

I/O bus

Single backplaneolder PC’s

Processor/memory bus for main bus – could bePCI backplane in moderncomputers.

All 3 types of bussesutilized here

Ex: PCI Ex: EIDE bus in a PC

Ex: Proprietary (old IBM?)

Page 12: I/O Subsystem Chapter 8

Synchronous Vs. Asynchronous Busses • Synchronous

– Bus includes clock line in control– Protocol is not very data dependent– Protocol tightly coupled to clock– Highly synchronized with clock– Completely clock driven– Only asynchronous stuff is generation of commands or requests– Lines must be short due to clock skew– A model for this type of bus is an FSM– Disadvantages:

all devices on bus must run at same clock speedlines must be short due to clock skew problem

– Advantage: can have high performance in special applications such as processor memory bussing

– Sometimes used for processor-memory bus

Page 13: I/O Subsystem Chapter 8

Synchronous Vs. Asynchronous Busses (cont.) • Asynchronous

– Very little clock dependency– Event driven– Keeps in step via hand shaking

See example in figure 8.10– Very versatile– Bus can be “arbitrarily” long– Common for standard busses– Ex: Sbus (SUN), microchannel, PCI– Can even connect busses/devices using different clocks– Disadvantage: lower performance … due to handshaking?

• A model for this type of bus is a pair of interacting FSM’s – See fig 8.11, P. 664 ... see performance analysis pp. 662-663

based on figure 8.10

Page 14: I/O Subsystem Chapter 8

Handshaking on an Asynchronous Bus

Fig. 8.10

DataRdy

Ack

Data

ReadReq 13

4

57

642 2

Operation: data from memory to device:Initially: device raised RedReq and puts address on data lines1. mem see ReadReq & reads address from data bus & raises Ack2. I/O device see Ack line high & releases readReq & data lines3. Mem see readReq low & drops Ack line to ack ReadReq signal4. mem puts data on data line and asserts DataRdy5. I/O see DataRdy & reads the data and signals Ack6. Mem see Ack & drops DataRdy and releases data lines7. I/O see DataRdy drop and drops Ack line

Note: bus is bi-directionalQuestion: what happens if an Ack fails to get issued?

address data

Color coding:Colored signals are from deviseBlack signals are from memory

Page 15: I/O Subsystem Chapter 8

FSM model of Asynchronous Busbased on example in fig. 8.10

Fig. 8.11

1Record fromdata linesand assert

Ack

ReadReq

ReadReq________

ReadReq

ReadReq

3, 4Drop Ack;

put memorydata on datalines; assert

DataRdy

Ack

Ack

6Release data

lines andDataRdy

________

___

Memory

2Release data

lines; deassertReadReq

Ack

DataRdy

DataRdy

5Read memorydata from data

lines;assert Ack

DataRdy

DataRdy

7Deassert Ack

I/O device

Put addresson data

lines; assertReadReq

________

Ack___

________

New I/O request

New I/O request

The numbers in eachstate correspond to the numbered stepsin fig. 8.10

Page 16: I/O Subsystem Chapter 8

An Example (pp.662-663)• Referring to the example in fig 8.10: We will compare the asynchronous bandwidth

(BW) with a synchronous approach• Asynchronous:

– 40 ns per handshake (one of the 7 steps)• Synchronous:

– Clock cycle = 50ns– Each bus transmission takes one clock cycle

• Both schemes: 32 bit dta bus and one word reads from a 200ns memory• Synchronous:

– Send address to memory: 50ns, read memory: 200ns, send data to device: 50ns for a total tome of 300 ns

– BW = 4bytes/300ns = 13.3 MB/sec• Asynchronous:

– Can overlap steps 2, 3, and 4 with memory access time– Step 1: 40ns– Steps 2, 3, 4: maximum{3x40ns, 200ns} = 200ns (steps 2,3,4 “hidden by memory access)– Steps 5, 6, 7: 3x40 = 120ns– BW = 4 bytes/ (40+200+120)ns = 11.1 MB/sec

• Observation: Synchronous is only 20% faster due to overlap in handshaking• Comment: asynchronous usually [referred because it ti more technology independent

and more versatile in handling different device speeds

Page 17: I/O Subsystem Chapter 8

An Example (pp.665-666)The Effect of Block Size on Synchronous Bus Bandwidth

• Bus description– Two cases to consider:

Memory & bus system supporting access of 4 word blocks (case 1)and 16 word blocks (case 2)where a word is 32 bits in each case

– 64 bit (2 words) synchronous bus clocked at 200MHz (5 ns/cycle)each 64 bit transfer taking 1 clock cycle1 clock cycle needed to send the initial address

– Two idle clock cycles needed between bus operations – bus assumed to be idle before an access

– A memory access for the first 4 words is 200ns (40 cycles) and each additional set of 4 words is 20 ns (4 cycles)

– Assume that a bus transfer of the most recently read data and a read of the next 4 words can be overlapped.

– Summary: memory accessed 4 words at a time but must be send over bus in two 2 word shots (2 cycles) since bus is only 2 words wide.

• Find: sustained bandwidth, latency (xfr time of 256 words, & # bus transactions/sec for a read of 256 words in two cases: 4-word blocks and 16-word blocks.Note: interpret a “bus transaction” as transferring a (4 or 16 word) block.

Page 18: I/O Subsystem Chapter 8

An Example (pp.665-666)Case 1: 4-word Block Transfers

• 1 clock cycle to send address of block to memory• 200 MHz bus is has a 5ns period (5ns/cycle)

memory access time (1st (and only) 4 words) is 200ns#cycles to read memory = (memory access time)/(clock cycle time)

= 200ns/5ns = 40 cycles• 2 clock cycles to send data from memory

since we transfer 64 bits = 2 words per cycle and a block is 4 words

• 2 idle cycles between this transfer and the next• Note: no overlap here because entire block transferred in one access. Overlap is only

within a block for multiple accesses – as in case 2 (next).

• Total number of cycles for a block = 45 cycles256 words to be read results in 256/4 = 64 blocks (transactions)thus 45x64 = 2880 cycles needed for the transferlatency = 2880 cycles x 5ns/cycle = 14,400 ns# bus transactions /sec = 64/14400ns = 4.44M transactions/secBW = (256x4)bytes/14400ns = 71.11 MB/sec

Page 19: I/O Subsystem Chapter 8

An Example (pp.665-666)Case 2: 16-word Block Transfers

• Timing for a 1 block (16 word) transfer:

Number of transactions (blocks) needed = 256/16 = 16 transactions … was 64 for 4 word blkTotal transfer time = 57x16 = 912 cycles … was 2880 for 4 word blockLatency = 912 cycles x 5 ns/cycle = 4560 ns … was 14,400ns for 4 word blockTransactions/sec = 16 /4560 ns = 3.51M transactions/sec … was 4.44M for 4 word blockBW = (256x4)/4560ns = 244.56 MB/sec … was 71.11 for 4 word block

Total = 1 + 40 + 16 = 57 cycles … was 45 for 4 word block

Note: a 16 word block is read in four 4 word shots, thus there will be overlap.

This portion is essentially case 1

Page 20: I/O Subsystem Chapter 8

Controlling Bus Access• Only one on at a time • Bus controls – The “bus master”

– Controls access to bus– Initiates & controls all bus requests

• Slave– Never generates own requests– Responds to read and write requests

• Processor: always a master• Memory: usually a slave • Having a single bus master could create a bottle neck

– Processor would be involved with every bus transaction– See fig 8.12 for an example

Page 21: I/O Subsystem Chapter 8

Bus Control With a Single Master

Fig. 8.12

Memory Processor

Bus request lines

Bus

Disks

Bus request lines

Bus

Disks

Processor

Bus request lines

Bus

Disks

a.

b.

c.

ProcessorMemory

Memory

Disk makes requestto processor: a data xfrfrom memory to disk.

Processor responds by asserting read request line to memory.

Processor acks to disk that request is being processed. Disk now places desired address on the bus.

Page 22: I/O Subsystem Chapter 8

Controlling Bus Access – Multiple Masters• Bus arbitration – deciding which master gets control of

the bus: p. 669 – A chip (arbiter) which decides which device gets the bus next – Typically each device has a dedicated line to the arbitrate for

requests– Arbiter will eventually issue a grant (separate line to device)– Device now is master, uses the bus, and then signals the arbiter

when is is done with the bus. – Devices have priorities – Bus arbiter may invoke a “fairness” rule to low priority device

which is waiting – Arbitration time is overhead and should be overlapped with bus

transfers whenever possible - maybe use physically separate lines for arbitration.

Page 23: I/O Subsystem Chapter 8

Arbitration Schemes p. 670• Daisy chain

– Chain from a high to low priority devices – Device making request takes the grant but does not pass it on, grant passed

on only by non-requesting devices - no fairness, possible starvation.

Device n

Lowest priority

Device 2Device 1

Highest priority

Busarbiter

Grant

Grant Grant

Release

Request

Page 24: I/O Subsystem Chapter 8

Arbitration Schemes p. 670• Centralized, parallel

– Multiple request lineschosen device becomes masterrequires central arbiter – a potential botleneck

– Used by PCI• Distributed arbitration - self selection

– Multiple request lines – Request: place id code on bus - by examining bus can determine

priority – No need for central arbiter

need more lines for requestsex: Nubus for Apple/Mac)

• Distributed arbitration by collision detection– Free for all – request bus at will– Collision detector then resolves who gets it– Ethernet uses this.

Page 25: I/O Subsystem Chapter 8

I/O To Memory, Processor, Os Interfaces

• Questions (p. 673)– How do i/o requests transform to device commands and get

transferred to a device? – How are data transfers between device and memory done?– What is the role of the Operating System?

• The OS:– Device drivers operating at kernel/supervisory mode.– Performs interrupt handling & DMA services.– Functions:

Commands to I/O. Respond to I/O signals ... some are interrupts.Control data transfer ... buffers, DMA, other algorithms, control priorities.

Page 26: I/O Subsystem Chapter 8

Commands To I/O devices • Two basic approaches:

– Direct I/O (programmed I/O or “PIO”)– Memory mapped I/O

• PIO– Special I/O instructions: in/out for Intel – “Address” associated with in/out put on address bus but - the op-

code context causes i/o interface to be access ... usually registers causes I/O activity

– Address is an I/O port

• Memory mapped => see next

Page 27: I/O Subsystem Chapter 8

Commands To I/O devices (cont)• Memory mapped

– Certain portion of address space reserved for i/o devices – Program communicates with device in same way it does with

memory: memory instructions used– If the address is in “device space” range, the device controller

responds with appropriate commands to device ... read/write

• User programs not allowed to access memory mapped I/O space – Address used by instruction encodes both device identity & types of

data transmission – Memory mapped is usually faster than PIO because DMA available

Page 28: I/O Subsystem Chapter 8

I/O - PROCESSOR COMMUNICATIONpolling/memory mapped

• Polling is simplest way for I/O to communicate with processor– Periodically check status bits to seen what to do next

I/O device posts status in a special register, Ex: “I am busy”– Processor Continually Checks For Status Using Either PIO Or Memory

Mapped I/O – Wastes a lot of processor time because processors are faster than I/O

devices.– Much of the polls occur when the waited for event has not yet happened– OK for slow devices such as a mouse – Under OS control, polls can be limited to periods only when the device is

active – thus allowing polling even for faster devices – cheap I/O!

Page 29: I/O Subsystem Chapter 8

I/O - Example

• Examples for slow medium & high speed deviceDetermine impact of polling overhead for 3 devices.Assume number of clock cycles per poll is 400 and 500 MHz clock.In all cases no data can be missed.– Example 1 – a mouse polled 30 times/sec

cycle/sec for polling = 30 polls x 400 cyc/poll = 12,000 cyc/sec% of processor cycles consumed = 12000/500MHz = 0.002%Negligible impact on performance.

– Example 2 – a floppy diskTransfers data to processor is in 16 bit (2 byte) unitsand has a data rate of 50 KB/secPolling rate = ( (50 KB/sec)/ 2 bytes/poll) = 25K polls/secCycles/sec for polling = 25K polls/sec x 400 cyc/poll = 107 cyc/sec % of processor cycles consumed = (107cyc/sec)/500MHz = 2%Still tolerable

Page 30: I/O Subsystem Chapter 8

I/O - Example (cont.)

– Example 3 – a hard driveTransfers data in four-word chunksTransfer rate is 4MB/secMust poll at the data rate in 4-word chunks: (4MB/sec)/(16 bytes/xfr)or polling rate is 250K polls/secCycles/sec for polling = (250K polls/sec) x (400cyc/poll) = 108 cyc/sec% of processor cycles consumed = (108 cyc/sec) / 500MHz = 20%

– 1/5 of processor would be used in polling the disk!Not acceptable.

• The bottom line: polling works OK for low speed devices but not for high speed devices.

Page 31: I/O Subsystem Chapter 8

Interrupt driven I/O

• The problem with simple polling is that it must be done when nothing is happening – during a waiting period

• When CPU processing is needed for an I/O event, the processor is interrupted.– Interrupts are asynchronous– Not associated with any particular instruction– Allows instruction completion (compare with exceptions in chapter 5)– Interrupt must convey further information such as identity of device and

priority.– Convey this additional information by using vectored interrupts or a cause

register.

Page 32: I/O Subsystem Chapter 8

Interrupt Scheme

The “granularity” of an interrupt is a single machine instruction. The check for pending interrupts and processing of interrupts is done between instructions being executed, ie., the current instruction is completed before a pending interrupt is processed

Page 33: I/O Subsystem Chapter 8

Overhead for Interrupt driven I/O • Using the previous example of a hard drive (p. 676):

data transfers in 4 – word chunksTransfer rate of 4MB/sec– Assume overhead for each transfer, including the interrupt is 500 clock

cycles– Find the % of processor consumed if hard drive is only transferring data

5% of the time – causing CPU interaction.

• Answer:Interrupt rate for busy disk would be same as previous polling rate to match the transfer rate:(250K interrupts/sec) x 500cycles/interrupt = 125x106 cyc/sec% processor consumed during an XFR = 125x106/500MHz = 25%assume disk is transferring data 5% of the time, then % processor consumed during an XFR (average) = 25% x 5% = 1.25% No overhead when disk is not actually transferring data – improvement over polling.

Page 34: I/O Subsystem Chapter 8

DMA I/O• Polling and interrupt driven I/O best with lower bandwidth

devices where cost is more a factor.• Both polling and interrupt driven I/O, puts burden of

moving data and managing the transfer on the CPU.– Even though the processor may continue processing during an I/O

access, it ultimately must move the I/O data from the device when tha data becomes available or perhaps from some I/O buffer to main memory.

– In our previous example of an interrupt driven hard disk, even though the CPU does not have to wait for every I/O event to complete, it would still consume 25% of the CPU cycles while the disk is transferring data. See p. 680.

• Interrupt driven I/O for high bandwidth devices can be greatly improved if we make a device controller transfer data directly to memory without involving the processor: DMA (Direct Memory Access).

Page 35: I/O Subsystem Chapter 8

DMA I/O (cont.)• DMA is a specialized processor that transfers data between

memory and an I/O device while the CPU goes on with other tasks.

• DMA is external to the CPU ans must act as a bus master.• The CPU first sets up the “DMA registers” with a memory

address & the number of bytes to be transferred.– To the requesting program, this may be seen as setting up a “control

block” in memory.• DMA is frequently part of the controller for a device.• Interrupts still are used with DMA, but only to inform

processor that the I/O transfer is complete or an error.• DMA is a form or multi or parallel processing – not a new

idea: IBM Channels for main frames in the 60’s.– Channels are programmable (with channel control words), whereas

DMA is generally not programmable.

Page 36: I/O Subsystem Chapter 8

DMA I/O – How It Works• Three steps of DMA

– Processor sets up DMA: Device id, operation, source/destination, number of bytes to transfer

– DMA controller “arbitrates” for bus Supplies correct commands to device, source, destination, etc.Then let’s the data “rip”.Fancy buffering may be used ... ping/pong buffers.May be multi-channeled

– Interrupt processor on completion of DMA or error

• DMA can still have contention with processor in competing for memory and bus contention.– Problem: “cycle stealing” - when there is bus/memory contention

when CPU is executing a memory word during a DMA xfr, DMA wins out and CPU will pause instruction execution memory cycle (cycle was “stolen”).

Page 37: I/O Subsystem Chapter 8

Overhead Using DMA• Again use the previous disk example on page 676.

– Assume initial setup of DMA takes 1000 CPU cycles– Assume interrupt handling for DMA completion takes 500 CPU cycles– Hard drive has a transfer rate of 4MB/sec and uses DMA– The average transfer size from disk is 8KB

• What % of the 500MHz CPU is consumed if the disk is actively transferring 100% of the time? Ignore any bus contention between CPU and DMA controller.

• Answer:Each DMA transfer takes 8KB/(4MB/sec) = 0.002 sec/xfrwhen the disk is constantly transferring, it takes:[1000 + 500cyc/xfr]/0.002sec/xfr = 750,000 clock cyc/secsince the CPU runs at 500MHz, then:% of processor consumed = (750,000 cyc/sec)/500 MHz = 0.0015 0.2%

Page 38: I/O Subsystem Chapter 8

DMA: Virtual Vs. Physical Addressing (p. 683)%%% • In a VM system, should DMA use virtual addresses or physical

addresses?– this topic in the book is at best flaky-here is my take on it:• If virtual addresses used:

– Contiguous pages in VM may not be contiguous in PM.– DMA request is made by specifying virtual address for starting point of data to

be transferred and the number of bytes to be transferred.– DMA unit will have to translate VA to PA for all read/writes to/from memory –

a performance problem – actually the address translation may be done by the OS which will provide DMA with the physical addresses: a “scatter/gather” operation – fancy DMA controllers may be able to chain series of pages for a single request of more than one page – OS provides list of physical page frame addresses corresponding to the multi-page DMA block in VM. orRestrict the DMA block sizes to integral pages & translate starting address.

• If physical addresses used, they may be not contiguous in virtual memory - if page boundary crossed. Must constrain all DMA transfers to stay within a single page or requests must be for a page at a time.

• Also the OS must be savvy enough so it would not relocate pages in the target/source region during a DMA transfer.

Page 39: I/O Subsystem Chapter 8

DMA: Memory Coherency

• DMA & memory/cached systems– W/O DMA all memory access is through address translation and cache – With DMA, data is transferred to/from main memory

cache ==> coherency problem• DMA read/writes are to main memory• No cache between processor & DMA controller

– Value of a memory location seen by DMA & CPU may differ– If DMA writes into main memory at location for which there are corresponding

pages in the cache, the cache data seen by CPU will be obsolete.– If cache is write back, and the DMA reads a value directly from main memory

before cache does a write back (due to “lazy” write backs), then the value read by DMA will be obsolete.… remember there is a possibility that DMA will take priority in accessing memory over the CPU – to its disadvantage.

• Possible solutions: see next ==>

Page 40: I/O Subsystem Chapter 8

DMA: Memory Coherency (cont.)

• Some solutions: see pp. 683-684– Route all I/O activity through cache:

Performance hit and may be costlyMay flush out good data needed by processor ... I/O data may not be that critical to the processor at the time it arrivesthe working set may be messed up.

– OS selectively invalidates the cache for I/Omemory operation, orforce a write back for an I/O read from memoryI/O operation – called cache flushing.(there may be some “read/write” terminology confusion here!).Some HW support needed here.

– Hardware mechanism to selectively flush (or invalidate) cache entriesThis is a common mechanism used in multiprocessor systems where there are many caches for a common main memory (the MP cache coherency problem). The same technique works for I/O – after all DMA is a form of multiprocessing.

Page 41: I/O Subsystem Chapter 8

Designing an I/O System – The Problem• Specifications for a system

– CPU maximum instruction rate: 300 MIPSaverage number of CPU instructions per I/O in the OS: 50,000

– Bandwidth of memory backplane bus: 100 MB/sec– SCSI-2 controller with a transfer rate of 20 MB/sec

the SCSI bus on each controller can accommodate up to 7 disks– Disk drives:

read/write bandwidth of 5 MB/sec average seek + rotational latency of 10 m

• The workload this system must support:– 64 KB reads – sequential on a track– User program needs 100,000 instructions/sec per I/O operation.

This is distinct from instructions in the OS.• The problem:

Find the maximum sustainable I/O rate and the number of disks and SCSI controllers required. Assume that reads can always be done on an idle disk if one exists – ignore disk conflicts.

Page 42: I/O Subsystem Chapter 8

Designing an I/O System – The Solution• Strategy: There are two fixed components in the system: memory bus and

CPU. Find the I/O rate that each component can sustain and determine which of these is the bottleneck.

– Each I/O takes 100,000 user instructions and 50,000 OS instructionsMax I/O rate for CPU = (instruction rate)/(Instructions per I/O) = (300x106) / [(50+100)x103] = 2000 I/Os per sec

– Each I/O transfers 64KB, thus:Max I/O rate of backplane bus = (bus BW)/(bytes per I/O

= (100x106)/(64x103) = 1562 I/Os per sec– The bus is the bottleneck … design the system to support the bus performance of

1562 I/Os per sec.– Number of disks needed to accommodate 1562 I/Os per sec:

Time per I/O at the disk = seek/rotational latency + transfer time = 10ms + 64KB/(5MB/sec) = 22.8 ms

Thus each disk can complete 1/22.8ms = 43.9 I/Os per secTo saturate the bus, we need (1562 I/Os per sec) / 43.9 I/Os per sec = 36 disks.

– How many SCSI busses is this?Required transfer rate per disk = xfr size/xfr time = 64KB/22.8ms = 2.74MB/secAssume we can use all the SCSI bus BW. We can place SCSI BW/xfr rate per disk = (20MB/sec)/(2.74MB/sec) = 7.3 ==> 7 disks on each SCSI bus. Note SCSI bus can support a max of 7 disks.For 36 disks we need 36/7 = 5.14 ==> 6 buses.