computer hardware and system software concepts

Computer Hardware and System Software Concepts

● Processor Structure– Von Neumann Machines

– Pipelined

– Clocked logic systems

Von Neumann Machine● John von Neumann proposed the concept of a

stored program computer early in 1950.● In a von Neumann machine, the program and the

data occupy the same memory.● The machine has a program counter (PC) which

points to the current instruction in memory.● The PC is updated on every instruction.● When there are no branches, program instructions

are fetched from sequential memory locations.● A branch simply updates the PC to some other

location in the program memory

Synchronous Machines

● Most machines nowadays are synchronous, that is they are controlled by a clock.

● Datapaths

Synchronous Machines● Registers and combinatorial logic blocks alternate

along the data-paths through the machine.● Data advances from one register to the next on

each cycle of the global clock: as the clock edge clocks new data into a register, its current output (processed by passing through the combinatorial block) is latched into the next register in the pipeline.

● The registers are master-slave flip-flops which allow the input to be isolated from the output, ensuring a "clean" transfer of the new data into the register.

Synchronous Machines

● In a synchronous machine, the slowest possible propagation delay, t

pdmax, through any

combinatorial block must be less than the smallest clock cycle time, t

cyc - otherwise a pipeline hazard

will occur and data from a previous stage will be clocked into a register again.

● If tcyc

< tpd

for any operation in any stage of the

pipeline, the clock edge will arrive at the register before data has propagated through the combinatorial block.

Synchronous Machines...● There may also be feedback loops - in which the

output of the current stage is fed back and latched in the same register: a conventional state machine

● This sort of logic is used to determine the next operation (ie next microcode word or next address for branching purposes).

Basic Processor Structure● We will consider the basic structure of a simple

processor.

Basic Processor Structure...

● ALU– Arithmetic Logic Unit - this circuit takes two

operands on the inputs (labelled A and B) and produces a result on the output (labelled Y). The operations will usually include, as a minimum:

● add, subtract● and, or, not● shift right, shift left

– ALUs in more complex processors will execute many more instructions.


● Register File– A set of storage locations (registers) for storing

temporary results.

– Early machines had just one register - usually termed an accumulator.

– Modern RISC processors will have at least 32 registers.

● Instruction Register– The instruction currently being executed by the

processor is stored here.


● Control Unit– The control unit decodes the instruction in the

instruction register and sets signals which control the operation of most other units of the processor.

– For example, the operation code (opcode) in the instruction will be used to determine the settings of control signals for the ALU which determine which operation (+,-,^,v,~,shift,etc) it performs.


● Clock– The vast majority of processors are synchronous, that

is, they use a clock signal to determine when to capture the next data word and perform an operation on it.

– In a globally synchronous processor, a common clock needs to be routed (connected) to every unit in the processor.


● Program counter– The program counter holds the memory address of the

next instruction to be executed.

– It is updated every instruction cycle to point to the next instruction in the program.

● Memory Address Register– This register is loaded with the address of the next

data word to be fetched from or stored into main memory.


● Address Bus– This bus is used to transfer addresses to memory and

memory-mapped peripherals.

– It is driven by the processor acting as a bus master.

● Data Bus– This bus carries data to and from the processor,

memory and peripherals. It will be driven by the source of data, ie processor, memory or peripheral device.


● Multiplexed Bus– Of necessity, high performance processors provide

separate address and data buses.

– To limit device pin counts and bus complexity, some simple processors multiplex address and data onto the same bus: naturally this has an adverse affect on performance.

– When a bus is used for multiple purposes, eg address and data, it's called a multiplexed bus.

Executing Instructions● Let's examine the steps in the execution of a

simple memory fetch instruction, eg 101c16

: lw

$1,0($2)● This instruction tells the processor to take the

address stored in register 2, add 0 to it and load the word found at that address in main memory into register 1.

● As the next instruction to be executed (our lw instruction) is at memory address 101c

16, the

program counter contains 101c.

Execution Steps● The control unit sets the multiplexer to drive the

PC onto the address bus.

● The memory unit responds by placing 8c41000016

- the lw $1,0($2) instruction as encoded for a MIPS processor - on the data bus from where it is latched into the instruction register.

● The control unit decodes the instruction, recognises it as a memory load instruction and directs the register file to drive the contents of register 2 onto the A input of the ALU and the value 0 onto the B input. At the same time, it instructs the ALU to add its inputs.

Execution Steps....

● The output from the ALU is latched into the MAR. The controller ensures that this value is directed onto the address bus by setting the multiplexor.

● When the memory responds with the value sought, it is captured on the internal data bus and latched into register 1 of the register file.

● The program counter is now updated to point to the next instruction and the cycle can start again.

Another Example● Lets assume the next instruction is an add

instruction: 102016

: add $1,$3,$4

● This instruction tells the processor to add the contents of registers 3 and 4 and place the result in register 1.

● The control unit sets the multiplexer to drive the PC onto the address bus

● The memory unit responds by placing 0023202016

- the encoded add $1,$3,$4 instruction - on the data bus from where it is latched into the instruction register.

Another Example...

● The control unit decodes the instruction, recognizes it as an arithmetic instruction and directs the register file to drive the contents of register 1 onto the A input of the ALU and the contents of register 3 onto the B input. At the same time, it instructs the ALU to add its inputs.

● The output from the ALU is latched into the register file at register address 4.

● The program counter is now updated to point to the next instruction.

Key Terms● von Neumann machine

– A computer which stores its program in memory and steps through instructions in that memory.

● Pipeline– A sequence of alternating storage elements (registers or latches)

and combinatorial blocks, making up a datapath through the computer.

● program counter– A register or memory location which holds the address of the next

instruction to be executed.

● synchronous (system/machine)– A computer in which instructions and data move from one

pipeline stage to the next under control of a single (global) clock.

Performance

● Assume that the whole system is driven by a clock at f MHz. This means that each clock cycle takes t = 1/f microseconds

● Generally, a processor will execute one step every cycle, thus, for a memory load instruction, our simple processor needs:

Performance...

● PC to bus 1

● Memory response tac

● Decode and register access 1● ALU operation and latch result to MAR 1

● Memory response tac

● Increment PC - Overlap with step 3

● Total = 3 + 2*ta

Performance...

● If the memory response time is, say, 100ns, then our simple processor needs 3x10+2*100 = 230ns to execute a load instruction.

● For the add instruction, we make a similar table:– an add instruction requires 3x10+100 = 130ns to

execute.

● A store operation will also need more than 200ns, so instructions will require, on average, about 150ns.

Performance Measures

● One commonly used performance measure is MIPS or millions of instructions per second.

● Our simple processor will achieve:1/(150x10-9) = ~6.6 x 106 instructions per second= ~6.6 MIPS

● 100MHz is a very common figure for processors in 1998

● A MIPS rating of 6.6 is very ordinary.

Bottlenecks● It will be obvious that access to main memory is

a major limiting factor in the performance of a processor.

● Management of the memory hierarchy to achieve maximum performance is one of the major challenges for a computer architect.

● Unfortunately, the hardware maxim smaller is faster conflicts with programmers' and users' desires for more and more capabilities and more elaborate user interfaces in their programs - resulting in programs that require megabytes of main memory to run!

Bottlenecks...

● This has led the memory manufacturers to concentrate on density (improving the number of bits stored in a single package) rather than speed.

● They have been remarkably successful in this: the growth in capacity of the standard DRAM chips which form the bulk of any computer's semiconductor memory has matched the increase in speed of processors.

Bottlenecks...

● However the increase in DRAM access speeds has been much more modest - even if we consider recent developments in synchronous RAM and FRAM.

● Another reason for the manufacturer's concentration on density is that a small increase in DRAM access time has a negligible effect on the effective access time which needs to include overheads for bus protocols.

Bottlenecks...

● Cache memories are the most significant device used to reduce memory overheads.

● However, a host of other techniques such as pipelining, pre-fetching, branch prediction, etc are all used to alleviate the impact of memory fetch times on performance.

ALU

● The Arithmetic and Logic Unit is the 'core' of any processor: it's the unit that performs the calculations.

● A typical ALU will have two input ports (A and B) and a result port (Y).

● It will also have a control input telling it which operation (add, subtract, and, or, etc) to perform and additional outputs for condition codes (carry, overflow, negative, zero result).

ALU...● ALUs may be simple and perform only a few

operations: integer arithmetic (add, subtract), boolean logic (and, or, complement) and shifts (left, right, rotate).

● Such simple ALUs may be found in small 4- and 8-bit processors used in embedded systems.

ALU...● More complex ALUs will support a wider range of integer

operations (multiply and divide), floating point operations (add, subtract, multiply, divide) and even mathematical functions (square root, sine, cosine, log, etc).

● The largest market for general purpose programmable processors is the commercial one, where the commonest arithmetic operations are addition and subtraction. Integer multiply and all other more complex operations were performed in software - although this takes considerable time (a 32-bit integer multiply needs 32 adds and shifts), the low frequency of these operations meant that their low speed detracted very little from the machine's overall performance.

ALU...● Thus designers would allocate their valuable

silicon area to cache and other devices which had a more direct impact on processor performance in the target marketplace.

● More recently, transistor geometries have shrunk to the point where it's possible to get 107 transistors on a single die.

● Thus it becomes feasible to include floating point ALUs on every chip - probably more economic than designing separate processors without the floating point capability.

ALU...● In fact, some manufacturers will supply otherwise

identical processors with and without floating point capability.

● This can be achieved economically by marking chips which had defects only in the region of the floating point unit as "integer-only" processors and selling them at a lower price for the commercial information processing market!

● This has the desirable effect of increasing your semiconductor yield quite significantly - a floating point unit is quite complex and occupies a considerable area of silicon

ALU...

ALU...

● In simple processors, the ALU is a large block of combinatorial logic with the A and B operands and the opcode (operation code) as inputs and a result, Y, plus the condition codes as outputs.

● Operands and opcode are applied on one clock edge and the circuit is expected to produce a result before the next clock edge.

● Thus the propagation delay through the ALU determines a minimum clock period and sets an upper limit to the clock frequency.

ALU...

● In advanced processors, the ALU is heavily pipelined to extract higher instruction throughput.

● Faster clock speeds are now possible because complex operations (eg floating point operations) are done in multiple stages: each individual stage is smaller and faster.

Software or Hardware?

● The question of which instructions should be implemented in hardware and which can be left to software continues to occupy designers.

● A high performance processor with 107 transistors is very expensive to design - $108 is probably a minimum!

● Thus the trend seems to be to place everything on the die.

● However, there is an enormous market for lower capability processors - for embedded systems, primarily.

Note for hackers

● A small "industry" has grown up around the phenomenon of "clock-chipping" - the discovery that a processor will generally run at a frequency somewhat higher than its specification.

● Of necessity, manufacturers are somewhat conservative about the performance of their products and have to specify performance over a certain temperature range.

● For commercial products this is commonly 0oC - 70oC.

Note for hackers...

● A reputable computer manufacturer will also be somewhat conservative, ensuring that the temperature inside the case of his computer normally never rises above, say 45oC.

● This allows sufficient margin for error in both directions - chips sometimes degrade with age and computers may encounter unusual environmental conditions - so that systems will continue to function to their specifications.

Note for hackers...

● Clock-chippers rely on the fact that propagation delays usually increase with temperature so that a chip specified at x MHz at 70oC may well run at 1.5x at 45oC.

● Needless to say this is a somewhat reckless strategy: your processor may functional perfectly well for a few months in winter - and then start failing, initially occasionally, and then more regularly as summer approaches!

Note for hackers...

● The manufacturer may also have allowed for some degradation with age so that a chip specified for 70oC now will still function at xMHz in two years time.

● Thus a clock-chipped processor may start to fail after a few months at the higher speed - again the failures may be irregular and occasional initially, and start to occur with greater frequency as the effects of age show themselves.

● Restoring the original clock chip may be all that's needed to give you back a functional computer!

Key terms

● condition codes– a set of bits which store general information about the

result of an operation, eg result was zero, result was negative, overflow occurred, etc.

Register File

● The Register File is the highest level of the memory hierarchy.

● In a very simple processor, it consists of a single memory location - usually called an accumulator.

● The result of ALU operations was stored here and could be re-used in a subsequent operation or saved into memory.

● In a modern processor, it's considered necessary to have at least 32 registers for integer values and often 32 floating point registers as well.

Register File...

● Thus the register file is a small, addressable memory at the top of the memory hierarchy.

● It's visible to programs (which address registers directly), so that the number and type (integer or floating point) of registers is part of the instruction set architecture (ISA).

Register File...

● Registers are built from fast multi-ported memory cells.

● They must be fast: a register must be able to drive its data onto an internal bus in a single clock cycle.

● They are multi-ported because a register must be able to supply its data to either the A or the B input of the ALU and accept a value to be stored from the internal data bus.

Register File...

Register File Capacity

● A modern processor will have at least 32 integer registers each capable of storing a word of 32 (or, more recently, 64) bits.

● A processor with floating point capabilities will generally also provide 32 or more floating point registers, each capable of holding a double precision floating point word.

● These registers are used by programs as temporary storage for values which will be needed for calculations.

Register File Capacity...

● Because the registers are "closest" to the processor in terms of access time - able to supply a value within a single clock cycle - an optimising compiler for a high level language will attempt to retain as many frequently used values in the registers as possible.

● Thus the size of the register file is an important factor in the overall speed of programs.

● Earlier processors with fewer than 32 registers (eg early members of the x86 family) severely hampered the ability of the compiler to keep frequently referenced values close to the processor.


● However, it isn't possible to arbitrarily increase the size of the register file. With too many registers:– the capacitative load of too many cells on the data line

will reduce its response time,

– the resistance of long data lines needed to connect many cells will combine with the capacitative load to reduce the response time,


– the number of bits needed to address the registers will result in longer instructions. A typical RISC instruction has three operands: sub $5, $3, $6requiring 15 bits with 32 (= 25) registers,

– the complexity of the address decoder (and thus its propagation delay time) will increase as the size of the register file increases.

Ports● Register files need at least 2 read ports: the ALU has two

input ports and it may be necessary to supply both of its inputs from the same register: eg add $3, $2, $2

● The value in register 2 is added to itself and the result stored in register 3. So that both operands can be fetched in the same cycle, each register must have two read ports.

● In superscalar processors, it's necessary to have two read ports and a write port for each functional unit, because such processors can issue an instruction to every functional unit in the same clock cycle.

Key terms

● memory hierarchy– Storage in a processor may be arranged in a hierarchy,

with small, fast memories at the "top" of the hierarchy and slower, larger ones at the bottom. Managing this hierarchy effectively is one of the major challenges of computer architecture.

Cache

● Cache is a key to the performance of a modern processor.

● Typically ~25% of instructions reference memory, so that memory access time is a critical factor in performance.

● By effectively reducing the cost of a memory access, caches enable the greater than one instruction/cycle goal for instruction throughput for modern processors.

Cache...

● A further indication of the importance of cache may be gained from noting that out of 6.6x106 transistors in the MIPS R10000, 4.4x106 were used for primary caches.

Locality of Reference

● All programs show some locality of reference. This appears to be a universal property of programs - whether commercial, scientific, games, etc.

● Cache exploits this property to improve the access time to data and reducing the cost of accessing main memory. There are two types of locality:– Temporal Locality

– Spatial Locality

Temporal Locality

● Once a location is referenced, there is a high probability that it will be referenced again in the near future.

● Instructions– The simplest example of temporal locality is instructions in

loops: once the loop is entered, all the instructions in the loop will be referenced again - perhaps many times - before the loop exits.

– However, commonly called subroutines and functions and interrupt handlers (eg the timer interrupt handler) also have the same property: if they are accessed once, then it's very likely that they will be accessed again soon.

Temporal Locality...

● Many types of data exhibit temporal locality: at any point in a program there will tend to be some "hot" data that the program uses or updates many times before going on to another block of data.

● Some examples are:– Counters

– Look-up Tables

– Accumulation variables

– Stack variables

Spatial Locality● When an instruction or datum is accessed it is very likely

that nearby instructions or data will be accessed soon.● Instructions

– It's obvious that an instruction stream will exhibit considerable spatial locality. In the absence of jumps, the next instruction to be executed is the one immediately following the current one.

● Data– Data also shows considerable spatial locality - particularly when

arrays or strings are accessed. Programs commonly step through an array from beginning to end, accessing each element of the array sequentially.

Cache operation

● The most basic cache is a direct-mapped cache. It is a small table of fast memory (modern processors will store 16-256kbytes of data in a first-level cache and access a cache word in 2 cycles).

● There are two parts to each entry in the cache, the data and a tag.

Cache operation...

● If memory addresses have p bits (allowing 2p bytes of memory to be addressed) and the cache can store 2k words of memory.

● Then the least significant m bits of the address address a byte within a word. (Each word contains 2m bytes.)

● The next k bits of the address select one of the 2k entries in the cache.

Cache operation...

● The p-k-m bits of the tag in this entry are compared with the most significant p-k-m bits of the memory address: if they match, then the data "belongs" to the required memory address and is used instead of data from the main memory.

● When the cache tag matches the high bits of the address, we say that we've got a cache hit.

● Thus a request for data from the CPU may be supplied in 2 cycles rather than the 20-100 cycles that is necessary to fetch the same data from the main memory.

Cache operation...

Basic operations

● Write-Through● Write-Back

Cache organizations● Direct Mapped● Fully Associative

Cache organizations...

Memory

● Memory Technologies– Two different technologies can be used to store bits in

semiconductor random access memory (RAM): static static RAM and dynamic RAM.

Static RAM

● Static RAM cells use 4-6 transistors to store a single bit of data.

● This provides faster access times at the expense of lower bit densities.

● A processor's internal memory (registers and cache) will be fabricated in static RAM.

● Because the industry has focussed on mass-producing dynamic RAM in ever-increasing densities, static RAM is usually considerably more expensive than dynamic RAM: due both to its lower density and the smaller demand (lower production volumes lead to higher costs!).

Static RAM...

● Static RAM is used extensively for second level cache memory, where its speed is needed and a relatively small memory will lead to a significant increase in performance.

● A high-performance 1998 processor will generally have 512kB to 4Mbyte of L2 cache.

Static RAM...

● Since it doesn't need refresh, static RAM's power consumption is much less than dynamic RAM, SRAM's will be found in battery-powered systems.

● The absence of refresh circuitry leads to slightly simpler systems, so SRAM will also be found in very small systems, where the simplicity of the circuitry compensates for the cost of the memory devices themselves.

Dynamic RAM● The bulk of a modern processor's memory is composed

of dynamic RAM (DRAM) chips. ● One of the reasons that memory access times have not

reduced as dramatically as processor speeds have increased is probably that the memory manufacturers appear to be involved in a race to produce higher and higher capacity chips.

● It seems there is considerable kudos in being first to market with the next generation of chips. Thus density increases have been similar to processor speed increases.

Dynamic RAM...● A DRAM memory cell uses a single transistor and a

capacitor to store a bit of data. ● Devices are reported to be in limited production which

provide 256 Mbits of storage in a single device. ● At the same period, CPUs with 10 million transistors in

them are considered state-of-the-art. ● Regularity is certainly a major contributor to this

apparent discrepancy .. a DRAM is about as regular as it is possible to imagine any device could be: a massive 2-D array of bit storage cells.

● In contrast, a CPU has a large amount of irregular control logic.

Dynamic RAM...

● A typical DRAM cell with a single MOSFET and a storage capacitor

The Memory Hierarchy● Processors use memory of various types to store the

data on which a program operates. Data may start in a file on a magnetic disc ("hard disc"), be read into semiconductor memory ("D-RAM") for processing, transformed and written back to disc.

● As part of the transformation process, individual words of data will be transferred to the processor's registers and thence to the ALU.

● Transfers from semiconductor memory to registers will usually pass (transparently to a programmer) through one or more levels of cache.

The Memory Hierarchy...

● Memory designers are able to trade speed for capacity - the fastest memory (registers) having access times below 10ns but the lowest capacity (10s of words) and the slowest (magnetic tape) having access times of several seconds but the highest capacity (10s of Gbytes).

● Thus the memory in a system can usually be arranged in a hierarchy from the slowest (and highest capacity) to the fastest (and lowest capacity).


● Discrepancy between Processor and Bus Frequencies


● This figure shows the ratio between processor clock frequence ("Core Frequency") and bus frequency for the last dozen years.

● It can be seen that the ratio has started to increase dramatically recently - as processor frequencies have continued to increase and bus frequencies have remained fixed or increased at a slower rate.

● The bus frequency determines the rate at which information can be transferred between a processor and its bulk memory.


● The widening gap between processor and bus frequencies shows that effective management of this memory hierarchy is now critical to performing, posing a challenge for architects and programmers alike.

Further Speed!

● Instruction Issue Unit– With more than one word in a cache line, every request from the

instruction fetch unit to the cache is able to fetch several instructions.

– The instruction issue unit will normally try to take advantage of this by considering all of these instructions as possible candidates for the next instruction to be issued to the ALU.

– In the absence of branches, if the next instruction is unable to be issued because its operands are not available, there may be a subsequent instruction which has no dependencies on currently executing instructions which can be safely issued.

Further Speed!..

● This is the simplest form of pre-fetching: the instruction issue unit fetches and considers all the instructions in the cache line as potential candidates for issue to the ALU(s) next.

● In a super-scalar, the instruction issue unit will try to issue an instruction to each functional unit in every cycle, so it wants as many "candidates" as possible to check for dependencies!

Further Speed!..

● Branching Costs– Branching is always expensive in a heavily pipelined

processor.

– Even unconditional branches require interruption of current instruction fetching operations and restarting the instruction stream from a new memory location - any instructions which may have been fetched from memory and are waiting in instruction buffers or cache lines must be discarded.

Further Speed!..

● Conditional branches can be much more disruptive as they must often wait for operands to be generated or status bits to be set before the direction of the branch (taken or not taken) can be determined: an efficient processor may have fetched and partially executed many instructions beyond the branch before it is known whether to take it or not.

● Thus considerable effort has been devoted to reducing the cost of branches.

Further Speed!..

● Branch Prediction– Predicting that a branch will be taken turns out to be a

good choice most of the time.

– The majority of branches are those at the end of loops which branch back to the start of the loop. If the loop iterates n times, then the branch will be taken (n-1)/n of the time.

Further Speed!..

● Branch Target Buffers

computer hardware and system software concepts

Documents

basic structure

basic processor structurewe

current instruction

simple processor

data advances

clock edge clocks new

program instructions

global clock