unit-5: cocurrent processors

Unit-5: Cocurrent Processors

Vector & Multiple Instruction Issue Processors

Concurrent Processors

• Processors that can execute multiple instructions at the same time ( Concurrently)

• Concurrent processors can make simultaneous access to memory and can execute multiple operations simultaneously.

• These processors execute from one program stream and have single instruction counter but instructions are so rearranged that concurrent instruction execution is achieved.


• Processor performance depends on compiler ability, execution resources and memory system design.

• Sophisticated compilers can detect various types of instruction level parallelism that exists with in a program and then depending upon type of concurrent processor, compilers can restructure the code that allows the use of available concurrency


• There are two main types of concurrent processors.– Vector Processors: single vector instruction

replaces multiple scalar instructions. It depends on compilers ability to vectorize the code to transform loops into sequence of vector operations.

– Multiple Issue Processors: Instructions whose effects are independent of each other are executed concurrently.

Vector Processors

• A vector computer or vector processor is a machine designed to efficiently handle arithmetic operations on elements of arrays, called vectors. Such machines are especially useful in high-performance scientific computing, where matrix and vector arithmetic are quite common. Supercomputers like Cray Y-MP is an example of vector Processor.

Vector Processors

• Vector processors are based on the premise that the original program has either explicitly declared many of the data operands to be vectors or arrays or it implicitly uses loops whose data references can be expressed as references to a vector of operands ( Achieved by Compilers) .

• Vector processors achieve considerable speed up in processor performance over that of simple pipelined processors.

Vector Processors

• To achieve the concurrency in operations and resulting speed up in performance, vector processors have extended instruction set and architecture to support the concurrent execution of commonly used vector operations in hardware.

• Directly supporting vector operations in hardware reduces or eliminates the overhead of loop control which would otherwise be necessary

Vectors and vector arithmetic

• A vector, v, is a list of elements

v = ( v1, v2, v3, ..., vn )

• The length of a vector is defined as the number of elements in that vector; so the length of v is n.

• When mapping a vector to a computer program, we declare the vector as an array of one dimension.

Vectors and vector arithmetic

• Arithmetic operations may be performed on vectors. Two vectors are added by adding corresponding elements:

s = x + y = ( x1+y1, x2+y2, ..., xn+yn ). where s is the vector representing the final sum

and S, X, and Y have been declared as arrays of dimension N. This operation is sometimes called element wise addition. Similarly, the subtraction of two vectors, x - y, is an element wise operation.

Vector Computing Architectural Concepts

• A vector computer contains a set of special arithmetic units called pipelines.

• These pipelines overlap the execution of the different parts of an arithmetic operation on the elements of the vector.

• There can be different set of arithmetic pipelines to perform vector additions and vector multiplications.

The stages of a floating-point operation

• steps or stages involved in a floating-point addition on a sequential machine with normal floating point arithmetic hardware: s = x + y.

• [A:] The exponents of the two floating-point numbers to be added are compared to find the number with the smallest magnitude.

• [B:] The significand of the number with the smaller magnitude is shifted so that the exponents of the two numbers agree.

• [C:] The significands are added.

The stages of a floating-point operation

• [D:] The result of the addition is normalized.

• [E:] Checks are made to see if any floating-point exceptions occurred during the addition, such as overflow.

• [F:] Rounding occurs. Consider an example of such an addition.

The numbers to be added are x = 1234.00 and y = -567.8.

Stages of a Floating-point Addition

F

y

x

s

0.1234E4 0.12340E4

- 0.5678E3 - 0.05678E4

0.066620E4 0.66620E3 0.66620E3 0.6662E3

Step A B C D E


• consider this scalar addition performed on all the elements of a pair of vectors (arrays) of length n.

• Each of the six stages needs to be executed for every pair of elements.

• If each stage of the execution takes tau units of time, then each addition takes 6*tau units of time (not counting the time required to fetch and decode the instruction itself or to fetch the two operands).


• So number of time units required to add all the elements of the two vectors in a serial fashion would be Ts = 6*n*tau.

An Arithmetic Pipeline• Suppose addition operation described

previously is pipelined; that is, one of the six stages of the addition for a pair of elements is performed at each stage in the pipeline.

• Each stage of the pipeline has a separate arithmetic unit designed for the operation to be performed at that stage.

• it still takes 6*tau units of time to complete the sum of the first pair of elements, but that the sum of the next pair is ready in only tau more units of time.

An Arithmetic Pipeline• So the time, Tp, to do the pipelined

addition of two vectors of length n is

Tp = 6*tau + (n-1)*tau = (n + 5)*tau.

• Thus, this pipelined version of addition is faster than the serial version by almost a factor of the number of stages in the pipeline.

• This is an example of what makes vector processing more efficient than scalar processing.

An Arithmetic Pipeline• The operations at each stage of a pipeline

for floating-point multiplication are slightly different than those for addition.

• A multiplication pipeline may even have a different number of stages than an addition pipeline.

• There may also be pipelines for integer operations.

• Some vector architectures provide greater efficiency by allowing the output of one pipeline to be chained directly into another pipeline .

Vector Functional Units

• All modern vector processors use vector-register- vector instruction format.

• So all vector processors consists of vector register sets.

• Vector registers sets consists of eight or more registers with each register containing from 16 to 64 vector elements

• Each vector element is a floating point word.(64 bits)

Vector Functional Units• Vector registers access memory with

special Load and Store instructions.

• There are separate and independent functional units to manage the load / store function.

• There are separate vector execution units for each instruction class.

• These execution units are segmented ( pipelined ) to support highest possible execution rate.

Primary Storage Facilities

Memory

Data Cache

VLD

VST

VECTOR REGISTERS

Scalar

(Floating Point registers)

Integer General Purpose Registers

Vector Functional Units• Vector operations involve operations on a

large number of operands (Vector of operands), thus pipelining helps in achieving execution at the cycle rate of the system.

• Vector processors also contain scalar floating point registers, integer (General Purpose) registers and scalar functional units.

• Scalar registers and their contents can interface with vector execution units.

Vector Functional Units• Vectors as a data structure are not well

managed by a data cache, so vector load / store operations avoid data cache and are implemented directly between memory and vector registers.

• Vector load / store operations can be overlapped with other vector instruction executions. But vector loads must complete before they can be used.

Vector Functional Units• The ability of processor to concurrently

execute multiple ( independent) vector instructions is limited by the number of vector register ports and vector execution units.

• Each concurrent load or store requires a vector register port; vector ALU operations require multiple ports.

Vector Instructions / Operations

• Vector Instructions are effective in several ways.– They significantly improve code density.– They reduce the number of instructions required

to execute a program ( Reducing I-Bandwidth)– They organize data argument into regular

sequences that can be efficiently handled by the hardware

– They can represent a simple loop construct, thus removing the control overhead for loop execution.

Types of Vector Operations

• (a): Vector Arithmetic and Logical operations.– VADD, VSUB, VMPY, VDIV, VAND, VOR, VEOR

VOP VR1 VR2 VR3

VR2 and VR3 are the vector register which contain the source operands on which vector operation is performed and result is stored in Vector register VR1.

VR2 VOP VR3 VR1


• (b)Compare ( VCOMP):

VR VCOMP VR S

The result of vector compare is stored in a scalar register S. The Si bit of scalar register is set to 1 if v1.i > v2.i ( comparison of ith element)

Test (VTEST): V1 VTEST CC - Scalar

The Si bit is 1 if v1.i satisfies CC (condition code specified in the instruction).

Types of Vector Operations• (c): Accumulate (VACC): ∑ ( V1 * V2) -> SAccumulate the sum of product of each

element of two vectors into a scalar register.

• (d): Expand / Compress ( VEXP / VCPRS) VR OP S VRTake logical vectors and apply them to

elements in vector registers to create a new vector value.

• Vector Load (Scatter) and Vector Store (Gather) Instructions asynchronously access memory.


• When a vector operation is done on two vector registers of unequal length, we need some convention for producing result.

• All entries in a vector register which are not explicitly stored are given a invalid content symbol NAN and any operation using NAN will also produce NAN regardless of contents of other register.

Vector Processor Implementation• Vector processor implementation requires

considerable amount of additional control and hardware.

• Vector registers used to store vector operands generally bypass the data cache.

• Data cache is used solely to store scalar values.

• Since particular values stored in a vector may be aliased as a scalar and stored in a data cache all vector references to memory must be checked against contents of data cacahe.

Vector Processor Implementation• If there is a hit, invalidate the current value

contained in data cache and force memory update.

• Additional H/W or S/W control is required to ensure that scalar references from data cache to memory do not inadvertently reference a value contained in vector register.

• Earlier Vector processors used Memory to Memory instruction format, but due to severe memory congestion and contention problems most recent vector processor use vector registers to load / store vector operands.

Vector Processor Implementation• Vector registers generally consists of a set of

eight registers each containing from 16 to 64 entries of the size of a floating point word.

• The arithmetic pipeline may be shared with the scalar part of the processor.

• Under some conditions, it is possible to execute more than one arithmetic operation per cycle.

• Te result of one arithmetic operation can be directly used as an operand in subsequent vector instruction.

Vector Processor Implementation• This is called Chaining.

• For the two instructions

VADD VR3, VR1, VR2

VMPY VR5, VR3, VR4.

VADD

VMPY

VR 1.2

+

VR 2.2

To VR 3.1

VR 3.1 * VR 4.1

VR 1.3

+

VR 2.3

Vector Processor Implementation

• The illustrated chained ADD- MPY with each functional unit having 4 stages, saves 64 cycles.

• If unchained it would have taken 4 (startup) + 64 (elements/VR) = 68 cycles for each function – a total of 136 cycles.

• With chaining this is reduced to 4 (add start up) + 4 (multiply startup) + 64 (elements/VR) = 72 Cycles – A saving of 64 cycles.


• Another important aspect of vector implementation is management of references to memory.

• Memory must have sufficient bandwidth to support minimum two and preferably three references per cycle (two reads and one write)

• This bandwidth allows for two vector reads and one vector write to be initiated and executed concurrently with execution of a vector arithmetic operation.


• Major data paths in a generic vector processor are shown in Figure 7.14 on page no 438 of computer architecture by Michael Flynn.

Vector Memory• Simple low order interleaving used in normal

pipelined processors is not suitable for vector processors.

• Since access in case of vectors is non sequential but systematic, thus if array dimension or stride( address distance between adjacent elements) is same as interleaving factor then all references will concentrate on same module.

Vector Memory• It is quite common for these strides to be of

the form 2k or other even dimensions.

• So vector memory designs use address remapping and use a prime number of memory modules.

• Hashed addresses is a technique for dispersing addresses.

• Hashing is a strict 1:1 mapping of the bits in X to form a new address X’ based on simple manipulations of the bits in X.

Vector Memory• A memory system used in vector / matrix

accessing consists of following units.– Address Hasher– 2k + 1 memory modules.– Module mapper.

This may add certain overhead and add extra cycles to memory access but since the purpose of the memory is to access vectors, this can be overlapped in most cases.

Vector Memory

. . . . . .

2k + 1 Modules

Address Hasher

X address

X’

Module address Computation

X’ mod (2k + 1)

Module Index (X’ / 2k)

Address in a module

Data

Modeling Vector Memory Performance

• Vector memory is designed for multiple simultaneous requests to memory.

• Operand fetching and storing is overlapped with vector execution.

• Three concurrent operand access to memory are a common target but increased cost of memory system may limit this to two.

• Chaining may require even more accesses.• Another issue is the degree of bypassing or

out of order requests that a source can make to memory system.


• In case of conflict ie a request being directed to a busy module, the source can continue to make subsequent requests only if not serviced requests are held in a buffer.

• Assume each of ‘s’ access ports to memory has a buffer of size TBF/s which holds requests that are being held due to a conflict.

• For each source, degree of bypassing is defined as the allowable number of requests waiting before stalling of subsequent requests occurs.


• If Qc is the expected number of denied requests per module and m is the number of modules, Then buffer size must be large enough to hold denied requests

Buffer = TBF > m. Qc

• If n is the total number of requests made and B is the bandwidth achieved then

m . Qc = n-B ( denied requests)

Gamma (γ) – Binomial Model

• Assume that each vector source issues a request each cycle(δ =1) and each physical requestor has the same buffer capacity and characteristics.

• If the vector processor can make s requests per cycle and there are t cycles per Tc, then

Total requests per Tc = t . s = nThis is same as n requests per Tc in simple

binomial model .


• If γ is the mean queue size of bypassed requests awaiting service then each of γ – buffered requests also make a request.

• From memory modeling point of view this is equivalent to buffer requesting service each cycle until module is free.

Total request per Tc = t . s + t . s . γ

= t. s(1 + γ)

= n (1 + γ).


• Substituting this value in simple binomial equation

B (m, n,γ )

= m+n(1+γ)-1/2 - (m+n(1+γ)-1/2)2 – 2nm(1+γ)

Calculating γ opt

• The γ is the mean expected bypassed request queue per source.

• If we continue to increase number of bypass buffer registers we can achieve a γ opt which totally eliminates contention.

• No contention occurs when B = n or B(m,n,γ) = n• This occurs when ρa = ρ = n/m• Since MB/ D/1 queue size is given by

Q = ρa2 – p ρa / 2(1- ρa) = n(1+γ) –B /m

Calculating γ opt• Substituting ρa = ρ= n/m and p=1/m we get:Q=( n2-n)/2((m2-nm) = (n/m )(n-1) /(2m-2n)• Since Q = (n(1+γ) –B) /mSo mQ = n(1+γ) –B

Now for γopt (n-B) =0

So γopt = m/n Q

So γopt = n-1/ 2m-2n

And mean total buffer size (TBF) = n γopt

To avoid overflow buffer may be considerably larger may be 2 x TBF

Vector Processor Speed up: Performance Relative to

Pipelined ProcessorVector processor performance depends on1. The amount of program that can be

expressed in a vectorizable form.2. Vector startup costs ( length of pipeline)3. Number of execution units and support

for chaining.4. Number of operands that can be

simultaneously accessed /stored5. Number of vector registers.

Vector Processor Speedup

• The overall effect of speedup possible by the vector processor over the pipelined processor is generally limited to 4.

• This assumes concurrent execution of 2 load, 1 store and 1 arithmetic operation.

• If chaining is allowed and memory system can accommodate an additional concurrent load instruction, the speedup can extend to <6 ( 3LDs, 2 Arith, and 1 ST)

Vector Processor Speedup

• In practice such speedups are not achievable for following reasons.– To sustain such high speedup the program must

consist of purely vector code.– Due to some address arithmetic and control

operations, some part of program is not vectorizable.

– Suppose maximum speedup of 4 is achievable for 75% of operations

Overall speedup Sp = T1/ (%vector . (T1/4) + % nonvector . T1) = 2.3

Vector Processor Speedup– The depth of pipeline also limits the effective

speedup

Two parameters R∞ and n1/2 are used to characterize vector processors performance.

R∞ = 1/Δt = maximum vector arithmetic execution rate that the processor can sustain in the absence of chaining. (Δt cycle time of vector pipeline)

If c functional units can be chained the max performance = c * R∞ ( usually c =2)

n1/2 is a parameter that measures the depth of pipeline.

Vector Processor Speedup– Assume n is the vector size ( no of elements) then

n1/2 is the length of vector operand that achieves exactly one half of maximum performance.

– Usually vector processor can not start a new instruction until a previous instruction is finished using the pipeline ( Pipeline flushing)

– This delay and startup time correspond to depth of pipeline and are represented by n1/2.

Vector efficiency R/R∞ = 1 / 1+ 1/(n/ n1/2)Vector efficiency is very low if vector length n is

significantly less than n1/2..

Vector Processor Speedup– Memory contention under heavy referencing can

add to delay– This contention causes vector registers to be

loaded and stored more slowly than vector instructions are executed. The waiting time between two vector operations adds to delay.

– So generally for memory limited processors

n1/2 = max [ vector startup cycles, vector memory overhead cycles]

Vector processors speedup is thus limited below maximum value of 4 or 5, but for code having significant vector content speedup of 2-3 is generally achievable.

Multiple – Issue Machines

• These machines evaluate dependencies among group of instructions, and groups found to be independent are simultaneously dispatched to multiple execution units.

• There are two broad classes of multiple issue machines– Statically Scheduled: detection process is done

by the compiler.– Dynamically Scheduled: Detection of

independent instructions is done by hardware in the decoder at run time.

Statically Scheduled Machines

• Sophisticated compilers search the code at compile time and instructions found independent in their effect are assembled into instruction packets, which are decoded and executed at run time.

• Statically scheduled processors must have some additional information either implicitly or explicitly indicating instruction packet boundaries.


• Early statically scheduled machines include the so called VLIW ( Very long instruction word) machines.

• These machines use an instruction word that consists of 8 to 10 instruction fragments.

• Each fragment controls a designated execution unit.

• To accommodate multiple instruction fragments the instruction word is typically over 200 bits long.


• The register set is extensively multi ported to support simultaneous access to multiple execution units.

• To avoid performance limitation by occurrence of branch instructions a novel compiler technology called trace scheduling is used.

• In trace scheduling branches are predicted where possible and predicted path is incorporated into a large basic block.


• If an unanticipated ( or unpredicted) branch occurs during the execution of the code, at the end of the basic block the proper result is fixed up for use by a target basic block.

Dynamically Scheduled Machines.

• In dynamically scheduled machines detection of independent instruction is done by hardware at run time.

• The detection may also be done at compile time and code suitably arranged to optimize execution patterns.

• At run time the search for concurrent instructions is restricted to the localities of the last executing instruction.

Superscalar Machines• The maximum program speed up available in

multiple issue machines, largely depends on sophisticated compiler technology.

• The potential speedup available from multiflow compiler using trace scheduling is generally les than 3.

• Recent multiple issue machines having more modest objectives, are called Superscalar Machines.

• The ability to issue multiple instructions in a single cycle is referred to as Superscalar implementation

Major Data Paths in a Generic M.I.Machine

(Refer to fig 7.28 on page no 458.)• Simultaneous access to data as required

by VLIW processor mandates extensive use of register ports which can become a bottleneck.

• Dynamic multiple issue processor use multiple buses connecting the register sets and functional units and each bus services multiple functional units.

• This may limit the maximum degree of concurrency.

Comparing Vector and Multiple Issue Processors

• The goal of any processor design is to provide cost effective computation across a range of applications.

• So we should compare the two technologies based on following two factors.– Cost– Performance

Cost Comparison• While comparing the cost we must

approximate the area used by both the technologies in the form of additional / required units.

• The cost of execution units is about the same for both.( for same maximum performance).

• A major difference lies in the storage hierarchy.

• Both rely heavily on multi ported registers.

Cost Comparison• These registers occupy significant amount

of area. If p is the no of ports, the area required is

Area = (No of reg +3p)(bits per reg +3p) rbe.• Most vector processors have 8 sets of 64

element registers with each element being 64 bit in size.

• Each vector register is dual ported ( a read port and a write port). Since registers are sequentially accessed each port can be shared by all elements in the register set.

Cost Comparison• There is an additional switching overhead

to switch each of n vector registers to each of p external ports.

Switch area = 2 (bits per reg).p. (no of reg)

• So area used by registers set in vector processors (supporting 8ports) is

Area = 8x[(64+6) (64+6)] =39,200 rbe.

Switch area = 2 (64).8.(64) = 8192 rbe.

Cost Comparison• A multiple issue processor with 32 registers

each having 64 bits and supporting 8 ports will require

Area = (32+3(8))(64+3(8)) =4928 rbe

• So vector processors use almost 42,464 rbe of extra area compared to MI processors.

• This extra area corresponds to about 70,800 cache bits ( .6 rbe/bit) ie approximately 8 KB of data cache.

• Vector processors use small data cache.

Cost Comparison• Multiple issue machines require larger data

cache to ensure high performance.• Vector processors require support hardware

for managing access to memory system.• Also high degree of interleaving is required in

memory system to support processor bandwidth

• M.I machines must support 4-6 reads and 2-3 writes per cycle. This increases the area required by buses between arithmetic units and registers.

Cost Comparison• M.I machines should access and hold multiple

instructions each cycle from I- cache• This increases the size of I-fetch path between

I-cache and instruction decoder/ instruction register.

• At instruction decoder multiple instructions must be decoded simultaneously and detection for instruction independence must be performed.

• The main difference depends heavily on the size of data cache required by MI machines and cost of memory interleaving required by vector processor.

Performance Comparison• The performance of vector processors

depends primarily on two factors – Percentage of code that is vectorizable.– Average length of vectors.

We know that n1/2 or the vector size at which the vector processor achieves approx half its asymptotic performance is roughly the same as arithmetic plus memory access pipeline.

For short vectors data cache is sufficient in MI machines so for

Short Vectors M.I .processors would perform better than equivalent vector processor.

Performance Comparison• As vectors get longer the performance of

M.I. machine becomes much more dependent on size of data cache and n1/2

of vector processors improve.• So for long vectors performance would be

better in case of vector processor.• The actual difference depends largely on

sophistication in compiler technology.• Compiler can recognize occurrence of

short vector and treat that portion of code as if it were a scalar code

Shared Memory Multiprocessors

• Beyond the instruction level concurrency used in vector and multiple issue processors, there are processing ensembles that consist of ‘n’ identical processors that share a common memory.

• Multiprocessors are usually designed for at least one of two reasons– Fault Tolerance: – Program Speed up.


• Fault Tolerant Systems: n identical processors ensure that failure of one processor does not affect the ability of the multiprocessor to continue with program execution.

• These multiprocessors are called high availability or high integrity systems

• These systems may not provide any speed up over a single processor system


• Program Speed up: Most multiprocessors are designed with main objective of improving program speed up over that of single processor.

• Yet fault tolerance is still an issue as no design for speedup ought to come at the expense of fault tolerance.

• It is generally not acceptable for whole multi processor system to fail if any one of its processors fail.


• Basic Issues: Three basic issues are associated with designs of multiprocessor systems.– Partitioning– Scheduling of tasks– Communication and synchronization.

Partitioning

• This is the process of dividing a program into tasks, each of which can be assigned to an individual processor for execution.

• The partitioning process occurs at compile time.• The goal of portioning process is to uncover the

maximum amount of parallelism possible within certain obvious machine limitations.

• Program overhead, the added time a task takes to be loaded into a processor, define the size of minimum task produced by portioning program.

Partitioning• The program overhead time which is

configuration and scheduling dependent, limits the maximum degree of parallelism among executing sub tasks.

• If amount of parallelism is increased by using finer and finer grain task sizes the amount of overhead time is accordingly increased.

• If available parallelism exceeds the known number of processors, or several shorter tasks share the same instruction / data working set, Clustering is used to group subtasks into a single assignable task

Partitioning• The detection of parallelism is done by one of

three methods.– Explicit statement of concurrency in high level

language. Programmers delineate boundaries among tasks that can be executed in parallel.

– Programmers hint in source statement which compilers can use or ignore.

– Implicit parallelism: sophisticated compilers can detect parallelism in normal serial code and transform program code for execution on multiprocessors.

Scheduling• Scheduling is done both statically ( at

compile time ) and dynamically ( at run time).

• Statically scheduling is not sufficient to ensure optimum speedup or even fault tolerance.

• The processor availability is difficult to predict and may vary from run to run.

• Runtime scheduling has advantage of handling changing system environments and program structures.

Scheduling• Run time overhead is prime disadvantage of

run time scheduling.

• It is desirable from fault tolerance point of view that run time scheduling is initiated by any processor and then the process itself is distributed across all available processors.

• The major run time overheads in run time scheduling include.– Information gathering: (about dynamic program

state and state of the system)– Scheduling

Scheduling– Dynamic execution control: dynamic clustering or

process creation at run time.

– Dynamic data management.: Assignment of tasks and processors in such a way as to minimize the required amount of memory overhead delay in accessing the data.

Synchronization and Coherency

• In a multiprocessor configuration having high degree of task concurrency, the tasks must follow an explicit order and communication between active tasks must be performed in an orderly way.

• The value passing between different tasks executing on different processors is performed by synchronization primitives or semaphores.


• Synchronization is the means to ensure that multiple processors have a coherent or similar view of critical values in memory.

• Memory coherence is the property of memory that ensures that a read operation returns the same value which was stored by the latest write to same address.

• In complex systems of multiple processors the program order of memory related operations may be different from order in which operations are actually executed.


• In multiprocessors there are different degrees to which ordering may be enforced.– Sequential Consistency: all memory accesses

execute atomically in some total order and all memory accesses of a single process appear to execute in (its) program order.

– Processor consistency: the issuing processor sees loads followed by stores in program order but store followed by load is not necessarily performed in program order.


– Weak Consistency: Synch operations are sequentially consistent, other memory operations can occur in any order. All synchronization operations are performed before any subsequent memory operation is performed and all pending memory operations are performed before any synchronization operation is performed.

– Release Consistency: Synch operations are split into acquire (lock) and release (unlock) and these operations are processor consistent.

Means of Synchronization

• a single synchronization variable which is accessed by a read-modify –write function.

• Before access can be made to a shared variable, this location is referenced.

• If this location is ‘0’ then a write can be made to shared variable and contents of this synchronization location are changed to ‘1’.(This indicates that producer process has produced data for consumer process to Read.)


• Similarly Read access to this shared variable is allowed only when synchronization variable is ‘1’.(that is only after data has been written by producer process). After Read is completed the synchronization location is changed to ‘0’ allowing producer process to write to shared variable.

• If there were multiple consumer processes then we need additional bits in synchronization location to coordinate


• We can designate certain bit combinations giving exclusive rights to producer or either of consumers to change (write) or use (Read) the shared variable.

0 0 Producer changing data, Consumers locked out0 1 Data available to any consumer, producer locked out1 1 A consumer is acquiring data, producer and other

consumers are locked out.1 0 Buffer empty, consumers locked out, Producer

allowed to change dataThe access to this region in memory is locked and

unlocked by certain bit patterns called Semaphores.


• Another synchronization mechanism is test and set instruction.

• The test and set instruction tests a value in memory and then sets that value to ‘1’.

• If shared variable is not yet available the accessing process enters a loop which continues to access the synchronization location till it indicates that shared variable is set.

• This process of looping on a shared variable with a test and set type of instruction is called a spin lock.


• Spin locks create significant additional memory traffic.

• To reduce this traffic a suspend lock can be implemented.

• In suspend lock the status can be recorded by the equivalent of test and set instruction and then the requesting process can go idle , awaiting interrupt from producing process.

• The fetch and add primitive combines multiple test and set instructions into a single instruction that will access a synchronization variable and return a value.

Types of Shared Memory Multiprocessors

• The variety in multiprocessors results from the way memory is shared between processors. This can be as follows– Shared data cache, shared memory– Separate data cache, but shared bus –

shared memory– Separate data cache with separate buses to

shared memory– Separate processor and separate memory

modules interconnected with a multi stage interconnection network.

Types of Shared Memory Multiprocessors

• The simpler and more limited the multiprocessor configuration, the easier it is to provide synchronization communication and memory coherency.

• As processor speed and number of processors increase, shared data caches and buses run out of bandwidth and become a bottleneck.

• Replicating caches and /or buses to provide additional bandwidth increases coherency traffic.

Memory Coherence in Shared Memory Multiprocessor

• Each node in a multiprocessor system possesses a local cache.

• Since the address space of the processors overlaps, different processors can be holding (caching) the same memory segment at the same time.

• Further each processor may be modifying these cached location simultaneously.

• Cache coherency problem is to ensure that all caches contain same most updated copy of data.

Memory Coherence in Shared Memory Multiprocessor

• The protocol that maintains the consistency of data in all the local caches is called the cache coherency protocol.

• The most important distinction among various protocols is the way they distribute information about writes to memory to the other processors in the system.– Snoopy Protocol: a write is broad cast to all

processors in the system– Directory Based: Write is only sent to those

processors that are known to possess a copy.

Snoopy Protocol• Snoopy protocols are simpler, as

ownership information need not be stored in system.

• Theses protocols create great deal of additional bus or network traffic while doing the broadcast to all the processors.

• Snoopy protocols are bus based and used in shared bus Multiprocessors.

• Snoopy protocols assume that all of the processors are aware of , and receive, all bus transactions ( snoop on the bus)

Snoopy Protocol• Snoopy protocols are further classified

based on the type of action local processor must take when an altered line is recognized.

• There are two types of actions– Invalidate: all copies in other caches are

invalidated before changes are made to data in a particular line. The invalidate signal is received from the bus and all caches which posses the same cache line invalidate their copies.

Snoopy Protocol– Update: Writes are broadcast on the bus and

caches sharing the same line snoop for data on the bus and update the contents and state of their cache lines.

Another set of distinction among protocols is based on if main memory is updated when a write occurs or only cache contents are updated.

Snoopy Protocol

• There are four variations based upon invalidate strategy.– Write – invalidate– Synapse– Illinois– Berkeley

Snoopy Protocol

• There are two variations based on update strategy– Firefly– Dragon

Directory Based Protocols

• These protocols maintain the state information of the cache lines at a single location called directory.

• Only the caches that are stated in the directory and are thus known to posses a copy of the newly altered line, are sent invalidate or write update information.

• Since there is no need to connect to all caches, in contrast to snoopy protocols, directory based protocols can scale better.

Directory Based Protocols• As the number of processor nodes and

number of cache lines increase, the size of directory can become very large.

• In addition to distinction based on action taken by local processor in terms of invalidate or update cache lines, there is important distinction among directory based protocols depending on directory placement.– Central Directory– Distributed Directory

Directory Based Protocols• Central Directory: Directory is placed or

associated with the memory system. Each line in memory has an entry defining the users of the line.

1 0 1 0

Cache 2 Cache 1 Cache 0Cache3

One bit vector per memory lineDirectory

Directory Based Protocols• Distributed Directory: Memory has only

one pointer which identifies the cache that last requested it. A subsequent request to that line is then referred to that cache and requestor Id is placed at the head of list.

Cache 0 Cache 2 Cache 1Cache3

One bit vector per memory lineDirectory

3

0 2 T


• There are singly and doubly linked versions called the Scalable Distributed Directory (SDD) and Scalable Coherent Interface (SCI) respectively.

• The invalidate protocols differ in time it takes to maintain the consistency across the network of processors.

• With the central directory , the invalidate traffic can be forwarded in parallel to all using processors.


• In the distributed directory this process is done serially through the use of the lists.

• In update protocols using central directory, the directory sends a count number to the writer ( indicating no. of users).

• Processors having copy of the line are informed that line is being updated along with write data.

• The processors after updating send ack to writing cache. ( Total ack = Count)


• In distributed directory, writing processor becomes head of the list then update signal together with new data is forwarded down the list.56

THE END

unit-5: cocurrent processors

Documents