unit-5: cocurrent processors
DESCRIPTION
Unit-5: Cocurrent Processors. Vector & Multiple Instruction Issue Processors. Concurrent Processors. Processors that can execute multiple instructions at the same time ( Concurrently) - PowerPoint PPT PresentationTRANSCRIPT
Unit-5: Cocurrent Processors
Vector & Multiple Instruction Issue Processors
Concurrent Processors
• Processors that can execute multiple instructions at the same time ( Concurrently)
• Concurrent processors can make simultaneous access to memory and can execute multiple operations simultaneously.
• These processors execute from one program stream and have single instruction counter but instructions are so rearranged that concurrent instruction execution is achieved.
Concurrent Processors
• Processor performance depends on compiler ability, execution resources and memory system design.
• Sophisticated compilers can detect various types of instruction level parallelism that exists with in a program and then depending upon type of concurrent processor, compilers can restructure the code that allows the use of available concurrency
Concurrent Processors
• There are two main types of concurrent processors.– Vector Processors: single vector instruction
replaces multiple scalar instructions. It depends on compilers ability to vectorize the code to transform loops into sequence of vector operations.
– Multiple Issue Processors: Instructions whose effects are independent of each other are executed concurrently.
Vector Processors
• A vector computer or vector processor is a machine designed to efficiently handle arithmetic operations on elements of arrays, called vectors. Such machines are especially useful in high-performance scientific computing, where matrix and vector arithmetic are quite common. Supercomputers like Cray Y-MP is an example of vector Processor.
Vector Processors
• Vector processors are based on the premise that the original program has either explicitly declared many of the data operands to be vectors or arrays or it implicitly uses loops whose data references can be expressed as references to a vector of operands ( Achieved by Compilers) .
• Vector processors achieve considerable speed up in processor performance over that of simple pipelined processors.
Vector Processors
• To achieve the concurrency in operations and resulting speed up in performance, vector processors have extended instruction set and architecture to support the concurrent execution of commonly used vector operations in hardware.
• Directly supporting vector operations in hardware reduces or eliminates the overhead of loop control which would otherwise be necessary
Vectors and vector arithmetic
• A vector, v, is a list of elements
v = ( v1, v2, v3, ..., vn )
• The length of a vector is defined as the number of elements in that vector; so the length of v is n.
• When mapping a vector to a computer program, we declare the vector as an array of one dimension.
Vectors and vector arithmetic
• Arithmetic operations may be performed on vectors. Two vectors are added by adding corresponding elements:
s = x + y = ( x1+y1, x2+y2, ..., xn+yn ). where s is the vector representing the final sum
and S, X, and Y have been declared as arrays of dimension N. This operation is sometimes called element wise addition. Similarly, the subtraction of two vectors, x - y, is an element wise operation.
Vector Computing Architectural Concepts
• A vector computer contains a set of special arithmetic units called pipelines.
• These pipelines overlap the execution of the different parts of an arithmetic operation on the elements of the vector.
• There can be different set of arithmetic pipelines to perform vector additions and vector multiplications.
The stages of a floating-point operation
• steps or stages involved in a floating-point addition on a sequential machine with normal floating point arithmetic hardware: s = x + y.
• [A:] The exponents of the two floating-point numbers to be added are compared to find the number with the smallest magnitude.
• [B:] The significand of the number with the smaller magnitude is shifted so that the exponents of the two numbers agree.
• [C:] The significands are added.
The stages of a floating-point operation
• [D:] The result of the addition is normalized.
• [E:] Checks are made to see if any floating-point exceptions occurred during the addition, such as overflow.
• [F:] Rounding occurs. Consider an example of such an addition.
The numbers to be added are x = 1234.00 and y = -567.8.
Stages of a Floating-point Addition
F
y
x
s
0.1234E4 0.12340E4
- 0.5678E3 - 0.05678E4
0.066620E4 0.66620E3 0.66620E3 0.6662E3
Step A B C D E
Stages of a Floating-point Addition
• consider this scalar addition performed on all the elements of a pair of vectors (arrays) of length n.
• Each of the six stages needs to be executed for every pair of elements.
• If each stage of the execution takes tau units of time, then each addition takes 6*tau units of time (not counting the time required to fetch and decode the instruction itself or to fetch the two operands).
Stages of a Floating-point Addition
• So number of time units required to add all the elements of the two vectors in a serial fashion would be Ts = 6*n*tau.
An Arithmetic Pipeline• Suppose addition operation described
previously is pipelined; that is, one of the six stages of the addition for a pair of elements is performed at each stage in the pipeline.
• Each stage of the pipeline has a separate arithmetic unit designed for the operation to be performed at that stage.
• it still takes 6*tau units of time to complete the sum of the first pair of elements, but that the sum of the next pair is ready in only tau more units of time.
An Arithmetic Pipeline• So the time, Tp, to do the pipelined
addition of two vectors of length n is
Tp = 6*tau + (n-1)*tau = (n + 5)*tau.
• Thus, this pipelined version of addition is faster than the serial version by almost a factor of the number of stages in the pipeline.
• This is an example of what makes vector processing more efficient than scalar processing.
An Arithmetic Pipeline• The operations at each stage of a pipeline
for floating-point multiplication are slightly different than those for addition.
• A multiplication pipeline may even have a different number of stages than an addition pipeline.
• There may also be pipelines for integer operations.
• Some vector architectures provide greater efficiency by allowing the output of one pipeline to be chained directly into another pipeline .
Vector Functional Units
• All modern vector processors use vector-register- vector instruction format.
• So all vector processors consists of vector register sets.
• Vector registers sets consists of eight or more registers with each register containing from 16 to 64 vector elements
• Each vector element is a floating point word.(64 bits)
Vector Functional Units• Vector registers access memory with
special Load and Store instructions.
• There are separate and independent functional units to manage the load / store function.
• There are separate vector execution units for each instruction class.
• These execution units are segmented ( pipelined ) to support highest possible execution rate.
Primary Storage Facilities
Memory
Data Cache
VLD
VST
VECTOR REGISTERS
Scalar
(Floating Point registers)
Integer General Purpose Registers
Vector Functional Units• Vector operations involve operations on a
large number of operands (Vector of operands), thus pipelining helps in achieving execution at the cycle rate of the system.
• Vector processors also contain scalar floating point registers, integer (General Purpose) registers and scalar functional units.
• Scalar registers and their contents can interface with vector execution units.
Vector Functional Units• Vectors as a data structure are not well
managed by a data cache, so vector load / store operations avoid data cache and are implemented directly between memory and vector registers.
• Vector load / store operations can be overlapped with other vector instruction executions. But vector loads must complete before they can be used.
Vector Functional Units• The ability of processor to concurrently
execute multiple ( independent) vector instructions is limited by the number of vector register ports and vector execution units.
• Each concurrent load or store requires a vector register port; vector ALU operations require multiple ports.
Vector Instructions / Operations
• Vector Instructions are effective in several ways.– They significantly improve code density.– They reduce the number of instructions required
to execute a program ( Reducing I-Bandwidth)– They organize data argument into regular
sequences that can be efficiently handled by the hardware
– They can represent a simple loop construct, thus removing the control overhead for loop execution.
Types of Vector Operations
• (a): Vector Arithmetic and Logical operations.– VADD, VSUB, VMPY, VDIV, VAND, VOR, VEOR
VOP VR1 VR2 VR3
VR2 and VR3 are the vector register which contain the source operands on which vector operation is performed and result is stored in Vector register VR1.
VR2 VOP VR3 VR1
Types of Vector Operations
• (b)Compare ( VCOMP):
VR VCOMP VR S
The result of vector compare is stored in a scalar register S. The Si bit of scalar register is set to 1 if v1.i > v2.i ( comparison of ith element)
Test (VTEST): V1 VTEST CC - Scalar
The Si bit is 1 if v1.i satisfies CC (condition code specified in the instruction).
Types of Vector Operations• (c): Accumulate (VACC): ∑ ( V1 * V2) -> SAccumulate the sum of product of each
element of two vectors into a scalar register.
• (d): Expand / Compress ( VEXP / VCPRS) VR OP S VRTake logical vectors and apply them to
elements in vector registers to create a new vector value.
• Vector Load (Scatter) and Vector Store (Gather) Instructions asynchronously access memory.
Types of Vector Operations
• When a vector operation is done on two vector registers of unequal length, we need some convention for producing result.
• All entries in a vector register which are not explicitly stored are given a invalid content symbol NAN and any operation using NAN will also produce NAN regardless of contents of other register.
Vector Processor Implementation• Vector processor implementation requires
considerable amount of additional control and hardware.
• Vector registers used to store vector operands generally bypass the data cache.
• Data cache is used solely to store scalar values.
• Since particular values stored in a vector may be aliased as a scalar and stored in a data cache all vector references to memory must be checked against contents of data cacahe.
Vector Processor Implementation• If there is a hit, invalidate the current value
contained in data cache and force memory update.
• Additional H/W or S/W control is required to ensure that scalar references from data cache to memory do not inadvertently reference a value contained in vector register.
• Earlier Vector processors used Memory to Memory instruction format, but due to severe memory congestion and contention problems most recent vector processor use vector registers to load / store vector operands.
Vector Processor Implementation• Vector registers generally consists of a set of
eight registers each containing from 16 to 64 entries of the size of a floating point word.
• The arithmetic pipeline may be shared with the scalar part of the processor.
• Under some conditions, it is possible to execute more than one arithmetic operation per cycle.
• Te result of one arithmetic operation can be directly used as an operand in subsequent vector instruction.
Vector Processor Implementation• This is called Chaining.
• For the two instructions
VADD VR3, VR1, VR2
VMPY VR5, VR3, VR4.
VADD
VMPY
VR 1.2
+
VR 2.2
To VR 3.1
VR 3.1 * VR 4.1
VR 1.3
+
VR 2.3
Vector Processor Implementation
• The illustrated chained ADD- MPY with each functional unit having 4 stages, saves 64 cycles.
• If unchained it would have taken 4 (startup) + 64 (elements/VR) = 68 cycles for each function – a total of 136 cycles.
• With chaining this is reduced to 4 (add start up) + 4 (multiply startup) + 64 (elements/VR) = 72 Cycles – A saving of 64 cycles.
Vector Processor Implementation
• Another important aspect of vector implementation is management of references to memory.
• Memory must have sufficient bandwidth to support minimum two and preferably three references per cycle (two reads and one write)
• This bandwidth allows for two vector reads and one vector write to be initiated and executed concurrently with execution of a vector arithmetic operation.
Vector Processor Implementation
• Major data paths in a generic vector processor are shown in Figure 7.14 on page no 438 of computer architecture by Michael Flynn.
Vector Memory• Simple low order interleaving used in normal
pipelined processors is not suitable for vector processors.
• Since access in case of vectors is non sequential but systematic, thus if array dimension or stride( address distance between adjacent elements) is same as interleaving factor then all references will concentrate on same module.
Vector Memory• It is quite common for these strides to be of
the form 2k or other even dimensions.
• So vector memory designs use address remapping and use a prime number of memory modules.
• Hashed addresses is a technique for dispersing addresses.
• Hashing is a strict 1:1 mapping of the bits in X to form a new address X’ based on simple manipulations of the bits in X.
Vector Memory• A memory system used in vector / matrix
accessing consists of following units.– Address Hasher– 2k + 1 memory modules.– Module mapper.
This may add certain overhead and add extra cycles to memory access but since the purpose of the memory is to access vectors, this can be overlapped in most cases.
Vector Memory
. . . . . .
2k + 1 Modules
Address Hasher
X address
X’
Module address Computation
X’ mod (2k + 1)
Module Index (X’ / 2k)
Address in a module
Data
Modeling Vector Memory Performance
• Vector memory is designed for multiple simultaneous requests to memory.
• Operand fetching and storing is overlapped with vector execution.
• Three concurrent operand access to memory are a common target but increased cost of memory system may limit this to two.
• Chaining may require even more accesses.• Another issue is the degree of bypassing or
out of order requests that a source can make to memory system.
Modeling Vector Memory Performance
• In case of conflict ie a request being directed to a busy module, the source can continue to make subsequent requests only if not serviced requests are held in a buffer.
• Assume each of ‘s’ access ports to memory has a buffer of size TBF/s which holds requests that are being held due to a conflict.
• For each source, degree of bypassing is defined as the allowable number of requests waiting before stalling of subsequent requests occurs.
Modeling Vector Memory Performance
• If Qc is the expected number of denied requests per module and m is the number of modules, Then buffer size must be large enough to hold denied requests
Buffer = TBF > m. Qc
• If n is the total number of requests made and B is the bandwidth achieved then
m . Qc = n-B ( denied requests)
Gamma (γ) – Binomial Model
• Assume that each vector source issues a request each cycle(δ =1) and each physical requestor has the same buffer capacity and characteristics.
• If the vector processor can make s requests per cycle and there are t cycles per Tc, then
Total requests per Tc = t . s = nThis is same as n requests per Tc in simple
binomial model .
Gamma (γ) – Binomial Model
• If γ is the mean queue size of bypassed requests awaiting service then each of γ – buffered requests also make a request.
• From memory modeling point of view this is equivalent to buffer requesting service each cycle until module is free.
Total request per Tc = t . s + t . s . γ
= t. s(1 + γ)
= n (1 + γ).
Gamma (γ) – Binomial Model
• Substituting this value in simple binomial equation
B (m, n,γ )
= m+n(1+γ)-1/2 - (m+n(1+γ)-1/2)2 – 2nm(1+γ)
Calculating γ opt
• The γ is the mean expected bypassed request queue per source.
• If we continue to increase number of bypass buffer registers we can achieve a γ opt which totally eliminates contention.
• No contention occurs when B = n or B(m,n,γ) = n• This occurs when ρa = ρ = n/m• Since MB/ D/1 queue size is given by
Q = ρa2 – p ρa / 2(1- ρa) = n(1+γ) –B /m
Calculating γ opt• Substituting ρa = ρ= n/m and p=1/m we get:Q=( n2-n)/2((m2-nm) = (n/m )(n-1) /(2m-2n)• Since Q = (n(1+γ) –B) /mSo mQ = n(1+γ) –B
Now for γopt (n-B) =0
So γopt = m/n Q
So γopt = n-1/ 2m-2n
And mean total buffer size (TBF) = n γopt
To avoid overflow buffer may be considerably larger may be 2 x TBF
Vector Processor Speed up: Performance Relative to
Pipelined ProcessorVector processor performance depends on1. The amount of program that can be
expressed in a vectorizable form.2. Vector startup costs ( length of pipeline)3. Number of execution units and support
for chaining.4. Number of operands that can be
simultaneously accessed /stored5. Number of vector registers.
Vector Processor Speedup
• The overall effect of speedup possible by the vector processor over the pipelined processor is generally limited to 4.
• This assumes concurrent execution of 2 load, 1 store and 1 arithmetic operation.
• If chaining is allowed and memory system can accommodate an additional concurrent load instruction, the speedup can extend to <6 ( 3LDs, 2 Arith, and 1 ST)
Vector Processor Speedup
• In practice such speedups are not achievable for following reasons.– To sustain such high speedup the program must
consist of purely vector code.– Due to some address arithmetic and control
operations, some part of program is not vectorizable.
– Suppose maximum speedup of 4 is achievable for 75% of operations
Overall speedup Sp = T1/ (%vector . (T1/4) + % nonvector . T1) = 2.3
Vector Processor Speedup– The depth of pipeline also limits the effective
speedup
Two parameters R∞ and n1/2 are used to characterize vector processors performance.
R∞ = 1/Δt = maximum vector arithmetic execution rate that the processor can sustain in the absence of chaining. (Δt cycle time of vector pipeline)
If c functional units can be chained the max performance = c * R∞ ( usually c =2)
n1/2 is a parameter that measures the depth of pipeline.
Vector Processor Speedup– Assume n is the vector size ( no of elements) then
n1/2 is the length of vector operand that achieves exactly one half of maximum performance.
– Usually vector processor can not start a new instruction until a previous instruction is finished using the pipeline ( Pipeline flushing)
– This delay and startup time correspond to depth of pipeline and are represented by n1/2.
Vector efficiency R/R∞ = 1 / 1+ 1/(n/ n1/2)Vector efficiency is very low if vector length n is
significantly less than n1/2..
Vector Processor Speedup– Memory contention under heavy referencing can
add to delay– This contention causes vector registers to be
loaded and stored more slowly than vector instructions are executed. The waiting time between two vector operations adds to delay.
– So generally for memory limited processors
n1/2 = max [ vector startup cycles, vector memory overhead cycles]
Vector processors speedup is thus limited below maximum value of 4 or 5, but for code having significant vector content speedup of 2-3 is generally achievable.
Multiple – Issue Machines
• These machines evaluate dependencies among group of instructions, and groups found to be independent are simultaneously dispatched to multiple execution units.
• There are two broad classes of multiple issue machines– Statically Scheduled: detection process is done
by the compiler.– Dynamically Scheduled: Detection of
independent instructions is done by hardware in the decoder at run time.
Statically Scheduled Machines
• Sophisticated compilers search the code at compile time and instructions found independent in their effect are assembled into instruction packets, which are decoded and executed at run time.
• Statically scheduled processors must have some additional information either implicitly or explicitly indicating instruction packet boundaries.
Statically Scheduled Machines
• Early statically scheduled machines include the so called VLIW ( Very long instruction word) machines.
• These machines use an instruction word that consists of 8 to 10 instruction fragments.
• Each fragment controls a designated execution unit.
• To accommodate multiple instruction fragments the instruction word is typically over 200 bits long.
Statically Scheduled Machines
• The register set is extensively multi ported to support simultaneous access to multiple execution units.
• To avoid performance limitation by occurrence of branch instructions a novel compiler technology called trace scheduling is used.
• In trace scheduling branches are predicted where possible and predicted path is incorporated into a large basic block.
Statically Scheduled Machines
• If an unanticipated ( or unpredicted) branch occurs during the execution of the code, at the end of the basic block the proper result is fixed up for use by a target basic block.
Dynamically Scheduled Machines.
• In dynamically scheduled machines detection of independent instruction is done by hardware at run time.
• The detection may also be done at compile time and code suitably arranged to optimize execution patterns.
• At run time the search for concurrent instructions is restricted to the localities of the last executing instruction.
Superscalar Machines• The maximum program speed up available in
multiple issue machines, largely depends on sophisticated compiler technology.
• The potential speedup available from multiflow compiler using trace scheduling is generally les than 3.
• Recent multiple issue machines having more modest objectives, are called Superscalar Machines.
• The ability to issue multiple instructions in a single cycle is referred to as Superscalar implementation
Major Data Paths in a Generic M.I.Machine
(Refer to fig 7.28 on page no 458.)• Simultaneous access to data as required
by VLIW processor mandates extensive use of register ports which can become a bottleneck.
• Dynamic multiple issue processor use multiple buses connecting the register sets and functional units and each bus services multiple functional units.
• This may limit the maximum degree of concurrency.
Comparing Vector and Multiple Issue Processors
• The goal of any processor design is to provide cost effective computation across a range of applications.
• So we should compare the two technologies based on following two factors.– Cost– Performance
Cost Comparison• While comparing the cost we must
approximate the area used by both the technologies in the form of additional / required units.
• The cost of execution units is about the same for both.( for same maximum performance).
• A major difference lies in the storage hierarchy.
• Both rely heavily on multi ported registers.
Cost Comparison• These registers occupy significant amount
of area. If p is the no of ports, the area required is
Area = (No of reg +3p)(bits per reg +3p) rbe.• Most vector processors have 8 sets of 64
element registers with each element being 64 bit in size.
• Each vector register is dual ported ( a read port and a write port). Since registers are sequentially accessed each port can be shared by all elements in the register set.
Cost Comparison• There is an additional switching overhead
to switch each of n vector registers to each of p external ports.
Switch area = 2 (bits per reg).p. (no of reg)
• So area used by registers set in vector processors (supporting 8ports) is
Area = 8x[(64+6) (64+6)] =39,200 rbe.
Switch area = 2 (64).8.(64) = 8192 rbe.
Cost Comparison• A multiple issue processor with 32 registers
each having 64 bits and supporting 8 ports will require
Area = (32+3(8))(64+3(8)) =4928 rbe
• So vector processors use almost 42,464 rbe of extra area compared to MI processors.
• This extra area corresponds to about 70,800 cache bits ( .6 rbe/bit) ie approximately 8 KB of data cache.
• Vector processors use small data cache.
Cost Comparison• Multiple issue machines require larger data
cache to ensure high performance.• Vector processors require support hardware
for managing access to memory system.• Also high degree of interleaving is required in
memory system to support processor bandwidth
• M.I machines must support 4-6 reads and 2-3 writes per cycle. This increases the area required by buses between arithmetic units and registers.
Cost Comparison• M.I machines should access and hold multiple
instructions each cycle from I- cache• This increases the size of I-fetch path between
I-cache and instruction decoder/ instruction register.
• At instruction decoder multiple instructions must be decoded simultaneously and detection for instruction independence must be performed.
• The main difference depends heavily on the size of data cache required by MI machines and cost of memory interleaving required by vector processor.
Performance Comparison• The performance of vector processors
depends primarily on two factors – Percentage of code that is vectorizable.– Average length of vectors.
We know that n1/2 or the vector size at which the vector processor achieves approx half its asymptotic performance is roughly the same as arithmetic plus memory access pipeline.
For short vectors data cache is sufficient in MI machines so for
Short Vectors M.I .processors would perform better than equivalent vector processor.
Performance Comparison• As vectors get longer the performance of
M.I. machine becomes much more dependent on size of data cache and n1/2
of vector processors improve.• So for long vectors performance would be
better in case of vector processor.• The actual difference depends largely on
sophistication in compiler technology.• Compiler can recognize occurrence of
short vector and treat that portion of code as if it were a scalar code
Shared Memory Multiprocessors
• Beyond the instruction level concurrency used in vector and multiple issue processors, there are processing ensembles that consist of ‘n’ identical processors that share a common memory.
• Multiprocessors are usually designed for at least one of two reasons– Fault Tolerance: – Program Speed up.
Shared Memory Multiprocessors
• Fault Tolerant Systems: n identical processors ensure that failure of one processor does not affect the ability of the multiprocessor to continue with program execution.
• These multiprocessors are called high availability or high integrity systems
• These systems may not provide any speed up over a single processor system
Shared Memory Multiprocessors
• Program Speed up: Most multiprocessors are designed with main objective of improving program speed up over that of single processor.
• Yet fault tolerance is still an issue as no design for speedup ought to come at the expense of fault tolerance.
• It is generally not acceptable for whole multi processor system to fail if any one of its processors fail.
Shared Memory Multiprocessors
• Basic Issues: Three basic issues are associated with designs of multiprocessor systems.– Partitioning– Scheduling of tasks– Communication and synchronization.
Partitioning
• This is the process of dividing a program into tasks, each of which can be assigned to an individual processor for execution.
• The partitioning process occurs at compile time.• The goal of portioning process is to uncover the
maximum amount of parallelism possible within certain obvious machine limitations.
• Program overhead, the added time a task takes to be loaded into a processor, define the size of minimum task produced by portioning program.
Partitioning• The program overhead time which is
configuration and scheduling dependent, limits the maximum degree of parallelism among executing sub tasks.
• If amount of parallelism is increased by using finer and finer grain task sizes the amount of overhead time is accordingly increased.
• If available parallelism exceeds the known number of processors, or several shorter tasks share the same instruction / data working set, Clustering is used to group subtasks into a single assignable task
Partitioning• The detection of parallelism is done by one of
three methods.– Explicit statement of concurrency in high level
language. Programmers delineate boundaries among tasks that can be executed in parallel.
– Programmers hint in source statement which compilers can use or ignore.
– Implicit parallelism: sophisticated compilers can detect parallelism in normal serial code and transform program code for execution on multiprocessors.
Scheduling• Scheduling is done both statically ( at
compile time ) and dynamically ( at run time).
• Statically scheduling is not sufficient to ensure optimum speedup or even fault tolerance.
• The processor availability is difficult to predict and may vary from run to run.
• Runtime scheduling has advantage of handling changing system environments and program structures.
Scheduling• Run time overhead is prime disadvantage of
run time scheduling.
• It is desirable from fault tolerance point of view that run time scheduling is initiated by any processor and then the process itself is distributed across all available processors.
• The major run time overheads in run time scheduling include.– Information gathering: (about dynamic program
state and state of the system)– Scheduling
Scheduling– Dynamic execution control: dynamic clustering or
process creation at run time.
– Dynamic data management.: Assignment of tasks and processors in such a way as to minimize the required amount of memory overhead delay in accessing the data.
Synchronization and Coherency
• In a multiprocessor configuration having high degree of task concurrency, the tasks must follow an explicit order and communication between active tasks must be performed in an orderly way.
• The value passing between different tasks executing on different processors is performed by synchronization primitives or semaphores.
Synchronization and Coherency
• Synchronization is the means to ensure that multiple processors have a coherent or similar view of critical values in memory.
• Memory coherence is the property of memory that ensures that a read operation returns the same value which was stored by the latest write to same address.
• In complex systems of multiple processors the program order of memory related operations may be different from order in which operations are actually executed.
Synchronization and Coherency
• In multiprocessors there are different degrees to which ordering may be enforced.– Sequential Consistency: all memory accesses
execute atomically in some total order and all memory accesses of a single process appear to execute in (its) program order.
– Processor consistency: the issuing processor sees loads followed by stores in program order but store followed by load is not necessarily performed in program order.
Synchronization and Coherency
– Weak Consistency: Synch operations are sequentially consistent, other memory operations can occur in any order. All synchronization operations are performed before any subsequent memory operation is performed and all pending memory operations are performed before any synchronization operation is performed.
– Release Consistency: Synch operations are split into acquire (lock) and release (unlock) and these operations are processor consistent.
Means of Synchronization
• a single synchronization variable which is accessed by a read-modify –write function.
• Before access can be made to a shared variable, this location is referenced.
• If this location is ‘0’ then a write can be made to shared variable and contents of this synchronization location are changed to ‘1’.(This indicates that producer process has produced data for consumer process to Read.)
Means of Synchronization
• Similarly Read access to this shared variable is allowed only when synchronization variable is ‘1’.(that is only after data has been written by producer process). After Read is completed the synchronization location is changed to ‘0’ allowing producer process to write to shared variable.
• If there were multiple consumer processes then we need additional bits in synchronization location to coordinate
Means of Synchronization
• We can designate certain bit combinations giving exclusive rights to producer or either of consumers to change (write) or use (Read) the shared variable.
0 0 Producer changing data, Consumers locked out0 1 Data available to any consumer, producer locked out1 1 A consumer is acquiring data, producer and other
consumers are locked out.1 0 Buffer empty, consumers locked out, Producer
allowed to change dataThe access to this region in memory is locked and
unlocked by certain bit patterns called Semaphores.
Means of Synchronization
• Another synchronization mechanism is test and set instruction.
• The test and set instruction tests a value in memory and then sets that value to ‘1’.
• If shared variable is not yet available the accessing process enters a loop which continues to access the synchronization location till it indicates that shared variable is set.
• This process of looping on a shared variable with a test and set type of instruction is called a spin lock.
Means of Synchronization
• Spin locks create significant additional memory traffic.
• To reduce this traffic a suspend lock can be implemented.
• In suspend lock the status can be recorded by the equivalent of test and set instruction and then the requesting process can go idle , awaiting interrupt from producing process.
• The fetch and add primitive combines multiple test and set instructions into a single instruction that will access a synchronization variable and return a value.
Types of Shared Memory Multiprocessors
• The variety in multiprocessors results from the way memory is shared between processors. This can be as follows– Shared data cache, shared memory– Separate data cache, but shared bus –
shared memory– Separate data cache with separate buses to
shared memory– Separate processor and separate memory
modules interconnected with a multi stage interconnection network.
Types of Shared Memory Multiprocessors
• The simpler and more limited the multiprocessor configuration, the easier it is to provide synchronization communication and memory coherency.
• As processor speed and number of processors increase, shared data caches and buses run out of bandwidth and become a bottleneck.
• Replicating caches and /or buses to provide additional bandwidth increases coherency traffic.
Memory Coherence in Shared Memory Multiprocessor
• Each node in a multiprocessor system possesses a local cache.
• Since the address space of the processors overlaps, different processors can be holding (caching) the same memory segment at the same time.
• Further each processor may be modifying these cached location simultaneously.
• Cache coherency problem is to ensure that all caches contain same most updated copy of data.
Memory Coherence in Shared Memory Multiprocessor
• The protocol that maintains the consistency of data in all the local caches is called the cache coherency protocol.
• The most important distinction among various protocols is the way they distribute information about writes to memory to the other processors in the system.– Snoopy Protocol: a write is broad cast to all
processors in the system– Directory Based: Write is only sent to those
processors that are known to possess a copy.
Snoopy Protocol• Snoopy protocols are simpler, as
ownership information need not be stored in system.
• Theses protocols create great deal of additional bus or network traffic while doing the broadcast to all the processors.
• Snoopy protocols are bus based and used in shared bus Multiprocessors.
• Snoopy protocols assume that all of the processors are aware of , and receive, all bus transactions ( snoop on the bus)
Snoopy Protocol• Snoopy protocols are further classified
based on the type of action local processor must take when an altered line is recognized.
• There are two types of actions– Invalidate: all copies in other caches are
invalidated before changes are made to data in a particular line. The invalidate signal is received from the bus and all caches which posses the same cache line invalidate their copies.
Snoopy Protocol– Update: Writes are broadcast on the bus and
caches sharing the same line snoop for data on the bus and update the contents and state of their cache lines.
Another set of distinction among protocols is based on if main memory is updated when a write occurs or only cache contents are updated.
Snoopy Protocol
• There are four variations based upon invalidate strategy.– Write – invalidate– Synapse– Illinois– Berkeley
Snoopy Protocol
• There are two variations based on update strategy– Firefly– Dragon
Directory Based Protocols
• These protocols maintain the state information of the cache lines at a single location called directory.
• Only the caches that are stated in the directory and are thus known to posses a copy of the newly altered line, are sent invalidate or write update information.
• Since there is no need to connect to all caches, in contrast to snoopy protocols, directory based protocols can scale better.
Directory Based Protocols• As the number of processor nodes and
number of cache lines increase, the size of directory can become very large.
• In addition to distinction based on action taken by local processor in terms of invalidate or update cache lines, there is important distinction among directory based protocols depending on directory placement.– Central Directory– Distributed Directory
Directory Based Protocols• Central Directory: Directory is placed or
associated with the memory system. Each line in memory has an entry defining the users of the line.
1 0 1 0
Cache 2 Cache 1 Cache 0Cache3
One bit vector per memory lineDirectory
Directory Based Protocols• Distributed Directory: Memory has only
one pointer which identifies the cache that last requested it. A subsequent request to that line is then referred to that cache and requestor Id is placed at the head of list.
Cache 0 Cache 2 Cache 1Cache3
One bit vector per memory lineDirectory
3
0 2 T
Directory Based Protocols
• There are singly and doubly linked versions called the Scalable Distributed Directory (SDD) and Scalable Coherent Interface (SCI) respectively.
• The invalidate protocols differ in time it takes to maintain the consistency across the network of processors.
• With the central directory , the invalidate traffic can be forwarded in parallel to all using processors.
Directory Based Protocols
• In the distributed directory this process is done serially through the use of the lists.
• In update protocols using central directory, the directory sends a count number to the writer ( indicating no. of users).
• Processors having copy of the line are informed that line is being updated along with write data.
• The processors after updating send ack to writing cache. ( Total ack = Count)
Directory Based Protocols
• In distributed directory, writing processor becomes head of the list then update signal together with new data is forwarded down the list.56
THE END