different parallel processing architectures module 2

Different parallel processing architectures

Module 2

Syllabi

Multithreaded architectures–principles of multithreading

Dataflow operators, Dataflow language properties, advantages & potential problems

Static and dynamic dataflow architectures

Basics..

• Computers are basically designed for execution of instructions, which are stored as programs in the memory.

• These instructions are executed sequentially and hence are slow as the next instruction can be executed only after the output of pervious instruction has been obtained.

• As discussed earlier to improve the speed and through put the concept of parallel processing was introduced.

• To execute the more than one instruction simultaneously one has to identify the independent instruction which can be passed to separate processors.

Basics..

• The parallelism in multiprocessor can be mainly implemented on principle in three ways:

Instruction Level ParallelismData Level ParallelismThread Level Parallelism

Instruction-Level Parallelism (ILP)

• The potential of overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel.

• Instruction level parallelism is obtained primarily in two ways in uniprocessors: through pipelining and through keeping multiple functional units busy executing multiple instructions at the same time.

Data Level Parallelism

• The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop.

• This type of parallelism is often called loop-level parallelism as an example of it vector processor.

Thread Level Parallelism

• Thread Light weight Process• Thread level parallelism (TLP) is the act of running

multiple flows of execution of a single process simultaneously.

• TLP is most often found in applications that need to run independent, unrelated tasks (such as computing, memory accesses, and IO) simultaneously.

• These types of applications are often found on machines that have a high workload, such as web servers.

• TLP is a popular ground for current research due to the rising popularity of multi-core and multiprocessor systems, which allow for different threads to truly execute in parallel.

Principles of multithreading

• In the multithreaded execution model, a program is a collection of partially ordered threads, and a thread consists of a sequence of instructions which are executed in the conventional von Neumann model.

• Multithreading is the process of executing multiple threads concurrently on a processor.

• Multithreading demands that the processor be designed to handle multiple contexts simultaneously on a context switching basis

Multithreadedcomputation model

• Let us consider the system where memories are distributed to form global address space. The machine parameter on which machine is analyzed are

a. The latency (L) :this include network delay, cache miss penalty, and delay.

b. The number of thread: the number of thread that can be interleaved in each processor. A thread is represented by a context consisting a program counter, register set and required context status word.

c. The context switching overhead: this refer to cycle lost in performing context switching in processor. This depends on the switching mechanism and the amount of processor state devoted to maintaining the active thread.

d. The interval between switches: this refer to cycle between switches triggered by remote reference.

Multiple context processor

• Multithreaded systems are constructed with multiple context processors.

• For example the Horizon & Tera the compiler detects such data dependencies and the

hardware enforces it by switching to another context if dependency is being detected.

• This is implemented by inserting into each instruction a field which indicates its minimum number of independent successors over all possible control flows.

Context switching policies.

• Switching from one thread to another is performed according to one of the following policies :

Switching on every instruction: the processor switches from one thread to another every cycle.

Switching on block of instructions: blocks of instructions from different threads are interleaved.

Switching on every load: whenever a thread encounters a load instruction, the processor switches to another thread after that load instruction is issued. The context switch is irrespective of whether the data is local or remote.

Switch on cache miss: This policy correspond the case where a context is preempted when it causes a cache miss.

Data flow computers

• Data flow machines is an alternative of designing a computer that can store program systems. The aim of designing parallel architecture is to get high performing machines.

• The designing of new computer is based on following three principles:

To achieve high performanceTo match technological progress To offer better programmability in application

areas

• Before we study in detail about these data flow computers lets revise the drawbacks of processors based on pipeline architecture.

The major hazards are Structural hazards Data hazards due to• true dependences which happens in case of WAR or• false dependences also called name dependencies :

anti and output dependences (RAW or WAW) Control hazardsIf data dependency can be removed the performance of

the system will definitely improve.

• Data flow computers are based on the principle of data driven computation which is very much different from the von Neumann architecture which is basically based on the control flow while where the data flow architecture is designed on availability of data

• hence also called data driven computers

Types of flow computer

• As a designers perspective there are various possible ways in which one can design a system depending on the way we execute the instructions. Two possible ways are-:

• Control flow computers : The next instruction is executed when the last instruction as stored in the program has been executed.

• Data flow computers An instruction executed when the data (operands) required for executing that instruction is available.

Data driven computing and languages

• In order to under how Dataflow is different from Control-Flow. Lets see the working of von Neumann architecture which is based on the control flow computing model.

• Here each program is sequence of instructions which are stored in memory.

• These a series of addressable instructions store the information about the an operation along with the information about the with memory locations that store the operand or in case of interrupt or some function call it store the address of the location where control has to transferred or in case of conditional transfer it specifies the status bits to be checked and location where the control has to transferred.

The key features of control flow model are

• Data is passed between instructions via reference to shared memory cells

• Flow of control is implicitly sequential but special control operators can be used for explicit parallelism

• Program counter are used to sequence the execution of instruction in centralized control environment.

• However the data driven model accept the execution of any instruction only on availability of the operand.

• Data flow programs are represented by directed graphs which show the flow of data between instructions.

• Each instruction consists of an operator, one or two operands and one or more destinations to which the result is to be transferred.

• The key features of data driven model are as follows:• Intermediate results as well as final result are passed

directly as data token between instruction.• There is no concept of shared data storage as used in

traditional computers• In contrast to control driven computers where the

program has complete control over the instruction sequencing here the data driven computer the program sequencing is constrained only by data dependency among the instructions.

• Instructions are examined to check the operand availability and if functional unit and operand both are available the instruction is immediately executed.

• Data flow computing as required to implement the parallelism hence it is required to analysis the data dependency .

• Data flow computational model uses directed graph G = (V ,E), which is also called as data dependency graph or DataFlow Graph (DFG).

• An important characteristic of dataflow graph is its ability to detect parallelism of computation by finding various types of dependency among the data. This graph consists of nodes that represent the operations (opcode) and an arc connects the two node and it indicates how the data flow between these nodes or we can say arcs are pointers for forwarding the data tokens.

• DFG is used for the description of behavior of data driven computer. Vertex v _ V is an actor, a directed edge e _ E describes precedence

• relationships of source actor to sink actor and is guarantee of proper execution of the dataflow program.

• Tokens are used to indicate presence of data in DFG.

• Actor in dataflow program can be executed only in case there is a presence of a requisite number of data values (tokens) on input edges of an actor.

• When firing an actor execution, the defined number of tokens from input edges is consumed and defined number of tokens is produced to the output edges.

• Data flow languages make a clean break from the von Neumann framework, giving a new definition to concurrent programming languages. They manage to make optimal use of the implicit parallelism in a program.

Consider the following segment:l . P = X + Y (waits for availability of input value for X and Y)2. Q = P I Y (as P is required input it must waits for instruction 1 to

finish)3. R = X * P (as P is required input it must waits for instruction 1 to

finish)4. S = R - Q(as R and Q are required as input it must waits for

instruction 2 and 3 tofinish)

5. T = R * P (as R is required input it must waits for instruction 3 to finish)

6. U = S I T (as S and T are required as input it must waits for instruction 4 and 5 to finish)

• Permissible computation sequences of the above program for the conventional von Neumann machine are

• (1,2.3.4,5,6)• (1,3,2,5,4,6)• (1,3,5,2,4,6)• (1,2,3,5,4,6) and• (1,3,2,4.5,6)

• A dataflow program is a graph, where nodes represent operations and edges represent data paths.

Data Flow Computer architecture

• The Pure dataflow computers are further classified as the :

• static• dynamic

• The basic principle of any Dataflow computer is data driven and hence it executes a program by receiving,processing and sending out token.

• These token consist of some data and a tag. These tags are used for representing all types of dependences between instructions.

• Thus dependencies are handled by translating them into tag matching and tag transformation.

• The processing unit is composed of two parts matching unit that is used for matching the tokens and execution unit used for actual implementation of instruction.

• When the processing element gets a token the matching unit perform the matching operation and when a set of matched tokens the processing begins by execution unit

• The type of operation to be performed by the instruction has to

• be fetched from the instruction store which is stored as the tag information. This information contains details about-:

• what operation has be performed on the data• how to transform the tags.

• There are variety of static, dynamic and also hybrid dataflow computing models.

• In static model, there is possibility to place only one token on the edge at the same time.

• When firing an actor, no token is allowed on the output edge of an actor.

• Control tokens must be used to acknowledge the proper timing in the transferring data token from one node to another.

• Dynamic model of dataflow computer architecture allows placing of more than one token on the edge at the same time.

• To allow implementation of this feature of the architecture, the concept of tagging of tokens was used. Each token is tagged and the tag identifies conceptual position of token in the token flow i.e., the label attached in each tag uniquely identify the context in which particular token is used.

• Static and dynamic data flow architecture have a pipelined ring structure with ring having four resource sections

The memories used for storing the instructionThe processors unit that form the task force for parallel

execution of enabled instructionThe routing network the routing network is used to

pass the result data token to their destined instructionThe input output unit serves as an interface between

data flow computer and outside world.

Static Data Flow architecture

• Data flow graph used in the Dennis machine must follow the static execution rule that only one token is allowed to exist on any arc at any given time,

• otherwise successive sets of tokens cannot be distinguished thus instead of FIFO design of string token at arc is replace by simple design where the arc can hold at most one data token.

• This is called static because here tokens are not labeled and control token are used for acknowledgement purpose so that proper timing in the transferring data tokens from node to node can take place.

• Here the complete program is loaded into memory before execution begins.

• Same storage space is used for storing both the instructions as well as data.

• In order to implement this, acknowledge arcs are implicitly added to the dataflow graph that go in the opposite direction to each existing arc and carry an acknowledgment token

• Some example of static data flow computers are MIT Static Dataflow, DDM1 UtahData Driven, LAU System, TI Distributed Data Processor, NEC Image Pipelined Processor.

Case study of MIT Static dataflow computer

• It consist of five major sections connected by channels through which information is sent in the form of discrete tokens (packet):

• Memory section It consist of instruction cells which hold instructions and their

operands. The memory section is a collection of memory cells, each cell

composed of three memory words that represent an instruction template.

The first word of each instruction cell contains op-code and destination address(es), and the next two words represent the operands.

• Processing section It consists of processing units that units perform functional operations

on data tokens . It consist of many pipelined functional units, which perform the

operations, form the result packet(s), and send the result token(s) to the memory section.

• Arbitration network It delivers operation packets from the memory section to the processing

section. Its purpose is to establish a smooth flow of enabled instructions (i.e.,

instruction packet) from the memory section to the processing section. An instruction packet contains the corresponding op-code, operand

value(s), and destination address(es).

• Control network delivers a control token from the processing section to the memory section.

The control network reduces the load on the distribution network by transferring the Boolean tokens and the acknowledgement signals from the processing section to the memory section.

• Distribution network delivers data tokens from the processing section to the memory section.

Dynamic Dataflow Architecture

• In Dynamic machine data tokens are tagged ( labeled or colored) to allow multiple tokens to

appear simultaneously on any input arc of an operator.

• No control tokens are needed to acknowledge the transfer of data tokens among the instructions.

• The tagging is achieve by attaching a label with each token which uniquely identifies the context of that particular token.

• This dynamically tagged data flow model suggests that maximum parallelism is exploited from the program graph.

• While this is the conceptual view of the tagged token model, in reality only one copy of the graph is kept in memory and tags are used to distinguish between tokens that belong to each invocation.

• A general format for instruction has opcode, the number of constants stored in instruction and number of destination for the result token.

• Each destination is identified by four fields namely the destination address, the input port at the destination instruction, number of token needed to enable the destination and the assignment function used in selecting processing element for the execution of destination instruction.

• The dynamic architecture has following characteristic different from static architecture.

• Here Program nodes can be instantiated at run time unlike in static architecture where it is loaded in the beginning.

• Also in dynamic architecture several instances of an data packet are enabled and also separate storage space is used for instructions and data

• The dynamic architecture requires storage space for the unmatched tokens.

• First in first out token queue for storing the tokens is not suitable.

• A tag contains a unique subgraph invocation ID, as well as an iteration ID if the subgraph is a loop.

• These pieces of information, taken together, are commonly known as the color of the token

• However no acknowledgement mechanism is required. The term “coloring” is used for the token labeling operations and tokens with the same color belong together.

Dynamic Data Flow architecture

CONTD..

• In dynamic machines, data tokens are tagged (labeled or colored) to allow multiple tokens to appear simultaneously on any input are of an operator node.

• No control tokens are needed to acknowledge the transfer of data tokens among instructions.

• Instead, the matching of token tags (labes or colors) is performed to merge them for instructions requiring more than one operand token.

• Therefore, additional hardware is needed to attach tags onto data tokens and to perform tag matching

CONTD..

• We shall present the Arvind machine. These machine was with following objectives:

1) Modularity: The machine should be constructed from only a few different component types, regularly interconnected, but internally these components will probably be quite complex (e.g., a processor).

2) 2) Reliability and Fault- Tolerance: Components should be pooled, so removal of a failed component may lower speed and capacity but not the ability to complete a computation.

CONTD..

• This project originated at the University of California at Irvine and now continues at the Massachusetts Institute of Technology by Arvind and his associates.

CONTD..

CONTD..

• The Irvine machine was proposed to consist of multiple PE clusters.

• All PE clusters (physical domains) can operate concurrently. H ere a PE organized as a pipelined processor.

• Each box in the figure is a unit that performs work on one item at a time drawn from FIFO input queue(s).

• The physical domains are interconnected by two system buses.

• The token bus is a pair of bidirectional shift-register rings.• Each ring is partitioned into as many slots as there are PEs

and each slot is either empty or holds one data token.• Obviously, the token rings are used to transfer tagged

tokens among the PEs.

CONTD..

• Each cluster of PEs (four PEs per cluster, as shown in Figure ) shares a local memory through a local bus and a memory controller.

• A global bus is used to transfer data structures among the local memories.

• Each PE must accept all tokens that are sent to it and sort those tokens into groups by activity name.

• When all input tokens for an activity have arrived (through tag matching), the PE must execute that activity.

• The U-interpreter can help implement interative or procedure computation by mapping the loop or procedure instances into the PE clusters for parallel executions

CONTD..

• The Arvind machine at MIT is modified from the Irvine machine, but still based on the ID Language.

• Instead of using token rings, the Arvind machine has chosen to use an N x N packet switch network for inter-PE communications as demonstrated in Figure.

• The machine consists of N PEs, where each PE is a complete computer with an instruction set, a memory, tag-matching hardware, etc.

• Activities are divided among the PEs according to a mapping from tags to PE numbers.

• Each PE uses a statistically chosen assignment function to determine the destination PE number.

THANK YOU

different parallel processing architectures module 2

Documents

loop level parallelism

parallelism available

type of parallelism

multiple instructions

basis slide

number of thread

execution of instructions

independent instruction