daa-unit v paralle algorithms and concurrant algorithmsminimum spanning tree: prim's algorithm...

DAA-Unit VParalle Algorithms andConcurrant Algorithms

By Prof.By Prof. B.A.KhivsaraB.A.KhivsaraAssistant Professor,Assistant Professor,

Department of Computer EngineeringDepartment of Computer EngineeringSNJB’s COESNJB’s COE ChandwadChandwad

OutlineOutlineSequential and Parallel ComputingSequential and Parallel Computing

RAM and PRAM ModelsRAM and PRAM Models

Amdahl’s LawAmdahl’s Law

Brent’s TheoremBrent’s TheoremBrent’s TheoremBrent’s Theorem

Parallel Algorithm Analysis and optimal paralle algorithmsParallel Algorithm Analysis and optimal paralle algorithms

Graph ProblemsGraph Problems

Concurrent AlgorithmsConcurrent Algorithms

Dinning Philosophers ProblemDinning Philosophers Problem

Sequential Vs Parallel ComputingSequential Vs Parallel Computing

Sequential ParallelWhile Sequential programminginvolves a consecutive and orderedexecution of processes one afteranother

Parallel programming involves theconcurrent computation orsimultaneous execution of processesor threads at the same time.

sequential programming, processesare run one after another in asuccession fashion

while in parallel computing, you havemultiple processes execute at thesame time

Simple to implement Complex to implement

Processor Utilization is poor Processor utilization is efficient

RRandomandom AAccessccess MMachineachine ModelModel(RAM)(RAM)

RAM model of serial computers:• Memory is a sequence of words, each

capable of containing an integer.• Each memory access takes one unit of time• Basic operations (add, multiply, compare)

take one unit time.• Instructions are not modifiable• Read-only input tape, write-only output tape

• Memory is a sequence of words, eachcapable of containing an integer.

• Each memory access takes one unit of time• Basic operations (add, multiply, compare)

take one unit time.• Instructions are not modifiable• Read-only input tape, write-only output tape

RAM Sequential Algorithm StepsRAM Sequential Algorithm Steps

A READREAD phase in which the processor readsdatum from a memory location and copies it intoa register.

A READREAD phase in which the processor readsdatum from a memory location and copies it intoa register.

A COMPUTECOMPUTE phase in which a processorperforms a basic operation on data from one ortwo of its registers.




A WRITEWRITE phase in which the processor copiesthe contents of an internal register into amemory location.

A WRITEWRITE phase in which the processor copiesthe contents of an internal register into amemory location.

Parallel AlgorithmParallel Algorithm

In computer science, a parallelalgorithm, as opposed to atraditional serial algorithm, is analgorithm which can be executed apiece at a time on many differentprocessing devices, and thencombined together again at the endto get the correct result.




Approaches to parallelismApproaches to parallelism

• A thread consists of a single flow of control,a program counter, a call stack, and a smallamount of thread-specific data

• Threads share memory, and communicateby reading and writing to that memory

shared-memoryparallel

processing


processing

8


processing


processing

• A process is a thread that has its ownprivate memory

• Threads send messages to one another.

message-passingparallel

processing

message-passingparallel

processing

PRAMPRAM [[PParallelarallel RRandomandom AAccessccessMMachine]achine]

PRAM composed of:PRAM composed of:

• P processors, each with its own unmodifiable program.• A single shared memory composed of a sequence of

words, each capable of containing an arbitrary integer.• a read-only input tape.• a write-only output tape.

• P processors, each with its own unmodifiable program.• A single shared memory composed of a sequence of

words, each capable of containing an arbitrary integer.• a read-only input tape.• a write-only output tape.

PRAM model is a synchronous, MIMD,shared address space parallel computer.PRAM model is a synchronous, MIMD,shared address space parallel computer.

PRAM model of computation

Shared memory

PRAM model CharacteristicsPRAM model Characteristics

At each unit of time, a processor is eitheractive or idle (depending on id)At each unit of time, a processor is eitheractive or idle (depending on id)

All processors execute same programAll processors execute same programAll processors execute same programAll processors execute same program

At each time step, all processors execute sameinstruction on different data (“data-parallel”)At each time step, all processors execute sameinstruction on different data (“data-parallel”)

Focuses on concurrency onlyFocuses on concurrency only

Variants of PRAM modelVariants of PRAM modelExclusive

WriteConcurrent

Write

ExclusiveRead

EREW ERCW

ConcurrentRead

CREW CRCW

ExclusiveWrite

ConcurrentWrite

ExclusiveRead

EREW ERCW

ConcurrentRead

CREW CRCW

Concurrent (C) means, many processors can do the operationsimultaneously in the same memoryExclusive (E) not concurent

More PRAM taxonomyMore PRAM taxonomy

EREW - exclusive read, exclusive write• A program isn’t allowed to have two processors

access the same memory location at the same time.

CREW - concurrent read, exclusive writeCREW - concurrent read, exclusive write

CRCW - concurrent read, concurrent write• Needs protocol for arbitrating write conflicts

CROW – concurrent read, owner write• Each memory location has an official “owner”

SubSub--variants of CRCWvariants of CRCW

Common CRCWCommon CRCW• CW iff all processors writing same value

Arbitrary CRCWArbitrary CRCWArbitrary CRCWArbitrary CRCW• Arbitrary value of write set stored

Priority CRCWPriority CRCW• Value of min-index processor stored

Combining CRCWCombining CRCW

Amdahl's lawAmdahl's law

Amdahl's law, also known asAmdahl's argument, is used to findthe maximum expected improvementto an overall system when only part ofthe system is improved. It is oftenused in parallel computing to predictthe theoretical maximum speedupusing multiple processors




Amdahl's lawAmdahl's law

In the case of parallelization, Amdahl's law states that if P isthe proportion of a program that can be made parallel and

(1 − P) is the proportion that cannot be parallelized (serial),then the maximum speedup that can be achieved by using Nprocessors is







S(N) = 1(1 - P) + P/N

S(N) = 1(1 - P) + P/N

As N goes to infinity,the maximum speedup S(N) = 1/(1 - P)As N goes to infinity,the maximum speedup S(N) = 1/(1 - P)

Amdahl's lawAmdahl's law ExampleExample 11 95% of a program’s execution time occurs inside

a loop that can be executed in parallel. What is

the maximum speedup we should expect from a

parallel version of the program executing on 8

CPUs?

18

95% of a program’s execution time occurs inside

a loop that can be executed in parallel. What is

the maximum speedup we should expect from a

parallel version of the program executing on 8

CPUs?

9.58/)05.01(05.0

1

Amdahl's lawAmdahl's law ExampleExample 22 5% of a parallel program’s execution time is spent

within inherently sequential code.The maximum

speedup achievable by this program, regardless

of how many PEs are used, is

19

5% of a parallel program’s execution time is spent

within inherently sequential code.The maximum

speedup achievable by this program, regardless

of how many PEs are used, is

2005.01

/)05.01(05.01lim pp

LimitationsLimitations of Amdahl's lawof Amdahl's law Amdahl's law only applies to cases where the problem

size is fixed.

ExampleExample

If 75% of a process can be parallelized, and there arefour processors, then the possible speedup is

1 / ((1 - 0.75) + 0.75/4) = 2.286 But with 40 processors the speedup is only

1 / ((1 - 0.75) + 0.75/40) = 3.721

This has led to conclude that having lots of processorswon’t help very much

Amdahl's law only applies to cases where the problemsize is fixed.

ExampleExample

If 75% of a process can be parallelized, and there arefour processors, then the possible speedup is

1 / ((1 - 0.75) + 0.75/4) = 2.286 But with 40 processors the speedup is only

1 / ((1 - 0.75) + 0.75/40) = 3.721

This has led to conclude that having lots of processorswon’t help very much

Brent’s TheoremBrent’s TheoremBrent's theorem specifies that for a sequential algorithm with t timesteps, and a total of m operations, that a run time T is definitelypossible on a shared memory machine (PRAM) with p processors.

Brent's theorem specifies that for a sequential algorithm with t timesteps, and a total of m operations, that a run time T is definitelypossible on a shared memory machine (PRAM) with p processors.

It may be possible to implement this algorithm faster (by schedulinginstruction differently to minimise idle processors, for instance), butit is definitely possible to implement this algorithm in this time givenp processors.




Brent’s Theorem : Given a parallel algorithm A that performs mcomputation steps in time t , then p processors can executealgorithm in A in time

T = t + (m-t)/p

Brent’s Theorem : Given a parallel algorithm A that performs mcomputation steps in time t , then p processors can executealgorithm in A in time

T = t + (m-t)/p

Brent’s TheoremBrent’s Theorem ExampleExampleSay we are summing an array.

for (i=0;i < length(a); i++)sum = sum + a(i);

Say we are summing an array.for (i=0;i < length(a); i++)

sum = sum + a(i);

Using this algorithm, each add operation depends on the result of theprevious one, forming a chain of length n, thus t = n. There are noperations, so m = n.




T = n + 0/p.T = n + 0/p.

so no matter how many processors are available, this algorithm willtake time n.so no matter how many processors are available, this algorithm willtake time n.

Brent’s TheoremBrent’s Theorem ExampleExample

Now consider the adding the array Parallely:sum(a) = ( (A0 + A1) + (A2 + A3) ) + ( (A4 + A5) + (A6 + A7) )Now consider the adding the array Parallely:sum(a) = ( (A0 + A1) + (A2 + A3) ) + ( (A4 + A5) + (A6 + A7) )

We can add A2 to A3 without needing to knowwhat A0 + A1 is.We can add A2 to A3 without needing to knowwhat A0 + A1 is.We can add A2 to A3 without needing to knowwhat A0 + A1 is.We can add A2 to A3 without needing to knowwhat A0 + A1 is.

For an array of length n, the longest chain(s) will be of lengthlog(n). t = log(n). m = n.For an array of length n, the longest chain(s) will be of lengthlog(n). t = log(n). m = n.

T = log(n) + (n - log(n))/p.T = log(n) + (n - log(n))/p.

Brent’s TheoremBrent’s TheoremAnalysis/ImportanceAnalysis/Importance

This tells us many useful things about the algorithm:This tells us many useful things about the algorithm:

If we have n processors, the algorithm can beimplemented in log(n) timeIf we have n processors, the algorithm can beimplemented in log(n) timeIf we have n processors, the algorithm can beimplemented in log(n) time

If we have log(n) processors, the algorithm can beimplemented in 2log(n) time (asymptotically this is thesame as log(n) time)

If we have log(n) processors, the algorithm can beimplemented in 2log(n) time (asymptotically this is thesame as log(n) time)

If we have 1 processor, the algorithm can be implementedin n time.If we have 1 processor, the algorithm can be implementedin n time.

Brent’s TheoremBrent’s TheoremAnalysis/ImportanceAnalysis/Importance

If we consider the amount of work done in each case,with one processor, we do n work, with log(n)processors we do n work, but with n processors we donlog(n) work.

If we consider the amount of work done in each case,with one processor, we do n work, with log(n)processors we do n work, but with n processors we donlog(n) work.

The implementations with 1 or log(n) processors,therefore are cost optimal, while the implementationwith n processors is not.




It is important to remember that Brent's theorem doesnot tell us how to implement any of these algorithms inparallel; it merely tells us what is possible.

It is important to remember that Brent's theorem doesnot tell us how to implement any of these algorithms inparallel; it merely tells us what is possible.





Parallel Algorithm Analysis and optimal parallel algorithmsParallel Algorithm Analysis and optimal parallel algorithms




Efficient and optimal parallelEfficient and optimal parallelalgorithms Analysisalgorithms Analysis

A parallel algorithm is efficient iff it is fast (e.g. polynomial time) and the product of the parallel time and number of processors

is close to the time of at the best know sequentialalgorithm

T sequential (T parallel N processors ) A parallel algorithms is optimal iff this product is

of the same order as the best known sequentialtime

A parallel algorithm is efficient iff it is fast (e.g. polynomial time) and the product of the parallel time and number of processors

is close to the time of at the best know sequentialalgorithm

T sequential (T parallel N processors ) A parallel algorithms is optimal iff this product is

of the same order as the best known sequentialtime

Metrics:Metrics:Speedup , Efficiency and CostSpeedup , Efficiency and Cost Speedup:

S = Time( sequential algorithm-Ts)Time(parallel algorithm- Tp)

Efficiency:E = S / NN is the number of processors

Cost :C= N * Tp

Speedup:S = Time( sequential algorithm-Ts)

Time(parallel algorithm- Tp) Efficiency:

E = S / NN is the number of processors

Cost :C= N * Tp

Metrics:Metrics: AnalysisAnalysisParallel algorithm is cost-optimal:• If parallel cost = sequential time

• Cp = T1• Ep = 100%

Critical when down-scaling:

• If parallel cost = sequential time• Cp = T1• Ep = 100%

Critical when down-scaling:• If parallel implementation may become slower than

sequential• T1 = n3

• Tp = n2.5 when p = n2

• Cp = n4.5

Analysis a parallel algorithmAnalysis a parallel algorithmExampleExample

AddingAdding n numbersn numbers iin paralleln parallel

Analysis a parallel algorithmAnalysis a parallel algorithmExample:Example:Example for 8 numbers:We start with 4 processors and each of them adds 2 items in the first step.Example for 8 numbers:We start with 4 processors and each of them adds 2 items in the first step.

The number of items is halved at every subsequent step. Hence log n steps arerequired for adding n numbers. The processor requirement is O(n) .The number of items is halved at every subsequent step. Hence log n steps arerequired for adding n numbers. The processor requirement is O(n) .

A parallel algorithms is analyzed mainly in terms of its time, processor andwork complexities.A parallel algorithms is analyzed mainly in terms of its time, processor andwork complexities.

32

A parallel algorithms is analyzed mainly in terms of its time, processor andwork complexities.A parallel algorithms is analyzed mainly in terms of its time, processor andwork complexities.

Time complexity T(n) : How many time steps are needed? T(n) = O(log n)Time complexity T(n) : How many time steps are needed? T(n) = O(log n)

Processor complexity P(n) : How many processors are used? P(n) = O(n)Processor complexity P(n) : How many processors are used? P(n) = O(n)

Work complexity W(n) : What is the total work done by all the processors? W(n)= O(n log n)Work complexity W(n) : What is the total work done by all the processors? W(n)= O(n log n)

Graph Problems: Topic OverviewGraph Problems: Topic Overview

General frameworkGeneral framework

Minimum Spanning Tree: Prim's AlgorithmMinimum Spanning Tree: Prim's AlgorithmMinimum Spanning Tree: Prim's Algorithm

Single-Source Shortest Paths: Dijkstra'sAlgorithmSingle-Source Shortest Paths: Dijkstra'sAlgorithm

Bipartite GraphBipartite Graph

Representation of GraphsRepresentation of Graphs

a) An undirected graph and (b) a directed graph.

Graph ProblemGraph Problem--GeneralGeneralFrameworkFramework Let M be an n*n matrix with nonnegative integer

coefficients. Let be an matrix defines as (i,i) = 0 for every i (i,j) = min {Mi0i1 + Mi1i2 +….. + Mik-1ik } for every i≠j Where i0 = i and ik =j and min is taken over all

sequences i0, i1,… ik

Let M be an n*n matrix with nonnegative integercoefficients. Let be an matrix defines as (i,i) = 0 for every i (i,j) = min {Mi0i1 + Mi1i2 +….. + Mik-1ik } for every i≠j Where i0 = i and ik =j and min is taken over all

sequences i0, i1,… ik

Algorithm forAlgorithm forM[i,j] = M[i,j] for 1<=i , j<=nFor r =1 to logn do{q[i,j,k]=m[i,j]+m[j,k] for 1<= i and j,k <=nIn Parallel setM[i,j]= min { q[i,1,j], q[i,2,j] , …… q[i,n,j] } for 1<=i and

j<=n}Put (i,i)=0 for all i and (i,j) =m[I,j] for i≠j

M[i,j] = M[i,j] for 1<=i , j<=nFor r =1 to logn do{q[i,j,k]=m[i,j]+m[j,k] for 1<= i and j,k <=nIn Parallel setM[i,j]= min { q[i,1,j], q[i,2,j] , …… q[i,n,j] } for 1<=i and

j<=n}Put (i,i)=0 for all i and (i,j) =m[I,j] for i≠j

ExampleExample 1

2

3

6 4

5

0 0 1 1 1 0 0 0 0 0 0 0

M0 0 1 1 1 01 0 0 0 0 11 1 0 1 1 11 1 0 0 1 11 1 1 0 0 00 0 1 0 1 0

0 0 0 0 0 00 0 0 0 0 01 1 0 1 1 11 1 0 0 1 10 0 0 0 0 00 0 0 0 0 0

(1,5) = 0 since M12 + M25=0

(3,1) = 1 since for every choice of i1,i2….ik-1 the sum Mi0i1 + Mi1i2 …. is >0

Minimum Spanning TreeMinimum Spanning TreeA spanning tree of an undirected graph G is asubgraph of G that is a tree containing all thevertices of G.

A spanning tree of an undirected graph G is asubgraph of G that is a tree containing all thevertices of G.

In a weighted graph, the weight of a subgraph isthe sum of the weights of the edges in thesubgraph.




A minimum spanning tree (MST) for a weightedundirected graph is a spanning tree withminimum weight.

A minimum spanning tree (MST) for a weightedundirected graph is a spanning tree withminimum weight.

Minimum Spanning TreeMinimum Spanning Tree

An undirected graph and its minimum spanning tree.

Minimum Spanning Tree:Minimum Spanning Tree:Prim'sPrim's AlgorithmAlgorithm

Prim's algorithm for finding an MST is a greedyalgorithm.Prim's algorithm for finding an MST is a greedyalgorithm.

Start by selecting an arbitrary vertex, include it intothe current MST.Start by selecting an arbitrary vertex, include it intothe current MST.Start by selecting an arbitrary vertex, include it intothe current MST.Start by selecting an arbitrary vertex, include it intothe current MST.

Grow the current MST by inserting into it thevertex closest to one of the vertices already incurrent MST.

Grow the current MST by inserting into it thevertex closest to one of the vertices already incurrent MST.

Minimum Spanning Tree: Prim's Algorithm

Minimum Spanning Tree:Minimum Spanning Tree: Prim's AlgorithmPrim's Algorithm

Prim's sequentialsequential minimum spanning tree algorithm.

Prim's Algorithm:Prim's Algorithm: Parallel FormulationParallel FormulationThe algorithm works in n outer iterations - it is hard to execute theseiterations concurrently.The algorithm works in n outer iterations - it is hard to execute theseiterations concurrently.

The inner loop is relatively easy to parallelize. Let p be the number ofprocesses, and let n be the number of vertices.The inner loop is relatively easy to parallelize. Let p be the number ofprocesses, and let n be the number of vertices.

The adjacency matrix is partitioned in a 1-D block fashion, withdistance vector d partitioned accordingly.The adjacency matrix is partitioned in a 1-D block fashion, withdistance vector d partitioned accordingly.The adjacency matrix is partitioned in a 1-D block fashion, withdistance vector d partitioned accordingly.The adjacency matrix is partitioned in a 1-D block fashion, withdistance vector d partitioned accordingly.

In each step, a processor selects the locally closest node, followed bya global reduction to select globally closest node.In each step, a processor selects the locally closest node, followed bya global reduction to select globally closest node.

This node is inserted into MST, and the choice broadcast to allprocessors.This node is inserted into MST, and the choice broadcast to allprocessors.

Each processor updates its part of the d vector locally.Each processor updates its part of the d vector locally.

Prim's Algorithm:Prim's Algorithm: Parallel FormulationParallel Formulation

The partitioning of the distance array d and the adjacency matrixA among p processes.

Prim's Algorithm:Prim's Algorithm: Parallel AnalysisParallel Analysis

The cost of a broadcast is O(log p).The cost of a broadcast is O(log p).

The cost of local updation of the d vector is O(n/p).The cost of local updation of the d vector is O(n/p).

The parallel time per iteration is O(n/p + log p).The parallel time per iteration is O(n/p + log p).The parallel time per iteration is O(n/p + log p).

The total parallel time for n vertices with n iteration isO(n2/p + n log p).The total parallel time for n vertices with n iteration isO(n2/p + n log p).

The corresponding efficiency is O(p2log2p).The corresponding efficiency is O(p2log2p).

SingleSingle--Source Shortest PathsSource Shortest Paths

For a weighted graph G = (V,E,w), the single-sourceshortest paths problem is to find the shortest pathsfrom a vertex v ∈ V to all other vertices in V.

For a weighted graph G = (V,E,w), the single-sourceshortest paths problem is to find the shortest pathsfrom a vertex v ∈ V to all other vertices in V.

Dijkstra's algorithmDijkstra's algorithm is similar to Prim's algorithm. Itmaintains a set of nodes for which the shortest pathsare known.




It grows this set based on the node closest to sourceusing one of the nodes in the current shortest pathset.

It grows this set based on the node closest to sourceusing one of the nodes in the current shortest pathset.

SingleSingle--Source Shortest Paths:Source Shortest Paths:Dijkstra's AlgorithmDijkstra's Algorithm

Dijkstra's sequential single-source shortest pathsalgorithm.

Dijkstra's Algorithm:Dijkstra's Algorithm: Parallel FormulationParallel Formulation

The weighted adjacency matrix is partitioned using the1-D block mapping.The weighted adjacency matrix is partitioned using the1-D block mapping.

Each process selects, locally, the node closest to thesource, followed by a global reduction to select nextnode.




The node is broadcast to all processors and the l-vector updated.The node is broadcast to all processors and the l-vector updated.

The parallel performance of Dijkstra's algorithm isidentical to that of Prim's algorithm.The parallel performance of Dijkstra's algorithm isidentical to that of Prim's algorithm.

Dijkstra'sDijkstra's Algorithm:Algorithm: Parallel AnalysisParallel Analysis

The cost of a broadcast is O(log p).The cost of a broadcast is O(log p).

The cost of local updation of the d vector is O(n/p).The cost of local updation of the d vector is O(n/p).

The parallel time per iteration is O(n/p + log p).The parallel time per iteration is O(n/p + log p).The parallel time per iteration is O(n/p + log p).

The total parallel time for n vertices with n iteration isO(n2/p + n log p).The total parallel time for n vertices with n iteration isO(n2/p + n log p).

The corresponding efficiency is O(p2log2p).The corresponding efficiency is O(p2log2p).

A Bipartite GraphA Bipartite Graph

It is an undirected graph G=(V, E) in which V can bepartitioned into two (disjoint) sets V1 and V2 such that(u, v) E implies either u V1 and v V2 or vice versa.

It is an undirected graph G=(V, E) in which V can bepartitioned into two (disjoint) sets V1 and V2 such that(u, v) E implies either u V1 and v V2 or vice versa.

That is, all edges go between the two sets V1 and V2 butare not allowed within V1 and V2

That is, all edges go between the two sets V1 and V2 butare not allowed within V1 and V2

A Bipartite Graph ExampleA Bipartite Graph Example

Graph (Bipartite)Graph (Bipartite) MatchingMatching

Graph G = (V,E)Graph G = (V,E)

Graph Matching M is a subset of EGraph Matching M is a subset of EGraph Matching M is a subset of E

• if e1, e2 distinct edges in M• Then they have no vertex in common

Matching in Bipartite graph is a set of edges chosenin such a way that no two edges share an endpoint.Matching in Bipartite graph is a set of edges chosenin such a way that no two edges share an endpoint.

A Bipartite Graph MatchingA Bipartite Graph Matching

1

2

3

45

6

7

89

U V

1

2

3

45

6

7

89

U V

Match M={(1,6),(3,7),(5,9)}

|M|= 3 edges

There is no edge from node 2,4

and 8 in M. so these are free node

(a) Bipartite Graph (b) Bipartite Graph Matching

1

2

3

45

6

7

89

U V

A Bipartite Graph Maximum MatchingMaximum Matching

1

2

3

45

6

7

89

U V

consider free node 4 and add new edge(4,7) to M and edge (3,7) is removed fromM, so the node 3 will be free.

Blue Edge denote original edge in MGreen edge denote removed edge from MRed edge denote newly added edge to M

consider free node and add new edge(3,9) in M and edge (5,9) is removed fromM, so the node 5 will be free.

1

2

3

45

6

7

89

U V

A Bipartite Graph Maximum MatchingMaximum Matching

New free node is 5 , so new edge (5,8) is added in M. Now only node 2 is freebut adding edge (2,6) is not feasible. No node is there in free list so stop.

There is an Alternate path from (4,7), (7,3), (3,9), (9,5), (5,8)Path 4-7-3-9-5-8 is also an Augmented path(P) as it starts andends with free node.New M={(1,6),(3,9),(4,7),(5,8)}|M|= 4 edges

AugmentingAugmenting Path(P)Path(P) in G givenin G givenGraph Matching MGraph Matching M

• An augmenting path p = (e1, e2, …, ek)

1 3 5 k

2 4 k-1

begins and ends atunmatched vertices

requiree , e , e , ..., e E-Me , e , ..., e M

ìïïí Îïï Îî

1 3 5 k

2 4 k-1

begins and ends atunmatched vertices

requiree , e , e , ..., e E-Me , e , ..., e M

ìïïí Îïï Îî

Maximum Matching AlgorithmMaximum Matching Algorithm• Algorithm

graph G (V,E)[1] M {}[2] there exists an augmenting

path p relative to MM M P

[3] maximum matching M

input

while

dooutput

=¬

¬ Å

graph G (V,E)[1] M {}[2] there exists an augmenting

path p relative to MM M P

[3] maximum matching M

input

while

dooutput

=¬

¬ Å

Parallel Algorithm for BipartiteParallel Algorithm for BipartiteGraphGraph

For parallelism, we discover multiple vertex-disjoint augmentingpaths simultaneously.

When searching for vertex-disjoint augmenting paths, we buildalternating forests

In parallel platforms, we can search and built (vertex-disjoint)alternating forest in two ways.

(1) source parallel (DFS style): each processor build onealternating tree in the forest starting from an unmatched vertex.

(2) level parallel (BFS style): every processor together buildone level of BFS forest and then move to the next level.

The first approach employs coarse-grained parallelism becausea processor build the whole tree.

The second approach employs fine-grained parallelism since aprocessor can process a single vertex.

For parallelism, we discover multiple vertex-disjoint augmentingpaths simultaneously.

When searching for vertex-disjoint augmenting paths, we buildalternating forests

In parallel platforms, we can search and built (vertex-disjoint)alternating forest in two ways.

(1) source parallel (DFS style): each processor build onealternating tree in the forest starting from an unmatched vertex.

(2) level parallel (BFS style): every processor together buildone level of BFS forest and then move to the next level.

The first approach employs coarse-grained parallelism becausea processor build the whole tree.

The second approach employs fine-grained parallelism since aprocessor can process a single vertex.

Parallelizing Augmenting Path SearchesParallelizing Augmenting Path Searches

a

b

c

d

e

a’

b’

c’

d’

e’

Source parallel: DFS basedCoarse grained

Level parallel: BFS basedFine grained

e e’

b a’ a b’

e d’ d e’

T1

T2

L0

b a’ a b’

e d’ d e’L1 L2 L3

T1 T2 T1 T2

Parallel Algorithm for BipartiteParallel Algorithm for BipartiteGraphGraph However, note that this searching is not same as

traditional BFS/DFS. For example, it is not a parallel single DFS because of

the following differences : (a) we search form multiple-source, i.e., we run

multiple DFS's simultaneously (b) The paths are vertex-disjoint, need extra check for

that. For example, edges marked with "x" are not partof the forest to maintain vertex-disjointednessproperty

(c) and the paths are alternating, i.e., every other stepis trivial

However, note that this searching is not same astraditional BFS/DFS.

For example, it is not a parallel single DFS because ofthe following differences :

(a) we search form multiple-source, i.e., we runmultiple DFS's simultaneously

(b) The paths are vertex-disjoint, need extra check forthat. For example, edges marked with "x" are not partof the forest to maintain vertex-disjointednessproperty

(c) and the paths are alternating, i.e., every other stepis trivial

Parallelizing Augmenting Path SearchesParallelizing Augmenting Path Searchesa

b

c

d

e

a’

b’

c’

d’

e’

ParallelDFS

ParallelDFS

e e’

b a’ a b’

e d’ d e’

T1

T2

1. Multiple source2. Vertex-disjoint3. Alternating DFS

ConcurrentConcurrent AlgorithmsAlgorithms

concurrency is a property of systems in whichseveral computations are executing simultaneously,and potentially interacting with each other.

concurrency is a property of systems in whichseveral computations are executing simultaneously,and potentially interacting with each other.

The computations may be executing on multiplecores in the same chip, preemptively time-sharedthreads on the same processor




models concurrent computation :Parallel RandomAccess Machine model, Actor modelmodels concurrent computation :Parallel RandomAccess Machine model, Actor model

Advantages of ConcurrencyAdvantages of Concurrency

Concurrent processes can reduceduplication in code.Concurrent processes can reduceduplication in code.

The overall runtime of the algorithm can besignificantly reduced.The overall runtime of the algorithm can besignificantly reduced.The overall runtime of the algorithm can besignificantly reduced.The overall runtime of the algorithm can besignificantly reduced.

More real-world problems can be solved thanwith sequential algorithms alone.More real-world problems can be solved thanwith sequential algorithms alone.

Redundancy can make systems morereliable.Redundancy can make systems morereliable.

Disadvantages of ConcurrencyDisadvantages of Concurrency

Runtime is not always reduced, so carefulplanning is requiredRuntime is not always reduced, so carefulplanning is required

Concurrent algorithms can be more complexthan sequential algorithmsConcurrent algorithms can be more complexthan sequential algorithmsConcurrent algorithms can be more complexthan sequential algorithmsConcurrent algorithms can be more complexthan sequential algorithms

Shared data can be corruptedShared data can be corrupted

Communications between tasks is neededCommunications between tasks is needed

Achieving ConcurrencyAchieving Concurrency

Many computers today have more thanone processor (multiprocessor machines)

CPU 1 CPU 2

Memory

bus

Achieving ConcurrencyAchieving Concurrency Concurrency can also be achieved on a

computer with only one processor: The computer “juggles” jobs, swapping its

attention to each in turn “Time slicing” allows many users to get CPU

resources Tasks may be suspended while they wait for

something, such as device I/O

CPU

task 1task 2

task 3ZZZZ

ZZZZ

Concurrency can also be achieved on acomputer with only one processor: The computer “juggles” jobs, swapping its

attention to each in turn “Time slicing” allows many users to get CPU

resources Tasks may be suspended while they wait for

something, such as device I/O

Concurrency vs. ParallelismConcurrency vs. Parallelism

Concurrency is the execution ofmultiple tasks at the same time,regardless of the number ofprocessors.

Parallelism is the execution ofmultiple processors on the same task.

Concurrency is the execution ofmultiple tasks at the same time,regardless of the number ofprocessors.

Parallelism is the execution ofmultiple processors on the same task.

Types of Concurrent SystemsTypes of Concurrent Systems

MultiprogrammingMultiprogramming

MultiprocessingMultiprocessingMultiprocessingMultiprocessing

MultitaskingMultitasking

Distributed SystemsDistributed Systems

Examples of ConcurrentAlgorithms

P/C Bounded-BufferP/C Bounded-Buffer

Readers and WritersReaders and WritersReaders and WritersReaders and Writers

Dining-Philosophers…Dining-Philosophers…

Dining Philosophers ProblemDining Philosophers Problem

Philosophers spend their lives alternatingbetween thinking and eating.Philosophers spend their lives alternatingbetween thinking and eating.

Five philosophers can be seated around acircular table.Five philosophers can be seated around acircular table.

There is a shared bowl of rice.There is a shared bowl of rice.There is a shared bowl of rice.There is a shared bowl of rice.

In front of each one is a plate.In front of each one is a plate.

Between each pair of philosophers there is achopstick, so there are five chopsticks.Between each pair of philosophers there is achopstick, so there are five chopsticks.

It takes two chopsticks to take/eat rice, so whilen is eating neither n+1 nor n-1 can be eating.It takes two chopsticks to take/eat rice, so whilen is eating neither n+1 nor n-1 can be eating.

Dining PhilosophersDining Philosophers ProblemProblem

Each one thinks for a while, gets the chopsticks/forksneeded, eats, and puts the chopsticks/forks down again,in an endless cycle.

Each one thinks for a while, gets the chopsticks/forksneeded, eats, and puts the chopsticks/forks down again,in an endless cycle.

Illustrates the difficulty of allocating resources amongprocess without deadlock and starvation.Illustrates the difficulty of allocating resources amongprocess without deadlock and starvation.Illustrates the difficulty of allocating resources amongprocess without deadlock and starvation.Illustrates the difficulty of allocating resources amongprocess without deadlock and starvation.

Dining PhilosophersDining Philosophers ProblemProblem

The challenge is to grant requests for chopsticks whileavoiding deadlock and starvation.The challenge is to grant requests for chopsticks whileavoiding deadlock and starvation.

Deadlock can occur if everyone tries to get their chopsticksat once. Each gets a left chopstick, and is stuck, becauseeach right chopstick is someone else’s left chopstick.




The Dining Philosophers ProblemThe Dining Philosophers Problem

PhilosophersPhilosophers• Think• Take Forks (One At A Time)• Eat• Put Forks (One At A Time)

Eating requires 2 forksEating requires 2 forksEating requires 2 forksEating requires 2 forks

Pick one fork at a timePick one fork at a time

How to prevent deadlock?How to prevent deadlock?

What about starvation?What about starvation?

What about concurrency?What about concurrency?

Idea is to capture the concept of multiple processes competing for limited resources79

Dining Philosopher’s Problem(Dijkstra ’71)

What can go wrong?What can go wrong?

Starvation:Starvation: A policy that canleave some philosopher hungry insome situation



Deadlock:Deadlock: A policy that leaves allthe philosophers “stuck”, so thatnobody can do anything at all

Deadlock:Deadlock: A policy that leaves allthe philosophers “stuck”, so thatnobody can do anything at all

An incorrectAn incorrect solutionsolution-- DeadlockDeadlock ispossible

( means “waiting for this fork”) 82

An incorrect solutionAn incorrect solution-- Starvation ispossible

Eat

Eat

BlockStarvation!

83

Dining Philosophers SolutionDining Philosophers Solution

Each philosopheris a process. One semaphore

per fork: fork: array[0..4] of

semaphores Initialization:

fork[i].count := 1for i := 0..4

Process Pi:repeatthink;wait(fork[i]);wait(fork[i+1 mod 5]);eat;signal(fork[i+1 mod 5]);signal(fork[i]);forever

Each philosopheris a process. One semaphore

per fork: fork: array[0..4] of

semaphores Initialization:

fork[i].count := 1for i := 0..4

Process Pi:repeatthink;wait(fork[i]);wait(fork[i+1 mod 5]);eat;signal(fork[i+1 mod 5]);signal(fork[i]);forever

Note: deadlock if each philosopher starts by pickingup his left fork!

Dining PhilosophersDining Philosophers SolutionSolution--PossiblePossible solutions to avoid deadlock:solutions to avoid deadlock:

Allow at most four philosophers to be sittingsimultaneously at the table . Allow aphilosopher to pick up the forks only if bothare available.



Use an asymmetric solution - an odd-numbered philosopher picks up first the leftchopstick and then the right chopstick. Even-numbered philosopher picks up first the rightchopstick and then the left chopstick.

Use an asymmetric solution - an odd-numbered philosopher picks up first the leftchopstick and then the right chopstick. Even-numbered philosopher picks up first the rightchopstick and then the left chopstick.

Dining Philosophers SolutionDining Philosophers Solution

A solution:A solution: admit only 4 philosophers ata time that try to eat.A solution:A solution: admit only 4 philosophers ata time that try to eat.

Then 1 philosopher can always eat whenthe other 3 are holding 1 fork.Then 1 philosopher can always eat whenthe other 3 are holding 1 fork.

Process Pi:repeatthink;wait(T);wait(fork[i]);wait(fork[i+1 mod 5]);eat;signal(fork[i+1 mod 5]);signal(fork[i]);signal(T);forever

Then 1 philosopher can always eat whenthe other 3 are holding 1 fork.Then 1 philosopher can always eat whenthe other 3 are holding 1 fork.

Hence, we can use another semaphoreT that would limit at 4 the number ofphilosophers “sitting at the table”.

Hence, we can use another semaphoreT that would limit at 4 the number ofphilosophers “sitting at the table”.

Initialize: T.count := 4Initialize: T.count := 4

Process Pi:repeatthink;wait(T);wait(fork[i]);wait(fork[i+1 mod 5]);eat;signal(fork[i+1 mod 5]);signal(fork[i]);signal(T);forever

daa-unit v paralle algorithms and concurrant algorithmsminimum spanning tree: prim's algorithm...

Documents