exploring parallel programming · processor that clock cycle. in competitive overclocking, liquid...

Exploring parallel programming

ITF30714 Fordypningsemne

Kristoffer Møgster Berge 02.12.2016

Table of contents Introduction Parallelism vs Concurrency Why parallelism

The memory wall Instruction level parallelism The power wall

Multi core processors Better algorithms

Challenges with parallel programming The parallelizability of problems Race conditions Amdahl’s law Dependency

Approaches Message passing Shared memory

Finding the largest value in an array Block decompositioning Scaling

Message passing The mandelbrot set

The solution Reviewing the results The code

Shared memory Merge sort

Identifying dependency Introducing threads Scaling

Shortest path in a graph Incremental parallel implementation Full mapping Reviewing the results

Summary References 1 - Drawing the Mandelbrot set

The main function The master function The delegator function The slave function

2 - Merge sort 3 - Graph search

The node object Tracing the shortest path Incremental implementation Full mapping Sequential breadth first

2

Introduction The use of multi-core processors are becoming more widespread for every year. Not only are we seeing this in desktop and laptop computers, but even devices like smartphones and smartwatches now have multi-core processors. Software developers now face the challenge to create software that can take advantage of several cores since the single-core performance isn’t increasing as much as it used to. A consumer may take this for granted, but creating software that takes advantage of several cores is in many cases challenging, and in some cases impossible. In this report I will briefly explain why we are seeing more devices with multi-core processors by explaining the problems manufacturers are facing. I will mainly cover the problems with creating parallel programs and create some programs using different approaches and tools.

Parallelism vs Concurrency Concurrency is the perception of a program doing several things at once. This doesn’t necessarily mean parallelism as concurrency can be present in software running on a single core processor. Concurrency can be the result of the operating system performing context switching between several threads, or asynchronous programming. Although not parallel, asynchronous programming has a lot of similarities with parallel programming, such as race conditions. For a program to be truly parallel, it has to utilize several virtual or physical cores on the processor. Given a problem that takes 3 seconds to solve, a concurrent program could solve two of these problems in about 6 seconds by doing context switching, but a program able to utilize two processor cores should optimally be able to solve both of these problems in 3 seconds. Making a program perform several things concurrent on a system with only one core can still have benefits for the running time of the program, given that some of this time is spent waiting. An application reliant on several request to servers over a network would waste a lot of time by doing these requests sequential, waiting for the response from one server before performing the next request. This can be solved asynchronous by freeing the thread after the request is sent, and resuming when the program receives the response. Although this can save time, the amount of work done by the program is the same. The program just takes advantage of the time else spent waiting. In this paper however, i will focus on parallel programming.

3

Why parallelism The graph shows the increase in Instruction level parallelism, power usage, clock speed and transistor count in processors over the last four decades. All trends except the transistor count started flattening out in the early 2000s, but even the steady rate of shrinking transistors (Moore’s law) is expected to end as early as 2020 [10]. Without improvements in these fields, processors are not going to become a lot faster. The processor performance is primarily limited by three factors, the memory wall, the power wall and instruction level parallelism.

http://www.extremetech.com/computing/165331-intels-former-chief-architect-moores-law-will-be-dead-within-a-decade

The memory wall As the graph illustrates, the clock speed of processors had a yearly increase of about 50% between the 1970 and the early 2000s. In the same period the increase of the frequency in RAM only had an annual increase of about 9% [4]. This gap presents a bottleneck because the processors needs to receive data from the memory, but since the memory can’t keep up with the processor, the memory will limit the processors performance. There is little point in creating faster processors if we can’t provide the processors with enough tasks to take advantage of the extra speed. To solve this problem, processors use caches to limit the reading and writing to the memory. The caches are small memory modules on the processor that can be accessed much faster than the RAM, but have a very limited size. Although the increase in frequency in processors almost has ceased, we still face this challenge especially on multi-core systems where the combined processing power can process more data than the memory can provide.

4

Instruction level parallelism The code we write is translated down to instructions the processor can execute. Each instruction may consist of several tasks and can be broken down into smaller pieces. Each of these smaller tasks can be executed by a separate part of the processor. A scalar processor has a pipeline that can take one instruction at a time, executing one task after the other. To increase the throughput of a scalar processor we need to decrease the time it takes to perform a task, either by increasing the clock speed or solving the task more efficiently. A superscalar processor on the other hand has a wide pipeline that allows the processor to perform different tasks in parallel. This means that we can utilize several parts of the processor at once by stacking instructions in the pipeline. As we break down each task into smaller tasks, the overlap becomes smaller, the clock speed can be increased and the throughput becomes greater. The illustration shows a pipeline with three parallel instructions consisting of several subtasks where the numbers represent a type of task that the processor can perform. On each step along the length of the pipeline, the processor is able to perform one of these tasks. The instructions can be stacked as long as subtasks of equal type does not overlap. The overlap can introduce problems like race conditions and requires dependency checking which adds complexity and time delay. Increasing the number of instructions performed in parallel will increase the time and cost that comes with checking for dependencies, thereby limiting the width of the pipeline.

The frequency of the processor, or the clock cycle should be the same length as the longest step in any instruction, so to increase the frequency the longest step need to be shortened [6]. To shorten the longest step, we can break it down into smaller steps. A lot of effort has been done to shorten these steps, and also here it seems we have hit a wall [6]. Another way to increase the clock speed is to shrink the components, decreasing the distances the electrical impulses has to travel, thereby shortening all the time it takes to execute every task by a small amount. Processor manufacturers have done this at a steady pace since the 70s, as predicted by Gordon Moore in 1965 [5], but it’s expected that this will cease as early as 2020 [10].

5

The power wall By raising the voltage in the processor we can make the transistors switch quicker and shorten all the steps, but increasing the voltage makes the processor generate more heat. When increasing the frequency of a processor by a factor of two, the heat generated increases by eight [6]. This heat is potentially harmful to the processor if the cooling isn’t sufficient. To limit heat generation, some processors use clock gating to turn off parts of the processors that are not in use. Preventing the electricity to move through these unused parts affects the heat generated, as no electricity means no heat generated by that part of the processor that clock cycle. In competitive overclocking, liquid nitrogen is used to cool the processor. Needless to say, this isn’t a viable solution for most consumer products. To meet the demand for more computational power the manufacturers has found other ways to improve the overall performance.

Multi core processors With no feasible way to overcome these problems with currently available technology, the solution has been to introduce several cores in processors, as the demand for more processing power still is present. Although putting more processor cores in devices technically gives the device more processing power, making the software utilize all cores can be a challenge. Problems that on a single core processor can run sequential have to be broken down into smaller tasks to be solved by different cores in parallel. Doing this can both make the code more complex and more prone to bugs, but can also require comprehensive synchronization to ensure task are performed in the right order. In some cases it’s impossible to divide the load between several cores as the task has to be solved in sequence. Even programs that can take advantage of several cores to a certain degree, have problems scaling as they most likely still contain a portion of sequential code that prevents scaling.

Better algorithms Rather than throwing more hardware at a problem, focusing on using the best suited algorithm, or even trying to develop more efficient algorithms can be a better approach. A 100% parallel algorithm of growth rate n2 can be solved twice as fast with twice as much hardware, but if we double the size of the dataset, we would need four times the hardware to solve the problem in the same time. To solve larger problems with algorithms of large growth rates, we need implementations that can scale well to be able to utilize the added hardware. But with an implementation that are not 100% parallel, the scaling quickly becomes a problem due to Amdahl’s law. When an implementation has trouble scaling, more hardware won’t help. The solution is to find other ways to solve the problem, either with algorithms of lower growth rates, or by facilitating better scaling by increasing the parallel portion of the code.

6

Challenges with parallel programming

The parallelizability of problems Not all problems can be solved using parallel programming. This depends on several factors, but it all comes down to how much the independent threads needs to communicate. The easiest type of problem to parallelize is the problems that rarely or never have to communicate with other threads. We call these embarrassingly parallel problems because they are so obviously parallelizable that it would be embarrassing not to take advantage of this. In the other end of the scale we find inherently serial problems which cannot effectively be parallelized as they require result from the previous step to execute the next. [3] We can divide parallel algorithms into three groups by how much they interact [7]:

● Non interacting algorithms have none or a fixed number of interactions. ● Weakly interacting algorithms have number of interactions that is less than the

big-O order of the algorithm. ● Strongly interacting algorithms have a number of interactions that is equal or

greater than the big-O order of the algorithm.

Race conditions A race condition can occur when two or more threads have access and tries to change the same shared resource at the same time. The problem with running several threads is that we are rarely able to predict which thread will finish first, since we are not in control over the context switching and prioritizing done by the underlying operating system. Let’s say we have two threads trying to find the biggest number in two arrays. Both threads have access to a shared variable that holds the current largest value. If a thread finds a value larger than the shared variable, it should update it. This is done in three steps, read, check and write. In the case given below, we see that thread 2 is accessing the shared variable just after tread 1. Unaware of the new value provided by thread 1, thread 2 overwrites the change giving us an incorrect result.

Thread 1 Shared variable Thread 2

Read 0

Check if bigger than 3

0 Read

Write 3 3 Check if bigger than 2

2 Write 2

A common approach to prevent race conditions is to use locks. A lock ensures that only one thread can access a resource at once. A thread that encounters a lock with a locked state

7

halts its execution until the lock is unlocked again. When a lock changes its state to unlocked, it broadcasts a message to threads waiting for it to become unlocked again, indicating that they may resume their execution. Locks will give a bit of overhead since more code will be executed, and some threads may spend time waiting for a lock to become unlocked again. Another downside of using locks is that we can encounter deadlock-situations, potentially crashing our entire program. A deadlock can occur when two or more threads are locking a resource and waiting to access another resource in such a way that the threads are waiting for each other.

Tread 1 Tread 2

Lock variable A Lock variable B

Lock variable B Lock variable A

In this simple example, both threads needs to access variable A and B, but since they don’t lock the variables they need in the same order, we end up with a deadlock in the second step. In the second step, thread 1 needs to access the variable B to continue its execution, but the variable is locked, so thread 1 halts its execution to wait for the variable to become available. Thread 2 which has applied the lock to the variable B, also needs to access variable A, but it’s locked by thread 1. Thread 1 now waits for thread 2 to finish while thread 2 waits for thread 1 to finish. As a result, the execution of the program will not continue.

Amdahl’s law Amdahl’s law states that the potential speedup of an algorithm is given by the fraction of code that is parallelizable(P) and number of cores (N) added.

Speedup =1

+ (1−P )NP

Even at 95% parallel code, the potential speedup will start to flatten out and get to a point where adding more cores doesn’t add any significant performance. In theory, if 100% of the code where parallel, the performance increase would double if you were to double the number of cores. The theoretical speedup will always be limited by the serial part of the program. Given a program which major parts are parallel, but has a serial part that takes 10 seconds to execute. No matter how many cores we add, the program will never finish in under 10 seconds. Adding more cores will only bring the execution time closer to 10 seconds.

8

Dependency As Amdahl’s law states, the serial part of the program limits the parallelizability. A more accurate definition of “the serial part” would be the longest chain of dependent calculations, also known as the critical path. A sequence of operations cannot safely be performed in parallel and produce a correct result if the operations are dependent on each other [3]. Bernstein’s conditions [3] is used to detect data dependency, and decide if operations can be interchanged without changing the result of the program. Let be the input of the(p1)I

first operation and be the output of the second operation. If(p2)O

the operations are flow dependent, meaning that p1 are dependent(p1) O(p2) = Ø I ⋂ / on the results from p2. The second condition detects anti-dependency. If

the operations are anti-dependent, meaning that p2 are dependent(p2) O(p1) = I ⋂ / Ø on output from p1. In the example below, operation 2 is anti-dependent on operation 3. If operation 3 were to be performed before operation 2, it would give a different value of the variable k. Here operation 2 is also flow dependent on operation 1.

1. i = 3; 2. k = i + 2; 3. i = 0;

This example also contains output dependency, where operation 1 and 3 both updates the variable i, or . Interchanging these two operations will not only(p1) O(p2) = O ⋂ / Ø affect the variable k, but also the final value of i. Even when to operations are not data dependent, they may still have dependency in the form of control dependency. An operation is control dependent on another operation if the result of the other operation determines if the operation is executed or not.

i = 3; k = true; if(k){ i ++; }

There is no data dependency between the variables i and k, but the operation inside the if statement is control dependent on k, meaning the variable k will decide if the statement is going to be performed, thus altering the result of the variable i.

Approaches

Message passing In the 80s, several different message passing environments were developed. A group of scientists started working on a standard in the early 90s, and in 1994 they released the first version of MPI (Message Passing Interface), which defines a set of routines that can be useful when writing message passing programs [2]. There are many different implementations of the MPI and there have been several revisions since 1994. In this project I use MPICH which is an implementation of the MPI 2.0 standard.

9

Message passing is the process of sending a message to a process which then invokes the actual code instead of directly invoking the code. This makes it possible for programs to work together over a network by passing messages. Message passing has its strength in being able to expand the cluster of computers working together. It can utilize computers of differing computation power and machines placed in different geographical regions. As long as the nodes can communicate across a network, they can work together. Message passing is not the best choice for running weakly interacting or strongly interacting algorithms, although it is possible. The network latency is much greater than the latency between memory and cpu in a shared memory environment [7]. When computers are spread out over different locations, the latency will also vary. This requires comprehensive synchronization by the implementation to ensure that all threads wait for a message to reach its destination. It’s not just the overall latency that is the problem. We risk losing packages when transferring our messages over the network. Because of this the MPI needs to implement some fault tolerance logic, so the nodes are certain that the messages reach their destination. In some cases, we risk losing an entire node and need to re-assign the lost work to another node.

Shared memory For running algorithms that require more interaction between threads, the execution time can benefit from using a single computer with several processor cores. When the program runs on a single computer, the threads can share a memory space efficiently. This makes threads able to work on the same instance of a dataset without having to make copies and pass the results to each other. Although the communication-latency between processors is lower in shared memory architectures than MPI-clusters [7], shared memory machines may not scale as well due to the limitations set by the speed and bandwidth of the memory [9]. As several processes may read and change parts of the dataset during the execution, race conditions can occur. To prevent this, the implementation might have to use locks and make threads wait for their turn to access or modify the shared data, creating more overhead. Nevertheless, shared memory can give better performance than message passing for problems that has a significant amount of communication between the processes [7].

Finding the largest value in an array Let’s take a simple problem where we are trying to find the largest number in an array. This can be solved sequentially by iterating the array and updating a local variable when a bigger number is found. The program is written in C++.

int GetLargestNumber(int* arr, int length) { int largest = arr[0]; for (int i = 1; i < length; i++) {

if (arr[i] > largest) { largest = arr[i];

} }

return largest; }

10

Block decompositioning We can easily divide this task with block decompositioning, having one thread finding the largest number in the first half, and another thread finding the largest number in the second half as they are not dependent on exchanging information to execute their part. To limit the number of interactions we use a thread local variable in the task method to hold the current largest number found. As each thread only has to receive its instructions once and return the result after iterating its range, we have a fixed number of interactions, which makes this a non interacting algorithm. When both threads finish, we are left with two values; the largest value in each half of the array. To get the largest value for the entire array, we need to collect the results from both threads and compare them to get one answer. One way to solve this is to use a shared variable. When a thread finishes, it checks if the shared variable is smaller than the largest value it found in its part, and updates the shared variable if so. Here we encounter a race condition, where both threads may try to update the shared variable at the same time. To solve this we can use a shared method that uses a lock guard to ensure that only one thread can execute the method at once.

mutex mu; //Method accessing the shared variable void setResult(int &resultPointer, int result) {

lock_guard<mutex> guard(mu); if (resultPointer < result) {

resultPointer = result; }

} void task(int* arr, int start, int end, int &resultPointer) {

int largest = arr[start]; for (int i = start+1; i <= end; i++) {

if (arr[i] > largest) { largest = arr[i];

} }

//Calling shared method with the result setResult(resultPointer, largest);

} int GetLargestNumber(int* arr, int length) {

int largest = arr[0];

thread t1(task,ref(arr), 0, length/2, ref(largest)); thread t2(task,ref(arr), length / 2, length, ref(largest)); t1.join(); t2.join();

//Returns the largest value after both threads are finished return largest;

}

11

Scaling When using two threads to solve this, we should expect half the running time compared to the sequential solution. For small arrays, the overhead of starting two threads may make the total execution time longer than the sequential. I timed the execution time of both the parallel and sequential implementation above and got the following results on a Intel i5-2500K. The i5-2500K has four physical cores and runs at 3.3GHz. Hyperthreading is not available on this processor. The extra work of starting two threads doesn’t start to pay off until the array has at least 3 million values. This means that the sequential implementation is faster for arrays with less than 3 million values.

The processor has four cores, so with even more threads we can solve the problem faster, given that the array contains at least 5-7 million values. As we expect from a processor with four cores, four threads seems to give the best performance. Despite of the constant time spent spawning the two extra threads plus the overhead from context switching, using six cores will give an execution time closer to using four than using to using two because it’s utilizing all four cores.

12

Message passing To test out message passing i needed an environment with several nodes. I first set up a raspberry pi B+ with raspbian OS and installed MPICH on the machine. To get the machines to communicate without having to use a password i also set up a public/private ssh-key and added it’s own public key to it’s list of authorized keys. This way, any machine running a clone of the original sd-card will be authorized to access any node in the cluster via ssh. The compiling and installation of MPI was excruciatingly slow on the B+, even after overclocking the processor. After two or three hours of compiling i could make clones of the sd-card and set up as many nodes as i wanted. After connecting all the nodes to a switch, i mounted a network-folder from the master-node to all other nodes. When the code is compiled in the network folder on the master node, it is then distributed to all other nodes automatically.

Cluster with eight raspberry pi B+

The mandelbrot set To familiarize myself with MPI I chose to create a program that calculates and displays the Mandelbrot set [12]. The code is based on a serial program created by Jan Høiberg. In the Mandelbrot set, the color value of any pixel can be calculated individually, which makes it an embarrassingly parallel problem. The solution can be written so that it has a fixed number of interactions depending on the cluster size and the chunk size, in other words, a non interacting algorithm. To share the load between all nodes, we can use block decompositioning as we did when finding the largest value in an array, where all nodes calculate the pixel values of an equally sized portion of the screen. Unfortunately this does not ensure that all nodes will do the same amount of computation.

13

#include "mandelbrot.h" #include <math.h> #define MAXITER 500 int getDistance(double x, double y){ int iterations = 1; double d = sqrt(x * x + y * y); double zx2 = 0, zy2 = 0, zx1, zy1; while (iterations < MAXITER && d <= 2) { zx1 = zx2; zy1 = zy2; zx2 = zx1 * zx1 - zy1 * zy1 + x; zy2 = 2 * zx1 * zy1 + y; d = sqrt(zx2 * zx2 + zy2 * zy2); iterations++; }

return iterations; }

The calculation required to determine the distance a point in the complex plane has to the Mandelbrot set, requires approximately 500 iterations in the while loop in the implementation above. Points far away from the Mandelbrot set will reach a value of 2 in a fairly low number of iterations, making them quicker to solve, while the points that are in the mandelbrot set will require the full 500. As we zoom in, the max number of iterations has to be increased to get the same level of detail in the result. When looking at the shape of the set, we see that by splitting the picture up horizontally we risk getting some portions containing a lot more, or a lot less work than other portions. On a single computer with with four processor cores, dividing the work between four threads will very likely cause one or more of the threads to finish before the rest. This means that for the last part of the computation, the computer will not be able to utilize all it’s processor cores as the remaining work is only present in a couple of the original four threads. By splitting the work up in even smaller chunks, all four cores will share the load more equally since the operating system can perform context switching. Context switching works great on shared memory architectures because the work is stored in the memory, which all processor cores can access. On a cluster however, this is not the case. All nodes have their own memory, and can perform context switching on it’s own threads, but it cannot do work on threads spawned on other nodes. Splitting up the work in smaller chunks can actually make the program slower across the cluster because of the time spent spawning the extra threads, plus the overhead from context switching. One node still risks receiving a lot more work than the others.

14

In this illustration, I have divided up the set between three nodes. Unfortunately the second node has been assigned a lot more work than the two others, and will likely finish last. It doesn’t matter how many pieces we split the problem up in, the workload will still be uneven as long as we don’t shuffle the tasks.

The solution Distributing the load evenly throughout the cluster will require three different types of threads. A single master thread receiving the results and drawing them in a window, one or more slave threads performing the calculations and a single delegator which gives the slave threads a new section of points to calculate when requested. All slave threads can address both the master and delegator thread. Neither the master or the delegator can address any of the slave threads. I use the word thread here since the program can run on a single node. When a slave thread finishes a chunk of work, it sends a message to the delegator thread, requesting a new chunk. The delegator thread keeps track of what portions have been appointed to slave threads and not, and can send a message back to the slave with coordinates to a new chunk. This keeps all the remaining work available for all threads until a slave starts to solve the chunk. Given that the problem is split up into enough chunks, all nodes can contribute as long as there is work left. There is of course still a chance that a single node will calculate the very last of the work on its own, but as the chunks become smaller, this will become an insignificant slowdown.

1. Slave requests work from delegator 2. Delegator assigns slave a section of the set to solve 3. Slave returns the result to master node

15

Reviewing the results When measuring the execution time of the program it’s interesting to look at the time between the first node receives an empty value from the delegator, meaning no more chunks are available, and the time the last node finishes its chunk, let’s call this the tail. The tail could be reduced by decreasing the size of the chunks, meaning more chunks. The time saved in decreasing the tail need to be greater than the expected increase in execution time on the main part of the program, where all nodes are working. The expected increase in time is caused by extra communication between nodes, as we need to send results and receive task more often when we have a greater number of tasks. During the testing i installed a GUI on one of the nodes so i could watch as it completed parts of the picture. I chose to run the master and delegator threads on the GUI-node and slave threads on the remaining seven nodes. Running a slave thread on the GUI-node actually slowed down the total execution time by a lot. This may be because the GUI-node is responsible for delegating task and receiving results, and when spending time calculating a chunk it is unavailable for other nodes that need it’s attention. The time saved on utilizing the GUI-node to its full potential is far less than the extra time all the other nodes has to spend waiting on it. I would also guess that this slowdown would get greater as we increase the number of nodes in the cluster.

As we expect, doubling the amount of nodes, almost halves the execution time. The red line in the graph represents a theoretical 100% parallel algorithm. We clearly see that this program is no exception from Amdahl’s law as the measured speedup starts to deviate from the theoretical 100% parallel speedup, and we will eventually reach a point where adding more nodes isn’t going to help a lot.

16

When dividing into 20 chunks, the first node finishes after 32 seconds, and the last at 39. That means at least one node is idle for seven seconds. Trying to take advantage of this time I increased the number of chunks to 40, 100 and 150. For all different sizes the last node still finished after 38 seconds, but at 150 the first node also finished after 38. It seems like there is little point in trying to divide the task into more than 20 partitions to achieve a more even distribution of load in this case. The overhead of the extra message passing covers the time saved from utilizing the idle node.

Image of the mandelbrot set drawn by the cluster.

The code The implementation is written in C++ and runs on MPICH. To send the results from the slave to the master thread I could have used an int-array where the first two values were the start and stop Y-coordinates followed by the values, as this is enough to define the value and position of every point in the chunk. However, I found out that I could define datatypes in MPI and send structs between the threads, so I decided to try this despite that it will result in more memory use.

struct MandelbrotPixel{ short x,y; int distance; };

Every point in the result generated by the slave node has an x and y position as shorts since it’s very unlikely that we will draw an image with a resolution higher 32.767 pixels in any direction, and a distance as an int. I define a matching MPI data type and pass it to both the slave and master node for further use. Having defined an MPI_Datatype matching the variables in my struct, i can pass my struct in an MPI_Send by also passing the MPI_Datatype along with it.

17

MPI_Init(NULL,NULL); //Defining MPI datatype to match struct MPI_Datatype pixeltype, dataType[2]; MPI_Aint offset[2], dataTypeSize; int blockCount[2]; //Adding two short variables to the datatype offset[0] = 0; dataType[0] = MPI_SHORT; blockCount[0] = 2; //Adding one int variable to the datatype MPI_Type_extent(MPI_SHORT, &dataTypeSize); offset[1] = 2 * dataTypeSize; dataType[1] = MPI_INT; blockCount[1] = 1; //Defining the variable MPI_Type_struct(2, blockCount, offset, dataType, &pixeltype); MPI_Type_commit(&pixeltype);

Shared memory

Merge sort I chose to try to parallelize the merge sorting algorithm[11] as this is a fairly easy algorithm to understand and parallelize. Having created two programs in C++ I decided to use Java for this to test out parallel programming in a higher level language. The solution is based on a sequential implementation using a static variable to hold the array of numbers to be sorted, and a buffer-array of equal length. This is to avoid instantiating buffer-arrays within the recursive calls. The algorithm works in such a way that we will not get any race conditions in either array, so using locks is unnecessary.

Identifying dependency The sorting is started by initializing the split-method with the first and last index of the array to be sorted. The split method will call itself recursively until the base case (one value) and then start the merge method which will sort the numbers in the independent parts. Because of the splitting, the merging in the two split-calls will not affect the same indexes in the shared list. As they do not affect the same variables we have no dependency between them. The result will be the same no matter which order they are executed in. The merge call however is flow dependent on both split calls on the same level, since it requires the results from both. This means that we will need to make sure both split operations are finished before starting the merge.

private static void split(int low, int high) { if (low < high) { int middle = low + (high - low) / 2; split(low, middle); split(middle + 1, high); merge(low, middle, high); } }

18

Introducing threads To make the split operations run in parallel in Java, we need a Thread object that can start the split method with the right parameters. Instead of declaring a new class implementing runnable, with a constructor taking the same parameters as the split method, and then starting it, we can create a method returning a new thread object with the overridden run-method as shown below. To limit the amount of threads created, we introduce a static variable for the maximum depth in the tree to start new threads at, and pass the depth we are at to the child threads.

private static Thread splitThread(final int low, final int high, final int depth){ return new Thread(){ @Override public void run(){ split(low, high, depth); } }; }

To start the threads we first check the depth we are at in the tree. If we are below the threshold we create and start new threads. If not, we run the splitting in sequence. We also increase the depth variable sent to the splitThread method to ensure the child threads know their depth. To make sure both threads are finished before merge is executed, we call the join-method on the thread-object to make the parent-thread wait until it’s finished before continuing.

private static void split(int low, int high, int depth) { if (low < high) { int middle = low + (high - low) / 2; if(depth < splitThreshold){ Thread leftThread = splitThread(low, middle, depth+1); leftThread.start(); Thread rightThread = splitThread(middle+1, high, depth+1); rightThread.start(); try { leftThread.join(); rightThread.join(); } catch (InterruptedException e) { e.printStackTrace(); } } else{ split(low, middle, depth); split(middle + 1, high, depth); } merge(low, middle, high); } }

19

Scaling We will not be able to achieve the same amount of speedup with this algorithm as with the search by block decompositioning. A sequential implementation of this algorithm will split the array into a tree from either left to right, or opposite. If we start the splits as two independent threads, those two threads can sort it’s part of the tree by itself and we can achieve a theoretical halving of the execution time. But when these two threads finish, we will have to merge the results to get the finished sorted array. This leaves us with one thread that has to merge the entire array by itself as the merge-operation cannot be performed in parallel. So on the next level, those two threads can split into four, but when those four finish, only two threads can process the results, and so on. In addition to the main thread, at the third level in the tree, we have spawned seven threads, but only four of them are running. The rest are waiting for it’s child threads to finish. To find the optimal number of threads to use, we need to look at the maximum amount of simultaneously active threads, and not the total amount of threads spawned.

So for an array with one million values, the tree height will be a bit over 20, but we can only achieve full parallel potential on 17 of those levels on a processor with eight cores. For this project i got the opportunity to experiment on a computer with two Intel Xeon E5-2690 v4 processors. The processor has 14 cores with hyperthreading which gives a total of 56 logical cores for both processors combined. To utilize all 56 cores we need to have at least 56 threads running at the same time. At the sixth level in the tree, the number of leaf nodes is 64 which would give us enough threads.

20

Treads Milliseconds Speedup

1 170 1

2 105 1.6

4 64 2.6

8 34 5

I plotted the values from sorting one billion values and compared it to the theoretical speedup of Amdahl’s law. From this graph we see that this implementation of the merge sort algorithm is a bit more than 93% parallel. The speedup decreases after about 50 simultaneously running threads as expected since the maximum amount the hardware is capable of running in parallel is 56. This shows the importance of writing effective algorithms, as throwing more hardware at the problem, even with 93% parallel code isn’t very cost effective at this scale.

21

Shortest path in a graph To try out a bit more challenging problem, I decided to try out a graph algorithm. I found some data sets originating from social networks provided by Stanford University [8] where the nodes are users and edges are friendships. As these graphs are undirected and unweighted, a modified breadth-first search will give better performance than Dijkstra's algorithm. Since the edges have no weight, when we iterate by breadth first, the first time we encounter the target we will have traveled the shortest path. If the edges had been weighted, this may not have been the case and we would have to use Dijkstra's algorithm to take the weight of the edges into consideration. A sequential solution to finding the shortest path between two nodes in the graph can be implemented by using a FIFO queue [13]. Every time we encounter an undiscovered node, we add it to the queue to return to it later, and updates its distance value to the current distance traveled. We remove the first item in the queue and check all its neighbours until we find the target node, or the queue is empty. If the queue is empty, there is no path between the two nodes.

//Static queue public static LinkedList<Node> Queue = new LinkedList<Node>(); public static void findPath(Node start, Node target){ //Resetting queue and setting initial values Queue .clear(); start.distance = 0; Node cursor = start; cursor.found = true; boolean found = false; // outer: while(!found){ for(Node n: cursor.Neighbours){ if(n == target){ //Found shortest path found = true; break outer; } //Adding adjacent nodes to queue if not discovered if(!n.found) { Queue .addLast(n); n.distance = cursor.distance + 1; n.found = true; } } try { cursor = Queue .removeFirst(); } catch (NoSuchElementException e){ //No more elements. No path was found break; } } }

After running this method, the nodes in the graph will have the necessary distance-variables for us to be able to track the shortest way from the target, back to the root-node. All nodes not visited will have a distance of Integer.MAX_VALUE.

22

Let’s say we map this graph breadth first, from left to right. When we reach the target node in blue, the code will stop since there is no use in mapping the rest of the graph. From the target node, we select the neighbour with the lowest distance until we reach the root node in green. This will be the shortest path. The sequential breadth first algorithm was unexpectedly effective on these datasets, and even at 60 million nodes with over one billion edges, a dataset taking up 33GB as a txt file, the shortest path was found in under two seconds, leaving little room for improvement.

Incremental parallel implementation I can think of two different ways to do this in parallel. One way is to continue with the FIFO queue and share the load of visiting all the neighbours of all the elements in the tree for one level at the time, collecting the results and then starting new threads for the next level. Let’s call this the incremental parallel solution. This way, we can stop when we find the target node, and avoid mapping the entire graph. The drawbacks of this solution is that the threads would not run for very long and we will need a fair amount of sequential code to collect the results and start new threads for every level of the tree. More sequential code means less potential scaling, and stopping and starting for every level means more overhead from spawning threads and waiting for threads to finish. An advantage of this is that we don’t get any serious race conditions. Since all threads operate within the same level in the tree, they will write the same value to the adjacent nodes. Although several threads may add the same node to its independent queues, that later will be combined and cause duplicates. This can cause a tiny amount of extra work, but we will not get any errors.

23

Incremental + Not having to map the entire graph + No dangerous race conditions - Less parallel code and less potential speedup

To minimize the overhead associated with spawning new threads, we can use a thread pool, that will keep our threads alive until we need them again. The threads cannot directly access the FIFO queue like in the sequential solution since the linked list is not thread safe, meaning there is no guarantee that the result will be correct when several threads try to add elements to the queue at the same time. Instead we can receive a list of future objects from our thread pool executor. A future object is a promise that the object at a later point will contain the result from the task provided to the thread. We can collect the future objects and then wait for the threads to finish before trying to get the values from the futures.

outer: while(true){ int partitions = (int)Math.ceil ((double)Queue .size()/(double)PARTITION_SIZE ); List<Callable<List<Node>>> threads = new LinkedList<Callable<List<Node>>>(); for(int i = 0; i < partitions; i++){ int amount = Queue .size()<PARTITION_SIZE ? Queue .size() : PARTITION_SIZE ; LinkedList<Node> partition = new LinkedList<Node>(Queue .subList(0,amount)); Queue .subList(0,amount).clear(); Callable t = new MapperThread(partition); threads.add(t); } List<Future<List<Node>>> futures = executor .invokeAll(threads); Queue .clear(); for(Future f: futures){ List<Node> nodes = (List<Node>)f.get(); if(nodes.contains(target)){ found = true; break outer; } Queue .addAll(nodes); } }

This code will run in a loop until we find the target node. First we create a list to hold the threads and create a new thread for each partition. Each thread is assigned assign a portion of the queue, and then added to the list. After completing the list of threads, we pass the threads to our executor which will divide them between the threads in our pool. The invokeAll method returns a list of future objects when all threads have finished. After the threads have finished, we can get the results from each future object.

Full mapping Another way to do parallelize this is to split into threads at a certain level, and let the threads keep on running to the deeper levels. The problem with parallelizing this is that we will encounter race conditions. One scenario is when two nodes try to set different distance values to a node at the same time, they may not be aware of the changed value in between the read, check and write operations. This can result in the wrong distance being written last. Another scenario is when one thread reach a node and updates its distance, another thread may encounter the same node at a later point with a shorter distance. As shown in the illustration below, even if it updates the node with the correct distance, the mapping needs to

24

be run from that node again to ensure its neighbours and it’s neighbours neighbours and so on, also receive the new shorter distance.

Let’s assume we have two threads running, red and green. Even if our green thread updates the distance to the current node’s neighbours, it needs to continue to check all neighbours neighbours to ensure that they also get the correct distance. This can cause a lot of extra work as the threads may need to correct each other's work. The uneven progression will also prevent us from stopping the mapping when we encounter the target, since another thread may find a shorter path at a later point. This gives us a total of three drawbacks with this parallel solution. Full mapping

+ More parallel code and better potential speedup - Locking, unlocking, and potentially waiting to access the distance variable - More work, as threads may have to correct each other's work - We have to map the entire graph since the progression is uneven.

Another challenge with the full search solution is to split the search tree into a fixed number of threads. Since the root node may have a thousand edges, the second level of the tree will have a thousand nodes. To avoid starting too many threads we can use a thread pool with a fixed number of threads. We can then send our thousand tasks to the thread pool that will handle the load balancing across the fixed number of concurrent threads.

Reviewing the results Measuring the different implementations can be challenging since the path between nodes differ. The sequential solution may be quicker at finding one path, but a parallel can be faster at another. To achieve a fair measurement i will run each solution one hundred times with random root and target nodes, and measure the average. Comparing the two implementations to each other and a theoretical speedup based on Amdahl’s law we clearly see that the full mapping at close to 95% seems to be more parallel than the incremental which only performs as a theoretical 88% parallel program. This does not mean that the full mapping is the better solution, since it has a lot of overhead as stated earlier. I was very surprised at how much this overhead slowed down the implementation.

25

Running on one thread the full mapping used 34 seconds to find the shortest path, which is a lot compared to the sequential implementation that only used 1.6 seconds. Even when scaling as well as a 95% parallel program and allowed to run on 56 processor cores, it only managed to find the shortest path in an average of 2.3 seconds at best.

The incremental solution performed a lot better than I thought it would. Already at two threads, it’s faster than the sequential implementation. Since it is less parallel, there is no significant speedup over 20 threads. I changed the code to use locks to prevent several threads adding the same node to the queue, but the overhead from locking, unlocking and waiting was a lot more than the time saved from potentially adding and visiting a node more than once. Partitioning the tasks of 10.000 nodes each also seemed to give the best result. There is no dynamic adjustment of the partition size, so one increment may create several hundred tasks, but all task are divided between the threads available in the pool.

26

public static List<Node> Map(LinkedList<Node> nodes){ List<Node> newQueue= new LinkedList<Node>(); //For all nodes in the assigned queue for(Node n : nodes){ //Check all neighbours if found for(Node neighbour: n.Neighbours){ if(!neighbour.found){ //Update distance, status and add to queue if not found neighbour.lock.lock(); neighbour.found = true; neighbour.lock.unlock(); neighbour.distance = n.distance +1; newQueue.add(neighbour); } } } return newQueue; }

To see how well the caches performed i installed the Performance Counter Monitor from Intel and monitored the repeated execution with a given amount of threads over a period of 30 seconds to get a fair average. For one and two threads the L3 cache has a hit ratio of 0.33, and the L2 0.13. For three threads and above, the hit ratio is around 0.2 for L3 and 0.07 for L2. This means that when running two or one thread, the cache helps us achieve better performance since it can provide the processor with data, without having to fetch from memory. For more than two threads, the cache is a lot less helpful.

27

Having larger caches on the processors would probably give a better parallel speedup since the threads don’t have to wait for data to be fetched from memory as much. So when trying to solve this problem quicker, using more than 20 threads didn’t give a significant speedup. At that point, our money may be better spent on faster memory or a processor with a larger cache rather than 30 extra cores.

28

Summary Using parallel programming and distributed computing is not the solution to all problems requiring large amounts of computational power. The nature of the problems we are trying to solve and the implementation of the algorithms will decide how well we can take advantage of parallel computation and clusters. When writing applications for multi-core platforms, we have to consider the additional overhead and complexity that comes with parallel programming before deciding to implement parallel solutions. For many small tasks and problems, the overhead of using several threads might be greater than the time saved, and the synchronization can introduce new bugs and unnecessary complexity. However, being able to identify the problems that can benefit from using several threads, and implementing this can make applications a lot more efficient. I started this project with only a basic understanding of what a thread was and had a plan to write several programs for multiple problems. I quickly realized that i would not have time to do everything I planned, especially after spending almost three full days setting up and getting the raspberry pi cluster to work. I expected parallel programming to be challenging due to race conditions, added complexity and having to think in a different way, but I was surprised by the extent of the problems that came with it. The overhead of starting threads was a lot more than I expected, but the problems of scaling parallel implementations was the problem that surprised me the most. If I had more time I would have liked to try out parallel programming in a functional programming language which supposedly is better suited for parallel programming than imperative languages. I did not have enough time to test any problems on GPUs, but if I had, I would have tried to draw the mandelbrot set. I didn’t truly realize the importance of the hardware until the very end of my project. The on-chip cache and memory speeds had a much bigger role than I expected and I think it would be interesting to do more experiments with different processors architectures, memory speeds and even trying to change the software implementation to better fit the hardware and achieve better performance. Parallel programming is without a doubt an exciting and vast field that I now barely have started to explore.

29

References [1] http://www.cs.columbia.edu/~sedwards/classes/2012/3827-spring/advanced-arch-2011.pdf [2] https://en.wikipedia.org/wiki/Message_Passing_Interface [3]https://en.wikipedia.org/wiki/Parallel_computing [4]http://www.eecs.berkeley.edu/~culler/courses/cs252-s05/lectures/cs252s05-lec01-intro.ppt#359,15,Memory%20Capacity%20%20(Single%20Chip%20DRAM) [5] https://en.wikipedia.org/wiki/Moore%27s_law [6] https://software.intel.com/en-us/blogs/2014/02/19/why-has-cpu-frequency-ceased-to-grow [7] Arne Maus: A Classification of Parallel Algorithms and why Most Problems will not get Much More Computer Power in the Near Future [8] http://snap.stanford.edu/data/ [9]https://en.wikipedia.org/wiki/Shared_memory [10]http://www.extremetech.com/computing/165331-intels-former-chief-architect-moores-law-will-be-dead-within-a-decade [11]https://en.wikipedia.org/wiki/Merge_sort [12]https://en.wikipedia.org/wiki/Mandelbrot_set [13]https://en.wikipedia.org/wiki/Breadth-first_search

30

http://www.cs.columbia.edu/~sedwards/classes/2012/3827-spring/advanced-arch-2011.pdf

https://en.wikipedia.org/wiki/Message_Passing_Interface

https://en.wikipedia.org/wiki/Parallel_computing

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05/lectures/cs252s05-lec01-intro.ppt#359,15,Memory%20Capacity%20%20(Single%20Chip%20DRAM)

http://www.eecs.berkeley.edu/~culler/courses/cs252-s05/lectures/cs252s05-lec01-intro.ppt#359,15,Memory%20Capacity%20%20(Single%20Chip%20DRAM)

https://en.wikipedia.org/wiki/Moore%27s_law

https://software.intel.com/en-us/blogs/2014/02/19/why-has-cpu-frequency-ceased-to-grow

http://snap.stanford.edu/data/

https://en.wikipedia.org/wiki/Shared_memory



https://en.wikipedia.org/wiki/Merge_sort

https://en.wikipedia.org/wiki/Mandelbrot_set

https://en.wikipedia.org/wiki/Breadth-first_search

1 - Drawing the Mandelbrot set The main function Entery point on all nodes. Starts the appropriate function based on the rank assigned to the node. int main(){ MPI_Init(NULL,NULL); //Defining MPI datatype to match struct MPI_Datatype pixeltype, dataType[2]; MPI_Aint offset[2], dataTypeSize; int blockCount[2]; //Adding two short variables to the datatype offset[0] = 0; dataType[0] = MPI_SHORT; blockCount[0] = 2; //Adding one int variable to the datatype MPI_Type_extent(MPI_SHORT, &dataTypeSize); offset[1] = 2 * dataTypeSize; dataType[1] = MPI_INT; blockCount[1] = 1; //Defining variable MPI_Type_struct(2, blockCount, offset, dataType, &pixeltype); MPI_Type_commit(&pixeltype); //Setting initial zoom of fraction and standard window size int window_width = 1500; int window_height = 900; double midX = -0.5; double midY = 0; double step = 0.002;; int partitions = 50; //Minimum amount of partitions //Finding actual number of partitions based on window size int actualStep = floor(window_height/partitions); int actualPartitions = ceil(window_height/actualStep); int world_size,rank; MPI_Comm_size(MPI_COMM_WORLD,&world_size); MPI_Comm_rank(MPI_COMM_WORLD,&rank); if(world_size < 3){ if(rank == 0) std::cout << "World size must be at least 3\n"; return 0; } if(rank == 0){ //Node is master master(window_width, window_height, actualPartitions,pixeltype); } else if(rank == 1){ //Node is delegator delegator(window_height,window_width,partitions,actualStep); } else{ //Node is slave slave(midX-((window_width/2)*step), midY-((window_height/2)*step),step, pixeltype); } MPI_Finalize(); }

31

The master function Trigged by the main function if the node is assigned rank 0. void master(int window_width, int window_height,int partitions, MPI_Datatype pixeltype){ //Creating new window GraphicsContainer gCont = getNewDisplay(window_width,window_height); MPI_Status status; long WHITE = WhitePixel(gCont.display, gCont.screen); long BLACK = BlackPixel(gCont.display, gCont.screen); long COLOR = 0; for(int i = 0; i < partitions; i++){ //Probe for message MPI_Probe(MPI_ANY_SOURCE,RESULT_TAG,MPI_COMM_WORLD,&status); int sectionSize; //Query status of probed message for size MPI_Get_count(&status, MPI_INT,&sectionSize); //Allocate array for result and receive MandelbrotPixel pixels[sectionSize];MPI_Recv(pixels,sectionSize,pixeltype, MPI_ANY_SOURCE, RESULT_TAG,MPI_COMM_WORLD,&status); //Draw all pixels for(int j = 0; j < sectionSize; j++){ if(pixels[j].distance == MAXITER){ COLOR = BLACK; } else{ COLOR = (WHITE-(MAXITER-pixels[j].distance)*(WHITE-BLACK))/MAXITER ; } XSetForeground(gCont.display,gCont.gc,COLOR); XDrawPoint(gCont.display,gCont.window,gCont.gc,pixels[j].x,pixels[j].y); } } }

32

The delegator function Trigged by the main function if the node is assigned rank 1. void delegator(int height, int width, int partitions, double step){ MPI_Status status; int cursor = 0; int world_size; MPI_Comm_size(MPI_COMM_WORLD,&world_size); int section[4]; section[0] = 0; //Start X section[2] = width; //Stop X int node_id; for(int i=0; i < partitions + world_size -2; i++){ //Recieving request for new section MPI_Recv(&node_id,1,MPI_INT,MPI_ANY_SOURCE,REQUEST_TAG,MPI_COMM_WORLD,&status); //Defining new section to solve section[1] = cursor; //Start Y if(cursor >= height){ section[3] = cursor; MPI_Send(section,4,MPI_INT,node_id,TASK_TAG,MPI_COMM_WORLD); continue; } else if(height < cursor+step){ section[3] = height; cursor = height; } else{ section[3] = cursor+step; cursor = cursor+step; } //Sending new section to the node that requested it MPI_Send(section,4,MPI_INT,node_id,TASK_TAG, MPI_COMM_WORLD); } }

33

The slave function Trigged by the main function if the rank assigned to the node is greater than 1. void slave(double minX, double minY, double step, MPI_Datatype pixeltype){ int rank; int done = 0; MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Status status; while(done == 0){ int section[4]; //Sending request to delegator for work MPI_Send(&rank,1,MPI_INT,1,REQUEST_TAG,MPI_COMM_WORLD); //Receiving work from delegator MPI_Recv(&section,4,MPI_INT,MPI_ANY_SOURCE,TASK_TAG, MPI_COMM_WORLD,&status); //Finding the size of the received work int sectionSize = (section[0]-section[2])*(section[1]-section[3]); MandelbrotPixel pixels[sectionSize]; //Exit if no more work is available if(section[1]-section[3] == 0){ done = 1; continue; } //Calculating all points in the assigned section int arrayCursor = 0; for(int i=section[0]; i<section[2]; i++){ for(int j=section[1]; j<section[3];j++){ pixels[arrayCursor].x = i; pixels[arrayCursor].y = j; pixels[arrayCursor].distance = getDistance((minX+(step*i)),(minY+(step*j)),MAXITER); arrayCursor++; } } //Sending result to master MPI_Send(pixels,sectionSize,pixeltype, 0,RESULT_TAG,MPI_COMM_WORLD); } }

34

2 - Merge sort public class ParallelMergeSorter { private static int[] list ; private static int[] buffer ; private static int splitThreshold ; public static int[] sort(int[] values, int ds) { list = values; splitThreshold = ds; buffer = new int[values.length]; split (0, values.length - 1, 0); return list ; } private static Thread splitThread(final int low, final int high, final int depth){ return new Thread(){ @Override public void run(){ split (low, high, depth); } }; } private static void split(int low, int high, int depth) { if (low < high) { int middle = low + (high - low) / 2; if(depth < splitThreshold ){ Thread leftThread = splitThread (low, middle, depth+1); leftThread.start(); Thread rightThread = splitThread (middle+1, high, depth+1); rightThread.start(); try { leftThread.join(); rightThread.join(); } catch (InterruptedException e) { e.printStackTrace(); } }else{ split (low, middle, depth); split (middle + 1, high, depth); } merge (low, middle, high); } } private static void merge(int low, int middle, int high) { // Copy both parts into the buffer array for (int i = low; i <= high; i++) { buffer [i] = list [i]; } int i = low; int j = middle + 1; int k = low; // Copy the smallest values from either the left or the right side back to the original array while (i <= middle && j <= high) { if (buffer [i] <= buffer [j]) { list [k] = buffer [i]; i++; } else { list [k] = buffer [j]; j++; } k++; } // Copy the rest of the left side of the array into the target array while (i <= middle) { list [k] = buffer [i]; k++; i++; } } }

35

3 - Graph search

The node object package com.company; import java.util.ArrayList; import java.util.concurrent.locks.Lock; import java.util.concurrent.locks.ReentrantLock; public class Node { public int Id; public ArrayList<Node> Neighbours = new ArrayList<Node>(); public boolean found = false; public int distance; public Lock lock = new ReentrantLock(); public Node (int id){ Id = id; } public void addNeighbour(Node n){ Neighbours.add(n); } }

Tracing the shortest path After mapping the distances between the nodes, this function will find the shortest path from the target node back to the start node and return a list of nodes representing the shortest path. private static LinkedList<Node> getShortestPath(Node endnode){ //Setting initial value of neighbour closest to start-node LinkedList<Node> path = new LinkedList<Node>(); //End-node need to be in the path path.add(endnode); Node cursor = endnode; while(true){ //Initial value of closest node Node closest = cursor.Neighbours.get(0); //Checking all neighbours for(Node n: cursor.Neighbours){ //Changing to closest node if closer if(n.distance < closest.distance){ closest = n; } } path.addFirst(closest); cursor = closest; //Start-node will be the only node with distance 0 if(cursor.distance == 0) return path; } }

Incremental implementation The incremental parallel implementation of the breadth first search. Entry point is findPath(). package com.company; import java.util.LinkedList;

36

import java.util.List; import java.util.concurrent.*; public class BreadthFirstIncrementalParallel { public static LinkedList<Node> Queue = new LinkedList<Node>(); public static final int PARTITION_SIZE = 10000; private static ExecutorService executor ; static class MapperThread implements Callable { private LinkedList<Node> nodes; public MapperThread(LinkedList<Node> nodes){ this.nodes = nodes; } @Override public Object call() throws Exception { return Map (nodes); } } public static List<Node> Map(LinkedList<Node> nodes){ List<Node> newQueue= new LinkedList<Node>(); //For all nodes in the assigned queue for(Node n : nodes){ //Check all neighbours if found for(Node neighbour: n.Neighbours){ if(!neighbour.found){ //Update distance, status and add to queue if not found neighbour.lock.lock(); neighbour.found = true; neighbour.lock.unlock(); neighbour.distance = n.distance +1; newQueue.add(neighbour); } } } return newQueue; } public static List<Node> findPath(Node start, Node target, int numberOfThreads){ //Creating thread pool with desired number of threads executor = new ScheduledThreadPoolExecutor(numberOfThreads); //Resetting queue and setting initial values Queue .clear(); start.distance = 0; start.found = true; boolean found = false; LinkedList<Node> shortestPath = new LinkedList<Node>(); Queue .add(start); outer: while(true){ //Finding out how many partitions we have to create int partitions = (int)Math.ceil ((double)Queue .size()/(double)PARTITION_SIZE ); List<Callable<List<Node>>> threads = new LinkedList<Callable<List<Node>>>(); //Creating thread object for each partition for(int i = 0; i < partitions; i++){ int amount = Queue .size()<PARTITION_SIZE ? Queue .size() : PARTITION_SIZE ; LinkedList<Node> partition = new LinkedList<Node>(Queue .subList(0,amount)); Queue .subList(0,amount).clear(); Callable t = new MapperThread(partition); threads.add(t); } try { //Starting the threads List<Future<List<Node>>> futures = executor .invokeAll(threads); Queue .clear(); //Processing the results for(Future f: futures){ List<Node> nodes = (List<Node>)f.get(); if(nodes.contains(target)){

37

found = true; break outer; } Queue .addAll(nodes); } } catch (InterruptedException e) { e.printStackTrace(); } catch (ExecutionException e) { e.printStackTrace(); } } if(found){ shortestPath = getShortestPath (target); } return shortestPath; } }

38

Full mapping The full mapping parallel implementation of the breadth first search. Entry point is findShortestPathParallel(). package com.company; import java.util.LinkedList; import java.util.List; import java.util.NoSuchElementException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; public class BreadthFirstFullSearchParallel { private static ExecutorService executor ; private static Thread mapThread(final Node n){ return new Thread(){ @Override public void run(){ map (n); } }; } public static List<Node> findShortestPathParallel(Node from, Node to, int numberOfThreads){ executor = Executors.newFixedThreadPool (numberOfThreads); from.distance = 0; MapParallel (from,numberOfThreads); List<Node> path = getShortestPath (to); return path; } private static void map(Node baseNode){ //Thread local list of nodes to check LinkedList<Node> Queue = new LinkedList<Node>(); Node cursor = baseNode; while(true){ for(Node n: cursor.Neighbours){ //Locks to prevent race condition n.lock.lock(); //Sets new distance if shorter if(n.distance > cursor.distance +1){ n.distance = cursor.distance +1; //If distance is shorter, path finding has to be run from the node. Queue.addLast(n); } //Unlocks n.lock.unlock(); } try{ //We are finished updating distance to neighbours, take next element. cursor = Queue.removeFirst(); } catch (NoSuchElementException e){ //If queue is empty, we are finished with this path break; } } } private static void MapParallel(Node baseNode, int numberOfThreads){ //Thread local list of nodes to check LinkedList<Node> Queue = new LinkedList<Node>(); Node cursor = baseNode; while(true){ for(Node n: cursor.Neighbours){ //Locks to prevent race condition n.lock.lock(); //Sets new distance if shorter

39

if(n.distance > cursor.distance +1){ n.distance = cursor.distance +1; //If distance is shorter, path finding has to be run from the node. Queue.addLast(n); } //Unlocks n.lock.unlock(); } //Letting queue build up until it passes the threshold value. //Starting new thread for each node in the queue if(Queue.size() > numberOfThreads*10){ for(Node n: Queue){ Thread t = mapThread (n); executor .execute(t); } executor .shutdown(); while(!executor .isTerminated()){ } return; } try{ //We are finished updating distance to neighbours, take next element. cursor = Queue.removeFirst(); } catch (NoSuchElementException e){ //If queue is empty, we are finished with this path return; } } } }

40

Sequential breadth first The sequential implementation of the breadth first search. package com.company; import java.util.LinkedList; import java.util.List; import java.util.NoSuchElementException; public class BreadthFirstPathFinder { public static LinkedList<Node> Queue = new LinkedList<Node>(); public static List<Node> findPath(Node start, Node target){ Queue .clear(); start.distance = 0; Node cursor = start; cursor.found = true; boolean found = false; outer: while(!found){ for(Node n: cursor.Neighbours){ if(n == target){ //Found shortest path found = true; break outer; } //Adding adjacent nodes to queue if not discovered if(!n.found) { Queue .addLast(n); n.distance = cursor.distance + 1; n.found = true; } } try { cursor = Queue .removeFirst(); } catch (NoSuchElementException e){ break; } } LinkedList<Node> shortestPath = new LinkedList<Node>(); if(found){ shortestPath.addFirst(target); while(true){ shortestPath.addFirst(cursor); if(cursor == start){ break; } for(Node n: cursor.Neighbours){ if(n.found && n.distance < cursor.distance) cursor = n; } } } return shortestPath; } }

41

exploring parallel programming · processor that clock cycle. in competitive overclocking, liquid...

Documents