[pereira, m m] harnessing the multi cores - parallel programming with haskell

Upload: marcio-machado-pereira

Post on 09-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    1/16

    Harnessing the multicores:Parallel Programming with Haskell

    Marcio Machado PereiraMO601 November, 2010

    Abstract

    With the advent of multicore processors, parallel programming has become a majorchallenge for software developers. Exploiting the potential of these machines requiresnew parallel software that can break a task into several parts, solve them more or less

    independently, and assemble the results into a single response. New ideas andprogramming tools originated in the scientific community are being developed, andcertainly will benefit the technical community by providing new ways to make better useof multicore processors. Our survey begins by discussing some of these techniquesand how they are trying to raise the level at which parallel programs can be written.Next, lets delve a little in the technique of annotations, i.e., the inclusion ofmechanisms that allow the programmer to control the granularity of parallelismindicating which "computations" can be undertaken in parallel. As we can see, thisapproach is very promising because the semantics of the program is completelydeterministic, and the programmer is not required to identify and create executionthreads and make use of mechanisms for communication and synchronization,explicitly.

    1. Introduction

    According to [1], in the past few years, parallel computers have entered mainstreamcomputing with the advent of multicore computers. Previously, most computers weresequential and performed a single operation per time step. Moores Law drove theimprovements in semiconductor technology that doubled the transistors on a chip everytwo years, which increased the clock speed of computers at a similar rate and alsoallowed for more sophisticated computer implementations. But this steadyimprovement ended, so manufacturers shifted to multicore processors that used eachMoores Law generation of transistors to double the number of independent processors

    on a chip. Each processor ran no faster than its predecessor, and sometimes evenslightly slower, but in aggregate, a multicore processor could perform twice the amountof computation as its predecessor.

    This new computer generation rests on the same problematic foundation of softwarethat the scientific community struggled with in its long experience with parallelcomputers. Most existing general-purpose software is written for sequential computersand will not run any faster on a multicore computer. Exploiting the potential of thesemachines requires new, parallel software that can break a task into multiple pieces,solve them more or less independently, and assemble the results into a single answer.Finding better ways to produce parallel software is currently the most pressing problemfacing the software development community and is the subject of considerableresearch and development.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    2/16

    One key problem in parallel programming today is that most of it is conducted at a verylow level of abstraction. Programmers must break their code into components that runon specific processors and communicate by writing into shared memory locations orexchanging messages. In many ways, this state of affairs is similar to the early days ofcomputing, when programs were written in assembly languages for a specific computerand had to be rewritten to run on a different machine. In both situations, the problemwas not just the lack of reusability of programs, but also that assembly languagedevelopment was less productive and more error prone than writing programs inhigher-level languages.

    Fortunately, many parallel programming techniques originated in the scientificcommunity has influenced the search for new approaches to programming multicorecomputers. Future improvements in our ability to program multicore computers willbenefit all software developers as well as create a new fundamental programmingparadigm. In next sections we will briefly discuss some of these techniques and howthey are attempting to raise the level at which parallel programs can be written. Afterthat we will present a technique called annotations and discuss one real

    implementation and their current issues.

    2. Data Parallel Programming

    The oldest and best-established idea is data parallel programming. In this programmingparadigm, an operation or sequence of operations is applied simultaneously to all itemsin a collection of data. The granularity of the operation can range from adding twonumbers in a data parallel addition of two matrices to complex data mining calculationsin a map-reduce style computation [2]. The appeal of data parallel computation is thatparallelism is mostly hidden from the programmer. Each computation proceeds inisolation from the concurrent computations on other data, and the code specifying the

    computation is sequential. The developer need not worry about the details of movingdata and running computations because they are the responsibility of the runtimesystem. GPUs (graphics processing units) provide hardware support for this style ofprogramming, and they have recently been extended into GPGPUs (general-purposeGPUs) that perform very high-performance numeric computations.

    Examples of data parallelism are High-Performance Fortran (HPF) and OpenMP1. Bothexploit data parallelism by employing many processors to process different parts of asingle array. Typically, the arrays are required to be flat(arrays of floats, for example),but that is often quite inconvenient for the programmer. In ground-breaking work in the90s, the NESL2 language and its implementation offered data-parallel operations overnested data structures (such as arrays of variably-sized subarrays), and allowed all

    subarrays to be simultaneously computed in data-parallel [3].

    1OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-

    platform shared memory multiprocessing programming in C, C++, and Fortran on many architectures,

    including Unix and Microsoft Windows platforms. It consists of a set of compiler directives, library

    routines, and environment variables that influence run-time behavior.

    2NESL is one of the most successful parallel functional languages. It is a strict, strongly-typed, data-

    parallel language with implicit parallelism and implicit thread interaction. It has been implemented on a

    range of parallel architectures, including several vector computers. NESL fully supports nested

    sequences and nested parallelism, and has the ability to take a parallel function and apply it overmultiple operations over the data. NESL is loosely based on ML functional language.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    3/16

    Unfortunately, data parallelism is not a programming model that works for all types ofproblems. Some computations require more communication and coordination. Forexample, protein folding calculates the forces on all atoms in parallel, but localinteractions are computed in a manner different from remote interactions. Otherexamples of computations that are hard to write as data parallel programs includevarious forms of adaptive mesh refinement that are used in many modern physicssimulations in which local structures, such as clumps of matter or cracks in a materialstructure, need finer spatial resolution than the rest of the system.

    3. Transactional Memory

    A new idea that has recently attracted considerable research attention is transactionalmemory (TM), a mechanism for coordinating the sharing of data in a multicorecomputer. Data sharing is a rich source of programming errors because the developerneeds to ensure that a processor that changes the value of data has exclusive access

    to it. If another processor also tries to access the data, one of the two updates can belost, and if a processor reads the data too early, it might see an inconsistent value. Themost common mechanism for preventing this type of error is a lock, which a programuses to prevent more than one processor from accessing a memory locationsimultaneously. Locks, unfortunately, are low-level mechanisms that are easily andfrequently misused in ways that both allow concurrent access and cause deadlocksthat freeze program execution.

    TM is a higher-level abstraction that allows the developer to identify a group of programstatements that should execute atomicallythat is, as if no other part of the program isexecuting at the same time. So instead of having to acquire locks for all the data thatthe statements might access, the developer shifts the burden to the runtime system

    and hardware. TM is a promising idea, but many engineering challenges still stand inthe way of its widespread use. Currently, TM is expensive to implement without supportin the processors, and its usability and utility in large, real world codes is as yetundemonstrated. If these issues can be resolved, TM promises to make many aspectsof multicore programming far easier and less error prone.

    4. Functional Programming

    Another new idea is the use of functional programming languages. Peyton Jones 3presents four arguments in support of this claim. First, functional languages do notrequire explicit language constructs to represent parallelism, synchronization, or

    communication among tasks. Parallelism is implicit in the algorithm that theprogrammer chooses, and any independent computations can be performedconcurrently if the resources are available; the language implementation automaticallyhandles synchronization and communication among tasks. Second, no specialmechanisms are necessary to protect data that is shared among concurrent tasks. Anytask that references a common sub-expression may perform the evaluation, and anyother task can use the evaluated result. Third, the same formal reasoning that appliesto sequential functional programs also applies to parallel ones. Parallel evaluation is a

    3Peyton Jones is a researcher at Microsoft Research in Cambridge, England and also an Honorary

    Professor of the Computing Science Department at Glasgow University, where he was a professor during1990-1998.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    4/16

    feature of the implementation, but the semantics of the language is defined at anabstract level and applies to both parallel and sequential implementations. Finally, theresults of a functional program are determinate. Any correct implementation of afunctional language must preserve referential transparency, which means that theprogram itself must provide a consistent mapping from inputs to outputs. In particular,no external or implementation-dependent factors, such as the scheduling of individualtasks, may influence this mapping.

    Despite of these arguments, it has proved hard to realize this potential in practice.Plenty of papers describe promising ideas, but vastly fewer describe realimplementations with good wall-clock performance. Actually, this idea of parallelfunctional programming dates back to 1975 [9, 10] when researches were suggestedthe technique of evaluating function arguments in parallel, with the possibility offunctions absorbing unevaluated arguments and perhaps also exploiting speculativeevaluation. In general, parallelism obtained from referential transparency in purefunctional languages is of too fine granularity, not yielding good performance. Thesearch for ways of controlling the degree of parallelism of functional programs by

    means of automatic mechanisms, either static or dynamic, had little success [11, 12].Compilers that exploit implicit parallelism have been facing difficulty to promote goodload balancing amongst processors and to keep communication costs low.

    For example, consider the following nave program for computing the factorial of apositive integer:

    factorial 1 = 1

    factorial n = n * factorial (n - 1)

    This program embodies the usual recursive definition of the factorial function: 1! is just1, and n! is n*(n 1)! for all larger values of n. For a sequential implementation,

    this program is reasonable. However, it contains no useful parallelism. Calculating n!requires a total of n multiplications, but these multiplications do not have to beperformed in a particular sequence, as the above program suggests. A parallelprogram for computing factorial may take the less intuitive divide-and-conquerapproach to the problem, as shown below:

    factorial n = product (1, n)

    where product (lo, hi)

    | (lo == hi) = lo

    | otherwise = product (lo, mid) * product (mid + 1, hi)

    where mid = (lo + hi) `div` 2

    This program defines n! as the product of all integers in the interval from 1 to n. Theproduct of any interval is then defined in two parts. For a degenerate interval thatcontains only one integer, the product is just that integer. A larger interval is split at themidpoint into two subintervals, and the product for the whole interval is obtained bymultiplying together the products for the two subintervals. This program containsimplicit parallelism, because each interval is divided into two independent subintervalsthat can be processed concurrently. It is no coincidence that the second program ismore complicated than the first; even in a functional language, parallel programming ismore difficult than sequential programming, because of the decomposition of theproblem into concurrently executable tasks. Although a compiler can automaticallyextract parallelism from a functional program, there is some evidence to suggest thatexplicit programmer control is necessary. One way to try and obtain more reliable

    performance on parallel architectures is to include annotations in the program toindicate when parallel computation might be beneficial. In the next section we will

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    5/16

    describe some these experiments in a real programming environment: the GlasgowHaskell Compiler (GHC) and the parallel Haskell runtime system, known as Glasgowparallel Haskell (GpH).

    5. Parallelism in Functional Language Haskell

    At least in theory, Haskell has a head start in the race to find an effective way toprogram parallel hardware. Purity-by-default means that there should be a wealth ofinherent parallelism in Haskell code, and the ubiquitous lazy evaluation model meansthat, in a sense, futures are built-in. How can we turn these benefits into real speedupson commodity hardware? According to [4], completely implicit parallelism is still adistant goal one recent attempt at this in the context of Haskell can be found in [5]. Inthis paper Harris and Singh based their work around the use of thunks4: Thunksprovide a natural abstraction with which to look for implicit parallelism. They usedsome speculative techniques to predict which thunks are likely to be good candidatesfor parallel execution. The figure below show actual performance on multi-corehardware, normalized against single-core performance using 1, 2, 3, and 4 cores.

    The performance 1 means the same as sequential execution, 2 means twice as fastas sequential execution, and so on (appendix A summarizes the test program thatHarris and Singh used).

    As researchers have previously explored, there are several tensions here:optimizations for non-strict languages often try to avoid or delay thunk allocation,whereas they exploit thunk allocation as a potential source of parallelism. It would beinteresting to explore the impact of reducing the level of optimization on the parallelismto the eventual performance that they achieved in practice.

    On the other hand, the approach used in GpH is intermediate between purely implicitand purely explicit approaches. Actually, the GpH runtime supports three models ofparallelism: explicit thread-based concurrency [6], semi-explicit deterministicparallelism [7], and data-parallelism [8]. The semi-explicit GpH programming model hasbeen shown to be remarkably effective [14]. The semantics of the program remainscompletely deterministic, and the programmer is not required to identify threads,communication, or synchronization. They merely annotate sub-computations that might

    4Lazy evaluation in Haskell is based on the allocation and execution of thunks which represent

    suspended computations whose results may or may not be needed. A lazy run-time system does not

    evaluate a thunk unless it has to. Expressions are translated into a graph and the run-time reduces it,chucking out any unneeded thunk, unevaluated. Section 6 outlines the GHC run-time system.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    6/16

    be evaluated in parallel, leaving the choice of whether to actually do so to the runtimesystem. These so-called sparks are created and scheduled dynamically, and their grainsize varies widely.

    5.1. Annotations in Haskell

    Glasgow parallel Haskell GpH, provides a mechanism that allows the user to controlthe granularity of parallelism by indicating what computations may be usefully carriedout in parallel. This is done by the addition of two built-ins combinators to controlparallelism: a parallel combinator par and a sequential combinator seq, with thefollowing types:

    par :: a -> b -> b

    seq :: a -> b -> b

    The combinator par indicates to the Haskell run-time system that it may be beneficial

    to evaluate the first argument in parallel with the second argument. This is done byevaluating (par e1 e2) first adding e1 to a pool of work available for unemployedprocessors, and then continues by evaluating e2. The par function returns as its resultthe value of the second argument. The GpH run-time system does not necessarilycreate a thread to compute the value of the expression e1. Instead, the run-timesystem creates a spark which has the potential to be executed on a different threadfrom the parent thread. A sparked computation expresses the possibility of performingsome speculative evaluation. Since a thread is not necessarily created to compute thevalue of e1, this approach has some similarities with the notion of a lazy future [15].This parallelism can be further coordinated by using seq to specify a sequence ofevaluation the first expression is evaluated before the second will be returned.

    An example of the use of this mechanism is shown below. First, we have a normalHaskell function that sorts a list using a divide-and-conquer approach:

    quicksort :: Ord a => [a] -> [a]

    quicksort [] = []

    quicksort (p:xs) = losort ++ p:hisort

    where

    losort = quicksort [ y | y

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    7/16

    need to explicitly create any threads or write any code for inter-thread communicationor synchronization. Sometimes it is convenient to write a function with two argumentsas an infix function and this is done in Haskell by writing quotes around the function:

    quicksort :: Ord a => [a] -> [a]

    quicksort [] = []quicksort (p:xs) = losort `par` hisort `seq` (losort ++ p:hisort)

    where

    losort = quicksort [ y | y

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    8/16

    small modifications to the program using existing well-known techniques, and therebyachieve decent speedup on todays parallel hardware.

    5.2 Strategies Controlling Evaluation Degree

    This section gives an abridged introduction to the parallel programming techniquecalled evaluation strategies. We focus on the language features necessary to achievethe basic functionality and highlight the advantages of this parallel programmingtechnique. A complete description and discussion of evaluation strategiescan be foundin [7].

    Strategies are a remarkably simple idea. In the original formulation, a strategy is afunction of type a -> ()for some a:

    type Strategy a = a -> ()

    A Strategy specifies the dynamic behavior required when computing a value of agiven type. It makes no contribution towards the value being computed by thealgorithmic component of the function: it is evaluated purely for effect, and hence itreturns just the empty tuple (). Like the quicksort example, strategiesmay be appliedwith the usingcombinator:

    using :: a -> Strategy a -> a

    using x s = s x `seq` x

    Some basic strategiescan be defined as follows. Because reduction to WHNF5 is thedefault evaluation degree in GpH, sometimes we need a strategy to reduce anexpression to its normal form (NF), i.e. a form that contains no redexes6. This is

    reached by rnfstrategy. The rnf strategy can only be defined over both built-in anduser-defined types, but not over function types or any type incorporating a function type few reduction engines support the reduction of inner redexes within functions. Theobvious solution is to use a Haskell type class,NFData, to overload the rnf operation.Because NF and WHNF coincide for built-in types such as integers and booleans, thedefault method forrnf isrwhnf.For each data type an instance ofNFData must bedeclared that specifies how to reduce a value of that type to normal form. Such aninstance relies on its element types, if any, being in class NFData. Consider, forexample, the instance to use with lists.

    5Normally, GHC evaluates an expression to what we call head normal form (abbreviated HNF). It stops

    once it reaches the outermost constructor (the head). This is distinct from normal form (NF), in which

    an expression is completely evaluated. An expression is in weak head normal form (WHNF) if it is a head

    normal form (HNF) or any lambda abstraction. I.e., the top level is not a redex. The term was coined by

    Simon Peyton Jones to make explicit the difference between head normal form (HNF) and what graph

    reduction systems produce in practice. For normal data, weak head normal form is the same as head

    normal form. The difference only arises for functions.

    6Executing a functional program, i.e. evaluating an expression, means to repeatedly apply function

    definitions until all function applications have been expanded. Every reduction replaces a subexpression,

    called reducible expression or redex for short, with an equivalent one, either by appealing to a function

    definition or by using a built-in function. An expression without redexes is said to be in normal form. Ofcourse, execution stops once reaching a normal form which thus is the result of the computation.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    9/16

    rwhnf :: Strategy a

    rwhnf x = x `seq` ()

    class NFData a where

    rnf :: Strategy a

    rnf = rwhnf

    instance NFData a => NFData [a] where

    rnf [] = ()

    rnf (x:xs) = rnf x `seq` rnf xs

    Evaluation strategiesuse lazy higher-order functions to separate the two concerns ofspecifying the algorithm and specifying the program's dynamic behavior. A functiondefinition is split into two parts, the algorithm and the strategy, with values defined inthe former being manipulated in the latter. The algorithmic code is consequentlyuncluttered by details relating only to the parallel behavior. The primary benefits of theevaluation strategy approach are similar to those that are obtained by using laziness toseparate the different parts of a sequential algorithm: the separation of concerns

    makes both the algorithm and the dynamic behavior easier to comprehend and modify.Changing the algorithm may entail specifying new dynamic behavior; conversely, it iseasy to modify the strategy without changing the algorithm. Because evaluationstrategies are written using the same language as the algorithm, they have severalother desirable properties.

    Strategies are powerful: simpler strategies can be composed, or passed asarguments to form more elaborate strategies;

    Strategiescan be defined over all types in the language; Strategies are extensible: the user can define new application-specific

    strategies; Strategiesare type safe: the normal type system applies to strategic code; Strategies have a clear semantics, which is precisely that used by the

    algorithmic language.

    For instance, strategies can control other aspects of dynamic behavior, therebyavoiding cluttering the algorithmic code with them. A simple example is a thresholdingmechanism that controls thread granularity. In Fibonacci example below, granularity isimproved for many machines if threads are not created when the argument is small.

    fibonacci :: Int -> Int

    fibonacci n

    | n 10

    then rnf n1 `par` rnf n2

    else ()

    In the example above, the strategy is invoked with demanding to ensure that it isevaluated. The Haskellflip function simply reorders a binary function's parameters.

    demanding :: a -> () -> a

    demanding = flip seq

    Evaluation strategieshave been implemented in GpH and used in a number of large-scale parallel programs, including data-parallel complex database queries, a divide-

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    10/16

    and-conquer linear equation solver, and a pipelined natural-language processor, Lolita.Lolita is large, comprising over 60,000 lines of Haskell. Experiences show thatstrategiesfacilitate the top-down parallelization of existing programs.

    The figure below summarizes some benchmarks7 realized by Marlow and others in [4].The benchmarks consist of a selection of small-to-medium-sized programs described inappendix B. In this work, we can see the results for each benchmark program after theimprovements by the use of evaluation strategies, relative to the performance of thesequential version. By sequential it means that the single-threaded version of theruntime system was used, in which par is a no-op, and there are no synchronizationoverheads.

    6. Haskell Runtime System

    In any programming language supporting concurrency, a great deal of complexity is

    hidden inside the implementation of the concurrency abstractions. Much of this supporttakes the form of a runtime system that supports threads, primitives for threadcommunication (e.g. locks, condition variables and transactional memory), a scheduler,and much else besides. In this section we are exploring a little the concurrency supportfor Glasgow Haskell Compiler (GHC).

    7In this work Marlow and others were used only 7 of 8 cores on their test system. In fact they did

    perform the measurements for all 8 cores, but found that the results were far less consistent than

    the 7 core results, and in some cases performance degraded significantly. On closer inspection the OS

    appeared to be de-scheduling one or more of their threads, leading to long pauses when the threadsneeded to synchronize. This effect is discussed in more detail in the paper.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    11/16

    The first abstraction is a Haskell Execution Contextor HEC. A HEC should be thoughtof as a virtual CPU; the runtime may map it to a real CPU, or to an operating systemthread (OS thread). Since a program has multiple HECs, each perhaps executing on adifferent CPU, the runtime must provide a safe way for the HECs to communicate andsynchronize with each other. The standard way to do so, and the one directlysupported by most operating systems, is to use locks and other forms of low-levelsynchronization such as condition variables. However, while locks provide goodperformance, they are notoriously difficult to use. In particular, program modules writtenusing locks are difficult to composeelegantly and correctly [16].

    Even ignoring all these difficulties, however, there is another Very Big Problem withusing locks as the runtime main synchronization mechanism in a lazy language likeHaskell. A typical use of a lock is this: take a lock, modify a shared data structure (aglobal ready-queue, perhaps), and release the lock. The lock is used only to ensurethat the shared data structure is mutated in a safe way. Crucially, a HEC never holds alock for long, because blocking another HEC on the lock completely stops a virtualCPU. Here is how we might realize this pattern in Haskell:

    do { takeLock lk

    ; rq

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    12/16

    to represent each spark pool. It consists of one common queue FIFO (first in, first out).Furthermore, it has the property that each HEC is free to take sparks from anotherHEC, if its own queue is empty. Of course there is a control to prevent competitionbetween more than one HEC for the same resource8.

    To evaluate sparks, HECs create spark threads. Each spark thread first executes thesparks those already in the spark pool of your HEC. If there is no spark on his queue, itwill try to steal some from another HEC. The cycle is repeated after the execution ofcomputation that was in charge (by evaluating the pattern until your WHNF), or until nomore sparks to be executed or stolen. In the latter case, the thread is interrupted.

    You can have more than one spark thread at each HEC in a given time. This is done toprevent an HEC become idle if some (or more) spark thread momentarily becomesblocked (by OS system, for example). However, the HEC aims to eliminate the excessof spark threads if is not set up this situation.

    When a spark thread catches a spark from queue to run, it gives preference to the

    oldest. With this, coarse tasks are performed first, which is desirable because theexecution of tiny sparks would cause overhead. The parallelism works best when thework units are greater. However, we do not want every grain is so stocky that there arenot enough to divide among all the available processing units, or else it causes a highimbalance between them.

    Given the following expression in Haskell:

    a `par` (b `seq` (a + b))

    As we saw, par makes a spark to be added at the current HEC spark pool,representing the computation of its first argument a. The rest of the expression is

    evaluated on its own main thread. Now, seq requires b to be evaluated before theaddition. However, it is perfectly feasible that b ends to be computed before someonetakes the sparkedato be evaluated. In this scenario, the program will proceed withthe calculation of the sum, but at this point, will not yet available the value of a. Thus,the main thread itself evaluate the computation during the normal course to assess thenecessary result. If this occurs, obviously, the sparked a no longer need to beevaluated. When a spark in the spark pool refers to a value, rather than an unevaluatedcomputation, we say the spark has fizzled; this potential for parallel execution hasexpired. The runtime system can, and should, remove fizzled sparks from the sparkpool so that storage manager can release the memory they refer to.

    7. Conclusion

    We have shown that the uses of purely functional programs, like Haskell, are viablealternatives to address parallel programming and to explore multicore architecture. Infact, in the last few years Haskell (and GHC in particular) has gained impressivesupport for parallel programming on commodity multi-core systems. In addition to

    8A work-stealing queue is a lock-free data structure with some attractive properties: the owner of the

    queue can push and pop from one end without synchronization, meanwhile other threads can steal

    from the other end of the queue incurring only a single atomic instruction. When the queue is almost

    empty, popping also incurs an atomic instruction to avoid a race between the popping thread and astealing thread.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    13/16

    traditional threads and shared variables, Haskell supports multicore execution out ofthe box, including multiple parallel programming models: Evaluation Strategies,Concurrent with STM, and Data Parallel.

    Besides, GHC runtime system has received attention recently, with significantimprovements in parallel performance. However, optimizing the runtime only addresseshalf of the problem; the other half being how to tune a given Haskell program to runeffectively in parallel. The programmer still has control over task granularity, datadependencies, speculation, and to some extent evaluation order. Getting these wrongcan be disastrous for parallel performance. For example, the granularity should neitherbe too fine nor too coarse. Too coarse and the runtime will not be able to effectivelyload-balance to keep all CPUs constantly busy; too fine and the costs of creating andscheduling the tiny tasks outweigh the benefits of executing them in parallel. Currentmethods for tuning parallel Haskell programs rely largely on trial and error, experience,and an eye for understanding the limited statistics produced at the end of a programsrun by the runtime system.

    What we need are effective ways to measure and collect information about the runtimebehavior of parallel programs, and tools to communicate this information to theprogrammer in a way that they can understand and use to solve performance problemswith their programs. The ability to profile parallel programs plays an important role inthe research field because the analysis process motivates the need to develop new,specialized strategies to help control evaluation order, extent and granularity.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    14/16

    References

    [1] James Larus, D. Gannon, Multicore computing and scientific discovery, The Fourth Paradigm: Data-IntensiveScientific Discovery. Part 3: Scientific Infrastructure,Microsoft Research, 2009.

    [2] D. Gannon and D. Reed, Parallelism and the Cloud, The Fourth Paradigm: Data-Intensive Scientific Discovery. Part

    3: Scientific Infrastructure,Microsoft Research, 2009.[3] Guy E. Blelloch. Programming parallel algorithms. Communications of the ACM, 39(3):8597, 1996.

    [4] S. Marlow, S. Peyton Jones and S. Singh, Runtime Support for Multicore Haskell, ICFP09, Proceedings of the 14thACM SIGPLAN international conference on Functional programming, Edinburgh, Scotland, UK, 2009.

    [5] Tim Harris and Satnam Singh. Feedback directed implicit parallelism. ICFP07: Proceedings of the 12th ACMSIGPLAN international conference on Functional programming, New York, NY, USA, 2007.

    [6] S. Peyton Jones, A. Gordon, and S. Finne. Concurrent Haskell. In Proc. of POPL96, pages 295308. ACM Press,1996.

    [7] PW Trinder, K Hammond, H-W Loidl, and SL Peyton Jones. Algorithm + strategy = parallelism. Journal of FunctionalProgramming, 8:2360, January 1998.

    [8] S. Peyton Jones, R. Leshchinskiy, G. Keller, and M. Chakravarty. Harnessing the multicores: Nested data parallelismin Haskell. FSTTCS09: IARCS Annual Conference on Foundations of Software Technology and Theoretical ComputerScience, 2009.

    [9] Lins, R. D. Functional Programming and Parallel Processing. 2nd International Conference on Vector and Parallel

    Processing - VECPAR96[10]Hammond, K., &Michaelson, G., Research Directions in Parallel Functional Programming. Springer-Verlag, 1999.

    [11] Peyton Jones, S. L., Clack, C., & Salkild, J., GRIP - A High-Performance Architecture for Parallel Graph Reduction.FPCA87: Conference on Functional Programming Languages and Computer Architecture.

    [12] Kaser, O., Ramakrishnan, C.R., Ramakrishnan, I. V., & Sekar, R. C. Equals - A Fast Parallel Implementation of aLazy Language. Journal of Functional Programming, 7(2), 183217, 1997.

    [13] Don Jones Jr., Simon Marlow, Satnam Singh. Parallel Performance Tuning for Haskell. In Haskell09ACMSIGPLAN Symposium on Haskell. ACM, 2009.

    [14] H.W. Loidl, F. Rubio, N. Scaife, K. Hammond, S. Horiguchi, U. Klusik, R. Loogen, G. J. Michaelson, R. Pe na, S.Priebe, J. Rebn, and P. W. Trinder. Comparing parallel functional languages: Programming and performance. HigherOrder Symbol. Comput., 16(3):203251, 2003.

    [15] Mohr, E., Kranz, D.A., Halstead, R.H.: Lazy task creation a technique for increasing the granularity of parallelprograms. IEEE Transactions on Parallel and Distributed Systems 2(3). July 1991.

    [16] T. Harris, S. Marlow, S. Peyton Jones, and M. Herlihy. Composable memory transactions. In ACM Symposium on

    Principles and Practice of Parallel Programming (PPoPP05), June 2005.[17] Blumofe, R. D.; Leiserson, C. E. Scheduling multithreaded computations by work stealing. Journal of the ACM

    (JACM), Pages: 720 748, Sept. 1999.

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    15/16

    Appendix A

    Summary of the test programs used by Harris and Singh (see section 5.):

    atom Floating point simulation, nofib/spectral

    boyer Gabriel suite boyer benchmark, nofib/spectral

    bsort-1 Sorting circuit model, locally written

    bsort-2 Sorting circuit model, locally written

    cacheprof Cache profiling tool, nofib/real

    calendar Prints a given years calendar, nofib/spectral

    circsim Circuit simulator, nofib/spectral

    clausify Put propositions into clausal form, nofib/spectral

    compress Text compression algorithm, nofib/real

    fft2 Fourier transforms, nofib/spectral

    fibheaps Fibonacci heaps, nofib/spectral

    hidden Line rendering, nofib/real

    lcss Hirschbergs LCSS algorithm, nofib/spectral

    multiplier Binary-multiplier simulator, nofib/spectral

    para Paragraph formatting, nofib/spectral

    primetest Primality testing, nofib/spectral

    rewrite Equational rewriting system, nofib/spectral

    scs Circuit simulator, nofib/real

    simple Hydrodynamics and heat-flow, nofib/spectral

    sphere ray tracer, nofib/spectral

  • 8/8/2019 [Pereira, M M] Harnessing the Multi Cores - Parallel Programming With Haskell

    16/16

    Appendix B

    The programs used in the Marlows benchmarks (section 5.2) are mostly easy to parallelize,and are not highly optimized, so the results they report in the paper should beinterpreted as suggestive rather than conclusive. Nevertheless, their goal has not beento optimize the programs, but rather to optimize the implementation to make existingprograms parallelize better:

    parfib: the ubiquitous parallel Fibonacci function, included here as a sanity testto ensure that our implementation is able to parallelize micro-benchmarks.The parallelism is divide-and-conquer-style, using explicit par and seq.

    sumeuler: the sum of the value of Eulers function applied to each integer up toa given bound. This is a map/reduce style problem: applications of the Euler

    function can be performed in parallel, and the results must be summed. Theparallelism is expressed using parListChunk from the strategies library.

    matmult: A naive matrix-multiply algorithm. The matrix is represented as a[[Int]].The parallelism is expressed using parListChunk.

    ray: A ray-tracer benchmark . The parallelism is expressed using parBuffer,and is quite fine-grained (each pixel to be rendered is a separate spark).

    gray: Another ray-tracing benchmark, this time taken from an entry in theICFP00 programming contest. Only the rendering part of the program has been

    parallelized, using a parBuffer as above. According to time profiling, theprogram only spends about 50% of its time in the renderer, so we expect this tolimit the parallelism we can achieve. The parallelism is expressed using a singleparBuffer in the renderer.

    prsa: A parallel RSA message encoder, encoding a 500KB message.Parallelism is again expressed using parBuffer.

    partree: A parallel map and fold over a tree. The program originates in the GUMbenchmark suite, and in fact appears to be badly written: it is quadratic in thesize of the tree. Nevertheless, it does appear to run in parallel, so we used theprogram unmodified for the purposes of benchmarking.

    mandel: this is a mandelbrot-set program originating in the nofib benchmarksuite. It generates a lazy list of pixel data (for a 1024x1024 scene), in a similarway to the ray tracer, and it was parallelized in the same way with the additionof parBuffer. The difference in this case is that the parallelisms more coarse-grained: each scan-line of the result is a separate spark.