extending openmp to survive the heterogeneoushpc.ac.upc.edu/pdfs/dir20/file003848.pdf · extending...

Int J Parallel ProgDOI 10.1007/s10766-010-0135-4

Extending OpenMP to Survive the HeterogeneousMulti-Core Era

Eduard Ayguadé · Rosa M. Badia · Pieter Bellens · Daniel Cabrera ·Alejandro Duran · Roger Ferrer · Marc Gonzàlez · Francisco Igual ·Daniel Jiménez-González · Jesús Labarta · Luis Martinell ·Xavier Martorell · Rafael Mayo · Josep M. Pérez · Judit Planas ·Enrique S. Quintana-Ortí

Received: 27 April 2010 / Accepted: 27 April 2010© Springer Science+Business Media, LLC 2010

Abstract This paper advances the state-of-the-art in programming models forexploiting task-level parallelism on heterogeneous many-core systems, presenting anumber of extensions to the OpenMP language inspired in the StarSs programmingmodel. The proposed extensions allow the programmer to write portable code easilyfor a number of different platforms, relieving him/her from developing the specific

E. Ayguadé (B) · R. M. Badia · P. Bellens · D. Cabrera · A. Duran · R. Ferrer ·M. Gonzàlez · D. Jiménez-González · J. Labarta · L. Martinell · X. Martorell · J. M. Pérez · J. PlanasBarcelona Supercomputing Center (Centro Nacional de Supercomputación (BSC-CNS)),08.034 Barcelona, Spaine-mail: [email protected]

P. Bellense-mail: [email protected]

D. Cabrerae-mail: [email protected]

A. Durane-mail: [email protected]

R. Ferrere-mail: [email protected]

M. Gonzàleze-mail: [email protected]

D. Jiménez-Gonzáleze-mail: [email protected]

J. Labartae-mail: [email protected]

L. Martinelle-mail: [email protected]

X. Martorelle-mail: [email protected]

123

Int J Parallel Prog

code to off-load tasks to the accelerators and the synchronization of tasks. Our resultsobtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonableparallel performance. However, the real impact of our approach in is the productivitygains it yields for the programmer.

Keywords Parallel computing · Programming models · Runtime systems ·Task-level parallelism · Multi-core processors · Hardware accelerators ·Heterogeneous computing

1 Introduction

In response to the combined hurdles of power dissipation, large memory latency, andlittle instruction-level parallelism left to be exploited, all major hardware manufac-turers have adopted the replication of cores on-chip as the mainstream path to deliverhigher performance [1]. Today, chips with a few general-purpose, fully-functionalcores are available from Intel (2–6 cores), AMD (2–4 cores), or Sun (8 cores), to namea few, and the number of cores is expected to increase with each shrink of the processtechnology. Chips in the near future will potentially integrate hundreds or thousandsof cores.

Graphics processors (GPUs) from NVIDIA and AMD/ATI, on the other hand, arealready in the many-core era, featuring hundreds of fine-grained stream cores per pro-cessor (up to 240 and 320 in their latest designs, respectively). Together with GPUs,hardware accelerators like the heterogeneous IBM/Sony/Toshiba Cell B.E., Clear-Speed ASICs or the FPGAs from multiple vendors are appealing in that, compared with

J. M. Péreze-mail: [email protected]

J. Planase-mail: [email protected]

E. Ayguadé · M. Gonzàlez · D. Jiménez-González · J. Labarta · X. MartorellDepto. de Arquitectura de Computadores, Universitat Politècnica de Catalunya,08.034 Barcelona, Spain

R. M. BadiaIIIA, Artificial Intelligence Research Institute, CSIC, Spanish National Research Council,Madrid, Spaine-mail: [email protected]

F. Igual · R. Mayo · E. S. Quintana-OrtíDepto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (UJI), 12.071 Castellon, Spain

F. Iguale-mail: [email protected]

R. Mayoe-mail: [email protected]

E. S. Quintana-Ortíe-mail: [email protected]

123

Int J Parallel Prog

general-purpose multi-core processors, they offer much higher performance-cost andperformance-power ratios for certain applications. Therefore, we envision a future ofheterogeneous architectures, equipped with a few coarse-grain general-purpose cores,and several accelerators, probably of a different nature. Applications will be solvedon the most appropriate technology, while other parts of the processor may be turnedoff, to decrease the power used by the chip.

While the increasing number of transistors on-chip dictated by Moore’s Law indi-cates that many-core processors, with a blend of heterogeneous technologies, willbe feasible in the near future, availability of applications for these architectures, andmore specifically, programming models will really determine the success or failure ofthese designs. How easy it is to develop programs that efficiently exploit parallelismat all levels (instruction, data, task, and node) in these parallel processors is the key tothe future. The heterogeneous nature of these systems, and the existence of multipleseparate address spaces, only exacerbates the height of the programmability wall.

The majority of proposals in this current “Tower of Babel” era assume a host-directed programming and execution model with attached accelerator devices. Thebulk of the user’s application executes on the host while user-specified code regionsare offloaded to the accelerator. In general, the specifics of the different acceleratorarchitectures makes programming extremely difficult if one plans to use the vendor-provided SDKs (e.g., libspe for Cell B.E. or CUDA for NVIDIA GPUs). It would bedesirable to retain most of the advantages of using these SDKs, but in a productive andportable manner, avoiding the mix of hardware specific code (for task offloading, datamovement, etc.) with application code. The recent attempt of OpenCL to unify theprogramming models for architectures based on hardware accelerators tries to ensureportability, low-level access to the hardware, and supposedly high performance. Webelieve, however, that OpenCL still exposes much of the low-level details, making itcumbersome to use for non-experts.

OpenMP [2] survived the explosion of parallel programming languages of the 90sto become the standard for parallelizing regular applications on shared-memory mul-tiprocessors. While recent additions to OpenMP 3.0 accommodate task parallelism,making it more suitable for irregular codes, OpenMP is not ready for the challengesposed by the new generation of multi-core heterogeneous architectures.

Star Superscalar (StarSs) [3] is a promising programming model in this directionthat we have used as the starting point to propose a set of extensions to OpenMP. StarSshas its roots in the runtime exploitation of task parallelism, with special emphasis onportability, simplicity and flexibility. Functions that are suitable for parallel execu-tion are annotated as tasks, and their arguments are tagged with their directionality(input, output or both). This information is provided using a simple OpenMP-likeannotation of the source code (e.g., pragmas in C/C++). This information is used atruntime to build a task dependence graph dynamically. This graph is one of the mainelements used by the runtime to schedule tasks as soon as all their dependences arehonored and the appropriate resource to execute them is available. For those architec-tures with local memories, the runtime also takes care of moving in/out the associateddata. The scheduler may also be driven by data locality policies; other target-depen-dent optimizations of the scheduler can also be incorporated into the general frame-work of StarSs. Current instantiations of the StarSs programming model and tools

123

Int J Parallel Prog

include GRIDSs (for the Grid), CellSs (for the Cell B.E.), SMPSs (for general-pur-pose multi-core processors) and GPUSs (for platforms with multiple GPUs). We arecurrently extending it to cover platforms with multiple generic hardware acceleratorsand FPGAs.

2 The StarSs Extensions to OpenMP

OpenMP has been traditionally employed to exploit loop-based parallelism, presentin most regular scientific and engineering applications, on shared-memory multipro-cessors. Recently, OpenMP 3.0 has been extended with tasks, or deferrable units ofwork, to accommodate irregular applications also. In particular, in OpenMP 3.0 theprogrammer can specify tasks, and later ensure that all the tasks defined up to somepoint have finished. Tasks are generated in the context of a team of threads, while theparallel construct creates such team. A task is created when the code reaches thetask construct, defined as follows:

#pragma omp task [clause-list]structured-block

Valid clauses for this construct are untied, if, shared, private and first-private. The untied clause specifies that the task can be resumed by a differentthread after a possible task switching point; when the expression in the if clauseevaluates to false, the encountering thread suspends the current task region and beginsexecution of the new task immediately. The last three clauses are used for setting datasharing attributes of variables in the task body, and have the following syntax:

– shared( variable-list )– private( variable-list )– firstprivate( variable-list )

where variable-list is a list of identifiers. Naming a variable inside a data sharingclause explicitly sets the data sharing attribute for the variable in the task construct.References in the task construct to variables for which the data sharing attribute isprivate or firstprivate do not refer to the original variable but to a privatestorage of the task. Variables annotated with firstprivate, in addition, will havesuch storage initialized with the value of the original variable when the program exe-cution reaches the task construct. References to variables for which the data sharingattribute is shared refer to the original variable.

StarSs extends the task mechanism in OpenMP 3.0 to allow the specification ofdependencies between tasks and to map the execution of certain tasks to a type ofhardware accelerator (a device). StarSs considers each accelerator (e.g., a SPE, aGPU, a FPGA) as a single execution unit, which can efficiently execute specializedpieces of code. The runtime implementation of the model isolates the user from allthe complexities related to task scheduling and offloading. The StarSs extensions areorthogonal to other possible extensions to generate efficient code by a compiler (e.g.,vectorization width, number of threads running on accelerators and code transforma-tions). In the next sections, we explore how these extensions could be mapped to theOpenMP language.

123

Int J Parallel Prog

2.1 Taskifying Functions and Expressing Dependencies

StarSs can specify that a function should be executed as a task. To allow this, we haveextended the OpenMP task construct to annotate functions in addition to structuredblocks:

#pragma omp task [clause-list]{function-declaration|function-definition|

structured-block}

Whenever the program calls a function annotated in this way, the runtime will createan explicit task. Although this seems to be a simple and naive extension, it associatesan implicit name to the task that will be used later in sect. 2.3.

We have also extended this construct, with the StarSs clauses input, outputand inout. This information is used to derive dependencies among tasks at runtime.The syntax of these clauses is:

– input( data-reference-list )– output( data-reference-list )– inout( data-reference-list )

Dependencies are expressed by means of data-reference-lists, which are a supersetof a variable-list. A data-reference in such a list can contain a variable identifier, butalso references to subobjects. References to subobjects include array element refer-ences (e.g., a[4]), array sections (a[3:6]), field references (a.b), and elaboratedshaping expressions ([10][20] p). For simplicity, details on the syntax used todefine subobjects will be introduced in the following examples as well as in sect. 2.4.Implicit tasks created in parallel regions are assumed to be totally independent.It is not allowed to use input, output and inout in a parallel construct.

Figure 1 illustrates the use of the extended task construct to parallelize a sequen-tial code that computes the matrix multiplication C = C + A · B. In this particularcode, the programmer defines each element of matrices A, B and C as a pointer to ablock of BS × BS floats, which are allocated from inside the main function. Eachtask corresponds to an instantiation of function gemm, and the programmer uses theinout clause to express the data dependence that exists among tasks computing thesame block of C (several tasks will overwrite the same block). In this simple example,since the allocation and initialization of matrices is done sequentially, there is no needto annotate the blocks of matrices A and B with the input clause, as they are notinvolved in any data dependence.

In order to show an example with a richer set of dependencies, we use the LUfactorization of a blocked sparse matrix A consisting of N B × N B blocks, wheresome of the blocks are void (i.e., all their entries are zero) while some others are fullypopulated with nonzero entries (they are dense). This is captured in the data structurethat holds the matrix, where we assume that storage is only allocated for the denseblocks.

Figure 2 shows the sequential Sparse_LU function annotated with the extendedtask construct, and Fig. 3 illustrates the dependencies that are encountered dur-ing the execution of this code on a particular blocked sparse matrix. Using the task

123

Int J Parallel Prog

Fig. 1 Matrix multiplication example annotated with our proposed extensions to OpenMP

construct, the programmer identifies four types of tasks, which correspond to the invo-cation of kernels lu_getrf, lu_trsm_right, lu_trsm_left, and lu_gemm.In this case, the kernel call is annotated and the programmer needs to indicate the coor-dinates of the blocks involved in the operation (e.g., A[k][k] for lu_getrf) and theirdimensions ([0 : BS − 1][0 : BS − 1] in all cases). For kernel lu_gemm, for exam-ple, the programmer also specifies that the first and second arguments (A and B) areinput parameters (they are only read during the execution of the kernel) while thethird argument (C) is inout (it is read and written during the execution of the kernel).Note that all references in the code correspond to blocks of the same matrix, yieldingan elaborate dependence graph for this example (see Fig. 3).

Note that the dependencies are on the input data (sparse characteristic of the inputmatrix) and new blocks can be dynamically generated (line 33 in Fig. 2).

The annotations are placed on the original sequential version, with no transforma-tions applied to identify the specification of the inherent parallelism available. Theruntime employs the information implicit in the graph, transparently to the user/pro-grammer, to extract task parallelism while satisfying the dependencies among tasks.

2.2 Specifying Target Devices: Heterogeneous Cholesky

To target heterogeneous systems composed of general-purpose processors and hard-ware accelerators, we add a StarSs construct that may precede an existing taskpragma to OpenMP:

#pragma omp target device(device-name-list)[clause-list]

The target construct specifies that the execution of the task could be off-loadedto any of the (types of) devices specified in device-name-list (and as such its

123

Int J Parallel Prog

Fig. 2 Sparse_LU example annotated with our proposed extensions to OpenMP

Fig. 3 Footprint of an 5 × 5 blocked sparse matrix (left) and dependency graph for its sparse LU factor-ization (right). In the graph square, triangle, diamond and circle shapes correspond to tasks lu_getrf,lu_trsm_right, lu_trsm_left and lu_gemm, respectively

123

Int J Parallel Prog

Fig. 4 Cholesky example annotated with our proposed extensions to OpenMP

code must be handled by the proper compiler backend). If the task is not precededby a target directive, then the default device-name, which is smp and cor-responds to a homogeneous shared-memory multicore architecture, is used. Otherdevice-names are vendor specific. We will use three possible examples in thispaper to specify the accelerator: spe for a single SPE of the Cell B.E., cuda for thewhole GPU and fpga for the whole FPGA.

Two additional clauses can be used with the device pragma:

– copy_in(data-reference-list)– copy_out(data-reference-list)

These two clauses, which are ignored for the smp device, specify data movement forshared variables used inside the task. The copy_in clause specifies those vari-ables that must be moved from host memory to device memory. The copy_outclause specifies those variables that need to be moved back from device memory tohost memory.

Figure 4 shows code that computes the Cholesky factorization of a densematrix A, consisting of NB×NB blocks with dimension BS×BS each. The operationis decomposed into four types of tasks: Cholesky factorization of a diagonal block(chol_potrf), triangular solve involving the subdiagonal blocks (chol_trsm),symmetric update of the blocks on the diagonal (chol_syrk), and update of theremaining blocks (chol_gemm). The target construct is used here to specify thatall these tasks, except for the factorization of the diagonal block, should be computedon a cuda accelerator (i.e., a GPU).

Other vendor-specific clauses in the target construct for each particulardevice-name are possible. Some restrictions may apply to tasks that target a specific

123

Int J Parallel Prog

Fig. 5 Matrix multiplication example annotated with our proposed extensions to OpenMP

device (for example, they may not contain any other OpenMP directives or do anyinput/output with some devices). In addition, tasks offloaded to some specific devicesshould be tied or they should execute on the same type of device if thread switchingis allowed.

2.3 Specifying Alternative Implementations: Matrix Multiplication Revisited

Tha target construct also offers the implements clause to specify alternativeimplementations for a “taskifyied” function that are tailored to specific acceleratordevices. The syntax of the clause is:

– implements(function-name)

Using this clause, in the example in Fig. 5, the programmer is specifying threepossible options to execute function gemm. The first one uses the original definitionof function gemm for the default target architecture. The user also specifies two alter-native implementations: gemm_cuda for an NVIDIA GPU; and gemm_spe for theCell B.E.. For all the devices, the runtime is in charge of moving data before and afterthe execution of the task.

If the original implementation is appropriate for one of the accelerator types, thenthe programmer should precede the definition of the task with the specification of thetarget device (as in line 1 of Fig. 5). In this case, the compiler would generate twoversions for the same function, one going through the native optimizer for the defaultdevice, and another using the accelerator-specific compiler.

2.4 Specifying Array Section Dependences

Through the examples so far we have seen dependence specifications of scalar objectsor full arrays but our syntax can also specify sections of arrays. Since C/C++ doesnot have any way to express ranges of an array, we have borrowed the array-sectionsyntax from Fortran 90. An array section is then expressed with a[first:last]meaning all elements of the array a from the first to the last element inclusive.Both, first and last are expressions evaluated at execution time.

123

Int J Parallel Prog

Fig. 6 Simple example of array sections

Fig. 7 Shaping expression examples

Figure 6 shows a simple example of array sections where task A fills the bottomhalf of the array a (i.e., elements 0 to N/2 − 1) and task B fills the remainingelements (i.e., elements N/2 to N − 1). Task C waits until both tasks are finishedbefore executing. For syntactic economy a[:last] is the same as a[0:last]. Forarrays where the upper bound is known, a[first:] and a[:] mean respectivelya[first:N-1] and a[0:N-1]. Designating an array (i.e., a) in a data referencelist without an array section or an array subscript is equivalent to the whole array-sec-tion (i.e., a[:]). Array sections can also be specified for multidimensional arrays byspecifying one section for each dimension (e.g., a[1:2][3:4]).

While technically not arrays, array sections can also be applied to pointers:p[ f irst : last] refers to the elements ∗(p + f irst) to ∗(p + last). Pointers to arrayscan use multidimensional sections but because pointers lack dimensional informa-tion, multidimensional sections are not allowed for pointer-to-pointers types withouta shaping expression.

In order to use multidimensional sections over pointers, the structural informationof the array dimensions needs to be restored. A shaping expression serves that purpose.Shaping expressions are a sequence of dimensions, enclosed in square brackets, anda data reference, that should refer to a pointer type: [N]p.

For example, in Fig. 7, the input clause on line 3 creates a dependence againstthe pointer value and not the pointed data. But, using a shaping expression as in line5 the dependence is against the pointed data. As shown in line 7 shaping expressionsenable multidimensional sections over pointers.

Array parameters are implicitly converted to pointers types where the outermostdimension is lost in C/C++. Thus, a shaping expression is required to define a depen-dence to the whole array. This is why in line 10, theinput clause creates a dependence

123

Int J Parallel Prog

Fig. 8 Example using the extended taskwait pragma

with the pointer but in line 11 the dependence is created against the matrix storedthrough the pointer.

2.5 Additional Clause for Taskwait

We have also extended the OpenMP taskwait construct, with an on clause fromStarSs, as follows:

#pragma omp taskwait on(data-reference-list)

in order to wait for the termination of those tasks whose output or inout matchwith data-reference-list.

For example, in code shown in Fig. 8, the programmer needs to insert thetaskwait pragma in order to ensure that the next statement reads the appropri-ate value for variable x, which is generated by task A. However, task B andtask C can run in parallel with the code after the taskwait pragma.

3 Extensions to the OpenMP Tasking Execution Model

The runtime supporting the execution of the StarSs extensions dynamically createsexplicit tasks and the memory region specifiers in data-reference-list are used to builda task dependence graph as follows:

– Data references specified in input or inout clauses are checked against thedata-references specified in output or inout clauses of all tasks, in the scopeof the same parent task, in execution or pending execution. If there is a match, atrue dependence is added between both tasks.

– Data references specified in the output or inout clauses are checked againstdata references specified in input, output or inout clauses of all tasks, inthe scope of the same parent task, in execution or pending execution. If there isa match, a false dependence appears. This dependence could be eliminated bydynamically renaming the memory region specified in data reference. Renamingis an optional feature of the runtime which can be activated selectively to increasethe potential parallelism in the task graph.

– A variable in a shared data clause, but not in an input, output or inoutclause, indicates that the variable is accessed inside the task but it is not affectedby any data dependence in the current scope of execution (or is protected by usinganother mechanism).

123

Int J Parallel Prog

When a task is ready for execution (i.e., it has no unsatisfied dependencies onpreviously generated tasks):

– The runtime can choose among the different available targets to execute the task.This decision is implementation-dependent but it will ideally be tailored to resourceavailability. If no resource is available, the runtime could stall that task until onebecomes available or launch the execution of the original function on the defaultsmp device.

– The runtime system must copy variables in the copy_in list from the host mem-ory to the device memory. Once the task finishes execution, the runtime must copyback variables in the copy_out list, if necessary. Many optimizations on the waylocal stores are handled based on the potentially huge level of lookahead that atask graph represents and information about the accessed data is available. It is forexample possible to minimize the data transfers by caching data already transferredand scheduling tasks to the accelerators where local copies of their input data areavailable.

The execution of a taskwait forces the write-back to memory for the referencesin data-reference-list of any possible data renaming that has been dynami-cally introduced by the runtime system to eliminate false dependences.

3.1 Proof-of-Concept Implementation for Multi-Core

The SMP Superscalar (SMPSs) runtime targets homogeneous multicore and sharedmemory machines (SMP). For this specific implementation the runtime exploits thatall cores have coherent load/store access to all data structures and therefore no spe-cific data movement between the main program and the cores executing the tasks arerequired.

Although a main thread is responsible for executing the main program, for add-ing tasks to the data dependence graph, and for synchronizing all threads, the run-time implements distributed scheduling. All threads execute tasks, including the mainthread when it would otherwise be idle. A general ready tasks list is accessible byall threads while private ready lists store the ready tasks that must be executed byeach core. To favor the exploitation of memory locality, each thread first processesthe tasks inserted in its own ready list, where it also inserts new ready tasks releasedby the completion of those executed in the same core. Also, to favor load balancing,a thread steals tasks from the main ready list and from lists from other cores when itsown is empty.

3.2 Proof-of-Concept Implementation for the Cell B.E.

The Cell Superscalar (CellSs) runtime implementation targets the execution on CellB.E. based platforms. The challenges with this platform are twofold: the heteroge-neity of the chip, with one general purpose (and slower) multi-threaded Power-basedprocessor element (PPE) and eight synergistic processor elements (SPEs); and thememory organization with a PPE main memory non coherent with the local memories

123

Int J Parallel Prog

of the SPEs. Data transfers from the main memory to the small (only 256 KB) of theSPEs must be explicitly programmed with DMA.

The CellSs runtime is organized as two threads that run in the PPE (the main threadand the helper thread) and up to sixteen threads that run in the SPEs.1 The user pro-gram starts normal execution in the main thread and whenever an annotated task iscalled, a new node in the task graph is added with its corresponding dependences. Thehelper thread is responsible for task scheduling and synchronization with the SPEs.Each SPE thread waits for tasks to be scheduled in its core. To reduce communicationoverhead and the need for explicit communications, tasks are scheduled in sets to beexecuted in the same SPE. Within a task set, double buffering (data transfers for thenext task are performed concurrently with the execution of the current task), and otheroptimizations (e.g., the reduction of write transfers when a parameter is written morethan once in a task set) can be applied. Extracting data-level parallelism for each SPE(simdization) is left in the hands of the native target compiler programmer and/or theprogrammer (using intrinsics, for example).

3.3 Proof-of-Concept Implementation for NVIDIA GPUs

The GPU SuperScalar (GPUSs) implementation targets the parallelization of appli-cations on platforms consisting of a general-purpose (possible multi-core) processor(the host) connected with several hardware accelerators (the devices). In our pro-totype implementation, these accelerators are programmable NVIDIA GPUs, whichcommunicate with the host via a PCIExpress bus. Each GPU can only access its ownlocal memory space; direct communication between GPUs is not possible, so that datatransfers between them must be performed through the host memory.

Our approach considers each accelerator as a single execution unit, which can effi-ciently execute specialized pieces of code (in our case, CUDA kernels defined by theprogrammer as tasks). GPUSs is not aware of the internal architecture of the accel-erator, and it only exploits task-parallelism by mapping/scheduling the execution oftasks to the hardware accelerators in the system. As in CellSs, extracting data-levelparallelism inside a single GPU is left in the hands of the programmer and the nativedevice-specific compiler.

The architecture of the GPUSs runtime is similar to that of CellSs. However,there are some important differences between the two, derived from the architec-tural features of each system. In particular, data transfers between memory spacesthrough the PCIExpress bus are a major bottleneck in these type of multi-acceler-ator systems. Therefore, the number of data movements associated with the exe-cution of a given task must be reduced as much as possible to improve perfor-mance. To do so, GPUSs views the local store of each accelerator as a cachememory that keeps data blocks recently used by the GPU. The replacement pol-icy (LRU in our current implementation) and the number and size of blocks in the

1 Although Cell B.E. chips have up to eight SPEs (and only six available on the chips that equip the Play-Station 3), the blades usually come with two of those chips and the system enables access to all of themfrom the Power processors).

123

Int J Parallel Prog

cache can be easily tuned in the runtime. Inconsistencies between data blocks storedin the caches of the accelerators and the blocks in the host memory are allowedusing a write-back memory coherence policy; thus, data blocks written by a GPUare only updated in the host memory when another GPU has to operate on them.Coherence among the local stores of the GPUs is maintained with a write-invalidatepolicy.

The system handles the existence of multiple memory spaces by keeping a memorymap of the cache of each accelerator. The translation of addresses between differentmemory spaces is transparent to the user. Additionally, the information stored in themap is used to reduce data transfers by carefully mapping tasks to the most appropriateexecution resource.

The basic architecture of the runtime and many of the techniques implemented inGPUSs for NVIDIA multi-accelerator systems can be easily ported to other (homo-geneous of heterogeneous) multi-accelerator platforms with similar architectures.

3.4 Proof-of-Concept Implementation for SGI RASC FPGA

For FPGA-based systems, offloading a task means sending the bitstream correspond-ing to the task, configuring the FPGA device and sending/receiving data. In thesesystems, the time required to offload a bitstream in the device (bitstream file read anddevice configure) can be significantly high. For example, it takes approximately onesecond to load a 4 MB bitstream in the SGI RASC blade. Once the bitstream in loaded inmemory, reusing it takes just 4 ms. In order to hide this high bitstream loading time, theFPGASs prototype implementation includes a full associative bitstream cache to keepinformation about the bitstreams currently loaded in the FPGA devices. When a taskpragma is found, the runtime checks if the bitstream that implements the task is alreadyconfigured. A hit in the bitstream cache produces the effective offloading of the taskexecution. If the runtime detects a miss in the bitstream cache, the runtime will apply aleast frequently used (LFU) replacement policy. During the miss, and in order to hidethe FPGA configuration time, the runtime launches the execution of the task in the hostprocessor. Once the bitstream is configured, the runtime will detect a hit in the bitstreamcache and offload the execution of future instances of that task to the FPGA device.

Since the data transfer overhead between the host and the FPGA device is usuallyan important issue, the runtime system applies data packing and unpacking, accordingto the different memory associations detected by the compiler.

3.5 Bringing it all Together: Challenges Ahead

Previous subsections documented implementations of the model for specific acceler-ators. These implementations (other than SMPSs) use the main processor mainly asa controlling device to spawn tasks to the accelerators but do not execute tasks on it.With new architectures [4] where a fair amount of the power capacity resides in themain processors this is not a desirable approach. Furthermore, architectures with morethan one type of accelerator are already appearing and a runtime that can handle morethan one at the same time will be required.

123

Int J Parallel Prog

Supporting more than one accelerator and being able to run tasks in the main pro-cessor(s) is not that difficult. By defining clean interfaces that hide architectural detailsthis becomes more an engineering task than a research one. But, then, things becomemore interesting. Once a task can be run on any of the processing elements (supposingthe user has specified versions for all of them with our extensions) the runtime mustdecide which one to use. On one hand, the runtime should maximize the usage of allthe processing elements of the system. A naive approach could schedule each newtask in the fastest element that is not executing anything. But, of course, differenttasks will be faster on different processing elements. Because there is usually somerepetitiveness in the kind of tasks executed, we think that it will be possible for theruntime to learn, as the execution proceeds, where to send each task. On the otherhand, we want to minimize the communication costs. One obvious thing to do is toschedule tasks that use the same data on the same processing element, which will notalways be possible, in which case the runtime will need to take into account the datatransfer cost from the processor element where they were last used. This is not aneasy task, as the communication mechanism may differ from one processor elementto another. These two factors (efficiency and reducing communication) will need tobe balanced to obtain an optimal scheduling. An important factor to take this decisionis the communication-to-computation ratio and it is unclear if the runtime will be ableto find the right choice on its own. In any case, further research needs to be conductedon a fully heterogeneous runtime/system to solve this tradeoff.

4 Experimental Results

In this section we illustrate the general validity and portability, and promising perfor-mance results of the StarSs model, and therefore the proposed extensions to OpenMP,to exploit task-level parallelism in heterogeneous architectures. The Cholesky factor-ization in Fig. 4 will serve as a case study for SMPSs, CellSs and GPUSs, but weemphasize again that the scope of StarSs is not limited to the parallelization of (dense)linear algebra codes. This operation is appealing in that it is a basic building block fora wide variety of scientific and engineering applications (the Cholesky factorizationis the most efficient method to solve certain classes of linear systems of equations).Besides, this factorization shares with many other linear algebra operations a well-known and regular algorithmic structure, and has been traditionally considered as astarting point for HPC community efforts. In all experiments we consider the cost ofthe Cholesky factorization to be the standard n3/3 floating-point arithmetic operations(flops for short), for square matrices of order n when reporting the rate of computation.

For SMPSs, the annotations with the target construct from the code in Fig. 4are ignored, as all operations are computed on the cores of the shared-memory mul-tiprocessor. The experimental analysis has been performed on a SGI Altix multipro-cessor, consisting of 16 nodes equipped with a dual core CPU at 1.6 GHz each (thepeak performance of the system is 204.8 GFLOPS). The memory system presents acc–NUMA organization, with a local RAM being shared by the CPUs in each node,and a SGI NUMAlink interconnection network. The codes employ double precisionarithmetic and were linked with the BLAS in Intel’s Math Kernel Library (MKL) 9.1.

123

Int J Parallel Prog

0

50

100

150

200

0 2000 4000 6000 8000 10000

GF

LOP

S

Matrix size

Cholesky factorization on 32 cores

SMPSsMKL

0

50

100

150

200

0 4 8 12 16 20 24 28 32

GF

LOP

S

Number of cores

Cholesky factorization of an 8000 x 8000 matrix

SMPSsMKL

Fig. 9 Performance (left) and scalability (right) of the SMPSs Cholesky factorization routine

This package provides a highly tuned implementation of linear algebra basic buildingblocks like chol_trsm_right, chol_gemm, and chol_syrk Intel processors.Figure 9 compares the GFLOPS rates attained by the Cholesky factorization routineparallelized with SMPSs and the (traditional) multi-threaded implementation of thisroutine in MKL. We note the superior performance of the SMPSs code, due to thehigher level of parallelism and scalability exposed by the dynamic scheduling mech-anism of tasks in SMPSs.

The experiments with CellSs were run on a QS22 IBM Blade server, with 2 Pow-erXCell (the high-performance double-precision floating-point version of the CellB.E. processor) at 3.2 Ghz, and 12 GB of memory. The results were obtained withCellSs version 2.2 and using the IBM SDK version 3.1. Figure 10 presents the resultsobtained for the Cholesky factorization with the Cell based platform. The figure on theleft shows the absolute performance results when executing with 8 SPUs and varyingthe matrix size up to 4096 × 4096 floats. The results are compared against a hand-coded version, where the graph generation and scheduling of the tasks are performedstatically [5]. The lose in performance against this hand-coded example is 19% forlarge matrix sizes. We consider this a more than reasonable result taking into accountthat the original hand-coded version has 302 while the CellSs 32 lines.2 The figure onthe right shows the scalability of CellSs from 1 to 8 SPU, using the elapsed time using1 SPU as the base case.

Our next experiment employs GPUSs, the prototype extension of StarSs for plat-forms with multiple GPUs. The target system is a server with two Intel Xeon QuadCoreE5405 (2.0 GHz) processors and 8 GBytes of shared DDR2 RAM memory, connectedwith an NVIDIA Tesla s870 computing system with four NVIDIA G80 GPUs and6 GBytes of DDR3 memory (1.5 GBytes per GPU). The Intel 5400 chipset featurestwo PCIExpress Gen2 interfaces connected with the Tesla, which deliver a peakbandwidth of 48 Gbits/second on each interface. We used NVIDIA CUBLAS (version2.0) built on top of the CUDA API (version 2.0) together with the NVIDIA driver(171.05). Single precision was employed in all experiments. The cost of all data trans-fers between RAM and GPU memories is included in the timings. Figure 11 illustrates

2 Only accounting code lines, without comments and includes. The source code of the tiles that are thesame for both examples are not accounted.

123

Int J Parallel Prog

0

50

100

150

200

0 1000 2000 3000 4000

GF

LOP

S

Matrix size

Cholesky factorization on 8 SPUs

Hand-coded (static scheduling)CellSs

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8 9

Spe

ed-u

p

Number of cores

Cholesky factorization of an 4096 x 4096 matrix

CellSs

Fig. 10 Performance (left) and scalability (right) of the Cholesky factorization routine parallelized withCellSs

0

50

100

150

200

250

300

350

400

0 4000 8000 12000 16000 20000

GF

LOP

S

Matrix size

Cholesky factorization on 4 GPUs

GPUSsCUBLAS

0.5

1

1.5

2

2.5

4000 8000 12000 16000

Spe

ed-u

p

Matrix size

Cholesky factorization with GPUSs - Scalability

4 GPUs3 GPUs2 GPUs1 GPU

Fig. 11 Performance (left) and scalability (right) of the Cholesky factorization routine parallelized withGPUSs

the parallel performance and scalability of the Cholesky factorization codes with ourprototype implementation of GPUSs and comparison with the original CUBLAS.

5 Related Work

OpenMP [2] grew out of the need to standardize the explosion of parallel programminglanguages and tools of the 90s. It was initially structured around parallel loops and wasmeant to handle dense numerical applications. The simplicity of its original interface,the use of a shared-memory model, and the fact that the parallelism of a programis expressed with annotations loosely-coupled to the code, all have helped OpenMPbecome well-accepted. While recent additions to OpenMP 3.0 [6] accomodate for taskparallelism, making it more suitable for irregular codes, OpenMP is not ready for thechallenges posed by the new generation of multi-core heterogeneous architectures.

The Cell B.E. [7] is likely one the most challenging heterogeneous architecturesto program. IBM developed an OpenMP prototype compiler that generates parallelprograms under the master-slave programming model. Data transfers between master(PPE) and slaves (SPEs) are transparently introduced employing a software cache,although the compiler can try to optimize for very regular access patterns. Other pro-gramming solutions for the Cell B.E. like Sequoia, MPI microtasks, and our own

123

Int J Parallel Prog

CellSs are more promising in that they target task parallelism, employing higher-levelinformation to perform more complete optimizations.

GPUs have traditionally been programmed using specific-domain graphics librariessuch as OpenGL or DirectX. NVIDIA was one of the pioneering graphics companyto realize the potential of general-purpose graphics processors, and the benefits whichcould be gained by offering a general-purpose application programming interface(API). The result was CUDA [8], a “unified” architecture design featuring a pro-grammable graphics pipeline, and an API to develop parallel programs that exploitdata-parallelism on this architecture. Unfortunately, in order to develop efficient codes,CUDA programmers (as well as those of Brook+ [9], the data-parallel language forAMD/ATI GPUs) still need to be deeply aware of the underlying architecture. Apromising approach that may solve this problem consists in automatically transform-ing existing OpenMP codes into CUDA.

Hardware description languages (e.g., Verilog or VHDL) are employed to developFPGA-accelerated computational kernels. In order to solve the programmability prob-lem for these devices, extensions to the C language have appeared. Usually, there aretwo strategies to compile these extensions. In the first strategy, the section of code tobe offloaded to the FPGA is translated from C to VHDL. The second strategy is tomap a soft processor (e.g. Xilinx Microblaze) into the FPGA, and translate the sourcecode to be executed to the code that this soft processor understands. In both cases, ahost/device communication library, such as the RASClib for the SGI RASC architec-ture, is necessary to offload the execution to one of the FPGAs available, includingdata movement.

PGI [10] and HMPP [11] programming models are two other approaches tryingto tackle the accelerator problem with high-level directives. PGI uses compiler tech-nology to offload the execution of loops to the accelerators. HMPP also annotatesfunctions as tasks to be offloaded to the accelerators. We think that StarSs has higherpotential in that it shifts part of the intelligence that HMPP and PGI delegate inthe compiler to the StarSs runtime system. Although these alternatives do support afair amount of asynchronous computations expressed as futures or continuations, thelevel of lookahead they support is limited in practice. In these approaches, synchro-nization requests (waiting for a given future or selecting among some list of themwhen the result is needed) have to be explicitly inserted in the main control flow ofthe program. Besides the additional complexity of the code, this approach implies thatcertain scheduling decisions are made statically and hardwired in the application code.The approach followed in StarSs exploits much higher levels of lookahead (tens ofthousands of tasks) without requiring the programmer to schedule the synchronizationoperations explicitly and giving much higher flexibility to the runtime to respond tothe foreseeable variability of application characteristics and resource availability.

5.1 Tending a Bridge to OpenCL

A recent attempt to unify the programming models for general-purpose multi-corearchitectures and the different types of hardware accelerators (Cell B.E., GPUs,FPGAs, DSPs, etc.) is OpenCL [12]. The participation of silicon vendors (e.g., Intel,

123

Int J Parallel Prog

IBM, NVIDIA, and AMD) in the definition of this open standard ensures portabil-ity, low-level access to the hardware, and supposedly high performance. We believe,however, that OpenCL still exposes much of the low-level details, making it cumber-some to use by non-experts.

Finding the appropriate interoperability between the StarSs extensions to OpenMPand OpenCL can be a viable solution. While StarSs targets the exploitation of task-par-allelism by mapping/scheduling the execution of tasks to the hardware acceleratorsin the system, OpenCL could be used to express, in a portable way, the data-levelparallelism to be exploited in accelerators by the native device-specific compiler.

6 Conclusions and Future Work

In this paper we propose a number of extensions to the OpenMP language, that comefrom the research StarSs programming model, that try to tackle the programmabilityof emerging heterogeneous architectures. The first group of these extensions aim toenable a more productive and effective parallelization where the user specifies thedata dependences among the different parallel tasks of the computation. The compilerand runtime are then responsible to extract the parallelism of the application and per-form the needed synchronizations. The second group specifies on which acceleratorsa task should run. This information also allows the runtime to generate the code tooffload the tasks to the accelerators, and combined with the data dependences, it canalso generate the code that takes care of the data movement. Furthermore, a user canspecify optimized versions of the task code for a given accelerator to take advantageof already optimized operations (e.g., the BLAS libraries) while maintaining a singlesource code that is portable across multiple platforms.

We have presented the challenges of implementing this model in a number of archi-tectures (e.g., multicores, Cell B.E., GPUs and FPGAs). Our experimental results showthat our current prototypes can obtain high performance compared to the amount ofeffort that the programmer needs to devote. While in some cases there is still a gapwith respect to hand-tuned version we expect that further research will reduce it.

In particular, our current research focuses on improving the scheduling of tasksto minimize the communication of data across the different processing elements ofthe system and to increase data locality. Other research directions try to increase theamount of available parallelism by removing false dependences (using renaming asoutlined in Sect. 3 and establishing a trade–off between memory used in data renamingand parallelism exploited at runtime).

Finally, we are in the process of integrating all proof-of-concept implementationsinto a single runtime and including the intelligence described in Sect. 3.5 to make useof different accelerators at the same time. Further scheduling research is needed tomake this efficient.

Acknowledgments We would like to thank the helpful and constructive comments that we got fromthe reviewers of this paper. The researchers at BSC-UPC were supported by the Spanish Ministry of Sci-ence and Innovation (contracts no. TIN2007-60625 and CSD2007-00050), the European Commission inthe context of the SARC project (contract no. 27648) and the HiPEAC2 Network of Excellence (contractno. FP7/IST-217068), and the MareIncognito project under the BSC-IBM collaboration agreement. The

123

Int J Parallel Prog

researchers at the Universidad Jaime I were supported by the Spanish Ministry of Science and Innova-tion/FEDER (contracts no. TIN2005-09037-C02-02 and TIN2008-06570-C04-01) and by the FundaciónCaixa-Castellón/Bancaixa (contracts no. P1B-2007-19 and P1B-2007-32).

References

1. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Suger-man, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core × 86architecture for visual computing. ACM. Trans. Graph. 27(3), 1–15 (2008)

2. OpenMP Architecture Review Board.: OpenMP 3.0 Specification. http://www.openmp.org May (2008)3. Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a Programming Model for the Cell BE

Architecture. In : Proceedings of the ACM/IEEE SC 2006 Conference, November (2006)4. Turner, J.A.: Roadrunner: Heterogeneous Petascale Computing for Predictive Simulation. Technical

report, Technical Report LANLUR-07-1037, Los Alamos National Lab, Las Vegas, NV (2007)5. Kurzak, J., Buttari, A., Dongarra, J.: Solving systems of linear equations on the cell processor using

cholesky factorization. IEEE. Trans. Parallel Distrib. Syst. 19(9), 1175–1186 (2008)6. Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P.,

Zhang, G.: The design of OpenMP tasks. IEEE. Trans. Parallel Distrib. Syst. 20(3), 404–418 (2009)7. Pham, D.C., Aipperspach, T., Boerstler, D., Bolliger, M., Chaudhry, R., Cox, D., Harvey, P., Harvey,

P.M., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Pham, M., Pille,J., Posluszny, S., Riley, M., Stasiak, D.L., Suzuoki, M., Takahashi, O., Warnock, J., Weitzel, S., Wen-del, D., Yazawa, K.: Overview of the architecture, circuit design, and physical implementation of afirst-generation cell processor. IEEE J. Solid-State Circuits 41(1), 179–196 (2006)

8. NVIDIA : NVIDIA CUDA Compute Unified Device Architecture-Programming Guide (2007)9. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs:

Stream Computing on Graphics Hardware. In : SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, pp.777–786. ACM Press, New York (2004)

10. PGI.: PGI Fortran and C Accelerator Compilers and Programming Model Technology Preview. ThePortland Group (2008)

11. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A Hybrid Multi-core Parallel Programming Environment.In : First Workshop on General Purpose Processing on Graphics Processing Units, October (2007)

12. Khronos OpenCL Working Group.: The OpenCL Specification. Aaftab Munshi, Ed (2009)

123

http://www.openmp.org

extending openmp to survive the heterogeneoushpc.ac.upc.edu/pdfs/dir20/file003848.pdf · extending...

Documents