cpu-gpu cluster - centralesupelecvialle/anr-cigc-gcpmf/final_report_gabriel... · nationale des...

ÉCOLE SUPÉRIEURE D'ÉLECTRICITÉPOLYTECHNICAL UNIVERSITY OF BUCHAREST

CPU-GPU cluster

Design, Experimentations, Performances

Gabriel Noaje

Supervisors: Stéphane Vialle (SUPELEC)

Dumitru Popescu (PUB)

2008

CPU-GPU clusterDesign, Programming and Experimentation.

(for future application to financial computing)

Gabriel Noaje - SUPELEC & UPB

June 30, 2008

Contents

1 Introduction 9

2 Choosing the cluster hardware 132.1 Hardware requirements for a GPU cluster . . . . . . . . . . . 132.2 Choosing the graphic card . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Asynchronous support in CUDA . . . . . . . . . . . . . 142.2.2 Tests description . . . . . . . . . . . . . . . . . . . . . 142.2.3 Tests results . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Choosing the cluster nodes hardware . . . . . . . . . . . . . . 172.4 Choosing the interconnection network . . . . . . . . . . . . . . 182.5 Summary of the GPU cluster configuration . . . . . . . . . . . 19

3 Compiling CUDA + MPI code 213.1 The Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Encountered problems and solutions . . . . . . . . . . . . . . . 243.3 Sumary of CUDA & MPICH compilation . . . . . . . . . . . . 25

4 First CUDA+MPI code on the GPU cluster 274.1 The matrix product problem . . . . . . . . . . . . . . . . . . . 274.2 Structure of CUDA + MPI code . . . . . . . . . . . . . . . . . 294.3 CUDA code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.1 Kernel call . . . . . . . . . . . . . . . . . . . . . . . . . 304.3.2 Kernel execution model . . . . . . . . . . . . . . . . . . 32

4.4 Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Sumary of the first CUDA+MPI code . . . . . . . . . . . . . . 35

5 Optimizations and tweaks 415.1 Infiniband interconnection network . . . . . . . . . . . . . . . 415.2 GPU overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Kernel optimization . . . . . . . . . . . . . . . . . . . . . . . . 445.4 Results accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 45

3

4 CONTENTS

5.5 Sumary of optimizations and tweaks . . . . . . . . . . . . . . 47

6 Conclusions 49

7 Appendix A - Graphic cards GPU overlap support test re-sults 51

List of Figures

2.1 GPUs overlapping test results . . . . . . . . . . . . . . . . . . 17

4.1 Different types of matrix partitions . . . . . . . . . . . . . . . 284.2 MPI matrix product partition . . . . . . . . . . . . . . . . . . 284.3 Thread batching . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 CUDA memory model . . . . . . . . . . . . . . . . . . . . . . 344.5 CUDA kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6 CUDA matrix product partition . . . . . . . . . . . . . . . . . 384.7 MPI execution times (without and with overlapping) [log scale] 384.8 MPI+CUDA execution times (without and with overlapping) . 394.9 MPI+CUDA execution times (without and with overlapping)

[log scale] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.10 GFlops performance [linear and logarithmic scale] . . . . . . . 39

5.1 Gigabit versus Infiniband speedup (Credits: Lars E. Jons-son, William R. Magro - Comparative Performance of In-finiBand Architecture and Gigabit Ethernet Interconnects onIntel R©Itanium R©2 Microarchitecture-based Clusters) . . . . . 43

5.2 CUDA matrix product partition using streams . . . . . . . . . 445.3 CUDA code using streams . . . . . . . . . . . . . . . . . . . . 455.4 Execution times for different values of CUDA streams . . . . . 46

5

6 LIST OF FIGURES

List of Tables

2.1 List of graphic cards chosen to be tested . . . . . . . . . . . . 152.2 ASUS GeForce 8800GT technical specifications . . . . . . . . 182.3 ASUS GeForce 8800GT CUDA specifications . . . . . . . . . . 182.4 GPU cluster hardware configuration . . . . . . . . . . . . . . . 20

4.1 Circulation and computation function . . . . . . . . . . . . . . 304.2 GFlops performances . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 MPI execution time (Gigabit vs Infiniband) . . . . . . . . . . 425.2 MPI+CUDA execution time (Gigabit vs Infiniband) . . . . . 425.3 Performance comparison between naive implementation of ma-

trix product and CUBLAS libray . . . . . . . . . . . . . . . . 465.4 Comparison of numerical deviation of the results using one

CPU and one GPU . . . . . . . . . . . . . . . . . . . . . . . . 46

7

8 LIST OF TABLES

Chapter 1

Introduction

Nowadays problems require more and more computation power and speedas their complexity and data involved in the process grows much more fasterthan the actual compute capabilities available. The idea is that some prob-lems are solvable, but the time required to obtain a result is way to largecompared to the human existence. For this reason the hardware researchersare in continuous challenge to provide the means that are able to deal withsuch problems and solve them in a timely manner. First there was paral-lelism on a single machine with a single processor by using threads. Thencame the idea of multicores sticked together on the same die. Each computernow had a great power to make complex computations, but what it came tocomplex problems they were failing to do the job in a reasonable time. Sothe next logical idea was to connect several computers such as they couldexchange information and the problem that needed to be solved would bedistributed among all the processors in the network and the computationwould be carried in parallel.

At the time, the major focus was on the CPUs computation capabilitiesand everyone was focused on how to obtain the last drop of power fromthem. But in the shadow another device was crunching trillions of bits persecond, but only for a specific purpose, graphics. Graphic cards developedover the years with the only purpose of rendering ultra-realistic scenes mainlyin games and animations and the programming tools offered by the devel-opers (ex. Cg–nVIDIA, HLSL–Microsoft) where targeted to this kind ofcomputations. Sustained rythm of graphic cards innovation draw attentionon their graphic processings units which are continuing to become faster andmore flexible in their use. Today’s GPUs can process huge amounts of treat-ments and they are now used not only for rendering 3D scenes, but alsofor processing images and performing massively parallel computing, such as

9

10 CHAPTER 1. INTRODUCTION

financial statisitics or terrain analysis for finding new oil fields [13].

Regardless the purpose of their use, GPUs need a software interface tounleash their entire power and the API used for that has to be as easy to useas possible, that one with no graphics background could use it to immersethe potential of GPU. GPU producers foreseen the benefits of General-Purpose Computing on Graphics Processing Units (GPGPU). BothNVIDIA and AMD created tools which facilitates development of generalpurpose programs that use the GPU.

On the 15th of February 2007, nVIDIA Corporation has realesed to thepublic the initial version og Compute Unified Device Architecture(CUDA). The CUDA SDK allows a programmer to use the C program-ming language to code algorithms for executions on the GPU. GeForce 8Series, featuring G8X GPUs, was the first hardware to support the newframework, now also covering Tesla solutions and some Quadro solutions.CUDA gives developers unfettered access to the native instruction set andmemory of the massively parallel computational elements in CUDA GPUs.Using CUDA, Nvidia GeForce-based GPUs effectively become powerful, pro-grammable open architectures like todays CPUs (Central Processing Units).By opening up the architecture, CUDA provides developers both with thelow-level, deterministic, and the high-level API for repeatable access to hard-ware which is necessary to develop essential high-level programming toolssuch as compilers, debuggers, math libraries, and application platforms.

In December 2007, AMD first released its Stream Computing SDK, whichrun on Windows XP. The SDK includes ”Brook+”, an AMD hardware opti-mized version of the Brook language developed by Stanford University, itselfa variant of the ANSI C (C language), open-sourced and optimized for streamcomputing. The AMD Core Math Library (ACML) and AMD PerformanceLibrary (APL) with optimizations for the AMD FireStream and the COBRAvideo library for video transcoding acceleration will also be included. An-other important part of the SDK, the Compute Abstraction Layer (CAL), isa software development layer aimed for low-level access, through the CTMhardware interface, to the GPU architecture for performance tuning softwarewritten in various high-level programming languages.

The financial domain is one of the fields where GPGPU emerged and port-ing intensive computation applications to the GPU was a real succes. Monte-Carlo algorithms used for financial models like European Option Pricing can

11

take full advatage ot the GPU power.

Objectives in two workpackages in the ANR-CIGC-GCPMF project areto provide programs for Monte-Carlo algorithms suited for GPUs which willbe used to perform fast financial computations. The research is sustainedby different partners within the consortium with expertise in computer sci-ence, financial mathematics, banks, like: SUPELEC, INRIA, EDF, EcoleNationale des Ponts et Chaussees, Misys Summit Banking Systems, BNPParibas, Pricing Partners, Calyon, Ixis-Cib, Centrale Paris, Paris VI [1].Workpackage 2.3 of the project is focused on experiments for Moonte-Carloprograms on clusters of GPUs. The main objectives of the WP are:

• Efficiently distribute parallel programs with communication on a set ofGPUs;

• Transfer data between GPUs as fast as possible to assure no idle times

• Use these ideas for Monte-Carlo simulations

The algorithmic strategy for such a cluster is to fully load the GPU withcomputation, while off-loading computations from CPU which can handlethe communications and perform overlapping.

12 CHAPTER 1. INTRODUCTION

Chapter 2

Choosing the cluster hardware

2.1 Hardware requirements for a GPU clus-

ter

The new CUDA framework allows to lower the CPU load and send thecomputations to the GPU which by construction has a highly parallel ar-chitecture, thus accelerating the processing of large amounts of data arraysusing an identical set of instructions (called kernel) for each element in onedata array. In this way the processor is only busy with supplying the dataand the kernels to the GPU and then collecting the results. Although thetransfer between CPU (host) and GPU (device) might appear as a bottleneckdue to the slow speed of the bridge bus. One could obtain very good resultsif it exposes as much data parallelism as possible in the algorithm and alsotakes care of mapping it to the hardware as efficiently as possible in such away that the transfers are minimized.

Craving for more computational power led to the idea of plunging theCUDA framework in a Message Passing Interface (MPI) environment,thus arising the concept of cluser of GPUs. Inside the cluster each node ofthe MPI network would pass most of intensive parallel tasks to the GPU,off-loading the CPU which is free to handle the network communication be-tween nodes. Being a new concept that emerged recently in the scientificcommunity. special care was needed before starting the code development.

The following problems had to be addressed in order to create a rock solidplatform for tests:

• Choosing from all the graphic cards models the one that suits best the

13

14 CHAPTER 2. CHOOSING THE CLUSTER HARDWARE

computational needs;

• Choosing the correct hardware platform to support the GPU (CPU,motherboard, RAM, HDD, power supply);

• Choosing the networking infrastructure that will support the entirecluster communication flow.

With these items in mind several tests where carried out to identify the mostfeasible solution for the following work.

2.2 Choosing the graphic card

2.2.1 Asynchronous support in CUDA

In order to accelerate even more the GPU cluster computation capabilitythe graphic card that was chosen had to have a special capability: asyn-chronous concurrent execution support. Accorging to the CUDA Program-ming Guide Version 1.1 released on the 29th of November 2007 (chapter4.5.1.5 ) [10] to facilitate concurrent execution between host and device, someruntime functions are asynchronous, which means that control is returned tothe application before the device has completed the requested task. Theseasynchronous functions are:

• Kernel launches through __global__ functions or cuGridLaunch() andcuGridLaunchAsync();

• The functions that perform memory copies and are suffixed with Async;

• The functions that perform device ↔ device memory copies;

• The functions that set memory.

In the same chapter is also stated that some devices can also perform copiesbetween page-locked host memory and device memory concurrently with ker-nel execution. So the idea was to interleave execution with memory accesesthus gaining even more performance.

2.2.2 Tests description

Choosing the most suitable graphic card for the GPUs cluster requiered afew tests in order to be sure that the desired capabilities are met and bestperformances will be achieved.

2.2. CHOOSING THE GRAPHIC CARD 15

As a first selection criteria, the graphic card CUDA revision number ca-pability was used. Following the technical specifications presented in theAppendix A of the CUDA Programming Guide a number of GPUs withCUDA 1.1 compute capability where chosen. Table 2.1 presents the graphiccards that have been chosen for further tests.

No. Vendor Model

1 ASUS GeForce 8800 GT (EN880GT / G / HTDP / 512M / A)2 PNY Quadro FX 5703 PNY Quadro FX 17004 HP Quadro FX 17005 PNY Quadro FX 37006 PNY Quadro FX 4600

Table 2.1: List of graphic cards chosen to be tested

Getting back to the chapter 4.5.1.5, the CUDA Programming Guide statesthat one could use in an application the function cuDeviceGetAttribute()

with the CU_DEVICE_ATTRIBUTE_GPU_OVERLAP parameter to query the devicecapability on asynchronous concurrent execution. Based on this idea a simpletest program was written following the example called deviceQuery from theCUDA SDK. The example presented in the CUDA SDK was coded using thehigh-level CUDA Runtime API, but the function that queries for overlappingcapabilities resides in the low-level CUDA Driver API, and since the two APIscannot be mixed in anyway the entire code was rewritten using the DriverAPI. Thus the code fragment that queries for this capability is shown bellow.

1 int query;2 CU_SAFE_CALL_NO_SYNC(cuDeviceGetAttribute (&query ,3 CU_DEVICE_ATTRIBUTE_GPU_OVERLAP , device ));4 printf("GPU Overlap support: %s (%d)\n",query?5 "AVAILABLE":"Not available",query );

The return of the function is plain simple 1 or 0, meaning the GPU hasor hasnot overlap capability. Testing according to the NVIDIA guidelinesshould be straight forward and easy. However things do not stand thatsimple. According to the NVIDIA Support Forum [6, 7] the function resultmight be erroneous and a device might report 1 although it is a CUDA1.0 device. Discussions on the forum state that there might be a nVidiaForceware Driver bug, but no official statement is available to clarify theproblem.


In order to make sure that the device has indeed the overlap capibility,another test was carried out using another program from the CUDA SDK.The simpleStreams illustrates the usage of CUDA streams for overlappingkernel execution with device/host memcopies. The kernel is used to initial-ize an array to a specific value, after which the array is copied to the host(CPU) memory. To increase performance, multiple kernel/memcopy pairsare launched asynchronously, each pair in its own stream. The programcomputes an estimated time for the streamed execution and if the device in-deed executes the streams asynchronously the real time should approximatelymatch it. The code fragment which initiates the streams is the following:

1 // asynchronously launch nstreams kernels , each operating on

2 //its own portion of data

3 for(int i = 0; i < nstreams; i++) init_array <<<blocks , threads ,4 0, streams[i]>>>(d_a+i*n/nstreams , d_c);5

6 // asynchronoously launch nstreams memcopies

7 for(int i = 0; i < nstreams; i++)8 cudaMemcpyAsync(a+i*n/nstreams , d_a+i*n/nstreams ,9 nbytes/nstreams , cudaMemcpyDeviceToHost , streams[i]);

By this mean one can be sure that if the GPU overlap attribute returnstrue and the execution of the streams goes as expected, then the devicedefinitly supports this capability.

2.2.3 Tests results

The two tests where integrated into a shell script which also prints technicalinformation about the GPU, using the NVIDIA Driver command line inter-face. Running the two tests on the six selected graphic cards allowed revealedthe caracterisitcs of each card and their overlap capability. Further on (Fig.2.1), two relevant cases will be presented, but the complete tables with thetest results can be found in Appendix A.

In the first column of the table a normal situation can be seen. A CUDA1.1 device which reports that it has overlap capability and the simpleStreamtest aproximatelly matched the expected time with the real time. In thesecond column though, the Quadro FX 4600 GPU, which is identified as aCUDA 1.0 device, reports that it has overlap capability. Though if the timesof the simpleStream test are compared it can be clearly seen that in fact thestreams are not run concurrently, but sequential thus the real time is muchgreater than the expected time. This latter situation is the exemplificationof the problems discussed earlier when a device erroneously reports that ithas overlap capability, since in fact it does not.

2.3. CHOOSING THE CLUSTER NODES HARDWARE 17

Figure 2.1: GPUs overlapping test results

Based on the tests results on the available GPUs the ASUS GeForce8800GT card was chosen.Though a lot more devices where identified to sup-port overlapping, the ASUS model had more multiprocessors (14 multipro-cessors - 112 stream processors), a larger amount of memory (512MB) and ahigher core clock frequency (600MHz). The technical and CUDA specifica-tions are as follows:

2.3 Choosing the cluster nodes hardware

Around the ASUS GeForce 8800 GT GPU a high performance configurationwas built to sustain each node. Every computer is a middle tower with andual core Intel E8200 processor providing a 6MB cache and a Intel motherboard with the front side bus at 1333MHz. The mother boards offer twoPCI Express 2.0 x16 ports and one PCI Express 1.1 x4 ports (routed to ax16 connector) thus allowing the plug of a maximum of three graphic cardsor two graphic cars and a fast interconnection network card (see 2.4). Eachcomputer has 4 GB DDR3 RAM operating at a memory clock of 133MHz.Each node has a 600W power supply which powers the GPU which solelyuses 240W when loaded [5] and all the other components.


Technical specification

Vendor model: ASUS EN880GT/G/HTDP/512M/ACore clock: 600 MHz

Memory clock: 900 MHzMemory amount: 512 MBMemory interface: 256-bitMemory bandwidth: 57.6 GB/secTexture fill rate: 33.6 billion/sec

Table 2.2: ASUS GeForce 8800GT technical specifications

CUDA specification

Global memory: 536543232 bytesConstant memory: 65536 bytesShared memory per block: 16384 bytesRegisters / block: 8192Warp size: 32Threads / block: 512Block dimension: 512 x 512 x 64Grid dimension: 65535 x 65535 x 1Maximum memory pitch: 262144 bytesTexture alignment: 256 bytes

Table 2.3: ASUS GeForce 8800GT CUDA specifications

The operating system on each node is GNU/Linux Fedora Core 8 distri-bution with kernel version 2.6.24.5-85.

2.4 Choosing the interconnection network

Interconnection between the nodes is assured using two types of networks.First an Ethernet Gigabit network was installed by simply using the moth-erboards built-in network cards (featuring Intel 82566DC-2 Gigabit chipset)and a D-Link DGS-1016D 16 port Gigabit switch. Then a more expensivesolutions was put in place using Infiniband connection. This required the ad-dition of Mellanox Infiniband PCI cards in each node which occupies one ofthe three PCI Express slots of the motherboards and the use of an MellanoxInfiniband switch.

2.5. SUMMARY OF THE GPU CLUSTER CONFIGURATION 19

Tests will be carried on in order to see how each connection behaves indifferent situations and which is the best solution to choose. Each one has itsown advanteges and disadvantages: the gigabit is more common, thus cheaperto implement, while the infiniband promises 10 times the speed of gigabit,it requires dedicated equipment, which is less common and more expensive.The tests will try to show the tradeoffs for both types of connections andto see which one is more suitable as a communication infrastructure for theGPU cluster.

In order to use both of these connections two implementations of MPIare needed and have been installed. The first one, MPICH2 [3], works withthe Gigabit connection out of the box, but for the infiniband connectiona different version is needed, called MVAPICH2 [4] which is aware of theinfiniband stack and knows how to use it for communication.

2.5 Summary of the GPU cluster configura-

tion

In conclusion we aimed to build a cluster of CPU-GPU nodes with the fol-lowing characteristics:

• GPU computation enabled with supplementary CPU-GPU communi-cation overlapping capabilities;

• fast interconnection network in between nodes;

• enough power supply on each node to support several devices.

The most important part of building the cluster was choosing the mostsuitable GPU and showed that the decision of choosing a device should notrely on one single bit of information which can or cannot be trusted. Thusthorough tests allowed to precisely identify the graphic card best suited forthe GPUs cluster. The final configuration of the cluster is as presented inTable 2.4


16 x NETSTATION I8000

Graphic card ASUS GeForce 8800 GT (EN8800GT/G/HTDP/512M)Mother board Intel DX38BTProcessor Intel Core 2 Duo E 8200

(2.66GHz, FSB 1333MHz, 6MB cache, 45nm)Memory 4 x 1GB DDR3 1066 133MhzHarddisk 160GB SATA II (7200 RPM, 8MB cache)Power supply 600W Fortron SLI Epsilon EPS 12VConectivity

Gigabit card (16): Intel 82566DC-2 Gigabit chipset (motherboard integrated)Gigabit switch: D-Link DGS-1016D 16 port Gigabit Desktop SwitchInfiniband card (8): Mellanox InfiniHost III Lx MHGS18 - XTC (chipset MT25204)Infiniband switch: Flextronics ODM F-X430066

8 port 4X 1U SDR InfiniBand switch- unmanaged

Table 2.4: GPU cluster hardware configuration

Chapter 3

Compiling CUDA + MPI code

3.1 The Makefile

The purpose of the study is to create a framework for paralel and distributedcomputing using a cluster of GPUs. For this reason a Message Passing In-terface needs to be used to transport data and information between nodesand at each node the Compute Unified Device Architecture is used to com-municate between CPU and GPU and to instruct graphic processor to makedesired computations. In order to make the entire framework functional ahybrid MPI and CUDA code needs to be developed, compiled and deployedover the cluster. As a MPI framework the MPICH2 implementation wasused, but also the OpenMPI is considered for further development.

Both architectures have their own way of compiling source code writenfor their scope. A brief description of each way of compiling is presented tobetter understand the way mixed code needs to be compiled.

MPICH2 uses a script called mpicc for compiling C code. At an extent thescript does nothing more than call the gcc compiler with a few parametersto include the mpich header and link the mpich library. Thus compiling aMPI C code can be done by just using the gcc compiler with the appropriateparameters.

In the case of CUDA the compilation of a source code is more complex andinvolves more stages [11]. Source files for CUDA applications consist of a mix-ture of conventional C++ ’host’ code, plus GPU ’device’ (i.e. GPU-) func-tions. The CUDA compilation trajectory separates the device functions from

21

22 CHAPTER 3. COMPILING CUDA + MPI CODE

the host code, compiles the device functions using proprietary NVIDIA com-pilers/assemblers, compiles the host code using any general purpose C/C++compiler that is available on the host platform, mainly gcc, and afterwardsembeds the compiled GPU functions as load images in the host object file. Inthe linking stage, specific CUDA runtime libraries are added for supportingremote SIMD procedure calling and for providing explicit GPU manipulationsuch as allocation of GPU memory buffers and host-GPU data transfer. Thiscompilation trajectory involves several splitting, compilation, preprocessing,and merging steps for each CUDA source file, and several of these steps aresubtly different for different modes of CUDA compilation (such as compila-tion for device emulation, or the generation of ’fat device code binaries’). Itis the purpose of the CUDA compiler driver nvcc to hide the intricate de-tails of CUDA compilation from developers. Additionally, instead of being aspecific CUDA compilation driver, nvcc mimics the behavior of general pur-pose compiler drivers (such as gcc), in that it accepts a range of conventionalcompiler options, such as for defining macros and include/library paths, andfor steering the compilation process. All non-CUDA compilation steps areforwarded to a general C compiler that is available on the current platform.

Regarding the gcc and g++ version used for compiling CUDA programsit seems that version 4.1 or lower is recomanded [8] [9]. Any attempt to usea higher version of the compiler resulted in errors of the following kind.

1 "/usr/include/c++/4.2/ i486 -linux -gnu/bits/c++ config.h",2 line 149: error: expected a "{"3 namespace std __attribute__ (( __visibility__ ("default"))) {4 ^5

6 "/usr/include/c++/4.2/ bits/cpp_type_traits.h", line 74:7 error: expected a "{"8 namespace __gnu_cxx __attribute__ (( __visibility__9 ^

10 ("default"))) {11

12

13 "/usr/include/c++/4.2/ bits/cpp_type_traits.h", line 81:14 error: expected a "{"15 namespace std __attribute__ (( __visibility__ ("default"))) {16 ^17

18 "/usr/include/c++/4.2/ bits/cpp_type_traits.h", line 313:19 error: name followed by "::" must be a class or20 namespace name21 struct __is_normal_iterator < __gnu_cxx ::22 ^

3.1. THE MAKEFILE 23

23 __normal_iterator <_Iterator ,24

25

26 "/usr/include/c++/4.2/ bits/cpp_type_traits.h", line 313:27 error: too many arguments for class template28 "<error >:: __is_normal_iterator"29 struct __is_normal_iterator < __gnu_cxx ::30 __normal_iterator <_Iterator ,31 ^32

33 "/usr/include/c++/4.2/ bits/cpp_type_traits.h", line 313:34 error: name followed by "::" must be a class or35 namespace name36 struct __is_normal_iterator < __gnu_cxx ::37 ^38 __normal_iterator <_Iterator ,

1 gcc: error trying to exec ’cc1plus ’: execvp:2 No such file or directory

Taking into account the complex structure of the CUDA compilation stagesthe easiest way of compiling MPICH and CUDA code is to use the nvccalong with the appropriate MPICH parameters that are used normaly, mainlythe library inclusion. Thus the nvcc compiler will split the C/C++ code(including MPI) and will pass it to the gcc compiler together with nvccparameters, resulting in a normal MPICH compilation that later in the stageswill be linked to CUDA code. A sample Makefile used to compile mixedMPICH and CUDA code looks like the following.

1 CC=nvcc2 CFLAGS= -I. -I/usr/local/mpich2 -1.0.6. p1/include \3 -I/usr/local/cuda/include \4 -I/home/noaje_gab/NVIDIA_CUDA_SDK/common/inc5 LDFLAGS=-L/usr/local/mpich2 -1.0.6. p1/lib \6 -L/usr/local/cuda/lib \7 -L/home/noaje_gab/NVIDIA_CUDA_SDK/lib \8 -L/home/noaje_gab/NVIDIA_CUDA_SDK/common/lib9 LIB= -lcuda -lcudart -lm -lmpich -lpthread -lrt

10 MPICH_FLAG= -DMPICH_IGNORE_CXX_SEEK \11 -DMPICH_SKIP_MPICXX12 SOURCES= CalcCirc.cpp Init.cu main.cpp13 EXECNAME= MatrixProduct14

15 all:


16 $(CC) -v -o $(EXECNAME) $(SOURCES) $(LIB) \17 $(LDFLAGS) $(CFLAGS) $(MPICH_FLAG)18

19 clean:20 rm -f *.o core

Each Makefile command and parameter is explained further on.

3.2 Encountered problems and solutions

As a first remark any source file which contains CUDA code should have a.cu extension so that the nvcc compiler could properly manage it. Otherwisethe source file will not be preprocessed and will be forwarded directly to thecompiler resulting errors similar to the following ones.

1 In file included from Init.c:12:2 /home/noaje_gab/NVIDIA_CUDA_SDK/common/inc/cutil.h: 123: error:3 expected ’=’, ’,’, ’;’, ’asm’ or ’__attribute__ ’4 before ’cutReadFilef ’5 /home/noaje_gab/NVIDIA_CUDA_SDK/common/inc/cutil.h: 254: error:6 expected ’=’, ’,’, ’;’, ’asm’ or ’__attribute__ ’7 before c u t W r i t e F i l e u i

Any other source files which does not contain CUDA code should have a.cpp extension thus forcing the nvcc to invoke the g++ compiler instead ofgcc. But even if user code is pure ANSI C code, sometime at inclusion stepsof the compilation, C++ code coming from CUDA libraries is included. Theresulted mixed code cannot be handled by gcc compiler and by using the.cpp extension the use of g++ is forced. The errors which appear when a .cextension is used are as follows.

1 main.c:(. text+0x187): undefined reference to ‘ProcessorInit ’

Other two things which need to be mentioned relate to MPICH. Firstaccording to chapter 5.3 Special Issues for C++ in [16] states that someusers may get error messages such as

1 SEEK_SET is #defined but must not be for the C++2 bindings of MPI

The problem is that both stdio.h and the MPI C++ interface useSEEK SET, SEEK CUR and SEEK END. This is a known bug in the MPI-2

3.3. SUMARY OF CUDA & MPICH COMPILATION 25

standard. There are two solutions to this problem. Either add the followinglines before mpi.h is included

1 #undef SEEK_SET2 #undef SEEK_CUR3 #undef SEEK_END

or add the following parameter to the compiler command line which willcause the MPI versions of SEEK SET, etc. to be skipped.

1 -DMPICH_IGNORE_CXX_SEEK

The second problem can be seen by running the nvcc in verbose mode(nvcc -v). The nvcc invokes by default the CUDA front end (cudafe) withan option -no_exceptions who disables C++ exceptions. But as it as statedearlier although user code is ANSI C, after the inclusion of different headersfrom CUDA framework and MPICH framework a C and C++ mangled coderesults which cannot be compiled with the exceptions deactivated and thusresulting in errorsof the following type.

1 mpicxx.h : support for exception handling is disabled

In order to solve this problem it is mandatory to force the compiler not toload MPICH C++ bindings by not loading mpicxx.h header. This is doneby passing the following parameter to the compiler command line.

1 -DMPICH_SKIP_MPICXX

3.3 Sumary of CUDA & MPICH compilation

Compiling mixed CUDA and MPICH code together is a straightforward pro-cess, as long as a few small problems are addressed. Using the followingMakefile as a skeleton, any source code mixing CUDA and MPICH code canbe compiled, with the mention that the paths must match the ones on thesystem the compilation is made.

1 CC=nvcc2 CFLAGS= -I. -I/usr/local/mpich2 -1.0.6. p1/include3 -I/usr/local/cuda/include4 -I/home/noaje_gab/NVIDIA_CUDA_SDK/common/inc5 LDFLAGS=-L/usr/local/mpich2 -1.0.6. p1/lib


6 -L/usr/local/cuda/lib7 -L/home/noaje_gab/NVIDIA_CUDA_SDK/lib8 -L/home/noaje_gab/NVIDIA_CUDA_SDK/common/lib9 LIB= -lcuda -lcudart -lm -lmpich -lpthread -lrt

10 MPICH_FLAG= -DMPICH_IGNORE_CXX_SEEK11 -DMPICH_SKIP_MPICXX12 SOURCES= CalcCirc.cpp Init.cu main.cpp13 EXECNAME= MatrixProduct14

15 all:16 $(CC) -v -o $(EXECNAME) $(SOURCES) $(LIB)17 $(LDFLAGS) $(CFLAGS) $(MPICH_FLAG)18

19 clean:20 rm -f *.o core

Chapter 4

First CUDA+MPI code on theGPU cluster

4.1 The matrix product problem

First experiments on the GPU cluster required a simple problem to workwith, but in the same time with a high computation complexity suitablefor a distributed environment. The matrix product problem was chosen forthe first experiments. The complexity of the naive implementation of theproblem is O(n3). Though faster algorithms exist like Strassen algorithmor CoppersmithWinograd algorithm, for experimentation purposes the naiveimplementation was chosen.

In order to compute the matrix product using a cluster each matrix needsto be partitioned and a mapped on the cluster nodes. There are differentways a matrix can be partitioned in blocks, depending on virtual topologythat is used in the MPI framework. For experiments, a ring virtual topologywas chosen thus leading to several partitioning posibilities (Fig. 4.1):

• lines blocks in which first m lines are located on processor P0, thenext m on processor P1, etc. In general, processor,Pi has the lines imto (i + 1)m− 1

• cyclic lines in which processor Pi has the lines l for which lmodp = i

• cyclic lines blocks in which each processor has d consecutive lines,the initial matrix is divided in submatrices of pd lines, each submatrixbeing partitioned lines blocks on the processors

Naturaly the same partitioning methods can be applied by columns.

27

28 CHAPTER 4. FIRST CUDA+MPI CODE ON THE GPU CLUSTER

Figure 4.1: Different types of matrix partitions

The partitioning used for the experiments is a mixed one. The A matrixis partitioned in lines blocks, while the B matrix is partitioned in columnsblocks. By circulating the A matrix on the ring topology between the proces-sors, each of them can compute a lines block from the C matrix. For examplein the first stage each processor computes a small matrix which is part of theC lines block. (Fig. 4.2)

Figure 4.2: MPI matrix product partition

On each node of the cluster the matrix product between the two blocksthat are at one moment in the memory is in fact computed using the GPU, byfurthur partitioning the two block matrices to map on the GPU architecture.

4.2. STRUCTURE OF CUDA + MPI CODE 29

4.2 Structure of CUDA + MPI code

The implementation of the matrix product using CUDA and MPI code con-tains three main parts:

• initialization: where the MPI and CUDA frameworks are initiated, thememory allocations for both CPU and GPU are performed and eachblock is initialized;

• computation: where the A matrix is circulated among the processorsusing MPI communication routines, the matrices are transfered onthe GPU, the actual computations are performed, the result matrixis transfered from the GPU and placed in its place in the C bloc ma-trix on each processor;

• finalization: where the resources are freed and the MPI framework isclosed (in the CUDA Runtime API there is no explicit function forfinalizing the CUDA framework)

In the initialization phase the MPI framework is started using a specificfunction (MPI Init). CUDA Runtime API does not need a specific initial-ization, but a user function is used to detect if the computer has a CUDAenabled GPU. The matrices on the host and on the device are allocated inlinear arrays (row-major order) using the specific functions (host: malloc;device: cudaMalloc), then a user function initializes the each lines block.

The computation phase consists of the MPI communication routines andthe computation on the GPU. In order to achieve overlapping between theMPI communication and the actual computation, the Issend non-blockingfunction was used [14]. Ssend is a send variation that uses the synchronousmode meaning it can be started whether or not a matching receive wasposted. However, the send will complete successfully only if a matchingreceive is posted, and the receive operation has started to receive the mes-sage sent by the synchronous send. Thus, the completion of a synchronoussend not only indicates that the send buffer can be reused, but also indicatesthat the receiver has reached a certain point in its execution, namely thatit has started executing the matching receive. If both sends and receivesare blocking operations then the use of the synchronous mode provides syn-chronous communication semantics: a communication does not complete ateither end before both processes rendezvous at the communication. A sendexecuted in this mode is non-local.


The non-blocking version uses an opaque request objects to identify com-munication operations and match the operation that initiates the communi-cation with the operation that terminates it. These are system objects thatare accessed via a handle. A request object identifies various properties of acommunication operation, such as the send mode, the communication bufferthat is associated with it, its context, the tag and destination arguments tobe used for a send, or the tag and source arguments to be used for a receive.In addition, this object stores information about the status of the pendingcommunication operation.

Two versions of the function that does the circulation and computationwhere used: one with overlapping and another one without. The pseudo-code for the two functions is presented in table 4.1.

WITH overlap WITHOUT overlap

MPI Communication nonblocking Sequential ComputationSequential Computation MPI Communication nonblockingMPI Wait MPI Wait

Table 4.1: Circulation and computation function

The finalization phase of the program consists in freeing the resourceson both host and device and closing the MPI framework using the specificfunction MPI Finalize. The CUDA framework does not require an explicitfinalization.

4.3 CUDA code

Inside the sequential computation function the two block matrices are copiedto the device using the CUDA specific function cudaMemcpy. Then the kernelis called and in the end the result is copied back from the device. The CUDAhost component uses the high-level CUDA Runtime API.

4.3.1 Kernel call

When programmed through CUDA, the GPU is viewed as a compute devicecapable of executing a very high number of threads in parallel. The batch ofthreads that executes a kernel is organized as a grid of thread blocks (Fig.4.3).

4.3. CUDA CODE 31

Figure 4.3: Thread batching

A thread block is a batch of threads that can cooperate together by ef-ficiently sharing data through some fast shared memory and syncronizingtheir execution to coordinate memory accesses. Each of the threads thatexecute a kernel is given a unique thread ID that is accessible within thekernel through the built-in threadIdx variable. For convenience, threa-dIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional index, forming a one-dimensional, two-dimensional, or three-dimensional thread block. However,a kernel can be executed by multiple equally-shaped thread blocks, so thatthe total number of threads is equal to the number of threads per blocktimes the number of blocks. These multiple blocks are organized into a one-dimensional or two-dimensional grid of thread blocks. Each block within


the grid can be identified by a one-dimensional or two-dimensional indexaccessible within the kernel through the built-in blockIdx variable.

Thread blocks are required to execute independently. It must be possibleto execute them in any order, in parallel or in series. This independencerequirement allows thread blocks to be scheduled in any order across anynumber of cores, enabling programmers to write scalable code. The numberof thread blocks in a grid is typically dictated by the size of the data beingprocessed rather than by the number of processors in the system, which itcan greatly exceed.

The number of CUDA threads for each call is specified using a new syntax:<<<grid, threads>>>. The first parameter specifies the number of threadblocks in the grid, while the second parameter specifies the number of threadsinside a thread block.

During the experiments a thread block of 16x16 was used, and the gridsize was chosen to accomodate the matrices sizes, which for simplicity werechosen as multiples of the block size (16). Thus the code for calling the kernelwas the following:

1 #define BLOCK_SIZE 162 \dots3 dim3 threads(BLOCK_SIZE , BLOCK_SIZE );4 dim3 grid(WC/threads.x, HC/threads.y);5 matrixMulkern <<< grid , threads >>>(d_C , d_A , d_B , WA , WB);

4.3.2 Kernel execution model

A grid of thread blocks is executed on the device by scheduling blocks forexecution on the multiprocessors. Each multiprocessor processes batches ofblocks one batch after the other. A block is processed by only one multipro-cessor, so that the shared memory space resides in the on-chip shared memoryleading to very fast memory accesses. How many blocks each multiproces-sor can process in one batch depends on how many registers per thread andhow much shared memory per block are required for a given kernel since themultiprocessors registers and shared memory are split among all the threadsof the batch of blocks. If there are not enough registers or shared memoryavailable per multiprocessor to process at least one block, the kernel will failto launch.

4.3. CUDA CODE 33

The blocks that are processed by one multiprocessor in one batch arereferred to as active. Each active block is split into SIMD groups of threadscalled warps: Each of these warps contains the same number of threads,called the warp size, and is executed by the multiprocessor in a SIMD fashion.Active warps i.e. all the warps from all active blocks are time-sliced: Athread scheduler periodically switches from one warp to another to maximizethe use of the multiprocessors computational resources. A half-warp is eitherthe first or second half of a warp.

The way a block is split into warps is always the same; each warp containsthreads of consecutive, increasing thread IDs with the first warp containingthread 0. The issue order of the warps within a block is undefined, buttheir execution can be synchronized to coordinate global or shared memoryaccesses. The issue order of the blocks within a grid of thread blocks isundefined and there is no synchronization mechanism between blocks, sothreads from two different blocks of the same grid cannot safely communicatewith each other through global memory during the execution of the grid.

A thread that executes on the device has only access to the devices DRAMand on-chip memory through the following memory spaces (Fig. 4.4):

• read-write per-thread registers ;

• read-write per-thread local memory ;

• read-write per-block shared memory ;

• read-write per-grid global memory ;

• read-only per-grid constant memory ;

• read-only per-grid texture memory.

The kernel used for the matrix product on the GPU is a naive implemen-tation of the classical algorithm. (Fig. 4.5)

The kernel execution model is matched on the GPU architecture. Each ofthe block matrices that are transfered in the GPU global memory, are furtherpartitioned in blocks of BLOCK SIZE x BLOCK SIZE, that are assigned toblocks in a grid using the built-in variable blockID inside the grid. Eachthread block allocates a space for two blocks, one from A matrix and onefrom B matrix, in the shared memory. Further on, inside a thread block, eachthread computes an element of the Csub matrix, using one line from A matrix


Figure 4.4: CUDA memory model

and one column from B matrix, based on the built-in variable threadID (Fig.4.6).

4.4 Performances

First performance tests where intended to benchmark the improvement ofusing CUDA to perform the matrix product on a node, rather than usingthe CPU. On the same time the impact of overlapping MPI communicationwith the computation was followed. The tests used two matrices of 6080x6080elements and the Gigabit interconnection network. The measured times were:computation time (CPU or CUDA sequential product), communication time(MPI wait time) and entire communication and computation loop time.

The result of running the MPI implementation using one core/node using1,2,4 and 8 nodes are presented in log scale in Fig. 4.7.

It is clear from the graphics that the computation time is much moregreater than the communication time, thus leading to no improvements whenthe overlapping strategy is applied. Also the linear decrease of the executiontime denotes good scalability of the algorithm when the number of nodesincreases.

4.5. SUMARY OF THE FIRST CUDA+MPI CODE 35

The same measurements where performed on the MPI+CUDA version ofthe program (Fig. 4.8).

Analysing the graphics one can see that computation time and commu-nication time are roughly the same. The interesting fact is that at one point(when runnning the program on more than 4 nodes) the communication timeexceeds the computation time, which means the Gigabit Ethernet connectionis not fast enough to feed GPU which computes faster than the network canprovide data. Also it can be noticed that the overlap strategy does have animpact, decreasing the total execution time.

The same graphics from figure 4.8 seen in logarithmic scale reveal furtherinformations (Fig. 4.9).

The irregular decrease of the execution time denotes a bad scalability,showing that the overlapping of MPI communication and GPU computa-tions are incomplete, because the hardware does not support full overlapping.Though, as stated before, the overlapping appears to be the right strategysince it also improves scalability.

In terms of floating point operations performance, the MPI+CUDA pro-gram on one node achieves 77 GFlops (37x speed-up), while on 8 nodes itachieves 155 GFlops (10x speed-up) (Table 4.2, Fig. 4.10)

No. of nodes MPI MPI+CUDA

1 2.1 GFlops 77 GFlops8 15.5 GFlops 155GFlops

Table 4.2: GFlops performances

4.5 Sumary of the first CUDA+MPI code

After the first experiments it was clear that using a cluster of GPU siginif-icant speedups can be achived. Running the experiments on one node, theCUDA and MPI program achieved a 37x speedup compared to the MPI onlyversion. The same programs runned on 8 nodes achieved a 10x speedup. Thescalability of the problem on the GPU seems to be bad since the it does notexpose a linear comportement, fact caused by the incomplete overlap. Thehardware is not capable of transmiting datas to the GPU at the same rate ituses it. Though, the overlapping strategy seems to be the good way since itdecreases by a small factor the overall execution time.


Though there is significant performance increase when the computationsare done on the GPU there is still a lot of room for improvements. Furthertests were carried to probe the impact of different modifications in differentareas of the program.


1 __global__ void2 matrixMulkern( float* C, float* A, float* B, int wA, int wB)3 {4 int bx = blockIdx.x;5 int by = blockIdx.y;6 int tx = threadIdx.x;7 int ty = threadIdx.y;8

9 int aBegin = wA * BLOCK_SIZE * by;10 int aEnd = aBegin + wA - 1;11 int aStep = BLOCK_SIZE;12 int bBegin = BLOCK_SIZE * bx;13 int bStep = BLOCK_SIZE * wB;14

15 float Csub = 0;16

17 for (int a = aBegin , b = bBegin;18 a <= aEnd;19 a += aStep , b += bStep) {20

21 __shared__ float As[BLOCK_SIZE ][ BLOCK_SIZE ];22 __shared__ float Bs[BLOCK_SIZE ][ BLOCK_SIZE ];23

24 AS(ty, tx) = A[a + wA * ty + tx];25 BS(ty, tx) = B[b + wB * ty + tx];26

27 __syncthreads ();28

29 for (int k = 0; k < BLOCK_SIZE; ++k)30 Csub += AS(ty , k) * BS(k, tx);31

32 __syncthreads ();33 }34

35 int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;36 C[c + wB * ty + tx] = Csub;37 }

Figure 4.5: CUDA kernel


Figure 4.6: CUDA matrix product partition

Figure 4.7: MPI execution times (without and with overlapping) [log scale]


Figure 4.8: MPI+CUDA execution times (without and with overlapping)

Figure 4.9: MPI+CUDA execution times (without and with overlapping)[log scale]

Figure 4.10: GFlops performance [linear and logarithmic scale]

Chapter 5

Optimizations and tweaks

5.1 Infiniband interconnection network

As seen in the previous chapter the Gigabit Ethernet network used for nodesintercommunication reaches its limits in the CUDA and MPI program, failingto feed the GPU with data at the same speed it processes it. For this reasona different interconnection network has been tested, namely the Infiniband.

InfiniBand is a switched fabric communications link primarily used in high-performance computing. Its features include quality of service and failover,and it is designed to be scalable. The serial connection’s signalling rate is2.5 gigabit per second (Gbit/s) in each direction per connection. InfiniBandsupports double (DDR) and quad data (QDR) speeds, for 5 Gbit/s or 10Gbit/s respectively, at the same data-clock rate. Links use 8B/10B encodingevery 10 bits sent carry 8bits of data so that the useful data transmissionrate is four-fifths the raw rate. Thus single, double, and quad data ratescarry 2, 4, or 8 Gbit/s respectively. Links can be aggregated in units of 4 or12, called 4X or 12X. A quad-rate 12X link therefore carries 120 Gbit/s raw,or 96 Gbit/s of useful data.

The hardware available for test was a 4X SDR connection as detailed insection 2.4 using the OFED stack and the MPI implementation MVAPICH2.The tests were run using 8 nodes and matrices of 16384x16384 elements. Theresults are presented in tables 5.1 and 5.2.

The differences between the execution times using Gigabit and Infinibandare not obvious, showing no significant improvements. The reason for theseresults might be that the speedup for up to 8 nodes using the two networksis insignificant [15]. An improvement can be seen for networks using more

41

42 CHAPTER 5. OPTIMIZATIONS AND TWEAKS

Overlap Gigabit (seconds) Infiniband (seconds)

No 554.8432 550.8798Yes 547.7903 551.3156

Table 5.1: MPI execution time (Gigabit vs Infiniband)

Overlap Gigabit (seconds) Infiniband (seconds)

No 30.3515 30.7751Yes 30.3011 30.2159

Table 5.2: MPI+CUDA execution time (Gigabit vs Infiniband)

than 12 nodes (Fig. 5.1). Since the GPU cluster only had 8 nodes connectedthe Infiniband, this idea could not be probed, thus requiring more futureinvestigations.

5.2 GPU overlap

Asynchrounous hostdevice memory copy for pinned memory (allocated withcudaMallocHost) frees up CPU on CUDA capable devices. As detailed insection 2.2 the GPU chosen for the cluster has such a capability, allowingconcurrent execution of a kernel and a host ↔ device copy from a pagelocked memory. GPU overlap is implemented using streams. Streams aresimilar to usual threads from parallel programming, grouping a sequenceof operations that execute in order. Different streams, on the other hand,may execute their operations out of order with respect to one another orconcurrently. A stream is defined by creating a stream object and specifyingit as the stream parameter to a sequence of kernel launches and host↔ devicememory copies. Any kernel launch, memory set, or memory copy for whicha zero stream parameter has been specified begins only after all precedingoperations are done, including operations that are part of streams, and nosubsequent operation may begin until it is done. By default all instructionsare grouped in stream 0.

One strategy of GPU overlapping in the matrix product program was thefollowing (Fig. 5.2):

1. copy the B matrix to the device at the begining, right after it wasinitialized since it does not change during program execution

5.2. GPU OVERLAP 43

Figure 5.1: Gigabit versus Infiniband speedup (Credits: Lars E. Jonsson,William R. Magro - Comparative Performance of InfiniBand Architectureand Gigabit Ethernet Interconnects on Intel R©Itanium R©2 Microarchitecture-based Clusters)

2. at each step split the A block matrix in several smaller blocks that arecopied asynchronously from host to device

3. execute the kernel concurrently on sub-blocks that have been copiedwhile there still are unfinished transfers

4. copy the result back to host also concurrently with other kernel execu-tions

The simplified code for implementing the stream strategy can be seen infigure 5.3.

Running the program using different values for the number of streamsresulted in the graphic in figure 5.4

As the graph clearly shows, as the number of streams increases the exe-cution time also increases. So in the first tests the GPU overlapping did notseem too encouraging. A possible explanation might be the overhead intro-duced by the restrictions of GPU overlapping, namely the use of page lockedmemory. This restriction introduces an extra memory copy step. Thus afirst copy is done from the host memory to the host page locked memory,while a second copy is done from the host page locked memory to the devicememory. The same overhead is present in transfers from device memory tohost memory.


Figure 5.2: CUDA matrix product partition using streams

Further tests and the use of CUDA profiling tool are necessary to inves-tigate in depth the overlapping execution and determine the modificationsthat need to be made in order to achieve an execution speedup by using GPUoverlapping.

5.3 Kernel optimization

The kernel used for the matrix multiplication uses the classical naive algo-rithm, but it can be further optimized the achieve even higher performance.

NVIDIA proposes the CUBLAS library, which is an implementation ofBLAS (Basic Linear Algebra Subprograms) on top of the CUDA driver [12].The library is selfcontained at the API level, that is, no direct interaction withthe CUDA driver is necessary. The basic model by which applications usethe CUBLAS library is to create matrix and vector objects in GPU memoryspace, fill them with data, call a sequence of CUBLAS functions, and, finally,upload the results from GPU memory space back to the host. To accomplishthis, CUBLAS provides helper functions for creating and destroying objectsin GPU space, and for writing data to and retrieving data from these objects.

One “draw-back”of the library is that it uses the column-major storageand 1-based indexing, for compatibility with Fortran environments. Thus in

5.4. RESULTS ACCURACY 45

1 cudaStream_t *streams =( cudaStream_t *) \\2 malloc(Ns*sizeof(cudaStream_t ));3 for (i=0;i<Ns;i++)4 CUDA_SAFE_CALL(cudaStreamCreate (&( streams[i])));5 for (i=0;i<Ns;i++)6 CUDA_SAFE_CALL(cudaMemcpyAsync(A_h[i], \\7 &A[CurrentIndex ][i*size_A/Ns], mem_size_A_h ,8 cudaMemcpyHostToHost ,streams[i]));9 for (i=0;i<Ns;i++)

10 CUDA_SAFE_CALL(cudaMemcpyAsync(d_A[i], A_h[i], \\11 mem_size_A_h , cudaMemcpyHostToDevice ,streams[i]) );12 for (i=0;i<Ns;i++)13 matrixMulkern <<< grid , threads , 0, streams[i] >>> \\14 (d_C[i], d_A[i], d_B , WA, WB);15 for (i=0;i<Ns;i++)16 CUDA_SAFE_CALL(cudaMemcpyAsync(C_h[i], d_C[i],17 mem_size_C_h , cudaMemcpyDeviceToHost , streams[i]) ); \\18 cudaThreadSynchronize ();19 for (i=0;i<Ns;i++)20 CUDA_SAFE_CALL(cudaStreamDestroy(streams[i]));

Figure 5.3: CUDA code using streams

order to use CUBLAS in a C environment macros or inline functions shouldbe defined to implement matrices on top of one-dimensional arrays.

The instructions that perform matrix multiplication using the CUBLASlibrary are the following:

1 cublasSetMatrix(HA,WA,sizeof (*A),A,HA,d_A ,HA);2 cublasSetMatrix(HB,WB,sizeof (*B),B,HB,d_B ,HB);3 cublasSgemm(’n’,’n’,HA,WB,WA ,1,d_A ,HA,d_B ,HB ,0,d_C ,HC);4 cublasGetMatrix(HC,WC,sizeof (*C),d_C ,HC,C,HC);

Running a test on a single node using a matrix of 6080x6080 elementsshowed that the CUBLAS version of the matrix product achieves almostdouble flops compared to the naive implementation (Table 5.3).

5.4 Results accuracy

The GeForce 8 and 9 series have support only for single-precision binaryfloating-point arithmetic, with a few deviations, thus not fully complyingthe IEEE-754 standard. The new GTX 200 GPUs, launched in June 2008,fully comply with the IEEE-754R standard for double-precision.


Figure 5.4: Execution times for different values of CUDA streams

Loop time (seconds) GFlops

CUDA + MPI naive 7.7864 77CUBLAS 3.8153 117

Table 5.3: Performance comparison between naive implementation of matrixproduct and CUBLAS libray

Using the GeForce 8800 GT graphic card, a numerical deviation of theresults was observed, when compared with the results from the CPU. Forexample runnning the matrix product for a matrix of 6080x6080 elements onone CPU and one GPU, the following results where obtained for the middleelement of the final matrix.

Result

CPU 2077557732130422784GPU 2077556220301934592

Table 5.4: Comparison of numerical deviation of the results using one CPUand one GPU

The problem might arouse from the way CPU and GPU architecturesimplement the IEEE-754 standard and also from the conversions of a floating-point value to an integer that each processing unit is performing.

5.5. SUMARY OF OPTIMIZATIONS AND TWEAKS 47

5.5 Sumary of optimizations and tweaks

Thorough experiments were conducted trying to tweak the execution of theprogram on the cluster.

The Infiniband network tests did not show a real improvement. Since only8 nodes out of the total 16, had an Infiniband connection, more complete testscannot be performed to test the MPI framework. Test performed outside theMPI framwork for measuring bandwidth and latency showed that in a point-to-point connection the Infiniband performs better than the Gigabit, butfurther experiments need to be done.

In theory, GPU overlap strategy seems also an attractive idea, but inpractical tests seems that it makes the program perform worse when thenumber of streams increases. The solution would be to further investigate indepth execution of the streams using CUDA Profiler tool and identifiy thebottlenecks.

Also the use of CUBLAS library whenever it is possible makes sense, sinceit is higly optimized achiving high performances.

Future experiments using the new GTX 200 GPUs are interesting to com-pare the results and see if the same numerical deviations are observed.

Chapter 6

Conclusions

The idea of a CPU-GPU cluster proved to be a challenging experience. Itoffers many paths to study and tweak ports of parallel algorithms on theMPI+CUDA framework. The performance tests looked more than encour-aging and the speedups obtained showed that this is a worthy domain forfuture deeper investigations.

The first steps where difficult and unsure as no other research laboratoryin France tried this approach at the moment. At the beggining, the entirehardware architecture had to be chosen having in mind several features thatlater on where investigated. Even choosing the right graphic card was notan easy task since several tests where necessary to exactly determine themost suitable card. Once the hardware was installed and configured, thestudy focused on mixing the two frameworks (MPI and CUDA) in one singleworking program. By contininuous tests and failures, all the problems wheresolved and a final generic Makefile was created which allowed the include ofMPI and CUDA libraries and compilation into a final executable of the twoframeworks.

Once the MPI and CUDA environment was set and the first functionalprogram was created, the research continued on different paths to persue thebest performance. Some of the strategies like overlapping of MPI commu-nication with local computations or the usage of higly optimized CUBLASlibrary seemed to be improve the overall performance. Some other stategieslike the use of Infiniband interconnection network and GPU overlapping ofmemory transfers betwoen host and device and kernel executions, which the-oretically should improve performance, in the practical tests did not shownotable results and need further investigations.

49

50 CHAPTER 6. CONCLUSIONS

The perspectives opened by the first experiments are endless, though indeepth investigations need to be carried out in the future. On short term thefollowing experimental strategies are to be tried:

• usage of OpenMP to take advantage of the multi-core CPUs

• usage of different MPI implementation which support different inter-connection networks (OpenMPI)

• expand the Infiniband network on more nodes in order to see a signifi-cant increase of performance compared to Gigabit Ethernet

• usage of CUDA tools (Ocupancy calculator, Profiler) to analyse theexecution of kernels and tweak the code

On mid term the implementation of the Monte-Carlo algorithm is themain objective of the workpackage of the ANR-CIGC-GCPMF project, butother test strategies worth being carried such as:

• using newer GPUs (GTX 200 series) which introduces double precisionand significantly increases the computation power

• using FPGAs

• expanding the number of nodes

Also as a mid and long term action is the outsourcing of all the knowledgegained during this experiments since they were not precisely targeted on aspecific domain. The generality of the conducted test strategies, qualifies theresearch more as a proof of concept in the unexplored domain of CPU-GPUclusters. Thus the results and all informations gained can be used in nomatter what field requiring high performance computing.

Chapter 7

Appendix A - Graphic cardsGPU overlap support testresults

51

52CHAPTER 7. APPENDIX A - GRAPHIC CARDS GPU OVERLAP SUPPORT TEST RESULTS

Graphic cards tests

ASUS_EN880GT_G_HTDP_512M_A PNY_QuadroFX_570 HP_QuadroFX_1700--------------------------------| GPU Info using nVIDIA Driver |--------------------------------Graphic card chipset: GeForce 8800 GT/PCI/SSE2Core clock [MHz]: 600Memory clock [MHz]: 900Memory amount [MB]: 512Nvidia driver version: 169.12GPU Core Temp: --------------------------------------------| GPU Info using CUDA driver API functions |--------------------------------------------There is 1 device supporting CUDA

Device 0: "GeForce 8800 GT" Major revision number: 1 Minor revision number: 1 GPU Overlap support: AVAILABLE (1) Total amount of global memory: 536543232 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1512000 kilohertz---------------------------| GPU Simple Streams Test |---------------------------memcopy: 32.87kernel: 40.63non-streamed:73.59 (73.50 expected)8 streams: 43.72 (44.74 expected with compute capability 1.1 or later)

--------------------------------| GPU Info using nVIDIA Driver |--------------------------------Graphic card chipset: Quadro FX 570/PCI/SSE2Core clock [MHz]: 460Memory clock [MHz]: 400Memory amount [MB]: 256Nvidia driver version: 169.12GPU Core Temp: 45--------------------------------------------| GPU Info using CUDA driver API functions |--------------------------------------------There is 1 device supporting CUDA

Device 0: "Quadro FX 570" Major revision number: 1 Minor revision number: 1 GPU Overlap support: AVAILABLE (1) Total amount of global memory: 268107776 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 918000 kilohertz---------------------------| GPU Simple Streams Test |---------------------------memcopy: 33.39kernel: 99.20non-streamed:132.73 (132.58 expected)8 streams: 109.06 (103.37 expected with compute capability 1.1 or later)



PNY_QuadroFX_4600 PNY_QuadroFX_3700 PNY_QuadroFX_1700--------------------------------| GPU Info using nVIDIA Driver |--------------------------------Graphic card chipset: Quadro FX 4600/PCI/SSE2Core clock [MHz]: 500Memory clock [MHz]: 700Memory amount [MB]: 768Nvidia driver version: 169.12GPU Core Temp: 53--------------------------------------------| GPU Info using CUDA driver API functions |--------------------------------------------There is 1 device supporting CUDA






Bibliography

[1] Anr-cigc-gcpmf project website http://www.pricingpartners.com/anr-cigc-gcpmf/, June 30, 2008.

[2] Gpgpu http://www.gpgpu.org, June 30, 2008.

[3] Mpich2 website http://www.mcs.anl.gov/research/projects/mpich2/,May 21, 2008.

[4] Mvapich2 website http://mvapich.cse.ohio-state.edu/, May 21, 2008.

[5] New nvidia geforce 8800gt surprises (review) - power consumptionhttp://www.tomshardware.com/reviews/geforce-8800-gt,1714-14.html,October 29, 2007.

[6] Nvidia support forum : asynchroneous hostdevice memcpy+eventshttp://forums.nvidia.com/index.php?showtopic=55372, January 3,2008.

[7] Nvidia support forum : Asynchronous data transferhttp://forums.nvidia.com/index.php?showtopic=53759, December31, 2007.

[8] Nvidia support forum : nvcc problem with attributehttp://forums.nvidia.com/index.php?showtopic=45966, September19, 2007.

[9] Nvidia support forum : sdk compile errorshttp://forums.nvidia.com/index.php?showtopic=63817d, April 3,2008.

[10] NVIDIA Corporation *. CUDA Programming Guide Version 1.1.NVIDIA Corporation, 11/29/2007.

[11] NVIDIA Corporation **. The CUDA Compiler Driver NVCC. NVIDIACorporation, 11/5/2007.

55

56 BIBLIOGRAPHY

[12] NVIDIA Corporation ***. CUBLAS Library V1.1. NVIDIA Corpora-tion, September, 2007.

[13] Edited by Hubert Nguyen. GPU Gems 3. Addison-Wesley, 2007.

[14] Message Passing Interface Forum. MPI: A Message-Passing InterfaceStandard. University of Tennessee, 15/11/2005.

[15] William R. Magro Lars E. Jonsson. Comparative performanceof infiniband architecture and gigabit ethernet interconnects onintel R©itanium R©2 microarchitecture-based clusters. In 4th EuropeanLS-DYNA Users Conference.

[16] Mathematics and Computer Science Division Argonne National Labo-ratory. MPICH2 Installer’s Guide Version 0.4. Argonne National Lab-oratory, 3/2/2005.

cpu-gpu cluster - centralesupelecvialle/anr-cigc-gcpmf/final_report_gabriel... · nationale des...

Documents