tutorial: high performance sbse using commodity graphics cards

Tutorial: High Performance SBSEUsing Commodity Graphics Cards

Simon Poulding, University of York, UKSSBSE, September 2012

© Simon Poulding & The University of York, 2012

SBSE and High Performance Computing

entire searchalgorithm parallelisable

operations withinalgorithm parallelisable

EVALUATION

EVALUATIONVARIATION SELECTION

EVALUATION

VARIATION EVALUATION SELECTION



Distributed Computing

Multicore Computing

General Purpose Computing on GPUs (GPGPU)

CUDA Architecture

Developing CUDA Applications

Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution

GPU Cards

0

1000

2000

3000

4000

2008 2009 2010 2011 2012 2013

Technical Innovation

release date

GFL

OP/

s(si

ngle

prec

ision

)

GeForce GTX 280

GeForce GTX 480GeForce GTX 580

GeForce GTX 680

Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012

General Purpose Computing for GPUs (GPGPU)

NVIDIA GPUs(most since 2009)

NVIDIA GPUsAMD GPUs

Intel HD GPUs

other vendors ... Intel Core CPUs


CUDA Architecture



Physical Architecture

globalmemory

streamingmultiprocessorsDRAM

GPU

systemmemory

DRAM

CPU

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

Adapted from “CUDA C Best Practices Guide”, NVIDIA, May 2012

Logical Architecture

shared memory

threadlocalmemory

blocks

globalmemory

Mapping Logical to Physical

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

blockstreamingmultiprocessor

CUDA Performance Features

single-instruction multiple-thread

hardware multithreading

coalesced memory access

Single-Instruction Multiple-Thread

} 1 warp =32 threads

......

Hardware Multithreading & Occupancy

...

}shared

memory

registers

‘core’

...... }

...

}...... }

...

}...... }

Coalesced Memory Access

......

}global

memory

...


CUDA Architecture



Typical CUDA Application Pattern

globalmemory

systemmemory

memorycopy

kernellaunch

kernelcompletion

memorycopy

device

host

threads runningkernel code

Example Problem

a b c = a * b382 17 ?1124 17 ?

30 17 ?2781 98 ?824 98 ?

4510 98 ?4088 31 ?

......

...256 x 64

Kernel Code (device-side)

__global__ void exampleKernel(int * a, int * b, int * c) {

__shared__ int sb;

const unsigned int thread = threadIdx.x; const unsigned int block = blockIdx.x; const unsigned int gThread = block * blockDim.x + thread;

if (thread == 0) { sb = b[block];}

__syncthreads();

c[gThread] = a[gThread] * sb;

}

Launching a Kernel (host-side)

const unsigned int numThreads = 256;const unsigned int numBlocks = 64;

dim3 gridD(numBlocks, 1, 1);dim3 blockD(numThreads, 1, 1);

exampleKernel<<<gridD,blockD>>>(a,b,c);

Allocating and Copying Memory (host-side)

const unsigned int numThreads = 256;const unsigned int numBlocks = 64;

int * a,b,c;

cudaMalloc((void **)&a, numThreads * numBlocks * sizeof(int));cudaMalloc((void **)&b, numBlocks * sizeof(int));cudaMalloc((void **)&c, numThreads * numBlocks * sizeof(int));

cudaMemcpy(a, inputA, numThreads * numBlocks * sizeof(int), cudaMemcpyHostToDevice);cudaMemcpy(b, inputB, numBlocks * sizeof(int), cudaMemcpyHostToDevice);

Putting It All Together

__global__ void exampleKernel(int * a, int * b, int * c) {...}

int main(...) {...cudaMalloc(...);cudaMemcpy(...);...exampleKernel<<<gridD,blockD>>>(a,b,c);...cudaMemcpy(...);...}

Build Process

CUDAsource

file host source

device source

deviceintermediatecode (PTX)

deviceexecutable

code (cubin)host source

withembeddeddevice code

hostexecutable

nvcc

non-CUDAsource

file

standardcompilerand linker

Adapted from “CUDA Compiler Driver NVCC”, NVIDIA, May 2012

Compute Capability

compute capability 1.0 1.1 1.2 1.3 2.x 3.0 3.5

atomic functions (global memory) No YesYesYesYesYesYes

atomic functions (shared memory) NoNo YesYesYesYesYes

warp vote functions NoNo YesYesYesYesYes

double precision floating point NoNoNo YesYesYesYes

additional fence and sync functions NoNoNoNo YesYesYes

max number threads per block 512512512512 102410241024

number register per multiprocessor 8K8K 16K16K 32K 64K64K

max shared memory per multiprocessor 16KB16KB16KB16KB 48KB48KB48KB

local memory per thread 16KB16KB16KB16KB 512KB512KB512KB

max number instructions per kernel 2 million2 million2 million2 million 512 million512 million512 million

Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012

Additional Tools and Libraries

Development Tools CUDA Libraries

debugger linear algebra (CUBLAS)

memory checker

profiler

sparse matrices (CUSPARSE)

random number generation (CURAND)

fast Fourier transform (CUFFT)

Thrust


CUDA Architecture



Bayesian Optimisation Algorithm

Ising Spin Glass

+1-1 -1

+1+1

-1 -1+1

+1

+1

+1 +1

+1

+1

-1-1

-1

+1

-1

-1

+1

Implementation

build Bayesiannetwork model

EVALUATION


EVALUATIONVARIATION

VARIATION SELECTION

SELECTION

calculate Isingspin glass energy

restricted tournamentreplacement

Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011

CUDA kernel CUDA kernel CUDA kernel

Results

0

20

40

60

80

100

8x8x8 12x12x12 16x16x16 24x24x24

GPU

Spe

ed-U

p

Problem Size

Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011


CUDA Architecture



Multi-Objective Test Suite Minimisation

t1 t2 t3 ... tl

r1 1 0 1 ... 0

r2 1 0 0 ... 1

r3 0 1 1 ... 1

rm 1 1 0 ... 0

cost 9 7 4 6

test casesre

quire

men

ts

... ... ... ... ...

Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011

Implementation

MO algorithm NSGA-II

EVALUATION


EVALUATION

calculation of coverage and cost by

matrix multiplication

Java jMetal MOEA library openCL usingJavaCL wrapper


Results

0

10

20

30

5.92E+4 6.62E+5 1.12E+7

GPU

Spe

ed-U

p

Problem Size



CUDA Architecture



Implementation

EVALUATION

EVALUATION

EVALUATION

execute instrumented softwarewith test inputs

CUDA kernel

research funded by the MOD Centre for Defence Enterprise (CDE)

Language Compatibility

large subset of C++Standard Template Libraryruntime type informationnetwork and file IOrand()

dynamic memory allocationfunction pointersfunction recursionmultiple source code files

only in computecapability 2.0+:

missing:

OO featurestemplatesmath libraryIEEE 754 floating point compliance

including:


Results

0

20

40

60

80

~20 LOC ~100 LOC ~1,500 LOC

GPU

Spe

ed-U

p

Problem Size


Resources

NVIDIA CUDA Zone

CUDA SDK samples - ‘template’ application

C Programming GuideC Best Practices Guide

tutorial: high performance sbse using commodity graphics cards

Documents