tutorial: high performance sbse using commodity graphics cards

40
Tutorial: High Performance SBSE Using Commodity Graphics Cards Simon Poulding, University of York, UK SSBSE, September 2012 © Simon Poulding & The University of York, 2012

Upload: others

Post on 11-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Tutorial: High Performance SBSEUsing Commodity Graphics Cards

Simon Poulding, University of York, UKSSBSE, September 2012

© Simon Poulding & The University of York, 2012

Page 2: Tutorial: High Performance SBSE Using Commodity Graphics Cards

SBSE and High Performance Computing

entire searchalgorithm parallelisable

operations withinalgorithm parallelisable

EVALUATION

EVALUATIONVARIATION SELECTION

EVALUATION

VARIATION EVALUATION SELECTION

VARIATION EVALUATION SELECTION

VARIATION EVALUATION SELECTION

Page 3: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Distributed Computing

Page 4: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Multicore Computing

Page 5: Tutorial: High Performance SBSE Using Commodity Graphics Cards

General Purpose Computing on GPUs (GPGPU)

CUDA Architecture

Developing CUDA Applications

Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution

Page 6: Tutorial: High Performance SBSE Using Commodity Graphics Cards

GPU Cards

Page 7: Tutorial: High Performance SBSE Using Commodity Graphics Cards

0

1000

2000

3000

4000

2008 2009 2010 2011 2012 2013

Technical Innovation

release date

GFL

OP/

s(si

ngle

prec

ision

)

GeForce GTX 280

GeForce GTX 480GeForce GTX 580

GeForce GTX 680

Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012

Page 8: Tutorial: High Performance SBSE Using Commodity Graphics Cards

General Purpose Computing for GPUs (GPGPU)

NVIDIA GPUs(most since 2009)

NVIDIA GPUsAMD GPUs

Intel HD GPUs

other vendors ... Intel Core CPUs

Page 9: Tutorial: High Performance SBSE Using Commodity Graphics Cards

General Purpose Computing on GPUs (GPGPU)

CUDA Architecture

Developing CUDA Applications

Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution

Page 10: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Physical Architecture

globalmemory

streamingmultiprocessorsDRAM

GPU

systemmemory

DRAM

CPU

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

Adapted from “CUDA C Best Practices Guide”, NVIDIA, May 2012

Page 11: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Logical Architecture

shared memory

threadlocalmemory

blocks

globalmemory

Page 12: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Mapping Logical to Physical

sharedmemory

registers

‘core’

sharedmemory

registers

‘core’

blockstreamingmultiprocessor

Page 13: Tutorial: High Performance SBSE Using Commodity Graphics Cards

CUDA Performance Features

single-instruction multiple-thread

hardware multithreading

coalesced memory access

Page 14: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Single-Instruction Multiple-Thread

} 1 warp =32 threads

......

Page 15: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Hardware Multithreading & Occupancy

...

}shared

memory

registers

‘core’

...... }

...

}...... }

...

}...... }

Page 16: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Coalesced Memory Access

......

}global

memory

...

Page 17: Tutorial: High Performance SBSE Using Commodity Graphics Cards

General Purpose Computing on GPUs (GPGPU)

CUDA Architecture

Developing CUDA Applications

Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution

Page 18: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Typical CUDA Application Pattern

globalmemory

systemmemory

memorycopy

kernellaunch

kernelcompletion

memorycopy

device

host

threads runningkernel code

Page 19: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Example Problem

a b c = a * b382 17 ?1124 17 ?

30 17 ?2781 98 ?824 98 ?

4510 98 ?4088 31 ?

......

...256 x 64

Page 20: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Kernel Code (device-side)

__global__ void exampleKernel(int * a, int * b, int * c) {

__shared__ int sb;

const unsigned int thread = threadIdx.x; const unsigned int block = blockIdx.x; const unsigned int gThread = block * blockDim.x + thread;

if (thread == 0) { sb = b[block];}

__syncthreads();

c[gThread] = a[gThread] * sb;

}

Page 21: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Launching a Kernel (host-side)

const unsigned int numThreads = 256;const unsigned int numBlocks = 64;

dim3 gridD(numBlocks, 1, 1);dim3 blockD(numThreads, 1, 1);

exampleKernel<<<gridD,blockD>>>(a,b,c);

Page 22: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Allocating and Copying Memory (host-side)

const unsigned int numThreads = 256;const unsigned int numBlocks = 64;

int * a,b,c;

cudaMalloc((void **)&a, numThreads * numBlocks * sizeof(int));cudaMalloc((void **)&b, numBlocks * sizeof(int));cudaMalloc((void **)&c, numThreads * numBlocks * sizeof(int));

cudaMemcpy(a, inputA, numThreads * numBlocks * sizeof(int), cudaMemcpyHostToDevice);cudaMemcpy(b, inputB, numBlocks * sizeof(int), cudaMemcpyHostToDevice);

Page 23: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Putting It All Together

__global__ void exampleKernel(int * a, int * b, int * c) {...}

int main(...) {...cudaMalloc(...);cudaMemcpy(...);...exampleKernel<<<gridD,blockD>>>(a,b,c);...cudaMemcpy(...);...}

Page 24: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Build Process

CUDAsource

file host source

device source

deviceintermediatecode (PTX)

deviceexecutable

code (cubin)host source

withembeddeddevice code

hostexecutable

nvcc

non-CUDAsource

file

standardcompilerand linker

Adapted from “CUDA Compiler Driver NVCC”, NVIDIA, May 2012

Page 25: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Compute Capability

compute capability 1.0 1.1 1.2 1.3 2.x 3.0 3.5

atomic functions (global memory) No YesYesYesYesYesYes

atomic functions (shared memory) NoNo YesYesYesYesYes

warp vote functions NoNo YesYesYesYesYes

double precision floating point NoNoNo YesYesYesYes

additional fence and sync functions NoNoNoNo YesYesYes

max number threads per block 512512512512 102410241024

number register per multiprocessor 8K8K 16K16K 32K 64K64K

max shared memory per multiprocessor 16KB16KB16KB16KB 48KB48KB48KB

local memory per thread 16KB16KB16KB16KB 512KB512KB512KB

max number instructions per kernel 2 million2 million2 million2 million 512 million512 million512 million

Adapted from “CUDA C Programming Guide”, NVIDIA, July 2012

Page 26: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Additional Tools and Libraries

Development Tools CUDA Libraries

debugger linear algebra (CUBLAS)

memory checker

profiler

sparse matrices (CUSPARSE)

random number generation (CURAND)

fast Fourier transform (CUFFT)

Thrust

Page 27: Tutorial: High Performance SBSE Using Commodity Graphics Cards

General Purpose Computing on GPUs (GPGPU)

CUDA Architecture

Developing CUDA Applications

Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution

Page 28: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Bayesian Optimisation Algorithm

Page 29: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Ising Spin Glass

+1-1 -1

+1+1

-1 -1+1

+1

+1

+1 +1

+1

+1

-1-1

-1

+1

-1

-1

+1

Page 30: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Implementation

build Bayesiannetwork model

EVALUATION

EVALUATIONVARIATION SELECTION

EVALUATIONVARIATION

VARIATION SELECTION

SELECTION

calculate Isingspin glass energy

restricted tournamentreplacement

Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011

CUDA kernel CUDA kernel CUDA kernel

Page 31: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Results

0

20

40

60

80

100

8x8x8 12x12x12 16x16x16 24x24x24

GPU

Spe

ed-U

p

Problem Size

Poulding, Staunton, Burles, “Full Implementation of an Estimation of Distribution Algorithm on a GPU”, CIGPU Competition Entry, GECCO 2011

Page 32: Tutorial: High Performance SBSE Using Commodity Graphics Cards

General Purpose Computing on GPUs (GPGPU)

CUDA Architecture

Developing CUDA Applications

Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution

Page 33: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Multi-Objective Test Suite Minimisation

t1 t2 t3 ... tl

r1 1 0 1 ... 0

r2 1 0 0 ... 1

r3 0 1 1 ... 1

rm 1 1 0 ... 0

cost 9 7 4 6

test casesre

quire

men

ts

... ... ... ... ...

Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011

Page 34: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Implementation

MO algorithm NSGA-II

EVALUATION

EVALUATIONVARIATION SELECTION

EVALUATION

calculation of coverage and cost by

matrix multiplication

Java jMetal MOEA library openCL usingJavaCL wrapper

Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011

Page 35: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Results

0

10

20

30

5.92E+4 6.62E+5 1.12E+7

GPU

Spe

ed-U

p

Problem Size

Yoo, Harman, Ur, “Highly Scalable Multi Objective Test Suite Minimisation Using Graphics Cards”, SSBSE 2011

Page 36: Tutorial: High Performance SBSE Using Commodity Graphics Cards

General Purpose Computing on GPUs (GPGPU)

CUDA Architecture

Developing CUDA Applications

Case StudiesParallelising Search AlgorithmParallelising Fitness EvaluationParallelising Software Execution

Page 37: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Implementation

EVALUATION

EVALUATION

EVALUATION

execute instrumented softwarewith test inputs

CUDA kernel

research funded by the MOD Centre for Defence Enterprise (CDE)

Page 38: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Language Compatibility

large subset of C++Standard Template Libraryruntime type informationnetwork and file IOrand()

dynamic memory allocationfunction pointersfunction recursionmultiple source code files

only in computecapability 2.0+:

missing:

OO featurestemplatesmath libraryIEEE 754 floating point compliance

including:

research funded by the MOD Centre for Defence Enterprise (CDE)

Page 39: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Results

0

20

40

60

80

~20 LOC ~100 LOC ~1,500 LOC

GPU

Spe

ed-U

p

Problem Size

research funded by the MOD Centre for Defence Enterprise (CDE)

Page 40: Tutorial: High Performance SBSE Using Commodity Graphics Cards

Resources

NVIDIA CUDA Zone

CUDA SDK samples - ‘template’ application

C Programming GuideC Best Practices Guide