june 24, 2013 jason su

50
Parallel Computing and You: A CME213 Review June 24, 2013 Jason Su

Upload: toby

Post on 23-Feb-2016

34 views

Category:

Documents


4 download

DESCRIPTION

June 24, 2013 Jason Su. Technologies for C/C++/Fortran. Single machine, multi-core P(OSIX) threads: bare metal multi-threading OpenMP : compiler directives that implement various constructs like parallel-for Single machine, GPU CUDA/ OpenCL : bare metal GPU coding - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: June 24, 2013 Jason Su

Parallel Computing and You:A CME213 Review

June 24, 2013Jason Su

Page 2: June 24, 2013 Jason Su

Technologies for C/C++/Fortran• Single machine, multi-core

– P(OSIX) threads: bare metal multi-threading– OpenMP: compiler directives that implement various constructs like

parallel-for

• Single machine, GPU– CUDA/OpenCL: bare metal GPU coding– Thrust: algorithms library for CUDA inspired by C++ STL

• Like MATLAB, where you use a set of fast core functions and data structures to implement your program

• Multi-machine– Message Passing Interface (MPI): a language-independent communication

protocol to coordinate and program a cluster of machines

Page 3: June 24, 2013 Jason Su

Challenges of Parallel Programming• Correctness

– Race conditions– Synchronization/deadlock– Floating point arithmetic is not associative or distributive

• Debugging – How do you fix problems that are not reproducible?– How do you assess the interaction of many threads running

simultaneously?• Performance

– Management and algorithm overhead– Amdahl’s Law

threads

threads

nPP

nS

1

1

Page 4: June 24, 2013 Jason Su

Challenges: Race Conditions

What we expected:Thread 1 Thread 2 Integer

Value0

Read ← 0

Increment 0

Write → 1

Read ← 1

Increment 1

Write → 2

What we got:Thread 1 Thread 2 Integer

Value0

Read ← 0

Read ← 0

Increment 0

Increment 0

Write → 1

Write → 1

As a simple example, let us assume that two threads each want to increment the value of a global integer variable by one.

Page 5: June 24, 2013 Jason Su

Challenges: DeadlockA real world example would be an illogical statute passed by the Kansas legislature in the early 20th century, which stated:

“When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.”

Page 6: June 24, 2013 Jason Su

Challenges: Deadlock• Consider a program that manages bank accounts:

BankAccount: string owner float balance withdraw(float amount) deposit(float amount) transfer(self, Account to, float amount): lock(self) lock(to) from.withdraw(amount) to.deposit(amount) release(to) release(from)

Page 7: June 24, 2013 Jason Su

Challenges: Race Conditions

What we expected:Thread 1:Bilbo

Thread 2:Frodo

Deposit: Ring Deposit: $1000

Transfer: Ring → $1000 + Ring

$1000 ← Transfer: $1000

$1000 Ring

What we got:Thread 1:Bilbo

Thread 2:Frodo

Deposit: Ring Deposit: $1000

Transfer: Ring →← Transfer: $1000

No, you give me it first, then I’ll give it to you!

As a simple example, let us assume that two threads each want to increment the value of a global integer variable by one.

Page 8: June 24, 2013 Jason Su

Challenges: Floating Point Arithmetic

What we expected: What we got:

Say we had a floating point format that has 1 base-10 exponent and 1 decimal place:

1001

1001

EEEE

EEEE

24312431

□E□

1001

1001

EEEE

EEEE

24311431

Page 9: June 24, 2013 Jason Su

Challenges: Floating Point Arithmetic

• Floating point integers have limited precision– Any operation implicitly rounds, which means that

order matters• Float arithmetic is not associative or distributive• Do operations on values of similar precision first

– Actually an issue in all code, but becomes apparent when serial and parallel answers differ

“Many programmers steeped in sequential programming for so many years make the assumption that there is only one right answer for their algorithm. After all, their code has always delivered the same answer every time it was run…however, all the answers are equally correct”

Page 10: June 24, 2013 Jason Su

Challenges: Floating Point Arithmetic

• The answer changes significantly with the number of threads– Because the order of operations has changed– Indicates an unstable numerical algorithm– Parallel programming revealed a flaw in the

algorithm which is otherwise unapparent with serial code

Page 11: June 24, 2013 Jason Su

OpenMP

• Uses a fork-join programming paradigm

Page 12: June 24, 2013 Jason Su
Page 13: June 24, 2013 Jason Su

OpenMP• Declare private and shared variables for a thread• Atomic operations

– Ensure that code is uninterrupted by other threads, important for shared variables

• Parallel for-loop: see MATLAB’s parfor> #pragma omp parallel for num_threads(nthreads)

for (int i=0; i < N; i++) { ... }

• Reduction: ℝn_threads→ℝ> #pragma omp parallel for reduction(+:val)

for (uint i=0; i < v.size(); i++) val = v[i];

Page 14: June 24, 2013 Jason Su

Moore’s Law

• Serial scaling performance has reached its peak.

• Processors are not getting faster, but wider

Page 15: June 24, 2013 Jason Su

GPUs

Page 16: June 24, 2013 Jason Su

GPUs

Page 17: June 24, 2013 Jason Su

CPU vs GPU

• GPU devotes more transistors to data processing

Page 18: June 24, 2013 Jason Su

CPU vs GPU

• GPU devotes more transistors to data processing

Page 19: June 24, 2013 Jason Su

CPU vs GPU

• CPU minimizes time to complete a given task: latency of a thread

• GPU maximizes number of tasks in a fixed time: throughput of all threads

• Which is better? It depends on the problem

“If you were plowing a field, which would you rather use: two

strong oxen or 1024 chicken?”

Seymour Cray

Page 20: June 24, 2013 Jason Su

NVIDIA CUDA

• A set of libraries and extensions for C/C++ or Fortran• Also 3rd party support in MATLAB, Python, Java, and

Mathematica

Compute Unified Device Architecture

(CUDA)

• Newer GPU architectures use different CUDA versions• Denoted by their Compute Capability

Language is evolving with the

hardware

• NVIDIA offers their own expensive products designed for scientific computing (Tesla and Quadro)

• But desktop options are affordable and effective (GeForce)

Hardware originates comes from the desktop market

Page 21: June 24, 2013 Jason Su

NVIDIA CUDA

• Hiding latency with parallelismExtremely cheap thread creation and switching

• Fast but up to you to manage it correctly

Direct access to L1 cache(shared memory,

up to 48KB)

• May or may not fit your problemSingle instruction

multiple thread (SIMT) execution model

Page 22: June 24, 2013 Jason Su

NVIDIA GPU Architecture: Processors

• GPU contains many streaming multiprocessors (MP or SM)– These do the work– Groups threads into “warps”

and executes these in lock-step

– Different warps are swapped in and out constantly to hide latency = “parallelism”

Page 23: June 24, 2013 Jason Su

What is a warp?

• A warp is a group of 32 threads that are executed in lock-step (SIMT)• Threads in a warp demand obedience, can either: be idle or do the

same instruction as its siblings– Divergent code (if/else, loops, return, etc.) can wastefully leave some

threads idle

Page 24: June 24, 2013 Jason Su

NVIDIA GPU Architecture: Memory

Page 25: June 24, 2013 Jason Su

NVIDIA GPU Architecture• Imagine a pool of threads representing all the work there

is to do– For example, a thread for each pixel in the image– These threads need to be broken up into blocks that fit neatly

onto each MP, this is chosen by you

• There are several limitations that affect how well these blocks fill up an MP, called its occupancy– Total threads < 2048– Total blocks < 16– Total shared memory < 48KB– Total registers < 64K

Page 26: June 24, 2013 Jason Su

CUDA Program Structure

Host code• void main()• Transfers data to global memory

on GPU/device• Determines how the data and

threads are divided to fit onto MPs

• Calls kernel functions on GPU• kernel_name<<< gridDim, blockDim >>> (arg1, arg2, …);

Page 27: June 24, 2013 Jason Su

CUDA Program Structure

• Memory and processors on the GPU exist in big discrete chunks

• Designing for GPU is not only developing a parallel algorithm

• But also shaping it so that it optimally maps to GPU hardware structures

Page 28: June 24, 2013 Jason Su

CUDA Program Structure

Global kernel functions• Device code called by the host• Access data based on threadIdx

location• Do work, hopefully in a non-

divergent way so all threads are doing something

Device functions• Device code called by the device• Helper functions usually used in

kernels

Page 29: June 24, 2013 Jason Su

Connected Component Labeling

Finally, we get to the problem.

Page 30: June 24, 2013 Jason Su

Connected Component Labeling

• Say you had a segmented mask of lesions:– How would you count the total number of

(connected) lesions in the image?– How would you track an individual lesion over

time?1

2

Page 31: June 24, 2013 Jason Su
Page 32: June 24, 2013 Jason Su

Connected labels share a similar property or state

In this case color, where we’ve simplified to black and white

Page 33: June 24, 2013 Jason Su

Stars labeled 1 through 5467

Page 34: June 24, 2013 Jason Su

Initialize with Raster-Scan Numbering

1 2 3 4 5 6 7 89 10 11 12 13 14 15 16

17 18 19 20 21 22 23 2425 26 27 28 29 30 31 3233 34 35 36 37 38 39 4041 42 43 44 45 46 47 4849 50 51 52 53 54 55 5657 58 59 60 61 62 63 64

Input Image Initial Labels

• Goal is to label each connected region with a unique number, min(a, b) is used to combine labels.

• That means 1, 7 , 9, and 54 should remain.

Page 35: June 24, 2013 Jason Su

Kernel A – Label Propagation

• Iterate until nothing changes:1. Assign a pixel for each thread2. For each thread:

a) Look at my neighboring pixelsb) If neighbor is smaller than me,

re-label myself to the lowest adjacent label

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

1 2 3 4 5 6 7 89 10 11 12 13 14 15 16

17 18 19 20 21 22 23 2425 26 27 28 29 30 31 3233 34 35 36 37 38 39 4041 42 43 44 45 46 47 4849 50 51 52 53 54 55 5657 58 59 60 61 62 63 64

Page 36: June 24, 2013 Jason Su

Kernel A – Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

Page 37: June 24, 2013 Jason Su

Kernel A – Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

Page 38: June 24, 2013 Jason Su

Kernel A – Label Propagation

• A label can only propagate itself by a maximum of one cell per iteration– Many iterations are required

• Very slow for large clusters in the image– We’re getting killed by many accesses to global memory,

each taking O(100) cycles– Even having many parallel threads is not enough to hide

it, need 32 cores/MP*400 cycles = 12,800 threads active/MP

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

Page 39: June 24, 2013 Jason Su

Kernel B – Local Label Propagation• Take advantage of L1-speed

shared memory• Iterate until global convergence:

1. Assign a pixel for each thread2. For each thread:

a) Load myself and neighbor labels into shared memory

b) Iterate until local convergence:i. Look at my neighboring pixelsii. If neighbor is smaller than me,

re-label myself to the lowest adjacent label

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

Page 40: June 24, 2013 Jason Su

Kernel B – Local Label Propagation

1 2 3 4 5 6 7 89 10 11 12 13 14 15 16

17 18 19 20 21 22 23 2425 26 27 28 29 30 31 3233 34 35 36 37 38 39 4041 42 43 44 45 46 47 4849 50 51 52 53 54 55 5657 58 59 60 61 62 63 64

Initial Labels

Page 41: June 24, 2013 Jason Su

Kernel B – Local Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

Page 42: June 24, 2013 Jason Su

Kernel B – Local Label Propagation

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

Page 43: June 24, 2013 Jason Su

Kernel B – Local Label Propagation

• A label can only propagate itself by a maximum of one block per iteration– Iterations are reduced but still can be a lot

• Tiling the data into shared memory blocks is a very common technique– Appears in many applications, including FDTD

Page 44: June 24, 2013 Jason Su

Kernel D – Label Equivalence• Completely change the algorithm

– How do we make it so labels can propagate extensively in 1 iteration?

– Track which labels are equivalent and resolve equivalency chains

Scanning• Find minimum

neighbor for each pixel and record it in equivalence list

Analysis• Traverse through

equivalency chain until you get a label that is equivalent to itself

Relabel the pixels

Iterate until convergence:

Page 45: June 24, 2013 Jason Su

Kernel D – Label Equivalence

K. Hawick, A. Leist, D. Playne, Parallel graph component labelling with GPUsand CUDA, Parallel Computing 36 (12) (2010) 655–678.

Page 46: June 24, 2013 Jason Su

CPU – MATLAB’s bwlabel

Run-length encode image• There are special

hardware-instructions that make this very fast

Scan the runs• Assign initial labels

and record label equivalences in an equivalence table.

Analysis• Here the chains are

traversed to the actual minimum instead of a self-reference

• This is possible because the code is run in serial

Relabel the runs

Page 47: June 24, 2013 Jason Su

PerformanceInput Solution CPU:

Matlab’sbwlabel

GPU:Kernel A

GPU:Kernel B

GPU:Kernel D

477x423 23 labels 1.23ms

92.9ms

14.3ms

1.05ms

1463x2233 68 labels 15.2ms

5031.1ms

491.3ms

12.3ms

Page 48: June 24, 2013 Jason Su

Results: CPU vs GPU

• A naïve algorithm will still lose to CPU, badly• Starting from the best serial algorithm is

smart• Shared memory can offer a 10x improvement

GPU isn’t a cure-all

• Actually is a more challenging problem for GPU to beat CPU

• Memory access bound problem• Does not take advantage of the TFLOPs

available

Connected component

labeling

Page 49: June 24, 2013 Jason Su

Figure Credits

• Christopher Cooper, Boston University• Eric Darve, Stanford University• Gordon Erlebacher, Florida State University• CUDA C Computing Guide

Page 50: June 24, 2013 Jason Su