medical image processing strategies for multi-core cpus

89
Medical image processing strategies for multi-core CPUs Daniel Blezek, Mayo Clinic [email protected]

Upload: daniel-blezek

Post on 06-May-2015

3.889 views

Category:

Health & Medicine


4 download

TRANSCRIPT

Page 1: Medical Image Processing Strategies for multi-core CPUs

Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo [email protected]

Page 2: Medical Image Processing Strategies for multi-core CPUs

Poll

Does your primary computer have more than one core...?

2

Have you ever written parallel code?

Page 3: Medical Image Processing Strategies for multi-core CPUs

It’s a parallel world...

SMP formerly was the domain of researchers Thanks to Intel, now it’s everywhere!

3

Hardware has far outstripped software Developers are not trained Development of parallel software is difficult Outside the box

Erlang Scala ...

... but most of us think in serial ...

shoehorn

Page 4: Medical Image Processing Strategies for multi-core CPUs

Parallel Computing – according to Google

“parallel computing” 1.4M hits on Google “multithreading” 10M hits “multicore” 2.4M hits “parallel programming” 1.1M hits

Why is it so hard?– the world is parallel– we all think in parallel– yet we are taught to program in serial

4driving

Page 5: Medical Image Processing Strategies for multi-core CPUs

Degrees of parallelism (my take)

Serial – SISD single thread of execution Data parallel – SIMD (fine grained parallelism) Embarrassingly parallel – larger scale SIMD

– CT or MR reconstruction– Each operation is independent, e.g. iFFT of slices

Worker thread – e.g. virus scanning software Coarse grained parallelism – SMP or MIMD

– Focus of this presentation, more in GPU talk– Concurrency, OpenMP, TBB, pthreads/Winthreads

Large scale – MPI on cluster, tight coupling Large scale – Grid computing, loose coupling

5

Page 6: Medical Image Processing Strategies for multi-core CPUs

Pragmatic approach

C/C++ and Fortran are the kings of performance– (I’ve never written a single line of Fortran, so don’t ask)

“Bolted on” parallel concepts– Zero language support

Huge existing codebase

6

Page 7: Medical Image Processing Strategies for multi-core CPUs

Pragmatic approach

Briefly touch on SIMD Introduce SMP concepts

– Threads, concurrency Development models

– pthreads/WinThreads– OpenMP– TBB– ITK

Medical Image Processing– Example problems– Common errors

Next steps

7packed

Page 8: Medical Image Processing Strategies for multi-core CPUs

SIMD

8

Page 9: Medical Image Processing Strategies for multi-core CPUs

SIMD – basic principles

9

http://en.wikipedia.org/wiki/SIMD

Page 10: Medical Image Processing Strategies for multi-core CPUs

Data structures for SIMD

Array of Structuresstruct Vec {float x, y, z;

};

Vec[] points = new Vec[sz];

10

X Y Z --

X Y Z --X Y Z --

X Y Z --

*

Pack

Unpack

Page 11: Medical Image Processing Strategies for multi-core CPUs

Data structures for SIMD

11

Structure of Arraysstruct Vec {float[] x;float[] y;float[] z;Vec ( int sz ) { x = new float[sz]; y = new float[sz]; z = new float[sz];};

};

Structure of Arraysstruct Vec {Vector4f[] v;Vec ( int sz ) {

// must be word // aligned v =

new Vector4f[sz];};

};

Page 12: Medical Image Processing Strategies for multi-core CPUs

SIMD pitfalls

Structure alignment– Usually needs to be aligned on word boundary

Structure considerations– May need to refactor existing code/structures

Generally not cross-platform– MMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc...

Performance gains are modest– 2x – 4x common

Limited instructions– Add, multiply, divide, round– Not suitable for branching logic

Autovectorizing compilers for simple loops– -ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler)

12

Page 13: Medical Image Processing Strategies for multi-core CPUs

Threads

13

Page 14: Medical Image Processing Strategies for multi-core CPUs

14

Threads – they’re everywhere

Page 15: Medical Image Processing Strategies for multi-core CPUs

SMP concepts

15

Useful to think in terms of “cores”– 2 dual-core CPU = 4 “cores”– Cores share main memory, may share cache– Threads in same process share memory

Generally, one executing thread per core– Other threads sleeping

Page 16: Medical Image Processing Strategies for multi-core CPUs

Cores – they’re everywhere

16

How many cores does your laptop have?

Mine has 50(!)2 Intel CPU (Core 2 Duo)32 nVidia cores (9600M GT)

16 nVidia cores (9400M)

Page 17: Medical Image Processing Strategies for multi-core CPUs

Parallel concepts for SMP

Process– Started by the OS– Single thread executes “main”– No direct access to memory of other processes

Threads– Stream of execution under a process– Access to memory in containing process– Private memory– Lifetime may be less than main thread

Concurrency– Coordination between threads– High level (mutex, locks, barriers)– Low level (atomic operations)

17

Page 18: Medical Image Processing Strategies for multi-core CPUs

Processes & Threads

18

Process Thread

NoNo

Page 19: Medical Image Processing Strategies for multi-core CPUs

#include <pthread.h>

// Thread work function, must return pointer to voidvoid *doWork(void *work) { // Do work return work; // equivalent to pthread_exit ( myWork );}...pthread_t child;...rc=pthread_create(&child, &attr, doWork, (void *)work);... rc = pthread_join ( child, &threadwork );...

Thread construction – pthread example

19

Page 20: Medical Image Processing Strategies for multi-core CPUs

Thread construction – Win32 example

20

#include <windows.h>DWORD WINAPI doWork( LPVOID work) {};...PMYDATA work;DWORD childID;HANDLE child;child = CreateThread( NULL, // default security attributes 0, // use default stack size doWork, // thread function name work, // argument to thread function 0, // use default creation flags &childID); // returns the thread identifier

WaitForMultipleObjects(NThreads, child, TRUE, INFINITE);

Page 21: Medical Image Processing Strategies for multi-core CPUs

Thread construction – Java example

21

import java.lang.Thread;

class Worker implements Runnable {public Worker ( Work work ) {};

public void run() {}; // Do work here}...Worker worker = new Worker ( someWork );New Thread ( worker ).start();

Page 22: Medical Image Processing Strategies for multi-core CPUs

Race Conditions

22

Serial Parallel

Problem!nono/door

Page 23: Medical Image Processing Strategies for multi-core CPUs

Mutex

Mutex – Mutual exclusion lock– Protects a section of code– Only one thread has a lock on the object– Threads may

• wait for the mutex• return a status if the mutex is locked

Semaphore– N threads

Critical Section– One thread executes code– Protects global resources– Maintain consistent state

23

Page 24: Medical Image Processing Strategies for multi-core CPUs

Race Conditions

24

...N = 0;...// Start some threads...

void* doWork() {

N++; // get, incr, store

}

Solution w/Mutex

Mutex mutex;

mutex.lock();

mutex.release();

NoNo

Page 25: Medical Image Processing Strategies for multi-core CPUs

Atomic operations

Locks are not perfect– Cause blocking– Relatively heavy-weight

Atomic operations– Simple operations– Hardware support– Can implement w/Mutex

Conditions– Invisibility – no other thread knows about the change– Atomicity – if operation fails, return to original state

25

Page 26: Medical Image Processing Strategies for multi-core CPUs

Deadlock

Deadlock

26NoNo

Mutex Thread

Mutex A

Mutex B

Page 27: Medical Image Processing Strategies for multi-core CPUs

Thread synchronization – barrier

Initialized with the number of threads expected Threads signal when they are ready

– Wait until all expected threads are there A stalled or dead thread can stall all the threads

27

Page 28: Medical Image Processing Strategies for multi-core CPUs

Thread synchronization – Condition variables

Workers atomically release mutex and wait Master atomically releases mutex and signals Workers wake up and acquire mutex

28

Mutex Thread

Condition

ConditionMutex A

Mutex A

Mutex A Mutex A

Wait Mutex A

Working

Condition

Page 29: Medical Image Processing Strategies for multi-core CPUs

Thread pool & Futures

29

Maintains a “pool” of Worker threads Work queued until thread available Optionally notify through a “Future”

– Future can query status, holds return value Thread returns to pool, no startup overhead Core concept for OpenMP and TBB

Page 30: Medical Image Processing Strategies for multi-core CPUs

OpenMP

30

Page 31: Medical Image Processing Strategies for multi-core CPUs

Introduction to OpenMP

Scatter / gather paradigm– Maintains a thread pool

Requires compiler support– Visual C++, gcc 4.0, Intel Compiler

Easy to adapt existing serial code, easy to debug– Simple paradigm

31

Page 32: Medical Image Processing Strategies for multi-core CPUs

OpenMP – simple parallel sections

32

#pragma omp parallel sections num_threads ( 5 ){ // 5 Threads scatter here

#pragma omp section { // Do task 1 } #pragma omp section { // Do task 2 } ... #pragma omp section { // Do task N }

// Implicit barrier}

...B

arrier

NoNo

Page 33: Medical Image Processing Strategies for multi-core CPUs

OpenMP – parallel for

33

#pragma omp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {

// Threads scatter here // each thread has a private copy of i doSomeWork( i );

}// Implicit barrier

Scheduling the iterations

Page 34: Medical Image Processing Strategies for multi-core CPUs

OpenMP – reduction

34

int TotalAmountOfWork = 0;

#pragma omp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {

// Threads scatter here // each thread has a private copy of i & TotalAmountOfWork TotalAmountOfWork += doSomeWork( i );

}// Implicit barrier

// TotalAmountOfWork was properly accumulated// Each thread has local copy, barrier does reduction// No need to use critical sections

Page 35: Medical Image Processing Strategies for multi-core CPUs

OpenMP – “atomic” reduction

35

int TotalAmountOfWork = 0;

#pragma omp parallel forfor ( int i = 0; i < NumberOfIterations; i++ ) {

// Threads scatter here int myWork = doSomeWork( i ); #pragma omp atomic TotalAmountOfWork += myWork;

}// Implicit barrier

// TotalAmountOfWork was properly accumulated// However, the atomic section can cause thread stalls

Page 36: Medical Image Processing Strategies for multi-core CPUs

OpenMP – critical

36

int TotalAmountOfWork = 0;

#pragma omp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {

// Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i );

#pragma omp critical { // Execute by one thread at a time, e.g., “Mutex lock” criticalOperation(); }

}// Implicit barrier

Page 37: Medical Image Processing Strategies for multi-core CPUs

OpenMP – single

37

int TotalAmountOfWork = 0;

#pragma omp parallel for reduction ( + : TotalAmountOfWork )for ( int i = 0; i < NumberOfIterations; i++ ) {

// Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i );

#pragma omp single nowait { // Execute by one thread, use “master” for the main thread reportProgress ( TotalAmountOfWork ); } // !! No implicit barrier because of “nowait” clause !!

}// Implicit barrier

Page 38: Medical Image Processing Strategies for multi-core CPUs

Threading Building Blocks (TBB)

38

Page 39: Medical Image Processing Strategies for multi-core CPUs

Introduction to TBB

Commercial and Open Source Licenses– GPL with runtime exception

Cross-platform C++ library– Similar to STL– Usual concurrency classes

Several different constructs for threading– for, do, reduction, pipeline

Finer control over scheduling Maintains a thread pool to execute tasks http://www.threadingbuildingblocks.org/

39

Page 40: Medical Image Processing Strategies for multi-core CPUs

TBB – parallel for

40

#include "tbb/blocked_range.h”#include "tbb/parallel_for.h”

class Worker { public: Worker ( /* ... */ ) {...}; void operator() ( const tbb::blocked_range<int>& r ) const { for ( int i = r.begin(); i != r.end(); ++i ) { doWork ( i ); } }};...tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ),Worker ( /* ... */ ), tbb::auto_partitioner() );

Page 41: Medical Image Processing Strategies for multi-core CPUs

TBB – parallel reduction

41

#include "tbb/blocked_range.h”#include "tbb/parallel_reduce.h”

class ReducingWorker { int mLocalWork; public: ReducingWorker ( /* ... */ ) {...};

ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {}; void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork};

void operator() ( const tbb::blocked_range<int>& r ) { ... }};...Worker w;tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ),w, tbb::auto_partitioner() );

w.getLocalWork();

Page 42: Medical Image Processing Strategies for multi-core CPUs

TBB – parallel reduction

42

Page 43: Medical Image Processing Strategies for multi-core CPUs

TBB – synchronization

43

tbb::spin_mutex MyMutex;

void doWork ( /* ... */ ) { // Enter critical section, exit when lock goes out of scope tbb::spin_mutex::scoped_lock lock ( MyMutex );

// NB: This is an error!!! // tbb::spin_mutex::scoped_lock ( MyMutex );}...#include <tbb/atomic.h>tbb::atomic<int> MyCounter;...MyCounter = 0; // Atomicint i = MyCounter; // AtomicMyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic...MyCounter = 0; MyCounter += 2; // Watch out for other threads!

Page 44: Medical Image Processing Strategies for multi-core CPUs

ITK Model

44

Page 45: Medical Image Processing Strategies for multi-core CPUs

ITK Implementation

Threads operate across slices– Only implemented behavior in ITK

itk::MultiThreader is somewhat flexible– Requires that you break the ITK model– Uses Thread Join, higher overhead– No thread pool

45

Page 46: Medical Image Processing Strategies for multi-core CPUs

Comparison

46

Threads (C/C++)+ Fine-grain control- Not cross-platform- Few constructs

ITK+ Integrated+ Simple- Limited control+/- ITK only

TBB+/- More complex+ Fine-grain control+ Intel (-?)+ Open Source+ Some constructs- Must re-write

code

OpenMP+ Simple+ Adapt existing code+/- Industry standard+/- Compiler support- Coarse-grain control

Language specific (Java)+ Fine-grain control+ Cross-platform easy(?)+ Many constructs+/- Language-specific

diy

Page 47: Medical Image Processing Strategies for multi-core CPUs

Medical Imaging

47

Page 48: Medical Image Processing Strategies for multi-core CPUs

Image class

48

class Image { public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...}};

Page 49: Medical Image Processing Strategies for multi-core CPUs

Trivial problem – threshold

Threshold an image– If intensity > 100, output 1– otherwise output 0

Present from simple to complex– OpenMP– TBB– ITK– pthread (see extra slides)

49

Page 50: Medical Image Processing Strategies for multi-core CPUs

Threshold – OpenMP #1

50

void doThreshold ( Image* in, Image* out ) {#pragma omp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } }}

// NB: can loop over slices, rows or columns by moving// pragma, but must choose at compile time

Page 51: Medical Image Processing Strategies for multi-core CPUs

Threshold – OpenMP #2

51

void doThreshold ( Image* in, Image* out ) {#pragma omp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } }}

// Likely a lot faster than previous code

Page 52: Medical Image Processing Strategies for multi-core CPUs

Threshold – TBB #1

52

class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) { for ( int x = r.begin(); x != r.end(); ++x ) { if ( in->mData[x] > 100 ) { out->mData[x] = 1; } else { out->mData[x] = 0; } } }}

...

parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ), Threshold ( in, out ), auto_partitioner() );// NB: default “grain size” for blocked_range is 1 pixel// tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs )

Page 53: Medical Image Processing Strategies for multi-core CPUs

Threshold – TBB #2

53

class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) { for ( int z = in->mDepth; z < in->mDepth; z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...

parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() );

Page 54: Medical Image Processing Strategies for multi-core CPUs

Threshold – TBB #3

54

class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() );

Page 55: Medical Image Processing Strategies for multi-core CPUs

Threshold – ITK solution

55

ThreadedGenerateData( const OutputImageRegionType out, int threadId){... // Define the iterators ImageRegionConstIterator<TIn> inputIt(inputPtr, out); ImageRegionIterator<TOut> outputIt(outputPtr, out);

inputIt.GoToBegin(); outputIt.GoToBegin();

while( !inputIt.IsAtEnd() ) { if ( inputIt.Get() > 100 ) { outputIt.Set ( 1 ); } else { outputIt.Set ( 0 ); { ++inputIt; ++outputIt;}}

Page 56: Medical Image Processing Strategies for multi-core CPUs

Interesting problem – anisotropic diffusion

Edge preserving smoothing methodPerona and Malik. Scale-space and edge detection using anisotropic

diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639

Iterative process Demonstrate

– OpenMP– TBB– (ITK has an implementation)– (pthreads are tedious at the very least)

Pop quiz – are the following correct?

56

Page 57: Medical Image Processing Strategies for multi-core CPUs

Anisotropic diffusion – OpenMP

57

void doAD ( Image* in, Image* out ) {#pragma omp parallel for for ( int t = 0; t < TotalTime; t++ ) { for ( int z = 0; z < in->mDepth; z++ ) { ... } }}

Page 58: Medical Image Processing Strategies for multi-core CPUs

Anisotropic diffusion – OpenMP

58

void doAD ( Image* in, Image* out ) { short *previousSlice, *slice, *nextSlice; for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { previousSlice = in->mSlicePointers[z-1]; slice = in->mSlicePointers[z]; nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...

Page 59: Medical Image Processing Strategies for multi-core CPUs

Anisotropic diffusion – OpenMP

59

void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { short* previousSlice = in->mSlicePointers[z-1]; short* slice = in->mSlicePointers[z]; short* nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...

Page 60: Medical Image Processing Strategies for multi-core CPUs

Anisotropic diffusion – TBB #1

60

class doAD { public: static ADConstants* sConstants; doAD ( Image* in, Image* out ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { if ( !sConstants == NULL ) { initConstants(); } // process ... }}

Page 61: Medical Image Processing Strategies for multi-core CPUs

Threshold – TBB #2

61

class doAD { public: doAd ( ... ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth 0, in->mHeight 0, in->mWidth ), doAD ( in, out ), auto_partitioner() );

Page 62: Medical Image Processing Strategies for multi-core CPUs

Threshold – TBB #3

62

class doAD { public: static tbb::atomic<int> sProgress; tbb::spin_mutex mMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);

Page 63: Medical Image Processing Strategies for multi-core CPUs

Threshold – TBB #4

63

class doAD { public: static tbb::atomic<int> sProgress; static tbb::spin_mutex mMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } };...doAD::sProgress = 0;parallel_for (...);

Page 64: Medical Image Processing Strategies for multi-core CPUs

nowait

Anisotropic diffusion – OpenMP (Progress)

64

using std;void doAD ( Image* in, Image* out ) {int progress = 0;for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int s = 0; s < in->mDepth; s++ ) { #pragma omp atomic progress++; #pragma omp single reportProgress ( progress ); ... } }}

Page 65: Medical Image Processing Strategies for multi-core CPUs

Real-life problem

Compute Frangi’s vesselness measure– Frangi et al. Model-based quantitation of 3-D magnetic resonance angiographic

images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956

Memory constrained solution– ITK implementation requires 1.2G for 100M volume

• Antiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007)

Possible solutions using– OpenMP, TBB

65

Page 66: Medical Image Processing Strategies for multi-core CPUs

Vesselness

66

Page 67: Medical Image Processing Strategies for multi-core CPUs

ITK Implementation – computing the Hessian

6 volumes computed in serial– Individual filters are threaded– Good CPU usage– High memory requirements 67

Page 68: Medical Image Processing Strategies for multi-core CPUs

Design considerations

Break problem into blocks– Compute hessian, eigenvalues, and vesselness– Reduces memory requirements– Incurs overhead, boundary conditions

68

Page 69: Medical Image Processing Strategies for multi-core CPUs

Design considerations

69keep cpu’s full

Page 70: Medical Image Processing Strategies for multi-core CPUs

Design considerations – boundary condition

70

Page 71: Medical Image Processing Strategies for multi-core CPUs

Trade-offs

71

Page 72: Medical Image Processing Strategies for multi-core CPUs

Algorithm sketch – Serial

72

int BlockSize = 32;for ( int z = 0; z < image->mDepth; z += BlockSize ) { for ( int y = 0; y < image->mHeight; y += BlockSize ) { for ( int x = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } }}

Page 73: Medical Image Processing Strategies for multi-core CPUs

Algorithm sketch – OpenMP

73

int BlockSize = 32;#pragma omp parallel forfor ( int z = 0; z < image->mDepth; z += BlockSize ) { for ( int y = 0; y < image->mHeight; y += BlockSize ) { for ( int x = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } }}

Each thread is on a different slice– May cause cache contention– Similar problems for “y” direction

Page 74: Medical Image Processing Strategies for multi-core CPUs

Algorithm sketch – OpenMP

74

int BlockSize = 32;for ( int z = 0; z < image->mDepth; z += BlockSize ) { for ( int y = 0; y < image->mHeight; y += BlockSize ) {#pragma omp parallel for for ( int x = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } }}

All threads on same rows– May not utilize all CPUs

• If Ratio of Width to BlockSize < # CPUs– Better cache utilization

Page 75: Medical Image Processing Strategies for multi-core CPUs

Algorithm sketch – TBB

75

Individual blocks– Full CPUs– May not have best cache performance

class Vesselness { public: void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { // Process the block, could use ITK here processBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(), r.cols().size(), r.rows().size(), r.pages().size() );...parallel_for ( tbb::blocked_range3d<int,int,int>( 0, in->mDepth, 32 0, in->mHeight, 32 0, in->mWidth, 32 ), Vesselness( in, out ), auto_partitioner() );

Page 76: Medical Image Processing Strategies for multi-core CPUs

Next steps

Go try parallel development– Try threads to gain understanding and insight– Next OpenMP, adapting existing code– TBB: more constructs, different approachs

Experiment with new languages– Erlang, Scala, Reia, Chapel, X10, Fortress...

Check out some of the resources provided Have fun! It’s a brave new world out there...

76

Page 77: Medical Image Processing Strategies for multi-core CPUs

Resources

TBB (http://www.threadingbuildingblocks.org/) OpenMP (http://openmp.org/wp/) Books/Articles

– Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/)– Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/)– ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf)– The Problem with Threads (

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf) Tutorials

– Parallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/)– pthreads (https://computing.llnl.gov/tutorials/pthreads/)– OpenMP (https://computing.llnl.gov/tutorials/openMP/)

Other– LLNL (https://computing.llnl.gov/)– Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language)– GCC-OpenMP (http://gcc.gnu.org/projects/gomp/)– Intel Compiler (http://software.intel.com/en-us/intel-compilers/) 77

Page 78: Medical Image Processing Strategies for multi-core CPUs

Resources

Languages– Erlang (http://www.erlang.org/)– Scala (http://www.scala-lang.org/)– Chapel (http://chapel.cs.washington.edu/)– X10 (http://x10-lang.org/)– Unified Parallel C (http://upc.gwu.edu/)– Titanium (http://titanium.cs.berkeley.edu/)– Co-Array Fortran (http://www.co-array.org/)– ZPL (http://www.cs.washington.edu/research/zpl/home/index.html)– High Performance Fortran (http://hpff.rice.edu/)– Fortress (http://projectfortress.sun.com/Projects/Community/) – Others (http://www.google.com/search?q=parallel+programming+language)

78

Page 79: Medical Image Processing Strategies for multi-core CPUs

Medical image processing strategies for multi-core CPUsDaniel Blezek, Mayo [email protected]

Page 80: Medical Image Processing Strategies for multi-core CPUs

Thread construction – pthread example

80

include <pthread.h>

void *(*start_routine)(void *);

intpthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, void *(*start_routine)(void *), void *restrict arg);

voidpthread_exit(void *value_ptr);

intpthread_join(pthread_t thread, void **value_ptr);

Page 81: Medical Image Processing Strategies for multi-core CPUs

Mutex – pthread example

81

#include <pthread.h>pthread_mutex_t myMutex;...pthread_mutex_init ( &myMutex, NULL );...pthread_mutex_lock ( &myMutex );// Critical Section, only one thread at a time...pthread_mutex_unlock ( &myMutex );...if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) { // We did get the lock, so we are in the critical section ... pthread_mutex_unlock ( &myMutex );}

Page 82: Medical Image Processing Strategies for multi-core CPUs

Mutex – Java example

82

import java.lang.*;

class Foo { public synchronized int doWork () { // only one thread can execute doWork

}

Object resource;public int otherWork () {

synchronized ( resource ) { // critical section, resource is the mutex ... }}

Page 83: Medical Image Processing Strategies for multi-core CPUs

Threshold – pthread

83

struct Work { Image* in; Image *out; int start; int end; };Work workArray[THREADCOUNT];pthread_t thread[THREADCOUNT];

void* doThreshold ( void* inWork ) { Work* work = (Work*) inWork; for ( int s = work->start; s < work->end; s++ ) {...}}...pthread_attr_t attributes;pthread_attr_init ( &attributes );pthread_attr_setdetachstate ( &attributes, PTHREAD_CREATE_JOINABLE );for ( int t = 0; t < THREADCOUNT; t++ ) { initializeWork ( in, out, t, workArray[t] ); pthread_create ( &thead[t], &attributes, doThreshold, (void*) workArray[t] );}for ( int t = 0; t < THREADCOUNT; t++ ) { pthread_join ( thread[t], NULL );}

Page 84: Medical Image Processing Strategies for multi-core CPUs

Insight Toolkit

84

Page 85: Medical Image Processing Strategies for multi-core CPUs

Semaphore

Allow N threads access– Protects limited resources

Binary semaphore– N = 1– Equivalent to Mutex

85

Page 86: Medical Image Processing Strategies for multi-core CPUs

ITK Implementation

Threads operate across slices– Only implemented behavior in ITK

itk::MultiThreader is somewhat flexible– Requires that you break the ITK model– Uses Thread Join, higher overhead– No thread pool

86

Page 87: Medical Image Processing Strategies for multi-core CPUs

ITK – itk::MultiTheader

87

#include <itkMultiThreader.h>

// Win32DWORD doWork ( LPVOID lpThreadParameter );// Pthread - Linux, Mac, Unixvoid* doWork ( void* inWork );

itk::MultiThreader::Pointer threader = itk::MultiThreader::New();

threader->SetNumberOfThreads ( NumberOfThreads );for ( int i = 0; i < NumberOfThreads; i++ ) { threader->SetMultipleMethod ( i, doWork, (void*) work[i] );}// Explicit barrier, waits for Thread jointhreader->MultipleMethodExecute();

Page 88: Medical Image Processing Strategies for multi-core CPUs

#include <itkImageToImageFilter.h>

template <In, Out> Worker : public ImageToImageFilter<In, Out> {...void BeforeThreadedGenerateData() {

// Master thread only ... } void ThreadedGenerateData(const OutputImageRegionType &r, int tid ){ // Generate output data for r ... }void AfterThreadedGenerateData() {

// Master thread only ... }

// Output split on last dimension// i.e. Slices for 3D volumes

Insight Toolkit

88

Page 89: Medical Image Processing Strategies for multi-core CPUs

Anisotropic diffusion – OpenMP

89

using std;void doAD ( Image* in, Image* out ) {for ( int t = 0; t < TotalTime; t++ ) {#pragma omp parallel for for ( int slice = 0; slice < in->mDepth; slice++ ) { ... } }}