exploring parallel computing

Exploring ParallelComputing

Fabian Frie

Numerical Methods in Quantum Physics

February 6th 2014

, | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014

Syllabus

1 IntroductionWhat is Parallel Computing?Scalability

2 Parallel Programming ModelsMemory ModelsExploring MPIExploring OpenMPComparison

3 Examples with OpenMPMatrix Matrix MultiplicationApproximation of π

4 Conclusion





4 Conclusion


What is Parallel Computing?Introduction

É Parallelization is another optimization technique toreduce execution time

É Thread: series of instructions for a processor unitÉ Coarse-grain parallelism: parallelization achieved

by distributing domains over different processors.É fine-grain parallelism: parallelization achieved by

distributing iterations equally over differentprocessors.

,Page 4 | F. Frie, Haupseminar NMQP | Exploring Parallel Computing | February 2014



É Thread: series of instructions for a processor unit

É Coarse-grain parallelism: parallelization achievedby distributing domains over different processors.

É fine-grain parallelism: parallelization achieved bydistributing iterations equally over differentprocessors.





by distributing domains over different processors.

É fine-grain parallelism: parallelization achieved bydistributing iterations equally over differentprocessors.





by distributing domains over different processors.É fine-grain parallelism: parallelization achieved by

distributing iterations equally over differentprocessors.


Scalability IAmdahl’s Law

É Define the speed–up with respect to the number ofthreads n by

S(n) =∆t(1)

∆t(n)

É unless the application is embarrassingly parallel,S(n) will deviate from the ideal curve

É Assume the program has a parallel fraction f thanwith n processors the execution time will changeaccording to

∆t(n) =f

n∆t(1) + (1− f )∆t(1)


Scalability IIAmdahl’s Law

É Amdahl’s Law states: If the fraction f of a programcan be made parallel than the maximum speedupthat can be achieved by using n threads is

S(n) =1

(1− f ) + f /n


Scalability IIIAmdahl’s Law





4 Conclusion


Memory ArchitecturesShared↔Distributed

É Shared Memory Architectures

É Symmetric Multi Processor (SMP):A shared address space with equal access cost foreach processor.

É Non Uniform Memory Access (NUMA):Different memory regions have different accesscosts.

É Distributed Memory Architectures

É Clusters: Each processor acts on its own privatememory space. For remote data, communication isrequired.



É Shared Memory ArchitecturesÉ Symmetric Multi Processor (SMP):

A shared address space with equal access cost foreach processor.


É Distributed Memory Architectures

É Clusters: Each processor acts on its own privatememory space. For remote data, communication isrequired.



É Shared Memory ArchitecturesÉ Symmetric Multi Processor (SMP):

A shared address space with equal access cost foreach processor.


É Distributed Memory ArchitecturesÉ Clusters: Each processor acts on its own private

memory space. For remote data, communication isrequired.


Shared Memory ArchitectureIntel Core i7 980X Extreme Edition


Exploring MPI IWhat is MPI?

É MPI ≡ »Message Passing Interface«É MPI is extensive parallel programming API for

distributed memory (clusters, grids)É First introduced in 1994É MPI supports C, C++, and FortranÉ All data is private to processing unitÉ Data communication must be programmed

explicitly


Exploring MPI IIWhat is MPI?

ProsÉ Flexibility: Can use any

cluster of any sizeÉ Widely availableÉ Widely used : popular

in High performancecomputing

ConsÉ Redesign of applicationÉ More resources

required: Typically morememory

É Error–prone & hard todebug: Due to manylayers


Exploring OpenMPParallel Programming Models

É OpenMP ≡ »Open Multi Processing« (API)

É OpenMP is build for shared memory architecturessuch as Symmetric Multi Processing (SMP)machines

É Supports both coarse-grained and fine-grainedparallelism

É Data can be shared or privateÉ All threads have access to the same, shared,

memoryÉ Use of mostly implicit synchronization



É OpenMP ≡ »Open Multi Processing« (API)É OpenMP is build for shared memory architectures

such as Symmetric Multi Processing (SMP)machines









É Data can be shared or private

É All threads have access to the same, shared,memory

É Use of mostly implicit synchronization







memory

É Use of mostly implicit synchronization


ComparisonParallel Programming ModelsMPI

É popular, widely usedÉ ready for gridsÉ high steep learning

curveÉ No data scoping

(shared, private,...)É sequential code is not

preservedÉ requires only one

libraryÉ easier modelÉ requires runtime

enviroment

OpenMP

É popular, widely usedÉ limited to one system

(SMP), not grid readyÉ easy to learnÉ data scoping requiredÉ preserve sequential

codeÉ requires compiler

supportÉ performance issues

implicitÉ no runtime enviroment

required


ComparisonParallel Programming ModelsMPIÉ popular, widely used

É ready for gridsÉ high steep learning





enviroment

OpenMPÉ popular, widely used

É limited to one system(SMP), not grid ready

É easy to learnÉ data scoping requiredÉ preserve sequential




required


ComparisonParallel Programming ModelsMPIÉ popular, widely usedÉ ready for grids

É high steep learningcurve

É No data scoping(shared, private,...)

É sequential code is notpreserved

É requires only onelibrary

É easier modelÉ requires runtime

enviroment

OpenMPÉ popular, widely usedÉ limited to one system

(SMP), not grid ready

É easy to learnÉ data scoping requiredÉ preserve sequential




required


ComparisonParallel Programming ModelsMPIÉ popular, widely usedÉ ready for gridsÉ high steep learning

curve

É No data scoping(shared, private,...)




enviroment


(SMP), not grid readyÉ easy to learn

É data scoping requiredÉ preserve sequential




required




(shared, private,...)




enviroment


(SMP), not grid readyÉ easy to learnÉ data scoping required

É preserve sequentialcode

É requires compilersupport

É performance issuesimplicit

É no runtime enviromentrequired





preserved



enviroment



code

É requires compilersupport








library


enviroment




support








libraryÉ easier model

É requires runtimeenviroment





implicit








enviroment






required





4 Conclusion


Simple Tasks with OpenMPExamples

Matrix MatrixMultiplication

C = AB (1)

Ci,j =∑

k

Ai,k · Bk,j (2)

Approximation of π

1∫

0

dx4

1 + x2= (3)

[arctan(x)]10 = π (4)

N∑

i=0

4

1 + x2i

∆x ≈ π (5)

⇒ How efficiently can these problems beparallelized?


Matrix Matrix MultiplicationExamples


Approximation of πExamples


Approximation of π ISource Code

1 program integ_pi2 use omp_lib3 implicit none4

5 integer(kind=8) :: ii, num_steps, jj6 integer :: tid, nthreads7 real(kind=8) :: step, xx, pi, summ, start_time, run_time8

9 num_steps = 10000000010 step = 1d0/dble(num_steps)11

12 do jj = 1,8 ! Number of requested threads13 pi = 0d014 call omp_set_num_threads(jj)15 start_time = omp_get_wtime()16 nthreads = omp_get_num_threads()17

18 !$omp single


Approximation of π IISource Code

19 write(*,*) "Number of threads: ", nthreads20 !$omp end single21

22 !$omp parallel do reduction(+:pi) private(ii,xx)23 do ii = 0,num_steps24 xx = (dble(ii)+0.5d0) * step25 pi = pi + 4d0 / (1d0 + xx*xx)26 enddo27 !$omp end parallel do28

29 run_time = omp_get_wtime()-start_time30 pi = pi * step31 write(*,*) "pi approx ", pi32 write(*,*) "wtime: ", run_time33 enddo34 end program integ_pi


Wrap UpProspects

É Hybrid parallelism: Combine MPI and OpenMP forseveral reasons

É Nested parallelism: Devide and conquere principleÉ Problems with Data Races and Deathlocks


Wrap UpProspects


É Nested parallelism: Devide and conquere principle

É Problems with Data Races and Deathlocks


Wrap UpProspects


É Nested parallelism: Devide and conquere principleÉ Problems with Data Races and Deathlocks


Thank you for your attention!

Enjoy your meal!


Ruud van der Pas Barabara Chapman Gabriele Jost.Using OpenMP: Portable Shared Memory ParallelProgramming. MIT Press, Cambridge.

Miguel Hermanns. „Parallel Programming inFortran 95 using OpenMP“. In: School ofAeronautical Engineering, 2002.

Timothy G. Mattson. „A Hands-on Introduction toOpenMP“. In: OpenMP Architecture Review Board,2008.

Ruud van der Pas. „Basic Concepts inParallelization“. In: IWOMP 2010 CCS, Univeristy ofTsukuba, 2010.

W. H. Press u. a. Numerical Recipes: The Art ofScientific Computing. 3. Aufl. Cambridge,University Press, 2007.


Matrix Matrix Multiplication ISource Code

1 program matmult2 use omp_lib3 implicit none4

5 integer nra, nca, ncb, tid, nthreads, ii, jj, kk, chunk,nn6 parameter (nra=900)7 parameter (nca=900)8 parameter (ncb=100)9 real*8 a(nra,nca), b(nca,ncb), c(nra,ncb), time

10

11 chunk = 1012 do nn = 1,813 call omp_set_num_threads(nn)14 !$omp parallel shared(a,b,c,nthreads,chunk) private(tid,ii,jj,kk)15 tid = omp_get_thread_num()16

17 ! !$omp single18 ! write(*,*) "threads: ", omp_get_num_threads()


Matrix Matrix Multiplication IISource Code

19 ! !$omp end single20

21 !$omp do schedule(static,chunk)22 do ii = 1, nra23 do jj = 1, nca24 a(ii,jj) = (ii-1)+(jj-1)25 enddo26 enddo27 !$omp end do28

29 !$omp do schedule(static,chunk)30 do ii = 1, nca31 do jj = 1, ncb32 b(ii,jj) = (ii-1)*(jj-1)33 enddo34 enddo35 !$omp end do36


Matrix Matrix Multiplication IIISource Code

37 !$omp do schedule(static,chunk)38 do ii = 1, nra39 do jj = 1, ncb40 c(ii,jj) = 0d041 enddo42 enddo43 !$omp end do44

45 time = omp_get_wtime()46 !$omp do schedule(static,chunk)47 do ii = 1,nra48 do jj = 1,ncb49 do kk =1,nca50 c(ii,jj) = c(ii,jj) + a(ii,kk) * b(kk,jj)51 enddo52 enddo53 enddo54 !$omp end do


Matrix Matrix Multiplication IVSource Code

55

56 !$omp end parallel57 write(*,*) omp_get_wtime() - time58 enddo59 endprogram

exploring parallel computing

Documents