scheduling and performance issues for programming using openmp

27
Scheduling and Performance Issues for Programming using OpenMP Alan Real ISS, University of Leeds

Upload: kaleb

Post on 24-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Scheduling and Performance Issues for Programming using OpenMP. Alan Real ISS, University of Leeds. Overview. Scheduling for worksharing constructs STATIC. GUIDED. RUNTIME. Performance. SCHEDULE clause. Gives control over which loop iterations are executed by which thread. Syntax: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scheduling and Performance Issues for Programming using OpenMP

Scheduling and Performance Issues for Programming using

OpenMP

Alan Real ISS, University of Leeds

Page 2: Scheduling and Performance Issues for Programming using OpenMP

Overview

• Scheduling for worksharing constructs– STATIC.– GUIDED.– RUNTIME.

• Performance.

Page 3: Scheduling and Performance Issues for Programming using OpenMP

SCHEDULE clause

• Gives control over which loop iterations are executed by which thread.

• Syntax:– Fortran: SCHEDULE(kind[,chunksize])– C/C++: schedule(kind[,chunksize])

• kind is one of STATIC, DYNAMIC, GUIDED or RUNTIME• chunksize is an integer expression with positive value;

e.g.:!$OMP DO SCHEDULE(DYNAMIC,4)

• Default is implementation dependent.

Page 4: Scheduling and Performance Issues for Programming using OpenMP

STATIC schedule

• With no chunksize specified– Iteration space is divided into (approximately) equal

chunks, and one chunk is assigned to each thread (block schedule)

• If chunksize is specified– Iteration space is divided into chunks, each of

chunksize iterations. The chunks are assigned cyclically to each thread (block cyclic schedule)

Page 5: Scheduling and Performance Issues for Programming using OpenMP

STATIC schematic

• Block-cyclic schedule is good for triangular matrices to avoid load imbalance; e.g: do i=1,n do j=i,n block end do end do

SCHEDULE(STATIC)1 46

T0 T1 T2 T3

SCHEDULE(STATIC,4)

T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 T3 T0 T1 T2 T3

1 46

Page 6: Scheduling and Performance Issues for Programming using OpenMP

DYNAMIC schedule

• Divides the iteration space up into chunks of size chunksize and assigns them to threads on a first-come first-served basis.

• i.e. as a thread finishes a chunk, it is assigned the next chunk in the list.

• When no chunksize is specified it defaults to 1.

Page 7: Scheduling and Performance Issues for Programming using OpenMP

GUIDED schedule

• Similar to DYNAMIC, but the chunks start off large and get smaller exponentially.

• The size of the next chunk is (roughly) the number of remaining iterations divided by the number of threads.

• The chunksize specifies the minimum size of the chunks

• When no chunksize is specifies it defaults to 1.

Page 8: Scheduling and Performance Issues for Programming using OpenMP

DYNAMIC and GUIDED

SCHEDULE(GUIDED,3)1 46

T0 T1 T2 T3

T3

1SCHEDULE(DYNAMIC,3)

T0 T1 T2 T3 T0 T1T2 T3 T0 T1T2 T3 T0T1 T2

46

T3 T3T2 T2T1 T0

Page 9: Scheduling and Performance Issues for Programming using OpenMP

RUNTIME schedule

• Allows the choice of schedule to be deferred until runtime

• Set by environment variable OMP_SCHEDULE:% setenv OMP_SCHEDULE “guided,4”

• There is no chunksize associated with the RUNTIME schedule

Page 10: Scheduling and Performance Issues for Programming using OpenMP

Choosing a schedule

• STATIC is best for balanced loops –least overhead.• STATIC,n good for loops with mild or smooth load

imbalance – but can introduce “false sharing” (see later).

• DYNAMIC useful if iterations have widely varying loads, but ruins data locality (can get cache hit).

• GUIDED often less expensive than DYNAMIC, but beware of loops where first iterations are the most expensive!

• Use RUNTIME for convenient experimentation

Page 11: Scheduling and Performance Issues for Programming using OpenMP

Performance

• Easy to write OpenMP but hard to write an efficient program!

• 5 main causes of poor performance:– Sequential code– Communication– Load imbalance– Synchronisation– Compiler (non-)optimisation.

Page 12: Scheduling and Performance Issues for Programming using OpenMP

Sequential code

• Amdahl’s law: Parallel code will never run faster than the parts which can only be executed in serial.– Limits performance.

• Need to find ways of parallelising it!• In OpenMP, all code outside of parallel regions

and inside MASTER, SINGLE and CRITICAL directives is sequential.– This code should be as small as possible.

Page 13: Scheduling and Performance Issues for Programming using OpenMP

Communication

• On Shared memory machines, communication is “disguised” as increased memory access costs.– It takes longer to access data in main memory or another

processor’s cache than it does from local cache.• Memory accesses are expensive!

– ~100 cycles for a main memory access compared to 1-3 cycles for a flop.

• Unlike message passing, communication is spread throughout the program.– Much harder to analyse and monitor.

Page 14: Scheduling and Performance Issues for Programming using OpenMP

Caches and coherency• Shared memory programming assumes that a shared variable

has a unique value at a given time.• Caching means that multiple copies of a memory location

may exist in the hardware.• To avoid two processors caching different values of the same

memory location, caches must be kept coherent.• Coherency operations are usually performed on the cache

lines in the level of cache closest to memory – (level 2 cache on most systems)

• Maxima has:– 64kb Level 1 cache, 32byte cache lines (4 d.p. words).– 8Mb Level 2 cache, 512byte cache lines (64 d.p. words).

Page 15: Scheduling and Performance Issues for Programming using OpenMP

Memory hierarchy

P P P P

L2C L2C L2C L2C

Interconnect

Memory

L1C L1C L1C L1C

Page 16: Scheduling and Performance Issues for Programming using OpenMP

Coherency

• Highly simplified view:– Each cache line can exist in one of 3 states:

• Exclusive: the only valid copy in any cache.• Read-only: a valid copy but other caches may contain it.• Invalid: out of date and cannot be used.

– A READ on an invalid or absent cache line will be cached as read-only.

– A READ on an exclusive cache line will be changed to read only.

– A WRITE on a line not in an exclusive state will cause all other copies to be marked invalid and the written line to be marked exclusive.

Page 17: Scheduling and Performance Issues for Programming using OpenMP

Coherency example

P P P P

L2C L2C L2C L2C

Interconnect

Memory

L1C L1C L1C L1C

Exclusive Read-Only Invalid

W R W

R

Page 18: Scheduling and Performance Issues for Programming using OpenMP

False sharing

• Cache lines consist of several words of data.• What happens when two processors are both writing to

different words on the same cache line?– Each write will invalidate the other processors copy.– Lots of remote memory accesses.

• Symptoms:– Poor speedup– High, non-deterministic numbers of cache misses.– Mild, non-deterministic, unexpected load imbalance.

Page 19: Scheduling and Performance Issues for Programming using OpenMP

False sharing example 1 integer count(maxthreads)!$OMP PARALLEL : count(myid)=count(myid)+1

– Each thread writes a different element of the same array– Words with consecutive addresses are more than likely in the

same cache line.• Solution:

– Pad array by dimension of cache line: parameter (linesize=16) integer count(linesize,maxthreads)!$OMP PARALLEL : count(1,myid) = count(1,myid) + 1

Page 20: Scheduling and Performance Issues for Programming using OpenMP

False sharing example 2

• Small chunk sizes:!OMP DO SCHEDULE(STATIC,1)

do j = 1,n

do i=1,j

b(j)=b(j)+a(i,j)

end do

end do• Will induce false sharing on b.

Page 21: Scheduling and Performance Issues for Programming using OpenMP

Data affinity

• Data is cached on the processors which access it.– Must reuse cached data as much as possible.

• Write code with good data affinity:– Ensure the same thread accesses the same subset of

program data as much as possible.

• Try to make these subsets large, contiguous chunks of data.– Will avoid false sharing and other problems.

Page 22: Scheduling and Performance Issues for Programming using OpenMP

Data affinity example!$OMP DO PRIVATE(i) do j=1,n do i=1,n a(i,j)= i+j end do end do!$OMP DO SCHEDULE(STATIC,16)!$OMP& PRIVATE(i) do j=1,n do i=1,j b(j)=b(j)+a(i,j) end do end do• 1st 16 j’s OK – rest are cache misses!

i

i

j

j

j

i

1 2 3 4

j

i

16

Page 23: Scheduling and Performance Issues for Programming using OpenMP

Distributed shared memory systems

• Location of data in main memory is important.– OpenMP has no support for controlling this.

• Normally data is placed on the first processor which accesses it.

• Data placement can be controlled indirectly by parallelisation of data initialisation.

• Not an issue on Maxima as CPUs have reasonably uniform memory access within each machine.

Page 24: Scheduling and Performance Issues for Programming using OpenMP

Load imbalance

• Load imbalance can arise from both communication and computation.

• Worth experimenting with different scheduling options– Can use SCHEDULE(RUNTIME).

• If none are appropriate, may be best to do your own scheduling!

Page 25: Scheduling and Performance Issues for Programming using OpenMP

Synchronisation

• Barriers can be very expensive– Typically 1000s cycles to synchronise 32 processors).

• Avoid barriers via:– Careful use of the NOWAIT clause– Parallelise at the outermost level possible.

• May require re-ordering of loops +/ indicies.

– Consider using point-to-point synchronisation for nearest-neighbour type problems.

– Choice of CRITICAL / ATOMIC / lock routines may impact performance.

Page 26: Scheduling and Performance Issues for Programming using OpenMP

Compiler (non-)optimisation

• Sometimes the addition of parallel directives can inhibit the compiler from performing sequential optimisations.

• Symptoms:– 1-thread parallel code has longer execution and higher

instruction count than sequential code.

• Can sometimes be cured by making shared data private, or local to a routine.

Page 27: Scheduling and Performance Issues for Programming using OpenMP

Performance tuning• My code is giving me poor speedup. I don’t know why. What

do I do now?• A:

– Say “this machine/language is a heap of junk”– Give up and go back to your workstation/PC

• B:– Try to classify and localise the sources of overhead.

• What type of problem is it and where in the code does it occur– Use any available tools that can help you:

• Add timers, use profilers – Sun Performance analyser.– Fix problems that are responsible for large overheads first.– Iterate

• Good Luck!