openmp under the hood - unimorealgo.ing.unimo.it/.../pp1617/13-omp_loops.pdfthe iterations will be...

33
OpenMP loops Paolo Burgio [email protected]

Upload: others

Post on 23-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 2: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Expressing parallelism

– Understanding parallel threads

Memory Data management

– Data clauses

Synchronization

– Barriers, locks, critical sections

Work partitioning

– Loops, sections, single work, tasks…

Execution devices

– Target

Parallel Programming LM – 2016/17 2

Outline

Page 3: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Threads

– How to create and properly manage a team of threads

– How to join them with barriers

Memory

– How to create private and shared variables storages

– How to properly ensure memory consistency among parallel threads

Data syncronization

– How to create locks to implement, e.g., mutual exclusion

– How to identify Critical Sections

– How to ensure atomicity on single statements

Parallel Programming LM – 2016/17 3

What we saw so far..

Page 4: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

But..how can we split an existing workload among parallel threads?

– Say, a loop

Typical $c€nario

1. Analyze sequential code from customer/boss

2. Parallelize it with OpenMP (for a "generic" parallel machine)

3. Tune num_threads for specific machine

4. Get money/congratulations from customer/boss

Might not be as easy as with PI Montecarlo!

How to do 2. without rewriting/re-engineering the code?

Parallel Programming LM – 2016/17 4

Work sharing between threads

Page 5: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Create an array of N elements

– Put inside each array element its index, multiplied by '2'

– arr[0] = 0; arr[1] = 2; arr[2] = 4; ...and so on..

Now, do it in parallel with a team of T threads

– N = 19, T ≠ 19, N < T

– Hint: Act on the boundaries of the loop

– Hint #2: omp_get_thread_num(), omp_get_num_threads()

Example:

Parallel Programming LM – 2016/17 5

ExerciseLet's

code!

0

1

6

T T

7

8

13

T

14

15

18

0

1

4

T T

5

6

9

T

10

11

14

T

15

16

18

T = 3 T = 4

Page 6: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Case #1: N multiple of T

– Say, N = 20, T = 4

chunk = #iterations for each thread

Very simple..

Parallel Programming LM – 2016/17 6

Loop partitioning among threadsLet's

code!

0

1

4

T T

5

6

9

T

10

11

14

T

15

16

19

chunk

𝑖𝑠𝑡𝑎𝑟𝑡 = 𝑡ℎ𝑟𝑒𝑎𝑑𝐼𝐷 ∗ 𝑐ℎ𝑢𝑛𝑘; 𝑖𝑒𝑛𝑑 = 𝑖𝑠𝑡𝑎𝑟𝑡 + 𝑐ℎ𝑢𝑛𝑘

𝑐ℎ𝑢𝑛𝑘 =𝑁

𝑇;

5

Page 7: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Case #2: N not multiple of T

– Say, N = 19, T = 4

chunk = #iterations for each thread (but last)

– Last thread has less! (chunklast)

Parallel Programming LM – 2016/17 7

Loop partitioning among threadsLet's

code!

0

1

4

T T

5

6

9

T

10

11

14

T

15

16

18

chunkchunk_last

5

4

𝑖𝑠𝑡𝑎𝑟𝑡 = 𝑡ℎ𝑟𝑒𝑎𝑑𝐼𝐷 ∗ 𝑐ℎ𝑢𝑛𝑘; 𝑖𝑒𝑛𝑑 = ቊ𝑖𝑠𝑡𝑎𝑟𝑡 + 𝑐ℎ𝑢𝑛𝑘 𝑖𝑓 𝑛𝑜𝑡 𝑙𝑎𝑠𝑡 𝑡ℎ𝑟𝑒𝑎𝑑

𝑖𝑠𝑡𝑎𝑟𝑡 + 𝑐ℎ𝑢𝑛𝑘𝑙𝑎𝑠𝑡 + 1 𝑖𝑓 𝑙𝑎𝑠𝑡 𝑡ℎ𝑟𝑒𝑎𝑑

𝑐ℎ𝑢𝑛𝑘 =𝑁

𝑇+ 1 ; 𝑐ℎ𝑢𝑛𝑘𝑙𝑎𝑠𝑡 = 𝑁% 𝑐ℎ𝑢𝑛𝑘

Page 8: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Unfortunately, we don't know which thread will be "last" in time

But…we don't actually care the order in which iterations are executed

– If there are not depenencies..

– And..we do know that

0 <= omp_get_thread_num()< omp_get_num_threads()

We choose that last thread as highest number

Parallel Programming LM – 2016/17 8

"Last thread"

Page 9: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Case #1(N not multiple of T)

Case #2 (N multiple of T)

Parallel Programming LM – 2016/17 9

Let's put them together!

𝑖𝑠𝑡𝑎𝑟𝑡 = 𝑡ℎ𝑟𝑒𝑎𝑑𝐼𝐷 ∗ 𝑐ℎ𝑢𝑛𝑘; 𝑖𝑒𝑛𝑑 = ቊ𝑖𝑠𝑡𝑎𝑟𝑡 + 𝑐ℎ𝑢𝑛𝑘 𝑖𝑓 𝑛𝑜𝑡 𝑙𝑎𝑠𝑡 𝑡ℎ𝑟𝑒𝑎𝑑

𝑖𝑠𝑡𝑎𝑟𝑡 + 𝑐ℎ𝑢𝑛𝑘𝑙𝑎𝑠𝑡 + 1 𝑖𝑓 𝑙𝑎𝑠𝑡 𝑡ℎ𝑟𝑒𝑎𝑑

𝑐ℎ𝑢𝑛𝑘 =𝑁

𝑇+ 1 ; 𝑐ℎ𝑢𝑛𝑘𝑙𝑎𝑠𝑡 = 𝑁% 𝑐ℎ𝑢𝑛𝑘

𝑖𝑠𝑡𝑎𝑟𝑡 = 𝑡ℎ𝑟𝑒𝑎𝑑𝐼𝐷 ∗ 𝑐ℎ𝑢𝑛𝑘; 𝑖𝑒𝑛𝑑 = 𝑖𝑠𝑡𝑎𝑟𝑡 + 𝑐ℎ𝑢𝑛𝑘𝑐ℎ𝑢𝑛𝑘 =𝑁

𝑇

Page 10: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

A way to distribute work among parallel threads

– In a simple, and "elegant" manner

– Using pragmas

OpenMP was born for this

– OpeMP 2.5 targets regular, loop-based parallelism

OpenMP 3.x targets irregular/dynamic parallelism

– We will see it later

Parallel Programming LM – 2016/17 10

Work sharing constructs

Page 11: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

The iterations will be executed in parallel by threads in the team

The iterations are distributed across threads executing the parallel

region to which the loop region binds

for-loops must have Canonical loop form

Parallel Programming LM – 2016/17 11

The for construct

#pragma omp for [clause [[,] clause]...] new-line

for-loops

Where clauses can be:

private(list)

firstprivate(list)

lastprivate(list)

linear(list[ : linear-step])

reduction(reduction-identifier : list)

schedule([modifier [, modifier]:]kind[, chunk_size])

collapse(n)

ordered[(n)]

nowait

Page 12: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia Parallel Programming LM – 2016/17 12

Canonical loop form

for (init-expr; test-expr; incr-expr)

structured-block

init-expr; test-expr; incr-expr not void

Eases programmers' life

– More structured

– Recommended also for "sequential programmers"

Preferrable to while and do..while

– If possible

Page 13: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Create an array of N elements

– Put inside each array element its index, multiplied by '2'

– arr[0] = 0; arr[1] = 2; arr[2] = 4; ...and so on..

Now, do it in parallel with a team of T threads

– Using the for construct

Parallel Programming LM – 2016/17 13

ExerciseLet's

code!

Page 14: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

first/private, reduction we already know…

– Private storage, w/ or w/o initialization

linear, we won't see

Parallel Programming LM – 2016/17 14

Data sharing clauses

#pragma omp for [clause [[,] clause]...] new-line

for-loops

Where clauses can be:

private(list)

firstprivate(list)

lastprivate(list)

linear(list[ : linear-step])

reduction(reduction-identifier : list)

schedule([modifier [, modifier]:]kind[, chunk_size])

collapse(n)

ordered[(n)]

nowait

Page 15: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

A list item that appears in a lastprivate clause is subject to the

private clause semantics

Also, the value is updated with the one from the sequentially last

iteration of the associated loops

Parallel Programming LM – 2016/17 15

The lastprivate clause

Page 16: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Create a new storage for the variables, local to threads, and initialize

Parallel Programming LM – 2016/17 16

lastprivate variables and memory

int a = 11;

#pragma omp for lastprivate(a) \

num_threads(4)

{

a = ...

}

T

T TT

Process memory

a

T0 Stack

11

Page 17: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Create a new storage for the variables, local to threads, and initialize

Parallel Programming LM – 2016/17 16

lastprivate variables and memory

int a = 11;

#pragma omp for lastprivate(a) \

num_threads(4)

{

a = ...

}

T

T TT

Process memory

a

T0 Stack

11

a

T0 Stack

a

T2 Stack

a

T1 Stack

a

T3 Stack

?

?

?

?

Page 18: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Create a new storage for the variables, local to threads, and initialize

Parallel Programming LM – 2016/17 16

lastprivate variables and memory

int a = 11;

#pragma omp for lastprivate(a) \

num_threads(4)

{

a = ...

}

T

T TT

Process memory

a

T0 Stack

11

a

T0 Stack

a

T2 Stack

a

T1 Stack

a

T3 Stack

?

?

?

8

5

Page 19: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Create a new storage for the variables, local to threads, and initialize

Parallel Programming LM – 2016/17 16

lastprivate variables and memory

int a = 11;

#pragma omp for lastprivate(a) \

num_threads(4)

{

a = ...

}

T

T TT

Process memory

a

T0 Stack

11

8

5

5

t

Page 20: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Modify the "PI Montecarlo" exercise

– Use the for construct

Up to now, each threads executes its "own" loop

– i from 0 to 2499

Using the for construct, they actually share the loop

– No need to modify the boundary!!!

– Check it with printf

Parallel Programming LM – 2016/17 17

ExerciseLet's

code!

0

1

2499

T T

0

1

2499

T

0

1

2499

T

0

1

2499

0

1

2499

T T

2500

2501

4999

T

5000

5001

7499

T

7500

7501

9999

Page 21: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Create an array of N elements

– Put inside each array element its index, multiplied by '2'

– arr[0] = 0; arr[1] = 2; arr[2] = 4; ...and so on..

Declare the array as lastprivate

– So you can print its value after the parreg, in the sequential zone

– Do this at home

Parallel Programming LM – 2016/17 18

ExerciseLet's

code!

Page 22: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

OpenMP provides three work-sharing constructs

– Loops

– Single

– Sections

Parallel Programming LM – 2016/17 19

OpenMP 2.5

Page 23: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

The enclosed block is executed by only one threads in the team

..and what about the other threads?

Parallel Programming LM – 2016/17 20

The single construct

#pragma omp single [clause [[,] clause]...] new-line

structured-block

Where clauses can be:

private(list)

firstprivate(list)

copyprivate(list)

nowait

Page 24: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Each worksharing construct has an implicit barrier at its end

– Example: a loop

– If one thread is delayed, it prevents other threads to do useful work!!

Parallel Programming LM – 2016/17 21

Worksharing constructs and barriers

TT TT

#pragma omp parallel num_threads(4)

{

#pragma omp for

for(int i=0; i<N; i++)

{

...

} // (implicit) barrier

// USEFUL WORK!!

} // (implicit) barrier

Page 25: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Each worksharing construct has an implicit barrier at its end

– Example: a loop

– If one thread is delayed, it prevents other threads to do useful work!!

Parallel Programming LM – 2016/17 21

Worksharing constructs and barriers

TT TT

#pragma omp parallel num_threads(4)

{

#pragma omp for

for(int i=0; i<N; i++)

{

...

} // (implicit) barrier

// USEFUL WORK!!

} // (implicit) barrier

Page 26: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

The nowait clause removes the barrier at the end of a worksharing

(WS) construct

– Applies to all of WS constructs

– Does not apply to parregs!

Parallel Programming LM – 2016/17 22

Nowait clause in the for construct

#pragma omp for [clause [[,] clause]...] new-line

for-loops

Where clauses can be:

private(list)

firstprivate(list)

lastprivate(list)

linear(list[ : linear-step])

reduction(reduction-identifier : list)

schedule([modifier [, modifier]:]kind[, chunk_size])

collapse(n)

ordered[(n)]

nowait

Page 27: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Removed the barrier at the end of WS construct

– Still, there is a barrier at the end of parreg

Parallel Programming LM – 2016/17 23

Worksharing constructs and barriers

T

T TT

#pragma omp parallel num_threads(4) \

nowait

{

#pragma omp for

for(int i=0; i<N; i++)

{

...

} // no barrier

// USEFUL WORK!!

} // (implicit) barrier

Page 28: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Each section contains code that is executed by a single thread

– A "switch" for threads

Clauses, we already know..

– lastprivate items are updated by the section executing last (in time)

Parallel Programming LM – 2016/17 24

The sections construct

#pragma omp sections [clause[ [,] clause] ... ] new-line

{

[#pragma omp section new-line]

structured-block

[#pragma omp section new-line]

structured-block

...

}

Where clauses can be:

private(list)

firstprivate(list)

lastprivate(list)

reduction(reduction-identifier : list)

nowait

Page 29: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Loops implement data-parallel paradigm

– Same work, on different data

– Aka: data decomposition, SIMD, SPMD

Sections implement task-based paradigm

– Different work, on the same or different data

– Aka: task decomposition, MPSD, MPMD

Parallel Programming LM – 2016/17 25

Sections vs. loops

Page 30: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

The structured block is executed only by master thread

– "Similar" to the single construct

It is not a work-sharing construct

– There is no barrier implied!!

Parallel Programming LM – 2016/17 26

The master construct

#pragma omp master new-line

structured-block

No clauses

T

Page 31: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

For each WS construct, there is also a compact form

– In this case, clauses to both constructs apply

Parallel Programming LM – 2016/17 27

Combined parreg+ws

#pragma omp parallel

{

#pragma omp for

for(int i=0; i<N; i++)

{

...

}

} // (implicit) barrier

#pragma omp parallel

#pragma omp for

for(int i=0; i<N; i++)

{

...

} // (implicit) barrier

#pragma omp parallel for

for(int i=0; i<N; i++)

{

...

} // (implicit) barrier= =

Page 32: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

Download the Code/ folder from the course website

Compile

$ gcc –fopenmp code.c -o code

Run (Unix/Linux)

$ ./code

Run (Win/Cygwin)

$ ./code.exe

Parallel Programming LM – 2016/17 28

How to run the examplesLet's

code!

Page 33: OpenMP under the hood - UNIMOREalgo.ing.unimo.it/.../PP1617/13-OMP_Loops.pdfThe iterations will be executed in parallel by threads in the team The iterations are distributed across

© 2016 Università di Modena e Reggio Emilia

"Calcolo parallelo" website

– http://algogroup.unimore.it/people/marko/courses/programmazione_parallela/

My contacts

[email protected]

– http://hipert.mat.unimore.it/people/paolob/

OpenMP specifications

– http://www.openmp.org

– http://www.openmp.org/mp-documents/openmp-4.5.pdf

– http://www.openmp.org/mp-documents/OpenMP-4.5-1115-CPP-web.pdf

A lot of stuff around…

– https://computing.llnl.gov/tutorials/openMP/

Parallel Programming LM – 2016/17 29

References