report of openmp

Report of OpenMP

Speaker : Guo-Zhen Wu Date : 2008/12/28

Question 5: What happens if number of threads is larger than number of cores of host machine?

Exercise 1: modify code of hello.cto show “every thread has its own private variable th_id”, that is, shows th_idhas 5 copies.

Ans : we can use printf to tell each thread print its own th_id printf(“th_id : %p”,&th_id ) ;

Exercise 2: modify code of hello.c, remove clause “private (th_id)”in #pragmadirective, what happens? Can you explain?

Question 6: Why index i must be private variable and a,b,c,N can be shared variable? What happens if we change i to shared variable? What happens if we change a,b,c,N to private variable?

The result are the same !!! What’s happened ?

Change N from shared to private

Change a from shared to private

Change b from shared to private

Segmentation fault

Aborted!

Segmentation fault

Segmentation fault

Change c from shared to private

166.34852.0

5362.1

)8(

)(

coreT

SingleT

Question 7: the limitation of performance improvement is 3, why? Can you use different configuration of schedule clause to improve this number?

dynamic takes 13.024 (s) ! Whereas, static needs only 0.4852(s) .

Dynamic

Number of thread

Cost time (s)

1 1.5362

2 0.8610

4 0.5586

8 0.4852

16 0.6037

32 0.7258

64 0.8244

Number of thread = 8

Number of chunk

Cost time (s)

2 2.1173

8 1.8999

32 1.5697

128 1.4190

512 0.6780

2048 0.5562

8196 0.4944

32784 0.4891

131136 0.4937

524544 0.4881

Question 8: we have three for-loop, one is for “i”, one is for “j” and last one is for “k”, which one is parallelized by OpenMP directive?

Question 9: explain why variable i, j, k, sum, a, b are declared as private? Can we move some of them to shared clause?

Private->Shared i j k sum a b

Result compare with original The same

Error is larger than

10^13

System error!

Error is larger

than 10^7

Error is larger

than 10^5

Error is larger

than 10^5

Can we move it to shared clause Yes No No No No No

Exercise 3: verify subroutine matrixMul_parallel

Matlab code

BLOCK_SIZE WA=HA=WB Error

1 25xBLOCK_SIZE 3.4261e-005



8 25xBLOCK_SIZE 0.0012




Exercise 4: verify following subroutine matrix_parallel, which parallelizes loop-j , not loop-i.

1. Performance between loop-i and loop-j

Conclusion : loop j is a little faster than loop j, and the result of computation is the same.

2. why do we declare index i as shared variable? What happens if we declare index i as private variable?

Exercise 5: verify subroutine matrixMul_block_seq with non-block version, you can use high precision package.

Float Double

Take threads = 1, i.e. sequentially

Double-double Quad-double

Recall : How to modify code such that it can work with arbitrary precision ?

variable-length

2

1

Exercise 6: if we use “double”, how to choose value of BLOCK_SIZE, show your experimental result.

N Total size Thread 1 Thread 2 Thread 4 Thread 8

2 12 MB 1,268 ms 645 ms 319 ms 194 ms

4 48 MB 10,031ms 4,876ms 2,481 ms 1,304 ms

8 192 MB 79,744ms 39,116ms 19,609ms 10,278ms

Block version, BLOCK_SIZE = 362 (double)

If we want to keep size of As and Bs are 1MB, since one double is 8 byte, that is twice of float (4 byte) .

362

512256

512512482

2

n

n

n

Work Station : 140.114.34.1

2

512512222

fd

fd

NN

NnN 3dNtimeCost

Block version, BLOCK_SIZE = 512 (float)

Dimension Thread 1 Thread 2 Thread 4 Thread 8

1024x1024 3,454 ms 1,881ms 882ms 4,63ms

2048x2048 28,990ms 14,302ms 6,991ms 3,540ms

4096x4096 224,142 ms 111,344ms 55,845ms 28,198ms

Dimension Thread 1 Thread 2 Thread 4 Thread 8

1024x1024 3,586 ms 1,824 ms 902 ms 548 ms

2048x2048 28,372ms 13,791ms 7,017 ms 3,688ms

4096x4096 225,550ms 110,640ms 55,463ms 29,071ms

Block version, BLOCK_SIZE = 362 (double)

Conclusion : it cost almost the same time no matter you choose float or double under the same dimension of matrix.

Exercise 7: Can you modify subroutine matrixMul_block_parallel to improve its performance?

Exercise 8: compare parallel computation between CPU and GPU in your host machine

report of openmp

Documents

shared variable

variable i

private variable th

loop j

clause private th

float block version

thread 81024x10243

index i