report of openmp
DESCRIPTION
Report of OpenMP. Speaker : Guo-Zhen Wu Date : 2008/12/28. Question 5 : What happens if number of threads is larger than number of cores of host machine?. - PowerPoint PPT PresentationTRANSCRIPT
Report of OpenMP
Speaker : Guo-Zhen Wu Date : 2008/12/28
Question 5: What happens if number of threads is larger than number of cores of host machine?
Exercise 1: modify code of hello.cto show “every thread has its own private variable th_id”, that is, shows th_idhas 5 copies.
Ans : we can use printf to tell each thread print its own th_id printf(“th_id : %p”,&th_id ) ;
Exercise 2: modify code of hello.c, remove clause “private (th_id)”in #pragmadirective, what happens? Can you explain?
Question 6: Why index i must be private variable and a,b,c,N can be shared variable? What happens if we change i to shared variable? What happens if we change a,b,c,N to private variable?
The result are the same !!! What’s happened ?
Change N from shared to private
Change a from shared to private
Change b from shared to private
Segmentation fault
Aborted!
Segmentation fault
Segmentation fault
Change c from shared to private
166.34852.0
5362.1
)8(
)(
coreT
SingleT
Question 7: the limitation of performance improvement is 3, why? Can you use different configuration of schedule clause to improve this number?
dynamic takes 13.024 (s) ! Whereas, static needs only 0.4852(s) .
Dynamic
Number of thread
Cost time (s)
1 1.5362
2 0.8610
4 0.5586
8 0.4852
16 0.6037
32 0.7258
64 0.8244
Number of thread = 8
Number of chunk
Cost time (s)
2 2.1173
8 1.8999
32 1.5697
128 1.4190
512 0.6780
2048 0.5562
8196 0.4944
32784 0.4891
131136 0.4937
524544 0.4881
Question 8: we have three for-loop, one is for “i”, one is for “j” and last one is for “k”, which one is parallelized by OpenMP directive?
Question 9: explain why variable i, j, k, sum, a, b are declared as private? Can we move some of them to shared clause?
Private->Shared i j k sum a b
Result compare with original The same
Error is larger than
10^13
System error!
Error is larger
than 10^7
Error is larger
than 10^5
Error is larger
than 10^5
Can we move it to shared clause Yes No No No No No
Exercise 3: verify subroutine matrixMul_parallel
Matlab code
BLOCK_SIZE WA=HA=WB Error
1 25xBLOCK_SIZE 3.4261e-005
2 25xBLOCK_SIZE 1.1290e-004
4 25xBLOCK_SIZE 5.0797e-004
8 25xBLOCK_SIZE 0.0012
16 25xBLOCK_SIZE 0.0037
32 25xBLOCK_SIZE 0.0115
64 25xBLOCK_SIZE 0.0316
Exercise 4: verify following subroutine matrix_parallel, which parallelizes loop-j , not loop-i.
1. Performance between loop-i and loop-j
Conclusion : loop j is a little faster than loop j, and the result of computation is the same.
2. why do we declare index i as shared variable? What happens if we declare index i as private variable?
Exercise 5: verify subroutine matrixMul_block_seq with non-block version, you can use high precision package.
Float Double
Take threads = 1, i.e. sequentially
Double-double Quad-double
Recall : How to modify code such that it can work with arbitrary precision ?
variable-length
2
1
Exercise 6: if we use “double”, how to choose value of BLOCK_SIZE, show your experimental result.
N Total size Thread 1 Thread 2 Thread 4 Thread 8
2 12 MB 1,268 ms 645 ms 319 ms 194 ms
4 48 MB 10,031ms 4,876ms 2,481 ms 1,304 ms
8 192 MB 79,744ms 39,116ms 19,609ms 10,278ms
Block version, BLOCK_SIZE = 362 (double)
If we want to keep size of As and Bs are 1MB, since one double is 8 byte, that is twice of float (4 byte) .
362
512256
512512482
2
n
n
n
Work Station : 140.114.34.1
2
512512222
fd
fd
NN
NnN 3dNtimeCost
Block version, BLOCK_SIZE = 512 (float)
Dimension Thread 1 Thread 2 Thread 4 Thread 8
1024x1024 3,454 ms 1,881ms 882ms 4,63ms
2048x2048 28,990ms 14,302ms 6,991ms 3,540ms
4096x4096 224,142 ms 111,344ms 55,845ms 28,198ms
Dimension Thread 1 Thread 2 Thread 4 Thread 8
1024x1024 3,586 ms 1,824 ms 902 ms 548 ms
2048x2048 28,372ms 13,791ms 7,017 ms 3,688ms
4096x4096 225,550ms 110,640ms 55,463ms 29,071ms
Block version, BLOCK_SIZE = 362 (double)
Conclusion : it cost almost the same time no matter you choose float or double under the same dimension of matrix.
Exercise 7: Can you modify subroutine matrixMul_block_parallel to improve its performance?
Exercise 8: compare parallel computation between CPU and GPU in your host machine