asynchronous task creation for task-based parallel ... · asynchronous taskwait •counter of tasks...
TRANSCRIPT
![Page 1: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/1.jpg)
Asynchronous Task Creation for Task-Based Parallel Programming Runtimes
Jaume Bosch ([email protected]), Xubin Tan, Carlos Álvarez, Daniel Jiménez, Xavier Martorell and Eduard Ayguadé
Barcelona, Sept. 24, 2018
![Page 2: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/2.jpg)
1) Introduction 2) OmpSs@FPGA 3) Implementation and design 4) Conclusion
![Page 3: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/3.jpg)
Task-Based Parallel Programming Models
void cholesky(float *A[NT][NT]) { for (int k=0; k<NT; k++) { #pragma omp task inout(A[k][k]) spotrf( A[k][k] ) ; for (int i=k+1; i<NT; i++) { #pragma omp task in(A[k][k]) inout(A[k][i]) strsm( A[k][k], A[k][i] ); } for (i=k+1; i<NT; i++) { for (int j=k+1; j<i; j++) { #pragma omp task in(A[k][i], A[k][j]) inout(A[j][i]) sgemm( A[k][i], A[k][j], A[j][i] ); } #pragma omp task in(A[k][i]) inout(A[i][i]) ssyrk( A[k][i], A[i][i] ); } } }
3
![Page 4: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/4.jpg)
Task-Based Parallel Programming Models
void cholesky(float *A[NT][NT]) { for (int k=0; k<NT; k++) { #pragma omp task inout(A[k][k]) spotrf( A[k][k] ) ; for (int i=k+1; i<NT; i++) { #pragma omp task in(A[k][k]) inout(A[k][i]) strsm( A[k][k], A[k][i] ); } for (i=k+1; i<NT; i++) { for (int j=k+1; j<i; j++) { #pragma omp task in(A[k][i], A[k][j]) inout(A[j][i]) sgemm( A[k][i], A[k][j], A[j][i] ); } #pragma omp task in(A[k][i]) inout(A[i][i]) ssyrk( A[k][i], A[i][i] ); } } }
4
![Page 5: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/5.jpg)
Runtime operational flow of a task
5
Ru
nti
me
Runtime
TDG Ready Task Pool
Task states:
Instantiated Ready Active Completed
![Page 6: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/6.jpg)
Current target model
6
Ru
nti
me
![Page 7: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/7.jpg)
Next? target model
Ru
nti
me
Ru
nti
me
7
![Page 8: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/8.jpg)
1) Introduction
2) OmpSs@FPGA 3) Implementation and design 4) Conclusion
![Page 9: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/9.jpg)
OmpSs: Forerunner of the OpenMP tasking
+ Task prototyping
+ Task dependences
+ Task priorities + Taskloop prototyping
+ Task reductions + Taskwait deps + OMPT impl. + Multideps + Commutative + Data affinity
+ Taskloop dependences
Today
9
![Page 10: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/10.jpg)
OmpSs@FPGA
• Easily offloading tasks to FPGA devices • Automatic generation
of FPGA bitstream from C/C++ tasks
• Automatic data movements and task synchronization between the host and the FPGA
• Support for HW instrumentation integrated in Extrae
10
![Page 11: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/11.jpg)
OmpSs@FPGA | Source code example
void dotProductBlock(float *v1, float *v2, float *result) { int resultLocal = result[0]; for (size_t i = 0; i < BSIZE; ++i) { resultLocal += v1[i]*v2[i]; } result[0] = resultLocal; } int main() { ... for (size_t i = 0; i < VSIZE; i += BSIZE) { dotProductBlock(v1+i, v2+i, results+i/BSIZE); } #pragma omp taskwait ... }
11
#pragma omp task in([BSIZE]v1, [BSIZE]v2) inout([1]result)
#pragma omp target device(fpga) num_instances(2)
![Page 12: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/12.jpg)
OmpSs@FPGA | Compilation process
(fpgacc, fpgacxx)
Native compiler
autoVivado FPGA specific vendor tools
Linker
12
![Page 13: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/13.jpg)
OmpSs@FPGA | Execution
Runtime
TDG Ready Task Pool
Task Manager
ReadyQ FinishQ
13
![Page 14: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/14.jpg)
OmpSs@FPGA | Execution
Runtime
TDG Ready Task Pool
Task Manager
ReadyQ FinishQ
14
![Page 15: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/15.jpg)
1) Introduction 2) OmpSs@FPGA
3) Implementation and design 4) Conclusion
![Page 16: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/16.jpg)
Task creation inside a fpga task
#pragma omp target device(fpga) num_instances(2) #pragma omp task in([BSIZE]v1, [BSIZE]v2) inout([1]result) void dotProductBlock(float *v1, float *v2, float *result) { ... } #pragma omp target device(fpga) #pragma omp task in([VSIZE]v1, [VSIZE]v2) inout([1]result) void dotProduct(float *v1, float *v2, float *result) {
... for (size_t i = 0; i < VSIZE; i += BSIZE) { dotProductBlock(v1+i, v2+i, results+i/BSIZE); }
... } int main() { ... dotProduct(v1, v2, &result); ... }
16
![Page 17: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/17.jpg)
Task creation from FPGA
Runtime
TDG Ready Task Pool
Task Manager
ReadyQ FinishQ
17
NewQ
![Page 18: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/18.jpg)
Task creation from FPGA with Picos
Runtime
TDG Ready Task Pool
TM
18
RQ FQ NQ
![Page 19: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/19.jpg)
Asynchronous task creation
• Shared memory region between the accelerator and the handler • Coherent and consistent
• Tasks must be added and handled in the same order • Ensure sequential order execution
• Handling must cannot be parallelized for the same parent task
• Shared memory can be splited into sub-regions, one for each accelerator
19
![Page 20: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/20.jpg)
Taskwait inside a fpga task
#pragma omp target device(fpga) num_instances(2) #pragma omp task in([BSIZE]v1, [BSIZE]v2) inout([1]result) void dotProductBlock(float *v1, float *v2, float *result) { ... } #pragma omp target device(fpga) #pragma omp task in([VSIZE]v1, [VSIZE]v2) inout([1]result) void dotProduct(float *v1, float *v2, float *result) { for (size_t i = 0; i < VSIZE; i += BSIZE) { dotProductBlock(v1+i, v2+i, results+i/BSIZE); } #pragma omp taskwait } int main() { ... dotProduct(v1, v2, &result); ... }
20
![Page 21: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/21.jpg)
Taskwait inside a fpga task
Runtime
TDG Ready Task Pool
Task Manager
RQ FQ
21
NQ TW Taskwait Manager
Parent Task Id Count
1 1 99
1
-1 0
![Page 22: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/22.jpg)
Asynchronous taskwait
• Counter of tasks executed in each context (parent task id) • Children of SMP tasks not considered, do not have a parent fpga task id
• Accelerator blocks until receives a signal from the Taskwait Manager • The signal is sent when the count becomes 0
• The entry for a context is deleted when the task finalizes • Garbage collector
22
![Page 23: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/23.jpg)
1) Introduction 2) OmpSs@FPGA 3) Implementation and design
4) Conclusion
![Page 24: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/24.jpg)
Execution trace Matrix Multiply 1024*1024, 3 accelerators at 333Mhz 128*128
• With current implementation, we can save around 50% the time of 1 thread
• With the Picos implementation, we can save an additional thread (fpga helper thread)
24
![Page 25: Asynchronous Task Creation for Task-Based Parallel ... · Asynchronous taskwait •Counter of tasks executed in each context (parent task id) •Children of SMP tasks not considered,](https://reader033.vdocuments.mx/reader033/viewer/2022060812/609058555f70866ddf378013/html5/thumbnails/25.jpg)
Conclusions
• Design of a general mechanism for asynchronous task creation
• Suitable for accelerators
• Design successfully implemented over OmpSs@FPGA on an Axiom Board
• Similar performance to SMP task creation version despite the round trip
• Next steps
• Integration with Picos HW manager
• Performance evaluation for iterative solvers
25