advanced / other programming models sathish vadhiyar
TRANSCRIPT
![Page 1: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/1.jpg)
Advanced / Other Programming Models
Sathish Vadhiyar
![Page 2: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/2.jpg)
OpenCL – Command Queues, Runtime Compilation, Multiple Devices
Sources: OpenCL overview from AMD OpenCL learning kit from AMD
![Page 3: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/3.jpg)
Introduction
OpenCL is a programming framework for heterogeneous computing resources
Resources include CPUs, GPUs, Cell Broadband Engine, FPGAs, DSPs
Many similarities with CUDA
![Page 4: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/4.jpg)
![Page 5: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/5.jpg)
Command QueuesA command queue is the mechanism for the
host to request that an action be performed by the device Perform a memory transfer, begin executing, etc. Interesting concept of enqueuing kernels and
satisfying dependencies using events
A separate command queue is required for each device
Commands within the queue can be synchronous or asynchronous
Commands can execute in-order or out-of-order
5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer
Architecture Research Lab, with Ben Gaster, AMD © 2011
![Page 6: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/6.jpg)
Example – Image Rotation
![Page 7: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/7.jpg)
Slides 8, 11-16 of lecture 5 in openCL University kit
![Page 8: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/8.jpg)
Synchronization
![Page 9: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/9.jpg)
Synchronization in OpenCL
Synchronization is required if we use an out-of-order command queue or multiple command queues
Coarse synchronization granularity Per command queue basis
Finer synchronization granularity Per OpenCL operation basis using events
9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer
Architecture Research Lab, with Ben Gaster, AMD © 2011
![Page 10: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/10.jpg)
OpenCL Command Queue Control Command queue synchronization methods work on a per-queue
basis Flush: clFlush(cl_commandqueue)
Send all commands in the queue to the compute device
No guarantee that they will be complete when clFlush returns
Finish: clFinish(cl_commandqueue) Waits for all commands in the command queue to
complete before proceeding (host blocks on this call) Barrier: clEnqueueBarrier(cl_commandqueue)
Enqueue a synchronization point that ensures all prior commands in a queue have completed before any further commands execute
10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer
Architecture Research Lab, with Ben Gaster, AMD © 2011
![Page 11: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/11.jpg)
OpenCL Events
Previous OpenCL synchronization functions only operated on a per-command-queue granularity
OpenCL events are needed to synchronize at a function granularity
Explicit synchronization is required for Out-of-order command queues Multiple command queues
11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer
Architecture Research Lab, with Ben Gaster, AMD © 2011
![Page 12: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/12.jpg)
Using User Events
A simple example of user events being triggered and used in a command queue
//Create user event which will start the write of buf1user_event = clCreateUserEvent(ctx, NULL);clEnqueueWriteBuffer( cq, buf1, CL_FALSE, ..., 1, &user_event , NULL);//The write of buf1 is now enqued and waiting on user_event
X = foo(); //Lots of complicated host processing code
clSetUserEventStatus(user_event, CL_COMPLETE);//The clEnqueueWriteBuffer to buf1 can now proceed as per OP of foo()
12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer
Architecture Research Lab, with Ben Gaster, AMD © 2011
![Page 13: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/13.jpg)
Multiple Devices
![Page 14: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/14.jpg)
Multiple Devices OpenCL can also be used to program multiple
devices (CPU, GPU, Cell, DSP etc.) OpenCL does not assume that data can be
transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an
intermediate transfer to the host
OpenCL events are used to synchronize execution on different devices within a context
![Page 15: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/15.jpg)
Compiling Code for Multiple Devices
![Page 16: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/16.jpg)
Charm++
Source: Tutorial Slides fromParallel Programming Lab, UIUCAuthors (Laxmikant Kale, Eric Bohm)
![Page 17: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/17.jpg)
Virtualization: Object-based Decomposition
In MPI, the number of processes is typically equal to the number of processors
Virtualization: Divide the computation into a large
number of pieces Independent of number of processors Typically larger than number of processors
Let the system map objects to processors
![Page 18: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/18.jpg)
The Charm++ Model Parallel objects (chares) communicate
via asynchronous method invocations (entry methods).
The runtime system maps chares onto processors and schedules execution of entry methods.
Chares can be dynamically created on any available processor
Can be accessed from remote processors
18Charm++ Basics
![Page 19: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/19.jpg)
04/21/23 CS 420 19
Processor Virtualization
User View
System implementation
User is only concerned with interaction between objects (VPs)
![Page 20: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/20.jpg)
20
Adaptive Overlap via Data-driven Objects Problem:
Processors wait for too long at “receive” statements With Virtualization, you get Data-driven
execution There are multiple entities (objects, threads) on each
proc No single object or threads holds up the processor Each one is “continued” when its data arrives
So: Achieves automatic and adaptive overlap of computation and communication
![Page 21: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/21.jpg)
Load Balancing
![Page 22: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/22.jpg)
04/21/23 CS 420 22
Using Dynamic Mapping to Processors Migration
Charm objects can migrate from one processor to another
Migration creates a new object on the destination processor while destroying the original
Use that for dynamic (and static, initial) load balancing
Measurement based, predictive strategies Based on object communication patterns
and computational loads
![Page 23: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/23.jpg)
Summary: Primary Advantages
Automatic mapping Migration and load balancing Asynchronous and message driven
communications Computation-communication overlap
![Page 24: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/24.jpg)
How it looks?
![Page 25: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/25.jpg)
Asynchronous Hello World
Program’s asynchronous flow Mainchare sends message to Hello
object Hello object prints “Hello World!” Hello object sends message back to the
mainchare Mainchare quits the application
Charm++ Basics 25
![Page 26: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/26.jpg)
Code and Workflow
Charm++ Basics 26
![Page 27: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/27.jpg)
Hello World: Array VersionMain Code
Charm++ Basics 27
![Page 28: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/28.jpg)
Array Code
Charm++ Basics 28
![Page 29: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/29.jpg)
Result$ ./charmrun +p3 ./hello 10Running “Hello World” with 10 elements using 3 processors.“Hello” from Hello chare #0 on processor 0 (told by -1)“Hello” from Hello chare #1 on processor 0 (told by 0)“Hello” from Hello chare #2 on processor 0 (told by 1)“Hello” from Hello chare #3 on processor 0 (told by 2)“Hello” from Hello chare #4 on processor 1 (told by 3)“Hello” from Hello chare #5 on processor 1 (told by 4)“Hello” from Hello chare #6 on processor 1 (told by 5)“Hello” from Hello chare #7 on processor 2 (told by 6)“Hello” from Hello chare #8 on processor 2 (told by 7)“Hello” from Hello chare #9 on processor 2 (told by 8)
Charm++ Basics 29
![Page 30: Advanced / Other Programming Models Sathish Vadhiyar](https://reader036.vdocuments.mx/reader036/viewer/2022081503/56649ec95503460f94bd6d12/html5/thumbnails/30.jpg)