opencl guide

19
Open Computing Language

Upload: amafro-lopez

Post on 21-Apr-2015

94 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: OpenCL Guide

Open Computing Language

Page 2: OpenCL Guide

Introduction• OpenCL (Open Computing Language) • Is an open royalty-free standard• For general purpose parallel programming

across CPUs, GPUs and other processors

OpenCL lets Programmers write a single portable program that uses ALL resources in the heterogeneous platform

Page 3: OpenCL Guide

OpenCL consists of..

• API for coordinating parallel computation across heterogeneous processors.

• A cross-platform programming language

Supports both data- and task-based parallel programming models

Utilizes a subset of ISO C99 with extensions for parallelism

Defines a configuration profile for handheld and embedded devices

Page 4: OpenCL Guide

The BIG Idea behind OpenCL• OpenCL execution model …• execute a kernel at each point in a problem domain• - E.g., process a 1024 x 1024 image with one kernel

invocation per pixel• or 1024 x 1024 = 1,048,576 kernel executions

Page 5: OpenCL Guide

To use OpenCL, you must

• Define the platform• Execute code on the platform• Move data around in memory• Write (and build) programs

Page 6: OpenCL Guide

OpenCL Platform Model• One Host + one or more Compute Devices• - Each Compute Device is composed of one or more

Compute Units• - Each Compute Unit is further divided into one or

more Processing Elements

Page 7: OpenCL Guide

OpenCL Execution ModelAn OpenCL application runs on a host which submits work to the compute devices

• Work item: the basic unit of work on an OpenCL device• Kernel: the code for a work item. Basically a C function• Program: Collection of kernels and otherfunctions (Analogous to a dynamic library)• Context: The environment within which workitemsexecutes … includes devices and their memories and command queues

Applications queue kernel execution instancesQueued in-order … one queue to a deviceExecuted in-order or out-of-order

Page 8: OpenCL Guide

Example of the NDRange organization.. • SIMT: SINGLE INSTRUCTION

MULTIPLE THREAD. The same code isexecuted in parallel by a differentthread, and each thread executes thecode with different data.

• Work-item: are equivalent to theCUDA threads.

• Work-group: allow communicationand cooperation between work-items. They reflect how work-itemsare organized . Equivalent to CUDAthread blocks

• ND-Range: the ND-Range is the nextorganization level, specifying howwork-groups are organized

Page 9: OpenCL Guide

Example of the NDRange organization..

Page 10: OpenCL Guide

OpenCL Memory Model

Page 11: OpenCL Guide

OpenCL programs

OpenCL programs are divided in two part:

qOne that executes on the device (in our case, on the GPU).ü write KernelsüThe device program is the one you may be concerned about

qOne that executes on the host (in our case, the CPU).üOffers an API so that you can manage your device execution.üCan be programmed in C or C++ and it controls the OpenCL

environment (context, command-queue,...).

Page 12: OpenCL Guide

Sample: a kernel that adds two vectors

void vector_add_cpu (const float* src_a, constfloat* src_b, float* res, const int num)

{ for (int i = 0; i < num; i++) res[i] = src_a[i] + src_b[i];

}

This kernel should take four parameters: twovectors to be added, another to store theresult, and the vectors size. If you write aprogram that solves this problem on the CPU itwill be something like this:

Page 13: OpenCL Guide

__kernel void vectorAdd(__global const float* src_a, __global const float* src_b, __global, float* res, const int num)

{

/* get_global_id(0) returns the ID of the thread in execution. As many threads are launched at the same time, executing the same kernel, each one will receive a different ID, and consequently perform a different computation.*/

const int idx = get_global_id(0);/* Now each work-item asks itself: "is my ID inside the vector's range?"

If the answer is YES, the work-item performs the corresponding computation*/

if (idx < num)res[idx] = src_a[idx] + src_b[idx];

}

However, on the GPU the logic would be slightly different.

Instead of having one thread iterating through all elements, we could have eachthread computing one element, which index is the same of the thread.

Sample: a kernel that adds two vectors

Page 14: OpenCL Guide

// Some interesting data for the vectorsint InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};// Number of elements in the vectors to be added#define SIZE 2048// Main function// *********************************************************************int main(int argc, char **argv){// Two integer source vectors in Host memoryint HostVector1[SIZE], HostVector2[SIZE];

// Initialize with some interesting repeating datafor(int c = 0; c < SIZE; c++){HostVector1[c] = InitialData1[c%20];HostVector2[c] = InitialData2[c%20];}

Sample..

Page 15: OpenCL Guide

Sample..// Create a context to run OpenCL on our CUDA-enabled NVIDIA GPUcl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,NULL, NULL, NULL);

// Get the list of GPU devices associated with this contextsize_t ParmDataBytes;clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes);cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL);

// Create a command-queue on the first GPU devicecl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0], 0, NULL);

// Allocate GPU memory for source vectors AND initialize from CPU memorycl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);

Page 16: OpenCL Guide

Sample..// Allocate output memory on GPU

cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,

sizeof(int) * SIZE, NULL, NULL);

// Create OpenCL program with source code

cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7,

OpenCLSource, NULL, NULL);

// Build the program (OpenCL JIT compilation)

clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);

// Create a handle to the compiled OpenCL function (Kernel)

cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);

// In the next step we associate the GPU memory with the Kernel arguments

clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);

clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);

clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);

Page 17: OpenCL Guide

Sample..// Launch the Kernel on the GPUsize_t WorkSize[1] = {SIZE}; // one dimensional RangeclEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);

// Copy the output in GPU memory back to CPU memoryint HostOutputVector[SIZE];clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, CL_TRUE, 0,SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);

// Cleanupfree(GPUDevices);clReleaseKernel(OpenCLVectorAdd);clReleaseProgram(OpenCLProgram);clReleaseCommandQueue(GPUCommandQueue);clReleaseContext(GPUContext);clReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clReleaseMemObject(GPUOutputVector);

Page 18: OpenCL Guide

Sample…

// Print out the results

for (int Rows = 0; Rows < (SIZE/20); Rows++, printf("\n"))

{

for(int c = 0; c <20; c++)

{

printf("%c",(char)HostOutputVector[Rows * 20 + c]);

}

}

Page 19: OpenCL Guide

ThanksThanks!!