the pocl kernel compiler

22
The pocl Kernel Compiler Clay Chang

Upload: clay-chang

Post on 11-Aug-2015

63 views

Category:

Software


2 download

TRANSCRIPT

Page 1: The pocl Kernel Compiler

The pocl Kernel Compiler

Clay Chang

Page 2: The pocl Kernel Compiler

CPU versus GPU

• Sophiscated Control• Branch Prediction• Out-of-Order Execution• Large Cache

• Little Control• No or Limited Branch

Prediction• Simple Execution• Small or no cache• Lots of ALUs

Page 3: The pocl Kernel Compiler

OpenCL as the Portable API

Page 4: The pocl Kernel Compiler

Why OpenCL for CPU

Muiti-core CPU is out there E.g. MediaTek Tri-Cluster 10 cores SoC

Mobile GPU is already busy ~25% occupied by system UI in Android

Not every programs run good on GPU Heavy Branch Divergence

OpenCL allows easily exploit multi-core and SIMD Imagine: writing pthread + SIMD in assembly or intrinsics

Page 5: The pocl Kernel Compiler

Running OpenCL Kernels on CPU

One thread per work-item? Thousands of threads being created Context-switching problems How to synchronize threads?

How about running one work-group on a CPU thread?

Page 6: The pocl Kernel Compiler

Related Works

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors.

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Clover (http://people.freedesktop.org/~steckdenis/clover) Shamrock (https://git.linaro.org/gpgpu/shamrock.git)

Page 7: The pocl Kernel Compiler

What is to pocl

POrtable Computing Language An efficient implementation of OpenCL standard which can be easily

adapted for new targets http://github.com/pocl/pocl Main developer: Pekka Jääskeläinen from Tampere University of

Technology Supporting Architecture: CPU, tce, cellspu, HSA Current version: 0.11

Page 8: The pocl Kernel Compiler

Components in pocl

Page 9: The pocl Kernel Compiler

The pocl Kernel Compiler

OpenCLKernel Source

Clang / LLVM poclKernel Compiler

clBuildProgram(…) clEnqueueNDRangeKernel (…, local_size, …)

Single Work-item Kernel

Transformed Kernel

Page 10: The pocl Kernel Compiler

pocl Compilation Chain1

2

3

4 Compile Kernel (OpenCL C) by Clang

1

Linked with target-specific built-in functions, such as sin, cos, geom_distance, etc…

2

Work-group Function Generation / Parallel Work-item Loops Creation

3

Backend Optimizations (Auto-vecs, …) and CodeGen

4

Page 11: The pocl Kernel Compiler

Work-group_function() { for (int i = 0; i < work-group_size; i++) {

}}

Work-group Function Generation

Kernel (single work-item)

What if there are barriers?

WI-loop

clEnqueueNDRangeKernel(…., group_size, ….)

Page 12: The pocl Kernel Compiler

Semantics of barrier Synchronization

OpenCL 1.2 rev19 p.30:

“… the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all…”

if (tid % 2) { …. barrier(); …}

Page 13: The pocl Kernel Compiler

Kernel Without barriers

• A node in a CFG is a basic block (BB)• BB: branchless sequence of

instructions• BB executed as an entity,

from the first instruction to the last.

• An edge in a CFG represents a branch in the control flow

• Multiple exit BBs are allowed

• pocl Kernel Compiler generates WI-loop around the CFG

Page 14: The pocl Kernel Compiler

Types of Barrier

Un-conditional barriers barrier that dominates the exit node

Conditional barriers Barriers being placed in

if – else for-loop (b-loop)

Page 15: The pocl Kernel Compiler

Kernel with unconditional barriers

pocl Kernel Compiler creates WI-loops before and after the barrier

This forms an algorithm:Algorithm 1: Parallel region formation when the kernel does not contain conditional barriers.

Step1: Ensure there is an implicit barrier at the entry and the exit nodes of the kernel function and that there is only one exit node in the kernel function. This is a safe starting condition as it does not affect any execution order restrictions.Step2: Perform a depth-first-search traversal of the kernel CFG. Ignore the possible back edges to avoid infinite loops and to include the loops of the kernel to the parallel region.Step3: When encountering a barrier, create a parallel region by calling CreateSubgraph for the previously encountered barrier and the newly found barrier.

barrier

barrier

Page 16: The pocl Kernel Compiler

A CFG with Two Conditional barriers

Algorithm 2: Tail duplication for parallel region formation in the case of conditional barriers in the kernel.

Step1: Perform a depth-first traversal of the CFG, starting at the entry node.Step2: Each time a new, unprocessed conditional barrier is found, use CreateSubgraph to produce a sub-CFG from that barrier to the next exit node (duplicate the tail).Step3: Replicate the created sub-CFG using ReplicateCFG. In order to reduce code duplication, merge the tails from the same unconditional barrier paths. That is, replicate the basic blocks only after the last barrier that is unconditionally reachable from the one at hand.Step4: Start the algorithm at each of the found barrier successors.

Page 17: The pocl Kernel Compiler

A CFG with Two Conditional barriers – After Tail Duplication

Easier for WI-loops creation!

barrier

barrier

barrier barrier

?

?

Page 18: The pocl Kernel Compiler

“Peel” the First Loop Iteration

?

?

No more ambiguous branches in WI-

loops!

Page 19: The pocl Kernel Compiler

Barriers in Kernel Loops

Insert implicit barrier into:1. End of loop pre-header

block2. Before the loop latch

branch3. After the PhiNode

region of the loop header block

3

2

1

Page 20: The pocl Kernel Compiler

Horizontal Inner-Loop Parallelization

More parallelization after loop interchange

blockWidth unknown until runtime

Page 21: The pocl Kernel Compiler

Handling of Kernel Variables

1. There will be two parallel regions2. a‘s lifetime only in the first parallel region (it’s a temporary

variable)3. B’s lifetime span across both parallel regions

Context Array

Page 22: The pocl Kernel Compiler

References

Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable OpenCL Implementation" in International Journal of Parallel Programming, Springer, August 2014.

http://github.com/pocl/pocl