ikra: leveraging object-oriented abstractions in a ruby-to-cuda … · 2019. 12. 14. · 3....

Ikra: Leveraging Object-oriented Abstractions in a Ruby-to-CUDA JIT Translator

Matthias Springer, Hidehiko Masuhara(Tokyo Institute of Technology)

Object Support TrafficSimulator<< Ruby Application >>

Actor-@max_velocity-@progress-@street

Car+move

Pedestrian+move

...

Architecture

Overview:1. Merge cascaded commands (kernel fusion)2. Infer types and whether instance variables are read or written3. Generate CUDA code4. Compile shared library (nvcc)5. Reorder jobs (avoiding warp divergence) 6. Trace reachable objects, allocate and transfer objects (Ruby FFI)7. Invoke kernel8. Write back written columns

Ikra Ruby Application<<include>>

provides:pmap

pselect

peach

pnew

ArrayMapCommandArraySelectCommand

ArrayNewCommandArrayIdentityCommandto_command

ArrayEachCommand

ArrayCommand-@target+execute+[](index)+pmap(&block)+pselect(&block)+peach(&block)

@target

0..1

Array.

Overview- Acceleration of Ruby programs with GPUs (CUDA)- High-level Goal: GPGPU for Ruby programmers- Source code analysis and type inference at runtime

Restrictions in parallel sections:- No meta programming/reflection- Dynamic typing should be avoided

No restrictions outside of parallel sections

C C P C C P P

0 1 3 4 2 5 6

Job reordering

warp 1 warp 2

Columnar Object Layout Actor

-@max_velocity-@progress-@street-@color

Actor-@max_velocity : float / r-@progress : float / rw-@street : Street / rw-@color : Object / (/)

type inf.

infer types andread/written

read andwritten

only read

neither readnor written

Actor@max_velocity

float[]

1234

10.7512.909.15

10.00

@streetint[]

1234

4283

Street@length

int[]

1234

7515379

(morecolumns)

colu

mns

arrays

objectreference@progress

float[]

1234

9.5010.20.751.10

__device__ void meth_Pedestrian_move(int _id, int weather){ attr_Actor_progress[_id] += meth_Random_rand(attr_Actor_maxProgress[_id]);

if attr_Actor_progress[_id] >= attr_Street_length[attr_Actor_street[_id]]) { attr_Actor_street[_id] = meth_Street_random_connected(attr_Actor_street[_id]); }}

Related Work:Zhang et al. On-the-fly elimination of dynamic irregularities for GPU computing. ASPLOS XVI.

- Job Reordering

- Kernel FusionWahib et al. Scalable kernel fusion for memory-bound GPU applications. SC '14.

- Columnar Object LayoutMattis et al. Columnar objects: improving the performance of analytical applications. ONWARD! 2015

attr_Actor_ ...if attr_Actor_ ...{ attr_Actor_street[2] = ...}



thread 1 thread 2 thread 3

attr_Actor_street: 4 2 8 3 4 4 (chance for coalescing)

...

actor_size = 20actor_type = [CAR, CAR, ...]actor_max_velocity = [...]# ...

def run_simulation weather = UI.selected_weather

(1..actor_size).peach do |i| if actor_type[i] == CAR move_car(i, weather) elsif actor_type[i] == PED move_ped(i, weather) end endend

@progress += Random.rand(@max_velocity)

if @progress >= @street.length @street = @street.random_connected end

actors = [...]

def run_simulation weather = UI.selected_weather

actors.peach do |actor| actor.move(weather) endend

if weather == BAD @progress += 0.5 * @max_velocity else @progress += 0.95 * @max_velocity end

if @progress >= @street.length @street = @street.random_connected end

with

obje

cts

job reorderingarray

jobs(original order)

warpdivergence

C C C C P P P

- Allocate threads: only objects of the same class per warp- Improve code quality using class-based programming

threads

Kernel Fusion Heap Object Tracerarr[1]arr[2]arr[3]arr[4]arr[5]

env1env2

1. Type inferenceActor

-@max_velocity : float / r-@progress : float / rw-@street : Street / rw-@color : Object / (/)

- Traverse only classes/methods reachable from main block- Determine if instance variables are read/written

2. Find objects and calc. column offsetsObject graph traversal. Follow instance variable if:- Instance variable is marked as read/written- Value type is not primitive and was not processed yet

12 3

4

56

7

class A172

class B345

...123

123

offset maps

- Idea: Represent all objects of a class as fields of arrays (columnar layout)- Benefit: Chance for coalescing when accessing the same column in parallel- Implementation: Heap Object Tracer converts object graph to columnar layout

roots (array, lex. vars.)MapCommand

SelectCommand

MapCommand

IdentityCommand

Array

command[5]command.sizecommand. execute

Accessing content ofcommand triggers

fusion of all commands,

compilation, etc.(at runtime)

__device__ int block_1(int el) { return el; }__device__ int block_2(int el) { return 2 * el; }__device__ bool block_3(int el) { return el > 10; }__device__ int block_4(int el) { return el + 1; }

__global__ void kernel(int *input, int *result, int *jobs) { int job = jobs[threadIdx.x + blockIdx.x * ...]; int v2 = block_2(block_1(input[job])); if (block_3(v1)) result[job] = block_4(v2); }

1

2

3

4

job reorderingarray

object ID

@target

@target

@target

@target

Result of step 2

3. Write object columnsReplace object references with offsets

attr_Actor_max_velocity

attr_Actor_progress

attr_Actor_street attr_Street_length

no self forclass methods

Ikra

ikra: leveraging object-oriented abstractions in a ruby-to-cuda … · 2019. 12. 14. · 3....

Documents