ikra: leveraging object-oriented abstractions in a ruby-to-cuda … · 2019. 12. 14. · 3....

1
Ikra: Leveraging Object-oriented Abstractions in a Ruby-to-CUDA JIT Translator Matthias Springer, Hidehiko Masuhara (Tokyo Institute of Technology) Object Support TracSimulator << Ruby Application >> Actor -@max_velocity -@progress -@street Car +move Pedestrian +move ... Architecture Overview: 1. Merge cascaded commands (kernel fusion) 2. Infer types and whether instance variables are read or written 3. Generate CUDA code 4. Compile shared library (nvcc) 5. Reorder jobs (avoiding warp divergence) 6. Trace reachable objects, allocate and transfer objects (Ruby FFI) 7. Invoke kernel 8. Write back written columns Ikra Ruby Application <<include>> provides: pmap pselect peach pnew ArrayMapCommand ArraySelectCommand ArrayNewCommand ArrayIdentityCommand to_command ArrayEachCommand ArrayCommand -@target +execute +[](index) +pmap(&block) +pselect(&block) +peach(&block) @target 0..1 Array. Overview - Acceleration of Ruby programs with GPUs (CUDA) - High-level Goal: GPGPU for Ruby programmers - Source code analysis and type inference at runtime Restrictions in parallel sections: - No meta programming/reection - Dynamic typing should be avoided No restrictions outside of parallel sections C C P C C P P 0 1 3 4 2 5 6 Job reordering warp 1 warp 2 Columnar Object Layout Actor -@max_velocity -@progress -@street -@color Actor -@max_velocity : oat / r -@progress : oat / rw -@street : Street / rw -@color : Object / (/) type inf. infer types and read/written read and written only read neither read nor written Actor @max_velocity oat[] 1 2 3 4 10.75 12.90 9.15 10.00 @street int[] 1 2 3 4 4 2 8 3 Street @length int[] 1 2 3 4 75 15 37 9 (more columns) columns arrays object reference @progress oat[] 1 2 3 4 9.50 10.2 0.75 1.10 __device__ void meth_Pedestrian_move(int _id, int weather) { attr_Actor_progress[_id] += meth_Random_rand(attr_Actor_maxProgress[_id]); if attr_Actor_progress[_id] >= attr_Street_length[attr_Actor_street[_id]]) { attr_Actor_street[_id] = meth_Street_random_connected(attr_Actor_street[_id]); } } Related Work: Zhang et al. On-the-y elimination of dynamic irregularities for GPU computing. ASPLOS XVI. - Job Reordering - Kernel Fusion Wahib et al. Scalable kernel fusion for memory-bound GPU applications. SC '14. - Columnar Object Layout Mattis et al. Columnar objects: improving the performance of analytical applications. ONWARD! 2015 attr_Actor_ ... if attr_Actor_ ... { attr_Actor_street[2] = ... } attr_Actor_ ... if attr_Actor_ ... { attr_Actor_street[4] = ... } attr_Actor_ ... if attr_Actor_ ... { attr_Actor_street[5] = ... } thread 1 thread 2 thread 3 attr_Actor_street: 4 2 8 3 4 4 (chance for coalescing) ... actor_size = 20 actor_type = [CAR, CAR, ...] actor_max_velocity = [...] # ... def run_simulation weather = UI.selected_weather (1..actor_size).peach do |i| if actor_type[i] == CAR move_car(i, weather) elsif actor_type[i] == PED move_ped(i, weather) end end end @progress += Random.rand(@max_velocity) if @progress >= @street.length @street = @street.random_connected end actors = [...] def run_simulation weather = UI.selected_weather actors.peach do |actor| actor.move(weather) end end if weather == BAD @progress += 0.5 * @max_velocity else @progress += 0.95 * @max_velocity end if @progress >= @street.length @street = @street.random_connected end with objects job reordering array jobs (original order) warp divergence C C C C P P P - Allocate threads: only objects of the same class per warp - Improve code quality using class-based programming threads Kernel Fusion Heap Object Tracer arr[1] arr[2] arr[3] arr[4] arr[5] env1 env2 1. Type inference Actor -@max_velocity : oat / r -@progress : oat / rw -@street : Street / rw -@color : Object / (/) - Traverse only classes/methods reachable from main block - Determine if instance variables are read/written 2. Find objects and calc. column osets Object graph traversal. Follow instance variable if: - Instance variable is marked as read/written - Value type is not primitive and was not processed yet 1 2 3 4 5 6 7 class A 1 7 2 class B 3 4 5 ... 1 2 3 1 2 3 oset maps - Idea: Represent all objects of a class as elds of arrays (columnar layout) - Benet: Chance for coalescing when accessing the same column in parallel - Implementation: Heap Object Tracer converts object graph to columnar layout roots (array, lex. vars.) MapCommand SelectCommand MapCommand IdentityCommand Array command[5] command.size command. execute Accessing content of command triggers fusion of all commands, compilation, etc. (at runtime) __device__ int block_1(int el) { return el; } __device__ int block_2(int el) { return 2 * el; } __device__ bool block_3(int el) { return el > 10; } __device__ int block_4(int el) { return el + 1; } __global__ void kernel(int *input, int *result, int *jobs) { int job = jobs[threadIdx.x + blockIdx.x * ...]; int v2 = block_2(block_1(input[job])); if (block_3(v1)) result[job] = block_4(v2); } 1 2 3 4 job reordering array object ID @target @target @target @target Result of step 2 3. Write object columns Replace object references with osets attr_Actor_max_velocity attr_Actor_progress attr_Actor_street attr_Street_length no self for class methods Ikra

Upload: others

Post on 25-Sep-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ikra: Leveraging Object-oriented Abstractions in a Ruby-to-CUDA … · 2019. 12. 14. · 3. Generate CUDA code 4. Compile shared library (nvcc) 5. Reorder jobs (avoiding warp divergence)

Ikra: Leveraging Object-oriented Abstractions in a Ruby-to-CUDA JIT Translator

Matthias Springer, Hidehiko Masuhara(Tokyo Institute of Technology)

Object Support TrafficSimulator<< Ruby Application >>

Actor-@max_velocity-@progress-@street

Car+move

Pedestrian+move

...

Architecture

Overview:1. Merge cascaded commands (kernel fusion)2. Infer types and whether instance variables are read or written3. Generate CUDA code4. Compile shared library (nvcc)5. Reorder jobs (avoiding warp divergence) 6. Trace reachable objects, allocate and transfer objects (Ruby FFI)7. Invoke kernel8. Write back written columns

Ikra Ruby Application<<include>>

provides:pmap

pselect

peach

pnew

ArrayMapCommandArraySelectCommand

ArrayNewCommandArrayIdentityCommandto_command

ArrayEachCommand

ArrayCommand-@target+execute+[](index)+pmap(&block)+pselect(&block)+peach(&block)

@target

0..1

Array.

Overview- Acceleration of Ruby programs with GPUs (CUDA)- High-level Goal: GPGPU for Ruby programmers- Source code analysis and type inference at runtime

Restrictions in parallel sections:- No meta programming/reflection- Dynamic typing should be avoided

No restrictions outside of parallel sections

C C P C C P P

0 1 3 4 2 5 6

Job reordering

warp 1 warp 2

Columnar Object Layout Actor

-@max_velocity-@progress-@street-@color

Actor-@max_velocity : float / r-@progress : float / rw-@street : Street / rw-@color : Object / (/)

type inf.

infer types andread/written

read andwritten

only read

neither readnor written

Actor@max_velocity

float[]

1234

10.7512.909.15

10.00

@streetint[]

1234

4283

Street@length

int[]

1234

7515379

(morecolumns)

colu

mns

arrays

objectreference@progress

float[]

1234

9.5010.20.751.10

__device__ void meth_Pedestrian_move(int _id, int weather){ attr_Actor_progress[_id] += meth_Random_rand(attr_Actor_maxProgress[_id]);

if attr_Actor_progress[_id] >= attr_Street_length[attr_Actor_street[_id]]) { attr_Actor_street[_id] = meth_Street_random_connected(attr_Actor_street[_id]); }}

Related Work:Zhang et al. On-the-fly elimination of dynamic irregularities for GPU computing. ASPLOS XVI.

- Job Reordering

- Kernel FusionWahib et al. Scalable kernel fusion for memory-bound GPU applications. SC '14.

- Columnar Object LayoutMattis et al. Columnar objects: improving the performance of analytical applications. ONWARD! 2015

attr_Actor_ ...if attr_Actor_ ...{ attr_Actor_street[2] = ...}

attr_Actor_ ...if attr_Actor_ ...{ attr_Actor_street[4] = ...}

attr_Actor_ ...if attr_Actor_ ...{ attr_Actor_street[5] = ...}

thread 1 thread 2 thread 3

attr_Actor_street: 4 2 8 3 4 4 (chance for coalescing)

...

actor_size = 20actor_type = [CAR, CAR, ...]actor_max_velocity = [...]# ...

def run_simulation weather = UI.selected_weather

(1..actor_size).peach do |i| if actor_type[i] == CAR move_car(i, weather) elsif actor_type[i] == PED move_ped(i, weather) end endend

@progress += Random.rand(@max_velocity)

if @progress >= @street.length @street = @street.random_connected end

actors = [...]

def run_simulation weather = UI.selected_weather

actors.peach do |actor| actor.move(weather) endend

if weather == BAD @progress += 0.5 * @max_velocity else @progress += 0.95 * @max_velocity end

if @progress >= @street.length @street = @street.random_connected end

with

obje

cts

job reorderingarray

jobs(original order)

warpdivergence

C C C C P P P

- Allocate threads: only objects of the same class per warp- Improve code quality using class-based programming

threads

Kernel Fusion Heap Object Tracerarr[1]arr[2]arr[3]arr[4]arr[5]

env1env2

1. Type inferenceActor

-@max_velocity : float / r-@progress : float / rw-@street : Street / rw-@color : Object / (/)

- Traverse only classes/methods reachable from main block- Determine if instance variables are read/written

2. Find objects and calc. column offsetsObject graph traversal. Follow instance variable if:- Instance variable is marked as read/written- Value type is not primitive and was not processed yet

12 3

4

56

7

class A172

class B345

...123

123

offset maps

- Idea: Represent all objects of a class as fields of arrays (columnar layout)- Benefit: Chance for coalescing when accessing the same column in parallel- Implementation: Heap Object Tracer converts object graph to columnar layout

roots (array, lex. vars.)MapCommand

SelectCommand

MapCommand

IdentityCommand

Array

command[5]command.sizecommand. execute

Accessing content ofcommand triggers

fusion of all commands,

compilation, etc.(at runtime)

__device__ int block_1(int el) { return el; }__device__ int block_2(int el) { return 2 * el; }__device__ bool block_3(int el) { return el > 10; }__device__ int block_4(int el) { return el + 1; }

__global__ void kernel(int *input, int *result, int *jobs) { int job = jobs[threadIdx.x + blockIdx.x * ...]; int v2 = block_2(block_1(input[job])); if (block_3(v1)) result[job] = block_4(v2); }

1

2

3

4

job reorderingarray

object ID

@target

@target

@target

@target

Result of step 2

3. Write object columnsReplace object references with offsets

attr_Actor_max_velocity

attr_Actor_progress

attr_Actor_street attr_Street_length

no self forclass methods

Ikra