ikra: leveraging object-oriented abstractions in a ruby-to-cuda … · 2019. 12. 14. · 3....
TRANSCRIPT
Ikra: Leveraging Object-oriented Abstractions in a Ruby-to-CUDA JIT Translator
Matthias Springer, Hidehiko Masuhara(Tokyo Institute of Technology)
Object Support TrafficSimulator<< Ruby Application >>
Actor-@max_velocity-@progress-@street
Car+move
Pedestrian+move
...
Architecture
Overview:1. Merge cascaded commands (kernel fusion)2. Infer types and whether instance variables are read or written3. Generate CUDA code4. Compile shared library (nvcc)5. Reorder jobs (avoiding warp divergence) 6. Trace reachable objects, allocate and transfer objects (Ruby FFI)7. Invoke kernel8. Write back written columns
Ikra Ruby Application<<include>>
provides:pmap
pselect
peach
pnew
ArrayMapCommandArraySelectCommand
ArrayNewCommandArrayIdentityCommandto_command
ArrayEachCommand
ArrayCommand-@target+execute+[](index)+pmap(&block)+pselect(&block)+peach(&block)
@target
0..1
Array.
Overview- Acceleration of Ruby programs with GPUs (CUDA)- High-level Goal: GPGPU for Ruby programmers- Source code analysis and type inference at runtime
Restrictions in parallel sections:- No meta programming/reflection- Dynamic typing should be avoided
No restrictions outside of parallel sections
C C P C C P P
0 1 3 4 2 5 6
Job reordering
warp 1 warp 2
Columnar Object Layout Actor
-@max_velocity-@progress-@street-@color
Actor-@max_velocity : float / r-@progress : float / rw-@street : Street / rw-@color : Object / (/)
type inf.
infer types andread/written
read andwritten
only read
neither readnor written
Actor@max_velocity
float[]
1234
10.7512.909.15
10.00
@streetint[]
1234
4283
Street@length
int[]
1234
7515379
(morecolumns)
colu
mns
arrays
objectreference@progress
float[]
1234
9.5010.20.751.10
__device__ void meth_Pedestrian_move(int _id, int weather){ attr_Actor_progress[_id] += meth_Random_rand(attr_Actor_maxProgress[_id]);
if attr_Actor_progress[_id] >= attr_Street_length[attr_Actor_street[_id]]) { attr_Actor_street[_id] = meth_Street_random_connected(attr_Actor_street[_id]); }}
Related Work:Zhang et al. On-the-fly elimination of dynamic irregularities for GPU computing. ASPLOS XVI.
- Job Reordering
- Kernel FusionWahib et al. Scalable kernel fusion for memory-bound GPU applications. SC '14.
- Columnar Object LayoutMattis et al. Columnar objects: improving the performance of analytical applications. ONWARD! 2015
attr_Actor_ ...if attr_Actor_ ...{ attr_Actor_street[2] = ...}
attr_Actor_ ...if attr_Actor_ ...{ attr_Actor_street[4] = ...}
attr_Actor_ ...if attr_Actor_ ...{ attr_Actor_street[5] = ...}
thread 1 thread 2 thread 3
attr_Actor_street: 4 2 8 3 4 4 (chance for coalescing)
...
actor_size = 20actor_type = [CAR, CAR, ...]actor_max_velocity = [...]# ...
def run_simulation weather = UI.selected_weather
(1..actor_size).peach do |i| if actor_type[i] == CAR move_car(i, weather) elsif actor_type[i] == PED move_ped(i, weather) end endend
@progress += Random.rand(@max_velocity)
if @progress >= @street.length @street = @street.random_connected end
actors = [...]
def run_simulation weather = UI.selected_weather
actors.peach do |actor| actor.move(weather) endend
if weather == BAD @progress += 0.5 * @max_velocity else @progress += 0.95 * @max_velocity end
if @progress >= @street.length @street = @street.random_connected end
with
obje
cts
job reorderingarray
jobs(original order)
warpdivergence
C C C C P P P
- Allocate threads: only objects of the same class per warp- Improve code quality using class-based programming
threads
Kernel Fusion Heap Object Tracerarr[1]arr[2]arr[3]arr[4]arr[5]
env1env2
1. Type inferenceActor
-@max_velocity : float / r-@progress : float / rw-@street : Street / rw-@color : Object / (/)
- Traverse only classes/methods reachable from main block- Determine if instance variables are read/written
2. Find objects and calc. column offsetsObject graph traversal. Follow instance variable if:- Instance variable is marked as read/written- Value type is not primitive and was not processed yet
12 3
4
56
7
class A172
class B345
...123
123
offset maps
- Idea: Represent all objects of a class as fields of arrays (columnar layout)- Benefit: Chance for coalescing when accessing the same column in parallel- Implementation: Heap Object Tracer converts object graph to columnar layout
roots (array, lex. vars.)MapCommand
SelectCommand
MapCommand
IdentityCommand
Array
command[5]command.sizecommand. execute
Accessing content ofcommand triggers
fusion of all commands,
compilation, etc.(at runtime)
__device__ int block_1(int el) { return el; }__device__ int block_2(int el) { return 2 * el; }__device__ bool block_3(int el) { return el > 10; }__device__ int block_4(int el) { return el + 1; }
__global__ void kernel(int *input, int *result, int *jobs) { int job = jobs[threadIdx.x + blockIdx.x * ...]; int v2 = block_2(block_1(input[job])); if (block_3(v1)) result[job] = block_4(v2); }
1
2
3
4
job reorderingarray
object ID
@target
@target
@target
@target
Result of step 2
3. Write object columnsReplace object references with offsets
attr_Actor_max_velocity
attr_Actor_progress
attr_Actor_street attr_Street_length
no self forclass methods
Ikra