gpus graal
TRANSCRIPT
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Enabling Heterogeneous Computing in Javawith Graal
Juan Fumero, Michel Steuwer, Christophe Dubach
The University of Edinburgh
7 July 2015Truffle Workshop
1 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
1 Introduction
2 API
3 Runtime Code Generation
4 Data Management
5 Results
6 Conclusion
2 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Heterogeneous Computing
NBody App (NVIDIA SDK) ˜105x speedup over seqLU Decomposition (Rodinia Benchmark) ˜10x over 32OpenMP threads
3 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Cool, but how to program?
4 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Example in OpenCL1 // create host buffers2 i n t ∗A, . . . .3 //Initialization4 . . .5 // platform6 c l u i n t numPlatforms = 0 ;7 c l p l a t f o r m i d ∗p l a t f o r m s ;8 s t a t u s = c l G e t P l a t f o r m I D s ( 0 , NULL , &numPlatforms ) ;9 p l a t f o r m s = ( c l p l a t f o r m i d ∗) m a l l o c ( numPlatforms∗ s i z e o f ( c l p l a t f o r m i d ) ) ;
10 s t a t u s = c l G e t P l a t f o r m I D s ( numPlatforms , p l a t f o r m s , NULL) ;11 c l u i n t numDevices = 0 ;12 c l d e v i c e i d ∗ d e v i c e s ;13 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , 0 , NULL , &
numDevices ) ;14 // Allocate space for each device15 d e v i c e s = ( c l d e v i c e i d ∗) m a l l o c ( numDevices∗ s i z e o f ( c l d e v i c e i d ) ) ;16 // Fill in devices17 s t a t u s = c l G e t D e v i c e I D s ( p l a t f o r m s [ 0 ] , CL DEVICE TYPE ALL , numDevices ,
d e v i c e s , NULL) ;18 c l c o n t e x t c o n t e x t ;19 c o n t e x t = c l C r e a t e C o n t e x t (NULL , numDevices , d e v i c e s , NULL , NULL , &s t a t u s ) ;20 cl command queue cmdQ ;21 cmdQ = clCreateCommandQueue ( c o n t e x t , d e v i c e s [ 0 ] , 0 , &s t a t u s ) ;22 cl mem d A , d B , d C ;23 d A = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,
d a t a s i z e , A, &s t a t u s ) ;24 d B = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM READ ONLY|CL MEM COPY HOST PTR ,
d a t a s i z e , B, &s t a t u s ) ;25 d C = c l C r e a t e B u f f e r ( c o n t e x t , CL MEM WRITE ONLY , d a t a s i z e , NULL , &s t a t u s ) ;26 . . .27 // Check errors28 . . .
5 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Example in OpenCL
1 const char ∗ s o u r c e F i l e = ” k e r n e l . c l ” ;2 s o u r c e = r e a d s o u r c e ( s o u r c e F i l e ) ;3 program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t , 1 , ( const char∗∗)&s o u r c e , NULL ,
&s t a t u s ) ;4 c l i n t b u i l d E r r ;5 b u i l d E r r = c l B u i l d P r o g r a m ( program , numDevices , d e v i c e s , NULL , NULL , NULL) ;6 // Create a kernel7 k e r n e l = c l C r e a t e K e r n e l ( program , ” vecadd ” , &s t a t u s ) ;89 s t a t u s = c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( cl mem ) , &d A ) ;
10 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 1 , s i z e o f ( cl mem ) , &d B ) ;11 s t a t u s |= c l S e t K e r n e l A r g ( k e r n e l , 2 , s i z e o f ( cl mem ) , &d C ) ;1213 s i z e t g l o b a l W o r k S i z e [ 1 ] = { ELEMENTS} ;14 s i z e t l o c a l i t e m s i z e [ 1 ] = {5} ;1516 clEnqueueNDRangeKernel (cmdQ , k e r n e l , 1 , NULL , g l o b a l W o r k S i z e , NULL , 0 , NULL ,
NULL) ;1718 c l E n q u e u e R e a d B u f f e r (cmdQ , d C , CL TRUE , 0 , d a t a s i z e , C , 0 , NULL , NULL) ;1920 // Free memory
6 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL example
1 k e r n e l vo idvecadd (
2 g l o b a l i n t ∗a ,3 g l o b a l i n t ∗b ,4 g l o b a l i n t ∗c ) {5
6 i n t i d x =7 g e t g l o b a l i d ( 0 ) ;8 c [ i d x ] = a [ i d x ] ∗
b [ i d x ] ;9 }
• Hello world App ˜ 250 lines ofcode (including errorchecking)
• Low-level and specific code
• Knowledge about targetarchitecture
• If GPU/accelerator changes,tuning is required
7 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL programming is hard and error-prone!!
8 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Higher levels of abstraction
9 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Higher levels of abstraction
10 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Similar works
• Sumatra API (discontinued): Stream API for HSAIL
• AMD Aparapi: Java API for OpenCL
• NVIDIA Nova: functional programming language forCPU/GPU
• Cooperhead: subset of python than can be executed onheterogeneous platforms.
11 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Our Approach
Three levels of abstraction:
• Parallel Skeletons: API based on functional programmingstyle (map/reduce)
• High-level optimising library which rewrites operations totarget specific hardware
• OpenCL code generation and runtime with datamanagement for heterogeneous architecture
12 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Our approachOverview
Application +ArrayFunction API
Java Bytecode
(using Graal API)
OpenCL Kernel Generation
OpenCL Execution
Java source compilation
Java executiondotP.apply(input)
Accelerator
OpenCL Kernel
JOCL
output
13 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Example: Saxpy
1 // Computation function2 ArrayFunc<Tuple2<F l o a t , F l o a t >, F l o a t> mult = new
MapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;3
4 // Prepare the input5 Tuple2<F l o a t , F l o a t > [ ] i n p u t = new Tuple2 [ s i z e ] ;6 f o r ( i n t i = 0 ; i < i n p u t . l e n g t h ; ++i ) {7 i n p u t [ i ] . 1 = ( f l o a t ) ( i ∗ 0 . 3 2 3 ) ;8 i n p u t [ i ] . 2 = ( f l o a t ) ( i + 2 . 0 ) ;9 }
10
11 // Computation12 F l o a t [ ] output = mult . a p p l y ( i n p u t ) ;
If accelerator enabled, the map expression is rewritten in lowerlevel operations automatically.map(λ) = MapAccelerator(λ) =CopyIn().computeOCL(λ).CopyOut()
14 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Our ApproachOverview
Ar r ayFunc
Map
MapThr eads
MapOpenCL
Reduce. . .appl y( ) { f or ( i = 0; i < s i ze; ++i ) out [ i ] = f . appl y( i n[ i ] ) ) ;}
appl y( ) { f or ( t hr ead : t hr eads) t hr ead. per f or mMapSeq( ) ;}
appl y( ) { copyToDevi ce( ) ; execut e( ) ; copyToHost ( ) ;}
Funct i on
15 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Runtime Code GenerationWorkflow
...10: aload_211: iload_312: aload_013: getfield16: aaload18: invokeinterface#apply23: aastore24: iinc27: iload_3...
Java sourceMap.apply(f)
Java bytecode
Graal VM
CFG + Dataflow(Graal IR)
void kernel ( global float* input, global float* output) { ...; ...;} OpenCL Kernel
3. optimizations
2. IR generation
4. kernel generation
1. Type inference
16 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL code generated1 double lambda0 ( f l o a t p0 ) {2 double c a s t 1 = ( double ) p0 ;3 double r e s u l t 2 = c a s t 1 ∗ 2 . 0 ;4 r e t u r n r e s u l t 2 ;5 }6 k e r n e l vo id l ambdaComputat ionKerne l (7 g l o b a l f l o a t ∗ p0 ,8 g l o b a l i n t ∗ p 0 i n d e x d a t a ,9 g l o b a l double ∗p1 ,
10 g l o b a l i n t ∗ p 1 i n d e x d a t a ) {11 i n t p0 d im 1 = 0 ; i n t p1 d im 1 = 0 ;12 i n t gs = g e t g l o b a l s i z e ( 0 ) ;13 i n t l o o p 1 = g e t g l o b a l i d ( 0 ) ;14 f o r ( ; ; l o o p 1 += gs ) {15 i n t p 0 l e n d i m 1 = p 0 i n d e x d a t a [ p0 d im 1 ] ;16 b o o l cond 2 = l o o p 1 < p 0 l e n d i m 1 ;17 i f ( cond 2 ) {18 f l o a t auxVar0 = p0 [ l o o p 1 ] ;19 double r e s = lambd0 ( auxVar0 ) ;20 p1 [ p 1 i n d e x d a t a [ p1 d im 1 + 1 ] + l o o p 1 ]21 = r e s ;22 } e l s e { break ; }23 }24 }
17 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Investigation of runtime for BS
Black-scholes benchmark.Float[] =⇒ Tuple2 < Float,Float > []
0.0
0.2
0.4
0.6
0.8
1.0
Am
ount of to
tal ru
ntim
e in %
Unmarshaling
CopyToCPU
GPU Execution
CopyToGPU
Marshaling
Java overhead
• Un/marshal data takesup to 90% of the time
• Computation stepshould be dominant
This is not acceptable. Can we do better?
18 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Custom Array Type
Programmer's View
Tuple2
...
Graal-OCL VM
float float float float...
double double double double...
FloatBuffer
DoubleBuffer
...
0 1 2 n-1
...
0 1 2 n-1
0 1 2 n-1
float
double
Tuple2
float
double
Tuple2
float
double
Tuple2
float
double
...
PArray<Tuple2<Float,Double>>
With this layout, un/marshal operations are not necessary
19 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Example of JPAI
1 ArrayFunc<Tuple2<F l o a t , Double >, Double> f = newMapFunction<>(t −> 2 . 5 f ∗ t . 1 ( ) + t . 2 ( ) ) ;
2
3 PArray<Tuple2<F l o a t , Double>> i n p u t = new PArray<>( s i z e ) ;4
5 f o r ( i n t i = 0 ; i < s i z e ; ++i ) {6 i n p u t . put ( i , new Tuple2 <>(( f l o a t ) i , ( double ) i + 2) ) ;7 }8
9 PArray<Double> output = f . a p p l y ( i n p u t ) ;
20 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Setup
• 5 Applications
• Comparison with:• Java Sequential - Graal
compiled code• AMD and Nvidia GPUs• Java Array vs. Custom
PArray• Java threads
21 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Java Threads Execution
0
1
2
3
4
5
6
small large
Saxpysmall large
K−Means
small large
Black−Scholes
small large
N−Bodysmall large
Monte Carlo
Speedup v
s. Java
sequential
Number of Java Threads
#1 #2 #4 #8 #16
CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
22 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL GPU Execution
0.1
1
10
100
1000
small large
Saxpy
0.004 0.004small large
K−Meanssmall large
Black−Scholessmall large
N−Bodysmall large
Monte Carlo
Speedup v
s. Java
sequential
Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized
AMD Radeon R9 295NVIDIA Geforce GTX Titan Black
23 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
OpenCL GPU Execution
0.1
1
10
100
1000
small largeSaxpy
0.004 0.004small large
K−Meanssmall large
Black−Scholessmall large
N−Bodysmall large
Monte Carlo
Spe
edup
vs.
Jav
a se
quen
tial
Nvidia Marshalling Nvidia Optimized AMD Marshalling AMD Optimized
10x12x 70x
AMD Radeon R9 295NVIDIA Geforce GTX Titan Black
24 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
.zip(Conclusions).map(Future)
Present
• We have presented an API to enable heterogeneouscomputing in Java
• Custom array type to reduce overheads when transfer thedata
• Runtime system to run heterogeneous applications withinJava
Future
• Runtime data type specialization
• Code generation for multiple devices
• Runtime scheduling (Where is the best place to run thecode?)
25 / 26
EnablingHeterogeneousComputing in
Java withGraal
Juan Fumero,Michel
Steuwer,Christophe
Dubach
Introduction
API
Runtime CodeGeneration
DataManagement
Results
Conclusion
Thanks so much for your attention
This work was supported bya grant from:
Juan Jose [email protected]
26 / 26