optimizing x10 code with rose compiler...
TRANSCRIPT
© 2015 IBM Corporation
Optimizing X10 code with ROSE compiler framework
IBM Research - Tokyo
Michihiro Horie([email protected]) Mikio Takeuchi
Kiyokuni Kawachiya
IBM Research - Tokyo
This material is based upon work supported by the U.S. Department of Energy, Office of Science,
Advanced Scientific Computing Research under Award Number DE-SC0008923.
Jan 29, 2015 1 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
X10
Developed by IBM
One of the parallel and distributed programming language
– Aiming 10 times productivity
– In addition to the programming for HPC, X10 has been recently used
inside IBM products
APGAS(Asynchronous Partitioned Global Address Space) model
– Places
– Java-like grammar
2
Shared MemoryOpenMP, etc
Thread(Process)
Data(Object)
CommunicationAddressspace
Reference
Message PassingMPI, etc
PGASUPC, CAF, etc
Place 0 Place 1 Place 2
Remote reference
UPC, CAF, XMP, etc. Jan 29, 2015 X10 Day Tokyo
. .
© 2015 IBM Corporation
IBM Research – Tokyo
X10 compiler
Built on top of Polyglot framework
– Implemented in Java
Java backend and C++ backend
3
AST: Abstract Syntax Tree
XRX: X10 Runtime in X10
XRJ: X10 Runtime in Java
XRC: X10 Runtime in C++
X10RT: X10 Comm. Runtime
X10
Source
Parsing /
Type Check
AST Optimizations
AST Lowering X10 AST
X10 AST
Java Code
Generation
C++ Code
Generation
Java Source C++ Source
Java Compiler Platform Compilers XRJ XRC XRX
Java Bytecode Native executable
X10RT
X10 Compiler Front-End
Java
Back-End
C++
Back-End
Native Environment
(CPU, GPU, etc) Java VMs
Java or C(JNI)
Existing Java
Application
Existing Native (C/C++/etc)
Application
Java
Interop
Support
CUDA Source
Native X10 Managed X10
Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
“Proxy applications” Parallel and distributed applications that are eveloped in the research projects of
U.S. Department of Energy(DoE)
Implemented in C/C++, Fortran with MPI, OpenMP, etc.
– CoMD: molecular dynamics
– MCCK: neutronics, investigating the communication cost
– LULESH:Lagrangian hydrodynamics
Porting proxy applications in X10
– To compare execution performance between the original and X10-port version
– Porting strategy:
• To keep the original code structures as much as possible
4
Writing HPC Applications in X10 Parallel Distributed Programming Language,
Hiroki Murata, Michihiro Horie, Koichi Shirahata, Jun Doi, Hideki Tai, Mikio Takeuchi, and Kiyokuni Kawachiya.
情報処理学会第98回プログラミング研究会, 2013-5-(2), 7 pages (2014/03/17).
Porting MPI based HPC Applications to X10,
Hiroki Murata, Michihiro Horie, Koichi Shirahata, Jun Doi, Hideki Tai, Mikio Takeuchi, and Kiyokuni Kawachiya.
In Proceedings of the 2014 X10 Workshop (X10 '14), co-located with PLDI '14, 7 pages (2014/06/12). Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
Original and X10-port version
val r1 = new Rail[Long](10);
val r2 = new Rail[Long](10);
val R = new Rail[Long](10);
:
def calc(i : Long, k : Long, t : Long) {
r1(i) = k * R(i) / t;
r2(i) = k * R(i) / t;
}
X10 port
P7IH: 32 Power7 cores 3.84 GHz, 128 GB memory, peak: 982 Gflops
X10 trunk r28794 (built with “-Doptimize=true -DNO_CHECKS=true”)
xlC_r: V12.1
Compile option for the original : -O3 –qinline
Compile option for native x10: -x10rt pami -O -NO_CHECKS
val r1 = new Rail[Long](10);
val r2 = new Rail[Long](10);
val R = new Rail[Long](10);
:
def calc(i : Long, k : Long, t : Long) {
val a = k * R(i) / t;
r1(i) = a;
r2(i) = a;
}
long r1[10];
long r2[10];
long R[10];
:
void calc(long i, long k, long t) {
r1[i] = k * R[i] / t;
r2[i] = k * R[i] / t;
}
Original (C++)
Needed to unify common
subexpressions by hand
<
≒
X10 was
slower…
5 Jan 29, 2015 5 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
For better performance in x10-port versions
We often needed to …
– Use val as much as possible
– Set class as final
– Unify common expressions
– Hoist loop invariant
– Replace short array with variables
– inline method called in loop (with @Inline)
We would like to enrich optimizations in X10 compiler front-end
– Currently, we need to apply above optimizations by hand
– Our idea is to use ROSE compiler framework
6 X10 Day Tokyo Jan 29, 2015
© 2015 IBM Corporation
IBM Research – Tokyo
ROSE compiler framework
Source-to-source translator
– Developed in the Lawrence Livermore National Laboratory
– Implemented in C++
– Support C/C++/Fortran, MPI/OpenMP, etc.
• We also confirmed parsing and unparsing Java code properly
7
C Source
Parser ROSE AST
Control flow analysisAST-based optimizations
ROSE Front-end
Fortran Source
C++ Source
C Source
ROSE Mid-end
OptimizedROSE AST
Unparser
ROSE Back-end
C Source
Fortran Source
C++ Source
Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
X10 ROSE Connection
To enrich optimizations in X10 front-end, we can reuse the ROSE
optimizations
– Convert Polyglot AST to ROSE AST
8
Control flow analysisAST-based optimizations
ROSE Mid-end
OptimizedROSE AST
Unparser
ROSE Back-end
FasterX10 Source
ROSE AST
ROSE Front-end
Parser
X10 Compiler Front-end
ConverterX10 ASTX10 Source
Convert Polyglot AST to
ROSE AST through JNI
X10
parser
Jan 29, 2015 X10 Day Tokyo
Define ROSE AST nodes for X10-
specific language constructs
Generate X10 code from
X10 AST
Start X10 compiler for
X10 source
© 2015 IBM Corporation
IBM Research – Tokyo
We tried several ROSE optimizations to LULESH
Currently,
– Loop unrolling
– Method inlining
– Stack allocation
– Loop Invariant Hoisting
Under investigation of how we can reuse ROSE analyses
– Liveness analysis
– Use-Def analysis
– Pointer analysis, etc.
Estimated how much execution performance can improve by using
ROSE optimizations/analyses
9 X10 Day Tokyo Jan 29, 2015
© 2015 IBM Corporation
IBM Research – Tokyo
Describing how to handle input source with ROSE APIs
In driver code written in C++, we can describe how to handle input
source
– Invoke ROSE analyses/optimizations through its APIs
– Process inherited/synthesized attributes in AST
Control flow analysisAST-based optimizations
ROSE Mid-end
OptimizedROSE AST
Unparser
ROSE Back-end
FasterX10 Source
ROSE AST
ROSE Front-end
Parser
X10 Compiler Front-end
ConverterX10 ASTX10 Source
Driver
© 2015 IBM Corporation
IBM Research – Tokyo
Loop unrolling
ROSE provides an API:
– loopUnrolling(SgForStatement *loop, size_t factor)
We found 6 targets in LULESH
11
protected def calcForceForNodes(domain : Domain) {
for (var i : Long = 0; i <= (domain.numNode-1); ++i) {
domain.fx(i) = 0.0;
domain.fy(i) = 0.0;
domain.fz(i) = 0.0;
}
:
}
protected def calcForceForNodes(domain : Domain) { var i_nom_5 : long; val _lu_fringe_6 = domain.numNode - 1 - ((domain.numNode == 0 ? domain.numNode : (domain.numNode + 1)) % 4 == 0 ? 0 : 4); for (i_nom_5 = 0L; i_nom_5 <= _lu_fringe_6; i_nom_5 += 4) { domain.fx(i_nom_5) = 0.0; domain.fy(i_nom_5) = 0.0; domain.fz(i_nom_5) = 0.0; domain.fx(i_nom_5 + 1) = 0.0; domain.fy(i_nom_5 + 1) = 0.0; domain.fz(i_nom_5 + 1) = 0.0; domain.fx(i_nom_5 + 2) = 0.0; domain.fy(i_nom_5 + 2) = 0.0; domain.fz(i_nom_5 + 2) = 0.0; domain.fx(i_nom_5 + 3) = 0.0; domain.fy(i_nom_5 + 3) = 0.0; domain.fz(i_nom_5 + 3) = 0.0; } for (; i_nom_5 <= (domain.numNode - 1L); i_nom_5 += 1) { domain.fx(i_nom_5) = 0.0; domain.fy(i_nom_5) = 0.0; domain.fz(i_nom_5) = 0.0;
Before optimization
After optimization
Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
Inlining (@Inline)
ROSE provides an API:
– doInline(SgFunctionCallExp* funcall, bool allowRecursion)
We found 16 targets in LULESH
12 12
def calcVolumeForceForElems(domain : Domain) {
val hgcoef = domain.hgcoef;
val determ = new Rail[double](numElem);
:
calcHourglassControlForElems(domain,determ,hgcoef);
}
def calcHourglassControlForElems(domain : Domain,
determ : Rail[double], hgcoef : double) {
val numElem = domain.numElem;
val numElem8 = numElem * 8L;
if (dvdx == null) {
dvdx = new Rail[double](numElem8);
dvdy = new Rail[double](numElem8);
dvdz = new Rail[double](numElem8);
:
}
:
}
def calcVolumeForceForElems(domain : Domain) {
val hgcoef = domain.hgcoef;
val determ = new Rail[double](numElem);
:
{
val numElem = domain.numElem;
val numElem8 = numElem * 8L;
if (dvdx == null) {
dvdx = new Rail[double](numElem8);
dvdy = new Rail[double](numElem8);
dvdz = new Rail[double](numElem8);
:
}
:
}
}
def calcHourglassControlForElems(domain : Domain,
determ : Rail[double], hgcoef : double) {
val numElem = domain.numElem;
val numElem8 = numElem * 8L;
if (dvdx == null) {
dvdx = new Rail[double](numElem8);
dvdy = new Rail[double](numElem8);
dvdz = new Rail[double](numElem8);
:
}
Before optimization
After optimization
Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
Stack allocation
ROSE does not provide optimization API for applying stack allocation
Instead, we would like to attach @StackAllocate of X10 automatically
Currently, implementing an escape analysis to decide whether attaching
@StackAllocate is valid or not
– Reusing pointer analysis that ROSE provides
13 13
val x1 = new Rail[double](8L); @StackAllocate val x1 = @StackAllocate new Rail[double](8L);
Before optimization After optimization
Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
Loop invariant hoisting
Althogugh ROSE does not provide optimization API, we can realize by
using primitive ROSE APIs
– insertStatement()
– removeStatement()
We found 1 target in LULESH
To decide whether hoisting loop invariants is valid or not, we can use
ROSE’s analyses:
– Use-Definition analysis
– Liveness analysis
14
:
val numElem = domain.numElem;
val numElem8 = numElem * 8L;
while ((domain.time < domain.stopTime) &&
(domain.cycle < Lulesh.this.opts.its)) {
dvdx = new Rail[double](numElem8);
dvdy = new Rail[double](numElem8);
dvdz = new Rail[double](numElem8);
:
}
val numElem = domain.numElem;
val numElem8 = numElem * 8L;
dvdx = new Rail[double](numElem8);
dvdy = new Rail[double](numElem8);
dvdz = new Rail[double](numElem8);
while ((domain.time < domain.stopTime) &&
(domain.cycle < Lulesh.this.opts.its)) {
:
}
Before optimization After optimization
Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
Preliminary: Execution performance in single place and single process
These results could change, because we have just applied optimizations
partly without using ROSE analyses
15
76.3 83.1 81.4 81.8 79.1 74.8
0 10 20 30 40 50 60 70 80 90
C++ Original (1) X10-port
(2) (1) + inlining
(3) (2) + loop unrolling
(4) (3) + stack allocation
(5) (4) + loop invariant hoisting
LULESH elapsed time [s] (Smaller is better)
IBM Power775 (13x POWER7 3.84 GHz (8x4 HWT, 128GB mem)), RHEL 6.2
X10 2.4.1 native backend (compile option: -x10rt pami -O -NO_CHECKS)
C/C++ compiler: XL C V12.1 (compile option: -O3 -qinline)
Communication runtime: PAMI (X10) and MPI (Original)
X10 2.5.1 native backend (compile option: -X10rt pami –O –NO_CHECKS)
Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
So far…
To enhance features of X10 optimizations, we reused ROSE’s
optimizations
– Avoid costs to implement new features to an existing compiler
Optimizations we tried :
– Method inlining
– Loop unrolling
– Stack Allocation
– Loop invariant hoisting
We estimated the execution performance of LULESH in single place and
single process
– Improved by 2% compared to the C++ original
– Improved by 10% compared to the X10-port
16 Jan 29, 2015 X10 Day Tokyo
© 2015 IBM Corporation
IBM Research – Tokyo
Many future works
We have to extend ROSE because it is not APGAS-oriented
– Just reusing ROSE optimizations/analyses is not sufficient
– (Ex.) We may not hoist a loop invariant beyond “at”
Also, we should support APGAS-specific optimizations
– (Ex.) We can optimize “at”-invariant to avoid unnecessary copying
cost
17 X10 Day Tokyo Jan 29, 2015
© 2015 IBM Corporation
IBM Research – Tokyo
C++ code generated by X10 compiler
18 X10 Day Tokyo
:
void CSE_Sample::calc(x10_long i, x10_long k, x10_long t) {
this->FMGL(r1)->x10::lang::Rail<x10_long >::__set(i,
((x10_long) ((((x10_long) ((k)
* (CSE_Sample::FMGL(R__get)()->x10::lang::Rail<x10_long >::__apply(i)))))
/ x10aux::zeroCheck(t))));
this->FMGL(r2)->x10::lang::Rail<x10_long >::__set(i,
((x10_long) ((((x10_long) ((k)
* (CSE_Sample::FMGL(R__get)()->x10::lang::Rail<x10_long >::__apply(i)))))
/ x10aux::zeroCheck(t))));
}
val r1 = new Rail[Long](10);
val r2 = new Rail[Long](10);
val R = new Rail[Long](10);
:
def calc(i : Long, k : Long, t : Long) {
r1(i) = k * R(i) / t;
r2(i) = k * R(i) / t;
}
Jan 29, 2015