optimizing x10 code with rose compiler...

© 2015 IBM Corporation

Optimizing X10 code with ROSE compiler framework

IBM Research - Tokyo

Michihiro Horie（[email protected]） Mikio Takeuchi

Kiyokuni Kawachiya

IBM Research - Tokyo

This material is based upon work supported by the U.S. Department of Energy, Office of Science,

Advanced Scientific Computing Research under Award Number DE-SC0008923.

Jan 29, 2015 1 X10 Day Tokyo

mailto:[email protected]


IBM Research – Tokyo

X10

Developed by IBM

One of the parallel and distributed programming language

– Aiming 10 times productivity

– In addition to the programming for HPC, X10 has been recently used

inside IBM products

APGAS（Asynchronous Partitioned Global Address Space） model

– Places

– Java-like grammar

2

Shared MemoryOpenMP, etc

Thread(Process)

Data(Object)

CommunicationAddressspace

Reference

Message PassingMPI, etc

PGASUPC, CAF, etc

Place 0 Place 1 Place 2

Remote reference

UPC, CAF, XMP, etc. Jan 29, 2015 X10 Day Tokyo

. .



X10 compiler

Built on top of Polyglot framework

– Implemented in Java

Java backend and C++ backend

3

AST: Abstract Syntax Tree

XRX: X10 Runtime in X10

XRJ: X10 Runtime in Java

XRC: X10 Runtime in C++

X10RT: X10 Comm. Runtime

X10

Source

Parsing /

Type Check

AST Optimizations

AST Lowering X10 AST

X10 AST

Java Code

Generation

C++ Code

Generation

Java Source C++ Source

Java Compiler Platform Compilers XRJ XRC XRX

Java Bytecode Native executable

X10RT

X10 Compiler Front-End

Java

Back-End

C++

Back-End

Native Environment

(CPU, GPU, etc) Java VMs

Java or C(JNI)

Existing Java

Application

Existing Native (C/C++/etc)

Application

Java

Interop

Support

CUDA Source

Native X10 Managed X10

Jan 29, 2015 X10 Day Tokyo



“Proxy applications” Parallel and distributed applications that are eveloped in the research projects of

U.S. Department of Energy（DoE）

Implemented in C/C++, Fortran with MPI, OpenMP, etc.

– CoMD: molecular dynamics

– MCCK: neutronics, investigating the communication cost

– LULESH：Lagrangian hydrodynamics

Porting proxy applications in X10

– To compare execution performance between the original and X10-port version

– Porting strategy:

• To keep the original code structures as much as possible

4

Writing HPC Applications in X10 Parallel Distributed Programming Language,

Hiroki Murata, Michihiro Horie, Koichi Shirahata, Jun Doi, Hideki Tai, Mikio Takeuchi, and Kiyokuni Kawachiya.

情報処理学会第98回プログラミング研究会, 2013-5-(2), 7 pages (2014/03/17).

Porting MPI based HPC Applications to X10,

Hiroki Murata, Michihiro Horie, Koichi Shirahata, Jun Doi, Hideki Tai, Mikio Takeuchi, and Kiyokuni Kawachiya.

In Proceedings of the 2014 X10 Workshop (X10 '14), co-located with PLDI '14, 7 pages (2014/06/12). Jan 29, 2015 X10 Day Tokyo

https://www.risec.aist.go.jp/events/ipsj-pro2013-5/



http://x10-lang.org/workshop/workshop14

http://conferences.inf.ed.ac.uk/pldi2014/



Original and X10-port version

val r1 = new Rail[Long](10);


val R = new Rail[Long](10);

:

def calc(i : Long, k : Long, t : Long) {

r1(i) = k * R(i) / t;

r2(i) = k * R(i) / t;

}

X10 port

P7IH: 32 Power7 cores 3.84 GHz, 128 GB memory, peak: 982 Gflops

X10 trunk r28794 (built with “-Doptimize=true -DNO_CHECKS=true”)

xlC_r: V12.1

Compile option for the original : -O3 –qinline

Compile option for native x10: -x10rt pami -O -NO_CHECKS




:


val a = k * R(i) / t;

r1(i) = a;

r2(i) = a;

}

long r1[10];

long r2[10];

long R[10];

:

void calc(long i, long k, long t) {

r1[i] = k * R[i] / t;

r2[i] = k * R[i] / t;

}

Original （C++）

Needed to unify common

subexpressions by hand

＜

≒

X10 was

slower…

5 Jan 29, 2015 5 X10 Day Tokyo



For better performance in x10-port versions

We often needed to …

– Use val as much as possible

– Set class as final

– Unify common expressions

– Hoist loop invariant

– Replace short array with variables

– inline method called in loop (with @Inline)

We would like to enrich optimizations in X10 compiler front-end

– Currently, we need to apply above optimizations by hand

– Our idea is to use ROSE compiler framework

6 X10 Day Tokyo Jan 29, 2015



ROSE compiler framework

Source-to-source translator

– Developed in the Lawrence Livermore National Laboratory

– Implemented in C++

– Support C/C++/Fortran, MPI/OpenMP, etc.

• We also confirmed parsing and unparsing Java code properly

7

C Source

Parser ROSE AST

Control flow analysisAST-based optimizations

ROSE Front-end

Fortran Source

C++ Source

C Source

ROSE Mid-end

OptimizedROSE AST

Unparser

ROSE Back-end

C Source

Fortran Source

C++ Source




X10 ROSE Connection

To enrich optimizations in X10 front-end, we can reuse the ROSE

optimizations

– Convert Polyglot AST to ROSE AST

8


ROSE Mid-end

OptimizedROSE AST

Unparser

ROSE Back-end

FasterX10 Source

ROSE AST

ROSE Front-end

Parser

X10 Compiler Front-end

ConverterX10 ASTX10 Source

Convert Polyglot AST to

ROSE AST through JNI

X10

parser


Define ROSE AST nodes for X10-

specific language constructs

Generate X10 code from

X10 AST

Start X10 compiler for

X10 source



We tried several ROSE optimizations to LULESH

Currently,

– Loop unrolling

– Method inlining

– Stack allocation

– Loop Invariant Hoisting

Under investigation of how we can reuse ROSE analyses

– Liveness analysis

– Use-Def analysis

– Pointer analysis, etc.

Estimated how much execution performance can improve by using

ROSE optimizations/analyses




Describing how to handle input source with ROSE APIs

In driver code written in C++, we can describe how to handle input

source

– Invoke ROSE analyses/optimizations through its APIs

– Process inherited/synthesized attributes in AST


ROSE Mid-end

OptimizedROSE AST

Unparser

ROSE Back-end

FasterX10 Source

ROSE AST

ROSE Front-end

Parser

X10 Compiler Front-end

ConverterX10 ASTX10 Source

Driver



Loop unrolling

ROSE provides an API:

– loopUnrolling(SgForStatement *loop, size_t factor)

We found 6 targets in LULESH

11

protected def calcForceForNodes(domain : Domain) {

for (var i : Long = 0; i <= (domain.numNode-1); ++i) {

domain.fx(i) = 0.0;

domain.fy(i) = 0.0;

domain.fz(i) = 0.0;

}

:

}

protected def calcForceForNodes(domain : Domain) { var i_nom_5 : long; val _lu_fringe_6 = domain.numNode - 1 - ((domain.numNode == 0 ? domain.numNode : (domain.numNode + 1)) % 4 == 0 ? 0 : 4); for (i_nom_5 = 0L; i_nom_5 <= _lu_fringe_6; i_nom_5 += 4) { domain.fx(i_nom_5) = 0.0; domain.fy(i_nom_5) = 0.0; domain.fz(i_nom_5) = 0.0; domain.fx(i_nom_5 + 1) = 0.0; domain.fy(i_nom_5 + 1) = 0.0; domain.fz(i_nom_5 + 1) = 0.0; domain.fx(i_nom_5 + 2) = 0.0; domain.fy(i_nom_5 + 2) = 0.0; domain.fz(i_nom_5 + 2) = 0.0; domain.fx(i_nom_5 + 3) = 0.0; domain.fy(i_nom_5 + 3) = 0.0; domain.fz(i_nom_5 + 3) = 0.0; } for (; i_nom_5 <= (domain.numNode - 1L); i_nom_5 += 1) { domain.fx(i_nom_5) = 0.0; domain.fy(i_nom_5) = 0.0; domain.fz(i_nom_5) = 0.0;

Before optimization

After optimization




Inlining (@Inline)

ROSE provides an API:

– doInline(SgFunctionCallExp* funcall, bool allowRecursion)

We found 16 targets in LULESH

12 12

def calcVolumeForceForElems(domain : Domain) {

val hgcoef = domain.hgcoef;

val determ = new Rail[double](numElem);

:

calcHourglassControlForElems(domain,determ,hgcoef);

}

def calcHourglassControlForElems(domain : Domain,

determ : Rail[double], hgcoef : double) {

val numElem = domain.numElem;

val numElem8 = numElem * 8L;

if (dvdx == null) {

dvdx = new Rail[double](numElem8);

dvdy = new Rail[double](numElem8);

dvdz = new Rail[double](numElem8);

:

}

:

}

def calcVolumeForceForElems(domain : Domain) {

val hgcoef = domain.hgcoef;

val determ = new Rail[double](numElem);

:

{



if (dvdx == null) {




:

}

:

}

}

def calcHourglassControlForElems(domain : Domain,

determ : Rail[double], hgcoef : double) {



if (dvdx == null) {




:

}

Before optimization

After optimization




Stack allocation

ROSE does not provide optimization API for applying stack allocation

Instead, we would like to attach @StackAllocate of X10 automatically

Currently, implementing an escape analysis to decide whether attaching

@StackAllocate is valid or not

– Reusing pointer analysis that ROSE provides

13 13

val x1 = new Rail[double](8L); @StackAllocate val x1 = @StackAllocate new Rail[double](8L);

Before optimization After optimization




Loop invariant hoisting

Althogugh ROSE does not provide optimization API, we can realize by

using primitive ROSE APIs

– insertStatement()

– removeStatement()

We found 1 target in LULESH

To decide whether hoisting loop invariants is valid or not, we can use

ROSE’s analyses:

– Use-Definition analysis

– Liveness analysis

14

:



while ((domain.time < domain.stopTime) &&

(domain.cycle < Lulesh.this.opts.its)) {




:

}






while ((domain.time < domain.stopTime) &&

(domain.cycle < Lulesh.this.opts.its)) {

:

}

Before optimization After optimization




Preliminary: Execution performance in single place and single process

These results could change, because we have just applied optimizations

partly without using ROSE analyses

15

76.3 83.1 81.4 81.8 79.1 74.8

0 10 20 30 40 50 60 70 80 90

C++ Original (1) X10-port

(2) (1) + inlining

(3) (2) + loop unrolling

(4) (3) + stack allocation

(5) (4) + loop invariant hoisting

LULESH elapsed time [s] (Smaller is better)

IBM Power775 (13x POWER7 3.84 GHz (8x4 HWT, 128GB mem)), RHEL 6.2

X10 2.4.1 native backend (compile option: -x10rt pami -O -NO_CHECKS)

C/C++ compiler: XL C V12.1 (compile option: -O3 -qinline)

Communication runtime: PAMI (X10) and MPI (Original)

X10 2.5.1 native backend (compile option: -X10rt pami –O –NO_CHECKS)




So far…

To enhance features of X10 optimizations, we reused ROSE’s

optimizations

– Avoid costs to implement new features to an existing compiler

Optimizations we tried :

– Method inlining

– Loop unrolling

– Stack Allocation

– Loop invariant hoisting

We estimated the execution performance of LULESH in single place and

single process

– Improved by 2% compared to the C++ original

– Improved by 10% compared to the X10-port

16 Jan 29, 2015 X10 Day Tokyo



Many future works

We have to extend ROSE because it is not APGAS-oriented

– Just reusing ROSE optimizations/analyses is not sufficient

– (Ex.) We may not hoist a loop invariant beyond “at”

Also, we should support APGAS-specific optimizations

– (Ex.) We can optimize “at”-invariant to avoid unnecessary copying

cost




C++ code generated by X10 compiler

18 X10 Day Tokyo

:

void CSE_Sample::calc(x10_long i, x10_long k, x10_long t) {

this->FMGL(r1)->x10::lang::Rail<x10_long >::__set(i,

((x10_long) ((((x10_long) ((k)

* (CSE_Sample::FMGL(R__get)()->x10::lang::Rail<x10_long >::__apply(i)))))

/ x10aux::zeroCheck(t))));

this->FMGL(r2)->x10::lang::Rail<x10_long >::__set(i,

((x10_long) ((((x10_long) ((k)

* (CSE_Sample::FMGL(R__get)()->x10::lang::Rail<x10_long >::__apply(i)))))

/ x10aux::zeroCheck(t))));

}




:


r1(i) = k * R(i) / t;

r2(i) = k * R(i) / t;

}

Jan 29, 2015

optimizing x10 code with rose compiler...

Documents