a new optimization technique for the inspector-executor method

1

A New Optimization Technique for the Inspector-Executor Method

Daisuke Yokota†

Shigeru Chiba‡

Kozo Itano†

† University of Tsukuba‡Tokyo Institute of Technology

2

Computer Simulation is Expensive

Physicists are running a parallel computer at our campus every day for simulation.– Our target parallel computer costs

$45,000 every month

$1 / min International phone call between Japan and Canada.

– The program runs very long. A week or more.

3

Hardware for Fast Inter-node Communication

– Our computer SR2201 has such hardwarefor avoiding communication bottleneck

Should be used but not in the real…– At least, at our computer center– It is not used by compiler

Difficult to generate optimized code for that hardware

– It is not used by programmer Programmers are not computer engineers but physicists

4

Our HPF Compiler

Optimization for– Utilizing hardware for inter-node communication

Technique– The Inspector-Executor method

plus Static Code Optimization– Compilation is executed in parallel

Target– Hitachi SR2201

5

Optimizations

Reducing the amount of the exchanged data– Our compiler allocates loop iterations to appropriate

nodes for minimizing communication

Merging multiple messages– Our target computer provides hardware support– Our compiler tries to use that hardware

Reusing TCW– Another hardware support– To reduce setup time for each message sending

6

Merging Multiple Messages

Hardware support:– Block-Stride Communication– Multiple messages are sent

as a single message

(Data must be stored at regular Intervals)

Sender Receiver

7

Reusing TCW

TCW: Transfer Control Word Reusing parameters to the communication

hardware

do I=1,…

end do

setting send

do I=1,…

end do

before optimization after optimization

setting

send

8

Implementation:Original Inspector-Executor Method

Goal: Parallelize a loop by runtime analysis Inspector runs at runtime

Inspector

Determines which array elements must be exchanged among nodes

Executor

1. Exchanges array elements2. Executes a loop body in

parallel3. Exchanges array elements

Resulting data of the analysis

9

Our ImprovedInspector Executor Method

Inspector produces statically optimized code of the executor.– Inspector runs off-line.– Running Inspector is part of the compilation

process.

Inspector Executor

Optimized executor code- Not data!

10

Static Code Optimization

Inspector performs constant folding– When generating the executor code

Constant folding eliminates from Executor:– A table containing the result of the analysis

by InspectorSaves memory space (the table size is big!)

– Memory access for table-lookupBetter performance

11

OUTER directive

Specifies the range of analysis by Inspector.– OUTER Loop– We assume that the program structure fits the

structure of typical simulation programs.This repeats millions of timesduring the simulation.

INNER LoopThis is parallelized.

Executor

OUTER Loop

12

Restrictions

Programmers must guarantee …– Every iteration of the OUTER loop needs to

exchange the same set of array elements among nodes.

Since Inspector analyzes only the first iteration

– The set of exchanged array elements is determined without executing inter-node communication

Inspector does not perform the communication for reducing the compilation time

Our compiler cannot compile IS of NAS parallel benchmark

13

Our Compiler Runs on a PC Cluster

For executing inspectorin parallel.– Inspector must analyze a large

amount of data.

– In the original inspector-executormethod, inspector runs in parallel.Our inspector is part of the compiler.

14

Execution Flow of Our Compiler

Source Program

Translate into SPMD

Generate Inspector

Inspector Log

Analysis

Code Generation

Generate Inspector

Inspector Log

Analysis

Code Generation

〃〃

〃〃

〃

〃

Exchange Information of Messages

SPMD Parallel code

15

Our Prototype Compiler

Fortran77 + HPF + OUTER directive– Output: SPMD Fortran code

Target machine– Compilation:

PentiumIII 733MHz x 16 nodes, RedHat 7.1, 100Base Ethernet

– Execution:Hitachi SR2201, PowerPC-based 150MHz x16 nodes

16

Experiments: Pde1 benchmark

Poisson Equation Good for massively parallel computing

– Regular array accesses– High scalability– Distributed array accesses are centralized in a small

region of source code

17

Execution Time (pde1)

0

5

10

15

20

1 2 4 8 16Number of nodes

Speedu

p

OursHitachi HPFLinear

249sec

137,100sec

Hitachi’s HPF compiler needs more directives for better performance

18

Effects by static code optimization (pde1)

Number of nodes

0%

50%

100%

1 2 4 8 16

dynamicstatic

Reductionof executiontime

19

Compilation Time (pde1)

0

50

100

150

200

250

2 4 8 16Number of nodes

Com

pila

tion

tim

e (

sec)

backend Fortransequentialparallel

data exchange

Long compilation time is paid off if the OUTERloop iterates many times.

20

Experiment: FT-a

3D Fourier Transformation Features

– Irregular array accesses– Distributed array accesses are centralized in a small

region of source code

21

Execution Time (FT-a)

0

5

10

15

20


Speedu

p OursHitachi HPFLinear

46sec

4,898sec

22

Compilation Time (FT-a)

050

100150200250300350


Com

pila

tion

Tim

e (

sec)

backendsequentialparallel

data exchange

23

Experiments: BT-a

Block Tri-diagonal Solver Features

– A small number of irregular array accesses– Distributed array accesses are scattered all over the

source code

24

Execution Time (BT-a)

0

5

10

15

20


Speedu

p OursHitachi HPFLinear

1,430sec

1,370,000sec

25

Compilation Time (BT-a)

05000

10000150002000025000300003500040000


Com

pila

tion

Tim

e (

sec)

backendsequentialparallel

data exchange

Inspector must analyze a huge numberof array accesses

Our compiler cannot achieve good performance

26

Conclusion

HPF compiler for utilizing hardware for inter-node communication

– Inspector-executor method– Static code optimization

Inspector produces optimized executor code– Compiler runs on a PC cluster

Experiment– Long compilation time is acceptable for simulation programs

running for long time

27

予備

28

通信量の削減 ( 最適化 )

通信量が少なくなるようにループのくり返しを分配

– データの分割は HPF で指示

– 予備実行で発生するであろう通信量を調べる

ループのくり返し

P E 2

PE 1

受け持つプロセッサ

PE 1P E 1

PE 2PE 2

必要

な通

信量

29

Merging Multiple Messages

Our compiler collects several messages sent in a single message

– Messages in the loop with INDEPENDENT directive can be merged This directive specifies that the result of that loop is

independent of the execution order of the iterations

Our compiler finds block-stride communication to reduce a number of communication by pattern matching

30

Future Works

We want to reduce a number of communication more– We want to use block stride communication more aggressively

(If with sending redundant data they could be merged into small number of communication, )◎

Prevention of expanding too long code– If data dependency between processors are too complex, our

compiler generates too many communication operations

Improvement of scalability of compilation time– Inspector log by BT was too huge

Experiments with real simulations

31

CP-PACS/Pilot-3

Distributed memory machine– Center for Computational Physics at U

niversity of Tsukuba– 2048PEs(CP-PACS),128PEs(Pilot-3)– Hyper crossbar– RDMA

32

Our Optimizer to Solve the Problem

Use of special communication devices– Parallel machines sometimes have special

hardware to reduce a time for inter-node communication

Development of compilers for easy and well-know computer languages– Fortran77, Simple HPF(High Performance Fortran)

Runtime analysis– Profiler about communication on PC-cluster

33

Effects by static code optimization (pde1)

Number of nodes

0%

50%

100%

2 4 8 16

dynamicstatic

Reductionof executiontime

a new optimization technique for the inspector-executor method

Documents

executor codeconstant

exchanged array elements

computer engineers

optimized code

computer centerit

hardware supportour

set of array elements

target parallel computer