a new optimization technique for the inspector-executor method
DESCRIPTION
A New Optimization Technique for the Inspector-Executor Method. Daisuke Yokota † Shigeru Chiba ‡ Kozo Itano †. † University of Tsukuba ‡Tokyo Institute of Technology. Computer Simulation is Expensive. Physicists are running a parallel computer at our campus every day for simulation. - PowerPoint PPT PresentationTRANSCRIPT
1
A New Optimization Technique for the Inspector-Executor Method
Daisuke Yokota†
Shigeru Chiba‡
Kozo Itano†
† University of Tsukuba‡Tokyo Institute of Technology
2
Computer Simulation is Expensive
Physicists are running a parallel computer at our campus every day for simulation.– Our target parallel computer costs
$45,000 every month
$1 / min International phone call between Japan and Canada.
– The program runs very long. A week or more.
3
Hardware for Fast Inter-node Communication
– Our computer SR2201 has such hardwarefor avoiding communication bottleneck
Should be used but not in the real…– At least, at our computer center– It is not used by compiler
Difficult to generate optimized code for that hardware
– It is not used by programmer Programmers are not computer engineers but physicists
4
Our HPF Compiler
Optimization for– Utilizing hardware for inter-node communication
Technique– The Inspector-Executor method
plus Static Code Optimization– Compilation is executed in parallel
Target– Hitachi SR2201
5
Optimizations
Reducing the amount of the exchanged data– Our compiler allocates loop iterations to appropriate
nodes for minimizing communication
Merging multiple messages– Our target computer provides hardware support– Our compiler tries to use that hardware
Reusing TCW– Another hardware support– To reduce setup time for each message sending
6
Merging Multiple Messages
Hardware support:– Block-Stride Communication– Multiple messages are sent
as a single message
(Data must be stored at regular Intervals)
Sender Receiver
7
Reusing TCW
TCW: Transfer Control Word Reusing parameters to the communication
hardware
do I=1,…
end do
setting send
do I=1,…
end do
before optimization after optimization
setting
send
8
Implementation:Original Inspector-Executor Method
Goal: Parallelize a loop by runtime analysis Inspector runs at runtime
Inspector
Determines which array elements must be exchanged among nodes
Executor
1. Exchanges array elements2. Executes a loop body in
parallel3. Exchanges array elements
Resulting data of the analysis
9
Our ImprovedInspector Executor Method
Inspector produces statically optimized code of the executor.– Inspector runs off-line.– Running Inspector is part of the compilation
process.
Inspector Executor
Optimized executor code- Not data!
10
Static Code Optimization
Inspector performs constant folding– When generating the executor code
Constant folding eliminates from Executor:– A table containing the result of the analysis
by InspectorSaves memory space (the table size is big!)
– Memory access for table-lookupBetter performance
11
OUTER directive
Specifies the range of analysis by Inspector.– OUTER Loop– We assume that the program structure fits the
structure of typical simulation programs.This repeats millions of timesduring the simulation.
INNER LoopThis is parallelized.
Executor
OUTER Loop
12
Restrictions
Programmers must guarantee …– Every iteration of the OUTER loop needs to
exchange the same set of array elements among nodes.
Since Inspector analyzes only the first iteration
– The set of exchanged array elements is determined without executing inter-node communication
Inspector does not perform the communication for reducing the compilation time
Our compiler cannot compile IS of NAS parallel benchmark
13
Our Compiler Runs on a PC Cluster
For executing inspectorin parallel.– Inspector must analyze a large
amount of data.
– In the original inspector-executormethod, inspector runs in parallel.Our inspector is part of the compiler.
14
Execution Flow of Our Compiler
Source Program
Translate into SPMD
Generate Inspector
Inspector Log
Analysis
Code Generation
Generate Inspector
Inspector Log
Analysis
Code Generation
〃 〃
〃 〃
〃
〃
Exchange Information of Messages
SPMD Parallel code
15
Our Prototype Compiler
Fortran77 + HPF + OUTER directive– Output: SPMD Fortran code
Target machine– Compilation:
PentiumIII 733MHz x 16 nodes, RedHat 7.1, 100Base Ethernet
– Execution:Hitachi SR2201, PowerPC-based 150MHz x16 nodes
16
Experiments: Pde1 benchmark
Poisson Equation Good for massively parallel computing
– Regular array accesses– High scalability– Distributed array accesses are centralized in a small
region of source code
17
Execution Time (pde1)
0
5
10
15
20
1 2 4 8 16Number of nodes
Speedu
p
OursHitachi HPFLinear
249sec
137,100sec
Hitachi’s HPF compiler needs more directives for better performance
18
Effects by static code optimization (pde1)
Number of nodes
0%
50%
100%
1 2 4 8 16
dynamicstatic
Reductionof executiontime
19
Compilation Time (pde1)
0
50
100
150
200
250
2 4 8 16Number of nodes
Com
pila
tion
tim
e (
sec)
backend Fortransequentialparallel
data exchange
Long compilation time is paid off if the OUTERloop iterates many times.
20
Experiment: FT-a
3D Fourier Transformation Features
– Irregular array accesses– Distributed array accesses are centralized in a small
region of source code
21
Execution Time (FT-a)
0
5
10
15
20
1 2 4 8 16Number of nodes
Speedu
p OursHitachi HPFLinear
46sec
4,898sec
22
Compilation Time (FT-a)
050
100150200250300350
2 4 8 16Number of nodes
Com
pila
tion
Tim
e (
sec)
backendsequentialparallel
data exchange
23
Experiments: BT-a
Block Tri-diagonal Solver Features
– A small number of irregular array accesses– Distributed array accesses are scattered all over the
source code
24
Execution Time (BT-a)
0
5
10
15
20
1 2 4 8 16Number of nodes
Speedu
p OursHitachi HPFLinear
1,430sec
1,370,000sec
25
Compilation Time (BT-a)
05000
10000150002000025000300003500040000
2 4 8 16Number of nodes
Com
pila
tion
Tim
e (
sec)
backendsequentialparallel
data exchange
Inspector must analyze a huge numberof array accesses
Our compiler cannot achieve good performance
26
Conclusion
HPF compiler for utilizing hardware for inter-node communication
– Inspector-executor method– Static code optimization
Inspector produces optimized executor code– Compiler runs on a PC cluster
Experiment– Long compilation time is acceptable for simulation programs
running for long time
27
予備
28
通信量の削減 ( 最適化 )
通信量が少なくなるようにループのくり返しを分配
– データの分割は HPF で指示
– 予備実行で発生するであろう通信量を調べる
ループのくり返し
P E 2
PE 1
受け持つプロセッサ
PE 1P E 1
PE 2PE 2
必要
な通
信量
29
Merging Multiple Messages
Our compiler collects several messages sent in a single message
– Messages in the loop with INDEPENDENT directive can be merged This directive specifies that the result of that loop is
independent of the execution order of the iterations
Our compiler finds block-stride communication to reduce a number of communication by pattern matching
30
Future Works
We want to reduce a number of communication more– We want to use block stride communication more aggressively
(If with sending redundant data they could be merged into small number of communication, )◎
Prevention of expanding too long code– If data dependency between processors are too complex, our
compiler generates too many communication operations
Improvement of scalability of compilation time– Inspector log by BT was too huge
Experiments with real simulations
31
CP-PACS/Pilot-3
Distributed memory machine– Center for Computational Physics at U
niversity of Tsukuba– 2048PEs(CP-PACS),128PEs(Pilot-3)– Hyper crossbar– RDMA
32
Our Optimizer to Solve the Problem
Use of special communication devices– Parallel machines sometimes have special
hardware to reduce a time for inter-node communication
Development of compilers for easy and well-know computer languages– Fortran77, Simple HPF(High Performance Fortran)
Runtime analysis– Profiler about communication on PC-cluster
33
Effects by static code optimization (pde1)
Number of nodes
0%
50%
100%
2 4 8 16
dynamicstatic
Reductionof executiontime