automatic parameterisation of parallel linear algebra routines domingo giménez javier cuenca josé...
TRANSCRIPT
Automatic Parameterisation of Parallel Linear Algebra Routines
Domingo Giménez Javier Cuenca José González
University of MurciaSPAIN
Algèbre Linéaire et Arithmétique: Calcul Numérique, Symbolique et ParalèleRabat, Maroc. 28-31 Mai 2001
Outline
Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs
Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions
Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works
Linear Algebra: highly optimizable operations
Optimizations are Platform Specific Traditional method: Hand-Optimization for each platform
Current Situation of Linear Algebra Parallel Routines (LAPRs)
020406080
100120140160180
time
(sec
onds
)
512 1024 1536 2048 2560 3072
Problem Size
Untuned
Tuned
Time-consuming Incompatible with Hardware Evolution Incompatible with changes in the system (architecture
and basic libraries) Unsuitable for dynamic systems Misuse by non expert users
Problems of traditional method
ATLAS, FLAME, I-LIB
Analyse platform characteristics in detail Sequential code Empirical results of the LAPR + Automation High Installation Time
Current approaches
Develop a methodology for obtaining Automatically Tuned Software
Execution Environment
Auto-tuning Software
Our objective
Routines Parameterised:
System parameters, Algorithmic parameters System parameters obtained at installation time
Analytical model of the routine and simple installation routines to obtain the system parameters
A reduced number of executions at installation time Algorithmic parameters obtained at running time
From the analytical model with the system parameters obtained in the installation process
From the file with information generated in the installation process
Methodology
System parameters obtained at installation time
Analytical model of the routine and simple installation routines to obtain the system parameters
Algorithmic parameters obtained at running time
From the analytical model with the system parameters obtained in the installation process
Analytical modelling
The behaviour of the algorithm on the platform is defined
Texec = f (SPs, n, APs)
SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size
Analytical Model
System Parameters (SPs):Hardware Platform Physical Characteristics
Current Conditions
Basic libraries
How to estimate each SP?
1º.- Obtain the kernel of performance cost of LAPR
2º.- Make an Estimation Routine from this kernel
Two Kinds of SPs:
Communication System Parameters (CSPs)
Arithmetic System Parameters (ASPs)
Analytical Model
LAPRs Performance
Arithmetic System Parameters (ASPs):tc arithmetic cost
but using BLAS: k1 k2 and k3.
Computation Kernel of the LAPR Estimation Routine Similar storage scheme Similar quantity of data
Analytical Model
Communication System Parameters (CSPs):ts start-up time
tw word-sending time
Communication Kernel of the LAPR Estimation Routine Similar kind of communication Similar quantity of data
Analytical Model
Algorithmic Parameters (APs)
Values chosen in each execution
b block size
p number of processors
r c logical topology
grid configuration (logical 2D mesh)
Analytical Model
Pre-installing (manual):
1º Make the Analytical Model: Texec = f (SPs, n, APs)
2º Write the Estimation Routines for the SPs
Installing on a Platform (automatic):
3º Estimate the SPs using the Estimation Routines of step 2
4º Write a Configuration File, or include the information in the LAPR:
for each n APs that minimize Texec
Execution:
The user executes LAPR for a size n:
LAPR obtains optimal APs
The Methodology. Step by step:
LAPR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem.
Message-passing with MPI Logical Ring & Logical 2D-Mesh
Platform:SGI Origin 2000
Application Example
Application Example. Algorithm Scheme
10 1011 11
B
0001 01
20 2021 21
10
00
20
11
01
21
W D
00b
n/r
n
Application Example: Pre-installing.
HCVCariexec tttT
r
nkcb
p
nktari 2
12492
1
3
3
wsVC tc
nt
b
nt
2
42 rb
ntbtct wsHC 2
122
22
1º Make the Analytical Model: Texec= f (SPs,n,APs)
Application Example: Pre-installing.
2º Write the Estimation Routines for the SPs
k3 matrix-matrix multiplication with DGEMM
k1 Givens Rotation to 2 vectors with DROT
ts
communications along the 2 directions of the 2D-mesh
tw
Application Example: Installing
3º Estimate the SPs using the Estimation Routines
k1 0.01 µs
0.005 µs b = 32
k3 0.004 µs b = 64
0.003 µs b = 128
ts 20 µs
tw 0.1 µs
Comparison of execution times using different sets of Execution Parameters (4 processors)
Application Example: Executing
0
50
100
150
200
250
300
512 1024 1536 2048 2560 3072
Untuned
Tuned with MCAP
Tuned with MVAP
Optimal Execution Time
Comparison of execution times using different sets of Execution Parameters (8 processors)
Application Example: Executing
0
20
40
60
80
100
120
140
160
180
200
512 1024 1536 2048 2560 3072
Untuned
Tuned with MCAP
Tuned with MVAP
Optimal Execution Time
LAPR: One-sided Block Jacobi MethodAlgorithmic Parameters: block size
mesh topologyPlatform: SGI Origin 2000 with message-passing
System Parameters: arithmetic costs
communication costsSatisfactory Reduction of the Execution Time:
from 25% higher than the optimal to only 2%
Application Example: Executing
Outline
Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs
Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions
Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works
System parameters obtained at installation time
Installation routines making a reduced number of executions at installation time
Algorithmic parameters obtained at running time
From the file with information generated in the installation process
Exhaustive Execution
The behaviour of the algorithm on the platform is defined (as in Analytical Modelling)
Texec = f (SPs, n, APs)
SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size
Exhaustive Execution
Identify Algorithmic Parameters (APs) (as in Analytical Modelling)
Values chosen in each execution
b block size
p number of processors
r c logical topology
grid configuration (logical 2D mesh)
Exhaustive Execution
Pre-installing (manual):
1º Determine the APs
2º Decide heuristics to reduce execution time in the installation process
Installing on a Platform (automatic):
3º Decide (the manager) the problem sizes to be analysed
4º Execute and write a Configuration File, or include the information in the LAPR:
for each n APs that minimize Texec
Execution:
The user executes LAPR for a size n:
LAPR obtains optimal APs
The Methodology. Step by step:
LAPR: Gaussian elimination.
Message-passing with MPI Logical Ring,
rowwise block-cyclic striped partitioning
Platform:networks of processors (heterogeneous system)
Application Example
Application Example: Pre-installing.
1º Determine the APs logical ring, rowwise block-cyclic striped partitioning
p number of processors
b block size for the data distribution
different block sizes in heterogeneous systems
b0b1b2b0b1b2b0b1b2b0
Application Example: Pre-installing.
2º Decide heuristics to reduce execution time in the installation process
Execution time varies in a continuous way with the problem size and the APs
Consider the system as homogeneous Installation can finish:
When Analytical and Experimental predictions coincide
When a certain time has been spent on the installation
Homogeneous Systems:
3º The manager decides the problem sizes
4º Execute and write a Configuration File, or include the information in the LAPR:
for each n APs that minimize Texec
Heterogeneous Systems:
3º The manager decides the problem sizes
4º Execute:
write a Configuration File, for each n APs that minimize Texec
write a Speed File, with the relative speeds of the processors in the system
Application Example: Installing
RI-THE: Obtains p and b from the formula.
RI-HOM: Obtains p and b through a reduced number of executions.
RI-HET: 1º. As RI-HOM.
2º. Obtains bi for each processor
pbs
sb p
jj
ii
1
Application Example: Installation Routines
Three different configurations:
PLA_HOM: 5 SUN Ultra-1
PLA_HYB: 5 SUN Ultra-1
1 SUN Ultra-5
PLA_HET: 1 SUN Ultra-1
1 SUN Ultra-5
1 SUN Ultra-1 (manages the file system)
Application Example: Systems
Experimental results in PLA-HOM:
Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time
0
0,5
1
1,5
2
500 1000 1500 2000 2500 3000
RI-THEO
RI-HOMO
RI-HETE
Application Example: Executing
Experimental results in PLA-HYB:
Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time
0
0,5
1
1,5
2
500 1000 1500 2000 2500 3000
RI-THEO
RI-HOMO
RI-HETE
Application Example: Executing
0
0,5
1
1,5
2
500 1000 1500 2000 2500 3000
RI-THEO
RI-HOMO
RI-HETE
Experimental results in PLA-HET:
Quotient between the execution time with the parameters from the Installation Routine and the optimum execution time
Application Example: Executing
Two techniques for automatic tuning of Parallel Linear Algebra Routines:
1. Analytical ModellingFor predictable systems (homogeneous, static, ...)
like Origin 2000
2. Exhaustive Execution
For less predictable systems (heterogeneous, dynamic, ...)
like networks of workstations Transparent to the user Execution close to the optimum
Comparison
Outline
Current Situation of Linear Algebra Parallel Routines (LAPRs)ObjectiveApproach I: Analytical Model of the LAPRs
Application: Jacobi Method on Origin 2000 Approach II: Exhaustive Executions
Application: Gauss elimination on networks of processorsValidation with the LU factorizationConclusionsFuture Works
To validate the methodology it is necessary to experiment with:
More routines:
block LU factorization More systems:
Architectures:
IBM SP2 and Origin 2000 Libraries:
reference BLAS, machine BLAS, ATLAS
Validation with the LU factorization
Sequential LU
nkbnbknktari 222
33
3 3
1
3
2
Analytical Model: Texec= f (SPs,n,APs)
SPs: cost of arithmetic operations of different levels:
k1, k2, k3
APs: block size bLU ES
ES UM
b
Quotient between different execution times and the optimum execution time
Sequential LU. Comparison in IBM SP2
0
0,2
0,4
0,6
0,8
1
1,2
1,4
512 1024 1536 2048 2560
modelled
weighted
LAPACK
Quotient between the execution time with the parameters provided by the model and the optimum execution time, with different basic libraries. In
SUN 1
Sequential LU. Model execution time/optimum execution time
0
0,2
0,4
0,6
0,8
1
1,2
1,4
256 512 768 1024 1280 1536
ref. BLASmac. BLASATLAS
Parallel LU
nkbnbkrc
cr
p
nktari 2
223
3
3 3
1
3
2
Analytical Model: Texec= f (SPs,n,APs)
SPs: cost of arithmetic operations: k1, k2, k3
cost of communications: ts, tw
APs: block size b,
number of processors p,
grid configuration rc
00 01 02 00 01 02
10 11 12 10 11 12
00 01 02 00 01 02
10 11 12 10 11 12
00 01 02 00 01 02
10 11 12 10 11 12
b
Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with
4 and 8 processors.
Parallel LU. Comparison in IBM SP2
0
0,5
1
1,5
2
2,5
512 1024 1536 2048 2560 3072 3584
SEQPAR4PAR8
Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with
4 and 8 processors.
Parallel LU. Comparison in Origin 2000
0
0,2
0,4
0,6
0,8
1
1,2
1,4
512 1024 1536 2048 2560 3072 3584
SEQPAR4PAR8
The modelling of the algorithm provides satisfactory results in different systems
Origin 2000, IBM SP2
reference BLAS, machine BLAS, ATLAS The prediction is worse in some cases:
When the number of processors increases
In multicomputers where communications are more important (IBM SP2)
Exhaustive Executions
Parallel LU. Conclusions
If the manager installs the routine for sizes 512, 1536, 2560,
and executions are performed for sizes 1024, 2048, 3072,
the execution time is well predicted
The same policy can be used in the installation of other software:
Quotient between the execution time with the parameters provided by the installation process and the optimum execution time. With ScaLAPACK, in
IBM SP2
Parallel LU. Exhaustive Execution
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1024 2048 3072
4 pro.
8 pro.
Parameterisation of Parallel Linear Algebra Routines enables development of Automatically Tuned Software
Two techniques can be used:
Analytical Modelling
Exhaustive Executions
or
a combination of both
Experiments performed in different systems and with different routines
Conclusions
We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries:
it is necessary to analyse the methodology in more systems and with more routines
Architecture of an Automatically Tuned Linear Algebra Library
At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes
Future Works
Architecture of an Automatically Tuned Linear Algebra Library
Installation file
Installation routines
Basic routines library
SP fileAP file
Library
Basic routines declarationmanager
Installation
Compilation
designer
designer
manager
manager
Architecture of an Automatically Tuned Linear Algebra Library
Installation routines
Library
designer
designer
Architecture of an Automatically Tuned Linear Algebra Library
Installation routines
Basic routines library
Library
Basic routines declaration
designer
designer
manager
Architecture of an Automatically Tuned Linear Algebra Library
Installation file
Installation routines
Basic routines library
Library
Basic routines declarationmanager
Installation
designer
designer
manager
manager
Architecture of an Automatically Tuned Linear Algebra Library
Installation file
Installation routines
Basic routines library
SP fileAP file
Library
Basic routines declarationmanager
Installation
designer
designer
manager
manager
Architecture of an Automatically Tuned Linear Algebra Library
Installation file
Installation routines
Basic routines library
SP fileAP file
Library
Basic routines declarationmanager
Installation
Compilation
designer
designer
manager
manager