introduction to matlab distributed computing server (mdcs) - mcgill... · 1 introduction to matlab...
TRANSCRIPT
1
Introduction to Matlab Distributed Computing Server
(MDCS)
Dan Mazur and Pier-Luc [email protected]
December 1st, 2015
Partners and sponsors
2
3
Exercise 0: Login and Setup
Example hand-out slip:07:k41a0?wy#
● Ubuntu login:● Username: csuser07● Password: ___@[S07
● Guillimin login:● ssh [email protected] ● Password: k41a0?wy#
4
Outline
● Introduction and Overview● Configuring MDCS for Guillimin
● Submitting and monitoring jobs on Guillimin– batch command
● Parallel toolbox– parfor loops (parallel for loops)
– spmd sections (single program multiple data)
– distributed arrays (large memory problems)
– GPUs and Xeon Phis
5
Parallel Computing Toolbox (PCT)
● High-level constructs for parallel programming– parallel for loops
– distributed arrays
– data parallel (spmd) sections
● Implicit (automatic) parallelism
● Implemented with MPI (MPICH2)
● Restricted to 12 cores on a single node– Multi-node scalability built into MPICH2
– Scalability intentionally limited through technological effort
6
MDCS Overview
● MDCS allows parallel toolbox users access to a number of workers (set by the license terms) on any number of nodes
7
MDCS vs. PCT differences● MDCS jobs are submitted to the batch system
on a cluster, not run locally– Client - Server model
● In PCT, one explicitly starts a parpool environment– In MDCS, this environment is requested in the
batch() command
8
MDCS OverviewGuillimin
Your PC
MDCS
Matlab
.m script+ attached files
Worker Nodes
9
MDCS OverviewGuillimin
Your PC
MDCS
Matlab
.m script+ attached files
Worker Nodes
Job
Scheduler
10
MDCS OverviewGuillimin
Your PC
MDCS
Matlab
.m script+ attached files
Monitoring informationWorker Nodes
Job
Scheduler
11
MDCS OverviewGuillimin
Your PC
MDCS
Matlab
.m script+ attached files
Monitoring informationWorker Nodes
Job
Scheduler
Important: Do not attachlarge data files.Data transfer to and from Guilliminis best accomplished with scp or sftp.See http://www.hpc.mcgill.ca for large file transfers.
12
MDCS Licensing
One N-worker MDCS Job
Provided by user (often via institution) Provided by McGill HPC
Desktop Matlab license
Parallel computing toolbox license
Additional toolbox licensesPool of 64 MDCS licenses
N x MDCS worker licenses
1 master process worker license
13
MDCS Scenario
● Researchers begin using desktop Matlab using institutional licenses
● Eventually, researchers and research programs depend on the resulting software
● Problem sizes increase with time, eventually necessitating parallel computing
● No problem: Mathworks uses an implementation of MPI with good scaling behaviour provided by the free software community to implement their parallel computing toolbox functionality
– But, place restrictions on number of nodes and cores
– Require additional licenses to remove these restrictions
● Because of decisions they made years ago, researchers find themselves facing either
– Potentially expensive license fees to unlock their software's capabilities, or
– Financial and time barriers to switching vendors (i.e. porting code)
14
MDCS Alternatives● Compile MPI functions with mex
– Difficult to maintain, cannot use PCT functions, cannot use Matlab debugger, must have access to many individual Matlab licenses (e.g. TAH license)
● Use Matlab MPI - Use global file system for MPI-like communication– Low performance for tightly-coupled problems
● Use GNU Octave– Reduces the switching costs by re-implementing the Matlab programming
language
– Parallel capabilities are less mature than Matlab
● Porting code to another language (Python, R, Fortran, etc.)– Significant effort and time
● Contact us for help and advice– [email protected]
15
MDCS Desktop Configuration
1) Install scripts used for communicating with scheduler
2) Configure the cluster profile
3) Verify your setup
16
Exercise 1: Install Scripts
● Download and unpack .tar.gz configuration file on your local machine– E.g. Linux:cd <workdir>
wget \ http://www.hpc.mcgill.ca/downloads/mdcs_config/guillimin_mdcs_config_v2.3.tar.gz
tar -xvf guillimin_mdcs_config_v2.3.tar.gz
● copy all "config/toolbox-local/*" files to the "<your_matlab_install>/toolbox/local" folder on your local machine
● Start or restart Matlab. Then test your installation:
>> glmnVersion
17
Permissions● What if you don't have write access to the toolbox/local folder?
● Create a new folder in your home directory for Matlab scripts
● Add the new path to your Matlab pathpath('newpath', path);
● Set new path in a startup.m file
● Use MATLABPATH environment variable in Mac and Linux OSs
18
MDCS Integration ScriptsglmnCommSubFcn.m
glmnIndSubFcn.m
glmnGetRemoteConn.m
glmnPBS.m
glmnCreateSubScript.m
glmnExtractJobId.m
glmnGenSubmitString.m
glmnCommJobWrapper.shglmnIndJobWrapper.sh
glmnDeleteJobFcn.m
glmnGetJobStateFcn.m
Main drivers for submitting jobs
Establishes connection to cluster with ssh
Cancel job on cluster through Matlab
Get the job status from the cluster
Specifies the submission parameters
Creates a script which will run on clusterto submit the job
Generates the qsub command
Gets the PBS jobID from the cluster
The script that is submitted to the worker nodes by qsub
19
Avoiding Metadata Corruption● Each pair (Server, Matlab installation) requires a pair of
metadata folders, one on the submitting computer and one on Guillimin
● E.g. installing a new version of Matlab and re-using the same metadata folders will result in corruption
● E.g. Submitting to a new MDCS server and re-using the same metadata folders will result in corruption
● E.g. Multiple users from the same client will require a shared metadata folder (read and write) or separate profiles– Important: You cannot re-use your class account
configuration for other Guillimin accounts
20
How many metadata folders?
guillimin orcinusServers:
Clients:
R2013a R2014a
Lab computer Home computer
R2013a
21
How many metadata folders?
guillimin orcinusServers:
Clients:
R2013a R2014a
Lab computer Home computer
R2013a
Answer: 12
22
Exercise 2: Configure your computer● We have made a script, glmnConfigCluster.m, to
make configuration easier
● Warning: glmnConfigCluster will overwrite any profiles called 'guillimin'
>> glmnConfigCluster
Enter a unique name for your local computer (e.g. the hostname): workshopHome directory on local computer (e.g. /home/alex, /Users/alex, or C:\\Users\\alex): /Users/dmazurHome directory on guillimin (e.g. /home/alex): /home/dmazurOne last step: please connect to guillimin, and create your Matlab job directory:
mkdir -p /home/dmazur/.matlab/jobs/workshop/guillimin/R2014a
Once done, your local computer will be configured to submit jobs to guillimin.
23
Exercise 3: Validation● You will want to test your new cluster with
simple tests before trying more complicated codes
● Clicking the validation button in Matlab can take a long time and the final test is expected to fail
● Perform the validation procedure from the McGill HPC documentation– Must be performed in the TestParfor directory
cd examples/TestParfor
– In glmnPBS.m, set procsPerNode to 3
24
A simple batch job
● myCluster = parcluster('guillimin')– Selects a cluster profile
● j = batch(myCluster, ...)– Submits jobs to a cluster
● Prompted for username● Select 'no' when asked to use identity file● Prompted for password
● wait(j): Waits for job to finish
25
Exercise 4: Simple Batch Job
>> myCluster = parcluster('guillimin')>> j = batch(myCluster, @rand, 1, {10, 10}, 'CurrentDirectory', '.');>> wait(j)>> r = fetchOutputs(j)
26
glmnPBS.m
● For parallel jobs, we have a script (glmnPBS.m) to make job submission easier
● Place this script in your working directory● Before submission, check that you have a
valid glmnPBS.m file, and that your submission parameters are correct
>> test = glmnPBS();
>> test.getSubmitArgs()
27
classdef glmnPBS %Guillimin PBS submission arguments properties % Local script, remote working directory (home, by default) localScript = 'TestParfor'; workingDirectory = '.';
% nodes, ppn, gpus, phis and other attributes numberOfNodes = 1; procsPerNode = 3; gpus = 0; phis = 0; attributes = '';
% Specify the memory per process required pmem = '1700m'
% Requested walltime walltime = '00:30:00'
% Please use metaq unless you require a specific node type queue = 'metaq'
% All jobs should specify an account or RAPid: % e.g. % account = 'xyz-123-aa' account = '';
% You may use otherOptions to append a string to the qsub command % e.g. % otherOptions = '-M email[at]address.com -m bae' otherOptions = '' end
28
Submitting with glmnPBS.m
>> cluster = parcluster('guillimin');
>> glmnPBS.submitTo(cluster);
● Note that glmnPBS.m must be present for all job submissions, even with batch()
– Called by glmnCommSubFcn.m
methods(Static) function job = submitTo(cluster) opt = glmnPBS(); job = batch(cluster, opt.localScript, ... 'matlabpool', opt.getNbWorkers(), ... 'CurrentDirectory', opt.workingDirectory ... ); endend
29
Matlab Job Monitor● Parallel > Monitor Jobs
● Select Profile: guillimin● Enter username
● Select 'no'
● Enter password
● Tip: Set autoupdate to 'never', or use an identity file. Otherwise, Matlab interrupts your work with password requests.
30
Job Monitor can report the state, and more details such as output and errors (right click).
Matlab Job Monitor
31
Monitoring Jobs on Guillimin
● Show running and queued jobsqstat -u class01
– qstat shows both MDCS and other Guillimin jobs
● Detailed scheduler information for job w/ jobID=########qstat -f ########
● Meta-data is stored in job-specific folders /home/username/.matlab/jobs/workshop/guillimin/R2014a/Job1
– The .log files contain output and error from Matlab itself
– The .txt files contain output from disp() and fprintf()
● You should create output and save matlab (.mat) files within your Guillimin storage (scratch, home, or project spaces)– fprintf()
– save()
32
Exercise 5: Submit Parallel Job● Change the working directory to the examples/TestParfor folder you copied from the .tar.gz configuration file
● Launch TestParFor.m using glmnPBS.m
>> cluster = parcluster('guillimin')>> job = glmnPBS.submitTo(cluster)
33
Make sure you are in thecorrect directory
>> cluster = parcluster('guillimin')>> job = glmnPBS.submitTo(cluster)
This script runs for ~15 minutes.You may use showq or the jobmonitor to monitor its progress.
34
Exercise Codes
● While your job is waiting/running...
● Please download and extract the exercise codes from our website
● http://www.hpc.mcgill.ca/downloads/
intro_mdcs/dec2015.tar.gz
35
Parallel Matlab● Benefits of parallelism
– Computations complete faster
– Scale to larger data sets in the same amount of time
– Work with larger data sets using distributed memory
36
Parallel Matlab
● Implicit (automatic) parallelism– Bioinformatics toolbox
– Image processing toolbox
– optimization toolbox
– signal processing toolbox
– statistics toolbox
– etc...
● Explicit parallelism– parallel toolbox
● parfor● spmd● distributed()
37
TestParfor.mfunction TestParfor; clear all; N=4000; filename='~/output_test_parfor.txt';outfile = fopen(filename,'w');fprintf(outfile, 'CALCULATION LOG: \n\n'); tic;for k=1:10 Ham(:,:,k)=rand(N)+i*rand(N); fprintf(outfile,'Serial: Doing K-point : %3i\n', k); inv(Ham(:,:,k));endt2=toc; fprintf(outfile, 'Time serial = %12f\n', t2);fclose(outfile); tic;parfor k=1:10 Ham(:,:,k)=rand(N)+i*rand(N); outfile = fopen(filename,'a'); fprintf(outfile,'Parallel: Doing K-point : %3i\n', k); fclose(outfile); inv(Ham(:,:,k));end t2=toc;outfile = fopen(filename,'a'); fprintf(outfile, 'Time parallel = %12f\n', t2);fprintf(outfile, 'CALCULATIONS DONE ... \n\n'); fclose(outfile);
Serial 'for' loop executed on headprocessor
Parallel 'parfor' loop executed on2 worker nodes
Location of output file on Guillimin
38
Parfor
i = 1 i = 2 i = 3 i = 4
Time
Serial for loopi = 1 i = 2 i = 3 i = 4
Time
Serial for loop
Time
Parallel parfor loopwith 4 workers
i = 1
i = 2
i = 3
i = 4
Time
39
~/output_test_parfor.txtCALCULATION LOG:
Serial: Doing K-point : 1Serial: Doing K-point : 2Serial: Doing K-point : 3Serial: Doing K-point : 4Serial: Doing K-point : 5Serial: Doing K-point : 6Serial: Doing K-point : 7Serial: Doing K-point : 8Serial: Doing K-point : 9Serial: Doing K-point : 10Time serial = 553.056296Parallel: Doing K-point : 7Parallel: Doing K-point : 4Parallel: Doing K-point : 6Parallel: Doing K-point : 3Parallel: Doing K-point : 5Parallel: Doing K-point : 2Parallel: Doing K-point : 1Parallel: Doing K-point : 9Parallel: Doing K-point : 8Parallel: Doing K-point : 10Time parallel = 291.879429CALCULATIONS DONE ...
Ideal speedup = 2.00XActual speedup = 1.90X
Serial 'for' loop executed on headprocessor
Parallel 'parfor' loop executed on2 worker nodes
40
Parfor loops● Loop index must be consecutive integers
– Cannot be altered in the loop
● Iterations must be independent from one another– Local or temporary variables modified inside the
parfor loop can't be used after the for loop
● Cannot nest parfor loops– Don't need to be the outermost for loop
● Matlab editor will automatically warn about problems
41
Load Balancing
● Each iteration of the for loop should do an equal amount of work
Good load balancing:
parfor i = 1: 40 x = rand(1000, 1000); inv(x);end
Bad load balancing:
parfor i = 1: 40 x = rand(100*i, 100*i); inv(x);end
40th iteration has much more workthan 1st iteration
42
Parallel Reduction
>> s = 0;>> parfor i = 1:40>> s = s + 1;>> end>> disp(s) 820
● Operation will be done 'atomically'● Operation must be associative
● e.g. addition or multiplication● not subtraction or division
43
Aside: Atomic Operations>> s = 0;>> parfor i = 1:40>> s = s + 1;>> end>> disp(s) 820
Step 1: Read s from memoryStep 2: add 1Step 3: Store result in s
s
0
0
0
1
1
Worker 2
0
0+1
Worker 1
0
0+1
non-atomic addition
1
23
1
23
44
Aside: Atomic Operations>> s = 0;>> parfor i = 1:40>> s = s + 1;>> end>> disp(s) 820
Step 1: Read s from memoryStep 2: add 1Step 3: Store result in s
s
0
0
0
1
1
Worker 2
0
0+1
Worker 1
0
0+1
s
0
0
0
1
1
1
1
2
Worker 2
1
1+1
Worker 1
0
0+1
non-atomic addition atomic addition
1
23
1
23
1
23
1
23
45
Aside: Atomic Operations>> s = 0;>> parfor i = 1:40>> s = s + 1;>> end>> disp(s) 820
Step 1: Read s from memoryStep 2: add 1Step 3: Store result in s
s
0
0
0
1
1
1
1
2
Worker 2
1
1+1
Worker 1
0
0+1
atomic addition
1
23
1
23
Matlab calls 's' a 'reduction variable' and these operations are automaticallyatomic.
http://www.mathworks.com/help/distcomp/reduction-variables.html
46
Parallel Concatenation
>> y = [];>> parfor i = 1:10>> y = [y, i] ;>> end>> disp(y) 1 2 3 4 5 6 7 8 9 10
● Matrix is stored in 'correct' order according to index i
47
Parameter Sweep● Damped harmonic
oscillator
● Give initial velocity for a variety of k's and b's and watch maximum response amplitude
48
Exercise 6: Parameter Sweep● paramsweep.m solves a second-order
ordinary differential equation (ODE) for varying parameter values
● Modify this code to run in parallel on 2 workers
● Submit your modified code to the MDCS
● Retrieve the resulting plot from Guillimin using scp and view it on your laptop[laptop]$ scp \ [email protected]:~/paramsweep.png ./
49
50
Single Program Multiple Data● spmd command allows each worker to
execute the same program on different data
● Variables labindex and numlabs are (for example) used to index the data– Automatically defined inside SPMD sections
● Functions labsend() and labreceive() are used to send and receive data between the workers
51
>> matlabpool(3) % Or parpool(3) on newer versions of MatlabStarting matlabpool using the 'local' profile ... connected to 3 workers.>> spmdlabindexendLab 1: ans = 1 Lab 2: ans = 2 Lab 3: ans = 3 >> spmdq = magic(labindex + 2);end
>> q q = Lab 1: class = double, size = [3 3] Lab 2: class = double, size = [4 4] Lab 3: class = double, size = [5 5] >> q{1}
ans =
8 1 6 3 5 7 4 9 2
>> q{2}
ans =
16 2 3 13 5 11 10 8 9 7 6 12 4 14 15 1
52
SPMD data load example● SPMD can be used to have each worker
process data from separate files
● Example, process data stored in files datafile1.mat, datafile2.mat, etc...
spmd infile = load(['datafile' num2str(labindex) '.mat']); result = myfunc(infile)end
53
Serial numerical integration
m = 10;b = pi/2;dx = b/m;x = dx/2:dx:b-dx/2;int = sum(cos(x)*dx)
54
SPMD Integral● We would like to parallelize this integral using
spmd
● In terms of m, b, numlabs and labindex:– How many increments per lab?
– Integration length per lab?
– Local integration range?
● We can use gplus() to perform a global sum over workers
55
SPMD Integral
● We would like to parallelize this integral using spmd
● In terms of m, b, numlabs and labindex:– How many increments per lab?
● n = m / numlabs
– Integration length per lab?● Delta = dx * n = (b / m) * (m / numlabs) = b / numlabs
– Local integration range?● ai = (labindex – 1) * Delta● bi = labindex * Delta
● We can use gplus(int, 1) to perform a global sum over int from each worker
56
SPMD Integral
e.g.) m = 10, numlabs = 5n = 10/5 = 2 increments per labDelta = (pi/2)/5 = pi/10ai = (labindex-1)*pi/10bi = labindex*pi/10
Sum over increments for a worker:int = sum(cos(x)*dx);Global sum over all workers:int = gplus(int, 1);
57
Exercise 7: Numerical Integration● integration.m is a serial numerical
integration program
● Modify this code to run in parallel using the spmd command
● Submit your modified code to the MDCS using 2 workers
58
Distributed Arrays
[a;e;i;
m]
[b; f;j;
n]
[c;g; k; o]
[d; h; l;p]
matlabpool(4)A = distributed([ a b c d; e f g h; i j k l; m n o p]);
1 21
3 4
MDCS Workers
59
Distributed Arrays
● Allow large data sets to be distributed over multiple nodes● Distributed by columns● Can be constructed by
– partitioning a large array already in memory
– combining smaller arrays into one large array
– using distributed matrix constructor functions (distributed.rand(), distributed.zeros(), etc.)
● Operations on distributed arrays are automatically parallelized● Arrays do not persist if the matlabpool is closed
60
Codistributed Arrays● Codistributed arrays provide much more
control over how arrays are distributed– Can be distributed by any dimension
– Can distribute different amounts of data to different workers
● Codistributed arrays can be declared inside spmd sections
61
Exercise 8: Matrix Multiplication● matrixmul.m is a serial matrix multiplication
● Modify this file to use distributed arrays– create distributed random arrays a, b
– time a matrix multiplication: tic; c = a*b; toc
● Submit the job for 1 worker and then for 4 workers– What is the speedup (serial time / parallel time)?
62
Using GPUs with Matlab● The Parallel Computing Toolbox can utilize
CUDA-capable GPUs on the system (e.g. the K20s on Guillimin)
● GPU-enabled functions– fft, filter
– toolbox functions
● Linear-algebra operations
● Custom CUDA kernels – .cu or .ptx formats
63
GPU Arrays● Matlab can copy arrays to the GPU
● Perform matrix operations on the GPU to speed them up
● e.g.– x = rand(1000, 'single', 'gpuArray');
– x2 = x.*x; %performed on GPU
64
Exercise 9: GPU Job
● fourier.m is a serial fast Fourier transform (FFT) code
● Modify this file to perform the same calculation using normal and GPU Arrays
● Use tic and toc to time both operations and output the results
● Submit this job to a Guillimin GPU node– Hint: Simply request in glmnPBS.m
numberOfNodes = 1;
procsPerNode = 1;
gpus = 1;
● What is the speedup from the GPU?
65
Summary● Today we learned:
– How to configure a desktop installation of Matlab to submit jobs to a cluster computer using MDCS
– How to submit jobs to a cluster and monitor their output
– How to write parallel Matlab applications using parfor, spmd, and distributed arrays
● Many Matlab programs can be parallelized with a very small change
● Note that parallel programming is a huge topic and we have only scratched the surface!
66
Questions
What questions do you have?
67
Using Xeon Phi with Matlab
● Matlab uses the Intel MKL math library
● Version >= 11.0 of MKL has automatic offloading to Xeon Phi– Included in Matlab R2014a and newer
● On Guillimin:module add ifort_icc
export MKL_MIC_MAX_MEMORY=16G
export MKL_MIC_ENABLE=1
matlab &