parallel programming in matlab on biohpc
TRANSCRIPT
Parallel Programming in MATLAB on BioHPC
1
[web] portal.biohpc.swmed.edu
[email] [email protected]
Updated for 2017-05-17
What is MATLAB
2
• High level language and development environment for:- Algorithm and application development- Data analysis- Mathematical modeling
• Extensive math, engineering, and plotting functionality
• Add-on products for image and video processing, communications, signal processing and more.
powerful & easy to learn !
Ways of MATLAB Parallel Computing
3
• Build-in Multithreading (Implicit Parallelism)*automatically handled by Matlab
- use of vector operations in place of for-loops+, -, * , /, SORT, CEIL, LOG, FFT, IFFT, SVD, …
Example: dot product
v.s. for i = 1 : size(A, 1)for j = 1: size(A, 2)
C(i, j) = A(i, j) * B(i, j);end
end
C = A .* B;
If both A and B are 1000-by-1000 Matrix, the Matlab build-in dot product is 50~100 times faster than for-loop.
* Testing environment : Matlab/R2015a @ BioHPC computer node
Ways of MATLAB Parallel Computing
4
• Parallel Computing Toolbox *Acceleration MATLAB code with very little code changes
job will running on your workstation, thin-client or a reserved compute node
- Parallel for Loop (parfor)- Spmd (Single Program Multiple Data)- Pmode (Interactive environment for spmd development)- Parallel Computing with GPU
• Matlab Distributed Computing Server (MDCS)*Directly submit Matlab job to BioHPC cluster
- Matlab job scheduler integrated with SLURM- Single job use multiple nodes
Parallel Computing Toolbox –Parallel for-Loops (parfor)
5
• Allow several MATLAB workers to execute individual loop iterations simultaneously.
• The only difference in parfor loop is the keyword parfor instead of for. When the loop begins, it opens a parallel pool of MATLAB sessions called workers for executing the iterations in parallel.
for …<statements>
end
<open matlab pool>
parfor …<statements>
end
<close matlab pool>
a b
6
Parallel Computing Toolbox –Parallel for-Loops (parfor)
Example 1: Estimating an Integration
𝐹 𝑥 = 𝑎
𝑏
0.2𝑠𝑖𝑛2𝑥 𝑑𝑥
The integration is estimated by calculating the total area under the F(x) curve.
𝐹 𝑥 ≈
1
𝑛
ℎ𝑒𝑖𝑔ℎ𝑡 ∗ 𝑤𝑖𝑑𝑡ℎ
We can calculate subareas at any order !
7
quad_fun.m (serial version)
function q = quad_fun (n, a, b)q = 0.0;w=(b-a)/n;for i = 1 : n
x = (n-i) * a + (i-1) *b) /(n-1);fx = 0.2*sin(2*x);q = q + w*fx;
endreturn
end
function q = quad_fun_parfor (n, a, b)q = 0.0;w=(b-a)/n;parfor i = 1 : n
x = (n-i) * a + (i-1) *b) /(n-1);fx = 0.2*sin(2*x);q = q + w*fx;
endreturn
end
quad_fun_parfor.m (parfor version)
12 workersSpeedup = t1/t2 = 6.14x
Parallel Computing Toolbox –Parallel for-Loops (parfor)
Example 1: Estimating an Integration
8
Parallel Computing Toolbox –Parallel for-Loops (parfor)
Q: How many workers can I use for MATLAB parallel pool ?
Default Maximum
128GB, 384GB, GPU 12 16*
256GB 12 24*
Workstation 6 6
Thin client 2 2
** Use parpool(‘local’, numOfWorkers) to specify how many workers you want.(or use matlabpool(‘local’, numOfWorkers) if your are using matlab_r2013a)
2013a/2013b• max number of workers = 12
2014a/2014b/2015a/2015b• max number of workers = number of physical cores
9
Parallel Computing Toolbox –Parallel for-Loops (parfor)
Q: How many workers can I use for MATLAB parallel pool ?
2016a/2016b/2017a• max number of workers = 512• default number of workers = number of
physical cores• choose smartly
% load local profile configurationmyCluster = parcluster('local');% increase number of workersmyCluster.NumWorkers=32;
% open matlab poolparpool(myCluster);
% parfor loopparfor …<statements>
end
% close matlab pooldelete(gcp('nocreate'));
Number of workers
128GB, 384GB, GPU <= 32
256GB <= 48
Workstation <= 12
Thin client <= 4
Limitations of parfor
10
No Nested parfor loops
parfor i = 1 : 10parfor j = 1 : 10
<statement>end
end
parfor i = 1 : 10for j = 1 : 10
<statement>end
end
* Because a worker cannot open a parallel pool
Loop iterations must be independent
parfor i = 2 : 10A(I) = 0.5* ( A(I-1) + A(I+1) );
end
Loop variable must be increasing integers
parfor x = 0 : 0.1 : 1<statement>
end
xAll = 0 : 0.1 :1;parfor ii = 1 : length(xAll)
x = xAll(ii);<statement>
end
Parallel Computing Toolbox – spmd and Distributed Array
11
spmd : Single Program Multiple Data
• single program -- the identical code runs on multiple workers.• multiple data -- each worker can have different, unique data for that code.
• numlabs – number of workers.• labindex -- each worker has a unique identifier between 1 and numlabs.
• point-to-point communications between workers : labsend, labreceive, labsendrecieve
• Ideal for: 1) programs that takes a long time to execute.2) programs operating on large data set.
spmdlabdata = load([‘datafile_’ num2str(labindex) ‘.ascii’]) % multiple dataresult = MyFunction(labdata); % single program
end
<open matlab pool>
spmd<statements>
end
<close matlab pool>
12
function q = quad_fun (n, a, b)q = 0.0;w=(b-a)/n;for i = 1 : n
fx = 0.2*sin(2*i);q = q + w*fx;
endreturn
end
a = 0.13;b = 1.53;n = 120000000;spmd
aa = (labindex - 1) / numlabs * (b-a) + a;bb = labindex / numlabs * (b-a) + a;nn = round(n / numlabs)q_part = quad_fun(nn, aa, bb);
endq = sum([q_part{:}]);
quad_fun.m
Parallel Computing Toolbox – spmd and Distributed Array
quad_fun_spmd.m (spmd version)
12 workersSpeedup = t1/t3 = 8.26x
Example 1: Estimating an Integration
12 communications between client and workers to update q, whereas in parfor, it needs 120,000,000 communications between client and workers
13
Parallel Computing Toolbox – spmd and Distributed Array
distributed array - Data distributed from client and access readily on client. Data always distributed along the last dimension, and as evenly as possible along that dimension among the workers.
codistributed array – Data distributed within spmd, user define at which dimension to distribute data . Array created directly (locally) on worker.
parpool(‘local’);A = rand(3000);B = rand(3000);spmd
u = codistributed(A, codistributor1d(1)); % by row, 1st dimv = codistributed(B, codistributor1d(2)); % by column, 2nd dimw = u * v;
enddelete(gcp);
Matrix Multiply A*B
parpool(‘local’);A = rand(3000);B = rand(3000);a = distributed(A);b = distributed(B);c = a*b; % run on workers, c is distributeddelete(gcp);
distributed array codistributed array
14
Parallel Computing Toolbox – pmode: learn parallel programing interactively
Enter command here
15
Example 2 : Ax = b
Parallel Computing Toolbox – spmd and Distributed Array
n = 10000;M = rand(n);X = ones(n, 1);A = M + M’;b = A * X;u = A \ b;
linearSolver.m (serial version) n = 10000;M = rand(n);X = ones(n, 1);spmd
m = codistributed(M, codistributor(‘1d’, 2));x = codistributed(X, codistributor(‘1d’, 1));A = m +m’;b = A * x;utmp = A \ b;
endu1 = gather(utmp);
linearSolverSpmd.m (spmd version)
16
Parallel Computing Toolbox – spmd and Distributed Array
Factors reduce the speedup of parfor and spmd
• Computation inside the loop is simple.• Memory limitations. (create more data compare to serial code)• Transfer and convert data is time consuming• Unbalanced computational load• Synchronization
Parallel Computing Toolbox – spmd and Distributed Array
17
Example 3 : Contrast Enhancement
function y = contrast_enhance (x)x = double (x);n = size (x, 1);x_average = sum ( sum (x(:,:))) /n /n ;s = 3.0; % the contrast s should be greater than 1y = (1.0 – s ) * x_average + s * x((n+1)/2, (n+1)/2);return
end
Running time is 14.46 s
contrast_enhance.m
x = imread(‘surfsup.tif’);yl = nlfilter (x, [3,3], @contrast_enhance);y = uint8(y);
Image size : 480 * 640 uint8
contrast_serial.m
(example provided by John Burkardt @ FSU)
18
Parallel Computing Toolbox – spmd and Distributed Array
Example 3 : Contrast Enhancement
x = imread(‘surfsup.tif’);xd = distributed(x);spmd
xl = getlocalPart(xd);xl = nlfilter(xl, [3,3], @contrast_enhance);
endy = [xl{:}];
contrast_spmd.m (spmd version)
12 workers, running time is 2.01 sspeedup = 7.19x
When the image is divided by columns among the workers, artificial internal boundaries are created !
Solution: build up communication between workers.
General sliding-neighborhood operation
zero padding at edges
Parallel Computing Toolbox – spmd and Distributed Array
Example 3 : Contrast Enhancement
column = labSendReceive ( previous, next, xl(:,1) );
column = labSendReceive ( next, previous, xl(:,end) );
previous next
ghost boundary
For 3*3 filtering window, you need 1 column of ghost boundary on each side.
column = labSendReceive ( previous, next, xl(:,1) );if ( labindex < numlabs )
xl = [ xl, column ]; end
column = labSendReceive ( next, previous, xl(:,end) );if ( 1 < labindex )
xl = [ column, xl ]; end
xl = nlfilter ( xl, [3,3], @contrast_enhance );
if ( 1 < labindex ) xl = xl(:,2:end);
end
if ( labindex < numlabs ) xl = xl(:,1:end-1);
end
xl = uint8 ( xl ); endy = [ xl{:} ];
2020
x = imread('surfsup.tif');xd = distributed(x);
spmdxl = getLocalPart ( xd );
if ( labindex ~= 1 ) previous = labindex - 1;
elseprevious = numlabs;
end
if ( labindex ~= numlabs ) next = labindex + 1;
elsenext = 1;
end
12 workersrunning time 2.00 sspeedup = 7.23x
contrast_MPI.m (mpi version)
Parallel Computing Toolbox – spmd and Distributed Array
Example 3 : Contrast Enhancement
find out previous & next by labindex
attache ghost boundaries
remove ghost boundaries
21
Parallel Computing Toolbox – GPU
Parallel Computing Toolbox enables you to program MATLAB to use your computer’s graphics processing unit (GPU) for matrix operations. For some problems, execution in the GPU is faster than in the CPU.
Step 1 : reserve a GPU nodeStep 2 : inside terminal, type in
export CUDA_VISIBLE_DEVICES=“0”to enable the hardware rendering.
How to enable GPU computing of Matlab on BioHPC cluster
• You can launch Matlab directly if you have GPU card on your workstation.• Current supporting versions on BioHPC: 2013a/2013b/2014a/2014b/2015a/2015b
Parallel Computing Toolbox – GPU
22
Example 2 : Ax = b
linearSolverGPU.m (GPU version)
Speedup = t1/t3 = 2.54x
n = 10000;M = rand(n);X = ones(n, 1);Mgpu = gpuArray(M);Xgpu = gpuArray(X);Agpu = Mgpu + Mgpu’;Bgpu = Agpu * Xgpu;ugpu = Agpu \ bgpuu = gather(ugpu);
Copy data from CPU to GPU ( ~ 0.3 s)
Copy data from GPU to CPU ( ~0.0002 s)
23
Parallel Computing Toolbox – GPU
Check GPU information (inside Matlab)
Q: If the GPU available.if (gpuDeviceCount > 0 )
Q: How to check memory usage?currentGPU = gpuDevice;
currentGPU.AvailableMemory;
* totalMemory refers to the GPU accelerator memory, e.g: K40m ~12 GB; K20m ~5 GB
Parallel Computing Toolbox – benchmark
24
Out of memory
* the actually memory consumption is much higher than the input data size
Matlab Distributed Computing Server
25
Matlab Distributed Computing Server: Setup and Test your MATLAB environment
26
Home -> ENVIRONMENT -> Parallel -> Manage Cluster Profiles
Add -> Import Cluster Profiles from fileAnd then select profile file from:
/project/apps/apps/MATLAB/profile/nucleus_r2017a.settings/project/apps/apps/MATLAB/profile/nucleus_r2016b.settings/project/apps/apps/MATLAB/profile/nucleus_r2016a.settings/project/apps/apps/MATLAB/profile/nucleus_r2015b.settings/project/apps/apps/MATLAB/profile/nucleus_r2015a.settings/project/apps/apps/MATLAB/profile/nucleus_r2014b.settings/project/apps/apps/MATLAB/profile/nucleus_r2014a.settings/project/apps/apps/MATLAB/profile/nucleus_r2013b.settings/project/apps/apps/MATLAB/profile/nucleus_r2013a.settings
*based on your matlab version
27
Matlab Distributed Computing Server: Setup and Test your MATLAB environment
You should have two Cluster Profiles ready for Matlab parallel computing:“local “ – for running job on your workstation, thin client or on any single compute node of
BioHPC cluster (use Parallel Computing toolbox)“nucleus_r<version No.>” – for running job on BioHPC cluster with multiple nodes (use
MDCS)
* Ignore the “MATLAB Parallel Cloud” profile if you are using 2016a/2016b/2017a
Test your parallel profile
Matlab Distributed Computing Server: Setup SLURM environment
28
From Matlab command window:ClusterInfo.setQueueName(‘128GB’); % use 128 GB partitionClusterInfo.setNNode(2); % request 2 nodesClusterInfo.setWallTime(‘1:00:00’); % setup time limit to 1 hourClusterInfo.setEmailAddress(‘[email protected]’); % email notification
You could check your settings by:ClusterInfo.getQueueName(); ClusterInfo.getNNode(); ClusterInfo.getWallTime(); ClusterInfo.getEmailAddress();
29
Matlab Distributed Computing Server: batch processing
For scripts:job = batch(‘ascript’, ‘Pool’, 15, ‘profile’, ‘nucleus_r2016a’);
For functions:job = batch(@sfun, 1, {50, 4000}, ‘Pool’, 15, ‘profile’, ‘nucleus_r2016a’);
script name
number of workers - 1 configuration file
function name
number of variables to be returned by function
cell array containing input arguments
Job submission
30
Matlab Distributed Computing Server: batch processing
Wait for batch job to be finished
• Block session until job has finished : wait(job, ‘finished’);
• Occasionally checking on state of job : get(job, ‘state’);
• Load result after job finished : results = fetchOutputs(job) -or- load(job);
• Delete job : delete(job);
You can also open job monitor from Home->Parallel->Monitor Jobs
31
Matlab Distributed Computing Server
Example 1: Estimating an Integration
ClusterInfo.setQueueName(‘super’); ClusterInfo.setNNode(2); ClusterInfo.setWallTime(‘00:30:00’);
job = batch(@quad_fun, 1, {120000000, 0.13, 1.53}, ‘Pool’, 63, ‘profile’, ‘nucleus_r2016a’);wait(job, ‘finished’);Results = fetchOutputs(job);
quad_submission.m
32
Matlab Distributed Computing Server
* Recall the serial job only needs 5.7881 s, MDCS is not designed for small jobs
If we submit spmd version code with MDCS
Running time is 15 sjob = batch(‘quad_fun_spmd’, ‘Pool’, 63, ‘profile’, ‘nucleus_r2016a’);
Running time is 24 s
33
Matlab Distributed Computing Server
Example 4: Vessel Extraction
microscopy image (size:2126*5517)
binary image243 subsets of binary image
local normalizationremove pepper & saltthreshold
separate vessel groups based on connectivity
find the central line of vessel (fast matching method)
~ 6.5 hours
~ 21.3 seconds
~ 6.6 seconds
34
Matlab Distributed Computing Server
Example 4: Vessel Extraction (step 3)
% get file listsubImages = dir(['BWimages', '/*.jpg']);
parfor i = 1 : length(subImages)filename = subImages(i).name;fastMatching(filename);
end
time needed for processing each sub image various Job running on 4 nodes requires minimum computational timet = 784 s
t = 784 s
Speedup = 28.3x
vesselExtraction_submission.m
License limitation of Matlab Distributed Computing Server
35
Maximum concurrent licenses = 128 workerShared with all BioHPC user
Solution :• Single-node jobs – use local profile, increase number of workers (2016a)• Multiple-node jobs – SLURM + GUN parallel
- ex_04 : example_parallel_sbatch.sh
Job pending on MDCS licenses
4 nodes with a pool size of 128 threads, t = 699 s
SLURM + GUN parallel
36
* Image retrieved from https://www.biostars.org/p/63816/
parallelize is to run 8 jobs on each CPU
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time
A shell tool for executing jobs in parallel using one or more computers.
http://www.gnu.org/software/parallel/
Usage: sbatch GUN parallel srun base scripts
Submit multiple jobs to multi-node with srun & GUN parallel
37
#!/bin/bash
#SBATCH --job-name=parallelExample#SBATCH --partition=super#SBATCH --nodes=4#SBATCH –ntasks=128#SBATCH --time=1-00:00:00
CORES_PER_TASK=1INPUTS_COMMAND=“ls BWimages”TASK_SCRIPT=“single.sh”
module add parallel
SRUN=“srun --exclusive –n –N1 –c $CORES_PER_TASK”PARALLEL=“parallel --delay .2 –j $SLURM_NTASKS –joblog task.log”
eval $INPUTS_COMMAND | $PARALLEL $SRUN sh $TASK_SCRIPT {}
#!/bin/bash
module add matlab/2016a
matlab –nodisplay –nodesktop -singleCompThread –r “fastMatching(‘$1’), exit”
single.sh
• INPUTS_COMMAND: generate a input file list• TASK_SCRIPT: base script
Questions and Comments
38
For assistance with MATLAB, please contact the BioHPC Group:
Email: [email protected]
Submit help ticket at http://portal.biohpc.swmed.edu
When did it happen?What time? Cluster or client? What job id? How you run it?
Which Matlab Version?If the code comparable with different versions?
Can we look at your scripts and data?Tell us if you are happy for us to access your scripts/data to help troubleshoot.