exploration of the performance of sdmm algorithm is too large, the ability to point at certain...
TRANSCRIPT
Exploration of the performance of SDMM algorithm
Sheng Wang
August 19, 2016
MSc in High Performance Computing with Data Science
The University of Edinburgh
Year of Presentation: 2016
Abstract
Recently, compressive sensing is a popular and attractive technique in the area
of signal processing. Its use in the radio astronomy generates a new area, radio
interferometry. Currently the attention in the area of radio interferometry is on
how we can use an array of radio telescopes instead of one and still get the good
quality measurement. Many results have been published during these years.
Some even outperforms the state-of-the-art imaging algorithm widely used in
this area, called CLEAN. Example includes SDMM, PADMM and PD-based
algorithm. In this project, we are going to explore the performance of the SDMM
algorithm, hoping to exploit its potential parallelism. We examine each part of the
algorithm carefully in the project, and leave some results and analysis at the end
of the project.
i
Contents
Chapter 1 Introduction ............................................................................................... 1
1.1 Difficulties and Diversions from the original plan ............................................. 2
1.2 Structure of the dissertation .............................................................................. 2
Chapter 2 Background theory ..................................................................................... 3
2.1 Radio astronomy .............................................................................................. 3
2.2 Radio interferometry ........................................................................................ 4
2.3 Compressive sensing ........................................................................................ 6
2.3.1 Sparse representation ................................................................................. 8
2.3.2 Measurement operators .............................................................................. 8
2.3.3 Signal reconstruction algorithms ................................................................ 9
2.4 Compressive sensing in radio interferometry ................................................... 10
2.5 Large-scale optimization ................................................................................ 12
2.5.1 Proximal splitting methods ...................................................................... 12
2.5.2 Parallel structure of and ................................................................... 13
2.5.3 Resulting algorithms ................................................................................ 13
Chapter 3 Current implementation ............................................................................ 19
3.1 Current structure of SDMM ............................................................................ 19
3.2 Problems with current structure ...................................................................... 21
Chapter 4 Optimisation and parallelisation methods .................................................. 22
4.1 Platform ......................................................................................................... 22
4.2 Tools ............................................................................................................. 22
4.2.1 Profiling .................................................................................................. 22
ii
4.2.2 Data visualization .................................................................................... 24
4.3 Parallel languages .......................................................................................... 25
4.3.1 Heterogeneous computing........................................................................ 25
4.3.2 OpenCL, OpenACC and OpenMP ........................................................... 27
Chapter 5 Implementation of the solution ................................................................. 28
5.1 Initial profiling ............................................................................................... 28
5.2 Profiling “ConjugateGradient” ........................................................................ 30
5.3 Three-level parallelism in “update_directions” ................................................ 30
5.3.1 Parallel proximity operator....................................................................... 31
5.3.2 Separable sparsity priors .......................................................................... 32
5.3.3 Splitting of the data.................................................................................. 33
Chapter 6 Summary and conclusions ........................................................................ 35
6.1 Future work ................................................................................................... 35
References ............................................................................................................... 37
Appendix................................................................................................................. 41
iii
List of Tables
Table 1 data fields in the SDMM class ..................................................................... 19
Table 2 two main functions in the SDMM class ........................................................ 20
Table 3 time consumption of each original term........................................................ 31
Table 4 time consumption of each term .................................................................... 33
iv
List of Figures
Figure 1 Arecibo telescope ......................................................................................... 4
Figure 2 Demonstration of radio interferometry .......................................................... 5
Figure 3 the Square Kilometre Array .......................................................................... 6
Figure 4 Demonstration of compressive sensing ......................................................... 7
Figure 5 single-pixel camera ...................................................................................... 7
Figure 6 general process of using compressive sensing in signal acquisition .............. 10
Figure 7 SARA algorithm ........................................................................................ 11
Figure 8 sequential SDMM algorithm ...................................................................... 14
Figure 9 parallel SDMM algorithm .......................................................................... 15
Figure 10 Proximal ADMM algorithm ..................................................................... 16
Figure 11 the PD-based method ............................................................................... 17
Figure 12 Cray Apprentice2 interface ....................................................................... 24
Figure 13 Intel CPU trends .................................................................................... 25
Figure 14 initial profiling of reweighted SDMM ....................................................... 28
Figure 15 intial profiling for “solve_for_xn” function ............................................... 29
Figure 16 initial profiling for “update_directions” function ....................................... 29
Figure 17 profiling for “ConjugateGradient” class .................................................... 30
v
Acknowledgements
I am very grateful to Mr. Adrian Jackson for his constant support and guidance
throughout this project. I would further like to thank my friends and family for their
ongoing support during this dissertation. Finally, thank you to the EPCC for
allowing the time and resources to work on this project.
1
Chapter 1
Introduction
Radio telescopes have been widely used by astronomers to explore the universe
by detecting radio waves emitted by a wide range of objects. Unlike optical
telescopes, radio telescopes are working with signals at a longer wavelength,
which makes the sensing more stable in cloudy skies. Radio telescopes can be
used individually or they can be linked together to create a telescope array, also
known as interferometer. The interferometer helps to exceed the limit of the
angular resolution that a single radio telescope can get. Therefore, it is an area
appealing many scientists to join in.
The Square Kilometre Array (SKA) is a large multi radio telescope project aimed
to be built in recent years. SKA would have a total collecting area of about one
square kilometre, which makes it the most sensitive among other radio
instruments. The birth of the large radio interferometer means the need for high
performance central computing engines to process the data it generates.
Compressive sensing is an attractive technique in the area of signal processing.
It can help to efficiently reduce the amount of samples in the signal acquisition
compared to the classical Shannon-Nyquist based methods. The compressive
sensing is applied to the area of radio interferometry recently. A lot of imaging
algorithms in the framework of compressive sensing are produced. Among
them, SDMM, Proximal ADMM and PD-based methods are the ones we focus
on. Although the parallel version of each algorithm has already been proposed,
there is only the serial SDMM algorithm implemented in the PURIFY package.
In this project, we are going to explore the potential parallelism of the SDMM
algorithm, and try to understand the limit of its performance.
2
1.1 Difficulties and Diversions from the original plan
During the progress of the project, we ran into several obstacles. Initially, we
were planning to port the SDMM (or PADMM, PD-based methods) to
accelerators in order to explore the performance of the implementation.
However, we found that the implementation of SDMM, PADMM and PD-based
methods are only done partly. There is only sequential implementation of SDMM
provided, and the other two algorithms are even not fully implemented.
Therefore, we change the original plan of porting the existing code to
accelerators, to explore the performance of the current serial version of SDMM.
The biggest difficulty we meet in this project is the complexity of the background
theory, which can affect the understanding of the implementation. Several
unfamiliar concepts are pumping out during the progress of the project, such as
compressive sensing, radio interferometry and several convex optimization
methods. Trying to understand the concepts takes up most of our time.
In addition, the implementation of SOPT package requires many third-party
libraries, which makes the build system much complex. Besides, the update of
some necessary libraries can sometimes lead to the failure of building the
project.
1.2 Structure of the dissertation
In this dissertation, we present the work carried out in the following manner.
To begin with, in Chapter 2, we introduce the background theory of our project,
from radio astronomy to the detailed algorithm. We try to define the problem we
are going to solve formally using equations.
Then, in Chapter 3, we look into the current implementation of the SDMM
algorithm and explain what the current implementation consists of.
In Chapter 4, we introduce the platform our experiment is going on and the tools
and parallel languages we are using.
In Chapter 5, we implement some of the ideas and examine the actual effect in
normal usage. Discussion about the reason for the performance difference is
also conducted in this chapter.
Finally, in Chapter 6, we summary our work and introduce ideas for future work
that could be done in the future.
3
Chapter 2
Background theory
2.1 Radio astronomy
Since the first radio signals from space were detected by Karl Jansky in the
1930s, radio telescopes have been widely used by astronomers to explore the
universe by detecting radio waves emitted by a wide range of objects, including
our Sun and even some stars that are millions of light years away from Earth.
Although optical telescopes are now widely used by astronomers, as one of their
disadvantages, they can be hampered by cloud or poor weather conditions on
Earth. Unlike optical telescopes, radio telescopes are working with signals at a
longer wavelength, which can be used even in cloudy skies. This advantage of
radio telescopes makes them as an alternative to optical telescopes.
In order to obtain the same level of detail and resolution as the same-level optical
telescopes, radio telescopes have to have a much larger collecting area.
Currently the largest radio telescope in the world as a single dish is the Arecibo
telescope (see figure 1), which is located in a natural hollow in Puerto Rico,
South America. However, even compared with the currently largest radio
telescope in the world, the resolution of optical telescopes is still much more
superior.
Simply increasing the size of a single radio telescope to improve the resolution of
radio telescopes can lead to other problems. For example, if the size of the
telescope is too large, the ability to point at certain regions may be limited. Radio
astronomers have been able to utilise a technique known as interferometry in
order to get round this limitation in size.
4
Figure 1 Arecibo telescope
2.2 Radio interferometry
Radio telescopes can be used individually or they can be linked together to
create a telescope array, also known as interferometer. Some scientists found
that the effect of more than one radio telescope acting together is the same as
that of a single vast telescope. The resolution of an interferometer depends not
on the diameter of individual radio telescopes, but on the maximum separation
between them.
Moving them further apart increases the angular resolution, which means the
increase of the telescope ability to resolve smaller objects in the sky. In an
interferometer, the signals from all of the telescopes are then brought together
and processed by a correlator, which combines the signals to effectively
simulate that from a single much larger telescope. This process is demonstrated
in figure 2.
5
Figure 2 Demonstration of radio interferometry
Recently there is a multi-radio telescope project called “the square kilometre
array” (see figure 3), which involves many countries including UK, Canada,
India, and China etc. and is aimed at building several arrays of telescopes to
achieve a wide collecting area of about one square kilometre. When the SKA is
completed, it will hopefully surpass the resolution of optical instruments like the
Hubble Space Telescope, one of the largest and most versatile optical
telescopes. However, a lot of telescopes mean a lot of data. Efficient algorithms
for conversion from the data collected from separate telescope stations to the
high-resolution image are important and necessary in the SKA era.
6
Figure 3 the Square Kilometre Array
2.3 Compressive sensing
Traditional approaches to sampling signals or images are based on Shannon’s
celebrated theorem: the sampling rate must be at least twice the maximum
frequency present in the signal (the so-called Nyquist rate). Sampled signals or
images completely keep the information of the original signals so that we can
recover them exactly later on. In fact, this principle underlies nearly all signal
acquisition protocols used in consumer audio and visual electronics, medical
imaging devices, radio receivers, and so on.
Because of this success, the amount of data generated by sensing systems has
grown from a trickle to a torrent. However, in many emerging and important
areas, the resulting Nyquist rate is still so high that we end up with far too many
samples. Sometimes, it may be too costly or even physically impossible to build
devices that are capable of acquiring samples at the necessary rate. Therefore,
in some application areas such as medical imaging, remote surveillance,
spectroscopy and radio interferometry, traditional sensing systems based on the
Nyquist theorem cannot satisfy the need anymore.
7
Figure 4 Demonstration of compressive sensing
In order to deal with such high-dimensional data, we usually depend on
compression, which aims at finding an appropriate representation of a signal that
has lower dimension and contains as much original information as possible. For
example, in figure 4, by using measurement operator to sense the original signal
x, we get the measured vector y. The measured vector not only contains all the
information necessary for reconstruction of the original signal, but also is much
smaller. Compressive sensing theory asserts that one can recover certain
signals and images from far fewer samples or measurements than traditional
methods use. The key concept of compressive sensing is that we can reduce the
cost of the measurement of certain signals if the signals have certain
characteristics (i.e. the signals are sparse in a known basis). One typical
application of compressive sensing is the single-pixel camera (see Figure 5).
Instead of having many sensing resources (i.e. photon detector), we are now
able to use only one photon detector and sample (and simultaneously
compress) the signal (image) to its “information rate” using non-adaptive, linear
measurement.
Figure 5 single-pixel camera
8
Compressive sensing theory mainly consists of three parts: sparse
representation, measurement operators and signal reconstruction algorithms.
2.3.1 Sparse representation
The sparsity of the signal can be simply understood as the number of non-zero
elements in the signal. The smaller the number is, the sparser the signal is. CS
theory is based on the principle that the sparsity of a signal can be exploited to
recover the original signal through optimization from far fewer samples than
required by the traditional method. Therefore, the original signal being
compressive is one important precondition and basis of compressive sensing
theory.
In reality, few signals are perfectly sparse, which means that we cannot directly
fit them into the CS framework. However, after some transformation of the
original signal, they may be approximately sparse in some domain. In theory,
any signal is compressive as long as we can find its corresponding sparse
domain. Formally, we describe the problem as follows:
,
where “x” is the original signal, “a” is the transformed signal and “ ” is the
transformation.
Classical sparsifying transforms, also known as analytical sparsifying
transforms, such as Wavelets and DCT, have been widely used in compression
standards. Recently, redundant sparsifying dictionaries have become popular
especially in imaging denoising, inpainting and image reconstruction. Formally,
we can describe the problem in synthesis model as follows:
,
where “D” is the sparsifying dictionary.
2.3.2 Measurement operators
In the compressive sensing framework, we do not directly measure the original
signals (or transformed signals) but instead we measure the signals after some
projection via measurement operator . The sensing problem can be formally
defined as , where y denotes the measured signals, is the
measurement operator, x is the original signal and n denotes the additive noise.
9
The measurement operator has to guarantee that the measured signal must
contain all of the information of the original signal, otherwise we can never
accurately recover it from the measured values. Therefore, the choice of
measurement operators and sparsifying operators must obey the RIP
(Restricted Isometry Property) in the standard CS (i.e. using classical sparsifying
transforms). In addition, in the extended CS, which uses redundant dictionaries
for sparsifying, they must obey the D-RIP (Dictionary Restrict Isometry
Property).
As it has been proved that there are some matrices can be used as universal
measurement operators and guarantee the stable signal recovery, the choice of
the measurement operator is usually not a main concern in the CS framework.
Now, with sparse representation involved in the sensing problem, we can further
define the sensing problem as:
2.3.3 Signal reconstruction algorithms
The most direct way of recovering the original signal x from the measured signal
y is solving the following optimization problem:
‖ ‖ , subject to
However, solving the above equation is a NP problem, which means it is hard to
get the answer in non-polynomial time. This problem can be equivalent to L1
minimization problem under some constraints. The most common equivalent
approach is to solve the following convex problem:
‖ ‖ , subject to ‖ ‖ ,
where is an upper bound on the L2 norm of the noise. The recovered signal is
defined as , where a is the solution to the above problem.
In later researches, researchers find that signals often exhibit better sparsity in
an over-complete dictionary. Therefore, recent works have begun to address the
case of CS with redundant dictionaries. Therefore, the equivalent problem of the
original problem, also known as analysis-based problem, is redefined as:
‖ ‖ , subject to ‖ ‖
10
By solving the above problem, we can directly get the recovered signal instead
of doing one more transformation step.
The following figure shows a general process of applying compressive sensing
to signal acquisition.
Figure 6 general process of using compressive sensing in signal acquisition
2.4 Compressive sensing in radio interferometry
The most standard image reconstruction algorithm in radio interferometry is
called CLEAN, which is a non-linear deconvolution method based on local
iterative beam removal. A multi-scale version of CLEAN, MS-CLEAN, has also
been developed, where the sparsity model is improved by multi-scale
decomposition, hence enabling better recovery of the signal. However, these
approaches are known to be slow, sometimes prohibitively so.
Recently, a lot of attention have been put into compressive sensing and convex
optimization based imaging algorithms. Carrillo et al. (2012) proposed a novel
sparsity analysis (SARA) in the context of Fourier imaging in radio astronomy.
11
They found that natural images often include several types of structures
admitting sparse representations in different frames. Therefore, instead of
promoting average sparsity over a single basis, promoting it over a
concatenation of several bases (In the paper, they are Dirac and the first eight
orthonormal Daubechies wavelet bases, i.e. Db1-Db8) is a very powerful prior.
The SARA algorithm adopts a reweighted L1 minimization scheme, which
replaces the L0 norm by a weighted L1 norm. The reconstruction problem can be
formulated as:
‖ ‖ , subject to ‖ ‖ ,
where W is a diagonal matrix with positive weights and is the dictionary
mentioned above.
To solve this optimization problem, SARA uses Douglas-Rachford splitting
algorithm. The reconstruction algorithm is defined as follows:
Figure 7 SARA algorithm
Experimental results demonstrate that the sparsity averaging prior embedded in
the analysis reweighted L1 formulation of SARA outperforms state-of-the-art
priors, based on single frame or gradient sparsity, both in terms of SNR and
visual quality.
12
The Douglas-Rachford splitting algorithm used by SARA solves the problem by
iteratively minimizing the L1 norm and then projecting the result onto constraint
set until some stopping criteria is achieved. This iterative algorithm requires prior
knowledge of the operator norm of to guarantee fast convergence, which in
some cases it may be impossible. In addition, the Douglas-Rachford algorithm
does not offer a parallel structure, which is not that suitable for the upcoming
telescopes.
2.5 Large-scale optimization
In order to efficiently solve the problem defined above, we can do some
transformation on the original problem and use some existing efficient convex
optimization methods to solve the transformed problem.
2.5.1 Proximal splitting methods
Proximal splitting methods solve optimization problems of this form:
,
where is convex lower semi-continuous function, not necessarily
differentiable.
The proximity operator of is defined as:
‖ ‖
With proximal splitting methods, the problem defined above can be transformed
to this form:
, subject to , for .
Now based on the new formulation of the problem we are going to solve, one
obvious part we can use for parallelism is that the minimization of each term in
the equation can be done separately, which in theory provides an acceleration
factor of three.
13
2.5.2 Parallel structure of and
The other two places we can use for parallelism are the measurement operator
and the sparsifying operator .
Based on the original form of the sensing problem, , an efficient
parallel implementation can be achieved by splitting of the data into blocks
[
], [
]
For the sparsity priors, the L1 norm is additively separable and the splitting of the
bases can be used,
[ ]
The problem defined previously can be further redefined as:
∑ ∑
,
‖ ‖ ,
‖ ‖
2.5.3 Resulting algorithms
There are many existing algorithms that can be used to solve the problem
defined above, but here we mainly focus on three algorithms, SDMM
(simultaneous direction method of multipliers), PADMM (proximal alternating
direction method of multipliers) and the Primal-dual based method.
SDMM
14
Figure 8 sequential SDMM algorithm
Figure 8 shows the algorithm of the serial version of SDMM, which is also the
version implemented in the current version of SOPT package. After making use
of the three parallel structures mentioned previously, the parallel version of
SDMM is defined as follows (see Figure 9).
15
Figure 9 parallel SDMM algorithm
The parallel version of the SDMM algorithm has not been implemented in the SOPT
package.
Proximal ADMM
16
Figure 10 Proximal ADMM algorithm
Figure 10 shows the details of the algorithm of Proximal ADMM. The Proximal
ADMM algorithm has not been fully implemented in the SOPT package
PD-based method
17
Figure 11 the PD-based method
Figure 11 shows the details of PD-based method. It is probably easier to be
parallelised because it does not require the host node to do heavy linear
18
transform like what SDMM does. However, this algorithm has not been
implemented in the SOPT package either.
19
Chapter 3
Current implementation
3.1 Current structure of SDMM
In the SOPT C++ package, only the serial version of SDMM is implemented. In
general, the implementation of the SDMM provides a very general structure,
which allows the extension of the algorithm much easier.
In terms of data fields, the SDMM class mainly consists of two vectors,
“proximals_” and “transforms_”. Their types and functions are listed below.
Data field Function
vector<Proximal> proximals_ An array of the proximity operators
involved in the computation
vector<LinearTransform> transforms_ An array of the linear transforms
involved in the computation
Table 1 data fields in the SDMM class
The class of “LinearTransform” provides a unified interface for different
components acting in a similar way. For example, a transformation can be
defined directly or through matrix as they all will be finally wrapped in to functions
and be used by SDMM algorithm. The “LinearTransform” class provides a lot of
overloaded functions to deal with the different inputs.
In terms of functions, the SDMM class mainly contains two functions,
“update_directions” and “solve_for_xn”. Their inputs and outputs are listed
below.
20
Function signature Purpose
void update_directions(vector, vector,
vector)
Calculate a semi-final result for each
term i.e. calculating the value of r and s
used by “solve_for_xn” function
Diagnostic solve_for_xn(vector,
vector, vector)
Calculate the reconstructed signal at
the current iteration (the conjugate
gradient algorithm is used to computer
the inversion of the matrix Q)
Table 2 two main functions in the SDMM class
The use of this implementation of SDMM in the CS framework is straightforward.
First, user can define the measurement operator as follows:
sampling = linear_transform<Scalar>(Sampling( parameter1, parameter2,
parameter3))
Then, the sparsifying operator is defined in this way (in this example, we use a
SARA sparsity operator, which is a concatenation of several dictionaries):
SARA const sara (dictionary1, dictionary2, dictionary3)
Psi = linear_transform<Scalar>(sara, image.rows(), image.cols())
After that, the above operators are passed into SDMM class by the “append”
function:
auto const sdmm = SDMM<Scalar> ()
. append (proximal g_i, L_i)
. append (proximal g_i, L_i)
……
The “append” function adds the proximity operator and the transformation
function to “proximals_” and “transforms_” arrays respectively. In order to
21
improve the use of this implementation of SDMM, the SDMM class consists of
many overloading “append” functions to deal with different cases. In addition,
users can append as many proximal-transform pair as they want, which allows
this implementation to be extended for dealing with many different problems.
3.2 Problems with current structure
The advantage of current implementation of SDMM is that it provides a very
general framework for users to use in the interferometric imaging. Users can
bind the measurement operator they choose into the framework or they can
choose to use either a single wavelet for sparisifying or a concatenation of
several wavelets (e.g. sparsity operator used in SARA). In addition, users can
also bind the reweighted algorithm they need to the SDMM model, so that the
model can be used to solve reweighed problems, which may be preferable in the
imaging area. In short, the generality makes the current implementation of
SDMM to be very user-friendly.
However, this advantage is also the disadvantage. Because the implementation
is too general, it is hard to apply optimization or parallelism to certain term. For
example, in the parallel version of SDMM, step 8-11, step 13-16 and step 19-21
represent the computation of different terms. Their computation is parallelised
using a little different strategy. Therefore, it is important for the implementation to
be able to apply user-defined optimization strategies to each computation of
terms.
22
Chapter 4
Optimisation and parallelisation methods
4.1 Platform
The Advanced Research Computing High End Resource (ARCHER) is a Cray
XC30 supercomputer equipped with parallel high-performance file systems as
well as pre- and post-processing capabilities. For this project, ARCHER is the
main platform we are going to use.
There are totally 3008 compute nodes on ARCHER and they are split into 8
groups. For each of groups two Intel Xeon E5-2697 12-core processors are
equipped with hyper-threading enabled. The default clock rate of the processor
is set to 2.7 GHz. In addition, there are 4544 Standard memory compute nodes,
which have memory of 64GB shared between two processors, and 376 high
memory nodes, which have memory of 128 GB. Because the memory is shared
between two processors, each processor has one Non-uniform memory access
(NUMA) region so that accessing to the local memory by cores within a NUMA
region has a lower latency than accessing memory on the other NUMA region.
4.2 Tools
4.2.1 Profiling
CrayPat is a performance analysis tool offered by Cray for the XC platform.
Basically, CrayPat provides two categories of profiling methods:
instrumentation-based profiling and sample-based profiling. For the former, the
compiler inserts timer calls at key points into the program that is going to be
investigated so that it can track the execution counts for routines and source
lines. However, although the execution count of a routine is exact, the execution
time is not reliable because of the heavy overhead of those inserted timer calls.
23
The general workflow for getting performance data using CrayPat is as follows:
1 Unload the darshan module if it is loaded.
2 Load the perftools-base and perftools modules.
3 Build your application; keep .o files.
4 Instrument the application using pat_build.
5 Run the instrumented executable to get a performance data (".xf") file.
6 Run pat_report on the data file to view the results.
For the latter, with sample-based profiling, the program’s current instruction
address is read and tracked at a certain interval. The instruction address is then
mapped to source lines and/or functions in the program. The advantage of this
kind of profiling is that the execution time of a source line or a function is more
accurate compared to instrumentation-based profiling because of the low
overhead this method adds to the program. In addition, as it does not modify the
code itself, we can also track the execution time of some generated assembly
code.
CrayPat's Automatic Program Analysis (APA) feature provides an easy way for
such a purpose. Using this feature, one can generate an instrumented
executable for a sampling experiment. When the binary is executed, it generates
an ASCII text file that contains CrayPat's suggestion for pat_build tracing
options, which can be used to re-instrument the executable for detailed tracing
experiments.
The general workflow for using APA is as follows:
1 Generate the executable for sampling, using the special '-O apa' flag.
2 Running the executable on compute nodes via aprun generates an xf file.
3 Run pat_report on the data file.
(It will generate the “. ap2” and “. apa” files. The latter contains suggested
pat_build options for building an executable for tracing experiments.)
4 Examine the “. apa” file and, if necessary, customize it for your need
5 Rebuild an executable using pat_build's -O option with the apa file name
as the argument.
6 Run the new executable for a tracing experiment.
7 Run pat_report on the newly created xf file. This is the tracing result.
For this project, sample-based profiling can already satisfy our needs as we
focus on the general structure of the algorithm, not the detailed implementation.
However, for further investigation, instrumentation-based profiling can be of
great use.
24
4.2.2 Data visualization
Cray Apprentice2 displays data that was captured by CrayPat. This visualization
tool displays a variety of different data panels, depending on the type of
performance experiment that was conducted. Its target is to help identify
conditions including load imbalance, excessive serialization, excessive
communication and network contention.
Cray Apprentice2 provides call-graph-based profile information with source code
mapping and timeline-based trace visualization, also with source code
mappings. It is capable of running either on the Cray system service nodes, or
on a remote Linux server or workstation. Examples of Cray Apprentice2 displays
are demonstrated below.
Figure 12 Cray Apprentice2 interface
The general workflow for using Cray Apprentice2 is as follows:
1. Instrument the application using pat_build.
2. Run the instrumented executable to get a performance data (".xf") file.
3. Run pat_report on the data file and Get the “. ap2” file.
4. Run app2 on the “. ap2” file to get graphical reports
25
4.3 Parallel languages
4.3.1 Heterogeneous computing
CPUs are short for central processing units and they are the main functional unit
in the computer. Up until approximately ten years ago, the speed of CPUs is
improved mainly by increasing in the frequency of their clocks. However, what is
along with the increased clock frequency is the increased heat generated by the
CPUs and finally the excessive heat prevents the clock frequency of the CPUs
from increasing. Today most of the CPUs run at 2 to 3 GHz, which already lasts
for over 10 years.
The current solution proposed by most CPU manufacturers to keep increasing
the speed of the CPUs is to add more CPU cores, which is known as multi-core
CPU. We can gain good performance from explicitly making use of those
multi-cores. This is where parallel computing first entered the technology scene
in a big way. However, the software has to be re-written in a parallel style in order
for programs to actually make use of those CPU cores.
Figure 13 Intel CPU trends
The figure above shows the trends in the Intel’s CPU over years. Until about
2005, the clock speed of the Intel’s CPU keeps being increased. After that, the
26
improvement in the clock speed slows down and in the meantime the power
consumption also decreases. This suggests that the multi-core CPU improves
the performance not only by using more than one core simultaneously but also
by improving the energy efficiency of the CPU.
The alternative to multi-core CPUs for increasing the performance of single-core
CPUs is to use other kinds of processors to supplement the CPU in doing the
computational tasks. Coordinating more than one processor of different
architecture types to perform computation is known as “heterogeneous
computing”.
In the area of “heterogeneous computing”, two common accelerators are GPUs
and Xeon Phi respectively. In practice, GPUs are more widely used compared to
Xeon Phi because of the fact most of the desktop computers have already had
GPUs inside. During several years’ development, GPUs have advanced from
merely dealing with pixels to actually having incredible capabilities to do
mathematical computations. Therefore, we can gain great performance if we
offload those heavy mathematical computations originally executed in CPUs to
GPUs.
However, what makes the “heterogeneous computing” hard to promote is that
most software has to be re-written in order to make use of GPUs as accelerators.
For example, CUDA is the first highly adopted platform which enables
developers to write high-performance general-purpose GPU programs.
Developers need to write GPU kernels, manage different levels of GPU memory
and make trade-offs for data transfers between the host and the device during
the development of CUDA programs.
Xeon Phi is the accelerator released by Intel. It makes use of multiple older x86
CPUs to form a new multi-core architecture. One of the benefits of using Xeon
Phi as accelerators is that any code compatible to x86 architecture is also
compatible to Xeon Phi. This makes the development of high-performance
programs based on accelerators much more convenient. However, the
performance of Xeon Phi is widely reported to be not as good as that of GPUs.
This means we need to put more efforts into the performance tuning if we use
Xeon Phi as accelerators.
Although this project is not intended to port the current implementation to
accelerators, it is worthwhile to use the languages in the implementation that
works well with accelerators, as heterogeneous computing is a possible trend in
the future. Therefore, in the next section, we mainly compare three languages
that are currently popular and working well with accelerators.
27
4.3.2 OpenCL, OpenACC and OpenMP
Comparison between different parallel languages can be done in several ways,
for example, performance. However, the performance of a piece of code is
determined by many factors including the hardware the code is running on and
the code itself. As mentioned above, currently the two popular accelerators,
GPU and Xeon Phi, are totally built on different architectures, and each of them
has its own advantage over the other. For example, the frequency of one single
core on Xeon Phi is much higher than GPU, while the number of cores on GPU
is instead much more than that on Xeon Phi. In this project, we choose the
parallel languages based on their portability as we hope our code can be run on
both GPU and Xeon Phi.
OpenCL is a framework for writing programs that execute across heterogeneous
platforms. However, although NUVIDA and Intel still have some products
supporting OpenCL, at present it does not give one much optimism. For
example, NUVIDA just released a driver supporting OpenCL 1.2 in 2015, while
the latest version of OpenCL has been 2.2. In addition, for Intel’s products, only
the integrated GPU on the chip can support OpenCL, which seems not to have
enough attraction for HPC users.
OpenACC is a programming standard for parallel computing developed by Cray,
CAPS, NVIDIA and PGI. The standard is designed to simplify parallel
programming of heterogeneous CPU/GPU systems. As many OpenACC
members have worked as members of the OpenMP standard group to merge
into OpenMP specification (OpenMP 4.0), OpenACC and OpenMP are likely to
be more and more similar. Therefore, currently OpenMP and OpenACC might
be the best languages for programming accelerators.
For this project, we choose to use OpenMP as the parallel language because it
is more commonly used in the multi-core CPU environment. Although at the
moment, the compiler in our hand does not support the latest OpenMP
(OpenMP 4.0), which means porting the code to GPU using OpenMP is
impossible, it can be probably achieved in the near future.
28
Chapter 5
Implementation of the solution
5.1 Initial profiling
Figure 14 initial profiling of reweighted SDMM
The figure above displays the result of the initial profiling of the code and its call
tree. It is obvious that the two functions, “solve_for_xn” and “update_directions”,
take up most of the running time of the code.
29
Figure 12 intial profiling for “solve_for_xn” function
In the function “solve_for_xn”, we can find the conjugate gradient algorithm takes
up the most time of the function (see figure x). The conjugate gradient algorithm
is used in the step 5 of the serial version of SDMM in order to calculate an
inverse of the matrix Q.
Figure 16 initial profiling for “update_directions” function
In the function “update_directions”, we can find the soft thresholding algorithm
takes up the most of time consumed (see figure x). The algorithm here is used to
calculate the L1 norm.
Therefore, in this chapter, we are going to focus on the improvement of the
“solve_for_xn” and “update_directions” functions.
30
5.2 Profiling “ConjugateGradient”
Figure 13 profiling for “ConjugateGradient” class
The figure above shows the result of profiling the conjugate gradient algorithm.
The result shows that most of the time consumed by the conjugate gradient
algorithm is actually consumed by the general matrix-vector product. As the
“Eigen” library is likely to have implemented an efficient matrix-vector product
function for general use, there is few space for us to do in order to make any
improvement.
5.3 Three-level parallelism in “update_directions”
As mentioned in the background section, the SDMM structure offers mainly
three degrees of parallelisation that can be further exploited. Firstly, the
proximity operators can be implemented in parallel. Then, the sparsity priors are
separable. In addition, the data vector and the measurement operator can be
31
partitioned into several blocks so that each compute node can work on each
piece of the data simultaneously.
5.3.1 Parallel proximity operator
Calculating the proximity operator of each term in parallel can be easily achieved
using OpenMP. As in our experiment, we have only three terms, we will only
need three threads to run the code.
#pragma omp parallel for
for (t_uint i=0; i < transforms(). size(); ++i) {
z[i] += transforms(i) * x;
y[i] = proximals(i, z[i]);
z[i] -= y[i];
}
By doing this, in theory, we can get an acceleration factor of three. However, as
the computation burden of different terms can be very different, the performance
gain in reality may not be that much.
Term
Time
consumption
(second)
.append(sopt::proximal::l1_norm<Scalar>, psi.adjoint(), psi)
(‖ ‖ ) 15.63563
.append(sopt::proximal::translate(sopt::proximal::L2Ball<Scalar>(eps
ilon), -y),sampling)
(‖ ‖ )
0.35529
.append(sopt::proximal::positive_quadrant<Scalar>)
(positive quadrant) 0.41683
Time consumption after parallelisation 16.59599
Table 3 time consumption of each original term
32
Table 3 shows the time consumption by the three different terms in our problem.
This suggests that even if these three terms are perfectly computed in parallel,
the performance gain is still less than 5%. However, in reality, cost like fork/join
of threads can affect the final performance of the code as well. After
parallelisation, the time consumption is even more than before (16.59599
second).
5.3.2 Separable sparsity priors
In theory, the sparsity prior can be split into any number of blocks. However,
splitting it based on the bases used is a very natural way. In our experiment, the
sparifying operator consists of three bases, one “DB3“ and two “DB1” with
different levels. Therefore, in our experiment, we split it into three blocks.
auto const psi0 = sopt::linear_transform<Scalar>(sara[0], image.rows(), image.cols());
auto const psi1 = sopt::linear_transform<Scalar>(sara[1], image.rows(), image.cols());
auto const psi2 = sopt::linear_transform<Scalar>(sara[2], image.rows(), image.cols());
auto const sdmm
= sopt::algorithm::SDMM<Scalar>()
.append(sopt::proximal::l1_norm<Scalar>, psi0.adjoint(), psi0)
.append(sopt::proximal::l1_norm<Scalar>, psi1.adjoint(), psi1)
.append(sopt::proximal::l1_norm<Scalar>, psi2.adjoint(), psi2)
……
Here we manually split the sparsity prior “psi” into three blocks (“psi0”, “psi1”,
and “psi2”), and append them to the SDMM. Combined with previous
modification to proximity operator, now we have five terms that are calculated in
parallel.
Terms
Time
consumption
(second)
.append(sopt::proximal::l1_norm<Scalar>, psi0.adjoint(), psi0) 5.21192
.append(sopt::proximal::l1_norm<Scalar>, psi1.adjoint(), psi1) 6.98543
33
.append(sopt::proximal::l1_norm<Scalar>, psi2.adjoint(), psi2) 5.18503
.append(sopt::proximal::translate(sopt::proximal::L2Ball<Scalar>(ep
silon), -y), sampling) 0.31529
.append(sopt::proximal::positive_quadrant<Scalar>) 0.41297
Time consumption after parallelisation 9.83453
Table 4 time consumption of each term
Table 4 shows the time consumption by the five different terms and the actual
time consumption after the parallelisation. Although the total time consumption is
reduced, more and more noises are involved, affecting the result further away
from the expectation.
Although splitting sparsity priors based on the bases they use is a natural and
straightforward way, whether the result will be optimal is something unknown
yet. For example, in the above table, the second term split from the original prior
costs comparatively more time than the other two. For the optimal solution, this
may not be the best splitting of the prior. However, a different splitting strategy
may not allow for the use of fast algorithms for the computation of the operator.
5.3.3 Splitting of the data
As introduced in the previous section, the measured vector can be easily divided
into several blocks as follows:
[
]
This form of parallelisation, also known as data parallelism, focuses on
distributing the data across different parallel computing nodes. The advantage of
data parallelism is that usually it can provide good scalability for the program.
In our context, we also have to split the measurement operator into the same
blocks as the measured vector as follows:
[
]
34
In order to achieve that, we have to design a supporting matrices M to help the
division of the measurement operator:
[
] [
]FZ
Therefore, with this distributed optimization approach, each piece of the
measured vector and the measurement operator can be local to each compute
node, which distribute the memory requirement and the processing load.
However, because of the time pressure and the difficulty of implementing this
distributed approach, we only provide the idea here.
35
Chapter 6
Summary and conclusions
This project first introduces a currently popular technique, compressive sensing.
It might be a promising path for the signal acquisition as it can help to reduce the
amount of samples taken in the process of sensing, causing a waste of resource.
In the “big data” era, it is necessary to use techniques that are much faster.
Compressive sensing is also applied to the area of radio interferometry, hoping it
can help to deal with the large amount of measurement data. In this project, we
introduce four algorithms used in this area, SARA, SDMM, PADMM and
PD-based method. The latter three algorithms are attractive because they offer a
parallel structure, which is preferable for dealing with large amount of data.
However, currently only SARA and SDMM are implemented in the SOPT
package, while PADMM and PD-based method are not implemented or not fully
implemented. In addition, current SDMM implementation makes no use of its
parallel structure. In this project, we try to implement a parallel SDMM algorithm.
In the paper where SDMM is introduced, they have mentioned three degrees of
parallelism of SDMM algorithm. However, based on our experiment, it may be
difficulty to balance the load if the splitting of the data is not dealt with well.
Especially SOPT being a public library, it is expected to provide functionality with
enough generality. How to deal with the load balancing problem and how to
implement them in a friendly way is something we need to consider in the future.
6.1 Future work
There are a wide range of further work can be attempted on this project. Here
are some options that are considered worthy:
SDMM might be much slower than the latest PD-based algorithm for
interferometric imaging, as the PD-based method offers a full splitting
structure, while SDMM still needs the host node to do heavy linear
36
computation. At the time of doing the project, PD-based algorithm has not
been implemented so that we have to use the SDMM instead. However,
in the future, with the new algorithm available, the performance may be
more attractive.
This project only works with multi-core CPUs, however, in the future, with
full splitting algorithm available, we can begin to study the performance of
the algorithm on different accelerators or the scalability of the algorithm.
Currently some unexpected cost is affecting the final performance of the
parallel implementation. This can be vital when it is scaled to large
systems. Studying how those cost, such as fork/join of threads or thread
competition, affect the final result and what we can do to reduce them
might be an interesting direction.
Apart from OpenMP, for the further work, we can use other parallel
languages to implement the algorithm. Although as mentioned in the
previous section, OpenMP might be the most portable among the
OpenACC, OpenCL and OpenMP, using some specific parallel
languages might result in the better performance, for example, using
CUDA for GPU, although this usually will be at the cost of doing many
program rewriting,
For real big-data application, we may have to make use of heterogeneous
system to get the optimal performance. Therefore, how to port the current
implementation to the system is also something worth to investigate. For
example, we may need to consider how to use MPI along with OpenMP
or CUDA or any other parallel language to make the implementation
scaled to large systems.
37
References
[1] Archer.ac.uk. (2016). ARCHER » 5. Performance analysis. [online] Available at:
http://www.archer.ac.uk/documentation/best-practice-guide/performance.php
[Accessed 19 Aug. 2016]
[2] Candes, E. and Wakin, M. (2008). An Introduction To Compressive Sampling.
IEEE Signal Process. Mag., 25(2), pp.21-30..
[3] Carrillo, R., McEwen, J. and Wiaux, Y. (2014). PURIFY: a new approach to
radio-interferometric imaging. Monthly Notices of the Royal Astronomical
Society, 439(4), pp.3591-3604.
[4] Carrillo, R., McEwen, J. and Wiaux, Y. (2012). Sparsity Averaging Reweighted
Analysis (SARA): a novel algorithm for radio-interferometric imaging. Monthly
Notices of the Royal Astronomical Society, 426(2), pp.1223-1234.
[5] Cornwell, T. (2009). Hogbom's CLEAN algorithm. Impact on astronomy and
beyond. Astronomy and Astrophysics, 500(1), pp.65-66.
[6] Dsp.rice.edu. (2016). Compressive Imaging: A New Single-Pixel Camera | Rice
DSP. [online] Available at: http://dsp.rice.edu/cscamera [Accessed 19 Aug.
2016].
[7] Eldar, Y. and Kutyniok, G. (2012). Compressed sensing. Cambridge: Cambridge
University Press.
[8] LI, S. and WEI, D. (2009). A Survey on Compressive Sensing. Acta Automatica
Sinica, 35(11), pp.1369-1377.
[9] Nersc.gov. (2016). CrayPat. [online] Available at:
http://www.nersc.gov/users/software/performance-and-debugging-tools/craypat
/ [Accessed 19 Aug. 2016].
[10] Onose, A., Carrillo, R., Repetti, A., McEwen, J., Thiran, J., Pesquet, J. and
38
Wiaux, Y. (2016). Scalable splitting algorithms for big-data interferometric imaging
in the SKA era. Mon. Not. R. Astron. Soc., p.stw1859.
[11] Rong, R. (2013). Splitting algorithms for convex optimization and applications
to sparse matrix factorization. [Los Angeles]: University of California, Los
Angeles.
[12] Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T. and Yan, S. (2010). Sparse
Representation for Computer Vision and Pattern Recognition. Proceedings of the
IEEE, 98(6), pp.1031-1044.
[13] Wikipedia. (2016). OpenCL. [online] Available at:
https://en.wikipedia.org/wiki/OpenCL [Accessed 19 Aug. 2016].
[14] Wikipedia. (2016). OpenACC. [online] Available at:
https://en.wikipedia.org/wiki/OpenACC [Accessed 19 Aug. 2016].
[15] Wikipedia. (2016). OpenMP. [online] Available at:
https://en.wikipedia.org/wiki/OpenMP [Accessed 19 Aug. 2016].
[16] The Next Platform. (2015). Is OpenACC The Best Thing To Happen To
OpenMP?. [online] Available at:
http://www.nextplatform.com/2015/11/30/is-openacc-the-best-thing-to-happen-to-o
penmp/ [Accessed 19 Aug. 2016].
[17] K. Davis, “Data transfer of an “array of pointers” using the Intel Language
Extensions for Offload (LEO) for the Intel Xeon Phi coprocessor,” Intel Corporation,
22 August 2014. [Online]. Available:
https://software.intel.com/en-us/articles/xeon-phi-coprocessor-data-transfer-array-of
-pointers-using-language-extensions-for-offload. [Accessed 30 July 2015].
[18] OpenACC-standard.org, “The OpenACC Application Programming Interface,”
OpenACC-standard.org, 2013.
[19] M. Noack, “HAM - Heterogenous Active Messages for Efficient Offloading on
the Intel Xeon Phi,” The Zuse Institute Berlin (ZIB), Berlin, 2014.
[20] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,”
39
ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1:1-1:25, 2011.
[21] R. Boisvert, R. Pozo and K. Remington, “The Matrix Market Exchange Formats
: Initial Design,” National Institute of Standards and Technology, Gaithersburg,
1996.
[22] Pgroup.com. (2016). PGInsider September 2013: Tesla vs. Xeon Phi vs.
Radeon. [online] Available at:
https://www.pgroup.com/lit/articles/insider/v5n2a1.htm [Accessed 19 Aug. 2016].
[23] Cmake.org. (2016). CMake. [online] Available at: https://cmake.org/ [Accessed
19 Aug. 2016].
[24] Openmp.org. (2016). OpenMP.org » OpenMP Specifications. [online]
Available at: http://openmp.org/wp/openmp-specifications/ [Accessed 19 Aug.
2016].
[25] The Khronos Group. (2016). OpenCL - The open standard for parallel
programming of heterogeneous systems. [online] Available at:
https://www.khronos.org/opencl/ [Accessed 19 Aug. 2016].
[26] OpenMP Architecture Review Board, “OpenMP Application Program
Interface,” OpenMP Architecture Review Board, 2013.
[27] R. W. Green, “OpenMP* Loop Scheduling,” Intel Corporation, 29 August 2014.
[Online]. Available:
https://software.intel.com/en-us/articles/openmp-loop-scheduling. [Accessed 9
August 2016].
[28] A. D. Robison, “SIMD Parallelism using Array Notation,” Intel Developer
Zone, 3 September 2010. [Online]. Available:
https://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-not
ation/?wapkw=array+notation. [Accessed 30 July 2016].
[29] P. Kennedy, “Intel Xeon Phi 5110P Coprocessor – Many Integrated Core
Unleashed,” Serve The Home (STH), 13 November 2012. [Online]. Available:
http://www.servethehome.com/introducing-intel-xeon-phi-5110p-coprocessor-intels
40
-integrated-core-unleased/. [Accessed 16 August 2016].
[30] S. Cepeda, “Optimization and Performance Tuning for Intel Xeon Phi
Coprocessors,” Intel Corporation, 12 November 2012. [Online]. Available:
https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-i
ntel-xeon-phi-coprocessors-part-2-understanding. [Accessed 31 July 2016].
[31] J. Fenlason, “GNU gprof,” Free Software Foundation, Inc., November 2008.
[Online]. Available: https://sourceware.org/binutils/docs/gprof/. [Accessed 30 July
2016].
41
Appendix
Commands for compiling the code on ARCHER:
module load cmake/3.5.2
module swap PrgEnv-cray PrgEnv-gnu;
module load fftw;
module load git;
module load anaconda;
export CRAYPE_LINK_TYPE = dynamic;
CC=cc CXX=CC FC=ftn cmake -DCMAKE_BUILD_TYPE=Release
-DCMAKE_PREFIX_PATH='/opt/cray/fftw/3.3.4.5/sandybridge;WHERE EIGEN IS'
WHERE SOPT IS
Make
* Replace the “WHERE EIGEN IS” and “WHERE SOPT IS” with path to the eigen
folder and path to the sopt folder