exploration of the performance of sdmm algorithm is too large, the ability to point at certain...

48
Exploration of the performance of SDMM algorithm Sheng Wang August 19, 2016 MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2016

Upload: doanduong

Post on 20-Mar-2018

215 views

Category:

Documents


3 download

TRANSCRIPT

Exploration of the performance of SDMM algorithm

Sheng Wang

August 19, 2016

MSc in High Performance Computing with Data Science

The University of Edinburgh

Year of Presentation: 2016

Abstract

Recently, compressive sensing is a popular and attractive technique in the area

of signal processing. Its use in the radio astronomy generates a new area, radio

interferometry. Currently the attention in the area of radio interferometry is on

how we can use an array of radio telescopes instead of one and still get the good

quality measurement. Many results have been published during these years.

Some even outperforms the state-of-the-art imaging algorithm widely used in

this area, called CLEAN. Example includes SDMM, PADMM and PD-based

algorithm. In this project, we are going to explore the performance of the SDMM

algorithm, hoping to exploit its potential parallelism. We examine each part of the

algorithm carefully in the project, and leave some results and analysis at the end

of the project.

i

Contents

Chapter 1 Introduction ............................................................................................... 1

1.1 Difficulties and Diversions from the original plan ............................................. 2

1.2 Structure of the dissertation .............................................................................. 2

Chapter 2 Background theory ..................................................................................... 3

2.1 Radio astronomy .............................................................................................. 3

2.2 Radio interferometry ........................................................................................ 4

2.3 Compressive sensing ........................................................................................ 6

2.3.1 Sparse representation ................................................................................. 8

2.3.2 Measurement operators .............................................................................. 8

2.3.3 Signal reconstruction algorithms ................................................................ 9

2.4 Compressive sensing in radio interferometry ................................................... 10

2.5 Large-scale optimization ................................................................................ 12

2.5.1 Proximal splitting methods ...................................................................... 12

2.5.2 Parallel structure of and ................................................................... 13

2.5.3 Resulting algorithms ................................................................................ 13

Chapter 3 Current implementation ............................................................................ 19

3.1 Current structure of SDMM ............................................................................ 19

3.2 Problems with current structure ...................................................................... 21

Chapter 4 Optimisation and parallelisation methods .................................................. 22

4.1 Platform ......................................................................................................... 22

4.2 Tools ............................................................................................................. 22

4.2.1 Profiling .................................................................................................. 22

ii

4.2.2 Data visualization .................................................................................... 24

4.3 Parallel languages .......................................................................................... 25

4.3.1 Heterogeneous computing........................................................................ 25

4.3.2 OpenCL, OpenACC and OpenMP ........................................................... 27

Chapter 5 Implementation of the solution ................................................................. 28

5.1 Initial profiling ............................................................................................... 28

5.2 Profiling “ConjugateGradient” ........................................................................ 30

5.3 Three-level parallelism in “update_directions” ................................................ 30

5.3.1 Parallel proximity operator....................................................................... 31

5.3.2 Separable sparsity priors .......................................................................... 32

5.3.3 Splitting of the data.................................................................................. 33

Chapter 6 Summary and conclusions ........................................................................ 35

6.1 Future work ................................................................................................... 35

References ............................................................................................................... 37

Appendix................................................................................................................. 41

iii

List of Tables

Table 1 data fields in the SDMM class ..................................................................... 19

Table 2 two main functions in the SDMM class ........................................................ 20

Table 3 time consumption of each original term........................................................ 31

Table 4 time consumption of each term .................................................................... 33

iv

List of Figures

Figure 1 Arecibo telescope ......................................................................................... 4

Figure 2 Demonstration of radio interferometry .......................................................... 5

Figure 3 the Square Kilometre Array .......................................................................... 6

Figure 4 Demonstration of compressive sensing ......................................................... 7

Figure 5 single-pixel camera ...................................................................................... 7

Figure 6 general process of using compressive sensing in signal acquisition .............. 10

Figure 7 SARA algorithm ........................................................................................ 11

Figure 8 sequential SDMM algorithm ...................................................................... 14

Figure 9 parallel SDMM algorithm .......................................................................... 15

Figure 10 Proximal ADMM algorithm ..................................................................... 16

Figure 11 the PD-based method ............................................................................... 17

Figure 12 Cray Apprentice2 interface ....................................................................... 24

Figure 13 Intel CPU trends .................................................................................... 25

Figure 14 initial profiling of reweighted SDMM ....................................................... 28

Figure 15 intial profiling for “solve_for_xn” function ............................................... 29

Figure 16 initial profiling for “update_directions” function ....................................... 29

Figure 17 profiling for “ConjugateGradient” class .................................................... 30

v

Acknowledgements

I am very grateful to Mr. Adrian Jackson for his constant support and guidance

throughout this project. I would further like to thank my friends and family for their

ongoing support during this dissertation. Finally, thank you to the EPCC for

allowing the time and resources to work on this project.

1

Chapter 1

Introduction

Radio telescopes have been widely used by astronomers to explore the universe

by detecting radio waves emitted by a wide range of objects. Unlike optical

telescopes, radio telescopes are working with signals at a longer wavelength,

which makes the sensing more stable in cloudy skies. Radio telescopes can be

used individually or they can be linked together to create a telescope array, also

known as interferometer. The interferometer helps to exceed the limit of the

angular resolution that a single radio telescope can get. Therefore, it is an area

appealing many scientists to join in.

The Square Kilometre Array (SKA) is a large multi radio telescope project aimed

to be built in recent years. SKA would have a total collecting area of about one

square kilometre, which makes it the most sensitive among other radio

instruments. The birth of the large radio interferometer means the need for high

performance central computing engines to process the data it generates.

Compressive sensing is an attractive technique in the area of signal processing.

It can help to efficiently reduce the amount of samples in the signal acquisition

compared to the classical Shannon-Nyquist based methods. The compressive

sensing is applied to the area of radio interferometry recently. A lot of imaging

algorithms in the framework of compressive sensing are produced. Among

them, SDMM, Proximal ADMM and PD-based methods are the ones we focus

on. Although the parallel version of each algorithm has already been proposed,

there is only the serial SDMM algorithm implemented in the PURIFY package.

In this project, we are going to explore the potential parallelism of the SDMM

algorithm, and try to understand the limit of its performance.

2

1.1 Difficulties and Diversions from the original plan

During the progress of the project, we ran into several obstacles. Initially, we

were planning to port the SDMM (or PADMM, PD-based methods) to

accelerators in order to explore the performance of the implementation.

However, we found that the implementation of SDMM, PADMM and PD-based

methods are only done partly. There is only sequential implementation of SDMM

provided, and the other two algorithms are even not fully implemented.

Therefore, we change the original plan of porting the existing code to

accelerators, to explore the performance of the current serial version of SDMM.

The biggest difficulty we meet in this project is the complexity of the background

theory, which can affect the understanding of the implementation. Several

unfamiliar concepts are pumping out during the progress of the project, such as

compressive sensing, radio interferometry and several convex optimization

methods. Trying to understand the concepts takes up most of our time.

In addition, the implementation of SOPT package requires many third-party

libraries, which makes the build system much complex. Besides, the update of

some necessary libraries can sometimes lead to the failure of building the

project.

1.2 Structure of the dissertation

In this dissertation, we present the work carried out in the following manner.

To begin with, in Chapter 2, we introduce the background theory of our project,

from radio astronomy to the detailed algorithm. We try to define the problem we

are going to solve formally using equations.

Then, in Chapter 3, we look into the current implementation of the SDMM

algorithm and explain what the current implementation consists of.

In Chapter 4, we introduce the platform our experiment is going on and the tools

and parallel languages we are using.

In Chapter 5, we implement some of the ideas and examine the actual effect in

normal usage. Discussion about the reason for the performance difference is

also conducted in this chapter.

Finally, in Chapter 6, we summary our work and introduce ideas for future work

that could be done in the future.

3

Chapter 2

Background theory

2.1 Radio astronomy

Since the first radio signals from space were detected by Karl Jansky in the

1930s, radio telescopes have been widely used by astronomers to explore the

universe by detecting radio waves emitted by a wide range of objects, including

our Sun and even some stars that are millions of light years away from Earth.

Although optical telescopes are now widely used by astronomers, as one of their

disadvantages, they can be hampered by cloud or poor weather conditions on

Earth. Unlike optical telescopes, radio telescopes are working with signals at a

longer wavelength, which can be used even in cloudy skies. This advantage of

radio telescopes makes them as an alternative to optical telescopes.

In order to obtain the same level of detail and resolution as the same-level optical

telescopes, radio telescopes have to have a much larger collecting area.

Currently the largest radio telescope in the world as a single dish is the Arecibo

telescope (see figure 1), which is located in a natural hollow in Puerto Rico,

South America. However, even compared with the currently largest radio

telescope in the world, the resolution of optical telescopes is still much more

superior.

Simply increasing the size of a single radio telescope to improve the resolution of

radio telescopes can lead to other problems. For example, if the size of the

telescope is too large, the ability to point at certain regions may be limited. Radio

astronomers have been able to utilise a technique known as interferometry in

order to get round this limitation in size.

4

Figure 1 Arecibo telescope

2.2 Radio interferometry

Radio telescopes can be used individually or they can be linked together to

create a telescope array, also known as interferometer. Some scientists found

that the effect of more than one radio telescope acting together is the same as

that of a single vast telescope. The resolution of an interferometer depends not

on the diameter of individual radio telescopes, but on the maximum separation

between them.

Moving them further apart increases the angular resolution, which means the

increase of the telescope ability to resolve smaller objects in the sky. In an

interferometer, the signals from all of the telescopes are then brought together

and processed by a correlator, which combines the signals to effectively

simulate that from a single much larger telescope. This process is demonstrated

in figure 2.

5

Figure 2 Demonstration of radio interferometry

Recently there is a multi-radio telescope project called “the square kilometre

array” (see figure 3), which involves many countries including UK, Canada,

India, and China etc. and is aimed at building several arrays of telescopes to

achieve a wide collecting area of about one square kilometre. When the SKA is

completed, it will hopefully surpass the resolution of optical instruments like the

Hubble Space Telescope, one of the largest and most versatile optical

telescopes. However, a lot of telescopes mean a lot of data. Efficient algorithms

for conversion from the data collected from separate telescope stations to the

high-resolution image are important and necessary in the SKA era.

6

Figure 3 the Square Kilometre Array

2.3 Compressive sensing

Traditional approaches to sampling signals or images are based on Shannon’s

celebrated theorem: the sampling rate must be at least twice the maximum

frequency present in the signal (the so-called Nyquist rate). Sampled signals or

images completely keep the information of the original signals so that we can

recover them exactly later on. In fact, this principle underlies nearly all signal

acquisition protocols used in consumer audio and visual electronics, medical

imaging devices, radio receivers, and so on.

Because of this success, the amount of data generated by sensing systems has

grown from a trickle to a torrent. However, in many emerging and important

areas, the resulting Nyquist rate is still so high that we end up with far too many

samples. Sometimes, it may be too costly or even physically impossible to build

devices that are capable of acquiring samples at the necessary rate. Therefore,

in some application areas such as medical imaging, remote surveillance,

spectroscopy and radio interferometry, traditional sensing systems based on the

Nyquist theorem cannot satisfy the need anymore.

7

Figure 4 Demonstration of compressive sensing

In order to deal with such high-dimensional data, we usually depend on

compression, which aims at finding an appropriate representation of a signal that

has lower dimension and contains as much original information as possible. For

example, in figure 4, by using measurement operator to sense the original signal

x, we get the measured vector y. The measured vector not only contains all the

information necessary for reconstruction of the original signal, but also is much

smaller. Compressive sensing theory asserts that one can recover certain

signals and images from far fewer samples or measurements than traditional

methods use. The key concept of compressive sensing is that we can reduce the

cost of the measurement of certain signals if the signals have certain

characteristics (i.e. the signals are sparse in a known basis). One typical

application of compressive sensing is the single-pixel camera (see Figure 5).

Instead of having many sensing resources (i.e. photon detector), we are now

able to use only one photon detector and sample (and simultaneously

compress) the signal (image) to its “information rate” using non-adaptive, linear

measurement.

Figure 5 single-pixel camera

8

Compressive sensing theory mainly consists of three parts: sparse

representation, measurement operators and signal reconstruction algorithms.

2.3.1 Sparse representation

The sparsity of the signal can be simply understood as the number of non-zero

elements in the signal. The smaller the number is, the sparser the signal is. CS

theory is based on the principle that the sparsity of a signal can be exploited to

recover the original signal through optimization from far fewer samples than

required by the traditional method. Therefore, the original signal being

compressive is one important precondition and basis of compressive sensing

theory.

In reality, few signals are perfectly sparse, which means that we cannot directly

fit them into the CS framework. However, after some transformation of the

original signal, they may be approximately sparse in some domain. In theory,

any signal is compressive as long as we can find its corresponding sparse

domain. Formally, we describe the problem as follows:

,

where “x” is the original signal, “a” is the transformed signal and “ ” is the

transformation.

Classical sparsifying transforms, also known as analytical sparsifying

transforms, such as Wavelets and DCT, have been widely used in compression

standards. Recently, redundant sparsifying dictionaries have become popular

especially in imaging denoising, inpainting and image reconstruction. Formally,

we can describe the problem in synthesis model as follows:

,

where “D” is the sparsifying dictionary.

2.3.2 Measurement operators

In the compressive sensing framework, we do not directly measure the original

signals (or transformed signals) but instead we measure the signals after some

projection via measurement operator . The sensing problem can be formally

defined as , where y denotes the measured signals, is the

measurement operator, x is the original signal and n denotes the additive noise.

9

The measurement operator has to guarantee that the measured signal must

contain all of the information of the original signal, otherwise we can never

accurately recover it from the measured values. Therefore, the choice of

measurement operators and sparsifying operators must obey the RIP

(Restricted Isometry Property) in the standard CS (i.e. using classical sparsifying

transforms). In addition, in the extended CS, which uses redundant dictionaries

for sparsifying, they must obey the D-RIP (Dictionary Restrict Isometry

Property).

As it has been proved that there are some matrices can be used as universal

measurement operators and guarantee the stable signal recovery, the choice of

the measurement operator is usually not a main concern in the CS framework.

Now, with sparse representation involved in the sensing problem, we can further

define the sensing problem as:

2.3.3 Signal reconstruction algorithms

The most direct way of recovering the original signal x from the measured signal

y is solving the following optimization problem:

‖ ‖ , subject to

However, solving the above equation is a NP problem, which means it is hard to

get the answer in non-polynomial time. This problem can be equivalent to L1

minimization problem under some constraints. The most common equivalent

approach is to solve the following convex problem:

‖ ‖ , subject to ‖ ‖ ,

where is an upper bound on the L2 norm of the noise. The recovered signal is

defined as , where a is the solution to the above problem.

In later researches, researchers find that signals often exhibit better sparsity in

an over-complete dictionary. Therefore, recent works have begun to address the

case of CS with redundant dictionaries. Therefore, the equivalent problem of the

original problem, also known as analysis-based problem, is redefined as:

‖ ‖ , subject to ‖ ‖

10

By solving the above problem, we can directly get the recovered signal instead

of doing one more transformation step.

The following figure shows a general process of applying compressive sensing

to signal acquisition.

Figure 6 general process of using compressive sensing in signal acquisition

2.4 Compressive sensing in radio interferometry

The most standard image reconstruction algorithm in radio interferometry is

called CLEAN, which is a non-linear deconvolution method based on local

iterative beam removal. A multi-scale version of CLEAN, MS-CLEAN, has also

been developed, where the sparsity model is improved by multi-scale

decomposition, hence enabling better recovery of the signal. However, these

approaches are known to be slow, sometimes prohibitively so.

Recently, a lot of attention have been put into compressive sensing and convex

optimization based imaging algorithms. Carrillo et al. (2012) proposed a novel

sparsity analysis (SARA) in the context of Fourier imaging in radio astronomy.

11

They found that natural images often include several types of structures

admitting sparse representations in different frames. Therefore, instead of

promoting average sparsity over a single basis, promoting it over a

concatenation of several bases (In the paper, they are Dirac and the first eight

orthonormal Daubechies wavelet bases, i.e. Db1-Db8) is a very powerful prior.

The SARA algorithm adopts a reweighted L1 minimization scheme, which

replaces the L0 norm by a weighted L1 norm. The reconstruction problem can be

formulated as:

‖ ‖ , subject to ‖ ‖ ,

where W is a diagonal matrix with positive weights and is the dictionary

mentioned above.

To solve this optimization problem, SARA uses Douglas-Rachford splitting

algorithm. The reconstruction algorithm is defined as follows:

Figure 7 SARA algorithm

Experimental results demonstrate that the sparsity averaging prior embedded in

the analysis reweighted L1 formulation of SARA outperforms state-of-the-art

priors, based on single frame or gradient sparsity, both in terms of SNR and

visual quality.

12

The Douglas-Rachford splitting algorithm used by SARA solves the problem by

iteratively minimizing the L1 norm and then projecting the result onto constraint

set until some stopping criteria is achieved. This iterative algorithm requires prior

knowledge of the operator norm of to guarantee fast convergence, which in

some cases it may be impossible. In addition, the Douglas-Rachford algorithm

does not offer a parallel structure, which is not that suitable for the upcoming

telescopes.

2.5 Large-scale optimization

In order to efficiently solve the problem defined above, we can do some

transformation on the original problem and use some existing efficient convex

optimization methods to solve the transformed problem.

2.5.1 Proximal splitting methods

Proximal splitting methods solve optimization problems of this form:

,

where is convex lower semi-continuous function, not necessarily

differentiable.

The proximity operator of is defined as:

‖ ‖

With proximal splitting methods, the problem defined above can be transformed

to this form:

, subject to , for .

Now based on the new formulation of the problem we are going to solve, one

obvious part we can use for parallelism is that the minimization of each term in

the equation can be done separately, which in theory provides an acceleration

factor of three.

13

2.5.2 Parallel structure of and

The other two places we can use for parallelism are the measurement operator

and the sparsifying operator .

Based on the original form of the sensing problem, , an efficient

parallel implementation can be achieved by splitting of the data into blocks

[

], [

]

For the sparsity priors, the L1 norm is additively separable and the splitting of the

bases can be used,

[ ]

The problem defined previously can be further redefined as:

∑ ∑

,

‖ ‖ ,

‖ ‖

2.5.3 Resulting algorithms

There are many existing algorithms that can be used to solve the problem

defined above, but here we mainly focus on three algorithms, SDMM

(simultaneous direction method of multipliers), PADMM (proximal alternating

direction method of multipliers) and the Primal-dual based method.

SDMM

14

Figure 8 sequential SDMM algorithm

Figure 8 shows the algorithm of the serial version of SDMM, which is also the

version implemented in the current version of SOPT package. After making use

of the three parallel structures mentioned previously, the parallel version of

SDMM is defined as follows (see Figure 9).

15

Figure 9 parallel SDMM algorithm

The parallel version of the SDMM algorithm has not been implemented in the SOPT

package.

Proximal ADMM

16

Figure 10 Proximal ADMM algorithm

Figure 10 shows the details of the algorithm of Proximal ADMM. The Proximal

ADMM algorithm has not been fully implemented in the SOPT package

PD-based method

17

Figure 11 the PD-based method

Figure 11 shows the details of PD-based method. It is probably easier to be

parallelised because it does not require the host node to do heavy linear

18

transform like what SDMM does. However, this algorithm has not been

implemented in the SOPT package either.

19

Chapter 3

Current implementation

3.1 Current structure of SDMM

In the SOPT C++ package, only the serial version of SDMM is implemented. In

general, the implementation of the SDMM provides a very general structure,

which allows the extension of the algorithm much easier.

In terms of data fields, the SDMM class mainly consists of two vectors,

“proximals_” and “transforms_”. Their types and functions are listed below.

Data field Function

vector<Proximal> proximals_ An array of the proximity operators

involved in the computation

vector<LinearTransform> transforms_ An array of the linear transforms

involved in the computation

Table 1 data fields in the SDMM class

The class of “LinearTransform” provides a unified interface for different

components acting in a similar way. For example, a transformation can be

defined directly or through matrix as they all will be finally wrapped in to functions

and be used by SDMM algorithm. The “LinearTransform” class provides a lot of

overloaded functions to deal with the different inputs.

In terms of functions, the SDMM class mainly contains two functions,

“update_directions” and “solve_for_xn”. Their inputs and outputs are listed

below.

20

Function signature Purpose

void update_directions(vector, vector,

vector)

Calculate a semi-final result for each

term i.e. calculating the value of r and s

used by “solve_for_xn” function

Diagnostic solve_for_xn(vector,

vector, vector)

Calculate the reconstructed signal at

the current iteration (the conjugate

gradient algorithm is used to computer

the inversion of the matrix Q)

Table 2 two main functions in the SDMM class

The use of this implementation of SDMM in the CS framework is straightforward.

First, user can define the measurement operator as follows:

sampling = linear_transform<Scalar>(Sampling( parameter1, parameter2,

parameter3))

Then, the sparsifying operator is defined in this way (in this example, we use a

SARA sparsity operator, which is a concatenation of several dictionaries):

SARA const sara (dictionary1, dictionary2, dictionary3)

Psi = linear_transform<Scalar>(sara, image.rows(), image.cols())

After that, the above operators are passed into SDMM class by the “append”

function:

auto const sdmm = SDMM<Scalar> ()

. append (proximal g_i, L_i)

. append (proximal g_i, L_i)

……

The “append” function adds the proximity operator and the transformation

function to “proximals_” and “transforms_” arrays respectively. In order to

21

improve the use of this implementation of SDMM, the SDMM class consists of

many overloading “append” functions to deal with different cases. In addition,

users can append as many proximal-transform pair as they want, which allows

this implementation to be extended for dealing with many different problems.

3.2 Problems with current structure

The advantage of current implementation of SDMM is that it provides a very

general framework for users to use in the interferometric imaging. Users can

bind the measurement operator they choose into the framework or they can

choose to use either a single wavelet for sparisifying or a concatenation of

several wavelets (e.g. sparsity operator used in SARA). In addition, users can

also bind the reweighted algorithm they need to the SDMM model, so that the

model can be used to solve reweighed problems, which may be preferable in the

imaging area. In short, the generality makes the current implementation of

SDMM to be very user-friendly.

However, this advantage is also the disadvantage. Because the implementation

is too general, it is hard to apply optimization or parallelism to certain term. For

example, in the parallel version of SDMM, step 8-11, step 13-16 and step 19-21

represent the computation of different terms. Their computation is parallelised

using a little different strategy. Therefore, it is important for the implementation to

be able to apply user-defined optimization strategies to each computation of

terms.

22

Chapter 4

Optimisation and parallelisation methods

4.1 Platform

The Advanced Research Computing High End Resource (ARCHER) is a Cray

XC30 supercomputer equipped with parallel high-performance file systems as

well as pre- and post-processing capabilities. For this project, ARCHER is the

main platform we are going to use.

There are totally 3008 compute nodes on ARCHER and they are split into 8

groups. For each of groups two Intel Xeon E5-2697 12-core processors are

equipped with hyper-threading enabled. The default clock rate of the processor

is set to 2.7 GHz. In addition, there are 4544 Standard memory compute nodes,

which have memory of 64GB shared between two processors, and 376 high

memory nodes, which have memory of 128 GB. Because the memory is shared

between two processors, each processor has one Non-uniform memory access

(NUMA) region so that accessing to the local memory by cores within a NUMA

region has a lower latency than accessing memory on the other NUMA region.

4.2 Tools

4.2.1 Profiling

CrayPat is a performance analysis tool offered by Cray for the XC platform.

Basically, CrayPat provides two categories of profiling methods:

instrumentation-based profiling and sample-based profiling. For the former, the

compiler inserts timer calls at key points into the program that is going to be

investigated so that it can track the execution counts for routines and source

lines. However, although the execution count of a routine is exact, the execution

time is not reliable because of the heavy overhead of those inserted timer calls.

23

The general workflow for getting performance data using CrayPat is as follows:

1 Unload the darshan module if it is loaded.

2 Load the perftools-base and perftools modules.

3 Build your application; keep .o files.

4 Instrument the application using pat_build.

5 Run the instrumented executable to get a performance data (".xf") file.

6 Run pat_report on the data file to view the results.

For the latter, with sample-based profiling, the program’s current instruction

address is read and tracked at a certain interval. The instruction address is then

mapped to source lines and/or functions in the program. The advantage of this

kind of profiling is that the execution time of a source line or a function is more

accurate compared to instrumentation-based profiling because of the low

overhead this method adds to the program. In addition, as it does not modify the

code itself, we can also track the execution time of some generated assembly

code.

CrayPat's Automatic Program Analysis (APA) feature provides an easy way for

such a purpose. Using this feature, one can generate an instrumented

executable for a sampling experiment. When the binary is executed, it generates

an ASCII text file that contains CrayPat's suggestion for pat_build tracing

options, which can be used to re-instrument the executable for detailed tracing

experiments.

The general workflow for using APA is as follows:

1 Generate the executable for sampling, using the special '-O apa' flag.

2 Running the executable on compute nodes via aprun generates an xf file.

3 Run pat_report on the data file.

(It will generate the “. ap2” and “. apa” files. The latter contains suggested

pat_build options for building an executable for tracing experiments.)

4 Examine the “. apa” file and, if necessary, customize it for your need

5 Rebuild an executable using pat_build's -O option with the apa file name

as the argument.

6 Run the new executable for a tracing experiment.

7 Run pat_report on the newly created xf file. This is the tracing result.

For this project, sample-based profiling can already satisfy our needs as we

focus on the general structure of the algorithm, not the detailed implementation.

However, for further investigation, instrumentation-based profiling can be of

great use.

24

4.2.2 Data visualization

Cray Apprentice2 displays data that was captured by CrayPat. This visualization

tool displays a variety of different data panels, depending on the type of

performance experiment that was conducted. Its target is to help identify

conditions including load imbalance, excessive serialization, excessive

communication and network contention.

Cray Apprentice2 provides call-graph-based profile information with source code

mapping and timeline-based trace visualization, also with source code

mappings. It is capable of running either on the Cray system service nodes, or

on a remote Linux server or workstation. Examples of Cray Apprentice2 displays

are demonstrated below.

Figure 12 Cray Apprentice2 interface

The general workflow for using Cray Apprentice2 is as follows:

1. Instrument the application using pat_build.

2. Run the instrumented executable to get a performance data (".xf") file.

3. Run pat_report on the data file and Get the “. ap2” file.

4. Run app2 on the “. ap2” file to get graphical reports

25

4.3 Parallel languages

4.3.1 Heterogeneous computing

CPUs are short for central processing units and they are the main functional unit

in the computer. Up until approximately ten years ago, the speed of CPUs is

improved mainly by increasing in the frequency of their clocks. However, what is

along with the increased clock frequency is the increased heat generated by the

CPUs and finally the excessive heat prevents the clock frequency of the CPUs

from increasing. Today most of the CPUs run at 2 to 3 GHz, which already lasts

for over 10 years.

The current solution proposed by most CPU manufacturers to keep increasing

the speed of the CPUs is to add more CPU cores, which is known as multi-core

CPU. We can gain good performance from explicitly making use of those

multi-cores. This is where parallel computing first entered the technology scene

in a big way. However, the software has to be re-written in a parallel style in order

for programs to actually make use of those CPU cores.

Figure 13 Intel CPU trends

The figure above shows the trends in the Intel’s CPU over years. Until about

2005, the clock speed of the Intel’s CPU keeps being increased. After that, the

26

improvement in the clock speed slows down and in the meantime the power

consumption also decreases. This suggests that the multi-core CPU improves

the performance not only by using more than one core simultaneously but also

by improving the energy efficiency of the CPU.

The alternative to multi-core CPUs for increasing the performance of single-core

CPUs is to use other kinds of processors to supplement the CPU in doing the

computational tasks. Coordinating more than one processor of different

architecture types to perform computation is known as “heterogeneous

computing”.

In the area of “heterogeneous computing”, two common accelerators are GPUs

and Xeon Phi respectively. In practice, GPUs are more widely used compared to

Xeon Phi because of the fact most of the desktop computers have already had

GPUs inside. During several years’ development, GPUs have advanced from

merely dealing with pixels to actually having incredible capabilities to do

mathematical computations. Therefore, we can gain great performance if we

offload those heavy mathematical computations originally executed in CPUs to

GPUs.

However, what makes the “heterogeneous computing” hard to promote is that

most software has to be re-written in order to make use of GPUs as accelerators.

For example, CUDA is the first highly adopted platform which enables

developers to write high-performance general-purpose GPU programs.

Developers need to write GPU kernels, manage different levels of GPU memory

and make trade-offs for data transfers between the host and the device during

the development of CUDA programs.

Xeon Phi is the accelerator released by Intel. It makes use of multiple older x86

CPUs to form a new multi-core architecture. One of the benefits of using Xeon

Phi as accelerators is that any code compatible to x86 architecture is also

compatible to Xeon Phi. This makes the development of high-performance

programs based on accelerators much more convenient. However, the

performance of Xeon Phi is widely reported to be not as good as that of GPUs.

This means we need to put more efforts into the performance tuning if we use

Xeon Phi as accelerators.

Although this project is not intended to port the current implementation to

accelerators, it is worthwhile to use the languages in the implementation that

works well with accelerators, as heterogeneous computing is a possible trend in

the future. Therefore, in the next section, we mainly compare three languages

that are currently popular and working well with accelerators.

27

4.3.2 OpenCL, OpenACC and OpenMP

Comparison between different parallel languages can be done in several ways,

for example, performance. However, the performance of a piece of code is

determined by many factors including the hardware the code is running on and

the code itself. As mentioned above, currently the two popular accelerators,

GPU and Xeon Phi, are totally built on different architectures, and each of them

has its own advantage over the other. For example, the frequency of one single

core on Xeon Phi is much higher than GPU, while the number of cores on GPU

is instead much more than that on Xeon Phi. In this project, we choose the

parallel languages based on their portability as we hope our code can be run on

both GPU and Xeon Phi.

OpenCL is a framework for writing programs that execute across heterogeneous

platforms. However, although NUVIDA and Intel still have some products

supporting OpenCL, at present it does not give one much optimism. For

example, NUVIDA just released a driver supporting OpenCL 1.2 in 2015, while

the latest version of OpenCL has been 2.2. In addition, for Intel’s products, only

the integrated GPU on the chip can support OpenCL, which seems not to have

enough attraction for HPC users.

OpenACC is a programming standard for parallel computing developed by Cray,

CAPS, NVIDIA and PGI. The standard is designed to simplify parallel

programming of heterogeneous CPU/GPU systems. As many OpenACC

members have worked as members of the OpenMP standard group to merge

into OpenMP specification (OpenMP 4.0), OpenACC and OpenMP are likely to

be more and more similar. Therefore, currently OpenMP and OpenACC might

be the best languages for programming accelerators.

For this project, we choose to use OpenMP as the parallel language because it

is more commonly used in the multi-core CPU environment. Although at the

moment, the compiler in our hand does not support the latest OpenMP

(OpenMP 4.0), which means porting the code to GPU using OpenMP is

impossible, it can be probably achieved in the near future.

28

Chapter 5

Implementation of the solution

5.1 Initial profiling

Figure 14 initial profiling of reweighted SDMM

The figure above displays the result of the initial profiling of the code and its call

tree. It is obvious that the two functions, “solve_for_xn” and “update_directions”,

take up most of the running time of the code.

29

Figure 12 intial profiling for “solve_for_xn” function

In the function “solve_for_xn”, we can find the conjugate gradient algorithm takes

up the most time of the function (see figure x). The conjugate gradient algorithm

is used in the step 5 of the serial version of SDMM in order to calculate an

inverse of the matrix Q.

Figure 16 initial profiling for “update_directions” function

In the function “update_directions”, we can find the soft thresholding algorithm

takes up the most of time consumed (see figure x). The algorithm here is used to

calculate the L1 norm.

Therefore, in this chapter, we are going to focus on the improvement of the

“solve_for_xn” and “update_directions” functions.

30

5.2 Profiling “ConjugateGradient”

Figure 13 profiling for “ConjugateGradient” class

The figure above shows the result of profiling the conjugate gradient algorithm.

The result shows that most of the time consumed by the conjugate gradient

algorithm is actually consumed by the general matrix-vector product. As the

“Eigen” library is likely to have implemented an efficient matrix-vector product

function for general use, there is few space for us to do in order to make any

improvement.

5.3 Three-level parallelism in “update_directions”

As mentioned in the background section, the SDMM structure offers mainly

three degrees of parallelisation that can be further exploited. Firstly, the

proximity operators can be implemented in parallel. Then, the sparsity priors are

separable. In addition, the data vector and the measurement operator can be

31

partitioned into several blocks so that each compute node can work on each

piece of the data simultaneously.

5.3.1 Parallel proximity operator

Calculating the proximity operator of each term in parallel can be easily achieved

using OpenMP. As in our experiment, we have only three terms, we will only

need three threads to run the code.

#pragma omp parallel for

for (t_uint i=0; i < transforms(). size(); ++i) {

z[i] += transforms(i) * x;

y[i] = proximals(i, z[i]);

z[i] -= y[i];

}

By doing this, in theory, we can get an acceleration factor of three. However, as

the computation burden of different terms can be very different, the performance

gain in reality may not be that much.

Term

Time

consumption

(second)

.append(sopt::proximal::l1_norm<Scalar>, psi.adjoint(), psi)

(‖ ‖ ) 15.63563

.append(sopt::proximal::translate(sopt::proximal::L2Ball<Scalar>(eps

ilon), -y),sampling)

(‖ ‖ )

0.35529

.append(sopt::proximal::positive_quadrant<Scalar>)

(positive quadrant) 0.41683

Time consumption after parallelisation 16.59599

Table 3 time consumption of each original term

32

Table 3 shows the time consumption by the three different terms in our problem.

This suggests that even if these three terms are perfectly computed in parallel,

the performance gain is still less than 5%. However, in reality, cost like fork/join

of threads can affect the final performance of the code as well. After

parallelisation, the time consumption is even more than before (16.59599

second).

5.3.2 Separable sparsity priors

In theory, the sparsity prior can be split into any number of blocks. However,

splitting it based on the bases used is a very natural way. In our experiment, the

sparifying operator consists of three bases, one “DB3“ and two “DB1” with

different levels. Therefore, in our experiment, we split it into three blocks.

auto const psi0 = sopt::linear_transform<Scalar>(sara[0], image.rows(), image.cols());

auto const psi1 = sopt::linear_transform<Scalar>(sara[1], image.rows(), image.cols());

auto const psi2 = sopt::linear_transform<Scalar>(sara[2], image.rows(), image.cols());

auto const sdmm

= sopt::algorithm::SDMM<Scalar>()

.append(sopt::proximal::l1_norm<Scalar>, psi0.adjoint(), psi0)

.append(sopt::proximal::l1_norm<Scalar>, psi1.adjoint(), psi1)

.append(sopt::proximal::l1_norm<Scalar>, psi2.adjoint(), psi2)

……

Here we manually split the sparsity prior “psi” into three blocks (“psi0”, “psi1”,

and “psi2”), and append them to the SDMM. Combined with previous

modification to proximity operator, now we have five terms that are calculated in

parallel.

Terms

Time

consumption

(second)

.append(sopt::proximal::l1_norm<Scalar>, psi0.adjoint(), psi0) 5.21192

.append(sopt::proximal::l1_norm<Scalar>, psi1.adjoint(), psi1) 6.98543

33

.append(sopt::proximal::l1_norm<Scalar>, psi2.adjoint(), psi2) 5.18503

.append(sopt::proximal::translate(sopt::proximal::L2Ball<Scalar>(ep

silon), -y), sampling) 0.31529

.append(sopt::proximal::positive_quadrant<Scalar>) 0.41297

Time consumption after parallelisation 9.83453

Table 4 time consumption of each term

Table 4 shows the time consumption by the five different terms and the actual

time consumption after the parallelisation. Although the total time consumption is

reduced, more and more noises are involved, affecting the result further away

from the expectation.

Although splitting sparsity priors based on the bases they use is a natural and

straightforward way, whether the result will be optimal is something unknown

yet. For example, in the above table, the second term split from the original prior

costs comparatively more time than the other two. For the optimal solution, this

may not be the best splitting of the prior. However, a different splitting strategy

may not allow for the use of fast algorithms for the computation of the operator.

5.3.3 Splitting of the data

As introduced in the previous section, the measured vector can be easily divided

into several blocks as follows:

[

]

This form of parallelisation, also known as data parallelism, focuses on

distributing the data across different parallel computing nodes. The advantage of

data parallelism is that usually it can provide good scalability for the program.

In our context, we also have to split the measurement operator into the same

blocks as the measured vector as follows:

[

]

34

In order to achieve that, we have to design a supporting matrices M to help the

division of the measurement operator:

[

] [

]FZ

Therefore, with this distributed optimization approach, each piece of the

measured vector and the measurement operator can be local to each compute

node, which distribute the memory requirement and the processing load.

However, because of the time pressure and the difficulty of implementing this

distributed approach, we only provide the idea here.

35

Chapter 6

Summary and conclusions

This project first introduces a currently popular technique, compressive sensing.

It might be a promising path for the signal acquisition as it can help to reduce the

amount of samples taken in the process of sensing, causing a waste of resource.

In the “big data” era, it is necessary to use techniques that are much faster.

Compressive sensing is also applied to the area of radio interferometry, hoping it

can help to deal with the large amount of measurement data. In this project, we

introduce four algorithms used in this area, SARA, SDMM, PADMM and

PD-based method. The latter three algorithms are attractive because they offer a

parallel structure, which is preferable for dealing with large amount of data.

However, currently only SARA and SDMM are implemented in the SOPT

package, while PADMM and PD-based method are not implemented or not fully

implemented. In addition, current SDMM implementation makes no use of its

parallel structure. In this project, we try to implement a parallel SDMM algorithm.

In the paper where SDMM is introduced, they have mentioned three degrees of

parallelism of SDMM algorithm. However, based on our experiment, it may be

difficulty to balance the load if the splitting of the data is not dealt with well.

Especially SOPT being a public library, it is expected to provide functionality with

enough generality. How to deal with the load balancing problem and how to

implement them in a friendly way is something we need to consider in the future.

6.1 Future work

There are a wide range of further work can be attempted on this project. Here

are some options that are considered worthy:

SDMM might be much slower than the latest PD-based algorithm for

interferometric imaging, as the PD-based method offers a full splitting

structure, while SDMM still needs the host node to do heavy linear

36

computation. At the time of doing the project, PD-based algorithm has not

been implemented so that we have to use the SDMM instead. However,

in the future, with the new algorithm available, the performance may be

more attractive.

This project only works with multi-core CPUs, however, in the future, with

full splitting algorithm available, we can begin to study the performance of

the algorithm on different accelerators or the scalability of the algorithm.

Currently some unexpected cost is affecting the final performance of the

parallel implementation. This can be vital when it is scaled to large

systems. Studying how those cost, such as fork/join of threads or thread

competition, affect the final result and what we can do to reduce them

might be an interesting direction.

Apart from OpenMP, for the further work, we can use other parallel

languages to implement the algorithm. Although as mentioned in the

previous section, OpenMP might be the most portable among the

OpenACC, OpenCL and OpenMP, using some specific parallel

languages might result in the better performance, for example, using

CUDA for GPU, although this usually will be at the cost of doing many

program rewriting,

For real big-data application, we may have to make use of heterogeneous

system to get the optimal performance. Therefore, how to port the current

implementation to the system is also something worth to investigate. For

example, we may need to consider how to use MPI along with OpenMP

or CUDA or any other parallel language to make the implementation

scaled to large systems.

37

References

[1] Archer.ac.uk. (2016). ARCHER » 5. Performance analysis. [online] Available at:

http://www.archer.ac.uk/documentation/best-practice-guide/performance.php

[Accessed 19 Aug. 2016]

[2] Candes, E. and Wakin, M. (2008). An Introduction To Compressive Sampling.

IEEE Signal Process. Mag., 25(2), pp.21-30..

[3] Carrillo, R., McEwen, J. and Wiaux, Y. (2014). PURIFY: a new approach to

radio-interferometric imaging. Monthly Notices of the Royal Astronomical

Society, 439(4), pp.3591-3604.

[4] Carrillo, R., McEwen, J. and Wiaux, Y. (2012). Sparsity Averaging Reweighted

Analysis (SARA): a novel algorithm for radio-interferometric imaging. Monthly

Notices of the Royal Astronomical Society, 426(2), pp.1223-1234.

[5] Cornwell, T. (2009). Hogbom's CLEAN algorithm. Impact on astronomy and

beyond. Astronomy and Astrophysics, 500(1), pp.65-66.

[6] Dsp.rice.edu. (2016). Compressive Imaging: A New Single-Pixel Camera | Rice

DSP. [online] Available at: http://dsp.rice.edu/cscamera [Accessed 19 Aug.

2016].

[7] Eldar, Y. and Kutyniok, G. (2012). Compressed sensing. Cambridge: Cambridge

University Press.

[8] LI, S. and WEI, D. (2009). A Survey on Compressive Sensing. Acta Automatica

Sinica, 35(11), pp.1369-1377.

[9] Nersc.gov. (2016). CrayPat. [online] Available at:

http://www.nersc.gov/users/software/performance-and-debugging-tools/craypat

/ [Accessed 19 Aug. 2016].

[10] Onose, A., Carrillo, R., Repetti, A., McEwen, J., Thiran, J., Pesquet, J. and

38

Wiaux, Y. (2016). Scalable splitting algorithms for big-data interferometric imaging

in the SKA era. Mon. Not. R. Astron. Soc., p.stw1859.

[11] Rong, R. (2013). Splitting algorithms for convex optimization and applications

to sparse matrix factorization. [Los Angeles]: University of California, Los

Angeles.

[12] Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T. and Yan, S. (2010). Sparse

Representation for Computer Vision and Pattern Recognition. Proceedings of the

IEEE, 98(6), pp.1031-1044.

[13] Wikipedia. (2016). OpenCL. [online] Available at:

https://en.wikipedia.org/wiki/OpenCL [Accessed 19 Aug. 2016].

[14] Wikipedia. (2016). OpenACC. [online] Available at:

https://en.wikipedia.org/wiki/OpenACC [Accessed 19 Aug. 2016].

[15] Wikipedia. (2016). OpenMP. [online] Available at:

https://en.wikipedia.org/wiki/OpenMP [Accessed 19 Aug. 2016].

[16] The Next Platform. (2015). Is OpenACC The Best Thing To Happen To

OpenMP?. [online] Available at:

http://www.nextplatform.com/2015/11/30/is-openacc-the-best-thing-to-happen-to-o

penmp/ [Accessed 19 Aug. 2016].

[17] K. Davis, “Data transfer of an “array of pointers” using the Intel Language

Extensions for Offload (LEO) for the Intel Xeon Phi coprocessor,” Intel Corporation,

22 August 2014. [Online]. Available:

https://software.intel.com/en-us/articles/xeon-phi-coprocessor-data-transfer-array-of

-pointers-using-language-extensions-for-offload. [Accessed 30 July 2015].

[18] OpenACC-standard.org, “The OpenACC Application Programming Interface,”

OpenACC-standard.org, 2013.

[19] M. Noack, “HAM - Heterogenous Active Messages for Efficient Offloading on

the Intel Xeon Phi,” The Zuse Institute Berlin (ZIB), Berlin, 2014.

[20] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,”

39

ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1:1-1:25, 2011.

[21] R. Boisvert, R. Pozo and K. Remington, “The Matrix Market Exchange Formats

: Initial Design,” National Institute of Standards and Technology, Gaithersburg,

1996.

[22] Pgroup.com. (2016). PGInsider September 2013: Tesla vs. Xeon Phi vs.

Radeon. [online] Available at:

https://www.pgroup.com/lit/articles/insider/v5n2a1.htm [Accessed 19 Aug. 2016].

[23] Cmake.org. (2016). CMake. [online] Available at: https://cmake.org/ [Accessed

19 Aug. 2016].

[24] Openmp.org. (2016). OpenMP.org » OpenMP Specifications. [online]

Available at: http://openmp.org/wp/openmp-specifications/ [Accessed 19 Aug.

2016].

[25] The Khronos Group. (2016). OpenCL - The open standard for parallel

programming of heterogeneous systems. [online] Available at:

https://www.khronos.org/opencl/ [Accessed 19 Aug. 2016].

[26] OpenMP Architecture Review Board, “OpenMP Application Program

Interface,” OpenMP Architecture Review Board, 2013.

[27] R. W. Green, “OpenMP* Loop Scheduling,” Intel Corporation, 29 August 2014.

[Online]. Available:

https://software.intel.com/en-us/articles/openmp-loop-scheduling. [Accessed 9

August 2016].

[28] A. D. Robison, “SIMD Parallelism using Array Notation,” Intel Developer

Zone, 3 September 2010. [Online]. Available:

https://software.intel.com/en-us/blogs/2010/09/03/simd-parallelism-using-array-not

ation/?wapkw=array+notation. [Accessed 30 July 2016].

[29] P. Kennedy, “Intel Xeon Phi 5110P Coprocessor – Many Integrated Core

Unleashed,” Serve The Home (STH), 13 November 2012. [Online]. Available:

http://www.servethehome.com/introducing-intel-xeon-phi-5110p-coprocessor-intels

40

-integrated-core-unleased/. [Accessed 16 August 2016].

[30] S. Cepeda, “Optimization and Performance Tuning for Intel Xeon Phi

Coprocessors,” Intel Corporation, 12 November 2012. [Online]. Available:

https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-i

ntel-xeon-phi-coprocessors-part-2-understanding. [Accessed 31 July 2016].

[31] J. Fenlason, “GNU gprof,” Free Software Foundation, Inc., November 2008.

[Online]. Available: https://sourceware.org/binutils/docs/gprof/. [Accessed 30 July

2016].

41

Appendix

Commands for compiling the code on ARCHER:

module load cmake/3.5.2

module swap PrgEnv-cray PrgEnv-gnu;

module load fftw;

module load git;

module load anaconda;

export CRAYPE_LINK_TYPE = dynamic;

CC=cc CXX=CC FC=ftn cmake -DCMAKE_BUILD_TYPE=Release

-DCMAKE_PREFIX_PATH='/opt/cray/fftw/3.3.4.5/sandybridge;WHERE EIGEN IS'

WHERE SOPT IS

Make

* Replace the “WHERE EIGEN IS” and “WHERE SOPT IS” with path to the eigen

folder and path to the sopt folder