performance tuning software dsm applications using visualisation

The Journal of Supercomputing, , 1–18 (To be published)c© To be published Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Performance Tuning Software DSM Applications

using Visualisation

MATS BRORSSON AND MARTIN KRAL [email protected]

Department of Information Technology, Lund University, Sweden

Editor: Mark Clement

Abstract.Small organisations can now have access to high raw processing power using networks of work-

stations (NOW) as parallel computing platforms. Software Distributed Shared Memory (SoftwareDSM) packages have been developed to facilitate the programming of such systems. However, be-cause of the high interprocess latencies in a NOW, the performance of a software DSM applicationis more susceptible to the partitioning of the problem than what might be expected.This paper presents an approach for a tool to visualise the execution of a program in a way that

highlights performance bottlenecks. The tool associates identified bottlenecks with the correspond-ing source code lines in order to determine what piece of code is the cause of poor performance.The visualisation technique is demonstrated in two case studies. They clearly show that the vi-sualisation is indeed useful and provides an effective way to acquire an understanding of whatcharacterises an applications sharing behaviour.

Keywords: shared memory, performance tuning, software DSM, visualisation

1. Introduction

Networks of workstations (NOW) are attractive to use as parallel computing plat-forms since the hardware is most often already in place and the only thing thatis needed is a way of making use of it for parallel applications. It is undoubtedlymuch easier to convert an existing sequential program to be parallel by using ashared memory programming model as compared to message passing. Several sys-tems for software distributed shared memory (Software DSM) have been proposedto facilitate this process on a NOW. However, if these programs are not carefullyanalysed and tuned to minimise communication, the performance will in many casesnot come close to what can be achieved with message passing. In the end, it mayrequire as much effort from the programmer to tune the performance of an appli-cation for software DSM as would have been needed to rewrite the program usingmessage passing in the first place.This paper presents an approach to study the performance effects of shared mem-

ory accesses in a software DSM system. We do that with a tool that is used tovisualise the extent to which shared memory accesses contribute to the executiontime and with links back to the corresponding source code lines. The rationaleis that programmers need help to get as much performance out of their programsas possible. Programmers are usually experts in some other field than computerengineering and it is unreasonable to expect that they can take appropriate actionsdepending on the kind of platform being used for the moment.

2

Visualisation is widely recognised as important in the computing sciences andis now also gaining grounds in parallel applications performance debugging [4].We have previously shown that tools based on visualisation of problematic accesspatterns can help programmers to find out whether there is a performance problemin the way their programs are written [2, 3]. These tools were developed within thecontext of tightly coupled shared memory multiprocessors. A software DSM systemis much more vulnerable to access patterns that cause excessive communication andthere is much more to gain from having a tool that provides programmer feedback.We have implemented our tool on top of the TreadMarks software DSM sys-

tem [1]. TreadMarks uses the memory protection mechanism to detect accessesto shared memory and to enforce consistency across the system. A lazy-releaseconsistency protocol allows multiple processors, in a data-race free program, to si-multaneously read and write the same virtual page. Consistency is enforced at thenext synchronisation operation.We have modified TreadMarks to perform execution time measurements and to log

events that cause interprocess communication. To reduce the impact of tracing wepostpone as much as possible of the analysis to a post-mortem phase. The currentanalysis of the execution log is trigged by barrier operations and it is for this reasononly meaningful to log programs that use barriers for interprocess synchronisation.The post-mortem analysis and visualisation of the execution is done by a tool calledPapp (Programmer’s Aid for Parallel Programming [6]).The visual display in Papp is based on a space-time diagram showing the execu-

tion time for each processor subdivided into barrier intervals. Each barrier intervalis further divided into busy time, lock acquisition time, shared data acquisitiontime, I/O interrupt time and overhead time. The focus in this paper is on the dataacquisition time which can be traced to write accesses in previous barrier inter-vals. The execution log contains information about where these write accesses wereperformed and Papp therefore not only provides a possibility to spot performancebottlenecks, but also provides links directly to the sources of these memory accesseswhich otherwise would be very difficult to find.We have used two different case studies, a 3-D FFT program and two different

implementations of an SOR algorithm, to show that the visualisation does indeedcorrespond to the behaviour of the program being traced. Comparing the twodifferent implementations of the SOR algorithm we demonstrate how a poor im-plementation can degrade the performance of the parallel system and how this isdetected using Papp.With the 3-D FFT program we discuss how the granularity of the data sharing

in a parallel application affects the overhead of a software DSM system and howthis is shown in Papp. This is done by means of a comparison of performancevisualisation for two different work loads for FFT.The main contributions of this work are: (i) a method by which we can directly

measure the impact certain memory accesses have on the execution time, and (ii)a method by which we can graphically display this information to the programmer.As far as we know, this is a new concept in the area of software DSM systems butwe discuss the most important related work in section 5.

3

2. The TreadMarks software DSM system

The performance of a software DSM system is of course intimately related to theactual system used. We therefore describe the TreadMarks system in some de-tail so as to understand the performance consequences of shared data accesses inTreadMarks applications.

2.1. The function of TreadMarks

The TreadMarks programming model is similar to other fork-based models in whichshared memory has to be explicitly allocated. The specified number of processorsis stated at the beginning of the execution, each executing the same program image(TreadMarks currently enforces one process per processor and we will use theseterms interchangeably). The execution is coordinated through barriers and mutualexclusive synchronisation primitives. A processor can distinguish itself from theother processors through a global variable that holds its identity.A network of standard workstations has of course no shared memory and there is

no hardware support to maintain memory consistency except the address transla-tion mechanism used for memory protection and virtual memory. This means thatcoherence must be maintained on a page level granularity. In order to reduce theimpact of false sharing, TreadMarks uses a multiple-writer, lazy release consistencyprotocol.Figure 1 shows the essence of how lazy release consistency works. In this example

there are three processors, P0, P1 and P2. There are also two variables, V1 andV2 that are located in the same page. First processors P0 and P2 update V1 andV2, respectively. Then there is a barrier synchronisation after which processor P1accesses both V1 and V2. From start, there is a valid copy of the page containingV1 and V2 in the memory of all three processors. This page is write-protected butread accesses can be done in parallel.At the first write reference from P0 (P2) to V1 (V2), a segmentation violation

will occur (called Segv in Unix signal notation) since the page is write protected.TreadMarks is notified that a write access has occurred and creates a copy, a twin,of the original page, before the process is allowed to modify V1 (V2). P0 and P2are now free to do additional changes in the page. It is, however, assumed thatthese changes do not overlap, i.e., that the program is datarace free.Processor 0 acts as the barrier manager. All processors report to P0 and when

they all have arrived, P0 sends a message to each processor releasing it for execution.When the processors arrive at the barrier all processors exchange information onwhat changes they have made during the preceding barrier interval. In this exampleP1 and P2 will be notified that P0 has modified the page containing V1 and V2.Similarly, P0 and P1 will be notified that P2 has modified the same page. Thisinformation, which is created locally upon entering the barrier routine, is calleda write notice and is just a note that a particular page has been modified by aparticular processor during a synchronisation interval.

4

��

��

First write reference to V1.Create twin.

Apply diffs.Continue withmemory reference.

Compute diff for V1.Discard twin.

First reference topage by P1. Obtainand apply diffs.

��

��

��

��

twin

twin

twin

twin Compute diff for V2.Discard twin.

diff V1.

diff V2.

Execute barrier. Exchange write notices and invalidate pages for which write notices have been received.

BA

RR

IER

P2

P3

P1

Figure 1. The lazy release consistency protocol in TreadMarks. Processors are not notified ofchanges until the barrier and pages do not get updated until the first reference after the barrier.

All pages for which a write notice have arrived are invalidated (read and writeprivileges are removed) before processing is resumed. When P1 issues a memoryreference to any address in this page, it takes a segmentation fault and TreadMarksrequests page updates from processors P0 and P2. However, they will not sendthe entire page to P1. This would not only take longer time than necessary totransfer, but would also require some kind of processing at the receiving end inorder to merge the changes made by both processors. Instead they each send a diffwhich is an encoding of changes made. The diff is created by comparing the pageword-by-word with the twin created earlier. After the diff has been created, thetwin can be safely discarded. When the diffs have arrived at processor 1, they areapplied to the page and P1 can continue with its memory reference.The description of TreadMarks’ function is of course limited here due to space

constraints and more detailed descriptions can be found in [1] and [5]. We will nowgo over to discuss the performance aspects of maintaining a shared memory imagein software on standard networks.

2.2. Performance aspects

Figure 1 indicates where network communication takes place in TreadMarks. Datais implicitly communicated from processors P0 and P2 to P1 but there are eightexplicit messages: First the barrier has to be processed; this involves sending amessage to the barrier manager (P0 in this example) that awaits a message from

5

Table 1. Shared memory access fault overheads.

Fault handling overhead Time (µs)

Local software processing 927Hardware 112Remote software processing 845

Total time 1884

all other processors before sending back a reply to each processor. Then processorP1 requests the diffs which implies replies from P0 and P2. Most of the messagesare small (on the order of ten bytes). Only the diff messages can be of substantialsize, up to the size of a page.Table 1 indicates the size of the access fault overheads involved as measured on a

network of IBM RS/6000 workstations for a similar situation as in the example [8].In this table we have subdivided the overhead into local processing (at processorP1 according to the example in figure 1), hardware overhead and remote processingat the nodes that respond to the requests.We see here that hardware overheads is a relatively small portion of the total

cost involved in lock acquisition and fault handling. What really takes time is thesoftware processing, both at the requesting side (the processor that is takes anaccess fault) and at the serving side. A large portion of this software overhead isactually communication protocol software execution.An earlier study on TreadMarks application performance shows that the direct

overhead from fault handling ranges from 2% to 14% for a selection of applica-tions [8]. This does not seem to be too much of a problem, but there are a numberof indirect effects to consider. A high fault time overhead often results in an un-balanced program which will lead to longer waiting times at the barriers. Accessfaults also result in an interrupt (Unix signal handling) at the serving processorwhich will slow down this processor. It does not, however, affect the actual barrierexecution time very much since the size of the write notices exchanged is smallenough not to affect the communication time significantly.How do we then find the sources (in the application) to memory access overheads?

Note that it is not sufficient to look at the processor state at the access fault alone.This will only provide information on where the fault is taken and on what address.If we would like to find out which of the access faults that are the most important,we can order them according to how long a time they take to service, but we stillcannot infer where in the program this interprocess communication originates. Thereason for this is that the source of a fault is the writes done to this page byother processors in an earlier synchronisation interval. For the example in figure 1it is the writes done by processors P0 and P2 before the barrier that cause thecommunication after the barrier.We have developed a method to extract the information on which write accesses

contribute the most to access fault time overhead in subsequent barrier synchroni-

6

Figure 2. The user interface of Papp visualisation.

Figure 3. Legend for the Papp display.

sation intervals. This information can then be presented in graphical form to theprogrammer. The next section describes this method and the graphical tool, Papp.

3. The Papp visualisation tool

3.1. Visualisation tool user interface

Figure 2 shows the user interface window for a TreadMarks execution. The dis-play shows a time-line for each processor subdivided into barrier intervals. Foreach interval the execution time is subdivided into busy time and various over-heads according to the legend shown in figure 3. A detailed description is found insection 3.2. For the purpose of this study the most important overheads are:

• Acquiring shared data—time when the processor requests and awaits diffs fromother processors

7

Figure 4. The analysis of a specific barrier with instruction list for processor 2 and the corre-sponding instruction location.

• Sigio—time when the processor is interrupted and servicing requests from otherprocessors

• Idle—time waiting for one or more processors reaching the barrier

The width of each barrier interval is proportional to the execution time. Theprogrammer can from this display easily get an overview of the execution of theprogram showing which barrier intervals take the longest time to execute and wherethere are a lot of memory access time overhead.From this view the user can select to show only one barrier. This brings up the

selected barrier only as seen in figure 4. When the user clicks on the red fieldwithin a barrier interval (shown as the darker shadow in the figures), a window isdisplayed that shows a list of instructions that have caused the communication inthis interval. This list consists of store operations that in earlier intervals wrote topages accessed in this interval. The list is sorted according to the time it took totransmit the corresponding diff. If the same instruction accessed several differentpages, the transmission time for each page diff is accumulated. For each instructionthe program counter value is displayed together with information on the numberof accesses that the instruction has done, the total size of all diffs associated withthat instruction and finally the total transmission time.When the user selects an instruction from the list a popup window displays the

line number, procedure name and source file where the source code line generatingthe instruction is located. The instruction list window and instruction locationwindow can be seen in figure 4. This information is used by the programmer to

8

Table 2. Execution time components.

Execution time component Comment

Busy The processor is performing useful workLock Acquisition Lock processing and waiting to acquire lockBarrier Barrier processingSigio I/O interrupt processing when the processor was busyAcquiring shared data Processing segmenation fault and waiting for the shared

data to arriveIdle Waiting for other processors to arrive at a barrierPapp overhead Overhead for tracing and measurements

locate performance bottlenecks and understand what in the program is causingdegradation of execution.Let us now go over to discuss the data acquisition that forms the basis for the

post-mortem analysis.

3.2. Data acquisition

There are two types of information that are needed in Papp: timing measurementsto produce the visual display, and information about shared data writes to sharedmemory in order to associate performance degrading writes to source code lines.This information is collected in run-time and output to file for post-mortem analysisduring barrier processing so as to disturb the relative execution of the processorsas little as possible.In this section we first describe how the timing measurements are performed

and then how in-line tracing of writes is employed to keep track of the sources ofinterprocess communications.

3.2.1. Timing measurements A barrier interval ends, and the next one startswhen a process exits the barrier procedure. We measure the time from the start ofa barrier interval until the next barrier. This time is hopefully mostly busy time,but it also includes time for lock acquisitions, segmentation faults (Segv), I/Ointerrupt processing (Sigio) and trace code time and other overheads, summarisedin table 2. We measure these overhead times separately and deduct them from thetime between two barriers in order to obtain the actual busy time.The overhead due to Papp consists of the time to process the trace code (see

section 3.2.2) and the time for file output. There is a small amount of time notaccounted for which mainly is time that arises because of the start of one measure-ment and the end of another is not overlapping. The time to execute the extraassembly instructions needed to call the trace subroutine, as described in the nextsection, is also not accounted for. Furthermore there is an overhead caused by thesystem call used to receive current time that is difficult to estimate, and in additiondiffers between different systems.

9

L..111lwz 9,0(29)lwzx 11,31,11slwi 9,9,2

#insert trace codestwu 1,-512(1)stmw 3,72(1)add 3,9,11mfxer 15stw 15,196(1)mflr 16stw 16,188(1)mfctr 17stw 17,64(1)mfc 18stw 18,56(1)stw 0,204(1)bl .trace_st # call trace_stcror 31,31,31lwz 0,204(1)lwz 18,56(1)lwz 17,64(1)lwz 16,188(1)lwz 15,196(1)mtcrf 0xff,18mtctr 17mtlr 16mtxer 15lmw 3,72(1)addi 1,1,512

# end of trace codestfsx 31,9,11 # store instruction.line 28

L..93

Figure 5. POWER2 assembly code with trace call.

3.2.2. Assembly tracing In order to keep track of the write operations to sharedmemory, we insert a call to a trace routine in the assembly code for each storeinstruction that is not to a stack-pointer relative address.Figure 5 shows the block of POWER assembly code inserted before all non stack

relative store instructions. Since the trace code is inserted directly in the assemblycode without a live-register analysis we must be very careful to save the processor

10

Table 3. Execution time overhead using Papp tracing.

SOR SORhighVersion of TreadMarks Execution time (s) Slowdown Execution time (s) Slowdown

Original 25.37 1.0 220.19 1.0Original + trace code. No analysis 67.00 2.6 287.82 1.3With full tracing and analysis 1208.44 47.6 2714.24 12.3

state before calling the trace routine. We do this through an extra activation frameon the stack where we can save all registers during the trace call. The trace routinecollects information on the program countere values and destination addresses forall shared data writes.The tracing, processing and disk I/O involved in the statistics gathering procedure

is of course associated with considerable overhead. Table 3 shows how much theexecution time is affected for two of the algorithms studied in section 4, SOR andSORhigh. These numbers come from executions of the programs on a small networkof Sun/ELC workstations on a 10 Mbit/s Ethernet. The amount of overhead dueto tracing is roughly the same in both SOR and SORhigh but the execution timeis much higher for SORhigh. Therefore the slowdown when tracing SOR is muchhigher than when tracing SORhigh.The overhead may seem alarming, but for the class of programs considered in this

paper; barrier coordinated programs, it does not affect the behaviour of the appli-cation much. Recall that when barriers are used as the primary synchronisationprimitive and with the lazy release consistency memory model there are no dataraces or races for locks. The memory access behaviour between two barriers is thusalways the same irrespective of how the overheads are distributed.Let us now discuss the use of Papp to analyse the performance behaviour of two

different applications; SOR and FFT.

4. Case studies

The objective of this section is to demonstrate the visualisation technique and howit can be used to explain interprocess communication in software DSM applications.We first describe the two experimental platforms used before going into detail onthe applications.

4.1. Experimental platforms

We have used two different system architectures in our experiments. They illustratehow the speed of the interconnecting network affect the interprocess communicationoverhead.Four Sun ELC workstations interconnected by a 10 Mbit/s Ethernet have been

used as a example of a system with slow interprocess communication. The network

11

P0

P1

P2

P3

P0

P1

P2

P3

time

BarrierCalculation of blackmatrix using red

Calculation of red matrixusing black

Figure 6. An iteration of the SOR calculation.

is shared with other users doing various work and the execution is thus perturbedin an unpredictable way.As an example of a system connected with a high bandwidth network we have

used an IBM SP2 with a total of 110 nodes. However, we have in our experimentsonly used up to 16 nodes. Each node in the SP2 machine basically consists of aPOWER2 architecture RS/6000 workstation running at 66.7 MHz. These nodesshare data over a high performance switch with a 40MB/s channel bandwidth.

4.2. SOR

4.2.1. Algorithm SOR implements red-black Successive Over-Relaxation on anm by 2n grid. It is an iterative procedure which is used to solve partial differentialequations. The sequential algorithm uses two matrices, red and black. In orderto produce a new value in the black matrix, a weighted mean of the elementsneighbouring the corresponding element in the red matrix is calculated. One ofthese belongs to the row above the element being calculated and one to the rowbelow. The work is parallelised so that each processor is responsible for a number ofconsecutive rows. After the black matrix has been calculated, a barrier synchronisesthe computation and computation starts on the red matrix. This ends one iterationand is illustrated in figure 6.In order to calculate boundary rows, elements that belong to another processor

have to be accessed, and that is the origins of the interprocess communication asshown in figure 6 where processor P1 needs a value previously produced by processorP2.We have also a second version of SOR called SORhigh. The difference is how

the grid is stored in memory. SOR stores a row consecutively in memory whileSORhigh stores columns consecutively in memory. This is of decisive importanceto the performance. Storing rows consecutively in memory means that each row

12

Figure 7. SORhigh calculating a 250 × 250 matrix running on four SUN workstations.

will occupy a minimum number of different virtual pages while each column willoccupy the maximum number of pages. Storing columns consecutively in memorywill render the opposite situation.

4.2.2. Comparing SORhigh with SOR Figure 7 shows the Papp display for 10iterations of SORhigh executed on the Sun platform. The darker shadowed areasrepresent memory access overhead and we can thus see that the amount of overheadis very high. The reason for this high overhead has its source in the way the workis distributed among the processors as described in the previous section.The initialisation phase which consists of allocation of shared memory and ini-

tialisation of shared matrices is done during interval 0. Since only processor 0 isinvolved in this, the other three processes spend their time in the idle state waitingfor processor 0 to finish.In interval 1 accesses to shared memory are done for the first time by all pro-

cessors. This is the first part of the SOR algorithm. Since all shared memory isinitialised by processor 0 and consequently resides in the memory of processor 0, ithas to be distributed to the other processors so that they can access their allottedrows in the matrix. This process is reflected in the diagram where processors 1through 3 spend most of their time in interval 1 receiving data due to such coldmisses. Processor 0 also has a small amount of shared data acquisition overhead.This is because the first write to shared memory causes some overhead when Tread-Marks keeps track of which pages have been changed. Processor 0 otherwise spendsmost of its time administering requests for shared memory from other processes,which is shown in the diagram by the lighter shadowed field indicating the amountof Sigio done.Interval 2 shows the time spent calculating the second part of the SOR algorithm,

it is to large extent similar to the first part, but uses a different set of elementsas input. The only elements that has to be requested from other processors are

13

Figure 8. SOR calculating a 250× 250 matrix running on four SUN workstations.

elements belonging to boundary rows, but because of the matrix layout in memorythe amount of communication is high and roughly of the same size as in interval1, this is in despite of the fact that the number of matrix elements transferred islower in interval 2 than in interval 1.In the following intervals the two parts of the SOR algorithm are iterated 10 times.

Since the memory access pattern is repeated every two intervals the distributionbetween the execution time is the same throughout the last 19 intervals.In SORhigh the processors spend more time requesting data from other processors

than in calculation solving the problem. This shows that it probably would pay offto examine where this communication emanates from.The trace of the other implementation of the SOR algorithm, simply called SOR,

is visualised in figure 8. Since the matrix being calculated on is laid out in memoryin another fashion than in SORhigh, the amount of communication is much lower.Interval 0 is the same as for SORhigh since all accesses to shared memory is done

by one processor only. In interval 1, however, where distribution of data takes place,the amount of communication is much lower than in SORhigh despite that the sizeof the matrix being distributed to each processor is the same. This is a consequenceof how the matrix resides in memory, in SORhigh each processor accesses a highernumber of virtual pages which causes a large amount of overhead in TreadMarksbecause many small diff messages are sent.The rest of the barrier intervals clearly show that the high communication in SOR

is a transient state caused by initialisation and distribution of data as opposed toSORhigh where communication is caused by page accesses that overlap to a greatextent. The amount of idle time is quite high. The reason for this is because theidle time also contains time that is spent waiting for other processors to finish theirmeasurement and tracing overhead.The layout of the matrix in memory in combination with the way the matrix is

divided among processors thus has a first-order effect on the performance. Storingrows consecutively in memory means that when a processor accesses a row that has

14

Figure 9. Four iterations of 3DFFT with a 32× 32× 32 cube running on 4 processors on an IBMSP2.

been previously written by another processor, it only has to request diffs from aminimum number of pages. When columns are stored consecutively as in SORhighthe number of pages whose diffs has to be requested is greater but each diff issmaller since each page contains a smaller number of elements from a specific row.Since each diff causes overhead both in size and time to handle for the OS, alarge number of small diffs are more expensive in terms of execution time thantransferring a smaller number of diffs whose individual size is larger.

4.3. 3DFFT

4.3.1. Algorithm 3DFFT is a program that numerically solves a partial differ-ential equation using three-dimensional (3-D) forward and inverse FFTs iteratively.The input data, X0, is a 3-D array with n1 ×n2 ×n3 complex elements stored as avector in memory with the real and imaginary parts stored separate. The real andcomplex values are separated by the n3 elements in the last dimension.Each processor calculates the values in one dimension of the 3-D matrix. The

(inverse) FFTs in the first two dimensions can be carried out entirely withoutcommunication. For the last dimension, the matrix is transposed so that the com-putation once again can be carried out locally. Because of the way the transposeis done, the amount of interprocess communication generated is dependent of thesize of the input array. The workload size thus has a first-order effect on thecomputation-to-communication ratio.

4.3.2. Analysing 3DFFT at different workloads Figure 9 shows the executionbehaviour of four iterations of 3DFFT executed on an IBM SP2 using four processorwith an array of size 32× 32× 32. We can see from the display that there is a fairamount of overhead in the iterative part.

15

Figure 10. 3DFFT with a 64 × 64 × 64 cube running on 4 processors on the IBM SP2.

In intervals 0 and 1 the memory that is to be shared by all processors is allocatedby processor 0 and each processor initialises its part of the data. The first half of theforward FFT done in interval 2 does not cause any interprocess communication sinceall elements already where accessed during interval 1 and are thus resident in eachprocessor’s memory. The second half of the forward FFT is done in interval 3 andthe matrix transpose done in this interval involves accesses to elements belongingto other processes. This causes the communication overhead seen in interval 3. Thesame barrier interval also contains the first half of an inverse 3-D FFT that is thestart of the iterative algorithm. The first half of the inverse 3-D FFT only involvesaccesses to elements belonging to respective processor, so it does not contribute tothe communication seen.Interval 4 contains the second half of the inverse 3-D FFT involving a transpose.

From the links maintained by Papp to source code we find that this transpose is thecause of the large amount of overhead seen in barrier 4. During matrix transposeeach process reads elements belonging to other processors and stores them in itsown elements. Reading and storing is done in two different arrays to avoid dataraces. After the matrix transpose, each process accesses its own elements in theircalculations. Interval 5 repeats the first half of the inverse FFT done in interval 3.and marks the start of the second of the four iterations.Let us now study the execution of 3DFFT on a 64× 64× 64 matrix instead. As

can be seen in figure 10 the amount of overhead is much smaller than in figure 9.The reason for this is that the interprocess communication caused by the matrixtranspose is dependent of the size of the virtual page. During a matrix transposeeach processor copies elements belonging to other processors to its own elements.This copying consist of reading blocks of elements from the other processors mem-ory. The number of blocks each processor reads depend on the number of processorsinvolved in the computation. The more processors are involved the higher the num-ber of blocks. The size of each block is dependent on the work load and the numberof processors involved. The bigger the work load, the bigger the blocks.

16

When transposing a 64 × 64 × 64 matrix using four processors, the size of theblocks is an integral number of virtual pages. This means each pages diff is sentonly once over the network since each page is accessed by one processor only. Incontrast, transposing a 32 × 32 × 32 matrix on four processors means that eachblock is not an integral number of pages, and some processors will have to senddiffs to more than one processor. And since each diff is large, because all elementsare accessed at some point, the overhead is large.We have now seen how the Papp visualisation tool is helpful, first of all to under-

stand how a parallel software DSM application functions and where the overheadsare, but also how it can help to find the source of interprocess communication in arelatively complex application such as 3DFFT. The next section discusses some ofthe related work performed by other researchers.

5. Related work

Even though software DSM systems have been around for quite some time now,very little work have been done in the area of performance tuning tools for softwareDSM systems. In fact, the area of performance tuning tools for shared memoryparallel systems at large is in many respects unresearched.The most relevant work has been performed by Rajamony and Cox at Rice Uni-

versity [9]. Through an assembly instrumentation routine similar to ours, theyanalyse the data and control dependencies in a shared memory application at run-time. From the analysis they can then suggest a number of program transformationsthat would increase performance. They report results from an implementation oftheir method on TreadMarks that show performance improvements from 1.32 to 34times on a range of applications. In contrast to our approach, however, they do nothave a visual interface to display the execution behaviour of an application.Within the Paradyn project [7], Xu et al. have implemented shared-memory

performance profiling tool [12] for a network of workstations using the Blizzard-Sfine-grain distributed shared memory system [10]. Since the Blizzard-S system in-vokes the coherence action routines on every shared memory access they do not haveto instrument their code to gather information since this can be achieved directlyin the coherence software. Thanks to the use of the powerful Paradyn statisticsgathering techniques, they also achieve very little run-time overhead. Their toolidentifies the data and source code location of coherence overheads, but fail toidentify the sources of these coherence overheads, i.e., the write operations thatearlier lead to interprocess communication. This is in contrast to both Papp andthe work reported by Rajamony that can identify the real sources of interprocesscommunication.

6. Conclusions

The research community has worked hard to make software DSM systems a viablealternative for cost-effective parallel computing on networks of workstations. State-of-the art software DSM systems such as TreadMarks makes it possible to develop

17

high-performance parallel applications for NOWs. However, considerable effortmight be required to performance tune an application because of the high networkcommunication latencies involved.We have in this paper presented a technique to visualise the execution of shared

memory applications on software DSM systems, implemented on TreadMarks. Bymeans of two case studies we have shown that the visualisation undeniably repre-sents the execution behaviour, and that the technique to gather run-time executiondata can be used to find the source code origins to performance bottlenecks.The main contributions of the work reported in this paper are:

• The identification of the information needed to provide feedback to the pro-grammer of performance bottlenecks.

• A method to measure and visualise this information in order to present feedbackin an-easy-to-understand fashion to the programmer.

However, there is still much work to be done in order to make Papp a practi-cal and useful tool to parallel application developers. Although the tool supportslinks back to the source code, the current implementation of Papp does not aidthe programmer in how to restructure the program so as to reduce the interprocesscommunication overhead. We also lack an identification of oversynchronisations.Techniques for these deficiencies are under development elsewhere, such as by Ra-jamony and Cox [9] and we have plans to adapt their results to our visualisationenvironment.

Acknowledgments

The research in this paper was in part supported by the Swedish National Boardfor Industrial and Technical Development (NUTEK) under project number P855and by the Swedish council for High Performance Computing (HPDR) and Paral-lelldatorcentrum (PDC), Royal Institute of Technology.

References

1. C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu and W.Zwaenepoel. TreadMarks: Shared Memory Computing on Networks of Workstations, IEEEComputer, Vol. 29, no. 2, pp. 18-28, February 1996.

2. M. Brorsson, SM-prof: A Tool to Visualise and Find cache Coherence Bottlenecks in Mul-tiprocessor Programs, In Proceedings of the 1995 ACM SIGMETRICS International Con-ference on Measurement & Modeling of Computer Systems, pp. 178-187, Ottawa, Canada,May 1995.

3. M. Brorsson, Performance Tuning of Small Scale Shared Memory Multiprocessor Applicationsusing Visualisation, In Proceedings of the 10th International Conference on Parallel andDistributed Computing Systems, New Orleans, October 1997.

4. M. T. Heath, A. D. Malony and D. T. Rover, The Visual Display of Parallel PerformanceData, IEEE Computer, pp. 21-28, November 1995.

5. P. Keleher, S. Dwarkadas, A.L. Cox, and W. Zwaenepoel, TreadMarks: Distributed SharedMemory on Standard Workstations and Operating Systems, In Proceedings of the Winter 94Usenix Conference, pp. 115-131, January 1994.

18

6. M. Kral, Programmer’s Aid for Parallel Programming, MSc thesis, Department of Informa-tion Technology, Lund University, P.O. Box 118, SE-221 00 Lund, Sweden.

7. B. P. Miller , M. D. Callaghan, J. M. Cargille, J. K. Hollingswirth, R. B. Irvin, K. L.Karavanic, K. Kunchithapadam and T. Newhall, The Paradyn Performance Tools. IEEEComputer 28, 11, November 1995.

8. E. W. Parsons, M. Brorsson and K. C. Sevcik, Predicting the Performance of DistributedVirtual Shared Memory Applications, IBM Systems Journal, Volume 36, No. 4.

9. R. Rajamony and A. L. Cox, Performance Debugging Shared Memory Parallel ProgramsUsing Run-Time Dependence Analysis, In Proceedings of the ACM SIGMETRICS Interna-tional Conference on Measurement and Modeling of Computer Systems, June 1997, Seattle,WA.

10. I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-grain access control for distributed shared memory. In Proceedings of the Sixth InternationalConference on Architectural Support for Programming Languages and Operating Systems(ASPLOS VI), pages 297-307, October 1994.

11. J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5-44, March 1992.

12. Z. Xu, J. R. Larus and B. P. Miller, Shared-Memory Performance Profiling, In Proceedingsof the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP’97), Las Vegas, Nevada, 1997.

performance tuning software dsm applications using visualisation

Documents