software technologies for high-performance parallel signal ... · pdf file• kepner and...

• KEPNER AND LEBAKSoftware Technologies for High-Performance Parallel Signal Processing

VOLUME 14, NUMBER 2, 2003 LINCOLN LABORATORY JOURNAL 181

Software Technologies forHigh-Performance ParallelSignal ProcessingJeremy Kepner and James Lebak

■ Real-time signal processing consumes the majority of the world’s computingpower. Increasingly, programmable parallel processors are used to address a widevariety of signal processing applications (e.g., scientific, video, wireless, medical,communication, encoding, radar, sonar, and imaging). In programmablesystems the major challenge is no longer the speed of the hardware but thecomplexity of optimized software. Specifically, the key technical hurdle lies inmapping an algorithm onto a parallel computer in a general manner thatpreserves performance while providing software portability. We have developedthe Parallel Vector Library (PVL) to allow signal processing algorithms to bewritten with high-level mathematical constructs that are independent of theunderlying parallel mapping. Programs written using PVL can be ported to awide range of parallel computers without sacrificing performance. Furthermore,the mapping concepts in PVL provide the infrastructure for enabling newcapabilities such as fault tolerance and self-optimization. This article discussesPVL with a particular emphasis on quantitative comparisons with standardparallel signal programming practices.

R - is critical to awide variety of applications, including radarsignal processing, sonar signal processing,

digital encoding, wireless communication, videocompression, medical imaging, and scientific dataprocessing. These applications feature large and grow-ing computation and communication requirementsthat can be met only through the use of multiple pro-cessors and fast low-latency networks. A balance be-tween computation and communication is one of thekey characteristics of this type of processing. Ad-vanced software techniques are needed to manage thelarge number of processors and complex networks re-quired and achieve the desired performance.

Military sensing platforms employ a variety of sig-nal processing systems at all stages: initial target de-tection, tracking, target discrimination, intercept,

and engagement assessment. Although high-perfor-mance embedded computing is widely used through-out commercial enterprises, the Department ofDefense (DoD) often has the most demanding re-quirements, particularly for radar, sonar, and imagingsensor platforms. Figure 1 shows many of these plat-forms, along with a graph that illustrates the growthin computational requirements for future real-timesignal processing applications. The challenge for thesesystems is the cost-effective implementation of com-plex algorithms on complex hardware. This challengeis made all the more difficult by the need to stayabreast of rapidly changing commercial off-the-shelf(COTS) hardware. The key to meeting this challengelies in utilizing advanced software techniques that al-low new hardware to be inserted while preserving thesoftware investment.


182 LINCOLN LABORATORY JOURNAL VOLUME 14, NUMBER 2, 2003

FIGURE 1. Military platforms are among the largest driversof embedded real-time signal processing applications. De-partment of Defense (DoD) systems in the 2010 time frameare projected to require multiple teraflops of computation.

The standard approach to writing high-perfor-mance signal processing applications has been to usevendor-supplied computation and communication li-braries that are highly optimized to a specific vendorplatform and guaranteed to deliver the highest pos-sible performance. Unfortunately, this approach usu-ally leads to software that is difficult to write, is notcompatible with other platforms, and is dependenton the number of processors on which the applicationis deployed.

The Parallel Vector Library

Our answer to the challenge of writing reusable soft-ware has been to develop the Parallel Vector Library(PVL). The major goals of the library are (1) to makethe job of writing distributed signal processing soft-ware easier, (2) to provide for portability of applica-tion level code, (3) to separate the job of mapping dis-tributed data and computations from the job of signalprocessor development, and (4) to achieve high per-formance on many different parallel platforms.

The primary technical challenge in deploying par-allel processing systems is the assignment of differentparts of the algorithm to processors in the parallelcomputing hardware. This process, illustrated in Fig-ure 2, is referred to as mapping. The mapping of anapplication needs to be optimized for each system theapplication runs on. Thus, for a program to run on avariety of systems, the algorithm description must notdepend on details of the parallel hardware; that is, itmust be map independent.

Map independence is an important element ofportability that goes beyond the simple, standards-based portability of recent efforts. Standards such asthe Message-Passing Interface (MPI) [1] and the Vec-tor, Signal, and Image Processing Library (VSIPL) [2]provide a portable interface to parallel applications,that is, one that is the same for different platforms.However, unless special care is taken with the designof a program using MPI and VSIPL, the applicationcode must be modified every time the number ofhardware elements used changes, as illustrated in Fig-ure 3. Such modifications may be required, for ex-ample, when moving the application program from asingle-processor prototyping environment to a full-scale deployed parallel system, or when moving an ap-

19951990 2000 2005 20100.001

0.01

0.1

1

10

100Projectedgrowth

Near-termapplications

Year

Tera

flops



FIGURE 2. Parallel mapping is the process of assigning different parts of an algorithm to separate processing el-ements in the parallel-computing hardware. Mapping needs to be optimized for each system the application runson. Here a three-stage pipeline, consisting of a beamformer, a finite impulse response (FIR) filter, and a detector,is mapped to distinct sets of processors in a parallel computer. A program that runs on a variety of systems, anddoesn’t depend on the specific details of the parallel hardware implementation, is said to be map independent.

plication from one generation of technology to an-other. PVL avoids the need for such modificationsthrough the concept of map independence.

Map independence is achieved by raising the levelof abstraction at which signal processing applications

are written. In current practice, it is common for theapplication writer to have to deal with multiple ob-jects describing the same memory area. For example,a buffer containing a vector may need an MPI objectfor communication operations and a VSIPL object

FIGURE 3. The challenge of portability and scalability in parallel systems. Consider a basic al-gorithm and mapping, where Stage 1 is mapped to processor nodes 1 and 2 and sends data toStage 2, which is mapped onto nodes 3 and 4. Existing software approaches typically hard-code the processor mapping information directly into the application, and thus require signifi-cant changes to run the application on a different number of processors. When processornodes 5 and 6 are added to Stage 2, the code for the new mapping must be modified, as shownabove. The challenge is to create algorithms that are map independent, and don’t need modifi-cations when moved to different processing systems.

Parallel computer

Mapping

Signal processing algorithm

BeamformXout = W*Xin

FilterXout = FIR(Xin)

DetectXout = |Xin|> c

Code for mapping 1

Stage 1

Algorithm + Mapping

Processor1

Processor2

Stage 2

Processor3

Processor4

Processor5

Processor6

while (!done){ if ( rank ()= =1|| rank()= =2) stage 1(); else if ( rank ()= =3|| rank()= =4) stage 2();}

while (!done){ if ( rank ()= =1|| rank()= =2) stage 1(); else if ( rank ()= =3|| rank()= =4)|| rank ()= =5|| rank()= =6) stage 2();}

Code for mapping 2



FIGURE 4. Evolution of the parallel vector library (PVL). Over the past decade, there has been a continuous tran-sition of computing technology from non-real-time scientific computing to the real-time domain, and the devel-opment focus has shifted from procedural languages such as Fortran to object-oriented languages such as C++.Thus much of the functionality of the Fortran-based linear algebra package (LAPACK) was made suitable for em-bedded computing in the Vector, Signal, and Image Processing Library (VSIPL). The deployment of the Mes-sage-Passing Interface (MPI) made possible the development of the Fortran-based scalable version of LAPACK,called ScaLAPACK. PVL, which evolved from the earlier C-based parallel Space-Time Adaptive Processing Li-brary (STAPL), builds on MPI and VSIPL in a way analogous to the way ScaLAPACK builds on MPI andLAPACK, also incorporating new object-oriented technologies such as those in the portable expression tem-plate engine (PETE) that makes mathematical programming easier and more efficient.

for computation operations. Coordinating these de-scriptions can be a tedious and error-prone task. PVLprovides objects useful for both computation andcommunication, and handles the coordination ofthese operations without explicit direction from theapplication programmer. This coordination has theadded benefit of making the job of writing distrib-uted signal processing software easier.

PVL is based on earlier work on the Space-TimeAdaptive Processing Library (STAPL), which wasused to field a 1000-CPU embedded signal processor[3]. PVL incorporates the lessons learned from theSTAPL project into a new object-oriented librarywritten in the C++ programming language. C++ wasselected because of its growing acceptance in the em-bedded community and its powerful abstraction ca-pabilities that allow more to be done with fewer lines

of code. Figure 4 shows how PVL’s capabilities buildon industry standards and previous libraries. The baseversion of PVL is actually built on the open standardsof VSIPL and MPI to provide an added level of port-ability. PVL has been successfully implemented on awide range of workstation, cluster, and embedded ar-chitectures, as illustrated in Figure 5.

In summary, PVL achieves its goals of increasingproductivity and portability by allowing the indepen-dence of map and application and increasing the levelof abstraction. We discuss the implementation ofthese concepts in more detail in the following sectionon the PVL programming model. The subsequentsection on PVL performance shows how we achievethese goals without sacrificing our primary goal ofhigh performance. A later section describes ongoingresearch involving PVL.

Parallel processing

library

Parallel communications

Single processor

library

Applicability

Scientific (non-real-time) computing

Real-time signal processing

STAPL

PVL

VSIPL

PETE

MPI

LAPACK

ScaLAPACK

• Fortran• Object-based

• Fortran

• C• Object-based



• C++• Object-oriented

• C++• Object-oriented

Time



FIGURE 5. PVL layered architecture. The main purpose of a software library as middleware is to insulate theapplication from the hardware. The user writes an application in the high-productivity layer. The middlewareimplements constructs necessary to deliver high performance on parallel computers. The portability of themiddleware is achieved by isolating the hardware-specific details to a math kernel such as VSIPL and a mes-saging kernel such as MPI.

Application (productivity) Input Analysis Output

Vector/matrix Computation Conduit Task

User interface

MapGrid DistributionParallel Vector Library (performance)

Hardware (portability)

Workstation

Intel clusterPowerPC

cluster

Embedded multi-computer

Embeddedboard

Math kernel (VSIPL) Messaging kernel (MPI)

Hardware interface

The PVL Programming Model

Real-time signal processing applications generallymake use of two types of parallelism. The first is data-parallel operations on vectors and matrices, whichmeans operating on a vector or matrix that is itselfdistributed. The second is task-parallel operation,which may be used to construct pipelines or to pro-cess incoming data sets in a round-robin fashion. APVL program can make use of both of these types ofparallelism.

The core capability of PVL is that of assigning, ormapping, the portions of a parallel program to be ex-ecuted on or stored in specific components of themachine. The assignment information is contained ina map object. PVL programs are written using fouruser-level signal processing and control objects: dataobjects (i.e., vectors and matrices), computations,

tasks, and conduits. These objects are illustrated inFigure 6. The mapping of the user-level objects toprocessors is controlled by the mapping objectsshown in the same figure.

In broad terms, users obtain task parallelism bywriting an application as a set of tasks and conduits. Atask represents a scope for a particular computationstage; the effect of multiple tasks is to allow differentprocessor groups to perform different computationsat the same time. Within a task, operations are per-formed with vectors, matrices, and computations,using a single-program multiple-data paradigm thatallows for data parallelism. Conduits allow communi-cation of data objects between different scopes ortasks.

The overall effect of these objects is to make it pos-sible to write multistage algorithms independent ofthe underlying hardware. For example, a simple data



parallel implementation of a basic frequency-domainfilter can be constructed, as shown in Figure 7(a). Acomplete filtering system can then be constructed byinserting the basic filtering algorithm into a task andadding appropriate input tasks and output tasks con-

nected by conduits, as shown in Figure 7(b). In eachcase, the application code does not include any refer-ences to the number of processors being used and cantherefore be mapped to any parallel architecture.

In the following sections, we give more detail re-

FIGURE 7. (a) Development of a basic frequency-domain filtering application, using an FFT and an inverse FFT (IFFT)on input data. PVL vector/matrix and computation objects allow signal processing algorithms to be implementedquickly by using high-level constructs. (b) PVL tasks and conduit objects allow complete filtering systems to be built.

FFT

IFFT Xout

Files

Input task Analysis task: Xout = IFFT[W*FFT(Xin)] Output task

Input conduit

Output conduit

Xout

Files

File

FFT

IFFT Xout

(a)

(b)

File

Xin

W

Xin

W

Xin

W

FIGURE 6. PVL objects. PVL is based on four basic user-level signal processing and control objects(highlighted in color): vectors and matrices, computations (for example, a fast Fourier transform, orFFT), tasks, and conduits. Each of these is independently mappable onto a grid of processors.

Sig

nal

pro

cess

ing

an

d c

on

tro

lM

app

ing

Data and task

Data

Task and pipeline

Task and pipeline

Data, task, and pipeline

ParallelismDescriptionClass

Used to perform matrix/vector algebra on data spanning multiple processorsVector/matrix

Performs signal/image processing functions on matrices/vectors (e.g., FFT, FIR, QR decomposition)Computation

Supports algorithm decomposition (i.e., the boxes in a signal flow diagram)Task

Specifies how tasks, vectors/matrices, and computations are distributed on processorsMap

Supports data movement between tasks (i.e., the arrows on a signal flow diagram)Conduit

Organizes processors into a two-dimensional layoutGrid



garding the mapping, data parallel features, and taskparallel features of PVL.

Maps

Maps consist of a set of processor nodes and a descrip-tion of how those nodes are to be used. Objects thatcan have a map are called mappable. There are threegeneral classes of mappable type, and three corre-sponding types of maps. User functions, data, andcalculations are represented in the library by tasks,distributed data objects, and distributed computationobjects, respectively. Conduits are not themselvesmappable types, but each endpoint of a conduit is atask object.

The map contains all information specific to howthe object is placed onto the processors. Within eachmap is a two-dimensional grid with a specific list ofnodes and a distribution description specific to thetype of object. For example, a distribution descriptionfor a data object such as a matrix or vector determineshow data are to be distributed among processors(block, cyclic, or block-cyclic) and includes an over-lap description that describes whether elements

should be duplicated on multiple processors (thismay be done, for example, to support boundary con-ditions). Figure 8 shows the structure of a map object.

Data Parallel Operations

A PVL application program performs mathematicaloperations on distributed data objects (matrices andvectors). These operations are written at a high level,using mathematical expressions. Objects are assignedto particular nodes by using the map object. The dis-tribution descriptions for data objects allow the userto specify an arbitrary block-cyclic distribution foreach matrix or vector. (For a complete description ofthe block-cyclic data distribution, see the high-per-formance Fortran specification [4].)

As an example, Figure 9 shows three simple distri-butions of a vector of length eight: (a) a block distri-bution on three processors, (b) a block distributionover two processors, and (c) a cyclic distribution overfour processors. The application program does not

FIGURE 8. Structure of a PVL map object. All PVL user ob-jects contain a map. Each map is composed of three compo-nents: a grid, a distribution description, and an overlap de-scription. Within the grid is the list of physical nodes ontowhich the PVL object is mapped.

FIGURE 9. Example distributions of a vector of length eight:(a) block distribution on three processor nodes, (b) blockdistribution on two processor nodes, (c) cyclic distributionon four processor nodes.

Map

Grid

Vector/matrix Computation Conduit Task

Distribution Overlap

List of nodes(e.g., {0, 2, 4, 6, 8, 10})

(a)

(b)

A0A A1 A2 A3 A4 A5 A6 A7

B0B B1 B2 B3 B4 B5 B6 B7

(c)

C0C C1 C2 C3 C4 C5 C6 C7



explicitly specify the distribution, but merely includesa reference to the map, which is itself stored in an ex-ternal file. The map can be changed without chang-ing the application code. Also, vectors or matricesused in an expression are not required to have thesame distribution. In the expression C = A + B, thevectors A, B, and C are not required to be distributedin the same way (though there may be compellingperformance advantages to doing so).

PVL computation objects allow complex opera-tions such as a fast Fourier transform (FFT) to be setup in advance and to be executed in a parallel man-ner. Advance setup, or early binding, of computationis common practice in embedded signal processing.As with computation objects (such as the FFT) inVSIPL, distributed computation objects provide aconvenient place for the library to store constants andprecalculated data without exposing them to the ap-plication programmer.

Furthermore, these objects allow computations tobe mapped to special-purpose hardware, and theymanage the communication into and out of thathardware. The map associated with the distributedcomputation object specifies where the computationtakes place. It may also specify characteristics of thealgorithm to be used to perform the computation; forexample, the particular FFT algorithm to be usedmay be specified in the map of an FFT object.

Task Parallel Operations

A PVL program consists of a set of tasks connected byconduits. Within a task, an application is written us-ing distributed matrix and vector objects as previ-ously described. Operations within a task are writtenusing a single-program multiple-data paradigm. Dis-tributed data objects are communicated between dif-ferent tasks by using conduits. An individual data ob-ject exists only in the scope of one particular task at atime; by using a conduit, the data contained in theseitems can be sent to and received from other tasks in anon-blocking way, possibly with multibuffering. Theuse of the conduit abstraction separates the applica-tion programmer from the details of the communica-tion service being used, and provides more opportu-nity to optimize the communication operation.

Conduits are critical because, in many signal pro-

cessing applications, the primary challenge is not per-forming the computations but moving data so thatthe data are in a position to be acted upon. One of themost common examples of this data movement is thecorner turn operation, which, in its simplest form,can be described as a matrix transpose operation. Thisoperation is performed on almost every system withmultiple input channels (such as radar, sonar, andcommunications); typically, processing within a par-ticular channel is followed by processing that cutsacross channels.

For a two-dimensional data object, a corner turn isnecessary when two consecutive operations requireaccess to data in different ways, one by rows and oneby columns. A corner turn between the two opera-tions permits each operation to operate on data storedin a favorable access pattern, maximizing memoryperformance. In a local corner turn, data associatedwith an object are merely moved in memory. In a dis-tributed corner turn, all-to-all communication be-tween processor groups may be required in additionto local data movement. By using the conduit object,the corner turn is no more difficult for the program-mer to orchestrate than any other data movement.

Efficient performance of a collective communica-tion operation requires a considerable amount ofsetup or early binding of communication. At pro-gram setup time, a PVL application program specifiesthe tasks that each conduit connects. During thisconnection operation, the PVL conduit object cancompute in advance the source and destination of allmessages, set up all necessary buffers, create multiplebuffers if necessary, and set up special hardware forcommunication.

PVL Benefits

PVL supports both task and data parallelism. Dataparallelism is a well-understood way to achieve speed-up for dense matrix calculation. The major benefit ofPVL’s task-parallel features is that the stages may berun on separate hardware (for example, separate pro-cessors of a parallel machine). By breaking the systemup into multiple stages, the system designer gains anextra level of parallelism. This process, which is re-ferred to as pipelining, allows customization of thesystem resources for each individual stage. An alterna-



tive approach would be to have all the system re-sources work on an individual data set in turn, beforegoing on to the next one. The latency for each indi-vidual data set processed by the system will probablybe greater in the pipelined scheme than in the uni-form processing scheme. However, an increase inthroughput is attained with the pipelined approachbecause several data sets are being processed at thesame time. In addition, gains in efficiency may be re-alized because of the smaller number of processing re-sources assigned to each stage, and because communi-cation may occur concurrently with computation.

Another major benefit of PVL is the portabilityand reusability of software modules. Beyond the obvi-ous benefits for technology refresh and system up-grades, this eases system development, because appli-cation writers can gradually introduce parallelism. Anapplication may be tested with single-processor mapsand then retested with the maps for a parallel proces-sor after the code has been verified. The separation ofapplication and mapping also makes it easier to de-bug on inexpensive platforms such as networks ofworkstations, reducing contention for embedded tar-get hardware.

From a software engineering perspective, the use oftasks and conduits in the PVL programming modelprovides an effective framework for integrating mod-ules developed by different software engineers. Appli-cation developers can concentrate on the signal pro-cessing and linear algebra requirements of their pieceof the algorithm independently of other pieces.

Performance Results

Parallel machines have various complex memory hier-archies, and the capabilities of the individual proces-sors vary greatly from machine to machine. Ideally,mapping the application to the hardware should beentirely separate from the application code writing.As we have described, PVL uses abstractions to hidethe details of single-processor computation and thetransfer of data between processors, allowing the ap-plication to become scalable to different numbers ofprocessors. These same abstractions therefore providefor portability of the application between differentparallel machines or succeeding generations of thesame parallel machine.

Use of a high level of abstraction may have two ef-fects on performance. High-level objects may intro-duce overhead and reduce performance; however, ahigh level of abstraction may also enable library opti-mizations to be hidden from the application pro-grammer. In this section, we describe the perfor-mance of PVL from both of these perspectives. First,in the following subsection, we demonstrate that PVLdoes not introduce an unacceptable amount of com-putational overhead for complex FFT operations.Then, in the subsequent subsection, we demonstratethat the level of abstraction introduced in PVL canresult in improved performance in some cases.

Abstraction and Overhead

A major goal of PVL is that a reasonably mapped ap-plication should be able to achieve performance closeto that which can be achieved by native mathematicaland message-passing libraries on a given platform. Toexamine our success in this goal, we measured theamount of overhead PVL adds to an optimized FFToperation. This section describes these performanceresults in more detail.

Understanding the PVL FFT. The use of the PVLFFT can be divided into three phases: declaration,setup, and execution. The distinguishing characteris-tics of the PVL FFT are that the input vector and theoutput vector each have a map associated with themthat describes how they are distributed, and that, inaddition, a computation map is associated with theFFT object that can be different from either the inputor the output vector map. The computation map de-scribes how and where the computations take place.

Methodology. We measured computation times forout-of-place complex-to-complex vector FFT opera-tions using vectors of length L = 2n, n ∈ {8, 9, …, 14}(i.e., L ranged from 256 to 16,384). These numberswere picked because they reflect a good range of inter-est for the embedded processing space. Each functionof interest was run a given number of iterations andthe total time was recorded. The first iteration wasnot counted because cache effects were likely to bemost pronounced the first time the function wascalled. Measured times considered only the computa-tion portion of the program, and did not include timeto create or construct the FFT object, allocate



memory for the vectors, or perform other functions.Computation rates were calculated by assuming 5nLfloating-point operations per complex FFT. In allcases we ran the measurement program six times, av-eraged the time over a thousand iterations each time,and used the minimum of the six results.

FIGURE 10. Overhead for PVL fast Fourier transform (FFT)execution time, relative to that of the Fastest Fourier Trans-form in the West (FFTW) library for vectors of variouslengths. In all cases the overhead is less than 2%.

FIGURE 11. C++ expression templates. C++ typically requires the use of temporary variables in order to write high-levelmathematical expressions. Obtaining high performance from C++ requires technology such as expression templatesthat eliminate the normal creation of temporary variables in an expression.

Performance Results. The PVL FFT object whoseperformance is measured here used the Fastest FourierTransform in the West (FFTW) library [5]. Becausethis library is a collection of self-optimizing FFT rou-tines that has been shown to outperform vendor-opti-mized libraries in some cases, we can be confidentthat it represents a well-performing FFT operation.Figure 10 shows the execution time of PVL relative toFFTW. It is clear that PVL adds very little overheadto the underlying FFTW call.

Abstraction and Optimization

PVL is implemented by using the C++ programminglanguage, which allows the user to write programs us-ing high-level mathematical constructs such as

A B C D= + ∗ ,

where A, B, C, and D are all distributed vectors ormatrices. Such expressions are enabled by the opera-tor overloading feature of C++ [6]. A naive imple-mentation of operator overloading in C++ results inthe creation of temporary data structures for each

1.020

1.016

1.012

1.008

1.004

1.0256 1024 4096 16384

Vector length

Rel

ativ

e pe

rfor

man

ce o

f P

VL

Operator +

Main

Operator =

Pass expression tree reference to operator

Calculate result andperform assignment

Pass B and C references to operator +

Create expression parse tree

Return expressionparse tree

AB + C

1.

2.

3.

4.

5.

B&

C&

B& C&

copy

copy &

Expression

A = B + C

Binarynode<opassign, vector,Binarynode<opadd, vector, vector >>

Parse tree Expression type

Expression

templates

=

A +

B C

Parse trees, not vectors, created

+



substep of the expression, such as the intermediatemultiply C D∗ , which can result in a significant per-formance penalty. This penalty can be avoided by theuse of expression templates, which allow the compilerto analyze a chained expression and eliminate thetemporary variables. A tool called the Portable Ex-pression Template Engine (PETE), developed by theAdvanced Computing Laboratory at the Los AlamosNational Laboratory, allows easy generation of ex-pression template code for user-defined types. Figure11 illustrates this process for the simpler expressionA = B + C. In many instances it is possible to achievebetter performance with expression templates thanwith standard C-based libraries because the C++ ex-

pression-template code can achieve superior cacheperformance for long expressions [7].

Figures 12, 13, and 14 show the performance PVLobtained with three different templated expressionson an eight-node processor cluster. For long expres-sions, PVL code that uses templated expressions isable to equal or exceed the performance of existingsignal processing libraries (VSIPL and MPI).

Advanced Research and Future Work

This section describes some of the current researchwork being done to add new capabilities to PVL inthe areas of fault tolerance, self-optimization, and dy-namic parallelism.

FIGURE 12. Comparison of single-processor performance of VSIPL (C), PVL (C++) on top of VSIPL (C), andPVL (C++) on top of the Portable Expression Template Engine (PETE) (C++) for three different expressions withdifferent vector lengths. PVL with VSIPL or PETE is able to equal or improve upon the performance of VSIPL.

0.6

0.8

0.9

1.0

1.2

0.7

1.1

0.8

0.9

1.1

1.0

0.9

1.0

1.1

1.2

1.3

Vector length

Rel

ativ

e ex

ecut

ion

time

Rel

ativ

e ex

ecut

ion

time

Rel

ativ

e ex

ecut

ion

time

1.2VSIPLPVL/VSIPLPVL/PETE

A = B + C A = B + C*D A = B + C*D/E + FFT(F)

8 32 128

512

2048

8192

32,76

8

131,0

72

Vector length

8 32 128

512

2048

8192

32,76

8

131,0

72

Vector length

8 32 128

512

2048

8192

32,76

8

131,0

72

VSIPLPVL/VSIPLPVL/PETE

VSIPLPVL/VSIPLPVL/PETE

FIGURE 13. Comparison of multiprocessor (no communication) performance of VSIPL (C), PVL (C++) on top ofVSIPL (C), and PVL (C++) on top of PETE (C++) for three different expressions with different vector lengths.PVL with VSIPL or PETE is comparable to or better than the performance of VSIPL.

0.6

0.8

0.9

1.0

1.4

1.2

1.3

0.7

1.1

0.9

1.0

0.9

1.0

1.3

1.5

1.1

1.2

1.4

Vector length

Rel

ativ

e ex

ecut

ion

time

Rel

ativ

e ex

ecut

ion

time

Rel

ativ

e ex

ecut

ion

time

1.1

8 32 128

512

2048

8192

32,76

8

131,0

72

Vector length

8 32 128

512

2048

8192

32,76

8

131,0

72

Vector length

8 32 128

512

2048

8192

32,76

8

131,0

72

CC++/VSIPLC++/PETE

CC++/VSIPLC++/PETE

CC++/VSIPLC++/PETE




FIGURE 15. Dynamic mapping for fault tolerance. The separation of mapping from the algorithm allows the appli-cation to adapt more easily to changing hardware by modifying mapping at runtime. Remapping a matrix, for ex-ample, allows the program to react to failed processors and assign the work load to a spare processor, without re-quiring any fundamental change in the signal processing algorithm.

FIGURE 14. Comparison of multiprocessor (with communication) performance of VSIPL (C), PVL (C++) on topof VSIPL (C), and PVL (C++) on top of PETE (C++) for three different expressions with different vector lengths.PVL with VSIPL or PETE is comparable to the performance of VSIPL.

Fault Tolerance and Dynamic Mapping

One of the key challenges in bringing massively paral-lel computing to real-time signal processing is how toimplement fault-tolerance capabilities in software.Signal processors need to be able to adjust quickly toprocessor failure so that they can continue to carryout their mission. The simplest approach is to double

or triple the size of the computer and use a redundantvoting scheme. Unfortunately, there are many situa-tions in which this solution is too costly or the systemis physically too large. In addition, this type of redun-dancy will not catch certain errors.

In theory, massively parallel systems present an op-portunity for greatly increasing the fault tolerance ofthe system by adding only a few additional proces-

Parallel processor Spare

Failure

Nodes: 0, 1Grid: 2 × 1Distribution: rows

Map0

Nodes: 0, 2Grid: 2 × 1Distribution: columns

Map1

Nodes: 1, 3Grid: 2 × 1Distribution: columns

Map2

Input task

Xin

Output task

Xout

CC++/VSIPLC++/PETE

0.8

0.9

1.1

1.0

0.9

1.0

0.9

1.5

1.0

Vector length

Rel

ativ

e ex

ecut

ion

time

Rel

ativ

e ex

ecut

ion

time

Rel

ativ

e ex

ecut

ion

time

1.1

8 32 128

512

2048

8192

32,76

8

131,0

72

Vector length

8 32 128

512

2048

8192

32,76

8

131,0

72

Vector length

8 32 128

512

2048

8192

32,76

8

131,0

72

CC++/VSIPLC++/PETE

CC++/VSIPLC++/PETE




sors. Exploiting this capability requires software thatcan abstract the functionality of the program awayfrom the hardware and allow processing tasks to beeasily moved from one processor to another. In termsof the PVL library, the fault-recovery problem can besummarized as finding an efficient and dynamic wayto give an object a new map after a failure has beendetected, as illustrated in Figure 15.

We have implemented three fault-recovery strate-gies that solve this problem in different ways. In thefirst, referred to as remapping, the task object is de-stroyed and rebuilt from scratch on a new set ofnodes. In the second strategy, referred to as redundantobjects, a task object is built for each different possiblefailure scenario, and the appropriate object is usedwhen a fault is detected. In the third strategy, the li-brary allows a task to be mapped to a larger set of pro-cessors and only a subset is used at any given time.The set of nodes associated with the object can dy-namically change to be a new subset of the larger set.This strategy is referred to as rebinding.

These three strategies have different effects on thecomplexity of the application (in terms of memoryuse and code size) and the performance of the appli-cation (in terms of samples processed per second).The remapping strategy is the least complicated of thethree, but has the largest performance impact. The re-dundant object strategy has the best performance ofthe three, but is very complicated for the applicationprogrammer. The rebinding strategy puts the burdenof complexity on the library developer and offers bet-ter performance than the remapping strategy. A moredetailed description of our results in this area can befound elsewhere [8].

Self-Optimization

PVL allows algorithm concerns to be separated frommapping concerns. However, the issue of determin-ing the mapping of an application has not changed.As a software application is moved from one piece ofhardware to another, the optimal mapping needs tochange, as illustrated in Figure 16. Currently, people

FIGURE 16. Optimal mapping challenge. Real applications can easily involve several complicated processingstages with potentially thousands of mapped vectors, matrices, and computations. The maps that produce thebest performance will change, depending on the hardware. Automated capabilities are necessary to generatemaps and to determine which maps are optimal.

Application

Different optimal maps

Workstation

Intel clusterPowerPC

cluster

Embedded multicomputer

Embeddedboard

Xin

Input

W1

Xin FIR1 FIR2 Xout

W2

Low-pass filter

W3

Xin * Xout

Beamform Matched filter

W4

Xin FFT

IFFT Xout

Hardware



FIGURE 18. S3P performance. The S3P framework is tested on two different problem sizes in a four-stage appli-cation with two different criteria: minimize latency for a given number of processors and maximize the through-put for a given number of processors. In all cases S3P picks the correct mapping, as shown by the four-compo-nent processor assignment string, and predicts the performance of the best map to within a few percent.

with knowledge of both the algorithm and the paral-lel computer use intuition and simple rules to try toestimate what the best mapping will be for a particu-lar application on a particular hardware. Unfortu-nately, this method is time consuming, labor inten-sive, and inaccurate, and it does not in any wayguarantee an optimal solution.

The generation of optimal mapping is an impor-tant problem because without this key piece of infor-mation it is not possible to write truly portable paral-lel software. To address this problem, we developed aframework called Self-optimizing Software for SignalProcessing (S3P) [9].

The S3P framework, which is illustrated conceptu-ally in Figure 17, requires certain capabilities from anapplication. First, the application must be made of

FIGURE 17. Self-optimizing Software for Signal Processing(S3P). This framework combines the dynamic mapping ca-pabilities found in PVL with the self-optimizing softwaretechniques developed in FFTW. The resulting framework al-lows an optimal mapping to be found automatically for anyapplication on any parallel architecture. Each available pathis a complete system mapping, and the best path (shown ingreen) is the optimal system mapping.

Number of processors

Late

ncy

(sec

)

Large (48 × 128k)Small (48 × 4k)

Thr

ough

put (

fram

es/s

ec)

4 5 6 7 8 4 5 6 7 8

25

20

15

10

0.25

0.20

0.15

0.10

1.5

1.0

0.5

PredictedAchieved

PredictedAchieved

PredictedAchieved

PredictedAchieved

5.0

4.0

3.0

2.0


1-1-1-1

1-1-1-1

1-1-1-11-1-1-1

1-1-2-1

1-2-2-1

1-2-2-2 1-3-2-2

1-1-2-1

1-1-2-2

1-2-2-21-3-2-2

1-1-1-2

1-1-2-2

1-2-2-21-2-2-3

1-1-2-1

1-2-2-1

1-2-2-22-2-2-2

Late

ncy

(sec

)T

hrou

ghpu

t (fr

ames

/sec

)

Beamform Filter Detect



tasks that are capable of being mapped to multipleconfigurations of hardware. Furthermore, each taskneeds to be able to measure its computing resources(e.g., number of processors, memory, execution time)in each configuration. Given these capabilities, S3Pcan assemble a system graph that contains all possiblemappings and find the best path through this graph,which corresponds to the optimal mapping. Becausethe resource measurements are empirical there is noneed for ad hoc modeling, and the entire process is au-tomatic. This framework has been tested on differentproblem sizes and with different criteria, and it is ableto pick the correct mapping and to predict the perfor-mance of the best map to within a few percent. Figure18 shows S3P performance for two problem sizes andtwo performance criteria—latency and throughput.

In each example, S3P finds the correct mapping andclearly predicts the achieved performance.

Dynamic Load Balancing

Another key area for continued development is theneed to be able to address problems where the workload is dynamic. A typical example of this is post-de-tection processing in a signal processing system inwhich the amount of work done is proportional tothe number of targets found. This situation is diffi-cult to address with a parallel computer because it isnot possible to predict how to distribute the data sothat the workload is even.

The challenge presented by dynamic work loads isclassically described by the “balls into bins” problem.Typically data in a signal processing system are dis-

FIGURE 19. The problem of dynamic load balancing. A typical signal or image processing pipeline, shown in the upperrow, begins by performing many regular operations such as detection that depend only upon the size of the input data, fol-lowed by additional operations such as estimation that depend upon the precise location of objects of interest within thedata. The detection processing lends itself naturally to parallel computing by statically chopping up the data into equal re-gions, as shown in the lower row. This partition of the data results in a balanced work load among the parallel processors.The estimation processing, however, is dynamic in nature and a static distribution leads to unbalanced work loads on theparallel processors. Researchers are seeking solutions that dynamically distribute the work load among the processors onan as-needed basis.

Image processing pipeline

0.13

0.15

0.24

0.97

0.08 0.11

0.30

0.10

Work load is static, and proportional to the number of

pixels in the image

Detection Estimation

Static parallel implementation

0.13

0.15

0.24

0.97

0.11

0.30

0.10 0.08

Work load is dynamic, and proportional to the number of

detections in the image

Work load is balanced

Work load is unbalanced



tributed uniformly on a parallel processor, but targetsor detections are randomly distributed in these data,which results in some processors having more workthan others. The nonuniformity of the distribution ofwork can quickly lead to the vast majority of proces-sors being idle while the application waits for oneprocessor to finish working. Figure 19 illustrates thisproblem with an image processing pipeline, showinghow a static parallel implementation results in an un-balanced work load. The solution to this problem liesin the ability to dynamically distribute the work onan as-needed basis. Figure 20 shows how a dynamicdistribution can dramatically improve processing effi-ciency, compared to static distribution. This improve-ment becomes particularly significant as the numberof processors increases. A variety of technologies areuseful in making this type of solution feasible on aparallel computer, including shared memory, fastbroadcast, and multithreading.

Summary

Exploiting parallel processing for real-time streamingapplications presents unique software challenges. Wehave developed a software library to address many of

FIGURE 20. Static versus dynamic load balancing. Randomly distributed targets in datathat are statically distributed on a parallel processor always result in an unbalanced pro-cessor distribution because of random fluctuations in the work load. This unbalance leadsto a decrease in processing efficiency as the number of processors increases. If the workcan be distributed dynamically among the processors on an as-needed basis, the effi-ciency can be much higher and is limited only by the granularity of the work.

1

10

100

1000

1 4 16 64 256 1024

Linear

Dynamic

Static

50% efficient

Par

alle

l spe

edup


15% efficient

50% efficient

94% efficient

these challenges. The library exploits advanced fea-tures of C++ to easily express data and task parallelismwithout making the application dependent on theunderlying parallel hardware. This approach delivershigh-performance execution comparable to or betterthan standard approaches.

Our future efforts will focus on adding to the fea-tures of this technology to exploit dynamic parallel-ism, integrate high-performance parallel software un-derneath mainstream programming environments,and use self-optimizing techniques to maintain per-formance.

Acknowledgments

We would like to thank the entire PVL software de-velopment team, including Matt Anderson, RobertBond, Hector Chan, Jim Daly, Mike DiMare, TedHall, Mike Harrison, Andy Heckerling, Paul Hergt,Hank Hoffmann, Michael Moore, Fran Rayson,Patrick Richardson, Ed Rutledge, and GlennSchrader.



R E F E R E N C E S

1. “MPI: A Message-Passing Interface Standard,” Message-Pass-ing Interface Forum (University of Tennessee, Knoxville,Tenn., Apr. 1994), <http://www.mpi-forum.org>.

2. D.A. Schwartz, R.R. Judd, W.J. Harrod, and D.P. Manley,“Vector, Signal, and Image Processing Library (VSIPL) 1.0 Ap-plication Programmer’s Interface,” Technical Report (GeorgiaTech Research Corporation, Atlanta, Ga., Mar. 2000), <http://www.vsipl.org>.

3. C.M. DeLuca, C.W. Heisey, R.A. Bond, and J.M. Daly, “APortable Object-Based Parallel Library and Layered Frame-work for Real-Time Radar Signal Processing,” ISCOPE ’97:First Conference on International Scientific Computing in Ob-ject-Oriented Parallel Environments, Marino del Rey, Calif., 8–11 Dec. 1997, pp. 241–248.

4. “High-Performance Fortran Language Specification,” High-Performance Fortran Forum, Scientific Programming 2 (1),1993, pp. 1–170.

5. M. Frigo and S.G. Johnson, “An Adaptive Software Architec-ture for the FFT,” Proc 1998 IEEE International Conference onAcoustics, Speech, and Signal Processing 3, Seattle, 12–15 May1998, pp. 1381–1384.

6. B. Stroustrup, The C++ Programming Language, 3rd ed.(Addison-Wesley, Reading, Mass., 1997).

7. S. Haney, J. Crotinger, S. Karmesin, and S. Smith, “Easy Ex-pression Templates Using PETE, the Portable Expression Tem-plate Engine,” Dr. Dobb’s J. 24 (10), 1999.

8. J.M. Lebak, J.V. Kepner, and H. Hoffmann, “Software FaultRecovery for Real-Time Signal Processing on Massively Paral-lel Computers,” Proc. Tenth SIAM Conf. on Parallel Processingfor Scientific Computing, Portsmouth, Va., 12–14 Mar. 2001.

9. H. Hoffmann, J.V. Kepner, and R.A. Bond, “S3P: Automatic,Optimized Mapping of Signal Processing Applications to Par-allel Architectures,” Proc. Fifth Annual High-Performance Em-bedded Computing (HPEC) Workshop, Nov. 2001.



is a staff member in the Em-bedded Digital Systems group.His research has addressed thedevelopment of parallel algo-rithms and tools, and theapplication of massively paral-lel computing to a variety ofdata-intensive problems. Hereceived a B.A. degree inastrophysics from PomonaCollege, and a Ph.D. degree inastrophysics from PrincetonUniversity in 1998, afterwhich he joined LincolnLaboratory.

is a staff member in the Em-bedded Digital Systems group.He is interested in parallel andnumerical algorithms fordigital signal processing. Hereceived B.S. degrees in math-ematics and electrical engi-neering in 1989, and an M.S.degree in electrical engineeringin 1991, from Kansas StateUniversity. He received aPh.D. degree in electricalengineering from CornellUniversity in 1997.

software technologies for high-performance parallel signal ... · pdf file• kepner and...

Documents