assessing accelerator-based hpc reverse time migration

Assessing Accelerator-BasedHPC Reverse Time Migration

Mauricio Araya-Polo, Javier Cabezas, Mauricio Hanzich, Miquel Pericas, Felix Rubio,

Isaac Gelado, Muhammad Shafiq, Enric Morancho, Nacho Navarro, Member, IEEE,

Eduard Ayguade, Jose Marıa Cela, and Mateo Valero, Fellow, IEEE

Abstract—Oil and gas companies trust Reverse Time Migration (RTM), the most advanced seismic imaging technique, with crucial

decisions on drilling investments. The economic value of the oil reserves that require RTM to be localized is in the order of 1013 dollars.

But RTM requires vast computational power, which somewhat hindered its practical success. Although, accelerator-based

architectures deliver enormous computational power, little attention has been devoted to assess the RTM implementations effort. The

aim of this paper is to identify the major limitations imposed by different accelerators during RTM implementations, and potential

bottlenecks regarding architecture features. Moreover, we suggest a wish list, that from our experience, should be included as features

in the next generation of accelerators, to cope with the requirements of applications like RTM. We present an RTM algorithm mapping

to the IBM Cell/B.E., NVIDIA Tesla and an FPGA platform modeled after the Convey HC-1. All three implementations outperform a

traditional processor (Intel Harpertown) in terms of performance (10x), but at the cost of huge development effort, mainly due to

immature development frameworks and lack of well-suited programming models. These results show that accelerators are well

positioned platforms for this kind of workload. Due to the fact that our RTM implementation is based on an explicit high order finite

difference scheme, some of the conclusions of this work can be extrapolated to applications with similar numerical scheme, for

instance, magneto-hydrodynamics or atmospheric flow simulations.

Index Terms—Reverse time migration, accelerators, GPU, Cell/B.E., FPGA, geophysics.

Ç

1 INTRODUCTION

THE necessity for High-Performance Computing (HPC)will keep increasing, as there is always a problem that

needs more computational power than currently available.However, the last year’s technological issues have put anend to frequency scaling, and hence to traditional single-processor architectures. Thus, processors, designers, andapplication developers have turned to multicore architec-tures that include accelerator units in the search forperformance.

During this quest, three main accelerators arise aspossible solutions for the new HPC generation hardware—Cell/B.E., GPUs, and FPGAs. The Cell/B.E. architectureproposed by IBM [1] includes a general purpose PPC host(PPE: Power Processing Element) and eight SIMD accel-erators (SPEs: Synergistic Processing Units), all in one chip,and interconnected through a fast ring network. The GPUs(Graphic Processing Units) processors [2], mainly provided

by ATI and NVIDIA, come from the graphics field and arenormally integrated into PCI cards connected to any generalpurpose processor. GPUs are basically a set of elements forprocessing vertex and pixels. In order to handle generalpurpose programming and HPC, they were unified andgeneralized as computing cores. The last alternative, FPGAs(e.g., Xilinx Virtex series) are a completely differentapproach based on a configurable hardware [3]. Inside anFPGA, the hardware logical layout is configured beforedoing the computation, usually by generating a customcomputation unit and replicating it as many times aspossible. This allows them to achieve high throughput evenwhile running at frequencies far below ISA processors oraccelerators. Any of the proposed architectures can providethe needed performance. However, this performance doesnot come for free; the development cost increases. As thearchitectures are all different from traditional homogeneousprocessors, they have their own particularities. Consider-able effort must be invested to adapt the algorithm to thearchitectural features.

The aim of this work is to show how a key application inoil industry can be effectively mapped to different accel-erator architectures, and analyze performance and draw-backs. Moreover, a wish list for each architecture is given assuggestions for future generations of such accelerators,considering the major limitations found during the softwaredevelopment process.

The application that we consider is Reverse TimeMigration (RTM) [4]. RTM is an algorithm based on thecalculation of a wave equation through a volume represent-ing the earth’s subsurface. RTM’s major strength is thecapability of showing the bottom of salt bodies at several

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 1, JANUARY 2011 147

. M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, M. Shafiq,E. Ayguade, J.M. Cela, and M. Valero are with the Barcelona Super-computing Center, 31 Jordi Girona, 08034 Barcelona, Spain.E-mail: {mauricio.araya, javier.cabezas, mauricio.hanzich, miquel.pericas,felix.rubio, muhammad.shafiq, eduard.ayguade, josem.cela,mateo.valero}@bsc.es.

. I. Gelado, E. Morancho, and N. Navarro are with the UniversitatPolitecnica de Catalunya, Campus Nord UPC, 1-3 Jordi Girona, 08034Barcelona, Spain. E-mail: {igelado, enricm, nacho}@ac.upc.edu.

Manuscript received 24 Sept. 2009; revised 10 Mar. 2010; accepted 2 June2010; published online 28 July 2010.Recommended for acceptance by D.A. Bader, D. Kaeli, and V. Kindratenko.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPDSSI-2009-09-0462.Digital Object Identifier no. 10.1109/TPDS.2010.144.

1045-9219/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

kilometers (�6 km) beneath the earth’s surface. In order tounderstand the economical impact of RTM, we just have toreview the USA Mineral Management Service (MMS)reports [5]. The oil reserves of the Mexican Gulf under thesalt layer are approximately 5� 1010 barrels. Moreover thereserves in both Atlantic coasts, Africa and South America,are also under a similar salt structure. A conservativeestimation of the value of all these reserves is in the order of1013 dollars. RTM is the key application to localize thesereserves. RTM is the method that produces the bestsubsurface images; however, its computational cost (at leastone order of magnitude higher than others) hinders itsadoption in daily industry work. Another possibility thatwe do not cover here is to design a special-purpose machine(a la Anton [6]). This approach may be economicallyfeasible due to the investment of Oil and Gas industry inRTM development.

Our RTM implementation is based on an explicit 3D highorder Finite Difference numerical scheme for the waveequation, including absorbing boundary conditions. There-fore, other simulations based on the same numerical schemewill benefit from some of our conclusions. For example,magneto-hydrodynamics in astrophysics, atmospheric me-soscale modeling and some DNS turbulent flow simulations.

The remainder of this paper is organized as follows:Section 2 introduces the seismic problem solved by RTM.Section 3 explains the RTM algorithm and its hotspots.Section 4 shows RTM implementation over homogeneousprocessors. Section 5 describes the selected acceleratorsarchitectures, while Section 6 depicts RTM implementationover each of them. Section 7 shows some performanceresults for the different RTM implementations. In Section 8we propose, as users, some suggestions for future hardwarearchitectures. Finally, Section 9 reveals our conclusions.

2 INTRODUCTION TO THE SEISMIC IMAGING

PROBLEM

Seismic imaging tries to generate images of the terrain inorder to see the geological structures. This is the basic toolof oil industry to identify possible reservoir locations.Moreover, this technology is needed for several geologicalactivities, for example, the CO2 sequestration.

The main technique for seismic imaging generatesacoustic waves and records the earth’s response at somedistance from the source. Like in medical imaging, usingsome signal processing algorithm on the recorded data, wewant to rebuild the properties of the propagation media,i.e., the local earth structure.

The propagation of waves in the earth is a complexphenomenon that requires a sophisticated wave equation tobe modeled accurately. Unfortunately, the amount ofinformation about local earth’s properties is very little, so,simplified models for wave propagation are used. Thesimplest model assumes that wave propagation is modeledby the isotropic acoustic wave equation:

d2pðr; tÞdt2

þ cðrÞ2r2pðr; tÞ ¼ fðr; tÞ; ð1Þ

where cðrÞ is the sound speed at each terrain point ðrÞ,pðr; tÞ is the pressure wave value, and fðr; tÞ is the input

wave. This simple model uses a single parameter (cðrÞ) tomodel the terrain properties.

Using this acoustic model it, is not possible to reproducethe complete wave form of the received wave. However,this model is good enough to reproduce the arrival time ofthe received waves. Even considering that the amplitudeinformation is lost by the acoustic model, the phaseinformation is preserved.

The problem that we want to solve is an inverse problembased on (1). The inverse problem solution is decomposedin two functions:

1. Tomography is the function that builds an approx-imate velocity model (cðrÞ), using a terrain imageand the empirical data.

2. Migration is the function that builds a terrain image,using a velocity model and the empirical data.

The interaction between tomography and migration isbased in the fact that a column of terrain should producethe same image for all the input waves, if the velocity modelis the correct one. Then, if we join the images of the sameterrain column produced by different source waves, a set ofhorizontal lines associated with different terrain reflectorsshould be observed. If these lines are curved up or down,this means that the velocity value is too high or too low.This idea allows to correct the velocity model and close theloop between tomography and migration.

From the computational point of view, migration is themost expensive function. In the past, due to computationallimitations, simplified models for migration were used. Forexample, the migration known as Kirchhoff [7] uses rays(infinite frequency approximation) to avoid the waveequation solution. However, this simplified migrationalgorithm fails when strong velocity discontinuities appear.This happens, for example, when saline bodies are presentin the subsurface. This is the case in important areas for oilindustry, like the Gulf of Mexico.

The best migration algorithm for subsalt imaging isRTM. RTM generates an output proportional to the gradientof cost function when using the adjoint method in aninverse problem. The methodology of RTM can be used forany inverse problem related with a wave equation. Theimportance of RTM is not just the quality of the image, butthe fact that RTM is a data driven algorithm, as can beobserved in Figs. 1 and 2. This means that images producedby RTM show information present in the empirical data butnot in the velocity model. Due to this fact, Oil and Gasindustry believes that the synergy between RTM andtomography to solve the inverse problem leads to the best

148 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 1, JANUARY 2011

Fig. 1. Synthetic velocity model with a salt body. Courtesy of REPSOL.

solution. Fig. 1 shows a synthetic velocity model thatincludes a salt body used to illustrate this fact. Fig. 2b showsthe image produced by the approximate velocity model inFig. 2a, that has no salt. We can observe that the top layer ofthe salt body appears in the image, even when it is notpresent in the velocity model. In [8] the data drivenbehavior of RTM is shown.

3 RTM ALGORITHM

Due to the fact that the acoustic shots (medium perturba-tion) are introduced in different moments, we can processthem independently. The most external loop of RTMsweeps all shots. This embarrassingly parallel loop can bedistributed in a cluster or a grid of computers. The numberof shots ranges from 105 to 107, depending on the size of thearea to be analyzed. For each shot, we need to prepare thedata of the velocity model and the proper set of seismictraces associated with the shot.

In this paper, we are only interested in the RTMalgorithm needed to process one shot, what we will callthe RTM kernel. Fig. 3 shows the pseudocode of thisalgorithm. RTM is based in solving the wave equation (1)two times. First, using as right-hand side the input shot(forward propagation), and second, using as right-handside the receivers traces (backward propagation). Then, thetwo computed wave fields are correlated at each point toobtain the image.

The statement in Fig. 3 stands for the following:

. Line 3: Computes the Laplacian operator and thetime integration. Spatial discretization uses the FiniteDifference method [9], and time integration uses anexplicit method. Typically, for stability conditions103 points per each space dimension and 104 timesteps are needed. Also, this is the most computa-tional intensive step.

. Line 6: Is the source wave introduction (shot orreceivers).

. Line 9: Computes the absorbing boundary conditions(ABC).

. Line 12: Does the cross-correlation between both,wave fields (backward only) and the needed I/O.

3.1 RTM Implementation Problems

RTM implementations have well known hotspots, on top ofthat, when the RTM implementation has as target platforma heterogeneous architecture, the list of those hotspots

increased in particularities but not in diversity. We can

divide the hotspots into three groups: memory, Input/

Output, and computation. In the next items, we will

describe these groups.

3.1.1 Memory

RTM memory consumption is related to the frequency atwhich the migration should be done. Higher frequencies(e.g., over 20-30 Hz) may imply the usage of several GiB(>10 GiB) of memory for migrating one single shot. Thetotal amount of required memory could be greater than theamount available in a single computational node, forcing adomain decomposition technique to process one shot.

A 3D Finite Differences stencil has a memory accesspattern [10] that can be observed in Fig. 4c, the stencil isrepresented by the cross-shaped object (Figs. 4a and 4b). Ascan be seen from Fig. 4c, only one direction (Z in that case)has the data consecutively stored in memory, so, access tomemory for other directions is very expensive, in terms ofcache misses. The stencil memory access pattern is a mainconcern when designing the RTM kernel code [11], because

ARAYA-POLO ET AL.: ASSESSING ACCELERATOR-BASED HPC REVERSE TIME MIGRATION 149

Fig. 2. (a) Velocity model without any salt body, and (b) generatedimage. Courtesy of REPSOL.

Fig. 3. The RTM Algorithm.

Fig. 4. (a) A generic 3D stencil structure, (b) a 3D seven-point stencil,and (c) its memory access pattern.

it is strongly dependent on the memory hierarchy structureof the target architecture. Besides, due to the reduced size ofthe L1, L2, or L3 caches, blocking techniques must beapplied to efficiently distribute the data among them [12], atleast for classical multicore platforms.

Moreover, modern HPC environments (e.g., Cell/B.E orSGI Altix) have a Non Uniform Memory Access (NUMA)time, depending on the physical location of memory data.Thus, a time penalization may be paid if data are notproperly distributed among memory banks.

3.1.2 Input/Output

We divided the I/O problem into three categories: data size(>1 TiB), storage limitations, and concurrency. On one hand,looking for high accuracy, the spatial discretization mayproduce a huge computational domain. On the other hand,the time discretization may imply large number of time steps.

RTM implementations store the whole computationaldomain regarding the number of time steps (line 12 in Fig. 3),which may overwhelm the storage capacity (>300 GiB). Inorder to avoid that RTM becomes an I/O bounded applica-tion, it is mandatory to overlap computation and I/O usingasynchronous libraries. Additionally, some data compres-sion techniques can be used to reduce the amount of datatransferred. Finally, the correlation can be performed everyn steps at the expense of image quality (we call this rate stack).

As a distributed file system is generally used for sharingthe global velocity model and seismic traces, negativebehavior could be observed as the number of shotsconcurrently accessing the shared data increases. Therefore,using global file systems impose new constraints: therequired available storage network bandwidth and themaximum number of concurrent petitions that can be served.

3.1.3 Computation

In order to efficiently exploit the vectorial functional unitspresent in modern processors, we have to overcome twomain problems: the low computation versus memory access(c/ma) ratio and the vectorization of the stencil computa-tion. In order to use the pool of vector registers completely,unrolling techniques are needed.

The low c/ma ratio means that many neighbor points areaccessed to compute just one central point, and even worse,many of the accessed points are not reused when the nextcentral point is computed. This effect is called low datalocality ratio. For instance, the generic stencil structure inFig. 4a defines a ð3� ðn� 2ÞÞ þ 1 stencil. If n ¼ 4, then25 points are required to compute just one central point,then the c/ma ratio is 0.04, which is far from the idealc=ma ¼ 1 ratio. To tackle this problem, strategies thatincrease data reuse must be deployed.

4 RTM on HOMOGENEOUS ARCHITECTURES

To set a baseline for comparison between the RTMimplementations for accelerators, we have mapped thealgorithm to a traditional cache-coherent multicore plat-form. This platform is based on commodity hardware,available in a HPC oriented configuration. Detailed hard-ware specifications are reported in Table 1.

As mentioned in Section 3, the kernel is the mostcomputational demanding task of RTM. The computational

weight of the kernel is due to the low number of operations

it performs at each data point [13]. We tackle that problem

with blocking [12], [14]. To apply blocking, in particular the

Rivera [14] strategy, the original 3D space is divided in

slices, where the X axis of each has a size that optimally fits

the cache hierarchy (like Cell implementation, see Fig. 8).As long as the optimization of computation is concerned,

we exploit all the forms of parallelism provided by the

architecture; the thread-level parallelism provided by the

multiple cores, and the data-level parallelism provided by

the SIMD instruction set. We use all the independent

threads (8) per blade with a parallelization strategy that

follows the blocking. Each core processes its assigned set of

slices independently. Since each core has its own L2 cache,

interference among threads is minimal. Our implementa-

tion employs OpenMP [15], and thanks to careful loop

management, we provide to OpenMP with more opportu-

nities for scheduling, thus enhancing scalability.As mentioned earlier, the RTM algorithm requires huge

data sets to be transferred to/from disk. To tackle this

problem, the wavefield size is reduced, prior to be stored by

means of data compression, and asynchronous I/O is used

to overlap disk transfers and computation.

5 ACCELERATOR ARCHITECTURES

5.1 Cell/B.E. Architecture

As a reference Cell/B.E. platform, we consider the IBM

Blade-Center QS22 blade (see Table 2 for the full specifica-

tions), which is a dual-socket blade.Each Cell/B.E. processor (see Fig. 5) on a QS22 blade

contains a general-purpose 64-bit PowerPC-type PPE with

cache memories, and eight SPEs with software-based

scratchpad memories called Local Stores (LS). The PPE and

the SPEs both have a (slightly different) 128-bit-wide SIMD

instruction set, which allows, for example, to simultaneously

process four single-precision floating-point operands.On the programmability front, programming this archi-

tecture requires special attention to the following issues:

1. The use of SIMD instructions requires an appro-priate data layout (padding and alignment).


TABLE 1Technical Specifications of the Homogeneous System

Employed in Our Experiments

2. The branch predictors are simple, and the mispre-diction penalty is high; control-flow-intensive codeshould be rewritten as data-flow-intensive whenpossible.

3. Load/store instructions only operate on the LS,which is small (256 KiB, shared for code and data);the programmer must divide the computation infragments which operate on working sets smallenough to fit the LS.

4. Accesses to the main memory only happen viaDirect Memory Access (DMA) operations, and areperformed independently by a Memory FunctionController (MFC); the programmer must use theMFC to overlap computation and transfers, to hidethe shorter of the two latencies.

5. Also, DMA performance is influenced by usageparameters (transfer block size and alignment,average concurrent requests, bank congestion, con-troller congestion, NUMA issues); it is the program-mer’s responsibility to adopt congestion-avoidingmemory access patterns.

Every stated feature is important in order to exploit theCell/B.E. capabilities; however, the last two are the mostrelevant for our test case.

5.2 GPU Architecture

Modern 3D graphics processing units (GPUs) are designedto take advantage of the data-parallel characteristics foundin image generation algorithms. They contain hundreds of

functional units that can operate in parallel when accessing

different data elements. The computing power of these

devices easily exceeds that of general purpose multicore

CPUs. Thus, some GPU manufacturers like ATI and

NVIDIA have generalized the design of the processors to

allow them execute general purpose programs using C-like

programming languages.The reference hardware used in the implementation of

the RTM algorithm and in the benchmarks in this paper is

based on the NVIDIA Tesla architecture[16], introduced in

the GeForce 8800 GPU. As seen in Fig. 6, a GPU is presented

as a set of streaming multiprocessors (SMs), each with its

own set of processors (SPs), registers, and shared memory

(user-managed cache). SPs are fully capable of executing

integer and single/double-precision floating-point arith-

metic. However, all the SPs in an SM share the fetch,

control, and memory units. Therefore, each SM may be

better conceptualized as an eight-wide vector processor. All

SMs have access to the global device memory, which is not

cached by the hardware. Latency is hidden by executing

thousands of threads concurrently. Register and shared

memory resources are partitioned among the currently

executing threads.

5.2.1 CUDA Programming Model

CUDA allows programming GPUs for parallel computation

without any graphics knowledge. A CUDA program is

organized into a host program, consisting of one or more

sequential threads running on the CPU, and one or more

parallel kernels that are executed on a parallel processing

device like the GPU.A kernel executes a scalar sequential program on a set of

parallel threads. CUDA arranges threads into thread blocks.

All threads in a thread block can read and write any shared

memory location assigned to that thread block. Conse-

quently, threads within a thread block can communicate via

shared memory, or use shared memory as a user-managed

cache since shared memory latency is two orders of

magnitude lower than that of global memory. A barrier

primitive is provided so that all threads in a thread block

can synchronize their execution.


TABLE 2Technical Specifications of the Cell/B.E.-Based System

Employed in Our Experiments

Fig. 6. Tesla unified graphics and computing architecture of aGeForce GTX 280 or Tesla C1060 GPU with 240 SP functionalunits, organized in 30 SM multithreaded multiprocessors. Each SMcan handle up to 768 concurrent threads; the GPU executes up to30,720 concurrent threads [17].

Fig. 5. Functional block diagram of the Cell/B.E. processor.

In order to take full advantage of the computing poweroffered by the architecture, a number of issues must betaken into account:

. Memory transfers through the PCI Express (PCIe)bus are much slower than host memory transfers.However, they can be overlapped with kernelexecution in the GPU. Therefore, this kind oftransfers must be minimized, and overlapped withcomputation.

. The RAM memory found in GPUs is designed todeliver a very high bandwidth. In order to maximizethe usage of the memory bus, consecutive threads ina thread block must access adjacent memory posi-tions simultaneously. Moreover, the data layoutmust allow this memory access pattern.

. Shared memory should be used if memory positionsare read more than once by the threads in a threadblock, since global memory is not cached.

5.2.2 GPU Reference Platform

As a reference GPU platform, we considered a dual processorhost with an NVIDIA Tesla C1060 device (Table 3).

5.3 FPGA Architecture

FPGAs are an extreme case of a multicore architecture inthat they implement thousands of logic cells, each of whichcan be configured to implement a simple logical function.Together, they can implement highly complex functional-ities. From the developers view, an FPGA is a chip that canbe programmed to implement a hardware circuit. In thisview, an FPGA can be considered a virtualized VLSIhardware substrate. It is therefore programmed with toolssimilar to those of Electronic Design Automation (EDA).The toolchain includes phases such as synthesis, placement,and routing.

Given that the FPGA substrate can be configured toperform highly parallel tasks with high degree of customi-zation, FPGAs have become an attractive option foraccelerating HPC tasks. However, they also have severalhurdles that need to be overcome, as explained below inSection 5.3.3.

5.3.1 Modeled FPGA System Architecture

Our RTM port for FPGAs was developed and verified on anSGI RASC RC100 platform. However, this platform is notcompetitive with current technologies. Not only are the twoFPGAs it features two families behind (Virtex-4 versuscurrent Virtex-5 and Virtex-6), but the system architectureimposes a restricted memory model and provides limitedBW (3.2 GiB/s) to the point of severely hamperingperformance, particularly in a BW-hungry code such as RTM.

For this reason, and in order to provide fair results, weattempt a performance prediction to more recent hardwarebased on a detailed analysis of the performance weachieved on RC100. We developed a system model basedon a current FPGA-based platform, the Convey HC-1. Aswith the HC-1, our platform model contains four Virtex-5LX330 FPGAs to accelerate the applications. The overallcompute model is a coprocessor model featuring a large,cache-coherent shared memory between the processor andthe coprocessor. While we only select the most notablefeatures from the HC-1 to provide our performanceestimate, we expect the performance reported here to bereachable with this system. A descriptive system diagram isshown in Fig. 7.

The acceleration engines are provided with a memorysubsystem capable of transferring at up to 80 GiB/s byinterleaving data over 16 DIMM channels. In a maximalconfiguration filling all slots, the coprocessor system canhost up to 128 GiB of data, more than enough for the typicalrequirements of an RTM run.

5.3.2 Virtex-5 LX330

The acceleration engines of the FPGA coprocessor arepowered by 4� Virtex-5 LX330 chips. The V5 LX330, whichwas announced in December 2006, is the largest device ofthe Virtex-5 family. It is implemented in a 65 nm processand contains 288� 36 Kbit BlockRAMs totaling 10,368 Kbitof data (1.3 MiB). The chip contains 330K logic cells and 192DSP slices, each containing a 25� 18 multiplier, an adder,and an accumulator.

5.3.3 Programming Model

FPGAs are mostly programmed like circuits using HardwareDescription Languages (HDL) and tools originating from theElectronic Design Automation (EDA) community. HDL is alow-level programming approach that provides complete


Fig. 7. FPGA system architecture modeled after the Convey HC-1. Inthis architecture, FPGAs have access to global memory via a cache-coherent memory controller.

TABLE 3Technical Specifications of the Host

with the NVIDIA Tesla C1060 Devices

control of the circuit execution at cycle level. This makes

programming in HDL a quite tedious task. Instead of

debugging, circuits are normally verified by first simulatingthem thoroughly using a software simulator. This allows the

developer to check the correctness of all signals, before

compiling the HDL code for the FPGA. Contrary to a software

binary, the result of the compilation toolchain is a configura-tion file (called bitstream) which is loaded onto the FPGA.

As a general rule, to obtain performance from an FPGA,

the focus is on placing multiple computation engines to

exploit the massive parallelism effectively. To this end, it isuseful to minimize the area of the computation engines by

customizing the computation via partial evaluation or

specializing the data types. Customizing the design alsoimproves the performance per Watt relationship. Another

goal is maximizing data reuse on-chip to reduce the

external memory requirements. Of course, the amount ofcompute engines that can be placed is also a function of the

external bandwidth. Adding more engines to a memory-

bound design only wastes area.

6 RTM ON ACCELERATORS

6.1 RTM Implementation on Cell/B.E.

The amount of problems to be tackled when implementing

a competitive RTM system may seem cumbersome. How-

ever, in this section, we will show how to face each of thestated problems, one by one, onto the Cell/B.E. platforms.

. Memory: As stated before (Section 3.1), at least acouple of memory problems have to be tackled usingspecific techniques: blocking and NUMA primitives.

We use blocking as strategy to evenly distribute

work and data among the SPEs. The main goal of

this technique is to efficiently use the LS memoryand bandwidth; recall that the bus that commu-

nicates the PPE and SPEs is shared by all of them.

Fig. 8 shows the blocking technique used to divideand scatter the data space among the SPEs in the

Cell/B.E. processor.Notice that the 3D space is split in the X direction,

then each subcube is given to one SPE to be

processed. Then, Y is the traversing direction whileeach subcube is processed. This is particularly

important in the Cell/B.E. processor, due to the

limited size of the SPEs local stores. Thus, astreaming of Z-X planes is constantly providing

the SPEs with the required data to compute.

The second memory consideration is related tothe NUMA characteristic of the QS22 blades, whichbasically means that it does not provide the sameaccess time for every memory address. In the QS22case, this happens when the mapping of physicaland logical addresses does not link to the memorybank that is closest to the SPEs. Also, the lattersituation hinders the scalability of the application, inparticular, when the full set of SPEs is in use (as inour blocking schema). Fortunately, in such NUMAsystems, special primitives allow the programmersto associate a physical memory address with a givenlogical address. Fig. 9 show the results for the samecomputation with and without NUMA mapping.

. Input/Output (I/O): To solve the problems related tothe I/O, we have used several techniques: com-pression, asynchronous I/O (AI/O), and doublebuffering.

As initially mentioned in Section 3.1, the RTMimplementations demand huge amount of I/Oeffort; both disk and bandwidth are overwhelmed.The RTM algorithm is composed of two main parts:forward and backward stages. In the forward stage,wavefields are stored. Then, in the backward stage,the same wavefields are restored (line 12 in Fig. 3).This mechanism enables the correlation task that atthe end produces the final image. In Fig. 10a, we canobserve the behavior of the implementation with theregular I/O mechanism active. It is clear that everySPE has to wait until the I/O task is done (dark


Fig. 8. Data accessing and vectorization pattern.

Fig. 9. QS22 RTM scalability with and without NUMA considerations.

Fig. 10. (a) Segments of the RTM execution traces without and (b) withAI/O utilization. The execution trace is composed by rows that representfirst the PPE and then the SPE threads; for sake of clarity, we presenttraces with only eight SPEs.

regions in the first row of the execution trace, PPEthread). In the following paragraph, we describewhat we do to handle this problem.

First, we compress/decompress the store/restoredata, achieving a compression factor of 10; this iscarried out by planes as our schema of Fig. 8depicted. Second, following the planes scheme, dataare transferred to/from main memory using adouble-buffering technique. Finally, when the com-pressed data are in main memory, AI/O is deployedto store/restore them. This approach is easy toimplement, to debug, and what is most important,allows perfect overlapping between computationand I/O.

Traces in Fig. 10 show a good load balance amongthe threads. Further, the first row of Fig. 10b showshow the I/O tasks are completely overlapped withcomputation. As can be deduced, the introduction ofAI/O saves many time steps as regards the wave-field size, bandwidth, and disk speed. Whencomputing the backward phase of RTM, the datapreviously written to disk must be read in, in orderto correlate it with the data that is being computed.Our approach allows us to begin reading data fromdisk early enough to let the SPEs find the data readyto be used in the buffer.

. Computation: On the pure computation side, we takeadvantage of the semistencil [13] technique to reducethe number of planes to be processed, and at thesame time increasing the data locality. Also, thistechnique increases the computation/memory ac-cess ratio. It is important to remark that once data isin LS, the latency of read and stores is the same; thus,the only kind of optimization techniques that areuseful are: register blocking, loop unrolling, andsoftware prefetching. All of them are in use in theRTM implementation.

The RTM implementation takes advantage of theSIMD registers. Fig. 11a shows the stencil wecalculate in a scalar (i.e., not SIMD) code and theSIMD stencil (Fig. 11b) that uses SIMD registers andlets us calculate four points of the data space inparallel. Notice that both stencils take the same timeto compute, but the SIMD version lets us computefour points while the scalar code computes just one.

Considering the number of SIMD registers in Cell/B.E. (Table 2), it is possible to optimize the code inorder to compute up to 20 points of data fieldsimultaneously, which leads to high performance. Ingeneral purpose processors (i.e., most homogeneous),

however, the number of SIMD registers is normallysmaller, thus limiting to a small number of points thatcan be processed in parallel. To exploit SIMD, we haveadopted an aligned and padded data layout, andmanually tuned the computational kernel by usingthe SIMD SPE intrinsics and language extensionsavailable in the IBM Cell SDK 3.0.

Every line described in Fig. 3 has been imple-mented in code for SPE, thus all these pieces of codehave been SIMDized. This is very important tomaintain an efficient overall performance of theapplication, thus following Amdahl’s law principles.

6.2 RTM Implementation on GPUs

RTM consists of a mixture of serial control logic andinherently parallel computation. Furthermore, most of thesecomputations are data-parallel. This directly matches theprogramming model provided by CUDA.

GPU’s performance is hugely penalized by frequent andlarge memory transfers through the PCIe bus. Therefore, wehave implemented all the steps described in Section 3 asGPU kernels that operate on data which reside in the GPU’sGDDRAM. The host code only orchestrates the executionenvironment, the kernels’ invocations, and the I/O transfersto/from disk whenever they are necessary.

6.2.1 Memory Usage

. Forward. The 3D stencil computation step only needsaccessing the data of the wavefield on the currenttime step. However, the velocity model and thecontents of the two previous time steps of thewavefield are also required for the time integrationand the absorbing boundary conditions steps.Additionally, a fifth volume is necessary to storethe image illumination. The size in memory of awavefield volume depends on the dimensions of thefield, but an additional ghost area is used in eachdimension to calculate the 3D stencil. The velocitymodel and the illumination volumes, on the otherhand, do not need the ghost area.

Data are compressed before being transferred todisk in order to reduce both the PCIe and disk datacommunication times. A buffer is used to store thecompressed volume while it is being copied to disk.Having a dedicated buffer to keep this informationallows performing the transfer concurrently with thecomputation of the next time step too.

. Backward. The memory requirements during thisphase are very similar. The illumination volumeused in the forward phase is replaced by another oneused to compute the final image by performing thecorrelation between the forward and backwardwavefields. Additionally, the information of thereceivers must also be stored in the GDDRAM.

Calculations show that the memory required tocompute a single shot using realistic data sizes (e.g.,1;000� 1; 000� 800) can easily exceed 16 GiB. However,current NVIDIA products offer 4 GiB of GDDRAM, atmost. Hence a multi-GPU implementation, either using ashared memory or a multinode system, is mandatory.


Fig. 11. (a) Scalar and (b) SIMD stencil representation.

We decompose the problem domain into as manysubdomains as GPUs are present in the system. It ispartitioned along the Y-dimension so subdomains are,internally, contiguous in memory. However, values insubdomain’s boundaries have to be communicated betweenGPUs every time step as they are necessary in the next timestep computation. A number of planes equal to half thestencil’s order must be transferred (four planes in the caseof our RTM implementation). Thus, GPUs are synchronizedat the beginning of each iteration in order to guarantee thecoherence of the data. This additional overhead limits theperformance of the GPU implementation specially whencommunication is performed between different nodes.

6.2.2 I/O

During the backward phase, a correlation between theforward and backward wavefields must be performed atevery stack step. That implies that the forward wavefieldsfor these time steps have to be accessible in the backwardphase. In the case of GPUs, an additional transfer betweenthe GDDR and the host memory is necessary, as the datacannot be directly transferred to an external device (onlypossible with peer DMA, which is not supported in currentGPUs). In order to minimize the performance impact of thisdata transfer, two measures have been adopted:

. Data compression. Data are compressed in the GPUmemory before being transferred to the host memoryand to the disk. A huge size reduction is achieved atthe expense of a new computational kernel that takes� 1=10th of the stencil computation time.

. Asynchronous I/O (A I/O). In addition to the hostthreads used to control the GPUs and synchronize theexecution, one per-GPU host thread is used to write todisk as soon as the compressed data have beenproduced in the device. The execution of the follow-ing time steps can continue in parallel (see Fig. 12).

In the backward phase, the forward wavefields are read,transferred, and decompressed in the same manner.

Thread affinity: A graphics device is usually connected toa system processor through a dedicated high-bandwidthPCIe bus. In multiprocessor machines with many GPUdevices, a bad CPU $ GPU mapping can harm perfor-mance (we have experienced slowdowns of up to 2x insome runs). As we have a multicore multiprocessor system,we have used pthread_set_affiniy_np to arrange thecontrol and I/O threads that communicate with the sameGPU to be scheduled in different cores of the sameprocessor. With these changes, we never observe theaforementioned slowdowns.

6.2.3 Computation

. 3D Stencil: Global memory is not implicitly cachedby the hardware. Furthermore, accesses to globalmemory are very expensive. Therefore, optimizingglobal memory bandwidth usage is a must in orderto obtain good performance from the GPU. A k-order stencil computation calculates the value ofeach point by using the k=2 neighbor elements ineach direction. That is, 3kþ 1 reads per calculatedelement. This number is referred to as readredundancy by Micikevicius in [18]. The sameconcept is also applied to writes. The sum of themis called overall redundancy. We used this metric toanalyze global memory accesses and improvememory bandwidth usage.For the 3D stencil computation kernel, we use a 2D

sliding window approach. Shared memory is used to

store an ðnþ kÞ � ðmþ kÞ tile which holds the

elements of the z and x dimensions of the wavefield.

Each thread loads an element into the shared

memory. Moreover, since threads load elements that

are consecutive in memory, they can be coalesced in

order to maximize the global memory bandwidth.

Accesses to neighbor elements for these dimensions

are then fetched from shared memory. The kernel

iterates on the y dimension; thus, each thread

computes a column along this dimension. Neighbor

elements for the y dimension are stored in k registers

of each thread which behave like a queue: every time

a tile is done, the oldest element is popped out and a

new element is queued. Using this approach, we

obtain a read redundancy of 3 and a write

redundancy of 1 for 16� 16 tiles.. ABC: We have implemented a different kernel for

each face of the 3D wavefield as they requiredifferent memory access patterns. The used ap-proach is the same as the one used for the stencilcomputation: there is a front of threads that traversesthe points that belong to the ABC area and updatesthem accordingly. In this case, the dimension whichis traversed depends on the face that is beingcomputed. This makes memory coalescing notpossible when computing two of the faces sincenone of the two dimensions of these faces isconsecutive in memory. The slowdown for thesetwo faces can be up to 6x compared to any of theother four faces.

. Shot insertion: This computation is very simple andcan be run in the host. However, since the wave-fields reside in the GPU memory, additionalmemory transfers must be performed in order tokeep the data coherent. Thus, we have implementeda simple kernel that uses a single thread in the GPU.

. Receivers’ data insertion: The backward phase of theRTM algorithm requires exciting the medium withthe data previously gathered by the receivers. Thealgorithm is the one used for the shot insertion so thecode has been reused. Furthermore, there can be upto thousands of receivers. Thus, we exploit the GPU


Fig. 12. Execution trace of the host threads of the program. Threads 1and 3 use the first GPU (control and I/O, respectively) and threads 2 and4 use the second GPU. A wavefield is stored to disk every five timesteps. Disk I/O can be overlapped with the next computational steps.

parallelism and create one thread per receiver tocalculate all of them in parallel.

6.3 RTM on FPGAs

In this section, we describe how we estimate the perfor-mance of RTM on the FPGA node. We describe the portionsthat are implemented in the FPGA.

The FPGA implementation of RTM (see Fig. 13) is basedon a streaming approach in which the volumes arepartitioned into subvolumes which are then streamedthrough the FPGA.

6.3.1 3D Stencil

To maximize performance and minimize off-chip accesses,we concentrate on maximizing data reuse in the 3D stencil.Four streams are used for the input volumes in the forwardphase (current volume, previous volume, illumination, andvelocity volume) and one output stream is used for theoutput volume, one for the illumination, and another for thecompressed output (only when disk writes need to beperformed). In the backward phase, the illumination streamis replaced by the correlation stream.

A special-purpose cache focusing on data reuse has beendesigned based on the FPGA internal Block-RAM (BRAM).In the ideal case, every point of the previous volume loadedonto the FPGAs Block-RAM would be used exactly 25 timesbefore it is removed from the FPGA, as there are 25 stencilcomputations that make use of every point. In practice,however, the reuse ratio is slightly lower because no outputis generated for ghost points. However, one benefit of ourmodeled platform is its global shared memory which allowsto proceed computation without the need of communicatingthe ghost points between time steps. The subvolumes aresized such that nine contiguous planes can be keptsimultaneously in the BRAMs. These planes form thesmallest volume that allows to compute a plane of theoutput subvolume. To complete the remaining planes of theoutput subvolume, two techniques are used. First, internally,

planes are streamed from the subvolumes in Y-direction.Second, externally, domain decomposition is used topartition the volume into subvolumes in the Z and X axis.This completes the computation of the whole data set.Because the stencil requires access to volume points from theneighboring subvolumes, the real subvolume that isstreamed already includes these ghost points.

The stencil data are laid out internally in the FPGA BRAMin a three-level memory hierarchy from which all necessaryinput points can be read in a single cycle (see Fig. 14). For theVirtex4-LX200 device present in the SGI Altix 4700, thedimensions of the extended subvolume (i.e., including ghostpoints) are 200 points in the Z-dimension and 75 in the X-dimension. No output points are being computed for theseghost points. Therefore, the reuse degree is slightly smaller,21.44 for the subvolumes used in this mapping. We assumethe same dimensions for the Virtex-5 chip even though thischip has more on-chip memory and might thus enablesomewhat larger subvolumes with less overhead. Planes arestreamed sequentially in the Y-direction, thus there is nolimit on the number of planes in this direction. Thanks to datareuse and an aggressive data cache, this design can internallygenerate a huge supply of data. Unfortunately, this supplycannot be matched by the compute units. This happensbecause synthesizing floating-point (FP) units on FPGA chipsis costly in terms of area. In general, implementing standardfloating points on FPGA should be avoided due to thecomplexity of the IEEE754 standard, which requires, amongothers, continuous normalization after each operation andhandling rounding modes, NaNs, etc. For FPGA, it is muchmore efficient to use fixed-point units, which can better mapto the available DSP units. For RTM, an interesting option toreduce area is to avoid rounding and normalization betweeneach partial FP operation and do it only once before the dataare stored back to main memory such as in [19]. On the otherhand, the data front-end can easily scale to much higherbandwidth [20].


Fig. 13. A top-level view of the streaming system: The memory controllerreads the data and fills the input FIFOs of the FPGA design, giving thedata cache and compute units the view of a streaming environment. Fig. 14. Layout of the special-purpose data cache for the 3D stencil

computation in the FPGA implementation.

In this basic implementation, the compute units arestandard data flow versions of the stencil and timeintegration. In one Virtex4-LX200, two compute units areimplemented running at twice the frequency of the datafront-end. This allows the basic design to generate fouroutput points per cycle. However, factoring the plane andcolumn switching overheads results in a steady-stateperformance of 3.14 points/cycle (1.57 results/cycle percompute unit).

Using Xilinx ISE 11.1, we conclude that even withoutimplementing the single normalization option, three com-pute units can be implemented in each of the four Virtex5LX330 devices present in the modeled FPGA platform. Eachcompute unit consists of 27 adders and 16 multipliers. Weexpect the data cache to run at 150 MHz and compute unitat 300 MHz. This configuration will deliver a steady-stateperformance of 18.84 points/cycle at 150 MHz. Thus, theFPGA model requires 36 GiB/s of input bandwidth (threeVolumes �18:84� 67=63 subvolume overhead �4 bytes/point x 150MHz) and 11.3 GiB/s of output bandwidth. Thisis less than the 80 GiB/s that the coprocessor memory canprovide. Given that memory access patterns are completelydeterministic, an intelligent memory scheduler should nothave problems exploiting this bandwidth by minimizingmemory bank access conflicts.

We complete the estimation by also analyzing theperformance that can be obtained if we also accelerate theremaining parts of the code: the absorbing boundaryconditions, the illumination, correlation, and the compres-sion/decompression.

6.3.2 ABC

Regarding the boundary conditions, they can be imple-mented using the same logic as the 3D stencil and timeintegration, but streaming planes from the volume ghostpoints. This way we reuse the slices of the stencil and onlyimplement little additional logic. This will not deliver thebest performance and will not be very efficient, but sincethe processing of ghost points is small compared to thestencil (less than 10 percent additional points for thevolumes considered here), we do not consider it critical toaccelerate this even further.

6.3.3 Correlation and Illumination

This operation should also be accelerated. These twoembarrassingly parallel operations are very simple compu-tationally, but they require reading and writing a wholevolume. They can be computed just after completing thestencil and time integration. Given that reading andwriting a volume to/from coprocessor memory proceedsat 11.3 GiB/s, we need 22.6 GiB/s to accommodate thisoperation without performance penalty. Overall, thecomputation requires 70 GiB/s, which is still below the80 GiB/s maximum bandwidth.

6.3.4 Compression and Decompression

These steps are necessary to reduce the I/O requirements.We integrate these computationally simple operations intothe stencil processing unit, both to compress a volumeduring forward and store it, and to decompress it duringthe backward phase. This requires 11 GiB/s more databandwidth because a new volume is generated. Fortu-nately, these operations can be performed when noillumination/correlation is being computed.

7 PERFORMANCE EVALUATION

We have carried out experiments to verify first the numericalsoundness, and second the performance of the implementa-tion. The experimental results show the appealing perfor-mance of the GPUs, Cell/B.E, and FPGA with respect to thetraditional multicore architecture. The results are averagesover repeated runs, to eliminate spurious effects (e.g., bustraffic, or unpredictable operating system events).

Fig. 15a shows that all the accelerators outperform thehomogeneous multicore from 6 (Cell/B.E.) to 24 times(Tesla C1060). The Tesla C1060 outperforms all otheraccelerators because: it is more recent than the Cell/B.E.Its hardware characteristics and mainly its architecture iswell-suited for algorithm mapping.

In Fig. 15b, two types of experiments are described.Experiment type 1 stands for 100 forward and 100 backwardtime steps, where the velocity model is a cube from 256 upto 512 grid points per side. Also, in type 1 experiments, thewavefield is stored/restored and correlated every five time


Fig. 15. Elapsed times of representative experiments with all-mentioned platform implementations of RTM. (a) Computation-only experiments,100 steps, forward and backward. (b) Computation plus I/O experiments, forward and backward.

steps. Experiments of type 2 have the same setup as type1 experiments, but the number of time steps is 1,000, and thecorrelation frequency is 10 time steps. The latter kind ofexperiments are close to what real industrial executions are,both in terms of model size and time step number.

As can be foreseen, the I/O technologies attached to thetested architectures become an important bottleneck. This isbecause the accelerators deliver ready-to-be-stored data at arate that the I/O is unable to handle. In order to avoid thisproblem, we take advantage of two main strategies: increasethe stack rate or apply data compression. Fig. 16 depicts theI/O requirements for some RTM test cases, where the stackhas been set to five steps, compression is in place, and thedimension problem ranges from 256 to 512 cubic points. Ascan be observed, under the mentioned conditions, aHypernode (similar to a SATA 2 10,000 RPM disk), cannothandle the work for every accelerator; further for GPU andFPGA cases, the need for better I/O technologies is a must.If the compression level has to be reduced, then even for aCell/B.E. case, there will be a severe I/O bottleneck.

8 WISHLIST

The performance obtained for RTM in each platform usedin this study was limited by both RTM and platforms’characteristics. Given that the RTM algorithm features arefixed, we believe that an effort could be done to adapt thearchitecture to problems like this. Following sectionsexplain our suggestions for modifying, adding, or remov-ing certain characteristics for each of the platforms used inthis work.

8.1 Cell/B.E.

Probably, the main drawbacks of the Cell/B.E. architectureare related to the limited memory bandwidth and the size ofthe LS. Both limitations affect the way an application shouldbe designed and implemented, producing large andcomplex codes. Obviously, as users, we believe these twocharacteristics should be improved for new generations ofthe platform. However, after implementing RTM, we have

identified several other aspects of the hardware anddevelopment software that, we think, could be improved.Regarding the hardware architecture, aspects such as: PPEperformance, DMA lists structure, or I/O bandwidth maylimit application performance. Moreover, as the Cell/B.E.architecture imposes restrictions over the program struc-ture, so does the development environment provided withthe architecture. We believe that an autovectorizingcompiler or a higher level programming model may reducethe development cost. In the following sections, we describewhat, we believe, are the major issues related to Cell/B.E.hardware and software development platform.

8.1.1 Hardware

. PPE performance: At the beginning of RTM develop-ment process, we implemented some routines to berun on the PPE. However, we realized that theexecution time of those routines was higher than theSPE execution time. Hence, most of the work wasmoved to the SPEs, with the associated cost indevelopment (e.g., due to vectorization, DMAmanagement, and memory constraints). If the PPEperformed a little better, the coding effort is reducedbecause it is easier to implement code for the PPEthan to implement it for the SPE. An example of suchan issue arises when implementing the source waveintroduction (line 6 in Fig. 3). The code for such atask contains few lines and represents less than1 percent of the whole computation in an homo-geneous processor. However, for certain cases, thiscode takes almost 50 percent of the computationtime when executed in the PPE, while the rest ofRTM phases are executed in the SPE. Moreover, theporting of such code to the SPEs is not trivial,because is not easy to be paralelized and efficientlyimplemented as an SIMD code.

. DMA lists: We extensively use DMA lists (Section5.1), which support 32-bit addresses only. This is dueto the list structure, where the most significant bits ofthe address are the same for every element in the list.That forces us to create a mechanism that considerswhat happens when our data surpasses a 4 GiBboundary, increasing the code complexity evenmore; 64-bit addresses should be directly supportedin the list elements.

. Access concentrator sharing: Regarding the connectionbetween Cells within the same blade, some adjust-ments should be introduced. The Access Concentrator(AC) is active in one Cell/B.E. only, forcing the otherCell/B.E. to go through an external channel to accessit. This creates imbalanced access time to memorybetween both processors. This is partially respon-sible for the NUMA solutions that were applied inSection 6.1. The ideal solution would be to have bothACs active, and a protocol to keep them coherent.Nevertheless, we realize that this may not be a cheapsolution.

. I/O bandwidth: Another limitation outside the chip isthe I/O bandwidth through the Hypernode (whichis the best known I/O solution). As the number of


Fig. 16. RTM forward and backward with stack 5, and high level ofcompression. Hypernode is a technology proposed by IBM for providinghigh-performance I/O, for instance, for the Cell/B.E. platform.

concurrent RTM instances or the size of the case (seeFig. 16) increases, the disk access time through thesame Hypernode is also increased. This affects eachRTM instance with a dramatically I/O performancepenalty. In order to mitigate this effect, it is possibleto increase the number of channels for the Hyper-node. In Fig. 16, it is possible to observe that I/O forCell architecture does not seem a problem. However,the test was done considering a high level ofcompression and dedicated Hypernodes, whichmight not be the standard situation.

8.1.2 Software

. Compiler: Probably, the most important element wewould like to find in next generation developmentenvironment is an improved level of automatization.For example, some degree of automatic vectorizationwill decrease dramatically the cost of development.For example, Fig. 11 shows the vectorial implemen-tation that was implemented by hand. Furthermore,memory access automatization from the SPEs wouldbe welcome.

. Programming Model: In addition to the compilerimprovement, a simpler programming model woulddecrease the development cost. Such programmingmodel should hide as much as possible the restric-tions imposed by the underlying architecture. Forexample, the double buffering used for the SPEsprocessing, could be automatically implementedinside the programming model as some sort ofprefetching like in CellSs [21].

. DMA concept: The DMA system can manage bothindividual and collective transfers, which onlysupport a base address and a number of bytes tobe transferred. If the concept of stride is included,there is no need of keeping huge DMA lists foraccessing data with constant strides. This againdecreases the code complexity and the amount ofextra data (i.e., DMA lists) needed to control theaccess to the data used for computation.

. NUMA considerations: The Cell/B.E. QS22 platform isa NUMA architecture. In order to obtain the bestperformance from the system, it was necessary toconsider this characteristic when developing RTM(see Fig. 9). However, some kind of automatizationincluded into the programming model or the compilercould manage the threads affinity and memory bankallocation in order to reduce the memory access time.

. SPE debug: Debugging SPE code is a time-consumingtask. One of the main problems we face with thiskind of debugging process is the lack of feedback.Moreover, there is no protection between code anddata access inside the SPE LS, which allows the userto corrupt the code. From our point of view, someenhancing should be done regarding those debug-ging topics.

8.2 GPU

Current GPUs offer outstanding computational power but

their performance is hugely penalized by two main factors.

First, the GDDRAM memory size of current devices is not

big enough to handle problems like RTM directly, and

imposes the usage of techniques like domain decomposi-

tion. Second, GPU devices are connected to the host

through a PCIe bus that has a limited bandwidth. Since

communication between GPU devices is required due to the

constrained amount of memory, the global performance of

the applications is bounded by the communication band-

width (Fig. 12 ). We propose some modifications that could

help to mitigate these problems.

8.2.1 Hardware

. Cache memory: Global memory is throughput-or-iented. Kernels that benefit the most are those thatperform sequential memory accesses as the width ofthe memory bus allows to serve many of them at atime. If some values have to be reused by differentthreads, the common practice is to store those valuesin the shared memory for each thread block.However, in computations like stencils, values fromthe neighbor thread blocks may also be necessary.Thus, there are memory locations that are read manytimes (Section 6.2.3). Having a memory cache sharedamong all the SMs would avoid the extra accesses tothe GDDRAM. It has been recently announced thatthe next generation of NVIDIA GPUs (called Fermi1)will have a two-level cache hierarchy.

. Direct GPU/GPU communication: Since domain de-composition is mandatory in RTM due to memoryconstraints, several MiBs of data must be transferredbetween devices every time step (Section 6.2.1).Currently, data communications between GPU de-vices require doing an intermediate copy in the hostmemory. A technique to directly communicate GPUdevices without the need to use the host memory,such as Peer DMA, would avoid this extra copy.Moreover, GPUs can be connected through a ScalableLink Interface (SLI) that has greater bandwidth thanthe PCIe bus. However, communicating GPU devicesusing SLI is not currently exposed to the programmer.

. Memory controller: Although high memory band-width is achieved, better memory coalescing wouldbe necessary to reach the bandwidth peak. Forinstance, when nonregular access patterns are used(like the memory patterns used to compute the ABCof some faces).

. Dynamic thread layout: Launching kernels in the GPUhas an overhead. Being able to execute many kernelswithout the intervention of the host would eliminatethis extra work. Currently, the main problem indoing this is that each kernel has its own threadconfiguration and resource allocation (e.g., sharedmemory and registers) and cannot be reconfigureddynamically. Fermi devices will reduce the kernellaunch overhead.

8.2.2 Software

. Virtual memory: The distinction between the datastructures that are allocated in the host and in the


1. http://www.nvidia.com/fermi.

GPU memory harms programmability. The coher-ence of the data must be manually maintained by theprogrammer. A system that automatically keepstrack of the modified (or dirty) pages in the host andGPUs memories and makes them coherent could beimplemented to ease the GPU application develop-ment such as GMAC [22].

. Fine-tuning optimizations: The CUDA toolchain per-forms a program compilation in two steps. The firststep generates an intermediate closed byte code.Later, the CUDA driver recompiles the byte codeand generates code optimized for the device onwhich it is going to be run. That makes it impracticalto manually tune assembly code and identify hot-spots and real register usage.

. Debugging: Starting from the CUDA SDK 2.2 version,debugging GPU kernel code is possible. However,some features are still missing. First, when the codeperforms an illegal memory access, the execution isaborted and the information about the faultinginstruction and CUDA thread identifier are lost.Second, since the CUDA runtime allows having upto 30,720 threads running simultaneously, an exten-sion to facilitate the debugging of such a number ofthreads would be welcome.

. Profiling: Fine-grained profiling information (e.g.,thread block scheduling) is not provided by thecurrent tools. This information would enable adeeper analysis of the kernel execution behavior inorder to optimize it.

8.3 FPGA

The experience of porting RTM to the SGI RASC RC 100revealed several problems that considerably harmed thepossibilities offered by the FPGA chip itself. Our projectedperformance based on the more recent Convey HC-1 showedthat many of these bottlenecks are addressed by currentproducts. However, many issues can still be improved:

8.3.1 Hardware

. Floating Point: Contrary to the GPU and Cellimplementations, the FPGA implementation ofRTM turns out to be compute bound. While previousresearch showed that a storage subsystem feeding 16compute engines can be built in a year 2005 Virtex4-LX200 device [20], only three compute engines doactually fit in a year 2007 Virtex5-LX330. To bettersupport the implementation of FP stencils, futureFPGA devices should include more hardware blocksto increase the density of floating-point adders andmultipliers or, in the extreme case, include fullimplementations of single-precision FP blocks.

8.3.2 Software

. High-Level Synthesis: For performance and tool sup-port reasons, FPGA chips are still mostly programmedin Hardware Description Languages (HDL), whichrequire low-level details (such as wires and signals) tobe specified (see Section 5.3.3). This limits developerproductivity. To overcome this limitation, many tools

have been developed that allow to program FPGAsusing variants of the C language. While performanceof recent C-to-gates tools has considerably improved,diversity in extensions and language constructs stilllimits user adoption.

. Debugging: Both the compilation time and debuggingmethodologies for FPGA application developmentare very slow and painful when compared to CPUapplication development. Debuggers such as gdb

have been extended to read FPGA registers, but thisapproach is not very useful to debug timing errors orcontrol problems. Model simulations are normallyrequired for detecting such issues. However, thesetools are unfamiliar to software developers andrequire detailed analysis of multiple signals in orderto understand which signals are causing trouble.Clearly, future tools will need to dramaticallyimprove these productivity issues. C-to-gates wouldalso help as these methodologies allow to performverification completely at the software level.

. Portability: The lack of standards across FPGA plat-forms considerably harms user adoption. In general,code targeting one platform encodes characteristicsof the platform, such as busses, memory subsystem,host interfaces, etc). As a result, the HDL code canonly be ported to a different platform with a greatdeal of effort. In our case, porting our SGI RASCimplementation of RTM to a different platformwould have required a considerable effort to adaptthe external IO and host interfaces. There is littlemotivation in undertaking such a huge engineeringtask, particularly in a research environment. It isclear that standards and methodologies need to bedeveloped that make user code more portable. Forexample, the development of some standard acrossC-to-gates tools would greatly improve portability asthe remaining platform differences can be automati-cally handled by the compiler or platform library.This would greatly increase user adoption.

9 CONCLUSIONS

In general terms, GPUs, Cell/B.E., and FPGAs outperformtraditional multicores by one order of magnitude. However,to achieve this, a great development effort is required, mainlybecause the programming environments are still immature.The principal issues faced when implementing RTM are:

. FPGA: The RTM porting to FPGA is the one thatrequires most effort. All operations need to bedescribed in HDL. IP cores provided by XilinxCoreGen were used to increase productivity. How-ever, for the future, high-level productivity tools willbe critical to allow developers harness the potentialof FPGA technology.

. Cell/B.E.: In order to obtain high performance inCell/B.E., the developer has to go through internalsof the platform that should be hidden. Thus, wewould like to see tools that, for example, help withthe SIMDization of the code, DMA management,and automatic double-buffering. Also, in pureperformance terms, the platform is lagging behindthe fast release cycle of GPUs.


. GPU: Developing the GPU version has been thefastest among all platforms, mainly thanks to theCUDA programming model. The performance re-sults are very impressive although the peak memorybandwidth has not been reached. In order to achieveeven more performance, better optimization toolsare required.

Summarizing, accelerators deliver high-performancecomputing, surpassing their homogeneous counterparts,but the development effort is higher due to manyarchitectural particularities. Furthermore, we believe thataccelerators can cope with current HPC workloads like3DFD-based applications, especially if the programmabilityshortcomings are solved.

ACKNOWLEDGMENTS

The authors thank Repsol and the Barcelona Supercomput-

ing Center (under Kaleidoscope Project2) for their permis-

sion to publish the material reported in this paper.

REFERENCES

[1] J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R. Maeurer, andD. Shippy, “Introduction to the Cell Multiprocessor,” IBM J.Research and Development, vol. 49, nos. 4/5, pp. 589-604, 2005.

[2] S. Patel and W.-M.W. Hwu, “Accelerator Architectures,” IEEEMicro, vol. 28, no. 4, pp. 4-12, July/Aug. 2008.

[3] S. Hauck and A. DeHon, Reconfigurable Computing: The Theory andPractice of FPGA-Based Computation. Morgan Kaufmann PublishersInc., 2007.

[4] E. Baysal, D.D. Kosloff, and J.W.C. Sherwood, “Reverse TimeMigration,” Geophysics, vol. 48, no. 11, pp. 1514-1524, 1983.

[5] R. Baud, R. Peterson, G. Richardson, L. French, J. Regg, T.Montgomery, T. Williams, C. Doyle, and M. Dorner, “DeepwaterGulf of Mexico 2002: America’s Expanding Frontier,” OCS Report,vol. MMS 2002-021, pp. 1-133, 2002.

[6] D.E. Shaw, M.M. Deneroff, R.O. Dror, J.S. Kuskin, R.H. Larson,J.K. Salmon, C. Young, B. Batson, K.J. Bowers, J.C. Chao, M.P.Eastwood, J. Gagliardo, J.P. Grossman, C.R. Ho, D.J. Ierardi, I.Kolossvary, J.L. Klepeis, T. Layman, C. McLeavey, M.A. Moraes,R. Mueller, E.C. Priest, Y. Shan, J. Spengler, M. Theobald, B.Towles, and S.C. Wang, “Anton, a Special-Purpose Machine forMolecular Dynamics Simulation,” Proc. 34th Ann. Int’l Symp.Computer architecture (ISCA ’07), pp. 1-12, 2007.

[7] Y. Sun, F. Qin, S. Checkles, and J.P. Leveille, “3d PrestackKirchhoff Beam Migration for Depth Imaging,” Geophysics, vol. 65,pp. 1592-1603, 2000.

[8] A.G.F. Ortigosa, Q. Liao, and W. Cai, “Speeding Up RTM VelocityModel Building Beyond Algorithmics,” Proc. SEG Int’l Expositionand 78th Ann. Meeting, Nov. 2008.

[9] A. Ray, G. Kondayya, and S.V.G. Menon, “Developing a FiniteDifference Time Domain Parallel Code for Nuclear Electromag-netic Field Simulation,” IEEE Trans. Antennas and Propagation,vol. 54, no. 4, pp. 1192-1199, Apr. 2006.

[10] S. Operto, J. Virieux, P. Amestoy, L. Giraud, and J.Y. L’Excellent,“3D Frequency-Domain Finite-Difference Modeling of AcousticWave Propagation Using a Massively Parallel Direct Solver: AFeasibility Study,” SEG Technical Program Expanded Abstracts,pp. 2265-2269, 2006.

[11] S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick, “Impact ofModern Memory Subsystems on Cache Optimizations for StencilComputations,” Proc. Workshop Memory System Performance (MSP’05), pp. 36-43, 2005.

[12] M.E. Wolf and M.S. Lam, “A Data Locality Optimizing Algo-rithm,” ACM SIGPLAN Notices, vol. 26, no. 6, pp. 30-44, 1991.

[13] R. de la Cruz, M. Araya-Polo, and J.M. Cela, “Introducing theSemi-Stencil Algorithm,” Proc. Eighth Int’l Conf. Parallel Processingand Applied Math., 2009.

[14] G. Rivera and C.W. Tseng, “Tiling Optimizations for 3D ScientificComputations,” Proc. High Performance Networking and ComputingConf., 2000.

[15] L. Dagum and R. Menon, “OpenMP: An Industry-Standard APIfor Shared-Memory Programming,” IEEE Computational Scienceand Eng., vol. 5, no. 1, pp. 46-55, Jan. 1998.

[16] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NvidiaTesla: A Unified Graphics and Computing Architecture,” IEEEMicro, vol. 28, no. 2, pp. 39-55, Mar./Apr. 2008.

[17] M. Garland, S.L. Grand, J. Nickolls, J. Anderson, J. Hardwick, S.Morton, E. Phillips, Y. Zhang, and V. Volkov, “Parallel ComputingExperiences with CUDA,” IEEE Micro, vol. 28, no. 4, pp. 13-27,July/Aug. 2008.

[18] P. Micikevicius, “3d Finite Difference Computation on GPUsUsing CUDA,” Proc. Second Workshop General Purpose Processing onGraphics Processing Units (GPGPU-2), pp. 79-84, 2009.

[19] C. He, G. Qin, M. Lu, and W. Zhao, “An Efficient Implementationof High-Accuracy Finite Difference Computing Engine onFPGAs,” Proc. 17th IEEE CS Int’l Conf. Application-Specific Systems,Architectures and Processors (ASAP ’06), pp. 95-98, 2006.

[20] M. Shafiq, M. Pericas, R. de la Cruz, M. Araya, N. Navarro, and E.Ayguade, “Exploiting Memory Customization in FPGA for 3DStencil Computations,” Proc. Int’l Conf. Field-Programmable Tech-nology (FPT ’09), 2009.

[21] P. Bellens, J.M. Perez, R.M. Badia, and J. Labarta, “CellSs: AProgramming Model for the Cell BE Architecture,” Proc. ACM/IEEE Conf. Supercomputing (SC ’06), p. 86, 2006.

[22] I. Gelado, J. Cabezas, J. Stone, S. Patel, N. Navarro, and W.-M.Hwu, “An Asymmetric Distributed Shared Memory Model forHeterogeneous Parallel Systems,” Proc. 15th Int’l Conf. Architectur-al Support for Programming Languages and Operating Systems, 2010.

Mauricio Araya-Polo received the engineeringdegree in computer science in 2001 from theUniversity of Chile. He received the master’s andPhD degrees from the University of Nice—Sophia-Antipolis at the Institut National deRecherche en Informatique et Automatique(INRIA), France, in 2003 and 2005, respectively.Since 2007, he is a researcher, from 2009 seniorresearcher on computational geophysics at theBarcelona Supercomputing Center (BSC). His

research interests cover the areas of multicore architectures, andprogramming models, numerical algorithms, and code optimizationtechniques for HPC. He is a member of the ACM, the SIAM, the SEG,and the EAGE.

Javier Cabezas received the bachelor’s degreein computer science and the master’s degree incomputer architecture from the Universitat Poli-tecnica de Catalunya (UPC). Since 2006, he is aPhD student in the Computer ArchitectureDepartment at UPC. His research is focusedon operating system support for heterogeneousmassively-parallel computing systems and mas-sively-parallel accelerators.

Mauricio Hanzich received the bachelor’sdegree in computer science degree from theUniversidad del Comahue (Neuquen) and thePhD degree from the Universitat Autonoma deBarcelona. He is a senior researcher in theBarcelona Supercomputing Center at the Span-ish National Supercomputing Institute. He iscurrently researching and developing seismicimaging tools for the oil industry. Prior to thisposition, he was a professor at the Universitat

Autonoma de Barcelona and an information technology consultant forthe Argentinian government. Raised in Neuquen, Argentina, he nowlives in Barcelona.


2. See www.kaleidoscopeproject.info.

Miquel Pericas received the engineering de-gree in telecommunications from the TechnicalUniversity of Catalonia (UPC) in 2002, and thePhD degree in computer architecture in 2008,also from UPC. From 2003 to 2005, he lecturedon computer organization at the BarcelonaSchool of Informatics (FIB). He is currently aresearcher with the Computer Sciences Groupat the Barcelona Supercomputing Center (BSC).His research interests focus on heterogeneous

supercomputing and, in particular, on applying FPGA technology tosupercomputing.

Felix Rubio received the degree in telecommu-nication engineering in 2006 from the UniversitatPolitecnica de Catalunya, and is now a mastercandidate in computer architecture. He also is aresearcher at the Barcelona SupercomputingCenter, where he works on optimization strate-gies for heterogeneous architectures. Currently,he is mainly focused on optimization of geophy-sical high-performance demanding codes.

Isaac Gelado received the engineering degreein telecommunications in 2003 from the Uni-versity of Valladolid. Since 2004, he is workingtoward the PhD degree in the ComputerArchitecture Department at the Universitat Poli-tecnica de Catalunya (UPC). He is currently alecturer in the Telecommunications EngineeringSchool at UPC. His research covers operatingsystems and architecture support for heteroge-neous massively-parallel computing systems.

Muhammad Shafiq received the MSc (Electro-nics) degree in 1999 from Quaid-i-Azam Uni-versity, Islamabad. Later, he worked for sevenyears on embedded systems design and devel-opment for NESCOM Pakistan. Since November2007, he has been working toward the PhDdegree in computer architectures at the Uni-versitat Politecnica de Catalunya (UPC). Hismain research interests include reconfigurableaccelerators with focus on efficient data man-

agement strategies for HPC applications.

Enric Morancho received the degree in com-puter science in 1992 and the PhD degree incomputer science in 2002, both from theUniversitat Politecnica de Catalunya (UPC),Spain. In 1993, he joined the Department ofComputer Architecture at UPC, where he iscurrently an associate professor. His researchinterests include processor microarchitecture,memory hierarchy, and awareness of architec-ture in programming.

Nacho Navarro received the PhD degree incomputer science from the Universitat Politecni-ca de Catalunya (UPC) in 1991, Barcelona,Spain, where he is an associate professor since1994. He is also a senior researcher at theBarcelona Supercomputing Center (BSC). Hiscurrent research interests include tools forevaluation of multicore microprocessors, appli-cation-specific computer architectures, and dy-namic reconfigurable logic and resource

management in heterogeneous environments and sensor networks.He is also doing research on the programmability and support ofhardware accelerators like GPUs at the IMPACT Research Group,University of Illinois. He is a member of the IEEE, the IEEE ComputerSociety, the ACM, and the USENIX.

Eduard Ayguade received the engineeringdegree in telecommunications in 1986 and thePhD degree in computer science in 1989, bothfrom the Universitat Politecnica de Catalunya(UPC), Spain. Since 1987, he has been lecturingon computer organization and architecture.Since 1997, he is full professor in the ComputerArchitecture Department at UPC. He is currentlyan associate director for research on computersciences at the Barcelona Supercomputing

Center (BSC). His research interests cover the areas of multicorearchitectures, and programming models and compilers for high-performance architectures.

Jose Marıa Cela received the PhD degree intelecommunication engineering from the Unive-sidad Politecnica de Cataluna. Currently, he is adirector of Computer Applications in Scienceand Engineering Department (CASE) at theBarcelona Supercomputing Center. His researchinterests cover the areas of parallel computing,PDE solvers, and optimization algorithms.

Mateo Valero is a professor in the ComputerArchitecture Department of UPC, Barcelona. Hisresearch interests include high-performancearchitectures. He has published approximately500 papers, has served in the organization ofmore than 200 international conferences, andhas given more than 300 invited talks. He is thedirector of the Barcelona Supercomputing Cen-ter, the National Center of Supercomputing inSpain. He has been honoured with several

awards including the Eckert-Mauchly Award and the “King Jaime I” inResearch and two National Awards on Informatics and on Engineering.In December 1994, he became a founding member of the Royal SpanishAcademy of Engineering. In 2005, he was elected correspondentacademic of the Spanish Royal Academy of Science; in 2006, a memberof the Royal Spanish Academy of Doctors; and, in 2008, a member ofthe Academia Europaea. He is a fellow of the IEEE, fellow of the ACM,and an Intel distinguished research fellow.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


assessing accelerator-based hpc reverse time migration

Documents