sar image synthesis algorithm improvement on multi …cantallo/irs2011_gpu.pdf · [5, 6] where...

SAR Image Synthesis Algorithm Improvement onMulti-processor/Multi-core Computers : vectoring on

massively parallel processors.

Carole E. Nahum∗, Hubert M.J. Cantalloube∗∗

∗Direction Generale de l’Armement 5/7 rue des Mathurins, 92221 Bagneux CEDEX Franceemail: [email protected]

∗∗Office National d’Etudes et Recherches Aerospatiales Chemin de la Huniere,91761 Palaiseau CEDEX, France email: [email protected]

Abstract: SAR image synthesis, though highly systematic hence clearly highly paral-lelisable, requires local variations in the processing in case of airborne acquisitions tocompensate for effective trajectory fluctuations around the planned fight path. Here arepresented an alternative to paralleling of concurrent independent threads described ear-lier : the vectorisation on massively parallel processors. The changes in data organisa-tion and computation division imposed by the hardware specificities are described, andperformance comparison with multi-threading on multi-CPU is assessed.

1. Introduction

At the end of year 2008, ONERA changed the aircraft carrier for its airborne SAR systemRAMSES. The previous system was deployed through the side door of a wide body aircraft,while the new system is in pods under the wings. The proximity of the wing, the reduced heightof the antennae and the lower frequency induce frequency dependency of the antenna patternthat motivated a major change in the SAR processing software. The opportunity was taken toadapt the program to multiprocessors. It is now paralleled using the Native Posix Thread (NTP)interface for multi-core computers[4].However, the need for “near real time” image synthesis for on board assessment of the successof an acquisition motivated us to adapt the algorithm to massively parallel computers such asprogrammable graphic processing units (GPU) or other architectures (Cell, Niagara). Unlike theNTP multi-threading for which different CPU cores can process different stages simultaneously(pipe-lining) or the same stage out of synchronism (data parallelism), the subunits of massivelyparallel systems execute the same code in a more or less simultaneous manner (some GPU arestrictly vector processing units, while other are loosely so, the processing being scheduled byindependent groups of fixed size). The software adapting to this type of computer is thereforenot a straightforward transposition of the multi-thread NTP coding.The openCL programming interface was chosen because it is available on both Nvidia and ATIGPUs and on several other architectures (it is supposed to be “device agnostic”) and that it use“on the fly compilation” which allows the application to produce an optimised program sourcefor the very waveform and processing parameters of the run. (Our radar system is very flexible,and the waveform is often changed between acquisitions, hence this combination of flexibilityand optimisation was appealing).

2. Processing organisation

The SAR processor uses the ω−k algorithm [1, 2, 3] and is organised (Fig. 1) in successive stepsmainly similar either along the rows (initially the sampling window or “fast time”, ultimatelythe “range” axis) or the columns (initially the succession of pulses or “slow time”, ultimatelythe “azimuth” axis).

Figure 1: Left : synoptic of the SAR processor (blue=optional steps, red=bistatic specific steps, green=test-points,red background=mono-static only steps). Right : portion processed along rows and columns of the processing.

3. Data reorganisation for GPU processing

For best efficiency on GPU, data corresponding to successive addresses should accessed inparallel, which is naturally the case while processing along columns. Since the GPU are not (yet)capable of handling complex floating point values however, instead of the native real/imaginarypart interleaving used in CPU processing, the real and imaginary part are interleaved by blocs.For stages of the processing along rows, the data is not in the appropriate order. Though theeasiest would be to transpose rows and columns on the data, since the processed data bloc needsnot be square, this operation would have an excessive cost. Our approach is to transpose the datamatrix by square blocs of size equal to the minimum vector scheduling unit size (“warp” size).The processing itself remains very similar, the bloc transposition can be seen as a permutationof a few address low bits (e.g. 5 for a warp size of 32).

4. Code reorganisation for GPU processing

4. 1. the local synchronisation

The main difficulty for the code reorganisation is due to the “loose” synchronisation betweenthe elementary processor in a GPU (this looseness in the synchronisation being crucial for theirhigh throughputs, since it allows for interleaving of threads instructions thus hiding the memoryaccesses latency). The only way of synchronising globally all the threads is to end the program(a “kernel”) and start a new one. This may seem a strong constraint, but it also has the positive

aspect of allowing a different tessellation of the data for different program blocs. Furthermore,the openCL programming interface allows for natural chaining of the kernels that are put in abatch queue and the completion of any kernel may be (part of) the condition of starting anotherkernel.

4. 2. the global memory segmentation

GPU global memory is segmented, hence an allocated bloc is limited in size which may besmaller than the SAR processing bloc size. Furthermore, the address for a bloc may changebetween kernel invocations. This last point, forbids the use of an array of pointers to accessthe SAR processing bloc. In the original (CPU) algorithm, the pointers are circularly permutedbetween the successive SAR processing blocs in order to shift the bloc overlap from the end ofthe previous processing bloc to the beginning of the next processing bloc. For the GPU version,it was changed to a circular buffer, segmented in up to 4 segments, and the choice between thesegments is done by two integer comparisons (against hard coded constant values) of the offsetindex value.Furthermore, as the global GPU memory is smaller than what is currently available to CPUs,the possibility of using part of the main buffer as output buffer is added. This is possible onlyif the overlap between successive processing blocs is smaller than 50%. Moreover, it is onlyinteresting for saving memory if the increase in row length in the main buffer does not outweightthe decrease in the number of rows. (Note that since range re-sampling is done just prior totransfers back to CPU RAM by blocs of 32 rows, the output buffer row length is much smallerin case of processing shared by sub-bandwidth.)

5. Code optimisation issues

5. 1. Fourier transform optimisation

Since all Fourier transforms in the processing are done in a large number of row or columns ofidentical sizes, it is optimised for simultaneous execution of several consecutive sets of trans-forms (the consecutive set contains one warp size transforms). For efficiency, the advices of[5, 6] where followed, and the algorithm uses base 32 transform performed by one kernel foreach stage. The first stage kernel works out-of-place because it also performs the bit reversalpermutation (thus avoiding global synchronisation issue) while the higher stages work in-place(but the last one, which puts the result back to the original array, and can be of base 2 to 32depending of Fourier transform size). Internally, at each stage, a bloc of 32 threads performone base-32 butterfly on 32 transforms, keeping the data in registers. Each thread computes one32-base butterfly as a 8 point base-2 FFT followed by a 4 point base-2 FFT. The resulting codehas a very low occupancy (2 to 3 %) and a very high register usage, but due to its low memorybandwidth, it has a good performance level.The capability of performing a small number of non consecutive blocs of consecutive Fouriertransforms within the same kernel invocation is used for frequency agile waveforms (at the

range compression stage) and for the pre-compensation of the second and higher order motion-induced (and bistatic) quadratic phase.

5. 2. triple-buffering of computer to/from GPU data transfers

There is a severe bottleneck in the GPU systems which is the transfers between the CPU mem-ory and the GPU global memory. Presently, the GPU interfaces to the CPU using a PCIe bus,of which bandwidth is much smaller than that of CPU memory. SAR processing is intensive indata transfers : The example of Fig. 2 requires reading 34.3 Gb of signal and writing 8.9 Gb ofimage (and optionally, 4.5 Gb of illumination pattern). The transfers (e.g. from CPU to GPU)requires three operations : First the data is written in a locked (pinned) area of the CPU memory,second the data is copied through the PCIe bus to the GPU memory and third the data is format-ted to a usable format for the GPU (namely the signal which is real on one byte per sample isconverted to complex floating numbers with real and imaginary parts interleaved by blocs of 32(the warp size). Since these operations can be executed simultaneously for successive blocs, itis more efficient to use three buffers in locked memory and three buffers in GPU global memoryand pipeline the three operations in a “round robin” manner.Indeed, the first operations requires the CPU to write to its memory, the second operation usesa dedicated hardware : direct memory access (DMA) on the CPU side and copy engine on theGPU side, the third operations uses the GPU processing units. (note also that the raw signal istransmitted to the GPU - even if the byte ordering does not match that of the GPU - before itsconversion to complex floats because it minimises the number of transmitted bytes).

6. Performances

Programming is still under progress but the code is already operational with some restrictions:Missing functionalities: computation of the illumination pattern (not strictly required from aSAR processor, but useful for SAR image compositing), part of the radio-frequency interfer-ence filtering for the deramp-on-receive & Frequency Modulated Continuous-wave (FMCW)waveforms, and the bistatic processing. Required enhancements: triple buffering of signal inputand image output (tested on a toy-problem, but not in the processor), optimising some kernels(especially the one applying the quadratic phase compensation, and the nominal processing) toinsure coalescent memory accesses, pipelining between the stages processed in the main bufferand the stages in the output buffer, computing the terrain elevation in image coordinate andthe motion compensation which are performed on the CPU, while these computations are veryclose to depth-buffer and ray-tracing which are the historical task of GPU.

Tests have been run on a desktop computer with 2×4-core2 CPUs and two C2050 GPUs. (theC2050 GPU has 14 MPUs with 32 scalar units each and 3 Gb memory.) The benchmarkingproblem is that of [4] (Fig. 2) which is representative of present day SAR images producedat ONERA. For comparison, the CPU computations in [4] are performed by a server with 8-hyper-threading-Xeon (16 equivalent scalar processors) and 64 Gb RAM, and a laptop with one

hyper-threading-Pentium (2 equivalent scalar processors) and 2 Gb RAM. The server reads thesignal data from a 16 hard disk wide RAID and writes the image to a 4 hard disk wide RAID.All other computers read and write to a single hard disk.

Figure 2: Test signal : acquisition at X-band, with a bandwidth=1.224 GHz (obtained by 5 successive frequencyagilities covering 240 MHz each), Duration=120s (34.3 Gb), full resolution swath=1200 m slant, square resolutionstripe length=9300m, target=down-town Toulouse (France).

computer processing configuration memory usage(Mb) elapsed time GPU time noteserver 5 proc.×scalar ? 10:15 - (a)

server 1 proc.×16-threads 3943 02:23 - (b)

laptop 1 proc.×2-threads×5 runs 1984 19:15 -server 1 proc.×16-threads×5 runs 1984 02:35 -server 5 proc.×3-threads 1984×5 01:35 -server 5 proc.×15-threads 1984×5 01:20 - (c)

desktop 1 GPU×1 queue×5 runs 1145 01:12 00:09:25 (d)

desktop 2 GPUs×1 queue×2 runs 1145×2 00:43 00:05:40and 1 GPU×1 queue

desktop 2 GPUs×2 queues 2290×2 00:38 00:05:40 (e)

and 1 GPU×1 queuedesktop 1 GPU×1 queue 2763 00:13:25 00:04:40 (b)

Table 1: Benchmark results (with multi-thread test in italics for reference)notes: (a) old scalar code did not report memory usage, unlike multi-threaded and openCL codes. (b) herethe full bandwidth is processed in a single run on a single process with 16 threads/a single GPU. (c) the 75scheduled threads exceed the number - 16 - of simultaneous executed threads. (d) desktop has a single hard disk,read overhead is 5:50 per run, and is not hidden by the triple buffering. (e) C2050 with 3072 Mb can accommodatethe computation of two sub-bandwidths par GPU, this allows a partial hiding of the PCIe transfers latency.

Result in Table 1 show the processing of the full bandwidth in a single run is a clear winnerand should be preferred whenever enough memory is available. However, even in this case, thecomputation itself accounts for only 34% of the elapsed time. The speed-up, from 16 CPU coresto 448 GPU “cores” (plus about 30% of an host CPU core) is only ×6, while the increase in pro-cessing unit number is ×28. This emphasises the importance of the data transfers between GPUand CPU latency (hence the importance of implementing the triple buffering of the transfers).Test should be conducted with faster storage - namely, reading and writing to separate RAIDarrays - and triple buffering to be fully conclusive (though the GPU solution is already bothfaster, cheaper and with lower power requirement).

7. Perspectives

As the computation time of one sub-bandwidth is 107 s on the C2050 GPU, (for an acquisitionof 120 s corresponding to 46 overlapping processing blocs), it could be envisioned to compute

the full resolution image in real time using 5 GPUs, provided the input data-rate and the outputdisk write rate are adapted and that the PCIe bus transfers latencies are hidden.Of course, algorithm and operating mode should be adapted to real-time : First, the nominal(straight line) trajectory to which the motion is compensated is computed by least square usingthe full acquisition trajectory, instead the RT processing should compensate for a trajectoryobjective that the aircraft pilot should follow with accuracy. Second, the motion compensationshould be computed separately for each processing bloc instead of at the beginning (in factthis was the case for the original scalar code), this includes computing the trajectory dGPShybridising also in real-time. Third, all RFI filtering should be local (at least to the processingbloc). With all these point solved, it will be possible to have the image corresponding to aprocessing bloc in our example case about 10 s after the illuminating of the beginning of theimage bloc (note that a bloc of 2 000×12 000 pixel will be output every 2.6 s after the first 10 sinterval, which also raises the question of using the produced image at this rate !).Both the multi-threading and the vectoring on GPU are also interesting for other computerintensive task in the SAR data handling chain. The first to be considered is the time-domainSAR processor, especially for the production of flash-looks (images in range×squint coordinatefor a fixed integration interval) used, for example, in circular SAR acquisitions. Next is theautofocus module, important for longer range and circular acquisitions (of which the trajectoryis generally of poor accuracy). The most efficient algorithms involves the synthesis of a largenumber of polar format sub-images, sub-images correlations and singular value decompositions.The last computer intensive tasks are the image geocoding & compositing module and the imagerefocusing post-processing module which can be useful for real-time processed images in casethe trajectory is corrected after the processing is done (for example through autofocus or dGPSfrom a ground station).

References

[1] M. Soumekh, “Synthetic Aperture Radar Signal Processing With Matlab Algorithms”, John Wiley1999.

[2] I. cumming, F. wong, “Digital Signal Processing of Synthetic Aperture Radar Data”, Artech House,2005.

[3] H. Cantalloube, “Non Stationary Bistatic Synthetic Aperture Radar Processing: Assessment of Fre-quency Domain Processing from Simulated and Real Signals”, PIERS, Vol. 5 No. 2 2009 pp. 196-200 / Proc. of PIERS Conference Beijing, China, March 23-27 2009, pp. 392-396.

[4] H. Cantalloube, C. Nahum, “Mono and bi-static SAR Image Synthesis Algorithm Improvementon Multi-processor/Multi-core Computers”, Proc. of EUSAR 2010 conference, June 7-10 2010,Aachen, Germany.

[5] V. Volkov, “Better performance at lower occupancy”, Proc. of GPU Technology Conference 2010,October 11-14 2010, San Jose CA, USA.

[6] V. Volkov, B. Kazian, “Fitting FFT onto the G80 Architecture”, CS 252 final project report, 2008,University of California, Berkeley USA.

sar image synthesis algorithm improvement on multi …cantallo/irs2011_gpu.pdf · [5, 6] where...

Documents