image parallel processing based on gpu.pdf
TRANSCRIPT
-
367
Image Parallel Processing Based on GPU
Nan Zhang Changchun Institute of Optics, Fine Mechanics and
Physics, Chinese Academy of Sciences
Graduate School of the Chinese Academy of Sciences
Changchun China; Beijing China
Yun-shan ChenChangchun Institute of Optics, Fine Mechanics and
Physics, Chinese Academy of Sciences
Graduate School of the Chinese Academy of Sciences
Changchun China; Beijing China
Jian-li Wang * Changchun Institute of Optics, Fine Mechanics and
Physics, Chinese Academy of Sciences
Changchun China
AbstractIn order to solve the compute-intensive character of image processing, based on advantages of GPU parallel
operation, parallel acceleration processing technique is
proposed for image. First, efficient architecture of GPU is
introduced that improves computational efficiency, comparing
with CPU. Then, Sobel edge detector and homomorphic
filtering, two representative image processing algorithms, are
embedded into GPU to validate the technique. Finally, tested
image data of different resolutions are used on CPU and GPU
hardware platform to compare computational efficiency of
GPU and CPU. Experimental results indicate that if data
transfer time, between host memory and device memory, is
taken into account, speed of the two algorithms implemented
on GPU can be improved approximately 25 times and 49 times
as fast as CPU, respectively, and GPU is practical for image
processing.
Keywords: Image Processing; Parallel operation; GPU;
CUDA
I. INTRODUCTION
Recently, performances of programmable Graphics Processing Unit have rapidly developed. GPU has possessed powerful parallel computational capability. At present, Flops of advanced single-chip GPU has reached 1Tflops/s and the memory bandwidth is up to 141GB/s which have exceeded that of mainstream CPU more than 10 times, in other words, the advanced GPU can be equal to a small computer cluster. With emergence of NVIDIA CUDA, GPU program development becomes more flexible and efficient, and GPU acceleration techniques of image processing receive enormous attention [1] [2] [3].
Many image processing algorithms are computationally expensive and parallelizable. Moreover, traditional processing methods can not satisfy real time requirement for large size image processing. GPU is only useful for extremely data parallel workloads, where similar calculations are executed on quantities of data that are arrayed in a regular grid-like fashion, so it is one of ideal solutions of large size images. In this paper, GPU is utilized
to implement parallel processing of Sobel edge detector and homomorphic filtering. Results show that parallel algorithms achieve remarkable speedup, compared with sequential methods based on CPU.
From paper, section 2 introduces the hardware and software architecture of GPU; parallel strategies for Sobel edge detector and homomorphic filtering on GPU are presented in section 3 and 4. Experimental results are analyzed in section 5. Section 6 summarizes the work and gives conclusion.
II. GPU DESCRIPTION
A. Hardware Architecture
The hardware architecture is illustrated in Fig.1. The device is a set of multiprocessors. Each multiprocessor is a set of processors with SIMD (Single Instruction Multiple Data) architecture that each processor of the multiprocessor executes the same instruction but operates on different data, at each clock cycle. The device has its own DRAM referred to device memory which has three types --- global memory, constant memory and texture memory that all can communicate with host memory. Each multiprocessor has four types on-chip memory --- register, shared memory, a constant cache speeding up read from constant memory and a texture cache speeding up read from constant memory [1].
Host
Host
Memory
978-1-4244-5848-6/10/$26.00 2010 IEEE
-
368
Figure 1. Hardware model [1].
Figure 2. Thread batching.
B. Programming Model
CUDA (Compute Unified Device Architecture) is a novel hardware and programming architecture for issuing and managing computations on GPU, released by NVIDIA in 2007. The CUDA software stack contains a hardware driver, an API (application programming interface) and its runtime, and two higher-level mathematical libraries, CUBLAS (CUDA Basic Linear Algebra Subprograms) and CUFFT (CUDA Fast Fourier Transform) [1]. CUDA applies a C-like development environment to users, and GPU is viewed as data-parallel computing device with no need of mapping programs into graphics APIs. So program development based on GPU becomes more efficient and flexible.
The philosophical and architectural underpinning of CUDA is to create mass of thread level parallelism that can be dynamically exploited by hardware. The CUDA programming model regards GPU as a compute device which is capable of executing a high number of threads in parallel and operating as a coprocessor to the host CPU. In other words, a portion of an application that is executed many times on different data, can be divided into a function that is executed on the device as many different threads. Such a function running on GPU is called kernel. As shown in Fig. 2, a kernel is executed by a grid of thread blocks and a thread block is a group of 512 threads at most that executes in parallel operating on different data based on thread IDs.
GPU computing architecture and software do not require knowledge of graphics concepts any longer, so designs of GPU acceleration algorithms lie in expressing algorithms in a highly parallel fashion, for image processing.
III. GPU IMPLEMENTATION OF FAST IMAGE
EDGE DETECTION
Sobel edge detector is a popular and effective edge detector that is based on convolving the image with a filter in horizontal and vertical direction [5] [6]. Therefore it is relatively inexpensive in terms of arithmetic complexity, and often used as pre-processing step in many computer vision algorithms. The operator uses two 33 kernels which are convolved with original image to calculate approximations of the derivatives as represented in (1).
1 2 1
0 0 0
1 2 1
xh
=
1 0 1
2 0 2
1 0 1
yh
= (1)
Sobel edge detector is a template-based operator where each output pixel is determined by the correlation between the pixels in eight neighborhoods. The multiple points access method [7] can be considered to improve efficiency of data access that several continuous data are read and placed into the register. Then the latter points can access the frontal points in the register for calculating without repeated access to global memory. Multiple outputs are obtained finally. The access fashion of multiple points can be seen in Fig. 3 where 4 output pixels only need 12 input pixels. It takes 400 to 600 clocks to access global memory, while accessing a register is only 4 clock cycles, so the fashion can notably improve the access efficiency.
Meanwhile, template operation results in boundary problem that reduces the parallel computing efficiency. So the texture memory can be taken into account, for this case. The attributes of texture fetching present several benefits for image processing, for example, the boundary cases can be handled automatically by specifying the addressing mode, when accessing the texture memory. So convolution operations at image edges can be easier by wrap or clamping mode of operation at texture borders.
Firstly, image data are copied from the host memory to CUDA array, and then the array is bound to the texture memory. Any kernel launching, mentioned in section 2, must specify the execution configuration which defines dimensions of the grid and blocks used to execute the kernel on the device. According to Sobel features and GPU programming mechanism, each thread block is set to process one row data in the image, and each thread deals with 8 output pixels, that is, each thread processes 30 input pixels, and outputs 8 pixels. Finally all the results are read back to the host memory.
IV. GPU IMPLEMENTATION OF FAST IMAGE
ENHANCEMENT
Homomorphic filtering is a well-known approach for image enhancement which can reduce undesired contributions within image, due to light source nonuniformity [8] [9]. The approach is derived from an
-
369
image illumination-reflectance model that removes effects of nonuniformity illumination by compressing image dynamic range and enhancing contrast.
The image f(x, y) can be expressed as the composition of two components --- illumination component i(x, y) and reflectance component r(x, y), namely, f(x, y) = i(x, y) r(x, y).
Figure 3. Multiple points access method.
Figure 4. The processing of homomorphic filtering.
The general idea of homomorphic filtering is shown in Fig. 4. First, the image is passed through logarithmic operation that divides the image into two components. Then, direct FFT is applied to the preceding result, and its representation in spatial frequency domain is modified by applying a filter that implements contrast enhancement with different weights. Finally, the modified image is transformed using IFFT and corresponding result is passed through an exponential calculation that reverses the effects of the logarithmic operation.
Each step of the process in Fig. 4 is of data correlation; therefore, the parallel processing can only be executed within each step. Logarithm operation, point-wise multiplication and exponent calculation pertain to point operations of image processing techniques, that is, each output pixel depends only upon the corresponding input pixel, no interchange with other pixels. Consequently, the three steps can be easily parallelized respectively which is well suited to SIMD stream architecture of GPU. For the Fourier Transform, APIs of CUFFT library, CUDA provided can be called simply. The image is processed as follows:
First, a (image W image H) size .bmp file is copied from the host memory to the global memory; secondly, all of kernel threads are set to the groups of 256 or 1616 per block, for parallelization of point operation processing, and each thread generally correlates to a single pixel for processing, namely, the image is divided into (imageH/16)(imageW/16) blocks; then all the blocks and the threads within them are executed in parallel; The FFT and IFFT are implemented by calling cufftExecC2C( )
function from CUFFT library. Finally the resulting image is copied back from the global memory to the host memory.
V. EXPERIMENT RESULTS AND ANALYSIS
For each image processing algorithm, both GPU parallel code and CPU serial code are designed and the calculation is based on single-precision floating-point values to ensure comparability of the results. And then the executed time is compared that is recorded and averaged over 10 iterations per test. In contrast to processing on CPU, the time needed to copy data to the GPU and results from the GPU are taken into account for processing on GPU. Results are shown for five image sizes as illustrated in Fig. 5 and Table1. For specificity, the hardware used is given below:
CPU: Intel Pentium Dual E2160 [email protected], 1 GB RAM
GPU: NVIDIA Geforce 8800GT (1.5 GHz), 512 MB global memory
The speed comparison of Sobel edge detector is shown in Fig. 5. GPU can significantly improve computing speed as the image size increases. The speedups achieved range from 3.6(128128) to 25.3(20482048). Fig. 6 shows parallel Sobel edge detection result on standard lenaimage that is computed by the GPU program.
For homomorphic filtering, around 40 speedup is gained, showed in Table 1. Results from the table show moderate increase in speedup as image size increases. The speedup for higher image resolution reaches an early
plateau at image size 10241024, beyond 49, that
processes an image in 16 ms which is satisfied to real time requirement.
The result of parallel homomorphic filtering on a workroom image is displayed in Fig. 7, and quality evaluation of five image sizes is calculated respectively. The original image is nonuniform that several objects in the dark place are indistinct. According to data from Table.2 and Fig.7, image enhanced by homomorphic filtering has increased contrast and sharpness obviously, and detail indistinguishable (objects in the bookcase and on the desk) in the original is easily noticed in the enhanced image.
Figure 5. The time comparison of Sobel, where x-axis is of respective scale and values of left y-axis are marked by the side of GPU, and the right ones are for CPU.
-
370
Figure 6. Sobel edge detection result (256256 size).
VI. CONCLUSION
In this paper, two parallel image processing algorithms, Sobel edge detection and homomorphic filtering, are presented and implemented on GPU, and compared with the sequential implementations based on CPU. Performance results indicate that significant speedup can be achieved, and the speedup increases with image size increasing. The Sobel edge detection and homomorphic filtering can get
speedup of up to 25 and 49 respectively, compared to
CPU-based implementations. Obviously, GPU provides a novel and efficient acceleration technique for image processing, and is cheaper in hardware implementation. Future work will involve mapping more complex image processing algorithms into GPU, and a deeper analysis of parallelization strategies to make best of computing resources provided by GPU.
Table1. The time comparison of homomorphic filter
Image size GPU(ms) CPU(ms) Speedup
128128 1.191 10.491 8.81
256256 1.586 34.643 21.84
512512 4.325 202.479 46.82
10241024 16.387 810.211 49.44
20482048 77.685 3303.834 42.53
Figure 7. Homomorphic filtering result (10241024 size). The original
image with nonuniformity is on the left, and homomorphic
enhanced image in on the right.
Table 2 Quality evaluation fore-and-aft enhancement
image size brightness contrast entropy
nonuniform 10241024 10.52 12.94 4.83
enhanced 10241024 37.38 26.11 5.53
ACKNOWLEDGMENT
I am extremely grateful to my advisors, Dr. Jian-li Wang, for his patience, wisdom, and kindness. This work would have failed miserably without his constant guidance.
My parents set me on the path of learning and work, and they are still my most valued teachers. Last, thanks for the opportunity supplied by conference organizers.
REFERENCES
[1] NVIDIA Corporation, NVIDIA CUDA Compute Unified Device Architecture Programming Guide, Version 1.1, 2007.
[2] T. R. Halfhil, Parallel Processing with CUDA, Microprocessor Report, Scottsdale, Arizona, Jan 28, 2008.
[3] NVIDIA CUDA.http://forums.nvidia.com
[4] Rafael C.Gonzalez, Richard E.Woods. Digital Image Processing Second Edition.2003, 3.
[5] L. S. Davis, A Survey of Edge Detection Technique CGIP, vol. 4, pp.248-270, 1975.
[6] V. Podlozhnyuk. Image Convolution with CUDA, http:// www. nvidia.com/object/cuda_home
[7] H. R. Zuo. Fast Sobel Edge Detection Algorithm Base on GPU. Opto-Electronic Engineering, vol. (1), pp. 8-12, 2009.
[8] C. N. Chen, and Y. J. Wang, Image contrast enhancement by homomorphic filtering in frequency field, Micro-Computer Information, vol. 23(2-3), pp. 264-266, 2007.
[9] Ponomarev, and I. Vladimir, Image enhancement by homomorphic filters, Proceedings of SPIE - The International Society for Optical Engineering, vol. 2564, pp. 153-159, 1995.
[10] K. David, NVIDIA's GT200 -- Inside a Parallel Processor,http://www.realworldtech.com, 2008
[11] NVIDIA Corporation, NVIDIA CUFFT Library, Version 1.1, 2007.
[12] NVIDIA Corporation, NVIDIA CUBLAS Library, Version 1.1, 2007.
[13] M. Frigo, and S. Johnson, An Adaptive Software Architecture for the FFT, ICASSP conference proceedings. Seattle, Washington, USA, pp.1381-1384, 1998.
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False
/Description >>> setdistillerparams> setpagedevice