image parallel processing based on gpu.pdf

4
367 Image Parallel Processing Based on GPU Nan Zhang Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences Graduate School of the Chinese Academy of Sciences Changchun China; Beijing China [email protected] Yun-shan Chen Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences Graduate School of the Chinese Academy of Sciences Changchun China; Beijing China [email protected] Jian-li Wang * Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences Changchun China [email protected] AbstractIn order to solve the compute-intensive character of image processing, based on advantages of GPU parallel operation, parallel acceleration processing technique is proposed for image. First, efficient architecture of GPU is introduced that improves computational efficiency, comparing with CPU. Then, Sobel edge detector and homomorphic filtering, two representative image processing algorithms, are embedded into GPU to validate the technique. Finally, tested image data of different resolutions are used on CPU and GPU hardware platform to compare computational efficiency of GPU and CPU. Experimental results indicate that if data transfer time, between host memory and device memory, is taken into account, speed of the two algorithms implemented on GPU can be improved approximately 25 times and 49 times as fast as CPU, respectively, and GPU is practical for image processing. Keywords: Image Processing; Parallel operation; GPU; CUDA I. INTRODUCTION Recently, performances of programmable Graphics Processing Unit have rapidly developed. GPU has possessed powerful parallel computational capability. At present, Flops of advanced single-chip GPU has reached 1Tflops/s and the memory bandwidth is up to 141GB/s which have exceeded that of mainstream CPU more than 10 times, in other words, the advanced GPU can be equal to a small computer cluster. With emergence of NVIDIA CUDA, GPU program development becomes more flexible and efficient, and GPU acceleration techniques of image processing receive enormous attention [1] [2] [3]. Many image processing algorithms are computationally expensive and parallelizable. Moreover, traditional processing methods can not satisfy real time requirement for large size image processing. GPU is only useful for extremely data parallel workloads, where similar calculations are executed on quantities of data that are arrayed in a regular grid-like fashion, so it is one of ideal solutions of large size images. In this paper, GPU is utilized to implement parallel processing of Sobel edge detector and homomorphic filtering. Results show that parallel algorithms achieve remarkable speedup, compared with sequential methods based on CPU. From paper, section 2 introduces the hardware and software architecture of GPU; parallel strategies for Sobel edge detector and homomorphic filtering on GPU are presented in section 3 and 4. Experimental results are analyzed in section 5. Section 6 summarizes the work and gives conclusion. II. GPU DESCRIPTION A. Hardware Architecture The hardware architecture is illustrated in Fig.1. The device is a set of multiprocessors. Each multiprocessor is a set of processors with SIMD (Single Instruction Multiple Data) architecture that each processor of the multiprocessor executes the same instruction but operates on different data, at each clock cycle. The device has its own DRAM referred to device memory which has three types --- global memory, constant memory and texture memory that all can communicate with host memory. Each multiprocessor has four types on-chip memory --- register, shared memory, a constant cache speeding up read from constant memory and a texture cache speeding up read from constant memory [1]. Host Host Memory 978-1-4244-5848-6/10/$26.00 ©2010 IEEE

Upload: eider-carlos

Post on 25-Nov-2015

66 views

Category:

Documents


2 download

TRANSCRIPT

  • 367

    Image Parallel Processing Based on GPU

    Nan Zhang Changchun Institute of Optics, Fine Mechanics and

    Physics, Chinese Academy of Sciences

    Graduate School of the Chinese Academy of Sciences

    Changchun China; Beijing China

    [email protected]

    Yun-shan ChenChangchun Institute of Optics, Fine Mechanics and

    Physics, Chinese Academy of Sciences

    Graduate School of the Chinese Academy of Sciences

    Changchun China; Beijing China

    [email protected]

    Jian-li Wang * Changchun Institute of Optics, Fine Mechanics and

    Physics, Chinese Academy of Sciences

    Changchun China

    [email protected]

    AbstractIn order to solve the compute-intensive character of image processing, based on advantages of GPU parallel

    operation, parallel acceleration processing technique is

    proposed for image. First, efficient architecture of GPU is

    introduced that improves computational efficiency, comparing

    with CPU. Then, Sobel edge detector and homomorphic

    filtering, two representative image processing algorithms, are

    embedded into GPU to validate the technique. Finally, tested

    image data of different resolutions are used on CPU and GPU

    hardware platform to compare computational efficiency of

    GPU and CPU. Experimental results indicate that if data

    transfer time, between host memory and device memory, is

    taken into account, speed of the two algorithms implemented

    on GPU can be improved approximately 25 times and 49 times

    as fast as CPU, respectively, and GPU is practical for image

    processing.

    Keywords: Image Processing; Parallel operation; GPU;

    CUDA

    I. INTRODUCTION

    Recently, performances of programmable Graphics Processing Unit have rapidly developed. GPU has possessed powerful parallel computational capability. At present, Flops of advanced single-chip GPU has reached 1Tflops/s and the memory bandwidth is up to 141GB/s which have exceeded that of mainstream CPU more than 10 times, in other words, the advanced GPU can be equal to a small computer cluster. With emergence of NVIDIA CUDA, GPU program development becomes more flexible and efficient, and GPU acceleration techniques of image processing receive enormous attention [1] [2] [3].

    Many image processing algorithms are computationally expensive and parallelizable. Moreover, traditional processing methods can not satisfy real time requirement for large size image processing. GPU is only useful for extremely data parallel workloads, where similar calculations are executed on quantities of data that are arrayed in a regular grid-like fashion, so it is one of ideal solutions of large size images. In this paper, GPU is utilized

    to implement parallel processing of Sobel edge detector and homomorphic filtering. Results show that parallel algorithms achieve remarkable speedup, compared with sequential methods based on CPU.

    From paper, section 2 introduces the hardware and software architecture of GPU; parallel strategies for Sobel edge detector and homomorphic filtering on GPU are presented in section 3 and 4. Experimental results are analyzed in section 5. Section 6 summarizes the work and gives conclusion.

    II. GPU DESCRIPTION

    A. Hardware Architecture

    The hardware architecture is illustrated in Fig.1. The device is a set of multiprocessors. Each multiprocessor is a set of processors with SIMD (Single Instruction Multiple Data) architecture that each processor of the multiprocessor executes the same instruction but operates on different data, at each clock cycle. The device has its own DRAM referred to device memory which has three types --- global memory, constant memory and texture memory that all can communicate with host memory. Each multiprocessor has four types on-chip memory --- register, shared memory, a constant cache speeding up read from constant memory and a texture cache speeding up read from constant memory [1].

    Host

    Host

    Memory

    978-1-4244-5848-6/10/$26.00 2010 IEEE

  • 368

    Figure 1. Hardware model [1].

    Figure 2. Thread batching.

    B. Programming Model

    CUDA (Compute Unified Device Architecture) is a novel hardware and programming architecture for issuing and managing computations on GPU, released by NVIDIA in 2007. The CUDA software stack contains a hardware driver, an API (application programming interface) and its runtime, and two higher-level mathematical libraries, CUBLAS (CUDA Basic Linear Algebra Subprograms) and CUFFT (CUDA Fast Fourier Transform) [1]. CUDA applies a C-like development environment to users, and GPU is viewed as data-parallel computing device with no need of mapping programs into graphics APIs. So program development based on GPU becomes more efficient and flexible.

    The philosophical and architectural underpinning of CUDA is to create mass of thread level parallelism that can be dynamically exploited by hardware. The CUDA programming model regards GPU as a compute device which is capable of executing a high number of threads in parallel and operating as a coprocessor to the host CPU. In other words, a portion of an application that is executed many times on different data, can be divided into a function that is executed on the device as many different threads. Such a function running on GPU is called kernel. As shown in Fig. 2, a kernel is executed by a grid of thread blocks and a thread block is a group of 512 threads at most that executes in parallel operating on different data based on thread IDs.

    GPU computing architecture and software do not require knowledge of graphics concepts any longer, so designs of GPU acceleration algorithms lie in expressing algorithms in a highly parallel fashion, for image processing.

    III. GPU IMPLEMENTATION OF FAST IMAGE

    EDGE DETECTION

    Sobel edge detector is a popular and effective edge detector that is based on convolving the image with a filter in horizontal and vertical direction [5] [6]. Therefore it is relatively inexpensive in terms of arithmetic complexity, and often used as pre-processing step in many computer vision algorithms. The operator uses two 33 kernels which are convolved with original image to calculate approximations of the derivatives as represented in (1).

    1 2 1

    0 0 0

    1 2 1

    xh

    =

    1 0 1

    2 0 2

    1 0 1

    yh

    = (1)

    Sobel edge detector is a template-based operator where each output pixel is determined by the correlation between the pixels in eight neighborhoods. The multiple points access method [7] can be considered to improve efficiency of data access that several continuous data are read and placed into the register. Then the latter points can access the frontal points in the register for calculating without repeated access to global memory. Multiple outputs are obtained finally. The access fashion of multiple points can be seen in Fig. 3 where 4 output pixels only need 12 input pixels. It takes 400 to 600 clocks to access global memory, while accessing a register is only 4 clock cycles, so the fashion can notably improve the access efficiency.

    Meanwhile, template operation results in boundary problem that reduces the parallel computing efficiency. So the texture memory can be taken into account, for this case. The attributes of texture fetching present several benefits for image processing, for example, the boundary cases can be handled automatically by specifying the addressing mode, when accessing the texture memory. So convolution operations at image edges can be easier by wrap or clamping mode of operation at texture borders.

    Firstly, image data are copied from the host memory to CUDA array, and then the array is bound to the texture memory. Any kernel launching, mentioned in section 2, must specify the execution configuration which defines dimensions of the grid and blocks used to execute the kernel on the device. According to Sobel features and GPU programming mechanism, each thread block is set to process one row data in the image, and each thread deals with 8 output pixels, that is, each thread processes 30 input pixels, and outputs 8 pixels. Finally all the results are read back to the host memory.

    IV. GPU IMPLEMENTATION OF FAST IMAGE

    ENHANCEMENT

    Homomorphic filtering is a well-known approach for image enhancement which can reduce undesired contributions within image, due to light source nonuniformity [8] [9]. The approach is derived from an

  • 369

    image illumination-reflectance model that removes effects of nonuniformity illumination by compressing image dynamic range and enhancing contrast.

    The image f(x, y) can be expressed as the composition of two components --- illumination component i(x, y) and reflectance component r(x, y), namely, f(x, y) = i(x, y) r(x, y).

    Figure 3. Multiple points access method.

    Figure 4. The processing of homomorphic filtering.

    The general idea of homomorphic filtering is shown in Fig. 4. First, the image is passed through logarithmic operation that divides the image into two components. Then, direct FFT is applied to the preceding result, and its representation in spatial frequency domain is modified by applying a filter that implements contrast enhancement with different weights. Finally, the modified image is transformed using IFFT and corresponding result is passed through an exponential calculation that reverses the effects of the logarithmic operation.

    Each step of the process in Fig. 4 is of data correlation; therefore, the parallel processing can only be executed within each step. Logarithm operation, point-wise multiplication and exponent calculation pertain to point operations of image processing techniques, that is, each output pixel depends only upon the corresponding input pixel, no interchange with other pixels. Consequently, the three steps can be easily parallelized respectively which is well suited to SIMD stream architecture of GPU. For the Fourier Transform, APIs of CUFFT library, CUDA provided can be called simply. The image is processed as follows:

    First, a (image W image H) size .bmp file is copied from the host memory to the global memory; secondly, all of kernel threads are set to the groups of 256 or 1616 per block, for parallelization of point operation processing, and each thread generally correlates to a single pixel for processing, namely, the image is divided into (imageH/16)(imageW/16) blocks; then all the blocks and the threads within them are executed in parallel; The FFT and IFFT are implemented by calling cufftExecC2C( )

    function from CUFFT library. Finally the resulting image is copied back from the global memory to the host memory.

    V. EXPERIMENT RESULTS AND ANALYSIS

    For each image processing algorithm, both GPU parallel code and CPU serial code are designed and the calculation is based on single-precision floating-point values to ensure comparability of the results. And then the executed time is compared that is recorded and averaged over 10 iterations per test. In contrast to processing on CPU, the time needed to copy data to the GPU and results from the GPU are taken into account for processing on GPU. Results are shown for five image sizes as illustrated in Fig. 5 and Table1. For specificity, the hardware used is given below:

    CPU: Intel Pentium Dual E2160 [email protected], 1 GB RAM

    GPU: NVIDIA Geforce 8800GT (1.5 GHz), 512 MB global memory

    The speed comparison of Sobel edge detector is shown in Fig. 5. GPU can significantly improve computing speed as the image size increases. The speedups achieved range from 3.6(128128) to 25.3(20482048). Fig. 6 shows parallel Sobel edge detection result on standard lenaimage that is computed by the GPU program.

    For homomorphic filtering, around 40 speedup is gained, showed in Table 1. Results from the table show moderate increase in speedup as image size increases. The speedup for higher image resolution reaches an early

    plateau at image size 10241024, beyond 49, that

    processes an image in 16 ms which is satisfied to real time requirement.

    The result of parallel homomorphic filtering on a workroom image is displayed in Fig. 7, and quality evaluation of five image sizes is calculated respectively. The original image is nonuniform that several objects in the dark place are indistinct. According to data from Table.2 and Fig.7, image enhanced by homomorphic filtering has increased contrast and sharpness obviously, and detail indistinguishable (objects in the bookcase and on the desk) in the original is easily noticed in the enhanced image.

    Figure 5. The time comparison of Sobel, where x-axis is of respective scale and values of left y-axis are marked by the side of GPU, and the right ones are for CPU.

  • 370

    Figure 6. Sobel edge detection result (256256 size).

    VI. CONCLUSION

    In this paper, two parallel image processing algorithms, Sobel edge detection and homomorphic filtering, are presented and implemented on GPU, and compared with the sequential implementations based on CPU. Performance results indicate that significant speedup can be achieved, and the speedup increases with image size increasing. The Sobel edge detection and homomorphic filtering can get

    speedup of up to 25 and 49 respectively, compared to

    CPU-based implementations. Obviously, GPU provides a novel and efficient acceleration technique for image processing, and is cheaper in hardware implementation. Future work will involve mapping more complex image processing algorithms into GPU, and a deeper analysis of parallelization strategies to make best of computing resources provided by GPU.

    Table1. The time comparison of homomorphic filter

    Image size GPU(ms) CPU(ms) Speedup

    128128 1.191 10.491 8.81

    256256 1.586 34.643 21.84

    512512 4.325 202.479 46.82

    10241024 16.387 810.211 49.44

    20482048 77.685 3303.834 42.53

    Figure 7. Homomorphic filtering result (10241024 size). The original

    image with nonuniformity is on the left, and homomorphic

    enhanced image in on the right.

    Table 2 Quality evaluation fore-and-aft enhancement

    image size brightness contrast entropy

    nonuniform 10241024 10.52 12.94 4.83

    enhanced 10241024 37.38 26.11 5.53

    ACKNOWLEDGMENT

    I am extremely grateful to my advisors, Dr. Jian-li Wang, for his patience, wisdom, and kindness. This work would have failed miserably without his constant guidance.

    My parents set me on the path of learning and work, and they are still my most valued teachers. Last, thanks for the opportunity supplied by conference organizers.

    REFERENCES

    [1] NVIDIA Corporation, NVIDIA CUDA Compute Unified Device Architecture Programming Guide, Version 1.1, 2007.

    [2] T. R. Halfhil, Parallel Processing with CUDA, Microprocessor Report, Scottsdale, Arizona, Jan 28, 2008.

    [3] NVIDIA CUDA.http://forums.nvidia.com

    [4] Rafael C.Gonzalez, Richard E.Woods. Digital Image Processing Second Edition.2003, 3.

    [5] L. S. Davis, A Survey of Edge Detection Technique CGIP, vol. 4, pp.248-270, 1975.

    [6] V. Podlozhnyuk. Image Convolution with CUDA, http:// www. nvidia.com/object/cuda_home

    [7] H. R. Zuo. Fast Sobel Edge Detection Algorithm Base on GPU. Opto-Electronic Engineering, vol. (1), pp. 8-12, 2009.

    [8] C. N. Chen, and Y. J. Wang, Image contrast enhancement by homomorphic filtering in frequency field, Micro-Computer Information, vol. 23(2-3), pp. 264-266, 2007.

    [9] Ponomarev, and I. Vladimir, Image enhancement by homomorphic filters, Proceedings of SPIE - The International Society for Optical Engineering, vol. 2564, pp. 153-159, 1995.

    [10] K. David, NVIDIA's GT200 -- Inside a Parallel Processor,http://www.realworldtech.com, 2008

    [11] NVIDIA Corporation, NVIDIA CUFFT Library, Version 1.1, 2007.

    [12] NVIDIA Corporation, NVIDIA CUBLAS Library, Version 1.1, 2007.

    [13] M. Frigo, and S. Johnson, An Adaptive Software Architecture for the FFT, ICASSP conference proceedings. Seattle, Washington, USA, pp.1381-1384, 1998.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice