improving fcm and t2fcm algorithms performance using gpus for medical images segmentation

6
Improving FCM and T2FCM Algorithms Performance using GPUs for Medical Images Segmentation Mohammed A. Shehab, Mahmoud Al-Ayyoub and Yaser Jararweh Jordan University of Science and Technology Irbid, Jordan Emails: [email protected], {maalshbool, yijararweh}@just.edu.jo Abstract—Image segmentation gained popularity recently due to numerous applications in many fields such as computer vision, medical imaging. From its name, segmentation is interested in partitioning the image into separate regions where one of them is of special interest. Such region is called the Region of Interest (RoI) and it is very important for many medical imaging problems. Clustering is one of the segmentation approaches typically used on medical images despite its long running time. In this work, we propose to leverage the power of the Graphics Processing Unit (GPU)to improve the performance of such approaches. Specifically, we focus on the Fuzzy C-Means (FCM) algorithm and its more recent variation, the Type-2 Fuzzy C- Means (T2FCM) algorithm. We propose a hybrid CPU-GPU implementation to speed up the execution time without affecting the algorithm’s accuracy. The experiments show that such an approach reduces the execution time by up to 80% for FCM and 74% for T2FCM. Index Terms—Medical Image Segmentation; Fuzzy C-Means; Type-2 Fuzzy C-Means; GPU; CUDA I. I NTRODUCTION Recently, medical image processing (for the different ex- isting modalities such as magnetic resonance imaging (MRI), computed tomography (CT), digital mammography, etc.) has become more popular due to its obvious benefits in the diagnosis of many diseases. Researchers are continuously trying to come up with more accurate and efficient techniques [1]. However, due to the recent advances in medical image modalities and the increased size and resolution of medical images, the processing capabilities of typical CPUs are not longer suitable. A recent trend is to exploit the capabilities of Graphics Processing Unit (GPU) in order to improve the performance of medical image processing tasks [2], [3], [4]. Image segmentation is one of the fundamental tasks in image processing. It focuses on how to extract objects from images. It separates different regions of the image where one region is of special interest. Such region is called the Region of Interest (RoI) and it is very important for many medical imaging problems [5], [6]. For example, segmentation is an integral step in many Computer-Aided Diagnosis (CAD) systems [7], [8], [9], [10]. Many approaches were proposed for this task such as threshold-based methods, clustering methods, compression-based methods, histogram-based methods and region-growing methods [11], [12], [13], [14]. We focus here on the clustering techniques for segmentation. Specifically, we are concerned with the celebrated Fuzzy C-Means (FCM) technique [15]. Due to its importance, several enhancements of FCM ap- peared over the past three decades trying to improve the accu- racy and performance of FCM. For the latter objective, [16], [1], [17], [18] proposed to use GPU capabilities. GPUs use single instruction multiple data (SIMD) parallel programming. While both CPUs and GPUs can run and manage thousands of threads simultaneously via time-slicing, modern CPUs can run 4-12 threads in parallel, whereas GPUs can run a thousand threads at a time [1], [19]. In this work, we show how to improve the performance of FCM as well as a variation of it called Type-2 Fuzzy C-Means (T2FCM) using GPU. Following the finding of [1], we devise a hybrid CPU-GPU implementation and compare it with CPU implementation on two medical images. The structure of this paper is as follows. The following section briefly discuss a few similar works. Section III presents our methodology which involves discussing the sequential as well as the hybrid implementations and Section IV discusses the experiments we conducted and the results we obtained. Finally, we conclude our work and provide some directions for future researchers. II. RELATED WORKS Most research efforts focused on improving the accuracy of FCM with some researchers focusing on how to improve the performance of FCM. For instance, Rowi´ nska et al. [16] im- plemented FCM on a parallel architecture. They used CUDA to convert the sequential code of FCM to a parallel one. The testing data were composed of different colored images with different sizes. They transferred two main functions of FCM to be executed on GPU platform. The membership matrix and calculating new centroids were running on the GPU side while the objective function and the termination condition were running on the CPU side. Their experiments were conducted on an Intel Core i3 machine with NVIDIA GeForce GTX 560 video card and Windows 7 64-bit operating system. Their CUDA implementation was tested against two sequential implementations (in C++ and MATLAB). Two types of ex- periments were conducted with one-/two-dimensional feature spaces. The GPU implementation was 7 times faster than the 2015 6th International Conference on Information and Communication Systems (ICICS) 978-1-4799-7348-4/15/$31.00 ©2015 IEEE

Upload: just

Post on 13-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Improving FCM and T2FCM AlgorithmsPerformance using GPUs for Medical Images

SegmentationMohammed A. Shehab, Mahmoud Al-Ayyoub and Yaser Jararweh

Jordan University of Science and TechnologyIrbid, Jordan

Emails: [email protected], {maalshbool, yijararweh}@just.edu.jo

Abstract—Image segmentation gained popularity recently dueto numerous applications in many fields such as computer vision,medical imaging. From its name, segmentation is interestedin partitioning the image into separate regions where one ofthem is of special interest. Such region is called the Region ofInterest (RoI) and it is very important for many medical imagingproblems. Clustering is one of the segmentation approachestypically used on medical images despite its long running time.In this work, we propose to leverage the power of the GraphicsProcessing Unit (GPU)to improve the performance of suchapproaches. Specifically, we focus on the Fuzzy C-Means (FCM)algorithm and its more recent variation, the Type-2 Fuzzy C-Means (T2FCM) algorithm. We propose a hybrid CPU-GPUimplementation to speed up the execution time without affectingthe algorithm’s accuracy. The experiments show that such anapproach reduces the execution time by up to 80% for FCMand 74% for T2FCM.

Index Terms—Medical Image Segmentation; Fuzzy C-Means;Type-2 Fuzzy C-Means; GPU; CUDA

I. INTRODUCTION

Recently, medical image processing (for the different ex-isting modalities such as magnetic resonance imaging (MRI),computed tomography (CT), digital mammography, etc.) hasbecome more popular due to its obvious benefits in thediagnosis of many diseases. Researchers are continuouslytrying to come up with more accurate and efficient techniques[1]. However, due to the recent advances in medical imagemodalities and the increased size and resolution of medicalimages, the processing capabilities of typical CPUs are notlonger suitable. A recent trend is to exploit the capabilitiesof Graphics Processing Unit (GPU) in order to improve theperformance of medical image processing tasks [2], [3], [4].

Image segmentation is one of the fundamental tasks inimage processing. It focuses on how to extract objects fromimages. It separates different regions of the image whereone region is of special interest. Such region is called theRegion of Interest (RoI) and it is very important for manymedical imaging problems [5], [6]. For example, segmentationis an integral step in many Computer-Aided Diagnosis (CAD)systems [7], [8], [9], [10]. Many approaches were proposed forthis task such as threshold-based methods, clustering methods,compression-based methods, histogram-based methods andregion-growing methods [11], [12], [13], [14]. We focus hereon the clustering techniques for segmentation. Specifically,

we are concerned with the celebrated Fuzzy C-Means (FCM)technique [15].

Due to its importance, several enhancements of FCM ap-peared over the past three decades trying to improve the accu-racy and performance of FCM. For the latter objective, [16],[1], [17], [18] proposed to use GPU capabilities. GPUs usesingle instruction multiple data (SIMD) parallel programming.While both CPUs and GPUs can run and manage thousandsof threads simultaneously via time-slicing, modern CPUs canrun 4-12 threads in parallel, whereas GPUs can run a thousandthreads at a time [1], [19].

In this work, we show how to improve the performance ofFCM as well as a variation of it called Type-2 Fuzzy C-Means(T2FCM) using GPU. Following the finding of [1], we devisea hybrid CPU-GPU implementation and compare it with CPUimplementation on two medical images.

The structure of this paper is as follows. The followingsection briefly discuss a few similar works. Section III presentsour methodology which involves discussing the sequential aswell as the hybrid implementations and Section IV discussesthe experiments we conducted and the results we obtained.Finally, we conclude our work and provide some directionsfor future researchers.

II. RELATED WORKS

Most research efforts focused on improving the accuracy ofFCM with some researchers focusing on how to improve theperformance of FCM. For instance, Rowinska et al. [16] im-plemented FCM on a parallel architecture. They used CUDAto convert the sequential code of FCM to a parallel one. Thetesting data were composed of different colored images withdifferent sizes. They transferred two main functions of FCMto be executed on GPU platform. The membership matrix andcalculating new centroids were running on the GPU side whilethe objective function and the termination condition wererunning on the CPU side. Their experiments were conductedon an Intel Core i3 machine with NVIDIA GeForce GTX560 video card and Windows 7 64-bit operating system.Their CUDA implementation was tested against two sequentialimplementations (in C++ and MATLAB). Two types of ex-periments were conducted with one-/two-dimensional featurespaces. The GPU implementation was 7 times faster than the

2015 6th International Conference on Information and Communication Systems (ICICS)

978-1-4799-7348-4/15/$31.00 ©2015 IEEE

CPU implementation and the performance was enhanced by86%.

Walters et al [17] used CUDA to parallelize two segmen-tation algorithms: Markov random fields based (MRF) andHMMER’s Viterbi. They focused on liver medical images asdataset for this research. The machine they used had an AMDAthlon 275 processor with 8 GB of memory and an NVIDIAGeForce 8800 GTX video card supplemented with 768 MBof memory. The performance speedup was around 130x and38.6x for the MRF and HMMER algorithms, respectively.

Similarly, Pan et al. [18] considered the Region Growingand Multi-Degree Immersing Watershed segmentation algo-rithms. They used abdomen images and brain images as theirdataset. The hardware used included a Geforce 8500 GT with2 GB of GPU memory and Pentium 4 2.4GHz CPU with2GB RAM. The performance speedup was 84% for RegionGrowing Method and 65% for Multilevel Watershed Method.

III. METHODOLOGY

In this section, we start by presenting the sequential im-plementation of FCM and T2FCM. Then, we show how weconvert this implementation into a hybrid CPU-GPU imple-mentation. After that, we display the results of the imagesafter running FCM and T2FCM algorithms.

A. Sequential Implementations

In this section we introduce FCM and T2FCM and discusstheir sequential implementations. The programming languageused in this study is C#. FCM [15] is one of the famous clus-tering algorithms used for segmentation. This segmentationprocess is done in three main steps. The first step is doneby calculating the centroid for each cluster (initially, thesecentroids are generated randomly) as follows.

vj =

∑Ni=1 µ

mijxj∑N

i=1 µmij

, (1)

where m is the fuzziness factor, N is the number of pointsand vj is the enter of cluster j. The second step is calculatingthe membership for each data point with all clusters centroidsas follows.

µij =

(C∑

L=1

|xi − vj ||xi − vL|

)−1

, (2)

where C is the number of clusters and xi is the object point.The final step is done by calculating the distance betweenpoints and cluster centers. These steps are repeated until thedifference of the total distance between points and centers isless than or equal some error threshold [16]. This step is calledobjective function computation and it is calculated as follows.

Jm =N∑i=1

C∑j=1

[µmij∥xi − vj∥2]. (3)

The steps of the FCM algorithm are as follows.1) Input the number of clusters C, the fuzziness parameter

m and the termination criterion ϵ. Set loop counter k =0.

2) Randomly initialize cluster centers.3) Initialize the membership matrix U = [µij] according

to Equation 2.4) Calculate objective function Jk according to Equation 3.5) Increment loop counter.6) Calculate the cluster center vectors C(k) = [Ci] accord-

ing to Equation 1.7) Calculate the membership matrix U = [µij] according

to Equation 2.8) Calculate objective function Jk+1 according to Equa-

tion 3.9) Stop if |Jk+1 − Jk| < ϵ; otherwise, repeat Step 5.Rhee et al. [11] presented T2FCM, which is very similar

to FCM except that it uses a new equation to calculate theobjective function as shown in Equation 4.

aij = uij −1− uij

2. (4)

The steps of the T2FCM algorithm are as follows.1) Input the number of clusters C, the fuzziness parameter

m and the termination criterion ϵ. Set loop counter k =0.

2) Randomly initialize cluster centers.3) Initialize the membership matrix U = [µij] according

to Equation 1.4) Calculate objective function Jk according to Equation 2.5) Increment loop counter.6) Calculate the cluster center vectors C(k) = [Ci] accord-

ing to Equation 3.7) Calculate the membership matrix U = [µij] according

to Equation 1.8) Calculate the new membership matrix A according to

Equation 4.9) Calculate objective function Jk+1 according to Equa-

tion 2.10) Stop if |Jk+1 − Jk| < ϵ; otherwise, repeat Step 5.

B. Parallel Implementations

In this section we discuss how to improve the performanceof FCM and T2FCM using hybrid CPU-GPU implementations.Basically, for FCM, the hybrid CPU-GPU implementation issimilar to the sequential one except for the following. Wetransfer two functions to run on GPU side. The first functionis used to calculate membership of data with cluster centroidsusing Equation 2, whereas the second function is used toupdate and segment image pixels due to their strength of mem-bership with cluster centroids. We run centroid and objectivefunction calculations on CPU side because these two functionsneed to calculate summations as shown in Equations 1 and3. This operation needs to synchronize all threads to readeach data point value and add its values to single memoryallocation [19]. Thread synchronization causes delays on GPUsince the summation operation is not a parallel operation dueto the data dependencies it contain. In this case the CPU isexpected to be better especially if we take into account thedelay time of transferring data between GPU memory and

2015 6th International Conference on Information and Communication Systems (ICICS)

TABLE IGPU UTILIZATION PERCENTAGES.

Compute capability 1 1.1 1.2 1.3 2 2.1 3Threads per block

64 67 67 50 50 33 33 5096 100 100 75 75 50 50 75128 100 100 100 100 67 67 100192 100 100 94 94 100 100 94256 100 100 100 100 100 100 100384 100 100 75 75 100 100 94512 67 67 100 100 100 100 100768 N/A N/A N/A N/A 100 100 75

1024 N/A N/A N/A N/A 67 67 100

CPU memory.On the other hand, Equation 2 shows that themembership function has a summation operation; however, theiterations of this operation use the number of clusters which isalways less than the number of data points [15]. In addition, inthis function each pixel data needs to calculate the Euclideandistance between itself and the cluster centroid and this canbe done independently, because every pixel can read the valueof the cluster center without affecting the other data pixels.

We also improve T2FCM’s performance by running themembership matrix U calculation (Equation 2), new member-ship matrix A calculation (Equation 4) and the segmentationfunctions on GPU side. In Equations 2 and 4 have the sameproperties. So, they can both be run successfully on GPU side.

Other challenges in parallel implementation include mem-ory optimization and selecting a suitable number of threads. Asmentioned in [19], if a programmer uses amount of memoryin his/her code that is more than the data size, then GPUwill take longer time to finish its job. So, we calculate thememory size (i.e., number of blocks) by dividing the size ofdata by the number of threads. This will optimize memorysize and speed up GPU performance. Relevant to this is thenumber of threads. It is known in the literature [19] that alarge number of threads is not always a good idea to speed upGPU performance. Also, not many models of NVIDIA GPUssupport very large number of threads such as 1,024 threads.So, in our experiments, we find out that using a fixed numberof threads (256 threads) leads to the best performance and thebest GPU utilization as shown in TableI.

IV. RESULTS AND DISCUSSION

In this section we present the results for both the sequentialand hybrid CPU-GPU implementations and compare them. Wecalculate the performance by calculating the running time ofeach implementation from its starts to its finish. Then, wecalculate the performance improvement using the followingequation.

performance =new − original

original× 100.

Each experiments is repeated ten times and the averages arereported.

We now does the experimental set up. For the dataset, werun FCM and T2FCM algorithm with two brain MRI imagesand two mammograms. The MRI images have sizes 208 ×176 and 383 × 344 while the mammograms have the sizes323×380 and 348×368. The machine we use for experimentshas an Intel Core i7 CPU with 6GB RAM and an NIVIDAGT 740M video card with 2GB of GPU RAM. The operatingsystem is Windows 8.1 and the code is implemented in theC# programming language. To run C# code in GPU we useCUDA and an integration library for Visual Studio 2013.1 Thenumber of clusters of all test cases is five and we make surethat our implementations does not compromise the accuracyby matching the output of each sequential implementation withits parallel counterpart. The results are shown in Figures 1-4.Also, Figures 5 and 6 show the final output of each algorithmhighlighting the differences in accuracy between FCM andT2FCM.

As expected, the the hybrid CPU-GPU implementationsignificantly outperformed the sequential implementation. Forthe brain MRI images, FCM’s average times for the sequentialand the hybrid CPU-GPU implementations are 34.09 secand 6.91 sec, respectively, while T2FCM’s average times forthe sequential and the hybrid CPU-GPU implementations are37.91 sec and 9.97 sec, respectively. As for the mammograms,FCM’s average times for the sequential and the hybrid CPU-GPU implementations are 142.76 sec and 28.72 sec, respec-tively, while T2FCM’s average times for the sequential and thehybrid CPU-GPU implementations are 136.33 sec and 35.27sec, respectively. From these times, we can see that the per-formance has improved by 80% for FCM. This improvementis obtained by moving two functions to be performed by theGPU. We also use some optimization techniques to speed upthe GPU execution time such as reducing the number of datatransfers from and to the GPU memory. For example, in thisimplementation, we just transfer cluster index to GPU eachtime and we copy this data only when the code finishes todisplay the image in our application’s GUI. As for T2FCM,the performance has improved by 74% after we moved threefunctions to the GPU and run two functions on the CPUside. Figure 7 shows the performance comparison betweenthe sequential and the hybrid CPU-GPU implementations forboth FCM and T2FCM.

Our improvements are comparable to the improvementsreported in some of the existing works (such as [18]) andsmaller than others (such as [16]). This is due to many reasonssuch as the difference in the hardware used. We use GPUfrom the GT series while [16], [17] used a GPU from theGTX series. As reported in many sources including NVIDIA’sdocumentation, it is expected that GTX series GPU wouldperform better than GT series GPU. This difference is shownin Figure 8. Figure 8(b) shows the estimated improvement inperformance if we use GTX 560 GPU. As shown in figurethe improving will increase more 100% and become 331%for FCM and 307% for T2FCM.

1https://cudafy.codeplex.com/

2015 6th International Conference on Information and Communication Systems (ICICS)

(a) Cluster 1. (b) Cluster 2. (c) Cluster 3. (d) Cluster 4. (e) Cluster 5.

Fig. 1. The clusters resulting from applying FCM on the brain MRI image.

(a) Cluster 1. (b) Cluster 2. (c) Cluster 3. (d) Cluster 4. (e) Cluster 5.

Fig. 2. The clusters resulting from applying T2FCM on the brain MRI image.

(a) Cluster 1. (b) Cluster 2. (c) Cluster 3. (d) Cluster 4. (e) Cluster 5.

Fig. 3. The clusters resulting from applying FCM on the breast mammogram.

(a) Cluster 1. (b) Cluster 2. (c) Cluster 3. (d) Cluster 4. (e) Cluster 5.

Fig. 4. The clusters resulting from applying T2FCM on the breast mammogram.

2015 6th International Conference on Information and Communication Systems (ICICS)

(a) Original. (b) FCM. (c) T2FCM.

Fig. 5. The final result of applying FCM on the brain MRI image.

(a) Original. (b) FCM. (c) T2FCM.

Fig. 6. The final result of applying FCM on the breast mammogram.

Fig. 7. Performance of the four implementations under consideration.

2015 6th International Conference on Information and Communication Systems (ICICS)

(a) The difference computed by Passmark benchmark[20].

(b) Estimated difference.

Fig. 8. Approximate performance of GTX-560 vs GT-740M.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we explored the potential gain of using GPUcapabilities to improve the segmentation of medical images.We focused on FCM and T2FCM segmentation algorithmsand presented our hybrid CPU-GPU implementation for eachalgorithm. We tested our hybrid implementations on brain MRIimages and mammograms and the results showed that theysignificantly outperform the sequential implementations.

REFERENCES

[1] A. Eklund, P. Dufort, D. Forsberg, and S. M. LaConte, “Medical imageprocessing on the gpu–past, present and future,” Medical image analysis,vol. 17, no. 8, pp. 1073–1094, 2013.

[2] S. AlZubi, Y. Jararweh, and R. Shatnawi, “Medical volume segmentationusing 3d multiresolution analysis,” in 2012 International Conference onInnovations in Information Technology (IIT), 2012.

[3] Y. Jararweh, M. Jarrah, and S. Hariri, “Exploiting gpus for compute-intensive medical applications,” in Multimedia Computing and Systems(ICMCS), 2012 International Conference on. IEEE, 2012, pp. 29–34.

[4] Y. Jararweh, S. Hariri, and T. Moukabary, “Simulating of cardiacelectrical activity with autonomic run time adjustments,” AHSC Frontiersin Biomedical Research, 2009.

[5] S. D. Olabarriaga and A. W. Smeulders, “Interaction in the segmentationof medical images: A survey,” Medical image analysis, vol. 5, no. 2, pp.127–142, 2001.

[6] S. A. Begum and O. M. Devi, “A rough type-2 fuzzy clustering algo-rithm for mr image segmentation,” International Journal of ComputerApplications, vol. 54, no. 4, pp. 4–11, 2012.

[7] M. Al-Ayyoub, D. Alawad, K. Al-Darabsah, and I. Aljarrah, “Automaticdetection and classification of brain hemorrhages,” WSEAS Transactionson Computers, vol. 12, no. 10, pp. 395–405, 2013.

[8] A. Oqaily et al., “Localization of coronary artery thrombosis using coro-nary angiography,” in The Third International Conference on InformaticsEngineering and Information Science (ICIEIS2014). The Society ofDigital Information and Wireless Communication, 2014, pp. 310–316.

[9] K. Al-Darabsah and M. Al-Ayyoub, “Breast cancer diagnosis usingmachine learning based on statistical and texture features extraction,”in Proceedings of the 4th International Conference on Information andCommunication Systems (ICICS 2013), 2013.

[10] K. Alawneh et al., “Computer-aided diagnosis of lumbar disk hernia-tion,” in Proceedings of the 6th International Conference on Informationand Communication Systems (ICICS 2015), 2015.

[11] F. C.-H. Rhee and C. Hwang, “A type-2 fuzzy c-means clusteringalgorithm,” in IFSA World Congress and 20th NAFIPS InternationalConference, 2001. Joint 9th, vol. 4. IEEE, 2001, pp. 1926–1929.

[12] L. Garcia Ugarriza, E. Saber, S. R. Vantaram, V. Amuso, M. Shaw, andR. Bhaskar, “Automatic image segmentation by dynamic region growthand multiresolution merging,” Image Processing, IEEE Transactions on,vol. 18, no. 10, pp. 2275–2288, 2009.

[13] F. Y. Shih and S. Cheng, “Automatic seeded region growing for colorimage segmentation,” Image and vision computing, vol. 23, no. 10, pp.877–886, 2005.

[14] J. Tang, “A color image segmentation algorithm based on regiongrowing,” in Computer engineering and technology (iccet), 2010 2ndinternational conference on, vol. 6. IEEE, 2010, pp. V6–634.

[15] J. C. Bezdek, R. Ehrlich, and W. Full, “Fcm: The fuzzy c-meansclustering algorithm,” Computers & Geosciences, vol. 10, no. 2, pp.191–203, 1984.

[16] Z. Rowinska and J. Gocławski, “Cuda based fuzzy c-means accelerationfor the segmentation of images with fungus grown in foam matrices,”Image Processing & Communications, vol. 17, no. 4, pp. 191–200, 2012.

[17] J. P. Walters, V. Balu, S. Kompalli, and V. Chaudhary, “Evaluatingthe use of gpus in liver image segmentation and hmmer databasesearches,” in Parallel & Distributed Processing, 2009. IPDPS 2009.IEEE International Symposium on. IEEE, 2009, pp. 1–12.

[18] L. Pan, L. Gu, and J. Xu, “Implementation of medical image seg-mentation in cuda,” in Information Technology and Applications inBiomedicine, 2008. ITAB 2008. International Conference on. IEEE,2008, pp. 82–85.

[19] S. Cook, CUDA programming: a developer’s guide to parallel comput-ing with GPUs. Newnes, 2012.

[20] “Geforce gtx 560 vs gt 740m,” Online, Dec 2011, http://gpuboss.com/gpus/GeForce-GTX-560-vs-GeForce-GT-740M [Accessed Mar-2015].

2015 6th International Conference on Information and Communication Systems (ICICS)