parallel generation of digitally reconstructed radiographs ... · parallel generation of digitally...

4
Parallel Generation of Digitally Reconstructed Radiographs on Heterogeneous Multi-GPU Workstations Marwan Abdellah 1, Student Member, IEEE, EMBS, Asem Abdelaziz 2 , Student Member, IEEE, EMBS Eslam Ali 2 , Sherief Abdelaziz 2 , Abdelrahman Sayed 2 Mohamed I. Owis 3 , Member, IEEE, EMBS, Ayman Eldeib 4 , Senior Member, IEEE, EMBS Abstract— The growing importance of three-dimensional ra- diotherapy treatment has been associated with the active pres- ence of advanced computational workflows that can simulate conventional x-ray films from computed tomography (CT) volumetric data to create digitally reconstructed radiographs (DRR). These simulated x-ray images are used to continuously verify the patient alignment in image-guided therapies with 2D- 3D image registration. The present DRR rendering pipelines are quite limited to handle huge imaging stacks generated by recent state-of-the-art CT imaging modalities. We present a high performance x-ray rendering pipeline that is capable of generating high quality DRRs from large scale CT volumes. The pipeline is designed to harness the immense computing power of all the heterogeneous computing platforms that are connected to the system relying on OpenCL. Load-balancing optimization is also addressed to equalize the rendering load across the entire system. The performance benchmarks demonstrate the capability of our pipeline to generate high quality DRRs from relatively large CT volumes at interactive frame rates using cost-effective multi-GPU workstations. A 512 2 DRR frame can be rendered from 1024 × 2048 × 2048 CT volumes at 85 frames per second. I. I NTRODUCTION &RELATED WORK Digitally reconstructed radiographs are computer gener- ated x-ray-like images that simulate the active decay of x- ray point source within the human tissue. They play an essential role in image-guided radiation therapy, mainly in the planning and verification of radiotherapy treatments [1]. DRRs are used in 2D-3D image registration packages to verify the patient positioning before treatment delivery. They map the alignment of dynamic intraoperative images with another volumetric anatomical stacks that were captured preoperatively in the planning stage [2], [3], [4]. A. Rationale and Challenges Recent CT scanners are capable of imaging fine structures of the the tissue at very high spatial resolution. Over the last few years, the volume of the resulting data from those advanced scanners has increased dramatically. This leap imposed further challenges on the interactive generation of 1 Marwan Abdellah is a M.Sc graduate from the Department of Systems and Biomedical Engineering, Faculty of Engineering, Cairo University, Egypt. Corresponding author [email protected] 2 Asem Abdelaziz, Sherief Abdelaziz, Eslam Ali, and Abdelrahman Sayed are undergraduate students with the Department of Systems and Biomedical Engineering, Faculty of Engineering, Cairo University, Egypt. 3 Mohamed I. Owis is an Associate Professor and 4 Ayman Eldeib is a Professor with the Department of Systems and Biomedical Engineering, Faculty of Engineering, Cairo University, Egypt. high quality DRRs from large scale CT datasets. These challenges can be trivially resolved by purchasing high- end imaging workstations that are powerful enough to load and render large CT volumes relying on traditional software packages and robust highly-specialized rendering devices, for example NVIDIA Quadro GPUs [5]. This approach is extremely expensive and might not be affordable in certain situations. If the quality of the DRR can be tolerated, a single mid-range GPU can be used to render the DRR interactively. The quality of the generated DRR is proportional to the re- sampling factor that is used for down-scaling the CT volume to fit in the memory of the GPU. Nevertheless, this solution is not acceptable in some procedures and can be harmful if the down-sampling rate is lower than a certain threshold. In this work, we discuss an alternative low-cost solution that is capable of rendering high quality DRRs at interactive frame rates. High performance distributed ray tracing is used to render DRR images on all the computing nodes that are connected to the system. The architecture of the ren- dering pipeline uses the heterogeneous computing model of OpenCL to avoid limiting the system to a specific hardware vendor or computing platform. B. Heterogeneous Computing with OpenCL In 2009, OpenCL has been introduced to spark a new era of affordable high performance computing. This frame- work allows designing software pipelines that can be ex- ecuted across a set of different heterogeneous computing devices, mainly CPUs and GPUs [6]. OpenCL is extremely advantageous over similar computing frameworks such as CUDA, Direct Compute and Stream. It can run seamlessly on many computing architectures and is supported by various hardware vendors [7], [8]. Moreover, the heterogeneous computing model of OpenCL is addressed to leverage the performance of applications that exhibit task- and data- parallelism [9]. Our x-ray rendering pipeline is implemented with OpenCL to exploit the advantages provided by all the aforementioned features. This pipeline is presented to overcome the limitations of existing ones that are not scalable to render large scale CT data and also vendor-specific [10], [11], [12], [13]. II. THEORY: X- RAY VOLUME RENDERING DRRs can be reconstructed either in the spatial domain or the k-space. k-space methods are normally faster than the 978-1-4577-0220-4/16/$31.00 ©2016 IEEE 3953

Upload: others

Post on 21-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Generation of Digitally Reconstructed Radiographs ... · Parallel Generation of Digitally Reconstructed Radiographs on Heterogeneous Multi-GPU Workstations Marwan Abdellah

Parallel Generation of Digitally Reconstructed Radiographs onHeterogeneous Multi-GPU Workstations

Marwan Abdellah1†, Student Member, IEEE, EMBS, Asem Abdelaziz2, Student Member, IEEE, EMBSEslam Ali2, Sherief Abdelaziz2, Abdelrahman Sayed2

Mohamed I. Owis3, Member, IEEE, EMBS, Ayman Eldeib4, Senior Member, IEEE, EMBS

Abstract— The growing importance of three-dimensional ra-diotherapy treatment has been associated with the active pres-ence of advanced computational workflows that can simulateconventional x-ray films from computed tomography (CT)volumetric data to create digitally reconstructed radiographs(DRR). These simulated x-ray images are used to continuouslyverify the patient alignment in image-guided therapies with 2D-3D image registration. The present DRR rendering pipelinesare quite limited to handle huge imaging stacks generated byrecent state-of-the-art CT imaging modalities. We present ahigh performance x-ray rendering pipeline that is capable ofgenerating high quality DRRs from large scale CT volumes. Thepipeline is designed to harness the immense computing power ofall the heterogeneous computing platforms that are connectedto the system relying on OpenCL. Load-balancing optimizationis also addressed to equalize the rendering load across theentire system. The performance benchmarks demonstrate thecapability of our pipeline to generate high quality DRRs fromrelatively large CT volumes at interactive frame rates usingcost-effective multi-GPU workstations. A 5122 DRR frame canbe rendered from 1024×2048×2048 CT volumes at 85 framesper second.

I. INTRODUCTION & RELATED WORK

Digitally reconstructed radiographs are computer gener-ated x-ray-like images that simulate the active decay of x-ray point source within the human tissue. They play anessential role in image-guided radiation therapy, mainly inthe planning and verification of radiotherapy treatments [1].DRRs are used in 2D-3D image registration packages toverify the patient positioning before treatment delivery. Theymap the alignment of dynamic intraoperative images withanother volumetric anatomical stacks that were capturedpreoperatively in the planning stage [2], [3], [4].

A. Rationale and Challenges

Recent CT scanners are capable of imaging fine structuresof the the tissue at very high spatial resolution. Over thelast few years, the volume of the resulting data from thoseadvanced scanners has increased dramatically. This leapimposed further challenges on the interactive generation of

1Marwan Abdellah is a M.Sc graduate from the Department of Systemsand Biomedical Engineering, Faculty of Engineering, Cairo University,Egypt. †Corresponding author [email protected]

2Asem Abdelaziz, Sherief Abdelaziz, Eslam Ali, and Abdelrahman Sayedare undergraduate students with the Department of Systems and BiomedicalEngineering, Faculty of Engineering, Cairo University, Egypt.

3Mohamed I. Owis is an Associate Professor and 4Ayman Eldeib is aProfessor with the Department of Systems and Biomedical Engineering,Faculty of Engineering, Cairo University, Egypt.

high quality DRRs from large scale CT datasets. Thesechallenges can be trivially resolved by purchasing high-end imaging workstations that are powerful enough to loadand render large CT volumes relying on traditional softwarepackages and robust highly-specialized rendering devices,for example NVIDIA Quadro GPUs [5]. This approach isextremely expensive and might not be affordable in certainsituations. If the quality of the DRR can be tolerated, a singlemid-range GPU can be used to render the DRR interactively.The quality of the generated DRR is proportional to the re-sampling factor that is used for down-scaling the CT volumeto fit in the memory of the GPU. Nevertheless, this solutionis not acceptable in some procedures and can be harmful ifthe down-sampling rate is lower than a certain threshold.

In this work, we discuss an alternative low-cost solutionthat is capable of rendering high quality DRRs at interactiveframe rates. High performance distributed ray tracing is usedto render DRR images on all the computing nodes thatare connected to the system. The architecture of the ren-dering pipeline uses the heterogeneous computing model ofOpenCL to avoid limiting the system to a specific hardwarevendor or computing platform.

B. Heterogeneous Computing with OpenCL

In 2009, OpenCL has been introduced to spark a newera of affordable high performance computing. This frame-work allows designing software pipelines that can be ex-ecuted across a set of different heterogeneous computingdevices, mainly CPUs and GPUs [6]. OpenCL is extremelyadvantageous over similar computing frameworks such asCUDA, Direct Compute and Stream. It can run seamlesslyon many computing architectures and is supported by varioushardware vendors [7], [8]. Moreover, the heterogeneouscomputing model of OpenCL is addressed to leverage theperformance of applications that exhibit task- and data-parallelism [9]. Our x-ray rendering pipeline is implementedwith OpenCL to exploit the advantages provided by allthe aforementioned features. This pipeline is presented toovercome the limitations of existing ones that are not scalableto render large scale CT data and also vendor-specific [10],[11], [12], [13].

II. THEORY: X-RAY VOLUME RENDERING

DRRs can be reconstructed either in the spatial domainor the k-space. k-space methods are normally faster than the

978-1-4577-0220-4/16/$31.00 ©2016 IEEE 3953

Page 2: Parallel Generation of Digitally Reconstructed Radiographs ... · Parallel Generation of Digitally Reconstructed Radiographs on Heterogeneous Multi-GPU Workstations Marwan Abdellah

spatial domain ones, however, they are limited due to theirhuge memory requirements. The generation of a conventionalradiograph is modeled with Beer-Lambert law to accountfor the exponential decay of a photon source in terms ofdepth of penetration [14]. This model corresponds to anabsorption-only optical model that can be used to simulatethe light transport in participating media and solve thevolume rendering integral [15]. Using this optical model, theDRR can be generated from a CT volume if the attenuationcoefficients of the tissue are available and energy of the x-raysource is known a priori.

A faster alternative for the generating the DRR can relyon an emission-only optical model that is based on x-raytransform. This transform is a direct projection of the CTvolume on a 2D image plane. In this case, x-ray volumerendering is used to create the DRR by integrating valuesof the volume samples that a scan line passes through [16].The scan line is defined in the parametric form L(t) = O+ωt, where O is the starting point of the line, w is a unitvector along the line direction and t is the distance betweenany point on the line and the origin. The continuous X-raytransform of the volume V on the line L is defined as

X[V (L)] =

∫L

V =

∫ ∞−∞

V (O + ωt)dt (1)

We apply discrete x-ray transform on multiple bricks thatare partitioned from the volume and blend them later torender the final DRR that corresponds to the entire volume.

III. MULTI-GPU DRR RENDERING PIPELINE

We use sort last x-ray volume rendering to generatethe DRRs. The rendering pipeline is split into two mainstages. The first one pre-processes the volume data to getdistributed efficiently across the different compute nodes thatare shipped with the system. Afterwards, the rendering loopis activated and recalled on demand following any user inter-actions such as changing film resolution, image brightness,viewing parameters and hardware block configurations.

During the pre-processing step, the employed workstationis scanned for all the heterogeneous devices (CPUs andGPUs) that are connected to it. The devices are categorizedbased on their types and ranked according to their computecapabilities. This ranking allows us to integrate static loadbalancing [17], [18], [19] into the rendering pipeline toefficiently partition and distribute the data to the differentcomputing devices. The higher the rank of the device, thelarger the data brick will be sent to it for rendering. Thismaximizes the rendering resources utilization, reduces theresponse delays, and achieves optimal performance balanceacross the entire system. We have also added support formemory balancing if load balancing fails to distribute thedata due to memory limitations of any of the devices. In thiscase, the device are ranked with respect to their memorycapacity and the volume is partitioned accordingly.

As seen in Figure 1, a DRR frame is generated in threesteps: (1) partitioning the CT volume into smaller chunks, orbricks, that are distributed across all the rendering devices,

(2) rendering each brick in a ray tracer, (3) gathering theimages generated from each brick and compositing them intothe final DRR image. To exploit the advanced features ofGPU textures, the rendering kernels are executed on the GPUand the compositing kernels are executed on the CPU. Ifthe workstation contains more than one CPU, the CPU withhigher compute capabilities will be selected for executing thecompositing kernel.

Image Plane (Detector)

Proxy Geometry

Rays Shooter

DRR

Compositing

Fig. 1. Sort last rendering of a DRR on three GPUs with ray tracing. EachGPU renders a partition (colored brick) of the volume and the resultingimages are sorted and blended together in a parallel compositor to createthe DRR of the entire volume.

After the volume partitioning, every brick is uploadedto a 3D texture object on the corresponding GPU. EachGPU is responsible for rendering the given brick into atile that represents a part of the final DRR frame. The hostassigns different asynchronous tasks to the GPUs including(1) pre-processing the volume brick with O(N3) filters, (2)x-ray rendering of the brick to yield a DRR tile, and (3)applying local post-processing filters on the generated tile.The pre- and post-processing kernels are optional, while theX-ray rendering kernel is essentially executed on a per-framebasis. The rendering pipeline is designed in a multi-threadedfashion, where every device is controlled by a correspondingworker thread on the host. After the generation of eachtile, the associated GPU is blocked until the entire DRRframe is rendered. The generation of all the tiles triggers thecompositing thread that downloads the tiles back to the CPUto blend them and create the final DRR frame. Afterwards,a global post-processing filter is applied on the generatedDRR, using an optional CPU kernel before finally displayingit. A high-level block diagram of the rendering pipeline isillustrated in Figure 2.

3954

Page 3: Parallel Generation of Digitally Reconstructed Radiographs ... · Parallel Generation of Digitally Reconstructed Radiographs on Heterogeneous Multi-GPU Workstations Marwan Abdellah

Rendering Device 1

Pre-Processing Filters

X-Ray Rendering

Post-Processing Filters

Tile 1

Rendering Device 2

Pre-Processing Filters

X-Ray Rendering

Post-Processing Filters

Tile 2

Rendering Device N

Pre-Processing Filters

X-Ray Rendering

Post-Processing Filters

Tile N

….

….

Mul

ti-G

PU S

yste

m

Compositing

DRR Frame

Mul

ti-Co

re C

PU

Post Processing Filters

Volume Decomposition

Brick 1 Brick 2 Brick N….

Render Slot Render Slot Render Slot

User

Inte

ract

ion

(Ren

der S

igna

l)

Upload Volume Bricks to GPU 3D Texture Memory

DataCPU OperationsOpenCL GPU KernelOpenCL CPU KernelQt Slot

Fig. 2. A high level diagram of the architecture of our rendering pipeline.All the rendering devices (GPUs) operate in parallel to produce multipletiles (sub-frames) that are sorted and blended by the compositor to generatethe final DRR image. The ray tracing kernels are executed on the GPUs,while the compositor kernel is executed on a multi-core CPU. All the kernelsare written in OpenCL.

IV. EXPERIMENTAL RESULTS & DISCUSSION

Figure 3 shows a high resolution DRR of the entire humanbody reconstructed with our rendering pipeline. The datasetused in this experiment is obtained from the Visible HumanProject datasets repository [20]

A. Experimental Setup

The performance of our system is analyzed on a relativelylow-end machine shipped with an Intel Core i3 CPU, 8GBytes of memory and two affordable NVIDIA GPUs thatare available in the commodity hardware market. The firstGPU is a GeForce GTX 570 that has 1.25 GBytes ofmemory and 15 OpenCL compute units running at 1464MHz. The other GPU is a GeForce GTX 750 contains fivecompute units with 1940 MHz and 4 GByte for memory.The performance benchmarks are recorded for a fixed DRRresolution (5122) and multiple volume dimensions to addressthe impact of using multiple GPUs on the scalability and theinteractivity of the rendering pipeline. The resolutions of thevolumetric datasets selected for experimental benchmarkingwere 5123, 1024 × 5122, 10242 × 512, 10243, 2048 × 10242

and 20482 × 1024. The two modes of volume partitioning,the load-balanced and memory-balanced decompositing, areinvestigated in all benchmarking tasks.

B. Benchmarking Analysis

Figure 4 demonstrates the average benchmarking results−in frames per second (FPS)− for rendering the datasetson the selected GPUs in parallel. The pipeline can renderthe largest dataset (20482 × 1024) that can be partitionedto fill both GPUs at 88 FPS. The benchmarks exhibit thesignificance of load-balancing when the resolution of thevolumes is increasing. The largest dataset (10243) that canfulfill the two balancing strategies can be rendered at 113and 132 FPS for memory- and load-balancing respectively.

Figure 5 shows profiling of each subroutine executed forrendering a single DRR frame. The average time taken byeach subroutine is represented as a fraction of the totalrendering loop time. The load-balancing has improved therelative performance of the rendering kernels from 72%/28%to 58%/42% on average.

Fig. 4. Performance profiles of our rendering pipeline for multiple datasestswith different sizes. The benchmarks in blue and green represent the averageperformance (in FPS) of rendering 512×512 DRRs with load- and memory-balancing respectively. The load-balancing is not applicable in the last twocases due to the memory constraints of the employed GPUs.

V. CONCLUSION & FUTURE WORK

We presented an efficient rendering pipeline targetinginteractive generation of high quality DRRs from high reso-lution, large scale CT volumes that are acquired by advancedstate-of-the-art CT scanners. We highlighted the drawbacksof relevant pipelines and how OpenCL is a promising com-puting framework for distributed rendering, bringing highdegree of code portability and hardware heterogeneity. Incontrary to existing solutions, our system does not requireoverpriced high end work stations and dedicated hardwareplatforms. The pipeline is designed to run in a parallel, multi-threaded distributed fashion on affordable heterogeneousmulti-GPU workstations relying on OpenCL. The render-ing kernel is optimized by integrating load- and memory-balancing implementations allowing effective distribution ofthe volume data across the entire system. The performancebenchmarks reflect the efficiency and scalability of ourrendering pipeline using an affordable workstation containingmoderate CPU and two mid-range GPUs. A 5122 DRR framecan be reconstructed from 1024× 2048× 2048 CT volumesat 85 FPS.

The system is being extended to support parallel dis-tributed rendering on multiple workstations connected to-

3955

Page 4: Parallel Generation of Digitally Reconstructed Radiographs ... · Parallel Generation of Digitally Reconstructed Radiographs on Heterogeneous Multi-GPU Workstations Marwan Abdellah

Fig. 3. A sagittal DRR of the human visible female dataset rendered with our proposed pipeline. The dataset is obtained from the Visible HumanProject [20].

Fig. 5. Relative performance profiles for every subroutine contributing to DRR rendering for memory- and load-balanced rendering.

gether with a high bandwidth network. This support willbe essential to complement the presented approach if theemployed workstation is limited to a single low-end GPU.

REFERENCES

[1] Marwan Abdellah, Ayman Eldeib, and Mohamed I Owis, “Accel-erating DRR generation using fourier slice theorem on the GPU,”in Engineering in Medicine and Biology Society (EMBC), 2015 37thAnnual International Conference of the IEEE. IEEE, 2015, pp. 4238–4241.

[2] Natasa Milickovic, Dimos Baltas, S Giannouli, M Lahanas, andN Zamboglou, “CT imaging based digitally reconstructed radiographsand their application in brachytherapy,” Physics in medicine andbiology, vol. 45, no. 10, pp. 2787, 2000.

[3] Stefania Pallotta and Marta Bucciolini, “A simple method to testthe geometrical reliability of digital reconstructed radiograph (drr),”Journal of Applied Clinical Medical Physics, vol. 11, no. 1, 2010.

[4] CS Moore, G Avery, S Balcam, L Needler, A Swift, AW Beavis,and JR Saunderson, “Use of a digitally reconstructed radiograph-based computer simulation for the optimisation of chest radiographictechniques for computed radiography imaging systems,” The Britishjournal of radiology, 2014.

[5] Hui Yan, Lei Ren, Devon J Godfrey, and Fang-Fang Yin, “Acceler-ating reconstruction of reference digital tomosynthesis using graphicshardware,” Medical physics, vol. 34, no. 10, pp. 3768–3776, 2007.

[6] John E Stone, David Gohara, and Guochun Shi, “OpenCL: Aparallel programming standard for heterogeneous computing systems,”Computing in science & engineering, vol. 12, no. 1-3, pp. 66–73, 2010.

[7] David B Kirk and W Hwu Wen-mei, Programming massively parallelprocessors: a hands-on approach, Newnes, 2012.

[8] Kamran Karimi, Neil G Dickson, and Firas Hamze, “A performancecomparison of CUDA and OpenCL,” arXiv preprint arXiv:1005.2581,2010.

[9] Jonathan Tompson and Kristofer Schlachter, “An introduction to theOpenCL programming model,” Person Education, 2012.

[10] Marwan Abdellah, Ayman Eldeib, and Amr Sharawi, “High perfor-mance GPU-based Fourier volume rendering,” International Journal ofBiomedical Imaging, vol. 2015, no. Article ID 590727, pp. 13–pages,2015.

[11] Marwan Abdellah, Ayman Eldeib, and Mohamed I Owis, “GPUacceleration for digitally reconstructed radiographs using bindlesstexture objects and CUDA/OpenGL interoperability,” in Engineeringin Medicine and Biology Society (EMBC), 2015 37th Annual Interna-tional Conference of the IEEE. IEEE, 2015, pp. 4242–4245.

[12] Marwan Abdellah, Yassin Amer, and Ayman Eldeib, “OptimizedGPU-accelerated framework for x-ray rendering using k-space volumereconstruction,” in XIV Mediterranean Conference on Medical andBiological Engineering and Computing 2016, pp. 372–377. SpringerInternational Publishing, 2016.

[13] Marwan Abdellah, Ayman Eldeib, and Amr Shaarawi, “Constructinga functional Fourier volume rendering pipeline on heterogeneousplatforms,” in Biomedical Engineering Conference (CIBEC), 2012Cairo International. IEEE. IEEE, 2012, number 2156-6097, pp. 77–80.

[14] Moritz Ehlke, Heiko Ramm, Hans Lamecker, Hans-Christian Hege,and Stefan Zachow, “Fast generation of virtual X-ray images forreconstruction of 3D anatomy,” Visualization and Computer Graphics,IEEE Transactions on, vol. 19, no. 12, pp. 2673–2682, 2013.

[15] Nelson Max, “Optical models for direct volume rendering,” Visual-ization and Computer Graphics, IEEE Transactions on, vol. 1, no. 2,pp. 99–108, 1995.

[16] Amir Averbuch and Yoel Shkolnisky, “3D discrete X-ray transform,”Applied and Computational Harmonic Analysis, vol. 17, no. 3, pp.259–276, 2004.

[17] Carlos S De La Lama, Pablo Toharia, Jose Luis Bosque, and Oscar DRobles, “Static multi-device load balancing for OpenCL,” in Paralleland Distributed Processing with Applications (ISPA), 2012 IEEE 10thInternational Symposium on. IEEE, 2012, pp. 675–682.

[18] Kwan Liu Ma, James S Painter, Charles D Hansen, and Michael FKrogh, “A data distributed, parallel algorithm for ray-traced volumerendering,” in Proceedings of the 1993 symposium on Parallelrendering. ACM, 1993, pp. 15–22.

[19] Thomas W Crockett, “An introduction to parallel rendering,” ParallelComputing, vol. 23, no. 7, pp. 819–843, 1997.

[20] “Visible human project CT datasets,” https://mri.radiology.uiowa.edu/visible_human_datasets.html.

3956