implementation of cell-projection parallel volume rendering with dynamic load balancing

Implementation of Cell-Projection Parallel Volume Renderingwith Dynamic Load Balancing

M. Takayama 1, Y. Shinomoto 1, M. Goshima 1, S. Mori 1, Y. Nakashima 2, and S. Tomita 1

1: Graduate School of InformaticsKyoto University

Kyoto, 606-8501, JAPAN

2: Graduate School of EconomicsKyoto University

Kyoto, 606-8501, JAPAN

AbstractA parallel volume rendering system for unstructured

grid volume data is proposed in this paper. By imple-menting the mechanism for dynamic load balancing intothe system, the authors solve the issue of load imbal-ance due to the view dependency and run-time featuresof early ray termination. An experimental implemen-tation of the system achieved a 4.32-times performanceimprovement.

Keywords: Dynamic Load Balancing, Parallel Pro-cessing, Scientific Visualization, Volume Rendering,Cell-Projection

1 Introduction

PC-cluster based large-scale simulation has beenrapidly increasing in popularity. We have noticedthe strong demand for simultaneous visualizationof large-scale simulation results on a PC cluster.Indeed, we have already been doing research onhardware acceleration techniques for structured-grid volume rendering[1, 2]. Now, we are going totackle unstructured grid volume rendering for si-multaneous visulaization of large-scale simulationresults on a PC cluster. The volume rendering ofunstructured-grid volume data can be implementedby direct volume rendering or indirect volume ren-dering algorithms.

The indirect volume rendering algorithm con-verts the unstructured grid volume data into struc-tured grid volume data and then performs directvolume rendering with this structured grid volumedata. Due to the regularity of the converted vol-ume data, this direct volume rendering can benefit

from hardware acceleration. On the other hand,however, it may suffer an unacceptable increase ofdata size in general.

Direct volume rendering algorithms can be clas-sified into either ray-casting or projection meth-ods. Projection methods may be further cat-egorized as cell-projected[3], slice-projected, orvortex-projected schemes[4]. In this paper, wehave focused on the cell-projection scheme, andwe discuss its parallel implementation and dynamicload balancing.

This paper is organized as follows. The nextsection introduces cell-projection volume render-ing and its parallel implementation. Then, dynamicload balancing for cell-projection parallel volumerendering is discussed in Section 3. Section 4shows some experimental results, followed by theconclusion in Section 5.

2 Cell-Projection Volume Render-ing

In the cell-projection(CP) scheme[3], unstructuredgrid volume data is rendered by the following threesteps(Figure 1). For simplicity of discussion, weassume the cell to be a tetrahydra.

1. Projection Phase: Project a given three-dimensional data(cell) into a two-dimensionalscreen and find the projection area(�) for eachdata(cell).

2. Scan Conversion Phase: Perform scan con-version for each projected cell. More pre-cisely, for each pixel � �� in the projection

https://www.researchgate.net/publication/2510864_Parallel_Visualization_of_Large-scale_Aerodynamics_Calculations_A_Case_Study_on_the_Cray_T3E?el=1_x_8&enrichId=rgreq-360fe4ffa32b542708532eee83d01298-XXX&enrichSource=Y292ZXJQYWdlOzIyMTEzMzQzMDtBUzoxMDMzNTI3MzgyNTQ4NThAMTQwMTY1MjYyMzAxOQ==

area(�) on the screen, calculate the cell’s con-tribution (color(RGB) and opacity(�)) to pixel� �� and depth values (��) forboth the front and back intersection pointswhere the ray corresponding to pixel � �� intersects the cell. In the rest of this paper, werefer to the data structure consisting of thesefour parameters (RGB, �, ��) asray segment. As the result of the scan con-version of a cell, ray segments for each pixelin the projection area(�) are computed. Forevery cell, compute the ray segments. Afterthat, for each pixel on the screen, gather theray-segments corresponding to the pixel andmake a depth-sorted list of these ray segments(Figure 2).

3. Composition Phase: Given the ray-segmentlists for all pixels on the screen, calculate thepixel value(color) by compositing the depth-sorted ray segments in the ray-segment listfrom front to back by alpha blending usingPorter-Duff’s over operation[5]. Now, wefinally obtain the volume rendered image.

Here, we have to note that, even in the scanconversion phase, one may perform compositionof the neighboring ray segments in the ray-segmentlist for a ray if these two ray-segments are derivedfrom the neighboring cells contacting each otheralong the ray. We refer to this composition in thescan conversion phase as partial composition[6](Figure 3).

The resultant of the partial composition of tworay segments is again a ray segment. If we canapply partial composition successfully when a newray-segment is inserted into the ray-segment list,we can reduce the length of the ray-segment listand list manipulation overhead due to the length ofthe list.

Moreover, the opacity of a partially composedray segment is higher than the opacity of eachray segment used in this partial composition. Thisfeature greatly helps us to develop the optimizationscheme described below.

1. project 2. scan convert

3. composite

Figure 1: Cell Projection

NULL

Figure 2: Ray-Segment List

PartialComposition

PartialComposition

Figure 3: Partial Composition

3 Parallel Implementation

3.1 Fundamental Implementation

As we have mentioned before, the goal of our re-search is volume visualization of simulation resultson a PC cluster. So, without a lack of generality,we can assume that the unstructured volume datahas already been distributed among the nodes ofthe PC cluster. However, we have to remenber thatthe best data distribution pattern for the simulationmay not be the best pattern for the visualization.This is the starting point of our parallel volumerendering(PVR) implementation.

In the following discussion, we assume the sys-tem has working nodes(WNs) for computationand one control node(CN) for global managementof the load distribution and user interface includingfinal image output.

https://www.researchgate.net/publication/30874000_Tompositing_Digital_Images?el=1_x_8&enrichId=rgreq-360fe4ffa32b542708532eee83d01298-XXX&enrichSource=Y292ZXJQYWdlOzIyMTEzMzQzMDtBUzoxMDMzNTI3MzgyNTQ4NThAMTQwMTY1MjYyMzAxOQ==

Once the data(cell) has been generated and dis-tributed among the working nodes (WNs), thecomputation for the projection and scan conver-sion phases can easily be parallelized since thereis no essential dependency between these compu-tations for each cell. So, as the first step, each WNperforms the projection and scan conversion of thecells which are assigned to the WN and generatesits own ray-segment lists.

When all WNs complete this computation, thecomposition phase starts as the second step. Now,we can utilize the pixel-level parallelism, so weassign the computation for a certain region ofthe screen to each WN and perform the parallelcomposition as follows. For each pixel in theassigned region, each node 1)gathers from othernodes the ray-segment lists corresponding to thepixel, 2)merges them into a single depth-sortedray-segment list, and then 3)calculates the pixelvalue(color) by compositing the depth-sorted raysegments in the ray-segment list from front toback by alpha blending using Porter-Duff’s overoperation[5]. Now, each node finally obtains thevolume rendered image for its own region. Then,each WN sends its image to the control node tooutput the overall volume rendered image onto thedisplay. Though we cannot explain the detailed im-plementation of the global composition phase dueto the page limit of this paper, we have adopted theBinary-Swap Image Composition(BSC) scheme[7]to reduce both the communication overhead andthe load imbalance.

We also refer to this composition phase as the fi-nal composition phase or global composition phaseif we need to distinguish it from the partial(local)composition in the scan conversion phase.

Our fundamental parallel implementation itselfis quite simple. The most important feature of ourfundamental implementation is that it aggressivelyapplies the partial composition in the scan conver-sion phase so that it can reduce the data size ofeach ray-segment list required for exchange in thefinal composition phase.

The previous work by Ma, et al.[4] proposedthe parallel implementation which executes boththe scan conversion and (global) composition pro-cesses concurrently. Matsui, et al. [8] have alsoadopted a similar technique for their structured-

(Depth Sort)ProjectionScan Conversion

(Partial Composition)(Early Ray Termination)

(Dynamic Load Balancing)

Cell-Based DataParallel

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � SynchronizationGlobal Composition

(Binary Swap Image Composition)Pixel-Based DataParallel

( ) indicates optimization technique.

Figure 4: Parallel Implementation

grid parallel volume rendering system. We alsoconsidered adopting this idea. But our conclusionhas been negative so far, because 1)it may incur theproblem of frequent asynchronous interprocessorcommunication, 2)our partial composition schememay reduce the cost of global composition and alsoit can be thought of as a concurrent execution ofthe projection phase and the composition phase,and furthermore, 3)we are currently developinga technique like pipelining which may hide thecost(latency) of the global composition in someother work.

3.2 Optimizations

3.2.1 Approximate Depth Sorting

By adopting partial composition, if the programcould process the scan conversion of the cells indepth-sorted order, it could reduce the length of theray-segment list, and thus it could reduce the listmanipulation cost at the same time. However, thedepth order of cells may change pixel by pixel, ingeneral. So, it is diffic oult, or rather impossible, todetermine a perfect depth order of cells for volumerendering.

Our solution to this issue is approximate depthsorting which sorts the cells in the depth order oftheir center of gravity. Instead of determining thenear-perfect depth order for paying the computa-tional cost for both sorting and list manipulation,as in an octree search, we chose a rather simple andless costly sorting scheme, since the order of cellsshould be recalculated every time the viewpointchanges.



3.2.2 Early Ray Termination

Erly ray termination(ERT)[9] is an optimizationtechnique proposed for front-to-back ray-castingvolume rendering. The concept of ERT is that theobjects which are located behind less transparentobjects (voxel, cell, and so on) may have veryfew or no contributions to the final pixel coloreven in volume rendering, thus, it is possible toterminate the composition computation along theray before the ray passes through the volume dataspace. Therefore, the ERT contribute too much tothe reduction of the volume rendering cost, thoughits effect depends on the opacity of each object[6].

Since ERT is a pixel-based optimization tech-nique, it is easily implemented in the composi-tion phase of our algorithm. However, the mosttime-consuming process in cell-projection parallelvolume rendering(CP-PVR) is the scan conversiontime. Thus, if we could introduce the conceptof ERT in the scan-conversion phase, we couldreduce the computation time much more.

The ERT-table scheme[10] has been proposedfor this purpose. In the ERT-table scheme, thedecision of termination is made per region-baseinstead of per pixel base. For this purpose, wehave introduced a small look up table, which wecall the ERT-table, which represents the status ofeach region and indicates whether the opacities ofall pixels inside the region have already becomesufficiently high. In the ERT-table scheme, af-ter computing the projection area(�) of a cell, itchecks the regional information of the ERT-tablecorresponding to � to decide if scan conversion ofthe cell is required. If the decision is ”no need,”then the remaining process concerning the cell isterminated.

An important feature of the ERT-table schemeis that it checks only a sufficient, though notnecessary, condition for termination. This approx-imation may possibly reduce the effect of ERT,but it significantly reduces the cost to maintainthe information in the ERT-table and thus makesit possible to apply the ERT technique in the scanconversion phase.

Again, the partial composition used in our im-plementation contributes to the increase of theefficiency of ERT because it increases the opacity

of each ray segment.

3.2.3 Weak Sharing of ERT Informationamong PCs

The previous section introduced the ERT techniqueinto cell-projection(CP) volume rendering by con-figuring the ERT-table with each WN. Now, weare going to extend this technique to our parallelprogram(CP PVR). The fundamental idea is thatthe accuracy, or completeness, of the informationin the ERT-table may increase if WNs exchangetheir own information. The typical situation is asfollows. WNb holds cells which are located behindthe cells in WNa at a time when WNa terminatesits scan conversion due to ERT. In this case, WNbhas no need to continue its scan-conversion. IfWNa and WNb exchange the ERT information intheir ERT-tables, WNb can also benefit from ERT.This is what we call ERT sharing. The frequentexchange of ERT information may increase the ac-curacy of ERT information; however, it may incurundesired inter-processor communication. Thus,we have introduced the weak sharing of ERT infor-mation into our CP PVR program. The meaningof weak sharing is that the frequency of ERT in-formation exchange is far beneath the frequency ofERT information updates at each WN.

4 Dynamic Load Balancing

4.1 Load Imbalance in CP PVR

In this section, we first discuss the three majorsources of load imbalance in our CP PVR program.

1. Load imbalance due to initial data distri-bution: As we have mentioned before, thegoal of our research is volume visualizationof simulation results on a PC cluster. In thiscase, the best data distribution pattern for thesimulation may not be the best pattern for thevisualization.

2. Load imbalance due to view dependency ofscan conversion time: The scan conversiontime of each cell is proportional to the size ofits projected area; thus, it may vary according

https://www.researchgate.net/publication/220183505_Efficient_Ray_Tracing_of_Volume_Data?el=1_x_8&enrichId=rgreq-360fe4ffa32b542708532eee83d01298-XXX&enrichSource=Y292ZXJQYWdlOzIyMTEzMzQzMDtBUzoxMDMzNTI3MzgyNTQ4NThAMTQwMTY1MjYyMzAxOQ==

to the viewpoint. This is what we call theview dependency of scan conversion time.

3. Load imbalance due to run-time feature ofERT : The effect of ERT strongly depends onthe opacity of each cell and the viewing direc-tion. Both the opacity and viewing directionare the fundamental parameters for volumevisualization, and they frequently change dur-ing the visualization. Thus, it is impossibleto estimate the effect on the computation ofERT in advance.

Because of the irregularity of the unstructuredvolume data, static load distribution such as re-gional partitioning (3D-block cyclic[8], and so on)does not help too much. The simple static load dis-tribution scheme which distributes an equal num-ber of cells to each WN may reduce the degree ofimbalance, but it cannot solve the second and thirdsources of load imbalance.

The first two sources of load imbalance canbe solved by preprocessing once a viewpoint isgiven. Previous work[4, 11]adopted such a pre-processing based load balancing scheme. In thisscheme, researchers performed data redistributionas a pre-processing phase, taking the effect of viewdependency into account. But, it cannot deal withthe third source of load imbalance. Furthermore,it is hard to keep track of the quick movement ofthe viewpoint.

In order to solve these three sources of load im-balance, we chose the distributed work-stealingscheme as the dynamic load balancing(DLB)scheme for postprocessing. In this scheme, when aworking node(WN) completes the scan conversionof its own cells, it receives the cells from anotherWN which still has cells to be processed. Due tothe dynamic behavior of this scheme, it can easilydeal with the dynamic feature of the ERT effect.

4.2 Work Stealing

As we mentioned in Section 3, our CP PVR pro-gram is composed of two different kinds of dataparallel regions(Figure 4). According to our pre-liminary study[10], we find that the first regionwhich utilizes cell-based data parallelism is quitedominant in the overall performance. We also find

that the second region (global composition phase)is rather static compared to the first region, andthe cost of data migration to implement DLB isrelatively high in the second region. Therefore, weapply the DLB scheme only to the cell-based par-allel region, while the static load balancing schemeis applied to the pixel-based parallel region.(1)Amount of Data(cells) To Be Migrated At aTime.

In order to implement the DLB scheme fora PC cluster system, it is mandatory to lessen thefrequency of data migration. Furthermore, the totalof the data migration cost and the cost required toprocess the migrated data should be comparableto, or smaller than, the cost required to process theremainding data after migration. Therefore, theamount of data to be migrated at at time shouldbe adapted according to the amount of remaindingdata. For this to be achievable, we chose a datasize based on a somewhat guided self-scheduling.At the time of migration, 1�� of the remainingcells are sent to the idle WN on request. Currently,we have chosen 3 as x, based on the preliminaryexperimental results.(2)Node Selection Policy

For the simplicity of implementation, our cur-rent DLB scheme relies on the CN to maintainthe information required to make decisions onthe source node of data migration. And inter-node communication to gather any DLB-relatedimplementation is performed with polling-basedasynchronous messages, where each node weak-periodically checks the arrival of messages.

The following three policies have already beenimplemented as a node selection policy.Random Scheme : The CN maintains the list ofbusy WNs above a certain threshold which haveremaining cells to be processed. When the CNreceives requests for data migration from an idleWN, it chooses a busy WN at random from the listof busy WNs and informs the idle WN about theselected WN.Max-Remain Scheme: The CN maintains the listof busy WNs, as in the random scheme. But theCN also corrects the information about the amountof remaining cells on each of the busy WNs. Whenthe CN receives requests for data migration froman idle WN, it chooses the WN which may have


Figure 5: Sample Data (Human Aorta)

the biggest work to do from the list of busy WNsand informs the idle WN about the selected WN.Bounding Box Scheme: This scheme utilizes ge-ometric information in order to find busy WNs. Inthis scheme, each WN first computes the boundingbox for all cells inside the node. This can be donesimultaneously at an approximate depth-sortingtime without paying any cost. Once the boundingbox on each node is computed, WNs inform theCN of their bounding box information. When theCN receives requests for data migration from anidle WN, it chooses the WN whose bounding boxoverlaps the idle WN’s bounding box and informsthe idle WN about the selected WN. This scheme isproposed in antcipation of the increase of the pos-sibility of partial composition. We may possiblyimplement this scheme without any interventionof the CN if all WNs share their bounding boxinformation.

5 Evaluation

In this section, we evaluate the effect of the dy-namic load balancing schemes.

For a relatively large sample of volume data,we used segmented human aorta simulationdata(Figure 5). This dataset consists of 307,565 un-structured cells of tetrahydra, with 62,475 verticesin total. Figure 6 shows bounding box represen-tations of the initial cell assignment viewed fromtwo orthogonal view directions, (A) and (B). Therectangular boxes in these figures represent thebounding-box of each WN. A almost equal numberof cells are assigned to each WN initially. An 8-nodes PC cluster(Pentium4 3GHz, 1GbE) for WNsand a 1-node PC(Pentium4 2GHz, 1GbE) for theCN are used in our experiment.

Figure 7 shows how our DLB schemes im-

(a)Viewpoint(A) (b)Viewpoint(B)

Figure 6: Bounding Box Representations of InitialCell Assignment Viewed from Two OrthogonalView Directions.

Table 1: Overall EffectsScan Conversion Time Tsc[sec]

Optimization Level (A) (B)Base 45.37 25.48ERT 20.47 10.93

ERT + DLB 18.40 8.87ERT + DLB + ERTsharing 15.82 5.89ERT:Early Ray TerminationDLB:Dynamic Load Balancing

prove the load imbalance in the scan conversionphase with ERT. The y-axis of this figure indi-cates the scan conversion time(Tsc) at each WNand the horizontal line labeled oracle indicates thearithmetic mean of the scan conversion time ofeach WN without applying DLB. From this figure,we can confirm the effect of DLB. We can alsoconfirm that the max-remain scheme achieves thebest performance for both two viewpoints in thisenvironment. The bounding-box scheme was ex-pected to perform the best for the viewpoint (B),however, Figure 7(b) doesn’t show such a result.It is because our current implementation of thebounding-box scheme doesn’t consider the loadimbalance among WNs whose bounding boxes donot overlap each other, and thus it somewhat re-stricts the possibility of the load balancing. Wethink this problem can be solved by combining thebounding-box scheme and max-remain scheme.

Table 1 summarizes the overall effects of theoptmization techniques used in our CP-PVR pro-gram. ”Base” stands for the parallel processingwithout ERT and DLB. From this table, we canconfirm the 2.86- and 4.32-times speedup com-pared to the ”Base” implementation for viewpoint(A) and (B), respectively.

0

5

10

15

20

25

1 2 3 4 5 6 7 8

random max-remain bounding-box

oracle : 15.85

max-remain : 18.40

nothing : 20.38

Tsc (sec) nothing

WN

(a) Viewpoint (A)

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8

Tsc (sec)

WN

nothing random max-remain bounding-box

oracle : 7.70

max-remain : 8.87

nothing : 10.93

(b) Viewpoint (B)

Figure 7: Comparison of DLB Schemes

6 Conclusion

We have proposed a cell-projection parallel volumerendering system for simultaneous visualization ofsimulation results on a PC cluster. By adoptingthe early ray termination and dynamic load bal-ancing optimization techniques, it could achievea 4.32-times performance improvement in our ex-perimental environment. We would like to furtherinvestigate the various features of our program inthe near future.

7 Acknowledgment

We would like to thank Prof. T. Matsuzawaand Dr. M. Watanabe of JAIST for providingthe aorta dataset. Part of this research was sup-ported by the Grant-in-Aid for Scientific Research(S)#16100001 and (B)#13480083 from JSPS.

References

[1] Mori, S., et al.: ReVolver/C40: A Scalable Par-allel Computer for Volume Rendering –Designand Implementation–, IEICE Trans. Inf. & Syst.,Vol.E86-D, No.10, pp.2006-2015, 2003.

[2] Maruyama, Y., et al.: Parallel Volume Render-ing with Commodity Graphics Hardware, IPSJSIG Meeting(2003-ARC-154), Vol.2003, No.84,pp.61–66, Aug. (2003).

[3] Max, N., et al.: Area and Volume Coherencefor Efficient Visualization of 3D Scalar Func-tions, Computer Graphics (San Diego Workshopon Volume Visualization), Vol.24, No.5, pp.27–33(1990).

[4] Ma, K.-L. and Crockett, T. W.: Parallel visual-ization of large-scale aerodynamics calculations:A case study on the Cray T3E, IEEE ParallelRendering Symposium, pp. 95–104 (1997).

[5] Porter, T. and Duff., T.: Compositing DigitalImages, ACM Computer Graphics (SIGGRAPH’84), Vol.18, No. 3, pp. 253-259 (1984).

[6] Lichtenbelt, B., et al.: Introduction to VolumeRendering, Hewlett-Packard Professional Books,Prentice Hall PTR, (1998).

[7] Ma, K.-L., et al.: Parallel volume rendering us-ing binary-swap compositing, IEEE ComputerGraphics and Applications, Vol.14, No.4, pp.59–68 (1994).

[8] Matsui, M., et al.: Reducing the Complexityof Parallel Volume Rendering by PropagatingAccumulated Opacity, Technical Report of IE-ICE,Vol.103, No.249, pp.13–18(2003).

[9] Levoy, M.: Efficient ray tracing of volume data,ACM Transactions on Graphics, Vol.9, No.3,pp.245–261(1990).

[10] Takayama, M.: Dynamic Load Balancing on Par-allel Visualization of Large Scale UnstructuredGrid, Master’s Thesis, Graduate School of Infor-matics, Kyoto University, Feb. (2004).

[11] Chen, L., et al.: Parallel performance optimiza-tion of large-scale unstructured data visualizationfor the earth simulator, Proc. of the Fourth Eu-rographics Workshop on Parallel Graphics andVisualization, Eurographics Association, pp. 133–140 (2002).











implementation of cell-projection parallel volume rendering with dynamic load balancing

Documents