lunar temperature calculations on a gpu temperature calculations on a gpu ... , Ý is the angle...

9

LUNAR TEMPERATURE CALCULATIONS ON A GPU

Kyle M. Berney Department of Information & Computer Sciences

Department of Mathematics University of Hawai‘i at Mānoa

Honolulu, HI 96822

ABSTRACT

Lunar surface temperature is a crucial parameter for retention of volatiles, such as ice. Near the lunar poles, temperature is determined by reflected sunlight rather than direct sunlight. Temperature modeling of craters with high resolution topography is a formidable computational challenge, because the computational cost increases rapidly with spatial resolution. Graphics Processing Units (GPUs), a novel type of computer hardware, are now programmable and inexpensive, and they provide low-latency memory units, hardware implemented trigonometric functions, and multi-threaded functionality. By utilizing the CUDA C programming language, we implement a simplified model for the surface energy balance, which retains the computational complexity of the full problem. We investigate and experiment with different approaches and compare the runtime of our CUDA C programs to the runtime of the same surface energy balance model written in the C programming language. Our objective is to develop a computationally efficient algorithm for surface energy balance calculations on a GPU, which may be used for future research and applications. We were successful and achieved a speed up by a factor of over 100 compared to a CPU.

INTRODUCTION Recent and ongoing lunar missions have provided high-resolution topography, from the Kaguya spacecraft (Araki et al., 2009) and from the Lunar Orbiter Laser Altimeter (LOLA) (Smith et al., 2010a, 2010b), and surface temperature from the DIVINER instrument onboard the Lunar Reconnaissance Orbiter (Paige et al., 2010a, 2010b). Figure 1 below, shows a high-resolution temperature map of the south polar region of the Moon. The brightly colored areas (red, yellow, and green) receive lots of direct sunlight, while the darker colored areas (blue and purple) receive little to no sunlight. For these darker colored areas, their temperature is determined by the reflected light received from other surface elements in its field of view. Given the high spatial resolution of the available data, modeling temperature is computationally challenging. In particular, the surface energy balance is the computationally most expensive component, since an algorithm has to visit many surface elements at each time step.

10

Figure 1: High-resolution temperature map of the south polar region of the Moon. The inner white circle represents 85°S latitude. (Adopted from Paige et al. 2010a).

Modern day GPUs are now programmable and can be used effectively as a numerical coprocessor (Kirk and Hwu, 2010). GPU-based clusters are becoming increasingly common everywhere from university research labs to the world’s fastest supercomputers. Thus, a single GPU can now replace a sizeable conventional computer cluster. GPUs are relatively inexpensive, compared to an equal number of CPU (Central Processing Unit) cores, and provide extensive low-latency memory units, hardware implemented trigonometric functions, and is able to execute thousands of threads in parallel. For this research project, an NVidia Tesla C1060 GPU was used, shown in Figure 2 below. The NVidia Tesla C1060 has 30 multi-processors with 8 cores each, for a total of 240 cores. Each core runs at a 1.3 GHz clock rate with a maximum of 512 threads per block. For comparison, a commonplace Intel Xeon CPU has 4 cores with 2 threads per core and a clock rate of 2.3 GHz.

Figure 2: An NVidia Tesla C1060 GPU, which can be mounted in a

regular PCI16e card slot. The Tesla series of GPUs from NVidia are

designed solely for General Purpose GPU Computing; hence it serves as a numerical coprocessor for massively

parallel computations.

11

In order to utilize our GPU for numerical computations, a programming language developed by NVidia, CUDA C, was learned and studied. CUDA C is comprised of host code, which runs on the CPU, and device code, which runs on the GPU. It is basically the C programming language with extra keywords and functions which provides GPU related functionality. For example, the keyword ‘__global__’ is used to declare that a function will run on the GPU rather than the CPU. These functions are called kernels, which are executed in parallel on the GPU. Another example is the ‘cudaMalloc()’, ‘cudaFree()’, and ‘cudaMemcpy()’ functions. These three functions are analogous to the C programming language functions ‘malloc()’, ‘free()’, and ‘memcpy()’ respectively. The function ‘cudaMalloc()’ allocates DRAM (Dynamic Random Access Memory) memory space for GPU use, ‘cudaFree()’ de-allocates DRAM memory space which was allocated via ‘cudaMalloc()’, and ‘cudaMemcpy()’ is used to copy GPU memory to CPU memory or vice versa. One of the main features of CUDA C is that it provides explicit use of multiple types of GPU memory. There are five different GPU memory types: global memory, local memory, shared memory, constant memory, and texture memory. Local memory is stored off chip in DRAM, however it is cached on the GPU. Local memory is specific to 1 thread and can be read and written to. Global memory is also stored in DRAM and can be read and written to, however global memory has no cache on the GPU and it can be accessed by all threads. Shared memory is the only memory type to be stored on the actual GPU. It can also be both read and written to, however it can only be accessed by a group of threads called a block. Constant and texture memory are similar; both are stored in DRAM, are read only, and have caches on the GPU. Their difference comes in their use: constant memory is used when threads access the same memory space at the same time, while texture memory is used when threads access nearby memory spaces. Figures 3 below, provides a graphical overview of the different types of CUDA memory.

Figure 3: Graphical overview of the memory types on a CUDA

device. (Adopted from NVidia 2010)

12

METHODS

In order to limit the set-up time for our model, we used a simplified “toy” model which retains the same computational complexity of the full problem. Our model consists of a collection of elements, which are the analog of lunar surface elements. Each of these elements reflects light and illuminates other elements in its field of view. To mimic solar illumination, an artificial incoming energy flux is computed, which changes with time. Time is determined by sun position, which is measured by the hour angle (radians east of noon). Hence, we have that the incoming energy for element consists of direct sunlight and reflected sunlight from other

surface elements. This brings us to our mathematical formulation: 1∑ ∝ , , where is the incoming energy for surface element at time step , is

albedo, is the direct incoming solar radiation (insolation) for surface element at time step

, ∝ , is the angle subtended by the surface element as seen by surface element , and is

the incoming energy for surface element at time step 1. Every surface element experiences sunrise and sunset which depends on its azimuthal orientation and the latitude. We are computing the time average of over one day. For each of the two semesters that was spent on this research project, a separate lunar surface energy model was developed and implemented. In the first semester, we developed a model which consists of a simple ring of surface elements, shown below in Figure 4. For this model, we can clearly see that every surface element is in the field of view of all other surface elements. However, ∝ , is independent of and . It can be calculated by using the formula: ∝ , ∆ 2⁄ , where ∆ 2 ⁄ and is the number of surface elements. Similarly, the azimuth value for each surface element can be calculated using: ∆ . Due to this, we do not need to store the ∝ , values nor the azimuth values for each surface element in memory; we instead calculate them on the fly when needed. In the second semester of our research, we wanted to develop a surface energy balance model where there exist surface elements which are not in the field of view of all other surface elements; and which incorporates the storage in memory of the ∝ , values and the azimuth values for each surface element . Hence, we created a model which consists of two conjoined regular octagons, shown below in Figure 4. For this model, there are no shortcut formulas to calculate the azimuth values and the ∝ , values, thus we compute the values once and store them in memory.

13

Figure 4: Lunar surface energy balance models developed. The first semester’s model is depicted to the left and the second semester’s model is depicted to the right.

Multiple CUDA C programs were developed for each of the two surface energy balance

models. The main differences between each CUDA C program are what type of GPU memory the data structures are stored in. For both of the models, we have a 1-dimensional array of size

for the values and a 1-dimensional array of size for the values. However, for our second semester model, we also have a 1-dimensional array of size for the azimuth values and a flattened out square by matrix, which is a 1-dimensional array of size , for the ∝ , values. Two CUDA C programs were developed for the first semester’s surface energy balance

model. The first CUDA C program, global_ring.cu, stored both the array and the

array in global memory, while the second CUDA C program, constant_ring.cu, stored the

array in global memory and the array in constant memory. For the second semester’s model, five different CUDA C programs were developed. The first program, global.cu, stores all

data structures in global memory. The second program, constant.cu, stores the array in constant memory and all other data structures in global memory. The third program, transpose_global.cu, is basically the same as global.cu except that our ∝ , matrix is transposed. Similarly, the fourth program, transpose_constant.cu, is the same as constant.cu except our ∝ , matrix is transposed. For the last program, timeloop.cu, we move our time-loop onto the GPU. In other words, instead of calling a kernel at every time step, we only call one kernel which iterates with time. We also note that our ∝ , matrix is transposed as well in timeloop.cu.

14

RESULTS

Figure 5: Runtime graph of the first semester’s surface energy balance model programs. The black trend line is shown for comparison.

Figure 5 above, shows the runtimes of the programs developed for the first semester’s surface energy balance model. The first thing we see is that our C program does indeed show

runtime growth. Also, both global_ring.cu and constant_ring.cu run much faster than the C program, with contant_ring.cu being the fastest. This is due to the fact that constant memory has a cache on the GPU, which makes constant memory reads significantly faster than global memory reads. However, we are limited by the size of our GPU’s constant memory space. For our NVidia Tesla C1060 GPU, we have a total of 64kB of constant memory space. This translates to 16,384 floating point or integer values. Hence, constant_ring.cu can only operate on less than 16,384 surface elements. We found that the peak speed up occurred at 15,000 surface elements, with global_ring.cu running 115 times faster than the C program and constant_ring.cu running 207 times faster.

y = 0.0007x2

1

4

16

64

256

1024

4096

16384

65536

262144

64 256 1024 4096 16384

Time (seconds)

Number of Surface Elements, N

Runtime (Log‐Log Base 2)

First Semeter C Program global_ring.cu constant_ring.cu

15

Figure 6: Runtime graph of the second semester’s surface energy balance model programs. The black trend line is shown for comparison.

Figure 6 above, shows the runtimes of the programs developed for the second semester’s surface energy balance model. Just like we saw in the first semester’s C program, the second semester’s C program shows growth. The first two CUDA programs developed, global.cu and constant.cu, both run faster than the C program, however their speed up factor is only 3-4 times faster than the C program, which is much less than what we saw in the first semester. We tracked down the reason of the slow down to the memory access pattern of our ∝ , matrix. By transposing our ∝ , matrix, the memory access pattern becomes coalesced. In other words, consecutive threads now accesses consecutive memory addresses. This small change in the organization of our ∝ , matrix translates to a very significant speed up. We find that the peak speed up occurs at 15,260 elements with transpose_global.cu running 72 times faster than the C program, transpose_constant.cu running 106 times faster, and timeloop.cu running 74 times faster. Since transpose_constant.cu is utilizing constant memory, we are limited to less than 16,384 surface elements. However, both transpose_global.cu and timeloop.cu are able to operate on a larger number of surface elements. We also see that the runtimes of transpose_global.cu and timeloop.cu are similar to each other; thus, whether the time loop is on the CPU or the GPU is not significant.

y = 0.0007x2

1

4

16

64

256

1024

4096

16384

65536

262144

64 256 1024 4096 16384

Time (seconds)

Number of Surface Elements, N

Runtime (Log‐Log Base 2)

Second Semester C Program global.cuconstant.cu transpose_global.cutranspose_constant.cu timeloop.cu

16

CONCLUSION

In conclusion, we were successful in achieving a significant speed up for crater reflection calculations. For both of the surface energy balance models developed, the CUDA C program

which utilized constant memory for the array ran the fastest; with constant_ring.cu having a peak speed up factor of 206 and transpose_constant.cu having a peak speed up factor of 106. However, for those programs, we are limited to less than 16,384 surface elements. The key realization that we take from this research project is where you store your data and how threads access your data is very important. We saw in the first semester that where you store your data, global memory vs. constant memory, has a big impact; and in the second semester, we saw that how threads access your data, non-coalesced memory access vs. coalesced memory access, is very significant. Our results also suggest that GPUs could be used successfully in 3-dimensional models of the surface temperature in lunar craters.

REFERENCES Araki, H. et al. (2009) Lunar global shape and polar topography derived from Kaguya-LALT laser altimetry. Science 323, 897-900. Kirk, D. B. and Hwu, W. W. (2010) Programming Massively Parallel Processors. Morgan

Kaufmann. Nvidia (2010) CUDA C Best Practices Guide. Paige, D. A. et al. (2010a) Diviner Lunar Radiometer observations of cold traps in the Moon’s

south polar region. Science 330, 479-482. Paige, D. A. et al. (2010b) The Lunar Reconnaissance Orbiter Diviner Lunar Radiometer

Experiment. Space Sci. Rev. 150, 125-160. Smith, D. E. et al. (2010b) Initial Observations from the Lunar Orbiter Laser Altimeter (LOLA).

Geophys. Res. Lett. 37, L18204.

Smith, D. E. et al. (2010a) The Lunar Orbiter Laser Altimeter Investigation on the Lunar Reconnaissance Orbiter Mission. Space Sci. Rev. 150, 209-241.

lunar temperature calculations on a gpu temperature calculations on a gpu ... , Ý is the angle...

Documents