jitu das, jonathan gluck, david poliakoff, julian musshoff, erik schumacher carnegie mellon...

1
Jitu Das, Jonathan Gluck, David Poliakoff, Julian Musshoff, Erik Schumacher Carnegie Mellon University, Swarthmore College, Milsaps College, TU Dortmund Center for Computational Technology, Louisiana State University, Baton Rouge, Louisiana 70803 Acknowledgments We thank Dr. Mark Jarrell, Dr. Randy Hall, Dr. Juana Moreno, Professor J. (Ram) Ramanujam, Kathryn Traxler, Dr. Gabrielle Allen, Dr. Bety Rodriguez- Mila, Brittany Shannon, Kuang Shing Chen, The team of room 280, Dr. Ed Seidel, and the NSF REU initiative. Carnegie Mellon University, Swarthmore College, and Millsaps College. Conclusions From our results it is not difficult to conclude that there is a tremendous future for scientific computation in heterogeneous processing. Our experiments in general purpose graphics processing have yielded strong results in favor of additional inquiry. This field is the bleeding edge and, as such, must be approached with a grain of salt. We made use of CUDA because of its efficiency and previous establishment. Future work into the possibilities of the more universal OpenCL might be beneficial. In heterogeneous computing, much attention must be paid to the architecture of the processor on which work is being done. This is an art more familiar to earlier generations of computer programmers. Earlier programming necessitated a more architecture based form of programming. Work such as ours, is a precursor to the future of heterogeneous computing. It is possible that, in the near future, dedicated hardware will become more of a presence in the supercomputing scene. Literature cited "CUDA C Programming Guide 3.1." Nvidia Toolkit Website. Nvidia Corp., 28/05/2010. Web. 20 Jul 2010. <http://developer.download.nvidia.com/comput e/cuda/3_1/toolkit/docs NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf.>. "Cublas Library." Nvidia Developer. Nvidia Corp., 05/2010. Web. 20 Jul 2010. <http://developer.download.nvidia.com/comput e/cuda/3_1/toolkit/docs/ CUBLAS_Library_3.1.pdf>. Mikelsons, K. Macridin, A. D’Azevedo, E. Tomko, K. Jarrel, M. “Optimization of Quantum Monte Carlo Methods for Fermionic Hubbard model.” Draft of Article. Louisiana State University,2010. For further information Please contact any of the authors at "David Poliakoff“ [email protected], "Jitu Das" [email protected], "Jonathan Gluck" [email protected]. GPGPU implementation of quantum physical simulation of electron excitation Examining Markov Chain Monte Carlo Techniques One result which must not be ignored is that of cost. We examined the cost of the physical hardware on which we were running our GPGPU implementation, versus their MPI implementation and noticed another advantage. This fiscal advantage to using the graphics processor becomes even more self-evident when we conceptualize it a different way. The following visualization of bang per buck presents a startling increase in value of the graphics card. Figure 6: Measurement of the number of times one can run the program per dollar per second. Lighter color is based on CPU time. Darker color is based on wall time. Figure 5: Graphical representation of the MSRP of four Xeon processors versus one Nvidia c1060. Introduction There has been a push for parallelization in recent scientific computation. This has been the logical step forward, as it becomes more difficult to coax additional units of computation per dollar with the cessation of Moore’s law. Much of this parallelization, however, has been homogeneous, running on machines with many of the same processor. This is not necessarily the most efficient method. Not all processors share the same strengths, some being better at certain tasks than are others. It has been stated that the future of processing is heterogeneity. The hypothesis we examine is whether it is possible to accelerate the run time of quantum physics simulations with the use of General Purpose Graphics Processing Units or GPGPUs. This could increase computational power for a fraction of the cost. Materials and methods We implemented our GPGPU program using the CUDA architecture and programming model. Our GPGPU hardware was provided by the Spider cluster on LONI. The Spider cluster makes use of Nvidia T10 graphics processing. The Queenbee cluster, on which we ran the existing MPI code, was composed of Intel Xeon 5430 processors. The work to be sped up was on simulating electron excitation in super conductors. This required knowledge of the Metropolis-Hasting’s algorithm, and Hirsch-Fye. Figure 2. Layout of blocks and threads in CUDA; note that while many parallel architectures operate on thousands of threads, CUDA kernels frequently operate with millions. Figure 1. Specifications of the Nvidia Tesla architecture used in the T10 GPUs. Results After much studying of the CUDA documentation and the existing MPI code, we began to write evolving prototypes of the code to run on the Spider cluster of GPGPUs. After much benchmarking, we received results for a series of data points. These results were not too promising, but when looked at Figure 4: CPU time comparison of CPU and GPU runtimes Figure 3: Wall time comparison of CPU and GPU runtimes.

Post on 20-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Jitu Das, Jonathan Gluck, David Poliakoff, Julian Musshoff, Erik Schumacher Carnegie Mellon University, Swarthmore College, Milsaps College, TU Dortmund

Jitu Das, Jonathan Gluck, David Poliakoff, Julian Musshoff, Erik Schumacher

Carnegie Mellon University, Swarthmore College, Milsaps College, TU DortmundCenter for Computational Technology, Louisiana State University, Baton Rouge, Louisiana 70803

AcknowledgmentsWe thank Dr. Mark Jarrell, Dr. Randy Hall, Dr. Juana Moreno, Professor J. (Ram) Ramanujam, Kathryn Traxler, Dr. Gabrielle Allen, Dr. Bety Rodriguez-Mila, Brittany Shannon, Kuang Shing Chen, The team of room 280, Dr. Ed Seidel, and the NSF REU initiative. Carnegie Mellon University, Swarthmore College, and Millsaps College.

ConclusionsFrom our results it is not difficult to conclude that there

is a tremendous future for scientific computation in heterogeneous processing. Our experiments in general purpose graphics processing have yielded strong results in favor of additional inquiry.

This field is the bleeding edge and, as such, must be approached with a grain of salt. We made use of CUDA because of its efficiency and previous establishment. Future work into the possibilities of the more universal OpenCL might be beneficial.

In heterogeneous computing, much attention must be paid to the architecture of the processor on which work is being done. This is an art more familiar to earlier generations of computer programmers. Earlier programming necessitated a more architecture based form of programming.

Work such as ours, is a precursor to the future of heterogeneous computing. It is possible that, in the near future, dedicated hardware will become more of a presence in the supercomputing scene.

Literature cited"CUDA C Programming Guide 3.1." Nvidia Toolkit Website. Nvidia

Corp., 28/05/2010. Web. 20 Jul 2010. <http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf.>.

"Cublas Library." Nvidia Developer. Nvidia Corp., 05/2010. Web. 20 Jul 2010. <http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/CUBLAS_Library_3.1.pdf>.

Mikelsons, K. Macridin, A. D’Azevedo, E. Tomko, K. Jarrel, M. “Optimization of Quantum Monte Carlo Methods for Fermionic Hubbard model.” Draft of Article. Louisiana State University,2010.

For further informationPlease contact any of the authors at "David Poliakoff“ [email protected],"Jitu Das" [email protected],"Jonathan Gluck" [email protected].

GPGPU implementation of quantum physical simulation of electron excitation Examining Markov Chain Monte Carlo Techniques

One result which must not be ignored is that of cost. We examined the cost of the physical hardware on which we were running our GPGPU implementation, versus their MPI implementation and noticed another advantage.

This fiscal advantage to using the graphics processor becomes even more self-evident when we conceptualize it a different way. The following visualization of bang per buck presents a startling increase in value of the graphics card.

Figure 6: Measurement of the number of times one can run the program per dollar per second. Lighter color is based on CPU time. Darker color is based on wall time.

Figure 5: Graphical representation of the MSRP of four Xeon processors versus one Nvidia c1060.

IntroductionThere has been a push for parallelization in recent scientific computation. This has been the logical step forward, as it becomes more difficult to coax additional units of computation per dollar with the cessation of Moore’s law. Much of this parallelization, however, has been homogeneous, running on machines with many of the same processor. This is not necessarily the most efficient method. Not all processors share the same strengths, some being better at certain tasks than are others. It has been stated that the future of processing is heterogeneity. The hypothesis we examine is whether it is possible to accelerate the run time of quantum physics simulations with the use of General Purpose Graphics Processing Units or GPGPUs. This could increase computational power for a fraction of the cost.

Materials and methods

We implemented our GPGPU program using the CUDA architecture and programming model. Our GPGPU hardware was provided by the Spider cluster on LONI. The Spider cluster makes use of Nvidia T10 graphics processing. The Queenbee cluster, on which we ran the existing MPI code, was composed of Intel Xeon 5430 processors.

The work to be sped up was on simulating electron excitation in super conductors. This required knowledge of the Metropolis-Hasting’s algorithm, and Hirsch-Fye.

Figure 2. Layout of blocks and threads in CUDA; note that while many parallel architectures operate on thousands of threads, CUDA kernels frequently operate with millions.

Figure 1. Specifications of the Nvidia Tesla architecture used in the T10 GPUs.

ResultsAfter much

studying of the CUDA documentation and the existing MPI code, we began to write evolving prototypes of the code to run on the Spider cluster of GPGPUs.

After much benchmarking, we received results for a series of data points.

These results were not too promising, but when looked at from a CPU time perspective we began to see a definite advantage.

It may be surmised that the results in figure 3 are due to the fact that one graphics card is doing four times the work of that accomplished by any one of the CPUs.

Figure 4: CPU time comparison of CPU and GPU runtimes

Figure 3: Wall time comparison of CPU and GPU runtimes.