quantitative performance assessment of proxy ... quantitative performance assessment of proxy apps...
Post on 28-May-2020
Embed Size (px)
Quantitative Performance Assessment
of Proxy Apps and Parents Report for ECP Proxy App Project Milestone AD-CD-PA-504-5
David Richards1, Omar Aaziz2, Jeanine Cook2, Hal Finkel3, Brian Homerding3, Peter McCorquodale4, Tiffany Mintz5, Shirley Moore5, Vinay Ramakrishnaiah6,
Courtenay Vaughan2, and Greg Watson5
1Lawrence Livermore National Laboratory, Livermore, CA 2Sandia National Laboratories, Albuquerque, NM
3Argonne National Laboratory, Chicago, IL 4Lawrence Berkeley National Laboratory, Berkeley, CA
5Oak Ridge National Laboratory, Oak Ridge, TN 6Los Alamos National Laboratory, Los Alamos, NM
1 Executive Summary 3
2 Methodology 5
3 miniQMC and QMCPACK Performance Assessment 11
4 miniVite and Vite Performance Assessment 27
5 PICSARlite and WarpX Performance Assessment 35
6 Profugus, XSBench, and Shift Performance Assessment 45
7 Thornado-mini Performance Assessment 50
8 MACSio Performance Assessment 56
9 GPU Performance Assessment of SW4 and SW4lite 64
10 GPU Performance Assessment of Nek5000 and Nekbone 70
11 Summary and Conclusion 74
1 Executive Summary
Proxy applications are small, simplified codes that allow application developers to share important features of large applications without forcing collaborators to assimilate large and complex code bases. Proxies can also be though of as models for larger codes. Because proxies omit features and capabilities of their parent applications (otherwise they wouldn’t be small), the question naturally arises whether any given proxy is actually a faithful model for its parent application for some particular modeling purpose. This report sets out to answer that question for some of the proxies in the ECP proxy app collection1, with particular attention to performance. It also completes the AD-CD-PA-504 Milestone:
We will assess the fidelity of 8 proxy applications with respect to their respective parent applications using the methodology previously developed.
Note that we use the term “parent” somewhat losely here. In some cases there is a clear lineage relationship where the proxy was created directly from the large application we are comparing it to. In others, the proxy may have been created independently, or even from an application other than the one we are comparing. Hence, in this report, parent merely refers to the application for which the proxy is being evaluated for model fidelity.
The complete set of proxy-parent pairs evaluated in this report is shown in Table 1. Because application performance can vary significantly depending on the problem being solved, for each proxy and parent we took special care to identify input parameters to specify problems that were as similar as possible from proxy to parent and also representative of exascale-class problems. These problem specifications are one of the key artifacts of this report. Summary descriptions of the problems are given in this report and full specifications are published on the ECP Proxy App web site.
We found that some of the proxies are very good models for the performance of their parents. In particular, miniQMC, miniVite, and XSBench have performance characteristics that are very similar to their corresponding parents. In contrast, PICSARlite, ProfugusMC, MACSIO, SW4lite, and NEKbone showed significant performance differences from their parents. The reasons for these differences vary and are more fully explored in the respective sections of the document. Although we highlight the performance differences between proxies and parents we do not want to imply that any of these proxies are poor or should not be used. We do however warn users that blindly
ProfugusMC, XSBench Shift
NEKbone NEK 5000
Table 1: Proxies and parents analyzed in this report.
1The current version of the ECP Proxy App Suite can be found at http://proxyapps.exascaleproject.org/ecp-proxy-apps-suite
assuming that all proxies are faithful representations of all aspects of the parent performance will lead to incorrect conclusions.
Finally, we note that the majority of the work in this report was performed on CPU platforms. This is somewhat unfortunate as GPU hardware will provide the majority of the computing power on exascale-class systems. We will continue to publish GPU-based assessments as more proxies and parents are ported to GPUs and assessment tools improve.
This section describes our methodology for both CPU- and GPU-based assessments. We give specifications for the compter hardware systems used and the tools we used for profiling and mea- surement. Compiler details are presented for each proxy/parent in their respective sections on assessment. Finally, during the course of this work we uncovered some significant problems with tools and systems. We have provided a list of these issues to inform the broader community.
2.1 CPU Performance Assessment
For CPU-based assessment of proxy apps and parents, the goal is to understand performance similarity between the proxy and parent at both a high and low level. We aim to understand high-level similarity through profiling and models such as roofline. At a lower level, the desire is to understand similarities in how the applications execute on the underlying hardware by examining their performance using key metrics. A stretch goal is to understand the similarity of performance bottlenecks at the hardware level. The outcome of this assessment is to fundamentally know how well the proxy represents the parent, and understand the limitations to this representativeness.
To this end, our CPU assessment for each proxy/parent pair minimally includes the following:
1. a clear qualitative comparison and description of the proxy versus the parent with respect to primary functionality,
2. dynamic profiling to understand if key kernels and functions implementing these kernels are as expected and consistent across the two applications,
3. roofline modeling to understand at a high-level if the two applications are characterized by the same bottlenecks (e.g., compute-bound versus memory bound, cache bound, floating-point operations bound),
4. hotspot analysis that indicates memory or compute boundedness for each function and/or kernel.
For some proxy/parent pairs, further analysis was done. Note that in this assessment, we examine the performance of MACSio and how well it can generate I/O behavior that matches an ECP application of interest. MACSio is very different than the other proxies analyzed in this work. It is a parameterizable I/O proxy that generates I/O behavior based on the input parameters that are extracted from the application of interest. The assessment methodology is, therefore, quite different for this proxy as is noted in Section 8.
We also did some assessment of GPU-enabled proxy applications. The assessment methodology for this is described in the next section.
2.2 GPU Performance Assessment
Most applications use GPUs by offloading computationlly intensive kernels to these accelerators. Programming models used include CUDA, OpenACC, OpenMP 4.5, and the portability frame- works Raja and Kokkos. The GPU performance assessment is carried out by identifying the most important GPU kernels and measuring their performance characteristics. We attempt to compare characteristics of corresponding kernels between the proxy and the parent application.
The most important GPU kernels cannot necessarily be identified by running the GPU version of the code on a default input set and collecting a timing profile. The input set used for the profiling should ideally be representative of the challenge exascale problem, scaled down to represent performance on a single node or small number of nodes. Determination of the appropriate input
set is carried out by consulting milestone reports for the relevant ECP parent application project and/or by consulting with the proxy and/or full application developers. In some cases, the proxy app may use an input for a synthetic or analytical problem to avoid complexities and dependencies of the full application. We attempt to use recent versions of the proxy and parent applications that are implemented using the programming model(s) being used by the parent application development team moving forward toward scaling to exascale.
Once the most important GPU kernels have been determined, we carry out more detailed performance data collection and analysis for these kernels. We collect data to calculate instruction mix, arithmetic intensity, achieved memory bandwidth, and percentage of peak attained for each kernel. We use the arithmetic intensity to place the kernel on the relevant GPU roofline model. We compare the values of the performance data and derived metrics between corresponding kernels for the proxy and full application to try to determine how well the proxy represents the full application. We use the NVIDIA nvprof command-line profiler to collect performance data.
2.3 Computational Platforms
Between the various groups working on this project at the different labs, three primary plat- forms were used for measurement of CPU characteristics (for all proxy/parent pairs except MAC- Sio). These are the Cori system at NERSC, the Blake (Intel Sk