abhinav bhatele | todd gamblin | brian t n gunney | …bhatele/pubs/pdf/posters/siam2012.pdf ·...

1
Revealing Performance Artifacts in Parallel Codes through Multi-domain Visualizations ABHINAV BHATELE | TODD GAMBLIN | BRIAN T N GUNNEY | MARTIN SCHULZ | PEER-TIMO BREMER | KATHERINE E ISAACS Center for Applied Scientific Computing, Lawrence Livermore National Laboratory This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-POST-527971). The HAC model Application Domain (Physical simulation space) Hardware Domain (Flops, cache misses, network topology) Communication Domain (Virtual topology) Data Analysis and Visualization Performance data is measured in domains such as application, communication or hardware. The HAC model provides an intuitive way of analyzing performance by projecting data obtained from one domain in to another domain. By doing this, we can correlate symptoms of per- formance problems in a domain to their root causes in a different one. Projecting data across domains can be more complicated for adaptive applications because the relationships between the dif- ferent domains change as the simulation progresses. Hence, we need to keep track of the movement of data/work units and changes in communication across phases. In this poster, we present the projections of mea- sured data across multiple domains to discover performance bottlencks in an adaptive code. Projections on the application domain Structured adaptive mesh refinement (SAMR) is an appli- cation domain with challenges in scaling and performance analysis. We study a linear advection (LinAdv) benchmark that uses a popular SAMR library called SAMRAI. SAMR applications decompose the simulation space hierarchically into multiple mesh levels. Each mesh level is broken down into patches that contain cells in the areas where a mesh level has been refined. LinAdv has three patch levels, 0, 1 and 2 with increasingly finer resolution. On the left, we use VisIt to visualize the application domain for a 1024-core run on Blue Gene/P. Instead of coloring the patches by their physical properties, we color them by the MPI rank of the process that owns them. This gives an indication of whether neighboring patches in the application domain get mapped to nearby physical processors on the hardware domain - in this case a three- dimensional torus network. We can see that nearby patches are colored in similar shades of red/yellow, blue or green. Even across different levels, patches at the same cartesian coordinates have simi- lar colors. We can also color the patches by the average or maximum number of hops traversed by messages from a patch to its neighbors. These mappings can be used to analyze if communication between patches is efficient. Traditional methods of performance analysis can require significant efforts and time on the part of the application developer. Here, we present a case study to demon- strate the use of visualizations in multiple domains: application, hardware and communication, for discovering performance artifacts in a parallel application. Using a structured adaptive mesh refinement code, we present visualization and analysis techniques which make the process of performance debugging faster and more intuitive. Projections on the hardware domain Communication is becoming the dominant bottleneck as we scale to a large number of cores. It becomes important to analyze communication in terms of contention on specific links and distribution of network traffic between various directions. Boxfish can be used to visualize such data. In the figures below, an 8 x 8 x 16 torus is shown with the nodes colored by their load on the left and by the time spent in a phase of load balancing on the right. Projections on the communication domain Load balancing in SAMRAI takes more and more time as the problem is scaled to a large number of processors. The times spent in the three phases within load balancing were recorded and plotted linearly by the MPI ranks (above). Phase 1 i.e., load distribution appears to lead to longer times spent in other phases. Coloring the nodes in the communication domain, which happens to be a binary tree (right), by the amount of load on each pro- cessor did not provide any clues. We then projected the timing data for phase 1 to the com- munication domain (figure on the left). This time, we clearly see that a part of the tree is significntly delayed (in red). Coloring the edges by the number of boxes sent down the tree shows a flow problem arising from the movement of load from the rest of the tree to this sub-tree. 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0 256 512 768 1024 Time (s) Different phases of load balancing (1024 cores) phase 1 phase 2 phase 3 MPI Rank Performance improvements 0 5e-07 1e-06 1.5e-06 2e-06 2.5e-06 3e-06 3.5e-06 4e-06 4.5e-06 256 512 1K 2K 4K 8K 16K 32K 64K Wall clock time per cell update (s) Number of cores Time spent in load balancing Indirect send Direct send 10 15 20 25 30 35 40 256 512 1K 2K 4K 8K 16K 32K 64K Wall clock time (s) Number of cores Time spent in solving and adaptation box size 5,5,5 box size 7,7,7 box size 9,9,9 box size 11,11,11 We tried some preliminary solutions to mitigate the problem observed in the load balancing phase. We used two strategies to reduce the amount of data being moved: 1. Send the patch destinations to the source pro- cessor directly instead of sending the data over the tree network, and 2. Increase the box size which reduces the number of boxes. Both these measures reduce the time spent in load balancing and also the overall time. A scalable solution will be implemented in the future to compleletely eliminate this bottleneck.

Upload: others

Post on 31-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ABHINAV BHATELE | TODD GAMBLIN | BRIAN T N GUNNEY | …bhatele/pubs/pdf/posters/siam2012.pdf · abhinav bhatele | todd gamblin | brian t n gunney | martin schulz | peer-timo bremer

Revealing Performance Artifacts in Parallel Codes through Multi-domain VisualizationsABHINAV BHATELE | TODD GAMBLIN | BRIAN T N GUNNEY | MARTIN SCHULZ | PEER-TIMO BREMER | KATHERINE E ISAACS

Center for Applied Scientific Computing, Lawrence Livermore National LaboratoryThis work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-POST-527971).

The HAC model

Application Domain(Physical simulation space)

Hardware Domain(Flops, cache misses,network topology)

CommunicationDomain

(Virtual topology)

Data Analysis and

Visualization

Performance data is measured in domains such as application, communication or hardware. The HAC model provides an intuitive way of analyzing performance by projecting data obtained from one domain in to another domain. By doing this, we can correlate symptoms of per-formance problems in a domain to their root causes in a different one. Projecting data across domains can be more complicated for adaptive applications because the relationships between the dif-ferent domains change as the simulation progresses. Hence, we need to keep track of the movement of data/work units and changes in communication across phases.

In this poster, we present the projections of mea-sured data across multiple domains to discover performance bottlencks in an adaptive code.

Projections on the application domain

Structured adaptive mesh refinement (SAMR) is an appli-cation domain with challenges in scaling and performance analysis. We study a linear advection (LinAdv) benchmark that uses a popular SAMR library called SAMRAI. SAMR applications decompose the simulation space hierarchically into multiple mesh levels. Each mesh level is broken down into patches that contain cells in the areas where a mesh level has been refined. LinAdv has three patch levels, 0, 1 and 2 with increasingly finer resolution.

On the left, we use VisIt to visualize the application domain for a 1024-core run on Blue Gene/P. Instead of coloring the patches by their physical properties, we color them by the MPI rank of the process that owns them. This gives an indication of whether neighboring patches in the application domain get mapped to nearby physical processors on the hardware domain - in this case a three-dimensional torus network.

We can see that nearby patches are colored in similar shades of red/yellow, blue or green. Even across different levels, patches at the same cartesian coordinates have simi-lar colors. We can also color the patches by the average or maximum number of hops traversed by messages from a patch to its neighbors. These mappings can be used to analyze if communication between patches is efficient.

Traditional methods of performance analysis can require significant efforts and time on the part of the application developer. Here, we present a case study to demon-strate the use of visualizations in multiple domains: application, hardware and communication, for discovering performance artifacts in a parallel application. Using a structured adaptive mesh refinement code, we present visualization and analysis techniques which make the process of performance debugging faster and more intuitive.

Projections on the hardware domainCommunication is becoming the dominant bottleneck as we scale to a large number of cores. It becomes important to analyze communication in terms of contention on specific links and distribution of network traffic between various directions. Boxfish can be used to visualize such data. In the figures below, an 8 x 8 x 16 torus is shown with the nodes colored by their load on the left and by the time spent in a phase of load balancing on the right.

Projections on the communication domain

Load balancing in SAMRAI takes more and more time as the problem is scaled to a large number of processors. The times spent in the three phases within load balancing were recorded and plotted linearly by the MPI ranks (above). Phase 1 i.e., load distribution appears to lead to longer times spent in other phases. Coloring the nodes in the communication domain, which happens to be a binary tree (right), by the amount of load on each pro-cessor did not provide any clues.

We then projected the timing data for phase 1 to the com-munication domain (figure on the left). This time, we clearly see that a part of the tree is significntly delayed (in red). Coloring the edges by the number of boxes sent down the tree shows a flow problem arising from the movement of load from the rest of the tree to this sub-tree.

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0 256 512 768 1024

Tim

e (s

)

Different phases of load balancing (1024 cores)

phase 1phase 2phase 3

MPI Rank

Performance improvements

0 5e-07 1e-06

1.5e-06 2e-06

2.5e-06 3e-06

3.5e-06 4e-06

4.5e-06

256 512 1K 2K 4K 8K 16K 32K 64KWal

l clo

ck t

ime

per

cell

upda

te (

s)

Number of cores

Time spent in load balancing

Indirect sendDirect send

10

15

20

25

30

35

40

256 512 1K 2K 4K 8K 16K 32K 64K

Wal

l clo

ck t

ime

(s)

Number of cores

Time spent in solving and adaptation

box size 5,5,5box size 7,7,7box size 9,9,9

box size 11,11,11

We tried some preliminary solutions to mitigate the problem observed in the load balancing phase. We used two strategies to reduce the amount of data being moved: 1. Send the patch destinations to the source pro-cessor directly instead of sending the data over the tree network, and 2. Increase the box size which reduces the number of boxes. Both these measures reduce the time spent in load balancing and also the overall time. A scalable solution will be implemented in the future to compleletely eliminate this bottleneck.