automatic tuning of parallel read and write performance€¦ ·...

1
Motivation Performance Modeling and Tuning Writes Tuning Read Performance Suren Byna Parallel I/O Automatic Tuning of Parallel Read and Write Performance Nonlinear regression model An essential component of modern HPC - Scientific simulations periodically write their state as checkpoint datasets - Input and output datasets have become larger and larger Optimal I/O performance is critical for utilizing the power of HPC machines We consider smooth, nonlinear models, which can be written as linear combinations of n b nonlinear basis functions φ, (Eq. 1) Once a basis φ has been selected, the hyper-parameters β can be selected by standard regression-/optimization-based approaches. There can be many choices of basis functions; for simplicity, we focus on terms that are low-degree polynomials in either the parameter, x i , or the inverse of the parameter, 1/x i : (Eq. 2) One of our goals in building a model: Simplicity - Incorporate only a handful of basis terms, n b , from the set (Eq. 2). Given an initially empty set, we follow a greedy procedure (forward model selection approach) of adding to this set (P) the terms that most reduces the prediction error. Formally, p being a term, this means we determine those p that solves (Eq. 3) We do this until we reach a desired limit on the number of terms to include. This is a visualization of the 1 trillion-electron dataset at time step 1905. All of the particles with energy > 1.3 are shown in grey, while particles with energy > 1.5 are shown in color. A total of ~165 million particles with energy > 1.3 and ~423 thousand particles with energy > 1.5 appear to be accelerated preferentially along the direction of the mean magnetic field corresponding to formation of four jets. Complex Parallel I/O Software Stack Different optimization parameters at each layer of the I/O software stack Complex inter-dependencies between these layers Highly dependent on the application, HPC platform, and problem size/ concurrency Application developers need good I/O performance without becoming experts on the I/O Results 1.0 0.5 0.0 -0.5 -1.0 Ux Uz Uy 1.0 0.5 0.0 -0.5 -1.0 -1.0 -0.5 0.0 0.5 1.0 1.880 Energy 1.735 1.590 1.445 1.300 m(x;β ) = β k φ k ( x) k=1 n b ( x i ) p i i=1 n x : p i {1, 0,1}, i =1,..., n x min p{1,0,1} nx m ^ ( x j ; P p) y j y j $ % & & ' ( ) ) 2 j=1 n y Contributors: Prabhat, John Wu, Bing Dong (LBNL), Babak Behzad and Mark Snir (UIUC), Houjun Tang, Chris Zou, Nagiza Samatova (NCSU), Quincey Koziol (The HDF Group) This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. The authors would like to thank NERSC and Cray staff for troubleshooting I/O issues on Hopper and Edison. We also thank TACC and Stampede admin staff. We would also like to thank members of the HDF Group and Burlen Loring and Oliver Rübel of the Visualization group at LBNL CRD. Tuning Writing Data with Model-based training Overview of Dynamic Model-driven I/O tuning Exploration Pruning Model Generation HPC System Training Phase Storage System Develop an I/O Model Training Set I/O Kernel Top k Configurations Refit the model (Controled by user) Performance Results Select the Best Performing Configuration I/O Model All Possible Values Refitting Three Phases Training phase: Using a training set, develop an empirical write performance model Pruning phase: Using the model predicted write times select the top k configurations that have the best performance. Store the tuned parameters of the best performing configuration Refitting phase (optional): Refit the model to obtain the current characteristics of the I/O system Results 0.1 0.3 0.4 1 10 20 30 40 VPIC-IO VORPAL-IO GCRM-IO I/O Bandwidth (GB/s) Edison Hopper Stampede Default VPIC-IO Default VORPAL I/O Default GCRM I/O 0.1 0.2 0.3 1 10 20 30 VPIC-IO VORPAL-IO GCRM-IO I/O Bandwidth (GB/s) Edison Hopper Default VPIC-IO Default VORPAL I/O 0 5 10 15 20 25 30 VPIC-IO VORPAL-IO GCRM-IO I/O Bandwidth (GB/s) Edison Hopper Performance improvement Impact of Aggregators to Stripe Count ratio 55x 31x 21x 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 [:, :, 100:600] [:, :, 4569:5789] [:, :, 20000:30000] Time to read data (sec.) Query SDS Reorganized File Original File A strategy for collecting read traces and analyze for patterns Based on the observed patterns, create partial replicas of data in a layout that similar future read patterns would benefit Transparently redirect future reads to the replicas - Original dataset is a 3D in xyz dimensions - Replicated dataset is transposed to zxy dimensions Performance improvement w/ Electrocorticography (ECoG) I/O kernel

Upload: others

Post on 03-Oct-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Tuning of Parallel Read and Write Performance€¦ · CRD_PowerPoint_Poster_36x48_Template Motivation Performance Modeling and Tuning Writes Tuning Read Performance Suren

Lawrence Berkeley National Laboratory Creative Services Office (CSO) #25378 CRD_PowerPoint_Poster_36x48_Template

Motivation Performance Modeling and Tuning Writes Tuning Read Performance

Suren Byna

Parallel I/O

Automatic Tuning of Parallel Read and Write Performance

Nonlinear regression model

•  An essential component of modern HPC - Scientific simulations periodically write their state as checkpoint datasets - Input and output datasets have become larger and larger •  Optimal I/O performance is critical for utilizing the power of HPC machines

•  We consider smooth, nonlinear models, which can be written as linear combinations of nb nonlinear basis functions φ,

(Eq. 1)

•  Once a basis φ has been selected, the hyper-parameters β can be

selected by standard regression-/optimization-based approaches.

•  There can be many choices of basis functions; for simplicity, we focus on terms that are low-degree polynomials in either the parameter, xi , or the inverse of the parameter, 1/xi:

(Eq. 2)

•  One of our goals in building a model: Simplicity

- Incorporate only a handful of basis terms, nb, from the set (Eq. 2). •  Given an initially empty set, we follow a greedy procedure (forward model

selection approach) of adding to this set (P) the terms that most reduces the prediction error. Formally, p being a term, this means we determine those p that solves

(Eq. 3)

•  We do this until we reach a desired limit on the number of terms to

include.

This is a visualization of the 1 trillion-electron dataset at time step 1905. All of the particles with energy > 1.3 are shown in grey, while particles with energy > 1.5 are shown in color. A total of ~165 million particles with energy > 1.3 and ~423 thousand particles with energy > 1.5 appear to be accelerated preferentially along the direction of the mean magnetic field corresponding to formation of four jets.

Complex Parallel I/O Software Stack

•  Different optimization parameters at each layer of the I/O software stack

•  Complex inter-dependencies between these layers

•  Highly dependent on the application, HPC platform, and problem size/concurrency

•  Application developers need good I/O performance without becoming experts on the I/O

Results 1.0

0.50.0

-0.5-1.0 Ux

Uz

Uy

1.0

0.5

0.0

-0.5

-1.0

-1.0

-0.5

0.0

0.5

1.0

1.880

Energy

1.735

1.590

1.445

1.300

m(x;β) = βkφk (x)k=1

nb

(xi )pi

i=1

nx

∏ : pi ∈{−1,0,1}, i=1,...,nx

minp∈{−1,0,1}nx

m^(x j; P∪ p) − y j

y j$

%

&&

'

(

))

2

j=1

ny

Contributors: Prabhat, John Wu, Bing Dong (LBNL), Babak Behzad and Mark Snir (UIUC), Houjun Tang, Chris Zou, Nagiza Samatova (NCSU), Quincey Koziol (The HDF Group)

This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. The authors would like to thank NERSC and Cray staff for troubleshooting I/O issues on Hopper and Edison. We also thank TACC and Stampede admin staff. We would also like to thank members of the HDF Group and Burlen Loring and Oliver Rübel of the Visualization group at LBNL CRD.

Tuning Writing Data with Model-based training

Overview of DynamicModel-driven I/O tuning

Exploration

Pruning

Model Generation

HPC System

Training Phase

Storage System

Develop anI/O Model

Training Set

I/O Kernel

Top k Configurations

Refi

t the

mod

el(C

ontro

led

by u

ser)

Performance ResultsSelect the Best

Performing Configuration

I/O ModelAll Possible

Values

Refitting

•  Three Phases Ø  Training phase: Using a

training set, develop an empirical write performance model

Ø  Pruning phase: Using the model predicted write times select the top k configurations that have the best performance. Store the tuned parameters of the best performing configuration

Ø  Refitting phase (optional): Refit the model to obtain the current characteristics of the I/O system

Results

0.1

0.3

0.4

1

10

20

30

40

VPIC-IO VORPAL-IO GCRM-IO

I/O

Ba

ndw

idth

(G

B/s

)

EdisonHopper

StampedeDefault VPIC-IO

Default VORPAL I/ODefault GCRM I/O

0.1

0.2

0.3

1

10

20

30

VPIC-IO VORPAL-IO GCRM-IO

I/O

Ban

dw

idth

(G

B/s

)

EdisonHopper

Default VPIC-IODefault VORPAL I/O

0

5

10

15

20

25

30

VPIC-IO VORPAL-IO GCRM-IO

I/O

Ban

dw

idth

(G

B/s

)

EdisonHopper

Performance improvement

Impact of Aggregators to Stripe Count ratio

55x� 31x� 21x�

0.00(

20.00(

40.00(

60.00(

80.00(

100.00(

120.00(

140.00(

[:,(:,(100:600]( [:,(:,(4569:5789]( [:,(:,(20000:30000](

Time(to(re

ad(data((sec.)(

Query(

SDS(Reorganized(File( Original(File(

•  A strategy for collecting read traces and analyze for patterns •  Based on the observed patterns, create partial replicas of data in a layout

that similar future read patterns would benefit •  Transparently redirect future reads to the replicas

- Original dataset is a 3D in xyz dimensions - Replicated dataset is transposed to zxy dimensions

Performance improvement w/ Electrocorticography (ECoG) I/O kernel