hpc user forum 9/10/08 [email protected] managed by ut-battelle for the department of energy 1 hpc...

Download HPC User Forum 9/10/08 klasky@ornl.gov Managed by UT-Battelle for the Department of Energy 1 HPC User Forum 9/10/2008 Scott Klasky S. Ethier, S. Hodson,

If you can't read please download the document

Upload: douglas-dean

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 1 HPC User Forum 9/10/2008 Scott Klasky S. Ethier, S. Hodson, C. Jin, Z. Lin, J. Lofstead, R. Oldfeld, M. Parashar,K. Schwan, A. Shoshani, M. Wolf, Y. Xiao, F. Zheng
  • Slide 2
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 2 GTC EFFIS ADIOS. Workflow. Dashboard. Conclusions.
  • Slide 3
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 3 2008-2009 Compute node (7,832; 35.2 GF/node) 1 socket (AM2/HT1) per node 4 cores per socket (31,328 cores total) Core CPU: 2.2 GHz AMD Opteron Memory per core: 2 GB (DDR2-800) 232 service & I/O nodes Local storage: ~750 TB, 41 GB/s Interconnect: 3D torus, SeaStar 2.1 NIC Aggregate memory: 63 TB Peak performance: 275 TF Compute node (13,888; 73.6 GF/node) 2 sockets per node (F/HT1) 4 cores per socket (111,104 cores total) Core CPU: 2.3 GHz AMD Opteron Memory per core: 2 GB (DDR2-800) 256 service & I/O nodes Local storage: ~10 PB, 200+ GB/s Interconnect: 3D torus, SeaStar 2.1 NIC Aggregate memory: 222 TB Peak performance: 1.0 PF 150 cabinets, 3400 ft 2 6.5 MW power.
  • Slide 4
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 4 Big Simulations for early 2008: GTC Science Goals and Impact Science Goals Use GTC (classic) to analyze cascades and propagation in Collisionless Trapped Electron Mode (CTEM) turbulence Resolve the critical question of * scaling of confinement in large tokamaks such as ITER; what are consequences of departure from this scaling? Avalanches and turbulence spreading tend to break Gyro-Bohm scaling but zonal flows tend to restore it by shearing apart extended eddies: a competition Use GTC-S (shaped) to study electron temperature gradient (ETG) drift turbulence & compare against NSTX experiments NSTX has a spherical torus with a very low major to minor radius aspect ratio and a strongly-shaped cross-section NSTX exps have produced very interesting high frequency short wavelength modes - are these kinetic electron modes? ETG is a likely candidate but only a fully nonlinear kinetic simulation with the exact shape & exp profiles can address this Science Impact Further the understanding of CTEM turbulence by validation against modulated ECH heat pulse propagation studies on the DIII-D, JET & Tore Supra tokamaks Is CTEM the key mechanism for electron thermal transport? Electron temperature fluctuation measurements will shed light Understand the role of nonlinear dynamics of precession drift resonance in CTEM turbulence First-time for direct comparison between simulation & experiment on ETG drift turbulence GTC-S possesses right geometry and right nonlinear physics to possibly resolve this Help to pinpoint micro-turbulence activities responsible for energy loss through the electron channel in NSTX plasmas
  • Slide 5
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 5 GTC Early Application: electron microturbulence in Fusion Plasma Scientific Discovery - Transition to favorable scaling of confinement for both ions and electrons now observed in simulations for ITER plasmas Electron transport less understood but more important in ITER since fusion products first heat the electrons Simulation of electron turbulence is more demanding due to shorter time scales and smaller spatial scales Recent GTC simulation of electron turbulence used 28,000 cores for 42 hours in a dedicated run on Jaguar at ORNL producing 60 TB of data currently being analyzed. This run pushes 15 billion particles for 4800 major time cycles Good news for ITER! Good news for ITER! Ion transport Electron transport
  • Slide 6
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 6 3D fluid data analysis provides critical information to characterize microturbulence, such as radial eddy size, eddy auto-correlation time Flux Surface Electrostatic Potential demonstrates a ballooning structure Radial Turbulence eddies have average size ~ 5 ion gyroradius
  • Slide 7
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 7 From SDM center* Workflow engine Kepler Provenance support Wide-area data movement From universities Code coupling (Rutgers) Visualization (Rutgers) Newly developed technologies Adaptable I/O (ADIOS) (with Georgia Tech) Dashboard (with SDM center) Visualization Code Coupling Wide-area Data Movement Dashboard Workflow Adaptable I/O Provenance and Metadata Foundation Technologies Enabling Technologies Approach: place highly annotated, fast, easy-to-use I/O methods in the code, which can be monitored and controlled, have a workflow engine record all of the information, visualize this on a dashboard, move desired data to users site, and have everything reported to a database.
  • Slide 8
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 8 GTC EFFIS ADIOS. Conclusions.
  • Slide 9
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 9 Those fine fort.* files! Multiple HPC architectures BlueGene, Cray, IB-based clusters Multiple Parallel Filesystems Lustre, PVFS2, GPFS, Panasas, PNFS Many different APIs MPI-IO, POSIX, HDF5, netCDF GTC (fusion) has changed IO routines 8 times so far based on performance when moving to different platforms. Different IO patterns Restarts, analysis, diagnostics Different combinations provide different levels of IO performance Compensate for inefficiencies in the current IO infrastructures to improve overall performance
  • Slide 10
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 10 Allows plug-ins for different I/O implementations. Abstracts the API from the method used for I/O. Simple API, almost as easy as F90 write statement. Best practices/optimize IO routines for all supported transports for free Componentization. Thin API XML file data groupings with annotation IO method selection buffer sizes Common tools Buffering Scheduling Pluggable IO routines External Metadata (XML file) Scientific Codes ADIOS API MPI-CIOLIVE/DataTapMPI-IOPOSIX IOpHDF-5pnetCDFViz EnginesOthers (plug-in) bufferingschedulefeedback
  • Slide 11
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 11 Simple API very similar to standard Fortran or C POSIX IO calls. As close to identical as possible for C and Fortran API open, read/write, close is the core set_path, end_iteration, begin/end_computation, init/finalize are the auxiliaries No changes in the API for different transport methods. Metadata and configuration defined in an external XML file parsed once on startup. Describe the various IO grouping including attributes and hierarchical path structures for elements as an adios-group Define the transport method used for each adios-group and give parameters for communication/writing/reading Change on a per element basis what is written Change on a per adios-group basis how the IO is handled
  • Slide 12
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 12 ADIOS is an IO componentization, which allows us to Abstract the API from the IO implementation. Switch from synchronous to asynchronous IO at runtime. Change from real-time visualization to fast IO at runtime. Combines. Fast I/O routines. Easy to use. Scalable architecture (100s cores) millions of procs. QoS. Metadata rich output. Visualization applied during simulations. Analysis, compression techniques applied during simulations. Provenance tracking.
  • Slide 13
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 13 ADIOS Fortran and C based API almost as simple as standard POSIX IO External configuration to describe metadata and control IO settings Take advantage of existing IO techniques (no new native IO methods) Fast, simple-to-write, efficient IO for multiple platforms without changing the source code
  • Slide 14
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 14 Data groupings logical groups of related items written at the same time. Not necessarily one group per writing event IO Methods Choose what works best for each grouping Vetted, improved, and/or written by experts for each POSIX (Wei-keng Liao, Northwestern) MPI-IO (Steve Hodson, ORNL) MPI-IO Collective (Wei-keng Liao, Northwestern) NULL (Jay Lofstead, GT) Ga Tech DataTap Asynchronous (Hasan Abbasi, GT) phdf5 others.. (pnetcdf on the way).
  • Slide 15
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 15 Specialty APIs HDF-5 complex API Parallel netCDF no structure File system aware middleware MPI ADIO layer File system connection, complex API Parallel File systems Lustre Metadata server issues PVFS2 client complexity LWFS client complexity GPFS, pNFS, Panasas may have other issues
  • Slide 16
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 16 Platforms tested Cray CNL (ORNL Jaguar) Cray Catamount (SNL Redstorm) Linux Infiniband/Gigabit (ORNL Ewok) BlueGene P now being tested/debugged. Looking for future OSX support. Native IO Methods MPI-IO independent, MPI-IO collective, POSIX, NULL, Ga Tech DataTap asynchronous, Rutgers DART asynchronous, Posix-NxM, phdf5, pnetcdf, kepler-db
  • Slide 17
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 17 MPI-IO method. GTC and GTS codes have achieved over 20 GB/sec on Cray XT at ORNL. 30GB diagnostic files every 3 minutes, 1.2 TB restart files every 30 minutes, 300MB other diagnostic files every 3 minutes. DART:
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 18 June 7, 2008: 24 hour GTC run on Jaguar at ORNL 93% of machine (28,672 cores) MPI-OpenMP mixed model on quad-core nodes (7168 MPI procs) three interruptions total (simple node failure) with 2 10+ hour runs Wrote 65 TB of data at >20 GB/sec (25 TB for post analysis) IO overhead ~3% of wall clock time. Mixed IO methods of synchronous MPI-IO and POSIX IO configured in the XML file
  • Slide 19
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 19 Chimera IO Performance (Supernova code) 2x scaling Plot minimum value from 5 runs with 9 restarts/run Error bars show maximum time for the method.
  • Slide 20
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 20 Chimera Benchmark Results Why ADIOS is better than pHDF5? ADIOS_MPI_IO vs. pHDF5 w/ MPI Indep. IO driver ADIOS_MPI_IO Function# of callsTime write25602218.28 MPI_File_open256095.80 MPI_Recv255524.68 buffer_write613632010.29 fopen5129.86 bp_calsize_stringtag31795204.44 other--~40 pHDF5 Function# of callsTime write14406533109.67 MPI_Bcast(sync)31480012259.30 MPI_File_open2560325.17 MPI_File_set_size256023.76 MPI_Comm_dup512016.34 H5P,H5D,etc--8.71 other--~20 Use 512 cores, 5 restart dumps. Conversion time on 1 processor for the 2048 core job = 3.6s (read) + 5.6s (write) + 6.9 (other) = 18.8 s Number above are sum among all PEs (parallelism not shown)
  • Slide 21
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 21 A research transport to study asynchronous data movement Uses server directed I/O to maintain high bandwidth, low overhead for data extraction I/O scheduling is performed to the perturbation caused by asynchronous I/O
  • Slide 22
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 22 Due to perturbations caused by asynchronous I/O, the overall performance of the application may actually get worse We schedule the data movement using application state information to prevent asynchronous I/O from interfering with MPI communication 800 GB of data. Schedule I/O takes 2x longer to move data. Overhead is 2x less.
  • Slide 23
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 23 XML configuration file: Fortan90 code: ! initialize the system loading the configuration file adios_init (config.xml, err) ! open a write path for that type adios_open (h1, output, restart.n1, w, err) adios_group_size (h1, size, total_size, comm, err) ! write the data items adios_write (h1, g_NX, 1000, err) adios_write (h1, g_NY, 800, err) adios_write (h1, lo_x, x_offset, err) adios_write (h1, lo_y, y_offset, err) adios_write (h1, l_NX, x_size, err) adios_write (h1, l_NY, y_size, err) adios_write (h1, temperature, u, err) ! commit the writes for asynchronous transmission adios_close (h1, err) ! do more work ! shutdown the system at the end of my run adios_finalize (mype, err)
  • Slide 24
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 24 C code: // parse the XML file and determine buffer sizes adios_init (config.xml); // open and write the retrieved type adios_open (&h1, restart, restart.n1, w); adios_group_size (h1, size, &total_size, comm); adios_write (h1, n, n); // int n; adios_write (h1, mi, mi); // int mi; adios_write (h1, zion, zion); // float zion [10][20][30][40]; // write more variables... // commit the writes for synchronous transmission or // generally initiate the write for asynchronous transmission adios_close (h1); // do more work... // shutdown the system at the end of my run adios_finalize (mype); XML configuration file: srv=ewok001.ccs.ornl.gov
  • Slide 25
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 25 Petascale GTC runs will produce 1PB per simulation. Couple GTC with Edge core (core-edge coupling). 4 PB of data per run. Cant store all of GTC runs at ORNL unless we go to tape. ( 12 days to grab data from tape if we get 1GB/sec). 1.5 FTE looking at the the data. Need more real-time analysis of data. Workflows, data-in-transit (IO graphs), ? Can we create a staging area with fat-nodes Move data from computational nodes to fat nodes using network of HPC resource. Reduce data on fat-nodes. Allow users to plug-in analysis routines on fat-nodes How Fat? Shared memory helps (dont have to paralyze parallelize-all analysis codes. Typical upper bound of codes we studied write 1/20 th of memory/core for analysis. Want 1/20 th of resources (5% overhead). Need 2x memory per core for analysis (2x overhead for memory we need (in data + out data). On Cray at ORNL this means we will have roughly 750 sockets (quad core) for fat memory with shared memory of 34 GB of shared memory. Also useful for codes which require memory but not as many nodes. Can we have shared memory on this portion? What are the other solutions?
  • Slide 26
  • HPC User Forum 9/10/08 [email protected] Managed by UT-Battelle for the Department of Energy 26 GTC is a code which is scaling to the petascale computers. GBP, Cray XT. New changes are new science and new IO (ADIOS). Major challenge in the future is speeding up the data analysis. ADIOS is an IO componentization. ADIOS is being integrated integrated into Kepler. Achieved over 50% peak IO performance for several codes on Jaguar. Can change IO implementations at runtime. Metadata is contained in XML file. Petascale science starts with petascale applications. Need enabling technologies to scale. Need to rethink ways to do science.