xsede14 bof: drilling down: understanding user–level activity on today’s supercomputers xsede14...

26
XSEDE14 BoF: Drilling Down: Understanding User– Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Upload: gwenda-goodwin

Post on 30-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

XSEDE14 BoF:Drilling Down: Understanding User–Level Activity on Today’s Supercomputers

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Today's Supercomputers

Page 2: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Outline

• Brief presentation• Open discussion• Demo

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 3: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Can you

• Accurately say how many users, projects link a particular library into their code?

• Determine if a library was never used?• Differentiate user built app usage from center provided app usage?• Determine after the fact which users used a buggy library?• Help a user figure out how they built their code (provenance

information)?• Determine trend usage in libraries/compilers?• Catch runtime/compiler time environment differences?• Determine which routines from a math or IO library are used the

most?• Identify applications being used older than a certain amount?

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 4: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

If not, but you want to

• We will describe our new tool - XALT

• First provide a little background• Then a brief description of XALT follows

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 5: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Robert McLayTACC

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 6: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

My Passions

• Protect new user but stay out of vet's way• Make staff support efficient and effective• Automate detection, correction, prevention• Make the repeat tickets go away!

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 7: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Making a difference…

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Maintain consistent, compatible software

environment

$ module swap mvapich2 impi Inactive Modules: 1) vasp Due to MODULEPATH changes the following have been reloaded: 1) fftw3/3.3.2

$ module load mvapich2

Lmod Error: You can only have one MPI module loaded at a time.You already have impi loaded.

Lmod and related tools

Page 8: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Making a difference…

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Detect potential problems and alert

users

TACC: Starting up job 423224****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ******************************************************

Lariat and related tools

Page 9: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Making a difference…

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Job-level usage data on libraries and applications ALTD

(Mark Fahey -- NICS)

Rank Version Count % Users Rank Application Count % Users

1 libsci 54672 12.2 339 1 arps 511236 26.9 27

2 fftw/ 27633 6.2 277 2 namd 338000 17.8 108

3 hdf5/ 24427 5.4 163 3 amber 262556 13.8 35

4 papi/ 9465 2.1 57 4 vasp 87628 4.6 59

5 acml 8264 1.8 119 5 lammps 22687 1.2 39

Kraken 2012: Library usage at compilation Kraken 2012: Application usage at job launch

Page 10: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Joining forces…

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Detect potential problems and alert

users

Lariat

Job-level usage data on libraries and applications

ALTD

XALT

TACC: Starting up job 423224****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ******************************************************

Rank Version Count % Users Rank Application Count % Users

1 libsci 54672 12.2 339 1 arps 511236 26.9 27

2 fftw/ 27633 6.2 277 2 namd 338000 17.8 108

3 hdf5/ 24427 5.4 163 3 amber 262556 13.8 35

4 papi/ 9465 2.1 57 4 vasp 87628 4.6 59

Kraken 2012: Library usage at compilation Kraken 2012: Application usage at job launch

Page 11: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

My own not-so-hidden agenda...

• Looking for XALT beta users• Hungry for ideas, needs, feedback• Wanting to begin conversation with kindred souls

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 12: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Mark FaheyUTK

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 13: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

ALTD

• Tracks– Which executables use the largest number of core hours?• Are they managed by center? Do they use the system efficiently?

– Which libraries, applications, or tools are being used? • Are there libraries we should remove? Are there libraries we should install?

– What percentage of executables are scripts? • Are these scripts being used because the job starter isn’t sophisticated

enough?

– Are there any executables with modification times older than 1 year? • Should we ask the user to recompile?

• In use by several centers already– NERSC, NCCS, NICS, CSCS, NCSA/BW, and newest KAUST

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 14: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

ALTD is enabled on all major computing platforms at NERSC

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

mpich libsci

netcdf

shmem

metis

papi

superl

u gsl

trilin

osboost

pnetcdf

tpslcfi

tsio

ezcdf

sundial

stau

onesided

scalap

ack silo

dfftpack tcl

udunitsiobuf

mpip1

10

100

1000

10000Library usage on Hopper

(June 21, 2012 - Jan 17, 2013)

mpich zlib libscihdf5 netcdf fftwshmem acml metisparmetis papi petscsuperlu hypre gslmumps trilinos gaboost perftools pnetcdftscotch tpsl mklcfitsio ipm ezcdfpspline sundials slepctau adios onesidedlibtool scalapack ncarsilo sprng dfftpackparpack tcl szipudunits h5part iobuflibfast mpip libelf

Libraries

Num

ber

of u

nqiu

e us

ers

Page 15: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Applications of ALTD

• Understanding current library usage and plan for future software need

• Providing usage statistics to developers and vendors

• Restoring the program environment where user applications were built

• Assisting with debugging system issues

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

An ALTD tool to restore the build environment for an application:aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2User : zz217Linked on : 2013-01-03Executable Name: vaspLibraries Used ://usr/lib64/libhugetlbfs.a../vasp.5.lib/libdmy.a/opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a/opt/cray/atp/1.6.0/lib//libAtpSigHandler.a/opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a/opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a/opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a/opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a/opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a/opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a/opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a/opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a/opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a/opt/cray/cce/8.1.2/craylibs/x86-64/libu.a/opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a/opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a/opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a/opt/gcc/4.4.4/snos/lib64/libstdc++.a/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a/opt/cray/cce/8.1.2/craylibs/x86-64/libf.a/opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a/opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a/opt/cray/cce/8.1.2/craylibs/x86-64/libu.a/opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a//usr/lib64/librt.a/opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a//usr/lib64/libpthread.a//usr/lib64/libc.a/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a//usr/lib64/libm.a/opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a

Page 16: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

ALTD at CSCS

• In production at CSCS since 2011• Rock solid: just a single downtime in two years

– Rosa (Cray XE6) since March 2011• 600K compilations, 2.8M jobs

– Todi (Cray XK6/XK7) since October 2012• 470K compilations, 500K jobs

– Daint (Cray XC30) since March 2013• 100K compilations, 550K jobs

• We’ve added an additional SQL table “accounting” which logs more data about the application execution – number of cores used, number of cores claimed, number of threads, MPI processes, processes per node, …

• We want to be able to detect situations like the use of a buggy or non-performant library

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 17: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

How we mine data: a hypothetic situation

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

A critical bug has been identified in FFTW version 3.3.0.2, affecting code correctness

Page 18: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

First, find which users have linked this library

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

mysql> select distinct username from altd_rosa_link_tags,altd_rosa_linkline where altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' ;

+----------+| username |+----------+| tkachenn | | boswald | | liang || robinson | | yunding | | zilia | +----------+5 rows in set (4.33 sec)

– Querying the ALTD database reveals that several users have applications linked to the buggy library

Page 19: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Now, check if they are using the buggy application

• And it’s confirmed that user “robinson” is running the application linked to the buggy library

• It’s now up to the user services group to contact the user and recommend relinking their applications against the newer version of FFTW, which has fixed the bug

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

mysql> select altd_rosa_jobs.* from altd_rosa_link_tags,altd_rosa_linkline,altd_rosa_jobs where altd_rosa_jobs.tag_id=altd_rosa_link_tags.tag_id and altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' and altd_rosa_jobs.username="robinson";

+---------+--------+------------------------+----------+------------+--------+---------------+| run_inc | tag_id | executable | username | run_date | job_id | build_machine |+---------+--------+------------------------+----------+------------+--------+---------------|| 2410158 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa || 2410172 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410198 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410222 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | +---------+--------+------------------------+----------+------------+--------+---------------|4 rows in set (0.65 sec)

Page 20: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

This methodology is clearly unmanageable!

• Ideally, user support specialists would be alerted automatically to “situations of interest”

• Users running applications linked to legacy, less-performant, or buggy libraries• Users running legacy versions of applications• Users building code with legacy compilers• Users making use of their own libs or apps, when more optimized versions

are available centrally

How can we automate the processes of data mining, reporting and alerting?

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 21: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

TACC_Stats

• Job-level transparent performance monitoring from HPC compute nodes– CPU performance counters– IB statistics– Lustre statistics– Scheduler job statistics– Host data– OS statistics

• Analyses integrate available Lariat data (XALT in the future)

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 22: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

XALT: Understanding the Software Needs of High End Computer Users

• NSF funded project• Combining the best of Lariat and ALTD• Collecting job-level and link-time level data and subsequent

analytics– Alpha version for collection– Working on subsequent analytics

• Building a community around analytics – potentially one of many tools

• Will make it available to the community– Optional interface to XDMod/SUPREMME

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 23: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

XALT Goals

• Goal is a census of libraries and applications and automatic filtering of user issues– what additional user problems can we detect and report (perhaps

correct) automatically? – How can we leverage lessons learned by the tacc stats team to

implement additional automatic filtering? – Plan to add tracking of function calls as well

• Want to balance the need for portability with support for site-specific capabilities

• Want to simplify the processes system administrators use to install, configure, and manage

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 24: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

XALT Agenda

• New tracking infrastructure – XALT• Alpha version available today – Deployed at NICS and TACC– LANL and CSCS testing it

• Some new functionality still to add– Detect function calls– Check runtime environment versus compile time env– Analytics

• SourceForge– http://sourceforge.net/projects/xalt/ – [email protected]• Want feedback, hungry for ideas

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 25: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Thanks to

• Richard Gerber and Zhengji Zhao, NERSC• Tim Robinson, CSCS• Bill Barth, TACC• Bilel Hadri, KAUST• Julius Westerman, LANL

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers

Page 26: XSEDE14 BoF: Drilling Down: Understanding User–Level Activity on Today’s Supercomputers XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on

Contact Info

• Mark R. Fahey– [email protected]

• Robert McLay– [email protected]

XSEDE14 BoF: Drilling Down: Understanding User-Level Activity on Today's Supercomputers