progress on release, api discussions, vote on apis, and quarterly report al geist may 6-7, 2004...
TRANSCRIPT
Progress on Release, API Discussions,Vote on APIs, and Quarterly Report
Progress on Release, API Discussions,Vote on APIs, and Quarterly Report
Al GeistMay 6-7, 2004Chicago, ILL
Coordinator: Al Geist
Participating Organizations
ORNLANLLBNLPNNL
PSCSDSCIBMSGI
SNLLANLAmesNCSA
CrayIntel
Participating OrganizationsParticipating Organizations
How do we position ourselves for the DOE Ultrascale facility winner to be announced May 12
Regardless of who is chosen we should try to be in a position to help with the system software needs of the facility.
IBMCrayIntelSGI
Scalable Systems SoftwareScalable Systems Software
Participating Organizations
ORNLANLLBNLPNNL
NCSAPSCSDSC
SNLLANLAmes
• Collectively (with industry) define standard interfaces between systems components for interoperability
• Create scalable, standardized management tools for efficiently running our large computing centers
Problem
Goals
• Computer centers use incompatible, ad hoc set of systems tools
• Present tools are not designed to scale to multi-Teraflop systems
ResourceManagement
Accounting& user mgmt
SystemBuild &Configure
Job management
SystemMonitoring
www.scidac.org/ScalableSystems
To learn more visit
Grid Interfaces
Accounting
Event Manager
ServiceDirectory
MetaScheduler
MetaMonitor
MetaManager
SchedulerNode StateManager
AllocationManagement
Process Manager
UsageReports
Meta Services
System &Job Monitor
Job QueueManager
NodeConfiguration
& BuildManager
Standard XML
interfacesauthentication communication
Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite
Checkpoint /Restart
Validation & Testing
HardwareInfrastructure
Manager
Packaging&
Install
Scalable Systems Software SuiteScalable Systems Software SuiteUpdates to this diagramUpdates to this diagram
Scalable Systems Software CenterJanuary 15-16Argonne
Review of Last MeetingReview of Last Meeting
Details inMain project notebook
Highlights from Jan. mtgHighlights from Jan. mtg
Craig – 1280 dual xeon cluster “Titanium” is available this eveningTo test the scalability of SSS suite. One node will be used asHead node to install our suite and run on entire cluster.Could build everything but Bambo and ssslib due to XersesWill begin to be available at 6pm
Late night session on 1280 node testbedPM ran at 1280 worked at 4000, hung at 6000Warehouse had a problem at 1280 and took out head nodeRM components ran on head node OK until Warehouse crashed it
Scott Jackson – Gold running on 11 TF PNNL cluster
Thomas Naughton – 2nd release March. Discussion of how many orgs in our group could shakedown the tarball. Group feels better to have few very reliable components than all components
Highlights from Jan. mtg (cont.)Highlights from Jan. mtg (cont.)
Rusty Lusk – Process Manager Spec for first votePresentation and discussion…Who is responsible for limited enforcement PM or QM? I.e.Must use certain amount of memory, must not execute OS command(in general - things that happen after fork)Rusty says the question is good and he needs to think about How this may affect the interface.Other items to think about - use of wildcard as “to be returned” operator – OK - Inclusion but don’t show me. - Dynamic jobs and PM. - improve readability
Delay vote until we have a written proposal.
Highlights from Jan. mtgHighlights from Jan. mtg
Discussion of having two XML syntax styles (functional, object)Al says he would like to see one common one across the suitethat he didn’t care which one as long as the whole group could agree.
Narayan – Restriction Syntax Overview. An issue of uniqueness was brought up and was to be taken into consideration by Narayan
Rusty Lusk – Restriction Syntax on Chiba CityDavid would like to see a paper of the requirements that the Chibaeffort required.
Andrew and Paul and Craig offer to investigate a prototype translatorTo see how / if it is possible.
Investigate standardization of tokens across the two syntax
Scalable Systems Software Center
January-May
Progress Since Last MeetingProgress Since Last Meeting
SciDAC PI mtg – March 22-24, 2004SciDAC PI mtg – March 22-24, 2004
In Charleston SC with severalattending for Scalable Systems 2 page project summary reportAnnual report for Fred20 minute talk – presented by RustyFred asked each ISIC to use new speaker
Poster Presentation – by Stephen/John
Systems Software Suite 2nd ReleaseSystems Software Suite 2nd Release
Target Date March ‘04 – So we could announce it at the PI meeting. Real Status?
SSS-OSCAR – will hear more in next talkNeed way to test that the suite is installed correctly
Five Project NotebooksFive Project Notebooks
A main notebook for general information
And individual notebooks for each working group
• Over 300 total pages
• BC and PM groups need to get specs into their notebooks
• Add Telecom meeting notes even if short (Kudos to RM group)
Get to all notebooks through main web site www.scidac.org/ScalableSystems
Click on side bar or at “project notebooks” at bottom of page
Bi-Weekly Working Group TelecomsRM is only notes I see in notebook
Resource management, scheduling, and accounting
Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”
Proccess management, monitoring, and checkpointing
Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910
Node build, configuration, and information service
Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)
Scalable Systems Software Center
May 6-7, 2004
This MeetingThis Meeting
Major Topics this MeetingMajor Topics this Meeting
Stability of Systems Software Suite – second release is out. Are we ready for outside users?
Quarterly Report Due – would like to get one to Fred by end of May. Will need text from WG leaders.
Formal API presentations and voting - we left several things hanging last meeting
MICS PI Mtg - August 9-12 at Argonne. A good time to have a highlight of outside user(s)
SC04 Mtg - November in Pittsburg. Talks? Tutorial? Birds of a feather?
Agenda – May 6Agenda – May 6 8:30 Al Geist – Project Status. 9:15 Thomas Naughton – SSS OSCAR software suite release Working Group Reports
Progress report on what their group has done API Proposals for adoption by the groupProgress on software suite improvements
9:30 Narayan Desai – Node Build, Configure10:30 Break11:30 Will McClendon – Validation and Testing 12:30 Lunch (on own – cafeteria) 1:30 Ron Oldfield – ASAP testing, and formalism issues 2:00 Paul Hargrove – Process Management
Craig and Rusty 3:00 Scott Jackson – Resource Management 4:00 Paul/Craig – findings about trying to build a syntax translator 4:30 Group Discussion on getting outside users of 2nd release 5:00 Al – Discussion on SC04, other conferences, papers, etc. 5:30 Adjourn
Agenda – May 7Agenda – May 7
8:30 Discussion, proposals, votes
Craig – discussion Paul – straw vote on two syntax
Rusty - Process Manager proposal (deferred) Scott – Allocation Manager proposal (deferred) Al - Quarterly report, papers, SC04, other meetings.
10:30 Break11:00 Al Geist – Release 2 and outside users (Jazz? Ram? NCSA? SNL?) MICS PI Mtg August at Argonne (news to come) next meeting date: August 26-27, 2004
location: Argonne
12:00 meeting ends
Meeting notesMeeting notes
Al Geist – presents project overview and goals for this meeting
Thomas Naughton – SSS-OSCAR: in tarball isBamboo, BRLC, Gold, LAM/MPI, MAUI-SSS, SSSLib, Warehouse, MPD2SSSLib contains SD, EM, PM, BCM, NSM, NHw, plus communicationTodo: bug tracker, test sss-oscar-v2a6-v3.0 for pre-release, Documentation- use scidac review 1 pager, add license-sss to directoryNeed: A test suite and a few test machines to test onDiscussion on APItest and who creates tests, etc. Each does individualEstablish release schedule thru SC04Add easier way for authors to “test just their stuffSC04 – fully tested release v1.0 with all SSS components
code freeze Friday September 3
Meeting notesMeeting notes
Narayan Dasi – Build ConfigureLibrary improvements- bugfixes, testing of java support, SSL testingInfrastructure Improvements-sss python library improvements, EM bugfixesBCM component usage experience
Hardware infrastructure – still seeking purposeRestriction Syntax examples given and discused
craig thankful that !d (don’t display this field) now worksUniqueness issue-default is to return all duplicates
new flag “unique=true” to remove duplicatesmuch discussion. Rusty suggests remove only duplicate linesPaul brings up the problem on “action” commands ie kill jobs twice
Al says the problem is not solvable in general in restriction syntaxScott asked if RMAP syntax can handle this?Much work on the board. And question ofatomicity of queries which require multiple SQL queries to complete.
Meeting notesMeeting notes
Will McClendon – Component Interface TestingAPITest v0.1.2 It is now available by FTP by putting it under GPL Cplant licenseftp://ftp.sandia.gov/outgoing/apitest (also in notebook)Not integrated back into ssslibHTTP Interface development“Twisted Python” framework Info and www.effbot.orgScott helped find bug in python popen3 – now uses Twisted SpawnProcessBetter support for browsing test data within sessionBatch and test data stored in an in-memory in XML file format
writing out data to file available soonShows an XML example that runs test. Several questions answeredShows an XML batch file example.Runs live demo – works fine. Discussion follows.
Ron Oldfield – replacing Eric DeBenedictis who is moving to other SNL jobs-ORNL help set up a testing environment-Testing for correct installation and individual tests, then whole suite test
Meeting notesMeeting notes
Ron Oldfield (cont) – simulating real workloadsperformance and scalability testing needed in the futureportability is important for our reference implementationdiscussion code portability vs feature portabilityauthorization also needs testing
What are the issues in lightweight OSStandard naming conventions both format and semantics
someone really needs to go through the existing schemaesRMAP dictionary makes a good starting point
Paul Hargrove – process managementStill continue development on all three componentsSyntax translation effort to be discussed later today.Checkpoint –pre-emption (suspend and resume) works-checkpointing (ckpt works, restart in progress)Todo: migration, checkpoint file management – not overflow disks (list,delete)Query- “can I restart here”
Meeting notesMeeting notes
Paul Hargrove – process management (cont)Suspend/resume works with Bamboo, SD, EM, OM, PM componentsStill need to design restart-time interactions with RM groupOpen files support under testingBug fix releases as needed.Checkpoint manger outstanding issuesImplement full interface
using restriction syntax, event generation, error reportingMust implement file management
think ls and rm, expiration
Craig Steffan – no slidesTried run on 1280 nodes on Tungsten failed, did run on 128Can now run on 1024 nodes. Being stopped by #sockets limitHarvesting can now be done of other info f.e. myrinet HWNext: adding support for “job” management start interfacing with Build group help to get it on Chiba
Meeting notesMeeting notes
Rusty Lusk – process manager updatePM component – added “limits” interface, dynamic jobs (mpi_comm_spawn) can spawn lots of nodes and the use “unused” ones as needed show limits specMPD2 improvements found by production use on chiba support for limits support for mpi_comm_spawn interactive debugging via mpigdb – allows control of stdin, stderr, stdoutFuture: need to work more closely with QM QM interface for requesting dynamic jobs
Meeting notesMeeting notes
Scott Jackson – resource manager updateDiagram on boardReleased SSSRMAPv3 specNew things - wire protocol - message format - job groupsLatest software release (in OSCAR) uses SSSRMAP v2Second release of Bamboo in March w/ epilogue and prologue supportGold now fully SSSRMAP v2 - second alpha release due June - which will be in Perl (first release in Java ran into memory size limits) - user guide done - first release running on PNNL’s SGI AltixTesting using APITest begunSilver several,various improvements in XMLFuture work: implement SSSRMAP v3 in the components - merger of Maui 3.2 and SSS. Integrate chkpt/restart. Limit enforcement - now SSS affects all Maui users. Ability to handle dynamic jobs
Job group
JobT T T T
T Task group
Multi-step job
JobJob
Meeting notesMeeting notes
Paul – translator report (no slides)looking at the two syntax and seeing if we could automate Translation between sssrmap and restriction syntax
Found: sssrmap could say 4<proc<16 but not in RSRS band aid – special operators to handle rangesFor multiple table queries – nested RS syntax doesn’t haveInformation (primary data type) to know how to combine multiple SQL resultsThere is no way to translate between these cases.
Paul discourages the implementation of a translator.
Meeting notes – Day 2Meeting notes – Day 2Craig – General thoughts on official V1.0 (no slides)Released at SC04 this will be the first time many people will seeOur orthogonal directions in syntax is damaging If we don’t make a decision soon - project progress towards V1.0Brett, who works with both, favors the SSSRMAPHe likes the more descriptive nature of it and OO nature.Rusty says that we need two written proposals for a componentthat we can compare and vote on otherwise we are just all talk.Paul says the one is better but two is not too bad. Scott doesn’t think we can reconcile Paul asks for straw vote for a preference, Scott second’s SSRMAP – 7 and 5 institutions (but one is Al) Restriction Syntax - 3 all ANL Abstain – 3 and 2 institutionsCraig says he will do whatever it takes to make either work. he is going to make ssslib SSSRMAP work Neil says “users” are guiding factor and RMAP better there Paul says understandability and acceptability is key and RMAP is betterBoth say that RS is more compact and elegant.
Meeting notes – Day 2 (cont)Meeting notes – Day 2 (cont)
Narayan- asks does it just need documentation and tutorialsPaul says no. There is closer match for SOAP et al. the OO was not a factor in his choice, but it is more popular today.Neil says potential users won’t have a Narayan to figure this out.Components are both client and server so developer has to know syntax.Rusty – if there was something else added to RS that made it easier to use or understand. He is not sure it is a good idea.Will – documentation is better in RMAP and he has looked at RMAP more Would all this stuff be more abstracted? User does as little as they can read manual only after they get stuck. Doesn’t care as long we pick ONE! Need to have a same look and feel across the project.Rick – I don’t care which. I don’t like XML. What about the SD and EM that are already accepted. Al – says that he feels that RMAP would be more acceptable to vendors and this would be a critical to long term success of the project.
Paul says that Process manager document is not complete enough to vote on at this time.
Meeting notes – Day 2 (cont)Meeting notes – Day 2 (cont)
Discussion -