progress on release, api discussions, vote on apis, and quarterly report al geist may 6-7, 2004...

28
Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Upload: nelson-tyler

Post on 01-Jan-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Progress on Release, API Discussions,Vote on APIs, and Quarterly Report

Progress on Release, API Discussions,Vote on APIs, and Quarterly Report

Al GeistMay 6-7, 2004Chicago, ILL

Page 2: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCSDSCIBMSGI

SNLLANLAmesNCSA

CrayIntel

Participating OrganizationsParticipating Organizations

How do we position ourselves for the DOE Ultrascale facility winner to be announced May 12

Regardless of who is chosen we should try to be in a position to help with the system software needs of the facility.

Page 3: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

IBMCrayIntelSGI

Scalable Systems SoftwareScalable Systems Software

Participating Organizations

ORNLANLLBNLPNNL

NCSAPSCSDSC

SNLLANLAmes

• Collectively (with industry) define standard interfaces between systems components for interoperability

• Create scalable, standardized management tools for efficiently running our large computing centers

Problem

Goals

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

ResourceManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystems

To learn more visit

Page 4: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfacesauthentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

Validation & Testing

HardwareInfrastructure

Manager

Packaging&

Install

Scalable Systems Software SuiteScalable Systems Software SuiteUpdates to this diagramUpdates to this diagram

Page 5: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Scalable Systems Software CenterJanuary 15-16Argonne

Review of Last MeetingReview of Last Meeting

Details inMain project notebook

Page 6: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Highlights from Jan. mtgHighlights from Jan. mtg

Craig – 1280 dual xeon cluster “Titanium” is available this eveningTo test the scalability of SSS suite. One node will be used asHead node to install our suite and run on entire cluster.Could build everything but Bambo and ssslib due to XersesWill begin to be available at 6pm

Late night session on 1280 node testbedPM ran at 1280 worked at 4000, hung at 6000Warehouse had a problem at 1280 and took out head nodeRM components ran on head node OK until Warehouse crashed it

Scott Jackson – Gold running on 11 TF PNNL cluster

Thomas Naughton – 2nd release March. Discussion of how many orgs in our group could shakedown the tarball. Group feels better to have few very reliable components than all components

Page 7: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Highlights from Jan. mtg (cont.)Highlights from Jan. mtg (cont.)

Rusty Lusk – Process Manager Spec for first votePresentation and discussion…Who is responsible for limited enforcement PM or QM? I.e.Must use certain amount of memory, must not execute OS command(in general - things that happen after fork)Rusty says the question is good and he needs to think about How this may affect the interface.Other items to think about - use of wildcard as “to be returned” operator – OK - Inclusion but don’t show me. - Dynamic jobs and PM. - improve readability

Delay vote until we have a written proposal.

Page 8: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Highlights from Jan. mtgHighlights from Jan. mtg

Discussion of having two XML syntax styles (functional, object)Al says he would like to see one common one across the suitethat he didn’t care which one as long as the whole group could agree.

Narayan – Restriction Syntax Overview. An issue of uniqueness was brought up and was to be taken into consideration by Narayan

Rusty Lusk – Restriction Syntax on Chiba CityDavid would like to see a paper of the requirements that the Chibaeffort required.

Andrew and Paul and Craig offer to investigate a prototype translatorTo see how / if it is possible.

Investigate standardization of tokens across the two syntax

Page 9: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Scalable Systems Software Center

January-May

Progress Since Last MeetingProgress Since Last Meeting

Page 10: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

SciDAC PI mtg – March 22-24, 2004SciDAC PI mtg – March 22-24, 2004

In Charleston SC with severalattending for Scalable Systems 2 page project summary reportAnnual report for Fred20 minute talk – presented by RustyFred asked each ISIC to use new speaker

Poster Presentation – by Stephen/John

Page 11: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Systems Software Suite 2nd ReleaseSystems Software Suite 2nd Release

Target Date March ‘04 – So we could announce it at the PI meeting. Real Status?

SSS-OSCAR – will hear more in next talkNeed way to test that the suite is installed correctly

Page 12: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Five Project NotebooksFive Project Notebooks

A main notebook for general information

And individual notebooks for each working group

• Over 300 total pages

• BC and PM groups need to get specs into their notebooks

• Add Telecom meeting notes even if short (Kudos to RM group)

Get to all notebooks through main web site www.scidac.org/ScalableSystems

Click on side bar or at “project notebooks” at bottom of page

Page 13: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Bi-Weekly Working Group TelecomsRM is only notes I see in notebook

Resource management, scheduling, and accounting

Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”

Proccess management, monitoring, and checkpointing

Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910

Node build, configuration, and information service

Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)

Page 14: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Scalable Systems Software Center

May 6-7, 2004

This MeetingThis Meeting

Page 15: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Major Topics this MeetingMajor Topics this Meeting

Stability of Systems Software Suite – second release is out. Are we ready for outside users?

Quarterly Report Due – would like to get one to Fred by end of May. Will need text from WG leaders.

Formal API presentations and voting - we left several things hanging last meeting

MICS PI Mtg - August 9-12 at Argonne. A good time to have a highlight of outside user(s)

SC04 Mtg - November in Pittsburg. Talks? Tutorial? Birds of a feather?

Page 16: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Agenda – May 6Agenda – May 6 8:30 Al Geist – Project Status. 9:15 Thomas Naughton – SSS OSCAR software suite release Working Group Reports

Progress report on what their group has done API Proposals for adoption by the groupProgress on software suite improvements

9:30 Narayan Desai – Node Build, Configure10:30 Break11:30 Will McClendon – Validation and Testing 12:30 Lunch (on own – cafeteria) 1:30 Ron Oldfield – ASAP testing, and formalism issues 2:00 Paul Hargrove – Process Management

Craig and Rusty 3:00 Scott Jackson – Resource Management 4:00 Paul/Craig – findings about trying to build a syntax translator 4:30 Group Discussion on getting outside users of 2nd release 5:00 Al – Discussion on SC04, other conferences, papers, etc. 5:30 Adjourn

Page 17: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Agenda – May 7Agenda – May 7

8:30 Discussion, proposals, votes

Craig – discussion Paul – straw vote on two syntax

Rusty - Process Manager proposal (deferred) Scott – Allocation Manager proposal (deferred) Al - Quarterly report, papers, SC04, other meetings.

10:30 Break11:00 Al Geist – Release 2 and outside users (Jazz? Ram? NCSA? SNL?) MICS PI Mtg August at Argonne (news to come) next meeting date: August 26-27, 2004

location: Argonne

12:00 meeting ends

Page 18: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Al Geist – presents project overview and goals for this meeting

Thomas Naughton – SSS-OSCAR: in tarball isBamboo, BRLC, Gold, LAM/MPI, MAUI-SSS, SSSLib, Warehouse, MPD2SSSLib contains SD, EM, PM, BCM, NSM, NHw, plus communicationTodo: bug tracker, test sss-oscar-v2a6-v3.0 for pre-release, Documentation- use scidac review 1 pager, add license-sss to directoryNeed: A test suite and a few test machines to test onDiscussion on APItest and who creates tests, etc. Each does individualEstablish release schedule thru SC04Add easier way for authors to “test just their stuffSC04 – fully tested release v1.0 with all SSS components

code freeze Friday September 3

Page 19: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Narayan Dasi – Build ConfigureLibrary improvements- bugfixes, testing of java support, SSL testingInfrastructure Improvements-sss python library improvements, EM bugfixesBCM component usage experience

Hardware infrastructure – still seeking purposeRestriction Syntax examples given and discused

craig thankful that !d (don’t display this field) now worksUniqueness issue-default is to return all duplicates

new flag “unique=true” to remove duplicatesmuch discussion. Rusty suggests remove only duplicate linesPaul brings up the problem on “action” commands ie kill jobs twice

Al says the problem is not solvable in general in restriction syntaxScott asked if RMAP syntax can handle this?Much work on the board. And question ofatomicity of queries which require multiple SQL queries to complete.

Page 20: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Will McClendon – Component Interface TestingAPITest v0.1.2 It is now available by FTP by putting it under GPL Cplant licenseftp://ftp.sandia.gov/outgoing/apitest (also in notebook)Not integrated back into ssslibHTTP Interface development“Twisted Python” framework Info and www.effbot.orgScott helped find bug in python popen3 – now uses Twisted SpawnProcessBetter support for browsing test data within sessionBatch and test data stored in an in-memory in XML file format

writing out data to file available soonShows an XML example that runs test. Several questions answeredShows an XML batch file example.Runs live demo – works fine. Discussion follows.

Ron Oldfield – replacing Eric DeBenedictis who is moving to other SNL jobs-ORNL help set up a testing environment-Testing for correct installation and individual tests, then whole suite test

Page 21: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Ron Oldfield (cont) – simulating real workloadsperformance and scalability testing needed in the futureportability is important for our reference implementationdiscussion code portability vs feature portabilityauthorization also needs testing

What are the issues in lightweight OSStandard naming conventions both format and semantics

someone really needs to go through the existing schemaesRMAP dictionary makes a good starting point

Paul Hargrove – process managementStill continue development on all three componentsSyntax translation effort to be discussed later today.Checkpoint –pre-emption (suspend and resume) works-checkpointing (ckpt works, restart in progress)Todo: migration, checkpoint file management – not overflow disks (list,delete)Query- “can I restart here”

Page 22: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Paul Hargrove – process management (cont)Suspend/resume works with Bamboo, SD, EM, OM, PM componentsStill need to design restart-time interactions with RM groupOpen files support under testingBug fix releases as needed.Checkpoint manger outstanding issuesImplement full interface

using restriction syntax, event generation, error reportingMust implement file management

think ls and rm, expiration

Craig Steffan – no slidesTried run on 1280 nodes on Tungsten failed, did run on 128Can now run on 1024 nodes. Being stopped by #sockets limitHarvesting can now be done of other info f.e. myrinet HWNext: adding support for “job” management start interfacing with Build group help to get it on Chiba

Page 23: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Rusty Lusk – process manager updatePM component – added “limits” interface, dynamic jobs (mpi_comm_spawn) can spawn lots of nodes and the use “unused” ones as needed show limits specMPD2 improvements found by production use on chiba support for limits support for mpi_comm_spawn interactive debugging via mpigdb – allows control of stdin, stderr, stdoutFuture: need to work more closely with QM QM interface for requesting dynamic jobs

Page 24: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Scott Jackson – resource manager updateDiagram on boardReleased SSSRMAPv3 specNew things - wire protocol - message format - job groupsLatest software release (in OSCAR) uses SSSRMAP v2Second release of Bamboo in March w/ epilogue and prologue supportGold now fully SSSRMAP v2 - second alpha release due June - which will be in Perl (first release in Java ran into memory size limits) - user guide done - first release running on PNNL’s SGI AltixTesting using APITest begunSilver several,various improvements in XMLFuture work: implement SSSRMAP v3 in the components - merger of Maui 3.2 and SSS. Integrate chkpt/restart. Limit enforcement - now SSS affects all Maui users. Ability to handle dynamic jobs

Job group

JobT T T T

T Task group

Multi-step job

JobJob

Page 25: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notesMeeting notes

Paul – translator report (no slides)looking at the two syntax and seeing if we could automate Translation between sssrmap and restriction syntax

Found: sssrmap could say 4<proc<16 but not in RSRS band aid – special operators to handle rangesFor multiple table queries – nested RS syntax doesn’t haveInformation (primary data type) to know how to combine multiple SQL resultsThere is no way to translate between these cases.

Paul discourages the implementation of a translator.

Page 26: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notes – Day 2Meeting notes – Day 2Craig – General thoughts on official V1.0 (no slides)Released at SC04 this will be the first time many people will seeOur orthogonal directions in syntax is damaging If we don’t make a decision soon - project progress towards V1.0Brett, who works with both, favors the SSSRMAPHe likes the more descriptive nature of it and OO nature.Rusty says that we need two written proposals for a componentthat we can compare and vote on otherwise we are just all talk.Paul says the one is better but two is not too bad. Scott doesn’t think we can reconcile Paul asks for straw vote for a preference, Scott second’s SSRMAP – 7 and 5 institutions (but one is Al) Restriction Syntax - 3 all ANL Abstain – 3 and 2 institutionsCraig says he will do whatever it takes to make either work. he is going to make ssslib SSSRMAP work Neil says “users” are guiding factor and RMAP better there Paul says understandability and acceptability is key and RMAP is betterBoth say that RS is more compact and elegant.

Page 27: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notes – Day 2 (cont)Meeting notes – Day 2 (cont)

Narayan- asks does it just need documentation and tutorialsPaul says no. There is closer match for SOAP et al. the OO was not a factor in his choice, but it is more popular today.Neil says potential users won’t have a Narayan to figure this out.Components are both client and server so developer has to know syntax.Rusty – if there was something else added to RS that made it easier to use or understand. He is not sure it is a good idea.Will – documentation is better in RMAP and he has looked at RMAP more Would all this stuff be more abstracted? User does as little as they can read manual only after they get stuck. Doesn’t care as long we pick ONE! Need to have a same look and feel across the project.Rick – I don’t care which. I don’t like XML. What about the SD and EM that are already accepted. Al – says that he feels that RMAP would be more acceptable to vendors and this would be a critical to long term success of the project.

Paul says that Process manager document is not complete enough to vote on at this time.

Page 28: Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL

Meeting notes – Day 2 (cont)Meeting notes – Day 2 (cont)

Discussion -