gridpp use- interoper- communic- ability tony doyle

34
GridPP use- interoper- communic- ability Tony Doyle

Upload: quentin-cain

Post on 28-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GridPP use- interoper- communic- ability Tony Doyle

GridPP use-

interoper-communic-

abilityTony Doyle

Page 2: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Introduction

A. Is the system usable?B. How will GridPP and NGS interoperate?C. Communication and discussion

introduction

Page 3: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

A. “Usability” (Prequel)• GridPP runs a major part of the EGEE/LCG

Grid, which supports ~3000 users • The Grid is not (yet) as transparent as end-

users want it to be• The underlying overall failure rate is ~10%• User (interface)s, middleware and

operational procedures (need to) adapt• (see talk by Jeremy for more info. on

performance and operations)• Procedures to manage the underlying

problems such that system is usable are highlighted

Page 4: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

5 million hours

“Active” User requires thousands of CPU hoursEGEE CPU hours(1 April 2006 to 31 July

2006 )

Page 5: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Virtual Organisations• Users are grouped into Virtual Organisations

– Users/VO varies from 1 to 806 members (and growing..)

• Broadly four classes of VO– LHC experiments– EGEE supported– Worldwide (mainly non-LHC particle physics)– Local/regional e.g. UK PhenoGrid

• Sites can choose which VOs to support, subject to MOU/funding commitments– Most GridPP sites support ~20 VOs– GridPP nominally allocates 1% of resources to EGEE non-HEP

VOs– GridPP currently contributes 30% of the EGEE CPU resources

Page 6: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

User View?• Perspective matters• This is not

– a usability survey– unbiased– representative

• Straw poll – users overcame initial

registration hurdles within ~two weeks

– users adapt to Grid in (un-)coordinated ways

– The Grid was sufficiently flexible for many analysis applications

Page 7: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Physics AnalysisESD: Data or Monte CarloESD: Data or Monte Carlo

Event Tags Event TagsEvent Selection

Analysis Object DataAnalysis Object DataAnalysis Object DataAnalysis Object DataAnalysis Object Data

AOD

Analysis Object Data

AOD

Calibration DataCalibration Data

Analysis, Skims

Raw DataRaw Data

Collaboration

-wide

Tasks

Analysis

Groups

Individual

PhysicistsPhysics Analysis

Physics

Objects Physics

Objects

Physics

Objects

INC

RE

AS

ING

DA

TA

FLO

W

Page 8: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

User evolution

Number of UK Grid users (exc. Deployment Team)Quarter: 05Q4 06Q2 06Q3Value: 1342 1831 2777

Many EGEE VOs supported c.f. 3000 EGEE targetNumber of active users (> 10 jobs per month)Quarter: 05Q4 06Q1 06Q2Value: 83 166 201Fraction: 6.2% 11.0%Viewpoint: growing fairly rapidly, but not as active

as they could be? depends on the “active” definition

Page 9: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

806 atlas 763 dzero 577 cms 566 dteam 150 lhcb 131

alice 75 bio 65 dteamsgm 41 esr 31 ilc 27 atlassgm 27 alicesgm 21 cmsprg 18

atlasprg 17 fusn 15 zeus 13 dteamprg 13 cmssgm 11 hone 9 pheno 9 geant 7 babar 6 aliceprg 5 lhcbsgm 5 biosgm 3 babarsgm 2 zeussgm 2 t2k 2 geantsgm 2 cedar 1 phenosgm

1 minossgm 1 lhcbprg 1 ilcsgm 1 honesgm 1 cdf

Kn

ow

you

r users

? U

K-e

nab

led

VO

s

Page 10: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

User Interface

• The GUI is relatively low-level (jobs, file collections)• Dynamic panels for higher level functions

Job details

Logical

Folders

Job Monitoring

Log window

Job builder

Scriptor

Screenshot of the Ganga GUI

Screenshot of the Ganga GUI

Dockable windows

Dockable windows

Page 11: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Complex ApplicationsATLAS• GANGA software framework (jointly with LHCb)• data challenges• producing Monte Carlo data •10 million CPU hours

per year

CMS• Monte Carlo production, data transfer, job submission• CMS transfers top a petabyte a month for the last three months

LHCb• DIRAC software to submit analysis jobs using Grid• 2006 analysis job completion efficiency improved to 91%

Page 12: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

WLCG MoU• Particle physicists collaborate,

play roles and delegate – e.g. “prg” production group

“sgm” software group managers

• Underpinned by Memoranda of Understanding

• Current MoU signatories:China France Germany Italy India Japan Netherlands Pakistan Portugal Romania Taiwan UK USA

• Pending signatures: Australia Belgium Canada Czech Republic Nordic Poland Russia Spain Switzerland Ukraine

• Negotiation w.r.t. resource and service level

Page 13: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Resource allocation• Need to assign quotas and priorities to VOs and measure delivery

• VOMS provides group/role information in the proxy

• Tools to control quotas and priorities in site services being developed– So far only at whole-VO level– Maui batch scheduler is flexible, easy to map to groups/roles– Sites set the target shares– Can publish VO/group-specific values in GLUE schema, hence the RB

can use them for scheduling

• Accounting tool (APEL) measures CPU use at global level (UK task)– Storage accounting currently being added– GridPP monitors storage across UK– Privacy issues around user-level accounting, being solved by

encryption

Page 14: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

User Support• Becoming vital as the number of users grows

– But modest effort available in the various projects

• Global Grid User Support (GGUS) portal at Karlsruhe provides a central ticket interface– Problems are categorised

• Tickets are classified by an on-duty Ticket Process Manager, and assigned to an appropriate support unit– UK (GridPP) contributes support effort

• GGUS has a web-service interface to ticketing systems at each ROC– Other support units are local mailing lists– Mostly best-effort support, working hours only

• Currently ~tens of tickets/week– Manageable, but may not scale much further– Some tickets slip through the net

Page 15: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Documentation & Training

• Need documentation and training for both system managers and users– Mostly expert users up to now, but user community is expanding– Induction of new VOs is a particular problem – no peer support– EGEE is running User Fora for users to share experience

• Next in Manchester in May ’07 (with OGF)– EGEE has a dedicated training activity run by NeSC/Edinburgh

• Documentation is often a low priority, little dedicated effort– The rapid pace of change means that material requires constant

review• Effort on documentation is now increasing

– GridPP has appointed a documentation officer• GridPP web site, wiki

– Installation manual for admins is good• There is also a wiki for admins to share experience

– Focus is now on user documentation• New EGEE web site – coming soon

Page 16: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Alternative view?

• The number of users in the Grid School for the Gifted is ~manageable now

• The system may be too complex, requiring too much work by the “average user”?

• Or the (virtual) help desk may not be enough?

• Or the documentation may be misleading?

• Or..• Having smart users helps

(the current ones are)

Page 17: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

B. “Interoperability”

• GridPP/NGS meeting - Nottingham EMCC, September 2006

• Present: Tony Doyle, David Britton, Paul Jeffreys, David Wallom, Robin Middleton, Andy Richards, Stephen Pickles, Steven Young, Dave Colling, Peter Clarke, Neil Geddes

• Agenda: 1. Ultimate goals and the model for achieving

them and any constraints 2. Timetables3. Required software (in both directions)

Page 18: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

B. “Interoperability”

• Goals: A general discussion on what we might hope to achieve and why.

• Several key points made... • Open question whether we ever need to actually have any closer

partnership• GridPP is focused on a relatively immediate goal and will always be

constrained in some way by the broader LCG requirements • NGS should be further from the bleeding edge in grid developments• NGS affiliation and partnership model exists• GridPP T2's all have MoUs which will need revamping under GridPP3.

This will be an ideal opportunity to formalise any relationship between GridPP (T2's) and the NGS.

• It is unclear who is using EGEE (in the UK) and who could or would want to use it

• EGEE-UKI needs to do a better PR job within the UK • Phenogrid are registering with EGEE

Page 19: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

B. “Interoperability”• The current "minimal software stack" approach of NGS is

being reviewed as a greater variety of partner resources are considered (data centres and research facilities)

• Different "stacks" will be relevant to different sorts of partners i.e. there is likely to be a range of "NGS Profiles“

• For the foreseeable future, NGS is likely to exist in a world with multiple parallel software stacks and it will not be possible merge them

• Installing parallel stacks or profiles is not a problem if they are easy to install and do not interfere

• One possibility is that the different NGS profiles would reflect Different stacks such as GT4 or gLite

• Operations-can we present accounting information consistently

Page 20: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

B. “Interoperability”

• What benefit is there in a GridPP site joining NGS ? • much less relevant for sites where the resources are essentially

dedicated for HEP. Where there are shared facilities with other fields then the generic and shared nature of the NGS can provide ready made interfaces for the broader communities. We are clearly a long way form being able to merge both activities completely. e.g. GridPP requirements on monitoring and accounting could not currently be met by NGS nodes and NGS would not require all partners to report a la GridPP. (Of course this does not preclude project specific layers such as this accounting on top of the basic NGS profiles, for relevant partner).

• There is a concern that "joining" the NGS would put an additional load on the GridPP sites. Looking further ahead of course, the intention is that this is not the case, but that supporting the standard NGS profiles is exactly the same work as required to meet (a subset of) the GridPP requirements. This can only be guaranteed if there is sufficient representation of GridPP sites within the NGS.

Page 21: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

B. “Interoperability”

• Next steps/timetable • GridPP3 MoUs - No action required. Can wait until next

year and should be informed by lessons learned over the next 6-12 months. GridPP sites currently meet the minimal requirements for NGS through the standard GridPP installations.

• If Sites enable the NGS VO then this effectively gives NGS affiliation if they wish.

• Formal Affiliation would, however, require that the interface be monitored by NGS. Agreed that the next step should be to understand in detail what is actually required for NGS partnership.

Page 22: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

B. “Interoperability”

• Next steps/timetable • Agreed to focus on two sites, Glasgow and LeSC. Aim to be ready

to achieve NGS “partnership” by Christmas 2006. • The decision as to whether or not to actually apply for formal

partnership can be left to later in the year. • The principal goal is to understand the steps and requirements etc.• It was agreed that NGS should provide a Glite CE for core NGS

nodes which would allow the nodes To be a part of the EGEE/LCG SAM infrastructure.

• Accounting and monitoring are areas which are still developing and where it is not clear what the best solution is (for NGS)

• Meet once more before Christmas..

Page 23: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

=> Implementation…

• GU should concentrate on delivering:

1. A job submission mechanism 2. A method to prepare the job's environment what input files, etc.

This means we can offer

1. gsissh login to head node, with access to some shared space (e.g. the home directory for the NGS pool accounts). 2. job submission from head node to the gatekeeper, which can use either GRAM (globus-job-submit) or EGEE methods (edg-job-submit)

This would seem to qualify us as an NGS partner site, comparing with  http://www.grid-support.ac.uk/index.php?option=content&task=view&id=143 

• The SLAs on offer seem none too onerous

Page 24: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

C. “Communicability”1. "T0-T1-T2 Service Challenges"

Panel Members: Tony Cass, Jeremy Coles, Dave Colling, John Gordon, Dave Kant, Mark Leese, Jamie Shiers. [notes recorded by: Neasan O'Neill]

2. "Analysis on the Grid" Panel Members: Roger Barlow, Giuliano Castelli, David Grellscheid, Mike Kenyon, Gennady Kuznetsov, Steve Lloyd, Andrew McNab, Caitriana Nicholson, James Werner. [notes recorded by: Giuseppe Mazza]

3. "How is/will data be managed at the T1/T2s?" Panel Members: Phil Clark, Greig Cowan, Brian Davies, Alessandra Forti, David Martin, Paul Millar, Jens Jensen, Sam Skipsey, Gianfranco Sciacca, Robin Tasker, Paul Trepka. [notes recorded by: Tom Doherty]

4. "Experiment Service Challenges" Panel Members: Dave Colling, Catalin Condurache, Peter Hobson, Roger Jones, Raja Nandakumar, Glenn Patrick. [notes recorded by: Caitriana Nicholson]

5. "Beyond GridPP2 and e-Infrastructure" Panel Members: Pete Clarke, Dave Britton, Tony Doyle, Neil Geddes, John Gordon, Neasan O'Neill, Joanna Schmidt, John Walsh, Pete Watkins. [notes recorded by: Duncan Rand]

6. "Site Installation and Management" Panel Members: Tony Cass, Pete Gronbech, Dave Kelsey, Winnie Lacesso, Colin Morey, Mark Nelson, Derek Ross, Graeme Stewart, Steve Thorn, John Walsh. [notes recorded by: Mark Leese]

7. "What is a workable Tier-2 Deployment Model?" Panel Members: Olivier van der Aa, Jeremy Coles, Santanu Das, Alessandra Forti, Pete Gronbech, Peter Love, Giuseppe Mazza, Duncan Rand, Graeme Stewart, Pete Watkins.

[notes recorded by: Gianfranco Sciacca] 8. "What is Middleware Support?"

Panel Members: Mona Aggarwal, Tom Doherty, Barney Garrett, Jens Jensen, Andrew McNab, Robin Middleton, Paul Millar, Robin Tasker. [notes recorded by: Catalin Condurache]

Page 25: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

1. "LCG Service Challenges"

• This was a session which brought out the detailed planning of Service Challenges. 1. SC is a great idea which is a kind of reality check: “reality” is imminent data, increasing complexity of experiment-led initiatives, and more users 2. Need more documentation and support: still true(!) despite effort3. Time scales and deadlines are needed for deployment: well known and widely communicated via Jamie – Jeremy… 4. Storage model is important issue especially for storage group: increasingly large issue – dedicated discussion5. Communication on experience: forthcoming discussions will be discussed at DTeam and PMB meetings6. Networks will play an important part in SC4: underpins file transfer tests, but needs to be embedded within these - disk performance (being understood) v network performance (many [hidden] variables)

Page 26: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

There was a list of specific actions

• Implement a better user support model ONGOING• Support the deployment of an SRM at every Tier-2 site DONE• Revisit site plans for implementing promised resources DONE• Support the installation of any required local catalogues at

sites GENERALLY LIMITED TO TIER-1. DONE • Investigate the experiment VO box requests. Make a

recommendation to Tier-2s. Revisit as GridPP. NOT REQD. (CURRENTLY)

• Better understand network links to sites (we do not want to saturate links) ONGOING

• Schedule transfer tests from Tier-1 to Tier-2 test rates and stability DONE AND ONGOING

• Work closer with experiments? CAN IMPROVE

Page 27: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

There was a list of specific actions• user support (mail lists, web form, TPMs, GGUS

integration) NEED TO ENSURE USERS “KNOW” (AND KEEP REMINDING THEM)

• SRM at T2 (almost done) DONE• site plans revised (SRIF3, FEC) ONGOING• local catalogues (wiki, SC3, plan for rest)• VO boxes (review group) DISAPPEARING..• network links (10 easy questions, wiki) FIREWALL+GRID

http://www.ggf.org/documents/GFD.83.pdf• T1-T2 tests (plan, stalled, dcache/dpm) DONE• Experiment links (some progress) MORE REQD.

Page 28: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

2. "Running Applications on the Grid"

(Why won't my jobs run?)Summary • A number of people say things working are well - pleasant surprise - easier

than LSF! A SUBSET OF USERS ATTEND GRIDPP MEETINGS• VO setup and requirements: don't want each VO to have to talk to each

site. VO should provide list of requirements for site to support VO. THERE ARE A LARGE NUMBER OF RESPONSIBILITIES TO BE HANDLED BY EACH EXPT.

• Certificates: need to improve situation. Once over this hurdle using the grid is plainer sailing. INTRINSIC TIME DEPENDENCE OF CA-RA-USER TRUST ESTABLISHMENT (NECESSARY)

• Data management issues more of a problem than job or RB problems. How to get information to user re failures and support channels. INCREASINGLY TRUE – MANY AD-HOC DELETIONS FOLLOWING E.G. FTS FAILURES

• Monitoring real file transfers would be an interesting addition. USER MECHANISMS TO TRACE OVERALL PROGRESS, BUT NOT MANY INDIVIDUAL USER TOOLS/SCRIPTS APPEARING E.G. TNT (Tag Navigator Tool) PLUG-IN TO GANGA FOR ATLAS FILE COLLECTIONS WOULD NEED TO COMMUNICATE WITH THE MonAMI FTS PLUG-IN

Page 29: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

3. "Grid Documentation"

(What documentation is needed/missing? Is it a question of organisation?)

• Could updates to documents be raised at meetings?• A mailing list specifically for document updates may be useful.• Competition between different solutions to one problem.• For all experiments - link in all documentation and give responsibility to a

line manager (for example) to oversee its maintenance.• What are the mechanisms or how do we find out what is inadequate within

a document - a document should be checked every few months to point out its inadequacies => should a review process be set up by SB.

• Roles and responsibilities should be established.• Important documents should be highlighted - and index of useful doc's and

what sources of documents are available may be useful.

• Much progress made by Stephen Burke in many of these areas. Steve attends PMB

Page 30: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

5. "Beyond GridPP2 and e-Infrastructure"

• (What is the current status of planning?)• EGEE II may be superseded by European

infrastructure – EGEE III NOW BEING PLANNED• DTI planning a UK infrastructure• Integrate better with NGS - SEE EARLIER SLIDES• More things developed by GridPP will be

supported centrally – NEED TO CONVINCE UK COMMUNITY OF THE USEFULNESS AND ADAPTABILITY OF GLITE AS A COMPONENT PART OF PERVASIVE INFRASTRUCTURE

Page 31: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

6. "Managing Large Facilities in the LHC era"

• (What works? What doesn't? What won't) • Sys admins seem happy with their package

managers.• We should share common knowledge (about

software tools) more. ONGOING• Extra Costs (over and above the price of the

hardware) involved in having large clusters. ONGOING

• IMPROVED, BUT CAN IMPROVE FURTHERMETRIC: T (INSTALL – USER AVAILABILTY) + AVAILABILITY

Page 32: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

7. "What is a workable Tier-2 Deployment

Model?“• Conclusion: Deployment is under control

– testing has made good progress– operations still an issue

METRIC: T (INSTALL – USER AVAILABILTY) + OVERALL AVAILABILITY * # SYSTEM MANAGER(S)

“EXCELLENT” T2 SUPPORT STRUCTURE REQD.

Page 33: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

8. "What is Middleware Support?"

• (really all about) • gLite test bed• EGEE2 - dedicated testing/certification system• using wiki was good idea. Consolidate into documents.• need some structure to make sure wiki doesn't get out of control.• need some moderators for the wiki.• developers not getting correct requirements for s/w.sysadmin questions

not the same questions that were in the minds• of the developers..• bad if the wiki is incorrect.• need someone to move what is in the wiki to some sort of more formal

docs (LaTeX or DocBook) which has been properly checked and signed off by the developers.

• ONGOING, LIMITED PROGRESS – INTRINSIC LIMITATION? (THERE WILL ALWAYS BE OUT OF DATE/LIMITED DOCUMENTATION?)

• NEED A DOCUMENTATION REVIEW CHALLENGE?

Page 34: GridPP use- interoper- communic- ability Tony Doyle

“Gridability”1 November 2006 Tony Doyle - University of Glasgow

Conclusion• All sessions were felt to be worthwhile • Some produced hard actions• Some areas have made progress since• Positive correlation between subjects which made

progress and where GridPP had existing structures in place (Deployment, Documentation) – Counter examples, middleware, experiments

• Let’s do this again but next time take more care to task people with subsequent progress and look for new structures to deliver results.

• “MAKE IT SO” • The logical end of a talk on “Gridability”

(or the emperors new clothes?)