john gordon j.c.gordon@rl.ac.uk and lcg and grid operations john gordon cclrc e-science centre, uk...
Post on 24-Dec-2015
226 Views
Preview:
TRANSCRIPT
John Gordon
j.c.gordon@rl.ac.uk
LCG andand Grid Operations
John Gordon
CCLRC e-Science Centre, UK
LCG Grid Operations
John Gordon
j.c.gordon@rl.ac.uk
Outline
• The monitoring tools
• How we use them in operations
• What is still to be done
John Gordon
j.c.gordon@rl.ac.uk
Grid Operations
• Once middleware has been developed, tested and deployed, grid operations are the set of actions and procedures to keep a grid running for the users.
John Gordon
j.c.gordon@rl.ac.uk
The Vision
• GOC Processes and Activities– Coordinating Grid Operations– Defining Service Level Parameters– Monitoring Service Performance Levels– First-Level Fault Analysis– Interacting with Local Support Groups– Coordinating Security Activities– Operations Development
John Gordon
j.c.gordon@rl.ac.uk
Have we delivered?
• Coordinating Grid Operations
• Defining Service Level Parameters
• Monitoring Service Performance Levels
• First-Level Fault Analysis• Interacting with Local
Support Groups• Coordinating Security
Activities• Operations Development
• Yes, RAL, CERN & Taipei
• No
• up or down• Yes
• Yes
• Policies, not operation• Monitoring and
accounting
John Gordon
j.c.gordon@rl.ac.uk
Monitoring the Grid is a Challenge!
John Gordon
j.c.gordon@rl.ac.uk
Why We Monitor• Keep systems up and running• Notice failures; grid-wide services MDS; • Knowing what services a site should be running
no point raising an alert if the site isn’t meant to run it! definition of services and which sites run them (SLA)
What Tools Do We Use• Job Submission; GridIce; Nagios; GIIS Monitor• How – Database• Developments Planned nagios
Monitoring Overview
John Gordon
j.c.gordon@rl.ac.uk
• We have only fragmentary information about the services that sites are running.
• We don’t know what RBs/SEs/Sites the VOs are using for data challenges.• We don’t know what the core services are and who is running them.• We don’t have a toolkit to test specific core services.• We have to concentrate on functional behaviour of services e.g If an RB
sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB?
• Not all the tests that we perform are effective at finding problems so we must take tests written by the experts and integrate them into GOC monitoring.
• We must develop tests which simulate the life cycle of real applications in a Grid environment.
• There are lots of monitoring tools available, so we need to bring them together.
• Do we spend time investigating new tools, or make the ones which we already have better?
• …and probably lots more!
Monitoring Challenges
John Gordon
j.c.gordon@rl.ac.uk
• There are many frameworks which can be used to monitor
distributed environments• MAPCENTRE http://mapcenter.in2p3.fr/• GPPMON http://goc.grid-support.ac.uk/• GRIDICE http://grid-ice.esc.rl.ac.uk• NAGIOS http://www.nagios.org/• MONALISA http://monalisa.cacr.caltech.edu/• GIIS Monitor http://goc.grid.sinica.edu.tw/gstat/• Ganglia
– Example: Mapcentre 30 sites ~ 500 lines in config file (static version)– Example: Nagios 30 sites, 12 individual config files with
dependencies
– Developed Tools to Configure these services to make the job easier NAGIOS, MAPCENTER and GPPMON
Monitoring Services
John Gordon
j.c.gordon@rl.ac.uk
GOC Configuration Database
GOC GridSite MySQL
Resource CentreResources & Site Information
EDG, LCG-1, LCG-2, …
ce
se
bdii
rb
Monitoring
Secure Database Management via HTTPS / X.509
People, Contact Information, Resources
Scheduled Maintenance
RC
SQLhttps
SERVER
John Gordon
j.c.gordon@rl.ac.uk
GOC Job Submission Flow Diagram
Simple job forked on CE using globus
GOC (UI)
Build List of CE, RB
Resources
JOB Script
GLOBUS.CEcreate CE
sent acknowledge
globus-job-run CE
SITE DB
SQL QUERY
wget http://goc_ui/ack.cgi?GLOBUS.CE
received acknowledgement
1
2
3
4
5
GPPMON - 2
John Gordon
j.c.gordon@rl.ac.uk
GPPMON - 3
JOB Script
RB.CEcreate
RB
sent acknowledge
edg-job-submit
GOC (UI)
Build List of CE, RB
Resources
SITE DB
SQL QUERY
CE
Other.GlueCEUniqueID
wget http://goc_ui/ack.cgi?RB.CE
received acknowledgement WN
CE
Simple job through local jobmanager on CE via Resource Broker Job MatchMaking
John Gordon
j.c.gordon@rl.ac.uk
LCG2 Site Status: 21 July 2004 10.00am
GPPMON – 1
John Gordon
j.c.gordon@rl.ac.uk
GRIDICE - 1
http://grid-ice.esc.rl.ac.uk/gridice
John Gordon
j.c.gordon@rl.ac.uk
John Gordon
j.c.gordon@rl.ac.uk
Ganglia Monitoring - 1
• http://gridpp.ac.uk/ganglia• Can use Ganglia to monitor a cluster
RAL Tier-1 Centre
LCG PBS Server displays Job status for each VO
John Gordon
j.c.gordon@rl.ac.uk
Ganglia Monitoring - 2
• Can also use Ganglia to monitor clusters of clusters
John Gordon
j.c.gordon@rl.ac.uk
Provide ROCs with a package to monitor the resources in the region• Tailored Monitoring• ROCs may upload their own maps• JAVA GUI to automate site locations on the map
Hierarchical view of Resources
• Example GridPP made up of virtual T2 centres
Regional Monitoring - 1
EGEE
France UK/I S.E.E
GridPP
LondonT2
ScotGrid
IMPERIAL
QMUL
Edinburgh
John Gordon
j.c.gordon@rl.ac.uk
LCG2 Site Status: 21 July 2004 10.00am
GPPMON – 1
John Gordon
j.c.gordon@rl.ac.uk
http://goc.grid-support.ac.uk/roc_map/map.php Active map to select individual regions
Regional Monitoring - 2
John Gordon
j.c.gordon@rl.ac.uk
Regional Monitoring - 3
UK/I Monitoring displays GRIDPP and NGS resources.
John Gordon
j.c.gordon@rl.ac.uk
Replica Manager Tests - 1
• GOC to take over site certification testing which is done by CERN deployment team on a daily basis (e.g reports by Piotr Nyczyk)
• First step toward this involved running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE e.g Castorgrid
• Demonstrates that we can integrate other peoples tools into GPPMON
• Development of a portal which will:– Make it easy to retrieve debug information from the job output.– Connect with information provided by other monitoring tools e.g Taipei GIIS
Monitor. – Provide testing “on-demand” to site administrators through a secure interface.
John Gordon
j.c.gordon@rl.ac.uk
http://goc.grid-support.ac.uk/gridsite/status/rmtest.php?action=table
Results of each test are shown as a coloured index on the map.
Distinguish between jobs that have completed, or have failed or still running.
Replica Manager Tests - 2
John Gordon
j.c.gordon@rl.ac.uk
Description of the tests
Job Outputs
GIIS Monitor Information
Replica Manager Tests - 3
John Gordon
j.c.gordon@rl.ac.uk
GIIS Monitor• Developed by MinTsai (GOC Taipei)
• Tool to display and check information published by the site GIIS
• http://goc.grid.sinica.edu.tw/gstat/
John Gordon
j.c.gordon@rl.ac.uk
Job Accounting -1http://goc.grid-support.ac.uk/ROC/docs/accounting/accounting.php
Program publishes PBS log file information through RGMA to the GOC
GOC aggregates data across all sites.
John Gordon
j.c.gordon@rl.ac.uk
Job Accounting - 2• Offline testing of program using data from the CORE sites completed.
• Development of an accounting portal underway to provide accounting on-demand for each site, and aggregated for each EGEE region
• Challenge! Deal with large database 1 ROW per LCGPBS Job per Site!
• http://goc-dev.esc.rl.ac.uk/jpg/goc_demo.php
• http://goc-dev.esc.rl.ac.uk/jpg/goc_demo3.php
John Gordon
j.c.gordon@rl.ac.uk
GridPP Accounting
John Gordon
j.c.gordon@rl.ac.uk
EDG-network monitoring
John Gordon
j.c.gordon@rl.ac.uk
Security
• Worked with Security Group
• Defined a Security Policy – and auditing procedures
• Have a list for security contacts– but not really exercised it yet– still need to define procedures in the event of
security incidents
John Gordon
j.c.gordon@rl.ac.uk
Keeping the Work Flowing
• Regular monitoring of job submission– shows sites that have problems running jobs
• Nagios tracks individual services– plus certificate lifetime
• RM tests show whether data can be moved• GridICE and Ganglia show what is running
• Limited by RB behaviour – we can see that jobs are not getting to sites but not why.
John Gordon
j.c.gordon@rl.ac.uk
What we have delivered?
• A set of monitoring tools
• A monitoring regime
• Two GOCs (RAL and Taipei)
• Security Policy
John Gordon
j.c.gordon@rl.ac.uk
Still to do
• Effective problem tracking– we see site problems and get them fixed– but don’t manage long-term problems
• Integration with User Support– we track problems we see– but problems users notice not effectively dealt with
• Automatic alerts– Nagios does but EMS from Taipei looks promising
• Remote repair– agents until middleware can support this directly
• Security • Deploy accounting• Distribute monitoring to EGEE ROCs and others
John Gordon
j.c.gordon@rl.ac.uk
What Next ? (1)
• RSS used to send tailored streams– sites, ROCs, management can all decide what
to subscribe to
• Accounting– being tested in LCG C&T testbed– should be in next LCG release– Then get T2 accounts
• keep your pbs log and msgs and gatekeeper logs
John Gordon
j.c.gordon@rl.ac.uk
Monitoring Feeds
• GOC server generates a lot of monitoring information.
• Need a way to give this information to the right people e.g site administrators
• Really Simple Syndication (RSS) is an XML schema• Used by many sites which want to syndicate content
e.g BBC, Slashdot• Client Pull model: GOC creates RSS formatted
documents, clients pull these feeds which render them in html.
John Gordon
j.c.gordon@rl.ac.uk
Aggregator RSSReader (Windows Client)
GOC generates RSS feeds which clients can pull using an RSS aggregator.
Aggregators available for Linux, Windows and MacOS
The aggregator shown displays test results for the RAL CE. These results are archived and popup on the desktop when the feed is updated.
John Gordon
j.c.gordon@rl.ac.uk
What next? (2)
• GGUS developments– operations issued forwarded to UK GSC
helpdesk
• Weekly LCG GDA Operations Meeting– see next slide
• EGEE ROCs taking support load– UK ready?
• EGEE CICs taking operations load on weekly rotation
John Gordon
j.c.gordon@rl.ac.uk
Proposal• 2 hour weekly meeting, with VRVS for remote participation –
– use the existing GDA slot– Fully open meeting
• Weekly operations reports (written in advance - previous Friday evening) from – Each EGEE ROC (NE should include Nordugrid ops)– Taipei GOC– Grid3 (covering FNAL and BNL Tier 1’s)– Other LCG Tier 1 sites (where different from the above) - Triumf, Tokyo – others?– ROCs and Tier1s will report on and represent the sites they support
• Weekly reports (written submitted in advance) from customers: – LHC experiments – Bio-med – Others as they come on-line
• During the meeting only issues should be brought up and resolved • Need to have good representation from ROCs and Tier 1s • Need application reps involved in grid work to attend • Once a month have more general discussions (presentation style): eg:
– Middleware developments – Larger issues - batch system problems, etc
• Minutes, attendance and problems will be public
John Gordon
j.c.gordon@rl.ac.uk
UK view
• RAL CIC will take on part of ongoing GOC work – including development for LCG/EGEE
• UK/I ROC will monitor and support UK/I sites– Helpdesk/DTeam/GOC– Maps tailored for Tier2s
top related