service availability monitor tests for atlas current status tests in development to do alessandro di...

14
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

Upload: hester-henderson

Post on 18-Jan-2018

227 views

Category:

Documents


0 download

DESCRIPTION

4 Dec 2007 Alessandro Di Girolamo 3 Work in progress We are developing and testing ATLAS-specific SAM tests in order to: monitor the availability of ATLAS critical Site Services verify the correct installation and the proper functioning of the ATLAS software on each site SE & SRM & CE endpoints definition: intersection between GOCDB and TiersOfATLAS (ATLAS specific sites configuration file with Cloud Model)  different services and endpoints might need to be tested using different VOMS credentials  ATLAS endpoints and paths must be explicitly tested (i.e. /dq2 area)  the LFC of the Cloud (residing in the T1) is used

TRANSCRIPT

Page 1: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

Service Availability Monitor tests for ATLAS

Current Status Tests in development

To Do

Alessandro Di Girolamo CERN IT/PSS-ED

Page 2: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 2

SAM Critical Tests: Current StatusNow running standard OPS tests using ATLAS credentials

(i.e. the original SAM tests run under the ATLAS VO)

• List of sites from GOCDB• SE & SRM:

put: lcg-cr using cern-prod LFC, files in SAM test directory get: lcg-cp from site to the SAM UI del: lcg-del - clean the catalog and the storage

• CE Check CA RPMs version Job Submission on a WN tests VO swdir (sw installation directory)

• LFC lfc-ls, lfc-mkdir

• FTS glite-transfer-channel-list, Information System configuration and publication

Page 3: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 3

Work in progressWe are developing and testing ATLAS-specific SAM tests

in order to: • monitor the availability of ATLAS critical Site Services• verify the correct installation and the proper functioning of the

ATLAS software on each site

SE & SRM & CE endpoints definition: intersection between GOCDB and TiersOfATLAS (ATLAS specific sites configuration file with Cloud Model)

different services and endpoints might need to be tested using different VOMS credentials ATLAS endpoints and paths must be explicitly tested (i.e. /dq2 area) the LFC of the Cloud (residing in the T1) is used

Page 4: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 4

Development: Tests and Alarms• SE & SRM (centrally from SAM UI):

– put: lcg-cr with Cloud LFC, with and without using BDII infos– get: lcg-cp

• CE (job submitted on each ATLAS CE):– keep on running large part of OPS suite– for ATLAS Tier1 and Tier2:

• Check the presence of the required version of the ATLAS sw • Compile and execute a real analysis job based on a sample dataset• Test put/get to local storage via native protocols (dccp, rfcp …)

Alarm system:• SE / SRM / CE tests failing: site contact persons will be alerted via SAM Alarm System (mail and/or sms)• Grid Services (FTS, LFC etc.) tests failing: alarms to

Service responsible the ATLAS dedicated services (DDM, etc..) that use those services

Page 5: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 5

Reliability & Availability results

SAM Critical Tests not reliable for:– France: BDII configuration (ATLAS endpoint should be explicitly put)– NDGF/BNL: different service setup

SAM Critical Tests last months failures:– FZK: real SRM failures. Problems under investigation with site responsible– SARA: (mainly) not scheduled network problems

Page 6: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 6

To Do• New ATLAS specific tests (now running in

pre-production) will be more realistic for the Experiment

• Improve completeness of monitor informations Informations across TiersOfATLAS, GOCDB and

BDII. ATLAS Cloud topology view Integration with Ganga Robot and other ATLAS

tools Integration with the ATLAS dashboard

Page 7: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 7

Backup slides

• …

Page 8: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 8

SAM ATLAS SE (SRM) tests

All SRM endpoints (v1 and v2) can be considered as SE:

• SE tests are sent to the list of SRM endpoints resulting from the intersection of ToA & GOCDB

Page 9: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 9

SAM ATLAS SE (SRM) tests

All SRM endpoints (v1 and v2) can be considered as SE:

• SE tests are sent to the list of SRM endpoints resulting from the intersection of ToA & GOCDB

Page 10: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 10

SAM results on Gridmap

Thks to CERN openlab / EDS

Topology: Possibility to include ATLAS Cloud view, Possibility to change the metrics for the sites size

The collaboration with the Gridmap developers is already started

Page 11: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 11

Other SAM testsMany more tests, not

critical, are running

Page 12: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 12

Site Availability: T0/T1 Site Services

Availability: Site Services X = CE, SE,

SRM Down: if all services of

type X of a site are Down

Ok: if all services of type X are Ok

Degraded: if some services of type X are Ok and other are Down

Site BDII: Ok or Down by taking the status of the site BDII instance

Site Availability: The AND of each single

Site Services Availability

Page 13: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 13

Site Availability: one example

Page 14: Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED

4 Dec 2007 Alessandro Di Girolamo 14

Storage Space Monitor via SAM

A specific SAM test could be sent on the VOBOXes to check storage disk space, as already done for the IT cloud