tape monitoring
DESCRIPTION
Tape Monitoring. Vladimír Bahyl IT DSS TAB Storage Analytics Seminar February 2011. Overview. From low level Tape drives; libraries Via middle layer LEMON Tape Log DB To high level Tape Log GUI SLS TSMOD What is missing? Conclusion. Low level – towards the vendors. - PowerPoint PPT PresentationTRANSCRIPT
Data & Storage Services
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
DSS
Tape Monitoring
Vladimír BahylIT DSS TAB
Storage Analytics SeminarFebruary 2011
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 2Overview
• From low level– Tape drives; libraries
• Via middle layer– LEMON– Tape Log DB
• To high level– Tape Log GUI– SLS
• TSMOD• What is missing?• Conclusion
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 3Low level – towards the vendors
• Oracle Service Delivery Platform (SDP)– Automatically opens tickets with Oracle– We also receive notifications– Requires “hole” in the firewall, but quite useful
• IBM TS3000 console– Central point collecting all information from 4
(out of 5) libraries– Call home via Internet (not modem)– Engineers come on site to fix issues
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 4Low level – CERN usage
• SNMP– Using it (traps) whenever available– Need MIB files with SNMPTT actuators:
– IBM libraries send traps on errors– ACSLS sends activity traps
• ACSLS– Event log messages on multiple lines concatenated into
one– Forwarded via syslog to central store– Useful for tracking issues with library components (PTP)
EVENT ibm3584Trap004 .1.3.6.1.4.1.2.6.182.1.0.4 ibm3584Trap CRITICALFORMAT ON_BEHALF: $A SEVERITY: '3' $s MESSAGE: 'ASC/ASCQ $2, Frame/Drive $6, $7'EXEC /usr/local/sbin/ibmlib-report-problem.sh $A CRITICALNODES ibmlib0 ibmlib1 ibmlib2 ibmlib3 ibmlib4SDESCTrap for library TapeAlert 004.DESC
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 5Middle layer – LEMON
• Actuators constantly check local log files• 4 situations covered:
1. Tape drive not operational2. Request stuck for at last 3600 seconds3. Cartridge is write protected4. Bad MIR (Media Information Record)
• Ticket is created= email is sent– All relevant information
is provided within the ticketto speedup the resolution
• Workflow is followed tofind a solution
Dear SUN Tape Drive maintainer team,
this is to report that a tape drive T10B661D@tpsrv963 has became non-operational.Tape T05653 has been disabled.
PROBABLE ERRORS
01/28 15:33:05 10344 rlstape: tape alerts: hardware error 0, media error 0, read failure 0, write failure 001/28 15:33:05 10344 chkdriveready: TP002 - ioctl error : Input/output error01/28 15:33:05 10344 rlstape: TP033 - drive [email protected] not operational
IDENTIFICATION
Drive Name: T10B661D Location: acs0,6,1,13 Serial Nr:
Volume ID: T05653 Library: SL8600_1 Model: T10000 Producer: STK Density: 1000GC Free Space: 0 Nb Files: 390 Status: FULL|DISABLED Pool Name: compass7_2
Tape Server: tpsrv963
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 6Middle layer – Tape Log DB
• CASTOR log messages from all tape servers are processed and forwarded to central database
• Allows correlation of independent errors (not a complete list):– X input/output errors with Y tapes on 1 drive– X write errors on Y tapes on 1 drive– X positioning errors on Y tapes on 1 drive– X bad MIRs for 1 tape on Y drives– X write/read errors on 1 tape on Y drives– X positioning errors on 1 drive on Y drives– Too many errors on a library
• Archive for 120 days all logs slit by VID and tape server– Q: What happened to this tape?
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 7Tape Log – the data
• Origin: rtcpd & taped log messages– All tape servers sending data in parallel
• Content: various file state information• Volume:
– Depends on the activity of the tape infrastructure– Past 7 days: ~30 GBs of text files (raw data)
• Frequency:– Depends on the activity of the tape infrastructure– Easily > 1000 lines / second
• Format: plain text
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 8Tape Log – data transport
• Protocol: (r)syslog log messages• Volume: ~150 KB/second• Accepted delays: YES/NO
– YES: If the tape log server can not upload processed data into the database, it will try later as it has local text log file
– NO: If the rsyslog daemon is not running the the tape log server, lost messages will not be processed
• Losses acceptable: YES (to some small extent)– The system is only used for statistics or slow reactive
monitoring– Serious problem will reoccur elsewhere– We use TCP in order not to loose messages
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 9Tape Log – data storage
• Medium: Oracle database• Data structure: 3 main tables
– Accounting– Errors– Tape history
• Amount of data in store:– 2 GB– 15-20 millions of records (2 years worth of data)
• Aging: no, data kept forever
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 10Tape Log – data processing
• No additional post processing, once data is stored in database
• Data mining and visualization done online– Can take up to a minute
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 11High level – Tape Log GUI
• Oracle APEX on top of data in DB• Trends
– Accounting– Errors– Media issues
• Graphs– Performance– Problems
• http://castortapeweb
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 12High level – Tape Log GUI
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 13Tape Log – pros and cons
• Pros– Used by DG in his talk!– Using standard transfer protocol– Only uses in-house supported tools– Developed quickly; requires little/no support
• Cons– Charting limitations
• Can live with that; see point 1 – not worth supporting something special
– Does not really scale• OK if only looking at last year’s data
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 14High level – SLS
• Service view for users• Life availability information as well as
capacity/usage trends– Partially reuses Tape Log DB data
• Information organized per VO– Text and graphs
• Per day/week/month
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 15High level – SLS
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 16TSMOD
• Tape Service Manager on Duty– Weekly changing role to
• Resolve issues• Talk to vendors• Supervise interventions
• Acts on twice-daily summary e-mail which monitors:– Drives stuck in (dis-)mounting– Drives not in production without any reason– Requests running or queued for too long– Queue size too long– Supply tape pools running low– Too many disabled tapes since the last run
• Goal: have one common place to watch
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 17What is missing?
• We often need the full chain– When was the tape last time successfully read?– On which drive?– What was the firmware of that drive?
• Users hidden within upper layers– We do not know which exact user is right now
reading/writing– The only information we have is the experiment
name and that is deducted from the stager hostname
• Details investigations often require request ID
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DSS 18Conclusion
• CERN has extensive tape monitoring covering all layers
• The monitoring is fully integrated with the rest of the infrastructure
• It is flexible to support new hardware (e.g. higher capacity media)
• The system is being improved as new requirements arise