teragrid operations overview mike pingleton ncsa teragrid operations december 2 nd, 2004

17
TeraGrid Operations TeraGrid Operations Overview Overview Mike Pingleton Mike Pingleton NCSA TeraGrid Operations NCSA TeraGrid Operations December 2 December 2 nd nd , 2004 , 2004

Upload: evangeline-harriet-charles

Post on 14-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

TeraGrid Operations TeraGrid Operations OverviewOverview

Mike PingletonMike Pingleton

NCSA TeraGrid Operations NCSA TeraGrid Operations December 2December 2ndnd, 2004, 2004

TeraGrid Operations CenterTeraGrid Operations CenterProvides continuous and coordinated operational support, user Provides continuous and coordinated operational support, user assistance, and incident response for the nation-wide TeraGrid assistance, and incident response for the nation-wide TeraGrid

TOC CapabilitiesTOC Capabilities

24/7 single source of assistance for TeraGrid 24/7 single source of assistance for TeraGrid users and staff, via email or telephoneusers and staff, via email or telephone

Dedicated TeraGrid trouble-ticket system (TTS) Dedicated TeraGrid trouble-ticket system (TTS) ensures timely resolution of problems and event ensures timely resolution of problems and event responseresponse

Leverages and pools vast experience of existing Leverages and pools vast experience of existing operations staff and system administratorsoperations staff and system administrators

Capable of monitoring systems/queues at Capable of monitoring systems/queues at multiple remote sitesmultiple remote sites

““use existing infrastructure” - NSFuse existing infrastructure” - NSF

TOC Technical ApproachTOC Technical Approach

TG Operations Center staffed by NCSA TG Operations Center staffed by NCSA and SDSC Operations staff, 12 hour shift and SDSC Operations staff, 12 hour shift for each sitefor each site

TOC provides front-line evaluation, TOC provides front-line evaluation, resolution, and routing of problemsresolution, and routing of problems

TOC coordinates, participates in event TOC coordinates, participates in event response – security issues, down time, response – security issues, down time, etc.etc.

NCSA & SDSC Ops Centers:NCSA & SDSC Ops Centers:Expanded Scope, but Business as UsualExpanded Scope, but Business as Usual

Monitoring CapabilitiesMonitoring Capabilities

MonitoringMonitoring

Currently ‘passively’ monitoring most Currently ‘passively’ monitoring most TeraGrid clusters using CluMonTeraGrid clusters using CluMon

Ramping up efforts to monitor the Ramping up efforts to monitor the TeraGrid networkTeraGrid network

Monitoring capacity untapped at this point Monitoring capacity untapped at this point (not yet monitoring grid fabric)(not yet monitoring grid fabric)

TeraGrid Ticketing SystemTeraGrid Ticketing System

Technical Approach - TeraGrid Technical Approach - TeraGrid Ticketing SystemTicketing System

[email protected]@teragrid.org or toll-free number receive all or toll-free number receive all incoming requestsincoming requests

TTS is a browser-based, db-driven system TTS is a browser-based, db-driven system developed from NCSA’s in-house ticketing developed from NCSA’s in-house ticketing system (use existing infrastructure!)system (use existing infrastructure!)

Users are able to track the progress of their Users are able to track the progress of their ticketstickets

New TG sites are easily integrated into system New TG sites are easily integrated into system (all new ETF sites already integrated)(all new ETF sites already integrated)

Technical Approach – TeraGrid Technical Approach – TeraGrid Ticketing System Ticketing System (continued)(continued)

Problem Resolution – a tiered approachProblem Resolution – a tiered approach Front-line evaluation, routing or resolution by Front-line evaluation, routing or resolution by

TG Ops staffTG Ops staff Site-specific issues routed to site-leads for Site-specific issues routed to site-leads for

resolutionresolution TG-wide issues routed to user support team TG-wide issues routed to user support team

to coordinate resolution by technical leadsto coordinate resolution by technical leads Front-line Resolution an important factorFront-line Resolution an important factor

22% of all trouble tickets resolved by TOC 22% of all trouble tickets resolved by TOC staffstaff

Trouble Ticket ProcessingTrouble Ticket ProcessingFrom Open To CloseFrom Open To Close

When a ticket is created, user receives auto-When a ticket is created, user receives auto-notification with ticket numbernotification with ticket number

User receives personal reply within 30 minutesUser receives personal reply within 30 minutes Ticket is assigned to a project & to someoneTicket is assigned to a project & to someone User is kept updated on progress, resolutionUser is kept updated on progress, resolution Problem behind ticket is resolvedProblem behind ticket is resolved User is notifiedUser is notified User receives auto-notification of closure, with User receives auto-notification of closure, with

summarysummary

Problem Resolution WorkflowProblem Resolution Workflow

TeraGridUser

[email protected] TeraGrid

Operations

User SupportTeam

TeraGrid Sites

TeraGridTicket Breakdown

729 tickets, 22%

317 tickets, 10%

2249 tickets, 68%

Site Specific Tickets

TeraGrid-Wide Tickets

TG Ops Center

Pulling Ops Centers Together:Pulling Ops Centers Together:

A common set of web-based procedures A common set of web-based procedures documentation – documentation – Routing & Assignment GuidesRouting & Assignment Guides ’’20 Questions’ Guides for problem 20 Questions’ Guides for problem

determinationdetermination Basic operational policies and proceduresBasic operational policies and procedures

‘‘Shift Turnover’ phone callsShift Turnover’ phone calls Open communication & assistanceOpen communication & assistance

ChallengesChallenges

TeraGrid is a huge learning curve for Ops Staff TeraGrid is a huge learning curve for Ops Staff (must know at least a little bit about everything)(must know at least a little bit about everything)

Keeping abreast with a constant state of changeKeeping abreast with a constant state of change Working with people who are very far away (and Working with people who are very far away (and

sometimes on vacation)sometimes on vacation) Promoting the concept of Problem Resolution Promoting the concept of Problem Resolution

(new to some) and getting everyone to use the (new to some) and getting everyone to use the Ticketing SystemTicketing System

Inexperienced users on the horizonInexperienced users on the horizon

Lessons LearnedLessons Learned

More tickets than anyone expectedMore tickets than anyone expected Problem Resolution on a global scale is Problem Resolution on a global scale is

expensive wrt time and talent consumedexpensive wrt time and talent consumed TG Ops Center more than just a problem TG Ops Center more than just a problem

routing switchboardrouting switchboard Communication & coordination between Communication & coordination between

RPs, services and TOC vital to successRPs, services and TOC vital to success