aarnet copyright 2006 aarnet out of hours support questnet workshop qut 5 november 2008...
TRANSCRIPT
AARNet Copyright 2006
AARNet Out of Hours SupportQUESTnet Workshop
QUT 5 November 2008
[email protected]@aarnet.edu.au
AARNet Copyright 2006
2
Core Operational Functions• Monitoring• Measurement• Contact & Communication• Logging / Record Keeping• Fault Diagnosis• Co-ordination• Restoration
AARNet Copyright 2006
3
Things go wrong… IT happens!• Power failures• Fibre optic cables get dug up, broken (Dial-after-you-Dig)• Equipment misbehaves, fails• Operator errors – misconfigurations• Denial of Service and Distributed Denial of Service attacks• Maintenance without notice• Fires in computer rooms• Floods in computer rooms• Floods in the middle of the desert• Lightening strikes• Rodents chew through cables• Ships drag anchor, trains go off the rails (and onto fibre optic cables)• The list goes on…
AARNet Copyright 2006
4
Maximise up-time via design of the Network• Design philosophy is to have diversity and redundancy• Ideally no single points of failure• Dual PoPs in Major Capital Cities• Diverse transmission paths• Dual customer connections• I know-
– No diverse path Nth of Townsville– Single PoP in NT & Tasmania– No option for dual connections for many customers
• You can solve all problems with the application of cash
AARNet Copyright 2006
AARNet’s – International Network
AARNet Copyright 2006
AARNet’s national network
© 2008, AARNet Pty Ltd Private and Confidential
6
AARNet Copyright 2006
7
Monitoring - Nagios• Nagios – a ‘free’ software package• Runs continuously on servers in Sydney and Perth• Constantly polls / probes nearly 1,000 network
elements (hosts & services)• Raises alarm in the event of a fault
– SMS messages sent to key operations staff• Format under review – include customer contact
– E-mail to NOC mailing list– Web based on-line, real-time displays & reports
• Measures and records availability
AARNet Copyright 2006
8
Nagios
AARNet Copyright 2006
9
Measurement – SNMP, MRTG• Interface/circuit information
gathered every five minutes with Simple Network Management Protocol
• Visualised with Multi Router Traffic Graph (MRTG) on all important interfaces on the network
• Link/circuit utilisation– Bits per second– Packets per second– Flows per second
• Blue/inbound, Green/outbound
AARNet Copyright 2006
10
Measurement – SmokePing• Measures and records Round-trip Time
(RTT) and Packet-Loss to selected targets
• Ideal for visualising latency and jitter (variance in latency)
• Y-axis represents latency• X-axis - time• ‘Smoke’ represent degree of jitter• Coloured bars represent packet-loss:
– Green – no packet-loss– Blue – 1 in 20 lost (5%)– Purple – 4 in 20 lost (20%)– Red – 19 out of 20 packets lost!
AARNet Copyright 2006
11
Measurement - NetFlow• Flow:- a conversation in one direction between two
computers on the network• Flow record:- Information about the flow – timestamp,
source, destination, number of packets and bytes, protocol used, application type, etc
• Member’s Edge Router generates flow records and exports them to the Member’s Edge Server
• Raw flow records are analysed and processed to produce various reports
• Extremely useful when investigating security incidents
AARNet Copyright 2006
12
Contact & Communication • 24 x 7 Call Centre / Help Desk
– (02) 9963-3538– New number: 1300 APL NOC (1300 275 662)
• Calls & SMS to the On-Call Officer• Escalates to:
– Secondary On-Call Officer– Tertiary On-Call Officer
• Response normally within 15 minutes
AARNet Copyright 2006
13
Contact & Communication • National and State based e-mail distribution lists for
AARNet operational notices• Contain e-mail addresses of the form
[email protected]– Member controls which individuals receive AARNet operational notices– Member can add or change recipients without any intervention by
AARNet– Better quality distribution lists – more up-to-date
AARNet Copyright 2006
14
Contact & Communication
AARNetNOC
AutoSMS
AutoE-Mail
WebSite
Nagios
Supplier Peer Customer
Direct E-Mail
DirectE-Mail
TelephoneMailing
List
Customer
DirectE-Mail
Telephone
Supplier/Peer
Customer Supplier Peer
Telephone Call Centre24 x 7 - Auto SMS
AARNet Copyright 2006
15
Contact & Communication • Under development – Common Contact Database
– Single, centralised and definitive– Shared by Operations staff– All Customer Technical Contacts
• Preference for mobile phone numbers– All Supplier contacts
• Other NOCs• Co-location facilities• Circuit IDs, rack locations, etc
AARNet Copyright 2006
16
Logging / Record Keeping / Trouble Tickets• JIRA by Atlassian• General purpose, web based ‘issue tracking’ software• Multi-user – anyone with a web browser (and username and password)• Queue oriented – multiple queues• One queue dedicated to network trouble tickets (NOCTTS)• Log of issues including
– fault descriptions– current fault status– contact information– communications– sequence of events– comments– through to resolution and ‘closed issue’
AARNet Copyright 2006
17
AARNet Copyright 2006
18
AARNet Copyright 2006
19
In Summary…
• Increased emphasis on monitoring• Continued focus on measurement• Contact and communication vital to our success
• What does AARNet need from you?– Quality contact information– Pro-active communication if something goes wrong– Co-operation and collaboration to fix the problem
AARNet Copyright 2006
20
AARNet provides 24 x 7 On-call coverage• The 24 x 7 Helpdesk function is outsourced to a call centre
(Link:Q Communications)– Customers, peers and service providers call Link:Q (not AARNet directly)– A call centre operator answers the phone, takes details and pages the AARNet on-
call officer– An escalation roster is sent to Link:Q weekly – primary, secondary and tertiary
AARNet contacts for the week– Upon receiving a call and taking details, Link:Q then call and SMS the primary
AARNet on-call officer– If the call/SMS is not acknowledged with Link:Q within 20 mins, Link:Q call and SMS
the secondary– If the secondary call/SMS is not acknowledged with Link:Q within 20 mins, Link:Q call
and SMS the tertiary– Hence the AARNet SLA of responding within the hour (although usually it is much
sooner)
AARNet Copyright 2006
21
On-Call Officer • The on-call officer is responsible for:
– Taking and acknowledging calls to the AARNet Helpdesk– Responding immediately to the caller– Gathering as much information, in the first instance and creating a ticket for
each call– Deciding to take responsibility for the ticket personally or assigning the ticket
to someone else– If taking responsibility for the ticket:
• Analysing, troubleshooting, testing, solving, working the ticket to closure• Monitoring communications, events and developments related to the ticket,
including e-mail to the AARNet NOC• Communicating/updating relevant information to all parties involved or affected
for the duration of the ticket
AARNet Copyright 2006
22
On-Call Office...Business Hours• Additionally, during business hours, the on-call officer is
responsible for:– Monitoring alarms, communications, events and developments via e-
mail messages to the AARNet NOC– Creating and updating tickets to record and document faults and
difficulties– Creating and updating tickets and calendar entries to reflect periods of
scheduled maintenance– Informing all affected parties of outages or hazardous conditions due to
faults or periods of scheduled maintenance
AARNet Copyright 2006
23
On-Call Allowance • AARNet pays the primary on-call officer an allowance per day,
usually for the 7-day week they are on call.• In addition, the primary on-call officer is paid time-and-a-half if
called out-of-hours; – Usually regarded as outside Mon-Fri 8:00am-6:00pm and public
holidays,– For a minimum of 2 hours per day, again only if called out-of-hours.
• All Operational Staff are on the roster!– Eg. Sys Admins cover network faults– Improves the skills and familiarity across all Operational areas
AARNet Copyright 2006
Thank you