lecture2 - fault management

Upload: kawish11

Post on 09-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Lecture2 - Fault Management

    1/23

    CIT 443: Enterprise Network Management

    Fault Management

  • 8/8/2019 Lecture2 - Fault Management

    2/23

    Fault? An event that causes adverse, unintended,

    or non-specification operating conditions in

    or on an enterprise network system May be masked by automatic error

    correction routines

    May be perceived initially as performance

    problems

    Incidents may become an indicator of more

    serious issues with increased frequency

  • 8/8/2019 Lecture2 - Fault Management

    3/23

    Classification of Faults

    Event Type Severity Response

    Incident Informational Notice

    Problem Alarm Alert

    Error Emergency Caution

    Failure Critical Warning

  • 8/8/2019 Lecture2 - Fault Management

    4/23

    Fault Management The process of identifying, locating,

    documenting, & resolving adverse,

    unintended, or non-specification operatingconditions of enterprise network systems

    Includes the necessary policies,

    processes, &/or procedures for all stepsas well

  • 8/8/2019 Lecture2 - Fault Management

    5/23

    Benefits of Fault Management

    Reduce down-time

    Reduce the need for fire-fighting

    Allow more time for other management

    tasks

  • 8/8/2019 Lecture2 - Fault Management

    6/23

    Elements of Fault Management

    System Monitoring

    Alarm Processing

    Fault Resolution

  • 8/8/2019 Lecture2 - Fault Management

    7/23

    Forouzan, B. A. TCP/IP Protocol Suite, SecondEdition. McGraw Hill, 2003.

    System Monitoring & Alarm Processing

    3 Relevant Protocols

    SNMP (v3): Defines the format of packetsexchanged between a manager and an agent. Itreads and changes the status (values) of objects

    (variables) in SNMP packets. (Forouzan, p. 625) MIB (v2): Creates a collection of named objects,

    their types, and their relationships to each other inan entity to be managed. (Forouzan, p. 625)

    SMI (v2): A guideline for SNMP that emphasizes

    three attributes to handle an object:1. Name2. Data Type

    3. Encoding Method

  • 8/8/2019 Lecture2 - Fault Management

    8/23

    Forouzan, B. A. TCP/IP Protocol Suite, SecondEdition. McGraw Hill, 2003.

    SNMP: Managers and Agents Framework for managing devices in an internetwork

    using the TCP/IP protocol suite.

    Manager: Host that runs the SNMP client program

    Agent: Host (router, switch, etc.) that runs theSNMP server

    Agent maintains information in a database to bequeried and/or modified by the manager

    Agent can also contribute to the managementprocess by sending unsolicited messages to themanager (traps) to notify of system events

  • 8/8/2019 Lecture2 - Fault Management

    9/23

    Forouzan, B. A. TCP/IP Protocol Suite, SecondEdition. McGraw Hill, 2003.

    SNMP: Three Management Functions

    1. Manager can query an agent for

    information

    2. Manager can force an agent to

    perform a task

    3. Agent can contribute to management

    process (traps)

  • 8/8/2019 Lecture2 - Fault Management

    10/23

    Structure of Management Info Abstract Syntax Notation (ASN.1) is used to

    access information contained within the MIB

    stucture. A notation system that identifies data structures

    for reliable encoding, transmission, and

    decoding of messages.

    Nearly all entities managed by SNMP havean object ID that starts with 1.3.6.1.2.1

    ISO.org.dod.internet.mgmt.mib-2

  • 8/8/2019 Lecture2 - Fault Management

    11/23

    Fault Resolution Process1. Identify the fault

    What are the fault symptoms?

    What could be t

    he problem?

    2. Isolate the fault

    3. Prioritize the fault

    4. Correct the fault (if possible)

    5. Fault Reporting

  • 8/8/2019 Lecture2 - Fault Management

    12/23

    Identify a Fault - Collect Information

    Log Network Events Through the use of SNMP Traps, etc.

    Which device(s) originated the events?

    Watchdog Timers Reset with the completion of a given task

    Generate a trap when timer expires and the task is notcomplete

    Polling Periodic monitoring of network activity

    Polled data is often logged to a server

    Useful in trend analysis and resolving intermittent faults

    Useful for resolving problems after the fact

    Polling uses bandwidth shorter polling intervals require

    more bandwidth

  • 8/8/2019 Lecture2 - Fault Management

    13/23

    Isolate the Fault Look Beyond the Symptoms

    Use a Fault Isolation Methodology Top Down

    Bottom Up Intermittent Problems are Difficult!

    Why?

    Attempt to take a snap-shot of network at time of serviceinterruption

    Take note of recurrence time

    Attempt to correlate data: What is the same?

    Determine if part of a Common Cause Fault (Failure)Group?

    Root Cause Analysis

  • 8/8/2019 Lecture2 - Fault Management

    14/23

    Isolate the Fault

  • 8/8/2019 Lecture2 - Fault Management

    15/23

    Prioritize Faults Not all faults are of the same priority

    Determine which faults to take

    immediate action on and which to defer

    Some prioritization can be performed at

    the help desk level

    Divide and conquer

  • 8/8/2019 Lecture2 - Fault Management

    16/23

    Prioritize Faults

    Criticality

    Low 3 4 5

    Medium 2 3 4

    High 1 2 3

    High Medium Low

    Impact

  • 8/8/2019 Lecture2 - Fault Management

    17/23

    Correct the Fault Repair, Restore, Replace, then

    Reevaluate

    Remember, faults can be caused by just

    about anything in the networkincluding

    users.

    Fixing the underlying fault may require achange in the policies of how users interact

    with network systems

  • 8/8/2019 Lecture2 - Fault Management

    18/23

    Fault Reporting Symptoms

    Effect on Network Operations

    Cause

    Resolution

    Update Documentation

    What is the purpose of reporting?

  • 8/8/2019 Lecture2 - Fault Management

    19/23

    Reporting/Documentation

    MTTF

    MTBF

    Failure Rate

    MTTR

  • 8/8/2019 Lecture2 - Fault Management

    20/23

    Fault Management: Network Entities

    PBX

    Hubs

    Routers

    Switches

    Servers Workstations

    Firewalls

    Intrusion Detection/Prevention Systems

    Wireless Access Points

    Power Management Systems

    Network SCADA systems Temperature Management Systems (HVAC)

    Home Appliances?

    Others?

  • 8/8/2019 Lecture2 - Fault Management

    21/23

    Industry Trends Enterprise Network Management is a key

    initiative for large companies

    All encompassing Manage every part ofthe enterprise network

    Automate Correlation 80% of time is spent trying to isolate & determine

    the fault (root cause analysis) Notify a manager or engineer of what to fix

    Automate Fault Resolution - Device managerfixes problems local to the box or networkcomprised of the same components

  • 8/8/2019 Lecture2 - Fault Management

    22/23

    Topics for Further Investigation

    1. Technologies for Automating Fault Diagnosis

    2. Methods for Automated Fault Resolution

    3. Evolution of Protocols for Fault Notification and Trapping

    4. Fault Management System Architectures5. Enterprise Network Management: Best Practices and Lessons

    Learned

    6. Corporate Implementations of Enterprise Network ManagementSystems

    7. Current Issues with Enterprise Network Management

    8. Enterprise Network Management of Wireless Networks9. Enterprise Network Management of Converged Networks with

    Differentiated Services

  • 8/8/2019 Lecture2 - Fault Management

    23/23

    Forouzan, B. A. TCP/IP Protocol Suite, Second

    Edition. McGraw Hill, 2003.

    Questions?