occam's razor - an introduction to holistic troubleshooting

Upload: wes-morgan

Post on 03-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    1/18

    2011 IBM Corporation

    ID902Occam's Razor: An Introductionto Holistic Troubleshooting

    Wes Morgan Senior Software Engineer

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    2/18

    2011 IBM Corporation

    Agenda

    Why are we here?

    Increasingly Complex Architectures Specialization within IT/IS

    Command and Control Issues

    Consequences of Fix it NOW!

    The Holistic Approach Occam's Razor Preparation

    Understanding Your Deployment

    Knowing Your Routine

    Knowing Your Limits

    Execution

    Ask Your Neighbors

    Identify/Refine Your Target Problem vs. Routine

    Client, Server or Both?

    Recent Changes

    Lather, Rinse, Repeat...

    Questions & Answers

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    3/18

    2011 IBM Corporation

    Why Are We Here? Complex Architectures

    Fault Tolerance/Redundancy Load Balancers

    Firewalls

    Intranet/Extranet

    Virtualization

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    4/18 2011 IBM Corporation

    Why Are We Here? IT/IS Specialization

    We don't handle that Different team

    Communication often rare and/or difficult

    Simple questions answered slowly

    No one really sees big picture

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    5/18

    2011 IBM Corporation

    Why Are We Here? Command and Control

    We can't do that until the next window Change Control != everyone informed

    Software integration demands team integration as well

    Multiple vendors/contractors may be involved

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    6/18

    2011 IBM Corporation

    Why Are We Here? Fix It NOW Consequences

    Panic mode Time-to-resolution faces sometimes arbitrary limits

    All hands on deck

    Overall technical guidance lacking

    Troubleshooting becomes scattershot

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    7/18

    2011 IBM Corporation

    The Holistic Approach Occam's Razor

    Pluralitas non est ponenda sine neccesitate.

    Plurality should not be posited without necessity.

    William of Ockham, c. 1285-1349

    Close relatives: When two theories explain the same phenomenon, choose the simpler

    admit no more causes..than such as are both true and sufficient... (Newton)

    KISS: Keep It Simple, Stupid

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    8/18

    2011 IBM Corporation

    Why Use Occam's Razor?

    Multiple failures highly unlikely Far more likely that one root failure triggered additional problems

    Playing it could be introduces complexity and (probably) politics

    Don't chase rabbits!

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    9/18

    2011 IBM Corporation

    Preparation Understand Your Deployment

    It's far more than just your stuff Hardware (or lack thereof!)

    Operating System

    Network (within the data center)

    Network (long haul/extranet/VPN)

    Dependencies (directory, SAN)

    Special-purpose devices (firewalls/proxies/reverse-proxies)

    Network appliances

    KNOW YOUR DATA PATH!

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    10/18

    2011 IBM Corporation

    Preparation Know Your Routine

    Profile your systems! perfpmr (AIX), perfmon (Windows), iostat/vmstat (Linux)

    Understand what normal looks like

    Be sure to profile peak time too!

    Logins/sessions per day

    User patterns (e.g. Accounting end-of-month)

    Domino platform statistics can be VERY useful

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    11/18

    2011 IBM Corporation

    Preparation Knowing Your Limits

    Compare your routine use to: Vendor benchmarks

    Third party testing/whitepapers

    Software specifications

    Know how much wiggle room you have CPU utilization

    RAM consumption

    ESPECIALLY important in virtual environments

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    12/18

    2011 IBM Corporation

    Execution Ask Your Neighbors

    Many deployments in your environment share potential points of failure Load Balancers

    SAN

    Quick check with peers may identify common problem quickly

    Formalize this process if you can weekly outage reports?

    May also be indicative of general network issues

    Allows you to handle some issues without vendor involvement

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    13/18

    2011 IBM Corporation

    Execution Identify/Refine the Target

    Most missed aspect of troubleshooting Identify scope/range of affected users

    Identify scope/range of affected servers

    LOOK FOR COMMON FACTORS! Third-party applications

    Same location Same release

    Time of day

    Check for customizations

    Follow the data flow!

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    14/18

    2011 IBM Corporation

    Execution Problem vs. Routine

    Take a snapshot of the problem Compare it to routine data

    May identify particular areas of concern

    May allow vendor to focus their efforts better/faster

    Examples: Domino NSD NAMElookup activity

    Perfmon/perfpmr/iostat disk queuing

    Pay particular attention to period just BEFORE problem (last 10 minutes)

    Be prepared to be pointed in a different direction!

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    15/18

    2011 IBM Corporation

    Execution Client, Server or Both?

    DON'T GO AFTER A FLY WITH A SLEDGEHAMMER! Resist the urge to turn on all the debug

    Overly ambitious debug can present its own performance cost DEBUG_TCP_ALL in IBM Lotus Domino

    VP_TRACE_ALL in IBM Lotus Sametime

    debug=FINEST in Java It's worth a round of data gathering to target server debug more specifically

    High-level client-side debug correlates well with trace logs Live HTTP Headers (Firefox add-on)

    Firebug (Firefox add-on)

    Fiddler (MSIE proxy) Again, gather twice - routine and problem - when possible

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    16/18

    2011 IBM Corporation

    Execution Recent Changes

    Back to Change Control Look for ANY changes close to start of problem

    Don't forget to check for OS patches/updates

    Look for new stuff too...

    Check all along the data flow

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    17/18

    2011 IBM Corporation

    Lather, Rinse, Repeat...

    Be prepared to cycle through this process several times Apply same principles to each area of troublehsooting

    Example: Identify/Refine shows only particular users suffering

    Logs show directory issues

    Now, users not experiencing problems are routine Troubleshoot directory by comparing problem users against routine users

    e.g. get LDIF dumps for both

    Only go where the evidence takes you!

  • 7/28/2019 Occam's Razor - An Introduction to Holistic Troubleshooting

    18/18

    2011 IBM Corporation

    QUESTIONS & ANSWERS

    Please complete a session evaluation!

    More questions? Find me in the Lotus SolutionsDevelopment Lab!

    THANKS FOR BEING HERE!