7.17 1130am adv.perform.forensics_bb

35
Advanced Performance Forensics Advanced Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture [email protected] Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture [email protected]

Upload: steve-feldman

Post on 23-Jan-2015

335 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 7.17 1130am adv.perform.forensics_bb

AdvancedPerformance Forensics

AdvancedPerformance ForensicsUncovering the Mysteries of Performance and Scalability

Incidents through Forensic Engineering

Stephen Feldman Senior Director Performance Engineering and Architecture

[email protected]

Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering

Stephen Feldman Senior Director Performance Engineering and Architecture

[email protected]

Page 2: 7.17 1130am adv.perform.forensics_bb

Sessions Goals

The goals of today’s session are…

• Introduce the practice of performance forensics.• Present an argument for session level analysis.

• Discuss the difference between Resources and Interfaces.

• Present tools that can be used for performance forensics at different layers of the architectural stack and the client layer.

Page 3: 7.17 1130am adv.perform.forensics_bb

Definition of Performance Forensics

• The practice of collecting evidence, performing interviews and modeling for the purpose of root cause analysis of a performance or scalability problem.– In context of a performance (response time problem)– Discussing an individual event (session experience)

• Performance problems can be classified in two main categories:– Response Time Latency– Queuing Latency

Page 4: 7.17 1130am adv.perform.forensics_bb

Performance Forensics MethodologyIdentify the

Problem

Interviewing

Collecting

Evidence

Data

Analysis

Modeling

and

Visualizing

Method-R

Sampling

and

Simulating

Root Cause

Identify the Most Important Operations that Affect Your Business

Turn the Problem Statement into a Diagnosis to Get to

Root Cause

Develop a Problem Statement

Formulate a Hypothesis

Establish a Diagnosis

Perform

Session

Inspection

Page 5: 7.17 1130am adv.perform.forensics_bb

Putting Performance Forensics in Context

• Emphasis on the user and the user’s actions and experiences.– How can this be measured?

• Capture the response time experience and the response time expectations of the user.– Put into perspective user action in-line with the goals

of Method-R (what’s most important to the business)

• Identify the contributors of response latency

• Everyone needs to be involved

Page 6: 7.17 1130am adv.perform.forensics_bb

Measuring the Session

• When should this happen?– When a problem statement cannot be developed from

the data you do have (evidence or interviews) and more data needs to be collected.

• How should you go about this?– Want to minimize disruption to the production

environment.– Adaptive collection: Less Intensive to More Intensive

over time.

Basic Sampling Continuous Collection Profiling

Page 7: 7.17 1130am adv.perform.forensics_bb

Resources vs. Interfaces

• One of the most critical data points to collect

• Interfaces are critical for understanding throughput and queuing models.– Queuing is another cause of latency– Also a cause of time-outs

• Resources are critical for understanding the cost of performing a transaction.– Core Resources: CPU, Memory and I/O

• Response Time = Service Time + Queue Time

Page 8: 7.17 1130am adv.perform.forensics_bb

The Importance of Wait Events

• Rise of Session Level Forensics– Underlying theme with all of these tools that “Session” is more

important then “System”• Wait event tuning used to account for latency

– Exists in SQL Server (Waits and Queues) and Oracle (10046)– Other components not mature enough to represent

• Waits are statistical explanations of latency• Each individual wait event might be deceiving, but

looking at both aggregates and outliers can explain why a performance problem exists.

• When sampling directly, usually only have about 1 hour to act on the data.

Page 9: 7.17 1130am adv.perform.forensics_bb

Performance Forensics Tools

Page 10: 7.17 1130am adv.perform.forensics_bb

Categories of Tools

• HTTP and User Experience

• JVM Instrumentation Tools• Database Instrumentation

– Session and Wait Event– Cost Execution Plans– Profilers

Page 11: 7.17 1130am adv.perform.forensics_bb

Breaking Down Latency

Page 12: 7.17 1130am adv.perform.forensics_bb

Fiddler2

• Fiddler 2 measures end-to-end client responsiveness of a web request.

• Little to no overhead (less intrusive forensics)• Captures requests in order to present http codes, size of

objects, sequence of loading, time to process request, performance by bandwidth speed.– Rough estimation of User Experience based on locality.

• Inspects every detail of the http request– Detailed session inspection– Breakdown of http transformation

• Other Tools in Category: Y-slow/Firebug, Charlesproxy, liveHTTPheaders and IEInspector

Page 13: 7.17 1130am adv.perform.forensics_bb
Page 14: 7.17 1130am adv.perform.forensics_bb

Coradiant Truesight

• Commercial tool used for passive user experience monitoring.

• Captures page, object and session level data.• Capable of defining Service Level Thresholds and

Automatic Incident Management.• Used to trace back session as if you were watching over

the user’s shoulder.• Exceptional tool for trend analysis. (Less Intrusive)• Primarily used in forensics as evidence for analysis.• Other Tools in the Category: Quest User Experience and

Citrix EdgeSight

Page 15: 7.17 1130am adv.perform.forensics_bb

Coradiant Truesight

Page 16: 7.17 1130am adv.perform.forensics_bb

Coradiant Truesight

Page 17: 7.17 1130am adv.perform.forensics_bb

Log Analyzers

• Both commercial and open source tools are available to parse and analyze http access logs.

• Provides trend data, client statistical data, http summary information.

• Recommend using this data to study request and bandwidth trends for correlation purposes with resource utilization graphs.– Such a large volume of data.– Recommend working within small time slices

• Post-processing tool (No Impact to Application)• Examples: Urchin, Summary, WebTrends, SawMill,

Surfstats and AlterWind Log Analyzer

Page 18: 7.17 1130am adv.perform.forensics_bb

JSTAT

• Low intrusive statistic collector that provides– Percentages of usage by each region– Frequency/Counts of collections– Time spent in pause state

• Can be invoked any time without restarting the JVM by obtaining the Process ID– Exception is on Windows when the JVM is run as a background

service• Critical for understanding windows of stall times between

sampling– Assume you collect every 5 seconds and observe a 3 second

pause time– Means the application could only work for 2 seconds

Page 19: 7.17 1130am adv.perform.forensics_bb

JSTAT

Page 20: 7.17 1130am adv.perform.forensics_bb

Process of Garbage Collection

Page 21: 7.17 1130am adv.perform.forensics_bb

Process of Garbage Collection

Page 22: 7.17 1130am adv.perform.forensics_bb

-VerboseGC and -Xloggc

• JVM flags that invoke JVM logging• Verbose JVM logging is a low-overhead

collector (less intrusive measurement)– Requires a restart of the instance to run

• -XX:+PrintGCDetails is a recommended setting to be used with:– -XX:+PrintGCApplicationConcurrentTime– -XX:+PrintGCApplicationStoppedTime

• Provides aggregate statistics about Pause Times versus Working Times.

Page 23: 7.17 1130am adv.perform.forensics_bb

-VerboseGC and -Xloggc

Page 24: 7.17 1130am adv.perform.forensics_bb

IBM Pattern Modeling Tool for Java GC

• Post processing tool used for visualizing a –VerboseGC or –Xloggc file.

• Can make the analysis efforts for analyzing a log file substantially easier.

• Represents pauses/stalls at particular times• Has no affect on the application environment as

it reads a log file that is dormant.

Page 25: 7.17 1130am adv.perform.forensics_bb

IBM Pattern Modeling Tool for Java GC

Page 26: 7.17 1130am adv.perform.forensics_bb

JHAT, JMAP and SAP Memory Analyzer

• Jhat: Java Heap Analysis Tool takes a heap dump and parses the data into useful and human-digestible information about what's in the JVM's memory.

• JMap: Java Memory Map is a JVM tool that provides information about what is in the heap at a given time.– Provides text and OQL views into JHat data

• SAP Memory Analyzer will visualize the JHat output• Should be run when a problem is occurring right now

– When the system is unresponsive– When the JVM runs into continuous collections

Page 27: 7.17 1130am adv.perform.forensics_bb
Page 28: 7.17 1130am adv.perform.forensics_bb

ASH

• ASH: Active Session History– Samples session activity in the system every second.– 1 hour of history in memory for immediate access at your

fingertips

• ASH in Memory– Collects active session data only– History v$session_wait + v$session + extras• Circular Buffer - 1M to 128M (~2% of SGA)

• Flushed every hour to disk or when buffer 2/3 full (it protects itself so you can relax)

• Tools to Consider: SessPack and SessSnaper

Page 29: 7.17 1130am adv.perform.forensics_bb

SQL Server Performance Dashboard

• Feature of SQL Server 2005 SP2

• Template report that take advantage of DMVs• Provides views into wait events

– Doesn’t link events to SQL IDs in the report– Provides aggregate views of wait events– Session Level DMVs (sys.dm_os_wait_stats and

sys.dm_exec_sessions)

• Complimentary Tools: SQL Server Health and History Tool and Quest Spotlight for SQL Server

Page 30: 7.17 1130am adv.perform.forensics_bb
Page 31: 7.17 1130am adv.perform.forensics_bb

Importance of Cost Execution Plans

• Can be run on databases with low overhead– Do not need the literal values to run– Both SQL Server and Oracle can run “Estimated Cost Plans”

• Each database uses an “Optimizer” that determines the best path of execution of SQL– Calculates IO, CPU and Number of Executes (Loop Conditions)

• Understanding cost operations on a particular object can help change your tuning strategy (ex: TABLE ACCESS BY INDEX ROWID)

• Cost is time– Query cost refers to the estimated elapsed time, in seconds,

required to complete a query on a specific hardware configuration.

Page 32: 7.17 1130am adv.perform.forensics_bb

RML and Profiler

• The RML utilities process SQL Server trace files and view reports showing how SQL Server is performing. – Which application, database or login is using the most resources, and

which queries are responsible for that.– Whether there were any plan changes for a batch during the time when

the trace was captured and how each of those plans performed.– What queries are running slower in today's data compared to a previous

set of data

• Profiler captures statements, query counts/statistics, wait events– Can capture and correlate profile data to Perfmon data

• Heavy overhead with both• Other Tools to Consider: Quest Performance Analysis for SQL

Server

Page 33: 7.17 1130am adv.perform.forensics_bb

Oracle OEM and 10046

• Oracle finally delivered with OEM with a web-based interface.– Performance dashboard provides great historical and present

overview– Access to ADDM and ASH simplifies job of DBA– SQL History

• Problems– licensing somewhat cost prohibitive– Still doesn’t provide wait events

• For 10046 still need to consider profiling on your own and using a profiler reader like Hotsos P4.– Difficult to trace and capture sessions

Page 34: 7.17 1130am adv.perform.forensics_bb
Page 35: 7.17 1130am adv.perform.forensics_bb

Want More?

• Check-out my blog for postings of the presentation: http://sevenseconds.wordpress.com

• To view my resources and references for this presentation, visit www.scholar.com

• Simply click “Advanced Search” and search by [email protected] and tag: ‘bbworld08’or ‘forensics’