oracle operational timing data paper 36571

30
www.hotsos.com Slide 1 Copyright © 1999–2003 by Hotsos Enterprises, Ltd. Oracle Operational Timing Data Paper 36571 Cary Millsap ([email protected]) Hotsos Enterprises, Ltd. OracleWorld 2003 / San Francisco, California USA 8:30a–9:30a Wednesday 10 September 2003

Upload: jasmine-lancaster

Post on 01-Jan-2016

34 views

Category:

Documents


1 download

DESCRIPTION

Oracle Operational Timing Data Paper 36571. Cary Millsap ([email protected]) Hotsos Enterprises, Ltd. OracleWorld 2003 / San Francisco, California USA 8:30a –9:30a Wednesday 10 September 2003. Additional resources for you…. Paper update hotsos.com Questions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 1Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Oracle Operational Timing DataPaper 36571

Cary Millsap ([email protected])

Hotsos Enterprises, Ltd.OracleWorld 2003 / San Francisco, California USA

8:30a–9:30a Wednesday 10 September 2003

Page 2: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 2Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Additional resources for you…

• Paper update– hotsos.com

• Questions– Hotsos booth in exhibit hall

• Book– Coming soon (est. July)

Page 3: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 3Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Motive

Page 4: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 4Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Three great “discoveries” are revolutionizing the effectiveness of Oracle performance analysts today.

• User action focus• Response time focus• Amdahl’s Law

• Recommendation– Eli Goldratt’s The Goal– “Goal readers are now

doing the best work of their lives.”

Page 5: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 5Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Great Discovery #1: user action focus.

You can’t extrapolate detail from an aggregate.

• Some problems can’t be detected by observing system-wide data

• Hence, system-wide focus is not reliable

• This is one reason “tuning” is so hard

Page 6: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 6Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Example: What’s the problem here? It’s obvious, right?

• CIO’s #1 problem– “We need PYUGEN to execute 2x faster”

• DBA’s viewpoint– “The system’s number one wait event is latch free”

Page 7: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 7Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Wrong. You can not trust system-wide data to guide your way.

Resource profile for PYUGEN

Response Time Component Duration # Calls Dur/Call----------------------------- --------------- ----------- ------------SQL*Net message from client 984.0s 49.6% 95,161 0.010340sSQL*Net more data from client 418.8s 21.1% 3,345 0.125208sdb file sequential read 279.3s 14.1% 45,084 0.006196sCPU service 248.7s 12.5% 222,760 0.001116sunaccounted-for 27.9s 1.4%latch free 23.7s 1.2% 34,695 0.000683slog file sync 1.1s 0.1% 506 0.002154sSQL*Net more data to client 0.8s 0.0% 15,982 0.000052slog file switch completion 0.3s 0.0% 3 0.093333senqueue 0.3s 0.0% 106 0.002358sSQL*Net message to client 0.2s 0.0% 95,161 0.000003sother 0.2s 0.0%----------------------------- --------------- ----------- ------------Total 1,985.4s 100.0%

Page 8: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 8Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Great Discovery #2: response time focus.

You can’t tell how long something tookby counting how many times it happened.

• Try it…

Page 9: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 9Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Example: What’s the problem here? It’s obvious, right?

Response Time Component # Calls------------------------------ ---------CPU service 18,750SQL*Net message to client 6,094SQL*Net message from client 6,094db file sequential read 1,740log file sync 681SQL*Net more data to client 108SQL*Net more data from client 71db file scattered read 34direct path read 5free buffer waits 4log buffer space 2direct path write 2log file switch completion 1latch free 1

Page 10: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 10Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Wrong. You can not trust call counts to guide your way.

Response Time Component Duration # Calls Dur/Call------------------------------ ------------------ --------- -----------SQL*Net message from client 166.60s 91.7% 6,094 0.027338sCPU service 9.65s 5.3% 18,750 0.000515sunaccounted-for 2.22s 1.2%db file sequential read 1.59s 0.9% 1,740 0.000914slog file sync 1.12s 0.6% 681 0.001645sSQL*Net more data from client 0.25s 0.1% 71 0.003521sSQL*Net more data to client 0.11s 0.1% 108 0.001019sfree buffer waits 0.09s 0.0% 4 0.022500sSQL*Net message to client 0.04s 0.0% 6,094 0.000007sdb file scattered read 0.04s 0.0% 34 0.001176slog file switch completion 0.03s 0.0% 1 0.030000slog buffer space 0.01s 0.0% 2 0.005000slatch free 0.01s 0.0% 1 0.010000sdirect path read 0.00s 0.0% 5 0.000000sdirect path write 0.00s 0.0% 2 0.000000s------------------------------ ------------------ --------- -----------Total 181.76s 100.0%

Page 11: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 11Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Great Discovery #3: Amdahl’s Law.

Performance improvement is proportional tohow much a program uses the thing you improved.

• Gene Amdahl’s realization, formalized in 1967– Many “tuning” efforts are misplaced– This is one reason so many “successful” performance

improvement projects create no impact

Page 12: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 12Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

These great discoveries lead us to a simple, obviously efficient Oracle optimization strategy…

Work first to reduce the biggest response time componentof a business’ most important user action.

• Sounds easy• So why don’t people do it?

• Because it’s hard to get response time data for a specific user action

Page 13: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 13Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Oracle Operational Timing Data

Page 14: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 14Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

So why do Oracle analysts use counts instead of timings?

• Because counts are easier to obtain

• Counts are easy

– Select from V$SESSTAT union V$SESSION_EVENT at t0

– Select again at t1

– Then compute results[t1] – results[t0]

– Maximum error is ±1 event (not a big deal)• But times are hard

– Same method produces potentially enormous duration errors– Example…

Page 15: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 15Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Accounting for response time with SQL queries of Oracle fixed view data is hard.

$ perl vprof.pl --username=system --password=manager 8Press <Enter> to mark time t0:Press <Enter> to mark time t1:

Resource Profile for Session 8

t0 = 15:31:50.507000 t1 = 15:32:02.696000 interval duration = 12.189000s accounted-for duration = 23.510000s

Response Time Component Duration (seconds) Calls Dur/Call---------------------------------- -------------------- ------------ ----------SQL*Net message from client 22.78 186.9% 2 11.390000db file sequential read 0.60 4.9% 199 0.003015CPU service 0.09 0.7% 2 0.045000db file scattered read 0.04 0.3% 4 0.010000SQL*Net message to client 0.00 0.0% 2 0.000000unaccounted-for -11.32 -92.9% 1 -11.321000---------------------------------- -------------------- ------------ ----------Total 12.19 100.0%

Source: Millsap, C. 2003. Optimizing Oracle Performance, Sebastopol CA: O’Reilly (July 2003)

Page 16: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 16Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

One easy way out of the problem is to use Oracle’s extended SQL trace data.

• Easy to activate• Low overhead if used sensibly• Produces a chronological record• Contains all the performance data you need in most cases• Doesn’t require any special tools to get started

Page 17: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 17Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

The easiest way to activate extended SQL trace is to add a few statements to your application’s source code.

alter session set timed_statistics=truealter session set max_dump_file_size=unlimitedalter session set tracefile_identifier='POX20031031a'alter session set events '10046 trace name context forever, level 8'

/* code to be traced goes here */

alter session set events '10046 trace name context off'

Page 18: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 18Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Raw trace data lines are not so hard to understand.

PARSE #54:c=20000,e=11526,p=0,cr=2,cu=0,mis=1,r=0,dep=1,og=0,tim=1017039304725071

EXEC #1:c=10000,e=12137,p=0,cr=22,cu=0,mis=0,r=1,dep=0,og=4,tim=1017039275981174

FETCH #3:c=10000,e=306,p=0,cr=3,cu=0,mis=0,r=1,dep=2,og=4,tim=1017039275973158

WAIT #1: nam='SQL*Net message to client' ela= 40 p1=1650815232 p2=1 p3=0

WAIT #1: nam='SQL*Net message from client' ela= 1709 p1=1650815232 p2=1 p3=0

WAIT #34: nam='db file sequential read' ela= 14118 p1=52 p2=2755 p3=1WAIT #44: nam='latch free' ela= 1327989 p1=-1721538020 p2=87 p3=13

Page 19: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 19Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

The five fields I’ll speak about today are the cursor id, c, e, nam, and ela (see MetaLink 39817.1 for more info).

Field Description

#n n is the id of the cursor upon which the action is executing

cThe approximate total CPU capacity (user-mode + kernel-mode) consumed by the database call

e The approximate wall time that elapsed during the database call

namThe name assigned by an Oracle kernel developer to a sequence of instructions (often including a system call) in the Oracle kernel

ela The approximate wall time that elapsed during the wait event

Page 20: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 20Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

There are two types of wait event: events issued within db calls, and calls issued between db calls.

• Events issued between db calls are recognizable by nam– SQL*Net message from client– SQL*Net message to client– single-task message– pipe get– rdbms ipc message– pmon timer– smon timer

• Most other events are issued within db calls

Page 21: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 21Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

A db call’s elapsed time is approximately its total CPU consumption plus the total duration of its wait events.

WAIT #4: nam='db file sequential read' ela= 13060 p1=1 p2=53903 p3=1WAIT #4: nam='db file sequential read' ela= 6978 p1=1 p2=4726 p3=1FETCH #4:c=0,e=21340,p=2,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=10330641379

53092

db call

e c ela

Page 22: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 22Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

A whole trace file’s elapsed duration is the sum of the db call durations plus the sum of the between-call stuff.

PARSE #9:c=0,e=0,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=4,tim=1716466757WAIT #9: nam='SQL*Net message to client' ela= 0 p1=1413697536 p2=1 p3=0WAIT #9: nam='SQL*Net message from client' ela= 3 p1=1413697536 p2=1 p3

=0...PARSE #9:c=0,e=0,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=1716466760

0 betweencalls

dep

R e ela

Page 23: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 23Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Extended SQL trace is reliable for diagnosing the root cause of virtually any performance problem.

• Method is reliable for even more problem types by isolating the “un-measured” components– M – measurement intrusion effect– E – quantization error– N – time spent not executing– U – un-instrumented Oracle code (systematic error + bugs)– S – CPU consumption that is counted twice

db call

e c ela M E N U S

Page 24: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 24Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

What you can do with the timing data is spectacular.

• Performance problems cannot hide from you• No more solving the wrong problem• No more stabs in the dark or trial-and-error “tuning”• No more multi-month performance diagnosis projects• No more CTD [Vaidyanatha and Deshpande (2000)]

You either solve the right problem quicklyor prove that solving it is not worth the effort.

Page 25: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 25Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Performance problems cannot hide from this method.

• Partial laundry list of problems we’ve solved– System-wide problems, and specific user-action problems– Query mistakes: bad SQL, faulty indexing, data density, …– Application mistakes: excessive parsing, no arrays, …– Serialization mistakes: locks, latches, memory buffers, …– Network mistakes: protocol selection, bad segments, …– Disk I/O mistakes: poorly sized caches, imbalances, …– Capacity shortages: swapping, paging, …

Page 26: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 26Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Why I use extended SQL trace data instead of Oracle’s V$ fixed views…

• I don’t use V$ data anymore, because it’s a mess– Can’t poll fast enough with SQL– Too complicated to attach directly to SGA– Too many data sources (V$SESSTAT, V$SESSION_EVENT, V$LATCH, V$LOCK, V$FILESTAT, V$WAITSTAT, V$SQL, …)

– No notion of e; therefore, can’t isolate M, E, N, U, S– No read consistency– Statistics are unreliable (CPU time used by this session, SECONDS_IN_WAIT, …)

– Statistics are susceptible to overflow– Hard (impossible?) to determine recursive SQL relationships

Page 27: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 27Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Where do Oracle’s timing statistics actually come from?

procedure dbcall { e0 = gettimeofday; # mark the wall time c0 = getrusage; # obtain resource usage statistics ... # execute the db call (may call wevent) c1 = getrusage; # obtain resource usage statistics e1 = gettimeofday; # mark the wall time e = e1 – e0; # elapsed duration of dbcall c = (c1.utime + c1.stime) – (c0.utime + c0.stime); # total CPU time consumed by dbcall }

procedure wevent { ela0 = gettimeofday; # mark the wall time ... # execute the wait event (syscall) ela1 = gettimeofday; # mark the wall time ela = ela1 – ela0; # elapsed duration of wevent}

Page 28: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 28Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Understanding where Oracle timings come from motivates all sorts of interesting research projects…

• How to quantify measurement intrusion effect• How to bound quantization error• How to bound un-instrumented call effects• How to estimate the amount of kernel-mode time double-

counting• …

Page 29: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 29Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Questions & Answers

Page 30: Oracle Operational Timing Data Paper 36571

www.hotsos.com Slide 30Copyright © 1999–2003 by Hotsos Enterprises, Ltd.

Oracle Operational Timing DataPaper 36571

Cary Millsap ([email protected])

Hotsos Enterprises, Ltd.OracleWorld 2003 / San Francisco, California USA

8:30a–9:30a Wednesday 10 September 2003