performance scenario: diagnosing and resolving sudden slow down on two node rac

Performance Scenario:

Diagnosing and resolving sudden slow down on two node RAC

Introduction…

•

Karl Arao, OCP‐DBA, RHCT

•

Senior Consultant at SQL*Wizard

•

RAC user for 3years

•

1st

environment on VMware

•

I “heart”

performance

•

Don’t like to guess when troubleshooting

Scenario

One Thursday…a client called…

There was a SUDDEN slow down

on ALL

of the applications

…a big impact to the Business

And it’s running on

RAC RACno changes on the

RAC nodes and on the applications

Some of 10g Performance Features

• OEM Performance Page• ADDM • SQL Tuning advisor• AWR (DBA_HIST_)• ASH• Time Model (total time for all db calls)• Wait Class (12 wait class)• Metrics (v$ performance metric deltas)• Services

Setup

• Server and Storage: SunFire

X4200 (2CPU, 12GB memory) with LUNs

on EMC CX300

• OS: RHEL 4.3 ES• Database and clusterware: Oracle 10.2.0.3• Database Files, Flash Recovery Area, OCR, and

Voting disk are located on OCFS2 filesystems

• Application: Forms and Reports (6i and also lower)

Troubleshooting Principle

Systematic/Layered approach..

Understand..

Then Fix..

Lets get it on!

1. Measured the OS stack

• Monitored the following– cpu

(vmstat, top, mpstat)

– io

(iostat)

– memory (vmstat, meminfo)

– network (netstat)– process info (top, ps)

• CPU on server1

• CPU on server2

• Datafiles

on server1

• Datafiles

on server2

• OCR & voting disk on server1

• OCR & voting disk on server2

• Archivelogs

on server1

• Archivelogs

on server2

• Flash Recovery Area on server1

• Flash Recovery Area on server2

• Memory on server1

• Memory on server2

• Compared my past & current RDA of the database

• Query on some v$views.. a query on v$session showed that server1 has more connections

(89% of the total users)

2. Checked the DB environment

This could be because of:1)

The clients having lower versions (< Sql*Plus 8.1

or OCI8, see Note 97926.1) that may not support TAF (FAILOVER_MODE) and Load Balancing

(LOAD_BALANCE)

OR

2) They are using TNS entries explicitly connecting to server1


• Users don’t have FAILOVER capabilities


• Checked the application module usage on server1


• How bout I graph it in excel? Will the data be more

meaningful?

.. YES most of the users uses the xxxlogin.fmx

module


3. Checked instance‐wide DB performance

• Graphed the ASH data..

.. suffering from “gc

cr

block lost” and “gc

cr

multi block request” from 7am to 4pm


• Researched on Metalink

for known issues.. Found Doc ID: 563566.1 gc

lost blocks

diagnostics

• Was able to pinpoint the peak period from the graph. Then, generated ADDM and AWR

report on that peak period..


• ADDM

Elapsed Time: 60min

DB Time: 61.83min

AAS: 1.03

Max CPU: 2


• Should I follow these recommendations right away?

Nope collect more facts, numbers, figures


• AWR


• Do we have a workload distribution problem? Nope even with distributed users..

We still have performance problem..

4. Checked session‐level DB performance

• The database has too many activity, where do I start? Where to drill down?

• gv$session_longops

& gv$session_wait

output too many users, and require repetitive

monitoring• In the spirit of Method‐R…

"WORK FIRST TO REDUCE THE BIGGEST RESPONSE TIME COMPONENT OF A

BUSINESS' MOST IMPORTANT USER ACTION“

• Went to the Accounting Department, checked on the desktop terminals


• Users PC1069 (with SID 601) and PC918 (with SID 483) are on total hang


• Checked on the – performance/wait counters

– the current SQLs


• v$session_wait

(SID 601)


• v$sesstat

(SID 601)


• v$sql, v$sql_plan, v$sql_plan_statistics

(SID 601)

• Running for 98 minutes

• Just 12.14 seconds on CPU


• v$sesstat

(SID 483)


• v$sql, v$sql_plan, v$sql_plan_statistics

(SID 483)

• Running for 3 hours• Just 2.68 seconds on CPU


• Another graph of ASH

5. Drilled down on the network interconnect

• Generated a “cat & egrep”

command to look for problems in the interconnect from the OS Watcher “netstat”

output

(from Metalink

Doc ID: 563566.1 gc

lost blocks diagnostics)


$ cat server1_netstat.dat | egrep

‐i "udpInOverflows|packet

receive

errors|fragments

dropped|reassembles

failed|fragments

dropped after

timeout"

34096 fragments dropped after timeout

306030 packet reassembles failed

15 packet receive errors



15 packet receive errors



…

output snipped …


• Restarted the switch

STILL

THERE IS A PERFORMANCE PROBLEM


• Replaced the switch

THEY GOT FAST


karao@karl:~/Desktop$ cat karlarao.dat

| egrep

‐i "udpInOverflows|packet

receive

errors|fragments

dropped|reassembles

failed|fragments

dropped after timeout"0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors0 packet receive errors


• Another graph of ASH (Stacked graph)


• Another graph of ASH (3d view)

Conclusion

You don’t have to guess..

Even if it’s a RAC environment..

It just takes facts, numbers, figuresto solve a performance problem

References and Tools

• http://karlarao.wordpress.com• http://blog.tanelpoder.com

– http://www.tanelpoder.com/files/TPT_public.zip– http://www.tanelpoder.com/files/PerfSheet.zip– Neil Gunther

& Tanel

Poder

‐

Multidimensional Visualization of Oracle

Performance using Barry007 http://arxiv.org/pdf/0809.2532

• http://ashmasters.com• http://www.perfvision.com• http://www.method‐r.com

• Metalink

Doc ID 97926.1 Failover Issues and Limitations [Connect‐time

failover and TAF]

• Metalink

Doc ID 563566.1 gc

lost blocks diagnostics• Metalink

Doc ID 301137.1 OS Watcher User Guide

Join Oracle Users –

Philippines

• Facebookhttp://www.facebook.com/home.php#/pages/Oracle‐Users‐Philippines/86773013086?ref=ts

• Linkedinhttp://www.linkedin.com/groups?home=&gid=2028295&trk=anet_ug_hm

Contact me through:

[email protected]

0919‐267‐3389

889‐6999

performance scenario: diagnosing and resolving sudden slow down on two node rac

Technology