rac performance tuning iloug

Oracle RAC

David Yahalom CTO Naya Technologies www.naya-tech.co.il [email protected]

Oracle RAC Architecture

Shared Storage

Node1 Node2Interconnect

Firewall

Node3 Node4Node3

Service

Oracle RAC Architecture public network

Node1

Operating System

Oracle Clusterware

instance 1

ASM

VIP1

Listener Node 2

Operating System

Oracle Clusterware

instance 2

ASM

VIP2

Listener Service Node n

Operating System

Oracle Clusterware

instance n

ASM

VIPn

Listener Service

/…/

Redo / Archive logs all instances shared storage

Database / Control files

OCR and Voting Disks Managed by ASM

Let’s get some terminology out of the way…


Global Cache Service (GCS)

Global Cache Service (GCS)

•  Manages coherent access to data in buffer caches of all instances in the cluster.

•  Minimizes access time to data which is not in local cache •  access to data in global cache faster than disk access

•  Implements fast direct memory access over high-speed interconnects •  for all data blocks and types (current – wirte, CR - read).

•  Uses an efficient and scalable messaging protocol •  Never more than 3 hops

•  Optimizations for read-mostly applications

Cache Hierarchy: Data in Remote Cache Local Cache Miss

Datablock Requested

Datablock Returned

Remote Cache Hit

LMS

LMS


Oracle block “master”.

Oracle RAC block master:

•  The master can be thought of as the directory node for a block or an object. •  The global state of the data block – whether it is cached or on disk, which instances have the blocks cached and whether the blocks can be shared immediately or has modification pending - is completely known at the master.

Cache Hierarchy: Data On Disk

Local Cache Miss

Datablock Requested

Grant Returned

Remote Cache Miss

Disk Read

LMS

LMS

GC Current block 2-way

GC Current block 3-way

<Insert Picture Here>

What can go wrong?

<Insert Picture Here>

Common Problems and Symptoms

•  “Lost Blocks”: Interconnect or Switch Problems.

•  Slow or bottlenecked disks: one node becomes a bottleneck, entire cluster waits.

•  System load and scheduling: high CPU – “frozen” LMS processes.

•  Contention: frequent access to same resources.

•  Unexpectedly high latencies: network issues.

Best practice #1: Tune the interconnect

• Dropped packets/fragments • Buffer overflows / high load on NICs. • Packet reassembly failures or timeouts • TX/RX errors • Verify low utilization

Often overlooked, but always important.

“Lost Blocks”: NIC Receive Errors

ifconfig –a:

eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04

inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95

TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0

…

Overruns indicates that NIC internal buffers should be increased while dropped may indicate that the driver and OS layers cannot drain the queued messages fast enough.

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class ----------------------------------------------------------------------------------------------------

log file sync 286,038 49,872 174 41.7 Commit

gc buffer busy 177,315 29,021 164 24.3 Cluster

gc cr block busy 110,348 5,703 52 4.8 Cluster

gc cr block lost 4,272 4,953 1159 4.1 Cluster

cr request retry 6,316 4,668 739 3.9 Other

Finding a Problem with the Interconnect

Should never be here

Interconnect Statistics Automatic Workload Repository (AWR )

Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg

---------------------------------------------------------------------

1 .79 .65 1.04 1.06

2 .75 .57 . 95 .78

3 .55 .59 .53 .59

4 1.59 3.16 1.46 1.82

---------------------------------------------------------------------

Latency probes for different message sizes Exact throughput measurements (not shown) Send and receive errors, dropped packets (not shown )

Event Waits Time (s) AVG (ms) % Call

Time

---------------------- ---------- ---------- --------- --------

gc cr block 2-way 317,062 5,767 18 19.0

gc current block 2-way 201,663 4,063 20 13.4

gc buffer busy 111,372 3,970 36 13.1

CPU time 2,938 9.7

gc cr block busy 40,688 1,670 41 5.5 -------------------------------------------------------

Tackle latency first, then tackle busy events

Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms) Cause: high load, slow interconnect, contention…

Interconnect latency

Cache Fusion messaging traffic

Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~

Per Second Per Transaction ---------------- --------------------- Global Cache blocks received: 4.30 3.65 Global Cache blocks served: 23.44 19.90 GCS/GES messages received: 133.03 112.96 GCS/GES messages sent: 78.61 66.75 DBWR Fusion writes: 0.11 0.10 Est Interconnect traffic (KB) 263.20

Network traffic received = Global Cache blocks received * DB block size = 4.3 * 8192 = .01 Mb/sec

Network traffic generated = Global Cache blocks served * DB block size = 23.44 * 8192 = .20 Mb/sec

• Dedicated interconnect NICs and switches. • Tune IPC buffer sizes • Ensure enough OS resources available

•  Spinning process can consume all network ports • Disable any firewall on interconnect • Use “Jumbo Frames” where supported. • Make sure network utilization is low (20%).

What to do?

Storage is global to the cluster and a single badly behaving node or badly balanced disk configuration can affect the entire disk read and write performance of all nodes.

Best practice #2: I/O is critical to RAC

• Log flush IO delays can cause “busy” buffers: LGWR always writes before block changes ownership.

LGWR bad latency – bad overall RAC performance.

• “Bad” queries on one node can saturate a disk where the redo logs are located.

•  IO is issued from ALL nodes to shared storage.

Cluster-Wide I/O Impact

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time ------------------------------ ------------ ----------- ------ ------ log file sync 286,038 49,872 174 41.7

gc buffer busy 177,315 29,021 164 24.3

gc cr block busy 110,348 5,703 52 4.8

Load Profile ~~~~~~~~~~~~ Per Second

---------------

Redo size: 40,982.21

Logical reads: 81,652.41

Physical reads: 51,193.37

Node 2

Node 1

Expensive Query in Node 2 Impacts Node1

1. IO on disk group containing redo logs is bottlenecked. 2. Block shipping for hot blocks is delayed by log flush IO. 3. Serialization/Queues build up.

Drill-down on node 2: An IO capacity problem

I/O contention

Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class ---------------- -------- ------- ---- ---- ----------

db file scattered read 3,747,683 368,301 98 33.3 User I/O

gc buffer busy 3,376,228 233,632 69 21.1 Cluster

db file parallel read 1,552,284 225,218 145 20.4 User I/O

gc cr multi block 35,588,800 101,888 3 9.2 Cluster request

read by other session 1,263,599 82,915 66 7.5 User I/O

After “killing” the session…

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class --------------------------- --------- ----------- ---- ------ ----------

CPU time 4,580 65.4

log file sync 276,281 1,501 5 21.4 Commit

log file parallel write 298,045 923 3 13.2 System I/O

gc current block 3-way 605,628 631 1 9.0 Cluster

gc cr block 3-way 514,218 533 1 7.6 Cluster

1. Log file writes are normal

2. Global serialization has disappeared

•  Tune IO layout – RAC much more sensitive to full table scans / full index scans / etc…

•  Tune queries consuming a lot of IO.

•  One busy node can affect the entire cluster.

•  Separate storage of redo log files and data files.

•  Make sure Async I/O is enabled!

What to do?

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class ----------------- --------- ------ ---- ----- ---------- gc current block 275,004 21,054 77 21.3 Cluster congested

gc cr grant congested 177,044 13,495 76 13.6 Cluster

gc cr block congested 85,975 8,917 104 9.0 Cluster

Congested : LMS could not dequeue messages fast enough. Cause : Long run queue, CPU starvation. Solution : High process priority for LMS, start more LMS processes.

If an LMS is not able to be scheduled in order to process messages which have arrived in its request queue, the time in the run queue adds to the data access time for users on other nodes.

Best practice #3: single node CPU load matters

Never use more LMS processes than CPUs

Event Waits Time (s) AVG (ms) % Call

Time

---------------------- --------- -------- -------- --------

gc cr block 2-way 317,062 5,767 18 19.0

gc current block 2-way 201,663 4,063 20 13.4

gc buffer busy 111,372 3,970 36 13.1

CPU time 2,938 9.7

gc cr block busy 40,688 1,670 41 5.5 -------------------------------------------------------

•  Any frequently accessed data may have hotspots which are sensitive to how may users are accessing the same data concurrently. •  Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related.

•  RAC can magnify a resource bottleneck. •  Identify “hot” blocks and reduce concurrency. •  If possible – “partition” application workload.

Best practice #4: avoid block contention

Best practice #5: Smart application design

•  No fundamentally different design and coding practices for RAC.

BUT: •  Flaws in execution or design have higher impact in

RAC •  Performance and scalability in RAC will be more sensitive to

bad plans or bad schema design •  Serializing contention makes applications less scalable.

•  Standard SQL and schema tuning solves > 80% of performance problems

Major scalability pitfalls •  Serializing contention on a small set of data/index blocks

•  monotonically increasing index (sequence numbers) – not scalable when index modified from all nodes.

•  frequent updates of small cached tables (“hot blocks”). •  Sparse blocks ( PCTFREE 99 ) will reduce serialization

•  Concurrent DDL and DML (frequent invalidation of cursors = many data dictionary reads and syncs).

•  Segment without automatic segment space management (ASSM) or Free List Group (FLG).

•  Sequence caching

Major scalability pitfalls •  Serializing contention on a small set of data/index blocks

•  monotonically increasing index (sequence numbers) – not scalable when index modified from all nodes.

•  frequent updates of small cached tables (“hot blocks”). •  Sparse blocks ( PCTFREE 99 ) will reduce serialization

•  Concurrent DDL and DML (frequent invalidation of cursors = many data dictionary reads and syncs).

•  Segment without automatic segment space management (ASSM) or Free List Group (FLG).

•  Full table scans - direct reads do not need to be globally synchronized ( hence less CPU for global cache )

rac performance tuning iloug

Technology