rac performance tuning iloug

32
Oracle RAC David Yahalom CTO Naya Technologies www.naya-tech.co.il [email protected]

Upload: oracle-israel

Post on 25-Dec-2014

2.326 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Rac performance tuning iloug

Oracle RAC

David Yahalom CTO Naya Technologies www.naya-tech.co.il [email protected]

Page 2: Rac performance tuning iloug

Oracle RAC Architecture

Shared Storage

Node1 Node2Interconnect

Firewall

Node3 Node4Node3

Page 3: Rac performance tuning iloug

Service

Oracle RAC Architecture public network

Node1

Operating System

Oracle Clusterware

instance 1

ASM

VIP1

Listener Node 2

Operating System

Oracle Clusterware

instance 2

ASM

VIP2

Listener Service Node n

Operating System

Oracle Clusterware

instance n

ASM

VIPn

Listener Service

/…/

Redo / Archive logs all instances shared storage

Database / Control files

OCR and Voting Disks Managed by ASM

Page 4: Rac performance tuning iloug

Let’s get some terminology out of the way…

Page 5: Rac performance tuning iloug

Let’s get some terminology out of the way…

Global Cache Service (GCS)

Page 6: Rac performance tuning iloug

Global Cache Service (GCS)

•  Manages coherent access to data in buffer caches of all instances in the cluster.

•  Minimizes access time to data which is not in local cache •  access to data in global cache faster than disk access

•  Implements fast direct memory access over high-speed interconnects •  for all data blocks and types (current – wirte, CR - read).

•  Uses an efficient and scalable messaging protocol •  Never more than 3 hops

•  Optimizations for read-mostly applications

Page 7: Rac performance tuning iloug

Cache Hierarchy: Data in Remote Cache Local Cache Miss

Datablock Requested

Datablock Returned

Remote Cache Hit

LMS

LMS

Page 8: Rac performance tuning iloug

Let’s get some terminology out of the way…

Oracle block “master”.

Page 9: Rac performance tuning iloug

Oracle RAC block master:

•  The master can be thought of as the directory node for a block or an object. •  The global state of the data block – whether it is cached or on disk, which instances have the blocks cached and whether the blocks can be shared immediately or has modification pending - is completely known at the master.

Page 10: Rac performance tuning iloug

Cache Hierarchy: Data On Disk

Local Cache Miss

Datablock Requested

Grant Returned

Remote Cache Miss

Disk Read

LMS

LMS

Page 11: Rac performance tuning iloug

GC Current block 2-way

Page 12: Rac performance tuning iloug

GC Current block 3-way

Page 13: Rac performance tuning iloug

<Insert Picture Here>

What can go wrong?

Page 14: Rac performance tuning iloug

<Insert Picture Here>

Common Problems and Symptoms

•  “Lost Blocks”: Interconnect or Switch Problems.

•  Slow or bottlenecked disks: one node becomes a bottleneck, entire cluster waits.

•  System load and scheduling: high CPU – “frozen” LMS processes.

•  Contention: frequent access to same resources.

•  Unexpectedly high latencies: network issues.

Page 15: Rac performance tuning iloug

Best practice #1: Tune the interconnect

• Dropped packets/fragments • Buffer overflows / high load on NICs. • Packet reassembly failures or timeouts • TX/RX errors • Verify low utilization

Often overlooked, but always important.

Page 16: Rac performance tuning iloug

“Lost Blocks”: NIC Receive Errors

ifconfig –a:

eth0 Link encap:Ethernet HWaddr 00:0B:DB:4B:A2:04

inet addr:130.35.25.110 Bcast:130.35.27.255 Mask:255.255.252.0

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:21721236 errors:135 dropped:0 overruns:0 frame:95

TX packets:273120 errors:0 dropped:0 overruns:0 carrier:0

Overruns indicates that NIC internal buffers should be increased while dropped may indicate that the driver and OS layers cannot drain the queued messages fast enough.

Page 17: Rac performance tuning iloug

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time Wait Class ----------------------------------------------------------------------------------------------------

log file sync 286,038 49,872 174 41.7 Commit

gc buffer busy 177,315 29,021 164 24.3 Cluster

gc cr block busy 110,348 5,703 52 4.8 Cluster

gc cr block lost 4,272 4,953 1159 4.1 Cluster

cr request retry 6,316 4,668 739 3.9 Other

Finding a Problem with the Interconnect

Should never be here

Page 18: Rac performance tuning iloug

Interconnect Statistics Automatic Workload Repository (AWR )

Target Avg Latency Stddev Avg Latency Stddev Instance 500B msg 500B msg 8K msg 8K msg

---------------------------------------------------------------------

1 .79 .65 1.04 1.06

2 .75 .57 . 95 .78

3 .55 .59 .53 .59

4 1.59 3.16 1.46 1.82

---------------------------------------------------------------------

Latency probes for different message sizes Exact throughput measurements (not shown) Send and receive errors, dropped packets (not shown )

Page 19: Rac performance tuning iloug

Event Waits Time (s) AVG (ms) % Call

Time

---------------------- ---------- ---------- --------- --------

gc cr block 2-way 317,062 5,767 18 19.0

gc current block 2-way 201,663 4,063 20 13.4

gc buffer busy 111,372 3,970 36 13.1

CPU time 2,938 9.7

gc cr block busy 40,688 1,670 41 5.5 -------------------------------------------------------

Tackle latency first, then tackle busy events

Expected: To see 2-way, 3-way Unexpected: To see > 1 ms (AVG ms should be around 1 ms) Cause: high load, slow interconnect, contention…

Interconnect latency

Page 20: Rac performance tuning iloug

Cache Fusion messaging traffic

Global Cache Load Profile ~~~~~~~~~~~~~~~~~~~~

Per Second Per Transaction ---------------- --------------------- Global Cache blocks received: 4.30 3.65 Global Cache blocks served: 23.44 19.90 GCS/GES messages received: 133.03 112.96 GCS/GES messages sent: 78.61 66.75 DBWR Fusion writes: 0.11 0.10 Est Interconnect traffic (KB) 263.20

Network traffic received = Global Cache blocks received * DB block size = 4.3 * 8192 = .01 Mb/sec

Network traffic generated = Global Cache blocks served * DB block size = 23.44 * 8192 = .20 Mb/sec

Page 21: Rac performance tuning iloug

• Dedicated interconnect NICs and switches. • Tune IPC buffer sizes • Ensure enough OS resources available

•  Spinning process can consume all network ports • Disable any firewall on interconnect • Use “Jumbo Frames” where supported. • Make sure network utilization is low (20%).

What to do?

Page 22: Rac performance tuning iloug

Storage is global to the cluster and a single badly behaving node or badly balanced disk configuration can affect the entire disk read and write performance of all nodes.

Best practice #2: I/O is critical to RAC

Page 23: Rac performance tuning iloug

• Log flush IO delays can cause “busy” buffers: LGWR always writes before block changes ownership.

LGWR bad latency – bad overall RAC performance.

• “Bad” queries on one node can saturate a disk where the redo logs are located.

•  IO is issued from ALL nodes to shared storage.

Page 24: Rac performance tuning iloug

Cluster-Wide I/O Impact

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s)(ms) Time ------------------------------ ------------ ----------- ------ ------ log file sync 286,038 49,872 174 41.7

gc buffer busy 177,315 29,021 164 24.3

gc cr block busy 110,348 5,703 52 4.8

Load Profile ~~~~~~~~~~~~ Per Second

---------------

Redo size: 40,982.21

Logical reads: 81,652.41

Physical reads: 51,193.37

Node 2

Node 1

Expensive Query in Node 2 Impacts Node1

1. IO on disk group containing redo logs is bottlenecked. 2. Block shipping for hot blocks is delayed by log flush IO. 3. Serialization/Queues build up.

Page 25: Rac performance tuning iloug

Drill-down on node 2: An IO capacity problem

I/O contention

Top 5 Timed Events Avg %Total wait Call Event Waits Time(s) (ms) Time Wait Class ---------------- -------- ------- ---- ---- ----------

db file scattered read 3,747,683 368,301 98 33.3 User I/O

gc buffer busy 3,376,228 233,632 69 21.1 Cluster

db file parallel read 1,552,284 225,218 145 20.4 User I/O

gc cr multi block 35,588,800 101,888 3 9.2 Cluster request

read by other session 1,263,599 82,915 66 7.5 User I/O

Page 26: Rac performance tuning iloug

After “killing” the session…

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class --------------------------- --------- ----------- ---- ------ ----------

CPU time 4,580 65.4

log file sync 276,281 1,501 5 21.4 Commit

log file parallel write 298,045 923 3 13.2 System I/O

gc current block 3-way 605,628 631 1 9.0 Cluster

gc cr block 3-way 514,218 533 1 7.6 Cluster

1. Log file writes are normal

2. Global serialization has disappeared

Page 27: Rac performance tuning iloug

•  Tune IO layout – RAC much more sensitive to full table scans / full index scans / etc…

•  Tune queries consuming a lot of IO.

•  One busy node can affect the entire cluster.

•  Separate storage of redo log files and data files.

•  Make sure Async I/O is enabled!

What to do?

Page 28: Rac performance tuning iloug

Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time(s) (ms) Time Wait Class ----------------- --------- ------ ---- ----- ---------- gc current block 275,004 21,054 77 21.3 Cluster congested

gc cr grant congested 177,044 13,495 76 13.6 Cluster

gc cr block congested 85,975 8,917 104 9.0 Cluster

Congested : LMS could not dequeue messages fast enough. Cause : Long run queue, CPU starvation. Solution : High process priority for LMS, start more LMS processes.

If an LMS is not able to be scheduled in order to process messages which have arrived in its request queue, the time in the run queue adds to the data access time for users on other nodes.

Best practice #3: single node CPU load matters

Never use more LMS processes than CPUs

Page 29: Rac performance tuning iloug

Event Waits Time (s) AVG (ms) % Call

Time

---------------------- --------- -------- -------- --------

gc cr block 2-way 317,062 5,767 18 19.0

gc current block 2-way 201,663 4,063 20 13.4

gc buffer busy 111,372 3,970 36 13.1

CPU time 2,938 9.7

gc cr block busy 40,688 1,670 41 5.5 -------------------------------------------------------

•  Any frequently accessed data may have hotspots which are sensitive to how may users are accessing the same data concurrently. •  Its is very likely that CR BLOCK BUSY and GC BUFFER BUSY are related.

•  RAC can magnify a resource bottleneck. •  Identify “hot” blocks and reduce concurrency. •  If possible – “partition” application workload.

Best practice #4: avoid block contention

Page 30: Rac performance tuning iloug

Best practice #5: Smart application design

•  No fundamentally different design and coding practices for RAC.

BUT: •  Flaws in execution or design have higher impact in

RAC •  Performance and scalability in RAC will be more sensitive to

bad plans or bad schema design •  Serializing contention makes applications less scalable.

•  Standard SQL and schema tuning solves > 80% of performance problems

Page 31: Rac performance tuning iloug

Major scalability pitfalls •  Serializing contention on a small set of data/index blocks

•  monotonically increasing index (sequence numbers) – not scalable when index modified from all nodes.

•  frequent updates of small cached tables (“hot blocks”). •  Sparse blocks ( PCTFREE 99 ) will reduce serialization

•  Concurrent DDL and DML (frequent invalidation of cursors = many data dictionary reads and syncs).

•  Segment without automatic segment space management (ASSM) or Free List Group (FLG).

•  Sequence caching

Page 32: Rac performance tuning iloug

Major scalability pitfalls •  Serializing contention on a small set of data/index blocks

•  monotonically increasing index (sequence numbers) – not scalable when index modified from all nodes.

•  frequent updates of small cached tables (“hot blocks”). •  Sparse blocks ( PCTFREE 99 ) will reduce serialization

•  Concurrent DDL and DML (frequent invalidation of cursors = many data dictionary reads and syncs).

•  Segment without automatic segment space management (ASSM) or Free List Group (FLG).

•  Full table scans - direct reads do not need to be globally synchronized ( hence less CPU for global cache )