database health check
TRANSCRIPT
DatabaseServerHealthCheck
Josh BerkusPostgreSQL Experts Inc.pgCon 2010
DATABASE SERVERHELP 5¢
Program of Treatment
● What is a Healthy Database?● Know Your Application● Load Testing● Doing a database server checkup
● hardware● OS & FS● PostgreSQL● application
● Common Ailments of the Database Server
What is a Healthy Database Server?
What is a Healthy Database Server?
● Response Times
What is a Healthy Database Server?
● Response Times● lower than required● consistent & predicable
● Capacity for more● CPU and I/O headroom● low server load
25 50 75 100 125 150 175 200 225 2500
5
10
15
20
25
30
Number of Clients
Med
ian
Re s
pon s
e T
ime
Max Response Time
Exp
ecte
d Lo
ad
What is an Unhealthy Database Server?
● Slow response times● Inconsistent response times● High server load● No capacity for growth
25 50 75 100 125 150 175 200 225 2500
5
10
15
20
25
30
Number of Clients
Med
ian
Re s
pon s
e T
ime
Max Response Time
Exp
ecte
d Lo
ad
A healthy database server is able to maintain consistent
and acceptable response times under expected loads with
margin for error.
25 50 75 100 125 150 175 200 225 2500
5
10
15
20
25
30
Number of Clients
Med
ian
Re s
pon s
e T
ime
Hitting The Wall
CPUs Floored
Average: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77
0 88.96 0.09 10.03 1.111 12.09 0.02 86.98 0.002 98.90 0.00 0.00 10.103 77.52 0.44 1.70 20.34
16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
CPUs Floored
Average: CPU %user %system %iowait %idleAverage:all 69.36 0.13 24.87 5.77
0 88.96 0.09 10.03 1.111 12.09 0.02 86.98 0.002 98.90 0.00 0.00 10.103 77.52 0.44 1.70 20.34
16:38:29 up 13 days, 22:10, 3 users, load average: 11.05, 9.08, 8.13
IO Saturated
Device: tps MB_read/s MB_wrtn/ssde 414.33 0.40 38.15sdf 1452.00 99.14 29.00
Average: CPU %user %system %iowait %idleAverage:all 34.75 0.13 58.75 6.37
0 8.96 0.09 90.03 1.111 12.09 0.02 86.98 0.002 91.90 0.00 7.00 10.103 27.52 0.44 51.70 20.34
Out of Connections
FATAL: connection limit exceeded for non-superusers
How close are youHow close are youto the wall?to the wall?
The Checkup(full physical)
1. Analyze application
2. Analyze platform
3. Correct anything obviously wrong
4. Set up load test
5. Monitor load test
6. Analyze Results
7. Correct issues
The Checkup(semi-annual)
1. Check response times
2. Check system load
3. Check previous issues
4. Check for Signs of Illness
5. Fix new issues
Knowyour
application!
Application database usage
Which does your application do?
✔ small reads
✔ large sequential reads
✔ small writes
✔ large writes
✔ long-running procedures/transactions
✔ bulk loads and/or ETL
What Color Is My Application?● Web Application (Web)
● Online Transaction Processing (OLTP)
● Data Warehousing (DW)
W
O
D
What Color Is My Application?● Web Application (Web)
● DB much smaller than RAM● 90% or more simple queries
● Online Transaction Processing (OLTP)
● Data Warehousing (DW)
W
O
D
What Color Is My Application?● Web Application (Web)
● DB smaller than RAM● 90% or more simple queries
● Online Transaction Processing (OLTP)● DB slightly larger than RAM to 1TB● 20-40% small data write queries● Some long transactions and complex read queries
● Data Warehousing (DW)
W
O
D
What Color Is My Application?● Web Application (Web)
● DB smaller than RAM● 90% or more simple queries
● Online Transaction Processing (OLTP)● DB slightly larger than RAM to 1TB● 20-40% small data write queries● Some long transactions and complex read queries
● Data Warehousing (DW)● Large to huge databases (100GB to 100TB)● Large complex reporting queries● Large bulk loads of data● Also called "Decision Support" or "Business Intelligence"
W
O
D
What Color Is My Application?● Web Application (Web)
● CPU-bound● Ailments: idle connections/transactions, too many queries
● Online Transaction Processing (OLTP)● CPU or I/O bound● Ailments: locks, database growth, idle transactions,
database bloat● Data Warehousing (DW)
● I/O or RAM bound
● Resources: database growth, longer running queries, memory usage growth
W
O
D
Special features required?
● GIS● heavy cpu for GIS functions● lots of RAM for GIS indexes
● TSearch● lots of RAM for indexes● slow response time on writes
● SSL● response time lag on connections
LoadTesting
12:00:00 AM02:00:00 AM
04:00:00 AM06:00:00 AM
08:00:00 AM10:00:00 AM
12:00:00 PM02:00:00 PM
04:00:00 PM06:00:00 PM
08:00:00 PM10:00:00 PM
0
10
20
30
40
50
60
70
80
Time
Re
qu
est
s P
er
Se
c on
d
12:00:00 AM02:00:00 AM
04:00:00 AM06:00:00 AM
08:00:00 AM10:00:00 AM
12:00:00 PM02:00:00 PM
04:00:00 PM06:00:00 PM
08:00:00 PM10:00:00 PM
0
10
20
30
40
50
60
70
80
Time
Re
qu
est
s P
er
Se
c on
d
DO
WN
TIM
E
When preventing downtime,it is not average load which
matters, it is peak load.
What to load test
● Load should be as similar as possible to your production traffic
● You should be able to create your target level of traffic● better: incremental increases
● Test the whole application as well ● the database server may not be your weak point
How to Load Test
1. Set up a load testing tool
you'll need test servers for this*
2. Turn on PostgreSQL, HW, application monitoring
all monitoring should start at the same time
3. Run the test for a defined time
1 hour is usually good
4. Collect and analyze data
5. Re-run at higher level of traffic
Test Servers
● Must be as close as reasonable to production servers● otherwise you don't know how production will be
different● there is no predictable multiplier
● Double them up as your development/staging or failover servers
● If your test server is much smaller, then you need to do a same-load comparison
Tools for Load Testing
Production Test
1. Determine the peak load hour on the production servers
2. Turn on lots of monitoring duringthat peak load hour
3. Analyze results
Pretty much your only choice without a test server.
Issues with Production Test
● Not repeatable
− load won't be exactly the same ever again
● Cannot test target load
− just whatever happens to occur during that hour
−can't test incremental increases either
● Monitoring may hurt production performance
● Cannot test experimental changes
The Ad-Hoc Test
● Get 10 to 50 coworkers to open several sessions each
● Have them go crazy on using the application
Problems with Ad-Hoc Testing
● Not repeatable● minor changes in response times may be due to
changes in worker activity
● Labor intensive● each test run shuts down the office
● Can't reach target levels of load● unless you have a lot of coworkers
Seige
● HTTP traffic generator● all test interfaces must be addressable as URLs● useless for non-web applications
● Simple to use● create a simple load test in a few hours
● Tests the whole web application● cannot test database separately
● http://www.joedog.org/index/siege-home
pgReplay
● Replays your activity logs at variable speed● get exactly the traffic you get in production
● Good for testing just the database server● Can take time to set up
● need database snapshot, collect activity logs● must already have production traffic
● http://pgreplay.projects.postgresql.org/
tsung● Generic load generator in erlang
● a load testing kit rather than a tool● Generate a tsung file from your actvity logs using
pgFouine and test the database● Generate load for a web application using custom
scripts
● Can be time consuming to set up● but highly configurable and advanced● very scalable - cluster of load testing clients
● http://tsung.erlang-projects.org/
pgBench
● Simple micro-benchmark● not like any real application
● Version 9.0 adds multi-threading, customization● write custom pgBench scripts● run against real database
● Fairly ad-hoc compared to other tools● but easy to set up
● ships with PostgreSQL
Benchmarks
● Many “real” benchmarks available● DBT2, EAstress, CrashMe, DBT5, DBMonster, etc.
● Useful for testing your hardware● not useful for testing your application
● Often time-consuming and complex
Platform-specific
● Web framework or platform tests● Rails: ActionController::PerformanceTest● J2EE: OpenDemand, Grinder, many more
– JBoss, BEA have their own tools● Zend Framework Performance Test
● Useful for testing specific application performance● such as performance of specific features, modules
● Not all platforms have them
Flight-Check
● Attend the tutorial tomorrow!
monitoring PostgreSQL during load test
log_collector = onlog_destination = 'csvlog'log_filename = 'load_test_1_%h'log_rotation_age = 60minlog_rotation_size = 1GB
log_min_duration_statement = 0log_connections = onlog_disconnections = onlog_temp_files = 100kBlog_lock_waits = on
monitoring hardware during load test
sar -A -o load_test_1.sar 30 240
iostat or fsstat or zfs iostat
monitoring application during load test
● Collect response times● with timestamp● with activity
● Monitor hardware and utilization● activity● memory & CPU usage
● Record errors & timeouts
Checking Hardware
Checking Hardware
● CPUs and Cores● RAM● I/O & disk support● Network
CPUs and Cores
● Pretty simple: ● number● type● speed● L1/L2 cache
● Rules of thumb● fewer faster CPUs is
usually better than more slower ones
● core != cpu● thread != core● virtual core != core
CPU calculations
● ½ to 1 core for OS● ½ to 1 core for software raid or ZFS● 1 core for postmaster and bgwriter● 1 core per:
● DW: 1 to 3 concurrent users● OLTP: 10 to 50 concurrent users● Web: 100 to 1000 concurrent users
CPU tools
● sar● mpstat● pgTop
in praise of sar
● collects data about all aspects of HW usage● available on most OSes
● but output is slightly different
● easiest tool for collecting basic information● often enough for server-checking purposes
● BUT: does not report all data on all platforms
sar
CPUs: sar -P ALL and sar -uMemory: sar -r and sar -RI/O: sar -b and sar -dnetwork: sar -n
sar CPU output
06:05:01 AM CPU %user %nice %system %iowait %steal %idle06:15:01 AM all 14.26 0.09 6.01 1.32 0.00 78.3206:15:01 AM 0 14.26 0.09 6.01 1.32 0.00 78.32
15:08:56 %usr %sys %wio %idle15:09:26 10 5 0 8515:09:56 9 7 0 8415:10:26 15 6 0 8015:10:56 14 7 0 7915:11:26 15 5 0 8015:11:56 14 5 0 81
Linux
Solaris
Memory
● Only one statistic: how much?● Not generally an issue on its own
● low memory can cause more I/O● low memory can cause more CPU time
memory sizing
SharedBuffers
work_memmaint_mem
FilesystemCache
In Buffer
In Cache
On Disk
Figure out Memory Sizing
● What is the active portion of your database?● i.e. gets queried frequently
● How large is it?● Where does it fit into the size categories?● How large is the inactive portion of your
database?● how frequently does it get hit? (remember backups)
Memory Sizing
● Other needs for RAM – work_mem:● sorts and aggregates: do you do a lot of big ones?● GIN/GiST indexes: these can be huge● hashes: for joins and aggregates● VACUUM
I/O Considerations
● Throughput● how fast can you get data off disk?
● Latency● how long does it take to respond to requests?
● Seek Time● how long does it take to find random disk pages?
I/O Considerations
● Throughput● important for large databases● important for bulk loads
● Latency● huge effect on small writes & reads● not so much on large scans
● Seek Time● important for small writes & reads● very important for index lookups
I/O Considerations
● Web● concerned about read latency & seek time
● OLTP● concerned about write latency & seek time
● DW/BI● concerned about throughput & seek time
------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
32096M 79553 99 240548 45 50646 5 72471 94 185634 10 1140 1
------Sequential Output------ --Sequential Input-- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP/sec %CP
24G 260044 33 62110 17 89914 15 1167 25
6549ms 4882ms 3395ms 107ms
Common I/O Types
● Software RAID & ZFS● Hardware RAID Array● NAS/SAN● SSD
Hardware RAID Sanity Check
● RAID 1 / 10, not 5● Battery-backed write cache?
● otherwise, turn write cache off
● SATA < SCSI/SAS● about ½ real throughput
● Enough drives?● 4-14 for OLTP application● 8-48 for DW/BI
Sw RAID / ZFS Sanity Check
● Enough CPUs?● will need one for the RAID
● Enough disks?● same as hardware raid
● Extra configuration?● caching● block size
NAS/SAN Sanity Check
● Check latency!● Check real throughput
● drivers often a problem
● Enough network bandwidth?● multipath or fiber required to get HW RAID
performance
SSD Sanity Check
● 1 SSD = 4 Drives● relative performance
● Check write cache configuration● make sure data is safe
● Test real throughput, seek times● drivers often a problem
● Research durability stats
IO Tools
● I/O Tests● dd test● Bonnie++● IOZone● filebench
● Monitoring Tools● sar● mpstat iowait● iostat● on zfs: fsstat, zfs
-iostat● EXPLAIN ANALYZE
Network
● Throughput● not usually an issue, except:
– iSCSI / NAS / SAN– ELT & Bulk Load Processes
● remember that gigabit is only 100MB/s!
● Latency● real issue for Web / OLTP● consider putting app ↔ database on private
network
Checkups for the Cloud
Just like real HW, except ...
● Low ceiling on #cpus, RAM● Virtual Core < Real Core
● “CPU Stealing”● last-generation hardware● calculate 50% more cores
Cloud I/O Hell
● I/O tends to be very slow, erratic● comparable to a USB thumb drive● horrible latency, up to ½ second● erratic, speeds go up and down● RAID together several volumes on EBS● use asynchronous commit
– or at least commit_siblings
#1 Cloud Rule
If your databasedoesn't fit in RAM,
don't host iton a public cloud
Checking Operating Systemand Filesystem
OS Basics
● Use recent versions● large performance, scaling improvements in Linux &
Solaris in last 2 years
● Check OS tuning advice for databases● advice for Oracle is usually good for PostgreSQL
● Keep up with information about issues & patches● frequently specific releases have major issues● especially check HW drivers
OS Basics
● Use Linux, BSD or Solaris!● Windows has poor performance and weak
diagnostic tools● OSX is optimized for desktop and has poor
hardware support● AIX and HPUX require expertise just to install, and
lack tools
Filesystem Layout
● One array / one big pool● Two arrays / partitions
● OS and transaction log● Database
● Three arrays● OS & stats file● Transaction log● Database
Linux Tuning
● XFS > Ext3 (but not that much)
● Ext3 Tuning: data=writeback,noatime,nodiratime● XFS Tuning: noatime,nodiratime
– for transaction log: nobarrier
● “deadline” I/O scheduler● Increase SHMMAX and SHMALL
● to ½ of RAM
● Cluster filesystems also a possibility● OCFS, RHCFS
Solaris Tuning
● Use ZFS● no advantage to UFS anymore● mixed filesystems causes caching issues● set recordsize
– 8K small databases– 128K large databases– check for throughput/latency issues
Solaris Tuning
● Set OS parameters via “projects”● For all databases:
● project.max-shm-memory=(priv,12GB,deny)
● For high-connection databases:● use libumem● project.max-shm-ids=(priv,32768,deny)● project.max-sem-ids=(priv,4096,deny)● project.max-msg-ids=(priv,4096,deny)
FreeBSD Tuning
● ZFS: same as Solaris● definite win for very large databases● not so much for small databases
● Other tuning per docs
PostgreSQL Checkup
postgresql.conf: formulae
shared_buffers = available RAM / 4
postgresql.conf: formulae
max_connections =web: 100 to 200OLTP: 50 to 100DW/BI: 5 to 20
if you need more, use pooling!
postgresql.conf: formulae
Web/OLTP:work_mem = Av.RAM * 2 / max_connections
DW/BI:work_mem AvRAM / max_connections
postgresql.conf: formulae
Web/OLTP:maintenance_work_mem = Av.RAM * 16
DW/BI:maintenance_work_mem = AvRAM / 8
postgresql.conf: formulae
autovacuum = on
DW/BI & bulk loads:autovacuum = offautovacuum_max_workers = 1/2
postgresql.conf: formulae
checkpoint_segments = web: 8 to 16oltp: 32 to 64BI/DW: 128 to 256
postgresql.conf: formulae
wal_buffers = 8MB
effective_cache_size = AvRAM * 0.75
How much recoverability do you need?
● None: ● fsync=off● full_page_writes=off● consider using ramdrive
● Some Loss OK● synchronous_commit = off● wal_buffers = 16MB to 32MB
● Data integrity critical● keep everything on
File Locations
● Database● Transaction Log● Activity Log● Stats File● Tablespaces?
Database Checks: Indexes
select relname, seq_scan, seq_tup_read, pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activity from pg_stat_user_tables where seq_scan > 1000 and pg_relation_size(relid) > 1000000 order by seq_scan desc limit 10; relname | seq_scan | seq_tup_read | size | update_activity ----------------+----------+--------------+---------+----------------- permissions | 12264 | 53703 | 2696 kB | 365 users | 11697 | 351635 | 17 MB | 741 test_set | 9150 | 18492353300 | 275 MB | 27643 test_pool | 5143 | 3141630847 | 212 MB | 77755
Database Checks: IndexesSELECT indexrelid::regclass as index , relid::regclass as table FROM pg_stat_user_indexes JOIN pg_index USING (indexrelid) WHERE idx_scan < 100 AND indisunique IS FALSE;
index | table acct_acctdom_idx | accounts hitlist_acct_idx | hitlist hitlist_number_idx | hitlist custom_field_acct_idx | custom_field user_log_accstrt_idx | user_log user_log_idn_idx | user_log user_log_feed_idx | user_log user_log_inbdstart_idx | user_log user_log_lead_idx | user_log
Database Checks:Large Tables
relname | total_size | table_size-------------------------+------------+------------ operations_2008 | 9776 MB | 3396 MB operations_2009 | 9399 MB | 3855 MB request_by_second | 7387 MB | 5254 MB request_archive | 6975 MB | 3349 MB events | 92 MB | 66 MB event_edits | 82 MB | 68 MB 2009_ops_eoy | 33 MB | 19 MB
Database Checks:Heavily-Used Tables
select relname, pg_size_pretty(pg_relation_size(relid)) as size, coalesce(n_tup_ins,0) + coalesce(n_tup_upd,0) + coalesce(n_tup_del,0) as update_activity from pg_stat_user_tables order by update_activity desc limit 10;
relname | size | update_activity ------------------------+---------+----------------- session_log | 344 GB | 4811814 feature | 279 MB | 1012565 daily_feature | 28 GB | 984406 cache_queue_2010_05 | 2578 MB | 981812 user_log | 30 GB | 796043 vendor_feed | 29 GB | 479392 vendor_info | 23 GB | 348355 error_log | 239 MB | 214376 test_log | 945 MB | 185785 settings | 215 MB | 117480
Database Unit Tests
● You need them!● you will be changing database objects and rewriting
queries● find bugs in testing or in testing … or in production
● Various tools● pgTAP● Framework-level tests
– Rails, Django, Catalyst, JBoss, etc.
Application StackCheckup
The Layer Cake
HardwareStorage
Operating System
PostgreSQL
Middleware
Application
Filesystem
Schema
Drivers
Queries
RAM/CPU Network
Kernel
Config
Connections Caching
Transactions
The Layer Cake
HardwareStorage
Operating System
PostgreSQL
Middleware
Application
Filesystem
Schema
Drivers
Queries
RAM/CPU Network
Kernel
Config
Connections Caching
Transactions
The Funnel
HW
Application
Middleware
PostgreSQL
OS
Check PostgreSQL Drivers
● Does the driver version match the PostgreSQL version?
● Have you applied all updates?● Are you using the best driver?
● There are several Python, C++ drivers● Don't use ODBC if you can avoid it.
● Does the driver support cached plans & binary data?● If so, are they being used?
Check Caching
Check Caching
● Does the application use data caching?● what kind?● could it be used more?● what is the cache invalidation strategy?● is there protection from “cache refresh storms”?
● Does the application use HTTP caching?● could they be using it more?
Check Connection Pooling
● Is the application using connection pooling?● all web applications should, and most OLTP● external or built into the application server?
● Is it configured correctly?● max. efficiency: transaction / statement mode● make sure timeouts match
Check Query Design
● PostgreSQL does better with fewer, bigger statements
● Check for common query mistakes● joins in the application layer● pulling too much data and discarding it● huge OFFSETs● unanchored text searches
Check Transaction Management
● Are transactions being used for loops?● batches of inserts or updates can be 75% faster if
wrapped in a transaction
● Are transactions aborted properly?● on error● on timeout● transactions being held open while non-database
activity runs
Common Ailmentsof the
Database Server
Check for them, monitor for them
● ailments could throw off your response time targets● database could even “hit the wall”
● check for them during health check● and during each checkup
● add daily/continuous monitors for them● Nagios check_postgres.pl has checks for many of
these things
Database Growth
● Checkup:● check both total database size and largest table(s)
size daily or weekly
● Symptoms:● database grows faster than expected● some tables grow continuously and rapidly
Database Growth
● Caused By:● faster than expected increase in usage● “append forever” tables● Database Bloat
● Leads to:● slower seq scans and index scans● swapping & temp files● slower backups
Database Growth
● Treatment:● check for Bloat● find largest tables and make them smaller
– expire data– partitioning
● horizontal scaling (if possible)● get better storage & more RAM, sooner
Database Bloat-[ RECORD 1 ]+-----schemaname | publictablename | user_logtbloat | 3.4wastedpages | 2356903wastedbytes | 19307749376wastedsize | 18 GBiname | user_log_accttime_idxituples | 941451584ipages | 9743581iotta | 40130146ibloat | 0.2wastedipages | 0wastedibytes | 0wastedisize | 0 bytes
Database Bloat
● Caused by: ● Autovacuum not keeping up
– or not enough manual vacuum– often on specific tables only
● FSM set wrong (before 8.4)● Idle In Transaction
● Leads To:● slow response times● unpredictable response times● heavy I/O
Database Bloat
● Treatment:● make autovacuum more aggressive
– on specific tables with bloat● fix FSM_relations/FSM_pages● check when tables are getting vacuumed● check for Idle In Transaction
Memory Usage Growth00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 0 0 100 0 0 100 0 002:00:00 0 0 100 0 0 100 0 003:00:00 0 0 100 0 0 100 0 004:00:00 0 0 100 0 0 100 0 0
00:00:01 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s01:00:00 3788 115 98 0 0 100 0 002:00:00 21566 420 78 0 0 100 0 003:00:00 455721 1791 59 0 0 100 0 004:00:00 908 6 96 0 0 100 0 0
Memory Usage Growth
● Caused by:● Database Growth or Bloat● work_mem limit too high● bad queries
● Leads To:● database out of cache
– slow response times● OOM Errors (OOM Killer)
Memory Usage Growth
● Treatment● Look at ways to shrink queries, DB
– partitioning– data expiration
● lower work_mem limit● refactor bad queries● Or just buy more RAM
Idle Connections
select datname, usename, count(*) from pg_stat_activity where current_query = '<IDLE>' group by datname, usename;
datname | usename | count ---------+---------+------- track | www | 318
Idle Connections
● Caused by:● poor session management in application● wrong connection pool settings
● Leads to:● memory usage for connections● slower response times● out-of-connections at peak load
Idle Connections
● Treatment:● refactor application● reconfigure connection pool
– or add one
Idle In Transaction
select datname, usename, max(now() - xact_start) as max_time, count(*) from pg_stat_activity where current_query ~* '<IDLE> in transaction' group by datname, usename;
datname | usename | max_time | count ---------+----------+---------------+------- track | admin | 00:00:00.0217 | 1 track | www | 01:03:06.0709 | 7
Idle In Transaction
● Caused by:● poor transaction control by application● abandoned sessions not being terminated fast
enough
● Leads To:● locking problems● database bloat● out of connections
Idle In Transaction
● Treatment● refactor application● change driver/ORM settings for transactions● change session timeouts & keepalives on pool,
driver, database
Longer Running Queries
● Detection:● log slow queries to PostgreSQL log● do daily or weekly report (pgfouine)
● Symptoms:● number of long-running queries in log increasing● slowest queries getting slower
Longer Running Queries
● Caused by:● database growth● poorly-written queries● wrong indexes● out-of-date stats
● Leads to:● out-of-CPU● out-of-connections
Longer Running Queries
● Treatments:● refactor queries ● update indexes● make Autoanalyze more aggressive● control database growth
Too Many Queries
Too Many Queries
● Caused By:● joins in middleware● not caching● poll cycles without delays● other application code issues
● Leads To:● out-of-CPU● out-of-connections
Too Many Queries
● Treatment:● characterize queries using logging● refactor application
Locking
● Detection:● log_lock_waits● scan activity log for deadlock warnings● query pg_stat_activity and pg_locks
● Symptoms:● deadlock error messages● number and time of lock_waits getting larger
Locking
● Caused by:● long-running operations with exclusive locks● inconsistent foreign key updates● poorly planned runtime DDL
● Leads to:● poor response times● timeouts● deadlock errors
Locking
● Treatment● analyze locks● refactor operations taking locks
– establish a canonical order of updates for long transactions
– use pessimistic locks with NOWAIT● rely on cascade for FK updates
– not on middleware code
Temp File Usage
● Detection:● log_temp_files = 100kB● scan logs for temp files weekly or daily
● Symptoms:● temp file usage getting more frequent● queries using temp files getting longer
Temp File Usage
● Caused by:● Sorts, hashes & aggregates too big for work_mem
● Leads to:● slow response times● timeouts
Temp File Usage
● Treatment● find swapping queries via logs● set work_mem higher for that ROLE, or● refactor them to need less memory, or● buy more RAM
All healthy now?
See you in six months!
Q&A
● Josh Berkus● [email protected]● it.toolbox.com/blogs/
database-soup
● PostgreSQL Experts● www.pgexperts.com● pgCon Sponsor
● Also see:● Load Testing
(tommorrow)● Testing BOF (Friday)
Copyright 2010 Josh Berkus & PostgreSQL Experts Inc. Distributable under the creative commons attribution license,except for 3rd-party images which are property of their respective owners.