mysql zfs best practices
Post on 27-Apr-2015
240 Views
Preview:
TRANSCRIPT
Archives
« April 2010Sun Mon Tue Wed Thu Fri Sat
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30
Today
Search
Search
Past Entries
My last day atSun - 9/18/2009cmdtruss -- truss-c MySQL(COM_*)CommandsInniostat -InnoDB IOStatisticsMySQL InnodbZFS BestPracticesOptimizingMySQLPerformance withZFS - SlidesavailableMySQL 5.4 on 2Socket Nehalemsystem (Sun FireX4270)Reducing Innodbmutex contentionMySQLScalability onNehalemsystemsSSDs forPerformanceEngineersTrading offEfficiency for theSake of FlexibilityMySQL and UFSIntroduction tothe Innodb IOsubsystemBuilding MySQL5.1.28 on
NEELAKANTH NADGIR'S BLOG
All MySQL Personal Ruby Sun uperf ZFS
« Optimizing MySQL... | Main | Inniostat - InnoDB... »
Tuesday May 26, 2009
MySQL Innodb ZFS Best Practices
One of the cool things about talking about MySQL performance with ZFS is
that there is not much tuning to be done Tuning with ZFS is considered
evil, but a necessity at times. In this blog I will describe some of the tunings
that you can apply to get better performance with ZFS as well as point out
performance bugs which when fixed will nullify the need for some of these
tunings.
For the impatient, here is the summary. See below for the reasoning behind
these recommendations and some gotchas.
1. Match ZFS recordsize with Innodb page size (16KB for Innodb
Datafiles, and 128KB for Innodb log files).
2. If you have a write heavy workload, use a Seperate ZFS Intent Log.
3. If your database working set size does not fit in memory, you can get a
big boost by using a SSD as L2ARC.
4. While using storage devices with battery backed caches or while
comparing ZFS with other filesystems, turn off the cache flush.
5. Prefer to cache within MySQL/Innodb over the ZFS Adaptive
replacement cache (ARC).
6. Disable ZFS prefetch.
7. Disable Innodb double write buffer.
Lets look at all of them in detail.
WHATMatch ZFS recordsize with Innodb page size (16KB for
Datafiles, and 128KB for Innodb log files).
HOW zfs set recordsize=16k tank/db
The biggest boost in performance can be obtained by
matching the ZFS record size with the size of the IO. Since a
Innodb Page is 16KB in size, most read IO is of size 16KB
(except for some prefetch IO which can get coalesced). The
default recordsize for ZFS is 128KB. The mismatch between
the read size and the ZFS recordsize can result in severely
inflated IO. If you issue a 16KB read and the data is not
already there in the ARC, you have to read 128KB of data to
get it. ZFS cannot do a small read because the checksum is
calculated for the whole block and you have to read it all to
5.1.28 onOpensolarisusing Sun StudiocompilersLearning MySQLInternals via bugreportsInnodb just gotbetter!UnlockingMySQL : Whatshot and what'snotPeeling theMySQLScalability OnionStorage engineor MySQLserver? Wherehas the timegone?Improving filesortperformance inMySQLuperf - A networkbenchmark tool
Links
Tim Cookblogs.sun.comWeblogLogin
Today's Page Hits: 152
WHYverify data integrity. The other reason to match the IO size
and the ZFS recordsize is the read-modify-write penalty. With
a ZFS recordsize of 128KB, When Innodb modifies a page, if
the zfs record is not already in memory, it needs to be read in
from the disk and modified before writing to disk. This
increases the IO latency significantly. Luckily matching the
ZFS recordsize with the IO size removes all the problems
mentioned above.
For Innodb log file, the writes are usually sequential and
varying in size. By using ZFS recordsize of 128KB you
amortize the cost of read-modify-write.
NOTE
You need to set the recordsize before creating the database
files. If you have already created the files, you need to copy
the files to get the new recordsize. You can use the stat(2)
command to check the recordsize (look for IO Block:)
WHATIf you have a write heavy workload, use a seperate intent log
(slog).
HOW zpool add log c4t0d0 c4t1d0
WHY
Write latency is extremely critical for many MySQL workloads.
Typically, a query will read some data, do some calculations,
update some data and then commit the transaction. To
commit, the Innodb log has to be updated. Many transactions
can be committing at the same time. It is very important that
this "wait" for commit be fast. Luckily in ZFS, synchronous
writes can be accelerated up by using the Seperate Intent Log.
In our tests with Sysbench read-write, we have seen around
10-20% improvement with the slog.
NOTE
If your query execution involves a physical read from
disk, the time for the write may not be that important. Be
sure to check this suggestion with your real workload.
Until Bug 6574286 is fixed, you cannot remove a slog.
Innodb actually issues multiple kinds of writes (log write,
dataspace write, insert buffer write). Of these, the most
critical one is the Innodb log write. The slog feature is
pool wide and thus some writes (like dataspace writes),
which need not go to the slog still do. This will be fixed
via Bug 6832481 ZFS separate intent log bypass
property
It is also possible that during ZFS transaction sync time,
the ZFS IO queue (35 deep) can get full. This means
that a write has to wait for a slot to become empty. Bug
6471212: need reserved I/O scheduler slots to improve
I/O latency of critical ops solves this using reserved slots.
Bug 6721168 slog latency impacted by I/O scheduler
during spa_sync is also worth checking out.
WHAT L2ARC (or Level 2 ARC)
HOW zpool add cache c4t0d0
WHY
If your database does not fit in memory, every time you miss
the database cache, you have to read a block from disk. This
cost is quite high with regular disks. You can minimize the
database cache miss latency by using a (or multiple) SSDs as
a level-2 cache or L2ARC. Depending on your database
working set size, memory and L2ARC size you may see
several orders of magnitude improvement in performance.
NOTE
WHAT When it is safe, turn off ZFS cache flush
HOWThe ZFS Evil tuning guide has more information about setting
this tunable. Refer to it for the best way to achieve this.
WHY
ZFS is designed to work reliably with disks with caches.
Everytime it needs data to be stored persistantly on disk, it
issues a cache flush command to the disk. Disks with a
battery backed caches need not do anything (i.e the cache
flush command is a nop). Many storage devices interpret this
correctly and do the right thing when they receive a cache
flush command. However, there are still a few storage systems
which do not interpret the cache flush command correctly. For
such storage systems, preventing ZFS from sending the cache
flush command results in a big reduction in IO latency. In our
tests with Sysbench read-write test we saw a 30%
improvement in performance.
NOTE
Setting this tunable on a system without a battery backed
cache can cause inconsistencies in case of a crash.
When comparing ZFS with filesystems that blindly enable
the write cache, be sure to set this to get a fair
comparison.
WHAT Prefer to cache within MySQL/Innodb over the ARC.
HOW Via my.cnf and by limiting the ARC size
WHY
You have multiple levels of caching when you are using
MySQL/Innodb with ZFS. Innodb has its own buffer pool and
ZFS has the ARC. Both of them make independent decisions
on what to cache and what to flush. It is possible for both of
them to cache the same data. By caching inside Innodb, you
get a much shorter (and faster) code path to the data.
Moreover, when the Innodb buffer cache is full, a miss in the
Innodb buffer cache can lead to flushing of a dirty buffer, even
if the data was cached in the ARC. This leads to unnecessary
writes. Even though the ARC dynamically shrinks and expands
relative to memory pressure, it is more efficient to just limit it.In
our tests, we have found that it is better (7-200%) to cache
inside Innodb rather than ZFS.
NOTE
The ARC can be tuned to cache everything, just metadata or
nothing on a per filesystem basis. See below for tuning advise
about this.
WHAT Disable ZFS Prefetch.
HOW In /etc/system: set zfs:zfs_prefetch_disable = 1
WHY
Most filesystems implement some kind of prefetch. ZFS
prefetch detects linear (increasing and decreasing), strided,
multiblock strided IO streams and issues prefetch IO when it
will help performance. These prefetch IO have a lower priority
than regular reads and are generally very beneficial. ZFS also
has a lower level prefetch (commonly called vdev prefetch) to
help with spatial locality of data.
In Innodb, rows are stored in order of primary index. Innodb
issues two kinds of prefetch requests; one is triggered while
accessing sequential pages and other is triggered via random
access in an extent. While issuing prefetch IO, Innodb
assumes that file is laid out in the order of the primary key.
This is not true for ZFS. We are yet to investigate the impact
of Innodb prefetch.
It is well known that OLTP workloads access data in a random
order and hence do not benefit from prefetch. Thus we
recommend that you turn off ZFS prefetch.
NOTE
If you have changed the primary cache caching strategy
to just cache metadata, you will not trigger file level
prefetch.
If you have set recordsize to 16k, you will not trigger the
lower level prefetch.
WHAT Disable Innodb Double write buffer.
HOW skip-innodb_doublewrite in my.cnf
WHY
Innodb uses a double write buffer for safely updating pages in
a tablespace. Innodb first writes the changes to the double
write buffer before updating the data page. This is to prevent
partial writes. Since ZFS does not allow partial writes, you can
safely turn off the double write buffer. In our tests with
Sysbench read-write, we say a 5% improvement in
performance.
NOTE
Posted at 01:21PM May 26, 2009 by Neelakanth Nadgir in MySQL |
Comments:
Post a Comment:Comments are closed for this entry.
top related