postgresql on ext4, xfs, btrfs and zfs
TRANSCRIPT
PostgreSQL on EXT3/4, XFS, BTRFS and ZFS
comparing modern (Linux) file systems
Tomas Vondra <[email protected]>
Linux file systems
● plenty of choices, with different– goals, features, tuning options
– maturity level, reliability
– ext3/4, XFS
– traditional, design from the 90s
– improving over time, reasonably “modern”
● BTRFS, ZFS– next-generation, new architecture / design
● other (not included in this talk)– log-organized file systems, distributed, clustered, ...
EXT3, EXT4, XFS
EXT3, EXT4, XFS - history
● ext3 (2001) / ext4 (2008)– evolution of original Linux filesystem (ext, ext2, ...)
– continuous improvements / fixes
● XFS (2002)– originally from SGI Irix 5.3 (1994)
– 2000 released under GPL
– 2002 merged into 2.5.36
● both are– reliable journaling file systems
– proven by time on many deployments
EXT3, EXT4, XFS - features
● traditional design with journal● not handling
– multiple devices
– volume management
– snapshots
– ...
● need additional layers for those things– hardware RAID
– software RAID (dm)
– LVM / LVM2
EXT3, EXT4, XFS - evolution
● conceived in times of rotational storage– mostly work with SSD
– stop-gap for future storage (NVRAM, ...)
● evolution, not a revolution (mostly)– fixing bugs (some real, some imaginary)
– adding features (e.g. TRIM, barriers, ...)
– scalability improvements (metadata, ...)
– be careful when reading old articles / benchmarks
– be vary of anecdotal evidence (without context)
– synthetic benchmarks are misleading
EXT3, EXT4, XFS - sources
● Linux Filesystems: Where did they come from?(Dave Chinner @ linux.conf.au 2014)https://www.youtube.com/watch?v=SMcVdZk7wV8
● Ted Ts'o on the ext4 Filesystem(Ted Ts'o, NYLUG, 2013)https://www.youtube.com/watch?v=2mYDFr5T4tY
● XFS: There and Back … and There Again?(Dave Chinner @ Vault 2015)https://lwn.net/Articles/638546/
● XFS: Recent and Future Adventures in Filesystem Scalability(Dave Chinner, linux.conf.au 2012)https://www.youtube.com/watch?v=FegjLbCnoBw
● XFS: the filesystem of the future?(Jonathan Corbet, Dave Chinner, LWN, 2012)http://lwn.net/Articles/476263/
BTRFS, ZFS
BTRFS, ZFS - goals
● ideas– integrate the layers
– design for commodity hardware (expect failures)
– design for huge data volumes
● so that we get …– flexible management
– built-in snapshotting
– compression, deduplication
– checksums
– ...
BTRFS, ZFS - history
● BTRFS– merged in 2009, but considered “experimental”
– on-disk format “stable” (1.0)
– some claim it’s “stable” but I doubt that …
– (What are the criteria for filesystem to be “stable”?)
● ZFS– originally from Solaris, but got Oracled :-(
– today a bit fragmented development
– available on other BSD systems (FreeBSD)
– “ZFS on Linux” project (CDDL vs. GPL)
Tuning options
Generic tuning options
● TRIM (discard)– enable / disable TRIM on SSDs
– impacts garbage collection / wear leveling
● write barriers– prevent disk from optimizing order of writes
– still may loose data, but no filesystem corruption
– write cache + battery => disable barriers
● SSD alignment– alignment on SSDs matter (pages, blocks, …)
– not dedicated tuning options (can use stripe unit / width)
BTRFS tuning options
● nodatacow (BTRFS)– disable copy on write
– still can do snapshots (will do necessary COW)
– disables checksums (needs full COW)
● zfs_arc_max– limit the size of ARC cache
– should be released automatically, but ...
BTRFS tuning options
● recordsize=8kB– match the fs page with PostgreSQL page
● ashift=13 (8kB)– align the writes to SSD pages
● primarycache=metadata– prevent double buffering (shared buffers)
http://open-zfs.org/wiki/Performance_tuning
file systems
● ext3 (default)● default
● ext4● default● discard, nobarrier, stripe-width
● xfs● default● LVM● LVM + snapshot● discard, nobarrier● discard, nobarrier, agcount, sunit/swidth
● btrfs● default● nodatacow● nodiscard (+fstrim)
● zfs● default● recordsize=8k, ashift=13, primarycache=metadata (open-zfs)● recordsize=8k, ashift=13, max_arc_size=5GB (custom)
benchmarks
pgbench (TPC-B)
● transactional benchmark– small queries (access by PK, ...)
● modes– read-only
– read-write
● scales– small (~200MB)
– medium (~50% RAM)
– large (~200% RAM)
TPC-DS
● warehouse, analytical– large amounts of data
– queries processing a lot of data
● complex queries– aggregations
– joins
– CTEs
– …
● successor to TPC-H– more elaborate / realistic
System
● PostgreSQL 9.4.1● Gentoo with kernel 3.17● CPU: Intel i5-2500k
– 4 cores @ 3.3 GHz (3.7GHz)
– 6MB cache
– 2011-2013
● 8GB RAM (DDR3 1333)● SSD Intel S3500 100GB (SATA)
pgbench read-only
btrfs
btrfs-nodatacow
btrfs-nodiscard-fstrim
ext3
ext4
ext4-discard-nobarrier-stripe
xfs
xfs-discard-lvm-snapshot
xfs-discard-nobarrier
xfs-lvm
xfs-tuned-agcount-su-sw
zfs
zfs-tuned
zfs-tuned-2
0 10000 20000 30000 40000 50000 60000
pgbench / small (150MB) / read-only
transactions per second
btrfs
btrfs-nodatacow
btrfs-nodiscard-fstrim
ext3
ext4
ext4-discard-nobarrier-stripe
xfs
xfs-discard-lvm-snapshot
xfs-discard-nobarrier
xfs-lvm
xfs-tuned-agcount-su-sw
zfs
zfs-tuned
zfs-tuned-2
0 10000 20000 30000 40000 50000 60000
pgbench / medium (50% RAM) / read-only
transactions per second
btrfs
btrfs-nodatacow
btrfs-nodiscard-fstrim
ext3
ext4
ext4-discard-lvm-snapshot
ext4-discard-nobarrier-stripe
xfs
xfs-discard-lvm-snapshot
xfs-discard-nobarrier
xfs-lvm
xfs-tuned-agcount-su-sw
zfs
zfs-tuned
zfs-tuned-2
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
pgbench / large (200% RAM) / read-only
transactions per second
pgbench read-write
btrfs
btrfs-nodatacow
btrfs-nodiscard-fstrim
ext3
ext4
ext4-discard-nobarrier-stripe
xfs
xfs-discard-lvm-snapshot
xfs-discard-nobarrier
xfs-lvm
xfs-tuned-agcount-su-sw
zfs
zfs-tuned
zfs-tuned-2
0 1000 2000 3000 4000 5000 6000 7000 8000
pgbench / small (150MB) / read-write
transactions per second
btrfs
btrfs-nodatacow
btrfs-nodiscard-fstrim
ext3
ext4
ext4-discard-nobarrier-stripe
xfs
xfs-discard-lvm-snapshot
xfs-discard-nobarrier
xfs-lvm
xfs-tuned-agcount-su-sw
zfs
zfs-tuned
zfs-tuned-2
0 1000 2000 3000 4000 5000 6000
pgbench / medium (50% RAM) / read-write
transactions per second
btrfs
btrfs-nodatacow
btrfs-nodiscard-fstrim
ext3ext4
ext4-discard-lvm-snapshot
ext4-discard-nobarrier-stripexfs
xfs-discard-lvm-snapshot
xfs-discard-nobarrier
xfs-lvm
xfs-tuned-agcount-su-swzfs
zfs-tuned
zfs-tuned-2
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
pgbench / large (200% RAM) / read-write
transactions per second
performance variability
EXT / XFS conclusions
EXT4● good “default” choice● disable barriers (with protected write cache)● tune alignment to match the SSD● very “smooth” results
XFS● does not outperform ext4 (in this test)● not much worse, if properly tuned● disable write barriers, tune alignment to SSD● more anomalies than ext4 (sudden performance drops, ...)
BTRFS & ZFS
TPC-DS
mkfs / mount options
● ext4, xfs– mkfs.ext4 E stripewidth=256 /dev/sda1– mkfs.xfs d su=512k,sw=1 l su=512k f /dev/sda1– mount: defaults,noatime,discard,nobarrier
● btrfs– mkfs.btrfs l 8192 L pgdata /dev/sda1– mount: defaults,noatime,ssd,discard,nobarrier [compress=lzo]
● zfs– zpool create pgpool /dev/sda1– zfs create pgpool/pgdata– zfs set recordsize=8k pgpool/pgdata– zfs set atime=off pgpool/pgdata
ext4 xfs btrfs btrfs (lzo) zfs zfs (lz4)0
1000
2000
3000
4000
5000
6000
TPC-DS load duration
on EXT4, XFS, BTRFS and ZFS
data indexes
du
ratio
n [
seco
nd
s]
ext4 xfs btrfs btrfs lzo zfs zfs (lz4)0
100
200
300
400
500
600
700
TPC-DS query performance
EXT4, XFS, BTRFS and ZFS
du
ratio
n [
seco
nd
s]
ext4 xfs btrfs btrfs lzo zfs zfs (lz4)0
10
20
30
40
50
60
70
TPC-DS space used
on EXT4, XFS, BTRFS and ZFS
size
[G
B]
TPC-DS summary
● EXT4, XFS, BTRFS– about the same performance
● compression is nice– uncompressed: 60GB
– compressed: ~30GB
● mostly storage capacity, queries not faster● ZFS much slower :-(