the extent of gfs2fsck.gfs2 performance improvements as filesystems get larger, fsck time becomes a...

1

The Extent of GFS2

Dr Steven Whitehouse22/23 March 2017Linux Foundation Vault 2017

2

Topics

● Quick tour of GFS2

● Where have we got to?

● Where are we going?

3

What is GFS2?

● 64 bit, symmetric cluster filesystem

● Uses DLM for locking● Abstracted through glocks – cache control mechanism

● Inodes are single blocks (unit of caching)● Equal height metadata tree using pointer blocks

● Directories use “extensible hashing”

● Hidden metadata filesystem contains system data● One journal per node● Also quota & statfs data

4

Where did GFS2 come from?

● GFS started out as a research project at the University of Minnesota

● Initial purpose was storage of ocean current simulation data

● Spun out into Sistina Software circa 2000● Red Hat bought Sistina Software in Dec 2003

● GFS2 was a development from GFS● Very similar on disk structures – allows in place upgrade● Code clean up & some improvements● Went upstream in 2.6.19 (Nov 2006)

5

Where is GFS2 used today?

● Lots of different applications…● Web/FTP servers● Backup solutions● Message Queue (IBM Websphere, Tibco MQ, Active

MQ)● Various SAS workloads● and many more...

● Many different sectors● Financial, IT, Retail, Manufacturing, ...

6

What workloads is GFS2 best at?

● Small numbers of nodes (<=16)

● When (almost) POSIX compliance is required

● When the workload can be mostly localized● This point is very important for performance

● When HA is an important consideration

● Avoid:● Highly non-local workloads● Polling the filesystem for inter-node communication

7

Recent Developments

8

Resource Group Scalability (1)

● Like ext3 block group / XFS allocation group● Subsection of the filesystem with allocation bitmap

● Internally held in an rbtree for quick access

● At allocation time we have a choice of which rgrp to use

● We want locality with previous allocations● We want to avoid inter-node contention

9

Resource Group Scalability (2 - locality)

● Each (in core) resource group has a list of block reservations associated with it

● The reservations are created at write or page_mkwrite time, where a size hint is calculated

● A node-local reservation is then created for a number of blocks, even though fewer may be allocated

● Future allocations will try to use the reservation, before looking elsewhere for space

● Avoids the multiple streaming writes issue● A big performance improvement for that specific case

10

Resource Group Scalability (3 – inter-node)

● We want to avoid inter-node contention on rgrps● How hard should we try to allocate from a particuar

rgrp?● Orlov allocator (as per ext3) gives first level of

contention avoidance

● The second level is given by lock stats – did we have to wait longer than average for this rgrp? If so it might be contended

● If we have a reservation we ignore the lock stats, to avoid excessive fragmentation

11

Glock scalability

● Glocks are kept in a single big hash table● Indexed by type and glock number (inode/rgrp number)

● Lookups mostly occur on inode creation/lookup● Glock references are kept by inodes for their lifetime

● Recent change to use rhashtable improves scalability● Keeps RCU locking & lockref advantages● Scales according to number of glocks/inodes

● Big performance improvement with lots of inodes (>1m)

12

Xattrs & SELinux (1)

● In GFS2 xattrs are stored in a separate block to the inode

● Two disk reads may be required for each inode● Solution:

● If we create xattrs at inode creation time (e.g. for SELinux labels) then we can allocate 2 blocks (inode & xattr) contiguously

● We then mark the directory entry, so we know that there are two blocks to read, not just one.

● When we read the inode, we can then issue a single read for both blocks

13

Xattrs & SELinux (2)

● SELinux has historically not been cluster coherent● No way for GFS2 to invalidate SELinux labels

● This is now fixed upstream, so SELinux can be used in a fully cluster coherent manner

● Combined with the xattr performance improvement, SELinux is now a viable option for GFS2

14

Multi-threaded streaming scalability

● Journal can be a source of contention with multi-threaded workloads

● A recent patch avoids taking the journal lock in case that the block in question is already in the journal

● For streaming workloads this is very likely to be the case for the inode and some of the indirect blocks, for example

● Improvements seen of around 50% with fast storage

15

Fsck.gfs2 performance improvements

● As filesystems get larger, fsck time becomes a major issue

● The design of GFS2’s fsck is based on multiple passes● The amount of memory used for storage of state has

been reduced● Readahead has been added● pass1c has been removed (combined with pass1)

● Work in continuing on improvements in this area

16

What’s next?

17

DLM Lock Timing Analysis

● Using ● Using the gfs2_glock_lock_time tracepoint

● The tdiff field reports the time of each DLM lock request

● srtt, srttb, srttvar, srttvarb● Smoothed round trip times (b = blocking) and variance

● sirt, sirtvar● Smoothed Inter-request times

● dcount – Number of DLM requests

● qcount – Number of (local) glock requests

20

Journal Flushing (1)

● This can take a long time● Increases glock release latency● Stops new transactions while journal is being flushed

● Causes:● Ordered write mode, means data is flushed before the

journal● Inability to start transactions while journal flush is in

progress

21

Journal Flushing (2)

● Things are not all bad● We have streamlined the journal I/O already● Builds large bio I/Os – very efficient● Works well under memory pressure

● Design allows adding new data and being backwards compatible

● Some space left in data structures, so lots of options● A big win would be to eliminate the ordered write list

flushing

22

Ordered write list

● A list of inodes to which data has been written

● At journal flush time:● Sorts the ordered write list by inode number● Writes back the data for each inode● Waits for the data for each inode● Then flushes the journal

● Can we avoid this?

23

Introducing extents

● One potential solution to the ordered write issue● Add additional information to the journal indicating

newly allocated extents● Then we can avoid the pre-journal flush writeback

● Backwards compatibility● Yes, from journal PoV● No, in case of mixed clusters (old & new)

● Could provide a way in which to introduce more general support for extents into GFS2

24

iomap

● Recently introduced upstream● Would enable multi-page write

● Spread locking overhead across multiple pages● Performance win for streaming writes

● Also to fix FIEMAP issue● Improve efficiency of mapping holes in sparse files

● One nice side effect● Should be possible to write a generic

SEEK_DATA/SEEK_HOLE for iomap based filesystems

25

Thank-you!

the extent of gfs2fsck.gfs2 performance improvements as filesystems get larger, fsck time becomes a...

Documents