spectrum scale: metadata & concepts part ii
TRANSCRIPT
![Page 1: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/1.jpg)
© Copyright IBM Corporation 2016. Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.
Indulis Bernsteins
Systems Architect
indulisb uk.ibm.com
Spectrum Scale: Metadata & Concepts Part II
Ehningen, 03 Mar 2020
![Page 2: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/2.jpg)
“Metadata” means different things to different people
o Filesystem metadata
• Directory structure, access permissions, creation time, owner ID, last modified etc.
o Scientist’s metadata
• EPIC persistent identifier, Grant ID, data description, data source, publication ID, etc.
o Medical patient’s metadata
• National Health ID, Scan location, Scan technician, etc.
o Object metadata
• MD5 checksum, Account, Container, etc.
![Page 3: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/3.jpg)
Filesystem Metadata (MD)
• Used to find, access, and manage data (in files)
o Hierarchical directory structure
o POSIX standard for information in the filesystem metadata (Linux,
UNIX)
• POSIX specifies what not how
▪ Filesystem handles how it stores and works with its Metadata
• Can add other information or functions…
as long as the POSIX functions work
![Page 4: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/4.jpg)
Why focus on Filesystem Metadata (MD)?
• Can be become a performance bottleneck
o Examples:
• Scan for files changed since last backup
• Delete files owned by user Mustermann (who just left the company)
• Migrate least used files from the SSD tier to the disk tier
• Delete snapshot #5, out of a total of 10 snapshots
![Page 5: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/5.jpg)
Why focus on Filesystem Metadata (MD)?
• Can be a significant cost
o For performance, MD may need to be on Flash or SSD or NVMe
• Let’s try to get the capacity and performance right!
• Performance on NVMe is not infinite!
![Page 6: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/6.jpg)
Spectrum Scale Filesystem Metadata (MD)
o POSIX compatible filesystem, including multi-user locking (to 10,000+ users!)
o Designed to support extra Spectrum Scale functions:
• Small amounts of file data inside Metadata inode
• Multi-site “stretch cluster”
▪ Via replication of Metadata and Data
• HSM / ILM / Tiering of files to Object storage or Tape tier
▪ Via MD Extended Attributes
• Fast directory scans using many servers in parallel (“Policy Engine”)
▪ Bypasses “normal” POSIX directory functions using special Spectrum Scale MD design
• Other types of metadata using Extended Attributes (EAs)
▪ EAs can “tag” the file with user defined information
• Snapshots
![Page 7: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/7.jpg)
Other Spectrum Scale Metadata (besides filesystem MD)
• Filesystem Descriptor: stored on reserved area, multiple NSDs/disks
• Cluster Configuration Data: mmsdrfs file on server’s local disk
…etc.
• We will only talk about Filesystem Metadata
![Page 8: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/8.jpg)
Filesystem MD capacity
• Filesystem MD capacity is used up- mostly by
• File inodes = 1 per file
+ Indirect Blocks as needed (might take up a lot of capacity)
• Directory inodes = 1 per directory
+ Directory Blocks as needed
• Extended Attributes: in Data inode
+ EA blocks as needed
• MD also used up by other things… (more info later!)
![Page 9: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/9.jpg)
NSDs and Storage pools
9
![Page 10: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/10.jpg)
Spectrum Scale Storage pools
• Operating System LUN Spectrum Scale “NSD”
“Disk” when allocated to filesystem
o An NSD belongs to a Storage Pool
o Each filesystem can have 1 to 8 Storage Pools
o A Storage Pool can have many NSDs in it
• Each NSD can only belong to a single filesystem
![Page 11: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/11.jpg)
More pools! System and other Storage pools• Max of 8 internal storage pools per filesystem
• One System storage pool per filesystem
o Required: one and only one per filesystem
o Only System pool can store MetaData
o System pool can also store Data
• NSDs defined as Metadata Only, or Data only, or Data
And Metadata
• Many optional User-defined Data storage pools
o Data only, no MD
o All NSDs must be defined as Data Only
o Filesystem default placement policy required
• One System.log storage pool (optional)
o For recovery logs normally stored in System pool
Filesystem 1
Spectrum Scale Cluster
System pool
NSD
MD
Data1 pool
NSD
Data
NSD
DATA
NSD
MD+DATA
Data2 pool
NSD
Data
System.log poolLog
Data
![Page 12: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/12.jpg)
Filesystem 1
LUN ↔ “NSD” ↔ “Disk”
Spectrum Scale Cluster
Storage Pool
GPFS “Disk”
= NSD
which is
allocated to
filesystem
Network
Shared Disk
“NSD”
(unallocated)
LUN
known to
OS, not to
GPFS mmcrnsd
turns OS LUN
NSD,
available to be
allocated to a
filesystem.
NSD is not yet
not yet
allocated to a
Storage pool or
a filesystem
mmcrfs sets the Storage Pool the
NSDs will belong to, and if each NSD
will be Data, Metadata, or Data+MD. It
allocates the NSDs to the filesystem
![Page 13: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/13.jpg)
Metadata (MD) creation, important attributes
• Metadata characteristics are set at filesystem creation
o Blocksize for MD, can’t be changed later
• Optionally different to Data blocksize
o Inode size, can’t be changed later
• Sizes: 512 bytes, 1K, 4K
• inodes can be pre-allocated and can be changed or extended later
o Not all MD space is used for inodes!!!!
o Not all MD space is used for inodes!!!!
• …so do not pre-allocate all of MD for inodes
• …leave room in MD for Directory Blocks, Indirect Blocks etc
![Page 14: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/14.jpg)
Where does Metadata live? Example #1
Spectrum ScaleCluster
System pool
NSD#1
Data+MD
Filesystem 1
![Page 15: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/15.jpg)
Where does Metadata live? Example
Spectrum ScaleCluster
System pool
NSD#1
MD
Filesystem 1
Data1 pool
NSD#2
Data
![Page 16: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/16.jpg)
fs1Blocksizes:Data = 1 MiBMD = 64 KiB
Different MD and Data blocksizes
System pool
Data1 pool
NSD
t111
Data
Data2 pool
NSD
t121
Data
NSD
t110
MD
NSD
t120
MD
fs2Blocksizes:Data = 4 MiBMD = 256 KiB
System pool
V5.0.2 onwards: small variable size subblocks = typically no reason to choose separate blocksizes
![Page 17: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/17.jpg)
Basic NSD Layout: divided into fixed size blocks
17
…
…Block 0
Block 1
…Block 2
…Block 3
…
• Blocksizes for Data and MD chosen at file system
creation time- can’t be changed later
• Full block: largest contiguously allocated unit
• Subblock: Part of a block, smallest allocatable unit:
– 1/1024 to 1/32 of a block*, depending on the filesystem blocksize
• And whether the Metadata and Data are in separate storage pools
– Data block fragments (one or more contiguous subblocks)
– Indirect blocks, directory blocks, …
• Choices: 64k, 128k, 256k, 512k, 1M, 2M, 4M, 8M, 16M
– Try to set as multiple of RAID stripe size for best write perf
– ESS GNR vdisk track size must match filesystem blocksize
• 3-way, 4-way replication: 256k, 512k, 1M, 2M
• 8+2P, 8+3P: 512k, 1M, 2M, 4M, 8M, 16M
*Before V5.0.2 a subblock was always 1/32 of a block
![Page 18: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/18.jpg)
IBM Storage & SDISub-blocks in Spectrum Scale V5.0.2 onwards
18
Fit 512 8k files in a single
4MB block!New
Default
![Page 19: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/19.jpg)
What is in the Filesystem Metadata zoo?
• Metadata capacity allocated as needed to:
o inodes: for files and directories
o Indirect blocks
o Directory blocks
o Extended Attribute block
o etc.
All MD information is
ultimately held in “files”
in MD space.
Some are fixed size,
some are extendable.
Some are accessed
using normal file access
routines, others using
special code
![Page 20: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/20.jpg)
Other Metadata stuff- “low level files”
• Files, not visible to users, special read/write and locking code
o Log files
o Fileset metadata file, policy file
o Block allocation map
• A bitmap of all the sub-blocks in the filesystem, 32 to 1024 bits per block (TBC!)
• Organised into n equal sized “regions” which contain some blocks from each
NSD
▪ This is what mmcrfs -n NumNodes does
▪ Node “owns” a region and independently does striped allocation across all disks
▪ Filesystem manager allocates a region to a node dynamically
o inode allocation map
• Similar to block allocation map, but keeps track of inodes in use
![Page 21: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/21.jpg)
inodes
• Fixed size “chunks” which hold some (not all!) filesystem metadata
o Allocated on System storage pool (NSDs)- metadataOnly or dataAndMetadata
o 512, 1024, or 4096 Bytes each: set at filesystem creation, can’t change later
o Held in one invisible inode file, extended as required, replicated if policy is set
• inodes are used for Data or Directory use, and can contain:
o Disk block pointers = Disk Addresses = DAs
• ..or pointers to Indirect blocks which then point to Data blocks
o File data (for very small files)
o Directory entries, or pointers to Directory blocks
o Ext Attributes (XATTRS), or pointer to EA blocks
![Page 22: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/22.jpg)
File inode, with Data in inode
inode header
File Data
Data in inode
inode header
Data + EAs
in inode
Extended
Attributes
File Data
![Page 23: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/23.jpg)
inode header
Data NSD
Extended
Attributes
File Data in blocks
& sub-blocks
File inodes containing DAs (Disk Addresses) = pointers
DA (pointer)
DA (pointer)
DA (pointer)
Data pointers
+ EAs in inode
inode header
Data + EAs
in inode
Extended
Attributes
File Data
File data too
large to fit
inside inode
![Page 24: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/24.jpg)
File inode, with Indirect Block containing DAs
24
NSD containing MD
(Data+MD, or MD-only)
inode header
inode pointer to
indirect block
DA (pointer)
File Data in blocks
& sub-blocksindirect block
DA (pointer)
DA (pointer)Data NSD
Data NSD
DA (pointer)
DA (pointer)
DA (pointer)
![Page 25: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/25.jpg)
Data and MD replication: maximum replicas
• Max replication settings for Data and Metadata
o Max can only be set at filesystem creation (mmcrfs)
o Max replicas for MD has little effect on MD capacity used, until the default is changed to >1 (TBC!)
• And mmrestripefs -R is run
o Max replicas for Data, multiplies the MD capacity used
• Reserves space in MD for the replicas even if no files are replicated!
• Default replicas of MD and Data
o Set at filesystem creation, can change later
• mmchfs … -m DefaultMetadataReplicas … -r DefaultDataReplicas
o Number of data replicas does not have to be the same for all files in a filesystem
• You can change from data replication on a file-by-file basis using policies or mmchattr
![Page 26: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/26.jpg)
Replication (animation)
inode
indirect blocks
…disk blocks & sub-blocks
Max Data replicas = 1, current replica setting for file = 1
![Page 27: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/27.jpg)
EM
PT
Y
EM
PT
Y
EM
PT
Y
EM
PT
Y
EM
PT
Y
EM
PT
Y
Replication (animation)
inode
indirect blocks
…disk blocks & sub-blocks
Max Data replicas = 2, current replica setting for file = 1
![Page 28: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/28.jpg)
Replication (animation)
inode
indirect blocks
…disk blocks & sub-blocks
Max Data replicas = 2, current replica setting for file = 2
![Page 29: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/29.jpg)
EA blocks
• If EAs don’t fit into inode
• inode has a single pointer to EA block
o Max 64KiB
![Page 30: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/30.jpg)
Critical file size
• File larger than a certain size “spills” into an indirect block
o Significantly increases MD capacity used
• Indirect blocks are large compared to inodes, 32K vs 4K
o That’s 8x more expensive MD capacity if a file is bigger than a certain “critical” size
• Critical file size is based on inode size, blocksize, default # of replicas, EAs/XATTRs
• Critical file size is 2.8 GBytes (approx.), tested with a V5.0.2 filesystem on ESS
o 4K inodes, 1 MiB MD blocksize, 16 MiB Data blocksize, 2 Data replicas max, no Eas
o For 1 Data replica max, would be approx. 5.6 Gbytes
o For 3 Data replicas max, would be approx. 5.6 Gbytes
NSD containing MD
(Data+MD, or MD-only)
inode header
inode pointer to
indirect block
DA (pointer)
File Data in blocks
& sub-blocks
indirect block
DA (pointer)
DA (pointer)Data NSD
Data NSD
DA (pointer)
DA (pointer)
DA (pointer)
inode header
Data NSD
Extended
Attributes
File Data in blocks
& sub-blocks
DA (pointer)
DA (pointer)
DA (pointer)
Data pointers
+ EAs in inode
![Page 31: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/31.jpg)
MD Space for Directories
Directory inodes can contain file or dir name info to save space
• Embeds file/dir name + pointer to file/dir inode in directory inode
• 512 byte directory inode can contain 12 x 32 byte file entries
o (512 -128)/32 = 12
o For 1 KiB inode: (1024 – 128)/32 = 24 file entries
• The average directory size ranges from 10-16 entries, so this is useful
o 1st 32 byte entry allows names up to 20 bytes (32 – 12 byte header)
o Longer names take up additional 16 byte blocks
31
![Page 32: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/32.jpg)
Block Allocation Map
• Can take up a large % of MD capacity
• Logically a large bit vector with 32 to 1024 bits for each disk block
(one bit per subblock) (TBC!)
• Divided into a fixed number of n equal size “regions”
o Each region contains bits representing blocks across all disks
o Different nodes can use different regions and still stripe across all
disks (many more regions than nodes)
32
![Page 33: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/33.jpg)
Inode Allocation Map• Keeps track of the status (free/in-use) of all inodes in the inode file
• Same idea as block allocation map:
o Logically a large bit vector stored in a metadata file
o Divided into regions, so different nodes can work with different regions
o A region can consist of multiple segments to allow growing the inode allocation map when more
inodes are added to the file system (inode file expansion)
• Two bits per inode to encode four states:
o Free
o In-use
o Being-created: short-term transitional state during file creation
o Being-deleted: transitional state during file deletion
• Being-deleted state may be longer lasting for files that were deleted but are still open:
o Directory entry is gone, but inode, indirect blocks and data blocks cannot be freed until the file is close
by everyone that has it open (“destroy-on-last-close”).
33
![Page 34: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/34.jpg)
Snapshots and MD
• NOT Just # of snapshots x % of data changed!
• Data change % is small, but might touch many files & blocks
• Many data blocks changed Many inodes & indirect blocks changed
▪ All changed inodes and indirect blocks have to be copied!
➢ Single inode changed whole “block” of inode file is copied
➢ 16 MiB blocksize filesystem = 0.5 MiB indirect block
• Filename changes Directory inodes & directory blocks changed
o Whole directory copied, even for a single change
There is no way to accurately predict the effect of
Snapshots on MD capacity, except by observation!
![Page 35: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/35.jpg)
When do I need Flash/SSD for MD?• Bottom Line: When pagepool in memory or LROC regularly does not contain the
needed inode entries, retrieving metadata from Flash storage may make sense
• 1 x shared storage pool for Data and MD, h/w does not provide enough I/O for both
simultaneously
o Or MD and Data NSDs are on separate pools and LUNs, but share physical disk drives!
• This can happen if storage is managed by a team who prevent visibility of physical LUNs used
o Or MD and Data workloads interfere with each other’s performance
• Sometimes they don’t interfere!
• Think of a job that just does creates, then reads, then writes, then deletes
35
![Page 36: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/36.jpg)
Workloads which might need Flash/SSD for MD
o Intensive use of the Spectrum Scale policy engine
• Spectrum Protect Incremental backups (mmbackup),
• ILM/tiering: diskFlash/SSD, disktape, etc.
o Snapshots- deletes of a “middle” snapshot in a series
• Reconciliation to later snapshots is MD intensive
o Lots of “find”, or “create file”, or “delete file” tasks, esp from OS
o Work on small files with data in inodes
36
☺
![Page 37: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/37.jpg)
Flash/SSD: “I am tired of your constant writing!”
• Run out of pre-cleared Flash areas = write performance drops
o “Pre-conditioning curve”, present in SSDs and most Flash
• >50% drop in IOps after 0.5-1 hour of continuous operation
▪ at 70% Read 30% Write (8K)
o Can delay drop in performance by designing SSDs/Flash with more spare
capacity
• = More $$$, less usable capacity = Enterprise drives not Workstation
• Note that IBM Flashsystem contains unique features to reduce “write fatigue”
o Some technologies do not need pre-conditioning no “write cliff”
▪ As at July 2019 this seems to only be Intel/Micron “3D Xpoint” Optane, approx. 10x cost
37
Source: storagereview.com review of Intel 2TB P3700 SSD, Dec 2015Note: this is a good result!
![Page 38: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/38.jpg)
MD Recommendations
38
![Page 39: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/39.jpg)
Recommendations
• Understand the data!
o How many files and of what sizes?
o Which are the file sizes which might take up lots of MD space?
• Too big for data to live inside inode… so which size inodes
make sense?
• Too big for one inode to point to the data, so need 32K indirect
blocks
• Understand the workload!
• Replication enabled? How many?, snapshots? ILM? Xattrs/EAs?
• Data and Workloads can change!!!
![Page 40: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/40.jpg)
General recommendations
• Do not pre-allocate more than 25% of MD-only pool capacity to inodes
o Pre-allocation of inodes is a performance strategy
o Need room for indirect blocks + directory blocks + EA blocks + …
o For combined Data and Metadata, pre-allocate small %
![Page 41: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/41.jpg)
General recommendations V5.0.2+ filesystem format
o Use same blocksize for MD and Data, actual subblocks are based
on minimum of MD and Data blocksizes (TBC!)
o Even large blocksizes now have small subblocks
o Larger blocksizes are good for sequential performance
• But watch out for Read-Modify-Write of small blocks being randomly
written
o Use separate MD and Data pools for data that is treated differently
by policy
• E.g. MD default replicas=2 going to 2 separate Failure Groups one per
site, Data default replicas=1 with only some data at both sites (thanks to
Simon Thompson at Uni of Birmingham for this example)
![Page 42: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/42.jpg)
MD Recommendations: LROC and HAWC
o LROC=read caching in SSD on Client node
• Can set to cache only Data, only inodes, only Dir blocks, or
combination
• mmchconfig options
o HAWC=write caching in SSD on Client nodes (replicated) or
central Flash/SSD device
• Best for lots of small block writes, so probably good for MD
• Section on HAWC in manuals & KnowledgeCenter
![Page 43: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/43.jpg)
Recommendations: RAID
• Remember:
o Scale tries to coalesce writes & read-ahead reads
• Hates I/Os that are separated by time and space
o Flash storage is not infinite speed!
• Storage RAID recommendations:
o I/O penalty for RAID5/6 write sizes < RAID stripe size!
o RAID1 is better for MD, if there will be a lot of writes
• RAID5 and RAID6 impose up to 6x I/O penalty!
▪ https://theithollow.com/2012/03/21/understanding-raid-penalty/
• Filesystem blocksize should be a multiple of RAID stripe
size
o Or can suffer read-modify-write for random I/O workloads
![Page 44: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/44.jpg)
MD recommendations
• Do not pre-allocate all of MD capacity for inodes!!!
• Do not always use 3% of Data rule-of-thumb for MD allocation
o Safe in most cases, but… can result in a waste of expensive resources
• e.g. 3% of 2 PB is not required to support 20,000 large files for a TSM storage pool
• Try to get a histogram of file numbers and capacities
• Don’t panic! About the size of Indirect Blocks… unless you have a lot of files which use them!
![Page 45: Spectrum Scale: Metadata & Concepts Part II](https://reader030.vdocuments.mx/reader030/viewer/2022012707/61a8464d1a17457c02168bde/html5/thumbnails/45.jpg)
45
The End of the Tail