performance bottlenecks for metadata workload in gluster with poornima gurusiddaiah, rajesh joseph
TRANSCRIPT
Metadata Performance Bottlenecks in Gluster
Poornima G Rajesh joseph
Agenda Existing performance xlators Metadata operations
create files mkdir recursive ls -lR rm files rmdir recursive mv files Small file create / PUT Small file read / GET
Performance Xlators Write-behind and Flush behind Io-cache Read-ahead Quick-read Md-cache Readdir-ahead Open-behind Symlink cache
Test Setup and Tools Centos 7.2 clients and server 3 FUSE Clients simultaneously performing
operation on their home directories Latest master, with below volume options:
cluster.lookup-optimize onreaddir-optimize onconsistent-metadata no
Performance matrics collected using volume profile
Workload is some simple scripts.
#Create Files
13
12
11
1025
10
181
STAT/LOOKUP - 13% (Negative Lookups ~ 12%) CREATE - 12%
REMOVEXATTR - 13% SETATTR - 8%
INODELK - 25% (Removexattr + setattr) ENTRYLK - 10%
FLUSH - 10% READDIRP - 1%
Create Files Negative lookups – 11-12% Removexattr (security.ima) – 23% Lease on parent dir and newly created files can
eliminate, sending inodelk(25%) and entrylk(10%) to the bricks. But only benefits single client use cases.
Create is not a function of number of bricks.
#mkdir recursive
17
10
9
43
21
STAT/LOOKUP - 13% (Negative lookups ~ 12%) MKDIR - 10% SETXATTR - 9%
INODELK - 43% (SETXATTR and MKDIR) ENTRYLK - 21%
mkdir recursive Negative lookup ~10% and cache miss In DHT, Setxattr can be compounded with mkdir?
saves inodelks and setxattr Lease on parent dir and newly created files can
eliminate, sending inodelk and entrylk to the bricks. But only benefits single client use cases.
It is not a function of number of dht subvol, as mkdir is sent initially on hashed subvol followed by parallel mkdir on the rest.
#ls -lR
34
19
47
STAT/LOOKUP - 34% (~20% on dir ~10% cache miss) OPENDIR - 19% READDIRP - 47%
ls -lR Readdirp - sequential and fn(number of dht
subvols). Hence not scalable.Eg: reading 800 empty directories on 10*1 = 16000 sequential readdirp calls30*1 = 48000 sequential readdirp calls
Readdir-ahead only masks part of this latency, hence parallel readdirp in DHT.
Eliminate EOD detection readdirp call DHT – do not NULL the readdirp entries of
directory Md-cache – eliminate revalidate lookups
#rm files in one large dir
2
36
18
44
STAT/LOOKUP - 2% ENTRYLK - 36% UNLINK - 19% READDIRP - 44%
Delete files Parallel readdirp – may not improve performance
drastically for large directories. Lease on parent directory, can make entrylk only
client side. But helps only in single client use case.
#rmdir small directories recursively
12
9
55
35
18
STAT/LOOKUP - 12% OPENDIR - 9% INODELK - 55% (rmdir sequential inodelk)
RMDIR - 3% ENTRYLK - 5% READDIRP - 18%
rmdir Recusrsive Rmdir - fn (number of dht subvolumes). Hence
not scalable. Inodelks are taken sequentially on all dht subvols
before rmdir (rmdir and inodelk unlock – parallel). WIP to remove inodelk.
Lease on parent directory, can make inodelk and entrylk only client side. But helps only in single client use case.
Revalidate lookup – md-cache
Small file create
20
21
101
50
STAT/LOOKUP - 20% (Negative LOOKUP - 15-16%) ENTRYLK - 21%
CREATE - 10% WRITE - 1%
FLUSH - 50% (FINODELK + FXATTROP + WRITE + FLUSH)
Small file Create / PUT ENTRYLK, INODELK can be made only client side
by using leases, but helps single client use case Compound fop XATTROP + WRITE PUT fop, only gfapi clients(Swift, core utils) will
be able to leverage.
Small file Read
95.6
2.51.40.5
STAT/LOOKUP - 95% OPEN - 2.5% READ - 1.2% FLUSH - 0.5%
Conclusion
We have not yet saturated, the perf improvements. Performance of these metadata operations can be improved
greatly, without compromising on consistency.