performance bottlenecks for metadata workload in gluster with poornima gurusiddaiah, rajesh joseph

Metadata Performance Bottlenecks in Gluster

Poornima G Rajesh joseph

Agenda Existing performance xlators Metadata operations

create files mkdir recursive ls -lR rm files rmdir recursive mv files Small file create / PUT Small file read / GET

Performance Xlators Write-behind and Flush behind Io-cache Read-ahead Quick-read Md-cache Readdir-ahead Open-behind Symlink cache

Test Setup and Tools Centos 7.2 clients and server 3 FUSE Clients simultaneously performing

operation on their home directories Latest master, with below volume options:

cluster.lookup-optimize onreaddir-optimize onconsistent-metadata no

Performance matrics collected using volume profile

Workload is some simple scripts.

#Create Files

13

12

11

1025

10

181

STAT/LOOKUP - 13% (Negative Lookups ~ 12%) CREATE - 12%

REMOVEXATTR - 13% SETATTR - 8%

INODELK - 25% (Removexattr + setattr) ENTRYLK - 10%

FLUSH - 10% READDIRP - 1%

Create Files Negative lookups – 11-12% Removexattr (security.ima) – 23% Lease on parent dir and newly created files can

eliminate, sending inodelk(25%) and entrylk(10%) to the bricks. But only benefits single client use cases.

Create is not a function of number of bricks.

#mkdir recursive

17

10

9

43

21

STAT/LOOKUP - 13% (Negative lookups ~ 12%) MKDIR - 10% SETXATTR - 9%

INODELK - 43% (SETXATTR and MKDIR) ENTRYLK - 21%

mkdir recursive Negative lookup ~10% and cache miss In DHT, Setxattr can be compounded with mkdir?

saves inodelks and setxattr Lease on parent dir and newly created files can

eliminate, sending inodelk and entrylk to the bricks. But only benefits single client use cases.

It is not a function of number of dht subvol, as mkdir is sent initially on hashed subvol followed by parallel mkdir on the rest.

#ls -lR

34

19

47

STAT/LOOKUP - 34% (~20% on dir ~10% cache miss) OPENDIR - 19% READDIRP - 47%

ls -lR Readdirp - sequential and fn(number of dht

subvols). Hence not scalable.Eg: reading 800 empty directories on 10*1 = 16000 sequential readdirp calls30*1 = 48000 sequential readdirp calls

Readdir-ahead only masks part of this latency, hence parallel readdirp in DHT.

Eliminate EOD detection readdirp call DHT – do not NULL the readdirp entries of

directory Md-cache – eliminate revalidate lookups

#rm files in one large dir

2

36

18

44

STAT/LOOKUP - 2% ENTRYLK - 36% UNLINK - 19% READDIRP - 44%

Delete files Parallel readdirp – may not improve performance

drastically for large directories. Lease on parent directory, can make entrylk only

client side. But helps only in single client use case.

#rmdir small directories recursively

12

9

55

35

18

STAT/LOOKUP - 12% OPENDIR - 9% INODELK - 55% (rmdir sequential inodelk)

RMDIR - 3% ENTRYLK - 5% READDIRP - 18%

rmdir Recusrsive Rmdir - fn (number of dht subvolumes). Hence

not scalable. Inodelks are taken sequentially on all dht subvols

before rmdir (rmdir and inodelk unlock – parallel). WIP to remove inodelk.

Lease on parent directory, can make inodelk and entrylk only client side. But helps only in single client use case.

Revalidate lookup – md-cache

Small file create

20

21

101

50

STAT/LOOKUP - 20% (Negative LOOKUP - 15-16%) ENTRYLK - 21%

CREATE - 10% WRITE - 1%

FLUSH - 50% (FINODELK + FXATTROP + WRITE + FLUSH)

Small file Create / PUT ENTRYLK, INODELK can be made only client side

by using leases, but helps single client use case Compound fop XATTROP + WRITE PUT fop, only gfapi clients(Swift, core utils) will

be able to leverage.

Small file Read

95.6

2.51.40.5

STAT/LOOKUP - 95% OPEN - 2.5% READ - 1.2% FLUSH - 0.5%

Conclusion

We have not yet saturated, the perf improvements. Performance of these metadata operations can be improved

greatly, without compromising on consistency.

performance bottlenecks for metadata workload in gluster with poornima gurusiddaiah, rajesh joseph

Technology