lustre performance monitoring and trouble shooting · pdf filelustre performance monitoring...

50
Lustre performance monitoring and trouble shooting March, 2015 Patrick Fitzhenry and Ian Costello

Upload: vonga

Post on 07-Mar-2018

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 1

Lustre performance monitoring and trouble shooting

March, 2015 Patrick Fitzhenry and Ian Costello

Page 2: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 2

Agenda

► EXAScaler (Lustre) Monitoring• NCI test kit hardware details• What is it? How does it work• Demo

► Lustre trouble-shooting• General points• 4 examples

Page 3: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 3

Introduction

► Patrick Fitzhenry • Director, Technical Services & Support, South Asia & ANZ

► Ian Costello• Senior Application Support Engineer

Page 4: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 4

Lustre Performance Monitoring

4

Page 5: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 5

NCI test kit hardware details20 x Fujitsu compute nodes

Dual E5-2670, 2.60GHzProcessors, 32GB

Single Rail FDR

SFA12KX-40400x3TB NL-SAS4xOSS’s:• Dual E5-2670• 128GB• CENTOS 6.4

Metadata12 x 600GB 15K SAS2xMD’s:• Dual E5-2670• 128GB• CENTOS 6.4

Page 6: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 6

Lustre Monitoring Background

► DDN development project► Use information Linux's /proc► Goals:

• Collect near real-time data (minimum every 1sec) and visualize them• All Lustre statistics information can be collectable• Support Lustre-1.8.x, 2.x version and beyond • Application aware monitoring (Job stats)• Administrator can make any custom graphs on the web browser• Configurable, intuitive dashboard• Scalable, Light weight and no performance impacts• and it is quite helps for debug and I/O analysis.

► Lustre is distributed, scalable filesystem. The monitoring/analysis tool must be aware of this.

► Lustre monitoring tool helps understanding current/past filesystembehavior and prevents slowdown of performance

6

Page 7: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 7

ExaScaler Monitoring

OSS/MDS

collectd

Lustre clientDDN monitoringplugin

graphite

Monitoring Server

collectd

Graphite plugin

UDP(TCP)/IP based small text message transfer graphite

• Lightweight• Near real-time• Massive scale• Customizable

• File system, OST Pool, OST/MDT stats, etc.• JOB ID, UID/GID, aggregation of application's

stats, etc.• Archive of data by policy

Page 8: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 8

Opentsdb Architecture

► The end to end Opentsdb work flow:

Page 9: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 9

A new Lustre plugin for collectd

► Using Collectd (http:///collectd.org)• Running at many Enterprise/HPC system• Written in C for performance and portability• Includes optimizations and features to handle hundreds of thousands

of data sets.• Comes with over 90 plugins which range from standard cases to very

specialized and advanced topics.• Provides powerful networking features and is extensible in numerous

ways• Actively developed and supported and well documented

► Lustre plugin extended collectd to collect Lustre statistics while inheriting its advantages

► It is possible to port Lustre plugin to a better framework if necessary

9

Page 10: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 10

XML definition of Lustre's /procinformation

► Tree structured descriptions about how to collect statistics from Lustre proc entries

► Modular• A hierarchical framework comprised by a core logic layer (Lustre

plugin) and statistics definition layer (XML files)• Extendable without the need to update any source codes of Lustre

plugin• Easy to maintain the stableness of core logic

► Centralized• A single XML file for all definitions of Lustre data collection• No need to maintain massive error-prone scripts• Easy to verify correctness• Easy to support multiple versions and update for new versions of

Lustre10

Page 11: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 11

XML definition of Lustre's /procinformation

► Precise• Strict rules using regular expression could be configured to filter

out all but what we exactly want• Locations to save collected statistics are explicitly defined and

configurable► Powerful

• Any statistics could be collected as long as there is proper regular expressions to match it

► Extendable• Any newly wanted statistics could be collected in no time by adding

definition in XML file► Efficient

• No matter how many definitions are predefined in the XML file, only under-used definitions will be traversed at run-time.

11

Page 12: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 12

Example of a collectd.conf

This is an example of a /etc/collectd.conf from an MDS (tmds1):[root@tmds1 ~]# cat /etc/collectd.conf## collectd.conf for DDN LustreMon#

Interval 5

WriteQueueLimitHigh 1000000WriteQueueLimitLow 800000

LoadPlugin match_regex

LoadPlugin syslog<Plugin syslog>

#LogLevel infoLogLevel err

</Plugin>

LoadPlugin lustre<Plugin "lustre">

<Common>DefinitionFile "/etc/lustre-ieel-2.5_definition.xml"

</Common># OST stats# <Item># Type "ost_kbytestotal"# Query_interval 300# </Item># <Item># Type "ost_kbytesfree"# Query_interval 300# </Item>

<Item>Type "ost_stats_write"

</Item><Item>

Type "ost_stats_read"</Item>

Page 13: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 13

Example of a collectd.conf (continued)

# MDT stats# <Item># Type "mdt_filestotal"# Query_interval 300# </Item># <Item># Type "mdt_filesfree"# Query_interval 300# </Item><Item>Type "md_stats_open"

</Item><Item>Type "md_stats_close"

</Item><Item>Type "md_stats_mknod"

</Item><Item>Type "md_stats_unlink"

</Item><Item>Type "md_stats_mkdir"

</Item><Item>Type "md_stats_rmdir"

</Item><Item>Type "md_stats_rename"

</Item><Item>Type "md_stats_getattr"

</Item><Item>Type "md_stats_setattr"

</Item><Item>Type "md_stats_getxattr"

</Item><Item>Type "md_stats_setxattr"

</Item><Item>Type "md_stats_statfs"

</Item><Item>Type "md_stats_sync"

</Item>

Page 14: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 14

Example of a collectd.conf (continued)

<Item>Type "ost_jobstats"<Rule>Field "job_id"

</Rule></Item><Item>Type "mdt_jobstats"<Rule>Field "job_id"

</Rule></Item>

<ItemType>Type "mdt_jobstats"<ExtendedParse>

# Parse the field job_idField "job_id"# Match the patternPattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)"<ExtendedField>

Index 1Name pbs_job_uid

</ExtendedField><ExtendedField>

Index 2Name pbs_job_gid

</ExtendedField><ExtendedField>

Index 3Name pbs_job_id

</ExtendedField></ExtendedParse>TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}"

</ItemType><ItemType>

Type "ost_jobstats"<ExtendedParse>

# Parse the field job_idField "job_id"# Match the patternPattern "u([[:digit:]]+)[.]g([[:digit:]]+)[.]j([[:digit:]]+)"<ExtendedField>

Index 1Name pbs_job_uid

</ExtendedField>

Page 15: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 15

Example of a collectd.conf (continued)

<ExtendedField>Index 2Name pbs_job_gid

</ExtendedField><ExtendedField>

Index 3Name pbs_job_id

</ExtendedField></ExtendedParse>TsdbTags "pbs_job_uid=${extendfield:pbs_job_uid} pbs_job_gid=${extendfield:pbs_job_gid} pbs_job_id=${extendfield:pbs_job_id}"

</ItemType></Plugin>

loadPlugin "write_tsdb"<Plugin "write_tsdb">

<Node>Host "10.10.108.33"Port "8500"

</Node></Plugin>

#loadPlugin "write_graphite"#<Plugin "write_graphite"># <Carbon># Host "172.21.66.181"# Port "2003"# Prefix "collectd."# Protocol "udp"# </Carbon>#</Plugin>

Page 16: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 16

Demo

► Show the OpenTSB layout► Show the Grafana layout► Show adding a mdt based stat, then update with a filter

to a jobid► Show adding a ost based stat

Page 17: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 17

Troubleshooting Lustre

17

Page 18: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 18

Process when Troubleshooting Lustre

18

Page 19: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 19

Lustre debugging

► Lustre is complex environment, lots of tightly coupled moving parts:• Storage (data, metadata)• OSS• MDS• Network• Lustre Server• Lustre Client• Operating Systems

► The software resides in kernel-space which makes it difficult to to debug compared with user-space software.

► It is possible to debug Lustre• Lustre bugs do get resolved – searching jira (if the issue is Lustre)• A lot of tools have been developed specifically for Lustre debugging.• The Lustre community is very active and provides strong support.

19

Page 20: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 20

What to do when a Lustre issue occurs1

► Understand the problem• What is the failure type? (kernel crash/LBUG/system call

failure/stuck process/incorrect result/unexpected behavior/performance regression)

• Which nodes cause the problemo Is it a server side problem or client side problem?o Is it a problem limited to a single client?o Is it a metadata or data access problem?

• How critical the problem is? The impacted services could be:o The whole system, e.g. crash or deadlock on MGS/MDS;o All of the services on a server, e.g. crash or deadlock on OSS;o A certain service of the whole system, e.g. quota failure on QMT/QSD;o All of the operations on the client(s), e.g. crash or deadlock on client.

20

Page 21: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 21

What to do when a Lustre issue occurs2

► Find a simple and reliable reproduction method• Step 1: Confirm which program causes the bug;• Step 2: Write a simple program which can reproduce the problem

repeatedly3;• Step 3: Simplify the program as much as possible.• A simple and reliable reproduction method:o Simplifies the description of the issue thus helps other people

understand it quickly;o Reduces the collected logs thus reduces the time to analyze it;o Accelerates the confirmation of possible fix methods thus accelerates

the fix process.

21

Page 22: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 22

What to do when a Lustre issue happens3

► Collect logs on the involved nodes• System logs are always valuable to determine the states of Lustre nodes.• Use ‘strace’ command to collect logs of system calls:o Which system call returns failure?o Which errno does this system call returns? Errno is essential for understanding and

debuging the issue, e.g. EIO(5) usually means disk I/O has some problems. • Collect kernel dump file when crash happenso Kdump should always been enabled on production system.o It is especially useful for ‘NULL pointer dereference’.

• Collect Lustre messages for further analysis• Tips:o A few lines of critical messages are much more helpful than other messages.o The first messages when the bug happens are more important.o Massive messages which are printed days before the bug happens is less valuable.o Redundancy messages are always better than lack of messages.

22

Page 23: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 23

What to do when a Lustre issue occurs4

► Collect Lustre messages• Command: lctl debug_kernel• Different masks can be used: trace, inode, super, ext2, malloc, cache,

info, ioctl, neterror, net, warning, buffs, other, dentry, nettrace, page, dlmtrace, error, emerg, ha, rpctrace, vfstrace, reada, mmap, config,console, quota, sec, lfsck, hsm

• Default masks are “warning, error, emerg, console”. But it might be necessary to change mask to collect desirable messages.

Mask Usage

trace Useful for tracing the process flow of Lustre software stack. Frequently used.

quota Useful for debuging quota problems.

dlmtrace Useful for debuging LDLM problems.

ioctl Useful for debuging ioctl problems.

malloc Useful for debuging memory leak problems. Usually used together with leak_finder.pl

23

Page 24: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 24

What to do when a Lustre issue happens5

► Fix the issue• Search whether the same issues has been fix in master branch of

Lustre git repositoryo Lustre mater branch is evolving quickly which means a lot of fixed bugs

might still exists on the older version.• Search whether there is any similar issue reportedo A fix/walk-around method might have proved to be successful.

• Keep the faith that a fix method will show up naturally as soon as the problem is fully understood.

• Compromise if have to:o Find a temporary way to recover the service of the production system

quickly, e.g. reboot/e2fsck.o If it is impossible to understand or fix the root cause of the issue right

now, try to find a way to walk around it.

24

Page 25: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 25

Real examples of fixing Lustre bugs 1

► RM-135/LU-4478• Problem discription: When formating a Lustre OST, the kernel crashes.• Reproduce steps:o Apply a debug patch which returns failure from ldiskfs_acct_on()o Formatting a Lustre OST will trigger the crash

• Collected log: Kernel dump file collected by Kdump• Analysis: o Log shows that the kernel crashes in ext4_get_sb()/get_sb_bdev()/

kill_block_super()/generic_shutdown_super()/iput()/clear_inode() because of ‘BUG: unable to handle kernel NULL pointer dereference at 00000000000001e0’

o By using ‘crash’ commands, it is confirmed EXT4_SB((inode)->i_sb) is NULLo After further analysis, it is found that the failure of ldiskfs_acct_on() in

ldiskfs_fill_super() is not handled correctly.• Fix: Add codes to handle failure of ldiskfs_acct_on() in

ldiskfs_fill_super() . (http://review.whamcloud.com/10938)

25

Page 26: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 26

Real examples of fixing Lustre bugs 2

► RM-185/LU-5054• Problem description: Creating and setting a pool name of length 16 to

a directory will succeed. However, creating a file under that directory will fail.

• Reproduce steps:o [root@penguin1 ~]# lfs setstripe -p aaaaaaaaaaaaaaaa /lustre/dir2o [root@penguin1 ~]# touch /lustre/dir2/a

touch: cannot touch `/lustre/dir2/a': Argument list too long• Errno: E2BIG(7)• Collected log: Trace log of Lustre to check which function returns the

E2BIG errno.• Analysis: Log shows that lod_generate_and_set_lovea() returns

E2BIG, because the pool name inherited from parent directory is longer than the length limit.

• Fix: Cleanup all related codes to enforce a consistent length limit of pool name. (http://review.whamcloud.com/10306)

26

Page 27: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 27

Real examples of fixing Lustre bugs 3

► LU-5808• Problem discription: When using one MGT to mange two file systems which names

are 'lustre' and 'lustre2T’, it is impossible to mount their MDTs on different servers because parsing of MGS llog fails.

• Reproduce steps:o mkfs.lustre --mgs --reformat /dev/sdb1;o mkfs.lustre --fsname lustre --mdt --reformat --mgsnode=192.168.3.122@tcp --index=0 /dev/sdb2;o mkfs.lustre --fsname lustre2T --mdt --reformat --mgsnode=192.168.3.122@tcp --index=0 /dev/sdb3;o mount -t lustre /dev/sdb1 /mnt/mgs;o mount -t lustre /dev/sdb2 /mnt/mdt-lustre;o mount -t lustre /dev/sdb3 /mnt/mdt-lustre2T;o lctl conf_param lustre.quota.ost=ug;o mount -t ldiskfs /dev/sdb1 /mnt/ldiskfs;o llog_reader /mnt/ldiskfs/CONFIGS/lustre2T-MDT0000 | grep quota.ost;o The output of the last command is:

#10 (224)marker 8 (flags=0x01, v2.5.25.0) lustre 'quota.ost' Mon Oct 27 21:26:23 2014-#11 (088)param 0:lustre 1:quota.ost=ug#12 (224)marker 8 (flags=0x02, v2.5.25.0) lustre 'quota.ost' Mon Oct 27 21:26:23 2014-

• Collected log:o Trace log of Lustre to check which function returns the failure when mouting MDTso Trace log of Lustre to check how does MGS handles llog names

• Analysis: Log shows that the MGS matches the llog of ‘lustre2T’ even when it tries to update the llog of ‘lustre’

• Fix: Update codes of MGS to match llog name strictly to avoid invalid record (http://review.whamcloud.com/12437)

27

Page 28: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 28

Performance Issue during commissioning (1)

Background:► Lustre System being Commissioned in Asia► DDN Storage, White box Servers, DDN Lustre► HW assembled by third party contractor

• No pre or post installation documentation

Problem Statement:► Low OSS Performance► Failing Performance Acceptance tests

Page 29: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 29

► Local team spent many hours trying to resolve► Escalated to (remote) DDN APAC Lustre Support team► Steps to resolve:

• Determine what the problem is in the first caseo Multiple tests to confirm where the problem is occurring

– ior and iozone– obdfilter-survey– lnet-selftest– raw ib test utils ib_[write,read]_bw– Make sure to specify the correct HCA you want to test on.

• Based on results from the above testing investigate the hardware• lspci –vv was our friend

Performance Issue during commissioning (2)

Page 30: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 30

► Resolution• Onsite engineer moved 1 HCA to a 8 lane PCI on all servers• Restart tests to confirm the fix – which it did and achieved the

10GB/s read/write performance profile.

Performance Issue during commissioning (3)

Page 31: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 31

► 20/20 Hind-sight is a beautiful thing:• Obvious when the issue is known

► Lessons learned:• Need detailed documentation of installation – issue would have

been resolved easily if available

Performance Issue during commissioning (4)

Page 32: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 32

What makes Lustre debugging easier?

Difficulty to debug Easy Middle HardAbility to reproduce Every time Sometimes Never

Time to reproduce Seconds Minutes Hours

Program to reproduce A few system calls Single node application Parallel application

Condition to reproduce A certain condition of a single process

Race condition with multiple processes

Uncertain/Unknown condition

Involved nodes Client MDS or OSS Client & MDS & OSS

Involved software components

Single component Multiple components on a single node

Multiple components on multiple nodes with

RPCs

Ways of failing Omission failure (crash, request loss,

or no reply)

Commission failure (wrong process of request, incorrect reply, corrupted

state)

Arbitrary/Byzantine failure (unpredictable

result)

Types of error Syntax error (compileerror)

Semantic defect (unintended result) Design deficiency

Problem description Clear description with reproduction steps

Clear text description Ambiguous description

Collected logs Precise logs since the bug occurred

Massive unfiltered logs Not enough logs

32

Page 33: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 33

Fini – Questions?

33

Page 34: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 34

Lustre debugging

► Lustre is a very complex piece of software which is hard to debug• It has a lot of software components with tightly coupled interfaces.• It is a distributed file system with multiple types of nodes

connected together by network.• The software resides in kernel-space which makes it difficult to to

debug compared with user-space software.► It is possible to debug Lustre

• Most bugs of Lustre get fixed eventually – searching jira.• A lot of tools have been developed specifically for Lustre

debugging.• The Lustre community is very active and provides strong support.

34

Page 35: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 35

Lustre DDN branch Client Performance optimization

35

Page 36: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 36

Genomic Analysis Application

► It's a standardized job set (pipeline), but...• More than 2000 jobs run in a single pipeline.o Alignment and mapping with genomics reference databaseso Annotations – adding references (metadata) to datao Analysis by each application

• There are 100+ analysis applications. But, no MPI applications. A lot of single jobs!

• Each applications have a lot of options/libraries• All jobs are associated with job scheduler and allocated them very

efficiently.• A lot of analysis pipelines are running on same HPC cluster

simultaneously.

Engineering Technical Conference 2014

Where ideas become reality

|36

Page 37: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 37

Complex, Complex and Complex...

Engineering Technical Conference 2014

Where ideas become reality

|

job301

job302

job303

job304

job305

job201

job202

job203

job204

job205

job101

job102

job103

job104

job105job1

job2

job3job4 job5

Single Pipeline

job106

After Finish jobjob107

job206

job306

job6

waiting jobs

Dependency

37

Page 38: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 38

Pipeline aware I/O performance monitoring

► Developed Lustre Performance monitoring Tool• Near realtime data point collection. (every second)• Any type of I/O monitoring is possible.(UID/GID/JOBID or any type of custom ID)

Total

Pipeline1

ExaScaler Monitor

Performance monitoring is NOT only daily/hourly report, but it's really critical for performance optimization.

Pipeline2

Pipeline3 Pipeline4

38

Page 39: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 39

Problem at MMBK

► Pipeline job on lustre-2.5 client elapsed time is longer than lustre-1.8 client system.

Engineering Technical Conference 2014

Where ideas become reality

|

lustre-1.8 client system

lustre-2.5 client systemJob started

Finished job

Finished job

10hours

One analysis takes 2.5 days!

39

Page 40: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 40

Lustre performance optimization for genomic applications

Worked with Intel exclusively and optimized current Lustre-2.5 client codes for better I/O performance for genomic applications.► mmap() I/O performance improvements

• Bug fixes, optimization and improvements• BTW, there is an crucial issue with mmap() in GPFS

► Performance improvements for single shared file• Parallel read to same region of file from single client

► CPU/Memory resource reduct• A lot of CPU intensive application. CPU is always high usages

► Large bulk I/O size support and enhancement • Support to up 16MB I/O size (4MB was limit)• Aggressive ReadAhead Engine for large I/O

40

Page 41: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 41

Fix mmap() performance problem and improvements

# cat /proc/fs/lustre/llite/*/statsllite.share1-ffff881067f9b800.stats=snapshot_time 1408263676.546716 secs.usecsread_bytes 589388 samples [bytes] 0 2147479552 258867698600write_bytes 1025093126 samples [bytes] 1 4194304 637173439272osc_read 3880442 samples [bytes] 8 1048576 3667025741928osc_write 640640 samples [bytes] 5 1048576 637252863026ioctl 17938 samples [regs]open 90267 samples [regs]close 90239 samples [regs]mmap 10523 samples [regs]seek 6997546 samples [regs]fsync 16 samples [regs]readdir 48874 samples [regs]setattr 252 samples [regs]truncate 12 samples [regs]getattr 2097773 samples [regs]create 3465 samples [regs]link 1 samples [regs]unlink 2890 samples [regs]statfs 2069 samples [regs]alloc_inode 8423 samples [regs]getxattr 1025105141 samples [regs]inode_permission 229899278 samples [regs]

Several application calls a lot of mmap().10%+ of open() calls with mmap()!

050

100150200250300350400450

32K 128K 512K 1024KBlock size

mmap() read perforamnce improvementsLustre-1.8.9 Fixed DDN branch

050

100150200250300350400450

lustre-1.8.9 lustre-2.5.2 Fixed DDN branch

mmap() read Performance (1MB block size)

After rework, 2.5x speed up from 1.8 client.

41

Page 42: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 42

Performance improvements for the same region of a shared file

A reference database file

Application is not MPI, but a lot of single applications refer to a reference file and does mapping operation with it

Single client'sprocesses

0200400600800

100012001400160018002000

4KB single 4KB parallel 1MB single 1MB parallel

Fix and optimization for parallel read(no cache)

lustre-1.8.9

lustre-2.5.2

Fixed DDN branch

Sanger Institute in UK hit similar performance regressions with lustre-2.5.2 client.After they applied our patches, significant reduced job's elapsed time.24 hours (Fixed DDN Lustre branch) from 40 hours (lustre-2.5.2).

9X

8X2X 2X

2X12X

42

Page 43: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 43

Optimization of performance under heavy CPU loads

► All client's CPU utilizations are quite high and Job scheduler allocates next jobs very efficiently.

► Found Lustre-2.5 performance regressions under heavy CPU loads.

► A lot of Java applications seems not be doing good memory management. And Lustre client consumes memory.• Several implementation of applications are based on old

architecture. (assuming everything put on the cache?) • Reduced buffer caches for Lustre changed more disk access

rater than using caches...

43

Page 44: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 44

Large bulk I/O size support

Engineering Technical Conference 2014

Where ideas become reality

|05

10152025303540

320 x NLSAS 400 x NLSAS

SFA12K/Lustre Performance(Read)(/w large bulk I/O patches)

1MB I/O

4MB I/O

16MB I/O

# cat /proc/fs/lustre/obdfilter/*/brw_statssnapshot_time: 1406696961.271996 (secs.usecs)

read | writepages per bulk r/w rpcs % cum % | rpcs % cum %1: 1091416 1 1 | 681741 2 22: 62166 0 1 | 164562 0 24: 96568 0 1 | 60799 0 28: 115945 0 1 | 10054 0 216: 170813 0 1 | 11361 0 232: 242152 0 1 | 18944 0 264: 444827 0 2 | 37609 0 2128: 861561 0 3 | 107677 0 3256: 99436837 96 100 | 32549912 96 100

read | writediscontiguous pages rpcs % cum % | rpcs % cum %0: 102060933 99 99 | 33641331 99 991: 177850 0 99 | 1196 0 992: 27307 0 99 | 39 0 993: 10447 0 99 | 27 0 994: 5502 0 99 | 16 0 99

- snip –

read | writediscontiguous blocks rpcs % cum % | rpcs % cum %0: 102029460 99 99 | 31615681 93 931: 208894 0 99 | 2026762 6 992: 27592 0 99 | 131 0 993: 10511 0 99 | 25 0 994: 5549 0 99 | 9 0 99

- snip -

As far as it monitors server side IO stats, a lot of large sequential I/O are coming.

0

5

10

15

20

25

30

35

320 x NLSAS 400 x NLSAS

SFA12K/Lustre Performance(Write)(/w large bulk I/O patches)

1MB I/O

4MB I/O

16MB I/O

44

Page 45: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 45

Performance results after reworking all improvements (1/3 scale test case)

Job Started

After rework :5 hours faster than lustre-1.8

Job Finished

Job Finished

Fixed Lustre Branch

Lustre-1.8.9

45

Page 46: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 46

Summary

► Learned I/O patterns of genomic analysis applications.• Each job's IO access patterns are not difficult, but it makes

complexity with genomic analysis pipeline. ► We've done performance monitoring, analysis and

optimization of Lustre. • Realtime Lustre performance monitoring helps performance

analysis and performance optimization.► There are still many areas we can optimize

• Still remained a lot of legacy and old system architectures base.• Changing the applications are really hard (researchers are busy

and I/O optimization is not main work ) but adapting and optimizing for their applications are possible.

46

Page 47: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 47

Trouble shooting

► Using two real examples to discuss/illustrate troubleshooting Lustre:

1. Performance Issue during commissioning

2. 3 bugs in a mature running systems

47

Page 48: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 48

Generic Grafana graphing

48

Page 49: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 49

Grafana IOR run

49

Page 50: Lustre performance monitoring and trouble shooting · PDF fileLustre performance monitoring and trouble shooting. March, 2015. Patrick Fitzhenry and Ian Costello ©2012 DataDirect

ddn.com©2012 DataDirect Networks. All Rights Reserved. 50

Opentsdb web interface

50