isilon tuning -...

ISILON TUNING – L3 CACHE

REDUCING LATENCIESVinicius Segantin ViteriStorage Specialist Locaweb [email protected]

Luiz PissinattiAccount Manager EMC [email protected]

2016 EMC Proven Professional Knowledge Sharing 2

Table of Contents

Executive Summary ............................................................................................................ 3

Introduction ......................................................................................................................... 4

Audience .............................................................................................................................. 6

Describing the Storage Environment ................................................................................ 7

Previous Environment .................................................................................................... 7

POC .................................................................................................................................. 8

New Cluster Architecture ............................................................................................... 9

Evaluation .......................................................................................................................... 11

Analyzing Reads ........................................................................................................... 11

Comparing Cluster With and Without L3 Cache ......................................................... 12

Enabling L3 Cache ............................................................................................................ 14

Execution Times ............................................................................................................ 14

Results ............................................................................................................................... 16

Comparative Results..................................................................................................... 17

Conclusion ........................................................................................................................ 19

Appendix - OneFS Upgrade from 6.5 TO 7.1 ................................................................... 20

Migration ........................................................................................................................ 20

Disclaimer: The views, processes or methodologies published in this article are those of the

authors. They do not necessarily reflect EMC Corporation’s views, processes or

methodologies.


Executive Summary

We were very intrigued by a "simple" question: What would be better? To use SSDs to store

application index files always on flash or to use all flash disks as L3 cache? The only thing

we knew is that the answer would not be that simple, and would need a lot of dedication and

hours of pure fun!!! Guaranteed! In this article we share our experience on enabling L3

Cache in a multi-petabyte Isilon® Cluster. We discuss all the steps, concerns, activities, and

especially the benefits we have achieved by using SSDs as L3 Cache for metadata

acceleration.

The L3 cache is a new feature available on OneFS version 7.1.1 and allows a different use

for the Flash Disks. Although this is not the main objective of this article we will also explain

what we have done to plan the upgrade from 6.5 (not a rolling upgrade, by the way) with

minimum downtime to our services.

We will discuss all the steps we have taken to plan and execute this very exciting challenge

of code upgrade, freeing the SSDs and enabling L3. We explain and show using

performance graphics, historical behaviors, and some metrics we considered important to

plan this execution. We had to be very careful to understand which Isilon jobs that should

run at a scheduled time – such as deleting snapshots and archiving old data – because the

email application we support is extremely sensitive to latency.

The workload described in this article consists of random access to small files, a very

unlikely workload to be found on Isilon deployments. It is also very difficult to find resources

and references for this type of access. Usually, Isilon clusters support large sequential data

like media or video surveillance. However, the flexibility embedded in Isilon software has

allowed us to tune it to support this unusual workload with very interesting results.


Introduction

The year is 2014. Location: The Venetian hotel in Las Vegas. We were at EMC World

conference and EMC had just announced L3 cache features for Isilon Clusters. That

sounded like great and exciting news! All you need to do is upgrade your OneFS and be

happy! No licenses required!

However, real life is not that easy. The first real challenge was to upgrade from OneFS 6.5

to OneFS 7.1 (not a rolling upgrade by the way) with minimal impact to our services. After

that, we still have this “one million dollar question” to solve: will it help at all? If you consider

that we have more than 15PB of Isilon already deployed, this is actually a lot more than a

one million dollar question, making this challenge even more interesting!

This article is about a real customer scenario supporting real applications. However, details

such as customer name will not be disclosed or may be fictitious. We would rather spend

consecutive nights in the datacenter tuning or troubleshooting Isilon clusters (and we do this

from time to time ;-) than spend 15 minutes with the legal team trying to identify the

applicable terms and conditions, whether or not this or that should be mentioned, and so

forth. My apologies to our lawyer friends reading this, no offense intended, we have just

found this way would be easier ;-)

Getting back to the cache discussion… A lot of doubts in our minds, a lot of unanswered

questions: What kind of benefits will we accrue in our environment? Do we need to enable

this feature? How do we do this? How many jobs will run? What about network latencies?

Will we reduce I/O load on the disks?

In this article we will share our experience on enabling L3 Cache in a multi-petabyte Isilon

Cluster. We will discuss all the steps, concerns, activities and specially the benefits we have

achieved by using SSDs as L3 Cache for metadata acceleration.

Though not the main objective of this article we will also explain what we have done to plan

the OneFS upgrade with minimum downtime to our services. Due to changes in the file

system storage layout we would need to re-protect the entire cluster and it could run for

weeks due to the amount of data residing in this cluster. We have decided to start a new

cluster with the new OneFS version and migrate the data to it. We will explore the tests

made with two migrations tools: rsync and SyncIQ. The first one gave us the opportunity to

be comfortable with an easy rollback plan. Meanwhile, SyncIQ gave us much more


throughput, saving a lot of time comparing with rsync. We will briefly discuss important

aspects to consider when planning Isilon migrations.

Thanks. Enjoy your reading!


Audience

This article is intended for Isilon administrators, storage architects, and other technical

professionals that already have some knowledge or experience managing OneFS and Isilon

architecture. We will discuss existing configurations in our clusters, and some design

decisions we considered to support our workload. It is beyond the scope of this article to

explain OneFS concepts and implementation.


Describing the Storage Environment

Previous Environment

In 2008 we were challenged by our company to build a storage infrastructure to support the

webmail services. The decision at that time was towards using COTS hardware and Open

software to provide NFS services. By leveraging OpenSolaris and ZFS (Zettabyte

FileSystem) we were able to use a mix of SATA and Flash drives to accelerate performance

and also scale capacity.

We defined a couple of building blocks based on their usage and needs of capacity or

performance. The performance boxes had about 7.1TB of usable capacity while others had

about 30 usable terabytes to be used as a replication target and/or backup.

When more space was needed, it was just a matter of deploying a new box and the NFS

mount points to scale-out the environment. We also created some automation scripts and

standards to expedite deployment of those new storages in our environment.

We started with a couple of those storages but had to double that number in less than three

months. By the end of first year we were at about 10 times the first deployment. Long story

short, in less than three years we had grown to about 700 storages supporting over 10 PB of

capacity.

We were very proud of our achievement to that point, especially because no one had any

idea that it would grow so fast! It was clear that a lot would need to change to continue

supporting the growth. As each storage unit is a standalone management point, you can

imagine the challenge we were facing. Some problems were decreasing our service level

agreement (SLA). Crucial problems, such as the lack of space available in one storage or

natural behavior changes on the workload like heavy IOPS or write peaks for instance, were

forcing us to do numerous data migrations back and forth. Each migration meant a 1-hour

downtime for 100,000 email accounts.

Isilon OneFS, a single File System and a single namespace with automated capacity

balance and storage tiers, was absolutely key to our environment. It is a true scale-out

solution, able to scale capacity and performance (CPU, network) without losing control of

your environment.


POC

It was time to start a POC (proof of concept)! After a lot of preparation and understanding of

our application, we started with a 6-node cluster such as shown in the picture below.

The initial idea was to have two different pools; one for hot content running over SAS drives

and the other running over SATA drives for archive old data. Our performance tests

indicated that considering the webmail workload, each S200 node would support a 100.000

email account.

After six months on production, we decided to replace all NL-nodes for X-Series nodes. This

significantly improved the environment. Using X400 nodes increased the cluster CPU and

memory capacity compared to the NL series with the same capacity of disks, and another

very important capability; providing flash drives to enable global namespace acceleration

(GNA).

Our new scenario became three S200 nodes and three X400 nodes, such as shown below:


New Cluster Architecture

Since the POC was very successful, we decided to build our clusters in what we called

“failure domains”. Each cluster would have:

16 S200 nodes (15+1): to support 1.5 Million accounts per cluster

10 X400: providing 1PB storage capacity per cluster

On the application side, we have two types of files: Index Files and Mailboxes. The index

files are used to index all messages that are stored in mailbox files. All mailboxes are in

mdbox format and are configured for a maximum 10MB per file. All index files are around

30KB size.

On the Isilon side, we have two different node types. S200 and X400 represent two

separated pools. We call S200 PoolA that store all data from last seven days, and X400 is

PoolB that holds all old data, that are more than seven days old.

Some FilePoolPolicies exists to move these two types of files in the correct pool. All index

files must be created on PoolA and always stay in PoolA. All mdbox file are created in PoolA

and are moved to PoolB as your creation time (ctime) attribute becomes older than seven

days.


Using ctime we guarantee that just the newest files are in PoolA, and depending of the disk

usage of PoolA we can change from seven to fifteen days. This is a strategy to provide more

performance to our customer once all messages are read from SAS drives while maintaining

control on used space

On average, all clusters were running fine but from time to time we still faced sporadic

increases in response time, sometimes severe increases in latency during high workload

periods during the day. After extensive troubleshooting, we found that enabling the metadata

acceleration using SSD’s was an important change for this workload.

Using SSD’s for metadata acceleration helped to increase performance and avoid peaks in

response times and that became our standard for all cluster deployments. We also did other

tuning and adjustments that are outside the scope of this article and would be enough

material for another discussion.

Just 6% of all SSD in the cluster are being used. One reason to be able to enable L3 Cache.

When we heard about the new feature of using SSD’s for L3 cache, we immediately

understood that there was a potential benefit for this particular workload. This stimulated the

discussions of what would be better: SSD’s as L3 cache? SSD’s as Metadata Acceleration

or SSD’s to store application index files?

Cluster Name: EMAIL-IG-0001

Cluster Health: [ OK ]

Cluster Storage: HDD SSD Storage

Size: 1.3P (1.3P Raw) 28T (28T Raw)

VHS Size: 9.3T

Used: 989T (77%) 1.6T (6%)

Avail: 298T (23%) 26T (94%)

Node Group Name: FAST-01 Protection: +2d:1n w18

Pool Storage: HDD SSD Storage

Size: 212T (214T Raw) 6.5T (6.5T Raw)

VHS Size: 1.5T

Used: 49T (23%) 209G (3%)

Avail: 164T (77%) 6.3T (97%)


Evaluation

The decision was not easy. Out of the 156 nodes we have, 128 are among the clusters

eligible to have L3 enabled. Each cluster supports about 150K NFS ops/s and have about 05

PB of stored user data for more them 8 million email accounts.

At this time, we were using SmartPool policies to maintain hot data on performance nodes

and all SSD’s were being used as metadata acceleration in both tiers. We noticed however,

that utilization ratio on SSD was very low, on average below 6% of capacity being used.

Although the performances were being met by the metadata acceleration, we felt the SSD’s

were underutilized and that we could extract more performance out of those flash drives.

Since we now had a better understanding of the application and its files, we started to think if

it would be better to use SSD’s as L3 cache or to enable user data to reside in Flash and

force all application index files to live on Flash from cradle to grave, while other files would

follow SmartPool policies.

We enabled L3 cache in one of our new clusters, since it was just being deployed to better

understand the behavior. We leveraged InsightIQ to analyze cluster performance data. Here

are some key findings:

Analyzing Reads

About 63% of all read operations are namespace_reads, which are basically

metadata and represents 512 bytes in each operation in RAM memory.

About 21% of read operations are READ, which are user data. Each access to

disk is 8 kilobytes. This is where we are focusing our analysis. Those are all disk


operations and the idea is to alleviate this workload from mechanical disks out to

SSD’s using L3 cache.

Each X400 in our case has 2.4TB of SSDs and each S200 has 400Gb. For the entire

cluster, we would have more than 30 TB of SSD’s available to cache. As this is a distributed

system, user and system access are shared among all nodes, regardless of data or port it

will be sent or received.

Comparing Cluster With and Without L3 Cache

In the following graphics we can see that in cluster PD-01 we have about 80% cache hit in

L3 requesting “user data”; that is, the email messages themselves. While on PD-03, all user

data access is provided by spinning disks.

We found that L1 and L2 usage behavior is much more uniform and stable in the cluster with

L3 cache enabled. To clarify if you are curious, the two drops in L3 cache shown in the

graph are related to data migrations where some services have been stopped.


The main objective of this paper is to demonstrate that by enabling L3 cache using all SSD’s

in the cluster, we avoid overloading the SAS drives. Any hit that we can get is one less I/O

going to spinning disks. Achieving such a high hit rate on L3 for user data (around 80% in

our case) means that SSD’s are being used to deliver data, alleviating a lot the SAS back-

end. The majority of our performance impacts were related to the SAS drives being

overloaded. Now we hope that those issues are gone forever.


Enabling L3 Cache

Considering that we are already running a version of OneFS were L3 Cache is an option,

now we need to start thinking about the impacts to enable this in a running production

cluster. If you are not yet running 7.1, please read the Appendix – Data Migration for

additional information on how to migrate with minimal impact.

The Isilon engineering procedure to enable L3 cache mention that SmartPools and

FlexProtect need to be executed in order to free up the SSDs and re-protect data within the

cluster. However, this should taken into serious consideration with very careful planning.

The email workload that we support here tends to be very sensitive to latency. In our

particular workload, some Isilon jobs tend to have a potentially high execution time and

frequently can impact latency and response time.

We have developed a very particular schedule to be able to address all Isilon configuration

needs, as well as to run and delete snapshots, enable replacement of failed disks, and move

data between different node pools, among all other administrative tasks. The SmartPools for

instance, runs once a week to move data down to X400 nodes.

When we enable L3, all data in all SSD’s in the cluster are flushed out by SmartPools and at

the end FlexProtect orchestrates a controlled smartfail of each SSD out of the cluster. Since

we run SmartPools only once a week we feared that doing all data movements already

scheduled and also freeing up the SSD’s would be too much. In this particular case, we

have scheduled SmartPool job to run two days before we enable the L3 cache so it would

move the great majority of user data resulting in less work needing to be done at the next

execution.

By doing this, we have had near zero negative impact in all executions. One FlexProtect was

started for each of the two pools on each cluster. The execution time actually depends on

load (client access) and data volume in each dataset but we were able to do it over the

weekend when we have the lowest amount of access.

Execution Times

PoolA (S200): 105TB of capacity being used, 04 hour execution time

PoolB (X400): 1.52PB of capacity being used, 16h execution time


We were also worried about how to warm-up the SSD’s with data, once all metadata were

already being accelerated in the SSD’s. At first, we have developed a number of scripts that

would run on the NFS client and walk through the entire filesystem structure doing reads to

/dev/null…. The good thing is that this was not even necessary! We have just run another

session of SmartPools and that was enough to warm all metadata.


Results

After enabling and concluding the activation process, all SSD’s are being presented as L3 to

OneFS and are not available anymore as disks. This is show in the output below:

The greatest benefit we have had is low latency to NFS operations. For our particular use

case, that means a huge service level improvement to the application.

Before enabling L3 cache we had an average latency of 72ms. After enabling it, latency has

dropped to 54ms, a very representative 25% improvement in service level.

Latency average: 72.31 – before enable L3 Cache


Latency average: 72.31 – after enable L3 Cache

After enabling L3 cache, we see L3 hits going to about 90%, while the behavior of L1 and L2

cache became uniform (flat lines instead of recurring spikes)

Comparative Results

Final result in all graphs shown here are very positive. Cache is warmed up as the client

accesses the data. We also see a change in behavior of L2 and L2, that was frenetically

oscillating from 0 - 80% to a much higher and constant watermark after L3 activation.

The first graph shows the activation of L3 cache. The second graph shows the efficiency of

the L3 cache.


After enabling L3 cache in another cluster we saw a decrease in the disk IOPS on X400

nodes, while maintaining NFS operation with the same behavior than before.


Conclusion

After about three months of extensive work, planning, and deep understanding of our

application and storage environment we were able to successfully upgrade five email

clusters from 6.x to 7.1 by leveraging data migration techniques and also to enable L3 cache

in all of them.

We have increased the service level and lowered response time for the application while

decreasing disk utilization in the clusters, enabling more comfortable day-by-day operations.

We may now potentially add more email accounts per cluster and delay additional

infrastructure investments.

For our workload in particular, we have concluded that enabling L3 cache was a very good

use of SSD’s. We encourage you to evaluate your environment and try to identify what

benefits that configuration might bring to your applications.


Appendix - OneFS Upgrade from 6.5 TO 7.1

One of the biggest challenges we have faced was the OneFS upgrade from 6.5 to 7.1. The

changes in protection group normally require a huge internal re-layout of all data stores in

the cluster. The good news is that Isilon OneFS can do this online and automatically with no

downtime to applications.

The bad news is that to be able to accomplish this, the Flexprotect job might run for

potentially a couple of days in a row. Since our application is very sensitive to latency, this

would present severe performance impacts. We had to try a new approach.

We decided to build a new cluster using version 7.1 and migrate data from 6.5. After the old

cluster is empty, it would be reinstalled with the new version to support new email accounts.

We have studied alternatives to execute the migration, which we will briefly explore.

1-) SyncIQ

Pros: High performance to send data due to the configurable number of worker per node.

Cons: Failover/Failback options would be an interesting alternative due to the feature of

quick rollback in case of any problems. However, since we were migrating from different

versions, OneFS does not allow failover feature. We should proceed with a replication break

to failover to the second cluster and that would ruin any rollback plans.

2-) Rsync

Prós: Existing know-how from the application team to use for migrations

Cons: Low performance

We have decided on SyncIQ due to ease of use for high data volumes, great performance,

and file data integrity. Tests with SyncIQ has proven to be very reliable and efficient, even

using the graphical interface.

Migration

The scenario would require the migration of about 700TB in 16 datasets, with about 43TB

each. The biggest risk was in case of any rollback. A 43TB rollback would take more than 10


hours with downtime to the application. By planning accordingly, we have done migrations of

03 datasets per day, with 10 minutes of data unavailability per dataset and no complaints

filed with our helpdesk from end users.


EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC

CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND

WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY

DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A

PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an

applicable software license.

isilon tuning -...

Documents