isilon tuning -...
TRANSCRIPT
ISILON TUNING – L3 CACHE
REDUCING LATENCIESVinicius Segantin ViteriStorage Specialist Locaweb [email protected]
Luiz PissinattiAccount Manager EMC [email protected]
2016 EMC Proven Professional Knowledge Sharing 2
Table of Contents
Executive Summary ............................................................................................................ 3
Introduction ......................................................................................................................... 4
Audience .............................................................................................................................. 6
Describing the Storage Environment ................................................................................ 7
Previous Environment .................................................................................................... 7
POC .................................................................................................................................. 8
New Cluster Architecture ............................................................................................... 9
Evaluation .......................................................................................................................... 11
Analyzing Reads ........................................................................................................... 11
Comparing Cluster With and Without L3 Cache ......................................................... 12
Enabling L3 Cache ............................................................................................................ 14
Execution Times ............................................................................................................ 14
Results ............................................................................................................................... 16
Comparative Results..................................................................................................... 17
Conclusion ........................................................................................................................ 19
Appendix - OneFS Upgrade from 6.5 TO 7.1 ................................................................... 20
Migration ........................................................................................................................ 20
Disclaimer: The views, processes or methodologies published in this article are those of the
authors. They do not necessarily reflect EMC Corporation’s views, processes or
methodologies.
2016 EMC Proven Professional Knowledge Sharing 3
Executive Summary
We were very intrigued by a "simple" question: What would be better? To use SSDs to store
application index files always on flash or to use all flash disks as L3 cache? The only thing
we knew is that the answer would not be that simple, and would need a lot of dedication and
hours of pure fun!!! Guaranteed! In this article we share our experience on enabling L3
Cache in a multi-petabyte Isilon® Cluster. We discuss all the steps, concerns, activities, and
especially the benefits we have achieved by using SSDs as L3 Cache for metadata
acceleration.
The L3 cache is a new feature available on OneFS version 7.1.1 and allows a different use
for the Flash Disks. Although this is not the main objective of this article we will also explain
what we have done to plan the upgrade from 6.5 (not a rolling upgrade, by the way) with
minimum downtime to our services.
We will discuss all the steps we have taken to plan and execute this very exciting challenge
of code upgrade, freeing the SSDs and enabling L3. We explain and show using
performance graphics, historical behaviors, and some metrics we considered important to
plan this execution. We had to be very careful to understand which Isilon jobs that should
run at a scheduled time – such as deleting snapshots and archiving old data – because the
email application we support is extremely sensitive to latency.
The workload described in this article consists of random access to small files, a very
unlikely workload to be found on Isilon deployments. It is also very difficult to find resources
and references for this type of access. Usually, Isilon clusters support large sequential data
like media or video surveillance. However, the flexibility embedded in Isilon software has
allowed us to tune it to support this unusual workload with very interesting results.
2016 EMC Proven Professional Knowledge Sharing 4
Introduction
The year is 2014. Location: The Venetian hotel in Las Vegas. We were at EMC World
conference and EMC had just announced L3 cache features for Isilon Clusters. That
sounded like great and exciting news! All you need to do is upgrade your OneFS and be
happy! No licenses required!
However, real life is not that easy. The first real challenge was to upgrade from OneFS 6.5
to OneFS 7.1 (not a rolling upgrade by the way) with minimal impact to our services. After
that, we still have this “one million dollar question” to solve: will it help at all? If you consider
that we have more than 15PB of Isilon already deployed, this is actually a lot more than a
one million dollar question, making this challenge even more interesting!
This article is about a real customer scenario supporting real applications. However, details
such as customer name will not be disclosed or may be fictitious. We would rather spend
consecutive nights in the datacenter tuning or troubleshooting Isilon clusters (and we do this
from time to time ;-) than spend 15 minutes with the legal team trying to identify the
applicable terms and conditions, whether or not this or that should be mentioned, and so
forth. My apologies to our lawyer friends reading this, no offense intended, we have just
found this way would be easier ;-)
Getting back to the cache discussion… A lot of doubts in our minds, a lot of unanswered
questions: What kind of benefits will we accrue in our environment? Do we need to enable
this feature? How do we do this? How many jobs will run? What about network latencies?
Will we reduce I/O load on the disks?
In this article we will share our experience on enabling L3 Cache in a multi-petabyte Isilon
Cluster. We will discuss all the steps, concerns, activities and specially the benefits we have
achieved by using SSDs as L3 Cache for metadata acceleration.
Though not the main objective of this article we will also explain what we have done to plan
the OneFS upgrade with minimum downtime to our services. Due to changes in the file
system storage layout we would need to re-protect the entire cluster and it could run for
weeks due to the amount of data residing in this cluster. We have decided to start a new
cluster with the new OneFS version and migrate the data to it. We will explore the tests
made with two migrations tools: rsync and SyncIQ. The first one gave us the opportunity to
be comfortable with an easy rollback plan. Meanwhile, SyncIQ gave us much more
2016 EMC Proven Professional Knowledge Sharing 5
throughput, saving a lot of time comparing with rsync. We will briefly discuss important
aspects to consider when planning Isilon migrations.
Thanks. Enjoy your reading!
2016 EMC Proven Professional Knowledge Sharing 6
Audience
This article is intended for Isilon administrators, storage architects, and other technical
professionals that already have some knowledge or experience managing OneFS and Isilon
architecture. We will discuss existing configurations in our clusters, and some design
decisions we considered to support our workload. It is beyond the scope of this article to
explain OneFS concepts and implementation.
2016 EMC Proven Professional Knowledge Sharing 7
Describing the Storage Environment
Previous Environment
In 2008 we were challenged by our company to build a storage infrastructure to support the
webmail services. The decision at that time was towards using COTS hardware and Open
software to provide NFS services. By leveraging OpenSolaris and ZFS (Zettabyte
FileSystem) we were able to use a mix of SATA and Flash drives to accelerate performance
and also scale capacity.
We defined a couple of building blocks based on their usage and needs of capacity or
performance. The performance boxes had about 7.1TB of usable capacity while others had
about 30 usable terabytes to be used as a replication target and/or backup.
When more space was needed, it was just a matter of deploying a new box and the NFS
mount points to scale-out the environment. We also created some automation scripts and
standards to expedite deployment of those new storages in our environment.
We started with a couple of those storages but had to double that number in less than three
months. By the end of first year we were at about 10 times the first deployment. Long story
short, in less than three years we had grown to about 700 storages supporting over 10 PB of
capacity.
We were very proud of our achievement to that point, especially because no one had any
idea that it would grow so fast! It was clear that a lot would need to change to continue
supporting the growth. As each storage unit is a standalone management point, you can
imagine the challenge we were facing. Some problems were decreasing our service level
agreement (SLA). Crucial problems, such as the lack of space available in one storage or
natural behavior changes on the workload like heavy IOPS or write peaks for instance, were
forcing us to do numerous data migrations back and forth. Each migration meant a 1-hour
downtime for 100,000 email accounts.
Isilon OneFS, a single File System and a single namespace with automated capacity
balance and storage tiers, was absolutely key to our environment. It is a true scale-out
solution, able to scale capacity and performance (CPU, network) without losing control of
your environment.
2016 EMC Proven Professional Knowledge Sharing 8
POC
It was time to start a POC (proof of concept)! After a lot of preparation and understanding of
our application, we started with a 6-node cluster such as shown in the picture below.
The initial idea was to have two different pools; one for hot content running over SAS drives
and the other running over SATA drives for archive old data. Our performance tests
indicated that considering the webmail workload, each S200 node would support a 100.000
email account.
After six months on production, we decided to replace all NL-nodes for X-Series nodes. This
significantly improved the environment. Using X400 nodes increased the cluster CPU and
memory capacity compared to the NL series with the same capacity of disks, and another
very important capability; providing flash drives to enable global namespace acceleration
(GNA).
Our new scenario became three S200 nodes and three X400 nodes, such as shown below:
2016 EMC Proven Professional Knowledge Sharing 9
New Cluster Architecture
Since the POC was very successful, we decided to build our clusters in what we called
“failure domains”. Each cluster would have:
16 S200 nodes (15+1): to support 1.5 Million accounts per cluster
10 X400: providing 1PB storage capacity per cluster
On the application side, we have two types of files: Index Files and Mailboxes. The index
files are used to index all messages that are stored in mailbox files. All mailboxes are in
mdbox format and are configured for a maximum 10MB per file. All index files are around
30KB size.
On the Isilon side, we have two different node types. S200 and X400 represent two
separated pools. We call S200 PoolA that store all data from last seven days, and X400 is
PoolB that holds all old data, that are more than seven days old.
Some FilePoolPolicies exists to move these two types of files in the correct pool. All index
files must be created on PoolA and always stay in PoolA. All mdbox file are created in PoolA
and are moved to PoolB as your creation time (ctime) attribute becomes older than seven
days.
2016 EMC Proven Professional Knowledge Sharing 10
Using ctime we guarantee that just the newest files are in PoolA, and depending of the disk
usage of PoolA we can change from seven to fifteen days. This is a strategy to provide more
performance to our customer once all messages are read from SAS drives while maintaining
control on used space
On average, all clusters were running fine but from time to time we still faced sporadic
increases in response time, sometimes severe increases in latency during high workload
periods during the day. After extensive troubleshooting, we found that enabling the metadata
acceleration using SSD’s was an important change for this workload.
Using SSD’s for metadata acceleration helped to increase performance and avoid peaks in
response times and that became our standard for all cluster deployments. We also did other
tuning and adjustments that are outside the scope of this article and would be enough
material for another discussion.
Just 6% of all SSD in the cluster are being used. One reason to be able to enable L3 Cache.
When we heard about the new feature of using SSD’s for L3 cache, we immediately
understood that there was a potential benefit for this particular workload. This stimulated the
discussions of what would be better: SSD’s as L3 cache? SSD’s as Metadata Acceleration
or SSD’s to store application index files?
Cluster Name: EMAIL-IG-0001
Cluster Health: [ OK ]
Cluster Storage: HDD SSD Storage
Size: 1.3P (1.3P Raw) 28T (28T Raw)
VHS Size: 9.3T
Used: 989T (77%) 1.6T (6%)
Avail: 298T (23%) 26T (94%)
Node Group Name: FAST-01 Protection: +2d:1n w18
Pool Storage: HDD SSD Storage
Size: 212T (214T Raw) 6.5T (6.5T Raw)
VHS Size: 1.5T
Used: 49T (23%) 209G (3%)
Avail: 164T (77%) 6.3T (97%)
2016 EMC Proven Professional Knowledge Sharing 11
Evaluation
The decision was not easy. Out of the 156 nodes we have, 128 are among the clusters
eligible to have L3 enabled. Each cluster supports about 150K NFS ops/s and have about 05
PB of stored user data for more them 8 million email accounts.
At this time, we were using SmartPool policies to maintain hot data on performance nodes
and all SSD’s were being used as metadata acceleration in both tiers. We noticed however,
that utilization ratio on SSD was very low, on average below 6% of capacity being used.
Although the performances were being met by the metadata acceleration, we felt the SSD’s
were underutilized and that we could extract more performance out of those flash drives.
Since we now had a better understanding of the application and its files, we started to think if
it would be better to use SSD’s as L3 cache or to enable user data to reside in Flash and
force all application index files to live on Flash from cradle to grave, while other files would
follow SmartPool policies.
We enabled L3 cache in one of our new clusters, since it was just being deployed to better
understand the behavior. We leveraged InsightIQ to analyze cluster performance data. Here
are some key findings:
Analyzing Reads
About 63% of all read operations are namespace_reads, which are basically
metadata and represents 512 bytes in each operation in RAM memory.
About 21% of read operations are READ, which are user data. Each access to
disk is 8 kilobytes. This is where we are focusing our analysis. Those are all disk
2016 EMC Proven Professional Knowledge Sharing 12
operations and the idea is to alleviate this workload from mechanical disks out to
SSD’s using L3 cache.
Each X400 in our case has 2.4TB of SSDs and each S200 has 400Gb. For the entire
cluster, we would have more than 30 TB of SSD’s available to cache. As this is a distributed
system, user and system access are shared among all nodes, regardless of data or port it
will be sent or received.
Comparing Cluster With and Without L3 Cache
In the following graphics we can see that in cluster PD-01 we have about 80% cache hit in
L3 requesting “user data”; that is, the email messages themselves. While on PD-03, all user
data access is provided by spinning disks.
We found that L1 and L2 usage behavior is much more uniform and stable in the cluster with
L3 cache enabled. To clarify if you are curious, the two drops in L3 cache shown in the
graph are related to data migrations where some services have been stopped.
2016 EMC Proven Professional Knowledge Sharing 13
The main objective of this paper is to demonstrate that by enabling L3 cache using all SSD’s
in the cluster, we avoid overloading the SAS drives. Any hit that we can get is one less I/O
going to spinning disks. Achieving such a high hit rate on L3 for user data (around 80% in
our case) means that SSD’s are being used to deliver data, alleviating a lot the SAS back-
end. The majority of our performance impacts were related to the SAS drives being
overloaded. Now we hope that those issues are gone forever.
2016 EMC Proven Professional Knowledge Sharing 14
Enabling L3 Cache
Considering that we are already running a version of OneFS were L3 Cache is an option,
now we need to start thinking about the impacts to enable this in a running production
cluster. If you are not yet running 7.1, please read the Appendix – Data Migration for
additional information on how to migrate with minimal impact.
The Isilon engineering procedure to enable L3 cache mention that SmartPools and
FlexProtect need to be executed in order to free up the SSDs and re-protect data within the
cluster. However, this should taken into serious consideration with very careful planning.
The email workload that we support here tends to be very sensitive to latency. In our
particular workload, some Isilon jobs tend to have a potentially high execution time and
frequently can impact latency and response time.
We have developed a very particular schedule to be able to address all Isilon configuration
needs, as well as to run and delete snapshots, enable replacement of failed disks, and move
data between different node pools, among all other administrative tasks. The SmartPools for
instance, runs once a week to move data down to X400 nodes.
When we enable L3, all data in all SSD’s in the cluster are flushed out by SmartPools and at
the end FlexProtect orchestrates a controlled smartfail of each SSD out of the cluster. Since
we run SmartPools only once a week we feared that doing all data movements already
scheduled and also freeing up the SSD’s would be too much. In this particular case, we
have scheduled SmartPool job to run two days before we enable the L3 cache so it would
move the great majority of user data resulting in less work needing to be done at the next
execution.
By doing this, we have had near zero negative impact in all executions. One FlexProtect was
started for each of the two pools on each cluster. The execution time actually depends on
load (client access) and data volume in each dataset but we were able to do it over the
weekend when we have the lowest amount of access.
Execution Times
PoolA (S200): 105TB of capacity being used, 04 hour execution time
PoolB (X400): 1.52PB of capacity being used, 16h execution time
2016 EMC Proven Professional Knowledge Sharing 15
We were also worried about how to warm-up the SSD’s with data, once all metadata were
already being accelerated in the SSD’s. At first, we have developed a number of scripts that
would run on the NFS client and walk through the entire filesystem structure doing reads to
/dev/null…. The good thing is that this was not even necessary! We have just run another
session of SmartPools and that was enough to warm all metadata.
2016 EMC Proven Professional Knowledge Sharing 16
Results
After enabling and concluding the activation process, all SSD’s are being presented as L3 to
OneFS and are not available anymore as disks. This is show in the output below:
The greatest benefit we have had is low latency to NFS operations. For our particular use
case, that means a huge service level improvement to the application.
Before enabling L3 cache we had an average latency of 72ms. After enabling it, latency has
dropped to 54ms, a very representative 25% improvement in service level.
Latency average: 72.31 – before enable L3 Cache
2016 EMC Proven Professional Knowledge Sharing 17
Latency average: 72.31 – after enable L3 Cache
After enabling L3 cache, we see L3 hits going to about 90%, while the behavior of L1 and L2
cache became uniform (flat lines instead of recurring spikes)
Comparative Results
Final result in all graphs shown here are very positive. Cache is warmed up as the client
accesses the data. We also see a change in behavior of L2 and L2, that was frenetically
oscillating from 0 - 80% to a much higher and constant watermark after L3 activation.
The first graph shows the activation of L3 cache. The second graph shows the efficiency of
the L3 cache.
2016 EMC Proven Professional Knowledge Sharing 18
After enabling L3 cache in another cluster we saw a decrease in the disk IOPS on X400
nodes, while maintaining NFS operation with the same behavior than before.
2016 EMC Proven Professional Knowledge Sharing 19
Conclusion
After about three months of extensive work, planning, and deep understanding of our
application and storage environment we were able to successfully upgrade five email
clusters from 6.x to 7.1 by leveraging data migration techniques and also to enable L3 cache
in all of them.
We have increased the service level and lowered response time for the application while
decreasing disk utilization in the clusters, enabling more comfortable day-by-day operations.
We may now potentially add more email accounts per cluster and delay additional
infrastructure investments.
For our workload in particular, we have concluded that enabling L3 cache was a very good
use of SSD’s. We encourage you to evaluate your environment and try to identify what
benefits that configuration might bring to your applications.
2016 EMC Proven Professional Knowledge Sharing 20
Appendix - OneFS Upgrade from 6.5 TO 7.1
One of the biggest challenges we have faced was the OneFS upgrade from 6.5 to 7.1. The
changes in protection group normally require a huge internal re-layout of all data stores in
the cluster. The good news is that Isilon OneFS can do this online and automatically with no
downtime to applications.
The bad news is that to be able to accomplish this, the Flexprotect job might run for
potentially a couple of days in a row. Since our application is very sensitive to latency, this
would present severe performance impacts. We had to try a new approach.
We decided to build a new cluster using version 7.1 and migrate data from 6.5. After the old
cluster is empty, it would be reinstalled with the new version to support new email accounts.
We have studied alternatives to execute the migration, which we will briefly explore.
1-) SyncIQ
Pros: High performance to send data due to the configurable number of worker per node.
Cons: Failover/Failback options would be an interesting alternative due to the feature of
quick rollback in case of any problems. However, since we were migrating from different
versions, OneFS does not allow failover feature. We should proceed with a replication break
to failover to the second cluster and that would ruin any rollback plans.
2-) Rsync
Prós: Existing know-how from the application team to use for migrations
Cons: Low performance
We have decided on SyncIQ due to ease of use for high data volumes, great performance,
and file data integrity. Tests with SyncIQ has proven to be very reliable and efficient, even
using the graphical interface.
Migration
The scenario would require the migration of about 700TB in 16 datasets, with about 43TB
each. The biggest risk was in case of any rollback. A 43TB rollback would take more than 10
2016 EMC Proven Professional Knowledge Sharing 21
hours with downtime to the application. By planning accordingly, we have done migrations of
03 datasets per day, with 10 minutes of data unavailability per dataset and no complaints
filed with our helpdesk from end users.
2016 EMC Proven Professional Knowledge Sharing 22
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC
CORPORATION MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY
DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.