m cloud p sdownload.microsoft.com/download/9/1/6/916303ab-ca0d-4d1a... · 2018-10-16 · cloud...
TRANSCRIPT
MICROSOFT CLOUD PLATFORM SYSTEM
STORAGE PERFORMANCE DAN LOVINGER KRISHNA NITHIN CHEEMALAVAGUPALLI DIPAK VADNERE ANITHA ADUSUMILLI MARCH 2015
© 2015 Microsoft Corporation. All rights reserved. This document is provided "as-is." Information and views
expressed in this document, including URL and other Internet Web site references, may change without notice. You
bear the risk of using it. This document does not provide you with any legal rights to any intellectual property in any
Microsoft product. You may copy and use this document for your internal, reference purposes. You may modify this
document for your internal, reference purposes.
Page 1
1 INTRODUCTION The Microsoft Cloud Platform System1 (CPS) is designed specifically to reduce the complexity and risk of
implementing a self-service cloud. CPS includes all the needed software and hardware for service
providers and enterprises to provide their customers with the self-service offerings they demand. As a
result, these providers can respond quickly to business opportunities—without dealing with the
complexities associated with deploying and operating a cloud.
CPS is an integrated, ready-to-run private cloud solution for your datacenter, powered by Dell hardware
and Microsoft cloud software. CPS maximizes the economic benefits of a software-designed datacenter
when operating cloud services. A layer of software abstraction across the physical layers - storage,
network and compute – enables separation of the fabric from the tenant services that run on top of it.
Windows Azure Pack provides a consistent self-service approach that is common between Microsoft
Azure and CPS. System Center provides the administrative controls and Windows Server provides the
platform for virtualizing a wide range of computing services. The entire solution is pre-configured before
arriving at your loading dock, offering a turnkey private cloud solution.
This paper provides an overview of storage-focused performance of a single rack CPS stamp in the
following three scenarios, scaled across a deployment of tenant virtual machines (VMs):
Scenario 1. Boot storm: cold start of VMs.
Scenario 2. VM microbenchmarks: synthetic storage loads that are generated within VMs.
Scenario 3. VM database OLTP: simulated database online transaction processing (OLTP) using
Microsoft SQL Server, run within the VMs.
The following is a summary of the results.
Boot storm: ~1800 Azure A1-sized2 VMs cold started within 150 seconds at a median start time
of 20 seconds/VM
VM microbenchmarks across 112 Azure A1-sized VMs:
o 1.01 million 4KiB random read IOPS at an average latency of 0.90 milliseconds (ms), with
44% of SSU connectivity utilized
o 321,000 mixed 4KiB (70:30) read/write IOPS at average latencies of 2.49ms read and
4.47ms write
VM SQL Server database OLTP across 84 Azure A4-sized VMs
o sustained ~35,000 transactions/second (s)
o 222,000 total IOPS (182,000 read and 40,000 write) at an average 1.5ms read and 9.0ms
write latency
1 For more information about CPS, see http://www.microsoft.com/cps. 2 For details on Azure VM specifications, see http://azure.microsoft.com/pricing/details/virtual-machines.
Page 2
2 CPS OVERVIEW CPS is delivered in units that are referred to as a stamp. A CPS stamp is a complete management and
hosting domain. A single stamp ranges from a minimum of one rack to a maximum of four racks. As
cloud capacity needs grow, you can scale out a smaller stamp to expand the aggregate pool of compute,
storage and network resources.
The contents of each rack are divided into three service units.
Network scale unit (NSU). The NSU contains the fault tolerant networking fabric.
Compute scale unit (CSU). The CSU is a dense compute chassis that provides the physical
infrastructure for three service clusters; one each for management services (on the first rack
only), edge connectivity services, and tenant compute services.
Storage scale unit (SSU). The SSU contains the primary storage in CPS. It consists of four servers
cross-linked to four JBOD chassis. Each JBOD chassis contains 60 disks; 48 hard disk drives
(HDDs) and 12 solid-state drives (SSDs). These are combined into three tiered pools of storage
using the Storage Spaces feature of Windows Server 2012 R2. The servers then provide the
storage as a Scale-Out File Server cluster to the compute servers in the CSU.
Two of the three storage pools are used for tenant workloads. The third pool is used for backup. The following table provides a detailed description of each pool type in a single SSU.
Storage Pools Description
Tenant pools (x2) In each tenant pool:
64 HDDs
20 SSDs for increased tiering capacity
Triple mirrored virtual disks to provide redundancy and high performance
Striped for JBOD failure resilience
Backup pool (x1) 64 HDDs
8 SSDs
Dual parity virtual disks for increased capacity
Striped for JBOD failure resilience
Page 3
The following figure shows the composition of the SSU on a CPS rack, including the striping of the pools
on the JBODs to enable JBOD enclosure fault tolerance, and the fault tolerant network and storage
connectivity. The SSU is provisioned with 80Gb of 10Gb Remote Direct Memory Access (RDMA)
connectivity to the CSU, and any given CSU node is capable of 20Gb in isolation. Within the SSU, each
server has four 4x6Gb SAS links; one per JBOD. (You can find additional details about the hardware in
the Appendix.)
Figure 1. CPS rack – showing connectivity and distribution of storage cluster pools and media
JBOD 1
16x HDD
5x SSD
16x HDD
5x SSD
16x HDD
5x SSD
16x HDD
5x SSD
16x HDD
5x SSD
16x HDD
5x SSD
16x HDD
5x SSD
16x HDD
5x SSD
16x HDD
2x SSD
16x HDD
2x SSD
16x HDD
2x SSD
16x HDD
2x SSD
Tenant Pool 1
Tenant Pool 2
Backup Pool
Compute Scale Unit
Mgmt Cluster
Edge Cluster
Compute Cluster
Storage Scale Unit
Storage Cluster
2x 10Gb RDMA/NodeFault Tolerant
A. StorageNetwork
Storage Scale Unit
SSU Server 1
SSU Server 4
...
...
16x 6Gb SAS/NodeFault Tolerant
Compute Scale Unit
JBOD 4
JBOD 1
CSU Node 1
CSU Node 32
...
2x 10Gb RDMA/NodeFault Tolerant
CSU Node 2
CSU Node 3
A. StorageNetwork
B. SASFabric
Tiered storage spaces are a crucial design element for CPS. Storage Spaces periodically analyzes the
access pattern of the tenant workload and moves frequently accessed content to the SSD tier, while
demoting less frequently accessed content to the HDD tier. For any new workload, there is an initial
period during which Storage Spaces acquires this information. Over time, this process ensures that
performance-critical content is served from the fast SSD tier.
For the purpose of this paper, all workloads are shown after this initial period, with the content
optimally located within the tiers3.
3 For more information about tiered storage spaces, see http://technet.microsoft.com /library/dn789160.aspx.
Page 4
3 BOOT STORM SCENARIO The boot storm scenario demonstrates the ability of the CPS SSU to sustain the load that is generated by
a rapid boot sequence across a large number of VMs in a stamp that was initially idle. This is intended to
model the scenario of a cold stamp start and the typical start-of-workday boot and logon activity.
The following test environment was used for this scenario:
Windows Server 2008 R2 guest operating system, A1 size (1.75 GiB RAM, 1 vCore)
8.93 GiB operating system VHD utilized, average (dynamic VHD, 20GiB maximum)
28 nodes of the CSU were used
64 VMs per CSU node (28 x 64 = 1792 VMs total)
Even distribution of VMs to tenant storage
All VMs had their storage optimized before the boot storm measurements, and were turned off at the
start of the load sequence. Of the average 8.93 GiB that was utilized by the operating system VHD, an
average of 5.96 GiB remained on the HDD and 2.97 GiB was optimized to SSD storage based on access
patterns. In total, approximately 5.2 TiB of SSD storage was used.
To start the boot storm load, each CSU node began a sequence of calls to the Windows PowerShell
Start-VM cmdlet to each VM in turn. Each Start-VM call was observed to have a latency of approximately
two seconds. This reflects the time for the VM to transition to the 'on' state and begin its boot
sequence. (The state transition is the time taken by Hyper-V to allocate physical memory, to start the
VM worker process, and for the worker to provision the virtual hardware resources.) Every two seconds,
another wave of 28 VMs started their boot sequence; one per CSU node.
Time measurements began at the Start-VM call. Each VM was configured for automatic log on of a test
user. A startup task in the user’s profile noted the logon time, which indicated completion of the boot
sequence. This reflects the point at which a user or hosted service could have begun operations within
the VM.
Page 5
The following figure shows a representative sequence on one CSU node.
Figure 2. Representative CSU node boot sequence
The sequencing of VM start time reflects the one at a time nature of the Start-VM loop. Roughly 10-12
VMs are booting at any given time on this node, and the times are consistent throughout at about 20s
per VM.
The following figure shows the distribution of VM completion times, through logon, for all VMs in the
test (1,792 VMs).
Figure 3. Distribution of VM boot completion times in boot storm scenario
Nearly all VMs (95%) boot within 21 seconds, and all within 24 seconds.
0
20
40
60
80
100
120
140
160
0 8 16 24 32 40 48 56 64
Tim
e Si
nce
Sta
rt (
s)
Virtual Machine Number N
Representative Node Boot Sequence
Start Time Boot Time
1330
0
200
400
600
800
1000
1200
1400
10 12 14 16 18 20 22 24
Nu
mb
er o
f V
irtu
al M
ach
ines
Virtual Machine Time to Completion (s)
Boot Storm Virtual Machine Completion Times
VMs
Page 6
The following figure shows the aggregate I/O load on the SSU through the boot sequence.
Figure 4. Aggregate I/O load on the SSU during boot storm
Notice the following:
the increase in load as the full intensity of 10-12 VMs per node sets in (at around the 30 second
mark)
the load is sustained until boot completion for all VMs occurs by the 150 second mark (2
minutes, 30 seconds)
The SSU in aggregate is sustaining 310,000 - 320,000 combined IOPS during the sequence. The majority
of the IOPS are coming from the optimized storage on the SSD tier.
The boot storm scenario shows the capability of CPS to sustain a burst workload which is part of the
rhythm of a cloud platform.
Summary of results:
All 1,792 VMs booted in 150 seconds
95% of VMs booted within 21 seconds of off/on transition
310,000 - 320,000 combined IOPS were sustained from the CPS SSU
4 SCALED VM MICROBENCHMARK SCENARIO The scaled VM microbenchmark scenario tested synthetic storage loads (generated within VMs) on the
SSU.
The design of this microbenchmark included VMs that were producing a series of high intensity loads.
Compared to the boot storm load, a much smaller number of VMs were used in the test. This is primarily
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
0 30 60 90 120 150 180
IOP
S
Time (s)
Boot Storm IOPS
SSD Read IOPS
SSD Write IOPS
HDD Read IOPS
HDD Write IOPS
Total IOPS
Page 7
because of common methods that are used for storage characterization (used here), which maintain a
constant number of outstanding I/Os per load generator. This method quickly produces a relatively high
load when scaled across a number of VMs.
The following test environment was used for this microbenchmark scenario:
Windows Server 2012 R2 guest operating system, A1 VM size (1.75 GiB RAM, 1 vCore)
14 nodes of the CSU were used
8 VMs per CSU node (14 * 8 = 112 VMs total)
5GiB load file per VM
Even distribution of VMs to tenant storage
Two load mixes: 100% random reads, and a 70:30 ratio of read/write
All loads driven to the SSD tier (no HDD)
The DISKSPD load generator (version 2.0.14) was used to drive I/O load, using a single thread per VM.
DISKSPD is an open-source tool available at the following locations:
Binary release: http://aka.ms/diskspd
Source: https://github.com/microsoft/diskspd
In the following figure, a sequence of random 4KiB read tests are being driven to the SSU, starting at one
outstanding operation per VM, and scaling to 32 outstanding.
Figure 5. Driving 1.01 million random read IOPS from the CPS SSU
At an aggregate queue depth of 896 (eight per VM), the SSU reaches 1.01 million IOPS and a 0.90ms
average latency. The way that latency changes shows that this point may be close to optimal from a
1,013,721
0.90
0
1
2
3
4
5
6
7
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
112 224 448 896 1792 3584
Late
ncy
(m
s)
IOP
S
Aggregate Request Depth (112 VMs x Per VM Depth)
Random 4K Read Performance - 112 VMs
IOPS
Avg Read Latency
Page 8
total performance perspective. At the next increment of outstanding I/O, latency increases significantly.
At this point, the 95th percentile latency is 1.62ms.
Another way to look at this result is through the provisioned connectivity between the SSU and CSU
nodes. The 1.01 million 4KiB I/Os represent utilization of approximately 44% of the eight 10Gb RDMA
links. This is fault-tolerant performance that would still be available in degraded operation caused by
switch servicing or other temporary conditions.
A 70:30 ratio of small reads to writes is a common reference for many mixed loads, including the OLTP
workload of a database. Using a 4KiB I/O size, the following two figures show the load in the SSU.
Figure 6. Driving 321,000 combined mixed 4KiB IOPS from the SSU
224,542
96,291
320,834
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
112 224 448 896 1792 3584
IOP
S
Aggregate Request Depth (112 VMs x Per VM Depth)
Random 4K 70:30 Performance - 112 VMs
Read IOPS
Write IOPS
Total IOPS
Page 9
Figure 7. Average latency optimum occurring at 896 aggregate outstanding operations
Again, looking for the increase in latency, the SSU reaches approximately 321,000 combined IOPS to the
SSD under this load at 896 outstanding operations. The 95th percentile latencies at this point are 4.97ms
for reads and 37.4ms for writes. Note that while increased IOPS are available at higher request depths,
the average latency increase may be significant for a production workload.
These microbenchmark results set a good baseline for the performance of the CPS SSU.
Microbenchmark results are a common reference for storage subsystems, but do not fully reflect the
way in which latency may feed back into the sequence of real workload patterns of access. In the next
section, results from a true end-to-end production workload are shown, which combine all of these
effects.
Summary of results (in aggregate from 112 VMs, end-to-end):
Random 4KiB reads: 1.01 million IOPS at 0.89ms average latency, 1.62ms at the 95th percentile
Random 70:30 4KiB read/write: 321,000 IOPS at 2.49 and 4.47ms average latency and 4.97 and
37.4ms 95th percentile latency, respectively
5 SCALED VM OLTP SCENARIO In this scenario, the performance of a simulated database online transaction processing (OLTP) workload
using Microsoft SQL Server (on VMs) was tested.
The value of using an end-to-end SQL Server workload is to demonstrate performance with a complex,
latency-sensitive load. The overall transaction rate is a function of read latency to fill buffer pool misses,
the write latency of the buffer pool lazy writer committing dirty pages for completed transactions, and
2.49
4.47
0
2
4
6
8
10
12
14
16
18
20
112 224 448 896 1792 3584
Late
ncy
(m
s)
Aggregate Request Depth (112 VMs x Per VM Depth)
Random 4K 70:30 Performance - 112 VMs
Avg Read Latency
Avg Write Latency
Page 10
latency of the transaction log I/O itself. The way in which these latencies combine into steady state
performance can combine with microbenchmarks, seen in the earlier section, to more effectively
demonstrate the capabilities of a system.
The following test environment was used for the OLTP benchmark scenario:
Windows Server 2012 guest operating system, A4 VM size (14 GiB RAM, 8 vCores)
23 nodes of the CSU were used
84 VMs total (15 CSU nodes with 4 VMs, and 8 CSU nodes with 3 VMs)
Even distribution to tenant storage (6 VMs per each of the 14 tenant shares)
Microsoft SQL Server 2012, 2GiB buffer pool
OLTP database of 50GiB with a 20GiB working set per VM
Load was applied for one hour. The OLTP workload was delivered by a lightweight framework which
modeled 330 independent streams of user activity per SQL Server database. The buffer pool limitation
to 2GiB was designed to amplify the storage dependency. With an unconstrained buffer pool, the
system could have driven more absolute transactions per second, but would not have delivered
significant I/O load because far more of the database content would have been cached on the CSU
nodes.
The success criteria for this load was that the 90th percentile transaction latency should not exceed one
second latency. Adjusting for a latency to rate tradeoff, in this case the framework reached saturation
with the most complex transaction at 110ms.
The results are shown in the following figures.
Figure 8. Approximately 35,000 transactions/second OLTP in steady state
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0 600 1200 1800 2400 3000 3600
Tran
sact
ion
s/s
Run Time (s)
Aggregated Database Transactions/s
Total
Page 11
Figure 9. 222,000 IOPS (182K read, 40K write) at average latencies of 1.5ms read and 9.0ms write
The load runs through a warmup phase and transitions to a steady state after between the five to ten
minute marks. One notable aspect of warmup is the initial relative jump in read latency as a result of the
larger I/Os that are used to populate the initially empty buffer pool caches in the set of VMs. The
aggregated results that are quoted are across the final half of the run, where nominal performance has
been achieved.
Note that the I/O sizes for database file I/O are in units of 8K buffer pool pages. This is double the size of
the microbenchmark results shown in Section 4, and represents the smallest possible I/O that will be
generated for a database file. (The database will frequently gather multiple pages into single I/O
operations.) The steady state 222,000 IOPS of the OLTP load represents this more complex mix4.
Summary of results (in aggregate across the 84 SQL Server instances for the final 30 minutes of steady
state):
34,960 transactions/s, 416 transactions/s/instance (consistent)
222,000 IOPS in a mix of 182,000 read and 40,000 writes/s
1.5ms average read and 9.0ms average write latencies
4 See the following paper (focused on SQL OLTP Performance over SMB) for a deeper structural analysis of the OLTP storage load: http://www.microsoft.com/en-us/download/details.aspx?id=36793.
0
50,000
100,000
150,000
200,000
250,000
0 600 1200 1800 2400 3000 3600
IOP
S
Run Time (s)
Aggregated DB IOPS
Reads Writes Total
0.000
0.005
0.010
0.015
0.020
0.025
0 600 1200 1800 2400 3000 3600
Late
ncy
(s)
Run Time (s)
Aggregated DB IO Latency
Average Read Average Write
Page 12
6 CONCLUSION In partnership with Dell, CPS is Microsoft’s response to the software-defined datacenter. It offers an
integrated hardware and software solution for a turnkey private cloud solution. CPS is an Azure-
consistent cloud-in-a-box solution that delivers the built-in performance that is outlined in this paper.
The results that are outlined in this paper show the capabilities of the storage and networking platform
that underlies CPS, and demonstrate that CPS can host intense production workloads at scale.
Summary of results:
Boot storm. In this scenario of a 9:00 AM cold start of a single rack CPS stamp, ~1800 Azure A1-
sized VMs cold started within 150 seconds at a median start time of 20 seconds/VM.
VM microbenchmarks. This scenario demonstrated the basic scaling of randomized loads across
112 Azure A1-sized VMs.
o 1.01 million 4KiB random read IOPS at an average latency of 0.90ms, representing
approximately 44% of the eight 10Gb RDMA links. This is fault-tolerant performance
that would still be available in degraded operation caused by switch servicing or other
temporary conditions.
o 321,000 mixed 4KiB (70:30) read/write IOPS at average latencies of 2.49ms read and
4.47ms write
VM SQL Server database OLTP. This scenario demonstrated a broadly-scaled set of moderate
intensity OLTP loads, aggregated across 84 Azure A4-sized VMs.
o sustained ~35,000 transactions/second
o 222,000 total IOPS (182,000 read and 40,000 write) at an average 1.5ms read and 9.0ms
write latency
7 ACKNOWLEDGEMENTS Thanks to the following folks for their assistance and work behind this paper, but in particular Dipak
Vadnere and Krishna Nithin Cheemalavagupalli for their work in building the infrastructure to scale
individual workloads across a full deployment of VMs.
Jim Pinkerton, Spencer Shepler, Tanmay Waghmare
Page 13
8 APPENDIX Rack manifest
A single rack CPS deployment was used for this work, in retail production configuration. The CPS rack
manifest is summarized in the following table.
Network 6x Dell Force 10 – S4810P (48x 10GbE) 2x for Edge in 1st rack 2x for Tenant 2x for CSU to SSU Note: Edge and Tenant networks do not carry storage traffic.
1x Dell Force 10 – S55 (48x 1GbE) Chassis Mgmt Net
Compute scale unit 8x Dell PowerEdge C6220ii (4x nodes per chassis) Node: Dual socket Intel IvyBridge E5-2650v2 (2.6Ghz 8c16t) 256GiB DRAM 1 local 200GB SSD (boot/paging) 2x 10GbE Mellanox ConnectX-3 (Tenant) 2x 10GbE Chelsio T520-CR (CSU to SSU), iWARP/RDMA
Node distribution: 2x: Edge cluster 6x: Mgmt in 1st rack 24x: Tenant in 1st rack
Storage scale unit 4x Dell PowerEdge R620v2 Dual socket Intel IvyBridge E5-2650v2 (2.6Ghz 8c16t) 128GiB DRAM 1 local 200GB SSD (boot/paging) 2x LSI 9207-8e SAS Controllers (JBOD connectivity) 2x 10GbE Chelsio T520-CR (CSU to SSU), iWARP/RDMA
4x Dell PowerVault MD3060e JBODS (60 Bay) JBOD: 48x 7.2K RPM 4TB HDD 12x 800GB SSD
Tenant workload pools (2): 128x HDD and 40x SSD Backup pool: 64x HDD and 8x SSD