m cloud p sdownload.microsoft.com/download/9/1/6/916303ab-ca0d-4d1a... · 2018-10-16 · cloud...

MICROSOFT CLOUD PLATFORM SYSTEM

STORAGE PERFORMANCE DAN LOVINGER KRISHNA NITHIN CHEEMALAVAGUPALLI DIPAK VADNERE ANITHA ADUSUMILLI MARCH 2015

© 2015 Microsoft Corporation. All rights reserved. This document is provided "as-is." Information and views

expressed in this document, including URL and other Internet Web site references, may change without notice. You

bear the risk of using it. This document does not provide you with any legal rights to any intellectual property in any

Microsoft product. You may copy and use this document for your internal, reference purposes. You may modify this

document for your internal, reference purposes.

1 INTRODUCTION The Microsoft Cloud Platform System1 (CPS) is designed specifically to reduce the complexity and risk of

implementing a self-service cloud. CPS includes all the needed software and hardware for service

providers and enterprises to provide their customers with the self-service offerings they demand. As a

result, these providers can respond quickly to business opportunities—without dealing with the

complexities associated with deploying and operating a cloud.

CPS is an integrated, ready-to-run private cloud solution for your datacenter, powered by Dell hardware

and Microsoft cloud software. CPS maximizes the economic benefits of a software-designed datacenter

when operating cloud services. A layer of software abstraction across the physical layers - storage,

network and compute – enables separation of the fabric from the tenant services that run on top of it.

Windows Azure Pack provides a consistent self-service approach that is common between Microsoft

Azure and CPS. System Center provides the administrative controls and Windows Server provides the

platform for virtualizing a wide range of computing services. The entire solution is pre-configured before

arriving at your loading dock, offering a turnkey private cloud solution.

This paper provides an overview of storage-focused performance of a single rack CPS stamp in the

following three scenarios, scaled across a deployment of tenant virtual machines (VMs):

Scenario 1. Boot storm: cold start of VMs.

Scenario 2. VM microbenchmarks: synthetic storage loads that are generated within VMs.

Scenario 3. VM database OLTP: simulated database online transaction processing (OLTP) using

Microsoft SQL Server, run within the VMs.

The following is a summary of the results.

Boot storm: ~1800 Azure A1-sized2 VMs cold started within 150 seconds at a median start time

of 20 seconds/VM

VM microbenchmarks across 112 Azure A1-sized VMs:

o 1.01 million 4KiB random read IOPS at an average latency of 0.90 milliseconds (ms), with

44% of SSU connectivity utilized

o 321,000 mixed 4KiB (70:30) read/write IOPS at average latencies of 2.49ms read and

4.47ms write

VM SQL Server database OLTP across 84 Azure A4-sized VMs

o sustained ~35,000 transactions/second (s)

o 222,000 total IOPS (182,000 read and 40,000 write) at an average 1.5ms read and 9.0ms

write latency

1 For more information about CPS, see http://www.microsoft.com/cps. 2 For details on Azure VM specifications, see http://azure.microsoft.com/pricing/details/virtual-machines.

http://www.microsoft.com/cps

http://azure.microsoft.com/pricing/details/virtual-machines

2 CPS OVERVIEW CPS is delivered in units that are referred to as a stamp. A CPS stamp is a complete management and

hosting domain. A single stamp ranges from a minimum of one rack to a maximum of four racks. As

cloud capacity needs grow, you can scale out a smaller stamp to expand the aggregate pool of compute,

storage and network resources.

The contents of each rack are divided into three service units.

Network scale unit (NSU). The NSU contains the fault tolerant networking fabric.

Compute scale unit (CSU). The CSU is a dense compute chassis that provides the physical

infrastructure for three service clusters; one each for management services (on the first rack

only), edge connectivity services, and tenant compute services.

Storage scale unit (SSU). The SSU contains the primary storage in CPS. It consists of four servers

cross-linked to four JBOD chassis. Each JBOD chassis contains 60 disks; 48 hard disk drives

(HDDs) and 12 solid-state drives (SSDs). These are combined into three tiered pools of storage

using the Storage Spaces feature of Windows Server 2012 R2. The servers then provide the

storage as a Scale-Out File Server cluster to the compute servers in the CSU.

Two of the three storage pools are used for tenant workloads. The third pool is used for backup. The following table provides a detailed description of each pool type in a single SSU.

Storage Pools Description

Tenant pools (x2) In each tenant pool:

64 HDDs

20 SSDs for increased tiering capacity

Triple mirrored virtual disks to provide redundancy and high performance

Striped for JBOD failure resilience

Backup pool (x1) 64 HDDs

8 SSDs

Dual parity virtual disks for increased capacity

Striped for JBOD failure resilience

The following figure shows the composition of the SSU on a CPS rack, including the striping of the pools

on the JBODs to enable JBOD enclosure fault tolerance, and the fault tolerant network and storage

connectivity. The SSU is provisioned with 80Gb of 10Gb Remote Direct Memory Access (RDMA)

connectivity to the CSU, and any given CSU node is capable of 20Gb in isolation. Within the SSU, each

server has four 4x6Gb SAS links; one per JBOD. (You can find additional details about the hardware in

the Appendix.)

Figure 1. CPS rack – showing connectivity and distribution of storage cluster pools and media

JBOD 1

16x HDD

5x SSD

16x HDD

5x SSD

16x HDD

5x SSD

16x HDD

5x SSD

16x HDD

5x SSD

16x HDD

5x SSD

16x HDD

5x SSD

16x HDD

5x SSD

16x HDD

2x SSD

16x HDD

2x SSD

16x HDD

2x SSD

16x HDD

2x SSD

Tenant Pool 1

Tenant Pool 2

Backup Pool

Compute Scale Unit

Mgmt Cluster

Edge Cluster

Compute Cluster

Storage Scale Unit

Storage Cluster

2x 10Gb RDMA/NodeFault Tolerant

A. StorageNetwork

Storage Scale Unit

SSU Server 1

SSU Server 4

...

...

16x 6Gb SAS/NodeFault Tolerant

Compute Scale Unit

JBOD 4

JBOD 1

CSU Node 1

CSU Node 32

...

2x 10Gb RDMA/NodeFault Tolerant

CSU Node 2

CSU Node 3

A. StorageNetwork

B. SASFabric

Tiered storage spaces are a crucial design element for CPS. Storage Spaces periodically analyzes the

access pattern of the tenant workload and moves frequently accessed content to the SSD tier, while

demoting less frequently accessed content to the HDD tier. For any new workload, there is an initial

period during which Storage Spaces acquires this information. Over time, this process ensures that

performance-critical content is served from the fast SSD tier.

For the purpose of this paper, all workloads are shown after this initial period, with the content

optimally located within the tiers3.

3 For more information about tiered storage spaces, see http://technet.microsoft.com /library/dn789160.aspx.

http://technet.microsoft.com/en-us/library/dn789160.aspx

3 BOOT STORM SCENARIO The boot storm scenario demonstrates the ability of the CPS SSU to sustain the load that is generated by

a rapid boot sequence across a large number of VMs in a stamp that was initially idle. This is intended to

model the scenario of a cold stamp start and the typical start-of-workday boot and logon activity.

The following test environment was used for this scenario:

Windows Server 2008 R2 guest operating system, A1 size (1.75 GiB RAM, 1 vCore)

8.93 GiB operating system VHD utilized, average (dynamic VHD, 20GiB maximum)

28 nodes of the CSU were used

64 VMs per CSU node (28 x 64 = 1792 VMs total)

Even distribution of VMs to tenant storage

All VMs had their storage optimized before the boot storm measurements, and were turned off at the

start of the load sequence. Of the average 8.93 GiB that was utilized by the operating system VHD, an

average of 5.96 GiB remained on the HDD and 2.97 GiB was optimized to SSD storage based on access

patterns. In total, approximately 5.2 TiB of SSD storage was used.

To start the boot storm load, each CSU node began a sequence of calls to the Windows PowerShell

Start-VM cmdlet to each VM in turn. Each Start-VM call was observed to have a latency of approximately

two seconds. This reflects the time for the VM to transition to the 'on' state and begin its boot

sequence. (The state transition is the time taken by Hyper-V to allocate physical memory, to start the

VM worker process, and for the worker to provision the virtual hardware resources.) Every two seconds,

another wave of 28 VMs started their boot sequence; one per CSU node.

Time measurements began at the Start-VM call. Each VM was configured for automatic log on of a test

user. A startup task in the user’s profile noted the logon time, which indicated completion of the boot

sequence. This reflects the point at which a user or hosted service could have begun operations within

the VM.

The following figure shows a representative sequence on one CSU node.

Figure 2. Representative CSU node boot sequence

The sequencing of VM start time reflects the one at a time nature of the Start-VM loop. Roughly 10-12

VMs are booting at any given time on this node, and the times are consistent throughout at about 20s

per VM.

The following figure shows the distribution of VM completion times, through logon, for all VMs in the

test (1,792 VMs).

Figure 3. Distribution of VM boot completion times in boot storm scenario

Nearly all VMs (95%) boot within 21 seconds, and all within 24 seconds.

0

20

40

60

80

100

120

140

160

0 8 16 24 32 40 48 56 64

Tim

e Si

nce

Sta

rt (

s)

Virtual Machine Number N

Representative Node Boot Sequence

Start Time Boot Time

1330

0

200

400

600

800

1000

1200

1400

10 12 14 16 18 20 22 24

Nu

mb

er o

f V

irtu

al M

ach

ines

Virtual Machine Time to Completion (s)

Boot Storm Virtual Machine Completion Times

VMs

The following figure shows the aggregate I/O load on the SSU through the boot sequence.

Figure 4. Aggregate I/O load on the SSU during boot storm

Notice the following:

the increase in load as the full intensity of 10-12 VMs per node sets in (at around the 30 second

mark)

the load is sustained until boot completion for all VMs occurs by the 150 second mark (2

minutes, 30 seconds)

The SSU in aggregate is sustaining 310,000 - 320,000 combined IOPS during the sequence. The majority

of the IOPS are coming from the optimized storage on the SSD tier.

The boot storm scenario shows the capability of CPS to sustain a burst workload which is part of the

rhythm of a cloud platform.

Summary of results:

All 1,792 VMs booted in 150 seconds

95% of VMs booted within 21 seconds of off/on transition

310,000 - 320,000 combined IOPS were sustained from the CPS SSU

4 SCALED VM MICROBENCHMARK SCENARIO The scaled VM microbenchmark scenario tested synthetic storage loads (generated within VMs) on the

SSU.

The design of this microbenchmark included VMs that were producing a series of high intensity loads.

Compared to the boot storm load, a much smaller number of VMs were used in the test. This is primarily

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

0 30 60 90 120 150 180

IOP

S

Time (s)

Boot Storm IOPS

SSD Read IOPS

SSD Write IOPS

HDD Read IOPS

HDD Write IOPS

Total IOPS

because of common methods that are used for storage characterization (used here), which maintain a

constant number of outstanding I/Os per load generator. This method quickly produces a relatively high

load when scaled across a number of VMs.

The following test environment was used for this microbenchmark scenario:

Windows Server 2012 R2 guest operating system, A1 VM size (1.75 GiB RAM, 1 vCore)


8 VMs per CSU node (14 * 8 = 112 VMs total)

5GiB load file per VM

Even distribution of VMs to tenant storage

Two load mixes: 100% random reads, and a 70:30 ratio of read/write

All loads driven to the SSD tier (no HDD)

The DISKSPD load generator (version 2.0.14) was used to drive I/O load, using a single thread per VM.

DISKSPD is an open-source tool available at the following locations:

Binary release: http://aka.ms/diskspd

Source: https://github.com/microsoft/diskspd

In the following figure, a sequence of random 4KiB read tests are being driven to the SSU, starting at one

outstanding operation per VM, and scaling to 32 outstanding.

Figure 5. Driving 1.01 million random read IOPS from the CPS SSU

At an aggregate queue depth of 896 (eight per VM), the SSU reaches 1.01 million IOPS and a 0.90ms

average latency. The way that latency changes shows that this point may be close to optimal from a

1,013,721

0.90

0

1

2

3

4

5

6

7

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

112 224 448 896 1792 3584

Late

ncy

(m

s)

IOP

S

Aggregate Request Depth (112 VMs x Per VM Depth)

Random 4K Read Performance - 112 VMs

IOPS

Avg Read Latency

http://aka.ms/diskspd

https://github.com/microsoft/diskspd

total performance perspective. At the next increment of outstanding I/O, latency increases significantly.

At this point, the 95th percentile latency is 1.62ms.

Another way to look at this result is through the provisioned connectivity between the SSU and CSU

nodes. The 1.01 million 4KiB I/Os represent utilization of approximately 44% of the eight 10Gb RDMA

links. This is fault-tolerant performance that would still be available in degraded operation caused by

switch servicing or other temporary conditions.

A 70:30 ratio of small reads to writes is a common reference for many mixed loads, including the OLTP

workload of a database. Using a 4KiB I/O size, the following two figures show the load in the SSU.

Figure 6. Driving 321,000 combined mixed 4KiB IOPS from the SSU

224,542

96,291

320,834

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

450,000

112 224 448 896 1792 3584

IOP

S


Random 4K 70:30 Performance - 112 VMs

Read IOPS

Write IOPS

Total IOPS

Figure 7. Average latency optimum occurring at 896 aggregate outstanding operations

Again, looking for the increase in latency, the SSU reaches approximately 321,000 combined IOPS to the

SSD under this load at 896 outstanding operations. The 95th percentile latencies at this point are 4.97ms

for reads and 37.4ms for writes. Note that while increased IOPS are available at higher request depths,

the average latency increase may be significant for a production workload.

These microbenchmark results set a good baseline for the performance of the CPS SSU.

Microbenchmark results are a common reference for storage subsystems, but do not fully reflect the

way in which latency may feed back into the sequence of real workload patterns of access. In the next

section, results from a true end-to-end production workload are shown, which combine all of these

effects.

Summary of results (in aggregate from 112 VMs, end-to-end):

Random 4KiB reads: 1.01 million IOPS at 0.89ms average latency, 1.62ms at the 95th percentile

Random 70:30 4KiB read/write: 321,000 IOPS at 2.49 and 4.47ms average latency and 4.97 and

37.4ms 95th percentile latency, respectively

5 SCALED VM OLTP SCENARIO In this scenario, the performance of a simulated database online transaction processing (OLTP) workload

using Microsoft SQL Server (on VMs) was tested.

The value of using an end-to-end SQL Server workload is to demonstrate performance with a complex,

latency-sensitive load. The overall transaction rate is a function of read latency to fill buffer pool misses,

the write latency of the buffer pool lazy writer committing dirty pages for completed transactions, and

2.49

4.47

0

2

4

6

8

10

12

14

16

18

20

112 224 448 896 1792 3584

Late

ncy

(m

s)


Random 4K 70:30 Performance - 112 VMs

Avg Read Latency

Avg Write Latency

latency of the transaction log I/O itself. The way in which these latencies combine into steady state

performance can combine with microbenchmarks, seen in the earlier section, to more effectively

demonstrate the capabilities of a system.

The following test environment was used for the OLTP benchmark scenario:

Windows Server 2012 guest operating system, A4 VM size (14 GiB RAM, 8 vCores)


84 VMs total (15 CSU nodes with 4 VMs, and 8 CSU nodes with 3 VMs)

Even distribution to tenant storage (6 VMs per each of the 14 tenant shares)

Microsoft SQL Server 2012, 2GiB buffer pool

OLTP database of 50GiB with a 20GiB working set per VM

Load was applied for one hour. The OLTP workload was delivered by a lightweight framework which

modeled 330 independent streams of user activity per SQL Server database. The buffer pool limitation

to 2GiB was designed to amplify the storage dependency. With an unconstrained buffer pool, the

system could have driven more absolute transactions per second, but would not have delivered

significant I/O load because far more of the database content would have been cached on the CSU

nodes.

The success criteria for this load was that the 90th percentile transaction latency should not exceed one

second latency. Adjusting for a latency to rate tradeoff, in this case the framework reached saturation

with the most complex transaction at 110ms.

The results are shown in the following figures.

Figure 8. Approximately 35,000 transactions/second OLTP in steady state

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

0 600 1200 1800 2400 3000 3600

Tran

sact

ion

s/s

Run Time (s)

Aggregated Database Transactions/s

Total

Figure 9. 222,000 IOPS (182K read, 40K write) at average latencies of 1.5ms read and 9.0ms write

The load runs through a warmup phase and transitions to a steady state after between the five to ten

minute marks. One notable aspect of warmup is the initial relative jump in read latency as a result of the

larger I/Os that are used to populate the initially empty buffer pool caches in the set of VMs. The

aggregated results that are quoted are across the final half of the run, where nominal performance has

been achieved.

Note that the I/O sizes for database file I/O are in units of 8K buffer pool pages. This is double the size of

the microbenchmark results shown in Section 4, and represents the smallest possible I/O that will be

generated for a database file. (The database will frequently gather multiple pages into single I/O

operations.) The steady state 222,000 IOPS of the OLTP load represents this more complex mix4.

Summary of results (in aggregate across the 84 SQL Server instances for the final 30 minutes of steady

state):

34,960 transactions/s, 416 transactions/s/instance (consistent)

222,000 IOPS in a mix of 182,000 read and 40,000 writes/s

1.5ms average read and 9.0ms average write latencies

4 See the following paper (focused on SQL OLTP Performance over SMB) for a deeper structural analysis of the OLTP storage load: http://www.microsoft.com/en-us/download/details.aspx?id=36793.

0

50,000

100,000

150,000

200,000

250,000

0 600 1200 1800 2400 3000 3600

IOP

S

Run Time (s)

Aggregated DB IOPS

Reads Writes Total

0.000

0.005

0.010

0.015

0.020

0.025

0 600 1200 1800 2400 3000 3600

Late

ncy

(s)

Run Time (s)

Aggregated DB IO Latency

Average Read Average Write

http://www.microsoft.com/en-us/download/details.aspx?id=36793

6 CONCLUSION In partnership with Dell, CPS is Microsoft’s response to the software-defined datacenter. It offers an

integrated hardware and software solution for a turnkey private cloud solution. CPS is an Azure-

consistent cloud-in-a-box solution that delivers the built-in performance that is outlined in this paper.

The results that are outlined in this paper show the capabilities of the storage and networking platform

that underlies CPS, and demonstrate that CPS can host intense production workloads at scale.

Summary of results:

Boot storm. In this scenario of a 9:00 AM cold start of a single rack CPS stamp, ~1800 Azure A1-

sized VMs cold started within 150 seconds at a median start time of 20 seconds/VM.

VM microbenchmarks. This scenario demonstrated the basic scaling of randomized loads across

112 Azure A1-sized VMs.

o 1.01 million 4KiB random read IOPS at an average latency of 0.90ms, representing

approximately 44% of the eight 10Gb RDMA links. This is fault-tolerant performance

that would still be available in degraded operation caused by switch servicing or other

temporary conditions.

o 321,000 mixed 4KiB (70:30) read/write IOPS at average latencies of 2.49ms read and

4.47ms write

VM SQL Server database OLTP. This scenario demonstrated a broadly-scaled set of moderate

intensity OLTP loads, aggregated across 84 Azure A4-sized VMs.

o sustained ~35,000 transactions/second

o 222,000 total IOPS (182,000 read and 40,000 write) at an average 1.5ms read and 9.0ms

write latency

7 ACKNOWLEDGEMENTS Thanks to the following folks for their assistance and work behind this paper, but in particular Dipak

Vadnere and Krishna Nithin Cheemalavagupalli for their work in building the infrastructure to scale

individual workloads across a full deployment of VMs.

Jim Pinkerton, Spencer Shepler, Tanmay Waghmare

8 APPENDIX Rack manifest

A single rack CPS deployment was used for this work, in retail production configuration. The CPS rack

manifest is summarized in the following table.

Network 6x Dell Force 10 – S4810P (48x 10GbE) 2x for Edge in 1st rack 2x for Tenant 2x for CSU to SSU Note: Edge and Tenant networks do not carry storage traffic.

1x Dell Force 10 – S55 (48x 1GbE) Chassis Mgmt Net

Compute scale unit 8x Dell PowerEdge C6220ii (4x nodes per chassis) Node: Dual socket Intel IvyBridge E5-2650v2 (2.6Ghz 8c16t) 256GiB DRAM 1 local 200GB SSD (boot/paging) 2x 10GbE Mellanox ConnectX-3 (Tenant) 2x 10GbE Chelsio T520-CR (CSU to SSU), iWARP/RDMA

Node distribution: 2x: Edge cluster 6x: Mgmt in 1st rack 24x: Tenant in 1st rack

Storage scale unit 4x Dell PowerEdge R620v2 Dual socket Intel IvyBridge E5-2650v2 (2.6Ghz 8c16t) 128GiB DRAM 1 local 200GB SSD (boot/paging) 2x LSI 9207-8e SAS Controllers (JBOD connectivity) 2x 10GbE Chelsio T520-CR (CSU to SSU), iWARP/RDMA

4x Dell PowerVault MD3060e JBODS (60 Bay) JBOD: 48x 7.2K RPM 4TB HDD 12x 800GB SSD

Tenant workload pools (2): 128x HDD and 40x SSD Backup pool: 64x HDD and 8x SSD

m cloud p sdownload.microsoft.com/download/9/1/6/916303ab-ca0d-4d1a... · 2018-10-16 · cloud...

Documents