vmware vsphere 5 design best practice guide

VMware vSphere 5 Design Best Practice Guide www.viKernel.com, Gareth Hogarth Advisory Information: This document has been put together using VMware vSphere 5 best practice white papers, in most cases the information is simply copied and pasted. These are my notes used in preparation for the VCAP5-‐DCD exam. Please note that I may have intentionally not included all information from the whitepapers, just pertinent information relating to areas that I feel should provide the appropriate knowledge of vSphere components, use at own risk. This material is VMware Copyrighted. The information is publicly available from pubs.vmware.com. Purpose: The purpose of this document is to provide the necessary supporting knowledge for the VCAP5-‐DCD exam. The information provided is not intended as a comprehensive study guide, but can be used to assist you with some of the topics highlighted in the exam blueprint. Version Date Author Description 1.0 14/05/2014 Gareth Hogarth vSphere 5.0, 5.1 specific content. Table of Contents

1. VMware vSphere

VMFS Technical Overview and Best Practises 2

2. VMware Fault Tolerance Recommendations and Considerations on VMware vSphere 6

3. Networking Best Practices 16

4. VMware vSphere High Availability 5.0 Deployment best Practices 16

5. vSphere ESXi vCenter Server 5.0 Availability Guide (High Level) 27

6. Best Practices for Running VMware vSphere on iSCSI 41

7. Best Practices for running VMware vSphere on Network Attached Storage 54

8. VMware vSphere 5.0 Upgrade Best Practices 59

9. Best Practices for Performance Tuning of Latency-‐Sensitive Workloads in vSphere VMs 72

10. Performance Best Practices for VMware vSphere 5.0 77

11. VMware vSphere Distributed Switch Best Practices 100

12. VMware Network I/O Control: Architecture, Performance and Best Practices 122

13. Storage I/O Control Technical Overview and Considerations for Deployment 137

1. VMware vSphere

VMFS Technical Overview and Best Practises

VMFS5 • Provides Distributed Infrastructure Services for Multiple vSphere Hosts • VMFS enables virtual disk files to be shared by as many as 32 vSphere hosts. Furthermore, it manages storage access for

multiple vSphere hosts and enables them to read and write to the same storage pool at the same time • Facilitates Dynamic Growth • Provides Intelligent Cluster Volume Management • Optimizes Storage Utilization • Enables High Availability with Lower Management Overhead • Simplifies Disaster Recovery

Best Practices for Deployment and Use of VMFS

Topics Addressed: How Large a LUN?

• The best way to configure a LUN for a given VMFS volume is to size for throughput first and capacity second. That is, you should aggregate the total I/O throughput for all applications or virtual machines that might run on a given shared pool of storage; then make sure you have provisioned enough back-‐end disk spindles (disk array cache) and appropriate storage service to meet the requirements.

• Because there is no single correct answer to the question of how large your LUNs should be for a VMFS volume, the more important question to ask is, “How long would it take one to restore the virtual machines on this datastore if it were to fail?” The recovery time objective (RTO) is now the major consideration when deciding how large to make a VMFS datastore. This equates to how long it would take an administrator to restore all of the virtual machines residing on a single VMFS volume if there were a failure that caused data loss.

• The main concern now is how long it would take to recover from a catastrophic storage failure. Another important question to ask is, “How does one determine whether a certain datastore is overprovisioned or under provisioned?”

• vSphere Storage DRS, introduced in vSphere 5.0, can also be a useful feature to leverage for load balancing virtual machines across multiple datastores, from both a capacity and a performance perspective.

Isolation or Consolidation

• The basic answer depends on the nature of the I/O access patterns of that virtual machine. If you have a very heavy I/O-‐generating application, in many cases VMware vSphere Storage I/O Control can assist in managing fairness of I/O resources among virtual machines. Another consideration in addressing the “noisy neighbor” problem is that it might be worth the potentially inefficient use of resources to allocate a single LUN to a single virtual machine. This can be accomplished using either an RDM or a VMFS volume that is dedicated to a single virtual machine. These two types of volumes perform similarly (within 5 percent of each other), with varying read and write sizes and I/O access patterns.

Isolated Storage Resources

• One school of thought suggests limiting the access of a single LUN to a single virtual machine. In the physical world, this is quite common. When using RDMs, such isolation is implicit, because each RDM volume is mapped to a single virtual machine.

• The downside to this approach is that as you scale the virtual environment, you soon reach the upper limit of 256 LUNs per host.

Consolidated Pools of Storage

• The consolidation school wants to gain additional management productivity and resource utilization by pooling the storage resource and sharing it, with many virtual machines running on several vSphere hosts. Dividing this shared resource among many virtual machines enables better flexibility as well as easier provisioning and ongoing management of the storage resources for the virtual environment.

• Compared to strict isolation, consolidation normally offers better utilization of storage resources. The cost is additional resource contention, which under some circumstances can lead to reduction in virtual machine I/O performance. However, vSphere offers Storage I/O Control and vSphere Storage DRS to mitigate these risks.

Best Practice: Mix Consolidation with Some Isolation

• In general, use vSphere Storage DRS to detect and mitigate storage latency and capacity bottlenecks by load balancing virtual machines across multiple VMFS volumes. Additionally, vSphere Storage I/O Control can be leveraged to ensure fairness of I/O resource distribution among many virtual machines sharing the same VMFS datastore.

• Because workloads can vary significantly, there is no exact formula that determines the limits of performance and scalability regarding the number of virtual machines per LUN. These limits also depend on the number of vSphere hosts sharing concurrent access to a given VMFS volume. The key is to remember the upper limit of 256 LUNs per vSphere host and consider that this number can diminish the consolidation ratio if you take the concept of “one LUN per virtual machine” too far.

Use of RDMs or VMFS

• An RDM file is a special file in a VMFS volume that manages metadata for its mapped device. • Employing RDMs provides the advantages of direct access to a physical device while keeping some advantages of a virtual

disk in the VMFS file system. In effect, the RDM merges VMFS manageability with raw device access.

• An RDM is a symbolic link from a VMFS volume to a raw volume.

Using RDMs, you can do the following:

• Use vMotion to migrate virtual machines using raw volumes. • Add raw volumes to virtual machines using the VI client. • Use file system features such as distributed file locking, permissions and naming.

• RDMs have the following two compatibility modes: • Virtual compatibility mode enables a mapping to act exactly like a virtual disk file, including the use of storage array

snapshots. • Physical compatibility mode enables direct access of the SCSI device, for those applications needing lower level control. • vMotion, vSphere DRS and vSphere HA are all supported for RDMs that are in both physical and virtual compatibility modes.

Why Use VMFS?

• For most applications, VMFS is the clear choice. It provides the automated file system capabilities that make it easy to provision and manage storage for virtual machines running on a cluster of vSphere hosts. VMFS has an automated hierarchical file system structure with user-‐friendly file-‐naming access.

• It enables a higher disk utilization rate by facilitating the process of provisioning the virtual disks from a shared pool of clustered storage.

• As you scale the number of vSphere hosts and the total capacity of shared storage, VMFS greatly simplifies the process. It also enables a larger pool of storage than might be addressed via RDMs. Because the number of LUNs that a given cluster of vSphere hosts can discover is currently capped at 256, you can reach this number rather quickly if mapping a set of LUNs to every virtual machine running on the vSphere host cluster.

• Using RDMs usually requires more frequent and varied dependence on the storage administration team, because each LUN must be sized for the needs of each specific virtual machine to which it is mapped.

• With VMFS, however, you can carve out many smaller VMDKs for virtual machines from a single VMFS volume. This enables the partitioning of a larger VMFS volume—or a single LUN—into several smaller virtual disks, which facilitates a centralized management utility (vCenter) to be used as a control point

• With RDMs, there is no way to break up the LUN and address it as anything more than a single disk for a given virtual machine

Why Use RDMs?

Even with all the advantages of VMFS, there still are some cases where it makes more sense to use RDM storage access. The following scenarios call for raw disk mapping:

• Migrating an existing application from a physical environment to virtualization • Using Microsoft Cluster Service (MSCS) for clustering in a virtual environment • Implementing N-‐Port ID Virtualization (NPIV) • Separating heavy I/O workloads from the shared pool of storage

RDM Scenario 1: Migrating an Existing Application to a Virtual Server

Figure 4 shows a typical migration from a physical server to a virtual one. Before migration, the application running on the physical server has two disks (LUNs) associated with it. One disk is for the OS and application files; a second disk is for the application data.

To begin, use the VMware vCenter Converter TM to build the virtual machine and to load the OS and application data into the new virtual machine.

Next, remove access to the data disk from the physical machine and make sure the disk is properly zoned and accessible from the vSphere host. Then create an RDM for the new virtual machine pointing to the data disk. This enables the contents of the existing data disk to be accessed just as they are, without the need to copy them to a new location.

RDM Scenario 2: Using Microsoft Cluster Service in a Virtual Environment, another common use of RDMs is for MSCS configurations.

When and How to Use Disk Spanning

It is generally best to begin with a single LUN in a VMFS volume. To increase the size of that resource pool, you can provide additional capacity by either 1) adding a new VMFS extent to the VMFS volume or 2) increasing the size of the VMFS volume on an underlying LUN that has been expanded in the array (via a dynamic expansion within the storage array). Adding a new extent to the existing VMFS volume will result in the existing VMFS volume’s spanning across more than one LUN. However, until the initial capacity is filled, that additional allocation of capacity is not yet put to use.

Expanding the VMFS volume on an existing, larger LUN will also increase the size of the VMFS volume, but it should not be confused with spanning.

From a management perspective, it is preferable that a single large LUN with a single extent host your VMFS. Using multiple LUNs to back multiple extents of a VMFS volume entails presenting every LUN to each of the vSphere hosts sharing the datastore. Although multiple extents might have been required prior to the release of vSphere 5 and VMFS5 to produce VMFS volumes larger than 2TB, VMFS5 now supports single-‐extent volumes up to 64TB.

Gaining Additional Throughput and Storage Capacity

Additional capacity with disk spanning does not necessarily increase I/O throughput capacity for that VMFS volume. It does, however, result in increased storage capacity.

Suggestions for Rescanning

In prior versions of vSphere, it was recommended that before adding a new VMFS extent to a VMFS volume, you make sure a rescan of the SAN is executed for all nodes in the cluster that share the common pool of storage. However, in more recent versions of vSphere, there is an automatic rescan that is triggered when the target detects a new LUN, so that each vSphere host updates its shared storage information when a change is made on that shared storage resource. This auto rescan is the default setting in vSphere and is configured to occur every 300 seconds.

2. VMware Fault Tolerance Recommendations and Considerations on VMware vSphere

VMware High Availability Features Timeline

VMware Fault Tolerance (FT)

VMware FT is a feature available with VMware vSphere TM 4 (i.e., ESX 4 and vCenterTM Server 4) that allows a virtual machine to continue running even when the underlying physical server fails.

It is a software solution that runs on commodity hardware and does not require any modifications to the guest operating system or applications running inside the virtual machine

Overview

When VMware FT is enabled on a virtual machine (called the Primary VM), a copy of the Primary VM (called the Secondary VM) is automatically created on another host, chosen by VMware Distributed Resource Scheduler (DRS)

If VMware DRS is not enabled, the target host is chosen from the list of available hosts. VMware FT then runs the Primary and Secondary VMs in lockstep with each other – essentially mirroring the execution state of the Primary VM to the Secondary VM. In the event of a hardware failure that causes the Primary VM to fail, the Secondary VM immediately picks up where the Primary VM left off, and continues to run without any loss of network connections, transactions, or data.

VMware FT keeps the Primary and Secondary VMs in lockstep using VMware vLockstep technology. vLockstep technology ensures that the Primary and Secondary VMs execute the same x86 instructions in an identical sequence. Here, the Primary VM captures all nondeterministic events and sends them across a VMware FT logging network to the Secondary VM

As both the Primary and Secondary VMs execute the same instruction sequence, both initiate I/O operations. However, the outputs of the Primary VM are the only ones that take effect: disk writes are committed, network packets are transmitted, and so on. All outputs of the Secondary VM are suppressed by ESX. Thus, only a single virtual machine instance appears to the outside world.

Transparent Failover

Along with keeping the Primary and Secondary VMs in sync, VMware Fault Tolerance must rapidly detect and respond to hardware failures of the physical machines running the Primary or the Secondary VM. When vLockstep technology is initiated, the ESX hypervisor starts sending heartbeats over the FT logging network between the ESX hosts where the Primary and Secondary VMs reside. This allows VMware FT to detect immediately if a host fails and execute a transparent failover where the remaining VMware FT virtual machine continues running the protected workload without interruption.

Consider a VMware HA cluster of three ESX hosts, two of which are running a Primary and Secondary VM.

If the host running the Primary VM fails the Secondary VM is immediately activated to replace the Primary VM. A new Secondary VM is created and fault tolerance is re-‐established in a short period of time. Unlike the initial creation of the Secondary VM where DRS chooses the target ESX host, for failovers VMware HA chooses the target ESX host for the new Secondary VM. Users experience no interruption in service and no loss of data during the transparent failover.

Lifecycle of a fault-‐tolerant virtual machine

Turning on and enabling VMware FT for a virtual machine affects the virtual machine’s lifecycle, but it is entirely transparent to the end-‐user client and does not disrupt client connections or the client’s workload. The following steps outline the lifecycle of a VMware FT virtual machine:

1. Administrator selects a virtual machine in either the powered-‐on or off state and turns on VMware FT. 2. The virtual machine becomes the Primary VM and a Secondary VM is automatically created and assigned to an ESX host, sharing the same disk as the ESX host running the Primary VM. 3. If the Primary VM is already powered-‐on when VMware FT is turned on, its active state is immediately migrated using a special form of VMotion to the Secondary VM on an automatically chosen ESX host. If the Primary VM is powered-‐off then the migration of its active state to the Secondary VM occurs right after the Primary VM is powered on. 4. The Secondary VM stays synchronized with the Primary VM through VMware vLockstep technology. 5. If the ESX host running the Primary VM goes down, the Secondary VM will immediately “go live” and become the Primary VM. 6. VMware HA automatically starts a new Secondary VM on another available host to restore protection. 7. The Secondary VM is powered off when the Primary VM powers off or when VMware FT is disabled. The Secondary VM is removed altogether when VMware FT is turned off.

Requirements:

Cluster and Host Requirements

• VMware FT can only be used in a VMware HA cluster. • Ensure that all ESX hosts in the VMware HA cluster have identical ESX versions and patch levels. vLockstep technology only

works between Primary and Secondary VMs on hosts running identical versions of ESX. Please see the section on Patching hosts running VMware FT virtual machines for recommendations on how to upgrade hosts that are running FT virtual machines.

• ESX host processors must be VMware FT capable and belong to the same processor model family. VMware FT capable processors required changes in both the performance counter architecture and virtualization hardware assists of both AMD and Intel (AMD OpteronTM based on the AMD Barcelona, Budapest and Shanghai processor families; and Intel® Xeon® processors based on the Penryn and Nehalem micro-‐architectures and their successors.

• VMware FT does not disable AMD’s Rapid Virtualization Indexing (i.e., nested page tables) or Intel’s Extended Page Tables for the ESX host, but it is automatically disabled for the virtual machine when turning on VMware FT. However, virtual machines without FT enabled can still take advantage of these hardware-‐assisted virtualization features.

• VMware FT is supported on ESX hosts which have hyper-‐threading enabled or disabled. Hyper-‐threading does not have to be disabled on these systems for VMware FT to work.

Storage Requirements

• Shared storage required – Fibre channel, iSCSI, or NAS. • Turning on VMware FT for a virtual machine first requires the virtual machines’ virtual disk (VMDK) files to be eager zeroed

and thick-‐provisioned . So, thin-‐provisioned or lazy-‐ zeroed disks could be converted during off-‐peak times through two methods: Use the vmkfstools -‐-‐disk format eagerzeroedthick option in the vSphere CLI when the virtual machine is powered off. Please see the vSphere Command-‐Line Interface Installation and Reference Guide for details: http://www.vmware.com/pdf/vsphere4/r40/vsp_40_vcli.pdf

• Set cbtmotion.forceEagerZeroedThick = “true” flag in the .vmx file before powering on the virtual machine. Then use VMware Storage VMotion to do the conversion /.

• Backup solutions within the guest operating system for file or disk-‐level backups are supported. However, these applications may lead to the saturation of the VMware FT logging network if heavy read access is performed.

• Saturation of the FT logging network could occur for any disk-‐intensive workload • Do not run a lot of VMware FT virtual machines with high disk reads and high network inputs on the same ESX host

Networking Recommendations

• At a minimum, use 1 GbE NICs for VMware FT logging network. Use 10 GbE NICs for increased bandwidth of FT logging traffic.

• Ensure that the networking latency between ESX hosts is low Sub-‐millisecond latency is recommended for the FT logging network. Use vmkping to measure the latency.

• VMware vSwitch settings on the hosts should also be uniform, such as using the same VLAN for VMware FT logging, to make these hosts available for placement of Secondary VMs Consider using a VMware® vNetwork Distributed Switch to avoid inconsistencies in the vSwitch settings

Baseline Recommendation:

Preferably, each host has separate 1 GbE NICs for FT logging traffic and VMotion. The reason for recommending separate NICs is that the creation of the Secondary VM is done by migrating the Primary VM with VMotion. This can produce significant traffic on the VMotion NIC and could affect VMware FT logging traffic if the NICs are shared.

• In addition, it is preferable that the VMware FT logging NIC has redundancy, so that no unnecessary failovers occur if a single NIC is lost.

• As described in the steps below, the VMware FT logging NIC and VMotion NIC can be configured so that they will automatically share the remaining NIC if one or the other NIC fails.

1. Create a vSwitch that is connected to at least two physical NICs. 2. Create a VMware VMkernel connection (displayed as VM kernel Port in vSphere Client) for VMotion and another one for FT traffic. 3. Make sure that different IP addresses are set for the two VMkernel connections. 4. Assign the NIC teaming properties to ensure that vMotion and FT use different NICs as the active NIC:

a. For VMotion: Set NIC A as active and NIC B as passive. b. For FT: Set NIC B as active and NIC A as passive.

Not supported:

Source port ID or source MAC address based load balancing policies do not distribute FT logging traffic. However, if there are multiple VMware FT host pairs, some load balancing is possible with an IP-‐hash load balancing scheme, though IP-‐hash may require physical switch changes such as ether-‐channel setup. VMware FT will not automatically change any vSwitch settings.

VMware FT Usage Scenarios

VMware FT can be used to protect mission-‐critical workloads, while VMware HA protects the other workloads by restarting the virtual machine in the event of a virtual machine or ESX host failure.

Running VMware FT and VMware HA virtual machines on the same ESX host is fully supported. VMware HA also helps protect

VMware FT virtual machines in the unlikely case where the ESX hosts running the Primary and Secondary VMs both fail. In that case, VMware HA will trigger the restart of the Primary VM as well as re-‐spawn a new Secondary VM onto another host. Note that if the guest operating system in the Primary VM fails, such as resulting from a blue screen in Windows, the Secondary VM will experience the same failure. The VMware HA feature called VM Monitoring will detect this Primary VM failure through VMware Tools heartbeats and VMware HA will automatically restart the failed Primary VM and re-‐spawn a new Secondary VM.

VMware FT on-‐demand

The process of turning on VMware FT for a virtual machine takes on the order of minutes. Turning off VMware FT occurs in seconds. This allows virtual machines to be turned on and off on-‐demand when needed. Turning on and off VMware FT can also be automated by scheduling the task for certain times using the vSphere CLI.

During critical times in your datacenter, such as the last three days of the quarter when any outage can be disastrous, VMware FT on-‐demand can be scheduled to protect virtual machines for the critical 72 or 96 hours when protection is vital.

When the critical period ends VMware FT is turned off again, and the resources used for the Secondary VM are no longer allocated.

Patching hosts running VMware FT virtual machines

When ESX hosts are running VMware FT virtual machines, the ESX hosts running the Primary and Secondary VMs must be running the same ESX version and patch level. This requirement must be carefully considered when updating the ESX hosts. The following two approaches are recommended for patching ESX hosts with FT virtual machines.

The first approach is suggested for environments where disabling VMware FT for virtual machines can be tolerated for the amount of time required to update all ESX hosts in the cluster

For each virtual machine protected by VMware FT in the cluster, right-‐click the virtual machine, highlight Fault Tolerance and select Disable Fault Tolerance (note: turning off VMware FT would work but turning it back on later would take longer).

After updating all hosts in the cluster to the same version and patch level right-‐click each virtual machine you wish to protect with VMware FT, highlight Fault Tolerance, and select Enable Fault Tolerance.

Please note that the performance data of the Secondary VM will be lost when you turn off VMware FT for the virtual machine. This data is not lost when you disable VMware FT.

Recommendations for Reliability

Removing single points of failure from your environment is the most important practice in increasing reliability. Reduce single points of failure by implementing multiple NICs, multiple HBAs, multiple power supplies, storage RAID, etc. Fully-‐redundant NIC teaming and storage multi-‐pathing are recommended to improve reliability.

VMware FT does attempt a failover if the Primary VM loses all paths to fibre channel storage and the Secondary VM still has connection to fibre channel storage, but customers should not rely on this. Instead they should implement fully-‐redundant NIC teaming and storage multi-‐pathing

Other recommendations to improve reliability include:

• Ensuring VMotion and VMware FT logging NICs use a private network. • Using vNetwork Distributed Switches for all networks and hosts. • Minimizing VMotion migrations of the Primary or Secondary VMs to reduce network and compute resources required by

VMware FT. The administrator may also prefer to keep the Primary and Secondary VMs on specific hosts. • Ensuring that ESX hosts deliver consistent CPU cycles by making the power management usage consistent among hosts. • When using network-‐attached storage (NAS), ensure that the NAS device itself has sufficient resources

Uniformity of Hosts:

The ESX hosts in your cluster should be as uniform to each other as possible – as described in the Cluster and host requirements section. For better performance, the hosts running the Primary and Secondary VMs should operate at roughly the same processor frequencies in order to ensure the highest level of fault tolerance. Processor speed differences greater than 400 MHz in frequency may become problematic for CPU-‐bound workloads.

CPU frequency scaling may cause the Secondary VM to run slower than the Primary VM and will cause the Primary VM to slow down.

It is therefore recommended that BIOS-‐based power management features be used consistently across hosts and that certain settings should be avoided on hosts with VMware FT virtual machines.

VMware® Distributed Power Management (DPM) will not recommend a host for power off unless it can successfully recommend VMotion migrations of all virtual machines off that host.

Since VMware FT virtual machines are VMware DRS disabled and cannot be migrated by VMotion recommendations, VMware DPM will not recommend powering off any host with running VMware FT virtual machines

However, VMware DPM can still be enabled on a VMware HA cluster running VMware FT virtual machines and will simply provide power on or off recommendations for hosts not running VMware FT virtual machines.

Placement of Fault Tolerant Virtual Machines

VMware FT creates Secondary VMs and places them onto another ESX host. If VMware DRS is enabled, DRS decides the target host for the Secondary VM when VMware FT is turned on. If DRS is not enabled, the target host is chosen from the list of available hosts.

After a failover, VMware HA decides the target host for the new Secondary VM. When enabling VMware FT for many virtual machines, you may want to avoid the situation where many Primary and Secondary VMs are placed on the same host. The number of fault tolerant virtual machines that you can safely run on each host cannot be stated precisely because the number is based on the ESX host size, the virtual machine size, and workload factors, all of which can vary widely. VMware does expect the number of supportable VMware FT VMs running on a host to be bound by the saturation of the VMware FT logging network

Given this, it is recommended that no more than four Primary and Secondary VMs be placed onto the same ESX host. For running more than four VMware FT virtual machines on a host, refer to the following:

As described in the section on VMware vLockstep technology, the VMware FT logging network traffic depends on the amount of nondeterministic events and external inputs that are recorded at the Primary VM. Since the bulk of this traffic usually consists of incoming network packets and disk reads one could calculate the amount of networking bandwidth required for VMware FT logging using the following:

VMware FT logging bandwidth ~= (Avg disk reads (MB/s) x 8 + Avg network input (Mbps)) x 1.2 [20% headroom]

The above calculation reserves an additional 20 percent of networking bandwidth on top of the disk and network inputs to the virtual machine. This 20 percent headroom is recommended for transmitting nondeterministic CPU events and for TCP/IP overhead. You can measure the characteristics of your workload through the vSphere Client. Click the Performance tab of the virtual machine to see disk and network I/O.

When running multiple VMware FT virtual machines on the same ESX host, mix Primary and Secondary VMs together. The bulk of the VMware FT logging traffic flows from the Primary VM to the Secondary VM. Much less traffic flows from the Secondary VM to the Primary VM. Therefore, the bandwidth of the VMware FT logging NICs will be better utilized if each host has a mix of Primary and Secondary VMs, rather than all Primary VMs or all Secondary VMs. Also, the Secondary VM does not perform any I/O to the virtual machine network and disk. So, the utilization of the virtual machine network and disk will also be more balanced if a host has a mix of Primary and Secondary VMs.

• Timekeeping Recommendations • In order to avoid time mis-‐match issues of a virtual machine after an VMware FT failover, perform the following steps:

1. Synchronize the guest operating system time with a time source, which will depend whether the guest is Windows or Linux. 2. Synchronize the time of each ESX server host with a network time protocol (NTP) server.

Windows guest operating system time synch

For Windows Server 2003 guest operating systems, synchronize time with the appropriate domain controllers within their Microsoft Active Directory (AD) domain. In turn, each domain controller should sync their clock with the primary domain controller emulator (PDC Emulator) of the domain. All PDC Emulators should be time synchronized with the PDC Emulator of the root forest domain. Finally the PDC Emulator of the root forest domain should be time synchronized with a stratum 1 time source such as an NTP time server or a hardware atomic clock. If AD is not being used in your environment, synchronize time directly with the NTP time server or another reliable external time source. Please refer to your Windows documentation for details.

Linux guest operating system time synchronization

• For Linux guest operating systems, synchronize time with an NTP server by performing the following steps: 1. Open the VMware Tools Properties dialog box from within the guest. Under Miscellaneous Options, make sure “Time synchronization between the virtual machine and the ESX Server” option is not checked.

• Synchronize time with an NTP time server. Please refer to Installing and Configuring Linux Guest Operating Systems for configuration details. http://www.vmware.com/resources/techresources/1076

If your guest operating system is very time-‐sensitive, then synchronize the guest operating system directly with the NTP server. The method to do this varies depending on the guest operating system. Please consult your guest operating system documentation for details.

VMware FT Application Recommendations

Here are a few example recommendations for protecting applications with FT.Example 1: High availability for a multi-‐tiered SAP application

SAP NetWeaver 7.0 is a service-‐oriented application and integration platform that serves as the foundation for all other SAP applications. Within this multi-‐tiered SAP NetWeaver 7.0 application, the ABAP SAP Central Services (ASCS) instance is a single point of failure. (ABAP stands for Advanced Business Application Programming.) ASCS is a group of two servers: the Message Server and the Enqueue Server.

The Message Server handles all communications in the SAP system. Messaging Server failures cause internal communications between SAP dispatchers to fail. Other problems include failures in user logon and in batch job scheduling.

The Enqueue Server manages the logical locks for SAP documents and objects during transactions. Enqueue Server failures result in automatic roll backs of all transactions holding locks and SAP updates that are requesting locks will be aborted.

Since the ASCS is a single point of failure, it requires a high availability solution. For moderate use cases of client connections, a single vCPU virtual machine running ASCS will suffice. Running these services on a single vCPU virtual machine on another host will allow it to be protected with VMware FT

ESX #1: Virtual machine with two vCPUs running the database and SAP Central Instance (minus the Message and Enqueue Servers). Note: This host is also running an SAP-‐specific load driver benchmark called the Sales and Distribution (SD) Benchmark. This benchmark was used to validate continuous transaction execution with VMware FT during host failover.

ESX #2: Virtual machine with one vCPU running ASCS (i.e., the Message and Enqueue Servers). This virtual machine has VMware FT turned on and acts as the Primary VM.

ESX #3: Virtual machine with one vCPU acting as the Secondary VM for the ASCS.

Upon failure of either ESX #2 or #3, VMware FT allows the virtual machine on the other host to immediately takeover execution.

Thus, the ASCS services will not lose any data and will not experience any interruption in service. This can be tested by manually checking lock integrity via SAP transaction SM12, the SAP lock management transaction. If ESX #1 fails the database (protected via VMware HA) will temporarily go down but will not force a client disconnection for users logged onto separate dialog instance virtual machines (not shown above). The client will only experience a pause until the database comes back online either when the host is rebooted or when the database virtual machine is rebooted on another host through VMware HA.

Example 2: High availability for the Blackberry Enterprise Server

The Blackberry Enterprise Server (BES) 4.1.6 for Microsoft Exchange enables push-‐based access in delivering Exchange email, calendar, contacts, scheduling, instant messaging, and other Web services to Blackberry devices. Running BES in a single vCPU virtual machine can support up to 200 users that receive an average of 100-‐200 email messages per day. Unless there is a failover mechanism in place, the loss of BES due to hardware failure will result in the disruption of Blackberry users’ ability to synch with Exchange. VMware FT can be turned on for the BES virtual machine as shown in Figure 7 to provide continuous availability that can survive ESX host failures.

ESX #1: Virtual machine with two vCPUs running the database and Microsoft Exchange server.

ESX #2: Virtual machine with one vCPU running BES 4.1.6. This virtual machine has VMware FT turned on and acts as the Primary VM.

ESX #3: Virtual machine with one vCPU acting as the Secondary VM for BES 4.1.6.

A failure of either ESX #2 or #3 results in no loss of email delivery to the Blackberry device. VMware FT ensures that the BES workload is uninterrupted. Currently there are a number of different methods to protect BES from failure, ranging from simple backup plans to having offline stand-‐by servers prepared. However, VMware FT is the only software solution to offer uninterrupted protection for BES service while remaining cost-‐effective and user-‐friendly.

Summary of Performance Recommendations

• For each virtual machine there are two VMware FT-‐related actions that can be taken: turning FT on/off and enabling/disabling FT.

• Turning on FT” prepares the virtual machine for VMware FT by prompting for the removal of unsupported devices, disabling unsupported features, and setting the virtual machine’s memory reservation to be equal to its memory size (thus avoiding ballooning or swapping).

• “Enabling FT” performs the actual creation of the Secondary VM by live-‐migrating the Primary VM. • Note: Turning on VMware FT for a powered-‐on virtual machine will also automatically “Enable FT” for that virtual machine. • Each of these operations has performance implications.

• Do not turn on VMware FT for a virtual machine unless you will be using (i.e., Enabling) VMware FT for that machine. Turning on VMware FT automatically disables some features for the specific virtual machine that can help performance, such as hardware virtual MMU (if the processor supports it).

• Enabling VMware FT for a virtual machine uses additional resources (for example, the Secondary VM uses as much CPU and memory as the Primary VM). Therefore make sure you are prepared to devote the resources required before enabling VMware FT.

• The live migration that takes place when VMware FT is enabled can briefly saturate the VMotion network link and can also cause spikes in CPU utilization.

• If the VMotion network link is also being used for other operations, such as VMware FT logging, the performance of those other operations can be impacted. For this reason, it is best to have separate and dedicated NICs for FT logging traffic and also for VMotion, especially when multiple VMware FT virtual machines reside on the same host.

• Because this potentially resource-‐intensive live migration takes place each time FT is enabled, it is recommended that VMware FT not be frequently enabled and disabled.

• Because VMware FT logging traffic is asymmetric (the majority of the traffic flows from Primary to Secondary VM), congestion on the logging NIC can be avoided by distributing primaries onto multiple hosts. For example, on a cluster with two ESX hosts and two virtual machines with VMware FT enabled, placing one of the Primary VMs on each of the hosts allows the network bandwidth to be utilized bi-‐directionally.

• VMware FT virtual machines that receive large amounts of network traffic or perform lots of disk reads can create significant bandwidth on the VMware FT logging NIC. This is true of machines that routinely do these things as well as machines doing them only intermittently, such as during a backup operation. To avoid saturating the network link used for logging traffic, limit the number of VMware FT virtual machines on each host or limit disk read bandwidth and network receive bandwidth of those virtual machines.

• Make sure the VMware FT logging traffic is carried by at least a 1 GbE-‐rated NIC (which should in turn be connected to at least 1 GbE-‐rated infrastructure).

• Avoid placing more than four VMware FT-‐enabled virtual machines on a single host. In addition to reducing the possibility of saturating the network link used for logging traffic, this also limits the number of live-‐ migrations needed to create new Secondary VMs in the event of a host failure.

• If the Secondary VM lags too far behind the Primary VM (which can happen when the Primary VM is CPU bound and the Secondary VM is not getting enough CPU cycles), the hypervisor may slow down execution on the Primary VM to allow the Secondary VM to catch up. This can be avoided by making sure the hosts on which the Primary and Secondary VMs run are relatively closely matched with similar CPU make, model, and frequency. It is recommended to disable certain power management settings that do not allow for adjustments based on workload. As another alternative, enabling CPU reservations for the Primary VM (which will be duplicated for the Secondary VM) will help ensure that the Secondary VM gets CPU cycles when it requires them.

• Though timer interrupt rates do not significantly affect VMware FT performance, high timer interrupt rates create additional network traffic on the FT logging NIC. Therefore, if possible, reduce timer interrupt rates as described in the “Guest Operating System CPU Considerations” section of “Performance Best Practices for VMware vSphereTM 4.”

Fault Tolerance Host Networking Configuration Example

This example describes the host network configuration for Fault Tolerance in a typical deployment with four 1GB NICs. This is one possible deployment that ensures adequate service to each of the traffic types identified in the example and could be considered a best practice configuration.

Fault Tolerance provides full uptime during the course of a physical host failure due to power outage, system panic, or similar reasons. Network or storage path failures or any other physical server components that do not impact the host running state may not initiate a Fault Tolerance failover to the Secondary VM. Therefore, customers are strongly encouraged to use appropriate redundancy (for example, NIC teaming) to reduce that chance of losing virtual machine connectivity to infrastructure components like networks or storage arrays.

NIC Teaming policies are configured on the vSwitch (vSS) Port Groups (or Distributed Virtual Port Groups for vDS) and govern how the vSwitch will handle and distribute traffic over the physical NICs (vmnics) from virtual machines and vmkernel ports. A unique Port Group is typically used for each traffic type with each traffic type typically assigned to a different VLAN.

Host Networking Configuration Guidelines

The following guidelines allow you to configure your host's networking to support Fault Tolerance with different combinations of traffic types (for example, NFS) and numbers of physical NICs.

• Distribute each NIC team over two physical switches ensuring L2 domain continuity for each VLAN between the two physical switches.

• Use deterministic teaming policies to ensure particular traffic types have an affinity to a particular NIC (active/standby) or set of NICs (for example, originating virtual port-‐id).

• Where active/standby policies are used, pair traffic types to minimize impact in a failover situation where both traffic types will share a vmnic.

• Where active/standby policies are used, configure all the active adapters for a particular traffic type (for example, FT Logging) to the same physical switch. This minimizes the number of network hops and lessens the possibility of oversubscribing the switch to switch links.

Configuration Example with Four 1Gb NICs

Figure 3-‐2 depicts the network configuration for a single ESXi host with four 1GB NICs supporting Fault Tolerance. Other hosts in the FT cluster would be configured similarly.

This example uses four port groups configured as follows:

• VLAN A: Virtual Machine Network Port Group-‐active on vmnic2 (to physical switch #1); standby on vmnic0 (to physical switch #2.)

• VLANB: Management Network PortGroup-‐active on vmnic0 (to physical switch#2);stand by on vmnic2 (to physical switch #1.)

• VLAN C: vMotion Port Group-‐active on vmnic1 (to physical switch #2); standby on vmnic3 (to physical switch #1.) • VLAND:FT Logging PortGroup-‐active on vmnic3(to physical switch #1);standby on vmnic1(to physical switch #2.)

vMotion and FT Logging can share the same VLAN (configure the same VLAN number in both port groups), but require their own unique IP addresses residing in different IP subnets. However, separate VLANs might be preferred if Quality of Service (QoS) restrictions are in effect on the physical network with VLAN based QoS. QoS is of particular use where competing traffic comes into play, for example, where multiple physical switch hops are used or when a failover occurs and multiple traffic types compete for network resources.

3. Networking Best Practices Extract from Page 88 -‐ http://pubs.vmware.com/vsphere-‐50/topic/com.vmware.ICbase/PDF/vsphere-‐esxi-‐vcenter-‐server-‐50-‐networking-‐guide.pdf

• Separate network services from one another to achieve greater security and better performance. • Put a set of virtual machines on a separate physical NIC. This separation allows for a portion of the total networking

workload to be shared evenly across multiple CPUs. The isolated virtual machines can then better serve traffic from a Web client, for example:

• Keep the vMotion connection on a separate network devoted to vMotion. When migration with vMotion occurs, the contents of the guest operating system’s memory is transmitted over the network. You can do this either by using VLANs to segment a single physical network or by using separate physical networks(the latter is preferable).

• When using pass-‐through devices with a Linux kernel version 2.6.20 or earlier, avoid MSI and MSI-‐X modes because these modes have significant performance impact.

• To physically separate network services and to dedicate a particular set of NICs to a specific network service, create a vSphere standard switch or vSphere distributed switch for each service. If this is not possible, separate network services on a single switch by attaching them to port groups with different VLAN IDs. In either case, confirm with your network administrator that the networks or VLANs you choose are isolated in the rest of your environment and that no routers connect them.

• You can add and remove network adapters from a standard or distributed switch without affecting the virtual machines or the network service that is running behind that switch. If you remove all the running hardware, the virtual machines can still communicate among themselves. If you leave one network adapter intact, all the virtual machines can still connect with the physical network.

• To protect your most sensitive virtual machines, deploy firewalls in virtual machines that route between virtual networks with uplinks to physical networks and pure virtual networks with no uplinks.

• For best performance, use vmxnet3 virtual NICs. • Every physical network adapter connected to the same vSphere standard switch or vSphere distributed switch should also

be connected to the same physical network. • Configure all VMkernel network adapters to the same MTU. When several VMkernel network adapters are connected to

vSphere distributed switches but have different MTUs configured, you might experience network connectivity problems

4. VMware vSphere High Availability 5.0 Deployment best Practices vSphere makes it possible to reduce both planned and unplanned downtime. With the revolutionary VMware vSphere vMotion® capabilities in vSphere, it is possible to perform planned maintenance with zero application downtime.

VMware vSphere High Availability (HA) specifically reduces unplanned downtime by leveraging multiple VMware vSphere ESXi hosts configured as a cluster, to provide rapid recovery from outages and cost-‐effective high availability for applications running in virtual machines.

vSphere HA provides for application availability in the following ways:

• It reacts to hardware failure and network disruptions by restarting virtual machines on active hosts within the cluster. • It detects operating system (OS) failures by continuously monitoring a virtual machine and restarting it as required. • It provides a mechanism to react to application failures. • It provides the infrastructure to protect all workloads within the cluster, in contrast to other clustering solutions.

Users can combine HA with VMware vSphere Distributed Resource SchedulerTM (DRS) to protect against failures and to provide load balancing across the hosts within a cluster.

Design Principles for High Availability

Host Selection

Overall vSphere availability starts with proper host selection. This includes items such as redundant power supplies, error-‐correcting memory, remote monitoring and notification and so on. Consideration should also be given to removing single points of failure in host location. This includes distributing hosts across multiple racks or blade chassis to ensure that rack or chassis failure cannot impact an entire cluster.

When deploying a vSphere HA cluster, it is a best practice to build the cluster out of identical server hardware. The use of identical hardware provides a number of key advantages, such as the following ones:

• Simplifies configuration and management of the servers using Host Profiles • Increases ability to handle server failures and reduces resource fragmentation. The use of drastically different hardware

leads to an unbalanced cluster, as described in the Admission Control section. By default, vSphere HA prepares for the worst-‐case scenario in which the largest host in the cluster fails. To handle the worst case, more resources across all hosts must be reserved, making them essentially unusable.

Additionally, care should be taken to remove any inconsistencies that would prevent a virtual machine from being started on any cluster host. Inconsistencies such as the mounting of datastores to a subset of the cluster hosts or the implementation of VSphere DRS–required virtual machine–to-‐host affinity rules are scenarios to consider carefully. The avoidance of these conditions will increase the portability of the virtual machine and provide a higher level of availability. The overall size of a cluster is another important factor to consider. Smaller-‐sized clusters require a larger relative percentage of the available cluster resources to be set aside as reserve capacity to handle failures adequately. For example, to ensure that a cluster of three nodes can tolerate a single host failure, about 33 percent of the cluster resources are reserved for failover. A 10-‐node cluster requires that only 10 percent be reserved. In contrast, as cluster size increases so does the management complexity of the cluster, However, this increase in management complexity is overshadowed by the benefits a large cluster can provide.

Host Versioning

An ideal configuration is one in which all the hosts contained within the cluster use the latest version of ESXi. When adding a host to vSphere 5.0 clusters, it is always a best practice to upgrade the host to ESXi 5.0 and to avoid using clusters with mixed-‐host versions.

Mixed clusters are supported but not recommended because there is some differences in vSphere HA performance between host versions and these differences can introduce operational variances in a cluster.

These differences arise from the fact that earlier host versions do not offer the same capabilities as later versions. For example, VMware ESX® 3.5 hosts do not support certain properties present within ESX 4.0 and greater. These properties were added to ESX 4.0 to inform vSphere HA of conditions warranting a restart of a virtual machine. As a result, HA will not restart virtual machines that crash while running on ESX 3.5 hosts but will restart such a virtual machine if it was running on an ESX 4.0 or later host.

The following apply if using a vSphere HA–enabled cluster that includes hosts with differing versions:

• Users should be aware of the general limitations of using a mixed cluster, as previously mentioned. • Users should also know that ESXi 3.5 hosts within a 5.0 cluster must include a patch to address an issue involving file locks.

For ESX 3.5 hosts, users must apply the ESX350-‐201012401-‐SG patch. For ESXi 3.5, they must apply the ESXe350-‐201012401-‐I-‐BG patch. Prerequisite patches must be applied before applying these patches. HA will not enable an ESX/ESXi 3.5 host to be added to the cluster if it does not meet the patch requirements.

• Users should avoid deploying mixed clusters if VMware vSphere Storage vMotion® or VMware vSphere Storage DRS is required. The vSphere 5.0 Availability Guide has more information on this topic

VMware vCenter Server Availability Considerations

VMware vCenter Server is the management focal point for any vSphere environment. Although vSphere HA will continue to protect any environment without vCenter Server, the ability to manage the environment is severely impacted without it.

It is highly recommended that users protect their vCenter Server instance as well as possible. The following methods can help to accomplish this:

• Use of VMware vCenter Server Heartbeat—a specially designed high availability solution for vCenter Server • Use of vSphere HA—useful in environments in which the vCenter Server instance is virtualized, such as when using the

VMware vCenter Server Appliance It is extremely critical when using ESXi Auto Deploy that both the Auto Deploy service and the vCenter Server instance used are highly available. In the event of a loss of the vCenter Server instance, Auto Deploy hosts might not be able to reboot successfully in certain situations. However, it bears repeating here that if vSphere HA is used to make vCenter Server highly available, the vCenter Server virtual machine must be configured with a restart priority of high.

Additionally, this virtual machine should be configured to run on two or more hosts that are not managed by Auto Deploy. This can be done by using a DRS virtual machine–to-‐host “must run on” rule or by deploying the virtual machine on a datastore accessible to only these hosts. Because Auto Deploy depends upon the availability of vCenter Server in certain circumstances, this ensures that the vCenter Server virtual machine is able to come online. This does not require that vSphere DRS be enabled if users employ DRS rules, because these rules will remain in effect after DRS has been disabled.

Networking Design Considerations

General Networking Guidelines

• If the physical network switches that connect the servers support the PortFast (or an equivalent setting, this should be enabled. If this feature is not enabled, it can take a while for a host to regain network connectivity after booting due to the execution of lengthy spanning tree algorithms. While this execution is occurring, virtual machines cannot run on the host and HA will report the host as isolated or dead. Isolation will be reported if the host and an FDM master can access the host’s heartbeat datastores.

• Host monitoring should be disabled when performing any network maintenance that might disable all heartbeat paths

(including storage heartbeats) between the hosts within the cluster, because this might trigger an isolation response.

• With vSphere HA 5.0, all dependencies on DNS have been removed. • Users should employ consistent port group names and network labels on VLANs for public networks • If users employ inconsistent names for the original server and the failover server, virtual machines are disconnected from

their networks after failover. Network labels are used by virtual machines to reestablish network connectivity upon restart. Use of a documented naming scheme is highly recommended. Issues with port naming can be completely mitigated by use of a VMware vSphere Distributed Switch.

• Configure the management networks so that the vSphere HA agent on a host in the cluster can reach the agents on any of the other hosts using one of the management networks. Without such a configuration, a network partition condition can occur after a master host is elected.

• Configure the fewest possible number of hardware segments between the servers in a cluster. This limits single points of failure. Additionally, routes with too many hops can cause networking packet delays for heartbeats and increase the possible points of failure.

• In environments where both IPv4 and IPv6 protocols are used, the user should configure the distributed switches on all hosts to enable access to both networks. This prevents network partition issues due to the loss of a single IP networking stack or host failure.

• Ensure that TCP/UDP port 8182 is open on all network switches and firewalls that are used by the hosts for interhost communication. vSphere HA will open these ports automatically when enabled and close them when disabled. User action is required only if there are firewalls in place between hosts within the cluster, as in a stretched cluster configuration.

• Configure redundant management networking from ESXi hosts to network switching hardware if possible along with heartbeat datastores. Using network adaptor teaming will enhance overall network availability.

• Configuration of hosts with management networks on different subnets as part of the same cluster is supported. One or more isolation addresses for each subnet should be configured accordingly. Refer to the Host Isolation section for more details.

• The management network supports the use of jumbo frames as long as the MTU values and physical network switch configurations are set correctly. Ensure that the network supports jumbo frames end to end.

Setting Up Redundancy for vSphere HA Networking

Networking redundancy between cluster hosts is absolutely critical for vSphere HA reliability. Redundant management networking enables the reliable detection of failures.

NOTE: Because this document is primarily focused on vSphere 5.0, its use of the term “management network” refers to the VMkernel network selected for use as a management network. Refer to the vSphere Availability Guide for information regarding the service console network when using VMware ESX® 4.1, ESX 4.0, or ESX 3.5x.

Network Adaptor Teaming and Management Networks

Using a team of two network adaptors connected to separate physical switches can improve the reliability of the management network. The cluster is more resilient to failures because the hosts are connected to each other through two network adaptors and through two separate switches and thus they have two independent paths for cluster communication.

To configure a network adaptor team for the management network, it is recommended to configure the vNICs in the distributed switch configuration for the ESXi host in an active/standby configuration. This is illustrated in the following example:

Requirements: • Two physical network adaptors • VLAN trunking • Two physical switches

The distributed switch should be configured as follows:

• Load balancing set to route based on the originating virtual port ID (default) • Failback set to No • vSwitch0: Two physical network adaptors (for example, vmnic0 and vmnic2) • Two port groups (for example, vMotion and management)

In this example, the management network runs on vSwitch0 as active on vmnic0 and as standby on vmnic2. The vMotion network runs on vSwitch0 as active on vmnic2 and as standby on vmnic0.

It is recommended to use NIC ports from different physical NICs and it is preferable that the NICs are different makes and models.

Failback is set to no because in the case of physical switch failure and restart, ESXi might falsely determine that the switch is back online when its ports first come online. However, the switch itself might not be forwarding any packets until it is fully online. Therefore, when failback is set to no and an issue arises, both the management network and vMotion network will be running on the same network adaptor and will continue running until the user manually intervenes.

Management Network Changes in a vSphere HA Cluster

vSphere HA uses the management network as its primary communication path. As a result, it is critical that proper precautions are taken whenever a maintenance action will affect the management network.

As a general rule, whenever maintenance is to be performed on the management network, the host-‐monitoring functionality of vSphere HA should be disabled. This will prevent HA from determining that the maintenance action is a failure and from consequently triggering the isolation responses.

If there are changes involving the management network, it is advisable to reconfigure HA on all hosts in the cluster after the maintenance action is completed. This ensures that any pertinent changes are recognized by HA. Changes that cause a loss of management network connectivity are grounds for performing a reconfiguration of HA. An example of this is the addition or deletion of networks used for management network traffic when the host is not in maintenance mode.

Storage Design Considerations

Best practices for storage design reduces the likelihood of hosts losing connectivity to the storage used by the virtual machines, and that used by vSphere HA for Heartbeating. To maintain a constant connection between an ESXi host and its storage, ESXi supports multipathing, a technique that enables users to employ more than one physical path to transfer data between the host and an external storage device.

In case of a failure of any element in the SAN, such as an adapter, switch or cable, ESXi can move to another physical path that does not use the failed component.

In addition to path failover, multipathing provides load balancing, which is the process of distributing I/O loads across multiple physical paths. Load balancing reduces or removes potential bottlenecks.

Storage Heartbeats

A new feature of vSphere HA in vSphere 5.0 makes it possible to use storage subsystems as a means of communication between the hosts of a cluster. Storage heartbeats are used when the management network is unavailable to enable a slave HA agent to communicate with a master HA agent.

The feature also makes it possible to distinguish accurately between the different failure scenarios of dead, isolated or partitioned hosts.

• Storage heartbeats enable detection of cluster partition scenarios that are not supported with previous versions of vSphere. This results in a more coordinated failover when host isolation occurs.

• By default, vCenter Server will select automatically two datastores to use for storage heartbeats, It is intended to select datastores that are connected to the highest number of hosts. The algorithm is designed to select datastores that are backed by different LUNs or NFS servers. A preference is given to VMware vSphere VMFS–formatted datastores over NFS-‐hosted datastores.

• vCenter Server selects the heartbeat datastores when HA is enabled, when a datastore is added or removed from a host and when the accessibility to a datastore changes. Users can, however, configure vSphere HA to give preference to a subset of the datastores mounted by the hosts in the cluster. Alternately, they can require that HA choose only from a subset of these.

• VMware recommends the users employ the default setting unless there are datastores in the cluster that are more highly available than others. If there are some more highly available datastores, VMware recommends that users configure vSphere HA to give preference to these.

• VMware does not recommend restricting vSphere HA to using only a subset of the datastores because this setting restricts the system’s ability to respond when a host loses connectivity to one of its configured heartbeat datastores.

• NOTE: vSphere HA datastore heartbeating is very lightweight and will not impact in any way the use of the datastores by virtual machines.

• Although users can increase to four the number of heartbeat datastores chosen for each host, increasing the number does not make the cluster significantly more tolerant of failures. (See the vSphere Metro Storage Cluster white paper for details about heartbeat datastore recommendations specific to stretched clusters.)

• Environments that provide only network-‐based storage must work optimally with the network architecture to realize fully the potential of the storage heartbeat feature. If the storage network traffic and the management network traffic flow through the same network components, disruptions in network service might disrupt both. It is recommended that these networks be separated as much as possible or that datastores with a different failure domain be used for heartbeating.

• In cases where converged networking is used, VMware recommends that users leave heartbeating enabled. This is because even with converged networking failures can occur that disrupt only the management network traffic. For example, the VLAN tags for the management network might be incorrectly changed without impacting those used for storage traffic.

• It is also recommended that all hosts within a cluster have access to the same datastores. This promotes virtual machine portability because the virtual machines can then run on any of the hosts within the cluster. Such a configuration is also beneficial because it maximizes the chance that an isolated or partitioned host can communicate with a master during a network partition or isolation event.

• If network partitions or isolations are anticipated within the environment, users should ensure that a minimum of two shared datastores is provisioned to all hosts in the cluster

Cluster Configuration Considerations

Host Isolation

One key mechanism within vSphere HA is the ability for a host to detect when it has become network-‐isolated from the rest of the cluster. With this information, vSphere is able to take administrator-‐specified action with respect to running virtual machines on the host that has been isolated.

Depending on network layout and specific business needs, the administrator might wish to tune the vSphere HA response to an isolated host to favor rapid failover or to leave the virtual machine running so clients can continue to access it. The following section explains how a vSphere HA node detects when it has been isolated from the rest of the cluster, and the response options available to that node after that determination has been made.

Host Isolation Detection

Host isolation detection happens at the individual host level. Isolation fundamentally means a host is no longer able to communicate over the management network. To determine if it is network-‐isolated, the host attempts to ping its configured isolation addresses.

The isolation address used should always be reachable by the host under normal situations, because after five seconds have elapsed with no response from the isolation addresses, the host then declares itself isolated.

The default isolation address is the gateway specified for the management network. Advanced settings can be used to modify the isolation addresses used for your particular environment. The option das.isolationaddress[X] (where X is 0–9) is used to configure multiple isolation addresses. Additionally, das.usedefaultisolationaddress is used to indicate whether the default isolation address (the default gateway) should be used to determine if the host is network-‐isolated. If the default gateway is not able to receive ICMP ping packets, you must set this option to false.

Host Isolation Response

Tuning the host isolation response is typically based on whether loss of connectivity to a host via the management network would typically also indicate that clients accessing the virtual machine would also be affected. In this case it is likely that administrators would want the virtual machines shut down so other hosts with operational networks can start them up. If failures of the management network are not likely correlated with failures of the virtual machine network, where the loss of the management network simply results in the inability to manage the virtual machines on the isolated host, it is often preferable to leave the virtual machines running while the management network connectivity is restored.

The Host Isolation Response setting provides a means to set the action preferred for the powered-‐on virtual machines maintained by a host when that host has declared it is isolated. There are three possible isolation response values that can be configured and applied to a cluster or individually to a specific virtual machine.

• Leave Powered On • Power Off • Shut Down

Leave Powered On

With this option, virtual machines hosted on an isolated host are left powered on. In situations where a host loses all management network access, a virtual machine might still have the ability to access the storage subsystem and the virtual machine network. By selecting this option, the user enables the virtual machine to continue to function if this were to occur. This is the default isolation response setting in vSphere HA 5.0.

Power Off

When this isolation response option is used, the virtual machines on the isolated host are immediately stopped. This is similar to removing the power from a physical host. This can induce inconsistency with the file system of the OS used in the virtual machine. The advantage of this action is that vSphere HA will attempt to restart the virtual machine more quickly than when using the Shut Down option.

Shut Down

Through the use of the VMware Tools package installed within the guest OS of a virtual machine, this option attempts to shut down the OS gracefully with the virtual machine before powering off the virtual machine. This is more desirable than using the Power Off option because it provides the OS with time to commit any outstanding I/O activity to disk.

HA will wait for a default time period of 300 seconds (five minutes) for this graceful shutdown to occur. If the OS is not gracefully shut down by this time, it will initiate a power off of the virtual machine.

Changing the das.isolationshutdowntimeout attribute will modify this timeout if it is determined that more time is required to shut down an OS gracefully. The Shut Down option requires that the VMware Tools package be installed in the guest OS. Otherwise, it is equivalent to the Power Off setting.

In environments that use only network-‐based storage protocols, such as iSCSI and NFS, and those that share physical network components between the management and storage traffic, the recommended isolation response is Power Off. With these environments, it is likely that a network outage causing a host to become isolated will also affect the host’s ability to communicate to the datastores. This situation might be problematic if both instances of the virtual machine retain access to the virtual machine network. The Power Off isolation response recommendation reduces the impact of this issue by having the isolated HA agent power off the virtual machines on the isolated host.

The following table lists the recommended isolation policy for converged network configurations:

Host Monitoring

The host monitoring setting determines whether vSphere HA restarts virtual machines on other hosts in the cluster after a host isolation, a host failure or after they should crash for some other reason. This setting does not impact the VM/application monitoring feature. If host monitoring is disabled, isolated hosts won’t apply the configured isolation response, and vSphere HA won’t restart virtual machines that fail for any reason. Disabling host monitoring also impacts VMware vSphere Fault Tolerance (FT) because it controls whether HA will restart an FT secondary virtual machine after a failure event.

Cluster Partitions

A cluster partition is a situation where a subset of hosts within the cluster loses the ability to communicate with the rest of the hosts in the cluster but can still communicate with each other.

This can occur for various reasons, but the most common cause is the use of a stretched cluster configuration. A stretched cluster is defined as a cluster that spans multiple sites within a metropolitan area.

When a cluster partition occurs, one subset of hosts is still able to communicate to a master node. The other subset of hosts cannot. For this reason, the second subset will go through an election process and elect a new master node. Therefore, it is possible to have multiple master nodes in a cluster partition scenario, with one per partition. This situation will last only as long as the partition exists. After the network issue causing the partition is resolved, the master nodes will be able to communicate and discover multiple master roles. Anytime multiple master nodes exist and can communicate with each other over the management network, all but one will abdicate. Robust management network architecture helps to avoid cluster partition situations.

Additionally, if a network partition occurs, users should ensure that each host retains access to its heartbeat datastores, and that the masters are able to access the heartbeat datastores used by the slave hosts

vSphere Metro Storage Cluster Considerations

VMware vSphere Metro Storage Clusters (vMSC), or stretched clusters as they are often called, are environments that span multiple sites within a metropolitan area (typically up to 100km). Storage systems in these environments typically enable a seamless failover between sites. Because this a complex environment, a paper specific to the vMSC has been produced. Download it here: http://www.vmware.com/resources/ techresources/10299

Auto Deploy Considerations

Auto Deploy utilizes a PXE boot infrastructure to provision a host automatically. No host-‐state information is stored on the host itself.

The best practices recommendation from VMware staff for environments using Auto Deploy is as follows:

• Deploy vCenter Server Heartbeat. vCenter Server Heartbeat delivers high availability for vCenter Server, protecting the virtual and cloud infrastructure from application-‐, configuration-‐, OS-‐ or hardware-‐related outages. (EOA)

• Avoid using Auto Deploy in stretched cluster environments, because this complicates the environment • Deploy vCenter Server in a virtual machine. Run the vCenter Server virtual machine in a vSphere HA–enabled cluster and

configure the virtual machine with a vSphere HA restart priority of high. Perform one of the following actions: o Include two or more hosts in the cluster that are not managed by Auto Deploy and pin the vCenter Server virtual

machine to these hosts by using a rule (vSphere DRS–required virtual machine–to-‐host rule). Users can set up the rule and then disable DRS if they do not wish to use DRS in the cluster.

o Deploy vCenter Server and Auto Deploy in a separate management environment, that is, by hosts managed by a different vCenter server.

Virtual Machine and Application Health Monitoring

These features enable the vSphere HA agent on a host to detect heartbeat information on a virtual machine through VMware Tools or an agent running within the virtual machine that is monitoring the application health.

After the loss of a defined number of VMware Tools heartbeats on the virtual machine, vSphere HA will reset the virtual machine.

Virtual machine and application monitoring are not dependent on the virtual machine protection state attribute as reported by the vSphere Client.

This attribute signifies that vSphere HA detects that the preferred state of the virtual machine is to be powered on. For this reason, HA will attempt to restart the virtual machine assuming that there is nothing restricting the restart. Conditions that might restrict this action include insufficient resources available and a disabled virtual machine restart priority. This functionality is not available when the vSphere HA agent on a host is in the uninitialized state, as would occur immediately after the vSphere HA agent has been installed on the host or when the host is not available. Additionally, the number of missed heartbeats is reset after the vSphere HA agent on the host reboots. This should occur rarely if at all, or after vSphere HA is reconfigured on the host.

Because virtual machines exist only for the purposes of hosting an application, it is highly recommended that virtual machine health monitoring be enabled. All virtual machines must have the VMware Tools package installed within the guest OS. NOTE: Guest OS sleep states are not currently supported by virtual machine monitoring and can trigger an unnecessary restart of the virtual machine.

vSphere HA and vSphere FT

Often vSphere HA is used in conjunction with vSphere FT and provides protection for extremely critical virtual machines where any loss of service is intolerable

vSphere HA detects the use of FT to ensure proper operation. This section describes some of the unique behavior specific to vSphere FT with vSphere HA. Additional vSphere FT best practices can be found in the vSphere 5.0 Availability Guide.

Host Partitions vSphere HA will restart a secondary virtual machine of a vSphere FT virtual machine pair when the primary virtual machine is running in the same partition as the master HA agent that is responsible for the virtual machine. If this condition is not met, the secondary virtual machine in 5.0 cannot be restarted until the partition ends.

Host Isolation

Host isolation responses are not performed on virtual machines enabled with vSphere FT. The rationale is that the primary and secondary FT virtual machine pairs are already communicating via the FT logging network. So they either continue to function and have network connectivity or they have lost network and they are not heartbeating over the FT logging network, in which case one of them will then take over as a primary FT virtual machine. Because vSphere HA does not offer better protection than that, it bypasses FT virtual machines when initiating host isolation response.

Ensure that the FT logging network that is used is implemented with redundancy to provide greater resiliency to failures for FT.

Admission Control

vCenter Server uses HA admission control to ensure that sufficient resources in the cluster are reserved for virtual machine recovery in the event of host failure.

Admission control will prevent the following if there is encroachment on resources reserved for virtual machines restarted due to failure:

• The power-‐on of new virtual machines • Changes of virtual machine memory or CPU reservations • A vMotion instance of a virtual machine introduced into the cluster from another cluster

This mechanism is highly recommended to guarantee the availability of virtual machines. With vSphere 5.0, HA offers the following configuration options for choosing users’ admission control strategy:

Host Failures Cluster Tolerates (default):

HA ensures that a specified number of hosts can fail and that sufficient resources remain in the cluster to fail over all the virtual machines from those hosts. HA uses a concept called slots to calculate available resources and required resources for a failing over

of virtual machines from a failed host. Under some configurations, this policy might be too conservative in its reservations. The slot size can be controlled using several advanced configuration options. In addition, an advanced option can be used to specify the default slot size value for CPU. This value is used when no CPU reservation has been specified for a virtual machine. The value was changed in vSphere 5.0 from 256MHz to 32MHz. When no memory reservation is specified for a virtual machine, the largest memory overhead for any virtual machine in the cluster will be used as the default slot size value for memory. See the vSphere Availability Guide for more information on slot-‐size calculation and tuning

Percentage of Cluster Resources Reserved as failover spare capacity:

vSphere HA ensures that a specified percentage of memory and CPU resources are reserved for failover. This policy is recommended for situations where the user must have host virtual machines with significantly different CPU and memory reservations in the same cluster or have different-‐sized hosts in terms of CPU and memory capacity (vSphere 5.0 adds the ability to specify different percentages for memory and CPU through the vSphere Client). A key difference between this policy and the Host Failures Cluster Tolerates policy is that with this option the capacity set aside for failures can be fragmented across hosts.

Specify a Failover Host:

vSphere HA designates a specific host or hosts as a failover host(s). When a host fails, HA attempts to restart its virtual machines on the specified failover host(s). The ability to specify more than one failover host is a new feature in vSphere HA 5.0. When a host is designated as a failover host, HA admission control does not enable the powering on of virtual machines on that host, and DRS will not migrate virtual machines to the failover host. It effectively becomes a hot standby.

With each of the three admission control policies there is a chance in specific scenarios that, at the time of failing over a virtual machine, there might be insufficient contiguous capacity available on a single host to power on a given virtual machine . Although these are corner case scenarios this has been taken into account and HA will request vSphere DRS, if it is enabled, to attempt to defragment the capacity in such situations.

Further, if a host had been put into standby and vSphere DPM is enabled, it will attempt to power up a host if defragmentation is not sufficient.

The best practices recommendation from VMware staff for admission control is as follows:

Select the Percentage of Cluster Resources Reserved policy for admission control. This policy offers the most flexibility in terms of host and virtual machine sizing and is sufficient for most situations. When configuring this policy, the user should choose a percentage for CPU and memory that reflects the number of host failures they wish to support.

For example, if the user wants vSphere HA to set aside capacity for two host failures and there are 10 hosts of equal capacity in the cluster, then they should specify 20 percent (2/10). If there are not equal capacity hosts, then the user should specify a percentage that equals the capacity of the two largest hosts as a percentage of the cluster capacity.

• If the Host Failures Cluster Tolerates policy is used, attempt to keep virtual machine resource reservations similar across all configured virtual machines. Host Failures Cluster Tolerates uses a notion of “slot sizes” to calculate the amount of capacity needed as a reserve for each virtual machine. The slot size is based on the largest reserved memory and CPU needed for any virtual machine. Mixing virtual machines of greatly different CPU and memory requirements will cause the slot size calculation to default to the largest possible virtual machine, limiting consolidation. See the vSphere 5.0 Availability Guide for more information on slot-‐size calculation and overriding slot-‐size calculation in cases where it is necessary to configure different-‐ sized virtual machines in the same cluster.

• If the Failover Host policy is used, decide how many host failures to support, and then specify this number of hosts as failover hosts. Ensure that all cluster hosts are sized equally. If unequally sized hosts are used with the Host Failures Cluster Tolerates policy, vSphere HA will reserve excess capacity to handle failures of the largest N hosts, where N is the number of host failures specified. With Percentage of Cluster Resources Reserved policy, unequally sized hosts will require that the user increase the percentages to reserve enough capacity for the planned number of host failures. Finally, with the Specify a Failover Host policy, users must specify failover hosts that are as large as the largest nonfailover hosts in the cluster. This ensures that there is adequate capacity in case of failures.

HA added a capability in vSphere 4.1 to balance virtual machine loading on failover, thereby reducing the issue of resource imbalance in a cluster after a failover. With this capability, there is less likelihood for vMotion instances after a failover. Also in

vSphere 4.1, HA invokes vSphere DRS to create more contiguous capacity on hosts. This increases the chance for larger virtual machines to be restarted if some virtual machines cannot be restarted because of resource fragmentation. This does not guarantee enough contiguous resources to restart all the failed virtual machines. It simply means that vSphere will make the best effort to restart all virtual machines with the host resources remaining after a failure.

The admission control policy is evaluated against the current state of the cluster, not the normal state of the cluster. The normal state means that all hosts are connected and healthy. Admission control does not take into account resources of hosts that are disconnected or in maintenance mode. Only healthy and connected hosts— including standby hosts, if vSphere DPM is enabled—can provide resources that are reserved for tolerating host failures.

Affinity Rules

A virtual machine–host affinity rule specifies that the members of a selected virtual machine DRS group should or must run on the members of a specific host DRS group. Unlike a virtual machine–virtual machine affinity rule, which specifies affinity (or anti-‐affinity) between individual virtual machines, a virtual machine–host affinity rule specifies an affinity relationship between a group of virtual machines and a group of hosts. There are required rules (designated by the term “must”) and preferred rules (designated by the term “should”). See the vSphere Resource Management Guide for more details on setting up virtual machine–host affinity rules.

When restarting virtual machines after a failure, HA ignores the preferential virtual machine–host rules but follows the required rules. If HA violates any preferential rule, DRS will attempt to correct it after the failover is complete by migrating virtual machines. Additionally, vSphere DRS might be required to migrate other virtual machines to make space on the preferred hosts. If required rules are specified, vSphere HA will restart virtual machines on an ESXi host in the same host DRS group only. If no available hosts are in the host DRS group or the hosts are resource constrained, the restart will fail. Any required rules defined when DRS is enabled are enforced even if DRS is subsequently disabled. So to remove the effect of such a rule, it must be explicitly disabled. Limit the use of required virtual machine–host affinity rules to situations where they are necessary, because such rules can restrict HA target host selection when restarting a virtual machine after a failure.

Log Files

In the latest version of HA, the changes in the architecture enabled changes in how logging is performed. Previous versions of HA stored the operational logging information across several distinct log files. In vSphere HA 5.0, this information is consolidated into a single operational log file. This log file utilizes a circular log rotation mechanism, resulting in multiple files, with each file containing a part of the overall retained log history. To improve the ability of the VMware support staff to diagnose problems, VMware recommends configuring logging to retain approximately one week of history. The following table provides recommended log capacities for several sample cluster configurations.

The preceding recommendations are sufficient for most environments. If the user notices that the HA log history does not span one week after implementing the recommended settings in the preceding table, they should consider increasing the capacity beyond what is noted.

Increasing the log capacity for HA involves specifying the number of log rotations that are preserved and the size of each log file in

the rotation. For log capacities up to 30MB, use a 1MB file size; for log capacities greater than 30MB, use a 5MB file size.

1. The default log settings are sufficient for ESXi hosts that are logging to persistent storage.

2. The default log setting is sufficient for ESXi 5.0 hosts if the following conditions are met: (i) they are not managed by Auto Deploy and (ii) they are configured with the default log location in a scratch directory on a vSphere VMFS partition.

NOTE: The name of the vSphere HA logger is Fault Domain Manager (FDM).

General Logging Recommendations for All ESX Versions

• Ensure that the location where the log files will be stored has sufficient space available. • For ESXi hosts, ensure that logging is being done to a persistent location. • When changing the directory path, ensure that it is present on all hosts in the cluster and is mapped to a different directory

for each host. • Configure each HA cluster separately. • In vSphere 5.0, if a cluster contains 5.0 and earlier host versions, setting the das.config.log.maxFileNum advanced option

will cause the 5.0 hosts to maintain two copies of the log files, one maintained by the 5.0 logging mechanism discussed in the ESXi 5.0 documentation (see the following) and one maintained by the pre-‐5.0 logging mechanism, which is configured using the advanced options previously discussed. In vSphere 5.0U1, this issue has been resolved. In this version, to maintain two sets of log files, the new HA advanced configuration option das.config.log.outputToFiles must be set to true, and das.config.log.maxFileNum must be set to a value greater than two.

• After changing the advanced options, reconfigure HA on each host in the cluster. The log values users configure in this manner will be preserved across vCenter Server updates. However, applying an update that includes a new version of the HA agent will require HA to be reconfigured on each host for the configured values to be reapplied.

5. vSphere ESXi vCenter Server 5.0 Availability Guide (High Level)

Business Continuity and Minimizing Downtime vSphere makes it possible for organizations to dramatically reduce planned downtime. Because workloads in a vSphere environment can be dynamically moved to different physical servers without downtime or service interruption, server maintenance can be performed without requiring application and service downtime. With vSphere, organizations can:

• Eliminate downtime for common maintenance operations. • Eliminate planned maintenance windows. • Perform maintenance at any time without disrupting users and services.

The vSphere vMotion and Storage vMotion functionality in vSphere makes it possible for organizations to reduce planned downtime

because workloads in a VMware environment can be dynamically moved to different physical servers or to different underlying storage without service interruption

Preventing Unplanned Downtime

Key availability capabilities are built into vSphere:

• Sharedstorage.Eliminatesinglepointsoffailurebystoringvirtualmachinefilesonsharedstorage,such as Fibre Channel or iSCSI SAN, or NAS. The use of SAN mirroring and replication features can be used to keep updated copies of virtual disk at disaster recovery sites.

• Network interface teaming. Provide tolerance of individual network card failures. • Storage multipathing. Tolerate storage path failures.

vSphere HA Provides Rapid Recovery from Outages

Unlike other clustering solutions, vSphere HA provides the infrastructure to protect all workloads with the infrastructure:

• You do not need to install special software within the application or virtual machine. All workloads are protected by vSphere HA. After vSphere HA is configured, no actions are required to protect new virtual machines. They are automatically protected.

• You can combine vSphere HA with vSphere Distributed Resource Scheduler (DRS) to protect against failures and to provide load balancing across the hosts within a cluster.

Minimal setup

After a vSphere HA cluster is set up, all virtual machines in the cluster get failover support without additional configuration.

Reduced hardware cost and setup

The virtual machine acts as a portable container for the applications and it can be moved among hosts. Administrators avoid duplicate configurations on multiple machines. When you use vSphere HA, you must have sufficient resources to fail over the number of hosts you want to protect with vSphere HA. However, the vCenter Server system automatically manages resources and configures clusters.

Increased application availability

Any application running inside a virtual machine has access to increased availability. Because the virtual machine can recover from hardware failure, all applications that start at boot have increased availability without increased computing needs, even if the application is not itself a clustered application. By monitoring and responding to VMware Tools heartbeats and restarting nonresponsive virtual machines, it protects against guest operating system crashes.

DRS and vMotion integration

If a host fails and virtual machines are restarted on other hosts, DRS can provide migration recommendations or migrate virtual machines for balanced resource allocation. If one or both of the source and destination hosts of a migration fail, vSphere HA can help recover from that failure.

vSphere Fault Tolerance Provides Continuous Availability

vSphere HA provides a base level of protection for your virtual machines by restarting virtual machines in the event of a host failure. vSphere Fault Tolerance provides a higher level of availability, allowing users to protect any virtual machine from a host failure with no loss of data, transactions, or connections.

How vSphere HA Works

When you create a vSphere HA cluster, a single host is automatically elected as the master host. The master host communicates with vCenter Server and monitors the state of all protected virtual machines and of the slave hosts.

The master host must distinguish between a failed host and one that is in a network partition or that has become network isolated. The master host uses datastore heartbeating to determine the type of failure.

Master and Slave Hosts

When you add a host to a vSphere HA cluster, an agent is uploaded to the host and configured to communicate with other agents in the cluster. Each host in the cluster functions as a master host or a slave host.

When vSphere HA is enabled for a cluster, all active hosts (those not in standby or maintenance mode, or not disconnected) participate in an election to choose the cluster's master host

The host that mounts the greatest number of datastores has an advantage in the election.

Only one master host exists per cluster and all other hosts are slave hosts. If the master host fails, is shut down, or is removed from the cluster a new election is held.

The master host in a cluster has a number of responsibilities: • Monitoring the state of slave hosts. If a slave host fails or becomes unreachable, the master host identifies which virtual

machines need to be restarted. • Monitoring the power state of all protected virtual machines. If one virtual machine fails, the master host ensures that it is

restarted. Using a local placement engine, the master host also determines where the restart should be done. • Managing the lists of cluster hosts and protected virtual machines. • Acting as vCenter Server management interface to the cluster and reporting the cluster health state.

The slave hosts primarily contribute to the cluster by running virtual machines locally, monitoring their runtime states, and reporting state updates to the master host. A master host can also run and monitor virtual machines. Both slave hosts and master hosts implement the VM and Application Monitoring features.

One of the functions performed by the master host is virtual machine protection. When a virtual machine is protected, vSphere HA guarantees that it attempts to power it back on after a failure.

A master host commits to protecting a virtual machine when it observes that the power state of the virtual machine changes from powered off to powered on in response to a user action. If a failover occurs, the master host must restart the virtual machines that are protected and for which it is responsible. This responsibility is assigned to the master host that has exclusively locked a system-‐defined file on the datastore that contains a virtual machine's configuration file.

NOTE If you disconnect a host from a cluster, all of the virtual machines registered to that host are unprotected by vSphere HA.

Host Failure Types and Detection

In a vSphere HA cluster, three types of host failure are detected:

• A host stops functioning (that is, fails) • A host becomes network isolated • A host loses network connectivity with the master host.

The master host monitors the liveness of the slave hosts in the cluster. This communication is done through the exchange of network heartbeats every second.

When the master host stops receiving these heartbeats from a slave host, it checks for host liveness before declaring the host to have failed. The liveness check that the master host performs is to determine whether the slave host is exchanging heartbeats with one of the datastores.

Also, the master host checks whether the host responds to ICMP pings sent to its management IP addresses.

If a master host is unable to communicate directly with the agent on a slave host, the slave host does not respond to ICMP pings, and the agent is not issuing heartbeats it is considered to have failed.

The host's virtual machines are restarted on alternate hosts.

If such a slave host is exchanging heartbeats with a datastore, the master host assumes that it is in a network partition or network isolated and so continues to monitor the host and its virtual machines

Host network isolation occurs when a host is still running, but it can no longer observe traffic from vSphere HA agents on the management network. If a host stops observing this traffic, it attempts to ping the cluster isolation addresses. If this also fails, the host declares itself as isolated from the network.

The master host monitors the virtual machines that are running on an isolated host and if it observes that they power off, and the master host is responsible for the virtual machines, it restarts them.

NOTE If you ensure that the network infrastructure is sufficiently redundant and that at least one network path is available at all times, host network isolation should be a rare occurrence.

Network Partitions

Datastore Heartbeating

When the master host in a vSphere HA cluster can not communicate with a slave host over the management network, the master host uses datastore heartbeating to determine whether the slave host has failed, is in a network partition, or is network isolated. If the slave host has stopped datastore heartbeating, it is considered to have failed and its virtual machines are restarted elsewhere.

You can use the advanced attribute das.heartbeatdsperhost to change the number of heartbeat datastores selected by vCenter Server for each host. The default is two and the maximum valid value is five.

vSphere HA creates a directory at the root of each datastore that is used for both datastore heartbeating and for persisting the set of protected virtual machines. The name of the directory is .vSphere-‐HA. Do not delete or modify the files stored in this directory, because this can have an impact on operations.

vSphere HA Security

vSphere HA uses TCP and UDP port 8182 for agent-‐to-‐agent communication. The firewall ports open and close automatically to ensure they are open only when needed.

vSphere HA stores configuration information on the local storage or on ramdisk if there is no local datastore. These files are protected using file system permissions and they are accessible only to the root user.

For ESXi 5.x hosts, vSphere HA writes to syslog only by default, so logs are placed where syslog is configured to put them. The log file names for vSphere HA are prepended with fdm, fault domain manager, which is a service of vSphere HA

All communication between vCenter Server and the vSphere HA agent is done over SSL.

vSphere HA requires that each host have a verified SSL certificate. Each host generates a self-‐signed certificate when it is booted for the first time. This certificate can then be regenerated or replaced with one issued by an authority. If the certificate is replaced, vSphere HA needs to be reconfigured on the host. If a host becomes disconnected from vCenter Server after its certificate is updated and the ESXi or ESX Host agent is restarted, then vSphere HA is automatically reconfigured when the host is reconnected to vCenter Server. If the disconnection does not occur because vCenter Server host SSL certificate verification is disabled at the time, verify the new certificate and reconfigure vSphere HA on the host.

Using vSphere HA and DRS Together

Using vSphere HA with Distributed Resource Scheduler (DRS) combines automatic failover with load balancing.

When vSphere HA performs failover and restarts virtual machines on different hosts, its first priority is the immediate availability of all virtual machines. After the virtual machines have been restarted, those hosts on which they were powered on might be heavily loaded, while other hosts are comparatively lightly loaded.

In a cluster using DRS and vSphere HA with admission control turned on, virtual machines might not be evacuated from hosts entering maintenance mode. This behavior occurs because of the resources reserved for restarting virtual machines in the event of a failure. You must manually migrate the virtual machines off of the hosts using vMotion.

In some scenarios, vSphere HA might not be able to fail over virtual machines because of resource constraints. This can occur for several reasons.

• HA admission control is disabled and Distributed Power Management(DPM)is enabled. This can result in DPM consolidating

virtual machines onto fewer hosts and placing the empty hosts in standby mode leaving insufficient powered-‐on capacity to perform a failover.

• VM-‐Host affinity (required) rules might limit the hosts on which certain virtual machines can be placed. • There might be sufficient aggregate resources but these can be fragmented across multiple hosts so that they can not be

used by virtual machines for failover.

In such cases, vSphere HA can use DRS to try to adjust the cluster (for example, by bringing hosts out of standby mode or migrating virtual machines to defragment the cluster resources) so that HA can perform the failovers.

If DPM is in manual mode, you might need to confirm host power-‐on recommendations. Similarly, if DRS is in manual mode, you might need to confirm migration recommendations.

If you are using VM-‐Host affinity rules that are required, be aware that these rules cannot be violated. vSphere HA does not perform a failover if doing so would violate such a rule.

vSphere HA Admission Control

vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected. Three types of admission control are available.

• Host Ensures that a host has sufficient resources to satisfy the reservations of all virtual machines running on it. • Resource Pool Ensures that a resource pool has sufficient resources to satisfy the reservations, shares, and limits of all

virtual machines associated with it. • vSphere HA Ensures that sufficient resources in the cluster are reserved for virtual machine recovery in the event of host

failure. Admission control imposes constraints on resource usage and any action that would violate these constraints is not permitted. Examples of actions that could be disallowed include the following:

• Powering on a virtual machine. • Migrating a virtual machine onto a host or into a cluster or resource pool. • Increasing the CPU or memory reservation of a virtual machine.

Of the three types of admission control, only vSphere HA admission control can be disabled. However, without it there is no assurance that the expected number of virtual machines can be restarted after a failure. VMware recommends that you do not disable admission control, but you might need to do so temporarily, for the following reasons:

• If you need to violate the failover constraints when there are not enough resources to support them-‐-‐for example, if you are placing hosts in standby mode to test them for use with Distributed Power Management (DPM).

• If an automated process needs to take actions that might temporarily violate the failover constraints (for example, as part of an upgrade directed by vSphere Update Manager).

• If you need to perform testing or maintenance operations.

NOTE When vSphere HA admission control is disabled, vSphere HA ensures that there are at least two powered-‐on hosts in the cluster even if DPM is enabled and can consolidate all virtual machines onto a single host. This is to ensure that failover is possible.

Host Failures Cluster Tolerates Admission Control Policy

You can configure vSphere HA to tolerate a specified number of host failures. With the Host Failures Cluster Tolerates admission control policy, vSphere HA ensures that a specified number of hosts can fail and sufficient resources remain in the cluster to fail over all the virtual machines from those hosts.

With the Host Failures Cluster Tolerates policy, vSphere HA performs admission control in the following way:

• Calculates the slot size. A slot is a logical representation of memory and CPU resources. By default, it is sized to satisfy the requirements for any powered-‐on virtual machine in the cluster.

• Determines how many slots each host in the cluster can hold. • Determines the Current Failover Capacity of the cluster. This is the number of hosts that can fail and still leave enough slots

to satisfy all of the powered-‐on virtual machines. • Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user).

If it is, admission control disallows the operation.

Slot Size Calculation

Slot size is comprised of two components, CPU and memory.

vSphere HA calculates the CPU component by obtaining the CPU reservation of each powered-‐on virtual machine and selecting the largest value. If you have not specified a CPU reservation for a virtual machine, it is assigned a default value of 32MHz. You can change this value by using the das.vmcpuminmhz advanced attribute.)

vSphere HA calculates the memory component by obtaining the memory reservation, plus memory overhead, of each powered-‐on virtual machine and selecting the largest value. There is no default value for the memory reservation.

If your cluster contains any virtual machines that have much larger reservations than the others, they will distort slot size calculation. To avoid this, you can specify an upper bound for the CPU or memory component of the slot size by using the das.slotcpuinmhz or das.slotmeminmb advanced attributes, respectively.

Using Slots to Compute the Current Failover Capacity

After the slot size is calculated, vSphere HA determines each host's CPU and memory resources that are available for virtual machines. These amounts are those contained in the host's root resource pool, not the total physical resources of the host. The resource data for a host that is used by vSphere HA can be found by using the vSphere Client to connect to the host directly, and then navigating to the Resource tab for the host. If all hosts in your cluster are the same, this data can be obtained by dividing the cluster-‐level figures by the number of hosts. Resources being used for virtualization purposes are not included. Only hosts that are connected, not in maintenance mode, and that have no vSphere HA errors are considered.

The maximum number of slots that each host can support is then determined. To do this, the host’s CPU resource amount is divided by the CPU component of the slot size and the result is rounded down. The same calculation is made for the host's memory resource amount. These two numbers are compared and the smaller number is the number of slots that the host can support.

The Current Failover Capacity is computed by determining how many hosts (starting from the largest) can fail and still leave enough slots to satisfy the requirements of all powered-‐on virtual machines.

Admission Control Using Host Failures Cluster Tolerates Policy

The way that slot size is calculated and used with this admission control policy is shown in an example. Make the following assumptions about a cluster:

• The cluster is comprised of three hosts, each with a different amount of available CPU and memory resources. The first host (H1) has 9GHz of available CPU resources and 9GB of available memory, while Host 2 (H2) has 9GHz and 6GB and Host 3 (H3) has 6GHz and 6GB.

• There are five powered-‐on virtual machines in the cluster with differing CPU and memory requirements. VM1 needs 2GHz of CPU resources and 1GB of memory, while VM2 needs 2GHz and 1GB, VM3 needs 1GHz and 2GB, VM4 needs 1GHz and 1GB, and VM5 needs 1GHz and 1GB.

The Host Failures Cluster Tolerates is set to one.

1. Slot size is calculated by comparing both the CPU and memory requirements of the virtual machines and selecting the largest.

The largest CPU requirement (shared by VM1 and VM2) is 2GHz, while the largest memory requirement (for VM3) is 2GB. Based on this, the slot size is 2GHz CPU and 2GB memory.

2. Maximum number of slots that each host can support is determined. H1 can support four slots. H2 can support three slots (which is the smaller of 9GHz/2GHz and 6GB/2GB) and H3 can also support three slots.

3. Current Failover Capacity is computed. The largest host is H1 and if it fails, six slots remain in the cluster, which is sufficient for all five of the powered-‐on virtual machines. If both H1 and H2 fail, only three slots remain, which is insufficient. Therefore, the Current Failover Capacity is one.

The cluster has one available slot (the six slots on H2 and H3 minus the five used slots).

Percentage of Cluster Resources Reserved Admission Control Policy

You can configure vSphere HA to perform admission control by reserving a specific percentage of cluster CPU and memory resources for recovery from host failures.

With the Percentage of Cluster Resources Reserved admission control policy, vSphere HA ensures that a specified percentage of aggregate CPU and memory resources are reserved for failover.

With the Cluster Resources Reserved policy, vSphere HA enforces admission control as follows:

• Calculates the total resource requirements for all powered-‐on virtual machines in the cluster. • Calculates the total host resources available for virtual machines. • Calculates the Current CPU Failover Capacity and Current Memory Failover Capacity for the cluster. • Determines if either the Current CPU Failover Capacity or Current Memory Failover Capacity is less than the corresponding

Configured Failover Capacity (provided by the user). If so, admission control disallows the operation.

vSphere HA uses the actual reservations of the virtual machines. If a virtual machine does not have reservations, meaning that the reservation is 0, a default of 0MB memory and 32MHz CPU is applied.

NOTE The Percentage of Cluster Resources Reserved admission control policy also checks that there are at least two vSphere HA-‐enabled hosts in the cluster (excluding hosts that are entering maintenance mode). If there is only one vSphere HA-‐enabled host, an operation is not allowed, even if there is a sufficient percentage of resources available. The reason for this extra check is that vSphere HA cannot perform failover if there is only a single host in the cluster.

Computing the Current Failover Capacity

The total resource requirements for the powered-‐on virtual machines is comprised of two components, CPU and memory. vSphere HA calculates these values.

• The CPU component by summing the CPU reservations of the powered-‐on virtual machines. If you have not specified a CPU reservation for a virtual machine, it is assigned a default value of 32MHz (this value can be changed using the das.vmcpuminmhz advanced attribute.)

• The memory component by summing the memory reservation (plus memory overhead) of each powered-‐ on virtual machine.

The total host resources available for virtual machines is calculated by adding the hosts' CPU and memory resources. These amounts are those contained in the host's root resource pool, not the total physical resources of the host. Resources being used for virtualization purposes are not included. Only hosts that are connected, not in maintenance mode, and have no vSphere HA errors are considered.

The Current CPU Failover Capacity is computed by subtracting the total CPU resource requirements from the total host CPU resources and dividing the result by the total host CPU resources. The Current Memory Failover Capacity is calculated similarly.

Admission Control Using Percentage of Cluster Resources Reserved Policy

The way that Current Failover Capacity is calculated and used with this admission control policy is shown with an example. Make the following assumptions about a cluster:

• The cluster is comprised of three hosts, each with a different amount of available CPU and memory resources. The first host (H1) has 9GHz of available CPU resources and 9GB of available memory, while Host 2 (H2) has 9GHz and 6GB and Host 3 (H3) has 6GHz and 6GB.

• There are five powered-‐on virtual machines in the cluster with differing CPU and memory requirements. VM1 needs 2GHz of CPU resources and 1GB of memory, while VM2 needs 2GHz and 1GB, VM3 needs 1GHz and 2GB, VM4 needs 1GHz and 1GB, and VM5 needs 1GHz and 1GB

• The Configured Failover Capacity is set to 25%.

The total resource requirements for the powered-‐on virtual machines is 7GHz and 6GB. The total host resources available for virtual machines is 24GHz and 21GB. Based on this, the Current CPU Failover Capacity is 70% ((24GHz -‐ 7GHz)/24GHz). Similarly, the Current Memory Failover Capacity is 71% ((21GB-‐6GB)/21GB).

Because the cluster's Configured Failover Capacity is set to 25%, 45% (70-‐25) of the cluster's total CPU resources and 46% (71-‐25) of the cluster's memory resources are still available to power on additional virtual machines.

Specify Failover Hosts Admission Control Policy

You can configure vSphere HA to designate specific hosts as the failover hosts.

With the Specify Failover Hosts admission control policy, when a host fails, vSphere HA attempts to restart its virtual machines on one of the specified failover hosts. If this is not possible, for example the failover hosts have failed or have insufficient resources, then vSphere HA attempts to restart those virtual machines on other hosts in the cluster.

To ensure that spare capacity is available on a failover host, you are prevented from powering on virtual machines or using vMotion to migrate virtual machines to a failover host. Also, DRS does not use a failover host for load balancing.

NOTE If you use the Specify Failover Hosts admission control policy and designate multiple failover hosts, DRS does not load balance failover hosts and VM-‐VM affinity rules are not supported.

The Current Failover Hosts appear in the vSphere HA section of the cluster's Summary tab in the vSphere Client. The status icon next to each host can be green, yellow, or red.

• Green. The host is connected, not in maintenance mode, and has no vSphere HA errors. No powered-‐on virtual machines reside on the host.

• Yellow. The host is connected, not in maintenance mode, and has no vSphere HA errors. However, powered-‐on virtual machines reside on the host.

• Red. The host is disconnected, in maintenance mode, or has vSphere HA errors. Choosing an Admission Control Policy

You should choose a vSphere HA admission control policy based on your availability needs and the characteristics of your cluster. When choosing an admission control policy, you should consider a number of factors.

Avoiding Resource Fragmentation

Resource fragmentation occurs when there are enough resources in aggregate for a virtual machine to be failed over. However, those resources are located on multiple hosts and are unusable because a virtual machine can run on one ESXi host at a time

• The Host Failures Cluster Tolerates policy avoids resource fragmentation by defining a slot as the maximum virtual machine reservation.

• The Percentage of Cluster Resources policy does not address the problem of resource fragmentation. • With the Specify Failover Hosts policy, resources are not fragmented because hosts are reserved for failover.

Flexibility of Failover Resource Reservation

Admission control policies differ in the granularity of control they give you when reserving cluster resources for failover protection. The Host Failures Cluster Tolerates policy allows you to set the failover level as a number of hosts. The Percentage of Cluster Resources policy allows you to designate up to 100% of cluster CPU or memory resources for failover. The Specify Failover Hosts policy allows you to specify a set of failover hosts.

Heterogeneity of Cluster

Clusters can be heterogeneous in terms of virtual machine resource reservations and host total resource capacities. In a heterogeneous cluster, the Host Failures Cluster Tolerates policy can be too conservative because it only considers the largest virtual machine reservations when defining slot size and assumes the largest hosts fail when computing the Current Failover Capacity. The other two admission control policies are not affected by cluster heterogeneity.

NOTE vSphere HA includes the resource usage of Fault Tolerance Secondary VMs when it performs admission control calculations. For the Host Failures Cluster Tolerates policy, a Secondary VM is assigned a slot, and for the Percentage of Cluster Resources policy, the Secondary VM's resource usage is accounted for when computing the usable capacity of the cluster.

vSphere HA Checklist

• All hosts must be licensed for vSphere HA.

NOTE ESX/ESXi 3.5 hosts are supported by vSphere HA but must include a patch to address an issue involving file locks. For ESX 3.5 hosts, you must apply the patch ESX350-‐201012401-‐SG, while for ESXi 3.5 you must apply the patch ESXe350-‐201012401-‐I-‐BG. Prerequisite patches need to be applied before applying these patches.

• You need at least two hosts in the cluster. • All hosts need to be configured with static IP addresses. If you are using DHCP, you must ensure that the address for each

host persists across reboots. • To ensure that any virtual machine can run on any host in the cluster, all hosts should have access to the same virtual

machine networks and datastores. Similarly, virtual machines must be located on shared, not local, storage otherwise they

cannot be failed over in the case of a host failure.

NOTE vSphere HA uses datastore heartbeating to distinguish between partitioned, isolated, and failed hosts. Accordingly, you must ensure that datastores reserved for vSphere HA are readily available at all times.

• For VM Monitoring to work, VMware tools must be installed • Host certificate checking should be enabled • vSphere HA supports both IPv4 and IPv6.A cluster that mixes the use of both of these protocol versions, however, is more

likely to result in a network partition.

Enabling or Disabling Admission Control

You can enable or disable admission control for the vSphere HA cluster.

Enable: Disallow VM power on operations that violate availability constraints

Enables admission control and enforces availability constraints and preserves failover capacity. Any operation on a virtual machine that decreases the unreserved resources in the cluster and violates availability constraints is not permitted.

Disable: Allow VM power on operations that violate availability constraints

Disables admission control. Virtual machines can, for example, be powered on even if that causes insufficient failover capacity. When you do this, no warnings are presented, and the cluster does not turn red. If a cluster has insufficient failover capacity, vSphere HA can still perform failovers and it uses the VM Restart Priority setting to determine which virtual machines to power on first.

vSphere HA provides three policies for enforcing admission control, if it is enabled.

• Host failures the cluster tolerates • Percentage of cluster resources reserved as failover spare capacity • Specify failover hosts

Virtual Machine Options

Default virtual machine settings control the order in which virtual machines are restarted (VM restart priority) and how vSphere HA responds if hosts lose network connectivity with other hosts (host isolation response.)

VM Restart Priority Setting

VM restart priority determines the relative order in which virtual machines are restarted after a host failure. Such virtual machines are restarted sequentially on new hosts, with the highest priority virtual machines first and continuing to those with lower priority until all virtual machines are restarted or no more cluster resources are available

The values for this setting are: Disabled, Low, Medium (the default), and High. If you select Disabled, vSphere HA is disabled for the virtual machine, which means that it is not restarted on other ESXi hosts if its host fails.

The Disabled setting does not affect virtual machine monitoring, which means that if a virtual machine fails on a host that is functioning properly, that virtual machine is reset on that same host.

The restart priority settings for virtual machines vary depending on user needs. VMware recommends that you assign higher restart priority to the virtual machines that provide the most important services.

For example, in the case of a multitier application you might rank assignments according to functions hosted on the virtual machines.

• High. Database servers that will provide data for applications.

• Medium. Application servers that consume data in the database and provide results on web pages. • Low. Web servers that receive user requests, pass queries to application servers, and return results to

users.

Host Isolation Response Setting

Host isolation response determines what happens when a host in a vSphere HA cluster loses its management network connections but continues to run. Host isolation responses require that Host Monitoring Status is enabled. If Host Monitoring Status is disabled, host isolation responses are also suspended.

A host determines that it is isolated when it is unable to communicate with the agents running on the other hosts and it is unable to ping its isolation addresses.

When this occurs, the host executes its isolation response. The responses are:

• Leave powered on (the default) • Power off • Shut down

You can customize this property for individual virtual machines. To use the Shut down VM setting, you must install VMware Tools in the guest operating system of the virtual machine. Virtual machines that are in the process of shutting down will take longer to fail over while the shutdown completes. Virtual Machines that have not shut down in 300 seconds, or the time specified in the advanced attribute das.isolationshutdowntimeout seconds, are powered off. NOTE After you create a vSphere HA cluster, you can override the default cluster settings for Restart Priority and Isolation Response for specific virtual machines. Such overrides are useful for virtual machines that are used for special tasks. For example, virtual machines that provide infrastructure services like DNS or DHCP might need to be powered on before other virtual machines in the cluster.

If a host has its isolation response disabled (that is, it leaves virtual machines powered on when isolated) and the host loses access to both the management and storage networks, a "split brain" situation can arise. In this case, the isolated host loses the disk locks and the virtual machines are failed over to another host even though the original instances of the virtual machines remain running on the isolated host.

When the host comes out of isolation, there will be two copies of the virtual machines, although the copy on the originally isolated host does not have access to the vmdk files and data corruption is prevented. In the vSphere Client, the virtual machines appear to be flipping back and forth between the two hosts.

To recover from this situation, ESXi generates a question on the virtual machine that has lost the disk locks for when the host comes out of isolation and realizes that it cannot reacquire the disk locks. vSphere HA automatically answers this question and this allows the virtual machine instance that has lost the disk locks to power off, leaving just the instance that has the disk locks.

VM and Application Monitoring

VM Monitoring restarts individual virtual machines if their VMware Tools heartbeats are not received within a set time. Similarly, Application Monitoring can restart a virtual machine if the heartbeats for an application it is running are not received. You can enable these features and configure the sensitivity with which vSphere HA monitors non-‐responsiveness.

When you enable VM Monitoring, the VM Monitoring service (using VMware Tools) evaluates whether each virtual machine in the cluster is running by checking for regular heartbeats and I/O activity from the VMware Tools process running inside the guest. If no heartbeats or I/O activity are received, this is most likely because the guest operating system has failed or VMware Tools is not being allocated any time to complete tasks.

In such a case, the VM Monitoring service determines that the virtual machine has failed and the virtual machine is rebooted to restore service.

Occasionally, virtual machines or applications that are still functioning properly stop sending heartbeats. To avoid unnecessary

resets, the VM Monitoring service also monitors a virtual machine's I/O activity.

If no heartbeats are received within the failure interval, the I/O stats interval (a cluster-‐level attribute) is checked.

The I/O stats interval determines if any disk or network activity has occurred for the virtual machine during the previous two minutes (120 seconds). If not, the virtual machine is reset. This default value (120 seconds) can be changed using the advanced attribute das.iostatsinterval.

To enable Application Monitoring, you must first obtain the appropriate SDK (or be using an application that supports VMware Application Monitoring) and use it to set up customized heartbeats for the applications you want to monitor. After you have done this, Application Monitoring works much the same way that VM Monitoring does.

vSphere HA Advanced Attributes

Attribute Description

das.isolationaddress[...]

Sets the address to ping to determine if a host is isolated from the network. This address is pinged only when heartbeats are not received from any other host in the cluster. If not specified, the default gateway of the management network is used. This default gateway has to be a reliable address that is available, so that the host can determine if it is isolated from the network. You can specify multiple isolation addresses (up to 10) for the cluster: das.isolationaddressX, where X = 1-‐10. Typically you should specify one per management network. Specifying too many addresses makes isolation detection take too long.

das.usedefaultisolationaddress

By default, vSphere HA uses the default gateway of the console network as an isolation address. This attribute specifies whether or not this default is used (true|false).

das.isolationshutdowntimeout

The period of time the system waits for a virtual machine to shut down before powering it off. This only applies if the host's isolation response is Shut down VM. Default value is 300 seconds.

das.slotmeminmb

Defines the maximum bound on the memory slot size. If this option is used, the slot size is the smaller of this value or the maximum memory reservation plus memory overhead of any powered-‐on virtual machine in the cluster.

das.slotcpuinmhz

Defines the maximum bound on the CPU slot size. If this option is used, the slot size is the smaller of this value or the maximum CPU reservation of any powered-‐on virtual machine in the cluster.

das.vmmemoryminmb

Defines the default memory resource value assigned to a virtual machine if its memory reservation is not specified or zero. This is used for the Host Failures Cluster Tolerates admission control policy. If no value is specified, the default is 0 MB.

das.vmcpuminmhz

Defines the default CPU resource value assigned to a virtual machine if its CPU reservation is not specified or zero. This is used for the Host Failures Cluster Tolerates admission control policy. If no value is specified, the default is 32MHz.

das.iostatsinterval Changes the default I/O stats interval for VM Monitoring sensitivity. The default is 120 (seconds). Can be set to any value

greater than, or equal to 0. Setting to 0 disables the check.

das.ignoreinsufficienthbdatastore

Disables configuration issues created if the host does not have sufficient heartbeat datastores for vSphere HA. Default value is false.

das.heartbeatdsperhost

Changes the number of heartbeat datastores required. Valid values can range from 2-‐5 and the default is 2

NOTE : If you change the value of any of the following advanced attributes, you must disable and then re-‐enable vSphere HA before your changes take effect.

• das.isolationaddress[...] • das.usedefaultisolationaddress • das.isolationshutdowntimeout

Options No Longer Supported

• das.defaultfailoverhost • das.failureDetectionTime • das.failureDetectionInterval

Best Practices for vSphere HA Clusters

Setting Alarms to Monitor Cluster Changes

When vSphere HA or Fault Tolerance take action to maintain availability, for example, a virtual machine failover, you can be notified about such changes. Configure alarms in vCenter Server to be triggered when these actions occur, and have alerts, such as emails, sent to a specified set of administrators.

Several default vSphere HA alarms are available.

• Insufficient failover resources (a cluster alarm) • Cannot find master (a cluster alarm) • Failover in progress (a cluster alarm) • Host HA status (a host alarm) • VM monitoring error (a virtual machine alarm) • VM monitoring action (a virtual machine alarm) • Failover failed (a virtual machine alarm)

Monitoring Cluster Validity

A valid cluster is one in which the admission control policy has not been violated.

A cluster enabled for vSphere HA becomes invalid (red) when the number of virtual machines powered on exceeds the failover requirements, that is, the current failover capacity is smaller than configured failover capacity. If admission control is disabled, clusters do not become invalid.

Admission Control Best Practices

The following recommendations are best practices for vSphere HA admission control

• Select the Percentage of Cluster Resources Reserved admission control policy. This policy offers the most flexibility in terms of host and virtual machine sizing. In most cases, a calculation of 1/N, where N is the number of total nodes in the cluster, yields adequate sparing.

• Ensure that you size all cluster hosts equally. An unbalanced cluster results in excess capacity being reserved to handle failure of the largest possible node.

• Try to keep virtual machine sizing requirements similar across all configured virtual machines. The Host Failures Cluster Tolerates admission control policy uses slot sizes to calculate the amount of capacity needed to reserve for each virtual machine. The slot size is based on the largest reserved memory and CPU needed for any virtual machine. When you mix virtual machines of different CPU and memory requirements, the slot size calculation defaults to the largest possible, which limits consolidation.

Using Auto Deploy with vSphere HA

You can use vSphere HA and Auto Deploy together to improve the availability of your virtual machines. Auto Deploy provisions hosts when they power up and you can also configure it to install the vSphere HA agent on such hosts during the boot process. To have Auto Deploy install the vSphere HA agent, the image profile you assign to the host must include the vmware-‐fdm VIB.

Best Practices for Networking

Network Configuration and Maintenance

• When making changes to the networks that your clustered ESXi hosts are on, VMware recommends that you suspend the Host Monitoring feature. Changing your network hardware or networking settings can interrupt the heartbeats that vSphere HA uses to detect host failures, and this might result in unwanted attempts to fail over virtual machines.

• When you change the networking configuration on the ESXi hosts themselves, for example, adding port groups, or removing vSwitches, VMware recommends that in addition to suspending Host Monitoring, you place the hosts on which the changes are being made into maintenance mode. When the host comes out of maintenance mode, it is reconfigured, which causes the network information to be reinspected for the running host. If not put into maintenance mode, the vSphere HA agent runs using the old network configuration information.

Networks Used for vSphere HA Communications

To identify which network operations might disrupt the functioning of vSphere HA, you should know which management networks are being used for heart beating and other vSphere HA communications.

• On legacy ESX hosts in the cluster, vSphere HA communications travel over all networks that are designated as service console networks. VMkernel networks are not used by these hosts for vSphere HA communications.

• On ESXi hosts in the cluster, vSphere HA communications, by default, travel over VMkernel networks, except those marked for use with vMotion. If there is only one VMkernel network, vSphere HA shares it with vMotion, if necessary. With ESXi 4.x and ESXi, you must also explicitly enable the Management Network checkbox for vSphere HA to use this network.

NOTE VMware recommends that you do not configure hosts with multiple vmkNICs on the same subnet. If this is done, be aware that vSphere HA sends packets using any pNIC that is associated with a given subnet if at least one vNIC for that subnet has been configured for management traffic.

Network Isolation Addresses

A network isolation address is an IP address that is pinged to determine whether a host is isolated from the network. This address is pinged only when a host has stopped receiving heartbeats from all other hosts in the cluster. If a host can ping its network isolation address, the host is not network isolated, and the other hosts in the cluster have failed. However, if the host cannot ping its isolation address, it is likely that the host has become isolated from the network and no failover action is taken.

By default, the network isolation address is the default gateway for the host. Only one default gateway is specified, regardless of how many management networks have been defined. You should use the das.isolationaddress[...] advanced attribute to add isolation addresses for additional networks.

Other Networking Considerations

• Configuring Switches. If the physical network switches that connect your servers support the Port Fast (or an equivalent)

setting, enable it. This setting prevents a host from incorrectly determining that a network is isolated during the execution of lengthy spanning tree algorithms.

• Port Group Names and Network Labels. Use consistent port group names and network labels on VLANs for public networks. Port group names are used to reconfigure access to the network by virtual machines. If you use inconsistent names between the original server and the failover server, virtual machines are disconnected from their networks after failover. Network labels are used by virtual machines to reestablish network connectivity upon restart.

• Configure the management networks so that the vSphere HA agent on a host in the cluster can reach the agents on any of the other hosts using one of the management networks. If you do not set up such a configuration, a network partition condition can occur after a master host is elected.

Network Path Redundancy

Network path redundancy between cluster nodes is important for vSphere HA reliability. A single management network ends up being a single point of failure and can result in failovers although only the network has failed.

You can implement network redundancy at the NIC level with NIC teaming, or at the management network level. In most implementations, NIC teaming provides sufficient redundancy, but you can use or add management network redundancy if required. Redundant management networking allows the reliable detection of failures and prevents isolation conditions from occurring, because heartbeats can be sent over multiple networks.

Configure the fewest possible number of hardware segments between the servers in a cluster. The goal being to limit single points of failure. Additionally, routes with too many hops can cause networking packet delays for heartbeats, and increase the possible points of failure.

Network Redundancy Using NIC Teaming

Using a team of two NICs connected to separate physical switches improves the reliability of a management network. Because servers connected through two NICs (and through separate switches) have two independent paths for sending and receiving heartbeats, the cluster is more resilient. To configure a NIC team for the management network, configure the vNICs in vSwitch configuration for Active or Standby configuration. The recommended parameter settings for the vNICs are:

Default load balancing = route based on originating port ID

Failback = No

After you have added a NIC to a host in your vSphere HA cluster, you must reconfigure vSphere HA on that host.

Network Redundancy Using a Secondary Network

As an alternative to NIC teaming for providing redundancy for heartbeats, you can create a secondary management network connection, which is attached to a separate virtual switch. The primary management network connection is used for network and management purposes. When the secondary management network connection is created, vSphere HA sends heartbeats over both the primary and secondary management network connections. If one path fails, vSphere HA can still send and receive heartbeats over the other path.

6. Best Practices for Running VMware vSphere on iSCSI iSCSI considerations For datacenters with centralized storage, iSCSI offers customers many benefits. It is comparatively inexpensive and it is based on familiar SCSI and TCP/IP standards. In comparison to FC and Fibre Channel over Ethernet (FCoE) SAN deployments, iSCSI requires less hardware, it uses lower-‐cost hardware, and more IT staff members might be familiar with the technology. These factors contribute to lower-‐cost implementations.

One major difference between iSCSI and FC relates to I/O congestion. When an iSCSI path is overloaded, the TCP/IP protocol drops packets and requires them to be resent. FC communication over a dedicated path has a built-‐in pause mechanism when congestion occurs

When a network path carrying iSCSI storage traffic is oversubscribed, a bad situation quickly grows worse and performance further degrades as dropped packets must be resent. There can be multiple reasons for an iSCSI path being overloaded, ranging from oversubscription (too much traffic), to network switches that have a low port buffer.

Another consideration is the network bandwidth. Network bandwidth is dependent on the Ethernet standards used (1Gb or 10Gb). There are other mechanisms such as port aggregation and bonding links that deliver greater network bandwidth.

When implementing software iSCSI that uses network interface cards rather than dedicated iSCSI adapters, gigabit Ethernet interfaces are required. These interfaces tend to consume a significant amount of CPU Resource.

One way of overcoming this demand for CPU resources is to use a feature called a TOE (TCP/IP offload engine). TOEs shift TCP packet processing tasks from the server CPU to specialized TCP processors on the network adaptor or storage device

iSCSI was considered a technology that did not work well over most shared wide-‐area networks. It has prevalently been approached as a local area network technology. However, this is changing. For synchronous replication writes (in the case of high availability) or remote data writes, iSCSI might not be a good fit. Latency introductions bring greater delays to data transfers and might impact application performance. Asynchronous replication, which is not dependent upon latency sensitivity, makes iSCSI an ideal solution.

VMware vCenterTM Site Recovery ManagerTM may build upon iSCSI asynchronous storage replication for simple, reliable site disaster protection.

iSCSI Architecture

iSCSI initiators must manage multiple, parallel communication links to multiple targets. Similarly, iSCSI targets must manage multiple, parallel communications links to multiple initiators. Several identifiers exist in iSCSI to make this happen, including iSCSI Name, ISID (iSCSI session identifiers), TSID (target session identifier), CID (iSCSI connection identifier) and iSCSI portals.

iSCSI Names

iSCSI nodes have globally unique names that do not change when Ethernet adapters or IP addresses change. iSCSI supports two name formats as well as aliases. The first name format is the Extended Unique Identifier (EUI). An example of an EUI name might be eui.02004567A425678D.

The second name format is the iSCSI Qualified Name (IQN). An example of an IQN name might be iqn.1998-‐01. com.vmware:tm-‐pod04-‐esx01-‐6129571c.

iSCSI Initiators and Targets

A storage network consists of two types of equipment: initiators and targets. Initiators, such as hosts, are data consumers. Targets, such as disk arrays or tape libraries, are data providers. In the context of vSphere, iSCSI initiators fall into three distinct categories. They can be software, hardware dependent or hardware independent.

Software iSCSI Adapter

A software iSCSI adapter is VMware code built into the VMkernel. It enables your host to connect to the iSCSI storage device through standard network adaptors. The software iSCSI adapter handles iSCSI processing while communicating with the network adaptor. With the software iSCSI adapter, you can use iSCSI technology without purchasing specialized hardware.

Dependent Hardware iSCSI Adapter

This hardware iSCSI adapter depends on VMware networking and iSCSI configuration and management interfaces provided by VMware. This type of adapter can be a card that presents a standard network adaptor and iSCSI offload functionality for the same port. The iSCSI offload functionality depends on the host’s network configuration to obtain the IP and MAC addresses, as well as other parameters used for iSCSI sessions. An example of a dependent adapter is the iSCSI licensed Broadcom 5709 NIC.

Independent Hardware iSCSI Adapter

This type of adapter implements its own networking and iSCSI configuration and management interfaces. An example of an independent hardware iSCSI adapter is a card that presents either iSCSI offload functionality only or iSCSI offload functionality and

standard NIC functionality. The iSCSI offload functionality has independent configuration management that assigns the IP address, MAC address, and other parameters used for the iSCSI sessions. An example of an independent hardware adapter is the QLogic QLA4052 adapter.

SCSI Portals

iSCSI nodes keep track of connections via portals, enabling separation between names and IP addresses. A portal manages an IP address and a TCP port number. Therefore, from an architectural perspective, sessions can be made up of multiple logical connections, and portals track connections via TCP/IP port/address

iSCSI Implementation Options

With the hardware-‐initiator iSCSI implementation, the iSCSI HBA provides the translation from SCSI commands to an encapsulated format that can be sent over the network. A TCP offload engine (TOE) does this translation on the adapter.

The software-‐initiator iSCSI implementation leverages the VMkernel to perform the SCSI to IP translation and requires extra CPU cycles to perform this work. As mentioned previously, most enterprise-‐level networking chip sets offer TCP offload or checksum offloads, which vastly improve CPU overhead.

With the hardware-‐initiator iSCSI implementation, the iSCSI HBA provides the translation from SCSI commands to an encapsulated format that can be sent over the network. A TCP offload engine (TOE) does this translation on the adapter.

Mixing iSCSI Options

Having both software iSCSI and hardware iSCSI enabled on the same host is supported. However, use of both software and hardware adapters on the same vSphere host to access the same target is not supported. One cannot have the host access the same target via hardware-‐dependent/hardware-‐independent/software iSCSI adapters for multipathing purposes

Networking Settings

Network design is key to making sure iSCSI works. In a production environment, gigabit Ethernet is essential for software iSCSI. Hardware iSCSI, in a VMware Infrastructure environment, is implemented with dedicated HBAs.

iSCSI should be considered a local-‐area technology, not a wide-‐area technology, because of latency issues and security concerns. You should also segregate iSCSI traffic from general traffic. Layer-‐2 VLANs are a particularly good way to implement this segregation.

Beware of oversubscription. Oversubscription occurs when more users are connected to a system than can be fully supported at the same time. Networks and servers are almost always designed with some amount of oversubscription, assuming that users do not all need the service simultaneously. If they do, delays are certain and outages are possible. Oversubscription is permissible on general-‐purpose LANs, but you should not use an oversubscribed configuration for iSCSI.

Best practice is to have a dedicated LAN for iSCSI traffic and not share the network with other network traffic. It is also best practice not to oversubscribe the dedicated LAN.

Finally, because iSCSI leverages the IP network, VMkernel NICs can be placed into teaming configurations. Alternatively, a VMware

recommendation is to use port binding rather than NIC teaming. Port binding will be explained in detail later in this paper but suffice to say that with port binding, iSCSI can leverage VMkernel multipath capabilities such as failover on SCSI errors and Round Robin path policy for performance.

In the interest of completeness, both methods will be discussed. However, port binding is the recommended best practice.

VMkernel Network Configuration

A VMkernel network is required for IP storage and thus is required for iSCSI. A best practice would be to keep the iSCSI traffic separate from other networks, including the management and virtual machine networks.

IPv6 Supportability Statements

At the time of this writing, there is no IPv6 support for either hardware iSCSI or software iSCSI adapters in vSphere 5.1.

Throughput Options

There are a number of options available to improve iSCSI performance.

1. 10GbE – This is an obvious option to begin with. If you can provide a larger pipe, the likelihood is that you will achieve greater throughput. Of course, if there is not enough I/O to fill a 1GbE connection, then a larger connection isn’t going to help you. But let’s assume that there are enough virtual machines and enough datastores for 10GbE to be beneficial.

2. Jumbo frames – This feature can deliver additional throughput by increasing the size of the payload in each frame from a default MTU of 1,500 to an MTU of 9,000. However, great care and consideration must be used if you decide to implement it. All devices sitting in the I/O path (iSCSI target, physical switches, network interface cards and VMkernel ports) must be able to implement jumbo frames for this option to provide the full benefits. For example, if the MTU is not correctly set on the switches, the datastores might mount but I/O will fail. A common issue with jumbo-‐frame configurations is that the MTU value on the switch isn’t set correctly. In most cases, this must be higher than that of the hosts and storage, which are typically set to 9,000. Switches must be set higher, to 9,198 or 9,216 for example, to account for IP overhead. Refer to switch-‐vendor documentation as well as storage-‐vendor documentation before attempting to configure jumbo frames.

3. Round Robin path policy – Round Robin uses an automatic path selection rotating through all available paths, enabling the distribution of load across the configured paths. This path policy can help improve I/O throughput. For active/passive storage arrays, only the paths to the active controller will be used in the Round Robin policy. For active/active storage arrays, all paths will be used in the Round Robin policy. For ALUA arrays (Asymmetric Logical Unit Assignment), Round Robin uses only the active/optimized (AO) paths. These are the paths to the disk through the managing controller. Active/nonoptimized (ANO) paths to the disk through the nonmanaging controller are not used.

Not all arrays support the Round Robin path policy. Refer to your storage-‐array vendor’s documentation for recommendations on using this Path Selection Policy (PSP).

Minimizing Latency

Because iSCSI on VMware uses TCP/IP to transfer I/O, latency can be a concern. To decrease latency, one should always try to minimize the number of hops between the storage and the vSphere host. Ideally, one would not route traffic between the vSphere host and the storage array, and both would coexist on the same subnet.

NOTE: If iSCSI port bindings are implemented for the purposes of multipathing, you cannot route your iSCSI traffic.

Routing

A vSphere host has a single routing table for all of its VMkernel Ethernet interfaces. This imposes some limits on network communication. Consider a configuration that uses two Ethernet adapters with one VMkernel TCP/IP stack. One adapter is on the 10.17.1.1/24 IP network and the other on the 192.168.1.1/24 network. Assume that 10.17.1.253 is the address of the default gateway. The VMkernel can communicate with any servers reachable by routers that use the 10.17.1.253 gateway. It might not be able to talk to all servers on the 192.168 network unless both networks are on the same broadcast domain.

The VMkernel TCP/IP Routing Table

Another consequence of the single routing table affects one approach you might otherwise consider for balancing I/O. Consider a configuration in which you want to connect to iSCSI storage and also want to enable NFS mounts. It might seem that you can use one Ethernet adapter for iSCSI and a separate Ethernet adapter for NFS traffic to spread the I/O load. This approach does not work because of the way the VMkernel TCP/IP stack handles entries in the routing table.

For example, you might assign an IP address of 10.16.156.66 to the VMkernel adapter you want to use for NFS. The routing table then contains an entry for the 10.16.156.x network for this adapter. If you then set up a second adapter for iSCSI and assign it an IP address of 10.16.156.25, the routing table contains a new entry for the 10.16.156.x network for the second adapter. However, when the TCP/IP stack reads the routing table, it never reaches the second entry, because the first entry satisfies all routes to both adapters. Therefore, no traffic ever goes out on the iSCSI network, and all IP storage traffic goes out on the NFS network.

The fact that all 10.16.156.x traffic is routed on the NFS network causes two types of problems. First, you do not see any traffic on the second Ethernet adapter. Second, if you try to add trusted IP addresses both to iSCSI arrays and NFS servers, traffic to one or the other comes from the wrong IP address.

Using Static Routes

As mentioned before, for vSphere hosts, the management network is on a VMkernel port and therefore uses the default VMkernel gateway. Only one VMkernel default gateway can be configured on a vSphere host. You can, however, add static routes to additional gateways/routers from the command line

Availability Options – Multipathing or NIC Teaming

NIC Teaming for Availability

A best practice for iSCSI is to avoid the vSphere feature called teaming (on the network interface cards) and instead use port binding. Port binding introduces multipathing for availability of access to the iSCSI targets and LUNs. If for some reason this is not suitable (for instance, you wish to route traffic between the iSCSI initiator and target), then teaming might be an alternative.

If you plan to use teaming to increase the availability of your network access to the iSCSI storage array, you must turn off port security on the switch for the two ports on which the virtual IP address is shared

The purpose of this port security setting is to prevent spoofing of IP addresses.

Thus many network administrators enable this setting. However, if you do not change it, the port security setting prevents failover of the virtual IP from one switch port to another and teaming cannot fail over from one path to another. For most LAN switches, the port security is enabled on a port level and thus can be set on or off for each port.

iSCSI Multipathing via Port Binding for Availability

Another way to achieve availability is to create a multipath configuration. This is a more preferred method over NIC teaming, because this method will fail over I/O to alternate paths based on SCSI sense codes and not just network failures. Also, port bindings give administrators the opportunity to load-‐balance I/O over multiple paths to the storage device

Error Correction Digests

iSCSI header and data digests check the end-‐to-‐end, noncryptographic data integrity beyond the integrity checks that other networking layers provide, such as TCP and Ethernet. They check the entire communication path, including all elements that can change the network-‐level traffic, such as routers, switches and proxies.

Enabling header and data digests does require additional processing for both the initiator and the target and can affect throughput and CPU use performance.

Some systems can offload the iSCSI digest calculations to the network processor, thus reducing the impact on performance.

Flow Control

The general consensus from our storage partners is that hardware-‐based flow control is recommended for all network interfaces and switches.

Security Considerations

Private Network

iSCSI storage traffic is transmitted in an unencrypted format across the LAN. Therefore, it is considered best practice to use iSCSI on trusted networks only and to isolate the traffic on separate physical switches or to leverage a private VLAN. All iSCSI-‐array vendors agree that it is good practice to isolate iSCSI traffic for security reasons. This would mean isolating the iSCSI traffic on its own separate physical switches or leveraging a dedicated VLAN (IEEE 802.1Q).

Encryption

ISCSI supports several types of security. IPSec (Internet Protocol Security) is a developing standard for security at the network or packet-‐processing layer of network communication. IKE (Internet Key Exchange) is an IPSec standard protocol used to ensure security for VPNs. However, at the time of this writing IPSec was not supported on vSphere hosts.

Authentication

There are also a number of authentication methods supported with iSCSI.

• Kerberos (not supported vSphere 5.1) • SRP (Secure Remote Password) (not supported vSphere 5.1) • SPKM1/2 (Simple Public-‐Key Mechanism) (not supported vSphere 5.1) • CHAP (Challenge Handshake Authentication Protocol) (Supported)

At the time of this writing (vSphere 5.1), a vSphere host does not support Kerberos, SRP or public-‐key authentication methods for iSCSI The only authentication protocol supported is CHAP. CHAP verifies identity using a hashed transmission.

The target initiates the challenge. Both parties know the secret key. It periodically repeats the challenge to guard against replay attacks. CHAP is a one-‐way protocol, but it might be implemented in two directions to provide security for both ends. The iSCSI specification defines the CHAP security method as the only must-‐support protocol. The VMware implementation uses this security option. Initially, VMware supported only unidirectional CHAP, but bidirectional CHAP is now supported.

iSCSI Datastore Provisioning Steps

1. Create a new VMkernel port group for IP storage on an already existing virtual switch (vSwitch) or on a new vSwitch when it is configured. The vSwitch can be a vSphere Standard Switch (VSS) or a VMware vSphere Distributed Switch.

2. Ensure that the iSCSI initiator on the vSphere host(s) is enabled.

3. Ensure that the iSCSI storage is configured to export a LUN accessible to the vSphere host iSCSI initiators on a trusted network.

Why Use iSCSI Multipathing?

The primary use case of this feature is to create a multipath configuration with storage that presents only a single storage portal, such as the DELL EqualLogic and the HP LeftHand.

Without iSCSI multipathing, this type of storage would have one path only between the VMware ESX® host and each volume. iSCSI multipathing enables us to multipath to this type of clustered storage.

Another benefit is the ability to use alternate VMkernel networks outside of the vSphere host management network. This means

that if the management network suffers an outage, you continue to have iSCSI connectivity via the VMkernel ports participating in the iSCSI bindings.

NOTE: VMware considers the implementation of iSCSI multipathing versus NIC teaming a best practice.

Software iSCSI Multipathing Configuration Steps

For port binding to work correctly, the initiator must be able to reach the target directly on the same subnet – iSCSI port binding in vSphere 5.0 does not support routing.

In this configuration, if I place my VMkernel ports on VLAN 74, they can reach the iSCSI target without the need of a router. This is an important point and requires further elaboration because it causes some confusion. If I do not implement port binding and use a standard VMkernel port, then my initiator can reach the targets through a routed network.

This is supported and works well. It is only when iSCSI binding is implemented that a direct, non-‐routed network between the initiators and targets is required. In other words, initiators and targets must be on the same subnet.

There is another important point to note when it comes to the configuration of iSCSI port bindings. On VMware standard switches that contain multiple vmnic uplinks, each VMkernel (vmk) port used for iSCSI bindings must be associated with a single vmnic uplink. The other uplink(s) on the vSwitch must be placed into an unused state. This is only a requirement when there are multiple vmnic uplinks on the same vSwitch. If you are using multiple VSSs with their own vmnic uplinks, then this is not an issue.

Continuing with the network configuration, a second VMkernel (vmk) port is created. Now there are two vmk ports, labeled iSCSI1 and iSCSI2. These will be used for the iSCSI port binding/multipathing configuration. The next step is to configure the bindings and iSCSI targets. This is done in the properties of the software iSCSI adapter. Since vSphere 5.0, there is a new Network Configuration tab in the Software iSCSI Initiator Properties window. This is where the VMkernel ports used for binding to the iSCSI adapter are added.

After selecting the VMkernel adapters for use with the software iSCSI adapter, the Port Group Policy tab will tell you whether or not these adapters are compliant for binding. If you have more than one active uplink on a vSwitch that has multiple vmnic uplinks, the vmk interfaces will not show up as compliant. Only one uplink should be active. All other uplinks should be placed into an unused state.

Interoperability Considerations

Storage I/O Control

Storage I/O Control (SIOC) prevents a single virtual machine residing on one vSphere host from consuming more than its share of bandwidth on a datastore that it shares with other virtual machines residing on other vSphere hosts.

Historically, the disk shares feature can be set up on a per–vSphere host basis. This works well for all virtual machines residing on the same vSphere host sharing the same datastore built on a local disk. However, this cannot be used as a fairness mechanism for virtual machines from different vSphere hosts sharing the same datastore.

This is what SIOC does. SIOC modifies the I/O queues on various vSphere hosts to ensure that virtual machines with a higher priority get more queue entries than those virtual machines with a lower priority, enabling these higher-‐priority virtual machines to send more I/O than their lower-‐priority counterparts.

SIOC is a congestion-‐driven feature. When latency remains below a specific latency value, SIOC is dormant. It is triggered only when the latency value on the datastore rises above a predefined threshold.

SIOC is recommended if you have a group of virtual machines sharing the same datastore spread across multiple vSphere hosts and you want to prevent the impact of a single virtual machine’s I/O on the I/O (and thus performance) of other virtual machines. With SIOC you can set shares to reflect the priority of virtual machines, but you can also implement an IOPS limit per virtual machine. This means that you can limit the impact, in number of IOPS, which a single virtual machine can have on a shared datastore.

SIOC is available in the VMware vSphere Enterprise Plus Edition.

Network I/O Control

The Network I/O Control (NIOC) feature ensures that when the same network interface cards are used for multiple traffic types, other traffic types on the same network interface cards do not impact iSCSI traffic. It works by setting priority and bandwidth using priority tags in TCP/IP packets. With 10GbE networks, this feature can be very useful, because there is one pipe that is shared with multiple other traffic types. With 1GbE networks, you have probably dedicated the pipe solely to iSCSI traffic. This means that NIOC is congestion driven. NIOC takes effect only when there are different traffic types competing for bandwidth and the performance of one traffic type is likely to be impacted.

Whereas SIOC assists in dealing with the noisy-‐neighbor problem from a datastore-‐sharing perspective, NIOC assists in dealing with

the noisy-‐neighbor problem from a network perspective.

Using NIOC, one can also set the priority levels of different virtual machine traffic. If certain virtual machine traffic is important to you, these virtual machines can be grouped into one virtual machine port group and lower-‐priority virtual machines can be placed into another virtual machine port group. NIOC can now be used to prioritize virtual machine traffic and ensure that the high-‐priority virtual machines get more bandwidth when there is competition for bandwidth on the pipe.

SIOC and NIOC can coexist and in fact complement each other. NIOC is available in the vSphere Enterprise Plus Edition.

vSphere Storage DRS

VMware vSphere Storage DRS, introduced with vSphere 5.0, fully supports VMFS datastores on iSCSI. When you enable vSphere Storage DRS on a datastore cluster (group of datastores), it automatically configures balancing based on space usage.

The threshold is set to 80 percent but can be modified. This means that if 80 percent or more of the space on a particular datastore is utilized, vSphere Storage DRS will try to move virtual machines to other datastores in the datastore cluster using VMware vSphere Storage vMotion® to bring this usage value back down below 80 percent.

If the cluster is set to the automatic mode of operation, vSphere Storage DRS uses vSphere Storage vMotion to automatically migrate virtual machines to other datastores in the datastore cluster if the threshold is exceeded.

If the cluster is set to manual, the administrator is given a set of recommendations to apply. vSphere Storage DRS will provide the best recommendations to balance the space usage of the datastores. After you apply the recommendations, vSphere Storage vMotion, as seen before, moves one or more virtual machines between datastores in the same datastore cluster.

Another feature of vSphere Storage DRS is that it can balance virtual machines across datastores in the datastore cluster based on I/O metrics, and specifically latency.

vSphere Storage DRS uses SIOC to evaluate datastore capabilities and capture latency information regarding all the datastores in the datastore cluster. As mentioned earlier, the purpose of SIOC is to ensure that no single virtual machine uses all the bandwidth of a particular datastore. It achieves this by modifying the queue depth for the datastores on each vSphere host.

In vSphere Storage DRS, its implementation is different. SIOC, on behalf of vSphere Storage DRS, checks the capabilities of the datastores in a datastore cluster by injecting various I/O loads. After this information is normalized, vSphere Storage DRS can determine the types of workloads that a datastore can handle. This information is used in initial placement and load-‐balancing decisions.

vSphere Storage DRS continuously uses SIOC to monitor how long it takes an I/O to do a round trip. This is the latency. This information about the datastore is passed back to vSphere Storage DRS. If the latency value for a particular datastore is above the threshold value (the default is 15 milliseconds) for a significant percentage of time over an observation period (the default is 16 hours), vSphere Storage DRS tries to rebalance the virtual machines across the datastores in the datastore cluster so that the latency value returns below the threshold. This might involve one or more vSphere Storage vMotion operations. In fact, even if vSphere Storage DRS is unable to bring the latency below the defined threshold value, it might still move virtual machines between datastores to balance the latency.

When evaluating vSphere Storage DRS, VMware makes the same best practice recommendation made for vSphere Storage DRS initially. The recommendation is to run vSphere Storage DRS in manual mode first and then monitor the recommendations that vSphere Storage DRS is surfacing, ensuring that they make sense. After a period of time, if the recommendations make sense and you build a comfort level using vSphere Storage DRS, consider switching it to automated mode.

There are a number of considerations when using vSphere Storage DRS with certain array features. Check your storage vendor’s recommendation for using vSphere Storage DRS. There might be specific interaction with some advanced features on the array that you want to be aware of. VMware has already produced a very detailed white paper regarding the use of vSphere Storage DRS with array features such as tiered storage, thin provisioning and deduplication. More details regarding vSphere Storage DRS interoperability with storage-‐array features can be found in the VMware vSphere Storage DRS Interoperability white paper.

vSphere Storage APIs – Array Integration

This API enables the vSphere host to offload certain storage operations to the storage array rather than consuming resources on the vSphere host by doing the same operations.

For block storage arrays, no additional VIBs need to be installed on the vSphere host. All software necessary to use vSphere Storage APIs – Array Integration is preinstalled on the hosts.

The first primitive to discuss is Extended Copy (XCOPY), which enables the vSphere host to offload a clone operation or template deployments to the storage array.

NOTE: This primitive also supports vSphere Storage vMotion.

The next primitive is called Write Same. When creating VMDKs on block datastores, one of the options is to create an Eager Zeroed Thick (EZT) VMDK, which means zeroes get written to all blocks that make up that VMDK. With the Write Same primitive, the act of writing zeroes is offloaded to the array. This means that we don’t have to send lots of zeroes across the wire, which speeds up the process. In fact, for some arrays this is simply a metadata update, which means a very fast zeroing operation.

Our final primitive is Atomic Test & Set (ATS). ATS is a block primitive that replaces SCSI reservations when metadata updates are done on VMFS volumes.

Thin provisioning (TP) primitives were introduced with such vSphere 5.0. features as the raising of an alarm when a TP volume reached 75 percent of capacity at the back end, TP-‐Stun and, of course, the UNMAP primitive.

vSphere Storage DRS leverages 75 percent of capacity event. After the alarm is triggered, vSphere Storage DRS no longer considers those datastore as destinations for initial placement or ongoing load balancing of virtual machines.

The vSphere Storage APIs – Array Integration primitive TP-‐Stun was introduced to detect out-‐of-‐space conditions on SCSI LUNs. If a datastore reaches full capacity and has no additional free space, any virtual machines that require additional space will be stunned. Virtual machines that do not require additional space continue to work normally. After the additional space has been added to the datastore, the suspended virtual machines can be resumed.

Finally, the UNMAP primitive is used as a way to reclaim dead space on a VMFS datastore built on thin-‐provisioned LUNs. A detailed explanation of vSphere Storage APIs – Array Integration can be found in the white paper, VMware vSphere Storage APIs – Array Integration (VAAI).

NOTE: At the time of this writing, there was no support from vSphere Storage APIs – Array Integration for storage appliances. Support from vSphere Storage APIs – Array Integration is available only on physical storage arrays.

vSphere Storage vMotion

The only other considerations with vSphere Storage vMotion are relevant to both block operations and NAS. This is the configuration maximum. At the time of this writing, the maximum number of concurrent vSphere Storage vMotion operations per vSphere host is two and the maximum number of vSphere Storage vMotion operations per datastore is eight. This is to prevent any single datastore from being unnecessarily impacted.

vSphere Storage vMotion operations

vSphere Storage vMotion has gone through quite a few architectural changes over the years. The latest version in vSphere 5.x uses a mirror driver to split writes to the source and destination datastores after a migration is initiated. This means speedier migrations because there is only a single copy operation now required, unlike the recursive copy process used in previous versions that leveraged Change Block Tracking (CBT).

One consideration that has been called out already is that vSphere Storage vMotion operations of virtual machines between datastores cannot be offloaded to the array without support from vSphere Storage APIs – Array Integration. In those cases, the

software data mover does all vSphere Storage vMotion operations.

NOTE: A new enhancement in vSphere 5.1 enables up to four VMDKs belonging to the same virtual machine to be migrated in parallel, as long as the VMDKs reside on unique datastores.

Sizing Considerations

Recommended Volume Size

When creating this paper, we asked a number of our storage partners if there was a volume size that worked well for iSCSI. All partners said that there was no performance gain or degradation depending on the volume size and those customers might build iSCSI volumes of any size, so long as it was below the array vendor’s supported maximum. The datastore sizes vary greatly from customer to customer.

DELL recommends starting with a datastore that is between 500GB and 750GB for their Compellent range of arrays. However, because VMFS datastore can be easily extended on the fly, their general recommendation is to start with smaller and more manageable datastore sizes initially and expand them as needed. This seems like good advice.

Sizing of volumes is typically proportional to the number of virtual machines you attempt to deploy, in addition to snapshots/changed blocks created for backup purposes. Another consideration is that many arrays now have deduplication and compression features, which will also reduce capacity requirements. A final consideration is Recovery Point Objective (RPO) and Recovery Time Objective (RTO). These determine how fast you can restore your datastore with your current backup platform.

Recommended Block Size

This parameter is not tunable, for the most part. Some vendors have it hard set to 4KB and others have it hard set to 8KB. Block sizes are typically a multiple of 4KB. These align nicely with the 4KB grain size used in the VMDK format of VMware. For those vendors who have it set to 8KB, the recommendation is to format the volumes in the guest operating system (OS) to a matching 8KB block size for optimal performance. In this area, it is best to speak to your storage-‐array vendor to get vendor-‐specific advice.

Maximum Number of Virtual Machines per Datastore

The number of virtual machines that can run on a single datastore is directly proportional to the infrastructure and the workloads running in the virtual machines. For example, one might be able to run many hundreds of low-‐I/O virtual machines but only a few very intensive I/O virtual machines on the same datastore. Network congestion is an important factor. Users might consider using the Round Robin path policy on all storage devices to achieve optimal performance and load balancing. In fact, since vSphere 5.1 EMC now has the Round Robin path policy associated with its SATP (Storage Array Type Plug-‐in) in the VMkernel, so that when an EMC storage device is discovered, it will automatically use Round Robin.

The other major factor is related to the backup and recovery Service-‐Level Agreement (SLA). If you have one datastore with many virtual machines, there is a question of how long you are willing to wait while service is restored in the event of a failure. This is becoming the major topic in the debate over how many virtual machines per datastore is optimal.

The snapshot technology used by the backup product is an important question—specifically, whether it uses array-‐based snapshots or virtual machine snapshots. Performance is an important consideration if virtual machine snapshots are used to concurrently capture point-‐in-‐time copies of virtual machines. In many cases, array-‐based snapshots have less impact on the datastores and are more scalable when it comes to backups. There might be some array-‐based limitations to look at also. For instance, the number of snapshot copies of a virtual machine that a customer wants to maintain might exceed the number of snapshot copies an array can support. This varies from vendor to vendor. Check this configuration maximum with your storage-‐array vendor.

KB article 1015180 includes further details regarding snapshots and their usage. As shown in KB article 1025279, virtual machines can support up to 32 snapshots in a chain, but VMware recommends that you use only two or three snapshots in a chain and also that you use no single snapshot for more than 24–72 hours.

Booting a vSphere Host from Software iSCSI

VMware introduced support for iSCSI with ESX 3.x. However, ESX could boot only from an iSCSI LUN if a hardware iSCSI adapter was used. Hosts could not boot via the software iSCSI initiator of VMware. In vSphere 4.1, VMware introduced support making it possible to boot the host from an iSCSI LUN via the software iSCSI adapter.

NOTE: Support was introduced for VMware ESXi only, and not classic ESX.

Not all of our storage partners support iSCSI Boot Firmware Table (iBFT) boot from SAN. Refer to the partner’s own documentation for clarification.

Why Boot from SAN?

It quickly became clear that there was a need to boot via software iSCSI. Partners of VMware were developing blade chassis containing blade servers, storage and network interconnects in a single rack. The blades were typically diskless, with no local storage. The requirement was to have the blade servers boot off of an iSCSI LUN using network interface cards with iSCSI capabilities, rather than using dedicated hardware iSCSI initiators.

Compatible Network Interface Card

Much of the configuration for booting via software iSCSI is done via the BIOS settings of the network interface cards and the host. Check the VMware Hardware Compatibility List (HCL) to ensure that the network interface card is compatible. This is important, but a word of caution is necessary. If you select a particular network interface card and you see iSCSI as a feature, you might assume that you can use it to boot a vSphere host from an iSCSI LUN. This is not the case.

To see if a particular network interface card is supported for iSCSI boot, set the I/O device type to Network (not iSCSI) in the HCL and then check the footnotes. If the footnotes state that iBFT is supported, then this card can be used for boot from iSCSI.

Advanced Settings

There are a number of tunable parameters available when using iSCSI datastores. Before drilling into these advanced settings in more detail, you should understand that the recommended values for some of these settings might (and probably will) vary from storage-‐array vendor to storage-‐array vendor.

LoginTimeout

When iSCSI establishes a session between initiator and target, it must log in to the target. It will try to log in for a period of LoginTimeout seconds. If that is exceeded, the login fails.

LogoutTimeout

When iSCSI finishes a session between initiator and target, it must log out of the target. It will try to log out for a period of LogoutTimeout seconds. If that is exceeded, the logout fails.

RecoveryTimeout

The other options relate to how a dead path is determined. RecoveryTimeout is used to determine how long we should wait, in seconds, after PDUs are no longer being sent or received before placing a once-‐active path into a dead state. Realistically it’s a bit longer than that, because other considerations are taken into account as well.

NoopInterval and NoopTimeout

The noop settings are used to determine if a path is dead when it is not the active path. iSCSI will passively discover if this path is dead by using the noop timeout. This test is carried out on nonactive paths every NoopInterval seconds. If a response isn’t received by NoopTimeout, measured in seconds, the path is marked as dead.

Unless faster failover times are desirable, it is not required to change these parameters from their default settings. Use caution when modifying these parameters, because if paths fail too quickly and then recover, you might have LUNs/devices moving ownership unnecessarily between targets, and that can lead to path thrashing.

QFullSampleSize and QFullThreshold

Some of our storage partners require the use of the parameters QFullSampleSize and QFullThreshold to enable the adaptive queue-‐depth algorithm of VMware. With the algorithm enabled, no additional I/O throttling is required on the vSphere hosts. Refer to your storage-‐array vendor’s documentation to see if this is applicable to your storage.

Disk.DiskMaxIOSize

To improve the performance of virtual machines that generate large I/O sizes, administrators can consider setting the advanced parameter Disk.DiskMaxIOSize. Some of our partners suggest setting this to 128KB to enhance storage performance. However, it would be best to understand the I/O size that the virtual machine is generating before setting this parameter. A different size might be more suitable to your application.

DelayedAck

A host receiving a stream of TCP data segments, as in the case of iSCSI, can increase efficiency in both the network and the hosts by sending less than one ack acknowledgment segment per data segment received. This is known as a delayed ack. The common practice is to send an ack for every other full-‐sized data segment and not to delay the ack for a segment by more than a specified threshold. This threshold varies between 100 and 500 milliseconds. vSphere hosts, as do most other servers, use a delayed ack because of its benefits.

Some arrays, however, take the very conservative approach of retransmitting only one lost data segment at a time and waiting for the host’s ack before retransmitting the next one. This approach slows read performance to a halt in a congested network and might require the delayed ack feature to be disabled on the vSphere host. More details can be found in KB article 1002598.

Additional Considerations

Disk Alignment

This is not a recommendation specific to iSCSI, because it also can have an adverse effect on the performance of all block storage. Nevertheless, to account for every contingency, it should be considered a best practice to have the partitions of the guest OS running with the virtual machine aligned to the storage.

Microsoft Clustering Support

With the release of vSphere 5.1, VMware supports as many as five nodes in a Microsoft Cluster. However, at the time of this writing, VMware does not support the cluster quorum disk over the iSCSI protocol.

In-‐Guest iSCSI Support

A number of in-‐guest iSCSI software solutions are available. The iSCSI driver of Microsoft is one commonly seen running in a virtual machine when the guest OS is a version of Microsoft Windows. The support statement for this driver can be found in KB article 1010547, which states that “if you encounter connectivity issues using a third-‐party software iSCSI initiator to the third-‐party storage device, engage the third-‐party vendors for assistance. If the third-‐party vendors determine that the issue is due to a lack of network connectivity to the virtual machine, contact VMware for troubleshooting assistance.”

All Paths Down and Permanent Device Loss

All Paths Down (APD) can occur on a vSphere host when a storage device is removed in an uncontrolled manner or if the device fails and the VMkernel core storage stack cannot detect how long the loss of device access will last. One possible scenario for an APD condition is an FC switch failure that brings down all the storage paths, or, in the case of an iSCSI array, a network connectivity issue that similarly brings down all the storage paths.

A new condition known as Permanent Device Loss (PDL) was introduced in vSphere 5.0. The PDL condition enabled the vSphere host to take specific actions when it detected that the device loss was permanent. The vSphere host can be informed of a PDL situation by specific SCSI sense codes sent by the target array.

In vSphere 5.1, VMware introduced a PDL detection method for those iSCSI arrays that present only one LUN for each target. These arrays were problematic, because after LUN access was lost, the target also was lost. Therefore, the vSphere host had no way of reclaiming any SCSI sense codes.

vSphere 5.1 extends PDL detection to those arrays that have only a single LUN per target. With vSphere 5.1, for those iSCSI arrays that have a single LUN per target, an attempt is made to log in again to the target after a dropped session. If there is a PDL condition,

the storage system rejects the effort to access the device. Depending on how the array rejects the efforts to access the LUN, the vSphere host can determine whether the device has been lost permanently (PDL) or is temporarily unreachable.

Round Robin Path Policy Setting IOPS=1

A number of our partners have documented that if using the Round Robin path policy, best results can be achieved with an IOPS=1 setting. This might well be true in very small environments where there are a small number of virtual machines and a small number of datastores. However, because the environment scales with a greater number of virtual machines and a greater number of datastores, VMware considers that the default settings associated with the Round Robin path policy to be sufficient. Consult your storage array vendor for advice on this setting.

Data Center Bridging (DCB) Support

Our storage partner Dell now supports iSCSI over DCB under the PVSP (Partner Verified and Supported Products) program of VMware. This is for the Dell EqualLogic (EQL) array only with certain Converged Network Adapters (CNAs) and only on vSphere version 5.1. See KB article 2044431 for further details.

7. Best Practices for running VMware vSphere on Network Attached Storage

Background

VMware introduced the support of IP based storage in release 3 of the ESX server. Prior to that release, the only option for shared storage pools was Fibre Channel (FC). With VI3, both iSCSI and NFS storage were introduced as storage resources that could be shared across a cluster of ESX servers.

The addition of new choices has led to a number of people asking “What is the best storage protocol choice for one to deploy a virtualization project on?” The answer to that question has been the subject of much debate, and there seems to be no single correct answer.

The considerations for this choice tend to hinge on the issue of cost, performance, availability, and ease of manageability. However, an additional factor should also be the legacy environment and the storage administrator familiarity with one protocol vs. the other based on what is already installed.

The bottom line is, rather than ask “which storage protocol to deploy virtualization on,” the question should be, “Which virtualization solution enables one to leverage multiple storage protocols for their virtualization environment?” And, “Which will give them the best ability to move virtual machines from one storage pool to another, regardless of what storage protocol it uses, without downtime, or application disruption?” Once those questions are considered, the clear answer is VMware vSphere.

However, to investigate the options a bit further, performance of FC is perceived as being a bit more industrial strength than IP based storage. However, for most virtualization environments, NFS and iSCSI provide suitable I/O performance. The comparison has been the subject of many papers and projects. One posted on VMTN is located at: http://www.vmware.com/files/pdf/storage_protocol_perf.pdf.

The general conclusion reached by the above paper is that for most workloads, the performance is similar with a slight increase in ESX Server CPU overhead per transaction for NFS and a bit more for software iSCSI. For most virtualization environments, the end user might not even be able to detect the performance delta from one virtual machine running on IP based storage vs. another on FC storage.

The more important consideration that often leads people to choose NFS storage for their virtualization environment is the ease of provisioning and maintaining NFS shared storage pools. NFS storage is often less costly than FC storage to set up and maintain. For this reason, NFS tends to be the choice taken by small to medium businesses that are deploying virtualization—as well as the choice for deployment of virtual desktop infrastructures. This paper will investigate the trade offs and considerations in more detail.

Overview of the Steps to Provision NFS Datastores

Before NFS storage can be addressed by an ESX server, the following issues need to be addressed:

• Have a virtual switch configured for IP based storage. • The ESX hosts needs to be configured to enable its NFS client. • The NFS storage server needs to have been configured to export a mount point that is accessible to the ESX server on a

trusted network. For more details on NFS storage options and setup, consult the best practices for VMware provided by the storage vendor.

EMC with VMware vSphere 4 Applied Best Practices

NetApp and VMware vSphere Storage Best Practices

Regarding item one above, to configure the vSwitch for IP storage access you will need to create a new vSwitch under ESX server configuration, networking tab in vCenter. Indicating it is a vmkernel type connection will automatically add to the vSwitch. You will need to populate the network access information.

Regarding item two above, to configure the ESX host for running its NFS client, you’ll need to open a firewall port for the NFS client. To do this, select the configuration tab for the ESX Server in Virtual Center and click on Security Profile (listed under software options) and then check the box for NFS Client listed under the remote access choices in the Firewall Properties screen.

With these items addressed, an NFS datastore can now be added to the ESX server following the same process used to configure a datastore for block based (FC or iSCSI) datastores.

• On the ESX Server configuration tab in VMware VirtualCenter, select storage (listed under hardware options) and then click the add button.

• On the screen for select storage type, select Network File System and in the next screen enter the IP address of the NFS server, mount point for the specific destination on that server and the desired name for that new datastore.

• If everything is completed correctly, the new NFS datastore will show up in the refreshed list of datastores available for that ESX server.

The main differences in provisioning an NFS datastores compared to block based storage datastores are:

• For NFS there are fewer screens to navigate through but more data entry required than block based storage. • The NFS device needs to be specified via an IP address and folder (mount point) on that filer, rather than a pick list of

options to choose from.

Issues to Consider for High Availability

To achieve high availability, the LAN on which the NFS traffic will run needs to be designed with availability, downtime-‐avoidance, isolation, and no single-‐fail-‐point of failure in mind.

Multiple administrators need to be involved in designing for high-‐availability: Both the virtual administrator and the network administrator. If done correctly the failover capabilities of an IP based storage network can be as robust as that of a FC storage network.

Terminology

First, it is important to define a few terms that often cause confusion in the discussion of IP based storage networking. Some common terms and there definitions are as follows:

NIC/adapter/port/link – End points of a network connection.

Teamed/trunked/bonded/bundled ports – Pairing of two connections that are treated as one connection by a network switch or server. The result of this pairing are also referred to as an ether-‐channel. This is pairing of connections is defined as Link Aggregation in the 802.3 networking specification.

Cross stack ether channel – A pairing of ports that can span across two physical LAN switches managed as one logical switch. This is

only an option with a limited number of switches that are available today.

IP hash – Method of switching to an alternate path based on a hash of the IP address of both end points for multiple connections.

Virtual IP (VIF) – An interface used by the NAS device to present the same IP address out of two ports from that single array.

Avoiding single points of failure a the NIC, switch, filer levels

The first level of High Availability (HA) is to avoid a single point of failure being a NIC card in an ESX server, or the cable between the NIC card and the switch. With the solution having two NICs connected to the same LAN switch and configured as teamed at the switch and having IP hash failover enabled at the ESX server.

The second level of HA is to avoid a single point of failure being a loss of the switch to which the ESX connects. With this solution, one has four potential NIC cards in the ESX server configured with IP hash failover and two pairs going to separate LAN switches – with each pair configured as teamed at the respective LAN switches.

The third level of HA protects against loss of a filer (or NAS head) becoming unavailable. With storage vendors that provide clustered NAS heads that can take over for another in the event of a failure, one can configure the LAN such that downtime can be avoided in the event of losing a single filer, or NAS head.

An even higher level of performance and HA can build on the previous HA level with the addition of Cross Stack Ether-‐channel capable switches. With certain network switches, it is possible to team ports across two separate physical switches that are managed as one logical switch. This provides additional resilience as well as some performance optimization that one can get HA with fewer NICs, or have more paths available across which one can distribute load sharing.

Caveat: NIC teaming provides failover but not load-‐balanced performance (in the common case of a single NAS datastore)

It is also important to understand that there is only one active pipe for the connection between the ESX server and a single storage target (LUN or mountpoint). This means that although there may be alternate connections available for failover, the bandwidth for a single datastore and the underlying storage is limited to what a single connection can provide. To leverage more available bandwidth, an ESX server has multiple connections from server to storage targets. One would need to configure multiple datastores with each datastore using separate connections between the server and the storage. This is where one often runs into the

distinction between load balancing and load sharing. The configuration of traffic spread across two or more datastores configured on separate connections between the ESX server and the storage array is load sharing.

Security Considerations

VMware vSphere implementation of NFS supports NFS version 3 in TCP. There is currently no support for NFS version 2, UDP, or CIFS/SMB. Kerberos is also not supported in the ESX Server 4, and as such traffic is not encrypted. Storage traffic is transmitted as clear text across the LAN. Therefore, it is considered best practice to use NFS storage on trusted networks only. And to isolate the traffic on separate physical switches or leverage a private VLAN.

Another security concern is that the ESX Server must mount the NFS server with root access. This raises some concerns about hackers getting access to the NFS server. To address the concern, it is best practice to use of either dedicated LAN or VLAN to provide protection and isolation.

Additional Attributes of NFS Storage

There are several additional options to consider when using NFS as a shared storage pool for virtualization. Some additional considerations are thin provisioning, de-‐duplication, and the ease-‐of-‐backup-‐and-‐restore of virtual machines, virtual disks, and even files on a virtual disk via array based snapshots.

Thin Provisioning

Virtual disks (VMDKs) created on NFS datastores are in thin provisioned format by default. This capability offers better disk utilization of the underlying storage capacity in that it removes what is often considered wasted disk space. For the purpose of this paper, VMware will define wasted disk space as allocated but not used. The thin-‐provisioning technology removes a significant amount of wasted disk space.

On NFS datastores, the default virtual disk format is thin. As such, less allocation of VMFS volume storage space than is needed for the same set of virtual disks provisioned as thick format

De-‐duplication

Some NAS storage vendors offer data de-‐duplication features that can greatly reduce the amount of storage space required. It is important to distinguish between in-‐place de-‐duplication and de-‐duplication for backup streams. Both offer significant savings in space requirements, but in-‐place de-‐duplication seems to be far more significant for virtualization environments. Some customers have been able to reduce their storage needs by up to 75 percent of their previous storage footprint with the use of in place de-‐duplication technology.

Summary of Best Practices

Networking Settings

To isolate storage traffic from other networking traffic, it is considered best practice to use either dedicated switches or VLANs for your NFS and iSCSI ESX server traffic. The minimum NIC speed should be 1 gig E. In VMware vSphere, use of 10gig E is supported. Best to look at the VMware HCL to confirm which models are supported.

It is important to not over-‐subscribe the network connection between the LAN switch and the storage array. The retransmitting of dropped packets can further degrade the performance of an already heavily congested network fabric.

Datastore Settings

The default setting for the maximum number of mount points/datastore an ESX server can concurrently mount is eight. Although the limit can be increased to 64 in the existing release. If you increase max NFS mounts above the default setting of eight, make sure to also increase Net.TcpipHeapSize as well. If 32 mount points are used, increase tcpip.Heapsize to 30MB.

TCP/IP Heap Size

The safest way to calculate the tcpip heap size given the number of NFS volumes configured is to linearly scale the default values. For 8 NFS volumes the default min/max sizes of the tcpip heap are respectively 6MB/30MB. This means the tcpip heap size for a host configured with 64 NFS volumes should have the min/max tcpip heap sizes set to 48MB/240MB.

Filer Settings

In a VMware cluster, it is important to make sure to mount datastores the same way on all ESX servers. Same host (hostname/FQDN/IP), export and datastore name. Also make sure NFS server settings are persistent on the NFS filer (use FilerView, exportfs –p, or edit /etc/exports).

ESX Server Advanced Settings and Timeout settings

When setting the NIC teaming settings, it is considered best practice to select “no” for NIC teaming failback option. If there is some intermittent behavior in the network, this will prevent the flip-‐flopping of NIC cards being used.

When setting up VMware HA, it is a good starting point to also set the following ESX server timeouts and settings under the ESX server advanced setting tab.

• NFS.HeartbeatFrequency = 12 • NFS.HeartbeatTimeout = 5 • NFS.HeartbeatMaxFailures = 10

NFS Heartbeats

NFS heartbeats are used to determine if an NFS volume is still available. NFS heartbeats are actually GETATTR requests on the root file handle of the NFS Volume. There is a system world that runs every NFS. HeartbeatFrequency seconds to check if it needs to issue heartbeat requests for any of the NFS volumes. If a volume is marked available, a heartbeat will only be issues if it has been > NFS.HeartBeatDelta seconds since a successful GETATTR (not necessarily a heartbeat GETATTR) for that volume was issued The NFS heartbeat world will always issue heartbeats for NFS volumes that are marked unavailable. Here is the formula to calculate how long it can take ESX to mark an NFS volume as unavailable:

RoundUp(NFS.HeartbeatDelta, NFS.HeartbeatFrequency) + (NFS.HeartbeatFrequency * (NFS. HeartbeatMaxFailures -‐ 1)) + NFS.HeartbeatTimeout

Once a volume is back up it can take NFS.HeartbeatFrequency seconds before ESX marks the volume as available. See Appendix 2 for more details on these settings.

Previously thought to be Best Practices

Some early adopters of VI3 on NFS created some best practices that are no longer viewed favorably. They are:

• Turn off the NFS locking within the ESX server • Not placing virtual machine swap space on NFS storage.

Both these have been debunked and the following section provides more details as to why these are not no longer considered best practices.

NFS Locking

NFS locking on ESX does not use the NLM protocol. VMware has established its own locking protocol. These NFS locks are implemented by creating lock files on the NFS server. Lock files are named “.lck-‐<fileid>”, where <fileid> is the value of the “fileid” field returned from a GETATTR request for the file being locking. Once a lock file is created, VMware periodically (every NFS.DiskFileLockUpdateFreq seconds) send updates to the lock file

to let other ESX hosts know that the lock is still active. The lock file updates generate small (84 byte) WRITE requests to the NFS

server. Changing any of the NFS locking parameters will change how long it takes to recover stale locks. The following formula can be used to calculate how long it takes to recover a stale NFS lock:

(NFS.DiskFileLockUpdateFreq * NFS.LockRenewMaxFailureNumber) + NFS.LockUpdateTimeout

If any of these parameters are modified, it’s very important that all ESX hosts in the cluster use identical settings. Having inconsistent NFS lock settings across ESX hosts can result in data corruption!

In vSphere the option to change the NFS.Lockdisable setting has been removed. This was done to remove the temptation to disable the VMware locking mechanism for NFS. So it is no longer an option to turn it off in vSphere.

Virtual Machine Swap Space Location

Keeping the virtual machine swap space on the NFS datastore is now considered to be the best practice.

NFS – Advanced Options

• NFS.DiskFileLockUpdateFreq Time between updates to the NFS lock file on the NFS server. Increasing this value will increase the time it takes to recover stale NFS locks. (See NFS Locking)

• NFS.LockUpdateTimeout Amount of time VMWare waits before we abort a lock update request. (See NFS Locking)

• NFS.LockRenewMaxFailureNumber Number of lock update failures that must occur before VMare marks the lock as stale. (See NFS Locking)

• NFS.HeartbeatFrequency How often the NFS heartbeat world runs to see if any NFS volumes need a heartbeat request. (See NFS Heartbeats)

• NFS.HeartbeatTimeout Amount of time VMware waits before aborting a heartbeat request. (See NFS Heartbeats)

• NFS.HeartbeatDelta Amount of time after a successful GETATTR request before the heartbeat world will issue a heartbeat request for a volume. If an NFS volume is in an unavailable state, an update will be sent every time the heartbeat world runs (NFS.HeartbeatFrequency seconds). (See NFS Heartbeats)

• NFS.HeartbeatMaxFailures Number of consecutive heartbeat requests that must fail before VMwares mark a server as unavailable. (See NFS Heartbeats)

• NFS.MaxVolumes Maximum number of NFS volume that can be mounted. The TCP/IP heap must be increased to accommo-‐ date the number of NFS volumes configured (See TCP/IP Heap Size)

• NFS.SendBufferSize This is the size of the send buffer for NFS sockets. This value was chosen based on internal performance testing. Customers should not need to adjust this value.

• NFS.ReceiveBufferSize This is the size of the receive buffer for NFS sockets. This value was chosen based on internal performance testing. Customers should not need to adjust this value.

• NFS.VolumeRemountFrequency This determines how often VMWare would try to mount an NFS volume that was initially unmountable. Once a volume is mounted, it never needs to be remounted. The volume may be marked unavailable if VMWare loses connectivity to the NFS server—but it will still remain mounted.

8. VMware vSphere 5.0 Upgrade Best Practices

VMware vSphere 5.0 – What’s New

• Industry’s largest virtual machines – VMware can support even the largest applications with the introduction of virtual machines that can grow to as many as 32 vCPUs and can use up to 1TB of memory. This enhancement is 4x bigger than the previous release. vSphere can now support business-‐critical applications of any size and dimension.

• vSphere High Availability (VMware HA) – New architecture ensures the most simplified setup and the best guarantees for the availability of business-‐critical applications. Setup of the most widely used VMware HA technology in the industry has never been easier. VMware HA can now be set up in just minutes.

• VMware vSphere® Auto Deploy – In minutes, you can deploy more vSphere hosts running the ESXi hypervisor architecture “on the fly.” After it is running, Auto Deploy simplifies patching by enabling you to do a one-‐time patch of the source ESXi

image and then push the updated image out to your ESXi hosts, as opposed to the traditional method of having to apply the same patch to each host individually.

• Profile-‐Driven Storage – You can reduce the steps in the selection of storage resources by grouping storage according to a user-‐defined policy.

• vSphere Storage DRS – Automated load balancing now analyzes storage characteristics to determine the best place for a given virtual machine’s data to live when it is created and then used over time.

• vSphere Web Client – This rich browser-‐based client provides full virtual machine administration, and now has multiplatform support and optimized client/server communication, which delivers faster response and a more efficient user experience that helps take care of business needs faster.

• VMware vCenter Appliance (VCSA) – This VMware vCenter ServerTM preinstalled virtual appliance simplifies the deployment and configuration of vCenter Server, slipstreams future upgrades and patching, and reduces the time and cost associated with managing vCenter Server. (Upgrading to the VMware vCenter Appliance from the installable vCenter Server is not supported.)

• Licensing Reporting Manager – With the new vSphere vRAM licensing introduced with vSphere 5.0, vCenter Server is enabled to show not only installed licenses but the vRAM license memory pooling and its real-‐time utilization. This allows administrators to see the benefits of vRAM pooling and how to size as the business grows.

Upgrading to VMware vCenter Server 5.0

The first step in any vSphere migration project should always be the upgrade of vCenter Server. Your vCenter Server must be running at version 5.0 in order to manage an ESXi 5.0 host.

Upgrading vCenter Server 5.0 involves upgrading the vCenter Server machine, its accompanying database, and any configured plug-‐ins, including VMware vSphere® Update Manager and VMware vCenter Orchestrator.

As of vSphere 4.1, vCenter Server requires a 64-‐bit server running a 64-‐bit operating system (OS). If you are currently running vCenter Server on a 32-‐bit OS, you must migrate to the 64-‐bit architecture first. With the 64-‐bit vCenter Server, you also must use a 64-‐bit database source name (DSN) for the vCenter database.

Planning the Upgrade

It is recommended that you create an inventory of the current components and that you validate compatibility with the requirements of vCenter 5.0

Requirements

These are supported minimums. Scaling and sizing of vCenter Server and components should be based on the size of the current virtual environment and anticipated growth.

• Processor: Two CPUs 2.0GHz or higher Intel or AMD x86 processors, with processor requirements higher if the database runs on the same machine

• Memory: 4GB RAM, with RAM requirements higher if your database runs on the same machine • Disk storage: 4GB, with disk requirements higher if your database runs on the same machine • Networking: 1Gb recommended • OS: 64-‐bit • Supported database platform

Upgrade Process

The following diagram depicts possible upgrade scenarios

NOTE: With the release of vSphere 5.0, vCenter Server is also offered as a Linux-‐based appliance, referred to as the vCenter Server Appliance (VCSA), which can be deployed in minutes. Due to the architectural differences between the installable vCenter and the new VCSA, there is no migration path or database conversion tool to migrate to the VCSA. You must deploy a new VCSA and attach all the infrastructure components before recreating and attaching inventory objects.

We will explore the three most common scenarios:

• vCenter 4.0 and Upgrade Manager 4.0, and a 32-‐bit OS with a local database • vCenter 4.1 and Upgrade Manager 4.1, a 64-‐bit OS with a local database, and the requirement to migrate to a remote

database • vCenter 4.1, a 64-‐bit OS with a remote database, and a separate Upgrade Manager server

Backing Up Your vCenter Configuration

Before starting the upgrade procedure, it is recommended to back up your current vCenter Server to ensure that you can restore to the previous configuration in the case of an unsuccessful upgrade. It is important to realize that there are multiple objects that must be backed up to provide the ability to roll back:

• SSL certificates • vpxd.cfg • Database

Depending on the type of platform used to host your vCenter Server, it might be possible to simply create a clone or snapshot of your vCenter Server and database to allow for a simple and effective rollback scenario.

In most cases, however, it is recommended that you back up each of the aforementioned items separately to allow for a more granular recovery when required, following the database software vendor’s best practices and documentation.

The vCenter configuration file vpxd.cfg and the SSL certificates can be simply backed up by copying them to a different location. It is recommended that you copy them to a location external to the vCenter Server. The SSL certificates are located in a folder named “SSL” under the following folders—vpxd.cfg can be in the root of these folders:

Windows 2003: %ALLUSERSPROFILE%\Application Data\VMware\VMware VirtualCenter\ Windows 2008: %systemdrive%\ProgramData\VMware\VMware VirtualCenter\

It is important to also document any changes made to the vCenter configuration and to your database configuration settings, such as the database DSN, user name and password. Before any upgrade is undertaken, it is recommended that you back up your database and vCenter Server.

Host Agents

It is recommended that you validate that the current configuration meets the vCenter Server requirements. This can be done manually or by using the Agent Pre-‐Upgrade Checker, which is provided with the vCenter Server installation media.

The Agent Pre-‐Upgrade Checker will investigate each of ESX/ESXi hosts in the environment, and will report whether or not the agent on the host can be updated

Upgrading a 32-‐Bit vCenter 4.0 OS with a Local Database

This scenario will describe an upgrade of vCenter Server 4.0 with a local database running on a 32-‐bit version of a Microsoft Windows 2003 OS. As vCenter 5.0 is a 64-‐bit platform, an in-‐place upgrade is not impossible. A VMware Data Migration Tool included with the vCenter Server media can be utilized to migrate data and settings from the old 32-‐bit OS to the new 64-‐bit OS.

The Data Migration Tool should be unzipped in both the source and destination vCenter Server.

Backup Configuration Using the Data Migration Tool

Stop the following services on the “source” vCenter Server:

• VMware vSphere Update Manager service • vCenter Management Web Services • vCenter Server service

Open a Command Prompt and go to the location from which datamigration.zip was extracted. Type backup.bat. Decide whether the host patches should be backed up or not. We recommend not backing them up and downloading new patches and excluding ESX patches to minimize stored data.

Installing vCenter Using Data Provided by the Data Migration Tool

• Copy the contents of the “source” vCenter Server’s datamigration folder to the new vCenter Server. • Open up a Command Prompt and go to the folder containing the datamigration tools that you just copied. • Run install.bat.

Using the Data Migration Tool, you can easily migrate the vCenter Server 4.0 32-‐bit OS using Microsoft SQL Server 2005 Express to a 64-‐bit OS. As with any tool, there are some caveats. We have listed the most accessed VMware knowledge base articles regarding the Data Migration Tool for your convenience as follows:

• Backing up the vCenter Server 4.x bundle using the Data Migration tool fails with the error: Object reference not set to an instance of an object (http://kb.vmware.com/kb/1036228)

• Data migration tool fails with the error: RESTORE cannot process database ‘VIM_VCDB’ because it is in use by this session (http://kb.vmware.com/kb/2001184)

• vCenter Server 4.1 Data Migration Tool fails with the error: HResult 0x2, Level 16, State 1 (http://kb.vmware.com/kb/1024490) • Using the Data Migration Tool to upgrade from vCenter Server 4.0 to vCenter Server 4.1 fails (http://kb.vmware.com/kb/1024380) • When upgrading to vCenter Server 4.1, running install.bat of the Data Migration Tool fails (http://kb.vmware.com/kb/1029663)

Upgrading a 64-‐Bit vCenter 4.1 Server with a Remote Database

Of the three scenarios this is the most straightforward, but we still suggest that you back up your current vCenter configuration and database to provide a rollback scenario

• Insert the VMware vCenter Server 5.0 CD. Select vCenter Server and click Install. • Select the appropriate language and click OK. • Install .NET Framework 3.5 SP1 by clicking Install. • The ESXi Installer should now detect that vCenter is already installed. Upgrade the current installation by clicking Next.

Upgrading a 64-‐Bit vCenter 4.1 Server with a Local Database to a Remote Database

When upgrading your environment from vCenter Server 4.1 to vCenter Server 5.0, it might also be the right time to make adjustments to your design decisions. One of those changes might be the location of the vCenter Server database, where instead of using a local Microsoft SQL Server Express 2005 database, a remote SQL server is used. In this scenario, we will primarily focus on how to migrate the database. The upgrade of vCenter Server 4.1 can be done in two different ways, which we will briefly explain at the end of the migration workflow section.

If vCenter Server is currently installed as a virtual machine, we recommended that you create a new virtual machine for vCenter Server 5.0. That way, in case a rollback is required, the vCenter Server 4.1 virtual machine can be powered on with a minimal impact on your management environment.

• Download the Microsoft SQL Server Management Studio Express and install it on your vCenter Server (Guide assumes you are using SQL Express).

• Stop the service named “VMware VirtualCenter Server.” • Start the Microsoft SQL Server Management Studio Express application and log in to the local SQL instance. • Right-‐click your vCenter Server Database “VIM_VCDB” and click Back Up under Tasks.

Copy this database from the selected location to your new Microsoft SQL Database Server.

Create a new database on your destination Microsoft SQL Server 2008.

• Open Microsoft SQL Server Management Studio Express. • Log in to the local Microsoft SQL Server instance. • Right-‐click Databases and select New Database. • Give the new database a name and select an appropriate owner.

• Use the database calculator to identify the initial size of the database. Leave this set to the default and click OK. • Now that the database has been created, the old database must be restored to this newly created database. • Open Microsoft SQL Server Management Studio Express. • Log in to the local Microsoft • Right-‐click the newly created database and select Restore Database. • Select From device. Select the correct database. • SQL Server instance. • Unfold Databases. • Ensure that the correct database is selected to restore, as depicted in the following. • Select Overwrite the existing database (WITH REPLACE).

If you want to reuse your current environment, go to the vCenter Server and recreate the system DSN. If you prefer to keep this, go to the new vCenter Server and create a new system DSN.

• Open the ODBC Data Source Administrator. • Click the System DSN tab. • Remove the listed VMware VirtualCenter system DSN entry. • Add a new system DSN using the Microsoft SQL Server Native Client. If this option is not available, download it here:

http://www.microsoft.com/downloads/en/details.aspx?FamilyId=C6C3E9EF-‐BA29-‐ 4A43-‐8D69-‐A2BED18FE73C&displaylang=en.

If the current vCenter Server environment is reused, take the following steps. If a new vCenter Server is used, skip this step. We have tested the upgrade without uninstalling vCenter Server. Although it was successful, we recommend removing it every time to prevent any unexpected performance or results.

• Uninstall vCenter Server. • Reboot the vCenter Server host. • In both cases, vCenter Server must be reinstalled. • Install vCenter Server.

• In the installation wizard, select the newly created DSN that connects to your SQL2008 database. Select the Do not overwrite, leave my existing database in place option.

• – Ensure that the authentication type used in SQL2008 is the same as that used on SQLExpress2005. • – Reset the permissions of the vCenter account that connects to the database as the database owner

(dbo) user of the MSDB system database. Details regarding this migration procedure can also be found in VMware knowledge base article 1028601 (http://kb.vmware.com/kb/1028601), Migrating the vCenter Server 4.x database from SQL Express 2005 to SQL Server 2008.

Upgrading to VMware ESXi 5.0

Following the vCenter Server upgrade, you are ready to begin upgrading your ESXi hosts. You can upgrade your ESX/ESXi 4.x hosts to ESXi 5.0 using either the ESXi Installer or vSphere Update Manager. Each method has a unique set of advantages and disadvantages.

Choosing an Upgrade Path

The two upgrade methods work equally well, but there are specific requirements that must be met before a host can be upgraded to ESXi 5.0. The following chart takes into account the various upgrade requirements and can be used as a guide to help determine both your upgrade eligibility and your upgrade path.

Verifying Hardware Compatibility

ESXi 5.0 supports only 64-‐bit servers. Supported servers are listed on the vSphere Hardware Compatibility List (HCL). When verifying hardware compatibility, it’s also important to consider firmware versions. VMware will often annotate firmware requirements in the footnotes of the HCL.

Verifying ESX/ESXi Host Version

Only hosts running ESX/ESXi 4.x can be directly upgraded to ESXi 5.0. Hosts running older releases must first be upgraded to ESX/ESXi 4.x. While planning your ESXi 5.0 upgrade, evaluate the benefit of upgrading older servers against the benefit of replacing them with new hardware.

Boot-‐Disk Free-‐Space Requirements

Upgrading from ESXi 4.x

When upgrading from ESXi 4.x, using either the ESXi Installer or Update Manager, a minimum of 50MB of free space is required in the host’s local VMware vSphere® VMFS (VMFS) datastore. This space is used to temporarily store the host configuration.

Upgrading from ESX 4.x

When upgrading from ESX 4.x, the free-‐space requirements vary depending on whether you are using the ESXi Installer or Update Manager.

ESXi Installer

When using the ESXi Installer, a minimum of 50MB of free space is required in the host’s local VMFS datastore. This space is used to temporarily store the host configuration.

VMware vSphere Update Manager

When using Update Manager, in addition to having 50MB of free space on the local VMFS datastore, there is an additional requirement of 350MB free space in the “/boot” partition. This space is used as a temporary staging area where Update Manager will copy the ESXi 5.0 image and required upgrade scripts.

NOTE: Due to differences in the boot disk partition layout between ESX 3.5 and ESX 4.x, ESX 4.x hosts upgraded from ESX 3.x might not have the required 350MB of free space and therefore cannot be upgraded to ESXi 5.0 using Update Manager. In this case, use the ESXi Installer to perform the upgrade.

Disk Partitioning Requirements

Upgrading an existing ESX/ESXi 4.x host to ESXi 5.0 modifies the host’s boot disk. As such, a successful upgrade is highly dependent on having a supported boot disk partition layout.

Disk Partitioning Requirements for ESXi

ESXi 5.0 uses the same boot disk layout as ESXi 4.x. Therefore, in most cases the boot disk partition table does not require modification as part of the 5.0 upgrade. One notable exception is with an ESXi 3.5 host that is upgraded to ESXi 4.x and then immediately upgraded to ESXi 5.0. In ESXi 3.5, the boot banks are 48MB. In ESXi 4.x, the size of the boot banks changed to 250MB. When a host is upgraded from ESXi 3.5 to ESX 4.x, only one of the two boot banks is resized. This results in a situation where a host

will have one boot bank at 250MB and the other at 48MB, a condition referred to as having “lopsided boot banks.” An ESXi host with lopsided boot banks must have a new partition table written to the disk during the upgrade. Update Manager cannot be used to upgrade a host with lopsided boot banks. The ESXi Installer must be used instead.

Disk Partitioning Requirements for ESX

When upgrading an ESX 4.x host to ESXi 5.0, the ESX boot disk partition table is modified to support the dual-‐ image bank architecture used by ESXi. The VMFS-‐3 partition is the only partition that is retained. All other partitions on the disk are destroyed.

Limitations of an Upgraded ESXi 5.0 Host

There are some side effects associated with upgrading an ESX host to ESXi 5.0 as compared to performing a fresh installation. These include the following:

• Upgraded hosts retain the legacy MSDOS-‐based partition label and are still limited to a physical disk that is less than 2TB in size. Installing ESXi on a disk larger than 2TB requires a fresh install.

• Upgraded hosts do not have a dedicated scratch partition. Instead, as scratch directory is created and mounted off a VMFS volume. Aside from the scratch partition, all other disk partitions, such as the boot banks, locker and vmkcore, are identical to that of a freshly installed ESXi 5.0 host.

• The existing VMFS partition is not upgraded from VMFS-‐3 to VMFS-‐5.You can manually upgrade the VMFS partition after the upgrade. ESXi 5.0 is compatible with VMFS-‐3 partitions, so upgrading to VMFS-‐5 is required only to enable new vSphere 5.0 features.

• For hosts in which the VMFS partition is on a separate disk from the boot drive, the VMFS partition is left intact and the entire boot disk is overwritten. Any extra data on the disk is erased.

Preserving the ESX/ESXi Host Configuration

During the upgrade, most of the ESX/ESXi host configuration is retained. However, not all of the host settings are preserved. The following list highlights key configuration settings that are not carried forward during an upgrade:

• The service console port group • Local users and groups on the ESX/ESXi host • NIS settings • Rulesets and custom firewall rules • Any data in custom disk partitions • Any custom or third-‐party scripts/agents running in the ESX service console • SSH configurations for ESX hosts (SSH settings are kept for ESXi hosts)

Third-‐Party Software Packages

Some customers run optional third-‐party software components on their ESX/ESXi 4.x hosts. When upgrading, if third-‐party components are detected, you are warned that they will be lost during the upgrade.

If a host being upgraded contains third-‐party software components, such as CIM providers or nonstandard device drivers, either these components can be reinstalled after the upgrade or you can use vSphere 5.0 Image Builder CLI to create a customized ESXi installation image with these packages bundled.

VMware ESXi Upgrade Best Practices

Using vMotion/Storage vMotion

Virtual machines cannot be running on the ESX/ESXi host while it is upgraded. To avoid virtual machine downtime, use vMotion and Storage vMotion to migrate virtual machines and their related data files off the host prior to upgrading. If virtual machines are not migrated off the hosts, they must be shut down for the duration of the upgrade. If you don’t have a license for vMotion or Storage vMotion, leverage the vCenter 60-‐day trial period to access these features for the duration of the upgrade.

Placing ESX Hosts into Clusters and Enabling HA/DRS

Placing ESX hosts into a DRS-‐enabled HA cluster will facilitate migrating virtual machines off the host and ensure continued availability of your virtual machines. When running virtual machines on a DRS-‐enabled HA cluster, virtual machines on shared storage will automatically be migrated off the host when it is placed into maintenance mode. In addition, DRS will ensure that the cluster workload remains balanced as you roll the upgrade through the host in the vSphere cluster. Again, if you don’t have a license for HA/DRS, leverage the vCenter 60-‐day trial period to access these features for the duration of the upgrade.

Watching Out for Local Storage

Virtual machines running on local storage cannot be accessed by other ESXi hosts in your datacenter. They therefore cannot be resumed or “taken over” by another host in the rare event that you encounter a problem during the upgrade. If a problem develops, all virtual machines on local datastores will be down until the problem is resolved and the host is restored. If the problem is severe and you must resort to reinstalling ESX/ ESXi, you are at risk of losing all your local virtual machines. To avoid unnecessary virtual machine downtime and eliminate the risk of unwanted virtual machine deletion, migrate local virtual machines and their data files off the host and onto shared storage using vMotion/Storage vMotion. Again, leverage the vCenter 60-‐day trial period to enable vMotion and Storage vMotion if not already available.

Backing Up Your Host Configuration Upgrading

Prior to beginning a host migration, it’s always a good idea to back up the host configuration. The steps to backing up the host configuration differ depending on whether the host is running ESX or ESXi.

Backing Up Your ESX Host Configuration:

Before you upgrade an ESX host, back up the host’s configuration and local VMFS volumes. This backup ensures that you will not lose data during the upgrade.

Procedure

• Backup the files in the /etc/passwd, /etc/groups, /etc/shadow and /etc/gshadow directories. • The /etc/shadow and /etc/gshadow files might not be present on all installations. • Backup any custom scripts. • Backup your .vmxfiles. • Backup local images, such as templates, exported virtual machines and .isofiles.

Backing Up Your ESXi Host Configuration:

Procedure

Install the vSphereCLI.

In the vSphere CLI, run the vicfg-‐cfg backup command with the-‐s flag to save the host configuration to a specified backup filename.

~# vicfg-‐cfgbackup -‐-‐server <ESXi-‐host-‐ip> -‐-‐portnumber <port_number> -‐-‐protocol <protocol_type> -‐-‐username username -‐-‐password <password> -‐s <backup-‐filename>

In addition, it’s a good idea to document the host configuration and to have this information available in the event that problems arise during the host upgrade.

Summary of Upgrade Requirements and Recommendations

The following list provides a summary of the upgrade requirements and recommendations:

• Verify that your hardware is supported with ESXi5.0 by using the vSphere5.0 Hardware Compatibility List (HCL) at http://www.vmware.com/resources/compatibility/search.php.

• Consider phasing out the older servers and refreshing your hardware in conjunctionwithanESXi5.0upgrade. • Backup your host before attempting an upgrade. The upgrade process modifies the ESX/ESXi host’s boot disk partition

table, preventing automated rollback. • Verify that the boot disk partition table meets the upgrade requirements—particularly regarding the size of the /boot

partition and the location of the VMFS partition (the VMFS partition can be preserved only when it is physically located beyond the 1GB mark—that is, after the ESX boot partition, which is partition 4, and after the extended disk partition on the disk (8192 + 1835008 sectors).

• Use Image Builder CLI to add optional third-‐party software components, such as CIM providers and device drivers, to your ESXi 5.0 installation image.

• Move virtual machines on local storage over to shared storage, where they can be kept highly available using vMotion and Storage vMotion together with VMware HA and DRS.

• If the host was upgraded from ESXi3.5,watch out for lopsided bootbanks. Upgrade hosts with lopsided boot banks using the ESXi Installer.

• If the ESXi Installer does not provide an option to upgrade, verify that the required disk space is available (350MB in /boot, 50MB in VMFS).

Upgrading to ESXi 5.0 Using Update Manager

Requirements As a reminder, the following requirements must be met to perform an upgrade using Update Manager:

• Perform a full backup of the ESX/ESXi host. • Ensure that you have 50MBof free space on the bootdisk VMFS datastore. • Ensurethatyouhave350MBfreeontheESXhost’s“/boot”partition (ESXonly). • Ensure that the VMFS partition begins beyond the 1GB mark(starts after sector 1843200). • EnsurethatthehostwasnotrecentlyupgradedfromESXi3.5(ESXionly). • Use vMotion/Storage vMotion to migrate all virtual machines off the host (alternatively, power the virtual machines down).

Uploading the ESXi Installation ISO

Start the upgrade by uploading the ESXi 5.0 installation image into Update Manager. From the Update Manager screen, choose the ESXi Images tab and click the link to Import ESXi Image... . Follow the wizard to import the ESXi 5.0 Image.

Creating an Upgrade Baseline

Create an upgrade baseline using the uploaded ESXi 5.0 image. From the Update Manager screen, choose the Baselines and Groups tab. From the Baselines section on the left, choose Create... to create a new baseline. Follow the wizard to create a new baseline.

Attaching the Baseline to Your Cluster/Host

Attach the upgrade baseline to your host or cluster. From the vCenter Hosts and Clusters view, select the Update Manager tab and choose Attach... . Select the upgrade baseline created previously. If you have any other upgrade baselines attached, remove them.

Scanning the Cluster/Host

Scan your hosts to ensure that the host requirements are met and you are ready to upgrade. From the vCenter Hosts and Clusters view, select the host/cluster, select the Update Manager tab and select Scan... . Wait for the scan to complete.

If the hosts return a status of Non-‐Compliant, you are ready to proceed with upgrading the host.

If a host returns a status of Incompatible with the reason being an invalid boot disk, you cannot use Update Manager to upgrade. Try using the ESXi Installer.

If a host returns a status of Incompatible with the reason being that optional third-‐party software was detected, you can proceed with the upgrade and reinstall the optional software packages afterward or you can proactively add the optional packages to the ESXi installation image using Image Builder CLI.

Remediating Your Host

After the scan completes and your host is flagged as Non-‐Compliant, you are ready to perform the upgrade. From the Hosts and Clusters view, select the host/cluster, select the Update Manager tab and select Remediate. You will get a pop-‐up asking if you want to install patches, upgrade, or do both. Choose the upgrade option and follow the wizard to complete the remediation.

Assuming that DRS is enabled and running in fully automated mode, Update Manager will proceed to place the host into maintenance mode (if not already in maintenance mode) and perform the upgrade. If DRS is not enabled, you must evacuate the virtual machines off the host and put it into maintenance mode before remediating.

After the upgrade, the host will reboot and Update Manager will take it out of maintenance mode and return the host into operation.

Using Update Manager to Upgrade an Entire Cluster

You can use Update Manager to remediate an individual host or an entire cluster. If you choose to remediate an entire cluster, Update Manager will “roll” the upgrade through the cluster, upgrading each host in turn. You have flexibility in determining how Update Manager will treat the virtual machines during the upgrade. You can choose to either power them off or use vMotion to migrate them to another host. If you chose to power off the virtual machines, Update Manager will first power off all the virtual machines in the cluster and then proceed to upgrade the entire cluster in parallel. If you choose to migrate the virtual machines, Update Manager will evacuate as many hosts as it can (keeping within the HA admission control constraints) and upgrade the evacuated hosts in parallel. Then, after they are upgraded, it will move on to the next set of hosts.

Rolling Back from a Failed Update Manager Upgrade

During the upgrade, the files on the boot disk are overwritten. This prevents any kind of automated rollback if problems arise. To restore a host to its pre-‐upgrade state, reinstall the ESX/ESXi 4.x software and restore the host configuration from the backup.

Upgrading Using the ESXi Installer

Requirements

As a reminder, the following requirements must be met to perform an upgrade using the ESXi Installer:

• Perform a full backup of the ESX/ESXi host. • Ensure that you have 50M of free space on the boot disk VMFS datastore. • Ensure that the VMFS partition begins beyond the 1GB mark (starts after sector 1843200). • Use vMotion/Storage vMotion to migrate all virtual machines off the host (alternatively, power the virtual machines down).

Placing the Host into Maintenance Mode

Use vMotion/Storage vMotion to evacuate all virtual machines off the host and put the host into maintenance mode. If DRS is enabled in fully automated mode, the virtual machines on shared storage will be automatically migrated when the host is put into maintenance mode. Alternatively, you can power off any virtual machines running on the host.

Booting Off the ESXi 5.0 Installation Media

Connect to the host console and boot the host off the ESXi 5.0 installation media. From the boot menu, select the option to boot from the ESXi Installer.

Selecting Option to Migrate and Preserving the VMFS Datastore

When an existing ESX/ESXi 4.x installation is detected, the ESXi Installer will prompt to both migrate (upgrade) the host and preserve the existing VMFS datastore, or to do a fresh install (with options to preserve or overwrite the VMFS datastore). Select the Migrate ESX, preserve VMFS datastore option.

Third-‐Party-‐Software Warning

If third-‐party software components are detected, a warning is displayed indicating that these components will be lost.

If the identified software components are required, ensure either that they are included with the ESXi installation media (use Image Builder CLI to added third-‐party software packages to the install media) or that you reinstall them after the upgrade. Press Enter to continue the install or Escape to cancel.

Confirming the Upgrade

The system is then scanned in preparation for the upgrade. When the scan completes, the user is asked to confirm the upgrade by pressing the F11 key.

The ESXi Installer will then proceed to upgrade the host to ESXi 5.0. After the installation, the user will be asked to reboot the host.

Then reconnect the host and exit maintenance mode.

Post-‐Upgrade Considerations

Configuring the VMware ESXi 5.0 Dump Collector

A core dump is the state of working memory in the event of host failure. By default, an ESXi core dump is saved to the local boot disk. Use the VMware ESXi Dump Collector to consolidate core dumps onto a network server to ensure that they are available for use if debugging is required. You can install the ESXi Dump Collector on the vCenter Server or on a separate Windows server that has a network connection to the vCenter Server. Refer to the vSphere Installation and Setup Guide for more information on setting up the ESXi Dump Collector.

Configuring the ESXi 5.0 Syslog Collector

Install the vSphere Syslog Collector to enable ESXi system logs to be directed to a network server rather than to the local disk. You can install the Syslog Collector on the vCenter Server or on a separate Windows server that has a network connection to the vCenter Server. Refer to the vSphere Installation and Setup Guide for more information on setting up the ESXi Syslog Collector.

Configuring a Remote Management Host

Most ESXi host administration will be done through the vCenter Server, using the vSphere Client. There also will be occasions when remote command-‐line access is beneficial, such as for scripting, troubleshooting and some advanced configuration tuning. ESXi provides a rich set of APIs that are accessible using VMware vSphere® Command Line Interface (vCLI) and Windows based VMware vSphere® PowerCLI.

Upgrading Virtual Machines

After you perform an upgrade, you must determine if you will also upgrade the virtual machines that reside on the upgraded hosts. Upgrading virtual machines ensures that they remain compatible with the upgraded host software and can take advantage of new features. Upgrading your virtual machines entails upgrading the version of VMware Tools as well as the virtual machine’s virtual hardware version.

VMware Tools

The first step in upgrading virtual machines is to upgrade VMware Tools.

vSphere 5.0 supports virtual machines running both VMware Tools version 4.x and 5.0. Running virtual machines with VMware Tools version 5.0 on older ESX/ESXi 4.x hosts is also supported

Therefore, virtual machines running VMware Tools 4.x or higher do not require upgrading following the ESXi host upgrade. However, only the upgraded virtual machines will benefit from the new features and latest performance benefits associated with the most recent version of VMware Tools.

Virtual Hardware

The second step in upgrading virtual machines is to upgrade the virtual hardware version. Before upgrading the virtual hardware, you must first upgrade the VMware Tools. The hardware version of a virtual machine reflects the virtual machine’s supported virtual hardware features. These features correspond to the physical hardware available on the ESXi host on which you create the virtual machine. Virtual hardware features include BIOS and EFI, available virtual PCI slots, maximum number of CPUs, maximum memory configuration, and other characteristics typical to hardware. One important consideration when upgrading the virtual hardware is that virtual machines running the latest virtual hardware version (version 8) can run only on ESXi 5.0 hosts. Do not upgrade the virtual hardware for virtual machines running in a mixed cluster made up of ESX/ESXi 4.x hosts and ESXi 5.0 hosts. Only upgrade a virtual machine’s virtual hardware version after all the hosts in the cluster have been upgraded to ESXi 5.0. Upgrading the virtual machine’s virtual hardware version is a one-‐way operation. There is no option to reverse the upgrade after it is done.

Orchestrated Upgrade of VMware Tools and Virtual Hardware

An orchestrated upgrade enables you to upgrade both the VMware Tools and the virtual hardware of the virtual machines in your vSphere inventory at the same time. Use Update Manager to perform an orchestrated upgrade.

You can perform an orchestrated upgrade of virtual machines at the folder or datacenter level. Update Manager makes the process of upgrading the virtual machines convenient by providing baseline groups. When you remediate a virtual machine against a baseline group containing the “VMware Tools Upgrade to Match Host” baseline and the “VM Hardware Upgrade to Match Host” baseline, Update Manager sequences the upgrade operations in the correct order. As a result, the guest operating system is in a consistent state at the end of the upgrade.

Upgrading VMware vSphere VMFS

After you perform an ESX/ESXi upgrade, you might need to upgrade your VMFS to take advantage of the new features. vSphere 5.0 supports both VMFS version 3 and version 5, so it is not necessary to upgrade your VMFS volumes unless one needs to leverage new 5.0 features. However, VMFS-‐5 offers a variety of new features such as larger single-‐extent volume (approximately 60TB), larger VMDKs with unified 1MB block size (2TB), smaller subblock (8KB) to reduce the amount of stranded/unused space, and an improvement in performance and scalability via the implementation of the vSphere Storage API for Array Integration (VAAI) primitive Atomic Test & Set (ATS) across all datastore operations. VMware recommends that customers move to VMFS-‐5 to benefit from these features. A complete set of VMFS-‐5 enhancements can be found in the What’s New in vSphere 5.0 Storage white paper.

Considerations – Upgrade to VMFS-‐5 or Create New VMFS-‐5

Although a VMFS-‐3 that is upgraded to VMFS-‐5 provides you with most of the same capabilities as a newly created VMFS-‐5, there are some differences. Both upgraded and newly created VMFS-‐5 support single-‐extent volumes up to approximately 60TB and both

support VMDK sizes of 2TB, no matter what the VMFS file block size is. However, the additional differences, although minor, should be considered when making a decision on upgrading to VMFS-‐5 or creating new VMFS-‐5 volumes.

• VMFS-‐5 upgraded from VMFS-‐3 continues to use the previous file block size, which might be larger than the unified 1MB file block size. This can lead to stranded/unused disk space when there are many small files on the datastore.

• VMFS-‐5 upgraded from VMFS-‐3 continues to use 64KB subblocks, not new 8K subblocks. This can also lead to stranded/unused disk space.

• VMFS-‐5 upgraded from VMFS-‐3 continues to have a file limit of 30720 rather than the new file limit of >100000 for a newly created VMFS-‐5. This has an impact on the scalability of the file system.

For these reasons, VMware recommends using newly created VMFS-‐5 volumes if you have the luxury of doing so. You can then migrate the virtual machines from the original VMFS-‐3 to VMFS-‐5. If you do not have the available space to create new VMFS-‐5 volumes, upgrading VMFS-‐3 to VMFS-‐5 will still provide you with most of the benefits that come with a newly created VMFS-‐5.

Online Upgrade

If you do decide to upgrade VMFS-‐3 to VMFS-‐5, it is a simple, single-‐click operation. After you have upgraded the host to ESXi 5.0, go to the Configuration tab > Storage view. Select the VMFS-‐3 datastore. Above the Datastore Details window, an option to Upgrade to VMFS-‐5... will be displayed:

The upgrade process is online and non-‐disruptive. Virtual machines can continue to run on the datastore while it is being upgraded. Upgrading VMFS is a one-‐way operation. There is no option to reverse the upgrade after it is done. Also, after a file system has been upgraded, it will no longer be accessible by older ESX/ESXi 4.x hosts, so you must ensure that all hosts accessing the datastore are running ESXi 5.0. In fact, there are checks built in to vSphere that will prevent you from upgrading to VMFS-‐5 if any of the hosts accessing the datastore are running a version of ESX/ESXi that is older than 5.0.

As with any upgrade, VMware recommends that a backup of your virtual machines be made prior to upgrading your VMFS-‐3 to VMFS-‐5.

After the VMFS-‐5 volume is in place, the size can be extended to approximately 60TB, even if it is a single extent, and 2TB virtual machine disks (VMDKs) can be created, no matter what the underlying file block size. These features are available “out of the box,” without any additional configuration steps.

Refer to the vSphere Upgrade Guide for more information on features that require VMFS version 5, the differences between VMFS versions 3 and 5, and how to upgrade.

The following table provides a matrix showing the supported VMware Tools, virtual hardware and VMFS versions in ESXi 5.0.

9. Best Practices for Performance Tuning of Latency-‐Sensitive Workloads in vSphere VMs This summarizes our findings and recommends best practices to tune the different layers of an application’s environment for similar latency-‐sensitive workloads. By latency-‐sensitive, we mean workloads that require optimizing for a few microseconds to a few tens of microseconds end-‐to-‐end latencies; we don’t mean workloads in the hundreds of microseconds to tens of milliseconds end-‐to-‐

end-‐latencies. In fact, many of the recommendations in this paper that can help with the microsecond level latency can actually end up hurting the performance of applications that are tolerant of higher latency.

Please note that the exact benefits and effects of each of these configuration choices will be highly dependent upon the specific applications and workloads, so we strongly recommend experimenting with the different configuration options with your workload before deploying them in a production environment.

BIOS Settings

Most servers with new Intel and AMD processors provide power savings features that use several techniques to dynamically detect the load on a system and put various components of the server, including the CPU, chipsets, and peripheral devices into low power states when the system is mostly idle.

There are two parts to power management on ESXi platforms:

1. The BIOS settings for power management, which influence what the BIOS advertises to the OS/hypervisor about whether it should be managing power states of the host or not.

2. The OS/hypervisor settings for power management, which influence the policies of what to do when it detects that the system is idle.

For latency-‐sensitive applications, any form of power management adds latency to the path where an idle system (in one of several power savings modes) responds to an external event. So our recommendation is to set the BIOS setting for power management to “static high,” that is, no OS-‐controlled power management, effectively disabling any form of active power management. Note that achieving the lowest possible latency and saving power on the hosts and running the hosts cooler are fundamentally at odds with each other, so we recommend carefully evaluating the trade-‐offs of disabling any form of power management in order to achieve the lowest possible latencies for your application’s needs.

Servers with Intel Nehalem class and newer (Intel Xeon 55xx and newer) CPUs also offer two other power management options: C-‐states and Intel Turbo Boost. Leaving C-‐states enabled can increase memory latency and is therefore not recommended for low-‐latency workloads. Even the enhanced C-‐state known as C1E introduces longer latencies to wake up the CPUs from halt (idle) states to full-‐power, so disabling C1E in the BIOS can further lower latencies. Intel Turbo Boost, on the other hand, will step up the internal frequency of the processor should the workload demand more power, and should be left enabled for low-‐latency, high-‐performance workloads. However, since Turbo Boost can over-‐clock portions of the CPU, it should be left disabled if the applications require stable, predictable performance and low latency with minimal jitter.

How power management–related settings are changed depends on the OEM make and model of the server. For example, for HP ProLiant servers:

• Set the Power Regulator Mode to Static High Mode. • Disable Processor C-‐State Support. • Disable Processor C1E Support. • Disable QPI Power Management. • Enable Intel Turbo Boost.

For Dell PowerEdge servers:

• Set the Power Management Mode to Maximum Performance. • Set the CPU Power and Performance Management Mode to Maximum Performance. • Processor Settings: set Turbo Mode to enabled. • Processor Settings: set C States to disabled.

NUMA

The high latency of accessing remote memory in NUMA (Non-‐Uniform Memory Access) architecture servers can add a non-‐trivial amount of latency to application performance. ESXi uses a sophisticated, NUMA-‐aware scheduler to dynamically balance processor load and memory locality.

For best performance of latency-‐sensitive applications in guest OSes, all vCPUs should be scheduled on the same NUMA node and all VM memory should fit and be allocated out of the local physical memory attached to that NUMA node.

Processor affinity for vCPUs to be scheduled on specific NUMA nodes, as well as memory affinity for all VM memory to be allocated from those NUMA nodes, can be set using the vSphere Client under VM Settings Options tab Advanced General Configuration Parameters and adding entries for “numa.nodeAffinity=0, 1, ...,” where 0, 1, etc. are the processor socket numbers.

Note that when you constrain NUMA node affinities, you might interfere with the ability of the NUMA scheduler to rebalance virtual machines across NUMA nodes for fairness. Specify NUMA node affinity only after you consider the rebalancing issues. Note also that when a VM is migrated (for example, using vMotion) to another host with a different NUMA topology, these advanced settings may not be optimal on the new host and could lead to sub-‐optimal performance of your application on the new host. You will need to re-‐tune these advanced settings for the NUMA topology for the new host.

ESXi 5.0 and newer also support vNUMA where the underlying physical host’s NUMA architecture can be exposed to the guest OS by providing certain ACPI BIOS tables for the guest OS to consume. Exposing the physical host’s NUMA topology to the VM helps the guest OS kernel make better scheduling and placement decisions for applications to minimize memory access latencies.

vNUMA is automatically enabled for VMs configured with more than 8 vCPUs that are wider than the number of cores per physical NUMA node. For certain latency-‐sensitive workloads running on physical hosts with fewer than 8 cores per physical NUMA node, enabling vNUMA may be beneficial. This is achieved by adding an entry for "numa.vcpu.min = N", where N is less than the number of vCPUs in the VM, in the vSphere Client under VM Settings Options tab Advanced General Configuration Parameters.

To learn more about this topic, please refer to the NUMA sections in the "vSphere Resource Management Guide" and the white paper explaining the vSphere CPU Scheduler: http://www.vmware.com/files/pdf/techpaper/VMware-‐vSphere-‐CPU-‐Sched-‐Perf.pdf

Choice of Guest OS

Certain older guest OSes like RHEL5 incur higher virtualization overhead for various reasons, such as frequent accesses to virtual PCI devices for interrupt handling, frequent accesses to the virtual APIC (Advanced Programmable Interrupt Controller) for interrupt handling, high virtualization overhead when reading the current time, inefficient mechanisms to idle, and so on.

Moving to a more modern guest OS (like SLES11 SP1 or RHEL6 based on 2.6.32 Linux kernels, or Windows Server 2008 or newer) minimizes these virtualization overheads significantly. For example, RHEL6 is based on a “tickless” kernel, which means that it doesn’t rely on high-‐frequency timer interrupts at all. For a mostly idle VM, this saves the power consumed when the guest wakes up for periodic timer interrupts, finds out there is no real work to do, and goes back to an idle state.

Note however, that tickless kernels like RHEL6 can incur higher overheads in certain latency-‐sensitive workloads because the kernel programs one-‐shot timers every time it wakes up from idle to handle an interrupt, while the legacy periodic timers are pre-‐programmed and don’t have to be programmed every time the guest OS wakes up from idle. To override tickless mode and fall back to the legacy periodic timer mode for such modern versions of Linux, pass the nohz=off kernel boot-‐time parameter to the guest OS.

These newer guest OSes also have better support for MSI-‐X (Message Signaled Interrupts) which are more efficient than legacy INT-‐x style APIC -‐based interrupts for interrupt delivery and acknowledgement from the guest OSes.

Since there is a certain overhead when reading the current time, due to overhead in virtualizing various timer mechanisms, we recommend minimizing the frequency of reading the current time (using gettimeofday() or currentTimeMillis() calls) in your guest OS, either via the latency-‐sensitive application doing so directly, or via some other software component in the guest OS doing this. The overhead in reading the current time was especially worse in Linux versions older than RHEL 5.4, due to the underlying timer device they relied on as their time source and the overhead in virtualizing them. Versions of Linux after RHEL5.4 incur significantly lower overhead when reading the current time.

To learn more about best practices for time keeping in Linux guests, please see the VMware KB 1006427: http://kb.vmware.com/kb/1006427. To learn more about how timekeeping works in VMware VMs, please read http://www.vmware.com/files/pdf/Timekeeping-‐In-‐VirtualMachines.pdf.

Physical NIC Settings

Most 1GbE or 10GbE NICs (Network Interface Cards) support a feature called interrupt moderation or interrupt throttling, which coalesces interrupts from the NIC to the host so that the host doesn’t get overwhelmed and spend all its CPU cycles processing

interrupts.

However, for latency-‐sensitive workloads, the time the NIC is delaying the delivery of an interrupt for a received packet or a packet that has successfully been sent on the wire is the time that increases the latency of the workload.

Most NICs also provide a mechanism, usually via the ethtool command and/or module parameters, to disable interrupt moderation. Our recommendation is to disable physical NIC interrupt moderation on the ESXi host as follows:

# esxcli system module parameters set -‐m ixgbe -‐p "InterruptThrottleRate=0"

This example applies to the Intel 10GbE driver called ixgbe. You can find the appropriate module parameter for your NIC by first finding the driver using the ESXi command:

# esxcli network nic list

Then find the list of module parameters for the driver used:

# esxcli system module parameters list -‐m <driver>

Note that while disabling interrupt moderation on physical NICs is extremely helpful in reducing latency for latency-‐sensitive VMs, it can lead to some performance penalties for other VMs on the ESXi host, as well as higher CPU utilization to handle the higher rate of interrupts from the physical NIC.

Disabling physical NIC interrupt moderation can also defeat the benefits of Large Receive Offloads (LRO), since some physical NICs (like Intel 10GbE NICs) that support LRO in hardware automatically disable it when interrupt moderation is disabled, and ESXi’s implementation of software LRO has fewer packets to coalesce into larger packets on every interrupt. LRO is an important offload for driving high throughput for large-‐message transfers at reduced CPU cost, so this trade-‐off should be considered carefully.

Virtual NIC Settings

ESXi VMs can be configured to have one of the following types of virtual NICs (http://kb.vmware.com/kb/1001805): Vlance, VMXNET, Flexible, E1000, VMXNET 2 (Enhanced), or VMXNET 3.

We recommend you choose VMXNET 3 virtual NICs for your latency-‐sensitive or otherwise performance-‐critical VMs. VMXNET 3 is the latest generation of our paravirtualized NICs designed from the ground up for performance, and is not related to VMXNET or VMXNET 2 in any way. It offers several advanced features including multi-‐queue support: Receive Side Scaling, IPv4/IPv6 offloads, and MSI/MSI-‐X interrupt delivery. Modern enterprise Linux distributions based on 2.6.32 or newer kernels, like RHEL6 and SLES11 SP1, ship with out-‐ of-‐the-‐box support for VMXNET 3 NICs.

VMXNET 3 by default also supports an adaptive interrupt coalescing algorithm, for the same reasons that physical NICs implement interrupt moderation. This virtual interrupt coalescing helps drive high throughputs to VMs with multiple vCPUs with parallelized workloads (for example, multiple threads), while at the same time striving to minimize the latency of virtual interrupt delivery.

However, if your workload is extremely sensitive to latency, then we recommend you disable virtual interrupt coalescing for VMXNET 3 virtual NICs as follows.

To do so through the vSphere Client, go to VM Settings -‐> Options tab -‐> Advanced General -‐> Configuration Parameters and add an entry for ethernetX.coalescingScheme with the value of disabled.

Please note that this new configuration option is only available in ESXi 5.0 and later. An alternative way to disable virtual interrupt coalescing for all virtual NICs on the host which affects all VMs, not just the latency-‐sensitive ones, is by setting the advanced networking performance option (Configuration Advanced Settings Net) Coalesce DefaultOn to 0 (disabled). See http://communities.vmware.com/docs/DOC-‐10892 for details.

Another feature of VMXNET 3 that helps deliver high throughput with lower CPU utilization is Large Receive Offload (LRO), which aggregates multiple received TCP segments into a larger TCP segment before delivering it up to the guest TCP stack. However, for latency-‐sensitive applications that rely on TCP, the time spent aggregating smaller TCP segments into a larger one adds latency. It

can also affect TCP algorithms like delayed ACK, which now cause the TCP stack to delay an ACK until the two larger TCP segments are received, also adding to end-‐to-‐end latency of the application.

Therefore, you should also consider disabling LRO if your latency-‐sensitive application relies on TCP. To do so for Linux guests, you need to reload the vmxnet3 driver in the guest:

# modprobe -‐r vmxnet3 Add the following line in /etc/modprobe.conf (Linux version dependent): options vmxnet3 disable_lro=1 Then reload the driver using: # modprobe vmxnet3

VM Settings

If your application is multi-‐threaded or consists of multiple processes that could benefit from using multiple CPUs, you can add more virtual CPUs (vCPUs) to your VM. However, for latency-‐sensitive applications, you should not overcommit vCPUs as compared to the number of pCPUs (processors) on your ESXi host. For example, if your host has 8 CPU cores, limit your number of vCPUs for your VM to 7. This will ensure that the ESXi vmkernel scheduler has a better chance of placing your vCPUs on pCPUs which won’t be contended by other scheduling contexts, like vCPUs from other VMs or ESXi helper worlds.

If your application needs a large amount of physical memory when running unvirtualized, consider configuring your VM with a lot of memory as well, but again, try to refrain from overcommitting the amount of physical memory in the system. You can look at the memory statistics in the vSphere Client under the host’s Resource Allocation tab under Memory -‐> Available Capacity to see how much memory you can configure for the VM after all the virtualization overheads are accounted for.

If you want to ensure that the VMkernel does not deschedule your VM when the vCPU is idle (most systems generally have brief periods of idle time, unless you’re running an application which has a tight loop executing CPU instructions without taking a break or yielding the CPU), you can add the following configuration option. Go to VM Settings -‐> Options tab -‐> Advanced General -‐> Configuration Parameters and add monitor_control.halt_desched with the value of false.

Note that this option should be considered carefully, because this option will effectively force the vCPU to consume all of its allocated pCPU time, such that when that vCPU in the VM idles, the VM Monitor will spin on the CPU without yielding the CPU to the VMkernel scheduler, until the vCPU needs to run in the VM again. However, for extremely latency-‐sensitive VMs which cannot tolerate the latency of being descheduled and scheduled, this option has been seen to help.

A slightly more power conserving approach which still results in lower latencies when the guest needs to be woken up soon after it idles, are the following advanced configuration parameters (see also http://kb.vmware.com/kb/1018276):

• For > 1 vCPU VMs, set monitor.idleLoopSpinBeforeHalt to true • For 1 vCPU VMs, set monitor.idleLoopSpinBeforeHaltUP to true This option will cause the VM Monitor to spin for a small

period of time (by default 100 us, configurable through monitor.idleLoopMinSpinUS)before yielding the CPU to the VMkernel scheduler, which may then idle the CPU if there is no other work to do.

New in vSphere 5.5 is a VM option called Latency Sensitivity, which defaults to Normal. Setting this to High can yield significantly lower latencies and jitter, as a result of the following mechanisms that take effect in ESXi:

• Exclusive access to physical resources, including pCPUs dedicated to vCPUs with no contending threads for executing on these pCPUs.

• Full memory reservation eliminates ballooning or hypervisor swapping leading to more predictable performance with no latency overheads due to such mechanisms.

• Halting in the VM Monitor when the vCPU is idle, leading to faster vCPU wake-‐up from halt, and bypassing the VMkernel scheduler for yielding the pCPU. This also conserves power as halting makes the pCPU enter a low power mode, compared to spinning in the VM Monitor with the monitor_control.halt_desched=FALSE option.

• Disabling interrupt coalescing and LRO automatically for VMXNET 3 virtual NICs. • Optimized interrupt delivery path for VM DirectPath I/O and SR-‐IOV passthrough devices, using heuristics to derive hints

from the guest OS about optimal placement of physical interrupt vectors on physical CPUs. To learn more about this topic, please refer to the technical whitepaper: http://www.vmware.com/files/pdf/techpaper/latency-‐sensitive-‐perf-‐vsphere55.pdf

Polling Versus Interrupts

For applications or workloads that are allowed to use more CPU resources in order to achieve the lowest possible latency, polling in the guest for I/O to be complete instead of relying on the device delivering an interrupt to the guest OS could help. Traditional interrupt-‐based I/O processing incurs additional overheads at various levels, including interrupt handlers in the guest OS, accesses to the interrupt subsystem (APIC, devices) that incurs emulation overhead, and deferred interrupt processing in guest OSes (Linux bottom halves/NAPI poll, Windows DPC), which hurts latency to the applications.

With polling, the driver and/or the application in the guest OS will spin waiting for I/O to be available and can immediately indicate the completed I/O up to the application waiting for it, thereby delivering lower latencies. However, this approach consumes more CPU resources, and therefore more power, and hence should be considered carefully.

Note that this approach is different from what the idle=poll kernel parameter for Linux guests achieves. This approach requires writing a poll-‐mode device driver for the I/O device involved in your low latency application, which constantly polls the device (for example, looking at the receive ring for data to have been posted by the device) and indicates the data up the protocol stack immediately to the latency-‐sensitive application waiting for the data.

Guest OS Tips and Tricks

If your application uses Java, then one of the most important optimizations we recommend is to configure both the guest OS and Java to use large pages. Add the following command-‐line option when launching Java:

-‐XX:+UseLargePages

For other important guidelines when tuning Java applications running in VMware VMs, please refer to http://www.vmware.com/resources/techresources/1087.

Another source of latency for networking I/O can be guest firewall rules like Linux iptables. If your security policy for your VM can allow for it, consider stopping the guest firewall.

Similarly, security infrastructure like SELinux can also add to application latency, since it intercepts every system call to do additional security checks. Consider disabling SELinux if your security policy can allow for that.

10. Performance Best Practices for VMware vSphere 5.0 Validate Your Hardware

Before deploying a system we recommend the following:

Before deploying a system we recommend the following: • Verify that all hardware in the system is on the hardware compatibility list for the specific version of VMware software you will

be running. • Make sure that your hardware meets the minimum configuration supported by the VMware software you will be running. • Test system memory for 72 hours, checking for hardware errors. Hardware CPU Considerations This section provides guidance regarding CPUs for use with vSphere 5.0. General CPU Considerations

• When selecting hardware, it is a good idea to consider CPU compatibility for VMware vMotion™ (which in turn affects DRS) and VMware Fault Tolerance. See “VMware vMotion and Storage vMotion” on page 51, “VMware Distributed Resource Scheduler (DRS)” on page 52, and “VMware Fault Tolerance” on page 59.

Hardware-‐Assisted Virtualization Most recent processors from both Intel and AMD include hardware features to assist virtualization. These features were released in two generations:

• The first generation introduced CPU virtualization • The second generation added memory management unit (MMU) virtualization

For the best performance, make sure your system uses processors with second-‐generation hardware-‐assist features.

• Hardware-‐Assisted CPU Virtualization (VT-‐x and AMD-‐V) The first generation of hardware virtualization assistance, VT-‐x from Intel and AMD-‐V from AMD, became available in 2006. These technologies automatically trap sensitive events and instructions, eliminating the overhead required to do so in software. This allows the use of a hardware virtualization (HV) virtual machine monitor (VMM) as opposed to a binary translation (BT) VMM.

• Hardware-‐Assisted MMU Virtualization (Intel EPT and AMD RVI) More recent processors also include second generation

hardware virtualization assistance that addresses the overheads due to memory management unit (MMU) virtualization by providing hardware support to virtualize the MMU. ESXi supports this feature both in AMD processors, where it is called rapid virtualization indexing (RVI) or nested page tables (NPT), and in Intel processors, where it is called extended page tables (EPT).

• Hardware-‐assisted MMU virtualization allows an additional level of page tables that map guest physical memory to host

physical memory addresses, eliminating the need for ESXi to maintain shadow page tables. This reduces memory consumption and speeds up workloads that cause guest operating systems to frequently modify page tables. While hardware-‐assisted MMU virtualization improves the performance of the vast majority of workloads, it does increase the time required to service a TLB miss, thus potentially reducing the performance of workloads that stress the TLB.

• Hardware-‐Assisted I/O MMU Virtualization (VT-‐d and AMD-‐Vi) An even newer processor feature is an I/O memory

management unit that remaps I/O DMA transfers and device interrupts. This can allow virtual machines to have direct access to hardware I/O devices, such as network cards, storage controllers (HBAs) and GPUs. In AMD processors this feature is called AMD I/O

• Virtualization (AMD-‐Vi or IOMMU) and in Intel processors the feature is called Intel Virtualization Technology for Directed

I/O (VT-‐d). Hardware Storage Considerations Storage performance is a vast topic that depends on workload, hardware, vendor, RAID level, cache size, stripe size, and so on. Consult the appropriate documentation from VMware as well as the storage vendor. Many workloads are very sensitive to the latency of I/O operations. It is therefore important to have storage devices configured correctly VMware Storage vMotion performance is heavily dependent on the available storage infrastructure Bandwidth Consider choosing storage hardware that supports VMware vStorage APIs for Array Integration (VAAI). VAAI can improve storage scalability by offloading some operations to the storage hardware instead of performing them in ESXi. On SANs, VAAI offers the following features:

• Hardware-‐accelerated cloning (sometimes called “full copy” or “copy offload”) frees resources on the host and can speed u workloads that rely on cloning, such as Storage vMotion.

• Block zeroing speeds up creation of eager-‐zeroed thick disks and can improve first-‐time write performance on lazy-‐zeroed thick disks and on thin disks.

• Scalable lock management (sometimes called “atomic test and set,” or ATS) can reduce locking-‐related overheads, speeding up thin-‐disk expansion as well as many other administrative and file system-‐intensive tasks. This helps improve the scalability of very large deployments by speeding up provisioning operations like boot storms, expansion of thin disks,

snapshots, and other tasks. • Thin provision UNMAP allows ESXi to return no-‐longer-‐needed thin-‐provisioned disk space to the storage hardware for

reuse. On NAS devices, VAAI offers the following features:

• Hardware-‐accelerated cloning (sometimes called “full copy” or “copy offload”) frees resources on the host and can speed up workloads that rely on cloning. (Note that Storage vMotion does not make use of this feature on NAS devices.)

• Space reservation allows ESXi to fully pre-‐allocate space for a virtual disk at the time the virtual disk is created. Thus, in addition to the thin provisioning and eager-‐zeroed thick provisioning options that non-‐VAAI NAS devices support, VAAI NAS devices also support lazy-‐zeroed thick provisioning.

Though the degree of improvement is dependent on the storage hardware, VAAI can reduce storage latency for several types of storage operations, can reduce the ESXi host CPU utilization for storage operations, and can reduce storage network traffic.

• Performance design for a storage network must take into account the physical constraints of the network, not logical allocations. Using VLANs or VPNs does not provide a suitable solution to the problem of link oversubscription in shared configurations. VLANs and other virtual partitioning of a network provide a way of logically configuring a network, but don't change the physical capabilities of links and trunks between switches.

• If you have heavy disk I/O loads, you might need to assign separate storage processors (SPs) to separate systems to handle

the amount of traffic bound for storage.

• To optimize storage array performance, spread I/O loads over the available paths to the storage (that is, across multiple host bus adapters (HBAs) and storage processors).

• Make sure that end-‐to-‐end Fibre Channel speeds are consistent to help avoid performance problems. For more

information, see KB article 1006602.

• Configure maximum queue depth for Fibre Channel HBA cards. For additional information see VMware KB article 1267.

• Applications or systems that write large amounts of data to storage, such as data acquisition or transaction logging systems, should not share Ethernet links to a storage device with other applications or systems. These types of applications perform best with dedicated connections to storage devices.

• For iSCSI and NFS, make sure that your network topology does not contain Ethernet bottlenecks, where multiple links are

routed through fewer links, potentially resulting in oversubscription and dropped network packets. Any time a number of links transmitting near capacity are switched to a smaller number of links, such oversubscription is a possibility.

Recovering from these dropped network packets results in large performance degradation. In addition to time spent determining that data was dropped, the retransmission uses network bandwidth that could otherwise be used for new transactions. For iSCSI and NFS, if the network switch deployed for the data path supports VLAN, it might be beneficial to create a VLAN just for the ESXi host's vmknic and the iSCSI/NFS server. This minimizes network interference from other packet sources.

• Be aware that with software-‐initiated iSCSI and NFS the network protocol processing takes place on the host system, and thus these might require more CPU resources than other storage options.

• Local storage performance might be improved with write-‐back cache. If your local storage has write-‐back cache installed, make sure it’s enabled and contains a functional battery module. For more information, see KB article 1006602.

Hardware Networking Considerations Before undertaking any network optimization effort, you should understand the physical aspects of the network. The following are just a few aspects of the physical layout that merit close consideration:

• Consider using server-‐class network interface cards (NICs) for the best performance.

• Make sure the network infrastructure between the source and destination NICs doesn’t introduce bottlenecks. For example, if both NICs are 10Gigabit, make sure all cables and switches are capable of the same speed and that the switches are not configured to a lower speed.

For the best networking performance, we recommend the use of network adapters that support the following hardware features:

• Checksum offload • TCP segmentation offload (TSO) • Ability to handle high-‐memory DMA (that is, 64-‐bit DMA addresses) • Ability to handle multiple Scatter Gather elements per Tx frame • Jumbo frames (JF) • Large receive offload (LRO)

On some 10 Gigabit Ethernet hardware network adapters, ESXi supports NetQueue, a technology that significantly improves performance of 10Gigabit Ethernet network adapters in virtualized environments. In addition to the PCI and PCI-‐X bus architectures, we now have the PCI Express (PCIe) architecture. Ideally single-‐port 10 Gigabit Ethernet network adapters should use PCIe x8 (or higher) or PCI-‐X 266 and dual-‐port 10 Gigabit Ethernet network adapters should use PCIe x16 (or higher). There should preferably be no “bridge chip” (e.g., PCI-‐X to PCIe or PCIe to PCI-‐X) in the path to the actual Ethernet device (including any embedded bridge chip on the device itself), as these chips can reduce performance. Multiple physical network adapters between a single virtual switch (vSwitch) and the physical network constitute a NIC team. NIC teams can provide passive failover in the event of hardware failure or network outage and, in some configurations, can increase performance by distributing the traffic across those physical network adapters. Hardware BIOS Settings, General BIOS Settings

• Make sure you are running the latest version of the BIOS available for your system. • Make sure the BIOS is set to enable all populated processor sockets and to enable all cores in each socket. • Enable “Turbo Boost” in the BIOS if your processors support it. • Make sure hyper-‐threading is enabled in the BIOS for processors that support it. • Some NUMA-‐capable systems provide an option in the BIOS to disable NUMA by enabling node interleaving. In most cases

you will get the best performance by disabling node interleaving (in other words, leaving NUMA enabled). • Make sure any hardware-‐assisted virtualization features (VT-‐x, AMD-‐V, EPT, RVI, and so on) are enabled in the BIOS. • Disable from within the BIOS any devices you won’t be using. This might include, for example, unneeded serial, USB, or

network ports

• Cache prefetching mechanisms (sometimes called DPL Prefetch, Hardware Prefetcher, L2 Streaming Prefetch, or Adjacent Cache Line Prefetch) usually help performance, especially when memory access patterns are regular. When running applications that access memory randomly, however, disabling these mechanisms might result in improved performance.

• If the BIOS allows the memory scrubbing rate to be configured, we recommend leaving it at the manufacturer’s default

setting. Power Management BIOS Settings VMware ESXi includes a full range of host power management capabilities in the software that can save power when a host is not fully utilized. We recommend that you configure your BIOS settings to allow ESXi the most flexibility in using (or not using) the power management features offered by your hardware, then make your power-‐management choices within ESXi. In order to allow ESXi to control CPU power-‐saving features, set power management in the BIOS to “OS Controlled Mode” or equivalent. Even if you don’t intend to use these power-‐saving features, ESXi provides a convenient way to manage them. Availability of the C1E halt state typically provides a reduction in power consumption with little or no impact on performance. When “Turbo Boost” is enabled, the availability of C1E can sometimes even increase the performance of certain single-‐threaded workloads. We therefore recommend that you enable C1E in BIOS.

However, for a very few workloads that are highly sensitive to I/O latency, especially those with low CPU utilization, C1E can reduce performance. In these cases, you might obtain better performance by disabling C1E in BIOS, if that option is available. C-‐states deeper than C1/C1E (i.e., C3, C6) allow further power savings, though with an increased chance of performance impacts. We recommend, however, that you enable all C-‐states in BIOS, then use ESXi host power management to control their use. ESXi and Virtual Machines ESXi General Considerations This subsection provides guidance regarding a number of general performance considerations in ESXi.

• Plan your deployment by allocating enough resources for all the virtual machines you will run, as well as those needed by ESXi itself.

• Allocate to each virtual machine only as much virtual hardware as that virtual machine requires. Provisioning a virtual

machine with more resources than it requires can, in some cases, reduce the performance of that virtual machine as well as other virtual machines sharing the same host.

• Disconnect or disable any physical hardware devices that you will not be using. These might include devices such as:

o COM ports o LPT ports o USB controllers o Floppy drives o Optical drives (that is, CD or DVD drives) o Network interfaces o Storage controllers

• Disabling hardware devices (typically done in BIOS) can free interrupt resources. Additionally, some devices, such as USB

controllers, operate on a polling scheme that consumes extra CPU resources. Lastly, some PCI devices reserve blocks of memory, making that memory unavailable to ESXi.

• Unused or unnecessary virtual hardware devices can impact performance and should be disabled. For example, Windows

guest operating systems poll optical drives (that is, CD or DVD drives) quite frequently. When virtual machines are configured to use a physical drive, and multiple guest operating systems simultaneously try to access that drive, performance could suffer. This can be reduced by configuring the virtual machines to use ISO images instead of physical drives, and can be avoided entirely by disabling optical drives in virtual machines when the devices are not needed.

• ESXi 5.0 introduces virtual hardware version 8. By creating virtual machines using this hardware version, or upgrading

existing virtual machines to this version, a number of additional capabilities become available. Some of these, such as support for virtual machines with up to 1TB of RAM and up to 32 vCPUs, support for virtual NUMA, and support for 3D graphics, can improve performance for some workloads. This hardware version is not compatible with versions of ESXi prior to 5.0, however, and thus if a cluster of ESXi hosts will contain some hosts running pre-‐5.0 versions of ESXi, the virtual machines running on hardware version 8 will be constrained to run only on the ESXi 5.0 hosts. This could limit vMotion choices for Distributed Resource Scheduling (DRS) or Distributed Power Management (DPM).

ESXi CPU Considerations CPU virtualization adds varying amounts of overhead depending on the percentage of the virtual machine’s workload that can be executed on the physical processor as is and the cost of virtualizing the remainder of the workload:

• For many workloads, CPU virtualization adds only a very small amount of overhead, resulting in performance essentially comparable to native.

• Many workloads to which CPU virtualization does add overhead are not CPU-‐bound—that is, most of their time is spent

waiting for external events such as user interaction, device input, or data retrieval, rather than executing instructions. Because otherwise-‐unused CPU cycles are available to absorb the virtualization overhead, these workloads will typically have throughput similar to native, but potentially with a slight increase in latency.

• For a small percentage of workloads, for which CPU virtualization adds overhead and which are CPU-‐bound, there might be

a noticeable degradation in both throughput and latency.

• If an ESXi host becomes CPU saturated (that is, the virtual machines and other loads on the host demand all the CPU resources the host has), latency sensitive workloads might not perform well. In this case you might want to reduce the CPU load, for example by powering off some virtual machines or migrating them to a different host (or allowing DRS to migrate them automatically).

• It is a good idea to periodically monitor the CPU usage of the host. This can be done through the vSphere Client or by using

esxtop or resxtop. Below we describe how to interpret esxtop data: o If the load average on the first line of the esxtop CPU panel is equal to or greater than 1, this indicates that the

system is overloaded. o The usage percentage for the physical CPUs on the PCPU line can be another indication of a possibly overloaded

condition. In general, 80% usage is a reasonable ceiling and 90% should be a warning that the CPUs are approaching an overloaded condition. However organizations will have varying standards regarding the desired load percentage.

• Configuring a virtual machine with more virtual CPUs (vCPUs) than its workload can use might cause slightly increased

resource usage, potentially impacting performance on very heavily loaded systems. Common examples of this include a single-‐threaded workload running in a multiple-‐vCPU virtual machine or a multi-‐threaded workload in a virtual machine with more vCPUs than the workload can effectively use.

• Most guest operating systems execute an idle loop during periods of inactivity. Within this loop, most of these guest

operating systems halt by executing the HLT or MWAIT instructions. Some older guest operating systems (including Windows 2000 (with certain HALs), Solaris 8 and 9, and MS-‐DOS), however, use busy-‐waiting within their idle loops. This results in the consumption of resources that might otherwise be available for other uses (other virtual machines, the VMkernel, and so on). ESXi automatically detects these loops and de-‐schedules the idle vCPU. Though this reduces the CPU overhead, it can also reduce the performance of some I/O-‐heavy workloads. For additional information see VMware KB articles 1077 and 2231.

• The guest operating system’s scheduler might migrate a single-‐threaded workload amongst multiple vCPUs, thereby losing

cache locality. UP/ vs SMP HALs /Kernels NOTE When changing an existing virtual machine running Windows from multi-‐core to single-‐core the HAL usually remains SMP. For best performance, the HAL should be manually changed back to UP. Hyper-‐Threading Hyper-‐threading technology (sometimes also called simultaneous multithreading, or SMT) allows a single physical processor core to behave like two logical processors, essentially allowing two independent threads to run simultaneously. Unlike having twice as many processor cores—that can roughly double performance—hyper-‐threading can provide anywhere from a slight to a significant increase in system performance by keeping the processor pipeline busier. If the hardware and BIOS support hyper-‐threading, ESXi automatically makes use of it. For the best performance we recommend that you enable hyper-‐threading. Be careful when using CPU affinity on systems with hyper-‐threading. Because the two logical processors share most of the processor resources, pinning vCPUs, whether from different virtual machines or from a single SMP virtual machine, to both logical processors on one core (CPUs 0 and 1, for example) could cause poor performance. ESXi provides configuration parameters for controlling the scheduling of virtual machines on hyper-‐threaded systems (Edit virtual machine settings > Resources tab > Advanced CPU). When choosing hyper-‐threaded core sharing choices, the Any option (which is the default) is almost always preferred over None.

The “None” option indicates that when a vCPU from this virtual machine is assigned to a logical processor, no other vCPU, whether from the same virtual machine or from a different virtual machine, should be assigned to the other logical processor that resides on the same core. That is, each vCPU from this virtual machine should always get a whole core to itself and the other logical CPU on that core should be placed in the halted state. This option is like disabling hyper-‐threading for that one virtual machine. For nearly all workloads, custom hyper-‐threading settings are not necessary. In cases of unusual workloads that interact badly with hyper-‐threading, however, choosing the None hyper-‐threading option might help performance. For example, even though the ESXi scheduler tries to dynamically run higher-‐priority virtual machines on a whole core for longer durations, you can further isolate a high-‐priority virtual machine from interference by other virtual machines by setting its hyper-‐threading sharing property to None. Non-‐Uniform Memory Access (NUMA) By default, ESXi NUMA scheduling and related optimizations are enabled only on systems with a total of at least four CPU cores and with at least two CPU cores per NUMA node. On such systems, virtual machines can be separated into the following two categories:

• Virtual machines with a number of vCPUs equal to or less than the number of cores in each physical NUMA node. These virtual machines will be assigned to cores all within a single NUMA node and will be preferentially allocated memory local to that NUMA node. This means that, subject to memory availability, all their memory accesses will be local to that NUMA node, resulting in the lowest memory access latencies.

• Virtual machines with more vCPUs than the number of cores in each physical NUMA node (called “wide virtual machines”). These virtual machines will be assigned to two (or more) NUMA nodes and will be preferentially allocated memory local to those NUMA nodes. Because vCPUs in these wide virtual machines might sometimes need to access memory outside their own NUMA node, they might experience higher average memory access latencies than virtual machines that fit entirely within a NUMA node.

Host Power Management in ESXi Power Policy Options in ESXi ESXi 5.0 offers the following power policy options:

• High performance -‐ This power policy maximizes performance, using no power management features. • Balanced -‐ This power policy (the default in ESXi 5.0) is designed to reduce host power consumption while having little or no

impact on performance. • Low power -‐ This power policy is designed to more aggressively reduce host power consumption at the risk of reduced

performance. • Custom -‐ This power policy starts out the same as Balanced, but allows for the modification of individual parameters.

While the default power policy in ESX/ESXi 4.1 was High performance, in ESXi 5.0 the default is now Balanced. This power policy will typically not impact the performance of CPU-‐intensive workloads. Rarely, however, the Balanced policy might slightly reduce the performance of latency sensitive workloads. In these cases, selecting the High performance power policy will provide the full hardware performance. ESXi Memory Considerations Memory Overhead Virtualization causes an increase in the amount of physical memory required due to the extra memory needed by ESXi for its own code and for data structures. This additional memory requirement can be separated into two components: 1. A system-‐wide memory space overhead for the VMkernel and various host agents (hostd, vpxa, etc.). 2. An additional memory space overhead for each virtual machine. The per-‐virtual-‐machine memory space overhead can be further divided into the following categories:

• Memory reserved for the virtual machine executable (VMX) process. This is used for data structures needed to bootstrap and support the guest (i.e., thread stacks, text, and heap).

• Memory reserved for the virtual machine monitor (VMM). This is used for data structures required by the virtual hardware (i.e., TLB, memory mappings, and CPU state).

• Memory reserved for various virtual devices (i.e., mouse, keyboard, SVGA, USB, etc.) • Memory reserved for other subsystems, such as the kernel, management agents, etc.

The amounts of memory reserved for these purposes depend on a variety of factors, including the number of vCPUs, the configured memory for the guest operating system, whether the guest operating system is 32-‐bit or 64-‐bit, and which features are enabled for the virtual machine. For more information about these overheads, see vSphere Resource Management. Memory Sizing

• You should allocate enough memory to hold the working set of applications you will run in the virtual machine, thus minimizing thrashing.

• You should also avoid over-‐allocating memory. Allocating more memory than needed unnecessarily increases the virtual machine memory overhead, thus consuming memory that could be used to support more virtual machines.

Memory Overcommit Techniques ESXi uses five memory management mechanisms—page sharing, ballooning, memory compression, swap to host cache, and regular swapping—to dynamically reduce the amount of machine physical memory required for each virtual machine.

• Page Sharing: ESXi uses a proprietary technique to transparently and securely share memory pages between virtual machines, thus eliminating redundant copies of memory pages. In most cases, page sharing is used by default regardless of the memory demands on the host system. (The exception is when using large pages, as discussed in “Large Memory Pages for Hypervisor and Guest Operating System” on page 28.)

• Ballooning: If the virtual machine’s memory usage approaches its memory target, ESXi will use ballooning to reduce that

virtual machine’s memory demands. Using a VMware-‐supplied vmmemctl module installed in the guest operating system as part of VMware Tools suite, ESXi can cause the guest operating system to relinquish the memory pages it considers least valuable. Ballooning provides performance closely matching that of a native system under similar memory constraints. To use ballooning, the guest operating system must be configured with sufficient swap space.

• Memory Compression: If the virtual machine’s memory usage approaches the level at which host-‐level swapping will be

required, ESXi will use memory compression to reduce the number of memory pages it will need to swap out. Because the decompression latency is much smaller than the swap-‐in latency, compressing memory pages has significantly less impact on performance than swapping out those pages.

• Swap to Host Cache: If memory compression doesn’t keep the virtual machine’s memory usage low enough, ESXi will next

forcibly reclaim memory using host-‐level swapping to a host cache (if one has been configured). Swap to host cache is a new feature in ESXi 5.0 that allows users to configure a special swap cache on SSD storage. In most cases this host cache (being on SSD) will be much faster than the regular swap files (typically on hard disk storage), significantly reducing access latency. Thus, although some of the pages ESXi swaps out might be active, swap to host cache has a far lower performance impact than regular host-‐level swapping.

• Regular Swapping: If the host cache becomes full, or if a host cache has not been configured, ESXi will next reclaim memory

from the virtual machine by swapping out pages to a regular swap file. Like swap to host cache, some of the pages ESXi swaps out might be active. Unlike swap to host cache, however, this mechanism can cause virtual machine performance to degrade significantly due to its high access latency.

While ESXi uses page sharing, ballooning, memory compression, and swap to host cache to allow significant memory over-‐commitment, usually with little or no impact on performance, you should avoid overcommitting memory to the point that active memory pages are swapped out with regular host-‐level swapping. In the vSphere Client, select the virtual machine in question, select the Performance tab, then look at the value of Memory Balloon (Average). An absence of ballooning suggests that ESXi is not under heavy memory pressure and thus memory over commitment is not affecting the performance of that virtual machine.

In the vSphere Client, select the virtual machine in question, select the Performance tab, then compare the values of Consumed Memory and Active Memory. If consumed is higher than active, this suggests that the guest is currently getting all the memory it requires for best performance. In the vSphere Client, select the virtual machine in question, select the Performance tab, then look at the values of Swap-‐In and Decompress. Swapping in and decompressing at the host level indicate more significant memory pressure. Check for guest operating system swap activity within that virtual machine. This can indicate that ballooning might be starting to impact performance, though swap activity can also be related to other issues entirely within the guest (or can be an indication that the guest memory size is simply too small). Memory Swapping Optimizations Because ESXi uses page sharing, ballooning, and memory compression to reduce the need for host-‐level memory swapping, don’t disable these techniques. If you choose to overcommit memory with ESXi, be sure you have sufficient swap space on your ESXi system. At the time a virtual machine is first powered on, ESXi creates a swap file for that virtual machine equal in size to the difference between the virtual machine's configured memory size and its memory reservation. The available disk space must therefore be at least this large (plus the space required for VMX swap, as described in “Memory Overhead” on page 25). You can optionally configure a special host cache on an SSD (if one is installed) to be used for the new swap to host cache feature. NOTE Placing the regular swap file in SSD and using swap to host cache in SSD (as described above) are two different approaches to improving host swapping performance. Because it is unusual to have enough SSD space for a host’s entire swap file needs, we recommend using local SSD for swap to host cache. If you can’t use SSD storage, place the regular swap file on the fastest available storage. This might be a Fibre Channel SAN array or a fast local disk. Placing swap files on local storage (whether SSD or hard drive) could potentially reduce vMotion performance. This is because if a virtual machine has memory pages in a local swap file, they must be swapped in to memory before a vMotion operation on that virtual machine can proceed. Regardless of the storage type or location used for the regular swap file, for the best performance, and to avoid the possibility of running out of space, swap files should not be placed on thin-‐provisioned storage. Large Memory Pages for Hypervisor and Guest Operating System In addition to the usual 4KB memory pages, ESXi also provides 2MB memory pages (commonly referred to as “large pages”). By default ESXi assigns these 2MB machine memory pages to guest operating systems that request them, giving the guest operating system the full advantage of using large pages. The use of large pages results in reduced memory management overhead and can therefore increase hypervisor performance. If an operating system or application can benefit from large pages on a native system, that operating system or application can potentially achieve a similar performance improvement on a virtual machine backed with 2MB machine memory pages. Use of large pages can also change page sharing behavior. While ESXi ordinarily uses page sharing regardless of memory demands, it does not share large pages. Therefore with large pages, page sharing might not occur until memory over-‐commitment is high enough to require the large pages to be broken into small pages. ESXi Storage Considerations VMware vStorage APIs for Array Integration (VAAI) For the best storage performance, consider using VAAI-‐capable storage hardware. The performance gains from VAAI (described in “Hardware Storage Considerations” on page 11) can be especially noticeable in VDI environments (where VAAI can improve boot-‐storm and desktop workload performance), large data centers (where VAAI can improve the performance of mass virtual machine provisioning and of thin-‐provisioned virtual disks), and in other large-‐scale deployments.

LUN Access Methods, Virtual Disk Modes, and Virtual Disk Types You can use RDMs in virtual compatibility mode or physical compatibility mode:

• Virtual mode specifies full virtualization of the mapped device, allowing the guest operating system to treat the RDM like any other virtual disk file in a VMFS volume.

• Physical mode specifies minimal SCSI virtualization of the mapped device, allowing the greatest flexibility for SAN

management software or other SCSI target-‐based software running in the virtual machine. ESXi supports multiple virtual disk types: Thick – Thick virtual disks, which have all their space allocated at creation time, are further divided into two types: eager zeroed and lazy zeroed.

• Eager-‐zeroed – An eager-‐zeroed thick disk has all space allocated and zeroed out at the time of creation. This increases the time it takes to create the disk, but results in the best performance, even on the first write to each block.

• Lazy-‐zeroed – A lazy-‐zeroed thick disk has all space allocated at the time of creation, but each block is zeroed only on first

write. This results in a shorter creation time, but reduced performance the first time a block is written to. Subsequent writes, however, have the same performance as on eager-‐zeroed thick disks.

• Thin – Space required for a thin-‐provisioned virtual disk is allocated and zeroed upon first write, as opposed to upon

creation. There is a higher I/O cost (similar to that of lazy-‐zeroed thick disks) during the first write to an unwritten file block, but on subsequent writes thin-‐provisioned disks have the same performance as eager-‐zeroed thick disks.

Partition Alignment The alignment of file system partitions can impact performance. VMware makes the following recommendations for VMFS partitions: Like other disk-‐based filesystems, VMFS filesystems suffer a performance penalty when the partition is unaligned. Using the vSphere Client to create VMFS partitions avoids this problem since, beginning with ESXi 5.0, it automatically aligns VMFS3 or VMFS5 partitions along the 1MB boundary. SAN Multipathing By default, ESXi uses the Most Recently Used (MRU) path policy for devices on Active/Passive storage arrays. Do not use Fixed path policy for Active/Passive storage arrays to avoid LUN path thrashing. NOTE With some Active/Passive storage arrays that support ALUA (described below) ESXi can use Fixed path policy without risk of LUN path thrashing. By default, ESXi uses the Fixed path policy for devices on Active/Active storage arrays. When using this policy you can maximize the utilization of your bandwidth to the storage array by designating preferred paths to each LUN through different storage controllers. For more information, see the VMware SAN Configuration Guide. In addition to the Fixed and MRU path policies, ESXi can also use the Round Robin path policy, which can improve storage performance in some environments. Round Robin policy provides load balancing by cycling I/O requests through all Active paths, sending a fixed (but configurable) number of I/O requests through each one in turn. If your storage array supports ALUA (Asymmetric Logical Unit Access), enabling this feature on the array can improve storage performance in some environments. ALUA, which is automatically detected by ESXi, allows the array itself to designate paths as “Active Optimized.” When ALUA is combined with the Round Robin path policy, ESXi cycles I/O requests through these Active Optimized paths. Storage I/O Resource Allocation VMware vSphere provides mechanisms to dynamically allocate storage I/O resources, allowing critical workloads to maintain their performance even during peak load periods when there is contention for I/O resources. This allocation can be performed at the level of the individual host or for an entire datastore.

• The storage I/O resources available to an ESXi host can be proportionally allocated to the virtual machines running on that

host by using the vSphere Client to set disk shares for the virtual machines (select Edit virtual machine settings, choose the Resources tab, select Disk, then change the Shares field).

• The maximum storage I/O resources available to each virtual machine can be set using limits. These limits, set in I/O

operations per second (IOPS), can be used to provide strict isolation and control on certain workloads. By default, these are set to unlimited. When set to any other value, ESXi enforces the limits even if the underlying datastores are not fully utilized.

• An entire datastore’s I/O resources can be proportionally allocated to the virtual machines accessing that datastore using

Storage I/O Control (SIOC). When enabled, SIOC evaluates the disk share values set for all virtual machines accessing a datastore and allocates that datastore’s resources accordingly. SIOC can be enabled using the vSphere Client (select a datastore, choose the Configuration tab, click Properties... (at the far right), then under Storage I/O Control add a checkmark to the Enabled box).

With SIOC disabled (the default), all hosts accessing a datastore get an equal portion of that datastore’s resources. Any shares values determine only how each host’s portion is divided amongst its virtual machines. With SIOC enabled, the disk shares are evaluated globally and the portion of the datastore’s resources each host receives depends on the sum of the shares of the virtual machines running on that host relative to the sum of the shares of all the virtual machines accessing that datastore. General ESXi Storage Recommendations I/O latency statistics can be monitored using esxtop (or resxtop), which reports device latency, time spent in the kernel, and latency seen by the guest operating system. Make sure that the average latency for storage devices is not too high. This latency can be seen in esxtop (or resxtop) by looking at the GAVG/cmd metric. A reasonable upper value for this metric depends on your storage subsystem. If you use SIOC, you can use your SIOC setting as a guide — your GAVG/cmd value should be well below your SIOC setting. The default SIOC setting is 30 ms, but if you have very fast storage (SSDs, for example) you might have reduced that value. For further information on average latency see VMware KB article 1008205. You can adjust the maximum number of outstanding disk requests per VMFS volume, which can help equalize the bandwidth across virtual machines using that volume. For further information see VMware KB article 1268. If you will not be using Storage I/O Control and often observe QFULL/BUSY errors, enabling and configuring queue depth throttling might improve storage performance. This feature can significantly reduce the number of commands returned from the array with a QFULL/BUSY error. If any system accessing a particular LUN or storage array port has queue depth throttling enabled, all systems (both ESX hosts and other systems) accessing that LUN or storage array port should use an adaptive queue depth algorithm. Queue depth throttling is not compatible with Storage DRS. For more information about both QFULL/BUSY errors and this feature see KB article 1008113. Running Storage Latency Sensitive Applications By default the ESXi storage stack is configured to drive high storage throughout at low CPU cost. While this default configuration provides better scalability and higher consolidation ratios, it comes at the cost of potentially higher storage latency. Applications that are highly sensitive to storage latency might therefore benefit from the following: Adjust the host power management settings: Some of the power management features in newer server hardware can increase storage latency. Disable them as follows:

• Set the ESXi host power policy to Maximum performance (as described in “Host Power Management in ESXi” on page 23; this is the preferred method) or disable power management in the BIOS (as described in “Power Management BIOS Settings” on page 14).

• Disable C1E and other C-‐states in BIOS (as described in “Power Management BIOS Settings” on page 14).

• Enable Turbo Boost in BIOS (as described in “General BIOS Settings” on page 14). ESXi Networking Considerations In a native environment, CPU utilization plays a significant role in network throughput. To process higher levels of throughput, more CPU resources are needed. The effect of CPU resource availability on the network throughput of virtualized applications is even more significant. Because insufficient CPU resources will limit maximum throughput, it is important to monitor the CPU utilization of high-‐throughput workloads. Use separate virtual switches, each connected to its own physical network adapter, to avoid contention between the VMkernel and virtual machines, especially virtual machines running heavy networking workloads. To establish a network connection between two virtual machines that reside on the same ESXi system, connect both virtual machines to the same virtual switch. If the virtual machines are connected to different virtual switches, traffic will go through wire and incur unnecessary CPU and network overhead. Network I/O Control (NetIOC) Network I/O Control (NetIOC) allows the allocation of network bandwidth to network resource pools. You can either select from among seven predefined resource pools (Fault Tolerance traffic, iSCSI traffic, vMotion traffic, management traffic, vSphere Replication (VR) traffic, NFS traffic, and virtual machine traffic) or you can create user-‐defined resource pools. Each resource pool is associated with a portgroup and, optionally, assigned a specific 802.1p priority level. Network bandwidth can be allocated to resource pools using either shares or limits: Shares can be used to allocate to a resource pool a proportion of a network link’s bandwidth equivalent to the ratio of its shares to the total shares. If a resource pool doesn’t use its full allocation, the unused bandwidth is available for use by other resource pools. Limits can be used to set a resource pool’s maximum bandwidth utilization (in Mbps) from a host through a specific virtual distributed switch (vDS). These limits are enforced even if a vDS is not saturated, potentially limiting a resource pool’s bandwidth while simultaneously leaving some bandwidth unused. On the other hand, if a resource pool’s bandwidth utilization is less than its limit, the unused bandwidth is available to other resource pools. NetIOC can guarantee bandwidth for specific needs and can prevent any one resource pool from impacting the others. DirectPath I/O vSphere DirectPath I/O leverages Intel VT-‐d and AMD-‐Vi hardware support (described in “Hardware-‐Assisted I/O MMU Virtualization (VT-‐d and AMD-‐Vi)” on page 10) to allow guest operating systems to directly access hardware devices. In the case of networking, DirectPath I/O allows the virtual machine to access a physical NIC directly rather than using an emulated device (E1000) or a para-‐virtualized device (VMXNET, VMXNET3). While DirectPath I/O provides limited increases in throughput, it reduces CPU cost for networking-‐intensive workloads. DirectPath I/O is not compatible with certain core virtualization features, however. This list varies with the hardware on which ESXi is running: New for vSphere 5.0, when ESXi is running on certain configurations of the Cisco Unified Computing System (UCS) platform, DirectPath I/O for networking is compatible with vMotion, physical NIC sharing, snapshots, and suspend/resume. It is not compatible with Fault Tolerance, NetIOC, memory overcommit, VMCI, or VMSafe. For server hardware other than the Cisco UCS platform, DirectPath I/O is not compatible with vMotion, physical NIC sharing, snapshots, suspend/resume, Fault Tolerance, NetIOC, memory overcommit, or VMSafe. Typical virtual machines and their workloads don't require the use of DirectPath I/O. For workloads that are very networking intensive and don't need the core virtualization features mentioned above, however, DirectPath I/O might be useful to reduce CPU usage.

SplitRx Mode SplitRx mode, a new feature in ESXi 5.0, uses multiple physical CPUs to process network packets received in a single network queue. This feature can significantly improve network performance for certain workloads. These workloads include:

• Multiple virtual machines on one ESXi host all receiving multicast traffic from the same source. (SplitRx mode will typically improve throughput and CPU efficiency for these workloads.)

• Traffic via the vNetwork Appliance (DVFilter) API between two virtual machines on the same ESXi host. (SplitRx mode will

typically improve throughput and maximum packet rates for these workloads.) This feature, which is supported only for VMXNET3 virtual network adapters, is individually configured for each virtual NIC using the ethernetX.emuRxMode variable in each virtual machine’s .vmx file (where X is replaced with the network adapter’s ID). The possible values for this variable are:

• ethernetX.emuRxMode = "0" This value disables splitRx mode for ethernetX. • ethernetX.emuRxMode = "1" This value enables splitRx mode for ethernetX.

To change this variable through the vSphere Client: 1. Select the virtual machine you wish to change, then click Edit virtual machine settings. 2. Under the Options tab, select General, then click Configuration Parameters. 3. Look for ethernetX.emuRxMode (where X is the number of the desired NIC). If the variable isn’t present, click Add Row and enter it as a new variable. 4. Click on the value to be changed and configure it as you wish. The change will not take effect until the virtual machine has been restarted. Running Network Latency Sensitive Applications By default the ESXi network stack is configured to drive high network throughout at low CPU cost. While this default configuration provides better scalability and higher consolidation ratios, it comes at the cost of potentially higher network latency. Applications that are highly sensitive to network latency might therefore benefit from the following:

• Use VMXNET3 virtual network adapters • Adjust the host power management settings (Maximum Performance, disable C1E and other C-‐States, Enable Turbo Boost

in BIOS) • Disable VMXNET3 virtual interrupt coalescing for the desired NIC. In some cases this can improve performance for latency-‐

sensitive applications. In other cases—most notably applications with high numbers of outstanding network requests—it can reduce performance.

Guest Operating Systems

• Install the latest version of VMware Tools in the guest operating system. • Disable screen savers and Window animations in virtual machines. • Schedule backups and virus scanning programs in virtual machines to run at off-‐peak hours • For the most accurate timekeeping, consider configuring your guest operating system to use NTP, Windows Time Service,

the VMware Tools time-‐synchronization option, or another timekeeping utility suitable for your operating system.

• We recommend, however, that within any particular virtual machine you use either the VMware Tools time-‐synchronization option or another timekeeping utility, but not both.

Measuring Performance in Virtual Machines Be careful when measuring performance from within virtual machines. Timing numbers measured from within virtual machines can be inaccurate, especially when the processor is overcommitted.

NOTE One possible approach to this issue is to use a guest operating system that has good timekeeping behavior when run in a virtual machine, such as a guest that uses the NO_HZ kernel configuration option (sometimes called “tickless timer”). More information about this topic can be found in Timekeeping in VMware Virtual Machines (http://www.vmware.com/files/pdf/Timekeeping-‐In-‐VirtualMachines.pdf). Measuring performance from within virtual machines can fail to take into account resources used by ESXi for tasks it offloads from the guest operating system, as well as resources consumed by virtualization overhead. Guest Operating System CPU Considerations Many operating systems keep time by counting timer interrupts. The timer interrupt rates vary between different operating systems and versions. For example:

• Unpatched 2.4 and earlier Linux kernels typically request timer interrupts at 100 Hz (that is, 100 interrupts per second), though this can vary with version and distribution.

• Linux kernels have used a variety of timer interrupt rates, including 100 Hz, 250 Hz, and 1000 Hz, again varying with version and distribution.

• The most recent 2.6 Linux kernels introduce the NO_HZ kernel configuration option (sometimes called “tickless timer”) that

uses a variable timer interrupt rate.

• Microsoft Windows operating system timer interrupt rates are specific to the version of Microsoft Windows and the Windows HAL that is installed. Windows systems typically use a base timer interrupt rate of 64 Hz or 100 Hz.

• Running applications that make use of the Microsoft Windows multimedia timer functionality can increase the timer

interrupt rate. For example, some multimedia applications or Java applications increase the timer interrupt rate to approximately 1000 Hz.

• In addition to the timer interrupt rate, the total number of timer interrupts delivered to a virtual machine also depends on a number of other factors:

• Virtual machines running SMP HALs/kernels (even if they are running on a UP virtual machine) require more timer

interrupts than those running UP HALs/kernels.

• The more vCPUs a virtual machine has, the more interrupts it requires. Delivering many virtual timer interrupts negatively impacts virtual machine performance and increases host CPU consumption. If you have a choice, use guest operating systems that require fewer timer interrupts. For example:

• If you have a UP virtual machine use a UP HAL/kernel.

• In some Linux versions, such as RHEL 5.1 and later, the “divider=10” kernel boot parameter reduces the timer interrupt

rate to one tenth its default rate. See VMware KB article 1006427 for further information

• Kernels with tickless-‐timer support (NO_HZ kernels) do not schedule periodic timers to maintain system time. As a result, these kernels reduce the overall average rate of virtual timer interrupts, thus improving system performance and scalability on hosts running large numbers of virtual machines

Virtual NUMA (vNUMA)

• Virtual NUMA (vNUMA), a new feature in ESXi 5.0, exposes NUMA topology to the guest operating system, allowing NUMA-‐aware guest operating systems and applications to make the most efficient use of the underlying hardware’s NUMA architecture.

• Virtual NUMA, which requires virtual hardware version 8, can provide significant performance benefits, though the benefits

depend heavily on the level of NUMA optimization in the guest operating system and applications.

• You can obtain the maximum performance benefits from vNUMA if your clusters are composed entirely of hosts with matching NUMA architecture.

• This is because the very first time a vNUMA-‐enabled virtual machine is powered on, its vNUMA topology is set based in part

on the NUMA topology of the underlying physical host on which it is running. Once a virtual machine’s vNUMA topology is initialized it doesn’t change unless the number of vCPUs in that virtual machine is changed. This means that if a vNUMA virtual machine is moved to a host with a different NUMA topology, the virtual machine’s vNUMA topology might no longer be optimal for the underlying physical NUMA topology, potentially resulting in reduced performance.

• Size your virtual machines so they align with physical NUMA boundaries. For example, if you have a host system with six

cores per NUMA node, size your virtual machines with a multiple of six vCPUs (i.e., 6 vCPUs, 12 vCPUs, 18 vCPUs, 24 vCPUs, and so on).

• NOTE Some multi-‐core processors have NUMA node sizes that are different than the number of cores per socket. For example, some 12-‐core processors have two six-‐core NUMA nodes per processor.

Guest Operating System Storage Considerations

• The default virtual storage adapter in ESXi 5.0 is either BusLogic Parallel, LSI Logic Parallel, or LSI Logic SAS, depending on the guest operating system and the virtual hardware version. However, ESXi also includes a paravirtualized SCSI storage adapter, PVSCSI (also called VMware Paravirtual). The PVSCSI adapter offers a significant reduction in CPU utilization as well as potentially increased throughput compared to the default virtual storage adapters, and is thus the best choice for environments with very I/O-‐intensive guest applications.

• The depth of the queue of outstanding commands in the guest operating system SCSI driver can significantly impact disk

performance. A queue depth that is too small, for example, limits the disk bandwidth that can be pushed through the virtual machine.

• In some cases large I/O requests issued by applications in a virtual machine can be split by the guest storage driver.

Changing the guest operating system’s registry settings to issue larger block sizes can eliminate this splitting, thus enhancing performance. For additional information see VMware KB article 9645697.

• Make sure the disk partitions within the guest are aligned.

Guest Operating System Networking Considerations The default virtual network adapter emulated in a virtual machine is either an AMD PCnet32 device (vlance) or an Intel E1000 device (E1000). VMware also offers the VMXNET family of paravirtualized network adapters, however, that provide better performance than these default adapters and should be used for optimal performance within any guest operating system for which they are available.

• For the best performance, use the VMXNET3 paravirtualized network adapter for operating systems in which it is supported. This requires that the virtual machine use virtual hardware version 7 or later, and that VMware Tools be installed in the guest operating system.

• The VMXNET3, Enhanced VMXNET, and E1000 devices support jumbo frames for better performance. (Note that the vlance

device does not support jumbo frames.) To enable jumbo frames, set the MTU size to 9000 in both the guest network driver and the virtual switch configuration. The physical NICs at both ends and all the intermediate hops/routers/switches must also support jumbo frames.

• In ESXi, TCP Segmentation Offload (TSO) is enabled by default in the VMkernel, but is supported in virtual machines only

when they are using the VMXNET3 device, the Enhanced VMXNET device, or the E1000 device. TSO can improve performance even if the underlying hardware does not support TSO.

• In some cases, low receive throughput in a virtual machine can be caused by insufficient receive buffers in the receiver

network device. If the receive ring in the guest operating system’s network driver overflows, packets will be dropped in the VMkernel, degrading network throughput. A possible workaround is to increase the number of receive buffers, though this might increase the host physical CPU workload.

• For VMXNET, the default number of receive and transmit buffers is 100 each, with the maximum possible being 128. For

Enhanced VMXNET, the default number of receive and transmit buffers are 150 and 256, respectively, with the maximum possible receive buffers being 512. You can alter these settings by changing the buffer size defaults in the .vmx (configuration) files for the affected virtual machines. For additional information see VMware KB article 1010071

• Receive-‐side scaling (RSS) allows network packet receive processing to be scheduled in parallel on multiple CPUs. Without

RSS, receive interrupts can be handled on only one CPU at a time. With RSS, received packets from a single NIC can be processed on multiple CPUs concurrently. This helps receive throughput in cases where a single CPU would otherwise be saturated with receive processing and become a bottleneck. To prevent out-‐of-‐order packet delivery, RSS schedules all of a flow’s packets to the same CPU.

Virtual Infrastructure Management Use resource settings (that is, Reservation, Shares, and Limits) only if needed in your environment. If you expect frequent changes to the total available resources, use Shares, not Reservation, to allocate resources fairly across virtual machines. If you use Shares and you subsequently upgrade the hardware, each virtual machine stays at the same relative priority (keeps the same number of shares) even though each share represents a larger amount of memory or CPU. Use Reservation to specify the minimum acceptable amount of CPU or memory, not the amount you would like to have available. After all resource reservations have been met, ESXi allocates the remaining resources based on the number of shares and the resource limits configured for your virtual machine. When specifying the reservations for virtual machines, always leave some headroom for memory virtualization overhead and migration overhead. In a DRS-‐enabled cluster, reservations that fully commit the capacity of the cluster or of individual hosts in the cluster can prevent DRS from migrating virtual machines between hosts. As you approach fully reserving all capacity in the system, it also becomes increasingly difficult to make changes to reservations and to the resource pool hierarchy without violating admission control. VMware vCenter This section lists VMware vCenter practices and configurations recommended for optimal performance. It also includes a few features that are controlled or accessed through vCenter. The performance of vCenter Server is dependent in large part on the number of managed entities (hosts and virtual machines) and the number of connected VMware vSphere Clients. Exceeding the maximums specified in Configuration Maximums for VMware vSphere 5.0, in addition to being unsupported, is thus likely to impact vCenter Server performance. Whether run on virtual machines or physical systems, make sure you provide vCenter Server and the vCenter Server database with sufficient CPU, memory, and storage resources for your deployment size. To minimize the latency of vCenter operations, keep to a minimum the number of network hops between the vCenter Server system and the vCenter Server database. Although VMware vCenter Update Manager can be run on the same system and use the same database as vCenter Server, for maximum performance, especially on heavily-‐loaded vCenter systems, consider running Update Manager on its own system and providing it with a dedicated database. Similarly, VMware vCenter Converter can be run on the same system as vCenter Server, but doing so might impact performance, especially on heavily-‐loaded vCenter systems. VMware vCenter Database Considerations VMware vCenter Database Network and Storage Considerations To minimize the latency of operations between vCenter Server and the database, keep to a minimum the number of network hops between the vCenter Server system and the database system.

The hardware on which the vCenter database is stored, and the arrangement of the files on that hardware, can have a significant effect on vCenter performance:

• The vCenter database performs best when its files are placed on high-‐performance storage. • The database data files generate mostly random read I/O traffic, while the database transaction logs generate mostly

sequential write I/O traffic. For this reason, and because their traffic is often significant and simultaneous, vCenter performs best when these two file types are placed on separate storage resources that share neither disks nor I/O bandwidth.

VMware vCenter Database Configuration and Maintenance Configure the vCenter statistics level to a setting appropriate for your uses. This setting can range from 1 to 4, but a setting of 1 is recommended for most situations. Higher settings can slow the vCenter Server system. You can also selectively disable statistics rollups for particular collection levels. To avoid frequent log file switches, ensure that your vCenter database logs are sized appropriately for your vCenter inventory. For example, with a large vCenter inventory running with an Oracle database, the size of each redo log should be at least 512MB. vCenter Server starts up with a database connection pool of 50 threads. This pool is then dynamically sized, growing adaptively as needed based on the vCenter Server workload, and does not require modification. However, if a heavy workload is expected on the vCenter Server, the size of this pool at startup can be increased, with the maximum being 128 threads. Note that this might result in increased memory consumption by vCenter Server and slower vCenter Server startup. Update statistics of the tables and indexes on a regular basis for better overall performance of the database. As part of the regular database maintenance activity, check the fragmentation of the index objects and recreate indexes if needed (i.e., if fragmentation is more than about 30%). Microsoft SQL Server Database Recommendations If you are using a Microsoft SQL Server database, the following points can improve vCenter Server performance: Setting the transaction logs to Simple recovery mode significantly reduces the database logs’ disk space usage as well as their storage I/O load. If it isn’t possible to set this to Simple, make sure to have a high-‐performance storage subsystem. To further improve database performance for large inventories, place tempDB on a different disk than either the database data files or the database transaction logs. We recommend a fill factor of about 70% for the four VPX_HIST_STAT tables (vpx_hist_stat1, vpx_hist_stat2, vpx_hist_stat3, and vpx_hist_stat4). If the fill factor is set too high, the server must take time splitting pages when they fill up. If the fill factor is set too low, the database will be larger than necessary due to the unused space on each page, thus increasing the number of pages that need to be read during normal operations. Oracle Database Recommendations If you are using an Oracle database, the following points can improve vCenter Server performance: When using Automatic Memory Management (AMM) in Oracle 11g, or Automatic Shared memory Management (ASMM) in Oracle 10g, allocate sufficient memory for the Oracle database. Set appropriate PROCESSES or SESSIONS initialization parameters. Oracle creates a new server process for every new connection that is made to it. The number of connections an application can make to the Oracle instance thus depends on how many processes Oracle can create. PROCESSES and SESSIONS together determine how many simultaneous connections Oracle can accept. In large vSphere environments (as defined in vSphere Installation and Setup for vSphere 5.0) we recommend setting PROCESSES to 800. If database operations are slow, after checking that the statistics are up to date and the indexes are not fragmented, you should move the indexes to separate tablespaces (i.e., place tables and primary key (PK) constraint index on one tablespace and the other indexes (i.e., BTree) on another tablespace). For large inventories (i.e., those that approach the limits for the number of hosts or virtual machines), increase the db_writer_processes parameter to 4.

VMware vMotion and Storage vMotion VMware vMotion ESXi 5.0 introduces virtual hardware version 8. Because virtual machines running on hardware version 8 can’t run on prior versions of ESX/ESXi, such virtual machines can be moved using VMware vMotion only to other ESXi 5.0 hosts. ESXi 5.0 is also compatible with virtual machines running on virtual hardware version 7 and earlier, however, and these machines can be moved using VMware vMotion to ESX/ESXi 4.x hosts. vMotion performance will increase as additional network bandwidth is made available to the vMotion network. Consider provisioning 10Gb vMotion network interfaces for maximum vMotion performance. Multiple vMotion vmknics, a new feature in ESXi 5.0, can provide a further increase in network bandwidth available to vMotion. All vMotion vmknics on a host should share a single vSwitch. Each vmknic's portgroup should be configured to leverage a different physical NIC as its active vmnic. In addition, all vMotion vmknics should be on the same vMotion network. While a vMotion operation is in progress, ESXi opportunistically reserves CPU resources on both the source and destination hosts in order to ensure the ability to fully utilize the network bandwidth. ESXi will attempt to use the full available network bandwidth regardless of the number of vMotion operations being performed. The amount of CPU reservation thus depends on the number of vMotion NICs and their speeds; 10% of a processor core for each 1Gb network interface, 100% of a processor core for each 10Gb network interface, and a minimum total reservation of 30% of a processor core. Therefore leaving some unreserved CPU capacity in a cluster can help ensure that vMotion tasks get the resources required in order to fully utilize available network bandwidth. vMotion performance could be reduced if host-‐level swap files are placed on local storage (whether SSD or hard drive). VMware Storage vMotion VMware Storage vMotion performance depends strongly on the available storage infrastructure bandwidth between the ESXi host where the virtual machine is running and both the source and destination data stores. During a Storage vMotion operation the virtual disk to be moved is being read from the source data store and written to the destination data store. At the same time the virtual machine continues to read from and write to the source data store while also writing to the destination data store. This additional traffic takes place on storage that might also have other I/O loads (from other virtual machines on the same ESXi host or from other hosts) that can further reduce the available bandwidth. Storage vMotion will have the highest performance during times of low storage activity (when available storage bandwidth is highest) and when the workload in the virtual machine being moved is least active. During a Storage vMotion operation, the benefits of moving to a faster data store will be seen only when the migration has completed. However, the impact of moving to a slower data store will gradually be felt as the migration progresses. Storage vMotion will often have significantly better performance on VAAI-‐capable storage arrays. VMware Distributed Resource Scheduler (DRS) Cluster Configuration Settings When deciding which hosts to group into DRS clusters, try to choose hosts that are as homogeneous as possible in terms of CPU and memory. This improves performance predictability and stability. When heterogeneous systems have compatible CPUs, but have different CPU frequencies and/or amounts of memory, DRS generally prefers to locate virtual machines on the systems with more memory and higher CPU frequencies (all other things being equal), since those systems have more capacity to accommodate peak loads. VMware vMotion is not supported across hosts with incompatible CPU's. Hence with ‘incompatible CPU’ heterogeneous systems, the opportunities DRS has to improve the load balance across the cluster are limited. You can also use Enhanced vMotion Compatibility (EVC) to facilitate vMotion between different CPU generations. The more vMotion compatible ESXi hosts DRS has available, the more choices it has to better balance the DRS cluster. Virtual machines with smaller memory sizes and/or fewer vCPUs provide more opportunities for DRS to migrate them in order to improve balance across the cluster. Virtual machines with larger memory sizes and/or more vCPUs add more constraints in migrating

the virtual machines. This is one more reason to configure virtual machines with only as many vCPUs and only as much virtual memory as they need. Have virtual machines in DRS automatic mode when possible, as they are considered for cluster load balancing migrations across the ESXi hosts before the virtual machines that are not in automatic mode. Powered-‐on virtual machines consume memory resources—and typically consume some CPU resources—even when idle. Thus even idle virtual machines, though their utilization is usually small, can affect DRS decisions. For this and other reasons, a marginal performance increase might be obtained by shutting down or suspending virtual machines that are not being used. Resource pools help improve manageability and troubleshooting of performance problems. We recommend, however, that resource pools and virtual machines not be made siblings in a hierarchy. Instead, each level should contain only resource pools or only virtual machines. DRS affinity rules can keep two or more virtual machines on the same ESXi host (“VM/VM affinity”) or make sure they are always on different hosts (“VM/VM anti-‐affinity”). DRS affinity rules can also be used to make sure a group of virtual machines runs only on (or has a preference for) a specific group of ESXi hosts (“VM/Host affinity”) or never runs on (or has a preference against) a specific group of hosts (“VM/Host anti-‐affinity”). In most cases leaving the affinity settings unchanged will provide the best results. In rare cases, however, specifying affinity rules can help improve performance. To change affinity settings, select a cluster from within the vSphere Client, choose the Summary tab, click Edit Settings, choose Rules, click Add, enter a name for the new rule, choose a rule type, and proceed through the GUI as appropriate for the rule type you selected. Besides the default setting, the affinity setting types are:

• Keep Virtual Machines Together This affinity type can improve performance due to lower latencies of communication between machines.

• Separate Virtual Machines This affinity type can maintain maximal availability of the virtual machines. For instance, if they are both web server front ends to the same application, you might want to make sure that they don't both go down at the same time. Also co-‐location of I/O intensive virtual machines could end up saturating the host I/O capacity, leading to performance degradation. DRS currently does not make virtual machine placement decisions based on their I/O resources usage.

• Virtual Machines to Hosts (including Must run on..., Should run on..., Must not run on..., and Should not run on...) These affinity types can be useful for clusters with software licensing restrictions or specific availability zone requirements.

To allow DRS the maximum flexibility:

• Place virtual machines on shared datastores accessible from all hosts in the cluster. • Make sure virtual machines are not connected to host devices that would prevent them from moving off of those hosts.

The drmdump files produced by DRS can be very useful in diagnosing potential DRS performance issues during a support call. For particularly active clusters, or those with more than about 16 hosts, it can be helpful to keep more such files than can fit in the default maximum drmdump directory size of 20MB. This maximum can be increased using the DumpSpace option, which can be set using DRS Advanced Options. Cluster Sizing and Resource Settings Exceeding the maximum number of hosts, virtual machines, or resource pools for each DRS cluster specified in Configuration Maximums for VMware vSphere 5.0 is not supported. Even if it seems to work, doing so could adversely affect vCenter Server or DRS performance. Carefully select the resource settings (that is, reservations, shares, and limits) for your virtual machines.

• Setting reservations too high can leave few unreserved resources in the cluster, thus limiting the options DRS has to balance load.

• Setting limits too low could keep virtual machines from using extra resources available in the cluster to improve their performance.

Use reservations to guarantee the minimum requirement a virtual machine needs, rather than what you might like it to get.

Note that shares take effect only when there is resource contention. Note also that additional resources reserved for virtual machine memory overhead need to be accounted for when sizing resources in the cluster. If the overall cluster capacity might not meet the needs of all virtual machines during peak hours, you can assign relatively higher shares to virtual machines or resource pools hosting mission-‐critical applications to reduce the performance interference from less-‐critical virtual machines. If you will be using vMotion, it’s a good practice to leave some unused CPU capacity in your cluster. As described in “VMware vMotion” on page 51, when a vMotion operation is started, ESXi reserves some CPU resources for that operation. DRS Performance Tuning The migration threshold for fully automated DRS (cluster > DRS tab > Edit... > vSphere DRS) allows the administrator to control the aggressiveness of the DRS algorithm. The migration threshold should be set to more aggressive levels when the following conditions are satisfied:

• If the hosts in the cluster are relatively homogeneous. • If the virtual machines' resource utilization does not vary much over time and you have relatively few constraints on where

a virtual machine can be placed. The migration threshold should be set to more conservative levels in the converse situations. NOTE If the most conservative threshold is chosen, DRS will only apply move recommendations that must be taken either to satisfy hard constraints, such as affinity or anti-‐affinity rules, or to evacuate virtual machines from a host entering maintenance or standby mode. VMware Distributed Power Management (DPM) VMware Distributed Power Management (DPM) conserves power by migrating virtual machines to fewer hosts when utilizations are low. DPM is most appropriate for clusters in which composite virtual machine demand varies greatly over time; for example, clusters in which overall demand is higher during the day and significantly lower at night. If demand is consistently high relative to overall cluster capacity DPM will have little opportunity to put hosts into standby mode to save power. Because DPM uses DRS, most DRS best practices (described in “VMware Distributed Resource Scheduler (DRS)” on page 52) are relevant to DPM as well. DPM considers historical demand in determining how much capacity to keep powered on and keeps some excess capacity available for changes in demand. DPM will also power on additional hosts when needed for unexpected increases in the demand of existing virtual machines or to allow ] virtual machine admission The aggressiveness of the DPM algorithm can be tuned by adjusting the DPM Threshold in the cluster settings menu. This parameter controls how far outside the target utilization range per-‐host resource utilization can be before DPM makes host power-‐on/power-‐off recommendations. The default setting for the threshold is 3 (medium aggressiveness). For datacenters that often have unexpected spikes in virtual machine resource demands, you can use the DPM advanced option MinPoweredOnCpuCapacity (default 1 MHz) or MinPoweredOnMemCapacity (default 1 MB) to ensure that a minimum amount of CPU or memory capacity is kept on in the cluster. DPM can be disabled on individual hosts that are running mission-‐critical virtual machines, and the VM/Host affinity rules can be used to ensure that these virtual machines are not migrated away from these hosts. DPM can be enabled or disabled on a predetermined schedule using Scheduled Tasks in vCenter Server. When DPM is disabled, all hosts in a cluster will be powered on. This might be useful, for example, to reduce the delay in responding to load spikes expected at certain times of the day or to reduce the likelihood of some hosts being left in standby for extended periods. In a cluster with VMware High Availability (HA) enabled, DRS/DPM maintains excess powered-‐on capacity to meet the High Availability settings. The cluster might therefore not allow additional virtual machines to be powered on and/or some hosts might

not be powered down even when the cluster appears to be sufficiently idle. These factors should be considered when configuring HA. If VMware HA is enabled in a cluster, DPM always keeps a minimum of two hosts powered on. This is true even if HA admission control is disabled or if no virtual machines are powered on. VMware Storage Distributed Resource Scheduler (Storage DRS)

A new feature in vSphere 5.0, Storage Distributed Resource Scheduler (Storage DRS), provides I/O load balancing across datastores within a datastore cluster (a new vCenter object). This load balancing can avoid storage performance bottlenecks or address them if they occur. When deciding which datastores to group into a datastore cluster, try to choose datastores that are as homogeneous as possible in terms of host interface protocol (i.e., FCP, iSCSI, NFS), RAID level, and performance characteristics. We recommend not mixing SSD and hard disks in the same datastore cluster. While a datastore cluster can have as few as two datastores, the more datastores a datastore cluster has, the more flexibility Storage DRS has to better balance that cluster’s I/O load. As you add workloads you should monitor datastore I/O latency in the performance chart for the datastore cluster, particularly during peak hours. If most or all of the datastores in a datastore cluster consistently operate with latencies close to the congestion threshold used by Storage I/O Control (set to 30ms by default, but sometimes tuned to reflect the needs of a particular deployment), this might be an indication that there aren't enough spare I/O resources left in the datastore cluster. In this case, consider adding more datastores to the datastore cluster or reducing the load on that datastore cluster. NOTE Make sure, when adding more datastores to increase I/O resources in the datastore cluster, that your changes do actually add resources, rather than simply creating additional ways to access the same underlying physical disks. By default, Storage DRS affinity rules keep all of a virtual machine’s virtual disks on the same datastore (using intra-‐VM affinity). However you can give Storage DRS more flexibility in I/O load balancing, potentially increasing performance, by overriding the default intra-‐VM affinity rule. This can be done for either a specific virtual machine (from the vSphere Client, select Edit Settings > Virtual Machine Settings, then deselect Keep VMDKs together) or for the entire datastore cluster (from the vSphere Client, select Home > Inventory > Datastore and Datastore Clusters, select a datastore cluster, select the Storage DRS tab, click Edit, select Virtual Machine Settings, then deselect Keep VMDKs together). Inter-‐VM anti-‐affinity rules can be used to keep the virtual disks from two or more different virtual machines from being placed on the same datastore, potentially improving performance in some situations. They can be used, for example, to separate the storage I/O of multiple workloads that tend to have simultaneous but intermittent peak loads, preventing those peak loads from combining to stress a single datastore. VMware High Availability VMware High Availability (HA) minimizes virtual machine downtime by monitoring hosts, virtual machines, or applications within virtual machines, then, in the event a failure is detected, restarting virtual machines on alternate hosts.

• When vSphere HA is enabled in a cluster, all active hosts (those not in standby mode, maintenance mode, or disconnected) participate in an election to choose the master host for the cluster; all other hosts become slaves. The master has a number of responsibilities, including monitoring the state of the hosts in the cluster, protecting the powered-‐on virtual machines, initiating failover, and reporting cluster health state to vCenter Server. The master is elected based on the properties of the hosts, with preference being given to the one connected to the greatest number of datastores. Serving in the role of master will have little or no effect on a host’s performance.

• When the master host can’t communicate with a slave host over the management network, the master uses datastore

heartbeating to determine the state of that slave host. By default, vSphere HA uses two datastores for heartbeating, resulting in very low false failover rates. In order to reduce the chances of false failover even further—at the potential cost of a very slight performance impact—you can use the advanced option das.heartbeatdsperhost to change the number of datastores (up to a maximum of five).

• Enabling HA on a host reserves some host resources for HA agents, slightly reducing the available host capacity for

powering on virtual machines.

• When HA is enabled, the vCenter Server reserves sufficient unused resources in the cluster to support the failover capacity specified by the chosen admission control policy. This can reduce the number of virtual machines the cluster can support.

VMware Fault Tolerance For each virtual machine there are two FT-‐related actions that can be taken: turning on or off FT and enabling or disabling FT. “Turning on FT” prepares the virtual machine for FT by prompting for the removal of unsupported devices, disabling unsupported features, and setting the virtual machine’s memory reservation to be equal to its memory size (thus avoiding ballooning or swapping). “Enabling FT” performs the actual creation of the secondary virtual machine by live-‐migrating the primary. Each of these operations has performance implications.

• Don’t turn on FT for a virtual machine unless you will be using (i.e., Enabling) FT for that machine. Turning on FT automatically disables some features for the specific virtual machine that can help performance, such as hardware virtual MMU (if the processor supports it).

• Enabling FT for a virtual machine uses additional resources (for example, the secondary virtual machine uses as much CPU

and memory as the primary virtual machine). Therefore make sure you are prepared to devote the resources required before enabling FT.

The live migration that takes place when FT is enabled can briefly saturate the vMotion network link and can also cause spikes in CPU utilization.

• If the vMotion network link is also being used for other operations, such as FT logging (transmission of all the primary virtual machine’s inputs (incoming network traffic, disk reads, etc.) to the secondary host), the performance of those other operations can be impacted. For this reason it is best to have separate and dedicated NICs (or use Network I/O Control, described in “Network I/O Control (NetIOC)” on page 34) for FT logging traffic and vMotion, especially when multiple FT virtual machines reside on the same host.

• Because this potentially resource-‐intensive live migration takes place each time FT is enabled, we recommend that FT not

be frequently enabled and disabled.

• FT-‐enabled virtual machines must use eager-‐zeroed thick-‐provisioned virtual disks. Thus when FT is enabled for a virtual machine with thin provisioned virtual disks or lazy-‐zeroed thick-‐provisioned virtual disks these disks need to be converted. This one-‐time conversion process uses fewer resources when the virtual machine is on storage hardware that supports VAAI (described in “Hardware

• Storage Considerations” on page 11). Because FT logging traffic is asymmetric (the majority of the traffic flows from primary to secondary), congestion on the logging NIC can be reduced by distributing primaries onto multiple hosts. For example on a cluster with two ESXi hosts and two virtual machines with FT enabled, placing one of the primary virtual machines on each of the hosts allows the network bandwidth to be utilized bidirectionally. FT virtual machines that receive large amounts of network traffic or perform lots of disk reads can create significant bandwidth on the NIC specified for the logging traffic. This is true of machines that routinely do these things as well as machines doing them only intermittently, such as during a backup operation. To avoid saturating the network link used for logging traffic limit the number of FT virtual machines on each host or limit disk read bandwidth and network receive bandwidth of those virtual machines. Make sure the FT logging traffic is carried by at least a Gigabit-‐rated NIC (which should in turn be connected to at least Gigabit-‐rated network infrastructure). Avoid placing more than four FT-‐enabled virtual machines on a single host. In addition to reducing the possibility of saturating the

network link used for logging traffic, this also limits the number of simultaneous live-‐migrations needed to create new secondary virtual machines in the event of a host failure. If the secondary virtual machine lags too far behind the primary (which usually happens when the primary virtual machine is CPU bound and the secondary virtual machine is not getting enough CPU cycles), the hypervisor might slow the primary to allow the secondary to catch up. The following recommendations help avoid this situation:

• Make sure the hosts on which the primary and secondary virtual machines run are relatively closely matched, with similar CPU make, model, and frequency.

• Make sure that power management scheme settings (both in the BIOS and in ESXi) that cause CPU frequency scaling are

consistent between the hosts on which the primary and secondary virtual machines run.

• Enable CPU reservations for the primary virtual machine (which will be duplicated for the secondary virtual machine) to ensure that the secondary gets CPU cycles when it requires them.

Though timer interrupt rates do not significantly affect FT performance, high timer interrupt rates create additional network traffic on the FT logging NICs. Therefore, if possible, reduce timer interrupt rates as described in “Guest Operating System CPU Considerations” on page 39. VMware vCenter Update Manager VMware vCenter Update Manager provides a patch management framework for VMware vSphere. It can be used to apply patches, updates, and upgrades to VMware ESX and ESXi hosts, VMware Tools and virtual hardware, and so on. Update Manager Setup and Configuration

• When there are more than 300 virtual machines or more than 30 hosts, separate the Update Manager database from the vCenter Server database.

• When there are more than 1000 virtual machines or more than 100 hosts, separate the Update Manager server from the

vCenter Server and the Update Manager database from the vCenter Server database.

• Allocate separate physical disks for the Update Manager patch store and the Update Manager database. To reduce network latency and packet drops, keep to a minimum the number of network hops between the Update Manager server system and the ESXi hosts.

• In order to cache frequently used patch files in memory, make sure the Update Manager server host has at least 2GB of

RAM. Update Manager General Recommendations

• For compliance view for all attached baselines, latency is increased linearly with the number of attached baselines. We therefore recommend the removal of unused baselines, especially when the inventory size is large.

• Upgrading VMware Tools is faster if the virtual machine is already powered on. Otherwise, Update Manager must power on

the virtual machine before the VMware Tools upgrade, which could increase the overall latency.

• Upgrading virtual machine hardware is faster if the virtual machine is already powered off. Otherwise, Update Manager must power off the virtual machine before upgrading the virtual hardware, which could increase the overall latency.

• NOTE Because VMware Tools must be up to date before virtual hardware is upgraded, Update Manager might need to

upgrade VMware Tools before upgrading virtual hardware. In such cases the process is faster if the virtual machine is already powered-‐on.

Update Manager Cluster Remediation

• Limiting the remediation concurrency level (i.e., the maximum number of hosts that can be simultaneously updated) to half the number of hosts in the cluster can reduce vMotion intensity, often resulting in better overall host remediation performance. (This option can be set using the cluster remediate wizard.)

• When all hosts in a cluster are ready to enter maintenance mode (that is, they have no virtual machines powered on),

concurrent host remediation will typically be faster than sequential host remediation.

• Cluster remediation is most likely to succeed when the cluster is no more than 80% utilized. Thus for heavily-‐used clusters, cluster remediation is best performed during off-‐peak periods, when utilization drops below 80%. If this is not possible, it is best to suspend or power-‐off some virtual machines before the operation is begun.

Update Manager Bandwidth Throttling

• During remediation or staging operations, hosts download patches. On slow networks you can prevent network congestion by configuring hosts to use bandwidth throttling. By allocating comparatively more bandwidth to some hosts, those hosts can more quickly finish remediation or staging.

• To ensure that network bandwidth is allocated as expected, the sum of the bandwidth allocated to multiple hosts on a

single network link should not exceed the bandwidth of that link. Otherwise, the hosts will attempt to utilize bandwidth up to their allocation, resulting in bandwidth utilization that might not be proportional to the configured allocations.

• Bandwidth throttling applies only to hosts that are downloading patches. If a host is not in the process of patch

downloading, any bandwidth throttling configuration on that host will not affect the bandwidth available in the network link.

11. VMware vSphere Distributed Switch Best Practices Design Considerations

The following three main aspects influence the design of a virtual network infrastructure:

1. Customer’s infrastructure design goals 2. Customer’s infrastructure component configurations 3. Virtual infrastructure traffic requirements

Let’s take a look at each of these aspects in a little more detail.

Infrastructure Design Goals

Customers want their network infrastructure to be available 24/7, to be secure from any attacks, to perform efficiently throughout day-‐to-‐day operations, and to be easy to maintain. In the case of a virtualized environment, these requirements become increasingly demanding as growing numbers of business-‐critical applications run in a consolidated setting. These requirements on the infrastructure translate into design decisions that should incorporate the following best practices for a virtual network infrastructure:

• Avoid any single point of failure in the network • Isolate each traffic type for increased resiliency and security • Make use of traffic management and optimization capabilities

Infrastructure Component Configurations In every customer environment, the utilized compute and network infrastructures differ in terms of configuration, capacity and feature capabilities. These different infrastructure component configurations influence the virtual network infrastructure design decisions. The following are some of the configurations and features that administrators must look out for:

• Server configuration: rack or blade servers • Network adapter configuration: 1GbE or 10GbE network adapters, number of available adaptors, offload function of these

adaptors if any. • Physical network switch infrastructure capabilities: switch clustering

It is impossible to cover all the different virtual network infrastructure design deployments based on the various combinations of type of servers, network adaptors and network switch capability parameters. In this paper, the following four commonly used deployments that are based on standard rack server and blade server configurations are described:

• Rack server with eight 1GbE network adaptors • Rack server with two 10GbE network adaptors • Blade server with two 10GbE network adapters • Blade server with hardware-‐assisted multiple logical Ethernet network adaptors

It is assumed that the network switch infrastructure has standard layer 2 switch features (high availability, redundant paths, fast convergence, port security) available to provide reliable, secure and scalable connectivity to the server infrastructure.

Virtual Infrastructure Traffic

vSphere virtual network infrastructure carries different traffic types. To manage the virtual infrastructure traffic effectively, vSphere and network administrators must understand the different traffic types and their characteristics. The following are the key traffic types that flow in the vSphere infrastructure, along with their traffic characteristics:

Management traffic: This traffic flows through a vmknic and carries VMware ESXi host-‐to-‐VMware vCenter configuration and management communication as well as ESXi host-‐to-‐ESXi host high availability (HA) – related communication. This traffic has low network utilization but has very high availability and security requirements.

VMware vSphere vMotion traffic: With advancement in vMotion technology, a single vMotion instance can consume almost a full 10Gb of bandwidth. A maximum of eight simultaneous vMotion instances can be performed on a 10Gb uplink; four simultaneous vMotion instances are allowed on a 1Gb uplink. vMotion traffic has very high network utilization and can be bursty at times. Customers must make sure that vMotion traffic doesn’t impact other traffic types, because it might consume all available I/O resources. Another property of vMotion traffic is that it is not sensitive to throttling and makes a very good candidate on which to perform traffic management.

Fault-‐tolerant traffic: When VMware Fault Tolerance (FT) logging is enabled for a virtual machine, all the logging traffic is sent to the secondary fault-‐tolerant virtual machine over a designated vmknic port. This process can require a considerable amount of bandwidth at low latency because it replicates the I/O traffic and memory-‐state information to the secondary virtual machine.

iSCSI/NFS traffic: IP storage traffic is carried over vmknic ports. This traffic varies according to disk I/O requests. With end-‐to-‐end jumbo frame configuration, more data is transferred with each Ethernet frame, decreasing the number of frames on the network. This larger frame reduces the overhead on server/targets and improves the IP storage performance. On the other hand, congested and lower-‐speed networks can cause latency issues that disrupt access to IP storage. It is recommended that users provide a high-‐speed path for IP storage and avoid any congestion in the network infrastructure.

Virtual machine traffic: Depending on the workloads that are running on the guest virtual machine, the traffic patterns will vary from low to high network utilization. Some of the applications running in virtual machines might be latency sensitive as is the case with VOIP workloads.

Table 1 summarizes the characteristics of each traffic type.

To understand the different traffic flows in the physical network infrastructure, network administrators use network traffic management tools. These tools help monitor the physical infrastructure traffic but do not provide visibility into virtual infrastructure traffic. With the release of vSphere5, VDS now supports the NetFlow feature, which enables exporting the internal (virtual machine-‐to-‐virtual machine) virtual infrastructure flow information to standard network management tools. Administrators now have the required visibility into virtual infrastructure traffic. This helps administrators monitor the virtual network infrastructure traffic through a familiar set of network management tools. Customers should make use of the network data collected from these tools during the capacity planning or network design exercises.

Example Deployment Components

After looking at the different design considerations, this section provides a list of components that are used in an example deployment. This example deployment helps illustrate some standard VDS design approaches. The following are some common components in the virtual infrastructure. The list doesn’t include storage components that are required to build the virtual infrastructure. It is assumed that customers will deploy IP storage in this example deployment.

Hosts

Four ESXi hosts provide compute, memory and network resources according to the configuration of the hardware. Customers can have different numbers of hosts in their environment, based on their needs. One VDS can span across 350 hosts. This capability to support large numbers of hosts provides the required scalability to build a private or public cloud environment using VDS (excellent use case).

Clusters

A cluster is a collection of ESXi hosts and associated virtual machines with shared resources. Customers can have as many clusters in their deployment as are required. With one VDS spanning across 350 hosts, customers have the flexibility of deploying multiple clusters with a different number of hosts in each cluster. For simple illustration purposes, two clusters with two hosts each are considered in this example deployment. One cluster can have a maximum of 32 hosts.

VMware vCenter Server

VMware vCenter Server centrally manages a vSphere environment. Customers can manage VDS through this centralized management tool, which can be deployed on a virtual machine or a physical host. The vCenter Server system is not shown in the diagrams, but customers should assume that it is present in this example deployment. It is used only to provision and manage VDS configuration. When provisioned, hosts and virtual machine networks operate independently of vCenter Server. All components required for network switching reside on ESXi hosts. Even if the vCenter server system fails, the hosts and virtual machines will still be able to communicate.

Network Infrastructure

Physical network switches in the access and aggregation layer provide connectivity between ESXi hosts and to the external world. These network infrastructure components support standard layer 2 protocols providing secure and reliable connectivity. Along with the preceding four components of the physical infrastructure in this example deployment, some of the virtual infrastructure traffic types are also considered during the design. The following section describes the different traffic types in the example deployment.

Virtual Infrastructure Traffic Types

In this example deployment, there are standard infrastructure traffic types, including iSCSI, vMotion, FT, management and virtual machine. Customers might have other traffic types in their environment, based on their choice of storage infrastructure (FC, NFS, FCoE). Figure 1 shows the different traffic types along with associated port groups on an ESXi host. It also shows the mapping of the network adapters to the different port groups.

Important Virtual and Physical Switch Parameters

Before going into the different design options in the example deployment, lets take a look at the virtual and physical network switch parameters that should be considered in all of the design options. There are some key parameters that vSphere and network administrators must take into account when designing VMware virtual networking. Because the configuration of virtual networking goes hand in hand with physical network configuration, this section will cover both the virtual and physical switch parameters.

VDS Parameters

VDS simplifies the challenges of the configuration process by providing one single pane of glass to perform virtual network management tasks. As opposed to configuring a vSphere standard (VSS) on each individual host, administrators can configure and manage one single VDS. All centrally configured network policies on VDS get pushed down to the host automatically when the host is added to the distributed switch. In this section, an overview of key VDS parameters is provided.

Host Uplink Connections (vmnics) and dvuplink Parameters

VDS has new abstraction, called dvuplink, for the physical Ethernet network adaptors (vmnics) on each host. It is defined during the creation of the VDS and can be considered as a template for individual vmnics on each host. All the properties – including network adaptor-‐teaming, load balancing and failover policies on VDS and dvportgroups – are configured on dvuplinks. These dvuplink properties are automatically applied to vmnics on individual hosts when a host is added to the VDS and when each vmnic on the host is mapped to a dvuplink.

This dvuplink abstraction therefore provides the advantage of consistently applying teaming and failover configurations to all the hosts’ physical Ethernet network adaptors (vmnics).

Figure 2, Shows two ESXi hosts with four Ethernet network adaptors each. When these hosts are added to the VDS, with four dvuplinks configured on a dvuplink port group, administrators must assign the network adaptors (vmnics) of the hosts to dvuplinks. To illustrate the mapping of the dvuplinks to vmnics, Figure 2 shows one type of mapping, where ESXi hosts vmnic0 is mapped to dvuplink1, vmnic1 to dvuplink2 and so on. Customers can choose different mapping if required, where vmnic0 can be mapped to a

different dvuplink instead of dvuplink1. VMware recommends having consistent mapping across different hosts because it reduces complexity in the environment.

Figure 2. dvuplink-‐to-‐vmnic Mapping

As a best practice, customers should also try to deploy hosts with the same number of physical Ethernet network adaptors with similar port speeds. Also, because the number of dvuplinks on VDS depends on the maximum number of physical Ethernet network adaptors on a host, administrators should take that into account during dvuplink port group configuration. Customers always have an option to modify this dvuplink configuration based on the new hardware capabilities.

Traffic Types and dvportgroup Parameters

Similar to port groups on standard switches, dvportgroups define how the connection is made through the VDS to the network. The VLAN ID, traffic shaping, port security, teaming and load balancing parameters are configured on these dvportgroups. The virtual ports (dvports) connected to a dvportgroup share the same properties configured on a dvgortgroup. When customers want a group of virtual machines to share the security and teaming policies, they must make sure that the virtual machines are part of one dvportgroup. Customers can choose to define different dvportgroups based on the different traffic types they have in their environment or based on the different tenants or applications they support in the environment. If desired, multiple dvportgroups can share the same VLAN ID.

In this example deployment, the dvportgroup classification is based on the traffic types running in the virtual infrastructure. After administrators understand the different traffic types in the virtual infrastructure and identify specific security, reliability and performed requirements for individual traffic types, the next step is to create unique dvportgroups associated with each traffic type. As was previously mentioned, the dvportgroup configuration defined at VDS level is automatically pushed down to every host that is added to the VDS. For example, in Figure2, the two dvportgroups, PG-‐A (yellow) and PG-‐B (green), defined at the distributed switch level are each available on each of the ESXi hosts that are part of that VDS.

dvportgroup Specific Configuration

After customers decide on the number of unique dvportgroup they want to create in their environment, the can start configuring them. The configuration options/parameters are similar to those available with port groups on vSphere standard switches. There are some additional options available on VDS dvportgroups that are related to teaming setup and are not available on vSphere standard switches. Customers can configure the following key parameters for each dvportgroup.

• Number of virtual ports (dvports) • Port binding (static, dynamic, ephemeral) • VLAN trunking/private VLANs • Teaming and load balancing along with active and standby link • Bidirectional traffic-‐shaping parameters

• Port security

As part of the teaming algorithm support, VDS provides a unique approach to load balancing traffic across the teamed network adaptors. This approach is called load-‐based teaming (LBT), which distributes the traffic across the network adaptors based on the percentage utilization of traffic on those adaptors. LBT algorithm works on both ingress and egress direction of the network adaptor traffic, as opposed to the hashing algorithms that work only in egress direction (traffic flowing out of the network adaptor). Also LBT prevents the worst-‐case scenario that might happen with hashing algorithms, where all traffic hashes to one network adaptor of the team while other network adaptors are not used to carry any traffic. To improve the utilization of all the links/network adaptors, VMware recommends the use of this advanced feature, LBT, of VDS. The LBT approach is recommended over EtherChannel on physical switches and route-‐based IP hash configuration on the virtual switch.

Port security policies at port group level enable customer protection from certain activity that might compromise security. For example, a hacker might impersonate a virtual machine and gain unauthorized access by spoofing the virtual machine’s MAC address. VMware recommends setting the MAC address “Changes” and “Forged Transmits” to “Reject” to help protect against attacks launched by a rogue guest operating system. Customers should set the “Promiscuous Mode” to “Reject” unless they want to monitor the traffic for network troubleshooting or intrusion detection purposes. NIOC Network I/O control (NIOC) is the traffic management capability available on VDS. The NIOC concept revolves around resource pools that are similar in many ways to the ones existing for CPU and memory. vSphere and network administrators now can allocate I/O shares to different traffic types similarly to allocating CPU and memory resources to a virtual machine. The share parameter specifies the relative importance of a traffic type over other traffic and provides a guaranteed minimum when the other traffic competes for a particular network adaptor. The shares are specified in abstract units numbered 1 to 100. Customers can provision shares to different traffic types based on the amount of resources each traffic type requires. This capability of provisioning I/O resources is very useful in situations where there are multiple traffic types competing for resources. For example, in a deployment where vMotion and virtual machine traffic types are flowing through one network adaptor, it is possible that vMotion activity might impact the virtual machine traffic performance. In this situation, shares configured in NIOC provide the required isolation to the vMotion and virtual machine traffic type and prevent one flow (traffic type) from dominating the other flow. NIOC configuration provides one more parameter that customers can utilize if they want to put any limits on a particular traffic type. This parameter is called “the limit.” The limit configuration specifies the absolute maximum bandwidth for a traffic type on a host. The configuration of the limit parameter is specified in Mbps. NIOC limits and shares parameters work only on the outbound traffic, i.e., traffic that is flowing out of the ESXi host. VMware recommends that customers utilize this traffic management feature whenever they have multiple traffic types flowing through one network adaptor, a situation that is more prominent with 10 Gigabit Ethernet (GbE) network deployments but can happen in 1GbE network deployments as well. The common use case for using NIOC in 1GbE network adaptor deployments is when the traffic from different workloads or different customer virtual machines is carried over the same network adaptor. As multiple-‐workload traffic flows through a network adaptor, it becomes important to provide I/O resources based on the needs of the workload. With the release of vSphere 5, customers now can make use of the new user-‐defined network resource pools capability and can allocate I/O resources to the different workloads or different customer virtual machines, depending on their needs. This user-‐defined network resource pools feature provides the granular control in allocating I/O resources and meeting the service-‐level agreement (SLA) requirements for the virtualized tier 1 workloads. Bidirectional Traffic Shaping Besides NIOC, there is another traffic-‐shaping feature that is available in the vSphere platform. It can be configured on a dvportgroup or dvport level. Customers can shape both inbound and outbound traffic using three parameters: average bandwidth, peak bandwidth and burst size. Customers who want more granular traffic-‐shaping controls to manage their traffic types can take advantage of this capability of VDS along with the NIOC feature. It is recommended that network administrators in your organization be involved while configuring these granular traffic parameters. These controls make sense only when there are oversubscription scenarios— caused by the oversubscribed physical switch infrastructure or virtual infrastructure—that are causing network performance issues. So it is very important to understand the physical and virtual network environment before making any bidirectional traffic-‐shaping configurations.

Physical Network Switch Parameters The configurations of the VDS and the physical network switch should go hand in hand to provide resilient, secure and scalable connectivity to the virtual infrastructure. The following are some key switch configuration Parameters the customer should pay attention to. VLAN If VLANs are used to provide logical isolation between different traffic types, it is important to make sure that those VLANs are carried over to the physical switch infrastructure. To do so, enable virtual switch tagging (VST) on the virtual switch, and trunk all VLANs to the physical switch ports. For security reasons, it is recommended that customers not use the VLAN ID 1 (default) for any VMware infrastructure traffic. Spanning Tree Protocol Spanning Tree Protocol (STP) is not supported on virtual switches, so no configuration is required on VDS. But it is important to enable this protocol on the physical switches. STP makes sure that there are no loops in the network. As a best practice, customers should configure the following: 1. Use PortFast on an ESXi host-‐facing physical switch ports. With this setting, network convergence on these switch ports will take place quickly after the failure because the port will enter the STP forwarding state immediately, bypassing the listening and learning states. 2. Use the PortFast Bridge Protocol Data Unit (BPDU) guard feature to enforce the STP boundary. This configuration protects against any invalid device connection on the ESXi host-‐facing access switch ports. As was previously mentioned, VDS doesn’t support STP, so it doesn’t send any BPDU frames to the switch port. However, if any BPDU is seen on these ESXi host-‐facing access switch ports, the BPDU guard feature puts that particular switch port in error-‐disabled state. The switch port is completely shut down and prevents affecting the Spanning Tree Topology. The recommendation of enabling PortEast and the BPDU guard feature on the switch ports is valid only when customers connect non-‐switching/bridging devices to these ports (eg ESXi Hosts). The switching/bridging devices can be hardware-‐based physical boxes or servers running a software-‐based switching/bridging function. Customers should make sure that there is no switching/bridging function enabled on the ESXi hosts that are connected to the physical switch ports. However, in the scenario where the ESXi host has a guest virtual machine that is configured to perform a bridging function, the virtual machine will generate BPDU frames and send them out to the VDS, which then forwards the BPDU frames through the network adaptor to the physical switch port. When the switch port configured with BPDU guard receives the BPDU frame, the switch will disable the port and the virtual machine will lose connectivity. To avoid this network failure scenario when running the software bridging function on an ESXi host, customers should disable the PortFast and BPDU guard configuration on the physical switch port and run STP. If customers are concerned about hacks that can generate BPDU frames, they should make use of VMware vShield App, which can block the frames and protect the virtual infrastructure from such layer 2 attacks. Refer to VMware vShield product documentation for more details on how to secure your vSphere virtual infrastructure: http://www.vmware.com/products/vshield/overview.html. Link Aggregation Setup Link aggregation is used to increase throughput and improve resiliency by combining multiple network connections. There are various proprietary solutions on the market along with vendor-‐independent IEEE 802.3ad (LACP) standard-‐based implementation. All solutions establish a logical channel between the two endpoints, using multiple physical links. In the vSphere virtual infrastructure, the two ends of the logical channel are the VDS and physical switch. These two switches must be configured with link aggregation parameters before the logical channel is established. Currently, VDS supports static link aggregation configuration and does not provide support for dynamic LACP. When customers want to enable link aggregation on a physical switch, they should configure static link aggregation on the physical switch and select IP hash as network adaptor teaming on the VDS. When establishing the logical channel with multiple physical links, customers should make sure that the Ethernet network adaptor connections from the host are terminated on a single physical switch. However, if customers have deployed clustered physical

switch technology, the Ethernet network adaptor connections can be terminated on two different physical switches . The clustered physical switch technology is referred to by different names by networking vendors. For example, Cisco calls their switch clustering solution Virtual Switching System (Nexus6K, 7K use vPC); Brocade calls theirs Virtual Cluster Switching. Refer to the networking vendor guidelines and configuration details when deploying switch clustering technology. Link-‐State Tracking Link-‐state tracking is a feature available on Cisco switches to manage the link state of downstream ports, ports connected to servers, based on the status of upstream ports, ports connected to aggregation/core switches. When there is any failure on the upstream links connected to aggregation or core switches, the associated downstream link status goes down. The server connected on the downstream link is then able to detect the failure and reroute the traffic on other working links. This feature therefore provides the protection from network failures due to the failed upstream ports in non-‐mesh topologies. Unfortunately, this feature is not available on all vendors’ switches, and even if it is available, it might not be referred to as link-‐state tracking. Customers should talk to the switch vendors to find out whether a similar feature is supported on their switches. Figure 3 shows the resilient mesh topology on the left and a simple loop-‐free topology on the right. VMware highly recommends deploying the mesh topology shown on the left, which provides highly reliable redundant design and doesn’t need a link-‐state tracking feature. Customers who don’t have high-‐end networking expertise and are also limited in number of switch ports might prefer the deployment shown on the right. In this deployment, customers don’t have to run STP because there are no loops in the network design. The downside of this simple design is seen when there is a failure in the link between the access and aggregation switches. In that failure scenario, the server will continue to send traffic on the same network adaptor even when the access layer switch is dropping the traffic at the upstream interface. To avoid this black holing of server traffic, customers can enable link-‐state tracking on the virtual and physical switches and indicate any failure between access and aggregation switch layers to the server through link-‐state information.

VDS has default network failover detection configuration set as “link status only.” Customers should keep this configuration if they are enabling the link-‐state tracking feature on physical switches. If link-‐state tracking capability is not available on physical switches, and there are no redundant paths available in the design, customers can make use of the beacon probing feature available on VDS. The beacon probing function is a software solution available on virtual switches for detecting link failures upstream from the access layer physical switch to the aggregation/core switches. Beacon probing is most useful with three or more uplinks in a team.

Maximum Transmission Unit Make sure that the maximum transmission unit (MTU) configuration matches across the virtual and physical network switch infrastructure. Rack Server in Example Deployment After looking at the major components in the example deployment and key virtual and physical switch parameters, let’s take a look at the different types of servers that customers can have in their environment. Customers can deploy an ESXi host on either a rack server or a blade server. This section discusses a deployment in which the ESXi host is running on a rack server. Two types of rack server configuration will be described in the following section:

• Rack server with eight 1GbE network adaptors • Rack server with two 10Gb E network adaptors

The various VDS design approaches will be discussed for each of the two configurations. Rack Server with Eight 1GbE Network Adaptors In a rack server deployment with eight 1GbE network adaptors per host, customers can either use the traditional static design approach of allocating network adaptors to each traffic type or make use of advanced features of VDS such as NIOC and LBT. The NIOC and LBT features help provide a dynamic design that efficiently utilizes I/O resources. In this section, both the traditional and new design approaches are described, along with their pros and cons. Design Option 1 -‐ Static Configuration This design option follows the traditional approach of statically allocating network resources to the different virtual infrastructure traffic types. As shown in Figure 4, each host has eight Ethernet network adaptors. Four are connected to one of the first access layer switches; the other four are connected to the second access layer switch, to avoid single point of failure. Let’s look in detail at how VDS parameters are configured.

dvuplink Configuration To support the maximum of eight 1GbE network adaptors per host, the dvuplink port group is configured with eight dvuplinks (dvuplink1—dvuplink8). On the hosts, dvuplink1 is associated with vmnic0, dvuplink2 is associated with vmnic1, and so on. It is a recommended practice to change the names of the dvuplinks to something meaningful and easy to track. For example, dvuplink1,

which gets associated with vmnic on a motherboard, can be renamed as “LOM-‐uplink1”; dvuplink2, which gets associated with vmnic on an expansion card, can be renamed as “Expansion-‐uplink1.” If the hosts have some Ethernet network adaptors as LAN on motherboard (LOM) and some on expansion cards, for a better resiliency story, VMware recommends selecting one network adaptor from LOM and one from an expansion card when configuring network adaptor teaming. To configure this teaming on a VDS, administrators must pay attention to the dvuplink and vmnic association along with dvportgroup configuration where network adaptor teaming is enabled. In the network adaptor-‐teaming configuration on a dvportgroup, administrators must choose the various dvuplinks that are part of a team. If the dvuplinks are named appropriately according to the host vmnic association, administrators can select “LOM-‐uplink1” and “Expansion-‐uplink1” when configuring the teaming option for a dvportgroup. dvportgroup Configuration As described in Table 2, there are five different port groups that are configured for the five different traffic types. Customers can create up to 5,000 unique port groups per VDS. In this example deployment, the decision on creating different port groups is based on the number of traffic types. According to Table 2, dvportgroup PG-‐A is created for the management traffic type. There are other dvportgroups defined for the other traffic types. The following are the key configurations of dvportgroup PG-‐A:

• Teaming option: Explicit failover order provides a deterministic way of directing traffic to a particular uplink. By selecting dvuplink1 as an active uplink and dvuplink2 as a standby uplink, management traffic will be carried over dvuplink1 unless there is a failure on dvuplink1. All other dvuplinks are configured as unused. Configuring the failback option to “No” is also recommended, to avoid the flapping of traffic between two network adaptors. The failback option determines how a physical adaptor is returned to active duty after recovering from a failure. If failback is set to “No,” a tailed adaptor is left inactive, even after recovery, until another currently active adaptor fails and requires a replacement.

• VMware recommends isolating all traffic types from each other by defining a separate VLAN for each dvportgroup.

• There are several other parameters that are part of the dvportgroup configuration. Customers can choose to configure

these parameters based on their environment needs. For example, customers can configure PVLAN to provide isolation when there are limited VLANs available in the environment.

As you follow the dvportgroups configuration in Table 2, you can see that each traffic type is carried over a specific dvuplink, with the exception of the virtual machine traffic type, The virtual machine traffic type uses two active links, dvuplink7 and dvuplink8, and these links are utilized through the LBT algorithm. As was previously mentioned, the LBT algorithm is much more efficient than the standard hashing algorithm in utilizing link bandwidth.

Physical Switch Configuration The external physical switch—where the rack servers’ network adaptors are connected to—is configured with trunk configuration with all the appropriate VLANs enabled. As described in the “Physical Network Switch Parameters” section, the following switch configurations are performed based on the VDS setup described in Table 2.

• Enable STP on the trunk ports facing the ESXi hosts, along with the PortFast mode and BPDU guard feature. • The teaming configuration on VDS is static, so no link aggregation is configured on the physical switches. • Because of the mesh topology deployment, as shown in Figure 4, the link-‐state tracking feature is not required on the

physical switches. In this design approach, resiliency to the infrastructure traffic is achieved through active/standby uplinks, and security is accomplished by providing separate physical paths for the different traffic types. However, with this design, the I/O resources are underutilized because the dvuplink2 and dvuplink6 standby links are not used to send or receive traffic. Also, there is no flexibility to allocate more bandwidth to a traffic type when it needs it. There is another variation to the static design approach that addresses the need of some customers to provide higher bandwidth to the storage and vMotion traffic type. In the static design that was previously described, iSCSI and vMotion traffic is limited to 1GB. If a customer wants to support higher bandwidth for iSCSI, they can make use of the iSCSI multipathing solution. Also, with the release of vSphere 5, vMotion traffic can be carried over multiple Ethernet network adaptors through the support of multi-‐network adaptor vMotion, thereby providing higher bandwidth to the vMotion process. For more details on how to set up iSCSI multipathing, refer to the VMware vSphere Storage guide: https://www.vmware.com/support/pubs/vsphere-‐esxi-‐vcenter-‐server-‐pubs.html. The configuration of multi-‐network adaptor vMotion is quite similar to the iSCSI multipath setup, where administrators must create two separate VMkernel interfaces and bind each one to a separate dvportgroup. This configuration with two separate dvportgroups provides the connectivity to two different Ethernet network adaptors or dvuplinks. Table 3. Static Design Configuration with iSCSI Multipathing and Multi–Network Adaptor vMotion

Table 3. Static Design Configuration with iSCSI Multipathing and Multi-‐Network Adaptor vMotion As shown in Table 3, there are two entries each for the vMotion and iSCSI traffic types. Also shown is a list of the additional dvportgroup configurations required to support the multi-‐network adaptor vMotion and iSCSI multipathing processes. For multi-‐network adaptor vMotion, dvportgroups PG-‐B1 and PG-‐B2 are listed, configured with dvuplink 3 and dvuplink4 respectively as active links. And for iSCSI multipathing, dvportgroups PG-‐D1 and PG-‐D2 are connected to dvuplink5 and dvuplink6 respectively as active links. Load balancing across the multiple dvuplinks is performed by the multipathing logic in the iSCSI process and by the ESXì platform in the vMotion process. Configuring the teaming policies for these dvportgroups is not required.

FT, management and virtual machine traffic-‐type dvportgroup configuration and physical switch configuration for this design remain the same as those described in “Design Option 1” of the previous section. This static design approach improves on the first design by using advanced capabilities such as iSCSI multipathing and multi network adaptor vMotion. But at the same time, this option has the same challenges related to underutilized resources and inflexibility in allocating additional resources on the fly to different traffic types. Design Option 2 -‐ Dynamic Configuration with NIOC and LBT After looking at the traditional design approach with static uplink configurations, let’s take a look at the VMware recommended design option that takes advantage of the advanced VDS features such as NIOC and LBT. In this design, the connectivity to the physical network infrastructure remains the same as that described in the static design option. However, instead of allocating specific dvuplinks to individual traffic types, the ESXi platform utilizes those dvuplinks dynamically. To illustrate this dynamic design, each virtual infrastructure traffic type’s bandwidth utilization is estimated. In a real deployment, customers should first monitor the virtual infrastructure traffic over a period of time, to gauge the bandwidth utilization, and then come up with bandwidth numbers for each traffic type. The following are some bandwidth numbers estimated by traffic type for the scenario:

• Management traffic (<1GB) • vMotion (1GB) • FT (1GB) • iSCSI (1GB) • Virtual machine (2G B)

Based on this bandwidth information, administrators can provision appropriate I/O resources to each traffic type by using the NIOC feature of VDS. Let’s take a look at the VDS parameter configurations for this design, as well as the NIOC setup. The dvuplink port group configuration remains the same, with eight dvuplinks created for the eight 1GbE network adaptors. The dvportgroup configuration is described in the following section. dvportgroup Configuration In this design, all dvuplinks are active and there are no standby and unused uplinks, as shown in Table 4. All dvuplinks are therefore available for use by the teaming algorithm. The following are the key parameter configurations of dvportgroup PG-‐A:

• Teaming option: LBT is selected as the teaming algorithm. With LBT configuration, the management traffic initially will be scheduled based on the virtual port ID hash. Depending on the hash output, management traffic is sent out over one of the dvuplinks. Other traffic types in the virtual infrastructure can also be scheduled on the same dvuplink initially. However, when the utilization of the dvuplink goes beyond the 75 percent threshold, the LBT algorithm will be invoked and some of the traffic will be moved to other underutilized dvuplinks. It is possible that management traffic will be moved to other dvuplinks when such an LBT event occurs.

• The failback option means going from using a standby link to using an active uplink after the active uplink comes back into

operation after a failure. This failback option works when there are active and standby dvuplink configurations. In this design, there are no standby dvuplinks. So when an active uplink fails, the traffic flowing on that dvuplink is moved to another working dvuplink. If the failed dvuplink comes back, the LBT algorithm will schedule new traffic on that dvuplink. This option is left as the default.


• There are several other parameters that are part of the dvportgroup configuration. Customers can choose to configure

these parameters based on their environment needs. For example, they can configure PVLAN to provide isolation when there are limited VLANs available in the environment.

As you follow the dvportgroups configuration in Table 4, you can see that each traffic type has all dvuplinks active and that these links are utilized through the LBT algorithm. Lets now look at the NIOC configuration described in the last two columns of Table 4.

Table 4. Dynamic Design Configuration with NIOC and LBT

The NIOC configuration in this design helps provide the appropriate I/O resources to the different traffic types (through shares). Based on the previously estimated bandwidth numbers per traffic type, the shares parameter is configured in the NIOC shares column in Table 4. The shares values specify the relative importance of specific traffic types, and NIOC ensures that during contention scenarios on the dvuplinks, each traffic type gets the allocated bandwidth. For example, a shares configuration of 10 for vMotion, iSCSI and FT allocates equal bandwidth to these traffic types. Virtual machines get the highest bandwidth with 20 shares and management gets lower bandwidth with 5 shares. To illustrate how share values translate to bandwidth numbers, let’s take an example of 1Gb capacity dvuplink carrying all five traffic types. This is a worst-‐case scenario where all traffic types are mapped to one dvuplink. This will never happen when customers enable the LBT feature, because LBT will balance the traffic based on the utilization of uplinks. This example shows how much bandwidth each traffic type will be allowed on one dvuplink during a contention or oversubscription scenario and when LBT is not enabled.

• Total shares: management (5) ÷ ( vMotion (10) + FT (10) + iSCSI (10) + virtual machine (20) = 55 ) • 1Gb = 1000Mbps

o Management: 5 shares; (5/55) x 1000 = 90.91Mbps o vMotion: 10 shares; (10/55) x 1000 = 181.18Mbps o FT: 10 shares; (10/55) x 1000 = 181.18Mbps o iSCSI: lo shares; (10/55) x 1000 = 181 .18Mbps o Virtual machine: 20 shares: (20/55) x 1000 = 363.64Mbps

Note: Given a workload requirement for a portgroup provided in Mbps identify the required share value: To calculate the bandwidth numbers during contention, you should first calculate the percentage of bandwidth for a traffic type by dividing its share value by the total available share number (55). In the second step, the total bandwidth of the dvuplink (1Gb) is multiplied with the percentage of bandwidth number calculated in the first step. For example, 5 shares allocated to management traffic translate to 90.91Mbps of bandwidth to management process on a fully utilized 1Gb network adaptor. In this example, custom share configuration is discussed, but a customer can make use of predefined high (100), normal (50) and low (25) shares when assigning them to different traffic types. The vSphere platform takes these configured share values and applies them per uplink. The schedulers running at each uplink are responsible for making sure that the bandwidth resources are allocated according to the shares. In the case of an eight 1GbE network adaptor deployment, there are eight schedulers running. Depending on the number of traffic types scheduled on a particular uplink, the scheduler will divide the bandwidth among the traffic types, based on the share numbers. For example, if only FT (10 shares) and management (5 shares) traffic are flowing through dvuplink 5, FT traffic will get double the bandwidth of management traffic, based on the shares value. Also, when there is no management traffic flowing, all bandwidth can be utilized by the FT process. This flexibility in allocating I/O resources is the key benefit of the NIOC feature. The NIOC limits parameter of Table 4 is not configured in this design. The limits value specifies an absolute maximum limit on egress traffic for a traffic type. Limits are specified in Mbps. This configuration provides a hard limit on any traffic, even if I/O resources are available to use. Using limits configuration is not recommended unless you really want to control the traffic, even though additional resources are available.

There is no change in physical switch configuration in this design approach, even with the choice of the new LBT algorithm. The LBT teaming algorithm doesn’t require any special configuration on physical switches. Refer to the physical switch settings described in “Design Option 1.”

Table 4. Dynamic Design Configuration with NIOC and LBT This design does not provide higher than 1Gb bandwidth to the vMotion and iSCSI traffic types as is the case with static design using multi-‐network adaptor vMotion and iSCSI multipathing. The LBT algorithm cannot split the infrastructure traffic across multiple dvuplink ports and utilize all the links. So even if vMotion dvportgroup PG-‐B has all eight 1GbE network adaptors as active uplinks, vMotion traffic will be carried over only one of the eight uplinks. The main advantage of this design is evident in the scenarios where the vMotion process is not using the uplink bandwidth, and other traffic types are in need of the additional resources. In these situations, NIOC makes sure that the unused bandwidth is allocated to the other traffic types that need it. This dynamic design option is the recommended approach because it takes advantage of the advanced VDS features and utilizes I/O resources efficiently. This option also provides active-‐active resiliency where no uplinks are in standby mode. In this design approach, customers allow the vSphere platform to make the optimal decisions on scheduling traffic across multiple uplinks. Some customers who have restrictions in the physical infrastructure in terms of bandwidth capacity across different paths and limited availability of the layer 2 domain might not be able to take advantage of this dynamic design option. When deploying this design option, it is important to consider all the different traffic paths that a traffic type can take and to make sure that the physical switch infrastructure can support the specific characteristics required for each traffic type. VMware recommends that vSphere and network administrators work together to understand the impact of the vSphere platform’s traffic scheduling feature over the physical network infrastructure before deploying this design option. Every customer environment is different, and the requirements for the traffic types are also different. Depending on the need of the environment, a customer can modify these design options to fit their specific requirements. For example, customers can choose to use a combination of static and dynamic design options when they need higher bandwidth for iSCSI and vMotion activities. In this hybrid design, four uplinks can be statically allocated to iSCSI and vMotion traffic types while the remaining four uplinks are used dynamically for the remaining traffic types (it may also be that the IP storage infrastructure uses separate physical switches). Table 5 shows the traffic types and associated port group configurations for the hybrid design. As shown in the table, management, FT and virtual machine traffic will be distributed on dvuplink1 to dvuplink4 through the vSphere platform’s traffic scheduling features, LBT and NIOC. The remaining four dvuplinks are statically assigned to vMotion and iSCSI traffic types.

Rack Server with Two 1OGbE Network Adaptors The two 1OGbE network adaptors deployment model is becoming very common because of the benefits they provide through I/O consolidation. The key benefits include better utilization of I/O resources, simplified management and reduced CAPEX and OPEX. Although this deployment provides these benefits, there are some challenges when it comes to the traffic management aspects. Especially in highly consolidated virtualized environments where more traffic types are carried over fewer 1OGbE network adaptors, it becomes critical to prioritize traffic types that are important and provide the required SLA guarantees. The NIOC feature available on the VDS helps in this traffic management activity. In the following sections, you will see how to utilize this feature in the different designs. As shown in Figure 5, rack servers with two 1OGbE network adaptors are connected to the two access layer switches to avoid any single point of failure. Similar to the rack server with eight 1GbE network adaptors, the different VDS and physical switch parameter configurations are taken into account with this design. On the physical switch side, the new 10Gb switches might have support for FCoE that enables convergence for SAN and LAN traffic. This document covers only the standard 10Gb deployments that support IP storage traffic (iSCSI/NFS) and not FC0E. In this section, two design options are described; one is a traditional approach and the other one is a VMware recommended approach.

Figure 5. Rack Server with Two 1OGbE Network Adaptors Design Option 1 -‐ Static Configuration The static configuration approach for rack server deployment with 1OGbE network adaptors is similar to the one described in “Design Option 1” of rack server deployment with eight 1GbE adaptors. There are a few differences in the configuration where the numbers of dvuplinks are changed from eight to two, and dvportgroup parameters are different. Let’s take a look at the configuration details on the VDS front. dvuplink Configuration To support the maximum two Ethernet network adaptors per host, the dvuplink port group is configured with two dvuplinks (dvuplink, dvuplink2). On the hosts, dvuplink1 is associated with vmnic0 and dvuplink2 is associated with vmnic1.

dvportgroup Configuration As described in Table 6, there are five different dvportgroups that are configured for the five different traffic types. For example, dvportgroup PG-‐A is created for the management traffic type. The following are the other key configurations of dvportgroup PG-‐A:

• Teaming option: An explicit failover order provides a deterministic way of directing traffic to a particular uplink. By selecting dvuplink as an active uplink and dvuplink2 as a standby uplink, management traffic will be carried over dvuplink unless there is a failure with it. Configuring the failback option to “No” is also recommended, to avoid the flapping of traffic between two network adaptors. The failback option determines how a physical adaptor is returned to active duty after recovering from a failure. It failback is set to “No,” a failed adaptor is left inactive, even after recovery, until another currently active adaptor tails, requiring its replacement.


• There are various other parameters that are part of the dvportgroup configuration. Customers can choose to configure

these parameters based on their environment needs. Table 6 provides the configuration details for all the dvportgroups. According to the configuration, dvuplink carries management, iSCSI and virtual machine traffic; dvuplink2 handles vMotion, FT and virtual machine traffic. As you can see, the virtual machine traffic type makes use of two uplinks, and these uplinks are utilized through the LBT algorithm. With this deterministic teaming policy, customers can decide to map different traffic types to the available uplink ports, depending on environment needs. For example, if iSCSI traffic needs higher bandwidth and other traffic types have relatively low bandwidth requirements, customers can decide to keep only iSCSI traffic on dvuplink1 and move all other traffic to dvuplink2. When deciding on these traffic paths, customers should understand the physical network connectivity and the paths’ bandwidth capacities. Physical Switch Configuration The external physical switch, which the rack servers’ network adaptors are connected to, has trunk configuration with all the appropriate VLANs enabled. As described in the physical network switch parameters sections, the following switch configurations are performed based on the VDS setup described in Table 6.

• Enable STP on the trunk ports facing ESXi hosts, along with the PortFast mode and BPDU guard feature. • The teaming configuration on VDS is static and therefore no link aggregation is configured on the physical switches. • Because of the mesh topology deployment shown in Figure 5, the link state-‐tracking feature is not required on the physical

switches.

Table 6. Static Design Configuration This static design option provides flexibility in the traffic path configuration, but it cannot protect against one traffic type’s dominating others. For example, there is a possibility that a network-‐intensive vMotion process might take away most of the network bandwidth and impact virtual machine traffic. Bidirectional traffic-‐shaping parameters at port group and port levels can

provide some help in managing different traffic rates. However, using this approach for traffic management requires customers to limit the traffic on the respective dvportgroups. Limiting traffic to a certain level through this method puts a hard limit on the traffic types, even when the bandwidth is available to utilize. This underutilization of I/O resources because of hard limits is overcome through the NIOC feature, which provides flexible traffic management based on the shares parameters. “Design Option 2,” described in the following section, is based on the NIOC feature. Design Option 2 -‐ Dynamic Configuration with NIOC and LBT This dynamic design option is the VMware-‐recommended approach that takes advantage of the NIOC and LBT features of the VDS. Connectivity to the physical network infrastructure remains the same as that described in “Design Option 1.” However, instead of allocating specific dvuplinks to individual traffic types, the ESXi platform utilizes those dvuplinks dynamically. To illustrate this dynamic design, each virtual infrastructure traffic type’s bandwidth utilization is estimated. In a real deployment, customers should first monitor the virtual infrastructure traffic over a period of time to gauge the bandwidth utilization, and then come up with bandwidth numbers. The following are some bandwidth numbers estimated by traffic type: • Management traffic (<1G B) • vMotion (2GB) • FT (1GB) • iSCSI (2G B) • Virtual machine (2GB) These bandwidth estimates are different from the one considered with rack server deployment with eight 1GbE network adaptors. Let’s take a look at the VDS parameter configurations for this design. The dvuplink port group configuration remains the same, with two dvuplinks created for the two 1OGbE network adaptors. The dvportgroup configuration is as follows. dvportgroup Configuration In this design, all dvuplinks are active and there are no standby and unused uplinks, as shown in Table 7. All dvuplinks are therefore available for use by the teaming algorithm. The following are the key configurations of dvportgroup PG-‐A:

• Teaming option: LBT is selected as the teaming algorithm. With LBT configuration, management traffic initially will be scheduled based on the virtual port ID hash. Based on the hash output, management traffic will be sent out over one of the dvuplinks. Other traffic types in the virtual infrastructure can also be scheduled on the same dvuplink with LBT configuration. Subsequently, if the utilization of the uplink goes beyond the 75 percent threshold, the LBT algorithm will be invoked and some of the traffic will be moved to other underutilized dvuplinks. It is possible that management traffic will get moved to other dvuplinks when such an event occurs.

• There are no standby dvuplinks in this configuration, so the failback setting is not applicable for this design approach. The

default setting for this failback option is “Yes.”

• VMware recommends isolating all traffic types from each other by defining a separate VLAN for each dvportgroup. • There are several other parameters that are part of the dvportgroup configuration. Customers can choose to configure

these parameters based on their environment needs. As you follow the dvportgroups configuration in Table 7, you can see that each traffic type has all the dvuplinks as active and these uplinks are utilized through the LBT algorithm. Let’s take a look at the NIOC configuration. The NIOC configuration in this design not only helps provide the appropriate I/O resources to the different traffic types but also provides SLA guarantees by preventing one traffic type from dominating others. Based on the bandwidth assumptions made for different traffic types, the shares parameters are configured in the NIOC shares column in Table 7. To illustrate how share values translate to bandwidth numbers in this deployment, let’s take an example of a 10Gb capacity dvuplink carrying all five traffic types. This is a worst-‐case scenario in which all traffic types are mapped to one dvuplink. This will never happen when customers enable the LBT feature, because LBT will move the traffic type based on the uplink utilization. The following example shows how much bandwidth each traffic type will be allowed on one dvuplink during a contention or oversubscription scenario and when LBT is not enabled:

• Total shares: management (5) + vMotion (20) + FT (10) + SCSI (20) + virtual machine (20) = 75 • 10Gb = 10000Mbps

o Management: 5 shares; (5/75) x 10Gb = 667Mbps o vMotion: 20 shares; (20/75) x 10Gb = 2.67Gbps o FT: 10 shares; (10/75) x 10Gb = 1.33Gbps o iSCSI: 20 shares; (20/75) x 10Gb = 2.67Gbps o Virtual machine: 20 shares; (20/75) x 10Gb = 2.67Gbps

For each traffic type, first the percentage of bandwidth is calculated by dividing the share value by the total available share number (75), and then the total bandwidth of the dvuplink (10Gb) is used to calculate the bandwidth share for the traffic type. For example, 20 shares allocated to vMotion traffic translate to 2.67Gbps of bandwidth to the vMotion process on a fully utilized 1OGbE network adaptor. In this 1OGbE deployment, customers can provide bigger pipes to individual traffic types without the use of trunking or multipathing technologies. This was not the case with an eight-‐1GbE deployment. There is no change in physical switch configuration in this design approach, so refer to the physical switch settings described in “Design Option 1 in the previous section. Table. 7. Dynamic Design Configuration This design option utilizes the advanced VDS features and provides customers with a dynamic and flexible design approach. In this design, I/O resources are utilized effectively and SLAs are met based on the shares allocation.

Blade Server in Example Deployment Blade servers are server platforms that provide higher server consolidation per rack unit as well as lower power and cooling costs. Blade chassis that host the blade servers have proprietary architectures and each vendor has its own way of managing resources in the blade chassis. It is difficult in this document to look at all of the various blade chassis available on the market and to discuss their deployments. In this section, we will focus on some generic parameters that customers should consider when deploying VDS in a blade chassis environment. From a networking point of view, all blade chassis provide the following two options:

• Integrated switches: With this option, the blade chassis enables built-‐in switches to control traffic flow between the blade servers within the chassis and the external network.

• Pass-‐through technology: This is an alternative method of network connectivity that enables the individual blade servers to communicate directly with the external network.

In this document, the integrated switch option is described as “where the blade chassis has a built-‐in Ethernet switch.’ This Ethernet switch acts as an access layer switch, as shown in Figure 6. This section discusses a deployment in which the ESXi host is running on a blade server. The following two types of blade server configuration will be described in the next section:

• Blade server with two 10G bE network adaptors • Blade server with hardware-‐assisted multiple logical network adaptors

For each of these two configurations, various VDS design approaches will be discussed. Blade Server with Two 10GbE Network Adaptors This deployment is quite similar to that of a rack server with two 1OGbE network adaptors in which each ESXi host is provided with two 1OGbE network adaptors. As shown in Figure 6, an ESXi host running on a blade server in the blade chassis is also provided with two 1OGbE network adaptors. Figure 6. Blade Server with Two 1OGbE Network Adaptors

In this section, two design options are described. One is a traditional static approach and the other one is a VMware recommended dynamic configuration with NIOC and LBT features enabled, These two approaches are exactly the same as the deployment described ¡n the “Rack Server with Two 1OGbE Network Adaptors” section. Only blade chassis—specific design decisions will be discussed as part of this section. For all other VDS and switch-‐related configurations, refer to the “Rack Server with Two 1OGbE Network Adaptors” section of this document. Design Option 1 -‐ Static Configuration The configuration of this design approach is exactly the same as that described in the “Design Option 1” section under “Rack Server with Two 1OGbE Network Adaptors.” Refer to Table 6 for dvportgroup configuration details. Let’s take a look at the blade server—specific parameters that require attention during the design. Network and hardware reliability considerations should be incorporated during the blade server design as well. In these blade server designs, customers must focus on the following two areas:

• High availability of blade switches in the blade chassis • Connectivity of blade server network adaptors to internal blade switches

High availability of blade switches can be achieved by having two Ethernet switching modules in the blade chassis. And the connectivity of two network adaptors on the blade server should be such that one network adaptor is connected to the first Ethernet switch module, and the other network adaptor is hooked to the second switch module in the blade chassis. Another aspect that requires attention in the blade server deployment is the network bandwidth availability across the midplane of the blade chassis and between the blade switches and aggregation layer. If there is an oversubscription scenario in the deployment, customers must think about utilizing traffic shaping and prioritization (802.lp tagging) features available in the vSphere platform. The prioritization feature enables customers to tag the important traffic coming out of the vSphere platform. These high-‐priority—

tagged packets are then treated according to priority by the external switch infrastructure. During congestion scenarios, the switch will drop lower-‐priority packets first and avoid dropping the important, high-‐priority packets. This static design option provides customers with the flexibility to choose different network adaptors for different traffic types. However, when doing the traffic allocation on a limited, two 1OGbE network adaptors, administrators ultimately will schedule multiple traffic types on a single adaptor. As multiple traffic types flow through one adaptor, the chances of one traffic type’s dominating others increases. To avoid the performance impact of the “noisy neighbors” (dominating traffic type), customers must utilize the traffic management tools provided in the vSphere platform. One of the traffic management features is NIOC, and that feature is utilized in “Design Option 2,” which is described in the following section. Design Option 2 -‐ Dynamic Configuration with NIOC and LBT This dynamic configuration approach is exactly the same as that described in the “Design Option 2” section under “Rack Server with Two 1OGbE Network Adaptors.” Refer to Table 7 for the dvportgroup configuration details and NIOC settings. The physical switch—related configuration in the blade chassis deployment is the same as that described in the rack server deployment. For the blade center-‐specific recommendation on reliability and traffic management, refer to the previous section. VMware recommends this design option, which utilizes the advanced VDS features and provides customers with a dynamic and flexible design approach. With this design, I/O resources are utilized effectively and SLAs are met based on the shares allocation. Blade Server with Hardware-‐Assisted Logical Network Adaptors (HP Flex-‐lO-‐ or Cisco UCS-‐like Deployment) Some of the new blade chassis support traffic management capabilities that enable customers to carve I/O resources. This is achieved by providing logical network adaptors for the ESXi hosts. Instead of two 1OGbE network adaptors, the ESX1 host now sees multiple physical network adaptors that operate at different configurable speeds. As shown in Figure 7, each ESXi host is provided with eight Ethernet network adaptors that are carved out of two 1OGbE network adaptors. Figure 7. Multiple Logical Network Adaptors

This deployment is quite similar to that of the rack server with eight 1GbE network adaptors. However, instead of 1GbE network adaptors, the capacity of each network adaptor is configured at the blade chassis level. In the blade chassis, customers can carve out different capacity network adaptors based on the need of each traffic type. For example, if iSCSI traffic needs 2.5Gb of bandwidth, a logical network adaptor with that amount of I/O resources can be created on the blade chassis and provided for the blade server.

As for the configuration of the VDS and blade chassis switch infrastructure, the configuration described in “Design Option 1” under “Rack Server with Eight 1GbE Network Adaptors” is more relevant for this deployment. The static configuration option described in that design can be applied as is in this blade server environment. Refer to Table 2 for the dvportgroup configuration details and switch configurations described in that section for physical switch configuration details. The question now is whether NIOC capability adds any value in this specific blade server deployment. NIOC is a traffic management feature that helps in scenarios where multiple traffic types flow through one uplink or network adaptor. If in this particular deployment only one traffic type is assigned to a specific Ethernet network adaptor, the NIOC feature will not add any value. However, if multiple traffic types are scheduled over one network adaptor, customers can make use of NIOC to assign appropriate shares to different traffic types. This NIOC configuration will ensure that bandwidth resources are allocated to traffic types and that SLAs are met. As an example, let’s consider a scenario in which vMotion and iSCSI traffic is carried over one 3Gb logical uplink. To protect the iSCSI traffic from network-‐intensive vMotion traffic, administrators can configure NIOC and allocate shares to each traffic type. If the two traffic types are equally important, administrators can configure shares with equal values (10 each). With this configuration, when there is a contention scenario, NIOC will make sure that the iSCSI process will get half of the 1Gb uplink bandwidth and avoid having any impact on the vMotion process. VMware recommends that the network and server administrators work closely together when deploying the traffic management features of the VDS and blade chassis. To achieve the best end-‐to-‐end quality of service (Q0S) result, a considerable amount of coordination is required during the configuration of the traffic management features. Operational Best Practices After a customer successfully designs the virtual network infrastructure, the next challenges are how to deploy the design and how to keep the network operational. VMware provides various tools, APIs, and procedures to help customers effectively deploy and manage their network infrastructure. The following are some key tools available in the vSphere platform: • VMware vSphere Command-‐Line Interface (vSphere CLI) • VMware vSphere API • Virtual network monitoring and troubleshooting

• NetFlow • Port mirroring

In the following section, we will briefly discuss how vSphere and network administrators can utilize these tools to manage their virtual network. Refer to the vSphere documentation for more details on the tools. VMware vSphere Command-‐Line Interface vSphere administrators have several ways to access vSphere components through vSphere interface options, including VMware vSphere CIient, vSphere Web Client, and vSphere Command-‐Line Interface. The vSphere CLI command set enables administrators to perform configuration tasks by using a vSphere vCLI package installed on supported platforms or by using VMware vSphere Management Assistant (vMA). Refer to the Getting Started with vSphere CLI document for more details on the commands: http://www.vmware.com/support/developer/vcli. The entire networking configuration can be performed through vSphere vCLI, helping administrators automate the deployment process. VMware vSphere API The networking setup in the virtualized datacenter involves configuration of virtual and physical switches. VMware has provided APIs that enable network switch vendors to get information about the virtual infrastructure, which helps them to automate the configuration of the physical switches and the overall process.

For example, vCenter can trigger an event after the vMotion process of a virtual machine is performed. After receiving this event trigger and related information, the network vendors can reconfigure the physical switch port policies such that when the virtual machine moves to another host, the VLAN/access control list (ACL) configurations are migrated along with the virtual machine, Multiple networking vendors have provided this automation between physical and virtual infrastructure configurations through integration with vSphere APIs. Customers should check with their networking vendors to learn whether such an automation tool exists that will bridge the gap between physical and virtual networking and simplify the operational challenges. Virtual Network Monitoring and Troubleshooting Monitoring and troubleshooting network traffic in a virtual environment require similar tools to those available in the physical switch environment. With the release of vSphere 5, VMware gives network administrators the ability to monitor and troubleshoot the virtual infrastructure through features such as NetFlow and port mirroring. NetFlow capability on a distributed switch along with a NetFlow collector tool helps monitor application flows and measures flow performance over time. It also helps in capacity planning and ensuring that I/O resources are utilized properly by different applications, based on their needs. Port mirroring capability on a distributed switch is a valuable tool that helps network administrators debug network issues in a virtual infrastructure. Granular control over monitoring ingress, egress or all traffic of a port helps administrators fine-‐tune what traffic is sent for analysis. vCenter Server on a Virtual Machine As mentioned earlier, vCenter Server is only used to provision and manage VDS configurations. Customers can choose to deploy it on a virtual machine or a physical host, depending on their management resource design requirements. In case of vCenter Server failure scenarios, the VDS will continue to provide network connectivity, but no VDS configuration changes can be performed. By deploying vCenter Server on a virtual machine, customers can take advantage of vSphere platform features such as vSphere High Availability (HA) and VMware Fault Tolerance (Fault Tolerance) ?? to provide higher resiliency to the management plane. In such deployments, customers must pay more attention to the network configurations. This is because if the networking for a virtual machine hosting vCenter Server is misconfigured, the network connectivity of vCenter Server is lost. This misconfiguration must be fixed. However, customers need vCenter Server to fix the network configuration because only vCenter Server can configure a VDS. As a work-‐around to this situation, customers must connect to the host directly where the vCenter Server virtual machine is running through vSphere Client. Then they must reconnect the virtual machine hosting vCenter Server to a VSS that is also connected to the management network of hosts. After the virtual machine running vCenter Server is reconnected to the network, it can manage and configure VDS. Refer to the community article “Virtual Machine Hosting a vCenter Server Best Practices” for guidance regarding the deployment of vCenter on a virtual machine: http://communities.vmware.com/servlet/JiveServlet/previewBody/14089-‐102-‐1-‐16292/VM hostVCBestPracitices. html. Conclusion A VMware vSphere distributed switch provides customers with the right measure of features, capabilities and operational simplicity for deploying a virtual network infrastructure. As customers move on to build private or public clouds, VDS provides the scalability numbers for such deployments. Advanced capabilities such as NIOC and LBT are key for achieving better utilization of I/O resources and for providing better SLAs for virtualized business-‐critical applications and multitenant deployments. Support for standard networking visibility and monitoring features such as port mirroring and NetFlow helps administrators manage and troubleshoot a virtual infrastructure through familiar tools. VDS also is an extensible platform that enables integration with other networking vendor products through open vSphere APIs.

12. VMware Network I/O Control: Architecture, Performance and Best Practices The Network I/O Control (NetIOC) feature available in VMware® vSphereTM 4.1 (“vSphere”) addresses these challenges by introducing a software approach to partitioning physical network bandwidth among the different types of network traffic flows. It does so by providing appropriate quality of service (QoS) policies enforcing traffic isolation, predictability and prioritization, therefore helping IT organizations overcome the contention resulting from consolidation. The experiments conducted in VMware performance labs using industry-‐standard workloads show that NetIOC:

• Maintains NFS and/or iSCSI storage performance in the presence of other network traffic such as vMotionTM and bursty virtual machines.

• Provides network service level guarantees for critical virtual machines. • Ensures adequate bandwidth for VMware Fault Tolerance (VMware FT) logging. • Ensures predictable vMotion performance and duration. • Facilitates any situation where a minimum or weighted level of service is required for a particular traffic type independent

of other traffic types.

The sections that follow discuss:

• Use cases and application of NetIOC with 10GbE in contrast to traditional 1GbE deployments • The NetIOC technology and architecture used within the vNetwork Distributed Switch (vDS) • How to configure NetIOC from the vSphere Client • Examples of NetIOC usage to illustrate possible deployment scenarios • Results from actual performance tests using NetIOC to illustrate how NetIOC can protect and prioritize traffic in the face of

network contention and oversubscription • Best practices for deployment

Moving from 1GbE to 10GbE

Virtualized datacenters are characterized by newer and complex types of network traffic flows such as vMotion and VMware FT logging traffic. In today’s virtualized datacenters where 10GbE connectivity is still not commonplace, networking is typically based on large numbers of 1GbE physical connections that are used to isolate different types of traffic flows and to provide sufficient bandwidth.

Table 1. Typical Deployment and Provisioning of 1GbE NICs with vSphere 4.0

Provisioning a large number of GbE network adapters to accommodate peak bandwidth requirements of these different types of traffic flows has a number of shortcomings:

• Limited bandwidth: Flows from an individual source (virtual machine, vMotion interface, and so on) are limited and bound to the bandwidth of a single 1GbE interface even if more bandwidth is available within a team

• Excessive complexity: Use of large numbers of 1GbE adapters per server leads to excessive complexity in cabling and management, with an increased likelihood of misconfiguration

• Higher capital costs: Large numbers of 1GbE adapters requires more physical switch ports, which in turn leads to higher capital costs including additional switches and rack space

• Lower utilization: Static bandwidth allocation to accommodate peak bandwidth for different traffic flows means poor average network bandwidth utilization

10GbE provides ample bandwidth for all the traffic flows to coexist and share the same physical 10GbE link. Flows that were limited to the bandwidth of a single 1GbE link are now able to use as much as 10GbE. While the use of a 10GbE solution greatly simplifies the networking infrastructure and addresses all the shortcomings listed above, there are a few challenges that still need to be addressed to maximize the value of a 10GbE solution. One means of optimizing the 10GbE network bandwidth is to prioritize the network traffic by traffic flows. This ensures that latency-‐sensitive and critical traffic flows can access the bandwidth they need.

NetIOC enables the convergence of diverse workloads on a single networking pipe. It provides sufficient controls to the vSphere administrator in the form of limits and shares parameters to enable and ensure predictable network performance when multiple traffic types contend for the same physical network resources.

NetIOC Architecture

Prerequisites for NetIOC

NetIOC is only supported with the vNetwork Distributed Switch (vDS). With vSphere 4.1, a single vDS can span up to 350 ESX/ESXi hosts (500 as of vSphere 5.5), providing a simplified and more powerful management environment versus the per-‐host switch model using the vNetwork Standard Switch (vSS). The vDS also provides a superset of features and capabilities over that of the vSS, such as network vMotion, bi-‐directional traffic shaping and private VLANs.

Configuring and managing a vDS involves use of distributed port groups (DV Port Groups) and distributed virtual uplinks (dvUplinks). DV Port Groups are port groups associated with a vDS similar to port groups available with vSS. dvUplinks provide a level of abstraction for the physical NICs (vmnics) on each vSphere host.

NetIOC Feature Set

NetIOC provides users with the following features:

• Isolation: ensure traffic isolation so that a given flow will never be allowed to dominate over others, thus preventing drops and undesired jitter.

• Shares: allow flexible networking capacity partitioning to help users to deal with over commitment when flows compete aggressively for the same resources

• Limits: enforce traffic bandwidth limit on the overall vDS set of dvUplinks • Load-‐Based Teaming: efficiently use a vDS set of dvUplinks for networking capacity

NetIOC Traffic Classes

The NetIOC concept revolves around resource pools that are similar in many ways to the ones already existing for CPU and Memory.

NetIOC classifies traffic into six predefined resource pools as follows:

• vMotion • iSCSI • FT logging • Management • NFS • Virtual machine traffic

Figure 1. NetIOC Architecture

Shares

A user can specify the relative importance of a given resource-‐pool flow using shares that are enforced at the dvUplink level. The underlying dvUplink bandwidth is then divided among resource-‐pool flows based on their relative shares in a work-‐conserving way. This means that unused capacity will be redistributed to other contending flows and won’t go to waste. As shown in Figure 1, the network flow scheduler is the entity responsible for enforcing shares and therefore is in charge of the overall arbitration under overcommitment. Each resource-‐pool flow has its own dedicated software queue inside the scheduler so that packets from a given resource pool won’t be dropped due to high utilization by other flows.

Limits

A user can specify an absolute shaping limit for a given resource-‐pool flow using a bandwidth capacity limiter. As opposed to shares that are enforced at the dvUplink level, limits are enforced on the overall vDS set of dvUplinks, which means that a flow of a given resource pool will never exceed a given limit for a vDS out of a given vSphere host.

Load-‐Based Teaming (LBT)

As of vSphere 4.1, which introduced a load-‐based teaming (LBT) policy that ensures vDS dvUplink capacity is optimized. LBT avoids the situation of other teaming policies in which some of the dvUplinks in a DV Port Group’s team were idle while others were completely saturated just because the teaming policy used is statically determined (IP Hashing). LBT reshuffles port binding dynamically based on load and dvUplinks usage to make an efficient use of the bandwidth available. LBT only moves ports to dvUplinks configured for the corresponding DV Port Group’s team. Note that LBT does not use shares or limits to make its judgment while rebinding ports from one dvUplink to another. LBT is not the default teaming policy in a DV Port Group so it is up to the user to configure it as the active policy.

LBT will only move a flow when the mean send or receive utilization on an uplink exceeds 75 percent of capacity over a 30-‐second period. LBT will not move flows more often than every 30 seconds.

Configuring NetIOC

NetIOC is configured through the vSphere Client in the Resource Allocation tab of the vDS from within the “Home-‐>Inventory-‐ >Networking” panel.

NetIOC is enabled by clicking on “Properties...” on the right side of the panel and then checking “Enable network I/O control on this vDS” in the pop up box.

The Limits and Shares for each traffic type is configured by right-‐clicking on the traffic type (for example, Virtual Machine Traffic) and selecting “Edit Settings...” This will bring up a Network Resource Pool Setting dialog box in which you can select the Limits and Shares values for that traffic type.

NetIOC Usage

Unlike the limits that are specified in absolute units of Mbps, shares are used to specify the relative importance of the flows. Shares are specified in abstract units with a value ranging from 1 to 100. In this section, we provide an example that describes the usage of shares.

Figure 6. NetIOC shares usage example

Figure 6 highlights the following characteristics of the shares:

• In absence of any other traffic, a particular traffic flow gets 100 percent of the bandwidth available, even if it was configured with 25 shares

• During the periods of contention, the bandwidth is divided among the traffic flows based on their relative shares NetIOC Performance In this section, we describe in detail the test-‐bed configuration, the workloads used to generate the network traffic flows, and the test results. Test Configuration In our test configuration, we used an ESX cluster that comprised two Dell PowerEdge R610 servers running the GA release of ESX 4.1. Each of the servers was configured with dual-‐socket, quad-‐core 2.27 GHz Intel Xeon L5520 processors, 96 GB of RAM, and a 10 GbE Intel Oplin NIC. The following figure depicts the hardware configuration used in our tests. The complete hardware details are provided in Appendix A.

Figure 7. Physical Hardware Setup Used in the Tests

In our test configuration, we used a single vDS that spanned both vSphere hosts. We configured the vDS with a single dvUplink (dvUplink1). The 10GbE physical NIC port on each of two vSphere hosts was mapped to dvUplink1. We configured the vDS with four DV Port Groups as follows:

• dvPortGroup-‐FT for FT logging traffic • dvPortGroup-‐NFS for NFS traffic • dvPortGroup-‐VM for virtual machine traffic • dvPortGroup-‐vMotion for vMotion traffic

Using four distinct DV Port Groups enabled us to easily track the network bandwidth usage of the different traffic flows. As shown in Figure 8, on both vSphere hosts, the virtual network adapters (vNICs) of all the virtual machines used for virtual machine traffic, and the VMkernel interfaces (vmknics) used for vMotion, NFS, and VMware FT logging were configured to use the same 10GbE physical network adapter through the vDS interface.

Figure 8. vDS Configuration Used in the Tests

Workloads Used for Performance Testing

To simulate realistic high network I/O load scenarios, we used the industry-‐standard workloads SPECweb2005 and SPECjbb2005, as they are representative of what most customers would run in their environments.

SPECweb2005 workload: SPECweb2005 is an industry-‐standard web server workload that is comprised of three component workloads. The support workload emulates a vendor support site that provides downloads — such as driver updates and documentation — over HTTP. It is a highly intensive networking workload. The performance score of the workload is measured in terms of the number of simultaneous user sessions that meet the quality of service requirements specified by the benchmark.

SPECjbb2005 workload: SPECjbb2005 is an industry-‐standard server-‐side Java benchmark. It is a highly memory-‐intensive workload because of Java’s usage of the heap and associated garbage collection. Due to these characteristics, when a virtual machine running a SPECjbb2005 workload is subject to vMotion, one could expect to generate heavy vMotion network traffic. This is because during vMotion the entire memory state of the virtual machine is transferred from the source ESX server to a destination ESX server through a high-‐speed network. During the process of migration, if the memory state of the virtual machine is actively changing, vMotion will need multiple iterations to transfer the active memory state that results in an increase in duration of vMotion and the associated network traffic.

IOmeter workload: IOmeter was used to generate NFS traffic.

NetIOC Performance Test Scenarios

Impact of the (or lack of the) network resource management controls is evident only when aggregate bandwidth requirements of the competing traffic flows exceed the available interface bandwidth. The impact is more apparent when one of the competing traffic flows is latency sensitive. Accordingly, we designed three different test scenarios with a mix of critical and noncritical traffic flows, with the aggregate bandwidth requirements of all the traffic flows under consideration exceeding the capacity of the network interface.

To evaluate and compare the performance and scalability of the virtualized environment with and without NetIOC controls, we used following different scenarios:

• Virtual machine and vMotion traffic flows contending on a vmnic • NFS, VMware FT, virtual machine, and vMotion traffic flows contending on a vmnic • Multiple vMotion traffic flows initiated from different vSphere hosts converging onto the same destination vSphere host

The goal was to determine if NetIOC provides good controls in achieving the QoS requirements in SPECweb2005 testing environments that otherwise would not have been met in absence of NetIOC.

Test Scenario 1: Using Two Traffic Flows—Virtual Machine Traffic and vMotion Traffic

We chose latency-‐sensitive SPECweb2005 traffic and vMotion traffic flows in our first set of tests. The goal was to evaluate the performance of a SPECweb2005 workload in a virtualized environment with and without NetIOC when latency-‐sensitive SPECweb2005 traffic and vMotion traffic contended for the same physical network resources. As shown in Figure 9, our test-‐bed was configured such that both the traffic flows used the same 10GbE physical network adapter. This was done by mapping the virtual network adapters of the virtual machines (used for SPECweb2005 traffic) and the VMkernel interface (used for vMotion traffic) to the same 10GbE physical network adapter. The complete experimental setup details for these tests are provided in Appendix B.

Figure 9. Setup for the Test Scenario 1

At first, we measured the bandwidth requirements of the SPECweb2005 virtual machine traffic and vMotion traffic flows in isolation. The bandwidth usage of the virtual machine traffic while running 17,000 SPECweb2005 user sessions was a little more than 7Gbps during the steady-‐state interval of the benchmark. The peak network bandwidth usage of the vMotion traffic flow used in our tests was measured to be more than 8Gbps. Thus, if both traffic flows used the same physical resources, the aggregate bandwidth requirements would certainly exceed the 10GbE interface capacity. In the test scenario, during the steady-‐state period of the SPECweb2005 benchmark, we initiated vMotion traffic flow, which resulted in both the vMotion traffic and the virtual machine traffic flows contending on the same physical 10GbE link.

Figure 10 shows the performance of the SPECweb2005 workload in a virtualized environment without NetIOC. The graph plots the number of SPECweb2005 user sessions that meet the QoS requirements (“Time Good”) at a given time. In this graph, the first dip corresponds to the start of the steady-‐state interval of the SPECweb2005 benchmark when the statistics are cleared. The second dip corresponds to the loss of QoS due to vMotion traffic competing for the same physical network resources.

Figure 10. SPECweb2005 Performance without NetIOC

We note that when we repeated the same test scenario several times, the loss of performance shown in the graph varied, possibly due to the nondeterministic nature of vMotion traffic. Nevertheless, these tests clearly demonstrate that lack of any network resource management controls results both in loss of performance and predictability that is required to guarantee SLAs required by critical traffic flows.

Figure 11 shows the performance of a SPECweb2005 workload in a virtualized environment with NetIOC controls in place. We configured the virtual machine traffic with twice the number of shares than those configured for vMotion traffic. In other words, we ensured the virtual machine traffic had twice the priority over vMotion traffic when both the traffic flows competed for the same physical network resources. Our tests revealed that although the duration of the vMotion was doubled due to the controls enforced by NetIOC, as shown in Figure 11, the SPECweb2005 performance was unperturbed due to vMotion traffic.

Figure 11. SPECweb2005 Performance with NetIOC

Test Scenario 2: Using Four Traffic Flows — NFS Traffic, Virtual Machine Traffic, VMware FT Traffic and vMotion Traffic

In this test scenario, we chose a very realistic customer deployment scenario that featured fault-‐tolerant Web servers.

A recent VMware customer survey found Web servers had the distinction of topping the high ranks among the popular applications used in conjunction with the VMware FT feature. This is no coincidence because fault-‐tolerant Web servers provide some compelling features that are not available with typical Web server-‐farm deployment scenarios using load balancers that redirect user requests when a Web server goes down. Such load–balancer based solutions may not be the most customer-‐friendly for Web sites that provide very large downloads, such as driver updates and documentation. As an example, consider a failure of a Web server while a user is downloading a large user manual. In a load-‐balancer based Web-‐farm deployment scenario, this will result in user request to fail (or timeout) and the user would need to resubmit the request. On the other hand, in a VMware FT–enabled Web server environment, the user will not experience such failure due to the presence of a secondary hypervisor that has full information on pending I/O operations from the failed primary virtual machine, and commits all the pending I/O. Refer to VMware vSphere 4 Fault Tolerance: Architecture and Performance for more information on VMware FT.

As shown in Figure 12, our test-‐bed was configured such that all the traffic flows used in the test mix contended for the same network resources. The complete experimental setup details for these tests are provided in Appendix B.

Figure 12. Setup for the Test Scenario 2

Our test-‐bed configuration featured four virtual machines that included:

• Two VMware FT–enabled Web server virtual machines serving SPECweb2005 benchmark requests (that generated virtual machine traffic and VMware FT logging traffic)

• One virtual machine (VM3) accessing an NFS store (that generated NFS traffic) • One virtual machine (VM4) running a SPECjbb2005 workload (used to generate vMotion traffic)

At first we measured the network bandwidth usage of all the four traffic flows in isolation. Table 2 describes the network bandwidth usage.

Table 2. Network Bandwidth Usage of the Four Traffic Flows used in the Test Environment

The goal was to evaluate the latencies of critical traffic flows including VMware FT and NFS traffic in a virtualized environment with and without NetIOC controls when four traffic flows contended for the same physical network resources. The test scenario had three phases:

Phase 1: The SPECweb2005 workload in the two VMware FT–enabled virtual machines was in the steady state.

Phase 2: The NFS workload in VM3 became active. SPECweb2005 workload in the other two virtual machines continued to be active.

Phase 3: The VM4 running the SPECjbb2005 workload was subject to vMotion while the NFS and SPECweb2005 workloads remained active in the other virtual machines.

The following figures depict the performance of different traffic flows in absence of NetIOC.

Let us first consider the performance of the VMware FT–enabled Web server virtual machines. The graph plots the number of SPECweb2005 user sessions that meet the QoS requirements (“Time Good”) at a given time. In this graph, the first dip corresponds to the start of the steady-‐state interval of the SPECweb2005 benchmark when the statistics are cleared. The second dip corresponds to the loss of QoS due to multiple traffic flows competing for the same physical network resources. The number of SPECweb2005 users sessions that meet the QoS requirements dropped by about 67 percent during the period of contention. We note that the SPECweb2005 performance degradation in the VMware FT environment was much more severe in the absence of NetIOC than what we observed in the first test scenario. This is because in a VMware FT environment, the primary and secondary virtual machines run in vLockstep, and so the network link between the primary and secondary ESX hosts plays a critical role in performance. During the periods of heavy contention on the network link, the primary virtual machine will make little or no forward progress.

Figure 13. SPECweb2005 Performance in a VMware FT Environment without NetIOC

Figure 14. NFS Access Latency without NetIOC

Similarly, we noticed a significant jump in the NFS store access latencies. As shown in Figure 14, the maximum I/O latency reported by the IOmeter increased from a mere 162 ms to 2166 ms (a factor of 13).

Figure 15. Network Bandwidth Usage of Traffic Flows in Different Phases without NetIOC

A detailed explanation of the bandwidth usage in each phase follows:

Phase 1: In this phase, the VMware FT–enabled VM1 and VM2 were active and the SPECweb2005 benchmark was in a steady-‐state interval. The aggregate network bandwidth usage of the virtual machine traffic flow and the VMware FT logging traffic flows was less than 4Gbps.

Phase 2: At the beginning of this phase, VM3 became active and added NFS traffic flow to the test mix. This resulted in three traffic flows competing for the network resources. Even so there was no difference in the QoS, as the aggregate bandwidth usage was still less than 5Gbps.

Phase 3: An addition of vMotion traffic flow to the test mix resulted in the aggregate bandwidth requirements of the four traffic flows exceeding the capacity of the physical 10GbE link. Lack of any control mechanism to manage access to the 10GbE bandwidth resulted in vSphere sharing the bandwidth among all the traffic flows. Critical traffic flows including VMware FT and NFS traffic flows got the same treatment as the vMotion traffic flow, which resulted in a significant drop in performance.

The performance requirements of the different traffic flows must be considered to put network I/O resource controls in place. In general, the bandwidth requirement of the VMware FT logging traffic is expected to be much smaller than the requirements of the other traffic flows. However, given its impact on performance, we configured VMware FT logging traffic with the highest priority over other traffic flows. We also ensured NFS traffic and virtual machine traffic flows had higher priority over vMotion traffic. Figure 16 shows shares assigned to the different traffic flows.

Figure 16. Share Allocation to Different Traffic Flows with NetIOC

Figure 17 shows the network bandwidth usage of the different traffic flows in different phases. As shown in the figure, thanks to the network I/O resource controls, vSphere was able to enforce priority among the traffic flows, and so the bandwidth usage of the critical traffic flows remained unperturbed during the period of contention.

Figure 17. Network Bandwidth Usage of Traffic Flows in Different Phases with NetIOC

The following figures show the performance of SPECweb2005 and NFS workloads in a VMware FT–enabled virtualized environment with NetIOC in place. As shown in the figures, vSphere was able to ensure service level guarantees to both the workloads in all the phases. Figure 18. SPECweb2005 Performance in FT Environment with NetIOC

Figure 19. NFS Access Latency with NetIOC

The maximum I/O latency reported by the IOmeter remained unchanged at 162 ms in all the phases, and the SPECweb2005

Performance remained unaffected by the network bandwidth usage spike caused by the vMotion traffic flow.

Test Scenario 3: Using Multiple vMotion Traffic Flows

In this final test scenario, we will show how NetIOC can be used in combination with Traffic Shaper to provide a comprehensive network convergence solution in a virtualized datacenter environment.

While NetIOC enables you to limit vMotion traffic initiated from a vSphere host, it fails to prevent performance loss when multiple vMotion traffic flows initiated on different vSphere hosts converge onto a single vSphere host and possibly overwhelm the latter. We will show how a solution based on NetIOC and Traffic Shaper can prevent such an unlikely event.

In vSphere 4.0, support for traffic shaping was introduced, providing some rudimentary controls on network bandwidth usage. For instance, it only provided bandwidth usage controls at the port level, and did not enforce prioritization among traffic flows. These controls were provided for both egress and ingress traffic. In vSphere deployment, the egress and ingress traffic are with respect to a vDS (or vSS). The traffic going into a vDS is ingress/input, and traffic leaving a vDS is egress/output. So, from the perspective of a vNIC port (or vmknic port), the network traffic from the physical network (or vmnic) will ingress into the vDS and egress from vDS to vNIC. Similarly, the traffic flow from vNIC will ingress into the vDS and egress to the physical network (or vmnic). In other words, the ingress and egress need to be interpreted as follows:

Ingress traffic: traffic from a vNIC (or vmknic) to vDS Egress traffic: traffic from vDS to the vNIC (or vmknic)

In this final test scenario, we added a third vSphere host to the same cluster that we used in our previous tests. As shown in Figure 20, the cluster used for this test comprised three vSphere hosts.

We initiated vMotion traffic (peak network bandwidth usage of 9Gbps) from vSphere Host 2, and vMotion traffic (peak network bandwidth usage close to 1Gbps) from vSphere Host 3. Both of these traffic flows converged onto the same destination vSphere host (Host 1). Below, we describe the results of the three test configurations.

Without NetIOC

As a point of reference, we first disabled NetIOC in our test configuration. Our tests indicated that, without any controls, the receive link on Host 1 was fully saturated due to multiple vMotion traffic flows whose aggregate network bandwidth usage exceeded the link capacity.

With NetIOC

As shown in Figure 21, we used NetIOC to enforce limits on vMotion traffic. Figure 21. NetIOC Settings to Enforce Limits on vMotion Traffic Flow

Figure 22 shows the Rx network bandwidth usage on Host 1 (with NetIOC controls in place) as multiple vMotion traffic flows converge on it.

Figure 22. Rx Network Bandwidth Usage on Host 1 with Multiple vMotions (with NetIOC On)


Phase 1: In this phase, vMotion from Host 3 to Host 1 was active. Due to the 1GbE link capacity on Host 3, the bandwidth usage of

the vMotion traffic flow was limited to 1Gbps.

Phase 2: At the beginning of this phase, vMotion from Host 2 to Host 1 became active, resulting in two active vMotion traffic flows converging onto the same destination vSphere host. Thanks to the NetIOC controls, the vMotion traffic flow from Host 2 was only limited to 3Gbps. The aggregate network bandwidth usage of both the active vMotion flows was close to 4Gbps.

NOTE: If there had been more concurrent vMotions (even if such an event is very unlikely), NetIOC would have failed to prevent these vMotions from saturating the receive link on the Host 1.

With NetIOC and Traffic Shaper

With NetIOC controls in place, we also used Traffic Shaper to enforce limits on the egress traffic. NetIOC controls obviate the need for traffic-‐shaping policies on ingress traffic. Accordingly, as shown in Figure 23, we used Traffic Shaper to enforce policies only on egress traffic. Also note that, each of the DV Port Groups can have its own traffic-‐shaping policy. In our example, we configured the dvPortGroup-‐vMotion with the traffic-‐shaping policies shown in Figure 21.

Figure 24 shows the Rx network bandwidth usage on Host 1 (with both NetIOC and Traffic Shaper controls in place) as multiple vMotion traffic flows converge on it.


Phase 1: In this phase, vMotion from Host 3 to Host 1 was active. Due to the 1GbE link capacity on Host 3, the bandwidth usage of

the vMotion traffic flow was limited to 1Gbps.

Phase 2: At the beginning of this phase, vMotion from Host 2 to Host 1 became active, resulting in two active vMotion traffic flows converging onto the same destination vSphere host. With both NetIOC and Traffic Shaper controls in place, the aggregate bandwidth usage on the receiver never exceeded 3Gbps.

These tests confirm that NetIOC in combination with Traffic Shaper can be a viable solution that provides effective controls on both receive and transmit traffic flows in a virtualized datacenter environment.

NetIOC Best Practices

NetIOC is a very powerful feature that will make your vSphere deployment even more suitable for your I/O-‐consolidated datacenter. However, follow these best practices to optimize the usage of this feature:

Best practice 1: When using bandwidth allocation, use “shares” instead of “limits,” as the former has greater flexibility for unused capacity redistribution. Partitioning the available network bandwidth among different types of network traffic flows using limits has shortcomings. For instance, allocating 2Gbps bandwidth by using a limit for the virtual machine resource pool provides a maximum of 2Gbps bandwidth for all the virtual machine traffic even if the team is not saturated. In other words, limits impose hard limits on the amount of the bandwidth usage by a traffic flow even when there is network bandwidth available.

Best practice 2: If you are concerned about physical switch and/or physical network capacity, consider imposing limits on a given resource pool. For instance, you might want to put a limit on vMotion traffic flow to help in situations where multiple vMotion traffic flows initiated on different ESX hosts at the same time could possibly oversubscribe the physical network. By limiting the vMotion traffic bandwidth usage at the ESX host level, we can prevent the possibility of jeopardizing performance for other flows going through the same points of contention.

Best practice 3: Fault tolerance is a latency-‐sensitive traffic flow, so it is recommended to always set the corresponding resource-‐ pool shares to a reasonably high relative value in the case of custom shares. However, in the case where you are using the predefined default shares value for VMware FT, leaving it set to high is recommended.

Best practice 4: We recommend that you use LBT as your vDS teaming policy while using NetIOC in order to maximize the networking capacity utilization.

NOTE: As LBT moves flows among uplinks it may occasionally cause reordering of packets at the receiver.

Best practice 5: Use the DV Port Group and Traffic Shaper features offered by the vDS to maximum effect when configuring the vDS. Configure each of the traffic flow types with a dedicated DV Port Group. Use DV Port Groups as a means to apply configuration policies to different traffic flow types, and more important, to provide additional Rx bandwidth controls through the use of Traffic Shaper. For instance, you might want to enable Traffic Shaper for the egress traffic on the DV Port Group used for vMotion. This can help in situations when multiple vMotions initiated on different vSphere hosts converge to the same destination vSphere server.

Conclusions

Consolidating the legacy GbE networks in a virtualized datacenter environment with 10GbE offers many benefits — ease of management, lower capital costs and better utilization of network resources. However, during the peak periods of contention, the lack of control mechanisms to share the network I/O resources among the traffic flows can result in significant performance drop of critical traffic flows. Such performance loss is unpredictable and uncontrollable if the access to the network I/O resources is unmanaged. NetIOC provides a mechanism to manage the access to the network I/O resources when multiple traffic flows compete. The experiments conducted in VMware performance labs using industry standard workloads show that:

• Lack of NetIOC can result in unpredictable loss in performance of critical traffic flows during periods of contention. • NetIOC can effectively provide service level guarantees to the critical traffic flows. Our test results showed that NetIOC

eliminated a performance drop of as much as 67 percent observed in an unmanaged scenario. • NetIOC in combination with Traffic Shaper provides a comprehensive network convergence solution enabling features that

are not available with the any of the hardware solutions in the market today.

13. Storage I/O Control Technical Overview and Considerations for Deployment

What’s new vSphere 5.0

vSphere Storage I/O Control now supports NFS – Set storage quality of service priorities per virtual machine for better access to storage resources for high-‐priority applications

Storage I/O Control (SIOC) provides storage I/O performance isolation for virtual machines, thus enabling VMware® vSphereTM (“vSphere”) administrators to comfortably run important workloads in a highly consolidated virtualized storage environment. It protects all virtual machines from undue negative performance impact due to misbehaving I/O-‐heavy virtual machines, often known as the “noisy neighbor” problem.

Furthermore, the service level of critical virtual machines can be protected by SIOC by giving them preferential I/O resource allocation during periods of congestion. SIOC achieves these benefits by extending the constructs of shares and limits, used extensively for CPU and memory, to manage the allocation of storage I/O resources

SIOC improves upon the previous host-‐level I/O scheduler by detecting and responding to congestion occurring at the array, and enforcing share-‐based allocation of I/O resources across all virtual machines and hosts accessing a datastore.

With SIOC, vSphere administrators can mitigate the performance loss of critical workloads due to high congestion and storage latency during peak load periods. The use of SIOC will produce better and more predictable performance behavior for workloads during periods of congestion. Benefits of leveraging SIOC:

• Provides performance protection by enforcing proportional fairness of access to shared storage • Detects and manages bottlenecks at the array • Maximizes your storage investments by enabling higher levels of virtual-‐machine consolidation across your shared

datastores The purpose of this paper is to explain the basic mechanics of how SIOC, a new feature in vSphere 4.1, works and to discuss considerations for deploying it in your VMware virtualized environments.

The Challenge of Shared Resources

Controlling the dynamic allocation of resources in distributed systems has been a long-‐standing challenge. Virtualized environments introduce further challenges because of the inherent sharing of physical resources by many virtual machines. VMware has provided ways to manage shared physical resources, such as CPU and memory, and to prioritize their use among all the virtual machines in the environment. CPU and memory controls have worked well since memory and CPU resources are shared only at a local-‐host level, for virtual machines residing within a single ESX® server.

The task of regulating shared resources that span multiple ESX hosts, such as shared datastores, presents new challenges, because these resources are accessed in a distributed manner by multiple ESX hosts. Previous disk shares did not address this challenge, as the shares and limits were enforced only at a single ESX host level, and were only enforced in response to host-‐side HBA bottlenecks, which occur rarely. This approach had the problem of potentially allowing lower-‐priority virtual machines greater access to storage resources based on their placement across different ESX hosts, as well as neglecting to provide benefits in the case that the datastore is congested but the host-‐side queue is not. An ideal I/O resource-‐management solution should provide the allocation of I/O resources independent of the placement of virtual machines and with consideration of the priorities of all virtual machines accessing the shared datastore. It should also be able to detect and control all instances of congestion happening at the shared resource.

The Storage I/O Control Solution

SIOC solves the problem of managing shared storage resources across ESX hosts. It provides a fine-‐grained storage-‐control mechanism by dynamically managing the size of, and access to, ESX host I/O queues based on assigned shares. SIOC enhances the disk-‐shares capabilities of previous releases of VMware ESX Server by enforcing these disk shares not only at the local-‐host level but also at the per-‐datastore level. Additionally, for the first time, vSphere with SIOC provides storage-‐device latency monitoring and control, with which SIOC can throttle back storage workloads according to their priority in order to maintain total storage-‐device latency below a certain threshold.

How Storage I/O Control Works

SIOC monitors the latency of I/Os to datastores at each ESX host sharing that device. When the average normalized datastore latency exceeds a set threshold (30ms by default), the datastore is considered to be congested, and SIOC kicks in to distribute the available storage resources to virtual machines in proportion to their shares. This is to ensure that low-‐priority workloads do not monopolize or reduce I/O bandwidth for high-‐priority workloads. SIOC accomplishes this by throttling back the storage access of the low-‐priority virtual machines by reducing the number of I/O queue slots available to them. Depending on the mix of virtual machines running on each ESX server and the relative I/O shares they have, SIOC may need to reduce the number of device queue slots that are available on a given ESX server

Host-‐Level Versus Datastore-‐Level Disk Schedulers

It is important to understand the way queuing works in the VMware virtualized storage stack to have a clear understanding of how SIOC functions. SIOC leverages the existing host device queue to control I/O prioritization. Prior to vSphere 4.1, the ESX server device queues were static and virtual-‐machine storage access was controlled within the context of the storage traffic on a single ESX server host. With vSphere 4.1, SIOC provides datastore-‐wide disk scheduling that responds to congestion at the array, not just on the host-‐ side HBA.. This provides an ability to monitor and dynamically modify the size of the device queues of each ESX server based on storage traffic and the priorities of all the virtual machines accessing the shared datastore.

An example of a local host-‐level disk scheduler is as follows:

Figure 1 shows the local scheduler influencing ESX host-‐level prioritization as two virtual machines are running on the same ESX server with a single virtual disk on each.

Figure 1. I/O Shares for Two Virtual Machines on a Single ESX Server (Host-‐Level Disk Scheduler)

In the case in which I/O shares for the virtual disks (VMDKs) of each of those virtual machines are set to different values, it is the local scheduler that prioritizes the I/O traffic only in case the local HBA becomes congested.

This described host-‐level capability has existed for several years in ESX Server prior to vSphere 4.1. It is this local-‐host level disk scheduler that also enforces the limits set for a given virtual-‐machine disk. If a limit is set for a given VMDK, the I/O will be controlled by the local disk scheduler so as to not exceed the defined amount of I/O per second.

vSphere 4.1 has added two key capabilities: (1) the enforcement of I/O prioritization across all ESX servers that share a common datastore, and (2) detection of array-‐side bottlenecks. These are accomplished by way of a datastore-‐wide distributed disk scheduler that uses I/O shares per virtual machine to determine whether device queues need to be throttled back on a given ESX server to allow a higher-‐priority workload to get better performance. The datastore-‐wide disk scheduler totals up the disk shares for all the VMDKs that a virtual machine has on the given datastore. The scheduler then calculates what percentage of the shares the virtual machine has compared to the total number of shares of all the virtual machines running on the datastore. This percentage of shares is displayed in the list of details shown in the view of virtual machines tab for each datastore, as seen in Figure 2.

Figure 2. Datastore View of Disk Share Allocation Among Virtual Machines

As described before, SIOC engages only after a certain device-‐level latency is detected on the datastore. Once engaged, it begins to assign fewer I/O queue slots to virtual machines with lower shares and more I/O queue slots to virtual machines with higher shares. It throttles back the I/O for the lower-‐priority virtual machines, those with fewer shares, in exchange for the higher-‐priority virtual machines getting more access to issue I/O traffic. However, it is important to understand that the maximum number of I/O queue slots that can be used by the virtual machines on a given host cannot exceed the maximum device-‐queue depth for the device queue of that ESX host. The ESX maximum queue depth varies by HBA model. The queue-‐depth maximum value is typically in range of 32 to 128. The lowest that SIOC can reduce the device queue depth to is 4. Figure 3a shows that, without SIOC, a virtual machine with a lower number of shares, “VM C,” may get a larger percentage of the available storage-‐array device-‐queue slots and thus greater storage array performance, while a virtual machine with higher I/O shares, “VM A,” gets fewer than its fair share and reduced storage array performance. However, with SIOC engaged on that datastore, as in Figure 3b, the result will be that the lower-‐priority virtual machine that is by itself on a separate host will be assigned a reduced number of I/O queue slots. That will result in fewer storage array queue slots being used and a reduction in average device latency. The reduction in average device latency provides VM A and VM B higher storage performance, as now the same number of I/Os that they previously were issuing complete faster due to the reduced latency for each of those I/Os.

For instance, assume that VM A was using 18 I/O slots as shown in figure 3a. Without SIOC, the storage array latency could be unbounded and the I/O workloads being performed by the lower priority VM C could cause a high storage device latency of, say,

40ms. In this example, VM A would have 18 I/Os @ 40ms worth of storage performance. Once enabled, SIOC controls the latency at the configured congestion threshold, say 30ms. SIOC determines the number of storage array queue slots that can be used while still maintaining an average device latency below the SIOC congestion threshold. Although SIOC does not directly manage the storage array queue, it is able to indirectly control the storage array device queue by managing the ESX device queues that feed into it. As shown in Figure 3b, SIOC has determined that 30 host-‐side storage queue slots can be used while still maintaining the desired average device latency. SIOC then distributes those storage array queue slots to the various virtual machine workloads according to their priorities. The net effect in this example is that VM C is throttled back to use only its correct relative share of the storage array.

VM A, entitled to 60 percent of the queue slots (1500/2500 = 60 percent), is still is able to issue the same 18 I/Os but at a reduced 30ms latency. SIOC provides VM A greater storage performance by controlling VM C and ensuring it uses only its appropriate allocation of total storage resources per performance. By throttling the ESX device-‐queue depths in proportion to the priorities of the virtual machines that are using them, SIOC is able to control storage congestion at the storage array and distribute storage array performance appropriately.

Figure 3. SIOC Device-‐Queue Management with Prioritized Disk Shares

SIOC provides isolation and prioritized distribution of storage resources even when vSphere administrators have not manually set individual disk-‐share priorities on each VMDK per virtual machine. SIOC protects virtual machines that are running on higher consolidated ESX servers. In Figures 4a and 4b, all virtual machine disks have default (1000 shares), or equal disk shares. Without SIOC, VM A and VM B are penalized and not provided equal access to storage resources simply because they are running together on the same ESX server and sharing the same ESX device queue. Whereas VM C, running on a lower consolidated ESX host, is given unfair preference to storage resources. Even administrators who do not wish to individually set VMDK disk shares can benefit from this feature. SIOC provides these vSphere administrators the ability to enable storage isolation for all virtual machines accessing a datastore by simply checking a single check box at the datastore level. This new storage management capability offered by SIOC allows vSphere administrators the ability to run higher consolidated virtual environments by preventing imbalances of storage resource allocation during times of storage contention.

Figure 4. SIOC Device-‐Queue Management with Equal Disk Shares

In these examples, SIOC is able to fully manage the storage array queue by throttling the ESX host device queues. This is possible because all the workloads impacting the storage array queue are coming from the ESX hosts and are under SIOC’s control. However,

SIOC is able to provide storage workload isolation/prioritization even in scenarios in which external workloads, not under SIOC’s control, are competing with those that it controls. In this scenario, SIOC will first automatically detect this situation, and then will increase the number of device-‐queue slots it makes available to the virtual machine workloads so that they can compete more fairly for total storage resources against external workloads. Using this approach, SIOC is able to maintain a balance between workload isolation/prioritization and storage I/O throughput even when it cannot directly control or influence the external workload. This behavior continues as long as the external workload persists and SIOC resumes normal operation once it stops detecting the external workload.

Enabling Storage I/O Control

Since SIOC is an attribute of a datastore, it is set under the properties of a specific datastore. By default SIOC is not enabled on the datastore. The default value for SIOC to kick in is 30ms, but this value can be modified by selecting the “Advanced” option where one enables SIOC in the vCenter interface as shown in Figure 5.

Figure 5. Datastore Properties — SIOC Enablement and Congestion Threshold Setting

SIOC can be used on any FC, iSCSI, or locally attached block storage device that is supported with vSphere 4.1. Review the vSphere 4.1 Hardware Compatibility List (http://www.vmware.com/go/hcl) for the entire list of supported storage devices. SIOC is supported with FC and iSCSI storage devices that have automated tiered storage capabilities. However, when using SIOC with automated tiered storage, the SIOC Congestion Threshold must be set appropriately to make sure the storage device’s automated tiered storage capabilities are not impacted by SIOC.

At this time, SIOC is not supported with NFS storage devices or with Raw Device Mapping (RDM) virtual disks. SIOC is also not supported with datastores that have multiple extents or are being managed by multiple vCenter Management Servers.

For complete step-‐by-‐step instructions on how to enable SIOC, or change the default latency threshold for a datastore or other limitations, consult the documentation or see “Managing Storage I/O Resources” (Chapter 4) in the vSphere 4.1 Resource Management Guide (http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdf)

Consideration for Deploying Storage I/O Control

Configuring Disk Shares

Disk shares specify the relative priority a virtual machine has on a given storage resource. When you assign disk shares to a virtual disk/virtual machine, you specify the priority for that virtual machine’s access to storage resources relative to other powered-‐on virtual machines. Disk shares in vSphere 4.1 can be leveraged at both a local, per–ESX host level, and now at a datastore level when SIOC is enabled and actively prioritizing storage resources. Disk shares are set by selecting “Edit Settings” for a virtual machine and are set on each VMDK, as seen in Figure 6. When SIOC is not enabled, disk shares and the relative priority they specify are enforced only at a local–ESX host level, and then only when local HBAs are saturated. Virtual machines running on the same ESX hosts will be prioritized relative to other virtual machines on the same host but not relative to virtual machines running on other ESX hosts. When

SIOC is enabled and actively controlling the ESX hosts to control storage latencies, disk shares and relative priorities are enforced across all the ESX servers that access the SIOC controlled datastore. So a virtual machine running on one ESX host will have access to storage resources based on the number of disk shares the virtual machine has compared to the total number of disk shares in use on the datastore by all virtual machines across all ESX hosts in the shared storage environment. If a virtual machine does not fully use its allocation of I/O access, the extra I/O slots are redistributed proportionally to the other virtual machines that are actively issuing I/O requests on the datastore.

Figure 6. Virtual Machine Properties — Disk Shares and IOP Limits

As part of vSphere 4.1, I/O per second (IOPS) limits on a per-‐VMDK level can be set to further manage and prioritize virtual machine workloads. Limits (expressed in terms of IOPS) are implemented at the local-‐disk scheduler level and are always enforced regardless of whether or not SIOC is enabled.

Configuring the Storage I/O Control Congestion Latency Value

SIOC is designed to only engage and enforce storage I/O shares when the storage resource becomes contended. This is very similar to CPU scheduling, in that it is only enforced when the resource is contended. To determine when a storage device is contended, SIOC uses a congestion-‐threshold latency value that vSphere administrators can specify. The default congestion-‐threshold latency, 30ms, in vSphere 4.1, is a conservative value that should work well for most users. The SIOC congestion-‐threshold value is configurable, so vSphere administrators have the opportunity to maximize the benefits of SIOC suited to their own virtual environment and storage-‐ management preferences. This section discusses the considerations and recommendations for changing this key parameter.

The SIOC threshold represents a balance between (1) isolation and prioritized access to the storage resource at lower latencies, and (2) higher throughput. When the SIOC congestion threshold is set low, SIOC can begin prioritizing storage access earlier and throttle storage workloads more aggressively in order to maintain a datastore-‐wide latency below the congestion latency threshold. The more aggressive throttling needed to maintain a lower latency might reduce the overall storage throughput. When the congestion threshold is set higher, SIOC will not engage and begin prioritizing resources among virtual machines until the higher latency is reached. When using a higher SIOC congestion latency, SIOC does not need to throttle storage workloads as much in order to maintain the storage latency below the higher congestion threshold. This may allow for higher overall storage throughput.

The default congestion threshold has been set to minimize the impact of throttling on storage throughput while still providing reasonably low storage latency and isolation for high-‐priority virtual machines. In most cases it is not necessary to modify the storage congestion threshold from its default value. However, a user may decide to modify the value depending on the type and speed of their storage device, the characteristics of the workloads in their virtual environment, and their storage-‐management preference between workload isolation/prioritization and workload throughput. Because various storage devices have different latency characteristics, users may need to modify the congestion threshold depending on their storage type. See Table 1 to determine the recommended range of values for your storage-‐device type.

Table 1. SIOC Congestion Threshold Recommendations

The congestion threshold may also need to be adjusted when using automated tiered storage devices. These are systems that contain two or more types of storage media and automatically and transparently migrate data between the storage types in order to optimize I/O performance. These systems typically try to keep the most frequently accessed or “hot” data on faster storage such as SSD, and less frequently accessed or “cold” data on slower media such as SAS or FC disks. This means that the type of storage media backing a particular LUN can change over time.

For full LUN auto-‐tiering storage devices, in which the entire LUN is migrated between different storage tiers, use the recommended value or range for the slowest tier of storage in the device. For example, in a full LUN auto-‐tiering storage device that contains SSD and Fibre Channel disks, use the congestion threshold value that is recommended for Fibre Channel.

With sub-‐LUN or block-‐level auto-‐tiering storage, in which individual storage blocks inside a LUN are migrated between storage tiers, combine the recommended congestion threshold values/ranges for each storage type in the auto-‐tiering storage devices. For example, in a sub-‐LUN / block-‐level auto-‐tiering storage device that contains an SSD storage tier and a Fibre Channel storage tier, use an SIOC congestion threshold value in the range of 10–30ms. The exact SIOC congestion-‐threshold value to use is based on your individual storage-‐device characteristics and your preference of isolation (using a smaller SIOC congestion-‐threshold value) or throughput

(using a larger SIOC congestion-‐threshold value). For example, in the SSD-‐FC scenario, the more SSD storage you have in the array, the more your storage device characteristics will match that of the SSD storage type and thus the closer your threshold should be to the SSD recommended value of 10ms, the low end of the combined SSD-‐FC range. Customers can use the midpoint of the range as a conservative congestion threshold value that provides a balance between the preference for isolation and the preference for throughput. In the SSD-‐FC example in which there was a range of 10–30ms, the conservative congestion threshold value would be 20ms.

When modifying the SIOC congestion threshold, keep in mind that the SIOC latency is a normalized latency metric calculated and normalized for I/O size and aggregate number of IOPS across all the storage workloads accessing the datastore. SIOC uses a normalized latency to take into consideration that not all storage workloads are the same. Some storage workloads may issue larger I/O operations that would naturally result in longer device latencies to service these larger I/O requests. Normalizing the storage-‐workload latencies allows SIOC to compare and prioritize workloads more accurately by bringing them all into a common measurement. Because the SIOC value is normalized, the actual observed latency as seen from the guest OS inside the virtual machine or from an individual ESX host may be different than the calculated SIOC-‐normalized latency per datastore.

Monitoring Storage I/O Control Effects

SIOC includes new metrics inside vCenter to allow users to observe SIOC’s actions and latency measurements. There are two new SIOC metrics in vCenter, SIOC normalized latency and SIOC Aggregated IOPS. The SIOC normalized latency is the value that SIOC calculates per datastore and uses when comparing with the SIOC congestion latency threshold to determine what actions to take, if any. SIOC calculates these metrics every four seconds and they are refreshed in the vCenter display every 20 seconds. These metrics can be viewed on the datastore performance screen inside vCenter, as seen in Figure 7. Additionally, vCenter reports the device-‐

queue depths for each ESX host. The ESX hosts’ device-‐queue depth metrics can be reviewed to determine what actions SIOC is taking on individual ESX hosts and their device queues in order to maintain a datacenter-‐wide SIOC latency on the datastore under the set congestion threshold.

Figure 7. vCenter Datastore Performance and SIOC Metrics

SIOC detects the moment when external workloads, not under SIOC’s control, may be impacting the virtual environment’s storage resources. When SIOC detects an external workload, it will trigger a “Non-‐VI workload detected” informational alert in vCenter. In most cases, this alert is purely informational and requires no action on the part of the vSphere administrator. However, the alert may be an indicator of an incorrectly configured SIOC environment. vSphere administrators should verify that they are running a supported SIOC configuration and that all datastores that utilize the same disk spindles have SIOC enabled with identical SIOC congestion-‐ threshold values. The alert might also be triggered by some backup products and other administrative workloads that bypass the ESX host and directly access the datastore in order to accomplish their tasks. SIOC is supported in these configurations and the alert can be safely ignored for these products. Refer to VMware KB article 1020651 for more details on the “Non-‐VI workload detected” alert.

Benefits of using Storage I/O Control

SIOC enables improved I/O resource management for a multitude of conditions and provides peace of mind when running business-‐ critical I/O intensive applications in a shared VMware virtualization environment.

Provides performance protection

A common concern in any shared resource environment is that one consumer may get far more than its fair share of that resource and adversely impact the performance of the other users that share the resource. SIOC provides the ability, at the datastore level, to support multiple-‐tenant environments that share a datastore, by enabling service-‐level protections during periods of congestion. SIOC prevents a single virtual machine from monopolizing the I/O throughput of a datastore even when the virtual machines have default (equal value) I/O shares set.

Detects and manages bottlenecks at the array only when congestion exists

SIOC detects a bottleneck at the datastore level, and manages I/O queue slot distribution across the ESX servers that share a datastore. SIOC expands the I/O resource control beyond the bounds of a single ESX server to work across all ESX servers that share a datastore.

When SIOC is enabled on a datastore and no congestion exists at the device level, it will not be engaged in managing I/O resources and will have no effect on I/O latency or throughput. In an optimized and well-‐configured environment, SIOC may only engage at

certain peak periods during the day. During these times of congestion and in the presence of external or non–SIOC controlled workloads, SIOC strikes a balance between aggregate throughput and enforcement of virtual machine I/O shares.

SIOC helps vSphere administrators understand when more I/O throughput (device capacity) is needed. If SIOC is engaged for significant periods of time during the day, it raises the question if there is a need for a change in the storage configuration. In this case, an administrator might consider either adding more I/O capacity or using VMware Storage vMotion to migrate I/O intensive virtual machines to an alternate datastore.

Enables higher levels of consolidation with less storage expense

SIOC enables vSphere administrators to maximize their storage investments by running more virtual machines on their existing storage infrastructure with confidence that periodic peak periods of high I/O activity will be controlled. Without SIOC, administrators will often overprovision their storage to avoid latency issues that pop up during peak periods of storage activity. With SIOC, the administrators can now comfortably run more virtual machines on a single datastore with confidence that the storage I/O will be controlled and managed at the device level.

Leveraging SIOC can reduce storage costs because the cost of overprovisioning a storage environment, to the point that no contention occurs, could be prohibitively expensive. Alternately, the cost of storage may drop dramatically by leveraging SIOC to manage the I/O queue slot allocations to ensure proportional fairness and prioritization of virtual machines based on their I/O shares.

Conclusion

SIOC offers I/O prioritization to virtual machines accessing shared storage resources. It allows vSphere administrators to align high-‐ priority virtual machine traffic with better performance and lower latency storage performance as compared to the lower-‐priority virtual machines. It monitors datastore latency and engages when a preset congestion threshold has been exceeded. SIOC gives vSphere administrators a new means to manage their VMware virtualized environments by allowing quality of service to be expressed for storage workloads. As such, SIOC is a big step forward in the journey toward automated, policy-‐based management of shared storage resources.

SIOC provides the means to better control a consolidated shared-‐storage resource by providing datastore-‐wide I/O prioritization, helping to manage traffic on a shared and congested datastore. With the introduction of SIOC in vSphere 4.1, vSphere administrators now have a new tool available to help them increase the consolidation density while ensuring that they will have peace of mind, knowing that during periodic periods of peak I/O activity there will be a prioritization and proportional fairness enforced across all the virtual machines accessing that shared resource.

vmware vsphere 5 design best practice guide

Documents