how to fail at vdi

Post on 22-May-2015

2.850 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

BriForum London 2012

TRANSCRIPT

BriForum | © TechTarget

Welcome

BriForum | © TechTarget

Dan Brinkmann @dbrinkmann blog.danbrinkmann.comSolutions Architect, VMware vExpertLewan & Associates (Denver, CO)

How to Fail at VDI

“What business problem are we solving?”

BriForum | © TechTarget

4

Business/Expectation VDI Failures

● No business problem● Desktop virtualization is not server virtualization● Saving money● Project in the hands of the vSphere administrator● No success criteria● Assume you know what users do● The same or better experience remotely as locally

BriForum | © TechTarget

BriForum | © TechTarget 5

Agenda

● Compute● Storage● Guessing

To understand what causes VDI failures

6

How to Fail at VDI

● Test with 5 users● Using vendor provided users/core sizing● Using vendor provided IOPs estimates● Ignore anti-virus● Ignore user profile management● Use existing desktop images for physcial PC’s● Guess

BriForum | © TechTarget

The technology failure points

7

Compute

● Multi-threaded apps● Latency sensitive workloads● Hyperthreading● Latency = Health

BriForum | © TechTarget

It’s magic until it stops working

8

Compute

● CPU scheduler in vSphere is entitlement/consumption based, not priority (unlike Windows)

● There is no priority in the CPU scheduler● Given equal entitlement the more a vm/world consumes

the more likely it is to be prempted by another vm/world● http://www.vmware.com/resources/techresources/10131

BriForum | © TechTarget

CPU scheduler in vSphere

9

Compute with a Physical PC

BriForum | © TechTarget

CPU 1

OS/Apps/Profile

10

Compute with Citrix XenApp

BriForum | © TechTarget

OS/Apps/Profile

OS/Apps/Profile

CPU 1 CPU 2

OS/Apps/Profile

OS/Apps/Profile

OS/Apps/Profile

OS/Apps/Profile

OS/Apps/Profile

OS/Apps/Profile

11

Compute with VDI

BriForum | © TechTarget

CPU 1 CPU 2

12

vSphere Compute

BriForum | © TechTarget

This is poor performance monitoring

13

vSphere Compute

BriForum | © TechTarget

This is better performance monitoring - ESXTOP

Display Metric Threshold Explanation

CPU %RDY 10 Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set.

CPU %CSTP 3Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.

CPU %SYS 20The percentage of time spent by system services on behalf of the world. Most likely caused by high IO VM. Check other metrics and VM for possible root cause

CPU %MLMTD 0The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU.

CPU %SWPWT 5 VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.

14

vSphere Compute

BriForum | © TechTarget

15

vSphere Compute

BriForum | © TechTarget

%CSTP probably driving %RDY values

16

vSphere Compute

BriForum | © TechTarget

Now with fewer vCPU’s

17

Summary on Compute

● Multithreading, vSMP● Not priority based● % Utilization is not the complete picture● Latency = Health● http://kb.vmware.com/selfservice/microsites/search.do?

language=en_US&cmd=displayKC&externalId=1017926

BriForum | © TechTarget

18

Storage

● #1 cause of performance issues in server virtualization● #1 cause of performance issues in desktop virtualization● Latency = Health

­ 20ms - in trouble­ 50ms - your users hate you

BriForum | © TechTarget

The wrath of the math

19

What You Need to Know

● Capacity vs performance● Random vs sequential● Average vs peak● Where it’s coming from● Most are guessing

BriForum | © TechTarget

20

Storage

BriForum | © TechTarget

Spinning disk

21

RAID Penalty

BriForum | © TechTarget

22

The Math – RAID 5 50/50

● 500 users, Windows 7, 20 IOPs avg, 50/50 read/write RAID 5

● 500 * 20 = 10,000 IOPs – 5,000 read, 5,000 write● 5,000 write * 4 = 20,000 + 5,000 read = 25,000 IOPs● 25,000 IOPs on 15K spindles (200 IOPS) = 125 spindles

BriForum | © TechTarget

Some back of the napkin math

23

The Math – RAID 10 50/50

● 500 users, Windows 7, 20 IOPs avg, 50/50 read/write RAID 10

● 500 * 20 = 10,000 IOPs – 5,000 read, 5,000 write● 5,000 write * 2 = 10,000 + 5,000 read = 15,000 IOPs● 15,000 IOPs on 15K spindles (200 IOPS) = 75 spindles

BriForum | © TechTarget

Some back of the napkin math

24

The Math – RAID 10 20/80

● 500 users, Windows 7, 20 IOPs avg, 20/80 read/write RAID 10

● 500 * 20 = 10,000 IOPs – 2,000 read, 8,000 write● 8,000 write * 2 = 16,000 + 2,000 read = 18,000 IOPs● 18,000 IOPs on 15K spindles (200 IOPS) = 90 spindles

BriForum | © TechTarget

Some back of the napkin math

25

vSphere Storage Latency

BriForum | © TechTarget

Guest

VMkernel

Application

Filesystem

I/O Drivers

Virtual SCSI

Filesystem

A

G

D

K

S

R

Device Queue

Application Latency

R = Physical Disk “Disk Secs/Transfer”

G = Guest Latency

K = ESX Kernel

D = Device Latency

26

vSphere Storage

BriForum | © TechTarget

Performance monitoring for storage

Display Metric Threshold Explanation

DISK GAVG 20 Look at “DAVG” and “KAVG” as the sum of both is GAVG.

DISK DAVG 20 Disk latency most likely to be caused by array.

DISK KAVG 2 Disk latency caused by the VMkernel, high KAVG usually means queuing. Check “QUED”.

DISK QUED 1Queue maxed out. Possibly queue depth set to low. Check with array vendor for optimal queue depth value.

DISK ABRTS/s 1 Aborts issued by guest(VM) because storage is not responding. Can be caused when paths failed.

DISK RESETS/s 1 The number of commands reset per second.

DISK CONS/s 20 SCSI Reservation Conflicts per second. Can be caused by too many VMDKs on a datastore.

27

Building for Read IOPs

● Memory - Storage controller cache, PVS● Host/Hypervisor - CBRC, Intellicache● Storage - SSD tiering / flash cache

BriForum | © TechTarget

Fairly easy

28

Building for Write IOPs

● Profiles/Apps● Spinning disk● SSD tiering● Local disk● IO optimization (dedupe, serializing IO)

BriForum | © TechTarget

Much harder…and expensive

29

Storage Summary

● 25,000 IOPs R5 50/50 – 125 spindles● 15,000 IOPs R10 50/50 – 75 spindles● 18,000 IOPs R10 20/80 – 90 spindles● Latency is the key metric● Write IOPs & things that cause it is #1 focus

BriForum | © TechTarget

30

How does this relate to VDI failure?

● Pilot performance is great, then terrible in production● Boot storm vs login storm● Applications in gold image vs streamed● Read/write ratio is important● Anti-virus software● Existing desktop images

BriForum | © TechTarget

31

Guessing

● Initial sizing● Determine peaks and when● Baseline application impact● Monitor application impact over time● Application updates/changes

BriForum | © TechTarget

You need to use tools to do this

32

Project testing

● Unit/system testing● Application testing● Performance/scalability testing● Operational testing● User acceptance testing

BriForum | © TechTarget

Good to know what you are and aren’t doing

33

Summary

● Understand your limited resources (compute/storage)● Don’t guess● 5 users = what kind of testing, what are you really

accomplishing?

BriForum | © TechTarget

top related