visual studio windows azure portal rest apis / ps cmdlets us-north central region fc tor pdu servers...

31

Upload: aron-sanders

Post on 14-Dec-2015

236 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU
Page 2: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Igal FiglinSr Program Manager LeadWindows Azure Fabric Controller Internals: Building and Updating Highly Available Applications

Windows Azure Fabric Controller Internals: Building and Updating Highly Available Applications

3-627

Page 3: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Session Objective(s): Understand how Azure works behind the scenes and as a result be more of an expert to customers.Answer questions such as:• How are customer services updated without downtime? Which updateability

parameters can I fine tune?• Which types of hardware failures exist and how can I configure a service to avoid

impact?• How does Windows Azure update its infrastructure and how does this impact my

service?• How can Azure customers build automation to implement custom update workflows?

Key Takeaways:• Plan for Updates and Redundancy when Designing a Service• Select the best update modes for PaaS services• Use Availability Sets for IaaS service tiers• Design IaaS updates to ensure high availability for a redundant service

Session Objectives And Takeaways

Page 4: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Scope and Agenda

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

In-Scope• Understanding how Windows Azure Fabric Controller works• Understanding update and fault recovery of the Windows

Azure Compute Services• Understanding how to design Windows Azure Compute

Services to ensure high availability and fault recovery

Out of Scope• Windows Azure Introduction• Storage considerations• Performance• Broad guidance on service architecture / app patterns

Questions in the End of Each Section

Page 5: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Intro to WA Internals

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 6: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Windows Azure

Windows Azure is an OS for the data center Handles resource management, provisioning, and monitoring Manages application lifecycle Allows developers to concentrate on business logic

Windows Azure provides common building blocks for distributed applications Compute resources, like Virtual Machines and Cloud Services Reliable queuing, simple structured storage, SQL storage Application services like access control, caching, and connectivity

Fabric Controller (FC) manages compute infrastructure Deploys and manages the health of the compute services Manages datacenter infrastructure (hardware & software), recovers from failures Drives infrastructure updates

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 7: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

RDFEService

Deploying a Service to theCloud: The 10K foot view

Develop using Visual Studio or any other IDE

Package upload to Windows Azure portal Optionally using Visual Studio developer upload

experience … or Powershell/Rest APIs through automation Service package passed to RDFE

Red Dog Front End (RDFE) sends service to a Fabric Controller (FC) based on service requirements Region, affinity groups Available resources

Fabric Controller (FC) deploys service to the right racks/servers

Visual Studio

Windows Azure Portal

Rest APIs / PS Cmdlets

US-North Central Region

FCFC FC FC

TOR

PDU

Serv

ers

TOR

PDU

Serv

ers

TOR

PDU

Serv

ers

TOR

PDU

Serv

ers

TOR

PDU

Serv

ers

TOR

PDU

Serv

ers

TOR

PDU

Serv

ers

TOR

PDU

Serv

ers

Page 8: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Datacenter Architecture Region can be comprised of multiple datacenters Datacenters are divided into “clusters”

Each rack provides a unit of fault isolation

Cluster 5Cluster 4Cluster 3Cluster 2Cluster 1

TOR

Agg

PDU

Agg Agg Agg Agg Agg

DatacenterRouters

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

TOR

PDU

……… … ……

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Serv

ers

Agg Agg Agg

Aggregation Routers andLoad Balancers

Cluster Network Aggregation

Top of RackSwitches

Racks

Power Distribution Units

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 9: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Inside a Cluster Each cluster is managed by a Fabric Controller (FC)

Manages DC hardware and services Allocates resources and manages lifecycles

FC is a distributed, stateful application running on servers spread across racks One FC instance is the primary and all others keep view of world in sync

Cluster

PDU

Fabric Controller

AGG

TOR Switch TOR Switch

Fabric Controller

PDU

TOR Switch

Fabric Controller

………

PDU

Cluster

TOR

Agg

PDU

TOR

PDU

TOR

PDU

Serv

ers

Serv

ers

Serv

ers

Rack 1 Rack 2 Rack 20

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 10: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Inside a Physical Server CPU, memory, disk & networking resources are

committed when allocating the service.

Physical Server

FC Host Agent

Host Partition

Trust boundaryPDU

TOR Switch

Fabric Controller

Unallocated CPUs

VMVMVM

Guest Agent Guest Agent

To Fabric Controller

PaaS VM Role Instance

PaaS VM Role Instance

IaaS VM Role

CPU CPUCPU CPUCPU CPU CPU CPU

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 11: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Running Highly Available Cloud Services

Infrastructure Operations Impacting Customer Services

PaaS

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 12: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Leveraging Fault Domains

Fault Domain is a physical unit of failure Rack can be considered a fault domain. Node Healing = moving VMs off the faulted server while

keeping the allocation constraints

Rack 1Fault Domain 1

Rack 2Fault Domain 2

AGG

TOR Switch TOR Switch

PDU PDU

Infrastructure Operations Impacting Customer Services

PaaS

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 13: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

PaaS: Leveraging Fault Domains

FC deploys the role instances in (at least) two different fault domains. Different roles are allocated to fault

domains independently An even distribution is maintained

when scaling up or down No way to control the Fault

Domain mapping, but it can be queried for each role instance: Portal REST service mgmt. APIs

(“FaultDomain”) Queuing can be defined

between the layers (only LB by default)

WorkerRole

Web Role

Azure Load Balancer

Web RoleInstance 0

Fault Domain 0

Web RoleInstance 1

Fault Domain 1

Web RoleInstance 2

Fault Domain 0

Worker RoleInstance 0

Fault Domain 0

Worker RoleInstance 1

Fault Domain 1

Infrastructure Operations Impacting Customer Services

PaaS

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 14: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

PaaS: Leveraging Update Domains

Update Domains (UD) control how to the service is updated. A single UD is being updated for a role at a

time. Scenarios:

User Initiated: PaaS service owner updates the service package or chooses a different Guest OS

Platform Initiated: Update Guest OS for PaaS services when a new version is released (e.g. security fixes); Update the server (hypervisor)

Implementation Details: Role instances are assigned into different

UDs, circularly

Alignment between UDs of the different roles

Up to 20 UDs per Service (5 by default)

Infrastructure Operations Impacting Customer Services

PaaS

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

WorkerRole

Web Role

Azure Load Balancer

Web RoleInstance 0

Update Domain 0

Web RoleInstance 1

Update Domain 1

Web RoleInstance 2

Update Domain 2

Worker RoleInstance 0

Update Domain 0

Worker RoleInstance 1

Update Domain 1

Page 15: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

PaaS Service Setup & ManagementDemo

Infrastructure Operations Impacting Customer Services

PaaS

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 16: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Mapping Instances to UDs/FDs- PaaS Update domains are always spread across fault domains

Usage: Setting update domains count:

You can change the number of update domains in the ServiceConfiguration.csfg file (upgradeDomainCount=n, requires service redeployment)

Role instance count can be changed dynamically through REST API change configuration request Determining Update and Fault Domains for your instance through REST management APIs:

RoleInstance element => read only, get role instance names and counts RoleInstance.Role.Name RoleInstance.ID RoleInstance.FaultDomain, RoleInstance.UpdateDomain

Web Role FD0 FD1

UD0 IN_0

UD1 IN_1

UD2 IN_2

Worker Role FD0 FD1

UD0 IN_0

UD1 IN_1

Infrastructure Operations Impacting Customer Services

PaaS

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 17: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Method 1: Defining a service update mode (“mode” in configuration):

• Auto – UD Walk by Fabric Controller when new package is uploaded (default)• Manual – Call Walk Upgrade Domain for each domain• Simultaneous – update all role instances simultaneously, ignoring Upgrade Domains

Notes on usage:• By utilizing manual UD walk it is possible to manually control speed of the update (risk

management); Rollback update when unsuccessful• By utilizing simultaneous update and update events in each role instance, it is possible to build

custom flows

Method 2: Swapping Staging vs Production Environments• Define staging environment and swap between production and stage environment• Allows final validation before production entry

Changing service size based on load, time of day, auto-scaling, etc

• Calling Change Deployment Configuration with the new service instance count• Calling Delete Role Instances

Automating PaaS Service UpdatesInfrastructure Operations Impacting

Customer Services

PaaS

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 18: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Running Highly Available Cloud Virtual Machines

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

IaaS

Page 19: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Migrated 3-Tier Enterprise Application Sample

• Sample application to demonstrate Windows Azure Usage (application migrated from customer premise).

• Sample application specifics:• High redundancy for each component• Load balancer for the front end• Data layer can be implemented by SQL

Server or SQL Azure (here); alternatively, can utilize Windows Azure storage

• Set up the whole application in the same affinity group to gain physical proximity

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

IaaS

BackendAvailability Set

FrontendAvailability Set

Front End

Backend

Front End

Backend

Azure Load Balancer

Front End

Queueing or load-balancing

Geo-Distributed StorageOr SQL Azure

Page 20: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

IaaS: Leveraging Availability Sets Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

IaaS

• Availability sets instruct how to allocate VMs in the datacenters to isolate impact for hardware faults and infrastructure updates.

• Availability sets are defined through portal or REST APIs.

• Availability sets has to be defined for each redundant application tier to achieve 99.95% SLA• We do not offer SLA unless there are 2 VM

instances defined and used in each availability set

• Application SLA is compositional and dependent on the multiplication of the SLA components (each tier, compute, networking, etc)• e.g. Front End may cause unavailability of the

entire service.

• No correspondence between fault domains used in different availability sets• Thus, queuing or load-balancing is being added

between the availability sets

BackendAvailability Set

FrontendAvailability Set

Front EndFault Domain 1

BackendFault Domain 2

Front EndFault Domain 2

BackendFault Domain 1

Azure Load Balancer

Front EndFault Domain 1

Queueing or load-balancing

Geo-Distributed StorageOr SQL Azure

Page 21: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

IaaS: Updating the Service and the Infrastructure Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

IaaS

• Scenario: Platform initiated update of the servers which run the IaaS VM instances.

• Goal: high redundancy for the IaaS service

• Each role is allocated to a different update domain (up to 5)• When physical servers are updated, only

fraction of the capacity will be touched at a time (or less).

• No mapping between update domains in different availability sets.

• IaaS service update is under the customer responsibility.• In some cases customer VM update and

infrastructure update can happen in the same time. • IaaS update notifications are sent to avoid

this.• Hardware failures can occur any time. Thus,

platform update + hardware failure could still cause service outage for dual VM availability sets.

BackendAvailability Set

FrontendAvailability Set

Front EndUpdate Domain

0

Front EndUpdate Domain

1

BackendUpdate Domain

1

Front EndUpdate Domain

2

Queueing or load-balancing

Geo-Distributed StorageOr SQL Azure

Azure Load Balancer

BackendUpdate Domain

0

Page 22: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Managing IaaS ServiceDemo

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

IaaS

Page 23: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Using IaaS VMs Correctly Analogous to deploying different PaaS services for each tier Update strategy should be clear/upfront Use Availability Sets to get platform scenario working correctly Do not use single instance availability sets for production

applications Each availability set is completely independent from the

Infrastructure Standpoint Mix PaaS roles and IaaS availability sets as needed Use Affinity Groups to enforce physical proximity of the

different services

High Availability IaaS VMs Usage GuidanceInfrastructure Operations Impacting

Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

IaaS

Page 24: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Guidance for administrator initiated IaaS update Update one update domain at a time When removing/restarting/shutting down VMs, make sure to

keep the remaining VMs evenly distributed in FDs and UDs Prepare for/detect platform update happening in parallel;

same for server hardware failures Validate VMs status before walking next UD 3 UDs will minimize collision risk with platform update

Single IaaS instances will get a notification before the update Add service auto-scaling

Capture role for an existing stopped VM or pre-create it; Add a new role from it; Shutdown / Delete role when scaling down

Defining Updating VMs in Availability Set

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

IaaS

Page 25: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Infrastructure Operations Impacting Customer Services

Infrastructure Operations

Running Highly Available Cloud

Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 26: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Server, VM and Role Health Maintenance

FC maintains service availability by monitoring the software & hardware health Based primarily on

heartbeats Automatically “heals”

affected roles

Symptom Healing Operation

Potential Causes

Issue with a customer code or customer VM

Reboot the VM(s)

• Role instance or Guest OS crash (PaaS)

• Customer OS Crash (IaaS)

Issue with physical server or rack

Allocate the impacted customer VMs to the different server(s)

• Physical server software failure

• Physical server hardware failure

• Rack / PDU / ToR Failure

Infrastructure Operations

Running Highly Available Cloud

Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 27: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Updating the Host OS

Initiated by the Windows Azure team (~once per month) Goal: update all machines as quickly as possible VMs might be rebooted when the server is updated (new

OS, BIOS, etc). Constraint: must not violate service SLA Algorithm: Fabric Controller performs UD walk (keep the UD

constraints for each service) Each server is updated in such a way it won’t violate UD & FD constraints for the services

utilizing it Might take many hours for services with large UD count

Note: your role instance keeps the same VM and VHDs, preserving cached data in the resource volume

Infrastructure Operations

Running Highly Available Cloud

Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

Page 28: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Summary: Highly Available Cloud Service vs Azure VMsAspect Cloud Services (PaaS) Azure VMs (IaaS)

Fault Domain count Two per Role Two per Availability Set

Update Domain count Five by default; up to twenty Five

Platform update UD by UD UD by UD

Administrator initiated update

UD by UD, or Blast, or Customer Controlled UD walk or VIP-Swap

Administrator controlled (can be automated using PowerShell or REST management APIs)

Frontend and backend highly-available addressability

Windows Azure provides Load-Balancer per role; queuing recommended for backend roles

Administrator defines endpoints in VMs and maps them to a load-balanced set; queuing recommended for backend roles

SLA 99.95% uptime for roles with two or more role instances

99.95% uptime for Availability Sets with two or more VMs

Multi-service collocation

Yes, using Affinity Groups Yes, using Affinity Groups

UD/FD automated management when service grows / shrinks

Yes (except when deleting a specific instance)

Yes when service grows; no when shrinks

Page 29: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

• Plan for Updates and Redundancy when Designing a Service

• Select the best update mode for PaaS services; utilize update notifications as needed

• Use Availability Sets for IaaS service tiers• Design IaaS updates to ensure high

availability for a redundant service

In Review: Key Takeaways

Page 30: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

Your Feedback is Important

Fill out an evaluation of this session and help shape future events.

Scan the QR code to evaluate this session on your mobile device.

You’ll also be entered into a daily prize drawing!

Page 31: Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.