failsafe 1 hour 2013

51
MICROSOFT CONFIDENTIAL – INTERNA Ulrich Homann Marc Mercuri FailSafe Patterns for Implementing Resilient Cloud Applications ARC302 Presented in 2013

Upload: marc-mercuri

Post on 08-Apr-2017

146 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Failsafe 1 hour   2013

Ulrich HomannMarc Mercuri

FailSafePatterns for Implementing Resilient Cloud Applications

ARC302

Presented in 2013

Page 2: Failsafe 1 hour   2013

^ ResiliencyResiliency

Page 3: Failsafe 1 hour   2013

End Slide

Page 4: Failsafe 1 hour   2013
Page 5: Failsafe 1 hour   2013

Netflix is currently unavailable.Try again later.

Page 6: Failsafe 1 hour   2013
Page 7: Failsafe 1 hour   2013

What is a modern application?

Page 8: Failsafe 1 hour   2013

Microsoft Dynamics: Line of Business Applications

Retail HQ

Manufacturing

Project

Point of Sale

Industry Operational Workloads

HR

Finance

SRM

Expense HCM

Sales and Distribution

Customer Care

Citizen Portal

Sales Force Automation

Marketing Automation

Administrative Core

Horizontal Operational Workloads

CRM Workloads

Page 9: Failsafe 1 hour   2013

FailSafe Services

Page 10: Failsafe 1 hour   2013

Cloud Services, Roles and InstancesCloud Service is a management, configuration, security, networking and service model boundary

VM1 VM2 VM3

VM4 VM5 VM…

INST

ANCE

S

ROLE

S

Fabrikam-CloudSvc

Cloud Service 1

WA Web Roles

Windows Azure

SQL Database

Data Access

Page 11: Failsafe 1 hour   2013

What are the “9”s

90% ("one nine")99% ("two nines")99.9% ("three nines")99.99% ("four nines")99.999% ("five nines")99.9999% ("six nines")

Page 12: Failsafe 1 hour   2013

The Truth About 9s

Page 13: Failsafe 1 hour   2013

Throttling

Page 14: Failsafe 1 hour   2013

Decompose by Workload

Page 15: Failsafe 1 hour   2013
Page 16: Failsafe 1 hour   2013

Define Lifecycle Model

Workload 1

Workload 2

Workload 1

Workload 2

Page 17: Failsafe 1 hour   2013

Availability Model and Plan

Page 18: Failsafe 1 hour   2013

Failure Points

Page 19: Failsafe 1 hour   2013

Failure Modes

Page 20: Failsafe 1 hour   2013

Failure Mode Example

catch (Exception e)

Page 21: Failsafe 1 hour   2013

Scale

Resources

Demands

Unit of ScaleWorkloads

Page 22: Failsafe 1 hour   2013

Workload 1

Workload 2

Bottom Ramp Peak

Page 23: Failsafe 1 hour   2013

Fault Domains

Fault and upgrade domains

• Failed component can’t take down service

• Isolated infrastructure• Physical hosts, racks• Network equipment

• Two by default• Role instances across 2+ fault

domains

Upgrade Domains

• VM rolling upgrades, no availability impact

• Logical grouping of role instances

• Five by default

• Role instances spread over upgrade domains

• Deployment upgraded for all or one at a time

Page 24: Failsafe 1 hour   2013

Deployment Redundancy

Page 25: Failsafe 1 hour   2013

Application considerations

Page 26: Failsafe 1 hour   2013

Circuit Breaker at Netflix

Page 27: Failsafe 1 hour   2013

Circuit Breaker at Netflix - Fallbacks

Page 28: Failsafe 1 hour   2013

Incorporate Open Standards••

•••

Page 29: Failsafe 1 hour   2013

Data Partitioning

Page 30: Failsafe 1 hour   2013

Data Decomposition Apply functional composition to database layer too Don’t force partitioning for the sake of partitioning; you will lose

manage-ability Partition where and when required to reduce dependency,

independent management and scale

Reduce logic in SQL Databases; CRUD is acceptable; say NO to others

Page 31: Failsafe 1 hour   2013

Understanding the 3Vs

Page 32: Failsafe 1 hour   2013

Understanding Queryability

Page 36: Failsafe 1 hour   2013

Data – to cache or not to cache….

Page 37: Failsafe 1 hour   2013

••

Data on the inside – Data on the outsidehttp://msdn.microsoft.com/en-us/library/ms954587.aspx

Page 38: Failsafe 1 hour   2013

“Query Ready” Cache Query patterns

Push the data close to where it is queried Example: BING Maps

Process, structure, produce, format etc. data and cache “query ready” data

Light/cheap data production is OK Pure and Idempotent operations are usually

good candidates

Duplication is OK Same data in a different format Same data in multiple places

This requires processing data before it is queried - NOT at the query time All data can be cached Some data can be cached: Frequently used Process Heavy, Expensive data Build as you Go

Page 39: Failsafe 1 hour   2013
Page 40: Failsafe 1 hour   2013

Backup and Restore

Page 41: Failsafe 1 hour   2013

CDN

Page 42: Failsafe 1 hour   2013

Latency shifts

Page 43: Failsafe 1 hour   2013

• Direct users to the service in the closest region

Traffic ManagerMonitoringPolicies

foo.cloudapp.net foo-us.cloudapp.net

foo-europe.cloudapp.net

foo-asia.cloudapp.net

1.2.3.4DNS response

Traffic Management

Page 44: Failsafe 1 hour   2013

Cloud Enterprise

Application-Layer Connectivity &

Messaging Service Bus

Data SynchronizationSQL Database Data Sync

Secure Machine-to-Machine Connectivity

Windows Azure Connect

Secure Site-to-Site Network Connectivity

Windows Azure Virtual Network

App Monitoring & Management

System Center

Cross-Premises Connectivity

Page 45: Failsafe 1 hour   2013

Design for operations

Page 46: Failsafe 1 hour   2013

What is a health model?

Logical piece of an applicationA component that makes sense to an operatorEach entity has a health stateEntities can be external or internalMultiple instances of an entity may exist

Managed EntityBreak down health state by functional teamMust be mutually exclusiveGroup by organizational responsibility e.g. security, performance, backupMay be specific or non-technology e.g. orders shipped.

AspectDefines level of operation currently availableNormal state is fully functionalWell designed applications may support partial operation e.g. read only

Operational Condition

Page 47: Failsafe 1 hour   2013

Troubleshooting Workflow

Page 48: Failsafe 1 hour   2013

Tools

Page 49: Failsafe 1 hour   2013

Demo

FailSafe Modeling Tool

Page 50: Failsafe 1 hour   2013

Test Plans Creation and Execution - Create, review, execute, and save tests plans and executions - Test Execution Reports

Multi-Subscription Test Execution - Send disruptions to multiple Cloud applications - Ability to define the disruptions execution order

Multiple Disruption Delivery Mechanisms - Use WA Management API and/or Overlord Agent - Mix the Disruption delivery

Extensible Disruptions’ Database - Template Engine for PowerShell Scripts - Ability to execute programs (NotMyFault.exe)

Cloud Overlord Testing Framework - Fault Injection Testing Framework - Generate consistent and repeatable platform level disruptions

Page 51: Failsafe 1 hour   2013

© 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.