what marketing likes to show off… …how we actually get there

Spark the future.

May 4 – 8, 2015Chicago, IL

Behind the Curtain: How we Run Exchange OnlinePerry Clarke, CVP, Outlook/Exchange Engineering Vivek Sharma, Director of PM, Office 365

BRK3186

We’re engineers, not marketing

What Marketing likes to show off… …how we actually get there

Caveats (this is the slide they make us put in)

We are going to try and show you details Including technology which only lives @ Microsoft Process and culture Technology backing the service

Some of these are outside the Exchange product, are only used within Microsoft

Some points will be necessarily vague or omitted altogether, e.g. HW/Vendor: we cycle through a LOT of gear from a variety of vendors Interdependent systems that do not exist in the enterprise product Exact numbers of things—always changing, easily misinterpreted

BUT…

EXACT DETAILS NOT AVAILABLE

Exchange Online Scale

Brilliant physicist and engineer. Understands all

the laws of thermodynamics.

STILL MANAGES TO BURN MICROWAVE POPCORN…

Global Scale: DC growth

SAN ANTONIO

QUINCYDES MOINES

CHICAGO

BOYDTON

BRAZIL

DUBLIN

AMSTERDAM

INDIA

BEIJING

SHANGHAI

JAPAN

HONG KONG

SINGAPORE

AUSTRALIA

*Operated by 21Vianet

AUSTRIA

FINLAND

1 million+ servers • 100+ Datacenters in over 40 countries • 1,500 network agreements and 50 Internet connections

Microsoft has datacenter capacity around the world…and we’re growing

Global Scale: Capacity growth

# of servers running Exchange Online from Aug 1st, 2012 to 3/23/14—600% growth

# of servers running Exchange Online from Aug 1st, 2012 to 4/23/15—1350% growth

What is the “Cloud” all about

COST – how much do I get for what $$RISK – am I comfortable running my most important work on the service USER EXPERIENCE – what value do I get?

We think all three can ONLY be satisfied by running in a large scale cloud

Conversation for every service provider and customer has always been:

Everything fails at global scaleLow Mean Time To Resolution (MTTR) is only possible via:

Comprehensive, timely and accurate monitoring

Rapid deployment with safety

Operational engine that can scale effectively with security

Everything eventually breaks, we have to continuously (re)invent

Data

Driven

Secure

Automate

d

Big DataExternal SignalsSystem Signals

AccessApprovalAuditingCompliance

ChangesSafety

OrchestrationRepair

Machine Learning

Case in Point: 4 Site and Isolated Network From Complex…

Large failure domains High rates of change due to shared

nature Expensive network gear Spanning Tree Problems

To Simpler…

Built to survive network device and/or DC failure (4 location)

FE is LB only, BE isolated with NAT MBX data replicated between stamps in a

Forest via internal backend Network

Load Balancer Load Balancer

Host (Multiple Properties)

L2 Agg L2 Agg

L2 Hosts L2 Hosts L2 Hosts

FirewallFirewall

AR (L3) AR (L3)

Core

10G Routed10G Dot1Q1G or 10G Dot1Q1G Dot1Q1G Routed

Global Scale: Transaction growth

1/1/

2014

1/5/

2014

1/9/

2014

1/13

/201

4

1/17

/201

4

1/21

/201

4

1/25

/201

4

1/29

/201

4

2/2/

2014

2/6/

2014

2/10

/201

4

2/14

/201

4

2/18

/201

4

2/22

/201

4

2/26

/201

4

3/2/

2014

3/6/

2014

3/10

/201

4

3/14

/201

4

3/18

/201

4

3/22

/201

40

1000000000

2000000000

3000000000

4000000000

5000000000

6000000000

7000000000

8000000000

9000000000

End user authentication transactions went from 5 billion to 8 billion (160%) in just three months

Our product and service infra has to handle spikes and unexpected growth

Global Scale: Transaction growth

End user authentication transactions went from 8 billion to 55 billion (687%) in span of a year.

Fundamental architecture changes required just to scale from one month to the next!

Jan Feb Mar April May June July Aug Sept Oct Nov Dec Jan Feb0

10

20

30

40

50

60

Peak Requests/Day (Billion)

Global scale: Transforming our technology

Office365.com

Our surface area is too big/partitioned to manage sanelyService management is largely done via our Datacenter Fabric

North America 1 North America n Europe 1

DATA

CEN

TER

AU

TO

MATIO

N

Zoom in: our Service Fabric

…is made up of a lot of stuff

Orchestration Central Admin (CA), the change/task engine for the service

Deployment/Patching Build, System orchestration (CA) + specialized system and server setup

Monitoring eXternal Active Monitoring (XAM): outside in probes, Local Active Monitoring (LAM/MA): server probes and recovery, Data Insights (DI): System health assessment/analysis

Diagnostics, Perf Extensible Diagnostics Service (EDS): perf counters, Watson (per server)

Data (Big, Streaming)

Cosmos, Data Pumpers/Schedulers, Data Insights streaming analysis

On-call Interfaces Office Service Portal, Remote PowerShell admin access

Notification/Alerting Smart Alerts (phone, email alerts), on-call scheduling, automated alerts

Provisioning/Directory

Service Account Forest Model (SAFM) via AD and tenant/user addition/updates via Provisioning Pipeline

Networking Routers, Load Balancers, NATs

New Capacity Pipeline

Fully automated server/device/capacity deployment

DATA

CEN

TER

AU

TO

MATIO

N

Demo

Visualizing Scale

Data & Monitoring

Using signals to improve the service via our Data Systems

Monitoring is a HARD problem

Multi-signal analysis

Data driven automationConfidence in data

communicate

snooze

recover

block

A

U

T

O

Evolution of Monitoring == Data Analysis

Data Insights EngineHas to process 500+ million events/hour—and growing every day

Highly purposed to collect, aggregate and reach conclusions

Built on Microsoft Azure, SQL Azure and open source tech (Cassandra)

Uses latest streaming tech similar to storm, spark

100s of millionsusers

100k+servers

15TB/day

Dozens of components

Dozens of datacenters

many regions

500M+Events per

hour

Millions of organizations

Signals: Outside-In monitoring

PARTITION PARTITION

Office365.com

Each scenario tests each DB WW ~5mins—ensuring near continuous verification of availability

From two+ locations to ensure accuracy and redundancy in system

250 million test transactions per day to verify the service

Synthetics create a robust “baseline” or heartbeat for the service

NETWORK NETWORK NETWORK NETWORK

Signals: Usage based

Aggregated error signals from real service usage

Tells us when something is wrong at the user/client level

Allows us to catch failure scenarios that we didn’t anticipate

System handles 50K transactions/sec, over 1 billion “user” records per day

Build confidence via multiple signals

If something is happening across many entities/signals, then it must be true

Apply “baseline” from outside-in as a source of truth

If each signal has reasonability fidelity—you get ~100% accuracy

We use this technique to build “Red Alerts”

+’ing these signals lets us triangulate

Combining signals, errors and scopes we can tell what is wrong and where

Dramatically reduces MTTR as no one has to fumble around for root cause

Allows automated recovery—with high confidence comes incredible power

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31apcprd01

apcprd02

apcprd03

apcprd04

apcprd06

eurprd01

eurprd02

eurprd03

eurprd04

eurprd05

eurprd06

eurprd07

jpnprd01

lamprd80

namprd01

namprd02

namprd03

namprd04

namprd05

namprd06

namprd07

namprd08

namprd09

Anomaly detection via machine learning

Deviation from normal means something might be wrong99.5% and 0.5% historical thresholds

Moving Average +/- 2 Standard Deviations

Methodology for data computed

Auto-posting to health dashboard4:46 PM is when the alert was raised

This is 4:46 PM!

Demo

Signal Analysis

Automation Systems

Doing work @ massive scale via “Central Admin” (CA)

All actions through “CA”

Why? Changes can be easily destructive, specially if done directly by humansSolution: conduct all actions via a safe, reliable, high throughput system

Central Admin is essentially the “brain” operating the service

Engineers express intent to CA via code (C# “workflows”) or PowerShell Engine then safely applies change

across the desired scopes Data from DI Pipeline informs CA to

ensure safety

CAPACITY

CA at work: workflows for service mgmt.

All actions are built using ‘workflows’, e.g. deployment, repair, recovery, patching

Even higher order work is done in CA, e.g. rebalance a DAG automatically

Every month we run ~200 million workflows. The system is robust enough to handle failures without human intervention

RING 1 RING 2 RING 3RING 0 RING 4 WORLDWIDEONCE VALIDATED BY FIRST RELEASE RING

FEATURE TEAMS OFFICE 365 TEAM MICROSOFT FIRST RELEASEONCE VALIDATED BY MICROSOFT RING

“Orchestration” vs. “Automation”

New Capacity Pipeline

Using the systems approach, we shrunk months to DAYS to add new capacity

Pipeline built on CA and rich Data systems, including DI for “service green-lighting”

Orchestration must include entire HW lifecycle

We fix/replace 50K or more HW issues every month

Repair == detect + triage + ticket + track completion + bring back to service

DiskControllers are still king—but NICs and “software failure” via deployment are catching up!

HDDs surprisingly low.

Speaking of repair… “Repairbox”

Specialized CA WF that scans and fixes variety of service issues Consistency checks (e.g. member of the right

server group) HW repair (automated detection, ticket opening

and closing) NW repair (e.g. firewall ACL) “Base config” repair such as hyper-threading

on/off

Our solution to scale along with number of servers, deal with stragglers etc.

Live Capacity

New Capacity Total

Issue

HyperThreading 398 44 442

HardDisk 195 30 225

PPM 105 105

WinrmConnectivity 96 1 97

Memory 53 10 63

HardDiskPredictive 39 14 53

Motherboard 41 2 43

NotHW 34 4 38

DiskController 28 9 37

PowerSupply 16 6 22

CPU 9 13 22

OobmBootOrder 19 2 21

Other 18 3 21

ILO IP 12 4 16

ILO Reset 14 2 16

Fan 10 3 13

NIC 9 2 11

InventoryData 4 2 6

NIC Firmware 5 5

ILO Password 1 4 5

OobmPowerState 5 5

Cache Module 4 1 5

High Temp 2 1 3

PSU 2 2

Cable 1 1

Spare 1 1

Total 1120 158 1278

Tickets Opened: 1278

Tickets Closed: 1431

Tickets Currently Active: 196

% Automated Found: 77%

Average time to complete (hrs): 9.43

95th Percentile (hrs): 28.73

Repairs: software 1) Run a simple patching cmd to initiate

patching: Request-CAReplaceBinary2) CA creates a patching approval request

email with all relevant details 3) CA applies patching in stages (scopes)

and notifies the requestor4) Approver reviews scopes and determines

if the patch should be approved5) Once approved, CA will start staged

rollout (first scope only contains 2 servers)

6) CA moves to the next stage ONLY if the previous scope has been successfully patched AND health index is high

7) Supports “Persistent Patching” mode

Demo

Capacity Management

Secure Operations

Making engineers responsible and responsive

Our Service Philosophy

Principles: • Engineering first—processes help but are

not the only solution • High emphasis on MTTR across both

automated and manual recovery • Use of automation to “run” the service

wherever possible, by the scenario owners

• Direct Escalation, Directly Responsible Individual as the default model

• Low tolerance for making the same mistake twice

• Low tolerance for “off the shelf” solutions to solve core problems

• Bias towards customer trust, compliance and security

All of this backed by rigorous, hands on attention to the service by entire team • MSR (Monthly Service Review –

scrutiny on all service aspects) • Weekly IM hand-off (managers)• Monthly Service Readiness review

(track customer satisfaction) • Component level hand-offs,

incident/post-mortem reviews (everyone)

What we are today is a mix of experimentation, learning from others and industry trends (and making a lot of mistakes!)

Product Team

Our “DevOps” model

Traditional IT

•Highly skilled, domain specific IT (not true Tier 1)•Success depends on static, predictable systems

Service IT•Tiered IT•Progressive escalations (tier-to-tier)•“80/15/5” goal

Direct Support

•Tier 1 used for routing/escalation only•10-12 engineering teams provide direct support of service 24x7

DevOps•Direct escalations•Operations applied to specific problem spaces (i.e. deployment)•Emphasize software and automation over human processes

Service

Operations

Service

Tier 2 Operations

Tier 1 Operations

Product Team

Service

Tier 1 Operations

Service

Pro

du

ct

Team

Op

era

tion

s

Software Aided Processes

Su

pp

ort

Oth

erProduct

Team

we are here

Roles and responsibilities

On-call Engineers: everyone is expected to be on-call across PM and Engineering

Incident Managers: Team managers are expected to act as senior individuals as needed for incidents

SLT Rotation: “exec IMs” for when customers are impacted

Communications Management: updates comms to customer portal, conduit between support and engineering

Provide any additional feedback on your on call experience.

When I am on-call, I hate my job and consider switching. It's only that I work with great people when I'm not on-callBuild Freshness report has had bugs more than once. Need better reliability before deploying changes.Too many alertsIt was much easier this time around than last.

Incident Management

All about managing the heat of the moment—doing the right things at the right time

IMs get paged automatically to: Find the right people to work on a problem Make sure recovery is priority 0 Keep everyone focused and sane Being accountable for making decisions under

pressure

Incident Protocol is a set of customer centric guidelines for each IM Classification of incident types Rules of engagement SLA and other customer promises

Demo

On-call, escalations, secure operations

We are here to serve you

The investments we make allow us to continuously improve the service for everyone

We are always looking for [email protected] or @pucknorris

[email protected] Get in touch. I’m harmless.

mailto:[email protected]

https://twitter.com/pucknorris

https://twitter.com/pucknorris

mailto:[email protected]

Visit Myignite at http://myignite.microsoft.com or download and use the Ignite Mobile App with the QR code above.

Please evaluate this sessionYour feedback is important to us!

http://myignite.microsoft.com/

what marketing likes to show off… …how we actually get there

Documents