what marketing likes to show off… …how we actually get there
TRANSCRIPT
Spark the future.
May 4 – 8, 2015Chicago, IL
Behind the Curtain: How we Run Exchange OnlinePerry Clarke, CVP, Outlook/Exchange Engineering Vivek Sharma, Director of PM, Office 365
BRK3186
We’re engineers, not marketing
What Marketing likes to show off… …how we actually get there
Caveats (this is the slide they make us put in)
We are going to try and show you details Including technology which only lives @ Microsoft Process and culture Technology backing the service
Some of these are outside the Exchange product, are only used within Microsoft
Some points will be necessarily vague or omitted altogether, e.g. HW/Vendor: we cycle through a LOT of gear from a variety of vendors Interdependent systems that do not exist in the enterprise product Exact numbers of things—always changing, easily misinterpreted
BUT…
EXACT DETAILS NOT AVAILABLE
Exchange Online Scale
Brilliant physicist and engineer. Understands all
the laws of thermodynamics.
STILL MANAGES TO BURN MICROWAVE POPCORN…
Global Scale: DC growth
SAN ANTONIO
QUINCYDES MOINES
CHICAGO
BOYDTON
BRAZIL
DUBLIN
AMSTERDAM
INDIA
BEIJING
SHANGHAI
JAPAN
HONG KONG
SINGAPORE
AUSTRALIA
*Operated by 21Vianet
AUSTRIA
FINLAND
1 million+ servers • 100+ Datacenters in over 40 countries • 1,500 network agreements and 50 Internet connections
Microsoft has datacenter capacity around the world…and we’re growing
Global Scale: Capacity growth
# of servers running Exchange Online from Aug 1st, 2012 to 3/23/14—600% growth
# of servers running Exchange Online from Aug 1st, 2012 to 4/23/15—1350% growth
What is the “Cloud” all about
COST – how much do I get for what $$RISK – am I comfortable running my most important work on the service USER EXPERIENCE – what value do I get?
We think all three can ONLY be satisfied by running in a large scale cloud
Conversation for every service provider and customer has always been:
Everything fails at global scaleLow Mean Time To Resolution (MTTR) is only possible via:
Comprehensive, timely and accurate monitoring
Rapid deployment with safety
Operational engine that can scale effectively with security
Everything eventually breaks, we have to continuously (re)invent
Data
Driven
Secure
Automate
d
Big DataExternal SignalsSystem Signals
AccessApprovalAuditingCompliance
ChangesSafety
OrchestrationRepair
Machine Learning
Case in Point: 4 Site and Isolated Network From Complex…
Large failure domains High rates of change due to shared
nature Expensive network gear Spanning Tree Problems
To Simpler…
Built to survive network device and/or DC failure (4 location)
FE is LB only, BE isolated with NAT MBX data replicated between stamps in a
Forest via internal backend Network
Load Balancer Load Balancer
Host (Multiple Properties)
L2 Agg L2 Agg
L2 Hosts L2 Hosts L2 Hosts
FirewallFirewall
AR (L3) AR (L3)
Core
10G Routed10G Dot1Q1G or 10G Dot1Q1G Dot1Q1G Routed
Global Scale: Transaction growth
1/1/
2014
1/5/
2014
1/9/
2014
1/13
/201
4
1/17
/201
4
1/21
/201
4
1/25
/201
4
1/29
/201
4
2/2/
2014
2/6/
2014
2/10
/201
4
2/14
/201
4
2/18
/201
4
2/22
/201
4
2/26
/201
4
3/2/
2014
3/6/
2014
3/10
/201
4
3/14
/201
4
3/18
/201
4
3/22
/201
40
1000000000
2000000000
3000000000
4000000000
5000000000
6000000000
7000000000
8000000000
9000000000
End user authentication transactions went from 5 billion to 8 billion (160%) in just three months
Our product and service infra has to handle spikes and unexpected growth
Global Scale: Transaction growth
End user authentication transactions went from 8 billion to 55 billion (687%) in span of a year.
Fundamental architecture changes required just to scale from one month to the next!
Jan Feb Mar April May June July Aug Sept Oct Nov Dec Jan Feb0
10
20
30
40
50
60
Peak Requests/Day (Billion)
Global scale: Transforming our technology
Office365.com
Our surface area is too big/partitioned to manage sanelyService management is largely done via our Datacenter Fabric
North America 1 North America n Europe 1
DATA
CEN
TER
AU
TO
MATIO
N
Zoom in: our Service Fabric
…is made up of a lot of stuff
Orchestration Central Admin (CA), the change/task engine for the service
Deployment/Patching Build, System orchestration (CA) + specialized system and server setup
Monitoring eXternal Active Monitoring (XAM): outside in probes, Local Active Monitoring (LAM/MA): server probes and recovery, Data Insights (DI): System health assessment/analysis
Diagnostics, Perf Extensible Diagnostics Service (EDS): perf counters, Watson (per server)
Data (Big, Streaming)
Cosmos, Data Pumpers/Schedulers, Data Insights streaming analysis
On-call Interfaces Office Service Portal, Remote PowerShell admin access
Notification/Alerting Smart Alerts (phone, email alerts), on-call scheduling, automated alerts
Provisioning/Directory
Service Account Forest Model (SAFM) via AD and tenant/user addition/updates via Provisioning Pipeline
Networking Routers, Load Balancers, NATs
New Capacity Pipeline
Fully automated server/device/capacity deployment
DATA
CEN
TER
AU
TO
MATIO
N
Demo
Visualizing Scale
Data & Monitoring
Using signals to improve the service via our Data Systems
Monitoring is a HARD problem
Multi-signal analysis
Data driven automationConfidence in data
communicate
snooze
recover
block
A
U
T
O
Evolution of Monitoring == Data Analysis
Data Insights EngineHas to process 500+ million events/hour—and growing every day
Highly purposed to collect, aggregate and reach conclusions
Built on Microsoft Azure, SQL Azure and open source tech (Cassandra)
Uses latest streaming tech similar to storm, spark
100s of millionsusers
100k+servers
15TB/day
Dozens of components
Dozens of datacenters
many regions
500M+Events per
hour
Millions of organizations
Signals: Outside-In monitoring
PARTITION PARTITION
Office365.com
Each scenario tests each DB WW ~5mins—ensuring near continuous verification of availability
From two+ locations to ensure accuracy and redundancy in system
250 million test transactions per day to verify the service
Synthetics create a robust “baseline” or heartbeat for the service
NETWORK NETWORK NETWORK NETWORK
Signals: Usage based
Aggregated error signals from real service usage
Tells us when something is wrong at the user/client level
Allows us to catch failure scenarios that we didn’t anticipate
System handles 50K transactions/sec, over 1 billion “user” records per day
Build confidence via multiple signals
If something is happening across many entities/signals, then it must be true
Apply “baseline” from outside-in as a source of truth
If each signal has reasonability fidelity—you get ~100% accuracy
We use this technique to build “Red Alerts”
+’ing these signals lets us triangulate
Combining signals, errors and scopes we can tell what is wrong and where
Dramatically reduces MTTR as no one has to fumble around for root cause
Allows automated recovery—with high confidence comes incredible power
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31apcprd01
apcprd02
apcprd03
apcprd04
apcprd06
eurprd01
eurprd02
eurprd03
eurprd04
eurprd05
eurprd06
eurprd07
jpnprd01
lamprd80
namprd01
namprd02
namprd03
namprd04
namprd05
namprd06
namprd07
namprd08
namprd09
Anomaly detection via machine learning
Deviation from normal means something might be wrong99.5% and 0.5% historical thresholds
Moving Average +/- 2 Standard Deviations
Methodology for data computed
Auto-posting to health dashboard4:46 PM is when the alert was raised
This is 4:46 PM!
Demo
Signal Analysis
Automation Systems
Doing work @ massive scale via “Central Admin” (CA)
All actions through “CA”
Why? Changes can be easily destructive, specially if done directly by humansSolution: conduct all actions via a safe, reliable, high throughput system
Central Admin is essentially the “brain” operating the service
Engineers express intent to CA via code (C# “workflows”) or PowerShell Engine then safely applies change
across the desired scopes Data from DI Pipeline informs CA to
ensure safety
CAPACITY
CA at work: workflows for service mgmt.
All actions are built using ‘workflows’, e.g. deployment, repair, recovery, patching
Even higher order work is done in CA, e.g. rebalance a DAG automatically
Every month we run ~200 million workflows. The system is robust enough to handle failures without human intervention
RING 1 RING 2 RING 3RING 0 RING 4 WORLDWIDEONCE VALIDATED BY FIRST RELEASE RING
FEATURE TEAMS OFFICE 365 TEAM MICROSOFT FIRST RELEASEONCE VALIDATED BY MICROSOFT RING
“Orchestration” vs. “Automation”
New Capacity Pipeline
Using the systems approach, we shrunk months to DAYS to add new capacity
Pipeline built on CA and rich Data systems, including DI for “service green-lighting”
Orchestration must include entire HW lifecycle
We fix/replace 50K or more HW issues every month
Repair == detect + triage + ticket + track completion + bring back to service
DiskControllers are still king—but NICs and “software failure” via deployment are catching up!
HDDs surprisingly low.
Speaking of repair… “Repairbox”
Specialized CA WF that scans and fixes variety of service issues Consistency checks (e.g. member of the right
server group) HW repair (automated detection, ticket opening
and closing) NW repair (e.g. firewall ACL) “Base config” repair such as hyper-threading
on/off
Our solution to scale along with number of servers, deal with stragglers etc.
Live Capacity
New Capacity Total
Issue
HyperThreading 398 44 442
HardDisk 195 30 225
PPM 105 105
WinrmConnectivity 96 1 97
Memory 53 10 63
HardDiskPredictive 39 14 53
Motherboard 41 2 43
NotHW 34 4 38
DiskController 28 9 37
PowerSupply 16 6 22
CPU 9 13 22
OobmBootOrder 19 2 21
Other 18 3 21
ILO IP 12 4 16
ILO Reset 14 2 16
Fan 10 3 13
NIC 9 2 11
InventoryData 4 2 6
NIC Firmware 5 5
ILO Password 1 4 5
OobmPowerState 5 5
Cache Module 4 1 5
High Temp 2 1 3
PSU 2 2
Cable 1 1
Spare 1 1
Total 1120 158 1278
Tickets Opened: 1278
Tickets Closed: 1431
Tickets Currently Active: 196
% Automated Found: 77%
Average time to complete (hrs): 9.43
95th Percentile (hrs): 28.73
Repairs: software 1) Run a simple patching cmd to initiate
patching: Request-CAReplaceBinary2) CA creates a patching approval request
email with all relevant details 3) CA applies patching in stages (scopes)
and notifies the requestor4) Approver reviews scopes and determines
if the patch should be approved5) Once approved, CA will start staged
rollout (first scope only contains 2 servers)
6) CA moves to the next stage ONLY if the previous scope has been successfully patched AND health index is high
7) Supports “Persistent Patching” mode
Demo
Capacity Management
Secure Operations
Making engineers responsible and responsive
Our Service Philosophy
Principles: • Engineering first—processes help but are
not the only solution • High emphasis on MTTR across both
automated and manual recovery • Use of automation to “run” the service
wherever possible, by the scenario owners
• Direct Escalation, Directly Responsible Individual as the default model
• Low tolerance for making the same mistake twice
• Low tolerance for “off the shelf” solutions to solve core problems
• Bias towards customer trust, compliance and security
All of this backed by rigorous, hands on attention to the service by entire team • MSR (Monthly Service Review –
scrutiny on all service aspects) • Weekly IM hand-off (managers)• Monthly Service Readiness review
(track customer satisfaction) • Component level hand-offs,
incident/post-mortem reviews (everyone)
What we are today is a mix of experimentation, learning from others and industry trends (and making a lot of mistakes!)
Product Team
Our “DevOps” model
Traditional IT
•Highly skilled, domain specific IT (not true Tier 1)•Success depends on static, predictable systems
Service IT•Tiered IT•Progressive escalations (tier-to-tier)•“80/15/5” goal
Direct Support
•Tier 1 used for routing/escalation only•10-12 engineering teams provide direct support of service 24x7
DevOps•Direct escalations•Operations applied to specific problem spaces (i.e. deployment)•Emphasize software and automation over human processes
Service
Operations
Service
Tier 2 Operations
Tier 1 Operations
Product Team
Service
Tier 1 Operations
Service
Pro
du
ct
Team
Op
era
tion
s
Software Aided Processes
Su
pp
ort
Oth
erProduct
Team
we are here
Roles and responsibilities
On-call Engineers: everyone is expected to be on-call across PM and Engineering
Incident Managers: Team managers are expected to act as senior individuals as needed for incidents
SLT Rotation: “exec IMs” for when customers are impacted
Communications Management: updates comms to customer portal, conduit between support and engineering
Provide any additional feedback on your on call experience.
When I am on-call, I hate my job and consider switching. It's only that I work with great people when I'm not on-callBuild Freshness report has had bugs more than once. Need better reliability before deploying changes.Too many alertsIt was much easier this time around than last.
Incident Management
All about managing the heat of the moment—doing the right things at the right time
IMs get paged automatically to: Find the right people to work on a problem Make sure recovery is priority 0 Keep everyone focused and sane Being accountable for making decisions under
pressure
Incident Protocol is a set of customer centric guidelines for each IM Classification of incident types Rules of engagement SLA and other customer promises
Demo
On-call, escalations, secure operations
We are here to serve you
The investments we make allow us to continuously improve the service for everyone
We are always looking for [email protected] or @pucknorris
[email protected] Get in touch. I’m harmless.
Visit Myignite at http://myignite.microsoft.com or download and use the Ignite Mobile App with the QR code above.
Please evaluate this sessionYour feedback is important to us!
© 2015 Microsoft Corporation. All rights reserved.