building big: lessons learned from windows azure customers – part two
DESCRIPTION
Building Big: Lessons learned from Windows Azure customers – Part Two. Mark Simms(@ mabsimms )Simon Davies(@ simongdavies ) Principal Program ManagerWindows Azure Technical Specialist MicrosoftMicrosoft. 3-030. Session Objectives. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/1.jpg)
Building Big: Lessons learned from Windows Azure customers – Part TwoMark Simms(@mabsimms) Simon Davies(@simongdavies)Principal Program Manager Windows Azure Technical Specialist Microsoft Microsoft3-030
![Page 2: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/2.jpg)
Session ObjectivesDesigning large-scale services requires careful design and architecture choicesThis session will explore customer deployments on Azure and illustrate the key choices, tradeoffs and learningsTwo part session:• Part 1: Building for Scale• Part 2: Building for Availability
![Page 3: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/3.jpg)
Other Great SessionsThis session will focus on architecture and design choices for delivering highly available services.If this isn’t a compelling topic, there are many other great sessions happening right now!
Room Level Title PresenterNexus/Normandy
300 Designing awesome XAML apps in Visual Studio and Blend for Windows 8 and Windows Phone 8
Jeffrey Ferman
Trident/Thunder 300 Developing Mobile Solutions with Windows Azure Part II
Nick HarrisChris Risner
Odyssey 200 Desktop apps: WPF 4.5 and Visual Studio 2012 Pete Brown (DPE)
Magellan 200 WP8 HTML5/IE10 for Developers Rick XuJorge Peraza
![Page 4: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/4.jpg)
Building Big – the availability challengeEverything will Fail –design for failureGet Insight – instrument everything
Agenda
![Page 5: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/5.jpg)
Designing and Deploying Internet Scale ServicesJames Hamilton, https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf
Partition the serviceSupport geo-distribution
Design for Failure• Do not trust underlying
components• Decouple components• Avoid single points of failureInstrument everything• Implement inter-service
monitoring and alerting• Instrument for production testing• Configurable logging
Part 1: Design for Scale Part 2: Design for Availability
Optimize for density
![Page 6: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/6.jpg)
What are the 9’s?
![Page 7: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/7.jpg)
The Hard Reality of the 9’s
![Page 8: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/8.jpg)
Design for FailureGiven enough time and pressure, everything failsHow will your application behave?• Gracefully handle failure modes, continue to
deliver value• Not so gracefully …Fault types:• Transient. Temporary service interruptions,
self-healing• Enduring. Require intervention.
![Page 9: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/9.jpg)
Failure ScopeRegion
Service
Node Individual Nodes May FailConnectivity Issues (transient failures), hardware failures, configuration and code errors
Entire Services May FailService dependencies (internal and external)
Regions may become unavailableConnectivity Issues, acts of nature
![Page 10: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/10.jpg)
Use fault-handling frameworks that recognize transient errors:
CloudFXP+P TFH
Appropriate retry and backoff policies
Node Failures
![Page 11: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/11.jpg)
Don’t do this – why?
![Page 12: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/12.jpg)
Sample Retry PoliciesPlatform Context Sample Target
e2e latency max“Fast First”
Retry Count
Delay Backoff
SQL Database
Synchronous (e.g. render web page)
200 ms Yes 3 50 ms Linear
Asynchronous (e.g. process queue item)
60 seconds No 4 5 s Exponential
Azure Cache
Synchronous (e.g. render web page)
100 ms Yes 3 10 ms Linear
Asynchronous (e.g. process queue item)
500 ms Yes 3 100 ms
Exponential
![Page 13: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/13.jpg)
At some point, your request is blocking the line
Fail gracefully, and get out of the queue!
Too much retry, too much trust of downstream service
Decoupling Components
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
50000
100000
150000
200000
250000
300000
350000
400000
450000 Web Request Response Latency
Avg Latency Response latency
![Page 14: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/14.jpg)
Decoupling ComponentsLeverage asynchronous I/O Beware – not all apparently async calls are “purely” async
Ensure that all external service calls are boundedBound the overall call latency (including retries); beware of thread pool pressure
Beware of convoy effects on failure recoveryTrying too hard to catch up can flood newly recovered services
![Page 15: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/15.jpg)
Service Level FailuresEntire Services will have outagesSQL Azure ,Windows Azure Storage – SLA < 100%External services may be unavailable or unreachable
Application needs to workaround theseReturn fail code to user (please try again later)Queue and try later (we’ve received your order…)
![Page 16: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/16.jpg)
Region Level FailureRegional failure will occur
Load needs to be spread over multiple regions
Route around failures
![Page 17: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/17.jpg)
8datacentres
Digital Watermar
ks
Mobile Integratio
n
Digimarc
![Page 18: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/18.jpg)
Example Distribution with Traffic Manager
Slide 18
Global load does not necessarily give uniform distribution
![Page 19: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/19.jpg)
• Hosted service(s) per data centre
• Each service is autonomous –services independently receive or pull data from source
• Azure traffic manager can direct traffic to “nearest” service
• Use probing to determine service health*
Information publishingAzure Traffic Manager
Source Data
Web Role
Worker Role
Cache Role
DBAzure Storage
Web Role
Worker Role
Cache Role
DBAzure Storage
Web Role
Worker Role
Cache Role
DBAzure Storage
Region 1 Region 2 Region 3
![Page 20: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/20.jpg)
Service InsightDeep and detailed data needed for management, monitoring, alerting and failure diagnosis
Capture, transport, storage and analysis of this data requires careful design
![Page 21: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/21.jpg)
Characterizing Insight•
••
••
•
•••
•••
![Page 22: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/22.jpg)
Build and Buy (or rent)No “one size fits all” for all perspectives at scaleNear real-time monitoring & alerting, deep diagnostics, long term trending
Mix of platform components and servicesWindows Azure Diagnostics, application logging, Azure portal, 3rd party services
![Page 23: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/23.jpg)
New RelicFree/$24/$149 pricing model(/month/server)Agent installation on server (role instance)Hooks application via Profiling API
![Page 24: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/24.jpg)
App DynamicsFree -> $979.00 (6 agents) Agent based, hooking profiling APICross-instance correlation
![Page 25: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/25.jpg)
OpsTeraLeverages Windows Azure Diagnostics (WAD) dataGraphing, alerts, auto-scaling
![Page 26: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/26.jpg)
PagerDutyOn-call scheduling alerting and incident management$9\$18 per user per monthIntegration with monitoring tools e.g. NewRelic , others , HTTP API, email
![Page 27: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/27.jpg)
• Azure platform service (agent) for collection and distribution of telemetry• Standard structured storage formats
(perf counters, events)• Code or XML driven configuration• Partially dynamic (post updated file to
blob store)
Windows Azure Diagnostics (WAD)
![Page 28: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/28.jpg)
Windows Azure Diagnostics (WAD)
Perf Counters
Windows Events
Diag Events
IIS Log FilesFailed Logs
Crash Dumps
WAD Performance Counters Table
WAD Windows Events Logs Table
WAD Logs Table
Wad-iis-log filesWad-iis-failed log files
Wad-crash-dumps
![Page 29: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/29.jpg)
Limitations of Default Configuration
![Page 30: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/30.jpg)
• Azure table storage is the target for performance counter and application log data
• General maximum throughput is 1000 entities / partition / table• Performance Counters:
• Uses part of timestamp as partition key (limits number of concurrent entity writes)• Each partition key is 60 seconds wide, and are written asynchronously in bulk
• The more entities in a partition (i.e. the number of performance counter entries * the number of role instances) the slower the queries
• Impact: to maintain acceptable read performance in large scale sites may need to • Increase performance counter collection period (1 minute -> 5 minutes)• Decrease the number of log records written into the activity table (by increasing the filtering level
– WARN or ERROR, no INFO)
Understanding Azure Table Store
![Page 31: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/31.jpg)
Managing the DelugePer-Application Server
Data Sources- IIS logs- Application logs- Performance counters
High value data- Filter- Aggregate - Publish
High volume data- Batch- Partition- Archive
High value data consumer- Generate alerts- Display dashboard- Operational intelligence
High volume data consumer- Data mining / analysis- Historical trends- Root Cause Analysis
![Page 32: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/32.jpg)
• Add high-bandwidth (chunky) logging and telemetry channels for verbose data logging• Capture tracing via core System.Diagnostics (or log4net, NLog, etc) with:
• WARN/ERROR -> Table storage• VERBOSE/INFO -> Blob storage
• Run-time configurable logging channels to enable selective verbose logging to table (i.e. just log database information)
• Leverage the features of the core Diagnostic Monitor• Use custom directory monitoring to copy files to blob storage
Extending the Experience
![Page 33: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/33.jpg)
Extending Diagnostics
Perf Counters
Windows Events
Diag Events
IIS Log FilesFailed Logs
Crash Dumps
WAD Performance Counters Table
WAD Windows Events Logs Table
WAD Logs Table
Wad-iis-log filesWad-iis-failed log files
Wad-crash-dumps
Verbose Perf Ctrs
Verbose Event logsVerbose Perfcounter logsVerbose Events
![Page 34: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/34.jpg)
Handling transient failures
Logging transient failures
Logging all external API calls with timingLogging full exception
(not .ToString())
Logging and Retry with CloudFX
![Page 35: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/35.jpg)
Demo: Multiple Logging Channelsusing NLog and WAD
![Page 36: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/36.jpg)
Logging ConfigurationTraditional .NET log configuration (System.Diagnostics) is hard coded against System.Configuration (app.config/web.config)Anti-pattern for Azure deploymentLeverage external configuration store (e.g. Service Configuration or blob storage) for run-time dynamic configuration
![Page 37: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/37.jpg)
Recap and ResourcesBuilding big: • The Availability Challenge• Design for Failure• Get Insight into Everything
Resources:Best Practices for the Design of Large-Scale Services on Windows Azure Cloud ServicesTODO: failsafe doc link
![Page 38: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/38.jpg)
• Follow us on Twitter @WindowsAzure
• Get Started: www.windowsazure.com/build
Resources
Please submit session evals on the Build Windows 8 App or at http://aka.ms/BuildSessions
![Page 39: Building Big: Lessons learned from Windows Azure customers – Part Two](https://reader036.vdocuments.mx/reader036/viewer/2022081502/56815fce550346895dcecf21/html5/thumbnails/39.jpg)
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.