windows azure conference 2014 designing applications for scalability
TRANSCRIPT
Windows Azure Conference 2014
Windows Azure Conference 2014
Emil VelinovSenior Program Manager, AzureCATMicrosoft
Designing Applications for Scalability
Windows Azure Conference 2014
Session Objective(s): Understand the scale limits and contention points of core Azure services and
resources Understand patterns and best practices for architecting for scale and
availability
Azure architecture is based on scale-out; composing multiple small-scale units to build large distributed systems
Understand contention points, scalability asymptotes and composability
NOTE: This is an architectural session; No code or demos!
Session Objectives And Takeaways
Windows Azure Conference 2014
Every Azure service or infrastructure component has finite resources
Scaling Azure services:
How much work can I get from each unit (density)?
How many can I compose together (scale)?
How many are available (scale limits)?
Azure Application Architecture
Azure Data Center
Cloud Service
Azure Load Balancer
Role
Role
Role Instance
Role Instance
Role Instance
Role Instance
Role Instance
Role Instance
Cloud Service
Azure Load Balancer
Role
Role
Role Instance
Role Instance
Role Instance
Role Instance
Role Instance
Role Instance
.
Azure Platform Services
Storage Account Blobs Queues Tables
Storage Account Blobs Queues Tables
SQL DB
Logical Server DB DB DB
Logical Server DB DB DB
Azure Platform Infrastructure
Machines Storage Bandwidth Electricity / Cooling
Azure Service Bus
Access Control Service
Infrastructure as a Service (VMs)
Media Services
Content Delivery Network
.
Subscription A
Subscription B
Subscription C
...
Windows Azure Conference 2014Windows Azure Conference 2014
Azure Compute
Windows Azure Conference 2014
The foundational container of Azure Compute is a Cloud Service:
Each Cloud service is composed of multiple roles (up to 25)
Each role, either Worker or Web, consists of a number of instances.
Each instance is a VM, running an immutable package of software
Role instances within the same service can communicate directly (using private IP addresses)
Services expose endpoints (up to 25)
Azure Compute – Cloud Service
Cloud Service
Azure Load Balancer
Role
Role
Role Instance
Role Instance
Role Instance
Role Instance
Role Instance
Role Instance
Network Address Translator (Router)
Windows Azure Conference 2014
Sizing
Virtual machinesize
CPU cores
Memory OS disk space–cloud services
OS disk space–virtual machines
Max. data disks(1 TB each)
Max. IOPS(500 per disk)
ExtraSmall (A0) Shared 768 MB 19 GB 20 GB 1 1x500
Small (A1) 1 1.75 GB 224 GB 70 GB 2 2x500
Medium (A2) 2 3.5 GB 489 GB 135 GB 4 4x500
Large (A3) 4 7 GB 999 GB 285 GB 8 8x500
ExtraLarge (A4)
8 14 GB 2,039 GB 605 GB 16 16x500
A5 2 14 GB 489 GB 135 GB 4 4X500
A6 4 28 GB 999 GB 285 GB 8 8x500
A7 8 56 GB 2,039 GB 605 GB 16 16x500
Windows Azure Conference 2014
In general, design for a larger number of smaller instances Aligning capacity and demand: using smaller instances = smaller
steps for increasing capacity Instance placement & fragmentation:
Small is the smallest size recommended for production workloads. Select a VM with 4 or 8 CPU Cores when using SQL Server Enterprise
Edition. Cloud Services require more disk space than a VM due to system
requirements. For the Web and Worker roles, the system reserves 4 GB of space for the Windows pagefile, and 2 GB of space for the Windows dumpfile.
Extra large and Ax instances are a good choice for large memory unit workloads
Example: caching
Azure Compute – Instance Sizing
Windows Azure Conference 2014Windows Azure Conference 2014
Azure StorageAbstractions: Blob, Table, Queue
www.buildwindows.com
Windows Azure Storage Stamps
LB
StorageLocation Service
Access blob storage via the URL: http://<account>.blob.core.windows.net/
Data access
Partition Layer
Front-Ends
Stream Layer
Intra-stamp replication
Inter-stamp (Geo) replication
LB
Partition Layer
Front-Ends
Stream Layer
Intra-stamp replication
Windows Azure Conference 2014
Data written to a storage account is auto-replicated to three storage nodes (across three fault/upgrade domains)
Geo-failover happens automatically (transparent DNS redirect)
SLA: 99.9% process valid requests and gateway connectivity
Access controlled via primary / secondary storage keys
Azure Storage Account
Storage Capacity 200 TB
Operations (transactions) 20,000 / sec.
Bandwidth (Geo Redundant)Ingress: 5 Gibps
Egress: 10 Gibps
Bandwidth (Locally Redundant)
Ingress: 10 Gibps
Egress: 15 Gibps
Availability 99.9%
Geo-DR Automatic
Throttling response 503 server busy
Storage accounts / subscription
5 (Default)
Windows Azure Conference 2014
Azure Storage Account - Blob
Max blob size (block)200 GB (50k
blocks)
Max block size 4 MB
Max blob size (page) 1 TB
Max page size 512 bytes
Max bandwidth / blob 480 Mbps
Latency bounds (per operation)
100ms nominal1-3 sec during load balancing
Scale-out unitContainer +
Blob
Scale-out impedance Low
• Use the appropriate blob type • Prefer block blobs with immutable / append-
only data)
• Use the largest practical block size• Note: network performance may require
smaller blocks for “long-haul”
• Implement retry logic with back-off for 503 (service unavailable) errors
Windows Azure Conference 2014
Blobs - Example
• Parallel upload - multiple concurrent upload threads to maximize throughput
• Geo-distance. Leverage “closer” storage accounts for globally distributed clients.
Scale• Simultaneous upload. Consider use of multiple storage accounts if
concurrent spikes expected. • Use round-robin, etc, mechanism from central service to route destination
Availability
Customer has 30k+ locations, periodically uploading usage data to Azure storage
• Transient connection. Handle reconnect for individual block sets on upload
• Disaster Recovery. Azure storage handles geo-replication on successful file upload
Windows Azure Conference 2014
Azure Storage Account - Table
Max operations / secondper partition
2,000
Max row size (names + data)
1 MB
Max column size (byte[] or string)
64 KB
Maximum number of rows
N/A (up to storage
account size limit)
Scale-out unitTable + Partition
Scale-out impedance Low
• Use appropriate partition keys to co-locate data
• Use appropriate partition keys to break data up into more partitions
• Implement retry logic with back-off for 503 (service unavailable) errors
• Avoid use of table storage for applications requiring non-trivial aggregation or function projection
• Leverage multiple storage accounts (not multiple tables) to increase operations/second
Windows Azure Conference 2014
Table - Example
• Partition selection. Choose natural/distributed partition strategy (audit data -> time slice).
• Chunky vs. chatty. Use entity group transactions where possible for multiple records / update.
Scale
• Partition selection. Choose appropriate time slice for partition key (max > 1k / partition).
• Use multiple tables (in multiple storage accounts) for scale-out. Leverage mod based (instance MOD count) or lookup-based service for routing.
Availability
• Transient connection. Handle reconnect for operations, with appropriate back-off (exponential) to avoid convoy effect
• Disaster Recovery. Azure storage hands geo-replication on successful commit
Web application storing audit trail record information in table storage
Windows Azure Conference 2014
Azure Storage Account - Queues
Max messages in a queue
N/A (up to storage
account size limit)
Max lifetime of a message
1 week (auto purged)
Max message size 64 KB
Max throughput2,000
messages / second
Scale-out unit Queue
Scale-out impedance Medium
• Optimize storage format to reduce message size / avoid 64KB limit
• For larger messages leverage Service Bus or Queues + Blob
• Leverage multiple queues to increase msgs/sec
• Vertical partitioning: split queues by function
• Horizontal partitioning: split messages between queues (round robin/direct assignment)
Windows Azure Conference 2014
Queues - ExampleAzure storage queues used to distribute messages (work items) to multiple worker roles
• Pipeline model. Single queue polling thread dispatching work to multiple worker agents (separate I/O and CPU bound work)
Scale• Use multiple queues (possibly in multiple storage accounts) for
scale-out. Leverage mod based (instance MOD count) or lookup-based service for routing.
Availability• Transient connection. Handle reconnect and throttling for
operations, with appropriate back-off (exponential) to avoid convoy effect
Windows Azure Conference 2014Windows Azure Conference 2014
Azure SQL Database
Windows Azure Conference 2014
SQL Databases operate in a multi-tenant shared environment Other users can consume key resources (worker threads,
transaction log, IOPS) and throttle your database Run time performance may have a level of unpredictability,
especially for high-load systems using large individual databases
For larger applications, transaction throughput is the bottleneck –not size!
SQL Database
Windows Azure Conference 2014
To compose multiple databases together (more size + I/O):
Manually shard databases (handle all aspects in the application layer)
“Unified” shard management Connection routing Range-based partitioning
SQL Database – Scale out
Windows Azure Conference 2014
SQL Database - Example
• Standard “SQL” optimization
• Minimize “chatty” interaction, minimize SQL batches (leverage table valued functions, etc)
Scale• Manually compose multiple databases for additional size and I/O
(max 150 GB size / DB – generally hit IO limit much faster)• Handle partitioning / routing at the application layer
Availability
• Transient connections. Handle reconnect and throttling for operations, with appropriate back-off (exponential) to avoid convoy effect
• Automatic replicas/fail-over within data center. No OOB geo-DR. Backup into blob storage for low-RTO DR
Store new user registration information for a web application
Windows Azure Conference 2014Windows Azure Conference 2014
Azure Distributed CacheNOTE: _Not_ Azure Cache Service Preview
Windows Azure Conference 2014
For low-latency / non-durable data or heavy read / light write workloads
Provides distributed access to key/value dictionary (k->{ byte[] })
Run Azure caching as worker role in Cloud service
HA configuration (multiple copies)
Cache
Maximum VMs 32 XL
Average operations / second node (XL)
45,000
Latency
1.2 ms. (server)
Sub ms. (local)
Max object size 1 MB
Maximum cache size Available memory
Object durability None
Windows Azure Conference 2014
Data Categorization
Resource
Activity• Read-Write, User specific (no concurrent access)
• Examples: Shopping cart content, survey response
• Read-Write, Shared between users (concurrent access)
• Examples: Number of units in stock, online survey results
• “Read-Heavy”, Shared between users (concurrent access)
• Examples: Product description, customer profile data
Categorize and evaluate your data for caching
Windows Azure Conference 2014
Choose the right topology for your workloads
Cache Topologies
Dedicated
Co-Located• Competes for memory, CPU and network• Scale with your app• Use for ASP.NET Session State, ASP.NET Page Output Caching
• Dedicated resources for predictability, isolation and scale• Add a new worker role dedicated for caching• Scale cache independently of your app
Windows Azure Conference 2014
Choose the right strategies for populating cache
Cache Population
Pre-Populate(All)
Pre-Populate(Partial)
• User worker roles to populate some data• For example: configuration service where 80% requests resolve same
data• Good for mix & match with On-Demand
• Use worker roles to populate all cache• For example: weather, stocks, transportation and reference data
• Build as you go
• Use async patterns i.e. use background thread rather than IO
Windows Azure Conference 2014
Capacity Planning
Throughput
High Availability
• Secondary copies requires more memory
• Number of Reads/sec
• Number of Writes/sec
• Average Object Size (Post-Serialization)
• Maximum Number of Objects
• High Availability Enabled
• Dedicated
• Co-Located
Windows Azure Conference 2014
Understand implications of caching.
Cache Summary
Scale
Availability
• Low availability; cache is flushed during application deployment (i.e. on role reboot). Code paths should always fallback to pull from durable store
• WARNING: Plan for total cache loss, spike load on store (i.e. SQL, storage)
• Leverage multiple cache instances to increase available cache size and operations / second (no client code changes required)
• Optimize use of cache / minimize cache misses for read operations
• Optimize transcoder/serialization efficiency (no DataContract, use Protobuf) to reduce wire and memory size in cache
Windows Azure Conference 2014Windows Azure Conference 2014
Go Big!Keep the lights on.
Windows Azure Conference 2014
Explore the evolution of a “typical” web app Quantify the load and anticipated growth profile Migrate a “stock” on-premise web application to Azure (lift and shift) Identify key areas of concern (density, composability & availability)
Optimize a Cloud service Optimize individual services and components Address common chokepoints and bottlenecks
Design for multiple Cloud services / data centers
Design for large scale and availability
Designing for Scale and Availability
Windows Azure Conference 2014
Standard “on-premise” web application
3-tier architecture behind a load balancer
Pure stateless web/app tiers
Scale-up SQL database
Active/passive cluster for resiliency
Starting point – classic web application
Load Balancer
Web Server
Web Server
Web Server
Web Server
Web Server
Web Server
App Server
App Server
App Server
SQL
Passive Node
Windows Azure Conference 2014
Classic lift-and-shift directly intoAzure (1:1 architectural mapping)
Lift and shift into Azure Platform-as-a-Service
Azure Cloud Service
Web Role
SQL Azure
Instance
Instance
Instance
Instance
Worker Role
Instance
Instance
Instance
External Load Balanced Endpoint
Load-balancing:- Queue-based- ACL8ed LB Endpoint
Good
• Simplified scaling (add web/worker instances).
• Automatic round-robin load balancing for incoming requests
• Automatic resiliency of SQL Azure
• Smaller scale unit for SQL Azure (can’t throw hardware at the problem any more); throttling limits
• Need to explicitly account for transient connection failures (load balancer / gateway connection to SQL Azure
Windows Azure Conference 2014
Scalability Contention points and Bottlenecks
Can easily scale web and worker instances – how efficiently can they communicate?
Throughput and throttling against single SQL Azure database Database throttling will bring down the site (machines are available –
data is not)
Availability failure points No single point of component failure (multiple web/worker instances,
spread across multiple fault/upgrade domains) A SQL Database DB will automatically fail over to replicas Database throttling will bring down the site (machines are available –
data is not)
Quantifying Scale and Availability
Windows Azure Conference 2014Windows Azure Conference 2014
Optimizations at the Cloud Service level
Windows Azure Conference 2014
Use ASP.NET Web API for REST services More efficient (dense) than WCF, simpler code paths
Use more efficient JSON or binary serializers Transcoding (serialization/deserialization) hugely impactful for throughput and
latency
Use network friendly data transfer objects (DTOs)
Do not mix data/state and logic in your data classes, avoid triggers/dependencies
Avoid cyclical dependencies (e.g. if you can’t easily serialize the class, you have a bad design)
Use asynchronous communication Minimize blocking IIS dispatch pipeline
Optimizing communication
Windows Azure Conference 2014
Use multiple SQL Database DBs Compose multiple databases to increase size / throughput Choose appropriate partitioning strategy to avoid distributed data
dependencies (cross-machine joins)
Use Table Storage where appropriate Durable key/value (data columns) lookups
Use multiple storage accounts Leverage multiple storage accounts for additional storage, throughput
Leverage caching
Shift to scale-out for data storage
Windows Azure Conference 2014
Publish work items to queues
Decouple receiving work from executing work (allow web/worker roles to scale independently)
Leverage appropriate queuing approach (multiple Azure queues, Service Bus queues, etc) to allocate and schedule work
Scale work tasks via queues
Azure Hosted Service
Web Role
Instance
Instance
Instance
Instance
Worker Role
Instance
Instance
Instance
Azure Queue
Windows Azure Conference 2014Windows Azure Conference 2014
Go Really BigMultiple services, multiple data centers
Windows Azure Conference 2014
Cloud Service has a finite capacity Number of web/worker roles, inbound and outbound connections,
etc.
Data centers have a finite capacity Available VMs, service capacity (Azure storage, SQL Database,
Service Bus, etc.)
Truly large applications need to compose data centers
Every individual component has a “mirror” (or “replica”)
Sharing Resources
Windows Azure Conference 2014
Design for application and feature pods Each (identical) deployment is assigned a slice of the workload (range of users,
etc.) No dependency on other peer services
Affinitize work to the application “pod” Need to route work to the appropriate application pod Cannot affinitize at the load balancer or traffic manager – need application level
routing
Implementing Horizontal Sharding
Windows Azure Conference 2014
Partition for scale-out and availability
Multiple Cloud services across multiple data centers
May require shipping data from one data center to another i.e. reference data or user data
Horizontal application partitioning
Service State
Web Service A
Azure Load Balancer
Web Role Instance
Role Instance
Role Instance
Routing Service
Azure Load Balancer
Web Role Instance
Role Instance
Role Instance
Storage Account A
SQL AzureSQL Azure
Web Service A
Azure Load Balancer
Web Role Instance
Role Instance
Role Instance
Web Service A
Azure Load Balancer
Web Role Instance
Role Instance
Role Instance
Service State
Storage Account A
SQL AzureSQL Azure
Service State
Storage Account A
SQL AzureSQL Azure
Windows Azure Conference 2014Windows Azure Conference 2014
Beyond The Pure Technical
Windows Azure Conference 2014
Distributed Systems in Windows Azure
Azure Services come with characteristics and factors that impact: Performance Availability
Your app requirements and the platform characteristics influence: Component usage Application design Troubleshooting mind set and expectations Telemetry requirements and design
Windows Azure Conference 2014Windows Azure Conference 2014
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.