windows azure: scaling sdn in the public cloud
TRANSCRIPT
Windows Azure: Scaling SDN in the Public Cloud
Albert Greenberg
Director of Development
Windows Azure Networking
• Microsoft’s big bet on public cloud
• Companies move their IT infrastructure to the cloud
• Elastic scaling and less expensive than on-premises DC
• Runs major Microsoft properties (Office 365, OneDrive, Skype, Bing, Xbox)
Summary • Scenario: BYO Virtual Network to the Cloud
• Per customer, with capabilities equivalent to on premise counterpart
• Challenge: How do we scale virtual networks across millions of servers?
• Solution: Host SDN solves it: scale, flexibility, timely feature rollout, debuggabililty • Virtual networks, software load balancing, …
• How: Scaling flow processing to millions of nodes • Flow tables on the host, with on-demand rule dissemination
• RDMA to storage
• Demo: ExpressRoute to the Cloud (Bing it!)
Infrastructure as a Service: Develop, test, run your apps
Easy VM portability
If it runs on Hyper-V, it runs in Windows Azure: Windows, Linux, … (Ubuntu, redis, mongodb, redis, …)
Deploy VMs anywhere with no lock-in
What Does IaaS Mean for Networking? Scenario: BYO Network
Windows Azure Virtual Networks
• Goal: BYO Address Space + Policy
• Azure is just another branch office of your enterprise, via VPN
• Communication between tenants of your Azure deployment should be efficient and scalable
10.1/16 10.1/16
Secu
re T
un
ne
l
Public Cloud Scale
2010 2014
Compute Instances
2010 2014
Azure Storage
2010 2014
Azure DC Network Capacity
Windows Azure momentum
How do we support 50k+ virtual networks, spread over a single 100k+ server deployment in a DC?
Start by finding the right abstractions
SDN: Building the right abstractions for Scale
Abstract by separating management, control, and data planes
Azure Frontend
Controller
Switch
Management Plane
Control Plane
Management plane Create a tenant
Control plane Plumb these tenant ACLs to these switches
Data plane Apply these ACLs to these flows
Example: ACLs
• Data plane needs to apply per-flow policy to millions of VMs
• How do we apply billions of flow policy actions to packets?
Solution: Host Networking
• If every host performs all packet actions for its own VMs, scale is much more tractable
• Use a tiny bit of the distributed computing power of millions of servers to solve the SDN problem • If millions of hosts work to implement billions of flows, each host only needs
thousands
• Build the controller abstraction to push all SDN to the host
VNets on the Host
• A VNet is essentially a set of mappings from a customer defined address space (CAs) to provider addresses (PAs) of hosts where VMs are located
• Separate the interface to specify a VNet from the interface to plumb mappings to switches via a Network Controller
• All CA<-> PA mappings for a local VM reside on the VM’s host, and are applied there
Azure Frontend
Controller
Customer Config
VNet Description (CAs)
L3 Forwarding Policy (CAs <-> PAs)
VMSwitch VMSwitch
Blue VMs CA Space
Green VMs CA Space
Northbound API
Southbound API
VNet Controller Azure Frontend
Controller
Node1: 10.1.1.5
Blue VM1 10.1.1.2
Green VM1 10.1.1.2
Azure VMSwitch
Node2: 10.1.1.6
Red VM1 10.1.1.2
Green VM2 10.1.1.3
Azure VMSwitch
Node3: 10.1.1.7
Green S2S GW 10.1.2.1
Azure VMSwitch
Green Enterpise Network 10.2/16
VPN GW
Customer Config
VNet Description
L3 Forwarding Policy
Secondary Controllers
Consensus Protocol
Forwarding Policy: Traffic to on-prem
Node1: 10.1.1.5
Blue VM1 10.1.1.2
Green VM1 10.1.1.2
Azure VMSwitch Src:10.1.1.2 Dst:10.2.0.9
Src:10.1.1.2 Dst:10.2.0.9
Policy lookup: 10.2/16 routes to GW on host with PA 10.1.1.7
Controller
Src:10.1.1.5 Dst:10.1.1.7 GRE:Green Src:10.1.1.2 Dst:10.2.0.9
L3 Forwarding Policy
Node3: 10.1.1.7
Green S2S GW 10.1.2.1
Azure VMSwitch
Green Enterpise Network 10.2/16
VPN GW
Src:10.1.1.2 Dst:10.2.0.9 L3VPN PPP
IaaS VM
Cloud Load Balancing
• All infrastructure runs behind an LB to enable high availability and application scale
• How do we make application load balancing scale to the cloud?
• Challenges: • Load balancing the load balancers
• Hardware LBs are expensive, and cannot support the rapid creation/deletion of LB endpoints required in the cloud
• Support 10s of Gbps per cluster
• Support a simple provisioning model
LB
Web Server VM
Web Server VM
SQL Service
IaaS VM
SQL Service
NAT
All-Software Load Balancer: Scale using the Hosts
LB VM
VM DIP 10.1.1.2
VM DIP 10.1.1.3
Azure VMSwitch
Stateless Tunnel
Edge Routers
Client
VIP
VIP
DIP DIP
Direct Return:
VIP
VIP
LB VM
VM DIP 10.1.1.4
VM DIP 10.1.1.5
Azure VMSwitch
NAT Controller
Tenant Definition: VIPs, # DIPs
Mappings
• Goal of an LB: Map a Virtual IP (VIP) to a Dynamic IP (DIP) set of a cloud service
• Two steps: Load Balance (select a
DIP) and NAT (translate VIP->DIP and ports)
• Pushing the NAT to the vswitch
makes the LBs stateless (ECMP) and enables direct return
• SDN controller abstracts out
LB/vswitch interactions
NAT
How We Scaled Host SDN
Flow Tables are the right abstraction
Node: 10.4.1.5
Azure VMSwitch
Blue VM1 10.1.1.2
NIC
Controller
Tenant Description
VNet Description
Flow Action
VNet Routing Policy
ACLs NAT Endpoints
Flow Action Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow Action Flow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
• VMSwitch exposes a typed Match-Action-Table API to the controller
• One table per policy
• Key insight: Let controller tell the switch exactly what to do with which packets (e.g. encap/decap), rather than trying to use existing abstractions (Tunnels, …)
VNET LB NAT ACLS
1. Table typing and flow caching are critical to Dataplane Performance
Node: 10.4.1.5
Azure VMSwitch
Blue VM1 10.1.1.2
NIC
Flow Action Flow Action Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow Action Flow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
• COGS in the cloud is driven by VM density – 40GbE is here
• NIC Offloads are critical to achieving density
• Requires significant design work in the VMSwitch to scale overlay / NAT / ACL policy to line speed
• First-packet actions can be complex, but established-flow matches need to be typed, predictable, and simple
Node: 10.4.1.5
Azure VMSwitch
2. Separate Controllers By Application
Blue VM1 10.1.1.2
NIC
LB Controller
Tenant Description
VNet Description
Flow Action
VNet Routing Policy
ACLs NAT Endpoints
Flow Action Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow Action Flow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
Network Controller
VNet Controller
LB
VIP Endpoints
Northbound API
3. Eventing: Agents are also per-Application
• Attempting to give each VMSwitch a synchronously consistent view of the entire network is not scalable
• Separate rapidly changing policy (location mappings of VMs in VNet) from static provisioning policy
• VMSwitches should request needed
mappings on-demand via eventing • We need a smart host agent to
handle eventing and look up mappings
Azure VMSwitch
Blue VM1 10.1.1.2
NIC
Flow Action Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
VNET
VNet Agent
VNet Controller
Mapping Service Mapping Service
Mapping Service
Policy (once)
Policy Mapping Request Event (No policy found for packet)
Mapping Request
Mappings
Eventing: The Real API is on the Host
• The wire protocols between the controller, agent, and related services are now application specific (rather than generic SDN APIs)
• The real southbound API (which is
implemented by VNet, LB, ACLs, etc) is now between the Agents and the VMSwitch • High performance OS-level API rather than a
wire protocol
• We have found that eventing is a requirement of any nontrivial SDN application
Azure VMSwitch
Blue VM1 10.1.1.2
NIC
Flow Action Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
VNET
Vnet Agent
VNet Controller
Mapping Service Mapping Service
Mapping Service
Policy (once)
Mapping Request Event (No policy found for packet)
Mapping Request
Southbound API
VNet Application
Mappings
• VNet scope is a region – 100k+ nodes. One controller can’t manage them all!
• Solution: Regional controller defines the VNet, local controller programs end hosts
• Make the Mapping Service hierarchical, enabling DNS-style recursive lookup
VNET
Agent
Local Controller
Local Mappings
Policy Mapping Request
Mappings
4. Separate Regional and Local Controllers
Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
VNET
Agent
Local Controller
Local Mappings
Policy Mapping Request
Mappings
Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Regional Controller
Regional Controller
Regional Controller
Regional Controller
Regional Controller
Regional Mappings
Mapping Request
VNet Description
Policy
A complete virtual network needs storage as well as compute!
How do we make Azure Storage scale?
Storage is Software Defined, Too
• Erasure Coding provides durability of 3-copy writes with small (<1.5x) overhead by distributing coded blocks over many servers
• Lots of network I/O for each storage I/O
…
Write Commit
Erasure Code
• We want to make storage clusters scale cheaply on commodity servers
To make storage cheaper, we use lots more network!
RDMA – High Performance Transport for Storage
• Remote DMA primitives (e.g. Read address, Write address) implemented on-NIC • Zero Copy (NIC handles all transfers via DMA) • Zero CPU Utilization at 40Gbps (NIC handles all packetization) • <2μs E2E latency
• RoCE enables Infiniband RDMA transport over IP/Ethernet network (all L3)
• Enabled at 40GbE for Windows Azure Storage, achieving massive COGS savings by eliminating many CPUs in the rack
All the logic is in the host: Software Defined Storage now scales with the Software Defined Network
NIC
Application
NIC
Application
Memory
Buffer A
Memory
Buffer B
Write local buffer at Address A to remote buffer at Address B
Buffer B is filled
Just so we’re clear… 40Gbps of I/O with 0%
CPU
Hybrid Cloud: How do we Onboard Enterprise?
Public internet
Public internet
ExpressRoute: Direct Connection to Your VNet
• All VNET policy to tunnel to/from customer circuit implemented on hosts
• Predictable low latency, high throughput to the cloud
ExpressRoute: Now live in MSIT!
Host
Customer Router
ExpressRoute: Entirely Automated SDN Solution
Edge Router
VMSwitch
Gateway VM
BGP RIB
VNET Agent
Gateway Controller
VNET Controller
SLB
Mapping Service
DEMO: ExpressRoute
Result: We made SDN Scale
• VNET, SLB, ACLs, Metering, and more scale to millions of servers
• Tens of Thousands of VNETs
• Tens of Thousands of Gateways
• Hundreds of Thousands VIPs
• 10s of Tbps of LB’d traffic
• Billions of Flows… all in the host!
Bandwidth served by SLB to a storage cluster over a week
40Gbps
30Gbps
20Gbps
Host Networking makes Physical Network Fast and Scalable
• Massive, distributed 40GbE network built on commodity hardware • No Hardware per tenant ACLs • No Hardware NAT • No Hardware VPN / overlay • No Vendor-specific control,
management or data plane
• All policy is in software – and everything’s a VM!
• Network services deployed like all other services
• Battle-tested solutions in Windows Azure are coming to private cloud
10G Servers
We bet our infrastructure on Host SDN, and it paid off
• The incremental cost of deploying a new tenant, new VNet, or new load balancer is tiny – everything is in software
• Using scale, we are cheaper and faster than any tenant deployed by an admin on-prem
• Public cloud is the future! Join us!