engagement case study google sdn peering: an early · pdf fileengagement case study murali...

52
Google SDN Peering: An Early Engagement Case Study Murali Suriar, [email protected] On behalf of Google Technical Infrastructure and Network Infrastructure SRE August 30, 2017

Upload: duongnhu

Post on 07-Mar-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Google SDN Peering: An Early Engagement Case StudyMurali Suriar, [email protected] behalf of Google Technical Infrastructure and Network Infrastructure SRE

August 30, 2017

Page 2: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Who am I?

● Murali Suriar● Seven years at Google*

○ Network Engineer, Dublin○ SRE, London

■ Initially working on proxies/load balancing■ Currently running SDN control systems

● @msuriar on Github, Twitter, IRC

* = minus a brief stint on a boat

Page 3: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Today's talk

● What is SDN?● A brief history of SDN at Google● An overview of Espresso (SDN internet peering)● SRE early engagement with the Espresso dev team

Page 4: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

What is SDN?

Page 5: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Traditional networking

● Common protocols and standards (mostly).● Proprietary/vertically integrated implementations.

Page 6: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

An aside - why hardware?

● IP networking all about packets per second (pps).● Weird standards.

Page 7: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Planes of a switch/router

Page 8: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Planes of a switch/router ("swouter")

Control

Management

Forwarding/data

Page 9: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Planes of a switch/router ("swouter")

● Control plane scales with protocol/network complexity.

● Network vendors use long-term supported hardware.

● Long depreciation cycles lead to underpowered control plane.

Control

Management

Forwarding/data

Page 10: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

The dream of SDN

● Create standard for programming the forwarding plane.

● Separate control plane from network devices.

ManagementControl

ForwardingForwarding

ForwardingForwarding

Page 11: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Complexities of SDN

● Need a new network to connect control and data plane together.

● Network engineers need to learn about running binaries and managing machines.

● Or sysadmins/SREs need to learn about networking.

Page 12: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

New failure modes of SDN

● Less shared fate between control plane and data plane.

● Single controller outage has (potentially) large impact on data plane.

● Increased latency in reacting to some classes of failures.

Page 13: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

A brief history of SDN at Google

Page 14: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

B4WAN

Interconnect

Andromeda NFV and network

virtualization

JupiterDatacenter Networking

The Pillars of SDN @ Google

Page 15: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]

B4: Google's Software Defined WAN

Page 16: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]

B4: From Copy Network to Business Critical

B4 tr

affic

2012 — 2016

Page 17: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

10.1.4/24

VNET: 5.4/16

VNET: 192.168.32/24

VNET: 10.1.1/24 Load Balancing

DoS

ACLs

VPN

NFVInternal Network

Andromeda

ToR

Google Infrastructure Services

10.1.1/24

ToR

10.1.2/24

ToR

10.1.3/24

ToR

Page 18: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Watchtower

Saturn

Firehose 1.1

Google Datacenter Network InnovationAnd hardware scale that we could not buy

18

Time

Capa

city

Firehose 1.0

Jupiter

4 Post

1.3Pb/s clusters in 2013

Page 19: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

B4WAN

Interconnect

Andromeda NFV and network

virtualization

JupiterDatacenter Networking

The Pillars of SDN @ Google

PublicInternet?

Page 20: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

B4WAN

Interconnect

Andromeda NFV and network

virtualization

JupiterDatacenter Networking

The Pillars of SDN @ Google

Espresso SDN for public

Internet

Page 21: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Enter Espresso

Page 22: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso in Context

B4

Jupiter Data CenterGoogle

Page 23: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso in Context

B4

B2

Peering Metro

Jupiter Data CenterGoogle

Google

Page 24: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso in Context

B4Espresso

B2

Internet

Peering Metro

User

Jupiter Data CenterGoogle

Google

Page 25: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Cloud 1.0Espresso

SDNPeering

RouterCentric

Protocols

Espresso: Before and After

Local viewConnectivity firstCoarse fault recovery

Per-metro and global viewApplication signalsReal-time optimization

Page 26: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso Architecture Overview

Label-switched Fabric

BGP speaker

External Peer

Espresso Metro

Peering Fabric

eBGP Peering

Page 27: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso Architecture Overview

Label-switched Fabric

HostHostHostHostHost

Host

Packet Processor

BGP speaker

External PeereBGP Peering

Espresso Metro

Labeled packets specify egress

HostHostHostHostHost

Peering Fabric

Page 28: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso Architecture Overview

Label-switched Fabric

HostHostHostHostHost

Host

Packet Processor

LocalControl

Global Controller

BGP speaker

External PeereBGP Peering

Espresso Metro

Application Signals

Labeled packets specify egress

HostHostHostHostHost

Peering Fabric

Page 29: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

SRE for Espresso

Page 30: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Complexities of Espresso

● Large set of distributed systems.● Many teams, different skill sets.● Massive, top to bottom change.● How do we contain and direct all of this so we make

progress?

Page 31: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso team

● Cross functional team○ Network engineers○ SREs○ Developers○ Testers○ ...

Page 32: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Espresso team

● Responsible for supporting Espresso from inception to production.

● Set up testing infrastructure.● Set up job control, monitoring.● Oncall when Espresso shipped its first bytes.● Eventually spun down and handed off oncall to

permanent teams.

Page 33: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Test/release infrastructure

● Unit tests on everything.● Some software integration tests.● Automated hardware integration tests.● CD pipeline cutting a release every night from latest

green commit and deploying to hardware testbeds.

Page 34: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Production environment

● Reused/adapted standard building blocks.○ Borg○ Chubby○ PrometheusBorgmon

● Had a post lab, prod-parallel testbed which paged Espresso oncall.

Page 35: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

"I have a question…"

"Do you know how to let a Borg job SSH into a production machine?"

"Yes. I'm not going to tell you how, though. What are you trying to do?"

(SSH is almost never used for system to system communication at Google; we prefer RPCs.)

Page 36: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

"I have a question…"

"I want to save some binary data to disk, then log in, copy it off, and then get it into Dremel."

"So… you want to save some structured (ProtoBuf?) logs into Dremel."

"Yes."

(It turns out Google has an existing toolkit to solve precisely this problem.)

Page 37: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Monitoring/alerting

● Lots of possible points of failure:○ Peering Fabric.○ Packet processing on

hosts.○ Software (Local

controller, BGP speakers).

○ Global control plane.● How to tell what's broken?

Page 38: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Monitoring/alerting

● Lots of possible points of failure○ Peering Fabric.○ Packet processing on

hosts.○ Software (Local

controller, BGP speakers).

○ Global control plane.● How to tell what's broken?

Page 39: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Monitoring/alerting

"Network devices have counters everywhere. If we page on the drop counters, that'll catch all the failures we see with traditional peering devices?"

"Oooor… we could build some blackbox probing infrastructure to catch failures which don't show up in counters?"

Page 40: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Monitoring/alerting

● Built a couple of high signal, symptom based alerts○ Black box prober, doing end to end test of control-

and dataplane.● Used lots of whitebox telemetry to help point to root

cause.○ ALL THE GRAPHS.

Page 41: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Monitoring/alerting

GFE

USPS (ACL)

Monitoring

1. Blackbox realtime monitoring of PF availability + encap + decap + GFE reachability.2. Greybox realtime monitoring of pocket processor ACL: decap + ACL-is-blocking + ACL-is-permitting.3. Passive loss/blackhole monitoring.

PFInternet B2 GFE

Packet processor

Monitoring

Page 42: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Introspection

● Alerting/monitoring tells you something is broken.● How do find out what exactly is causing you to be

paged?

Page 43: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Introspection tools

● Google has standard HTTP endpoints for debugging.○ "Show me the important things about this binary."○ "Packet processor, what do you know about

192.0.2.1?"● Custom traceroute-like tools for debugging dataplane.

Page 44: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

What broke?

● Most common failure mode: control plane breakage.● Example: Local controller OOM on new version.

○ No traffic impact. (Fail static.)○ Caught in first production canary.○ Added regression test.

Page 45: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

What broke?

● SDN management.● Example: accidentally disabled non-SSH access to

Peering Fabrics.○ No traffic impact. (Fail static)○ Used SSH access to restore SDN management.○ Added more conservative canarying for device

management changes.

Page 46: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Comprehensibility

● Complex system needed an architecture diagram.● Espresso architecture doc has:

○ All components.○ What talked to what.○ Links to individual design docs.○ (Later) Who was oncall for what.

Page 47: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Oncall

● Everyone in Espresso team in the oncall rotation:○ SREs.○ Developers.○ Network engineers.

● Some people never oncall before.● Some people already oncall for other stuff.● Needed to account for all of this in oncall practices.

Page 48: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Oncall

● Initially Espresso team oncall for all Espresso deployments.

● Then only for a couple of sites where we were testing new features.

● Eventually spun down and handed off to many existing teams.

Page 49: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Summary

Page 50: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

What did early engagement get us?

● Dev familiarity with production.○ When you're paged by a bug, you fix it faster.

● Broad knowledge across lots of disciplines.● Significant design changes:

○ Reusing more production infrastructure.○ Symptom based monitoring.

Page 51: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Lessons learned

● Design for testability.● Reuse whatever you can.● System architecture diagrams are great.● Focus on a few, high signal, symptom based alerts.● Lots of white box telemetry to aid with root causing.

Page 52: Engagement Case Study Google SDN Peering: An Early · PDF fileEngagement Case Study Murali Suriar, msuriar@google.com ... BGP speaker eBGP Peering External Peer Espresso Metro Labeled

Thank You!Thank You!