b4 and after: managing hierarchy, partitioning, and...

44
B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN

Upload: others

Post on 23-Sep-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in

Google's Software-Defined WAN

Page 2: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

(“Chi”) Chi-yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, Amin Vahdat

On behalf of many others in:Google Network Infrastructure and Network SREs

Page 3: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

3

2011 2012 2013 2014 2015 2016 2017 2018

99.9% availability

Saturn

First-generation B4 network

copy network

99.99% availability

>100x more traffic

toward

highly available,

massive-scale

network

99% availability

J-POPStargate

Page 4: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Previous B4 paper published in

SIGCOMM 2013

4

Page 5: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

12-site Topology

Demand Matrix (via Google BwE)

CentralTE

Controller

Background: B4 with SDN Traffic Engineering (TE) Deployed in 2012

Per-Site Domain TEControllers

Site-level tunnels(tunnels & tunnel splits)

5

Page 6: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Background: B4 with SDN Traffic Engineering (TE) Deployed in 2012

❏ High efficiency: Lower per-byte cost compared with B2 (Google global backbone running RSVP TE on vendor gears)

❏ Deterministic convergence: Fast, global TE optimization and failure handling

❏ Rapid software iteration: ~1 month for developing and deploying a median-size software features

Key Takeaways:

6

Page 7: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

But, it also comes with new challenges

7

Page 8: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Grand Challenge #1: High Availability Requirements

Service Class Application Examples Availability

SLO

SC4 Search ads, DNS, WWW 99.99%

SC3 Proto service backend, Email 99.95%

SC2 Ads database replication 99.9%

SC1 Search index copies, logs 99%

SC0 Bulk transfer N/A

B4 initially had 99%

availability in 2013

8

Page 9: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Service Class Application Examples Availability

SLO

SC4 Search ads, DNS, WWW 99.99%

SC3 Proto service backend, Email 99.95%

SC2 Ads database replication 99.9%

SC1 Search index copies, logs 99%

SC0 Bulk transfer N/A

B4 initially had 99%

availability

Very demanding goal, given:● inherent unreliability of long-haul links● necessary management operations

9

Page 10: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Grand Challenge #2: Scale Requirements

our bandwidth requirement doubled

every ~9 months

10

Page 11: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

traffic increased by >100x in 5 years

11

Page 12: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Grand Challenge #2: Scale Requirements

our bandwidth requirement doubled

every ~9 months

Scale increased across dimensions:● #Cluster prefixes: 8x● #B4 sites: 3x● #Control domains: 16x● #Tunnels: 60x

12

Page 13: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Other challenges: No disruption to existing traffic, maintain high cost efficiency and high feature velocity

13

Page 14: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

To meet these demanding requirements, we’ve had to aggressively develop many point solutions

14

Page 15: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

LessonsLearned

1. Flat topology scales poorly and hurts availability

2. Solving capacity asymmetry problem in hierarchical topology is key to achieve high availability at scale

3. Scalable switch forwarding rule management is essential to hierarchical TE

15

Page 16: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

SiteSite

Site Site

B4 WAN

BF BF BF BF

CF CF CF CF

5.12 Tbps To Clusters

5.12 / 6.4 Tbps To WAN (other B4 sites)

Saturn

First-generationB4 site fabric

16

Page 17: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Site

Site Site

B4 WAN

BF BF BF BF

CF CF CF CF

5.12 Tbps To Clusters

5.12 / 6.4 Tbps To WAN (other B4 sites)

Scaling option #1: Add more chassis--Up to 8

chassis per Saturn fabric

Site

17

Saturn

First-generationB4 site fabric

Page 18: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Site

Site Site

Scaling option #2:Build multiple B4 sites

in close proximity

SiteSite

Site

Slower central TE controller

Limited switch table limit

Complicated capacity planning and job allocation

18

Page 19: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Jumpgate Site

Jumpgate: Two-layer Topology

80 Tbps toward WAN / clusters /

sidelinks

x16

x32

spine switches

Supernode

edge switches

19

Page 20: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Jumpgate Site

Jumpgate: Two-layer Topology

80 Tbps toward WAN / clusters /

sidelinks

edge switches

x16

x32

spine switches

Supernode

Support horizontal scaling by adding more supernodes to a site

Support vertical scaling by upgrading a supernode in place to

new generation

Improve availability with granular, per-supernode control domain

20

Page 21: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

LessonsLearned

1. Flat topology scales poorly and hurts availability

2. Solving capacity asymmetry problem in hierarchical topology is key to achieve high availability at scale

3. Scalable switch forwarding rule management is essential to hierarchical TE

21

Page 22: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Site A Site B Site C1 1

16

4

4

4

4

4

4

4

4

Site A Site B Site C16 16

sum of supernode-level link capacity

22

Page 23: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Site A Site B Site C1 1

Site A Site B Site C14? 168

2

2

2

2

2

2

2

2

8

Bottleneck!

Abstract loss 43% = (14-8) / 14

23

Page 24: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Cumulative function of site-level links and topology events

Site-level link capacity loss due to topology abstraction / total capacity [log10 scale]

100% capacity loss in 18% cases

2% capacity loss at median case due to striping

inefficiency

24

Page 25: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Solution = Sidelinks + Supernode-level TE

25

Page 26: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Site A Site B Site C1 13.5

3.5

3.5

3.5

3.5

3.5

3.5

3.5

● 57% toward next site● 43% toward self site

26

Page 27: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Solution = Sidelinks + Supernode-level TE

Multi-layer TE(Site-level & supernode-level)

turns out to be challenging!

27

Page 28: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Design Proposals

Hierarchical Tunneling

Site-level tunnels +Supernode-level sub-tunnels

Two layers of IP encapsulation lead to

inefficient hashing

Supernode-level TE

Supernode-level tunnels

Scaling challenges: Increase path allocation run time by 188x longer

28

Page 29: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Supernode-level traffic splits;No packet encapsulation;

Calculated per site-level link

Tunnel Split Group (TSG)

x

Site A(4 supernodes)

Site B(2 supernodes)

4xxx

x

Assume balanced ingress traffic

Maximize admissible demand subject to fairness and link capacity constraint

29

Page 30: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Greedy Exhaustive Waterfill Algorithm

Iteratively allocate each flow on their direct path (w/o sidelinks) or alternatively on their indirect paths (w/ sidelinks on source site) until any flow cannot be allocated

further

Provably forwarding loop

free

Low abstraction capacity loss

Take less than 1 second to run

Site A(4 supernodes)

Site B(2 supernodes)

30

Page 31: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Cumulative function of site-level links and topology events

Site-level link capacity loss due to topology abstraction / total capacity [log10 scale]

100% loss

< 2% loss

31

Page 32: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

TSG Sequencing Problem

Current TSGs Target TSGs

A1

A2

B1

B2

A1

A2

B1

B2

Forwarding Loop BlackholeBad properties during update:

32

Page 33: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Dependency Graph based TSG Update

Loop-free and no extra blackhole

Requires no packet tagging

1. Map target TSGs to a supernode dependency graph

2. Apply TSG update in reverse topological ordering*

One or two steps in>99.7% of TSG ops

* Share ideas with work in IGP updates:● Francois & Bonaventure, Avoiding Transient Loops during IGP

convergence in IP Networks, INFOCOM’05● Vanbever et al., Seamless Network-wide IGP Migrations,

SIGCOMM’11

33

Page 34: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

LessonsLearned

1. Flat topology scales poorly and hurts availability

2. Solving capacity asymmetry problem in hierarchical topology is key to achieve high availability at scale

3. Scalable switch forwarding rule management is essential to hierarchical TE

34

Page 35: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

B4 Site

x16

x32

Supernode

35

Multi-stage Hashing across Switches in Clos Supernode

1. Ingress traffic at edge switches:a. Site-level tunnel splitb. TSG site-level split (to self-site or next-site)

2. At spine switches:a. TSG supernode-level splitb. Egress edge switch split

3. Egress traffic at edge switches:a. Egress port/trunk split

Enable hierarchical TE at scale: Overall throughput improved by >6%

Page 36: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

2011 2012 2013 2014 2015 2016 2017 2018

99.9% availability

TSG: Hierarchical TE

Efficient switch rule management

& more service classes

99.99% availability

J-POP Stargate

Jumpgate:Two-layer topology

Two service classes

99% availability

Saturn

Flat topology

SDN TE tunneling

copy network

>100x more traffic

toward

highly available,

massive-scale

network

36

Page 37: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Conclusions

❏ Highly available WAN with plentiful bandwidth offers unique benefits to many cloud services (e.g., Spanner)

❏ Future Work--Limit the blast radius of rare yet catastrophic failures❏ Reduce dependencies across components❏ Network operation via per-QoS canary

37

Page 38: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

Before After

Copy network with 99% availability High-available network with 99.99% availability

Inter-DC WAN with moderate number of sites 100x more traffic, 60x more tunnels

Saturn: flat site topology & per-site domain TE controller

Jumpgate: hierarchical topology & granular TE control domain

Site-level tunneling Site-level tunneling in conjunction with supernode-level TE (“Tunnel Split Group”)

Tunnel splits implemented at ingress switches Multi-stage hashing across switches in Clos supernode

B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN

Page 39: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

B4 Site

x16

x32

Supernode ACL(Flow Match)

ECMP(Port Hashing)

Encap(+Tunnel IP)

Switch Pipeline

39

Page 40: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

ACL(Flow Match)

ECMP(Port Hashing)

Encap(+Tunnel IP)

Switch Pipeline

Size(ACL) ≥ (#Sites ✕ #PrefixesPerSite ✕ #ServiceClasses)

>16 aggregated IPv4 & IPv6 cluster prefixes

6 aggregated QoSes

Up to 3K entries

Scaling bottleneck: Hit ACL table limit with ~32 sites

40

Page 41: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

VFP(QoS Match)

ACL(Flow Match)

ECMP(Port Hashing)

Encap(+Tunnel IP)

Switch Pipeline (Before)

ECMP(Port Hashing)

Encap(+Tunnel IP)

Switch Pipeline (After)

ACL(Flow Match)

ACL(Flow Match)

Per-VRF LPM(Prefix Match)

Increase # supported sites by 60x

Enable new features:Disable per-flow tunneling

41

Page 42: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

VFP(QoS Match)

ECMP(Port Hashing)

Encap(+Tunnel IP)

ACL(Flow Match)

ACL(Flow Match)

Per-VRF LPM(Prefix Match)

Switch Pipeline

Size(ECMP) ≥ (#Sites ✕ #PathingClasses ✕ TunnelsSplits

✕ TSG_Splits ✕ SwitchSplits)

32 ways

33 sites 3 classes 4 ways

16 ways198K entries required;16K supported by our switches

42

Page 43: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

VFP(QoS Match)

ECMP(Port Hashing)

Encap(+Tunnel IP)

ACL(Flow Match)

ACL(Flow Match)

Per-VRF LPM(Prefix Match)

Switch Pipeline

Size(ECMP) ≥ (#Sites ✕ #PathingClasses ✕ TunnelsSplits

✕ TSG_Splits ✕ SwitchSplits)

Scaling bottleneck: Hit ACL table limit with ~32 sites

Scaling bottleneck: Hit ACL table limit with ~32 sites

x16

x32

Supernode

Overall throughput improved by >6%

Support more sites & pathing classes

43

Page 44: B4 and After: Managing Hierarchy, Partitioning, and ...conferences.sigcomm.org/sigcomm/2018/files/slides/paper_2.2.pdf · B4 WAN BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12

B4 Site

x16

x32

Supernode

ACL(Flow Match)

ECMP(Port Hashing)

Encap(+Tunnel IP)

Switch Pipeline

Support up to only 32 sites

Reduced efficiency with lower path split granularity

Efficient flow matching via

virtual routing & forwarding

(VRF)

Multi-stage hashing by leveraging source MAC marking

and packet load balancing via spine-layer switches

44