srecon-americas-2017: trafficshift: avoiding disasters at scale
TRANSCRIPT
![Page 1: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/1.jpg)
TrafficShift - Avoiding Disasters at Scale
Michael KehoeStaff SRELinkedIn
Anil MallapurSRELinkedIn
![Page 2: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/2.jpg)
OverviewLinkedIn Architectural Overview
Fabric Disaster Recovery
Questions
![Page 3: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/3.jpg)
467+ million members
World’s largest professional network
200+ Countries
![Page 4: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/4.jpg)
Who are we ?Production-SRE team at LinkedIn
● Assist in restoring stability to services during site critical issues
● Developing applications to improve MTTD and MTTR
● Provide direction and guidelines for site monitoring
● Build tools for efficient site issue troubleshooting, issue detection & correlation
![Page 5: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/5.jpg)
Terminologies
Fabric/Colo Data Center with full application stack deployed
PoP/Edge Entry point to LinkedIn network (TCP/ SSL termination)
Load Test Planned stress testing of data centers
![Page 6: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/6.jpg)
2003
2010
2011
2013
2014
2015
Active & Passive
Active &Active
Multi-colo 3-way Active
&Active
Multi-colo n-way Active
&Active
![Page 7: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/7.jpg)
2017
4 Data Centers 13 PoPs
1000+ service
s
![Page 8: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/8.jpg)
![Page 9: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/9.jpg)
What are Disasters ?
Service Degradatio
n Infrastructu
re IssuesHuman Error
Data Center on
Fire
![Page 10: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/10.jpg)
One solution for all disasters
TrafficShift - Reroute user traffic to different
datacenters without any user interruption.
![Page 11: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/11.jpg)
Whaaaat ?
![Page 12: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/12.jpg)
Border Router
IPVS ATS
EDGE
ATS Frontend
FABRIC
Stickyrouting Service
Internet
![Page 13: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/13.jpg)
ATS
Request
Stickyrouting Service
Gets primary colo for user
If not cookie in header
DC1 in cookie DC1
DC2
Got DC2 as primary colo for
user
FABRICEDGE
![Page 14: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/14.jpg)
US-East
1 2 3 10
91 92 93 100
BUCKETSFABRIC
Stickyrouting
![Page 15: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/15.jpg)
How StickyRouting assigns users to a colo?
Capacity of Fabric
Offline job to assign colo to users
Geographic distance to users
![Page 16: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/16.jpg)
Advantages of sticky routing
Less latency for users
Store data where it’s necessary
Provides precise control over capacity allotment
![Page 17: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/17.jpg)
When to TrafficShift ?
Impact Mitigation
Planned Maintenan
ceStress Test
![Page 18: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/18.jpg)
Site Traffic and Disaster Recovery
US-West US-Central
US-East APAC
EDGE
0%Distributed Load
50%Distributed Load
50%Distributed Load
0%Distributed Load
Traffic stops being served to offline
fabricsTraffic is shifted to
online fabrics
![Page 19: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/19.jpg)
TrafficShift Architecture
Web application
Salt master
Stickyrouting ServiceCouchbase Backend Worker
Processes
FABRIC
BUCKETS
![Page 20: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/20.jpg)
What is Load Testing ?
3 times a week
Peak hour traffic
Fixed SLA
USW
Target Data Center
USW
![Page 21: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/21.jpg)
Load Testing
FABRIC
Target
US-West US-East
50%
Traffic Percentage
![Page 22: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/22.jpg)
Benefits of Load Test
Capacity PlanningLeverage production traffic to stress test
services
Identify bugs in production
Confidence in Disaster Recovery
![Page 23: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/23.jpg)
Big Red Button
Kill switch (No Kidding)Failout of a datacenter and PoP in less than 10 minutesMinimal user impact
![Page 24: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/24.jpg)
Key Takeaways●Design infrastructure to facilitate
disaster recovery
●Stress test regularly to avoid surprises
●Automate everything to reduce time to mitigate impact
![Page 25: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/25.jpg)
Questions
![Page 26: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/26.jpg)
![Page 27: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/27.jpg)
Edge Failout
![Page 28: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/28.jpg)
Edge Presence
![Page 29: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/29.jpg)
LinkedIn’s PoP Architecture
29
• Using IPVS - Each PoP announces a unicast address and a regional anycast address
• APAC, EU and NAMER anycast regions
• Use GeoDNS to steer users to the ‘best’ PoP
• DNS will either provide users with an anycast or unicast address for www.linkedin.com
• US and EU members is nearly all anycast• APAC is all unicast
![Page 30: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale](https://reader035.vdocuments.mx/reader035/viewer/2022062503/58ecfedb1a28ab7d438b4675/html5/thumbnails/30.jpg)
LinkedIn’s PoP DR
30
• Sometimes need to fail out of PoP’s• 3rd party provider issues (e.g. transit
links going down)• Infrastructure maintenance
• Withdraw anycast route announcements
• Fail healthchecks on proxy to drain unicast traffic