milena talavera senior infrastructure manager@slack
TRANSCRIPT
![Page 1: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/1.jpg)
Flannel Slack’s Secret to Scale
Milena Talavera Senior Infrastructure Manager@Slack
![Page 2: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/2.jpg)
Make a Copy of this deck (File > Make a copy…) when creating your own. This will preserve the design styles.
Things don’t need to be HUGE. Most presentations are seen full-screen or even projected quite large at an event so let’s keep things looking professional and to a modest size.
Less is more! Keep slides simple and provide helpful notes.
Our Mission: To make people’s working lives simpler, more pleasant, and more productive. t force others to read them.
![Page 3: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/3.jpg)
![Page 4: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/4.jpg)
![Page 5: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/5.jpg)
![Page 6: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/6.jpg)
Slack Scale
❖ 8M+ Daily Active Users 3M+ paid users; 65% of Fortune 100
Companies ❖ 100+ countries 50%+ of DAU outside of US
![Page 7: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/7.jpg)
From supporting small teams 3-4 years ago To serving gigantic organizations of hundreds of thousands of users today
![Page 8: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/8.jpg)
Slack Scale
To support such rapid growth of yesterday and today, Slack’s Infrastructure has to get ahead of customer growth
![Page 9: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/9.jpg)
Biggest Teams
2015 8,000 users
![Page 10: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/10.jpg)
Biggest Teams
2015 8,000 users
2016 26,000 users
![Page 11: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/11.jpg)
Biggest Teams
2015 8,000 users
2016 26,000 users
2018 266,000 users
![Page 12: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/12.jpg)
Slack Architecture History Lesson
![Page 13: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/13.jpg)
Fat, greedy client
Fat, lazy client
Flannel Powered
Lazy + Flannel Powered
Resiliency
Scale: Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
![Page 14: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/14.jpg)
Fat, greedy client
Fat, lazy client
Flannel Powered
Lazy + Flannel Powered
Resiliency
Scale: Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
![Page 15: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/15.jpg)
Fat, Greedy Client
WebApp PHP/Hack
Messaging Server Java
HTTP
WebSocket
MySql
![Page 16: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/16.jpg)
User Connect Flow in 2015
Client Server
2. HTTP response: a snapshot of the team
3.Long-lived WebSocket connection
real time events
1. https://slack.com/api/rtm.start
Connect
time
![Page 17: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/17.jpg)
User Connect Flow in 2015
Advantages ○ Every Slack Object available locally on
the client ○ User experience was super speedy ○ Enabled us to move fast
![Page 18: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/18.jpg)
User Connect Flow in 2015
Limitations ○ Expensive connection/reconnection ○ Large client memory footprint (grows with
team size) ○ Susceptible to thundering herd
![Page 19: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/19.jpg)
19
Number of users Number of channels Snapshot size (bytes)
30 10 200K
500 200 2.5M
3,000 7,000 20M
30,000 1,000 60M
Team Snapshot Size
Max Team Sizes in 2015: ~8,000 users
![Page 20: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/20.jpg)
Fat, greedy client
Fat, lazy client
Flannel Powered
Lazy + Flannel Powered
Resiliency
Scale: Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
![Page 21: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/21.jpg)
User Connect Flow in 2015
Client Server
2. HTTP response: a snapshot of the team
3.Long-lived WebSocket connection
real time events
1. https://slack.com/api/rtm.start
Connect
time
![Page 22: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/22.jpg)
User Connect Flow in 2016
Client Server
2. HTTP response: a partial snapshot of the
objects
3.Long-lived WebSocket connection
Pruned real time events
1. https://slack.com/api/rtm.start
Connect
time
4.Asynchronous fetch of non essential objects
![Page 23: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/23.jpg)
User Connect Flow in 2016
Incremental Improvements ○ Load less data at client boot time ○ Parallelized, lazy loading on demand ○ Simplified objects
On a 10,000 user team, these change alone saved a few megabyte of data.
![Page 24: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/24.jpg)
User Connect Flow in 2016
Still Limitations ○ Still Susceptible to thundering herd if
clients dump their cache ○ Still grows with team size
![Page 25: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/25.jpg)
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
![Page 26: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/26.jpg)
Flannel Powered Slack
Flannel: Slack’s edge cache service
○ A query engine backed by cache on edge locations
![Page 27: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/27.jpg)
Powered by Flannel
WebApp PHP/Hack
Messaging Server Java
MySql Cache
Edge Pops Client Edges Non edge locations
1. WebSocket connection
2. HTTP Post: download a snapshot of the team
3. WebSocket: Stream Json events to keep cache updated
![Page 28: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/28.jpg)
Flannel Deployment Architecture
HAProxy
Flannel
Edge Region C
Team Affinity
Flannel Flanne
l
Flannel
HAProxy
Flannel
Edge Region B
Team Affinity
Flannel Flanne
l
Flannel
GeoDNS
Client
HAProxy
Flannel
Team Affinity
Flannel Flanne
l
Flannel
Edge Region A
![Page 29: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/29.jpg)
Flannel Powered Slack 2017
Advantages ○ Clients have low latency access to key
big objects through edge/pop regions ○ Minimal client changes were needed to
implement ○ More query flexibility and filtering than
typical cache solutions like memcache
![Page 30: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/30.jpg)
Features Powered by Flannel
Quick Switcher
![Page 31: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/31.jpg)
Features Powered by Flannel
Mention Suggestions
![Page 32: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/32.jpg)
Features Powered by Flannel
Channel Header
![Page 33: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/33.jpg)
Features Powered by Flannel
Channel Sidebar
![Page 34: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/34.jpg)
Flannel Powered Slack 2017
Limitations ○ Keeping Flannel cache updated is
expensive (firehose feed of events) ○ Thundering herd phenomenon is still a
possibility ○ Cache on the websocket is in the critical
path
![Page 35: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/35.jpg)
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
![Page 36: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/36.jpg)
Powered by Flannel V1.5
Impactful Improvements ○ Thrift Pub/Sub reducing number of
events processed by 1000X
![Page 37: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/37.jpg)
Powered by Flannel V1.5
WebApp PHP/Hack
Messaging Server Java
MySql Cache
Edge Pops Client Edges Non edge locations
1. WebSocket connection
2. HTTP Post: download a snapshot of the team
3. Pub Sub Thrift events to keep cache updated
![Page 38: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/38.jpg)
Before
After
events reduce by
![Page 39: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/39.jpg)
Powered by Flannel V1.5
Impactful Improvements ○ Client lazily loads primary objects (users,
channels, channel membership) significantly reducing boot time
Max Team Sizes in 2018: ~266,000 users
![Page 40: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/40.jpg)
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
![Page 41: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/41.jpg)
Resiliency
As scale increases, failures are more likely to happen. Our goal is to minimize blast radius and recovery time of failure modes.
![Page 42: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/42.jpg)
Resiliency
Our observation: when failures happen, they happen faster than one can blink an eye. Solution to this can not rely on human intervention
![Page 43: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/43.jpg)
Resiliency
+ =
Automated Admission Control
![Page 44: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/44.jpg)
Resiliency
Measures Taken ○ Automated Admission Control based on
various metrics. Examples: memory pressure, concurrent requests, etc
Automated Admission Control
![Page 45: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/45.jpg)
Resiliency
Circuit Breakers
![Page 46: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/46.jpg)
Resiliency
Measures Taken ○ Built in Circuit Breakers to mitigate
cascading failures and protect services from each other’s bad behaviours
Circuit Breakers
![Page 47: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/47.jpg)
What Else
Regional Failover Auto Scaling
![Page 48: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/48.jpg)
Fat, greedy client
Fat, lazy client
Powered by Flannel
Lazy + Powered by Flannel
Resiliency
Scale: Client Pub/Sub
Slack Architecture History Lesson
2015 2017 2018
![Page 49: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/49.jpg)
Sneak Peak into the Future
Expand Pub/Sub to Client Side ○ Reduce events clients have to handle ○ Track what is in the current view ○ Subscribe/Unsubscribe to events when
view changes
![Page 50: Milena Talavera Senior Infrastructure Manager@Slack](https://reader030.vdocuments.mx/reader030/viewer/2022012421/6175b76410054e36a05fe50d/html5/thumbnails/50.jpg)
THANK YOU! Got Questions?
Milena