card: a congestion-aware request dispatching scheme for ... · •data collected and stored in an...
TRANSCRIPT
![Page 1: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/1.jpg)
CARD: A Congestion-Aware Request Dispatching Scheme for Replicated Metadata Server Cluster
Shangming Cai, Dongsheng Wang,
Zhanye Wang and Haixia Wang
Tsinghua University
1
![Page 2: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/2.jpg)
Background: Massive-scale ML in product environments
• Datasets updated hourly or daily
• data collected and stored in an HDFS-like distributed filesystem
• periodically offline training for online inference
• Challenges of the data-reader pipeline while training
• extremely heavy read workloads: millions to billions of files per epoch
• random access pattern: up-level shuffling for convergence speed
2
![Page 3: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/3.jpg)
Background: Massive-scale ML in product environments
• Workers interact with a DFS
• Metadata request
-> metadata server (MDS)
• File I/O
-> object storage devices (OSD)
3
OSD
Distributed filesystem
requests / data
Metadata Server
OSD OSD OSD
Training workers
……
![Page 4: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/4.jpg)
OSD
When the number of training workers grows…
• Extremely stressed workloads
• Metadata access stepbottlenecks the data-reader pipeline
• Potential single point of failure on MDS
4
Distributed filesystem
requests / data
Metadata Server
Training workers
……
……OSD OSD OSD
![Page 5: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/5.jpg)
Typical industrial response: Scaling out likewise
• Concerns to be addressed:
• Cost-effectiveness
• Scalability
• Run-time stability
5
OSD
Distributed filesystem
requests / data
MDS
OSD OSD OSD
Training workers
……
MDS MDS……
……
![Page 6: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/6.jpg)
To achieve load-balance…
6
OSD
Distributed filesystem
MDS
OSD OSD OSD
Training workers
……
MDS MDS……
……
Load balancer
•A middle layer load-balancer• Pros:
• good global load balancing
• more features are optional
• Cons:• load-balancer is stressed
• reintroduce a potential single point of failure
• not cost-effective
![Page 7: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/7.jpg)
To achieve load-balance…
7
OSD
Distributed filesystem
MDS
OSD OSD OSD
Training workers
……
MDS MDS……
……
Load balancer
•A middle layer load-balancer• Pros:
• good global load balancing
• more features are optional
• Cons:• load-balancer is stressed
• reintroduce a potential single point of failure
• not cost-effective
![Page 8: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/8.jpg)
Try client-side solutions
• Easy to implement
• Cost-effective
8
OSD
Distributed filesystem
MDS
OSD OSD OSD
Training workers
……
MDS MDS……
……
client−side solutions
![Page 9: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/9.jpg)
Client-side solution: Round-Robin
• Round-Robin
• Pros:• simple yet effective in
homogeneous environments
• Cons:• inflexible and inefficient in
shifting or heterogeneous environments
9
MDS 0
Clients (training workers)
MDS 1
MDS 3
MDS 2
![Page 10: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/10.jpg)
Client-side solution: Heuristic selection
•Heuristic selection• e.g., prefer lowest MART (moving
average of response time)
• Pros:• effective when facing light-
weight workloads• Cons:
• cause herd-behavior and load-oscillations
1010
MDS 0
MDS 1
MDS 3
MDS 2
20 ms40 ms 15 ms 25 ms
Clients
![Page 11: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/11.jpg)
Client-side solution: Round-Robin with Throttling
11
MDS 0
MDS 1
MDS 3
MDS 2
30 ms25 ms 5 ms 20 ms
Threshold: 50 ms
• Round-Robin with throttling• e.g., LADS, preset a MART threshold
to mark servers as congested
• Light-weight workloads• = Round-Robin
• Heavy workloads• = Heuristic selection • herd-behavior and load-
oscillations remain
Clients
![Page 12: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/12.jpg)
• Round-Robin with throttling• e.g., LADS, preset a MART threshold
to mark servers as congested
• Light-weight workloads• = Round-Robin
• Heavy workloads• = Heuristic selection • herd-behavior and load-
oscillations remain
12
MDS 0
MDS 1
MDS 3
MDS 2
60 mscongested
55 mscongested
40 ms
Threshold: 50 ms
65 mscongested
Client-side solution: Round-Robin with ThrottlingClients
![Page 13: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/13.jpg)
CARD: Congestion-Aware Request Dispatching scheme
• Core idea: Round-Robin with adaptive rate-control• inspired by CUBIC for TCP protocol• counting-based implementation• no extra info required from servers
• Light-weight workloads• = Round-Robin
• Heavy workloads• redirect requests from overloaded MDS to underloaded MDS• suppress upcoming requests: if and only if all servers are overloaded
13
![Page 14: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/14.jpg)
• Queue: place pending requests
• Selector: Round-Robin dispatching
• Rate-limiter: rate-control module
• Feedback: process feedbacks and forward replies
14
Congestion-aware rate-control mechanism
MDS 0
Process unit at clients
MDS 1
MDS 3
MDS 2
RL
Selector Feedback
RL RL RL
repliesrequests
Queue
![Page 15: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/15.jpg)
• Restrict requests routed to each MDSper 𝛿 time window
• Gradually increase the restriction according to a cubic growth function
• Feedback module computes receivingrates after each time window and forwards to RLs
15
Congestion-aware rate-control mechanism
MDS 0
Process unit at clients
MDS 1
MDS 3
MDS 2
RL
Selector Feedback
RL RL RL
repliesrequests
Queue
![Page 16: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/16.jpg)
• How to identify a congestion event?• sending rate > receiving rate
• elapsed time since last sending rate ↑event > 𝜆 (a hysteresis period )
• What to do then?• record current sending rate as
saturated sending rate
• reduce current sending rate
16
Congestion-aware rate-control mechanism
MDS 0
Process unit at clients
MDS 1
MDS 3
MDS 2
RL
Selector Feedback
RL RL RL
repliesrequests
Queue
![Page 17: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/17.jpg)
• ∆𝑡: elapsed time since the last congestion event
• 𝑀𝑖𝑗 : saturated sending rate
• Changed to current sending rate adaptively whenever a congestion event happens
• Then, current sending rate reduced to (1 − 𝛽) ∙ 𝑀𝑖𝑗 , and start to grow all over again accordingly
17
The cubic growth function for the rate-control
![Page 18: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/18.jpg)
Evaluation setup
• We implemented a prototype RMSC for simulation purposes
• Up to 8 servers to measure system scalability
• Crafted descending setup for heterogeneous experiments
• 10 clients run on separate machines launching request with Poisson arrivals
• 𝛿 = 5 ms, 𝜆 = 10 ms, 𝛽 =0.20
• To compare against CARD, we implemented aforementionedRound-Robin, MART and LADS as well
• Refer to the paper for more setup details
18
![Page 19: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/19.jpg)
Evaluation highlights
• Do CARD’s rate-control mechanism work as expected?
• Yes, the rate-control process is effective and adaptive
• Loads among servers are balanced under heavy workloads
• Can CARD achieve better scalability?
• In homogeneous clusters: CARD ≈ Round-Robin > other strategies
• In heterogeneous clusters: Yes, CARD > other strategies
19
![Page 20: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/20.jpg)
Examples of the rate-control procedure
The sending rate from each client to each server is adjusted adaptively according to the receiving rate
20
![Page 21: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/21.jpg)
Overall arriving rates in the homogeneous cluster
1) Heuristic selections cause severe herd behavior and load-oscillations
2) A data loading job is completed earlier when using CARD 21
CARDMART
![Page 22: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/22.jpg)
Overall arriving rates in the heterogeneous cluster
22
CARDLADS
1) A basic threshold throttling strategy is not sufficient enough
2) Arriving rates are stabilized around servers’ capacity when using CARD
![Page 23: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/23.jpg)
Overall throughput in the homogeneous cluster
23
• Heuristic selection is a bad choice under heavy workloads
• In ideal homogenous environments, Round-Robin and CARD achieve great scalability
![Page 24: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/24.jpg)
• Round-Robin is ineligible when facing heterogenous setups
• CARD outperforms other strategies and achieves excellent scalability
Overall throughput in the heterogeneous cluster
24
![Page 25: CARD: A Congestion-Aware Request Dispatching Scheme for ... · •data collected and stored in an HDFS-like distributed filesystem •periodically offline training for online inference](https://reader033.vdocuments.mx/reader033/viewer/2022051606/601991a48a7c3e21923b4040/html5/thumbnails/25.jpg)
Summary: CARD
• Adaptive client-side throttling method: easy and efficient
• Redirect requests from the overloaded server to the underloaded server adaptively under heavy workloads
• Degrade into pure Round-Robin when facing light-weight workloads
• Boosts throughput significantly over competing strategies in heterogeneous environments
25