![Page 1: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/1.jpg)
Distributed Machine Learning
1
Georgios Damaskinos2018
![Page 2: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/2.jpg)
Machine Learning ?
2
![Page 3: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/3.jpg)
Machine Learning “in a nutshell”
3
![Page 4: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/4.jpg)
Machine Learning algorithm
4
![Page 5: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/5.jpg)
Machine Learning algorithm
5
![Page 6: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/6.jpg)
Safety?
6
![Page 7: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/7.jpg)
Safety?
Convergence
Co
st f
un
ctio
n
7
![Page 8: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/8.jpg)
Machine Learning ?
8
![Page 9: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/9.jpg)
Think big!
9
![Page 10: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/10.jpg)
Think big!
Example: Image Classification
Data: ImageNet: 1.3 Million training images (224 x 224 x 3)
Model: ResNet-152: 60.2 Million parameters (model size)
Training time (single node):TensorFlow: 19 days!!
10
![Page 11: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/11.jpg)
Think big!
Example: Image Classification
Data: ImageNet: 1.3 Million training images (224 x 224 x 3)
Model: ResNet-152: 60.2 Million parameters (model size)
Training time (single node):TensorFlow: 19 days!!
11
![Page 12: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/12.jpg)
12
![Page 13: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/13.jpg)
Performance?
Training time (single node):TensorFlow: 19 days
Training time (distributed):1024 Nodes (theoretical): 25 minutesCSCS (3rd top supercomputer, 4500+ GPUs, state-of-the-art interconnect:
13
![Page 14: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/14.jpg)
Performance?
State-of-the-art (ResNet-50): 1 hour [GP+17]
● Batch size = 8192● 256 GPUs
[GP+17] Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." 201714
![Page 15: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/15.jpg)
Distributed … how?
15
![Page 16: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/16.jpg)
Data Parallelism
16
![Page 17: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/17.jpg)
17
Batch Learning
![Page 18: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/18.jpg)
18
Batch Learning
![Page 19: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/19.jpg)
19
Batch Learning
![Page 20: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/20.jpg)
20
Batch Learning
![Page 21: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/21.jpg)
21
• Partition Data• Parallel Compute on
Partitions
Parallel Batch Learning
![Page 22: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/22.jpg)
22
Parallel Batch Learning
![Page 23: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/23.jpg)
23
Parallel Batch Learning
![Page 24: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/24.jpg)
24
• More frequent updates
Parallel Synchronous Mini-Batch Learning
![Page 25: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/25.jpg)
25
Parallel Asynchronous Mini-Batch Learning
![Page 26: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/26.jpg)
26
Parallel Asynchronous Mini-Batch Learning
![Page 27: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/27.jpg)
27
Parallel Asynchronous Mini-Batch Learning
![Page 28: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/28.jpg)
28
Parallel Asynchronous Mini-Batch Learning
![Page 29: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/29.jpg)
29
Parallel Asynchronous Mini-Batch Learning
![Page 30: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/30.jpg)
30
Parallel Asynchronous Mini-Batch Learning
![Page 31: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/31.jpg)
31
Parallel Asynchronous Mini-Batch Learning
![Page 32: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/32.jpg)
32
• Gradients computed using stale parameters
• Increased utilization• Central lock
Parallel Asynchronous Mini-Batch Learning
![Page 33: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/33.jpg)
33
Distributed ML
● Parallelism○ Model○ Data
● Learning○ Synchronous○ Asynchronous
![Page 34: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/34.jpg)
34
Distributed ML: Challenges
1. Scalability
2. Privacy
3. Security
![Page 35: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/35.jpg)
35
Scalability - Asynchrony
![Page 36: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/36.jpg)
36
Scalability - Communication
ImageNet classification (ResNet-152):Model/update size = ~ 250MB
![Page 37: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/37.jpg)
37
Scalability - Communication
![Page 38: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/38.jpg)
38
Scalability - Communication
![Page 39: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/39.jpg)
39
Scalability - Communication
![Page 40: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/40.jpg)
40
Scalability - Communication
![Page 41: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/41.jpg)
41
ImageNet classification (ResNet-152):Mode/update size = ~ 250MB
Compression● Distillation [PPA+18]● Quantization [DGL+17]
○ SignSGD [BJ+18]
[DGL+17] Alistarh, Dan, et al. "QSGD: Communication-efficient SGD via gradient quantization and encoding." NIPS 2017.[PPA+18] Polino, Antonio, Razvan Pascanu, and Dan Alistarh. "Model compression via distillation and quantization." ICLR 2018.[BJ+18] Bernstein, Jeremy, et al. "signSGD: compressed optimisation for non-convex problems." ICML 2018.
Scalability - Communication
![Page 42: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/42.jpg)
42
Distributed ML: Challenges
1. Scalabilitya. Asynchronyb. Communication efficiency
2. Privacy
3. Security
![Page 43: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/43.jpg)
43
• Medical data
• Photos
• Search logs
•
•
•
![Page 44: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/44.jpg)
44
Privacy
Differential Privacy● Decentralized Learning [BGT+18]● Compression <-> DP [AST+18]
Local Privacy● MPC
[BGT+18] Bellet, A., Guerraoui, R., Taziki, M., & Tommasi, M.. Personalized and Private Peer-to-Peer Machine Learning. AISTATS 2018.[AST+18] Agarwal, N., Suresh, A. T., Yu, F., Kumar, S., & Mcmahan, H. B. (2018). cpSGD: Communication-efficient and differentially-private distributed SGD. NIPS 2018.
![Page 45: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/45.jpg)
45
Distributed ML: Challenges
1. Scalabilitya. Asynchronyb. Communication efficiency
2. Privacya. Differential Privacyb. Local Privacy
3. Security
![Page 46: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/46.jpg)
46
Security: Byzantine worker
![Page 47: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/47.jpg)
47
• More frequent updates
Security: Synchronous BFT
![Page 48: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/48.jpg)
48
Security: Synchronous BFT
Krum[Blanchard, Peva, Mhamdi, E. M. E., Rachid Guerraoui, and Julien Stainer. "Machine learning with adversaries: Byzantine tolerant gradient descent." NIPS. 2017.]
● Byzantine resilience against f/n workers, 2f + 2 < n
● Provable convergence (i.e., safety)
How?1. Worker i: score(i) =
Select gradient with minimum score2. m-Krum
Majority + Squared distance-based decision -> BFT
![Page 49: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/49.jpg)
49
• Gradients computed using stale parameters
• Increased utilization• Central lock
Security: Asynchronous BFT
![Page 50: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/50.jpg)
50
Security: Asynchronous BFT
Kardam [Damaskinos, G., Mhamdi, E. M. E., Guerraoui, R., Patra, R., & Taziki, M. Asynchronous Byzantine Machine Learning. ICML 2018.]
● Byzantine resilience against f/n workers, 3f < n
● Optimal slowdown:
● Provable (almost sure) convergence (i.e., safety)
How?1. Lipschitz Filtering Component => Byzantine resilience2. Staleness Dampening Component => Asynchronous convergence
Asynchrony can be viewed as Byzantine behavior
![Page 51: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/51.jpg)
51
Distributed ML: Challenges
1. Scalabilitya. Asynchronyb. Communication efficiency
2. Privacya. Differential Privacyb. Local Privacy
3. Securitya. Synchronous BFTb. Asynchronous BFT
![Page 52: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/52.jpg)
52
Distributed ML: Frameworks
![Page 53: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/53.jpg)
53
Tensorflow: Why?
https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a
Popularity
![Page 54: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/54.jpg)
54
Tensorflow: Why?
Support
● Visualization tools
● Documentation
● Tutorials
![Page 55: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/55.jpg)
55
Tensorflow: Why?
Portability - Flexibility - Scalability
![Page 56: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/56.jpg)
56
Tensorflow: What is it?
● Dataflow graph computation
● Automatic differentiation (also for while loops [Y+18])
[Y+18] Yu, Yuan, et al. "Dynamic control flow in large-scale machine learning." EuroSys. ACM, 2018.
![Page 57: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/57.jpg)
57
Tensor?
Multidimensional array of numbers
Examples:● A scalar● A vector● A matrix
![Page 58: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/58.jpg)
58
DataFlow?
● Computations are graphs○ Nodes: Operations○ Edges: Tensors
● Program phases:○ Construction: create the graph○ Execution: push data through the graph
![Page 59: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/59.jpg)
59
Tensorflow VS DataFlow Frameworks
● Batch Processing
● Relaxed consistency
● Simplicity○ No join operations○ Input diff => new batch
![Page 60: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/60.jpg)
60
Architecture
![Page 61: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/61.jpg)
61
Learning
![Page 62: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/62.jpg)
62
![Page 63: Distributed Machine LearningAsynchronous Byzantine Machine Learning. ICML 2018.] Byzantine resilience against f/n workers, 3f < n Optimal slowdown: Provable (almost sure) convergence](https://reader030.vdocuments.mx/reader030/viewer/2022040609/5ecd7d5f860b3d1d3b6328ff/html5/thumbnails/63.jpg)
63
TensorFlow BFT ? No!
How can we make it BFT?
[Damaskinos G., El Mhamdi E., Guerraoui R., Guirguis A., Rouault S.]