duke's fuqua school of business - securing distributed machine …jx77/sml_lsnew.pdf · 2018....
TRANSCRIPT
![Page 1: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/1.jpg)
Securing Distributed Machine Learningin High Dimensions
Jiaming Xu
The Fuqua School of BusinessDuke University
Joint work withYudong Chen (Cornell) and Lili Su (MIT)
Workshop on Information, Learning and DecisionShanghaiTech, July 1, 2018
![Page 2: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/2.jpg)
Distributed Machine Learning: Robustness
• An attractive solution to large-scale problems
I Algorithms: [Boyd et al. 11], [Jordan, Lee and Yang 16], etc.
I Systems: [Map-Reduce, Dean and Ghemawat 08], etc.
• The necessity of robustness:
Corrupted data
I Statistical noise: [Candes et al, JACM 11] [Loh and Wainwright,NIPS 11]
I Adversarial corruption: No structural assumptions [Chen, Caramanisand Mannor, ICML 13] [Diakonikolas et al., FOCS 16] [Charikar etal., STOC 17]
![Page 3: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/3.jpg)
Distributed Machine Learning: Robustness
• An attractive solution to large-scale problems
I Algorithms: [Boyd et al. 11], [Jordan, Lee and Yang 16], etc.
I Systems: [Map-Reduce, Dean and Ghemawat 08], etc.
• The necessity of robustness:
Corrupted data
I Statistical noise: [Candes et al, JACM 11] [Loh and Wainwright,NIPS 11]
I Adversarial corruption: No structural assumptions [Chen, Caramanisand Mannor, ICML 13] [Diakonikolas et al., FOCS 16] [Charikar etal., STOC 17]
![Page 4: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/4.jpg)
Distributed Machine Learning: Robustness
• An attractive solution to large-scale problems
I Algorithms: [Boyd et al. 11], [Jordan, Lee and Yang 16], etc.
I Systems: [Map-Reduce, Dean and Ghemawat 08], etc.
• The necessity of robustness: Corrupted data
I Statistical noise: [Candes et al, JACM 11] [Loh and Wainwright,NIPS 11]
I Adversarial corruption: No structural assumptions [Chen, Caramanisand Mannor, ICML 13] [Diakonikolas et al., FOCS 16] [Charikar etal., STOC 17]
![Page 5: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/5.jpg)
Our Goal
• Implicit assumption of previous work: Reliable learning system
I Each computing device follows some designed specification
• Our focus: Unreliable learning system
I Adversarial attacks: Some unknown subset of computing devices arecompromised, and behave adversarially – such as sending outmalicious messages
Goal: Secure model training in unreliable learning system
![Page 6: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/6.jpg)
Our Goal
• Implicit assumption of previous work: Reliable learning system
I Each computing device follows some designed specification
• Our focus: Unreliable learning system
I Adversarial attacks: Some unknown subset of computing devices arecompromised, and behave adversarially – such as sending outmalicious messages
Goal: Secure model training in unreliable learning system
![Page 7: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/7.jpg)
Why consider unreliable learning system?
![Page 8: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/8.jpg)
Privacy Risk in Conventional Learning Paradigm
• Data is collected from providers and stored at clouds
• Serious privacy risks:I Facebook data scandalI PRISM: Facebook, Google, Yahoo!, Apple, Microsoft, Dropbox, etc.
Jiaming Xu (Duke) Machine Learning Security 5
![Page 9: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/9.jpg)
Privacy Risk in Conventional Learning Paradigm
• Data is collected from providers and stored at clouds
• Serious privacy risks:I Facebook data scandalI PRISM: Facebook, Google, Yahoo!, Apple, Microsoft, Dropbox, etc.
Jiaming Xu (Duke) Machine Learning Security 5
![Page 10: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/10.jpg)
New Learning Paradigm: Federated Learning
Key idea: Leave training data on mobile devices
• Learning with external workers (data providers)• Proposed by Google researcher [McMahan 16]• Tested by Gboard on Android and Google Keyboard
Jiaming Xu (Duke) Machine Learning Security 6
![Page 11: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/11.jpg)
Security Risk in Federated Learning
• Less secured implementation environment• External workers are prone to adversarial attack – reprogrammed by
system hackers and behave maliciously
Jiaming Xu (Duke) Machine Learning Security 7
![Page 12: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/12.jpg)
Goal: Secure model training in unreliable learning system
![Page 13: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/13.jpg)
Challenges of Securing Unreliable Learning Systems
• Low local data volume versus high model complexity
I Local estimator is statistically inaccurate
I Hard to distinguish statistical errors from adversarial errors
I Call for close interaction between the learner (cloud) and the workers
• Communication constraints: Data transmission suffers high latencyand low throughout
Objectives
• Tolerate adversarial failures of the external workers
• Accurately learn highly complex models with low local data volume
• Use only a few communication rounds
Jiaming Xu (Duke) Machine Learning Security 9
![Page 14: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/14.jpg)
Outline of the Remainder
1 Problem formulation
2 Algorithm 1: Geometric median of means
3 Algorithm 2 (Optimal Algorithm):Iterative rewriting + projecting + filtering
4 Summary and concluding remarks
Jiaming Xu (Duke) Machine Learning Security 10
![Page 15: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/15.jpg)
Problem Formulation: Learning Model
• N i.i.d. data points Xii.i.d.∼ µ
• Collectively kept by m workers – each worker keeps Nm data points
• The learner wants to pick a model in Θ ⊆ Rd
• loss function f(x, θ): loss induced by x ∈ X under the model choiceθ ∈ Θ
Target: θ∗ ∈ arg minθ∈Θ F (θ) , E[f(X, θ)]
NOTE: the population risk F (θ) is unknown
![Page 16: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/16.jpg)
Example: Linear Regression
• N i.i.d. data points Xi = (wi, yi)i.i.d.∼ µ
I wi can be the features of a house/apartment, and yi is its sold price
• Θ ⊆ Rd: the set of possible linear predictors
• Risk function f(x, θ) = 12(y − 〈w, θ〉)2
Target: θ∗ ∈ arg minθ∈Θ E[
12(y − 〈w, θ〉)2
]
![Page 17: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/17.jpg)
Problem Formulation: Byzantine Fault Model
• In any iteration, up to q out of m workers are compromised andbehave arbitrarily;
• the set of faulty workers may be different across iterations;
• faulty workers have complete knowledge of the system;
• faulty workers can collude
![Page 18: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/18.jpg)
Problem Formulation: Byzantine Fault Model
• In any iteration, up to q out of m workers are compromised andbehave arbitrarily;
• the set of faulty workers may be different across iterations;
• faulty workers have complete knowledge of the system;
• faulty workers can collude
![Page 19: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/19.jpg)
Problem Formulation: Byzantine Fault Model
• In any iteration, up to q out of m workers are compromised andbehave arbitrarily;
• the set of faulty workers may be different across iterations;
• faulty workers have complete knowledge of the system;
• faulty workers can collude
![Page 20: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/20.jpg)
Problem Formulation: Byzantine Fault Model
• In any iteration, up to q out of m workers are compromised andbehave arbitrarily;
• the set of faulty workers may be different across iterations;
• faulty workers have complete knowledge of the system;
• faulty workers can collude
![Page 21: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/21.jpg)
Algorithm: Byzantine Gradient Descent
The learner:
1 Broadcast the current model parameter estimator θt−1;
2 Wait to receive all the gradients g(j)t from all workers j;
3 Aggregate gradients to obtain F̂ (θt−1);
4 Update: θt ← θt−1 − ηt × F̂ (θt−1);
Non-faulty worker j:
1 Compute the sample gradient g(j)t =
∑local data Xi
∇f(Xi, θt−1);
2 Send g(j)t back to the learner;
Averaging, i.e., taking F̂ (θt−1) = 1m
∑mj=1 g
(j)t , is not robust to
even a single Byzantine failure!
![Page 22: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/22.jpg)
Algorithm: Byzantine Gradient Descent
The learner:
1 Broadcast the current model parameter estimator θt−1;
2 Wait to receive all the gradients g(j)t from all workers j;
3 Aggregate gradients to obtain F̂ (θt−1);
4 Update: θt ← θt−1 − ηt × F̂ (θt−1);
Non-faulty worker j:
1 Compute the sample gradient g(j)t =
∑local data Xi
∇f(Xi, θt−1);
2 Send g(j)t back to the learner;
Averaging, i.e., taking F̂ (θt−1) = 1m
∑mj=1 g
(j)t , is not robust to
even a single Byzantine failure!
![Page 23: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/23.jpg)
Algorithm: Byzantine Gradient Descent
The learner:
1 Broadcast the current model parameter estimator θt−1;
2 Wait to receive all the gradients g(j)t from all workers j;
3 Robust gradient aggregate to obtain F̂ (θt−1);
4 Update: θt ← θt−1 − ηt × F̂ (θt−1);
Non-faulty worker j:
1 Compute the sample gradient g(j)t =
∑local data Xi
∇f(Xi, θt−1);
2 Send g(j)t back to the learner;
Simple averaging, i.e., taking F̂ (θt−1) = 1m
∑mj=1 g
(j)t , is not robust to
even a single Byzantine failure!
![Page 24: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/24.jpg)
Generic Key Technical Challenges
Target: θ∗ ∈ arg minθ∈Θ F (θ) , E[f(X, θ)]
• Suppose F (θ) is known: Perfect gradient descent –θt = θt−1 − η ×∇F (θt−1)
• But F (θ) is unknown: Approximate gradient descent –
θ′t = θ′t−1 − ηt ×∇F̂ (θ′t−1) = θ′t−1 − ηt ×∇F (θ′t−1) + ε(θ′t−1).
I The elements in{ε(θ′t−1)
}∞t=1
are dependent on each other;
I Complicated interplay between the randomness and the arbitrarybehaviors of Byzantine workers.
Our analysis plan: show uniform convergence,i.e., show ε(θ) ≈ 0 uniformly for all θ ∈ Θ
Standard concentration results might not suffice
![Page 25: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/25.jpg)
Generic Key Technical Challenges
Target: θ∗ ∈ arg minθ∈Θ F (θ) , E[f(X, θ)]
• Suppose F (θ) is known: Perfect gradient descent –θt = θt−1 − η ×∇F (θt−1)
• But F (θ) is unknown: Approximate gradient descent –
θ′t = θ′t−1 − ηt ×∇F̂ (θ′t−1) = θ′t−1 − ηt ×∇F (θ′t−1) + ε(θ′t−1).
I The elements in{ε(θ′t−1)
}∞t=1
are dependent on each other;
I Complicated interplay between the randomness and the arbitrarybehaviors of Byzantine workers.
Our analysis plan: show uniform convergence,i.e., show ε(θ) ≈ 0 uniformly for all θ ∈ Θ
Standard concentration results might not suffice
![Page 26: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/26.jpg)
Algorithm I: Median of Means
![Page 27: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/27.jpg)
Robust Gradient Aggregation: Median of Means
Median of Means
Given nk points X1, . . . , Xnk,
φ̂MM , median
1
n
n∑i=1
Xi, · · · ,1
n
kn∑i=(k−1)n+1
Xi
Definition (Geometric median)
y∗ , med {y1, · · · , ym} = arg miny∈Rd
∑mi=1 ‖y − yi‖2
Efficient computation of Geometric Median: Nearly linear time[Cohen et al. STOC 2016]
![Page 28: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/28.jpg)
Robustness of Geometric Median
Definition (Geometric median)
y∗ , med {y1, · · · , ym} = arg miny∈Rd
∑mi=1 ‖y − yi‖2
• One-dimension case: Geometric median = standard medianIf strictly more than bn/2c points are in [−r, r] for some r ∈ R,then median ALSO lies in [−r, r]
• Multi-dimension case:
Lemma (Minsker et al. 2015)
For any α ∈ (0, 1/2) and given r ∈ R, if∑n
i=1 1{‖yi‖2≤r} ≥ (1− α)n,
then ‖y∗‖2 ≤ Cαr, where Cα = 1−α√1−2α
.
Intuition: Majority voting in the noisy setting
![Page 29: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/29.jpg)
Robustness of Geometric Median
Definition (Geometric median)
y∗ , med {y1, · · · , ym} = arg miny∈Rd
∑mi=1 ‖y − yi‖2
• One-dimension case: Geometric median = standard medianIf strictly more than bn/2c points are in [−r, r] for some r ∈ R,then median ALSO lies in [−r, r]
• Multi-dimension case:
Lemma (Minsker et al. 2015)
For any α ∈ (0, 1/2) and given r ∈ R, if∑n
i=1 1{‖yi‖2≤r} ≥ (1− α)n,
then ‖y∗‖2 ≤ Cαr, where Cα = 1−α√1−2α
.
Intuition: Majority voting in the noisy setting
![Page 30: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/30.jpg)
Robustness of Geometric Median
Definition (Geometric median)
y∗ , med {y1, · · · , ym} = arg miny∈Rd
∑mi=1 ‖y − yi‖2
• One-dimension case: Geometric median = standard medianIf strictly more than bn/2c points are in [−r, r] for some r ∈ R,then median ALSO lies in [−r, r]
• Multi-dimension case:
Lemma (Minsker et al. 2015)
For any α ∈ (0, 1/2) and given r ∈ R, if∑n
i=1 1{‖yi‖2≤r} ≥ (1− α)n,
then ‖y∗‖2 ≤ Cαr, where Cα = 1−α√1−2α
.
Intuition: Majority voting in the noisy setting
![Page 31: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/31.jpg)
Performance with Median of Means
(1) q ≥ 1: the maximum # of Byzantine workers;(2) d: model dimension, i.e., Θ ⊆ Rd
Theorem (Informal)
Suppose some mild technical assumptions hold, and 2(1 + ε)q ≤ k ≤ m.Assume F (θ) is M -strongly convex with L-Lipschitz gradient. Then whp
‖θt − θ∗‖ ≤ ρt‖θ0 − θ∗‖+ C
√dk
N, ∀t ≥ 1,
where ρ = 12 + 1
2
√1− M2
4L2 ∈ (0, 1).
• After logN rounds,√dq/N becomes the dominant part
• When q = 0, we choose k = 1
• When q is large, we choose k = 2(1 + ε)q, resulting error ofO(√dq/N)
![Page 32: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/32.jpg)
Drawbacks of Geometric Median in High Dimensions
y∗ = arg min
m∑i=1
‖y − yi‖ ⇐⇒m∑i=1
yi − y∗
‖yi − y∗‖= 0
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
••••
••
•
• •
•
••
•
•
••
•
•
•• •
•
•
•
••
•
•
•
•
•
•
•
•
•
• ••
•
••
•
•
•
•
•
•
•
X
•••••••• ••••• •••••••
√d
X
ε√d
X
• Good data yii.i.d.∼ N (µ, Id)
• ε fraction is adversariallycorrupted
• GM suffers from ε√d error
Jiaming Xu (Duke) Machine Learning Security 21
![Page 33: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/33.jpg)
Drawbacks of Geometric Median in High Dimensions
y∗ = arg min
m∑i=1
‖y − yi‖ ⇐⇒m∑i=1
yi − y∗
‖yi − y∗‖= 0
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
••••
••
•
• •
•
••
•
•
••
•
•
•• •
•
•
•
••
•
•
•
•
•
•
•
•
•
• ••
•
••
•
•
•
•
•
•
•
X
•••••••• ••••• •••••••
√d
X
ε√d
X
• Good data yii.i.d.∼ N (µ, Id)
• ε fraction is adversariallycorrupted
• GM suffers from ε√d error
Jiaming Xu (Duke) Machine Learning Security 21
![Page 34: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/34.jpg)
Drawbacks of Geometric Median in High Dimensions
y∗ = arg min
m∑i=1
‖y − yi‖ ⇐⇒m∑i=1
yi − y∗
‖yi − y∗‖= 0
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
••••
••
•
• •
•
••
•
•
••
•
•
•• •
•
•
•
••
•
•
•
•
•
•
•
•
•
• ••
•
••
•
•
•
•
•
•
•
X
•••••••• ••••• •••••••
√d
X
ε√dX
• Good data yii.i.d.∼ N (µ, Id)
• ε fraction is adversariallycorrupted
• GM suffers from ε√d error
Jiaming Xu (Duke) Machine Learning Security 21
![Page 35: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/35.jpg)
Algorithm II:Optimal Algorithm in High Dimension
[Su and Xu, 2018] improves the estimation error
from O
(√qdN
)to O(
√d/N +
√q/N) – matching the minimax
error rate in the ideal failure-free setting as long as q = O(d).
Jiaming Xu (Duke) Machine Learning Security 22
![Page 36: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/36.jpg)
Ideas of Iterative Filtering [SCV ’18]
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
•
•
•
•
•
•
•
••••
••
•
• •
•
••
•
•
••
•
•
•• •
•
•
•
••
•
•• ••••••••••••• •••• ••
µu
• If the center µ were known, from
uu> ∈ arg max∑i
(yi − µ)> U (yi − µ)
s.t. U � 0
Tr(U) ≤ 1,
filter out outliers based on 〈yi − µ, u〉2
• However, µ is unknown!
• Idea: represent yi through∑
jWjiyj ;
Wji is constrained to be around 1(1−ε)m
Jiaming Xu (Duke) Machine Learning Security 23
![Page 37: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/37.jpg)
Iterative Filtering Algorithm [SCV ’18]
Define cost function
φ(W,U) =∑i∈S
ci
yi −∑j∈S
Wjiyj
> Uyi −∑
j∈AWjiyj
1 Compute saddle point
(Center approxi.) W ∗ ∈ arg minW
maxU
φ(W,U)
(Extreme direction) U∗ ∈ arg maxU
minW
φ(W,U)
2 If φ(W ∗, U∗) is small enough, stop; otherwise, down-weight ci
proportional to(yi −
∑j∈SW
∗jiyj
)>U∗(yi −
∑j∈SW
∗jiyj
), throw
away data points for which ci ≤ 1/2, and repeat.
Jiaming Xu (Duke) Machine Learning Security 24
![Page 38: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/38.jpg)
Guarantees of Iterative Filtering Algorithm [SCV ’18]
Lemma (SCV ’18)
Define µS = 1m
∑mi=1 yi. Suppose that∥∥∥∥ 1
m
∑i
(yi − µS) (yi − µS)>∥∥∥∥
2
≤ σ2.
Then for ε ≤ 14 , Iterative Filtering Algorithm outputs µ̂ such that
‖µ̂− µS‖ = O(σ√ε).
• Gradient vectors {gj(θt−1)}mj=1 are not i.i.d.• Apply with yj = gradient functions:
gj(θ) =1
|Sj |∑i∈Sj
∇f (Xi, θ)
• Need concentration of matrix [g1(θ), . . . , gm(θ)] uniformly over θ
Jiaming Xu (Duke) Machine Learning Security 25
![Page 39: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/39.jpg)
Uniform Concentration of Sample Covariance Matrix
• If gradient functions gj(θ) is sub-Gaussian, use ε-net
• However, in many cases such as linear regression, gj(θ) issub-exponential
• Existing tail bounds for matrices with sub-exponential columns arenot tightState-of-the-art: Standard concentration bounds [ALPTJ ’10]:√md+ d
Theorem (SX ’18)
Let A be a d×m matrix whose columns Aj are i.i.d. sub-exponential,zero-mean. Then with probability at least 1− e−d,
‖A‖2 .√m+ d log3 d
Remark:Tight up to poly-log factors
Jiaming Xu (Duke) Machine Learning Security 26
![Page 40: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/40.jpg)
Guarantees of Aggregated Gradient by Iterative Filtering
Theorem (SX ’18)
Suppose some mild technical assumptions hold and N & d2. Let ∇F̂ (θ)be the aggregated gradient function by Iterative Filtering Algorithm.
Then with probability at least 1− 2e−√d,
∥∥∥∇F̂ (θ)−∇F (θ)∥∥∥ .
(√q
N+
√d
N
)‖θ − θ∗‖+
(√q
N+
√d
N
)
• N & d2 is due to our sub-exponential assumption and is inevitable
• If assuming sub-Gaussian instead, only N & d is needed
Jiaming Xu (Duke) Machine Learning Security 27
![Page 41: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/41.jpg)
Main Convergence Result
Theorem (SX ’18)
Suppose some mild technical assumptions hold and N & d2. AssumeF (θ) is M -strongly convex with L-Lipschitz gradient. Then whp,
‖θt − θ∗‖ .(
1− M2
16L2
)t‖θ0 − θ∗‖+
(√q
N+
√d
N
).
• Improves over geometric median (√dq/N)
• If q = O(d), error rate is optimal
• Tolerate up to q/m = Θ(1) fraction of Byzantine errors
• Exponential convergence → only logarithmic communication rounds
Jiaming Xu (Duke) Machine Learning Security 28
![Page 42: Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018. 6. 28. · Distributed Machine Learning: Robustness An attractive solution to large-scale](https://reader035.vdocuments.mx/reader035/viewer/2022071514/6135a5aa0ad5d2067647826c/html5/thumbnails/42.jpg)
References
• Lili Su and Jiaming Xu,Securing Distributed Machine Learning in High Dimensions,arXiv:1804.10140, April 2018
• Yudong Chen, Lili Su, Jiaming Xu:Distributed Statistical Machine Learning in Adversarial Settings:Byzantine Gradient Descent
I Conference version: SIGMETRICS 2018;
I Journal version: POMACS Proceedings of the ACM on Measurementand Analysis of Computing Systems, Dec. 2017.