federated optimization in heterogeneous networkslitian/assets/slides/fedprox_mlsys20.pdf · •...
TRANSCRIPT
TianLi(CMU),AnitKumarSahu(BCAI),ManzilZaheer(GoogleResearch),MaziarSanjabi(FacebookAI),AmeetTalwalkar(CMU&DeterminedAI),VirginiaSmith(CMU)
FederatedOptimizationinHeterogeneousNetworks
FederatedLearningPrivacy-preservingtraininginheterogeneous,(potentially)massivenetworks
Networksofremotedevicese.g.,cellphones
next-wordprediction
Networksofisolatedorganizationse.g.,hospitals
healthcare
2
ExampleApplications
Voicerecognitiononmobilephones
Adaptingtopedestrianbehavioronautonomousvehicles
Personalizedhealthcareonwearabledevices
Predictivemaintenanceforindustrialmachines
3
Workflow&Challenges
Wt Wt
W′ ′ W′
Wt+1
Systemsheterogeneityvariablehardware,networkconnectivity,
power,etc
Statisticalheterogeneityhighlynon-identicallydistributeddata
Expensivecommunicationpotentiallymassivenetwork;wireless
communication
Privacyconcernsprivacyleakagethroughparameters
localtraininglocaltraining
Objective:
server
devices
lossondevicekAstandardsetup:
4
minw
f(w) =N
∑k=1
pkFk(w)
APopularMethod:FederatedAveraging(FedAvg)[1]
[1]McMahan,H.Brendan,etal."Communication-efficientlearningofdeepnetworksfromdecentralizeddata."AISTATS,2017.
Workswellinmanysettings!(especiallynon-convex)
5
Ateachcommunicationround:
Serverrandomlyselectsasubsetofdevices&sendsthecurrentglobalmodel wt
Eachselecteddevice updates for epochs
ofSGDtooptimize &sendsthenewlocalmodelback
k wt EFk
Serveraggregateslocalmodelstoformanewglobalmodelwt+1
Whatcangowrong?
Whataretheissues?
simpleaverageupdates
statisticalheterogeneity
highlynon-identicallydistributeddata
0% stragglers
6
stragglers
systemsheterogeneity
simplydropslowdevices[2]
[2]Bonawitz,Keith,etal."TowardsFederatedLearningatScale:SystemDesign."MLSys,2019.
0% stragglers
90% stragglers
FedAvg
heuristicmethod
notguaranteedtoconverge
OutlineMotivation
FedProxMethod
TheoreticalAnalysis
Experiments
FutureWork7
FedProx—HighLevel
rateasafunctionofstatisticalheterogeneityaccountforstragglers theory
allowforvariableamountsofwork&safelyincorporatethem
encouragemorewell-behavedupdates
simplydropstragglers
systemsheterogeneity
averagesimpleSGDupdates
statisticalheterogeneity
1. convergenceguarantees2. morerobustempiricalperformance forfederatedlearninginheterogeneousnetworks
Contributio
ns
FedProx
8
FedProx:AFrameworkForFederatedOptimizationAteachcommunicationround,
localobjective:
minwk
Fk(wk)
Objective:
minw
f(w) =N
∑k=1
pkFk(w)
Idea1:Allowforvariableamountsofworktobeperformedonlocaldevicestohandlestragglers
9
Idea2:ModifiedLocalSubproblem:
a proximal term
minwk
Fk(wk) +μ2
wk − wt 2
FedProx:AFrameworkForFederatedOptimization
ModifiedLocalSubproblem: minwk
Fk(wk) +μ2
wk − wt 2
Theproximalterm(1)safelyincorporatenoisyupdates;(2)explicitly
limitstheimpactoflocalupdates
GeneralizationofFedAvg
Canuseanylocalsolver
Morerobustandstableempiricalperformance
Strongtheoreticalguarantees(withsomeassumptions)10
OutlineMotivation
FedProxMethod
TheoreticalAnalysis
Experiments
FutureWork11
ConvergenceAnalysis
High-level:convergesdespitethesechallengesIntroducesnotionofB-dissimilarityintocharacterizestatisticalheterogeneity:
IIDdata: non-IIDdata:
B = 1B > 1
12
Challenges:devicesubsampling,non-iiddata,localupdates
*usedinothercontexts,e.g.,gradientdiversity[3]toquantifythebenefitsofscalingdistributedSGD
[3]Yin,Dong,etal."GradientDiversity:aKeyIngredientforScalableDistributedLearning.”AISTATS,2018.
𝔼 [∥∇Fk(w)∥2] ≤ ∥∇f(w)∥2B2
Assumption1:Dissimilarityisbounded
ConvergenceAnalysis
13
Proximaltermmakesthemethodmoreamenabletotheoreticalanalysis!
Assumption2:Modifiedlocalsubproblemisconvex&smooth
Assumption3:EachlocalsubproblemissolvedtosomeaccuracyFlexiblecommunication/computationtradeoffAccountforpartialworkintherates
Rateisgeneral:Coversbothconvex,andnon-convexlossfunctionsIndependentofthelocalsolver;agnosticofthesamplingmethod
ThesameasymptoticconvergenceguaranteeasSGDCanconvergemuchfasterthandistributedSGDinpractice
ConvergenceAnalysis
14
[Theorem]Obtainsuboptimality ,afterTrounds,with:ε
T = O ( f(w0) − f*ρε )
some constant, a function of (B, μ, …)
OutlineMotivation
FedProxMethod
TheoreticalAnalysis
Experiments
FutureWork15
ExperimentsZeroSystemsheterogeneity+FixedStatisticalheterogeneity
Benchmark:LEAF(leaf.cmu.edu)
16
FedAvg
FedProxwith leadstomorestableconvergenceunderstatisticalheterogeneityμ > 0
FedProx, μ > 0
FedAvg FedProx, μ > 0
Similarbenefitsforalldatasets
17
ExperimentsHighSystemsheterogeneity+FixedStatisticalheterogeneity
18
FedAvg
Allowingforvariableamountsofworktobeperformedhelps
convergenceinthepresenceofsystems
heterogeneity
FedProx, μ = 0FedProx, μ > 0
FedProxwith leadstomorestableconvergenceunderstatistical&systems
heterogeneity
μ > 0
FedAvg FedProx, μ = 0 FedProx, μ > 0
19
Similarbenefitsforalldatasets
Intermsoftestaccuracy:
onaverage,22%absoluteaccuracyimprovementcomparedwithFedAvgin
highlyheterogeneoussettings
ExperimentsImpactofStatisticalHeterogeneity
Settingμ>0canhelptocombatthis
Inaddition,B-dissimilaritycapturesstatisticalheterogeneity(seepaper)
Increasingheterogeneityleadstoworseconvergence
20
OutlineMotivation
FedProxMethod
TheoreticalAnalysis
Experiments
FutureWork21
FutureWorkPrivacy&security
Betterprivacymetrics&mechanisms
PersonalizationAutomaticfine-tuning
ProductionizingColdstartproblems
Hyper-parametertuningSetμautomatically
DiagnosticsDeterminingheterogeneityaprioriLeveragingtheheterogeneityforimprovedperformance
Whitepaper:FederatedLearning:Challenges,Methods,andFutureDirections,IEEESignalProcessingMagazine,2020.
(alsoonArXiv)
22
Thanks!
Paper&code:cs.cmu.edu/~litian/Benchmark:leaf.cmu.edu
Poster:#3,thisroom
23
On-deviceIntelligenceWorkshop,Wednesday,thisroom
Backup1• Relationswithpreviousworks
• proximalterm• ElasticSGD:employsamorecomplexmovingaveragetoupdateparameters;
limitedtoSGDasalocalsolver;onlybeenanalyzedforquadraticproblems• DANEandinexactDANE:addsanadditionalgradientcorrectionterm,assume
fulldeviceparticipation(unrealistic);discouragingempiricalperformance• FedDANE: A Federated Newton-Type Method, Arxiv.
• Otherworks:differentpurposessuchasspeedingupSGDonasinglemachine;differentanalysisassumptions(IID,solvingsubproblemsexactly)
• B-dissimilarityterm• Forotherpurposes,suchasquantifyingthebenefitinscalingSGDforIIDdata
24
• Datastatistics
• Systemsheterogeneitysimulation
• FixaglobalnumberofepochsE,andforcesomedevicestoperformfewerupdatesthan epochs.Inparticular,forvaryingheterogeneoussetting,assign (chosenuniformlyrandombetween )numberofepochsto0%,50,and90%ofselecteddevices.
E x[1,E]
Backup2
25
Backup3
• TheoriginalFedAvgalgorithm
Backup4• CompletetheoremAssumethefunctions arenon-convex,L-Lipschitzsmooth,andthereexists ,suchthat
,with .Supposethat isnotastationarysolutionandthelocalfunctions are -dissimilar,i.e., If and arechosensuchthat
Fk L_ > 0∇2Fk ⪰ − L_I μ̄ = μ − L_ > 0 wt
Fk B B(wt) ≤ B . μ, K, γtk
ρt = ( 1μ
−γtBμ
−B(1 + γt) 2
μ̄ K−
LB(1 + γt)μ̄μ
−L(1 + γt)2B2
2μ̄2−
LB2(1 + γt)2
μ̄2K (2 2K + 2)) > 0,
thenattheiteration ofFedProx,wehavethefollowingexpecteddecreaseintheglobalobjective:
t
𝔼St[ f(wt+1)] ≤ f(wt) − ρt∥∇f(wt)∥2,
where isthesetof deviceschosenatiteration andSt K t γt = maxk∈St
γtk .