federated optimization in heterogeneous networkslitian/assets/slides/fedprox_mlsys20.pdf · •...

TianLi(CMU),AnitKumarSahu(BCAI),ManzilZaheer(GoogleResearch),MaziarSanjabi(FacebookAI),AmeetTalwalkar(CMU&DeterminedAI),VirginiaSmith(CMU)

FederatedOptimizationinHeterogeneousNetworks

[email protected]

FederatedLearningPrivacy-preservingtraininginheterogeneous,(potentially)massivenetworks

Networksofremotedevicese.g.,cellphones

next-wordprediction

Networksofisolatedorganizationse.g.,hospitals

healthcare

2

ExampleApplications

Voicerecognitiononmobilephones

Adaptingtopedestrianbehavioronautonomousvehicles

Personalizedhealthcareonwearabledevices

Predictivemaintenanceforindustrialmachines

3

Workflow&Challenges

Wt Wt

W′ ′ W′

Wt+1

Systemsheterogeneityvariablehardware,networkconnectivity,

power,etc

Statisticalheterogeneityhighlynon-identicallydistributeddata

Expensivecommunicationpotentiallymassivenetwork;wireless

communication

Privacyconcernsprivacyleakagethroughparameters

localtraininglocaltraining

Objective:

server

devices

lossondevicekAstandardsetup:

4

minw

f(w) =N

∑k=1

pkFk(w)

APopularMethod:FederatedAveraging(FedAvg)[1]

[1]McMahan,H.Brendan,etal."Communication-efficientlearningofdeepnetworksfromdecentralizeddata."AISTATS,2017.

Workswellinmanysettings!(especiallynon-convex)

5

Ateachcommunicationround:

Serverrandomlyselectsasubsetofdevices&sendsthecurrentglobalmodel wt

Eachselecteddevice updates for epochs

ofSGDtooptimize &sendsthenewlocalmodelback

k wt EFk

Serveraggregateslocalmodelstoformanewglobalmodelwt+1

Whatcangowrong?

Whataretheissues?

simpleaverageupdates

statisticalheterogeneity

highlynon-identicallydistributeddata

0% stragglers

6

stragglers

systemsheterogeneity

simplydropslowdevices[2]

[2]Bonawitz,Keith,etal."TowardsFederatedLearningatScale:SystemDesign."MLSys,2019.

0% stragglers

90% stragglers

FedAvg

heuristicmethod

notguaranteedtoconverge

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork7

FedProx—HighLevel

rateasafunctionofstatisticalheterogeneityaccountforstragglers theory

allowforvariableamountsofwork&safelyincorporatethem

encouragemorewell-behavedupdates

simplydropstragglers

systemsheterogeneity

averagesimpleSGDupdates

statisticalheterogeneity

1. convergenceguarantees2. morerobustempiricalperformance forfederatedlearninginheterogeneousnetworks

Contributio

ns

FedProx

8

FedProx:AFrameworkForFederatedOptimizationAteachcommunicationround,

localobjective:

minwk

Fk(wk)

Objective:

minw

f(w) =N

∑k=1

pkFk(w)

Idea1:Allowforvariableamountsofworktobeperformedonlocaldevicestohandlestragglers

9

Idea2:ModifiedLocalSubproblem:

a proximal term

minwk

Fk(wk) +μ2

wk − wt 2

FedProx:AFrameworkForFederatedOptimization

ModifiedLocalSubproblem: minwk

Fk(wk) +μ2

wk − wt 2

Theproximalterm(1)safelyincorporatenoisyupdates;(2)explicitly

limitstheimpactoflocalupdates

GeneralizationofFedAvg

Canuseanylocalsolver

Morerobustandstableempiricalperformance

Strongtheoreticalguarantees(withsomeassumptions)10

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork11

ConvergenceAnalysis

High-level:convergesdespitethesechallengesIntroducesnotionofB-dissimilarityintocharacterizestatisticalheterogeneity:

IIDdata: non-IIDdata:

B = 1B > 1

12

Challenges:devicesubsampling,non-iiddata,localupdates

*usedinothercontexts,e.g.,gradientdiversity[3]toquantifythebenefitsofscalingdistributedSGD

[3]Yin,Dong,etal."GradientDiversity:aKeyIngredientforScalableDistributedLearning.”AISTATS,2018.

𝔼 [∥∇Fk(w)∥2] ≤ ∥∇f(w)∥2B2

Assumption1:Dissimilarityisbounded

ConvergenceAnalysis

13

Proximaltermmakesthemethodmoreamenabletotheoreticalanalysis!

Assumption2:Modifiedlocalsubproblemisconvex&smooth

Assumption3:EachlocalsubproblemissolvedtosomeaccuracyFlexiblecommunication/computationtradeoffAccountforpartialworkintherates

Rateisgeneral:Coversbothconvex,andnon-convexlossfunctionsIndependentofthelocalsolver;agnosticofthesamplingmethod

ThesameasymptoticconvergenceguaranteeasSGDCanconvergemuchfasterthandistributedSGDinpractice

ConvergenceAnalysis

14

[Theorem]Obtainsuboptimality ,afterTrounds,with:ε

T = O ( f(w0) − f*ρε )

some constant, a function of (B, μ, …)

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork15

ExperimentsZeroSystemsheterogeneity+FixedStatisticalheterogeneity

Benchmark:LEAF(leaf.cmu.edu)

16

FedAvg

FedProxwith leadstomorestableconvergenceunderstatisticalheterogeneityμ > 0

FedProx, μ > 0

FedAvg FedProx, μ > 0

Similarbenefitsforalldatasets

17

ExperimentsHighSystemsheterogeneity+FixedStatisticalheterogeneity

18

FedAvg

Allowingforvariableamountsofworktobeperformedhelps

convergenceinthepresenceofsystems

heterogeneity

FedProx, μ = 0FedProx, μ > 0

FedProxwith leadstomorestableconvergenceunderstatistical&systems

heterogeneity

μ > 0

FedAvg FedProx, μ = 0 FedProx, μ > 0

19

Similarbenefitsforalldatasets

Intermsoftestaccuracy:

onaverage,22%absoluteaccuracyimprovementcomparedwithFedAvgin

highlyheterogeneoussettings

ExperimentsImpactofStatisticalHeterogeneity

Settingμ>0canhelptocombatthis

Inaddition,B-dissimilaritycapturesstatisticalheterogeneity(seepaper)

Increasingheterogeneityleadstoworseconvergence

20

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork21

FutureWorkPrivacy&security

Betterprivacymetrics&mechanisms

PersonalizationAutomaticfine-tuning

ProductionizingColdstartproblems

Hyper-parametertuningSetμautomatically

DiagnosticsDeterminingheterogeneityaprioriLeveragingtheheterogeneityforimprovedperformance

Whitepaper:FederatedLearning:Challenges,Methods,andFutureDirections,IEEESignalProcessingMagazine,2020.

(alsoonArXiv)

22

Thanks!

Paper&code:cs.cmu.edu/~litian/Benchmark:leaf.cmu.edu

Poster:#3,thisroom

23

On-deviceIntelligenceWorkshop,Wednesday,thisroom

Backup1• Relationswithpreviousworks

• proximalterm• ElasticSGD:employsamorecomplexmovingaveragetoupdateparameters;

limitedtoSGDasalocalsolver;onlybeenanalyzedforquadraticproblems• DANEandinexactDANE:addsanadditionalgradientcorrectionterm,assume

fulldeviceparticipation(unrealistic);discouragingempiricalperformance• FedDANE: A Federated Newton-Type Method, Arxiv.

• Otherworks:differentpurposessuchasspeedingupSGDonasinglemachine;differentanalysisassumptions(IID,solvingsubproblemsexactly)

• B-dissimilarityterm• Forotherpurposes,suchasquantifyingthebenefitinscalingSGDforIIDdata

24

• Datastatistics

• Systemsheterogeneitysimulation

• FixaglobalnumberofepochsE,andforcesomedevicestoperformfewerupdatesthan epochs.Inparticular,forvaryingheterogeneoussetting,assign (chosenuniformlyrandombetween )numberofepochsto0%,50,and90%ofselecteddevices.

E x[1,E]

Backup2

25

Backup3

• TheoriginalFedAvgalgorithm

Backup4• CompletetheoremAssumethefunctions arenon-convex,L-Lipschitzsmooth,andthereexists ,suchthat

,with .Supposethat isnotastationarysolutionandthelocalfunctions are -dissimilar,i.e., If and arechosensuchthat

Fk L_ > 0∇2Fk ⪰ − L_I μ̄ = μ − L_ > 0 wt

Fk B B(wt) ≤ B . μ, K, γtk

ρt = ( 1μ

−γtBμ

−B(1 + γt) 2

μ̄ K−

LB(1 + γt)μ̄μ

−L(1 + γt)2B2

2μ̄2−

LB2(1 + γt)2

μ̄2K (2 2K + 2)) > 0,

thenattheiteration ofFedProx,wehavethefollowingexpecteddecreaseintheglobalobjective:

t

𝔼St[ f(wt+1)] ≤ f(wt) − ρt∥∇f(wt)∥2,

where isthesetof deviceschosenatiteration andSt K t γt = maxk∈St

γtk .

federated optimization in heterogeneous networkslitian/assets/slides/fedprox_mlsys20.pdf · •...

Documents