federated optimization in heterogeneous networkslitian/assets/slides/fedprox_mlsys20.pdf · •...

27
Tian Li (CMU), Anit Kumar Sahu (BCAI), Manzil Zaheer (Google Research), Maziar Sanjabi (Facebook AI), Ameet Talwalkar (CMU & Determined AI), Virginia Smith (CMU) Federated Optimization in Heterogeneous Networks [email protected]

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

TianLi(CMU),AnitKumarSahu(BCAI),ManzilZaheer(GoogleResearch),MaziarSanjabi(FacebookAI),AmeetTalwalkar(CMU&DeterminedAI),VirginiaSmith(CMU)

FederatedOptimizationinHeterogeneousNetworks

[email protected]

Page 2: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

FederatedLearningPrivacy-preservingtraininginheterogeneous,(potentially)massivenetworks

Networksofremotedevicese.g.,cellphones

next-wordprediction

Networksofisolatedorganizationse.g.,hospitals

healthcare

2

Page 3: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

ExampleApplications

Voicerecognitiononmobilephones

Adaptingtopedestrianbehavioronautonomousvehicles

Personalizedhealthcareonwearabledevices

Predictivemaintenanceforindustrialmachines

3

Page 4: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Workflow&Challenges

Wt Wt

W′ ′ W′

Wt+1

Systemsheterogeneityvariablehardware,networkconnectivity,

power,etc

Statisticalheterogeneityhighlynon-identicallydistributeddata

Expensivecommunicationpotentiallymassivenetwork;wireless

communication

Privacyconcernsprivacyleakagethroughparameters

localtraininglocaltraining

Objective:

server

devices

lossondevicekAstandardsetup:

4

minw

f(w) =N

∑k=1

pkFk(w)

Page 5: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

APopularMethod:FederatedAveraging(FedAvg)[1]

[1]McMahan,H.Brendan,etal."Communication-efficientlearningofdeepnetworksfromdecentralizeddata."AISTATS,2017.

Workswellinmanysettings!(especiallynon-convex)

5

Ateachcommunicationround:

Serverrandomlyselectsasubsetofdevices&sendsthecurrentglobalmodel wt

Eachselecteddevice updates for epochs

ofSGDtooptimize &sendsthenewlocalmodelback

k wt EFk

Serveraggregateslocalmodelstoformanewglobalmodelwt+1

Whatcangowrong?

Page 6: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Whataretheissues?

simpleaverageupdates

statisticalheterogeneity

highlynon-identicallydistributeddata

0% stragglers

6

stragglers

systemsheterogeneity

simplydropslowdevices[2]

[2]Bonawitz,Keith,etal."TowardsFederatedLearningatScale:SystemDesign."MLSys,2019.

0% stragglers

90% stragglers

FedAvg

heuristicmethod

notguaranteedtoconverge

Page 7: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork7

Page 8: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

FedProx—HighLevel

rateasafunctionofstatisticalheterogeneityaccountforstragglers theory

allowforvariableamountsofwork&safelyincorporatethem

encouragemorewell-behavedupdates

simplydropstragglers

systemsheterogeneity

averagesimpleSGDupdates

statisticalheterogeneity

1. convergenceguarantees2. morerobustempiricalperformance forfederatedlearninginheterogeneousnetworks

Contributio

ns

FedProx

8

Page 9: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

FedProx:AFrameworkForFederatedOptimizationAteachcommunicationround,

localobjective:

minwk

Fk(wk)

Objective:

minw

f(w) =N

∑k=1

pkFk(w)

Idea1:Allowforvariableamountsofworktobeperformedonlocaldevicestohandlestragglers

9

Idea2:ModifiedLocalSubproblem:

a proximal term

minwk

Fk(wk) +μ2

wk − wt 2

Page 10: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

FedProx:AFrameworkForFederatedOptimization

ModifiedLocalSubproblem: minwk

Fk(wk) +μ2

wk − wt 2

Theproximalterm(1)safelyincorporatenoisyupdates;(2)explicitly

limitstheimpactoflocalupdates

GeneralizationofFedAvg

Canuseanylocalsolver

Morerobustandstableempiricalperformance

Strongtheoreticalguarantees(withsomeassumptions)10

Page 11: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork11

Page 12: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

ConvergenceAnalysis

High-level:convergesdespitethesechallengesIntroducesnotionofB-dissimilarityintocharacterizestatisticalheterogeneity:

IIDdata: non-IIDdata:

B = 1B > 1

12

Challenges:devicesubsampling,non-iiddata,localupdates

*usedinothercontexts,e.g.,gradientdiversity[3]toquantifythebenefitsofscalingdistributedSGD

[3]Yin,Dong,etal."GradientDiversity:aKeyIngredientforScalableDistributedLearning.”AISTATS,2018.

𝔼 [∥∇Fk(w)∥2] ≤ ∥∇f(w)∥2B2

Page 13: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Assumption1:Dissimilarityisbounded

ConvergenceAnalysis

13

Proximaltermmakesthemethodmoreamenabletotheoreticalanalysis!

Assumption2:Modifiedlocalsubproblemisconvex&smooth

Assumption3:EachlocalsubproblemissolvedtosomeaccuracyFlexiblecommunication/computationtradeoffAccountforpartialworkintherates

Page 14: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Rateisgeneral:Coversbothconvex,andnon-convexlossfunctionsIndependentofthelocalsolver;agnosticofthesamplingmethod

ThesameasymptoticconvergenceguaranteeasSGDCanconvergemuchfasterthandistributedSGDinpractice

ConvergenceAnalysis

14

[Theorem]Obtainsuboptimality ,afterTrounds,with:ε

T = O ( f(w0) − f*ρε )

some constant, a function of (B, μ, …)

Page 15: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork15

Page 16: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

ExperimentsZeroSystemsheterogeneity+FixedStatisticalheterogeneity

Benchmark:LEAF(leaf.cmu.edu)

16

FedAvg

FedProxwith leadstomorestableconvergenceunderstatisticalheterogeneityμ > 0

FedProx, μ > 0

Page 17: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

FedAvg FedProx, μ > 0

Similarbenefitsforalldatasets

17

Page 18: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

ExperimentsHighSystemsheterogeneity+FixedStatisticalheterogeneity

18

FedAvg

Allowingforvariableamountsofworktobeperformedhelps

convergenceinthepresenceofsystems

heterogeneity

FedProx, μ = 0FedProx, μ > 0

FedProxwith leadstomorestableconvergenceunderstatistical&systems

heterogeneity

μ > 0

Page 19: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

FedAvg FedProx, μ = 0 FedProx, μ > 0

19

Similarbenefitsforalldatasets

Intermsoftestaccuracy:

onaverage,22%absoluteaccuracyimprovementcomparedwithFedAvgin

highlyheterogeneoussettings

Page 20: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

ExperimentsImpactofStatisticalHeterogeneity

Settingμ>0canhelptocombatthis

Inaddition,B-dissimilaritycapturesstatisticalheterogeneity(seepaper)

Increasingheterogeneityleadstoworseconvergence

20

Page 21: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

OutlineMotivation

FedProxMethod

TheoreticalAnalysis

Experiments

FutureWork21

Page 22: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

FutureWorkPrivacy&security

Betterprivacymetrics&mechanisms

PersonalizationAutomaticfine-tuning

ProductionizingColdstartproblems

Hyper-parametertuningSetμautomatically

DiagnosticsDeterminingheterogeneityaprioriLeveragingtheheterogeneityforimprovedperformance

Whitepaper:FederatedLearning:Challenges,Methods,andFutureDirections,IEEESignalProcessingMagazine,2020.

(alsoonArXiv)

22

Page 23: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Thanks!

Paper&code:cs.cmu.edu/~litian/Benchmark:leaf.cmu.edu

Poster:#3,thisroom

23

On-deviceIntelligenceWorkshop,Wednesday,thisroom

Page 24: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Backup1• Relationswithpreviousworks

• proximalterm• ElasticSGD:employsamorecomplexmovingaveragetoupdateparameters;

limitedtoSGDasalocalsolver;onlybeenanalyzedforquadraticproblems• DANEandinexactDANE:addsanadditionalgradientcorrectionterm,assume

fulldeviceparticipation(unrealistic);discouragingempiricalperformance• FedDANE: A Federated Newton-Type Method, Arxiv.

• Otherworks:differentpurposessuchasspeedingupSGDonasinglemachine;differentanalysisassumptions(IID,solvingsubproblemsexactly)

• B-dissimilarityterm• Forotherpurposes,suchasquantifyingthebenefitinscalingSGDforIIDdata

24

Page 25: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

• Datastatistics

• Systemsheterogeneitysimulation

• FixaglobalnumberofepochsE,andforcesomedevicestoperformfewerupdatesthan epochs.Inparticular,forvaryingheterogeneoussetting,assign (chosenuniformlyrandombetween )numberofepochsto0%,50,and90%ofselecteddevices.

E x[1,E]

Backup2

25

Page 26: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Backup3

• TheoriginalFedAvgalgorithm

Page 27: Federated Optimization in Heterogeneous Networkslitian/assets/slides/fedprox_mlsys20.pdf · • FedDANE: A Federated Newton-Type Method, Arxiv. • Other works: different purposes

Backup4• CompletetheoremAssumethefunctions arenon-convex,L-Lipschitzsmooth,andthereexists ,suchthat

,with .Supposethat isnotastationarysolutionandthelocalfunctions are -dissimilar,i.e., If and arechosensuchthat

Fk L_ > 0∇2Fk ⪰ − L_I μ̄ = μ − L_ > 0 wt

Fk B B(wt) ≤ B . μ, K, γtk

ρt = ( 1μ

−γtBμ

−B(1 + γt) 2

μ̄ K−

LB(1 + γt)μ̄μ

−L(1 + γt)2B2

2μ̄2−

LB2(1 + γt)2

μ̄2K (2 2K + 2)) > 0,

thenattheiteration ofFedProx,wehavethefollowingexpecteddecreaseintheglobalobjective:

t

𝔼St[ f(wt+1)] ≤ f(wt) − ρt∥∇f(wt)∥2,

where isthesetof deviceschosenatiteration andSt K t γt = maxk∈St

γtk .