federated learning in mobile edge networks: a ...index terms—federated learning, mobile edge...

1

Federated Learning in Mobile Edge Networks: AComprehensive Survey

Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Fellow, IEEE,Qiang Yang, Fellow, IEEE, Dusit Niyato, Fellow, IEEE, and Chunyan Miao

Abstract—In recent years, mobile devices are equipped withincreasingly advanced sensing and computing capabilities. Cou-pled with advancements in Deep Learning (DL), this opensup countless possibilities for meaningful applications, e.g., formedical purposes and in vehicular networks. Traditional cloud-based Machine Learning (ML) approaches require the data to becentralized in a cloud server or data center. However, this resultsin critical issues related to unacceptable latency and communi-cation inefficiency. To this end, Mobile Edge Computing (MEC)has been proposed to bring intelligence closer to the edge, wheredata is produced. However, conventional enabling technologiesfor ML at mobile edge networks still require personal datato be shared with external parties, e.g., edge servers. Recently,in light of increasingly stringent data privacy legislations andgrowing privacy concerns, the concept of Federated Learning(FL) has been introduced. In FL, end devices use their localdata to train an ML model required by the server. The enddevices then send the model updates rather than raw data to theserver for aggregation. FL can serve as an enabling technology inmobile edge networks since it enables the collaborative trainingof an ML model and also enables DL for mobile edge networkoptimization. However, in a large-scale and complex mobile edgenetwork, heterogeneous devices with varying constraints areinvolved. This raises challenges of communication costs, resourceallocation, and privacy and security in the implementation ofFL at scale. In this survey, we begin with an introduction tothe background and fundamentals of FL. Then, we highlightthe aforementioned challenges of FL implementation and reviewexisting solutions. Furthermore, we present the applications ofFL for mobile edge network optimization. Finally, we discuss theimportant challenges and future research directions in FL.

Index Terms—Federated Learning, mobile edge networks, re-source allocation, communication cost, data privacy, data security

W. Y. B. Lim is with Alibaba Group and the Alibaba-NTU JointResearch Institute, Nanyang Technological University, Singapore. Email:[email protected].

N. C. Luong is with Faculty of Information Technology, PHENIKAAUniversity, Hanoi 12116, Vietnam, and is with PHENIKAA Research andTechnology Institute (PRATI), A&A Green Phoenix Group JSC, No.167Hoang Ngan, Trung Hoa, Cau Giay, Hanoi 11313, Vietnam. Email:[email protected].

D. T. Hoang is with the School of Electrical and Data Engineering,University of Technology Sydney, Australia. E-mail: [email protected].

Y. Jiao, D. Niyato and C. Miao are with School of Computer Scienceand Engineering, Nanyang Technological University, Singapore. E-mails:[email protected], [email protected], [email protected].

Y.-C. Liang is with the Center for Intelligent Networking and Communi-cations (CINC), University of Electronic Science and Technology of China(UESTC), Chengdu 611731, China. E-mail: [email protected].

Q. Yang is with Hong Kong University of Science and Technology, HongKong, China. Email: [email protected].

I. INTRODUCTION

Currently, there are nearly 7 billion connected Internet ofThings (IoT) devices [1] and 3 billion smartphones around theworld. These devices are equipped with increasingly advancedsensors, computing, and communication capabilities. As such,they can potentially be deployed for various crowdsensingtasks, e.g., for medical purposes [2] and air quality monitoring[3]. Coupled with the rise of Deep Learning (DL) [4], thewealth of data collected by end devices opens up countlesspossibilities for meaningful research and applications.

In the traditional cloud-centric approach, data collected bymobile devices is uploaded and processed centrally in a cloud-based server or data center. In particular, data collected by IoTdevices and smartphones such as measurements [5], photos [6],videos [7], and location information [8] are aggregated at thedata center [9]. Thereafter, the data is used to provide insightsor produce effective inference models. However, this approachis no longer sustainable for the following reasons. Firstly, dataowners are increasingly privacy sensitive. Following privacyconcerns among consumers in the age of big data, policy mak-ers have responded with the implementation of data privacylegislations such as the European Commission’s General DataProtection Regulation (GDPR) [10] and Consumer PrivacyBill of Rights in the US [11]. In particular, the consent(GDPR Article 6) and data minimalization principle (GDPRArticle 5) limits data collection and storage only to what isconsumer-consented and absolutely necessary for processing.Secondly, a cloud-centric approach involves long propagationdelays and incurs unacceptable latency [12] for applicationsin which real-time decisions have to be made, e.g., in self-driving car systems [13]. Thirdly, the transfer of data to thecloud for processing burdens the backbone networks especiallyin tasks involving unstructured data, e.g., in video analytics[14]. This is exacerbated by the fact that cloud-centric trainingis relatively reliant on wireless communications [15]. As aresult, this can potentially impede the development of newtechnologies.

With data sources mainly located outside the cloud today[16], Mobile Edge Computing (MEC) has naturally beenproposed as a solution in which the computing and storagecapabilities [12] of end devices and edge servers are leveragedon to bring model training closer to where data is produced[17]. As defined in [15], an end-edge-cloud computing net-work comprises: (i) end devices, (ii) edge nodes, and (iii) cloudserver. For model training in conventional MEC approaches,a collaborative paradigm has been proposed in which training

arX

iv:1

909.

1187

5v2

[cs

.NI]

28

Feb

2020

2

Fig. 1: Edge AI approach brings AI processing closer to where datais produced. In particular, FL allows training on devices where thedata is produced.

data are first sent to the edge servers for model training up tolower level DNN layers, before more computation intensivetasks are offloaded to the cloud [18], [19] (Fig. 1). However,this arrangement incurs significant communication costs andis unsuitable especially for applications that require persistenttraining [15]. In addition, computation offloading and dataprocessing at edge servers still involve the transmission of po-tentially sensitive personal data. This can discourage privacy-sensitive consumers from taking part in model training, oreven violate increasingly stringent privacy laws [10]. Althoughvarious privacy preservation methods, e.g., differential privacy(DP) [20], have been proposed, a number of users are stillnot willing to expose their private data for fear that their datamay be inspected by external servers. In the long run, thisdiscourages the development of technologies as well as newapplications.

To guarantee that training data remains on personal devicesand to facilitate collaborative machine learning of complexmodels among distributed devices, a decentralized ML ap-proach called Federated Learning (FL) is introduced in [21]. InFL, mobile devices use their local data to cooperatively trainan ML model required by an FL server. They then send themodel updates, i.e., the model’s weights, to the FL server foraggregation. The steps are repeated in multiple rounds until

a desirable accuracy is achieved. This implies that FL canbe an enabling technology for ML model training at mobileedge networks. As compared to conventional cloud-centricML model training approaches, the implementation of FL formodel training at mobile edge networks features the followingadvantages.

• Highly efficient use of network bandwidth: Less infor-mation is required to be transmitted to the cloud. Forexample, instead of sending the raw data over for process-ing, participating devices only send the updated modelparameters for aggregation. As a result, this significantlyreduces costs of data communication and relieves theburden on backbone networks.

• Privacy: Following the above point, the raw data of usersneed not be sent to the cloud. Under the assumptionthat FL participants and servers are non-malicious, thisenhances user privacy and reduces the probability ofeavesdropping to a certain extent. In fact, with enhancedprivacy, more users will be willing to take part in col-laborative model training and so, better inference modelscan be built.

• Low latency: With FL, ML models can be consistentlytrained and updated. Meanwhile, in the MEC paradigm,real-time decisions, e.g., event detection [22], can bemade locally at the edge nodes or end devices. Therefore,the latency is much lower than that when decisions aremade in the cloud before transmitting them to the enddevices. This is vital for time critical applications suchas self-driving car systems in which the slightest delayscan potentially be life threatening [13].

Given the aforementioned advantages, FL has seen recentsuccesses in several applications. For example, the FederatedAveraging algorithm (FedAvg) proposed in [23] has been ap-plied to Google’s Gboard [24] to improve next-word predictionmodels. In addition, several studies have also explored the useof FL in a number of scenarios in which data is sensitivein nature, e.g., to develop predictive models for diagnosisin health AI [25] and to foster collaboration across multiplehospitals [26] and Government agencies [27].

Besides being an enabling technology for ML model train-ing at mobile edge networks, FL has also been increasinglyapplied as an enabling technology for mobile edge networkoptimization. Given the computation and storage constraintsof increasingly complex mobile edge networks, conventionalnetwork optimization approaches that are built on static modelsfare relatively poorly in modelling dynamic networks [15]. Assuch, a data-driven Deep Learning (DL) based approach [28]for optimizing resource allocation is increasingly popular. Forexample, DL can be used for representation learning of net-work conditions [29] whereas Deep Reinforcement Learning(DRL) can optimize decision making through interactions withthe dynamic environment [30]. However, the aforementionedapproaches require user data as an input and these data may besensitive or inaccessible in nature due to regulatory constraints.As such, in this survey, we also consider FL’s potential toserve as an enabling technology for optimizing mobile edgenetworks, e.g., in cell association [31], computation offloading

3

[32], and vehicular networks [33].However, there are several challenges to be solved before

FL can be implemented at scale. Firstly, even though raw datano longer needs to be sent to the cloud servers, communicationcosts remain an issue due to the high dimensonality of modelupdates and limited communication bandwith of participatingmobile devices. In particular, state-of-the art DNN model train-ing can involve the communication of millions of parametersfor aggregation. Secondly, in a large and complex mobile edgenetwork, the heterogeneity of participating devices in terms ofdata quality, computation power, and willingness to participatehave to be well managed from the resource allocation perspec-tive. Thirdly, FL does not guarantee privacy in the presenceof malicious participants or aggregating servers. In particular,recent research works have clearly shown that a maliciousparticipant may exist in FL and can infer the informationof other participants just from the shared parameters alone.As such, privacy and security issues in FL still need to beconsidered.

Although there are surveys on MEC and FL, the existingstudies usually treat the two topics separately. For existingsurveys on FL, the authors in [34] place more emphasis ondiscussing the architecture and categorization of different FLsettings to be used for the varying distributions of training data.The authors in [35] highlight the applications of FL in wirelesscommunications but do not discuss the issues pertaining to FLimplementation. In addition, the focus of [35] is on cellularnetwork architecture rather than mobile edge networks. Incontrast, the authors in [36] provide a brief tutorial on FLand the challenges related to its implementation, but do notconsider the issue of resource allocation in FL, or the potentialapplications of FL for mobile edge network optimization. Onthe other hand, for surveys in MEC that focus on implementingML model training at edge networks, a macroscopic approachis usually adopted in which FL is briefly mentioned as one ofthe enabling technologies in the MEC paradigm, but withoutdetailed elaboration with regards to its implementation or therelated challenges. In particular, the authors in [14], [37],and [38] study the architectures and process of training andinference at edge networks without considering the challengesto FL implementation. In addition, surveys studying the im-plementation of DL for mobile edge network optimizationmostly do not focus on FL as a potential solution to preservedata privacy. For example, the authors in [12], [19], [39]–[42] discuss strategies for optimizing caching and computationoffloading for mobile edge networks, but do not consider theuse of privacy preserving federated approaches in their studies.Similarly, [30] considers the use of DRL in communicationsand networking but do not include federated DRL approaches.

In summary, most existing surveys on FL do not considerthe applications of FL in the context of mobile edge networks,whereas existing surveys on MEC do not consider the chal-lenges to FL implementation, or the potential of FL to beapplied in mobile edge network optimization. This motivatesus to have a comprehensive survey that has the followingcontributions:• We motivate the importance of FL as an important

paradigm shift towards enabling collaborative ML model

training. Then, we provide a concise tutorial on FLimplementation and present to the reader a list of usefulopen-source frameworks that paves the way for futureresearch on FL and its applications.

• We discuss the unique features of FL relative to a cen-tralized ML approach and the resulting implementationchallenges. For each of this challenge, we present to thereader a comprehensive discussion of existing solutionsand approaches explored in the FL literature.

• We discuss FL as an enabling technology for mobile edgenetwork optimization. In particular, we discuss the currentand potential applications of FL as a privacy-preservingapproach for applications in edge computing.

• We discuss the challenges and future research directionsof FL.

For the reader’s convenience, we classify the related studiesto be discussed in this survey in Fig. 2. The classificationis based on (i) FL at mobile edge network, i.e., studiesthat focus on solving the challenges and issues related toimplementing the collaborative training of ML models on enddevices, and (ii) FL for mobile edge network, i.e., studiesthat specifically explore the application of FL for mobileedge network optimization. While the former group of studiesworks on addressing the fundamental issues of FL, the lattergroup uses FL as an application tool to solve issues in edgecomputing. We also present a list of common abbreviationsfor reference in Table II.

The rest of this paper is organized as follows. Section IIintroduces the background and fundamentals of FL. SectionIII reviews solutions provided to reduce communication costs.Section IV discusses resource allocation approaches in FL.Section V discusses privacy and security issues. Section VIdiscusses applications of FL for mobile edge network op-timization. Section VII discusses the challenges and futureresearch directions in FL. Section VIII concludes the paper.

II. BACKGROUND AND FUNDAMENTALS OF FEDERATEDLEARNING

Artificial Intelligence (AI) has become an essential part ofour lives today, following the recent successes and progressionof DL in several domains, e.g., Computer Vision (CV) [43]and Natural Language Processing (NLP) [44]. In traditionaltraining of Deep Neural Networks (DNNs), a cloud basedapproach is adopted whereby data is centralized and modeltraining occurs in powerful cloud servers. However, given theubiquity of mobile devices that are equipped with increasinglyadvanced sensing and computing capabilities, the trend ofmigrating intelligence from the cloud to the edge, i.e., in theMEC paradigm, has naturally arisen. In addition, amid grow-ing privacy concerns, the concept of FL has been proposed.

FL involves the collaborative training of DNN models onend devices. There are, in general, two steps in the FL trainingprocess namely (i) local model training on end devices and (ii)global aggregation of updated parameters in the FL server.In this section, we first provide a brief introduction to DNNmodel training, which generalizes local model training inFL. Note that while FL can be applied to the training of

4

TABLE I: An overview of selected surveys in FL and MEC

Ref. Subject Contribution[34]

FLIntroductory tutorial on categorization of different FL settings, e.g., vertical FL, horizontal FL, and Federated Transfer Learning

[35] FL in optimizing resource allocation for wireless networks while preserving data privacy[36] Tutorial on FL and discussions of implementation challenges in FL[14]

MEC

Computation offloading strategy to optimize DL performance in edge computing[37] Survey on architectures and frameworks for edge intelligence[38] ML for IoT management, e.g., network management and security[19] Survey on computation offloading in MEC[30] Survey on DRL approaches to address issues in communications and networking[39] Survey on techniques for computation offloading[40] Survey on architectures and applications of MEC[41] Survey on computing, caching, and communication issues at mobile edge networks[42] Survey on the phases of caching and comparison among the different caching schemes[12] Survey on joint mobile computing and wireless communication resource management in MEC

Fig. 2: Classification of related studies to be discussed in this survey.

TABLE II: List of common abbreviations.

Abbreviation DescriptionBAA Broadband Analog AggregationCNN Convolutional Neural NetworkCV Computer VisionDDQN Double Deep Q-NetworkDL Deep LearningDNN Deep Neural NetworkDP Differential PrivacyDQL Deep Q-LearningDRL Deep Reinforcement LearningFedAvg Federated AveragingFL Federated LearningGAN Generative Adversarial Network

IID Independent and IdenticallyDistributed

IoT Internet of ThingsIoV Internet of VehiclesLSTM Long Short Term MemoryMEC Mobile Edge ComputingML Machine LearningMLP Multilayer PerceptronNLP Natural Language Processing

OFDMA Orthogonal Frequency-divisionMultiple Access

QoE Quality of ExperienceRNN Recurrent Neural NetworkSGD Stochastic Gradient DescentSMPC Secure Multiparty ComputationSNR Signal-to-noise ratioSVM Support Vector MachineTFF TensorFlow FederatedUE User EquipmentURLLC Ultra reliable low latency communication

ML models in general, we focus specifically on DNN modeltraining in this section as a majority of the papers that wesubsequently review study the federated training of DNNmodels. In addition, the DNN models are easily aggregatedand outperform conventional ML techniques especially whenthe data is large. The implementation of FL at mobile edge net-works can thus naturally leverage on the increasing computingpower and wealth of data collected by distributed end devices,both of which are driving forces contributing to the rise ofDL [45]. As such, a brief introduction to general DNN modeltraining will be useful for subsequent sections. Thereafter,we proceed to provide a tutorial of the FL training processthat incorporates both global aggregation and local training.In addition, we also highlight the statistical challenges ofFL model training and present the protocols and open-sourceframeworks of FL.

A. Deep Learning

Conventional ML algorithms rely on hand-engineered fea-ture extractors to process raw data [46]. As such, domainexpertise is often a prerequisite for building an effective MLmodel. In addition, feature selection has to be customized andreinitiated for each new problem. On the other hand, DNNsare representation learning based, i.e., DNNs can automaticallydiscover and learn these features from raw data [4] and thusoften outperform conventional ML algorithms especially whenthere is an abundance of data.

DL lies within the domain of the brain-inspired computingparadigm, of which the neural network is an important partof [47]. In general, a neural network design emulates thatof a neuron [48]. It comprises three layers: (i) input layer,(ii) hidden layer, and (iii) output layer. In a conventional

5

feedforward neural network, a weighted and bias-correctedinput value is passed through a non-linear activation functionto derive an output [49] (Fig. 3). Some activation functionsinclude the ReLu and softmax functions [44]. A typical DNNcomprises multiple hidden layers that map an input to anoutput. For example, the goal of a DNN trained for imageclassification [50] is to produce a vector of scores as theoutput, in which the positional index of the highest scorecorresponds to the class to which the input image is classifiedto belong. As such, the objective of training a DNN is tooptimize the weights of the network such that the loss function,i.e., difference between the ground truth and model output, isminimized.

Fig. 3: In forward pass, an output is derived from the weights andinputs. In backpropagation, the input gradient e is used to calibratethe weights of the DNN model.

Before training, the dataset is first split into the training andtest dataset. Then, the training dataset is used as input datafor the optimization of weights in the DNN. The weights arecalibrated through stochastic gradient descent (SGD), in whichthe weights are updated by the product of (i) the learning ratelr, i.e., the step size of gradient descent in each iteration, and(ii) partial derivative of the loss function L with respect to theweight w. The SGD formula is as follows:

W = W − lr∂L

∂W(1)

∂L

∂W≈ 1

m

∑iεB

∂l(i)

∂W(2)

Note that the SGD formula presented in (1) is that of a mini-batch GD. In particular, equation (2) is derived as the average

gradient matrix over the gradient matrices of B batches, inwhich each batch is a random subset consisting of m trainingsamples. This is preferred over the full batch GD, i.e., wherethe entirety of the training set is included in computing thepartial derivative, since the full batch GD can lead to slowtraining and batch memorization [51]. The gradient matricesare derived through backpropagation from the input gradiente (Fig. 3) [48].

The training iterations are then repeated over many epochs,i.e., full passes over the training set, for loss minimalization. Awell-trained DNN generalizes well, i.e., achieve high inferenceaccuracy when applied to data that it has not seen before, e.g.,the test set. There are other alternatives to supervised learn-ing, e.g., semi-supervised learning [52], unsupervised learning[53] and reinforcement learning [54]. In addition, there alsoexists several DNN networks and architectures tailored toprocess the varying natures of input data, e.g., MultilayerPerceptron (MLP) [55], Convolutional Neural Network (CNN)[56] typically for CV tasks, and Recurrent Neural Network(RNN) [57] usually for sequential tasks. However, an in-depthdiscussion is out of the scope of this paper. We refer interestedreaders to [58]–[63] for comprehensive discussions of DNNarchitectures and training strategies. We next focus on FL, animportant paradigm shift towards enabling privacy preservingand collaborative DL model training.

B. Federated Learning

Motivated by privacy concerns among data owners, theconcept of FL is introduced in [21]. FL allows users tocollaboratively train a shared model while keeping personaldata on their devices, thus alleviating their privacy concerns.As such, FL can serve as an enabling technology for MLmodel training at mobile edge networks. For an introductionto the categorizations of different FL settings, e.g., vertical andhorizontal FL, we refer the interested readers to [34].

In general, there are two main entities in the FL system,i.e., the data owners (viz. participants) and the model owner(viz. FL server). Let N = {1, . . . , N} denote the set of Ndata owners, each of which has a private dataset Di∈N . Eachdata owner i uses its dataset Di to train a local model wi

and send only the local model parameters to the FL server.Then, all collected local models are aggregated w = ∪i∈Nwi

to generate a global model wG. This is different from thetraditional centralized training which uses D = ∪i∈NDi totrain a model wT , i.e., data from each individual source isaggregated first before model training takes place centrally.

A typical architecture and training process of an FL systemis shown in Fig. 4. In this system, the data owners serve asthe FL participants which collaboratively train an ML modelrequired by an aggregate server. An underlying assumption isthat the data owners are honest, which means they use theirreal private data to do the training and submit the true localmodels to the FL server. Of course, this assumption may notalways be realistic [64] and we discuss the proposed solutionssubsequently in Sections IV and V.

In general, the FL training process includes the followingthree steps. Note: the local model refers to the model trained

6

Fig. 4: General FL training process involving N participants.

at each participating device, whereas the global model refersto the model aggregated by the FL server.• Step 1 (Task initialization): The server decides the train-

ing task, i.e., the target application, and the correspondingdata requirements. The server also specifies the hyperpa-rameters of the global model and the training process,e.g., learning rate. Then, the server broadcasts the initial-ized global model w0

G and task to selected participants.• Step 2 (Local model training and update): Based on the

global model wtG, where t denotes the current iteration

index, each participant respectively uses its local data anddevice to update the local model parameters wt

i . The goalof participant i in iteration t is to find optimal parameterswti that minimize the loss function L(wt

i), i.e.,

wt∗

i = argminwt

i

L(wti). (3)

The updated local model parameters are subsequentlysent to the server.

• Step 3 (Global model aggregation and update): Theserver aggregates the local models from participants andthen sends the updated global model parameters wt+1

G

back to the data owners.The server wants to minimize the global loss function

L(wtG), i.e.,

L(wtG) =

1

N

N∑i=1

L(wti). (4)

Steps 2-3 are repeated until the global loss function convergesor a desirable training accuracy is achieved.

Note that the FL training process can be used for differentML models that essentially use the SGD method such asSupport Vector Machines (SVMs) [65], neural networks, andlinear regression [66]. A training dataset usually contains aset of n data feature vectors x = {x1, . . . ,xn} and a set of

corresponding data labels1 y = {y1, . . . , yn}. In addition, letyj = f(xj ;w) denote the predicted result from the modelw updated/trained by data vector xj . Table III summarizesseveral loss functions of common ML models [67].

TABLE III: Loss functions of common ML models

Model Loss function L(wti)

Neural network1n

∑nj=1(yi − f(xj ;w))2

(Mean Squared Error)Linear

regression12

∥∥yj −wTxj

∥∥2K-means

∑j ‖xj − f(xj ;w)‖

(f(xj ;w) is the centroid of all objectsassigned to xj ’s class)

squared-SVM[ 1n

∑nj=1 max(0, 1− yj(wTxj − bias))]

+λ∥∥wT

∥∥2(bias is the bias parameter andλ is const.)

Global model aggregation is an integral part of FL. Astraightforward and classical algorithm for aggregating thelocal models is the FedAvg algorithm proposed in [23], whichis similar to that of local SGD [68]. The pseudocode forFedAvg is given in Algorithm 1. As described in Step 1 above,

Algorithm 1 Federated averaging algorithm [23]Require: Local minibatch size B, number of participants m per iteration, number of

local epochs E, and learning rate η.Ensure: Global model wG.1: [Participant i]2: LocalTraining(i, w):3: Split local dataset Di to minibatches of size B which are included into the set Bi.4: for each local epoch j from 1 to E do5: for each b ∈ Bi do6: w ← w− η∆L(w; b) (η is the learning rate and ∆L is the gradient

of L on b.)7: end for8: end for9:

10: [Server]11: Initialize w0

G12: for each iteration t from 1 to T do13: Randomly choose a subset St of m participants from N14: for each partipant i ∈ St parallely do15: wt+1

i ← LocalTraining(i, wtG)

16: end for17: wt

G = 1∑i∈N Di

∑Ni=1Diw

ti (Averaging aggregation)

18: end for

the server first initializes the task (lines 11-16). Thereafter, inStep 2, the participant i implements the local training andoptimizes the target in (3) on minibatches from the originallocal dataset (lines 2-8). Note that a minibatch refers to arandomized subset of each participant’s dataset. At the tth

iteration (line 17), the server minimizes the global loss in (4)by the averaging aggregation which is formally defined as

wtG =

1∑i∈N Di

N∑i=1

Diwti . (5)

The FL training process is iterated till the global loss functionconverges, or a desirable accuracy is achieved.

C. Statistical Challenges of FL

Following an elaboration of the FL training process in theprevious section, we now proceed to discuss the statisticalchallenges faced in FL.

1In the case of unsupervised learning, there is no data label.

7

In traditional distributed ML, the central server has accessto the whole training dataset. As such, the server can split thedataset into subsets that follow similar distributions. The sub-sets are subsequently sent to participating nodes for distributedtraining. However, this approach is impractical for FL sincethe local dataset is only accessible by the data owner.

In the FL setting, the participants may have local datasetsthat follow different distributions, i.e., the datasets of partic-ipants are non-IID. While the authors in [23] show that theaforementioned FedAvg algorithm is able to achieve desirableaccuracy even when data is non-IID across participants, theauthors in [69] found otherwise. For example, the accuracyof a FedAvg-trained CNN model has 51% lower accuracythan centrally-trained CNN model for CIFAR-10 [70]. Thisdeterioration in accuracy is further shown to be quantified bythe earth mover’s distance (EMD) [71], i.e., difference in FLparticipant’s data distribution as compared to the populationdistribution. As such, when data is non-IID and highly skewed,data-sharing is proposed in which a shared dataset withuniform distribution across all classes is sent by the FL serverto each FL participant. Then, the participant trains its localmodel on its private data together with the received data. Thesimulation result shows that accuracy can be increased by30% with 5% shared data due to reduced EMD. However,a common dataset may not always be available for sharingby the FL server. An alternative solution to gather contribu-tions towards building the common dataset is subsequentlydiscussed in section IV.

The authors in [72] also find that global imbalance, i.e.,the situation in which the collection of data held across allFL participants is class imbalanced, leads to a deterioration inmodel accuracy. As such, the Astraea framework is proposed.On initialization, the FL participants first send their datadistribution to the FL server. A rebalancing step is introducedbefore training begins in which each participant performs dataaugmentation [73] on the minority classes, e.g., through ran-dom rotations and shifts. After training on the augmented data,a mediator is created to coordinate intermediate aggregation,i.e., before sending the updated parameters to the FL server forglobal aggregation. The mediator selects participants with datadistributions that best contributes to an uniform distributionwhen aggregated. This is done through a greedy algorithmapproach to minimize the Kullback-Leibler Divergence [74]between local data and uniform distribution. The simulation re-sults show accuracy improvement when tested on imbalanceddatasets.

Given the heterogeneity of data distribution across devices,there has been an increasing number of studies that borrowconcepts from multi-task learning [75] to learn separate, butstructurally related models for each participant. Instead of min-imizing the conventional loss function presented previouslyin Table III, the loss function is modified to also model therelationship amongst tasks [76]. Then, the MOCHA algorithmis proposed in which an alternating optimization approach[77] is used to approximately solve the minimization problem.Interestingly, MOCHA can also be calibrated based on theresource constraints of a participating device. For example,the quality of approximation can be adaptively adjusted based

on network conditions and CPU states of the participatingdevices. However, MOCHA cannot be applied to non-convexDL models.

Similarly, the authors in [78] also borrow concepts frommulti-task learning to deal with the statistical heterogeneityin FL. The FEDPER approach is proposed in which all FLparticipants share a set of base layers trained using the FedAvgalgorithm. Then, each participant separately trains another setof personalization layers using its own local data. In particular,this approach is suitable for building recommender’s systemsgiven the diverse preferences of participants. The authorsshow empirically using the Flickr-AES dataset [79] that theFEDPER approach outperforms a pure FedAvg approach sincethe personalization layer is able to represent the personal pref-erence of an FL participant. However, it is worth to note thatthe collaborative training of the base layers are still importantto achieve a high test accuracy, since each participant hasinsufficient local data samples for purely personalized modeltraining.

Apart from data heterogeneity, the convergence of a dis-tributed learning algorithm is always a concern. Higher con-vergence rate helps to save a large amount of time andresources for the FL participants, and also significantly in-creases the success rate of the federated training since fewercommunication rounds imply reduced participant dropouts. Toensure convergence, the study in [80] propose FedProx, whichmodifies the loss function to also include a tunable parameterthat restricts how much local updates can affect the prevailingmodel parameters. The FedProx algorithm can be adaptivelytuned, e.g., when training loss is increasing, model updates canbe tuned to affect the current parameters less. Similarly, theauthors in [81] also propose the LoAdaBoost FedAvg algorithmto complement the aforementioned data-sharing approach [69]in ML on medical data. In LoAdaBoost FedAvg, participantstrain the model on their local data and compare the cross-entropy loss with the median loss from the previous traininground. If the current cross-entropy loss is higher, the model isretrained before global aggregation so as to increase learningefficiency. The simulation results show that faster convergenceis achieved as a result.

In fact, the statistical challenges of FL coexist with otherissues that we explore in subsequent sections. For example,the communication costs incurred in FL can be reduced byfaster convergence. Similarly, resource allocation policies canalso be designed to solve statistical heterogeneity. As such, werevisit these concepts in greater detail subsequently.

D. FL protocols and frameworks

To improve scalability, an FL protocol has been proposedin [82] from the system level. This protocol deals with issuesregarding unstable device connectivity and communicationsecurity etc.

The FL protocol (Fig. 5) consists of three phases in eachtraining round:

1) Selection: In the participant selection phase, the FLserver chooses a subset of connected devices to par-ticipate in a training round. The selection criteria may

8

subsequently be calibrated to the server’s needs, e.g.,training efficiency [83]. In Section IV, we further elab-orate on proposed participant selection methods.

2) Configuration: The server is configured accordingly tothe aggregation mechanism preferred, e.g. simple orsecure aggregation [84]. Then, the server sends thetraining schedule and global model to each participant.

3) Reporting: The server receives updates from partici-pants. Thereafter, the updates can be aggregated, e.g.,using the FedAvg algorithm.

In addition, to manage device connections accordingly tovarying FL population size, pace steering is also recom-mended. Pace steering adaptively manages the optimal timewindow for participants to reconnect to the FL server [82].For example, when the FL population is small, pace steering isused to ensure that there is a sufficient number of participatingdevices that connect to the server simultaneously. In contrast,when there is a large population, pace steering randomlychooses devices to participate to prevent the situation in whichtoo many participating devices are connected at one point oftime.

Apart from communication efficiency, communication se-curity during local updates transmission is another problemto be resolved. Specifically, there are mainly two aspects incommunication security:

1) Secure aggregation: To prevent local updates from beingtraced and utilized to infer the identity of the FLparticipant, a virtual and trusted third party server isdeployed for local model aggregation [84]. The SecretSharing mechanism [85] is also used for transmission oflocal updates with authenticated encryption.

2) Differential privacy: Similar to secure aggregation, dif-ferential privacy (DP) prevents the FL server fromidentifying the owner of a local update. The difference isthat to achieve the goal of privacy preservation, the DPin FL [86] adds a certain degree of noise in the originallocal update while providing theoretical guarantees onthe model quality.

These concepts on privacy and security are presented in detailin Section V. Recently, some open-source frameworks for FLhave been developed as follows:

1) TensorFlow Federated (TFF): TFF [87] is a frameworkbased on Tensorflow developed by Google for decentral-ized ML and other distributed computations. TFF con-sists of two layers (i) FL and (ii) Federated Core (FC).The FL layer is a high-level interface that allows theimplementation of FL to existing TF models without theuser having to apply the FL algorithms personally. TheFC layer combines TF with communication operators toallow users to experiment with customized and newlydesigned FL algorithms.

2) PySyft: PySyft [88] is a framework based on PyTorchfor performing encrypted, privacy-preserving DL andimplementations of related techniques, such as SecureMultiparty Computation (SMPC) and DP, in untrustedenvironments while protecting data. Pysyft is developedsuch that it retains the native Torch interface, i.e., the

ways to execute all tensor operations remain unchangedfrom that of Pytorch. When a SyftTensor is created, aLocalTensor is automatically created to also apply theinput command to the native Pytorch tensor. To simulateFL, participants are created as Virtual Workers. Data, i.e.,in the structure of tensors, can be split and distributedto the Virtual Workers as a simulation of a practicalFL setting. Then, a PointerTensor is created to specifythe data owner and storage location. In addition, modelupdates can be fetched from the Virtual Workers forglobal aggregation.

3) LEAF: An open source framework [89] of datasetsthat can be used as benchmarks in FL, e.g., FederatedExtended MNIST (FEMNIST), an MNIST [90] datasetpartitioned based on writer of each character, and Sen-timent140 [91], a dataset partitioned based on differentusers. In these datasets, the writer or user is assumed tobe a participant in FL, and their corresponding data istaken to be the local data held in their personal devices.The implementation of newly designed algorithms onthese benchmark datasets allow for reliable comparisonacross studies.

4) FATE: Federated AI Technology Enabler (FATE) is anopen-source framework by WeBank [92] that supportsthe federated and secure implementation of several MLmodels.

E. Unique characteristics and issues of FL

Besides the statistical challenges we present in SectionII-C, FL has some unique characteristics and features [93]as compared to other distributed ML approaches:

1) Slow and unstable communication: In the traditionaldistributed training in a data center, the communicationenvironment can be assumed to be perfect where theinformation transmission rate is very high and there is nopacket loss. However, these assumptions are not applica-ble to the FL environment where heterogeneous devicesare involved in training. For example, the Internet uploadspeed is typically much slower than download speed[94]. Also, some participants with unstable wirelesscommunication channels may consequently drop out dueto disconnection from the Internet.

2) Heterogeneous devices: Apart from bandwidth con-straints, FL involves heterogeneous devices with varyingresource constraints. For example, the devices can havedifferent computing capabilities, i.e., CPU states andbattery level. The devices can also have different levelsof willingness to participate, i.e., FL training is resourceconsuming and given the distributed nature of trainingacross numerous devices, there is a possibility of freeridership.

3) Privacy and security concerns: As we have previouslydiscussed, data owners are increasingly privacy sensitive.However, as will be subsequently presented in SectionV, malicious participants are able to infer sensitiveinformation from shared parameters, which potentially

9

Fig. 5: Federated learning protocol [82].

negates privacy preservation. In addition, we have pre-viously assumed that all participants and FL servers aretrustful. In reality, they may be malicious.

These unique characteristics of FL lead to several practicalissues in FL implementation mainly in three aspects thatwe now proceed to discuss, i.e., (i) statistical challenges (ii)communication costs (iii) resource allocation and (iv) privacyand security. In the following sections, we review related workthat address each of these issues.

III. COMMUNICATION COST

In FL, a number of rounds of communications between theparticipants and the FL server may be required to achievea target accuracy (Fig. 5). For complex DL model traininginvolving, e.g. CNN, each update may comprise millions ofparameters [95]. The high dimensionality of the updates canresult in the incurrence of high communication costs andcan lead to a training bottleneck. In addition, the bottleneckcan be worsened due to (i) unreliable network conditions ofparticipating devices [96] and (ii) the asymmetry in Inter-net connection speeds in which upload speed is faster thandownload speed, resulting in delays in model uploads byparticipants [94]. As such, there is a need to improve thecommunication efficiency of FL. The following approachesto reduce communication costs are considered:• Edge and End Computation: In the FL setup, the com-

munication cost often dominates computation cost [23].The reason is that on-device dataset is relatively smallwhereas mobile devices of participants have increasinglyfast processors. On the other hand, participants may onlybe willing to participate in the model training only if theyare connected to Wi-Fi [94]. As such, more computationcan be performed on edge nodes or on end devices beforeeach global aggregation so as to reduce the number of

communication rounds needed for the model training. Inaddition, algorithms to ensure faster convergence can alsoreduce number of communication rounds involved, at theexpense of more computation on edge servers and enddevices.

• Model Compression: This is a technique commonly usedin distributed learning [97]. Model or gradient compres-sion involves the communication of an update that istransformed to be more compact, e.g., through sparsi-fication, quantization or subsampling [98], rather thanthe communication of full update. However, since thecompression may introduce noise, the objective is toreduce the size of update transferred during each round ofcommunication while maintaining the quality of trainedmodels [99].

• Importance-based Updating: This strategy involves se-lective communication such that only the important orrelevant updates [100] are transmitted in each communi-cation round. In fact, besides saving on communicationcosts, omitting some updates from participants can evenimprove the global model performance.

A. Edge and End ComputationTo decrease the number of communication rounds, addi-

tional computation can be performed on participating enddevices before each iteration of communication for globalaggregation (Fig. 6(a)). The authors in [23] consider two waysto increase computation on participating devices: (i) increasingparallelism in which more participants are selected to partic-ipate in each round of training and (ii) increasing computa-tion per participant whereby each participant performs morelocal updates before communication for global aggregation.A comparison is conducted for the FederatedSGD (FedSGD)algorithm and the proposed FedAvg algorithm. For the FedSGDalgorithm, all participants are involved and only one pass

10

is made per training round in which the minibatch sizecomprises of the entirety of the participant’s dataset. This issimilar to the full-batch training in centralized DL frameworks.For the proposed FedAvg algorithm, the hyperparameters aretuned such that more local computations are performed bythe participants. For example, the participant can make morepasses over its dataset or use a smaller local minibatch size toincrease computation before each communication round. Thesimulation results show that increased parallelism does notlead to significant improvements in reducing communicationcost, once a certain threshold is reached. As such, moreemphasis is placed on increasing computation per participantwhile keeping the fraction of selected participants constant.For MNIST CNN simulations, increased computation usingthe proposed FedAvg algorithm can reduce communicationrounds by more than 30 times when the dataset is IID. Fornon-IID dataset, the improvement is less significant (2.8 times)using the same hyperparameters. However, for Long ShortTerm Memory (LSTM) simulations [101], improvements aremore significant even for non-IID data (95.3 times). In ad-dition, FedAvg increases the accuracy eventually since modelaveraging produces regularization effects similar to dropout[102], which prevents overfitting.

As an extension, the authors in [103] also validate thata similar concept as that of [23] works for vertical FL. Invertical FL, collaborative model training is conducted acrossthe same set of participants with different data features. TheFederated Stochastic Block Coordinate Descent (FedBCD)algorithm is proposed in which each participating deviceperforms multiple local updates first before communicationfor global aggregation. In addition, convergence guarantee isalso provided with an approximate calibration of the numberof local updates per interval of communication.

Another way to decrease communication cost can also bethrough modifying the training algorithm to increase conver-gence speed, e.g., through the aforementioned LoAdaBoostFedAvg in [81]. Similarly, the authors in [104] also propose in-creased computation on each participating device by adoptinga two-stream model (Fig. 6(b)) commonly used in transferlearning and domain adaption [106]. During each traininground, the global model is received by the participant andfixed as a reference in the training process. During training,the participant learns not just from local data, but also fromother participants with reference to the fixed global model.This is done through the incorporation of Maximum MeanDiscrepancy (MMD) into the loss function. MMD measuresthe distance between the means of two data distributions [106],[107]. Through minimizing MMD loss between the local andglobal models, the participant can extract more generalizedfeatures from the global model, thus accelerating the conver-gence of training process and reducing communication rounds.The simulation results on the CIFAR-10 and MNIST datasetusing DL models such as AlexNet [56] and 2-CNN respec-tively show that the proposed two-stream FL can reach thedesirable test accuracy in 20% fewer communication roundseven when data is non-IID. However, while convergencespeed is increased, more computation resources have to beconsumed by end devices for the aforementioned approaches.

Fig. 6: Approaches to increase computation at edge and enddevices include (a) Increased computation at end devices, e.g.,more passes over dataset before communication [23], [103] (b)Two-stream training with global model as a reference [104]and (c) Intermediate edge server aggregation [105].

Given the energy constraints of participating mobile devicesin particular, this necessitates resource allocation optimizationthat we subsequently discuss in Section IV.

While the aforementioned studies consider increasing com-putation on participating devices, the authors in [105] proposean edge computing inspired paradigm in which proximate edgeservers can serve as intermediary parameter aggregators giventhat the propagation latency from participant to the edge serveris smaller than that of the participant-cloud communication(Fig. 6(c)). A hierarchical FL (HierFAVG) algorithm is pro-posed whereby for every few local participant updates, theedge server aggregates the collected local models. After apredefined number of edge server aggregations, the edge servercommunicates with the cloud for global model aggregation.As such, the communication between the participants and thecloud occurs only once after an interval of multiple localupdates. Comparatively, for the FedAvg algorithm proposed in[23], the global aggregation occurs more frequently since nointermediate edge server aggregation is involved. The authorsfurther prove the covergence of HierFAVG for both convexand non-convex objective functions given non-IID user data.The simulation results show that for the same number of local

11

updates between two global aggregations, more intermediateedge aggregations before each global aggregation can lead toreduced communication overhead as compared to the FedAvgalgorithm. This result holds for both IID and non-IID data,implying that intermediate aggregation on edge servers may beimplemented on top of FedAvg so as to reduce communicationcosts. However, when applied to non-IID data, the simulationresults show that HierFAVG fails to converge to the desiredaccuracy level (90%) in some instances, e.g., when edge-cloud divergence is large or when there are many edgeservers involved. As such, a further study is required to betterunderstand the tradeoffs between adjusting local and edgeaggregation intervals, so as to ensure that the parameters of theHierFAVG algorithm can be optimally calibrated to suit othersettings. Nevertheless, HierFAVG is a promising approach forthe implementation of FL at mobile edge networks, sinceit leverages on the proximity of intermediate edge server toreduce communication costs, and potentially relieve the burdenon the remote cloud.

B. Model Compression

To reduce communication costs, the authors in [94] proposestructured and sketched updates to reduce the size of modelupdates sent from participants to the FL server during eachcommunication round.

Structured updates restrict participant updates to have a pre-specified structure, i.e., low rank and random mask. For thelow rank structure, each update is enforced to be a low rankmatrix expressed as a product of two matrices. Here, onematrix is generated randomly and held constant during eachcommunication round whereas the other is optimized. As such,only the optimized matrix needs to be sent to the server. Forthe random mask structure, each participant update is restrictedto be a sparse matrix following a pre-defined random sparsitypattern generated independently during each round. As such,only the non-zero entries are required to be sent to the server.

On the other hand, sketched updates refer to the approach ofencoding the update in a compressed form before communica-tion with the server, which subsequently decodes the updatesbefore aggregation. One example of sketched update is thesubsampling approach, in which each participant communi-cates only a random subset of the update matrix. The serverthen averages the subsampled updates to derive an unbiasedestimate of the true average. Another example of sketchedupdate is the probabilistic quantization approach [108], inwhich the update matrices are vectorized and quantized foreach scalar. To reduce the error from quantization, a structuredrandom rotation that is the product of a Walsh-Hadamardmatrix and binary diagonal matrix [109] can be applied beforequantization.

The simulation results on the CIFAR-10 image classifi-cation task show that for structured updates, the randommask performs better than that of the low rank approach.The random mask approach also achieves higher accuracythan sketching approaches since the latter involves a removalof some information obtained during training. However, thecombination of all three sketching tools, i.e., subsampling,

Fig. 7: The compression techniques considered are summarizedabove by the diagram from authors in [99]. (i) Federated dropoutto reduce size of model (ii) Lossy compression of model (iii)Decompression for training (iv) Compression of participant updates(v) Decompression (vi) Global aggregation

quantization, and rotation, can achieve higher compressionrate and faster convergence, albeit with some sacrifices inaccuracy. For example, by using 2 bits for quantization andsketching out all but 6.25% of update data, the number ofbits needed to represent updates can be reduced by 256 timesand the accuracy level achieved is 85%. In addition, sketchingupdates can achieve higher accuracy in training when thereare more participants trained per round. This suggests thatfor practical implementation of FL where there are manyparticipants available, more participants can be selected fortraining per round so that subsampling can be more aggressiveto reduce communication costs.

The authors in [99] extend the studies in [94] by proposinglossy compression and federated dropout to reduce server-to-participant communication costs. A summary of the proposedtechniques are adapted from the authors’ work in Fig. 7.For participant-to-server upload of model parameters that wediscuss previously, the decompressions can be averaged overmany updates to receive an unbiased estimate. However, thereis no averaging for server-to-participant communications sincethe same global model is sent to all participants during eachround of communication. Similar to [94], subsampling andprobabilistic quantization are considered. For the applicationof structured random rotation before subsampling and quanti-zation, Kashin’s representation [110] is applied instead of theHadamard transformation since the former is found to performbetter in terms of accuracy-size tradeoff.

In addition to the subsampling and quantization approaches,the federated dropout approach is also considered in which afixed number of activation functions at each fully-connectedlayer is removed to derive a smaller sub-model. The sub-model is then sent to the participants for training. The updatedsubmodel can then be mapped back to the global modelto derive a complete DNN model with all weights updatedduring subsequent aggregation. This approach reduces theserver-to-participant communication cost, and also the sizeof participant-to-server updates. In addition, local computa-tion is reduced since fewer parameters have to be updated.The simulations are performed on MNIST, CIFAR-10, andEMNIST [111] datasets. For the lossy compression, it isshown that the subsampling approach taken by [94] doesnot reach an acceptable level of performance. The reason isthat the update errors can be averaged out for participant-to-server uploads but not for server-to-participant downloads.

12

On the other hand, quantization with Kashin’s representationcan achieve the same performance as the baseline withoutcompression while having communication cost reduced bynearly 8 times when the model is quantized to 4 bits. Forfederated dropout approaches, the results show that a dropoutrate of 25% of weight matrices of fully-connected layers (orfilters in the case of CNN) can achieve acceptable accuracyin most cases while ensuring around 43% reduction in sizeof models communicated. However, if dropout rates are moreaggressive, convergence of the model can be slower.

The aforementioned two studies suggest useful model com-pression approaches in reducing communication costs forboth server-to-participant and participant-to-server communi-cations. As one may expect, the reduction in communicationcosts come with sacrifices in model accuracy. It will thusbe useful to formalize the compression-accuracy tradeoffs,especially since this varies for different tasks, or when differentnumber of FL participants are involved.

C. Importance-based Updating

Based on the observation that most parameter values of aDNN model are sparsely distributed and close to zero [112],the authors in [100] propose the edge Stochastic GradientDescent (eSGD) algorithm that selects only a small fractionof important gradients to be communicated to the FL serverfor parameter update during each communication round. TheeSGD algorithm keeps track of loss values at two consecutivetraining iterations. If the loss value of the current iteration issmaller than the preceding iteration, this implies that currenttraining gradients and model parameters are important fortraining loss minimalization and thus, their respective hiddenweights are assigned a positive value. In addition, the gradientis also communicated to the server for parameter update. Oncethis does not hold, i.e., the loss increases as compared tothe previous iteration, other parameters are selected to beupdated based on their hidden weight values. A parameterwith larger hidden weight value is more likely to be selectedsince it has been labeled as important several times duringtraining. To account for small gradient values that can delayconvergence if they are ignored and not updated completely[113], these gradient values are accumulated as residual values.Since the residuals may arise from different training iterations,each update to the residual is weighted with a discount factorusing the momentum correction technique [114]. Once theaccumulated residual gradient reaches a threshold, they arechosen to replace the least important gradient coordinatesaccording to the hidden weight values. The simulation resultsshow that eSGD with a 50% drop ratio can achieve higheraccuracy than that of the thresholdSGD algorithm proposed in[112], which uses a fixed threshold value to determine whichgradient coordinates to drop. eSGD can also save a largeproportion of gradient size communicated. However, eSGDstill suffers from accuracy loss as compared to standard SGDapproaches. For example, when tested on simple classificationtasks using the MNIST dataset, the model accuracy convergesto just 91.22% whereas standard SGD can achieve 99.77%accuracy. If extended to more sophisticated tasks, the accuracy

can potentially deteriorate to a larger extent. In addition,the accuracy and convergence speed of the eSGD approachfluctuates arbitrarily based on hyperparameters used, e.g.,minibatch size. As such, further studies have to be conductedto formally balance the tradeoffs between communication costsand training performance.

While [100] studies the selective communication of gradi-ents, the authors in [96] propose the Communication-MitigatedFederated Learning (CMFL) algorithm that uploads only rel-evant local model updates to reduce communication costswhile guaranteeing global convergence. In each iteration, aparticipant’s local update is first compared with the globalupdate to identify if the update is relevant. A relevance score iscomputed where the score equates to percentage of same-signparameters in the local and global update. In fact, the globalupdate is not known in advance before aggregation. As such,the global update made in the previous iteration is used as anestimate for comparison since it was found empirically thatmore than 99% of the normalized difference of two sequentialglobal updates are smaller than 0.05 in both MNIST CNN andNext-Word-Prediction LSTM. An update is considered to beirrelevant if its relevance score is smaller than a predefinedthreshold. The simulation results show that CMFL requires3.47 times and 13.97 times fewer communication roundsto reach 80% accuracy for MNIST CNN and Next-Word-Prediction LSTM, respectively, as compared to the benchmarkFedAvg algorithm. In addition, CMFL can save significantlymore communication rounds as compared to Gaia. Note thatGaia is a geo-distributed ML approach suggested in [115]which measures relevance based on magnitude of updatesrather than sign of parameters. When applied with the afore-mentioned MOCHA algorithm II-C [76], CMFL can reducecommunication rounds by 5.7 times for the Human ActivityRecognition dataset [116] and 3.3 times for the Semeion Hand-written Digit dataset [117]. In addition, CMFL can achieveslightly higher accuracy since it involves the elimination ofirrelevant updates that are outliers which harm training.

D. Summary and Lessons Learned

In this section, we have reviewed three main approaches forcommunication cost reduction in FL, and for each approach,we discuss the solutions proposed in different studies. Wesummarize the approaches along with references in Table IV.From this review, we gather the following lessons learned:• Communication cost is a key issue to be resolved before

we can implement FL at scale. In particular, the state-of-the-art DL models have high inference accuracy but areincreasingly complex with millions of parameters. Theslow upload speed of mobile devices can thus impedethe implementation of efficient FL.

• This section explores several key approaches to com-munication cost reduction. However, many of the ap-proaches, e.g., model compression, result in a deteri-oration in model accuracy or incur high computationcost. For example, when too many local updates areimplemented between communication rounds, the com-munication cost is indeed reduced but the convergence

13

TABLE IV: Approaches to communication cost reduction in FL.

Approaches Ref. Key Ideas Tradeoffs and Shortcomings

[23] More local updates in between communication for global aggregation, to reduceinstances of communication

Increased computation cost and poor performancein non-IID setting

[103] Similar to the ideas of [23], but with convergence guarantees for vertical FL Increased computation cost and delayedconvergence if global aggregation is too infrequent

[104] Transfer learning-inspired two-stream model for FL participants to learn from the fixedglobal model so as to accelerate training convergence

Increased computation cost and delayedconvergenceEdge and End

Computation[105] MEC-inspired edge server assisted FL that aids in intermediate parameter aggregation to

reduce instances of communicationSystem model is not scalable when there aremore edge servers

[94] Structured and sketched updates to compress local models communicated from participantto FL server Model accuracy and convergence issues

ModelCompression [99] Similar to the ideas of [94], but for communication from FL server

to participants Model accuracy and convergence issues

[100] Selective communication of gradients that are assigned importance scores,i.e., to reduce training loss

Only empirically tested on simple datasets andtasks, with fluctuating resultsImportance-based

Updating [96] Selective communication of local model updates that have higher relevance scores whencompared to previous global model

Difficult to implement when global aggregationsare less frequent

can be significantly delayed [103]. The tradeoff betweenthese sacrifices and communication cost reduction thushas to be well-managed.

• The current studies of this tradeoff are often mainlyempirical in nature, e.g., several experiments have tobe done to find the optimal number of local trainingiterations before communication. With more effective op-timization approaches formalized theoretically and testedempirically, FL can eventually become more scalablein nature. For example, the authors in [118] study thetradeoffs between the completion time of FL trainingand energy cost expended. Then, a weighted sum ofcompletion time and energy consumption is minimizedusing an iterative algorithm. For delay-sensitive scenarios,the weights can be adjusted such that the FL participantsconsume more energy for completion time minimization.

• Apart from working to directly reduce the size of modelcommunicated, studies on FL can draw inspiration fromapplications and approaches in the MEC paradigm. Forexample, a simple case study introduced in [100] con-siders the base station as an intermediate model aggrega-tor to reduce instances of device-cloud communication.Unfortunately, there are convergence issues when moreedge servers or mobile devices are considered. This isexacerbated by the non-IID distribution of data acrossdifferent edge nodes. For future works, this statisticalchallenge can be met, e.g., through inspirations frommulti-task learning as we have discussed in Section II-C.In addition, more effective and innovative system modelscan be explored such that FL networks can utilize thewealth of computing and storage resources that are closerto the data sources to facilitate efficient FL.

• For the studies that we have discussed in this section, theheterogeneity among mobile devices, e.g., in computingcapabilities, is often not considered. For example, oneof the ways to reduce communication cost is to increasecomputation on edge devices, e.g., by performing morelocal updates [23] before each communication round. Infact, this does not merely lead to the expenditure ofgreater computation cost. The approach may also not befeasible for devices with weak processing power, and canlead to the straggler effect. As such, we further exploreissues on resource allocation in the next section.

IV. RESOURCE ALLOCATION

FL involves the participation of heterogeneous devices thathave different dataset qualities, computation capabilities, en-ergy states, and willingness to participate. Given the deviceheterogeneity and resource constraints, i.e., in device energystates and communication bandwidth, resource allocation hasto be optimized to maximize the efficiency of the trainingprocess. In particular, the following resource allocation issuesneed to be considered:• Participant Selection: As part of the FL protocol pre-

sented in Section II-D, participant selection refers to theselection of devices to participate in each training round.Typically, a set of participants is randomly selected by theserver to participate. Then, the server has to aggregateparameter updates from all participating devices in theround before taking a weighted average of the models[23]. As such, the training progress of FL is limitedby the training time of the slowest participating devices,i.e., stragglers [119]. New participant selection protocolsare thus investigated in order to address the trainingbottleneck in FL.

• Joint Radio and Computation Resource Management:Even though computation capabilities of mobile deviceshave grown rapidly, many devices still face a scarcity ofradio resources [120]. Given that local model transmis-sion is an integral part of FL, there has been a growingnumber of studies that focus on developing novel wirelesscommunication techniques for efficient FL.

• Adaptive Aggregation: As discussed in Section II-B, FLinvolves global aggregation in which model parametersare communicated to the FL server for aggregation.The conventional approach to global aggregation is asynchronous one, i.e., global aggregations occur in fixedintervals after all participants complete a certain numberof rounds of local computation. However, adaptive cali-brations of global aggregation frequency can be investi-gated to increase training efficiency subject to resourceconstraints [119].

• Incentive Mechanism: In the practical implementationof FL, participants may be reluctant to participate in afederation without receiving compensation since trainingmodels is resource-consuming. In addition, there existsinformation asymmetry between the FL server and par-

14

ticipants since participants have greater knowledge oftheir available computation resources and data quality.Therefore, incentive mechanisms have to be carefullydesigned to both incentivize participation and reduce thepotential adverse impacts of information asymmetry.

A. Participant Selection

To mitigate the training bottleneck, the authors in [83]propose a new FL protocol called FedCS. This protocol isillustrated in Fig. 8. The system model is a MEC frameworkin which the operator of the MEC is the FL server thatcoordinates training in a cellular network that comprises par-ticipating mobile devices that have heterogeneous resources.Accordingly, the FL server first conducts a Resource Requeststep to gather information such as wireless channel states andcomputing capabilities from a subset of randomly selectedparticipants. Based on this information, the MEC operatorselects the maximum possible number of participants thatcan complete the training within a prespecified deadline forthe subsequent global aggregation phase. By selecting themaximum possible number of participants in each round,accuracy and efficiency of training are preserved. To solve themaximization problem, a greedy algorithm [121] is proposed,i.e., participants that take the least time for model uploadand update are iteratively selected for training. The simulationresults show that compared with the FL protocol which onlyaccounts for training deadline without performing participantselection, FedCS can achieve higher accuracy since FedCS isable to involve more participants in each training round [23].However, FedCS has been tested only on simple DNN models.When extended to the training of more complex models, itmay be difficult to estimate how many participants should beselected. For example, more training rounds may be needed forthe training of complex models, and the selection of too fewparticipants may lead to poor performance considering thatsome participants may drop out during training. In addition,there is bias towards selecting participants with devices thathave better computing capabilities. These participants may nothold data that is representative of the population distribution.In particular, we revisit the fairness issue [122] subsequentlyin this section.

While FedCS addresses heterogeneity of resources amongparticipants in FL, the authors in [123] extend their workon the FedCS protocol with the Hybrid-FL protocol thatdeals with differences in data distributions among participants.The dataset of participants participating in FL may be non-IID since it is reflective of each individual user’s specificcharacteristics. As we have discussed in Section II-C, thenon-IID dataset may significantly degrade the performance ofthe FedAvg algorithm [69]. One proposed measure to addressthe non-IID nature of the dataset is to distribute publiclyavailable data to participants, such that the EMD betweentheir on-device dataset and the population distance is reduced.However, such a dataset may not always exist, and participantsmay not download them for security reasons. Thus, an alter-native solution is to construct an approximately IID datasetusing inputs from a limited number of privacy insensitive

Fig. 8: Participant selection under the FedCS and Hybrid-FL proto-col.

participants [123]. In the Hybrid-FL protocol, during theResource Request step (Fig. 8), the MEC operator asks randomparticipants if they permit their data to be uploaded. During theparticipant selection phase, apart from selecting participantsbased on computing capabilities, participants are selected suchthat their uploaded data can form an approximately IID datasetin the server, i.e., the amount of collected data in eachclass has close values (Fig. 8). Thereafter, the server trainsa model on the collected IID dataset and merge this modelwith the global model trained by participants. The simulationresults show that even with just 1% of participants sharingtheir data, classification accuracy for non-IID data can besignificantly improved as compared to the aforementionedFedCS benchmark where data is not uploaded at all. However,the recommended protocol can violate the privacy and securityof users, especially if the FL server is malicious. In the casewhen participants are malicious, data can be falsified beforeuploading, as we will further discuss in Section V. In addition,the proposed measure can be costly especially in the case ofvideos and images. As such, it is unlikely that participantswill volunteer for data uploading when they can free ride onthe efforts of other volunteers. For feasibility, a well-designedincentive and reputation mechanism is needed to ensure thatonly trustworthy participants are allowed to upload their data.

In general, the mobile edge network environment in which

15

FL is implemented on is dynamic and uncertain with variableconstraints, e.g., wireless network and energy conditions.Thus, this can lead to training bottlenecks. To this end,Deep Q-Learning (DQL) can be used to optimize resourceallocation for model training as proposed in [124]. The systemmodel is a Mobile Crowd Machine Learning (MCML) settingwhich enables participants in a mobile crowd network tocollaboratively train DNN models required by a FL server. Theparticipating mobile devices are constrained by energy, CPU,and wireless bandwidth. Thus, the server needs to determineproper amounts of data, energy, and CPU resources that themobile devices use for training to minimize energy consump-tion and training time. Under the uncertainty of the mobileenvironment, a stochastic optimization problem is formulated.In the problem, the server is the agent, the state space includesthe CPU and energy states of the mobile devices, and theaction space includes the number of data units and energyunits taken from the mobile devices. To achieve the objective,the reward function is defined as a function of the accumulateddata, energy consumption, and training latency. To overcomethe large state and action space issues of the server, the DQLtechnique based on Double Deep Q-Network (DDQN) [125] isadopted to solve the server’s problem. The simulation resultsshow that the DQL scheme can reduce energy consumption byaround 31% compared with the greedy algorithm, and traininglatency is reduced up to 55% compared with the randomscheme. However, the proposed scheme is applicable only infederations with few participating mobile devices.

As an extension to [124], the authors in [126] also proposesa resource allocation approach using DRL, with the addeduncertainty that FL participants are mobile and so they mayventure out of the network coverage range with a certainprobability. Without prior knowledge of the mobile network,the FL server is able to optimize resource allocation acrossparticipants, e.g., channel selection and device energy con-sumption.

The aforementioned resource allocation approaches focuson improving the training efficiency of FL. However, this maycause some FL participants to be left out of the aggregationphase because they are stragglers with limited computing orcommunication resources.

One consequence of this unfair resource allocation, a topicthat is commonly explored in resource allocation for wirelessnetworks [127] and ML [128]. For example, if the participantselection protocol selects mobile devices with higher comput-ing capabilities to participate in each training round [83], theFL model will be overrepresented by the distribution of dataowned by participants with devices that have higher computingcapabilities. Therefore, the authors in [122] and [129] considerfairness as an additional objective in FL. Fairness is definedin [129] to be the variance of performance of an FL modelacross participants. If the variance of the testing accuracy islarge, this implies the presence of more bias or less fairness,since the learned model may be highly accurate for certainparticipants and less so for other underrepresented participants.The authors in [129] propose the q-Fair FL (q-FFL) algorithmthat reweighs the objective function in FedAvg to assign higherweights in the loss function to devices with higher loss. The

modified objective function is as follows:

minw

Fq(w) =

m∑k=1

pkq + 1

F q+1k (w) (6)

where Fk refers to the standard loss functions presented inTable III, q refers to the calibration of fairness in the systemmodel, i.e., setting q = 0 returns the formulation to the typicalFL objective, and pk refers to ratio of local samples to the totalnumber of training samples. In fact, this is a generalizationof the Agnostic FL (AFL) algorithm proposed in [122], inwhich the device with the highest loss dominates the entire lossfunction. The simulation results show that the proposed q-FFLcan achieve lower variance of testing accuracy and convergesmore quickly than the AFL algorithm. However, as expected,for some calibrations of the q-FFL algorithm, there can beconvergence slowdown since stragglers can delay the trainingprocess. As such, an asynchronous aggregation approach (tobe subsequently discussed in this section) can potentially beconsidered for use with the q-FFL algorithm.

In contrast, the authors in [130] propose a neural networkbased approach to estimate the local models of FL partici-pants that are left out during training. In the system model,resource blocks are first allocated by the base station to userswhose models have larger effects on the global FL model.In particular, one user is selected to always be connected tothe base station. This user’s model parameters are then usedas input to the feedforward neural network to estimate themodel parameters of users who are left out during the trainingiteration. This allows the base stations to be able to integratemore locally trained FL model parameters to each iteration ofglobal aggregation, thus improving the FL convergence speed.

B. Joint Radio and Computation Resource Management

While most FL studies have previously assumed orthogonal-access schemes such as Orthogonal Frequency-division Mul-tiple Access (OFDMA) [131], the authors in [132] pro-pose a multi-access Broadband Analog Aggregation (BAA)design for communication-latency reduction in FL. Insteadof performing communication and computation separatelyduring global aggregation at the server, the BAA schemebuilds on the concept of over-the-air computation [133] tointegrate computation and communication through exploit-ing the signal superposition property of a multiple-accesschannel. The proposed BAA scheme allows the reuse of thewhole bandwidth (Fig. 9(a)) whereas OFDMA orthogonalizesbandwidth allocation (Fig. 9(b)). As such, for orthogonal-access schemes, communication latency increases in directproportion with the number of participants whereas for multi-access schemes, latency is independent of the number ofparticipants. The bottleneck of signal-to-noise ratio (SNR)during BAA transmission is the participating device withthe longest propagation distance given that devices that arenearer have to lower their transmission power for amplitudealignment with devices located further. To increase SNR, par-ticipants with longer propagation distance have to be dropped.However, this leads to the truncation of model parameters. Assuch, to manage the SNR-truncation tradeoff, three scheduling

16

Fig. 9: A comparison [132] between (a) BAA by over-the-air com-putation which reuses bandwidth (above) and (b) OFDMA (below)which uses only the allocated bandwidth.

schemes are considered namely i) Cell-interior scheduling:participants beyond a distance threshold are not scheduled, ii)All-inclusive scheduling: all participants are considered, andiii) Alternating scheduling: edge server alternates between thetwo aforementioned schemes. The simulation results show thatthe proposed BAA scheme can achieve similar test accuracy asthe OFDMA scheme while achieving latency reduction from10 times to 1000 times. As a comparison between the threescheduling schemes, the cell-interior scheme outperforms theall-inclusive scheme in terms of test accuracy for high mobilitynetworks where participants have rapidly changing locations.For low mobility networks, the alternating scheduling schemeoutperforms cell-interior scheduling.

As an extension, the authors in [134] also introduce erroraccumulation and gradient sparsification in addition to over-the-air computation. In [132], gradient vectors that are nottransmitted as a result of power constraints are completelydropped. To improve the model accuracy, the untransmittedgradient vectors can first be stored in an error accumulationvector. In the next round, local gradient estimates are thencorrected using the error vector. In addition, when thereare bandwidth limitations, the participating device can applygradient sparsification to keep only elements with the highestmagnitudes for transmission. The elements that are not trans-mitted are subsequently added on to the error accumulationvector for gradient estimate correction in the next round. Thesimulation results show that the proposed scheme can achievehigher test accuracy than over-the-air computation withouterror accumulation or gradient sparsification since it correctsgradient estimates with the error accumulation vector andallows for a more efficient utilization of the bandwidth.

Similar to [132] and [134], the authors in [135] proposean integration of computation and communication via over-the-air computation. However, it is observed that aggregationerror incurred during over-the-air computation can lead to adrop in model accuracy [136] as a result of signal distortion.As such, a participant selection algorithm is proposed in whichthe number of devices selected for training is maximizedso as to improve statistical learning performance [23] whilekeeping the signal distortion below a threshold. Due to thenonconvexity [137] of the mean-square-error (MSE) constraintand intractability of the optimization problem, a difference-of-convex functions (DC) algorithm [138] is proposed to solve themaximization problem. The simulation results show that theproposed DC algorithm is scalable and can also achieve near-optimal performance that is comparable to global optimization,which is non-scalable due to its exponential time complexity.In comparison with other state-of-the-art approaches such asthe semidefinite relaxation technique (SDR) proposed in [139],the proposed DC algorithm can also select more participants,thus also achieving higher model accuracy.

C. Adaptive Aggregation

The proposed FedAvg algorithm synchronously aggregatesparameters as shown in Fig. 10(a) and is thus susceptible to thestraggler effect, i.e., each training round only progresses as fastas the slowest device since the FL server waits for all devicesto complete local training before global aggregation can takeplace [119]. In addition, the model does not account for partic-ipants that can join halfway when the training round is alreadyin progress. As such, the asynchronous model is proposed toimprove the scalability and efficiency of FL. For asynchronousFL, the server updates the global model whenever it receives alocal update (Fig. 10(b)). The authors in [119] find empiricallythat an asynchronous approach is robust to participants joininghalfway during a training round, as well as when the federationinvolves participating devices with heterogeneous processingcapabilities. However, the model convergence is found to besignificantly delayed when data is non-IID and unbalanced.As an improvement, [140] propose the FedAsync algorithmin which each newly received local updates are adaptivelyweighted according to staleness, that is defined as the dif-ference between the current epoch and iteration in which thereceived update belongs to. For example, a stale update froma straggler is outdated since it should have been receivedin previous training rounds. As such, it is weighted less. Inaddition, the authors also prove the convergence guaranteefor a restricted family of non-convex problems. However, thecurrent hyperparameters of the FedAsync algorithm still haveto be tuned to ensure convergence in different settings. Assuch, the algorithm is still unable to generalize to suit thedynamic computation constraints of heterogeneous devices.In fact, given the uncertainty surrounding the reliability ofasynchronous FL, synchronous FL remains to be the approachmost commonly used today [82].

For most existing implementations of the FedAvg algorithm,the global aggregation phase occurs after a fixed number oftraining rounds. To better manage the dynamic resource con-

17

Fig. 10: A comparison between (a) synchronous and (b) asyn-chronous FL.

straints, the authors in [67] propose an adaptive global aggre-gation scheme which varies the global aggregation frequencyso as to ensure desirable model performance while ensuring anefficient use of available resources, e.g., energy, during the FLtraining process. In [67], the MEC system model used consistsof (i) the local update phase where the model is trained usinglocal data, (ii) edge aggregation phase where the intermediateaggregation occurs and (iii) global aggregation phase whereupdated model parameters are received and aggregated by theFL server. In particular, the authors study how the training lossis affected when the total number of edge server aggregationand local updates between global aggregation intervals vary.For this, a convergence bound of gradient descent with non-IID data is first derived. Then, a control algorithm is sub-sequently proposed to adaptively choose the optimal globalaggregation frequency based on the most recent system state.For example, if global aggregation is too time consuming,more edge aggregations will take place before communicationwith the FL server is initiated. The simulation results showthat the adaptive aggregation scheme outperforms the fixedaggregation scheme in terms of loss function minimizationand accuracy within the same time budget. However, theconvergence guarantee of the adaptive aggregation scheme isonly considered for convex loss functions currently.

D. Incentive Mechanism

The authors in [141] propose a service pricing scheme inwhich participants serve as training service providers for amodel owner. In addition, to overcome energy inefficiency inthe transfer of model updates, a cooperative relay network isproposed to support model update transfer and trading. Theinteraction between participants and model owner is modelledas a Stackelberg game [142] in which the model owner isthe buyer and participants are the sellers. The Stackelberggame is proposed in which each rational participant can non-cooperatively decide on its own profit maximization price. Inthe lower-level subgame, the model owner determines size oftraining data to maximize profits with consideration of the in-creasing concave relationship between learning accuracy of themodel and size of training data. In the upper-level subgame,the participants decide the price per unit of data to maximizetheir individual profits. The simulation results show that theproposed mechanism can ensure uniqueness of the Stackelbergequilibrium. For example, model updates that contain valuableinformation are priced higher at the Stackelberg equilibrium.In addition, model updates can be transferred cooperatively,thus reducing congestion in communication and improvingenergy efficiency. However, the simulation environment onlyinvolves relatively few mobile devices.

Similar to [141], the authors in [143], [144] also modelthe interaction between participants and model owner as aStackelberg game, which is well-suited to represent the FLserver-participant interaction involved in FL.

However, unlike the aforementioned conventional ap-proaches to solving Stackelberg formulations, a DRL-basedapproach is adopted together with the Stackelberg game bythe authors in [145]. In the DRL formulation, the FL serveracts as an agent that decides a payment in response to theparticipation level and payment history of edge nodes, withthe objective of minimizing incentive expenses. Then, the edgenodes determine an optimal participation level in response tothe payment policy. This learning based incentive mechanismdesign enables the FL server to derive an optimal policy inresponse to its observed state, without requiring any priorinformation.

In contrast to [141], [143]–[145], the authors in [146]propose an incentive design using a contract theoretic [147]approach to attract participants with high-quality data for FL.In particular, well-designed contracts can reduce informationasymmetry through self-revealing mechanisms in which partic-ipants select only the contracts specifically designed for theirtypes. For feasibility, each contract must satisfy the IndividualRationality (IR) and Incentive Compatibility (IC) constraints.For IR, each participant is assured of a positive utility when theparticipant participates in the federation. For IC, every utilitymaximizing participant only chooses the contract designedfor its type. The model owner aims to maximize its ownprofits subject to IR and IC constraints. As illustrated inFig. 11, the optimal contracts derived are self-revealing suchthat each high-type participant with higher data quality onlychooses contracts designed for its type, whereas each low-typeparticipant with lower data quality does not have the incentive

18

Fig. 11: Participants with unknown resource constraints maximizetheir utility only if they choose the bundle that best reflects theirconstraints.

to imitate high-type participants. The simulation results showthat all types of participants only achieve maximum utilitywhen they choose the contract that matches their types. Inaddition, the proposed contract theory approach also has betterperformance in terms of profit for the model owner comparedwith the Stackelberg game-based incentive mechanism. Thisis because under the contract theoretic approach, the modelowner can extract more profits from the participants whereasunder the Stackelberg game approach, the participants canoptimize their individual utilities. In fact, the informationasymmetry between FL servers and participants make contracttheory a powerful and efficient tool for mechanism design inFL. As an extension, the authors in [148] introduced a multi-dimensional contract in which each FL participant determinesthe optimal computation power and image quality it is willingto contribute for model training, in exchange for contractrewards in each iteration.

The authors in [146] further introduce reputation as a metricto measure the reliability of FL participants and design areputation-based participant selection scheme for reliable FL[64]. In this setting, each participant has a reputation value[149] derived from two sources, (i) direct reputation opinionsfrom past interactions with the FL server and (ii) indirectreputation opinions from other task publishers, i.e., other FLservers. The indirect reputation opinions are stored in an open-access reputation blockchain [150] to ensure secure reputationmanagement in a decentralized manner. Before model training,the participants choose a contract that best fits its datasetaccuracy and resource conditions. Then, the FL server chooses

the participants that have reputation scores which are largerthan a prespecified threshold. After the FL task is completed,i.e., a desirable accuracy is achieved, the FL server updatesthe reputation opinions, which are subsequently stored in thereputation blockchain. The simulation results show that theproposed scheme can significantly improve the accuracy ofthe FL model since unreliable workers are detected and notselected for FL training.

E. Summary and Lessons Learned

In this section, we have discussed four main issues inresource allocation. The issues and approaches are summarizedin Table V. From this section, the lessons learned are asfollows:• In heterogeneous mobile networks, the consideration of

resource allocation is important to ensure efficient FL.For example, each training iteration is only conducted asquickly as the slowest FL participant, i.e., the straggler ef-fect. In addition, the model accuracy is highly dependenton the quality of data used for training by FL participants.In this section, we have explored different dimensionsof resource heterogeneity for consideration, e.g., varyingcomputation and communication capabilities, willingnessto participate, and quality of data for local model training.In addition, we have explored various tools that can beconsidered for resource allocation. For example, DRL isuseful given the dynamic and uncertain wireless networkconditions experienced by FL participants, whereas con-tract theory can serve as a powerful tool in mechanismdesign under the context of information asymmetry. Nat-urally, traditional optimization approaches have also beenwell explored in radio resource management for FL, giventhe high dependency on communications efficiency in FL.

• In Section III, communication cost reduction comes witha sacrifice in terms of either higher computation costsor lower inference accuracy. Similarly, there exist dif-ferent tradeoffs to be considered in resource allocation.A scalable model is thus one that enables customizationto suit varying needs. For example, the study of [129]allows the FL server to calibrate levels of fairness whenallocating training importance, whereas the study in [118]enables the tradeoffs between training completion timeand energy expense to be calibrated by the FL systemadminstrator.

• In synchronous FL, the FL system is susceptible to thestraggler effect. As such, asynchronous FL has beenproposed as a solution in [119] and [140]. In addition,asynchronous FL also allows participants to join theFL training halfway even while a training round is inprogress. This is more reflective of practical FL settingsand can be an important contributing factor towardsensuring the scalability of FL. However, synchronousFL remains to be the most common approach useddue to convergence guarantees [82]. Given the manyadvantages of asynchronous FL, new asynchronous al-gorithms should be investigated. In particular, for futureproposed algorithms, the convergence guarantee in a non-

19

TABLE V: Approaches to resource allocation in FL.

Approaches Ref. Key Ideas Tradeoffs and Shortcomings

[83] FedCS to select participants based on computation capabilities so as tocomplete FL training before specified deadline

Difficult to estimate training duration accurately forcomplex models

[123] Following [83], Hybrid-FL to select participants so as to accumulateIID, distributable data for FL model training

Request of data sharing may defeat theoriginal intent of FL

[124] DRL to determine resource consumption by FL participants

[126] Following [124], DRL for resource allocation withmobility-aware FL participants

DRL models are difficult to train especiallywhen the number of FL participants are large

ParticipantSelection

[129] Fair resource allocation to reduce variance of model performance Convergence delays with more fairness

[132] Propose BAA to integrate computation and communication throughexploiting the signal superposition property of multiple-access channel

[134] Improves on [132] by accounting for gradient vectors that are nottransmitted due to power constraintsJoint Radio and

ComputationResource Management [135] Improves on [132] using the DC algorithm to minimize

aggregation error

Signal distortion can lead to drop in accuracy,the scalability is also an issue when largeheterogeneous networks are involved

[119] Asynchronous FL where model aggregation occurs whenever localupdates are received by FL server

Significant delay in convergence in non-IIDand unbalanced datasetAdaptive

Aggregation [67] Adaptive global aggregation frequency based on resource constraints Convergence guarantees are limited to restrictiveassumptions

[141], [143]–[145] Stackelberg game for incentivizing higher quantities of training dataor compute resource contributed

FL server derives lower profits. Also, assumptionthat there is only one FL server

[146], [148] Contract theoretic approach to incentivize FL participantsIncentiveMechanism [64] Reputation mechanism to select effective workers Assumption that there is only one FL server

IID setting for non-convex loss functions needs to beconsidered.

• The study of incentive mechanism design is a particularlyimportant aspect of FL. In particular, due to data privacyconcerns, the FL servers are unable to check for trainingdata quality. With the use of self-revealing mechanismsin contract theory, or through modeling the interactionsbetween FL server and participants with game theo-retic concepts, high quality data can be motivated ascontributions from FL participants. However, existingstudies in [64], [141], and [146] generally assume that afederation enjoys a monopoly. In particular, each systemmodel is assumed to only consist of multiple individualparticipants collaborating with a sole FL server. There canbe exceptions to this setting as follows: (i) the participantsmay be competing data owners who are reluctant to sharetheir model parameters since the competitors also benefitfrom a trained global model and (ii) the FL servers maycompete with other FL servers, i.e., model owners. In thiscase, the formulation of the incentive mechanism designwill be vastly different from that proposed. A relativelynovel approach has been to model the regret [151] ofeach FL participants in joining the various competingfederations for model training. For future works, a sys-tem model with multiple competing federations can beconsidered together with Stackelberg games and contracttheoretic approaches.

• In this section, we have assumed that FL assures the pri-vacy and security of participants. However, as discussedin the following section, this assumption may not hold inthe presence of malicious participants or FL server.

V. PRIVACY AND SECURITY ISSUES

One of the main objectives of FL is to protect the privacy ofparticipants, i.e., the participants only need to share parametersof the trained model instead of sharing their actual data.However, some recent research works have shown that privacyand security concerns may arise when the FL participants orFL servers are malicious in nature. In particular, this defeats

the purpose of FL since the resulting global model can becorrupted, or the participants may even have their privacycompromised during model training. In this section, we discussthe following issues:• Privacy: Even though FL does not require the exchange

of data for collaborative model training, a malicious par-ticipant can still infer sensitive information, e.g., gender,occupation, and location, from other participants basedon their shared models. For example, in [152], whentraining a binary gender classifier on the FaceScrub [153]dataset, the authors show that they can infer if a certainparticipant’s inputs are included in the dataset just frominspecting the shared model, with a very high accuracy ofup to 90%. Thus, in this section, we discuss privacy issuesrelated to the shared models in FL and review solutionsproposed to preserve the privacy of participants.

• Security: In FL, the participants locally train the modeland share trained parameters with other participants inorder to improve the accuracy of prediction. However, thisprocess is susceptible to a variety of attacks, e.g., data andmodel poisoning, in which a malicious participant cansend incorrect parameters or corrupted models to falsifythe learning process during global aggregation. Conse-quently, the global model will be updated incorrectly,and the whole learning system becomes corrupted. Thissection discusses more details on emerging attacks in FLas well as some recent countermeasures to deal with suchattacks.

A. Privacy Issues

1) Information exploiting attacks in machine learning - Abrief overview: One of the first research works that showsthe possibility of extracting information from a trained modelis [154]. In this paper, the authors show that during the trainingphase, the correlations implied in the training samples aregathered inside the trained model. Thus, if the trained model isreleased, it can lead to an unexpected information leakage toattackers. For example, an adversary can infer the ethnicity orgender of a user from its trained voice recognition system.

20

In [155], the authors develop a model-inversion algorithmwhich is very effective in exploiting information from decisiontree-based or face recognition trained models. The idea of thisapproach is to compare the target feature vector with eachof the possible value and then derive a weighted probabilityestimation which is the correct value. The experiment resultsreveal that by using this technique, the adversary can recon-struct an image of the victim’s face from its label with a veryhigh accuracy.

Recently, the authors in [156] show that it is even possiblefor an adversary to infer information of a victim throughqueries to the prediction model. In particular, this occurs whena malicious participant has the access to make predictionqueries on a trained model. Then, the malicious participantcan use the prediction queries to extract the trained modelfrom the data owner. More importantly, the authors pointout that this kind of attack can successfully extract modelinformation from a wide range of training models such asdecision trees, logistic regressions, SVMs, and even complextraining models including DNNs. Some recent research workshave also demonstrated the vulnerabilities of DNN-basedtraining models against model extraction attacks [157]–[159].Therefore, this raises a serious privacy concern for participantsin sharing training models in FL.

2) Differential privacy-based protection solutions for FLparticipants: In order to protect the privacy of parameterstrained by DNNs, the authors in [20] introduce a technique,called differentially private stochastic gradient descent, whichcan be effectively implemented on DL algorithms. The keyidea of this technique is to add some “noise” to the trainedparameters by using a differential privacy-preserving random-ized mechanism [160], e.g., a Gaussian mechanism, beforesending such parameters to the server. In particular, at thegradient averaging step of a normal FL participant, a Gaussiandistribution is used to approximate the differentially privatestochastic gradient descent. Then, during the training phase,the participant keeps calculating the probability that maliciousparticipants can exploit information from its shared parame-ters. Once a predefined threshold is reached, the participantwill stop its training process. In this way, the participant canmitigate the risk of revealing private information from itsshared parameters.

Inspired by this idea, the authors in [161] develop an ap-proach which can achieve a better privacy-protection solutionfor participants. In this approach, the authors propose two mainsteps to process data before sending trained parameters to theserver. In particular, for each learning round, the aggregateserver first selects a random number of participants to trainthe global model. Then, if a participant is selected to train theglobal model in a learning round, the participant will adopt themethod proposed in [20], i.e., using a Gaussian distributionto add noise to the trained model before sending the trainedparameters to the server. In this way, a malicious participantcannot infer information of other participants by using theparameters of shared global model as it has no informationregarding who has participated in the training process in eachlearning round.

Global parameters

Local training dataset

Aggregator

SGD

Select parameters to update

Select parameters to upload

Server

Participant

Local parameters

Fig. 12: Selective parameters sharing model.

3) Collaborative training solutions: While DP solutionscan protect private information of a honest participant fromother malicious participants in FL, they only work well ifthe server is trustful. If the server is malicious, it can resultin a more serious privacy threat to all participants in thenetwork. Thus, the authors in [162] introduce a collaborativeDL framework to render multiple participants to learn theglobal model without uploading their explicit training modelsto the server. The key idea of this technique is that insteadof uploading the whole set of trained parameters to the serverand updating the whole global parameters to its local model,each participant wisely selects the number of gradients toupload and the number of parameters from the global modelto update as illustrated in Fig. 12. In this way, maliciousparticipants cannot infer explicit information from the sharedmodel. One interesting result of this paper is that even whenthe participants do not share all trained parameters and do notupdate all parameters from the shared model, the accuracy ofproposed solution is still close to that of the case when theserver has all dataset to train the global model. For example,for the MNIST dataset [163], the accuracy of prediction modelwhen the participants agree to share 10% and 1% of theirparameters are respectively 99.14% and 98.71%, comparedwith 99.17% for the centralized solution when the server hasfull data to train. However, the approach is yet to be tested onmore complex classification tasks.

Although selective parameter sharing and DP solutions canmake information exploiting attacks more challenging, theauthors in [164] show that these solutions are susceptible toa new type of attack, called powerful attack, developed basedon Generative Adversarial Networks (GANs) [165]. GANs isa class of ML technique which uses two neural networks,

21

namely generator network and discriminator network, thatcompete with each other to train data. The generator networktries to generate the fake data by adding some “noise” tothe real data. Then, the generated fake data is passed tothe discriminator network for classification. After the trainingprocess, the GANs can generate new data with the samestatistics as the training dataset. Inspired by this idea, theauthors in [164] develop a powerful attack which allows amalicious participant to infer sensitive information from avictim participant even with just a part of shared parametersfrom the victim as illustrated in Fig. 13. To deal with theGAN attack, the authors in [166] introduce a solution usingsecret sharing scheme with extreme boosting algorithm. Thisapproach executes a lightweight secret sharing protocol beforetransmitting the newly trained model in plaintext to the serverat each round. Thereby, other participants in the networkcannot infer information from the shared model. However, thelimitation of this approach is the reliance on a trusted thirdparty to generate signature key pairs.

Different from all aforementioned works, the authorsin [167] introduce a collaborative training model in which allparticipants cooperate to train a federated GANs model. Thekey idea of this method is that the federated GANs modelcan generate artificial data that can replace participants’ realdata, and thus protecting the privacy of real data for the honestparticipants. In particular, in order to guarantee participants’data privacy while still maintaining flexibility in trainingtasks, this approach produces a federated generative model.This model can output artificial data that does not belongto any real user in particular, but comes from the commoncross-user data distribution. As a result, this approach cansignificantly reduce the possibility of malicious exploitationof information from real data. However, this approach inheritsexisting limitations of GANs, e.g., training instability due tothe generated fake data, which can dramatically reduce theperformance of collaborative learning models.

4) Encryption-based Solutions: Encryption is an effectiveway to protect data privacy of the participants when they wantto share the trained parameters in FL. In [168], the homo-morphic encryption technique is introduced to protect privacyof participants’ shared parameters from a honest-but-curiousserver. A honest-but-curious server is defined to be a userwho wants to extract information from the participants’ sharedparameters, but keeps all operations in FL in proper workingcondition. The idea of this solution is that the participants’trained parameters will be encrypted using the homomorphicencryption technique before they are sent to the server. Thisapproach is effective in protecting sensitive information fromthe curious server, and also achieves the same accuracy asthat of the centralized DL algorithm. A similar concept isalso presented in [84] with secret sharing mechanism usedto protect information of FL participants.

Although both the encryption techniques presented in [168]and [84] can prevent the curious server from extracting infor-mation, they require multi-round communications and cannotpreclude collusions between the server and participants. Thus,the authors in [169] propose a hybrid solution which integratesboth additively homomorphic encryption and DP in FL. In

particular, before the trained parameters are sent to the server,they will be encrypted using the additively homomorphicencryption mechanism together with intentional noises toperturb the original parameters. As a result, this hybrid schemecan simultaneously prevent the curious server from exploitinginformation as well as solve the collusion problem between theserver and malicious participants. However, in this paper, theauthors do not compare the accuracy of the proposed approachwith the case without homomorphic encryption and DP. Thus,the performance of proposed approach, i.e., in terms of modelaccuracy, is not clear.

B. Security Issues

1) Data Poisoning Attacks: In FL, a participant trains itsdata and sends the trained model to the server for furtherprocessing. In this case, it is intractable for the server tocheck the real training data of a participant. Thus, a maliciousparticipant can poison the global model by creating dirty-labeldata to train the global model with the aim of generatingfalsified parameters. For example, a malicious participant cangenerate a number of samples, e.g., photos, under a designedlabel, e.g., a clothing branch, and use them to train the globalmodel to achieve its business goal, e.g., the prediction modelshows results of the targeted clothing branch. Dirty-label datapoisoning attacks are demonstrated to achieve high misclas-sifications in DL processes, up to 90%, when a maliciousparticipant injects relatively few dirty-label samples (around50) to the training dataset [170]. This calls for urgent solutionsto deal with data poisoning attacks in FL.

In [171], the authors investigate impacts of a sybil-baseddata poisoning attack to a FL system. In particular, for thesybil attack, a malicious participant tries to improve theeffectiveness of data poisoning in training the global modelby creating multiple malicious participants. In Table VI, theauthors show that with only two malicious participants, theattack success rate can achieve up to 96.2%, and now the FLmodel is unable to correctly classify the image of “1” (insteadit always incorrectly predicts them to be the image of “7”).To mitigate sybil attacks, the authors then propose a defensestrategy, namely FoolsGold. The key idea of this approachis that honest participants can be distinguished from sybilparticipants based on their updated gradients. Specifically, inthe non-IID FL setting, each participant’s training data hasits own particularities, and sybil participants will contributegradients that appear more similar to each other than thoseof other honest participants. With FoolsGold, the system candefend the sybil data poisoning attack with minimal changesto the conventional FL process and without requiring anyauxiliary information outside of the learning process. Throughsimulations results on 3 diverse datasets (MNIST [163],KDDCup [172], Amazon Reviews [172]), the authors showthat FoolsGold can mitigate the attack under a variety ofconditions, including different distributions of participant data,varying poisoning targets, and various attack strategies.

2) Model Poisoning Attacks: Unlike data poisoning attackswhich aim to generate fake data to cause adverse impactsto the global model, a model poisoning attack attempts to

22

Discriminator D

0.6

0.3

0.1

[User A]

GANGenerator G

Data Label

[User B]

Is an image of label [User A]?

Copy new parameters to D

Copy new parameters to Model H

Attacker H Victim V

[User C]

Data Label

[User B]

Download parameters from Server

Server

Upload trained parameters to the server

Fig. 13: GAN Attack on collaborative deep learning.

TABLE VI: The accuracy and attack success rates for no-attackscenario and attacks with 1 and 2 sybils in a FL system with MNISTdataset [163].

Baseline Attack 1 Attack 2Number of honest participants 10 10 10Number of sybil participants 0 1 2The accuracy (digits: 0, 2-9) 90.2% 89.4% 88.8%

The accuracy (digit: 1) 96.5% 60.7% 0.0%Attack success rate 0.0% 35.9% 96.2%

directly poison the global model that it sends to the server foraggregation. As shown in [173] and [174], model poisoningattacks are much more effective than those of data poisoningattacks, especially for large-scale FL with many participants.The reason is that for data poisoning attacks, a maliciousparticipant’s updates are scaled based on its dataset and thenumber of participants in the federation. However, for modelpoisoning attacks, a malicious participant can modify theupdated model, which is sent to the server for aggregation,directly. As a result, even with one single attacker, the wholeglobal model can be poisoned. The simulation results in [173]also confirm that even a highly constrained adversary with lim-ited training data can achieve high success rate in performingmodel poisoning attacks. Thus, solutions to protect the globalmodel from model poisoning attacks have to be developed.

In [173], some solutions are suggested to prevent modelpoisoning attacks. Firstly, based on an updated model sharedfrom a participant, the server can check whether the sharedmodel can help to improve the global model’s performance ornot. If not, the participant will be marked to be a potentialattacker, and after few rounds of observing the updated modelfrom this participant, the server can determine whether thisis a malicious participant or not. The second solution isbased on the comparison among the updated models sharedby the participants. In particular, if an updated model from aparticipant is too different from the others, the participant canpotentially be a malicious one. Then, the server will continueobserving updates from this participant before it can determinewhether this is a malicious user or not. However, model

poisoning attacks are extremely difficult to prevent becausewhen training with millions of participants, it is intractableto evaluate the improvement from every single participant. Assuch, more effective solutions need to be further investigated.

In [174], the authors introduce a more effective modelpoisoning attack which is demonstrated to achieve 100%accuracy on the attacker’s task within just a single learninground. In particular, a malicious participant can share itspoisoned model which not only is trained for its intentionalpurpose, but which also contains a backdoor function. Inthis paper, the authors consider to use a semantic backdoorfunction to inject into the global model. The reason is thatthis function can make the global model misclassify evenwithout a need to modify the input data of the maliciousparticipant. For example, an image classification backdoorfunction can inject an attacker-chosen label to all images withsome certain features, e.g., all dogs with black stripes can bemisclassifed to be cats. In the simulations, the authors showthat this attack can greatly outperform other conventional FLdata poisoning attacks. For example, in a word-prediction taskwith 80,000 total participants, compromising just eight of themis enough to achieve 50% backdoor accuracy, as comparedto 400 malicious participants needed to perform the data-poisoning attack.

3) Free-Riding Attacks: Free-riding is another attack inFL that occurs when a participant wants to benefit from theglobal model without contributing to the learning process.The malicious participant, i.e., free-rider, can pretend that ithas very small number of samples to train or it can select asmall set of its real dataset to train, e.g., to save its resources.As a result, the honest participants need to contribute moreresources in the FL training process. To address this problem,the authors in [175] introduce a blockchain-based FL architec-ture, called BlockFL, in which the participants’ local learningmodel updates are exchanged and verified by leveraging theblockchain technology. In particular, each participant trainsand sends the trained global model to its associated minerin the blockchain network and then receives a reward thatis proportional to the number of trained data samples as

23

(a) Conventional federated learning

Blockchain network

Miner 1

Participant 1

Miner 2

Participant 2 Participant 1 Participant 2

(b) BlockFL

Cross verification

Reward and updated global model

Server

Fig. 14: An illustration of (a) conventional FL and (b) the proposedBlockFL architectures.

illustrated in Fig. 14. In this way, this framework can not onlyprevent the participants from free-riding, but also incentivizeall participants to contribute to the learning process. A similarblockchain-based model is also introduced in [176] to providedata confidentiality, computation auditability, and incentivesfor the participants of FL. However, the utilization of theblockchain technology implies the incurrence of a significantcost for implementing and maintaining miners to operate theblockchain network. Furthermore, consensus protocols used inblockchain networks, e.g., proof-of-work (PoW), can cause along delay in information exchange, and thus they may not beappropriate to implement on FL models.

C. Summary and Lessons Learned

In this section, we have discussed two key issues, i.e.,privacy and security, when trained models are exchangedin FL. In general, it is believed that FL is an effectiveprivacy-preserving learning solution for participants to performcollaborative model training. However, in this section, we haveshown that a malicious participant can exploit the processand gain access to sensitive information of other participants.Furthermore, we have also shown that by using the sharedmodel in FL, an attacker can perform attacks which can notonly breakdown the whole learning system, but also falsifythe trained model to achieve its malicious goal. In addition,solutions to deal with these issues have also been reviewed,which are especially important in order to guide FL systemadministrators in designing and implementing the appropri-ate countermeasures. We summarize the key information ofattacks and their corresponding countermeasures in Table VII.

VI. APPLICATIONS OF FEDERATED LEARNING FORMOBILE EDGE COMPUTING

In the aforementioned studies, we have discussed the is-sues pertaining to the implementation of FL as an enablingtechnology that allows collaborative learning at mobile edgenetworks. In this section, we focus instead on the applicationsof FL for mobile edge network optimization.

As highlighted by the authors in [35], the increasing com-plexity and heterogeneity of wireless networks enhance theappeal of adopting a data-driven ML based approach [28] foroptimizing system designs and resource allocation decisionmaking for mobile edge networks. However, as discussed inprevious sections, the private data of users may be sensitivein nature. As such, existing learning based approach can becombined with FL for privacy-preserving applications.

In this section, we consider four applications of FL in edgecomputing:• Cyberattack Detection: The ubiquity of IoT devices and

increasing sophistication of cyberattacks [177] imply thatthere is a need to improve existing cyberattack detectiontools. Recently, DL has been widely successful in cyber-attack detection. Coupled with FL, cyberattack detectionmodels can be learned collaboratively while maintaininguser privacy.

• Edge Caching and Computation Offloading: Given thecomputation and storage capacity constraints of edgeservers, some computationally intensive tasks of enddevices have to be offloaded to the remote cloud serverfor computation. In addition, commonly requested filesor services should be placed on edge servers for fasterretrieval, i.e., users do not have to communicate with theremote cloud when they want to access these files orservices. As such, an optimal caching and computationoffloading scheme can be collaboratively learned andoptimized with FL.

• Base Station Association: In a dense network, it isimportant to optimize base station association so as tolimit interference faced by users. However, traditionallearning based approaches that utilize user data oftenassume that such data is centrally available. Given userprivacy constraints, an FL based approach can be adoptedinstead.

• Vehicular Networks: The Internet of Vehicles (IoV) [178]features smart vehicles with data collection, computationand communication capabilities for relevant functions,e.g., navigation and traffic management. However, thiswealth of knowledge is again, private and sensitive innature since it can reveal the driver’s location and per-sonal information. In this section, we discuss the use ofan FL based approach in traffic queue length predictionand energy demand in electric vehicle charging stationsdone at the edge of IoV networks.

A. Cyberattack Detection

Cyberattack detection is one of the most important steps topromptly prevent and mitigate serious consequences of attacksin mobile edge networks. Among different approaches to de-tect cyberattacks, DL is considered to be the most effective toolto detect a wide range of attacks with high accuracy. In [188],the authors show that DL can outperform all conventionalML techniques with very high accuracy in detecting intru-sions on three datasets, i.e., KDDcup 1999, NSL-KDD [189],and UNSW-NB15 [190]. However, the detection accuracy ofsolutions based on DL depends very much on the available

24

TABLE VII: A summary of attacks and countermeasures in FL.

Attack Types Attack Method Countermeasures

Informationexploiting attacks(privacy issues)

Attackers try to illegally exploit informationfrom the shared model.

• Differentially private stochastic gradient descent: Add “noise” to the trained parameters by usinga differential privacy-preserving randomized mechanism [20].

• Differentially private and selective participants: Add “noise” to the trained parameters and selectrandomly participants to train global model in each round [161].

• Selective parameter sharing: Each participant wisely selects the number of gradients to uploadand the number of parameters from the global model to update [162].

• Secrete sharing scheme with extreme boosting algorithm: This approach executes a lightweightsecret sharing protocol before transmitting the newly trained model in plaintext to the server ateach round [166].

• GAN model training: All participants are cooperative to train a federated GANs model [167].

Data poisoningattacks

Attackers poison the global model bycreating dirty-label data and use such data

to train the global model.

• FoolsGoal: Distinguish honest participants based on their updated gradients. It is based on thefact that in the non-IID FL setting, each participant’s training data has its own particularities, andmalicious participants will contribute gradients that appear more similar to each other than thoseof the honest participants [171].

Model poisoningattacks

Attackers attempt to directly poison theglobal model that they send to the server for

aggregation.

• Based on an updated model shared from a participant, the server can check whether the sharedmodel can help to improve the global model’s performance or not. If not, the participant will bemarked to be a potential attacker [173].

• Compare among the updated global models shared by the participants, and if an updatedglobal model from a participant is too different from others, it could be a potential maliciousparticipant [173].

Free-riding attacks

Attackers benefit from the global modelwithout contributing to the learning process,e.g., by pretending that they have very small

number of samples to train.

• BlockFL: Participants’ local learning model updates are exchanged and verified by leveragingblockchain technology. In particular, each participant trains and sends the trained global model toits associated miner in the blockchain network and then receives a reward that is proportional tothe number of trained data samples [175].

TABLE VIII: FL based approaches for mobile edge network optimization.

Applications Ref. Description

Cyberattack Detection[179] Cyberattack detection with edge nodes as participants[180] Cyberattack detection with IoT gateways as participants[181] Blockchain to store model updates

Edge caching and computation offloading

[32] DRL for caching and offloading in UEs[182] DRL for computation offloading in IoT devices[183] Stacked autoencoder learning for proactive caching[184] Greedy algorithm to optimize service placement schemes

Base station assoication [185] Deep echo state networks for VR application[31] Mean field game with imitation for cell association

Vehicular networks [186] Extreme value theory for large queue length prediction[187] Energy demand learning in electric vehicular networks

datasets. Specifically, DL algorithm only can outperform otherML techniques when given sufficient data to train. However,this data may be sensitive in nature. Therefore, some FL-basedattack detection models for mobile edge networks have beenintroduced recently to address this problem.

In [179], the authors propose a cyberattack detection modelfor an edge network empowered by FL. In this model, eachedge node operates as a participant who owns a set of datafor intrusion detection. To improve the accuracy in detectingattacks, after training the global model, each participant willsend its trained model to the FL server. The server willaggregate all parameters from the participants and send theupdated global model back to all the participants as illustratedin Fig. 15. In this way, each edge node can learn from otheredge nodes without a need of sharing its real data. As aresult, this method can not only improve accuracy in detectingattacks, but also enhance the privacy of intrusion data at theedge nodes and reduce traffic load for the whole network. Asimilar idea is also presented in [180] in which IoT gatewaysoperate as FL participants and an IoT security service providerworks as a server node to aggregate trained models shared bythe participants. The authors in [180] show empirically that by

using FL, the system can successfully detect 95.6% of attacksin approximately 257 ms without raising any false alarm whenevaluated in a real-world smart home deployment setting.

In both [179] and [180], it is assumed that the participants,i.e., edge nodes and IoT gateways, are honest, and theyare willing to contribute in training their updated modelparameters. However, if some of the participants are malicious,they can make the whole intrusion detection corrupted. Thus,the authors in [181] propose to use blockchain technologyin managing data shared by the participants. By using theblockchain, all incremental updates to the anomaly detectionML model are stored in the ledger, and thus a maliciousparticipant can be easily identified. Furthermore, based onshared models from honest participants stored in the ledger,the intrusion detection system can easily recover the properglobal model if the current global model is poisoned.

B. Edge Caching and Computation Offloading

To account for the dynamic and time-varying conditionsin a MEC system, the authors in [32] propose the use ofDRL with FL to optimize caching and computation offloadingdecisions in an MEC system. The MEC system consists of a

25

Edge node 1 Edge node 2

Cloud server

Model exchange

Fig. 15: FL-based attack detection architecture for IoT edge net-works.

set of user equipments (UEs) covered by base stations. Forcaching, the DRL agent makes the decision to cache or notto cache the downloaded file, and which local file to replaceshould caching occur. For computation offloading, the UEscan choose to either offload computation tasks to the edgenode via wireless channels, or perform the tasks locally. Thiscaching and offloading decision process is illustrated in Fig.16. The states of the MEC system include wireless networkconditions, UE energy consumption, and task queuing states,whereas the reward function is defined as quality of experience(QoE) of the UEs. Given the large state and action spacein the MEC environment, a DDQN approach is adopted. Toprotect the privacy of users, an FL approach is proposed inwhich training can occur with data remaining on the UEs.In addition, existing FL algorithms, e.g., FedAvg [23], canalso ensure that training is robust to the unbalanced and non-IID data of the UEs. The simulation results show that theDDQN with FL approach achieves similar average utilitiesamong UEs as compared to the centralized DDQN approach,while consuming less communication resources and preservinguser privacy. However, the simulations are only performedwith 10 UEs. If the implementation is expanded to target alarger number of heterogeneous UEs, there can be significantdelays in the training process especially since the training ofa DRL model is computationally intensive. As an extension,transfer learning [191] can be used to increase the efficiencyof training, i.e., training is not initialized from scratch.

Similar to [32], the authors in [182] propose the use of DRL

in optimizing computation offloading decisions in IoT systems.The system model consists of IoT devices and edge nodes.The IoT devices can harvest energy units [192] from the edgenodes to be stored in the energy queue. In addition, an IoTdevice also maintains a local task queue with unprocessed andunsuccessfully processed tasks. These tasks can be processedlocally or offloaded to the edge nodes for processing, in aFirst In First Out (FIFO) order [14]. In the DRL problemformulation, the network states are defined to be a functionof energy queue length, task execution delay, task handoverdelay from edge node association, and channel gain betweenthe IoT device and edge nodes. A task can fail to be executed,e.g., when there is insufficient energy units or communicationbandwidth for computation offloading. The utility consideredis a function of task execution delay, task queuing delay, num-ber of failed tasks and penalty of execution failure. The DRLagent makes the decision to either offload computation to theedge nodes or perform computation locally. To ensure privacyof users, the agent is trained without users having to uploadtheir own data to a centralized server. In each training round, arandom set of IoT devices are selected to download the modelparameters of the DRL agent from the edge networks. Themodel parameters are then updated using their own data, e.g.,energy resource level, channel gain, and local sensing data.Then, the updated parameters of the DRL agent are sent tothe edge nodes for model aggregation. The simulation resultsshow that the FL based approach can achieve same levels oftotal utility as the centralized DRL approach. This is robust tovarying task generation probabilities. In addition, when taskgeneration probabilities are higher, i.e., there are more tasksfor computation in the IoT device, the FL based scheme canachieve a lower number of dropped tasks and shorter queuingdelay than the centralized DRL scheme. However, the simula-tion only involves 15 IoT devices serviced by relatively manyedge nodes. To better reflect practical scenarios where feweredge nodes have to cover several IoT devices, further studiescan be conducted on optimizing the edge-IoT collaboration.For example, the limited communication bandwidth can causesignificant task handover delay during computation offloading.In addition, with more IoT devices, the DRL training will takea longer time to converge especially since the devices haveheterogeneous computation capabilities.

Instead of using a DRL approach, the authors in [183]propose the use of an FL based stacked autoencoder learn-ing model, i.e., FL based proactive content caching scheme(FPCC), to predict content popularity for optimized cachingwhile protecting user privacy. In the system model, each useris equipped with a mobile device that connects to the basestation that covers its geographical location. Using a stackedautoencoder learning model, the latent representation of auser’s information, e.g., location, and file rating, i.e., contentrequest history, is learned. Then, a similarity matrix betweenthe user and its historically requested files is obtained in whicheach element of the matrix represents the distance between theuser and the file. Based on this similarity matrix, the K nearestneighbours of each user are determined, and the similaritybetween the user’s historical watch list and the neighbours’ arecomputed. An aggregation approach is then used to predict the

26

Fig. 16: FL-based (a) caching and (b) computation offloading.

most popular files for caching, i.e., files with highest similarityscores across all users. Being the most popular files acrossusers that are most frequently retrieved, the cached files neednot be re-downloaded from its source server everytime it isdemanded. To protect the privacy of users, FL is adopted tolearn the parameters of the stacked autoencoder without theuser having to reveal its personal information or its contentrequest history to the FL server. In each training round, theuser first downloads a global model from the FL server. Then,the model is trained and updated using their local data. Theupdated models are subsequently uploaded to the FL serverand aggregated using the FedAvg algorithm. The simulationresults show that the proposed FPCC scheme could achieve thehighest cache efficiency, i.e, the ratio of cached files matchinguser requests, as compared to other caching methods such asthe Thompson sampling methods [193]. In addition, privacyof the user is preserved.

The authors in [184] introduce a privacy-aware serviceplacement scheme to deploy user-preferred services on edgeservers with consideration for resource constraints in the edgecloud. The system model consists of a mobile edge cloudserving various mobile devices. The user’s preference modelis first built based on information such as number of times ofrequests for a service, and other user context information, e.g.,ages and locations. However, since this can involve sensitivepersonal information, an FL based approach is proposed totrain the preference model while keeping users’ data on theirpersonal devices. Then, an optimization problem is formulated

in which the objective is to maximize quantity of servicesdemanded from the edge based on user preferences, subject toconstraints of storage capacity, computation capability, uplinkand downloading bandwidth. The optimization problem is thensolved using a greedy algorithm, i.e., the service which mostimproves the objective function is added till resource con-straints are met. The simulation results show that the proposedscheme can outperform the popular service placement scheme,i.e., where only the most popular services are placed on theedge cloud, in terms of number of requests processed on edgeclouds since it also considers the aforementioned resourceconstraints in maximizing quantity of services.

C. Base Station Association

The authors in [185] propose an FL based deep echo statenetworks (ESNs) approach to minimize breaks in presence(BIPs) [194] for users of virtual reality (VR) applications. ABIP event can be a result of delay in information transmissionwhich can be caused when the user’s body movements obstructthe wireless link. BIPs cause the user to be aware that they arein a virtual environment, thus reducing their quality of expe-rience. As such, a user association policy has to be designedsuch that BIPs are minimized. The system model consists ofbase stations that cover a set of VR users. The base stationsreceive uploaded tracking information from each associateduser, e.g., physical location and orientation, while the usersdownload VR videos for their use in the VR application.For data transmission, the VR users have to associate withone of the base stations. As such, a minimization problemis formulated where BIPs are minimized with respect toexpected locations and orientations of the VR user. To derive aprediction of user locations and orientations, the base stationhas to rely on the historical information of users. However,the historical information stored at each base station onlycollects partial data from each user, i.e., a user connects tomultiple base stations and its data is distributed across them.As such, an FL based approach is implemented whereby eachbase station first trains a local model using its partial data.Then, the local models are aggregated to form a global modelcapable of generalization, i.e., comprehensively predicting auser’s mobility and orientations. The simulation results showthat the federated ESN algorithm can achieve lower BIPsexperienced by users as compared to the centralized ESNalgorithm proposed in [195], since a centralized approach onlymakes partial prediction with the incomplete data from solebase stations, whereas the federated ESN approach can makepredictions based on a model learned collaboratively frommore complete data.

Following the ubiquity of IoT devices, the traditional cloud-based approach may no longer be sufficient to cater to densecellular networks. As computation and storage moves to theedge of networks, the association of users to base stations areincreasingly important to facilitate efficient ML model trainingamong the end users. To this end, the authors in [31] considersolving the problem of cell association in dense wirelessnetworks with a collaborative learning approach. In the systemmodel, the base stations cover a set of users in an LTE cellular

27

system. In a cellular system, users are likely to face similarchannel conditions as their neighbors and thus can benefitfrom learning from their neighbours that are already associatedwith base stations. As such, the cell association problemis formulated as a mean-field game (MFG) with imitation[196] in which each user maximizes its own throughput whileminimizing the cost of imitation. The MFG is further reducedinto a single-user Markov decision process that is then solvedby a neural Q-learning algorithm. In most other proposedsolution for cell association, it is assumed that all informationis known to the base stations and users. However, given privacyconcerns, the assumption of information sharing may not bepractical. As such, a collaborative learning approach can beconsidered where only the outcome of the learning algorithmis exchanged during the learning process whereas usage data iskept locally in each user’s own device. The simulation resultsshow that imitating users can attain higher utility within ashorter training duration as compared to non-imitating users.

D. Vehicular Networks

Ultra reliable low latency communication (URLLC) invehicular networks is an essential prerequisite towards de-veloping an intelligent transport system. However, existingradio resource management techniques do not account for rareevents such as large queue lengths at the tail-end distribution.To model the occurrence of such low probability events, theauthors in [186] propose the use of extreme value theory(EVT) [197]. The approach requires sufficient samples ofqueue state information (QSI) and data exchange amongvehicles. As such, an FL approach is proposed in whichvehicular users (VUEs) train the learning model with datakept locally and upload only their updated model parametersto the roadside units (RSU). The RSU then averages outthe model parameters and return an updated global modelto the VUEs. In a synchronous approach, all VUEs uploadtheir models at the end of a prespecified interval. However,the simultaneous uploading by multiple vehicles can lead todelays in communication. In contrast for an asynchronousapproach, each VUE only evaluates and uploads their modelparameters after a predefined number of QSI samples arecollected. The global model is also updated whenever a localupdate is received, thus reducing communication delays. Tofurther reduce overhead, Lyapunov optimization [198] forpower allocation is also utilized. The simulation results showthat under this framework, there is a reduction of the numberof vehicles experiencing large queue lengths whereas FLcan ensure minimal data exchange relative to a centralizedapproach.

Apart from QSI, the vehicles in vehicular networks arealso exposed to a wealth of useful captured images that canbe adopted to build better inference models, e.g., for trafficoptimization. However, these images are sensitive in naturesince they can give away the location information of vehicularclients. As such, an FL approach can be used to facilitatecollaborative ML while ensuring privacy preservation. How-ever, the images captured by vehicles are often varying inquality due to motion blurs. In addition, another source of

heterogeneity is the difference in computing capabilities of ve-hicles. Given the information asymmetry involved, the authorsin [148] propose a multi-dimensional contract design in whichthe FL server designs contract bundles comprising varyinglevels of data quality, compute resources, and contractualpayoffs. Then, the vehicular client chooses the contract bundlethat maximizes its utility, in accordance to its hidden type.Similar to the results in [146], the simulation results showthat the FL server derives greatest utility under the proposedcontract theoretic approach, in contrast to the linear pricing orStackelberg game approach.

The authors in [187] propose a federated energy demandlearning (FEDL) approach to manage energy resources incharging stations (CSs) for electric vehicles (EVs). When alarge number of EVs congregate at a CS, this can lead toenergy transfer congestion. To resolve this, energy is suppliedfrom the power grids and reserved in advance to meet thereal-time demands from the EVs [199], rather than havingthe CSs request for energy from the power grid only uponreceiving charging requests. As such, there is a need to forecastenergy demand for EV networks using historical chargingdata. However, this data is usually stored separately at eachof the CS that the EVs utilize and is private in nature.As such, an FEDL approach is adopted in which each CStrains the demand prediction model on its own dataset beforesending only the gradient information to the charging stationprovider (CSP). Then, the gradient information from the CSis aggregated for global model training. To further improvemodel accuracy, the CSs are clustered using the constrained K-means algorithm [200] based on their physical locations. Theclustering-based FEDL reduces the cost of biased prediction[201]. The simulation results show that the root mean squarederror of a clustered FEDL model is lower than conventionalML algorithms, e.g., multi-layer perceptron regressor [202].However, the privacy of user data is still not protected bythis approach, since user data is stored in each of the CS.As an extension, the user data can possibly be stored in eachEVs separately, and model training can be conducted in theEVs rather than the CSs. This can allow more user featuresto be considered to enhance the accuracy of EDL, e.g., userconsumption habits.

Summary: In this section, we discuss that FL can also beused for mobile edge network optimization. In particular, DLand DRL approaches are suitable for modelling the dynamicenvironment of increasingly complex edge networks but re-quire sufficient data for training. With FL, model trainingcan be carried out while preserving the privacy of users. Asummary of the approaches are presented in Table VIII.

VII. CHALLENGES AND FUTURE RESEARCH DIRECTIONS

Apart from the aforementioned issues, there are still chal-lenges new research directions in deploying FL at scale to bediscussed as follows.• Dropped participants: The approaches discussed in Sec-

tion IV, e.g., [83], [123], and [124], propose new algo-rithms for participant selection and resource allocationto address the training bottleneck and resource hetero-geneity. In these approaches, the wireless connections of

28

participants are assumed to be always available. However,in practice, participating mobile devices may go offlineand can drop out from the FL system due to connectivityor energy constraints. A large number of dropped devicesfrom the training participation can significantly degradethe performance [23], e.g., accuracy and convergencespeed, of the FL system. New FL algorithms need tobe robust to device drop out in the networks and an-ticipate the scenarios in which only a small number ofparticipants are left connected to participate in a traininground. One potential solution is that the FL model ownerprovides free dedicated/special connection, e.g., cellularconnections, as an incentive to the participants to avoiddrop out.

• Privacy concerns: FL is able to protect the privacy ofeach participants since the model training may be con-ducted locally, with just the model parameters exchangedwith the FL server. However, as specified in [154], [155],and [156], communicating the model updates during thetraining process can still reveal sensitive information toan adversary or a third-party. The current approaches pro-pose security solutions such as DP, e.g., [20], [161], and[203], and collaborative training, e.g., [162] and [164].However, the adoption of these approaches sacrificesthe performance, i.e., model accuracy. They also requiresignificant computation on participating mobile devices.Thus, the tradeoff between privacy guarantee and systemperformance has to be well balanced when implementingthe FL system.

• Unlabeled data: It is important to note that the ap-proaches reviewed in the survey are proposed for su-pervised learning tasks. This means that the approachesassume that labels exist for all of the data in the federatednetwork. However, in practice, the data generated in thenetwork may be unlabeled or mislabeled [204]. Thisposes a big challenge to the server to find participantswith appropriate data for model training. Tackling thischallenge may require the challenges of scalability, het-erogeneity, and privacy in the FL systems to be addressed.One possible solution is to enable mobile devices toconstruct their labeled data by learning the “labeled data”from each other. Emerging studies have also consideredthe use of semi-supervised learning inspired techniques[205].

• Interference among mobile devices: The existing resourceallocation approaches, e.g., [83] and [124], address theparticipant selection based on the resource states of theirmobile devices. In fact, these mobile devices may begeographically close to each other, i.e., in the same cell.This introduces an interference issue when they updatelocal models to the server. As such, channel allocationpolicies may need to be combined with the resourceallocation approaches to address the interference issue.While studies in [132], [134], and [135] consider multi-access schemes and over-the-air computation, it remainsto be seen if such approaches are scalable, i.e., able tosupport a large federation of many participants. To thisend, data driven learning based solutions, e.g., federated

DRL, can be considered to model the dynamic envi-ronment of mobile edge networks and make optimizeddecisions.

• Communication security: The privacy and security threatsstudied in Section V-B revolve mainly around data-relatedcompromises, e.g., data and model poisoning. Due tothe exposed nature of the wireless medium, FL is alsovulnerable to communication security issues such as Dis-tributed Denial-of-Service (DoS) [206] and jamming at-tacks [207]. In particular, for jamming attacks, an attackercan transmit radio frequency jamming signals with highpower to disrupt or cause interference to the communica-tions between the mobile devices and the server. Such anattack can cause errors to the model uploads/downloadsand consequently degrade the performance, i.e., accuracy,of the FL systems. Anti-jamming schemes [208] such asfrequency hopping, e.g., sending one more copy of themodel update over different frequencies, can be adoptedto address the issue.

• Asynchronous FL: In synchronous FL, each traininground only progresses as quickly as the slowest device,i.e., the FL system is susceptible to the straggler effect. Assuch, asynchronous FL has been proposed as a solutionin [119] and [140]. In addition, asynchronous FL alsoallows participants to join the FL training halfway evenwhile a training round is in progress. This is morereflective of practical FL settings and can be an importantcontributing factor towards ensuring the scalability ofFL. However, synchronous FL remains to be the mostcommon approach used due to convergence guarantees[82]. Given the many advantages of asynchronous FL,new asynchronous algorithms should be explored. Inparticular, for future proposed algorithms, the conver-gence guarantee in a non-IID setting for non-convexloss functions need to be considered. An approach to beconsidered is the possibile inclusion of backup workersfollowing the studies of [113].

• Comparisons with other distributed learning methods:Following the increased scrutiny on data privacy, therehas been a growing effort on developing new privacypreserving distributed learning algorithms. One studyproposes split learning [209], which also enables collab-orative ML without requiring the exchange of raw datawith an external server. In split learning, each participantfirst trains the neural network up to a cut layer. Then, theoutputs from training are transmitted to an external serverthat completes the other layers of training. The resultantgradients are then back propagated up to the cut layer,and eventually returned to the participants to completethe local training. In contrast, FL typically involves thecommunication of full model parameters. The authorsin [210] conduct an empirical comparison between thecommunication efficiencies of split learning and FL. Thesimulation results show that split learning performs wellwhen the model size involved is larger, or when there aremore participants involved, since the participants do nothave to transmit the weights to an aggregating server.However, FL is much easier to implement since the

29

participants and FL server are running the same globalmodel, i.e., the FL server is just in charge of aggregationand thus FL can work with one of the participants servingas the master node. As such, more research efforts can bedirected towards guiding system administrators to makean informed decision as to which scenario warrants theuse of either learning methods.

• Further studies on learning convergence: One of theessential considerations of FL is the convergence of thealgorithm. FL finds weights to minimize the global modelaggregation. This is actually a distributed optimizationproblem, and the convergence is not always guaranteed.Theoretical analysis and evaluations on the convergencebounds of the gradient descent based FL for convexand non-convex loss functions are important researchdirections. While existing studies have covered this topic,many of the guarantees are limited to restrictions, e.g.,convexity of the loss function.

• Usage of tools to quantify statistical heterogeneity: Mo-bile devices typically generate and collect data in a non-IID manner across the network. Moreover, the numberof data samples among the mobile devices may varysignificantly. To improve the convergence of FL algo-rithm, the statistical heterogeneity of the data needs tobe quantified. Recent works, e.g., [211], have developedtools for quantifying statistical heterogeneity throughmetrics such as local dissimilarity. However, these metricscannot be easily calculated over the federated networkbefore training begins. The importance of these metricsmotivates future directions such as the development ofefficient algorithms to quickly determine the level ofheterogeneity in federated networks.

• Combined algorithms for communication reduction: Cur-rently, there are three common techniques of commu-nication reduction in FL as discussed in Section III.It is important to study how these techniques can becombined with each other to improve the performancefurther. For example, the model compression techniquecan be combined with the edge server-assisted FL. Thecombination is able to significantly reduce the size ofmodel updates, as well as the instances of communicationwith the FL server. However, the feasibility of this com-bination has not been explored. In addition, the tradeoffbetween accuracy and communication overhead for thecombination technique needs to be further evaluated. Inparticular, for simulation results we discuss in SectionIII, the accuracy-communication cost reduction tradeoffis difficult to manage since it varies for different settings,e.g., data distribution, quantity, number of edge servers,and number of participants.

• Cooperative mobile crowd ML: In the existing ap-proaches, mobile devices need to communicate with theserver directly and this may increase the energy consump-tion. In fact, mobile devices nearby can be grouped in acluster, and the model downloading/uploading betweenthe server and the mobile devices can be facilitatedby a “cluster head” that serves as a relay node [212].The model exchange between the mobile devices and

the cluster head can then be done in Device-to-Device(D2D) connections. Such a model can improve the energyefficiency significantly. Efficient coordination schemesfor the cluster head can thus be designed to furtherimprove the energy efficiency of a FL system.

• Applications of FL: Given the advantages of guaranteeingdata privacy, FL has an increasingly important role toplay in many applications, e.g., healthcare, finance andtransport systems. For most current studies on FL appli-cations, the focus mainly lies in the federated training ofthe learning model, with the implementation challengesneglected. For future studies on the applications of FL,besides a need to consider the aforementioned issues inthe survey, i.e., communication costs, resource allocation,and privacy and security, there is also a need to considerthe specific issues related to the system model in whichFL will be adopted in. For example, for delay criticalapplications, e.g, in vehicular networks, there will bemore emphasis on training efficiency and less on energyexpense.

VIII. CONCLUSION

This paper has presented a tutorial of FL and a compre-hensive survey on the issues regarding FL implementation.Firstly, we begin with an introduction to the motivation forMEC, and how FL can serve as an enabling technology forcollaborative model training at mobile edge networks. Then,we describe the fundamentals of DNN model training, FL, andsystem design towards FL at scale. Afterwards, we providedetailed reviews, analyses, and comparisons of approaches foremerging implementation challenges in FL. The issues includecommunication cost, resource allocation, data privacy and datasecurity. Furthermore, we also discuss the implementation ofFL for privacy-preserving mobile edge network optimization.Finally, we discuss challenges and future research directions.

REFERENCES

[1] K. L. Lueth, “State of the iot 2018: Number of iot devices now at 7bÐ market accelerating,” IOT Analytics, 2018.

[2] R. Pryss, M. Reichert, J. Herrmann, B. Langguth, and W. Schlee,“Mobile crowd sensing in clinical and psychological trials–a casestudy,” in 2015 IEEE 28th International Symposium on Computer-Based Medical Systems. IEEE, 2015, pp. 23–24.

[3] R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: current state andfuture challenges,” IEEE Communications Magazine, vol. 49, no. 11,pp. 32–39, 2011.

[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,no. 7553, p. 436, 2015.

[5] D. Oletic and V. Bilas, “Design of sensor node for air quality crowd-sensing,” in 2015 IEEE Sensors Applications Symposium (SAS). IEEE,2015, pp. 1–5.

[6] Y. Jing, B. Guo, Z. Wang, V. O. Li, J. C. Lam, and Z. Yu, “Crowd-tracker: Optimized urban moving object tracking using mobile crowdsensing,” IEEE Internet of Things Journal, vol. 5, no. 5, pp. 3452–3463,2017.

[7] H.-J. Hong, C.-L. Fan, Y.-C. Lin, and C.-H. Hsu, “Optimizing cloud-based video crowdsensing,” IEEE Internet of Things Journal, vol. 3,no. 3, pp. 299–313, 2016.

[8] W. He, G. Yan, and L. Da Xu, “Developing vehicular data cloudservices in the iot environment,” IEEE Transactions on IndustrialInformatics, vol. 10, no. 2, pp. 1587–1595, 2014.

[9] P. Li, J. Li, Z. Huang, T. Li, C.-Z. Gao, S.-M. Yiu, and K. Chen, “Multi-key privacy-preserving deep learning in cloud computing,” FutureGeneration Computer Systems, vol. 74, pp. 76–85, 2017.

30

[10] B. Custers, A. Sears, F. Dechesne, I. Georgieva, T. Tani, and S. van derHof, EU Personal Data Protection in Policy and Practice. Springer,2019.

[11] B. M. Gaff, H. E. Sussman, and J. Geetter, “Privacy and big data,”Computer, vol. 47, no. 6, pp. 7–9, 2014.

[12] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A surveyon mobile edge computing: The communication perspective,” IEEECommunications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358,2017.

[13] G. Ananthanarayanan, P. Bahl, P. Bodík, K. Chintalapudi, M. Philipose,L. Ravindranath, and S. Sinha, “Real-time video analytics: The killerapp for edge computing,” computer, vol. 50, no. 10, pp. 58–67, 2017.

[14] H. Li, K. Ota, and M. Dong, “Learning iot in edge: deep learning forthe internet of things with edge computing,” IEEE Network, vol. 32,no. 1, pp. 96–101, 2018.

[15] Y. Han, X. Wang, V. Leung, D. Niyato, X. Yan, and X. Chen,“Convergence of edge computing and deep learning: A comprehensivesurvey,” arXiv preprint arXiv:1907.08349, 2019.

[16] Cisco, “Cisco global cloud index: Forecast and methodology,2016Ð2021 white paper.”

[17] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Visionand challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp.637–646, 2016.

[18] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computationoffloading for mobile-edge cloud computing,” IEEE/ACM Transactionson Networking, vol. 24, no. 5, pp. 2795–2808, 2015.

[19] P. Mach and Z. Becvar, “Mobile edge computing: A survey on archi-tecture and computation offloading,” IEEE Communications Surveys &Tutorials, vol. 19, no. 3, pp. 1628–1656, 2017.

[20] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov,K. Talwar, and L. Zhang, “Deep learning with differential privacy,”in Proceedings of the 2016 ACM SIGSAC Conference on Computerand Communications Security. ACM, 2016, pp. 308–318.

[21] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federatedlearning of deep networks using model averaging,” 2016.

[22] F. Cicirelli, A. Guerrieri, G. Spezzano, A. Vinci, O. Briante, A. Iera,and G. Ruggeri, “Edge computing and social internet of things forlarge-scale smart environments development,” IEEE Internet of ThingsJournal, vol. 5, no. 4, pp. 2557–2571, 2017.

[23] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al.,“Communication-efficient learning of deep networks from decentral-ized data,” arXiv preprint arXiv:1602.05629, 2016.

[24] A. Hard, K. Rao, R. Mathews, F. Beaufays, S. Augenstein, H. Eichner,C. Kiddon, and D. Ramage, “Federated learning for mobile keyboardprediction,” arXiv preprint arXiv:1811.03604, 2018.

[25] T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis,and W. Shi, “Federated learning of predictive models from federatedelectronic health records,” International journal of medical informatics,vol. 112, pp. 59–67, 2018.

[26] K. Powell, “Nvidia clara federated learning to de-liver ai to hospitals while protecting patient data,”https://blogs.nvidia.com/blog/2019/12/01/clara-federated-learning/,2019.

[27] D. Verma, S. Julier, and G. Cirincione, “Federated ai for building aisolutions across multiple agencies,” arXiv preprint arXiv:1809.10036,2018.

[28] O. Simeone, “A very brief introduction to machine learning with appli-cations to communication systems,” IEEE Transactions on CognitiveCommunications and Networking, vol. 4, no. 4, pp. 648–664, 2018.

[29] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile andwireless networking: A survey,” IEEE Communications Surveys &Tutorials, 2019.

[30] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C.Liang, and D. I. Kim, “Applications of deep reinforcement learningin communications and networking: A survey,” IEEE CommunicationsSurveys & Tutorials, 2019.

[31] K. Hamidouche, A. T. Z. Kasgari, W. Saad, M. Bennis, and M. Debbah,“Collaborative artificial intelligence (ai) for user-cell association inultra-dense cellular systems,” in 2018 IEEE International Conferenceon Communications Workshops (ICC Workshops). IEEE, 2018, pp.1–6.

[32] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edgeai: Intelligentizing mobile edge computing, caching and communicationby federated learning,” arXiv preprint arXiv:1809.07857, 2018.

[33] S. Samarakoon, M. Bennis, W. Saady, and M. Debbah, “Distributedfederated learning for ultra-reliable low-latency vehicular communica-tions,” arXiv preprint arXiv:1807.08127, 2018.

[34] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:Concept and applications,” ACM Transactions on Intelligent Systemsand Technology (TIST), vol. 10, no. 2, p. 12, 2019.

[35] S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning forwireless communications: Motivation, opportunities and challenges,”arXiv preprint arXiv:1908.06847, 2019.

[36] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learn-ing: Challenges, methods, and future directions,” arXiv preprintarXiv:1908.07873, 2019.

[37] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edgeintelligence: Paving the last mile of artificial intelligence with edgecomputing,” arXiv preprint arXiv:1905.10083, 2019.

[38] L. Cui, S. Yang, F. Chen, Z. Ming, N. Lu, and J. Qin, “A survey onapplication of machine learning for internet of things,” InternationalJournal of Machine Learning and Cybernetics, vol. 9, no. 8, pp. 1399–1417, 2018.

[39] K. Kumar, J. Liu, Y.-H. Lu, and B. Bhargava, “A survey of computationoffloading for mobile systems,” Mobile Networks and Applications,vol. 18, no. 1, pp. 129–140, 2013.

[40] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edgecomputing: A survey,” IEEE Internet of Things Journal, vol. 5, no. 1,pp. 450–465, 2017.

[41] S. Wang, X. Zhang, Y. Zhang, L. Wang, J. Yang, and W. Wang, “Asurvey on mobile edge networks: Convergence of computing, cachingand communications,” IEEE Access, vol. 5, pp. 6757–6779, 2017.

[42] J. Yao, T. Han, and N. Ansari, “On mobile edge caching,” IEEECommunications Surveys & Tutorials, vol. 21, no. 3, pp. 2525–2553,2019.

[43] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolu-tional network for image super-resolution,” in European conference oncomputer vision. Springer, 2014, pp. 184–199.

[44] J. Schmidhuber, “Deep learning in neural networks: An overview,”Neural networks, vol. 61, pp. 85–117, 2015.

[45] X.-W. Chen and X. Lin, “Big data deep learning: challenges andperspectives,” IEEE access, vol. 2, pp. 514–525, 2014.

[46] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speechemotion recognition using a deep convolutional recurrent network,” in2016 IEEE international conference on acoustics, speech and signalprocessing (ICASSP). IEEE, 2016, pp. 5200–5204.

[47] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processingof deep neural networks: A tutorial and survey,” Proceedings of theIEEE, vol. 105, no. 12, pp. 2295–2329, 2017.

[48] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” inNeural networks for perception. Elsevier, 1992, pp. 65–93.

[49] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learningactivation functions to improve deep neural networks,” arXiv preprintarXiv:1412.6830, 2014.

[50] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel imagedataset for benchmarking machine learning algorithms,” arXiv preprintarXiv:1708.07747, 2017.

[51] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks formachine learning lecture 6a overview of mini-batch gradient descent,”Cited on, vol. 14, p. 8, 2012.

[52] X. J. Zhu, “Semi-supervised learning literature survey,” University ofWisconsin-Madison Department of Computer Sciences, Tech. Rep.,2005.

[53] A. Radford, L. Metz, and S. Chintala, “Unsupervised representationlearning with deep convolutional generative adversarial networks,”arXiv preprint arXiv:1511.06434, 2015.

[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” arXiv preprint arXiv:1312.5602, 2013.

[55] H. Bourlard and Y. Kamp, “Auto-association by multilayer perceptronsand singular value decomposition,” Biological cybernetics, vol. 59, no.4-5, pp. 291–294, 1988.

[56] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[57] T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur,“Recurrent neural network based language model,” in Eleventh annualconference of the international speech communication association,2010.

[58] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learningfor big data,” Information Fusion, vol. 42, pp. 146–157, 2018.

31

[59] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“A brief survey of deep reinforcement learning,” arXiv preprintarXiv:1708.05866, 2017.

[60] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deep-learning techniques for salient and category-specific object detection:a survey,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 84–100, 2018.

[61] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep learning-based fine-grained object classification and semantic segmentation,”International Journal of Automation and Computing, vol. 14, no. 2,pp. 119–135, 2017.

[62] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, “Deep learningfor sensor-based activity recognition: A survey,” Pattern RecognitionLetters, vol. 119, pp. 3–11, 2019.

[63] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“Deep reinforcement learning: A brief survey,” IEEE Signal ProcessingMagazine, vol. 34, no. 6, pp. 26–38, 2017.

[64] J. Kang, Z. Xiong, D. Niyato, S. Xie, and J. Zhang, “Incentive mech-anism for reliable federated learning: A joint optimization approach tocombining reputation and contract theory,” 09 2019.

[65] C. J. Burges, “A tutorial on support vector machines for patternrecognition,” Data mining and knowledge discovery, vol. 2, no. 2, pp.121–167, 1998.

[66] R. H. Myers and R. H. Myers, Classical and modern regression withapplications. Duxbury press Belmont, CA, 1990, vol. 2.

[67] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive federated learning in resource constrained edgecomputing systems,” IEEE Journal on Selected Areas in Communica-tions, 2019.

[68] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local sgd,” arXiv preprint arXiv:1808.07217, 2018.

[69] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federatedlearning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.

[70] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of featuresfrom tiny images,” Citeseer, Tech. Rep., 2009.

[71] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance asa metric for image retrieval,” International journal of computer vision,vol. 40, no. 2, pp. 99–121, 2000.

[72] M. Duan, “Astraea: Self-balancing federated learning for improvingclassification accuracy of mobile deep learning applications,” arXivpreprint arXiv:1907.01132, 2019.

[73] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, “Under-standing data augmentation for classification: when to warp?” in 2016international conference on digital image computing: techniques andapplications (DICTA). IEEE, 2016, pp. 1–6.

[74] J. M. Joyce, “Kullback-leibler divergence,” International encyclopediaof statistical science, pp. 720–722, 2011.

[75] M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann,and M. I. Jordan, “Communication-efficient distributed dual coordinateascent,” in Advances in neural information processing systems, 2014,pp. 3068–3076.

[76] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federatedmulti-task learning,” in Advances in Neural Information ProcessingSystems, 2017, pp. 4424–4434.

[77] J. C. Bezdek and R. J. Hathaway, “Convergence of alternating opti-mization,” Neural, Parallel & Scientific Computations, vol. 11, no. 4,pp. 351–368, 2003.

[78] M. G. Arivazhagan, V. Aggarwal, A. K. Singh, and S. Choud-hary, “Federated learning with personalization layers,” arXiv preprintarXiv:1912.00818, 2019.

[79] J. Ren, X. Shen, Z. Lin, R. Mech, and D. J. Foran, “Personalized imageaesthetics,” in Proceedings of the IEEE International Conference onComputer Vision, 2017, pp. 638–647.

[80] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the conver-gence of fedavg on non-iid data,” arXiv preprint arXiv:1907.02189,2019.

[81] L. Huang, Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu, “Loadaboost:Loss-based adaboost federated machine learning on medical data,”arXiv preprint arXiv:1811.12629, 2018.

[82] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al.,“Towards federated learning at scale: System design,” arXiv preprintarXiv:1902.01046, 2019.

[83] T. Nishio and R. Yonetani, “Client selection for federated learn-ing with heterogeneous resources in mobile edge,” arXiv preprintarXiv:1804.08333, 2018.

[84] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan,S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggre-gation for privacy-preserving machine learning,” in Proceedings of the2017 ACM SIGSAC Conference on Computer and CommunicationsSecurity. ACM, 2017, pp. 1175–1191.

[85] J. Bloemer, “How to share a secret,” Communications of the Acm,vol. 22, no. 22, pp. 612–613, 2011.

[86] H. B. Mcmahan, D. Ramage, K. Talwar, and L. Zhang, “Learning dif-ferentially private recurrent language models,” international conferenceon learning representations, 2018.

[87] Google, “Tensorflow federated: Machine learning on decentralizeddata.”

[88] T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueckert,and J. Passerat-Palmbach, “A generic framework for privacy preservingdeep learning,” arXiv preprint arXiv:1811.04017, 2018.

[89] S. Caldas, P. Wu, T. Li, J. Konecny, H. B. McMahan, V. Smith,and A. Talwalkar, “Leaf: A benchmark for federated settings,” arXivpreprint arXiv:1812.01097, 2018.

[90] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digitdatabase,” AT&T Labs [Online]. Available: http://yann. lecun.com/exdb/mnist, vol. 2, p. 18, 2010.

[91] A. Go, R. Bhayani, and L. Huang, “Sentiment140,” Site Functionality,2013c. URL http://help. sentiment140. com/site-functionality. Abruf am,vol. 20, 2016.

[92] WeBank, “Fate: An industrial grade federated learning framework.”https://fate.fedai.org,2018.

[93] J. Konecny, H. B. McMahan, D. Ramage, and P. Richtárik, “Federatedoptimization: Distributed machine learning for on-device intelligence,”arXiv preprint arXiv:1610.02527, 2016.

[94] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communica-tion efficiency,” arXiv preprint arXiv:1610.05492, 2016.

[95] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[96] L. Wang, W. Wang, and B. Li, “Cmfl: Mitigating communicationoverhead for federated learning.”

[97] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, andS. Wright, “Atomo: Communication-efficient learning via atomic spar-sification,” in Advances in Neural Information Processing Systems,2018, pp. 9850–9861.

[98] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd withmemory,” in Advances in Neural Information Processing Systems,2018, pp. 4447–4458.

[99] S. Caldas, J. Konecny, H. B. McMahan, and A. Talwalkar, “Expandingthe reach of federated learning by reducing client resource require-ments,” arXiv preprint arXiv:1812.07210, 2018.

[100] Z. Tao and Q. Li, “esgd: Communication efficient distributed deeplearning on the edge,” in {USENIX} Workshop on Hot Topics in EdgeComputing (HotEdge 18), 2018.

[101] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[102] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks fromoverfitting,” The journal of machine learning research, vol. 15, no. 1,pp. 1929–1958, 2014.

[103] Y. Liu, Y. Kang, X. Zhang, L. Li, Y. Cheng, T. Chen, M. Hong,and Q. Yang, “A communication efficient vertical federated learningframework,” arXiv preprint arXiv:1912.11187, 2019.

[104] X. Yao, C. Huang, and L. Sun, “Two-stream federated learning: Reducethe communication costs,” in 2018 IEEE Visual Communications andImage Processing (VCIP). IEEE, 2018, pp. 1–4.

[105] L. Liu, J. Zhang, S. Song, and K. B. Letaief, “Edge-assisted hierarchicalfederated learning with non-iid data,” arXiv preprint arXiv:1905.06641,2019.

[106] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferablefeatures with deep adaptation networks,” in Proceedings of the 32ndInternational Conference on International Conference on MachineLearning-Volume 37. JMLR. org, 2015, pp. 97–105.

[107] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer featurelearning with joint distribution adaptation,” in Proceedings of the IEEEinternational conference on computer vision, 2013, pp. 2200–2207.

[108] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffmancoding,” arXiv preprint arXiv:1510.00149, 2015.

32

[109] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan, “Distributedmean estimation with limited communication,” in Proceedings ofthe 34th International Conference on Machine Learning-Volume 70.JMLR. org, 2017, pp. 3329–3337.

[110] B. S. Kashin, “Diameters of some finite-dimensional sets and classesof smooth functions,” Izvestiya Rossiiskoi Akademii Nauk. SeriyaMatematicheskaya, vol. 41, no. 2, pp. 334–351, 1977.

[111] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “Emnist: an exten-sion of mnist to handwritten letters,” arXiv preprint arXiv:1702.05373,2017.

[112] N. Strom, “Scalable distributed dnn training using commodity gpucloud computing,” in Sixteenth Annual Conference of the InternationalSpeech Communication Association, 2015.

[113] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisitingdistributed synchronous sgd,” arXiv preprint arXiv:1604.00981, 2016.

[114] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradientcompression: Reducing the communication bandwidth for distributedtraining,” arXiv preprint arXiv:1712.01887, 2017.

[115] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger,P. B. Gibbons, and O. Mutlu, “Gaia: Geo-distributed machine learningapproaching {LAN} speeds,” in 14th {USENIX} Symposium on Net-worked Systems Design and Implementation ({NSDI} 17), 2017, pp.629–647.

[116] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A pub-lic domain dataset for human activity recognition using smartphones.”in Esann, 2013.

[117] M. Buscema and S. Terzi, “Semeion handwritten digit data set,” Centerfor Machine Learning and Intelligent Systems, California, USA, 2009.

[118] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energyefficient federated learning over wireless communication networks,”arXiv preprint arXiv:1911.02417, 2019.

[119] M. R. Sprague, A. Jalalirad, M. Scavuzzo, C. Capota, M. Neun,L. Do, and M. Kopp, “Asynchronous federated learning for geospatialapplications,” in Joint European Conference on Machine Learning andKnowledge Discovery in Databases. Springer, 2018, pp. 21–28.

[120] M. I. Jordan, J. D. Lee, and Y. Yang, “Communication-efficientdistributed statistical inference,” Journal of the American StatisticalAssociation, vol. 114, no. 526, pp. 668–681, 2019.

[121] M. Sviridenko, “A note on maximizing a submodular set functionsubject to a knapsack constraint,” Operations Research Letters, vol. 32,no. 1, pp. 41–43, 2004.

[122] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,”arXiv preprint arXiv:1902.00146, 2019.

[123] N. Yoshida, T. Nishio, M. Morikura, K. Yamamoto, and R. Yonetani,“Hybrid-fl: Cooperative learning mechanism using non-iid data inwireless networks,” arXiv preprint arXiv:1905.07210, 2019.

[124] T. T. Anh, N. C. Luong, D. Niyato, D. I. Kim, and L.-C. Wang, “Effi-cient training management for mobile crowd-machine learning: A deepreinforcement learning approach,” arXiv preprint arXiv:1812.03633,2018.

[125] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning,” in Thirtieth AAAI conference on artificialintelligence, 2016.

[126] H. T. Nguyen, N. C. Luong, J. Zhao, C. Yuen, and D. Niyato, “Resourceallocation in mobility-aware federated learning networks: A deepreinforcement learning approach,” arXiv preprint arXiv:1910.09172,2019.

[127] M. J. Neely, E. Modiano, and C.-P. Li, “Fairness and optimal stochasticcontrol for heterogeneous networks,” IEEE/ACM Transactions OnNetworking, vol. 16, no. 2, pp. 396–409, 2008.

[128] M. Feldman, “Computational fairness: Preventing machine-learneddiscrimination,” 2015.

[129] T. Li, M. Sanjabi, and V. Smith, “Fair resource allocation in federatedlearning,” arXiv preprint arXiv:1905.10497, 2019.

[130] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence timeoptimization for federated learning over wireless networks,” arXivpreprint arXiv:2001.07845, 2020.

[131] D. López-Pérez, A. Valcarce, G. De La Roche, and J. Zhang, “Ofdmafemtocells: A roadmap on interference avoidance,” IEEE Communica-tions Magazine, vol. 47, no. 9, pp. 41–48, 2009.

[132] G. Zhu, Y. Wang, and K. Huang, “Low-latency broadbandanalog aggregation for federated edge learning,” arXiv preprintarXiv:1812.11494, 2018.

[133] B. Nazer and M. Gastpar, “Compute-and-forward: Harnessing inter-ference through structured codes,” IEEE Transactions on InformationTheory, vol. 57, no. 10, pp. 6463–6486, 2011.

[134] M. M. Amiri and D. Gunduz, “Federated learning over wireless fadingchannels,” arXiv preprint arXiv:1907.09769, 2019.

[135] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” arXiv preprint arXiv:1812.11750, 2018.

[136] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P.Tang, “On large-batch training for deep learning: Generalization gapand sharp minima,” arXiv preprint arXiv:1609.04836, 2016.

[137] S. Boyd and L. Vandenberghe, Convex optimization. Cambridgeuniversity press, 2004.

[138] P. D. Tao et al., “The dc (difference of convex functions) programmingand dca revisited with dc models of real world nonconvex optimizationproblems,” Annals of operations research, vol. 133, no. 1-4, pp. 23–46,2005.

[139] Z.-Q. Luo, N. D. Sidiropoulos, P. Tseng, and S. Zhang, “Approxima-tion bounds for quadratic optimization with homogeneous quadraticconstraints,” SIAM Journal on optimization, vol. 18, no. 1, pp. 1–28,2007.

[140] C. Xie, S. Koyejo, and I. Gupta, “Asynchronous federated optimiza-tion,” arXiv preprint arXiv:1903.03934, 2019.

[141] S. Feng, D. Niyato, P. Wang, D. I. Kim, and Y.-C. Liang, “Joint servicepricing and cooperative relay communication for federated learning,”arXiv preprint arXiv:1811.12082, 2018.

[142] M. J. Osborne et al., An introduction to game theory. Oxford universitypress New York, 2004, vol. 3, no. 3.

[143] Y. Sarikaya and O. Ercetin, “Motivating workers in federated learning:A stackelberg game perspective,” arXiv preprint arXiv:1908.03092,2019.

[144] L. U. Khan, N. H. Tran, S. R. Pandey, W. Saad, Z. Han, M. N. Nguyen,and C. S. Hong, “Federated learning for edge networks: Resource opti-mization and incentive mechanism,” arXiv preprint arXiv:1911.05642,2019.

[145] Y. Zhan, P. Li, Z. Qu, D. Zeng, and S. Guo, “A learning-based incentivemechanism for federated learning,” IEEE Internet of Things Journal,2020.

[146] J. Kang, Z. Xiong, D. Niyato, H. Yu, Y.-C. Liang, and D. I. Kim,“Incentive design for efficient federated learning in mobile networks:A contract theory approach,” arXiv preprint arXiv:1905.07479, 2019.

[147] P. Bolton, M. Dewatripont et al., Contract theory. MIT press, 2005.[148] D. Ye, R. Yu, M. Pan, and Z. Han, “Federated learning in vehicular edge

computing: A selective model aggregation approach,” IEEE Access,2020.

[149] R. Jurca and B. Faltings, “An incentive compatible reputation mecha-nism,” in EEE International Conference on E-Commerce, 2003. CEC2003. IEEE, 2003, pp. 285–292.

[150] R. Dennis and G. Owen, “Rep on the block: A next generationreputation system based on the blockchain,” in 2015 10th Interna-tional Conference for Internet Technology and Secured Transactions(ICITST). IEEE, 2015, pp. 131–138.

[151] H. Yu, Z. Liu, Y. Liu, T. Chen, M. Cong, X. Weng, D. Niyato, andQ. Yang, “A fairness-aware incentive scheme for federated learning,”AIES 20: Proceedings of the AAAI/ACM Conference on AI, Ethics, andSociety, 2020.

[152] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploitingunintended feature leakage in collaborative learning.” IEEE, 2019.

[153] H.-W. Ng and S. Winkler, “A data-driven approach to cleaning largeface datasets,” in 2014 IEEE International Conference on ImageProcessing (ICIP). IEEE, 2014, pp. 343–347.

[154] G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, andG. Felici, “Hacking smart machines with smarter ones: How to ex-tract meaningful data from machine learning classifiers,” InternationalJournal of Security, vol. 10, no. 3, pp. 137–150, 2015.

[155] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacksthat exploit confidence information and basic countermeasures,” inProceedings of the 22nd ACM SIGSAC Conference on Computer andCommunications Security. ACM, 2015, pp. 1322–1333.

[156] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealingmachine learning models via prediction apis,” in 25th {USENIX}Security Symposium ({USENIX} Security 16), 2016, pp. 601–618.

[157] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membershipinference attacks against machine learning models,” in 2017 IEEESymposium on Security and Privacy (SP). IEEE, 2017, pp. 3–18.

[158] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma-chine learning: from phenomena to black-box attacks using adversarialsamples,” arXiv preprint arXiv:1605.07277, 2016.

[159] P. Laskov et al., “Practical evasion of a learning-based classifier: Acase study,” in 2014 IEEE symposium on security and privacy. IEEE,2014, pp. 197–211.

33

[160] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noiseto sensitivity in private data analysis,” in Theory of cryptographyconference. Springer, 2006, pp. 265–284.

[161] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federatedlearning: A client level perspective,” arXiv preprint arXiv:1712.07557,2017.

[162] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” inProceedings of the 22nd ACM SIGSAC conference on computer andcommunications security. ACM, 2015, pp. 1310–1321.

[163] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE,vol. 86, no. 11, pp. 2278–2324, 1998.

[164] B. Hitaj, G. Ateniese, and F. Pérez-Cruz, “Deep models under the gan:information leakage from collaborative deep learning,” in Proceedingsof the 2017 ACM SIGSAC Conference on Computer and Communica-tions Security. ACM, 2017, pp. 603–618.

[165] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in neural information processing systems, 2014, pp. 2672–2680.

[166] Y. Liu, Z. Ma, S. Ma, S. Nepal, and R. Deng, “Boosting privately:Privacy-preserving federated extreme boosting for mobile crowdsens-ing,” https://arxiv.org/pdf/1907.10218.pdf, 2019.

[167] A. Triastcyn and B. Faltings, “Federated generative privacy,” arXivpreprint arXiv:1910.08385, 2019.

[168] Y. Aono, T. Hayashi, L. Wang, S. Moriai et al., “Privacy-preservingdeep learning via additively homomorphic encryption,” IEEE Transac-tions on Information Forensics and Security, vol. 13, no. 5, pp. 1333–1345, 2017.

[169] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang, “Towards efficient andprivacy-preseving federated deep learning,” in 2019 IEEE InternationalConference on Communications. IEEE, 2019, pp. 1–6.

[170] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoorattacks on deep learning systems using data poisoning,” arXiv preprintarXiv:1712.05526, 2017.

[171] C. Fung, C. J. Yoon, and I. Beschastnikh, “Mitigating sybils infederated learning poisoning,” arXiv preprint arXiv:1808.04866, 2018.

[172] A. Asuncion and D. Newman, “Uci machine learning repository,” 2007.[173] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo, “Analyz-

ing federated learning through an adversarial lens,” arXiv preprintarXiv:1811.12470, 2018.

[174] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How tobackdoor federated learning,” arXiv preprint arXiv:1807.00459, 2018.

[175] H. Kim, J. Park, M. Bennis, and S.-L. Kim, “On-device federatedlearning via blockchain and its latency analysis,” arXiv preprintarXiv:1808.03949, 2018.

[176] J. Weng, J. Weng, J. Zhang, M. Li, Y. Zhang, and W. Luo, “Deepchain:Auditable and privacy-preserving deep learning with blockchain-basedincentive,” Cryptology ePrint Archive, Report 2018/679, 2018.

[177] I. Stojmenovic, S. Wen, X. Huang, and H. Luan, “An overview offog computing and its security issues,” Concurrency and Computation:Practice and Experience, vol. 28, no. 10, pp. 2991–3005, 2016.

[178] M. Gerla, E.-K. Lee, G. Pau, and U. Lee, “Internet of vehicles: Fromintelligent grid to autonomous cars and vehicular clouds,” in 2014 IEEEworld forum on internet of things (WF-IoT). IEEE, 2014, pp. 241–246.

[179] A. Abeshu and N. Chilamkurti, “Deep learning: the frontier fordistributed attack detection in fog-to-things computing,” IEEE Com-munications Magazine, vol. 56, no. 2, pp. 169–175, 2018.

[180] T. D. Nguyen, S. Marchal, M. Miettinen, M. H. Dang, N. Asokan, andA.-R. Sadeghi, “D\" iot: A crowdsourced self-learning approach fordetecting compromised iot devices,” arXiv preprint arXiv:1804.07474,2018.

[181] D. Preuveneers, V. Rimmer, I. Tsingenopoulos, J. Spooren, W. Joosen,and E. Ilie-Zudor, “Chained anomaly detection models for federatedlearning: An intrusion detection case study,” Applied Sciences, vol. 8,no. 12, p. 2663, 2018.

[182] J. Ren, H. Wang, T. Hou, S. Zheng, and C. Tang, “Federatedlearning-based computation offloading optimization in edge computing-supported internet of things,” IEEE Access, vol. 7, pp. 69 194–69 201,2019.

[183] Z. Yu, J. Hu, G. Min, H. Lu, Z. Zhao, H. Wang, and N. Georgalas, “Fed-erated learning based proactive content caching in edge computing,”in 2018 IEEE Global Communications Conference (GLOBECOM).IEEE, 2018, pp. 1–6.

[184] Y. Qian, L. Hu, J. Chen, X. Guan, M. M. Hassan, and A. Alelaiwi,“Privacy-aware service placement for mobile edge computing viafederated learning,” Information Sciences, vol. 505, pp. 562–570, 2019.

[185] M. Chen, O. Semiari, W. Saad, X. Liu, and C. Yin, “Federated echostate learning for minimizing breaks in presence in wireless virtualreality networks,” arXiv preprint arXiv:1812.01202, 2018.

[186] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Federatedlearning for ultra-reliable low-latency v2v communications,” in 2018IEEE Global Communications Conference (GLOBECOM). IEEE,2018, pp. 1–7.

[187] Y. Saputra, H. Dinh, D. Nguyen, E. Dutkiewicz, M. Mueck, andS. Srikanteswara, “Energy demand prediction with federated learningfor electric vehicle networks,” in IEEE Global Communications Con-ference (GLOBECOM), 2019.

[188] K. K. Nguyen, D. T. Hoang, D. Niyato, P. Wang, D. Nguyen, andE. Dutkiewicz, “Cyberattack detection in mobile cloud computing: Adeep learning approach,” in 2018 IEEE Wireless Communications andNetworking Conference (WCNC). IEEE, 2018, pp. 1–6.

[189] T. U. of New Brunswick, “Nsl-kddhttps://www.unb.ca/cic/datasets/nsl.html.”

[190] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set fornetwork intrusion detection systems (unsw-nb15 network data set),”in 2015 military communications and information systems conference(MilCIS). IEEE, 2015, pp. 1–6.

[191] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transac-tions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.

[192] S. Priya and D. J. Inman, Energy harvesting technologies. Springer,2009, vol. 21.

[193] O. Chapelle and L. Li, “An empirical evaluation of thompson sam-pling,” in Advances in neural information processing systems, 2011,pp. 2249–2257.

[194] J. Chung, H.-J. Yoon, and H. J. Gardner, “Analysis of break in presenceduring game play using a linear mixed model,” ETRI journal, vol. 32,no. 5, pp. 687–694, 2010.

[195] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong,“Caching in the sky: Proactive deployment of cache-enabled unmannedaerial vehicles for optimized quality-of-experience,” IEEE Journal onSelected Areas in Communications, vol. 35, no. 5, pp. 1046–1061,2017.

[196] O. Guéant, J.-M. Lasry, and P.-L. Lions, “Mean field games andapplications,” in Paris-Princeton lectures on mathematical finance2010. Springer, 2011, pp. 205–266.

[197] L. De Haan and A. Ferreira, Extreme value theory: an introduction.Springer Science & Business Media, 2007.

[198] M. J. Neely, “Stochastic network optimization with application tocommunication and queueing systems,” Synthesis Lectures on Com-munication Networks, vol. 3, no. 1, pp. 1–211, 2010.

[199] P. You and Z. Yang, “Efficient optimal scheduling of charging stationwith multiple electric vehicles via v2v,” in 2014 IEEE InternationalConference on Smart Grid Communications (SmartGridComm). IEEE,2014, pp. 716–721.

[200] P. Bradley, K. Bennett, and A. Demiriz, “Constrained k-means cluster-ing,” Microsoft Research, Redmond, vol. 20, no. 0, p. 0, 2000.

[201] W. Li, T. Logenthiran, V.-T. Phan, and W. L. Woo, “Implemented iot-based self-learning home management system (shms) for singapore,”IEEE Internet of Things Journal, vol. 5, no. 3, pp. 2212–2219, 2018.

[202] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar,F. Estrada-Solano, and O. M. Caicedo, “A comprehensive survey onmachine learning for networking: evolution, applications and researchopportunities,” Journal of Internet Services and Applications, vol. 9,no. 1, p. 16, 2018.

[203] L. Jiang, R. Tan, X. Lou, and G. Lin, “On lightweight privacy-preserving collaborative learning for internet-of-things objects,” 2019.

[204] Z. Gu, H. Jamjoom, D. Su, H. Huang, J. Zhang, T. Ma, D. Pendarakis,and I. Molloy, “Reaching data confidentiality and model accountabilityon the caltrain,” in 2019 49th Annual IEEE/IFIP International Confer-ence on Dependable Systems and Networks (DSN). IEEE, 2019, pp.336–348.

[205] A. Albaseer, B. S. Ciftler, M. Abdallah, and A. Al-Fuqaha, “Exploitingunlabeled data in smart cities using federated learning,” arXiv preprintarXiv:2001.04030, 2020.

[206] F. Lau, S. H. Rubin, M. H. Smith, and L. Trajkovic, “Distributed denialof service attacks,” in Smc 2000 conference proceedings. 2000 ieeeinternational conference on systems, man and cybernetics.’cyberneticsevolving to systems, humans, organizations, and their complex inter-actions’(cat. no. 0, vol. 3. IEEE, 2000, pp. 2275–2280.

[207] W. Xu, K. Ma, W. Trappe, and Y. Zhang, “Jamming sensor networks:attack and defense strategies,” IEEE network, vol. 20, no. 3, pp. 41–47,2006.

34

[208] M. Strasser, C. Popper, S. Capkun, and M. Cagalj, “Jamming-resistantkey establishment using uncoordinated frequency hopping,” in 2008IEEE Symposium on Security and Privacy (sp 2008). IEEE, 2008,pp. 64–78.

[209] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learningfor health: Distributed deep learning without sharing raw patient data,”arXiv preprint arXiv:1812.00564, 2018.

[210] A. Singh, P. Vepakomma, O. Gupta, and R. Raskar, “Detailed com-parison of communication efficiency of split learning and federatedlearning,” arXiv preprint arXiv:1909.09145, 2019.

[211] I. I. Eliazar and I. M. Sokolov, “Measuring statistical heterogeneity:The pietra index,” Physica A: Statistical Mechanics and its Applica-tions, vol. 389, no. 1, pp. 117–125, 2010.

[212] J.-S. Leu, T.-H. Chiang, M.-C. Yu, and K.-W. Su, “Energy efficientclustering scheme for prolonging the lifetime of wireless sensor net-work with isolated nodes,” IEEE communications letters, vol. 19, no. 2,pp. 259–262, 2014.

federated learning in mobile edge networks: a ...index terms—federated learning, mobile edge...

Documents