heterps: distributed deep learning with reinforcement

13
HETERPS: DISTRIBUTED DEEP L EARNING WITH REINFORCEMENT L EARNING BASED S CHEDULING IN HETEROGENEOUS E NVIRONMENTS Ji Liu 1 Zhihua Wu 1 Dianhai Yu 1 Yanjun Ma 1 Danlei Feng 1 Minxu Zhang 1 Xinxuan Wu 1 Xuefeng Yao 1 Dejing Dou 1 1 Baidu Inc. ABSTRACT Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heter ogeneous P arameter S erver (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). The codes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle. 1 I NTRODUCTION Deep Neural Networks (DNNs) have achieved major suc- cess in various domains, such as computer vision, natural language processing, and advertising systems. Big models with many layers, neurons and huge amounts of parameters are generally trained with large amounts of data (Mundhenk et al., 2016), which drastically improves ultimate accuracy (Dean et al., 2012). For instance, the Click-Through Rate (CTR) prediction model (Zhao et al., 2019), BERT (Devlin et al., 2019) and ERNIE (Sun et al., 2020) exploit large num- bers of parameters, e.g., from 110 million to 340 million parameters for BERT. Large models are generally composed of both data-intensive and compute-intensive layers. For instance, CTR models process high-dimensional input data. The high-dimensional input data contain a large number of sparse features. Sparse features refer to a small proportion of non-zero features, which are processed through an embedding layer to gener- ate a low-dimensional embedding (Zhao et al., 2020). As the embedding layer processes large amounts of data, e.g., around 10 TB or even bigger, the processing incurs high Input/Output (IO) cost and is data-intensive. However, the training process of some other layers in deep neural net- works, e.g., fully-connected layers, are compute-intensive because of heavy computing workloads. As the processing units, e.g., CPUs, diverse types of GPUs, AI processors, become heterogeneous, it is critical to take full advantage of heterogeneous computing resources for the distributed training process of large-scale DNN models. Some computing resources, e.g., CPUs, favor data-intensive tasks, while some others, e.g., GPUs, prefer compute- intensive tasks. In this case, the scheduling between tasks and diverse computing resources is of much importance for distributed training. While the scheduling problem is a typical NP-hard problem (Du & Leung, 1989; Liu et al., 2020), some straightforward methods already exist. For instance, as it generally deals with big amounts of data, the first layer can be scheduled to CPUs, while the other layers can be scheduled to GPUs (Zhao et al., 2019) in this paper. As not all the DNN models have the same structure, this method may not be suitable for diverse types of DNN structures. Genetic (Barika et al., arXiv:2111.10635v2 [cs.DC] 16 Jan 2022

Upload: others

Post on 24-Jan-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HeterPS: Distributed Deep Learning With Reinforcement

HETERPS: DISTRIBUTED DEEP LEARNING WITH REINFORCEMENTLEARNING BASED SCHEDULING IN HETEROGENEOUS ENVIRONMENTS

Ji Liu † 1 Zhihua Wu † 1 Dianhai Yu 1 Yanjun Ma 1 Danlei Feng 1 Minxu Zhang 1 Xinxuan Wu 1 Xuefeng Yao 1

Dejing Dou 1

1 Baidu Inc.

ABSTRACTDeep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellentperformance. The training process of DNN models generally handles large-scale input data with many sparsefeatures, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The trainingprocess generally exploits distributed computing resources to reduce training time. In addition, heterogeneouscomputing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process.Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. Toefficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework,i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and aReinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-foldcompared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloadswith heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficientlyschedule the workload of each layer to appropriate computing resources to minimize the cost while satisfyingthroughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributedcomputing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperformsstate-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3% smaller). Thecodes of the framework are publicly available at: https://github.com/PaddlePaddle/Paddle.

1 INTRODUCTION

Deep Neural Networks (DNNs) have achieved major suc-cess in various domains, such as computer vision, naturallanguage processing, and advertising systems. Big modelswith many layers, neurons and huge amounts of parametersare generally trained with large amounts of data (Mundhenket al., 2016), which drastically improves ultimate accuracy(Dean et al., 2012). For instance, the Click-Through Rate(CTR) prediction model (Zhao et al., 2019), BERT (Devlinet al., 2019) and ERNIE (Sun et al., 2020) exploit large num-bers of parameters, e.g., from 110 million to 340 millionparameters for BERT.

Large models are generally composed of both data-intensiveand compute-intensive layers. For instance, CTR modelsprocess high-dimensional input data. The high-dimensionalinput data contain a large number of sparse features. Sparsefeatures refer to a small proportion of non-zero features,which are processed through an embedding layer to gener-ate a low-dimensional embedding (Zhao et al., 2020). Asthe embedding layer processes large amounts of data, e.g.,around 10 TB or even bigger, the processing incurs high

Input/Output (IO) cost and is data-intensive. However, thetraining process of some other layers in deep neural net-works, e.g., fully-connected layers, are compute-intensivebecause of heavy computing workloads.

As the processing units, e.g., CPUs, diverse types of GPUs,AI processors, become heterogeneous, it is critical to takefull advantage of heterogeneous computing resources forthe distributed training process of large-scale DNN models.Some computing resources, e.g., CPUs, favor data-intensivetasks, while some others, e.g., GPUs, prefer compute-intensive tasks. In this case, the scheduling between tasksand diverse computing resources is of much importance fordistributed training.

While the scheduling problem is a typical NP-hard problem(Du & Leung, 1989; Liu et al., 2020), some straightforwardmethods already exist. For instance, as it generally dealswith big amounts of data, the first layer can be scheduledto CPUs, while the other layers can be scheduled to GPUs(Zhao et al., 2019) in this paper. As not all the DNN modelshave the same structure, this method may not be suitablefor diverse types of DNN structures. Genetic (Barika et al.,

arX

iv:2

111.

1063

5v2

[cs

.DC

] 1

6 Ja

n 20

22

Page 2: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

2019), and Greedy (Shi et al., 2020) can be directly appliedto address the layer scheduling problem, while they mayfall into local optimal, which corresponds to high cost. Inaddition, as a black-box optimization method, Bayesian Op-timization (BO)-based scheduling method (Dolatnia et al.,2016) can also be applied. However, BO may incur somerandomness, which corresponds to high costs in some cornercases.

After scheduling the tasks to proper heterogeneous comput-ing resources, parallelism can be achieved to accelerate thetraining process. Data parallelism (Verbraeken et al., 2020)is widely used to parallelize the training process of large-scale DNN models, while pipeline parallelism (Huang et al.,2018; Narayanan et al., 2019; Fan et al., 2021) emerges as apromising approach to address large DNN models. With thedata parallelism approach, the training data is partitionedas many times as the number of computing resources, andall computing resources subsequently apply the same DNNmodel to process different chunks of the data sets (Ver-braeken et al., 2020). With the pipeline approach, eachcomputing resource processes the training data with a stageof the DNN model, while the processing of each stage canbe parallelized (Narayanan et al., 2019). A stage of the DNNmodel is composed of several continuous layers, while twodifferent stages may have data dependencies, i.e., the outputof a stage is the input of the other stage. The data parallelismand the pipeline parallelism can be combined to achievefine-grained parallelism (Fan et al., 2021; Narayanan et al.,2019).

While parallelism reduces the training time, multiple com-puting resources may incur a big monetary cost. The train-ing process generally has a hard limitation in terms ofthroughput to train a DNN model within an acceptable time.Thus, it is beneficial to minimize the monetary cost with thethroughput limit. As the computing resources may be elas-tic, i.e., the number of computing resources may scale up orshrink on demand, the elasticity of the computing resourcescan be exploited to ensure the throughput constraint whileminimizing the monetary cost. In this case, it is critical todetermine the number of computing resources to use for thedistributed training.

In this paper, we propose Paddle-Heterogeneous ParameterServer (Paddle-HeterPS) to enable the distributed training oflarge-scale DNN with elastic heterogeneous computing re-sources. Paddle-HeterPS is composed of three modules, i.e.,DNN layer scheduling module, data management module,and distributed training module. The DNN layer schedulingmodule generates a scheduling plan and a provisioning plan.The provisioning plan defines the number of computingresources of each type for the distributed training process,while the scheduling plan maps each layer to a proper typeof computing resources. The data management module han-

dles the data transfer among different clusters or servers.A cluster is a set of interconnected computing resources(Liu et al., 2014). The distributed training module paral-lelizes the training process of the model while exploiting thecombination of data parallelism and pipeline parallelism.

To take advantage of heterogeneous computing resources,we propose a DNN layer scheduling method in the schedul-ing module. Within a DNN model, multiple layers mayhave diverse characteristics, e.g., data-intensive or compute-intensive. An embedding layer is typically data-intensive,while a fully-connected layer is generally compute-intensivebecause of its heavy computing workload. To reduce thetraining time, we schedule each layer to a proper type ofcomputing resource, e.g., CPUs or GPUs of a certain type.Then, we combine several successive layers into a stage,which is scheduled to the same type of computing resourcesto reduce the time to transfer data among different comput-ing resources. In this way, a scheduling plan is generated.Afterward, we generate a provisioning plan to adjust thenumber of computing resources of each type to achieveload balance and to reduce the monetary cost while meetingthe throughput limitation. We exploit both the data paral-lelism and the pipeline parallelism to parallelize the trainingprocess.

We summarize our main contributions as follows:

• We propose a framework denoted by Paddle-HeterPS to enable the distributed training of large-scaleDNN with elastic heterogeneous computing resources.The framework manages the data storage and data com-munication among distributed computing resources.

• We propose a reinforcement learning-based layerscheduling method to schedule each layer to a propertype of computing resources while minimizing the totalcost and ensuring the throughput. In addition, basedon the scheduling method, we also present a method todetermine the proper number of computing resourcesfor distributed training.

• We carry out extensive experiments based on DNNmodels of diverse structures to show the advantages ofour approach compared with baseline approaches.

The rest of the paper is organized as follows. Section 2presents the background and related work. In Section 3, weintroduce the system design of Paddle-HeterPS, includingthe system architecture and the data management methods.In Section 4, we explain the cost model and formally de-fine the scheduling problem. In Section 5, we propose areinforcement learning-based scheduling method. Section 6presents extensive experimental evaluation results to showthe advantages of Paddle-HeterPS. Finally, Section 7 con-cludes.

Page 3: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Figure 1. An example of CTR prediction model. The circles rep-resent a neuron and the arrows represent the data flow. The inputdata are sparse and the execution is data-intensive. The secondlayer is an embedding layer, which is also data-intensive. Then,two fully-connected layers, which are compute-intensive, followthe embedding layer. Finally, an output layer produces the finalresult, which is generally data-intensive as it incurs IO cost.

2 BACKGROUND & RELATED WORKS

In this section, we first present DNN models, e.g., CTRmodels. Then, we present the related work for distributedtraining and scheduling methods.

2.1 CTR Prediction Model

CTR prediction models are generally exploited in large com-mercial web search engines (Fan et al., 2019; Zhao et al.,2019; 2020) to predict the likelihood of an advertisement tobe clicked when being shown as a response to a submittedquery. Traditional CTR models are based on the heuristicsand handcrafted features generated from the historical clickdata using Bayesian (Richardson et al., 2007), decision treesand logistic regression (Szwabe et al., 2017; He et al., 2014).Recently, many new approaches (Chen et al., 2016; Chanet al., 2018; Zhou et al., 2018; Liu et al., 2018b; Chen &Li, 2021) proposed for CTR prediction are based on deepneural network models. These approaches exploit DNNsto automatically learn from the historical raw queries. Asshown in Figure 1, DNN-based CTR prediction models gen-erally employ an embedding layer to learn low-dimensionalrepresentations from sparse inputs. The execution of theembedding layer is data-intensive as it incurs much IO cost.Then, fully connected layers are exploited to learn the la-tent correlations among features. Some other layers, e.g.,LSTM (Chen & Li, 2021), can also be used to explore spe-cific interactions among features. Finally, an output layer,which generally takes advantage of the softmax function,generates the prediction results. A layer is data-intensivewhen it takes less time to perform corresponding execu-tion tasks with CPUs than GPUs. Otherwise, the layer iscompute-intensive.

2.2 Distributed DNN Training

DNN training consists of multiple iterations, which refinethe model parameters until the model achieves a conver-gence point with the minimum loss based on a loss function.Each iteration is composed of three phases (Jiang et al.,2020c): forward propagation, backward propagation, andmodel update. The forward propagation processes a batch ofinput data and calculates the loss based on the DNN model.Backward propagation calculates the gradients of the pa-rameters of each neuron based on the loss calculated fromthe forward propagation. Model update updates the modelusing the gradients with a certain optimizer, e.g., SGD asshown in Formula 1, Adam (Kingma & Ba, 2015), etc. InFormula 1, ωt is the weights of the model at iteration t, ηtis the learning rate at iteration t, and stochastic gradient gtis the gradient of a defined loss function.

ωt+1 := ωt − ηt · gt, (1)

Two types of data parallelism, i.e., data parallelism (Ver-braeken et al., 2020; Li et al., 2020) and pipeline parallelism(Huang et al., 2018; Narayanan et al., 2019; Fan et al., 2021),can be used to parallelize the forward and backward propa-gation. The data parallelism approach enables each iterationto process large amounts of data. The input data is parti-tioned as many times as the number of computing resources,and each part of the data is used to calculate the gradientsat each computing resource. The model parameters or gra-dients at each computing resource are averaged to be usedto update the parameters in each computing resource, usingeither a parameter server architecture (Li et al., 2014) ora ring-allreduce architecture (Gibiansky, 2017). The pa-rameter server architecture exploits a server to calculatethe global parameters or gradients based on a distributedoptimization algorithm, while the ring-allreduce architec-ture exploits a decentralized protocol to calculate the globalparameters or gradients. The pipeline approach partitionsthe DNN model into multiple stages, while each stage isprocessed in a computing resource (Narayanan et al., 2019).The data parallelism and the pipeline parallelism can becombined to parallelize the execution of each stage (Fanet al., 2021; Narayanan et al., 2019).

Although many works (Fan et al., 2021; Narayanan et al.,2019; Huang et al., 2019) have been proposed to combinethe data parallelism and the pipeline parallelism to accel-erate the distributed DNN training, they focus on GPUsand cannot utilize heterogeneous computing resources, e.g.,CPUs and diverse types of GPUs. Parameter servers canexploit CPUs to calculate the global parameters or gradients(Li et al., 2014) while GPUs can also be used in parameterservers to accelerate the processing (Cui et al., 2016; Zhaoet al., 2020). In addition, the parameter server architecturecan address the fault-tolerance and exploits high perfor-

Page 4: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Figure 2. The infrastructure architecture of Paddle-HeterPS. A co-ordinator is connected to each computing resource (worker). Eachworker can be CPU-based or GPU or XPU-based. XPU representsdiverse types of processors optimized for the training process ofDNN models. The training data is stored in a cluster, which is alsoconnected to the workers and the coordinator.

mance communication library, i.e., MPI (Mamidala et al.,2018), big data processing frameworks, e.g., Spark (Zahariaet al., 2010; Jiang et al., 2020a), or specific communicationarchitectures for distributed deep learning, e.g., Poseidon(Zhang et al., 2017). However, these approaches cannottake advantage of the combination of CPUs or GPUs to han-dle both data-intensive tasks and compute-intensive tasks.Some works (Kim et al., 2019; Jiang et al., 2020c; 2019)combine the parameter server architecture (for sparse layers)and ring-allreduce architecture (for dense layers), while theydo not focus on heterogeneous computing resources. AIBox(Zhao et al., 2019) and BytePS (Jiang et al., 2020b) canexploit CPUs and GPUs while they use heuristics to stati-cally schedule tasks to CPUs or GPUs, which cannot takeadvantage of the elasticity of available computing resources.

2.3 Scheduling Method

Traditional distributed machine learning (Li et al., 2014;Gibiansky, 2017) is generally carried out with homogeneouscomputing resources, e.g., CPUs or GPUs. Heterogeneouscomputing resources are considered to deal with the dis-tributed training process of CTR models in (Zhao et al.,2019), exploiting a simple heuristic scheduling method thatdirectly schedules the first layer to CPUs and the other layersto GPUs. Greedy methods (Shi et al., 2020) can be utilizedto minimize the energy cost for federated edge learning withCPU-GPU heterogeneous computing environments, whichmay fall into the local optimal, corresponding to a high cost.Other heuristics, e.g., Genetic (Barika et al., 2019), can bedirectly applied to perform the scheduling process, whichis similar to that of Greedy and which corresponds to low

Distributed training

Scheduling Data management

Figure 3. The functional architecture of Paddle-HeterPS.

performance, i.e., relatively high cost. Bayesian optimiza-tion can be used to generate good scheduling plans, while itmay add much randomness to the scheduling process andthe corresponding cost may be high in some corner cases. Inour work, we distinguish data-intensive layers and compute-intensive layers based on profiling results measured fromthe execution of forward and we backward processing ofeach layer, and schedule each layer to proper computingresources in order to reduce monetary cost while meetingthe throughput limit within heterogeneous environments.

3 SYSTEM DESIGN

In this section, we present the system design of Paddle-HeterPS. We first present the system architecture that en-ables distributed deep learning with heterogeneous com-puting resources. Then, we present the functionality ofeach module in Paddle-HeterPS, including the distributedtraining and data management methods.

3.1 System Architecture

As shown in Figure 2, the infrastructure architecture ofPaddle-HeterPS consists of four parts, i.e., training datacluster, coordinator, CPU workers, and GPU/XPU workers.The training data is stored in the training data cluster, e.g.,an HDFS cluster (Borthakur et al., 2008). There are threetypes of computing resources, i.e., CPU workers, parameterservers, and GPU/XPU workers. XPU represents diversetypes of processors optimized for the training process ofDNN models, e.g., Kunlun AI processors (Ouyang et al.,2020). A coordinator is connected to each worker and thetraining data cluster.

As shown in Figure 3, the functional architecture of Paddle-HeterPS consists of three modules: distributed training,scheduling, and data management. The distributed trainingmodule performs the forward propagation, backward propa-gation, and model update. The scheduling module scheduleseach layer to a proper type of computing resources. Thedata management module manages the data storage and datacommunication. This module exploits the host memory,GPU memory, and SSD, to store the data during the train-ing process. The distributed training module combines theparameter server architecture and the ring-allreduce archi-tecture to fully exploit heterogeneous computing resources.The distributed training module utilizes the scheduling mod-

Page 5: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

ule to place each layer into proper computing resources andthe data management module to interact with storage re-sources or to manage data communication among multiplecomputing resources. All three modules are implementedin the coordinator, while the training module and the datamanagement module are implemented in each worker.

The distributed training module exploits the existing DNNtraining framework, i.e., PaddlePaddle (Ma et al., 2019), toperform the forward propagation and the backward propaga-tion. Please note that PaddlePaddle is an industrial platformto train DNN models, while Paddle-HeterPS is an architec-ture within PaddlePaddle for the training process of DNNmodels in heterogeneous environments. During the train-ing process, the CPU workers exploit the parameter serverarchitecture, while the same types of GPU/XPU workerstake advantage of ring-allreduce architecture, which corre-sponds to smaller data transfer and balanced workload (Kimet al., 2019). In addition, we parallelize the computationand the data communication during the training process.For instance, while a computing resource is performing theforward propagation or backward propagation of a batchof training data, the computed gradients or parameters ofanother batch of training data can be transferred to othercomputing resources at the same time.

The scheduling module dynamically schedules each layerto proper computing resources based on profiled informa-tion and a cost model. To satisfy a predefined throughputconstraint, the scheduling module dynamically generatesa provisioning plan and a scheduling plan (see details inSection 5), while reducing the total monetary cost based ona cost model (see details in Section 4).

The data management module manages both data storageand data communication. As it takes much time to transferhuge amounts of the training data from the training datacluster to CPU workers, Paddle-HeterPS prefetches someinput training data and caches them in the memory of CPUworkers. When the storage space of CPU workers is limited,the data can also be cached in SSDs or hard disks of CPUworkers. During the training process, there is a monitorthat counts the access (read or write) frequency of each pa-rameter. If the access frequency is high, the monitor marksthe parameters as hot parameters, and the data managementmodule dynamically adjusts it into the high-speed storagedevices (Liu et al., 2018a), e.g., the host memory of CPUworkers or the GPU memory of GPU workers. Otherwise,the monitor marks the parameters as cold parameters, andthe data management module puts it to SSDs or normalhard disks. To accelerate the data communication amongmultiple workers, the data management module dynami-cally aggregates the data to send to reduce the overhead ofthe data communication. Please note that we also exploitdata compression during the data communication in the data

management module.

4 PROBLEM FORMULATION

In this section, we present a cost model to estimate thethroughput and the monetary cost to carry out the distributedtraining. Then, we formally formulate the scheduling prob-lem to address.

4.1 Cost Model

We propose a cost model to estimate the throughput and themonetary cost to train a DNN model with a specific schedul-ing plan and provisioning plan. The cost model is generallyimplemented in the scheduling module and under a specificexecution environment, e.g., with distributed heterogeneouscomputing resources (CPUs and GPUs).

Let us assume that a DNN model is partitioned to K stages({S1, S2, ..., SK}), each of which is composed of sequentiallayers scheduled to the same type of computing resourcesbased on a scheduling plan. The computing resources areset up based on a provisioning plan. We assume that thecomputing resources are homogeneous for a stage, whilethe computing resources can be heterogeneous for differentstages. In addition, we assume that we have the profilinginformation of each Stage Si with the computing resource ofa single unit (one CPU core or one GPU) and a small batchsize of Bo, e.g., the Original Computation Time (OCTi)and the Original Time for Data Communication (ODTi)from the current stage to the next stage. i represents theindex of the stage. The batch size refers to the number oftraining examples utilized in one batch during the training.The batch size also equals the number of iterations in onebatch. The whole training process is composed of multipleepochs, and each epoch contains a certain fixed number ofbatches. For instance, when we have M training examplesin total and the batch size is B, the number of batches foreach epoch is M

B . The profiling information can be easilyobtained by launching the training process on a single serverwith limited resources. Then, for each stage, we can esti-mate the computation time (CTi) and data communicationtime (DT ) of the stage Si for a single iteration with ki com-puting resources using the following formulas based on theAmdahl’s law (Sun & Chen, 2010):

CTi =OCTiBo

∗ (1− αi +αiki

) (2)

DTi =ODTiBo

∗ (1− βi +βiki), (3)

where αi and βi represent the parts of the workload that canbe parallelized for computation and data communication.These two parameters can be obtained using different unitsof computing resources and corresponding execution timeas presented in (Liu et al., 2016).

Page 6: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Table 1. Summary of Main Notations

Notation DefinitionK;{S1, S2, ..., SK} The number of stages; a set of stages

OCT ; ODT The original computation time; the original data communication timeB; Bo Batch size; the batch size of a small batch to measure OCT and ODTCT ; DT The computation time; the data communication time

ki The number of computing resources at Iteration iαi; βi The parts of the workload that can be parallelized for computation and data communicationETi The execution time of Iteration iL The number of epochs

Cost The monetary cost of the training processpt The cost to use a computing resource of Type t

l; al; L The index of a layer; the scheduling action of Layer l; the number of all the layersT The number of computing resource typesG The total number of the gradients of weights to calculate the expected value

Then, we can estimate the execution time (ETi) of the stage(Si) while assuming that the communication and the datacommunication can be overlapped as shown in the followingformula:

ETi = max{CTi, DTi}. (4)

Then, the throughput of the stage is:

Throughputi =B

ETi, (5)

where B is the batch size, i.e., the total number of examplesused for Iteration i. As the training process is realized usinga pipeline method, the throughput is limited by the smallestthroughput. The overall throughput of the training processcan be estimated as:

Throughput = mini∈{1,2,...,S}

Throughputi (6)

Finally, the total execution time (ET ) of the training processwhen the number of the epoch is L is:

ET = L ∗ M

Throughput. (7)

In addition, the monetary cost of the training process can beestimated as:

Cost = ET ∗T∑i=t

pt ∗ kt, (8)

where pi represents the cost to use the computing resource,which equals the price of the computing resource, e.g., theprice of using a V100 GPU per minute; kt represents thenumber of computing resources of Type t; and T is thenumber of computing resource types. pt and kt are specifiedin the schedule plan SP . The annotations are summarizedin Table 1.

4.2 Problem Formulation

In this section, we formulate the problem of scheduling eachlayer to a type of computing resource, e.g., CPU or a typeof GPU, while minimizing the monetary cost and satisfyingthe throughput constraint. The scheduling process generatesa scheduling plan, which is a mapping between a layerand a type of computing resource. The scheduling planis composed of a matrix of decision variables, and eachdecision variable is defined as follows:

Schedule(l, t) =

{1 Layer l is scheduled to Type t0 otherwise.

(9)

A layer can only be scheduled to one type of computingresource, e.g., CPUs or a type of GPU. The consecutivelayers that are scheduled to the same type of computingresources construct a stage. When there are T types ofcomputing resources, each layer can be scheduled to Ttypes. The problem we address in this paper is defined as:

minSP∈S

Cost(SP ) (10)

s.t.

{Throughput(SP ) > Throughputlimit

Nt(SP ) ≤ Nt,limit,(11)

where Throughputlimit represents the minimum through-put constraint, andNt,limit represents the maximum numberof computing resources for Type y. The size of the searchspace of the scheduling plan is TL with L representing thenumber of layers in a DNN model. It is obvious that thesearch space increases exponentially with L. The schedul-ing problem is a typical NP-hard problem (Du & Leung,1989; Liu et al., 2020; 2016).

Page 7: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

5 REINFORCEMENT LEARNING-BASEDSCHEDULING

In this section, we first present a method to generate a provi-sioning plan for each stage to achieve loading balance witha given scheduling plan. Then, we present a reinforcementlearning-based scheduling method to generate a schedulingplan.

5.1 Provision for Load Balancing

To fully exploit the computing resources for each stage,we try to achieve load balancing for the execution amongmultiple stages. We choose a proper number of computingresources for each stage, so that the execution time of asingle iteration of each stage is similar. When the batch sizeis small, the execution is communication-intensive, and theexecution time of a stage equals the communication time ofthe data. When the batch size is big enough, the executionis computation-intensive, i.e., the execution time of a stageequals the computation time.

Let us illustrate the method to achieve loading balancinggiven a scheduling plan SP for the case when the batchsize is big. In this case, we assume that we can find{k1, k2, ..., kn} so that the throughput of each stage is thesame. Then, we can have:

Throughputi = Throughput1,∀i ∈ {2, 3, ..., L} (12)

Based on Formulas 2, 4, 5, we can find the relationshipbetween ki and k1 as shown in the following formula:

ki =αi

OCT1

OCTi∗ (1− α1 +

α1

k1)− (1− αi)

. (13)

Based on the constraints defined in Formula 11, we canhave:

k1 > min{ α1 ∗OCT1Throughputlimit ∗Bo − (1− α1) ∗OCT1

,

β1 ∗ODT1Throughputlimit ∗Bo − (1− β1) ∗ODT1

},

(14)

for the throughput constraint.

In practice, we find that the monetary cost increases whenk1 increases from the value calculated based on Formula14. k1 has a limit because the number of the computingresources of Type 1 is limited as defined in Formula 11. Weuse the Newton method (Hansen, 1978) to find a maximumvalue of k1, which can minimize the monetary cost of thetraining defined in Formula 10, while satisfying Formula11. Then, the value of ki for each i ∈ {2, ..., n} can becalculated using Formula 13. In addition, as we exploit theparameter server architecture for the distributed learning,

we add an appropriate number of CPU cores to perform thefunctionality of parameter servers, which is based on histor-ical profiling results. Finally, we can have a provisioningplan for the training with a specific scheduling plan.

5.2 Reinforcement Learning Based Scheduling

In this section, we present our reinforcement learning-(Williams, 1992a) based scheduling method. As it can wellcapture the influence of the scheduling decisions of differentlayers, we use a Long Short-Term Memory (LSTM) model(Greff et al., 2017) to generate a scheduling plan. Then, wepresent the reinforcement learning-based method to trainthe LSTM model.

As shown in Figure 4, we build an LSTM model with alength similar to the number of layers. Each LSTM celltakes the input features of each layer and generates thescheduling decision (action). In the LSTM model, we takethe index of the layer as the time so that the model canwell capture the influence of decision actions of diverselayers. The input of each cell consists of the following fivefeatures of each layer: 1. the index of the layer (one-hotencoding); 2. the type of the layer, e.g., full connection,embedding, etc; (one-hot encoding); 3. the size of the inputdata of the layer (float points); 4. the size of the weightsof the layer (float points); 5. data communication time(float points). The output of each LSTM cell is a vectorof T dimensions, i.e., the number of types of computingresources. Then we apply the softmax function on the outputvector to get the probability that we schedule the layer toeach type of computing resources. Finally, we can generatea scheduling decision with the type of computing resourcesthat correspond to the highest probability for each layerusing the LSTM model.

We train the LSTM model based on the rules defined in(Williams, 1992b):

∇θR(θ) = EP (a1:T ;θ)[(

T∑l=1

∇θ logP (al|a(l−1):1; θ))Ri],

(15)where∇θR(θ) represents the gradients of the weights of theLSTM model, θ is the set of parameters in the LSTM model,Ri is the reward at Iteration i, al is the scheduling actionof Layer l, and P (al|a(l−1):1; θ) is the value obtained fromsoftmax function of each layer. We take the monetary costof ith iteration for the training as the reward, which can becalculated based on Formula 8. To approximate the abovegradient, we utilize G gradients of the weights to calculatethe expected value:

∇θR(θ) ≈1

G

G∑g=1

(

T∑t=1

∇θ logP (at|a(t−1):1; θ))Ri, (16)

while G equals to product of multiplying the number of

Page 8: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

LSTM Celll =1

P1,4...P1,2P1,1

w1s1t1i1 c1

LSTM Celll =2

LSTM Celll =L...

w2s2t2i2 c2 wLsLtLiL c L

PL,4...PL,2PL,1P2,4...P2,2P2,1

Inputs

Outputs

Figure 4. The architecture of LSTM. “i” represents the index of the layer. “t” represents the type of the layer. “s” represents the size ofthe input data of the layer. “w” represents the size of the weights of the layer. “c” represents the size of data communication time. Pl,t

represents the probability to schedule Layer l to Type t.

Algorithm 1 Reinforcement Learning-Based TrainingInput:

N : The number of scheduling plans used to train thenetwork for each round;

I : The maximum round of the training process;γ : The rate to update the baseline value;

Output:θ : Parameters of the LSTM model;

1: θ← randomly initialize the LSTM model, b← 02: for i ∈ {1, 2, ..., I} do3: V ← generate a set of G scheduling plans4: for SP ∈ V do5: Rn ← Cost(SP)6: end for7: Update θ according to Formula 188: b← (1 - γ) * b + γ

N *∑Nn=1Rn

9: end for

iterations by the batch size.

Although the original approximation in Eq. 16 is unbiased,it has a very high variance. Inspired by (Zoph & Le, 2016),we alleviate this issue by subtracting the reward with abaseline value, denoted as b. We take the moving average ofall mean rewards from each batch before the current batchas the baseline value and we obtain the following formula:

∇θR(θ) ≈1

G

G∑g=1

(

T∑t=1

∇θ logP (at|a(t−1):1; θ))(Ri − b)

(17)Then, we update the parameters of the LSTM model usingthe following formula:

θ′= θ + η ∗ ∇θR(θ), (18)

where η represents the learning rate.

The training process of the LSTM model based on Rein-forcement learning is shown in Algorithm 1, which can beused in the pre-training process and the distributed train-ing process. First, the model is initialized (Line 1). With Iiterations, the LSTM model is updated (Line 7) using the cal-culated reward (Line 5) of generated scheduling plans (Line3). Within the pre-training process, the scheduling plans

are randomly generated in Line 3, and the monetary cost iscalculated based on Formula 8. Within the distributed train-ing process, the scheduling plans are generated based on theupdated LSTM model and the monetary cost is calculatedbased on Formula 8 with the real throughput. Afterward,the baseline value is updated with the calculated rewards(Line 8).

6 EXPERIMENTAL EVALUATION

In this section, we present experimental results to showthe advantages of Paddle-HeterPS, including the provision-ing methods, the reinforcement learning-based schedulingmethod, and the throughput of Paddle-HeterPS framework.First, we present the comparison to show that our proposedscheduling method outperforms existing provisioning meth-ods. Then, we present the experimental results based on sim-ulation and real execution corresponding to diverse schedul-ing methods. Please note that aforementioned the com-parison is carried out within Paddle-HeterPS using diverseapproaches. Finally, we present the comparison of Paddle-HeterPS and Tensorflow (Abadi et al., 2015) in terms ofthroughput. We implemented Paddle-HeterPS using C++and Python, and the codes will be available upon acceptance.

Within the experiments presented in this section, each CPUserver has 2 Intel Gold 6271C CPUs, and each CPU contains24 physical cores. Two NVMe SSDs are attached to eachCPU server to support data loading. Each GPU server hasthe same CPU configurations as that of the CPU servers,while it has 8 Nvidia Tesla V100 GPUs. Both GPU serversand CPU servers are equipped with 512GB Memory. Allservers are connected with an InfiniBand NIC with 100Gbpsbandwidth. The price of each CPU core is 0.04 United StatesDollars (USD)/hour, and the price of each GPU card is 2.42USD/hour.

6.1 Computing Resource Provisioning

In this section, we present the experimental results by com-paring the provisioning plan generated based on our methodand static methods, i.e., StaRatio and StaPSRatio. StaRa-tio represents the ratio between the number of GPU cardsand the number of CPU cores is 1:6, which is the default

Page 9: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Figure 5. The comparison of different methods in terms of mone-tary cost. The unit of the monetary cost is USD.

ratio within a single server. StaPSRatio refers to the ratiowith the consideration of parameter servers, i.e., number ofGPU cards: number of CPU cores for training: number ofCPU cores for parameter servers = 1:6:6, which takes extraCPU cores for parameter servers based on heuristic expe-rience. We obtain the number based on the provisioningmethod detailed in Section 5. In addition, we use the RLscheduling method to perform the scheduling process. Inthis experimentation, we take a CTR model as a use case.

As shown in Figure 5, we can see that the cost of our pro-visioning method significantly outperforms the heuristics,i.e., StaRatio (up to 57.9%) and StaPSRatio (up to 48.3%).This result is expected, as our method tries to achieve loadbalancing among multiple stages and chooses an appropriatenumber of computing resources for the distributed trainingprocess. In addition, we find that StaPSRatio outperformsStaRatio (up to 55.8%) as it takes into consideration thenumber of CPU cores for parameter servers.

6.2 Layer Scheduling

In this section, we present the experimentation to show theadvantages of our proposed scheduling methods, i.e., the Re-inforcement Learning-based method (RL). We compare RLwith six baseline methods, i.e., Brute Force (BF), BayesianOptimization-based method (BO) (Dolatnia et al., 2016),CPU, GPU, Genetic (Barika et al., 2019), Greedy (Shi et al.,2020), and Heuristic (Zhao et al., 2019). CPU and GPUrepresent that the execution of all the layers is carried out inCPUs and GPUs. Heuristic represents that the execution ofthe first layer is carried out in GPUs and the rest is carriedout in CPUs. We take the V100 GPU with different pricesto simulate multiple types of GPUs in this section.

We focus on four types of models: MATCHNET (16),CTRDNN (16), 2EMB (10), and NCE (5), with the num-ber of layers in parenthesis. Although MATCHNET andCTRDNN have the same number of layers, MATCHNET ismore complex than CTRDNN because of the diverse typesof layers.

Table 2. Scheduling time corresponding to Brute Force (BF) andthe Reinforcement learning-based method. The number in theparentheses represent the number of GPU types. We considerthe combination of CPUs and diverse types of GPU types. “E”represents estimated time because of long scheduling time. Asthe scheduling time of RL for different numbers of GPU types aresimilar, we only present the average scheduling time of RL with 2and 4 GPU types. “/” represents that the scheduling time is too bigto be compared. The time unit is a second.

Number of layers BF(2) BF(4) RL8 0.068 11.37 15.91

12 0.71 3776.82 16.8316 11.74 1254017.23(E) 17.1620 217.26 / 19.13

Figure 6. The cost corresponding to each scheduling method. Thecost is normalized by multiplying a constant value for the sake ofeasy comparison.

First, we modify the structure of CTRDNN by removing oradding Full Connection (FC) layers to simulate a network of8, 12, 16, and 20 layers. We use a Brute Force (BF) methodand the RL method to carry out the scheduling process. Wefind that the costs of BF and RL are the same, while thescheduling time of BF is much longer than that of RL whenthe number of layers is big as shown in Table 2. Although itcan generate the optimal scheduling plans, BF correspondsto an extremely long scheduling time when the number oflayers is big, which is not practical. In addition, we find thatthe scheduling plans generated by the RL method are thesame as the optimal plans generated by BF.

We calculate the cost based on the simulation with the costmodel presented in Section 4.1 and the throughput and costmeasured from the real execution. The costs correspondingto diverse scheduling methods while ensuring the through-put are shown in Figure 6. In terms of the cost, RL signif-icantly outperforms the baseline methods, i.e., BO (up to27.9%), Genetic (up to 289.3%), Greedy (up to 291.4%),GPU (up to 304.2%), CPU (up to 4137.3%), and Heuristic(up to 312.3%). This is reasonable as RL can learn moreinformation than BO, which outperforms other heuristics inmost cases. In addition, although the price of CPU is much

Page 10: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Figure 7. The normalized throughput corresponding to eachscheduling method. The throughput is normalized by dividingthe throughput limit for the sake of easy comparison.

Figure 8. The cost corresponding to diverse models. The cost isnormalized by multiplying by a constant value for the sake of easycomparison.lower than that of GPU, we find that the cost of using CPUfor the training is higher than that of GPU in the simulation.Please note that “cost” is calculated based on the price ofCPU or GPU and the execution time, which is differentfrom “price” (the amount of money for using a computingresource within a time unit). CPU corresponds to biggernumbers of servers to achieve similar throughput as that ofother scheduling methods, which incurs high cost. In addi-tion, Figure 7 shows the normalized throughput of all thescheduling methods, which reflects that all the schedulingmethods can meet the throughput constraint.

Furthermore, Figure 8 shows the cost to train multiple mod-els with diverse methods. RL significantly outperformsbaseline methods, i.e., BO (up to 38.1%), Genetic (up to6.2%), Greedy (up to 29.3%), GPU (up to 229.0%) andHeuristic (up to 57.4%). Although it performs as well asRL for NCE and 2EMB, BO corresponds to a high cost forCTRDNN and has a big variance, which is incurred by therandomness within the sampling process of BO. In addition,as the structure of CTRDNN is more complex than other

Figure 9. The throughput corresponding to diverse models. Thethroughput is normalized by dividing the throughput limit for thesake of easy comparison.

Figure 10. The cost corresponding to diverse models and schedul-ing methods based on real execution. The unit of the monetarycost is USD.

networks, it is hard for BO to find a good scheduling plan.Figure 9 demonstrates the throughput for multiple modelswith different methods. The throughput of all cases meetsthe constraint, except for the case of CPU for CTRDNN.This is because the number of CPU servers is restricted by alimit, i.e., there are not enough CPU servers for the trainingprocess of CPU to achieve the throughput limit.

Finally, we carry out real execution using CPUs and GPUs.We get the same results, i.e., RL significantly outperformsBO (up to 197.8%), Genetic (13.46%), Greedy (159.2%),CPU (336%), GPU (1044%), and Heuristic (159.2%), asshown in Figure 10. The cost corresponding to CPU sig-nificantly differs from the simulated results. After analysis,we attribute the difference to the overhead brought by smallbatches within the CPU environments in the profiling infor-mation.

6.3 Throughput

To show the efficiency of Paddle-HeterPS, we comparethe throughput of Paddle-HeterPSwith that of Tensorflow(Abadi et al., 2015). We execute a CTRDNN of 7 layers,which corresponds to two models of different dimensions.We take “CTRDNN1” to denote the low dimension one and

Page 11: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Figure 11. The throughput corresponding to TF-CPU, TF-GPU,HeterPS-CPU, HeterPS-GPU, and HeterPS. CPU represents theexperiment with CTRDNN1, and GPU represents that withCTRDNN2.“CTRDNN2” to denote high dimension one. We take 4 CPUservers and 4 GPU servers while using the RL schedulingmethod to carry out the scheduling process. We conduct thetraining process of CTRDNN1 with Tensorflow of CPUconfiguration (TF-CPU), Paddle-HeterPS with the CPUscheduling method (Paddle-HeterPS-CPU), and Paddle-HeterPS1 using both CPUs and GPUs (Paddle-HeterPS),and that of CTRDNN2 with Tensorflow of GPU configura-tion (TF-GPU), Paddle-HeterPS with the GPU schedulingmethod (Paddle-HeterPS-GPU), and Paddle-HeterPS2 usingboth CPUs and GPUs (Paddle-HeterPS). Figure 11 showsthat HeterPS-CPU (9.5 times), HeterPS-GPU (3.8 times),HeterPS1 (up to 14.5 times), and HeterPS2 (6.9 times) sig-nificantly outperforms Tensorflow with the CPU and GPUconfiguration, respectively, in terms of throughput.

7 CONCLUSION

To efficiently train a DNN model using heterogeneous com-puting resources, we propose Paddle-HeterPS. The func-tional architecture of Paddle-HeterPS consists of three mod-ules, i.e., distributed training, scheduling, and data manage-ment. We exploit the parameter server architecture to handlethe distributed machine learning. We utilize multiple datamanagement methods to improve efficiency. In addition, wepropose a reinforcement learning-based scheduling methodto schedule each layer to a proper type of computing re-source to minimize the cost while ensuring the throughputlimit based on a provisioning method. We carried out exten-sive experimentation, which demonstrates the advantagesof our framework. The experimental results show that theprovisioning method can be up to 57.9% better than baselinemethods, the scheduling method can significantly outper-form state-of-the-art methods (up to 312.3% in terms of themonetary cost), and the throughput of the framework is 14.5times bigger than TensorFlow.

In the future, we plan to extend Paddle-HeterPS to multiple

data centers to process decentralized data. As the privacyand security of some sensitive data are of much importance,some data may be stored and accessed in a specific datacenter. While Paddle-HeterPS is compatible with verticalfederated learning (Liu et al., 2021), we can optimize thesystem architecture and the scheduling method of Paddle-HeterPS to efficiently train a big model while ensuring thesecurity and privacy of the decentralized data.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Is-ard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,Levenberg, J., Mane, D., Monga, R., Moore, S., Mur-ray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B.,Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va-sudevan, V., Viegas, F., Vinyals, O., Warden, P., Watten-berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow:Large-scale machine learning on heterogeneous systems,2015. URL https://github.com/tensorflow/tensorflow/tree/r1.15. Software available fromtensorflow.org.

Barika, M., Garg, S., Chan, A., and Calheiros, R. Schedul-ing algorithms for efficient execution of stream workflowapplications in multicloud environments. IEEE trans. onServices Computing, 2019.

Borthakur, D. et al. Hdfs architecture guide. Hadoop ApacheProject, 53(1-13):2, 2008.

Chan, P. P., Hu, X., Zhao, L., Yeung, D. S., Liu, D., andXiao, L. Convolutional neural networks based click-through rate prediction with multiple feature sequences.In IJCAI, pp. 2007–2013, 2018.

Chen, J., Sun, B., Li, H., Lu, H., and Hua, X.-S. Deep ctrprediction in display advertising. In Proceedings of the24th ACM international conference on Multimedia, pp.811–820, 2016.

Chen, Q. and Li, D. Improved ctr prediction algorithmbased on lstm and attention. In Proceedings of the 5thInternational Conference on Control Engineering andArtificial Intelligence, pp. 122–125, 2021.

Cui, H., Zhang, H., Ganger, G. R., Gibbons, P. B., and Xing,E. P. Geeps: Scalable deep learning on distributed gpuswith a gpu-specialized parameter server. In EuropeanConf. on Computer Systems, pp. 1–16, 2016.

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le,Q. V., Mao, M. Z., Ranzato, M., Senior, A. W., Tucker,P. A., Yang, K., and Ng, A. Y. Large scale distributed deep

Page 12: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

networks. In Conf. on Neural Information ProcessingSystems (NeurIPS), pp. 1232–1240, 2012.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:pre-training of deep bidirectional transformers for lan-guage understanding. In Conf. of the North AmericanChapter of the Association for Computational Linguis-tics: Human Language Technologies, (NAACL-HLT), pp.4171–4186, 2019.

Dolatnia, N., Fern, A., and Fern, X. Z. Bayesian optimiza-tion with resource constraints and production. In Int.Conf. on Automated Planning and Scheduling (ICAPS),pp. 115–123, 2016.

Du, J. and Leung, J. Y.-T. Complexity of scheduling paralleltask systems. SIAM Journal on Discrete Mathematics, 2(4):473–487, 1989.

Fan, M., Guo, J., Zhu, S., Miao, S., Sun, M., and Li, P. Mo-bius: towards the next generation of query-ad matchingin baidu’s sponsored search. In ACM SIGKDD Int. Conf.on Knowledge Discovery & Data Mining, pp. 2509–2517,2019.

Fan, S., Rong, Y., Meng, C., Cao, Z., Wang, S., Zheng, Z.,Wu, C., Long, G., Yang, J., Xia, L., et al. Dapple: apipelined data parallel approach for training large models.In ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming, pp. 431–445, 2021.

Gibiansky, A. Bringing hpc techniques to deep learn-ing. https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/, 2017.Online; accessed 2020-08-12.

Greff, K., Srivastava, R. K., Koutnık, J., Steunebrink, B. R.,and Schmidhuber, J. Lstm: A search space odyssey. IEEETransactions on Neural Networks and Learning Systems,28(10):2222–2232, 2017.

Hansen, E. Interval forms of newtons method. Computing,20:153–163, 1978.

He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y.,Atallah, A., Herbrich, R., Bowers, S., et al. Practicallessons from predicting clicks on ads at facebook. InProceedings of the Eighth International Workshop onData Mining for Online Advertising, pp. 1–9, 2014.

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X.,Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al.Gpipe: Efficient training of giant neural networks usingpipeline parallelism. arXiv preprint arXiv:1811.06965,2018.

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen,M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, z.

Gpipe: Efficient training of giant neural networks usingpipeline parallelism. In Advances in Neural InformationProcessing Systems (NeurIPS), volume 32, 2019.

Jiang, B., Deng, C., Yi, H., Hu, Z., Zhou, G., Zheng, Y.,Huang, S., Guo, X., Wang, D., Song, Y., et al. Xdl: anindustrial deep learning framework for high-dimensionalsparse data. In Int. Workshop on Deep Learning Practicefor High-Dimensional Sparse Data, pp. 1–9, 2019.

Jiang, J., Xiao, P., Yu, L., Li, X., Cheng, J., Miao, X., Zhang,Z., and Cui, B. Psgraph: How tencent trains extremelylarge-scale graphs with spark? In 2020 IEEE 36th In-ternational Conference on Data Engineering (ICDE), pp.1549–1557. IEEE, 2020a.

Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., and Guo, C. A uni-fied architecture for accelerating distributed DNN trainingin heterogeneous GPU/CPU clusters. In USENIX Sympo-sium on Operating Systems Design and Implementation(OSDI), pp. 463–479. USENIX Association, 2020b.

Jiang, Y., Zhu, Y., Lan, C., Yi, B., Cui, Y., and Guo, C. Aunified architecture for accelerating distributed {DNN}training in heterogeneous gpu/cpu clusters. In {USENIX}Symposium on Operating Systems Design and Implemen-tation ({OSDI}), pp. 463–479, 2020c.

Kim, S., Yu, G.-I., Park, H., Cho, S., Jeong, E., Ha, H., Lee,S., Jeong, J. S., and Chun, B.-G. Parallax: Sparsity-awaredata parallel training of deep neural networks. In EuroSys,pp. 1–15, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In Int. Conf. on Learning Representations(ICLR), 2015.

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.Scaling distributed machine learning with the parameterserver. In 11th {USENIX} Symposium on OperatingSystems Design and Implementation ({OSDI} 14), pp.583–598, 2014.

Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P.,Li, T., Paszke, A., Smith, J., Vaughan, B., Damania, P.,and Chintala, S. Pytorch distributed: Experiences onaccelerating data parallel training. Proc. VLDB Endow.,13(12):3005–3018, August 2020. ISSN 2150-8097.

Liu, J., Pacitti, E., Valduriez, P., and Mattoso, M. Paral-lelization of scientific workflows in the cloud. PhD thesis,INRIA, 2014.

Liu, J., Pacitti, E., Valduriez, P., de Oliveira, D., and Mat-toso, M. Multi-objective scheduling of scientific work-flows in multisite clouds. Future Generation ComputerSystems, 63:76–95, 2016.

Page 13: HeterPS: Distributed Deep Learning With Reinforcement

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Liu, J., Pineda, L., Pacitti, E., Costan, A., Valduriez, P.,Antoniu, G., and Mattoso, M. Efficient scheduling of sci-entific workflows using hot metadata in a multisite cloud.IEEE Transactions on Knowledge and Data Engineering,31(10):1940–1953, 2018a.

Liu, J., Huang, J., Zhou, Y., Li, X., Ji, S., Xiong, H., andDou, D. From distributed machine learning to federatedlearning: A survey. arXiv preprint arXiv:2104.14362,2021.

Liu, L., Yu, H., Sun, G., Luo, L., Jin, Q., and Luo, S. Jobscheduling for distributed machine learning in opticalwan. Future Generation Computer Systems (FGCS), 112:549–560, 2020.

Liu, W., Tang, R., Li, J., Yu, J., Guo, H., He, X., andZhang, S. Field-aware probabilistic embedding neuralnetwork for ctr prediction. In Proceedings of the 12thACM Conference on Recommender Systems, pp. 412–416,2018b.

Ma, Y., adn Tian Wu, D. Y., and Wang, H. Paddlepaddle:An open-source deep learning platform from industrialpractice. Frontiers of Data and Computing, 1(1):105,2019.

Mamidala, A. R., Kollias, G., Ward, C., and Artico, F.Mxnet-mpi: Embedding mpi parallelism in parameterserver task model for scaling deep learning. arXivpreprint arXiv:1801.03855, 2018.

Mundhenk, T. N., Konjevod, G., Sakla, W. A., and Boakye,K. A large contextual dataset for classification, detectionand counting of cars with deep learning. In EuropeanConference on Computer Vision, pp. 785–800. Springer,2016.

Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V.,Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Za-haria, M. Pipedream: generalized pipeline parallelism fordnn training. In ACM Symposium on Operating SystemsPrinciples, pp. 1–15, 2019.

Ouyang, J., Noh, M., Wang, Y., Qi, W., Ma, Y., Gu, C., Kim,S., i. Hong, K., Bae, W. K., Zhao, Z., Wang, J., Wu, P.,Gong, X., Shi, J., Zhu, H., and Du, X. Baidu kunlun anai processor for diversified workloads. In 2020 IEEE HotChips 32 Symposium (HCS), pp. 1–18, 2020.

Richardson, M., Dominowska, E., and Ragno, R. Predictingclicks: estimating the click-through rate for new ads.In Proceedings of the 16th international conference onWorld Wide Web, pp. 521–530, 2007.

Shi, W., Zhou, S., and Niu, Z. Device scheduling with fastconvergence for wireless federated learning. In IEEE Int.Conf. on Communications (ICC), pp. 1–6, 2020.

Sun, X. and Chen, Y. Reevaluating amdahl’s law in the mul-ticore era. Journal of Parallel and Distributed Computing,70(2):183–188, 2010.

Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., andWang, H. ERNIE 2.0: A continual pre-training frame-work for language understanding. In AAAI Conf. onArtificial Intelligence, pp. 8968–8975, 2020.

Szwabe, A., Misiorek, P., and Ciesielczyk, M. Logisticregression setup for rtb ctr estimation. In Proceedings ofthe 9th International Conference on Machine Learningand Computing, pp. 61–70, 2017.

Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Ver-belen, T., and Rellermeyer, J. S. A survey on distributedmachine learning. ACM Computing Surveys (CSUR), 53(2):1–33, 2020.

Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. MachineLearning, 8:229–256, 1992a.

Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. Machinelearning, 8(3-4):229–256, 1992b.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,Stoica, I., et al. Spark: Cluster computing with workingsets. HotCloud, 10(10-10):95, 2010.

Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu,Z., Wei, J., Xie, P., and Xing, E. P. Poseidon: An efficientcommunication architecture for distributed deep learn-ing on {GPU} clusters. In USENIX Annual TechnicalConference (USENIXATC), pp. 181–193, 2017.

Zhao, W., Zhang, J., Xie, D., Qian, Y., Jia, R., and Li, P.Aibox: CTR prediction model training on a single node.In Zhu, W., Tao, D., Cheng, X., Cui, P., Rundensteiner,E. A., Carmel, D., He, Q., and Yu, J. X. (eds.), ACMInternational Conference on Information and KnowledgeManagement, (CIKM), pp. 319–328. ACM, 2019.

Zhao, W., Xie, D., Jia, R., Qian, Y., Ding, R., Sun, M.,and Li, P. Distributed hierarchical GPU parameter serverfor massive scale deep learning ads systems. In Dhillon,I. S., Papailiopoulos, D. S., and Sze, V. (eds.), MachineLearning and Systems (MLSys), 2020.

Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan,Y., Jin, J., Li, H., and Gai, K. Deep interest network forclick-through rate prediction. In Proceedings of the 24thACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining, pp. 1059–1068, 2018.

Zoph, B. and Le, Q. V. Neural architecture search withreinforcement learning. arXiv preprint arXiv:1611.01578,2016.