towards randomness in learning rate -...

Towards Randomness in Learning Rate

Anonymous Authors

Abstract

Stochastic gradient methods play a crucial roleduring the optimization of deep neural networkswhich is a highly non-convex problem. There-fore, the network is easily trapped in local min-ima and saddle points. To obtain a better model,substantial research has focused on introducingrandom noise directly or indirectly to the gradi-ent update in stochastic gradient methods. It iswell-known that such randomness in gradient up-date helps the model escape local minima or sad-dle points with higher probability. Unlike pre-vious methods that focus on introducing random-ness to gradients, we propose a simple yet effec-tive strategy to introduce randomness by addingnoise to the learning rate through each epoch dur-ing the training process, which comes at no ex-tra computational cost. Extensive results indicatethat our random learning rate strategy helps themodel achieve higher accuracy and alleviate over-fitting with a reduced gap between the test accuracyand training accuracy. (Supplemental materials andcode are available at https://github.com/ExpHub/ijcai2019 random learning rate/ .)

1 Introduction

The optimization of a deep neural network is a highly non-convex optimization problem [Kleinberg et al., 2018]. Tosolve this confusingly difficult problem, a bunch of tech-niques that leverage random noise have received increasingattention, such as adding random noise to data preprocess-ing [Perez and Wang, 2017; Wu et al., 2015; DeVries andTaylor, 2017], weights [Kingma and Welling, 2013; Hintonet al., 2012; Huang et al., 2016], and gradients [Welling andTeh, 2011; Neelakantan et al., 2015; Jin et al., 2017]. Theseseemingly arbitrary techniques tend to achieve surprisinglysatisfactory results. In fact, such methods essentially intro-duce random noise directly or indirectly to the gradient up-date ⌘ · r✓✓✓L(✓✓✓), which is the most critical operation of cur-rently widely-used gradient-based optimization methods suchas Stochastic Gradient Descent (SGD) and its variants, in-cluding Adam [Kingma and Ba, 2014] and RMSProp [Tiele-man and Hinton, 2012].

Arguably, random noise 1) limits the amount of informa-tion flowing through the network, thus forcing the network tolearn meaningful data representations [Hinton et al., 2012];2) it provides “exploration energy” for finding better opti-mization solutions during gradient update [Neelakantan etal., 2015]. Essentially, such advantages of randomness aredue to the regularization resulting from the noise, which thusenhances the model robustness and lowers the chance of themodel to be trapped in local minima or saddle points [Ge etal., 2015; Jin et al., 2017].

Another method to add random noise to gradient up-date ⌘ · r✓✓✓L(✓✓✓) is perturbing learning rate ⌘, which, how-ever, is rarely explored. Usually, research about the learn-ing rate includes 1) designing effective learning rate sched-ules, such as cyclical learning rates [Smith, 2017] and warmrestarts [Loshchilov and Hutter, 2016]; 2) figuring out its rolein the optimization of a deep learning model: for example,Nar emphet al. [2018] has demonstrated that learning ratedetermines to which subset of local minima or saddle pointsthe model can converge.

Differing from previous research about the learning rate,in this paper, we utilize learning rate to perturb the gradientupdate ⌘ · r✓✓✓L(✓✓✓), in the hope that the model would have abetter chance to escape local minima or saddle points. Thiseasy-to-implement method does not incur any extra compu-tational cost. Specifically, we add Uniform noise to existinglearning rate schedules including constant learning rate, stepdecay, and etc. As shown in Figure 1, we can see that our ran-dom learning rate strategy better performance and alleviatesthe overfitting problem compared to the baseline. In addi-tion, we empirically test our method in various cases, includ-ing different models, optimizers, datasets, and tasks, whichdemonstrates the effectiveness of our random learning ratestrategy.

2 Randomness in Deep Learning

Randomness pervades everywhere in deep learning, includ-ing data preprocessing, weights, gradients, and learningrates [An, 1996]. In this section, we briefly summarize thetechniques that incorporate random noise in deep learning.

2.1 Indirect Randomness in Gradient Update

Randomness in data preprocessing. Data augmenta-tion [Perez and Wang, 2017] is a widely-used method to sup-

https://github.com/ExpHub/ijcai2019_random_learning_rate/

https://github.com/ExpHub/ijcai2019_random_learning_rate/

(a) (b)

(c) (d)

0.94

0.955

0.97

0.985

1

0.8

0.85

0.9

0.95

1

0 30 60 90 120 150 180Train Step Decay + RN (Ours) Train Step DecayTest Step Decay + RN (Ours) Test Step Decay

Test

Acc

urac

y

0.9166

0.9215

Trai

n A

ccur

acy

0.94

0.955

0.97

0.985

1

0.8

0.85

0.9

0.95

1

0 30 60 90 120 150 180Train CLR + RN (Ours) Train CLRTest CLR + RN(Ours) Test CLR

0.9183

0.9209

Test

Acc

urac

y

Trai

n A

ccur

acy

(a*) (b*)

0 30 60 90 120 150 180

Step Decay + RN (Ours) Step Decay

1e-1 1e-2 1e-3 1e-4 1e-5 1e-6

Lear

ning

Rat

e

Epochs

Lear

ning

Rat

e

0 30 60 90 120 150 180

CLR + RN (Ours) CLR

1e-1 1e-2 1e-3 1e-4 1e-5

0 30 60 90 120 150 180

Warm Restart + RN (Ours) Warm Restart

0 30 60 90 120 150 180

Exp + RN (Ours) Exp

1e-1 1e-2 1e-3 1e-4 1e-5 1e-6 1e-7

Epochs

Lear

ning

Rat

e

0.1

0.075

0.05

0.025

0

Lear

ning

Rat

e

Figure 1: Existing learning rate schedules with random noise (RN).We add random noise to the later epochs of the training process whenthe model is more likely to overfit [Mou et al., 2017]. (a*) and(b*) show the training and test accuracy on CIFAR-10 of step decay,CLR [Smith, 2017], and their counterparts with random noise. Thelearning rate schedules with random noise yield lower training ac-curacy but higher test accuracy, indicating better regularization thatthe random noise brings.

plement the data insufficiency problem in deep learning forvisual tasks; it randomly applies the operations of rotation,reflection, flipping, zooming and etc. to the input visual dataand thus leads to a more robust model [Wu et al., 2015].Many state-of-the-art data augmentation techniques like [De-Vries and Taylor, 2017] are used prevalently.

Randomness in weights. Kingma et al. [2013] has addedGaussian noise to hidden layers; the noise destroys “excessinformation”, forcing the network to learn compact represen-tations of data. Hinton et al. [2012] has injected noise to thehidden layers by randomly dropping units from the neuralnetwork during training, which necessitates the network toovercome the internal “perturbations”. Huang et al. [2016]has extended the idea of dropout by dropping information ata per-unit level to a per-layer level and dropped layers ran-domly.

2.2 Direct Randomness in Gradient Update

Randomness in gradients. Welling et al. [2011] has in-jected Gaussian noise into the parameter updates so that thetrajectory of the parameters will converge to the full poste-rior distribution rather than just the maximum of a posteriormode. Neelakantan et al. [2015] has provided that addingGaussian noise to gradients alleviates overfitting and resultsin lower training loss. Besides, Jin et al. [2017] has found that

the perturbed gradient descent contributed to the training ofdeep neural network due to higher chance of escaping saddlepoints.

Randomness in the learning rate. It is known that therandomness in the learning rate is of critical importance,which encourages the chance that the model escapes localminima or saddle points. However, to date, most existingworks focus on designing new learning rate schedule [Smith,2017; Loshchilov and Hutter, 2016], nevertheless neglectingthe role of the learning rate plays in perturbing the gradients.Recently, Blier et al. [2018] has proposed Alrao that appliesrandom learning rates to different features and uses a set ofparallel copies of the original classifiers, which is related toour work. However, Alrao 1) does not bring a conspicu-ous performance gain, while our method outperforms all thelearning rate strategies we use, such as step decay; 2) Alrao ishard to implement because it applies different random learn-ing rates to different features of the model and incorporatesmodel averaging, while our random learning rate strategy justentails a simple modification on existing learning rate sched-ules; 3) Alrao increases number of parameters on the clas-sification layer which is multiplied by the number of usedclassifier copies, while our method just introduces two extraparameters (details in Section 3) to which the model is notsensitive.

3 Methods

3.1 Motivation

Typically, the optimization of a deep learning model can bedescribed as

✓✓✓ = ✓✓✓ � ⌘ ·r✓✓✓L(✓✓✓), (1)where ✓✓✓ is the parameters to be optimized, L(✓✓✓) is the lossfunction, r✓✓✓L(✓✓✓) is the gradient of loss, and ⌘ is the learningrate.

Random noise is critical for the optimization of a deepneural network [Hochreiter and Schmidhuber, 1997; Chaud-hari et al., 2016; Smith and Le, 2018; Keskar et al., 2017].Larger variance of the gradient update ⌘ · r✓✓✓J(✓✓✓) within avalid range always contributes to the model robustness andthus leads to better optimization. For example, data augmen-tation [DeVries and Taylor, 2017; Tanner and Wong, 1987]and dropout [Hinton et al., 2012] bring noise indirectly tothe gradient update during training; in addition, consider-able work [Welling and Teh, 2011; Neelakantan et al., 2015;Jin et al., 2017] advocates random noise in the gradientr✓✓✓L(✓✓✓). However, the research that sheds light on combin-ing the learning rate with random noise is greatly insufficient.Therefore, we inject random noise to ⌘, instead of r✓✓✓L(✓✓✓)that has been well-explored.

3.2 Why Random Noise Works

Random noise has been demonstrated to be an effectivemethod to reach a valid minimum in a large corpus of pre-vious work. Ge et al. [2015] has demonstrated that stochas-tic gradient descent ultimately escapes all saddle points in apolynomial number of iterations under the condition that theobjective function satisfies the strict saddle property. Jin etal. [2017] has reported that perturbed gradient descent will

0 30 60 90 120 150 180

Step Decay Step Decay + RN (Ours)

Learning Rate with Random

Noise

1e-1

1e-2

1e-3

1e-4

1e-5

1e-6

Epochs

Lear

ning

Rat

e�

<latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit><latexit sha1_base64="8MzdY9xWePsViGumrVRX3IPm9UE=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6LHoxWMFWwttKJvtpF262YTdiVBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIY8rxvp7S2vrG5Vd6u7Ozu7R9UD4/aJsk0xxZPZKI7ITMohcIWCZLYSTWyOJT4GI5vZ/7jE2ojEvVAkxSDmA2ViARnZKVWL0Ri/WrNq3tzuKvEL0gNCjT71a/eIOFZjIq4ZMZ0fS+lIGeaBJc4rfQygynjYzbErqWKxWiCfH7s1D2zysCNEm1LkTtXf0/kLDZmEoe2M2Y0MsveTPzP62YUXQe5UGlGqPhiUZRJlxJ39rk7EBo5yYkljGthb3X5iGnGyeZTsSH4yy+vkvZF3ffq/v1lrXFTxFGGEziFc/DhChpwB01oAQcBz/AKb45yXpx352PRWnKKmWP4A+fzB8PWjqQ=</latexit>

⌘min<latexit sha1_base64="57jf4s8TzsrDqn50/Y1OGfri0zo=">AAAB8XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOTJcxOOsmQmdllplcIS/7CiwdFvPo33vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41bZwaDg0ey9i0I2ZBCg0NFCihnRhgKpLQisa3M7/1BMaKWD/gJIFQsaEWA8EZOumxC8h6mRJ62itX/Ko/B10lQU4qJEe9V/7q9mOeKtDIJbO2E/gJhhkzKLiEaambWkgYH7MhdBzVTIENs/nFU3rmlD4dxMaVRjpXf09kTFk7UZHrVAxHdtmbif95nRQH12EmdJIiaL5YNEglxZjO3qd9YYCjnDjCuBHuVspHzDCOLqSSCyFYfnmVNC+qgV8N7i8rtZs8jiI5IafknATkitTIHamTBuFEk2fySt486714797HorXg5TPH5A+8zx/pdpEP</latexit><latexit sha1_base64="57jf4s8TzsrDqn50/Y1OGfri0zo=">AAAB8XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOTJcxOOsmQmdllplcIS/7CiwdFvPo33vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41bZwaDg0ey9i0I2ZBCg0NFCihnRhgKpLQisa3M7/1BMaKWD/gJIFQsaEWA8EZOumxC8h6mRJ62itX/Ko/B10lQU4qJEe9V/7q9mOeKtDIJbO2E/gJhhkzKLiEaambWkgYH7MhdBzVTIENs/nFU3rmlD4dxMaVRjpXf09kTFk7UZHrVAxHdtmbif95nRQH12EmdJIiaL5YNEglxZjO3qd9YYCjnDjCuBHuVspHzDCOLqSSCyFYfnmVNC+qgV8N7i8rtZs8jiI5IafknATkitTIHamTBuFEk2fySt486714797HorXg5TPH5A+8zx/pdpEP</latexit><latexit sha1_base64="57jf4s8TzsrDqn50/Y1OGfri0zo=">AAAB8XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOTJcxOOsmQmdllplcIS/7CiwdFvPo33vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41bZwaDg0ey9i0I2ZBCg0NFCihnRhgKpLQisa3M7/1BMaKWD/gJIFQsaEWA8EZOumxC8h6mRJ62itX/Ko/B10lQU4qJEe9V/7q9mOeKtDIJbO2E/gJhhkzKLiEaambWkgYH7MhdBzVTIENs/nFU3rmlD4dxMaVRjpXf09kTFk7UZHrVAxHdtmbif95nRQH12EmdJIiaL5YNEglxZjO3qd9YYCjnDjCuBHuVspHzDCOLqSSCyFYfnmVNC+qgV8N7i8rtZs8jiI5IafknATkitTIHamTBuFEk2fySt486714797HorXg5TPH5A+8zx/pdpEP</latexit><latexit sha1_base64="57jf4s8TzsrDqn50/Y1OGfri0zo=">AAAB8XicbVDLSgNBEJyNrxhfUY9eBoPgKeyKoMegF48RzAOTJcxOOsmQmdllplcIS/7CiwdFvPo33vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41bZwaDg0ey9i0I2ZBCg0NFCihnRhgKpLQisa3M7/1BMaKWD/gJIFQsaEWA8EZOumxC8h6mRJ62itX/Ko/B10lQU4qJEe9V/7q9mOeKtDIJbO2E/gJhhkzKLiEaambWkgYH7MhdBzVTIENs/nFU3rmlD4dxMaVRjpXf09kTFk7UZHrVAxHdtmbif95nRQH12EmdJIiaL5YNEglxZjO3qd9YYCjnDjCuBHuVspHzDCOLqSSCyFYfnmVNC+qgV8N7i8rtZs8jiI5IafknATkitTIHamTBuFEk2fySt486714797HorXg5TPH5A+8zx/pdpEP</latexit>

⌘max<latexit sha1_base64="fsle5C8MiRyQgy0E5/SLkR+WAd8=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnBJGJyhL3NJFmyu3fs7onhyL+wsVDE1n9j579xk1yhiQ8GHu/NMDMvSgQ31ve/vcLK6tr6RnGztLW9s7tX3j9omjjVDBssFrG+j6hBwRU2LLcC7xONVEYCW9Hoeuq3HlEbHqs7O04wlHSgeJ8zap300EFLu5mkT5NuueJX/RnIMglyUoEc9W75q9OLWSpRWSaoMe3AT2yYUW05EzgpdVKDCWUjOsC2o4pKNGE2u3hCTpzSI/1Yu1KWzNTfExmVxoxl5DoltUOz6E3F/7x2avuXYcZVklpUbL6onwpiYzJ9n/S4RmbF2BHKNHe3EjakmjLrQiq5EILFl5dJ86wa+NXg9rxSu8rjKMIRHMMpBHABNbiBOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kD7HiREQ==</latexit><latexit sha1_base64="fsle5C8MiRyQgy0E5/SLkR+WAd8=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnBJGJyhL3NJFmyu3fs7onhyL+wsVDE1n9j579xk1yhiQ8GHu/NMDMvSgQ31ve/vcLK6tr6RnGztLW9s7tX3j9omjjVDBssFrG+j6hBwRU2LLcC7xONVEYCW9Hoeuq3HlEbHqs7O04wlHSgeJ8zap300EFLu5mkT5NuueJX/RnIMglyUoEc9W75q9OLWSpRWSaoMe3AT2yYUW05EzgpdVKDCWUjOsC2o4pKNGE2u3hCTpzSI/1Yu1KWzNTfExmVxoxl5DoltUOz6E3F/7x2avuXYcZVklpUbL6onwpiYzJ9n/S4RmbF2BHKNHe3EjakmjLrQiq5EILFl5dJ86wa+NXg9rxSu8rjKMIRHMMpBHABNbiBOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kD7HiREQ==</latexit><latexit sha1_base64="fsle5C8MiRyQgy0E5/SLkR+WAd8=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnBJGJyhL3NJFmyu3fs7onhyL+wsVDE1n9j579xk1yhiQ8GHu/NMDMvSgQ31ve/vcLK6tr6RnGztLW9s7tX3j9omjjVDBssFrG+j6hBwRU2LLcC7xONVEYCW9Hoeuq3HlEbHqs7O04wlHSgeJ8zap300EFLu5mkT5NuueJX/RnIMglyUoEc9W75q9OLWSpRWSaoMe3AT2yYUW05EzgpdVKDCWUjOsC2o4pKNGE2u3hCTpzSI/1Yu1KWzNTfExmVxoxl5DoltUOz6E3F/7x2avuXYcZVklpUbL6onwpiYzJ9n/S4RmbF2BHKNHe3EjakmjLrQiq5EILFl5dJ86wa+NXg9rxSu8rjKMIRHMMpBHABNbiBOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kD7HiREQ==</latexit><latexit sha1_base64="fsle5C8MiRyQgy0E5/SLkR+WAd8=">AAAB8XicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E0DJoYxnBJGJyhL3NJFmyu3fs7onhyL+wsVDE1n9j579xk1yhiQ8GHu/NMDMvSgQ31ve/vcLK6tr6RnGztLW9s7tX3j9omjjVDBssFrG+j6hBwRU2LLcC7xONVEYCW9Hoeuq3HlEbHqs7O04wlHSgeJ8zap300EFLu5mkT5NuueJX/RnIMglyUoEc9W75q9OLWSpRWSaoMe3AT2yYUW05EzgpdVKDCWUjOsC2o4pKNGE2u3hCTpzSI/1Yu1KWzNTfExmVxoxl5DoltUOz6E3F/7x2avuXYcZVklpUbL6onwpiYzJ9n/S4RmbF2BHKNHe3EjakmjLrQiq5EILFl5dJ86wa+NXg9rxSu8rjKMIRHMMpBHABNbiBOjSAgYJneIU3z3gv3rv3MW8tePnMIfyB9/kD7HiREQ==</latexit>

Figure 2: Step decay with random noise. The noise is only addedto the later �S epochs of the entire training process and should becircumscribed within an appropriate range [⌘min, ⌘max].

escape all saddle points in less complexity than [Ge et al.,2015].

Although the above methods improve the performance ofthe networks, they usually require high computational com-plexity and strict constraint condition. In the light of previouswork about perturbed gradient descent, we propose a simplebut effective method by introducing random noise to learningrate without additional computational burden.

3.3 Adding Random Noise to the Learning Rate

A learning rate schedule generates a series of learning ratesthrough each epoch. Denoting ⌘ as the original learning rate,in order to randomize the learning rate, ⌘ is multiplied bya number sampled from the standard Uniform or Gaussiandistribution, where a scaling factor ↵ is used to control themagnitude of the random noise. We choose multiplicativerather than additive noise because weights in the model tendto become very large to make the additive noise insignificant.However, multiplicative noise does not have such a patholog-ical problem hence ensuring the effectiveness of noise. Em-pirically, we find that the noise from a Uniform distributionor a Gaussian distribution shares similar performance. Forsimplicity, in this paper, we mainly take into account the Uni-form distribution. Therefore, the random learning rate can bedescribed as

⌘0 = ⌘ ⇥ p ⇥ ↵, p 2 U(0, 1), (2)

where ⌘ denotes the original learning rate, and ⌘0 denotes therandomized learning rate multiplied by noise p sampled froma standard Uniform distribution U(0, 1) with a scaling factor↵.

Traditional wisdom dictates that deep network practition-ers should initialize the learning rate with a large value to ac-celerate the training process, and then decay the learning rateto a relatively small value to make the model converge. How-ever, a small learning rate easily leads to a local minimum ora saddle point [Mou et al., 2017; Hochreiter and Schmidhu-ber, 1997]. Thus, we circumscribe the random noise to laterepochs, which is essentially a regularization technique to im-prove the generalization ability of the model.

As a result, our proposed learning rate can be formulatedas

⌘0 =

⇢⌘ e < �S⌘ ⇥ p ⇥ ↵, p 2 U(0, 1) e � �S

, (3)

1/15/2019 Optimizer_Visulization.html

file:///Users/mac/PycharmProjects/Tsinghua/RandomLR/visualization/HTML_Path_Vis/Optimizer_Visulization.html 1/1

SGD Momentum RMSProp Adam

1/15/2019 Optimizer_Visulization_Base.html

file:///Users/mac/PycharmProjects/Tsinghua/RandomLR/visualization/HTML_Path_Vis/Optimizer_Visulization_Base.html 1/1








SGD Momentum Adam RMSProp

Const Const+RN(Ours)

10.0

0.0

Optimizers:(b2)(b1)

(a2)(a1)

Const

Global Optimal:

Init Point

Const+RN(Ours)

(0,0)

(-1,0) (1,0) (-1,0) (1,0)

(0,0)

Figure 3: Visualization of different test function’s optimizationpaths. The test function in (a1) and (a2) has a global minima (�1, 0)and a local minima (1, 0). (b1) and (b2) has one global minimum(0, 0) and multiple local minima in its neighborhood. Constantlearning rate with and without random noise are used in (a1), (b1)and (a2), (b2) respectively. Compared to (a1) and (b1), (a2) and (b2)enjoy a better chance to converge to the global minima thanks to thenoise that encourages more aggressive explorations in the optimiza-tion process.

where S denotes the total epoch in the training process, edenotes the current epoch, and � represents the percentage ofepochs where random noise is introduced, as shown in Figure2.

3.4 A Better Path Towards the Global Optimal

To better understand how random noise in learning ratesworks, we visualize the paths in the optimization process withand without randomness in learning rates, as shown in Figure3. The baseline learning rate schedule is the constant learningrate. We use 4 popular optimizers: SGD, SGD with Momen-tum, Adam, and RMSProp. Because a deep neural networkusually asymptotically approaches a non-convex problem, weuse two non-convex test functions as the objective functions:one test function is a quadratic function combined with twoGaussian functions creating minima at (�1, 0) and (1, 0) re-spectively, where (�1, 0) is the local minimum and (1, 0) isthe global minimum; the other is a 2-dimensional Rastriginfunction [Muhlenbein et al., 1991] with a global minimum at(0, 0) and multiple local minimum in its neighborhood.

From Figure 3, we notice that our learning rate strategyhelps the optimizers find a better path towards the global op-timization. For example, from the comparison between (a1)and (a2), we can see that our random learning rate helpsMomentum and RMSProp escape local minima with higherchance, while the constant learning rate tends to fail. Thisis mainly because our random learning rate introduces moredynamics to the original gradient.

4 Experiments

4.1 Experiment Settings

We evaluate our method mainly on two standard image classi-fication datasets: CIFAR-10 and CIFAR-100. We normalizethe datasets with per-channel mean and deviation. Standarddata augmentation scheme is also incorporated. Additionally,images are also randomly mirrored horizontally with 50%probability. We report the highest validation accuracy fol-lowing common practice.

We set the scaling factor ↵ in Eq. 2 to 8 and the percentageof epochs � in Eq. 3 where random noise (RN) in our methodis added to 0.6 in most of our experiments. Unless otherwisespecified, the default setting of batch size is 128, the opti-mizer is SGD with momentum of 0.9 [Sutskever et al., 2013],the initial learning rate is 0.1, the model is ResNet20 [He etal., 2016], and the total training epochs is 200. For fair com-parison, all the codes are implemented on the Keras frame-work on a GTX 1080Ti GPU.

4.2 Learning Rate Schedules

We apply different learning rate schedules to image classifi-cation tasks. The commonly-used learning rate schedules areas follows.

Constant learning rate (Const) maintains the same learn-ing rate through the entire training process. We experimentwith a range of learning rates, from which we find that thelearning rate of 0.01 achieves the best performance. There-fore, constant learning rate of 0.01 serves as the baseline.

Step decay (SD) is the most widely used learning rateschedule in practice. It decreases the learning rate by a con-stant factor in every few epochs. We set the initial learningrate to 0.1 and decay it by a factor of 10 at the 80th, 120th,160th epoch and finally halve it at the 180th epoch.

Exponential decay (ED) is another commonly used learn-ing rate schedule. It has the mathematical form of lr =lr0 ⇤ rk/T , where lr0 is the initial learning rate that we setto 0.1, r is the decay factor, k is the iteration, and T is thedecay step. In our experiment, we set r to 0.96 and T to 600.

Cyclical learning rates (CLR) [Smith, 2017] has linearlyincreased the learning rate from the minimum to the maxi-mum and then decreased it per cycle. In our experiments, weset the maximum learning rate to 0.3 and the minimum learn-ing rate to 0.1 as specified in [Smith, 2017]. Iterations in acycle are 16 times that in an epoch. In comparison to the re-sults of step decay, the maximum and minimum learning ratesdecay resonating with the pattern of step decay, as shown inFigure 1.

Learning rates with warm restarts (WR) [Loshchilovand Hutter, 2016] has decreased the learning rate in a co-sine manner and restarted by reverting to the initial learningrate. In our experiments, we adopt the recommended hyper-parameter setting in [Loshchilov and Hutter, 2016] wherethe initial restart cycle Te = 10 and increase the cycle byTmul = 2.

We compare the test accuracy of all the above learning rateschedules with and without our method (+RN), as shown inTable 1. The consistent increase demonstrates the strong gen-eralization ability of adding random noise to different learn-

ing rate schedules. We believe that such performance gaincomes from enhanced model robustness resulting from ran-dom noise.

Table 1: Top-1% test accuracy of different learning rate schedules onCIFAR-10 and CIFAR-100. Our random learning rates consistenlyoutperforms their counterparts. We conduct each experiment threetimes from which we report the average and variance of the accuracy.

Learning Rate Schedules CIFAR-10 CIFAR-100

SD 91.66 ± 0.03 69.25 ± 0.49SD+RN (Ours) 92.15 ± 0.07 69.31 ± 0.57

Const 88.49 ± 0.44 59.85 ± 0.51Const+RN (Ours) 91.34 ± 0.37 69.56 ± 0.29

ED 91.57 ± 0.10 67.44 ± 0.17ED+RN (Ours) 91.61 ± 0.10 67.74 ± 0.59

CLR 91.83 ± 0.22 69.53 ± 0.65CLR+RN (Ours) 92.09 ± 0.26 69.73 ± 0.84

WR 91.70 ± 0.47 68.78 ± 0.20WR+RN (Ours) 91.77 ± 0.07 69.23 ± 0.40

We show the training and test accuracy of step decay andCLR in Figure 1. We observe that our method yields highertest accuracy but lower training accuracy, which indicates thatour method reduces the generalization gap of the model.

4.3 Different Optimizers, Models and Datasets

To further evaluate the advantage of adding randomness tolearning rates on different optimization problems, we applyour method to different experiment settings, including differ-ent optimizers, models, and datasets.

Different optimizers. SGD with momentum [Sutskeveret al., 2013] is the most commonly used optimizer, whichis defaulted in our experiment. In addition, the randomizedlearning rates can also operate above some well-known adap-tive learning rate optimizers, such as Adam [Kingma and Ba,2014], Adagrad [Duchi et al., 2011], and RMSProp [Tiele-man and Hinton, 2012]. We set their initial learning ratesaccording to the default: 1e-1 for SGD and Adagrad, 1e-3 forAdam and RMSprop.

Different models. We test our method on several popu-lar image classification models, including VGG16 [Simonyanand Zisserman, 2014], ResNet56 [He et al., 2016], DenseNetwith depth 40 and growth rate 12 [Huang et al., 2017], andWide Residual Neural Networks (WRNs) [Zagoruyko andKomodakis, 2016]. We use WRN-16-8 to denote a WRN withdepth 16 and width 8. We construct all the models accordingto the official Keras implementation.

Different datasets. We hand-construct two naive convo-lutional neural networks (CNNs) on which we test MNISTand SVHN [Netzer et al., 2011]. In particular, the MNIST-based experiments use a network comprised of two convo-lutional layers followed by two fully connected ones. ForSVHN, the network comprises six convolutional layers, eachfollowed by a batch normalization layer and a relu activationlayer. IMDb [Maas et al., 2011] is a benchmark dataset ofmovie reviews used for sentiment classification tasks. Pascal

VOC 2012 [Everingham et al., 2012] is a standard seman-tic segmentation dataset. DeepLab V3 [Chen et al., 2017]serves as the baseline model using ResNet as the backboneand mean intersection over union (mIoU) as the evaluationmetric for image segmentation tasks.

Experimental results show that our method outperforms allthe baselines on different models, optimizers, and datasets, asshown in Table 2.

Table 2: Experiment results with diverse experimental settings. Weuse top-1% test accuracy as evaluation metrics except that mIoUused for Pascal VOC 2012 dataset. We try various optimizers, mod-els, and datasets. Our method retains its competence in heteroge-neous scenarios. SGD+Mom refers to SGD with momentum of 0.9.

Models Datasets Optimizers SD SD + RN(Ours)

ResNet20 CIFAR-10

SGD+Mom 91.66 92.15

Adam 91.10 91.58

Adagrad 89.32 89.84

RMSProp 91.19 91.45

VGG16

CIFAR-10 SGD+Mom

90.17 91.28

ResNet56 93.56 94.23

DenseNet 94.09 94.54

WRN-16-8 94.51 94.81

Naive CNNs MNIST SGD+Mom 99.19 99.28

SVHN 95.94 96.01

LSTM IMDb Adam 82.85 83.59

DeepLab V3 Pascal SGD+Mom 74.20 75.44

VOC 2012 (mIoU) (mIoU)

4.4 Ablation Study

Random noise from different probability distributions.

The noise can be sampled from commonly-used noise distri-bution like a Uniform distribution or a Gaussian distribution.We compare the results using both distributions and the re-sults are shown in Table 3. We observe that the learning rateschedule with either Uniform noise or Gaussian noise outper-forms the baseline.

Table 3: Classification results on CIFAR-10 and CIFAR-100 withUniform noise and Gaussian noise.

Learning Rate Schedules CIFAR-10 CIFAR-100

SD 91.66 69.25SD+Uniform RN 92.15 69.31SD+Gaussian RN (µ = 4,� = 1) 92.09 70.03

SD+Gaussian RN (µ = 2,� = 1) 92.14 69.37

The choice of hyper-parameters (eg. ↵, �). Whenadding random noise to existing learning rate schedules, weintroduce two hyper-parameters, the scaling factor of randomnoise ↵ and the percentage of epochs � where random noiseis added. We show the classification results in Table 4 withdifferent combinations of ↵ and �, which indicates that our

method outperforms the baseline with different values of ↵.As for �, we observe performance degradation when addingrandom noise to the entire training process (� = 1) and tothe last 20% epochs (� = 0.2). Therefore, we argue that toomuch randomness hurts the training process especially duringthe initial epochs, and that our method does not work withnominal noise when the learning rate is very small. Specif-ically, the performance shows great improvement when �ranges from 0.4 to 0.8. Based on the above analysis, we set ↵to 8 and � to 0.6 in all our experiments.

Table 4: Classification results on CIFAR-10 with different hyper-parameters setting of random noise.

Method ↵ � Acc Method ↵ � Acc

Baseline - - 91.66 Baseline - - 91.66

Ours

2

0.6

91.86

Ours 8

1 91.594 91.91 0.8 91.818 92.15 0.6 92.15

16 91.98 0.4 91.8032 92.07 0.2 91.57

It is the “exploration energy” rather than the amplifi-

cation that helps. Some people may wonder whether it is therandom noise amplifies the learning rates or the “explorationenergy” brings better performance. We set ↵ = 10 for theconstant learning rate with random noise is initialized with1e-2, and compare it with 3 differently-initialized constantlearning rate schedules without random noise range from 1e-1 to 1e-2. The random noise group outperforms others andindicates the importance of ”exploration energy” in randomnoise strategy.

Table 5: Classification results on CIFAR-10 and CIFAR-100 usinga constant learning rate schedule with or without random noise tothree differently-initialized init learning rate range. The learningrate with random noise outperforms all its counterparts, includingones with larger magnitude.

Learning Rate Init Learning CIFAR-10 CIFAR-100Schedules Rates

Const 1e-2 88.49 59.85Const 5e-2 86.99 61.22Const 1e-1 86.30 59.15Const+RN (Ours) 1e-2 91.34 69.56

Comparison with L4. Linerized Loss-based optimaL

Learning-rate (L4) is a loss-based step size adaptationmethod. We compare our method with L4 on the basis ofconstant learning rate schedule. Experimental settings followthe original paper [Rolinek and Martius, 2018], where theoptimal learning rate for constant learning rate is 0.004, totaltraining epochs is 400, and batch size is 128. Experimentalresults and the learning rate schedules are displayed in Fig-ure 4. Empirical results demonstrate that the random regimeof our method outperforms the so-called adaptive method

L4_LR

0 45 90 135 180 225 270 315 360

Constant+RN_LR (Ours)Constant_LR

0.600

0.683

0.765

0.848

0.930

Constant+RN_val_acc (Ours)Constant_val_accL4_val_acc

Test

Acc

urac

y0.90290.8792

Lear

ning

Rat

e

Epochs

1e-1

1e-2

1e-3

1e-4

1e-5

0.9148

Figure 4: Classification result on CIFAR-10 with constant learningrate, constant learning rate with random noise and L4 algorithm. Ourmethod gets the highest accuracy.

of L4. Moreover, our method incurs much fewer computa-tions which shows the effectiveness and simpleness of therandomness-based method.

4.5 Effectively Utilizing Random Noise

Random noise brings more variance to the model’s training.Thus, we argue that our method will achieve a more pro-nounced performance gain when the variance of the modelis small. We choose two scenarios to demonstrate this phe-nomenon.

Batch Sizes

When training a model with a large batch size, the varianceof the gradient tends to be small, where the benefits broughtby random noise will become more obvious. A large batchtraining leads to a degradation in generalization performancedue to the elimination of inherent noise in gradient [Keskaret al., 2017]. We believe that “variance vanishing” resultingfrom the large batch size could be compensated by randomnoise in learning rates.

We choose 0.1 as the initial learning rate for batch size128 to stay consistent with other experiments. Accordingly,the initial learning rates for batch sizes 32, 256, and 2,048are 0.05, 0.14, 0.4 respectively. Because for different batchsizes M , the initial learning rate ⌘ /

pM , this change is

to make the covariance matrix of the parameters update stepstays the same for all different batchsize parameters [Hofferet al., 2017].

The experimental results are summarized in Table 6. Weobtain a larger margin of the accuracy improvement for largebatch size 2,048 where the variance within a mini-batch ismuch smaller and thus the injected random noise could yielda more pronounced variance increase of the gradient update.

Dropout, and Data Augmentation

A series of regularization methods including dropout [Hin-ton et al., 2012] and data augmentation [DeVries and Taylor,

Table 6: Top-1% test accuracy on CIFAR-10 with difference batchsizes: 32, 256 and 2,048. We observe that adding random noiseto learning rate schedules also brings a performance gain throughdifferent batch sizes. When adding noise to a larger batch size, wecan obtain a larger margin of improvement.

Batch Sizes 32 256 2,048

SD 92.18 91.66 90.70SD+RN (Ours) 92.27 91.91 91.44Improvement 0.09 0.23 0.74

2017; Tanner and Wong, 1987] have been widely used in deeplearning. Such methods enhance the model robustness by pur-posefully introducing random noise and helps the model es-cape local minima or saddle points. We argue that our methodof adding noise to the learning rate is essentially another tech-nique that introduces randomness, which can work orthogo-nally with other regularization methods. When such classicregularization methods are not incorporated which leads to asmaller variance during the model training, our method bringsa more conspicuous improvement in performance.

Dropout [Hinton et al., 2012] drops some hidden units dur-ing the training of a deep neural network. Data augmen-tation [DeVries and Taylor, 2017; Tanner and Wong, 1987]produces some fake data and thus enriches the dataset. Asshown in Table 7, our method works orthogonally with thesetwo methods in yielding better performance. In addition, ifneither dropout nor data augmentation is used where the vari-ance in the model is small, our method sees a more conspicu-ous performance gain.

Table 7: Top-1% test accuracy on CIFAR-10. Data Aug indicatesstandard data augmentation, and Improv means improvement.

Models Dropout Data Aug SD SD+RN Improv(Ours)

VGG16

X X 90.17 91.28 1.11X 87.01 87.87 0.86

X 83.21 83.58 0.3777.38 80.02 2.64

5 Summary

In this work, we associate randomness with learning rates byadding random noise directly to existing learning rate sched-ules, which achieves satisfactory results. The intuition comesfrom the ubiquitously-leveraged randomness in deep learn-ing, where randomizing the learning rate is rarely explored.We observe that such a simple yet effective method couldoperate upon various learning rate schedules, models, op-timizers, datasets and tasks, with a consistent performancegain, which we believe stems from the regularization ef-fects brought by random noise. We expect that our easy-to-implement method with no computational cost should counttowards the future toolset of practitioners.

References

[An, 1996] Guozhong An. The effects of adding noise during back-propagation training on a generalization performance. NC, 1996.

[Blier et al., 2018] Leonard Blier, Pierre Wolinski, and Yann Ol-livier. Learning with random learning rates. arXiv preprintarXiv:1810.01322, 2018.

[Chaudhari et al., 2016] Pratik Chaudhari, Anna Choromanska,Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs,Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprintarXiv:1611.01838, 2016.

[Chen et al., 2017] Liang-Chieh Chen, George Papandreou, FlorianSchroff, and Hartwig Adam. Rethinking atrous convolution forsemantic image segmentation. arXiv preprint arXiv:1706.05587,2017.

[DeVries and Taylor, 2017] Terrance DeVries and Graham W Tay-lor. Improved regularization of convolutional neural networkswith cutout. arXiv preprint arXiv:1708.04552, 2017.

[Duchi et al., 2011] John Duchi, Elad Hazan, and Yoram Singer.Adaptive subgradient methods for online learning and stochasticoptimization. JMLR, 2011.

[Everingham et al., 2012] M. Everingham, L. Van Gool, C. K. I.Williams, J. Winn, and A. Zisserman. The PASCAL Visual Ob-ject Classes Challenge 2012 (VOC2012) Results, 2012.

[Ge et al., 2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.Escaping from saddle points—online stochastic gradient for ten-sor decomposition. In COLT, 2015.

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. Deep residual learning for image recognition. In CVPR,2016.

[Hinton et al., 2012] Geoffrey E Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Im-proving neural networks by preventing co-adaptation of featuredetectors. arXiv preprint arXiv:1207.0580, 2012.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and JurgenSchmidhuber. Flat minima. NC, 1997.

[Hoffer et al., 2017] Elad Hoffer, Itay Hubara, and Daniel Soudry.Train longer, generalize better: closing the generalization gap inlarge batch training of neural networks. In NeurIPS, 2017.

[Huang et al., 2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Se-dra, and Kilian Q. Weinberger. Deep networks with stochasticdepth. In ECCV, 2016.

[Huang et al., 2017] Gao Huang, Zhuang Liu, Laurens VanDer Maaten, and Kilian Q Weinberger. Densely connected con-volutional networks. In CVPR, 2017.

[Jin et al., 2017] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham MKakade, and Michael I Jordan. How to escape saddle points effi-ciently. arXiv preprint arXiv:1703.00887, 2017.

[Keskar et al., 2017] Nitish Shirish Keskar, Dheevatsa Mudigere,Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang.On large-batch training for deep learning: Generalization gap andsharp minima. In ICLR, 2017.

[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam:A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[Kingma and Welling, 2013] Diederik P Kingma and MaxWelling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114, 2013.

[Kleinberg et al., 2018] Bobby Kleinberg, Yuanzhi Li, and YangYuan. An alternative view: When does SGD escape local min-ima? In ICML, 2018.

[Loshchilov and Hutter, 2016] Ilya Loshchilov and Frank Hutter.Sgdr: Stochastic gradient descent with warm restarts. arXivpreprint arXiv:1608.03983, 2016.

[Maas et al., 2011] Andrew L Maas, Raymond E Daly, Peter TPham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learn-ing word vectors for sentiment analysis. In ACL, 2011.

[Mou et al., 2017] Wenlong Mou, Liwei Wang, Xiyu Zhai, and KaiZheng. Generalization bounds of sgld for non-convex learning:Two theoretical viewpoints. arXiv preprint arXiv:1707.05947,2017.

[Muhlenbein et al., 1991] Heinz Muhlenbein, M Schomisch, andJoachim Born. The parallel genetic algorithm as function opti-mizer. PC, 1991.

[Nar and Sastry, 2018] Kamil Nar and Shankar Sastry. Step sizematters in deep learning. In NeurIPS. 2018.

[Neelakantan et al., 2015] Arvind Neelakantan, Luke Vilnis,Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, andJames Martens. Adding gradient noise improves learning forvery deep networks. arXiv preprint arXiv:1511.06807, 2015.

[Netzer et al., 2011] Yuval Netzer, Tao Wang, Adam Coates,Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digitsin natural images with unsupervised feature learning. In NeurIPSworkshop, 2011.

[Perez and Wang, 2017] Luis Perez and Jason Wang. The effec-tiveness of data augmentation in image classification using deeplearning. arXiv preprint arXiv:1712.04621, 2017.

[Rolinek and Martius, 2018] Michal Rolinek and Georg Martius.L4: Practical loss-based stepsize adaptation for deep learning.In NeurIPS, 2018.

[Simonyan and Zisserman, 2014] Karen Simonyan and AndrewZisserman. Very deep convolutional networks for large-scale im-age recognition. arXiv preprint arXiv:1409.1556, 2014.

[Smith and Le, 2018] Samuel L Smith and Quoc V Le. A bayesianperspective on generalization and stochastic gradient descent. InICLR, 2018.

[Smith, 2017] Leslie N Smith. Cyclical learning rates for trainingneural networks. In WACV, 2017.

[Sutskever et al., 2013] Ilya Sutskever, James Martens, GeorgeDahl, and Geoffrey Hinton. On the importance of initializationand momentum in deep learning. In ICML, 2013.

[Tanner and Wong, 1987] Martin A Tanner and Wing Hung Wong.The calculation of posterior distributions by data augmentation.JASA, 1987.

[Tieleman and Hinton, 2012] Tijmen Tieleman and Geoffrey Hin-ton. Lecture 6.5-rmsprop: Divide the gradient by a running av-erage of its recent magnitude. COURSERA: Neural networks formachine learning, 2012.

[Welling and Teh, 2011] Max Welling and Yee W Teh. Bayesianlearning via stochastic gradient langevin dynamics. In ICML,2011.

[Wu et al., 2015] Ren Wu, Shengen Yan, Yi Shan, Qingqing Dang,and Gang Sun. Deep image: Scaling up image recognition. arXivpreprint arXiv:1501.02876, 2015.

[Zagoruyko and Komodakis, 2016] Sergey Zagoruyko and NikosKomodakis. Wide residual networks. arXiv preprintarXiv:1605.07146, 2016.

Towards Randomness in Learning Rate - Supplementary Materials

1 Some Perspectives about How RandomLearning Rate Benefits

1.1 A Perspective from Information Theory

A fixed learning rate (i.e. ⌘) has been widely used in pre-vious model training. However, from the view of informa-tion theory, it is more convincing to consider the learningrate as a random variable rather than a fixed value [Coverand Thomas, 2012]. Specifically, the principle of maximumentropy states that if we only acquire incomplete informationof the unknown variable, the most suitable probability distri-bution that represents the variable is the one with the largestentropy. Obviously, the entropy of a fixed learning rate iszero, which is the worst condition to be expected.

Theorem 1 Maximum entropy probability distribution on an

closed interval is uniform distribution.

Empirically, a suitable learning rate should be neither toolarge nor too small. Hence, we can sample it within a range.As a result, the ideal assumption about a learning rate is uni-form distribution.

1.2 A Perspective from Optimization

A large corpus of previous work has indicated the benefits ofrandom gradient descent in reaching a valid minimum. Ge et

al. [2015] demonstrates that stochastic gradient descent ulti-mately escapes all the saddle points in a polynomial numberof iterations under the condition that the objective functionsatisfies the strict saddle property. Jin et al. [2017] reports thatperturbed gradient descent (see Algorithm 1) will converge toan approximate minimum in less complexity than [Ge et al.,2015]. We will substantiate that random learning rate bearsresemblance to perturbed gradient descent in a way.

Although above methods are far from efficient due to in-tense computational complexity and strict constraint condi-tions, they reveal the benefits of uncertainty in gradient de-scent. In the light of previous work about perturbed gradientdescent, we attempt to randomize learning rate in gradientdescent, which demonstrates strong empirical effectiveness.

In the Algorithm 1, we can rewrite the updating formula as

Algorithm 1 Perturbed Gradient Descent (Meta-algorithm) [Jin et al., 2017]Input: previous model’s parameter xt

Output: updated parameter xt+1

1: for t = 0, 1, · · · do2: if perturbation condition holds then3: xxxt ( xxxt + ⇠⇠⇠t, ⇠⇠⇠t uniformly ⇠ B0(r)4: end if5: xxxt+1 ( xxxt � ⌘rf(xxxt)6: end for

xxxt+1 = xxxt + ⇠⇠⇠t � ⌘rf(xxxt + ⇠⇠⇠t)

⇡ xxxt + ⇠⇠⇠t � ⌘[rf(xxxt) +H(f(xxxt))⇠⇠⇠t]

= xxxt � ⌘rf(xxxt)| {z }naive gradient descent

+(I � ⌘H(f(xxxt)))⇠⇠⇠t| {z }random gradient descent

,(1)

where H represents the Hessian Matrix.In our method of adding random to the learning rate, the

updating formula can be written as

xxxt+1 = xxxt � ⇠⌘rf(xxxt)

= xxxt � ⌘rf(xxxt) + (1� ⇠)⌘rf(xxxt)

= xxxt � ⌘rf(xxxt)| {z }naive gradient descent

+ diag(rf(xxxt))(⌘ � ⌘⇠)~e| {z }random gradient descent

.(2)

where ~e = [1, 1, · · · , 1]T .From both rewritten updating formulas, we observe that

the first parts of both are exactly identical to the naive gra-dient descent, and the second parts part can be interpretedas a matrix in terms of f(xxxt) multiplied by a random vector.Therefore, it is empirically reliable to rely on the resemblanceof our method to perturbed gradient descent that has proved toconverge to an approximate minimal point in the complexityof O(log4(d)/✏2) [Jin et al., 2017].

1.3 A Perspective from the Variance of LearningRate

To ensure the effectiveness of the random noise, we assumethe variance of the gradient update ⌘ · r✓✓✓J(✓✓✓) should in-crease. This could be achieved either by increasing that of

the learning rate ⌘ or that of the gradient r✓✓✓J(✓✓✓), thus en-hancing the model robustness and leading to better regular-ization. In this work, we opt to focus on randomizing thelearning rate. We treat the learning rate and the randomnoise through each epoch as two random variables X andY . As previous study has indicated, increasing the varianceof the gradient update ⌘ · r✓✓✓J(✓✓✓) improves the model ro-bustness [Welling and Teh, 2011; Neelakantan et al., 2015;Jin et al., 2017]. To this end, we can increase the varianceof the gradient r✓✓✓J(✓✓✓) or the learning rate ⌘. Unlike previ-ous study [Welling and Teh, 2011; Neelakantan et al., 2015;Jin et al., 2017] that focuses on introducing random noisein gradients, we instead focus on introducing randomness tolearning rates. Specifically, under the assumption that X andY are independent, a simple way to increase the variance ofX is adding Y to X , thus D(X + Y ) = D(X) + D(Y ) �D(X). However, in practice, the scale of the random noiseshould be proportional to the magnitude of the learning rate,which cannot be achieved without manual tuning of the ran-dom noise through each epoch. Therefore, instead of simplyadding X to Y , we adopt a multiplication operation of X andY to increase D(X).

If X and Y are independent, then

D(XY ) = E(X2Y

2)� [E(XY )]2

= E(X2)E(Y 2)� [E(X)]2[E(Y )]2,

D(X) = E(X2)� [E(X)]2.

(3)

Therefore, if E(Y ) � 1, thenD(XY )�D(X)

= E(X2)E(Y 2)� [E(X)]2[E(Y )]2 � E(X2) + [E(X)]2

= E(X2)[E(Y 2)� 1]� [E(X)]2{[E(Y )]2 � 1}� E(X2){[E(Y )]2 � 1}� [E(X)]2{[E(Y )]2 � 1}= {E(X2)� [E(X)]2}{[E(Y )]2 � 1}= D(X)[E(Y )� 1][E(Y ) + 1].

(4)In our experiment, the random noise Y is sampled from a

Uniform distribution or a Guassian distribution, and we makesure that E(Y ) >= 1. As a result,D(XY )�D(X) � D(X)[E(Y )� 1][E(Y ) + 1] � 0

) D(XY ) � D(X),(5)

which indicates that the random noise increases the varianceof the learning rate, hence enhancing the model robustness.

1.4 How to Choose the Parameters in ExperimentsReferencing the proofs from the section 1.3, take a real caseinto example. We treat the random noise and gradient througheach epoch as two random variables X and Y . If X and Y

are independent, and X ⇠ U(0, a), Y ⇠ U(0, b); then wecan conclude: when a

2>

127 , D(XY ) > D(Y ).

Proof :

E(Y 2) =

Z b

0x2 1

bdx =

1

3b2

D(Y ) = E(Y 2)� [E(Y )]2 =1

12b2

Since

D(XY ) = E(X2Y

2)� [E(X)]2[E(Y )]2

= E(X2)E(Y 2)� [E(X)]2[E(Y )]2

=1

3a2 ⇥ 1

3b2 � 1

4a2 ⇥ 1

4b2

=7

144a2b2

Therefore, when a2>

127 ,

D(XY ) =7

144a2b2

>7

144⇥ 12

7b2

=1

12b2

= D(Y )

This simple proof gives a definite bound for setting the pa-rameters of the uniform distribution. If the gradient satisfiesa uniform distribution Y ⇠ U(0, b), to increase its variance,the parameter a in random noise distribution should be largerthan

q127 .

2 A Better Path Towards the Global OptimalIn this section, we visualize the paths in the optimization pro-cess with and without randomness in learning rates, as shownin Figure 1. The baseline learning rate schedule is constantlearning rate. We use 4 popular optimizers: SGD, SGD withMomentum, Adam, and RMSProp. Because a deep neu-ral network usually approaches asymptotically a non-convexproblem, we use three non-convex test functions as the objec-tive functions to be optimized.

One test function is a quadratic function combined withtwo Guassian functions creating minima at (�1, 0) and (1, 0)respectively, which is represented as

f(x, y) = x2 + y

2 � ae� (x+1)2+y2

d � be� (x�1)2+y2

d , (6)

where the magnitude of two minima are controlled by a andb. Here we set a to 3 and b to 2 so that (�1, 0) is the localminimum and (1, 0) is the global minimum.

Since in practice the problem in deep learning is far moredifficult and complex than the above test function, we also useRastrigin function [Muhlenbein et al., 1991] with multiple lo-cal minima as a harder optimization task. The 2-dimensionalRastrigin function is described as

f(x, y) = x2 + y

2 � sin(2⇡x)� sin(2⇡y), (7)

where x, y 2 [�5.12, 5.12]. It has a global minimum at (0, 0)where f(x, y) = 0 and multiple local minimum in its neigh-borhood.

Another harder test function is Ackley function [Molga andSmutnicki, 2005], which is defined by

f(x, y) =� 20exp[�0.2p

0.5(x2 + y2)]

� exp[0.5(cos(2⇡x+ cos2⇡y] + e+ 20(8)

where x, y 2 [�5, 5]. Its global minimum is f(0, 0) = 0,surrounded by many local minima. Unlike Rastrigin functionwhose margin between local minima is flatten, there is steepgradient between different local minima. Thus, it is harder toescape local minima when optimizing the Ackley function.

From Figure 1, we notice that adding randomness to learn-ing rates helps the optimizers find a better path towards theglobal optimization. For example, from the comparison be-tween Baseline(a) and RN(b), Momentum and RMSProp suc-cessfully escape the local minimum. This phenomenon couldbe explained that the random learning rate introduces moredynamics to the original gradient, and therefore the model ismuch more inclined to escape saddle points or local minima.

References[Cover and Thomas, 2012] Thomas M Cover and Joy A

Thomas. Elements of information theory. John Wiley &Sons, 2012.

[Ge et al., 2015] Rong Ge, Furong Huang, Chi Jin, and YangYuan. Escaping from saddle points—online stochastic gra-dient for tensor decomposition. In COLT, 2015.

[Jin et al., 2017] Chi Jin, Rong Ge, Praneeth Netrapalli,Sham M Kakade, and Michael I Jordan. How to escapesaddle points efficiently. arXiv preprint arXiv:1703.00887,2017.

[Molga and Smutnicki, 2005] Marcin Molga and CzesławSmutnicki. Test functions for optimization needs. Test

functions for optimization needs, 2005.[Muhlenbein et al., 1991] Heinz Muhlenbein, M Schomisch,

and Joachim Born. The parallel genetic algorithm as func-tion optimizer. PC, 1991.

[Neelakantan et al., 2015] Arvind Neelakantan, Luke Vilnis,Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Ku-rach, and James Martens. Adding gradient noise im-proves learning for very deep networks. arXiv preprint

arXiv:1511.06807, 2015.[Welling and Teh, 2011] Max Welling and Yee W Teh.

Bayesian learning via stochastic gradient langevin dynam-ics. In ICML, 2011.













10.0

0.0

SGD Momentum Adam RMSPropOptimizers:

(a)1/21/2019 Optimizer_Visulization.html



(b) (c) (d)1/21/2019 Optimizer_Visulization_Base.html



Const Const+RN Const Const+RN1/21/2019 Optimizer_Visulization.html





SGD Momentum RMSProp Adam1/21/2019 Optimizer_Visulization.html












Const Const+RN Const Const+RN

Const Const+RN Const Const+RN

vs vs

Rosenbrock Function

Ackley Function

Quadratic Function

Global Optimal:

Init Point:

Figure 1: Visualization of optimization paths. Each row represents different test Functions, and each column represent different start pointand learning rate schedule. The Quadratic function in the first row has a global minima (�1, 0) and a local minima (1, 0). The Rosenbrockfunction in the second row has one global minimum (0, 0) and multiple local minima in its neighborhood. The Ackley function is similar tothe Rosenbrock function, however with steeper gradient around local minima. (a) and (c) represent the optimization paths of the test functionsby baseline constant learning rate, and (b) and (d) represent that by constant learning rate with random noise. Each path represents one witha different optimizer. The optimization in (a) and (b), or (c) and (d), has the same initialization. Compared to (a) and (c), (b) and (d) have abetter chance to converge to the global minima thanks to the noise that encourages more aggressive exploration in the optimization process.

towards randomness in learning rate -...

Documents