why adam beats sgd for attention models - arxiv · 2019. 12. 9. · arxiv:1912.03194v1 [math.oc] 6...

23
Why are Adaptive Methods Good for Attention Models? Jingzhao Zhang MIT [email protected] Sai Praneeth Karimireddy EPFL [email protected] Andreas Veit Google Research [email protected] Seungyeon Kim Google Research [email protected] Sashank Reddi Google Research [email protected] Sanjiv Kumar Google Research [email protected] Suvrit Sra MIT [email protected] Abstract While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to out- perform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well un- derstood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD’s poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clipping plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clipping can be applied in practice by developing an adaptive coordinate-wise clipping algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks. 1 Introduction Stochastic gradient descent (SGD) is the canonical algorithm for training neural networks [24]. SGD iteratively updates model parameters in the negative gradient direction and seamlessly scales to large-scale settings. Though a well-tuned SGD outperforms adaptive methods [31] in many tasks including ImageNet classification (see Figure 1a), certain tasks necessitate the use of adaptive variants of SGD (e.g., Adagrad [10], Adam [14], AMSGrad [23]), which employ adaptive learning rates. For instance, consider training an attention model [29] using BERT [9]. Figure 1e shows that in spite of extensive hyperparameter tuning, SGD converges much slower than Adam during BERT training. In this work, we provide one explanation for why adaptivity can facilitate convergence with theoretical and empirical evidence. The significant hint that initializes our work comes from the distribution of the stochastic gradients. For Imagenet, the norms of the mini-batch gradients are typically quite small and well concentrated around their mean. On the other hand, the mini-batch gradient norms for BERT take a wide range of values and are sometimes much larger than their mean value. More formally, while the distribution of the stochastic gradients in Imagenet is well approximated by a Gaussian, the distribution for BERT seems to be heavy-tailed. Such observation leads us to the question: does adaptivity stabilize optimization under heavy-tailed noise? 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. arXiv:1912.03194v2 [math.OC] 23 Oct 2020

Upload: others

Post on 31-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Why are Adaptive Methods Goodfor Attention Models?

    Jingzhao ZhangMIT

    [email protected]

    Sai Praneeth KarimireddyEPFL

    [email protected]

    Andreas VeitGoogle Research

    [email protected]

    Seungyeon KimGoogle Research

    [email protected]

    Sashank ReddiGoogle Research

    [email protected]

    Sanjiv KumarGoogle Research

    [email protected]

    Suvrit SraMIT

    [email protected]

    Abstract

    While stochastic gradient descent (SGD) is still the de facto algorithm in deeplearning, adaptive methods like Clipped SGD/Adam have been observed to out-perform SGD across important tasks, such as attention models. The settings underwhich SGD performs poorly in comparison to adaptive methods are not well un-derstood yet. In this paper, we provide empirical and theoretical evidence that aheavy-tailed distribution of the noise in stochastic gradients is one cause of SGD’spoor performance. We provide the first tight upper and lower convergence boundsfor adaptive gradient methods under heavy-tailed noise. Further, we demonstratehow gradient clipping plays a key role in addressing heavy-tailed gradient noise.Subsequently, we show how clipping can be applied in practice by developing anadaptive coordinate-wise clipping algorithm (ACClip) and demonstrate its superiorperformance on BERT pretraining and finetuning tasks.

    1 IntroductionStochastic gradient descent (SGD) is the canonical algorithm for training neural networks [24]. SGDiteratively updates model parameters in the negative gradient direction and seamlessly scales tolarge-scale settings. Though a well-tuned SGD outperforms adaptive methods [31] in many tasksincluding ImageNet classification (see Figure 1a), certain tasks necessitate the use of adaptive variantsof SGD (e.g., Adagrad [10], Adam [14], AMSGrad [23]), which employ adaptive learning rates. Forinstance, consider training an attention model [29] using BERT [9]. Figure 1e shows that in spite ofextensive hyperparameter tuning, SGD converges much slower than Adam during BERT training.

    In this work, we provide one explanation for why adaptivity can facilitate convergence with theoreticaland empirical evidence. The significant hint that initializes our work comes from the distribution ofthe stochastic gradients. For Imagenet, the norms of the mini-batch gradients are typically quite smalland well concentrated around their mean. On the other hand, the mini-batch gradient norms for BERTtake a wide range of values and are sometimes much larger than their mean value. More formally,while the distribution of the stochastic gradients in Imagenet is well approximated by a Gaussian,the distribution for BERT seems to be heavy-tailed. Such observation leads us to the question: doesadaptivity stabilize optimization under heavy-tailed noise?

    34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

    arX

    iv:1

    912.

    0319

    4v2

    [m

    ath.

    OC

    ] 2

    3 O

    ct 2

    020

  • We provide a positive answer to the above question by performing both theoretical and empiricalstudies of the convergence of optimization methods under heavy-tailed noise. In this setting, some ofthe stochastic gradients are much larger than the mean and can excessively influence the updates ofSGD. This makes SGD unstable and leads to its poor performance. A natural strategy to stabilizethe updates is to clip the magnitude of the stochastic gradients. We prove that indeed it is sufficientto ensure convergence even under heavy-tailed noise. Based on the analysis, we then motivate thedesign of a novel algorithm (ACClip) that outperforms ADAM on BERT related tasks. Specifically,we make the following contributions:

    • We empirically show that in tasks on which Adam outperforms SGD (BERT pretraining), thenoise in stochastic gradients is heavy-tailed. On the other hand, on tasks where traditionallySGD outperforms Adam (ImageNet training), we show that the noise is well concentrated.

    • In section 3, we study the convergence of gradient methods under heavy-tailed noisecondition where SGD’s performance degrades and its convergence might fail. We thenestablish (with upper and lower bounds) the convergence of clipped gradient methods underthe same condition and prove that they obtain theoretically optimal rates.

    • Though clipping speeds up SGD, it does not close the gap between SGD and ADAM. Insection 4, we motivated the a novel adaptive-threshold coordinate-wise clipping algorithmand in section 5 experimentally show that it outperforms Adam on BERT training tasks.

    1.1 Related work

    Adaptive step sizes. Adaptive step sizes during optimization have long been studied [3, 21]. Morerecently, Duchi et al. [10] developed the Adagrad algorithm that benefits from the sparsity in stochasticgradients. Inspired by Adagrad, several adaptive methods have been proposed in the deep learningcommunity [14, 28]. Recently, there has been a surge in interest to study the theoretical properties ofthese adaptive gradient methods due to [23], which pointed out the non-convergence of Adam andproposed an alternative algorithm, AMSGrad. Since then, many works studied different interestingaspects of adaptive methods, see [1, 6, 13, 16–18, 27, 30, 32–36]. Another direction of related workis normalized gradient descent, which has been studied for quasi-convex and non-convex settings[12, 15]. In contrast to our work, these prior works assume standard noise distributions that mightnot be applicable to key modern applications such as attention models, which exhibit heavy-tailednoise. Furthermore, convergence rates of adaptive methods are mostly worse than SGD.

    Noise in neural network. There has been little study of the actual stochastic gradient noise dis-tributions in neural network training. To our knowledge, [19, 25, 26] start the topic and observeheavy tailed noise in network training. Our work differs in two important ways: First, we treat thenoise as a high dimensional vector, while [25] treat deviations in each coordinate as scaler noisesto estimate tail index. Hence, we observe that the example given in [25] is well-concentrated whenviewed as a random vector. This is also confirmed by [20]. More experimental comparisons are inAppendix H. Second, we focus on convergence of optimization algorithm, the previously mentionedworks focus on Langevin dynamics and escaping saddle points. The convergence rate given in [19] isfor global Holder-continuous functions, which restricts the function variations and excludes exampleslike quadratic functions. Our analysis instead provides the first convergence rates under the standardL-smoothness setting. Further, [11] studies accelerated first order methods under less concentratednoise, however, there “heavy-tailedness” refers to non-sub-Gaussianity.

    2 Heavy-tailed noise in stochastic gradientsTo gain intuition about the difference between SGD and adaptive methods, we start our discussionwith the study of noise distributions of stochastic gradient that arise during neural network training.In particular, we focus on noise distributions while training two popular deep learning models —BERT and ResNet. Note that BERT and ResNet are typically trained with Adam and SGD (withmomentum) respectively and can thus, provide insights about difference between these optimizers.

    We first investigate the distribution of the gradient noise norm ‖g −∇f(x)‖ in the aforementionedneural network models, where g is the stochastic gradient computed from a minibatch sample. Inparticular, we fix the model at initialization without doing any updates. We then iterate through the

    2

  • 0.00 0.25 0.50 0.75 1.00Iterations 1e5

    2

    3

    Loss

    SGD momentumADAM

    (a)

    0.32 0.34Noise norm

    0.00

    0.01

    0.02

    0.03

    Dens

    ity

    (b) ImageNet training

    1.2 1.4 1.6 1.8Gaussian

    0.00

    0.01

    0.02

    0.03

    Density

    (c) Synthetic Gaussian

    0 1 2 3Sample size 1e7

    0.110

    0.112

    Estim

    ated

    varianc

    e

    (d) ImageNet variance

    0.0 0.5 1.0 1.5 2.0Iterations 1e2

    1.5

    2.0

    2.5

    3.0

    3.5

    Loss

    SGD momentumADAM

    (e)

    20 40 60 80Noise norm

    0.00

    0.01

    0.02

    0.03

    0.04

    Density

    (f) Bert pretraining

    0 5 10 15Levy-stable

    0.000

    0.025

    0.050

    0.075

    Dens

    ity

    (g) Synthetic Levy-stable

    0.0 0.2 0.4 0.6 0.8Sample size 1e7

    1000

    1500

    Estim

    ated

    varianc

    e

    (h) Bert varianceFigure 1: (a) Validation loss for ResNet50 trained on ImageNet. SGD momentum outperforms Adam.(b) Histogram of sampled gradient noise for ResNet50 on Imagenet dataset. (c) Histogram of samplesfrom a sum of squared Gaussians. (d) Estimated variance of the stochastic gradient for Resnet50.(e)Validation loss for BERT pretraining. Although hyperparameters for SGD are finetuned, a largeperformance gap is still observed between SGD and Adam. (f) Histogram of sampled gradient nosiefor BERT on Wikipedia+Books dataset. (g) Histogram of samples from a sum of squared α-stablerandom variables. (h) Estimated variance of the stochastic gradient for BERT model.

    dataset to compute the noise norm for each minibatch. Figure 1 (b) and (f) show these distributionsfor ResNet50 on ImageNet and BERT on the Wikipedia and books dataset at model initializationrespectively. For comparison, we plot distributions of a normalized sum of squared Gaussians, a well-concentrated distribution, and a Levy-α-stable distribution, a heavy-tailed distribution, in Figure 1 (c)and (g) respectively. We observe that the noise distribution for BERT appears heavy-tailed, whilethat of ResNet50 is well-concentrated. Results for noise distributions at other stages of training aredisplayed in Figure 2.

    To support this observation, in Figure 1 (d) and (h) we further show the empirical variance ofstochastic gradients with respect to the sample size used in the estimation. The results highlight thatwhile the corresponding estimator converges for Imagenet, the empirical variance does not convergein BERT training even as the sample size approaches 107.

    From the obeservation that the noise can be heavy-tailed, we hypothesize that this is one major aspectthat determines the performance of SGD and adaptive methods. In the rest of the paper, we argue andprovide evidence that adaptive methods can be faster than SGD in scenarios where heavy-tailed noisedistributions arise. More experiment details can be found in Section 5.

    3 Convergence of gradient methods under heavy-tailed noiseIn this section we study the performance of SGD and adaptive methods under heavy-tailed noise.More precisely, we analyze algorithms of the following form

    xk+1 = xk − ηkgk, (1)

    where xk represent the current parameters, ηk is the step size and gk is the stochastic (mini-batch)gradient evaluated at xk. We show that if the stochasticity in the gradient gk is heavy-tailed, it iscritical for the step sizes to be adaptive i.e. ηk must depend on the observed gradients. We propose touse one such algorithm GClip and prove that it obtains optimal convergence rates.

    Heavy-tailed noise. Neural network training can be seen as minimizing a differentiable stochasticfunction f(x) = Eξ[f(x, ξ)], where f : Rd → R can be potentially nonconvex and ξ represent themini-batches. At each iteration, we assume access to an unbiased stochastic gradient E[g(x)] =∇f(x, ξ) corresponding to the parameters x, mini-batch ξ. We also need to bound how much noiseis present in our stochastic gradients. In lieu of the usual bounded variance assumption, we use

    Assumption 1 (Bounded α−moment). There exists positive real numbers α ∈ (1, 2] and G > 0such that for all x, E[‖g(x)−∇f(x)‖α] ≤ σα. We say noise is heavy-tailed if α < 2.The above assumption with α = 2 corresponds to the standard variance bound, but in general isweaker. It is indeed possible (e.g. Pareto or α-stable Levy random variables) for the variance of g(x)

    3

  • Table 1: Error bounds (f(x)− f∗ for convex functions, ‖∇f(x)‖ for nonconvex functions) after kiterations: Define α-moment as E[‖g(x)−∇f(x)‖α] ≤ σα (Assump 1) in the smooth nonconvexcase and E[‖g(x)‖α] ≤ Gα (Assump 4) in the strongly case. In the standard setting (α = 2), GCliprecovers the optimal rates. For heavy-tailed noise (α ∈ (1, 2)), GClip converges both for convex(Thm 4) and non-convex functions (Thm 2). We also show matching lower-bounds for all α ∈ (1, 2]proving the optimality of clipping methods (Thm 5).

    Strongly Convex Function Non-Convex Function

    Heavy-tailed noise Standard noise Heavy-tailed noise Standard noise(α ∈ (1, 2)) (α ≥ 2) (α ∈ (1, 2)) (α ≥ 2)

    SGD N/A O(k−1

    )N/A O

    (k−

    14

    )GClip O

    (k

    −(α−1)α

    )O(k−1

    )O(k

    −(α−1)3α−2

    )O(k−

    14

    )LowerBound Ω

    (k

    −(α−1)α

    )Ω(k−1

    )Ω(k

    −(α−1)3α−2

    )Ω(k−

    14

    )to be unbounded, while simultaneously satisfying assumption 1 for α < 2. One should note that evenif the variance may not actually be infinite in practice, it might be too large to be practically useful.All our analyses and insights carry over to this setting as well.

    The possibility that the variance is unbounded has a profound impact on the optimization process.Remark 1 (Nonconvergence of SGD). Consider the function f(x) = x2/2 with noise satisfyingE[‖g(x) − ∇f(x)‖α] = σα for α < 2, and E[‖g(x) − ∇f(x)‖2] = ∞. Then, for any positiveconstants ηk that do not depend on gk, we have that E[‖∇f(xk)‖2] =∞.Proof. we denote the stochastic gradient gk := g(xk) = ∇f(xk) + ξk = xk + ξk, where ξk ∈Rd is a random variable with E‖ξ‖2 = ∞,E‖ξ‖α = σα,E[ξ] = ~0. Then, E[‖∇f(xk+1)‖2] =E[‖xk+1‖2] = E‖xk − ηkgk‖2 = E‖xk − ηk(xk + ξ)‖2 = E‖(1 − ηk)xk − ηkξ‖2 = E‖(1 −ηk)xk‖2 − 2(1 − ηk)ηkx>k E[ξ] + η2kE‖ξ‖2 ≥ η2kE‖ξ‖2 = ∞. Note that this holds for any fixedηk > 0 even if allowed to depend on the statistics of the noise distribution (such as σ or α).

    The issue is that SGD is easily influenced by a single-stochastic gradient, which could be very largeand incorrect. A simple strategy to circumvent this issue is to use a biased clipped stochastic gradientestimator. This allows us to circumvent the problem of unbounded variance and ensures optimalconvergence rates even under heavy-tailed noise. Our results are summarized in Table 1, and allproofs are relegated to the Appendices.

    3.1 Convergence of Clipped Methods

    A simple clipping strategy is to globally clip the norm of the update to threshold τk:

    xk+1 = xk − ηk min{

    τk‖gk‖ , 1

    }gk , τk ∈ R≥0 (GClip)

    We refer to this strategy as GClip (Global Clip), as opposed to coordinate-wise clipping which wediscuss later. We first state the rates for smooth non-convex functions.Theorem 2 (Non-convex convergence). Suppose that f is L-smooth and that the stochastic gradi-ents satisfy Assumption 1 for α ∈ (1, 2]. Let {xk} be the iterates of GClip with parameters ηk = η =min{ 14L ,

    σα

    Lτα ,1

    24Lτ } and τk = τ = max{2, 481/(α−1)σα/(α−1), 8σ,

    (f0σ2K

    ) α3α−2

    /L

    2α−23α−2 , }. Then

    for F0 := f(x0)− f∗,1

    K

    K∑k=1

    E[min{‖∇f(xk)‖, ‖∇f(xk)‖2}] = O(K−2α+23α−1 ), .

    Remark 3. When ‖∇f(xk)‖ ≤ �� 1, ‖∇f(xk)‖2 � ‖∇f(xk)‖. Hence the dominant term on theleft hand side of the inequality above is ‖∇f(xk)‖2. The right hand side is easily observed to be

    O(K− 2(α−1)

    3α−2). Together, this implies a convergence rate of E‖∇f(x)‖ ≤ O(K

    − (α−1)3α−2

    ).

    We prove improved rates of convergence for non-smooth strongly-convex functions in a boundeddomain. Due to limited space, we relegate the definitions and assumptions to Appendix A.

    4

  • Theorem 4 (Strongly-convex convergence). Suppose that the stochastic gradients satisfy Assump-tion 4 for α ∈ (1, 2]. Let {xk} be the iterates of projected GClip (proj-GClip) with clipping parameterτk = Gk

    α−1 and steps-size ηk = 4µ(k+1) . Define the output to be a k-weighted combination of the

    iterates: x̄k =∑kj=1 jxj−1/(

    ∑kj=1 j) . Then the output x̄k satisfies:

    E[f(x̄k)]− f(x?) ≤ 16G2

    µ(k+1)2(α−1)/α.

    The rates of convergence for the strongly convex and non-convex cases in Theorem 4 and Theorem2 exactly match those of the usual SGD rates (O(1/

    √k) for convex and O(k− 14 ) for non-convex)

    when α = 2 and gracefully degrade for α ∈ (1, 2]. As we will next show, both the strongly convexrates and non-convex rates of GClip are in fact optimal for every α ∈ (1, 2].

    3.2 Theoretic lower boundsWe prove that the rates obtained with GClip are optimal up to constants. First, we show a stronglower-bound for the class of convex functions with stochastic gradients satisfying E[|g(x)|α] ≤ 1.This matches the upper bounds of Theorems 4 and 8 for strongly-convex functions, showing that thesimple clipping mechanism of GClip is (up to constants) information theoretically optimal, providinga strong justification for its use.Theorem 5. For any α ∈ (1, 2] and any (possibly randomized) algorithmA, there exists a problem fwhich is 1-strongly convex and 1-smooth (µ = 1 and L = 1), and stochastic gradients which satisfyAssumptions 4 with G ≤ 1 such that the output xk of the algorithm A after processing k stochasticgradients has an error

    E[f(xk)]− f(x?) ≥ Ω(

    1k2(α−1)/α

    ).

    Next, we examine non-convex functions.Theorem 6. Given any α ∈ (1, 2], smoothness constant L, and (possibly randomized) algorithmA, there exists a constant c1 and an L-smooth function f with stochastic gradients satisfyingAssumption 1 for any given σ ≥ c1

    √(f(0)− f∗)L such that the output xk of the algorithm A after

    processing k stochastic gradients has an error

    E[‖∇f(xk)‖] ≥ Ω(

    1k(α−1)/(3α−2)

    ).

    Theorem 6, proven in Appendix G, extends the recent work of [2, Theorem 1] to heavy-tailed noise.Here, the lower-bound matches the upper-bound in Theorem 2 up to constants, proving its optimality.

    4 Faster Optimization with Adaptive Coordinate-wise ClippingThe previous section showed that adaptive step sizes (which depend on the gradients) are essential forconvergence under heavy-tailed noise, and also showed that GClip provides the optimal rates. Thereare of course other adaptive methods such as Adam which employs not only the current gradientsbut also all past gradients to adaptively set coordinate-wise step-sizes. In this section, we study whycoordinate-wise clipping may yield even faster convergence than GClip, and show how to modifyGClip to design an Adaptive Coordinate-wise Clipping algorithm (ACClip).

    4.1 Coordinate-wise clipping

    The first technique we use is applying coordinate-wise clipping instead of global clipping. We hadpreviously assumed a global bound on the α-moment of the norm (or variance) of the stochasticgradient is bounded by σ. However, σ might be hiding some dimension dependence d. We show amore fine-grained model of the noise in order to tease out this dependence.Assumption 2 (Coordinate-wise α moment). Denote {gi(x)} to be the coordinate-wise stochas-tic gradients for i ∈ [d]. We assume there exist constants {Bi} ≥ 0 and α ∈ (1, 2] such thatE[|gi(x)|α] ≤ Bαi .For the sake of convenience, we denote B = [B1;B2; · · · ;Bd] ∈ Rd, ‖B‖a = (

    ∑Bai )

    1/a. Underthis more refined assumption, we can show the following corollary:Corollary 7 (GClip under coordinate-wise noise). Suppose we run GClip under Assumption 2 toobtain the sequence {xk}. If f is µ-strongly convex, with appropriate step-sizes and averaging, theoutput x̄k satisfies E[f(x̄k)]− f(x?) ≤ 16d‖B‖

    µ(k+1)2(α−1)/α.

    5

  • 25 50 75Noise norm

    0.000

    0.025

    0.050

    Dens

    ity

    Iteration 0k

    100 200Noise norm

    0.000

    0.025

    0.050

    4k

    40 50Noise norm

    0.000

    0.025

    0.050

    12k

    40 60Noise norm

    0.000

    0.025

    0.050

    36k

    (a) Development of noise distribution during BERT training.

    0.32 0.34Noise norm

    0.00

    0.02

    0.04

    Dens

    ity

    Iteration 0k

    0.525

    0.550

    0.575

    Noise norm

    0.00

    0.02

    0.04 5k

    0.500

    0.525

    0.550

    Noise norm

    0.00

    0.02

    0.04 10k

    1.05 1.10 1.15Noise norm

    0.00

    0.02

    0.04 60k

    (b) Development of noise distribution during ResNet50 training on ImageNet.

    Figure 2: The distribution of gradient noise is non-stationary during BERT training, while it remainsalmost unchanged for ResNet training on ImageNet.

    Thus, the convergence of GClip can have a strong dependence on d, which for large-scale problemsmight be problematic. We show next that using coordinate-wise clipping removes this dependency:

    xk+1 = xk − ηk min{τk|gk| , 1

    }gk , τk ∈ Rd≥0 . (CClip)

    Theorem 8 (CClip under coordinate-wise noise). Suppose we run CClip under the Assumptionof 2 with τk = Bkα−1 to obtain the sequence {xk}. Then, if f is µ-strongly convex, with appropriatestep-sizes and averaging, the output x̄k satisfies

    E[f(x̄k)]− f(x?) ≤ 16‖B‖22

    µ(k+1)2(α−1)/α.

    Note that ‖B‖2 ≤ ‖B‖α. CClip has a worst-case convergence independent of d under the coordinate-wise noise model. Similar comparison between GClip and CClip can be done for non-convexconditions too, but we skip for conciseness. Though we only compare upper-bounds here, when thenoise across coordinates is independent the upper bounds may be tight (see Lemma 12).

    4.2 Online moment estimation

    We now present the second technique that is motivated by our observation in Figure 2. There, thedistribution of gradient noise at the beginning of different epochs is shown during training for BERTwith Wikipedia (top) as well as ResNet with ImageNet (bottom). The result highlights that the noisedistribution is not only heavy-tailed, but also non-stationary during BERT training and becomesincreasingly more concentrated. In contrast, for the ResNet model the noise distribution remainsmostly unchanged.

    Since the scale of the noise changes drastically during training for BERT model and our theoreticalanalysis suggest that we should clip proportional to the noise level, we propose to use an exponentialmoving average estimator to estimate the moment and clip the gradient accordingly (line 4,5 of Alg 1).This, combined with the momentum term leads to our proposed ACClip algorithm in Algorithm 1.On a high level, the algorithm applies clipping to the momentum term, where the clipping thresholdis proportional to the estimated moment using an exponential moving average. From our experiment,we found the conservative choice of α = 1 leads to the best performance.

    5 ExperimentsIn this section, we first verify the effect of coordinate-wise clipping and moment estimation introducedin Section 4. We then perform extensive evaluations of ACClip on BERT pre-training and fine-tuningtasks and demonstrate its advantage over Adam in Section 5.2. For completeness, an experimenton ImageNet is included in Appendix I. Finally, we start with a few more experiments on the noisedistribution in neural network training.

    6

  • Algorithm 1 ACClip1: x,mk ← x0, 02: for k = 1, ·, T do3: mk ← β1mk−1 + (1− β1)gk4: ταk ← β2ταk−1 + (1− β2)|gk|α5: ĝk ← min

    {τk

    |mk|+� , 1}mk

    6: xk ← xk−1 − ηkĝk7: end forreturn xK , where random variable K is supported on {1, · · · , T}.

    0 5 10 15 20 25 30thousand iterations

    2

    3

    4

    5

    6

    Training loss of TransformerXLSGDGlobal clippingCoordinate-wise clippingADAMACClip

    (a)

    0 50000 100000 150000 200000iterations

    1.5

    2.0

    2.5

    3.0

    3.5 Train loss of BERT pretrainingSGD momentumClipped SGD momentumADAMACClip

    (b)

    0 50000 100000 150000 200000iterations

    1.5

    2.0

    2.5

    3.0

    3.5 Validation loss of BERT pretrainingSGD momentumClipped SGD momentumADAMACClip

    (c)

    Figure 3: (a) Performance of different algorithms for training a toy transformer-XL model describedin Section 4. (b) Train and (c) validation loss for BERTbase pretraining with the sequence length of128. While there remains a large gap between non-adaptive methods and adaptive methods, clippedSGD momentum achieves faster convergence compared to standard SGD momentum. The proposedalgorithm for adaptive coordinate-wise clipping (ACClip) achieves a lower loss than Adam.

    5.1 From GClip to ACClip

    In this section we instantiate the argument in Section 4 with a set of experiments. As seen in Figure 3b,global clipping improves the vanilla SGD algorithm but is still far from the ADAM baseline. Weapply two techniques (coordinate-wise clipping and online moment estimation) onto the clippedSGD algorithm analyzed in Section 3. We use a set of experiments on Transformer-XL training todemonstrate the effect of each technique.

    Experiment setup We train a 6-layer Transformer-XL model[8] on PTB dataset as a proof ofconcept. Our main experiments will be on BERT pretraining and finetuning described in the nextsubsection 5.2. We use adapt the author’s github repo1, and replace the number of layers of the basemodel by 6. We then select the PTB data as input and set the maximum target length to be 128. Theresults are shown in Figure 3b.

    Observations From Figure 3b, we can tell that global clipping (orange curve) indeed speeds upvanilla SGD but is still much worse compared to the ADAM baseline provided by the code base.After replacing global clipping with coordinate-wise clipping, we see that the performance is alreadycomparable to the ADAM baseline. Finally, after using the moment estimation to determine theclipping threshold, we are able to achieve faster convergence than ADAM.

    5.2 Performance of ACClip for BERT pre-training and fine-tuning

    We now evaluate the empirical performance of our proposed ACClip algorithm on BERT pre-trainingas well fine-tuning using the SQUAD v1.1 dataset. As a baseline, we use Adam optimizer and thesame training setup as in the BERT paper [9]. For ACClip, we set τ = 1, learning rate = 1e-4, β1 =0.9, β2 = 0.99, � = 1e-5 and weight decay = 1e-5. We compare both setups on BERT models ofthree different sizes, BERTbase with 6 and 12 layers as well as BERTlarge with 24 layers.

    Figure 3b and 3c shows the loss for pretraining BERTbase using SGD with momentum, GClip,Adam and ACClip. The learning rates and hyperparameters for each method have been extensively

    1https://github.com/kimiyoung/transformer-xl/tree/master/pytorch

    7

  • Table 2: BERT pretraining: Adam vs ACClip. Compared to Adam, the proposed ACClipalgorithm achieves better evaluation loss and Masked LM accuracy for all model sizes.

    BERT Base 6 layers BERT Base 12 layers BERT Large 24 layersVal. loss Accuracy Val. loss Accuracy Val. loss Accuracy

    Adam 1.907 63.45 1.718 66.44 1.432 70.56ACClip 1.877 63.85 1.615 67.16 1.413 70.97

    Table 3: SQUAD v1.1 dev set: Adam vs ACClip. The mean and standard deviation of F1 and exactmatch score for 5 runs. The first row contains results reported from the original BERT paper, whichare obtained by picking the best ones out of 10 repeated experiments.

    BERT Base 6 layers BERT Base 12 layers BERT Large 24 layersEM F1 EM F1 EM F1

    Adam (Devlin et al., 2018) 80.8 88.5 84.1 90.9Adam 76.85± 0.34 84.79± 0.33 81.42± 0.16 88.61± 0.11 83.94± 0.19 90.87± 0.12ACClip 78.07 ± 0.24 85.87 ± 0.13 81.62 ± 0.18 88.82 ± 0.10 84.93 ± 0.29 91.40 ± 0.15

    20 40 60 80Noise norm

    0.00

    0.01

    0.02

    0.03

    0.04

    Density

    (a) Attention + Wikipedia

    25 30 35Noise norm

    0.00

    0.02

    0.04

    Dens

    ity

    (b) Attention + Gaussian

    40 60 80Noise norm

    0.00

    0.02

    0.04

    Dens

    ity

    (c) Resnet + Wikipedia

    0.27 0.28 0.29 0.30 0.31Noise norm

    0.00

    0.01

    0.02

    0.03

    Density

    (d) Resnet + Gaussian.

    Figure 4: Distribution of gradient noise norm in Attention and ResNet models on two data sources:Wikipedia and synthetic Gaussian. The heavy-tailed noise pattern results from the interaction of bothmodel architecture as well as data distribution.

    tuned to provide best performance on validation set. However, even after extensive tuning, thereremains a large gap between (clipped) SGD momentum and adaptive methods. Furthermore, clippedSGD achieves faster convergence as well as lower final loss compared to standard SGD. Lastly, theproposed optimizer ACClip achieves a lower loss than the Adam. Table 2 further shows that ACClipachieves lower loss and higher masked-LM accuracy for all model sizes.

    Next, we evaluate ACClip on the SQUAD v1.1 fine-tuning task. We again follow the procedureoutlined in [9] and present the results on the Dev set in Table 3. Both for F1 as well as for exactmatch, the proposed algorithm outperforms Adam on all model sizes. The experimental results onBERT pretraining and fine-tuning indicate the effectiveness of the proposed algorithm.

    5.3 Noise Patterns in BERT and ImageNet TrainingIn our initial analysis in Figure 1, we observe that training an attention model on Wikipedia leadsto heavy-tailed noise whereas training a ResNet on ImageNet data leads to well-concentrated noise.Here, we aim to disentangle the effect that model architecture and training data have on the shape ofgradient noise. To this end, we measure the distribution of the gradient noise norm in an Attention anda ResNet model on both Wikipedia and synthetic Gaussian data. We used BERTbase as the Attentionmodel, and the ResNet is constructed by removing the self-attention modules within the transformerblocks. Gaussian synthetic data is generated by replacing the token embedding layer with normalizedGaussian input. The resulting noise histograms are shown in Figure 4. The figure shows that theAttention model leads to heavy-tailed noise independently of input data. For the ResNet model, weobserve that Gaussian input leads to Gaussian noise, whereas Wikipedia data leads to be heavy-tailednoise. We thus conclude that the heavy-tailed noise pattern results from both the model architectureas well as the data distribution.

    6 DiscussionOne immediate extension from this work is to view RMSProp as a clipping algorithm and prove itsconvergence under shifting noise. The update for RMSProp and ACClip with β1 = 0 can be written

    8

  • with effective step-sizes hrms and hclip respectively as below:

    xk+1 = xk − α�+√β2vk+(1−β2)|gk|2

    gk =: xk − hAdamgk , and

    xk+1 = xk − ηk min{τk|gk| , 1

    }gk =: xk − hclipgk.

    Given any set of parameters for RMSProp, if we set the parameters for ACClip as

    ηk =2α

    �+√β2vk

    and τk = �+√β2vk√

    1−β2,

    then 12hclip ≤ hAdam ≤ 2hclip. Thus, RMSProp can be seen as ACClip where τk is set using√vk,

    which estimates E[|gk|2]1/2, and a correspondingly decreasing step-size. An analysis of RMSprop(and Adam) by viewing them as adaptive clipping methods is a great direction for future work.

    In summary, our work theoretically and empirically ties the advantage of adaptive methods over SGDto the heavy-tailed nature of gradient noise. A careful analysis of the noise and its impact yielded twoinsights: that clipping is an excellent strategy to deal with heavy-tailed noise, and that the ACClipyields state of the art performance for training attention models. Our results add to a growing body ofwork which demonstrate the importance of the structure of the noise in understanding neural networktraining. We believe additional such investigations into the source of the heavy tailed-ness, as well asa characterization of the noise can lead to further insights with significant impact on practice.

    Broader impact

    We study convergence rates of gradient methods under a more relaxed noise condition. The resultunder this setting reaches conclusions that are closer to practice compared to results under the standardsetting. Hence, our work provides one way to bridge the theory-practice gap and can facilitate morefuture works in this direction.

    References[1] N. Agarwal, B. Bullins, X. Chen, E. Hazan, K. Singh, C. Zhang, and Y. Zhang. The case for

    full-matrix adaptive regularization. arXiv preprint arXiv:1806.02958, 2018.

    [2] Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower boundsfor non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.

    [3] L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives.Pacific Journal of mathematics, 16(1):1–3, 1966.

    [4] S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions onInformation Theory, 59(11):7711–7717, 2013.

    [5] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary pointsi. arXiv preprint arXiv:1710.11606, 2017.

    [6] X. Chen, S. Liu, R. Sun, and M. Hong. On the convergence of a class of adam-type algorithmsfor non-convex optimization. arXiv preprint arXiv:1808.02941, 2018.

    [7] A. Cutkosky and H. Mehta. Momentum improves normalized sgd. arXiv preprintarXiv:2002.03305, 2020.

    [8] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl:Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860,2019.

    [9] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

    [10] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning andstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

    9

  • [11] E. Gorbunov, M. Danilova, and A. Gasnikov. Stochastic optimization with heavy-tailed noisevia accelerated gradient clipping. arXiv preprint arXiv:2005.10785, 2020.

    [12] E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond convexity: Stochastic quasi-convexoptimization. In Advances in Neural Information Processing Systems, pages 1594–1602, 2015.

    [13] H. Huang, C. Wang, and B. Dong. Nostalgic adam: Weighting more of the past gradients whendesigning the adaptive learning rate. arXiv preprint arXiv:1805.07557, 2018.

    [14] D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

    [15] K. Y. Levy. The power of normalization: Faster evasion of saddle points. arXiv preprintarXiv:1611.04831, 2016.

    [16] X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes.arXiv preprint arXiv:1805.08114, 2018.

    [17] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the adaptivelearning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.

    [18] J. Ma and D. Yarats. On the adequacy of untuned warmup for adaptive optimization. arXivpreprint arXiv:1910.04209, 2019.

    [19] T. H. Nguyen, U. Şimşekli, M. Gürbüzbalaban, and G. Richard. First exit time analysis ofstochastic gradient descent under heavy-tailed gradient noise. arXiv preprint arXiv:1906.09069,2019.

    [20] A. Panigrahi, R. Somani, N. Goyal, and P. Netrapalli. Non-gaussianity of stochastic gradientnoise. arXiv preprint arXiv:1910.09626, 2019.

    [21] B. T. Polyak. Introduction to optimization. optimization software. Inc., Publications Division,New York, 1, 1987.

    [22] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convexstochastic optimization. arXiv preprint arXiv:1109.5647, 2011.

    [23] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of ADAM and beyond. arXiv preprintarXiv:1904.09237, 2019.

    [24] H. Robbins and S. Monro. A stochastic approximation method. Annals of MathematicalStatistics, 22:400–407, 1951.

    [25] U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noisein deep neural networks. arXiv preprint arXiv:1901.06053, 2019.

    [26] U. Şimşekli, L. Zhu, Y. W. Teh, and M. Gürbüzbalaban. Fractional underdamped langevindynamics: Retargeting sgd with momentum under heavy-tailed gradient noise. arXiv preprintarXiv:2002.05685, 2020.

    [27] M. Staib, S. J. Reddi, S. Kale, S. Kumar, and S. Sra. Escaping saddle points with adaptivegradient methods. arXiv preprint arXiv:1901.09149, 2019.

    [28] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average ofits recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.

    [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, andI. Polosukhin. Attention is all you need. In Advances in neural information processing systems,pages 5998–6008, 2017.

    [30] R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: Sharp convergence over nonconvexlandscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.

    [31] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptivegradient methods in machine learning. In Advances in Neural Information Processing Systems,pages 4148–4158, 2017.

    10

  • [32] J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: Atheoretical justification for adaptivity. In International Conference on Learning Representations,2020.

    [33] D. Zhou, Y. Tang, Z. Yang, Y. Cao, and Q. Gu. On the convergence of adaptive gradient methodsfor nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.

    [34] Z. Zhou, Q. Zhang, G. Lu, H. Wang, W. Zhang, and Y. Yu. Adashift: Decorrelation andconvergence of adaptive learning rate methods. arXiv preprint arXiv:1810.00143, 2018.

    [35] F. Zou and L. Shen. On the convergence of weighted adagrad with momentum for training deepneural networks. arXiv preprint arXiv:1808.03408, 2018.

    [36] F. Zou, L. Shen, Z. Jie, W. Zhang, and W. Liu. A sufficient condition for convergences ofadam and rmsprop. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 11127–11135, 2019.

    11

  • A Additional definitions and assumptions

    Here we describe some of the formal assumptions which were previously skipped.

    A.1 Assumptions in the nonconvex setting

    We define the standard notion of smoothness.Assumption 3 (L-smoothness). f is L-smooth, i.e. there exist positive constants L such that ∀x, y,f(y) ≤ f(x) + 〈∇f(x), y − x〉+ L2 ‖y − x‖

    2 .

    Note that we only need the smoothness assumption for non-convex functions.

    A.2 Assumptions in the strongly convex setting

    For strongly-convex optimization, instead of bounding the noise, we assume that the stochastic oraclehas bounded moment.Assumption 4 (bounded α moment). There exists positive real numbers α ∈ (1, 2] and G > 0 suchthat for all x, E[‖g(x)‖α] ≤ Gα.

    Note that the above assumption implies a uniform bound on gradient norm. Such bound is necessaryfor nonsmooth strongly convex problems, as one can no longer factor out the gradient norm using thesmoothness assumption. See for example, [22].Assumption 5 (µ-strong-convexity). f is µ-strongly convex, if there exist positive constants µ suchthat ∀x, y,

    f(y) ≥ f(x) + 〈∇f(x), y − x〉+ µ2‖y − x‖2

    The strong convexity assumption and the bounded gradient assumption implies that the domain isbounded, which we state explicitly below,Assumption 6 (bounded domain). We look for a solution x within a bounded convex set X .

    We didn’t upper bound the domain diameter as it is not used explicitly in the proof. To ensure allupdates are within a domain, we use the projected version of (GClip) defined as follows:

    xk+1 = projX {xk − ηk min{

    τk‖gk‖ , 1

    }gk , τk ∈ R≥0]} (proj-GClip)

    The projection operator x = projX (y) finds the point x ∈ X that has the least distance to y.

    B Effect of global clipping on variance and bias

    We focus on (GClip) under stochastic gradients which satisfy Assumption 1.Lemma 9. For any g(x) suppose that assumption 1 holds with α ∈ (1, 2]. If E[‖g(x)‖α] ≤ Gα, thenthe estimator ĝ := min

    {τk‖gk‖ , 1

    }gk from (GClip) with clipping parameter τ ≥ 0 satisfies:

    E[‖ĝ(x)‖2

    ]≤ Gατ2−α and ‖E[ĝ(x)]−∇f(x)‖2 ≤ G2ατ−2(α−1) .

    Proof. First, we bound the variance.E[‖ĝ(x)‖2] = E[‖ĝ(x)‖α‖ĝ(x)‖2−α]

    By the fact that ĝ(x) ≤ τ , we getE[‖ĝ(x)‖2] = E[‖ĝ(x)‖ατ2−α] ≤ Gατ2−α.

    Next, we bound the bias,‖E[ĝ(x)]−∇f(x)‖ = ‖E[ĝ(x)− g(x)]‖≤ E[‖ĝ(x)− g(x)‖] = E[‖ĝ(x)− g(x)‖1{|g(x)|≥τ}]≤ E

    [‖g(x)‖1{|g(x)|≥τ}

    ]≤ E

    [‖g(x)‖α1{|g(x)|≥τ}

    ]/τα−1.

    12

  • The first inequality follows by Jenson’s inequality. The second inequality follows by definition of ĝ.The third inequality follows by ‖g(x)‖α1{|g(x)|≥τ} ≥ ‖g(x)‖τα−11{|g(x)|≥τ}.

    As we increase the clipping parameter τ , note that the variance (the first term in Lemma 9) increaseswhile the bias (which is the second term) decreases. This way, we can carefully trade-off the varianceof our estimator against its bias, thereby ensuring convergence of the algorithm.

    C Non-convex Rates (Proof of Theorem 2)

    The lemma in the previous section can be readily used in the nonsmooth strongly convex setting.However, we need a variant of Lemma 9 in the smooth case.Lemma 10. For any g(x) suppose that assumption 1 holds with α ∈ (1, 2]. If ‖∇f(x)‖ ≤ τ/2, thenthe estimator ĝ := min{1, τ/‖gk‖}gk from (GClip) with global clipping parameter τ ≥ 0 satisfies:

    E[‖ĝ(x)‖2

    ]≤ 2‖∇f(x)‖2 + 4σατ2−α and ‖E[ĝ(x)]−∇f(x)‖2 ≤ 4σ2ατ−2(α−1) .

    Proof. First, we bound the variance.E[‖ĝ(x)‖2] ≤ E[2‖∇f(x)‖2 + 2‖∇f(x)− ĝ(x)‖2]

    = E[2‖∇f(x)‖2 + 2‖∇f(x)− ĝ(x)‖α‖∇f(x)− ĝ(x)‖2−α]≤ E[2‖∇f(x)‖2 + 2‖∇f(x)− ĝ(x)‖α(2τ)2−α]≤ 2‖∇f(x)‖2 + 4τ2−αE[‖∇f(x)− g(x)‖α]≤ 2‖∇f(x)‖2 + 4τ2−ασα

    The expectation is taken with respect to the randomness in noise. The second last inequality followsby the fact that ‖∇f(x)− ĝ(x)‖ < 2τ .Next, we bound the bias,

    ‖E[ĝ(x)]−∇f(x)‖ = ‖E[ĝ(x)− g(x)]‖= E[|‖g(x)‖ − τ |1{‖g(x)‖>τ}]≤ E[‖g(x)−∇f(x)‖1{‖g(x)‖>τ}]≤ E[‖g(x)−∇f(x)‖1{‖g(x)−∇f(x)‖>τ/2}]≤ E[‖g(x)−∇f(x)‖α](τ/2)1−α ≤ 2σατ1−α

    The last line follows by

    ‖g(x)−∇f(x)‖1{‖g(x)−∇f(x)‖>τ/2} ≤‖g(x)−∇f(x)‖α

    (τ/2)α−11{‖g(x)−∇f(x)‖>τ/2}.

    Next, we need a subprocedure at the end proof of Lemma 2 from [7].

    Lemma 11 (Lemma 2 in [7]). For any vector v ∈ Rd, 〈v/‖v‖,∇f(x)〉 ≥ ‖∇f(x)‖3 −8‖v−∇f(x)‖

    3 .

    Finally, we are ready to show the proof.

    Proof. At each iteration, we consider two cases, either ‖∇f(xk)‖ < τ/2 or ‖∇f(xk)‖ ≥ τ/2.

    Case 1: ‖∇f(xk)‖ < τ/2 For simplicity, we denote ĝk = min{1, τ/‖gk‖}gk and the biasbk = E[ĝk]−∇f(xk). By Assumption 3, we have

    f(xk) ≤ f(xk−1) + 〈∇f(xk−1),−ηkĝk−1〉+η2k−1L

    2‖ĝk−1‖2

    ≤ f(xk−1)− ηk−1‖∇f(xk−1)‖2 − ηk−1〈∇f(xk−1), bk−1〉+η2k−1L

    2‖ĝk−1‖2

    ≤ f(xk−1)− ηk−1‖∇f(xk−1)‖2 − ηk−1〈∇f(xk−1), bk−1〉+η2k−1L

    2‖ĝk−1‖2

    ≤ f(xk−1)− ηk−1‖∇f(xk−1)‖2 +ηk−1

    2‖∇f(xk−1)‖2 +

    ηk−12‖bk−1‖2 +

    η2k−1L

    2‖ĝk−1‖2 .

    13

  • Here the last step used the AM-GM inequality. Then, taking expectation in both sides and usingLemma 10 gives

    E[f(xk)|xk−1] ≤ f(xk−1)− (ηk−1

    2− ηL)‖∇f(xk−1)‖2 + 2η2k−1Lσατ2−α +

    2ηkσ2α

    τ2α−2

    ≤ f(xk−1)−ηk−1

    4‖∇f(xk−1)‖2 + 2η2k−1Lσατ2−α +

    2ηk−1σ2α

    τ2α−2.

    In the last step we used {ηk = η ≤ 14L}.

    Case 2: ‖∇f(xk)‖ > τ/2 Recall ĝk = min{1, τ/‖gk‖}gk and parameter choices ηk = η =min{ 14L ,

    1Lτα ,

    124Lτ } and τk = τ = max{2, 48

    1/(α−1)σα/(α−1), 8σ, σK1

    3α−2 }. We use ∇f as ashorthand for∇f(xk):

    E[〈∇f, gk〉1{‖gk‖≤τ}] ≥ E[(‖∇f‖2 − ‖∇f‖‖gk −∇f‖)1{‖gk‖≤τ}]

    ≥E[‖∇f‖21{‖gk‖≤τ} − 12‖∇f‖21{‖gk‖≤τ,‖gk−∇f‖≤τ/4} − ‖∇f‖‖gk −∇f‖1{‖gk‖≤τ,‖gk−∇f‖≥τ/4}]

    ≥p2‖∇f‖2 − ‖∇f‖E[‖gk −∇f‖1{‖gk−∇f‖≥τ/4}] ≥

    p2‖∇f‖

    2 − ‖∇f‖ σα

    (τ/4)α−1 (2)

    The first inequality uses〈∇f, gk〉 = ‖∇f‖2 + 〈∇f, gk −∇f〉.

    The second line follows by

    ‖∇f‖ > τ/2 and ‖gk −∇f‖ < τ/4 =⇒ −‖∇f‖‖gk −∇f‖ ≥ −‖∇f‖2/2.The last inequality follows by σα ≥ E[‖gk −∇f‖α] ≥ E[‖gk −∇f‖( τ4 )

    α−11‖gk−∇f‖≥τ/4}] .

    With the above, we get

    E[〈∇f, ĝk〉] = E[〈∇f, gk〉1{‖gk‖ ≤ τ}] + E[〈∇f, gk/‖gk‖〉1{‖gk‖ ≥ τ}]≥ p2‖∇f‖

    2 − ‖∇f‖ σα

    (τ/4)α−1 + (1− p)‖∇f‖/3−83E[‖∇f − gk‖]

    ≥ ‖∇f‖/3− ‖∇f‖ σα

    (τ/4)α−1 −8σ3

    ≥ ‖∇f‖/3− ‖∇f‖/12− ‖∇f‖/6≥ ‖∇f‖/12

    The second line follows by Lemma 11 and (2). The third line follows by that τ ≥ 2, and ‖∇f‖ ≥ τ/2imply p2‖∇f‖

    2 ≥ p‖∇f‖/3. Then, by τ ≥ 481/(α−1)σα/(α−1), we have σα

    (τ/4)α−1 ≤112 . By τ ≥ 8σ,

    83σ ≤ τ/3 ≤ ‖∇f‖/6.

    E[f(xk)] ≤ f(xk−1) + E[〈∇f(xk−1),−ηkĝk〉] +η2kL

    2τ2

    ≤ f(xk−1)− ηk‖∇f(xk−1)‖/12 + η2kLτ‖∇f(xk−1)‖≤ f(xk−1)− ηk‖∇f(xk−1)‖/24

    The last inequality above follows by 124Lτ .

    Combine the two cases we have

    E[f(xk)|xk−1] ≤ f(xk−1)−η

    24min{‖∇f(xk−1)‖2, ‖∇f(xk−1)‖}+ 2η2Lσατ2−α +

    2ησ2α

    τ2α−2.

    Rearrange and sum the terms above for some fixed step-size and threshold {τk = τ} to get

    1

    K

    K∑k=1

    E[min{‖∇f(xk−1)‖2, ‖∇f(xk−1)‖}

    ]≤ 24ηK

    (f(x0)− E[f(xK)]) + 48ηLσατ2−α + 48σ2α

    τ2α−2

    ≤ 24ηK

    (f(x0)− f?)︸ ︷︷ ︸T1

    + 48ηLσατ2−α +48σ2α

    τ2α−2︸ ︷︷ ︸T2

    .

    14

  • Since we use a stepsize η ≤ 1Lτα , we can simplify T2 as

    ηLσατ2−α +σ2α

    τ2α−2≤ σ

    2α + σα

    τ2α−2.

    Denote F0 = f(x0) − f? to ease notation. Then, adding T2 back to T1 and using a thresholdτ ≥ σK

    13α−2 we get

    T1 + T2 ≤24F0K

    (Lτα + 4L+ 24Lτ) + 48σ2α + σα

    τ2α−2

    ≤ 48(σ2 + σ2−α)K−2α+23α−1

    + 24F0LK−1(4 + max{4, 48α/(α−1)σα

    2/(α−1), 64σα, σαKα

    3α−2 })

    + 24F0LK−1(max{2, 481/(α−1)σα/(α−1), 8σ, σK

    13α−2 })

    = O(K−2α+23α−1 )

    This proves the statement of the theorem.

    D Strongly-Convex Rates (Proof of Theorem 4)

    For simplicity, we denote ĝk = min{

    τk‖gk‖ , 1

    }gk and the bias bk = E[ĝk]−∇f(xk).

    ‖xk − x∗‖2 = ‖projX (xk−1 − ηkĝk−1 − x∗)‖2

    ≤ ‖(xk−1 − ηkĝk−1 − x∗)‖2

    = ‖xk−1 − x∗‖2 − 2ηk〈xk−1 − x∗,∇f(xk−1)〉− 2ηk〈xk−1 − x∗, bk−1〉+ η2k‖ĝk−1‖2

    ≤ (1− µηk)‖xk−1 − x∗‖2 − 2ηk(f(xk−1)− f∗))

    + 2ηk(µ

    4‖xk−1 − x∗‖2 +

    4

    µ‖bk‖2) + η2k‖ĝk−1‖2.

    The first inequality follows by the nonexpansivity of projections onto convex sets.

    Rearrange and we get

    f(xk−1)− f∗ ≤η−1k − µ/2

    2‖xk−1 − x∗‖2 −

    η−1k2‖xk − x∗‖2 +

    4

    µ‖bk‖2 +

    ηk2‖ĝk−1‖2.

    After taking expectation and apply the inequality from Lemma 9, we get

    E[f(xk−1)]− f∗ ≤ E[η−1k − µ/2

    2‖xk−1 − x∗‖2 −

    η−1k2‖xk − x∗‖2

    ]+ 4G2ατ2−2αµ−1 + ηkG

    ατ2−α/2.

    Then take ηk = 4µ(k+1) , τk = Gk1α and multiply both side by k, we get

    kE[f(xk−1)]− f∗ ≤µ

    8E[k(k − 1)‖xk−1 − x∗‖2 − k(k + 1)‖xk − x∗‖2

    ]+ 8G2k

    2−αα µ−1.

    Notice that∑Kk=1 k

    2−αα ≤

    ∫K+10

    k2−αα dk ≤ (K + 1)2/α. Sum over k and we get∑K

    k=1 kE[f(xk−1)]− f∗ ≤µ

    8E[−T (T + 1)‖xT − x∗‖2

    ]+ 8G2(K + 1)

    2αµ−1.

    Devide both side by K(K+1)2 and we get

    15

  • 2

    K(K + 1)

    ∑Kk=1 kE[f(xk−1)]− f∗ ≤ 8G

    2K−1(K + 1)2−αα µ−1.

    Notice that for K ≥ 1, K−1 ≤ 2(K + 1)−1. We have

    2

    K(K + 1)

    ∑Kk=1 kE[f(xk−1)]− f∗ ≤ 16G

    2(K + 1)2−2αα µ−1.

    The theorem then follows by Jensen’s inequality.

    E Effect of coordinate-wise moment bound

    We now examine how the rates would change if we replace Assumption 4 with Assumption 2.

    E.1 Convergence of GClip (proof of Corollary 7)

    We now look at(GClip) under assumption 2.

    The proof of both the convex and non-convex rates following directly from the following Lemma.

    Lemma 12. For any g(x) suppose that assumption 2 with α ∈ (1, 2]. Then suppose we have aconstant upper-bound

    E[‖g(x)‖α] ≤ D .

    Then D satisfies

    dα2−1‖B‖αα ≤ D ≤ dα/2‖B‖αα.

    Proof. Note that the function (·)α/2 is concave for α ∈ (1, 2]. Using Jensen’s inequality we canrewrite as:

    D ≥ E[‖g(x)‖α] = dα/2E

    (1d

    d∑i=1

    |g(x)(i)|2)α/2 ≥ dα/2−1E[ d∑

    i=1

    |g(x)(i)|α].

    Since the right hand-side can be as large as dα2−1‖B‖αα, we have our first inequality. On the other

    hand, we also have an upper bound below:

    E[‖g(x)‖α] = E

    ( d∑i=1

    |g(x)(i)|2)α/2 ≤ E[(d( dmax

    i=1g(x)(i))2

    )α/2]

    ≤ E[dα/2(

    dmaxi=1

    g(x)(i))α]≤ E

    [dα/2

    d∑i=1

    (g(x)(i))α

    ]≤ dα/2

    d∑i=1

    Bαi

    where ‖B‖αα =∑di=1B

    αi . Thus, we have shown that

    dα2−1‖B‖αα ≤ E[‖g(x)‖

    α] ≤ dα/2‖B‖αα .

    We know that Jensen’s inequality is tight when all the co-ordinates have equal values. This meansthat if the noise across the coordinates is linearly correlated the lower bound is tighter, whereas theupper bound is tighter if the coordinates depend upon each other in a more complicated manner orare independent of each other.

    Substituting this bound on G in Theorems 4 and 2 gives us our corollaries.

    16

  • E.2 Convergence of CClip (Proof of Theorem 8)

    The proof relies on the key lemma which captures the bias-variance trade off under the new noise-assumption and coordinate-wise clipping.Lemma 13. For any g(x) suppose that assumption 2 with α ∈ (1, 2] holds. Denote gi to be ithcomponent of g(x),∇f(x)i to be ith component of∇f(x). Then the estimator ĝ(x) = [ĝ1; · · · ; ĝd]from (CClip) with clipping parameter τ = [τ1; τ2; · · · ; τd] satisfies:

    E[‖ĝi‖2

    ]≤ Bαi τ2−αi and ‖E[ĝi]−∇f(x)i‖

    2 ≤ B2αi τ−2(α−1)i .

    Proof. Apply Lemma 9 to the one dimensional case in each coordinate.

    Proof of Theorem 8. Theorem 4 For simplicity, we denote ĝk = ηkĝ(xk) and the bias bk = E[ĝk]−∇f(xk).

    ‖xk − x∗‖2 = ‖xk−1 − ηkĝk−1 − x∗‖2

    = ‖xk−1 − x∗‖2 − 2ηk〈xk−1 − x∗,∇f(xk−1)〉− 2ηk〈xk−1 − x∗, bk−1〉+ η2k‖ĝk−1‖2

    ≤ (1− µηk)‖xk−1 − x∗‖2 − 2ηk(f(xk−1)− f∗))

    + 2ηk(µ

    4‖xk−1 − x∗‖2 +

    4

    µ‖bk‖2) + η2k‖ĝk−1‖2.

    Rearrange and we get

    f(xk−1)− f∗ ≤η−1k − µ/2

    2‖xk−1 − x∗‖2 −

    η−1k2‖xk − x∗‖2 +

    4

    µ‖bk‖2 +

    ηk2‖ĝk−1‖2.

    After taking expectation and apply the inequality from Lemma 9, we get

    E[f(xk−1)]− f∗ ≤ E[η−1k − µ/2

    2‖xk−1 − x∗‖2 −

    η−1k2‖xk − x∗‖2

    ]+∑di=1 4B

    2αi τ

    2−2αi µ

    −1 + ηkGατ2−αi /2.

    Then take ηk = 4µ(k+1) , τi = Bik1α and multiply both side by k, we get

    kE[f(xk−1)]− f∗ ≤µ

    8E[k(k − 1)‖xk−1 − x∗‖2 − k(k + 1)‖xk − x∗‖2

    ]+ 8

    ∑di=1B

    2i k

    2−αα µ−1.

    Notice that∑Kk=1 k

    2−αα ≤

    ∫K+10

    k2−αα dk ≤ (K + 1)2/α. Sum over k and we get∑K

    k=1 kE[f(xk−1)]− f∗ ≤µ

    8E[−T (T + 1)‖xT − x∗‖2

    ]+ 8

    ∑di=1B

    2i k

    2−αα µ−1.

    Devide both side by K(K+1)2 and we get

    2

    K(K + 1)

    ∑Kk=1 kE[f(xk−1)]− f∗ ≤ 8

    ∑di=1B

    2i k

    2−αα µ−1.

    Notice that for K ≥ 1, K−1 ≤ 2(K + 1)−1. We have2

    K(K + 1)

    ∑Kk=1 kE[f(xk−1)]− f∗ ≤ 16

    ∑di=1B

    2i k

    2−αα µ−1.

    The theorem then follows by Jensen’s inequality.

    17

  • F Lower Bound (Proof of Theorem 5)

    We consider the following simple one-dimensional function class parameterized by b:

    minx∈[0,1/2]

    {fb(x) =

    12 (x− b)

    2}, for b ∈ [0, 1/2] . (3)

    Also suppose that for α ∈ (1, 2] and b ∈ [0, 1/2] the stochastic gradients are of the form:

    g(x) ∼ ∇fb(x) + χb ,E[g(x)] = ∇fb(x) , and E[|g(x)|α] ≤ 1 . (4)

    Note that the function class (3) has µ = 1 and optimum value fb(b) = 0, and the α-moment of thenoise in (4) satisfies G = B ≤ 1. Thus, we want to prove the following:Theorem 14. For any α ∈ (1, 2] there exists a distribution χb such that the stochastic gradientssatisfy (4). Further, for any (possibly randomized) algorithm A, define Ak(fb + χb) to be the outputof the algorithm A after k queries to the stochastic gradient g(x). Then:

    maxb∈[0,1/2]

    E[fb(Ak(fb + χb))] ≥ Ω(

    1

    k2(α−1)/α

    ).

    Our lower bound construction is inspired by Theorem 2 of [4]. Let Ak(fb + χb) denote the outputof any possibly randomized algorithm A after processing k stochastic gradients of the functionfb (with noise drawn i.i.d. from distribution χb). Similarly, let Dk(fb + χb) denote the output ofa deterministic algorithm after processing the k stochastic gradients. Then from Yao’s minimaxprinciple we know that for any fixed distribution B over [0, 1/2],

    minA

    maxb∈[0,1/2]

    EA[Eχbfb(Ak(fb + χb))] ≥ minD Eb∼B[Eχbfb(Dk(fb + χb))] .

    Here we denote EA to be expectation over the randomness of the algorithm A and Eχb to be over thestochasticity of the the noise distribution χb. Hence, we only have to analyze deterministic algorithmsto establish the lower-bound. Further, since Dk is deterministic, for any bijective transformationh which transforms the stochastic gradients, there exists a deterministic algorithm D̃ such thatD̃k(h(fb + χb)) = Dk(fb + χb). This implies that for any bijective transformation h(·) of thegradients:

    minD

    Eb∼B[Eχbfb(Dk(fb + χb))] = minD Eb∼B[Eχbfb(Dk(h(fb + χb)))] .

    In this rest of the proof, we will try obtain a lower bound for the right hand side above.

    We now describe our construction of the three quantities to be defined: the problem distribution B,the noise distribution χb, and the bijective mapping h(·). All of our definitions are parameterized byα ∈ (1, 2] (which is given as input) and by � ∈ (0, 1/8] (which represents the desired target accuracy).We will pick � to be a fixed constant which depends on the problem parameters (e.g. k) and should bethought of as being small.

    • Problem distribution: B picks b0 = 2� or b1 = � at random i.e. ν ∈ {0, 1} is chosen by anunbiased coin toss and then we pick

    bν = (2− ν)� . (5)

    • Noise distribution: Define a constant γ = (4�)1/(α−1) and pν = (γα − 2νγ�). Simplecomputations verify that γ ∈ (0, 1/2] and that

    pν = (4�)αα−1 − 2ν(4�α)

    1α−1 = (4− 2ν)(4�α)

    1α−1 ∈ (0, 1) .

    Then, for a given ν ∈ {0, 1} the stochastic gradient g(x) is defined as

    g(x) =

    {x− 12γ with prob. pν ,x with prob. 1− pν .

    (6)

    To see that we have the correct gradient in expectation verify that

    E[g(x)] = x− pν2γ

    = x− γα−1

    2+ ν� = x− (2− ν)� = x− bν = ∇fbν (x) .

    18

  • Next to bound the α moment of g(x) we see that

    E[|g(x)|α] ≤ γα(x− 1

    )α+ xα ≤ 1

    2+

    1

    2= 1 .

    The above inequality used the bounds that α ≥ 1, x ∈ [0, 1/2], and γ ∈ (0, 1/2]. Thus g(x)defined in (6) satisfies condition (4).

    • Bijective mapping: Note that here the only unknown variable is ν which only affects pν .Thus the mapping is bijective as long as the frequencies of the events are preserved. Hencegiven a stochastic gradient g(xi) the mapping we use is:

    h(g(xi)) =

    {1 if g(xi) = xi − 12γ ,0 otherwise.

    (7)

    Given the definitions above, the output of algorithm Dk is thus simply a function of k i.i.d. samplesdrawn from the Bernoulli distribution with parameter pν (which is denoted by Bern(pν)). We nowshow how achieving a small optimization error implies being able to guess the value of ν.Lemma 15. Suppose we are given problem and noise distributions defined as in (5) and (6), and anbijective mapping h(·) as in (7). Further suppose that there is a deterministic algorithm Dk whoseoutput after processing k stochastic gradients satisfies

    Eb∼B[Eχbfb(Dk(h(fb + χb)))] < �2/64 .

    Then, there exists a deterministic function D̃k which given k independent samples of Bern(pν) outputsν′ = D̃k(Bern(pν)) ∈ {0, 1} such that

    Pr[D̃k(Bern(pν)) = ν

    ]≥ 3

    4.

    Proof. Suppose that we are given access to k samples of Bern(pν). Use these k samples as the inputh(fb + χb)) to the procedure Dk (this is valid as previously discussed), and let the output of Dk bex

    (ν)k . The assumption in the lemma states that

    Eν[Eχb |x

    (ν)k − bν |

    2]<�2

    32, which implies that Eχb |x

    (ν)k − bν |

    2 <�2

    16almost surely.

    Then, using Markov’s inequality (and then taking square-roots on both sides) gives

    Pr[|x(ν)k − bν | ≥

    2

    ]≤ 1

    4.

    Consider a simple procedure D̃k which outputs ν′ = 0 if x(ν)k ≥3�2 , and ν

    ′ = 1 otherwise. Recallthat |b0− b1| = � with b0 = 2� and b1 = �. With probability 34 , |x

    (ν)k − bν | <

    �2 and hence the output

    ν′ is correct.

    Lemma 15 shows that if the optimization error of Dk is small, there exists a procedure D̃k whichdistinguishes between the Bernoulli distributions with parameters p0 and p1 using k samples. Toargue that the optimization error is large, one simply has to argue that a large number of samples arerequired to distinguish between Bern(p0) and Bern(p1).

    Lemma 16. For any deterministic procedure D̃k(Bern(pν)) which processes k samples of Bern(pν)and outputs ν′

    Pr[ν′ = ν] ≤ 12

    +

    √k(4�)

    αα−1 .

    Proof. Here it would be convenient to make the dependence on the samples explicit. Denote(ν)k =

    (s

    (ν)1 , . . . , s

    (ν)k

    )∈ {0, 1}k to be the k samples drawn from Bern(pν) and denote the output

    as ν′ = D̃((ν)k ). With some slight abuse of notation where we use the same symbols to denote therealization and their distributions, we have:

    Pr[D̃((ν)k ) = ν

    ]=

    1

    2Pr[D̃((1)k ) = 1

    ]+

    1

    2Pr[D̃((0)k ) = 0

    ]=

    1

    2+

    1

    2E[D̃((1)k )− D̃(

    (0)k )].

    19

  • Next using Pinsker’s inequality we can upper bound the right hand side as:

    E[D̃((1)k )− D̃(

    (0)k )]≤∣∣∣D̃((1)k )− D̃((0)k )∣∣∣

    TV≤√

    1

    2KL(D̃(

    (1)k

    ), D̃(

    (0)k

    )),

    where |·|TV denotes the total-variation distance and KL(·, ·) denotes the KL-divergence. Recall twoproperties of KL-divergence: i) for a product measures defined over the same measurable space(p1, . . . , pk) and (q1, . . . , qk),

    KL((p1, . . . , pk), (q1, . . . , qk)) =

    k∑i=1

    KL(pi, qi) ,

    and ii) for any deterministic function D̃,

    KL(p, q) ≥ KL(D̃(p), D̃(q)) .

    Thus, we can simplify as

    Pr[D̃((ν)k ) = ν

    ]≤ 1

    2+

    √k

    8KL(Bern(p1),Bern(p0))

    ≤ 12

    +

    √k

    8

    (p0 − p1)2p0(1− p0)

    ≤ 12

    +

    √k(γ�)2

    4γα

    =1

    2+

    √k(4(2−1/α)�

    ) αα−1 .

    Recalling that α ∈ (1, 2] gives us the statement of the lemma.

    If we pick � to be

    � =1

    16k(α−1)/α,

    we have that1

    2+

    √k(4�)

    αα−1 <

    3

    4.

    Given Lemmas 15 and 16, this implies that for the above choice of �,

    Eb∼B[Eχbfb(Dk(h(fb + χb)))] ≥ �2/64 =1

    214k2(α−1)/α.

    This finishes the proof of the theorem. Note that the readability of the proof was prioritized overoptimality and it is possible to obtain significantly better constants.

    G Non-convex Lower Bound (Proof of Theorem 6)

    The proof is based on the proof of Theorem 1 in [2]. The only difference is that we assume boundedα−moment of the stochastic oracle instead of bounded variance as in the original proof. We referreaders to [2] for more backgrounds and intuitions. For convenience, we study the stochasticsetting (K = 1 in [2]) instead of batched setting. We denote a d−dimensional vector x as, x =[x(1); ...;x(d)]. Let support(x) denote the set of coordinates where x is nonzero, i.e.

    support(x) = {i ∈ [d]|x(i) 6= 0} ⊆ [d].

    Denote progβ(x) as the highest index whose entry is β−far from zero.

    progβ(x) = max{i ∈ [d]||x(i)| > β} ∈ [d].

    20

  • Note that the function progβ( · ) is decreasing in β. The function we use to prove the theorem is thesame as in [2, 5]. We denote

    fd(x) = −Ψ(1)Φ(x(1)) +d∑i=2

    (Ψ(−x(i−1))Φ(−x(i))−Ψ(x(i−1))Φ(x(i))

    ), where

    Ψ(x) =

    {0, x ≤ 1/2exp(1− 1(2x−1)2 ), x > 1/2

    ,Φ(x) =√e

    ∫ x−∞

    e−t2

    2 dt.

    The above function satisfies the following important properties,Lemma 17 (Lemma 2 in [2]). The function fd satisfies the following properties,

    1. fd(0)− infx fd(x) ≤ 12d.

    2. fd is L0-smooth, where L0 = 152.

    3. For all x, ‖∇fd(x)‖∞ ≤ 23.

    4. For all x, prog0(∇fd(x)) ≤ prog 12 (x) + 1

    5. For all x, if prog1(x) < d, then ‖∇fd(x)‖2 ≥ 1.

    We also define the stochastic oracle gd(x) as below

    gd(x)(i) =

    (1 + 1

    {i = prog 1

    4(x) + 1

    }(zp− 1))

    ∂x(i)fd(x)

    where z ∼ Bernoulli(p). The stochasticity of gd(x) is only in the (prog 14(x) + 1)th coordinate. It is

    easy to see that gd(x) is a probability-p zero chain as in [2, Definition 2] i.e. it satisfies

    P(∃x, s.t. prog0(gd(x)) = prog 14 (x) + 1

    )≤ p,

    P(∃x, s.t. prog0(gd(x)) > prog 14 (x) + 1

    )= 0.

    The second claim is because progβ( · ) is decreasing in β and

    prog 14(∇fd(x)) ≤ prog0(∇fd(x)) ≤ prog 12 (x) + 1 ≤ prog 14 (x) + 1 .

    The first claim is because if z = 0, then we explicitly set the (prog 14(x) + 1)th coordinate to 0. The

    stochastic gradient additionally has bounded α-moment as we next show.Lemma 18. The stochastic oracle above is an unbiased estimator of the true gradient, and for anyα ∈ (1, 2]

    E[‖gd(x)‖α] ≤ 2‖∇fd(x)‖α + 23α2

    pα−1.

    Proof. The unbiased-ness is easy to verify. For the bounded α-moment, observe that only the(prog 1

    4+ 1)-th coordinate is noisy and differs by a factor of ( zp − 1). Hence, we have

    E[‖gd(x)‖α] ≤ 2‖∇fd(x)‖α + 2E[‖gd(x)−∇fd(x)‖α]

    ≤ 2‖∇fd(x)‖α + ‖∇fd(x)‖α∞E[|zp− 1|α

    ]≤ 2‖∇fd(x)‖α + ‖∇fd(x)‖α∞

    p(1− p)α + (1− p)pα

    ≤ 2‖∇fd(x)‖α + 23α2

    pα−1

    The first inequality followed from Jensen’s inequality and the convexity of ‖ · ‖α for α ∈ (1, 2]:‖u+ v‖α ≤ 4‖u+v2 ‖

    α ≤ 2(‖u‖α + ‖v‖α) for any u, v .

    21

  • Now we are ready to prove Theorem 6. Given accuracy parameter �, suboptimality ∆ = f(0)− f∗,smoothness constant L, and bounded α−moment Gα, we define

    f(x) =Lλ2

    152fd(

    x

    λ),

    where λ = 304�L and d = b∆L

    7296�2 c. Then,

    g(x) =Lλ

    152gd(x/λ) = 2�gd(x/λ).

    Using Lemma 18, we have

    E[‖g(x)‖α] ≤ 8�α‖∇fd(x)‖α +5000�α

    pα−1

    When G ≥ 4√

    ∆L, we can set p = (5000�)αα−1

    (G−4√

    ∆L)αα−1

    and get E[‖g(x)‖α] ≤ Gα.

    Let xk be the output of any zero-respecting algorithm A. By [2, Lemma 1], we know that withprobability at least 1/2, prog1(xk) ≤ prog0(xk) < d for all k ≤

    (d−1)2p . Now applying Lemma 17.5,

    we have that for all k ≤ (d−1)2p :

    E[‖∇f(xk)‖] ≥1

    2

    152E[1‖∇fd(xk/λ)‖ | {prog1(xk) < d}] ≥ � .

    Therefore, E‖∇f(xk)‖ ≥ �, for all k ≤ (d−1)2p =(G−4

    √∆L)

    αα−1 ∆L

    7296×5000αα−1 �

    2+ αα−1

    = c(α)(G −

    4√

    ∆L)αα−1 ∆L�−

    3α−2α−1 . By eliminating �, we can rewrite this in terms of k. Finally, the tech-

    niques from [2, Theorem 3] show how to lift lower-bounds for zero-respecting algorithms to anyrandomized method.

    H A Comparison with [25]

    We are not the first to study the heavy-tailed noise behavior in neural network training. The novelwork by Simsekli et al. [25] studies the noise behavior of AlexNet on Cifar 10 and observed thatthe noise does not seem to come from Gaussian distribution. However, in our AlexNet trainingwith ImageNet data, we observe that the noise histogram looks Gaussian as in Figure 5(a, b). Webelieve the difference results from that in [25], the authors treat the noise in each coordinate as anindependent scaler noise, as described in the original work on applying tail index estimator. Weon the other hand, consider each the noise as a high dimensional random vector computed from aminibatch. We are also able to observe heavy tailed noise if we fix a single minibatch and plot thenoise in each dimension, as shown in Figure 5(c). The fact that noise is well concentrated on Cifar isalso observed by Panigrahi et al. [20].

    40 60 80Noise norm

    0.00

    0.02

    0.04

    Dens

    ity

    (a)

    1.20 1.25 1.30 1.35Norm of noise

    0.00

    0.01

    0.02

    0.03

    dens

    ity

    (b)

    0.00000.00050.00100.00150.00200.00250.0030Norm of noise

    0.00

    0.05

    0.10

    0.15

    0.20

    dens

    ity

    (c)

    Figure 5: (a) Noise histogram of AlexNet on ImageNet data at initialization. (b)Noise histogram ofAlexNet on ImageNet data at 5k iterations. (c) The per dimension noise distribution within a singleminibatch at initialization.

    Furthermore, we used the tail index estimator presented in [25] to estimate the tail index of noisenorm distribution. Though some assumptions of the estimator are not satisfied (in our case, thesymmetry assumption; in [25], the symmetry assumption and independence assumption), we think itcan be an indicator for measuring the “heaviness” of the tail distribution.

    22

  • 0.32 0.34Noise norm

    0.00

    0.01

    0.02

    0.03

    Dens

    ity

    (a) ImageNet training, α̂ = 1.99

    20 40 60 80Noise norm

    0.00

    0.01

    0.02

    0.03

    0.04

    Density

    (b) Bert pretraining, α̂ = 1.08Figure 6: Tail index estimation of gradient noise in ImageNet training and BERT training.

    I ACClip in ImageNet Training

    For completeness, we test ACClip on ImageNet training with ResNet50. After hyperparameter tuningfor all algorithms, ACClip is able to achieve better performance compared to ADAM, but worseperformance compared to SGD. This is as expected because the noise distribution in ImageNet +ResNet50 training is well concentrated. The validation accuracy for SGD, ADAM, ACClip are0.754, 0.716, 0.730 respectively.

    0.00 0.25 0.50 0.75 1.00Iterations 1e5

    1.5

    2.0

    2.5

    3.0

    3.5

    Validation loss

    ImageNet with ResNet50SGD momentumADAMACClip

    (a)

    Figure 7: Validation loss for ResNet50 trained on ImageNet. SGD outperforms Adam and ACClip.

    23

    1 Introduction1.1 Related work

    2 Heavy-tailed noise in stochastic gradients3 Convergence of gradient methods under heavy-tailed noise3.1 Convergence of Clipped Methods3.2 Theoretic lower bounds

    4 Faster Optimization with Adaptive Coordinate-wise Clipping4.1 Coordinate-wise clipping4.2 Online moment estimation

    5 Experiments5.1 From GClip to ACClip5.2 Performance of ACClip for BERT pre-training and fine-tuning5.3 Noise Patterns in BERT and ImageNet Training

    6 DiscussionA Additional definitions and assumptionsA.1 Assumptions in the nonconvex settingA.2 Assumptions in the strongly convex setting

    B Effect of global clipping on variance and biasC Non-convex Rates (Proof of Theorem 2)D Strongly-Convex Rates (Proof of Theorem 4)E Effect of coordinate-wise moment boundE.1 Convergence of GClip (proof of Corollary 7)E.2 Convergence of CClip (Proof of Theorem 8)

    F Lower Bound (Proof of Theorem 5)G Non-convex Lower Bound (Proof of Theorem 6)H A Comparison with simsekli2019tailI ACClip in ImageNet Training