learning hyperparameters for neural machine...

LEARNING HYPERPARAMETERS FOR NEURAL MACHINE TRANSLATION

A Dissertation

Submitted to the Graduate School

of the University of Notre Dame

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

by

Kenton Murray

David Chiang, Director

Graduate Program in Computer Science and Engineering

Notre Dame, Indiana

December 2019

LEARNING HYPERPARAMETERS FOR NEURAL MACHINE TRANSLATION

Abstract

by

Kenton Murray

Machine Translation, the subfield of Computer Science that focuses on translating be-

tween two human languages, has greatly benefited from neural networks. However, these

neural machine translation systems have complicated architectures with many hyperpa-

rameters that need to be manually chosen. Frequently, these are selected either through a

grid search over values, or by using values commonplace in the literature. However, these

are not theoretically justified and the same values are not optimal for all language pairs and

datasets.

Fortunately, the innate structure of the problem allows for optimization of these hyper-

parameters during training. Traditionally, the hyperparameters of a system are chosen and

then a learning algorithm optimizes all of the parameters within the model. In this work,

I propose three methods to learn the optimal hyperparameters during the training of the

model, allowing for one step instead of two. First, I propose using group regularizers to

learn the number, and size of, the hidden neural network layers. Second, I demonstrate

how to use a perceptron-like tuning method to solve known problems of undertranslation

and label bias. Finally, I propose an Expectation-Maximization based method to learn the

optimal vocabulary size and granularity. Using various techniques from machine learn-

ing and numerical optimization, this dissertation covers how to learn hyperparameters of a

Neural Machine Translation system while training the model itself.

CONTENTS

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix0.1 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Corpora and Evaluation . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . 5

1.2.3.1 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.3.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 81.2.3.4 Attention . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.4 Beam Search and Decoding . . . . . . . . . . . . . . . . . . . . 91.3 Outline of Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . 10

Chapter 2: Auto-Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Auto-Sizing Neural Networks: with applications to n-gram language models 122.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . 152.4.2 Group regularization . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.2.1 Group regularization . . . . . . . . . . . . . . . . . . . 192.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.1 Evaluating perplexity and network size . . . . . . . . . . . . . . 222.5.2 A closer look at training . . . . . . . . . . . . . . . . . . . . . . 222.5.3 Evaluating on machine translation . . . . . . . . . . . . . . . . . 27

2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

ii

Chapter 3: Auto-Sizing the Transformer Network . . . . . . . . . . . . . . . . . . 303.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Hyperparameter Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3 GPU Optimized Proximal Gradient Descent . . . . . . . . . . . . . . . . 323.4 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 Auto-sizing Transformer . . . . . . . . . . . . . . . . . . . . . . 353.4.2 Random Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.1.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 393.5.1.2 Auto-sizing parameters . . . . . . . . . . . . . . . . . 393.5.1.3 Random search parameters . . . . . . . . . . . . . . . 39

3.5.2 Auto-sizing vs. Random Search . . . . . . . . . . . . . . . . . . 413.5.3 Training times . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.4 Auto-sizing Sub-Components . . . . . . . . . . . . . . . . . . . 43

3.5.4.1 FFN matrices and multi-head attention . . . . . . . . . 433.5.4.2 FFN matrices . . . . . . . . . . . . . . . . . . . . . . 453.5.4.3 Encoder vs. Decoder . . . . . . . . . . . . . . . . . . 46

3.5.5 Random Search plus Auto-sizing . . . . . . . . . . . . . . . . . . 473.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 4: Efficiency through Auto-Sizing . . . . . . . . . . . . . . . . . . . . . . 504.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Auto-Sizing the Transformer . . . . . . . . . . . . . . . . . . . . . . . . 514.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.2 Auto-sizing sub-components . . . . . . . . . . . . . . . . . . . . 544.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Chapter 5: Rigging the Lottery: On the Intersection of Lottery Tickets and Struc-tured Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . 595.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.4 Structured Pruning on LeNet . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4.1 Structured Pruning on Transformer . . . . . . . . . . . . . . . . 615.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.5.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.5.2 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . 62

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

iii

Chapter 6: Correcting Length Bias in Neural Machine Translation . . . . . . . . . 746.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2.1 Label bias in sequence labeling . . . . . . . . . . . . . . . . . . 766.2.2 Length bias in NMT . . . . . . . . . . . . . . . . . . . . . . . . 77

6.3 Correcting Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.4.1 Data and settings . . . . . . . . . . . . . . . . . . . . . . . . . . 826.4.2 Solving the length problem solves the beam problem . . . . . . . 83

6.4.2.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 846.4.2.2 Word reward . . . . . . . . . . . . . . . . . . . . . . . 866.4.2.3 Wider beam . . . . . . . . . . . . . . . . . . . . . . . 866.4.2.4 Short sentences . . . . . . . . . . . . . . . . . . . . . 876.4.2.5 Length ratio . . . . . . . . . . . . . . . . . . . . . . . 90

6.4.3 Tuning word reward . . . . . . . . . . . . . . . . . . . . . . . . 906.4.3.1 Sensitivity to word reward value . . . . . . . . . . . . 906.4.3.2 Optimized word reward values . . . . . . . . . . . . . 916.4.3.3 Tuning time . . . . . . . . . . . . . . . . . . . . . . . 93

6.4.4 Word reward vs. length normalization . . . . . . . . . . . . . . . 936.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Chapter 7: An Expectation-Maximization Approach to BPE Segmentation . . . . . 957.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . 96

7.2.1 Neural machine translation . . . . . . . . . . . . . . . . . . . . . 967.2.2 Handling unknown words . . . . . . . . . . . . . . . . . . . . . 977.2.3 Hyperparameter search . . . . . . . . . . . . . . . . . . . . . . . 98

7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.3.2 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007.3.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.4.2 Translation model . . . . . . . . . . . . . . . . . . . . . . . . . 1077.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.4.4 Stopping condition . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

iv

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

v

FIGURES

1.1 An Attentional Sequence to Sequence Model. The bidirectional encoder isrepresented in green, the decoder is in blue, and the attention mechanismis in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 The (unsquared) `2 norm and `∞ norm both have sharp tips at the originthat encourage sparsity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Examples of the two possible cases for the `2 gradient update. Point v isdrawn with a hollow dot, and point w is drawn with a solid dot. . . . . . . 18

2.3 The proximal operator for the `∞ norm (with strength ηλ) decreases themaximal components until the total decrease sums to ηλ. Projection ontothe `1-ball (of radius ηλ) decreases each component by an equal amountuntil they sum to ηλ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Number of units in first hidden layer over time, with various starting sizes(λ = 0.1). If we start with too many units, we end up with the same number,although if we start with a smaller number of units, a few are still prunedaway. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Above: Number of units in first hidden layer over time, for various reg-ularization strengths λ. A regularization strength of ≤ 0.01 does not zeroout any rows, while a strength of 1 zeros out rows right away. Below: Per-plexity over time. The runs with λ ≤ 0.1 have very similar learning curves,whereas λ = 1 is worse from the beginning. . . . . . . . . . . . . . . . . 26

2.6 Evolution of the first hidden layer weight matrix after 1, 5, and 10 iterations(with rows sorted by `∞ norm). A nonlinear color scale is used to showsmall values more clearly. The four vertical blocks correspond to the fourcontext words. The light bar at the bottom is the rows that are close to zero,and the white bar is the rows that are exactly zero. . . . . . . . . . . . . . 27

3.1 Illustration of Algorithm 2. The shaded area, here with value ηλ = 2,represents how much the `∞ proximal step will remove from a sorted vector. 34

3.2 Architecture of the Transformer [109]. We apply the auto-sizing method tothe feed-forward (blue rectangles) and multi-head attention (orange rectan-gles) in all n layers of the encoder and decoder. Note that there are residualconnections that can allow information and gradients to bypass any layerwe are auto-sizing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

vi

4.1 Auto-sizing FFN network. For a row in the parameter matrix W1 that hasbeen driven completely to 0.0 (shown in white), the corresponding columnin W2 (shown in blue) no longer has any impact on the model. Both thecolumn and the row can be deleted, thereby shrinking the model. . . . . . 53

5.1 Best accuracy at each pruning iteration for MNIST experiments. . . . . . 63

5.2 Rows remaining after pruning iterations for MNIST experiments. . . . . . 64

5.3 Parameters remaining after pruning iterations for MNIST experiments. . . 65

5.4 Accuracy score compared to rows remaining for MNIST experiments. . . 66

5.5 Accuracy compared to remaining parameters for MNISTT experiments. . 67

5.6 Best validation BLEU score at each pruning iteration for Machine Trans-lation with Arn-Spa MT experiments. . . . . . . . . . . . . . . . . . . . 68

5.7 Rows remaining after pruning iterations for Machine Translation with Arn-Spa MT experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 Parameters remaining after pruning iterations for Machine Translation withArn-Spa MT experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.9 BLEU score compared to rows remaining for Machine Translation withArn-Spa MT experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.10 BLEU score compared to remaining parameters for Machine Translationwith Arn-Spa MT experiments. . . . . . . . . . . . . . . . . . . . . . . . 72

6.1 Label bias causes this toy word-by-word translation model to translateFrench un helicoptere incorrectly to an autogyro. . . . . . . . . . . . . . 77

6.2 A locally normalized model must determine, at each time step, a “budget”for the total remaining log-probability. In this example sentence, “TheBritish women won Olymp ic gold in p airs row ing,” the empty translationhas initial position 622 in the beam. Already by the third step of decod-ing, the correct translation has a lower score than the empty translation.However, using greedy search, a nonempty translation would be returned. 79

6.3 Impact of beam size on BLEU score when varying reference sentencelengths (in words) for Russian–English. The x-axis is cumulative movingright; length 20 includes sentences of length 0-20, while length 10 includes0-10. As reference length increases, the BLEU scores of a baseline systemwith beam size of 10 remain nearly constant. However, a baseline systemwith beam 1000 has a high BLEU score for shorter sentences, but a verylow score when the entire test set is used. Our tuned reward and normal-ized models do not suffer from this problem on the entire test set, but takea slight performance hit on the shortest sentences. . . . . . . . . . . . . . 88

vii

6.4 Histogram of length ratio between generated sentences and gold variedacross methods and beam size for Russian–English. Note that the baselinemethod skews closer 0 as the beam size increases, while our other methodsremain peaked around 1.0. There are a few outliers to the right that havebeen cut off, as well as the peaks at 0.0 and 1.0. . . . . . . . . . . . . . . 89

6.5 Effect of word penalty on BLEU and hypothesis length for Russian–English(top) and German-English (bottom) on 1000 unseen dev examples withbeams of 50. Note that the vertical bars represent the word reward that wasfound during tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.1 Example application of BPE rules. . . . . . . . . . . . . . . . . . . . . . 987.2 Value of λb for English-to-Vietnamese over the course of training. Early

in training, more BPE operations are preferred as this results in shortersequences. However, as training progresses and the model improves, it isable to deal with finer-grained segmentation. . . . . . . . . . . . . . . . . 102

7.3 Value of λb for Arn-Spa over the course of training (top) and correspond-ing BLEU score on the development set (bottom). We measure the perfor-mance of our model during training by comparing the BLEU score corre-sponding to the segmentation with the highest λb. Note that early in train-ing, more BPE operations are preferred, but by the end of training, feweroperations result in higher BLEU scores. Even as the model switches be-tween coarseness levels at later epochs, dev BLEU scores are very consistent.103

7.4 Value of λb for Japanese-to-Vietnamese over the course of training. Con-sistent with the other language pairs, early in training, more BPE oper-ations are preferred. As training progresses, the model converges to theoptimal value of 4k also found by grid search. . . . . . . . . . . . . . . . 104

7.5 Value of λb for Uighur-to-English over the course of training. As withthe other language pairs, early on in training, the model prefers more BPEoperations. However, consistent with the results from the grid search, 64koperations remains the preferred level throughout training. . . . . . . . . 105

viii

TABLES

2.1 Comparison of `∞,1 regularization on 2-gram, 3-gram, and 5-gram neurallanguage models. The network initially started with 1,000 units in thefirst hidden layer and 50 in the second. A regularization strength of λ =

0.1 consistently is able to prune units while maintaining perplexity, eventhough the final number of units varies considerably across models. Thevocabulary size is 100k. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Results from training a 5-gram neural LM on the AFP portion of the Gi-gaword dataset. As with the smaller Europarl corpus (Table 2.1), a reg-ularization strength of λ = 0.1 is able to prune units while maintainingperplexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 A regularization strength of λ = 0.1 is best across different vocabulary sizes. 24

2.4 Results using `2,1 regularization. . . . . . . . . . . . . . . . . . . . . . . 24

2.5 The improvements in translation accuracy due to the neural LM (shownin parentheses) are affected only slightly by `∞,1 regularization. For theEuroparl LM, there is no statistically significant difference, and for theGigaword AFP LM, a statistically significant but small decrease of −0.3. . 28

3.1 Number of parallel sentences in training bitexts. The French-English andArabic-English data is from the 2017 IWSLT campaign [65]. The muchsmaller Hausa-English and Tigrinya-English data is from the LORELEIproject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Comparison of BLEU scores, model size, and training time on Tigrinya-English, Hausa-English, and French-English. Model size is the total num-ber of parameters. Training time is measured in seconds. Baseline isthe recommended low-resource architecture in fairseq. Random searchrepresents the best model found from 72 (Tigrinya), 40 (Hausa), and 10(French) different randomly generated architecture hyperparameters. Bothauto-sizing methods, on both languages, start with the exact same initial-ization and number of parameters as the baseline, but converge to muchsmaller models across all language pairs. On the very low-resource lan-guages of Hausa and Tigrinya auto-sizing finds models with better BLEUscores. Random search is eventually able to find better models on Frenchand Hausa, but is an order of magnitude slower. . . . . . . . . . . . . . . 40

ix

3.3 Overall training times in seconds on a Nvidia GeForce GTX 1080Ti GPUfor small regularization values. Note that high regularization values willdelete too many values and cause training to end sooner. In general, `2,1

regularization does not appreciably slow down training, but `∞,1 can betwice as slow. Per epoch, roughly the same ratios in training times hold. . 42

3.4 BLEU scores and percentage of parameter rows deleted by auto-sizingon various sub-components of the model, across varying strengths of `2,1

regularization. 0.0 refers to the baseline without any regularizer. Blankspaces mean less than 1% of parameters were deleted. In the two verylow-resource language pairs (Hausa-English and Tigrinya-English), delet-ing large portions of the encoder can actually help performance. However,deleting the decoder hurts performance. . . . . . . . . . . . . . . . . . . 44

3.5 BLEU scores and percentage of model deleted using auto-sizing with var-ious l∞,1 regularization strengths. On the very low-resource language pairsof Hausa-English and Tigrinya-English, auto-sizing the feed-forward net-works of the encoder and decoder can improve BLEU scores. . . . . . . . 45

3.6 Test BLEU scores for the models with the best dev perplexity found usingrandom search over number of layers and size of layers. Regularizationvalues of `2,1 = 1.0 and `∞,1 = 10.0 were chosen based on tables 3.4 and3.5 as they encouraged neurons to be deleted. For the very low-resourcelanguage pairs, auto-sizing helped in conjunction with random search. . . 48

4.1 Comparison of BLEU scores and model sizes on newstest2014 and new-stest2015. Applying auto-sizing to the feed-forward neural network sub-components of the transformer resulted in the most amount of pruningwhile still maintaining good BLEU scores. . . . . . . . . . . . . . . . . . 53

6.1 Results of the Russian–English translation system. We report BLEU andMETEOR scores, as well as the ratio of the length of generated sentencescompared to the correct translations (length). γ is the word reward scorediscovered during training. Here, we examine a much larger beam (1000).The beam problem is more pronounced at this scale, with the baseline sys-tem losing over 20 BLEU points when increasing the beam from size 10 to1000. However, both our tuned length reward score and length normaliza-tion recover most of this loss. . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Results of the high-resource German–English system. Rows: BLEU, ME-TEOR, length = ratio of output to reference length; γ = learned parametervalue. While baseline performance decreases with beam size due to thebrevity problem, other methods perform more consistently across beamsizes. Length normalization (norm) gets the best BLEU scores, but similarMETEOR scores to the word reward. . . . . . . . . . . . . . . . . . . . . 84

x

6.3 Results of low-resource French–English and English–French systems. Rows:BLEU, METEOR, length = ratio of output to reference length; γ = learnedparameter value. While baseline performance decreases with beam sizedue to the brevity problem, other methods perform more consistently acrossbeam sizes. Word reward gets the best scores in both directions on ME-TEOR. Length normalization (norm) gets the best BLEU scores in Fra-Engdue to the slight bias of BLEU towards shorter translations. . . . . . . . . 85

6.4 Tuning time on top of baseline training time. Times are in minutes on 1000dev examples (German–English) or 892 dev examples (French–English).Due to the much larger model size, we only looked at beam sizes up to 75for German–English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.5 Varying the size of the Russian–English training dataset results in differentoptimal word reward scores (γ). In all settings, the tuned score alleviatesthe beam problem. As the datasets get smaller, using a tuned larger beamimproves the BLEU score over a smaller tuned beam. This suggests thatlower-resource systems are more susceptible to the beam problem. . . . . 93

7.1 Number of training, development, and test sentences in the four paralleldatasets used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.2 On English-to-Vietnamese, EM is able to find a better coarseness than gridsearch in less time. Time is in wall-clock seconds. Ops. = the number ofBPE operations selected. . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.3 On Mapudungun-to-Spanish, EM is able to find the same coarseness asgrid search in slightly less time. Time is in wall-clock seconds. Ops. = thenumber of BPE operations selected. . . . . . . . . . . . . . . . . . . . . 112

7.4 On Uighur-to-English, EM is able to find the same coarseness as gridsearch in much less time. Time is in wall-clock seconds. Ops. = the numberof BPE operations selected. . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.5 On Japanese-to-Vietnamese, EM is able to find a better coarseness thangrid search in much less time. Time is in wall-clock seconds. Ops. = thenumber of BPE operations selected. . . . . . . . . . . . . . . . . . . . . 114

0.1 Acknowledgements

I would like to thank my advisor, David Chiang, for all of the help and insights over

the years. I’ve learned so much and been pushed to take my work to a higher level. This

dissertation would not exist without his extensive support. I would also like to thank my

committee for the many insightful discussions we have had (and for reading this). I’d also

like to thank my labmates: Tomer, Antonis, Arturo, Toan, Bryan, Justin, Xing, and Darcy.

xi

They’ve made the paper deadlines, coursework, and long nights of grad school bearable.

I need to thank all of my collaborators outside of Notre Dame. From various Machine

Translation Marathons: Hieu, Tomas, Xuan, Gaurav, Huda, Jeremy, Marianna, Paul, Kevin,

Marine, Yash. At CMU: Alex, Alex’s advisor Alex, Chris, Christos, Daniel, Sriram, Ed,

Naoki, Skyler, Seth, Nisarga, Manaal, Waleed, Victor, Jeff, Wang. At AI2: Jayant and

Oyvind.

Naturally, I need to thank all of my friends at Notre Dame who have made South

Bend winters manageable. My numerous roommates: Eamonn, Gary, Sharif, Sean, Sal,

James, and Kola. Friends within the department: Joel, Andrey, Sam, Aidan, Gabe, Mar-

tin, Nate, Jeff, Paige, Andrew, Charley, Mandana, Sophia, Pam, PT, John, Trenton, Yizhe,

Louis. From the broader engineering school: Wes, Matt, Skye, Hasan, Arash, Abbas,

Emma, Nico, Nick, Cody. People who have helped me Auto-Size my own neurons through

C2H5OH instead of regularization: Ricardo, Cedric, Magda, Julia, DJ, Jessie, Tony, Crys-

tal, Shireen, Matyas, Pancho, Eleanor, Chris, other Chris, Seckin, Iulia, Lauren, Bryce,

Karly, Farya, Cagri, Steven, John, Justin, Marwa, Erika, Gavi, Elvin, Geneva, Maggie,

Kevin, Lorenzo, Greg, Annie, Leo, Michael, Elis, Nadia, Sana, Seth,

I would also like to thank my family for all of the support over the many, many, many

years of grad school. It doesn’t just take one person to get through this process and a

supportive family is a big help. Also, I’m sure my friends in South Bend would like to

give a special thanks to my Mom for the support she provided to all of them every time she

visited.

xii

CHAPTER 1

INTRODUCTION

1.1 Overview

Natural Language Processing (NLP) is the subfield of Computer Science that focuses

on the interplay between computers and human languages. It is an area heavily influenced,

by computing, machine learning, statistics, data mining, and linguistics, among a variety

of other disciplines. Human languages are very complex, and NLP itself can be subdivided

into many different specialties.

Machine translation (MT), the study of translating one human language into another,

is one of these subfields. It is often related to many other NLP disciplines, because trans-

lating between languages may require other tasks, such as parsing or segmentation. It is

a large research field, with a long history. Going beyond academia, there are numerous

commercial applications to MT, as well as interest from governments at all levels.

Over the last few years, the field of Natural Language Processing has been rapidly

transformed by techniques from Neural Networks and Deep Learning. This has been par-

ticularly true within machine translation [4, 106]. Many state-of-the-art methods and sys-

tems now rely heavily on neural networks [47, 118]. However, these methods are very

sensitive to hyperparameters, architectures, and settings. Most commonly, best practice

settings have been learned through trial-and-error experimentation, rather than more prin-

cipled theory. Trial and error is unsatisfying because it lacks insight into why the methods

work, and is time-consuming because researchers mush run multiple similar experiments.

On the other hand, using settings from previously published papers is unsatisfying because

1

these may not be optimal for all problems, and it slows down research progress because

evaluations might not be using the best possible configurations. Furthermore, these are

computational and resource intensive undertakings. While this is less of a problem for

large, corporate research labs, it is a more significant challenge for smaller research groups

and academic settings.

A neural machine translation system generally has three main parts. First, an encoder

takes a sentence which has been preprocessed and converts it into a vector representation.

Most commonly, this is one or more layers, of potentially varying sizes.This vector is then

fed into a decoder to generate a target sentence. This too is generally a type of RNN

with one or more layers of varying sizes. Finally, there is an attention mechanism which

calculates an additional vector based on the encoder states. All of these are trained jointly.

We note that there are numerous hyperparameters that need to be selected for these

models. For instance, the encoder alone requires choosing the number of layers, then

the size of the layers, as well as ways to represent the vocabulary. Similar observations

can be made for other parts of the model. These models are trained by optimizing an

objective function to get the optimal parameters. However, as we have demonstrated in

previous work, the structure of a neural network can be learned jointly with the parameters

by augmenting our objective function [67].

This dissertation covers methods to learn the majority of the hyperparameters in a neu-

ral machine translation system. We will rely on three main classes of optimization methods.

First, we modify the optimization function with various group regularizers. Second, we use

a perceptron based method for solving a model deficiency. Third, we use an expectation-

maximixation algorithm to learn how to segment target input and output representations of

natural language.

Thesis Statement: Hyperparameters for Neural Machine Translation, though often ig-

nored or under-explored, are integral for translation accuracy. Though time-consuming

2

and computationally expensive to do manually, these can be optimized through machine

learning techniques for smaller, faster, and better performing models.

3

1.2 Background

1.2.1 Machine Translation

Machine translation traces its history back to some of the earliest days of computing.

In 1949, influenced by the success of cryptography during World War II and as the Cold

War began in earnest, Warren Weaver sent a memorandum to colleagues where proposed

using these methods on language. Finally published in 1955, and often quoted, he wrote

“When I look at an article in Russian, I say: ‘This is really written in English, but it has

been coded in some strange symbols. I will now proceed to decode.’ ” [115]. Over the

years, numerous paradigms for Machine Translation have been proposed, including Rule

Based and Interlingua [111]. However, methods based on Machine Learning and Statis-

tical Methods have proved to be dominant. The first successful statistical based systems

were developed a quarter of a century ago [17]. These methods were based on digitized

bitexts - corpora that were aligned at the sentence level. Word level co-occurrence statistics

were used to learn probabilistic translations. About a decade later, phrase based [53, 54]

and hierarchical [21, 22] systems were developed which looked at groupings of words as

well as tree structures. Collectively, these were referred to as Statistical Machine Transla-

tion (SMT), and were state of the art until recently proposed Neural Machine Translation

(NMT).

1.2.2 Corpora and Evaluation

For both Statistical and Neural Machine Translation, systems are trained on corpora of

bitexts. Most commonly, these are sentence aligned between two languages. The language

being translated from is referred to as the “Source”, and the language being translated to

is the “Target”. These are often abbreviated to “src” and “trg”. Though systems can begin

to learn to translate at the level of tens-of-thousands of training sentences, high resource,

top-performing systems are often trained on millions of sentences.

4

In addition to the training set, in order to run proper experiments, there is generally

at lease one development set and a held-out test set. Both of these are generally on the

order of thousands of aligned sentences. Optionally, there may be monolingual data that is

used to augment these systems - through the use of a language model. This is more of a

necessity for traditional SMT systems, though they can help with NMT (particularly low-

resource) systems as well. For SMT systems, the development set was used for tuning the

potentially thousands of feature weights. For NMT, the development set is used to adjust

learning rates when the model is overfitting the training data.

Evaluating Machine Translation systems is an entire field in and of itself. Ideally, using

humans to judge translations would be the best, but this is time consuming, expensive,

and not practical. There are some opportunities for manual evaluation, such as the WMT

conference, which has humans rank potential translations. However, the most common

way is to use automatic metrics such as BLEU [84]. Over the years, numerous metrics

have been proposed, leveraging everything from paraphrasing [24, 58] to edit rates [104].

However, though it may not correlate as closely with human judgments, BLEU remains

the defacto standard. It is a modified n-gram precision metric that measures word-level

overlap between a candidate translation and the gold translation in the test set.

1.2.3 Neural Machine Translation

Neural Machine Translation is generally framed as a sequence to sequence model (often

referred to as Seq2Seq) [106]. A sequence of input tokens is passed through one or more

layers of a Neural Network. Generally, a vector or tensor representation of the sentence

is output by this first subnetwork and used to seed another subnetwork. This secondary

portion is referred to a decoder - harking back to Warren Weaver’s initial observation. At

each step of the decoder, an output token is chosen by passing the vector output by the

decoder through a softmax over the vocabulary.

5

Figure 1.1. An Attentional Sequence to Sequence Model. The bidirectionalencoder is represented in green, the decoder is in blue, and the attention

mechanism is in red.

Mach@ine Lea@ </s>

Apre@ nd@ </s>

Apre@ !

6

1.2.3.1 Vocabulary

The most common way to input natural language tokens into these models is through

a one-hot vector. A fixed vocabulary size is chosen, and each word in the vocabulary is

given a unique integer ID. When a word is passed into the model, it is a vector of all zeros,

except for a one at the ID of the word. The vector has to be of a fixed length, so a limit must

be put on the size of the vocabulary. This is commonly done by thresholding words based

on frequency in the training text. All words below the threshold are assigned to a special

<unk> token. The same process is also applied to the target language, where a one-hot

vector is selected by an argmax over the softmax output.

Using a fixed size vocabulary has numerous downsides. Rare words, such as many

semantically important proper nouns, are often assigned to the <unk> token. Words not

seen in the training set, referred to as out-of-vocabulary or OOV, are as well. This can have

a large impact on performance as translations containing the <unk> token in the middle of

sentences will not have n-gram matches. Heuristics such as dictionary lookups or copying

the source word through can help this problem to a certain degree [45, 63].

Alternatively, NMT systems can translate at subword units. At the most basic level,

only characters can be input into the model. This alleviates OOV issues as there is a fixed

character set. However, the downside is that the sequences are much longer, which makes

the networks slower and harder to train. Common practice today is to use a compromise

between word and character level. Frequently, Byte Pair Encoding, or BPE, is used [99].

We discuss this algorithm in much more detail in section 7 as we propose enhancements to

this model.

1.2.3.2 Encoder

The encoder takes the source sentence, X, and compresses it into a vector c. The en-

coder is represented by the green part of the model in figure 1.1, and c is represented

by the arrow going directly to the blue decoder. X is comprised of series of one-hot

7

vectors x1, ..., xT . Generally, the encoder is an RNN where the next state is defined as

ht = f (xt, ht−1). f represents any type of RNN, with GRUs and LSTMs being the most

frequent. c is generally taken to be the final vector, hT , though other methods exist.

1.2.3.3 Decoder

The decoder (the blue part of figure 1.1) generates the target sentence. It too is an RNN,

but the initial state is the vector c output by the encoder. Note that at each step m of the

RNN, not only is the previous step fed in, but also an attention vector cm which is defined

below. The probability of a target sentence translation Y is:

p(Y) =

M∏m=1

p(ym|{y1, ..., ym−1}, cm) (1.1)

This probability is again calculated using some variant of an RNN.

1.2.3.4 Attention

One of the downsides of early NMT models was that an entire source sentence was

compressed into a single vector at the end of a sentence. This is an information bottleneck,

and though RNNs are good at encoding sequences, some information was being lost. As

a remedy, an attention was added to the encoder for use by the decoder [4]. Various atten-

tional mechanisms have been proposed, though Bahdanau’s original model is still widely

used [63].

The attentional model works by taking the output vectors ht at each timestep t of the

encoder. At timestep i in the decoder, the context vector ci, is calculated as:

cm =

T∑t=1

αmtht. (1.2)

α is parameterized as a feedforward neural network. This attentional model is jointly

8

trained with the rest of the encoder and decoder. In figure 1.1, it is the red part of the

model, which is attending over all of the outputs of the encoder. The intuition behind this

model is that the decoder not only has the information from the previous steps of the RNN,

but also information from each step of the encoder.

1.2.4 Beam Search and Decoding

Decoding, the process by which a trained model is used to translate an input sentence

is intractable. Whereas traditional SMT systems were NP-hard [50], NMT systems assign

probability mass to every token at every step. For a vocabulary of size V , the number

of potential translations is exponential in the length, l, of the sentence: V l. However, l

is not necessarily known. As such, for both SMT and NMT systems, heuristics are used

at generation time - with the most frequent being beam search. In beam search, at each

time step, the k highest scoring hypotheses are kept. The next time step then proceeds by

expanding all k hypotheses as inputs. The outputs of all hypotheses are then pruned, and

once again, the top k are kept. Beam search is not globally optimal, but it has been shown

to improve performance. If k is infinite, our model is guaranteed to find the highest scoring

hypothesis.

It is worth mentioning that the process by which an NMT system is trained has a funda-

mental mismatch with decoding. At training time, the negative log-likelihood is calculated

between the softmax prediction and the correct word at each timestep. This is the loss that

is backpropogated through the model. There is no beam at this point, and the model has

only seen correct prefixes. This mismatch between training and testing has some negative

properties regarding the beam. This can lead to undertranslation (the brevity problem) and

the wide-beam problem. We address these in more detail in Section 6.

9

1.3 Outline of Proposed Work

This dissertation looks at a few of the most common baseline Neural Machine Trans-

lation architectures - the attentional sequence-to-sequence model and the Transformer net-

work [109]. Both of these rely on encoders, decoders, and attention – though specific

implementations of these sub-components differ. Within these models, there are a few

standard hyperparameters that are chosen.

Encoder Layer Sizes Number of Encoder Layers

Decoder Layer Sizes Number of Decoder Layers

Input Vocabulary/BPE Size Output Vocabulary/BPE Size

Beam Size Length Penalty

Softmaxes (Feedforward) Embeddings (Feedforward)

There are other settings and parameters used in training that do not fall under this scope.

Optimizer (SGD/ADAM/etc.)

Learning Rates (Decay and Initial)

Epochs

Minibatch Size

Max Length (just a cutoff, not useful)

Though the previous items are important to training an NMT model, none of these are

inherent properties of the trained model and will not be considered.

Of the settings to consider, we can break this up into three main categories: architec-

tural hyperparameters, input and output hyperparameters, and inference time hyperparam-

eters. The first category, architectural parameters, deals with how the network is structured.

How many neurons are in a layer? How many layers are in the encoder. For this category

of parameters, we introduce methods that modify the training objective function through

regularization. This is covered in Chapters 2 through 5 and has yielded three publications

[67, 71, 72].

10

The second category of hyperpameters, input and output parameters, deals with the fact

that neural networks expect fixed-sized vectors for their input and output representations

of data. Within natural language processing, and specifically machine translation, this is

the vocabularies of the source and target languages. Due to the long-tailed property of

human languages, subword units are used. Our method for learning the optimal coarseness

of these units is presented in Chapter 7 and is currently under review.

The final category is the parameters used at inference time – after the model has been

trained and is in use. In particular, we focus on the length penalty, which we propose

to learn using a perceptron based algorithm. We also demonstrate that this is inherently

related to the beam size, which is the other major hyperparameter normally chosen to

generate translations using a trained model. We discuss this further in Chapter 6. This

work was published in 2018 at WMT [70].

Hyperparameter Method Publication

Auto-Sizing Feedforward Group Regularization [68]

Encoder Layer Sizes

Group Regularization [71, 72]Number of Encoder Layers

Decoder Layer Sizes

Number of Decoder Layers

Input Vocabulary or BPE SizeExpectation-Maximization Under Review

Output Vocabulary or BPE Size

Beam SizePerceptron-like Tuning [70]

Length Penalty

11

CHAPTER 2

AUTO-SIZING

We first turn to the hyperparameter for the number of neurons in a layer of a neural

network. This value is chosen before a model is trained and can have implications for the

overall performance of a model. In this chapter, as well as the next three, we focus on

modifying the training objective used to learn the parameter weights of a neural network

by introducing a regularizer.

2.1 Auto-Sizing Neural Networks: with applications to n-gram language models

Neural networks have been shown to improve performance across a range ofnatural-language tasks. However, designing and training them can be com-plicated. Frequently, researchers resort to repeated experimentation to pickoptimal settings. In this paper, we address the issue of choosing the correctnumber of units in hidden layers. We introduce a method for automaticallyadjusting network size by pruning out hidden units through `∞,1 and `2,1 reg-ularization. We apply this method to language modeling and demonstrate itsability to correctly choose the number of hidden units while maintaining per-plexity. We also include these models in a machine translation decoder andshow that these smaller neural models maintain the significant improvementsof their unpruned versions.

2.2 Introduction

Neural networks have proven to be highly effective at many tasks in natural language.

For example, neural language models and joint language/translation models improve ma-

chine translation quality significantly [28, 108]. However, neural networks can be com-

plicated to design and train well. Many decisions need to be made, and performance can

12

be highly dependent on making them correctly. Yet the optimal settings are non-obvious

and can be laborious to find, often requiring an extensive grid search involving numerous

experiments.

In this section, we focus on the choice of the sizes of hidden layers in a feedforward

neural network. We introduce a method for automatically pruning out hidden layer units,

by adding a sparsity-inducing regularizer that encourages units to deactivate if not needed,

so that they can be removed from the network. Thus, after training with more units than

necessary, a network is produced that has hidden layers correctly sized, saving both time

and memory when actually putting the network to use.

Using a neural n-gram language model [7], we are able to show that our novel auto-

sizing method is able to learn models that are smaller than models trained without the

method, while maintaining nearly the same perplexity. The method has only a single hy-

perparameter to adjust (as opposed to adjusting the sizes of each of the hidden layers), and

we find that the same setting works consistently well across different training data sizes,

vocabulary sizes, and n-gram sizes. In addition, we show that incorporating these models

into a statistical machine translation decoder still results in large BLEU point improve-

ments. The result is that fewer experiments are needed to obtain models that perform well

and are correctly sized.

2.3 Method

Our method is focused on the challenge of choosing the number of units in the hidden

layers of a feedforward neural network. The networks used for different tasks require

different numbers of units, and the layers in a single network also require different numbers

of units. Choosing too few units can impair the performance of the network, and choosing

too many units can lead to overfitting and slow down computations with the network.

The method starts out with a large number of units in each layer and then jointly trains

the network while pruning out individual units when possible. The goal is to end up with a

13

trained network that also has the optimal number of units in each layer.

We do this by adding a regularizer to the objective function. For simplicity, consider

a single layer without bias, y = f (Wx). Let L(W) be the negative log-likelihood of the

model. Instead of minimizing L(W) alone, we want to minimize L(W) + λR(W), where

R(W) is a convex regularizer. The `1 norm, R(W) = ‖W‖1 =∑

i, j |Wi j|, is a common choice

for pushing parameters to zero, which can be useful for preventing overfitting and reducing

model size. However, we are interested not only in reducing the number of parameters but

the number of units. To do this, we need a different regularizer.

We assume activation functions that satisfy f (0) = 0, such as the hyperbolic tangent or

rectified linear unit ( f (x) = max{0, x}). Then, if we push the incoming weights of a unit yi

to zero, that is, Wi j = 0 for all j (as well as the bias, if any: bi = 0), then yi = f (0) = 0

is independent of the previous layers and contributes nothing to subsequent layers. So the

unit can be removed without affecting the network at all. Therefore, we need a regularizer

that pushes all the incoming connection weights to a unit together towards zero.

Here, we experiment with two, the `2,1 norm and the `∞,1 norm.1 The `2,1 norm on a

matrix W is

R(W) =∑

i

‖Wi:‖2 =∑

i

∑j

W2i j

12

. (2.1)

(If there are biases bi, they should be included as well.) This puts equal pressure on each

row, but within each row, the larger values contribute more, and therefore there is more

pressure on larger values towards zero. The `∞,1 norm is

R(W) =∑

i

‖Wi:‖∞ =∑

i

maxj|Wi j|. (2.2)

Again, this puts equal pressure on each row, but within each row, only the maximum value

1In the notation `p,q, the subscript p corresponds to the norm over each group of parameters, and qcorresponds to the norm over the group norms. Contrary to more common usage, in this paper, the groupsare rows, not columns.

14

x1

x2

x1

x2

`2 `∞

Figure 2.1. The (unsquared) `2 norm and `∞ norm both have sharp tips at theorigin that encourage sparsity.

(or values) matter, and therefore the pressure towards zero is entirely on the maximum

value(s).

Figure 2.1 visualizes the sparsity-inducing behavior of the two regularizers on a single

row. Both have a sharp tip at the origin that encourages all the parameters in a row to

become exactly zero.

2.4 Optimization

However, this also means that sparsity-inducing regularizers are not differentiable at

zero, making gradient-based optimization methods trickier to apply. The methods we use

are discussed in detail elsewhere [32, 33]; in this section, we include a short description of

these methods for completeness.

2.4.1 Proximal gradient method

Most work on learning with regularizers, including this work, can be thought of as

instances of the proximal gradient method [85]. Our objective function can be split into

two parts, a non-convex and differentiable part (L) and a convex but non-differentiable part

(λR). In proximal gradient descent, we alternate between improving L alone and λR alone.

Let u be the parameter values from the previous iteration. We compute new parameter

15

values w using:

v← u − η∇L(u) (2.3)

w← arg maxw

(12η‖w − v‖2 + λR(w)

)(2.4)

and repeat until convergence. The first update is just a standard gradient descent update on

L; the second is known as the proximal operator for λR and in many cases has a closed-

form solution. In the rest of this section, we provide some justification for this method,

and in Sections 2.4.2 and 2.4.2.1 we show how to compute the proximal operator for the

`2 and `∞ norms.

We can think of the gradient descent update (2.3) on L as follows. Approximate L

around u by the tangent plane,

L(v) = L(u) + ∇L(u)(v − u) (2.5)

and move v to minimize L, but don’t move it too far from u; that is, minimize

F(v) =12η‖v − u‖2 + L(v).

Setting partial derivatives to zero, we get

∂F∂v

=1η

(v − u) + ∇L(u) = 0

v = u − η∇L(u).

By a similar strategy, we can derive the second step (2.4). Again we want to move w to

minimize the objective function, but don’t want to move it too far from u; that is, we want

16

to minimize:

G(w) =12η‖w − u‖2 + L(w) + λR(w).

Note that we have not approximated R by a tangent plane. We can simplify this by substi-

tuting in (2.3). The first term becomes

12η‖w − u‖2 =

12η‖w − v − η∇L(u)‖2

=12η‖w − v‖2 − ∇L(u)(w − v)

+η

2‖∇L(u)‖2

and the second term becomes

L(w) = L(u) + ∇L(u)(w − u)

= L(u) + ∇L(u)(w − v − η∇L(u)).

The ∇L(u)(w − v) terms cancel out, and we can ignore terms not involving w, giving

G(w) =12η‖w − v‖2 + λR(w) + const.

which is minimized by the update (2.4). Thus, we have split the optimization step into two

easier steps: first, do the update for L (2.3), then do the update for λR (2.4). The latter can

often be done exactly (without approximating R by a tangent plane). We show next how to

do this for the `2 and `∞ norms.

2.4.2 Group regularization

Since the `2,1 norm on matrices (2.1) is separable into the `2 norm of each row, we can

treat each row separately. Thus, for simplicity, assume that we have a single row and want

17

‖w‖ > 0 ‖w‖ = 0

Figure 2.2. Examples of the two possible cases for the `2 gradient update. Pointv is drawn with a hollow dot, and point w is drawn with a solid dot.

to minimize

G(w) =12η‖w − v‖2 + λ‖w‖ + const.

The minimum is either at w = 0 (the tip of the cone) or where the partial derivatives are

zero (Figure 2.2):∂G∂w

=1η

(w − v) + λw‖w‖

= 0.

Clearly, w and v must have the same direction and differ only in magnitude, that is, w =

α v‖v‖ . Substituting this into the above equation, we get the solution

α = ‖v‖ − ηλ.

Therefore the update is

w = αv‖v‖

α = max(0, ‖v‖ − ηλ).

18

before `∞ prox. op. `1 projection

Figure 2.3. The proximal operator for the `∞ norm (with strength ηλ) decreasesthe maximal components until the total decrease sums to ηλ. Projection onto the`1-ball (of radius ηλ) decreases each component by an equal amount until they

sum to ηλ.

2.4.2.1 Group regularization

As above, since the `∞,1 norm on matrices (2.2) is separable into the `∞ norm of each

row, we can treat each row separately; thus, we want to minimize

G(w) =12η‖w − v‖2 + λmax

j|x j| + const.

Intuitively, the solution can be characterized as: Decrease all of the maximal |x j| until the

total decrease reaches ηλ or all the x j are zero. See Figure 2.3.

If we pre-sort the |x j| in nonincreasing order, it’s easy to see how to compute this: for

ρ = 1, . . . , n, see if there is a value ξ ≤ xρ such that decreasing all the x1, . . . , xρ to ξ

amounts to a total decrease of ηλ. The largest ρ for which this is possible gives the correct

solution.

But this situation seems similar to another optimization problem, projection onto the

`1-ball, which [33] solve in linear time without pre-sorting. In fact, the two problems can

be solved by nearly identical algorithms, because they are convex conjugates of each other

[3, 32]. Intuitively, the `1 projection of v is exactly what is cut out by the `∞ proximal

operator, and vice versa (Figure 2.3).

Duchi et al.’s algorithm modified for the present problem is shown as Algorithm 1. It

19

partitions the x j about a pivot element (line 6) and tests whether it and the elements to

its left can be decreased to a value ξ such that the total decrease is δ (line 8). If so, it

recursively searches the right side; if not, the left side. At the conclusion of the algorithm,

ρ is set to the largest value that passes the test (line 13), and finally the new x j are computed

(line 16) – the only difference from Duchi et al.’s algorithm.

This algorithm is asymptotically faster than that of [92]. They reformulate `∞,1 regu-

larization as a constrained optimization problem (in which the `∞,1 norm is bounded by µ)

and provide a solution in O(n log n) time. The method shown here is simpler and faster

because it can work on each row separately.

Algorithm 1 Linear-time algorithm for the proximal operator of the `∞ norm.1: procedure update(w, δ)2: lo, hi← 1, n3: s← 04: while lo ≤ hi do5: select md randomly from lo, . . . , hi6: ρ← partition(w, lo,md, hi)7: ξ ← 1

ρ

(s +

∑ρ

i=lo |xi| − δ)

8: if ξ ≤ |xρ| then9: s← s +

∑ρ

i=lo |xi|

10: lo← ρ + 111: else12: hi← ρ − 113: ρ← hi14: ξ ← 1

ρ(s − δ)

15: for i← 1, . . . , n do16: xi ← min(max(xi,−ξ), ξ)17: procedure partition(w, lo,md, hi)18: swap xlo and xmd

19: i← lo + 120: for j← lo + 1, . . . , hi do21: if x j ≥ xlo then22: swap xi and x j

23: i← i + 124: swap xlo and xi−1

25: return i − 1

20

2-gram 3-gram 5-gramλ layer 1 layer 2 ppl layer 1 layer 2 ppl layer 1 layer 2 ppl

0 1,000 50 103 1,000 50 66 1,000 50 550.001 1,000 50 104 1,000 50 66 1,000 50 540.01 1,000 50 104 1,000 50 63 1,000 50 550.1 499 47 105 652 49 66 784 50 551.0 50 24 111 128 32 76 144 29 68

Table 2.1: Comparison of `∞,1 regularization on 2-gram, 3-gram, and 5-gram neural lan-guage models. The network initially started with 1,000 units in the first hidden layer and50 in the second. A regularization strength of λ = 0.1 consistently is able to prune unitswhile maintaining perplexity, even though the final number of units varies considerablyacross models. The vocabulary size is 100k.

2.5 Experiments

We evaluate our model using the open-source NPLM toolkit released by ISI, extending

it to use the additional regularizers as described in this paper [108].2 We use a vocabulary

size of 100k and word embeddings with 50 dimensions. We use two hidden layers of

rectified linear units [74].

We train neural language models (LMs) on two natural language corpora, Europarl v7

English and the AFP portion of English Gigaword 5. After tokenization, Europarl has 56M

tokens and Gigaword AFP has 870M tokens. For both corpora, we hold out a validation

set of 5,000 tokens. We train each model for 10 iterations over the training data.

Our experiments break down into three parts. First, we look at the impact of our pruning

method on perplexity of a held-out validation set, across a variety of settings. Second,

we take a closer look at how the model evolves through the training process. Finally,

we explore the downstream impact of our method on a statistical phrase-based machine

translation system.

2These extensions have been contributed to the NPLM project.

21

2.5.1 Evaluating perplexity and network size

We first look at the impact that the `∞,1 regularizer has on the perplexity of our vali-

dation set. The main results are shown in Table 2.1. For λ ≤ 0.01, the regularizer seems

to have little impact: no hidden units are pruned, and perplexity is also not affected. For

λ = 1, on the other hand, most hidden units are pruned – apparently too many, since per-

plexity is worse. But for λ = 0.1, we see that we are able to prune out many hidden units:

up to half of the first layer, with little impact on perplexity. We found this to be consistent

across all our experiments, varying n-gram size, initial hidden layer size, and vocabulary

size.

Table 2.2 shows the same information for 5-gram models trained on the larger Giga-

word AFP corpus. These numbers look very similar to those on Europarl: again λ = 0.1

works best, and, counter to expectation, even the final number of units is similar.

Table 2.3 shows the result of varying the vocabulary size: again λ = 0.1 works best,

and, although it is not shown in the table, we also found that the final number of units did

not depend strongly on the vocabulary size.

Table 2.4 shows results using the `2,1 norm (Europarl corpus, 5-grams, 100k vocabu-

lary). Since this is a different regularizer, there isn’t any reason to expect that λ behaves

the same way, and indeed, a smaller value of λ seems to work best.

2.5.2 A closer look at training

We also studied the evolution of the network over the training process to gain some

insights into how the method works. The first question we want to answer is whether the

method is simply removing units, or converging on an optimal number of units. Figure 2.4

suggests that it is a little of both: if we start with too many units (900 or 1000), the method

converges to the same number regardless of how many extra units there were initially. But

if we start with a smaller number of units, the method still prunes away about 50 units.

Next, we look at the behavior over time of different regularization strengths λ. We

22

λ layer 1 layer 2 perplexity

0 1,000 50 100

0.001 1,000 50 99

0.01 1,000 50 101

0.1 742 50 107

1.0 24 17 173

TABLE 2.2

RESULTS FROM TRAINING A 5-GRAM NEURAL LM ON THE AFP

PORTION OF THE GIGAWORD DATASET. AS WITH THE SMALLER

EUROPARL CORPUS (TABLE 2.1), A REGULARIZATION STRENGTH OF

λ = 0.1 IS ABLE TO PRUNE UNITS WHILE MAINTAINING PERPLEXITY.

23

vocabulary size

λ 10k 25k 50k 100k

0 47 60 54 55

0.001 47 54 54 54

0.01 47 58 55 55

0.1 48 62 55 55

1.0 61 64 65 68

TABLE 2.3

A REGULARIZATION STRENGTH OF λ = 0.1 IS BEST ACROSS

DIFFERENT VOCABULARY SIZES.

λ layer 1 layer 2 perplexity

0 1,000 50 100

0.0001 1,000 50 54

0.001 1,000 50 55

0.01 616 50 57

0.1 199 32 65

TABLE 2.4

RESULTS USING `2,1 REGULARIZATION.

24

0 1 2 3 4 5 6 7 8 9 100

200

400

600

800

1,000

epoch

nonz

ero

units

inhi

dden

laye

r1

1000900800700

Figure 2.4. Number of units in first hidden layer over time, with various startingsizes (λ = 0.1). If we start with too many units, we end up with the same number,although if we start with a smaller number of units, a few are still pruned away.

25

0 2 4 6 8 100

200

400

600

800

1,000

epoch

nonz

ero

units

inhi

dden

laye

r1λ ≤ 0.01λ = 0.1λ = 1

0 2 4 6 8 100

20

40

60

80

100

120

epoch

perp

lexi

ty

λ = 0.01λ = 0.1λ = 1

Figure 2.5. Above: Number of units in first hidden layer over time, for variousregularization strengths λ. A regularization strength of ≤ 0.01 does not zero out

any rows, while a strength of 1 zeros out rows right away. Below: Perplexityover time. The runs with λ ≤ 0.1 have very similar learning curves, whereas

λ = 1 is worse from the beginning.

26

1 iteration 5 iterations 10 iterations

Figure 2.6: Evolution of the first hidden layer weight matrix after 1, 5, and 10 iterations(with rows sorted by `∞ norm). A nonlinear color scale is used to show small values moreclearly. The four vertical blocks correspond to the four context words. The light bar at thebottom is the rows that are close to zero, and the white bar is the rows that are exactly zero.

found that not only does λ = 1 prune out too many units, it does so at the very first

iteration (Figure 2.5, above), perhaps prematurely. By contrast, the λ = 0.1 run prunes out

units gradually. By plotting these curves together with perplexity (Figure 2.5, below), we

can see that the λ = 0.1 run is fitting the model and pruning it at the same time, which

seems preferable to fitting without any pruning (λ = 0.01) or pruning first and then fitting

(λ = 1).

We can also visualize the weight matrix itself over time (Figure 2.6), for λ = 0.1. It is

striking that although this setting fits the model and prunes it at the same time, as argued

above, by the first iteration it already seems to have decided roughly how many units it will

eventually prune.

2.5.3 Evaluating on machine translation

We also looked at the impact of our method on statistical machine translation systems.

We used the Moses toolkit [54] to build a phrase based machine translation system with a

traditional 5-gram LM trained on the target side of our bitext. We augmented this system

with neural LMs trained on the Europarl data and the Gigaword AFP data. Based on the

results from the perplexity experiments, we looked at models both built with a λ = 0.1

27

neural LM

λ none Europarl Gigaword AFP

0 (none)23.2

24.7 (+1.5) 25.2 (+2.0)

0.1 24.6 (+1.4) 24.9 (+1.7)

TABLE 2.5

THE IMPROVEMENTS IN TRANSLATION ACCURACY DUE TO THE

NEURAL LM (SHOWN IN PARENTHESES) ARE AFFECTED ONLY

SLIGHTLY BY `∞,1 REGULARIZATION. FOR THE EUROPARL LM,

THERE IS NO STATISTICALLY SIGNIFICANT DIFFERENCE, AND FOR

THE GIGAWORD AFP LM, A STATISTICALLY SIGNIFICANT BUT

SMALL DECREASE OF −0.3.

regularizer, and without regularization (λ = 0).

We built our system using the newscommentary dataset v8. We tuned our model us-

ing newstest13 and evaluated using newstest14. After standard cleaning and tokenization,

there were 155k parallel sentences in the newscommentary dataset, and 3,000 sentences

each for the tuning and test sets.

Table 2.5 shows that the addition of a neural LM helps substantially over the base-

line, with improvements of up to 2 BLEU. Using the Europarl model, the BLEU scores

obtained without and with regularization were not significantly different (p ≥ 0.05), con-

sistent with the negligible perplexity difference between these models. On the Gigaword

AFP model, regularization did decrease the BLEU score by 0.3, consistent with the small

perplexity increase of the regularized model. The decrease is statistically significant, but

small compared with the overall benefit of adding a neural LM.

28

2.6 Discussion

In this section, we have demonstrated the ability of a group regularizer to prune out

nodes in a feedforward neural network during training of the network. The method is

sensitive to the strength of the regularizer, but the same weight can be used across multiple

layers. This regularizer is added to the standard loss function for which we are optimizing.

There is some added complexity in order to incorporate a non-differentiable regularizer,

but we are able to overcome this through proximal gradient methods.

In this chapter, we have shown that the method works on a feedforward neural proba-

bilistic language model. We have shown that this can be added into a traditional statistical

machine translation system. The basic idea behind this method can be used on a wider

range of tasks, and it is this underlying principle that will serve as the foundation for our

proposed work in section 3, in a broader neural machine translation system.

29

CHAPTER 3

AUTO-SIZING THE TRANSFORMER NETWORK

In the previous chapter, we have presented a method to prune neurons in a feed forward

neural network through regularization. We demonstrated the ability to have the learned,

smaller networks to be integrated into a statistical machine translation system. In this

chapter, we extend this to a much more complicated, and state-of-the-art, neural architec-

ture called the Transformer.

Neural sequence-to-sequence models, particularly the Transformer, are thestate of the art in machine translation. Yet these neural networks are very sensi-tive to architecture and hyperparameter settings. Optimizing these settings bygrid or random search is computationally expensive because it requires manytraining runs. In this paper, we incorporate architecture search into a singletraining run through auto-sizing, which uses regularization to delete neuronsin a network over the course of training. On very low-resource language pairs,we show that auto-sizing can improve BLEU scores by up to 3.9 points whileremoving one-third of the parameters from the model.

3.1 Introduction

Encoder-decoder based neural network models are the state-of-the-art in machine trans-

lation. However, these models are very dependent on selecting optimal hyperparameters

and architectures. This problem is exacerbated in very low-resource data settings where the

potential to overfit is high. Unfortunately, these searches are computationally expensive.

For instance, Britz et al. [16] used over 250,000 GPU hours to compare various recurrent

neural network based encoders and decoders for machine translation. Strubell et al. [105]

demonstrated the neural architecture search for a large NLP model emits over four times

the carbon dioxide relative to a car over its entire lifetime.

30

Unfortunately, optimal settings are highly dependent on both the model and the task,

which means that this process must be repeated often. As a case in point, the Transformer

architecture has become the best performing encoder-decoder model for machine transla-

tion [109], displacing RNN-based models [4] along with much conventional wisdom about

how to train such models. Vaswani et al. ran experiments varying numerous hyperparam-

eters of the Transformer, but only on high-resource datasets among linguistically similar

languages. Popel and Bojar [89] explored ways to train Transformer networks, but only on

a high-resource dataset in one language pair. Less work has been devoted to finding best

practices for smaller datasets and linguistically divergent language pairs.

In this chapter, we apply auto-sizing [69], which is a type of architecture search con-

ducted during training, to the Transformer. We show that it is effective on very low-

resource datasets and can reduce model size significantly, while being substantially faster

than other architecture search methods. We make three main contributions.

1. We demonstrate the effectiveness of auto-sizing on the Transformer network by sig-

nificantly reducing model size, even though the number of parameters in the Transformer

is orders of magnitude larger than previous natural language processing applications of

auto-sizing.

2. We demonstrate the effectiveness of auto-sizing on translation quality in very low-

resource settings. On four out of five language pairs, we obtain improvements in BLEU

over a recommended low-resource baseline architecture. Furthermore, we are able to do

so an order of magnitude faster than random search.

3. We release GPU-enabled implementations of proximal operators used for auto-sizing.

Previous authors [15, 34] have given efficient algorithms, but they don’t necessarily par-

allelize well on GPUs. Our variations are optimized for GPUs and are implemented as a

general toolkit and are released as open-source software.1

1https://github.com/KentonMurray/ProxGradPytorch

31

3.2 Hyperparameter Search

While the parameters of a neural network are optimized by gradient-based training

methods, hyperparameters are values that are typically fixed before training begins, such

as layer sizes and learning rates, and can strongly influence the outcome of training. Hy-

perparameter optimization is a search over the possible choices of hyperparameters for a

neural network, with the objective of minimizing some cost function (e.g., error, time to

convergence, etc.). Hyperparameters may be selected using a variety of methods, most of-

ten manual tuning, grid search [30], or random search [8]. Other methods, such as Bayesian

optimization [9, 103], genetic algorithms [6, 36, 114], and hypergradient updates [64], at-

tempt to direct the selection process based on the objective function. All of these methods

require training a large number of networks with different hyperparameter settings.

In the previous chapter and in Murray and Chiang [69], we focused on the narrow case

of two hidden layers in a feed-forward neural network with a rectified linear unit activation.

In this chapter, we look at the broader case of all of the non-embedding parameter matrices

in the encoder and decoder of the Transformer network.

3.3 GPU Optimized Proximal Gradient Descent

Murray and Chiang [69] train a neural network while using a regularizer to prune units

from the network, minimizing:

L = −∑

f , e in data

log P(e | f ; W) + λR(‖W‖),

where W are the parameters of the model and R is a regularizer. For simplicity, assume

that the parameters form a single matrix W of weights. Murray and Chiang [69] try two

32

Algorithm 2 Parallel `∞ proximal stepRequire: Vector v with n elementsEnsure: Decrease the largest absolute value in v until the total decrease is ηλ

1: vi ← |vi|

2: sort v in decreasing order3: δi ← vi − vi+1, δn ← vn

4: ci ←

i∑i′=1

i′δi′ . prefix sum

5: bi = 1i (clip[ci−1,ci](ηλ) − ci−1)

6: pi =

n∑i′=i

bi′ . suffix sum

7: v← v − p8: restore order and signs of v

regularizers:

R(W) =∑

i

∑j

W2i j

12

(`2,1)

R(W) =∑

i

maxj|Wi j| (`∞,1)

The optimization is done using proximal gradient descent [85], which alternates between

stochastic gradient descent steps and proximal steps:

W ← W − η∇ log P(e | f ; w)

W ← arg minW′

(12η‖W −W ′‖2 + R(W ′)

)

To perform the proximal step for the `∞,1 norm, they rely on a quickselect-like algorithm

that runs in O(n) time [34]. However, this algorithm does not parallelize well. Instead, we

use Algorithm 2, which is similar to that of Quattoni et al. [93], on each row of W.

The algorithm starts by taking the absolute value of each entry and sorting the entries in

decreasing order. Figure 3.1a shows a histogram of sorted absolute values of an example v.

Intuitively, the goal of the algorithm is to cut a piece off the top with area ηλ (in the figure,

33

v1v2

v3

δ1 = b1

δ2

b2

δ3

(a) (b)

Figure 3.1. Illustration of Algorithm 2. The shaded area, here with value ηλ = 2,represents how much the `∞ proximal step will remove from a sorted vector.

shaded gray).

We can also imagine the same shape as a stack of horizontal layers (Figure 3.1b), each

i wide and δi high, with area iδi; then ci is the cumulative area of the top i layers. This view

makes it easier to compute where the cutoff should be. Let k be the index such that ηλ lies

between ck−1 and ck. Then bi = δi for i < k; bk = 1k (ηλ− ck−1); and bi = 0 for i > k. In other

words, bi is how much height of the ith layer should be cut off.

Finally, returning to Figure 3.1b, pi is the amount by which vi should be decreased (the

height of the gray bars). (The vector p also happens to be the projection of v onto the `1

ball of radius ηλ.)

Although this algorithm is less efficient than the quickselect-like algorithm when run

in serial, the sort in line 2 and the cumulative sums in lines 4 and 6 [56] can be parallelized

to run in O(log n) passes each.

3.4 Transformer

The Transformer network, introduced by Vaswani et al. [109], is a sequence-to-sequence

model in which both the encoder and the decoder consist of stacked self-attention layers.

Each layer of the decoder can attend to the previous layer of the decoder and the output

of the encoder. The multi-head attention uses two affine transformations, followed by a

34

softmax. Additionally, each layer has a position-wise feed-forward neural network (FFN)

with a hidden layer of rectified linear units:

FFN(x) = W2(max(0,W1x + b1)) + b2.

The hidden layer size (number of columns of W1) is typically four times the size of the

model dimension. Both the multi-head attention and the feed-forward neural network have

residual connections that allow information to bypass those layers.

3.4.1 Auto-sizing Transformer

Though the Transformer has demonstrated remarkable success on a variety of datasets,

it is highly over-parameterized. For example, the English-German WMT ’14 Transformer-

base model proposed in Vaswani et al. [109] has more than 60M parameters. Whereas early

NMT models such as Sutskever et al. [106] have most of their parameters in the embed-

ding layers, the added complexity of the Transformer, plus parallel developments reducing

vocabulary size [100] and sharing embeddings [91] has shifted the balance. Nearly 31%

of the English-German Transformer’s parameters are in the attention layers and 41% in the

position-wise feed-forward layers.

Accordingly, we apply the auto-sizing method to the Transformer network, and in par-

ticular to the two largest components, the feed-forward layers and the multi-head attentions

(blue and orange rectangles in Figure 3.2). A difference from the work of Murray and Chi-

ang [69] is that there are residual connections that allow information to bypass the layers

we are auto-sizing. If the regularizer drives all the neurons in a layer to zero, information

can still pass through. Thus, auto-sizing can effectively prune out an entire layer.

35

Figure 3.2. Architecture of the Transformer [109]. We apply the auto-sizingmethod to the feed-forward (blue rectangles) and multi-head attention (orange

rectangles) in all n layers of the encoder and decoder. Note that there are residualconnections that can allow information and gradients to bypass any layer we are

auto-sizing.

36

3.4.2 Random Search

As an alternative to grid-based searches, random hyperparameter search has been demon-

strated to be a strong baseline for neural network architecture searches as it can search

between grid points to increase the size of the search space [8]. In fact, Li and Talwalkar

[60] recently demonstrated that many architecture search methods do not beat a random

baseline. In practice, randomly searching hyperparameter domains allows for an intuitive

mixture of continuous and categorical hyperparameters with no constraints on differentia-

bility [64] or need to cast hyperparameter values into a single high-dimensional space to

predict new values [9].

3.5 Experiments

All of our models are trained using the fairseq implementation of the Transformer

[37].2 Our GPU-optimized, proximal gradient algorithms are implemented in PyTorch

and are publicly available.3 For the random hyperparameter search experiments, we use

SHADHO,4 which defines the hyperparameter tree, generates from it, and manages dis-

tributed resources [49]. Our SHADHO driver file and modifications to fairseq are also

publicly available.5

3.5.1 Settings

We looked at four different low-resource language pairs, running experiments in five

directions: Arabic-English, English-Arabic, French-English, Hausa-English, and Tigrinya-

English. The Arabic and French data comes from the IWSLT 2017 Evaluation Campaign

2https://github.com/pytorch/fairseq


4https://github.com/jeffkinnison/shadho

5https://bitbucket.org/KentonMurray/fairseq autosizing

37

Dataset Size

Ara–Eng 234k

Fra–Eng 235k

Hau–Eng 45k

Tir–Eng 15k

TABLE 3.1

NUMBER OF PARALLEL SENTENCES IN TRAINING BITEXTS. THE

FRENCH-ENGLISH AND ARABIC-ENGLISH DATA IS FROM THE 2017

IWSLT CAMPAIGN [65]. THE MUCH SMALLER HAUSA-ENGLISH AND

TIGRINYA-ENGLISH DATA IS FROM THE LORELEI PROJECT.

[65]. The Hausa and Tigrinya data were provided by the LORELEI project with custom

train/dev/test splits. For all languages, we tokenized and truecased the data using scripts

from Moses [54]. For the Arabic systems, we transliterated the data using the Buckwalter

transliteration scheme. All of our systems were run using subword units (BPE) with 16,000

merge operations on concatenated source and target training data [98]. We clip norms at

0.1, use label smoothed cross-entropy with value 0.1, and an early stopping criterion when

the learning rate is smaller than 10−5. All of our experiments were done using the Adam

optimizer [48], a learning rate of 10−4, and dropout of 0.1. At test time, we decoded using

a beam of 5 with length normalization [14] and evaluate using case-sensitive, detokenized

BLEU [84].

38

3.5.1.1 Baseline

The originally proposed Transformer model is too large for our data size – the model

will overfit the training data. Instead, we use the recommended settings in fairseq for

IWSLT German-English as a baseline since two out of our four language pairs are also

from IWSLT. This architecture has 6 layers in both the encoder and decoder, each with 4

attention heads. Our model dimension is dmodel = 512, and our FFN dimension is 1024.

3.5.1.2 Auto-sizing parameters

Auto-sizing is implemented as two different types of group regularizers, `2,1 and `∞,1.

We apply the regularizers to the feed-forward network and multi-head attention in each

layer of the encoder and decoder. We experiment across a range of regularization coeffi-

cient values, λ, that control how large the regularization proximal gradient step will be. We

note that different regularization coefficient values are suited for different types or regular-

izers. Additionally, all of our experiments use the same batch size, which is also related to

λ.

3.5.1.3 Random search parameters

As originally proposed, the Transformer network has 6 encoder layers, all identi-

cal, and 6 decoder layers, also identical. For our random search experiments, we sam-

ple the number of attention heads from {4, 8, 16} and the model dimension (dmodel) from

{128, 256, 512, 1024, 2048}. Diverging from most implementations of the Transformer, we

do not require the same number of encoder and decoder layers, but instead sample each

from {2, 4, 6, 8}. Within a layer, we also sample the size of the feed-forward network (FFN),

varying our samples over {512, 1024, 2048}. This too differs from most Transformer im-

plementations, which have identical layer hyperparameters.

39

Language Pair Search Strategy BLEU Model Size Training Time

Tir–Eng

Standard Baseline 3.6 39.6M 1.2kRandom Search 6.7 240.4M 43.7kAuto-sizing `∞,1 7.5 27.1M 2.1kAuto-sizing `2,1 7.4 27.1M 1.2k

Hau–Eng


Fra–Eng


Table 3.2: Comparison of BLEU scores, model size, and training time on Tigrinya-English,Hausa-English, and French-English. Model size is the total number of parameters. Train-ing time is measured in seconds. Baseline is the recommended low-resource architecture infairseq. Random search represents the best model found from 72 (Tigrinya), 40 (Hausa),and 10 (French) different randomly generated architecture hyperparameters. Both auto-sizing methods, on both languages, start with the exact same initialization and number ofparameters as the baseline, but converge to much smaller models across all language pairs.On the very low-resource languages of Hausa and Tigrinya auto-sizing finds models withbetter BLEU scores. Random search is eventually able to find better models on French andHausa, but is an order of magnitude slower.

40

3.5.2 Auto-sizing vs. Random Search

Table 3.2 compares the performance of random search with auto-sizing across, BLEU

scores, model size, and training times. The baseline system, the recommended IWSLT

setting in fairseq, has almost 40 million parameters. Auto-sizing the feed-forward network

sub-components in each layer of this baseline model with `2,1 = 10.0 or `∞,1 = 100.0

removes almost one-third of the total parameters from the model. For Hausa-English and

Tigrinya-English, this also results in substantial BLEU score gains, while only slightly

hurting performance for French-English. The BLEU scores for random search beats the

baseline for all language pairs, but auto-sizing still performs best on Tigrinya-English –

even with 72 different, random hyperparameter configurations.

Auto-sizing trains in a similar amount of time to the baseline system, whereas the

cumulative training time for all of the models in random search is substantially slower.

Furthermore, for Tigrinya-English and French-English, random search found models that

were almost 10 and 5 times larger respectively than the auto-sized models.

3.5.3 Training times

One of the biggest downsides of searching over architectures using a random search

process is that it is very time and resource expensive. Contrary to that, auto-sizing relies

on only training one model.

Auto-sizing relies on a proximal gradient step after a standard gradient descent step.

However, the addition of these steps for our two group regularizers does not significantly

impact training times. Table 3.3 shows the total training time for both `2,1 = 0.1 and

`∞,1 = 0.5. Even with the extra proximal step, auto-sizing using `2,1 actually converges

faster on two of the five language pairs. Note that these times are for smaller regularization

coefficients. Larger coefficients will cause more values to go to zero, which will make the

model converge faster.

41

Language Pair Baseline `2,1 `∞,1

Fra–Eng 11.3k 11.5k 28.8k

Ara–Eng 15.1k 16.6k 40.8k

Eng–Ara 16.6k 11.0k 21.9k

Hau–Eng 4.2k 3.5k 7.6k

Tir–Eng 1.2k 1.2k 2.1k

TABLE 3.3

OVERALL TRAINING TIMES IN SECONDS ON A NVIDIA GEFORCE

GTX 1080TI GPU FOR SMALL REGULARIZATION VALUES. NOTE THAT

HIGH REGULARIZATION VALUES WILL DELETE TOO MANY VALUES

AND CAUSE TRAINING TO END SOONER. IN GENERAL, `2,1

REGULARIZATION DOES NOT APPRECIABLY SLOW DOWN

TRAINING, BUT `∞,1 CAN BE TWICE AS SLOW. PER EPOCH, ROUGHLY

THE SAME RATIOS IN TRAINING TIMES HOLD.

42

3.5.4 Auto-sizing Sub-Components

As seen above, on very low-resource data, auto-sizing is able to quickly learn smaller,

yet better, models than the recommended low-resource transformer architecture. Here, we

look at the impact of applying auto-sizing to various sub-components of the Transformer

network. In section 3.3, following the work of Murray and Chiang [69], auto-sizing is de-

scribed as intelligently applying a group regularizer to our objective function. The relative

weight, or regularization coefficient, is a hyperparameter defined as λ. In this section, we

also look at the impact of varying the strength of this regularization coefficient.

Tables 3.4 and 3.5 demonstrate the impact of varying the regularization coefficient

strength has on BLEU scores and model size across various model sub-components. Recall

that each layer of the Transformer network has multi-head attention sub-components and a

feed-forward network sub-component. We denote experiments only applying auto-sizing

to feed-forward network as “FFN”. We also experiment with auto-sizing the multi-head

attention in conjunction with the FFN, which we denote “All”. A regularization coefficient

of 0.0 refers to the baseline model without any auto-sizing. Columns which contain per-

centages refer to the number of rows in a PyTorch parameter that auto-sizing was applied

to, that were entirely driven to zero. In effect, neurons deleted from the model. Note that

individual values in a row may be zero, but if even a single value remains, information can

continue to flow through this and it is not counted as deleted. Furthermore, percentages

refer only to the parameters that auto-sizing was applied to, not the entire model. As such,

with the prevalence of residual connections, a value of 100% does not mean the entire

model was deleted, but merely specific parameter matrices. More specific experimental

conditions are described below.

3.5.4.1 FFN matrices and multi-head attention

Rows corresponding to “All” in tables 3.4 and 3.5 look at the impact of varying the

strength of both the `∞,1 and `2,1 regularizers across all learned parameters in the encoder

43

`2,1 coefficientModel Portion 0.0 0.1 0.25 0.5 1.0 10.0

Hau–Eng

Encoder All 13.9 16.0 17.1 17.4 15.3 89% 16.4 100%Encoder FFN 15.4 15.1 16.3 15.9 100% 16.7 100%Decoder All 12.6 16.1 16.2 13.0 3% 0.0 63%Decoder FFN 11.8 14.7 14.4 11.7 79% 13.1 100%Enc+Dec All 15.8 17.4 17.8 12.5 61% 0.0 100%Enc+Dec FFN 14.7 15.3 14.2 12.8 86% 14.8 100%

Tir–EngEncoder All 3.6 3.3 4.7 5.3 7.2 8.4 100%Enc+Dec All 3.8 4.0 6.5 7.0 0.0 100%Enc+Dec FFN 4.0 4.2 3.3 5.1 7.4 100%

Fra–EngEncoder All 35.0 35.7 34.5 34.1 33.6 97% 32.8 100%Enc+Dec All 35.2 33.1 29.8 23% 24.2 73% 0.3 100%Enc+Dec FFN 35.6 35.0 34.2 15% 34.2 98% 33.5 100%

Ara–EngEnc+Dec All 27.9 28.0 24.7 1% 20.9 20% 14.3 72% 0.3 100%Enc+Dec FFN 26.9 26.7 1% 25.5 23% 25.9 97% 25.7 100%

Eng–AraEnc+Dec All 9.4 8.7 7.5 5.8 23% 3.7 73% 0.0 100%Enc+Dec FFN 8.6 8.3 3% 8.3 22% 7.9 93% 8.0 100%

Table 3.4: BLEU scores and percentage of parameter rows deleted by auto-sizing on vari-ous sub-components of the model, across varying strengths of `2,1 regularization. 0.0 refersto the baseline without any regularizer. Blank spaces mean less than 1% of parameterswere deleted. In the two very low-resource language pairs (Hausa-English and Tigrinya-English), deleting large portions of the encoder can actually help performance. However,deleting the decoder hurts performance.

44

`∞,10.0 0.1 0.25 0.5 1.0 10.0 100.0

Hau–EngEnc+Dec All 13.9 15.5 14.7 16.0 16.7 14.9 4% 1.5 100%Enc+Dec FFN 13.4 14.3 14.1 12.9 15.3 0% 15.0 100%

Tir–EngEnc+Dec All 3.6 4.6 3.4 3.4 3.7 7.4 0% 2.4 100%Enc+Dec FFN 3.6 3.8 3.9 3.6 4.7 0% 7.5 100%

Fra–EngEnc+Dec All 35.0 35.2 35.4 34.9 35.3 26.3 13% 1.7 100%Enc+Dec FFN 34.8 35.5 35.4 35.0 34.1 0% 34.3 100%

Ara–EngEnc+Dec All 27.9 27.3 27.5 27.6 26.9 18.5 22% 0.6 100%Enc+Dec FFN 27.8 27.2 28.3 27.6 25.4 0% 25.4 100%

Eng–AraEnc+Dec All 9.4 9.1 8.3 8.4 8.7 5.2 25% 0.6 100%Enc+Dec FFN 8.8 9.2 9.0 8.9 8.2 0% 8.3 100%

Table 3.5: BLEU scores and percentage of model deleted using auto-sizing with variousl∞,1 regularization strengths. On the very low-resource language pairs of Hausa-Englishand Tigrinya-English, auto-sizing the feed-forward networks of the encoder and decodercan improve BLEU scores.

and decoders (multi-head and feed-forward network parameters). Using `∞,1 regularization

(table 3.5), auto-sizing beats the baseline BLEU scores on three language pairs: Hau–Eng,

Tir–Eng, Fra–Eng. However, BLEU score improvements only occur on smaller regulariza-

tion coefficients that do not delete model portions.

Looking at `2,1 regularization across all learned parameters of both the encoder and

decoder (“Enc+Dec All” in table 3.4), auto-sizing beats the baseline on four of the five

language pairs (all except Eng–Ara). Again, BLEU gains are on smaller regularization

coefficients, and stronger regularizers that delete parts of the model hurt translation quality.

Multi-head attention is an integral portion of the Transformer model and auto-sizing this

generally leads to performance hits.

3.5.4.2 FFN matrices

As the multi-head attention is a key part of the Transformer, we also looked at auto-

sizing just the feed-forward sub-component in each layer of the encoder and decoder. Rows

45

deonted by “FFN” in tables 3.4 and 3.5 look at applying auto-sizing to all of the feed-

forward network sub-components of the Transformer, but not to the multi-head attention.

With `∞,1 regularization, we see BLEU improvements on four of the five language pairs.

For both Hausa-English and Tigrinya-English, we see improvements even after deleting

all of the feed-forward networks in all layers. Again, the residual connections allow in-

formation to flow around these sub-components. Using `2,1 regularization, we see BLEU

improvements on three of the language pairs. Hausa-English and Tigrinya-English main-

tain a BLEU gain even when deleting all of the feed-forward networks.

Auto-sizing only the feed-forward sub-component, and not the multi-head attention

part, results in better BLEU scores, even when deleting all of the feed-forward network

components. Impressively, this is with a model that has fully one-third fewer parameters

in the encoder and decoder layers. This is beneficial for faster inference times and smaller

disk space.

3.5.4.3 Encoder vs. Decoder

In table 3.4, experiments on Hau-Eng look at the impact of auto-sizing either the en-

coder or the decoder separately. Applying a strong enough regularizer to delete portions

of the model (`2,1 ≥ 1.0) only to the decoder (“Decoder All” and “Decoder FFN”) re-

sults in a BLEU score drop. However, applying auto-sizing to only the encoder (“Encoder

All” and “Encoder FFN”) yields a BLEU gain while creating a smaller model. Intuitively,

this makes sense as the decoder is closer to the output of the network and requires more

modeling expressivity.

In addition to Hau–Eng, table 3.4 also contains experiments looking at auto-sizing all

sub-components of all encoder layers of Tir–Eng and Fra–Eng. For all three language

pairs, a small regularization coefficient for the `2,1 regularizer applied to the encoder in-

creases BLEU scores. However, no rows are driven to zero and the model size remains the

same. Consistent with Hau–Eng, using a larger regularization coefficient drives all of the

46

encoder’s weights to all zeros. For the smaller Hau–Eng and Tir–Eng datasets, this actually

results in BLEU gains over the baseline system. Surprisingly, even on the Fra–Eng dataset,

which has more than 15x as much data as Tir–Eng, the performance hit of deleting the

entire encoder was only 2 BLEU points.

Recall from Figure 3.2 that there are residual connections that allow information and

gradients to flow around both the multi-head attention and feed-forward portions of the

model. Here, we have the case that all layers of the encoder have been completely deleted.

However, the decoder still attends over the source word and positional embeddings due to

the residual connections. We hypothesize that for these smaller datasets that there are too

many parameters in the baseline model and over-fitting is an issue.

3.5.5 Random Search plus Auto-sizing

Above, we have demonstrated that auto-sizing is able to learn smaller models, faster

than random search, often with higher BLEU scores. To compare whether the two archi-

tecture search algorithms (random and auto-sizing) can be used in conjunction, we also

looked at applying both `2,1 and `∞,1 regularization techniques to the FFN networks in all

encoder and decoder layers during random search. In addition, this looks at how robust the

auto-sizing method is to different initial conditions.

For a given set of hyperparameters generated by the random search process, we ini-

tialize three identical models and train a baseline as well as one with each regularizer

(`2,1 = 1.0 and `∞,1 = 10.0). We trained 216 Tir–Eng models (3 · 72 hyperparameter con-

fig.), 120 Hau–Eng, 45 Ara–Eng, 45 Eng–Ara, and 30 Fra–Eng models. Using the model

with the best dev perplexity found during training, table 3.6 shows the test BLEU scores

for each of the five language pairs. For the very low-resource language pairs of Hau–Eng

and Tir–Eng, auto-sizing is able to find the best BLEU scores.

47

none `2,1 `∞,1

Hau–Eng 17.2 16.6 17.8

Tir–Eng 6.7 7.9 7.6

Fra–Eng 35.4 34.7 34.1

Ara–Eng 27.6 25.6 25.9

Eng–Ara 9.0 7.6 8.4

TABLE 3.6

TEST BLEU SCORES FOR THE MODELS WITH THE BEST DEV

PERPLEXITY FOUND USING RANDOM SEARCH OVER NUMBER OF

LAYERS AND SIZE OF LAYERS. REGULARIZATION VALUES OF

`2,1 = 1.0 AND `∞,1 = 10.0 WERE CHOSEN BASED ON TABLES 3.4 AND

3.5 AS THEY ENCOURAGED NEURONS TO BE DELETED. FOR THE

VERY LOW-RESOURCE LANGUAGE PAIRS, AUTO-SIZING HELPED IN

CONJUNCTION WITH RANDOM SEARCH.

48

3.6 Conclusion

In this paper, we have demonstrated the effectiveness of auto-sizing on the Transformer

network. On very low-resource datasets, auto-sizing was able to improve BLEU scores by

up to 3.9 points while simultaneously deleting one-third of the parameters in the encoder

and decoder layers. This was accomplished while being significantly faster than other

search methods.

Additionally, we demonstrated how to apply proximal gradient methods efficiently us-

ing a GPU. Previous work on optimizing proximal gradient algorithms serious impacts

speed performance when the computations are moved off of a CPU and parallelized. Lever-

aging sorting and prefix summation, we reformulated these methods to be GPU efficient.

Overall, this paper has demonstrated the efficacy of auto-sizing on a natural language

processing application with orders of magnitude more parameters than previous work.

With a focus on speedy architecture search and an emphasis on optimized GPU algorithms,

auto-sizing is able to improve machine translation on very low-resource language pairs

without being resource or time-consuming.

49

CHAPTER 4

EFFICIENCY THROUGH AUTO-SIZING

Previously, we introduced the Auto-Sizing method, which uses a group regularizer to

learn smaller neural networks while maintaining accuracy and performance on translation

tasks. Here we look at a different objective – can we reduce memory and storage require-

ments of a large neural network model by aggressively using group regularization?

This chapter describes the Notre Dame Natural Language Processing Group’s(NDNLP) submission to the WNGT 2019 shared task [40]. We investigatedthe impact of auto-sizing [69, 72] to the Transformer network [109] with thegoal of substantially reducing the number of parameters in the model. Ourmethod was able to eliminate more than 25% of the model’s parameters whilesuffering a decrease of only 1.1 BLEU.

4.1 Introduction

The Transformer network [109] is a neural sequence-to-sequence model that has achieved

state-of-the-art results in machine translation. However, Transformer models tend to be

very large, typically consisting of hundreds of millions of parameters. As the number

of parameters directly corresponds to secondary storage requirements and memory con-

sumption during inference, using Transformer networks may be prohibitively expensive

in scenarios with constrained resources. For the 2019 Workshop on Neural Generation of

Text (WNGT) Efficiency shared task [40], the Notre Dame Natural Language Processing

(NDNLP) group looked at a method of inducing sparsity in parameters called auto-sizing

in order to reduce the number of parameters in the Transformer at the cost of a relatively

minimal drop in performance.

50

Auto-sizing, first introduced by Murray and Chiang [69], uses group regularizers to

encourage parameter sparsity. When applied over neurons, it can delete neurons in a net-

work and shrink the total number of parameters. A nice advantage of auto-sizing is that it

is independent of model architecture; although we apply it to the Transformer network in

this task, it can easily be applied to any other neural architecture.

NDNLP’s submission to the 2019 WNGT Efficiency shared task uses a standard, rec-

ommended baseline Transformer network. Following Murray et al. [72], we investigate the

application of auto-sizing to various portions of the network. Differing from their work, the

shared task used a significantly larger training dataset from WMT 2014 [10], as well as the

goal of reducing model size even if it impacted translation performance. Our best system

was able to prune over 25% of the parameters, yet had a BLEU drop of only 1.1 points.

This translates to over 25 million parameters pruned and saves almost 100 megabytes of

disk space to store the model.

4.2 Auto-Sizing the Transformer

We note that there has been some work recently on shrinking networks through pruning.

However, these differ from auto-sizing as they frequently require an arbitrary threshold and

are not included during the training process. For instance, See et al. [97] prunes networks

based off a variety of thresholds and then retrains a model. Voita et al. [113] also look

at pruning, but of attention heads specifically. They do this through a relaxation of an `0

regularizer in order to make it differentiable. This allows them to not need to use a proximal

step. This method too starts with pre-trained model and then continues training. Michel

et al. [66] also look at pruning attention heads in the transformer. However, they too use

thresholding, but only apply it at test time. Auto-sizing does not require a thresholding

value, nor does it require a pre-trained model.

Of particular interest are the large, position-wise feed-forward networks in each en-

coder and decoder layer:

51

FFN(x) = W2(max(0,W1x + b1)) + b2.

W1 and W2 are two large affine transformations that take inputs from D dimensions to

4D, then project them back to D again. These layers make use of rectified linear unit

activations, which were the focus of auto-sizing in the work of Murray and Chiang [69].

No theory or intuition is given as to why this value of 4D should be used.

Following [72], we apply the auto-sizing method to the Transformer network, focusing

on the two largest components, the feed-forward layers and the multi-head attentions (blue

and orange rectangles in Figure 3.2). Remember that since there are residual connections

allowing information to bypass the layers we are auto-sizing, information can still flow

through the network even if the regularizer drives all the neurons in a layer to zero –

effectively pruning out an entire layer.

4.3 Experiments

All of our models are trained using the fairseq implementation of the Transformer [37].1

For the regularizers used in auto-sizing, we make use of an open-source, proximal gradient

toolkit implemented in PyTorch2 [72]. For each mini-batch update, the stochastic gradient

descent step is handled with a standard PyTorch forward-backward call. Then the proximal

step is applied to parameter matrices.

4.3.1 Settings

We used the originally proposed transformer architecture – with six encoder and six

decoder layers. Our model dimension was 512 and we used 8 attention heads. The feed-

forward network sub-components were of size 2048. All of our systems were run us-



52

W1

W2

ReLU

Figure 4.1. Auto-sizing FFN network. For a row in the parameter matrix W1 thathas been driven completely to 0.0 (shown in white), the corresponding column inW2 (shown in blue) no longer has any impact on the model. Both the column and

the row can be deleted, thereby shrinking the model.

System Disk Size Number of Parameters newstest2014 newstest2015Baseline 375M 98.2M 25.3 27.9

All `2,1 = 0.1 345M 90.2M 21.6 24.1Encoder `2,1 = 0.1 341M 89.4M 23.2 25.5Encoder `2,1 = 1.0 327M 85.7M 22.1 24.5

FFN `2,1 = 0.1 326M 85.2M 24.1 26.4FFN `2,1 = 1.0 279M 73.1M 24.0 26.8

FFN `2,1 = 10.0 279M 73.1M 23.9 26.5FFN `∞,1 = 100.0 327M 73.1M 23.8 26.0

Table 4.1: Comparison of BLEU scores and model sizes on newstest2014 and new-stest2015. Applying auto-sizing to the feed-forward neural network sub-components ofthe transformer resulted in the most amount of pruning while still maintaining good BLEUscores.

53

ing subword units (BPE) with 32,000 merge operations on concatenated source and target

training data [98]. We clip norms at 0.1, use label smoothed cross-entropy with value 0.1,

and an early stopping criterion when the learning rate is smaller than 10−5. We used the

Adam optimizer [48], a learning rate of 10−4, and dropout of 0.1. Following recommen-

dations in the fairseq and tensor2tensor [110] code bases, we apply layer normalization

before a sub-component as opposed to after. At test time, we decoded using a beam of 5

with length normalization [14] and evaluate using case-sensitive, tokenized BLEU [84].

For the auto-sizing experiments, we looked at both `2,1 and `∞,1 regularizers. We ex-

perimented over a range of regularizer coefficient strengths, λ, that control how large the

proximal gradient step will be. Similar to Murray and Chiang [69], but differing from

Alvarez and Salzmann [1], we use one value of λ for all parameter matrices in the net-

work. We note that different regularization coefficient values are suited for different types

or regularizers. Additionally, all of our experiments use the same batch size, which is also

related to λ.

4.3.2 Auto-sizing sub-components

We applied auto-sizing to the sub-components of the encoder and decoder layers, with-

out touching the word or positional embeddings. Recall from Figure 3.2, that each layer

has multi-head attention and feed-forward network sub-components. In turn, each multi-

head attention sub-component is comprised of two parameter matrices. Similarly, each

feed-forward network has two parameter matrices, W1 and W2. We looked at three main

experimental configurations:

• All: Auto-sizing is applied to every multi-head attention and feed-forward networksub-component in every layer of the encoder and decoder.

• Encoder: As with All, auto-sizing is applied to both multi-head attention and feed-forward network sub-components, but only in the encoder layers. The decoder re-mains the same.

• FFN: Auto-sizing applied only to the feed-forward network sub-components W1 and

54

W2, but not to the multi-head portions. This too is applied to both the encoder anddecoder.

4.3.3 Results

Our results are presented in Table 4.1. The baseline system has 98.2 million parameters

and a BLEU score of 27.9 on newstest2015. It takes up 375 megabytes on disk. Our

systems that applied auto-sizing only to the feed-forward network sub-components of the

transformer network maintained the best BLEU scores while also pruning out the most

parameters of the model. Overall, our best system used `2,1 = 1.0 regularization for auto-

sizing and left 73.1 million parameters remaining. On disk, the model takes 279 megabytes

to store – roughly 100 megabytes less than the baseline. The performance drop compared

to the baseline is 1.1 BLEU points, but the model is over 25% smaller.

Applying auto-sizing to the multi-head attention and feed-forward network sub-components

of only the encoder also pruned a substantial amount of parameters. Though this too re-

sulted in a smaller model on disk, the BLEU scores were worse than auto-sizing just the

feed-forward sub-components. Auto-sizing the multi-head attention and feed-forward net-

work sub-components of both the encoder and decoder actually resulted in a larger model

than the encoder only, but with a lower BLEU score. Overall, our results suggest that the

attention portion of the transformer network is more important for model performance than

the feed-forward networks in each layer.

4.4 Conclusion

In this paper, we have investigated the impact of using auto-sizing on the transformer

network of the 2019 WNGT efficiency task. We were able to delete more than 25% of the

parameters in the model while only suffering a modest BLEU drop. In particular, focusing

on the parameter matrices of the feed-forward networks in every layer of the encoder and

decoder yielded the smallest models that still performed well.

55

A nice aspect of our proposed method is that the proximal gradient step of auto-sizing

can be applied to a wide variety of parameter matrices. Whereas for the transformer, the

largest impact was on feed-forward networks within a layer, should a new architecture

emerge in the future, auto-sizing can be easily adapted to the trainable parameters.

Overall, NDNLP’s submission has shown that auto-sizing is a flexible framework for

pruning parameters in a large NMT system. With an aggressive regularization scheme,

large portions of the model can be deleted with only a modest impact on BLEU scores.

This in turn yields a much smaller model on disk and at run-time.

56

CHAPTER 5

RIGGING THE LOTTERY: ON THE INTERSECTION OF LOTTERY TICKETS AND

STRUCTURED REGULARIZATION

Previously, we have shown how group regularizers can be used to prune neurons in a

neural network. We have demonstrated how this can be used to improve performance, both

in terms of translation quality as well as speed and memory efficiencies. We now show how

these improvements can be used in conjunction with another popular learning structure for

neural networks – the lottery ticket hypothesis.

Abstract: Work regarding the lottery ticket hypothesis [35] has shown thata simple iterative pruning process can drastically reduce the number of pa-rameters in neural networks without compromising performance. However,the pattern in which these parameters are pruned is sparse and not easily ex-ploitable on parallel hardware, limiting the benefits of reduced model size inimplementation. In this paper, we add regularization and proximal gradientmethods to the lottery ticket pruning process to encourage structured sparsitythat is more suitable to parallel hardware, such that entire neurons and theirincoming connections are pruned from a network. We demonstrate that thismethod can remove a substantial number of neurons while maintaining com-parable performance on three different architectures and tasks: LeNet (imageclassification), BiLSTM-CRF (part-of-speech tagging), and Transformer (ma-chine translation).

5.1 Introduction

Investigations of the lottery ticket hypothesis [35] have provided evidence that neural

networks are generally overparameterized and owe their efficacy to relatively small sub-

networks that have “won the initialization lottery.” A simple technique for discovering

these “winning tickets” is iterative pruning, which repeatedly trains a network, removes

57

weights whose magnitudes are within the bottom p percentile, and resets the remaining

parameters of the network to their original values. Iterative pruning can yield networks

that are only 10-20% of the size of the original yet maintain the accuracy of the full model,

offering substantial savings in computational cost in principle.

However, the resulting sparsity patterns are not necessarily amenable to optimization

by parallel hardware. Current GPU architectures and software libraries are optimized for

dense matrix operations, where deleted weights are represented with explicit zero entries.

Consider a matrix W of weights in a fully-connected feed-forward layer. In a given row

Wi, as long as there exists a non-deleted weight Wi j , 0, then a dense representation of W

must represent all deleted weights in Wi with zeroes. However, if row Wi consists entirely

of deleted neurons, then even in a dense representation, the entire row may be deleted from

W without affecting the model’s output, reducing the overall size of the matrix and its

output. Downstream parts of the network that depend on the deleted neurons may also be

pruned.

The standard lottery ticket pruning procedure generally does not produce these desir-

able row-wise sparsity patterns. As noted by Frankle and Carbin [35], iterative pruning

removes parameters in an unstructured fashion and results in uniformly distributed dele-

tions, which limits the benefits of decreased model size in practice. In this paper, we

propose “rigging” the lottery ticket pruning process with a structured method that deletes

a comparable number of parameters while resulting in the more easily exploitable row-

wise sparsity pattern described above. We do this by adding a regularization term to the

loss function during training that pushes parameters within the same row toward zero as a

group, then applying proximal gradient updates that cause parameters to clip exactly to 0

during training, effectively deleting them.

We demonstrate that structured pruning achieves the desired sparsity pattern on three

diverse neural network architectures and tasks. We apply our method to LeNet on the

MNIST image classification task, a BiLSTM-CRF on the Penn Treebank part-of-speech

58

tagging task, and a Transformer on a low-resource machine translation task. We will re-

lease our code publicly.

5.2 Background and Related Work

Regularization is a technique within statistics and machine learning that imposes a

penalty on a loss function during learning. Often this can be used to prevent overfitting

training data, but certain classes of regularizers can also be used to induce sparsity. As we

have demonstrated earlier in this thesis, this can be used to prune networks. Similar work

has been demonstrated by Alvarez and Salzmann [1] and Dodge et al. [29].

The lottery ticket hypothesis is a method that iteratively prunes individual values from a

neural network after training to convergence. It is an iterative method, that acts sequentially

and is necessarily many times slower than training a network. The pruning method is

related to Han et al. [39], which is basically a 1-step lottery ticket iteration.

5.3 Methods

For each model to which we apply structured pruning, we select groups of parameters

which should be encouraged to clip to zero together. Each group consists of a set of

parameters that contributes to the activation of a single neuron – namely, a set of connection

weights and a bias term. As long as the activation function of the neuron is a function f

such that f (0) = 0, and the setting of a neuron to 0 is equivalent to its removal from the

network, then applying structured pruning to that group of parameters can shrink the dense

representation of the layer.

Our modifications to the standard iterative pruning procedure are as follows. First,

during training, we add an `2,1 regularization term to the loss function to encourage the

desired sparsity patterns. Second, after each step of SGD, we apply the corresponding

proximal gradient update, potentially clipping parameters to 0. Third, we apply pruning

59

after each round of training. We test two variations of pruning. The first simply prunes

parameters which the regularizer has pushed to zero, permanently masking them out for

future iterations (unlike percentile-based pruning, the fraction of parameters pruned at each

iteration can vary). The second variation applies the usual percentile-based pruning in

combination with zero-based pruning; in this case, pruning removes the bottom p% of

weights, and then removes any 0-valued weights which may remain afterwards.

Following the discussion in Frankle and Carbin [35], we define a neural network to

be a function that maps data x to outputs using parameters θ. m ∈ {0, 1}|θ| is a mask that

describes which parameters are actually used in the model.

f (x; m � θ0)

The likelihood of a given network isL(θ | x). Learning attempts to learn the best values

of θ. This is done by minimizing the negative log-likelihood.

min(− log(L(θ | x))

A regularizer modifies this by adding an additional term

R(θ)

min(− log(L(θ | x) + R(θ))

Following the work of Murray and Chiang [68], we experiment with two regularizers

applied to each layer in the network independently. For a parameter matrix W, representing

a solitary layer within θ. First, we look at a sparsity inducing `1 regularizer:

R(θ) =∑

i

‖Wi‖.

60

This regularizer will not encourage structured sparsity.

We also look at a regularizer that will encourage structured sparsity, a `2,1 which oper-

ates over rows in the parameter matrix:

R(θ) =∑

i

‖Wi:‖2 =∑

i

(∑j

W2i j

) 12

5.4 Structured Pruning on LeNet

Following the first set of experiments in Frankle and Carbin [35], we experiment with

LeNet LeCun et al. [59], a simple feed forward neural network with two hidden layers (size

300 and 100) using a rectified linear unit. This is a very similar setup to the feed forward

neural probabilistic language model from Chapter 2.

5.4.1 Structured Pruning on Transformer

For our machine translation experiments, we again use the transformer network. In

particular, we focus on the feed forward hidden layers of the encoder and decoder based

off of the results explained in Chapters 3 and 4.

5.5 Experiments

We focus on three different learning tasks: a simple handwritten digit recognition task,

a part-of-speech (POS) tagging task, and a neural machine translation task. For all tasks,

we compare the baseline lottery ticket pruning method which uses purely unstructured,

percentile-based pruning; pruning with `2,1 regularization and zero-based pruning; and

pruning with `2,1 regularization and both percentile and zero-based pruning. We also test

both pruning types with `1 regularization, because this encourages individual parameter

sparsity – similar to the goal of the lottery ticket paper.

61

For our sparsity inducing regularizers, we make use of ProxGradPytorch1 [73]. This is

a GPU optimized implementation of algorithms from Murray and Chiang [68] and Parikh

et al. [86] implemented in PyTorch.

5.5.1 MNIST

Following the work of Frankle and Carbin [35], we experiment with Lenet-300-100

[59] on MNIST. We randomly select 10% of the 60,000 training examples to make a vali-

dation set. Mimicking the experiments in the Lottery Ticket paper we use adam [48] with

value 0.0012. We prune the lowest 20% of the parameters in each layer (except for 10%)

in the output layer. We run the pruning for 10 iterations. In each pruning iteration, we use

early stopping when the validation accuracy has not gone up for 5 epochs.

As can be seen in table using an `2,1 structured regularizer matches performance of both

the baseline and lottery ticket method while deleting more values and rows.

Interestingly, `1 regularization, unstructured sparsity, actually does not prune any. With

higher regularization values, it does, but performance decreases significantly.

5.5.2 Machine Translation

For machine translation task we base our implementation on the fairseq toolkit2 imple-

mentation of the transformer network [83, 109]. We use a recently released Mapadungun-

to-Spanish corpus [31]. We preprocessed our data with the Moses tokenizer [54] and

segment the data using 8,000 BPE operations [101]. We follow the recommendations

of Nguyen and Salazar [80], specifically using 6 encoder and 6 decoder layers, all with 8

attention heads. The model has 512 embedding dimensions, and the feedforward layers

have 2048 dimensions. Layer normalization is done before, not after, adding residual con-



62

5 10 15 20Pruning Iteration

96.8

97.0

97.2

97.4

97.6

97.8

98.0

98.2

Accu

racy

(%)

Accuracy vs. Pruning Iteration

percentile2, 1

2, 1 + percentile

Figure 5.1. Best accuracy at each pruning iteration for MNIST experiments.

63


0.0

0.2

0.4

0.6

0.8

1.0

Rem

aini

ng R

ows (

fract

ion

of o

rigin

al)

Remaining Rows vs. Pruning Iteration

percentile2, 1

2, 1 + percentile

Figure 5.2. Rows remaining after pruning iterations for MNIST experiments.

64


0.0

0.2

0.4

0.6

0.8

1.0

Rem

aini

ng P

aram

eter

s (fra

ctio

n of

orig

inal

)

Remaining Parameters vs. Pruning Iteration

percentile2, 1

2, 1 + percentile

Figure 5.3. Parameters remaining after pruning iterations for MNISTexperiments.

65

0.75 0.80 0.85 0.90 0.95 1.00Remaining Rows (fraction of original)

96.8

97.0

97.2

97.4

97.6

97.8

98.0

98.2

Accu

racy

(%)

Accuracy vs. Model Size

percentile2, 1

2, 1 + percentile

Figure 5.4. Accuracy score compared to rows remaining for MNISTexperiments.

66

0.0 0.2 0.4 0.6 0.8 1.0Remaining Parameters (fraction of original)

96.8

97.0

97.2

97.4

97.6

97.8

98.0

98.2

Accu

racy

(%)

Accuracy vs. Model Size

percentile2, 1

2, 1 + percentile

Figure 5.5. Accuracy compared to remaining parameters for MNISTTexperiments.

67

2 4 6 8 10 12Pruning Iteration

17.10

17.15

17.20

17.25

17.30

17.35

17.40

17.45BL

EUBLEU vs. Pruning Iteration

2, 1

2, 1 + percentile

Figure 5.6. Best validation BLEU score at each pruning iteration for MachineTranslation with Arn-Spa MT experiments.

nections. We use Adam [48] with an inverse square root learning rate reaching 0.001 after

8,000 warmup iterations. We clip norms to 1.0 and use dropout with probability 0.3.

For both the lottery ticket pruning, as well as our regularizers, we look at the feed-

forward network in each encoder and decoder layer.

5.6 Conclusion

In this chapter we have demonstrated that structured pruning can work in conjunction

with the lottery ticket hypothesis. As noted by Frankle and Carbin [35], modern computer

68

2 4 6 8 10 12Pruning Iteration

0.0

0.2

0.4

0.6

0.8

1.0

Rem

aini

ng R

ows (

fract

ion

of o

rigin

al)

Remaining Rows vs. Pruning Iteration2, 1

2, 1 + percentile

Figure 5.7. Rows remaining after pruning iterations for Machine Translationwith Arn-Spa MT experiments.

69

0.2 0.4 0.6 0.8 1.0Remaining Parameters (fraction of original)

17.10

17.15

17.20

17.25

17.30

17.35

17.40

17.45

BLEU

BLEU vs. Model Size

2, 1

2, 1 + percentile

Figure 5.8. Parameters remaining after pruning iterations for MachineTranslation with Arn-Spa MT experiments.

70

0.6 0.7 0.8 0.9 1.0Remaining Rows (fraction of original)

17.10

17.15

17.20

17.25

17.30

17.35

17.40

17.45

BLEU

BLEU vs. Model Size

2, 1

2, 1 + percentile

Figure 5.9. BLEU score compared to rows remaining for Machine Translationwith Arn-Spa MT experiments.

71

0.2 0.4 0.6 0.8 1.0Remaining Parameters (fraction of original)

17.10

17.15

17.20

17.25

17.30

17.35

17.40

17.45

BLEU

BLEU vs. Model Size

2, 1

2, 1 + percentile

Figure 5.10. BLEU score compared to remaining parameters for MachineTranslation with Arn-Spa MT experiments.

72

architectures, particularly GPUs, cannot make use of the sparsity gains of non-structured

sparsity. Here we have demonstrated how group regularization can be used to enforce spar-

sity patterns during lottery ticket training. Though we leave efficiency measures to future

work, prior work such as Murray et al. [71] has demonstrated that removing parameters

after regularized training can speed up computation.

73

CHAPTER 6

CORRECTING LENGTH BIAS IN NEURAL MACHINE TRANSLATION

Work up until this point has covered modifications to the objective function through the

use of regularizers. This has focused on learning hyperparameters about the architecture

of the model. We now turn to what happens after a model is trained – the actual transla-

tion of languages at inference time. In this chapter, we look at a hyperparameter used in

beam search, demonstrate how it is related to beam size (another hyperparamater), show

performance improvements by tuning this hyperparameter, and fix a modeling error. This

method does not modify the objective function during training, but instead, we utilize the

perceptron algorithm.

Abstract: We study two problems in neural machine translation (NMT). First,in beam search, whereas a wider beam should in principle help translation, itoften hurts NMT. Second, NMT has a tendency to produce translations thatare too short. Here, we argue that these problems are closely related and bothrooted in label bias. We show that correcting the brevity problem almost elim-inates the beam problem; we compare some commonly-used methods for do-ing this, finding that a simple per-word reward works well; and we introduce asimple and quick way to tune this reward using the perceptron algorithm.

6.1 Introduction

Although highly successful, neural machine translation (NMT) systems continue to be

plagued by a number of problems. We focus on two here: the beam problem and the brevity

problem.

First, machine translation systems rely on heuristics to search through the intractably

large space of possible translations. Most commonly, beam search is used during the

74

decoding process. Traditional statistical machine translation systems often rely on large

beams to find good translations. However, in neural machine translation, increasing the

beam size has been shown to degrade performance. This is the last of the six challenges

identified by Koehn and Knowles [51].

The second problem, noted by several authors, is that NMT tends to generate transla-

tions that are too short. Jean et al. and Koehn and Knowles address this by dividing transla-

tion scores by their length, inspired by work on audio chords [14]. A similar method is also

used by Google’s production system [117]. A third simple method used by various authors

[41, 76, 82] is a tunable reward added for each output word. Huang et al. [43] and Yang

et al. [119] propose variations of this reward that enable better guarantees during search.

In this chapter, we argue that these two problems are related (as hinted at by Koehn

and Knowles) and that both stem from label bias, an undesirable property of models that

generate sentences word by word instead of all at once.

The typical solution is to introduce a sentence-level correction to the model. We show

that making such a correction almost completely eliminates the beam problem. We com-

pare two commonly-used corrections, length normalization and a word reward, and show

that the word reward is slightly better.

Finally, instead of tuning the word reward using grid search, we introduce a way to

learn it using a perceptron-like tuning method. We show that the optimal value is sensitive

both to task and beam size, implying that it is important to tune for every model trained.

Fortunately, tuning is a quick post-training step.

6.2 Problem

Current neural machine translation models are examples of locally normalized models,

which estimate the probability of generating an output sequence e = e1:m as

P(e1:m) =

m∏i=1

P(ei | e1:i−1).

75

For any partial output sequence e1:i, let us call P(e′ | e1:i), where e′ ranges over all

possible completions of e1:i, the suffix distribution of e1:i. The suffix distribution must sum

to one, so if the model overestimates P(e1:i), there is no way for the suffix distribution to

downgrade it. This is known as label bias [13, 57].

6.2.1 Label bias in sequence labeling

Label bias was originally identified in the context of HMMs and MEMMs for sequence-

labeling tasks, where the input sequence f and output sequence e have the same length, and

P(e1:i) is conditioned only on the partial input sequence f1:i. In this case, since P(e1:i) has

no knowledge of future inputs, it’s much more likely to be incorrectly estimated. For exam-

ple, suppose we had to translate, word-by-word, un helicoptere to a helicopter (Figure 6.1).

Given just the partial input un, there is no way to know whether to translate it as a or an.

Therefore, the probability for the incorrect translation P(an) will turn out to be an overesti-

mate. As a result, the model will overweight translations beginning with an, regardless of

the next input word.

This effect is most noticeable when the suffix distribution has low entropy, because even

when new input (helicoptere) is revealed, the model will tend to ignore it. For example,

suppose that the available translations for helicoptere are helicopter, chopper, whirlybird,

and autogyro. The partial translation a must divide its probability mass among the three

translations that start with a consonant, while an gives all its probability mass to autogyro,

causing the incorrect translation an autogyro to end up with the highest probability.

In this example, P(an), even though overestimated, is still lower than P(a), and wins

only because its suffixes have higher probability. Greedy search would prune the incorrect

prefix an and yield the correct output. In general, then, we might expect greedy or beam

search to alleviate some symptoms of label bias. Namely, a prefix with a low-entropy suffix

distribution can be pruned if its probability is, even though overestimated, not among the

highest probabilities. Such an observation was made by Zhang and Nivre [120] in the

76

a/0.6

helicopter/0.6

chopper/0.3

whirlybird/0.1an/0.4

autogyro/1

Figure 6.1. Label bias causes this toy word-by-word translation model totranslate French un helicoptere incorrectly to an autogyro.

context of dependency parsing, and we will see next that precisely such a situation affects

output length in NMT.

6.2.2 Length bias in NMT

In NMT, unlike the word-by-word translation example in the previous section, each

output symbol is conditioned on the entire input sequence. Nevertheless, it’s still possible

to overestimate or underestimate p(e1:i), so the possibility of label bias still exists. We

expect that it will be more visible with weaker models, that is, with less training data.

Moreover, in NMT, the output sequence is of variable length, and generation of the out-

put sequence stops when </s> is generated. In effect, for any prefix ending with </s>, the

suffix distribution has zero entropy. This situation parallels example of the previous section

closely: if the model overestimates the probability of outputting </s>, it may proceed to

ignore the rest of the input and generate a truncated translation.

Figure 6.2 illustrates how this can happen. Although the model can learn not to prefer

shorter translations by predicting a low probability for </s> early on, at each time step,

the score of </s> puts a limit on the total remaining score a translation can have; in the

figure, the empty translation has score −10.1, so that no translation can have score lower

than −10.1. This lays a heavy burden on the model to correctly guess the total score of the

77

whole translation at the outset.

As in our label-bias example, greedy search would prune the incorrect empty transla-

tion. More generally, consider beam search: at time step t, only the top k partial or complete

translations are retained while the rest are pruned. (Implementations of beam search vary

in the details, but this variant is simplest for the sake of argument.) Even if a translation

ending at time t scores higher than a longer translation, as long as it does not fall within

the top k when compared with partial translations of length t (or complete translations of

length at most t), it will be pruned and unable to block the longer translation. But if we

widen the beam (k), then translation accuracy will suffer. We call this problem (which is

Koehn and Knowles’s sixth challenge) the beam problem. Our claim, hinted at by Koehn

and Knowles [51], is that the brevity problem and the beam problem are essentially the

same, and that solving one will solve the other.

6.3 Correcting Length

To address the brevity problem, many designers of NMT systems add corrections to the

model. These corrections are often presented as modifications to the search procedure. But,

in our view, the brevity problem is essentially a modeling problem, and these corrections

should be seen as modifications to the model (Section 6.3.1). Furthermore, since the root

of the problem is local normalization, our view is that these modifications should be trained

as globally-normalized models (Section 6.3.2).

6.3.1 Models

Without any length correction, the standard model score (higher is better) is:

s(e) =

m∑i=1

log P(ei | e1:i).

To our knowledge, there are three methods in common use for adjusting the model to

78

log-

prob

abili

ty

decoding timestep

0−1−2−3−4−5−6−7−8−9−10−11−12−13−14

</s>

The British

wom

en

won Olymp ic gold in

Figure 6.2. A locally normalized model must determine, at each time step, a“budget” for the total remaining log-probability. In this example sentence, “TheBritish women won Olymp ic gold in p airs row ing,” the empty translation has

initial position 622 in the beam. Already by the third step of decoding, thecorrect translation has a lower score than the empty translation. However, using

greedy search, a nonempty translation would be returned.

79

favor longer sentences.

Length normalization divides the score by m [14, 46, 51]:

s′(e) = s(e) / m.

Google’s NMT system [117] relies on a more complicated correction:

s′(e) = s(e)/ (5 + m)α

(5 + 1)α.

Finally, some systems add a constant word reward [41]:

s′(e) = s(e) + γm.

If γ = 0, this reduces to the baseline model. The advantage of this simple reward is that it

can be computed on partial translations, making it easier to integrate into beam search.

6.3.2 Training

All of the above modifications can be viewed as modifications to the base model so that

it is no longer a locally-normalized probability model.

To train this model, in principle, we should use something like the globally-normalized

negative log-likelihood:

L = − logexp s′(e∗)∑e exp s′(e)

where e∗ is the reference translation. However, optimizing this is expensive, as it requires

performing inference on every training example or heuristic approximations [2, 102].

Alternatively, we can adopt a two-tiered model, familiar from phrase-based translation

[82], first training s and then training s′ while keeping the parameters of s fixed, possibly

80

on a smaller dataset. A variety of methods, like minimum error rate training [41, 81], are

possible, but keeping with the globally-normalized negative log-likelihood, we obtain, for

the constant word reward, the gradient:

∂L∂γ

= −|e∗| + E[|e|].

If we approximate the expectation using the mode of the distribution, we get

∂L∂γ≈ −|e∗| + |e|

where e is the 1-best translation. Then the stochastic gradient descent update is just the

familiar perceptron rule:

γ ← γ + η (|e∗| − |e|),

although below, we update on a batch of sentences rather than a single sentence. Since

there is only one parameter to train, we can train it on a relatively small dataset.

Length normalization does not have any additional parameters, with the result (in our

opinion, strange) that a change is made to the model without any corresponding change to

training. We could use gradient-based methods to tune the α in the GNMT correction, but

the perceptron approximation turns out to drive α to ∞, so a different method would be

needed.

6.4 Experiments

We compare the above methods in four settings, a high-resource German–English sys-

tem, a medium-resource Russian–English system, and two low-resource French–English

and English–French systems. For all settings, we show that larger beams lead to large

BLEU and METEOR drops if not corrected. We also show that the optimal parameters

81

can depend on the task, language pair, training data size, as well as the beam size. These

values can affect performance strongly.

6.4.1 Data and settings

Most of the experimental settings below follow the recommendations of Denkowski

and Neubig [26]. Our high-resource, German–English data is from the 2016 WMT shared

task [11]. We use a bidirectional encoder-decoder model with attention [5].1 Our word

representation layer has 512 hidden units, while other hidden layers have 1024 nodes. Our

model is trained using Adam with a learning rate of 0.0002. We use 32k byte-pair encoding

(BPE) operations learned on the combined source and target training data [99]. We train

on minibatches of size 2012 words and validate every 100k sentences, selecting the final

model based on development perplexity.

Our medium-resource, Russian–English system uses data from the 2017 WMT trans-

lation task, which consists of roughly 1 million training sentences [12]. We use the same

architecture as our German–English system, but only have 512 nodes in all layers. We use

16k BPE operations and dropout of 0.2. We train on minibatches of 512 words and validate

every 50k sentences.

Our low-resource systems use French and English data from the 2010 IWSLT TALK

shared task [87]. We build both French–English and English–French systems. These net-

works are the same as for the medium Russian-English task, but use only 6k BPE opera-

tions. We train on minibatches of 512 words and validate every 30k sentences, restarting

Adam when the development perplexity goes up.

To tune our correction parameters, we use 1000 sentences from the German–English

development dataset, 1000 sentences from the Russian–English development dataset, and

the entire development dataset for French–English (892 sentences)2. We initialize the pa-

1We use Lamtram [75] for all experiments and our modifications have been added to the project.

2We found through preliminary experiments that this size of dev subset was an adequate trade-off between

82

Russian–English (medium) Beam Size10 50 75 100 150 1000

baseline BLEU 24.9 23.8 23.6 23.3 22.5 3.7METEOR 30.9 30.0 29.7 29.4 28.8 12.8length 0.90 0.86 0.85 0.84 0.81 0.31

reward BLEU 26.5 26.6 26.5 26.5 26.5 25.7METEOR 32.0 32.0 31.9 31.9 31.9 31.2length 0.98 0.98 0.98 0.98 0.98 1.02γ 0.716 0.643 0.640 0.633 0.617 0.562

norm BLEU 26.2 26.3 26.3 26.3 26.3 25.3METEOR 31.8 31.8 31.8 31.7 31.7 31.2length 0.96 0.96 0.96 0.96 0.97 1.02

Table 6.1: Results of the Russian–English translation system. We report BLEU and ME-TEOR scores, as well as the ratio of the length of generated sentences compared to thecorrect translations (length). γ is the word reward score discovered during training. Here,we examine a much larger beam (1000). The beam problem is more pronounced at thisscale, with the baseline system losing over 20 BLEU points when increasing the beam fromsize 10 to 1000. However, both our tuned length reward score and length normalizationrecover most of this loss.

rameter, γ = 0.2. We use batch gradient descent, which we found to be much more stable

than stochastic gradient descent, and use a learning rate of η = 0.2, clipping gradients for

γ to 0.5. Training stops if all parameters have an update of less than 0.03 or a max of 25

epochs was reached.

6.4.2 Solving the length problem solves the beam problem

Here, we first show that the beam problem is indeed the brevity problem. We then

demonstrate that solving the length problem does solve the beam problem. Tables 6.1, 6.2,

and 6.3 show the results of our German–English, Russian–English, and French–English

systems respectively. Each table looks at the impact on BLEU, METEOR, and the ratio

of the lengths of generated sentences compared to the gold lengths [25, 84]. The baseline

tuning speed and performance.

83

German–English (large) Beam Size10 50 75

baseline BLEU 29.6 28.6 28.2METEOR 34.0 33.1 32.8length 0.95 0.90 0.89

reward BLEU 30.3 30.6 30.6METEOR 34.9 34.8 34.9length 1.02 1.00 1.00γ 0.67 0.57 0.58

norm BLEU 30.7 31.0 30.9METEOR 34.9 35.0 35.0length 1.00 1.00 1.00

Table 6.2: Results of the high-resource German–English system. Rows: BLEU, METEOR,length = ratio of output to reference length; γ = learned parameter value. While baselineperformance decreases with beam size due to the brevity problem, other methods performmore consistently across beam sizes. Length normalization (norm) gets the best BLEUscores, but similar METEOR scores to the word reward.

method is a standard model without any length correction. The reward method is the tuned

constant word reward discussed in the previous section. Norm refers to the normalization

method, where a hypothesis’ score is divided by its length.

6.4.2.1 Baseline

The top sections of Tables 6.1, 6.2, 6.3 illustrate the brevity and beam problems in the

baseline models. As beam size increases, the BLEU and METEOR scores drop signifi-

cantly. This is due to the brevity problem, which is illustrated by the length ratio numbers

that also drop with increased beam size. For larger beam sizes, the length of the generated

output sentences are a fraction of the lengths of the correct translations. For the lower-

resource French–English task, the drop is more than 8 BLEU when increasing the beam

size from 10 to 150. The issue is even more evident in our Russian-English system where

we increase the beam to 1000 and BLEU scores drop by more than 20 points.

84

French–English (small) Beam Size10 50 100 150 200

baseline BLEU 30.0 28.9 25.4 21.9 19.4METEOR 32.4 31.3 28.6 25.9 24.1length 0.94 0.89 0.80 0.71 0.64

reward BLEU 29.4 29.7 29.7 29.8 29.8METEOR 32.8 32.9 32.9 32.9 32.9length 1.03 1.03 1.03 1.03 1.03γ 1.20 1.05 1.01 0.99 0.97

norm BLEU 30.7 30.8 30.7 30.7 30.7METEOR 32.8 32.8 32.8 32.7 32.7length 0.97 0.97 0.97 0.96 0.96

English–French (small) Beam Size10 50 100 150 200

baseline BLEU 25.8 26.1 26.1 25.5 24.3METEOR 47.8 47.5 47.2 46.3 44.2length 1.03 1.01 1.00 0.97 0.92

reward BLEU 25.5 25.5 25.5 25.5 25.5METEOR 48.3 48.5 48.5 48.5 48.4length 1.05 1.05 1.05 1.05 1.05γ 0.353 0.444 0.465 0.474 0.475

norm BLEU 25.4 25.5 25.5 25.5 25.5METEOR 48.4 48.4 48.4 48.4 48.4length 1.06 1.05 1.05 1.05 1.05

Table 6.3: Results of low-resource French–English and English–French systems. Rows:BLEU, METEOR, length = ratio of output to reference length; γ = learned parametervalue. While baseline performance decreases with beam size due to the brevity problem,other methods perform more consistently across beam sizes. Word reward gets the bestscores in both directions on METEOR. Length normalization (norm) gets the best BLEUscores in Fra-Eng due to the slight bias of BLEU towards shorter translations.

85

beam 10 50 75 100 150 200

French–English (small) 6.9 27.2 52.4 71.1 105.9 176.6English–French (small) 12.6 44.2 67.3 88.1 107.5 111.2German–English (large) 6.8 132.6 1066

Table 6.4: Tuning time on top of baseline training time. Times are in minutes on 1000dev examples (German–English) or 892 dev examples (French–English). Due to the muchlarger model size, we only looked at beam sizes up to 75 for German–English.

6.4.2.2 Word reward

The results of tuning the word reward, γ, as described in Section 6.3.2, is shown in

the second section of Tables 6.1, 6.2, and 6.3. In contrast to our baseline systems, our

tuned word reward always fixes the brevity problem (length ratios are approximately 1.0),

and generally fixes the beam problem. An optimized word reward score always leads to

improvements in METEOR scores over any of the best baselines. Across all language

pairs, reward and norm have close METEOR scores, though the reward method wins out

slightly. BLEU scores for reward and norm also increase over the baseline in most cases,

despite BLEU’s inherent bias towards shorter sentences. Most notably, whereas the base-

line Russian–English system lost more than 20 BLEU points when the beam was increased

to 1000, our tuned reward score resulted in a BLEU gain over any baseline beam size.

Whereas in our baseline systems, the length ratio decreases with larger beam sizes, our

tuned word reward results in length ratios of nearly 1.0 across all language pairs, mitigat-

ing many of the issues of the brevity problem.

6.4.2.3 Wider beam

We note that the beam problem in NMT exists for relatively small beam sizes – espe-

cially when compared to traditional beam sizes in SMT systems. On our medium-resource

Russian–English system, we investigate the full impact of this problem using a much larger

beam size of 1000. In Table 6.1, we can see that the beam problem is particularly pro-

86

nounced. The first row of the table shows the uncorrected, baseline score. From a beam

of 10 to a beam of 1000, the drop in BLEU scores is over 20 points. This is largely due to

the brevity problem discussed earlier. The second row of the table shows the length of the

translated outputs compared to the lengths of the correct translations. Though the problem

persists even at a beam size of 10, at a beam size of 1000, our baseline system generates

less than one third the number of words that are in the correct translations. Furthermore,

37.3% of our translated outputs have sentences of length 0. In other words, the most likely

translation is to immediately generate the stop symbol. This is the problem visualized in

Figure 6.2.

However, when we tune our word reward score with a beam of 1000, the problem

mostly goes away. Over the uncorrected baseline, we see a 22.0 BLEU point difference

for a beam of 1000. Over the uncorrected baseline with a beam of 10, the corrected beam

of 1000 gets a BLEU gain of 0.8 BLEU. However, the beam of 1000 still sees a drop of

less than 1.0 BLEU over the best corrected version. The word reward method beats the

uncorrected baseline and the length normalization correction in almost all cases.

6.4.2.4 Short sentences

Another way to demonstrate that the beam problem is the same as the brevity problem

is to look at the translations generated by baseline systems on shorter sentences. Figure 6.3

shows the BLEU scores of the Russian–English system for beams of size 10 and 1000 on

sentences of varying lengths, with and without correcting lengths. The x-axes of the figure

are cumulative: length 20 includes sentences of length 0–20, while length 10 includes 0–

10. It is worth noting that BLEU is a word-level metric, but the systems were built using

BPE; so the sequences actually generated are longer than the x-axes would suggest.

The baseline system on sentences with 10 words or less still has relatively high BLEU

scores—even for a beam of 1000. Though there is a slight drop in BLEU (less than 2),

it is not nearly as severe as when looking at the entire test set (more than 20). When

87

Figure 6.3: Impact of beam size on BLEU score when varying reference sentence lengths(in words) for Russian–English. The x-axis is cumulative moving right; length 20 includessentences of length 0-20, while length 10 includes 0-10. As reference length increases, theBLEU scores of a baseline system with beam size of 10 remain nearly constant. However,a baseline system with beam 1000 has a high BLEU score for shorter sentences, but a verylow score when the entire test set is used. Our tuned reward and normalized models donot suffer from this problem on the entire test set, but take a slight performance hit on theshortest sentences.

88

Figure 6.4: Histogram of length ratio between generated sentences and gold varied acrossmethods and beam size for Russian–English. Note that the baseline method skews closer 0as the beam size increases, while our other methods remain peaked around 1.0. There area few outliers to the right that have been cut off, as well as the peaks at 0.0 and 1.0.

89

correcting for length with normalization or word reward, the problem nearly disappears

when considering the entire test set, with reward doing slightly better. For comparison, the

rightmost points in each of the subplots correspond to the BLEU scores in columns 10 and

1000 of Table 6.1. This suggests that the beam problem is strongly related to the brevity

problem.

6.4.2.5 Length ratio

The interaction between the length problem and the beam problem can be visualized

in the histograms of Figure 6.4 on the Russian–English system. In the upper left plot, the

uncorrected model with beam 10 has the majority of the generated sentences with a length

ratio close to 1.0, the gold lengths. Going down the column, as the beam size increases,

the distribution of length ratios skews closer to 0. By a beam size of 1000, 37% of the

sentences have a length of 0. However, both the word reward and the normalized models

remain very peaked around a length ratio of 1.0 even as the beam size increases.

6.4.3 Tuning word reward

Above, we have shown that fixing the length problem with a word reward score fixes

the beam problem. However these results are contingent upon choosing an adequate word

reward score, which we have done in our experiments by optimization using a perceptron

loss. Here, we show the sensitivity of systems to the value of this penalty, as well as the

fact that there is not one correct penalty for all tasks. It is dependent on a myriad of factors

including, beam size, dataset, and language pair.

6.4.3.1 Sensitivity to word reward value

In order to investigate how sensitive a system is to the reward score, we varied values

of γ from 0 to 1.2 on both our German–English and Russian–English systems with a beam

size of 50. BLEU scores and length ratios on 1000 heldout development sentences are

90

shown in Figure 6.5. The length ratio is correlated with the word reward as expected, and

the BLEU score varies by more than 5 points for German–English and over 4.5 points for

Russian–English. On German–English, our method found a value of γ = 0.57, which is

slightly higher than optimal; this is because the heldout sentences have a slightly shorter

length ratio than the training sentences. Conversely, on Russian–English, our found value

of γ = 0.64 is slightly lower than optimal as these heldout sentences have a slightly higher

length ratio than the sentences used in training.

6.4.3.2 Optimized word reward values

Tuning the reward penalty using the method described in Section 6.3.2 resulted in con-

sistent improvements in METEOR scores and length ratios across all of our systems and

language pairs. Tables 6.1, 6.2, and 6.3 show the optimized value of γ for each beam size.

Within a language pair, the optimal value of γ is different for every beam size. Likewise, for

a given beam size, the optimal value is different for every system. Our French–English and

English–French systems in Table 6.3 have the exact same architecture, data, and training

criteria. Yet, even for the same beam size, the tuned word reward scores are very different.

Training dataset size Low-resource neural machine translation performs significantly

worse than high-resource machine translation [51]. Table 6.5 looks at the impact of training

data size on BLEU scores and the beam problem by using 10% and 50% of the available

Russian–English data. Once again, the optimal value of γ is different across all systems

and beam sizes. Interestingly, as the amount of training data decreases, the gains in BLEU

using a tuned reward penalty increase with larger beam sizes. This suggests that the beam

problem is more prevalent in lower-resource settings, likely due to the fact that less training

data can increase the effects of label bias.

91

20

22

24

26

BL

EU

Russian–English

0 0.2 0.4 0.6 0.8 1 1.20.8

1

1.2

leng

thra

tio

18

20

22

24

BL

EU

German–English

0 0.2 0.4 0.6 0.8 1 1.2

1

1.2

1.4

word reward (γ)

leng

thra

tio

Figure 6.5. Effect of word penalty on BLEU and hypothesis length forRussian–English (top) and German-English (bottom) on 1000 unseen dev

examples with beams of 50. Note that the vertical bars represent the word rewardthat was found during tuning.

92

Russian–English (medium) Beam SizeDataset Size 10 50 75 100 150

baseline 24.9 23.8 23.6 23.3 22.5100% reward 26.5 26.6 26.5 26.5 26.5

γ 0.716 0.643 0.640 0.633 0.617

baseline 22.8 21.4 20.8 20.4 19.250% reward 24.7 25.0 24.9 24.9 25.0

γ 0.697 0.645 0.638 0.636 0.646

baseline 17.0 16.2 15.8 15.6 15.110% reward 17.6 18.0 18.0 18.0 18.1

γ 0.892 0.835 0.773 0.750 0.800

Table 6.5: Varying the size of the Russian–English training dataset results in differentoptimal word reward scores (γ). In all settings, the tuned score alleviates the beam problem.As the datasets get smaller, using a tuned larger beam improves the BLEU score over asmaller tuned beam. This suggests that lower-resource systems are more susceptible to thebeam problem.

6.4.3.3 Tuning time

Fortunately, the tuning process is very inexpensive. Although it requires decoding on

a development dataset multiple times, we only need a small dataset. The time required

for tuning our French–English and German–English systems is shown in Table 6.4. These

experiments were run on an Nvidia GeForce GTX 1080Ti. The tuning usually takes a few

minutes to hours, which is just a fraction of the overall training time. We note that there

are numerous optimizations that could be taken to speed this up even more, such as storing

the decoding lattice for partial reuse. However, we leave this for future work.

6.4.4 Word reward vs. length normalization

Tuning the word reward score generally had higher METEOR scores than length nor-

malization across all of our settings. With BLEU, length normalization beat the word

reward on German-English and French–English, but tied on English-French and lost on

Russian–English. For the largest beam of 1000, the tuned word reward had a higher BLEU

93

than length normalization. Overall, the two methods have relatively similar performance,

but the tuned word reward has the more theoretically justified, globally-normalized deriva-

tion – especially in the context of label bias’ influence on the brevity problem.

6.5 Conclusion

We have explored simple and effective ways to alleviate or eliminate the beam problem.

We showed that the beam problem can largely be explained by the brevity problem, which

results from the locally-normalized structure of the model. We compared two corrections

to the model and introduced a method to learn the parameters of these corrections. Because

this method is helpful and easy, we hope to see it included to make stronger baseline NMT

systems.

We have argued that the brevity problem is an example of label bias, and that the

solution is a very limited form of globally-normalized model. These can be seen as the

simplest case of the more general problem of label bias and the more general solution

of globally-normalized models for NMT [94, 102, 112, 116]. Some questions for future

research are:

• Solving the brevity problem leads to significant BLEU gains; how much, if any,improvement remains to be gained by solving label bias in general?

• Our solution to the brevity problem requires globally-normalized training on only asmall dataset; can more general globally-normalized models be trained in a similarlyinexpensive way?

94

CHAPTER 7

AN EXPECTATION-MAXIMIZATION APPROACH TO BPE SEGMENTATION

The previous portions of this dissertation have looked at methods for learning the struc-

ture of the model as well as ways to correct modeling errors at inference time. In this

chapter, we deal with the inputs and outputs of the model – namely the vocabulary of the

source and target languages. In particular, we investigate the impact that subword units

have on performance.

Abstract: We investigate the impact that varying coarseness of subword unitshas on neural machine translation. We introduce a method that is able to learnthe optimal coarseness while only training one model, by treating the coarse-ness as a hidden variable within the model. We demonstrate our method onfour different language pairs from different linguistic families and show that,compared to a grid search, which requires many training runs, our methodachieves the same or slightly better accuracy in less time. The larger thedataset, the larger the speedup: in our largest dataset, the speedup is morethan three times.

7.1 Introduction

Whereas older statistical machine translation systems used pipelines of many process-

ing steps, neural machine translation (NMT) systems have successfully combined most of

the training and translation process into a single step, while advancing the state of the art.

However, some pre- and post-processing steps still remain. To improve handling of rare

and unknown words, the current standard practice is to perform subword segmentation be-

fore training, typically using byte pair encoding or BPE [101]. With this additional stage

comes an additional hyperparameter that must be tuned, namely, the number of BPE merge

operations (hereafter, coarseness).

95

Numerous papers have looked into the impact of the number of BPE operations on

machine translation performance. Generally, recommendations are to use fewer operations

for smaller datasets [27]. But the optimal number of operations could, in principle, depend

on properties of the language, dataset, or model; for instance, [20] found a dependency

between subword coarseness and number of layers. Unfortunately, this implies that find-

ing an optimal model requires searching over many subword coarseness levels, which is

computationally expensive.

In this paper, we introduce an method that incorporates coarseness (the number of

BPE operations) as a latent variable in the model and uses expectation-maximization to

optimize coarseness along with the rest of the model’s parameters. Thus, it unifies BPE

and the translation model into a single model and efficiently solves the problem of choosing

the right coarseness level.

We make several findings:

• There is no one optimal number of BPE operations. In our language pairs, which areall considered low-resource, the optimal number ranges from 2k to 64k.

• Grid search does not always find the optimal coarseness, but our method does.

• Our method is always faster than grid search. The larger the data, the larger thespeedup: on our largest language pair, the speedup is more than 3 times.

7.2 Background and Related Work

In this section, we provide some background information and survey previous work on

segmentation in machine translation.

7.2.1 Neural machine translation

In most neural machine translation systems, the input to the network is a sequence of

source-language words, each of which is represented as a one-hot vector, a vector of all

zeros except for a solitary one corresponding to the word type. Likewise, the final output

96

is typicaly a sequence of softmaxes over the target vocabulary, which are compared with

one-hot vectors for the reference target words.

A consequence of this representation of words is that the size of the source and target

vocabulary must be known in advance. When dealing with human languages, which nat-

urally have a long tailed distribution with a large number of word types, design decisions

must be made as to how to fit all possible words into a fixed-size vocabulary.

7.2.2 Handling unknown words

Various schemes have been used to handle unknown words in NMT. The easiest and

oldest solution is to choose a fixed vocabulary size, |V |, and replace all words not in the |V |

most frequent with a special token <unk>.

Currently, the most common approach is to split words into subwords using byte pair

encoding or BPE [101], which starts with individual characters and repeatedly merges the

most-frequent bigram (see Figure 7.1). The number of merge operations must be deter-

mined in advance, and the number of operations roughly determines the vocabulary size.

While the number of operations should, in principle, be chosen to maximize performance

(speed or accuracy or some combination thereof) on held-out data, it seems common to

rely on simple rules of thumb. For example, Denkowski and Neubig [27] recommend

as a best practice to use 16k operations for low-resource datasets and 32k operations for

high-resource datasets.

SentencePiece [55] is a more recent method. It treats subword segmentation as a latent

variable and uses EM to train a unigram language model over subwords. Since a given

training sentence has many possible segmentations, it randomly samples one segmenta-

tion. It too has a number of hyperparameters that need to be tuned, including a desired

vocabulary size.

97

rule text

(initial) s@ h@ e s@ e@ l@ l@ s s@ e@ a@ s@ h@ e@ l@ l@ s b@ y t@ h@ e s@ e@ a@ s@ h@ o@ r@ es@ h→ sh sh@ e s@ e@ l@ l@ s s@ e@ a@ sh@ e@ l@ l@ s b@ y t@ h@ e s@ e@ a@ sh@ o@ r@ es@ e→ se sh@ e se@ l@ l@ s se@ a@ sh@ e@ l@ l@ s b@ y t@ h@ e se@ a@ sh@ o@ r@ el@ l→ ll sh@ e se@ ll@ s se@ a@ sh@ e@ ll@ s b@ y t@ h@ e se@ a@ sh@ o@ r@ esh@ e→ she she se@ ll@ s se@ a@ she@ ll@ s b@ y t@ h@ e se@ a@ sh@ o@ r@ ell@ s→ lls she se@ lls se@ a@ she@ lls b@ y t@ h@ e se@ a@ sh@ o@ r@ ese@ a→ sea she se@ lls sea@ she@ lls b@ y t@ h@ e sea@ sh@ o@ r@ e

Figure 7.1: Example application of BPE rules.

7.2.3 Hyperparameter search

Hyperparameter search is a well-studied problem. Frequently, grid search is used,

which trains multiple models by varying each hyperparameter over a fixed set of values.

Alternatively, random search [8] can be used, where hyperparameters are generated from

a probability distribution.

Other approaches, such as [38, 44, 61] attempt to tune parameters as opposed to brute

force searching methods. This too can be costly, as well as difficult to incorporate when

parameters are part of a preproccesing step.

We are aware of one previous method for automatically tuning the number of BPE op-

erations, by Salesky et al. [96]. Their method starts with character-level segmentation and

progresses to coarser segmentations based on development perplexity. Our method, de-

scribed in the next section, does not use a predetermined schedule; it considers all coarse-

ness levels simultaneously, and in practice, we find that it progresses in the opposite direc-

tion, from coarser segmentations to finer segmentations.

7.3 Method

Rather than treating BPE as a separate preprocessing stage, our method views coarse-

ness (the number of BPE operations) as a latent variable that we optimize during training

using expectation-maximization (EM) [23]. An outline of our method is shown in Algo-

98

rithm 3. In this section, we describe our model, how to train it, and how to translate new

data with it.

7.3.1 Model

We model coarseness, b, as a latent variable, which ranges over a fixed finite set B.

In our experiments, we always use B = {2k, 4k, 8k, 16k, 32k, 64k}. It’s common to see

numbers of operations ranging from 4k to 32k, so we chose B to cover this range and

beyond by a factor of two at both ends.

Let D be the training data, a sequence of sentence pairs (x, y). Under our model, the

log-likelihood of the data is

log P(D) =∑x,y∈D

log∑b∈B

λbP(y | x, b) (7.1)

where λb is the probability of choosing coarseness b and P(y | x, b) is the probability of

translating x to y using coarseness b. We could implement the latter using |B| different

translation models, but for efficiency and parameter-sharing, we use a single translation

model whose vocabulary is the union of the vocabularies at all coarseness levels in B.

7.3.2 Vocabulary

Recall that BPE starts with characters and repeatedly merges the most common bigram.

For instance, in Figure 7.1, the initial vocabulary has 12 types (s@, h@, e, e@, l@, s, a@,

b@, y, t@, o@, r@). For each merge operation, the vocabulary size increases by one

type. However, it’s also possible that merging two types causes one of them to disappear

from the data. For example, after merging “l@ l@” into “ll@,” there is no token “l@” left

in the data. To guarantee that the vocabulary of our translation model is the union of all

the vocabularies at all coarseness levels in B, the simplest scheme is to learn the maximal

number of BPE operations, but not to discard a subword type even if it has zero count in

99

the data (because it might have nonzero count at a lower coarseness level).

7.3.3 Training

To learn the parameters of P(y | x, b) together with the parameters λb, we use expectation-

maximization, which would in principle look like this:

• Initialization: Set the coarseness weights λb to uniform:

λb =1|B|

(7.2)

• E step: Compute the expected number of times that each coarseness level is used,according to the current model.

E[c(b)] =∑x,y∈D

λbP(y | x, b)∑b′ λb′P(y | x, b)

(7.3)

• M step: Using the expected counts, re-estimate the coarseness weights λb by relativefrequency estimation,

λb =E[c(b)]∑b′ E[c(b′)]

(7.4)

and update the parameters of the translation model to improve (not maximize) thelikelihood of the parallel data,

L =∑x,y∈D

∑b∈B

E[c(b)] log P(y | x, b) (7.5)

However, we want to avoid processing the training data |B| times, because this would slow

down training unacceptably. Thus, in the E step, we collect expected counts over the

development data, not the training data. This may also help to reduce overfitting.

In the M step, we need to perform an epoch of stochastic gradient descent (SGD) on

(7.5). We have two problems. First, the term E[c(b)] should be computed individually

for each training sentence, but we did this for the development data, not the training data.

The second problem is that the summation over b would slow down training. To avoid

both problems, for each training minibatch, we randomly sample b with probability λb and

perform one epoch of SGD on the minibatch using only coarseness b.

100

Burn-in and early stopping We run training for a maximum of Etotal = 100 epochs. Due

to the poor state of the model at the start of training, for the first Eburnin = 3 epochs, we

only update the translation model; we do not update λb.

We tried early stopping by measuring performance the development set. Although

it’s common to use development perplexity for this purpose, perplexity is measured per-

token and therefore can’t be compared across different coarseness levels. Instead, we com-

pute the BLEU score, using SacreBLEU [90], of de-segmented translations against the

un-segmented references.

Label smoothing Label smoothing [88] is helpful in machine translation, but because it

works by interpolating the output word distribution with the uniform distribution over the

vocabulary, and our vocabulary size is constantly changing, we omitted label smoothing in

our experiments.

It should be possible to use label smoothing with our method, by using the uniform

distribution appropriate for the current vocabulary, but we leave this for future work.

7.3.4 Decoding

During training, since our model uses multiple segmentations simultaneously, we have

to either process a sentence multiple times with different segmentations or randomly sam-

ple a coarseness level. Both of these options would be undesirable when using the model

to translate new text. We can use just the b that maximizes λb, but if this λb is not close to

1, then, as we will see in Section 7.4, it is (unfortunately) necessary to re-train the model

using just b.1

1We found that another variation of our method, which uses length normalization [14] in the E step,reliably converges to weights where one λb is nearly one, making re-training unnecessary. We leave experi-mentation with this variation for future work.

101

Figure 7.2. Value of λb for English-to-Vietnamese over the course of training.Early in training, more BPE operations are preferred as this results in shorter

sequences. However, as training progresses and the model improves, it is able todeal with finer-grained segmentation.

102

Figure 7.3. Value of λb for Arn-Spa over the course of training (top) andcorresponding BLEU score on the development set (bottom). We measure the

performance of our model during training by comparing the BLEU scorecorresponding to the segmentation with the highest λb. Note that early in

training, more BPE operations are preferred, but by the end of training, feweroperations result in higher BLEU scores. Even as the model switches between

coarseness levels at later epochs, dev BLEU scores are very consistent.

103

Figure 7.4. Value of λb for Japanese-to-Vietnamese over the course of training.Consistent with the other language pairs, early in training, more BPE operationsare preferred. As training progresses, the model converges to the optimal value

of 4k also found by grid search.

104

Figure 7.5. Value of λb for Uighur-to-English over the course of training. Aswith the other language pairs, early on in training, the model prefers more BPE

operations. However, consistent with the results from the grid search, 64koperations remains the preferred level throughout training.

105

Algorithm 3 Training algorithm.initialize λb (eq. 7.2)for e = 1, . . . , Etotal do

for each training minibatch dosample b with probability λb

segment minibatch with coarseness bperform SGD update on minibatch

if e > Eburnin thencalculate expected counts (eq. 7.3)re-estimate λb (eq. 7.4)

Language Pair Train Dev. Test

Eng-Vie 133k 1.5k 1k

Arn-Spa 222k 8k 36k

Uig-Eng 641k 0.6k 0.3k

Jpn-Vie 340k 0.5k 1.2k

TABLE 7.1

NUMBER OF TRAINING, DEVELOPMENT, AND TEST SENTENCES IN

THE FOUR PARALLEL DATASETS USED.

7.4 Experiments

We run experiments using the Transformer [109] on four language pairs: English-

to-Vietnamese (Eng-Vie), Mapudungun-to-Spanish (Arn-Spa), Uyghur-to-English (Uig-

Eng), and Japanese-to-Vietnamese (Jpn-Vie).

106

7.4.1 Datasets

For English-to-Vietnamese, we used the data from the 2015 IWSLT evaluation cam-

paign [19] preprocessed by Luong and Manning [62]. Our Mapudungun-to-Spanish data

comes from the recent release by Duan et al. [31]. The Uyghur-to-English dataset is

from the DARPA LORELEI program. We use train/dev/test splits following Hermjakob

et al. [42]. For Japanese-to-Vietnamese, we use the data from the Wit3 corpus and others

[18, 95, 107], extracted by Ngo et al. [78].

All of our data except for Japanese were preprocessed using the standard Moses tok-

enization and truecasing scripts [54]. For Japanese, we followed the recommendation of

Ngo et al. [79] and used KyTea for tokenization [77]. For all language pairs, we learned

joint source and target BPE operations. In all of our experiments, the set of candidate num-

bers of BPE operations was B = {2k, 4k, 8k, 16k, 32k, 64k}. As described in Section 7.3.2,

we did not discard subword types that came to have zero count due to merging.

7.4.2 Translation model

All of our experiments use the Transformer [109] implemented in fairseq [83]. We use

the recommended settings from Nguyen and Salazar [80]. Specifically, we use 6 encoder

and 6 decoder layers, all with 8 attention heads. The model has 512 embedding dimensions,

and the feedforward layers have 2048 dimensions. Layer normalization is done before, not

after, adding residual connections. We use Adam [48] with an inverse square root learning

rate reaching 0.001 after 8,000 warmup iterations. We clip norms to 1.0 and use dropout

with probability 0.3. At test time, we decode with a beam of 5 using length normalization

[14, 46, 52].

7.4.3 Results

Tables 7.2–7.5 show performance on the four language pairs. For each language pair,

we compare the BLEU scores of training multiple systems across coarseness levels with

107

our EM approach. In each table, grid search corresponds to the model with the highest

development BLEU score. In addition, we look at our EM approach, either with early

stopping or at the last (100th) epoch, and with and without retraining the model using only

the highest-weighted b.

We can make the following observations across all four language pairs:

• Grid search does not always find the optimal coarseness for the test set; sometimesit is a factor of two below.

• The translation models learned directly by EM have lower BLEU scores; re-trainingis needed to maximize BLEU.

• EM with early stopping sometimes finds the optimal coarseness and sometimes is afactor of two above.

• By the last epoch, the highest-weighted coarseness stays the same or decreases, andin all cases, it finds the coarseness optimal for the test set.

We discuss early stopping versus using the last epoch below in Section 7.4.4.

The tables also show how long it took (wall-clock time) to finish training under each

setting. All experiments were run on Nvidia GeForce GTX 1080Ti GPUs. The EM

method’s training time is comparable to that of a single training run (top six rows) when

using the same stopping condition (early stopping). Naturally, training for the full 100

epochs takes longer and re-training adds more time, but the total time always remains less

than grid search. The larger the training data, the larger the speedup; our method is more

than three times faster for our largest dataset, Uig-Eng.

7.4.4 Stopping condition

When training with EM, it’s common to use early stopping to avoid overfitting. When

the hidden variables are related to (sub)word segmentation, one might expect a common

form of overfitting in which the model favors the largest possible segments in order to

memorize the data. It’s surprising, therefore, that Figures 7.2 and 7.3 show the opposite

108

trend, where the model initially favors the largest segments (high b) and over time uses

smaller and smaller segments (low b).

An explanation is that models with lower b take longer to converge (cf. Table 7.3),

because they need more time to learn how to “assemble” smaller subwords into meaningful

units. In any case, it would appear that there is no danger of overfitting by choosing a too-

high b.

Above in Section 7.3.3, we proposed using development BLEU to decide when to stop

training. But if overfitting is not a major concern, early stopping becomes unnecessary.

And indeed, our experiments show that running EM for more epochs allows it to select

better granularities. Based on our experiments, then, our present recommendation is simply

to run EM for a fixed number of epochs, like 100. A better stopping criterion might be to

stop when one of the λb’s peaks and no other λb is increasing.

The curves in Figures 7.2–7.5 show that this alternative criterion would have found

the optimal coarseness somewhat earlier on three language pairs, and on Uighur-English,

much earlier. We leave further experiments with this stopping criterion to future work.

7.5 Conclusion

We have found that there are different optimal coarseness levels for different datasets.

Although standard values used in the literature (e.g., 16k, 32k) would have resulted in

good BLEU scores, we found that the best coarseness levels often lay outside the typical

range (2k or 64k). Thus, optimizing segmentation coarseness is important for maximizing

translation accuracy.

As an alternative to a slow grid search, we introduced a method for learning the optimal

coarseness during training of a neural machine translation system. We model coarseness

as a latent variable and use expectation-maximization to optimize coarseness together with

the rest of the model. Our results show that we are able to learn the optimal coarseness on

four linguistically divergent language pairs while still being faster than a traditional grid

109

search approach.

110

BLEU

Method Dev. Test Time Ops.

2k 27.72 30.61 23k

4k 27.54 30.81 13k

8k 27.28 30.06 9k

16k 26.61 30.04 6k

32k 26.27 29.36 5k

64k 25.38 27.63 5k

grid search 27.72 30.61 61k 2k

EM (early) 26.89 29.19 12k8k

+ retrain 27.28 30.06 21k

EM (last) 26.28 29.14 28k4k

+ retrain 27.54 30.81 41k

TABLE 7.2

ON ENGLISH-TO-VIETNAMESE, EM IS ABLE TO FIND A BETTER

COARSENESS THAN GRID SEARCH IN LESS TIME. TIME IS IN

WALL-CLOCK SECONDS. OPS. = THE NUMBER OF BPE OPERATIONS

SELECTED.

111

BLEU


2k 17.79 18.30 2k

4k 17.22 17.79 3k

8k 16.74 16.74 1k

16k 16.15 16.53 1k

32k 15.65 15.98 1k

64k 15.15 15.35 1k

grid search 17.79 18.30 10k 2k

EM (early) 17.37 17.18 3k4k

+ retrain 17.22 17.79 6k

EM (last) 16.72 17.37 7k2k

+ retrain 17.79 18.30 9k

TABLE 7.3

ON MAPUDUNGUN-TO-SPANISH, EM IS ABLE TO FIND THE SAME

COARSENESS AS GRID SEARCH IN SLIGHTLY LESS TIME. TIME IS IN


SELECTED.

112

BLEU


2k 27.55 24.74 76k

4k 28.73 24.62 68k

8k 31.51 23.05 61k

16k 33.85 27.09 52k

32k 35.71 26.80 57k

64k 35.87 27.63 58k

grid search 35.87 27.63 371k 64k

EM (early) 35.13 26.73 58k64k

+ retrain 35.87 27.63 116k

EM (last) 33.89 27.13 60k64k

+ retrain 35.87 27.63 118k

TABLE 7.4

ON UIGHUR-TO-ENGLISH, EM IS ABLE TO FIND THE SAME

COARSENESS AS GRID SEARCH IN MUCH LESS TIME. TIME IS IN


SELECTED.

113

BLEU


2k 11.82 13.32 38k

4k 11.64 13.83 57k

8k 12.21 13.75 38k

16k 11.42 13.31 38k

32k 11.11 13.36 50k

64k 10.48 12.9 29k

grid search 12.21 13.75 249k 8k

EM (early) 11.06 12.90 42k8k

+ retrain 12.21 13.75 80k

EM (100 epochs) 10.90 12.18 53k4k

+ retrain 11.64 13.83 110k

TABLE 7.5

ON JAPANESE-TO-VIETNAMESE, EM IS ABLE TO FIND A BETTER

COARSENESS THAN GRID SEARCH IN MUCH LESS TIME. TIME IS IN


SELECTED.

114

CHAPTER 8

CONCLUSION

This dissertation covers optimizing hyperparameters in a neural machine translation

system. We focus on three different classes of methods for learning: regularization, percep-

tron, and expectation-maximization. Through the use of these methods, we have demon-

strated ways to automatically learn these hyperparameters. In addition to showing that

these methods are able to learn optimal parameters in many settings, we have also shown

the theoretical impacts stemming from this, as well as increased performance on a range

of metrics from BLEU scores, to disk space, to faster experimentation.

The goal of this thesis to to take a basic neural machine translation architecture and

learn all the parameters and hyperparameters, so as to reduce searching over architectures

and model configurations during experimentation. This allows for quicker experimen-

tation and faster idea validation. Neural Machine Translation models require numerous

hyperparameters to define their architectures, so it is untenable to manually search these.

Furthermore, even defining a good search space is or values to explore can be a difficult

undertaking in many search methods.

Though we have demonstrated efficacy on neural machine translation, we note that this

is just a special case of what these methods can do. For instance, section 2.1 looks at

feedforward neural networks. This method has already been extended by numerous other

groups, to a variety of tasks. Additionally, modeling errors, label bias, and beam search are

well-studied problems in other aspects of natural language processing, and more broadly,

learning theory. Overall, our methods should be able to generalize to a broader class of

tasks and applications.

115

BIBLIOGRAPHY

1. J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks.In Advances in Neural Information Processing Systems, pages 2270–2278, 2016.

2. D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, andM. Collins. Globally normalized transition-based neural networks. In Proc. ACL,pages 2442–2452, 2016. URL http://www.aclweb.org/anthology/P16-1231.

3. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1–106, 2012.doi: 10.1561/2200000015.

4. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473, 2014.

5. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learningto align and translate. In Proc. ICLR, 2015.

6. P. G. Benardos and G.-C. Vosniakos. Optimizing feedforward artificial neural net-work architecture. Engineering Applications of Artificial Intelligence, 20:365–382,2007.

7. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic languagemodel. Journal of machine learning research, 3(Feb):1137–1155, 2003.

8. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journalof Machine Learning Research, 13:281–305, 2012.

9. J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl. Algorithms for hyper-parameteroptimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.

10. O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz,P. Pecina, M. Post, H. Saint-Amand, et al. Findings of the 2014 workshop on statisti-cal machine translation. In Proceedings of the ninth workshop on statistical machinetranslation, pages 12–58, 2014.

11. O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. J. Yepes,P. Koehn, V. Logacheva, C. Monz, et al. Findings of the 2016 Conference on MachineTranslation. In Proceedings of the First Conference on Machine Translation: Volume2, Shared Task Papers, volume 2, pages 131–198, 2016.

116

http://www.aclweb.org/anthology/P16-1231

12. O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, S. Huang, M. Huck,P. Koehn, Q. Liu, V. Logacheva, et al. Findings of the 2017 Conference on MachineTranslation (WMT17). In Proc. Conference on Machine Translation, pages 169–214,2017.

13. L. Bottou. Une Approche theorique de l’Apprentissage Connexioniste; Applicationsa la reconnaissance de la Parole. PhD thesis, Universite de Paris Sud, 1991.

14. N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Audio chord recognitionwith recurrent neural networks. In Proc. International Society for Music InformationRetrieval, pages 335–340, 2013.

15. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization andstatistical learning via the alternating direction method of multipliers. Foundationsand Trends in Machine learning, 3(1):1–122, 2010.

16. D. Britz, A. Goldie, M.-T. Luong, and Q. Le. Massive exploration of neural machinetranslation architectures. In Proc. EMNLP, pages 1442–1451, 2017.

17. P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics ofstatistical machine translation: Parameter estimation. Computational linguistics, 19(2):263–311, 1993.

18. M. Cettolo, C. Girardi, and M. Federico. Wit3: Web inventory of transcribed andtranslated talks. In Proc. EAMT, pages 261–268, 2012.

19. M. Cettolo, J. Niehues, S. Stuker, L. Bentivogli, R. Cattoni, and M. Federico. TheIWSLT 2015 evaluation campaign. In Proc. IWSLT, 2015.

20. C. Cherry, G. Foster, A. Bapna, O. Firat, and W. Macherey. Revisiting character-based neural machine translation with capacity and compression. In Proceedingsof the 2018 Conference on Empirical Methods in Natural Language Processing,pages 4295–4305, Brussels, Belgium, Oct.-Nov. 2018. Association for Computa-tional Linguistics. doi: 10.18653/v1/D18-1461. URL https://www.aclweb.org/anthology/D18-1461.

21. D. Chiang. A hierarchical phrase-based model for statistical machine translation. InProceedings of the 43rd Annual Meeting on Association for Computational Linguis-tics, pages 263–270. Association for Computational Linguistics, 2005.

22. D. Chiang. Hierarchical phrase-based translation. computational linguistics, 33(2):201–228, 2007.

23. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–22, 1977.

117

https://www.aclweb.org/anthology/D18-1461


24. M. Denkowski and A. Lavie. Meteor universal: Language specific translation eval-uation for any target language. In Proceedings of the ninth workshop on statisticalmachine translation, pages 376–380, 2014.

25. M. Denkowski and A. Lavie. Meteor universal: Language specific translation evalu-ation for any target language. In Proc. Workshop on Statistical Machine Translation,2014.

26. M. Denkowski and G. Neubig. Stronger baselines for trustable results in neural ma-chine translation. In Proceedings of the First Workshop on Neural Machine Transla-tion, pages 18–27, 2017. URL http://www.aclweb.org/anthology/W17-3203.

27. M. Denkowski and G. Neubig. Stronger baselines for trustable results in neural ma-chine translation. In Proceedings of the First Workshop on Neural Machine Transla-tion, pages 18–27, 2017.

28. J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, and J. Makhoul. Fast and robustneural network joint models for statistical machine translation. In Proceedings of the52nd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), volume 1, pages 1370–1380, 2014.

29. J. Dodge, R. Schwartz, H. Peng, and N. A. Smith. Rnn architecture learning withsparse regularization. In Proceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP), pages 1179–1184, 2019.

30. K.-B. Duan and S. S. Keerthi. Which is the best multiclass SVM method? An empir-ical study. In International Workshop on Multiple Classifier Systems, pages 278–285,2005.

31. M. Duan, C. Fasola, S. K. Rallabandi, R. M. Vega, A. Anastasopoulos, L. Levin, andA. W. Black. A resource for computational experiments on Mapudungun, 2019.

32. J. Duchi and Y. Singer. Efficient online and batch learning using forward backwardsplitting. J. Machine Learning Research, 10:2899–2934, 2009.

33. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections ontothe `1-ball for learning in high dimensions. In Proc. ICML, pages 272–279, 2008.

34. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections ontothe `1-ball for learning in high dimensions. In Proc. ICML, pages 272–279, 2008.

35. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainableneural networks. arXiv preprint arXiv:1803.03635, 2018.

36. F. Friedrichs and C. Igel. Evolutionary tuning of multiple SVM parameters. Neuro-computing, 64:107–117, 2005.

118

http://www.aclweb.org/anthology/W17-3203

37. J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. ConvolutionalSequence to Sequence Learning. In Proc. ICML, 2017.

38. D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Googlevizier: A service for black-box optimization. In Proceedings of the 23rd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, pages1487–1495. ACM, 2017.

39. S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for ef-ficient neural network. In Advances in neural information processing systems, pages1135–1143, 2015.

40. H. Hayashi, Y. Oda, A. Birch, I. Constas, A. Finch, M.-T. Luong, G. Neubig, andK. Sudoh. Findings of the third workshop on neural generation and translation. InProceedings of the Third Workshop on Neural Generation and Translation, 2019.

41. W. He, Z. He, H. Wu, and H. Wang. Improved neural machine translation with SMTfeatures. In Proc. AAAI, 2016.

42. U. Hermjakob, Q. Li, D. Marcu, J. May, S. J. Mielke, N. Pourdamghani, M. Pust,X. Shi, K. Knight, T. Levinboim, et al. Incident-driven machine translation and nametagging for low-resource languages. Machine Translation, pages 1–31, 2017.

43. L. Huang, K. Zhao, and M. Ma. When to finish? optimal beam search for neural textgeneration (modulo beam size). In Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages 2134–2139, 2017.

44. F. Hutter, J. Lucke, and L. Schmidt-Thieme. Beyond manual tuning of hyper-parameters. KI - Kunstliche Intelligenz, 29(4):329–337, Nov 2015. ISSN 1610-1987. doi: 10.1007/s13218-015-0381-0. URL https://doi.org/10.1007/s13218-015-0381-0.

45. S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabularyfor neural machine translation. arXiv preprint arXiv:1412.2007, 2014.

46. S. Jean, O. Firat, K. Cho, R. Memisevic, and Y. Bengio. Montreal neural machinetranslation systems for wmt’15. In Proceedings of the Tenth Workshop on StatisticalMachine Translation, pages 134–140, 2015.

47. M. Junczys-Dowmunt, T. Dwojak, and H. Hoang. Is neural machine translationready for deployment? a case study on 30 translation directions. arXiv preprintarXiv:1610.01108, 2016.

48. D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Proc.ICLR, 2015.

49. J. Kinnison, N. Kremer-Herman, D. Thain, and W. Scheirer. Shadho: Massively scal-able hardware-aware distributed hyperparameter optimization. In Proc. IEEE WinterConference on Applications of Computer Vision (WACV), pages 738–747, 2018.

119

https://doi.org/10.1007/s13218-015-0381-0

https://doi.org/10.1007/s13218-015-0381-0

50. K. Knight. Decoding complexity in word-replacement translation models. Compu-tational linguistics, 25(4):607–615, 1999.

51. P. Koehn and R. Knowles. Six challenges for neural machine translation. In Pro-ceedings of the First Workshop on Neural Machine Translation, pages 28–39, August2017. URL http://www.aclweb.org/anthology/W17-3204.

52. P. Koehn and R. Knowles. Six challenges for neural machine translation. arXivpreprint arXiv:1706.03872, 2017.

53. P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proceed-ings of the 2003 Conference of the North American Chapter of the Association forComputational Linguistics on Human Language Technology-Volume 1, pages 48–54.Association for Computational Linguistics, 2003.

54. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi,B. Cowan, W. Shen, C. Moran, R. Zens, et al. Moses: Open source toolkit forstatistical machine translation. In Proceedings of the 45th annual meeting of the ACLon interactive poster and demonstration sessions, pages 177–180. Association forComputational Linguistics, 2007.

55. T. Kudo. Subword regularization: Improving neural network translation models withmultiple subword candidates. In Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers), pages 66–75,2018.

56. R. E. Ladner and M. J. Fischer. Parallel prefix computation. J. ACM, 27(4):831–838,Oct. 1980. ISSN 0004-5411. doi: 10.1145/322217.322232.

57. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Prob-abilistic models for segmenting and labeling sequence data. In Proceedings of theEighteenth International Conference on Machine Learning, pages 282–289, 2001.

58. A. Lavie and M. J. Denkowski. The meteor metric for automatic evaluation of ma-chine translation. Machine translation, 23(2-3):105–115, 2009.

59. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied todocument recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

60. L. Li and A. Talwalkar. Random search and reproducibility for neural architecturesearch. In Proc. UAI, 2019.

61. L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband:a novel bandit-based approach to hyperparameter optimization. The Journal of Ma-chine Learning Research, 18(1):6765–6816, 2017.

62. M.-T. Luong and C. D. Manning. Stanford neural machine translation systems forspoken language domain. In International Workshop on Spoken Language Transla-tion, Da Nang, Vietnam, 2015.

120

http://www.aclweb.org/anthology/W17-3204

63. M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-basedneural machine translation. arXiv preprint arXiv:1508.04025, 2015.

64. D. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based hyperparameter opti-mization through reversible learning. In Proc. ICML, pages 2113–2122, 2015.

65. C. Mauro, G. Christian, and F. Marcello. Wit3: Web inventory of transcribed andtranslated talks. In Proc. EAMT, pages 261–268, 2012.

66. P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one? Ad-vances in Neural Information Processing Systems, 2019.

67. K. Murray and D. Chiang. Auto-sizing neural networks: With applications to n-gramlanguage models. In EMNLP, 2015.

68. K. Murray and D. Chiang. Auto-sizing neural networks: With applications to n-gramlanguage models. In Proceedings of the 2015 Conference on Empirical Methods inNatural Language Processing, pages 908–916, 2015.

69. K. Murray and D. Chiang. Auto-sizing neural networks: With applications to n-gramlanguage models. In Proc. EMNLP, 2015.

70. K. Murray and D. Chiang. Correcting length bias in neural machine translation. WMT2018, page 212, 2018.

71. K. Murray, B. DuSell, and D. Chiang. Efficiency through auto-sizing: Notre DameNLP’s submission to the WNGT 2019 efficiency task. In Proceedings of the 3rdWorkshop on Neural Generation and Translation, pages 297–301, Hong Kong, Nov.2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5634. URLhttps://www.aclweb.org/anthology/D19-5634.

72. K. Murray, J. Kinnison, T. Q. Nguyen, W. Scheirer, and D. Chiang. Auto-sizing thetransformer network: Improving speed, efficiency, and performance for low-resourcemachine translation. In Proceedings of the Third Workshop on Neural Generationand Translation, 2019.

73. K. Murray, J. Kinnison, T. Q. Nguyen, W. Scheirer, and D. Chiang. Auto-sizing thetransformer network: Improving speed, efficiency, and performance for low-resourcemachine translation. In Proceedings of the Third Workshop on Neural Generationand Translation, 2019.

74. V. Nair and G. E. Hinton. Rectified linear units improve Restricted Boltzmann Ma-chines. In Proc. ICML, pages 807–814, 2010.

75. G. Neubig. lamtram: A toolkit for language and translation modeling using neuralnetworks. http://www.github.com/neubig/lamtram, 2015.

76. G. Neubig. Lexicons and minimum risk training for neural machine translation:Naist-cmu at wat2016. In Proceedings of the 3rd Workshop on Asian Translation,pages 119–125, December 2016.

121


77. G. Neubig, Y. Nakata, and S. Mori. Pointwise prediction for robust, adaptableJapanese morphological analysis. In Proc. NAACL HLT (Short Papers), pages 529–533. Association for Computational Linguistics, 2011.

78. T. Ngo, T. Ha, P. Nguyen, and L. Nguyen. Combining advanced methods in Japanese-Vietnamese neural machine translation. In Proceedings of the 10th InternationalConference on Knowledge and Systems Engineering (KSE 2018), Hochiminh City,Vietnam, 2018. URL https://arxiv.org/pdf/1805.07133.pdf.

79. T. Ngo, T. Ha, P. Nguyen, and L. Nguyen. How Transformer Revitalizes Character-based Neural Machine Translation: An Investigation on Japanese-Vietnamese Trans-lation Systems. In Proceedings of the 16th International Workshop on Spoken Lan-guage Translation 2019 (IWSLT 2019)), Hongkong, 2019. URL https://arxiv.org/pdf/1910.02238.pdf.

80. T. Q. Nguyen and J. Salazar. Transformers without tears: Improving the normaliza-tion of self-attention. In Proc. IWSLT, 2019.

81. F. J. Och. Minimum error rate training in statistical machine translation. In Proceed-ings of the 41st Annual Meeting of the Association for Computational Linguistics,pages 160–167, 2003.

82. F. J. Och and H. Ney. Discriminative training and maximum entropy models forstatistical machine translation. In Proceedings of the 40th annual meeting on associ-ation for computational linguistics, pages 295–302. Association for ComputationalLinguistics, 2002.

83. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli.fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.

84. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of the 40th annual meeting on associ-ation for computational linguistics, pages 311–318. Association for ComputationalLinguistics, 2002.

85. N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimiza-tion, 1(3):127–239, 2014. doi: 10.1561/2400000003.

86. N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends R© in Opti-mization, 1(3):127–239, 2014.

87. M. Paul, M. Federico, and S. Stuker. Overview of the iwslt 2010 evaluation cam-paign. In International Workshop on Spoken Language Translation (IWSLT) 2010,2010.

88. G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton. Regulariz-ing neural networks by penalizing confident output distributions. arXiv preprintarXiv:1701.06548, 2017.

122

https://arxiv.org/pdf/1805.07133.pdf



89. M. Popel and O. Bojar. Training tips for the Transformer model. The Prague Bulletinof Mathematical Linguistics, 110(1):43–70, 2018.

90. M. Post. A call for clarity in reporting bleu scores. In Proceedings of the ThirdConference on Machine Translation: Research Papers, pages 186–191, 2018.

91. O. Press and L. Wolf. Using the output embedding to improve language models. InProc. EACL: Volume 2, Short Papers, pages 157–163, 2017. URL http://www.aclweb.org/anthology/E17-2025.

92. A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l1,∞

regularization. In Proc. ICML, pages 857–864, 2009.

93. A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l1,∞

regularization. In Proc. ICML, pages 857–864, 2009.

94. M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training withrecurrent neural networks. In Proceedings of ICLR, 2015.

95. H. Riza, M. Purwoadi, T. Uliniansyah, A. A. Ti, S. M. Aljunied, L. C. Mai, V. T.Thang, N. P. Thai, V. Chea, S. Sam, et al. Introduction of the Asian Language Tree-bank. In 2016 Conference of The Oriental Chapter of International Committee forCoordination and Standardization of Speech Databases and Assessment Techniques(O-COCOSDA), pages 1–6. IEEE, 2016.

96. E. Salesky, A. Runge, A. Coda, J. Niehues, and G. Neubig. Optimizing segmentationgranularity for neural machine translation. In arXiv: 1810.08641, 2018.

97. A. See, M.-T. Luong, and C. D. Manning. Compression of neural machine translationmodels via pruning. In Proceedings of The 20th SIGNLL Conference on Computa-tional Natural Language Learning, pages 291–301, 2016.

98. R. Sennrich and B. Haddow. Linguistic input features improve neural machine trans-lation. In Proc. First Conference on Machine Translation: Volume 1, Research Pa-pers, volume 1, pages 83–91, 2016.

99. R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare wordswith subword units. arXiv preprint arXiv:1508.07909, 2015.

100. R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words withsubword units. In Proc. ACL, pages 1715–1725, 2016. URL http://www.aclweb.org/anthology/P16-1162.

101. R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words withsubword units. In Proceedings of the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), volume 1, pages 1715–1725,2016.

123

http://www.aclweb.org/anthology/E17-2025

http://www.aclweb.org/anthology/E17-2025



102. S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk trainingfor neural machine translation. In Proc. of ACL, 2016.

103. J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of ma-chine learning algorithms. In Advances in Neural Information Processing Systems,pages 2951–2959, 2012.

104. M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. A study of translationedit rate with targeted human annotation. Proceedings of association for machinetranslation in the Americas, 200(6), 2006.

105. E. Strubell, A. Ganesh, and A. McCallum. Energy and policy considerations for deeplearning in NLP. In Proc. ACL, 2019.

106. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neuralnetworks. In Advances in neural information processing systems, pages 3104–3112,2014.

107. J. Tiedemann. Parallel data, tools and interfaces in OPUS. In Proc. LREC, volume2012, pages 2214–2218, 2012.

108. A. Vaswani, Y. Zhao, V. Fossum, and D. Chiang. Decoding with large-scale neurallanguage models improves translation. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing, pages 1387–1392, 2013.

109. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,and I. Polosukhin. Attention is all you need. In Advances in Neural InformationProcessing Systems, pages 5998–6008, 2017.

110. A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones,L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit.Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018. URLhttp://arxiv.org/abs/1803.07416.

111. B. Vauquois. A survey of formal grammars and algorithms for recognition and trans-formation in mechanical translation. In Ifip congress (2), volume 68, pages 1114–1122, 1968.

112. A. Venkatraman, M. Hebert, and J. A. Bagnell. Improving multi-step prediction oflearned time series models. In AAAI, pages 3024–3030, 2015.

113. E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing multi-headself-attention: Specialized heads do the heavy lifting, the rest can be pruned. InProceedings of the 57th Annual Meeting of the Association for Computational Lin-guistics, pages 5797–5808, Florence, Italy, July 2019. Association for ComputationalLinguistics. doi: 10.18653/v1/P19-1580.

124

http://arxiv.org/abs/1803.07416

114. A. Vose, J. Balma, A. Heye, A. Rigazzi, C. Siegel, D. Moise, B. Robbins, andR. Sukumar. Recombination of artificial neural networks. In arXiv:1901.03900,2019.

115. W. Weaver. Translation. Machine translation of languages, 14:15–23, 1955.

116. S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search opti-mization. In Proceedings of EMNLP, 2016.

117. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridgingthe gap between human and machine translation, 2016. arXiv:1609.08144.

118. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridgingthe gap between human and machine translation. arXiv preprint arXiv:1609.08144,2016.

119. Y. Yang, L. Huang, and M. Ma. Breaking the beam search curse: A study of (re-)scoring methods and stopping criteria for neural machine translation. In Proceed-ings of the 2018 Conference on Empirical Methods in Natural Language Processing,2018.

120. Y. Zhang and J. Nivre. Analyzing the effect of global learning and beam-search ontransition-based dependency parsing. In Proceedings of COLING 2012: Posters,pages 1391–1400, 2012. URL http://www.aclweb.org/anthology/C12-2136.

This document was prepared & typeset with pdfLATEX, and formatted with nddiss2εclassfile (v3.2017.2[2017/05/09])

125

http://www.aclweb.org/anthology/C12-2136

learning hyperparameters for neural machine...

Documents