when bert plays the lottery, all tickets are winning

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3208–3229,November 16–20, 2020. c©2020 Association for Computational Linguistics

3208

When BERT Plays the Lottery, All Tickets Are Winning

Sai Prasanna∗

Zoho Labs

Zoho Corporation

Chennai, India

[email protected]

Anna Rogers∗Center for Social Data Science

Copenhagen University

Copenhagen, Denmark

[email protected]

Anna RumshiskyDept. of Computer Science

Univ. of Massachusetts Lowell

Lowell, USA

[email protected]

Abstract

Large Transformer-based models were shownto be reducible to a smaller number of self-attention heads and layers. We consider thisphenomenon from the perspective of the lot-tery ticket hypothesis, using both structuredand magnitude pruning. For fine-tuned BERT,we show that (a) it is possible to find sub-networks achieving performance that is com-parable with that of the full model, and (b)similarly-sized subnetworks sampled from therest of the model perform worse. Strikingly,with structured pruning even the worst possi-ble subnetworks remain highly trainable, indi-cating that most pre-trained BERT weights arepotentially useful. We also study the “good”subnetworks to see if their success can be at-tributed to superior linguistic knowledge, butfind them unstable, and not explained by mean-ingful self-attention patterns.

1 Introduction

Much of the recent progress in NLP is due to thetransfer learning paradigm in which Transformer-based models first try to learn task-independentlinguistic knowledge from large corpora, and thenget fine-tuned on small datasets for specific tasks.However, these models are overparametrized: wenow know that most Transformer heads and evenlayers can be pruned without significant loss inperformance (Voita et al., 2019; Kovaleva et al.,2019; Michel et al., 2019).

One of the most famous Transformer-based mod-els is BERT (Devlin et al., 2019). It became amust-have baseline and inspired dozens of studiesprobing it for various kinds of linguistic informa-tion (Rogers et al., 2020b).

We conduct a systematic case study of fine-tuning BERT on GLUE tasks (Wang et al., 2018)from the perspective of the lottery ticket hypothesis

∗Equal contribution

(Frankle and Carbin, 2019). We experiment withand compare magnitude-based weight pruning andimportance-based pruning of BERT self-attentionheads (Michel et al., 2019), which we extend tomulti-layer perceptrons (MLPs) in BERT.

With both techniques, we find the “good” subnet-works that achieve 90% of full model performance,and perform considerably better than similarly-sized subnetworks sampled from other parts of themodel. However, in many cases even the “bad”subnetworks can be re-initialized to the pre-trainedBERT weights and fine-tuned separately to achievestrong performance. We also find that the “good”networks are unstable across random initializationsat fine-tuning, and their self-attention heads do notnecessarily encode meaningful linguistic patterns.

2 Related Work

Multiple studies of BERT concluded that it is con-siderably overparametrized. In particular, it is pos-sible to ablate elements of its architecture withoutloss in performance or even with slight gains (Ko-valeva et al., 2019; Michel et al., 2019; Voita et al.,2019). This explains the success of multiple BERTcompression studies (Sanh et al., 2019; Jiao et al.,2019; McCarley, 2019; Lan et al., 2020).

While NLP focused on building larger Trans-formers, the computer vision community was ex-ploring the Lottery Ticket Hypothesis (LTH: Fran-kle and Carbin, 2019; Lee et al., 2018; Zhouet al., 2019). It is formulated as follows: “dense,randomly-initialized, feed-forward networks con-tain subnetworks (winning tickets) that – whentrained in isolation – reach test accuracy compa-rable to the original network in a similar numberof iterations” (Frankle and Carbin, 2019). The“winning tickets” generalize across vision datasets(Morcos et al., 2019), and exist both in LSTM andTransformer models for NLP (Yu et al., 2020).

3209

Task Dataset Train Dev Metric

CoLA Corpus of Linguistic Acceptability Judgements (Warstadt et al., 2019) 10K 1K MatthewsSST-2 The Stanford Sentiment Treebank (Socher et al., 2013) 67K 872 accuracyMRPC Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005) 4k n/a accuracySTS-B Semantic Textual Similarity Benchmark (Cer et al., 2017) 7K 1.5K PearsonQQP Quora Question Pairs1 (Wang et al., 2018) 400K n/a accuracyMNLI The Multi-Genre NLI Corpus (matched) (Williams et al., 2017) 393K 20K accuracyQNLI Question NLI (Rajpurkar et al., 2016; Wang et al., 2018) 108K 11K accuracyRTE Recognizing Textual Entailment (Dagan et al., 2005; Haim et al., 2006;

Giampiccolo et al., 2007; Bentivogli et al., 2009)2.7K n/a accuracy

WNLI Winograd NLI (Levesque et al., 2012) 706 n/a accuracy

Table 1: GLUE tasks (Wang et al., 2018), dataset sizes and the metrics reported in this study

However, so far LTH work focused on the “win-ning” random initializations. In case of BERT,there is a large pre-trained language model, usedin conjunction with a randomly initialized task-specific classifier; this paper and concurrent workby Chen et al. (2020) are the first to explore LTHin this context. The two papers provide comple-mentary results for magnitude pruning, but we alsostudy structured pruning, posing the question ofwhether “good” subnetworks can be used as antool to understand how BERT works. Another con-temporaneous study by Gordon et al. (2020) alsoexplores magnitude pruning, showing that BERTpruned before fine-tuning still reaches performancesimilar to the full model.

Ideally, the pre-trained weights would providetransferable linguistic knowledge, fine-tuned onlyto learn a given task. But we do not know whatknowledge actually gets used for inference, ex-cept that BERT is as prone as other models to relyon dataset biases (McCoy et al., 2019b; Rogerset al., 2020a; Jin et al., 2020; Niven and Kao, 2019;Zellers et al., 2019). At the same time, there is vastliterature on probing BERT architecture blocks fordifferent linguistic properties (Rogers et al., 2020b).If there are “good” subnetworks, then studyingtheir properties might explain how BERT works.

3 Methodology

All experiments in this study are done on the“BERT-base lowercase” model from the Transform-ers library (Wolf et al., 2020). It is fine-tuned2 on 9GLUE tasks, and evaluated with the metrics shownin Table 1. All evaluation is done on the dev sets, asthe test sets are not publicly distributed. For eachexperiment we test 5 random seeds.

2All experiments were performed with 8 RTX 2080 TIGPUs, 128 Gb of RAM, 2x CPU Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. Code repository: https://github.com/sai-prasanna/bert-experiments.

3.1 BERT ArchitectureBERT is fundamentally a stack of Transformer en-coder layers (Vaswani et al., 2017). All layershave identical structure: a multi-head self-attention(MHAtt) block followed by an MLP, with residualconnections around each.

MHAtt consists of Nh independentlyparametrized heads. An attention head h inlayer l is parametrized by W (h,l)

k ,W(h,l)q ,W

(h,l)v ∈

Rdh×d, W (h,l)o ∈ Rd×dh . dh is typically set to

d/Nh. Given n d-dimensional input vectorsx = x1, x2, ..xn ∈ Rd, MHAtt is the sum of theoutput of each individual head applied to input x:

MHAtt(l)(x) =Nh∑h=1

Att(l)W

(h,l)k

,W(h,l)q ,W

(h,l)v ,W

(h,l)o

(x)

The MLP in layer l consists of two feed-forwardlayers. It is applied separately to n d-dimensionalvectors z ∈ Rd coming from the attention sub-layer. Dropout (Srivastava et al., 2014) is used forregularization. Then inputs of the MLP are addedto its outputs through a residual connection.

3.2 Magnitude PruningFor magnitude pruning, we fine-tune BERT on eachtask and iteratively prune 10% of the lowest magni-tude weights across the entire model (excluding theembeddings, since this work focuses on BERT’sbody weights). We check the dev set score in eachiteration and keep pruning for as long as the per-formance remains above 90% of the full fine-tunedmodel’s performance. Our methodology and re-sults are complementary to those by Chen et al.(2020), who perform iterative magnitude pruningwhile fine-tuning the model to find the mask.

3.3 Structured PruningWe study structured pruning of BERT architectureblocks, masking them under the constraint that at

https://github.com/sai-prasanna/bert-experiments

https://github.com/sai-prasanna/bert-experiments

3210

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r0.430.02

0.480.02

0.550.01

0.460.02

0.440.02

0.520.01

0.420.02

0.480.02

0.450.02

0.500.02

0.510.01

0.540.01

0.490.02

0.520.01

0.480.02

0.480.02

0.490.01

0.510.02

0.400.02

0.480.02

0.460.02

0.470.02

0.500.02

0.450.02

0.480.01

0.540.01

0.500.02

0.440.02

0.500.02

0.470.02

0.450.02

0.480.02

0.500.02

0.480.01

0.480.02

0.480.02

0.490.01

0.480.02

0.530.01

0.480.02

0.500.02

0.480.01

0.530.02

0.500.02

0.500.02

0.500.01

0.500.01

0.480.02

0.540.01

0.420.02

0.560.01

0.570.01

0.530.01

0.510.02

0.520.02

0.540.01

0.520.02

0.520.01

0.510.02

0.520.01

0.510.02

0.550.01

0.530.01

0.540.01

0.530.02

0.560.01

0.550.01

0.550.01

0.530.02

0.470.01

0.490.02

0.540.01

0.570.01

0.550.01

0.520.01

0.520.01

0.520.01

0.530.01

0.510.02

0.540.01

0.550.01

0.470.02

0.540.01

0.460.01

0.510.02

0.530.01

0.500.02

0.550.01

0.470.01

0.530.01

0.550.01

0.540.01

0.520.02

0.520.02

0.520.02

0.520.01

0.550.01

0.550.01

0.530.01

0.540.01

0.560.01

0.540.01

0.540.01

0.560.01

0.540.01

0.530.01

0.530.01

0.530.01

0.550.01

0.560.01

0.540.01

0.530.01

0.540.01

0.550.01

0.570.01

0.540.01

0.540.01

0.550.01

0.550.01

0.550.01

0.550.01

0.570.01

0.540.01

0.530.02

0.540.01

0.590.01

0.540.01

0.550.01

0.540.01

0.580.01

0.480.01

0.540.01

0.590.01

0.540.01

0.560.01

0.570.01

0.560.01

0.560.01

0.550.01

0.560.01

0.590.01

0.570.01

0.560.01

0.570.01

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.510.02

0.530.01

0.530.01

0.540.01

0.540.01

0.540.01

0.540.01

0.530.01

0.530.01

0.540.01

0.540.01

0.530.01

(a) M-pruning: each cell gives the percentage of survivingweights, and std across 5 random seeds.

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

1.000.00

0.400.55

0.400.55

1.000.00

0.600.55

1.000.00

0.600.55

0.200.45

1.000.00

1.000.00

0.800.45

1.000.00

0.600.55

0.600.55

0.800.45

0.600.55

0.400.55

0.600.55

0.800.45

0.800.45

1.000.00

0.800.45

0.800.45

0.800.45

0.400.55

0.800.45

0.800.45

0.800.45

0.600.55

0.600.55

1.000.00

0.400.55

0.400.55

0.600.55

0.800.45

0.400.55

1.000.00

0.800.45

0.600.55

0.400.55

0.400.55

0.400.55

0.200.45

0.400.55

0.600.55

0.200.45

1.000.00

0.400.55

0.400.55

1.000.00

1.000.00

0.800.45

1.000.00

0.000.00

0.000.00

0.400.55

0.400.55

1.000.00

0.000.00

0.200.45

1.000.00

0.600.55

1.000.00

0.800.45

1.000.00

0.800.45

0.000.00

0.800.45

1.000.00

0.800.45

0.800.45

0.800.45

0.600.55

0.400.55

0.600.55

0.200.45

1.000.00

0.800.45

0.600.55

0.600.55

1.000.00

0.600.55

0.800.45

0.600.55

1.000.00

0.800.45

0.800.45

0.400.55

1.000.00

1.000.00

0.000.00

0.400.55

1.000.00

0.800.45

1.000.00

0.400.55

0.800.45

0.800.45

0.400.55

1.000.00

0.200.45

0.600.55

0.200.45

0.200.45

0.200.45

1.000.00

1.000.00

0.800.45

0.400.55

0.000.00

0.000.00

0.200.45

0.000.00

0.000.00

0.400.55

0.200.45

1.000.00

0.000.00

0.000.00

0.800.45

0.600.55

0.000.00

0.800.45

1.000.00

0.000.00

0.000.00

0.000.00

0.800.45

0.600.55

0.000.00

0.600.55

0.400.55

0.400.55

0.800.45

0.800.45

0.200.45

0.600.55

0.800.45

1.000.00

0.000.00

0.400.55

0.600.55

0.400.55

0.600.55

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.400.55

0.800.45

0.600.55

0.400.55

1.000.00

1.000.00

1.000.00

1.000.00

1.000.00

0.600.55

0.600.55

0.400.55

(b) S-pruning: each cell gives the average number of randomseeds in which a given head/MLP survived and std.

Figure 1: The “good” subnetworks for QNLI: self-attention heads (top, 12 x 12 heatmaps) and MLPs (bottom,1x12 heatmaps), pruned together. Earlier layers start at 0.

least 90% of full model performance is retained.Combinatorial search to find such masks is imprac-tical, and Michel et al. (2019) estimate the impor-tance of attention heads as the expected sensitivityto the mask variable ξ(h,l):

I(h,l)h = Ex∼X

∣∣∣∣ ∂L(x)∂ξ(h,l)

∣∣∣∣where x is a sample from the data distribution Xand L(x) is the loss of the network outputs on thatsample. We extend this approach to MLPs, withthe mask variable ν(l):

I(l)mlp = Ex∼X

∣∣∣∣∂L(x)∂ν(l)

∣∣∣∣If I(h,l)h and I

(l)mlp are high, they have a large

effect on the model output. Absolute values arecalculated to avoid highly positive contributionsnullifying highly negative contributions.

In practice, calculating I(h,l)h and I

(l)mlp would

involve computing backward pass on the lossover samples of the evaluation data3. We followMichel et al. in applying the recommendation ofMolchanov et al. (2017) to normalize the impor-tance scores of the attention heads layer-wise (with`2 norm) before pruning. To mask the heads, we

3The GLUE dev sets are used as oracles to obtain the bestheads and MLPs for the particular model and task.

use a binary mask variable ξ(h,l). If ξ(h,l) = 0, thehead h in layer l is masked:

MHAtt(l)(x) =Nh∑h=1

ξ(h,l)Att(l)W

(h,l)k

,W(h,l)q ,W

(h,l)v ,W

(h,l)o

(x)

Masking MLPs in layer l is performed similarlywith a masking variable ν(l):

MLP(l)out(z) = ν(l)MLP (l)(z) + z

We compute head and MLP importance scoresin a single backward pass, pruning 10% heads andone MLP with the smallest scores until the perfor-mance on the dev set is within 90%. Then we con-tinue pruning heads alone, and then MLPs alone.The process continues iteratively for as long as thepruned model retains over 90% performance of thefull fine-tuned model.

We refer to magnitude and structured pruning asm-pruning and s-pruning, respectively.

4 BERT Plays the Lottery

4.1 The “Good” SubnetworksFigure 1 shows the heatmaps for the “good” sub-networks for QNLI, i.e. the ones that retain 90% offull model performance after pruning.

For s-pruning, we show the number of randominitializations in which a given head/MLP survived

3211

the pruning. For m-pruning, we compute the per-centage of surviving weights in BERT heads andMLPs in all GLUE tasks (excluding embeddings).We run each experiment with 5 random initializa-tions of the task-specific layer (the same ones), andreport averages and standard deviations. See Ap-pendix A for other GLUE tasks.

Figure 1a shows that in m-pruning, all architec-ture blocks lose about half the weights (42-57%weights), but the earlier layers get pruned more.With s-pruning (Figure 1b), the most importantheads tend to be in the earlier and middle layers,while the important MLPs are more in the middle.Note that Liu et al. (2019) also find that the middleTransformer layers are the most transferable.

In Figure 1b, the heads and MLPs were prunedtogether. The overall pattern is similar when theyare pruned separately. While fewer heads (orMLPs) remain when they are pruned separately(49% vs 22% for heads, 75% vs 50% for MLPs),pruning them together is more efficient overall(i.e., produces smaller subnetworks). Full data isavailable in Appendix B. This experiment hintsat considerable interaction between BERT’s self-attention heads and MLPs: with fewer MLPs avail-able, the model is forced to rely more on the heads,raising their importance. This interaction was notexplored in the previous studies (Michel et al.,2019; Voita et al., 2019; Kovaleva et al., 2019),and deserves more attention in future work.

4.2 Testing LTH for BERT Fine-tuning:The Good, the Bad and the Random

LTH predicts that the “good” subnetworks trainedfrom scratch should match the full network perfor-mance. We experiment with the following settings:• “good” subnetworks: the elements selected

from the full model by either technique;• random subnetworks: the same size as “good”

subnetworks, but with elements randomlysampled from the full model;• “bad” subnetworks: the elements sampled

from those that did not survive the pruning,plus a random sample of the remaining ele-ments so as to match the size of the “good”subnetworks.

For both pruning methods, we evaluate the sub-networks (a) after pruning, (b) after retraining thesame subnetwork. The model is re-initialized topre-trained weights (except embeddings), and thetask-specific layer is initialized with the same ran-

dom seeds that were used to find the given mask.As mentioned earlier, the evaluation is per-

formed on the GLUE4 dev sets, which have alsobeen used to identify the the “good” subnetworksoriginally. These subnetworks were chosen to workwell on this specific data, and the corresponding“bad” subnetworks were defined only in relationto the “good” ones. We therefore do not expectthese subnetworks to generalize to other data, andbelieve that they would best illustrate what exactlyBERT “learns” in fine-tuning.

Performance of each subnetwork type is shownin Figure 2.The main LTH prediction is validated:the “good” subnetworks can be successfully re-trained alone. Our m-pruning results are consis-tent with contemporaneous work by Gordon et al.(2020) and Chen et al. (2020).

We observe the following differences betweenthe two pruning techniques:• For 7 out of 9 tasks m-pruning yields con-

siderably higher compression (10-15% moreweights pruned) than s-pruning.• Although m-pruned subnetworks are smaller,

they mostly reach5 the full network perfor-mance. For s-pruning, the “good” subnet-works are mostly slightly behind the full net-work performance.• Randomly sampled subnetworks could be ex-

pected to perform better than the “bad”, butworse than the “good” ones. That is the casefor m-pruning, but for s-pruning they mostlyperform on par with the “good” subnetworks,suggesting the subset of “good” heads/MLPsin the random sample suffices to reach the full“good” subnetwork performance.

Note that our pruned subnetworks are relativelylarge with both pruning methods (mostly over 50%of the full model). For s-pruning, we also lookat “super-survivors”: much smaller subnetworksconsisting only of the heads and MLPs that con-sistently survived across all seeds for a given task.For most tasks, these subnetworks contained onlyabout 10-26% of the full model weights, but lostonly about 10 performance points on average. SeeAppendix E for the details for this experiment.

4The results for WNLI are unreliable: this dataset hassimilar sentences with opposite labels in train and dev data,and in s-pruning the whole model gets pruned away. SeeAppendix A for discussion of that.

5For convenience, Figure 2 shows the performance of thefull model minus one standard deviation – the success criterionfor the subnetwork also used by Chen et al. (2020).

3212

0.0

0.2

0.4

0.6

0.8Ac

cura

cyMNLI (55%, std 5%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

QNLI (55%, std 2%)

0.0

0.2

0.4

0.6

Accu

racy

RTE (54%, std 9%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

MRPC (57%, std 5%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

QQP (53%, std 5%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

SST-2 (47%, std 5%)

0.0

0.1

0.2

0.3

0.4

0.5

Mat

thew

s cor

rela

tion

CoLA (65%, std 5%)

0.0

0.2

0.4

0.6

0.8

1.0

Pear

son

corre

latio

nSTS-B (47%, std 5%)

0.0

0.2

0.4

0.6

Accu

racy

WNLI (7%, std 12%)

full model performance - 1 std

full model

random subnetwork (pruned)

'bad' subnetwork (retrained)

BiLSTM + GloVe baseline

'good' subnetwork (pruned)

random subnetwork (retrained)

majority baseline

'good' subnetwork (retrained)

'bad' subnetwork (pruned)

(a) S-pruning

0.0

0.2

0.4

0.6

0.8

Accu

racy

MNLI (41%, std 0%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

QNLI (42%, std 1%)

0.0

0.2

0.4

0.6

Accu

racy

RTE (43%, std 2%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

MRPC (45%, std 4%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

QQP (38%, std 1%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

SST-2 (35%, std 1%)

0.0

0.2

0.4

Mat

thew

s cor

rela

tion

CoLA (51%, std 7%)

0.0

0.2

0.4

0.6

0.8

Pear

son

corre

latio

n

STS-B (40%, std 2%)

0.0

0.2

0.4

0.6

Accu

racy

WNLI (38%, std 23%)


full model






majority baseline



(b) M-pruning

Figure 2: The “good” and “bad” subnetworks in BERT fine-tuning: performance on GLUE tasks. ‘Pruned’ subnet-works are only pruned, and ‘retrained’ subnetworks are restored to pretrained weights and fine-tuned. Subfiguretitles indicate the task and percentage of surviving weights. STD values and error bars indicate standard devia-tion of surviving weights and performance respectively, across 5 fine-tuning runs. See Appendix C for numericalresults, and subsection 4.3 for GLUE baseline discussion.

3213

Model CoLA SST-2 MRPC QQP STS-B MNLI QNLI RTE WNLI Average6

Majority class baseline 0.00 0.51 0.68 0.63 0.02 0.35 0.51 0.53 0.56 0.42CBOW 0.46 0.79 0.75 0.75 0.70 0.57 0.62 0.71 0.56 0.61BILSTM + GloVe 0.17 0.87 0.77 0.85 0.71 0.66 0.77 0.58 0.56 0.66BILSTM + ELMO 0.44 0.91 0.70 0.88 0.70 0.68 0.71 0.53 0.56 0.68‘Bad’ subnetwork (s-pruning) 0.40 0.85 0.67 0.81 0.60 0.80 0.76 0.58 0.53 0.67‘Bad’ subnetwork (m-pruning) 0.24 0.81 0.67 0.77 0.08 0.61 0.6 0.49 0.49 0.51Random init + random s-pruning 0.00 0.78 0.67 0.78 0.14 0.63 0.59 0.53 0.50 0.52

Table 2: “Bad” BERT subnetworks (the best one is underlined) vs basic baselines (the best one is bolded). The ran-domly initialized BERT is randomly pruned by importance scores to match the size of s-pruned ‘bad’ subnetwork.

4.3 How Bad are the “Bad” Subnetworks?

Our study – as well as work by Chen et al. (2020)and Gordon et al. (2020) – provides conclusiveevidence for the existence of “winning tickets”, butit is intriguing that for most GLUE tasks randommasks in s-pruning perform nearly as well as the“good” masks - i.e. they could also be said to be“winning”. In this section we look specifically atthe “bad” subnetworks: since in our setup, we usethe dev set both to find the masks and to test themodel, these parts of the model are the least usefulfor that specific data sample, and their trainabilitycould yield important insights for model analysis.

Table 2 shows the results for the “bad” subnet-works pruned with both methods and re-fine-tuned,together with dev set results of three GLUE base-lines by Wang et al. (2018). The m-pruned ‘bad’subnetwork is at least 5 points behind the s-prunedone on 6/9 tasks, and is particularly bad on the cor-relation tasks (CoLA and STS-B). With respectto GLUE baselines, the s-pruned “bad” subnet-work is comparable to BiLSTM+ELMO and BiL-STM+GloVe. Note that there is a lot of variationbetween tasks: the ‘bad’ s-pruned subnetwork iscompetitive with BiLSTM+GloVe in 5/9 tasks, butit loses by a large margin in 2 more tasks, and winsin 2 more (see also Figure 2).

The last line of Table 2 presents a variation ofexperiment with fine-tuning randomly initializedBERT by Kovaleva et al. (2019): we randomly ini-tialize BERT and also apply a randomly s-prunedmask so as to keep it the same size as the s-pruned“bad” subnetwork. Clearly, even this model is inprinciple trainable (and still beats the majority classbaseline), but on average6 it is over 15 points be-hind the “bad” mask over the pre-trained weights.This shows that even the worst possible selection of

6GLUE leaderboard uses macro average of metrics to rankthe participating systems. We only consider the metrics inTable 1 to obtain this average.

pre-trained BERT components for a given task stillcontains a lot of useful information. In other words,some lottery tickets are “winning” and yield thebiggest gain, but all subnetworks have a non-trivialamount of useful information.

Note that even the random s-pruning of a ran-domly initialized BERT is slightly better than them-pruned “bad” subnetwork. It is not clear whatplays a bigger role: the initialization or the archi-tecture. Chen et al. (2020) report that pre-trainedweights do not perform as well if shuffled, but theydo perform better than randomly initialized weights.To test whether the “bad” s-pruned subnetworksmight match the “good” ones with more training,we trained them for 6 epochs, but on most tasks theperformance went down (see Appendix D).

Finally, BERT is known to sometimes have de-generate runs (i.e. with final performance muchlower than expected) on smaller datasets (Devlinet al., 2019). Given the masks found with 5 ran-dom initializations, we find that standard deviationof GLUE metrics for both “bad” and “random” s-pruned subnetworks is over 10 points not only forthe smaller datasets (MRPC, CoLA, STS-B), butalso for MNLI and SST-2 (although on the largerdatasets the standard deviation goes down afterre-fine-tuning). This illustrates the fundamentalcause of degenerate runs: the poor match betweenthe model and final layer initialization. Since our“good” subnetworks are specifically selected to bethe best possible match to the specific random seed,the performance is the most reliable. As for m-pruning, standard deviation remains low even forthe “bad” and “random” subnetworks in most tasksexcept MRPC. See Appendix C for full results.

5 Interpreting BERT’s Subnetworks

In subsection 4.2 we showed that the subnetworksfound by m- and s-pruning behave similarly in fine-tuning. However, s-pruning has an advantage in

3214

that the functions of BERT architecture blocks havebeen extensively studied (see detailed overview byRogers et al. (2020b)). If the better performanceof the “good” subnetworks comes from linguisticknowledge, they could tell a lot about the reasoningBERT actually performs at inference time.

5.1 Stability of the “Good” SubnetworksRandom initializations in the task-specific classi-fier interact with the pre-trained weights, affectingthe performance of fine-tuned BERT (McCoy et al.,2019a; Dodge et al., 2020). However, if better per-formance comes from linguistic knowledge, wewould expect the “good” subnetworks to better en-code this knowledge, and to be relatively stableacross fine-tuning runs for the same task.

We found the opposite. For all tasks, Fleiss’kappa on head survival across 5 random seeds wasin the range of 0.15-0.32, and Cochran Q test didnot show that the binary mask of head survivalobtained with five random seeds for each tasks weresignificantly similar at α = 0.05 (although masksobtained with some pairs of seeds were). Thismeans that the “good” subnetworks are unstable,and depend on the random initialization more thanutility of a certain portion of pre-trained weightsfor a particular task.

The distribution of importance scores, shownin Figure 3, explains why that is the case. At anygiven pruning iteration, most heads and MLPs havea low importance score, and could all be prunedwith about equal success.

0.0 0.2 0.4 0.6 0.8 1.0Importance score

0

5

10

15

20

25

Freq

uenc

y

Figure 3: Head importance scores distribution (this ex-ample shows CoLA, pruning iteration 1)

5.2 How Linguistic are the “Good”Subnetworks?

A popular method of studying functions of BERTarchitecture blocks is to use probing classifiers forspecific linguistic functions. However, “the factthat a linguistic pattern is not observed by our prob-ing classifier does not guarantee that it is not there,

and the observation of a pattern does not tell ushow it is used” (Tenney et al., 2019).

In this study we use a cruder, but more reliable al-ternative: the types of self-attention patterns, whichKovaleva et al. (2019) classified as diagonal (atten-tion to previous/next word), block (uniform atten-tion over a sentence), vertical (attention to punctua-tion and special tokens), vertical+diagonal, and het-erogeneous (everything else) (see Figure 4a). Thefraction of heterogeneous attention can be used asan upper bound estimate on non-trivial linguisticinformation. In other words, these patterns do notguarantee that a given head has some interpretablefunction – only that it could have it.

This analysis is performed by image classifica-tion on generated attention maps from individualheads (100 for each GLUE task), for which we usea small CNN classifier with six layers. The clas-sifier was trained on the dataset of 400 annotatedattention maps by Kovaleva et al. (2019).

Note that attention heads can be seen as aweighted sum of linearly transformed input vectors.Kobayashi et al. (2020) recently showed that theinput vector norms vary considerably, and the in-puts to the self-attention mechanism can have a dis-proportionate impact relative to their self-attentionweight. So we consider both the raw attentionmaps, and, to assess the true impact of the input inthe weighted sum, the L2-norm of the transformedinput multiplied by the attention weight (for whichwe annotated 600 more attention maps with thesame pattern types as Kovaleva et al. (2019)). Theweighted average of F1 scores of the classifier onannotated data was 0.81 for the raw attention maps,and 0.74 for the normed attention.

Our results suggest that the super-survivor headsdo not preferentially encode non-trivial linguisticrelations (heterogeneous pattern), in either raw ornormed self-attention (Figure 4b). As comparedto all 144 heads (Figure 4c) the “raw” attentionpatterns of super-survivors encode considerablymore block and vertical attention types. Sincenorming reduces attention to special tokens, theproportion of diagonal patterns (i.e. attention toprevious/next tokens) is increased at the cost ofvertical+diagonal pattern. Interestingly, for 3 tasks,the super-survivor subnetworks still heavily rely onthe vertical pattern even after norming. The verticalpattern indicates a crucial role of the special tokens,and it is unclear why it seems to be less importantfor MNLI rather than QNLI, MRPC or QQP.

3215

Diagonal HeterogeneousVertical Vertical + diagonal Block

[CLS][CLS] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP][CLS] [CLS] [SEP] [SEP] [SEP] [SEP][CLS]

(a) Reference: typical BERT self-attention patterns by Kovaleva et al. (2019).

MNLI QNLI RTE MRPC QQP SST-2 CoLA STS-B

Attention Patterns


Weight Normed Attention Patterns

Vertical + DiagonalBlock

VerticalHeterogeneous

Diagonal

(b) Super-survivor heads, fine-tuned


Attention Patterns





Diagonal

(c) All heads, fine-tuned

Figure 4: Attention pattern type distribution

The number of block pattern decreased, and wehypothesize that they are now classified as heteroge-neous (as they would be unlikely to look diagonal).But even with the normed attention, the utility ofsuper-survivor heads cannot be attributed only totheir linguistic functions (especially given that thefraction of heterogeneous patterns is only a roughupper bound). The Pearson’s correlation betweenheads being super-survivors and their having het-erogeneous attention patterns is 0.015 for the raw,and 0.025 for the normed attention. Many “impor-tant” heads have diagonal attention patterns, whichseems redundant.

We conducted the same analysis for the attentionpatterns in pre-trained vs. fine-tuned BERT forboth super-survivors and all heads, and found themto not change considerably after fine-tuning, whichis consistent with findings by Kovaleva et al. (2019).Full data is available in Appendix F.

Note that this result does not exclude the pos-sibility that linguistic information is encoded incertain combinations of BERT elements. How-ever, to date most BERT analysis studies focusedon the functions of individual components (Voitaet al., 2019; Htut et al., 2019; Clark et al., 2019;

Lin et al., 2019; Vig and Belinkov, 2019; Hewittand Manning, 2019; Tenney et al., 2019, see alsothe overview by Rogers et al. (2020b)), and thisevidence points to the necessity of looking at theirinteractions. It also adds to the ongoing discussionof interpretability of self-attention (Jain and Wal-lace, 2019; Serrano and Smith, 2019; Wiegreffeand Pinter, 2019; Brunner et al., 2020).

Once again, heterogenerous pattern counts areonly a crude upper bound estimate on potentiallyinterpretable patterns. More sophisticated alter-natives should be explored in future work. Forinstance, the recent information-theoretic probingby minimum description length (Voita and Titov,2020) avoids the problem of false positives withtraditional probing classifiers.

5.3 Information Shared Between Tasks

While the “good” subnetworks are not stable, theoverlaps between the “good” subnetworks may stillbe used to characterize the tasks themselves. Weleave detailed exploration to future work, but as abrief illustration, Figure 5 shows pairwise overlapsin the “good” subnetworks for the GLUE tasks.

The overlaps are not particularly large, but still

3216

CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B WNLI

Task

CoLA

MNLI

MRPC

QNLI

QQP

RTE

SST-2

STS-B

WNLI

Task

92.2018.91

62.4010.24

95.004.95

58.4011.33

64.407.80

89.409.13

57.6010.43

62.407.44

57.406.27

85.207.98

55.4018.85

54.608.99

52.207.40

52.206.98

78.2016.12

48.8022.03

55.0027.06

51.6023.00

52.0025.12

43.2020.10

74.0035.69

49.6013.05

50.4011.46

43.408.99

40.208.58

38.2010.23

35.0018.77

67.0015.65

53.4014.94

55.603.91

54.206.18

51.005.74

47.4014.40

46.0020.95

36.209.73

76.807.98

23.2051.88

19.0042.49

19.0042.49

17.6039.35

17.6039.35

14.8033.09

13.4029.96

16.2036.22

28.8064.40

Figure 5: Overlaps in BERT’s “good” subnetworks be-tween GLUE tasks: self-attention heads.

more than what we would expect if the heads werecompletely independent (e.g. MRPC and QNLIshare over a half of their “good” subnetworks).Both heads and MLPs show a similar pattern. Fulldata for full and super-survivor “good” subnet-works is available in Appendix G.

Given our results in subsection 5.2, the overlapsin the “good” subnetworks are not explainable bytwo tasks’ relying on the same linguistic patternsin individual self-attention heads. They also donot seem to depend on the type of the task. Forinstance, consider the fact that two tasks target-ing paraphrases (MRPC and QQP) have less incommon than MRPC and MNLI. Alternatively, theoverlaps may indicate shared heuristics, or patternssomehow encoded in combinations of BERT ele-ments. This remains to be explored in future work.

6 Discussion

This study confirms the main prediction of LTHfor pre-trained BERT weights for both m- ands-pruning. An unexpected finding is that with s-pruning, the “random” subnetworks are still almostas good as the “good” ones, and even the “worst”ones perform on par with a strong baseline. Thissuggests that the weights that do not survive prun-ing are not just “inactive” (Zhang et al., 2019).

An obvious, but very difficult question that arisesfrom this finding is whether the “bad” subnetworks

do well because even they contain some linguisticknowledge, or just because GLUE tasks are overalleasy and could be learned even by random BERT(Kovaleva et al., 2019), or even any sufficientlylarge model. Given that we did not find even the“good” subnetworks to be stable, or preferentiallycontaining the heads that could have interpretablelinguistic functions, the latter seems more likely.

Furthermore, should we perhaps be asking thesame question with respect to not only subnetworks,but also full models, such as BERT itself and allthe follow-up Transformers? There is a trend toautomatically credit any new state-of-the-art modelwith with better knowledge of language. However,what if that is not the case, and the success of pre-training is rather due to the flatter and wider op-tima in the optimization surface (Hao et al., 2019)?Can similar loss landscapes be obtained from other,non-linguistic pre-training tasks? There are initialresults pointing in that direction: Papadimitriouand Jurafsky (2020) report that even training onMIDI music is helpful for transfer learning for LMtask with LSTMs.

7 Conclusion

This study systematically tested the lottery tickethypothesis in BERT fine-tuning with two prun-ing methods: magnitude-based weight pruning andimportance-based pruning of BERT self-attentionheads and MLPs. For both methods, we find thatthe pruned “good” subnetworks alone reach the per-formance comparable with the full model, whilethe “bad” ones do not. However, for structuredpruning, even the “bad” subnetworks can be fine-tuned separately to reach fairly strong performance.The “good” subnetworks are not stable across fine-tuning runs, and their success is not attributableexclusively to non-trivial linguistic patterns in in-dividual self-attention heads. This suggests thatmost of pre-trained BERT is potentially useful infine-tuning, and its success could have more to dowith optimization surfaces rather than specific bitsof linguistic knowledge.

Carbon Impact Statement. This work contributed115.644 kg of CO2eq to the atmosphere and used 249.068 kWhof electricity, having a NLD-specific social cost of carbon of$-0.14 ($-0.24, $-0.04). The social cost of carbon uses modelsfrom (Ricke et al., 2018) and this statement and emissionsinformation was generated with experiment-impact-tracker(Henderson et al., 2020).

3217

Acknowledgments

The authors would like to thank Michael Carbin,Aleksandr Drozd, Jonathan Frankle, Naomi Saphraand Sri Ananda Seelan for their helpful comments,the Zoho corporation for providing access to theirclusters for running our experiments, and the anony-mous reviewers for their insightful reviews and sug-gestions. This work is funded in part by the NSFaward number IIS-1844740 to Anna Rumshisky.

ReferencesLuisa Bentivogli, Peter Clark, Ido Dagan, and Danilo

Giampiccolo. 2009. The Fifth PASCAL Recogniz-ing Textual Entailment Challenge. In TAC.

Gino Brunner, Yang Liu, Damian Pascual, OliverRichter, Massimiliano Ciaramita, and Roger Watten-hofer. 2020. On Identifiability in Transformers. InInternational Conference on Learning Representa-tions.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017Task 1: Semantic Textual Similarity Multilingualand Crosslingual Focused Evaluation. In Proceed-ings of the 11th International Workshop on SemanticEvaluation (SemEval-2017), pages 1–14, Vancouver,Canada. Association for Computational Linguistics.

Tianlong Chen, Jonathan Frankle, Shiyu Chang, SijiaLiu, Yang Zhang, Zhangyang Wang, and MichaelCarbin. 2020. The Lottery Ticket Hypothesis forPre-trained BERT Networks. arXiv:2007.12223 [cs,stat].

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What Does BERTLook at? An Analysis of BERT’s Attention. In Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP, pages 276–286, Florence, Italy. Associationfor Computational Linguistics.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2005. The PASCAL Recognising Textual Entail-ment Challenge. In Machine Learning ChallengesWorkshop, pages 177–190. Springer.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186.

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, AliFarhadi, Hannaneh Hajishirzi, and Noah Smith.2020. Fine-Tuning Pretrained Language Models:

Weight Initializations, Data Orders, and Early Stop-ping. arXiv:2002.06305 [cs].

W.B. Dolan and C. Brockett. 2005. Automaticallyconstructing a corpus of sentential paraphrases.In Third International Workshop on Paraphrasing(IWP2005). Asia Federation of Natural LanguageProcessing.

Jonathan Frankle and Michael Carbin. 2019. The Lot-tery Ticket Hypothesis: Finding Sparse, TrainableNeural Networks. In International Conference onLearning Representations.

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,and Bill Dolan. 2007. The Third PASCAL Recog-nizing Textual Entailment Challenge. In Proceed-ings of the ACL-PASCAL Workshop on Textual En-tailment and Paraphrasing, RTE ’07, pages 1–9,Prague, Czech Republic. Association for Computa-tional Linguistics.

Mitchell A. Gordon, Kevin Duh, and Nicholas An-drews. 2020. Compressing BERT: Studying the ef-fects of weight pruning on transfer learning. ArXiv,abs/2002.08307.

R. Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, DaniloGiampiccolo, Bernardo Magnini, and Idan Szpektor.2006. The Second Pascal Recognising Textual En-tailment Challenge. In Proceedings of the SecondPASCAL Challenges Workshop on Recognising Tex-tual Entailment.

Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Vi-sualizing and Understanding the Effectiveness ofBERT. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages4143–4152, Hong Kong, China. Association forComputational Linguistics.

Peter Henderson, Jieru Hu, Joshua Romoff, EmmaBrunskill, Dan Jurafsky, and Joelle Pineau. 2020.Towards the Systematic Reporting of the En-ergy and Carbon Footprints of Machine Learning.arXiv:2002.05651 [cs].

John Hewitt and Christopher D. Manning. 2019. AStructural Probe for Finding Syntax in Word Rep-resentations. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 4129–4138.

Phu Mon Htut, Jason Phang, Shikha Bordia, andSamuel R Bowman. 2019. Do attention heads inBERT track syntactic dependencies? arXiv preprintarXiv:1911.12246.

Sarthak Jain and Byron C. Wallace. 2019. Attention isnot Explanation. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-

https://openreview.net/forum?id=BJg1f6EFDB

https://doi.org/10.18653/v1/S17-2001

https://doi.org/10.18653/v1/S17-2001

https://doi.org/10.18653/v1/S17-2001

http://arxiv.org/abs/2007.12223


https://doi.org/10.18653/v1/W19-4828

https://doi.org/10.18653/v1/W19-4828

https://aclweb.org/anthology/papers/N/N19/N19-1423/






https://www.aclweb.org/anthology/I/I05/I05-5002.pdf

https://www.aclweb.org/anthology/I/I05/I05-5002.pdf

https://openreview.net/forum?id=rJl-b3RcF7



https://doi.org/10.18653/v1/D19-1424

https://doi.org/10.18653/v1/D19-1424

https://doi.org/10.18653/v1/D19-1424










3218

guage Technologies, Volume 1 (Long and Short Pa-pers), pages 3543–3556.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and QunLiu. 2019. TinyBERT: Distilling BERT fornatural language understanding. arXiv preprintarXiv:1909.10351.

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2020. Is BERT Really Robust? A StrongBaseline for Natural Language Attack on Text Clas-sification and Entailment. In AAAI 2020.

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, andKentaro Inui. 2020. Attention Module is Not Onlya Weight: Analyzing Transformers with VectorNorms. arXiv:2004.10102 [cs].

Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the Dark Secretsof BERT. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages4356–4365, Hong Kong, China. Association forComputational Linguistics.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. ALBERT: A Lite BERT for Self-SupervisedLearning of Language Representations. In ICLR.

Namhoon Lee, Thalaiyasingam Ajanthan, and PhilipTorr. 2018. SNIP: Single-shot network pruningbased on connection sensitivity. In InternationalConference on Learning Representations.

Hector J Levesque, Ernest Davis, and Leora Morgen-stern. 2012. The Winograd Schema Challenge. InProceedings of the Thirteenth International Confer-ence on Principles of Knowledge Representationand Reasoning, pages 552–561.

Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019.Open Sesame: Getting inside BERT’s LinguisticKnowledge. In Proceedings of the 2019 ACL Work-shop BlackboxNLP: Analyzing and Interpreting Neu-ral Networks for NLP, pages 241–253.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019. Lin-guistic Knowledge and Transferability of ContextualRepresentations. In Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 1073–1094, Minneapolis, Minnesota.Association for Computational Linguistics.

JS McCarley. 2019. Pruning a BERT-basedQuestion Answering Model. arXiv preprintarXiv:1910.06360.

R. Thomas McCoy, Junghyun Min, and Tal Linzen.2019a. BERTs of a feather do not generalize to-gether: Large variability in generalization acrossmodels with similar test set performance. ArXiv,abs/1911.02969.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019b.Right for the Wrong Reasons: Diagnosing SyntacticHeuristics in Natural Language Inference. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computa-tional Linguistics.

Paul Michel, Omer Levy, and Graham Neubig. 2019.Are Sixteen Heads Really Better than One? Ad-vances in Neural Information Processing Systems 32(NIPS 2019).

Pavlo Molchanov, Stephen Tyree, Tero Karras, TimoAila, and Jan Kautz. 2017. Pruning convolutionalneural networks for resource efficient inference. In5th International Conference on Learning Repre-sentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenRe-view.net.

Ari Morcos, Haonan Yu, Michela Paganini, and Yuan-dong Tian. 2019. One ticket to win them all: Gen-eralizing lottery ticket initializations across datasetsand optimizers. In Advances in Neural InformationProcessing Systems 32, pages 4932–4942. CurranAssociates, Inc.

Timothy Niven and Hung-Yu Kao. 2019. Probing Neu-ral Network Comprehension of Natural LanguageArguments. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 4658–4664, Florence, Italy. Association forComputational Linguistics.

Isabel Papadimitriou and Dan Jurafsky. 2020. Learn-ing Music Helps You Read: Using Transfer toStudy Linguistic Structure in Language Models.arXiv:2004.14601 [cs].

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ Questionsfor Machine Comprehension of Text. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2383–2392.Association for Computational Linguistics.

Katharine Ricke, Laurent Drouet, Ken Caldeira, andMassimo Tavoni. 2018. Country-level social cost ofcarbon. Nature Climate Change, 8(10):895.

Anna Rogers, Olga Kovaleva, Matthew Downey, andAnna Rumshisky. 2020a. Getting Closer to AI Com-plete Question Answering: A Set of PrerequisiteReal Tasks. In AAAI, page 11.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2020b. A Primer in BERTology: What we knowabout how BERT works. arXiv:2002.12327 [cs].

https://arxiv.org/abs/1909.10351








https://doi.org/10.18653/v1/D19-1445

https://doi.org/10.18653/v1/D19-1445

https://openreview.net/forum?id=H1eA7AEtvS

https://openreview.net/forum?id=H1eA7AEtvS

https://openreview.net/forum?id=B1VZqjAcYX

https://openreview.net/forum?id=B1VZqjAcYX

https://www.aclweb.org/anthology/W19-4825/

https://www.aclweb.org/anthology/W19-4825/

https://www.aclweb.org/anthology/N19-1112/





https://doi.org/10.18653/v1/P19-1334

https://doi.org/10.18653/v1/P19-1334

http://papers.nips.cc/paper/9551-are-sixteen-heads-really-better-than-one

https://openreview.net/forum?id=SJGCiw5gl

https://openreview.net/forum?id=SJGCiw5gl

http://papers.nips.cc/paper/8739-one-ticket-to-win-them-all-generalizing-lottery-ticket-initializations-across-datasets-and-optimizers.pdf



https://www.aclweb.org/anthology/P19-1459






https://doi.org/10.18653/v1/D16-1264

https://doi.org/10.18653/v1/D16-1264

https://aaai.org/Papers/AAAI/2020GB/AAAI-RogersA.7778.pdf





3219

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. DistilBERT, a distilled versionof BERT: Smaller, faster, cheaper and lighter. In5th Workshop on Energy Efficient Machine Learningand Cognitive Computing - NeurIPS 2019.

Sofia Serrano and Noah A. Smith. 2019. Is AttentionInterpretable? arXiv:1906.03731 [cs].

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,pages 1631–1642.

Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overfitting. J. Mach. Learn. Res.,15:1929–1958.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT Rediscovers the Classical NLP Pipeline. InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4593–4601.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, \LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, pages 5998–6008, Long Beach,CA, USA.

Jesse Vig and Yonatan Belinkov. 2019. Analyzing theStructure of Attention in a Transformer LanguageModel. In Proceedings of the 2019 ACL WorkshopBlackboxNLP: Analyzing and Interpreting NeuralNetworks for NLP, pages 63–76, Florence, Italy. As-sociation for Computational Linguistics.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing Multi-HeadSelf-Attention: Specialized Heads Do the HeavyLifting, the Rest Can Be Pruned. In Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics, pages 5797–5808, Florence,Italy. Association for Computational Linguistics.

Elena Voita and Ivan Titov. 2020. Information-Theoretic Probing with Minimum DescriptionLength. arXiv:2003.12298 [cs].

Alex Wang, Amapreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2018.GLUE: A Multi-Task Benchmark and Analysis Plat-form for Natural Language Understanding. InProceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 353–355, Brussels, Belgium.Association for Computational Linguistics.

Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-man. 2019. Neural Network Acceptability Judg-ments. Transactions of the Association for Compu-tational Linguistics, 7:625–641.

Sarah Wiegreffe and Yuval Pinter. 2019. Attentionis not not Explanation. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Process-ing (EMNLP-IJCNLP), pages 11–20, Hong Kong,China. Association for Computational Linguistics.

Adina Williams, Nikita Nangia, and Samuel R. Bow-man. 2017. A Broad-Coverage Challenge Corpusfor Sentence Understanding through Inference. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 1112–1122, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, and Jamie Brew. 2020. HuggingFace’s Trans-formers: State-of-the-art Natural Language Process-ing. arXiv:1910.03771 [cs].

Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S.Morcos. 2020. Playing the lottery with rewards andmultiple languages: Lottery tickets in RL and NLP.In ICLR.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag: Cana Machine Really Finish Your Sentence? In ACL2019.

Chiyuan Zhang, Samy Bengio, and Yoram Singer.2019. Are All Layers Created Equal? In ICML2019 Workshop Deep Phenomena.

Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosin-ski. 2019. Deconstructing Lottery Tickets: Zeros,Signs, and the Supermask. In Advances in Neural In-formation Processing Systems 32, pages 3597–3607.Curran Associates, Inc.





http://www.aclweb.org/anthology/D13-1170



https://www.aclweb.org/anthology/P19-1452.pdf

https://doi.org/10.18653/v1/W19-4808

https://doi.org/10.18653/v1/W19-4808

https://doi.org/10.18653/v1/W19-4808

https://doi.org/10.18653/v1/P19-1580

https://doi.org/10.18653/v1/P19-1580

https://doi.org/10.18653/v1/P19-1580




http://aclweb.org/anthology/W18-5446

http://aclweb.org/anthology/W18-5446

https://doi.org/10.1162/tacl_a_00290

https://doi.org/10.1162/tacl_a_00290

https://doi.org/10.18653/v1/D19-1002

https://doi.org/10.18653/v1/D19-1002

https://doi.org/10.18653/v1/N18-1101

https://doi.org/10.18653/v1/N18-1101








https://openreview.net/forum?id=ryg1P4Sh2E

http://papers.nips.cc/paper/8618-deconstructing-lottery-tickets-zeros-signs-and-the-supermask.pdf

http://papers.nips.cc/paper/8618-deconstructing-lottery-tickets-zeros-signs-and-the-supermask.pdf

3220

A “Good” Subnetworks in BERT Fine-tuned on GLUE Tasks

Each figure in this section shows the “good” subnet-work of heads and layers that survived the pruningprocess described in section 3. Each task was runwith 5 different random seeds. The top numberin each cell indicates how likely a given head orMLP was to survive pruning, with 1.0 indicatingthat it survived on every run. The bottom numberindicates the standard deviation across runs.The figures in this appendix show that each task hasa varying number of heads and layers that survivepruning on all fine-tuning runs, while some headsand layers were only “picked up” by some randomseeds. Note also that in addition to the architecture

elements that survive across many runs, there arealso some that are useful for over half of the tasks,as shown in Figure 15, and some always survivethe pruning.Visualizing the “good” subnetwork illustrates thecore problem with WNLI, the most difficult taskof GLUE. Figure 14 shows that each run is com-pletely different, indicating that BERT fails to findany consistent pattern between the task and theinformation in the available pre-trained weights.WNLI is described as “somewhat adversarial” byWang et al. (2018) because it has similar sentencesin train and dev sets with opposite labels.

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.420.00

0.470.00

0.530.00

0.440.00

0.420.00

0.510.00

0.400.00

0.460.00

0.430.00

0.480.00

0.490.00

0.530.00

0.480.00

0.510.00

0.470.00

0.460.00

0.470.00

0.490.00

0.380.00

0.470.00

0.440.00

0.450.00

0.480.00

0.430.00

0.470.00

0.520.00

0.480.00

0.430.00

0.480.00

0.460.00

0.430.00

0.460.00

0.490.00

0.470.00

0.470.00

0.470.00

0.480.00

0.460.00

0.510.00

0.460.00

0.480.00

0.470.00

0.510.00

0.480.00

0.480.00

0.490.00

0.490.00

0.460.00

0.520.00

0.400.00

0.550.00

0.550.00

0.520.00

0.500.00

0.500.00

0.520.00

0.500.00

0.510.00

0.490.00

0.500.00

0.490.00

0.540.00

0.510.00

0.530.00

0.510.00

0.550.00

0.530.00

0.540.00

0.510.00

0.450.00

0.470.00

0.520.00

0.550.00

0.530.00

0.500.00

0.510.00

0.510.00

0.510.00

0.490.00

0.520.00

0.530.00

0.460.00

0.520.00

0.450.00

0.500.00

0.520.00

0.490.00

0.530.00

0.450.00

0.510.00

0.540.00

0.520.00

0.500.00

0.500.00

0.500.00

0.510.00

0.530.00

0.530.00

0.520.00

0.520.00

0.540.00

0.520.00

0.520.00

0.540.00

0.520.00

0.520.00

0.510.00

0.520.00

0.530.00

0.540.00

0.520.00

0.520.00

0.530.00

0.530.00

0.550.00

0.520.00

0.520.00

0.540.00

0.530.00

0.530.00

0.530.00

0.560.00

0.530.00

0.510.00

0.520.00

0.570.00

0.530.00

0.530.00

0.530.00

0.570.00

0.470.00

0.530.00

0.570.00

0.530.00

0.550.00

0.560.00

0.550.00

0.540.00

0.540.00

0.550.00

0.580.00

0.560.00

0.550.00

0.560.00

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.500.00

0.510.00

0.520.00

0.520.00

0.530.00

0.520.00

0.520.00

0.510.00

0.520.00

0.530.00

0.520.00

0.520.00

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.800.45

1.000.00

0.400.55

0.400.55

0.800.45

1.000.00

0.600.55

0.400.55

1.000.00

0.800.45

0.800.45

0.800.45

0.400.55

0.600.55

0.400.55

0.200.45

0.800.45

0.000.00

0.400.55

1.000.00

1.000.00

0.200.45

1.000.00

1.000.00

1.000.00

0.800.45

0.800.45

0.600.55

0.000.00

0.600.55

1.000.00

0.000.00

0.000.00

0.600.55

0.800.45

0.400.55

1.000.00

1.000.00

0.600.55

0.800.45

0.400.55

0.800.45

0.800.45

0.600.55

0.800.45

1.000.00

0.800.45

0.600.55

1.000.00

1.000.00

0.800.45

1.000.00

0.800.45

0.600.55

0.800.45

0.600.55

0.200.45

0.800.45

0.600.55

0.600.55

0.800.45

0.200.45

0.800.45

1.000.00

0.800.45

1.000.00

0.200.45

0.800.45

1.000.00

0.600.55

1.000.00

0.200.45

0.800.45

0.800.45

0.400.55

0.200.45

1.000.00

1.000.00

1.000.00

0.800.45

1.000.00

0.400.55

0.600.55

0.600.55

1.000.00

1.000.00

0.600.55

1.000.00

0.600.55

0.400.55

0.600.55

0.400.55

1.000.00

1.000.00

1.000.00

0.200.45

0.400.55

1.000.00

0.400.55

0.600.55

1.000.00

1.000.00

0.400.55

0.600.55

0.600.55

1.000.00

0.800.45

0.800.45

0.400.55

0.200.45

0.400.55

0.000.00

0.600.55

0.200.45

1.000.00

0.800.45

0.800.45

0.200.45

0.800.45

1.000.00

0.400.55

0.200.45

0.200.45

1.000.00

0.200.45

0.000.00

0.000.00

1.000.00

0.800.45

0.400.55

0.800.45

1.000.00

0.400.55

0.800.45

0.800.45

0.600.55

0.800.45

0.200.45

0.400.55

0.400.55

0.800.45

0.600.55

1.000.00

1.000.00

0 1 2 3 4 5 6 7 8 9 10 11

Layer

1.000.00

0.200.45

0.600.55

0.800.45

0.600.55

0.600.55

0.400.55

1.000.00

1.000.00

1.000.00

0.800.45

0.400.55

(b) S-pruning

Figure 6: MNLI

3221

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.430.02

0.480.02

0.550.01

0.460.02

0.440.02

0.520.01

0.420.02

0.480.02

0.450.02

0.500.02

0.510.01

0.540.01

0.490.02

0.520.01

0.480.02

0.480.02

0.490.01

0.510.02

0.400.02

0.480.02

0.460.02

0.470.02

0.500.02

0.450.02

0.480.01

0.540.01

0.500.02

0.440.02

0.500.02

0.470.02

0.450.02

0.480.02

0.500.02

0.480.01

0.480.02

0.480.02

0.490.01

0.480.02

0.530.01

0.480.02

0.500.02

0.480.01

0.530.02

0.500.02

0.500.02

0.500.01

0.500.01

0.480.02

0.540.01

0.420.02

0.560.01

0.570.01

0.530.01

0.510.02

0.520.02

0.540.01

0.520.02

0.520.01

0.510.02

0.520.01

0.510.02

0.550.01

0.530.01

0.540.01

0.530.02

0.560.01

0.550.01

0.550.01

0.530.02

0.470.01

0.490.02

0.540.01

0.570.01

0.550.01

0.520.01

0.520.01

0.520.01

0.530.01

0.510.02

0.540.01

0.550.01

0.470.02

0.540.01

0.460.01

0.510.02

0.530.01

0.500.02

0.550.01

0.470.01

0.530.01

0.550.01

0.540.01

0.520.02

0.520.02

0.520.02

0.520.01

0.550.01

0.550.01

0.530.01

0.540.01

0.560.01

0.540.01

0.540.01

0.560.01

0.540.01

0.530.01

0.530.01

0.530.01

0.550.01

0.560.01

0.540.01

0.530.01

0.540.01

0.550.01

0.570.01

0.540.01

0.540.01

0.550.01

0.550.01

0.550.01

0.550.01

0.570.01

0.540.01

0.530.02

0.540.01

0.590.01

0.540.01

0.550.01

0.540.01

0.580.01

0.480.01

0.540.01

0.590.01

0.540.01

0.560.01

0.570.01

0.560.01

0.560.01

0.550.01

0.560.01

0.590.01

0.570.01

0.560.01

0.570.01

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.510.02

0.530.01

0.530.01

0.540.01

0.540.01

0.540.01

0.540.01

0.530.01

0.530.01

0.540.01

0.540.01

0.530.01

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

1.000.00

0.400.55

0.400.55

1.000.00

0.600.55

1.000.00

0.600.55

0.200.45

1.000.00

1.000.00

0.800.45

1.000.00

0.600.55

0.600.55

0.800.45

0.600.55

0.400.55

0.600.55

0.800.45

0.800.45

1.000.00

0.800.45

0.800.45

0.800.45

0.400.55

0.800.45

0.800.45

0.800.45

0.600.55

0.600.55

1.000.00

0.400.55

0.400.55

0.600.55

0.800.45

0.400.55

1.000.00

0.800.45

0.600.55

0.400.55

0.400.55

0.400.55

0.200.45

0.400.55

0.600.55

0.200.45

1.000.00

0.400.55

0.400.55

1.000.00

1.000.00

0.800.45

1.000.00

0.000.00

0.000.00

0.400.55

0.400.55

1.000.00

0.000.00

0.200.45

1.000.00

0.600.55

1.000.00

0.800.45

1.000.00

0.800.45

0.000.00

0.800.45

1.000.00

0.800.45

0.800.45

0.800.45

0.600.55

0.400.55

0.600.55

0.200.45

1.000.00

0.800.45

0.600.55

0.600.55

1.000.00

0.600.55

0.800.45

0.600.55

1.000.00

0.800.45

0.800.45

0.400.55

1.000.00

1.000.00

0.000.00

0.400.55

1.000.00

0.800.45

1.000.00

0.400.55

0.800.45

0.800.45

0.400.55

1.000.00

0.200.45

0.600.55

0.200.45

0.200.45

0.200.45

1.000.00

1.000.00

0.800.45

0.400.55

0.000.00

0.000.00

0.200.45

0.000.00

0.000.00

0.400.55

0.200.45

1.000.00

0.000.00

0.000.00

0.800.45

0.600.55

0.000.00

0.800.45

1.000.00

0.000.00

0.000.00

0.000.00

0.800.45

0.600.55

0.000.00

0.600.55

0.400.55

0.400.55

0.800.45

0.800.45

0.200.45

0.600.55

0.800.45

1.000.00

0.000.00

0.400.55

0.600.55

0.400.55

0.600.55

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.400.55

0.800.45

0.600.55

0.400.55

1.000.00

1.000.00

1.000.00

1.000.00

1.000.00

0.600.55

0.600.55

0.400.55

(b) S-pruning

Figure 7: QNLI

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.440.03

0.490.03

0.550.03

0.460.03

0.440.03

0.520.03

0.420.03

0.480.03

0.450.03

0.500.03

0.510.03

0.540.03

0.490.03

0.520.03

0.490.03

0.480.03

0.490.03

0.510.03

0.400.03

0.480.03

0.460.03

0.470.03

0.500.03

0.450.03

0.480.03

0.540.03

0.500.03

0.450.03

0.500.03

0.470.03

0.450.03

0.480.03

0.510.03

0.480.03

0.480.03

0.480.03

0.490.03

0.480.03

0.530.03

0.480.03

0.500.03

0.490.03

0.530.03

0.500.03

0.500.03

0.500.03

0.510.03

0.480.03

0.540.03

0.420.03

0.560.03

0.570.03

0.530.03

0.520.03

0.520.03

0.540.03

0.520.03

0.520.03

0.510.03

0.520.03

0.510.03

0.550.03

0.530.03

0.540.03

0.530.03

0.560.03

0.550.03

0.550.03

0.530.03

0.470.03

0.490.03

0.540.03

0.570.03

0.550.03

0.520.03

0.520.03

0.530.03

0.530.03

0.510.03

0.540.03

0.550.03

0.470.03

0.540.03

0.460.03

0.510.03

0.530.03

0.510.03

0.550.03

0.470.03

0.530.03

0.550.03

0.540.03

0.520.03

0.520.03

0.520.03

0.520.03

0.550.03

0.550.03

0.530.03

0.540.03

0.560.03

0.540.03

0.540.03

0.560.03

0.540.03

0.530.03

0.530.03

0.530.03

0.550.03

0.560.03

0.540.03

0.530.03

0.540.03

0.550.03

0.570.03

0.540.03

0.540.03

0.550.03

0.550.03

0.550.03

0.550.03

0.570.03

0.540.03

0.530.03

0.540.03

0.590.03

0.540.03

0.550.03

0.540.03

0.580.03

0.480.03

0.540.03

0.590.03

0.540.03

0.560.03

0.570.03

0.560.03

0.560.03

0.550.03

0.560.03

0.590.03

0.570.03

0.560.03

0.570.03

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.510.03

0.530.03

0.540.03

0.540.03

0.540.03

0.540.03

0.540.03

0.530.03

0.530.03

0.540.03

0.540.03

0.530.03

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.600.55

0.600.55

0.400.55

0.000.00

0.600.55

1.000.00

0.400.55

0.000.00

0.800.45

0.400.55

0.200.45

0.600.55

0.400.55

0.800.45

0.400.55

0.600.55

0.400.55

0.200.45

1.000.00

0.800.45

0.400.55

0.600.55

0.600.55

0.800.45

0.800.45

0.600.55

1.000.00

0.600.55

0.400.55

0.400.55

0.800.45

0.600.55

0.400.55

0.600.55

0.800.45

0.800.45

1.000.00

0.400.55

0.600.55

0.600.55

0.800.45

0.000.00

0.400.55

0.800.45

0.400.55

0.400.55

0.400.55

0.600.55

0.600.55

1.000.00

0.400.55

1.000.00

0.600.55

0.200.45

0.000.00

0.000.00

0.400.55

0.400.55

0.400.55

0.000.00

0.800.45

0.400.55

1.000.00

0.800.45

0.800.45

0.200.45

0.000.00

0.200.45

1.000.00

0.800.45

0.800.45

0.200.45

0.800.45

0.600.55

0.000.00

0.000.00

0.800.45

0.400.55

0.800.45

0.600.55

0.800.45

0.800.45

0.400.55

0.400.55

0.800.45

0.800.45

0.200.45

0.200.45

0.600.55

0.800.45

0.000.00

0.600.55

0.800.45

0.800.45

0.800.45

0.600.55

0.000.00

1.000.00

0.200.45

0.800.45

0.400.55

0.000.00

0.000.00

1.000.00

0.600.55

0.600.55

0.400.55

0.200.45

0.200.45

0.600.55

0.400.55

0.800.45

0.400.55

0.400.55

0.400.55

0.000.00

0.800.45

0.200.45

0.600.55

1.000.00

0.800.45

0.000.00

0.000.00

0.800.45

0.200.45

0.000.00

0.000.00

0.600.55

0.600.55

0.400.55

0.800.45

0.400.55

0.600.55

0.400.55

0.800.45

0.400.55

0.600.55

0.400.55

0.800.45

0.200.45

0.600.55

0.800.45

0.400.55

0.200.45

0 1 2 3 4 5 6 7 8 9 10 11

Layer

1.000.00

0.600.55

0.800.45

0.800.45

0.800.45

0.600.55

0.800.45

1.000.00

0.600.55

0.600.55

0.800.45

0.800.45

(b) S-pruning

Figure 8: RTE

3222

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.470.05

0.520.05

0.580.04

0.490.05

0.470.05

0.560.05

0.460.05

0.510.05

0.490.05

0.540.05

0.550.04

0.580.04

0.530.05

0.560.04

0.520.05

0.520.05

0.530.05

0.550.05

0.440.05

0.520.05

0.500.05

0.510.05

0.530.05

0.490.05

0.520.04

0.570.04

0.530.05

0.480.05

0.540.05

0.510.05

0.490.05

0.520.05

0.540.05

0.520.04

0.520.05

0.520.05

0.530.05

0.520.05

0.560.05

0.520.05

0.530.05

0.520.05

0.560.05

0.540.05

0.540.05

0.540.05

0.540.05

0.520.05

0.570.05

0.460.05

0.600.04

0.600.04

0.570.05

0.550.05

0.560.05

0.570.04

0.550.05

0.560.05

0.540.05

0.550.05

0.550.05

0.590.04

0.560.05

0.580.05

0.560.05

0.600.04

0.580.04

0.590.04

0.560.05

0.500.05

0.520.05

0.570.04

0.600.04

0.580.04

0.550.05

0.560.05

0.560.05

0.560.05

0.550.05

0.570.05

0.580.04

0.510.05

0.570.04

0.500.05

0.550.05

0.570.05

0.540.05

0.580.04

0.500.05

0.560.05

0.590.04

0.570.04

0.560.05

0.560.05

0.550.05

0.560.05

0.590.04

0.580.04

0.570.04

0.570.05

0.590.04

0.570.04

0.570.04

0.590.04

0.570.05

0.570.05

0.560.05

0.570.05

0.580.04

0.590.04

0.570.05

0.570.05

0.580.04

0.580.04

0.600.04

0.570.05

0.570.04

0.580.04

0.580.04

0.580.04

0.580.04

0.600.04

0.580.04

0.560.05

0.570.05

0.620.04

0.580.04

0.580.04

0.580.04

0.620.04

0.520.05

0.580.05

0.620.04

0.580.04

0.600.04

0.610.04

0.600.04

0.590.04

0.590.04

0.600.04

0.620.04

0.610.04

0.600.04

0.600.04

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.550.05

0.570.05

0.570.05

0.580.05

0.580.04

0.570.04

0.570.04

0.560.05

0.570.05

0.580.05

0.570.05

0.570.05

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

1.000.00

0.800.45

0.200.45

0.800.45

0.400.55

1.000.00

0.600.55

0.800.45

1.000.00

1.000.00

0.800.45

0.800.45

0.800.45

0.800.45

0.800.45

0.200.45

1.000.00

0.000.00

0.200.45

1.000.00

1.000.00

1.000.00

0.800.45

1.000.00

0.800.45

0.600.55

0.600.55

0.600.55

0.000.00

0.800.45

0.800.45

1.000.00

0.200.45

0.200.45

0.800.45

0.800.45

1.000.00

0.200.45

0.200.45

0.800.45

0.600.55

0.200.45

0.800.45

1.000.00

0.600.55

0.200.45

0.800.45

0.600.55

0.600.55

1.000.00

0.800.45

1.000.00

0.800.45

0.200.45

0.200.45

0.200.45

0.400.55

0.800.45

0.600.55

0.600.55

0.400.55

0.200.45

0.400.55

0.400.55

0.600.55

0.800.45

0.200.45

0.800.45

1.000.00

0.400.55

0.800.45

0.400.55

1.000.00

0.800.45

0.600.55

0.200.45

1.000.00

0.600.55

0.800.45

0.400.55

1.000.00

0.600.55

0.400.55

0.400.55

1.000.00

0.800.45

1.000.00

0.600.55

0.800.45

0.800.45

0.000.00

0.400.55

0.800.45

0.800.45

1.000.00

0.800.45

0.400.55

1.000.00

0.600.55

0.800.45

0.400.55

0.000.00

0.200.45

0.800.45

0.800.45

1.000.00

0.600.55

0.200.45

0.800.45

0.000.00

0.400.55

0.400.55

0.600.55

0.800.45

1.000.00

0.600.55

1.000.00

0.400.55

1.000.00

0.800.45

1.000.00

0.000.00

0.600.55

0.800.45

0.200.45

0.600.55

0.000.00

1.000.00

0.600.55

0.200.45

0.400.55

0.600.55

0.400.55

0.200.45

0.800.45

0.800.45

0.600.55

0.600.55

0.400.55

0.400.55

0.400.55

0.800.45

0.600.55

0.600.55

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.600.55

0.400.55

0.800.45

0.600.55

1.000.00

1.000.00

0.800.45

0.600.55

0.800.45

1.000.00

0.600.55

1.000.00

(b) S-pruning

Figure 9: MRPC

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.380.01

0.430.01

0.500.01

0.400.01

0.380.01

0.470.01

0.360.01

0.420.01

0.390.01

0.450.01

0.460.01

0.490.01

0.440.01

0.470.01

0.430.01

0.430.01

0.440.01

0.460.01

0.350.01

0.430.01

0.410.01

0.410.01

0.440.01

0.390.01

0.440.01

0.480.01

0.440.01

0.390.01

0.450.01

0.420.01

0.400.01

0.430.01

0.450.01

0.440.01

0.430.01

0.430.01

0.440.01

0.430.01

0.480.01

0.420.01

0.440.01

0.430.01

0.480.01

0.450.01

0.450.01

0.450.01

0.450.01

0.430.01

0.490.01

0.370.01

0.510.01

0.520.01

0.480.01

0.460.01

0.470.01

0.490.01

0.470.01

0.470.01

0.450.01

0.460.01

0.460.01

0.500.01

0.470.01

0.490.01

0.480.01

0.510.01

0.500.01

0.500.01

0.480.01

0.420.01

0.430.01

0.490.01

0.520.01

0.490.01

0.460.01

0.470.01

0.470.01

0.470.01

0.460.01

0.480.01

0.500.01

0.420.01

0.490.01

0.410.01

0.460.01

0.480.01

0.450.01

0.500.01

0.420.01

0.470.01

0.500.01

0.490.01

0.470.01

0.470.01

0.460.01

0.470.01

0.500.01

0.500.01

0.480.01

0.480.01

0.510.01

0.490.01

0.490.01

0.510.01

0.490.01

0.480.01

0.480.01

0.480.01

0.500.01

0.510.01

0.490.01

0.480.01

0.490.01

0.500.01

0.520.01

0.480.01

0.490.01

0.500.01

0.500.01

0.500.01

0.500.01

0.520.01

0.490.01

0.480.01

0.480.01

0.540.01

0.490.01

0.500.01

0.490.01

0.540.01

0.430.01

0.490.01

0.540.01

0.490.01

0.510.01

0.520.01

0.510.01

0.510.01

0.500.01

0.510.01

0.540.01

0.520.01

0.510.01

0.520.01

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.460.01

0.480.01

0.480.01

0.490.01

0.490.01

0.490.01

0.490.01

0.480.01

0.480.01

0.490.01

0.490.01

0.480.01

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.800.45

0.800.45

0.800.45

0.600.55

0.200.45

0.400.55

0.800.45

0.600.55

1.000.00

0.600.55

0.800.45

0.600.55

0.800.45

0.200.45

0.600.55

0.400.55

0.600.55

0.800.45

0.800.45

0.600.55

0.800.45

0.800.45

0.600.55

1.000.00

1.000.00

1.000.00

0.600.55

0.400.55

0.200.45

0.400.55

0.800.45

1.000.00

0.200.45

0.600.55

0.200.45

0.800.45

1.000.00

0.600.55

0.000.00

0.000.00

0.400.55

0.600.55

0.200.45

0.600.55

0.800.45

0.400.55

1.000.00

0.000.00

0.400.55

0.600.55

1.000.00

1.000.00

0.400.55

0.400.55

0.400.55

0.200.45

0.200.45

1.000.00

0.400.55

0.600.55

0.400.55

0.000.00

1.000.00

0.400.55

0.800.45

1.000.00

0.400.55

0.800.45

0.600.55

1.000.00

0.400.55

0.400.55

0.600.55

0.400.55

1.000.00

0.200.45

0.400.55

0.200.45

0.000.00

1.000.00

1.000.00

0.800.45

0.400.55

0.400.55

1.000.00

0.600.55

0.400.55

0.400.55

0.400.55

0.600.55

0.000.00

0.200.45

0.400.55

0.400.55

0.600.55

0.600.55

0.800.45

1.000.00

0.600.55

0.600.55

0.800.45

0.400.55

0.600.55

0.600.55

1.000.00

0.400.55

0.400.55

0.400.55

0.200.45

0.400.55

0.400.55

0.400.55

0.000.00

0.200.45

0.400.55

0.000.00

1.000.00

0.000.00

0.200.45

1.000.00

0.200.45

0.200.45

0.800.45

0.400.55

0.400.55

0.000.00

0.200.45

0.800.45

0.200.45

0.200.45

1.000.00

0.600.55

0.400.55

0.800.45

0.800.45

0.400.55

0.600.55

0.400.55

0.600.55

0.200.45

0.200.45

0.600.55

1.000.00

0.800.45

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.400.55

0.600.55

0.600.55

0.400.55

1.000.00

1.000.00

0.800.45

0.800.45

0.800.45

1.000.00

0.600.55

0.600.55

(b) S-pruning

Figure 10: QQP

3223

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.330.01

0.390.01

0.460.01

0.360.01

0.340.01

0.430.01

0.320.01

0.380.01

0.350.01

0.400.01

0.420.01

0.450.01

0.400.01

0.430.01

0.390.01

0.380.01

0.400.01

0.420.01

0.300.01

0.390.01

0.360.01

0.370.01

0.400.01

0.350.01

0.400.01

0.450.01

0.400.01

0.350.01

0.410.01

0.370.01

0.350.01

0.380.01

0.410.01

0.400.01

0.390.01

0.380.01

0.400.01

0.380.01

0.440.01

0.380.01

0.400.01

0.390.01

0.430.01

0.400.01

0.410.01

0.410.01

0.410.01

0.380.01

0.440.01

0.320.01

0.470.01

0.480.01

0.440.01

0.420.01

0.430.01

0.450.01

0.420.01

0.430.01

0.410.01

0.420.01

0.410.01

0.460.01

0.430.01

0.450.01

0.430.01

0.470.01

0.460.01

0.460.01

0.430.01

0.380.01

0.390.01

0.450.01

0.480.01

0.450.01

0.420.01

0.430.01

0.430.01

0.430.01

0.410.01

0.440.01

0.450.01

0.380.01

0.450.01

0.370.01

0.420.01

0.440.01

0.410.01

0.450.01

0.380.01

0.430.01

0.460.01

0.450.01

0.420.01

0.420.01

0.420.01

0.430.01

0.460.01

0.460.01

0.440.01

0.440.01

0.460.01

0.450.01

0.440.01

0.470.01

0.450.01

0.440.01

0.440.01

0.440.01

0.450.01

0.470.01

0.440.01

0.440.01

0.450.01

0.450.01

0.480.01

0.440.01

0.450.01

0.460.01

0.450.01

0.460.01

0.450.01

0.480.01

0.450.01

0.440.01

0.440.01

0.500.01

0.450.01

0.450.01

0.450.01

0.500.01

0.390.01

0.450.01

0.500.01

0.450.01

0.470.01

0.480.01

0.470.01

0.470.01

0.460.01

0.470.01

0.510.01

0.480.01

0.470.01

0.480.01

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.420.01

0.440.01

0.440.01

0.450.01

0.450.01

0.450.01

0.450.01

0.440.01

0.440.01

0.450.01

0.440.01

0.440.01

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.000.00

0.600.55

0.600.55

0.200.45

0.800.45

0.800.45

0.600.55

0.400.55

0.800.45

0.400.55

0.600.55

0.000.00

1.000.00

1.000.00

0.200.45

0.200.45

0.200.45

0.200.45

0.600.55

0.600.55

0.000.00

0.200.45

0.400.55

0.800.45

0.800.45

0.600.55

0.200.45

0.200.45

0.200.45

0.400.55

1.000.00

0.400.55

0.200.45

0.600.55

0.800.45

0.600.55

0.800.45

0.400.55

0.600.55

0.600.55

0.000.00

1.000.00

0.800.45

0.000.00

0.600.55

0.800.45

0.800.45

0.600.55

0.800.45

0.200.45

0.800.45

0.400.55

0.400.55

0.800.45

0.600.55

1.000.00

0.200.45

0.000.00

0.400.55

0.600.55

0.800.45

0.400.55

0.000.00

0.600.55

0.800.45

0.000.00

0.800.45

1.000.00

1.000.00

0.200.45

0.400.55

0.000.00

0.400.55

0.600.55

0.600.55

0.600.55

0.200.45

1.000.00

0.600.55

0.400.55

0.800.45

0.600.55

0.400.55

0.800.45

1.000.00

1.000.00

0.000.00

1.000.00

0.200.45

0.200.45

0.200.45

0.000.00

0.600.55

0.200.45

1.000.00

0.200.45

0.000.00

0.800.45

0.000.00

0.000.00

0.400.55

0.000.00

0.000.00

0.600.55

1.000.00

1.000.00

0.200.45

0.400.55

0.000.00

0.200.45

1.000.00

0.400.55

0.000.00

1.000.00

0.000.00

0.600.55

0.400.55

0.000.00

0.800.45

0.600.55

0.400.55

0.400.55

0.200.45

0.200.45

1.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

1.000.00

0.400.55

0.000.00

0.600.55

0.200.45

1.000.00

0.600.55

0.800.45

0.000.00

0.600.55

0.400.55

0.800.45

0.000.00

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.200.45

0.800.45

0.600.55

0.400.55

0.800.45

0.800.45

0.600.55

0.600.55

1.000.00

1.000.00

0.600.55

0.400.55

(b) S-pruning

Figure 11: SST-2

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.560.10

0.600.10

0.650.09

0.570.10

0.560.10

0.630.09

0.540.11

0.590.10

0.570.10

0.610.09

0.620.09

0.640.09

0.600.10

0.630.09

0.600.10

0.600.10

0.600.10

0.620.09

0.520.11

0.600.10

0.580.10

0.580.10

0.610.10

0.570.10

0.590.09

0.640.09

0.610.10

0.560.10

0.610.09

0.590.10

0.570.10

0.590.10

0.620.09

0.590.09

0.600.10

0.600.10

0.600.09

0.590.10

0.640.09

0.590.10

0.610.10

0.590.09

0.630.09

0.610.10

0.610.09

0.610.09

0.610.09

0.590.10

0.640.09

0.550.11

0.660.08

0.660.08

0.640.09

0.620.09

0.630.09

0.640.09

0.630.09

0.630.09

0.620.09

0.620.09

0.620.09

0.660.09

0.630.09

0.650.09

0.630.09

0.660.08

0.650.09

0.660.09

0.640.09

0.580.10

0.600.10

0.640.09

0.670.08

0.650.09

0.630.09

0.630.09

0.630.09

0.630.09

0.620.09

0.640.09

0.650.09

0.590.10

0.640.09

0.570.10

0.620.09

0.640.09

0.620.09

0.650.09

0.580.10

0.630.09

0.650.09

0.640.09

0.630.09

0.630.09

0.620.09

0.630.09

0.650.09

0.650.09

0.640.09

0.640.09

0.660.09

0.640.09

0.640.09

0.660.08

0.640.09

0.640.09

0.640.09

0.640.09

0.650.09

0.660.09

0.640.09

0.640.09

0.650.09

0.650.09

0.670.08

0.640.09

0.640.09

0.650.09

0.650.09

0.650.09

0.650.09

0.670.08

0.650.09

0.630.09

0.640.09

0.680.08

0.650.09

0.650.09

0.650.09

0.680.08

0.590.09

0.650.09

0.680.08

0.650.09

0.660.08

0.670.08

0.660.09

0.660.09

0.660.09

0.660.08

0.690.08

0.670.08

0.660.09

0.670.08

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.620.09

0.640.09

0.640.09

0.640.09

0.650.09

0.640.09

0.640.09

0.640.09

0.640.09

0.650.09

0.640.09

0.640.09

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.800.45

0.600.55

0.600.55

0.800.45

1.000.00

1.000.00

0.800.45

0.600.55

1.000.00

0.800.45

0.800.45

1.000.00

1.000.00

1.000.00

0.400.55

0.400.55

0.600.55

0.200.45

0.600.55

0.800.45

0.200.45

1.000.00

1.000.00

1.000.00

1.000.00

0.800.45

0.800.45

0.400.55

0.600.55

0.600.55

1.000.00

0.400.55

0.200.45

0.800.45

0.800.45

0.400.55

1.000.00

1.000.00

1.000.00

0.400.55

0.800.45

1.000.00

0.400.55

0.400.55

0.800.45

0.600.55

1.000.00

0.800.45

0.800.45

0.200.45

0.800.45

0.800.45

0.800.45

0.800.45

0.800.45

0.600.55

1.000.00

0.800.45

1.000.00

0.800.45

1.000.00

0.400.55

0.600.55

0.400.55

1.000.00

0.800.45

1.000.00

0.200.45

1.000.00

1.000.00

0.600.55

0.400.55

1.000.00

0.800.45

1.000.00

1.000.00

0.800.45

0.800.45

0.200.45

0.400.55

0.800.45

0.800.45

1.000.00

0.600.55

0.200.45

0.800.45

1.000.00

0.600.55

1.000.00

1.000.00

0.200.45

0.400.55

0.600.55

0.600.55

0.600.55

1.000.00

0.200.45

0.200.45

0.400.55

0.000.00

0.200.45

0.600.55

0.200.45

0.600.55

1.000.00

0.400.55

0.600.55

0.600.55

0.400.55

0.600.55

0.600.55

0.400.55

0.400.55

0.800.45

0.000.00

0.000.00

0.800.45

0.000.00

0.800.45

0.600.55

1.000.00

0.000.00

0.200.45

1.000.00

0.600.55

0.000.00

0.200.45

0.800.45

0.400.55

0.600.55

0.000.00

0.600.55

0.800.45

0.800.45

0.600.55

0.600.55

0.800.45

0.400.55

0.800.45

0.400.55

0.600.55

0.400.55

0.400.55

0.000.00

0 1 2 3 4 5 6 7 8 9 10 11

Layer

1.000.00

0.200.45

1.000.00

1.000.00

1.000.00

1.000.00

1.000.00

1.000.00

1.000.00

1.000.00

0.800.45

0.800.45

(b) S-pruning

Figure 12: CoLA

3224

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.410.03

0.460.02

0.520.02

0.430.02

0.410.02

0.500.02

0.390.03

0.450.02

0.420.03

0.470.02

0.490.02

0.520.02

0.470.02

0.500.02

0.460.02

0.450.02

0.460.02

0.480.02

0.370.02

0.460.02

0.430.03

0.440.03

0.470.02

0.420.03

0.460.02

0.510.02

0.470.02

0.420.02

0.470.02

0.450.02

0.420.02

0.450.02

0.480.02

0.460.02

0.460.02

0.460.02

0.470.02

0.450.02

0.500.02

0.450.02

0.470.02

0.460.02

0.500.02

0.470.02

0.480.02

0.480.02

0.480.02

0.450.02

0.510.02

0.390.03

0.540.02

0.540.02

0.510.02

0.490.02

0.490.02

0.510.02

0.490.02

0.500.02

0.480.02

0.490.02

0.480.02

0.530.02

0.500.02

0.520.02

0.500.02

0.540.02

0.520.02

0.530.02

0.500.02

0.440.02

0.460.02

0.510.02

0.540.02

0.520.02

0.490.02

0.500.02

0.500.02

0.500.02

0.480.02

0.510.02

0.520.02

0.450.02

0.510.02

0.440.02

0.490.02

0.510.02

0.480.02

0.520.02

0.440.02

0.500.02

0.530.02

0.510.02

0.490.02

0.490.02

0.490.02

0.500.02

0.520.02

0.520.02

0.510.02

0.510.02

0.530.02

0.510.02

0.510.02

0.530.02

0.510.02

0.510.02

0.500.02

0.510.02

0.520.02

0.530.02

0.510.02

0.510.02

0.520.02

0.520.02

0.540.02

0.510.02

0.510.02

0.530.02

0.520.02

0.530.02

0.520.02

0.550.02

0.520.02

0.500.02

0.510.02

0.560.02

0.520.02

0.520.02

0.520.02

0.560.02

0.450.02

0.520.02

0.560.02

0.520.02

0.540.02

0.550.02

0.540.02

0.530.02

0.530.02

0.540.02

0.570.02

0.550.02

0.540.02

0.550.02

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.480.02

0.500.02

0.510.02

0.510.02

0.520.02

0.510.02

0.510.02

0.500.02

0.510.02

0.520.02

0.510.02

0.510.02

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

1.000.00

1.000.00

0.600.55

0.200.45

0.800.45

0.800.45

0.200.45

0.600.55

1.000.00

0.800.45

0.600.55

0.400.55

0.800.45

0.400.55

0.200.45

0.600.55

0.600.55

0.000.00

0.400.55

1.000.00

1.000.00

0.600.55

0.200.45

1.000.00

0.800.45

0.400.55

0.600.55

0.600.55

0.200.45

0.800.45

1.000.00

0.000.00

0.200.45

0.600.55

0.600.55

0.800.45

1.000.00

0.200.45

0.400.55

0.400.55

0.600.55

0.200.45

0.200.45

0.800.45

0.800.45

0.000.00

1.000.00

0.200.45

0.000.00

1.000.00

0.800.45

0.600.55

0.000.00

0.800.45

0.200.45

0.200.45

0.200.45

1.000.00

0.800.45

0.400.55

0.800.45

0.200.45

0.400.55

0.600.55

0.400.55

0.800.45

0.000.00

0.200.45

1.000.00

0.400.55

1.000.00

0.000.00

1.000.00

0.600.55

0.400.55

0.200.45

1.000.00

0.000.00

0.600.55

0.200.45

1.000.00

0.200.45

0.200.45

0.200.45

0.600.55

0.800.45

0.800.45

0.400.55

0.400.55

0.800.45

0.200.45

0.600.55

0.600.55

0.800.45

1.000.00

1.000.00

0.200.45

0.600.55

0.600.55

1.000.00

0.400.55

0.400.55

0.600.55

1.000.00

1.000.00

0.600.55

0.200.45

0.000.00

0.200.45

0.200.45

0.200.45

0.600.55

0.400.55

0.600.55

0.400.55

0.200.45

0.800.45

0.400.55

0.600.55

1.000.00

1.000.00

0.600.55

0.400.55

0.800.45

0.800.45

0.200.45

0.400.55

0.600.55

0.400.55

0.400.55

0.800.45

0.200.45

0.600.55

0.400.55

1.000.00

0.400.55

0.200.45

0.400.55

0.600.55

0.200.45

0.400.55

0.600.55

0.400.55

0.200.45

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.400.55

0.000.00

0.800.45

0.400.55

0.800.45

1.000.00

0.800.45

0.800.45

0.600.55

0.600.55

0.600.55

0.600.55

(b) S-pruning

Figure 13: STS-B

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.400.33

0.440.31

0.500.28

0.420.32

0.400.32

0.470.30

0.390.33

0.430.31

0.410.32

0.450.31

0.470.30

0.490.29

0.450.31

0.480.29

0.440.31

0.440.31

0.450.30

0.460.30

0.380.33

0.440.31

0.420.32

0.430.31

0.450.31

0.410.32

0.470.28

0.490.29

0.450.31

0.410.32

0.450.30

0.430.32

0.420.32

0.440.31

0.450.31

0.470.28

0.440.31

0.440.31

0.450.30

0.430.32

0.480.30

0.430.31

0.450.31

0.450.30

0.470.30

0.450.31

0.450.30

0.460.30

0.460.30

0.440.31

0.480.29

0.390.33

0.500.29

0.510.28

0.480.30

0.460.30

0.470.30

0.490.29

0.460.30

0.470.30

0.460.30

0.470.30

0.460.30

0.500.28

0.470.30

0.480.29

0.470.30

0.500.28

0.490.29

0.500.29

0.470.30

0.440.30

0.440.31

0.490.29

0.510.28

0.490.29

0.460.30

0.470.30

0.470.30

0.470.30

0.460.30

0.480.29

0.490.29

0.430.31

0.490.29

0.430.30

0.460.30

0.480.30

0.460.30

0.490.29

0.440.30

0.470.30

0.490.29

0.480.29

0.470.30

0.470.30

0.470.30

0.470.30

0.490.29

0.490.29

0.480.29

0.480.30

0.500.29

0.480.29

0.480.29

0.500.29

0.480.29

0.480.30

0.480.30

0.480.30

0.490.29

0.500.29

0.480.29

0.480.29

0.490.29

0.490.29

0.510.28

0.480.29

0.480.29

0.500.29

0.490.29

0.490.29

0.490.29

0.510.28

0.490.29

0.480.30

0.480.29

0.530.27

0.490.29

0.490.29

0.490.29

0.530.27

0.450.30

0.490.29

0.530.27

0.490.29

0.500.28

0.510.28

0.500.28

0.500.28

0.500.29

0.500.28

0.530.27

0.510.28

0.500.28

0.510.28

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.460.30

0.480.30

0.480.30

0.480.29

0.490.29

0.480.29

0.480.29

0.470.30

0.480.30

0.480.29

0.480.30

0.470.30

(a) M-pruning

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0.200.45

0 1 2 3 4 5 6 7 8 9 10 11

Layer

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

(b) S-pruning

Figure 14: WNLI

3225

B Iterative Pruning Modes

We conducted additional experiments with the fol-lowing settings for iterative pruning based on im-portance scores:

• Heads only: in each iteration, we mask asmany of the unmasked heads with the lowestimportance scores as we can (144 heads in thefull BERT-base model).

• MLPs only: we iteratively mask one of theremaining MLPs that has the smallest impor-

tance score (subsection 3.3).

• Heads and MLPs: we compute head (sub-section 3.3) and MLP (subsection 3.3) impor-tance scores in a single backward pass, prun-ing 10% heads and one MLP with the smallestscores until the performance on the dev set iswithin 90%. Then we continue pruning headsalone, and then MLPs alone. This strategyresults in a larger number of total componentspruned within our performance threshold.

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

6.001.58

4.800.84

3.001.00

3.400.89

3.801.92

7.401.14

2.600.89

2.000.71

6.201.10

4.201.30

3.401.14

3.400.89

3.000.00

4.400.89

3.000.71

2.000.71

5.200.84

2.601.14

4.000.71

3.801.10

3.801.30

4.401.67

4.401.52

6.801.64

4.202.17

2.801.30

4.801.79

2.801.10

1.600.89

2.000.71

7.001.58

1.600.89

1.400.55

3.200.84

4.001.41

5.001.58

6.801.10

3.001.22

2.600.55

4.200.45

3.800.45

2.801.10

1.801.30

3.400.89

3.800.84

3.001.58

4.601.34

4.401.14

3.401.14

5.001.58

5.401.34

7.001.00

5.601.14

3.200.84

2.600.89

1.800.45

2.401.14

4.800.84

4.001.00

2.401.82

5.001.58

1.201.30

2.401.14

4.201.10

3.801.48

4.201.30

1.800.45

3.600.55

8.200.84

4.001.41

4.001.22

2.601.67

5.601.52

2.200.84

3.601.14

1.400.55

4.600.55

3.801.10

4.001.41

3.401.14

6.201.10

2.801.10

2.401.67

2.200.84

6.201.48

4.800.84

3.000.71

3.601.14

2.601.67

5.400.89

0.400.89

1.801.30

5.801.48

4.600.89

5.400.89

2.401.14

2.001.41

7.001.00

2.001.00

3.200.84

3.200.84

2.200.84

1.400.55

3.601.34

6.001.87

5.801.10

4.201.92

3.201.10

1.401.14

2.200.45

2.000.71

1.600.55

2.200.45

4.000.71

3.801.30

2.600.89

6.201.10

0.200.45

2.800.84

5.601.14

5.401.95

1.000.71

2.601.14

5.601.14

2.200.84

0.400.55

1.000.71

5.400.55

3.000.71

2.000.00

4.801.79

5.801.10

3.601.14

4.600.89

6.201.30

3.801.10

4.001.00

2.601.34

5.600.89

2.000.71

3.601.14

3.600.89

3.401.14

3.201.10

(a) Surviving heads (masking heads only)

0 1 2 3 4 5 6 7 8 9 10 11

Head

0

1

2

3

4

5

6

7

8

9

10

11

Laye

r

6.201.30

6.001.22

4.200.84

4.202.17

5.401.14

7.200.84

4.800.84

3.801.92

7.800.45

6.000.71

5.601.14

5.401.82

6.001.00

5.601.52

4.001.22

3.400.89

4.800.84

2.201.64

5.001.00

6.800.84

5.600.89

5.401.14

5.601.67

7.601.14

6.801.30

5.801.79

5.601.14

4.401.82

2.401.67

4.800.84

7.600.55

4.001.87

2.001.22

4.801.64

5.802.39

5.202.05

8.000.71

4.801.10

4.201.79

4.201.79

4.201.79

4.401.52

4.001.58

4.801.92

5.600.89

3.801.10

7.001.22

4.001.58

4.801.64

6.201.64

6.600.55

6.801.30

5.001.73

4.001.87

3.202.17

3.401.14

3.200.84

6.000.71

4.401.14

4.001.87

6.201.48

2.601.14

5.401.52

5.201.48

6.400.89

5.600.89

2.801.30

5.001.00

7.800.84

5.401.14

6.002.35

2.600.89

6.400.55

5.202.39

4.801.30

2.801.64

6.401.52

5.001.58

4.801.30

4.600.55

7.600.55

5.001.58

4.402.30

4.200.84

6.801.30

6.800.84

5.001.00

4.801.79

5.200.84

5.800.84

1.400.89

3.201.64

6.001.00

5.600.89

7.201.30

5.002.24

3.000.71

6.601.52

3.400.55

5.000.71

4.002.00

3.201.79

2.401.14

5.601.95

6.401.52

6.200.84

4.401.52

3.602.07

2.800.84

2.401.52

3.601.82

3.401.52

2.600.55

4.201.64

3.800.84

2.600.89

6.800.84

1.401.14

5.001.41

7.001.87

5.601.52

1.601.14

3.401.14

6.200.45

3.601.14

1.001.22

1.001.00

5.801.30

3.801.79

2.401.34

4.601.52

5.001.00

4.201.30

4.401.14

6.401.14

3.801.64

5.400.55

4.001.58

5.601.95

2.001.00

4.201.48

5.001.87

5.201.10

3.601.14

(b) Surviving heads (masking heads and MLPs)0 1 2 3 4 5 6 7 8 9 10 11

Layer

5.200.45

3.001.00

4.201.48

5.400.89

6.601.14

6.800.84

6.200.84

6.600.55

5.801.10

4.801.64

4.001.00

2.801.48

(c) Surviving MLPs (masking MLPs only)0 1 2 3 4 5 6 7 8 9 10 11

Layer

5.001.58

3.601.52

5.801.10

4.801.64

7.000.71

7.000.71

6.201.30

6.801.10

6.800.84

6.800.84

5.400.89

5.001.22

(d) Surviving MLPs (masking heads and MLPs)

Figure 15: The “good” subnetworks: self-attention heads and MLPs that survive pruning. Each cell gives theaverage number of GLUE tasks in which a given head/MLP survived, and the standard deviation across 5 fine-tuning initializations.

3226

CE

valu

atio

non

GL

UE

Task

sE

xper

imen

tC

oLA

MN

LI

MR

PCQ

NL

IQ

QP

RT

ESS

T-2

STS-

BW

NL

Im

ajor

ityba

selin

e0.

00±

0.00

0.35±

0.00

0.68±

0.00

0.51±

0.00

0.63±

0.00

0.53±

0.00

0.51±

0.00

0.02±

0.00

0.56±

0.00

full

mod

el0.

56±

0.01

0.84±

0.00

0.84±

0.01

0.92±

0.00

0.91±

0.00

0.63±

0.03

0.93±

0.00

0.89±

0.00

0.34±

0.06

S-pr

unin

gSu

bnet

wor

ks‘g

ood’

(pru

ned)

0.51±

0.01

0.76±

0.00

0.77±

0.01

0.83±

0.00

0.83±

0.01

0.57±

0.03

0.84±

0.01

0.80±

0.01

0.54±

0.06

‘goo

d’(r

etra

ined

)0.

50±

0.04

0.80±

0.04

0.77±

0.06

0.89±

0.02

0.89±

0.01

0.58±

0.06

0.87±

0.07

0.75±

0.23

0.54±

0.06

rand

om(p

rune

d)0.

43±

0.11

0.64±

0.19

0.48±

0.23

0.69±

0.12

0.74±

0.07

0.51±

0.04

0.69±

0.13

0.53±

0.25

0.54±

0.06

rand

om(r

etra

ined

)0.

42±

0.23

0.79±

0.08

0.68±

0.21

0.84±

0.05

0.88±

0.03

0.56±

0.05

0.89±

0.01

0.79±

0.08

0.54±

0.06

‘bad

’(pr

uned

)0.

34±

0.09

0.60±

0.13

0.43±

0.15

0.63±

0.06

0.68±

0.06

0.49±

0.04

0.62±

0.15

0.32±

0.38

0.54±

0.06

‘bad

’(re

trai

ned)

0.41±

0.11

0.80±

0.05

0.68±

0.14

0.77±

0.12

0.82±

0.11

0.59±

0.04

0.86±

0.06

0.61±

0.21

0.54±

0.06

Impo

rtan

cePr

unin

g-S

uper

Subn

etw

orks

‘goo

d’(p

rune

d)0.

13±

0.06

0.34±

0.01

0.39±

0.16

0.56±

0.03

0.63±

0.00

0.52±

0.02

0.54±

0.03

0.38±

0.07

0.54±

0.06

‘goo

d’(r

etra

ined

)0.

49±

0.02

0.76±

0.00

0.71±

0.00

0.83±

0.00

0.87±

0.00

0.50±

0.03

0.84±

0.00

0.80±

0.00

0.56±

0.00

rand

om(p

rune

d)0.

02±

0.05

0.32±

0.01

0.39±

0.16

0.50±

0.01

0.63±

0.00

0.47±

0.00

0.50±

0.01

0.06±

0.06

0.54±

0.06

rand

om(r

etra

ined

)0.

15±

0.15

0.75±

0.01

0.69±

0.01

0.76±

0.08

0.83±

0.04

0.50±

0.03

0.85±

0.00

0.12±

0.03

0.56±

0.00

‘bad

’(pr

uned

)0.

01±

0.01

0.32±

0.00

0.39±

0.16

0.49±

0.02

0.60±

0.07

0.52±

0.02

0.51±

0.02

0.06±

0.02

0.54±

0.06

‘bad

’(re

trai

ned)

0.13±

0.03

0.77±

0.00

0.69±

0.01

0.57±

0.03

0.87±

0.00

0.49±

0.03

0.83±

0.01

0.13±

0.01

0.56±

0.00

M-p

runi

ngSu

bnet

wor

ks‘g

ood’

(pru

ned)

0.52±

0.01

0.78±

0.00

0.78±

0.02

0.84±

0.01

0.83±

0.01

0.61±

0.01

0.85±

0.01

0.82±

0.02

0.48±

0.10

‘goo

d’(r

etra

ined

)0.

54±

0.02

0.84±

0.00

0.84±

0.01

0.91±

0.00

0.91±

0.00

0.61±

0.02

0.92±

0.00

0.88±

0.00

0.41±

0.11

rand

om(p

rune

d)0.

02±

0.03

0.33±

0.01

0.39±

0.16

0.51±

0.02

0.63±

0.00

0.47±

0.00

0.55±

0.09

0.08±

0.04

0.51±

0.08

rand

om(r

etra

ined

)0.

16±

0.06

0.76±

0.00

0.70±

0.01

0.81±

0.01

0.86±

0.00

0.56±

0.02

0.83±

0.01

0.24±

0.03

0.47±

0.12

‘bad

’(pr

uned

)0.

00±

0.00

0.32±

0.00

0.39±

0.16

0.51±

0.02

0.63±

0.00

0.49±

0.03

0.49±

0.01

0.05±

0.07

0.53±

0.06

‘bad

’(re

trai

ned)

0.02±

0.03

0.62±

0.00

0.68±

0.01

0.61±

0.00

0.78±

0.00

0.49±

0.03

0.82±

0.00

0.08±

0.00

0.49±

0.11

Tabl

e3:

Mea

nan

dst

anda

rdde

viat

ion

ofG

LU

Eta

sks

met

rics

eval

uate

don

five

seed

s.

3227

D Longer Fine-tuning of “Bad” s-pruned Subnetworks

Epoch CoLA SST-2 MRPC QQP STS-B MNLI QNLI RTE WNLI Avg

3 0.422 0.873 0.71 0.832 0.651 0.805 0.764 0.579 0.498 0.68154 0.423 0.859 0.663 0.828 0.652 0.804 0.762 0.587 0.554 0.68135 0.432 0.862 0.665 0.831 0.668 0.801 0.752 0.590 0.523 0.68046 0.425 0.867 0.655 0.830 0.667 0.800 0.753 0.594 0.521 0.6791

Figure 16: The mean of GLUE tasks metrics evaluated on five seeds at different epochs (the best one is bolded).*Slight divergence in metrics from the previously reported ones due to this being an new fine-tuning run.

E Performance of the “Super Survivor” Subnetworks

In this experiment, we explore three settings:• “good” subnetworks: the subnetworks con-

sisting only of “super-survivors”: the self-attention heads and MLPs that survived in allrandom seeds, shown in Appendix A. Thesesubnetworks are much smaller than the prunedsubnetworks discussed in subsection 4.2 (10-30% vs 50-70% of the full model);• “bad” subnetworks: the subnetworks the

same size as the super-survivor subnetworks,but selected from heads and MLPs the leastlikely to survive importance pruning;

• random subnetworks: same size as super-survivor subnetworks, but selected from el-ements that were neither super-survivors, northe ones in the “bad” subnetworks.

The striking conclusion is that on 6 out of 9 tasksthe bad and random subnetworks behaved nearly aswell as the “good” ones, suggesting that the “super-survivor” self-attention heads and MLPs did notsurvive importance pruning because of their encod-ing some unique linguistic information necessaryfor solving the GLUE tasks.

0.0

0.2

0.4

0.6

0.8

Accu

racy

MNLI (26%, std 0%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

QNLI (28%, std 0%)

0.0

0.2

0.4

0.6

Accu

racy

RTE (12%, std 0%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

MRPC (24%, std 0%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

QQP (18%, std 0%)

0.0

0.2

0.4

0.6

0.8

Accu

racy

SST-2 (13%, std 0%)

0.0

0.2

0.4

Mat

thew

s cor

rela

tion

CoLA (46%, std 0%)

0.0

0.2

0.4

0.6

0.8

Pear

son

corre

latio

n

STS-B (10%, std 0%)

0.0

0.2

0.4

0.6

Accu

racy

WNLI (1%, std 0%)


full model






majority baseline



Figure 17: The performance of “super survivor” subnetworks in BERT fine-tuning: performance on GLUE tasks(error bars indicate standard deviation across 5 fine-tuning runs). The size of the super-survivor subnetwork as %of full model weights is shown next to the task names.

3228

F Attention Pattern Type Distribution

We use two separately trained CNN classifiers toanalyze the BERT’s self-attention maps, both “raw”head outputs and weight-normed attention, follow-ing Kobayashi et al. (2020). For the former, we use400 annotated maps by Kovaleva et al. (2019), andfor the latter we additionally annotate 600 moremaps.We run the classifiers on pre-trained and fine-tuned

BERT, both the full model and the model prunedby the “super-survivor” mask (only the heads andMLPs that survived across GLUE tasks). For eachexperiment, we report the fraction of attention pat-terns estimated from a hundred dev-set samples foreach task across five random seeds.See Figure 4a for attention types illustration.


Attention Patterns





Diagonal

(a) Super-survivor heads, fine-tuned


Attention Patterns





Diagonal

(b) Super-survivor heads, pre-trained


Attention Patterns





Diagonal

(c) All heads, fine-tuned


Attention Patterns





Diagonal

(d) All heads, pre-trained

Figure 18: Attention pattern distribution in all BERT self-attention heads and the “super-survivor” heads.

3229

G How Task-independent are the “Good” Subnetworks?

The parts of the “good” subnetworks that are onlyrelevant for some specific tasks, but consistentlysurvive across fine-tuning runs for that task, shouldencode the information useful for that task, evenif it is not deeply linguistic. Therefore, the degreeto which the “good” subnetworks overlap acrosstasks may be a useful way to characterize the tasksthemselves.

Figure 19 shows pairwise comparisons between allGLUE tasks with respect to the number of sharedheads and MLPs in two conditions: the “good” sub-networks found by structured importance pruningthat were described in subsection 4.1, and super-survivors (the heads/MLPs that survived in all ran-dom seeds).


Task

CoLA

MNLI

MRPC

QNLI

QQP

RTE

SST-2

STS-B

WNLI

Task

92.2018.91

62.4010.24

95.004.95

58.4011.33

64.407.80

89.409.13

57.6010.43

62.407.44

57.406.27

85.207.98

55.4018.85

54.608.99

52.207.40

52.206.98

78.2016.12

48.8022.03

55.0027.06

51.6023.00

52.0025.12

43.2020.10

74.0035.69

49.6013.05

50.4011.46

43.408.99

40.208.58

38.2010.23

35.0018.77

67.0015.65

53.4014.94

55.603.91

54.206.18

51.005.74

47.4014.40

46.0020.95

36.209.73

76.807.98

23.2051.88

19.0042.49

19.0042.49

17.6039.35

17.6039.35

14.8033.09

13.4029.96

16.2036.22

28.8064.40

(a) Heads shared between tasks


Task

CoLA

MNLI

MRPC

QNLI

QQP

RTE

SST-2

STS-B

WNLI

Task

10.800.84

8.001.00

8.400.89

8.600.55

6.401.14

9.201.10

8.000.71

6.000.71

7.401.14

8.800.45

7.801.30

6.201.30

6.601.14

6.400.89

8.600.55

8.402.07

6.801.30

6.800.84

6.801.30

6.601.14

9.201.79

7.201.10

5.200.84

6.401.14

6.001.87

6.201.30

5.601.14

7.800.84

7.001.22

5.601.52

6.201.48

6.001.41

5.000.71

5.801.79

4.400.89

7.401.52

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

0.000.00

(b) MLPs shared between tasks


Task

CoLA

MNLI

MRPC

QNLI

QQP

RTE

SST-2

STS-B

WNLI

Task

34.00

10.00 40.00

9.00 17.00 28.00

12.00 14.00 14.00 31.00

8.00 11.00 9.00 9.00 23.00

3.00 7.00 6.00 5.00 5.00 11.00

6.00 9.00 4.00 5.00 2.00 1.00 19.00

10.00 14.00 13.00 13.00 8.00 5.00 4.00 24.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

(c) Super-survivor Heads shared between tasks


Task

CoLA

MNLI

MRPC

QNLI

QQP

RTE

SST-2

STS-B

WNLI

Task

9.00

4.00 4.00

3.00 1.00 4.00

5.00 2.00 2.00 5.00

3.00 1.00 3.00 2.00 3.00

2.00 2.00 0.00 1.00 0.00 2.00

2.00 2.00 1.00 1.00 1.00 0.00 2.00

1.00 0.00 1.00 1.00 1.00 0.00 0.00 1.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

(d) Super-survivor MLPs shared between tasks

Figure 19: The “good” subnetwork: The diagonal represents the BERT architecture components that survivepruning for a given task and remaining elements represent the common surviving components across GLUE tasks.Each cell gives the average number of heads (out of 144) or layers (out of 12), together with standard deviationacross 5 random initializations (for (a) and (b)).

when bert plays the lottery, all tickets are winning

Documents