brian lester rami al-rfou noah constant abstract

15
The Power of Scale for Parameter-Efficient Prompt Tuning Brian Lester * Rami Al-Rfou Noah Constant Google Research {brianlester,rmyeid,nconstant}@google.com Abstract In this work, we explore “prompt tuning,” a simple yet effective mechanism for learn- ing “soft prompts” to condition frozen lan- guage models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through back- propagation and can be tuned to incorporate signals from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3’s few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning be- comes more competitive with scale: as mod- els exceed billions of parameters, our method “closes the gap” and matches the strong per- formance of model tuning (where all model weights are tuned). This finding is especially relevant because large models are costly to share and serve and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed “prefix tuning” of Li and Liang (2021) and we provide a comparison to this and other similar approaches. Finally, we show that condition- ing a frozen model with soft prompts confers benefits in robustness to domain transfer and enables efficient “prompt ensembling.” 1 Introduction With the wide success of pre-trained large lan- guage models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. How- ever, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation tech- nique has been model tuning (or “fine-tuning”), where all model parameters are tuned during adap- tation, as proposed by Howard and Ruder (2018). * Work done as a Google AI Resident. Figure 1: Standard model tuning of T5 achieves strong performance, but requires storing separate copies of the model for each end task. Our prompt tuning of T5 matches the quality of model tuning as size increases, while enabling the reuse of a single frozen model for all tasks. Our approach significantly outperforms few- shot prompt design using GPT-3. We show mean and standard deviation across 3 runs for tuning methods. More recently, Brown et al. (2020) showed that prompt design (or “priming”) is surprisingly effec- tive at modulating a frozen GPT-3 model’s behavior through text prompts. Prompts are typically com- posed of a task description and/or several canonical examples. This return to “freezing” pre-trained models is appealing, especially as model size con- tinues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks. Unfortunately, prompt-based adaptation has sev- eral key drawbacks. Task description is error-prone and requires human involvement, and the effective- ness of a prompt is limited by how much condition- ing text can fit into the model’s input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B few- shot performance on SuperGLUE is 17.5 points be- arXiv:2104.08691v2 [cs.CL] 2 Sep 2021

Upload: others

Post on 07-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lesterlowast Rami Al-Rfou Noah ConstantGoogle Research

brianlesterrmyeidnconstantgooglecom

AbstractIn this work we explore ldquoprompt tuningrdquoa simple yet effective mechanism for learn-ing ldquosoft promptsrdquo to condition frozen lan-guage models to perform specific downstreamtasks Unlike the discrete text prompts used byGPT-3 soft prompts are learned through back-propagation and can be tuned to incorporatesignals from any number of labeled examplesOur end-to-end learned approach outperformsGPT-3rsquos few-shot learning by a large marginMore remarkably through ablations on modelsize using T5 we show that prompt tuning be-comes more competitive with scale as mod-els exceed billions of parameters our methodldquocloses the gaprdquo and matches the strong per-formance of model tuning (where all modelweights are tuned) This finding is especiallyrelevant because large models are costly toshare and serve and the ability to reuse onefrozen model for multiple downstream taskscan ease this burden Our method can be seenas a simplification of the recently proposedldquoprefix tuningrdquo of Li and Liang (2021) and weprovide a comparison to this and other similarapproaches Finally we show that condition-ing a frozen model with soft prompts confersbenefits in robustness to domain transfer andenables efficient ldquoprompt ensemblingrdquo

1 Introduction

With the wide success of pre-trained large lan-guage models a range of techniques has arisen toadapt these general-purpose models to downstreamtasks ELMo (Peters et al 2018) proposed freezingthe pre-trained model and learning a task-specificweighting of its per-layer representations How-ever since GPT (Radford et al 2018) and BERT(Devlin et al 2019) the dominant adaptation tech-nique has been model tuning (or ldquofine-tuningrdquo)where all model parameters are tuned during adap-tation as proposed by Howard and Ruder (2018)

lowastWork done as a Google AI Resident

108 109 1010 1011

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Model TuningModel Tuning (Multi-task)

Prompt DesignPrompt Tuning

Figure 1 Standard model tuning of T5 achieves strongperformance but requires storing separate copies of themodel for each end task Our prompt tuning of T5matches the quality of model tuning as size increaseswhile enabling the reuse of a single frozen model forall tasks Our approach significantly outperforms few-shot prompt design using GPT-3 We show mean andstandard deviation across 3 runs for tuning methods

More recently Brown et al (2020) showed thatprompt design (or ldquoprimingrdquo) is surprisingly effec-tive at modulating a frozen GPT-3 modelrsquos behaviorthrough text prompts Prompts are typically com-posed of a task description andor several canonicalexamples This return to ldquofreezingrdquo pre-trainedmodels is appealing especially as model size con-tinues to increase Rather than requiring a separatecopy of the model for each downstream task asingle generalist model can simultaneously servemany different tasks

Unfortunately prompt-based adaptation has sev-eral key drawbacks Task description is error-proneand requires human involvement and the effective-ness of a prompt is limited by how much condition-ing text can fit into the modelrsquos input As a resultdownstream task quality still lags far behind thatof tuned models For instance GPT-3 175B few-shot performance on SuperGLUE is 175 points be-

arX

iv2

104

0869

1v2

[cs

CL

] 2

Sep

202

1

Pre-trained Model

(11B params)

Task A Model(11B params)

Task B Model(11B params)

Task C Model(11B params)

a1a2

b1

c1c2

Task A Batch

Task B Batch

Task C Batch

Pre-trained Model

(11B params)

Model Tuning Prompt Tuning

A

B

C

Mixed-task Batch

(20K params each)

a1c1b1a2c2

ACBAC

Task Prompts

Figure 2 Model tuning requires making a task-specific copy of the entire pre-trained model for eachdownstream task and inference must be performed inseparate batches Prompt tuning only requires stor-ing a small task-specific prompt for each task andenables mixed-task inference using the original pre-trained model With a T5 ldquoXXLrdquo model each copyof the tuned model requires 11 billion parameters Bycontrast our tuned prompts would only require 20480parameters per taskmdasha reduction of over five orders ofmagnitudemdashassuming a prompt length of 5 tokens

low fine-tuned T5-XXL (Raffel et al 2020) (718vs 893) despite using 16 times more parameters

Several efforts to automate prompt design havebeen recently proposed Shin et al (2020) proposea search algorithm over the discrete space of wordsguided by the downstream application training dataWhile this technique outperforms manual promptdesign there is still a gap relative to model tuning

Li and Liang (2021) propose ldquoprefix tuningrdquoand show strong results on generative tasks Thismethod freezes the model parameters and back-propagates the error during tuning to prefix ac-tivations prepended to each layer in the encoderstack including the input layer Hambardzumyanet al (2021) simplify this recipe by restricting thetrainable parameters to the input and output sub-networks of a masked language model and showreasonable results on classifications tasks

In this paper we propose prompt tuning as afurther simplification for adapting language modelsWe freeze the entire pre-trained model and only al-low an additional k tunable tokens per downstreamtask to be prepended to the input text This ldquosoftpromptrdquo is trained end-to-end and can condensethe signal from a full labeled dataset allowing ourmethod to outperform few-shot prompts and closethe quality gap with model tuning (Figure 1) Atthe same time since a single pre-trained model isrecycled for all downstream tasks we retain the ef-ficient serving benefits of frozen models (Figure 2)

While we developed our method concurrently

with Li and Liang (2021) and Hambardzumyanet al (2021) we are the first to show that prompttuning alone (with no intermediate-layer prefixes ortask-specific output layers) is sufficient to be com-petitive with model tuning Through detailed ex-periments in sections 2ndash3 we demonstrate that lan-guage model capacity is a key ingredient for theseapproaches to succeed As Figure 1 shows prompttuning becomes more competitive with scale

We compare with similar approaches in Sec-tion 4 Explicitly separating task-specific param-eters from the ldquogeneralistrdquo parameters needed forgeneral language-understanding has a range of ad-ditional benefits We show in Section 5 that bycapturing the task definition in the prompt whilekeeping the generalist parameters fixed we are ableto achieve better resilience to domain shifts In Sec-tion 6 we show that ldquoprompt ensemblingrdquo learn-ing multiple prompts for the same task can boostquality and is more efficient than classic model en-sembling Finally in Section 7 we investigate theinterpretability of our learned soft prompts In sumour key contributions are

1 Proposing prompt tuning and showing its com-petitiveness with model tuning in the regimeof large language models

2 Ablating many design choices and showingquality and robustness improve with scale

3 Showing prompt tuning outperforms modeltuning on domain shift problems

4 Proposing ldquoprompt ensemblingrdquo and showingits effectiveness

2 Prompt Tuning

Following the ldquotext-to-textrdquo approach of T5 (Raffelet al 2020) we cast all tasks as text generationInstead of modeling classification as the probabil-ity of an output class given some input Pr(y|X)where X is a series of tokens and y is a single classlabel we now model it as conditional generationwhere Y is a sequence of tokens that represent aclass label T5 models classification as Prθ(Y |X)parameterized by the weights θ of the transform-ers (Vaswani et al 2017) that make up its encoderand decoder

Prompting is the approach of adding extra in-formation for the model to condition on during itsgeneration of Y Normally prompting is doneby prepending a series of tokens P to the in-put X such that the model maximizes the likeli-hood of the correct Y Prθ(Y |[P X]) while keep-

ing the model parameters θ fixed In GPT-3the representations of the prompt tokens P =p1 p2 pn are part of the modelrsquos embed-ding table parameterized by the frozen θ Find-ing an optimal prompt thus requires the selectionof prompt tokens through either manual searchor non-differentiable search methods (Jiang et al2020 Shin et al 2020) Prompt tuning removesthe restriction that the prompt P be parameterizedby θ instead the prompt has its own dedicated pa-rameters θP that can be updated While promptdesign involves selecting prompt tokens from afixed vocabulary of frozen embeddings prompttuning can be thought of as using a fixed promptof special tokens where only the embeddings ofthese prompt tokens can be updated Our new con-ditional generation is now PrθθP (Y |[P X]) andcan be trained by maximizing the likelihood of Yvia backpropagation while only applying gradientupdates to θP

Given a series of n tokens x1 x2 xn thefirst thing T5 does is embed the tokens forminga matrix Xe isin Rntimese where e is the dimension ofthe embedding space Our soft-prompts are repre-sented as a parameter Pe isin Rptimese where p is thelength of the prompt Our prompt is then concate-nated to the embedded input forming a single ma-trix [PeXe] isin R(p+n)timese which then flows thoughthe encoder-decoder as normal Our models aretrained to maximize the probability of Y but onlythe prompt parameters Pe are updated

21 Design Decisions

There are many possible ways to initialize theprompt representations The simplest is to trainfrom scratch using random initialization A moresophisticated option is to initialize each prompttoken to an embedding drawn from the modelrsquosvocabulary Conceptually our soft-prompt mod-ulates the frozen networkrsquos behavior in the sameway as text preceding the input so it follows thata word-like representation might serve as a goodinitialization spot For classification tasks a thirdoption is to initialize the prompt with embeddingsthat enumerate the output classes similar to theldquoverbalizersrdquo of Schick and Schuumltze (2021) Sincewe want the model to produce these tokens in theoutput initializing the prompt with the embeddingsof the valid target tokens should prime the modelto restrict its output to the legal output classes

Another design consideration is the length of the

prompt The parameter cost of our method is EP where E is the token embedding dimension and Pis the prompt length The shorter the prompt thefewer new parameters must be tuned so we aim tofind a minimal length that still performs well

22 Unlearning Span Corruption

Unlike autoregressive language models like GPT-3the T5 models we experiment with use an encoder-decoder architecture and pre-train on a span cor-ruption objective Specifically T5 is tasked withldquoreconstructingrdquo masked spans in the input textwhich are marked with unique sentinel tokens Thetarget output text consists of all the masked con-tent separated by sentinels plus a final sentinelFor instance from the text ldquoThank you for invitingme to your party last weekrdquo we might constructa pre-training example where the input is ldquoThankyou 〈X〉 me to your party 〈Y〉 weekrdquo and the targetoutput is ldquo〈X〉 for inviting 〈Y〉 last 〈Z〉rdquo

While Raffel et al (2020) find this architectureand pre-training objective more effective than tradi-tional language modeling we hypothesize that thissetup is not a good fit for producing a frozen modelthat can be readily controlled through prompt tun-ing In particular a T5 model pre-trained exclu-sively on span corruption such as T511 has neverseen truly natural input text (free of sentinel to-kens) nor has it ever been asked to predict trulynatural targets In fact due to the details of T5rsquosspan corruption preprocessing every pre-trainingtarget will begin with a sentinel While this ldquounnat-uralrdquo tendency to output sentinels is easy to over-come through fine-tuning we suspect that it wouldbe much harder to override through a prompt aloneas the decoder priors cannot be adjusted

Given these concerns we experiment with T5models in three settings (1) ldquoSpan CorruptionrdquoWe use pre-trained T5 off-the-shelf as our frozenmodel and test its ability to output the expectedtext for downstream tasks (2) ldquoSpan Corruption+ Sentinelrdquo We use the same model but prependall downstream targets with a sentinel so as tomore closely resemble the targets seen in pre-training (3) ldquoLM Adaptationrdquo We continue T5rsquosself-supervised training for a small number of ad-ditional steps but using the ldquoLMrdquo objective dis-cussed by Raffel et al (2020) given a natural textprefix as input the model must produce the naturaltext continuation as output Crucially this adapta-tion happens only once producing a single frozen

model that we can reuse for prompt tuning acrossany number of downstream tasks

Through LM adaptation we hope to ldquoquicklyrdquotransform T5 into a model more similar to GPT-3which always outputs realistic text and is known torespond well to prompts as a ldquofew-shot learnerrdquo Itis not obvious how successful this late-stage trans-formation will be compared to pre-training fromscratch and it has not been investigated previouslyto our knowledge As such we experiment withvarious lengths of adaptation up to 100K steps

3 Results

Our frozen models are built on top of pre-trainedT5 checkpoints of all sizes (Small Base Large XLXXL) We leverage the public T511 checkpointswhich include improvements over the original T51

Our ldquodefaultrdquo configuration plotted with a greenlsquotimesrsquo ( ) throughout uses an LM-adapted versionof T5 trained for an additional 100K steps ini-tializes using class labels (see Section 32) anduses a prompt length of 100 tokens While thisis longer than the default 10-token prefix used byLi and Liang (2021) our method still uses fewertask-specific parameters as we only tune the inputlayer as opposed to overwriting activations in allnetwork layers See Figure 4 for a detailed com-parison We will also see shortly that even muchshorter prompts are viable as model size increases

We measure performance on the SuperGLUEbenchmark (Wang et al 2019a) a collection ofeight challenging English language understandingtasks2 We report metrics on the development setassociated with each dataset

Each of our prompts train on a single Super-GLUE task there was no multi-task setup or mix-ing of training data across tasks We translate eachSuperGLUE dataset into a text-to-text format fol-lowing Raffel et al (2020) except that we omit thetask names prepended to inputs indicating whichSuperGLUE task an example belongs to

We train our prompts for 30000 steps using T5rsquosstandard cross-entropy loss with a constant learn-

1These improvements are (1) the removal of all superviseddata from pre-training (2) adjustments to hyperparametersdmodel and dff and (3) the use of GeGLU (Shazeer 2020)over ReLU (Nair and Hinton 2010) activations

2The tasks are BoolQ (Clark et al 2019) CB (De Marn-eff et al 2019) COPA (Roemmele et al 2011) MultiRC(Khashabi et al 2018) ReCoRD (Zhang et al 2018) RTE(Dagan et al 2005 Bar-Haim et al 2006 Giampiccolo et al2007 Bentivogli et al 2009) WiC (Pilehvar and Camacho-Collados 2018) and WSC (Levesque et al 2012)

ing rate of 03 and a batch size of 32 Checkpointsare selected via early stopping on the developmentset where the stopping metric is the default met-ric for the dataset or the average of metrics fordatasets evaluated with multiple metrics All ex-periments were run in JAX (Bradbury et al 2018)using the Adafactor optimizer (Shazeer and Stern2018) with weight decay 1eminus5 β2 decay 08 andparameter scaling off The models were imple-mented in Flax (Heek et al 2020) More detailsare available in Appendix A

31 Closing the GapTo compare our method with standard model tun-ing we tune the public T511 checkpoints onSuperGLUE using the default hyperparametersspecified in the T5 library (learning rate 0001and Adafactor optimizer with pre-training param-eter states restored) We consider two baselines(1) ldquoModel Tuningrdquo For an apples-to-apples com-parison we tune on each task separately as in ourprompt tuning setup3 (2) ldquoModel Tuning (Multi-task)rdquo We use T5rsquos multi-task tuning setup toachieve a more competitive baseline4 In this casea single model is tuned on all tasks jointly with atext prefix indicating the task name

In Figure 1 (p 1) we see that prompt tuningbecomes more competitive with model tuning asscale increases At the XXL size (11 billion param-eters) prompt tuning matches even the strongermulti-task model tuning baseline despite havingover 20000 times fewer task-specific parameters

To compare with prompt design we includeGPT-3 few-shot performance on the SuperGLUEdev split as reported by Brown et al (2020)5

Figure 1 shows that prompt tuning beats GPT-3prompt design by a large margin with prompt-tuned T5-Small matching GPT-3 XL (over 16times larger) and prompt-tuned T5-Large beatingGPT-3 175B (over 220 times larger)

3To improve this baseline we performed a sweep over thebatch size hyperparameter and selected 216 tokens per batch

4The T5 SuperGLUE submission used a more complexsetup first mixing multi-task supervised data into pre-trainingand then performing single-task fine-tuning Since we useT511 throughout this setup is unavailable as the pre-trainingphase is fully self-supervised We follow Raffel et al (2020)in using 220 tokens per batch and including DPR data inthe multi-task mixture which is known to boost WSC taskperformance (Kocijan et al 2019)

5We also experimented with using GPT-3rsquos manual textprompts directly with our LM-adapted T5 checkpoints How-ever performance was far below GPT-3 for comparable modelsizes This may be due to differences in pre-training data andmodel architecture as well as T5rsquos shorter sequence length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

1520100150

(a) Prompt length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Random UniformSampled VocabClass Label

(b) Prompt initialization

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Span CorruptionSpan Corruption+ SentinelLM Adaptation(100K)

(c) Pre-training method

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

0K10K50K100K

(d) LM adaptation steps

Figure 3 Ablations of various hyperparameters on prompt tuning performance (mean and stddev across 3 runs) Inour ldquodefaultrdquo ( ) configuration quality improves stably with model size Across all ablations the largest (XXL)model is the most robust to hyperparameter choice (a) Prompt length Increasing to 20+ tokens generally confersa large boost but XXL performs well even with single-token prompts (b) Prompt initialization Random uniforminitialization lags behind more ldquoadvancedrdquo initializations using sampled vocabulary or class label embeddings butthe difference vanishes at XXL size (c) Pre-training objective LM adaptation outperforms span corruption evenwhen a sentinel is added to downstream task targets but XXL works well with any method (d) LM adaptationLonger adaptation generally gives larger gains but XXL is robust to even short adaptation

32 Ablation Study

Prompt Length We train prompts for eachmodel size while varying the prompt length in1 5 20 100 150 and fixing other settings to ourdefault configuration Figure 3(a) shows that formost model sizes increasing prompt length beyonda single token is critical to achieve good perfor-mance Notably the XXL model still gives strongresults with a single-token prompt suggesting thatthe larger the model the less conditioning signalis needed to achieve a target behavior Across allmodels increasing beyond 20 tokens only yieldsmarginal gains6

Prompt Initialization We ablate the effect ofprompt initialization by training models at all sizeswhile fixing other hyperparameters to their defaultvalues For random initialization we sample uni-

6Going past 100 tokens appears mildly detrimental forlarger models A similar pattern of diminishing performancepast a certain prefix length is observed by Li and Liang (2021)

formly from the range [minus05 05] When initial-izing from sampled vocabulary we restrict to the5000 most ldquocommonrdquo tokens in T5rsquos Sentence-Piece vocabulary (Kudo and Richardson 2018)which is ordered by likelihood in the pre-trainingcorpus For ldquoclass labelrdquo initialization we takethe embeddings for the string representations ofeach class in the downstream task and use them toinitialize one of the tokens in the prompt Whena class label is multi-token we average the tokenembeddings At longer prompt lengths we oftenrun out of class labels before we have initialized allof the prompt tokens In this case we fall back toour sampled vocab strategy to fill in the prompt7

Figure 3(b) shows our ablation of initializationstrategy across model sizes where we find that

7T5rsquos handling of the ReCoRD and WSC tasks requiresthe model to generate short free-form text In these cases weinitialize the prompts with words related to the task common-sense reasoning reading and comprehension for ReCoRDand commonsense pronoun and resolution for WSC

the class based initialization performs best Atsmaller model sizes there are large gaps betweenthe different initializations but once the model isscaled to XXL size those differences disappear

With ldquoclass labelrdquo initialization we observe thatthe class labels typically persist in the learnedprompts such that the nearest token embeddings(in cosine distance) match the tokens used for ini-tialization Beyond this we did not find our learnedprompts to be interpretable similar to those of Shinet al (2020) See Section 7 for details

Pre-training Objective In Figures 3(c) and 3(d)we see pre-training objective has a clear effect onprompt tuning quality As hypothesized in Sec-tion 22 T5rsquos default ldquospan corruptionrdquo objectiveis not well-suited for training frozen models to belater conditioned by prompts Intuitively modelspre-trained to read and write sentinel tokens arehard to apply directly to tasks of reading and writ-ing text without sentinels As seen in Figure 3(c)even the ldquoworkaroundrdquo of adding a sentinel to thedownstream targets has little benefit While LMadaptation adds value across all model sizes wenote our largest XXL model is the most forgivingand gives strong results even with span corruption

Given the benefit of LM adaptation we alsoexplore how long of an adaptation is helpful Fig-ure 3(d) shows that longer adaptation provides ad-ditional gains up to 100K steps This suggeststhat the ldquotransitionrdquo from span corruption to a lan-guage modeling objective is not a trivial changeand making an effective switch takes an investmentof training resources (10 of the steps of the orig-inal T5 pre-training) At the same time as in ourother ablations we observe that the XXL modelis robust to even non-ideal configurations At thissize the gains from adaptation are quite modest

In the non-optimal ldquospan corruptionrdquo setting weobserve instability across model sizes with theSmall model outperforming the larger Base Largeand XL models On inspection we find that formany tasks these mid-sized models never learn tooutput a legal class label and thus score 0 Thetwo most common error modes are copying sub-spans from the input and predicting an empty stringFurthermore this poor performance is not due torandom variance in prompt tuning as we observelow variance across 3 runs for each size Theseresults indicate that using models pre-trained withthe ldquospan corruptionrdquo objective can be unreliablewith only 2 out of 5 models working well whereas

108 109 1010

Model Parameters

103

105

107

109

Task

Par

amet

ers

Model TuningPrefix Tuning (Train)Prefix Tuning (Infer)

WARPPrompt TuningPrompt Design

0001

001

01

1

10

100

Task Parameters (

)

Figure 4 Parameter usage of various adaptation tech-niques fixing architecture to T511 and promptprefixlength to 1ndash100 tokens (bands show mean and stddev)Model Tuning All parameters are task-specific Pre-fix Tuning Activations are tuned in the prefix of eachlayer requiring 01ndash1 task-specific parameters for in-ference but more are used for training WARP Taskparameters are reduced to under 01 by only tuninginput and output layers Prompt Tuning Only promptembeddings are tuned reaching under 001 for mostmodel sizes Prompt Design Only a sequence ofprompt IDs (500ndash2000 tokens) is required

the LM adapated versions work reliably across allmodel sizes

We have released T5 11 checkpoints adaptedusing the LM objective for 100K steps for all modelsizes8

4 Comparison to Similar Approaches

In this section we review recent work on learn-ing continuous prompts and draw comparisonswith our method One important axis of compari-son is the number of task-specific parameters eachmethod requires as shown in Figure 4 Amongmethods with learnable parameters prompt tuningis the most parameter efficient requiring less than001 task-specific parameters for models over abillion parameters9

Li and Liang (2021) propose ldquoprefix tuningrdquolearning a sequence of prefixes that are prepended

8httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

9To compare with prompt design we count each tokenID in the prompt as a parameter and assume a prompt ofbetween 500ndash2000 tokens to match the GPT-3 setting Whilethis technique is by far the most parameter efficient it comesat the cost of task quality

at every transformer layer This is akin to learningtransformer activations that are fixed across exam-ples at every network layer In contrast prompttuning uses a single prompt representation thatis prepended to the embedded input Beyond re-quiring fewer parameters our approach allows thetransformer to update the intermediate-layer taskrepresentations as contextualized by an input ex-ample Their work builds on GPT-2 (Radford et al2019) and BART (Lewis et al 2020) while ours fo-cuses on T5 and examines changes in performanceand robustness to design choices as model size in-creases When using BART prefix tuning includesprefixes on both the encoder and decoder networkwhile prompt tuning only requires prompts on theencoder Li and Liang (2021) also rely on a repa-rameterization of the prefix to stabilize learningwhich adds a large number of parameters duringtraining whereas our configuration does not re-quire this reparameterization and is robust acrossSuperGLUE tasks and model sizes

Hambardzumyan et al (2021) propose ldquoWARPrdquowhere prompt parameters are added to the inputlayer This method works with masked languagemodels relying on a [MASK] token and a learn-able output layer to project the mask to class logitsThis formulation restricts the model to producing asingle output limiting it to classification Prompttuning does not require any changes to the input ora task-specific head The performance of prompttuning is also considerably closer to the strong per-formance of model tuning

Liu et al (2021) propose ldquoP-tuningrdquo where learn-able continuous prompts are interleaved throughoutthe embedded input using patterns based on humandesign Our approach removes this complicationby simply prepending the prompt to the input Toachieve strong SuperGLUE results P-tuning has tobe used in conjunction with model tuning that ismodels jointly update both the prompt and the mainmodel parameters whereas our approach keeps theoriginal language model frozen10

Qin and Eisner (2021) use ldquosoft wordsrdquo to learnprompts to extract knowledge from pre-trainedLMs Prompts are positioned in relation to theinput based on hand-designed prompt prototypesand a learned ∆`

i parameter is included for eachlayer so parameter cost scales with model depth

10As another difference P-tuning requires the addition ofldquoanchorrdquo tokens in the input (eg a question mark followingthe hypothesis in the RTE task) to achieve strong performancewhile prompt tuning leaves inputs untouched

Dataset Domain Model Prompt ∆

SQuAD Wiki 949 plusmn02 948 plusmn01 minus01

TextbookQA Book 543 plusmn37 668 plusmn29 +125BioASQ Bio 779 plusmn04 791 plusmn03 +12RACE Exam 598 plusmn06 607 plusmn05 +09RE Wiki 884 plusmn01 888 plusmn02 +04DuoRC Movie 689 plusmn07 677 plusmn11 minus12DROP Wiki 689 plusmn17 671 plusmn19 minus18

Table 1 F1 mean and stddev for models trained onSQuAD and evaluated on out-of-domain datasets fromthe MRQA 2019 shared task Prompt tuning tends togive stronger zero-shot performance than model tun-ing especially on datasets with large domain shifts likeTextbookQA

Logeswaran et al (2020) use a learnableprepended token to adapt transformer models to var-ious tasks but focus on small synthetic datasets de-signed to accommodate a compositional task repre-sentation as opposed to larger real-world datasetsTheir base models are small transformers trainedfrom scratch jointly with the task representationswhereas we keep the base model frozen and inves-tigate scaling laws using larger transformers

More generally work on task prompts is closelyaligned with work on ldquoadaptersrdquo (Rebuffi et al2017 Houlsby et al 2019) small bottleneck lay-ers inserted between frozen pre-trained networklayers Adapters offer another means of reduc-ing task-specific parameters with Houlsby et al(2019) achieving GLUE performance close to fullmodel tuning when freezing BERT-Large and onlyadding 2ndash4 additional parameters Pfeiffer et al(2020) use multiple adapters in a multilingual con-text to explicitly separate language understandingfrom task specification similar to our approach Acore difference between adapters and prompt tun-ing is how the approaches change model behaviorAdapters modify the actual function that acts on theinput representation parameterized by the neuralnetwork by allowing the rewriting of activations atany given layer Prompt tuning modifies behaviorby leaving the function fixed and adding new in-put representations that can affect how subsequentinput is processed

5 Resilience to Domain Shift

By freezing the core language model parametersprompt tuning prevents the model from modify-ing its general understanding of language Insteadprompt representations indirectly modulate the rep-resentation of the input This reduces the modelrsquosability to overfit to a dataset by memorizing spe-

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

Pre-trained Model

(11B params)

Task A Model(11B params)

Task B Model(11B params)

Task C Model(11B params)

a1a2

b1

c1c2

Task A Batch

Task B Batch

Task C Batch

Pre-trained Model

(11B params)

Model Tuning Prompt Tuning

A

B

C

Mixed-task Batch

(20K params each)

a1c1b1a2c2

ACBAC

Task Prompts

Figure 2 Model tuning requires making a task-specific copy of the entire pre-trained model for eachdownstream task and inference must be performed inseparate batches Prompt tuning only requires stor-ing a small task-specific prompt for each task andenables mixed-task inference using the original pre-trained model With a T5 ldquoXXLrdquo model each copyof the tuned model requires 11 billion parameters Bycontrast our tuned prompts would only require 20480parameters per taskmdasha reduction of over five orders ofmagnitudemdashassuming a prompt length of 5 tokens

low fine-tuned T5-XXL (Raffel et al 2020) (718vs 893) despite using 16 times more parameters

Several efforts to automate prompt design havebeen recently proposed Shin et al (2020) proposea search algorithm over the discrete space of wordsguided by the downstream application training dataWhile this technique outperforms manual promptdesign there is still a gap relative to model tuning

Li and Liang (2021) propose ldquoprefix tuningrdquoand show strong results on generative tasks Thismethod freezes the model parameters and back-propagates the error during tuning to prefix ac-tivations prepended to each layer in the encoderstack including the input layer Hambardzumyanet al (2021) simplify this recipe by restricting thetrainable parameters to the input and output sub-networks of a masked language model and showreasonable results on classifications tasks

In this paper we propose prompt tuning as afurther simplification for adapting language modelsWe freeze the entire pre-trained model and only al-low an additional k tunable tokens per downstreamtask to be prepended to the input text This ldquosoftpromptrdquo is trained end-to-end and can condensethe signal from a full labeled dataset allowing ourmethod to outperform few-shot prompts and closethe quality gap with model tuning (Figure 1) Atthe same time since a single pre-trained model isrecycled for all downstream tasks we retain the ef-ficient serving benefits of frozen models (Figure 2)

While we developed our method concurrently

with Li and Liang (2021) and Hambardzumyanet al (2021) we are the first to show that prompttuning alone (with no intermediate-layer prefixes ortask-specific output layers) is sufficient to be com-petitive with model tuning Through detailed ex-periments in sections 2ndash3 we demonstrate that lan-guage model capacity is a key ingredient for theseapproaches to succeed As Figure 1 shows prompttuning becomes more competitive with scale

We compare with similar approaches in Sec-tion 4 Explicitly separating task-specific param-eters from the ldquogeneralistrdquo parameters needed forgeneral language-understanding has a range of ad-ditional benefits We show in Section 5 that bycapturing the task definition in the prompt whilekeeping the generalist parameters fixed we are ableto achieve better resilience to domain shifts In Sec-tion 6 we show that ldquoprompt ensemblingrdquo learn-ing multiple prompts for the same task can boostquality and is more efficient than classic model en-sembling Finally in Section 7 we investigate theinterpretability of our learned soft prompts In sumour key contributions are

1 Proposing prompt tuning and showing its com-petitiveness with model tuning in the regimeof large language models

2 Ablating many design choices and showingquality and robustness improve with scale

3 Showing prompt tuning outperforms modeltuning on domain shift problems

4 Proposing ldquoprompt ensemblingrdquo and showingits effectiveness

2 Prompt Tuning

Following the ldquotext-to-textrdquo approach of T5 (Raffelet al 2020) we cast all tasks as text generationInstead of modeling classification as the probabil-ity of an output class given some input Pr(y|X)where X is a series of tokens and y is a single classlabel we now model it as conditional generationwhere Y is a sequence of tokens that represent aclass label T5 models classification as Prθ(Y |X)parameterized by the weights θ of the transform-ers (Vaswani et al 2017) that make up its encoderand decoder

Prompting is the approach of adding extra in-formation for the model to condition on during itsgeneration of Y Normally prompting is doneby prepending a series of tokens P to the in-put X such that the model maximizes the likeli-hood of the correct Y Prθ(Y |[P X]) while keep-

ing the model parameters θ fixed In GPT-3the representations of the prompt tokens P =p1 p2 pn are part of the modelrsquos embed-ding table parameterized by the frozen θ Find-ing an optimal prompt thus requires the selectionof prompt tokens through either manual searchor non-differentiable search methods (Jiang et al2020 Shin et al 2020) Prompt tuning removesthe restriction that the prompt P be parameterizedby θ instead the prompt has its own dedicated pa-rameters θP that can be updated While promptdesign involves selecting prompt tokens from afixed vocabulary of frozen embeddings prompttuning can be thought of as using a fixed promptof special tokens where only the embeddings ofthese prompt tokens can be updated Our new con-ditional generation is now PrθθP (Y |[P X]) andcan be trained by maximizing the likelihood of Yvia backpropagation while only applying gradientupdates to θP

Given a series of n tokens x1 x2 xn thefirst thing T5 does is embed the tokens forminga matrix Xe isin Rntimese where e is the dimension ofthe embedding space Our soft-prompts are repre-sented as a parameter Pe isin Rptimese where p is thelength of the prompt Our prompt is then concate-nated to the embedded input forming a single ma-trix [PeXe] isin R(p+n)timese which then flows thoughthe encoder-decoder as normal Our models aretrained to maximize the probability of Y but onlythe prompt parameters Pe are updated

21 Design Decisions

There are many possible ways to initialize theprompt representations The simplest is to trainfrom scratch using random initialization A moresophisticated option is to initialize each prompttoken to an embedding drawn from the modelrsquosvocabulary Conceptually our soft-prompt mod-ulates the frozen networkrsquos behavior in the sameway as text preceding the input so it follows thata word-like representation might serve as a goodinitialization spot For classification tasks a thirdoption is to initialize the prompt with embeddingsthat enumerate the output classes similar to theldquoverbalizersrdquo of Schick and Schuumltze (2021) Sincewe want the model to produce these tokens in theoutput initializing the prompt with the embeddingsof the valid target tokens should prime the modelto restrict its output to the legal output classes

Another design consideration is the length of the

prompt The parameter cost of our method is EP where E is the token embedding dimension and Pis the prompt length The shorter the prompt thefewer new parameters must be tuned so we aim tofind a minimal length that still performs well

22 Unlearning Span Corruption

Unlike autoregressive language models like GPT-3the T5 models we experiment with use an encoder-decoder architecture and pre-train on a span cor-ruption objective Specifically T5 is tasked withldquoreconstructingrdquo masked spans in the input textwhich are marked with unique sentinel tokens Thetarget output text consists of all the masked con-tent separated by sentinels plus a final sentinelFor instance from the text ldquoThank you for invitingme to your party last weekrdquo we might constructa pre-training example where the input is ldquoThankyou 〈X〉 me to your party 〈Y〉 weekrdquo and the targetoutput is ldquo〈X〉 for inviting 〈Y〉 last 〈Z〉rdquo

While Raffel et al (2020) find this architectureand pre-training objective more effective than tradi-tional language modeling we hypothesize that thissetup is not a good fit for producing a frozen modelthat can be readily controlled through prompt tun-ing In particular a T5 model pre-trained exclu-sively on span corruption such as T511 has neverseen truly natural input text (free of sentinel to-kens) nor has it ever been asked to predict trulynatural targets In fact due to the details of T5rsquosspan corruption preprocessing every pre-trainingtarget will begin with a sentinel While this ldquounnat-uralrdquo tendency to output sentinels is easy to over-come through fine-tuning we suspect that it wouldbe much harder to override through a prompt aloneas the decoder priors cannot be adjusted

Given these concerns we experiment with T5models in three settings (1) ldquoSpan CorruptionrdquoWe use pre-trained T5 off-the-shelf as our frozenmodel and test its ability to output the expectedtext for downstream tasks (2) ldquoSpan Corruption+ Sentinelrdquo We use the same model but prependall downstream targets with a sentinel so as tomore closely resemble the targets seen in pre-training (3) ldquoLM Adaptationrdquo We continue T5rsquosself-supervised training for a small number of ad-ditional steps but using the ldquoLMrdquo objective dis-cussed by Raffel et al (2020) given a natural textprefix as input the model must produce the naturaltext continuation as output Crucially this adapta-tion happens only once producing a single frozen

model that we can reuse for prompt tuning acrossany number of downstream tasks

Through LM adaptation we hope to ldquoquicklyrdquotransform T5 into a model more similar to GPT-3which always outputs realistic text and is known torespond well to prompts as a ldquofew-shot learnerrdquo Itis not obvious how successful this late-stage trans-formation will be compared to pre-training fromscratch and it has not been investigated previouslyto our knowledge As such we experiment withvarious lengths of adaptation up to 100K steps

3 Results

Our frozen models are built on top of pre-trainedT5 checkpoints of all sizes (Small Base Large XLXXL) We leverage the public T511 checkpointswhich include improvements over the original T51

Our ldquodefaultrdquo configuration plotted with a greenlsquotimesrsquo ( ) throughout uses an LM-adapted versionof T5 trained for an additional 100K steps ini-tializes using class labels (see Section 32) anduses a prompt length of 100 tokens While thisis longer than the default 10-token prefix used byLi and Liang (2021) our method still uses fewertask-specific parameters as we only tune the inputlayer as opposed to overwriting activations in allnetwork layers See Figure 4 for a detailed com-parison We will also see shortly that even muchshorter prompts are viable as model size increases

We measure performance on the SuperGLUEbenchmark (Wang et al 2019a) a collection ofeight challenging English language understandingtasks2 We report metrics on the development setassociated with each dataset

Each of our prompts train on a single Super-GLUE task there was no multi-task setup or mix-ing of training data across tasks We translate eachSuperGLUE dataset into a text-to-text format fol-lowing Raffel et al (2020) except that we omit thetask names prepended to inputs indicating whichSuperGLUE task an example belongs to

We train our prompts for 30000 steps using T5rsquosstandard cross-entropy loss with a constant learn-

1These improvements are (1) the removal of all superviseddata from pre-training (2) adjustments to hyperparametersdmodel and dff and (3) the use of GeGLU (Shazeer 2020)over ReLU (Nair and Hinton 2010) activations

2The tasks are BoolQ (Clark et al 2019) CB (De Marn-eff et al 2019) COPA (Roemmele et al 2011) MultiRC(Khashabi et al 2018) ReCoRD (Zhang et al 2018) RTE(Dagan et al 2005 Bar-Haim et al 2006 Giampiccolo et al2007 Bentivogli et al 2009) WiC (Pilehvar and Camacho-Collados 2018) and WSC (Levesque et al 2012)

ing rate of 03 and a batch size of 32 Checkpointsare selected via early stopping on the developmentset where the stopping metric is the default met-ric for the dataset or the average of metrics fordatasets evaluated with multiple metrics All ex-periments were run in JAX (Bradbury et al 2018)using the Adafactor optimizer (Shazeer and Stern2018) with weight decay 1eminus5 β2 decay 08 andparameter scaling off The models were imple-mented in Flax (Heek et al 2020) More detailsare available in Appendix A

31 Closing the GapTo compare our method with standard model tun-ing we tune the public T511 checkpoints onSuperGLUE using the default hyperparametersspecified in the T5 library (learning rate 0001and Adafactor optimizer with pre-training param-eter states restored) We consider two baselines(1) ldquoModel Tuningrdquo For an apples-to-apples com-parison we tune on each task separately as in ourprompt tuning setup3 (2) ldquoModel Tuning (Multi-task)rdquo We use T5rsquos multi-task tuning setup toachieve a more competitive baseline4 In this casea single model is tuned on all tasks jointly with atext prefix indicating the task name

In Figure 1 (p 1) we see that prompt tuningbecomes more competitive with model tuning asscale increases At the XXL size (11 billion param-eters) prompt tuning matches even the strongermulti-task model tuning baseline despite havingover 20000 times fewer task-specific parameters

To compare with prompt design we includeGPT-3 few-shot performance on the SuperGLUEdev split as reported by Brown et al (2020)5

Figure 1 shows that prompt tuning beats GPT-3prompt design by a large margin with prompt-tuned T5-Small matching GPT-3 XL (over 16times larger) and prompt-tuned T5-Large beatingGPT-3 175B (over 220 times larger)

3To improve this baseline we performed a sweep over thebatch size hyperparameter and selected 216 tokens per batch

4The T5 SuperGLUE submission used a more complexsetup first mixing multi-task supervised data into pre-trainingand then performing single-task fine-tuning Since we useT511 throughout this setup is unavailable as the pre-trainingphase is fully self-supervised We follow Raffel et al (2020)in using 220 tokens per batch and including DPR data inthe multi-task mixture which is known to boost WSC taskperformance (Kocijan et al 2019)

5We also experimented with using GPT-3rsquos manual textprompts directly with our LM-adapted T5 checkpoints How-ever performance was far below GPT-3 for comparable modelsizes This may be due to differences in pre-training data andmodel architecture as well as T5rsquos shorter sequence length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

1520100150

(a) Prompt length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Random UniformSampled VocabClass Label

(b) Prompt initialization

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Span CorruptionSpan Corruption+ SentinelLM Adaptation(100K)

(c) Pre-training method

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

0K10K50K100K

(d) LM adaptation steps

Figure 3 Ablations of various hyperparameters on prompt tuning performance (mean and stddev across 3 runs) Inour ldquodefaultrdquo ( ) configuration quality improves stably with model size Across all ablations the largest (XXL)model is the most robust to hyperparameter choice (a) Prompt length Increasing to 20+ tokens generally confersa large boost but XXL performs well even with single-token prompts (b) Prompt initialization Random uniforminitialization lags behind more ldquoadvancedrdquo initializations using sampled vocabulary or class label embeddings butthe difference vanishes at XXL size (c) Pre-training objective LM adaptation outperforms span corruption evenwhen a sentinel is added to downstream task targets but XXL works well with any method (d) LM adaptationLonger adaptation generally gives larger gains but XXL is robust to even short adaptation

32 Ablation Study

Prompt Length We train prompts for eachmodel size while varying the prompt length in1 5 20 100 150 and fixing other settings to ourdefault configuration Figure 3(a) shows that formost model sizes increasing prompt length beyonda single token is critical to achieve good perfor-mance Notably the XXL model still gives strongresults with a single-token prompt suggesting thatthe larger the model the less conditioning signalis needed to achieve a target behavior Across allmodels increasing beyond 20 tokens only yieldsmarginal gains6

Prompt Initialization We ablate the effect ofprompt initialization by training models at all sizeswhile fixing other hyperparameters to their defaultvalues For random initialization we sample uni-

6Going past 100 tokens appears mildly detrimental forlarger models A similar pattern of diminishing performancepast a certain prefix length is observed by Li and Liang (2021)

formly from the range [minus05 05] When initial-izing from sampled vocabulary we restrict to the5000 most ldquocommonrdquo tokens in T5rsquos Sentence-Piece vocabulary (Kudo and Richardson 2018)which is ordered by likelihood in the pre-trainingcorpus For ldquoclass labelrdquo initialization we takethe embeddings for the string representations ofeach class in the downstream task and use them toinitialize one of the tokens in the prompt Whena class label is multi-token we average the tokenembeddings At longer prompt lengths we oftenrun out of class labels before we have initialized allof the prompt tokens In this case we fall back toour sampled vocab strategy to fill in the prompt7

Figure 3(b) shows our ablation of initializationstrategy across model sizes where we find that

7T5rsquos handling of the ReCoRD and WSC tasks requiresthe model to generate short free-form text In these cases weinitialize the prompts with words related to the task common-sense reasoning reading and comprehension for ReCoRDand commonsense pronoun and resolution for WSC

the class based initialization performs best Atsmaller model sizes there are large gaps betweenthe different initializations but once the model isscaled to XXL size those differences disappear

With ldquoclass labelrdquo initialization we observe thatthe class labels typically persist in the learnedprompts such that the nearest token embeddings(in cosine distance) match the tokens used for ini-tialization Beyond this we did not find our learnedprompts to be interpretable similar to those of Shinet al (2020) See Section 7 for details

Pre-training Objective In Figures 3(c) and 3(d)we see pre-training objective has a clear effect onprompt tuning quality As hypothesized in Sec-tion 22 T5rsquos default ldquospan corruptionrdquo objectiveis not well-suited for training frozen models to belater conditioned by prompts Intuitively modelspre-trained to read and write sentinel tokens arehard to apply directly to tasks of reading and writ-ing text without sentinels As seen in Figure 3(c)even the ldquoworkaroundrdquo of adding a sentinel to thedownstream targets has little benefit While LMadaptation adds value across all model sizes wenote our largest XXL model is the most forgivingand gives strong results even with span corruption

Given the benefit of LM adaptation we alsoexplore how long of an adaptation is helpful Fig-ure 3(d) shows that longer adaptation provides ad-ditional gains up to 100K steps This suggeststhat the ldquotransitionrdquo from span corruption to a lan-guage modeling objective is not a trivial changeand making an effective switch takes an investmentof training resources (10 of the steps of the orig-inal T5 pre-training) At the same time as in ourother ablations we observe that the XXL modelis robust to even non-ideal configurations At thissize the gains from adaptation are quite modest

In the non-optimal ldquospan corruptionrdquo setting weobserve instability across model sizes with theSmall model outperforming the larger Base Largeand XL models On inspection we find that formany tasks these mid-sized models never learn tooutput a legal class label and thus score 0 Thetwo most common error modes are copying sub-spans from the input and predicting an empty stringFurthermore this poor performance is not due torandom variance in prompt tuning as we observelow variance across 3 runs for each size Theseresults indicate that using models pre-trained withthe ldquospan corruptionrdquo objective can be unreliablewith only 2 out of 5 models working well whereas

108 109 1010

Model Parameters

103

105

107

109

Task

Par

amet

ers

Model TuningPrefix Tuning (Train)Prefix Tuning (Infer)

WARPPrompt TuningPrompt Design

0001

001

01

1

10

100

Task Parameters (

)

Figure 4 Parameter usage of various adaptation tech-niques fixing architecture to T511 and promptprefixlength to 1ndash100 tokens (bands show mean and stddev)Model Tuning All parameters are task-specific Pre-fix Tuning Activations are tuned in the prefix of eachlayer requiring 01ndash1 task-specific parameters for in-ference but more are used for training WARP Taskparameters are reduced to under 01 by only tuninginput and output layers Prompt Tuning Only promptembeddings are tuned reaching under 001 for mostmodel sizes Prompt Design Only a sequence ofprompt IDs (500ndash2000 tokens) is required

the LM adapated versions work reliably across allmodel sizes

We have released T5 11 checkpoints adaptedusing the LM objective for 100K steps for all modelsizes8

4 Comparison to Similar Approaches

In this section we review recent work on learn-ing continuous prompts and draw comparisonswith our method One important axis of compari-son is the number of task-specific parameters eachmethod requires as shown in Figure 4 Amongmethods with learnable parameters prompt tuningis the most parameter efficient requiring less than001 task-specific parameters for models over abillion parameters9

Li and Liang (2021) propose ldquoprefix tuningrdquolearning a sequence of prefixes that are prepended

8httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

9To compare with prompt design we count each tokenID in the prompt as a parameter and assume a prompt ofbetween 500ndash2000 tokens to match the GPT-3 setting Whilethis technique is by far the most parameter efficient it comesat the cost of task quality

at every transformer layer This is akin to learningtransformer activations that are fixed across exam-ples at every network layer In contrast prompttuning uses a single prompt representation thatis prepended to the embedded input Beyond re-quiring fewer parameters our approach allows thetransformer to update the intermediate-layer taskrepresentations as contextualized by an input ex-ample Their work builds on GPT-2 (Radford et al2019) and BART (Lewis et al 2020) while ours fo-cuses on T5 and examines changes in performanceand robustness to design choices as model size in-creases When using BART prefix tuning includesprefixes on both the encoder and decoder networkwhile prompt tuning only requires prompts on theencoder Li and Liang (2021) also rely on a repa-rameterization of the prefix to stabilize learningwhich adds a large number of parameters duringtraining whereas our configuration does not re-quire this reparameterization and is robust acrossSuperGLUE tasks and model sizes

Hambardzumyan et al (2021) propose ldquoWARPrdquowhere prompt parameters are added to the inputlayer This method works with masked languagemodels relying on a [MASK] token and a learn-able output layer to project the mask to class logitsThis formulation restricts the model to producing asingle output limiting it to classification Prompttuning does not require any changes to the input ora task-specific head The performance of prompttuning is also considerably closer to the strong per-formance of model tuning

Liu et al (2021) propose ldquoP-tuningrdquo where learn-able continuous prompts are interleaved throughoutthe embedded input using patterns based on humandesign Our approach removes this complicationby simply prepending the prompt to the input Toachieve strong SuperGLUE results P-tuning has tobe used in conjunction with model tuning that ismodels jointly update both the prompt and the mainmodel parameters whereas our approach keeps theoriginal language model frozen10

Qin and Eisner (2021) use ldquosoft wordsrdquo to learnprompts to extract knowledge from pre-trainedLMs Prompts are positioned in relation to theinput based on hand-designed prompt prototypesand a learned ∆`

i parameter is included for eachlayer so parameter cost scales with model depth

10As another difference P-tuning requires the addition ofldquoanchorrdquo tokens in the input (eg a question mark followingthe hypothesis in the RTE task) to achieve strong performancewhile prompt tuning leaves inputs untouched

Dataset Domain Model Prompt ∆

SQuAD Wiki 949 plusmn02 948 plusmn01 minus01

TextbookQA Book 543 plusmn37 668 plusmn29 +125BioASQ Bio 779 plusmn04 791 plusmn03 +12RACE Exam 598 plusmn06 607 plusmn05 +09RE Wiki 884 plusmn01 888 plusmn02 +04DuoRC Movie 689 plusmn07 677 plusmn11 minus12DROP Wiki 689 plusmn17 671 plusmn19 minus18

Table 1 F1 mean and stddev for models trained onSQuAD and evaluated on out-of-domain datasets fromthe MRQA 2019 shared task Prompt tuning tends togive stronger zero-shot performance than model tun-ing especially on datasets with large domain shifts likeTextbookQA

Logeswaran et al (2020) use a learnableprepended token to adapt transformer models to var-ious tasks but focus on small synthetic datasets de-signed to accommodate a compositional task repre-sentation as opposed to larger real-world datasetsTheir base models are small transformers trainedfrom scratch jointly with the task representationswhereas we keep the base model frozen and inves-tigate scaling laws using larger transformers

More generally work on task prompts is closelyaligned with work on ldquoadaptersrdquo (Rebuffi et al2017 Houlsby et al 2019) small bottleneck lay-ers inserted between frozen pre-trained networklayers Adapters offer another means of reduc-ing task-specific parameters with Houlsby et al(2019) achieving GLUE performance close to fullmodel tuning when freezing BERT-Large and onlyadding 2ndash4 additional parameters Pfeiffer et al(2020) use multiple adapters in a multilingual con-text to explicitly separate language understandingfrom task specification similar to our approach Acore difference between adapters and prompt tun-ing is how the approaches change model behaviorAdapters modify the actual function that acts on theinput representation parameterized by the neuralnetwork by allowing the rewriting of activations atany given layer Prompt tuning modifies behaviorby leaving the function fixed and adding new in-put representations that can affect how subsequentinput is processed

5 Resilience to Domain Shift

By freezing the core language model parametersprompt tuning prevents the model from modify-ing its general understanding of language Insteadprompt representations indirectly modulate the rep-resentation of the input This reduces the modelrsquosability to overfit to a dataset by memorizing spe-

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

ing the model parameters θ fixed In GPT-3the representations of the prompt tokens P =p1 p2 pn are part of the modelrsquos embed-ding table parameterized by the frozen θ Find-ing an optimal prompt thus requires the selectionof prompt tokens through either manual searchor non-differentiable search methods (Jiang et al2020 Shin et al 2020) Prompt tuning removesthe restriction that the prompt P be parameterizedby θ instead the prompt has its own dedicated pa-rameters θP that can be updated While promptdesign involves selecting prompt tokens from afixed vocabulary of frozen embeddings prompttuning can be thought of as using a fixed promptof special tokens where only the embeddings ofthese prompt tokens can be updated Our new con-ditional generation is now PrθθP (Y |[P X]) andcan be trained by maximizing the likelihood of Yvia backpropagation while only applying gradientupdates to θP

Given a series of n tokens x1 x2 xn thefirst thing T5 does is embed the tokens forminga matrix Xe isin Rntimese where e is the dimension ofthe embedding space Our soft-prompts are repre-sented as a parameter Pe isin Rptimese where p is thelength of the prompt Our prompt is then concate-nated to the embedded input forming a single ma-trix [PeXe] isin R(p+n)timese which then flows thoughthe encoder-decoder as normal Our models aretrained to maximize the probability of Y but onlythe prompt parameters Pe are updated

21 Design Decisions

There are many possible ways to initialize theprompt representations The simplest is to trainfrom scratch using random initialization A moresophisticated option is to initialize each prompttoken to an embedding drawn from the modelrsquosvocabulary Conceptually our soft-prompt mod-ulates the frozen networkrsquos behavior in the sameway as text preceding the input so it follows thata word-like representation might serve as a goodinitialization spot For classification tasks a thirdoption is to initialize the prompt with embeddingsthat enumerate the output classes similar to theldquoverbalizersrdquo of Schick and Schuumltze (2021) Sincewe want the model to produce these tokens in theoutput initializing the prompt with the embeddingsof the valid target tokens should prime the modelto restrict its output to the legal output classes

Another design consideration is the length of the

prompt The parameter cost of our method is EP where E is the token embedding dimension and Pis the prompt length The shorter the prompt thefewer new parameters must be tuned so we aim tofind a minimal length that still performs well

22 Unlearning Span Corruption

Unlike autoregressive language models like GPT-3the T5 models we experiment with use an encoder-decoder architecture and pre-train on a span cor-ruption objective Specifically T5 is tasked withldquoreconstructingrdquo masked spans in the input textwhich are marked with unique sentinel tokens Thetarget output text consists of all the masked con-tent separated by sentinels plus a final sentinelFor instance from the text ldquoThank you for invitingme to your party last weekrdquo we might constructa pre-training example where the input is ldquoThankyou 〈X〉 me to your party 〈Y〉 weekrdquo and the targetoutput is ldquo〈X〉 for inviting 〈Y〉 last 〈Z〉rdquo

While Raffel et al (2020) find this architectureand pre-training objective more effective than tradi-tional language modeling we hypothesize that thissetup is not a good fit for producing a frozen modelthat can be readily controlled through prompt tun-ing In particular a T5 model pre-trained exclu-sively on span corruption such as T511 has neverseen truly natural input text (free of sentinel to-kens) nor has it ever been asked to predict trulynatural targets In fact due to the details of T5rsquosspan corruption preprocessing every pre-trainingtarget will begin with a sentinel While this ldquounnat-uralrdquo tendency to output sentinels is easy to over-come through fine-tuning we suspect that it wouldbe much harder to override through a prompt aloneas the decoder priors cannot be adjusted

Given these concerns we experiment with T5models in three settings (1) ldquoSpan CorruptionrdquoWe use pre-trained T5 off-the-shelf as our frozenmodel and test its ability to output the expectedtext for downstream tasks (2) ldquoSpan Corruption+ Sentinelrdquo We use the same model but prependall downstream targets with a sentinel so as tomore closely resemble the targets seen in pre-training (3) ldquoLM Adaptationrdquo We continue T5rsquosself-supervised training for a small number of ad-ditional steps but using the ldquoLMrdquo objective dis-cussed by Raffel et al (2020) given a natural textprefix as input the model must produce the naturaltext continuation as output Crucially this adapta-tion happens only once producing a single frozen

model that we can reuse for prompt tuning acrossany number of downstream tasks

Through LM adaptation we hope to ldquoquicklyrdquotransform T5 into a model more similar to GPT-3which always outputs realistic text and is known torespond well to prompts as a ldquofew-shot learnerrdquo Itis not obvious how successful this late-stage trans-formation will be compared to pre-training fromscratch and it has not been investigated previouslyto our knowledge As such we experiment withvarious lengths of adaptation up to 100K steps

3 Results

Our frozen models are built on top of pre-trainedT5 checkpoints of all sizes (Small Base Large XLXXL) We leverage the public T511 checkpointswhich include improvements over the original T51

Our ldquodefaultrdquo configuration plotted with a greenlsquotimesrsquo ( ) throughout uses an LM-adapted versionof T5 trained for an additional 100K steps ini-tializes using class labels (see Section 32) anduses a prompt length of 100 tokens While thisis longer than the default 10-token prefix used byLi and Liang (2021) our method still uses fewertask-specific parameters as we only tune the inputlayer as opposed to overwriting activations in allnetwork layers See Figure 4 for a detailed com-parison We will also see shortly that even muchshorter prompts are viable as model size increases

We measure performance on the SuperGLUEbenchmark (Wang et al 2019a) a collection ofeight challenging English language understandingtasks2 We report metrics on the development setassociated with each dataset

Each of our prompts train on a single Super-GLUE task there was no multi-task setup or mix-ing of training data across tasks We translate eachSuperGLUE dataset into a text-to-text format fol-lowing Raffel et al (2020) except that we omit thetask names prepended to inputs indicating whichSuperGLUE task an example belongs to

We train our prompts for 30000 steps using T5rsquosstandard cross-entropy loss with a constant learn-

1These improvements are (1) the removal of all superviseddata from pre-training (2) adjustments to hyperparametersdmodel and dff and (3) the use of GeGLU (Shazeer 2020)over ReLU (Nair and Hinton 2010) activations

2The tasks are BoolQ (Clark et al 2019) CB (De Marn-eff et al 2019) COPA (Roemmele et al 2011) MultiRC(Khashabi et al 2018) ReCoRD (Zhang et al 2018) RTE(Dagan et al 2005 Bar-Haim et al 2006 Giampiccolo et al2007 Bentivogli et al 2009) WiC (Pilehvar and Camacho-Collados 2018) and WSC (Levesque et al 2012)

ing rate of 03 and a batch size of 32 Checkpointsare selected via early stopping on the developmentset where the stopping metric is the default met-ric for the dataset or the average of metrics fordatasets evaluated with multiple metrics All ex-periments were run in JAX (Bradbury et al 2018)using the Adafactor optimizer (Shazeer and Stern2018) with weight decay 1eminus5 β2 decay 08 andparameter scaling off The models were imple-mented in Flax (Heek et al 2020) More detailsare available in Appendix A

31 Closing the GapTo compare our method with standard model tun-ing we tune the public T511 checkpoints onSuperGLUE using the default hyperparametersspecified in the T5 library (learning rate 0001and Adafactor optimizer with pre-training param-eter states restored) We consider two baselines(1) ldquoModel Tuningrdquo For an apples-to-apples com-parison we tune on each task separately as in ourprompt tuning setup3 (2) ldquoModel Tuning (Multi-task)rdquo We use T5rsquos multi-task tuning setup toachieve a more competitive baseline4 In this casea single model is tuned on all tasks jointly with atext prefix indicating the task name

In Figure 1 (p 1) we see that prompt tuningbecomes more competitive with model tuning asscale increases At the XXL size (11 billion param-eters) prompt tuning matches even the strongermulti-task model tuning baseline despite havingover 20000 times fewer task-specific parameters

To compare with prompt design we includeGPT-3 few-shot performance on the SuperGLUEdev split as reported by Brown et al (2020)5

Figure 1 shows that prompt tuning beats GPT-3prompt design by a large margin with prompt-tuned T5-Small matching GPT-3 XL (over 16times larger) and prompt-tuned T5-Large beatingGPT-3 175B (over 220 times larger)

3To improve this baseline we performed a sweep over thebatch size hyperparameter and selected 216 tokens per batch

4The T5 SuperGLUE submission used a more complexsetup first mixing multi-task supervised data into pre-trainingand then performing single-task fine-tuning Since we useT511 throughout this setup is unavailable as the pre-trainingphase is fully self-supervised We follow Raffel et al (2020)in using 220 tokens per batch and including DPR data inthe multi-task mixture which is known to boost WSC taskperformance (Kocijan et al 2019)

5We also experimented with using GPT-3rsquos manual textprompts directly with our LM-adapted T5 checkpoints How-ever performance was far below GPT-3 for comparable modelsizes This may be due to differences in pre-training data andmodel architecture as well as T5rsquos shorter sequence length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

1520100150

(a) Prompt length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Random UniformSampled VocabClass Label

(b) Prompt initialization

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Span CorruptionSpan Corruption+ SentinelLM Adaptation(100K)

(c) Pre-training method

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

0K10K50K100K

(d) LM adaptation steps

Figure 3 Ablations of various hyperparameters on prompt tuning performance (mean and stddev across 3 runs) Inour ldquodefaultrdquo ( ) configuration quality improves stably with model size Across all ablations the largest (XXL)model is the most robust to hyperparameter choice (a) Prompt length Increasing to 20+ tokens generally confersa large boost but XXL performs well even with single-token prompts (b) Prompt initialization Random uniforminitialization lags behind more ldquoadvancedrdquo initializations using sampled vocabulary or class label embeddings butthe difference vanishes at XXL size (c) Pre-training objective LM adaptation outperforms span corruption evenwhen a sentinel is added to downstream task targets but XXL works well with any method (d) LM adaptationLonger adaptation generally gives larger gains but XXL is robust to even short adaptation

32 Ablation Study

Prompt Length We train prompts for eachmodel size while varying the prompt length in1 5 20 100 150 and fixing other settings to ourdefault configuration Figure 3(a) shows that formost model sizes increasing prompt length beyonda single token is critical to achieve good perfor-mance Notably the XXL model still gives strongresults with a single-token prompt suggesting thatthe larger the model the less conditioning signalis needed to achieve a target behavior Across allmodels increasing beyond 20 tokens only yieldsmarginal gains6

Prompt Initialization We ablate the effect ofprompt initialization by training models at all sizeswhile fixing other hyperparameters to their defaultvalues For random initialization we sample uni-

6Going past 100 tokens appears mildly detrimental forlarger models A similar pattern of diminishing performancepast a certain prefix length is observed by Li and Liang (2021)

formly from the range [minus05 05] When initial-izing from sampled vocabulary we restrict to the5000 most ldquocommonrdquo tokens in T5rsquos Sentence-Piece vocabulary (Kudo and Richardson 2018)which is ordered by likelihood in the pre-trainingcorpus For ldquoclass labelrdquo initialization we takethe embeddings for the string representations ofeach class in the downstream task and use them toinitialize one of the tokens in the prompt Whena class label is multi-token we average the tokenembeddings At longer prompt lengths we oftenrun out of class labels before we have initialized allof the prompt tokens In this case we fall back toour sampled vocab strategy to fill in the prompt7

Figure 3(b) shows our ablation of initializationstrategy across model sizes where we find that

7T5rsquos handling of the ReCoRD and WSC tasks requiresthe model to generate short free-form text In these cases weinitialize the prompts with words related to the task common-sense reasoning reading and comprehension for ReCoRDand commonsense pronoun and resolution for WSC

the class based initialization performs best Atsmaller model sizes there are large gaps betweenthe different initializations but once the model isscaled to XXL size those differences disappear

With ldquoclass labelrdquo initialization we observe thatthe class labels typically persist in the learnedprompts such that the nearest token embeddings(in cosine distance) match the tokens used for ini-tialization Beyond this we did not find our learnedprompts to be interpretable similar to those of Shinet al (2020) See Section 7 for details

Pre-training Objective In Figures 3(c) and 3(d)we see pre-training objective has a clear effect onprompt tuning quality As hypothesized in Sec-tion 22 T5rsquos default ldquospan corruptionrdquo objectiveis not well-suited for training frozen models to belater conditioned by prompts Intuitively modelspre-trained to read and write sentinel tokens arehard to apply directly to tasks of reading and writ-ing text without sentinels As seen in Figure 3(c)even the ldquoworkaroundrdquo of adding a sentinel to thedownstream targets has little benefit While LMadaptation adds value across all model sizes wenote our largest XXL model is the most forgivingand gives strong results even with span corruption

Given the benefit of LM adaptation we alsoexplore how long of an adaptation is helpful Fig-ure 3(d) shows that longer adaptation provides ad-ditional gains up to 100K steps This suggeststhat the ldquotransitionrdquo from span corruption to a lan-guage modeling objective is not a trivial changeand making an effective switch takes an investmentof training resources (10 of the steps of the orig-inal T5 pre-training) At the same time as in ourother ablations we observe that the XXL modelis robust to even non-ideal configurations At thissize the gains from adaptation are quite modest

In the non-optimal ldquospan corruptionrdquo setting weobserve instability across model sizes with theSmall model outperforming the larger Base Largeand XL models On inspection we find that formany tasks these mid-sized models never learn tooutput a legal class label and thus score 0 Thetwo most common error modes are copying sub-spans from the input and predicting an empty stringFurthermore this poor performance is not due torandom variance in prompt tuning as we observelow variance across 3 runs for each size Theseresults indicate that using models pre-trained withthe ldquospan corruptionrdquo objective can be unreliablewith only 2 out of 5 models working well whereas

108 109 1010

Model Parameters

103

105

107

109

Task

Par

amet

ers

Model TuningPrefix Tuning (Train)Prefix Tuning (Infer)

WARPPrompt TuningPrompt Design

0001

001

01

1

10

100

Task Parameters (

)

Figure 4 Parameter usage of various adaptation tech-niques fixing architecture to T511 and promptprefixlength to 1ndash100 tokens (bands show mean and stddev)Model Tuning All parameters are task-specific Pre-fix Tuning Activations are tuned in the prefix of eachlayer requiring 01ndash1 task-specific parameters for in-ference but more are used for training WARP Taskparameters are reduced to under 01 by only tuninginput and output layers Prompt Tuning Only promptembeddings are tuned reaching under 001 for mostmodel sizes Prompt Design Only a sequence ofprompt IDs (500ndash2000 tokens) is required

the LM adapated versions work reliably across allmodel sizes

We have released T5 11 checkpoints adaptedusing the LM objective for 100K steps for all modelsizes8

4 Comparison to Similar Approaches

In this section we review recent work on learn-ing continuous prompts and draw comparisonswith our method One important axis of compari-son is the number of task-specific parameters eachmethod requires as shown in Figure 4 Amongmethods with learnable parameters prompt tuningis the most parameter efficient requiring less than001 task-specific parameters for models over abillion parameters9

Li and Liang (2021) propose ldquoprefix tuningrdquolearning a sequence of prefixes that are prepended

8httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

9To compare with prompt design we count each tokenID in the prompt as a parameter and assume a prompt ofbetween 500ndash2000 tokens to match the GPT-3 setting Whilethis technique is by far the most parameter efficient it comesat the cost of task quality

at every transformer layer This is akin to learningtransformer activations that are fixed across exam-ples at every network layer In contrast prompttuning uses a single prompt representation thatis prepended to the embedded input Beyond re-quiring fewer parameters our approach allows thetransformer to update the intermediate-layer taskrepresentations as contextualized by an input ex-ample Their work builds on GPT-2 (Radford et al2019) and BART (Lewis et al 2020) while ours fo-cuses on T5 and examines changes in performanceand robustness to design choices as model size in-creases When using BART prefix tuning includesprefixes on both the encoder and decoder networkwhile prompt tuning only requires prompts on theencoder Li and Liang (2021) also rely on a repa-rameterization of the prefix to stabilize learningwhich adds a large number of parameters duringtraining whereas our configuration does not re-quire this reparameterization and is robust acrossSuperGLUE tasks and model sizes

Hambardzumyan et al (2021) propose ldquoWARPrdquowhere prompt parameters are added to the inputlayer This method works with masked languagemodels relying on a [MASK] token and a learn-able output layer to project the mask to class logitsThis formulation restricts the model to producing asingle output limiting it to classification Prompttuning does not require any changes to the input ora task-specific head The performance of prompttuning is also considerably closer to the strong per-formance of model tuning

Liu et al (2021) propose ldquoP-tuningrdquo where learn-able continuous prompts are interleaved throughoutthe embedded input using patterns based on humandesign Our approach removes this complicationby simply prepending the prompt to the input Toachieve strong SuperGLUE results P-tuning has tobe used in conjunction with model tuning that ismodels jointly update both the prompt and the mainmodel parameters whereas our approach keeps theoriginal language model frozen10

Qin and Eisner (2021) use ldquosoft wordsrdquo to learnprompts to extract knowledge from pre-trainedLMs Prompts are positioned in relation to theinput based on hand-designed prompt prototypesand a learned ∆`

i parameter is included for eachlayer so parameter cost scales with model depth

10As another difference P-tuning requires the addition ofldquoanchorrdquo tokens in the input (eg a question mark followingthe hypothesis in the RTE task) to achieve strong performancewhile prompt tuning leaves inputs untouched

Dataset Domain Model Prompt ∆

SQuAD Wiki 949 plusmn02 948 plusmn01 minus01

TextbookQA Book 543 plusmn37 668 plusmn29 +125BioASQ Bio 779 plusmn04 791 plusmn03 +12RACE Exam 598 plusmn06 607 plusmn05 +09RE Wiki 884 plusmn01 888 plusmn02 +04DuoRC Movie 689 plusmn07 677 plusmn11 minus12DROP Wiki 689 plusmn17 671 plusmn19 minus18

Table 1 F1 mean and stddev for models trained onSQuAD and evaluated on out-of-domain datasets fromthe MRQA 2019 shared task Prompt tuning tends togive stronger zero-shot performance than model tun-ing especially on datasets with large domain shifts likeTextbookQA

Logeswaran et al (2020) use a learnableprepended token to adapt transformer models to var-ious tasks but focus on small synthetic datasets de-signed to accommodate a compositional task repre-sentation as opposed to larger real-world datasetsTheir base models are small transformers trainedfrom scratch jointly with the task representationswhereas we keep the base model frozen and inves-tigate scaling laws using larger transformers

More generally work on task prompts is closelyaligned with work on ldquoadaptersrdquo (Rebuffi et al2017 Houlsby et al 2019) small bottleneck lay-ers inserted between frozen pre-trained networklayers Adapters offer another means of reduc-ing task-specific parameters with Houlsby et al(2019) achieving GLUE performance close to fullmodel tuning when freezing BERT-Large and onlyadding 2ndash4 additional parameters Pfeiffer et al(2020) use multiple adapters in a multilingual con-text to explicitly separate language understandingfrom task specification similar to our approach Acore difference between adapters and prompt tun-ing is how the approaches change model behaviorAdapters modify the actual function that acts on theinput representation parameterized by the neuralnetwork by allowing the rewriting of activations atany given layer Prompt tuning modifies behaviorby leaving the function fixed and adding new in-put representations that can affect how subsequentinput is processed

5 Resilience to Domain Shift

By freezing the core language model parametersprompt tuning prevents the model from modify-ing its general understanding of language Insteadprompt representations indirectly modulate the rep-resentation of the input This reduces the modelrsquosability to overfit to a dataset by memorizing spe-

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

model that we can reuse for prompt tuning acrossany number of downstream tasks

Through LM adaptation we hope to ldquoquicklyrdquotransform T5 into a model more similar to GPT-3which always outputs realistic text and is known torespond well to prompts as a ldquofew-shot learnerrdquo Itis not obvious how successful this late-stage trans-formation will be compared to pre-training fromscratch and it has not been investigated previouslyto our knowledge As such we experiment withvarious lengths of adaptation up to 100K steps

3 Results

Our frozen models are built on top of pre-trainedT5 checkpoints of all sizes (Small Base Large XLXXL) We leverage the public T511 checkpointswhich include improvements over the original T51

Our ldquodefaultrdquo configuration plotted with a greenlsquotimesrsquo ( ) throughout uses an LM-adapted versionof T5 trained for an additional 100K steps ini-tializes using class labels (see Section 32) anduses a prompt length of 100 tokens While thisis longer than the default 10-token prefix used byLi and Liang (2021) our method still uses fewertask-specific parameters as we only tune the inputlayer as opposed to overwriting activations in allnetwork layers See Figure 4 for a detailed com-parison We will also see shortly that even muchshorter prompts are viable as model size increases

We measure performance on the SuperGLUEbenchmark (Wang et al 2019a) a collection ofeight challenging English language understandingtasks2 We report metrics on the development setassociated with each dataset

Each of our prompts train on a single Super-GLUE task there was no multi-task setup or mix-ing of training data across tasks We translate eachSuperGLUE dataset into a text-to-text format fol-lowing Raffel et al (2020) except that we omit thetask names prepended to inputs indicating whichSuperGLUE task an example belongs to

We train our prompts for 30000 steps using T5rsquosstandard cross-entropy loss with a constant learn-

1These improvements are (1) the removal of all superviseddata from pre-training (2) adjustments to hyperparametersdmodel and dff and (3) the use of GeGLU (Shazeer 2020)over ReLU (Nair and Hinton 2010) activations

2The tasks are BoolQ (Clark et al 2019) CB (De Marn-eff et al 2019) COPA (Roemmele et al 2011) MultiRC(Khashabi et al 2018) ReCoRD (Zhang et al 2018) RTE(Dagan et al 2005 Bar-Haim et al 2006 Giampiccolo et al2007 Bentivogli et al 2009) WiC (Pilehvar and Camacho-Collados 2018) and WSC (Levesque et al 2012)

ing rate of 03 and a batch size of 32 Checkpointsare selected via early stopping on the developmentset where the stopping metric is the default met-ric for the dataset or the average of metrics fordatasets evaluated with multiple metrics All ex-periments were run in JAX (Bradbury et al 2018)using the Adafactor optimizer (Shazeer and Stern2018) with weight decay 1eminus5 β2 decay 08 andparameter scaling off The models were imple-mented in Flax (Heek et al 2020) More detailsare available in Appendix A

31 Closing the GapTo compare our method with standard model tun-ing we tune the public T511 checkpoints onSuperGLUE using the default hyperparametersspecified in the T5 library (learning rate 0001and Adafactor optimizer with pre-training param-eter states restored) We consider two baselines(1) ldquoModel Tuningrdquo For an apples-to-apples com-parison we tune on each task separately as in ourprompt tuning setup3 (2) ldquoModel Tuning (Multi-task)rdquo We use T5rsquos multi-task tuning setup toachieve a more competitive baseline4 In this casea single model is tuned on all tasks jointly with atext prefix indicating the task name

In Figure 1 (p 1) we see that prompt tuningbecomes more competitive with model tuning asscale increases At the XXL size (11 billion param-eters) prompt tuning matches even the strongermulti-task model tuning baseline despite havingover 20000 times fewer task-specific parameters

To compare with prompt design we includeGPT-3 few-shot performance on the SuperGLUEdev split as reported by Brown et al (2020)5

Figure 1 shows that prompt tuning beats GPT-3prompt design by a large margin with prompt-tuned T5-Small matching GPT-3 XL (over 16times larger) and prompt-tuned T5-Large beatingGPT-3 175B (over 220 times larger)

3To improve this baseline we performed a sweep over thebatch size hyperparameter and selected 216 tokens per batch

4The T5 SuperGLUE submission used a more complexsetup first mixing multi-task supervised data into pre-trainingand then performing single-task fine-tuning Since we useT511 throughout this setup is unavailable as the pre-trainingphase is fully self-supervised We follow Raffel et al (2020)in using 220 tokens per batch and including DPR data inthe multi-task mixture which is known to boost WSC taskperformance (Kocijan et al 2019)

5We also experimented with using GPT-3rsquos manual textprompts directly with our LM-adapted T5 checkpoints How-ever performance was far below GPT-3 for comparable modelsizes This may be due to differences in pre-training data andmodel architecture as well as T5rsquos shorter sequence length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

1520100150

(a) Prompt length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Random UniformSampled VocabClass Label

(b) Prompt initialization

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Span CorruptionSpan Corruption+ SentinelLM Adaptation(100K)

(c) Pre-training method

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

0K10K50K100K

(d) LM adaptation steps

Figure 3 Ablations of various hyperparameters on prompt tuning performance (mean and stddev across 3 runs) Inour ldquodefaultrdquo ( ) configuration quality improves stably with model size Across all ablations the largest (XXL)model is the most robust to hyperparameter choice (a) Prompt length Increasing to 20+ tokens generally confersa large boost but XXL performs well even with single-token prompts (b) Prompt initialization Random uniforminitialization lags behind more ldquoadvancedrdquo initializations using sampled vocabulary or class label embeddings butthe difference vanishes at XXL size (c) Pre-training objective LM adaptation outperforms span corruption evenwhen a sentinel is added to downstream task targets but XXL works well with any method (d) LM adaptationLonger adaptation generally gives larger gains but XXL is robust to even short adaptation

32 Ablation Study

Prompt Length We train prompts for eachmodel size while varying the prompt length in1 5 20 100 150 and fixing other settings to ourdefault configuration Figure 3(a) shows that formost model sizes increasing prompt length beyonda single token is critical to achieve good perfor-mance Notably the XXL model still gives strongresults with a single-token prompt suggesting thatthe larger the model the less conditioning signalis needed to achieve a target behavior Across allmodels increasing beyond 20 tokens only yieldsmarginal gains6

Prompt Initialization We ablate the effect ofprompt initialization by training models at all sizeswhile fixing other hyperparameters to their defaultvalues For random initialization we sample uni-

6Going past 100 tokens appears mildly detrimental forlarger models A similar pattern of diminishing performancepast a certain prefix length is observed by Li and Liang (2021)

formly from the range [minus05 05] When initial-izing from sampled vocabulary we restrict to the5000 most ldquocommonrdquo tokens in T5rsquos Sentence-Piece vocabulary (Kudo and Richardson 2018)which is ordered by likelihood in the pre-trainingcorpus For ldquoclass labelrdquo initialization we takethe embeddings for the string representations ofeach class in the downstream task and use them toinitialize one of the tokens in the prompt Whena class label is multi-token we average the tokenembeddings At longer prompt lengths we oftenrun out of class labels before we have initialized allof the prompt tokens In this case we fall back toour sampled vocab strategy to fill in the prompt7

Figure 3(b) shows our ablation of initializationstrategy across model sizes where we find that

7T5rsquos handling of the ReCoRD and WSC tasks requiresthe model to generate short free-form text In these cases weinitialize the prompts with words related to the task common-sense reasoning reading and comprehension for ReCoRDand commonsense pronoun and resolution for WSC

the class based initialization performs best Atsmaller model sizes there are large gaps betweenthe different initializations but once the model isscaled to XXL size those differences disappear

With ldquoclass labelrdquo initialization we observe thatthe class labels typically persist in the learnedprompts such that the nearest token embeddings(in cosine distance) match the tokens used for ini-tialization Beyond this we did not find our learnedprompts to be interpretable similar to those of Shinet al (2020) See Section 7 for details

Pre-training Objective In Figures 3(c) and 3(d)we see pre-training objective has a clear effect onprompt tuning quality As hypothesized in Sec-tion 22 T5rsquos default ldquospan corruptionrdquo objectiveis not well-suited for training frozen models to belater conditioned by prompts Intuitively modelspre-trained to read and write sentinel tokens arehard to apply directly to tasks of reading and writ-ing text without sentinels As seen in Figure 3(c)even the ldquoworkaroundrdquo of adding a sentinel to thedownstream targets has little benefit While LMadaptation adds value across all model sizes wenote our largest XXL model is the most forgivingand gives strong results even with span corruption

Given the benefit of LM adaptation we alsoexplore how long of an adaptation is helpful Fig-ure 3(d) shows that longer adaptation provides ad-ditional gains up to 100K steps This suggeststhat the ldquotransitionrdquo from span corruption to a lan-guage modeling objective is not a trivial changeand making an effective switch takes an investmentof training resources (10 of the steps of the orig-inal T5 pre-training) At the same time as in ourother ablations we observe that the XXL modelis robust to even non-ideal configurations At thissize the gains from adaptation are quite modest

In the non-optimal ldquospan corruptionrdquo setting weobserve instability across model sizes with theSmall model outperforming the larger Base Largeand XL models On inspection we find that formany tasks these mid-sized models never learn tooutput a legal class label and thus score 0 Thetwo most common error modes are copying sub-spans from the input and predicting an empty stringFurthermore this poor performance is not due torandom variance in prompt tuning as we observelow variance across 3 runs for each size Theseresults indicate that using models pre-trained withthe ldquospan corruptionrdquo objective can be unreliablewith only 2 out of 5 models working well whereas

108 109 1010

Model Parameters

103

105

107

109

Task

Par

amet

ers

Model TuningPrefix Tuning (Train)Prefix Tuning (Infer)

WARPPrompt TuningPrompt Design

0001

001

01

1

10

100

Task Parameters (

)

Figure 4 Parameter usage of various adaptation tech-niques fixing architecture to T511 and promptprefixlength to 1ndash100 tokens (bands show mean and stddev)Model Tuning All parameters are task-specific Pre-fix Tuning Activations are tuned in the prefix of eachlayer requiring 01ndash1 task-specific parameters for in-ference but more are used for training WARP Taskparameters are reduced to under 01 by only tuninginput and output layers Prompt Tuning Only promptembeddings are tuned reaching under 001 for mostmodel sizes Prompt Design Only a sequence ofprompt IDs (500ndash2000 tokens) is required

the LM adapated versions work reliably across allmodel sizes

We have released T5 11 checkpoints adaptedusing the LM objective for 100K steps for all modelsizes8

4 Comparison to Similar Approaches

In this section we review recent work on learn-ing continuous prompts and draw comparisonswith our method One important axis of compari-son is the number of task-specific parameters eachmethod requires as shown in Figure 4 Amongmethods with learnable parameters prompt tuningis the most parameter efficient requiring less than001 task-specific parameters for models over abillion parameters9

Li and Liang (2021) propose ldquoprefix tuningrdquolearning a sequence of prefixes that are prepended

8httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

9To compare with prompt design we count each tokenID in the prompt as a parameter and assume a prompt ofbetween 500ndash2000 tokens to match the GPT-3 setting Whilethis technique is by far the most parameter efficient it comesat the cost of task quality

at every transformer layer This is akin to learningtransformer activations that are fixed across exam-ples at every network layer In contrast prompttuning uses a single prompt representation thatis prepended to the embedded input Beyond re-quiring fewer parameters our approach allows thetransformer to update the intermediate-layer taskrepresentations as contextualized by an input ex-ample Their work builds on GPT-2 (Radford et al2019) and BART (Lewis et al 2020) while ours fo-cuses on T5 and examines changes in performanceand robustness to design choices as model size in-creases When using BART prefix tuning includesprefixes on both the encoder and decoder networkwhile prompt tuning only requires prompts on theencoder Li and Liang (2021) also rely on a repa-rameterization of the prefix to stabilize learningwhich adds a large number of parameters duringtraining whereas our configuration does not re-quire this reparameterization and is robust acrossSuperGLUE tasks and model sizes

Hambardzumyan et al (2021) propose ldquoWARPrdquowhere prompt parameters are added to the inputlayer This method works with masked languagemodels relying on a [MASK] token and a learn-able output layer to project the mask to class logitsThis formulation restricts the model to producing asingle output limiting it to classification Prompttuning does not require any changes to the input ora task-specific head The performance of prompttuning is also considerably closer to the strong per-formance of model tuning

Liu et al (2021) propose ldquoP-tuningrdquo where learn-able continuous prompts are interleaved throughoutthe embedded input using patterns based on humandesign Our approach removes this complicationby simply prepending the prompt to the input Toachieve strong SuperGLUE results P-tuning has tobe used in conjunction with model tuning that ismodels jointly update both the prompt and the mainmodel parameters whereas our approach keeps theoriginal language model frozen10

Qin and Eisner (2021) use ldquosoft wordsrdquo to learnprompts to extract knowledge from pre-trainedLMs Prompts are positioned in relation to theinput based on hand-designed prompt prototypesand a learned ∆`

i parameter is included for eachlayer so parameter cost scales with model depth

10As another difference P-tuning requires the addition ofldquoanchorrdquo tokens in the input (eg a question mark followingthe hypothesis in the RTE task) to achieve strong performancewhile prompt tuning leaves inputs untouched

Dataset Domain Model Prompt ∆

SQuAD Wiki 949 plusmn02 948 plusmn01 minus01

TextbookQA Book 543 plusmn37 668 plusmn29 +125BioASQ Bio 779 plusmn04 791 plusmn03 +12RACE Exam 598 plusmn06 607 plusmn05 +09RE Wiki 884 plusmn01 888 plusmn02 +04DuoRC Movie 689 plusmn07 677 plusmn11 minus12DROP Wiki 689 plusmn17 671 plusmn19 minus18

Table 1 F1 mean and stddev for models trained onSQuAD and evaluated on out-of-domain datasets fromthe MRQA 2019 shared task Prompt tuning tends togive stronger zero-shot performance than model tun-ing especially on datasets with large domain shifts likeTextbookQA

Logeswaran et al (2020) use a learnableprepended token to adapt transformer models to var-ious tasks but focus on small synthetic datasets de-signed to accommodate a compositional task repre-sentation as opposed to larger real-world datasetsTheir base models are small transformers trainedfrom scratch jointly with the task representationswhereas we keep the base model frozen and inves-tigate scaling laws using larger transformers

More generally work on task prompts is closelyaligned with work on ldquoadaptersrdquo (Rebuffi et al2017 Houlsby et al 2019) small bottleneck lay-ers inserted between frozen pre-trained networklayers Adapters offer another means of reduc-ing task-specific parameters with Houlsby et al(2019) achieving GLUE performance close to fullmodel tuning when freezing BERT-Large and onlyadding 2ndash4 additional parameters Pfeiffer et al(2020) use multiple adapters in a multilingual con-text to explicitly separate language understandingfrom task specification similar to our approach Acore difference between adapters and prompt tun-ing is how the approaches change model behaviorAdapters modify the actual function that acts on theinput representation parameterized by the neuralnetwork by allowing the rewriting of activations atany given layer Prompt tuning modifies behaviorby leaving the function fixed and adding new in-put representations that can affect how subsequentinput is processed

5 Resilience to Domain Shift

By freezing the core language model parametersprompt tuning prevents the model from modify-ing its general understanding of language Insteadprompt representations indirectly modulate the rep-resentation of the input This reduces the modelrsquosability to overfit to a dataset by memorizing spe-

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

1520100150

(a) Prompt length

108 109 1010

Model Parameters

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Random UniformSampled VocabClass Label

(b) Prompt initialization

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

Span CorruptionSpan Corruption+ SentinelLM Adaptation(100K)

(c) Pre-training method

108 109 1010

Model Parameters

10

20

30

40

50

60

70

80

90

100

Supe

rGLU

E Sc

ore

0K10K50K100K

(d) LM adaptation steps

Figure 3 Ablations of various hyperparameters on prompt tuning performance (mean and stddev across 3 runs) Inour ldquodefaultrdquo ( ) configuration quality improves stably with model size Across all ablations the largest (XXL)model is the most robust to hyperparameter choice (a) Prompt length Increasing to 20+ tokens generally confersa large boost but XXL performs well even with single-token prompts (b) Prompt initialization Random uniforminitialization lags behind more ldquoadvancedrdquo initializations using sampled vocabulary or class label embeddings butthe difference vanishes at XXL size (c) Pre-training objective LM adaptation outperforms span corruption evenwhen a sentinel is added to downstream task targets but XXL works well with any method (d) LM adaptationLonger adaptation generally gives larger gains but XXL is robust to even short adaptation

32 Ablation Study

Prompt Length We train prompts for eachmodel size while varying the prompt length in1 5 20 100 150 and fixing other settings to ourdefault configuration Figure 3(a) shows that formost model sizes increasing prompt length beyonda single token is critical to achieve good perfor-mance Notably the XXL model still gives strongresults with a single-token prompt suggesting thatthe larger the model the less conditioning signalis needed to achieve a target behavior Across allmodels increasing beyond 20 tokens only yieldsmarginal gains6

Prompt Initialization We ablate the effect ofprompt initialization by training models at all sizeswhile fixing other hyperparameters to their defaultvalues For random initialization we sample uni-

6Going past 100 tokens appears mildly detrimental forlarger models A similar pattern of diminishing performancepast a certain prefix length is observed by Li and Liang (2021)

formly from the range [minus05 05] When initial-izing from sampled vocabulary we restrict to the5000 most ldquocommonrdquo tokens in T5rsquos Sentence-Piece vocabulary (Kudo and Richardson 2018)which is ordered by likelihood in the pre-trainingcorpus For ldquoclass labelrdquo initialization we takethe embeddings for the string representations ofeach class in the downstream task and use them toinitialize one of the tokens in the prompt Whena class label is multi-token we average the tokenembeddings At longer prompt lengths we oftenrun out of class labels before we have initialized allof the prompt tokens In this case we fall back toour sampled vocab strategy to fill in the prompt7

Figure 3(b) shows our ablation of initializationstrategy across model sizes where we find that

7T5rsquos handling of the ReCoRD and WSC tasks requiresthe model to generate short free-form text In these cases weinitialize the prompts with words related to the task common-sense reasoning reading and comprehension for ReCoRDand commonsense pronoun and resolution for WSC

the class based initialization performs best Atsmaller model sizes there are large gaps betweenthe different initializations but once the model isscaled to XXL size those differences disappear

With ldquoclass labelrdquo initialization we observe thatthe class labels typically persist in the learnedprompts such that the nearest token embeddings(in cosine distance) match the tokens used for ini-tialization Beyond this we did not find our learnedprompts to be interpretable similar to those of Shinet al (2020) See Section 7 for details

Pre-training Objective In Figures 3(c) and 3(d)we see pre-training objective has a clear effect onprompt tuning quality As hypothesized in Sec-tion 22 T5rsquos default ldquospan corruptionrdquo objectiveis not well-suited for training frozen models to belater conditioned by prompts Intuitively modelspre-trained to read and write sentinel tokens arehard to apply directly to tasks of reading and writ-ing text without sentinels As seen in Figure 3(c)even the ldquoworkaroundrdquo of adding a sentinel to thedownstream targets has little benefit While LMadaptation adds value across all model sizes wenote our largest XXL model is the most forgivingand gives strong results even with span corruption

Given the benefit of LM adaptation we alsoexplore how long of an adaptation is helpful Fig-ure 3(d) shows that longer adaptation provides ad-ditional gains up to 100K steps This suggeststhat the ldquotransitionrdquo from span corruption to a lan-guage modeling objective is not a trivial changeand making an effective switch takes an investmentof training resources (10 of the steps of the orig-inal T5 pre-training) At the same time as in ourother ablations we observe that the XXL modelis robust to even non-ideal configurations At thissize the gains from adaptation are quite modest

In the non-optimal ldquospan corruptionrdquo setting weobserve instability across model sizes with theSmall model outperforming the larger Base Largeand XL models On inspection we find that formany tasks these mid-sized models never learn tooutput a legal class label and thus score 0 Thetwo most common error modes are copying sub-spans from the input and predicting an empty stringFurthermore this poor performance is not due torandom variance in prompt tuning as we observelow variance across 3 runs for each size Theseresults indicate that using models pre-trained withthe ldquospan corruptionrdquo objective can be unreliablewith only 2 out of 5 models working well whereas

108 109 1010

Model Parameters

103

105

107

109

Task

Par

amet

ers

Model TuningPrefix Tuning (Train)Prefix Tuning (Infer)

WARPPrompt TuningPrompt Design

0001

001

01

1

10

100

Task Parameters (

)

Figure 4 Parameter usage of various adaptation tech-niques fixing architecture to T511 and promptprefixlength to 1ndash100 tokens (bands show mean and stddev)Model Tuning All parameters are task-specific Pre-fix Tuning Activations are tuned in the prefix of eachlayer requiring 01ndash1 task-specific parameters for in-ference but more are used for training WARP Taskparameters are reduced to under 01 by only tuninginput and output layers Prompt Tuning Only promptembeddings are tuned reaching under 001 for mostmodel sizes Prompt Design Only a sequence ofprompt IDs (500ndash2000 tokens) is required

the LM adapated versions work reliably across allmodel sizes

We have released T5 11 checkpoints adaptedusing the LM objective for 100K steps for all modelsizes8

4 Comparison to Similar Approaches

In this section we review recent work on learn-ing continuous prompts and draw comparisonswith our method One important axis of compari-son is the number of task-specific parameters eachmethod requires as shown in Figure 4 Amongmethods with learnable parameters prompt tuningis the most parameter efficient requiring less than001 task-specific parameters for models over abillion parameters9

Li and Liang (2021) propose ldquoprefix tuningrdquolearning a sequence of prefixes that are prepended

8httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

9To compare with prompt design we count each tokenID in the prompt as a parameter and assume a prompt ofbetween 500ndash2000 tokens to match the GPT-3 setting Whilethis technique is by far the most parameter efficient it comesat the cost of task quality

at every transformer layer This is akin to learningtransformer activations that are fixed across exam-ples at every network layer In contrast prompttuning uses a single prompt representation thatis prepended to the embedded input Beyond re-quiring fewer parameters our approach allows thetransformer to update the intermediate-layer taskrepresentations as contextualized by an input ex-ample Their work builds on GPT-2 (Radford et al2019) and BART (Lewis et al 2020) while ours fo-cuses on T5 and examines changes in performanceand robustness to design choices as model size in-creases When using BART prefix tuning includesprefixes on both the encoder and decoder networkwhile prompt tuning only requires prompts on theencoder Li and Liang (2021) also rely on a repa-rameterization of the prefix to stabilize learningwhich adds a large number of parameters duringtraining whereas our configuration does not re-quire this reparameterization and is robust acrossSuperGLUE tasks and model sizes

Hambardzumyan et al (2021) propose ldquoWARPrdquowhere prompt parameters are added to the inputlayer This method works with masked languagemodels relying on a [MASK] token and a learn-able output layer to project the mask to class logitsThis formulation restricts the model to producing asingle output limiting it to classification Prompttuning does not require any changes to the input ora task-specific head The performance of prompttuning is also considerably closer to the strong per-formance of model tuning

Liu et al (2021) propose ldquoP-tuningrdquo where learn-able continuous prompts are interleaved throughoutthe embedded input using patterns based on humandesign Our approach removes this complicationby simply prepending the prompt to the input Toachieve strong SuperGLUE results P-tuning has tobe used in conjunction with model tuning that ismodels jointly update both the prompt and the mainmodel parameters whereas our approach keeps theoriginal language model frozen10

Qin and Eisner (2021) use ldquosoft wordsrdquo to learnprompts to extract knowledge from pre-trainedLMs Prompts are positioned in relation to theinput based on hand-designed prompt prototypesand a learned ∆`

i parameter is included for eachlayer so parameter cost scales with model depth

10As another difference P-tuning requires the addition ofldquoanchorrdquo tokens in the input (eg a question mark followingthe hypothesis in the RTE task) to achieve strong performancewhile prompt tuning leaves inputs untouched

Dataset Domain Model Prompt ∆

SQuAD Wiki 949 plusmn02 948 plusmn01 minus01

TextbookQA Book 543 plusmn37 668 plusmn29 +125BioASQ Bio 779 plusmn04 791 plusmn03 +12RACE Exam 598 plusmn06 607 plusmn05 +09RE Wiki 884 plusmn01 888 plusmn02 +04DuoRC Movie 689 plusmn07 677 plusmn11 minus12DROP Wiki 689 plusmn17 671 plusmn19 minus18

Table 1 F1 mean and stddev for models trained onSQuAD and evaluated on out-of-domain datasets fromthe MRQA 2019 shared task Prompt tuning tends togive stronger zero-shot performance than model tun-ing especially on datasets with large domain shifts likeTextbookQA

Logeswaran et al (2020) use a learnableprepended token to adapt transformer models to var-ious tasks but focus on small synthetic datasets de-signed to accommodate a compositional task repre-sentation as opposed to larger real-world datasetsTheir base models are small transformers trainedfrom scratch jointly with the task representationswhereas we keep the base model frozen and inves-tigate scaling laws using larger transformers

More generally work on task prompts is closelyaligned with work on ldquoadaptersrdquo (Rebuffi et al2017 Houlsby et al 2019) small bottleneck lay-ers inserted between frozen pre-trained networklayers Adapters offer another means of reduc-ing task-specific parameters with Houlsby et al(2019) achieving GLUE performance close to fullmodel tuning when freezing BERT-Large and onlyadding 2ndash4 additional parameters Pfeiffer et al(2020) use multiple adapters in a multilingual con-text to explicitly separate language understandingfrom task specification similar to our approach Acore difference between adapters and prompt tun-ing is how the approaches change model behaviorAdapters modify the actual function that acts on theinput representation parameterized by the neuralnetwork by allowing the rewriting of activations atany given layer Prompt tuning modifies behaviorby leaving the function fixed and adding new in-put representations that can affect how subsequentinput is processed

5 Resilience to Domain Shift

By freezing the core language model parametersprompt tuning prevents the model from modify-ing its general understanding of language Insteadprompt representations indirectly modulate the rep-resentation of the input This reduces the modelrsquosability to overfit to a dataset by memorizing spe-

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

the class based initialization performs best Atsmaller model sizes there are large gaps betweenthe different initializations but once the model isscaled to XXL size those differences disappear

With ldquoclass labelrdquo initialization we observe thatthe class labels typically persist in the learnedprompts such that the nearest token embeddings(in cosine distance) match the tokens used for ini-tialization Beyond this we did not find our learnedprompts to be interpretable similar to those of Shinet al (2020) See Section 7 for details

Pre-training Objective In Figures 3(c) and 3(d)we see pre-training objective has a clear effect onprompt tuning quality As hypothesized in Sec-tion 22 T5rsquos default ldquospan corruptionrdquo objectiveis not well-suited for training frozen models to belater conditioned by prompts Intuitively modelspre-trained to read and write sentinel tokens arehard to apply directly to tasks of reading and writ-ing text without sentinels As seen in Figure 3(c)even the ldquoworkaroundrdquo of adding a sentinel to thedownstream targets has little benefit While LMadaptation adds value across all model sizes wenote our largest XXL model is the most forgivingand gives strong results even with span corruption

Given the benefit of LM adaptation we alsoexplore how long of an adaptation is helpful Fig-ure 3(d) shows that longer adaptation provides ad-ditional gains up to 100K steps This suggeststhat the ldquotransitionrdquo from span corruption to a lan-guage modeling objective is not a trivial changeand making an effective switch takes an investmentof training resources (10 of the steps of the orig-inal T5 pre-training) At the same time as in ourother ablations we observe that the XXL modelis robust to even non-ideal configurations At thissize the gains from adaptation are quite modest

In the non-optimal ldquospan corruptionrdquo setting weobserve instability across model sizes with theSmall model outperforming the larger Base Largeand XL models On inspection we find that formany tasks these mid-sized models never learn tooutput a legal class label and thus score 0 Thetwo most common error modes are copying sub-spans from the input and predicting an empty stringFurthermore this poor performance is not due torandom variance in prompt tuning as we observelow variance across 3 runs for each size Theseresults indicate that using models pre-trained withthe ldquospan corruptionrdquo objective can be unreliablewith only 2 out of 5 models working well whereas

108 109 1010

Model Parameters

103

105

107

109

Task

Par

amet

ers

Model TuningPrefix Tuning (Train)Prefix Tuning (Infer)

WARPPrompt TuningPrompt Design

0001

001

01

1

10

100

Task Parameters (

)

Figure 4 Parameter usage of various adaptation tech-niques fixing architecture to T511 and promptprefixlength to 1ndash100 tokens (bands show mean and stddev)Model Tuning All parameters are task-specific Pre-fix Tuning Activations are tuned in the prefix of eachlayer requiring 01ndash1 task-specific parameters for in-ference but more are used for training WARP Taskparameters are reduced to under 01 by only tuninginput and output layers Prompt Tuning Only promptembeddings are tuned reaching under 001 for mostmodel sizes Prompt Design Only a sequence ofprompt IDs (500ndash2000 tokens) is required

the LM adapated versions work reliably across allmodel sizes

We have released T5 11 checkpoints adaptedusing the LM objective for 100K steps for all modelsizes8

4 Comparison to Similar Approaches

In this section we review recent work on learn-ing continuous prompts and draw comparisonswith our method One important axis of compari-son is the number of task-specific parameters eachmethod requires as shown in Figure 4 Amongmethods with learnable parameters prompt tuningis the most parameter efficient requiring less than001 task-specific parameters for models over abillion parameters9

Li and Liang (2021) propose ldquoprefix tuningrdquolearning a sequence of prefixes that are prepended

8httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

9To compare with prompt design we count each tokenID in the prompt as a parameter and assume a prompt ofbetween 500ndash2000 tokens to match the GPT-3 setting Whilethis technique is by far the most parameter efficient it comesat the cost of task quality

at every transformer layer This is akin to learningtransformer activations that are fixed across exam-ples at every network layer In contrast prompttuning uses a single prompt representation thatis prepended to the embedded input Beyond re-quiring fewer parameters our approach allows thetransformer to update the intermediate-layer taskrepresentations as contextualized by an input ex-ample Their work builds on GPT-2 (Radford et al2019) and BART (Lewis et al 2020) while ours fo-cuses on T5 and examines changes in performanceand robustness to design choices as model size in-creases When using BART prefix tuning includesprefixes on both the encoder and decoder networkwhile prompt tuning only requires prompts on theencoder Li and Liang (2021) also rely on a repa-rameterization of the prefix to stabilize learningwhich adds a large number of parameters duringtraining whereas our configuration does not re-quire this reparameterization and is robust acrossSuperGLUE tasks and model sizes

Hambardzumyan et al (2021) propose ldquoWARPrdquowhere prompt parameters are added to the inputlayer This method works with masked languagemodels relying on a [MASK] token and a learn-able output layer to project the mask to class logitsThis formulation restricts the model to producing asingle output limiting it to classification Prompttuning does not require any changes to the input ora task-specific head The performance of prompttuning is also considerably closer to the strong per-formance of model tuning

Liu et al (2021) propose ldquoP-tuningrdquo where learn-able continuous prompts are interleaved throughoutthe embedded input using patterns based on humandesign Our approach removes this complicationby simply prepending the prompt to the input Toachieve strong SuperGLUE results P-tuning has tobe used in conjunction with model tuning that ismodels jointly update both the prompt and the mainmodel parameters whereas our approach keeps theoriginal language model frozen10

Qin and Eisner (2021) use ldquosoft wordsrdquo to learnprompts to extract knowledge from pre-trainedLMs Prompts are positioned in relation to theinput based on hand-designed prompt prototypesand a learned ∆`

i parameter is included for eachlayer so parameter cost scales with model depth

10As another difference P-tuning requires the addition ofldquoanchorrdquo tokens in the input (eg a question mark followingthe hypothesis in the RTE task) to achieve strong performancewhile prompt tuning leaves inputs untouched

Dataset Domain Model Prompt ∆

SQuAD Wiki 949 plusmn02 948 plusmn01 minus01

TextbookQA Book 543 plusmn37 668 plusmn29 +125BioASQ Bio 779 plusmn04 791 plusmn03 +12RACE Exam 598 plusmn06 607 plusmn05 +09RE Wiki 884 plusmn01 888 plusmn02 +04DuoRC Movie 689 plusmn07 677 plusmn11 minus12DROP Wiki 689 plusmn17 671 plusmn19 minus18

Table 1 F1 mean and stddev for models trained onSQuAD and evaluated on out-of-domain datasets fromthe MRQA 2019 shared task Prompt tuning tends togive stronger zero-shot performance than model tun-ing especially on datasets with large domain shifts likeTextbookQA

Logeswaran et al (2020) use a learnableprepended token to adapt transformer models to var-ious tasks but focus on small synthetic datasets de-signed to accommodate a compositional task repre-sentation as opposed to larger real-world datasetsTheir base models are small transformers trainedfrom scratch jointly with the task representationswhereas we keep the base model frozen and inves-tigate scaling laws using larger transformers

More generally work on task prompts is closelyaligned with work on ldquoadaptersrdquo (Rebuffi et al2017 Houlsby et al 2019) small bottleneck lay-ers inserted between frozen pre-trained networklayers Adapters offer another means of reduc-ing task-specific parameters with Houlsby et al(2019) achieving GLUE performance close to fullmodel tuning when freezing BERT-Large and onlyadding 2ndash4 additional parameters Pfeiffer et al(2020) use multiple adapters in a multilingual con-text to explicitly separate language understandingfrom task specification similar to our approach Acore difference between adapters and prompt tun-ing is how the approaches change model behaviorAdapters modify the actual function that acts on theinput representation parameterized by the neuralnetwork by allowing the rewriting of activations atany given layer Prompt tuning modifies behaviorby leaving the function fixed and adding new in-put representations that can affect how subsequentinput is processed

5 Resilience to Domain Shift

By freezing the core language model parametersprompt tuning prevents the model from modify-ing its general understanding of language Insteadprompt representations indirectly modulate the rep-resentation of the input This reduces the modelrsquosability to overfit to a dataset by memorizing spe-

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

at every transformer layer This is akin to learningtransformer activations that are fixed across exam-ples at every network layer In contrast prompttuning uses a single prompt representation thatis prepended to the embedded input Beyond re-quiring fewer parameters our approach allows thetransformer to update the intermediate-layer taskrepresentations as contextualized by an input ex-ample Their work builds on GPT-2 (Radford et al2019) and BART (Lewis et al 2020) while ours fo-cuses on T5 and examines changes in performanceand robustness to design choices as model size in-creases When using BART prefix tuning includesprefixes on both the encoder and decoder networkwhile prompt tuning only requires prompts on theencoder Li and Liang (2021) also rely on a repa-rameterization of the prefix to stabilize learningwhich adds a large number of parameters duringtraining whereas our configuration does not re-quire this reparameterization and is robust acrossSuperGLUE tasks and model sizes

Hambardzumyan et al (2021) propose ldquoWARPrdquowhere prompt parameters are added to the inputlayer This method works with masked languagemodels relying on a [MASK] token and a learn-able output layer to project the mask to class logitsThis formulation restricts the model to producing asingle output limiting it to classification Prompttuning does not require any changes to the input ora task-specific head The performance of prompttuning is also considerably closer to the strong per-formance of model tuning

Liu et al (2021) propose ldquoP-tuningrdquo where learn-able continuous prompts are interleaved throughoutthe embedded input using patterns based on humandesign Our approach removes this complicationby simply prepending the prompt to the input Toachieve strong SuperGLUE results P-tuning has tobe used in conjunction with model tuning that ismodels jointly update both the prompt and the mainmodel parameters whereas our approach keeps theoriginal language model frozen10

Qin and Eisner (2021) use ldquosoft wordsrdquo to learnprompts to extract knowledge from pre-trainedLMs Prompts are positioned in relation to theinput based on hand-designed prompt prototypesand a learned ∆`

i parameter is included for eachlayer so parameter cost scales with model depth

10As another difference P-tuning requires the addition ofldquoanchorrdquo tokens in the input (eg a question mark followingthe hypothesis in the RTE task) to achieve strong performancewhile prompt tuning leaves inputs untouched

Dataset Domain Model Prompt ∆

SQuAD Wiki 949 plusmn02 948 plusmn01 minus01

TextbookQA Book 543 plusmn37 668 plusmn29 +125BioASQ Bio 779 plusmn04 791 plusmn03 +12RACE Exam 598 plusmn06 607 plusmn05 +09RE Wiki 884 plusmn01 888 plusmn02 +04DuoRC Movie 689 plusmn07 677 plusmn11 minus12DROP Wiki 689 plusmn17 671 plusmn19 minus18

Table 1 F1 mean and stddev for models trained onSQuAD and evaluated on out-of-domain datasets fromthe MRQA 2019 shared task Prompt tuning tends togive stronger zero-shot performance than model tun-ing especially on datasets with large domain shifts likeTextbookQA

Logeswaran et al (2020) use a learnableprepended token to adapt transformer models to var-ious tasks but focus on small synthetic datasets de-signed to accommodate a compositional task repre-sentation as opposed to larger real-world datasetsTheir base models are small transformers trainedfrom scratch jointly with the task representationswhereas we keep the base model frozen and inves-tigate scaling laws using larger transformers

More generally work on task prompts is closelyaligned with work on ldquoadaptersrdquo (Rebuffi et al2017 Houlsby et al 2019) small bottleneck lay-ers inserted between frozen pre-trained networklayers Adapters offer another means of reduc-ing task-specific parameters with Houlsby et al(2019) achieving GLUE performance close to fullmodel tuning when freezing BERT-Large and onlyadding 2ndash4 additional parameters Pfeiffer et al(2020) use multiple adapters in a multilingual con-text to explicitly separate language understandingfrom task specification similar to our approach Acore difference between adapters and prompt tun-ing is how the approaches change model behaviorAdapters modify the actual function that acts on theinput representation parameterized by the neuralnetwork by allowing the rewriting of activations atany given layer Prompt tuning modifies behaviorby leaving the function fixed and adding new in-put representations that can affect how subsequentinput is processed

5 Resilience to Domain Shift

By freezing the core language model parametersprompt tuning prevents the model from modify-ing its general understanding of language Insteadprompt representations indirectly modulate the rep-resentation of the input This reduces the modelrsquosability to overfit to a dataset by memorizing spe-

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

Train Eval Tuning Accuracy F1

QQP MRPC Model 731 plusmn09 812 plusmn21Prompt 763 plusmn01 843 plusmn03

MRPC QQP Model 749 plusmn13 709 plusmn12Prompt 754 plusmn08 697 plusmn03

Table 2 Mean and stddev of zero-shot domain transferbetween two paraphrase detection tasks

cific lexical cues and spurious correlations This re-striction suggests that prompt tuning may improverobustness to domain shifts where the distributionof inputs differs between training and evaluation

We investigate zero-shot domain transfer on twotasks question answering (QA) and paraphrase de-tection For question answering we use the MRQA2019 shared task on generalization (Fisch et al2019) This task collects extractive QA datasetsin a unified format and tests how models trainedon ldquoin-domainrdquo datasets perform when evaluatedon ldquoout-of-domainrdquo datasets For our experimentswe train on SQuAD (Rajpurkar et al 2016) andevaluate on each of the out-of-domain datasets11

Table 1 shows that prompt tuning outperformsmodel tuning on the majority of out-of-domaindatasets with a remarkable 125 point F1 gap be-tween the two approaches on TextbookQA We ob-serve larger gains from prompt tuning in cases oflarger domain shifts (eg to Biomedical in BioASQor to Textbooks in TextbookQA) Of the datasetswhere model tuning is better we see that DROPshares a domain (Wikipedia) with SQuAD and isthus one of the smallest domain transfers

As a second test of robustness to domain shiftwe explore transfer between two paraphrase detec-tion tasks from GLUE (Wang et al 2019b) Thefirst task is QQP (Iyer et al 2017) which asksif two questions from the community QampA siteQuora are ldquoduplicatesrdquo The second task is MRPC(Dolan and Brockett 2005) which asks if two sen-tences drawn from news articles are paraphrasesWe test transfer in both directions (QQPhArrMRPC)As before we train on the ldquoin-domainrdquo task selectcheckpoints using in-domain validation and evalu-ate zero-shot on the ldquoout-of-domainrdquo task

Table 2 shows that training a lightweight prompton the QQP data and evaluating on MRPC givesmuch better performance than tuning the entire

11We select checkpoints based on SQuAD validation F1The out-of-domain datasets are TextbookQA (Kembhavi et al2017) RACE (Lai et al 2017) BioASQ (httpbioasqorg) RE (Levy et al 2017) DuoRC (Saha et al 2018)and DROP (Dua et al 2019)

Dataset Metric Average Best Ensemble

BoolQ acc 911 913 917CB accF1 993 990 10000 10000 1000 1000COPA acc 988 1000 1000MultiRC EMF1a 657 887 663 890 671 894ReCoRD EMF1 927 934 929 935 932 939RTE acc 926 935 935WiC acc 762 766 774WSC acc 958 962 962

SuperGLUE (dev) 905 910 913

Table 3 Performance of a five-prompt ensemble builtfrom a single frozen T5-XXL model exceeds both theaverage and the best among the five prompts

model (+32 accuracy and +31 F1) The resultsare much closer in the other direction with prompttuning showing a small improvement in accuracyand a small drop in F1 These results support theview that model tuning may be over-parameterizedand more prone to overfit the training task to thedetriment of similar tasks in different domains

6 Prompt Ensembling

Ensembles of neural models trained from differentinitializations on the same data are widely observedto improve task performance (Hansen and Salamon1990) and are useful for estimating model uncer-tainty (Lakshminarayanan et al 2017) Howeveras model size increases ensembling can becomeimpractical Beyond the space required to store Nmodels (eg 42 GiB for each copy of T5-XXL)there is a substantial inference cost to running Ndistinct models whether in parallel or in series

Prompt tuning provides a more efficient way toensemble multiple adaptations of a pre-trained lan-guage model By training N prompts on the sametask we create N separate ldquomodelsrdquo for a taskwhile still sharing the core language modeling pa-rameters throughout Beyond drastically reducingstorage costs the prompt ensemble makes infer-ence more efficient To process one example ratherthan computing forward passes ofN different mod-els we can execute a single forward pass with abatch size of N replicating the example acrossthe batch and varying the prompt These savingsmirror those seen for multi-tasking in Figure 2

To demonstrate the viability of prompt ensem-bling we train five prompts for each SuperGLUEtask using a single frozen T5-XXL model withour default hyperparameters We use simple major-ity voting to compute predictions from the ensem-ble Table 3 shows that across all tasks the ensem-ble beats the single-prompt average and beats or

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

matches the best individual prompt

7 Interpretability

An ideally interpretable prompt would consist ofnatural language that clearly describes the task athand explicitly asks the model for some result oraction and makes it easy to understand why theprompt elicited such behavior from the model

As prompt tuning works in the continuous em-bedding space rather than the discrete token spaceinterpreting prompts becomes more difficult Totest the interpretability of our learned soft promptswe compute the nearest neighbors to each prompttoken from the frozen modelrsquos vocabulary We usecosine distance between the vocabulary embeddingvector and the prompt token representation as thesimilarity metric

We observe that for a given learned prompt to-ken the top-5 nearest neighbors form tight seman-tic clusters For example we see lexically similarclusters such as Technology technology Tech-nologies technological technologies as wellas more diverse but still strongly related clusterssuch as entirely completely totally altogether 100 The nature of these clusters suggests thatthe prompts are in fact learning ldquoword-likerdquo repre-sentations We found that random vectors drawnfrom the embedding space do not show this sort ofsemantic clustering

When initializing the prompts using the ldquoclass-labelrdquo strategy we often find that the class labelspersist through training Specifically if a prompttoken is initialized to a given label that label isoften among the learned tokenrsquos nearest neighborsafter tuning When initializing with the ldquoRandomUniformrdquo or ldquoSampled Vocabrdquo methods the classlabels can also be found in the nearest neighborsof the prompts however they tend to appear asneighbors to multiple prompt tokens This suggeststhat the model is learning to store the expectedoutput classes in the prompts as reference andinitializing the prompt to outputs classes makesthis easier and more centralized

When examining longer prompts (eg size 100)we often find several prompt tokens with the samenearest neighbors This suggests there is eitherexcess capacity in the prompt or that the lack ofsequential structure in the prompt representationmakes it difficult for the model to localize informa-tion to a specific position

While the learned prompts taken as sequences

show little interpretability we do observe a highfrequency of words like science technology andengineering as the nearest neighbors for promptstrained on the BoolQ dataset and approximately20 of the questions are in the ldquoNatureSciencerdquocategory While more investigation is needed thissuggests that one role of the prompt may be toprime the model to interpret inputs in a specificdomain or context (eg ldquoscientificrdquo)

8 Conclusion

In this paper we showed that prompt tuning isa competitive technique for adapting frozen pre-trained language models to downstream tasks Onthe popular SuperGLUE benchmark its task perfor-mance rivals that of traditional model tuning withthe gap vanishing as model size increases On zero-shot domain transfer we found that prompt tuningleads to improved generalization This plausibly in-dicates that freezing general-purpose language un-derstanding parameters and restricting downstreamlearning to a lightweight parameter footprint canhelp to avoid overfitting to a specific domain

Beyond task quality metrics we discussed theappeal of moving to frozen pre-trained models interms of storage and serving costs This moveenables both efficient multi-task serving as wellas efficient high-performing prompt ensemblingLooking forward we believe that factoring outtask-defining parameters as distinct from generallanguage-modeling parameters is an exciting stepthat opens up many avenues for new research

Acknowledgements

We thank Lucas Dixon Waleed Ammar SlavPetrov and Sebastian Ruder for comments on anearlier draft and the following people for helpfuldiscussion Colin Raffel Adam Roberts and NoamShazeer We thank Linting Xue for help with theLM adaptation training

ReferencesRoy Bar-Haim Ido Dagan Bill Dolan Lisa Ferro

Danilo Giampiccolo Bernardo Magnini and IdanSzpektor 2006 The second PASCAL recognisingtextual entailment challenge In Proceedings of thesecond PASCAL challenges workshop on recognis-ing textual entailment volume 6 pages 6ndash4 Venice

Luisa Bentivogli Peter Clark Ido Dagan and DaniloGiampiccolo 2009 The fifth PASCAL recognizingtextual entailment challenge In TAC

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

James Bradbury Roy Frostig Peter HawkinsMatthew James Johnson Chris Leary DougalMaclaurin George Necula Adam Paszke JakeVanderPlas Skye Wanderman-Milne and QiaoZhang 2018 JAX composable transformations ofPython+NumPy programs

Tom Brown Benjamin Mann Nick Ryder MelanieSubbiah Jared D Kaplan Prafulla DhariwalArvind Neelakantan Pranav Shyam Girish SastryAmanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan RewonChild Aditya Ramesh Daniel Ziegler Jeffrey WuClemens Winter Chris Hesse Mark Chen EricSigler Mateusz Litwin Scott Gray Benjamin ChessJack Clark Christopher Berner Sam McCandlishAlec Radford Ilya Sutskever and Dario Amodei2020 Language models are few-shot learners InAdvances in Neural Information Processing Systemsvolume 33 pages 1877ndash1901 Curran AssociatesInc

Christopher Clark Kenton Lee Ming-Wei ChangTom Kwiatkowski Michael Collins and KristinaToutanova 2019 BoolQ Exploring the surprisingdifficulty of natural yesno questions In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1(Long and Short Papers) pages 2924ndash2936 Min-neapolis Minnesota Association for ComputationalLinguistics

Ido Dagan Oren Glickman and Bernardo Magnini2005 The PASCAL recognising textual entailmentchallenge In Machine Learning Challenges Work-shop pages 177ndash190 Springer

Marie-Catherine De Marneff Mandy Simons and Ju-dith Tonhauser 2019 The CommitmentBank In-vestigating projection in naturally occurring dis-course Proceedings of Sinn und Bedeutung 23

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

William B Dolan and Chris Brockett 2005 Automati-cally constructing a corpus of sentential paraphrasesIn Proceedings of the Third International Workshopon Paraphrasing (IWP2005)

Dheeru Dua Yizhong Wang Pradeep Dasigi GabrielStanovsky Sameer Singh and Matt Gardner 2019DROP A reading comprehension benchmark requir-ing discrete reasoning over paragraphs In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics Human Language Technologies Volume 1

(Long and Short Papers) pages 2368ndash2378 Min-neapolis Minnesota Association for ComputationalLinguistics

Adam Fisch Alon Talmor Robin Jia Minjoon Seo Eu-nsol Choi and Danqi Chen 2019 MRQA 2019shared task Evaluating generalization in readingcomprehension In Proceedings of 2nd MachineReading for Reading Comprehension (MRQA) Work-shop at EMNLP

Danilo Giampiccolo Bernardo Magnini Ido Daganand Bill Dolan 2007 The third PASCAL recog-nizing textual entailment challenge In Proceedingsof the ACL-PASCAL workshop on textual entailmentand paraphrasing pages 1ndash9 Association for Com-putational Linguistics

Karen Hambardzumyan Hrant Khachatrian andJonathan May 2021 WARP Word-level Adversar-ial ReProgramming In Proceedings of the 59th An-nual Meeting of the Association for ComputationalLinguistics and the 11th International Joint Confer-ence on Natural Language Processing (Volume 1Long Papers) pages 4921ndash4933 Online Associa-tion for Computational Linguistics

L K Hansen and P Salamon 1990 Neural networkensembles IEEE Transactions on Pattern Analysisand Machine Intelligence 12(10)993ndash1001

Jonathan Heek Anselm Levskaya Avital Oliver Mar-vin Ritter Bertrand Rondepierre Andreas Steinerand Marc van Zee 2020 Flax A neural networklibrary and ecosystem for JAX

Neil Houlsby Andrei Giurgiu Stanislaw JastrzebskiBruna Morrone Quentin De Laroussilhe AndreaGesmundo Mona Attariyan and Sylvain Gelly2019 Parameter-efficient transfer learning for NLPIn Proceedings of the 36th International Conferenceon Machine Learning volume 97 of Proceedingsof Machine Learning Research pages 2790ndash2799PMLR

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Shankar Iyer Nikhil Dandekar and Kornel Csernai2017 First Quora dataset release Question pairs

Zhengbao Jiang Frank F Xu Jun Araki and GrahamNeubig 2020 How can we know what languagemodels know Transactions of the Association forComputational Linguistics 8423ndash438

A Kembhavi M Seo D Schwenk J Choi A Farhadiand H Hajishirzi 2017 Are you smarter than asixth grader textbook question answering for multi-modal machine comprehension In 2017 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) pages 5376ndash5384

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

Daniel Khashabi Snigdha Chaturvedi Michael RothShyam Upadhyay and Dan Roth 2018 Looking be-yond the surface A challenge set for reading com-prehension over multiple sentences In Proceedingsof North American Chapter of the Association forComputational Linguistics (NAACL)

Vid Kocijan Ana-Maria Cretu Oana-Maria CamburuYordan Yordanov and Thomas Lukasiewicz 2019A surprisingly robust trick for the Winograd schemachallenge In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguisticspages 4837ndash4842 Florence Italy Association forComputational Linguistics

Taku Kudo and John Richardson 2018 SentencePieceA simple and language independent subword tok-enizer and detokenizer for neural text processing InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing SystemDemonstrations pages 66ndash71 Brussels BelgiumAssociation for Computational Linguistics

Guokun Lai Qizhe Xie Hanxiao Liu Yiming Yangand Eduard Hovy 2017 RACE Large-scale ReAd-ing comprehension dataset from examinations InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing pages785ndash794 Copenhagen Denmark Association forComputational Linguistics

Balaji Lakshminarayanan Alexander Pritzel andCharles Blundell 2017 Simple and scalable predic-tive uncertainty estimation using deep ensembles InAdvances in Neural Information Processing Systemsvolume 30 Curran Associates Inc

Hector Levesque Ernest Davis and Leora Morgen-stern 2012 The Winograd schema challenge InThirteenth International Conference on the Princi-ples of Knowledge Representation and Reasoning

Omer Levy Minjoon Seo Eunsol Choi and LukeZettlemoyer 2017 Zero-shot relation extraction viareading comprehension In Proceedings of the 21stConference on Computational Natural LanguageLearning (CoNLL 2017) pages 333ndash342 Vancou-ver Canada Association for Computational Linguis-tics

Mike Lewis Yinhan Liu Naman Goyal Mar-jan Ghazvininejad Abdelrahman Mohamed OmerLevy Veselin Stoyanov and Luke Zettlemoyer2020 BART Denoising sequence-to-sequence pre-training for natural language generation translationand comprehension In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics pages 7871ndash7880 Online Associationfor Computational Linguistics

Xiang Lisa Li and Percy Liang 2021 Prefix-tuningOptimizing continuous prompts for generation InProceedings of the 59th Annual Meeting of theAssociation for Computational Linguistics and the11th International Joint Conference on Natural Lan-guage Processing (Volume 1 Long Papers) pages

4582ndash4597 Online Association for ComputationalLinguistics

Xiao Liu Yanan Zheng Zhengxiao Du Ming DingYujie Qian Zhilin Yang and Jie Tang 2021 GPTunderstands too CoRR abs210310385

Lajanugen Logeswaran Ann Lee Myle Ott HonglakLee MarcrsquoAurelio Ranzato and Arthur Szlam2020 Few-shot sequence learning with transform-ers CoRR abs201209543

Vinod Nair and Geoffrey E Hinton 2010 Rectifiedlinear units improve restricted Boltzmann machinesIn Proceedings of the 27th International Conferenceon International Conference on Machine LearningICMLrsquo10 page 807ndash814 Madison WI USA Om-nipress

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Jonas Pfeiffer Ivan Vulic Iryna Gurevych and Se-bastian Ruder 2020 MAD-X An Adapter-BasedFramework for Multi-Task Cross-Lingual TransferIn Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 7654ndash7673 Online Association for Computa-tional Linguistics

Mohammad Taher Pilehvar and Jose Camacho-Collados 2018 WiC 10000 example pairs forevaluating context-sensitive representations CoRRabs180809121

Guanghui Qin and Jason Eisner 2021 Learning howto ask Querying LMs with mixtures of soft promptsIn Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics Human Language Technologiespages 5203ndash5212 Online Association for Compu-tational Linguistics

Alec Radford Karthik Narasimhan Tim Salimans andIlya Sutskever 2018 Improving language under-standing by generative pre-training

Alec Radford Jeff Wu Rewon Child David LuanDario Amodei and Ilya Sutskever 2019 Languagemodels are unsupervised multitask learners OpenAIBlog

Colin Raffel Noam Shazeer Adam Roberts Kather-ine Lee Sharan Narang Michael Matena YanqiZhou Wei Li and Peter J Liu 2020 Exploringthe limits of transfer learning with a unified text-to-text transformer Journal of Machine Learning Re-search 21(140)1ndash67

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

Pranav Rajpurkar Jian Zhang Konstantin Lopyrev andPercy Liang 2016 SQuAD 100000+ questions formachine comprehension of text In Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing pages 2383ndash2392 AustinTexas Association for Computational Linguistics

Sylvestre-Alvise Rebuffi Hakan Bilen and AndreaVedaldi 2017 Learning multiple visual domainswith residual adapters In Advances in Neural Infor-mation Processing Systems volume 30 Curran As-sociates Inc

Melissa Roemmele Cosmin Adrian Bejan and An-drew S Gordon 2011 Choice of plausible alterna-tives An evaluation of commonsense causal reason-ing In 2011 AAAI Spring Symposium Series

Amrita Saha Rahul Aralikatte Mitesh M Khapra andKarthik Sankaranarayanan 2018 DuoRC Towardscomplex language understanding with paraphrasedreading comprehension In Proceedings of the 56thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1 Long Papers) pages1683ndash1693 Melbourne Australia Association forComputational Linguistics

Timo Schick and Hinrich Schuumltze 2021 Exploitingcloze-questions for few-shot text classification andnatural language inference In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics Main Vol-ume pages 255ndash269 Online Association for Com-putational Linguistics

Noam Shazeer 2020 GLU variants improve trans-former CoRR abs200205202

Noam Shazeer and Mitchell Stern 2018 AdafactorAdaptive learning rates with sublinear memory costIn Proceedings of the 35th International Conferenceon Machine Learning volume 80 of Proceedingsof Machine Learning Research pages 4596ndash4604PMLR

Taylor Shin Yasaman Razeghi Robert L Logan IVEric Wallace and Sameer Singh 2020 AutoPromptEliciting Knowledge from Language Models withAutomatically Generated Prompts In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) pages4222ndash4235 Online Association for ComputationalLinguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez ŁukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008

Alex Wang Yada Pruksachatkun Nikita NangiaAmanpreet Singh Julian Michael Felix Hill OmerLevy and Samuel Bowman 2019a SuperGLUE Astickier benchmark for general-purpose language un-derstanding systems In Advances in Neural Infor-mation Processing Systems volume 32 Curran As-sociates Inc

Alex Wang Amanpreet Singh Julian Michael FelixHill Omer Levy and Samuel R Bowman 2019bGLUE A multi-task benchmark and analysis plat-form for natural language understanding In the Pro-ceedings of ICLR

Michael L Waskom 2021 seaborn statistical datavisualization Journal of Open Source Software6(60)3021

Sheng Zhang Xiaodong Liu Jingjing Liu JianfengGao Kevin Duh and Benjamin Van Durme 2018ReCoRD Bridging the gap between human and ma-chine commonsense reading comprehension CoRRabs181012885

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

A Reproducibility

A1 Experimental Settings

We evaluate each GLUE and SuperGLUE datasetusing the metric specified in the benchmark Wereuse the evaluation code from the publicly avail-able T5 open-source release to compute metrics12

For the SQuAD and MRQA datasets we evaluateusing F1 one of the metrics used by the SQuADbenchmark where partial answer spans are consid-ered Again we use the T5 open-source release formetric calculation13 All of our models use T5 11as the base frozen model additional details and pre-trained checkpoints can be found on GitHub1415

All prompts for T5 Small and Base models weretrained on 4 TPU v2 chips while prompts for largermodels were trained on 16 TPU v3 chips

Parameter counts for each prompt can be foundin Table 4 Average runtimes until convergence canbe found in Table 5

A2 Hyperparameter Search

This work used 77 hyperparameter search trials(40 for prompt tuning and 37 for single-task modeltuning) and 3 training runs (with validation evalu-ation) for each baseline configuration and ablationsetting for a total of 195 runs for our main resultand ablations There were an additional 18 runs forthe domain shift experiments and 24 extra runs tocreate the ensemble Hyperparameter bounds canbe found in Table 6 Hyperparameter tuning wasdone via manual tuning and settings were selectedbased on the SuperGLUE score All experiments inthis work outside of the hyperparameter being ab-lated use our default configuration of 100K stepsof LM Adapation a prompt length of 100 andldquoclass-labelrdquo initialization

All graphs of our experimental results plot themean and standard deviation over 3 runs as com-puted by Seaborn (Waskom 2021) Some settingshave such low variance that the standard deviation

12httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspy

13httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5evaluationmetricspyL151

14httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmasterreleased_checkpointsmdt511

15httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmainreleased_checkpointsmdlm-adapted-t511lm100k

is hidden behind the line itself such as ldquoModel Tun-ing (Multi-task)rdquo in Figure 1 and the Base Largeand XL prompts trained on the ldquoSpan Corruptionrdquopretraining objective in Figure 3(b) Figure 4 alsoshows mean and standard deviation for the num-ber of parameters each method uses as the promptlength varies from 1ndash100 The ldquoPrefix Tuning(Train)rdquo curves appears to have no standard de-viation because the parameter count is so stronglydominated by the cost of the reparameterizationparameters that the standard deviation bands areoccluded For our experiments on domain transferwe report mean and standard deviation over 3 runs

A3 Datasets

All datasets used are in English For the GLUE1617

and SuperGLUE18 datasets we used the trainingvalidation and test splits that ship with TensorFlowDatasets We used version 100 for GLUE and102 for SuperGLUE datasets For SQuAD19

we used v11300 from Tensorflow Datasetsand follow the provided training validation andtest splits For the out-of-domain datasets we usedthe development splits distributed as part of theMRQA shared task20 Dataset sizes can be foundin Table 7 The label distributions for each datasetcan be found in Table 8 (BoolQ) Table 9 (CB)Table 10 (COPA) Table 11 (MultiRC) Table 14(RTE) Table 12 (WiC) Table 13 (WSC) Table 15(MRPC) and Table 16 (QQP)

The question answering datasets are extractivedatasets with a variety of answers so there isnrsquot alabel distribution to report Similarly the ReCoRDdataset is a multiple choice dataset where the modelmust predict the masked out entity from a list ofpossible entities Due to this formulation there isnrsquota meaningful label distribution

We followed the open-source T5 preprocess-ing procedure21 for each dataset except that weomit the dataset prefix denoting which SuperGLUEdataset an example belongs to For the SQuAD and

16httpswwwtensorfloworgdatasetscataloggluegluemrpc

17httpswwwtensorfloworgdatasetscatalogglueglueqqp

18httpswwwtensorfloworgdatasetscatalogsuper_glue

19httpswwwtensorfloworgdatasetscatalogsquadsquadv11_default_config

20httpsgithubcommrqaMRQA-Shared-Task-2019out-of-domain

21httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspy

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

T5 Size Prompt Length Trainable Parameters Total Parameters Percent Trainable

Small 1 512 76961664 0000675 2560 76963712 000333

20 10420 76971572 00133050 25600 76986752 003325

100 51200 77012352 006648150 76800 77037952 009969

Base 1 768 247578624 0000315 3840 247581696 000155

20 15360 247593216 00062050 38400 247616256 001551

100 76800 247654656 003101150 115200 247693056 004651

Large 1 1024 783151104 0000135 5120 783155200 000065

20 20480 783170560 00026250 51200 783201280 000654

100 102400 783252480 001907150 153600 783303680 001961

XL 1 2048 2849759232 0000075 10240 2849767424 000036

20 40960 2849798144 00014350 102400 2849859584 000359

100 204800 2849961984 000718150 307200 2850064384 001078

XXL 1 4096 11135336448 0000045 20480 11135352832 000018

20 81920 11135414272 00007450 204800 11137380352 000184

100 409600 11135741952 000368150 614400 11135946752 000552

Table 4 Number of parameters used for various prompt lengths and T5 model sizes Trainable parameters isthe number of parameters in the prompt itself while total parameters includes the prompt plus the original T5parameters The T5 parameters are frozen and shared across all tasks and include the SentencePiece lookup tableparameters The final column is the percentage of total parameters that are trainable

MRQA datasets we used the T5 SQuAD prepro-cessing code22 By following the T5 preprocessingand text-to-text format we recast the WSC datasetas a text generation task Instead of predictingwhether a supplied referent is correct for a high-lighted span our model predicts the correct referentdirectly As such we can only learn from trainingexamples where the referent is correct so WSCtraining data where the supplied referent is incor-rect are omitted

No new data was collected for this work

22httpsgithubcomgoogle-researchtext-to-text-transfer-transformerblobmastert5datapreprocessorspyL264

Prompt Length T5 Size Time

1 Large 317 plusmn0210XL 337 plusmn0211XXL 2123 plusmn0154

20 XL 4908 plusmn1853XXL 5303 plusmn1625

50 Small 0905 plusmn0507Base 5501 plusmn2748Large 11416 plusmn1312XL 23010 plusmn2540XXL 31313 plusmn2308

100 Small 1625 plusmn0115Base 2957 plusmn0018Large 12336 plusmn1021XL 33500 plusmn5442XXL 35115 plusmn4553

Table 5 Mean and standard deviation of the runtimeuntil convergence for the BoolQ dataset and variousprompt lengths and model sizes Convergence is de-fined as reaching a performance within 1 of the meanvalue for that model configuration A few configura-tions have been omitted because their runtimes wereartificially extended due to preemption

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset

Hyperparameter Search Space

Learning Rate 0001ndash05Parameter Scaling True FalseBatch Size 32 64 126 256 512Number of Steps 10000 20000 30000Warmup Steps off 2000 3000Decay Factor off 01 05Steps per Decay off 4000 6000 8000

Table 6 Search space for each hyperparameter consid-ered Parameter Scaling refers to the Adafactor settingwhere an update is scaled by the norm of the parameterit will be applied to Warmup Steps is the number ofsteps before a linearly increasing learning rate reachesthe Learning Rate value starting from zero Decay Fac-tor is the reduction in Learning Rate size that occursevery ldquoSteps per Decayrdquo steps

Dataset Training Validation Testing

BoolQ 9427 3270 3245CB 250 56 250COPA 400 100 500MultiRC 27243 4848 9693ReCoRD 100730 10000 10000RTE 2490 277 3000WiC 5428 638 1400WSC 259lowast 104 146MRPC 3668 408 1725QQP 363849 40430 390965SQuAD 87599 10570 -TextbookQA - 1504 -RACE - 1503 -BioASQ - 1501 -RE - 674 -DuoRC - 2948 -DROP - 1503 -

Table 7 Sizes for training validation and testing splitsof each dataset used lowastFollowing T5 our casting ofWSC as a text generation problems means we can onlytrain on examples where the supplied referent is correctThis means our training dataset is smaller than the nor-mal WSC training dataset which has 554 examples

Split False True

Training 377 623Validation 378 622

Table 8 Label distribution for the BoolQ dataset

Split contradiction entailment neutral

Training 476 460 64Validation 500 411 89

Table 9 Label distribution for the CB dataset

Split choice1 choice2

Training 488 512Validation 550 450

Table 10 Label distribution for the COPA dataset

Split False True

Training 559 441Validation 572 428

Table 11 Label distribution for the MultiRC dataset

Split False True

Training 500 500Validation 500 500

Table 12 Label distribution for the WiC dataset

Split False True

Training 00 1000Validation 635 365

Table 13 Label distribution for the WSC dataset Fol-lowing T5 we cast the WSC dataset to a free-form textgeneration task where the model generates the referentto the highlighted span instead predicting if the sup-plied entity is the correct referent of the highlightedspan Thus we only use training data where the sup-plied referent is correct making our training label dis-tribution focused entirely on True

Split entailment not_entailment

Training 512 498Validation 527 473

Table 14 Label distribution for the RTE dataset

Split equivalent not_equivalent

Training 674 326Validation 684 316

Table 15 Label distribution for the MRPC dataset

Split duplicate not_duplicate

Training 369 631Validation 368 632

Table 16 Label distribution for the QQP dataset