a primer on pretrained multilingual language models

26
A Primer on Pretrained Multilingual Language Models Sumanth Doddapaneni 1,3* Gowtham Ramesh 1,3* Mitesh M. Khapra 1,2,3Anoop Kunchukuttan 3,4 Pratyush Kumar 3,4 1 RBCDSAI, 2 IIT Madras 3 AI4Bharat, 4 Microsoft Abstract Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, etc. have emerged as a viable option for bringing the power of pretraining to a large number of lan- guages. Given their success in zero-shot trans- fer learning, there has emerged a large body of work in (i) building bigger MLLMs cov- ering a large number of languages (ii) creat- ing exhaustive benchmarks covering a wider variety of tasks and languages for evaluat- ing MLLMs (iii) analysing the performance of MLLMs on monolingual, zero-shot cross- lingual and bilingual tasks (iv) understand- ing the universal language patterns (if any) learnt by MLLMs and (v) augmenting the (of- ten) limited capacity of MLLMs to improve their performance on seen or even unseen lan- guages. In this survey, we review the existing literature covering the above broad areas of re- search pertaining to MLLMs. Based on our survey, we recommend some promising direc- tions of future research. 1 Introduction The advent of BERT (Devlin et al., 2019) has rev- olutionised the field of NLP and has lead to state of the art performance on a wide variety of tasks (Wang et al., 2018a). The recipe is to train a deep transformer based model (Vaswani et al., 2017) on large amounts of monolingual data and then fine-tune it on small amounts of task-specific data. The pretraining happens using a masked language modeling objective and essentially results in an en- coder which learns good sentence representations. These pretrained sentence representations then lead to improved performance on downstream tasks when fine-tuned on even small amounts of task- specific training data (Devlin et al., 2019). Given its success in English NLP, this recipe has been * * The first two authors have contributed equally. Corresponding author: [email protected] replicated across languages leading to many lan- guage specific BERTs such as FlauBERT (French) (Le et al., 2020), CamemBERT (French) (Mar- tin et al., 2020), BERTje (Dutch) (de Vries et al., 2019), FinBERT (Finnish) (Rönnqvist et al., 2019), BERTeus (Basque) (Agerri et al., 2020), AfriBERT (Afrikaans) (Ralethe, 2020), IndicBERT (Indian languages) (Kakwani et al., 2020) etc. However, training such language-specific models is only fea- sible for a few languages which have the necessary data and computational resources. The above situation has lead to the undesired ef- fect of limiting recent advances in NLP to English and a few high resource languages (Joshi et al., 2020a). The question then is How do we bring the benefit of such pretrained BERT based models to a very long list of languages of interest? One alternative, which has become popular, is to train multilingual language models (MLLMs) such as mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), XLM-R (Conneau et al., 2020a), etc. A MLLM is pretrained using large amounts of unlabeled data from multiple languages with the hope that low resource languages may benefit from high resource languages due to shared vo- cabulary, genetic relatedness (Nguyen and Chiang, 2017) or contact relatedness (Goyal et al., 2020). Several such MLLMs have been proposed in the past 3 years and they differ in the architecture (e.g., number of layers, parameters, etc), objective func- tions used for training (e.g., monolingual masked language modeling objective, translation language modeling objective, etc), data used for pretraining (Wikipedia, CommonCrawl, etc) and the number of languages involved (ranging from 12 to 100). To keep track of these rapid advances in MLLMs, as a first step, we present a survey of all existing MLLMs clearly highlighting their similarities and differences. While training an MLLM is more efficient and inclusive (covers more languages), is there a trade- arXiv:2107.00676v2 [cs.CL] 23 Dec 2021

Upload: others

Post on 28-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Sumanth Doddapaneni1,3∗ Gowtham Ramesh1,3∗
1RBCDSAI, 2IIT Madras 3AI4Bharat, 4Microsoft
Abstract
Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, etc. have emerged as a viable option for bringing the power of pretraining to a large number of lan- guages. Given their success in zero-shot trans- fer learning, there has emerged a large body of work in (i) building bigger MLLMs cov- ering a large number of languages (ii) creat- ing exhaustive benchmarks covering a wider variety of tasks and languages for evaluat- ing MLLMs (iii) analysing the performance of MLLMs on monolingual, zero-shot cross- lingual and bilingual tasks (iv) understand- ing the universal language patterns (if any) learnt by MLLMs and (v) augmenting the (of- ten) limited capacity of MLLMs to improve their performance on seen or even unseen lan- guages. In this survey, we review the existing literature covering the above broad areas of re- search pertaining to MLLMs. Based on our survey, we recommend some promising direc- tions of future research.
1 Introduction
The advent of BERT (Devlin et al., 2019) has rev- olutionised the field of NLP and has lead to state of the art performance on a wide variety of tasks (Wang et al., 2018a). The recipe is to train a deep transformer based model (Vaswani et al., 2017) on large amounts of monolingual data and then fine-tune it on small amounts of task-specific data. The pretraining happens using a masked language modeling objective and essentially results in an en- coder which learns good sentence representations. These pretrained sentence representations then lead to improved performance on downstream tasks when fine-tuned on even small amounts of task- specific training data (Devlin et al., 2019). Given its success in English NLP, this recipe has been
∗* The first two authors have contributed equally. †‡ Corresponding author: [email protected]
replicated across languages leading to many lan- guage specific BERTs such as FlauBERT (French) (Le et al., 2020), CamemBERT (French) (Mar- tin et al., 2020), BERTje (Dutch) (de Vries et al., 2019), FinBERT (Finnish) (Rönnqvist et al., 2019), BERTeus (Basque) (Agerri et al., 2020), AfriBERT (Afrikaans) (Ralethe, 2020), IndicBERT (Indian languages) (Kakwani et al., 2020) etc. However, training such language-specific models is only fea- sible for a few languages which have the necessary data and computational resources.
The above situation has lead to the undesired ef- fect of limiting recent advances in NLP to English and a few high resource languages (Joshi et al., 2020a). The question then is How do we bring the benefit of such pretrained BERT based models to a very long list of languages of interest? One alternative, which has become popular, is to train multilingual language models (MLLMs) such as mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), XLM-R (Conneau et al., 2020a), etc. A MLLM is pretrained using large amounts of unlabeled data from multiple languages with the hope that low resource languages may benefit from high resource languages due to shared vo- cabulary, genetic relatedness (Nguyen and Chiang, 2017) or contact relatedness (Goyal et al., 2020). Several such MLLMs have been proposed in the past 3 years and they differ in the architecture (e.g., number of layers, parameters, etc), objective func- tions used for training (e.g., monolingual masked language modeling objective, translation language modeling objective, etc), data used for pretraining (Wikipedia, CommonCrawl, etc) and the number of languages involved (ranging from 12 to 100). To keep track of these rapid advances in MLLMs, as a first step, we present a survey of all existing MLLMs clearly highlighting their similarities and differences.
While training an MLLM is more efficient and inclusive (covers more languages), is there a trade-
ar X
iv :2
10 7.
00 67
6v 2
3 D
ec 2
02 1
off in the performance compared to a monolingual model? More specifically, for a given language is a language-specific BERT better than a MLLM? For example, if one is only interested in English NLP should one use English BERT or a MLLM. The advantage of the former is that there is no capacity dilution (i.e., the entire capacity of the model is dedicated to a single language), whereas the advan- tage of the latter is that there is additional pretrain- ing data from multiple (related) languages. In this work, we survey several existing studies (Conneau et al., 2020a; Wu and Dredze, 2020; Agerri et al., 2020; Virtanen et al., 2019; Rönnqvist et al., 2019; Ro et al., 2020; de Vargas Feijó and Moreira, 2020; Virtanen et al., 2019; Wang et al., 2020a; Wu and Dredze, 2020) which show that the right choice depends on various factors such as model capacity, amount of pretraining data, fine-tuning mechanism and amount of task-specific training data.
One of the main motivations of training MLLMs is to enable transfer from high resource languages to low resource languages. Of particular interest, is the ability of MLLMs to facilitate zero- shot cross-lingual transfer (K et al., 2020) from a resource rich language to a resource deprived lan- guage which does not have any task-specific train- ing data. To evaluate such cross-lingual transfer, several benchmarks, such as XGLUE (Liang et al., 2020), XTREME (Hu et al., 2020), XTREME-R (Ruder et al., 2021) have been proposed. We review these benchmarks which contain a wide variety of tasks such as classification, structure prediction, question answering, and cross-lingual retrieval. Us- ing these benchmarks, several works (Pires et al., 2019; Wu and Dredze, 2019; K et al., 2020; Artetxe et al., 2020a; K et al., 2020; Dufter and Schütze, 2020; Liu et al., 2020a; Lauscher et al., 2020; Liu et al., 2020c; Conneau and Lample, 2019; Wang et al., 2019; Liu et al., 2019a; Cao et al., 2020; Wang et al., 2020d; Zhao et al., 2020; Wang et al., 2020b; Chi et al., 2020b) have studied the cross- lingual effectiveness of MLLMs and have shown that such transfer depends on various factors such as amount of shared vocabulary, explicit alignment of representations across languages, size of pre- training corpora, etc. We collate the main findings of these studies in this survey.
While the above discussion has focused on trans- fer learning and facilitating NLP in low resource languages, MLLMs could also be used for bilin- gual tasks. For example, could the shared rep-
resentations learnt by MLLMs improve Machine Translation between two resource rich languages? We survey several works (Conneau and Lample, 2019; Kakwani et al., 2020; Huang et al., 2019; Conneau et al., 2020a; Eisenschlos et al., 2019; Zampieri et al., 2020; Libovický et al., 2020; Jalili Sabet et al., 2020; Chen et al., 2020; Zenkel et al., 2020; Dou and Neubig, 2021; Imamura and Sumita, 2019; Ma et al., 2020; Zhu et al., 2020; Liu et al., 2020b; Xue et al., 2021) which use MLLMs for downstream bilingual tasks such as unsuper- vised machine translation, cross-lingual word align- ment, cross-lingual QA, etc. We summarise the main findings of these studies which indicate that MLLMs are useful for bilingual tasks, particularly in low resource scenarios.
The surprisingly good performance of MLLMs in cross-lingual transfer as well as bilingual tasks motivates the hypothesis that MLLMs are learning universal patterns. However, our survey of the studies in this space indicates that there is no consensus yet. While representations learnt by MLLMs share commonalities across languages identified by different correlation anal- yses, these commonalities are dominantly within languages of the same family, and only in certain parts of the network (primarily middle layers). Also, while probing tasks such as POS tagging are able to benefit from such commonalities, harder tasks such as evaluating MT quality remain beyond the scope as yet. Thus, though promising, MLLMs do not yet represent inter-lingua.
Lastly, given the effort involved in training MLLMs it is desirable that it is easy to extend it to new languages which weren’t a part of the ini- tial pretraining. We review existing studies which propose methods for (a) extending MLLMs to unseen languages, and (b) improving the capac- ity (and hence performance) of MLLMs for lan- guages already seen during pretraining. These range from simple techniques such as fine-tuning the MLLM for a few epochs on the target language to using language and task specific adapters to aug- ment the capacity of MLLMs.
1.1 Goals of the survey
Summarising the above discussion, the main goal of this survey is to review existing work with a focus on the following questions:
• How are different MLLMs built and how do they differ from each other? (Section 2)
• What are the benchmarks used for evaluating MLLMs? (Section 3)
• For a given language, are MLLMs better than monolingual LMs? (Section 4)
• Do MLLMs facilitate zero-shot cross-lingual transfer? (Section 5)
• Are MLLMs useful for bilingual tasks? (Sec- tion 6)
• Do MLLMs learn universal patterns? (Sec- tion 7)
• How to extend MLLMs to new languages? (Section 8)
• What are the recommendations based on this survey? (Section 9)
In this survey, we focus on the multilingual as- pects of language models for NLU. Hence, we do not discuss related topics like monolingual LMs, pretrained models for NLG, training of large mod- els, model compression, etc. Our survey is thus different from existing surveys on cross-lingual word embedding models (Ruder et al., 2019), mul- tilingual NMT models (Dabre et al., 2020) and pre- trained language models (Qiu et al., 2020; Kalyan et al., 2021) which do not focus on pretrained multi- lingual language models. To the best of our knowl- edge, our survey is the first work that presents a comprehensive review of multilingual aspects of pretrained language models for NLU.
2 How are MLLMs built?
The goal of MLLMs to is learn a model that can generate a multilingual representation of a given text. Loosely, the model should generate similar representations in a common vector space for similar sentences and words (or words in similar context) across languages. In this section we describe the neural network architecture, objective functions, data and languages used for building MLLMs. We also highlight the similarities and differences between existing MLLMs.
2.1 Architecture Multilingual Language models are typically based on the transformer architecture introduced by Vaswani et al. (2017) and then adapted for Natural Language Understanding (NLU) by Devlin et al.
(2019) (although there are a few exceptions like (Eisenschlos et al., 2019) which use RNN based models).
Input Layer The input to the MLM is a sequence of tokens. The token input comes from a one-hot representation of a finite vocabulary, which is typ- ically a subword vocabulary. This vocabulary is generally learnt from a concatenation of monolin- gual data from various languages using algorithms like BPE (Sennrich et al., 2016b), wordpiece (Wu et al., 2016) or sentencePiece (Kudo and Richard- son, 2018a). To ensure reasonable representation in the vocabulary for different languages and scripts, data can be sampled using exponential weighted smoothing (discussed later) (Conneau et al., 2020a; Devlin et al., 2018) or separate vocabularies can be learnt for clusters of languages (Chung et al., 2020) partitioning the vocab size.
Transformer Layers A typical MLLM com- prises the encoder of the transformer network and contains a stack of N layers with each layer con- taining k attention heads followed by a feedforward neural network. For every token in the input se- quence, an attention head computes an embedding using an attention weighted linear combination of the representations of all the other tokens in the sentence. The embeddings from all the attention heads are then concatenated and passed through a feedforward network to produce a d dimensional embedding for each input token. As shown in Table 1, existing MLLMs may differ in the choice of N , k and d. Further, the parameters in each layer may be shared as in (Kakwani et al., 2020).
Output Layer The outputs of the last trans- former layer are typically used as contextual rep- resentations for each token, while the embedding corresponding to the [CLS] token is considered to be the embedding of the entire input text. Alterna- tively, the text embedding can also be computed via pooling operations on the token embeddings. The output layer contains simple linear transformation followed by a softmax that takes as input a token embedding from the last transformer layer and out- puts a probability distribution over the tokens in the vocabulary. Note that the output layer is required only during pretraining and can be discarded during fine-tuning and task-inference - a fact that the Rem- BERT model uses to reduce model size (Chung et al., 2021b).
Model Architecture pretraining Languages
Task specific
data #langs. vocab.
IndicBERT (Kakwani et al., 2020) 12 12 768 33M MLM IndicCorp 7 7 12 200K
Unicoder (Huang et al., 2019) 12 16 1024 250M MLM, TLM, CLWR, CLPC, CLMLM Wikipedia X 7 15 95K
XLM-15 (Conneau and Lample, 2019) 12 8 1024 250M MLM, TLM Wikipedia X 7 15 95K XLM-17 (Conneau and Lample, 2019) 16 16 1280 570M MLM Wikipedia X 7 17 200K
MuRIL (Khanuja et al., 2021a) 12 12 768 236M MLM, TLM CommonCrawl + Wikipedia X 7 17 197K
VECO-small (Luo et al., 2021) 6 12 768 247M MLM, CS-MLM† CommonCrawl X 7 50 250K VECO-Large (Luo et al., 2021) 24 16 1024 662M MLM, CS-MLM CommonCrawl X 7 50 250K
XLM-align (Chi et al., 2021b) 12 12 768 270M MLM, TLM, DWA CommonCrawl + Wikipedia X 7 94 250K
InfoXLM-base (Chi et al., 2021a) 12 12 768 270M MLM, TLM, XLCO CommonCrawl X 7 94 250K InfoXLM-Large (Chi et al., 2021a) 24 16 1024 559M MLM, TLM, XLCO CommonCrawl X 7 94 250K XLM-100 (Conneau and Lample, 2019) 16 16 1280 570M MLM Wikipedia 7 7 100 200K XLM-R-base (Conneau et al., 2020a) 12 12 768 270M MLM CommonCrawl 7 7 100 250K XLM-R-Large (Conneau et al., 2020a) 24 16 1024 559M MLM CommonCrawl 7 7 100 250K X-STILTS (Phang et al., 2020) 24 16 1024 559M MLM CommonCrawl 7 X 100 250K HiCTL-base (Wei et al., 2021) 12 12 768 270M MLM, TLM, HICTL CommonCrawl X 7 100 250K HiCTL-Large (Wei et al., 2021) 24 16 1024 559M MLM, TLM, HICTL CommonCrawl X 7 100 250K
Ernie-M-base (Ouyang et al., 2021) 12 12 768 270M MLM, TLM, CAMLM, BTMLM CommonCrawl X 7 100 250K
Ernie-M-Large (Ouyang et al., 2021) 24 16 1024 559M MLM, TLM, CAMLM, BTMLM CommonCrawl X 7 100 250K
XLM-E (Chi et al., 2021c) 12 12 768 279M MLM, TLM, MRTD, TRTD CommonCrawl X 7 100 250k mBERT (Devlin et al., 2019) 12 12 768 172M MLM Wikipedia 7 7 104 110K Amber (Hu et al., 2021) 12 12 768 172M MLM, TLM, CLWA, CLSA Wikipedia X 7 104 120K
RemBERT (Chung et al., 2021a) 32 18 1152 559M‡ MLM CommonCrawl + Wikipedia 7 7 110 250K
Table 1: A comparison of existing Multilingual Language Models. † - Cross sequence MLM which is useful for NLG tasks. ‡ - For pretraining, RemBERT uses 995M parameters
2.2 Training Objective Functions A variety of objective functions have been proposed for training MLLMs. These can be broadly cate- gorized as monolingual or parallel objectives de- pending on the nature of training data required. We discuss and compare these objective functions in this section.
2.2.1 Monolingual Objectives The objective functions are defined on monolingual data alone. These are unsupervised/self-supervised objectives that train the model to generate multilin- gual representations by predicting missing tokens given the context tokens. Masked Language Model (MLM). This is the most standard training objective used for training most MLLMs. Typically, other pretraining objectives are used in conjunction with the MLM objective. It is a simple extension of the unsupervised MLM objective for a single language to multiple languages by pooling together monolingual data from multiple languages. Let x1, x2, . . . , xT be the sequence of words in a given training example. Of these, k tokens (≈ 15%) are randomly selected for masking. If the i-th token is selected for masking then it is replaced by (i) the [MASK] token (≈ 80% of the time), or (ii) a random token (≈ 10% of time), or (iii) kept as it is (≈ 10% of time). The goal is to then predict these k masked tokens using the remaining T − k
tokens. More formally, the model is trained to minimize the cross entropy loss for predicting the masked tokens. Specifically, if ui ∈ Rd is the representation for the i-th masked token computed by the last layer, then the cross-entropy loss for predicting this token is computed as:
LL(i) = − log eWui∑V j=1 e
Wuj
where V is the size of the vocabulary, W ∈ RV×d
is a parameter to be learned. The total loss is then obtained by summing over the loss of all the masked tokens. While this objective is monolin- gual, it surprisingly helps learn multilingual models where the encoder representations across languages are aligned without the need for any parallel cor- pora. The potential reasons for this surprising ef- fectiveness of MLM for multilingual models are discussed later. Causal Language Model (CLM). This is the tra- ditional language modelling objective of predicting the next word given the previous words. Unlike MLM, CLM has access to just unidirectional con- text. Given the success of MLM based language models for NLU applications, CLM has fallen out of favour and is currently used for pretraining NLG models where only unidirectional context is avail- able for generation. We describe it for the sake of
completeness. Let x1, x2, . . . , xT be the sequence of words in a given training batch. The goal then is to predict the i-th word given the previous i− 1 words. Specifically, the model is trained to mini- mize the cross entropy loss for predicting the i-th word given the previous i− 1 words. Multilingual Replaced Token Detection (MRTD). This objective function requires the model to detect real input tokens from the cor- rupted multilingual sentence. Let x1, x2, . . . , xT be the sequence of words in a given training example. k tokens are masked and a generator G (usually a smaller transformer model trained with MLM objective) is used to predict these masked tokens. A discriminator D is used on top of this, which takes the sentence predicted by G, Xcorrupt
as input and then for each xcorrputi in Xcorrupt
predicts whether it was the original token or token generated by G.
2.2.2 Parallel-corpora Objectives These objectives require parallel corpora and are designed to explicitly force representations of simi- lar text across languages to be close to each other in the multilingual encoder space. The objectives are either word-level (TLM, CAMLM, CLMLM, HICTL, CLSA) or sentence-level (XLCO, HICTL, CLSA). Since parallel corpus is generally much smaller than the monolingual data, the parallel ob- jectives are used in conjunction with monolingual models. This can be done via joint optimization of parallel and monolingual objectives, with each objective weighted appropriately. Sometimes, the initial training period may involve only monolin- gual objectives (XLCO, HICTL, CLSA). Translation Language Model (TLM). In addi- tion to monolingual data in each language, we may also have access to parallel data between some languages. Conneau and Lample (2019) intro- duced the translation language modeling objective to leverage such parallel data. Let xA1 , x
A 2 , . . . , x
A T
be the sequence of words in a language A and let xB1 , x
B 2 , . . . , x
B T be the corresponding parallel
sequence of words in a language B. Both the se- quences are fed as input to the MLM with a [SEP] token in between. Similar to MLM, a total of k tokens are masked such that these tokens could ei- ther belong to the sequence in A or the sequence in B. To predict a masked word in A the model could rely on the surrounding words in A or the translation in B, thereby implicitly being forced to learn aligned representations. More specifically, if
the context inA is not sufficient (due to masking or otherwise), then the model can use the context in B to predict a masked token in A. The final objective function is the same as MLM, i.e., to minimise the cross entropy loss of the masked tokens. The only difference is that the masked tokens could belong to either language. Cross-attention Masked Language Modeling (CAMLM). CAMLM introduced in Ouyang et al. (2021) learns cross-lingual representation by pre- dicting masked tokens in a parallel sentence pair. While predicting the masked tokens in a source sentence, the model is restricted to use semantics of the target sentence and vice versa. As opposed to TLM, where the model has access to both in- put sentence pairs to predict the mask tokens, in CAMLM, the model is restricted to only use the tokens in the corresponding parallel sentence to predict the masked tokens in the source sentence. Cross-lingual Masked Language Modeling (CLMLM). CLMLM (Huang et al., 2019) is very similar to TLM objective as the masked-language- modeling is performed with cross-lingual sentences as input. The main difference is that unlike TLM, the input is constructed at document level where multiple sentences from a cross-lingual document are replaced by their translations in another lan- guage. Cross-lingual Contrastive Learning (XLCO). Chi et al. (2021a) propose that we can leverage parallel data for training MLLMs by maximizing the information content between parallel sentences. For example, let aAi be a sentence in language A and let bBi be its translation in language B. In ad- dition, let {bBj }Nj=1,j 6=i be N − 1 sentences in B which are not translations of aAi . Chi et al. (2021a) show that the information content between aAi and bBi can be maximized by minimizing the following InfoNCE (van den Oord et al., 2019) based loss function:
LXLCO = − log exp(f(aAi )
A i ) >f(bBj )
(1)
where f(a) is the encoding of the sentence a as computed by the transformer. Instead of just explic- itly sampling negative sentences from {bBj }Nj=1,j 6=i, the authors use momentum (He et al., 2020) and mixup contrast (Chi et al., 2021a) to construct harder negative samples. Hierarchical Contrastive Learning (HICTL).
HICTL introduced in Wei et al. (2021) also uses an InfoNCE based contrastive loss (CTL) and extends it to learn both sentence level and word level cross- lingual representations. For sentence level CTL (same as Equation (1)), instead of directly sampling from {bBj }Nj=1,j 6=i, the authors use smoothed linear interpolation (Bowman et al., 2016; Zheng et al., 2019) between sentences in the embedding space to construct hard negative samples. For word level CTL, the similarity score used in the contrastive loss is calculated between the sentence represen- tation q ([CLS] token of a parallel sentence pair < aAi , bBi >) and other words. A bag of words W is maintained for each parallel sentence pair input and each word inW is considered a positive sample while the other words in the vocabulary are considered negative. Instead of sampling negative words from the large vocabulary V , they construct a subset S ⊂ V −W of negative words that are very similar to q in the embedding space. Cross-lingual Sentence alignment (CLSA). Hu et al. (2021) leverage parallel data by training the cross-lingual model to align sentence representa- tions. Given a parallel sentence (X,Y), the model is trained to predict the corresponding translation Y for sentence X from negative samples in a mini- batch. Unlike mBERT which encodes two sen- tences and uses [CLS] embeddings for sentence representation, the corresponding sentence repre- sentation in AMBER (Hu et al., 2021) is computed by averaging the word embeddings in the final layer of the MLLM. Translation Replaced Token Detection (TRTD). Similar to the monolingual MRTD objective, Chi et al. (2021c) leverage parallel sentences in a dis- criminative setup. In particular, they concate- nate parallel sentences to form a single input sen- tence. They then use a generator G to predict the masked tokens and then pass the corrupted sen- tence Xcorrput to a discriminator D which does token level classification to discriminate between generated tokens and original tokens.
2.2.3 Objectives based on other parallel resources
While parallel corpora are the most commonly used resource in the parallel objectives, other objectives use additional parallel resources like word align- ments (CLWR, CLWA), cross-lingual paraphrases (CLPC), code-mixed data (ALM) and backtrans- lated data (BTMLM). Cross-lingual word recovery (CLWR): Similar
to TLM, this task introduced by Huang et al. (2019) aims to learn the word alignments between parallel sentences in two languages. A trainable attention matrix (Bahdanau et al., 2016) is used to represent source language word embeddings by the target lan- guage word embeddings. The cross-lingual model is then trained to reconstruct the source language word embeddings from the transformation. Cross-lingual paraphrase classification (CLPC). Huang et al. (2019) leverage paral- lel data by introducing a paraphrase classification objective where parallel sentences (X,Y) across languages are positive samples and non-parallel sentences (X,Z) are treated as negative samples. To make the task challenging, they train a lightweight paraphrase detection model and sample Z that is very close to X but is not equal to Y Alternating Language Model (ALM). ALM (Yang et al., 2020) uses parallel sentences to con- struct code-mixed sentences and perform MLM on it. The code mixed sentences are constructed by re- placing aligned phrases between source and target language. Denoising word alignment (DWA) and self- labeling. Chi et al. (2021b) leverage parallel data by learning word alignment based objective. The objective follows two alternating steps which are optimized with expectation-maximization: i) Self-labeling - Given a parallel sentence pair (X,Y) of length n, m respectively, they first learn a doubly stochastic word alignment matrix A, where Aij gives the alignment probabiity of word Xi with Yj . This problem is framed as an optimal transport problem and its values are iteratively updated with Sinkhorn’s algorithm (Peyré and Cuturi, 2019) ii) Denoising word alignment - Similar to TLM, some tokens in the parallel sentence pair are masked. The forward alignment probability of a masked token is calculated as follows:
ai = softmax( q>i K√ dh
) (2)
K = linear([h∗n+1 . . . h ∗ n+m]) (4)
where i is the position of the masked token in the source language, h is the hidden state representa- tions from the encoder, h∗i is the query vector and [h∗n+1 . . . h
∗ n+m] are the key vectors represented
by hidden states of target tokens, dh is the dimen-
sion of the hidden states. The backward alignment is similarly calculated by having hidden states of source tokens as key vectors and query vectors with target sentence tokens. Given the self-labeled word alignments from previous step, the objective is to minimize the cross entropy loss between alignment probability ai and self-labeled word alignment Ai. Cross-lingual Word alignment (CLWA). Hu et al. (2021) leverage attention mechanism in trans- formers to learn a word alignment based objec- tive with parallel data. The model is trained to produce two attention matrices - source to target attention Ax→y which measures the similarity of source words with target words and similarly target to source attention Ay→x. To encourage the model to align words similarly in both source and target di- rection, they minimise the distance between Ax→y
and Ay→x. Back Translation Masked Language Modeling (BTMLM). BTMLM introduced in Ouyang et al. (2021) attempts to overcome the limitation of un- availability of parallel corpora for learning cross- lingual representations by leveraging back trans- lation (Sennrich et al., 2016a). It has 2 stages, wherein in the first stage, pseudo-parallel data is generated from a given monolingual sentence. In ERNIE-M (Ouyang et al., 2021), the pseudo- parallel sentence is generated by pretraining the model first with CAMLM and adding placeholder masks at the end of the original monolingual sen- tence to indicate the position and language the model needs to generate. In the second stage, tokens in the original monolingual sentence are masked and the sentence is then concatenated with the generated pseudo-parallel sentence. The model has to then predict the masked tokens.
In summary, parallel data can be used to improve both word level and sentence level cross lingual rep- resentations. The word alignment based objectives help for zero-shot transfer on word level tasks like POS, NER, etc while the sentence level objectives are useful for tasks like cross-lingual sentence re- trieval. Table 1 summarises the objective functions used by existing MLLMs.
2.3 Pretraining Data
Pretraining data. During pretraining MLLMs use two different sources of data (a) large monolingual corpora in individual languages, and (b) parallel corpora between some languages. Existing MLLMs differ in the source of monolingual
corpora they use. For example, mBERT (Devlin et al., 2019) is trained using Wikipedia whereas XLM-R (Conneau et al., 2020a) is trained using the much larger common-crawl corpus. IndicBERT (Kakwani et al., 2020) on the other hand is trained on custom crawled data in Indian languages. These pretraining data-sets used by different MLLMs are summarised in Table 1.
Languages. Some MLLMs like XLM-R are mas- sively multilingual as they support ∼ 100 lan- guages whereas others like IndicBERT (Kakwani et al., 2020) and MuRIL (Khanuja et al., 2021a) support a smaller set of languages. When deal- ing with large number of languages one needs to be careful about the imbalance between the amount of pretraining data available in different languages. For example, the number of English articles in Wikipedia and CommonCrawl is much lager than the number of Finnish or Odia articles. Similarly, there might be a difference between the amount of parallel data available between differ- ent languages. To ensure that low resource lan- guages are not under-represented in the model, most MLLMs use exponentially smoothed weight- ing of the data while creating the pretraining data. In particular, if m% of the total pretraining data belongs to language i then the probability of that language is pi = k
100 . Each pi is exponentiated by a factor α and the resulting values are then nor- malised to give a probability distribution over the languages. The pretraining data is then sampled according to this distribution. If α < 1 then the net effect is that the high-resource languages will be under sampled and the low resource languages will be over sampled. This also ensures that the low re- source languages get a reasonable share of the total vocabulary used by the model (while training the wordpiece (Schuster and Nakajima, 2012) or sen- tencepiece model (Kudo and Richardson, 2018b)). Table 1 summarises the number of languages sup- ported by different MLLMs and the total vocab- ulary used by them. Typically, MLLMs which support more languages have a larger vocabulary.
3 What are the benchmarks used for evaluating MLLMs?
The most common evaluation for MLLMs is cross-lingual performance on downstream tasks, i.e., fine-tune the model on task-specific data for a high-resource language like English and
evaluate it on other languages. Some common cross-lingual benchmarks are XGLUE (Liang et al., 2020), XTREME (Hu et al., 2020), XTREME-R (Ruder et al., 2021). These benchmarks contain training/evaluation data for a wide variety of tasks and languages as shown in Table 2. These tasks can be broadly classified into the following categories as discussed below (i) classification, (ii) structure prediction, (iii) question answering, and (iv) retrieval.
Classification. Given an input comprising of a sin- gle sentence or a pair of sentences, the task here is to classify the input into one of k classes. For example, consider the task of Natural Language Inference (NLI) where the input is a pair of sen- tences and the output is one of 3 classes: entails, neutral, contradicts. Some of the popular text clas- sification datasets used for evaluating MLLMs are XNLI (Conneau et al., 2018), PAWS-X (Yang et al., 2019) , XCOPA (Ponti et al., 2020), NC (Liang et al., 2020), QADSM (Liang et al., 2020), WPR (Liang et al., 2020) and QAM (Liang et al., 2020).
Structure Prediction. Given an input sentence, the task here is to predict a label for every word in the sequence. Two popular tasks here are Parts-Of-Speech (POS) tagging and Named Entity Recognition (NER). For NER, the datasets from WikiANN-NER (Nivre et al., 2018b), CoNLL 2002 (Tjong Kim Sang, 2002) and CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) shared tasks are used whereas for POS tagging the Universal Dependencies dataset (Nivre et al., 2018a) is used.
Question Answering. Here the task is to extract an answer span given a context and a question. The training data is typically available only in En- glish while the evaluation sets are available in mul- tiple languages. The datasets used for this task include XQuAD (Artetxe et al., 2020b), MLQA (Lewis et al., 2020) and TyDiQA-GoldP (Clark et al., 2020).
Retrieval. Given a sentence in a source language, the task here is to retrieve a matching sentence in the target language from a collection of sen- tences. The following datasets are used for this task: BUCC (Zweigenbaum et al., 2017), Tateoba (Artetxe and Schwenk, 2019), Mewsli-X (Ruder et al., 2021), LAReQA XQuAD-R (Roy et al., 2020). Of these Mewsli-X and LAReQA XQuAD- R are considered to be more challenging as they involve retrieving a matching sentence in the target
language from a multilingual pool of sentences.
4 Are MLLMs better than monolingual models?
As mentioned earlier, pretrained language models such as BERT (Devlin et al., 2019) and its variants have achieved state of the art results on many NLU tasks. The typical recipe is to first pretrain a BERT- like model on large amounts of unlabeled data and then fine-tune the model with training data for a specific task in a language L. Given this recipe, there are two choices for pretraining: (i) pretrain a monolingual model using monolingual data from L only, or (ii) pretrain a MLLM using data from multiple languages (including L).
The argument in favor of the former is that, since we are pretraining a model for a specific language there is no capacity dilution (i.e., all the model capacity is being used to cater to the language of interest). The argument in favor of the latter is that there might be some benefit of using the additional pretraining data from multiple (related) languages. Existing studies show that there is no clear winner and the right choice depends on a few factors as listed below:
Model capacity. Conneau et al. (2020a) argue that using a high capacity MLLM trained on much larger pretraining data is better than smaller ca- pacity MLLMs. In particular, they compare the performance of mBERT (172M parameters), XLM- Rbase (270M parameters) and XLM-R (559M pa- rameters) with a state of the art monolingual mod- els, i.e., BERT (335M parameters), RoBERTa (355M paramaters) (Liu et al., 2019b) on the fol- lowing datasets: XNLI (Conneau et al., 2018), NER (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003), QA (Lewis et al., 2020), MNLI (Williams et al., 2018), QNLI (Wang et al., 2018b), QQP (Iyer et al., 2017; Wang et al., 2018b), SST (Socher et al., 2013; Wang et al., 2018b), MRPC (Dolan and Brockett, 2005; Wang et al., 2018b) and STS-B (Cer et al., 2017; Wang et al., 2018b). They show that, in general, XLM-R performs bet- ter than XLM-Rbase which in turn performs better than mBERT. While none of the MLLMs outper- form a state of the art monolingual model, XLM-R matches its performance (within 1-2%) on most tasks. Amount of pretraining data. Conneau et al. (2020a) compare two similar capacity models
Task Corpus Train Dev Test Test Sets Lang Task Metric Domain Benchmark
Classification
XNLI 392,702 2,490 5,010 Translations 15 NLI Acc. Misc. XT, XTR, XG PAWS-X 49,401 2,000 2,000 Translations 7 Paraphrase Acc. Wiki / Quora XT, XTR, XG XCOPA 33,410+400 100 500 Translations 11 Reasoning Acc. Misc XTR
NC 100k 10k 10k - 5 Sent. Labelling Acc. News XG QADSM 100k 10k 10k - 3 Sent. Relevance Acc. Bing XG
WPR 100k 10k 10k - 7 Sent. Relevance nDCG Bing XG QAM 100k 10k 10k - 7 Sent. Relevance Acc. Bing XG
Struct. Pred UD-POS 21,253 3,974 47-20,436 Ind. annot. 37(104) POS F1 Misc. XT, XTR, XG
WikiANN-NER 20,000 10,000 1,000-10,000 Ind. annot. 47(176) NER F1 Wikipedia XT, XTR NER 15k 2.8k 3.4k - 4 NER F1 News XG
QA XQuAD 87,599 34,736 1,190 Translations 11 Span Extraction F1/EM Wikipedia XT, XTR MLQA 4,517-11,590 Translations 7 Span Extraction F1/EM Wikipedia XT, XTR
TyDiQA-GoldP 3,696 634 323-2,719 Ind. annot. 9 Span Extraction F1/EM Wikipedia XT, XTR
Retrieval
BUCC - - 1,896-14,330 - 5 Sent. Retrieval F1 Wiki / News XT Tatoeba - - 1,000 - 33(122) Sent. Retrieval Acc. Misc. XT
Mewsli-X 116,903 10,252 428-1,482 ind. annot. 11(50) Lang. agn. retrieval mAP@20 News XTR LAReQA XQuAD-R 87,599 10,570 1,190 translations 11 Lang. agn. retrieval mAP@20 Wikipedia XTR
Table 2: Benchmarks for the all tasks for evaluation of MLLMs. Lang represents the number of languages con- sidered from the entire pool of languages as part of the benchmark. Here XT refers to XTREME, XTR refers to XTREME-R and XG refers to XGLUE
pretrained on different amounts of data (Wikipedia v/s CommonCrawl) and show that the model trained on larger data consistently performs better on 3 different tasks (XNLI, NER, MLQA). The model trained on larger data is also able to match the performance of state of the art monolingual models. Wu and Dredze (2020) perform an exhaustive study comparing the performance of state of the art monolingual models (trained using in-language training data) with mBERT based models (fine-tuned using in-language training data). Note that the state of the art monolingual model used in these experiments is not necessarily a pretrained BERT based model (as for many low resource languages pretrainining with smaller amounts of corpora does not really help). They consider 3 tasks and a large number of languages: NER (99 languages), POS tagging (54 languages) and Dependency Parsing (54 languages). Their main finding is that for the bottom 30% of languages mBERT’s performance drops significantly compared to a monolingual model. They attribute this poor performance to the inability of mBERT to learn good representations for these languages from the limited pretraining data. However, they also caution that this is not necessarily due to “multilingual” pretraining as a monolingual BERT trained on these low resource languages still performs poorly compared to mBERT. The performance is worse than a state of the art non-BERT based model simply because pretraining with smaller corpora in these languages
is not useful (neither in a multilingual nor in a monolingual setting).
The importance of the amount of pretraining is also emphasised in Agerri et al. (2020) where they show that a monolingual BERT based model pretrained with larger data (216M tokens v/s 35M tokens in mBERT) outperforms mBERT on 4 tasks: topic classification, NER, POS tagging and senti- ment classification (1.5 to 10 point better across these tasks). Similarly, Virtanen et al. (2019) show that pretraining a monolingual BERT from scratch for Finnish with larger data (13.5B tokens v/s 450M tokens in mBERT) outperforms mBERT on 4 tasks: NER, POS tagging, dependency parsing and news classification. Both these works partly attribute the better performance to better tokenization and vo- cabulary representation in the monolingual model. Similarly Rönnqvist et al. (2019) show that for four Nordic languages (Danish, Swedish, Norwegian, Finnish) the performance of mBERT is very poor as compared to its performance on high resource languages such as English and German. Further, even for English and German, the performance of mBERT is poor when compared to the correspond- ing monolingual models in these languages. Ro et al. (2020) report similar results for Open Informa- tion extraction where a BERT based English model outperforms mBERT. In contrast to the results pre- sented so far, de Vargas Feijó and Moreira (2020) show that across 7 different tasks and many differ- ent experimental setups, on average mBERT per- forms better than a monolingual Portuguese model
trained on much larger data (4.8GB/992M tokens v/s 2GB in mBERT).
In summary, based on existing studies, there is no clear answer to the following question: “At what size of monolingual corpora does the advan- tage of training a multilingual model disappear?”. Also note that most of these studies use mBERT and hence more experiments involving XLM-Rbase, XLM-Rlarge and other recent MLLMs are required to draw more concrete conclusions.
Tokenization Apart from the pretraining corpora size, Rust et al. (2021) show that the monolingual models perform better than multilingual models due to their language specific tokenizer. To decouple the two factors, viz., the pretraining corpora size and tokenizer, they perform two experiments across 9 diverse typologically diverse languages and 5 tasks (NER, QA, Sentiment analysis, Dependency parsing, POS tagging). In the first experiment, they train two monolingual models using the same pretraining data but with two different tokenizers, one being a language specific tokenizer and the other being the mBERT tokenizer. In the second experiment, they retrain embedding layer of mBERT (the other layer weights are frozen) while using the two tokenisers mentioned above (monolingual tokenizer and mbert tokenizer). In both the experiments, in 38/48 combinations of model, task and language, they find that models which use mononlingual tokenizers perform much better than models which use the mBERT tokenizer. They further show that the better performance of monolingual tokenizers over mBERT tokenizer can be attributed to lower fertility and higher proportion of continued words (i.e., words that are tokenized to at least two sub tokens).
Joint v/s individual fine-tuning. Once a model is pretrained, there are two choices for fine-tuning it (i) fine-tune an individual model for each language using the training data for that language or (ii) fine-tune a joint model by combining the training data available in all the languages. For example, Virtanen et al. (2019) consider a scenario where NER training data is available in 4 languages. They show that a joint model fine-tuned using the training data in all the 4 languages matches the performance of monolingual models individually fine-tuned for each of these languages. The performance drop (if any) is so small that it does
not offset the advantage of deploying/maintaining a single joint model as opposed to 4 individual models. Moon et al. (2019) report similar results for NER and strongly advocate a single joint model. Lastly, Wang et al. (2020a) report that for the task of identifying offensive language in social media posts, a jointly trained model outperforms individually trained models on 4 out of the 5 languages considered. However, more careful analysis involving a diverse set of languages and tasks is required to conclusively say if a jointly fine-tuned MLLM is preferable over individually fine-tuned language specific models.
Amount of task-specific training data. Typi- cally, there is some correlation between the amount of task-specific training data and the amount of pretraining data available in a language. For exam- ple, a low resource language would have smaller amounts of training data as well as pretraining data. Hence, intuitively, one would expect that in such cases of extreme scarcity, a multilingual model would perform better. However, Wu and Dredze (2020) show that this is not the case. In particular, they show that for the task of NER for languages with only 100 labeled sentences, a monolingual model outperforms an mBERT based model. The reason for this could be a mix of poor tokenization, lower vocabulary share and poor representation learning from limited pretraining data. We believe that more experiments which carefully ablate the amount of training data across different tasks and languages would help us better understand the utility of MLLMs.
Summary: Based on existing studies it is not clear whether MLLMs are always better than monolin- gual models. We recommend a more systematic study where the above parameters are carefully ab- lated for a wider range of tasks and languages.
5 Do MLLMs facilitate zero-shot cross-lingual transfer?
In the context of MLLMs, the standard zero-shot cross-lingual transfer from a source language to a target language involves the following steps: (i) pretrain a MLLM using training data from multiple languages (including the source and target language) (ii) fine-tune the source model on task-specific training data available in the source language (iii) evaluate the fine-tuned model on test
data from the target language. In this section, we summarise existing studies on using MLLMs for cross-lingual transfer and highlight some factors which could influence their performance in such a setup.
Shared vocabulary. Before training a MLLM, the sentences from all languages are first tokenized by jointly training a WordPiece model (Schuster and Nakajima, 2012) or a SentencePiece model (Kudo and Richardson, 2018b). The basic idea is to tokenize each word into high frequency sub- words. The vocabulary used by the model is then a union of such subwords identified across all lan- guages. This joint tokenization ensures that the subwords are shared across many similar (or even distant) languages. For example, the token ‘es’ in the vocabulary would be shared by English, French, German, Spanish, etc. It is thus possible that if a subword which appears in the testset of the target language is also present in the training set of the source language then some model performance will be transferred through this shared vocabulary. Sev- eral studies examine the role of this shared vocabu- lary in enabling cross-lingual transfer as discussed below.
Pires et al. (2019) and Wu and Dredze (2019) show that there is strong positive correlation between cross-lingual zero-shot transfer perfor- mance and the amount of shared vocabulary between the source and target language. The results are consistent across 5 different tasks (MLDoc (Schwenk and Li, 2018), XNLI (Conneau et al., 2018), NER (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003), POS (Nivre et al., 2018a), Dependency parsing (Nivre et al., 2018a) ) and 5-39 different languages (the number of languages varies across tasks). However, K et al. (2020) present contradictory results and show that the performance drop is negligible even when the word overlap is reduced to zero synthetically (by shifting the unicode of each English character by a large constant thereby replacing it by a completely different character). On similar lines Artetxe et al. (2020a) show that cross-lingual transfer can happen even when the vocabulary is not shared, by appropriately fine-tuning all layers of the transformer except for the input embedding layer.
Architecture of the MLLM. The model capacity of a MLLM depends on the number of layers,
number of attention heads per layer and the size of the hidden representations. K et al. (2020) show that the performance of cross-lingual transfer depends on the depth of the network and not so much on the number of attention heads. In particular, they show that decent transfer can be achieved even with a single headed attention. Similarly, the total number of parameters in the model are not as important as the number of layers in determining the performance of cross-lingual transfer. In fact, Dufter and Schütze (2020) argue that mBERT’s multilinguality is due to its limited number of parameters which forces it to exploit common structures to align representations across languages. Another important architectural choice is the context window for self attention, i.e., the number of tokens fed as input to the MLLM during training. Liu et al. (2020a) show that using smaller context windows is useful if the pretraining data is small (around 200K sentences) but when large amounts of pretraining data is available then it is beneficial to use longer context windows.
Size of pretraining corpora. Liu et al. (2020a) show that cross-lingual transfer is better when mBERT is pretrained on larger corpora (1000K sentences per language) as opposed to smaller corpora (200K sentences per language). Lauscher et al. (2020) show that for higher level tasks such as XNLI and XQuAD, the performance of zero-shot transfer has a strong correlation with the amount of data in the target language used for pretraining the MLLM. For lower level tasks such as POS, dependency parsing and NER also there is a positive correlation but not as high as that for XNLI and XQuAD.
Fine-tuning strategies. Liu et al. (2020c) argue that when a MLLM is fine-tuned its parameters change which weakens its cross-lingual ability as some of the alignments learnt during pretraining are lost. They motivate this by showing that the cross-lingual retrieval performance of a MLLM drops drastically when it is fine-tuned for the task of POS tagging. To avoid this, they suggest using a continual learning framework for fine-tuning so that the model does not forget the original task (masked language modeling) that it was trained on. Using such a fine-tuning strategy they report better results on cross-lingual POS tagging, NER and sentence retrieval.
Using bitext. While MLLMs show good trans- fer without explicitly being trained with any cross-lingual signals, it stands to reason that explicitly providing such signals during training should improve the performance. Indeed, XLM-R (Conneau and Lample, 2019) and InfoXLM (Chi et al., 2021a) show that using parallel corpora with the TLM objective gives improved performance. If parallel corpus is not available then Dufter and Schütze (2020) suggest that it is better to train MLLMs with comparable corpora (say Wikipedia or CommonCrawl) than using corpora from different sources across languages.
Explicit alignment of representations. Some works (Wang et al., 2019; Liu et al., 2019a) compare the performance of zero-shot cross- lingual transfer using (i) representations learned by MLLMs which are implicitly aligned and (ii) representations from monolingual models which are then explicitly aligned post-hoc using some bitext. They observe that the latter leads to better performance. Taking cue from this, Cao et al. (2020); Wang et al. (2020d) propose a method to explicitly align the representations of aligned word pairs across languages while training mBERT. This is achieved by adding a loss function which minimises the Euclidean distance between the embeddings of aligned words. Zhao et al. (2020) also report similar results by explicitly aligning the representations of aligned word pairs and further normalising the vector spaces (i.e., ensuring that the representations across all languages have zero mean and unit variance).
Knowledge Distillation. Both Wang et al. (2020b) and Chi et al. (2020b) argue that due to limited model capacity MLLMs cannot capture all the nu- ances of multiple languages as compared to a pre- trained monolingual model which caters to only one language. They show that the cross-lingual performance of a MLLM can be improved by dis- tilling knowledge from a monolingual model. The above works apply knowledge distillation on the task-specific setting, i.e., the teacher LMs are first fine-tuned on a specific task and then this knowl- edge is distilled to a student LM. Khanuja et al. (2021b) focus on knowledge distillation in a task- agnostic setting. They distill knowledge from mul- tiple multilingual teacher LMs (with overlapping
languages) and show positive transfer from strong teachers for low resourced languages not covered by the teacher models.
Source language used for fine-tuning. English is currently the most widely used source language for evaluating the cross-lingual transfer performance of MLLMs. This could be due to easy availability of English datasets and also because popular multi- lingual benchmarks like XGLUE and XTREME use English as the default source language. To evaluate the effectiveness of this default choice and compare with other high resource alternatives, Lin et al. (2019) formulate the task of choosing the best source as a learning to rank problem. Using handcrafted features (dataset size, word/subword overlap, genetic distance between languages, etc.), they train Gradient-Boosting Decision Tree (Ke et al., 2017) to predict the source language which would be most suited for cross-lingual transfer. They show that the source languages predicted by their model outperform other baselines on Machine Translation and give comparable results for POS tagging. Similarly, Turc et al. (2021) try to find the best source language when the set of target languages is large and unknown beforehand. They study mBERT and mT5 on classfication (XNLI, PAWS-X) and QA tasks (XQuAD, TyDi QA) and show that German and Russian are often more effective source languages than English. They also find that zero-shot transfer can often be improved by fine-tuning on English datasets which are machine translated to better source languages.
Complexity of the task. As outlined in section 3, the tasks used for evaluating MLLMs are of different complexity (ranging from binary classification to Question Answering). Ruder et al. (2021) show that much of the progress on zero-shot transfer on existing benchmarks has not been uniform with most of the gains coming from cross-lingual retrieval tasks. Here again, the gains are mainly due to fine-tuning on other tasks and pretraining with parallel data. The progress of cross-lingual QA datasets (such as MLQA) is very minimal and the overall scores on the QA task are much less when compared to monolingual QA in English. Similarly, on structure prediction tasks like NER and POS tagging, there isn’t much improvement in the performance going from some of the earlier models like XLM-R (Conneau et al., 2020a) to some of the more recent models like
VECO (Luo et al., 2021). They recommend that (i) some of the easier tasks such as BUCC and PAWS-X should be dropped from these evaluation benchmarks and (ii) more complex tasks such as cross-lingual retrival from a mixed multilingual pool should be added (e.g., LAReQA (Roy et al., 2020) and Mewsli-X (Ruder et al., 2021)).
Summary: While not very conclusive, existing studies show that there is some evidence that the performance of MLLMs on zero-shot cross-lingual transfer is generally better when (i) the source and target languages share some vocabulary (ii) there is some similarity between the source and target languages (iii) the MLLM uses a deeper architec- ture (iv) enough pretraining data is available in the target languages (v) a continual learning (learning- without-forgetting) framework is used (vi) the rep- resentations are explicitly aligned using bitext and appropriate loss functions and (vii) the complex- ity of the task is less. Note that in all the above cases, cross-lingual transfer using MLLMs per- forms much worse than using in-language supervi- sion (as expected). Figure 1 shows the zero-shot scores of models on the XNLI benchmark. Further, in most cases it performs worse than a translate- train1 or a translate-test2 baseline.
6 Are MLLMs useful for bilingual tasks?
This survey has so far looked at the utility of MLLMs for cross-lingual tasks, where the multi- lingual capabilities of the MLLMs help in transfer learning and building universal models. Recent work has also explored the utility of MLLMs for bilingual tasks like word-alignment, sentence- retrieval, etc. This section analyzes the role of MLLMs for such bilingual tasks.
6.1 Word Alignment
Recent work has shown that MLLMs, which are trained on monolingual corpora only, can be used to learn high-quality word alignments in parallel sentences (Libovický et al., 2020; Jalili Sabet et al.,
1translate-train: The training data available in one lan- guage is translated to the language of interest using a MT system. This (noisy) data is then used for training a model in the target language.
2translate-test: The training data available in one (high resource) source language is used to train a model in that language. The test data from the language of interest is then translated to the source language using a MT system.
2020). This presents a promising unsupervised al- ternative to statistical aligners (Brown et al., 1993; Och and Ney, 2003; Östling and Tiedemann, 2016; Dyer et al., 2013) and neural MT based aligners (Chen et al., 2020; Zenkel et al., 2020), both of which are trained on parallel corpora. Finding the best word alignment can be framed as a maximum- weight maximal matching problem in the weighted bipartite graph induced by the distance between word embeddings. Greedy, iterative and optimal transport based solutions to the problem have been proposed. The iterative solutions seem to perform the best with good alignment speed as well. Us- ing contextual embeddings from some intermedi- ate layers gives better word alignment than the top encoder layer. Unlike statistical aligners, these MLLM based aligners are inherently multilingual. They also significantly outperform aligners based on static word embeddings like FastText3. The MLLM aligners can be significantly improved by aligning on parallel corpora and/or word-aligned data; fine-tuning also preserves the multilingual word-alignment (Dou and Neubig, 2021). Word alignment is useful for transfer in downstream tasks using the translate-test paradigm which needs pro- jection of spans between parallel text (e.g., question answering, POS, NER, etc).
6.2 Cross-lingual Sentence Retrieval
Since MLLMs represent sentences from different languages in a common embedding space, they can be used to find the translation of a sentence in another language using nearest-neighbour search. The sentence embedding can be obtained from the [CLS] token or appropriate pooling operations on the token embeddings. However, the encoder representations across different languages are not aligned well enough for high quality sentence- retrieval (Libovický et al., 2020). Unlike word alignment, embeddings learnt from monolingual corpora alone are not sufficient for cross-lingual sentence retrieval given the large search space. These shortcomings can be overcome by centering of sentence embeddings (Libovický et al., 2020) and/or fine-tuning the MLLMs on parallel corpora with contrastive/margin-based objectives (Feng et al., 2020; Ruder et al., 2021) and result in high- quality sentence retrieval.
3https://github.com/facebookresearch/fastText
Figure 1: Comparison of the performance of different MLLMs on the XNLI task. We use XNLI as this task is used by almost all the MLLMs. The size of each entry is proportional to the number of parameters in the model. Entries in circle use parallel data for tasks described in 2.2.2 and 2.2.3. Entries in diamond use only monolingual data. MuRIL reports XLNI scores for only Indian Languages.
6.3 Machine Translation
NMT models are typically encoder-decoder mod- els with attention trained using parallel corpora. It has been show that using BERT for initializing the encoder/decoder (Conneau and Lample, 2019; Imamura and Sumita, 2019; Ma et al., 2020) or ex- tracting features (Zhu et al., 2020) can help MT by pretraining the model with linguistic information. In particular, Conneau and Lample (2019) show that initialization of the encoder/decoder with a pretrained model can help in data-scarce scenarios like unsupervised NMT and low-resource NMT, and act as a substitute for backtranslation in su- pervised NMT scenarios. Subsequent research has drawn inspiration from the success of MLLMs and has shown the utility of denoising pretraining strate- gies specifically for sequence to sequence models (Liu et al., 2020b; Xue et al., 2021).
To summarize, MLLMs are useful for some bilingual tasks, particularly in low-resource sce- narios and fine-tuning with parallel data provides added benefits.
7 Do MLLMs learn universal patterns?
The success of language models such as BERT has led to a large body of research in understanding how such language models work, what information they learn, and how such information is represented in the models. Many of these questions studied in ‘BERTology’ are also relevant to multilingual lan- guage models (MLLMs) given the similarity in the neural architectures of these networks. But one question relates specifically to the analysis of MLLMs - Do these models learn and represent pat- terns which generalise across languages? Such an expectation is an old one going back to the “uni- versals of language” proposed in 1966 (Greenberg, 1966) and has been studied at different times. For instance, during the onset of word embeddings, it was shown that the embeddings learnt across lan- guages can be aligned effectively (Mikolov et al., 2013). This expectation is renewed due to MLLMs demonstrating surprisingly high cross-lingual trans- fer as summarised in §5. Inference on parallel text. Different works ap- proach this question of universal patterns in differ-
ent ways. One set of methods analyse the inference on MLLMs of parallel text in different languages. With such parallel text, the intermediate represen- tations on the MLLM can be compared to identify alignment, quantified with different mathematical techniques such as Canonical Correlation Analy- sis (CCA) and Centered Kernel Alignment (CKA). CCA analysis for mBERT showed that the model does not project the representations of different lan- guages on to a shared space - a trend that is stronger towards the later layers of the network (Singh et al., 2019). Further, the correlation of the representa- tions of the languages mirrored language evolution, in particular phylogenetic trees discovered by lin- guists. On the other hand, there exist symmetries in the representations of multiple languages as evi- denced by isomorphic embedding spaces (Conneau et al., 2020b). The argument in support for the existence of such symmetries is that monolingual BERT models exhibit high degrees of CKA sim- ilarity (Conneau et al., 2020b). Another related technique to find common representations across languages is machine translation. Given a sentence in a source language and a few candidate sentences in a target language, can we find the correct transla- tion by identifying the nearest neighbour in the rep- resentation space? It is found that such translation is sensitively dependent on the layer from which the representation is learnt - peaking in the mid- dle layers of between 5 and 8 with accuracy over 75% for related language pairs such as English- German and Hindi-Urdu (Pires et al., 2019). One may conclude that the very large neural capacity in MLLMs leads to multilingual representations that have language-neutral and language-specific components. The language-neutral components are rich enough to align word embeddings and also re- trieve similar sentences, but are not rich enough to solve complex tasks such as MT quality evaluation (Libovicky et al., 2019).
Probing tasks. Another approach to study univer- sality is by ‘probing tasks’ on the representations learnt at different layers. For instance, consistent dependency trees can be learnt from the represen- tations of intermediate layers indicating syntactic abstractions in mBERT (Limisiewicz et al., 2020; Chi et al., 2020a). However, the dependency trees were more accurate for Subject-Verb-Object (SVO) languages (such as English, French, Indonesian) than SOV languages (such as Turkish, Korean, and Japanese). This disparity between SOV and SVO
languages is also observed for part-of-speech tag- ging (Pires et al., 2019). Each layer has different specialisations and it is therefore useful to combine information from different layers for best results, instead of selecting a single layer based on the best overall performance as demonstrated for Dutch on a range of NLU tasks (de Vries et al., 2020). In the same work, a comparison with a monolingual Dutch model revealed that a multilingual model has more informative representations for POS tagging in earlier layers.
Controlled Ablations. Another set of results con- trol for the several confounding factors which need to be controlled or ablated to check the hypothesis that MLLMs learn language-independent represen- tations. The first such confounding factor is the joint script between many of the high resource lan- guages. This was identified not to be a sensitive factor by demonstrating that transfer between rep- resentations also occur between languages that do not share the same script, such as Urdu written in the Arabic script and Hindi written in Devanagari script (Pires et al., 2019). An important component of the model is the input tokenization. There is a strong bias to learn language-independent represen- tations when using sub-word tokenization rather than word-level or character-level (Singh et al., 2019). Another variable of interest is the pretrain- ing objective. Models such as LASER and XLM which are trained on cross-lingual objectives retain language-neutral features in the higher layers better than mBERT and XLM-R which are only trained on monolingual objectives (Choenni and Shutova, 2020).
In summary, there is no consensus yet on MLLMs learning universal patterns. There is clear evidence that MLLMs learn embeddings which have high overlap across languages, primarily be- tween those of the same family. These common representations seem to be clearest in the middle layers, after which the network specialises for dif- ferent languages as modelled in the pretraining objectives. These common representations can be probed to accurately perform supervised NLU tasks such as POS tagging, dependency parsing, in some cases with zero-shot transfer. However, more complex tasks such as MT quality evaluation (Libovicky et al., 2019) or language generation (Rönnqvist et al., 2019) remain outside the realm of these models currently, keeping the debate on universal patterns incomplete.
8 How to extend MLMs to new languages
Despite their success in zero-shot cross-lingual transfer, MLLMs suffer from the curse of multi- linguality which leads to capacity dilution. This limited capacity is an issue for (i) high resource languages as the performance of MLLMs for such languages is typically lower than corresponding monolingual models (ii) low resource languages where the performance is even poorer and finally (iii) languages which are unseen in training (the last point is obvious but needs to be stated nonethe- less). Given this situation, an obvious question to ask is how to enhance the capacity of MLLMs such that it benefits languages already seen during train- ing (high resource or low resource) as well as lan- guages which were unseen during training. The solutions proposed in the literature to address this question can be broadly classified into four cate- gories as discussed below. Fine-tuning on the target language. Here the as- sumption is that we only care about the perfor- mance on a single target language at test time. To ensure that this language gets an increased share of the model capacity we can simply fine-tune the pretrained MLLM using monolingual data in this language. This is akin to fine-tuning a MLLM (or even a monolingual LM) for a downstream task. Pfeiffer et al. (2020b) show that such target lan- guage adaptation prior to task specific fine-tuning using source language data leads to improved per- formance over the standard cross-lingual transfer setting. In particular, it does not result in catas- trophic forgetting of the multilingual knowledge learned during pretraining which enables cross- lingual transfer. The disadvantage of course is that the model is now specific to the given target lan- guage and may not be suitable for other languages. Further, this method does not address the funda- mental limitation in the model capacity and still hinders adaptation to low resource and unseen lan- guages. Augmenting vocabulary. A simple but effective way of extending a MLLM to a new language which was unseen during training is to augment the vocabulary of the model with new tokens cor- responding to the target language. This would in turn lead to additional parameters getting created in the input (embedding) layer and the decoder (output) layer. The pretrained MLLM can then be further trained using monolingual data from the target language so that the newly introduced pa-
rameters get trained. Wang et al. (2020c) show that such an approach is not only effective for un- seen languages but also benefits languages that are already present in the MLLM. The increase in the performance on languages already seen dur- ing training is surprising and can be attributed to (i) increased representation of this language in the vocabulary (ii) focused fine-tuning on the target language and (iii) increased monolingual data for the target language. Some studies have also con- sidered unseen languages which share vocabulary with already seen languages. For example, Muller et al. (2020) show that for Naribazi (North African Arabic dialect) which is written using Latin script and has a lot of code mixing with French, simply fine-tuning the BERT with very few sentences from Naribazi (around 50000 sentences) leads to reason- able zero-shot transfer from French. For languages having an unseen script, Müller et al. (2020) sug- gest that such languages should be transliterated to a script which is used by a related language seen during pretraining.
Exploiting Latent Semantics in the embedding matrix. Pfeiffer et al. (2020c) argue that lexi- cally overlapping tokens play an important role in cross-lingual transfer. However, for a new lan- guage with an unseen script such transfer is not possible. To leverage the multilingual knowledge learned in the embedding matrix, they factorise the embedding matrix (R|V |×d) into lower dimensional word embeddings (F ∈ R|V|×1) and C shared up- projection matrices (G1,G2,GC ∈ Rd1×d). The matrix F encodes token specific information and the up-projection matrices encode general cross- lingual information. Each token is associated with one of the C up-projection matrices. For an un- seen language T with an unseen script, they then learn an embedding matrix (F′ ∈ R|VT|×1) and an assignment for each token in that language to one of the C up-projection matrices. This allows the token to leverage the multilingual knowledge already encoded in the pretrained MLLM via the up-projection matrices.
Using Adapters. Another popular way to increase model capacity is to use adapters (Rebuffi et al., 2017; Houlsby et al., 2019) which essentially intro- duce a small number of additional parameters for every language and/or task, thereby augmenting the limited capacity of MLLMs. Several studies have shown the effectiveness of using such adapters in MLLMs (Üstün et al., 2020; Vidoni et al., 2020;
Pfeiffer et al., 2020b; Nguyen et al., 2021). We explain the overall idea by referring to the work of Pfeiffer et al. (2020b). An adapter can be added at every layer of the transformer. Let h be the hidden size of the Transformer and d be the di- mension of the adapter. An adapter layer simply contains a down-projection (D : Rh → Rd) fol- lowed by a ReLU activation and an up-projection ((U : Rd → Rh). Such an adapter layer thus in- troduces 2 ∗ d ∗ h additional parameters for every language. During adaptation, the rest of the param- eters of the MLLM are kept fixed and the adapter parameters are trained using unlabelled data from the target language using the MLM objective.
Further, Pfeiffer et al. (2020b) propose that dur- ing task-specific fine-tuning, the language adapter of the source language should be used whereas dur- ing zero-shot cross-lingual transfer, at test time, the source language adapter should be replaced by the target language adapter. However, this requires that the underlying MLLM does not change dur- ing fine-tuning. Hence, they add a task-specific adapter layer on top of a language specific adapter layer. During task-specific fine-tuning, only the parameters of this adapter layer are trained, leaving the rest of the MLLM unchanged. Vidoni et al. (2020) use orthogonality constraints to ensure that the language and task specific adapters learn com- plementary information. Lastly, to account for unseen languages whose vocabulary is not seen during training, they add invertible adapters at the input layer. The function of these adapaters is to learn token level characteristics. This adapter is also trained with the MLM objective using the un- labelled monolingual data in the target language.
In the context of the above discussion on adapters, we would like to point the readers to AdapterHub.ml 4 which is a useful repository con- taining all recent adapter architectures (Pfeiffer et al., 2020a).
9 Recommendations
Based on our review of the literature on MLLMs we make the following recommendations:
Ablation Studies. The design of deep neural models involves various parameters which are often optimized only by exhaustive ablation studies. In the case of MLLMs, the axes of ablation belong to three sets - architectural parameters, pretraining
4https://github.com/Adapter-Hub/adapter-transformers
objectives, and subset of languages chosen. Given the number of options for each of these sets, an exhaustive ablation study would be prohibitively expensive. However, in the absence of such a study some questions remain open: For instance, what subset of languages should one choose for training a multilingual model? How should the architecture be shaped as we change the number of languages? One research direction is to design controlled and scaled-down ablation studies where a broader set of parameters can be evaluated and generalized guidelines can be derived.
Zero-Shot Evaluation. The primary promise of MLLMs remains cross-lingual performance, espe- cially with zero-shot learning. However, results on zero-shot learning have a wide variance in the published literature across tasks and languages (Keung et al., 2020). A more systematic study, controlling for the design parameters discussed above and the training and test sets is required. Specifically, there is need for careful comparisons against translation-based baselines such as translate-test, translate-train, translate-train-all (Conneau et al., 2019) across tasks and languages.
mBERTologoy. A large body of literature has studied what the monolingual BERT model learns, and how and where such information is stored in the model (Rogers et al., 2020). One example is the analysis of the role of attention heads in encoding syntactic, semantic, or positional information (Voita et al., 2019). Given the similarity in architecture, such analysis may be extended to MLLMs. This may help interpretability of MLLMs but crucially contrast multilingual models from monolingual models, providing clues on the emergence of cross-linguality.
Language inclusivity. MLLMs hold promise as an ‘infrastructure’ resource for the long list of the languages in the world. Many of the languages are widely spoken but not sufficiently focused enough in research and development (Joshi et al., 2020b). Towards this end, MLLMs must become more inclusive scaling up to 1000s of languages. This may require model innovations such as moving beyond language-specific adapters. Crucially, it also requires the availability of inclusive benchmarks in variety of tasks and languages. Without such benchmark datasets, the research
community has little incentive to train and evaluate MLLMs targeting the long list of the world’s languages. We see this as an important next phase in the development of MLLMs.
Efficient models. MLLMs represent some of the largest models that are being trained today. However, running inference on such large models is often not possible on edge devices and increasingly expensive on cloud devices. One important research direction is to downsize these large models without affecting accuracy. Several standard methods such as pruning, quantiza- tion, factorization, distillation, and architecture search have been used on monolingual models (Tay et al., 2020). These methods need to be explored for MLLMs while ensuring that the generality of MLLMs across languages is retained.
Robust models. MLLMs supporting multiple lan- guages need to be extensively evaluated for any encoded biases and their ability to generalize. One direction of research is to build extensive diagnos- tic and evaluation suites such as MultiChecklist proposed in Ruder et al. (2021). Evaluation frame- works such as Explainaboard (Liu et al., 2021; Fu et al., 2020) need to also be developed for a range of tasks and languages to identify the nature of errors made by multilingual models. It is also important to extend the analysis of bias in deep NLP models to multilingual systems (Blodgett et al., 2020).
10 Conclusion
We reviewed existing literature on MLLMs cover- ing a wide range of sub-areas of research. In par- ticular, we surveyed papers on building better and bigger MLLMs using different training objectives and different resources (monolingual data, parallel data, back-translated data, etc). We also reviewed existing benchmarks for evaluating MLLMs and covered several studies which use these bench- marks to assess the factors which contribute to the performance of MLLMs in a (i) zero-shot cross- lingual transfer setup (ii) monolingual setup and (iii) bilingual setup. Given the surprisingly good performance of MLLMs in several zero-shot trans- fer learning setups we also reviewed existing works which investigate whether these models learn any universal patterns across languages. Lastly, we re- viewed studies on improving the limited capacity of MLLMs and extending them to new languages.
Based on our review, we recommend that future research on MLLMs should focus on (i) controlled ablation studies involving a broader set of param- eters (ii) comprehensive evaluation of the zero- shot performance of MLLMs across a wider set of tasks and languages (iii) understanding the patterns learn by attention heads and other components in MLLMs (iv) including more languages in pretrain- ing and evaluation and (v) building efficient and robust MLLMs using better evaluation frameworks (e.g., Explainaboard).
Acknowledgements
We would like to thank the Robert Bosch Center for Data Science and Artificial Intelligence for sup- porting Sumanth and Gowtham through their Post Baccalaureate Fellowship Program. We also thank the EkStep Foundation for their generous grant to support this work.
References Rodrigo Agerri, Iñaki San Vicente, Jon Ander Cam-
pos, Ander Barrena, Xabier Saralegi, Aitor Soroa, and Eneko Agirre. 2020. Give your text representa- tion models some love: the case for basque. In Pro- ceedings of The 12th Language Resources and Eval- uation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4781–4788. European Lan- guage Resources Association.
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020a. On the cross-lingual transferability of mono- lingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4623–4637. Association for Computa- tional Linguistics.
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020b. On the cross-lingual transferability of mono- lingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 4623–4637, Online. Asso- ciation for Computational Linguistics.
Mikel Artetxe and Holger Schwenk. 2019. Mas- sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguistics, 7:597–610.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2016. Neural machine translation by jointly learning to align and translate.
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of" bias" in nlp. arXiv preprint arXiv:2005.14050.
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- drew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Con- ference on Computational Natural Language Learn- ing, pages 10–21, Berlin, Germany. Association for Computational Linguistics.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The math- ematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263– 311.
Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Mul- tilingual alignment of contextual word representa- tions. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
Yun Chen, Yang Liu, Guanhua Chen, Xin Jiang, and Qun Liu. 2020. Accurate word alignment induction from neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 566–576, Online. Association for Computational Linguistics.
Ethan A. Chi, John Hewitt, and Christopher D. Man- ning. 2020a. Finding universal grammatical rela- tions in multilingual BERT. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5564–5577. Association for Computa- tional Linguistics.
Zewen Chi, Li Dong, Furu Wei, Xianling Mao, and Heyan Huang. 2020b. Can monolingual pretrained models help cross-lingual classification? In Pro- ceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Lin- guistics and the 10th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2020, Suzhou, China, December 4-7, 2020, pages 12–17. Association for Computational Linguistics.
Zewen Chi, Li Dong, Furu Wei, Nan Yang, Sak- sham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021a. In- foXLM: An information-theoretic framework for cross-lingual language model pre-training. In Pro- ceedings of the 2021 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3576–3588, Online. Association for Computational Linguistics.
Zewen Chi, Li Dong, Bo Zheng, Shaohan Huang, Xian- Ling Mao, Heyan Huang, and Furu Wei. 2021b.
Improving pretrained cross-lingual language mod- els via self-labeled word alignment. In Proceed- ings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Pro- cessing, ACL/IJCNLP 2021, (Volume 1: Long Pa- pers), Virtual Event, August 1-6, 2021, pages 3418– 3430. Association for Computational Linguistics.
Zewen Chi, Shaohan Huang, Li Dong, Shuming Ma, Saksham Singhal, Payal Bajaj, Xia Song, and Furu Wei. 2021c. Xlm-e: Cross-lingual language model pre-training via electra.
Rochelle Choenni and Ekaterina Shutova. 2020. What does it mean to be language-agnostic? probing mul- tilingual sentence encoders for typological proper- ties. arXiv preprint arXiv:2009.12862.
Hyung Won Chung, Thibault Fevry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2021a. Re- thinking embedding coupling in pre-trained lan- guage models. In International Conference on Learning Representations.
Hyung Won Chung, Thibault Févry, Henry Tsai, Melvin Johnson, and Sebastian Ruder. 2021b. Re- thinking embedding coupling in pre-trained lan- guage models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
Hyung Won Chung, Dan Garrette, Kiat Chuan Tan, and Jason Riesa. 2020. Improving multilingual models with language-clustered vocabularies. In Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, On- line, November 16-20, 2020, pages 4536–4546. As- sociation for Computational Linguistics.
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. Tydi qa: A benchmark for information-seeking question answering in typo- logically diverse languages. Transactions of the As- sociation for Computational Linguistics.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020a. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451. Associa- tion for Computational Linguistics.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- ina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross- lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computa- tional Linguistics.
Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle- moyer, and Veselin Stoyanov. 2020b. Emerging cross-lingual structure in pretrained language mod- els. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6022–6034, Online. Association for Compu- tational Linguistics.
Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5).
Diego de Vargas Feijó and Viviane Pereira Moreira. 2020. Mono vs multilingual transformer-based mod- els: a comparison across several language tasks. CoRR, abs/2007.09757.
Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nissim. 2019. Bertje: A dutch bert model.
Wietse de Vries, Andreas van Cranenburgh, and Malv- ina Nissim. 2020. What’s so special about bert’s lay- ers? A closer look at the NLP pipeline in monolin- gual and multilingual models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, On- line Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 4339–4350. Associ- ation for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics.
William B Dolan and Chris Brockett. 2005. Automati- cally constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP200