natural language correction - thesis...

Natural Language Correction - Thesis Proposal

Jakub NáplavaCharles University,

Faculty of Mathematics and Physics,Institute of Formal and Applied Linguistics

[email protected]

Abstract

Tools that correct errors in human written textsare an important part of many existing systemsused by countless people. Although the riseof deep neural networks in recent years madeit possible to handle more complex errors in-cluding even grammatical and fluency ones,current models still suffer from several prob-lems. In this thesis proposal, we describe is-sues we find are the most crucial in the areaof the natural language correction and discusswhat has been done so far to solve them. Wefurther describe the work we performed untilnow, and finally we provide an overview of ourfuture research plans.

1 Introduction

Natural Language Correction (commonly referredto as Grammatical Error Correction or simplyGEC) is the task of building systems for correct-ing all errors appearing in human written texts.Given a noisy input text, the corrected text shouldpreserve the semantic meaning, while being well-formed without any errors. For texts that do notcontain any errors, no changes shall be done.

The variety of error types ranges from simplelocal errors such as bad word spelling or missingpunctuation up to complex grammar errors such assubject-verb agreement or even fluency errors, thatrequire a whole segment of text to be rewritten.Two examples taken from an English GEC datasetare presented below.

In fact , in the political , economic and defence terms I feelrealocation of resources can and will be so positive .reallocation

very

So I think we can not live if old people could not find

siences and tecnologies and they did not developped

would not be alive our ancestors

did not develop sciences and technologies

this,

.

Figure 1: Two examples from an English GEC dataset.

One of the reasons why natural language correc-tion is difficult is that broad context is often neededto determine correct replacement – for certain er-ror types, such as wrong article or tense errors, noteven whole paragraph may be enough. Further-more, as there are often multiple errors appearingin text and the variety of error types is large, it isnearly impossible to manually write down correct-ing rules with enough coverage.

The natural language correction systems grad-ually evolved from simple rule-based systems,through combination of single error classifiers, torecent statistical and neural machine translationapproaches. These handle natural language cor-rection task as a translation problem from a noisyinput text into a well-formed text. As the recentBuilding Educational Applications Shared Taskon Grammatical Error Correction (Bryant et al.,2019) showed, slightly modified Transformer ar-chitecture is the current state-of-the-art architec-ture for English.

To evaluate system performance, multiple met-rics were proposed. Probably the most commonones are MaxMatch M2 scorer (Dahlmeier andNg, 2012b) and ERRANT scorer (Bryant et al.,2017). They evaluate system performance bymeans of correctly performed edits, where an editconsists of a part of the noisy text and its correc-tion. These edits are extracted automatically. Be-cause omitting a correction is usually not as badas proposing a bad one, F0.5 score emphasizingprecision over recall is used. There are also othermetrics such as GLEU (Napoles et al., 2015) or I-measure (Felice and Briscoe, 2015), but they areused in rather specific scenarios.

In certain domains and languages, human-levelperformance was already matched. Specifically,Ge et al. (2018) claim to reach human-level per-formance on correcting essays of East-Asian sec-ond learners of English. Although the results are

great, they were not repeated on any other domainor language, as the second-learner English is thelargest dataset by a wide margin, with majority ofconducted research.

While there are numerous datasets for train-ing and evaluating correction systems on English,there is only a limited number of datasets for otherlanguages. Similarly, the reported performance ofthese systems is currently worse than performanceof the current state-of-the art systems on English.Specifically, we are aware of only several attemptsto perform GEC in Czech, none of them havingcomparable performance to systems on English.

Except for its most straightforward usage in allkinds of editing tools such as Microsoft Word orLibreOffice Writer, language correcting systemsmay also be used in a pre-processing pipeline toother systems that work with user-generated data.These are nowadays frequently trained solely onclean data and their performance may deterioratesubstantially when processing noisy data.

This proposal is structured as follows. In Sec-tion 2, we present a short history of language cor-rection systems. In Section 3, we describe thework we have done so far. Section 4 describes ourfuture research plans and finally Section 5 summa-rizes this proposal.

2 History of Natural LanguageCorrection

Early natural language correction attempts fo-cused on correcting isolated words. The workon computer techniques for automatic spellingcorrection began as early as the 1960s (Kukich,1992). The research mainly focused on two is-sues: how to detect non-words in a text and how tocorrect them. One of the first commercial tools todetect spelling errors is the UNIX SPELL (McIl-roy, 1982), which contains a list of 30 000 correctEnglish words. As the module for correcting thedetected errors is not part of the program, userseither had to correct the words by themselves orcould use tools for isolated word spelling correc-tion such as grope (Taylor, 1981).

The first systems that aimed to correct larger va-riety of errors employed hand coded rules. Oneof such tools is Writer’s Workbench (Macdonaldet al., 1982) included with Unix systems as farback as the 1970s. It was based on simple patternmatching and string replacement and its style anddiction tools could highlight common grammat-

ical and stylistic errors and propose corrections.Other systems such as GramCheck (Bustamanteand León, 1996) or EPISTLE (Heidorn et al.,1982) employed syntactic analysis with manuallydesigned grammar rules. The great advantage ofthe hand coded rule is their interpretability andalso the fact that for certain error types, they canbe implemented easily. On the other hand, it isnearly impossible to define rules to cover all er-rors in grammar (or fluency), therefore, not muchof current research is conducted in this direction.

In 1990s, researchers in natural language pro-cessing started utilizing data driven approachesand applied machine learning to NLP tasks. Be-cause article and preposition errors are both diffi-cult for manual rules but have a small span, mul-tiple machine learning models were proposed totackle them (Knight and Chander; Minnen et al.,2000; Han et al., 2004; Nagata et al., 2005).Features encoding context such as neighbouringwords or their part-of-speech tags are typicallyused as inputs into a machine learning classifica-tion model. For example, Han et al. (2004) traineda max-entropy classifier to detect article errors andachieved an accuracy of 88%.

As each trained classifier can only correct a sin-gle error type, multiple of them must be com-bined to allow more realistic usage. Dahlmeieret al. (2012) used a pipeline system comprisingof several sequential steps. Dahlmeier and Ng(2012a) employ specific classifiers together witha language model to score a beam of hypotheses.These are iteratively generated by so called pro-posers, each allowed to propose only a small in-cremental change. Although their system workedquite well, it has many flaws such as the beam sizegrowing with the number of proposers or that de-signing classifier for certain more complex errortypes might be complicated.

The Czech system for context sensitive spellingcorrection of Richter et al. (2012) does not use anyspecific classifiers, but its approach can be seen asa generalisation of the approach of Dahlmeier andNg (2012a). It uses noisy channel approach witha candidate model, that for each word proposesits variants up to a predefined edit distance. Asit would be intractable to make a beam of all thehypothesis, the authors employ a Hidden MarkovModel with vertices being the variants of wordsproposed by the candidate model. Instead of us-ing separate error classifiers, the transition costs

utilize linguistic features. To find an optimal cor-rection, Viterbi algorithm (Forney, 1973) is used.

Brockett et al. (2006) propose to consider nat-ural language processing to be a machine transla-tion problem of translating grammatically incor-rect sentences into correct ones. Initially, statisti-cal machine translation systems were employed,and two out three top performing systems ofCoNLL 2014 Shared Task on Grammatical ErrorCorrection (Ng et al., 2014) were using statisticalmachine translation approaches.

From 2018, top performing systems in naturallanguage correction employ neural machine trans-lation approaches (Ge et al., 2018; Chollampattet al., 2019; Grundkiewicz et al., 2019). These willbe described in later sections.

3 Work done so far

Our work in the area of natural language correc-tion started with development of systems for auto-matic diacritics generation. While it is very com-mon that people in certain languages (e.g. Czechor Vietnamese) and on certain devices write with-out diacritics, we saw that existing systems basedon traditional statistical approaches do not some-times perform satisfactory – especially for wordsthat need large context to determine which of itsvariants with diacritics is correct, and also for out-of-vocabulary words. As recurrent neural net-works started to show amazing results on manyother natural language processing tasks, we de-cided to investigate how would they perform onthis particular task. Given the relative simplicityof this task, it was also a reasonable first step toevaluate what are neural networks capable of.

After achieving state-of-the-art results on the di-acritics generation task, we moved our focus to de-veloping a system that could correct more errors.A natural choice was extending our system with aspelling correction, as it is a common part of manycurrent tools. After spending non-trivial amountof time on developing the most appropriate neuralmodel, we started asking ourselves what spellingcorrection actually is. Many papers define spellingerror correction as locating misspelled words thatare not in the vocabulary and replacing them withcorrect ones. Other papers extend this definitionby also correcting words that, despite being in vo-cabulary, are in fact spelling errors. We found thelatter definition to be more accurate as accidentallymisclicking computer key may result in a valid

word, which is not covered by the former defi-nition. However, many grammatical errors, suchas subject verb-agreement (he appear) or certaincases of verb-tense error (he disliked vs he dis-likes), would fit into this category as well.

By the time we were at a loss about how toacquire real-world dataset consisting of spellingerrors only, Building Educational ApplicationsWorkshop announced a new shared task on gram-matical error correction on English. It was agreat incentive to change our focus and, instead ofspending time on an intermediate task of vaguelydefined spelling error correction, start developinggeneral system capable of correcting any gram-matical errors. The shared task came with 3 tracks:restricted, where only provided data could be usedfor training; unrestricted, where any data couldbe used to train a system; and low-resource track,where no annotated data could be used for train-ing. We participated in all three of them.

One of the shared task outcomes was that utiliz-ing synthetic data for training helps a great dealeven in English, in which numerous annotateddatasets already exist. Inspired by this, we askedourselves, how much would a synthetic noise helpin case of low-resource languages, i.e. languagesthat have only thousands of annotated sentences.When we automatically generated artificial syn-thetic data and combined them with authentic data,we reached new state-of-the-art results on all 3tested low-resource languages: Czech, Germanand Russian.

3.1 Diacritics Generation

When writing emails, tweets or texts in certainlanguages, people for various reasons sometimeswrite without diacritics. When using Latin script,they replace characters with diacritics (e.g. c withacute or caron) by the underlying basic charac-ter without diacritics. Practically speaking, theywrite in ASCII. Diacritics Generation (also knownas Automatic Accent Insertion, Diacritics Restora-tion or even Diacritization) is a subtask of generalgrammatical error correction which aims to cor-rect all missing diacritics in the text.

One of the first papers to describe systems forautomatic diacritics generation is a seminal workby Yarowsky (1999), who compares several sta-tistical algorithms for restoration of diacritics inFrench and Spanish. They used word based ap-proach and their final system combines decision

lists with morphological and collocational infor-mation. The word level approaches were furtherexplored in Crandall (2005); Šantic et al. (2009);Richter et al. (2012). The latter one builds a Hid-den Markov Model with vertices being word dia-critics variants, the transition costs are composedof morphological lemma feature, morphologicaltag feature and word forms feature. The emis-sion probabilities are based on an error model andViterbi algorithm is used to find an optimal di-acritization. There are also approaches that op-erate on character level (Mihalcea, 2002; Mihal-cea and Nastase, 2002; Scannell, 2011) and claimthat they work better on low resource languages.With the recent advent of neural networks, Be-linkov and Glass (2015) employed recurrent neu-ral networks and reached new state-of-the-art re-sults on Arabic. Note that although the bidirec-tional LSTM (Hochreiter and Schmidhuber, 1997)backbone is the same as in our work, they donot utilize residual connections to allow buildingdeeper models and also do not incorporate lan-guage model in decoding.

Majority of papers we have found evaluatedtheir models on a restricted number of languages(usually one to three). Therefore, we started ourwork by analyzing the set of languages for whichthe task of diacritics generation may be relevant(difficult). For each language contained withinUD 2.0 (Nivre et al., 2017), we measured the ra-tio of words with diacritics. As high occurrenceof words with diacritics does not naturally implythat generating diacritics is an ambiguous task forthe language, we also evaluated word error rate ofa simple dictionary baseline: according to a largeraw text corpora we constructed a dictionary of themost frequent variant with diacritics for a givenword without diacritics, and used the dictionary toperform the diacritics restoration. The results ofthis experiment are presented in Table 1. For sim-plicity, only top 12 languages sorted w.r.t. to thedictionary baseline are presented.

Table 1 shows that for nine languages, the worderror rate is larger than 2%, and the rest still haveword error rate still above 1%. We concludedthat even with a very large dictionary, the dia-critics restoration is a challenging task for theselanguages. Because there was no consistent ap-proach to obtaining datasets for diacritics restora-tion and also there were no datasets covering ma-jority of languages from Table 1, we proposed

LanguageWords with Word error rate ofdiacritics dictionary baseline

Vietnamese 88.4% 40.53%Romanian 31.0% 29.71%Latvian 47.7% 8.45%Czech 52.5% 4.09%Slovak 41.4% 3.35%Irish 29.5% 3.15%French 16.7% 2.86%Hungarian 50.7% 2.80%Polish 36.9% 2.52%Swedish 26.4% 1.88%Portuguese 13.3% 1.83%Galician 13.3% 1.62%

Table 1: Analysis of percentage of words with diacrit-ics and the word error rate of a dictionary baseline.Measured on UD 2.0 data and CoNLL 17 UD sharedtask raw data for dictionary. Only words containing atleast one alphabetical character are considered. Thistable displays top 15 languages with worst results ondictionary baseline.

a new pipeline utilizing both clean data fromWikipedia and also not that clean data from gen-eral web (utilizing CommonCrawl corpus). Thegenerated dataset together with scripts that wererun to generate it were made freely available (Ná-plava et al., 2018).

Further, we implemented a novel model for di-acritics restoration. It consists of a character levelrecurrent neural network, which for each inputcharacter outputs its correct variant with diacrit-ics. We combined the network with an externalword level language model, which is incorporatedduring beam search decoding. As you can see inFigure 2, the recurrent neural network is bidirec-tional, thus each character has information bothfrom its preceding and following context. For sim-plicity, the visualisation does not show that thereare actually multiple stacked RNNs with residualconnections.

We evaluated the proposed model on two ex-isting datasets and on our new dataset. The firstexisting dataset consisted of Croatian, Serbian andSlovenian data and we reduce the error of the pre-vious state-of-the-art system by more than 30% onWikipedia part of the dataset and by more than20% on the Twitter part of the dataset. The seconddataset is for diacritics generation in Czech andwe reduce the error of the best previous state-of-

Language Wiki Web Words with Lexicon Corpus Our model Our model Our model Errorsentences sentences diacritics w/o finetuning + LM reduction

Vietnamese 819 918 25 932 077 73.63% 0.7164 0.8639 0.9622 0.9755 0.9773 83.33%Romanian 837 647 16 560 534 24.33% 0.8533 0.9046 0.9018 0.9799 0.9837 82.96%Latvian 315 807 3 827 443 39.39% 0.9101 0.9457 0.9608 0.9657 0.9749 53.81%Czech 952 909 52 639 067 41.52% 0.9590 0.9814 0.9852 0.9871 0.9906 49.20%Polish 1 069 841 36 449 109 27.09% 0.9708 0.9841 0.9891 0.9903 0.9955 71.64%Slovak 613 727 12 687 699 35.60% 0.9734 0.9837 0.9868 0.9884 0.9909 44.21%Irish 50 825 279 266 26.30% 0.9735 0.9800 0.9842 0.9846 0.9871 35.55%Hungarian 1 294 605 46 399 979 40.33% 0.9749 0.9832 0.9888 0.9902 0.9929 58.04%French 1 818 618 78 600 777 14.65% 0.9793 0.9931 0.9948 0.9954 0.9971 58.11%Turkish 875 781 72 179 352 25.34% 0.9878 0.9905 0.9912 0.9918 0.9928 24.14%Spanish 1 735 516 80 031 113 10.41% 0.9911 0.9953 0.9956 0.9958 0.9965 25.57%Croatian 802 610 7 254 410 12.39% 0.9931 0.9947 0.9951 0.9951 0.9967 36.92%

Table 2: Results obtained on new multilingual dataset for diacritics generation. Note that the alpha-word accuracypresented in the table is measured only on those words that have at least one alphabetical character. The last columnpresents error reduction of our model combined with language model compared to the corpus method.

Figure 2: Visualisation of the recurrent neural modelwe used for diacritics generation.

the-art results of Korektor by more than 60%. Onthe new dataset, we compare the performance ourmodel with dictionary baseline as described ear-lier and so called corpus method, which extendsdictionary baseline via log-linear model with con-text probability. The results, together with datasetstatistics are for 12 languages presented in Table 2.

To conclude, we developed a new methodfor diacritics generation and shown that itoutperforms existing methods on two exist-ing datasets. Moreover, we proposed a newpipeline for obtaining consistent diacritics restora-tion datasets and made them freely available.The full paper presentend on LREC 2018 con-ference is available at https://www.aclweb.org/anthology/L18-1247.pdf, the code of our modelis available at https://github.com/arahusky/

diacritics_restoration and the publisheddataset can be found at http://hdl.handle.net/11234/1-2607 (Náplava et al., 2018).

3.2 GEC

Grammatical error correction (GEC) is the HolyGrail of natural language correction. As it requiressystems to correct all types of errors, it has been agoal impossible to achieve for a long time. Be-cause the majority of research has been conductedon English and we are not aware of any break-through paper on any other language, all papersdescribed in this Section are trained and tested forEnglish.

Brockett et al. (2006) was the first system toemploy statistical machine translation (SMT) toGEC. Although they used it only for correctingmass noun errors, such model was already pow-erful enough to correct a variety of error typesas well as make stylistic changes if trained onlarge enough data (Leacock et al., 2010). Sincethan, several other papers utilizing SMT were pro-posed (Mizumoto et al., 2011), but the real ad-vent of GEC started with the Helping Our Ownand CoNLL shared tasks between 2011 and 2014(Dale and Kilgarriff, 2011; Dale et al., 2012; Nget al., 2014). Two out of three top perform-ing teams in CoNLL 2014 shared task (Feliceet al., 2014; Junczys-Dowmunt and Grundkiewicz,2014) used machine translation approaches.

In the following years, the prevailing num-ber of papers utilized machine translation ap-proach. Grundkiewicz and Junczys-Dowmunt(2014) trained an SMT system with filtered sen-tences from Wikipedia revisions that matched aset of rules derived from NUCLE training data(GEC corpus). The SMT system was further tunedand extended by a rich set of task-specific fea-tures and incorporation of large language models

https://www.aclweb.org/anthology/L18-1247.pdf

https://www.aclweb.org/anthology/L18-1247.pdf

https://github.com/arahusky/diacritics_restoration

https://github.com/arahusky/diacritics_restoration

http://hdl.handle.net/11234/1-2607


in Junczys-Dowmunt and Grundkiewicz (2016).An SVM ranking model to re-rank correction can-didates proposed by SMT output was then imple-mented by Yuan (2017).

With the continuing success of neural networksin machine translation (Cho et al., 2014; Sutskeveret al., 2014; Heidorn et al., 1982; Bahdanau et al.,2014) and the inability of SMT to capture longrange dependencies and generalize beyond pat-terns seen during training, it was only a matter oftime, when the neural models strike into area ofGEC.

Yuan and Briscoe (2016) proposed a first GECsystem based on neural machine translation. Itsbackbone was classical encoder-decoder word-level model and they use a two-step approachto address out-of-vocabulary words, which mayoccur quite frequently due to errors in spelling.The two-step approach started by aligning the un-known words in the target sequence to their ori-gins in the source sentence with an unsupervisedaligner and translating these words with a wordlevel translation model. Xie et al. (2016) proposedto operate on character level and implemented aneural sequence-to-sequence model comprising ofa character level pyramidal encoder and a char-acter decoder with an attention mechanism. Al-though the use of characters as basic units elimi-nated problem with out-of-vocabulary words andthe pyramidal encoder reduced the size of poten-tially large attention matrices, we speculate that itsinability to effectively leverage word level infor-mation and longer training time caused that thismodel was surpassed by Ji et al. (2017). Similarlyto Yuan and Briscoe (2016), they utilized a wordlevel encoder-decoder model, but the decoder usedtwo nested levels of attention to overcome out-of-vocabulary problem: word level and charac-ter level. The word level was used in the clas-sical manner, but whenever there was an out-of-vocabulary word in the target sequence, they usedhard attention mechanism and character level de-coder to output the target word character by char-acter. The important aspect of the model was itscombined loss term, which allowed the charac-ter level decoder to be trained jointly. The SMTapproach once reappeared when Grundkiewiczand Junczys-Dowmunt (2018) achieved new state-of-the-art results with neural machine translationmodel being a re-scoring component in its SMTsystem. However, since then the backbone of most

models in GEC became either the Transformer ar-chitecture (Vaswani et al., 2017) or the convolu-tional encoder-decoder model proposed by Chol-lampatt and Ng (2018). Both models use sub-word units to mitigate out-of-vocabulary issue andreplace the slow-to-train recurrent units with ei-ther self-attention mechanism or convolution op-erations.

Junczys-Dowmunt et al. (2018) used Trans-former architecture with some tweaks adaptedfrom low resource machine translation: usingdropout on whole input words, assigning weightto target words based on their alignment to sourcewords, and they also propose to oversample sen-tences from the training set in order to have thesame error rate as the test set. To overcome theissue with the lack of annotated data, Lichtargeet al. (2018) trained the Transformer model on alarge corpus of Wikipedia edits and, because editsconcerning single article may be split into multi-ple revisions, they used their model incrementally.In other words, the model could repeatedly trans-late its current output as long as the translationis more probable than keeping the sentence un-changed. Similar idea is also presented in Ge et al.(2018), where the translation system is trainedwith respect to the incremental inference. Thisis achieved by dynamically extending the train-ing corpus with so called "fluency sentence pairs".These are corrections of a partially trained modelfor which it holds that the correction has higherfluency score than the original input. To computethe fluency score, authors use an external languagemodel.

Although this is only a short excerpt of re-search in GEC, we hope that it covers the mostinfluential papers and ideas before the start of an-other shared task on grammatical error correctionin 2019, which we participated in.

In 2019, Building Educational ApplicationsWorkshop announced another shared task ongrammatical error correction. With the hope ofreplacing the most commonly used CoNLL-2014test set, which has a large bias to Asian secondlearners of English and is also quite small, it camewith a new dataset. This dataset is much more di-verse, contains also a subset of native speakers andhas annotated English language levels. The sharedtask came with 3 tracks that aimed to test systemsunder different data restrictions. The Restrictedtrack precisely defined what annotated training

Track P R F0.5 Best RankRestricted 67.33 40.37 59.39 69.47 10 / 21Unrestricted 68.17 53.25 64.55 66.78 3 / 7Low Resource 50.47 29.38 44.13 64.24 5 / 9

Table 3: Official results of Building Education Applications 2019 Shared Task on Grammatical Error Correction.The reported scores are F0.5 measured on the test set.

data could be used. In the Unrestricted Track, par-ticipants could use any data to train the systems,and finally in the Low Resource track, no anno-tated data could be used. Errant F0.5 scorer wasused as the official metric mainly due to its abili-ties to report statistics on individual error types.

We started our work in GEC by implementingthe model proposed by Junczys-Dowmunt et al.(2018). Specifically, we extended standard Trans-former model by source word dropout and its lossby term that assigns higher weight to words thatshould change. To make regularization even moreeffective, we decided to dropout also whole targetword embeddings randomly in training. We usedthis model in all three tracks of the shared task.Moreover, we implemented iterative decoding asproposed by Lichtarge et al. (2018) and use thetrained system incrementally as long as the cost ofthe correction is less then the cost of the identitytranslation times a learned constant. We used themodel log-likelihood as the cost function. Finally,to reduce variance during training and hopefullyalso achieve better results, we used checkpoint av-eraging as described by Popel and Bojar (2018).

In the Restricted track, we used all alloweddata for training the model. Because majorityof data came from the noisiest Lang-8 corpus,we oversampled other datasets to make the differ-ence smaller. We further experimented with thevalue of source and target word dropout, weightfor non-identity words in modified objective, con-stant used in iterative decoding and the differencesbetween lighter Transformer BASE and heavierTransformer BIG architecture. The chosen hyper-parameters improved the performance of baselineTransformer model by almost 12 points in ErrantF0.5 score.

Because no annotated data were allowed in theLow Resource track, we decided to incorporateWikipedia revisions. We followed an approach ofLichtarge et al. (2018) and downloaded WikipediaXML revision dumps, extracted individual pages,removed all non-text elements, downsampled the

snapshots and with a small probability injectedspelling noise. With the same low probability,a random text substring (up to 8 characters) wasreplaced with a marker, which should force themodel to learn infilling. Finally, the texts fromtwo consecutive snapshots were aligned and se-quences between matching segments were ex-tracted to form a training pair. Only 4% of iden-tical samples was preserved. This resulted in over190M segment pairs, which were used to train thesystem.

Finally in the Unrestricted track, we used thebest system from the Low Resource track and fine-tuned it on the oversampled training data from theRestricted track.

The results of our systems together with per-formance of best performing systems and num-ber of participants are summarized in Table 3.The major outcome for our future research wasto learn to generate and use synthetic (artificial)data, which boosted the performance of top per-forming models a lot. Combining multiple mod-els into an ensemble was the other key factor thatdistinguished top performing systems from oth-ers. The presented system description paper can befound on https://www.aclweb.org/anthology/

W19-4419.pdf.

3.3 Low-Resource GEC

GEC in English is a long studied problem withmany existing systems and datasets. However,there has been only a limited research on error cor-rection of other languages. Namely, Boyd (2018)created a dataset and presented a GEC system forGerman, Rozovskaya and Roth (2019) for Rus-sian, Náplava (2017)1 for Czech and efforts to cre-ate annotated learner corpora were also done forChinese (Yu et al., 2014), Japanese (Mizumotoet al., 2011) and Arabic (Zaghouani et al., 2014).The reason why the majority of research has beenconducted on English is the availability of data.While we are aware of at least 6 datasets for GEC

1This author’s diploma thesis.

https://www.aclweb.org/anthology/W19-4419.pdf

https://www.aclweb.org/anthology/W19-4419.pdf

System P R F0.5

Rozovskaya andRoth (2019)

38.0 7.5 21.0

Our work – pretrain 47.76 26.08 40.96Our work – finetuned 63.26 27.50 50.20

Table 4: Comparison of our systems on Russian GECtest set (RULEC-GEC).

System P R F0.5

Boyd (2018) 51.99 29.73 45.22Our work – pretrain 67.45 26.35 51.41Our work – finetuned 78.21 59.94 73.71

Table 5: Comparison of our systems on German GECtest set (Falko-Merlin Test Set).

in English with millions of sentences altogether,we know of only a single dataset in the rest ofthe languages, each having at most tens of thou-sands of sentence pairs. This is naturally an issueas the current machine translation approaches re-quire large corpora to train properly.

The issue with the lack of the annotated datamight be mitigated by utilizing additional artifi-cially generated synthetic data. In recent years,several approaches to create them have been pro-posed. The first of them, the so called back-translation model, consists of training another ma-chine translation model in the opposite direction,i.e. to learn to translate from correct into incor-rect sentences (Náplava, 2017; Rei et al., 2017;Kasewa et al., 2018). While this approach mightgenerate high quality synthetic data, it once againrequires large volumes of training data, which isin the case of low resource languages intractable.

The second approach is to extract large volumeof parallel data from Wikipedia revisions and pre-train the GEC system on them (Lichtarge et al.,2018). The problem with this approach is that evenWikipedia might not be big enough for certain lan-guages.

The third approach is to create synthetic databy rule-based substitutions or by using a subsetof the following operations: token replacement,token deletion, token insertion, multitoken swapand spelling noise introduction. Yuan and Felice(2013) extract edits from NUCLE and apply themon a clean text. Choe et al. (2019) apply editsfrom W&I+Locness training set and also definemanual noising scenarios for preposition, nouns

and verbs. Zhao et al. (2019) use an unsupervisedapproach to synthesize noisy sentences and allowdeleting a word, inserting a random word, replac-ing a word with random word and also shuffling(rather locally). Grundkiewicz et al. (2019) im-prove this approach and replace a token with oneof its spell-checker suggestions. They also intro-duce additional spelling noise. This simple ap-proach works surprisingly well on on English asGrundkiewicz et al. (2019) was one of two win-ning teams of BEA 2019 Shared Task on GEC.

The most prominent issue with the unsuper-vised approach of Grundkiewicz et al. (2019)might be the language specific spell-checker.However, because the latest ASpell2 has dictionar-ies for 61 languages, we do not consider that an is-sue for now and decided to use this approach in ourwork. Specifically, given a clean sentence, we firstsample number of words to modify. For each cho-sen word, one of the following operations is per-formed with a predefined probability: substitutingthe word with one of its ASpell proposals, deletingit, swapping it with its right-adjacent neighbour, orinserting a random word from dictionary after thecurrent word. To make the system more robust tospelling errors, the same operations are also usedon individual characters. Because the model aftertraining often failed to correct errors in casing anddiacritics, we extended the word level operationsby changing word casing and the character leveloperations by changing character diacritics.

We employed the described pipeline to generatelarge synthetic data for English, Russian and Ger-man from clean WMT News Crawl monolingualtraining data (Ondrej et al., 2017). We used thesedata to pre-train our modified Transformer modelas described in Section 3.2 (for each language onemodel). Although these models never saw any au-thentic data, they were already better than previousstate-of-the-art systems on German (Boyd, 2018)and Russian (Rozovskaya and Roth, 2019). Wefurther fine-tuned the models with a mixture ofauthentic and synthetic data, which increased theperformance even further.

The final model on English was slightly worsethan the current state-of-the-art system (69.47 vs69.00). However, given that Grundkiewicz et al.(2019) use an ensemble of multiple models, whichare known to boost system performance consider-ably, we hypothesise, that our results are at least

2http://aspell.net/

http://aspell.net/

on par with theirs. As can be seen in Table 4 andTable 5, the main outcome of our work is that ourmodel works well in low resource scenarios, out-performing previous Russian and German state-of-the-art system by a large margin. Note that wereport both results of our pre-trained model only(Our work – pretrain) and the final fine-tuned sys-tem (Our work – finetuned).

The Czech GEC dataset from Náplava (2017)does not allow systems to be evaluated using stan-dard Errant F0.5 scorer or M2 scorer, because it ispublished only as aligned sentences. We thereforedecided to improve it and created a new datasetAKCES-GEC with separated edits together withtheir type annotations in M2 format (Dahlmeierand Ng, 2012b). To create the dataset, we usedCzeSL-man corpus (Rosen and Matliare, 2016)consisting of manually annotated transcripts of es-says of nonnative speakers of Czech. Apart fromthe released CzeSL-man, AKCES-GEC furtherutilizes additional unreleased parts of CzeSL-manand also essays of Romani pupils with Romaniethnolect of Czech as their first language. Theedits were created according to the manual align-ments. The newly generated AKCES-GEC datasetconsists of an explicit train/development/test split,with each set divided into foreigner and Romanistudents; for development and test sets, the for-eigners are further split into Slavic and non-Slavicspeakers. Furthermore, the development and testsets were annotated by two annotators, so we pro-vide two references if the annotators utilized thesame sentence segmentation and produced dif-ferent annotations. The detailed statistics of thedataset are presented in Table 6.

With the new Czech AKCES-GEC dataset, wecould reproduce our experiments from other lan-guages on it. Specifically, we generated a syn-thetic corpus on which we pre-trained our system.We then fine-tuned the system on a mixture of syn-thetic and authentic data from the corpus. The re-sults of our system compared to the previous state-of-the-art of Richter et al. (2012) are presented inTable 7.

To summarize our work in GEC with lowamount of annotated authentic data, we firstshowed that when utilizing synthetic corpus, wecan reach surprisingly good results. Althoughwe did not explicitly mentioned it before, wealso trained systems using authentic data onlyand the results were disappointing. We also

showed that fine-tuning the model on a mix-ture of authentic and synthetic data boosts theperformance even more. Moreover, we createda new dataset for GEC in Czech. We pub-lished the dataset at http://hdl.handle.net/

11234/1-3057. The paper presented on WNUT2019 can be found at https://www.aclweb.org/anthology/D19-55.pdf#page=366 and the codefor synthesizing the data and training the mod-els is available at https://github.com/ufal/

low-resource-gec-wnut2019.

4 Future Work

4.1 GEC in CzechOne of the main focus of our future research isgrammatical error correction in Czech. We presentmain points of our concrete research plans:

• Multi-domain dataset A big issue of manymachine learning models is their domainspecificity. Therefore, new dataset for CzechGEC evaluation must cover broad domainarea.

• Metrics Although Errant F0.5 scorer andMaxMatch M2 scorer work well for manyEnglish domains, there are certain domains,where other metrics correlate with humanscores better.

• Models Compare domain specific models tomodels that work well across multiple do-mains. Also, as multiple errors require largercontext than a single sentence, analyze theparagraph or even document level models.

• Deployability Despite that large architec-tures with millions of parameters work well,it would be great to find out the smallest pos-sible architecture that works well enough.

We have already assembled a team of annotatorswho are about to annotate texts from 5 differentsources: essays of Czech pupils, essays of Czechpupils with Romi background, essays of foreign-ers learning Czech as a second language, usercomments from several Czech Facebook pagesand comments from discussion forum of Czechonline news Novinky.cz. In comparison to ourformer dataset AKCES-GEC, this dataset alreadycontains texts from Czech natives with both for-mal and informal language. Due to a limited bud-get, only the testing and development sets will beannotated.



https://www.aclweb.org/anthology/D19-55.pdf#page=366

https://www.aclweb.org/anthology/D19-55.pdf#page=366

https://github.com/ufal/low-resource-gec-wnut2019

https://github.com/ufal/low-resource-gec-wnut2019

Train Dev TestDoc Sent Word Error r. Doc Sent Word Error r. Doc Sent Word Error r.

Foreign.Slavic

1 816 27 242 289 439 22.2 %70 1 161 14 243 21.8 % 69 1 255 14 984 18.8 %

Other 45 804 8 331 23.8 % 45 879 9 624 20.5 %Romani 1 937 14 968 157 342 20.4 % 80 520 5 481 21.0 % 74 542 5 831 17.8 %

Total 3 753 42 210 446 781 21.5 % 195 2 485 28 055 22.2 % 188 2 676 30 439 19.1 %

Table 6: Statistics of the AKCES-GEC dataset – number of documents, sentences, words and error rates.

System P R F0.5

Richter et al. (2012) 68.72 36.75 58.54Our work – pretrain 80.32 39.55 66.59Our work – finetuned 83.75 68.48 80.17

Table 7: Results on on AKCES-GEC Test Set (Czech).

Once the dataset is annotated, we plan to trainmultiple models and use them to correct noisy in-put sentences from the new dataset. Similarly toNapoles et al. (2019), we also want the annotatorsto assign quality of corrections and use these an-notations together with a set of metrics to eitherselect the single best metric or train a new met-ric. Because there may be too large differencesbetween certain domains (e.g. Czech pupils withRomi background tend to create texts with poorquality that need large fluency edits), it is possiblethat different metrics will be chosen for differentdomains.

The sentence level GEC models usually workquite well on English. However, some errors canonly be corrected reliably using cross-sentencecontext. An example of such an error in Englishare articles or verb tense errors and in Czech er-rors in subject-verb agreement. The documentlevel machine translation already seems to workdecently well when trained with large batches asdescribed in Junczys-Dowmunt (2019). Therehas also been some research conducted in GEC.Specifically, Chollampatt et al. (2019) extend con-volutional translation model of Gehring et al.(2017) with an auxiliary encoder that encodes pre-vious sentences and incorporate the encoding inthe decoder via attention. They claim to reach bet-ter results with this cross sentence model than witha model operating on single sentences.

It will be an interesting research to evaluatethe performance of a single model working wellacross all 5 domains compared to models that aretrained only on a single domain. Because some

domains are closer to each other (e.g. commentsfrom Czech Facebook and comments from Czechonline news), it is also possible that model trainedon a subset of all domains may show superior re-sults.

Finally, it would be great to have a workingGEC system deployed so that everyone could testand use it. Therefore, plan to distill small enoughmodel that works well and deploy it to LINDAT.

4.2 ML-NoiseNowadays, many models used in natural languageprocessing are trained and tested on clean data.However, they are often used on noisy user gen-erated content. We thus plan to investigate, howmuch does natural noise in data hurt model’s per-formance and also how to make systems robustagain such noise.

The fact that data with natural noise deteriorateperformance of current systems is by no meansa novel idea. Considering recent work on noisydata, Belinkov and Bisk (2017) found that natu-ral noise such as misspellings and typos cause sig-nificant drops in BLEU scores of current state-of-the-art character-level machine translation models.They explored multiple strategies to increase themodel’s robustness and found out that when themodel is trained on a mixture of original and noisyinput, it can learn to address certain amount of er-rors.

Adversarial attacks to deep classification mod-els were explored in Liang et al. (2017). Theyidentified text items important for the classifica-tion and perturbed them to generate adversarialinputs that fool the classifier. Although their ap-proach was able to successively find small per-turbations that make the classifier perform wrongpredictions, it is not clear to what extent would ahuman make such errors. The adversarial blackbox generation was further explored in Gao et al.(2018), who also tried to minimize edit distancewhen generating the adversarial examples.

Rychalska et al. (2019) implemented a frame-work capable of introducing multiple noise typesinto text such as removing or swapping articles,rewriting digit numbers into words or introducingerrors in spelling. They tested their frameworkon 4 natural language processing tasks and foundout that even recent state-of-the-art systems basedon BERT (Devlin et al., 2018) or ELMO (Peterset al., 2018) embeddings are not completely robustagainst such natural noise. They also re-trainedthe systems on adversarial noisy data and observedimprovement for certain error types.

The main goal of our work is to evaluate perfor-mance of the current systems under natural noisydata scenario and propose methods to mitigate thepotential drops in their performance. Ideally, an-notators would create as realistic data as possi-ble. However, because this approach would betoo costly, especially when considering multiplesystems in multiple languages, we decided to pro-pose an automated method introducing errors intexts, which corresponds to human errors as muchas possible.

To meet the requirement that the noised data arerealistic, we decided to infer the error probabili-ties from M2 files used in GEC datasets. Thesedatasets contain for each noisy sentence a set ofcorrecting edits and their type annotations. Giventhese annotations, we define the following set oferror types and estimate their probabilities:

• Diacritics Strip diacritics either from a wholesentence or randomly from individual charac-ters.

• Spelling Use ASpell to transform a word toother existing word (break → brake) or in-troduce the usual perturbations on individualcharacters (wrong→ worng)

• Suffix/Prefix Change common suffix/prefix(do→ doing).

• Casing Change casing of a word.

• Punctutation Insert, remove or replace cer-tain punctuation in text.

• Whitespace Remove or insert spaces in text.

• Word Order Reorder several adjacentwords.

• Common Other Insert, replace or substitutecommon errors as seen in data (the → a,on→ in).

We plan to infer the operation probabilitiesfor all 4 languages (English, German, Russian,Czech) for which the M2 GEC files exist. More-over, as English and Czech have M2 files dividedinto multiple categories, such as natives or secondlearners, we could create more detailed user pro-files.

We will use these profiles to test multiple sys-tems. At the moment, we have chosen the fol-lowing tasks: morphosyntactic analysis of Univer-sal Dependencies (Nivre et al., 2020), neural ma-chine translation, reading comprehension on theSQuAD dataset (Rajpurkar et al., 2018), severaltext classification tasks from the GLUE bench-mark (Wang et al., 2018) and dialog system slotfilling on Czech Public Transport data3.

Once we evaluate all proposed tasks, we planto implement two methods for handling the inputnoise. Firstly, we want to employ a trained GECsystem to correct the noisy texts and only then passthem to the models themselves. Secondly, for allsystems where possible, we will re-train the sys-tems with a mixture of the original and the noisydata. Ideally, the second approach would not hurtthe model’s performance on clean data while im-proving its performance on the noisy data.

5 Summary

In this thesis proposal we described the task of nat-ural language correction, together with its currentstate and a brief history. We presented our workon automatic diacritics restoration and we also dis-cussed our system for grammatical error correc-tion that participated in Building Educational Ap-plications 2019 Shared Task on Grammatical Er-ror Correction and our following work on trainingcorrection systems in low resource scenarios.

The main focus of our future work will be onautomatic language correction in Czech. To coverbroad range of its possible users, we plan to as-semble dataset from multiple domains. Apart fromthe standard sentence-level systems, we also planto try systems operating on multiple sentences.Besides this research, we also plan to test currentsystems for various natural language processingtasks with noisy user generated data and explore

3http://gitlab.com/ufal/dsg/dialmonkey

http://gitlab.com/ufal/dsg/dialmonkey

methods to make the systems more robust to thenoise.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Yonatan Belinkov and Yonatan Bisk. 2017. Syntheticand natural noise both break neural machine transla-tion. arXiv preprint arXiv:1711.02173.

Yonatan Belinkov and James Glass. 2015. Arabic di-acritization with recurrent neural networks. In Pro-ceedings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, pages 2281–2285.

Adriane Boyd. 2018. Using wikipedia edits in low re-source grammatical error correction. In Proceed-ings of the 2018 EMNLP Workshop W-NUT: The 4thWorkshop on Noisy User-generated Text, pages 79–84.

Chris Brockett, William B Dolan, and Michael Gamon.2006. Correcting esl errors using phrasal smt tech-niques. In Proceedings of the 21st InternationalConference on Computational Linguistics and the44th annual meeting of the Association for Compu-tational Linguistics, pages 249–256. Association forComputational Linguistics.

Christopher Bryant, Mariano Felice, Øistein E An-dersen, and Ted Briscoe. 2019. The bea-2019shared task on grammatical error correction. In Pro-ceedings of the Fourteenth Workshop on InnovativeUse of NLP for Building Educational Applications,pages 52–75.

Christopher Bryant, Mariano Felice, and EdwardBriscoe. 2017. Automatic annotation and evaluationof error types for grammatical error correction. As-sociation for Computational Linguistics.

Flora Ramírez Bustamante and Fernando SánchezLeón. 1996. Gramcheck: A grammar and stylechecker. In Proceedings of the 16th conferenceon Computational linguistics-Volume 1, pages 175–181. Association for Computational Linguistics.

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078.

Yo Joong Choe, Jiyeon Ham, Kyubyong Park, andYeoil Yoon. 2019. A neural grammatical er-ror correction system built on better pre-trainingand sequential transfer learning. arXiv preprintarXiv:1907.01256.

Shamil Chollampatt and Hwee Tou Ng. 2018. A multi-layer convolutional encoder-decoder neural networkfor grammatical error correction. In Thirty-SecondAAAI Conference on Artificial Intelligence.

Shamil Chollampatt, Weiqi Wang, and Hwee Tou Ng.2019. Cross-sentence grammatical error correction.In Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics, pages 435–445.

David Crandall. 2005. Automatic accent restoration inspanish text. Indiana University Bloomington.

Daniel Dahlmeier and Hwee Tou Ng. 2012a. A beam-search decoder for grammatical error correction. InProceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, pages568–578. Association for Computational Linguis-tics.

Daniel Dahlmeier and Hwee Tou Ng. 2012b. Betterevaluation for grammatical error correction. In Pro-ceedings of the 2012 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages568–572. Association for Computational Linguis-tics.

Daniel Dahlmeier, Hwee Tou Ng, and Eric Jun FengNg. 2012. Nus at the hoo 2012 shared task. In Pro-ceedings of the Seventh Workshop on Building Edu-cational Applications Using NLP, pages 216–224.

Robert Dale, Ilya Anisimoff, and George Narroway.2012. Hoo 2012: A report on the preposition anddeterminer error correction shared task. In Proceed-ings of the Seventh Workshop on Building Educa-tional Applications Using NLP, pages 54–62. Asso-ciation for Computational Linguistics.

Robert Dale and Adam Kilgarriff. 2011. Helping ourown: The hoo 2011 pilot shared task. In Proceed-ings of the 13th European Workshop on Natural Lan-guage Generation, pages 242–249. Association forComputational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Mariano Felice and Ted Briscoe. 2015. Towards a stan-dard evaluation method for grammatical error de-tection and correction. In Proceedings of the 2015Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, pages 578–587.

Mariano Felice, Zheng Yuan, Øistein E Andersen, He-len Yannakoudakis, and Ekaterina Kochmar. 2014.Grammatical error correction using hybrid systemsand type filtering. Association for ComputationalLinguistics.

G David Forney. 1973. The viterbi algorithm. Pro-ceedings of the IEEE, 61(3):268–278.

Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yan-jun Qi. 2018. Black-box generation of adversarialtext sequences to evade deep learning classifiers. In2018 IEEE Security and Privacy Workshops (SPW),pages 50–56. IEEE.

Tao Ge, Furu Wei, and Ming Zhou. 2018. Reachinghuman-level performance in automatic grammaticalerror correction: An empirical study. arXiv preprintarXiv:1807.01270.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N Dauphin. 2017. Convolutionalsequence to sequence learning. In Proceedingsof the 34th International Conference on MachineLearning-Volume 70, pages 1243–1252. JMLR. org.

Roman Grundkiewicz and Marcin Junczys-Dowmunt.2014. The wiked error corpus: A corpus of correc-tive wikipedia edits and its application to grammat-ical error correction. In International Conferenceon Natural Language Processing, pages 478–490.Springer.

Roman Grundkiewicz and Marcin Junczys-Dowmunt.2018. Near human-level performance in grammati-cal error correction with hybrid machine translation.arXiv preprint arXiv:1804.05945.

Roman Grundkiewicz, Marcin Junczys-Dowmunt, andKenneth Heafield. 2019. Neural grammatical errorcorrection systems with unsupervised pre-trainingon synthetic data. In Proceedings of the FourteenthWorkshop on Innovative Use of NLP for BuildingEducational Applications, pages 252–263.

Na-Rae Han, Martin Chodorow, and Claudia Leacock.2004. Detecting errors in english article usage witha maximum entropy classifier trained on a large, di-verse corpus. In LREC.

George E. Heidorn, Karen Jensen, Lance A. Miller,Roy J. Byrd, and Martin S Chodorow. 1982. Theepistle text-critiquing system. IBM Systems Journal,21(3):305–326.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yon-gen Gong, Steven Truong, and Jianfeng Gao.2017. A nested attention neural hybrid modelfor grammatical error correction. arXiv preprintarXiv:1707.02026.

Marcin Junczys-Dowmunt. 2019. Microsoft transla-tor at wmt 2019: Towards large-scale document-level neural machine translation. arXiv preprintarXiv:1907.06170.

Marcin Junczys-Dowmunt and Roman Grundkiewicz.2014. The amu system in the conll-2014 sharedtask: Grammatical error correction by data-intensiveand feature-rich statistical machine translation. InProceedings of the Eighteenth Conference on Com-putational Natural Language Learning: SharedTask, pages 25–33.

Marcin Junczys-Dowmunt and Roman Grundkiewicz.2016. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction.arXiv preprint arXiv:1605.06353.

Marcin Junczys-Dowmunt, Roman Grundkiewicz,Shubha Guha, and Kenneth Heafield. 2018. Ap-proaching neural grammatical error correction asa low-resource machine translation task. arXivpreprint arXiv:1804.05940.

Sudhanshu Kasewa, Pontus Stenetorp, and SebastianRiedel. 2018. Wronging a right: Generating bettererrors to improve grammatical error detection. arXivpreprint arXiv:1810.00668.

Kevin Knight and Ishwar Chander. Automated poste-diting of documents.

Karen Kukich. 1992. Techniques for automaticallycorrecting words in text. Acm Computing Surveys(CSUR), 24(4):377–439.

Claudia Leacock, Martin Chodorow, Michael Gamon,and Joel Tetreault. 2010. Automated grammaticalerror detection for language learners. Synthesis lec-tures on human language technologies, 3(1):1–134.

Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian,Xirong Li, and Wenchang Shi. 2017. Deeptext classification can be fooled. arXiv preprintarXiv:1704.08006.

Jared Lichtarge, Christopher Alberti, Shankar Kumar,Noam Shazeer, and Niki Parmar. 2018. Weakly su-pervised grammatical error correction using iterativedecoding. arXiv preprint arXiv:1811.01710.

Nina Macdonald, Lawrence Frase, P Gingrich, andStacey Keenan. 1982. The writer’s workbench:Computer aids for text analysis. IEEE Transactionson Communications, 30(1):105–110.

M McIlroy. 1982. Development of a spelling list. IEEETransactions on Communications, 30(1):91–99.

Rada Mihalcea and Vivi Nastase. 2002. Letter levellearning for language independent diacritics restora-tion. In proceedings of the 6th conference on Natu-ral language learning-Volume 20, pages 1–7. Asso-ciation for Computational Linguistics.

Rada F Mihalcea. 2002. Diacritics restoration: Learn-ing from letters versus learning from words. In In-ternational Conference on Intelligent Text Process-ing and Computational Linguistics, pages 339–348.Springer.

Guido Minnen, Francis Bond, and Ann Copestake.2000. Memory-based learning for article genera-tion. In Proceedings of the 2nd workshop on Learn-ing language in logic and the 4th conference onComputational natural language learning-Volume 7,pages 43–48. Association for Computational Lin-guistics.

Tomoya Mizumoto, Mamoru Komachi, Masaaki Na-gata, and Yuji Matsumoto. 2011. Mining revisionlog of language learning sns for automated japaneseerror correction of second language learners. In Pro-ceedings of 5th International Joint Conference onNatural Language Processing, pages 147–155.

Ryo Nagata, Takahiro Wakana, Fumito Masui, AtsuoKawai, and Naoki Isu. 2005. Detecting article er-rors based on the mass count distinction. In Interna-tional Conference on Natural Language Processing,pages 815–826. Springer.

Jakub Náplava. 2017. Natural language correction.

Jakub Náplava, Milan Straka, Jan Hajic, and PavelStranák. 2018. Corpus for training and evaluatingdiacritics restoration systems. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Ap-plied Linguistics (ÚFAL), Faculty of Mathematicsand Physics, Charles University, LINDAT/CLARINPID http://hdl.handle.net/11234/1-2607.

Courtney Napoles, Maria Nadejde, and Joel Tetreault.2019. Enabling robust grammatical error correctionin new domains: Data sets, metrics, and analyses.Transactions of the Association for ComputationalLinguistics, 7:551–566.

Courtney Napoles, Keisuke Sakaguchi, Matt Post, andJoel Tetreault. 2015. Ground truth for grammati-cal error correction metrics. In Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics and the 7th International JointConference on Natural Language Processing (Vol-ume 2: Short Papers), pages 588–593.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, ChristianHadiwinoto, Raymond Hendy Susanto, and Christo-pher Bryant. 2014. The conll-2014 shared task ongrammatical error correction. In Proceedings of theEighteenth Conference on Computational NaturalLanguage Learning: Shared Task, pages 1–14.

Joakim Nivre, Željko Agic, Lars Ahrenberg, et al.2017. Universal dependencies 2.0. lindat/clarin dig-ital library at the institute of formal and applied lin-guistics, charles university, prague.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Jan Hajic, Christopher D. Manning, SampoPyysalo, Sebastian Schuster, Francis Tyers, andDaniel Zeman. 2020. Universal dependencies v2:An evergrowing multilingual treebank collection.In Proceedings of The 12th Language Resourcesand Evaluation Conference, pages 4027–4036, Mar-seille, France. European Language Resources Asso-ciation.

Bojar Ondrej, Rajen Chatterjee, Federmann Christian,Graham Yvette, Haddow Barry, Huck Matthias,Koehn Philipp, Liu Qun, Logacheva Varvara, MonzChristof, et al. 2017. Findings of the 2017 confer-ence on machine translation (wmt17). In SecondConference onMachine Translation, pages 169–214.The Association for Computational Linguistics.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. arXiv preprint arXiv:1802.05365.

Martin Popel and Ondrej Bojar. 2018. Training tipsfor the transformer model. The Prague Bulletin ofMathematical Linguistics, 110(1):43–70.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Compu-tational Linguistics.

Marek Rei, Mariano Felice, Zheng Yuan, and TedBriscoe. 2017. Artificial error generation withmachine translation and syntactic patterns. arXivpreprint arXiv:1707.05236.

Michal Richter, Pavel Stranák, and Alexandr Rosen.2012. Korektor–a system for contextual spell-checking and diacritics completion. In Proceedingsof COLING 2012: Posters, pages 1019–1028.

Alexandr Rosen and Tatranské Matliare. 2016. Build-ing and using corpora of non-native czech. In ITAT,pages 80–87.

Alla Rozovskaya and Dan Roth. 2019. Grammar errorcorrection in morphologically rich languages: Thecase of russian. Transactions of the Association forComputational Linguistics, 7:1–17.

Barbara Rychalska, Dominika Basaj, AlicjaGosiewska, and Przemysław Biecek. 2019. Modelsin the wild: On corruption robustness of neural nlpsystems. In International Conference on NeuralInformation Processing, pages 235–247. Springer.

Nikola Šantic, Jan Šnajder, and Bojana Dalbelo Bašic.2009. Automatic diacritics restoration in croatiantexts. INFuture2009: Digital Resources and Knowl-edge Sharing, pages 309–318.

Kevin P Scannell. 2011. Statistical unicodification ofafrican languages. Language resources and evalua-tion, 45(3):375.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

WD Taylor. 1981. Grope—a spelling error correctiontool. AT& T Bell Labs Tech. Mem.



https://www.aclweb.org/anthology/2020.lrec-1.496

https://www.aclweb.org/anthology/2020.lrec-1.496

https://doi.org/10.18653/v1/P18-2124

https://doi.org/10.18653/v1/P18-2124

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 353–355, Brussels, Belgium.Association for Computational Linguistics.

Ziang Xie, Anand Avati, Naveen Arivazhagan, Dan Ju-rafsky, and Andrew Y Ng. 2016. Neural languagecorrection with character-based attention. arXivpreprint arXiv:1603.09727.

David Yarowsky. 1999. A comparison of corpus-based techniques for restoring accents in spanish andfrench text. In Natural language processing usingvery large corpora, pages 99–120. Springer.

Liang-Chih Yu, Lung-Hao Lee, and Li-Ping Chang.2014. Overview of grammatical error diagnosis forlearning chinese as a foreign language. In Proceed-ings of the 1st Workshop on Natural Language Pro-cessing Techniques for Educational Applications,pages 42–47.

Zheng Yuan. 2017. Grammatical error correction innon-native english. Technical report, University ofCambridge, Computer Laboratory.

Zheng Yuan and Ted Briscoe. 2016. Grammatical errorcorrection using neural machine translation. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages380–386.

Zheng Yuan and Mariano Felice. 2013. Constrainedgrammatical error correction using statistical ma-chine translation. In Proceedings of the Seven-teenth Conference on Computational Natural Lan-guage Learning: Shared Task, pages 52–61, Sofia,Bulgaria. Association for Computational Linguis-tics.

Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Os-sama Obeid, Nadi Tomeh, Alla Rozovskaya, NouraFarra, Sarah Alkuhlani, and Kemal Oflazer. 2014.Large scale arabic error annotation: Guidelines andframework.

Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, andJingming Liu. 2019. Improving grammatical er-ror correction via pre-training a copy-augmentedarchitecture with unlabeled data. arXiv preprintarXiv:1903.00138.

https://doi.org/10.18653/v1/W18-5446

https://doi.org/10.18653/v1/W18-5446

https://www.aclweb.org/anthology/W13-3607



natural language correction - thesis...

Documents