capitalization and punctuation restoration: a survey
TRANSCRIPT
![Page 1: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/1.jpg)
1
Capitalization and Punctuation Restoration: a Survey
Vasile Păiș , Dan Tufiș
Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy
CASA ACADEMIEI, 13 “Calea 13 Septembrie”, Bucharest 050711, ROMANIA
{vasile, tufis}@racai.ro
ABSTRACT
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying complex natural
language processing algorithms. This is especially significant for textual sources where punctuation and
casing are missing, such as the raw output of automatic speech recognition systems. Additionally, short text
messages and micro-blogging platforms offer unreliable and often wrong punctuation and casing. This survey
offers an overview of both historical and state-of-the-art techniques for restoring punctuation and correcting
word casing. Furthermore, current challenges and research directions are highlighted.
KEYWORDS
Punctuation restoration, capitalization, truecasing, sentence segmentation
1 Introduction
1.1 Motivation
Text understanding by humans and artificial intelligence (AI) programs often require the proper use of capital
letters and punctuation marks. For simple sentences, composed of few words, it may be possible for both
humans and AI to read and process even if capital letters usage or punctuation is not present or is inconsistent.
For example, in the case of voice-based commands, processing is usually performed only on recognized
lower-case words. However, if the body of text to be analyzed increases, such as whole paragraphs or pages,
then even for humans it is a challenging task to quickly comprehend its meaning. This was studied by Jones
et al. (2003), who analyzed the impact of capitalization and punctuation on the readability of speech-to-text
transcripts.
Early work considered punctuation only as cues from the reader perspective to the possible prosodic and
pause characteristics of the text (Markwardt, 1942). Nunberg (1990) argues for a much greater role associated
with punctuation. Furthermore, punctuation marks are classified as delimiting, separating and
disambiguating. Some marks, like the comma, may belong to multiple categories since they are able to
perform multiple roles. Jones (1994) proves that “for longer sentences of real language a grammar which
makes use of punctuation massively outperforms an otherwise similar grammar that ignores it”. Based on
this and other similar findings, current state of the art language models consider punctuation as part of their
vocabulary. This includes recent models such as BERT (Devlin et al, 2018), ELMo (Peters et al, 2018),
OpenAI’s GPT-2 (Radford et al, 2019) and we can make an educated guess about GPT-3 (Brown et al, 2020)
(the model itself as well as the source code are not publicly available).
Natural language processing (NLP) algorithms such as named entity recognition (NER), part-of-speech
(POS) identification, dependency parsing, machine translation (MT) make use of capital letters as features
regarding the currently processed word while punctuation is used as features for adjacent words. For example,
the Stanford Named Entity Recognizer (Finkel et al., 2005) considers features based on word shape. This
means constructing a word representation based on the type of characters appearing in the word. Multiple
algorithms for word shape representation were proposed, but the general idea is to encode an uppercase letter
with a specific character, say “X”, a lowercase letter with “x” and a digit with “d”. In this case a word like
“McDonald” would become “XxXxxxx”. Any such algorithms or variants are possible only if the words are
properly represented in terms of upper and lowercase letters. Additionally, the work of Spitkovsky et al.
(2011) studies the impact of punctuation on unsupervised dependency parsing.
![Page 2: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/2.jpg)
2
Special consideration must be given to Automatic Speech Recognition (ASR) systems. Primary output of
such systems usually consists of raw text, using the same casing (either lowercase or uppercase) and without
any punctuation. In such situations, before applying further NLP algorithms, additional preprocessing is
required in order to restore proper letter case and punctuation. These are sometimes referred to as “rich
transcriptions”. One of the first initiatives concerning automatic rich transcriptions of spoken language started
in 2002 in the context of DARPA Effective, Affordable Reusable Speech-to-text (EARS)1 programme which
supported the goal of significantly advancing the state-of-the-art in the hardest speech recognition challenges,
including rich transcriptions of language. To this end, NIST released a series of rich transcription evaluation
data sets2, to aid in the evaluation of such systems.
Even though a large volume of data requiring capitalization and punctuation restoration comes from ASR
systems, other sources must be considered as well. Miller et al. (2000) identify other noisy sources in the
form of text obtained via optical character recognition (OCR) or in some newspaper articles. In these cases,
not all the text is affected by lack of proper letter or punctuation but parts of it. In the case of OCR, it is
possible that some punctuation is not recognized, while in the case of certain articles it is possible for the first
sentence or paragraph to be written using only capital letters. Additionally, in the case of short text messages
(SMS), chats, tweets or other micro-blogging activities it is also possible for people to ignore proper casing
and punctuation (Baldwin et al., 2013), (Nebhi et al., 2015).
Capitalization plays an important role in systems producing uncased output. For example, in Sadat et al.
(2005) is presented the PORTAGE machine translation system. The uncased output of the system can benefit
from capitalization restoration as described in Agbago et al. (2005).
In the construction of human-computer interfaces using natural language, sometimes referred to as
“chatbots”, one of the difficulties encountered is represented by inconsistent use of punctuation and
capitalization by the user (Coniam, 2008). In this context, many approaches try to hide the issue by removing
all punctuation and capitalization from both training and runtime data (Shawar and Atwell, 2007).
Furthermore, Coniam (2014) analyzed also the output text of chatbots from the perspective of a human using
these programs to learn English as a second language. He was able to identify issues with capitalization and
punctuation even in the produced text. Nevertheless, he argues that for short sentences produced by the
chatbots, “changes to English through the increasing adoption of texting makes it debatable whether these
issues can be considered important nowadays”.
1.2 Task description
Capitalization, also known as “truecasing”, focuses on identifying the correct word form, making a distinction
between the following four classes: all letters lowercase (regular words inside sentences), all letters uppercase
(usually the case for certain abbreviations), only the first letter uppercase (starting words of sentences, most
proper nouns) and mixed case comprised of several uppercase letters and several lowercase letters (this is the
case for certain proper nouns, like “McDonald”). Sometimes an additional fifth class may be used: “no case”,
considering strings for which there is no upper/lower case distinction, such as numbers or symbols.
Punctuation restoration is the task of inserting appropriate punctuation marks in the appropriate positions,
considering an input text without any punctuation. Usually the following punctuation marks are considered:
comma, period (or full stop), exclamation mark, question mark, colon, semicolon and quotation marks.
However, since commas and periods occur a lot more than the others, lots of studies focus only on these ones.
It must be noted that the first letter uppercase due to the word being the first in a sentence does imply a
sentence segmentation mechanism, which is usually making use of punctuation restoration. Therefore, the
two tasks are intertwined and even though most studies focus on only one of the tasks, there are studies
focusing on both using integrated approaches.
In certain studies, only the sentence segmentation sub-task was considered (Stolcke and Shriberg, 1997).
Furthermore, Matusov et al. (2006) suggest that it is much easier to predict segment boundaries first and then
1 http://mi.eng.cam.ac.uk/research/projects/EARS/ears_summary.html
2 https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation
![Page 3: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/3.jpg)
3
recover punctuation on the segments instead of the initial unsegmented text. Another important observation
is in the case of machine translation software where the translation may be performed on the text segments
and punctuation is recovered in the target language where conventions for punctuation may differ compared
to the source language.
1.3 Outline
The present study will focus on general text processing for the purpose of case and punctuation restoration.
Nevertheless, since there are methods specific to ASR, which make use of audio information, such as pauses
in speech, these will be presented as well.
Section 2 presents methods for capitalization and punctuation restoration using only lexical features, starting
from rule-based approaches, n-gram language models, discrimination decision, hidden event language
models, boosting, conditional random fields and neural networks. Section 3 presents methods that incorporate
features specific to audio files. Section 4 considers evaluation resources and methods used for the considered
tasks. Finally, section 5 presents the conclusions associated with this work.
2 Methods using lexical features
2.1 Rule-based approaches
One of the early implementations for a rule-based approach to automatic capitalization is included in the
popular word processor Microsoft Word. In this case, a sentence capitalization state machine is used to
determine the start of a new sentence, hence indicating the starting word should be capitalized. The details
for this approach are described in U.S. patent 5761689 (Rayson et al., 1998). In this case, an analysis is
performed on the characters of the current word, defined as regular text characters (letters and digits) and any
adjacent punctuation. If during the analysis is identified a punctuation mark indicating end of a sentence, the
state machine changes its state accordingly thus being able to suggest a capital letter should follow.
Furthermore, if a word is determined as not being the first in a sentence and it is formed of only lowercase
letters, then the most frequent capitalization form of the word is assigned, based on dictionary lookup.
Apart from capitalization at the beginning of sentences, inside the sentence capitalization may be required
especially in the case of proper names. A possible approach for word capitalization inside sentences is based
on building a list of unambiguous capitalized words from the document and suggest based on this list
capitalization where it is not used. This approach is described in (Mikheev, 1999).
In (Kim and Woodland, 2002) is presented a system making use of a rule based named entity recognizer. The
assumption is that most capitalized words are either first words in sentences or part of a named entity. Most
of current named entity recognizers are usually used later in the processing pipeline, requiring additional
features such as part-of-speech tagging, like Stanford NER (Finkel et al., 2005) or NCRF++ (Yang and
Zhang, 2018), or at least sentence boundaries, like the system of Baevski et al. (2019). However, systems
such as (Kim and Woodland, 2000) are constructed to work only on raw text without capitalization or
punctuation. This particular system uses Brill inference rules (Brill, 1993).
Even though the purpose of this study is to explore capitalization and punctuation restoration techniques and
not rule-based named entity recognition systems, since these last ones proved to be useful, some other named
entities systems that ignore punctuation must also be considered, such as (Appelt et al., 1995), (Petasis et al.,
2001), (Todorovic et al., 2008), (Elsayed and Elghazaly, 2015) and (Tomita et al., 2005). This last one also
makes use of speech features, thus being useful if capitalization and punctuation restoration is applied on the
results of ASR.
Rule based systems are known to be hard to maintain as they require constant addition of new rules. However,
approaches like the one described in (Petasis et al., 2001) allow for a basic set of entities and rules to be
expanded by using large corpora in which it is possible to identify new rules that can be added to the rules
database. This is usually known as “bootstrapping”, while starting entities and rules are known as “seeds”.
In (Ehara et al., 2013), bootstrapping is defined as a method for harvesting “instances” similar to given
“seeds” by recursively harvesting “instances” and “patterns” by turns over corpora using the distributional
![Page 4: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/4.jpg)
4
hypothesis (Harris, 1954). In the same paper it is demonstrated that the performance of bootstrapping
algorithms depends on the selection of seeds.
2.2 N-gram based language models
A language model (LM) is a model assigning probabilities to sequences of words (Jurafsky and Martin,
2008). An n-gram is a sequence of “n” words. For certain values of “n” these are better known under specific
names such as: “unigram” for n=1, “bigram” for n=2, “trigram” for n=3. Such a LM usually implies
computing the probability of a word Wk given a context window of n previous words Wk-1, Wk-2...Wk-n. This
probability can be expressed by: P(Wk|Wk-1 ,Wk-2,...,Wk-n).
In (Beeferman et al.,1998) is proposed a LM based on trigrams, extended from a baseline model constructed
based only on words, to account also for punctuation. The paper focuses on restoring commas, but the method
presented can easily be extended to other punctuation marks as well. Given the sparsity of n-grams containing
commas, the increase of the LM model size is manageable. In the case of the trigram model of (Beeferman
et al., 1998), only 14.7% of the trigrams contained commas.
Multiple n-gram models, of different n values, can be combined into polygram models (Schukat-Talamazzini,
1995) by linear interpolation between the models. This allows a polygram model, of higher n value, to exploit
the information available in lower-order language models. Equation (1) describes a linear interpolation of
trigram, bigram, unigram and zero-gram (1/L) probabilities.
𝑃(𝑤|𝑢𝑣) = 𝜌0
1
𝐿+ 𝜌1𝑃1(𝑤) + 𝜌2𝑃2(𝑤|𝑢) + 𝜌3𝑃3(𝑤|𝑢𝑣) (1)
In this case 𝜌0, 𝜌1, 𝜌2 and 𝜌3 are the weights associated with the corresponding n-gram probabilities. Other
interpolation methods are described in (Schukat-Talamazzini, 1995) and (Schukat-Talamazzini et al., 1997).
An example of polygram model usage for sentence segmentation is presented in (Warnke et al., 1997).
A baseline capitalization model based on unigrams is presented in (Lita et al., 2003). In the same paper a
more complex model is presented based on a combination of trigrams, bigrams and unigrams as well as a
sentence level context.
Stolcke and Shriberg (1997) propose an n-gram based LM for text segmentation. In this case, no actual
punctuation is restored, but sentence boundaries without punctuation are detected. This could be seen as a
first step towards punctuation restoration. In their LM, the authors included a special token “<s>” indicating
the sentence boundary.
In (Niesler and Woodland, 1996) is proposed an n-gram LM based on categories. These models are able to
generalize and adapt to unseen word patterns. An example of using such a model is given in (Romero and
Sanchez, 2013) where a category-based model is employed to enhance a handwritten text recognition system
with emphasis on named entities such as person names, place of residence, geographical origin. As discussed
previously, in the “Rules based” section of this survey, named entity detection can play an important role in
restoring capitalization.
Agbago et al. (2005) propose enhancing the results of an n-gram model by using a scoring function and
incorporating casing probabilities based on word classes for unknown words. These are determined based on
word shapes, taking into account the presence of digits or symbols inside a word. Thus, they distinguish
between 4 classes: quantity words (starting or ending with numbers), acronyms (containing a sequence of
single letters followed by periods), hyphenated words (made up of at least two components joined by a
hyphen) and regular uniform words (consisting entirely of alphabetical characters).
Kaufmann (2010) proposed using a trigram model for capitalization of Twitter messages. The author
recognizes that original capitalization present in tweets is not a reliable indicator of true capitalization. Users
frequently use capitalization when trying to emphasize certain words or ideas. For example, considering an
example message (also known as “tweet”): “the movie was very GOOD”, we can easily see that the word
“GOOD” should not contain any uppercase letter. Most likely, this writing form was used as the user tried to
emphasize how much he enjoyed watching the movie. Additionally, the first word should have the first letter
uppercase since it is starting a sentence. The proposed LM was built using the Open American National
Corpus (OANC) (Ide and Macleod, 2001). In this approach, no actual tweets were used during training.
![Page 5: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/5.jpg)
5
Nebhi et al. (2015) propose performing truecasing on Twitter texts using a 3-gram language model built using
a combination of newswire articles and tweets that were truecased. This approach was taken because not
enough training data based solely on tweets was available. Therefore, the addition of newswire articles
improved the overall system performance.
Even with today’s computing resources, in terms of storage and processing power, models with high n
numbers are difficult to handle due to their storage requirements. Therefore, it is common to see relatively
low n values (such as 1 through 5). To allow for an easier usage of larger n-gram models, several pruning
and compression methods have been proposed, such as (Siivola and Pellom, 2005), (Pauls and Klein, 2011).
A large n-gram language model containing punctuation and casing was gathered by Google, available at the
Google Books website3 and described in (Michel et al., 2010). This model contains n-grams with n ranging
from 1 to 5 for multiple languages including English.
Another example of a LM with punctuation included is the Web 1T dataset (Brants and Franz, 2006), also
contributed by Google Inc., containing English n-grams (n ranging from 1 to 5) and associated frequency
counts calculated over 1 trillion words from web pages collected by Google in January 2006. In addition, the
Web 1T 10 European Languages dataset (Brants and Franz, 2009) contains similar data for other languages.
Gravano et al. (2009) studied the impact of increased n-gram order (varying n from 3 to 6) as well as training
set size (from 58 million to 55 billion tokens) on both capitalization and punctuation recovery. The focus was
on the following punctuation marks: comma, period, question mark and dash. All the other punctuation marks
were replaced with one of these marks. After training the LM, a weighted finite state automaton (FSA) is
constructed based on the input string considering all possible combinations of capitalization and
punctuations. A final decoding step computes the least cost path along the FSA. This is equivalent to the
maximum posterior probability sequence of punctuation marks and capitalized words according to the LM.
As expected, increasing training set size had a huge impact on performance. Training on the larger corpus
produced an increase of 6% in F1 score for capitalization. However, an interesting conclusion was that
increasing the n-gram order did not produce such high improvements. Increasing the n-gram order from n=3
to n=6 on both the small and larger corpora yielded and increase in F1 of less than 1% for capitalization.
Another conclusion, supporting previous findings, was that low-frequency symbols (like question marks and
dashes, that were considered during training) are much harder to model using n-gram based LM.
2.3 Capitalization as a discrimination decision
In (Gale et al., 1994) is mentioned that discrimination decisions arise in many natural language processing
tasks and the authors make a special mention to the task of deciding whether a word from some teletype text
should be capitalized if both cases have been used in the text. Excepting the trivial situation of a word needing
capitalization at the beginning of a sentence, it can be easily deduced that a word allowing both lowercase
and uppercase usage actually has at least two meanings: one corresponding to the lowercase version and one
for the capitalized version. Thus, the matter of deciding which word form to use actually becomes a decision
between which word sense to use, hence becoming a word sense disambiguation issue. The authors of (Gale
et al., 1994) propose to apply their disambiguation method (Gale et al., 1993) for the purposes of truecasing.
This is based on a Bayesian classifier combined with an interpolation procedure, seeking a trade-off between
measurement errors and relevance.
Word sense disambiguation has been recognized as a major problem in natural language processing research.
This can be encountered in scientific literature as early as Kaplan (1950) and Masterson (1967), which
considered that “the basic problem in machine translation is that of multiple meaning”. A survey on word
sense disambiguation methods is available in (Navigli, 2009).
2.4 Hidden event language models
Capitalization can be considered a sequence tagging problem. The sequence is the set of words and the tag
set is the capitalization classes as described in the “Introduction” (all letters lowercase, first letter uppercase,
3 http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
![Page 6: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/6.jpg)
6
all letters uppercase, mixed case). Thus given a word sequence W=w0 w1 w2… wn, the model predicts a
sequence of capitalization features C=c0 c1 c2...cn, with ci from {AL, FU, AU, MC} (“all lower case”, “first
uppercase”, “all uppercase”, “mixed case”). Similarly, punctuation restoration can be also considered a
sequence tagging problem, predicting inter-word events E=e0 e1 e2 … en , where ei denotes either a
punctuation mark or the absence of any marks.
Since Maximum Entropy Markov Models (MEMM) were previously used in other sequence tagging
problems, such as part-of-speech tagging (Ratnaparkhi, 1996) they can be used for capitalization as well.
Considering the set of possible word and tag contexts H and the set of allowable tags T, the probability model
can be expressed as:
𝑃(ℎ, 𝑡) = 𝜋𝜇 ∏ 𝛼𝑗
𝑓𝑗(ℎ,𝑡)𝑘
𝑗=1
(2)
where h is the sequence history, t is a tag, π is a normalization constant, {𝜇, 𝛼1… 𝛼𝑘} are positive model
parameters, f1...fk are features with fj(h,t) = {0,1}. Under the Maximum Entropy formalism, the goal of the
model is to maximize the entropy of a distribution, described by:
𝐻(𝑝) = − ∑ 𝑝(ℎ, 𝑡) log 𝑝(ℎ, 𝑡)
ℎ𝜖Η,tϵΤ
(3)
In (Chelba and Acero, 2004) is presented a maximum a-posteriori (MAP) adaptation technique for maximum
entropy (MaxEnt) models. This helps reduce the capitalization error of a system trained on a certain type of
corpus and used on a different one. The authors give an example of a capitalization system trained on
newswire text and used on email and office documents.
Batista et al. (2009) make use of a Maximum Entropy model for capitalization restoration for Spanish and
Portuguese languages. The features used are word n-grams of the form: wi, 2wi−1, 2wi, 3wi−2, 3wi−1,
considering wi as the current word and nwk being an n-gram starting at position k. An interesting approach
here is the usage of n-grams which are not necessarily centered on the current word. Thus, considering an
example sentence “John went to school in England”, the n-grams used for predicting the case associated with
the word “school” are: (school), (to school), (school in), (went to school), (to school in). At runtime the
authors propose incorporating a lexicon which can provide forced capitalization for previously unseen words
during training.
One of the models investigated by Hasan et al., (2014) makes use of a Hidden Markov Model (HMM), similar
to the one used previously for disfluencies detection (Stolcke and Shriberg, 1996). This model is only used
as a baseline to which more advanced models are compared.
2.5 “Boosting” approaches
Boosting (Freund and Schapire, 1995) is a meta-learning approach that aims at combining an ensemble of
weak classifiers to form a strong classifier. Considering a black-box learning algorithm, on each iteration t
this base learner is used to generate a base classifier Ht. The boosting algorithm provides the classifier with
the training data as well as a set of weights Wt, non-negative, associated with the training example. These
weights are used to force the base learner to focus on previously misclassified examples, often considered
the “hardest” examples. The output of the final classifier can be considered as:
𝑓(𝑥) = ∑ 𝐻𝑡(𝑥)
𝑇
𝑡=1
(4)
Adaptive Boosting (Adaboost), as described in (Schapire and Singer, 2000) is a greedy search for a linear
combination of classifiers by overweighting the examples that are misclassified by each classifier.
Gupta and Bangalore (2002) propose using a boosting approach for building a sentence boundary detector.
For this purpose, they use an extension described in (Schapire and Singer, 1999) for multi-class classification
problems. The software package used is Boostexter, described thoroughly in (Schapire and Singer, 2000). In
this system, the base classifiers output a real number in the interval [-1,+1] whose sign can be interpreted as
a prediction regarding the probability of the result belonging to one of the defined classes associated with the
result (thus +1 means it belongs with high probability to the respective class).
![Page 7: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/7.jpg)
7
The real-valued predictions of the final classifier f were converted into probabilities by means of a logistic
function:
1
1 + 𝑒−𝑓(𝑥) (5)
This function estimates the probability of x representing a sentence boundary.
As features for the boosting based classifier, Gupta and Bangalore considered a window of three words to
the left and three words to the right of the target word. Thus, they used the 3-gram representation on the left
side, the right side, the associated part of speech tags on the left and right, as well as the number of common
words and part of speech tags considering 1-gram, 2-gram and 3-gram representation on the left and right.
This resulted into a total of 18 features, as described in the paper, used for the classifiers.
Furthermore, in (Hasan et al., 2014) is proposed a boosting approach for sentence-end detection, using the
ICSIBoost tool4, which uses an Adaboost algorithm implementation. The textual features used contain n-
grams (with n=2,3,4) for the previous words and up to two words following the target word. Additionally,
the paper investigates the addition of prosodic features as well as forward and backward distance features.
2.6 Conditional Random Fields probabilistic models
Conditional Random Fields (CRF) are probabilistic models used to segment and label sequence data (Lafferty
et al., 2001). CRFs present an advantage over maximum entropy Markov models (MEMMs) and other
discriminative Markov models based on directed graphical models, which can be biased towards states with
few successor states.
Even though CRFs can be used as the final layer of a neural network architecture, this section of the survey
will only focus on CRFs used in probabilistic models. Neural networks usage of CRFs is covered in a
separated section.
Stanford CoreNLP natural language processing toolkit, as described in (Manning et al., 2014), contains a
truecasing module implemented with a discriminative model using a CRF sequence tagger (Finkel et
al.,2005). Furthermore, (Wang et al., 2006) presents a bilingual capitalization model for capitalizing machine
translation (MT) outputs using CRFs. A characteristic of this model is that it uses both the input sentence and
the output sentence of the MT system. In this case the feature functions used include: monolingual language
model feature, capitalized translation model feature (given the assumption that the translated word has a high
probability of keeping the original word capitalization), capitalization tag translation feature, upper-case
translation feature (an all uppercase sequence of words gets translated into a similar all uppercase sequence).
Additionally, an initial capitalization feature combined with a punctuation feature template are used to ensure
that word is capitalized following a sentence-end punctuation mark.
While considering punctuation restoration for comma, period, question mark and exclamation mark, Lu and
Ng (2010) propose using a factorial CRF (F-CRF) model (Sutton et al., 2007). Features are computed using
unigrams, bigrams and trigrams. The resulting model is compared against simple n-gram LMs (n=3 and n=5)
and linear-chain CRF for both English and Chinese. First of all, these results confirm those of Gravano et al.
(2009), which demonstrated that higher order n-gram models do not have a large impact on F1 scores. For
both English and Chinese the impact of a larger n-gram model in terms of F1 is less than 1%, in accordance
with Gravano’s findings. The CRF models are performing better than simple n-gram models, with the F-CRF
model having the highest F1 score for both languages.
While continuing the investigation of dynamic CRF models for sentence boundary detection as well as
punctuation restoration, Wang et al. (2012) propose using a vocabulary pruning method. This involves
selecting the top words appearing in the training data and replacing the others with an unknown token. This
reduces the sparsity of the model and makes it more resilient when faced with rare words. Their research
suggests an increase in F-measure for punctuation prediction when using around 25% of the words in the
initial vocabulary. Ueffing et al. (2013) also tried applying the vocabulary pruning method, but reported “no
4 https://github.com/benob/icsiboost
![Page 8: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/8.jpg)
8
significant gain”. This would seem to indicate that the method’s benefits are closely related to the datasets
being used for training and testing, especially with regard to the distribution of infrequent words.
Hasan et al. (2015) propose training a CRF model on noisy ASR output for sentence end detection. The
implementation makes use of the CRF++ toolkit5 and is explored the impact of noisy textual data for models
with both textual only features and additional prosodic features. These models are compared against non-
noisy models, trained using official transcripts, and the noisiness seems to improve the model performance
on the sentence end detection task. The use of a noisy text-only model on the ASR output text compared to
the transcripts model on the same text provides an increase of 2% in terms of F1 evaluation.
2.7 Neural network architectures
Architectures based on neural networks for truecasing can be separated based on the prediction given. This
can be either at word level, indicating a word class (such as all lowercase, all uppercase, first letter uppercase,
mixed case), or at character level, indicating if the individual letter should be lowercase or uppercase. With
regard to the punctuation restoration problem, usually punctuation marks are placed between words.
However, from a neural network perspective, it is possible to present the network either with complete words
or with individual characters. In the first case, the output will be considered as the punctuation mark following
an input word, while in the second case the network will predict a punctuation mark when presented with the
space character.
Susanto et al. (2016) experimented with two character-level recurrent neural network (RNN) architectures
based on either Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) or Gated Recurrent
Units (GRU) (Cho et al., 2014) cells. They constructed a small model with 2 layers and 300 hidden nodes
and a large model with 3 layers and 700 hidden nodes in each layer. Additionally, dropout (Srivastava et al.,
2014) is used to reduce the risk of overfitting. Their findings seem to backup the claims of Britz et al. (2017)
that LSTMs actually outperforms GRUs. Furthermore, their analysis of the small and large model suggests
that a larger RNN model usually improves performance. However, due to possible overfitting, there are
situations where a smaller model actually performs better.
In (Ramena et al., 2020) is presented another approach based on a character-level RNN. The network is
comprised of 4 layers: a Convolutional Neural Network (CNN) layer that generates character representations,
2 Bidirectional Long Short Term Memory (BiLSTM) layers followed by a final Conditional Random Field
(CRF) layer, which decides on the proper case for each of the characters. Furthermore, in the CNN layer,
Rectified Linear Unit (ReLU) (Hahnloser et al., 2000; Jarrett et al., 2009) is used for the activation function.
Even more, in the CNN and BiLSTM layers an additional dropout is used to improve performance.
Tilk and Alumae (2016) approach the punctuation restoration problem (considering comma, period and
question mark) by means of a bidirectional recurrent neural network, implemented using Gated Recurrent
Units (GRU) enhanced with an attention mechanism (Bahdanau et al., 2015). As described in the paper, the
attention mechanism in this case allows the model to focus on words that may indicate a question but are
relatively far apart from the current word. The attention mechanism is incorporated in the model using a late
fusion approach, similar to the technique described in (Wang and Cho, 2015). The artificial neural network
is comprised of 4 layers: an embedding layer (allowing for both pre-trained vectors or vectors to be trained
at runtime), a bidirectional GRU layer (comprised of two sub-layers, one for each direction: forward and
reverse), an unidirectional GRU layer with attention and a final SoftMax layer. Training was realized using
AdaGrad (Duchi et al., 2011). The results presented in the paper suggest that using a pre-trained vector
representation of words may have a beneficial effect on the F1 metric allowing for around 3 percent increase
in performance, in the case of using pre-trained GloVe vectors (Pennington et al., 2014) for the English
language.
Inspired by Tilk and Alumae’s work (2016), Salloum et al. (2017), while working with medical reports,
propose implementing a vocabulary reduction pre-processing step. This improves the modeling of rare words
and represents a method to deal with out-of-vocabulary words. For this purpose a suffix and prefix list is
5 http://crfpp.sourceforge.net/
![Page 9: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/9.jpg)
9
constructed and each word occurring less than 20 times in the training data is replaced by “pAAAA_n” or
“AAAA_ns”, where s denotes the suffix, p is the prefix, n is the length of the word without the suffix/prefix
rounded to a multiple of 5 letters. In the case of a word containing both a prefix and a suffix from the list, it
is actually split into two tokens, one with the prefix and one with the suffix. Finally, if a word is rare but does
not have a known prefix or suffix it is represented using the “RARE” token. All these operations take place
as a pre-processing step for both training and at runtime. The proposed suffix and prefix list is constructed
based on words from a medical terminology as well as some common proper names. This approach seems to
have a beneficial effect on the F1 score of each punctuation mark (comma, period, colon). Nevertheless, the
results are computed only on the author’s medical reports corpus and thus cannot be easily compared with
other works.
Klejch et al. (2016) investigate using a sequence-to-sequence mapper (Sutskever et al., 2014) for punctuation
restoration. This approach, encountered also in machine translation, is converting a sequence of input words,
without punctuation, to a sequence of punctuation marks (period, comma, exclamation mark, question mark,
no punctuation).
Gale and Parthasarathy (2017) investigate the usage of character-level neural networks for punctuation
prediction and sentence boundary detection. Thus, for each character a punctuation mark (comma, period,
question mark) or the lack of punctuation is predicted by the network. The recurrent neural network output
is delayed in order to allow the network to capture more context. This is realized by pre-padding the input
sequence with n-1 <PAD> tokens and also pre-padding the output sequence with n <PAD> tokens. Thus, a
delay of size n is introduced before the first useful tagging decision of the network. Evaluation of the proposed
model on two different datasets (which are not publicly available) indicate that the delay character level RNN
performance is on par with word-level CRF and RNN models, including word-level delay RNN models.
While investigating sentence segmentation of YouTube automatic subtitles, Song et al. (2019) propose a
method based on LSTM networks to add period marks in unsegmented text. For this purpose, the authors
used 27826 sentences from online lectures of Stanford University available on YouTube as training data.
These were first pre-processed to remove capitalization and any other punctuation except periods. The
network architecture used is based on an LSTM layer followed by a SoftMax layer. The input is in the form
of Word2Vec (Mikolov et al., 2013) encoded words, while the output is the decision of placing a period after
one of the words. The embeddings model used was constructed without capitalization, thus justifying the
authors choice for lowercasing the words.
Juin et al. (2017) tackle the problem of punctuation restoration considering a set of 11 output labels: period,
comma, question mark, exclamation mark, apostrophe, parentheses (left and right), quotes, semicolon, dash
and colon (actually the number of punctuation marks used vary with the datasets employed for
experimentation throughout the paper, but the mentioned labels form the maximal set). The neural network
model is based on an encoder-decoder architecture, accepting for input the context words together with the
associated part-of-speech tags. The actual implementation was realized using the OpenNMT system (Klein
et al.,2017), thus handling the problem as a sequence-to-sequence tagging. Given the input word + part of
speech tag, the system will output a sequence of words and punctuation marks. The authors employ a pre-
trained part-of-speech tagger, which was trained on punctuated text. This is applied on unpunctuated text and
unfortunately there is no analysis of the tagging errors due to this high difference between training data vs
the actual use case.
Tundik and Szaszak (2018) study the impact of combining a word-based model with a character-based one
for restoring punctuation (period, comma, question mark). The first model uses GloVe word embeddings to
represent words in a context window and a BILSTM layer followed by a SoftMax layer. The second model
makes use of fixed-length sequences of characters, provided as one-hot representations. These are passing
through a multi-layer neural network composed of a dense embedding layer, a 1D convolution layer, a max
pooling layer and a BILSTM layer followed by a final SoftMax layer. In order to combine the two models,
the last two layers (BILSTM and SoftMax) are common. Thus, the word-based model is reduced to the word
embeddings and the character-based model only keeps the dense, 1D convolutional and max pooling layers.
The authors experimented with individual models and the resulting ensemble model on both English and
Hungarian languages. Their findings suggest that performance of the two individual models for English
![Page 10: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/10.jpg)
10
language is within a 3% F1 increase in favor of the word level model, but with better performance of the
word-based model on Hungarian language, especially for detecting question marks. Nevertheless, the hybrid
model behaves better in both languages, the addition of the character embeddings leading to an increase in
F1 and a reduction of the error rate. Finally, the authors also consider the possibility of creating a weighted
ensemble of the two models, by using the individual outputs. By choosing appropriate weights, the values
can become similar to the hybrid model. However, the hybrid model still outperforms the weighted ensemble.
Yi et al. (2020) is using adversarial learning (Goodfellow et al., 2014) to train a neural network for multitask
learning (Caruana, 1997) in order to predict both punctuation and part of speech tagging, using BERT as the
base language model. The proposed neural model contains common BERT layers followed by a BiLSTM
layer and a CRF layer for each of the classification tasks. Furthermore, the network is completed by an
adversarial task discriminator composed of a max-pooling layer, followed by a gradient reversal layer, then
a fully connected layer and finally a SoftMax layer. The addition of the adversarial learning mechanism
improves the performance of the model, compared with a simple BERT fine tuning approach, with 2% in
terms of F1 score for both reference text and ASR output.
Nguyen et al (2019) propose using transformer models with overlapped chunk-merging to restore both
capitalization and punctuation in a single pass over the input data, consisting of ASR output. The proposed
implementation is split into three components: a chunk splitter which takes a long input sequence and splits
it into multiple chunks with overlap, the model itself which predicts simultaneously capitalization and
punctuation and a final component which merges the predicted output for the chunks. Chunks are created by
using a window of size K tokens sliding with K/2 at each step. The neural network model is implemented
based on a transformer architecture, using Tensor2Tensor (Vaswani et al., 2018) and OpenNMT (Klein et
al., 2017). The capitalization situations studied include only first letter uppercase and all lowercase, while
the punctuation marks considered are comma, period and question mark.
Courtland et al (2020) make use of the RoBERTa model (Liu et al., 2019) as input to a neural network
comprised of two additional linear layers. The authors propose aggregating predictions over multiple context
windows and show that this approach improves the overall accuracy. This involves predicting punctuation
marks (period, comma, question mark) for all the tokens in a window of 50 tokens and then advancing the
window with a step of 20 tokens (similar to the method described above of Nguyen et al (2019)). Thus, the
tokens will appear in multiple context windows presented to the network and the resulting predictions can be
aggregated. This aggregation is performed by a final layer of the network without an external mechanism.
The best results are obtained by also fine-tuning the RoBERTa representation on the task data. Nevertheless,
the authors also compare pre-trained language models without fine tuning and observe an overall
performance with F1 of 83.5 for pre-trained RoBERTa large model comparable to the fine-tuned RoBERTa
base model which obtained an F1 score of 83.9. Additionally, very close to these results is a pre-trained
XLNet base model which obtained an overall F1 of 82.9.
While working on detecting sentence boundaries in legal texts, Sanchez (2019) observes that legal texts are
more challenging than other language data, such as news articles, because most legal texts are structured into
sections and subsections, unlike the narrative structure of a news article. Furthermore, legal texts contain
additional elements such as header, footer, footnotes or lists and sometimes important sections are interleaved
with citations. Even more, specific elements are observed with regard to capitalization restoration, since in
legal texts certain named entities (or parts of named entities) are written with all capitals, such as the family
name or some institutions. Sanchez (2019) propose using a neural network model based on a BiLSTM hidden
layer with a final SoftMax layer to predict for each token if it represents either the beginning of a sentence,
an internal token, or the last token of a sentence. The input for the system is comprised of word embeddings
(Mikolov, 2013) associated to a three-token window, together with additional binary features indicating the
structure of the token (upper or lower case characters, digits, punctuation).
Similar to legal texts, medical texts create problems for general punctuation restoration systems due to their
domain specific language. Sunkara et al (2020) propose a combined punctuation and truecasing system for
medical texts, especially those obtained by ASR systems from both dictations and conversations. The
proposed system has two steps: first punctuation prediction and then the resulting sentence is fed into the
capitalization restoration component. The reason for this is the observation that a word following a sentence
![Page 11: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/11.jpg)
11
boundary (indicated by a period) has the first letter uppercase. Thus, the system employs a BERT layer
followed by a linear layer with SoftMax activation for punctuation prediction and a final layer linear also
with SoftMax activation for capitalization. The final layer makes use of BERT encodings concatenated with
the probabilities obtained from the punctuation prediction layer. The BERT model is fine-tuned for the
medical domain (Han and Eisenstein, 2019). Afterwards the model is further fine-tuned for task adaptation,
by selective masking the input tokens, while ensuring that 50% of the masked tokens are punctuation marks.
Finally, in order to account for ASR errors, the model is further augmented by using ASR output aligned
with reference text.
3 Methods incorporating audio features Acoustic information involves prosody which comprises features like pause duration between words, pitch-
intensity, per-word timing. This is usually used by speakers to impose structure in both spontaneous and read
speech (Kompe, 1996). Furthermore, prosodic cues are known to be relevant to discourse structure across
languages (Vaissiere, 1983). Even more, in the description of the Verbmobil speech-to-speech translation
system, Noth et al. (1999) recognize that “the major role of prosody in human–human communication is
segmentation and disambiguation”. Additionally, change of speaker can be seen as a helpful information for
sentence segmentation since usually one sentence ends and another begins when the speakers change. Based
on these observations, when speech data is available, for example in the case of ASR systems, it is desirable
to use acoustic information either by itself or together with textual information for punctuation insertion.
Stolcke and Shriberg (1997) proposed an enhanced n-gram model containing information on change of
speaker (turn-taking). Thus, they introduced a special token for turn boundaries in both training and testing.
They showed that this addition improves over text only n-gram models. Furthermore, in Stolcke et al. (1998)
a comparison is performed between a prosodic model, a text-only n-gram model and an n-gram model
enhanced with segmentation cues such as turn-taking and long pauses. For both n-gram models it was
considered n=4. It is shown that the n-gram model enhanced with segmentation cues derived from speech
behaves better for sentence segmentation purposes. The simple text-based n-gram model performs better than
the prosodic model on text that was correctly recognized. However, the prosodic model performs better than
the text-based n-gram model in case of errors in the ASR output. Nevertheless, in both cases the n-gram
model enhanced with speech cues has the highest accuracy. A further improvement is possible by using a
model combination of the prosodic model and the n-gram model with segmentation cues via model
interpolation technique. However, this only improves the results with less than 1% in terms of accuracy.
In (Warnke et al., 1997) is proposed a combination of a polygram language model and prosodic features used
as input for a multilayer perceptron (MLP) classifier for sentence boundary detection. The prosodic features
are modeled over a context of six syllables, taking into account duration, pause, F0 and energy.
Chen (1999) propose assigning acoustic baseforms (silence, double silence, disfluency) to punctuation marks
into an extended dictionary. Thus, an ASR language model is trained including punctuation marks. At
runtime, given a pause or silence detected in the speech, the process will look for the LM score. If this LM
score does not indicate a punctuation mark, then it will be interpreted as a silence, otherwise the appropriate
punctuation mark will be selected.
Considering the problem of sentence segmentation, Hakkani et al. (1999) explore a large number of prosodic
features in both a prosodic-only model and a combined model with word information. Prosodic modelling
was realized using CART-style decision trees (Breiman et al., 1984) and the initial features considered are
based on the work of Shriberg et al. (1997). By training and evaluating different models with subsets of the
initial features, 6 useful features for sentence segmentation were identified: pause duration, F0 (fundamental
frequency) differences across the boundary, speaker change, rhyme duration (length of the rhyme, nucleus
and coda, of the last syllable preceding the boundary). Out of these features, the most important is indicated
to be the pause duration.
While interested in punctuation annotation (comma, period, question mark) of automatically transcribed
broadcast news data, Christensen et al. (2001) analyze the impact of different prosody features on finite state
and MLP classifiers. Thus, they consider pause durations, phoneme durations (considering the average
![Page 12: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/12.jpg)
12
duration of the phonemes and the average duration of only vowels in the preceding word) and pitch
information. Their analysis supports previous research, indicating that pause duration has the biggest impact
on the model’s accuracy in predicting the correct punctuation mark.
A combination of the previous two approaches is used in Kolář et al. (2004) for building an automatic
punctuation annotation system (commas and periods) for Czech language. Thus a textual model is combined
with a prosodic model using the following features: F0 derived features (maximum, minimum and mean,
first and last value), first and last piecewise linear (PWL) slope, ratio and difference between the last value
in the current word and the first value in the following word, ratio and difference between the last PWL slope
in the current word and the first PWL slope in the following word, slope of linear regression from all values
in the current word), phoneme duration (average duration of vowels, duration of the first and last vowel,
duration of the longest and shortest vowel), pause duration, energy features ( maximum, minimum and mean
of frame level RMS energy). Two models are explored based on CART decision trees and MLP classification.
On the available testing data, it is reported that the CART based model performed better than the MLP model.
Unfortunately, the authors did not explore the impact of different prosodic features.
Huang and Zweig (2002) approach the punctuation restoration problem for period, comma and question
mark, by using a maximum entropy model with lexical features (a context window of two words before and
two words after the current word), the previously identified two punctuation marks, pause duration as a
prosody feature. In this experiment, pause duration is measured in 0.01 second intervals and is considered in
the 5-word window centered on the current word. They explore both reducing the window size (to only one
word before and after the current word) and individual text-only and prosodic-only models. Out of these
combinations, the best model is the mixed model (textual and prosody) with two words before and two words
after the current word. Nevertheless, the benefit of a larger window translates only in 1% improvement.
According to the authors this can be attributed either to the modelling technique used or to the dataset.
Usually, ASR systems are trained using only lowercase words. While trying to handle the capitalization issue,
Kim and Woodland (2004) propose training a system using an extended vocabulary, containing both
lowercase words and words with the first capital letter. Furthermore, a full stop is modelled based on the
silence in speech. This follows their previous work (Kim and Woodland, 2002) where the dictionary was
extended using also the all letters uppercase version of words.
Liu et al. (2005) approached the sentence boundary problem using CRFs combining lexical features (such as
n-grams and part-of-speech tags) with prosody features, including speaker turn change. The research shows
that CRFs have lower error rates compared to HMM and Maxent models trained using similar features.
Furthermore, the accuracy of the models can be further increased by using an ensemble mechanism based on
a simple voting between the three implemented models.
Batista et al. (2009) used a maximum entropy model for punctuation restoration in Spanish and Portuguese.
Their proposed features are wi, wi+1, 2wi−2, 2wi−1, 2wi, 2wi+1, 3wi−2, 3wi−1, pi, pi+1, 2pi−2, 2pi−1, 2pi, 2pi+1, 3pi−2,
3pi−1 (lexical), GenderChgs, SpeakerChgs, and TimeGap (acoustic), where wi is the current word, nwk is an
n-gram starting at position k, npk is the part of speech n-gram starting at position k, GenderChgs and
SpeakerChgs indicate changes in speaker and gender and TimeGap represents the pause between the current
and the next word.
In the context of machine translation of ASR documents, Matusov et al. (2006) propose a sentence
segmentation algorithm combining three language models. The first model is based on n-grams and estimates
the following probabilities: end of segment boundary, internal segment probability and probability of the first
words of the next segment. The second model is based on the pause duration between the assumed last word
of a segment and the first word in the next segment. The third model is based on the sentence length
probability. Finally, a log-linear combination of the models is constructed using manually tuned scaling
factors.
Peitz et al. (2011) further refine on this approach and consider punctuation prediction as a machine translation
task. They use a phrase-based MT system (Zens et al., 2008) to align the unpunctuated and punctuated
variants of the same text, thus enabling the use of additional features such as phrase translation probabilities.
Punctuation is modelled into classes, considering: a class for sentence-end marks (period, question mark,
![Page 13: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/13.jpg)
13
exclamation mark) with sub-classes for period and question mark; a class for commas; and a class for other
punctuation marks (quotes, apostrophe, parentheses and semicolon).
Following the evolution of textual only methods for punctuation restoration, as described in other sections of
this survey, mixed models using prosody and textual information were adapted to either incorporate the
improved language models or to use better classification methods for directly training mixed models. For
example, in Kolar and Lamel (2012) is described a system using prosodic features together with text features
trained using Adaptive Boosting. Previous findings still hold even with new classification models, since a
mixed model performs better than individual text-only or prosodic-only models.
Tilk and Alumae (2015) start by constructing a neural network model using only text features for predicting
commas and periods. Then, the last hidden layer of this model is used as input in conjunction with a feature
based on the pause duration for training a new model. Both models are using LSTM cells and the usage of
the last hidden layer as input for a new model is based on the work of Seide et al. (2011). The two-stage
model is completed by a SoftMax layer allowing the prediction of the punctuation mark. Thus, the model
learns in the first stage textual features and these are combined in the second stage with the prosodic feature
pause duration in order to make better predictions. This approach seems to be beneficial for period
restoration, as indicated by the authors. The work was continued in Tilk and Alumae (2016) by replacing the
LSTM cells with Bidirectional LSTMs and the addition of an attention mechanism. This led to an F1-score
improvement by 2.5% on reference text and 1.8% on ASR output.
Cho et al. (2016) propose using two parallel neural network models, one for text features and one for prosody
features, and only combine them in a final stage. The lexical model is trained using only GloVe word
embeddings and allows 4 outputs corresponding to a middle word in a context window indicating the
punctuation mark (period, comma, question mark) or no punctuation. The acoustic model only considers two
output classes: boundary or no boundary. For both models a 3-layer fully connected neural network is used.
The final decision for sentence boundary is taken in a 2-stage process. In the first stage both models are used.
In this case if an acoustic-based boundary decision is opposed by the lexical model then it will be denied. In
the second step, only the lexical model is used together with the output from the first stage. This can lead to
further segmentation. Models are evaluated individually and with the decision mechanism after each stage
and results prove that the overall model behaves best. The second decision stage adds an increase of 5% in
terms of F1 over the first stage decision.
Based on the observation that an ensemble model (Deng and Platt, 2014; Zhao et al., 2015), combining
multiple individual models, usually outperforms the individual models, Yi et al. (2017) envisage using such
an ensemble model as a teacher for a deep neural network based single model (called a student model). Thus,
the models of Tilk and Alumae (2016) and Lample et al. (2016) are combined with a simpler neural network
model in order to form the teacher system. Afterwards, the Kullback-Leibler (KL) divergence is used to guide
the training of a smaller neural network system. This approach is inspired by the work of Chan et al. (2015)
and Chebotar and Waters (2016) which proposed a similar approach for improving speech recognition.
Although the resulting punctuation prediction (comma, period, question mark) ensemble model outperforms
the student model, the student model is more suitable to deploy in real-life application than the ensemble,
due to its simplicity, reflected into smaller size and computing requirements. Furthermore, the student model
still outperforms individual models for all the investigated punctuation marks.
Klejch et al. (2017) continue their previous work (Klejch et al., 2016) on sequence-to-sequence models for
punctuation prediction by integrating prosody-based features. While most of the previously described models
predict a label corresponding to a middle word in a context window, a sequence-to-sequence model uses an
encoder-decoder architecture to predict sequences of labels, corresponding to the punctuation marks. These
models are usually specific to machine translation applications but in this case the punctuation restoration
problem can be considered as a translation from words to punctuation marks. The acoustic model uses a
hierarchical encoder. Thus, it transforms frame level acoustic representations into word level acoustic
embeddings. The system is trained to predict periods, commas, exclamation marks, question marks and in
some experiments also three dots. The inclusion of the acoustic features using this approach leads to an
increase in F1 of around 6% for reference text and 3% for ASR output.
![Page 14: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/14.jpg)
14
Continuing the investigation of encoder-decoder mechanisms for sentence prediction, Yi and Tao (2019)
incorporate a self-attention mechanism (Vaswani et al., 2017). Furthermore, in accordance with transfer
learning approaches, the authors use pre-trained Word2Vec (Mikolov et al., 2013), GloVe and Speech2Vec
(Yu-An and Glass, 2013) representations. The Speech2Vec representation aims to learn a fixed length
embedding of an audio segment that captures semantic information of the spoken word directly from audio.
The resulting model outperforms previous individual and ensemble-based models (Yi et al, 2017) for
restoring periods, commas and question marks.
Zelasko et al. (2018) compare two neural network models based on Bidirectional Long Short-Term Memory
(BiLSTM) and Convolutional Neural Networks (CNN) for punctuation prediction (comma, period, question
mark). Textual and prosody features were used together in the same model. The corpus used for training and
testing consists of telephone conversations, which lead the authors to use a feature for the side pronouncing
the words. Additionally, words were encoded using GloVe embeddings and the following audio features
were computed: difference between the start of the current word and start of the previous word, and duration
of the current word. Thus, contrary to previous research, Zelasko et al. (2018) do not use an explicit pause
duration feature. Instead this is encoded in the feature representing the difference between starting time of
adjacent words. The evaluation performed concludes that CNN based models yield a better precision while
BiLSTM models have better recall. Punctuation predicted by the CNN is more accurate - especially in the
case of question marks. Additionally, the authors mention experiments in which they tried to incorporate part
of speech tags which didn’t lead to any performance improvement.
Szaszak and Tundic (2019) propose combining three neural network models for restoring punctuation
(comma, period, question mark, exclamation mark). This is a continuation of their work (Tundik and Szaszak,
2018) where only two models, without prosody, were used. The first model (W) is word-based, using GloVe
word embeddings as inputs for a BiLSTM based neural network layer with a final SoftMax layer. The second
model (C) is character based (words are represented as a fixed-length sequence of one-hot encoded
characters) and is formed of four neural network layers: a character embedding dense layer, a 1D
convolutional layer, a BiLSTM layer and a final SoftMax layer. The third model (P) uses prosody features
(F0, energy, word duration, pause) as inputs to a BiLSTM layer followed by a SoftMax layer. In order to
combine the three models, the final SoftMax layer of each model is removed and the final BILSTM layers
are connected into a common LSTM layer which is finally followed by a new SoftMax layer. The authors
performed tests using each individual model, combinations of two models and all three of them. For English
language, the best results on reference texts are achieved by the combination of all three models (C+W+P),
while the best results on ASR output are achieved by the combination W+P. Interestingly, for Hungarian the
best results on reference texts are still produced by C+W+P, while on ASR output better results are achieved
by C+P. As the authors note, this is due to the agglutination characteristic of the Hungarian language.
Nanchen and Garner (2019) use a Gradient Boosting Machine (GBM) (Friedman, 2000) to improve an
ensemble of 4 neural network based algorithms for sentence boundary detection. The first two algorithms
make use of textual features, while the next employ pause duration and additional prosody data, obtained
using the Kaldi toolkit (Povey et al., 2011) and including probability of voice estimate, pitch and delta-pitch
features. By looking at the relative importance metric learned automatically by the GBM algorithm, the
authors confirm earlier findings that pause duration is a very important feature for sentence segmentation
when audio data is available.
While investigating the performance of punctuation restoration systems when faced with homonymy errors
in ASR output, Augustyniak et al. (2020) propose a method for retrofitting pre-trained word embedding
representations. This is based on the work of Dingwall and Potts (2018) who suggested updating general pre-
trained word embeddings with domain specific information. A comparison between a punctuation restoration
model using retrofitted word embeddings fed into a convolutional neural network model against the same
model using general word representations shows an improvement of 6% to 9% in terms of F1 score for
different punctuation marks.
![Page 15: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/15.jpg)
15
4 Evaluation Evaluating a capitalization and punctuation restoration system usually involves obtaining a correctly written
corpus, which is then pre-processed to remove capitalization and punctuation, then run through the evaluated
system and finally the results are compared with the original text. In some cases, when the text corpus is
comprised of written transcripts of audio recordings, the text run through the restoration system is the output
of an ASR system. From this approach, common evaluation metrics can be computed, such as precision (P),
recall (R), F1 measure, Slot Error Rate (SER). However, especially with regard to ASR text, in works such
as (Beeferman et al., 1998) is considered that user satisfaction is the most valuable qualitative metric of
success for a speech recognition system. Therefore, for the evaluation process, instead of comparing with the
original transcript used for performing the recordings, a number of human judges, with different backgrounds,
can assess the resulting output.
In an attempt to standardize the evaluation of ASR systems, NIST released a series of rich transcription
evaluation data sets6. These, or similar attempts at standardizing systems evaluations, could be used also for
evaluating capitalization and restoration systems either using ASR output or reference text. However, most
systems covered in this survey tend to use dedicated corpora. Thus, it is difficult to compare the performances
of different systems in terms of absolute values.
Earlier works, such as (Shriberg et al., 1998) and (Huang and Zweig, 2002) run their evaluation on the
Switchboard corpus, available in the Linguistic Data Consortium catalog (LDC97S62). Another set of
corpora, popular with earlier works (Hakkani-Tur et al, 1999), (Christensen et al, 2001), (Huang and Zweig,
2002), are Hub4 (LDC98S71) and Hub5 (LDC2002S10).
Recent works are usually relying on datasets from the International Workshop on Spoken Language
Translation (IWSLT). These are machine translation datasets that are focused on the automatic transcription
and translation of TED and TEDx talks (public speeches covering many different topics)7. One dataset that
was used in a number of recent works is from the IWSLT 2011 evaluation campaign (Federico et al., 2011).
It is comprised of monolingual and parallel corpora of TED talks that are copyright of the TED Conference
LLC and distributed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license8. The
first paper to use this dataset for punctuation prediction evaluation was Peitz et al. (2011). However, as it
also happens with later papers, only a fraction of the corpus was used for the punctuation restoration task. In
this case, only text documents belonging to the English-to-French translation track.
Peitz et al. (2011) did an analysis of the distribution of punctuation marks in the training and test parts of the
used corpus. Their analysis indicates that most of the marks are commas (approximately 50%), followed by
periods (39%). Question marks account for roughly 2%, while all the other analyzed punctuation marks
(quotes, apostrophe, parentheses and semicolon) represent 9% from the total marks. We can assume that
sampling from different parts of the IWSLT 2011 entire corpus will yield similar distributions, thus enabling
us to compare various works based on this dataset. Table 1 presents the results reported by different papers
using the IWSLT 2011 dataset for evaluation, based on reference text (not ASR output).
Table 1. Overall evaluation of different works using the IWSLT 2011 dataset
Paper Method Precision Recall F1
Peitz et al, 2011 Statistical machine translation 80.7 64.2 71.5
Ueffing et al, 2013 CRF 49.8 58.0 53.5
Tilk and Alumae, 2016 Bidirectional RNN using GRU 70.0 59.7 64.4
Tundik and Szaszak, 2018 Character + Word based BiLSTM - - 62.9
6 https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation
7 http://www.ted.com
8 http://creativecommons.org/licenses/by-nc-nd/3.0/
![Page 16: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/16.jpg)
16
Yi and Tao, 2019 Encoder-decoder with self-attention 76.7 69.6 72.9
Yi et al., 2020 Adversarial learning 80.9 75.0 77.8
In order to better understand the different works included in Table 1, a more detailed evaluation, for each of
the punctuation marks common to all papers (period, comma, question mark) is presented in Table 2.
Table 2. Evaluation on individual punctuation marks (period, comma, question
mark) using IWSLT 2011 dataset
Paper Period Comma Question mark
P R F1 P R F1 P R F1
Peitz et al, 2011 89.0 87.5 88.2 80.6 59.3 68.3 63.4 17.6 27.5
Ueffing et al, 2013 54.0 72.0 62.0 45.0 47.0 46.0 53.0 33.0 41.0
Tilk and Alumae, 2016 73.3 72.5 72.9 65.5 47.1 54.8 70.7 63.0 66.7
Tundik and Szaszak, 2018 68.4 72.6 70.4 62.4 53.7 57.7 64.1 54.3 58.8
Yi and Tao, 2019 82.5 77.4 79.9 67.4 61.1 64.1 80.1 70.2 74.8
Yi et al., 2020 87.3 81.1 84.1 76.2 71.2 73.6 79.1 72.7 75.8
By looking at the results reported in Tables 1 and 2, it can be seen that the current best overall approach is
represented by a neural network implementation using adversarial learning, as described in (Yi et al., 2020).
Furthermore, on earlier system (Peitz et al., 2011) even though had a good overall score, it had serious
troubles in predicting question marks (an F1 score of 27.5%). Newer systems seem to be able to handle better
the reduced amount of question marks present in the training set.
Another observation drawn from the two tables is that comma is the most difficult punctuation mark to
predict. Even though, as described earlier, commas account for nearly 50% of the total punctuation marks
present in the corpus, neural network approaches seem to have difficulties in properly predicting commas. In
neural network based approaches, the F1 score for commas is the lowest between the 3 considered marks.
Interestingly, earlier approaches based on statistics (Peitz et al., 2011) and CRF (Ueffing et al, 2013) are
worst at predicting question marks, while commas are on the second place. This is in a way expected since
question marks represent the least used signs in the training set. However, this also indicates that neural
networks are not currently able to grasp the fine details related to comma usage. As mentioned in the
introduction, commas may actually belong to multiple categories since they may be able to perform different
functions. Unfortunately, there is no analysis of the roles performed by commas in the IWSLT 2011 dataset.
Such an analysis may explain better the behavior of neural network approaches.
Even though evaluation metrics such as precision, recall and F1 make sense for the punctuation restoration
task, some authors choose to report their results using the slot error rate (SER) metric. In this case we use the
formula (6) from Makhoul et al. (1999) to convert the reported SER value into F1.
1 − 𝐹 ≅ 0.7 𝑆𝐸𝑅 (6)
Punctuation restoration for ASR output text is more challenging. Usually ASR systems introduce errors when
compared to reference text. The IWSLT 2011 dataset can be used also for evaluating punctuation restoration
on ASR output. Table 3 presents a comparison of recent papers overall F1 score for both reference and ASR
output.
Table 3. Comparison in terms of overall F1 score between punctuation restoration
on reference text and ASR output
Paper F1 reference F1 ASR
Tilk and Alumae, 2016 64.4 61.4
Tundik and Szaszak, 2018 62.9 52.2
Yi and Tao, 2019 72.9 68.8
![Page 17: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/17.jpg)
17
Yi et al., 2020 77.8 73.3
From table 3, it results that the system of Yi et al (2020) is the best system also for ASR output, based on the
IWSLT 2011 dataset. Furthermore, it can be noticed a roughly 4% decrease in F1 performance for ASR
output. A more in-depth analysis of system’s performance for ASR output is presented in table 4.
Table 4. Evaluation on individual punctuation marks (period, comma, question
mark) on ASR output, using the IWSLT 2011 dataset
Paper Period Comma Question mark
P R F1 P R F1 P R F1
Tilk and Alumae, 2016 70.7 72.0 71.4 59.6 42.9 49.9 60.7 48.6 54.0
Tundik and Szaszak, 2018 63.5 67.1 65.3 50.1 48.2 49.1 60.0 42.9 50.0
Yi and Tao, 2019 75.5 75.8 75.6 64.0 59.6 61.7 72.6 65.9 69.1
Yi et al., 2020 80.0 79.1 79.5 72.4 69.3 70.8 68.4 66.0 67.2
When comparing table 4 with table 2, the same drop in performance can be seen for each individual
punctuation mark. One slight difference can be observed with regard to comma vs question mark
performance. If in table 2 the comma was the hardest punctuation mark to predict for all systems, on ASR
output in table 4, the system of Yi et al (2020) seems to be actually getting better at predicting commas and
having more difficulties at predicting question marks.
In both reference text and ASR output, the performance associated with restoring periods is always the best
by a large margin of more than 10% in terms of F1 score compared to commas. In Figure 1 is presented the
evolution in terms of F1 score difference between periods and commas on reference text. Adversarial learning
employed by Yi et al (2020) is improving the performance of comma prediction and reduces the F1 difference
to period prediction from 20% (in the case of Peitz et al (2011)) to 10.5%.
Figure 1: Difference in restoration performance in terms of F1 score between periods and commas
Further considering the problem of comma prediction, it seems that all the systems have higher precision and
lower recall values (for both reference and ASR output). This indicates a higher number of false negatives
(lack of comma prediction). A similar phenomenon takes place in the case of question mark prediction.
Similar to punctuation restoration, works on capitalization made use of different datasets, which makes a
proper comparison of earlier systems to more recent ones to be quite difficult to realize. Earlier datasets used
for capitalization evaluation include NIST 1998 Hub-4 (Pallett et al, 1998), WSJ PTB3 (Marcus et al, 1999),
CSR HUB4 (MacIntyre, 1998). More recent works, such as Susanto et al (2016) and Ramena et al (2020)
make use of a Wikipedia based corpus (Coster and Kauchak, 2011). Furthermore, the authors of Ramena et
![Page 18: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/18.jpg)
18
al (2020) also evaluated on this Wikipedia corpus two earlier systems, that of Lita et al (2003) and Stanford
CoreNLP (Manning et al, 2014). The F1 scores are presented in Table 5.
Table 5. Comparison in terms of F1 score between capitalization restoration systems
System F1
Lita et al, 2003 86.8
Stanford CoreNLP (Manning et al, 2014) 90.9
Susanto et al, 2016 93.2
Ramena et al, 2020 94.0
The two best performing systems from Table 5, (Susanto et al, 2016) and (Ramena et al, 2020) make use of
character-level recurrent neural network architectures. In the first paper is employed a model based on 3
LSTM layers, while in the second paper a more complex network is built using a convolutional layer, two
BiLSTM layers and a final CRF layer.The added neural network complexity only accounts for 0.8% increase
in the F1 score. The reported precision (94.9) and recall (93.1) are quite similar and there is no clear indication
of where the network fails to correctly identify the correct case.
Apart from general narrative datasets, specialized datasets were used. Thus, Sanchez (2019) used the dataset
initially introduced by Savelka (2017) to implement and assess boundary detection systems for legal text.
The author reports an F1 score of 89 using a neural network approach.
Sunkara et al (2020) make use of two medical domain datasets: a dictation corpus consisting of 3.7M words
and a conversational corpus containing 51M words. Their transformer-based model with BERT embeddings
manage to obtain an F1 score of 79 for period and 69 for comma on the conversational corpus while on the
dictation corpus the scores are 85 for period and 72 for comma. A similar difference exists also for the
capitalization task, where the model achieves an F1 of 83 on the conversational corpus and 87 on the dictation
corpus for identifying words having their first letter uppercase. When comparing these results with those
presented in the tables above, it becomes clear that domain specific data poses significant challenges for both
punctuation restoration and truecasing architectures.
5 Conclusions This survey presented the various problems and techniques associated with the tasks of restoring proper
punctuation and capitalization. These are important first steps when annotating text from sources missing or
having incomplete punctuation or capitalization, such as automatic speech recognition systems, certain
micro-blogging activities, short text messages, some raw OCR output. We began by describing the tasks and
the motivation behind them, then described methods using only lexical features, followed by methods
incorporating audio features. Afterwards, we presented evaluation considerations with regard to datasets used
in scientific papers as well as a comparison of current systems on common datasets.
Already in 1997, Stolcke and Shriberg recognized that if “part-of-speech information is to be used for
segmentation, an automatic tagging step is required. This presents somewhat of a chicken-and-egg problem,
in that taggers typically rely on segmentations.” A possible solution to this problem would be to train both
segmentation and POS tagging as a single process. A step in this direction was taken by Yi et al. (2020) who
used adversarial learning for training a neural network to predict both punctuation and POS tagging, using
BERT as the base language model.
Long range dependencies are often required for punctuation restoration (Lu and Ng, 2010). This is a clear
disadvantage with models concerned only with local context, such as n-gram models or even neural networks
using small context windows. In this case recurrent neural network models (such as those based on BiLSTM
cells) seem to have a clear advantage, being able to make use of longer contexts.
With the evolution of classification algorithms and language modelling methods, better models were created
for punctuation restoration. As text-only models evolved so did mixed models, combining audio information
with word data. However, one thing remained constant: mixed models are performing better than individual
![Page 19: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/19.jpg)
19
ones (text-only or prosody-only). This is confirmed in numerous papers cited throughout this survey where
such individual models were compared with mixed models.
Current state-of-the-art models for both capitalization and punctuation restoration make use of neural
network architectures. Considering the papers of Yi et al (2020), for punctuation restoration, and Ramena et
al (2020), for truecasing, both make use of BiLSTM layers with a final CRF layer. Huang et al (2015)
investigated the usage of an architecture based on BiLSTM+CRF for different natural language processing
tasks. Their findings suggest that this architecture is able to achieve state-of-the-art results on part of speech
tagging, chunking and named entity recognition. Furthermore, the models are less dependent on word
embeddings compared to other neural network architectures.
A special case for punctuation restoration is represented by the task of detecting sentence boundaries. This
seems a somewhat simpler situation in which the system does not have to predict the exact punctuation at the
end of sentence but instead it has to recognized that a sentence ends and another one begins. Even though in
some domains this is considered a solved problem, there are still domains like legal text which present unique
challenges (Sanchez, 2019). This observation can then be extended to the more difficult tasks of complete
punctuation restoration and capitalization. In the case of legal texts, also capitalization is sometimes a
problem for regular punctuation restoration systems, since certain named entities (such as family names or
certain institutions) are written with all uppercase letters (Sanchez, 2019; Savelka et al., 2017). This creates
new research challenges for creating restoration systems for legal texts and also to identify other domains
where general purpose systems may not offer sufficiently good results. In this context, the work of Sunkara
et al (2020) investigates medical domain texts and argues for the necessity of domain adaptation techniques
that would allow improvement of punctuation restoration and truecasing systems.
ASR systems often produce recognition errors. In most of the papers presented in section 3 of this survey,
the impact of ASR errors can be seen on punctuation restoration systems in the form of algorithm
performance differences between execution against ASR output and reference text. As indicated by
Augustyniak et al. (2020) while investigating homonymic errors one possibility to improve punctuation
restoration systems is in the form of better word embeddings representations. However, since these are not
the only errors occurring in ASR systems, future research should focus also on different types of errors.Even
though the English language has 14 punctuation marks (period, question mark, exclamation, comma,
semicolon, colon, dash, hyphen, brackets, braces, parentheses, apostrophe, quotation marks, ellipsis) most of
the works on punctuation restoration are limited to only three of them (period, question mark and comma),
with a few exceptions such as Juin et al (2017) working with 8 punctuation marks, Peitz et al (2011) working
with 3 marks but also considering a special “others” class of punctuation, Gravano et al (2009) considering
the dash, Salloum et al (2017) considering the colon, Klejch et al (2017) considering exclamation mark and
ellipsis. This seems to indicate that more research is required for proper restoration of other punctuation
marks apart from period, question mark and comma.
REFERENCES
Akakpo Agbago, Roland Kuhn, George Foster. 2005. Truecasing for the Portage system. In Recent Advances
in Natural Language Processing (RANLP), Borovets, Bulgaria.
Douglas E. Appelt, Jerry R. Hobbs, John Bear, David Israel, Megumi Kameyama, Andy Kehler, David
Martin, Karen Myers, Mabry Tyson. 1995. SRI International FASTUS system MUC-6 Test Results and
Analysis. In Proceedings of the 6th Message Understanding Conference.
Łukasz Augustyniak, Piotr Szymanski, Mikołaj Morzy, Piotr Zelasko, Adrian Szymczak, Jan Mizgajski,
Yishay Carmiel, Najim Dehak. Punctuation Prediction in Spontaneous Conversations: Can We Mitigate
ASR Errors with Retrofitted Word Embeddings? arXiv:2004.05985 [cs.CL].
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, Michael Auli. 2019. Cloze-driven
Pretraining of Self-attention Networks. arXiv:1903.07785.
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. 2015. Neural machine translation by jointly learning
to align and translate. arXiv:1409.0473.
![Page 20: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/20.jpg)
20
Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, Li Wang. 2013. How noisy social media text,
how diffrnt social media sources? In Proceedings of the 6th International Joint Conference on Natural
Language Processing, 356-364.
Fernando Batista, Isabel Trancoso, Nuno Mamede. 2009. Automatic Recovery of Punctuation Marks and
Capitalization Information for Iberian Languages. In Proceedings of the Joint SIG-IL/Microsoft
Workshop on Speech and Language Technologies for Iberian Languages, Porto Salvo, Portugal, 99-102.
Doug Beeferman, Adam Berger, John Lafferty. 1998. Cyberpunc: a lightweight punctuation annotation
system for speech. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and
Signal Processing, ICASSP '98 (Cat. No.98CH36181), Seattle, WA, USA, 689-692 vol.2, doi:
10.1109/ICASSP.1998.675358.
Thorsten Brants, Alex Franz. 2006. “Web 1T 5-gram corpus version 1.1.”. Technical report, Google
Research.
Thorsten Brants, Alex Franz. 2009. Web 1T 5-gram, 10 European Languages Version 1. LDC2009T25,
Linguistic Data Consortium.
Leo Breiman, Joseph H. Friedman, R. A. Olshen, C. J. Stone. 1983. Classification and Regression Trees.
Wadsworth and Brooks, Pacific Grove, CA.
Eric Brill. 1993. A Corpus-Based Approach to Language Learning. Ph.D. thesis, University of Pennsylvania.
Denny Britz, Anna Goldie, Minh-Tang Luong, Quoc Le. 2017. Massive Exploration of Neural Machine
Translation Architectures, CoRR, abs/1703.03906.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei.
2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
Yunqi Cai, Dong Wang. 2019. Question mark prediction by BERT. In Proceedings of Asia-Pacific Signal
and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou,
China, 363-367, doi: 10.1109/APSIPAASC47483.2019.9023090.
Rich Caruana. 1997. Multitask learning. Machine Learning, 28(1):41–75.
William Chan, Nan R. Ke, Ian Lane. 2015. Transferring knowledge from a RNN to a DNN. In Proceedings
of the Annual Conference of the International Speech Communication Association (INTERSPEECH),
3264-3268.
Xiaoyin Che, Sheng Luo, Haojin Yang, Christoph Meinel. 2016. Sentence boundary detection based on
parallel lexical and acoustic models. In Proceedings of the Annual Conference of the International
Speech Communication Association (INTERSPEECH), 2528–2532.
Yevgen Chebotar, Austin Waters. 2016. Distilling knowledge from ensembles of neural networks for speech
recognition. In Proceedings of the Annual Conference of the International Speech Communication
Association (INTERSPEECH), 3439-3443.
Ciprian Chelba, Alex Acero. 2004. Adaptation of maximum entropy capitalizer: Little data can help a lot. In
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Barcelona, Spain, 285-292.
C. Julian Chen. 1999. Speech recognition with automatic punctuation. In Proceedings of Eurospeech ’99,
Budapest, Hungary, 447-450.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for
statistical machine translation. In Proceedings of EMNLP, 1724–1734.
Heidi Christensen, Yoshihiko Gotoh, Steve Renals. 2001. Punctuation annotation using statistical prosody
models. In Proceedings of ISCA Workshop on Prosody in Speech Recognition and Understanding.
Yu-An Chung, James Glass. 2018. Speech2vec: A sequence-to-sequence framework for learning word
embeddings from speech. In Proceedings of the Annual Conference of the International Speech
Communication Association (INTERSPEECH), 811–815.
David Coniam. 2008. Evaluating the language resources of chatbots for their potential in English as a second
language. ReCALL: the Journal of EUROCALL, 20(1), 98.
David Coniam. 2014. The linguistic accuracy of chatbots: usability from an ESL perspective, Text & Talk,
34(5), 545-567.
![Page 21: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/21.jpg)
21
William Coster, David Kauchak. 2011. Simple English Wikipedia: a new text simplification task. In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies: Short Papers-Volume 2, Association for Computational Linguistics, 665–669.
Maury Courtland, Adam Faulkner, Gayle McElvain. 2020. Efficient Automatic Punctuation Restoration
Using Bidirectional Transformers with Robust Inference. In Proceedings of the 17th International
Conference on Spoken Language Translation (IWSLT), 272–279.
Li Deng, John C. Platt. 2014. Ensemble deep learning for speech recognition. In Proceedings of the Annual
Conference of the International Speech Communication Association (INTERSPEECH), 1915–1919.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2018. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL].
Nicholas Dingwall, Christopher Potts. 2018. Mittens: an extension of GloVe for learning domain-specialized
representations. arXiv:1803.09901.
John Duchi, Elad Hazan, Yoram Singer. 2011. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, vol. 12, 2121–2159.
Yo Ehara, Issei Sato, Hidekazu Oiwa, Hiroshi Nakagawa. 2013. Understanding seed selection in
bootstrapping. In Proceedings of TextGraphs-8 Graph-based Methods for Natural Language
Processing, 44-52.
Hala Elsayed, Tarek Elghazaly. 2015. A Named Entities Recognition System for Modern Standard Arabic
using Rule-Based Approach. In Proceedings of the First International Conference on Arabic
Computational Linguistics (ACLing), 51-54, doi: 10.1109/ACLing.2015.14.
Marcello Federico, Luisa Bentivogli, Michael Paul, Sebastian Stueker. 2011. Overview of the iwslt 2011
evaluation campaign. In Proceedings of the International Workshop on Spoken Language Translation
(IWSLT), San Francisco, CA, USA, 11-27.
Jenny R. Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into
Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43nd Annual Meeting of the
Association for Computational Linguistics (ACL 2005), 363-370.
Yoav Freund, Robert Schapire. 1995. A decision-theoretic generalization of on-line learning and an
application to boosting. Computational learning theory, Springer, 23–37.
Jerome H. Friedman. 2000. Greedy function approximation: A gradient boosting machine. In Annals of
Statistics, vol. 29, 1189–1232.
William Gale, Kenneth Church, David Yarowsky. 1993. A Method for Disambiguating Word Senses in a
Large Corpus. Computers and the Humanities, 26, 415-439.
William Gale, Kenneth W. Church, David Yarowsky. 1994. Discrimination decisions for 100,000-
dimensional spaces. Current Issues in Computational Linguistics, 429–450.
William Gale, Sarangarajan Parthasarathy. 2017. Experiments in Character-Level Neural Network Models
for Punctuation. In Proceedings of the Annual Conference of the International Speech Communication
Association (INTERSPEECH), 2794-2798.
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, Yoshua Bengio. 2014. Generative adversarial networks. In Advances in Neural Information
Processing Systems, 3:2672–2680.
Agustin Gravano, Martin Jansche, Michiel Bacchiani. 2009. Restoring punctuation and capitalization in
transcribed speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 4741-4744.
Narendra K. Gupta, Srinivas Bangalore. 2002. Extracting clauses for spoken language understanding in
conversational systems. In Proceedings of the Conference on Empirical methods in natural language
processing, ACL, Volume 10, 273–280.
Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, H Sebastian Seung. 2000.
Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):
947.
Xiaochuang Han, Jacob Eisenstein. 2019. Unsupervised domain adaptation of contextualized embeddings
for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), 4229–4239.
Zellig S. Harris. 1954. Distributional Structure. WORD, 10:2-3, 146-162, doi:
10.1080/00437956.1954.11659520.
![Page 22: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/22.jpg)
22
Madina Hasan, Rama Doddipatla, Thomas Hain. 2014. Multi-pass sentence-end detection of lecture speech.
In Proceedings of the Annual Conference of the International Speech Communication Association,
INTERSPEECH, 2902-2906.
Madina Hasan, Rama Doddipatla, Thomas Hain. 2015. Noise-matched training of CRF based sentence end
detection models. In Proceedings of the 16th Annual Conference of the International Speech
Communication Association (INTERSPEECH), 349-353.
Sepp Hochreiter, Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–
1780.
Jing Huang, Geoffrey Zweig. 2002. Maximum entropy model for punctuation annotation from speech. In
Proceedings of the Annual Conference of the International Speech Communication Association
(INTERSPEECH), 917-920.
Zhiheng Huang, Wei Xu, Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging.
arXiv:1508.01991 [cs.CL].
Nancy Ide, Catherine Macleod. 2001. The American national corpus: A standardized resource for American
English. In Proceedings of Corpus Linguistics 2001, 831–836.
Kevin Jarrett, Koray Kavukcuoglu, Marc Ranzato, Yann LeCun. 2009. What is the best multi-stage
architecture for object recognition? In Proceedings of the 12th IEEE International Conference on
Computer Vision, 2146-2153, doi: 10.1109/ICCV.2009.5459469.
Bernard E. M. Jones. 1994. Exploring the role of punctuation in parsing natural text. In Proceedings of the
15th conference on Computational linguistics - Volume 1 (COLING ’94). Association for Computational
Linguistics, USA, 421–425.
Douglas A. Jones, Florian Wolf, Edward Gibson, Eliott Williams, Evelina Fedorenko, Douglas A. Reynolds,
Marc Zissman. 2003. Measuring the readability of automatic speech-to-text transcripts. In Proceedings
of the 8th European Conference on Speech Communication and Technology, (EUROSPEECH), 1585–
1588.
Chin C. Juin, Richard X. J. Wei, Luis F. D'Haro, Rafael E. Banchs. 2017. Punctuation prediction using a
bidirectional recurrent neural network with part-of-speech tagging. In Proceedings of the IEEE Region
10 Conference TENCON 2017, 1806-1811, doi: 10.1109/TENCON.2017.8228151.
Daniel Jurafsky, James Martin. 2008. Speech and Language Processing. Second Edition. Prentice Hall.
Abraham Kaplan. 1950. An Experimental Study of Ambiguity in Context. Mechanical Translation, v. 1, nos.
1-3.
Max Kaufmann, Jugal Kalita. 2010. Syntactic normalization of twitter messages. In Proceedings of the
International conference on natural language processing, Kharagpur, India.
Ji-Hwan Kim, Philip C. Woodland. 2000. A Rule-based Named Entity Recognition System for Speech Input.
In Proceedings lCSLP, S21-524.
Ji-Hwan Kim, Philip C. Woodland. 2002. Implementation of automatic capitalisation generation systems for
speech input. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal
Processing, I-857-I-860, doi: 10.1109/ICASSP.2002.5743874.
Ji-Hwan Kim, Philip C. Woodland. 2004. Automatic capitalisation generation for speech input. Computer
Speech & Language, vol. 18, no. 1, 67–90.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, Alexander M. Rush. 2017. OpenNMT: Open-
Source Toolkit for Neural Machine Translation. arXiv:1701.02810.
Ondrej Klejch, Peter Bell, Steve Renals. 2016. Punctuated transcription of multi-genre broadcasts using
acoustic and lexical approaches. In Spoken Language Technology Workshop, 433–440.
Ondrej Klejch, Peter Bell, Steve Renals. 2017. Sequence-to-sequence models for punctuated transcription
combining lexical and acoustic features. In Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 5700–5704.
Jáchym Kolář, Jan Švec, Josef Psutka. 2004. Automatic punctuation annotation in Czech broadcast news
speech. In Proceedings SPECOM ́, 319–325.
Jáchym Kolář, Lori Lamel. 2012. Development and Evaluation of Automatic Punctuation for French and
English Speech-toText. In Proceedings of the Annual Conference of the International Speech
Communication Association (INTERSPEECH), 1376-1379.
Ralf Kompe. 1996. Prosody in Speech Understanding Systems. Springer-Verlag.
John Lafferty, Andrew McCallum, Fernando Pereira. 2001. Conditional random fields: Probabilistic models
for segmentation and labeling sequence data. In Proceedings of the Eighteenth International Conference
![Page 23: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/23.jpg)
23
on Machine Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–
289.
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, Chris Dyer. 2016. Neural
architectures for named entity recognition. In Proceedings of the Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, 260-270.
Lucian V. Lita, Abraham Ittycheriah, Salim Roukos, Nanda Kambhatla. 2003. Truecasing. In Proceedings
of the 41st Annual Meeting on Association for Computational Linguistics, ACM, Vol. 1, 152-159.
Yang Liu, Andreas Stolcke, Elizabeth Shriberg, Mary Harper. 2005. Using conditional random fields for
sentence boundary detection in speech. In Proceedings of ACL’05, DOI: 10.3115/1219840.1219896.
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Jiawei Han.
2020. On the Variance of the Adaptive Learning Rate and Beyond. ArXiv: 1908.03265.
Wei Lu, Hwee Tou Ng. 2010. Better Punctuation Prediction with Dynamic Conditional Random Fields. In
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 177-186.
Robert MacIntyre. 1998. 1996 CSR HUB4 Language Model. LDC98T31, Linguistic Data Consortium.
John Makhoul, Francis Kubala , Richard Schwartz , Ralph Weischede. 1999. Performance Measures for
Information Extraction. In Proceedings of DARPA Broadcast News Workshop, 249-252.
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, David McClosky. 2014.
The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting
of the Association for Computational Linguistics: System Demonstrations, 55-60.
Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor. 1999. Treebank-3.
LDC99T42, Linguistic Data Consortium.
Albert H. Markwardt. 1942. Introduction to the English Language. Oxford University Press, New York.
Margaret Masterson. 1967. Mechanical Pidgin Translation. Machine Translation, Donald Booth, ed., Wiley.
Evgeny Matusov, Arne Mauser, Hermann Ney. 2006. Automatic Sentence Segmentation and Punctuation
Prediction for Spoken Language Translation. In Proceedings of IWSLT, 158-165.
Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William
Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon
Orwant, Steven Pinker, Martin A. Nowak, Erez Lieberman Aiden. 2010. Quantitative Analysis of
Culture Using Millions of Digitized Books. Science (Published online ahead of print: 12/16/2010).
Andrei Mikheev. 1999. A Knowledge-free Method for Capitalized Word Disambiguation. In Proceedings of
the Annual Meeting of the Association for Computational Linguistics, 159–166.
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. Efficient estimation of word representations
in vector space. arXiv:1301.3781 [cs.CL].
David Miller, Sean Boisen, Richard Schwartz, Rebecca Stone, Ralph Weischedel. 2000. Named Entity
Extraction from Noisy Input: Speech and OCR. In Proceedings of the Sixth Conference on Applied
Natural Language Processing, 316-324.
Alexandre Nanchen, Philip N. Garner. 2019. Empirical evaluation and combination of punctuation prediction
models applied to broadcast news. In Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 7275–7279.
Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys, 41, 2, Article 10.
doi:10.1145/1459352.1459355.
Kamel Nebhi, Kalina Bontcheva, Genevieve Gorrell. 2015. Restoring capitalization in #tweets. In
Proceedings of WWW Companion, 1111–1115.Thomas Niesler, Philip Woodland. 1996. A variable-
length category-based ngram language model. In Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing ICASSP-96, vol. 1, 164 –167.
Binh Nguyen, Vu Bao Hung Nguyen, Hien Nguyen, Pham Ngoc Phuong, The-Loc Nguyen, Quoc Truong
Do, and Luong Chi Mai. 2019. Fast and accurate capitalization and punctuation for automatic speech
recognition using transformer and chunk merging. arXiv:1908.02404.
Elmar Nöth, Anton Batliner, Andreas Kießling, Ralf Kompe, Heinrich Niemann. 1999. Suprasegmental
Modelling. In: Ponting K. (eds) Computational Models of Speech Pattern Processing, NATO ASI Series
(Series F: Computer and Systems Sciences), Springer, vol 169, 181-198.
Geoffrey Nunberg. 1990. The Linguistics of Punctuation. CSLI Lecture Notes 18, Stanford, CA.
David Pallett, Jonathan Fiscus, John Garofolo, A. Martin, Mark Przybocki. 2000. 1998 Broadcast News
Benchmark Test Results: English and Non-English Word Error Rate Performance Measures. In DARPA
Broadcast News Transcription and Understanding Workshop.
Michael Paul. 2009. Overview of the IWSLT 2009 evaluation campaign. In Proceedings of IWSLT’09, 1-18.
![Page 24: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/24.jpg)
24
Adam Pauls, Dan Klein. 2011. Faster and smaller N-gram language models. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language Technologies -
Volume 1 (HLT ’11). Association for Computational Linguistics, USA, 258–267.
Stephan Peitz, Markus Freitag, Arne Mauser, Hermann Ney. 2011. Modeling Punctuation Prediction as
Machine Translation. In Proceedings of the International Workshop on Spoken Language Translation
(IWSLT), 238-245.
Jeffrey Pennington, Richard Socher, Christopher D. Manning. 2014. GloVe: Global Vectors for Word
Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), ACL, 1532-1543.
Georgios Petasis, Frantz Vichot, Francis Wolinski, Georgios Paliouras, Vangelis Karkaletsis, Constantine D.
Spyropoulos. 2001. Using machine learning to maintain rule-based named-entity recognition and
classification systems. In Proceedings of the 39th Annual Meeting on Association for Computational
Linguistics. Association for Computational Linguistics, 426–433.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke
Zettlemoyer. 2018. Deep contextualized word representations, In Proceedings of NAACL 2018.
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko
Hannemann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, Karel Vesely.
2011. The Kaldi speech recognition toolkit. In Proceedings of the IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU), IEEE Signal Processing Society, 1-4.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. 2019. Language
Models are Unsupervised Multitask Learners.
Gopi Ramena, Divija Nagaraju, Sukumar Moharana, Debi P. Mohanty, Naresh Purre. 2020. An Efficient
Architecture for Predicting the Case of Characters using Sequence Models. ArXiv abs/2002.00738.
Adwait Ratnaparkhi. 1996. A Maximum Entropy Model for Part-of-Speech Tagging. In Eric Brill and
Kenneth Church, editors, Proceedings of the Conference on Empirical Methods in Natural Language
Processing, ACM, Somerset, New Jersey, 133–142.
Steven J. Rayson, Dean J. Hachamovitch, Andrew L. Kwatinetz, Stephen M. Hirsch. 1998. Autocorrecting
Text Typed into a Word Processing Document. U.S. patent 5761689.
Veronica Romero, Joan A. Sánchez. 2013. Category-Based Language Models for Handwriting Recognition
of Marriage License Books. In Proceedings of the 12th International Conference on Document Analysis
and Recognition, Washington, DC, 788-792, doi: 10.1109/ICDAR.2013.161.
Fatiha Sadat, Howard Johnson, Akakpo Agbago, George Foster, Roland Kuhn, Joel Martin, Aaron Tikuisis.
2005. PORTAGE: A Phrase-Based Machine Translation System. In Proceedings of the ACL Workshop
on Building and Using Parallel Texts, 129–132.
Wael Salloum, Greg Finley, Erik Edwards, Mark Miller, David Suendermann-Oeft. 2017. Deep Learning for
Punctuation Restoration in Medical Reports. In Proceedings of BioNLP 2017, 159-164.
George Sanchez. 2019. Sentence boundary detection in legal text. In Proceedings of the Natural Legal
Language Processing Workshop, Association for Computational Linguistics, 31-38.
Jaromir Savelka, Vern R. Walker, Matthias Grabmair, Kevin D. Ashley. 2017. Sentence boundary detection
in adjudicatory decisions in the United States. Traitement Automatique des langues, 58(2), 21-45.
Robert E. Schapire, Yoram Singer. 1999. Improved boosting algorithms using confidence-rated predictions.
Machine Learning, vol. 37, no. 3, 297–336.
Robert E. Schapire, Yoram Singer. 2000. BoosTexter: A Boosting-based System for Text Categorization.
Machine Learning, vol. 39, 135–168.
Ernst Gunter Schukat-Talamazzini. 1995. Stochastic Language Models. In Electrotechnical and Computer
Science Conference, Portoroz, Slovenia.
Ernst G. Schukat-Talamazzini, Florian Gallwitz, Stefan Harbeck, Volker Warnke. 1997. Rational
Interpolation of Maximum Likelihood Predictors in Stochastic Language Modeling. In Proceedings of
the Fifth European Conference on Speech Communication and Technology (EUROSPEECH), Rhodes,
Greece, 5, S. 2731-2734.
Frank Seide, Gang Li, Xie Chen, Dong Yu. 2011. Feature engineering in context-dependent deep neural
networks for conversational speech transcription. In Proceedings of IEEE Workshop on Automatic
Speech Recognition & Understanding (ASRU), 24–29.
Vesa Siivola, Bryan L. Pellom. 2005. Growing an n-gram language model. In Proceedings of
INTERSPEECH-2005, 1309-1312.
Bayan Abu Shawar, Eric Atwell. 2007. Chatbots: are they really useful? LDV Forum, vol. 22, no. 1, 29-49.
![Page 25: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/25.jpg)
25
Elizabeth Shriberg, Rebecca Bates, Andreas Stolcke. 1997. A prosody-only decision-tree model for
disfluency detection. In Proceedings of the European Conference on Speech Communication and
Technology, (EUROSPEECH), vol. 5, 2383–2386.
Hye-Jeong Song, Hong-Ki Kim, Jong-Dae Kim, Chan-Young Park, Yu-Seop Kim. 2019. Inter-Sentence
Segmentation of YouTube Subtitles Using Long-Short Term Memory (LSTM). Applied Sciences, 9(7),
1504, doi:10.3390/app9071504.
Valentin I. Spitkovsky, Hiyan Alshawi, Daniel Jurafsky. 2011. Punctuation: making a point in unsupervised
dependency parsing. In Proceedings of the Fifteenth Conference on Computational Natural Language
Learning (CoNLL), 19–28.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. 2014. Dropout:
A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research,
15(1):1929–1958.
Andreas Stolcke, Elizabeth Shriberg. 1996. Statistical language modeling for speech disfluencies. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), vol. 1, 405–408.
Andreas Stolcke, Elizabeth Shriberg. 1997. Automatic linguistic segmentation of conversational speech. In
Proceedings of EuroSpeech 97, 2:1005-1008.
Andreas Stolcke, Elizabeth Shriberg, Rebecca Bates, Mari Ostendorf, Dilek Hakkani, Madelaine Plauche,
Gokhan Tur, Yu Lu. 1998. Automatic detection of sentence boundaries and disfluencies based on
recognized words. In Proceedings of the International Conference on Spoken Language Processing,
Sidney, Australia, 2247-2250.
Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, Katrin Kirchhoff. 2020. Robust prediction
of punctuation and Truecasing for medical ASR. In Proceedings of the First Workshop on Natural
Language Processing for Medical Conversations, Association for Computational Linguistics, 53-62.
Raymond H. Susanto, Hai L. Chieu, Wei Lu. 2016. Learning to capitalize with character-level recurrent
neural networks: an empirical study. In Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, 2090–2095.Branimir T. Todorovic, Svetozar R. Rancic, Ivica M.
Markovic, Edin H. Mulalic, Velimir M. Ilic. 2008. Named entity recognition and classification using
context Hidden Markov Model. In Proceedings of the 9th Symposium on Neural Network Applications
in Electrical Engineering, Belgrade, 43-46, doi: 10.1109/NEUREL.2008.4685557.
Ilya Sutskever, Oriol Vinyals, Quoc Le. 2014. Sequence to sequence learning with neural networks. In
Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS),
Volume 2, MIT Press, Cambridge, MA, USA, 3104–3112.
Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh. 2007. Dynamic conditional random fields:
Factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine
Learning Research, 8, 693-723.
György Szaszak, Máté Tundik. 2019. Leveraging a character, word and prosody triplet for an asr error robust
and agglutination friendly punctuation approach. In Proceedings of the Annual Conference of the
International Speech Communication Association (INTERSPEECH), 2988–2992.
Ottokar Tilk, Tanel Alumae. 2015. LSTM for punctuation restoration in speech transcripts. In Proceedings
of the Sixteenth annual conference of the international speech communication association
(INTERSPEECH), 683-687.
Ottokar Tilk, Tanel Alumae. 2016. Bidirectional Recurrent Neural Network with Attention Mechanism for
Punctuation Restoration. In Proceedings of the Annual Conference of the International Speech
Communication Association (INTERSPEECH), 3047 - 3051.
Tatsuhiko Tomita, Yoshiyuki Okimoto, Hirofumi Yamamoto, Yoshinori Sagisaka. 2005. Speech recognition
of a named entity. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP '05), Philadelphia, PA, I/1057-I/1060 Vol. 1, doi: 10.1109/ICASSP.2005.1415299.
Máté Tundik, György Szaszak. 2018. Joint Word-and Character-level Embedding CNN-RNN Models for
Punctuation Restoration. In Proceedings of the 9th IEEE International Conference on Cognitive
Infocommunications (CogInfoCom), doi:10.1109/CogInfoCom.2018.8639876.
Dilek Hakkani Tur, Gokhan Tur, Andreas Stolcke, Elizabeth Shriberg. 1999. Combining words and prosody
for information extraction from speech. In Proceedings of the European Conference on Speech
Communication and Technology, (EUROSPEECH), Budapest, Hungary.
![Page 26: Capitalization and Punctuation Restoration: a Survey](https://reader038.vdocuments.mx/reader038/viewer/2022112803/61ffb999af5ded7c33563a53/html5/thumbnails/26.jpg)
26
Nicola Ueffing, Maximilian Bisani, Paul Vozila. 2013. Improved models for automatic punctuation
prediction for spoken and written text. In Proceedings of the 13th Annual Conference of the International
Speech Communication Association (INTERSPEECH), 3097–3101.
Jacqueline Vaissiere. 1983. Language-independent prosodic features. In A. Cutler and D. R. Ladd (Eds.),
Prosody: Models and Measurements, Springer Verlag, 53–66.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
Illia Polosukhin, 2017. Attention is all you need. In Proceedings of the 31st International Conference on
Neural Information Processing Systems (NIPS), 5998-6008.
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion
Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, Jakob Uszkoreit.
2018. Tensor2tensor for neural machine translation. CoRR, vol. abs/1803.07416.
Wei Wang, Kevin Knight, Daniel Marcu. 2006. Capitalizing machine translation. In Proceedings of the main
conference on Human Language Technology Conference of the North American Chapter of the
Association of Computational Linguistics, ACM, 1–8.
Xuancong Wang, Hwee Tou Ng, Khe Chai Sim. 2012. Dynamic Conditional Random Fields for Joint
Sentence Boundary and Punctuation Prediction. In Proceedings of the 13th Annual Conference of the
International Speech Communication Association (INTERSPEECH),1382-1385.
Tian Wang, Kyunghyun Cho. 2015. Larger-context language modelling. arXiv:1511.03729.
Volker Warnke, Ralf Kompe, Heinrich Niemann, Elmar Noth. 1997. Integrated dialog act segmentation and
classification using prosodic features and language models. In Proceedings of the Fifth European
Conference on Speech Communication and Technology (EUROSPEECH), Rhodes, Greece, 207-210.
Jie Yang, Yue Zhang. 2018. NCRF++: An Open-source Neural Sequence Labeling Toolkit. In Proceedings
of ACL 2018, System Demonstrations, 74-79.
Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Ya Li. 2017. Distilling knowledge from an ensemble of models for
punctuation prediction. In Proceedings of the Annual Conference of the International Speech
Communication Association (INTERSPEECH), 2779–2783.
Jiangyan Yi, Jianhua Tao. 2019. Self-attention based model for punctuation prediction using word and speech
embeddings. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 7270–7274.
Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Cunhang Fan. 2020. Adversarial Transfer Learning for
Punctuation Restoration. arXiv:2004.00248 [cs.CL].
Piotr Żelasko, Piotr Szymański, Jan Mizgajski, Adrian Szymczak, Yishay Carmiel, Najim Dehak. 2018.
Punctuation prediction model for conversational speech. arXiv:1807.00543.
Richard Zens, Hermann Ney. 2008. Improvements in Dynamic Programming Beam Search for Phrase-based
Statistical Machine Translation. In Proceedings of the International Workshop on Spoken Language
Translation, Honolulu, Hawaii, 195–205.
Yunxin Zhao, Jian Xue, Xin Chen. 2015. Ensemble learning approaches in speech recognition. In Speech
and Audio Processing for Coding, Enhancement and Recognition, Springer New York, 113–152.