[ieee 2012 20th telecommunications forum telfor (telfor) - belgrade, serbia (2012.11.20-2012.11.22)]...

4
Abstract — This paper describes a study on correspond- ence between the language model quality and the size of the textual corpus used in the training process. Three types of n- gram models developed for the Serbian language were included in the study: word-based, lemma-based and class- based model. They are created in order to deal with the data sparsity problem which is very expressed because of the high degree of inflection of the Serbian language. The three model types were trained on corpora of different sizes and evaluated by perplexity on authentic text and text with random word order in order to obtain the discrimination coefficients values. These values show different degrees of robustness of the three model types to data sparsity problem and indicate a way of combining these models in order to achieve the best language representation for a given training corpus. Key Words — Language model, evaluation, perplexity, discrimination coefficient I. INTRODUCTION OR the purpose of increasing the accuracy of the large vocabulary continuous speech recognition system (LVCSR) that is being developed for the Serbian language, different types of language models have been created in order to obtain a good language representation. All of these models are based on the n-gram concept and have been trained using the SRILM toolkit [1]. Three types of models have been considered. The first type is a model trained on the textual corpus containing regular words of the Serbian language. The second type was trained on the corpus consisting of lemmas. The third type was trained on the corpus of word classes, which have been defined mostly according to the complex morphology of the Serbian language. Serbian is one of the highly inflective languages and therefore a very large textual corpus is needed in order to create a good language representation [2]. Data sparsity is one of the greatest This work was supported by the Ministry of Education, Science and Technological Development of Serbia within the Project “Development of Dialogue Systems in Serbian and other South Slavic Languages” (TR- 32035). Stevan Ostrogonac, Faculty of technical sciences, Trg Dositeja Obradovi a 6, 21000 Novi Sad, Serbia (telephone: 381-63-8279550, e-mail: [email protected] ) Milan Se ujski, Faculty of technical sciences, Trg Dositeja Obradovi a 6, 21000 Novi Sad, Serbia (telephone: 381-64-3966422, e-mail: [email protected] ) Dragiša Miškovi , Faculty of technical sciences, Trg Dositeja Obradovi a 6, 21000 Novi Sad, Serbia (telephone: 381-66-6310230, e-mail: [email protected] ) problems that need to be dealt with when it comes to creating language models for different purposes. As the sizes of the available textual corpora are usually not adequate, there is a need for language models which can be efficiently trained on small corpora. Reducing the vocabulary size by replacing words with lemmas can help attenuate the effect of data sparsity on the language model quality. Of course, a certain amount of information is lost in the process and, even though better estimates are made, these estimates refer to groups of surface forms sharing the same lemmas. Probability estimates for sequences of lemmas are still always better than the default values which are assigned to the word sequences that are unseen in the training corpus. An even greater reduction of the vocabulary can be achieved by replacing the words by classes defined according to the language morphology (or some other criteria) [3]. A number of these classes can be arbitrary, but it is meant to be significantly smaller than the number of words in a corpus. In this study 1124 word classes were defined, which, of course, represents a great vocabulary reduction compared to the number of different words in the corpus which exceeds 350.000. The language model trained on word classes is a much less accurate language representation than the word-based model, but it can be helpful in situations when a word-based model cannot be well trained. It should be noted here that for the words that do not appear in the training corpus it is still possible to know the corresponding lemmas and the classes, as long as these words appear in the morphologic dictionary of the Serbian language [4]. The initial language model idea for the LVCSR system was to combine the three mentioned types of models in a log-linear way and assign them equal weights. This combined model introduced a certain degree of improvement to the word-based n-gram model, but left room for further improvement [5]. In order to create a better combination of the word- based, lemma-based and the class-based model, their behavior in regards to the training corpus size needs to be established. This would help determining their weighting coefficients. They could also be evaluated separately by the word error rate (WER) reduction of the LVCSR system, in order to obtain information on how to construct different combination concepts. Unfortunately, this is not yet possible as the LVCSR system for Serbian is under development. The following section of this paper describes in more Impact of training corpus size on the quality of different types of language models for Serbian Stevan Ostrogonac, Milan Se ujski, Dragiša Miškovi F 20th Telecommunications forum TELFOR 2012 Serbia, Belgrade, November 20-22, 2012. 978-1-4673-2984-2/12/$31.00 ©2012 IEEE 720

Upload: dragisa

Post on 02-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 20th Telecommunications Forum Telfor (TELFOR) - Belgrade, Serbia (2012.11.20-2012.11.22)] 2012 20th Telecommunications Forum (TELFOR) - Impact of training corpus size on

Abstract — This paper describes a study on correspond-

ence between the language model quality and the size of the textual corpus used in the training process. Three types of n-gram models developed for the Serbian language were included in the study: word-based, lemma-based and class-based model. They are created in order to deal with the data sparsity problem which is very expressed because of the high degree of inflection of the Serbian language. The three model types were trained on corpora of different sizes and evaluated by perplexity on authentic text and text with random word order in order to obtain the discrimination coefficients values. These values show different degrees of robustness of the three model types to data sparsity problem and indicate a way of combining these models in order to achieve the best language representation for a given training corpus.

Key Words — Language model, evaluation, perplexity, discrimination coefficient

I. INTRODUCTION OR the purpose of increasing the accuracy of the large vocabulary continuous speech recognition system

(LVCSR) that is being developed for the Serbian language, different types of language models have been created in order to obtain a good language representation. All of these models are based on the n-gram concept and have been trained using the SRILM toolkit [1]. Three types of models have been considered. The first type is a model trained on the textual corpus containing regular words of the Serbian language. The second type was trained on the corpus consisting of lemmas. The third type was trained on the corpus of word classes, which have been defined mostly according to the complex morphology of the Serbian language. Serbian is one of the highly inflective languages and therefore a very large textual corpus is needed in order to create a good language representation [2]. Data sparsity is one of the greatest

This work was supported by the Ministry of Education, Science and

Technological Development of Serbia within the Project “Development of Dialogue Systems in Serbian and other South Slavic Languages” (TR-32035).

Stevan Ostrogonac, Faculty of technical sciences, Trg Dositeja Obradovi a 6, 21000 Novi Sad, Serbia (telephone: 381-63-8279550, e-mail: [email protected] )

Milan Se ujski, Faculty of technical sciences, Trg Dositeja Obradovi a 6, 21000 Novi Sad, Serbia (telephone: 381-64-3966422, e-mail: [email protected] )

Dragiša Miškovi , Faculty of technical sciences, Trg Dositeja Obradovi a 6, 21000 Novi Sad, Serbia (telephone: 381-66-6310230, e-mail: [email protected] )

problems that need to be dealt with when it comes to creating language models for different purposes. As the sizes of the available textual corpora are usually not adequate, there is a need for language models which can be efficiently trained on small corpora.

Reducing the vocabulary size by replacing words with lemmas can help attenuate the effect of data sparsity on the language model quality. Of course, a certain amount of information is lost in the process and, even though better estimates are made, these estimates refer to groups of surface forms sharing the same lemmas. Probability estimates for sequences of lemmas are still always better than the default values which are assigned to the word sequences that are unseen in the training corpus.

An even greater reduction of the vocabulary can be achieved by replacing the words by classes defined according to the language morphology (or some other criteria) [3]. A number of these classes can be arbitrary, but it is meant to be significantly smaller than the number of words in a corpus. In this study 1124 word classes were defined, which, of course, represents a great vocabulary reduction compared to the number of different words in the corpus which exceeds 350.000. The language model trained on word classes is a much less accurate language representation than the word-based model, but it can be helpful in situations when a word-based model cannot be well trained.

It should be noted here that for the words that do not appear in the training corpus it is still possible to know the corresponding lemmas and the classes, as long as these words appear in the morphologic dictionary of the Serbian language [4].

The initial language model idea for the LVCSR system was to combine the three mentioned types of models in a log-linear way and assign them equal weights. This combined model introduced a certain degree of improvement to the word-based n-gram model, but left room for further improvement [5].

In order to create a better combination of the word-based, lemma-based and the class-based model, their behavior in regards to the training corpus size needs to be established. This would help determining their weighting coefficients. They could also be evaluated separately by the word error rate (WER) reduction of the LVCSR system, in order to obtain information on how to construct different combination concepts. Unfortunately, this is not yet possible as the LVCSR system for Serbian is under development.

The following section of this paper describes in more

Impact of training corpus size on the quality of different types of language models for Serbian

Stevan Ostrogonac, Milan Se ujski, Dragiša Miškovi

F

20th Telecommunications forum TELFOR 2012 Serbia, Belgrade, November 20-22, 2012.

978-1-4673-2984-2/12/$31.00 ©2012 IEEE 720

Page 2: [IEEE 2012 20th Telecommunications Forum Telfor (TELFOR) - Belgrade, Serbia (2012.11.20-2012.11.22)] 2012 20th Telecommunications Forum (TELFOR) - Impact of training corpus size on

detail the corpora and the techniques used to train the models which were evaluated in the experiment described in section 3. In section 4, the results of the experiment are presented and discussed. Section 5 summarizes the work done so far and gives an indication of the further research on the topic.

II. TRAINING THE LANGUAGE MODELS As mentioned before, models in this study were trained

using the SRILM toolkit. Training the models implies the estimation of word probabilities in different contexts. The probability estimates are calculated by counting the instances of word sequences in the training corpus. The probability of a word wn appearing in the context consisting of a word sequence wn-N+1…wn-1 is given by equation (1).

)()(

)|(11

1111

−+−

−+−−+− =

nNn

nnNnnNnn wwC

wwwCwwwP (1)

The operator C represents count. The models acquired through this training are textual files written in the ARPA form [1]. They are called Katz back-off models because of the way they are used to calculate the probability of a given word sequence (Katz back-off algorithm [6]). An entry of a model in the ARPA form looks as follows:

-3.2395582 bila dva -0.8413376.

The numerical value on the left side is the log-probability of a bigram “bila dva”. On the right side, a back-off coefficient for this bigram is given. Back-off coefficients are used in a way described by the equations (2) and (3).

>>

=otherwisezP

yxCifelseyzPyxzyxCifyxzP

yxzPKatz

Katz

Katz

Katz

)(0),()|(),(

0),,(),|(),|(

*

*

*

α (2)

>=

otherwisezPyxzyCifyzP

yzPKatz

KatzKatz )(),(

0),()|()|( *

*

α (3)

Here, the words in a sequence are denoted by x, y and z, operator C has the same meaning as in equation (1), and represents a Katz back-off coefficient. The probabilities P*

Katz are the probability estimates contained in the model’s entries, and PKatz are the probabilities returned by the model for a given word sequence.

The training corpus used for this study consisted of newspaper articles on a great variety of topics. Different magnitude corpora were used to create 17 different models. These corpora consisting of words were converted to corpora of lemmas and classes and corresponding models were created. Therefore, the study included a total of 51 models. The details about the corpora sizes (for 17 models of each type) are given in table 1.

The first column shows the number of word instances contained in different corpora. The second column shows the vocabulary sizes. The third column refers to the lemma-based corpora and shows the corresponding vocabulary sizes. The fourth column gives the vocabulary sizes information for class-based corpora.

The models were created in two groups. The first group consisting of models M01-M08 (24 models overall) was

created in order to track in more detail the model qualities when small training corpora are used. The second group consisting of nine models (M1-M9, for each type) was created in order to determine the possible saturation in model quality function as the corpus size increases.

TABLE 1: CORPORA SIZES STATISTICS.

Words (x106)

Vocabulary (x103)

Lemmas (x103) Classes

M01 0,09 20,205 11,37 506 M02 0,2 32,356 17 552 M03 0,4 47094 23,58 586 M04 0,58 57,293 27,97 605 M05 0,91 73,695 35,24 628 M06 1,14 84,618 40,7 642 M07 1,37 93,856 44,18 646 M08 1,6 101,710 47,56 651 M1 1,75 106,78 49,94 652 M2 3,52 149,82 69,79 681 M3 5,28 181,08 84,6 700 M4 6,95 205,78 96,61 710 M5 8,59 226,66 107,05 716 M6 10,26 245,8 116,65 720 M7 11,91 262,9 125,33 727 M8 13,56 278,97 133,47 731 M9 15,24 293,65 140,86 733 The following section describes how the quality of the

models was assessed.

III. EVALUATION EXPERIMENT The most commonly used measure of an n-gram based

language model’s quality is the perplexity value obtained on a test data set. The perplexity value PPL for a word sequence of the length K is defined by the equation (4).

KK

i ii wwwPPPL ∏

= −

=1 11 )|(

1 (4)

This value represents an average confusion of the language model by a word from a test data set. The perplexity depends on the model (the vocabulary of the training corpus) as well as the test data set. This makes perplexity inconvenient for comparing models trained on different types of corpora. For example, a word-based model can be evaluated by perplexity on a test data set. A corresponding class-based model can be trained and evaluated on the same corpora after a simple conversion from words to classes. The two perplexity measures cannot be compared directly because they do not refer to the same vocabularies.

A better quality measure can be obtained by evaluating a model’s perplexity on text with randomized word order on a sentence base and on the authentic text. The ratio of these two perplexity values can be thought of as a discrimination coefficient which shows how well a language model differentiates the authentic text from a meaningless word sequence [7]. The comparison of different model types by the discrimination coefficient is possible and it can give some valuable information on how to use and combine them to get the best language

721

Page 3: [IEEE 2012 20th Telecommunications Forum Telfor (TELFOR) - Belgrade, Serbia (2012.11.20-2012.11.22)] 2012 20th Telecommunications Forum (TELFOR) - Impact of training corpus size on

representation.

IV. RESULTS AND DISCUSSION The evaluation of all the models considered in this study

was initially done on two different data sets which were significantly different by size. Since the results on both data sets were very similar, is could be concluded that the smaller data set consisting of 190408 words was large enough to give accurate evaluation information. Because of the space limitations, only the results referring to the smaller test corpus will be presented. Table 2 presents the perplexity values of all the models on the authentic text. Table 3 gives the same information as the table 2, only for the text with random word order on the sentence level. Table 4 gives the discrimination coefficients for all the models. For the purpose of clarity, the values of the discrimination coefficients for both groups of models which were discussed in section 2 are given in a graphic form on figures 1 and 2.

From the tables 2 and 3 it can be seen how perplexity calculated on the authentic text (for any of the model types) decreases and the perplexity on the text with random word order increases as the training corpus gets larger. This effect is more expressed in the second group of models as the corpus increase step is greater. An interesting detail can be observed on the first group of word-based models. Namely, as the corpus size increases, the perplexity calculated on authentic text decreases slightly, but the perplexity on the text with random word order increases much more abruptly. This means that it is easier (in the sense of the training corpus size) to train a language model to discriminate invalid word sequences than to affirm the valid ones.

TABLE 2: PERPLEXITY VALUES ON AUTHENTIC TEXT.

Words- based models

Lemma - based models

Class- based models

M01 599,2 434,57 31,26 M02 634,27 409,65 28,82 M03 642,43 392,58 27,25 M04 639,03 378,45 26,58 M05 615,31 352,23 25,74 M06 599,16 338,65 25,35 M07 548,29 310,22 25,01 M08 532,46 298,91 24,8 M1 484,72 275,47 24,59 M2 389,86 227,64 23,82 M3 337 202,23 23,42 M4 305,91 186,52 23,21 M5 288,76 177,94 23,08 M6 276,02 171,51 22,97 M7 267,99 167,42 22,91 M8 259,7 163,78 22,85 M9 239,7 153,29 22,75

TABLE 3: PERPLEXITY VALUES ON TEXT WITH RANDOM WORD ORDER ON A SENTENCE LEVEL.

Words- based models

Lemma- based models

Class- based models

M01 2097,56 1475,91 265,3 M02 3076,31 1877,92 299,68 M03 4154,21 2310,44 328,54 M04 4791,33 2579,93 343,65 M05 5583,1 2898,68 354,752 M06 5939,57 3012,17 362,48 M07 6265,49 3123,36 365,04 M08 6598,95 3223,02 372,54 M1 6815,13 3290,24 375,23 M2 8238,23 3844,57 394,63 M3 9149,97 4150,64 407,19 M4 9850,25 4393,96 413,28 M5 10412,4 4598,64 417,74 M6 10884,6 4761,98 421,97 M7 11317,8 4913,7 424,37 M8 11670,8 5053,69 427,63 M9 12023,3 5182,77 430,03

TABLE 4: DISCRIMINATION COEFFICIENT VALUES.

Words- based models

Lemma- based models

Class- based models

M01 3,5 3,4 8,49 M02 4,85 4,58 10,4 M03 6,47 5,89 12,06 M04 7,5 6,86 12,93 M05 9,07 8,23 13,78 M06 9,91 8,92 14,3 M07 11,42 10,07 14,6 M08 12,39 10,78 15,02 M1 14,06 11,94 15,26 M2 21,13 16,89 16,57 M3 27,15 20,52 17,38 M4 32,2 23,65 17,8 M5 36,06 25,84 18,1 M6 39,43 27,76 18,37 M7 42,23 29,35 18,52 M8 44,94 30,86 18,71 M9 50,16 33,81 18,9

The values of discrimination coefficients give more

objective information about the quality of language models. As can be seen from the table 4 the word-based based model can give the best results when sufficient training data is available. On the other hand, the class-based model shows better results for smaller training data sets. This is mainly because the class-based model has a much smaller vocabulary and can achieve good estimates of word sequence probabilities even on a small training set.

On the other hand, the quality of the class-based language representation is limited to the amount of information contained in word classes.

722

Page 4: [IEEE 2012 20th Telecommunications Forum Telfor (TELFOR) - Belgrade, Serbia (2012.11.20-2012.11.22)] 2012 20th Telecommunications Forum (TELFOR) - Impact of training corpus size on

This is, of course, true for all model types, but the models with smaller vocabularies can reach their potentials on smaller training sets. Figures 1 and 2 show more clearly how the language models gain in discrimination power as the training sets get larger. The first group of models shows exactly how a class-based model outperforms the others when training corpora of less than 2 million words are used. This is useful information for situations when a language model has to be trained for a special purpose and not a lot of data is available for training. In this case, it would be better to train the model on word classes than on the words.

Figure 1. Discrimination coefficients for the first group

of models

If a language model is meant for general purpose, and a lot of training data is available, training the class-based n-gram model would be a bad solution because it would be trained well on a small part of the corpus, and after that the quality function would go into saturation. As figure 2 shows, the word-based model quality function does not indicate saturation, which means that the currently existing corpus is not large enough to reach the full potential of this language model. This is also true for the lemma-based model.

Figure 2. Discrimination coefficients for the second

group of models

It should be noted that even though a word-based model shows better results when trained on a very large corpus, it can still use some information given by a corresponding class-based model to improve the accuracy of word sequence probability estimation. Combining the two models is not only useful for situations when out-of-vocabulary words are encountered but also to smooth the probability distribution and correct values of word

sequence probabilities that have been mistakenly evaluated as extreemly high or low. The lemma-based model could also be of some help in these cases.

Since the basic idea to improve the language model for the Serbian language includes weighted combining of the word, lemma and class-based models, the main problem is finding the right set of weighting coefficients. The discrimination coefficient seems to be a good candidate for defining these weights, because it corresponds well to the qualities of the component models. Therefore, for every application, based on the training corpus, the weights could be easily determined from the discrimination coefficients.

However, the ultimate test of language model's quality is still the WER test in the context of speech recognition. When this is possible, it will give the information on how well the discrimination coefficients correspond to the weights of the component language models.

V. CONCLUSIONS AND FURTHER RESEARCH This paper presented the research done in order to

determine the language model’s quality function depending on the training corpus size. The word-based, lemma-based and the class based models created for the Serbian language were evaluated by their discrimination coefficients. The results show the saturation in the quality function of the class-based model for the existing training corpus. The other two model types require more training data in order to reach full potential. Nevertheless, the discrimination coefficients represent promising estimates of the component model’s weights in the combined language model.

The further research will be oriented to the WER evaluation of the models in order to obtain information on validity of conclusions drawn from the previous work.

REFERENCES [1] A. Stolcke, “SRILM – an extensible language modeling

toolkit, Proceedings of ICSLP, vol. 2, pp. 901-904, Denver, 2002.

[2] D. Miškovi , N, Jakovljevi , D. Pekar, M. Se ujski, “N-gram Application in Language Modeling for Serbian in Large-Vocabulary Speech Recognition,” in Serbian (“Primena n-grama za modelovanje srpskog jezika u prepoznavanju govora na velikim re nicima”), DOGS, pp. 61-63, Novi Sad, 2010.

[3] M. F. Brown, P. V. de Souza, R. L. Mercer, V. J. Della Pietra, J. C. Lai, “Class-Based N-gram Models of Natural Language,” Computational Linguistics, vol. 18, December 1992.

[4] M. Se ujski, “Accentuation dictionary for Serbian created for text-to-speech synthesis,” in Serbian (“Akcenatski re nik srpskog jezika namenjen sintezi govora na osnovu teksta”), DOGS, Be ej, 2002.

[5] S. Ostrogonac, D. Miškovi , M. Se ujski, D. Pekar, V. Deli , “A Language Model for Highly Inflective Non-Agglutinatve Languages,” SISY, Subotica, 2012.

[6] C. Manning, H. Schütze, "Foundations of Statistical Natural Language Processing," MIT Press, Cambridge, May 1999.

[7] S. Ostrogonac, D. Miškovi , M. Se ujski, D. Pekar, “Discriminative Potential of a Language Model Based on the Class n-gram Concept,” in Serbian (“Diskriminativne mogu nosti modela jezika zasnovanog na konceptu klasnog n-grama”), DOGS, Kova ica, 2012.

723