learning a token classification from a large corpus (a case study in abbreviations) petya osenova...

Learning a token classification from a large corpus(A case study in abbreviations)

Petya Osenova & Kiril Simov

BulTreeBank Project(www.BulTreeBank.org)

Linguistic Modeling Laboratory, Bulgarian Academy of Sciences

[email protected], [email protected]

ESSLLI'2002 Workshop on

Machine Learning Approaches in Computational Linguistics

August 5 - 9, 2002

Plan of the talk

• BulTreeBank Project

• Text Archive

• Token Processing Problem

• Global Token Classification

• Application to Abbreviations

BulTreeBank project

• It is a joint project between the Linguistic Modeling Laboratory (LML), Bulgarian Academy of Sciences and Seminar fuer Sprachwissenschaft (SfS), Tuebingen. It is funded by Volkswagen Foundation, Germany.

• Its main goal is the creation of a high quality syntactic treebank of Bulgarian which will be HPSG oriented.

• It also aims at producing a parser and a partial grammar of Bulgarian.

• Within the project an XML-based system for corpora development is being created.

BulTreeBank teamPrinciple researcher:

Kiril SimovResearchers:

Petya Osenova, Milena Slavcheva, Sia Kolkovska

PhD student:

Elisaveta BalabanovaStudents:

Alexander Simov, Milen Kouylekov,

Krasimira Ivanova, Dimitar Dojkov

BulTreeBank text archive

• A collection of linguistically interpreted texts from different genres (target size 100 million words)

• Linguistically interpreted text is a text in which all meaningful tokens (including numbers, special signs and others) are marked-up with linguistic descriptions

The current state of the text archive

• Nearly 90 000 000 running words: 15% fiction,

78% newspapers and 7% legal texts, government bulletins and other genres

• About 70 million running words are converted into XML format with respect to TEI guidelines

• 10 million running words are morphologically tagged

• 500 000 running words are manually disambiguated

Pre-processing steps (1)

• Morphosyntactic taggerAssigning all appropriate morpho-syntactic features to each potential word

• Part-of-speech disambiguatorChoosing the right morpho-syntactic features for each potential word in the context

• Partial Grammar for non-word tokens

Pre-processing steps (2)

Partial grammars• Sentence boundaries grammar• Named Entity Recognition

– Names of people, places, organizations etc.

– Dates, currencies, numerical expressions

– Abbreviations

– Foreign tokens

• Chunk grammar (Abney 1991, 1996)– Non-recursive constituents

Token processing problem

A token in a text receives its linguistic interpretation on the basis of two sources of information: (1) the language and (2) the context of use

Two problems:

• For less studied languages there is no enough language resources (low level of linguistic interpretation)

• Erroneous use in the context (wrong prediction)

Token classification

• Symbol-based classificationThe tokens are defined by their immanent graphical characteristics

• General token classificationThe tokens fall into several categories: common word, proper name, abbreviation, symbols, punctuation, error

• Grammatical and semantic classificationThe tokens are presented in several lexicons, in which their grammatical and semantic features are listed

General token classification

Our goal is to learn a corpus-based classification of the tokens with respect to the general token classification

We use this classification in two ways:– For an initial classification of the tokens in the texts

before consulting the dictionary, and – For processing linguistically the tokens from the

different classes

Learning general token categories (1)

Token classes:• Common words

typical - lowercased and first capital letter in sentence-initial position; non-typical - all caps

• Proper namestypical - first capital letter; non-typical - all caps; wrong - lowercased

• Abbreviationstypical - all caps, mixed, lowercased (with period, hyphen or a single letter)


Some problems:

• Some tokens can belong to more than one class according to their graphical properties.

• Spelling errors in a large set of texts could cause misclassification.


Our classification is not boolean but gradual-ranking of tokens with respect to each of the above categories.

Our initial procedure included the following steps:– We used some graphical criteria for assigning

potential categories to the unknown tokens.– We used statistical methods to make a distinction

within each category between the most frequent tokens of this category and tokens not in the category or rare tokens.


Graphical criterion

It takes into account the graphical specificity of the tokens.

For each category a list of tokens potentially belonging to it was constructed

Well known problems such as:– Common words written in capital letters

– Abbreviations written in a wrong way

The graphical criterion is not sufficient


Statistical criterion

For each category, in order to get the maximal number of right predictions for candidate tokens, every candidate-token is ranked

In fact we classify normalized tokens

A normalized token is an abstraction over tokens that share the same sequence of letters from a given alphabet


Ranking with a category (1)The ranking formula is

Rank = TokPar*DocPar

where the two parameters areTokPar = True/All

The number of true appearances of the token divided by the number of all appearances of the token

DocPar

The number of the documents in which the correctly written token was found if this number is less that 50, otherwise this value is 50


Ranking with a category (2)

The first parameter does not make difference between one or hundred occurrences. Thus, the real scope of distribution is lost

The impact that the token has over the text archive is represented by the second parameter. The upper bound of 50 is used as a normalization parameter.

Thus the tokens that are rare or do not belong to the category receive a very small rank.


Usefulness

• The method tolerates the tokens with greater impact over the whole corpus

• The tokens appearing in a small number of documents are processed by local-based approaches (document-centered)

General token categories and local-based approaches

• The local-based approaches can filter general classification with respect to ambiguous or unusual usage of tokens

• When the local-based approach is unapplicable, the information is taken from the general token classification

• The result of such a ranking is very useful for the other task mentioned above - the linguistic treatment of the unknown tokens

Abbreviations in the pre-processing

• Abbreviations are special tokens in the text

• They contribute to a robust:

– tagging– disambiguation– shallow parsing

Extraction criteria

Three criteria

• Graphical criterion (as above)

• Statistical criterion (as above)

• Context criterion - we tried to extract some abbreviations with their extensions written usually in brackets. Thus the

ambiguity is reduced.

Dealing with abbreviations

Our approach includes three steps:

• Typological classification - the existing classifications were refined with respect to the electronic treatment of abbreviations

• Extraction - different criteria were proposed for the extraction of the most frequent abbreviations in the archive

• Linguistic treatment - the abbreviations were extended and

the relevant grammatical information was added

Typological classification

Linguistic treatment (1)

• Encoding the linguistic information shared by all abbreviations:– head element presents the abbreviation itself– every abbreviation has a generalized type: acronym

or word– every abbreviation has at least one extension– every extension element consists of a phrase

Linguistic treatment (2)

• Encoding the linguistic information shared by some types of abbreviations:– the non-lexicalized abbreviations were assigned grammatical

information according to its syntactic head. Thus the element 'class' was introduced.

– the partly lexicalized abbreviations were assigned additionally grammatical information according to their inflection. Thus the element 'flex' was introduced.

– the abbreviations of foreign origin usually have an additional head element, called headforeign (headf).

Examples (1)type ACRONYM

<abbr><head>АЧП</head><acronym/><expan><phrase>Агенция за чуждестранна помощ</phrase><class>Сжед</class></expan><abbr>

<abbr><head>ДП</head><acronym/>

<expan><phrase>Държавно предприятие</phrase> <class>Ссред</class> </expan>

<expan><phrase>Демократичесka партия</phrase> <class>Сжед</class></expan></abbr>

<abbr><head>ЗУНК</head><acronym/><expan><phrase>Закон за уреждане на необслужваните кредити</phrase><class>Смед</class>

<flex>ЗУНК-а,ЗУНК-ът,ЗУНК-ове</flex></expan></abbr>

<abbr><head>ФБР</head><headf>FBI</headf><acronym/>

<expan><phrase>Федерално бюро за разследване</phrase>

<class>Ссред</class></expan><abbr>

Examples (2)

type WORD

<abbr><head>г-ца</head> <word/>

<expan><phrase>госпожица</phrase></expan></abbr>

<abbr><head>гр.<head><word/>

<expan><phrase>град</phrase></expan></abbr>

<abbr><head>в.</head><head>в-к</head><word/>

<expan><phrase>вестник</phrase></expan></abbr>

<abbr><head>ез.</head><word/>

<expan>езеро</expan>

<expan>език</expan></abbr>

Evaluation

The method is hard for absolute evaluation with respect to only one class of tokens

We apply only relative evaluation with respect to a given rank

Only precision measure is really applicable

The recall is practically equal to 100%

Precision = 98.7% for the first 557 candidates (Rank >= 25)

Other applications

• Classification and linguistic treatment of other classes of tokens: names, sentence boundary markers

(similar to abbreviation)

• Determination of the vocabulary of dictionary for human use

The lexeme with great impact over the nowadays texts will be chosen

Similar treatment of the new words

Future work

• Dealing with different ambiguities• Combination with other methods as document-

centered, morphological guessers• Using other stochastic methods

learning a token classification from a large corpus (a case study in abbreviations) petya osenova...

Documents

partial grammar of bulgarian

meaningful tokens

text archivenearly

general token classificationwe

semantic features

bulgarian academy of

corpusbased classification

contextpartial grammar