linguistic annotation of learner corpora a. díaz-negrillo, d. meurers & h. wunsch university of...

22
Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

Upload: marilynn-mosley

Post on 23-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

Linguistic annotation of learner corpora

A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

Page 2: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

1. IntroductionA study on linguistic annotation of learner

corpora, in particular Part-Of-Speech (POS) annotation, which aims to discuss where native POS tagsets fail to accurately describe learner language, by:

• Describing POS annotation practice in learner corpora, and

• Characterizing the areas where properties of learner language differ from those assumed by native POS annotation schemes.

Page 3: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• Learner corpora can play a role in

identifying areas of relevance in, for

example, FLT, SLA, materials design, etc.

• The terminology used to single out learner

language aspects needs to be mapped to

instances in the corpus, i.e. annotation.

Page 4: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• Linguistic annotation of learner corpora, in particular POS tagging, is becoming a common practice because:

– By the use of generally agreed linguistic categories, it allows to objectively identify units of interest.

– Other annotations specific to learner corpora (error-tagging) mostly allow research into deviances, it is costly and involves a degree of subjectivity.

– In SLA research there is an interest in the developmental stages of the acquisition process.

– POS tagging can be done automatically.

Page 5: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

Recent initiatives:

• International Corpus of Learner English (ICLE)

• Cambridge Learner Corpus (CLC)

• Japanese EFL Learner Corpus (JEFLL)

• Polish Learner Corpus of English

Page 6: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

Automatic POS-tagging consists of 2 parts:–Tag look-up: all possible tags for the given

token are determined based on lexical database reference or morphological analysis.

–Tag disambiguation: all possible tags are reduced to the correct tag based on distribution.

Fallback strategies: weaker versions of the 3 previous sources of evidence and, as a last resort, uses of the most frequent tags.

Page 7: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• POS-tagging learner language is essentially perceived as an instance of domain transfer (van Rooy & Schäfer 2003; Thouësny 2009):

– Automatic POS-taggers trained on native data are run on learner data.

– Due to differences in genre and data type, the annotations are less accurate.

– To make up for this degradation of performance, post-correction is often added.

Page 8: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• De Haan (2000) and Van Rooy & Schäfer (2002) investigated into POS tagging error types. Spelling errors seem to be source of major problems, which can be handled rather straightforwardly, especially if they result in non-words.

• De Haan (2000) proposes a fine-grained classification of learner errors that become relevant to the POS tagging process. He suggests adapting the TOSCA-ICLE POS tagset to cater for these learner-specific features.

Page 9: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

If native taggers,- Map linguistic categories of native language in POS

tags, based on the combinatory possibilities of stem-morphology-distribution.

The demonstrations ended without confrontation NNS

but learner language - Does not always present the same POS categories

because the combinatory possibilities of stem-morphology-distribution are different,

[…] If he want to know this […]VB/VBP?

Do native taggers always provide the categories needed to describe learner language?

Page 10: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

2. Method

• This paper is based on a sample of the NOn-native Corpus of English (NOCE, Díaz Negrillo, 2007), containing around 40,000 words.

• The NOCE corpus is a written corpus of EFL:– Over 300,000 words of written English by

Spanish undergraduates.– 1,054 samples of an average of 250 words

each.

Page 11: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• The samples were collected:– From 2003 to 2009 primarily among first year

students doing the English degree programme at the Universities of Granada and Jaén (Spain),

– At 3 stages in the academic year (beginning, mid-term and end),

– By the students’ lecturers, assisted by corpus compilers and in 1-hour teaching sessions,

– As a timed classroom task: essay writing, and – On a voluntary basis and under the

appropriate anonymous conditions.

Page 12: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• The corpus contains 3 types of annotation:– Editorial annotation: the corpus is annotated for

students’ editions of their own writing (e.g. struckouts, late insertions, reordering of units and missing/unreadable text).

– Error annotation: a section of the corpus of around 40,000 words is error-tagged with the tagset EARS (Error-Annotation and Retrieval System, Díaz Negrillo, 2009).

– POS annotation: the corpus is annotated with 3 automatic POS taggers: TnT, Stanford and Treebank.

Page 13: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• General observations of the corpus’ POS

annotations by the 3 POS taggers suggest:

– There are areas where the taggers do not

provide the same tag for a given token,

– Certain cases are easy to disambiguate

manually, but

– In other cases disambiguation is difficult

because the tagsets do not fully map the

categories present in the learner corpus.

Page 14: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

• A preliminary examination of the mismatches

between the native and learner POS categories

suggest 4 main types of mismatches.

• The mismatches are discussed on the basis of the

3 sources of information handled by automatic

POS taggers in the selection of tags for tokens:

– Lexical look-up: token’s stem,

– Morphology: token’s derivational and inflectional

markings, and

– Distribution: token’s syntactic context.

Page 15: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

3. Mismatches in POS classification variables

(1) You can find a big vary of beautiful beaches […] Verb ≠ Noun

(2) They are very kind and friendship […]

Noun ≠ Adjective ≠ Noun

Case 1. Stem-Distribution mismatch

Stem Distribution Morphology

Page 16: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

3. Mismatches in POS classification variables

(3) […] one of the favourite places to visit for foreigns. Adjective ≠ Noun ≠

Noun

(4) […] to be choiced for a job […]

Noun ≠ Verb ≠ Verb

Case 2. Stem-Distribution Stem-Morphology mismatch

Stem Distribution Morphology

Page 17: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

3. Mismatches in POS classification variables

(5) […] this film is one of the bests ever. Adjective ≠ Adjective ≠ Noun

(6) […] television, radio are very subjectives […]

Adjective ≠ Adjective ≠ Noun

Case 3. Stem-Morphology mismatch

Stem Distribution Morphology

Page 18: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

3. Mismatches in POS classification variables

(7) […] for almost every jobs nowadays. Noun ≠ Noun Sing ≠ Noun Pl

(8) […] it has grew up a lot especially since 1996 […]

Verb ≠ Verb PP ≠ Verb PT

Case 4. Distribution-Morphology mismatch

Stem Distribution Morphology

Page 19: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

4. POS tagging learner data and deviances

Not all learner errors demand special attention in POS-tagging:

(9) […] Internet can modificate […]

(10) He runned to by one […]

(11) […] The 11th March cames to out minds.

(12) Childrens spend so much time […]

(13) […] people shouldn’t be menospreciated […]

Page 20: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

4. Conclusions

• Linguistic annotation of learner data is a powerful means to gain access to learner properties with a view to conducting theoretical and applied research.

• Application of native automatic POS-taggers is a sensible point of departure.

• However, for linguistic annotations to be fully relevant in learner corpus research, annotation should capture the properties of learner language systematically.

• Adaptation of existing native POS-tagsets to learner data specifications seems necessary.

Page 21: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

Referencesde Haan, P. 2000. Tagging non-native English with the TOSCA-ICLE tagger.

In C. Mair & M. Hundt (Eds.), Corpus Linguistics and Linguistic Theory (pp. 69-79). Amsterdam: Rodopi.

Díaz Negrillo, A. 2007. A Fine-Grained Error Tagger for Learner Corpora. Unpublished Ph.D. thesis, University of Jaen, Jaén.

Díaz Negrillo, A. 2009. EARS: A User’s Manual. Munich: LINCOM.

Thouësny, S. 2009. Increasing the reliability of a part-of-speech tagging tool for use with learner language. Paper presented at the Automatic Analysis of Learner Language (AALL’09) Workshop, Tempe, AZ.

van Rooy, B. & Schäfer, L. 2002. The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies, 20, 325-335.

van Rooy, B. & Schäfer, L. 2003. An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus. In D. Archer, P. Rayson, A. Wilson & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference Lancaster University (UK), 28-31 March 2003. Vol. 16 (pp. 835-844). Lancaster: UCREL, Lancaster University.

Page 22: Linguistic annotation of learner corpora A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany

Linguistic annotation of learner corpora

A. Díaz-Negrillo, D. Meurers & H. Wunsch University of Jaén, University of Tübingen Spain Germany [email protected] [email protected]

[email protected]