annotation of corpora

30
Annotation of corpora A. Part-of-speech tagging B. Syntactic annotation C. Semantic annotation D. Discourse annotation E. Pragmatic annotation

Upload: bryar-powers

Post on 03-Jan-2016

90 views

Category:

Documents


3 download

DESCRIPTION

Annotation of corpora. A. Part-of-speech tagging B. Syntactic annotation C. Semantic annotation D. Discourse annotation E. Pragmatic annotation. Annotation of corpora. perfectly plain: produced by scanning; no information about text (usually, not even edition) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Annotation of corpora

Annotation of corpora

• A. Part-of-speech tagging

• B. Syntactic annotation

• C. Semantic annotation

• D. Discourse annotation

• E. Pragmatic annotation

Page 2: Annotation of corpora

Annotation of corpora

• perfectly plain: produced by scanning; no information about text (usually, not even edition)

• marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc.

• annotated with identifying information, e.g. edition date, author, genre, register, etc.

• annotated for part of speech, syntactic structure, discourse information, etc.

Page 3: Annotation of corpora

A. Part-of-speech tagging

LOB sample with POS tagging

A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.

A01 3 ^ by_IN Trevor_NP Williams_NP ._.

A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN

A01 4 nominating_VBG any_DTI more_AP labour_NN

A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN

A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.

Page 4: Annotation of corpora

A. Part-of-speech tagging

• Main steps:– Divide the text into word tokens (tokenization)– Select a set of tags– Apply tag set to tokens

• Tokenization: – orthographic word - morpho-syntactic unit?– multiwords, e.g., in spite of label as

in_PREP31 spite_PREP32 of_PREP33– mergers, e.g., clitics as in hasn’t, je t’aime, vendetelo

label as vendete_VERBlo_PRON– compounds, e.g., tag set label as

tagset_NOUN or tag_NOUN set_NOUN?

Page 5: Annotation of corpora

A. Part-of-speech tagging

• Choice of tag set

• sophisticated, linguistically well grounded set of tags…

• BUT: not automatically applicable without loss of accuracy

• example: come - present plural indicative, imperative, subjunctive; Lancaster corpus: distinguish from to-infinitive, LOB, Brown corpus: don’t distinguish

Page 6: Annotation of corpora

A. Part-of-speech tagging

• tag = word class

• label = alphanumeric characters• examples:

preposition prepositionprep

INsingular proper noun

NOUN:prop:singN-p-sg

NP1

• logically organized (taxonomy), e.g., in Lancaster, BNC, C7

• presentation: horizontal or vertical

Page 7: Annotation of corpora

A. Part-of-speech tagging

• encoding of tags

• TEI (SGML), e.g., BNC<w AV0>Even <w AT0>the <w AJ0>

old <w NN2>women <w VVB>manage <c PUN>, <w AVO>just <w CJS>as <w PNP>they <w VVB>’re <w VVG>passing <wPNP>you <c PUN>.</PUN> (Garside et al., 1997)

Page 8: Annotation of corpora

A. Part-of-speech tagging

• Applying tags to words

• tagging scheme should include a procedure of how to assign tags to words (both for humans and machines)

• need a lexicon: it will say which tags are assignable to which words

• again: ambuguity is a problem

Page 9: Annotation of corpora

B. Syntactic annotation

• syntactic annotation = parsed corpora• purposes:

– training automatic parsers (computational linguistics, e.g. probabilistic parsers - inductive training through extraction of frequency counts)

– extracting information (linguistics, e.g., building a lexicon, investigating subcategorization frames, collocations or other linguistic things, describing sublanguages)

Page 10: Annotation of corpora

B. Syntactic annotation

• a parsing scheme needs (cf. POS tagging):

– a list of symbols

– definitions of symbols

– description of how to apply symbols to text

• syntactically annotated corpora: tree banks

• examples of tree banks: Penn Treebank, Nijmegen Treebank, Susanne Corpus , Helsinki Constraint Grammar (ENGCG), Lancaster/IBM SEC treebank

Page 11: Annotation of corpora

B. Syntactic annotation

• Parsing

• the (automatic) analysis of texts (sentences) in terms of syntactic categories

S

NP VP

NP ADJP

NP

Pierre 61 old will join the as an executive Nov 29Vinken years board director

NP PP NP

Page 12: Annotation of corpora

B. Syntactic annotation

• Penn Treebank

• skeleton parsing: partial parse, leaving out the “hard” things (such as PP-attachment)

• phrase structure model (Garside et al., 1997, p.42)

((S (NP (NP Pierre Vinken) , (ADJP (NP 61 years) old ,)) will (VP join(NP the board)(PP as (NP a nonexecutive director))(NP Nov 29))).)

Page 13: Annotation of corpora

B. Syntactic annotation

• Penn Treebank

• available through LDC

• size: 3,300,000 words (Feb 97)

• Brown corpus, Wall Street Journal

• in the current phase:– add function labels (Subj, Obj etc.)

– add null constituents or traces (e.g., It’s easy [t] to eat)

– add indices for coreferences (e.g., Mary[i] saw herself[i] in the mirror)

– discontinuous constituents

– add semantic roles (Agent, Goal etc)

• may get too complex for large-scale reliable analysis…

Page 14: Annotation of corpora

B. Syntactic annotation

• Susanne Corpus• part of the Brown corpus, 128,000 words• result of manual analysis• parsing scheme specified in great detail• available from Oxford Text Archive:

– sable.ox.ac.uk/ota (http)

– ota.ox.ac.uk/pub/ota/public (ftp)

Page 15: Annotation of corpora

A./B. Demo

• TIGER

• NEGRA

Page 16: Annotation of corpora

C. Semantic annotation

• problem (1): more than one way of referring to a concept, e.g.,– text analysis: choice of expression may reflect

ideologies in the text or relationships between participants in conversation, for example, in doctor-patient interaction

abdomen --- tummy– information retrieval: historian in fashion seeks

information about trouserstrousers --- slacks, shorts, leggings, breeches

--> cf. RECALL in IR

Page 17: Annotation of corpora

C. Semantic annotation

• problem (2): one single word can refer to different concepts, e.g.,– information retrieval: historian in fashion wants to

know about bootsboot --- may refer to shoe, computer, kick, car

--> cf. PRECISION in IR

• so: – need to identify related words

(problem 1)– need to identify the different senses of a word

(problem 2)

Page 18: Annotation of corpora

C. Semantic annotation

• labeling words according to semantic field (word senses) so that you can

• … extract all the related words by querying on the semantic field

• … extract only those instances of ambiguous words with the specific senses you want by querying on the combination of word and semantic field

Page 19: Annotation of corpora

C. Semantic annotation• semantic fields: sense relations and other kinds of relations

(e.g., part-of, related-to etc.)• annotation (cf. PoS tagging):

– definition of the tagging scheme (labels and their meanings)– guidelines for applying the tagging scheme– in semantics: this is not as easy and straightforward as for PoS

tagging!– requirements:

• should make linguistic/psycholinguistic sense• should be able to account for the vocabulary in the corpus

exhaustively• should be suitable for texts from different periods and register

(comprehensiveness)• should preferably have a hierarchical structure

Page 20: Annotation of corpora

C. Semantic annotation

• multiple membership, e.g.,deepened: color and change/remain

• multiword units, e.g.,stubbed out: encoded as two separate words, but belonging together

• one recent ambitious attempt at a taxonomy of such semantic relations (sense relations, thesaurus-type relations, semantic fields etc.): WORDNET at www.cogsci.princeton.edu/~wn/

• you can try it online: www.cogsci.princeton.edu/~wn/online/

Page 21: Annotation of corpora
Page 22: Annotation of corpora

C. Semantic annotation

• How to do it?– manually

– computer-assisted (need at least a computer-readable lexicon and a disambiguation process - similar to PoS tagging)

– fully automatic (not really feasible):• semantic analysis is even harder than syntactic parsing

• no integrated ‘parse’ of meaning possible at the present time

Page 23: Annotation of corpora

D. Discourse annotation

• discourse features: what are they?• Typically: cohesion and coherence• coherence: what makes a text hang together

in terms of content• cohesion: the means of making a text hang

together• reference, substitution, ellipsis, conjunctive

relations (cause, result, effect etc.), thematic development

• Halliday & Hasan, 76

Page 24: Annotation of corpora

D. Discourse annotation

• example: anaphoric relations in the IBM/Lancaster corpus (UCREL)

• try to build up sth. like an ‘anaphoric treebank’

• what are anaphoric relations?– links between a proform and an antecedent

– example: The married couple said that they were

happy with their lot.The married couple said that they

were happy with their lot.

Page 25: Annotation of corpora

D. Discourse annotation

• anaphoric annotation in UCREL: categories used are based n Halliday & Hasan, 76

• example of annotation: (1 Feodor Baumenk 1), a former Nazi death camp guard, has asked the U.S Supreme Court to allow <REF=1 him to retain <REF=1 his American citizenship. (2 The Hartford Courant 2) said…

• symbols: (1), (2)… = antecedent < = anaphoric (> =

cataphoric) REF = central pronoun

Page 26: Annotation of corpora

D. Discourse annotation

• few corpora annotated for discourse features…

• how to do it?– manually

– computer-assisted: either interactive hand annotation, using some kind of specialized editor or automatic annotation with the possibility of hand correction or disambiguation

– a tool supporting annotation of anaphora: XANADU in Lancaster

Page 27: Annotation of corpora

E. Pragmatic annotation• anything beyond sentences and discourse: contexts of

situation and culture• examples of things people look at in pragmatics

– carry-on signals in conversation (e.g., Stenstroem 87): which functions do carry-on signals such as “well”, “you know” etc. have in conversation?

– speech acts (e.g., Stiles 92): speech act types in conversation, e.g., in doctor-patient interactions

PATIENT: I have the headaches to the point that I have to vomit (D) DOCTOR: Mm -hm (K) PATIENT: Then I have to go to bed and I sleep for a while (E) D = Disclosure

K = Acknowledgment E = Edification

Page 28: Annotation of corpora

E. Pragmatic annotation

• how to do it?– manually

– computer-assisted: ?

– fully-automatic: -

• You have to use your imagination!

• Stenstroem example: Can be done with a concordance program because it’s essentially word-based

• Stiles example: would probably have to be done manually (then use a concordance program on the annotated texts?)

Page 29: Annotation of corpora

Higher-level annotation: tools

• Tools that support specialized analysis, such as– specialized editors, e.g., Xanadu for anaphoric relations

– specialized in terms of linguistics models, • e.g., Sys-Tools for Systemic Functional Grammar

(minerva.ling.mq.edu.au/)(http://cirrus.dai.ed.ac.uk:8000/Coder/index.html)

• e.g., RSTTools for rhetorical relations analysis (www.dai.ed.ac.uk/daidb/people/homes/micko/RSTTool/index.html)

• Tools that support various kinds of analysis (but not quite everything you might want to do):– TATOE (www.darmstadt.gmd.de/~rostek/tatoe.htm)

Page 30: Annotation of corpora

References• Garside R., G. Leech & A. McEnery (eds.), 1997. Corpus

Annotation. Linguistic Information from Computer Text Corpora. Longman: London

• Fellbaum C. (ed), 1998. WordNet. An Electronic Lexical Database. MIT Press.

• Garside et al., 1997. Corpus annotation. London, Longman.• Halliday M.A.K. & R. Hasan, 1976. Cohesion in English.

Longman, London.• Mindt, 1991. Syntactic evidence for semantic distinctions in

English. In Aijmer & Altenberg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik, London, Longman.

• Stenstroem, 1987. Carry-on signals in English conversation. In Meijs (ed), Corpus Linguistics and Beyond. Amsterdam, Rodopi.

• Stiles, 1992. Describing talk: a taxonomy of verbal response models. Beverly Hills, Sage.