annotation of grammatemes in the prague dependency treebank 2.0
Post on 30-Dec-2015
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
Annotation of Grammatemes in the Prague Dependency Treebank 2.0
Magda Razímová
Zdeněk Žabokrtský
Institute of Formal and Applied Linguistics
Charles University
Prague, Czech Republic
{razimova,zabokrtsky}@ufal.mff.cuni.cz
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz2/30
Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes
Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment
Final remarks
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz3/30
Introduction grammatemes in the PDT 2.0
one type of attributes of nodes of a deep syntactic tree capturing morphological meanings that are semantically
indispensable• number for nouns, degree of comparison for adjectives, tense for
verbs, etc. annotation of grammatemes
the last task in the PDT 2.0 annotation procedure possible to assign automatically – profiting from the
already available annotation:• annotation of the same sentence at the lower layers• already available components of the t-tree (tree structure, types
of dependency relations, co-reference, etc.)
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz4/30
Historical backgroundand development of PDT project mid 1960’s – Praguian Functional Generative Description (Petr
Sgall et al.) 1994 – Czech National Corpus 1995 – PDT started 1998 – PDT 0.5 pre-release 2001 – PDT 1.0 released by LDC
manual annotation of morphology and surface syntax
2006 – PDT 2.0 to be released by LDC interlinked morphological, surface-syntactic and complex
deep-syntactic annotation • including annotation of grammatemes
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz5/30
Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes
Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment
Final remarks
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz6/30
Layers of annotation tectogrammatical layer
deep-syntactic dependency tree
analytical layer surface-syntactic dependency tree
morphological layer m-lemma and m-tag
associated with each token
word layer original text, segmented on word
boundaries lit: He-was would went toforest.He would have gone to the forest.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz7/30
Interlinking the layers
lit: He-was would went toforest.He would have gone to the forest.
any unit at any layer has a PDT unique ID
neighboring layers connected by top-down pointers
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz8/30
Size of the PDT 2.0 data (i) 7,129 manually annotated textual documents
all documents annotated at the m-layer• 16,065 sentences with 1,960,657 tokens
75 % of the m-layer data annotated at the a-layer• 5,338 documents, 87,980 sentences, 1,504,847 tokens
44 % of the m-layer data annotated also at the t-layer• 3,168 documents, 49,442 sentences, 833,357 tokens
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz9/30
training data (80 %) development test data (10 %) evaluation test data (10 %)
Size of the PDT 2.0 data (ii)
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz10/30
M-layer sentence represented as a
sequence of tokens each token lemmatized and
tagged (attributes m-lemma and m-tag)
positional m-tag: 15 characters 1. (main) POS 2. detailed POS 3. gender 4. number 5. case ...
lit.: Some contours problem(gen) reflexive_pronoun though after resurgence(instr) Havel's speech(instr) they-seem to-be clearer.
Some contours of the problem seem to be clearer after the resurgence by Havel's speech.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz11/30
A-layer rooted ordered tree with labeled
nodes and edges a-nodes
one token of the m-layer is represented by exactly one a-node
labeled with a-lemmas (identical with word forms)
a-edges represent dependency relations (Sb,
Obj, Adv, Atr) represent non-dependency relations
(Coord) analytical function attribute appears
as an a-node attribute
Some contours of the problem seem to be clearer after the resurgence by Havel's speech.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz12/30
T-layer
Some contours of the problem seem to be clearer after the resurgence by Havel's
speech.
rooted ordered tree with labeled nodes and edges
t-nodes complex typed feature
structures represent auto-semantic
words functional words do not have
nodes of their own artificially added nodes
t-edges dependency relations (functor) non-dependency relations
(coordination constructions) functor attribute appears as an
t-node attribute
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz13/30
lit. [To] all was handed over a certificate of successful graduation from the course.They all received a certificate of successful graduation from this course.
Areas of annotation at the t-layer
tree structure t-lemma attribute dependency relation
(functor and subfunctor)
topic-focus attributes co-reference attributes
node typing attributes (nodetype and sempos)
grammateme attributes
Všem bylo předáno osvědčení o úspěšném
absolvování kurzu.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz14/30
Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes
Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment
Final remarks
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz15/30
grammatemes t-node attributes representing inflectional information that
is semantically indispensable (morphological meanings such as number for nouns, tense for verbs, degree of comparison for adjectives, etc.)
semantically irrelevant morphological meanings are not part of the t-layer (e.g. case for nouns)
Peter met her youngest brother. Peter will meet her young brothers.
PeterACT
meetPREDtense=ant brother
PATnumber=sg
#PersPronAPP
youngRSTRdegree=sup
PeterACT
meetPREDtense=post brother
PATnumber=pl
#PersPronAPP
youngRSTRdegree=pos
Peter met her youngest brother. Peter will meet her young brothers.
PeterACT
meetPREDtense=ant brother
PATnumber=sg
#PersPronAPP
youngRSTRdegree=sup
PeterACT
meetPREDtense=post brother
PATnumber=pl
#PersPronAPP
youngRSTRdegree=pos
Grammatemes: Motivation
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz16/30
Grammateme attributes
15 grammatemes indeftype numertype negation degcmp
tense aspect verbmod deontmod dispmod resultative iterativeness
number gender person politeness
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz17/30
Conditioned presence/absence of grammatemes obviously, not all grammatemes are relevant for all nodes
no tense for dog, no degree of comparison for (he) waits, etc.
how to formally declare presence/absence of a given grammateme attribute in a given node?
the need for node typing
chosen solution: two-level typing 1st level: 8 more general types of nodes
• grammatemes relevant only for one of them 2nd level: 19 more specific subtypes, corresponding to detailed semantic
parts of speech
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz18/30
Presence/absence of grammateme values: Two-level t-node hierarchy
1st level: attribute nodetype 2nd level: attribute sempos
t-n o d e s
co m p le x a to m q c o m p le xlis tco a p d p h rfp h rro o t
se m a n ticve rb s
se m a n ticn o u n s
se m a n tica d ve rb ss em a n tic a d je c tiv e s
d e n o ta tivea d j.d e n o t
(d e g cm p ,n e g a tio n )
h e zký , p s í, čo ko lá d o vý
p ro n o m in a l
in d e fin itea d j.p ro n .in d e f
( in d e ftyp e )
ja ký , k te rý
d e fin itea d j.q u a n t.d e f
(n u m e rtyp e )
tř i (d ě ti), to lik
q u a n tifica tiv e
d e fin ite
d e m o n s tra tiv ea d j.p ro n .d e f.d e m o n
Ø
te n (u č ite l), ta ko vý
in d e fin itea d j.q u a n t in d e f
(n u m e rtyp e ,in d e ftyp e )
ko lik
g ra d a b lea d j.q u a n t.g ra d
(n u m e rtyp e ,d e g cm p )
h o d n ě , m á lo
s em a n tic a d je c tiv e s
d e n o ta tivea d j.d e n o t
(d e g cm p ,n e g a tio n )
h e zký , p s í, čo ko lá d o vý
p ro n o m in a l
in d e fin itea d j.p ro n .in d e f
( in d e ftyp e )
ja ký , k te rý
d e fin itea d j.q u a n t.d e f
(n u m e rtyp e )
tř i (d ě ti), to lik
q u a n tifica tiv e
d e fin ite
d e m o n s tra tiv ea d j.p ro n .d e f.d e m o n
Ø
te n (u č ite l), ta ko vý
in d e fin itea d j.q u a n t in d e f
(n u m e rtyp e ,in d e ftyp e )
ko lik
g ra d a b lea d j.q u a n t.g ra d
(n u m e rtyp e ,d e g cm p )
h o d n ě , m á lo
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz19/30
8 attribute values: root | qcomplex | list | atom | coap | dphr | fphr | complex
fully automatic annotation - use of the tree structure root t-attributes
• t-lemma qcomplex | list• functor atom | coap | dphr | fphr
else complex
Levnější benzín na Východě, dražší na Západě Cheaper gasoline in the East, more expensive one in the West
First level of the hierarchy: attribute nodetype
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz20/30
only complex nodes grouped into semantic parts of speech 19 values of the attribute sempos:
n. ... | adj. ... | adv. ... | v. ... fully automatic annotation – use of
m-tag t-lemma other t-attributes
sempos value delimits the set of relevant grammatemes
semantic adjectives
denotativeadj.denot
(degcmp,negation)
hezký, psí, čokoládový
pronominal
indefiniteadj.pron.indef
(indeftype)
jaký, který
definiteadj.quant.def
(numertype)
tři (děti), tolik
quantificative
definite
demonstrativeadj.pron.def.demon
Ø
ten (učitel), takový
indefiniteadj.quant indef
(numertype,indeftype)
kolik
gradableadj.quant.grad
(numertype,degcmp)
hodně, málo
semantic adjectives
denotativeadj.denot
(degcmp,negation)
hezký, psí, čokoládový
pronominal
indefiniteadj.pron.indef
(indeftype)
jaký, který
definiteadj.quant.def
(numertype)
tři (děti), tolik
quantificative
definite
demonstrativeadj.pron.def.demon
Ø
ten (učitel), takový
indefiniteadj.quant indef
(numertype,indeftype)
kolik
gradableadj.quant.grad
(numertype,degcmp)
hodně, málo
Second level of the hierarchy: attribute sempos
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz21/30
Values of nodetype and sempos in the PDT 2.0 – an overview
nodetype values: sempos values:
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz22/30
Grammateme value assignment n-tred environment for processing the PDT data http://ufal.mff.cuni.cz/˜pajas
automatic annotation 2000 lines of Perl code
• crucial importance of inter-layer links – use of• t-attributes, a-attributes, m-attributes
rules using special economic notation • 2000 lines written in a text file
lexical resources• special purpose lists of adverbs / verbs
manual annotation of special problems two annotators working in parallel simplified annotation environment: treebank positions
extracted into simple HTML forms
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz23/30
Simple HTML-basedenvironment for manual annotation
lit: The difference [you] would have
to pay yourself.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz24/30
Automatic vs. manual assignment
at the t-layer of the PDT 2.0: 1,594,333 grammateme values assigned
at 550,947 complex nodes
manually assigned:• 17,520 grammateme values
• inter-annotator agreement: 70-85 %
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz25/30
Grammateme assignment and m-tag
number grammateme: values sg | pl assigned automatically using m-tag
e.g. les (forest)• m-layer: tag NNIS2-----A---- t-layer: number=sg
manual assignment nouns with only plural forms (identified by
a list extracted from the machine-readable dictionary of standard Czech)
e.g. dveře (door/doors)• m-layer: always plural• t-layer: annotator decision sg | pl
n.denotnumber=sg
lit: He-was would went toforest.He would have gone to the forest.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz26/30
Grammateme assignment and tree structure
vverbmod=cdn
mood grammateme verbmod: values ind | imp | cdn
assigned automatically one-word verbal forms
• e.g. jde (goes)• m-tag information
verbal forms consisting of more word forms (represented by a single node at the t-layer)
• e.g. byl by šel (would have gone)• corresponding a-layer subtree
involves the node by• m-tag of the node by
lit: He-was would went toforest.He would have gone to the forest.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz27/30
lit. From remainder of raw material the diary produces dried milk, which [it] exports to Asia and South America.
From the rest of the material, the diary produces dried milk, which is exported [by it] to Asia and South America.
Grammateme assignment and co-reference
grammatemes gender, number and person in relative pronouns are left underspecified (value inher), since they are imposed only by grammatical agreement (thus can be “inherited from the antecedents”)
Ze zbytku suroviny mlékárna vyrábí sušené
mléko, které vyváží do Asie a Jižní Ameriky.
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz28/30
Outline of the talk Introduction Prague Dependency Treebank 2.0 Annotation of grammatemes
Motivation Grammateme attributes Two-level node hierarchy Examples of grammateme value assignment
Final remarks
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz29/30
Final remarks achievements:
two-level typing of t-layer nodes which makes it possible to formally capture presence/absence of individual grammatemes in a given node
automatic procedure for capturing the node classification and the grammateme attributes
verification of the procedure on large-scale data experience:
it was the existence of the lower annotation layers and the existence of inter-layer links what allowed to make the procedure of grammateme assignment more or less automatic
LREC 2006, Annotation Science razimova@ufal.mff.cuni.cz30/30
http://ufal.mff.cuni.cz/pdt2.0/
top related