prague dependency treebank(s) workshop at lsa2011, part ii
DESCRIPTION
Prague Dependency Treebank(s) Workshop at LSA2011, Part II. Jan Haji č , Zde ňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Part II - Syntax and Semantics. - PowerPoint PPT PresentationTRANSCRIPT
Prague Dependency Treebank(s)
Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová
Institute of Formal and Applied Linguistics
School of Computer ScienceFaculty of Mathematics and Physics
Charles University, PragueCzech Republic
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 2
Part II - Syntax and Semantics
Tectogrammatical representation Valency lexicon
Languages Czech, Arabic and English
Technical issues Annotation scheme and format Tools for annotation Applications
Summary, pointers, conclusion
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 3
PDT Annotation Layers L0 (w) Words (tokens)
automatic segmentation and markup only L1 (m) Morphology
Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)
Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)
Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 4
Layer 3 (t-layer): Tectogrammatical
Underlying (deep) syntax 4 sublayers (integrated):
dependency structure, (detailed) functors valency annotation
topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):
detailed functors underlying gender, number, ...
Total 39 attributes (vs. 5 at m-layer, 2 at a-layer)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 5
Analytical vs. Tectogrammatical
Underlying verb + tense
Deep function
Elided Actor in
Prepositions out
Another ellipsis...
(TR: sublayer 1 only shown)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 6
Layer 3: Tectogrammatical
Underlying (deep) syntax 4 sublayers:
dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):
detailed functors underlying gender, number, ...
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 7
Tectogrammatical Functors
“Actants”: ACT, PAT, EFF, ADDR, ORIG modify: verbs, nouns, adjectives cannot repeat in a clause, usually obligatory
Free modifications (~ 50), semantically defined can repeat; optional, sometimes obligatory Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP,
INTT, MANN; MAT, APP; ID, DPHR, ... Special
Coordination, Rhematizers, Foreign phrases,...
syntactic semantic
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 8
Tectogrammatical Example
Analytical verb form: (he) allowed would-be to-be enrolled směl by být zapsán
Additional attributes (grammatemes):conditional + “allow”
Collapsed
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 9
Tectogrammatical Example
Passive construction (action) (The) book has-been translated [by Mr. X] Kniha byla přeložena
Disappeared Added
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 10
Tectogrammatical Example
Object (he) gave him a-book dal mu knihu
Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 11
Tectogrammatical Example
Incomplete phrases Peter works well , but Paul badly Petr pracuje dobře, ale Pavel špatně
Added
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 12
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers:
dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):
detailed functors underlying gender, number, ...
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 13
Deep Word OrderTopic/Focus
Example:
Baker bakes rolls. vs. BakerIC bakes rolls.
Analyticaldep. tree:
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 14
Deep Word OrderTopic/Focus
Deep word order: from “old” information to the “new” one (left-to-
right) at every level (head included) projectivity by definition (almost...)
i.e., partial level-based order -> total d.w.o. Topic/focus/contrastive topic
attribute of every node (t, f, c) restricted by d.w.o. and other constraints
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 15
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers:
dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):
detailed functors underlying gender, number, ...
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 16
Coreference
Grammatical relative clauses
which, who Peter and Paul, who ...
control infinitival constructions
John promised to go ... reflexive pronouns
{him,her,thme}self(-ves) Mary saw herself in ...
Johngo
he home
promisePRED
ACTPAT
ACT DIR3
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 17
Coreference
Textual Ex.: Peter moved to Iowa after he finished his PhD.
Peter Iowafinish
he PhD
movePRED
ACT DIR1TWHEN
ACT PAT
heAPP
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 18
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers:
dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):
detailed functors underlying gender, number, ...
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 19
Grammatemes Detailed functors (subfunctors)
only for some functors: TWHEN: before/after LOC: next-to, behind, in-front-of, ... also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT
Lexical (underlying) number (SG/PL), tense, modality, degree of
comparison, ... strictly only where necessary (agreement!)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 20
Example - simplified view
Se zuby jsem měl v minulosti jen problémy.With teeth I-have had in the-past only problems.
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 21
Fully Annotated Sentence
The boundaries of some problems seem to be clearer after they were revived by Havel’s speech.
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 22
Arabic Example:Tectogrammatics
In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 23
English PDT-style Annotation
Morphology and Syntax By conversion
Tectogrammatical annotation Guidelines (English TR: by S. Cinková) Pre-annotation
Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al.
Valency From Propbank Frame Files (Cinková, Šindlerová,
Nedolužko, Semecký)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 24
Example - English TR
Words Dependencies Sem. function Valency
(predicates) Coref (BBN) Named Entities
(BBN)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 25
Valency in the PDTValency: specific ability of a word to combine itself with other units of meaning
dát (give)
Eva matka (mother)ACT ADDR
pršet (rain)
zítra (tomorrow)TWHEN
plakat (cry)
Adam noc (night)ACT TWHEN
Specific behavior
dar (gift)PAT
neděle (Sunday)TWHEN
---
Modifies anything
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 26
Valency - Basic Principles
inner participants vs. free modifications (arguments vs. adjuncts)
obligatory vs. optional modifications (the dialogue test)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 27
Inner Participant … … Free Modification
ACT(or), PAT(ient) ADDR(essee), EFF(ect), ORIG(in) (5)
each occurs just with particular verbs
each modifies the verb only once (in a clause)
Location (LOC, DIR1,…) Time (TWHEN, TTILL, …), Manner, Intention,… (70)
can modify in principle any verb
can be repeated (within the same clause)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 28
Inner Participants
syntactic criteria - Actor and Patient semantic criteria for other inner participants (if a verb has more than two arguments)
Argument shifting Actor PatientAddressee
Origin
EffectPetr has dug a hole.
The teacher asked a pupil.
Semantic Effect (as a cognitive role) shifted to the position of Patient.
Semantic Addresse shifted to the position of Patient.
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 29
Obligatory … Optional
A: John left.B: From where?A: *I don't know.
A: John left.B: To where?A: I don't know.
„from where“ obligatory modification
„to where“ optional modification
The Dialogue Test
Answering a question about a semantically obligatory modification, the speaker cannot say: I don't know.
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 30
Valency frame
obligatory optionalargument
adjunct
Structure:
one meaning of the word one valency frame
Contents:
functor obligatoriness surface form
word: leavemeaning 1: sb left sth meaning 2: sb left from somewhere
frame1: ACT PAT frame2: ACT DIR1
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 31
Valency lexicon:PDT-VALLEX
8500 verb senses / valency frames 9000 noun sense / valency frames some adjectives and adverbs
PDT-VALLEX Entryverb: dosáhnout meaning 1: to reach sthmeaning 2: to get sb to do sthmeaning 3: …meaning 4: …
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 32
The PDT-VALLEX editor
‘lay down’
resign
win
ask
senses:
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 33
Valency Lexicon and TrEd
to write sth (about sth)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 34
Corpus <-> Valency Lexicon
Corpus – occurrences of „uzavřít“ (to close) :
ENTRY: uzavřít vf1: ACT(.1) CPHR({smlouva}.4)
ex: u. dohodu (close a contract)vf2: ACT(.1) PAT(.4)
ex.: u. pokoj (close a room, house)
Lexicon:
Sentence 2035: Sentence 15345: Sentence 51042:
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 35
Valency and Text Generation
Tectogrammatical Representation has all the information to (re)generate the surface
form of the sentence: in a “generalized” form non-redundant (almost... but for generation, it is o.k.)
...except the links to a-layer, however links used only for training [statistical models for]
parsing/generation modules not present when e.g. doing text planning, translation, ...
valency dictionary: form of “learned” knowledge
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 36
Valency and Text Generation
Using valency for... ...getting the correct (lemma, tag) of verb arguments
Example:
starat_sePRED
MartinACT
tygrPAT
Martin....1..........
staratV..............
o...............
tygr....4..........
VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])
se...............
Martin se stará o tygry.
“Martin takes care of tigers.”
“to take care of”
“tiger”
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 37
The Annotation Process 4 sublayers
work on structure first, rest in parallel Structure
automatic preprocessing - programmed conversion from analytical layer annotation
Grammatemes mostly automatically (based on lower layers’
annotation), manual checking, corrections Cross-sublayer/cross-layer checking
partly automatic, then manual
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 38
The Annotation ProcessScheme
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 39
Tectogrammatical Annotation Tools
Manual annotation 4 groups of annotators ~ 4 sublayers Special graphical tool (TrEd)
Customizable graphical tree editor Preprocessing
Data from analytical layer, preprocessed Online dependency function preassignment
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 41
The Annotation Scheme XML + principles of linear- and tree-based
standoff annotation
PML(Prague Markup Language)
Layer schemes (Relax NG) PDT/PADT: t(ecto), a(nalytic), m(orphology), … English: + phrase-based (p-layer)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 42
PML/XML Annotation Layers
Strictly top-down links w+m+a can be easily
“knitted” API for cross-layer
access (programming)
PML Schema / Relax NG
[z and audio layers: used for spoken data (audio as layer “-1”)]
LFGanalogy:
f-struct
Φ
c-structz-
laye
rau
dio
BYL BYS ČELO LESA …
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 43
The Prague Markup Language Example
m-layer data, linked to w-layer:<m id="m-tr/_12941_01_00013.fs-s1w4"> <src.rf>manual</src.rf> <w> <dest.rf>w#w-tr/_12941_01_00013.fs-s1w4</dest.rf> <trans>basic</trans> </w> <form>pocházela</form> <lemma>pocházet_:T</lemma> <tag>VpQW---XR-AA---</tag></m><m id="m-tr/_12941_01_00013.fs-s1w5"> ...
Pointer to w-layer
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 45
Searching the Treebanks TrEd extension: PML-TQ
Backend: database server Frontend: TrEd or Web browser
Web access http://euler.ms.mff.cuni.cz:8111 Sample data (Czech, English [soon]):
anonymous / anonymous Full access (LSA 2011 particiapnts only, 2011):
LSA2011 / UC.Boulder Full access: licence needed for the corpora
Available later this year at http://www.lindat.cz
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 46
Using the Results: Parsing
Several parsers of Czech Analytical layer dependency syntax Trained on PDT 1.0 data, 1.2 mil. words
Collins(98), Charniak(00), Žabokrtský(02), Ribarov(04), Nivre(05), Zeman(05),
McDonald(05), CoNLL’06 (19 parsers) Best results
accuracy: percent of correct dependencies: 84-85% for a single parser, > 86% for a combination
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 47
Tectogrammatical Parsing
Newest results: 4 phases Transformation -based learning FnTBL Largely langu- age independent Coreference: >90%
m- and a-layer:Attribute manual autostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 %
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 48
Tectogrammatical Layer in Machine Translation
The Translation (“Vauquois”) triangle
transfer
source target
Tectogrammatical Representation
Surface Syntax
MorphologyGeneration
Cz En
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 49
Dependency trees in MT
According to his opinion UAL's executives were misinformed about the financing of the original transaction.
Transfer:
Podle jeho názoru bylo vedení UAL o financování původní transakce nesprávně informováno.
- structure (~0)- lexical- functions- grammatical
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 50
Analytical LayerCorrespondence
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 51
TectogrammaticalCorrespondence
The [Homestead’s] only remaining baker bakes the most famous rolls to the north of Long River.
‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River.
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 52
Valency and Translation leave:
leave-1 to leave [from] somewhere
leave-2 to leave sth for sb
Translating (from English into Czech): which equivalent to chose?
nechat vs. odjet/opustit which prepositions, cases, ... to use?
accusative vs. “z” (“from”) with genitive vs. ...?
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 53
Valency and Translation leave-1 nechat-3
ACT() PAT() LOC() ACT(.1) PAT(.4) LOC()
leave-2 odjet-1 ACT() DIR1(from.) ACT(.1) DIR1(z.[.2])
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 54
To summarize…
PDT is/has (a)… Dependency-based treebanking project
Czech (other languages: – Eng, Ar) Ongoing projects (other inst.): Italian, Old Greek, Latin, …
~ 1mil. words sufficient size for ML experiments
4 layers of annotation token, morphology, syntax, deep syntax/semantics++) independent and full information at all levels, but... interlinked (for the development of parsers/generators)
Valency dictionary integrated (links from data)
July 30, 2011 LSA 2011 Prague Dependency Treebanks II 55
Some pointers Current version of PDT: v2.0, LDC2006T01
all three levels, 1.9/1.5/0.8 Mwords http://ufal.mff.cuni.cz/pdt2.0
http://ufal.mff.cuni.cz Research -> Corpora (Treebank(s))
http://www.ldc.upenn.edu LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0),
LDC2004T25 (PCEDT 1.0), LDC2006T01 (PDT 2.0) http://www.clsp.jhu.edu: Workshop 2002
Using TL for MT Generation http://ufal.mff.cuni.cz/pedt
1st version of English dep. Treebank http://ufal.mff.cuni.cz/~hajic/lsa2011.html
This workshp page, many links to resources, tools