prague dependency treebank(s) workshop at lsa2011, part ii

53
Prague Dependency Treebank(s) Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

Upload: guang

Post on 20-Mar-2016

47 views

Category:

Documents


1 download

DESCRIPTION

Prague Dependency Treebank(s) Workshop at LSA2011, Part II. Jan Haji č , Zde ňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Part II - Syntax and Semantics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

Prague Dependency Treebank(s)

Workshop at LSA2011, Part II Jan Hajič, Zdeňka Urešová

Institute of Formal and Applied Linguistics

School of Computer ScienceFaculty of Mathematics and Physics

Charles University, PragueCzech Republic

Page 2: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 2

Part II - Syntax and Semantics

Tectogrammatical representation Valency lexicon

Languages Czech, Arabic and English

Technical issues Annotation scheme and format Tools for annotation Applications

Summary, pointers, conclusion

Page 3: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 3

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

Page 4: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 4

Layer 3 (t-layer): Tectogrammatical

Underlying (deep) syntax 4 sublayers (integrated):

dependency structure, (detailed) functors valency annotation

topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

Total 39 attributes (vs. 5 at m-layer, 2 at a-layer)

Page 5: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 5

Analytical vs. Tectogrammatical

Underlying verb + tense

Deep function

Elided Actor in

Prepositions out

Another ellipsis...

(TR: sublayer 1 only shown)

Page 6: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 6

Layer 3: Tectogrammatical

Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

Page 7: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 7

Tectogrammatical Functors

“Actants”: ACT, PAT, EFF, ADDR, ORIG modify: verbs, nouns, adjectives cannot repeat in a clause, usually obligatory

Free modifications (~ 50), semantically defined can repeat; optional, sometimes obligatory Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP,

INTT, MANN; MAT, APP; ID, DPHR, ... Special

Coordination, Rhematizers, Foreign phrases,...

syntactic semantic

Page 8: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 8

Tectogrammatical Example

Analytical verb form: (he) allowed would-be to-be enrolled směl by být zapsán

Additional attributes (grammatemes):conditional + “allow”

Collapsed

Page 9: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 9

Tectogrammatical Example

Passive construction (action) (The) book has-been translated [by Mr. X] Kniha byla přeložena

Disappeared Added

Page 10: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 10

Tectogrammatical Example

Object (he) gave him a-book dal mu knihu

Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor’s valency frame

Page 11: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 11

Tectogrammatical Example

Incomplete phrases Peter works well , but Paul badly Petr pracuje dobře, ale Pavel špatně

Added

Page 12: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 12

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

Page 13: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 13

Deep Word OrderTopic/Focus

Example:

Baker bakes rolls. vs. BakerIC bakes rolls.

Analyticaldep. tree:

Page 14: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 14

Deep Word OrderTopic/Focus

Deep word order: from “old” information to the “new” one (left-to-

right) at every level (head included) projectivity by definition (almost...)

i.e., partial level-based order -> total d.w.o. Topic/focus/contrastive topic

attribute of every node (t, f, c) restricted by d.w.o. and other constraints

Page 15: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 15

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

Page 16: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 16

Coreference

Grammatical relative clauses

which, who Peter and Paul, who ...

control infinitival constructions

John promised to go ... reflexive pronouns

{him,her,thme}self(-ves) Mary saw herself in ...

Johngo

he home

promisePRED

ACTPAT

ACT DIR3

Page 17: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 17

Coreference

Textual Ex.: Peter moved to Iowa after he finished his PhD.

Peter Iowafinish

he PhD

movePRED

ACT DIR1TWHEN

ACT PAT

heAPP

Page 18: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 18

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers:

dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes):

detailed functors underlying gender, number, ...

Page 19: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 19

Grammatemes Detailed functors (subfunctors)

only for some functors: TWHEN: before/after LOC: next-to, behind, in-front-of, ... also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT

Lexical (underlying) number (SG/PL), tense, modality, degree of

comparison, ... strictly only where necessary (agreement!)

Page 20: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 20

Example - simplified view

Se zuby jsem měl v minulosti jen problémy.With teeth I-have had in the-past only problems.

Page 21: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 21

Fully Annotated Sentence

The boundaries of some problems seem to be clearer after they were revived by Havel’s speech.

Page 22: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 22

Arabic Example:Tectogrammatics

In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.

Page 23: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 23

English PDT-style Annotation

Morphology and Syntax By conversion

Tectogrammatical annotation Guidelines (English TR: by S. Cinková) Pre-annotation

Transformation from Penn Treebank & Propbank (Palmer, Kingsbury) by Z. Žabokrtský et al.

Valency From Propbank Frame Files (Cinková, Šindlerová,

Nedolužko, Semecký)

Page 24: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 24

Example - English TR

Words Dependencies Sem. function Valency

(predicates) Coref (BBN) Named Entities

(BBN)

Page 25: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 25

Valency in the PDTValency: specific ability of a word to combine itself with other units of meaning

dát (give)

Eva matka (mother)ACT ADDR

pršet (rain)

zítra (tomorrow)TWHEN

plakat (cry)

Adam noc (night)ACT TWHEN

Specific behavior

dar (gift)PAT

neděle (Sunday)TWHEN

---

Modifies anything

Page 26: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 26

Valency - Basic Principles

inner participants vs. free modifications (arguments vs. adjuncts)

obligatory vs. optional modifications (the dialogue test)

Page 27: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 27

Inner Participant … … Free Modification

ACT(or), PAT(ient) ADDR(essee), EFF(ect), ORIG(in) (5)

each occurs just with particular verbs

each modifies the verb only once (in a clause)

Location (LOC, DIR1,…) Time (TWHEN, TTILL, …), Manner, Intention,… (70)

can modify in principle any verb

can be repeated (within the same clause)

Page 28: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 28

Inner Participants

syntactic criteria - Actor and Patient semantic criteria for other inner participants (if a verb has more than two arguments)

Argument shifting Actor PatientAddressee

Origin

EffectPetr has dug a hole.

The teacher asked a pupil.

Semantic Effect (as a cognitive role) shifted to the position of Patient.

Semantic Addresse shifted to the position of Patient.

Page 29: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 29

Obligatory … Optional

A: John left.B: From where?A: *I don't know.

A: John left.B: To where?A: I don't know.

„from where“ obligatory modification

„to where“ optional modification

The Dialogue Test

Answering a question about a semantically obligatory modification, the speaker cannot say: I don't know.

Page 30: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 30

Valency frame

obligatory optionalargument

adjunct

Structure:

one meaning of the word one valency frame

Contents:

functor obligatoriness surface form

word: leavemeaning 1: sb left sth meaning 2: sb left from somewhere

frame1: ACT PAT frame2: ACT DIR1

Page 31: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 31

Valency lexicon:PDT-VALLEX

8500 verb senses / valency frames 9000 noun sense / valency frames some adjectives and adverbs

PDT-VALLEX Entryverb: dosáhnout meaning 1: to reach sthmeaning 2: to get sb to do sthmeaning 3: …meaning 4: …

Page 32: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 32

The PDT-VALLEX editor

‘lay down’

resign

win

ask

senses:

Page 33: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 33

Valency Lexicon and TrEd

to write sth (about sth)

Page 34: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 34

Corpus <-> Valency Lexicon

Corpus – occurrences of „uzavřít“ (to close) :

ENTRY: uzavřít vf1: ACT(.1) CPHR({smlouva}.4)

ex: u. dohodu (close a contract)vf2: ACT(.1) PAT(.4)

ex.: u. pokoj (close a room, house)

Lexicon:

Sentence 2035: Sentence 15345: Sentence 51042:

Page 35: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 35

Valency and Text Generation

Tectogrammatical Representation has all the information to (re)generate the surface

form of the sentence: in a “generalized” form non-redundant (almost... but for generation, it is o.k.)

...except the links to a-layer, however links used only for training [statistical models for]

parsing/generation modules not present when e.g. doing text planning, translation, ...

valency dictionary: form of “learned” knowledge

Page 36: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 36

Valency and Text Generation

Using valency for... ...getting the correct (lemma, tag) of verb arguments

Example:

starat_sePRED

MartinACT

tygrPAT

Martin....1..........

staratV..............

o...............

tygr....4..........

VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])

se...............

Martin se stará o tygry.

“Martin takes care of tigers.”

“to take care of”

“tiger”

Page 37: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 37

The Annotation Process 4 sublayers

work on structure first, rest in parallel Structure

automatic preprocessing - programmed conversion from analytical layer annotation

Grammatemes mostly automatically (based on lower layers’

annotation), manual checking, corrections Cross-sublayer/cross-layer checking

partly automatic, then manual

Page 38: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 38

The Annotation ProcessScheme

Page 39: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 39

Tectogrammatical Annotation Tools

Manual annotation 4 groups of annotators ~ 4 sublayers Special graphical tool (TrEd)

Customizable graphical tree editor Preprocessing

Data from analytical layer, preprocessed Online dependency function preassignment

Page 40: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 41

The Annotation Scheme XML + principles of linear- and tree-based

standoff annotation

PML(Prague Markup Language)

Layer schemes (Relax NG) PDT/PADT: t(ecto), a(nalytic), m(orphology), … English: + phrase-based (p-layer)

Page 41: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 42

PML/XML Annotation Layers

Strictly top-down links w+m+a can be easily

“knitted” API for cross-layer

access (programming)

PML Schema / Relax NG

[z and audio layers: used for spoken data (audio as layer “-1”)]

LFGanalogy:

f-struct

Φ

c-structz-

laye

rau

dio

BYL BYS ČELO LESA …

Page 42: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 43

The Prague Markup Language Example

m-layer data, linked to w-layer:<m id="m-tr/_12941_01_00013.fs-s1w4"> <src.rf>manual</src.rf> <w> <dest.rf>w#w-tr/_12941_01_00013.fs-s1w4</dest.rf> <trans>basic</trans> </w> <form>pocházela</form> <lemma>pocházet_:T</lemma> <tag>VpQW---XR-AA---</tag></m><m id="m-tr/_12941_01_00013.fs-s1w5"> ...

Pointer to w-layer

Page 43: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 45

Searching the Treebanks TrEd extension: PML-TQ

Backend: database server Frontend: TrEd or Web browser

Web access http://euler.ms.mff.cuni.cz:8111 Sample data (Czech, English [soon]):

anonymous / anonymous Full access (LSA 2011 particiapnts only, 2011):

LSA2011 / UC.Boulder Full access: licence needed for the corpora

Available later this year at http://www.lindat.cz

Page 44: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 46

Using the Results: Parsing

Several parsers of Czech Analytical layer dependency syntax Trained on PDT 1.0 data, 1.2 mil. words

Collins(98), Charniak(00), Žabokrtský(02), Ribarov(04), Nivre(05), Zeman(05),

McDonald(05), CoNLL’06 (19 parsers) Best results

accuracy: percent of correct dependencies: 84-85% for a single parser, > 86% for a combination

Page 45: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 47

Tectogrammatical Parsing

Newest results: 4 phases Transformation -based learning FnTBL Largely langu- age independent Coreference: >90%

m- and a-layer:Attribute manual autostructure 89,3 % 76,4 %functor 85,5 % 77,4 %val_frame.rf 92,3 % 90,9 %t_lemma 93,5 % 90,9 %nodetype 94,5 % 92,6 %gram/sempos 93,8 % 91,5 %a/lex.rf 96,5 % 95,1 %a/aux.rf 94,3 % 90,3 %is_member 94,3 % 89,5 %is_generated 96,6 % 95,2 %deepord 68,0 % 66,7 %

Page 46: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 48

Tectogrammatical Layer in Machine Translation

The Translation (“Vauquois”) triangle

transfer

source target

Tectogrammatical Representation

Surface Syntax

MorphologyGeneration

Cz En

Page 47: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 49

Dependency trees in MT

According to his opinion UAL's executives were misinformed about the financing of the original transaction.

Transfer:

Podle jeho názoru bylo vedení UAL o financování původní transakce nesprávně informováno.

- structure (~0)- lexical- functions- grammatical

Page 48: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 50

Analytical LayerCorrespondence

Page 49: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 51

TectogrammaticalCorrespondence

The [Homestead’s] only remaining baker bakes the most famous rolls to the north of Long River.

‘al-xabaaz ‘al-’axiir ‘al-baaqii [fii Homestead] yaśmacu ‘ashhar ‘al-kruasaanaat ilaa shimaal min Long River.

Page 50: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 52

Valency and Translation leave:

leave-1 to leave [from] somewhere

leave-2 to leave sth for sb

Translating (from English into Czech): which equivalent to chose?

nechat vs. odjet/opustit which prepositions, cases, ... to use?

accusative vs. “z” (“from”) with genitive vs. ...?

Page 51: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 53

Valency and Translation leave-1 nechat-3

ACT() PAT() LOC() ACT(.1) PAT(.4) LOC()

leave-2 odjet-1 ACT() DIR1(from.) ACT(.1) DIR1(z.[.2])

Page 52: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 54

To summarize…

PDT is/has (a)… Dependency-based treebanking project

Czech (other languages: – Eng, Ar) Ongoing projects (other inst.): Italian, Old Greek, Latin, …

~ 1mil. words sufficient size for ML experiments

4 layers of annotation token, morphology, syntax, deep syntax/semantics++) independent and full information at all levels, but... interlinked (for the development of parsers/generators)

Valency dictionary integrated (links from data)

Page 53: Prague Dependency Treebank(s) Workshop at LSA2011, Part II

July 30, 2011 LSA 2011 Prague Dependency Treebanks II 55

Some pointers Current version of PDT: v2.0, LDC2006T01

all three levels, 1.9/1.5/0.8 Mwords http://ufal.mff.cuni.cz/pdt2.0

http://ufal.mff.cuni.cz Research -> Corpora (Treebank(s))

http://www.ldc.upenn.edu LDC2001T10 (PDT v1.0), LDC2004T23 (PADT 1.0),

LDC2004T25 (PCEDT 1.0), LDC2006T01 (PDT 2.0) http://www.clsp.jhu.edu: Workshop 2002

Using TL for MT Generation http://ufal.mff.cuni.cz/pedt

1st version of English dep. Treebank http://ufal.mff.cuni.cz/~hajic/lsa2011.html

This workshp page, many links to resources, tools