prague dependency treebank(s) workshop at lsa2011, part i

48
Prague Dependency Treebank(s) Workshop at LSA2011, Part I Jan Hajič, Zdeňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic

Upload: kaloni

Post on 27-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Prague Dependency Treebank(s) Workshop at LSA2011, Part I. Jan Haji č , Zde ňka Urešová Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic. Part I - Text to Syntax. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

Prague Dependency Treebank(s)

Workshop at LSA2011, Part I

Jan Hajič, Zdeňka Urešová

Institute of Formal and Applied Linguistics

School of Computer Science

Faculty of Mathematics and Physics

Charles University, Prague

Czech Republic

Page 2: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 2

Part I - Text to Syntax

The Prague Dependency Treebank projects The theory behind it The Corpora Morphology

Dictionaries, tools (incl. POS tagger) Dependency surface syntax

Czech, Arabic, English Parallel annotated corpus

Page 3: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 3

The Prague Dependency Treebank(s)

The idea Apply the “old” Prague theory to real-word texts Provide enough data for ML experiments

?“Old” Prague theory Prague structuralism (1930s) Stratificational approach Centered on “deep syntax”

Separated from “surface form” Dependency based (how else )

Page 4: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 4

PDT:The Methodology

Manual annotation is PRIMARY Some help from existing tools possible

“No information loss, no redundancy” Much formalization, but… … original form always retrievable

Dictionaries In theory: “secondary”, side effect of annotation (generalization) In reality: help consistency Links: data → dictionary(-ies)

Result: used for Machine Learning Ergonomy of annotation

Graphical (“linguistic”) presentation & editing

Page 5: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 5

The Prague Dependency Treebank Project: Czech Treebank

1995 (Dublin) 1996-2006-... 1998 PDT v. 0.5 released (JHU workshop)

400k words manually annotated, unchecked 2001 PDT 1.0 released (LDC):

1.3MW annotated, morphology & surface syntax 2006 PDT 2.0 release

0.8MW annotated (50k sentences) + PDT 1.0 corrected the “tectogrammatical layer”

underlying (deep) syntax

Page 6: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 6

Related Projects (Treebanks)

Prague Czech-English Dependency Treebank WSJ portion of PTB, translated to Czech (1.2 mil. words)

Annotated for basice set of attributes Penn Treebank / WSJ

Pre-converted, manually annotated for basic sets of attributes Named entity annotation, co-reference (BBN) merged in Detailed breakdown of NP

Prague Arabic Dependency Treebank apply same representation to annotation of Arabic surface syntax so far

Both published (in first version) 2004 (by LDC) PCEDT/PEDT version 2.0 being prepared (2011?) Preliminary version available for browsing – see the workshop web

Page 7: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 7

PDT (Czech) Data

4 sources: Lidové noviny (daily newspaper, incl. extra sections) DNES (Mladá fronta Dnes) (daily newspaper) Vesmír (popular science magazine, monthly) Českomoravský Profit (economical journal, weekly)

Full articles selected article ~ DOCUMENT (basic corpus unit)

Time period: 1990-1995 1.8 million tokens (~110,000 sentences)

Page 8: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 8

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

PD

T 1

.0

(200

1)P

DT

2.0

(2

006)

Page 9: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 9

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

Page 10: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 10

Morphological Attributes

Tag: 13 categoriesExample: AAFP3----3N----Adjective no poss. Gender negatedRegular no poss. Number no voiceFeminine no person reserve1Plural no tense reserve2Dative superlative base

var.Lemma: POS-unique identifier

Books/verb -> book-1, went -> go, to/prep. -> to-1

Ex.: nejnezajímavějším“(to) the most uninteresting”

Page 11: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 11

Morphological Tagset 13 categories, 4452 plausible tags (combinations):

Category # of values Example(s)POS 10 N (noun), Z (punctuation)SUBPOS 75 P (personal pron.), U (possessive adj.)GENDER 8 I (masc. inanimate), X (any), - (N.A)NUMBER 4 P (plural), D (dual)CASE 9 1 (nominative), 6 (locative)POSSGENDER 4 M (masc. animate), F (feminine)POSSNUMBER 3 S (singular), P (plural)PERSON 5 1 (first), ...TENSE 4 P (present), M (past)GRADE 5 3 (superlative)NEGATION 3 A (affirmative), N (negative)VOICE 3 A (active), P (passive)VAR 11 1 (1st variant), 6 (colloq. style), 8 (abbrev.)

Page 12: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 12

Morphological Analysis Formally: MA: A+ → Pow(L x T)

MA(f) = { [ l,t ] }; f A+ (the token), l L (lemma), t T (tag)

tokens taken in isolation no attempt to solve e.g. auxiliaries vs. full verbs Ex.: MA(“má“) = { [mít,VB-S---3P-AA---], lit. “to have”

lit. “has”,”my” [můj,PSFS1-S1------1], lit. “my” [můj,PSFS5-S1------1], [můj,PSNP1-S1------1], [můj,PSNP4-S1------1], [můj,PSNP5-S1------1] }

Page 13: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 13

Morphological Disambiguation

Full morphological disambiguation more complex than (e.g. English) POS tagging

Several full morphological taggers: (Pure) HMM Feature-based (MaxEnt, NB)

used in the PDT distribution Voted Perceptron, (M. Collins, EMNLP’02)

All: ~ 94-96% accuracy (perceptron is best) rule & statistic combination: tiny improvement

(Hajič et al., ACL 2001, Spoustova et al., 2007: > 96%)

Page 14: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 14

The Segmentation Problem:Arabic

Tokenization / segmentation not always trivial Arabic, German, Chinese, Japanese

Page 15: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 15

The Segmentation Problem:Solution for Arabic

Find max. no. of segments, concatenate up to max. 4 (x10) for Arabic

F---------VIIA-3MS--S----3MP4----------- sa-+yu-hbir-u+-hum+0

F---------VIIA-3MS--S----3MP4-----------

P---------SD----MS----------------------

P---------------------------------------

N-------2R------------------------------

N-------2D------------------------------

A-----FS2D------------------------------

Resulting annotation:

Page 16: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 16

Arabic Tagging Results

Maximum entropy, features ~ categories Experiments on Penn Arabic Treebanks

POS: From 95.25-97.37% Full morph. 88.17-89.31% Segmentation: 98.60-99.37%

Prague Arabic Dependency Treebank POS: 96.02% Full morph.: 89.24% Segmentation: 99.25%

Page 17: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 17

PDT Annotation Layers L0 (w) Words (tokens)

automatic segmentation and markup only L1 (m) Morphology

Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)

Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)

Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

Page 18: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 18

Layer 2 (a-layer): Analytical Syntax

Dependency + Analytical Function

dependent

governor

The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.

Page 19: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 19

Analytical Syntax: Functions

Main (for [main] semantic lexemes):Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom“Double” dependency: AtrAdv, AtrObj, AtrAtr

Special (function words, punctuation,...):Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxYPrepositions/Conjunctions: AuxP, AuxCPunctuation, Graphics: AuxX, AuxS, AuxG, AuxK

StructuralElipsis: ExD, Coordination etc.: Coord, Apos

Page 20: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 20

Surface Syntax Example

Complete sentence: Sb, Pred, Obj The-baker bakes rolls. Pekař peče housky.

Page 21: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 21

Surface Syntax Example

Incomplete phrases Peter works well , but Paul badly Petr pracuje dobře, ale Pavel špatně

Page 22: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 22

Surface Syntax Example

Variants (equal meaning) (he) bought shoes for boy koupil boty pro kluka

Page 23: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 23

PDT-styleArabic Surface Syntax

Only several differences (Sometimes) Separate nodes for individual

segments (cf. tagging/segmentation) Copula treatment (Czech: rare treated as

ellispsis; Arabic: systematic solution needed): Pred (Added) analytic functions:

AuxM (did-not)

Ante (what)

Work by Faculty of Arts, Charles University Arabic language students

Page 24: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 24

Arabic Surface SyntaxExample

In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.

Page 25: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 25

English Analytic Layer

By conversion from PTB Extended analytic functions

Head rules Jason Eisner’s, added more for full conversion

Coordination, traces, etc.

Coordination handling Same as in Czech/Arabic PDT

Page 26: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 26

Penn Treebank

University of Pennsylvania, 1993 Linguistic Data Consortium

Wall Street Journal texts, ca. 50,000 sentences 1989-1991 Financial (most), news, arts, sports 2499 (2312) documents in 25 sections

Annotation POS (Part-of-speech tags) Syntactic “bracketing” + bracket (syntactic) labels (Syntactic) Function tags, traces, co-indexing + Propbanking

Page 27: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 27

Penn Treebank Example

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

POS tag (NNS)(noun, plural)

Phrase label (NP)

Noun Phrase

“Preterminal”

Page 28: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 28

Penn Treebank Example:Sentence Tree

Phrase-based tree representation:

Page 29: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 29

Parallel Czech-English Annotation

English text -> Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal):

Morphology and surface syntax: technical conversion Penn Treebank style -> PDT Analytic layer

Tectogrammatical annotation: manual annotation (Slightly) different rules needed for English

Alignment Natural, sentence level only (now)

Page 30: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 30

Human Translation ofWSJ Texts

Hired translators / FCE level Specific rules for translation

Sentence per sentence only …to get simple 1:1 alignment

Fluent Czech at the target side If a choice, prefer “literal” translation

The numbers: English tokens: 1,173,766 Translated to Czech:

Revised/PCEDT 1.0: 487,929 Now finished (all 2312 documents)

Page 31: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 31

English Annotation POS and Syntax

Automatic conversion from Penn Treebank PDT morphological layer

From POS tags PDT analytic layer

From: Penn Treebank Syntactic Structure Non-terminal labels Function tags (non-terminal “suffixes”)

2-step process Head determination rules Conversion to dependency + analytic function

Page 32: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 32

Head Determination Rules

Exhaustive set of rules By J. Eisner + M. Cmejrek/J. Curin 4000 rules (non-terminal based)

Ex.: (S (NP-SBJ VP .)) → VP Additional rules

Coordination, Apposition Punctuation (end-of-sentence, internal)

Original idea (possibility of conversion) J. Robinson (1960s)

Page 33: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 33

Example: Head Determination Rules (J.E.)

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

(NP (DT NN)) → NN

(VP (VB NP)) → VB

(VP (MD VP)) → VP

(S (… VP …)) → VP

Rules:

Page 34: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 34

Conversion: Analytic Structure, Functions

Analytic Function assignment (conversion) Rules

based on functional tags:-SBJ Sb -PRD Pnom

-BNF Obj -DTV Obj-LGS Obj -ADV Adv-DIR Adv -EXT Adv-LOC Adv -MNR Adv-PRP Adv -PUT Adv-TMP Adv

Ad-hoc rules (if functional tags missing) Lemmatization (years → year)

Page 35: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 35

Example: Analytical Structure, Functions

(board)

(board)(the)

(join)

(will) (join)

(join)

(join)

→→

Penn Treebank structure

(with heads added) PDT-like Analytic

Representation

Page 36: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 36

Annotation tools

TrEd Visualization, annotation, processing, search Perl, Perl-Tk Customizable Multiple data formats, native: Prague ML (XML)

Demonstration I Surface-syntax annotation

Page 37: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 37

Speech reconstruction

Spoken input ASR does not do capitalization, punctuation Disfluencies Repetitions Corrections Fillers Deviations from syntax ...

ungrammatical / dissimilar to text

Page 38: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 38

Objective

Create gold-standard data for (Statistical) training Testing

Use in machine learning of automatic speech reconstruction (eventually) language understanding

Go beyond state-of-the-art ASR Post-Correction / disfluency removal (cf. e.g. Fitzgerald, 2008, or Lopez/Cozar &

Callejas, 2008)

Page 39: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 39

we ‘re sitting on the step ofwe ‘re sitting on the step of

What is Speech Reconstruction

original transcript → edited transcript resembling an interview editing for print

... apart from disfluency removal?

I think it ‘sI think it ‘s my aunt Molly ‘s housemy aunt Molly ‘s houseandand

I thinkI think

We’re sitting on the step of my aunt Molly’s house, We’re sitting on the step of my aunt Molly’s house, ..

????

my aunt Molly ‘s housemy aunt Molly ‘s house

Disfluency removal

Speech reconstruction

Page 40: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 40

Word-for-word transcription

using Transcriber 1.5.1 audio synchronization spoken text non-speech events (UH-

HUH, laughter, etc.) rough segmentation

(defined acoustically) speaker/turn identification

Page 41: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 41

Annotation rules

• sentence segmentation• orthography• capitalization• punctuation• morphosyntax• word order• partial ellipsis restoration• no discourse-irrelevant non- speech events• filler and fragment deletion• but: ...meaning preservation

~ written-text standards

Page 42: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 43

Segment splitting

(reconstruction) segment = sentence

(transcript) segment ~ non-silence span

Page 43: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 44

Word order, deletions, insertions, ...

punctuation, capitalization, ...

Page 44: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 45

Function word insertion

Page 45: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 46

Annotation layers

Multilayered standoff annotation

Page 46: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 47

Current status

English dialogues 151,000 words (14,5 h) 16,000 words double annotated Audio / manual transcript / reconstruction Auto tagging/parsing: syntax / semnatics

Annotation manual Baseline automatic SR systems

Page 47: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 48

Conclusion

Language resources Beyond post-ASR corrections:

“Speech Reconstruction” Integrated annotation of

Audio, ASR, manual transcription, edited speech reconstruction

[morphology, syntax, semantics] The next step: machine learning

Page 48: Prague Dependency Treebank(s) Workshop at LSA2011, Part I

July 30, 2011 LSA2011 Prague Dependency Treebanks I 51

Czech Example

Original transcription:ale taky důvod byl ten že škodováci byli hrozně rádi bejvali kdybych tam byla mohla nastoupit k ni - k nim jako do zaměstnání

Recovering punctuation and capitalization:Ale taky důvod byl ten , že škodováci byli hrozně rádi bejvali , kdybych tam byla , mohla nastoupit k ni k nim jako do zaměstnání

Translation:Ale taky důvod byl ten , že škodováci bývali byli hrozně rádi , kdybych tam byla , mohla nastoupit k ní , k nim jako do zaměstnání

Reference reconstructions:(a) Ale důvod byl taky ten , že "škodováci" by bývali byli hrozně rádi , kdybych tam k nim mohla nastoupit do zaměstnání .(b) Důvod byl ale taky ten , že škodováci by bývali byli hrozně rádi , kdybych tam byla mohla nastoupit k nim jako do zaměstnání .