annotation for allps.clul.ul.pt/files/papers-congressos-pdfs/magroannotationforall.pdf · japanese...

ANNOTATION FOR ALL seeking optimal solutions in syntactic annotation

Catarina Magro (CLUL – Dialectology & Diachrony)

ABC 2016 Faculdade de Letras da UL

CLUL’s syntactic

annotation policy

standardize

open up

Portuguese parsed corpora

1999 •  CORDIAL-SIN. Syntax-oriented Corpus of Portuguese Dialects

(Martins, Coord., [2000- ] 2010)

2012 •  P.S. Post Scritum. A Digital Archive of Ordinary Writing in Early

Modern Portugal and Spain (CLUL, Ed., 2014)

•  WOChWEL's POS-tagged and Parsed Old Portuguese texts (Martins, Pereira & Cardoso, 2013-15)

1998 •  Tycho Brahe Parsed Corpus of Historical Portuguese

(Galves & Faria, 2010)

UL

UNICAMP


WOChWEL •  Literary and historiographical texts (XIII-XIV)

P.S. Post Scriptum •  Private letters (XVI-XIX)

Tycho Brahe •  Literary and technical texts (XIV-XIX) •  Newspaper texts and private letters (XIX-XX)

Cordial-Sin •  Dialectal speech (XX)

The Penn treebank family

Penn Corpora of Historical English

Modéliser le changemen: les voies du

français (MCVF)

NINJAL Parsed

Corpus of Modern

Japanese (NPCMJ)


Audio-Aligned and Parsed Corpus of

Appalachian English

(AAPCAppE)

Icelandic Parsed

Historical Corpus

(IcePaHC)

Tycho Brahe

Cordial

Wochwel

P. S.

The Penn treebank annotation system

•  the annotation system adopts a version of the constituency grammar that assumes: •  one level of representation; •  empty categories (in antecedent-gap chains and in situ).

•  the primary goal of the annotation is the facilitation of automated search, not the adoption of a linguistically-accurate encoding.

(Santorini, 2010)


•  produces quite flat and sometimes linguistically unmotivated syntactic representations

•  multiple branching nodes •  some word level nodes (e.g. verbs, negation, sentence focus

particles) •  omission of undecidable information (e.g. VP boundaries) •  omission of subtle distinctions (e.g. argument vs adjunct PPs) •  use of default rules (w.r.t. location of wh-traces and structural

ambiguity, among others)


•  provides the encoding of

•  constituent boundaries •  phrase and clause dependencies •  categorial information (e.g. NP, PP, ADVP) •  grammatical functions (e.g. SBJ, ACC, DAT) •  some discourse functions (e.g. LFD, PRG) •  sentence and clause type (e.g. EXL, CMP, QUE) •  some null constituents •  certain transformational relations


•  syntactic annotation is represented as labelled bracketing over morphologically tagged texts

•  word tags – POS tags •  phrase and clause main labels – category labels •  phrase and clause extended labels – subcategory,

grammatical relation or discourse function labels

•  in the labeled bracketing representation, level of indenting corresponds to depth of structural embedding


!

Yesterday Mary told Jane that she studied too much during the weekend. !!(IP-MAT (NP-TMP (N Yesterday))! (NP-SBJ (NPR Mary))! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much)))! (PP (P during)! (NP (D the) (N+N weekend)))))) !!


!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary))! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!


!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! ! !← subject! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!


!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!


!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!


!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that) ! ! ! ! !← that clause! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!


!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that) ! ! ! ! !← that clause! (IP-SUB (NP-SBJ (PRO she))! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during) ! ! !← adjunct PP! (NP (D the) (N+N weekend)))))) !!

Adapting the Penn system

•  to adapt a system originally designed for the annotation of Middle English to annotate a tipologically distinct language such as Portuguese •  label set •  annotation schemes


(NODE (IP-INF (TO to)! (VB for+gyue)! (NP-OB2 (D a) (ADJ synful) (N man))! (NP-OB1 (PRO$ his) (NS synnes)))! (PPCME2; SEC. XV; ID CMAELR3,43.513))!

•  double object → oblique dative


•  double object → oblique dative

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .))!

(P.S.; SEC. XVIII; ID CARDS0036,.3))!


( (IP-MAT (CONJ and)! (PP (P for)! (NP (D +dare) (ADJ euele) (N +gewune)))! (NEG ne)! (VBP +dinc+d)! (NP-SBJ (PRO hit))! (NP-OB2 (PRO hem))! (NP-OB1 (Q no) (N misdade))! (. ,))(ID CMVICES1,79.910))!!( (IP-MAT (CONJ ac)! (NP-SBJ *pro*) !← subject coreferential with NP-OB2 of previous clause! (BEP bie+d) ! (scribal error)! (VAN ihealden)! (PP (PP (P for)! (NP (ADJ wise)! (NS menn)))! (CONJP (CONJ and)! (PP (P for)! (NP (ADJ +geape)))))! (. .))(PPCME2; SEC. XIII; ID CMVICES1,79.911))!

•  non-pro drop → pro drop

Adapting the Penn system •  non-pro drop → pro drop

( (IP-MAT! (NP-SBJ *pro*) !← referential null subject in a non dependent clause!! (VB-D Declamei)!! (PP (P contra)!! (NP (D-F a) (N vaidade)))!! (. ,))!

(TYCHO BRAHE; SEC XVIII; ID A_001_PSD,03.3))!


preserve expand adapt create


•  but what are in fact the “needs” of Portuguese corpora?

5 000 000 words

different historical varieties

different spatial

varieties

different situational contexts

different speakers’

social status

synchronic and diachronic variation


•  but what are in fact the “needs” of Portuguese corpora?

synchronic and diachronic variation

the annotator ignores the

precise range of the 4 sets of

data

the annotator can not

anticipate optimal

annotation solutions

the annotator feels that

annotating is always more urgent than writing down guidelines

4 teams cc. 50 annotators

15 years of work

I N C O N S I S T E N C Y

Converging Portuguese corpora

•  4 corpora •  the same set of parsing guidelines

Portuguese Syntactic Annotation Manual


For corpus users •  ensures an easy-to-use data access; •  makes a comparative survey of data conceived to answer

specific questions utterly productive; •  makes it possible to replicate quantitative studies on new

datasets. For corpus creators •  speeds up the parser training; •  improves automatic parsing of new data.


The big challenge •  to design a unified encoding system that allows to search

across diachronic and dialectal varieties for properties that are either shared or exclusive.


Cleft constructions (1) o joão leu o poema

John read the poem In English corpora: It-cleft

(2) it was the poem that john read Wh-cleft (pseudocleft)

(3) what john read was the poem Reverse Wh-cleft

(4) the poem was what john read

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu


wh-clefts

canonic


reverse


pseudo


double copula


reduced


semi pseudo


that-clefts

canonic


non-agreeing copula


reduced


reverse


double copula


double ‘é que’



XIII-XV centuries (Middle Ages)

wh-clefts

canonic


reverse


pseudo


double copula


reduced


semi pseudo


that-clefts

canonic


non-agreeing copula


reduced


reverse


double copula


double ‘é que’



XVI century

wh-clefts

canonic


reverse


pseudo


double copula


reduced


semi pseudo


that-clefts

canonic


non-agreeing copula


reduced


reverse


double copula


double ‘é que’



Classical Portuguese (XVII-XVIII)

wh-clefts

canonic


reverse


pseudo


double copula


reduced


semi pseudo


that-clefts

canonic


non-agreeing copula


reduced


reverse


double copula


double ‘é que’



Contemporary standard Portuguese (XIX-XX)

wh-clefts

canonic


reverse


pseudo


double copula


reduced


semi pseudo


that-clefts

canonic


non-agreeing copula


reduced


reverse


double copula


double ‘é que’



Dialectal European Portuguese (XX)

wh-clefts

canonic


reverse


pseudo


double copula


reduced


semi pseudo


that-clefts

canonic


non-agreeing copula


reduced


reverse


double copula


double ‘é que’



Brazilian Portuguese (XX)


Cleft constructions •  complementizer (+ -) •  wh-phrase (+ - 0) •  copula (+ - 0 x) •  focus position (L R M)


wh-element complementizer copula focus

canonic wh-cleft + - + M

reverse wh-cleft + - + L

pseudocleft + - + R

double copula + - x R

reduced pseudocleft + - 0 R

semi pseudocleft 0 - + R

canonic that-cleft - + + M

reduced that-cleft - + 0 M

reverse that-cleft - + + L

double copula - + x L



canonic wh-cleft + - + M

reverse wh-cleft + - + L pseudocleft + - + R




canonic that-cleft - + + M

reduced that-cleft - + 0 M

reverse that-cleft - + + L double copula - + x L



canonic wh-cleft + - + M reverse wh-cleft + - + L

pseudocleft + - + R




canonic that-cleft - + + M reduced that-cleft - + 0 M

reverse that-cleft - + + L

double copula - + x L


Alternative hypothesis: Cleft A, B, C, D, E, … •  prevents finding all clefts with shared features; •  prevents finding other constructions that display the same

features:

EP (south central dialects) (recursive ‘é que’)

O que é que é que o joão leu?

BP (null copula)

O que que o João leu?


doubled copula

complementizer

é que

null copula

non agreeing copula

copula

clefts

recursive é que

wh-element

relatives

questions minimal units of annotation

presentational ‘ser’

extraposed subject clauses

that clauses

CLUL’s syntactic

annotation policy

standardize

open up

Opening up the use of parsed corpora

building parsed corpora

using parsed corpora


Using Penn-style parsed corpora implies: •  to know the annotation system; •  to install search software (CorpusSearch2; Randall, 2010) and java; •  to be acquainted with the search language; •  to run the search queries using command lines.

we’re doomed! we’ll never make it!


The P. S. project designed two strategies to circumvent these issues (cf. also Pereira, 2015): •  enabling online access to the parsed corpus; •  implementing a search interface.

TEITOK web-based plataform (Janssen, 2014) •  viewing, creating and editing corpora with both rich textual

mark-up and linguistic annotation; ↓

data stored in full-fledged XML files in the standards defined by the Text-Encoding Initiative.

Opening up the use of parsed corpora The online access

Opening up the use of parsed corpus

P.S. Post Scriptum in TEITOK (until 2016): •  facsimile images of letters manuscripts; •  metadata (e. g. biographies, social status, events, places,

time); •  philological mark-up (e. g. support, hands, deletions,

spellings); •  linguistic annotation (e. g. lemmatization, POS annotation,

phonology and morphology phenomena, syntactic annotation). In 2016:

•  parsed corpus is online!

The online access


The online visualization of the P.S. parsed corpus implies: •  the conversion of the original Penn-treebank labeled-

bracket format (psd) into the xml format (psdx) (Ecay & Bacovcin 2014; van Gompel 2014);

•  the alignment of psdx and the source xml (stand-off annotation) (cf. Grover/Matthews/Tobin 2006; Marquilhas & Hendrickx 2016; McEnery/Wilson 2001; Schmidt 2010).

The online access

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!

from labeled-bracket

to xml

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

<forest forestId="3" File="CARDS0036" Location=".3" sentid="s-4" id="tree-3"> <eTree Label="IP-MAT" id="node-309"> <eTree Label="NP-SBJ" id="node-310"> <eLeaf Notext="*pro*" id="node-311"/> </eTree> <eTree Label="NP-ACC" id="node-312"> <eTree Label="DEM" id="node-313"> <eLeaf Text="Isto" tokid="w-99" id="node-314"/> </eTree> </eTree> <eTree Label="VB-P" id="node-315"> <eLeaf Text="prometo" tokid="w-100" id="node-316"/> </eTree> <eTree Label="PP" id="node-317"> <eTree Label="P" id="node-318"> <eLeaf Text="a" tokid="w-101" id="node-319"/> </eTree> <eTree Label="NP" id="node-320"> <eTree Label="NPR" id="node-321"> <eLeaf Text="VM" tokid="w-102" id="node-322"/> </eTree> </eTree> </eTree> <eTree Label="ADVP" id="node-323"> <eTree Label="ADV" id="node-324"> <eLeaf Text="fixemente" tokid="w-103" id="node-325"/> </eTree> </eTree> <eTree Label="." id="node-326"> <eLeaf Text="." tokid="w-104" id="node-327"/> </eTree> </eTree> </forest>!

psdx and source xml alignment


The parsed P.S. in TEITOK: •  corpus users: visualize parsed sentences (several

formats) aligned with the transcribed form; •  corpus creators: edit syntactic trees at both label and

structural level.

The online access


•  the storage of the parsed corpus in TEITOK, under a standard file format such as xml, allows the development of new online search engines.

•  TEITOK offers a tree search interface in which the user can search through PSDX files: •  writing query expressions in XPath language (the common

syntax to indicate nodes in an xml tree)… •  chosing predefined search queries.

The search interface

search results can match

extra-linguistic variables

interactive atlases of syntactic patterns

(historical and dialectal)

parsed results can automatically be

matched with extra-linguistic

variables

all Portuguese parsed corpora

obbey to a unified syntactic

annotation system




o que quero é que o joão leia o que quero que o joão leia

bom é que o joão leia é bom que o joão leia

XIII XIV XV XVI XVII

extraposed subject clauses

pseudoclefts

reverse that-clefts

This kind of tools form the basis of a new generation of parsed corpora, which, we hope, reconcile (Portuguese) researchers with syntactic annotation.


References

•  CLUL (Ed.). (2014). P.S. Post Scriptum. Arquivo Digital de Escrita Quotidiana em Portugal e Espanha na Época Moderna. http://ps.clul.ul.pt.

•  Ecay, A. & Bacovcin, A. (2014). An implementation of a morphologically-aware corpus annotation format. Presented at the workshop Converging Corpora: How to standardize historical corpora of typologically and genetically different languages. 16th Diachronic Generative Syntax Conference. Budapeste. July, 2014.

•  Galves, Charlotte, and Pablo Faria. (2010). Tycho Brahe Parsed Corpus of Historical Portuguese. http://www.tycho.iel.unicamp.br/~tycho/corpus/en /index.html.

•  Grover, Claire/Matthews, Michael/Tobin, Richard (2006). Tools to Address the Interdependence between Tokenisation and Standoff Annotation, in: Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing, Stroudsburg PA, Association for Computational Linguistics, 19–26, http://dl.acm.org/citation.cfm?id=1621034.1621038 (30.09.2015).

•  Janssen, M. (2014). TEITOK – The tokenized TEI environment. Centro de Linguística da Universidade de Lisboa. http://alfclul.clul.ul.pt/teitok/site/index.php?action=home

•  Marquilhas, R., & Hendrickx, I. (2016). Avanços nas humanidades digitais. In: A. M. Martins & E. Carrilho (Eds.). Manual de linguística portuguesa (pp. 252-277). Berlin: De Gruyter.

•  Martins, A. M. (.coord.) (2000- 2010). CORDIAL-SIN: Corpus Dialectal para o Estudo da Sintaxe / Syntax-oriented Corpus of Portuguese Dialects. Lisboa, Centro de Linguística da Universidade de Lisboa. URL: http://www.clul.ul.pt/en/resources/411-cordial-corpus

References

•  Martins, A. M., Sandra Pereira & Adriana Cardoso. (2013-2015). Parsed José de Arimateia. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa)

•  Martins, A. M., Sandra Pereira & Adriana Cardoso. (2014-2015). Parsed Demanda do Santo Graal. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa

•  Martins, A. M., Sandra Pereira & Adriana Cardoso. (2015). Parsed Legal Documents. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa

•  McEnery, T. & Wilson, A. (2001). Corpus Linguistics, Edinburgh, Edinburgh University Press. •  Pereira, S. (2015). "Arquídia: um recurso a construir", presented at V SIMELP – Simpósio Mundial de

Estudos de Língua Portuguesa (Simpósio 41: Dicionarística Portuguesa: Investigação e Projetos em Curso), October 8-11, Lecce

•  Randall, B. (2010). CorpusSearch 2. University of Pennsylvania. http://corpussearch.sourceforge.net •  Santorini, B. (2010). Annotation manual for the Penn Historical Corpora and the PCEEC. http://

www.ling.upenn.edu/hist-corpora/annotation/index.htm •  Schmidt, Desmond (2010). The Inadequacy of Embedded Markup for Cultural Heritage Texts, Literary

and Linguist Computing 25:3, 337–356. •  van Gompel, M (2014). FoLiA: Format for Linguistic Annotation. Language and Speech Technology.

Technical Report Series. Centre for Language Studies, Radboud University Nijmegen.