annotation for allps.clul.ul.pt/files/papers-congressos-pdfs/magroannotationforall.pdf · japanese...

69
ANNOTATION FOR ALL seeking optimal solutions in syntactic annotation Catarina Magro (CLUL – Dialectology & Diachrony) ABC 2016 Faculdade de Letras da UL

Upload: others

Post on 15-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

ANNOTATION FOR ALL seeking optimal solutions in syntactic annotation

Catarina Magro (CLUL – Dialectology & Diachrony)

ABC 2016 Faculdade de Letras da UL

Page 2: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

CLUL’s syntactic

annotation policy

standardize

open up

Page 3: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Portuguese parsed corpora

1999 •  CORDIAL-SIN. Syntax-oriented Corpus of Portuguese Dialects

(Martins, Coord., [2000- ] 2010)

2012 •  P.S. Post Scritum. A Digital Archive of Ordinary Writing in Early

Modern Portugal and Spain (CLUL, Ed., 2014)

•  WOChWEL's POS-tagged and Parsed Old Portuguese texts (Martins, Pereira & Cardoso, 2013-15)

1998 •  Tycho Brahe Parsed Corpus of Historical Portuguese

(Galves & Faria, 2010)

UL

UNICAMP

Page 4: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Portuguese parsed corpora

WOChWEL •  Literary and historiographical texts (XIII-XIV)

P.S. Post Scriptum •  Private letters (XVI-XIX)

Tycho Brahe •  Literary and technical texts (XIV-XIX) •  Newspaper texts and private letters (XIX-XX)

Cordial-Sin •  Dialectal speech (XX)

Page 5: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank family

Penn Corpora of Historical English

Modéliser le changemen: les voies du

français (MCVF)

NINJAL Parsed

Corpus of Modern

Japanese (NPCMJ)

Portuguese parsed corpora

Audio-Aligned and Parsed Corpus of

Appalachian English

(AAPCAppE)

Icelandic Parsed

Historical Corpus

(IcePaHC)

Tycho Brahe

Cordial

Wochwel

P. S.

Page 6: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

•  the annotation system adopts a version of the constituency grammar that assumes: •  one level of representation; •  empty categories (in antecedent-gap chains and in situ).

•  the primary goal of the annotation is the facilitation of automated search, not the adoption of a linguistically-accurate encoding.

(Santorini, 2010)

Page 7: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

•  produces quite flat and sometimes linguistically unmotivated syntactic representations

•  multiple branching nodes •  some word level nodes (e.g. verbs, negation, sentence focus

particles) •  omission of undecidable information (e.g. VP boundaries) •  omission of subtle distinctions (e.g. argument vs adjunct PPs) •  use of default rules (w.r.t. location of wh-traces and structural

ambiguity, among others)

Page 8: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

•  provides the encoding of

•  constituent boundaries •  phrase and clause dependencies •  categorial information (e.g. NP, PP, ADVP) •  grammatical functions (e.g. SBJ, ACC, DAT) •  some discourse functions (e.g. LFD, PRG) •  sentence and clause type (e.g. EXL, CMP, QUE) •  some null constituents •  certain transformational relations

Page 9: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

•  syntactic annotation is represented as labelled bracketing over morphologically tagged texts

•  word tags – POS tags •  phrase and clause main labels – category labels •  phrase and clause extended labels – subcategory,

grammatical relation or discourse function labels

•  in the labeled bracketing representation, level of indenting corresponds to depth of structural embedding

Page 10: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

!

Yesterday Mary told Jane that she studied too much during the weekend. !!(IP-MAT (NP-TMP (N Yesterday))! (NP-SBJ (NPR Mary))! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much)))! (PP (P during)! (NP (D the) (N+N weekend)))))) !!

Page 11: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary))! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!

Page 12: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! ! !← subject! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!

Page 13: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!

Page 14: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!

Page 15: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that) ! ! ! ! !← that clause! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!

Page 16: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

The Penn treebank annotation system

!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that) ! ! ! ! !← that clause! (IP-SUB (NP-SBJ (PRO she))! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during) ! ! !← adjunct PP! (NP (D the) (N+N weekend)))))) !!

Page 17: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system

•  to adapt a system originally designed for the annotation of Middle English to annotate a tipologically distinct language such as Portuguese •  label set •  annotation schemes

Page 18: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system

(NODE (IP-INF (TO to)! (VB for+gyue)! (NP-OB2 (D a) (ADJ synful) (N man))! (NP-OB1 (PRO$ his) (NS synnes)))! (PPCME2; SEC. XV; ID CMAELR3,43.513))!

•  double object → oblique dative

Page 19: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system

•  double object → oblique dative

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .))!

(P.S.; SEC. XVIII; ID CARDS0036,.3))!

Page 20: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system

( (IP-MAT (CONJ and)! (PP (P for)! (NP (D +dare) (ADJ euele) (N +gewune)))! (NEG ne)! (VBP +dinc+d)! (NP-SBJ (PRO hit))! (NP-OB2 (PRO hem))! (NP-OB1 (Q no) (N misdade))! (. ,))(ID CMVICES1,79.910))!!( (IP-MAT (CONJ ac)! (NP-SBJ *pro*) !← subject coreferential with NP-OB2 of previous clause! (BEP bie+d) ! (scribal error)! (VAN ihealden)! (PP (PP (P for)! (NP (ADJ wise)! (NS menn)))! (CONJP (CONJ and)! (PP (P for)! (NP (ADJ +geape)))))! (. .))(PPCME2; SEC. XIII; ID CMVICES1,79.911))!

•  non-pro drop → pro drop

Page 21: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system •  non-pro drop → pro drop

( (IP-MAT! (NP-SBJ *pro*) !← referential null subject in a non dependent clause!! (VB-D Declamei)!! (PP (P contra)!! (NP (D-F a) (N vaidade)))!! (. ,))!

(TYCHO BRAHE; SEC XVIII; ID A_001_PSD,03.3))!

Page 22: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system

preserve expand adapt create

Page 23: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system

•  but what are in fact the “needs” of Portuguese corpora?

5 000 000 words

different historical varieties

different spatial

varieties

different situational contexts

different speakers’

social status

synchronic and diachronic variation

Page 24: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Adapting the Penn system

•  but what are in fact the “needs” of Portuguese corpora?

synchronic and diachronic variation

the annotator ignores the

precise range of the 4 sets of

data

the annotator can not

anticipate optimal

annotation solutions

the annotator feels that

annotating is always more urgent than writing down guidelines

4 teams cc. 50 annotators

15 years of work

I N C O N S I S T E N C Y

Page 25: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

•  4 corpora •  the same set of parsing guidelines

Portuguese Syntactic Annotation Manual

Page 26: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

For corpus users •  ensures an easy-to-use data access; •  makes a comparative survey of data conceived to answer

specific questions utterly productive; •  makes it possible to replicate quantitative studies on new

datasets. For corpus creators •  speeds up the parser training; •  improves automatic parsing of new data.

Page 27: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

The big challenge •  to design a unified encoding system that allows to search

across diachronic and dialectal varieties for properties that are either shared or exclusive.

Page 28: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

Cleft constructions (1) o joão leu o poema

John read the poem In English corpora: It-cleft

(2) it was the poem that john read Wh-cleft (pseudocleft)

(3) what john read was the poem Reverse Wh-cleft

(4) the poem was what john read

Page 29: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu

Converging Portuguese corpora

Page 30: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu

Converging Portuguese corpora

XIII-XV centuries (Middle Ages)

Page 31: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu

Converging Portuguese corpora

XVI century

Page 32: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu

Converging Portuguese corpora

Classical Portuguese (XVII-XVIII)

Page 33: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu

Converging Portuguese corpora

Contemporary standard Portuguese (XIX-XX)

Page 34: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu

Converging Portuguese corpora

Dialectal European Portuguese (XX)

Page 35: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

wh-clefts

canonic

foi o poema o que o joão leu

reverse

o poema foi o que o joão leu

pseudo

o que o joão leu foi o poema

double copula

é o que o joão leu é o poema

reduced

o que o joão quer que a ana leia o poema

semi pseudo

o joão leu foi o poema

that-clefts

canonic

foi o poema que o joão leu

non-agreeing copula

é os poemas que o joão leu

reduced

o poema que o joão leu

reverse

o poema é que o joão leu

double copula

é o poema é que o joão leu

double ‘é que’

o poema é que é que o joão leu

Converging Portuguese corpora

Brazilian Portuguese (XX)

Page 36: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

Cleft constructions •  complementizer (+ -) •  wh-phrase (+ - 0) •  copula (+ - 0 x) •  focus position (L R M)

Page 37: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

wh-element complementizer copula focus

canonic wh-cleft + - + M

reverse wh-cleft + - + L

pseudocleft + - + R

double copula + - x R

reduced pseudocleft + - 0 R

semi pseudocleft 0 - + R

canonic that-cleft - + + M

reduced that-cleft - + 0 M

reverse that-cleft - + + L

double copula - + x L

Page 38: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

wh-element complementizer copula focus

canonic wh-cleft + - + M

reverse wh-cleft + - + L

pseudocleft + - + R

double copula + - x R

reduced pseudocleft + - 0 R

semi pseudocleft 0 - + R

canonic that-cleft - + + M

reduced that-cleft - + 0 M

reverse that-cleft - + + L

double copula - + x L

Page 39: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

wh-element complementizer copula focus

canonic wh-cleft + - + M

reverse wh-cleft + - + L

pseudocleft + - + R

double copula + - x R

reduced pseudocleft + - 0 R

semi pseudocleft 0 - + R

canonic that-cleft - + + M

reduced that-cleft - + 0 M

reverse that-cleft - + + L

double copula - + x L

Page 40: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

wh-element complementizer copula focus

canonic wh-cleft + - + M

reverse wh-cleft + - + L

pseudocleft + - + R

double copula + - x R

reduced pseudocleft + - 0 R

semi pseudocleft 0 - + R

canonic that-cleft - + + M

reduced that-cleft - + 0 M

reverse that-cleft - + + L

double copula - + x L

Page 41: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

wh-element complementizer copula focus

canonic wh-cleft + - + M

reverse wh-cleft + - + L pseudocleft + - + R

double copula + - x R

reduced pseudocleft + - 0 R

semi pseudocleft 0 - + R

canonic that-cleft - + + M

reduced that-cleft - + 0 M

reverse that-cleft - + + L double copula - + x L

Page 42: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

wh-element complementizer copula focus

canonic wh-cleft + - + M reverse wh-cleft + - + L

pseudocleft + - + R

double copula + - x R

reduced pseudocleft + - 0 R

semi pseudocleft 0 - + R

canonic that-cleft - + + M reduced that-cleft - + 0 M

reverse that-cleft - + + L

double copula - + x L

Page 43: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

Alternative hypothesis: Cleft A, B, C, D, E, … •  prevents finding all clefts with shared features; •  prevents finding other constructions that display the same

features:

EP (south central dialects) (recursive ‘é que’)

O que é que é que o joão leu?

BP (null copula)

O que que o João leu?

Page 44: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Converging Portuguese corpora

doubled copula

complementizer

é que

null copula

non agreeing copula

copula

clefts

recursive é que

wh-element

relatives

questions minimal units of annotation

presentational ‘ser’

extraposed subject clauses

that clauses

Page 45: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

CLUL’s syntactic

annotation policy

standardize

open up

Page 46: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpora

building parsed corpora

using parsed corpora

Page 47: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpora

Using Penn-style parsed corpora implies: •  to know the annotation system; •  to install search software (CorpusSearch2; Randall, 2010) and java; •  to be acquainted with the search language; •  to run the search queries using command lines.

we’re doomed! we’ll never make it!

Page 48: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpora

The P. S. project designed two strategies to circumvent these issues (cf. also Pereira, 2015): •  enabling online access to the parsed corpus; •  implementing a search interface.

Page 49: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

TEITOK web-based plataform (Janssen, 2014) •  viewing, creating and editing corpora with both rich textual

mark-up and linguistic annotation; ↓

data stored in full-fledged XML files in the standards defined by the Text-Encoding Initiative.

Opening up the use of parsed corpora The online access

Page 50: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpus

P.S. Post Scriptum in TEITOK (until 2016): •  facsimile images of letters manuscripts; •  metadata (e. g. biographies, social status, events, places,

time); •  philological mark-up (e. g. support, hands, deletions,

spellings); •  linguistic annotation (e. g. lemmatization, POS annotation,

phonology and morphology phenomena, syntactic annotation). In 2016:

•  parsed corpus is online!

The online access

Page 51: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpus

The online visualization of the P.S. parsed corpus implies: •  the conversion of the original Penn-treebank labeled-

bracket format (psd) into the xml format (psdx) (Ecay & Bacovcin 2014; van Gompel 2014);

•  the alignment of psdx and the source xml (stand-off annotation) (cf. Grover/Matthews/Tobin 2006; Marquilhas & Hendrickx 2016; McEnery/Wilson 2001; Schmidt 2010).

The online access

Page 52: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!

from labeled-bracket

to xml

Page 53: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!

from labeled-bracket

to xml

Page 54: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!

from labeled-bracket

to xml

Page 55: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!

from labeled-bracket

to xml

Page 56: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!

from labeled-bracket

to xml

Page 57: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!

from labeled-bracket

to xml

Page 58: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!

Page 59: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest forestId="3" File="CARDS0036" Location=".3" sentid="s-4" id="tree-3"> <eTree Label="IP-MAT" id="node-309"> <eTree Label="NP-SBJ" id="node-310"> <eLeaf Notext="*pro*" id="node-311"/> </eTree> <eTree Label="NP-ACC" id="node-312"> <eTree Label="DEM" id="node-313"> <eLeaf Text="Isto" tokid="w-99" id="node-314"/> </eTree> </eTree> <eTree Label="VB-P" id="node-315"> <eLeaf Text="prometo" tokid="w-100" id="node-316"/> </eTree> <eTree Label="PP" id="node-317"> <eTree Label="P" id="node-318"> <eLeaf Text="a" tokid="w-101" id="node-319"/> </eTree> <eTree Label="NP" id="node-320"> <eTree Label="NPR" id="node-321"> <eLeaf Text="VM" tokid="w-102" id="node-322"/> </eTree> </eTree> </eTree> <eTree Label="ADVP" id="node-323"> <eTree Label="ADV" id="node-324"> <eLeaf Text="fixemente" tokid="w-103" id="node-325"/> </eTree> </eTree> <eTree Label="." id="node-326"> <eLeaf Text="." tokid="w-104" id="node-327"/> </eTree> </eTree> </forest>!

psdx and source xml alignment

Page 60: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

<forest forestId="3" File="CARDS0036" Location=".3" sentid="s-4" id="tree-3"> <eTree Label="IP-MAT" id="node-309"> <eTree Label="NP-SBJ" id="node-310"> <eLeaf Notext="*pro*" id="node-311"/> </eTree> <eTree Label="NP-ACC" id="node-312"> <eTree Label="DEM" id="node-313"> <eLeaf Text="Isto" tokid="w-99" id="node-314"/> </eTree> </eTree> <eTree Label="VB-P" id="node-315"> <eLeaf Text="prometo" tokid="w-100" id="node-316"/> </eTree> <eTree Label="PP" id="node-317"> <eTree Label="P" id="node-318"> <eLeaf Text="a" tokid="w-101" id="node-319"/> </eTree> <eTree Label="NP" id="node-320"> <eTree Label="NPR" id="node-321"> <eLeaf Text="VM" tokid="w-102" id="node-322"/> </eTree> </eTree> </eTree> <eTree Label="ADVP" id="node-323"> <eTree Label="ADV" id="node-324"> <eLeaf Text="fixemente" tokid="w-103" id="node-325"/> </eTree> </eTree> <eTree Label="." id="node-326"> <eLeaf Text="." tokid="w-104" id="node-327"/> </eTree> </eTree> </forest>!

psdx and source xml alignment

Page 61: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpus

The parsed P.S. in TEITOK: •  corpus users: visualize parsed sentences (several

formats) aligned with the transcribed form; •  corpus creators: edit syntactic trees at both label and

structural level.

The online access

Page 62: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian
Page 63: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpus

•  the storage of the parsed corpus in TEITOK, under a standard file format such as xml, allows the development of new online search engines.

•  TEITOK offers a tree search interface in which the user can search through PSDX files: •  writing query expressions in XPath language (the common

syntax to indicate nodes in an xml tree)… •  chosing predefined search queries.

The search interface

search results can match

extra-linguistic variables

Page 64: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian
Page 65: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

interactive atlases of syntactic patterns

(historical and dialectal)

parsed results can automatically be

matched with extra-linguistic

variables

all Portuguese parsed corpora

obbey to a unified syntactic

annotation system

Opening up the use of parsed corpus

Page 66: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

Opening up the use of parsed corpus

o poema é que o joão leu

o que quero é que o joão leia o que quero que o joão leia

bom é que o joão leia é bom que o joão leia

XIII XIV XV XVI XVII

extraposed subject clauses

pseudoclefts

reverse that-clefts

Page 67: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

This kind of tools form the basis of a new generation of parsed corpora, which, we hope, reconcile (Portuguese) researchers with syntactic annotation.

Opening up the use of parsed corpus

Page 68: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

References

•  CLUL (Ed.). (2014). P.S. Post Scriptum. Arquivo Digital de Escrita Quotidiana em Portugal e Espanha na Época Moderna. http://ps.clul.ul.pt.

•  Ecay, A. & Bacovcin, A. (2014). An implementation of a morphologically-aware corpus annotation format. Presented at the workshop Converging Corpora: How to standardize historical corpora of typologically and genetically different languages. 16th Diachronic Generative Syntax Conference. Budapeste. July, 2014.

•  Galves, Charlotte, and Pablo Faria. (2010). Tycho Brahe Parsed Corpus of Historical Portuguese. http://www.tycho.iel.unicamp.br/~tycho/corpus/en /index.html.

•  Grover, Claire/Matthews, Michael/Tobin, Richard (2006). Tools to Address the Interdependence between Tokenisation and Standoff Annotation, in: Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing, Stroudsburg PA, Association for Computational Linguistics, 19–26, http://dl.acm.org/citation.cfm?id=1621034.1621038 (30.09.2015).

•  Janssen, M. (2014). TEITOK – The tokenized TEI environment. Centro de Linguística da Universidade de Lisboa. http://alfclul.clul.ul.pt/teitok/site/index.php?action=home

•  Marquilhas, R., & Hendrickx, I. (2016). Avanços nas humanidades digitais. In: A. M. Martins & E. Carrilho (Eds.). Manual de linguística portuguesa (pp. 252-277). Berlin: De Gruyter.

•  Martins, A. M. (.coord.) (2000- 2010). CORDIAL-SIN: Corpus Dialectal para o Estudo da Sintaxe / Syntax-oriented Corpus of Portuguese Dialects. Lisboa, Centro de Linguística da Universidade de Lisboa. URL: http://www.clul.ul.pt/en/resources/411-cordial-corpus

Page 69: ANNOTATION FOR ALLps.clul.ul.pt/files/Papers-Congressos-PDFs/MagroAnnotationForAll.pdf · Japanese (NPCMJ) Portuguese parsed corpora Audio-Aligned and Parsed Corpus of Appalachian

References

•  Martins, A. M., Sandra Pereira & Adriana Cardoso. (2013-2015). Parsed José de Arimateia. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa)

•  Martins, A. M., Sandra Pereira & Adriana Cardoso. (2014-2015). Parsed Demanda do Santo Graal. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa

•  Martins, A. M., Sandra Pereira & Adriana Cardoso. (2015). Parsed Legal Documents. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa

•  McEnery, T. & Wilson, A. (2001). Corpus Linguistics, Edinburgh, Edinburgh University Press. •  Pereira, S. (2015). "Arquídia: um recurso a construir", presented at V SIMELP – Simpósio Mundial de

Estudos de Língua Portuguesa (Simpósio 41: Dicionarística Portuguesa: Investigação e Projetos em Curso), October 8-11, Lecce

•  Randall, B. (2010). CorpusSearch 2. University of Pennsylvania. http://corpussearch.sourceforge.net •  Santorini, B. (2010). Annotation manual for the Penn Historical Corpora and the PCEEC. http://

www.ling.upenn.edu/hist-corpora/annotation/index.htm •  Schmidt, Desmond (2010). The Inadequacy of Embedded Markup for Cultural Heritage Texts, Literary

and Linguist Computing 25:3, 337–356. •  van Gompel, M (2014). FoLiA: Format for Linguistic Annotation. Language and Speech Technology.

Technical Report Series. Centre for Language Studies, Radboud University Nijmegen.