annotation for allps.clul.ul.pt/files/papers-congressos-pdfs/magroannotationforall.pdf · japanese...
TRANSCRIPT
ANNOTATION FOR ALL seeking optimal solutions in syntactic annotation
Catarina Magro (CLUL – Dialectology & Diachrony)
ABC 2016 Faculdade de Letras da UL
CLUL’s syntactic
annotation policy
standardize
open up
Portuguese parsed corpora
1999 • CORDIAL-SIN. Syntax-oriented Corpus of Portuguese Dialects
(Martins, Coord., [2000- ] 2010)
2012 • P.S. Post Scritum. A Digital Archive of Ordinary Writing in Early
Modern Portugal and Spain (CLUL, Ed., 2014)
• WOChWEL's POS-tagged and Parsed Old Portuguese texts (Martins, Pereira & Cardoso, 2013-15)
1998 • Tycho Brahe Parsed Corpus of Historical Portuguese
(Galves & Faria, 2010)
UL
UNICAMP
Portuguese parsed corpora
WOChWEL • Literary and historiographical texts (XIII-XIV)
P.S. Post Scriptum • Private letters (XVI-XIX)
Tycho Brahe • Literary and technical texts (XIV-XIX) • Newspaper texts and private letters (XIX-XX)
Cordial-Sin • Dialectal speech (XX)
The Penn treebank family
Penn Corpora of Historical English
Modéliser le changemen: les voies du
français (MCVF)
NINJAL Parsed
Corpus of Modern
Japanese (NPCMJ)
Portuguese parsed corpora
Audio-Aligned and Parsed Corpus of
Appalachian English
(AAPCAppE)
Icelandic Parsed
Historical Corpus
(IcePaHC)
Tycho Brahe
Cordial
Wochwel
P. S.
The Penn treebank annotation system
• the annotation system adopts a version of the constituency grammar that assumes: • one level of representation; • empty categories (in antecedent-gap chains and in situ).
• the primary goal of the annotation is the facilitation of automated search, not the adoption of a linguistically-accurate encoding.
(Santorini, 2010)
The Penn treebank annotation system
• produces quite flat and sometimes linguistically unmotivated syntactic representations
• multiple branching nodes • some word level nodes (e.g. verbs, negation, sentence focus
particles) • omission of undecidable information (e.g. VP boundaries) • omission of subtle distinctions (e.g. argument vs adjunct PPs) • use of default rules (w.r.t. location of wh-traces and structural
ambiguity, among others)
The Penn treebank annotation system
• provides the encoding of
• constituent boundaries • phrase and clause dependencies • categorial information (e.g. NP, PP, ADVP) • grammatical functions (e.g. SBJ, ACC, DAT) • some discourse functions (e.g. LFD, PRG) • sentence and clause type (e.g. EXL, CMP, QUE) • some null constituents • certain transformational relations
The Penn treebank annotation system
• syntactic annotation is represented as labelled bracketing over morphologically tagged texts
• word tags – POS tags • phrase and clause main labels – category labels • phrase and clause extended labels – subcategory,
grammatical relation or discourse function labels
• in the labeled bracketing representation, level of indenting corresponds to depth of structural embedding
The Penn treebank annotation system
!
Yesterday Mary told Jane that she studied too much during the weekend. !!(IP-MAT (NP-TMP (N Yesterday))! (NP-SBJ (NPR Mary))! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much)))! (PP (P during)! (NP (D the) (N+N weekend)))))) !!
The Penn treebank annotation system
!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary))! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!
The Penn treebank annotation system
!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told)! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she))! ! !← subject! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!
The Penn treebank annotation system
!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane))! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!
The Penn treebank annotation system
!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that)! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied)! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!
The Penn treebank annotation system
!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that) ! ! ! ! !← that clause! (IP-SUB (NP-SBJ (PRO she)) ! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during)! (NP (D the) (N+N weekend)))))) !!
The Penn treebank annotation system
!!!!(IP-MAT (NP-TMP (N Yesterday)) ! ! ! !← adjunct NP! (NP-SBJ (NPR Mary)) ! ! ! !← subject! (VBD told) ! ! ! ! !← verb! (NP-OB2 (NPR Jane)) ! ! ! !← second object! (CP-THT (C that) ! ! ! ! !← that clause! (IP-SUB (NP-SBJ (PRO she))! ! !← subject! (VBD studied) ! ! !← verb! (NP-MSR (QP (ADVR too) (Q much))) !← adjunct NP! (PP (P during) ! ! !← adjunct PP! (NP (D the) (N+N weekend)))))) !!
Adapting the Penn system
• to adapt a system originally designed for the annotation of Middle English to annotate a tipologically distinct language such as Portuguese • label set • annotation schemes
Adapting the Penn system
(NODE (IP-INF (TO to)! (VB for+gyue)! (NP-OB2 (D a) (ADJ synful) (N man))! (NP-OB1 (PRO$ his) (NS synnes)))! (PPCME2; SEC. XV; ID CMAELR3,43.513))!
• double object → oblique dative
Adapting the Penn system
• double object → oblique dative
( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .))!
(P.S.; SEC. XVIII; ID CARDS0036,.3))!
Adapting the Penn system
( (IP-MAT (CONJ and)! (PP (P for)! (NP (D +dare) (ADJ euele) (N +gewune)))! (NEG ne)! (VBP +dinc+d)! (NP-SBJ (PRO hit))! (NP-OB2 (PRO hem))! (NP-OB1 (Q no) (N misdade))! (. ,))(ID CMVICES1,79.910))!!( (IP-MAT (CONJ ac)! (NP-SBJ *pro*) !← subject coreferential with NP-OB2 of previous clause! (BEP bie+d) ! (scribal error)! (VAN ihealden)! (PP (PP (P for)! (NP (ADJ wise)! (NS menn)))! (CONJP (CONJ and)! (PP (P for)! (NP (ADJ +geape)))))! (. .))(PPCME2; SEC. XIII; ID CMVICES1,79.911))!
• non-pro drop → pro drop
Adapting the Penn system • non-pro drop → pro drop
( (IP-MAT! (NP-SBJ *pro*) !← referential null subject in a non dependent clause!! (VB-D Declamei)!! (PP (P contra)!! (NP (D-F a) (N vaidade)))!! (. ,))!
(TYCHO BRAHE; SEC XVIII; ID A_001_PSD,03.3))!
Adapting the Penn system
preserve expand adapt create
Adapting the Penn system
• but what are in fact the “needs” of Portuguese corpora?
5 000 000 words
different historical varieties
different spatial
varieties
different situational contexts
different speakers’
social status
synchronic and diachronic variation
Adapting the Penn system
• but what are in fact the “needs” of Portuguese corpora?
synchronic and diachronic variation
the annotator ignores the
precise range of the 4 sets of
data
the annotator can not
anticipate optimal
annotation solutions
the annotator feels that
annotating is always more urgent than writing down guidelines
4 teams cc. 50 annotators
15 years of work
I N C O N S I S T E N C Y
Converging Portuguese corpora
• 4 corpora • the same set of parsing guidelines
Portuguese Syntactic Annotation Manual
Converging Portuguese corpora
For corpus users • ensures an easy-to-use data access; • makes a comparative survey of data conceived to answer
specific questions utterly productive; • makes it possible to replicate quantitative studies on new
datasets. For corpus creators • speeds up the parser training; • improves automatic parsing of new data.
Converging Portuguese corpora
The big challenge • to design a unified encoding system that allows to search
across diachronic and dialectal varieties for properties that are either shared or exclusive.
Converging Portuguese corpora
Cleft constructions (1) o joão leu o poema
John read the poem In English corpora: It-cleft
(2) it was the poem that john read Wh-cleft (pseudocleft)
(3) what john read was the poem Reverse Wh-cleft
(4) the poem was what john read
wh-clefts
canonic
foi o poema o que o joão leu
reverse
o poema foi o que o joão leu
pseudo
o que o joão leu foi o poema
double copula
é o que o joão leu é o poema
reduced
o que o joão quer que a ana leia o poema
semi pseudo
o joão leu foi o poema
that-clefts
canonic
foi o poema que o joão leu
non-agreeing copula
é os poemas que o joão leu
reduced
o poema que o joão leu
reverse
o poema é que o joão leu
double copula
é o poema é que o joão leu
double ‘é que’
o poema é que é que o joão leu
Converging Portuguese corpora
wh-clefts
canonic
foi o poema o que o joão leu
reverse
o poema foi o que o joão leu
pseudo
o que o joão leu foi o poema
double copula
é o que o joão leu é o poema
reduced
o que o joão quer que a ana leia o poema
semi pseudo
o joão leu foi o poema
that-clefts
canonic
foi o poema que o joão leu
non-agreeing copula
é os poemas que o joão leu
reduced
o poema que o joão leu
reverse
o poema é que o joão leu
double copula
é o poema é que o joão leu
double ‘é que’
o poema é que é que o joão leu
Converging Portuguese corpora
XIII-XV centuries (Middle Ages)
wh-clefts
canonic
foi o poema o que o joão leu
reverse
o poema foi o que o joão leu
pseudo
o que o joão leu foi o poema
double copula
é o que o joão leu é o poema
reduced
o que o joão quer que a ana leia o poema
semi pseudo
o joão leu foi o poema
that-clefts
canonic
foi o poema que o joão leu
non-agreeing copula
é os poemas que o joão leu
reduced
o poema que o joão leu
reverse
o poema é que o joão leu
double copula
é o poema é que o joão leu
double ‘é que’
o poema é que é que o joão leu
Converging Portuguese corpora
XVI century
wh-clefts
canonic
foi o poema o que o joão leu
reverse
o poema foi o que o joão leu
pseudo
o que o joão leu foi o poema
double copula
é o que o joão leu é o poema
reduced
o que o joão quer que a ana leia o poema
semi pseudo
o joão leu foi o poema
that-clefts
canonic
foi o poema que o joão leu
non-agreeing copula
é os poemas que o joão leu
reduced
o poema que o joão leu
reverse
o poema é que o joão leu
double copula
é o poema é que o joão leu
double ‘é que’
o poema é que é que o joão leu
Converging Portuguese corpora
Classical Portuguese (XVII-XVIII)
wh-clefts
canonic
foi o poema o que o joão leu
reverse
o poema foi o que o joão leu
pseudo
o que o joão leu foi o poema
double copula
é o que o joão leu é o poema
reduced
o que o joão quer que a ana leia o poema
semi pseudo
o joão leu foi o poema
that-clefts
canonic
foi o poema que o joão leu
non-agreeing copula
é os poemas que o joão leu
reduced
o poema que o joão leu
reverse
o poema é que o joão leu
double copula
é o poema é que o joão leu
double ‘é que’
o poema é que é que o joão leu
Converging Portuguese corpora
Contemporary standard Portuguese (XIX-XX)
wh-clefts
canonic
foi o poema o que o joão leu
reverse
o poema foi o que o joão leu
pseudo
o que o joão leu foi o poema
double copula
é o que o joão leu é o poema
reduced
o que o joão quer que a ana leia o poema
semi pseudo
o joão leu foi o poema
that-clefts
canonic
foi o poema que o joão leu
non-agreeing copula
é os poemas que o joão leu
reduced
o poema que o joão leu
reverse
o poema é que o joão leu
double copula
é o poema é que o joão leu
double ‘é que’
o poema é que é que o joão leu
Converging Portuguese corpora
Dialectal European Portuguese (XX)
wh-clefts
canonic
foi o poema o que o joão leu
reverse
o poema foi o que o joão leu
pseudo
o que o joão leu foi o poema
double copula
é o que o joão leu é o poema
reduced
o que o joão quer que a ana leia o poema
semi pseudo
o joão leu foi o poema
that-clefts
canonic
foi o poema que o joão leu
non-agreeing copula
é os poemas que o joão leu
reduced
o poema que o joão leu
reverse
o poema é que o joão leu
double copula
é o poema é que o joão leu
double ‘é que’
o poema é que é que o joão leu
Converging Portuguese corpora
Brazilian Portuguese (XX)
Converging Portuguese corpora
Cleft constructions • complementizer (+ -) • wh-phrase (+ - 0) • copula (+ - 0 x) • focus position (L R M)
Converging Portuguese corpora
wh-element complementizer copula focus
canonic wh-cleft + - + M
reverse wh-cleft + - + L
pseudocleft + - + R
double copula + - x R
reduced pseudocleft + - 0 R
semi pseudocleft 0 - + R
canonic that-cleft - + + M
reduced that-cleft - + 0 M
reverse that-cleft - + + L
double copula - + x L
Converging Portuguese corpora
wh-element complementizer copula focus
canonic wh-cleft + - + M
reverse wh-cleft + - + L
pseudocleft + - + R
double copula + - x R
reduced pseudocleft + - 0 R
semi pseudocleft 0 - + R
canonic that-cleft - + + M
reduced that-cleft - + 0 M
reverse that-cleft - + + L
double copula - + x L
Converging Portuguese corpora
wh-element complementizer copula focus
canonic wh-cleft + - + M
reverse wh-cleft + - + L
pseudocleft + - + R
double copula + - x R
reduced pseudocleft + - 0 R
semi pseudocleft 0 - + R
canonic that-cleft - + + M
reduced that-cleft - + 0 M
reverse that-cleft - + + L
double copula - + x L
Converging Portuguese corpora
wh-element complementizer copula focus
canonic wh-cleft + - + M
reverse wh-cleft + - + L
pseudocleft + - + R
double copula + - x R
reduced pseudocleft + - 0 R
semi pseudocleft 0 - + R
canonic that-cleft - + + M
reduced that-cleft - + 0 M
reverse that-cleft - + + L
double copula - + x L
Converging Portuguese corpora
wh-element complementizer copula focus
canonic wh-cleft + - + M
reverse wh-cleft + - + L pseudocleft + - + R
double copula + - x R
reduced pseudocleft + - 0 R
semi pseudocleft 0 - + R
canonic that-cleft - + + M
reduced that-cleft - + 0 M
reverse that-cleft - + + L double copula - + x L
Converging Portuguese corpora
wh-element complementizer copula focus
canonic wh-cleft + - + M reverse wh-cleft + - + L
pseudocleft + - + R
double copula + - x R
reduced pseudocleft + - 0 R
semi pseudocleft 0 - + R
canonic that-cleft - + + M reduced that-cleft - + 0 M
reverse that-cleft - + + L
double copula - + x L
Converging Portuguese corpora
Alternative hypothesis: Cleft A, B, C, D, E, … • prevents finding all clefts with shared features; • prevents finding other constructions that display the same
features:
EP (south central dialects) (recursive ‘é que’)
O que é que é que o joão leu?
BP (null copula)
O que que o João leu?
Converging Portuguese corpora
doubled copula
complementizer
é que
null copula
non agreeing copula
copula
clefts
recursive é que
wh-element
relatives
questions minimal units of annotation
presentational ‘ser’
extraposed subject clauses
that clauses
CLUL’s syntactic
annotation policy
standardize
open up
Opening up the use of parsed corpora
building parsed corpora
using parsed corpora
Opening up the use of parsed corpora
Using Penn-style parsed corpora implies: • to know the annotation system; • to install search software (CorpusSearch2; Randall, 2010) and java; • to be acquainted with the search language; • to run the search queries using command lines.
we’re doomed! we’ll never make it!
Opening up the use of parsed corpora
The P. S. project designed two strategies to circumvent these issues (cf. also Pereira, 2015): • enabling online access to the parsed corpus; • implementing a search interface.
TEITOK web-based plataform (Janssen, 2014) • viewing, creating and editing corpora with both rich textual
mark-up and linguistic annotation; ↓
data stored in full-fledged XML files in the standards defined by the Text-Encoding Initiative.
Opening up the use of parsed corpora The online access
Opening up the use of parsed corpus
P.S. Post Scriptum in TEITOK (until 2016): • facsimile images of letters manuscripts; • metadata (e. g. biographies, social status, events, places,
time); • philological mark-up (e. g. support, hands, deletions,
spellings); • linguistic annotation (e. g. lemmatization, POS annotation,
phonology and morphology phenomena, syntactic annotation). In 2016:
• parsed corpus is online!
The online access
Opening up the use of parsed corpus
The online visualization of the P.S. parsed corpus implies: • the conversion of the original Penn-treebank labeled-
bracket format (psd) into the xml format (psdx) (Ecay & Bacovcin 2014; van Gompel 2014);
• the alignment of psdx and the source xml (stand-off annotation) (cf. Grover/Matthews/Tobin 2006; Marquilhas & Hendrickx 2016; McEnery/Wilson 2001; Schmidt 2010).
The online access
<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!
( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!
from labeled-bracket
to xml
<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!
( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!
from labeled-bracket
to xml
<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!
( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!
from labeled-bracket
to xml
<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!
( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!
from labeled-bracket
to xml
<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!
( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!
from labeled-bracket
to xml
<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!
( (IP-MAT (NP-SBJ *pro*)!! (NP-ACC (DEM Isto))!! (VB-P prometo)!! (PP (P a)!! (NP (NPR VM)))!! (ADVP (ADV fixemente))!! (. .)))!
from labeled-bracket
to xml
<forest> <eTree Label="IP-MAT”> <eTree Label="NP-SBJ”> <eLeaf Notext="*pro*”> </eTree> <eTree Label="NP-ACC”> <eTree Label="DEM”> <eLeaf Text="Isto”/> </eTree> </eTree> <eTree Label="VB-P”> <eLeaf Text="prometo”/> </eTree> <eTree Label="PP”> <eTree Label="P”> <eLeaf Text="a”/> </eTree> <eTree Label="NP”> <eTree Label="NPR”> <eLeaf Text="VM”/> </eTree> </eTree> </eTree> <eTree Label="ADVP”> <eTree Label="ADV”> <eLeaf Text="fixemente”/> </eTree> </eTree> <eTree Label=".”> <eLeaf Text=".”/> </eTree> </eTree> </forest>!
<forest forestId="3" File="CARDS0036" Location=".3" sentid="s-4" id="tree-3"> <eTree Label="IP-MAT" id="node-309"> <eTree Label="NP-SBJ" id="node-310"> <eLeaf Notext="*pro*" id="node-311"/> </eTree> <eTree Label="NP-ACC" id="node-312"> <eTree Label="DEM" id="node-313"> <eLeaf Text="Isto" tokid="w-99" id="node-314"/> </eTree> </eTree> <eTree Label="VB-P" id="node-315"> <eLeaf Text="prometo" tokid="w-100" id="node-316"/> </eTree> <eTree Label="PP" id="node-317"> <eTree Label="P" id="node-318"> <eLeaf Text="a" tokid="w-101" id="node-319"/> </eTree> <eTree Label="NP" id="node-320"> <eTree Label="NPR" id="node-321"> <eLeaf Text="VM" tokid="w-102" id="node-322"/> </eTree> </eTree> </eTree> <eTree Label="ADVP" id="node-323"> <eTree Label="ADV" id="node-324"> <eLeaf Text="fixemente" tokid="w-103" id="node-325"/> </eTree> </eTree> <eTree Label="." id="node-326"> <eLeaf Text="." tokid="w-104" id="node-327"/> </eTree> </eTree> </forest>!
psdx and source xml alignment
<forest forestId="3" File="CARDS0036" Location=".3" sentid="s-4" id="tree-3"> <eTree Label="IP-MAT" id="node-309"> <eTree Label="NP-SBJ" id="node-310"> <eLeaf Notext="*pro*" id="node-311"/> </eTree> <eTree Label="NP-ACC" id="node-312"> <eTree Label="DEM" id="node-313"> <eLeaf Text="Isto" tokid="w-99" id="node-314"/> </eTree> </eTree> <eTree Label="VB-P" id="node-315"> <eLeaf Text="prometo" tokid="w-100" id="node-316"/> </eTree> <eTree Label="PP" id="node-317"> <eTree Label="P" id="node-318"> <eLeaf Text="a" tokid="w-101" id="node-319"/> </eTree> <eTree Label="NP" id="node-320"> <eTree Label="NPR" id="node-321"> <eLeaf Text="VM" tokid="w-102" id="node-322"/> </eTree> </eTree> </eTree> <eTree Label="ADVP" id="node-323"> <eTree Label="ADV" id="node-324"> <eLeaf Text="fixemente" tokid="w-103" id="node-325"/> </eTree> </eTree> <eTree Label="." id="node-326"> <eLeaf Text="." tokid="w-104" id="node-327"/> </eTree> </eTree> </forest>!
psdx and source xml alignment
Opening up the use of parsed corpus
The parsed P.S. in TEITOK: • corpus users: visualize parsed sentences (several
formats) aligned with the transcribed form; • corpus creators: edit syntactic trees at both label and
structural level.
The online access
Opening up the use of parsed corpus
• the storage of the parsed corpus in TEITOK, under a standard file format such as xml, allows the development of new online search engines.
• TEITOK offers a tree search interface in which the user can search through PSDX files: • writing query expressions in XPath language (the common
syntax to indicate nodes in an xml tree)… • chosing predefined search queries.
The search interface
search results can match
extra-linguistic variables
interactive atlases of syntactic patterns
(historical and dialectal)
parsed results can automatically be
matched with extra-linguistic
variables
all Portuguese parsed corpora
obbey to a unified syntactic
annotation system
Opening up the use of parsed corpus
Opening up the use of parsed corpus
o poema é que o joão leu
o que quero é que o joão leia o que quero que o joão leia
bom é que o joão leia é bom que o joão leia
XIII XIV XV XVI XVII
extraposed subject clauses
pseudoclefts
reverse that-clefts
This kind of tools form the basis of a new generation of parsed corpora, which, we hope, reconcile (Portuguese) researchers with syntactic annotation.
Opening up the use of parsed corpus
References
• CLUL (Ed.). (2014). P.S. Post Scriptum. Arquivo Digital de Escrita Quotidiana em Portugal e Espanha na Época Moderna. http://ps.clul.ul.pt.
• Ecay, A. & Bacovcin, A. (2014). An implementation of a morphologically-aware corpus annotation format. Presented at the workshop Converging Corpora: How to standardize historical corpora of typologically and genetically different languages. 16th Diachronic Generative Syntax Conference. Budapeste. July, 2014.
• Galves, Charlotte, and Pablo Faria. (2010). Tycho Brahe Parsed Corpus of Historical Portuguese. http://www.tycho.iel.unicamp.br/~tycho/corpus/en /index.html.
• Grover, Claire/Matthews, Michael/Tobin, Richard (2006). Tools to Address the Interdependence between Tokenisation and Standoff Annotation, in: Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing, Stroudsburg PA, Association for Computational Linguistics, 19–26, http://dl.acm.org/citation.cfm?id=1621034.1621038 (30.09.2015).
• Janssen, M. (2014). TEITOK – The tokenized TEI environment. Centro de Linguística da Universidade de Lisboa. http://alfclul.clul.ul.pt/teitok/site/index.php?action=home
• Marquilhas, R., & Hendrickx, I. (2016). Avanços nas humanidades digitais. In: A. M. Martins & E. Carrilho (Eds.). Manual de linguística portuguesa (pp. 252-277). Berlin: De Gruyter.
• Martins, A. M. (.coord.) (2000- 2010). CORDIAL-SIN: Corpus Dialectal para o Estudo da Sintaxe / Syntax-oriented Corpus of Portuguese Dialects. Lisboa, Centro de Linguística da Universidade de Lisboa. URL: http://www.clul.ul.pt/en/resources/411-cordial-corpus
References
• Martins, A. M., Sandra Pereira & Adriana Cardoso. (2013-2015). Parsed José de Arimateia. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa)
• Martins, A. M., Sandra Pereira & Adriana Cardoso. (2014-2015). Parsed Demanda do Santo Graal. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa
• Martins, A. M., Sandra Pereira & Adriana Cardoso. (2015). Parsed Legal Documents. CC licensed: WOChWEL by Centro de Linguística da Universidade de Lisboa
• McEnery, T. & Wilson, A. (2001). Corpus Linguistics, Edinburgh, Edinburgh University Press. • Pereira, S. (2015). "Arquídia: um recurso a construir", presented at V SIMELP – Simpósio Mundial de
Estudos de Língua Portuguesa (Simpósio 41: Dicionarística Portuguesa: Investigação e Projetos em Curso), October 8-11, Lecce
• Randall, B. (2010). CorpusSearch 2. University of Pennsylvania. http://corpussearch.sourceforge.net • Santorini, B. (2010). Annotation manual for the Penn Historical Corpora and the PCEEC. http://
www.ling.upenn.edu/hist-corpora/annotation/index.htm • Schmidt, Desmond (2010). The Inadequacy of Embedded Markup for Cultural Heritage Texts, Literary
and Linguist Computing 25:3, 337–356. • van Gompel, M (2014). FoLiA: Format for Linguistic Annotation. Language and Speech Technology.
Technical Report Series. Centre for Language Studies, Radboud University Nijmegen.