Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
“A mas novas vos torn / Now I take you backto my tale”
The Romance of Flamenca
Olga Scrivner, E.D. Blodgett*, Sandra Kubler, MichaelMcGuire
Indiana University*University of Alberta
June 20131 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Introduction
In the past, historical documents and manuscripts were studiedexclusively by using a manual paper-based approach.
Recent achievements of corpus linguistics have introducedstate-of-art methods and tools for digitization, semi-automaticannotation, and visualization of such resources.
2 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Linguistic Annotation
“ By accessing linguistic annotation, we can extend the rangeof phenomena that can be found with high precision”
(Kubler and Zinsmeister, 2014)
1 Morphological annotation - collocations, spelling variation
2 Syntactic annotation - sentence structure in narratives vs.dialogues, prose vs. verse
3 Discourse annotation - analysis of scenes and characters(Female vs. male speaker, King vs. servants)
3 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Medieval Romance Languages
In recent years, a number of annotated corpora have beendeveloped for Medieval Romance languages:
Corpora of Old Spanish (Davies, 2002)
Old Portuguese (Davies and Ferreira, 2006)
Old French (Stein, 2008; Martineau et al., 2010)
There exist (to our knowledge) two electronic databases:
1 “The Concordance of Medieval Occitan” (Ricketts andReed, 2005)
2 “Provencal poetry” (ARTFL Project, 1998)
Users of those corpora are limited to lexical search.
4 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Medieval Romance Languages
In recent years, a number of annotated corpora have beendeveloped for Medieval Romance languages:
Corpora of Old Spanish (Davies, 2002)
Old Portuguese (Davies and Ferreira, 2006)
Old French (Stein, 2008; Martineau et al., 2010)
There exist (to our knowledge) two electronic databases:
1 “The Concordance of Medieval Occitan” (Ricketts andReed, 2005)
2 “Provencal poetry” (ARTFL Project, 1998)
Users of those corpora are limited to lexical search.
4 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Le Roman de Flamenca - 13th century
Le Roman de Flamenca, a “universally acknowledgedmasterpiece of Old Occitan narrative” (Fleischmann, 1995).
“Flamenca est la creation d’un homme d’esprit qui a voulufaire une oeuvre agreable ou fut representee dans ce qu’elleavait de plus brillant la vie des cours au XII siecle. C’etait unroman de moeurs contemporaines” (Meyer, 1865)
“Flamenca is the creation of a man of talent who wished towrite an agreeable work representing the most brilliant aspectsof courtly life in the twelfth century. It is a novel of manners”(Bradley, 1922)
5 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Le Roman de Flamenca
This romance presents a very intriguing love story between thebeautiful Flamenca, who is imprisoned in a tower by her jealoushusband Archambaut, and the sharp-witted knight Guillem.
The photo of the tapestry is used by permission of FreeLargePhotos.com
6 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Le Roman de Flamenca
The anonymous manuscript of Le Roman de Flamencawas accidentally discovered in Carcassonne (France) byRaynouard and was first fully edited P. Meyer in 1865.
This romance is unique in genre (“the first modernnovel”), its use of setting, adventures, and characterportrayal (Blodgett, 1995; Bradley, 1922; Meyer, 1865).
The potential value of this historical resource, however, islimited by the lack of an accessible digital format andlinguistic annotation.
7 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Goals
Our corpus is intended not only as material for linguisticresearch, but also to aid in broader studies:
Interactive online database with access to a glossary, totranslations of verses, and to comments (Meyer, 1901)http://nlp.indiana.edu/~obscrivn/Introduction.html
Multiple-level annotation - morphological, syntactic andpragmatic (Scrivner et al., 2013)
Parallel English-Occitan corpus (Blodgett, 1995)
8 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
What is Parallel Corpus
Parallel corpus is “an association between two texts(written or spoken) in different languages that representtranslations of each other” (Tufis, 2006).
Parallel alignment is reciprocal translation units thatencode valuable lexical and syntactic knowledge.
9 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Alignment types
One-to-one: one word from a source language correspondsto only one word in a target language
One-to-many:
10 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Alignment types
One-to-one: one word from a source language correspondsto only one word in a target language
One-to-many:
10 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Historical Parallel Corpora
Given that parallel words have the same content, we canidentify forms that have not been studied (Koolen et al., 2006;Enrique-Arias, 2012):
Spelling and lexical variation
Morphosyntactic variation
Null occurrences
11 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Null Occurrences
A mas novas Ø vos torn Now I take you back to my tale
12 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Null Occurrences
A mas novas Ø vos torn Now I take you back to my tale
12 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Automatic Alignment
English and Occitan texts are aligned by lines of verses
Bilingual lexicon is generated by NATools1
(Matrix of word-to-word probabilities)
Automatic alignment via Berkeley parser (Liang et al.,2006)
Manual correction of alignment
1http://linguateca.di.uminho.pt/natools/13 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Morphological Annotation -TNT Tagger
14 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Syntactic Annotation - Berkeley Parser
15 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Syntactic Annotation
“...nor did he want to omit Flamenca”
16 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Discourse Annotation - Speakers
The labels correspond to the main characters names, namelyFlamenca, Archambaut, Guillem, Father, King, Queen.
Less important characters are marked as FemaleSpeakers andMaleSpeakers.
17 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Parallel Alignment Annotation
18 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Corpus Design
Since we are targeting two different types of users, linguists andnon-linguistics, with different needs, the corpus is madeavailable in two different modes:
Web Interface: Users can mainly browse the text and lookup translations, glosses, and comments
Query Search: Users interested in the linguistic annotationcan query the corpus on-line
19 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
1. Web Database
Glossary definitions, comments, and footnotes are linked totokens and are made visible when the user hovers over amarked word.
20 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
2. Search Tool (ANNIS)
Our web search based on ANNIS allows for basic queries, tosearch for a word or phrase, and more complex queries forsyntactic, morphosyntactic, discourse and alignmentannotation.
21 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Null Subject
Modern Occitan varieties are null subject languages(Hinzelin and Kaiser, 2012)
(1) Ø Erawas
pertot,everywhere,
dintravaentered
pertoteverywhere
‘the light was everywhere and it was coming fromeverywhere’ (Lo bon de la nuoch, Max Rouquette)
22 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Previous Findings - Overt Subject
Overt subjects - disambiguation or “mise en relief”
(2) Femna que ieu ame illuminada de non ren‘Woman who I love illuminated from nothingness’(Saume dins lo vent, Serge Bec)
Person - more frequent with 1st person (Vance, 2009)
Genre - more frequent in prose
No difference by clause types (Sitaridou, 2005)
Subjunctive clause - preference for null subject (Vance,2009)
23 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Search by Genre
Discourse annotation: Flamenca, King etc
ex. speaker=”Flamenca”
Narrative vs. Dialogues
Male vs. Female
High social rank vs. low social rank
24 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Search by Person
Token annotation: I, you, it etc
ex. tok=”it”
Personal pronouns
Impersonal pronouns
25 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Search by Clause Type
Syntactic annotation:
ex. matrix cat=”IP” >[func=”MAT”]embedded cat=”IP” >[func=”SUB”]
Main clause
Embedded clause
26 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Overt vs. Null Subjects
Searching for null subjects:
27 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Overt vs. Null Subjects
Searching for overt subjects:
28 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Overt vs. Null Subjects - 1000 lines
Factor Null (%) Overt (%)
Total 308 (87) 45 (13)
Matrix Clause 34 (87) 5 (13)Embedded Clause 107 (84) 21 (16)
Impersonal pronouns 23 (92) 2 (8)1st person 32 (70) 14 (30)2nd person 28 (88) 4 (12)3rd person 200 (91) 19 (9)
Narration 187 (91) 19 (9)Discourse 121 (88) 26 (12)
29 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Explicite Impersonal Pronouns
Only tonic pronouns
1 mais aisso -m par causa tro bravabut it seems to me hard
2 mais so fon sos meillors thesaursand it was her greatest treasure
30 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Null Subjects
Corpus Search
Results
Conclusion
References
Social Variation
Factor Null (%) Overt (%)
Male speakers 46 (87) 7 (13)Female speaker 29 (78) 8 (12)
High social status 54 (86) 9 (13)Low social status 29 (83) 6 (17)
Author 11 (92) 1 (8)
31 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Conclusion
In contrast to traditional corpora, this corpus is structured tofulfill two objectives:
The web design facilitates the reading and understandingof The Romance of Flamenca. Words are interactivelylinked to the glossary, comments, and translations.
The corpus search design via its ANNIS interface allowsfor a visualization and for complex queries of the(morpho-)syntactic, discourse and parallel word alignedannotations.
32 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Future Directions
Culmination of manual correction
Preservation and annotation of other Old Occitanmanuscripts
Building a collaborative effort to continue with this project
33 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Bibliography I
ARTFL Project. Provencal Poetry database (American and French Research onthe Treasury of the French Language), Robert Morrissey, director, with F.R.Akehurst, 1998. URLhttp://artfl-project.uchicago.edu/content/proven%5Cc%7Bc%7Dal.
E.D. Blodgett. The Romance of Flamenca. Garland, New York, 1995.
W.A. Bradley. The Story of Flamenca. Harcourt Brace, New York, 1922.
Mark Davies. Corpus del Espanol: 100 million words, 1200s-1900s. Availableonline at http://www.corpusdelespanol.org, 2002.
Mark Davies and Michael Ferreira. Corpus do Portugues: 45 million words,1300s-1900s. Available online at http://www.corpusdoportugues.org, 2006.
Andres Enrique-Arias. Parallel texts in diachronic investigations: insights from aparallel corpus of spanish medieval bible translations. In Exploring AncientLanguages through Corpora EALC, 2012.
Suzanne Fleischmann. The non-lyric texts. In F.R.P. Akehurst and Judith M.Davis, editors, A Handbook of the Troubadours, pages 176–184. University ofCalifornia Press, 1995.
34 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Bibliography II
Marc-Olivier Hinzelin and Georg A. Kaiser. Etudes de linguistique gallo-romane,chapter Le parametre du sujet nul dans les varietes dialectales de l’occitan etdu francoprovencal, pages 247–260. Presses Universitaires de Vincennes,Saint-Denis, 2012.
Marijn Koolen, Frans Adriaans, Jaap Kamps, and Maarten de Rijke. Across-language approach to historic document retrieval. In M. Lalmas andet al., editors, ECIR 2006 LNCS 3936, pages 407–419. Springer-Verlag, 2006.
Sandra Kubler and Heike Zinsmeister. Corpus Linguistics and LinguisticallyAnnotated Corpora. Bloomsbury, 2014.
Percy Liang, Ben Taskar, and Dan Klein. Alignment by agreement. In Proceedingsof the Human Language Technology Conference of the North AmericanChapter of the ACL, HLT-NAACL ’06, pages 104–111, New York, NY, 2006.
France Martineau, Paul Hirschbuhler, Anthony Kroch, and Yves Charles Morin.Corpus MCVF (parsed corpus), modeliser le changement: les voies du francais,Department de Francais, University of Ottawa. CD-ROM, first edition,http://www.arts.uottawa.ca/voies/voies_fr.html, 2010.
Paul Meyer. Le Roman de Flamenca. Beziers, 1865.
Paul Meyer. Le Roman de Flamenca. Librairie Emile Bouillon, 2nd edition, 1901.
Peter T. Ricketts and Alan Reed. Concordance de l’Occitan Medieval. COM 2:Les Troubadours, Les Textes Narratifs en vers. Brepols, Turnhout, 2005.
35 / 36
Introduction
ParallelCorpora
CorpusStructure
Corpus Study
Conclusion
References
Bibliography III
Olga Scrivner, Sandra Kubler, Barbara Vance, and Eric Beuerlein. Le Roman deFlamenca : An annotated corpus of old occitan. In Francesco Mambrini, MarcoPassarotti, and Caroline Sporleder, editors, Proceedings of the Third Workshopon Annotation of Corpora for Research in Humanities, pages 85–96, 2013.
Ioanna Sitaridou. Corpora and Diachronic Linguistics, chapter A corpus-basedstudy of null subjects in Old French and Old Occitan, pages 359–374. Narr.,Tubingen, 2005.
Achim Stein. Syntactic annotation of Old French text corpora. Corpus, 7:157–161, 2008.
Dan Tufis. From word alignment to word senses, via multilingual wordnets. InComputer Science Journal of Moldova, volume 14, pages 3–33, 2006.
Barbara Vance. The evolution of subject pronoun systems in Medieval Occitan.Manuscript. Indiana University, 2009.
36 / 36