``a mas novas vos torn / now i take you back to my tale...

Post on 21-May-2018

220 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

“A mas novas vos torn / Now I take you backto my tale”

The Romance of Flamenca

Olga Scrivner, E.D. Blodgett*, Sandra Kubler, MichaelMcGuire

Indiana University*University of Alberta

June 20131 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Introduction

In the past, historical documents and manuscripts were studiedexclusively by using a manual paper-based approach.

Recent achievements of corpus linguistics have introducedstate-of-art methods and tools for digitization, semi-automaticannotation, and visualization of such resources.

2 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Linguistic Annotation

“ By accessing linguistic annotation, we can extend the rangeof phenomena that can be found with high precision”

(Kubler and Zinsmeister, 2014)

1 Morphological annotation - collocations, spelling variation

2 Syntactic annotation - sentence structure in narratives vs.dialogues, prose vs. verse

3 Discourse annotation - analysis of scenes and characters(Female vs. male speaker, King vs. servants)

3 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Medieval Romance Languages

In recent years, a number of annotated corpora have beendeveloped for Medieval Romance languages:

Corpora of Old Spanish (Davies, 2002)

Old Portuguese (Davies and Ferreira, 2006)

Old French (Stein, 2008; Martineau et al., 2010)

There exist (to our knowledge) two electronic databases:

1 “The Concordance of Medieval Occitan” (Ricketts andReed, 2005)

2 “Provencal poetry” (ARTFL Project, 1998)

Users of those corpora are limited to lexical search.

4 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Medieval Romance Languages

In recent years, a number of annotated corpora have beendeveloped for Medieval Romance languages:

Corpora of Old Spanish (Davies, 2002)

Old Portuguese (Davies and Ferreira, 2006)

Old French (Stein, 2008; Martineau et al., 2010)

There exist (to our knowledge) two electronic databases:

1 “The Concordance of Medieval Occitan” (Ricketts andReed, 2005)

2 “Provencal poetry” (ARTFL Project, 1998)

Users of those corpora are limited to lexical search.

4 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Le Roman de Flamenca - 13th century

Le Roman de Flamenca, a “universally acknowledgedmasterpiece of Old Occitan narrative” (Fleischmann, 1995).

“Flamenca est la creation d’un homme d’esprit qui a voulufaire une oeuvre agreable ou fut representee dans ce qu’elleavait de plus brillant la vie des cours au XII siecle. C’etait unroman de moeurs contemporaines” (Meyer, 1865)

“Flamenca is the creation of a man of talent who wished towrite an agreeable work representing the most brilliant aspectsof courtly life in the twelfth century. It is a novel of manners”(Bradley, 1922)

5 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Le Roman de Flamenca

This romance presents a very intriguing love story between thebeautiful Flamenca, who is imprisoned in a tower by her jealoushusband Archambaut, and the sharp-witted knight Guillem.

The photo of the tapestry is used by permission of FreeLargePhotos.com

6 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Le Roman de Flamenca

The anonymous manuscript of Le Roman de Flamencawas accidentally discovered in Carcassonne (France) byRaynouard and was first fully edited P. Meyer in 1865.

This romance is unique in genre (“the first modernnovel”), its use of setting, adventures, and characterportrayal (Blodgett, 1995; Bradley, 1922; Meyer, 1865).

The potential value of this historical resource, however, islimited by the lack of an accessible digital format andlinguistic annotation.

7 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Goals

Our corpus is intended not only as material for linguisticresearch, but also to aid in broader studies:

Interactive online database with access to a glossary, totranslations of verses, and to comments (Meyer, 1901)http://nlp.indiana.edu/~obscrivn/Introduction.html

Multiple-level annotation - morphological, syntactic andpragmatic (Scrivner et al., 2013)

Parallel English-Occitan corpus (Blodgett, 1995)

8 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

What is Parallel Corpus

Parallel corpus is “an association between two texts(written or spoken) in different languages that representtranslations of each other” (Tufis, 2006).

Parallel alignment is reciprocal translation units thatencode valuable lexical and syntactic knowledge.

9 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Alignment types

One-to-one: one word from a source language correspondsto only one word in a target language

One-to-many:

10 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Alignment types

One-to-one: one word from a source language correspondsto only one word in a target language

One-to-many:

10 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Historical Parallel Corpora

Given that parallel words have the same content, we canidentify forms that have not been studied (Koolen et al., 2006;Enrique-Arias, 2012):

Spelling and lexical variation

Morphosyntactic variation

Null occurrences

11 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Null Occurrences

A mas novas Ø vos torn Now I take you back to my tale

12 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Null Occurrences

A mas novas Ø vos torn Now I take you back to my tale

12 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Automatic Alignment

English and Occitan texts are aligned by lines of verses

Bilingual lexicon is generated by NATools1

(Matrix of word-to-word probabilities)

Automatic alignment via Berkeley parser (Liang et al.,2006)

Manual correction of alignment

1http://linguateca.di.uminho.pt/natools/13 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Morphological Annotation -TNT Tagger

14 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Syntactic Annotation - Berkeley Parser

15 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Syntactic Annotation

“...nor did he want to omit Flamenca”

16 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Discourse Annotation - Speakers

The labels correspond to the main characters names, namelyFlamenca, Archambaut, Guillem, Father, King, Queen.

Less important characters are marked as FemaleSpeakers andMaleSpeakers.

17 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Parallel Alignment Annotation

18 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Corpus Design

Since we are targeting two different types of users, linguists andnon-linguistics, with different needs, the corpus is madeavailable in two different modes:

Web Interface: Users can mainly browse the text and lookup translations, glosses, and comments

Query Search: Users interested in the linguistic annotationcan query the corpus on-line

19 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

1. Web Database

Glossary definitions, comments, and footnotes are linked totokens and are made visible when the user hovers over amarked word.

20 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

2. Search Tool (ANNIS)

Our web search based on ANNIS allows for basic queries, tosearch for a word or phrase, and more complex queries forsyntactic, morphosyntactic, discourse and alignmentannotation.

21 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Null Subject

Modern Occitan varieties are null subject languages(Hinzelin and Kaiser, 2012)

(1) Ø Erawas

pertot,everywhere,

dintravaentered

pertoteverywhere

‘the light was everywhere and it was coming fromeverywhere’ (Lo bon de la nuoch, Max Rouquette)

22 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Previous Findings - Overt Subject

Overt subjects - disambiguation or “mise en relief”

(2) Femna que ieu ame illuminada de non ren‘Woman who I love illuminated from nothingness’(Saume dins lo vent, Serge Bec)

Person - more frequent with 1st person (Vance, 2009)

Genre - more frequent in prose

No difference by clause types (Sitaridou, 2005)

Subjunctive clause - preference for null subject (Vance,2009)

23 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Search by Genre

Discourse annotation: Flamenca, King etc

ex. speaker=”Flamenca”

Narrative vs. Dialogues

Male vs. Female

High social rank vs. low social rank

24 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Search by Person

Token annotation: I, you, it etc

ex. tok=”it”

Personal pronouns

Impersonal pronouns

25 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Search by Clause Type

Syntactic annotation:

ex. matrix cat=”IP” >[func=”MAT”]embedded cat=”IP” >[func=”SUB”]

Main clause

Embedded clause

26 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Overt vs. Null Subjects

Searching for null subjects:

27 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Overt vs. Null Subjects

Searching for overt subjects:

28 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Overt vs. Null Subjects - 1000 lines

Factor Null (%) Overt (%)

Total 308 (87) 45 (13)

Matrix Clause 34 (87) 5 (13)Embedded Clause 107 (84) 21 (16)

Impersonal pronouns 23 (92) 2 (8)1st person 32 (70) 14 (30)2nd person 28 (88) 4 (12)3rd person 200 (91) 19 (9)

Narration 187 (91) 19 (9)Discourse 121 (88) 26 (12)

29 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Explicite Impersonal Pronouns

Only tonic pronouns

1 mais aisso -m par causa tro bravabut it seems to me hard

2 mais so fon sos meillors thesaursand it was her greatest treasure

30 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Null Subjects

Corpus Search

Results

Conclusion

References

Social Variation

Factor Null (%) Overt (%)

Male speakers 46 (87) 7 (13)Female speaker 29 (78) 8 (12)

High social status 54 (86) 9 (13)Low social status 29 (83) 6 (17)

Author 11 (92) 1 (8)

31 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Conclusion

In contrast to traditional corpora, this corpus is structured tofulfill two objectives:

The web design facilitates the reading and understandingof The Romance of Flamenca. Words are interactivelylinked to the glossary, comments, and translations.

The corpus search design via its ANNIS interface allowsfor a visualization and for complex queries of the(morpho-)syntactic, discourse and parallel word alignedannotations.

32 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Future Directions

Culmination of manual correction

Preservation and annotation of other Old Occitanmanuscripts

Building a collaborative effort to continue with this project

33 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Bibliography I

ARTFL Project. Provencal Poetry database (American and French Research onthe Treasury of the French Language), Robert Morrissey, director, with F.R.Akehurst, 1998. URLhttp://artfl-project.uchicago.edu/content/proven%5Cc%7Bc%7Dal.

E.D. Blodgett. The Romance of Flamenca. Garland, New York, 1995.

W.A. Bradley. The Story of Flamenca. Harcourt Brace, New York, 1922.

Mark Davies. Corpus del Espanol: 100 million words, 1200s-1900s. Availableonline at http://www.corpusdelespanol.org, 2002.

Mark Davies and Michael Ferreira. Corpus do Portugues: 45 million words,1300s-1900s. Available online at http://www.corpusdoportugues.org, 2006.

Andres Enrique-Arias. Parallel texts in diachronic investigations: insights from aparallel corpus of spanish medieval bible translations. In Exploring AncientLanguages through Corpora EALC, 2012.

Suzanne Fleischmann. The non-lyric texts. In F.R.P. Akehurst and Judith M.Davis, editors, A Handbook of the Troubadours, pages 176–184. University ofCalifornia Press, 1995.

34 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Bibliography II

Marc-Olivier Hinzelin and Georg A. Kaiser. Etudes de linguistique gallo-romane,chapter Le parametre du sujet nul dans les varietes dialectales de l’occitan etdu francoprovencal, pages 247–260. Presses Universitaires de Vincennes,Saint-Denis, 2012.

Marijn Koolen, Frans Adriaans, Jaap Kamps, and Maarten de Rijke. Across-language approach to historic document retrieval. In M. Lalmas andet al., editors, ECIR 2006 LNCS 3936, pages 407–419. Springer-Verlag, 2006.

Sandra Kubler and Heike Zinsmeister. Corpus Linguistics and LinguisticallyAnnotated Corpora. Bloomsbury, 2014.

Percy Liang, Ben Taskar, and Dan Klein. Alignment by agreement. In Proceedingsof the Human Language Technology Conference of the North AmericanChapter of the ACL, HLT-NAACL ’06, pages 104–111, New York, NY, 2006.

France Martineau, Paul Hirschbuhler, Anthony Kroch, and Yves Charles Morin.Corpus MCVF (parsed corpus), modeliser le changement: les voies du francais,Department de Francais, University of Ottawa. CD-ROM, first edition,http://www.arts.uottawa.ca/voies/voies_fr.html, 2010.

Paul Meyer. Le Roman de Flamenca. Beziers, 1865.

Paul Meyer. Le Roman de Flamenca. Librairie Emile Bouillon, 2nd edition, 1901.

Peter T. Ricketts and Alan Reed. Concordance de l’Occitan Medieval. COM 2:Les Troubadours, Les Textes Narratifs en vers. Brepols, Turnhout, 2005.

35 / 36

Introduction

ParallelCorpora

CorpusStructure

Corpus Study

Conclusion

References

Bibliography III

Olga Scrivner, Sandra Kubler, Barbara Vance, and Eric Beuerlein. Le Roman deFlamenca : An annotated corpus of old occitan. In Francesco Mambrini, MarcoPassarotti, and Caroline Sporleder, editors, Proceedings of the Third Workshopon Annotation of Corpora for Research in Humanities, pages 85–96, 2013.

Ioanna Sitaridou. Corpora and Diachronic Linguistics, chapter A corpus-basedstudy of null subjects in Old French and Old Occitan, pages 359–374. Narr.,Tubingen, 2005.

Achim Stein. Syntactic annotation of Old French text corpora. Corpus, 7:157–161, 2008.

Dan Tufis. From word alignment to word senses, via multilingual wordnets. InComputer Science Journal of Moldova, volume 14, pages 3–33, 2006.

Barbara Vance. The evolution of subject pronoun systems in Medieval Occitan.Manuscript. Indiana University, 2009.

36 / 36

top related