presenting the n´enufar project : a diachronic...

6
Presenting the enufar Project: a Diachronic Digital Edition of the Petit Larousse Illustr´ e Herv´ e Bohbot, Francesca Frontini, Giancarlo Luxardo PRAXILING UMR 5267 CNRS & Univ Paul Val´ ery Montpellier 3 - Montpellier, France [email protected] Mohamed Khemakhem 1,2,3 , Laurent Romary 1,2,4 1 Inria – ALMAnaCH, Paris 2 Centre Marc Bloch, Berlin 3 Universit´ e Paris Diderot, Paris 4 Berlin-Brandenburgische Akademie der Wissenschaften, Berlin [email protected] Abstract This paper presents the N´ enufar project, which aims to make several successive (free of copyright up to 1948) editions of the French Petit Larousse Illustr´ e dictionary available in a digitised format. The corpus of digital editions will be made publicly available via a web-based querying interface, as well as distributed in a machine readable format, TEI-LEX0. Keywords: TEI, Petit Larousse, dictionaries 1. Introduction The digitisation of historical dictionaries has recently taken on strong momentum, moving past the mere publication of scanned texts to the conversion of paper dictionaries into easily exploitable lexical databases encoded using well es- tablished digital standards. At the same time, a number of the main historical French dictionaries (16th to 19th cen- tury) are also currently being digitised and made available online. Two main initiatives in this regard are Grand Cor- pus des dictionnaires Garnier 1 and the ARTFL project 2 , which provide access to the content by means of search interfaces (though access is partly restricted and sources aren’t downloadable) 3 . On the other hand there is a lack of similar initiatives for 20th century French dictionaries. The enufar 4 project aims to make several successive edi- tions of the Petit Larousse Illustr´ e (PLI) available in a digi- tised format. The PLI makes an especially good candidate for such a project since it is the only French dictionary that has been updated every year since it was first published, in this case in 1905 5 . Under the French copyright law, col- lective works such as the PLI fall under the public domain after 70 years from the publication, which means that we can at present take into account all editions up to 1948. Each new edition of the PLI differs from the previous one in terms of lexical entries (with a number of words enter- ing or exiting); but changes are also found in updated def- initions and at times in the orthographic and grammatical norms which are referred to, all of which provides lexicog- raphers, linguists and historians with an invaluable source 1 http://www.classiques-garnier.com/ 2 http://artfl-project.uchicago.edu 3 Gallica also provides access to OCRed scans of old dictionar- ies, http://gallica.bnf.fr/. 4 Nouvelle ´ edition num´ erique de fac-simil´ es de r´ ef´ erence. 5 The PLI is still published today and is the best selling dictio- nary for the French language. of information on the evolution of French language and cul- ture during the first half 20th century. At the same time, the evolution of language notwithstanding, the PLI is also an important source of linguistic information on contempo- rary French, and its digitisation will feed into the existing ecosystem of French language technologies (see (Mariani et al., 2012) for an overview). 2. The Project enufar is a project headed by laboratoire Praxiling at the Paul Val´ ery University of Montpellier in collaboration with INRIA, and is supported by funding from the D´ el´ egation en´ erale ` a la Langue Franc ¸aise et aux Langues de France (DGLFLF) and the Huma-Num consortia CORLI 6 and CAHIER 7 . It continues a previous project, initiated in the early 2000s, which saw the publication of a first version of the 1905 edition in 2005 8 . The original edition was publicly accessible for searching from a web interface, but this is no longer the case; more- over, the XML encoding used was not fully TEI compliant. The first goal of the N´ enufar project is thus to re-encode the 1905 edition, transforming the existing version into a TEI compliant XML, as well as correcting remaining OCR errors and improving the detection and annotation of the main lexicographic elements of each entry. The availability of an already existing digitised version of the first edition makes the digitisation of later editions much easier: by comparing two OCRed versions of two subse- quent editions it is possible to identify changes in the more recent edition, but also undetected OCR errors from the pre- vious one. 6 https://corli.huma-num.fr/ 7 http://cahier.hypotheses.org/ 8 This first initiative was headed by laboratoire Lexique, Dic- tionnaires et Informatique, under the lead of Jean Pruvost, who is now an advisor in N´ enufar.

Upload: others

Post on 01-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presenting the N´enufar Project : a Diachronic …lrec-conf.org/workshops/lrec2018/W33/pdf/8_W33.pdfPresenting the N´enufar Project : a Diachronic Digital Edition of the Petit Larousse

Presenting the Nenufar Project: a Diachronic Digital Edition of thePetit Larousse Illustre

Herve Bohbot, Francesca Frontini, Giancarlo LuxardoPRAXILING UMR 5267 CNRS & Univ Paul Valery Montpellier 3 - Montpellier, France

[email protected]

Mohamed Khemakhem1,2,3, Laurent Romary1,2,41 Inria – ALMAnaCH, Paris2 Centre Marc Bloch, Berlin

3 Universite Paris Diderot, Paris4 Berlin-Brandenburgische Akademie der Wissenschaften, Berlin

[email protected]

AbstractThis paper presents the Nenufar project, which aims to make several successive (free of copyright up to 1948) editions of the FrenchPetit Larousse Illustre dictionary available in a digitised format. The corpus of digital editions will be made publicly available via aweb-based querying interface, as well as distributed in a machine readable format, TEI-LEX0.

Keywords: TEI, Petit Larousse, dictionaries

1. IntroductionThe digitisation of historical dictionaries has recently takenon strong momentum, moving past the mere publication ofscanned texts to the conversion of paper dictionaries intoeasily exploitable lexical databases encoded using well es-tablished digital standards. At the same time, a number ofthe main historical French dictionaries (16th to 19th cen-tury) are also currently being digitised and made availableonline. Two main initiatives in this regard are Grand Cor-pus des dictionnaires Garnier1 and the ARTFL project2,which provide access to the content by means of searchinterfaces (though access is partly restricted and sourcesaren’t downloadable)3. On the other hand there is a lackof similar initiatives for 20th century French dictionaries.The Nenufar4 project aims to make several successive edi-tions of the Petit Larousse Illustre (PLI) available in a digi-tised format. The PLI makes an especially good candidatefor such a project since it is the only French dictionary thathas been updated every year since it was first published, inthis case in 19055. Under the French copyright law, col-lective works such as the PLI fall under the public domainafter 70 years from the publication, which means that wecan at present take into account all editions up to 1948.Each new edition of the PLI differs from the previous onein terms of lexical entries (with a number of words enter-ing or exiting); but changes are also found in updated def-initions and at times in the orthographic and grammaticalnorms which are referred to, all of which provides lexicog-raphers, linguists and historians with an invaluable source

1http://www.classiques-garnier.com/2http://artfl-project.uchicago.edu3Gallica also provides access to OCRed scans of old dictionar-

ies, http://gallica.bnf.fr/.4Nouvelle edition numerique de fac-similes de reference.5The PLI is still published today and is the best selling dictio-

nary for the French language.

of information on the evolution of French language and cul-ture during the first half 20th century. At the same time,the evolution of language notwithstanding, the PLI is alsoan important source of linguistic information on contempo-rary French, and its digitisation will feed into the existingecosystem of French language technologies (see (Marianiet al., 2012) for an overview).

2. The ProjectNenufar is a project headed by laboratoire Praxiling at thePaul Valery University of Montpellier in collaboration withINRIA, and is supported by funding from the DelegationGenerale a la Langue Francaise et aux Langues de France(DGLFLF) and the Huma-Num consortia CORLI6 andCAHIER7. It continues a previous project, initiated in theearly 2000s, which saw the publication of a first version ofthe 1905 edition in 20058.The original edition was publicly accessible for searchingfrom a web interface, but this is no longer the case; more-over, the XML encoding used was not fully TEI compliant.The first goal of the Nenufar project is thus to re-encodethe 1905 edition, transforming the existing version into aTEI compliant XML, as well as correcting remaining OCRerrors and improving the detection and annotation of themain lexicographic elements of each entry.The availability of an already existing digitised version ofthe first edition makes the digitisation of later editions mucheasier: by comparing two OCRed versions of two subse-quent editions it is possible to identify changes in the morerecent edition, but also undetected OCR errors from the pre-vious one.

6https://corli.huma-num.fr/7http://cahier.hypotheses.org/8This first initiative was headed by laboratoire Lexique, Dic-

tionnaires et Informatique, under the lead of Jean Pruvost, who isnow an advisor in Nenufar.

Page 2: Presenting the N´enufar Project : a Diachronic …lrec-conf.org/workshops/lrec2018/W33/pdf/8_W33.pdfPresenting the N´enufar Project : a Diachronic Digital Edition of the Petit Larousse

While the PLI was published every year since 1905 theproject will prioritise the digitisation of only a selected setof issues, which correspond to major re-editions of the dic-tionary - namely the 1924, 1936, 1948 ones.Currently the 1924 edition is being digitised, and we cal-culated that 1/3 of its entries were modified with respect tothe 1905 one.A first release of the Nenufar corpus, including the 1905and the 1924 editions, will take place by the end of 2018.New editions will be subsequently made available. Along-side with the lexicographic part, it will also contain addi-tional onomastic information (from the encyclopaedic sec-tion of the PLI, listing proper names of people, places, ....)and a digitised version of all figures with their captions.

3. The FormatsThe question of publication formats is crucial for a projectsuch as this one, which caters to different research commu-nities. On the one hand, in order to fit the requirements ofthe general public as well as of traditional historical lexi-cographers, we need to provide a browsable web interface,which enables users to search for entries and see their evo-lution over time in a user-friendly way. On the other hand,the needs of digital lexicographers and language technolo-gists can only really be met by making the sources of eachedition available in a standardised format, something thatwould not only allow for more specialised querying, butwould also be best suited for long term preservation.Currently two formats are under discussion for the publi-cation of retrodigitised dictionaries such as PLI, namelythe TEI dictionaries module9, the Ontolex-Lemon model(RDF) (McCrae et al., 2017). Those two formats serve dif-ferent purposes: TEI represents the dictionary as a digitaledition, and is better suited to the needs of lexicographersand linguists, while Ontolex-Lemon is the reference formatfor the publication of dictionaries as Linked Open Data, andthus is more relevant for the domain of Language and Se-mantic Web technologists.As to the encoding of PLI in TEI, the first step was to trans-form the 2005 mark-up in a TEI compliant format, whichis the one presented in Appendix B. This first encoding re-mains very adherent to the structure of the typographic en-try, as can be seen in Appendix A, and thus uses the en-tryFree TEI tag, which allows for maximum freedom in therepresentation and encoding of the different parts of a lex-ical entry. For this reason it is the one that will be usedinternally in the Nenufar database to derive the HTML dis-played on the browsable web interface.However an excessive freedom in terms of entry mod-elling can become a hindrance to interoperability withother projects. For this reason a recent a joint ENeL10

/ DARIAH11 / PARTHENOS12 initiative has proposed amore strict TEI representation for dictionaries, called TEI-Lex0 (Banski et al., 2017). TEI-Lex0 derives from the lex-icographic module of TEI and is fully TEI compliant, but

9(Budin et al., 2012), see also http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DI.html

10http://www.elexicography.eu/11https://www.dariah.eu/12http://www.parthenos-project.eu/

aims to provide more clear guidelines for the encoding ofretrodigitised dictionaries.With respect to the more general TEI guidelines for dictio-naries, TEI-Lex0 is aimed at providing a schema which willallow most modern dictionaries to be represented in a waythat enables interoperability, comparability and further easeof exploitation. To that end, the internal structure and infor-mation of lexical entries have been revised and optimised tobe more clearly explicit and uniform.We believe that the PLI can constitute an excellent test casefor this new format, which we intend as the distribution for-mat for the downloadable resource. In Appendix C you canfind the same entry transformed into the TEI-Lex0 format.As you can see, going from the current format to the newone requires some changes; some of them (such as the in-sertion of the type attribute in the form tag) are straightfor-ward, but others are more complex to implement.Firs of all the entryFree tag is replaced by entry, which al-lows for less freedom as to the tags it may contain. As aconsequence, the original structure cannot be left as it is.In particular the sense tag needs to be inserted, to group adefinition with its related examples and citations. This im-plies adding information which, in the original entry is notexplicitly marked by visible typographic features (such asnumbering, symbols or formatting, as is the case in otherdictionaries). By close analysis of the PLI entries, we con-sider that every new definition instantiates a new sense, andthat no sense hierarchy is inferable.Another issue is the fact that free text is not allowed withinthe sense tag. Thus pc tags need to be used to wrap uppunctuation elements such as columns, as they cannot beconsidered neither as part of the definition, nor of the cita-tion.Despite the work required to transform the current formatinto TEI-Lex0, the advantages are obvious; TEI-Lex0 willallow for different dictionaries to be queried using the samestrategy and also facilitate the development of commontools.One of the current applications of this format is in theGROBID-Dictionaries infrastructure, which aims to auto-matically machine-learn the TEI-Lex0 structure of a dictio-nary entry from OCRed dictionary pages (Khemakhem etal., 2017). Within the Nenufar project experiments are on-going to digitise new editions with GROBID-Dictionaries.As to the Ontolex-Lemon version, at the time of writing thispaper (March 2018) a working group is active drafting thespecifications for a dictionary module, which will enableto represent retro-digitised dictionaries using the Ontolex-Lemon core with additional properties. The specificationsare not yet finalised, and the final modelling of PLI in thisnew format will be the object of further research; it is im-portant however to underline how PLI entries from the 1905edition are currently being used as examples to discuss thenew module issues13.As to the availability of the two versions, the TEI editionwill be downloadable from the Ortolang14 platform, and theOntolex-Lemon will be queryable via a SPARQL endpoint.

13https://www.w3.org/community/ontolex/wiki/Lexicography

14http://www.ortolang.fr

Page 3: Presenting the N´enufar Project : a Diachronic …lrec-conf.org/workshops/lrec2018/W33/pdf/8_W33.pdfPresenting the N´enufar Project : a Diachronic Digital Edition of the Petit Larousse

Finally, two modelling issues are of a more generic natureand will affect both formats. On the one hand homographsare generally but not systematically treated as separate en-tries in the PLI; this may represent a problem as to the en-coding of grammatical properties at the entry level and mayrequire adjustments. On the other a normalisation of datacategories for grammatical features is required and cur-rently on-going; the grammatical labels (gender, number,language, ...), represented with in the original by (often un-systematic) French abbreviations, will be normalised usingexisting controlled vocabularies; in this sense, the CLARINConcept Registry may 15 constitute a valid solution.

4. The ContentDictionaries are the “tools of a language and a culture”(Pruvost, 2006) and the PLI, whose millions of copies overmore than 110 years have found place in the majority ofFrench households, has played and still plays a great rolein the democratisation of linguistic knowledge(Cormier etal., 2006); for this reason the diachronic investigation of itssuccessive editions sheds a new light on the evolution ofFrench language and society.First and foremost the Nenufar corpus will constitute aprivileged source of information on the evolution of or-thography. The name of the project itself is inspired bya surprising controversy sparked in 2016 by the proposedchange in the spelling of the French word for waterlily,from nenuphar to nenufar. Despite the fact that the newspelling was strongly ostracised by the people and by themedia, an inspection of early editions of PLI shows thatthe nenufar spelling was already present in the 1905 edi-tion and remained the preferred orthography for the wordfor the whole of the first half of the 20th century. Otherorthographies attested in the earlier versions PLI would beconsidered almost shocking today, such as a priori (with anaccent), fiord instead of fjord, ognon as an alternate spellingfor oignon (the French for onion).Apart from the evolution of orthography, the older edi-tions of the PLI are rich in information about phonetics([distrik], [lo-kouass] for district et loquace en 1906), ne-ologisms (antimilitarisme in 1911, boche, the equivalentof the English pejorative word for German, in 1917, etc.)and changes in the definitions. As to the these, some arerather amusing, such as the one for aviation, which in 1905reads “on a fait de nombreuses tentatives a ce sujet mais leprobleme n’est pas encore resolu” (several tests have beencarried out but the problem hasn’t been solved yet) and in1911 becomes “les aeroplanes ont victorieusement resolu leprobleme du plus lourd que l’air” (planes have victoriouslysolved the heavier-than-air controversy). In other cases (asin the older entries for juiverie or negre, negresse) defini-tions bear testimony of the evolution of society, of whichthe PLI is the mirror.

5. ConclusionIn this paper we presented Nenufar, an ongoing projectaimed to the digitisation of chosen editions of the PetitLarousse Illustre from the first half of the 20th century.

15https://concepts.clarin.eu/ccr/browser/

A first TEI and web release of the Nenufar corpus will beavailable in 2018 with an open license, thus enabling re-search in the domains of linguistics, history and languagetechnologies to research and use thisTo ensure interoperability, the project is carried out in closecontact with on-going international initiatives aimed at pro-moting standard and best practices in the retro-digitisationof legacy dictionaries16. Moreover, it is currently used as atest bed for GROBID-Dictionaries, a technology which willconsiderably speed up the encoding of OCRed resources.The current project is specifically targeting the PLI, but thebest practices developed within Nenufar will be applicableto other legacy dictionaries.

6. Bibliographical ReferencesBanski, P., Bowers, J., and Erjavec, T. (2017). TEI-Lex0

Guidelines for the Encoding of Dictionary Informationon Written and Spoken Forms. In eLex2017.

Budin, G., Majewski, S., and Morth, K. (2012). CreatingLexical Resources in TEI P5. Journal of the Text Encod-ing Initiative, (Issue 3), November.

Cormier, M.-C., Pruvost, J., Mitterand, H., Garnier, Y., andCollectif. (2006). Les dictionnaires Larousse : Geneseet evolution. PU Montreal, Montreal, March.

Khemakhem, M., Foppiano, L., and Romary, L. (2017).Automatic Extraction of TEI Structures in Digitized Lex-ical Resources using Conditional Random Fields. Inelectronic lexicography, eLex 2017, Leiden, Netherlands,September.

Joseph Mariani, et al., editors. (2012). La langue francaisea l’ Ere du numerique – The French Language in theDigital Age. White Paper Series. Springer-Verlag, BerlinHeidelberg.

McCrae, J. P., Bosque-Gil, J., Gracia, J., Buitelaar, P., andCimiano, P. (2017). The OntoLex-Lemon Model: De-velopment and Applications. In eLex2017.

Pruvost, J. (2006). Les dictionnaires francais : Outilsd’une langue et d’une culture. Ophrys, Paris.

16In addition to what was mentioned in this paper, Nenufar isplanning on collaborating with the ELEXIS project, which re-cently kicked off and aims at building a European Infrastructurefor E-lexicography (http://www.elex.is/)

Page 4: Presenting the N´enufar Project : a Diachronic …lrec-conf.org/workshops/lrec2018/W33/pdf/8_W33.pdfPresenting the N´enufar Project : a Diachronic Digital Edition of the Petit Larousse

AppendicesA The dictionary entry verre (glass) in the PLI .

Page 5: Presenting the N´enufar Project : a Diachronic …lrec-conf.org/workshops/lrec2018/W33/pdf/8_W33.pdfPresenting the N´enufar Project : a Diachronic Digital Edition of the Petit Larousse

B The first TEI-XML encoding

Page 6: Presenting the N´enufar Project : a Diachronic …lrec-conf.org/workshops/lrec2018/W33/pdf/8_W33.pdfPresenting the N´enufar Project : a Diachronic Digital Edition of the Petit Larousse

C The TEI-LEX0 encoding