datech2014 - cataloguing for a billion word library of greek and latin

26
How do you catalog a billion word library? Bridget Almas 1 , Alison Babeu 1 , Frederik Baumgardt 2 , Lisa Cerrato 1 , Gregory Crane 12 , Greta Franzini 2 , Anna Krohn 1 , Simona Stoyanova 2 1. Perseus Digital Library, Tufts University 2. Open Philology Project, University of Leipzig

Upload: impact-centre-of-competence

Post on 22-Nov-2014

209 views

Category:

Technology


2 download

DESCRIPTION

Presentation of the paper Cataloguing for a Billion Word Library of Greek and Latin by Gregory Crane, Bridget Almas, Alison Babeu, Lisa Cerrato, Anna Krohn, Frederik Baumgardt, Monica Berti, Greta Franzini and Simona Stoyanova in DATeCH 2014. #digidays

TRANSCRIPT

Page 1: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

How do you catalog a billion word library?

Bridget Almas1, Alison Babeu1, Frederik Baumgardt2, Lisa Cerrato1, Gregory Crane12,

Greta Franzini2, Anna Krohn1, Simona Stoyanova2

1. Perseus Digital Library, Tufts University

2. Open Philology Project, University of Leipzig

Page 2: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Major points

1. We are interested in the logical structures within/across physical books:

Text Groups, Author Y, Papyri from X

Works, e.g., Vergil’s Aeneid

Individual words, e.g., Arma virumque cano

Page 3: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Major points

2. From a pragmatic perspective, we only need one version of a logical work

(e.g., Tacitus’ Annales). We can use that marked up version as a query that we

match against very large and very error-filled corpora.

Page 4: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Major points

3. A text collection can serve as a catalog, with all other versions of the texts in

that collection (including translations as well as shorter quotations as well as

alternate editions) represented as annotations on that collection.

Page 5: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 6: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 7: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 8: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Adding markup for a citation scheme

<div1 type="Book" n="1">

<milestone ed="p" n="1" unit="card"/>

<l n=”1”>Arma virumque cano, Troiae qui primus ab oris</l>

<l n=”2”>Italiam, fato profugus, Laviniaque venit</l>

<l n=”3”>litora, multum ille et terris iactatus et alto</l>

Page 9: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 10: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 11: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 12: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 13: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin
Page 14: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Our ability to align texts is what makes our approach possible

-- a single version of Goethe’s Faust allows us to organize

thousands of editions.

Page 15: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Canonical Text Services URNs

urn:cts:greekLit:tlg0284.tlg052.perseus-grc1

urn:cts:latinLit:phi0474.phi052.opp-lat1

urn:cts:latinLit:stoa0255.stoa004

Page 16: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Canonical Text Services URNs

These URNs allow us to represent any particular word in any version of any text -- they

allow us to represent our textual data (including annotations) as a very large RDF

graph.

Its not a million book library but a billion word data set.

Page 17: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Canonical Text Services URNs

urn:cts:greekLit:tlg0284.tlg052.perseus-grc1

urn:cts:latinLit:phi0474.phi052.opp-lat1

urn:cts:latinLit:stoa0255.stoa004

Canonical Text Services name space

Page 18: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Canonical Text Services URNs

urn:cts:greekLit:tlg0284.tlg052.perseus-grc1

urn:cts:latinLit:phi0474.phi052.opp-lat1

urn:cts:latinLit:stoa0255.stoa004

Greek literature

Latin literature

Page 19: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Canonical Text Services URNs

urn:cts:greekLit:tlg0284.tlg052.perseus-grc1

urn:cts:latinLit:phi0474.phi052.opp-lat1

urn:cts:latinLit:stoa0255.stoa004

TextGroup = tlg052

Following the Thesaurus Linguae Graecae, we assign 284 to Aelius Aristides

Page 20: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Canonical Text Services URNs

urn:cts:greekLit:tlg0284.tlg052.perseus-grc1

urn:cts:latinLit:phi0474.phi052.opp-lat1

urn:cts:latinLit:stoa0255.stoa004

A TextGroup can define any useful collection:

* inscriptions from Ephesus

* the Homeric Hymns

Page 21: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Canonical Text Services URNs

urn:cts:greekLit:tlg0284.tlg052.perseus-grc1

urn:cts:latinLit:phi0474.phi052.opp-lat1

urn:cts:latinLit:stoa0255.stoa004

FRBR ( Functional Requirements for Bibliographic Records) Works

tlg0284.tlg052 designates the Embassy of Achilles by Aelius Aristides

Page 22: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Representing different versions

OCT Loeb

1.41 confidere 1 same confidere 1

1.41 propediem 1 sub prope 1

1.41 insert diem 1

1.41 ipsum 1 same ipsum 1

1.41 eos 1 same eos 1

Page 23: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Representing different versions

OCT Loeb

1.41 confidere 1 same confidere 1

1.41 propediem 1 sub prope 1

1.41 insert diem 1

1.41 ipsum 1 same ipsum 1

1.41 eos 1 same eos 1

We can pragmatically represent the differences between our reference text and all other versions

Page 24: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Representing different versions

OCT Loeb

1.41 confidere 1 same confidere 1

1.41 propediem 1 sub prope 1

1.41 insert diem 1

1.41 ipsum 1 same ipsum 1

1.41 eos 1 same eos 1

The reference text does not have to be the best text -- it does not even have to be perfect. It organizes all other texts,

even with noise.

Page 25: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Conclusions

We are developing the Perseus Corpus of Greek Texts (c. 20m words of Greek

and Latin)

* Based on texts in Perseus

* FRBR metadata from the Perseus Catalog

* Revised XML brought in line with CTS and with the EpiDoc subset of TEI

XML

* Offers an extended “TEI by example”

Page 26: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin

Conclusions

We are preparing for a Leipzig Corpus

* This would be a superset of the Perseus Corpus

* Ideally much larger

* Initial work will include an additional 20 million words of primarily later Greek

and Latin