linking etymological database : a case study in germanic

23
Linking Etymological Database: A case study in Germanic Christian Chiarcos, Maria Sukhareva Goethe University Frankfurt am Main LDL – 2014, LREC Reykjavik, Iceland 27th May 2014

Upload: finley

Post on 24-Feb-2016

63 views

Category:

Documents


0 download

DESCRIPTION

LDL – 2014, LREC Reykjavik , Iceland 27th May 2014. Linking Etymological Database : A case study in Germanic. Christian Chiarcos , Maria Sukhareva Goethe University Frankfurt am Main. Overview. Background Linked Etymological Dictionaries - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Linking Etymological Database : A case  study  in  Germanic

Linking Etymological Database:A case study in Germanic

Christian Chiarcos, Maria SukharevaGoethe University Frankfurt am Main

LDL – 2014, LRECReykjavik, Iceland

27th May 2014

Page 2: Linking Etymological Database : A case  study  in  Germanic

Overview1. Background2. Linked Etymological Dictionaries3. Enriching of Linked Etymological Dictionaries4. Application5. Conclusion

Page 3: Linking Etymological Database : A case  study  in  Germanic

Background

Page 4: Linking Etymological Database : A case  study  in  Germanic

ACoLi Lab

TITUS

DDD Referenzkorpus Althochdeutsch

Background

1. Empirical Linguistics Thesaurus of Indo-European Text and Language Materials (TITUS)

2. ACoLi Lab (Applied Computational Linguistics)

3. LOEWE Cluster “Digital Humanities”

4. DFG-funded Old German Reference Corpus (DDD)

Processing of Old Germanic Languages at Goethe University Frankfurt, in collaboration between:

Page 5: Linking Etymological Database : A case  study  in  Germanic

Linked Etymological Data

Page 6: Linking Etymological Database : A case  study  in  Germanic

Linked Etymological Data

Page 7: Linking Etymological Database : A case  study  in  Germanic

Linked Etymological Data

• Linkability: representation of relations within and beyond lexicons

• Interoperability: (meta)data representation through community-maintained vocabularies (lexvo, Glottolog, OLiA, lemon)

• Inference: filling the logical gaps of the original XML representation– Symmetric closure of cross-references

Conversion of etymological dictionaries to RDF

Page 8: Linking Etymological Database : A case  study  in  Germanic

Linked Etymological Data

lemonet:translates a relation betweenlemon:LexicalEntrys

lemonet:etym links between languages, transitive and symmetric.Subproperty oflemon:lexicalVariant

all language identifiers were mapped from the original abbreviations and assigned ISO 639-3 codes wherever possible.

Page 9: Linking Etymological Database : A case  study  in  Germanic

Linked Etymological DataOriginal XML (lemma)

RDF Triples

Symmetric closure of etymological relations generated by SPARQL pattern

Links to external resources

Page 10: Linking Etymological Database : A case  study  in  Germanic

Enriching Etymological Dictionaries

Page 11: Linking Etymological Database : A case  study  in  Germanic

Enriching Etymological Dictionaries

(parentheses indicate marginal fragments with less than 50,000 tokens)

Germanic parallel Bible corpus

Page 12: Linking Etymological Database : A case  study  in  Germanic

Enriching Etymological Dictionaries

1. Statistical word alignment of parallel texts (GIZA++)2. Lexical translation tables as basis for the extracted word lists:

• Unidirectional: maximum of P(wt|ws)• Bidirectional: maximum of P(wt|ws) P(ws |wt)

3. Pruning by frequency

Page 13: Linking Etymological Database : A case  study  in  Germanic

Application

Page 14: Linking Etymological Database : A case  study  in  Germanic

ApplicationThematical Alignment of Bible paraphrases

– E.g., cross references within the Bible and between the Bible and gospel harmonies

• an interlinked index of thematically similar sections in the gospels and OS/OHG gospel harmonies– OS Heliand and OHG Tatian section level alignment (Sievers, 1872) has been

digitized– 4560 inter-text groups based on the Eusebian canon

• Basis for a more fine-grained level of alignment

Page 15: Linking Etymological Database : A case  study  in  Germanic

Application

Character-based similarity measures:

– GEOMETRY: δ = difference between the relative positions of wOS and wOHG

– IDENTITY: δ(wOS;wOHG) = 1 iff wOHG = wOS (0 otherwise);

– ORTHOGRAPHY: relative Levenshtein distance & statistical character replacement probability

(Neubig et al., 2012)

– NORMALIZATION: norm(wOS;wOHG) = δ(w’OS;wOHG) , with w’OS being the OHG ‘normalization’

(Bollmann et al., 2011)

– COOCCURRENCES: δ(wOS;wOHG) = P(wOS|wOHG)P(wOHG|wOS)

similarity metrics δ(wOS;wOHG) for every OS word wOS and its potential OHG cognate wOHG

Lexicon-based similarity measures:δlex(wOS;wOHG) = 1 iff wOHG 2 W (0otherwise) where W is a set of possible OHG translationsfor wOS suggested by a lexicon, i.e., either:

- ETYM: etymological link in (the symmetric closure of the etymological dictionaries,

- ETYM-INDIRECT: shared German gloss in the etymological dictionaries,

- TRANSLATIONAL DIRECT: link in the translational dictionaries,

- TRANSLATIONAL INDIRECT: indirectly linked in the translational dictionaries through a third language.

Page 16: Linking Etymological Database : A case  study  in  Germanic

Application

Character-based similarity measures:

– GEOMETRY: δ = difference between the relative positions of wOS and wOHG

– IDENTITY: δ(wOS;wOHG) = 1 iff wOHG = wOS (0 otherwise);

– ORTHOGRAPHY: relative Levenshtein distance & statistical character replacement probability

(Neubig et al., 2012)

– NORMALIZATION: norm(wOS;wOHG) = δ(w’OS;wOHG) , with w’OS being the OHG ‘normalization’

(Bollmann et al., 2011)

– COOCCURRENCES: δ(wOS;wOHG) = P(wOS|wOHG)P(wOHG|wOS)

similarity metrics δ(wOS;wOHG) for every OS word wOS and its potential OHG cognate wOHG

Lexicon-based similarity measures:δlex(wOS;wOHG) = 1 iff wOHG 2 W (0otherwise) where W is a set of possible OHG translationsfor wOS suggested by a lexicon, i.e., either:

- ETYM: etymological link in (the symmetric closure of the etymological dictionaries,

- ETYM-INDIRECT: shared German gloss in the etymological dictionaries,

- TRANSLATIONAL DIRECT: link in the translational dictionaries,

- TRANSLATIONAL INDIRECT: indirectly linked in the translational dictionaries through a third language.

Page 17: Linking Etymological Database : A case  study  in  Germanic

Conclusion & Discussion

Page 18: Linking Etymological Database : A case  study  in  Germanic

Conclusion

1. Application of Linked Data Paradigm to modeling of etymological dictionaries

2. Adopting of Lemon core model3. Representation of Köbler’s dictionary in a machine-readable

format4. Enriching etymological dictionaries by automatically obtained

translation pairs5. Initial experiment on usage of dictionaries for quasi-parallel

alignment

Page 19: Linking Etymological Database : A case  study  in  Germanic

lemon & etymology: A square peg for a round hole ?lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD.

L!L!

L! L!L! L!L!L!

Page 20: Linking Etymological Database : A case  study  in  Germanic

lemon & etymology: A square peg for a round hole ?lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD.

… but many of these resources arecreated by (or for) linguists rather than ontologists.

The original motivation for lemon wasto lexicalize ontologies. Quite a different problem from the inter-operability issues that linguists aretrying to solve by using it.

L!L!

L! L!L! L!L!L!

Page 21: Linking Etymological Database : A case  study  in  Germanic

lemon & etymology: A square peg for a round hole ?lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD.

But obviously, our usage of lemon is slightly abusive.1. Etymological and translational links between WordForms ?2. No external ontology to ground senses ?3. No word senses at all ?

But that is symptomatic for linguistic resources in a strict sense4. Similar problems observed by Cysouw & Moran on multilingual dictionaries for South American indigeneous languages.

Page 22: Linking Etymological Database : A case  study  in  Germanic

lemon & etymology: A square peg for a round hole ?lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD.

But obviously, our usage of lemon is slightly abusive.1. Etymological and translational links between word forms ?2. No external ontology to ground senses ?3. No word senses at all ?

But that is symptomatic for linguistic resources in a strict sense

What can we do about this state of affairs ? • Would there have been alternative ways to model our data ? • Shall we extend/abandon/replace/adjust lemon?

Page 23: Linking Etymological Database : A case  study  in  Germanic

Takk fyrir!