metadictionary -- towards a generic e--infrastructure for...

20
Variance in Language and Genome Annotating Digitized Print Dictionaries Annotating Morpheme Decompositions metaDictionary – Towards a Generic e–Infrastructure for Detecting Variance in Language by Exploiting Dictionary Information Dietmar Seipel and Werner Wegstein University W ¨ urzburg Computer Science / Digital Humanities ISGC 2011 – Taipei, 23.03.2011 Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Upload: phamnhan

Post on 24-Apr-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

metaDictionary – Towards a Generice–Infrastructure for Detecting Variance in

Language by Exploiting Dictionary Information

Dietmar Seipel and Werner Wegstein

University WurzburgComputer Science / Digital Humanities

ISGC 2011 – Taipei, 23.03.2011

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

1 Variance in Language and GenomeThe metaDictionaryNetwork Analysis of Morpheme Decompositions

2 Annotating Digitized Print DictionariesAnnotation in TEI

Grammar–Based Parsing

3 Annotating Morpheme DecompositionsAnnotation RulesThe Morpheme Annotation Tool

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

The metaDictionaryNetwork Analysis of Morpheme Decompositions

Variance in Language and Genome

Project goals:development of a metaDictionaryanalysis of morpheme decomposition networkscomparison with structural properties of genomes

The project is funded in a BMBF framework focussing oninterdependencies.

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

The metaDictionaryNetwork Analysis of Morpheme Decompositions

Variance in Space and Time

-

6

ahd mhd frnhd nhdTime Levels

750 – 1050 – 1350 – 1650 –

Dictionaries

ahdwb

lexer

dwb

lothrwb

luxemb

wdg

gabala

gabel(e)

gabel

Gawel

Gafel

Gabel

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

The metaDictionaryNetwork Analysis of Morpheme Decompositions

The metaLemma ”Gabel” (Fork)

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

The metaDictionaryNetwork Analysis of Morpheme Decompositions

Network Analysis of Morpheme Decompositions

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Network of Digitized Print Dictionaries

German dictionaries (old to present day language includingvarieties like regional dialects) are annotated in TEI P5

the fine grain annotation makes detailed additionalanalyses possible

data sources:LexerGrimmAdelungCampeLuxemb., Lothr.WDG

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Network of Digitized Print Dictionaries – Trier

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Entry of the Adelung Dictionary

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Fine Grain Structuring of the Entry

Der Aal, des –es, Mz. die –e,

Verkleinerungswort, das Alchen,

des –s, b. Mz. w. b. Ez.

1) Ein langer, runder ... Fisch ...

2) Ein Backwerk aus Butterteig ...

3) Die fal=schen Bruche, ...

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Annotation in TEI P5 (Text Encoding Initiative)

Der Aal, ...

<entry xml:id="cwds1_00005_aal"><form type="lemma">

<gramGrp><pos value="noun"/><gen value="m"/>

</gramGrp><form type="determiner">Der</form><form type="headword">Aal</form><pc>,</pc>

</form> ...<sense> ... </sense>

</entry>

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Extended Definite Clause Grammars

entry ===>form:[type:lemma],...,sense.

form:[type:lemma] ===>sequence(*, form:[type:determiner]),form:[type:headword].

sense ===> ...

The call sequence(*, form:[type:determiner])generates a sequence of zero or more form elements.

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Techniques from Computer Science

Grammarshigher precision compared to regular expressions andstatistical parserswe use a DCG (definite clause grammar) extension,which is even more compact and directly generates XML

XML is a common data format for modelling, managing, andexchanging semi–structured data.

There exist powerful query, transformation and updatelanguages for XML.

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation in TEI

Grammar–Based ParsingTechniques from Computer Science

Declarative Languages

ExamplesSQL (relational databases)XQUERY, XSLT (XML processing)PROLOG (programming)rules (decision support systems, grammars)

Advantagescompakt, rapidly programmableclear, less error–proneflexibly extensible

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation RulesThe Morpheme Annotation Tool

Annotating Morpheme Decompositions

. . . based on the Whole Word Morphologyextension by alignment methodsmorpheme decomposition:

morpheme term: ((craft + s) + man) + ship

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation RulesThe Morpheme Annotation Tool

System Architecture

For decomposing and annotating the large number of entries ofa dictionary (which can exceed 100.000), one needs

linguistic knowledge and

suitable tools from computer science:

morpheme decomposer,suitable, compact knowledge representation,inference methods,graphical user interface.

Fine grain annotated dictionaries are the basis for thedecomposition.

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation RulesThe Morpheme Annotation Tool

System Architecture

OWL Term Notation

Annotation Rules

Morphem Analyses VisualisationProtege

Morfessor

6

6

6

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation RulesThe Morpheme Annotation Tool

Annotation Rules

With the annotation rule (in logic)has_word_class(X, noun) :-

mc(X, A, B),has_word_class(A, noun),has_text_form(B, [ship, ...]).

the partially annotated term((craft*bm + s*ge) + man)*noun + ship

can be further annotated to(((craft*bm + s*ge) + man)*noun + ship)*noun

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation RulesThe Morpheme Annotation Tool

The Morpheme Annotation Tool

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language

Variance in Language and GenomeAnnotating Digitized Print Dictionaries

Annotating Morpheme Decompositions

Annotation RulesThe Morpheme Annotation Tool

Conclusions

The metaDictionary forms the core part of a generice–infrastructure:

derived from analysis of a network of dictionariesannotated morpheme decompositionsyield a more precise alignment for the metaDictionary

The next step will be to test the data using text corpora:basic morphemescombinations of basic morphemes

Culturomics (Michel et al., Science 2011): 52% of the Englishlexicon – the majority of the words used in English books – consistsof lexical dark matter undocumented in standard references.

Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language