metadictionary -- towards a generic e--infrastructure for...
TRANSCRIPT
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
metaDictionary – Towards a Generice–Infrastructure for Detecting Variance in
Language by Exploiting Dictionary Information
Dietmar Seipel and Werner Wegstein
University WurzburgComputer Science / Digital Humanities
ISGC 2011 – Taipei, 23.03.2011
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
1 Variance in Language and GenomeThe metaDictionaryNetwork Analysis of Morpheme Decompositions
2 Annotating Digitized Print DictionariesAnnotation in TEI
Grammar–Based Parsing
3 Annotating Morpheme DecompositionsAnnotation RulesThe Morpheme Annotation Tool
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
The metaDictionaryNetwork Analysis of Morpheme Decompositions
Variance in Language and Genome
Project goals:development of a metaDictionaryanalysis of morpheme decomposition networkscomparison with structural properties of genomes
The project is funded in a BMBF framework focussing oninterdependencies.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
The metaDictionaryNetwork Analysis of Morpheme Decompositions
Variance in Space and Time
-
6
ahd mhd frnhd nhdTime Levels
750 – 1050 – 1350 – 1650 –
Dictionaries
ahdwb
lexer
dwb
lothrwb
luxemb
wdg
gabala
gabel(e)
gabel
Gawel
Gafel
Gabel
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
The metaDictionaryNetwork Analysis of Morpheme Decompositions
The metaLemma ”Gabel” (Fork)
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
The metaDictionaryNetwork Analysis of Morpheme Decompositions
Network Analysis of Morpheme Decompositions
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Network of Digitized Print Dictionaries
German dictionaries (old to present day language includingvarieties like regional dialects) are annotated in TEI P5
the fine grain annotation makes detailed additionalanalyses possible
data sources:LexerGrimmAdelungCampeLuxemb., Lothr.WDG
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Network of Digitized Print Dictionaries – Trier
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Entry of the Adelung Dictionary
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Fine Grain Structuring of the Entry
Der Aal, des –es, Mz. die –e,
Verkleinerungswort, das Alchen,
des –s, b. Mz. w. b. Ez.
1) Ein langer, runder ... Fisch ...
2) Ein Backwerk aus Butterteig ...
3) Die fal=schen Bruche, ...
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Annotation in TEI P5 (Text Encoding Initiative)
Der Aal, ...
<entry xml:id="cwds1_00005_aal"><form type="lemma">
<gramGrp><pos value="noun"/><gen value="m"/>
</gramGrp><form type="determiner">Der</form><form type="headword">Aal</form><pc>,</pc>
</form> ...<sense> ... </sense>
</entry>
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Extended Definite Clause Grammars
entry ===>form:[type:lemma],...,sense.
form:[type:lemma] ===>sequence(*, form:[type:determiner]),form:[type:headword].
sense ===> ...
The call sequence(*, form:[type:determiner])generates a sequence of zero or more form elements.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Techniques from Computer Science
Grammarshigher precision compared to regular expressions andstatistical parserswe use a DCG (definite clause grammar) extension,which is even more compact and directly generates XML
XML is a common data format for modelling, managing, andexchanging semi–structured data.
There exist powerful query, transformation and updatelanguages for XML.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Declarative Languages
ExamplesSQL (relational databases)XQUERY, XSLT (XML processing)PROLOG (programming)rules (decision support systems, grammars)
Advantagescompakt, rapidly programmableclear, less error–proneflexibly extensible
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
Annotating Morpheme Decompositions
. . . based on the Whole Word Morphologyextension by alignment methodsmorpheme decomposition:
morpheme term: ((craft + s) + man) + ship
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
System Architecture
For decomposing and annotating the large number of entries ofa dictionary (which can exceed 100.000), one needs
linguistic knowledge and
suitable tools from computer science:
morpheme decomposer,suitable, compact knowledge representation,inference methods,graphical user interface.
Fine grain annotated dictionaries are the basis for thedecomposition.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
System Architecture
OWL Term Notation
Annotation Rules
Morphem Analyses VisualisationProtege
Morfessor
6
�
6
6
�
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
Annotation Rules
With the annotation rule (in logic)has_word_class(X, noun) :-
mc(X, A, B),has_word_class(A, noun),has_text_form(B, [ship, ...]).
the partially annotated term((craft*bm + s*ge) + man)*noun + ship
can be further annotated to(((craft*bm + s*ge) + man)*noun + ship)*noun
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
The Morpheme Annotation Tool
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
Conclusions
The metaDictionary forms the core part of a generice–infrastructure:
derived from analysis of a network of dictionariesannotated morpheme decompositionsyield a more precise alignment for the metaDictionary
The next step will be to test the data using text corpora:basic morphemescombinations of basic morphemes
Culturomics (Michel et al., Science 2011): 52% of the Englishlexicon – the majority of the words used in English books – consistsof lexical dark matter undocumented in standard references.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language