types und tokens distribution in titus Распределение словоформ в ...

26
Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: [email protected]

Upload: diza

Post on 23-Feb-2016

119 views

Category:

Documents


0 download

DESCRIPTION

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS. Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft Universität Frankfurt am Main E-Mail: [email protected] . Outline. TITUS Resource Data Peculiarities of TITUS texts - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

Types und Tokens Distribution in TITUS Распределение словоформ в корпусе TITUS

Dr. Svetlana Ahlborn Institut für Empirische Sprachwissenschaft

Universität Frankfurt am MainE-Mail: [email protected]

Page 2: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 2

Tokens and Types Distribution in TITUS

Outline

• TITUS Resource Data • Peculiarities of TITUS texts • Tokens and Types calculation in TITUS Resources• Metadata for Tokens and Types distribution

Корпусная лингвистика 2013

Page 3: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 3

Tokens and Types Distribution in TITUS

TITUS Resource Data • TITUS (Thesaurus

Indogermanischer Text- und Sprachmaterialien)

http://titus.uni-frankfurt.de

Корпусная лингвистика 2013

A token represents the concrete occurrence of the linguistic unit, and in a type, tokens associated with each other are bundled.

• TITUS includes currently 660 texts in 55 languages, more than 30 Mio. tokens

Page 4: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 4

Tokens and Types Distribution in TITUS

TITUS Data

Корпусная лингвистика 2013

http://www.clarin.eu/node/1512

Added by J. Gippert, R. Mittmann

Page 5: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 5

Tokens and Types Distribution in TITUS

TITUS Search Engine• TITUS Search Engine does not determine the number of

tokens in the concrete text, but the number of quotations of the word.

Корпусная лингвистика 2013

Page 6: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 6

Tokens and Types Distribution in TITUS

Peculiarities of TITUS texts: Gothic• Biblia Gothica contains additional parallel passages in Latin and Greek.

Корпусная лингвистика 2013

Biblia Gothica (http://titus.uni-frankfurt.de/texte/etcs/germ/got/gotnt/gotnt.htm).

Page 7: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 7

Tokens and Types Distribution in TITUS

Peculiarities of TITUS texts: Old Church Slavonic• Old Church Slavonic texts are represented in two ways: in the

Glagolitic alphabet – original form of the text – and in Cyrillic one.

Корпусная лингвистика 2013

Codex Marianus (http://titus.uni-frankfurt.de/texte/etcs/slav/aksl/marianus/maria.htm).

Page 8: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 8

Tokens and Types Distribution in TITUS

Peculiarities of TITUS texts: Old Polish• Old Polish texts contain a simultaneous display of editions

that have arisen at different times.

Корпусная лингвистика 2013

Kazania Swiętokrzyskie (http://titus.uni-frankfurt.de/texte/etcs/slav/apoln/ kazania/kazan.htm).

Page 9: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 9

Tokens and Types Distribution in TITUS

Peculiarities of TITUS texts: Ossetian• The Ossetian Nart epic is represented in Latinica und in the

advanced Cyrillic.

Корпусная лингвистика 2013

Ossetian: Nart epic (http://titus.uni-frankfurt.de/texte/etcs/iran/niran/oss/nart/nart.htm).

Page 10: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 10

Tokens and Types Distribution in TITUS

Peculiarities of TITUS texts: Russian-Low German• Tönnies Fenne's Manual (17th century) contains at least 9

different languages or language variations.

Корпусная лингвистика 2013

Page 11: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 11

Tokens and Types Distribution in TITUS

Peculiarities of TITUS texts: Old Prussian

Корпусная лингвистика 2013

Old Prussian corpus consists of at least 21 different languages or language variants (Old Prussian, Old Lithuanian, Latin, Gothic, Old Low German, Old High German).

Page 12: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 12

Tokens and Types Distribution in TITUS

Creation• A digitized source consists not only of a source language words,

but contains various information which does not belong originally to the document: numbers, tags, punctuation marks, edition information etc.

Корпусная лингвистика 2013

$zeile =~ s/\d*\s+\x{003C}\x86\x87\x84\x{003E}//gi; #<†‡„>

$zeile =~ s/\d*\s+<\W<?ConvertCheck:\s+LevelNameTooLong>//g; #<?ConvertCheck: LevelNameTooLong>

Page 13: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 13

Tokens and Types Distribution in TITUS

Examples: Gothic

Корпусная лингвистика 2013

Gothic Bible. Old Testament Fragments. Total: 1629 tokens und 893 types

Tokens Types

Gothic 420 240

Latin 572 325

Greek 627 319

Page 14: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 14

Tokens and Types Distribution in TITUS

Examples: Gothic

Gothic Bible. New Testament Books. Total: 170215 tokens und 28876 types

Tokens Types

Gothic 61167 9121

Latin 52648 9036

Greek 56400 10719

Корпусная лингвистика 2013

Page 15: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 15

Tokens and Types Distribution in TITUS

Examples:

Корпусная лингвистика 2013

Tönnies Fenne's Manual (17th century)

The language of the textbook of spoken Russian consists mainly of Russian in Latin transcription and Low German.

Page 16: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 16

Tokens and Types Distribution in TITUS

Examples: further application

Корпусная лингвистика 2013

Page 17: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 17

Tokens and Types Distribution in TITUS

Metadata• DC – Dublin Core• TEI – Text Encoding Initiative• CEI – Corpus Encoding Initiative• IMDI – ISLE Meta Data Initiative • OLAC – Open Language Archives Community• CMDI – Component MetaData Infrastructure

Корпусная лингвистика 2013

Page 18: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 18

Tokens and Types Distribution in TITUS

CMDI - Component MetaData Infrastructure

Корпусная лингвистика 2013

http://www.clarin.eu/cmdi

Page 19: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 19

Tokens and Types Distribution in TITUS

TITUS Metadata: HTML Format

<HEAD> <TITLE>TITUS Texts: Biblia gothica: Frame</TITLE> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> <META NAME="Author" CONTENT="Jost Gippert"> <META NAME="Description" CONTENT="TITUS: Texts: Biblia gothica: Frame"> <META NAME="KeyWords" CONTENT="TITUS Texte Texts Biblia gothica"></HEAD>

Корпусная лингвистика 2013

Page 20: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 20

Tokens and Types Distribution in TITUS

New Metadata Set for TITUS

Корпусная лингвистика 2013

* Name vorhanden*Author new*ProjectContactName existing*ProjectContactAddress existing*ProjectContactEmail existing*ProjectContactOranisation existing*ProjectDescription existing*Resource.Language neu*Resource.ResourceLink existing*Resource.Access.Availability existing*Resource.Access.Date existing*Resource.Access.Owner existing*Resource.Access.Publisher existing*Resource.Publication.Time.Original.Manuscript new*Resource.Publication.Time.Original.Facsimile new*Resource.Publication.Time.Original.Published new*Resource.Publication.Time.Electronic existing*Resource.Wordcount.General.Tokens *new (CLARIN)*Resource.Wordcount.General.Types new*Resource.Wordcount.Language.Tokens new*Resource.Wordcount.Language.Types new*Resource.Metadata.Encoding new

Page 21: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 21

Tokens and Types Distribution in TITUS

Metadata Example for TITUS – XML CMDI<ResourcePublicationTimeElectronic>16.6.2002</ResourcePublicationTimeElectronic> <ResourceWordcountGeneral> <Tokens>1629 Tokens</Tokens> <Types>893 Types</Types> </ResourceWordcountGeneral><ResourceWordcountTT> <Language></Language> <LanguageTokensTypes> Tokens | Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 1_General</Language> <LanguageTokensTypes>10 Tokens | 9 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 2_Gothic</Language> <LanguageTokensTypes>420 Tokens | 240 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 4_Latin</Language> <LanguageTokensTypes>572 Tokens | 325 Types</LanguageTokensTypes> </ResourceWordcountTT><ResourceWordcountTT> <Language>Language 5_Greek</Language> <LanguageTokensTypes>627 Tokens | 319 Types</LanguageTokensTypes> </ResourceWordcountTT>

Корпусная лингвистика 2013

Page 22: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 22

Tokens and Types Distribution in TITUS

Metadata for TITUS – Browser

Корпусная лингвистика 2013

Page 23: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 23

Tokens and Types Distribution in TITUS

Metadata for TITUS – Browser

Корпусная лингвистика 2013

Page 24: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 24

Tokens and Types Distribution in TITUS

Metadata for TITUS – Browser

Корпусная лингвистика 2013

Page 25: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 25

Tokens and Types Distribution in TITUS

Thank you for your attention!

Корпусная лингвистика 2013

Links• ARBIL (Metadaten-Editor)

http://tla.mpi.nl/tools/tla-tools/arbil/• CLARIN

http://www.clarin.eu• CMDI

http://www.clarin.eu/cmdi• Dublin Core

http://dublincore.org/documents/dcmi-terms/• IMDI

http://www.mpi.nl/IMDI/• OLAT

http://www.language-archives.org/• TEI

http://www.tei-c.org/index.xml• TITUS

http://titus.uni-frankfurt.de

Page 26: Types  und Tokens Distribution  in TITUS  Распределение словоформ  в  корпусе TITUS

26.06.2013 26

Tokens and Types Distribution in TITUS

Корпусная лингвистика 2013

Old PrussianCorpus

Tokens General:

17662 tokensTypes General:

8390 types