Лекция 4. russian corpora: comparison and usage. part 1
TRANSCRIPT
Russian Corpora: Comparison and Usage
Victor [email protected]
St.Petersburg State University
Department of Mathematical Linguistics
The 16th International Conference TSD 2013
Смотри:
Захаров В.П. Корпуса русского языка // Труды Института русского языка им. В.В. Виноградова. Вып. 6. 2015. С. 20-64.
Outline• Prehistory (1965-1977)
• Middle Ages (1978-2000)
• New Time (2000-2013)
• Conferences 2002-2013
• Russian National Corpus
• Spoken Russian Corpora
• Special Corpora
• Research projects
Лекция 1-3 Корпуса русского языка: история
"Prehistory"
Usually the beginning of the Russian corpus linguistics is connected with•the Uppsala Corpus of Russian TextsBut I want to begin with the prehistory :•the Frequency Dictionary of Russian (L.N.Zasorina, 1960–70-ties)•printed version: Chastotnyi slovar’ russkogo yazyka. Zasorina, L.N. (ed.). Moskva (1977) .
Лекция 1-3 Корпуса русского языка: история
"Prehistory" (2):
Text database for the dictionary counted about1 mln tokens.
During its compilation a huge number of notorious issues of corpus linguistics was discussed:
• representiveness,• tokenization,• normalization,• lemmatization
Лекция 1-3 Корпуса русского языка: история
"Prehistory" (3):
So it was the earliest computerized corpus of Russian that doesn’t exist nowadays.
Лекция 1-3 Корпуса русского языка: история
"Middle Ages": Computer Fund of the Russian Language
Idea: acad. Andrey Jershov
(1931-1988)Лекция 1-3 Корпуса русского языка: история
Jershov A.P. "On methodology of constructing dialogue systems: the
phenomenon of business prosa" (1978)
The idea was formulated as follows:
"Any progress in the field of constructing models and algorithms will remain a purely academic exercise, unless a most important problem of creating a Computer fund of Russian language is solved. It is to be hoped that creation of such a Computer fund by linguists, qualified for the task, will precede construction of large systems for application purposes. This would minimise labour costs and simultaneously would protect the 'tissues' of the Russian language from arbitrary and incompetent intervention“.
Лекция 1-3 Корпуса русского языка: история
"Computer Fundof the Russian Language"
The Fund was to include the following databases:• General Lexicon of Russian, • Databases of various dictionaries, • Terminology database, • Information system for the Russian grammar • Other subsystems (phonetics, dialectology,
diachronic lexis etc).And last but not least, Collection of texts, i.e.
corpus.Unfortunately, the bulk of the accumulated results
was either abandoned or lost.
Лекция 1-3 Корпуса русского языка: история
Uppsala Russian Corpus
The most renowned Russian corpus for many years was the Uppsala Corpus of Russian Texts created in1980-ties. •1 million tokens; 600 texts.•Literature: 40 authors (1960-1988).•Newspapers: various topics (1985-1988).•No annotation. Only later it was annotated as a part of Tübingen Corpus) (http:// www. sfb441.uni-tuebingen.de/b1/rus/korpora.html#uppsalakorpus) . •By now its linguistic material is neither up to date for the volume (one million word occurrences), nor complies with modern conceptions of a national corpus at all.
Лекция 1-3 Корпуса русского языка: история
Tübingen Corpus
The Uppsala corpus belongs to so called Tubingen Russian corpora
Tübingen Universität, 1999 -2004, Tilman BergerResearch program «Linguistische Datenstrukturen:Theoretische und empirische Grundlagen derGrammatikforschung»: Slavonic languagesMorphological tagging (TnT tagger)Corpus manager CQP (Stuttgart)The first corpus of Russian, freely accessible via the Internet.
Лекция 1-3 Корпуса русского языка: история
First corpora in Russia Russian newspaper corpus (Department of Philology ofthe Moscow University, Laboratory of General andComputational Lexicology and Lexicography, 2000–2002, A. Polikarpov).1 mln. tokens in total, online version is limited to 200 thousand tokens.Texts and text items are automatically or semi-automatically marked by a number of tags : • the source, text volume, genre, date of the publication
etc. (for texts); • grammatical, lexical, morphemic or other categories
(for words).A flexible search
Лекция 1-3 Корпуса русского языка: история
First corpora in Russia (2)
Some small corpora in Moscow Centre of Linguistic Documentation (V. Plungian, M. Daniel).
First of all: Russian Standard (2001-2002, V. Plungian, E. Rakhilina)
The BOCR Corpus (Big Corpus of Russian)(2002-2003, Serge Sharoff)
Лекция 1-3 Корпуса русского языка: история