Лекция 4. russian corpora: comparison and usage. part 1

Russian Corpora: Comparison and Usage

Victor [email protected]

St.Petersburg State University

Department of Mathematical Linguistics

The 16th International Conference TSD 2013

mailto:[email protected]

Смотри:

Захаров В.П. Корпуса русского языка // Труды Института русского языка им. В.В. Виноградова. Вып. 6. 2015. С. 20-64.

Outline• Prehistory (1965-1977)

• Middle Ages (1978-2000)

• New Time (2000-2013)

• Conferences 2002-2013

• Russian National Corpus

• Spoken Russian Corpora

• Special Corpora

• Research projects

Лекция 1-3 Корпуса русского языка: история

"Prehistory"

Usually the beginning of the Russian corpus linguistics is connected with•the Uppsala Corpus of Russian TextsBut I want to begin with the prehistory :•the Frequency Dictionary of Russian (L.N.Zasorina, 1960–70-ties)•printed version: Chastotnyi slovar’ russkogo yazyka. Zasorina, L.N. (ed.). Moskva (1977) .


"Prehistory" (2):

Text database for the dictionary counted about1 mln tokens.

During its compilation a huge number of notorious issues of corpus linguistics was discussed:

• representiveness,• tokenization,• normalization,• lemmatization


"Prehistory" (3):

So it was the earliest computerized corpus of Russian that doesn’t exist nowadays.


"Middle Ages": Computer Fund of the Russian Language

Idea: acad. Andrey Jershov

(1931-1988)Лекция 1-3 Корпуса русского языка: история

Jershov A.P. "On methodology of constructing dialogue systems: the

phenomenon of business prosa" (1978)

The idea was formulated as follows:

"Any progress in the field of constructing models and algorithms will remain a purely academic exercise, unless a most important problem of creating a Computer fund of Russian language is solved. It is to be hoped that creation of such a Computer fund by linguists, qualified for the task, will precede construction of large systems for application purposes. This would minimise labour costs and simultaneously would protect the 'tissues' of the Russian language from arbitrary and incompetent intervention“.


"Computer Fundof the Russian Language"

The Fund was to include the following databases:• General Lexicon of Russian, • Databases of various dictionaries, • Terminology database, • Information system for the Russian grammar • Other subsystems (phonetics, dialectology,

diachronic lexis etc).And last but not least, Collection of texts, i.e.

corpus.Unfortunately, the bulk of the accumulated results

was either abandoned or lost.


Uppsala Russian Corpus

The most renowned Russian corpus for many years was the Uppsala Corpus of Russian Texts created in1980-ties. •1 million tokens; 600 texts.•Literature: 40 authors (1960-1988).•Newspapers: various topics (1985-1988).•No annotation. Only later it was annotated as a part of Tübingen Corpus) (http:// www. sfb441.uni-tuebingen.de/b1/rus/korpora.html#uppsalakorpus) . •By now its linguistic material is neither up to date for the volume (one million word occurrences), nor complies with modern conceptions of a national corpus at all.


http://www.sfb441.uni-tuebingen.de/b1/korpora.html

Tübingen Corpus

The Uppsala corpus belongs to so called Tubingen Russian corpora

Tübingen Universität, 1999 -2004, Tilman BergerResearch program «Linguistische Datenstrukturen:Theoretische und empirische Grundlagen derGrammatikforschung»: Slavonic languagesMorphological tagging (TnT tagger)Corpus manager CQP (Stuttgart)The first corpus of Russian, freely accessible via the Internet.


First corpora in Russia Russian newspaper corpus (Department of Philology ofthe Moscow University, Laboratory of General andComputational Lexicology and Lexicography, 2000–2002, A. Polikarpov).1 mln. tokens in total, online version is limited to 200 thousand tokens.Texts and text items are automatically or semi-automatically marked by a number of tags : • the source, text volume, genre, date of the publication

etc. (for texts); • grammatical, lexical, morphemic or other categories

(for words).A flexible search


First corpora in Russia (2)

Some small corpora in Moscow Centre of Linguistic Documentation (V. Plungian, M. Daniel).

First of all: Russian Standard (2001-2002, V. Plungian, E. Rakhilina)

The BOCR Corpus (Big Corpus of Russian)(2002-2003, Serge Sharoff)


Лекция 4. russian corpora: comparison and usage. part 1

Documents