Лекция 4. russian corpora: comparison and usage. part 1

13
Russian Corpora: Comparison and Usage Victor Zakharov [email protected] St.Petersburg State University Department of Mathematical Linguistics The 16th International Conference TSD 2013

Upload: duongdang

Post on 13-Feb-2017

233 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

Russian Corpora: Comparison and Usage

Victor [email protected]

St.Petersburg State University

Department of Mathematical Linguistics

The 16th International Conference TSD 2013

Page 2: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

Смотри:

Захаров В.П. Корпуса русского языка // Труды Института русского языка им. В.В. Виноградова. Вып. 6. 2015. С. 20-64.

Page 3: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

Outline• Prehistory (1965-1977)

• Middle Ages (1978-2000)

• New Time (2000-2013)

• Conferences 2002-2013

• Russian National Corpus

• Spoken Russian Corpora

• Special Corpora

• Research projects

Лекция 1-3 Корпуса русского языка: история

Page 4: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

"Prehistory"

Usually the beginning of the Russian corpus linguistics is connected with•the Uppsala Corpus of Russian TextsBut I want to begin with the prehistory :•the Frequency Dictionary of Russian (L.N.Zasorina, 1960–70-ties)•printed version: Chastotnyi slovar’ russkogo yazyka. Zasorina, L.N. (ed.). Moskva (1977) .

Лекция 1-3 Корпуса русского языка: история

Page 5: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

"Prehistory" (2):

Text database for the dictionary counted about1 mln tokens.

During its compilation a huge number of notorious issues of corpus linguistics was discussed:

• representiveness,• tokenization,• normalization,• lemmatization

Лекция 1-3 Корпуса русского языка: история

Page 6: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

"Prehistory" (3):

So it was the earliest computerized corpus of Russian that doesn’t exist nowadays.

Лекция 1-3 Корпуса русского языка: история

Page 7: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

"Middle Ages": Computer Fund of the Russian Language

Idea: acad. Andrey Jershov

(1931-1988)Лекция 1-3 Корпуса русского языка: история

Page 8: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

Jershov A.P. "On methodology of constructing dialogue systems: the

phenomenon of business prosa" (1978)

The idea was formulated as follows:

"Any progress in the field of constructing models and algorithms will remain a purely academic exercise, unless a most important problem of creating a Computer fund of Russian language is solved. It is to be hoped that creation of such a Computer fund by linguists, qualified for the task, will precede construction of large systems for application purposes. This would minimise labour costs and simultaneously would protect the 'tissues' of the Russian language from arbitrary and incompetent intervention“.

Лекция 1-3 Корпуса русского языка: история

Page 9: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

"Computer Fundof the Russian Language"

The Fund was to include the following databases:• General Lexicon of Russian, • Databases of various dictionaries, • Terminology database, • Information system for the Russian grammar • Other subsystems (phonetics, dialectology,

diachronic lexis etc).And last but not least, Collection of texts, i.e.

corpus.Unfortunately, the bulk of the accumulated results

was either abandoned or lost.

Лекция 1-3 Корпуса русского языка: история

Page 10: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

Uppsala Russian Corpus

The most renowned Russian corpus for many years was the Uppsala Corpus of Russian Texts created in1980-ties. •1 million tokens; 600 texts.•Literature: 40 authors (1960-1988).•Newspapers: various topics (1985-1988).•No annotation. Only later it was annotated as a part of Tübingen Corpus) (http:// www. sfb441.uni-tuebingen.de/b1/rus/korpora.html#uppsalakorpus) . •By now its linguistic material is neither up to date for the volume (one million word occurrences), nor complies with modern conceptions of a national corpus at all.

Лекция 1-3 Корпуса русского языка: история

Page 11: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

Tübingen Corpus

The Uppsala corpus belongs to so called Tubingen Russian corpora

Tübingen Universität, 1999 -2004, Tilman BergerResearch program «Linguistische Datenstrukturen:Theoretische und empirische Grundlagen derGrammatikforschung»: Slavonic languagesMorphological tagging (TnT tagger)Corpus manager CQP (Stuttgart)The first corpus of Russian, freely accessible via the Internet.

Лекция 1-3 Корпуса русского языка: история

Page 12: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

First corpora in Russia Russian newspaper corpus (Department of Philology ofthe Moscow University, Laboratory of General andComputational Lexicology and Lexicography, 2000–2002, A. Polikarpov).1 mln. tokens in total, online version is limited to 200 thousand tokens.Texts and text items are automatically or semi-automatically marked by a number of tags : • the source, text volume, genre, date of the publication

etc. (for texts); • grammatical, lexical, morphemic or other categories

(for words).A flexible search

Лекция 1-3 Корпуса русского языка: история

Page 13: Лекция 4. Russian Corpora: Comparison and Usage. Part 1

First corpora in Russia (2)

Some small corpora in Moscow Centre of Linguistic Documentation (V. Plungian, M. Daniel).

First of all: Russian Standard (2001-2002, V. Plungian, E. Rakhilina)

The BOCR Corpus (Big Corpus of Russian)(2002-2003, Serge Sharoff)

Лекция 1-3 Корпуса русского языка: история