russian corpora: comparison and usage victor zakharov [email protected] st.petersburg state...

38
Russian Corpora: Russian Corpora: Comparison and Usage Comparison and Usage Victor Zakharov Victor Zakharov [email protected] St.Petersburg State University St.Petersburg State University Department Department of of Mathematical Linguistics Mathematical Linguistics The 16th International The 16th International Conference TSD 2013 Conference TSD 2013

Upload: mitchell-mcdowell

Post on 18-Jan-2016

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Russian Corpora: Russian Corpora: Comparison and UsageComparison and Usage

Russian Corpora: Russian Corpora: Comparison and UsageComparison and Usage

Victor ZakharovVictor [email protected]

St.Petersburg State UniversitySt.Petersburg State University

DepartmentDepartment of of Mathematical LinguisticsMathematical Linguistics

The 16th International The 16th International Conference TSD 2013Conference TSD 2013

Page 2: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Смотри:

Захаров В.П. Корпуса русского языка // Труды Института русского языка им. В.В. Виноградова. Вып. 6. 2015. С. 20-64.

Page 3: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

"New Times"•Conferences on corpus linguistics:Dialogue (every year)Corpora-20XX (our department)http://corpora.phil.spbu.ru2002 – 2013 (each 2 years)

•Publications•Russian National Corpus(http://ruscorpora.ru/en/index.html) Started in 2003 and from April, 2004 is accessible via the Internet

Лекция 1-3 Корпуса русского языка: современность

Page 4: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

The Russian National Corpus

The Russian National Corpus is the most popular one among linguists for both being the most well known and due to opportunities which it presents.

However, being unable to go into a deeper analysis within the framework of this presentation, we will zero in on its general characteristics together with its most unique features. Also, to show the state of the art in modern Russian corpus linguistics we will touch in greater detail upon other corpora that are not so much known but are worth mentioning.

Лекция 1-3 Корпуса русского языка: современность

Page 5: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

The Russian National Corpus (2)

The corpus allows us to study the variability and volatility of linguistic phenomena frequencies, as well as to obtain reliable results in the following areas:

1) the study of morphological variants of words and their evolution;

2) the study of word-formation options and related issues;

3) the study of changes in syntactic relations; 4) the research of changes in the system of

Russian accent; 5) the study of lexical variation, in particular,

changes in synonym series and lexical groups, as well as semantic relations in them.

Лекция 1-3 Корпуса русского языка: современность

Page 6: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

The Russian National Corpus (3)

Over 500 million words (March 2013)The RNC includes the following subcorpora:• 1) The main corpus• 2) Deeply annotated corpus (treebank)• 3) Spoken corpus• 4) Parallel text corpus• 5) Dialectal corpus• 6) Poetic corpus• 7) Educational corpus• 8) Newspaper corpus• 10) Multimodal/multimedia corpus

Лекция 1-3 Корпуса русского языка: современность

Page 7: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

The main corpus

The main corpus is subdivided into 2 parts:

• modern written texts (from the 1950s to the present day) (230 mln tokens);

• early texts (from the middle of the 18th to the middle of the 20th centuries).

The part of modern texts is the largest one of the subcorpora. Texts are represented in proportion to their share in real-life usage. For example, the share of fiction does not exceed 40% Лекция 1-3 Корпуса

русского языка: современность

Page 8: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

The main corpus (2)Every text included in the main corpus is subject to

metatagging and morphological tagging. Morphological tagging is carried out automatically. In

a small part of the main corpus (around 6 mln tokens) grammatical homonyms are disambiguated by hand, and results of automated morphological analysis are corrected.

This part is the model morphological corpus and serves as a testing ground for various search algorithms and programs of morphological analysis and automated processing.

Disambiguated texts are automatically supplied with indicators of stress. Stress annotation may be turned off for printing or saving the search results.

Лекция 1-3 Корпуса русского языка: современность

Page 9: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Searching Based on Yandex Search Engine:

lemma search; wordform search; set phrases; additional features:– before or after punctuation marks– in the beginning or in the end of a sentence– capitalization etc.

Additional options: grammeme search; semantic search; metadata search.

For lexico-grammatical search, we can input a sequence of lexemes and/or word-forms with certain grammatical and/or semantic features and combine them in any way. Лекция 1-3 Корпуса

русского языка: современность

Page 10: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Searching (2)

• For compound searches parenthesis are used. For example, the query S & (nom|acc) yields nouns in nominative or accusative.

• It can be used with both left or right truncation.

• Distance between words could be set from minimum to maximum. The distance between words next to each other is 1 word; the distance of 0 is interpreted as concurrence of wordforms

Лекция 1-3 Корпуса русского языка: современность

Page 11: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Russian National Corpus: Search Interface

Лекция 1-3 Корпуса русского языка: современность

Page 12: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Grammatical search

A simpler way to search for certain grammatical features is to use a selection window.

The selection window contains a list of appropriate features, subdivided by categories: f.e., for morphology, part of speech, case, gender, voice, number, etc.

Лекция 1-3 Корпуса русского языка: современность

Page 13: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Grammatical search (2)

Лекция 1-3 Корпуса русского языка: современность

Page 14: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Russian National Corpus:Metadata tagset

The interface unites certain metadata parameters into 2 blocks:I. Passport Author: name, gender, year of birth or approximate

age Text title Date of creation (can be given as an exact or an

approximate date, and as after or before a certain date)

II. Two subgroups: non-fiction, fiction;The two subgroups have different structures of

parameters. Лекция 1-3 Корпуса

русского языка: современность

Page 15: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

RNC: Semantic annotationThe RNC texts are semantically tagged.

Semantic annotation in the main corpus is a unique feature of RNC that makes it distinct from other national corpora.

Semantic and derivational parameters: person, substance, space, movement, diminutive, verbal noun, etc.

Is used the Semantic dictionary of the Corpus, based on the classification system which was developed for the database Lexicograph beginning from 1992 under the leadership of E. V. Paducheva and E. V. Rakhilina.

Лекция 1-3 Корпуса русского языка: современность

Page 16: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

The structure of semantic and lexical information

There are three groups of tags assigned to words to reflect lexical and semantic information:

• Class (a name, a reflexive pronoun, etc.)• Lexical and semantic features (a lexeme's

thematic class, indications of causality or assessment, etc.)

• Derivational features (a diminutive, an adjectival adverb, etc.)

The set of semantic and lexical parameters is different for different parts of speech. Moreover, nouns are divided into three subclasses (concrete nouns, abstract nouns, and proper names), each with its own hierarchy of tags. Лекция 1-3 Корпуса

русского языка: современность

Page 17: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Lexical and semantic tags• Taxonomy (a lexeme's thematic class) – for

nouns, adjectives and adverbs.• Mereology (“part – whole” and “element –

aggregate” relationships) – for concrete and abstract nouns

• Topology – for concrete names• Causation – for verbs• Auxiliary status – for verbs• Evaluation – for abstract and concrete

nouns, adjectives and adverbs• Etc.

Лекция 1-3 Корпуса русского языка: современность

Page 18: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Lexical and semantic tags:Fragment (1)

• Taxonomy: t:hum – person (человек (human), учитель(teacher)), t:hum:etn – ethnonyms (эфиоп (Ethiopian), итальянка (Italian)), t:hum:kin – kinship terms (брат (brother), бабушка (grandmother)), t:animal – animals (корова (cow), сорока (magpie)), etc.

• Mereology: pt:part – parts (верхушка (top)), pt:part& pc:plant – parts of plants (ветка (limb), корень (root)), pt:part& pc:constr – parts of buildings and constructions (комната (room), дверь (door)), etc.

• Topology: top:contain – containers (комната (room), озеро (lake)), top:horiz – horizontal surfaces (пол (floor), площадка (ground, area)), etc.

• Evaluation: ev – evaluation (neither positive nor negative) (озорник (mischief-maker)), ev:posit – positive evaluation (умница (clever man or woman)), ev:neg – negative evaluation (негодяй (scoundrel)).

Лекция 1-3 Корпуса русского языка: современность

Page 19: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Lexical and semantic tags:Fragment (2)

Some tags for verbs:• t:move – movement (бежать (run), бросить

(throw))• t:put – placement (положить (put), спрятать

(hide))• t:impact – physical impact (бить (beat),

колоть (prick))• t:be:exist – existence (жить (live),

происходить (happen))• t:be:appear – start of existence (возникнуть

(arise), создать (create))• t:be:disapp – end of existence (убить (kill),

улетучиться (diappear)) • t:loc – location (лежать (lie), стоять (stand)).

Лекция 1-3 Корпуса русского языка: современность

Page 20: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Search Interface in English

Лекция 1-3 Корпуса русского языка: современность

Page 21: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Semantics: search

Лекция 1-3 Корпуса русского языка: современность

Page 22: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Semantics: search (2)

Лекция 1-3 Корпуса русского языка: современность

Page 23: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

RNC: Search results

Search results can be presented twofold:

• a horizontal text (a broader context)• a concordance (next slide). In both cases grammatical and semantic

features of any word can be checked out: the slide shows features for the word лук (onion (t:food) OR bow (t:tool:weapon)).

Лекция 1-3 Корпуса русского языка: современность

Page 24: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

RNC: Search results (2)

Лекция 1-3 Корпуса русского языка: современность

Page 25: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

RNC: Search results (3)

Лекция 1-3 Корпуса русского языка: современность

Page 26: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Deeply Annotated Corpus (RNC)• 25 000 sentences, more than 350 000

tokens.• Various topics.• Primary ideology - multipurpose system

ETAP (machine translation, Laboratory for Computational Linguistics, Institute for Information Transmission, RAS).

• Sentence structure - dependency tree (syntax structure dates back to the linguistic model Meaning-Text Theory by Igor Mel'čuk).

Лекция 1-3 Корпуса русского языка: современность

Page 27: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Deeply Annotated Corpus (RNC)An example

Лекция 1-3 Корпуса русского языка: современность

Page 28: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

RNC: The Corpus of Spoken Russian

More than 10 mln tokens.Represents real-life Russian speech and

includes the recordings of public and spontaneous spoken Russian and the transcripts of the Russian movies. To record the spoken specimens the standard spelling was used. The corpus contains the patterns of different genres/types and of different geographic origins. The corpus covers the time frame from 1930 to 2007.

In addition, the corpus has its own annotation: the accentological and the sociological one. Лекция 1-3 Корпуса

русского языка: современность

Page 29: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

ORD corpus (Odin Rechevoj Den’ or One Day of

Speech)

The main aim of creating the ORD corpus is to collect recordings of actual speech which we use in our everyday communication.

Balanced group of 30 persons representing various social and age strata in the population of St. Petersburg

These individuals spent one day with recorders dangling around their necks and recording all their day communications.

More than 240 hours of recording were obtained with 170 hours containing speech data quite suitable for further linguistic analysis.

2202 communication episodes.At present, orthographic transcription of the corpus

numbers more than 50000 word forms.Лекция 1-3 Корпуса русского языка: современность

Page 30: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

RNC: the Multimedia Russian corpus (MURCO)

Fragments of movies of the 1930s through the 2000s and some other materials.

The total volume of the movie transcripts is around 3,5 million tokens.

The alignment of the text transcripts with the parallel sound and video tracks.

The types of annotation in the MURCO are as follows:• orthoepic annotation: combinations of sounds are

marked;• annotation of accentological structure;• speech act annotation: the types of speech acts;• gesture annotation: the type of gesticulation in a

clip.A user obtains not only a written text, annotated

from different points of view, but also the corresponding sound and video material.

Лекция 1-3 Корпуса русского языка: современность

Page 31: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Parallel corpora of Russian

1. Russian National Corpus (English-Russian, Russian-English, German-Russian, Ukrainian-Russian, Russian-Ukrainian, Belorussian-Russian, Russian-Belorussian)

2. PARUS (SNC, Bratislava) – PAralelní RUsko-Slovenský korpus (rus-slov)

3. PARRUS (Tampere) (rus-fin)4. InterCorp (ČNK, Praha)5. ParaSol (The Regensburg Parallel

Corpus of Slavonic)Лекция 1-3 Корпуса

русского языка: современность

Page 32: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Other text corpora of RussianHANCO

Helsinki Annotated Corpus(2001-2004, A. Mustajoki,M. Kopotev, the Department of Slavonic and Baltic Languages and Literatures at the University of Helsinki).

The corpus includes morphological, syntactic information about approximately 100, 000 running words, extracted from a modern Russian magazine and representing the modern Russian language.

Лекция 1-3 Корпуса русского языка: современность

Page 33: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

HANCOThe main principles of creation • Orientation to a wider audience. Potential users

are not only a narrow circle of experts, but also students and teachers of Russian. The choice of parameters for a search is carried out in such a way as to minimize the amount of specialized knowledge required.

• Orientation to the accuracy of the grammatical description, not to the amount of annotated material.

• Orientation to multilevel grammatical information The HANCO corpus contains multilateral grammatical information including morphological, syntactic, and functional characteristics. They can be combined in the process of searching.

• Possibility of alternative interpretations. The HANCO creators made the decision to accept the possibility of alternative interpretations of linguistic facts. Such seeming illegibility demands a lot of manual work, but it facilitates the searching of necessary information by the potential user.

Лекция 1-3 Корпуса русского языка: современность

Page 34: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Moshkov's Library corpus

Лекция 1-3 Корпуса русского языка: современность

Page 35: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

AOT-DDC: results

Лекция 1-3 Корпуса русского языка: современность

Page 36: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Russian corpora in Sketch Engine

• http://sketchengine.co.uk/• Lexical Computing Ltd. (A. Kilgarriff) • More than 150 corpora of different

languages• Among them corpora of Russian and

first of all ruTenTen corpus of 20 bilion tokens (Wacky technology)

• Corpus manager Sketch Engine (Masaryk University)

• Different tools: Concordance, Word sketches, Thesaurus, Differences, Clustering, etc.

Лекция 1-3 Корпуса русского языка: современность

Page 37: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Russian corpora by Serge Sharoff Leeds Univ.

(corpus manager CQP)

Russian Reference Corpus (a part of the RNC)

Russian Reference Corpus, another versionRussian Fiction (disambiguated) Russian Newspapers Russian Business CorpusRussian Internet Corpus Russian corpora together

Лекция 1-3 Корпуса русского языка: современность

Page 38: Russian Corpora: Comparison and Usage Victor Zakharov vz1311@yandex.ru St.Petersburg State University Department of Mathematical Linguistics The 16th International

Russian corpora by Serge Sharoff

Лекция 1-3 Корпуса русского языка: современность