ELRA - ELRA-U-W0311 : Russian National Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W0311

Russian National Corpus

The Russian National Corpus (RNC) is a collection of written, spoken and multimodal corpora, representing about 300 million tokens.

It covers a wide range of sources from the 18th to the early 21st century: original works of fiction (prose, drama and poetry) and other sources of written and spoken language (memoirs, essays, journalistic texts, scientific and popular scientific literature, public speeches, private talks, movie speech, letters, diaries, etc).

It contains the following subcorpora:
- the Main corpus, which contains prosaic texts, representing about 160 million tokens,
- the Deeply Annotated Syntactic corpus, which is annotated for morphological and syntactic structure,
- the Poetry corpus, which provides tags especially for poetry (about 5 million tokens),
- the Paper corpus (or Corpus of the contemporary Russian press), which includes around 100 million tokens and consists of paper texts and news reports from the Web (2000-2008),
- the Disambiguated corpus, which contains texts with disambiguated grammatical homonyms,
- the Accentological corpus, which contains poetic and prosaic texts of 18-21 centuries, annotated from the point of view of the real (not normative) Russian accentuation,
- the Parallel Russo-English and Russo-German corpora, which are sentence-aligned (9 million tokens),
- the Corpus of Spoken Russian, which contains transcripts of public and spontaneous spoken Russian in addition to transcripts of Russian movies between 1930 and 2007,
- the Multimodal Russian Corpus (MURCO), a collection of clips extracted from movies aligned with the corresponding transcript (still in progress).

Production

Creation date : 2004

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus #18626
Number of languages : Monolingual
Language(s) : Russian
Annotation Coverage : Partial
Annotation Granularity : Word
Annotation level : Morphological
Lexical Unit Information : Single word lemma
written corpus #28626
Number of languages : Bilingual
Language(s) : Russian >>>> English ; Russian >>>> German
Alignment : Sentence
Video1 #28626
Number of languages :
Language(s) :