ELRA - ELRA-U-W 0101 : Multext-East 1984 Parallel Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W 0101

Multext-East 1984 Parallel Corpus

This multilingual parallel corpus consists of the novel "1984" (G. Orwell) and contains approximately 100,000 words per language (English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Latvian, Lithuanian, Serbian, Russian).
Texts are tokenised and annotated with POS and morpho-syntactic information. The corpus is SGML marked-up.
The different translations have been automatically aligned and then hand-validated. The corpus is conform to the CES standards.

It is a part of a multilingual dataset containing multiple resources for Central and Eastern European languages:
- MULTEXT-East morphosyntactic specifications,
- MULTEXT-East morphosyntactic lexicons,
- MULTEXT-East morphosyntactically annotated "1984" corpus,
- MULTEXT-East comparable corpus,
- MULTEXT-East parallel speech corpus (from EUROM-1 speech corpus),
- and associated documentation.
The central component of the MULTEXT-East corpus is the novel "1984" by G. Orwell.

The dataset is compliant with the EAGLES and TEI P4 recommendations.
It is a resource of value for Central and Eastern European languages engineering research and development.

Identification

Period of coverage :

Version : v3, 2004
Version history : v1: 1998 ('East meets West' CDROM) v2: 2002

Production

Project : TELRI, CONCEDE, Multext-East Projects

Creation date : 2004

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Multilingual
Language(s) : English (United Kingdom)Romanian ; EnglishSlovene ; EnglishCzech ; EnglishBulgarian ; EnglishEstonian ; EnglishHungarian ; EnglishLatvian ; EnglishLithuanian ; EnglishSerbian ; EnglishRussian
Alignment : Sentence
Annotation Coverage : Full
Annotation Granularity : Word
Annotation level : Morphological
Annotation Scheme : TEI
Annotation language : SGML