Universal Catalogue  
  You are here » Universal Catalogue » Written Resources » Written Corpora
Language Resources
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Anglais
Catalog Reference : ELRA-U-W 0101
Multext-East 1984 Parallel Corpus
This multilingual parallel corpus consists of the novel "1984" (G. Orwell) and contains approximately 100,000 words per language (English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Latvian, Lithuanian, Serbian, Russian).
Texts are tokenised and annotated with POS and morpho-syntactic information. The corpus is SGML marked-up.
The different translations have been automatically aligned and then hand-validated. The corpus is conform to the CES standards.

It is a part of a multilingual dataset containing multiple resources for Central and Eastern European languages:
- MULTEXT-East morphosyntactic specifications,
- MULTEXT-East morphosyntactic lexicons,
- MULTEXT-East morphosyntactically annotated "1984" corpus,
- MULTEXT-East comparable corpus,
- MULTEXT-East parallel speech corpus (from EUROM-1 speech corpus),
- and associated documentation.
The central component of the MULTEXT-East corpus is the novel "1984" by G. Orwell.

The dataset is compliant with the EAGLES and TEI P4 recommendations.
It is a resource of value for Central and Eastern European languages engineering research and development.
Identification
Period of coverage :
Version : v3, 2004
Version history : v1: 1998 ('East meets West' CDROM) v2: 2002
Production
Project : TELRI, CONCEDE, Multext-East Projects Creation date : 2004
Applications
application Area : Research
Contents Click on the arrow to display content.
 written corpus 
 

Joint Copyright © 2008 ELRA & ELDA
Universal Catalogue 1.0.4