ELRA - ELRA-U-S 0010 : Multext-East Speech Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-S 0010

Multext-East Speech Corpus

This is a small parallel corpus of spoken texts taken from the EUROM-1 speech corpus. 40 short passages have been translated from English into Romanian, Slovene, Estonian, Hungarian, Czech and Bulgarian.
For four languages (Romanian, Slovene, Estonian and Hungarian) recordings of the texts are also provided (with links between texts and spoken passages).

It is a part of a multilingual dataset containing multiple resources for Central and Eastern European languages:
- MULTEXT-East morphosyntactic specifications,
- MULTEXT-East morphosyntactic lexicons,
- MULTEXT-East morphosyntactically annotated "1984" corpus,
- MULTEXT-East comparable corpus,
- MULTEXT-East "1984" parallel corpus,
- and associated documentation.
The central component of the MULTEXT-East corpus is the novel "1984" by G. Orwell.

The dataset is compliant with the EAGLES and TEI P4 recommendations.
It is a resource of value for Central and Eastern European languages engineering research and development.

Identification

Period of coverage :

Version : v3, 2004
Version history : v1: 1998 ('East meets West' CDROM) v2: 2002

Production

Project : TELRI, CONCEDE, Multext-East Projects

Creation date : 2004

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Multilingual
Language(s) : English ; Romanian ; Slovene ; Bulgarian ; Czech ; Estonian ; Hungarian
Alignment : Sentence
Annotation Scheme : TEI
Annotation language : XML
speech corpus
Language(s) : Romanian (Romania) ; Slovene (Slovenia) ; Estonian (Estonia) ; Hungarian (Hungary)
Source Channel : Microphone
Speech Acquisition Mode : Acoustic
Transcription Entries : Orthographic