ELRA - ELRA-U-W0373 : VoiceTRAN application-specific corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W0373

VoiceTRAN application-specific corpus

This is a restricted-domain corpus of Slovene-English parallel texts from the Slovenian Ministry of Defense.

It is sentence aligned, POS tagged at the word level and lemmatized. It has been annotated in the XML format, following the TEI P4 guidelines.

The corpus was built to induce a lexicon of words and phrases for Machine translation (MT) and for training a speech recognizer language model.

Production

Project : VoiceTRAN

Contents

Click on the arrow to display content.

written corpus
Number of languages : Bilingual
Language(s) : Slovenian <<< >>> English
Alignment : Sentence
Annotation Coverage : Full
Annotation Granularity : Word
Annotation level : Syntactic
Lexical Unit Information : Single word lemma
Annotation Scheme : TEI
Annotation language : XML