ELRA - ELRA-U-W 0056 : Czech-English Parallel Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W 0056

Czech-English Parallel Corpus

This Czech-English parallel corpus (CzEng) contains approximately 90 million words per language. It was compiled between 2005 and 2009 with documents from various fields: European law, information technologies and fiction (many documents were taken from other existing corpora like the Acquis corpus for example). In the last version (0.9) some texts from parallel web pages, electronically available books and subtitles have been added.

The corpus was cleaned, structured and then sentence segmented, tokenised and converted to a common XML format. The alignment was performed automatically with Hunalign, a freely available tool.

The purpose of the corpus is to support Czech-English and English-Czech machine translation research.

Identification

Period of coverage :

Version : v0.9
Version history : v0.7 v0.5

Production

Creation date : 2007

Applications


application Area : Research

Technical Informations

Bytesize : 4.0 GB
Compression : Zip

Contents

Click on the arrow to display content.

written corpus
Number of languages : Bilingual
Language(s) : Czech (Czech Republic)English (United Kingdom)
Character set : utf8
Alignment : Sentence