You are here
»
Universal Catalogue
»
Written Resources
»
Written Corpora
Language Resources
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Catalog Reference : ELRA-U-W 0056
Czech-English Parallel Corpus
This Czech-English parallel corpus (CzEng) contains approximately 90 million words per language. It was compiled between 2005 and 2009 with documents from various fields: European law, information technologies and fiction (many documents were taken from other existing corpora like the Acquis corpus for example). In the last version (0.9) some texts from parallel web pages, electronically available books and subtitles have been added.
The corpus was cleaned, structured and then sentence segmented, tokenised and converted to a common XML format. The alignment was performed automatically with Hunalign, a freely available tool.
The purpose of the corpus is to support Czech-English and English-Czech machine translation research.
Identification
Period of coverage :
Version :
v0.9
Version history :
v0.7 v0.5
Production
Creation date :
2007
Applications
application Area :
Research
Technical Informations
Bytesize :
4.0 GB
Compression :
Zip
Contents
Click on the arrow to display content.
written corpus
Number of languages
: Bilingual
Language(s) :
Czech (Czech Republic)English (United Kingdom)
Character set :
utf8
Alignment :
Sentence
Saturday 23 November, 2024
Joint Copyright © 2008
ELRA
&
ELDA
Universal Catalogue 1.0.4