ELRA - ELRA-U-W 0039 : The Europarl Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W 0039

The Europarl Corpus

The Europarl Corpus is a multilingual collection of texts extracted from the proceedings of the European Parliament. It concerns 11 languages.

Danish: 47,305,502 words
German: 44,688,020 words
Greek: 26,306,875 words (in 2007)
English: 50,978,295 words
Spanish: 52,503,808 words
Finnish: 34,106,317 words
French: 55,088,177 words
Italian: 50,161,729 words
Dutch: 50,926,645 words
Portuguese: 51,294,994 words
Swedish: 43,291,692 words

Mark-up concerns document, speaker and paragraph information.

This corpus was compiled for machine translation. It is available with a sentence aligner.

See also the parallel texts that have already been constructed (from all languages towards English). The data can also be asked for other language pairs.

Identification

Period of coverage :

Version : v5, 2010
Version history : v3, 2007

Production

Creation date : 2007

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Multilingual
Language(s) : Danish (Denmark) ; German (Germany) ; Greek (Greece) ; English (United Kingdom) ; Spanish (Spain) ; Finnish (Finland) ; French (France) ; Italian (Italy) ; Dutch (Netherlands) ; Swedish (Sweden) ; Portuguese (Portugal)
Character set : utf8