ELRA - ELRA-WC345 : Acquis Communautaire Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-WC345

Acquis Communautaire Corpus

This is a large aligned parallel corpus containing 1 billion words in 22 official EU languages (231 language pair combinations). It contains EU legislation, declarations, resolutions, acts, international agreements and documents on contents, principles and political objectives of the EU Treaties.
Texts are in the XML format according to TEI P4 and UTF-8-encoded.

Two different alignments have been performed, one at the level of paragraph (with Vanilla) and the other at the level of sentence (with HunAlign). The corpus is also manually classified according to EUROVOC subject domains.

This combination of multilinguality and subject domain coding makes this corpus a valuable resource for NLP research. It can be used for cross-language research, as for classification algorithms and key-word assignment software development.

Identification

Period of coverage :

Version : 3.0
Version history :

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Multilingual
Language(s) : Czech ; Danish ; Dutch ; English ; Estonian ; German ; Greek ; Finnish ; French ; Hungarian ; Italian ; Latvian ; Lithuanian ; Maltese ; Polish ; Portuguese ; Romanian ; Slovak ; Slovene ; Spanish ; Swedish ; Bulgarian
Alignment : Multilingual