You are here
»
Universal Catalogue
»
Written Resources
»
Written Corpora
Language Resources
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Catalog Reference : ELRA-WC345
Acquis Communautaire Corpus
This is a large aligned parallel corpus containing 1 billion words in 22 official EU languages (231 language pair combinations). It contains EU legislation, declarations, resolutions, acts, international agreements and documents on contents, principles and political objectives of the EU Treaties.
Texts are in the XML format according to TEI P4 and UTF-8-encoded.
Two different alignments have been performed, one at the level of paragraph (with Vanilla) and the other at the level of sentence (with HunAlign). The corpus is also manually classified according to EUROVOC subject domains.
This combination of multilinguality and subject domain coding makes this corpus a valuable resource for NLP research. It can be used for cross-language research, as for classification algorithms and key-word assignment software development.
Identification
Period of coverage :
Version :
3.0
Version history :
Applications
application Area :
Research
Contents
Click on the arrow to display content.
written corpus
Number of languages
: Multilingual
Language(s) :
Czech ; Danish ; Dutch ; English ; Estonian ; German ; Greek ; Finnish ; French ; Hungarian ; Italian ; Latvian ; Lithuanian ; Maltese ; Polish ; Portuguese ; Romanian ; Slovak ; Slovene ; Spanish ; Swedish ; Bulgarian
Alignment :
Multilingual
Saturday 23 November, 2024
Joint Copyright © 2008
ELRA
&
ELDA
Universal Catalogue 1.0.4