Universal Catalogue  
  You are here » Universal Catalogue » Written Resources » Written Corpora
Language Resources
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Anglais
Catalog Reference : ELRA-U-W 0030
The Leipzig Corpora Collection
The Leipzig Corpora Collection contains monolingual corpora for 15 languages. They are comparable in genre (newspapers, web documents). The data come as plain text or as MySql database tables (a Corpus Browser is also available). This collection of resources can be useful for comparative language studies, and also for developing and testing NLP applications.

Languages : Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Sorbian, Swedish, Turkish. Other languages are in preparation.

The number of words varies from one corpus to the other (from 1 million to 30 millions).

Note that many documents are taken from the web (Wikipedia for exemple).
Applications
application Area : Research
Technical Informations
Fileformat : Plain text
Contents Click on the arrow to display content.
 written corpus 
 

Joint Copyright © 2008 ELRA & ELDA
Universal Catalogue 1.0.4