Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 61 to 80 (of 730 products)

Result Pages: [<< Prev] 1 2 3 4 5 ... [Next >>]

ELRA-U-W 0052

Lexesp Corpus

Lexesp is a Spanish balanced corpus of 6,000,000 words wich was published in 2000. It represents various written categories: different literary genres, newspaper articles, scientific texts.
Language(s) : Spanish (Spain)

Click here for
more information

ELRA-U-W 0053

Pour la science SMS Corpus

This French corpus contains 30,000 SMS (Short Message Service) which have been collected in Belgium within the framework of the project 'Give your SMS to science' ('Faites don de vos SMS à la science').
Language(s) : French (Belgium)

Click here for
more information

ELRA-U-W 0054

Carmel Corpus

It is a multilingual aligned corpus of literary texts in four languages: English, French, Italian, Spanish. It contains 10,000,000 words from 36 classics of travel story from 19th to early 20th century.
Language(s) : French (France)English (United Kingdom) - French (France)Italian (Italy) - French (France)Spanish (Spain)

Click here for
more information

ELRA-U-W 0055

IJS-ELAN Slovene-English Parallel Corpus (IJS-ELAN)

This Slovene-English parallel corpus is composed of 15 texts and contains 500,000 words per language. It is tokenised, sentence segmented and aligned (encoding : XML (TEI/P4)).
Language(s) : Slovenian (Slovenia)English (United Kingdom)

Click here for
more information

ELRA-U-W 0056

Czech-English Parallel Corpus (CzEng)

This Czech-English parallel corpus contains approximately 90 million words per language. It was compiled between 2005 and 2009 with documents from various fields: European law, information technologies and fiction. In the last version (0.9) some texts from parallel web pages, electronically available books and subtitles have been added.
Language(s) : Czech (Czech Republic)English (United Kingdom)

Click here for
more information

ELRA-U-W 0057

The Croco Corpus (German-English Parallel Corpus)

This is a German-English parallel corpus of one million words.
Texts are comparable in terms of registers (8) ; both translation directions are represented for each register.
Language(s) : German (Germany)English (United Kingdom)

Click here for
more information

ELRA-U-W 0058

The IPI-PAN Corpus

The IPI-PAN corpus is a Polish written corpus of more than 250 million segments. Various genres are represented (in unbalanced proportions): contemporary prose, older prose, science, newspapers, parliamentary proceedings, law. The corpus is morphosyntactically annotated.
Language(s) : Polish (Poland)

Click here for
more information

ELRA-U-W 0059

Enron Email Corpus

This American English database contains 500,000 emails presented in folders, from 158 users in charge of senior management at Enron.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0060

Corpus of Spoken Professional American English (CSPA)

This American English corpus contains transcripts of professional conversations which were recorded between 1994 and 1998. It gathers 2 million words from 400 speakers.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0061

AnCora-DEP-CAT Catalan Treebank

AnCora-DEP-CAT is a Catalan corpus of 478,876 words (still under development). The 16,633 sentences of the corpus have been annotated with dependencies.
Language(s) : Catalan (Spain)

Click here for
more information

ELRA-U-W 0062

AnCora-DEP-ESP Spanish Treebank

AnCora-DEP-ESP is a Spanish corpus of 95,028 words (still under development). The 3,512 sentences of the corpus have been annotated with dependencies.
Language(s) : Spanish (Spain)

Click here for
more information

ELRA-U-W 0063

The Triptic Corpus (English, French and Dutch)

The Triptic corpus is a trilingual parallel corpus for English, French and Dutch. It contains 2,000,000 words and is aligned at paragraph level. It can be divided in two parts: fiction and non fiction.
Language(s) : English (United Kingdom)Dutch - DutchFrench - English (United Kingdom)French

Click here for
more information

ELRA-U-W 0064

Chinese-English Parallel Corpus

This corpus, which is still under construction, is a Chinese-English parallel corpus that will amount to 17 million words in each language when completed. It gathers texts from various genres: newspapers, technical articles, literature, movie transcription etc.
Language(s) : Chinese (China)English (United Kingdom)

Click here for
more information

ELRA-U-W 0065

The UCLA Chinese corpus

It is a modern Chinese written corpus of one million tokens from texts collected between 2000 and 2005. It is segmented and POS tagged.

It can be considered as a recent update of the Lancaster Corpus of Mandarin Chinese (LCMC), available from the ELRA catalogue under the reference ELRA-W0039.
Language(s) : Chinese (China)

Click here for
more information

ELRA-U-W 0066

Tagged Corpus of Spoken Professional American English

This is the tagged version of the Corpus of Spoken Professional American English, which contains transcripts of professional conversations which were recorded between 1994 and 1998 (2 million words from 400 speakers).
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0067

PWN Corpus of the Polish Language

This Polish corpus consists of texts from books, periodicals, web sites, ephemera and transcripts of spoken texts. It is a balanced corpus of 70 million words. Unfortunately, for copyright reasons, the only data available is a sampler of 7,5 million words (the demonstration version of the online corpus).
Language(s) : Polish (Poland)

Click here for
more information

ELRA-U-W 0068

Christine Corpus

It is a treebank of 100,000 words that covers mostly spontaneous, informal spoken English of the 90's. It offers structural analyses of a cross-section of speech from all British regions, social classes, etc.
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W 0069

Lucy Corpus

It represents written English in modern Britain (from published prose to the less-skilled writing of young adults and nine-to-twelve-year-old children). It is a 165,000 'word' treebank (compound words do not count as a single word) that was compiled between 2000 and 2003.
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W 0070

The International Corpus of Learner English (ICLE)

This corpus of argumentative essay writing contains over 3.7 million words written by advanced learners of English from 19 different mother tongue backgrounds. It is divided according to the mother tongue (the target size for each sub-corpus is 200,000 words).
Language(s) : English

Click here for
more information

ELRA-U-W 0071

Louvain Corpus of Native English Essays (LOCNESS)

This corpus contains 324,304 words from native English essays. It is divided in three parts: British pupils' A level essays, British university students essays, American university students' essays.
Language(s) : English (United Kingdom) - English (USA)

Click here for
more information

Displaying 61 to 80 (of 730 products)

Result Pages: [<< Prev] 1 2 3 4 5 ... [Next >>]