|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 61 to 80 (of 730 products) |
Result Pages: 4 |
Lexesp is a Spanish balanced corpus of 6,000,000 words wich was published in 2000. It represents various written categories: different literary genres, newspaper articles, scientific texts.
Language(s) : Spanish (Spain)
|
|
|
|
This French corpus contains 30,000 SMS (Short Message Service) which have been collected in Belgium within the framework of the project 'Give your SMS to science' ('Faites don de vos SMS à la science').
Language(s) : French (Belgium)
|
|
|
|
It is a multilingual aligned corpus of literary texts in four languages: English, French, Italian, Spanish. It contains 10,000,000 words from 36 classics of travel story from 19th to early 20th century.
Language(s) : French (France)English (United Kingdom) - French (France)Italian (Italy) - French (France)Spanish (Spain)
|
|
|
|
This Slovene-English parallel corpus is composed of 15 texts and contains 500,000 words per language. It is tokenised, sentence segmented and aligned (encoding : XML (TEI/P4)).
Language(s) : Slovenian (Slovenia)English (United Kingdom)
|
|
|
|
This Czech-English parallel corpus contains approximately 90 million words per language. It was compiled between 2005 and 2009 with documents from various fields: European law, information technologies and fiction. In the last version (0.9) some texts from parallel web pages, electronically available books and subtitles have been added.
Language(s) : Czech (Czech Republic)English (United Kingdom)
|
|
|
|
This is a German-English parallel corpus of one million words.
Texts are comparable in terms of registers (8) ; both translation directions are represented for each register.
Language(s) : German (Germany)English (United Kingdom)
|
|
|
|
The IPI-PAN corpus is a Polish written corpus of more than 250 million segments. Various genres are represented (in unbalanced proportions): contemporary prose, older prose, science, newspapers, parliamentary proceedings, law. The corpus is morphosyntactically annotated.
Language(s) : Polish (Poland)
|
|
|
|
This American English database contains 500,000 emails presented in folders, from 158 users in charge of senior management at Enron.
Language(s) : English (USA)
|
|
|
|
This American English corpus contains transcripts of professional conversations which were recorded between 1994 and 1998. It gathers 2 million words from 400 speakers.
Language(s) : English (USA)
|
|
|
|
AnCora-DEP-CAT is a Catalan corpus of 478,876 words (still under development). The 16,633 sentences of the corpus have been annotated with dependencies.
Language(s) : Catalan (Spain)
|
|
|
|
AnCora-DEP-ESP is a Spanish corpus of 95,028 words (still under development). The 3,512 sentences of the corpus have been annotated with dependencies.
Language(s) : Spanish (Spain)
|
|
|
|
The Triptic corpus is a trilingual parallel corpus for English, French and Dutch. It contains 2,000,000 words and is aligned at paragraph level. It can be divided in two parts: fiction and non fiction.
Language(s) : English (United Kingdom)Dutch - DutchFrench - English (United Kingdom)French
|
|
|
|
This corpus, which is still under construction, is a Chinese-English parallel corpus that will amount to 17 million words in each language when completed. It gathers texts from various genres: newspapers, technical articles, literature, movie transcription etc.
Language(s) : Chinese (China)English (United Kingdom)
|
|
|
|
It is a modern Chinese written corpus of one million tokens from texts collected between 2000 and 2005. It is segmented and POS tagged.
It can be considered as a recent update of the Lancaster Corpus of Mandarin Chinese (LCMC), available from the ELRA catalogue under the reference ELRA-W0039.
Language(s) : Chinese (China)
|
|
|
|
This is the tagged version of the Corpus of Spoken Professional American English, which contains transcripts of professional conversations which were recorded between 1994 and 1998 (2 million words from 400 speakers).
Language(s) : English (USA)
|
|
|
|
This Polish corpus consists of texts from books, periodicals, web sites, ephemera and transcripts of spoken texts. It is a balanced corpus of 70 million words. Unfortunately, for copyright reasons, the only data available is a sampler of 7,5 million words (the demonstration version of the online corpus).
Language(s) : Polish (Poland)
|
|
|
|
It is a treebank of 100,000 words that covers mostly spontaneous, informal spoken English of the 90's. It offers structural analyses of a cross-section of speech from all British regions, social classes, etc.
Language(s) : English (United Kingdom)
|
|
|
|
It represents written English in modern Britain (from published prose to the less-skilled writing of young adults and nine-to-twelve-year-old children). It is a 165,000 'word' treebank (compound words do not count as a single word) that was compiled between 2000 and 2003.
Language(s) : English (United Kingdom)
|
|
|
|
This corpus of argumentative essay writing contains over 3.7 million words written by advanced learners of English from 19 different mother tongue backgrounds. It is divided according to the mother tongue (the target size for each sub-corpus is 200,000 words).
Language(s) : English
|
|
|
|
This corpus contains 324,304 words from native English essays. It is divided in three parts: British pupils' A level essays, British university students essays, American university students' essays.
Language(s) : English (United Kingdom) - English (USA)
|
|
|
|
Displaying 61 to 80 (of 730 products) |
Result Pages: 4 |
|
|