|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 361 to 380 (of 730 products) |
Result Pages: 19 |
The NoWaC is a Norwegian 700 million word corpus constructed from the Web (.no domain).
Language(s) : Norwegian
|
|
|
|
This is an extensive collection of texts published between 1990 and 2006, which represents a balanced sample of texts in Slovenian. The FidaPLUS corpus extends the FIDA corpus to 600 million words.
Language(s) : Slovenian (Slovenia)
|
|
|
|
This is a parallel corpus of 3,110 English sentences from the Penn Treebank Corpus manually translated into Lao.
Language(s) : English >>>> Lao
|
|
|
|
This is a parallel corpus of 1 million words in English and Bahasa Indonesian.
Language(s) : English <<< >>> Indonesian
|
|
|
|
This corpus contains 250,000 sentences aligned in English and Indonesian (about 2.5 million words) from articles published between 2000 and 2007 through the ANTARA News Agency.
Language(s) : English <<< >>> Indonesian
|
|
|
|
This is a corpus of news in Bangla (or Bengali). It is also called the Prothom-Alo corpus
Language(s) : Bengali
|
|
|
|
This is a corpus of 100,000 words in Sinhala which have been translated into English. Annotation includes POS tags and it is aligned at the sentence level.
Language(s) : Sinhalese >>>> English
|
|
|
|
This is a written corpus which contains both official and daily speaking language.
Language(s) : Khmer
|
|
|
|
It consists of sentences translated into Indonesian from the English part of the BTEC Corpus. Annotation includes POS tagging, syllabification and word-stress tagging in the XML-format.
Language(s) : English >>>> Indonesian
|
|
|
|
It contains sampled paragraphs of the Slovene reference corpus, the FidaPLUS (see U-W0355), which is morphosyntactically tagged with context disambiguated MSDs and lemmas. This selection have been converted from SGML to XML following the TEI P4 guidelines and the former annotation tagset have been updated.
It consists of two corpora: the jos100k corpus and the jos1M.
Language(s) : Slovenian
|
|
|
|
This is a wide collection of 4,158 Slovenian texts from various categories: newspapers, magazines, formal speech, fiction, non-fiction, scientific and technical texts. It contains about 162 million words, marked at the sentence level.
Language(s) : Slovenian
|
|
|
|
This is a corpus of e-mail messages about business or private. It was collected from 10-39 years old men and women with specified mobile phones or PCs through simulation.
Language(s) : Japanese
|
|
|
|
This corpus contains 40,000 sentences. It consists of triplets of the English original, draft Japanese translation and final Japanese translation of books and online articles.
Language(s) : English - Japanese
|
|
|
|
This is a parallel corpus of more than 3 million words containing Dutch texts and their French translation aligned at sentence level.
It is under construction.
Language(s) : Dutch >>>> French
|
|
|
|
This is a corpus of 1 million word containing the complete work of the Spanish poet Federico García Lorca. The corpus is tokenized, pos-tagged and lemmatized.
Language(s) : Spanish
|
|
|
|
This is a subtitle corpus for Dutch. It contains 44 million words from 8,443 films and television series.
Language(s) : Dutch
|
|
|
|
This is a subtitle corpus for Chinese. It contains 33.5 million words (46.8 million characters) from 6,243 different contexts (7,148 files).
Language(s) : Chinese
|
|
|
|
This is a large corpus of discharged letters collected from a medical system used in a hospital in Sweden.
Language(s) : Swedish
|
|
|
|
The FDV-IJS is a Slovene monolingual corpus which contains over 5.5 million words.
Language(s) : Slovenian (Slovenia)
|
|
|
|
This is a restricted-domain corpus of Slovene-English parallel texts from the Slovenian Ministry of Defense.
Language(s) : Slovenian <<< >>> English
|
|
|
|
Displaying 361 to 380 (of 730 products) |
Result Pages: 19 |
|
|