|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 121 to 140 (of 730 products) |
Result Pages: 7 |
The Helsinki Corpus of Swahili comprises 12.5 million words from news texts. It has been annotated with SALAMA (Swahili Language Manager) for lemma, part-of-speech and morphology.
Language(s) : Swahili
|
|
|
|
The Finnish Language Text Collection (Suomen kielen tekstikokoelma) contains 180 million running tokens of written Finnish from the 1990's. A part of the corpus (about a half) is annotated with morpho-syntactic information.
Language(s) : Finnish (Finland)
|
|
|
|
The Finnish Swedish Text Collection contains 34,412,586 words of written Finnish Swedes from the 1990's. A part of the corpus is annotated with morpho-syntactic information.
Language(s) : Swedish (Finland)
|
|
|
|
The morphology archive for Finnish dialects has existed since 1967; it includes 2 million phrasal samples illustrating the morphological system of Finnish dialects. It was recently converted to electronic form in SGML.
Language(s) : Finnish (Finland)
|
|
|
|
It contains transcripts of continuous spontaneous speech collected in Britain during the 1970s and the late 1980s. Recordings have been transcribed orthographically. Approximately 860,000 words have so far been edited and the target size is one million.
Language(s) : English (United Kingdom)
|
|
|
|
The corpus contains approximately 100,000 running words, representing modern Russian. It is composed of extracts from a modern Russian magazine.
The HANCO project, which has been running since 2001, aims at the construction of a corpus annotated with morphological, syntactic and functional information.
Language(s) : Russian (Russia)
|
|
|
|
The SIGANN shared open corpus consists of texts drawn from the 2nd release of the American National Corpus. The aim of the project is to gather as many annotations of the corpus as possible.
Language(s) : English (USA)
|
|
|
|
The MultiSemCor is composed of 116 English texts with their corresponding 116 Italian translations, for a total of about 500,000 tokens. Texts are aligned at word level and contain semantic annotations.
Language(s) : EnglishItalian
|
|
|
|
This treebank is composed of a 220,000 word balanced corpus and a 90,000 word specialized corpus for the financial domain. Most of the data is annotated with morphosyntactic, syntactic and lexico-semantic informations.
Language(s) : Italian (Italy)
|
|
|
|
The BOLC is a collection of comparable legal texts in English and in Italian.
Number of words in Italian: 33.5 million
Number of words in English: 21 million
Language(s) : English (United Kingdom) - Italian (Italy)
|
|
|
|
The TILT corpus is an XML French-English bilingual collection of standards provided by the AFNOR (French Standards Organisation). The number of standards is 1,000, representing approximately 35,000 pages.
It has been annotated at three levels: structural, morpho-syntactic and semantic.
Language(s) : French (France)English (United Kingdom)
|
|
|
|
This French corpus contains transcriptions of 14 interviews of 30 to 60 minutes each, for a total of approximately 90,000 words. Conversations between tenants, lessors, staff from estate agencies were recorded in 2004 and then orthographically transcribed.
Language(s) : French (France)
|
|
|
|
This French corpus contains 10,000 words. It is lemmatised, POS tagged and morphologically annotated in accordance with the TEI P4 standard.
Language(s) : French (France)
|
|
|
|
This is a Bulgarian corpus containing more than 35 million words. It is composed of original and translated texts from various genres.
Language(s) : Bulgarian (Bulgaria)
|
|
|
|
The Structured Corpus of Bulgarian printed editions contains electronic versions of printed documents covering the period of 1945 to 2010.
Language(s) : Bulgarian (Bulgaria)
|
|
|
|
The Tagged Corpus of Bulgarian is composed of 300 word extracts from the Brown Corpus of Bulgarian, for a total of 200,000 words. Those extracts have been POS annotated by language experts.
Language(s) : Bulgarian
|
|
|
|
The Semantic Corpus of Bulgarian consists of excerpts from the Brown Corpus of Bulgarian for a total of 75,000 sense-annotated words.
Language(s) : Bulgarian (Bulgaria)
|
|
|
|
These corpora are representative of various specific domains:
- Parliament Proceedings: 1,735,718 words
- Law and Court: 848,986 words
- Politics: 489,071 words
- Economics: 753,734 words
- Medicine: 728,482 words
- Sports: 441,017 words
- Army: 202,143 words
- Lifestyle: 89,891
- Law and Economics: 369,699 words
- Law and Medicine: 219,016 words
- Law and Sports: 46,693 words
- Bulgarian Laws: 1,147,133 words
- “24 chasa” newspaper: 7,368,711 words
- Fiction: 20,000,000 words
Language(s) : Bulgarian (Bulgaria)
|
|
|
|
This Greek-English parallel corpus is composed of the English novel 1984 (G. Orwell) aligned at sentence level with its translation in Greek. Lexical units have also been annotated with lemma and morpho-lexical information.
Language(s) : Greek (Greece)English (United Kingdom)
|
|
|
|
This corpus of written Latvian contains approximately 20 million running words of various genres.
Language(s) : Latvian
|
|
|
|
Displaying 121 to 140 (of 730 products) |
Result Pages: 7 |
|
|