Universal Catalogue  
  You are here » Universal Catalogue » Written Resources » Written Corpora
Language Resources
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Anglais
Written Corpora
Displaying 121 to 140 (of 730 products) Result Pages: [<< Prev]  ... 6  7  8  9  10 ...  [Next >>] 

ELRA-U-W 0112
The Helsinki Corpus of Swahili (HCS)


The Helsinki Corpus of Swahili comprises 12.5 million words from news texts. It has been annotated with SALAMA (Swahili Language Manager) for lemma, part-of-speech and morphology.
Language(s) : Swahili

Click here for
more information


ELRA-U-W 0113
Finnish Language Text Collection 


The Finnish Language Text Collection (Suomen kielen tekstikokoelma) contains 180 million running tokens of written Finnish from the 1990's. A part of the corpus (about a half) is annotated with morpho-syntactic information.
Language(s) : Finnish (Finland)

Click here for
more information


ELRA-U-W 0114
Finnish Swedish Text Collection 


The Finnish Swedish Text Collection contains 34,412,586 words of written Finnish Swedes from the 1990's. A part of the corpus is annotated with morpho-syntactic information.
Language(s) : Swedish (Finland)

Click here for
more information


ELRA-U-W 0115
Digital Morphology Archives for Finnish Dialects (DMA)


The morphology archive for Finnish dialects has existed since 1967; it includes 2 million phrasal samples illustrating the morphological system of Finnish dialects. It was recently converted to electronic form in SGML.
Language(s) : Finnish (Finland)

Click here for
more information


ELRA-U-W 0116
The Helsinki Corpus of British English Dialects (HD)


It contains transcripts of continuous spontaneous speech collected in Britain during the 1970s and the late 1980s. Recordings have been transcribed orthographically. Approximately 860,000 words have so far been edited and the target size is one million.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0117
The Helsinki Annotated Corpus of Russian (HANCO)


The corpus contains approximately 100,000 running words, representing modern Russian. It is composed of extracts from a modern Russian magazine.
The HANCO project, which has been running since 2001, aims at the construction of a corpus annotated with morphological, syntactic and functional information.
Language(s) : Russian (Russia)

Click here for
more information


ELRA-U-W 0118
SIGANN Shared Open Corpus 


The SIGANN shared open corpus consists of texts drawn from the 2nd release of the American National Corpus. The aim of the project is to gather as many annotations of the corpus as possible.
Language(s) : English (USA)

Click here for
more information


ELRA-U-W 0120
The MultiSemCor Corpus 


The MultiSemCor is composed of 116 English texts with their corresponding 116 Italian translations, for a total of about 500,000 tokens. Texts are aligned at word level and contain semantic annotations.
Language(s) : EnglishItalian

Click here for
more information


ELRA-U-W 0121
The Italian SI-TAL Treebank 


This treebank is composed of a 220,000 word balanced corpus and a 90,000 word specialized corpus for the financial domain. Most of the data is annotated with morphosyntactic, syntactic and lexico-semantic informations.
Language(s) : Italian (Italy)

Click here for
more information


ELRA-U-W 0122
BOnonia Legal Corpus (BOLC)


The BOLC is a collection of comparable legal texts in English and in Italian.
Number of words in Italian: 33.5 million
Number of words in English: 21 million
Language(s) : English (United Kingdom) - Italian (Italy)

Click here for
more information


ELRA-U-W 0123
TILT Corpus 


The TILT corpus is an XML French-English bilingual collection of standards provided by the AFNOR (French Standards Organisation). The number of standards is 1,000, representing approximately 35,000 pages.
It has been annotated at three levels: structural, morpho-syntactic and semantic.
Language(s) : French (France)English (United Kingdom)

Click here for
more information


ELRA-U-W 0124
Real Estate French Corpus 


This French corpus contains transcriptions of 14 interviews of 30 to 60 minutes each, for a total of approximately 90,000 words. Conversations between tenants, lessors, staff from estate agencies were recorded in 2004 and then orthographically transcribed.
Language(s) : French (France)

Click here for
more information


ELRA-U-W 0125
OuRAL Written Corpus 


This French corpus contains 10,000 words. It is lemmatised, POS tagged and morphologically annotated in accordance with the TEI P4 standard.
Language(s) : French (France)

Click here for
more information


ELRA-U-W 0126
Bulgarian Written Corpus 


This is a Bulgarian corpus containing more than 35 million words. It is composed of original and translated texts from various genres.
Language(s) : Bulgarian (Bulgaria)

Click here for
more information


ELRA-U-W 0127
Structured Corpus of Bulgarian printed editions 


The Structured Corpus of Bulgarian printed editions contains electronic versions of printed documents covering the period of 1945 to 2010.
Language(s) : Bulgarian (Bulgaria)

Click here for
more information


ELRA-U-W 0128
Tagged Corpus of Bulgarian (BulPosCor)


The Tagged Corpus of Bulgarian is composed of 300 word extracts from the Brown Corpus of Bulgarian, for a total of 200,000 words. Those extracts have been POS annotated by language experts.
Language(s) : Bulgarian

Click here for
more information


ELRA-U-W 0129
Semantic Corpus of Bulgarian (BulSemCor)


The Semantic Corpus of Bulgarian consists of excerpts from the Brown Corpus of Bulgarian for a total of 75,000 sense-annotated words.
Language(s) : Bulgarian (Bulgaria)

Click here for
more information


ELRA-U-W 0130
Domain-specific Corpora of Bulgarian 


These corpora are representative of various specific domains:

- Parliament Proceedings: 1,735,718 words
- Law and Court: 848,986 words
- Politics: 489,071 words
- Economics: 753,734 words
- Medicine: 728,482 words
- Sports: 441,017 words
- Army: 202,143 words
- Lifestyle: 89,891
- Law and Economics: 369,699 words
- Law and Medicine: 219,016 words
- Law and Sports: 46,693 words
- Bulgarian Laws: 1,147,133 words
- “24 chasa” newspaper: 7,368,711 words
- Fiction: 20,000,000 words
Language(s) : Bulgarian (Bulgaria)

Click here for
more information


ELRA-U-W 0131
1984 Greek-English Corpus 


This Greek-English parallel corpus is composed of the English novel 1984 (G. Orwell) aligned at sentence level with its translation in Greek. Lexical units have also been annotated with lemma and morpho-lexical information.
Language(s) : Greek (Greece)English (United Kingdom)

Click here for
more information


ELRA-U-W 0132
Latvian Corpus of Written Texts 


This corpus of written Latvian contains approximately 20 million running words of various genres.
Language(s) : Latvian

Click here for
more information


Displaying 121 to 140 (of 730 products) Result Pages: [<< Prev]  ... 6  7  8  9  10 ...  [Next >>] 

Joint Copyright © 2008 ELRA & ELDA
Universal Catalogue 1.0.4