Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 461 to 480 (of 730 products)

Result Pages: [<< Prev] ... 21 22 23 24 25 ... [Next >>]

ELRA-WC0162

NEGRA treebank of German

The NEGRA corpus version 2 contains 355,096 tokens (20,602 sentences) from a German newspaper (the Frankfurter Rundschau).
Language(s) : German (Germany)

Click here for
more information

ELRA-WC0163

Reuters Corpus

The basis of this corpus is the Reuters Financial News Service comprising 9,063 XML tagged texts, 3.63 million tokens, published during Jan-Dec 2002.
Language(s) : English

Click here for
more information

ELRA-WC0165

Stockholm-Ume°a Corpus (SUC)

The Stockholm-Umeå corpus (SUC) is a Swedish corpus of 1 million words, annotated with part-of-speech, inflectional form and lemma.
Language(s) : Swedish

Click here for
more information

ELRA-WC0166

The Spartacus Database

It consists of offline handwritten Spanish sentences from four different subtasks. A total of around 100,000 word instances out of a vocabulary of around 3,300 words occur in the collection.
Language(s) : Spanish

Click here for
more information

ELRA-WC0167

The Basque corpus

It consists of 2,706,809 words and includes all the articles published by Elhuyar Foundation in the zientzia.net site until 2003.
Language(s) : Basque

Click here for
more information

ELRA-WC0168

Arabic full-form lexicon

This corpus has been converted into a two-level Finite State Transducer (FST) for morphology analysis and generation.
Language(s) : Arabic

Click here for
more information

ELRA-WC0169

Hand-annotated part of BNC

It contains about 5 million words.
Language(s) : English

Click here for
more information

ELRA-WC0170

Bulgarian and Croatian Comparable Corpus

This is a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian (The Bulgarian Corpus and The Croatian National Corpus - HNK).
Language(s) : Bulgarian - Croatian

Click here for
more information

ELRA-WC0171

Geographical Gazetteer Lists

Gazetteer lists containing geographical references.
Language(s) : English - Chinese - Arabic - Hindi

Click here for
more information

ELRA-WC0172

Touring information leaflet corpus

The corpus is a set of 1,100 touring information leaflets, with about 333,000 words and a vocabulary size of 6,300 words.
Language(s) : French

Click here for
more information

ELRA-WC0173

German Multi-word Expression DB

This one billion word corpus is a database of multi-word expressions where each entry is associated with a sequence of POS-tag/token pair.
Language(s) : German

Click here for
more information

ELRA-WC0174

Dependency Annotated Japanese Corpus

This is a phrase-Based Dependency annotated corpus with about 38,000 sentences of Mainichi Newspaper articles in 1995.
Language(s) : Japanese

Click here for
more information

ELRA-WC0175

Chinese Corpus of People’s Daily Newspaper

This corpus consists of about 20k sentences, annotated with word segmentation, part-of-speech tags and three named-entity tags.
Language(s) : Chinese

Click here for
more information

ELRA-WC0176

Ungrammatical Sentence Corpus & Grammatical Sentence Corpus

This a parallel corpus with an ungrammatical English sentence corpus and its grammatically corrected counterpart. 20,000 words each from a variety of sources: newspapers, emails, academic papers, websites, etc.
Language(s) : English

Click here for
more information

ELRA-WC0177

The OPUS Corpus

This is a growing collection of translated documents collected from the internet (tens of million words, in 60 languages).
Language(s) : Danish - German - Greek - English - Spanish - Finnish - French - Italian - Dutch - Portuguese - Swedish - Czech - Estonian - Hungarian - Lithuanian - Latvian - Polish - Slovak - Maltese - Slovenian - Afrikaans - Arabic - Azerbaijani - Belarusian - Bulgarian - Breton - Catalan - Welsh - Esperanto - Basque - Hebrew - Croatian - Indonesian - Icelandic - Japanese - Korean - Kurdish - Maori - Macedonian - Occitan - Portuguese (Brazil) - English (United Kingdom) - Romanian - Russian - Tamil - Thai - Turkish - Venda - Vietnamese - Xhosa - Chinese - Chinese (Taiwan) - Zulu - Serbian - Ukrainian - Twi - Irish

Click here for
more information

ELRA-WC0178

SZAK Corpus

This is an English-Hungarian parallel corpus of technical texts containing 1.2 million words per language.
Language(s) : Hungarian - English

Click here for
more information

ELRA-WC0179

Computer-domain Corpus

This is an aligned computer-domain corpus containing 74K sentences in five languages.
Language(s) : Spanish - Japanese - French - German - English

Click here for
more information

ELRA-WC0180

Galician corpus

This corpus of contemporary written Galician is morphosyntactically tagged and contains syntactic and prosodic data : 400.000 words drawn from journalistic texts.
Language(s) : Galician

Click here for
more information

ELRA-WC0181

German Medical Corpus

This corpus of medical documents in German contain more than 1 million running word forms.
Language(s) : German

Click here for
more information

ELRA-WC0182

English-German Europarl corpus

This data contains some 20 million words in 63,973 aligned documents in each language.
Language(s) : English - German

Click here for
more information

Displaying 461 to 480 (of 730 products)

Result Pages: [<< Prev] ... 21 22 23 24 25 ... [Next >>]