Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 301 to 320 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]

ELRA-U-W0293

SYN2000 corpus

The SYN2000 corpus is a synchronic representative corpus of contemporary written Czech (until 2000). It contains 100 million words (tokens), lemmatised and Part-Of-Speech tagged.
Language(s) : Czech

Click here for
more information

ELRA-U-W0294

SYN2006PUB

This is a synchronic written corpus of 300 million of words. It contains exclusively journalist texts in Czech from 1989 to 2004.
Language(s) : Czech

Click here for
more information

ELRA-U-W0295

Nepali Written Corpus

The Nepali Written Corpus is a part of the Nepali National Corpus (NNC).

This is a monolingual corpus of 15 million words containing texts from various books, magazines, newspapers and from Internet websites. It is segmented and POS tagged.

It is available in the ELRA catalogue http://catalog.elra.info under the reference ELRA-W0076.
Language(s) : Nepali

Click here for
more information

ELRA-U-W0296

Nepali-English Parallel Corpus

The Nepali-English Parallel Corpus is a part of the Nepali National Corpus (NNC).

This is a parallel corpus of about 4 million words from two different genres : computing and national development.

A part of it available in the ELRA catalogue http://catalog.elra.info under the reference ELRA-W0077.
Language(s) : Nepali <<< >>> English

Click here for
more information

ELRA-U-W0297

PukWaC English Web Corpus

The PukWaC is the same as the ukWaC (an English 2 billion word corpus constructed from the Web), but annotation includes a full dependency parsing.
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W0298

WaCkypedia English corpus

This is a copy of the English Wikipedia's full content at the date of 2009. It represents about 800 million tokens and is POS-tagged, lemmatized, and fully parsed with a dependency parser.
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W0299

CORpus of tagged Political Speeches (CORPS)

This corpus contains more than 3600 written speeches from native English speakers. It represents about 7.9 millions words, and more than 67 thousand tags about audience reactions.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W0300

Personae Corpus

This corpus in Dutch language contains about 200,000 words from essays written by 145 students (from Belgium). It is syntactically annotated and provides metadata about students' personality.
Language(s) : Dutch, Flemish (Belgium)

Click here for
more information

ELRA-U-W0301

Urdu-Nepali-English Parallel Corpus

This is a parallel corpus containing 100,720 words (4325 sentences) in common English, translated into Urdu and Nepali.
Language(s) : English >>>> Urdu - English >>>> Nepali

Click here for
more information

ELRA-U-W0302

Real-word Error Corpus

This is a corpus of 675 sentences (11839 words) containing 833 dyslexic real-word errors tagged.
Language(s) : English

Click here for
more information

ELRA-U-W0303

Birkbeck spelling error corpus

This is corpus of mispellings collected from native and non-native English speakers.
Language(s) : English

Click here for
more information

ELRA-U-W0304

Diachronic Corpus of Present-Day Spoken English (DCPSE)

This is a corpus of spoken British English covering the period between 1960 and 2000. It contains 885,436 words, fully-parsed and annotated (87,000 trees).
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W0305

English - Luganda Parallel Corpus

This is a small word-aligned parallel corpus of Luganda and English. Luganda is a major language of Uganda and is spoken by 6 million people as a first language.
Language(s) : English <<< >>> other

Click here for
more information

ELRA-U-W0306

SAWA Corpus

This is a parallel corpus of English and Swahili which contains about a million words for each language.
Language(s) : English <<< >>> Swahili

Click here for
more information

ELRA-U-W0307

British Columbia Conversation Corpus (BC3)

The British Columbia Conversation Corpus (BC3) contains 40 email threads (3222 sentences) annotated with linguistic features for email summarization.
Language(s) : English

Click here for
more information

ELRA-U-W0308

Mannheim German Reference Corpus (DeReKo)

The Mannheim German Reference Corpus is a collection of German corpora covering the period of 1956 to 2001. It contains more than 3.9 billion tokens and is Part-Of-Speech tagged.
Language(s) : German

Click here for
more information

ELRA-U-W0309

Quranic Arabic Corpus

The Quranic Arabic Corpus is a version of the Quran annotated for part-of-speech and associated with a syntactic treebank.
Language(s) : Arabic

Click here for
more information

ELRA-U-W0310

Turku Dependency Treebank (TDT)

The Turku Dependency Treebank (TDT) is a dependency-annotated treebank of Finnish. It contains articles of less than 75 sentences from the Finnish Wikipedia.
Language(s) : Finnish

Click here for
more information

ELRA-U-W0311

Russian National Corpus (RNC)

This is a collection of written, spoken and multimodal corpora, which represents about 300 million tokens.
Language(s) : Russian - Russian >>>> English - Russian >>>> German -

Click here for
more information

ELRA-U-W0312

Corpus of Greek Texts (CGT)

The Corpus of Greek Texts (CGT) includes spoken and written texts produced between 1990 and 2010. It contains 30 million words.
Language(s) : Greek

Click here for
more information

Displaying 301 to 320 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]