Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 341 to 360 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]

ELRA-U-W0333

PW CALO Corpus

This is a corpus of 222 email messages, generated during a four-day exercise.
Language(s) : English

Click here for
more information

ELRA-U-W0334

The W3C Corpus

The W3C corpus contains data collected from a crawl of the World Wide Web Consortium’s sites (w3c.org). This includes mailing lists, public webpages (html), and some text derived from other types of files (pdf, ...)

W3C data has been annotated for QA (question/answering) topic relevance for use in TREC Enterprise 2005 and 2006.
Language(s) : English

Click here for
more information

ELRA-U-W0335

The CSIRO Corpus

This corpus contains 370,715 documents collected from a crawl of the Australian CSIRO organization's websites (*.csiro.au).

The CSIRO Corpus has been annotated for QA (question/answering) topic relevance for use in TREC Enterprise track 2007.
Language(s) : English (Australia)

Click here for
more information

ELRA-U-W0336

Corpus of Contemporary Sinhala

This is a corpus of 10,000,000 words, which presents the modern usage of Sinhala (or Sinhalese), a language spoken in Sri Lanka.
Language(s) : Sinhalese

Click here for
more information

ELRA-U-W0337

Galician Technical Corpus (CTG)

This is a monolingual corpus of contemporary specialized Galician. It contains about 12 million words.
Language(s) : Galician

Click here for
more information

ELRA-U-W0338

Essex Arabic Summaries Corpus (EASC)

This is an Arabic corpus which contains 153 Arabic articles and 765 human-generated extractive summaries of these articles.
Language(s) : Arabic

Click here for
more information

ELRA-U-W0339

Arabic Propbank (APB)

The Arabic Propbank contains 560 predicates annotated with their relevant arguments in running texts. It is based on 200,000 words from the Arabic Treebank (version 2).
Language(s) : Arabic

Click here for
more information

ELRA-U-W0340

The Corpus of Academic Lithuanian (CorALit)

This is a specialised synchronic corpus of about 9 million words, including academic texts published between 1999 and 2009 in various areas.
Language(s) : Lithuanian

Click here for
more information

ELRA-U-W0341

Sejong Korean-Japanese Bilingual Corpus (SKJBC)

This corpus consists of 50 documents in Korean (4,030 sentences) with its translation into Japanese (4,080 sentences). It is aligned at sentence and paragraph levels and is annotated in the XML format.
Language(s) : Korean <<< >>> Japanese

Click here for
more information

ELRA-U-W0342

The comparable corpus of English and Russian news texts

This is a comparable corpus of English and Russian news texts. The English part contains newswires texts from 1996 to 1997 (83,491,119 words) and the Russian part contains articles from 2000 to 2001 (14,564,884 words) and others texts from various genres (50,512,584 words) .
Language(s) : English - Russian

Click here for
more information

ELRA-U-W0344

GIVE-2 corpus

This is a corpus of written human instructions collected within a virtual game upon the GIVE-2 software infrastructure. It consists of 45 German and 63 American English written discourses in which one subject guided another one in a treasure-hunt style task in virtual worlds.
Language(s) : English (USA) - German

Click here for
more information

ELRA-U-W0345

Prague Dependency Treebank (PDT)

The Prague Dependency Treebank is a multi-level corpus of Czech in the form of dependency analytical trees. It consists of 7,110 annotated articles from newspapers and journals, containing 115,844 sentences with 1,957,247 tokens.
Language(s) : Czech

Click here for
more information

ELRA-U-W0346

New Testament corpus

This is a morphologically tagged and syntactically parsed corpus of the Ancient Greek text of the Gospels.
Language(s) : Greek

Click here for
more information

ELRA-U-W0347

The Michigan Corpus of Upper-level Student Papers (MICUSP)

This is a corpus of student academic writing samples. It represents a collection of around 830 A grade papers (2.6 million words), covering various disciplines.
Language(s) : English

Click here for
more information

ELRA-U-W0348

MULINCO corpus

This is a multilingual corpus which contains both parallel and comparable texts, fully annotated.
Language(s) : Danish - English - French - German - Italian - Spanish

Click here for
more information

ELRA-U-W0349

The Helsinki Corpus of Somali

The Helsinki Corpus of Somali comprises 6,430 words with tags from running text in the SGML-format.
Language(s) : Somali

Click here for
more information

ELRA-U-W0350

OKMA Uspanteko corpus

This is a corpus in the Mayan language Uspanteko. It contains 284,000 words of transcribed text, from which 74,000 words are glossed. It also includes translations into Spanish and English.
Language(s) : other - Spanish - English

Click here for
more information

ELRA-U-W0351

Russian-Finnish parallel corpus of literary texts (ParRus)

This is a corpus of Russian classical and 20th century literature with translation into Finnish.
Language(s) : Russian <<< >>> Finnish

Click here for
more information

ELRA-U-W0352

Comparable Russian-Finnish corpus of juridical texts (FinRusLex)

This is a comparable corpus of juridical texts in Russian and Finnish.
Language(s) : Russian <<< >>> Finnish

Click here for
more information

ELRA-U-W0353

Multilingual corpus of juridical texts (MulJur)

This is a multilingual corpus of juridical texts in English, German, Russian and Swedish.
Language(s) : English - German - Russian - Swedish

Click here for
more information

Displaying 341 to 360 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]