Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 441 to 460 (of 730 products)

Result Pages: [<< Prev] ... 21 22 23 24 25 ... [Next >>]

ELRA-WC0140

CROSSMARC Corpus

The purpose of the project was the development of a corpus, cross-lingual name matching, multi-lingual named entity recognition.
Language(s) : English - Greek - French - Italian

Click here for
more information

ELRA-WC0141

PDF Corpus

It consists of 101 PDF documents (181,748 words) with a great variety in their content, appearance, style, and structure.
Language(s) : English

Click here for
more information

ELRA-WC0142

WCL Generic Corpus

It consists of 5.500 words of newspaper articles, paragraphs of literature and sentences constructed and annotated by a professional linguist.
Language(s) : Greek

Click here for
more information

ELRA-WC0143

Swedish-English Corpus

The parallel corpus, of approximately 50.000 tokens, comprises agricultural reports, specifications and circulars produced within the European Union.
Language(s) : Swedish - English

Click here for
more information

ELRA-WC0144

La Repubblica corpus

It contains newspaper texts, amounting to 175 million words.
Language(s) : Italian

Click here for
more information

ELRA-WC0145

Diachronic Italian corpus of nuclear physics texts

This is a diachronic Italian corpus, divided into three subcorpora, of nuclear physics texts belonging to different genres.
Language(s) : Italian

Click here for
more information

ELRA-WC0146

Diachronic English corpus of nuclear physics articles

This corpus contains the abstracts of newspapers and is structured
into diachronic stages of around ten years.
Language(s) : English

Click here for
more information

ELRA-WC0147

Prague Czech-English Dependency Treebank (PCEDT)

This is a parallel treebank and a Czech-English syntactically annotated resource.
Language(s) : Czech - English

Click here for
more information

ELRA-WC0148

Czech National Corpus (CNC)

This corpus is composed of computer-based texts of over 400 million words.
Language(s) : Czech

Click here for
more information

ELRA-WC0149

English corpus of biotechnology business information

It contains 840 documents of 452,000 words in total from newspapers related to biotechnology business information.
Language(s) : English

Click here for
more information

ELRA-WC0150

The DELOS Corpus

The Delos corpus is a collection of economic domain texts of approximately five million words and of varying genre (press reportage, news, articles, interviews and scientific studies). It has been automatically annotated.
Language(s) : Greek

Click here for
more information

ELRA-WC0151

AAC - Austrian Academy Corpus

The AAC is a very large and complex electronic text collection.
Language(s) : German - English - Russian

Click here for
more information

ELRA-WC0154

Penn Discourse Treebank (PDTB)

This treebank aims to produce a large-scale corpus in which approximately 30,000 discourse connectives are annotated.
Language(s) : English

Click here for
more information

ELRA-WC0155

TIGER Treebank

It consists of approximately 700,000 tokens (40,000 sentences) of semi-automatically tagged German newspaper text.
Language(s) : German

Click here for
more information

ELRA-WC0156

Tübingen Treebank of Written German (TüBa-D/Z)

The TüBa-D/Z treebank contains 45,200 sentences (794,079 tokens) taken from a German newspaper corpus (data based on 'die tageszeitung' from taz). The syntactic annotation was performed manually.
Language(s) : German

Click here for
more information

ELRA-WC0157

The GNOME Corpus

The GNOME Corpus includes texts from three genres - museum labels, pharmaceutical leaflets, and tutorial dialogues - in which different types of discourse and semantic information have been annotated.
Language(s) : English

Click here for
more information

ELRA-WC0158

The Reuters Corpus

This corpus includes over 800,000 English language news stories.
Language(s) : French - English

Click here for
more information

ELRA-WC0159

The MuchMore bilingual medical corpus

It includes around 9000 scientific abstracts in various domains with around 1 million tokens for each language.
Language(s) : English - German

Click here for
more information

ELRA-WC0160

Two Variant Corpora

The corpora consist of 324,616 Korean sentences, half translated from Japanese and the other half from English that match the original Japanese.
Language(s) : Korean - Japanese - English

Click here for
more information

ELRA-WC0161

Reuters-21578

It consists of 21,578 news appeared on the Reuters newswire in 1987.
Language(s) : English

Click here for
more information

Displaying 441 to 460 (of 730 products)

Result Pages: [<< Prev] ... 21 22 23 24 25 ... [Next >>]