Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 321 to 340 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]

ELRA-U-W0313

Hebrew Dependency Treebank

The Dependency Treebank of Hebrew consists of 6220 sentences, fully dependency parsed.
Language(s) : Hebrew

Click here for
more information

ELRA-U-W0314

Quechua-Spanish Parallel Treebank

This is a corpus-based parallel treebank of 200 sentences in both languages Quechua and Spanish.
Language(s) : Spanish (Peru) <<< >>> Quechua (Peru)

Click here for
more information

ELRA-U-W0315

Romanian Dependency Treebank (RDT)

The data contained in this treebank is representative of modern written standard Romanian. This resource is morpho-syntactically tagged.

It consists of 36,150 tokens
Language(s) : Romanian (Romania)

Click here for
more information

ELRA-U-W0316

Berlin central station Corpus

This is an English corpus of 1,068 web pages related to the "Berlin central station". It contains 55,255 sentences annotated for Name Entities (NE).
Language(s) : English

Click here for
more information

ELRA-U-W0317

NP4E corpus

This is a corpus of newswire texts coreferentially annotated for noun phrase (NP) coreference on 55,000 words and for event coreference on 12,500 words.
Language(s) : English

Click here for
more information

ELRA-U-W0318

Persian Linguistic Database (PLDB)

It contains various corpora in Modern Persian (Farsi), annotated for Part-of-Speech and/or pronunciation.
Language(s) : Persian

Click here for
more information

ELRA-U-W0319

UN Corpora

It contains paragraph-aligned parallel corpora in the six official languages of the United Nations. It represents a total of around 3 million tokens per language.
Language(s) : English - French - Arabic - Chinese - Russian - Spanish

Click here for
more information

ELRA-U-W0320

PolyU Business Corpus (PUBC)

The PolyU Business Corpus contains 3 comparable corpora of business texts in English, Chinese and Japanese.

It consists of news and reports from the business and finance sections of newspapers, annual reports and press releases from companies, online versions of company brochures and leaflets, ...
Language(s) : Japanese (Japan) - Chinese (Hong Kong) - English (Hong Kong)

Click here for
more information

ELRA-U-W0321

SFU Review Corpus

This is a collection of review documents labeled with respect to their overall sentiment polarity (positive or negative). It contains 400 reviews in English and 400 in Spanish.
Language(s) : English - Spanish

Click here for
more information

ELRA-U-W0322

CLiPA corpus

This corpus contains 5 original texts in English, with plagiarised versions of them in English, Spanish and Italian.
Language(s) : English - Spanish - Italian

Click here for
more information

ELRA-U-W0323

BioInfer Corpus

This is an annotated corpus of biomedical English containing 1100 sentences. It consists of biomedical research articles' abstracts annotated for relationships, named entities, and syntactic dependencies.
Language(s) : English

Click here for
more information

ELRA-U-W0324

Greek biomedical corpus

The Greek biomedical corpus contains 11.5 million word-forms from periodical articles and conference papers in modern Greek.

Annotation includes structural data, morphosyntactic and semantic tagging, biomedical words and multi-word terms identification. The corpus is annotated in the XML format, following TEI guidelines.
Language(s) : Greek (Greece)

Click here for
more information

ELRA-U-W0325

Death Penalty Corpus (DP Corpus)

It contains 1152 documents in English collected from pro-death penalty and anti-death penalty websites. Documents are annotated for sentiment and for document's general tone.
Language(s) : English

Click here for
more information

ELRA-U-W0326

Corpus annotated for multiword nouns

This is a French corpus annotated for multiword nouns. It contains 166,000 words (8,600 sentences) from which 5,057 occurrences of multiword nouns have been annotated.
Language(s) : French

Click here for
more information

ELRA-U-W0327

Tagged corpus for Galician language

This is a POS tagged corpus in Galician which contains 309,505 gramatical elements extracted from newspapers and journals.
Language(s) : Galician

Click here for
more information

ELRA-U-W0328

PsyCoL Maltese Lexical Corpus (PMLC)

This is a text database of 3,323,325 tokens (53,396 unique tokens) in Maltese collected from on-line newspapers.
Language(s) : Maltese

Click here for
more information

ELRA-U-W0329

PsyCoL Hebrew Lexical Corpus (PHLC)

This is a text database of 60,052,261 tokens (396,469 unique tokens) in Hebrew collected from on-line newspapers, TV transcripts and medical forums.
Language(s) : Hebrew

Click here for
more information

ELRA-U-W0330

KFOR Text Corpus

This corpus contains 800 military reports in English (886,000 tokens) from the KFOR activities of the German Federal Army. Annotation includes Part-of-speech, Named Entities (NE), structural parts of the document (topic, source, ...) and verbal groups in different layers of annotation.
Language(s) : English (Germany)

Click here for
more information

ELRA-U-W0331

TR-CoNLL

The TR-CoNLL corpus contains 946 news articles (204,566 tokens) from the CoNLL shared task, in which 6,980 toponym instances have been annotated.

Toponym Resolution (TR) is the task of mapping from a set of potentially ambiguous place names to the intended latitude/longitude coordinates of places they refer to, taking into account textual context.
Language(s) : English

Click here for
more information

ELRA-U-W0332

CSpace Email corpus

This is a corpus of around 15,000 email messages.
Language(s) : English

Click here for
more information

Displaying 321 to 340 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]