Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 361 to 380 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]

ELRA-U-W0354

NoWac Norwegian web corpus

The NoWaC is a Norwegian 700 million word corpus constructed from the Web (.no domain).
Language(s) : Norwegian

Click here for
more information

ELRA-U-W0355

The FidaPLUS corpus

This is an extensive collection of texts published between 1990 and 2006, which represents a balanced sample of texts in Slovenian. The FidaPLUS corpus extends the FIDA corpus to 600 million words.
Language(s) : Slovenian (Slovenia)

Click here for
more information

ELRA-U-W0356

English-Lao Parallel corpus

This is a parallel corpus of 3,110 English sentences from the Penn Treebank Corpus manually translated into Lao.
Language(s) : English >>>> Lao

Click here for
more information

ELRA-U-W0357

Indonesian - English Parallel Corpus (PANL-BPPT)

This is a parallel corpus of 1 million words in English and Bahasa Indonesian.
Language(s) : English <<< >>> Indonesian

Click here for
more information

ELRA-U-W0358

ANTARA Corpus

This corpus contains 250,000 sentences aligned in English and Indonesian (about 2.5 million words) from articles published between 2000 and 2007 through the ANTARA News Agency.
Language(s) : English <<< >>> Indonesian

Click here for
more information

ELRA-U-W0359

Bangla News Corpus

This is a corpus of news in Bangla (or Bengali). It is also called the Prothom-Alo corpus
Language(s) : Bengali

Click here for
more information

ELRA-U-W0360

English-Sinhala Parallel and Aligned Tagged Corpus

This is a corpus of 100,000 words in Sinhala which have been translated into English. Annotation includes POS tags and it is aligned at the sentence level.
Language(s) : Sinhalese >>>> English

Click here for
more information

ELRA-U-W0361

Khmer Tagged Corpus

This is a written corpus which contains both official and daily speaking language.
Language(s) : Khmer

Click here for
more information

ELRA-U-W0362

BTEC-ATR Parallel Corpus English - Indonesian

It consists of sentences translated into Indonesian from the English part of the BTEC Corpus. Annotation includes POS tagging, syllabification and word-stress tagging in the XML-format.
Language(s) : English >>>> Indonesian

Click here for
more information

ELRA-U-W0363

JOS Morphosyntactically Tagged Corpora of Slovene

It contains sampled paragraphs of the Slovene reference corpus, the FidaPLUS (see U-W0355), which is morphosyntactically tagged with context disambiguated MSDs and lemmas. This selection have been converted from SGML to XML following the TEI P4 guidelines and the former annotation tagset have been updated.

It consists of two corpora: the jos100k corpus and the jos1M.
Language(s) : Slovenian

Click here for
more information

ELRA-U-W0364

Nova beseda

This is a wide collection of 4,158 Slovenian texts from various categories: newspapers, magazines, formal speech, fiction, non-fiction, scientific and technical texts. It contains about 162 million words, marked at the sentence level.
Language(s) : Slovenian

Click here for
more information

ELRA-U-W0365

Power Shift text corpus

This is a corpus of e-mail messages about business or private. It was collected from 10-39 years old men and women with specified mobile phones or PCs through simulation.
Language(s) : Japanese

Click here for
more information

ELRA-U-W0366

POMDAF Corpus

This corpus contains 40,000 sentences. It consists of triplets of the English original, draft Japanese translation and final Japanese translation of books and online articles.
Language(s) : English - Japanese

Click here for
more information

ELRA-U-W0367

REBECA

This is a parallel corpus of more than 3 million words containing Dutch texts and their French translation aligned at sentence level.

It is under construction.
Language(s) : Dutch >>>> French

Click here for
more information

ELRA-U-W0368

LORCA Corpus

This is a corpus of 1 million word containing the complete work of the Spanish poet Federico García Lorca. The corpus is tokenized, pos-tagged and lemmatized.
Language(s) : Spanish

Click here for
more information

ELRA-U-W0369

SUBTLEX-NL

This is a subtitle corpus for Dutch. It contains 44 million words from 8,443 films and television series.
Language(s) : Dutch

Click here for
more information

ELRA-U-W0370

SUBTLEX-CH

This is a subtitle corpus for Chinese. It contains 33.5 million words (46.8 million characters) from 6,243 different contexts (7,148 files).
Language(s) : Chinese

Click here for
more information

ELRA-U-W0371

Corpus of Clinical Data

This is a large corpus of discharged letters collected from a medical system used in a hospital in Sweden.
Language(s) : Swedish

Click here for
more information

ELRA-U-W0372

FDV-IJS monolingual corpus

The FDV-IJS is a Slovene monolingual corpus which contains over 5.5 million words.
Language(s) : Slovenian (Slovenia)

Click here for
more information

ELRA-U-W0373

VoiceTRAN application-specific corpus

This is a restricted-domain corpus of Slovene-English parallel texts from the Slovenian Ministry of Defense.
Language(s) : Slovenian <<< >>> English

Click here for
more information

Displaying 361 to 380 (of 730 products)

Result Pages: [<< Prev] ... 16 17 18 19 20 ... [Next >>]