Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 241 to 260 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]

ELRA-U-W 0233

The Swedish Immigrant Newspaper corpus

The Swedish Immigrant Newspaper Corpus is available in nine different languages: Swedish, Albanian, Arabic, English, Finnish, Persian, Polish, Serbo-Croatian and Spanish.
Language(s) : Swedish - Albanian - Arabic - English - Finnish - Persian - Polish - Spanish - Sardinian

Click here for
more information

ELRA-U-W 0234

Swedish-Turkish Parallel Corpus

This Swedish Turkish parallel corpus is a balanced corpus composed of fiction and technical documents.

Source language: Swedish (150,000 words)
Target language: Turkish (100,000 words)

Texts are annotated with POS and morphological features. They are automatically aligned at sentence and word levels.
Language(s) : SwedishTurkish

Click here for
more information

ELRA-U-W 0235

Swedish Political Texts

The Swedish Political Texts are texts from the Swedish government. It is a parallel corpus in 5 languages: German, English, Spanish, French, Swedish. It contains 11,000 words per language.
Language(s) : SwedishGerman - SwedishEnglish - SwedishSpanish - SwedishFrench

Click here for
more information

ELRA-U-W 0236

Syntax-oriented Corpus of Portuguese Dialects (CORDIAL-SIN)

The CORDIAL-SIN is a 500,000 word corpus of European Portuguese. It consists of the transcription of spontaneous and semi-directed oral data collected all over the country in various projects. The aim was to gather a representative corpus of dialects spoken in Portugal.
Data are available in four types: verbatim transcription, normalized transcription, with POS annotation and with syntactic annotation.
Language(s) : Portuguese (Portugal)

Click here for
more information

ELRA-U-W 0237

CORLEX

CORLEX is a Portuguese corpus that was designed in the objective of compiling a lexicon for European Portuguese. CORLEX was extracted from the CRPC; it contains 6,210,438 words and gathers texts of different types and topics.
Language(s) : Portuguese (Portugal)

Click here for
more information

ELRA-U-W 0238

MiniCors

MiniCors is a semantically tagged Spanish corpus with 13,477 sentences and 565,782 words. It is partially tagged according to criteria of frequency and polysemy degree.
Language(s) : Spanish

Click here for
more information

ELRA-U-W 0239

MiniCors-Cat

MiniCors-Cat is a semantically tagged Catalan corpus with 6,722
tagged examples, covering 45,509 sentences and 1,451,778 words. The tagging was made with the dictionary MiniDir-Cat.
Language(s) : Catalan

Click here for
more information

ELRA-U-W 0240

Sejong Corpus

The Sejong corpus is a Korean raw corpus composed of written and spoken texts. It contains 57 million words plus additional 75 millions of already existing electronic texts.
Language(s) : Korean

Click here for
more information

ELRA-U-W 0241

DGT Multilingual Translation Memory of the Acquis Communautaire (DGT-TM)

The DGT-TM is a translation memory created from the text collection of the Acquis Communautaire. A translation memory is a collection of small text segments (sentences or sentence parts) and their translation.
Language(s) : Bulgarian - Czech - Danish - Dutch - English - Estonian - German - Greek - Finnish - French - Italian - Hungarian - Latvian - Lithuanian - Maltese - Polish - Portuguese - Romanian - Slovak - Spanish - Swedish - Slovene

Click here for
more information

ELRA-U-W 0242

Reference Corpus of Present-day Galician Language (CORGA)

CORGA is a corpus of contemporary Galician (from 1975 to nowadays). It includes 23 million words of different genres.
Language(s) : Galician (Spain)

Click here for
more information

ELRA-U-W 0243

ESF Database

The ESF Database is a collection of data from five European countries: France, Germany, Great Britain, The Netherlands and Sweden.
It contains transcriptions of second language data from adult immigrant workers living in Western Europe.
Language(s) : English - German - French - Dutch - Swedish

Click here for
more information

ELRA-U-W 0244

ANDES Corpus

The ANDES corpus is a collection of recorded and transcribed language materials from the Andes.
Language(s) : Quechua - Spanish

Click here for
more information

ELRA-U-W 0245

INTERA Multilingual Corpus

The INTERA corpus contains 12 million written words in various domains: law, health, education, tourism, environment, politics, finance.
It is a comparable corpus in which texts are aligned at sentence level (TMX standard), annotated at sentence level, morphologically tagged and lemmatized (XCES).

Language pairs: Bulgarian - English, Greek - English, Serbian - English and Slovene - English.
Language(s) : BulgarianEnglish - SerbianEnglish - SloveneEnglish - GreekEnglish

Click here for
more information

ELRA-U-W 0246

CINTIL Corpus

CINTIL is a linguistically interpreted corpus of Portuguese. It contains 1 million annotated tokens and has been manually verified by linguistic experts.
Language(s) : Portuguese

Click here for
more information

ELRA-U-W 0247

Treebank of Old Indo-European Languages

This is a parallel treebank of old Indo-European versions of the New Testament. It concerns the Greek, Latin, Gothic, Armenian, Old Church Slavonic languages.
Currently 10,037 sentences have been annotated.
Language(s) :

Click here for
more information

ELRA-U-W 0248

Swiss Text Corpus

This is a corpus of the German language. It contains German texts of different types (20th century Switzerland).
Language(s) : German (Switzerland)

Click here for
more information

ELRA-U-W 0249

TUNA Reference Corpus

The TUNA corpus contains descriptions of objects and people in English. It is annotated at the semantic level with a domain representation.
Language(s) : English

Click here for
more information

ELRA-U-W 0250

GREC Corpus

The GREC corpus contains 2,000 short introductory texts from
Wikipedia entries, including about 18,000 annotated referring expressions.
Language(s) : English

Click here for
more information

ELRA-U-W 0251

Bijankhan corpus

The Bijankhan corpus is a Persian (Farsi) corpus containing daily news and common texts for a total of 2,6 million words. It has been manually tagged.
Language(s) : Farsi

Click here for
more information

ELRA-U-W 0252

Hamshahri Corpus

The Hamshahri corpus is a Persian (Farsi) text collection that comprises news texts from the Hamshahri daily newspaper from 1996 to 2002. It contains more that 160,000 news articles about various subjects.
Language(s) : Persian

Click here for
more information

Displaying 241 to 260 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]