|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 241 to 260 (of 730 products) |
Result Pages: 13 |
The Swedish Immigrant Newspaper Corpus is available in nine different languages: Swedish, Albanian, Arabic, English, Finnish, Persian, Polish, Serbo-Croatian and Spanish.
Language(s) : Swedish - Albanian - Arabic - English - Finnish - Persian - Polish - Spanish - Sardinian
|
|
|
|
This Swedish Turkish parallel corpus is a balanced corpus composed of fiction and technical documents.
Source language: Swedish (150,000 words)
Target language: Turkish (100,000 words)
Texts are annotated with POS and morphological features. They are automatically aligned at sentence and word levels.
Language(s) : SwedishTurkish
|
|
|
|
The Swedish Political Texts are texts from the Swedish government. It is a parallel corpus in 5 languages: German, English, Spanish, French, Swedish. It contains 11,000 words per language.
Language(s) : SwedishGerman - SwedishEnglish - SwedishSpanish - SwedishFrench
|
|
|
|
The CORDIAL-SIN is a 500,000 word corpus of European Portuguese. It consists of the transcription of spontaneous and semi-directed oral data collected all over the country in various projects. The aim was to gather a representative corpus of dialects spoken in Portugal.
Data are available in four types: verbatim transcription, normalized transcription, with POS annotation and with syntactic annotation.
Language(s) : Portuguese (Portugal)
|
|
|
|
CORLEX is a Portuguese corpus that was designed in the objective of compiling a lexicon for European Portuguese. CORLEX was extracted from the CRPC; it contains 6,210,438 words and gathers texts of different types and topics.
Language(s) : Portuguese (Portugal)
|
|
|
|
MiniCors is a semantically tagged Spanish corpus with 13,477 sentences and 565,782 words. It is partially tagged according to criteria of frequency and polysemy degree.
Language(s) : Spanish
|
|
|
|
MiniCors-Cat is a semantically tagged Catalan corpus with 6,722
tagged examples, covering 45,509 sentences and 1,451,778 words. The tagging was made with the dictionary MiniDir-Cat.
Language(s) : Catalan
|
|
|
|
The Sejong corpus is a Korean raw corpus composed of written and spoken texts. It contains 57 million words plus additional 75 millions of already existing electronic texts.
Language(s) : Korean
|
|
|
|
The DGT-TM is a translation memory created from the text collection of the Acquis Communautaire. A translation memory is a collection of small text segments (sentences or sentence parts) and their translation.
Language(s) : Bulgarian - Czech - Danish - Dutch - English - Estonian - German - Greek - Finnish - French - Italian - Hungarian - Latvian - Lithuanian - Maltese - Polish - Portuguese - Romanian - Slovak - Spanish - Swedish - Slovene
|
|
|
|
CORGA is a corpus of contemporary Galician (from 1975 to nowadays). It includes 23 million words of different genres.
Language(s) : Galician (Spain)
|
|
|
|
The ESF Database is a collection of data from five European countries: France, Germany, Great Britain, The Netherlands and Sweden.
It contains transcriptions of second language data from adult immigrant workers living in Western Europe.
Language(s) : English - German - French - Dutch - Swedish
|
|
|
|
The ANDES corpus is a collection of recorded and transcribed language materials from the Andes.
Language(s) : Quechua - Spanish
|
|
|
|
The INTERA corpus contains 12 million written words in various domains: law, health, education, tourism, environment, politics, finance.
It is a comparable corpus in which texts are aligned at sentence level (TMX standard), annotated at sentence level, morphologically tagged and lemmatized (XCES).
Language pairs: Bulgarian - English, Greek - English, Serbian - English and Slovene - English.
Language(s) : BulgarianEnglish - SerbianEnglish - SloveneEnglish - GreekEnglish
|
|
|
|
CINTIL is a linguistically interpreted corpus of Portuguese. It contains 1 million annotated tokens and has been manually verified by linguistic experts.
Language(s) : Portuguese
|
|
|
|
This is a parallel treebank of old Indo-European versions of the New Testament. It concerns the Greek, Latin, Gothic, Armenian, Old Church Slavonic languages.
Currently 10,037 sentences have been annotated.
Language(s) :
|
|
|
|
This is a corpus of the German language. It contains German texts of different types (20th century Switzerland).
Language(s) : German (Switzerland)
|
|
|
|
The TUNA corpus contains descriptions of objects and people in English. It is annotated at the semantic level with a domain representation.
Language(s) : English
|
|
|
|
The GREC corpus contains 2,000 short introductory texts from
Wikipedia entries, including about 18,000 annotated referring expressions.
Language(s) : English
|
|
|
|
The Bijankhan corpus is a Persian (Farsi) corpus containing daily news and common texts for a total of 2,6 million words. It has been manually tagged.
Language(s) : Farsi
|
|
|
|
The Hamshahri corpus is a Persian (Farsi) text collection that comprises news texts from the Hamshahri daily newspaper from 1996 to 2002. It contains more that 160,000 news articles about various subjects.
Language(s) : Persian
|
|
|
|
Displaying 241 to 260 (of 730 products) |
Result Pages: 13 |
|
|