|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 1 to 20 (of 730 products) |
Result Pages: 1 |
This is a collection of 50,000 specially written biographies of the men and women who have shaped all aspects of the British past.
Language(s) : English (United Kingdom)
|
|
|
|
The Euradic project aimed at the creation of the following linguistic resources: monolingual dictionaries, bilingual dictionaries, specialized dictionary (bilingual, trilingual, multilingual databases), parallel corpus.
Language(s) : English - Spanish - Italian - German - Arabic - Portuguese - Greek - FrenchArabic
|
|
|
|
It contains two TV interviews, with video, audio and transcribed data. Each interview lasted 51 minutes about a general topic, mainly the profession, past live and work of interviewee. In each session, a native Slovenian male journalist interviewed a female non-native speaker. The transcriptions consist of 12.5k words, where 2,516 are different.
Language(s) : Slovenian
|
|
|
|
It contains audio meetings with a significant textual component. The meeting scenarios consist of oral discussions and written text documents reflecting the results of these discussions. It also comprises 4 types of metadata encoded in XML: segmentation elements to establish text and speech units, time stamps to keep track of actions on text documents, detailed action descriptions and keywords. The entire corpus contains 29 meetings which last in total more than 17 hours, 14,665 words, 5,015 text actions and 1,125 gesturing actions. A manual annotation is still in progress, which includes orthographic transcription of contents and tagging of dialogue acts.
Language(s) : English
|
|
|
|
AILLA is a database of audio and textual materials from the indigenous languages of Latin America.
Language(s) :
|
|
|
|
This is a multifaceted corpus for Dutch. It contains material from different sources: newspapers, television subtitles, teleprompter files and broadcast news transcripts with the audio file. It consists of 530 million words and about 800 files of broadcast news audio.
Language(s) : Dutch
|
|
|
|
This is a small parallel corpus of spoken texts taken from the EUROM-1 speech corpus. 40 short passages have been translated from English into Romanian, Slovene, Estonian, Hungarian, Czech and Bulgarian.
For four languages (Romanian, Slovene, Estonian and Hungarian) recordings of the texts are also provided (with links between texts and spoken passages).
Language(s) : English - Romanian - Slovene - Bulgarian - Czech - Estonian - Hungarian - Romanian (Romania) - Slovene (Slovenia) - Estonian (Estonia) - Hungarian (Hungary)
|
|
|
|
These Mongolian corpora are under development. They will include spoken and written materials from many Mongolian-speaking regions (Mongolia, Russia, Inner Mongolia and other Mongolian-speaking Chinese regions).
Language(s) : Mongolian
|
|
|
|
This multilingual collection contains spontaneous speech data that were collected and transcribed for the training and testing of Verbmobil systems. The different languages concerned are German, English and Japanese.
3,200 dialogs were collected from 1,658 speakers.
Language(s) : German - Japanese - English
|
|
|
|
This 2,7 million word corpus is a collection of English letters from 1410 to 1681. It was designed for research in historical sociolinguistics.
Today the corpus name refers to a family of corpora derived from this first collection: sampler, parsed version, extended version (published or not).
Language(s) : English (United Kingdom)
|
|
|
|
The CEECS is a 450,000 word corpus of English letters from 1418 to 1680. It was published in 1998.
It is a subcorpora of the Corpus of Early English Correspondence, which was compiled for research in historical sociolinguistics.
Language(s) : English (United Kingdom)
|
|
|
|
The PCEEC is a parsed version of the Corpus of Early English Correspondence, which was compiled for research in sociolinguistics. It contains 2,200,000 words and is composed of English letters from 1410 to 1681.
It was published in 2006.
Language(s) : English (United Kingdom)
|
|
|
|
It is a 300,000 word corpus of local English letters on practical subjects (from 1761 to 1790). It is currently available in two formats: plain text with COCOA-style annotations (like the Helsinki corpus) and HTML.
Language(s) : English (United Kingdom)
|
|
|
|
This is a 830,000 word corpus composed of historical Scottish English texts from 1450 to 1700, representing fifteen different prose genres.
Language(s) : English (Scotland)
|
|
|
|
It is a one million word corpus of texts written between 1986 and 1990. Texts collected are 2000 word excerpts from various categories which are very similar to those found in the Brown and the LOB.
Language(s) : English (New Zealand)
|
|
|
|
The Bosque Treebank is a subset of Floresta; it comprises 215,003 tokens from CETEMPúblico and CETENFolha (corresponding to 9,431 trees). It was fully revised by linguists.
Language(s) : Portuguese (Portugal) - Portuguese (Brazil)
|
|
|
|
Hungarian Named Entity Corpus of Business Newswire Texts comprises 200,000 words of short business news articles (segments of the Szeged corpus). The corpus has been POS tagged and standard annotation has been added to entities.
Language(s) : Hungarian (Hungary)
|
|
|
|
It is a 180 million word corpus of Portuguese which was built during the project Computacional Processing of Portuguese.
Texts were extracted from editions of the PÚBLICO, a daily Portuguese newspaper, published between 1991 and 1998.
Language(s) : Portuguese (Portugal)
|
|
|
|
It is a 24 million word corpus of Brasilian Portuguese which was created during the Computational Processing of Portuguese project. Texts were extracted from the daily newspaper Folha de S. Paulo (year 1994).
Language(s) : Portuguese (Brazil)
|
|
|
|
This is a small dependency treebank for Chinese; it contains 711 sentences and 20,034 tokens.
Language(s) : Chinese (China)
|
|
|
|
Displaying 1 to 20 (of 730 products) |
Result Pages: 1 |
|
|