Universal Catalogue  
  You are here » Universal Catalogue » Written Resources » Written Corpora
Language Resources
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Anglais
Written Corpora
Displaying 1 to 20 (of 730 products) Result Pages:  1  2  3  4  5 ...  [Next >>] 

ELRA-LMON7
Dictionary of National Biography (DNB)


This is a collection of 50,000 specially written biographies of the men and women who have shaped all aspects of the British past.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-LMULT12
Euradic 


The Euradic project aimed at the creation of the following linguistic resources: monolingual dictionaries, bilingual dictionaries, specialized dictionary (bilingual, trilingual, multilingual databases), parallel corpus.
Language(s) : English - Spanish - Italian - German - Arabic - Portuguese - Greek - FrenchArabic

Click here for
more information


ELRA-MULT16
SINOD: Slovenian Non-native Speech Database 


It contains two TV interviews, with video, audio and transcribed data. Each interview lasted 51 minutes about a general topic, mainly the profession, past live and work of interviewee. In each session, a native Slovenian male journalist interviewed a female non-native speaker. The transcriptions consist of 12.5k words, where 2,516 are different.
Language(s) : Slovenian

Click here for
more information


ELRA-SD187
COWRAT Corpus 


It contains audio meetings with a significant textual component. The meeting scenarios consist of oral discussions and written text documents reflecting the results of these discussions. It also comprises 4 types of metadata encoded in XML: segmentation elements to establish text and speech units, time stamps to keep track of actions on text documents, detailed action descriptions and keywords. The entire corpus contains 29 meetings which last in total more than 17 hours, 14,665 words, 5,015 text actions and 1,125 gesturing actions. A manual annotation is still in progress, which includes orthographic transcription of contents and tagging of dialogue acts.
Language(s) : English

Click here for
more information


ELRA-U-MM0002
Archive of the Indigenous Languages of Latin America (AILLA)


AILLA is a database of audio and textual materials from the indigenous languages of Latin America.
Language(s) :

Click here for
more information


ELRA-U-MM0005
Twente News Corpus (TwNC)


This is a multifaceted corpus for Dutch. It contains material from different sources: newspapers, television subtitles, teleprompter files and broadcast news transcripts with the audio file. It consists of 530 million words and about 800 files of broadcast news audio.
Language(s) : Dutch

Click here for
more information


ELRA-U-S 0010
Multext-East Speech Corpus 


This is a small parallel corpus of spoken texts taken from the EUROM-1 speech corpus. 40 short passages have been translated from English into Romanian, Slovene, Estonian, Hungarian, Czech and Bulgarian.
For four languages (Romanian, Slovene, Estonian and Hungarian) recordings of the texts are also provided (with links between texts and spoken passages).
Language(s) : English - Romanian - Slovene - Bulgarian - Czech - Estonian - Hungarian - Romanian (Romania) - Slovene (Slovenia) - Estonian (Estonia) - Hungarian (Hungary)

Click here for
more information


ELRA-U-S 0060
Mongolian Corpora 


These Mongolian corpora are under development. They will include spoken and written materials from many Mongolian-speaking regions (Mongolia, Russia, Inner Mongolia and other Mongolian-speaking Chinese regions).
Language(s) : Mongolian

Click here for
more information


ELRA-U-S 0172
Verbmobil Data 


This multilingual collection contains spontaneous speech data that were collected and transcribed for the training and testing of Verbmobil systems. The different languages concerned are German, English and Japanese.

3,200 dialogs were collected from 1,658 speakers.
Language(s) : German - Japanese - English

Click here for
more information


ELRA-U-W 0001
The Corpus of Early English Correspondence (CEEC)


This 2,7 million word corpus is a collection of English letters from 1410 to 1681. It was designed for research in historical sociolinguistics.
Today the corpus name refers to a family of corpora derived from this first collection: sampler, parsed version, extended version (published or not).
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0002
The Corpus of Early English Correspondence Sampler (CEECS)


The CEECS is a 450,000 word corpus of English letters from 1418 to 1680. It was published in 1998.

It is a subcorpora of the Corpus of Early English Correspondence, which was compiled for research in historical sociolinguistics.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0003
The Parsed Corpus of Early English Correspondence (PCEEC)


The PCEEC is a parsed version of the Corpus of Early English Correspondence, which was compiled for research in sociolinguistics. It contains 2,200,000 words and is composed of English letters from 1410 to 1681.
It was published in 2006.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0004
A Corpus of the Late Eighteenth century Prose 


It is a 300,000 word corpus of local English letters on practical subjects (from 1761 to 1790). It is currently available in two formats: plain text with COCOA-style annotations (like the Helsinki corpus) and HTML.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0005
The Helsinki Corpus of Older Scots 


This is a 830,000 word corpus composed of historical Scottish English texts from 1450 to 1700, representing fifteen different prose genres.
Language(s) : English (Scotland)

Click here for
more information


ELRA-U-W 0006
The Wellington Corpus of Written New Zealand English (WWC)


It is a one million word corpus of texts written between 1986 and 1990. Texts collected are 2000 word excerpts from various categories which are very similar to those found in the Brown and the LOB.
Language(s) : English (New Zealand)

Click here for
more information


ELRA-U-W 0007
The Bosque Treebank 


The Bosque Treebank is a subset of Floresta; it comprises 215,003 tokens from CETEMPúblico and CETENFolha (corresponding to 9,431 trees). It was fully revised by linguists.
Language(s) : Portuguese (Portugal) - Portuguese (Brazil)

Click here for
more information


ELRA-U-W 0008
Hungarian Named Entity Corpus of Business Newswire Texts 


Hungarian Named Entity Corpus of Business Newswire Texts comprises 200,000 words of short business news articles (segments of the Szeged corpus). The corpus has been POS tagged and standard annotation has been added to entities.
Language(s) : Hungarian (Hungary)

Click here for
more information


ELRA-U-W 0009
CETEMPúblico corpus 


It is a 180 million word corpus of Portuguese which was built during the project Computacional Processing of Portuguese.
Texts were extracted from editions of the PÚBLICO, a daily Portuguese newspaper, published between 1991 and 1998.
Language(s) : Portuguese (Portugal)

Click here for
more information


ELRA-U-W 0010
CETENFolha corpus 


It is a 24 million word corpus of Brasilian Portuguese which was created during the Computational Processing of Portuguese project. Texts were extracted from the daily newspaper Folha de S. Paulo (year 1994).
Language(s) : Portuguese (Brazil)

Click here for
more information


ELRA-U-W 0011
A tentative Chinese Dependency Treebank 


This is a small dependency treebank for Chinese; it contains 711 sentences and 20,034 tokens.
Language(s) : Chinese (China)

Click here for
more information


Displaying 1 to 20 (of 730 products) Result Pages:  1  2  3  4  5 ...  [Next >>] 

Joint Copyright © 2008 ELRA & ELDA
Universal Catalogue 1.0.4