Universal Catalogue  
  You are here » Universal Catalogue » Written Resources » Written Corpora
Language Resources
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Anglais
Written Corpora
Displaying 21 to 40 (of 730 products) Result Pages: [<< Prev]   1  2  3  4  5 ...  [Next >>] 

ELRA-U-W 0012
BLIPP'99 Corpus 


The BLIPP'99 corpus is an automatically annotated corpus that comprises 30 million words from the Wall Street Journal (three years: 1987-89).
Language(s) : English (USA)

Click here for
more information


ELRA-U-W 0013
LinGO Redwoods Treebank 


This treebank is a collection of corpora analysed with the LinGO English Resource Grammar. It contains VerbMobil data sets (speech transcripts) and extracts of a corpus of ecommerce customer email.
Language(s) : English (USA)

Click here for
more information


ELRA-U-W 0014
AMALGAM Multi-Tagged Corpus 


This multi-tagged corpus contains 180 sentences which were tagged with the AMALGAM tagger using the Brown, ICE, LLC, LOB, UNIX Parts, POW, SEC and UPenn tagging schemes.
Language(s) : English

Click here for
more information


ELRA-U-W 0015
AMALGAM Multi-Treebank 


This corpus is based on the IPSM raw text (60 sentences). In the framework of the AMALGAM project, sentences were parsed according to several rival parsing schemes.
Language(s) : English

Click here for
more information


ELRA-U-W 0016
Estonian Constraint Grammar Corpus (Estonian CG Corpus)


Estonian Constraint Grammar Corpus consists of 200,000 running words (15,000 sentences) from fiction, newspapers and legal texts. Shallow syntactic annotation has been added using Constraint Grammar.
Language(s) : Estonian (Estonia)

Click here for
more information


ELRA-U-W 0017
Estonian Dialogue Corpus (EDiC)


This corpus of Estonian is composed of different types of data:
- transcripts of spoken dialogues (233,000 running words),
- written dialogues (2,500 running words collected in 2001 and 10,000 collected in 2009),
- human-computer interactions.
Language(s) : Estonian (Estonia)

Click here for
more information


ELRA-U-W 0018
Dependency Treebank for Russian (SynTagRus)


The data contained in this treebank is representative of modern written Russian. This resource is morpho-syntactically tagged and a syntactic annotation using dependencies is also planned.
Language(s) : Russian (Russia)

Click here for
more information


ELRA-U-W 0019
Latin Dependency Treebank 


This is a 13,683 word collection of syntactically parsed Latin sentences.
Language(s) : Latin

Click here for
more information


ELRA-U-W 0020
NAiST Thai Treebank 


This is a syntactically parsed corpus of Thai.
Language(s) : Thai

Click here for
more information


ELRA-U-W 0021
The Wellington Corpus of Spoken New Zealand English (WSC)


It is a one million word corpus of spoken New Zealand English collected from 1988 to 1994. It contains excerpts of 2000 words representing the various categories and characteristics of speech (formal to informal, monologue/dialogue, broadcast, etc.).
Language(s) : English (New Zealand)

Click here for
more information


ELRA-U-W 0022
The Uppsala Russian Corpus 


It is a one million word corpus of Russian informative and literary prose (from 1985 to 1989 and from 1960 to 1988 respectively), which has been designed to be as representative and varied as possible.
Language(s) : Russian (Russia)

Click here for
more information


ELRA-U-W 0023
The American National Corpus (ANC)


In its second release, the ANC contains 22 million words of written American English, collected from 1990 onwards. It covers many different genres and also contains transcripts of spoken data.

Annotation of the corpus concerns lemmas, parts of speech, noun chunks and verb chunks.
Language(s) : English (USA)

Click here for
more information


ELRA-U-W 0024
The TOEFL 2000 Spoken and Written Academic Language Corpus (T2K-SWAL)


It is a 2,7 million word corpus of spoken and written registers that can be encountered by students in their academic activities: classroom teaching, office hours, textbooks, institutional written materials, etc.
All the texts are grammatically annotated.
Language(s) : English (USA)

Click here for
more information


ELRA-U-W 0025
The Corpus of Early English Medical Writing 


It is a corpus of medical texts from 1375 to 1800 which is divided according to three historical periods : Middle English Medical Texts (500,000 words), Early Modern Medical Texts (1,8 million words expected), Late Modern English Medical Texts (under construction).

This resource is designed for studies on the evolution of medical writing.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0026
The Freiburg-Brown Corpus of American English (FROWN)


It is a one million word American English corpus that is meant to be a counterpart of the Brown corpus for the language of the early 90’s. It represents fifteen genres.
Language(s) : English (USA)

Click here for
more information


ELRA-U-W 0027
The Freiburg-LOB Corpus of British English (FLOB)


It is a one million word British English corpus that is meant to be a counterpart of the LOB corpus for the language of the early 90’s. It represents fifteen genres.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0028
The Australian Corpus of English (ACE)


It is a one million word corpus gathering 500 samples of written Australian English which contain 2000 words each. The data was collected in 1986.
17 genres are represented, from newspapers to popular lore.
Language(s) : English (Australia)

Click here for
more information


ELRA-U-W 0029
A Corpus of Nineteenth Century English (CONCE)


It is a one-million word corpus of English texts written from 1800 to 1900. It is divided according to three periods (1800-1830, 1850-1870, 1870-1900) and is designed to enable comparisons with materials contained in the Helsinki Corpus of English Texts.
It is divided into seven genres : correspondence, scientific writing, history writing, fiction, trial proceedings, parliamentary debates, and drama comedy.
Language(s) : English (United Kingdom)

Click here for
more information


ELRA-U-W 0030
The Leipzig Corpora Collection (LCC)


The Leipzig Corpora Collection contains monolingual corpora for 15 languages. They are comparable in genre (newspapers, web documents). The data come as plain text or as MySql database tables.

The number of words varies according to the language (from 1 million to 30 millions).
Language(s) : English (United Kingdom) - French (France) - German (Germany) - Catalan (Spain) - Dutch (Netherlands) - Danish (Denmark) - Estonian (Estonia) - Finnish (Finland) - Italian (Italy) - Japanese (Japan) - Korean (Korea) - Norwegian (Norway) - Swedish (Sweden) - Turkish (Turkey)

Click here for
more information


ELRA-U-W 0031
Mixed Corpus of Estonian (MCE)


This is an Estonian 80,000,000 word corpus (still under construction). It gathers texts of various genres and has been designed to support the Estonian language and culture. It can be used in computational linguistics as well as in theoretical linguistics.
Language(s) : Estonian (Estonia)

Click here for
more information


Displaying 21 to 40 (of 730 products) Result Pages: [<< Prev]   1  2  3  4  5 ...  [Next >>] 

Joint Copyright © 2008 ELRA & ELDA
Universal Catalogue 1.0.4