Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 21 to 40 (of 730 products)

Result Pages: [<< Prev] 1 2 3 4 5 ... [Next >>]

ELRA-U-W 0012

BLIPP'99 Corpus

The BLIPP'99 corpus is an automatically annotated corpus that comprises 30 million words from the Wall Street Journal (three years: 1987-89).
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0013

LinGO Redwoods Treebank

This treebank is a collection of corpora analysed with the LinGO English Resource Grammar. It contains VerbMobil data sets (speech transcripts) and extracts of a corpus of ecommerce customer email.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0014

AMALGAM Multi-Tagged Corpus

This multi-tagged corpus contains 180 sentences which were tagged with the AMALGAM tagger using the Brown, ICE, LLC, LOB, UNIX Parts, POW, SEC and UPenn tagging schemes.
Language(s) : English

Click here for
more information

ELRA-U-W 0015

AMALGAM Multi-Treebank

This corpus is based on the IPSM raw text (60 sentences). In the framework of the AMALGAM project, sentences were parsed according to several rival parsing schemes.
Language(s) : English

Click here for
more information

ELRA-U-W 0016

Estonian Constraint Grammar Corpus (Estonian CG Corpus)

Estonian Constraint Grammar Corpus consists of 200,000 running words (15,000 sentences) from fiction, newspapers and legal texts. Shallow syntactic annotation has been added using Constraint Grammar.
Language(s) : Estonian (Estonia)

Click here for
more information

ELRA-U-W 0017

Estonian Dialogue Corpus (EDiC)

This corpus of Estonian is composed of different types of data:
- transcripts of spoken dialogues (233,000 running words),
- written dialogues (2,500 running words collected in 2001 and 10,000 collected in 2009),
- human-computer interactions.
Language(s) : Estonian (Estonia)

Click here for
more information

ELRA-U-W 0018

Dependency Treebank for Russian (SynTagRus)

The data contained in this treebank is representative of modern written Russian. This resource is morpho-syntactically tagged and a syntactic annotation using dependencies is also planned.
Language(s) : Russian (Russia)

Click here for
more information

ELRA-U-W 0019

Latin Dependency Treebank

This is a 13,683 word collection of syntactically parsed Latin sentences.
Language(s) : Latin

Click here for
more information

ELRA-U-W 0020

NAiST Thai Treebank

This is a syntactically parsed corpus of Thai.
Language(s) : Thai

Click here for
more information

ELRA-U-W 0021

The Wellington Corpus of Spoken New Zealand English (WSC)

It is a one million word corpus of spoken New Zealand English collected from 1988 to 1994. It contains excerpts of 2000 words representing the various categories and characteristics of speech (formal to informal, monologue/dialogue, broadcast, etc.).
Language(s) : English (New Zealand)

Click here for
more information

ELRA-U-W 0022

The Uppsala Russian Corpus

It is a one million word corpus of Russian informative and literary prose (from 1985 to 1989 and from 1960 to 1988 respectively), which has been designed to be as representative and varied as possible.
Language(s) : Russian (Russia)

Click here for
more information

ELRA-U-W 0023

The American National Corpus (ANC)

In its second release, the ANC contains 22 million words of written American English, collected from 1990 onwards. It covers many different genres and also contains transcripts of spoken data.

Annotation of the corpus concerns lemmas, parts of speech, noun chunks and verb chunks.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0024

The TOEFL 2000 Spoken and Written Academic Language Corpus (T2K-SWAL)

It is a 2,7 million word corpus of spoken and written registers that can be encountered by students in their academic activities: classroom teaching, office hours, textbooks, institutional written materials, etc.
All the texts are grammatically annotated.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0025

The Corpus of Early English Medical Writing

It is a corpus of medical texts from 1375 to 1800 which is divided according to three historical periods : Middle English Medical Texts (500,000 words), Early Modern Medical Texts (1,8 million words expected), Late Modern English Medical Texts (under construction).

This resource is designed for studies on the evolution of medical writing.
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W 0026

The Freiburg-Brown Corpus of American English (FROWN)

It is a one million word American English corpus that is meant to be a counterpart of the Brown corpus for the language of the early 90’s. It represents fifteen genres.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0027

The Freiburg-LOB Corpus of British English (FLOB)

It is a one million word British English corpus that is meant to be a counterpart of the LOB corpus for the language of the early 90’s. It represents fifteen genres.
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W 0028

The Australian Corpus of English (ACE)

It is a one million word corpus gathering 500 samples of written Australian English which contain 2000 words each. The data was collected in 1986.
17 genres are represented, from newspapers to popular lore.
Language(s) : English (Australia)

Click here for
more information

ELRA-U-W 0029

A Corpus of Nineteenth Century English (CONCE)

It is a one-million word corpus of English texts written from 1800 to 1900. It is divided according to three periods (1800-1830, 1850-1870, 1870-1900) and is designed to enable comparisons with materials contained in the Helsinki Corpus of English Texts.
It is divided into seven genres : correspondence, scientific writing, history writing, fiction, trial proceedings, parliamentary debates, and drama comedy.
Language(s) : English (United Kingdom)

Click here for
more information

ELRA-U-W 0030

The Leipzig Corpora Collection (LCC)

The Leipzig Corpora Collection contains monolingual corpora for 15 languages. They are comparable in genre (newspapers, web documents). The data come as plain text or as MySql database tables.

The number of words varies according to the language (from 1 million to 30 millions).
Language(s) : English (United Kingdom) - French (France) - German (Germany) - Catalan (Spain) - Dutch (Netherlands) - Danish (Denmark) - Estonian (Estonia) - Finnish (Finland) - Italian (Italy) - Japanese (Japan) - Korean (Korea) - Norwegian (Norway) - Swedish (Sweden) - Turkish (Turkey)

Click here for
more information

ELRA-U-W 0031

Mixed Corpus of Estonian (MCE)

This is an Estonian 80,000,000 word corpus (still under construction). It gathers texts of various genres and has been designed to support the Estonian language and culture. It can be used in computational linguistics as well as in theoretical linguistics.
Language(s) : Estonian (Estonia)

Click here for
more information

Displaying 21 to 40 (of 730 products)

Result Pages: [<< Prev] 1 2 3 4 5 ... [Next >>]