|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 21 to 40 (of 730 products) |
Result Pages: 2 |
The BLIPP'99 corpus is an automatically annotated corpus that comprises 30 million words from the Wall Street Journal (three years: 1987-89).
Language(s) : English (USA)
|
|
|
|
This treebank is a collection of corpora analysed with the LinGO English Resource Grammar. It contains VerbMobil data sets (speech transcripts) and extracts of a corpus of ecommerce customer email.
Language(s) : English (USA)
|
|
|
|
This multi-tagged corpus contains 180 sentences which were tagged with the AMALGAM tagger using the Brown, ICE, LLC, LOB, UNIX Parts, POW, SEC and UPenn tagging schemes.
Language(s) : English
|
|
|
|
This corpus is based on the IPSM raw text (60 sentences). In the framework of the AMALGAM project, sentences were parsed according to several rival parsing schemes.
Language(s) : English
|
|
|
|
Estonian Constraint Grammar Corpus consists of 200,000 running words (15,000 sentences) from fiction, newspapers and legal texts. Shallow syntactic annotation has been added using Constraint Grammar.
Language(s) : Estonian (Estonia)
|
|
|
|
This corpus of Estonian is composed of different types of data:
- transcripts of spoken dialogues (233,000 running words),
- written dialogues (2,500 running words collected in 2001 and 10,000 collected in 2009),
- human-computer interactions.
Language(s) : Estonian (Estonia)
|
|
|
|
The data contained in this treebank is representative of modern written Russian. This resource is morpho-syntactically tagged and a syntactic annotation using dependencies is also planned.
Language(s) : Russian (Russia)
|
|
|
|
This is a 13,683 word collection of syntactically parsed Latin sentences.
Language(s) : Latin
|
|
|
|
This is a syntactically parsed corpus of Thai.
Language(s) : Thai
|
|
|
|
It is a one million word corpus of spoken New Zealand English collected from 1988 to 1994. It contains excerpts of 2000 words representing the various categories and characteristics of speech (formal to informal, monologue/dialogue, broadcast, etc.).
Language(s) : English (New Zealand)
|
|
|
|
It is a one million word corpus of Russian informative and literary prose (from 1985 to 1989 and from 1960 to 1988 respectively), which has been designed to be as representative and varied as possible.
Language(s) : Russian (Russia)
|
|
|
|
In its second release, the ANC contains 22 million words of written American English, collected from 1990 onwards. It covers many different genres and also contains transcripts of spoken data.
Annotation of the corpus concerns lemmas, parts of speech, noun chunks and verb chunks.
Language(s) : English (USA)
|
|
|
|
It is a 2,7 million word corpus of spoken and written registers that can be encountered by students in their academic activities: classroom teaching, office hours, textbooks, institutional written materials, etc.
All the texts are grammatically annotated.
Language(s) : English (USA)
|
|
|
|
It is a corpus of medical texts from 1375 to 1800 which is divided according to three historical periods : Middle English Medical Texts (500,000 words), Early Modern Medical Texts (1,8 million words expected), Late Modern English Medical Texts (under construction).
This resource is designed for studies on the evolution of medical writing.
Language(s) : English (United Kingdom)
|
|
|
|
It is a one million word American English corpus that is meant to be a counterpart of the Brown corpus for the language of the early 90’s. It represents fifteen genres.
Language(s) : English (USA)
|
|
|
|
It is a one million word British English corpus that is meant to be a counterpart of the LOB corpus for the language of the early 90’s. It represents fifteen genres.
Language(s) : English (United Kingdom)
|
|
|
|
It is a one million word corpus gathering 500 samples of written Australian English which contain 2000 words each. The data was collected in 1986.
17 genres are represented, from newspapers to popular lore.
Language(s) : English (Australia)
|
|
|
|
It is a one-million word corpus of English texts written from 1800 to 1900. It is divided according to three periods (1800-1830, 1850-1870, 1870-1900) and is designed to enable comparisons with materials contained in the Helsinki Corpus of English Texts.
It is divided into seven genres : correspondence, scientific writing, history writing, fiction, trial proceedings, parliamentary debates, and drama comedy.
Language(s) : English (United Kingdom)
|
|
|
|
The Leipzig Corpora Collection contains monolingual corpora for 15 languages. They are comparable in genre (newspapers, web documents). The data come as plain text or as MySql database tables.
The number of words varies according to the language (from 1 million to 30 millions).
Language(s) : English (United Kingdom) - French (France) - German (Germany) - Catalan (Spain) - Dutch (Netherlands) - Danish (Denmark) - Estonian (Estonia) - Finnish (Finland) - Italian (Italy) - Japanese (Japan) - Korean (Korea) - Norwegian (Norway) - Swedish (Sweden) - Turkish (Turkey)
|
|
|
|
This is an Estonian 80,000,000 word corpus (still under construction). It gathers texts of various genres and has been designed to support the Estonian language and culture. It can be used in computational linguistics as well as in theoretical linguistics.
Language(s) : Estonian (Estonia)
|
|
|
|
Displaying 21 to 40 (of 730 products) |
Result Pages: 2 |
|
|