Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 281 to 300 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]

ELRA-U-W 0273

A Representative Corpus of Historical English Registers (ARCHER)

The ARCHER is a socio-historical corpus made up of texts representing eleven written and spoken registers in British and American English. It is divided into ten 50-year periods from 1650-1990 and contains approximately 1,7 million words.
Language(s) : English (USA) - English (United Kingdom)

Click here for
more information

ELRA-U-W 0274

Romanian Corpus of Newspaper Texts

This Romanian newspaper corpus contains 56 million words with diacritics (Unicode .txt format), it has been parsed with GojolParser (two formats : dependency maps and trees) with a good accuracy.
Language(s) : Romanian

Click here for
more information

ELRA-U-W 0275

Melbourne-Surrey Corpus

The Melbourne-Surrey corpus contains 100,000 words of Australian newspaper texts.
Language(s) : English (Australia)

Click here for
more information

ELRA-U-W0276

NPS Chat Corpus

The NPS Chat Corpus contains 10,567 posts collected in 2006 from various online chat services. Posts have been hand privacy masked, part-of-speech tagged and dialogue-act tagged.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W0277

Problem Report Corpus

The Problem Report Corpus contains problem report summaries from various open source projects (Apache, Eclipse, Firefox, Linux, Openoffice).
Language(s) : English

Click here for
more information

ELRA-U-W0278

The Patient Information Leaflet Corpus (PIL)

This is a collection of 471 documents giving instructions to patients about their medication.
Language(s) : English

Click here for
more information

ELRA-U-W0279

Sentiment polarity datasets

This is a collection of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative). It contains 1000 positive and 1000 negative full text movie reviews.
Language(s) : English

Click here for
more information

ELRA-U-W0280

Subjectivity datasets

This is a collection of sentences labeled with respect to their subjectivity status (subjective or objective). It contains 5000 subjective and 5000 objective processed sentences.
Language(s) : English

Click here for
more information

ELRA-U-W0281

Corpus del Español

It contains more than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s.
Language(s) : Spanish

Click here for
more information

ELRA-U-W0282

Italian TimeBank (ITB)

The Italian TimeBank (ITB) contains 171 newspaper articles which have been manually annotated for events. It represents a total of 62.000 words.
Language(s) : Italian

Click here for
more information

ELRA-U-W0283

SUBTLEXus Corpus

This is a subtitle corpus for American English. It contains 51 million words from 8,388 US films and sitcoms (from 1900 to 2007).
Language(s) : English (USA)

Click here for
more information

ELRA-U-W0284

KRYS I Corpus

This is a corpus containing about 6300 PDF documents classified into genres. Documents have been labelled independently by two kinds of people according to 70 assigned genres.
Language(s) : English

Click here for
more information

ELRA-U-W0285

New Amsterdam Corpus of Old French Literary Texts (NCA)

This is a lemmatized and XML-formatted corpus of Old French Literary Texts containing more than three millions words from 200 different texts.
Language(s) : French

Click here for
more information

ELRA-U-W0286

The Dialogue Diversity Corpus (DDC)

This is a written corpus containing 54 dialogues transcripts, collected from different corpora.
Language(s) : English

Click here for
more information

ELRA-U-W0287

Parallel Italian-Danish Corpus annotated for anaphora

The Parallel Italian-Danish Corpus annotated for anaphora contains EU texts, literary texts, newspaper articles and dialogue transcriptions.
Language(s) : Italian - Danish

Click here for
more information

ELRA-U-W0288

NICT Japanese-Chinese parallel corpus

This is a parallel corpus containing 38,383 sentence pairs collected in Japanese newspapers and translated into Chinese. This corpus is aligned at word and phrase levels and has been annotated with morphological and syntactic tags.
Language(s) : Japanese <<< >>> Chinese

Click here for
more information

ELRA-U-W0289

HMIHY corpus

The "How May I Help You (SM)?" corpus (or HMIHY corpus) contains 5,690 human-computer dialogues. Each caller turn is annotated with emotional labels.
Language(s) : English

Click here for
more information

ELRA-U-W0290

BLOGS08

BLOGS08 is a TREC test collection containing samples of the blogosphere collected once a week during one year. It represents 28,488,767 blog posts from 1,303,520 blog feeds.
Language(s) : English

Click here for
more information

ELRA-U-W0291

Help-desk emails dialogues Corpus

This is a large corpus containing 30,000 request–response email dialogues between customers and operators.
Language(s) : English

Click here for
more information

ELRA-U-W0292

SYN2005 corpus

The SYN2005 corpus is a synchronic representative corpus of contemporary written Czech collected between 2000 and 2004. It contains 100 million words (tokens), lemmatised and Part-Of-Speech tagged.
Language(s) : Czech

Click here for
more information

Displaying 281 to 300 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]