|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 281 to 300 (of 730 products) |
Result Pages: 15 |
The ARCHER is a socio-historical corpus made up of texts representing eleven written and spoken registers in British and American English. It is divided into ten 50-year periods from 1650-1990 and contains approximately 1,7 million words.
Language(s) : English (USA) - English (United Kingdom)
|
|
|
|
This Romanian newspaper corpus contains 56 million words with diacritics (Unicode .txt format), it has been parsed with GojolParser (two formats : dependency maps and trees) with a good accuracy.
Language(s) : Romanian
|
|
|
|
The Melbourne-Surrey corpus contains 100,000 words of Australian newspaper texts.
Language(s) : English (Australia)
|
|
|
|
The NPS Chat Corpus contains 10,567 posts collected in 2006 from various online chat services. Posts have been hand privacy masked, part-of-speech tagged and dialogue-act tagged.
Language(s) : English (USA)
|
|
|
|
The Problem Report Corpus contains problem report summaries from various open source projects (Apache, Eclipse, Firefox, Linux, Openoffice).
Language(s) : English
|
|
|
|
This is a collection of 471 documents giving instructions to patients about their medication.
Language(s) : English
|
|
|
|
This is a collection of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative). It contains 1000 positive and 1000 negative full text movie reviews.
Language(s) : English
|
|
|
|
This is a collection of sentences labeled with respect to their subjectivity status (subjective or objective). It contains 5000 subjective and 5000 objective processed sentences.
Language(s) : English
|
|
|
|
It contains more than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s.
Language(s) : Spanish
|
|
|
|
The Italian TimeBank (ITB) contains 171 newspaper articles which have been manually annotated for events. It represents a total of 62.000 words.
Language(s) : Italian
|
|
|
|
This is a subtitle corpus for American English. It contains 51 million words from 8,388 US films and sitcoms (from 1900 to 2007).
Language(s) : English (USA)
|
|
|
|
This is a corpus containing about 6300 PDF documents classified into genres. Documents have been labelled independently by two kinds of people according to 70 assigned genres.
Language(s) : English
|
|
|
|
This is a lemmatized and XML-formatted corpus of Old French Literary Texts containing more than three millions words from 200 different texts.
Language(s) : French
|
|
|
|
This is a written corpus containing 54 dialogues transcripts, collected from different corpora.
Language(s) : English
|
|
|
|
The Parallel Italian-Danish Corpus annotated for anaphora contains EU texts, literary texts, newspaper articles and dialogue transcriptions.
Language(s) : Italian - Danish
|
|
|
|
This is a parallel corpus containing 38,383 sentence pairs collected in Japanese newspapers and translated into Chinese. This corpus is aligned at word and phrase levels and has been annotated with morphological and syntactic tags.
Language(s) : Japanese <<< >>> Chinese
|
|
|
|
The "How May I Help You (SM)?" corpus (or HMIHY corpus) contains 5,690 human-computer dialogues. Each caller turn is annotated with emotional labels.
Language(s) : English
|
|
|
|
BLOGS08 is a TREC test collection containing samples of the blogosphere collected once a week during one year. It represents 28,488,767 blog posts from 1,303,520 blog feeds.
Language(s) : English
|
|
|
|
This is a large corpus containing 30,000 request–response email dialogues between customers and operators.
Language(s) : English
|
|
|
|
The SYN2005 corpus is a synchronic representative corpus of contemporary written Czech collected between 2000 and 2004. It contains 100 million words (tokens), lemmatised and Part-Of-Speech tagged.
Language(s) : Czech
|
|
|
|
Displaying 281 to 300 (of 730 products) |
Result Pages: 15 |
|
|