Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 261 to 280 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]

ELRA-U-W 0253

Persian Today Corpus

This is a 1,000,000-word corpus of modern Persian, mostly written between 1994 and 2004.
Language(s) : Persian

Click here for
more information

ELRA-U-W 0254

Persian Text Corpus

The Persian Text Corpus contains 10 million words. It has been hand-annotated.
Language(s) : Persian

Click here for
more information

ELRA-U-W 0255

Corpus of Early Ontario English (CONTE)

The Corpus of Early Ontario English covers the period from the earliest Ontarian English texts to the end of the 19th century. It contains approximately 225,000 words from diaries, newspapers, official letters, etc. (informal register as well as formal writing).

It is divided along temporal (periods of 25 years) and social criteria.
Language(s) : English

Click here for
more information

ELRA-U-W 0256

Corpus of American English

The corpus of American English contains more than 360 million words, equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
It is POS tagged with CLAWS.
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0257

Welsh-English Aligned Corpus

This aligned corpus of Welsh-English contains 510,813 aligned sentence pairs. Texts are taken from the proceedings of the National Assembly for Wales.
Language(s) : WelshEnglish

Click here for
more information

ELRA-U-W 0258

deWaC German Web Corpus

The deWaC is a German 1.7 billion word corpus constructed from the Web (.de domain).
Language(s) : German

Click here for
more information

ELRA-U-W 0259

itWaC Italian Web Corpus

The itWaC is an Italian 2 billion word corpus constructed from the Web (.it domain).
Language(s) : Italian

Click here for
more information

ELRA-U-W 0260

ukWaC English Web Corpus

The ukWaC is an English 2 billion word corpus constructed from the Web (.uk domain).
Language(s) : English

Click here for
more information

ELRA-U-W 0261

frWaC French Web Corpus

The frWaC is a French 1.6 billion word corpus constructed from the Web (.fr domain).
Language(s) : French

Click here for
more information

ELRA-U-W 0262

Spanish Web Corpus

It is a Spanish corpus constructed from the Web (.es domain).
Language(s) : Spanish

Click here for
more information

ELRA-U-W 0263

NILC Corpus

The NILC corpus is a 40 million word Brazilian Portuguese corpus. It is available in two forms: plain text and POS tagged version.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0264

Brazilian CorpusDT

The corpusDT is a corpus of scientific texts in Brazilian Portuguese. It consists of authentic theses and dissertations on Computer Science.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0265

Brazilian Portuguese-English Parallel Corpora

It is a bilingual Brazilian Portuguese-English corpora of parallel texts from different domains: scientific, law and journalistic.
It contains approximately 75,000 words.
Language(s) : Portuguese (Brazil)English

Click here for
more information

ELRA-U-W 0266

Brazilian CorpusGIS

This is a corpus of grammatically inadequate sentences in Brazilian Portuguese.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0267

RHETALHO

RHETALHO is a corpus rhetorically annotated according to RST (Rhetorical Structure Theoy, Mann and Thompson, 1987). It is composed of 40 scientific and news texts.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0268

Corpus of Verbal Response Mode Annotated Utterances

This corpus is a pragmatically-annotated set of utterances. It contains 1,368 annotated utterances from 14 dialogues and several sets of isolated utterances. They are transcripts of spoken dialogues from various domains. Each utterance is annotated with two VRM categories that classify both its literal and pragmatic meaning.
Language(s) : English

Click here for
more information

ELRA-U-W 0269

MPQA Opinion Corpus

This corpus contains 535 news articles and a total of 11,114 sentences. They have been manually annotated for opinions and sentiments (beliefs, emotions, sentiments, speculations, etc.).
Language(s) : English (USA)

Click here for
more information

ELRA-U-W 0270

USENET Corpus

The USENET corpus is a collection of public USENET postings. It currently contains over 25 billion words and covers 47,860 English language non-binary-file news groups from October 2005 to January 2010. It is untagged and has been cleaned and anonymized.
Language(s) : English

Click here for
more information

ELRA-U-W 0271

Stockholm Multilingual Treebank (SMULTRON)

SMULTRON is a parallel treebank in English, German and Swedish. It contains around 1500 sentences that have been PoS-tagged and annotated with phrase structure trees.
It has been aligned at sentence, phrase and word levels.
Language(s) : GermanEnglish - EnglishSwedish - GermanSwedish

Click here for
more information

ELRA-U-W 0272

Parallel Corpus of Swedish, Danish and Norwegian Subtitles

This parallel corpus consists of TV subtitles from soap operas, detective series, animation series, comedies, documentaries, feature films, etc.
This amounts to more than 14,000 subtitle files in each language, corresponding to more than 5 million subtitles (more than 50 million words).
Language(s) : SwedishDanish - SwedishNorwegian

Click here for
more information

Displaying 261 to 280 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]