Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 201 to 220 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]

ELRA-U-W 0193

Repentino

Repentino is composed of textual named entity instances (set of proper nouns denoting a specific entity classified as to which kind of entity they denote: company, book title, place name, etc.). Currently, Repentino gathers more than 450,000 instances (in XML).
Language(s) : Portuguese

Click here for
more information

ELRA-U-W 0194

ELAN corpus of European Portuguese

The ELAN corpus is a subcorpus of the CRPC (Corpus de Referencia do Portugues Contemporaneo). It contains 2,840,552 words.
Language(s) : Portuguese

Click here for
more information

ELRA-U-W 0195

RL corpus of European Portuguese

The RL corpus is a subcorpus of the CRPC (Corpus de Referencia do Portugues Contemporaneo). It contains 8,670,438 words.
Language(s) : Portuguese

Click here for
more information

ELRA-U-W 0196

AFRICA Corpus of Portuguese

This resource is a subcorpus of the CRPC (Corpus de Referencia do Portugues Contemporaneo). It contains 3,070,879 words of written data and 129,245 words of spoken data. It is representative of African Portuguese (Angola, Sao Tome and Principe, Mozambique, Guinea, Cape Verde).
Language(s) : Portuguese

Click here for
more information

ELRA-U-W 0197

TeMário Corpus of Portuguese

This corpus contains 100 news texts extracted from Folha de São Paulo and Jornal do Brasil, for a total of 61,412 words. They come with manually written summaries and ideal extracts.
Language(s) : Portuguese

Click here for
more information

ELRA-U-W 0198

Lacio-Ref Corpus

The Lacio-Ref is a reference corpus of newspaper articles in Brazilian Portuguese. It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0199

Mac-Morpho

Mac-Morpho is a 1,1 million word gold standard corpus (portion of the Lacio-Ref) which is morpho-syntactically annotated (PALAVRAS, E. Bick) and manually validated. It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0200

Annotated Part of the Lacio-Ref

This resource is a portion of the Lacio-Ref which is automatically annotated with lemmas, POS and syntactic tags (Curupira parser). It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0201

Lacio-Dev Corpus

The Lacio-Dev is a deviation corpus composed of non-revised texts (516,840 tokens). It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0202

Portuguese English Parallel Corpus (Par-C)

Par-C is a Portuguese-English parallel corpus. It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)English

Click here for
more information

ELRA-U-W 0203

Portuguese English Comparable Corpus (Comp_C)

Comp_C is a Portuguese-English comparable corpus (300,000 words in each language). It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0204

Nexing Corpus

The Nexing corpus is a collection of written transcriptions of verbal data (around 30 hours of audio recordings) elicited during psycholinguistic experiment on syllogistic reasoning.
Language(s) : Portuguese

Click here for
more information

ELRA-U-W 0205

COMET Multilingual Corpora

This is a multilingual corpus that comprises three subparts:

- CorTec: Technical and Scientific corpus (Brazilian Portuguese, English).
- CoMAprend: Multilingual Learner corpus (Brazilian Portuguese, English, German, French, Spanish, Italian).
- CorTrad: Translation corpus (Brazilian Portuguese, English).

It is designed for contrastive linguistic studies (translation, terminology, teaching etc.).
Language(s) : Portuguese (Brazil) - English - German - Spanish - French - Italian

Click here for
more information

ELRA-U-W 0206

LIVAC Synchronous Corpus

The LIVAC contains texts from Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. The materials from the diverse communities have been synchronized. In 2005 the corpus was composed of over 150 million Chinese characters and over 720,000 word types. It is still expanding.
The analysis concerns various linguistic units (characters, words, sentences).
Language(s) : Chinese

Click here for
more information

ELRA-U-W 0207

PH Corpus of Chinese

Guo Jim’s Mandarin Chinese PH corpus is a collection of Chinese newswire texts containing around two million words. Those texts were published by the Xinhua News Agency during 1990-1991.
Language(s) : Chinese

Click here for
more information

ELRA-U-W 0208

PFR Corpus of Chinese

The PFR corpus consists of one month's newspaper material published by the People's Daily (January 1998).
Language(s) : Chinese

Click here for
more information

ELRA-U-W 0209

Chinese Internet Corpus

The Chinese Internet Corpus contains 280 million words (tokens). This corpus has been compiled automatically from the Internet in February 2005 along with other Internet corpora (for English, German and Russian).
Language(s) : Chinese

Click here for
more information

ELRA-U-W 0210

Reader's Digest Corpus (Czech/English)

The Reader's Digest corpus is a parallel text of articles from Reader's Digest (1993-1996). The Czech part is translation of the English one.
It contains 53,117 parallel sentences.
Language(s) : English (United Kingdom)Czech

Click here for
more information

ELRA-U-W 0211

Czech Academic Corpus (CAC)

The CAC is a Czech corpus with a manual annotation of morphology, consisting of approximately 650,000 words (it was originally called Corpus of the Pragmatic Style). It is composed of articles from a wide range of media (newspapers, magazines, and transcripts of the spoken language from radio and TV programs).
Language(s) : Czech

Click here for
more information

ELRA-U-W 0212

Mannheimer Corpus

The Mannheimer Corpus contains 2.53 million word. It is divided into two subcorpora: the Mannheimer Korpus 1 (293 texts from 1950 to 1967) and the Mannheimer Korpus 2 (52 texts from 1949 to 1974). It covers a wide variety of sources.
Language(s) : German

Click here for
more information

Displaying 201 to 220 (of 730 products)

Result Pages: [<< Prev] ... 11 12 13 14 15 ... [Next >>]