|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 201 to 220 (of 730 products) |
Result Pages: 11 |
Repentino is composed of textual named entity instances (set of proper nouns denoting a specific entity classified as to which kind of entity they denote: company, book title, place name, etc.). Currently, Repentino gathers more than 450,000 instances (in XML).
Language(s) : Portuguese
|
|
|
|
The ELAN corpus is a subcorpus of the CRPC (Corpus de Referencia do Portugues Contemporaneo). It contains 2,840,552 words.
Language(s) : Portuguese
|
|
|
|
The RL corpus is a subcorpus of the CRPC (Corpus de Referencia do Portugues Contemporaneo). It contains 8,670,438 words.
Language(s) : Portuguese
|
|
|
|
This resource is a subcorpus of the CRPC (Corpus de Referencia do Portugues Contemporaneo). It contains 3,070,879 words of written data and 129,245 words of spoken data. It is representative of African Portuguese (Angola, Sao Tome and Principe, Mozambique, Guinea, Cape Verde).
Language(s) : Portuguese
|
|
|
|
This corpus contains 100 news texts extracted from Folha de São Paulo and Jornal do Brasil, for a total of 61,412 words. They come with manually written summaries and ideal extracts.
Language(s) : Portuguese
|
|
|
|
The Lacio-Ref is a reference corpus of newspaper articles in Brazilian Portuguese. It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)
|
|
|
|
Mac-Morpho is a 1,1 million word gold standard corpus (portion of the Lacio-Ref) which is morpho-syntactically annotated (PALAVRAS, E. Bick) and manually validated. It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)
|
|
|
|
This resource is a portion of the Lacio-Ref which is automatically annotated with lemmas, POS and syntactic tags (Curupira parser). It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)
|
|
|
|
The Lacio-Dev is a deviation corpus composed of non-revised texts (516,840 tokens). It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)
|
|
|
|
Par-C is a Portuguese-English parallel corpus. It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)English
|
|
|
|
Comp_C is a Portuguese-English comparable corpus (300,000 words in each language). It was developed in the Lacio-Web Project.
Language(s) : Portuguese (Brazil)
|
|
|
|
The Nexing corpus is a collection of written transcriptions of verbal data (around 30 hours of audio recordings) elicited during psycholinguistic experiment on syllogistic reasoning.
Language(s) : Portuguese
|
|
|
|
This is a multilingual corpus that comprises three subparts:
- CorTec: Technical and Scientific corpus (Brazilian Portuguese, English).
- CoMAprend: Multilingual Learner corpus (Brazilian Portuguese, English, German, French, Spanish, Italian).
- CorTrad: Translation corpus (Brazilian Portuguese, English).
It is designed for contrastive linguistic studies (translation, terminology, teaching etc.).
Language(s) : Portuguese (Brazil) - English - German - Spanish - French - Italian
|
|
|
|
The LIVAC contains texts from Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. The materials from the diverse communities have been synchronized. In 2005 the corpus was composed of over 150 million Chinese characters and over 720,000 word types. It is still expanding.
The analysis concerns various linguistic units (characters, words, sentences).
Language(s) : Chinese
|
|
|
|
Guo Jim’s Mandarin Chinese PH corpus is a collection of Chinese newswire texts containing around two million words. Those texts were published by the Xinhua News Agency during 1990-1991.
Language(s) : Chinese
|
|
|
|
The PFR corpus consists of one month's newspaper material published by the People's Daily (January 1998).
Language(s) : Chinese
|
|
|
|
The Chinese Internet Corpus contains 280 million words (tokens). This corpus has been compiled automatically from the Internet in February 2005 along with other Internet corpora (for English, German and Russian).
Language(s) : Chinese
|
|
|
|
The Reader's Digest corpus is a parallel text of articles from Reader's Digest (1993-1996). The Czech part is translation of the English one.
It contains 53,117 parallel sentences.
Language(s) : English (United Kingdom)Czech
|
|
|
|
The CAC is a Czech corpus with a manual annotation of morphology, consisting of approximately 650,000 words (it was originally called Corpus of the Pragmatic Style). It is composed of articles from a wide range of media (newspapers, magazines, and transcripts of the spoken language from radio and TV programs).
Language(s) : Czech
|
|
|
|
The Mannheimer Corpus contains 2.53 million word. It is divided into two subcorpora: the Mannheimer Korpus 1 (293 texts from 1950 to 1967) and the Mannheimer Korpus 2 (52 texts from 1949 to 1974). It covers a wide variety of sources.
Language(s) : German
|
|
|
|
Displaying 201 to 220 (of 730 products) |
Result Pages: 11 |
|
|