Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 101 to 120 (of 730 products)

Result Pages: [<< Prev] ... 6 7 8 9 10 ... [Next >>]

ELRA-U-W 0092

HPSG-Annotated Test Suite for Polish

This is a test-suite of written Polish sentences which was created as a part of the European Union CRIT-2 project. The annotation is based on HPSG.
Language(s) : Polish (Poland)

Click here for
more information

ELRA-U-W 0093

Szeged Treebank for Hungarian

The treebank contains 82,000 sentences (1,2 million words) which have been syntactically analysed through the theory of Hungarian generative syntax.
Language(s) : Hungarian

Click here for
more information

ELRA-U-W 0094

Floresta Virgem

This Portuguese treebank comprises the first million words of the CETEMPúblico corpus (41,406 sentences, 41,382 trees). It also includes the contents of the Bosque Treebank without manual revision (from CETEMPúblico and CETENFolha). It thus covers Portuguese spoken in Portugal and in Brazil.
Language(s) : Portuguese (Portugal) - Portuguese (Brazil)

Click here for
more information

ELRA-U-W 0095

Szeged Corpus for Hungarian

The Szeged Corpus is a morpho-syntactically annotated and POS-tagged Hungarian natural language database. It contains 1,2 million words from texts of various genres.
Language(s) : Hungarian (Hungary)

Click here for
more information

ELRA-U-W 0096

Louvain International Database of Spoken English Interlanguage (LINDSEI)

The LINDSEI is a corpus of spoken English from French mother tongue learners. It contains transcripts of 100,000 words from 50 interviews (30 female subjects, 20 male subjects).
Language(s) : English (France)

Click here for
more information

ELRA-U-W 0097

The Salsa Corpus

The SALSA corpus is based on the TIGER corpus, a syntactically annotated German newspaper corpus of 1,5 million words. Word sense and semantic roles were added to TIGER using the frames of FrameNet 1.2.
Language(s) : German (Germany)

Click here for
more information

ELRA-U-W 0098

Croatian Dependency Treebank

The Croatian Dependency Treebank is a part of the Croatian National Corpus (weekly newspaper Croatia Weekly, CW2000) which is lemmatized and morphosyntactically tagged in accordance with MulTextEast recommendations. It is planned to gather at least 100,000 tokens.
Language(s) : Croatian

Click here for
more information

ELRA-U-W 0099

Hinoki Treebank / Sensebank

The Hinoki is built from dictionary definition sentences (38,900). It contains syntactic annotation based on HPSG and word sense tagging.
Language(s) : Japanese (Japan)

Click here for
more information

ELRA-U-W 0100

German-English Parallel Corpus de-news

This German English parallel corpus is adapted from the de-news web site. It includes 9,756 news items, 66,317 German sentences (1,017,064 tokens), 62,475 English sentences (1,175,526 tokens) and 59,014 aligned sentences.
Language(s) : GermanEnglish

Click here for
more information

ELRA-U-W 0101

Multext-East 1984 Parallel Corpus

This multilingual parallel corpus consists of the novel "1984" (G. Orwell) and contains approximately 100,000 words per language (English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Latvian, Lithuanian, Serbian, Russian).
Language(s) : English (United Kingdom)Romanian - EnglishSlovene - EnglishCzech - EnglishBulgarian - EnglishEstonian - EnglishHungarian - EnglishLatvian - EnglishLithuanian - EnglishSerbian - EnglishRussian

Click here for
more information

ELRA-U-W 0102

Multext-East Comparable Corpus

This multilingual comparable corpus contains a fiction part and a news part. Data is comparable across the languages in terms of number and size of texts. It is divided in twelve parts of 100,000 words each.
Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.
Language(s) : Romanian (Romania) - Slovene (Slovenia) - Czech (Czech Republic) - Bulgarian (Bulgaria) - Estonian (Estonia) - Hungarian (Hungary)

Click here for
more information

ELRA-U-W 0103

Multext-East POS tagged 1984

This multilingual resource is a morpho-syntactically annotated version of the novel "1984" (G. Orwell) in 8 languages. Context disambiguated lemma and morpho-syntactic descriptions are marked up for each word.
Language(s) : English (United Kingdom) - Romanian (Romania) - Slovene (Slovenia) - Czech (Czech Republic) - Bulgarian (Bulgaria) - Estonian (Estonia) - Hungarian (Hungary) - Serbian

Click here for
more information

ELRA-U-W 0104

The Basque Semcor Corpus (EuSemcor)

The Basque Semcor is a manually sense-annotated corpus for Basque. It contains 300,000 words (Note that Basque is an agglutinative language).
Language(s) : Basque (Spain)

Click here for
more information

ELRA-U-W 0105

Sfnet Corpus of Finnish

The Sfnet corpus is a discussion group corpus collected from a Finnish newsgroup area. It contains discussions in Finnish written from October 2002 to April 2003, for a total of over 100 million words.
Language(s) : Finnish (Finland)

Click here for
more information

ELRA-U-W 0106

Finnish Parole Corpus (PAROLE-FI)

The Finnish Parole Corpus is annotated morphologically and syntactically.
Language(s) : Finnish (Finland)

Click here for
more information

ELRA-U-W 0107

Swedish Parole Corpus (PAROLE-SV)

The PAROLE corpus of Swedish contains 19 million running-text words. It has been collected in EU project LE-PAROLE that ended in 1997.
Language(s) : Swedish (Sweden)

Click here for
more information

ELRA-U-W 0108

German Parole Corpus (PAROLE-DE)

The German Parole Corpus contains approximately 20 million words and is in a TEI/SGML format. A part of it is tagged.
Language(s) : German (Germany)

Click here for
more information

ELRA-U-W 0109

Oulu Corpus of Finnish

The Oulu corpus contains 429,058 words (29,000 sentences from 5,800 short samples) representative of the standard Finnish language of the 1960's (fiction literature, radio talks, newpapers and journals, non-fiction literature).
Language(s) : Finnish (Finland)

Click here for
more information

ELRA-U-W 0110

KOTUS Swedish-Finnish Parallel Corpus

This resource is a Swedish-Finnish parallel corpus.
Language(s) : Swedish (Sweden)Finnish (Finland)

Click here for
more information

ELRA-U-W 0111

Jyväskylä Corpus of Middle-French

This corpus contains 430,000 words of Middle French from 14 documents (prose, novels, plays and lyrical poetry).
Language(s) : French (France)

Click here for
more information

Displaying 101 to 120 (of 730 products)

Result Pages: [<< Prev] ... 6 7 8 9 10 ... [Next >>]