Universal Catalogue  
  You are here » Universal Catalogue » Written Resources » Written Corpora
Language Resources
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Anglais
Written Corpora
Displaying 101 to 120 (of 730 products) Result Pages: [<< Prev]  ... 6  7  8  9  10 ...  [Next >>] 

ELRA-U-W 0092
HPSG-Annotated Test Suite for Polish 


This is a test-suite of written Polish sentences which was created as a part of the European Union CRIT-2 project. The annotation is based on HPSG.
Language(s) : Polish (Poland)

Click here for
more information


ELRA-U-W 0093
Szeged Treebank for Hungarian 


The treebank contains 82,000 sentences (1,2 million words) which have been syntactically analysed through the theory of Hungarian generative syntax.
Language(s) : Hungarian

Click here for
more information


ELRA-U-W 0094
Floresta Virgem 


This Portuguese treebank comprises the first million words of the CETEMPúblico corpus (41,406 sentences, 41,382 trees). It also includes the contents of the Bosque Treebank without manual revision (from CETEMPúblico and CETENFolha). It thus covers Portuguese spoken in Portugal and in Brazil.
Language(s) : Portuguese (Portugal) - Portuguese (Brazil)

Click here for
more information


ELRA-U-W 0095
Szeged Corpus for Hungarian 


The Szeged Corpus is a morpho-syntactically annotated and POS-tagged Hungarian natural language database. It contains 1,2 million words from texts of various genres.
Language(s) : Hungarian (Hungary)

Click here for
more information


ELRA-U-W 0096
Louvain International Database of Spoken English Interlanguage (LINDSEI)


The LINDSEI is a corpus of spoken English from French mother tongue learners. It contains transcripts of 100,000 words from 50 interviews (30 female subjects, 20 male subjects).
Language(s) : English (France)

Click here for
more information


ELRA-U-W 0097
The Salsa Corpus 


The SALSA corpus is based on the TIGER corpus, a syntactically annotated German newspaper corpus of 1,5 million words. Word sense and semantic roles were added to TIGER using the frames of FrameNet 1.2.
Language(s) : German (Germany)

Click here for
more information


ELRA-U-W 0098
Croatian Dependency Treebank 


The Croatian Dependency Treebank is a part of the Croatian National Corpus (weekly newspaper Croatia Weekly, CW2000) which is lemmatized and morphosyntactically tagged in accordance with MulTextEast recommendations. It is planned to gather at least 100,000 tokens.
Language(s) : Croatian

Click here for
more information


ELRA-U-W 0099
Hinoki Treebank / Sensebank 


The Hinoki is built from dictionary definition sentences (38,900). It contains syntactic annotation based on HPSG and word sense tagging.
Language(s) : Japanese (Japan)

Click here for
more information


ELRA-U-W 0100
German-English Parallel Corpus de-news 


This German English parallel corpus is adapted from the de-news web site. It includes 9,756 news items, 66,317 German sentences (1,017,064 tokens), 62,475 English sentences (1,175,526 tokens) and 59,014 aligned sentences.
Language(s) : GermanEnglish

Click here for
more information


ELRA-U-W 0101
Multext-East 1984 Parallel Corpus 


This multilingual parallel corpus consists of the novel "1984" (G. Orwell) and contains approximately 100,000 words per language (English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Latvian, Lithuanian, Serbian, Russian).
Language(s) : English (United Kingdom)Romanian - EnglishSlovene - EnglishCzech - EnglishBulgarian - EnglishEstonian - EnglishHungarian - EnglishLatvian - EnglishLithuanian - EnglishSerbian - EnglishRussian

Click here for
more information


ELRA-U-W 0102
Multext-East Comparable Corpus 


This multilingual comparable corpus contains a fiction part and a news part. Data is comparable across the languages in terms of number and size of texts. It is divided in twelve parts of 100,000 words each.
Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.
Language(s) : Romanian (Romania) - Slovene (Slovenia) - Czech (Czech Republic) - Bulgarian (Bulgaria) - Estonian (Estonia) - Hungarian (Hungary)

Click here for
more information


ELRA-U-W 0103
Multext-East POS tagged 1984 


This multilingual resource is a morpho-syntactically annotated version of the novel "1984" (G. Orwell) in 8 languages. Context disambiguated lemma and morpho-syntactic descriptions are marked up for each word.
Language(s) : English (United Kingdom) - Romanian (Romania) - Slovene (Slovenia) - Czech (Czech Republic) - Bulgarian (Bulgaria) - Estonian (Estonia) - Hungarian (Hungary) - Serbian

Click here for
more information


ELRA-U-W 0104
The Basque Semcor Corpus (EuSemcor)


The Basque Semcor is a manually sense-annotated corpus for Basque. It contains 300,000 words (Note that Basque is an agglutinative language).
Language(s) : Basque (Spain)

Click here for
more information


ELRA-U-W 0105
Sfnet Corpus of Finnish 


The Sfnet corpus is a discussion group corpus collected from a Finnish newsgroup area. It contains discussions in Finnish written from October 2002 to April 2003, for a total of over 100 million words.
Language(s) : Finnish (Finland)

Click here for
more information


ELRA-U-W 0106
Finnish Parole Corpus (PAROLE-FI)


The Finnish Parole Corpus is annotated morphologically and syntactically.
Language(s) : Finnish (Finland)

Click here for
more information


ELRA-U-W 0107
Swedish Parole Corpus (PAROLE-SV)


The PAROLE corpus of Swedish contains 19 million running-text words. It has been collected in EU project LE-PAROLE that ended in 1997.
Language(s) : Swedish (Sweden)

Click here for
more information


ELRA-U-W 0108
German Parole Corpus (PAROLE-DE)


The German Parole Corpus contains approximately 20 million words and is in a TEI/SGML format. A part of it is tagged.
Language(s) : German (Germany)

Click here for
more information


ELRA-U-W 0109
Oulu Corpus of Finnish 


The Oulu corpus contains 429,058 words (29,000 sentences from 5,800 short samples) representative of the standard Finnish language of the 1960's (fiction literature, radio talks, newpapers and journals, non-fiction literature).
Language(s) : Finnish (Finland)

Click here for
more information


ELRA-U-W 0110
KOTUS Swedish-Finnish Parallel Corpus 


This resource is a Swedish-Finnish parallel corpus.
Language(s) : Swedish (Sweden)Finnish (Finland)

Click here for
more information


ELRA-U-W 0111
Jyväskylä Corpus of Middle-French 


This corpus contains 430,000 words of Middle French from 14 documents (prose, novels, plays and lyrical poetry).
Language(s) : French (France)

Click here for
more information


Displaying 101 to 120 (of 730 products) Result Pages: [<< Prev]  ... 6  7  8  9  10 ...  [Next >>] 

Joint Copyright © 2008 ELRA & ELDA
Universal Catalogue 1.0.4