|
Language Resources |
|
|
|
Search Catalogue |
|
|
|
Send us information |
|
|
|
Languages |
|
|
|
|
|
Displaying 101 to 120 (of 730 products) |
Result Pages: 6 |
This is a test-suite of written Polish sentences which was created as a part of the European Union CRIT-2 project. The annotation is based on HPSG.
Language(s) : Polish (Poland)
|
|
|
|
The treebank contains 82,000 sentences (1,2 million words) which have been syntactically analysed through the theory of Hungarian generative syntax.
Language(s) : Hungarian
|
|
|
|
This Portuguese treebank comprises the first million words of the CETEMPúblico corpus (41,406 sentences, 41,382 trees). It also includes the contents of the Bosque Treebank without manual revision (from CETEMPúblico and CETENFolha). It thus covers Portuguese spoken in Portugal and in Brazil.
Language(s) : Portuguese (Portugal) - Portuguese (Brazil)
|
|
|
|
The Szeged Corpus is a morpho-syntactically annotated and POS-tagged Hungarian natural language database. It contains 1,2 million words from texts of various genres.
Language(s) : Hungarian (Hungary)
|
|
|
|
The LINDSEI is a corpus of spoken English from French mother tongue learners. It contains transcripts of 100,000 words from 50 interviews (30 female subjects, 20 male subjects).
Language(s) : English (France)
|
|
|
|
The SALSA corpus is based on the TIGER corpus, a syntactically annotated German newspaper corpus of 1,5 million words. Word sense and semantic roles were added to TIGER using the frames of FrameNet 1.2.
Language(s) : German (Germany)
|
|
|
|
The Croatian Dependency Treebank is a part of the Croatian National Corpus (weekly newspaper Croatia Weekly, CW2000) which is lemmatized and morphosyntactically tagged in accordance with MulTextEast recommendations. It is planned to gather at least 100,000 tokens.
Language(s) : Croatian
|
|
|
|
The Hinoki is built from dictionary definition sentences (38,900). It contains syntactic annotation based on HPSG and word sense tagging.
Language(s) : Japanese (Japan)
|
|
|
|
This German English parallel corpus is adapted from the de-news web site. It includes 9,756 news items, 66,317 German sentences (1,017,064 tokens), 62,475 English sentences (1,175,526 tokens) and 59,014 aligned sentences.
Language(s) : GermanEnglish
|
|
|
|
This multilingual parallel corpus consists of the novel "1984" (G. Orwell) and contains approximately 100,000 words per language (English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Latvian, Lithuanian, Serbian, Russian).
Language(s) : English (United Kingdom)Romanian - EnglishSlovene - EnglishCzech - EnglishBulgarian - EnglishEstonian - EnglishHungarian - EnglishLatvian - EnglishLithuanian - EnglishSerbian - EnglishRussian
|
|
|
|
This multilingual comparable corpus contains a fiction part and a news part. Data is comparable across the languages in terms of number and size of texts. It is divided in twelve parts of 100,000 words each.
Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.
Language(s) : Romanian (Romania) - Slovene (Slovenia) - Czech (Czech Republic) - Bulgarian (Bulgaria) - Estonian (Estonia) - Hungarian (Hungary)
|
|
|
|
This multilingual resource is a morpho-syntactically annotated version of the novel "1984" (G. Orwell) in 8 languages. Context disambiguated lemma and morpho-syntactic descriptions are marked up for each word.
Language(s) : English (United Kingdom) - Romanian (Romania) - Slovene (Slovenia) - Czech (Czech Republic) - Bulgarian (Bulgaria) - Estonian (Estonia) - Hungarian (Hungary) - Serbian
|
|
|
|
The Basque Semcor is a manually sense-annotated corpus for Basque. It contains 300,000 words (Note that Basque is an agglutinative language).
Language(s) : Basque (Spain)
|
|
|
|
The Sfnet corpus is a discussion group corpus collected from a Finnish newsgroup area. It contains discussions in Finnish written from October 2002 to April 2003, for a total of over 100 million words.
Language(s) : Finnish (Finland)
|
|
|
|
The Finnish Parole Corpus is annotated morphologically and syntactically.
Language(s) : Finnish (Finland)
|
|
|
|
The PAROLE corpus of Swedish contains 19 million running-text words. It has been collected in EU project LE-PAROLE that ended in 1997.
Language(s) : Swedish (Sweden)
|
|
|
|
The German Parole Corpus contains approximately 20 million words and is in a TEI/SGML format. A part of it is tagged.
Language(s) : German (Germany)
|
|
|
|
The Oulu corpus contains 429,058 words (29,000 sentences from 5,800 short samples) representative of the standard Finnish language of the 1960's (fiction literature, radio talks, newpapers and journals, non-fiction literature).
Language(s) : Finnish (Finland)
|
|
|
|
This resource is a Swedish-Finnish parallel corpus.
Language(s) : Swedish (Sweden)Finnish (Finland)
|
|
|
|
This corpus contains 430,000 words of Middle French from 14 documents (prose, novels, plays and lyrical poetry).
Language(s) : French (France)
|
|
|
|
Displaying 101 to 120 (of 730 products) |
Result Pages: 6 |
|
|