Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 641 to 660 (of 730 products)

Result Pages: [<< Prev] ... 31 32 33 34 35 ... [Next >>]

ELRA-WC319

Why-questions Corpus

This data collection comprises 395 why-questions, the source documents and one or two answers. Paraphrases were also created for a subset of the questions.
Language(s) : English

Click here for
more information

ELRA-WC320

FAQ (Frequently Asked Questions) Corpus

This a collection of 2,824,179 Q/A pairs downloaded from the web.
Language(s) : English

Click here for
more information

ELRA-WC321

Cyclone Corpus

This encyclopedic corpus was built extractingand organizing data from 30 million web pages.
Language(s) : Japanese

Click here for
more information

ELRA-WC322

PropBank Database

The database is formed by a verb lexicon of about 3,600 verbs and a semantically annotated corpus, the Penn Wall Street Journal Treebank II, where more than 110,000 PropBank instances are annotated.
Language(s) : English

Click here for
more information

ELRA-WC323

Slovene Dependency Treebank

This treebank consists of 1984 sentences or 30,000 words which were manually annotated.
Language(s) : Slovenian

Click here for
more information

ELRA-WC324

Talbanken05 Swedish Treebank

The Talbanken05 is a modernized version of Talbanken76, a syntactically annotated Swedish corpus.
Language(s) : Swedish

Click here for
more information

ELRA-WC325

Talbanken76 Swedish Treebank

This is a Swedish POS tagged and syntactically annotated corpus which contains a written part (professional prose and high school students’ essays) and a spoken part (interviews, and conversations and debates). The whole corpus consists of 300,000 tokens.
Language(s) : Swedish

Click here for
more information

ELRA-WC326

NIL Corpus

This is an annotated Chinese chat language corpus, an informal language corpus built for informal language processing research. It covers 12,112 pieces of chat language text containing 92,314 words and 12,983 chat terms.
Language(s) : Chinese

Click here for
more information

ELRA-WC327

Japanese Associative Concept Dictionary

It consists of 33,018 words and 240,093 word pairs made by an association of 10 participants.
Language(s) : Japanese

Click here for
more information

ELRA-WC328

Annotated KNACK-2002 Corpus of Dutch Written Text

This corpus contains 267 documents from a news magazine in the first ten weeks of 2002 in five domains: economical, political, scientific, cultural and social news. They were annotated with coreferential annotation.
Language(s) : Dutch

Click here for
more information

ELRA-WC329

TagShare Corpus

This is a Portuguese corpus of one million tokens. 1/3 of the total corpus corresponds to transcribed spoken materials.
Language(s) : Portuguese

Click here for
more information

ELRA-WC330

Nobel Prize Winners in Physics and Economics Corpus

It consists of Nobel Lectures since the inception of the prize in Physics (1902-2004, 969515 tokens in 157 texts) and in Economics (1969-2004, 727658 tokens in 55 texts).
Language(s) : English

Click here for
more information

ELRA-WC331

Spanish TextCeram Tagged Domain Corpus

This is a corpus of 12,6 MB of specialised texts from books and works in the field of ceramics, that is 2,8 million words.
Language(s) : Spanish

Click here for
more information

ELRA-WC332

English TextCeram Tagged Domain Corpus

This is a corpus of 1,16 MB of specialised texts from books and works in the field of ceramics, that is 250,000 words.
Language(s) : English

Click here for
more information

ELRA-WC333

Pilot version of Russian Reference Corpus

This balanced collection of written modern Russian, which is a a representative collection of various genres, consists of 50 million words.
Language(s) : Russian

Click here for
more information

ELRA-WC334

Corpus of Russian Newspapers

It contains 78 million words, consisting of several major Russian newspapers from 2001 to 2004.
Language(s) : Russian

Click here for
more information

ELRA-WC335

Russian Internet Corpus

It contains 160 million words. This a snapshot of modern Russian language as used on the Internet; this is still work in progress.
Language(s) : Russian

Click here for
more information

ELRA-WC336

Corpus of Russian Fiction

It contains 1.5 million words; its morphosyntactic features have been manually disambiguated.
Language(s) : Russian

Click here for
more information

ELRA-WC337

Computer Corpus of Russian Newspapers Texts of the End of the XX-th Century

These data include full issues of 13 newspapers issued in 1994-1997. These newspapers are daily and weekly, central and regional, rightist, centrist and leftist. The corpus contains in total 11,401,479 running words, 15.004 different lexemes in 23,109 different texts of various volume.
Language(s) : Russian

Click here for
more information

ELRA-WC338

Brown Corpus of Bulgarian (BCB)

The corpus is structured along the standards of the Brown University Corpus and comprises 1,000,805 words extracted mainly from electronic texts. In the creation of the corpus the requirement was observed for including only original Bulgarian texts. The corpus consists of 500 text units belonging to 15 categories, each unit being approximately 2000 words long.
Language(s) : Bulgarian

Click here for
more information

Displaying 641 to 660 (of 730 products)

Result Pages: [<< Prev] ... 31 32 33 34 35 ... [Next >>]