Universal Catalogue

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Written Corpora

Displaying 661 to 680 (of 730 products)

Result Pages: [<< Prev] ... 31 32 33 34 35 ... [Next >>]

ELRA-WC339

KIAP Corpus 1

This is the first part of the KIAP Corpus, containing 180 different research articles from journals in the domains of Economics, Linguistics and Medicine.
Language(s) : English - French - Norwegian

Click here for
more information

ELRA-WC340

KIAP Corpus 2

This is the second part of the KIAP Corpus, containing 270 different research articles from journals in the domains of Economics, Linguistics and Medicine.
Language(s) : English - French - Norwegian

Click here for
more information

ELRA-WC341

Light-verb construction (LVC) Corpus

This is a small corpus of human annotations for sentences containing possible light verb constructions (that is support verb construction). It consists of 741 sentences.
Language(s) : English

Click here for
more information

ELRA-WC342

Croatian National Corpus (HNK)

The Croatian National Corpus (HNK) is a collection of selected texts mainly written in contemporary Croatian covering different media, genres, styles, fields and topics.
The HNK currently contains 101.3 million tokens.
Language(s) : Croatian

Click here for
more information

ELRA-WC343

Scottish Corpus of Texts and Speech (SCOTS)

The SCOTS Corpus contains documents in Scottish Standard English, documents in several varieties of Scots. While Scottish Standard English has a standard written form, Scots does not. This means that the corpus contains a wide range of spelling variation. Currently, an Advanced Search System is offered so as to exploit the corpus extensive sociolinguistic metadata by allowing the user to build up a search profile specifying sociolinguistic or textual criteria. The latest version of the corpus includes 936 documents and a total of 2,524,431 words.
This corpus contains data in both Scottish Standard English and Scots.
Language(s) : English (Scotland)

Click here for
more information

ELRA-WC344

The Bulgarian Corpus

It is intended to yield 100 million running words whish are collected from different sources in HTML and RTF formats. It is representative of different genres: 15 % fiction, 78 % newspapers and 7 % legal texts, government bulletins and others.
Language(s) : Bulgarian

Click here for
more information

ELRA-WC345

Acquis Communautaire Corpus (Acquis)

This is a large aligned parallel corpus containing 1 billion words in 22 official EU languages (231 language pair combinations). It contains EU legislation, declarations, resolutions, acts, international agreements and documents on contents, principles and political objectives of the EU Treaties.
It is also manually classified according to EUROVOC subject domains.
Language(s) : Czech - Danish - Dutch - English - Estonian - German - Greek - Finnish - French - Hungarian - Italian - Latvian - Lithuanian - Maltese - Polish - Portuguese - Romanian - Slovak - Slovene - Spanish - Swedish - Bulgarian

Click here for
more information

ELRA-WC346

SEMiSUSANNE Corpus

This is a structucally and syntactically annotated corpus formed from the union of the SUSANNE and SemCor corpora. It contains 33 documents common to both corpora. It is part-of-speech tagged.
Language(s) : English

Click here for
more information

ELRA-WC347

Phishing Email Corpus

It contains 210 phishing emails produced in 2004 and 2005. The phishing is an activity consisting of fraudulently attempting to acquire information such as passwords or credit card details.
Language(s) : English

Click here for
more information

ELRA-WC348

Asu-wo-yumo Monologue Corpus

This is the transcriptions of 327 programs of "Asu-wo-yumo", a TV commentary program in which a commentator speaks for 10 minutes on a social issue. The corpus is segmented into speeches and has been syntactically annotated.
Language(s) : Japanese

Click here for
more information

ELRA-WC349

Kyoto Text Corpus

This is a morphologically and syntactically annotated corpus of 40,000 sentences from a newspaper. 5,000 sentences are annotated with information of case, anaphora and coreference.
Language(s) : Japanese

Click here for
more information

ELRA-WC350

NUS SMS Corpus

This is an English SMS (Short Message Service) message corpus containing about 10,000 SMS messages collected by university students. It is in the XML format.
Language(s) : English

Click here for
more information

ELRA-WC352

Electronic Corpus of El Periodico de Catalunya

This journalistic corpus consists of 13 million words.
Language(s) : Spanish (Spain)

Click here for
more information

ELRA-WC353

Sensem Corpus

This is a lexical database consisting of sentences extracted from the electronic version of the newspaper El Periodico de Catalunya. It illustrates the semantic and syntactic behavior of the 250 more frequent Spanish verbs. The corpus comprises one million words, with 100 examples of each verb. 25,000 sentences have been semantically and syntactically annotated, that is to say 800,000 words, and about 400,000 words have been manually checked. It is presented in the XML format.
Language(s) : Spanish (Spain)

Click here for
more information

ELRA-WC354

Corpus do Português

It contains 45 million words, more than 50,000 Portuguese texts from the 1300s to the 1900s.
Language(s) : Portuguese

Click here for
more information

ELRA-WC355

Alpino Treebank

The treebank contains syntactically annotated Dutch sentences, and more than 150,000 words. It includes newspaper (a part of the Eindhoven corpus).
Language(s) : Dutch

Click here for
more information

ELRA-WC356

ARTFL Textual Database

The database contains nearly 2000 texts, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing. The 18th, 19th and 20th centuries are equally represented, with a smaller selection of 17th century texts as well as some medieval and Renaissance texts. It also inscludes a Provençal database consisting of 38 texts in their original spellings.
Language(s) : French

Click here for
more information

ELRA-WC357

Syntactical Database of Current Spanish (Base de Datos Sintácticos del español actual)

It contains about 160,000 clauses (1.5 m words) of Spanish with syntactic analysis (manually added), from the corpus ARTHUS (Archivo de Textos Hispánicos de la Universidad de Santiago). Composition: 66.5% written (narratives, essays and journalistic texts), 14.7% drama and 18.9% oral transcriptions.
Language(s) : Spanish

Click here for
more information

ELRA-WC358

ARHTUS (Archivo de textos hispánicos de la Universidad de Santiago)

The corpus contains written contemporary texts in Spanish from Spain and from South America including 1,449,005 words and several types: essays, oral transcriptions, narratives and theatre.
Language(s) : Spanish

Click here for
more information

ELRA-WC359

CEXI (English Italian Translational Corpus)

This is a bi-directional, parallel, translation-driven corpus, which will consist of about 4.6 million words, or 368 text samples of 10 to 15 thousand words each. It contains translations from English into Italian and translations from Italian into English, published between 1975 and 2000.
Language(s) : Italian - English

Click here for
more information

Displaying 661 to 680 (of 730 products)

Result Pages: [<< Prev] ... 31 32 33 34 35 ... [Next >>]