ELRA - ELRA-WC338 : Brown Corpus of Bulgarian

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-WC338

Brown Corpus of Bulgarian

The corpus is structured along the standards of the Brown University Corpus and comprises 1,000,805 words extracted mainly from electronic texts. In the creation of the corpus the requirement was observed for including only original Bulgarian texts. However, some exceptions had to be made for romance and western text excerpts, taken from foreign language sources translated into Bulgarian because of the lack of original Bulgarian texts in these genres.

The BCB corpus is divided into 500 text units - approximately 2000 words each. The majority of texts consist of more than 2000 words and only a small number of less than 2000. The texts were sampled from 15 different text categories. The number of texts in each category varies:
Press – reportage: 44;
Press – editorial: 27;
Press – reviews: 17;
Religion: 17;
Skill and hobbies: 36;
Popular lore: 48;
Belles-lettres: 75;
Miscellaneous – government and house organs: 30;
Learned: 80;
Fiction – general: 29;
Fiction – mystery: 24;
Fiction – science: 6;
Fiction – adventure: 29;
Fiction –romance: 29;
Humor: 9.

Extracts from the BCB have been tagged and semantically disambiguated to form respectively the Tagged Corpus of Bulgarian (U-W 0128) and the Semantic Corpus of Bulgarian (U-W 0129). The Bulgarian Brown Corpus with the full-length texts is also available (nearly 5 million words).

Identification

Period of coverage : 1990–2005

Version :
Version history :

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : Bulgarian
Number of tokens : one million words