ELRA - ELRA-U-W0295 : Nepali Written Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W0295

Nepali Written Corpus

The Nepali Written Corpus is a part of the Nepali National Corpus (NNC).

This is a monolingual corpus of 15 million words containing texts from various books, magazines, newspapers and from Internet websites. It is segmented and POS tagged (using a 112 parts-of-speech tagset developed empirically to annotate the first part of the corpus, the Core Sample).

It is divided into two parts :
- the Core Sample : includes 398 texts (804,574 words) collected from books, journals, magazines and newspapers between 1990 and 1992. It covers a large variety of genres : press reportage, press editorial, press review, religion, skills, trades and hobbies, biographies, essays, science, fiction. This part was developed following the FLOB and FROWN framework for collecting text corpus.
- the General Corpus : includes digitized written texts (14 million words) collected mainly from Internet websites, newspapers, books, publishers and authors between 2005 and 2006.

It is available in the ELRA catalogue http://catalog.elra.info under the reference ELRA-W0076.

In addition to this corpus, the NNC contains :
- a parallel corpus (see Nepali-English Parallel Corpus, ref. U-W0296).
- a spoken corpus (see Nepali Spoken Corpus, ref. U-S0205).
- a text-to-speech corpus (see Nepali Text-to-Speech Corpus, ref. U-S0206).

Production

Project : NeLRaLEC

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : Nepali
Character set : Unicode
Annotation language : XML