You are here
»
Universal Catalogue
»
Written Resources
»
Written Corpora
Language Resources
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Catalog Reference : ELRA-U-W0295
Nepali Written Corpus
The Nepali Written Corpus is a part of the Nepali National Corpus (NNC).
This is a monolingual corpus of 15 million words containing texts from various books, magazines, newspapers and from Internet websites. It is segmented and POS tagged (using a 112 parts-of-speech tagset developed empirically to annotate the first part of the corpus, the Core Sample).
It is divided into two parts :
- the Core Sample : includes 398 texts (804,574 words) collected from books, journals, magazines and newspapers between 1990 and 1992. It covers a large variety of genres : press reportage, press editorial, press review, religion, skills, trades and hobbies, biographies, essays, science, fiction. This part was developed following the FLOB and FROWN framework for collecting text corpus.
- the General Corpus : includes digitized written texts (14 million words) collected mainly from Internet websites, newspapers, books, publishers and authors between 2005 and 2006.
It is available in the ELRA catalogue
http://catalog.elra.info
under the reference ELRA-W0076.
In addition to this corpus, the NNC contains :
- a parallel corpus (see Nepali-English Parallel Corpus, ref. U-W0296).
- a spoken corpus (see Nepali Spoken Corpus, ref. U-S0205).
- a text-to-speech corpus (see Nepali Text-to-Speech Corpus, ref. U-S0206).
Production
Project :
NeLRaLEC
Applications
application Area :
Research
Contents
Click on the arrow to display content.
written corpus
Number of languages
: Monolingual
Language(s) :
Nepali
Character set :
Unicode
Annotation language : XML
Saturday 23 November, 2024
Joint Copyright © 2008
ELRA
&
ELDA
Universal Catalogue 1.0.4