ELRA - ELRA-U-W0292 : SYN2005 corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W0292

SYN2005 corpus

The SYN2005 corpus is a synchronic representative corpus of contemporary written Czech. It contains 100 million words (tokens), lemmatised and Part-Of-Speech tagged.

This corpus is identical with its predecessor, the SYN2000 corpus (see U-W0293), but contains only recent texts (from 2000 to 2004) with a different repartition : 40% fiction, 27% technical literature, 33% journalism.

The SYN2000 corpus is a synchronic representative corpus of contemporary written Czech. It contains 100 million words (tokens), lemmatised and Part-Of-Speech tagged.

This corpus contains texts written between 1990 and 1999, and some important works of Czech literature from the XXth century. It was intended to cover many different genres. Repartition : 15% fiction, 25% technical literature, 60% journalism.

This corpus is a part of the CNC (Czech National Corpus).

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : Czech
Number of tokens : 100 million words
Annotation Granularity : Word
Annotation level : Morphological
Lexical Unit Information : Single word lemma