ELRA - ELRA-U-W 0087 : Estonian Treebank Arborest

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W 0087

Estonian Treebank Arborest

Arborest is a 2,500 sentence treebank of Estonian which was built in a two-stage process using both Constraint Grammar (CG) and Phrase Structure Grammar (PSG).

The 200,000 word Estonian CG corpus* is a shallow syntactically annotated (and proof-read) corpus of various genres (fiction, newspapers and legal texts). Arborest is the result of its conversion (15% of the whole corpus) to PSG grammar in a semi-automatic way. Structural information has been added in the process of conversion, resulting in VISL-style treebanks.

A larger treebank is under construction (2004-2008).

This is a useful resource for language description (Estonian syntax) and for language technological software evaluation (Information Retrieval, Information Extraction, Machine Translation, etc.).

* Estonian Constraint Grammar Corpus

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : Estonian
Annotation Coverage : Full
Annotation Granularity : Word
Annotation level : Syntactic
Annotation Mode : Semi automatic