ELRA - ELRA-U-W 0088 : Penn Arabic Treebank

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W 0088

Penn Arabic Treebank

The Penn Arabic Treebank (ATB) is a one million word corpus that has been syntactically annotated in the framework of the DARPA TIDES project.

It contains written Modern Standard Arabic newswire from the Agence France Presse corpus (from July to November 2000) and resorts to the same annotation scheme as the Penn Treebank (constituent structure).

In the last version (ATB3), it contains a total of 401,122 words/tokens after clitics are separated for the treebank annotation.

It is designed to support language research and development of language technology for Modern Standard Arabic (automatic content extraction, information retrieval, information extraction, natural language processing).

Identification

Period of coverage :

Version : v.3 (2008)
Version history : v2.0 (2003)

Production

Project : DARPA TIDES project

Creation date : 2003

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : Modern Standard Arabic
Annotation Coverage : Full
Annotation Granularity : Word
Annotation level : Syntactic
Annotation language : XML