ELRA - ELRA-U-W 0079 : Penn Treebank

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W 0079

Penn Treebank

The Penn Treebank is a bank of linguistic trees for English. The data comes from several well-known corpora: Wall Street Journal, the Brown Corpus, Switchboard and ATIS (more than one million words). The corpus contains annotations showing rough syntactic and semantic information. The theoretical background underlying the analysis of sentences is the constituent structure theory.

Texts are POS tagged and transcripts of spoken data (Switchboard) are annotated for disfluency, tagged and parsed.

The data contained in Release III is annotated in Treebank II style and contains a manual for Treebank II bracketing and the part-of-speech tagging guidelines. Tools for processing Treebank data is made available, as well as the contents of the previous version (Version 0.5). Release III provides as a new material the Brown parsed text. The Penn Treebank III is available through LDC http://www.ldc.upenn.edu/Catalog/

Identification

Period of coverage :

Version : release III 1999
Version history : release II 1995

Production

Creation date : 1995

Applications


application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : English (USA)
Annotation Coverage : Full
Annotation Granularity : Word
Annotation level : Syntactic