ELRA - ELRA-U-W 0023 : The American National Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-W 0023

The American National Corpus

In its second release, the ANC contains 22 million words of written American English (10 million more than the 1st release), collected from 1990 onwards. It covers many different genres (from fiction to medical articles) and also contains transcripts of spoken data like telephone conversations.

It is still growing and will amount to 100 million words when completed, being then comparable to the BNC (in size and variety of genres). This resource is of great value for education, linguistic and lexicographic research, and also NLP applications (machine translation, information retrieval, etc.).

Annotation of the corpus concerns lemmas, parts of speech, noun chunks and verb chunks. The tagset used for POS annotation is the Penn tagset (many documents have also been pos-tagged with the Biber tagset).

Tools for processing files with stand-off annotations have also been developed.

Identification

Period of coverage : from 1990 onwards

Version : v2, 2005
Version history :

Production

Project : The American National Corpus project

Creation date : 2005

Applications

	Applications possible : Discourse analysis#Information retrieval
application Area : Education#Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : English (USA)
Annotation Coverage : Full
Annotation Granularity : Word
Annotation level : Syntactic
Annotation Mode : Automatic
Annotation language : XML