You are here
»
Universal Catalogue
»
Written Resources
»
Written Corpora
Language Resources
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Catalog Reference : ELRA-U-W 0023
The American National Corpus
In its second release, the ANC contains 22 million words of written American English (10 million more than the 1st release), collected from 1990 onwards. It covers many different genres (from fiction to medical articles) and also contains transcripts of spoken data like telephone conversations.
It is still growing and will amount to 100 million words when completed, being then comparable to the BNC (in size and variety of genres). This resource is of great value for education, linguistic and lexicographic research, and also NLP applications (machine translation, information retrieval, etc.).
Annotation of the corpus concerns lemmas, parts of speech, noun chunks and verb chunks. The tagset used for POS annotation is the Penn tagset (many documents have also been pos-tagged with the Biber tagset).
Tools for processing files with stand-off annotations have also been developed.
Identification
Period of coverage :
from 1990 onwards
Version :
v2, 2005
Version history :
Production
Project :
The American National Corpus project
Creation date :
2005
Applications
Applications possible :
Discourse analysis#Information retrieval
application Area :
Education#Research
Contents
Click on the arrow to display content.
written corpus
Number of languages
: Monolingual
Language(s) :
English (USA)
Annotation Coverage : Full
Annotation Granularity : Word
Annotation level : Syntactic
Annotation Mode : Automatic
Annotation language : XML
Saturday 23 November, 2024
Joint Copyright © 2008
ELRA
&
ELDA
Universal Catalogue 1.0.4