You are here
»
Universal Catalogue
»
Written Resources
»
Written Corpora
Language Resources
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Catalog Reference : ELRA-U-W0334
The W3C Corpus
The W3C corpus contains data collected from a crawl of the World Wide Web Consortium’s sites (w3c.org). This includes mailing lists, public webpages (html), and some text derived from other types of files (pdf, ...)
W3C data has been annotated for QA (question/answering) topic relevance for use in TREC Enterprise 2005 and 2006.
The mailing list subset contains nearly 200,000 documents, structured in more than 50,000 threads of emails (reply-to relations and subject overlap).
Contents
Click on the arrow to display content.
written corpus
Number of languages
: Monolingual
Language(s) :
English
Document source :
Internet
Saturday 23 November, 2024
Joint Copyright © 2008
ELRA
&
ELDA
Universal Catalogue 1.0.4