ELRA - ELRA-U-MM0005 : Twente News Corpus

You are here » Universal Catalogue » Written Resources » Written Corpora

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-MM0005

Twente News Corpus

The Twente News Corpus (TwNC) is a multifaceted corpus for Dutch. It contains material from different sources:
- Dutch national newspapers,
- television subtitles,
- teleprompter files,
- both manually and automatically generated broadcast news transcripts along with the broadcast news audio.

It consists of 530 million words and about 800 files of broadcast news audio.

It is in the xml format and is encoding in utf-8.

It can be used for speech recognition, cross-media indexing, cross-language information retrieval, etc.

Production

Project : MultimediaN, CATCH, STEVIN

Applications

	Applications possible : Speech recognition#Automatic speech recognition#Information retrieval
application Area : Research

Contents

Click on the arrow to display content.

written corpus
Number of languages : Monolingual
Language(s) : Dutch
Annotation language : XML
speech corpus
Language(s) : Dutch
Source Channel : Television
Transcription Entries : Orthographic
Annotation language : XML