ELRA - ELRA-U-S 0035 : Mandarin Chinese Broadcast News Corpus

You are here » Universal Catalogue » Spoken Resources » Broadcast Resources

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-S 0035

Mandarin Chinese Broadcast News Corpus

The MATBN corpus contains 198 one-hour news shows (for a total of approximately 2.3 million Chinese characters). They were collected between 2001 and 2003 from the Public Television Service Foundation (Taiwan).

They were first recorded in stereo with a 44.1kHz sampling rate and 16 bit resolution with a DAT recorder in the TV broacasting studio. The DAT recordings were then converted into a single Microsoft Windows wave file. The signal was down-sampled to 16kHz with a 16 bit resolution.

The corpus has been segmented, labeled and transcribed manually (with DGA&LDC Transcriber). The transcription is aligned to the speech signal.

SGML tagging annotation: acoustic conditions, background conditions, story boundaries, speaker turn boundaries, audible acoustic events (hesitations, repetitions, vocal non-speech events, external noises).

The MATBN is the result of a joint project sponsored by the National Science Council of Taiwan.

Applications

	Applications possible : Speech recognition#Speech synthesis
application Area : Research#Other

Contents

Click on the arrow to display content.

speech corpus
Language(s) : Mandarin Chinese
Duration : 198 hours
Quantisation : 16 bit
Source Channel : Television
Speech Acquisition Mode : Acoustic
Sound Type Annotation : Adverts#Articulatory noise#Background noise#Mispronunciation#Music#Speaker noise
Transcription Entries : Orthographic
Transcription Segmentation : Episode#Speaker turn
Annotation language : SGML