ELRA - ELRA-U-S0269 : Mass

You are here » Universal Catalogue » Spoken Resources » Desktop/microphone

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-S0269

Mass

This is a Malay speech corpus. It contains 70 hours of read speech recorded by 90 speakers and 10 hours of broadcast news from local TV stations in Malaysia.

Read speech was recorded by Malay, Indian and Chinese speakers (female and male) through a headset microphone, at the sampling rate of 22kHz. They read sentences extracted from a text corpus collected from local news websites. Each speaker read about 5,000 words. The target is to record a total of 140 hours of speech.

The broadcast news part was recorded daily by 30-minutes slots. Audio files were then segmented into 5-minutes segments and stored at 16KHz, 16 bit pcm (wavefiles). This part is manually transcribed and segmented into speech utterances. The aim will be to collect a total of 15 hours of broadcast news.

Applications

Applications existing : Automatic speech recognition

Contents

Click on the arrow to display content.

speech corpus #18824
Language(s) : Malay
Duration : 90 hours
Source Channel : Microphone
Recording Environment : Sound proof room
Speech Content : Continuous sentences
speech corpus #28824
Language(s) : Malay
Duration : 10 hours
Quantisation : 16-bit
Source Channel : Television
Transcription Entries : Orthographic
Annotation language : XML