ELRA - ELRA-U-S 0071 : LVCSR Speech corpus for Indonesian

You are here » Universal Catalogue » Spoken Resources » Desktop/microphone

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-S 0071

LVCSR Speech corpus for Indonesian

This is a speech database containing 84,000 read sentences. Each speaker uttered 210 sentences from a text corpus covering two domains:

- 3186 sentences from the news domain (Kompas and Tempo).
- 2500 sentences for application domains (telecommunication service, tele-home security, billing information services, reservation services, etc.).

Each speaker was asked to read a set of sentences, and nearly 500 speakers of three different age groups, genre (50% male/female) and four major Western Indonesian accents (Javanese, Sundanese, Batak, Standard Indonesian) were recorded by telephone or microphone in two sets:
- Daily News Task: 110 sentences/speaker, 44000 utterances, 43 hours of speech,
- Telephone Applications: 100 sentences/speaker, 40000 utterances, 36 hours of speech.

All recordings were carried out in a sound-proof room with 2 channels for clean speech (16 kHz) and telephone speech (8 kHz).

This LVCSR database will support the development of automatic speech recognition systems for Indonesian.

The database comprises also a pronunciation dictionary derived from the Daily News and Telephone Application Tasks. The dictionary includes 40k words: 30k Indonesian words, 8k place and person names and 2k foreign words.

Applications

	Applications possible : Speech recognition#Automatic speech recognition
application Area : Research

Contents

Click on the arrow to display content.

speech corpus
Language(s) : Indonesian
Source Channel : Microphone#Telephone
Speech Acquisition Mode : Acoustic