ELRA - ELRA-U-S0216 : The Lancaster Los Angeles Spoken Chinese Corpus

You are here » Universal Catalogue » Spoken Resources » Desktop/microphone

Language Resources

Search Catalogue

Send us information

Would you like to collaborate ?
Contact Us

Languages

Catalog Reference : ELRA-U-S0216

The Lancaster Los Angeles Spoken Chinese Corpus

The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese. It consists of dialogues (55%) and monologues (45%) including both spontaneous (57%) and scripted (43%) speech.

Transcription contains 1,002,151 words, corresponding to 73,976 sentences and 49,670 utterance units (paragraphs).

This corpus includes various spoken registers :
- face-to-face conversation (60,806 words),
- telephone conversation between overseas Chinese and their family in China (295,026 words),
- play/movie scripts (80,446 words),
- TV talk show transcripts (118,588 words),
- formal debates between university students recorded between 1993 and 2002 (77,909 words),
- spontaneous oral narratives of native Beijing residents (102,262 words),
- edited oral narratives (267,114 words).

It is encoded in Unicode, structured in XML and Part-of-speech tagged.

Contents

Click on the arrow to display content.

speech corpus
Language(s) : Mandarin Chinese
Source Channel : Microphone
Transcription Entries : Orthographic
Annotation Coverage : Full
Annotation Granularity : Word#Sentence#Paragraph
Annotation level : Morphological