You are here
»
Universal Catalogue
»
Spoken Resources
»
Desktop/microphone
Language Resources
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Send us information
Would you like to collaborate ?
Contact Us
Languages
Catalog Reference : ELRA-U-S0216
The Lancaster Los Angeles Spoken Chinese Corpus
The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese. It consists of dialogues (55%) and monologues (45%) including both spontaneous (57%) and scripted (43%) speech.
Transcription contains 1,002,151 words, corresponding to 73,976 sentences and 49,670 utterance units (paragraphs).
This corpus includes various spoken registers :
- face-to-face conversation (60,806 words),
- telephone conversation between overseas Chinese and their family in China (295,026 words),
- play/movie scripts (80,446 words),
- TV talk show transcripts (118,588 words),
- formal debates between university students recorded between 1993 and 2002 (77,909 words),
- spontaneous oral narratives of native Beijing residents (102,262 words),
- edited oral narratives (267,114 words).
It is encoded in Unicode, structured in XML and Part-of-speech tagged.
Contents
Click on the arrow to display content.
speech corpus
Language(s) :
Mandarin Chinese
Source Channel :
Microphone
Transcription Entries : Orthographic
Annotation Coverage : Full
Annotation Granularity : Word#Sentence#Paragraph
Annotation level : Morphological
Friday 22 November, 2024
Joint Copyright © 2008
ELRA
&
ELDA
Universal Catalogue 1.0.4