six point one (3) annotation of spoken chinese

4
267 SIX POINT ONE (3) Annotation of Spoken Chinese Thomas Fang Zheng 1 INTRODUCTION Spontaneous speech 1,2 and dialectal speech 3,4 are two major issues that affect the performance of Chinese automatic speech recognition (ASR). In China, the official spoken language is standard Chinese, called Putonghua. For simplification, the term “Chinese” stands here for Putonghua or standard Chinese, unless noted otherwise. Chinese is a syllabic language, and Pinyin is the Romanized written form for syllables. In most cases, a Chinese syllable has two parts, an Initial followed by a Final, corresponding to a consonant and a vowel, respectively. If there is no consonant in a syllable, the Initial is called a null or zero Initial ( ). The term “IF” means either an Initial or a Final. On the one hand, current ASR systems can usually reach quite high accuracy for carefully read standard speech, but the accuracy remains quite low for spontaneous speech 1 . Spontaneous speech has high pronunciation variability because users tend to speak more sloppily. Compared with read speech, spontaneous speech contains many more phonetic shifts, instances of reduction and assimilation, duration changes, tone shifts, and so forth. Therefore, we need to model allophonic variations resulting from speakers’ native languages. On the other hand, Putonghua is actually based on a notable Chinese dialect used widely in Northern China, Mandarin ( ), yet most speakers speak a non-standard version of Putonghua influenced by their dialectal background, called dialectal Chinese. In addition to Mandarin, there are eight major dialectal regions: Wu, Yue, Min, Hakka, Xiang, Gan, Hui, and Jin. These dialects can be further divided into more than 40 sub-categories. Although the Chinese dialects share a written language and standard Chinese is widely spoken in most regions, speech is still strongly influenced by native dialects. A syllable (or IF) in one Chinese dialect or dialectal Chinese is often pronounced as a different one in another Chinese dialect or dialectal Chinese; this is called syllable (or IF) mapping. For example, in southern China the Initial /zh/ is often pronounced as /z/, while in Sichuan Province the syllable /guo/ in the word “ (country)” is pronounced as /gui/. To develop high-performance ASR systems, databases used for acoustic modeling should cover as many varied phenomena as possible, so that knowledge required for better speech recognition can be learned. Annotations and transcriptions should be as rich as possible, including the following information. 1. The language: standard Chinese, a certain Chinese dialect, or a certain form of dialectal Chinese; simplified Chinese, or traditional Chinese. 2. Speaking style: read speech or spontaneous speech.

Upload: others

Post on 10-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SIX POINT ONE (3) Annotation of Spoken Chinese

PHONEME LABELING

267

SIX POINT ONE(3) Annotation of Spoken Chinese

Thomas Fang Zheng

1 INTRODUCTION

Spontaneous speech1,2 and dialectal speech3,4 are two major issues that affect the performance of Chinese automatic speech recognition (ASR).

In China, the official spoken language is standard Chinese, called Putonghua. For simplification, the term “Chinese” stands here for Putonghua or standard Chinese, unless noted otherwise. Chinese is a syllabic language, and Pinyin is the Romanized written form for syllables. In most cases, a Chinese syllable has two parts, an Initial followed by a Final, corresponding to a consonant and a vowel, respectively. If there is no consonant in a syllable, the Initial is called a null or zero Initial ( ). The term “IF” means either an Initial or a Final.

On the one hand, current ASR systems can usually reach quite high accuracy for carefully read standard speech, but the accuracy remains quite low for spontaneous speech1. Spontaneous speech has high pronunciation variability because users tend to speak more sloppily. Compared with read speech, spontaneous speech contains many more phonetic shifts, instances of reduction and assimilation, duration changes, tone shifts, and so forth. Therefore, we need to model allophonic variations resulting from speakers’ native languages.

On the other hand, Putonghua is actually based on a notable Chinese dialect used widely in Northern China, Mandarin ( ), yet most speakers speak a non-standard version of Putonghua influenced by their dialectal background, called dialectal Chinese. In addition to Mandarin, there are eight major dialectal regions: Wu, Yue, Min, Hakka, Xiang, Gan, Hui, and Jin. These dialects can be further divided into more than 40 sub-categories. Although the Chinese dialects share a written language and standard Chinese is widely spoken in most regions, speech is still strongly influenced by native dialects. A syllable (or IF) in one Chinese dialect or dialectal Chinese is often pronounced as a different one in another Chinese dialect or dialectal Chinese; this is called syllable (or IF) mapping. For example, in southern China the Initial /zh/ is often pronounced as /z/, while in Sichuan Province the syllable /guo/ in the word “ (country)” is pronounced as /gui/.

To develop high-performance ASR systems, databases used for acoustic modeling should cover as many varied phenomena as possible, so that knowledge required for better speech recognition can be learned. Annotations and transcriptions should be as rich as possible, including the following information.

1. The language: standard Chinese, a certain Chinese dialect, or a certain form of dialectal Chinese; simplified Chinese, or traditional Chinese.

2. Speaking style: read speech or spontaneous speech.

Administrator
文本框
Shuichi Itahashi and Chiu-yu Tseng Eds. "Computer Processing of Asian Spoken Languages," 1st Edition, Consideration Books, Los Angeles, USA, March 2010, pp. 267-270
Page 2: SIX POINT ONE (3) Annotation of Spoken Chinese

COMPUTER PROCESSING OF ASIAN SPOKEN LANGUAGES

268

3. Recording channel: close-talk microphone, or telephone. 4. Sampling rate: 8 kHz and 16 kHz are frequently used. 5. Sampling precision: 8-bit or 16-bit. 6. Signal-to-noise ratio (SNR) level. 7. Number of speakers and speaker diversity. 8. Corpus size per hour. In this chapter, we focus on database annotation related to

sound changes and phone changes occurring in spontaneous/dialectal Chinese speech.

2 DATABASE ANNOTATION

2.1 Symbols for Putonghua and dialectal ChineseRoughly speaking, two major types of information require transcription: phonetic information and linguistic information. The former involves two forms, the base form and the surface form. The base form gives the canonical pronunciation of an utterance, while the surface form gives the observed (actual) pronunciation. For any language, there is an alphabet set for its basic units, which makes the base form transcription easy to give in a machine-readable style. For the surface form, however, this is more difficult. The International Phonetic Alphabet (IPA) can be used to accurately represent the actual pronunciation of an utterance, but it is not machine-readable. Taking this into consideration, researchers have defined a new kind of alphabet called the Speech Assessment Methods Phonetic Alphabet (SAMPA). Accordingly, for Chinese, we now have a labeling set of machine-readable IPA symbols adapted for Chinese languages from SAMPA, called SAMPA-C 5,6 which consists of 23 phonologic consonants, 9 phonologic vowels, and 10 kinds of sound-change marks, as listed in Table 1. With these symbols, the 21 Initials, 38 Finals, and 38 retroflexed Finals, as well as their corresponding sound variability forms, can be represented. Tones after tone sandhi, or tonal variations, can be attached to Finals. Please refer to section “SAMPA-SC for Standard Chinese (Putonghua)” for the Chinese consonants (Initials), vowels, Finals, and tones represented in SAMPA-C.

TABLE 1 SOUND CHANGE REPRESENTATION IN SAMPA-C

Page 3: SIX POINT ONE (3) Annotation of Spoken Chinese

PHONEME LABELING

269

2.2 Transcription layersA Chinese speech corpus could include some or all of the following transcription layers.

In dialectal Chinese, the IF and syllable sets defined for Putonghua ASR are not sufficient to represent all possible pronunciations , since richer pronunciation varia-tions exist. In addition to the pronunciation variations described here, dialect-related multipronunciation and syllable mapping phenomena are also very common.To model such pronunciation variations, the standard IF and syllable sets should be extended to dialectal Chinese IF and syllable sets based on SAMPA-C. Annotation using these extended sets would be helpful in learning Chinese syllable mapping and developing a multi-pronunciation lexicon, as well as providing other useful knowl-edge for developing dialectal Chinese ASR systems.

4

3, 4

(a) Chinese character layer. The transcription includes sentences in terms of Chinese characters or Chinese words. The Chinese characters could be either tradi-tional or simplified. Non-Chinese words, enclosed in braces, for example, and paralin-guistic and non-linguistic phenomena could also be transcribed. (b) Canonical Chinese Pinyin layer. The pronunciation of the corresponding sentence is given in canonical Chinese Pinyin (syllables) or IF. The Pinyin (or Final) here can be either toned or toneless. In Chinese, there are four tones (denoted by 1-4) and a neutral tone (0). The following examples illustrate toned Pinyin and IF strings: “{hi} ni3 zen3 me0 ye2 LA< ye2 lai2 la0 LA>,” “{hi} n i3 z en3 m e0 ie2 LA< ie2 l ai2l a0 LA>.” (c) Surface-form Chinese IF layer. This layer roughly provides the surface-form pronunciation of the sentence in Chinese IF. To make the surface form more accurate, a superset of the canonical Chinese IF set should be defined, like a generalized IF (GIF) set defined for this purpose . Tones often change, in a phenomenon known as tone sandhi, according to certain rules. Tone sandhi differs from ordinary tone changes because the changed tone can be regarded as a quasi-canonical one. Pinyin syllables with tone sandhi are followed by a two-digit tone string, where the first digit is the original canonical tone and the second indicates the tone sandhi, e.g., /yi12 ge4/.” (d) Surface-form SAMPA-C layer. SAMPA-C symbols can be used to provide more accurate surface-form transcription than in the surfaceform IF layer, even if a GIF set is defined. The observed tone is generally attached to the SAMPA-C sequence of each Final. (e) Miscellaneous layer. Non-speech information, spontaneous phonetic phenom-ena, and spoken language phenomena (paralinguistic or non-linguistic) can also be given. Paralinguistic phenomena include lengthening (LE), breathing (BR), laughing (LA),

2

crying (CR), coughing (CO), disfluency (DS), error (ER), silence (SI), murmur

Page 4: SIX POINT ONE (3) Annotation of Spoken Chinese

COMPUTER PROCESSING OF ASIAN SPOKEN LANGUAGES

270

3 TRANSCRIPTION TOOLS

Many tools can be used for transcription. Most of these tools have many features in addition to the transcription function, such as analysis of the pitch, formants, spectrum, cepstrum, and so on. Three popular transcription tools are Praat (http://www.fon.hum.uva.nl/praat/), X-Waves+, and SFS (http://www.phon.ucl.ac.uk/resource/sfs/); the first two support multi-layer transcription.

REFERENCES1. Fung, P., Byrne, W., Zheng, T. F., Kamm, T., Liu, Y., Song, Z., Venkataramani, V. and Ruhi, U.

Pronunciation modeling of Mandarin casual speech, Final Report for Johns Hopkins Summer Workshop 2000, http://www.clsp.jhu.edu/ws2000/final_reports/mpm/. 2000

2. Zheng, T. F., Song, Z., Fung, P., and Byrne, W. Mandarin pronunciation modeling based on CASS corpus, J. Computer Science & Technology, 17, 3, pp. 249-263. 2002

3. Sproat, R., Zheng, T. F., Gu, L., Li, J., Zheng, Y., Su, Y, Zhou, H., Bramsen, P, Kirsch, D., Shafran, I., Tsakalidis, S., Starr, R. and Jurafsky, D. Dialectal Chinese Speech Recognition, Final Report for Johns Hopkins Summer Workshop 2004, http://www.clsp.jhu.edu/ws2004/groups/ws04casr/. 2004

4. Li, J., Zheng, T. F., Byrne, W. and Jurafsky, D. A dialectal Chinese speech recognition framework, J. of Computer Science and Technology, 21, 1, pp. 106-115. 2006

5. Chen, X., Li, A., Sun, G., Hua, W. and Yu, Z. An application of SAMPA-C for standard Chinese. Proc. ICSLP, 4, pp. 652-655. 2000

6. Li, A., Chen, X., Sun, G., Hua. W., Yin, Z., Zu, Y., Zheng, T. F. and Song, Z. The phonetic labeling on read and spontaneous discourse corpora, Proc. ICSLP, 4, pp. 724-727. 2000

or uncertain segment (UC), modal or exclamation (MO), smack (SM), non-Chinese(NC), sniffle (SN), yawn (YA), overlap (OV), interjection (IN), deglutition (DE), hawking (HA), sneeze (SE), filled pause (FP), trill (TR), and whisper (WH). Nonpara-linguistic phenomena include noise (NS), steady noise (TN), and beep (BP). A two-letter abbreviation followed by “<” or “>” indicates the starting/ending point.