unsupervised and semi-supervised learning of tone and pitch accent gina-anne levow university of...
Post on 17-Dec-2015
219 Views
Preview:
TRANSCRIPT
Unsupervised and Semi-Supervised Learning
of Tone and Pitch AccentGina-Anne Levow
University of Chicago
June 6, 2006
Roadmap
• Challenges for Tone and Pitch Accent– Variation and Learning
• Data collections & processing
• Learning with less– Semi-supervised learning– Unsupervised clustering
• Approaches, structure, and context
• Conclusion
Challenges: Tone and Variation
• Tone and Pitch Accent Recognition– Key component of language understanding
• Lexical tone carries word meaning• Pitch accent carries semantic, pragmatic, discourse meaning
– Non-canonical form (Shen 90, Shih 00, Xu 01)
• Tonal coarticulation modifies surface realization– In extreme cases, fall becomes rise
– Tone is relative• To speaker range
– High for male may be low for female• To phrase range, other tones
– E.g. downstep
Challenges: Training Demands
• Tone and pitch accent recognition– Exploit data intensive machine learning
• SVMs (Thubthong 01,Levow 05, SLX05)• Boosted and Bagged Decision trees (X. Sun, 02)• HMMs: (Wang & Seneff 00, Zhou et al 04, Hasegawa-Johnson
et al, 04,…)– Can achieve good results with large sample sets
• ~10K lab syllabic samples -> > 90% accuracy– Training data expensive to acquire
• Time – pitch accent 10s of time real-time• Money – requires skilled labelers• Limits investigation across domains, styles, etc
– Human language acquisition doesn’t use labels
Strategy: Training
• Challenge: – Can we use the underlying acoustic structure of the
language – through unlabeled examples – to reduce the need for expensive labeled training data?
• Exploit semi-supervised and unsupervised learning– Semi-supervised Laplacian SVM– K-means and asymmetric k-lines clustering– Substantially outperform baselines
• Can approach supervised levels
Data Collections I: English
• English: (Ostendorf et al, 95)– Boston University Radio News Corpus, f2b– Manually ToBI annotated, aligned, syllabified– Pitch accent aligned to syllables
• 4-way: Unaccented, High, Downstepped High, Low – (Sun 02, Ross & Ostendorf 95)
• Binary: Unaccented vs Accented
Data Collections II: Mandarin
• Mandarin: – Lexical tones:
• High, Mid-rising, Low, High falling, Neutral
Data Collections III: Mandarin
• Mandarin Chinese:– Lab speech data: (Xu, 1999)
• 5 syllable utterances: vary tone, focus position– In-focus, pre-focus, post-focus
– TDT2 Voice of America Mandarin Broadcast News• Automatically force aligned to anchor scripts
– Automatically segmented, pinyin pronunciation lexicon
– Manually constructed pinyin-ARPABET mapping
– CU Sonic – language porting
– 4-way: High, Mid-rising, Low, High falling
Local Feature Extraction
• Motivated by Pitch Target Approximation Model• Tone/pitch accent target exponentially approached
– Linear target: height, slope (Xu et al, 99)
• Scalar features: – Pitch, Intensity max, mean (Praat, speaker normalized)– Pitch at 5 points across voiced region– Duration– Initial, final in phrase
• Slope: – Linear fit to last half of pitch contour
Context Features
• Local context:– Extended features
• Pitch max, mean, adjacent points of adjacent syllable
– Difference features wrt adjacent syllable• Difference between
– Pitch max, mean, mid, slope
– Intensity max, mean
• Phrasal context:– Compute collection average phrase slope– Compute scalar pitch values, adjusted for slope
Experimental Configuration
• English Pitch Accent:– Proportionally sampled: 1000 examples
• 4-way and binary classification– Contextualization representation, preceding syllables
• Mandarin Tone:– Balanced tone sets: 400 examples
• Vary data set difficulty: clean lab -> broadcast• 4 tone classification
– Simple local pitch only features
» Prior lab speech experiments effective with local features
Semi-supervised Learning
• Approach: – Employ small amount of labeled data– Exploit information from additional – presumably more
available –unlabeled data• Few prior examples: EM, co-& self-training: Ostendorf ‘05
• Classifier:– Laplacian SVM (Sindhwani,Belkin&Niyogi ’05)– Semi-supervised variant of SVM
• Exploits unlabeled examples – RBF kernel, typically 6 nearest neighbors
Experiments
• Pitch accent recognition:– Binary classification: Unaccented/Accented– 1000 instances, proportionally sampled
• Labeled training: 200 unacc, 100 acc
– >80% accuracy (cf. 84% w/15x labeled SVM)
• Mandarin tone recognition:– 4-way classification: n(n-1)/2 binary classifiers– 400 instances: balanced; 160 labeled
• Clean lab speech- in-focus-94%– cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training samples
• Broadcast news: 70% – Cf. <50% w/supervised SVM 160 training samples; 74% 4x training
Unsupervised Learning
• Question: – Can we identify the tone structure of a language from
the acoustic space without training?• Analogous to language acquisition
• Significant recent research in unsupervised clustering
• Established approaches: k-means• Spectral clustering: Eigenvector decomposition of affinity
matrix– (Shih & Malik 2000, Fischer & Poland 2004, BNS 2004)
– Little research for tone• Self-organizing maps (Gauthier et al,2005)
– Tones identified in lab speech using f0 velocities
Unsupervised Pitch Accent
• Pitch accent clustering:– 4 way distinction: 1000 samples, proportional
• 2-16 clusters constructed– Assign most frequent class label to each cluster
• Learner: – Asymmetric k-lines clustering (Fischer & Poland ’05):
» Context-dependent kernel radii, non-spherical clusters
– > 78% accuracy– Context effects:
• Vector w/context vs vector with no context comparable
Contrasting Clustering
• Approaches– 3 Spectral approaches:
• Asymmetric k-lines (Fischer & Poland 2004)• Symmetric k-lines (Fischer & Poland 2004)• Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004)
– Binary weights, k-lines clustering
– K-means: Standard Euclidean distance– # of clusters: 2-16
• Best results: > 78%– 2 clusters: asymmetric k-lines; > 2 clusters: kmeans
• Larger # of clusters more similar
Contrasting Learners
Tone Clustering
• Mandarin four tones:• 400 samples: balanced• 2-phase clustering: 2-3 clusters each• Asymmetric k-lines
– Clean read speech: • In-focus syllables: 87% (cf. 99% supervised) • In-focus and pre-focus: 77% (cf. 93% supervised)
– Broadcast news: 57% (cf. 74% supervised)
• Contrast:– K-means: In-focus syllables: 74.75%
• Requires more clusters to reach asymm. k-lines level
Tone Structure
First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height, or slope
Conclusions
• Exploiting unlabeled examples for tone and pitch accent– Semi- and Un-supervised approaches
• Best cases approach supervised levels with less training
– Leveraging both labeled & unlabeled examples best– Both spectral approaches and k-means effective
» Contextual information less well-exploited than in supervised case
• Exploit acoustic structure of tone and accent space
Future Work
• Additional languages, tone inventories– Cantonese - 6 tones, – Bantu family languages – truly rare data
• Language acquisition– Use of child directed speech as input– Determination of number of clusters
Thanks
• V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin
• Dinoj Surendran, Siwei Wang, Yi Xu
• This work supported by NSF Grant #0414919
• http://people.cs.uchicago.edu/~levow/tai
Spectral Clustering in a Nutshell
• Basic spectral clustering– Build affinity matrix– Determine dominant eigenvectors and
eigenvalues of the affinity matrix– Compute clustering based on them
• Approaches differ in:– Affinity matrix construction
• Binary weights, conductivity, heat weights
– Clustering: cut, k-means, k-lines
K-Lines Clustering Algorithm
• Due to Fischer & Poland 2005• 1. Initialize vectors m1...mK (e.g. randomly, or
as the ¯first K eigenvectors of the spectraldata yi)
• 2. for j=1 . . .K:– Define Pj as the set of indices of all points yi that are
closest to the line defined by mj , and create the matrix Mj = [yi], i in Pi whose columns are the corresponding vectors yi
• 3. Compute the new value of every mj as the ¯first eigenvector of MjMTj
• 4. Repeat from 2 until mj 's do not change
Asymmetric Clustering
• Replace Gaussian kernel of fixed width– (Fischer & Poland TR-ISDIA-12-04, p. 12), – Where tau = 2d+ 1 or 10, largely insensitive to tau
Laplacian SVM
• Manifold regularization framework– Hypothesize intrinsic (true) data lies on a low
dimensional manifold, • Ambient (observed) data lies in a possibly high
dimensional space• Preserves locality:
– Points close in ambient space should be close in intrinsic
– Use labeled and unlabeled data to warp function space
– Run SVM on warped space
Laplacian SVM (Sindhwani)
• Input : l labeled and u unlabeled examples• Output :• Algorithm :
– Contruct adjacency Graph. Compute Laplacian.– Choose Kernel K(x,y). Compute Gram matrix K.– Compute– And
Current and Future Work
• Interactions of tone and intonation– Recognition of topic and turn boundaries– Effects of topic and turn cues on tone real’n
• Child-directed speech & tone learning• Support for Computer-assisted tone learning• Structured sequence models for tone
– Sub-syllable segmentation & modeling
• Feature assessment– Band energy and intensity in tone recognition
Related Work
• Tonal coarticulation: – Xu & Sun,02; Xu 97;Shih & Kochanski 00
• English pitch accent– X. Sun, 02; Hasegawa-Johnson et al, 04;
Ross & Ostendorf 95
• Lexical tone recognition– SVM recognition of Thai tone: Thubthong 01– Context-dependent tone models
• Wang & Seneff 00, Zhou et al 04
Pitch Target Approximation Model
• Pitch target:– Linear model:
– Exponentially approximated:
– In practice, assume target well-approximated by mid-point (Sun, 02)
battT )(
battty )exp()(
Classification Experiments
• Classifier: Support Vector Machine – Linear kernel– Multiclass formulation
• SVMlight (Joachims), LibSVM (Cheng & Lin 01)
– 4:1 training / test splits
• Experiments: Effects of – Context position: preceding, following, none, both– Context encoding: Extended/Difference– Context type: local, phrasal
Results: Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5% 81.3%
Extend PrePost 74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Results: Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5% 81.3%
Extend PrePost 74.0% 80.7%
Extend Pre 74.0% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69.0% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Results: Local Context
Context Mandarin Tone English Pitch Accent
Full 74.5% 81.3%
Extend PrePost 74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
Discussion: Local Context
• Any context information improves over none
– Preceding context information consistently improves over none or following context information
• English: Generally more context features are better• Mandarin: Following context can degrade
– Little difference in encoding (Extend vs Diffs)
• Consistent with phonological analysis (Xu) that carryover coarticulation is greater than anticipatory
Results & Discussion: Phrasal Context
Phrase Context Mandarin Tone English Pitch Accent
Phrase 75.5% 81.3%
No Phrase 72% 79.9%
•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve
Context: Summary
• Employ common acoustic representation– Tone (Mandarin), pitch accent (English)
• SVM classifiers - linear kernel: 76%, 81%• Local context effects:
– Up to > 20% relative reduction in error– Preceding context greatest contribution
• Carryover vs anticipatory
• Phrasal context effects:– Compensation for phrasal contour improves recognition
Aside: More Tones
• Cantonese:– CUSENT corpus of read broadcast news text– Same feature extraction & representation – 6 tones:
– High level, high rise, mid level, low fall, low rise, low level
– SVM classification:• Linear kernel: 64%, Gaussian kernel: 68%
– 3,6: 50% - mutually indistinguishable (50% pairwise)» Human levels: no context: 50%; context: 68%
• Augment with syllable phone sequence– 86% accuracy: 90% of syllable w/tone 3 or 6: one
dominates
Aside: Voice Quality & Energy
• By Dinoj Surendran
• Assess local voice quality and energy features for tone – Not typically associated with Mandarin
• Considered: – VQ: NAQ, AQ, etc; Spectral balance; Spectral Tilt;
Band energy
• Useful: Band energy significantly improves– Esp. neutral tone
• Supports identification of unstressed syllables– Spectral balance predicts stress in Dutch
Roadmap
• Challenges for Tone and Pitch Accent– Contextual effects– Training demands
• Modeling Context for Tone and Pitch Accent– Data collections & processing– Integrating context– Context in Recognition
• Reducing Training demands– Data collections & structure– Semi-supervised learning– Unsupervised clustering
• Conclusion
Strategy: Context
• Exploit contextual information– Features from adjacent syllables
• Height, shape: direct, relative
– Compensate for phrase contour
– Analyze impact of • Context position, context encoding, context type• > 20% relative improvement over no context
top related