usc linguistics resonance tuning in soprano singing and vocal tract shaping: comparison of sung and...
TRANSCRIPT
USC Linguistics
Resonance tuning in soprano singing and vocal tract shaping: Comparison
of sung and spoken vowels
2pSC29
Shrikanth Narayanan*^, Erik Bresch*, Stephen Tobin^, Dani Byrd^, Krishna Nayak*, Jon Nielsen*
*USC Viterbi School of Engineering ^USC Department of Linguistics
Supported by NIH. Our thanks to the USC Imaging Science Center.
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Background: Singing Acoustics
• Signal characteristics of the singing voice • J. Sundberg, “The Acoustics of Singing Voice,” Scientific
American 236, 1977.
• Measure vocal tract resonances with external excitation
• Show tuning of F1 to F0 for softly sung vowels • E. Joliveau, J. Smith, and J. Wolfe, “Vocal tract resonances in
singing: The soprano voice,” JASA 116, Oct. 2004.
• Estimating formants at high pitch from audio waveform is problematic • H. Traunmueller, A. Erikson, “A method of measuring
formant frequencies at high fundamental frequencies,” Proc. EuroSpeech’97, Vol.1:477-480.
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Problem statement
• Long term project goal: Investigate relation between vocal tract shaping and source control in sung and spoken productions
• Specific focus: Soprano challenge• investigate vocal tract shaping for
different vowels with increasing pitch
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Data collection
• Subject in MRI scanner in supine position for approx. 60min
• Soprano, trained western opera singer• sang various 30s pieces• spoke utterances “la”, “le”, “li”, “lo”, “lu”
(3 realizations each)• Sang two-octave b-flat major scales “la”,
“le”, “li”, “lo”, “lu” (one realization each)
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Real-time MR imaging
• GE 1.5T scanner• Custom head/neck receiver coil• RTHawk software
• Santos et al., Proc. IEEE EMBS, 26th Annual Meeting
• 13 interleaf spiral pulse sequence• TR = 6.5ms• true frame rate 11fps• sliding window reconstruction 22fps• slice thickness approx. 5mm, mid-sagittal plane• resolution approx. 3mm/pixel• resulting image size 68x68 pixels
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Synchronized audio acquisition
• Phone-OR optical microphone• Laptop with National Instruments 16bit DAQ card• Sampling rate 100kHz (5x oversampling)• Custom FPGA-based sync hardware
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Synchronized audio acquisition
• Offline gradient noise cancellation• employs adaptive FIR filter and
normalized LMS algorithm• achieves approx. 30dB SNR improvement
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Image analysis
• Manual tracking of MR images• vocal tract outline for each
individual frame• from larynx to lips
• Computation of midline• finding start and end point at
larynx and lips• repeated recursive bi-section• smoothing spline fit
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Image analysis
• Find aperture cross sections perpendicular to smooth midline
• Computation of final midline• along midpoints of cross sections• coordinate system based on midline• coordinate origin to be anchored in
the future to anatomical landmark, currently above epiglottis
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Image analysis
• Final result: Aperture function with midline-based coordinate system
frontback
cons
tric
tion
degr
ee (
aper
ture
)
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Acoustic analysis
• Using real-time noise-cancelled audio• Pitch estimation using PRAAT• Format analysis using PRAAT for spoken
utterances and for low pitch notes• Formant analysis difficult for high pitch
utterances (example /i/ on next slide)
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Acoustic analysis
• Formant analysis from audio is difficult at high pitch.
• “li” note 1 (F0 = 233Hz), note 5 (F0 = 349Hz): clear formant structure
• “li” note 11 (F0 = 622Hz), note 15 (F0 = 932Hz): formants are harder to identify
F0=233 Hz F0=349 Hz
F0=622 Hz F0=932 Hz
5 kHz 5 kHz
5 kHz 5 kHz
Sung Vowel: /i/
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Aperture function analysis
At high pitches, the acoustic identity of the vowels are “sacrificed,” i.e. they converge acoustically. However, in their articulation...
• while the front half of the aperture function converges for all vowels at high pitch, the
• back half of aperture function maintains a vowel-dependent shape.
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Aperture functions for vowels sung at different pitches
F0 = 233Hz F0 = 349Hz
F0 = 622Hz F0 = 932Hz
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Larynx position analysis results
Larynx raising with higher pitch for /e/, /i/, /o/, /u/
pitch increasing ------------------------->
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Vocal tract length analysis results
Vocal tract length decreases with pitch for /e/, /i/, /o/
spoken vowels
pitch increasing ------------------------->
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Minimum aperture analysis results
Minimum aperture value increases with pitch for all vowels
Minimum aperture location varies with
pitch for /a/, /o/
spoken vowels
pitch increasing -------------------------> pitch increasing ------------------------->
spoken vowels
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Image analysis examples /a/
• F0 = 932Hz
• Note 15
• F0 = 622Hz• Note 11
• F0 = 349Hz• Note 5
• F0 = 233Hz• Note 1
• spoken
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Image analysis examples /i/
• F0 = 932Hz
• Note 15
• F0 = 622Hz• Note 11
• F0 = 349Hz• Note 5
• F0 = 233Hz• Note 1
• spoken
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Sung vowels
Resonance tuning can be shown for vowels with low F1.
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Vocal tract shapes in comparison
• /a/ • /e/ • /i/ • /o/ • /u/
sp
oken
F0 =
233H
zF0 =
932H
z
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Aperture functions in comparison
• /a/ • /e/ • /i/ • /o/ • /u/
sp
oken
F0 =
233H
zF0 =
932H
z
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Discussion
• Several challenges in analysis:• Vocal tract resonances are difficult to estimate from the
acoustic output at high pitch.• We plan in the future to estimate vocal tract resonances
from MR-derived area function data.cf. Joliveau et.al. estimated resonances directly by acoustic methods
• 3 jointly controlled goals:• pitch: critical goal; not compromised as evidenced in audio
• one component of implementation: raised larynx (except low vowel)
• intensity: another important goal; increases with pitch• one component of implementation: open front cavity/cone effect
• vowel identity: acoustic identity lost at high pitches• front cavity shaping compromised, but back cavity distinction still
maintained; effect depends on vowel (low vs. high for example)
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Discussion
• Strategy for joint control and relative weighting of the goals is unknown.
• It appears that vowel identity is compromised but not completely ignored at high pitch.• Joliveau et.al. data acquired at soft intensity: opening
of front cavity for cone effect may have been minimized
• Generalizability of results limited• Need: Data from more subjects needed and direct
acoustic modeling for estimating vocal tract resonances
• Ongoing work: We have collected data from 5 more sopranos.
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Image analysis examples /a/
• Some /a/ images:
• F0 = 932Hz
• F0 = 622Hz
• F0 = 349Hz
• F0 = 233Hz
• spoken
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Image analysis examples /i/
• Some /i/ images:
• spoken
• F0 = 932Hz
• F0 = 622Hz
• F0 = 349Hz
• F0 = 233Hz
USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span
Pitch and power estimation
Average power increases with
pitch.
Pitch follows the nominal values very
closely.