usc linguistics resonance tuning in soprano singing and vocal tract shaping: comparison of sung and...

USC Linguistics

Resonance tuning in soprano singing and vocal tract shaping: Comparison

of sung and spoken vowels

2pSC29

Shrikanth Narayanan*^, Erik Bresch*, Stephen Tobin^, Dani Byrd^, Krishna Nayak*, Jon Nielsen*

*USC Viterbi School of Engineering ^USC Department of Linguistics

Supported by NIH. Our thanks to the USC Imaging Science Center.

USC Speech Articulation and kNowldege (SPAN) Group sail.usc.edu/span

Background: Singing Acoustics

• Signal characteristics of the singing voice • J. Sundberg, “The Acoustics of Singing Voice,” Scientific

American 236, 1977.

• Measure vocal tract resonances with external excitation

• Show tuning of F1 to F0 for softly sung vowels • E. Joliveau, J. Smith, and J. Wolfe, “Vocal tract resonances in

singing: The soprano voice,” JASA 116, Oct. 2004.

• Estimating formants at high pitch from audio waveform is problematic • H. Traunmueller, A. Erikson, “A method of measuring

formant frequencies at high fundamental frequencies,” Proc. EuroSpeech’97, Vol.1:477-480.


Problem statement

• Long term project goal: Investigate relation between vocal tract shaping and source control in sung and spoken productions

• Specific focus: Soprano challenge• investigate vocal tract shaping for

different vowels with increasing pitch


Data collection

• Subject in MRI scanner in supine position for approx. 60min

• Soprano, trained western opera singer• sang various 30s pieces• spoke utterances “la”, “le”, “li”, “lo”, “lu”

(3 realizations each)• Sang two-octave b-flat major scales “la”,

“le”, “li”, “lo”, “lu” (one realization each)


Real-time MR imaging

• GE 1.5T scanner• Custom head/neck receiver coil• RTHawk software

• Santos et al., Proc. IEEE EMBS, 26th Annual Meeting

• 13 interleaf spiral pulse sequence• TR = 6.5ms• true frame rate 11fps• sliding window reconstruction 22fps• slice thickness approx. 5mm, mid-sagittal plane• resolution approx. 3mm/pixel• resulting image size 68x68 pixels


Synchronized audio acquisition

• Phone-OR optical microphone• Laptop with National Instruments 16bit DAQ card• Sampling rate 100kHz (5x oversampling)• Custom FPGA-based sync hardware


Synchronized audio acquisition

• Offline gradient noise cancellation• employs adaptive FIR filter and

normalized LMS algorithm• achieves approx. 30dB SNR improvement


Image analysis

• Manual tracking of MR images• vocal tract outline for each

individual frame• from larynx to lips

• Computation of midline• finding start and end point at

larynx and lips• repeated recursive bi-section• smoothing spline fit


Image analysis

• Find aperture cross sections perpendicular to smooth midline

• Computation of final midline• along midpoints of cross sections• coordinate system based on midline• coordinate origin to be anchored in

the future to anatomical landmark, currently above epiglottis


Image analysis

• Final result: Aperture function with midline-based coordinate system

frontback

cons

tric

tion

degr

ee (

aper

ture

)


Acoustic analysis

• Using real-time noise-cancelled audio• Pitch estimation using PRAAT• Format analysis using PRAAT for spoken

utterances and for low pitch notes• Formant analysis difficult for high pitch

utterances (example /i/ on next slide)


Acoustic analysis

• Formant analysis from audio is difficult at high pitch.

• “li” note 1 (F0 = 233Hz), note 5 (F0 = 349Hz): clear formant structure

• “li” note 11 (F0 = 622Hz), note 15 (F0 = 932Hz): formants are harder to identify

F0=233 Hz F0=349 Hz

F0=622 Hz F0=932 Hz

5 kHz 5 kHz

5 kHz 5 kHz

Sung Vowel: /i/


Aperture function analysis

At high pitches, the acoustic identity of the vowels are “sacrificed,” i.e. they converge acoustically. However, in their articulation...

• while the front half of the aperture function converges for all vowels at high pitch, the

• back half of aperture function maintains a vowel-dependent shape.


Aperture functions for vowels sung at different pitches

F0 = 233Hz F0 = 349Hz

F0 = 622Hz F0 = 932Hz


Larynx position analysis results

Larynx raising with higher pitch for /e/, /i/, /o/, /u/

pitch increasing ------------------------->


Vocal tract length analysis results

Vocal tract length decreases with pitch for /e/, /i/, /o/

spoken vowels

pitch increasing ------------------------->


Minimum aperture analysis results

Minimum aperture value increases with pitch for all vowels

Minimum aperture location varies with

pitch for /a/, /o/

spoken vowels

pitch increasing -------------------------> pitch increasing ------------------------->

spoken vowels


Image analysis examples /a/

• F0 = 932Hz

• Note 15

• F0 = 622Hz• Note 11

• F0 = 349Hz• Note 5

• F0 = 233Hz• Note 1

• spoken


Image analysis examples /i/

• F0 = 932Hz

• Note 15

• F0 = 622Hz• Note 11

• F0 = 349Hz• Note 5

• F0 = 233Hz• Note 1

• spoken


Sung vowels

Resonance tuning can be shown for vowels with low F1.


Vocal tract shapes in comparison

• /a/ • /e/ • /i/ • /o/ • /u/

sp

oken

F0 =

233H

zF0 =

932H

z


Aperture functions in comparison

• /a/ • /e/ • /i/ • /o/ • /u/

sp

oken

F0 =

233H

zF0 =

932H

z


Discussion

• Several challenges in analysis:• Vocal tract resonances are difficult to estimate from the

acoustic output at high pitch.• We plan in the future to estimate vocal tract resonances

from MR-derived area function data.cf. Joliveau et.al. estimated resonances directly by acoustic methods

• 3 jointly controlled goals:• pitch: critical goal; not compromised as evidenced in audio

• one component of implementation: raised larynx (except low vowel)

• intensity: another important goal; increases with pitch• one component of implementation: open front cavity/cone effect

• vowel identity: acoustic identity lost at high pitches• front cavity shaping compromised, but back cavity distinction still

maintained; effect depends on vowel (low vs. high for example)


Discussion

• Strategy for joint control and relative weighting of the goals is unknown.

• It appears that vowel identity is compromised but not completely ignored at high pitch.• Joliveau et.al. data acquired at soft intensity: opening

of front cavity for cone effect may have been minimized

• Generalizability of results limited• Need: Data from more subjects needed and direct

acoustic modeling for estimating vocal tract resonances

• Ongoing work: We have collected data from 5 more sopranos.


Image analysis examples /a/

• Some /a/ images:

• F0 = 932Hz

• F0 = 622Hz

• F0 = 349Hz

• F0 = 233Hz

• spoken


Image analysis examples /i/

• Some /i/ images:

• spoken

• F0 = 932Hz

• F0 = 622Hz

• F0 = 349Hz

• F0 = 233Hz


Pitch and power estimation

Average power increases with

pitch.

Pitch follows the nominal values very

closely.

usc linguistics resonance tuning in soprano singing and vocal tract shaping: comparison of sung and...

Documents