physics of human voice a new theory with applications research conference, november 16, 2012 c....

Physics of Human Voice A New Theory with Applications

Research Conference, November 16, 2012

C. Julian Chen

Department of Applied Physics and Applied Mathematics

Columbia University

Outline• The source-filter theory

– Speech processing based on source-filter theory

• A new theory of human voice production– Timing correlation of voice and EGG signals– The concept of timbrons

• Speech processing based on timbrons– Kramers-Kronig relations– Fourier analysis– Laguerre-function expansion– Timbre vectors

• Applications: voice transformation, speech synthesis and speech recognition

Scientific Basis of Speech Technology

• Understand the physics of speech production• Understand the physics of hearing• Discover accurate and efficient parametric

representations of speech signal• Design methods to convert speech signal into a

parametric representation• Design methods to accurately recover speech from

a parametric representation• Develop methods to modify and manipulate

speech through the parametric representation

Source-Filter Theory of Voice Production

The first edition of Gunnar Fant’s book was published in 1960.

Source-Filter Theory of Voice Production (1)

A quasi-periodic pulsating airflow generated by the opening-closing of the glottis creates a buzzing source, then filtered by the spectrum of upper vocal track to form the voice (Fant 1960).

Source-Filter Theory of Voice Production (2)

Strong airflow occurs during the opening of the glottis. No air flows when there is a glottal stop.

Traditional Speech Processing: Windowing

Because processing windows are asynchronous to pitch periods, it always compromises speech signals.

Traditional Speech Processing: Phone alignment errors due to improper windowing

Because a substantial number of processing windows are crossing phone boundaries, automatic phone alignment inevitably produce a large percentage of errors.

The Electroglottograph (EGG)

A non-invasive instrument to detect the change of electric conductance between the two vocal cords, thus to monitor the opening and closing of the glottis (circa 1960).

Simultaneously recorded voice and EGG signals (1)

(1). A, Glottal closures are very fast. (2) Strongest voice signal C immediately follows the glottal closure. (3) Voice signals in the glottal open phase B is much weaker.

Simultaneously recorded voice and EGG signals (2)

Showing two individual glottal closures, each triggers a timbron, solely determined by the geometry of vocal tract.

The Timbron Theory (1)

When the glottis is open, there is a continuous airflow in the vocal tract. A glottal closure abruptly stops airflow supply, excites a d’Alembert wave front, which resonates in the vocal tract. The waveform represents the instantaneous timbre.

The Timbron Theory (2)

Energetics:

Velocity of airflow 0.2 m/sec.

Volume of the vocal tract 2×10-5m3

Density of air 1.25 kg/m3

Kinetic energy = ½ × 2 ×10-5 × 1.25 × 0.2 2 = 0.5 μJ.

Frequency 100 Hz, power is 50 μW

Typical speech power: 10 – 100 μW

It matches perfectly with the typical measured speech power.

Simplified Cartoons on Timbrons

In the following, we present two simplified cartoons about the formation of timbrons. The first set is about vowel [u:], a typical front vowel. The second set is about [ɑ:].

Using typical geometrical values of the vocal tract, the cartoons explain the first formants of [u:] and [ɑ:].

Nevertheless, the cartoons are designed only for intuitive understanding of the concept of timbrons. In order to explain the timbrons accurately, numerical solutions of the wave equations are necessary.

Timbron [u:] - preparation

Before a glottal closure, t<0, there is a continuous airflow with typical velocity of 0.2 m/sec.

The distance between the glottis and lips is about 25 cm. Beyond the lips, the cross section greatly expands, thus the airflow velocity is very small.

Timbron [u:] – phase 1

A glottal closure abruptly stops the supply of airflow, excites a zero-velocity d’Alembert wavefront, propagating with the speed of sound. The air behind the wavefront is rarefied.

It takes about 0.8 msec for the wavefront to reach lips. Then the air beyond the lips rushes in to fill the partial vacuum.


The d’Alembert wavefront of a velocity towards the glottis continuous to propagate with the speed of sound. Due to radiation loss, the velocity is slightly reduced.

It takes 0.8 msec for the wavefront to reach the glottis. The acoustic wave is reflected and propagate towards lips.


The d’Alembert wavefront of a velocity towards the glottis propagate towards the lips with the speed of sound. The air behind the wavefront is densified, which stores energy.

It takes 0.8 msec for the wavefront to reach the lips. The wavefront starts to propagate towards the glottis.


A new wavefront of a velocity towards the lips propagates towards the glottis with the speed of sound. It reaches the glottis and the entire cycle starts over.

The cycle takes 3.2 msec, corresponds to a frequency 310 Hz. It is the first formant frequency of vowel [u:].

Timbron [ɑ:] - preparation

Before a glottal closure, t<0, there is a continuous airflow with typical velocity of 0.2 m/sec.

The distance between the glottis and oropharynx is about 12 cm. Beyond the oropharynx, with a widely open mouth, the airflow velocity is very small.

Timbron [ɑ:] – phase 1

A glottal closure abruptly stops the supply of airflow, excites a zero-velocity d’Alembert wavefront, propagating with the speed of sound. The air behind the wavefront is rarefied.

It takes 0.4 msec for the wavefront to reach the oropharynx. Air in the mouth rushes in to fill the partial vacuum.


The d’Alembert wavefront of a velocity towards the glottis continuous to propagate with the speed of sound. Due to radiation loss, the velocity is slightly reduced.

It takes 0.4 msec for the wavefront to reach glottis. The acoustic wave is reflected and propagate towards oropharynx.


A new wavefront of velocity towards the glottis propagate towards the oropharynx with the speed of sound. Air behind the wavefront is densified, which stores energy.

It takes 0.4 msec for the wavefront to reach oropharynx. A new wavefront starts to propagate towards the glottis.


Finally, a new wavefront of velocity towards oropharynx propagates towards the glottis with the speed of sound. It reaches the glottis and the cycle starts over.

The entire cycle takes 1.6 msec, corresponds to a frequency of 625 Hz. It is the first formant frequency of vowel [ɑ:].

Two Mathematical Theorems

Theorem 1: The phase spectrum of a timbron is uniquely determined by its amplitude spectrum

Proof: Because before a glottal closure, the value of a timbron is zero, using theory of functions in complex variables, the phase spectrum can be calculated using an improper integral, similar to Kramers-Kronig relations.

Theorem 2: For a voice generated by a periodic sequence of glottal closures , the waveform in a complete period contains full information about the underlying timbron.

Proof: Again, using the fact that the value of a timbron is zero before a glottal closure, the theorem can be proved using basic properties of Fourier transform.

The New Parameterization

Using both voice signal and electroglottograph signal to segment the voice into natural frames.

Segment the entire speech signal

Parameterization and Regeneration

Convert Spectrum into Timbre Vectors

The timbre vector has some similarity to the state vector in quantum mechanics.

Accuracy of Timbre-Vector Representation

Because Laguerre functions are complete and orthonormal, the timber vector can be as accurate as needed, in stark contrast to the inaccurate and incomplete LPC coefficients.

Examples of Timbrons

Obtained from ARCTIC databases, speaker bdl, sentence a0008. The sentence was converted into a sequence of timbre vectors, then using Kramers-Kronig relations to recover the phase. The timbrons are then generated by FFT.

Each timbron is 15 msec. The first 2.5 msec is pre-excitation waveform, theoretically should be zero.

A timbron is a complete representation of the instantaneous timbre. Different vowels show very different waveforms.

The starting frame of plosive [K] can also be represented by a timbron, with a phase spectrum determined by its amplitude spectrum.

The subsequent frames of [K] do not have well-defined phase.

Timbron of Vowel [AE], 15 msec (bdl a0008 frame 56)

Timbron of Consonant [k]. First frame. 15 msec. (bdl a0008. Frame 155)

Timbron of Consonant [k]. 2nd frame. 15 msec. (bdl a0008. Frame 156)

Timbron of Vowel [EH]. 15 msec. (bdl a0008. Frame 168)

Timbron of Consonant [M]. 15 msec. (bdl a0008. Frame 185)

Timbron of Vowel[IH]. 15 msec. (bdl a0008. Frame 240)

Timbron of Vowel [UH]. 15 msec. (bdl a0008. Frame 206)

Timbron of Consonant [N]. 15 msec. (bdl a0008. Frame 252)

Timbron of Part of Vowel[AY]. 15 msec. (bdl a0008. Frame 291)

Timbre vectors can be fused to eliminate seams

Speech segment 1 Speech segment 2

Using fusing process, the entire speech section becomes natural.

Speech synthesis

Speech recognition

Voice Demo 1: Speech Regeneration

The original recorded speech was converted into timbre vector form and regenerated. There is very little quality degradation.

bdl

original

jmkslt

regenerated

Voice Demo 2: Voice Transformation

Although deeply into the falsetto register and vocal-fry register, the voice is still clear and human-like.

childcontralto sopranomezzo-soprano

The pitch and head-size can be changed dramatically. By raising the pitch 6 halftones each time, the original voice, a tenor, can be changed to female and child voices.

By lowering the pitch 6 halftones each time, a tenor voice can be changed to very deep male voices.

baritone bass contra-bass giant

Voice Demo 3: Speed Variation

300 wpm 100 wpm 200 wpm 150 wpm

The speed can be changed from 100 words per minute to 1000 words per minute, and the voice is still clear. The low speed can be used for foreign language education. The high speed is a great advantage for visually impaired people.

400 wpm 500 wpm 600 wpm 700 wpm

800 wpm 900 wpm 1000 wpm

Voice Demo 4: Prosody Modification

Voice Affirmation Question

MezzosopranoTenorBaritoneBassContrabass SopranoChild

-- change an affirmation into a question --

Summary• A new theory of human voice production

– Based on simultaneous voice and EGG signals– The concept of timbrons

• Speech processing based on timbrons– Kramers-Kronig relations– Fourier analysis– Laguerre-function expansion– Timbre vectors

• Applications: voice transformation, speech synthesis and speech recognition

physics of human voice a new theory with applications research conference, november 16, 2012 c....

Documents

speech signals

voice signals

speech synthesis

traditional speech processing

voice transformation

voice fant

physics of human voice

wtypical speech power