physics of human voice a new theory with applications research conference, november 16, 2012 c....

48
Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied Mathematics Columbia University

Upload: doris-parsons

Post on 15-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Physics of Human Voice A New Theory with Applications

Research Conference, November 16, 2012

C. Julian Chen

Department of Applied Physics and Applied Mathematics

Columbia University

Page 2: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Outline• The source-filter theory

– Speech processing based on source-filter theory

• A new theory of human voice production– Timing correlation of voice and EGG signals– The concept of timbrons

• Speech processing based on timbrons– Kramers-Kronig relations– Fourier analysis– Laguerre-function expansion– Timbre vectors

• Applications: voice transformation, speech synthesis and speech recognition

Page 3: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Scientific Basis of Speech Technology

• Understand the physics of speech production• Understand the physics of hearing• Discover accurate and efficient parametric

representations of speech signal• Design methods to convert speech signal into a

parametric representation• Design methods to accurately recover speech from

a parametric representation• Develop methods to modify and manipulate

speech through the parametric representation

Page 4: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Source-Filter Theory of Voice Production

The first edition of Gunnar Fant’s book was published in 1960.

Page 5: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Source-Filter Theory of Voice Production (1)

A quasi-periodic pulsating airflow generated by the opening-closing of the glottis creates a buzzing source, then filtered by the spectrum of upper vocal track to form the voice (Fant 1960).

Page 6: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Source-Filter Theory of Voice Production (2)

Strong airflow occurs during the opening of the glottis. No air flows when there is a glottal stop.

Page 7: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Traditional Speech Processing: Windowing

Because processing windows are asynchronous to pitch periods, it always compromises speech signals.

Page 8: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Traditional Speech Processing: Phone alignment errors due to improper windowing

Because a substantial number of processing windows are crossing phone boundaries, automatic phone alignment inevitably produce a large percentage of errors.

Page 9: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

The Electroglottograph (EGG)

A non-invasive instrument to detect the change of electric conductance between the two vocal cords, thus to monitor the opening and closing of the glottis (circa 1960).

Page 10: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Simultaneously recorded voice and EGG signals (1)

(1). A, Glottal closures are very fast. (2) Strongest voice signal C immediately follows the glottal closure. (3) Voice signals in the glottal open phase B is much weaker.

Page 11: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Simultaneously recorded voice and EGG signals (2)

Showing two individual glottal closures, each triggers a timbron, solely determined by the geometry of vocal tract.

Page 12: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

The Timbron Theory (1)

When the glottis is open, there is a continuous airflow in the vocal tract. A glottal closure abruptly stops airflow supply, excites a d’Alembert wave front, which resonates in the vocal tract. The waveform represents the instantaneous timbre.

Page 13: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

The Timbron Theory (2)

Energetics:

Velocity of airflow 0.2 m/sec.

Volume of the vocal tract 2×10-5m3

Density of air 1.25 kg/m3

Kinetic energy = ½ × 2 ×10-5 × 1.25 × 0.2 2 = 0.5 μJ.

Frequency 100 Hz, power is 50 μW

Typical speech power: 10 – 100 μW

It matches perfectly with the typical measured speech power.

Page 14: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Simplified Cartoons on Timbrons

In the following, we present two simplified cartoons about the formation of timbrons. The first set is about vowel [u:], a typical front vowel. The second set is about [ɑ:].

Using typical geometrical values of the vocal tract, the cartoons explain the first formants of [u:] and [ɑ:].

Nevertheless, the cartoons are designed only for intuitive understanding of the concept of timbrons. In order to explain the timbrons accurately, numerical solutions of the wave equations are necessary.

Page 15: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [u:] - preparation

Before a glottal closure, t<0, there is a continuous airflow with typical velocity of 0.2 m/sec.

The distance between the glottis and lips is about 25 cm. Beyond the lips, the cross section greatly expands, thus the airflow velocity is very small.

Page 16: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [u:] – phase 1

A glottal closure abruptly stops the supply of airflow, excites a zero-velocity d’Alembert wavefront, propagating with the speed of sound. The air behind the wavefront is rarefied.

It takes about 0.8 msec for the wavefront to reach lips. Then the air beyond the lips rushes in to fill the partial vacuum.

Page 17: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [u:] – phase 2

The d’Alembert wavefront of a velocity towards the glottis continuous to propagate with the speed of sound. Due to radiation loss, the velocity is slightly reduced.

It takes 0.8 msec for the wavefront to reach the glottis. The acoustic wave is reflected and propagate towards lips.

Page 18: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [u:] – phase 3

The d’Alembert wavefront of a velocity towards the glottis propagate towards the lips with the speed of sound. The air behind the wavefront is densified, which stores energy.

It takes 0.8 msec for the wavefront to reach the lips. The wavefront starts to propagate towards the glottis.

Page 19: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [u:] – phase 4

A new wavefront of a velocity towards the lips propagates towards the glottis with the speed of sound. It reaches the glottis and the entire cycle starts over.

The cycle takes 3.2 msec, corresponds to a frequency 310 Hz. It is the first formant frequency of vowel [u:].

Page 20: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [ɑ:] - preparation

Before a glottal closure, t<0, there is a continuous airflow with typical velocity of 0.2 m/sec.

The distance between the glottis and oropharynx is about 12 cm. Beyond the oropharynx, with a widely open mouth, the airflow velocity is very small.

Page 21: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [ɑ:] – phase 1

A glottal closure abruptly stops the supply of airflow, excites a zero-velocity d’Alembert wavefront, propagating with the speed of sound. The air behind the wavefront is rarefied.

It takes 0.4 msec for the wavefront to reach the oropharynx. Air in the mouth rushes in to fill the partial vacuum.

Page 22: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [ɑ:] – phase 2

The d’Alembert wavefront of a velocity towards the glottis continuous to propagate with the speed of sound. Due to radiation loss, the velocity is slightly reduced.

It takes 0.4 msec for the wavefront to reach glottis. The acoustic wave is reflected and propagate towards oropharynx.

Page 23: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [ɑ:] – phase 3

A new wavefront of velocity towards the glottis propagate towards the oropharynx with the speed of sound. Air behind the wavefront is densified, which stores energy.

It takes 0.4 msec for the wavefront to reach oropharynx. A new wavefront starts to propagate towards the glottis.

Page 24: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron [ɑ:] – phase 4

Finally, a new wavefront of velocity towards oropharynx propagates towards the glottis with the speed of sound. It reaches the glottis and the cycle starts over.

The entire cycle takes 1.6 msec, corresponds to a frequency of 625 Hz. It is the first formant frequency of vowel [ɑ:].

Page 25: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Two Mathematical Theorems

Theorem 1: The phase spectrum of a timbron is uniquely determined by its amplitude spectrum

Proof: Because before a glottal closure, the value of a timbron is zero, using theory of functions in complex variables, the phase spectrum can be calculated using an improper integral, similar to Kramers-Kronig relations.

Theorem 2: For a voice generated by a periodic sequence of glottal closures , the waveform in a complete period contains full information about the underlying timbron.

Proof: Again, using the fact that the value of a timbron is zero before a glottal closure, the theorem can be proved using basic properties of Fourier transform.

Page 26: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

The New Parameterization

Using both voice signal and electroglottograph signal to segment the voice into natural frames.

Page 27: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Segment the entire speech signal

Page 28: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Parameterization and Regeneration

Page 29: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Convert Spectrum into Timbre Vectors

The timbre vector has some similarity to the state vector in quantum mechanics.

Page 30: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Accuracy of Timbre-Vector Representation

Because Laguerre functions are complete and orthonormal, the timber vector can be as accurate as needed, in stark contrast to the inaccurate and incomplete LPC coefficients.

Page 31: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Examples of Timbrons

Obtained from ARCTIC databases, speaker bdl, sentence a0008. The sentence was converted into a sequence of timbre vectors, then using Kramers-Kronig relations to recover the phase. The timbrons are then generated by FFT.

Each timbron is 15 msec. The first 2.5 msec is pre-excitation waveform, theoretically should be zero.

A timbron is a complete representation of the instantaneous timbre. Different vowels show very different waveforms.

The starting frame of plosive [K] can also be represented by a timbron, with a phase spectrum determined by its amplitude spectrum.

The subsequent frames of [K] do not have well-defined phase.

Page 32: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Vowel [AE], 15 msec (bdl a0008 frame 56)

Page 33: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Consonant [k]. First frame. 15 msec. (bdl a0008. Frame 155)

Page 34: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Consonant [k]. 2nd frame. 15 msec. (bdl a0008. Frame 156)

Page 35: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Vowel [EH]. 15 msec. (bdl a0008. Frame 168)

Page 36: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Consonant [M]. 15 msec. (bdl a0008. Frame 185)

Page 37: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Vowel[IH]. 15 msec. (bdl a0008. Frame 240)

Page 38: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Vowel [UH]. 15 msec. (bdl a0008. Frame 206)

Page 39: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Consonant [N]. 15 msec. (bdl a0008. Frame 252)

Page 40: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbron of Part of Vowel[AY]. 15 msec. (bdl a0008. Frame 291)

Page 41: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Timbre vectors can be fused to eliminate seams

Speech segment 1 Speech segment 2

Using fusing process, the entire speech section becomes natural.

Page 42: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Speech synthesis

Page 43: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Speech recognition

Page 44: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Voice Demo 1: Speech Regeneration

The original recorded speech was converted into timbre vector form and regenerated. There is very little quality degradation.

bdl

original

jmkslt

regenerated

Page 45: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Voice Demo 2: Voice Transformation

Although deeply into the falsetto register and vocal-fry register, the voice is still clear and human-like.

childcontralto sopranomezzo-soprano

The pitch and head-size can be changed dramatically. By raising the pitch 6 halftones each time, the original voice, a tenor, can be changed to female and child voices.

By lowering the pitch 6 halftones each time, a tenor voice can be changed to very deep male voices.

baritone bass contra-bass giant

Page 46: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Voice Demo 3: Speed Variation

300 wpm 100 wpm 200 wpm 150 wpm

The speed can be changed from 100 words per minute to 1000 words per minute, and the voice is still clear. The low speed can be used for foreign language education. The high speed is a great advantage for visually impaired people.

400 wpm 500 wpm 600 wpm 700 wpm

800 wpm 900 wpm 1000 wpm

Page 47: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Voice Demo 4: Prosody Modification

Voice Affirmation Question

MezzosopranoTenorBaritoneBassContrabass SopranoChild

-- change an affirmation into a question --

Page 48: Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied

Summary• A new theory of human voice production

– Based on simultaneous voice and EGG signals– The concept of timbrons

• Speech processing based on timbrons– Kramers-Kronig relations– Fourier analysis– Laguerre-function expansion– Timbre vectors

• Applications: voice transformation, speech synthesis and speech recognition