speech in nips 2019/2020 - tsinghua university

Speech in NIPS 2019/2020

Lantian Li

2020-12-21

Untangling in Invariant Speech Recognition

• How information is untangled within DNNs trained to recognize speech.

• Define several metrics (manifold capacity) which connecting geometric properties of network representations and the separability of classes.

• A theory-driven geometric analysis of representation untangling in tasks.

• CNN, Deep Speech 2

• WSJ, Librispeech

Anchor points (support vectors)

Manifold capacity measures

• Mean-Field Theoretic (MFT) Manifold Capacity

FastSpeech: Fast, Robust and Controllable Text to Speech

• Neural TTS suffers from slow inference speech, lack of robustness (word skipping or repeating) and uncontrollability (speed or prosody).

• Using a feed-forward Transformer (instead of conventional encoder-attention-decoder framework) to generate Mel-spectrogram in parallel.

• Using a length regulator to expand the phoneme sequence to match the length of the target Mel-spectrogram sequence.

FastSpeech

FastSpeech 2

Length regulator

Controllability

Robustness

https://speechresearch.github.io/fastspeech/

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Listening to Sounds of Silence for Speech Denoising• A silent interval reveals noise characteristics.

• Several silent intervals assemble a time-varying noise distribution.

• Silent Interval Detection, Noise Estimation, Noise Removal

Loss functions and training

Silent interval supervision

Data construction

• AVSPEECH: audio-video speech

• 2214 videos for training and 234 videos for testing

• DEMAND and Google’s AudioSet

• The SNRs range in [-10dB, 10dB]

Performance of SID

Ablation studies

Comparison with SOTA

Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

• Exploring wav2vec 2.0 on speaker verification and language identification• https://arxiv.org/abs/2012.06185

https://arxiv.org/abs/2012.06185

The Cone of Silence: Speech Separation by Localization

https://grail.cs.washington.edu/projects/cone-of-silence/

https://grail.cs.washington.edu/projects/cone-of-silence/

speech in nips 2019/2020 - tsinghua university

Documents