speech in nips 2019/2020 - tsinghua university

29
Speech in NIPS 2019/2020 Lantian Li 2020-12-21

Upload: others

Post on 24-Feb-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech in NIPS 2019/2020 - Tsinghua University

Speech in NIPS 2019/2020

Lantian Li

2020-12-21

Page 2: Speech in NIPS 2019/2020 - Tsinghua University

Untangling in Invariant Speech Recognition

• How information is untangled within DNNs trained to recognize speech.

• Define several metrics (manifold capacity) which connecting geometric properties of network representations and the separability of classes.

• A theory-driven geometric analysis of representation untangling in tasks.

• CNN, Deep Speech 2

• WSJ, Librispeech

Page 3: Speech in NIPS 2019/2020 - Tsinghua University
Page 4: Speech in NIPS 2019/2020 - Tsinghua University
Page 5: Speech in NIPS 2019/2020 - Tsinghua University

Anchor points (support vectors)

Page 6: Speech in NIPS 2019/2020 - Tsinghua University

Manifold capacity measures

• Mean-Field Theoretic (MFT) Manifold Capacity

Page 7: Speech in NIPS 2019/2020 - Tsinghua University
Page 8: Speech in NIPS 2019/2020 - Tsinghua University
Page 9: Speech in NIPS 2019/2020 - Tsinghua University
Page 10: Speech in NIPS 2019/2020 - Tsinghua University

FastSpeech: Fast, Robust and Controllable Text to Speech

• Neural TTS suffers from slow inference speech, lack of robustness (word skipping or repeating) and uncontrollability (speed or prosody).

• Using a feed-forward Transformer (instead of conventional encoder-attention-decoder framework) to generate Mel-spectrogram in parallel.

• Using a length regulator to expand the phoneme sequence to match the length of the target Mel-spectrogram sequence.

Page 11: Speech in NIPS 2019/2020 - Tsinghua University

FastSpeech

Page 12: Speech in NIPS 2019/2020 - Tsinghua University

FastSpeech 2

Page 13: Speech in NIPS 2019/2020 - Tsinghua University

Length regulator

Page 14: Speech in NIPS 2019/2020 - Tsinghua University

Controllability

Page 15: Speech in NIPS 2019/2020 - Tsinghua University

Robustness

https://speechresearch.github.io/fastspeech/

Page 16: Speech in NIPS 2019/2020 - Tsinghua University

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Page 17: Speech in NIPS 2019/2020 - Tsinghua University
Page 18: Speech in NIPS 2019/2020 - Tsinghua University

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Page 19: Speech in NIPS 2019/2020 - Tsinghua University

Listening to Sounds of Silence for Speech Denoising• A silent interval reveals noise characteristics.

• Several silent intervals assemble a time-varying noise distribution.

• Silent Interval Detection, Noise Estimation, Noise Removal

Page 20: Speech in NIPS 2019/2020 - Tsinghua University
Page 21: Speech in NIPS 2019/2020 - Tsinghua University

Loss functions and training

Page 22: Speech in NIPS 2019/2020 - Tsinghua University

Silent interval supervision

Page 23: Speech in NIPS 2019/2020 - Tsinghua University

Data construction

• AVSPEECH: audio-video speech

• 2214 videos for training and 234 videos for testing

• DEMAND and Google’s AudioSet

• The SNRs range in [-10dB, 10dB]

Page 24: Speech in NIPS 2019/2020 - Tsinghua University
Page 25: Speech in NIPS 2019/2020 - Tsinghua University

Performance of SID

Page 26: Speech in NIPS 2019/2020 - Tsinghua University

Ablation studies

Page 27: Speech in NIPS 2019/2020 - Tsinghua University

Comparison with SOTA

Page 28: Speech in NIPS 2019/2020 - Tsinghua University

Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

• Exploring wav2vec 2.0 on speaker verification and language identification• https://arxiv.org/abs/2012.06185

Page 29: Speech in NIPS 2019/2020 - Tsinghua University

The Cone of Silence: Speech Separation by Localization

https://grail.cs.washington.edu/projects/cone-of-silence/