digital signal processing group school of electronics dept ...dsp_agh_15.pdf · 4 psychoacoustic...

57
1 Digital Signal Processing Group School of Electronics Dept of Computer Science, Electronics and Telecommunications AGH University of Science and Technology, Kraków, Poland dsp.agh.edu.pl

Upload: others

Post on 01-Jun-2020

4 views

Category:

Documents


1 download

TRANSCRIPT

1

Digital Signal Processing Group

School of Electronics

Dept of Computer Science, Electronics

and Telecommunications

AGH University of Science and

Technology, Kraków, Poland

dsp.agh.edu.pl

2

• Automatic speech recognition

• Speaker verification, identification and

profiling

• Natural language processing

• 3D sound simmulation

• Speech enhancement

Topics of our research

3

Automatic speech

recognition

4

Psychoacoustic wavelet speech analysis

time

[M. Ziółko, J. Gałka, B. Ziółko and T. Drwięga, "Perceptual Wavelet

Decomposition for Speech Segmentation", Proceedings of

INTERSPEECH 2010, Makuhari, Japan]

Fourier-wavelet transform

dbdta

bttse

afas jfb

)(1

,~̂ 2

where:

• - scale (reversibly correlates with a frequency)

• - time translation

• - arbitrary chosen wavelet

ab

speech spectrum

An example of wavelet transform

and Fourier-wavelet transform

L i t w o o j cz y z n o

Boundaries detection

End of sentence End of sentence

(…) oczywiście, znowu, no, jest to sprawa nakładów. Na to też jest dużo pieniędzy,

głównie ze względu na weteranów wojennych. Ostatnio w Stanach te badania się (…)

[B. Ziółko, P. Żelasko, D. Skurzok, ”Statistics of diphones and triphones presence on the wordboundaries in the Polish language. Applications to ASR, XXII Pacific Voice Conference, 2014]

Phoneme segmentation

[B. Ziółko, S. Manandhar, R. C. Wilson, M. Ziółko, „Phoneme Segmentation Based on Wavelet SpectraAnalysis”, Archives of Acoustics, 2011, vol. 36, No. 1]

Phoneme segmentation

Importance

matrix

Event

function

subband

time

10

[M. Ziółko, J. Gałka, B. Ziółko, T. Jadczyk, D. Skurzok, M. Mąsior, ”Automatic Speech Recognition System Dedicated for Polish, Interspeech2011, Florence]

11

[A system and a method for providing a dialog with a user, B. Ziółko, T. Pędzimąż (patent application in USA, Canada, Japan & EPO)]

Audio and image segmentation evaluation

[B. Ziółko ”Fuzzy Precision and Recall Measures for Audio Signals Segmentation”, Fuzzy Sets and Systems, 2015, earlyaccess]

Applications

13

Automatic dialect and languagerecognition

Phonemes analysis for genealogical tree

of world languages

http://speechsamples.agh.edu.pl

16

Speaker verification,

identification and

profiling

Voice biometrics

convenience

price

Recording Matching Decision

Voiceprint database

YES NO

Speaker verification

System supporting speaker identification in emergency call center

Expression of emotions - models

Model dedicated to situational context:emergency call

22

Playback detection

24

[J. Gałka, M. Grzywacz, R. Samborski Playback attackdetection for text-dependent speaker verification over telephonechannels, Speech Communication, vol. 67, pp. 143-153]

Playback attack detection

25

[J. Gałka, M. Grzywacz, R. Samborski Playback attackdetection for text-dependent speaker verification overtelephone channels, Speech Communication, IF, vol. 67, pp. 143-153]

SafeLock

• Cortex M4

• STM32F407VGT6

• PN/EN for access control systems

[J. Gałka, M. Mąsior, M. Salasa Voice authentication embedded solution for secured access control, IEEE Transactions on Consumer Electronics, vol. 60, issue 40, pp. 653-661]

28

Natural language

processing

29

Application of POS taggers in ASR

[A. Pohl, B. Ziółko ”Using Part of Speech N-grams for Improving Automatic Speech Recognition of Polish”, 9th International Conference on Machine Learning and Data Mining MLDM 2013, New York]

30

Results – taggers performance

[A. Pohl, B. Ziółko ”A Comparison of Polish Taggers in the Applicationfor Automatic Speech Recognition”, LTC, Poznań, 2013]

31

32

Classification of Wikipedia articles

33

Classification of Wikipedia articles

34

Sign language

recognition

http://witkom.info/?lang=en

WITKOMVirtual sign language translator

Sign language acquisition, motion capture

Sensor glove

42

Sensor glove signals

43Sample no

Accele

ration

Sign „is fine” before filtration

Corpora/databases

Acted emotional speech

professional and amateur actors

12 speakers

Read speech: commands,

sentences, continuous speech

~ 4 hours of recordings

44,1 kHz, 16 bit, stereo

Neutral, joy, sadness, anger, fear, surprise

+ Irony

Spontaneous emotional speech From Krakow Emergency Call Center

~ 45 hours of recordings

3307 recordings

8 kHz, 16 bit, mono

46

020406080

anxie

ty

sad

ness

str

ess

weari…

an

ger

fury

surp

rise

%

AGH speech corpus

• 55 hours of annotated recordings,

• about 600 speakers,

• various conditions,

• various hardware,

• 16-bit and 16 [kHz].

Polish text corpus - over 1,4 bln words

Audiovisual Polish speech corpus –

over 3h

[P. Żelasko, B. Ziółko, T. Jadczyk, D. Skurzok „AGH Corpus ofPolish Speech”, Language Resources and Evaluation (IF = 0.922),early access, 2015]

47

3D sound simmulation

Beam tracing algorithm

Diffraction

50

[B. Ziółko, T. Pędzimąż, Sz. Pałka, I. Gawlik, B. Miga, P. Bugiel ”Real-time 3D Audio Simulation in Video Games with RAYAV”, Making Games, vol.1, 2015]

[B. Miga, B. Ziółko, „Real-timeacoustic phenomena modelling for computer games audio engine”, Archives of Acoustics, 2015, vol. 2]

Port to Quake

Wave-based Room Acoustic Simulations

• Applications: – Acoustic prediction for architectural design

– Auralization of virtual rooms

• Most accurate sound field modelling methods (outperform geometrical methods)

• Computational cost very high but real-time capability on GPGPUs

www.agh.edu.pl

Wave-based Room Acoustic Simulations

• All wave-related phenomena (diffraction, occlusion) are included

• Various FDTD schemes (modelling methods) exhibit differentdispersion error (an artefact at high frequencies)

www.agh.edu.pl

[K.Kowalczyk and M.van Walstijn, „Wideband and isotropic 2D room acoustics simulations with interpolated FDTD

schemes”, IEEE Trans. Audio, Speech, Lang. Process., Vol. 18, No. 1, pp. 78-89, Jan. 2010]

56

Speech enhancement

5757

Microphone array

m1

m2

m3

m4

58

Microphone arrays

hhhnsEQ obs

T

corr

T

m 2)(~2

1

corroptobsh opth

is a Wiener filter

where

Sm2(n)

Sm1(n)

LMS

Adaptive

filter

ΣSm1(n)

-

+BPF

BPF Sm2(n)

e(n)~

~

Σ

-

+

FIR

Filter coefficients

Svoice(n)LPF

LPF

Dual-Microphone Speech Extraction from Signals with Audio Background

[R. Samborski, M. Ziółko, B. Ziółko, J. Gałka, "Wiener Filtration for Speech Extraction from the Intentionally Corrupted Signals", Proceedings of The IEEE International Symposium on Industrial Electronics (ISIE-2010) , Bari, 2010]

85 samples ↔ 63 cm

Process of adaptation

Virtual Microphones for Speech Enhancement

• Applications: Modern hands-free communication (e.g. teleconference systems)

• Aim: Capture signal of a desired (distant) speaker, while suppressing interfering speaker signals and noise

• When: Positioning of the physical microphones near the desired speaker is challenging

www.agh.edu.pl

Virtual Microphones for Speech Enhancement

• Goal: Synthesize a Virtual Microphone (VM) signal at anarbitrary position, which sounds perceptually similar to the signal that would be recorded by a real microphone located in the same position

• Applied methods: Parametric signal processing based on the signals recorded using 2 distant microphone arrays

www.agh.edu.pl

[K.Kowalczyk et al., „Parametric Spatial Sound Processing”, IEEE Signal Process. Magazine, Special

Issue on Assisted Listening, pp. 31-42, Vol. 32, Nr. 2, Mar. 2015]

Virtual Microphones for Speech Enhancement

• Scenario and results:

– 2 speaker scenario (desired speaker on top)

– Rotating a cardioid/omnidirectional microphone located centrally

– Signal-to-Interfer-plus-Noise-Ratio (SINR) as expected from real microphones placed at the same position

www.agh.edu.pl

[K.Kowalczyk, A.Craciun, E.Habets, „Generating virtual mirophone signals in noisy environments”, Proc. European

Sign. Proc. Conf., Marrakech, Morocco, Sep. 2013]

Speech vs. music discrimination

There is energy modulation in speech signal with frequency

around 4 Hz. It is a result of average length of a syllable - 250 ms.

[S. Kacprzak, M. Ziółko ”Speech/Music Discrimination via Energy Density Analysis”, SLSP, Tarragona, 2013]

Results of speech/music classification

Average frequency is lower for speech

Log (Minimum Energy Density)

XXIII PVC will be held in Santa Clara, California

Medals and Prizes

6969

http://www.dsp.agh.edu.pl

DSP AGH Group

70

Our partners