![Page 1: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/1.jpg)
Audio & Speech Technology for Consumer ElectronicsBasics and Technical Challenges
ICCE Consumer Electronics Society Webinar
Reinhard MOELLER University of Wuppertal
![Page 2: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/2.jpg)
IEEE Consumer Electronics Society
221.09.1721.09.17
● Introduction● Historical Facts● Mathematical Elements of
Speech Technology ● Speech Processing
![Page 3: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/3.jpg)
IEEE Consumer Electronics Society
321.09.1721.09.17
● Introduction
![Page 4: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/4.jpg)
IEEE Consumer Electronics Society
421.09.1721.09.17
Introduction
● Human differentiates Sound and Noise
● Sound and Noise are evolutionary basis of communication between human and environmentHumans can feel and hear acoustic information
![Page 5: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/5.jpg)
IEEE Consumer Electronics Society
521.09.1721.09.17
Principles of Sound
Sound • travels in waves, produced when an object pushes
on the air around it, causing small changes in air pressure.
• Properties: frequency, wavelength, period, amplitude, phase and speed
• Can be one single tone or a mixture of several tones with equal or different properties; Examples:
•music consists of a mixture of different frequencies and amplitudes•White noise – mix of frequencies with equal power distribution over a given frequency range,“unwanted” sound, harsh/crisp sounding noise•Pink noise – mix of frequencies with equal power distribution over a given logarithmic frequency scale, “naturally” sounding environment noise•speech
![Page 6: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/6.jpg)
IEEE Consumer Electronics Society
621.09.1721.09.17
Human Audio „Sensors”: The Ears
The principle of hearing (after H. v. Helmholtz, 1873)
The inner ear is an active sound analyzer
![Page 7: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/7.jpg)
IEEE Consumer Electronics Society
721.09.1721.09.17
Measurement of Sound
• The sound level heard by human ears is commonly measured in decibels
• Referring to sound, a decibel is used to measure the amplitude of the sound wave: 10 log (P2/P1) dB
• Unit Decibel is useful because it can represent the wide range of sound levels the human ear can hear using a more manageable scale
• On the decibel scale, the softest sound that can be heard is 0 dB (P1=P2). Every increase of 10 dB represents an approximate doubling of the perceived loudness of the sound
![Page 8: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/8.jpg)
IEEE Consumer Electronics Society
821.09.1721.09.17
Dynamics of Human Hearing
Very soft
Extremly Loud
Dynamic rangeof a bicycle: 7:1
Dynamic range of a human ear: 1.000.000:1
Issues:How loud is too loud?What about hearing impairment?
![Page 9: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/9.jpg)
IEEE Consumer Electronics Society
921.09.1721.09.17
Human Audio “Actuator”: Speech and Tone
● Speech production model: source-filter interaction– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum
Glottal pulses Vocal tract Speech signal
![Page 10: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/10.jpg)
IEEE Consumer Electronics Society
1021.09.1721.09.17
● Historical Facts
![Page 11: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/11.jpg)
IEEE Consumer Electronics Society
1121.09.1721.09.17
Pre-History of Audio and Speech Technology
● 1653: Cyrano de Bergerac„Sonderbare Geschichten der Staaten und Reiche des Mondes“
– .. books are little mechanical boxes like wristwatches.. reader fits its nerves and listens to the sound…
● 1786: Baron Münchhausen„Der Ritt auf der Kanonenkugel und andere Abenteuer„
– ..frozen sound carried in a post horn, melted behind a warm oven and resound..
![Page 12: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/12.jpg)
IEEE Consumer Electronics Society
1221.09.1721.09.17
Pre-History of Audio and Speech Technology
● 1634: Keppler– „once we will produce speeking
machines, but they will have a snarling tone"
● 1761: Euler– „It would be one of our most important
inventions, if we could build a machine able to imitate all sounds of our words with all articulations... The thing does not seem to be impossible to me“
● 1773: Ch. G. Kratzenstein– Single vowels using resonance tubes
connected to organ pipes
![Page 13: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/13.jpg)
IEEE Consumer Electronics Society
1321.09.1721.09.17
Pre-History of Audio and Speech Technology
● 1791: Wolfgang von Kempelen– „Mechanism of the human speech and
description of a speaking machine“
– The Chess Turk
– detailed construction plans, basis for later reconstructions and improvement
– called „..the first phonetitian..“
● 1824: Johann Maelzel– speaking dolly (Mama, Papa) Kempelen‘s Speaking Machine
Source: German Museum
![Page 14: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/14.jpg)
IEEE Consumer Electronics Society
1421.09.1721.09.17
History of consumer audio recording
● 1877: Edison‘s Phonograph– Information carrier is a cylinder
– Intended applications:● dictaphone, voice recorder
● Archive of voices of famous people
– First recorded and replayed word: HELLO
![Page 15: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/15.jpg)
IEEE Consumer Electronics Society
1521.09.1721.09.17
● History of consumer audio recording
1887: Berliner‘s Grammophon started success story of music
mass reproduction- wax coated zinc plate1892: pressed rubber disc1895: Schellack disc1896: Edison Spring motor
enhanced phonograph1908: double-sided disc1948: PVC
![Page 16: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/16.jpg)
IEEE Consumer Electronics Society
1621.09.1721.09.17
History of consumer audio recording
1898: Piano Roll in mass production
![Page 17: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/17.jpg)
IEEE Consumer Electronics Society
1721.09.1721.09.17
History of consumer audio electronics recording: Music media● 1930‘s: magnetizable tapes
● 1983 Digital Audio Tape (DAT)– originally for consumer use
– professional 8 channel S-VHS since 1993
● 1980: Red Book Standard (AudioCD)– 44.1 kHz, 16 Bit, 74 minutes
● 1990+: DVD Audio, Mini Disc, iPod,solid state disc & more
![Page 18: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/18.jpg)
IEEE Consumer Electronics Society
1821.09.1721.09.17
Consumer audio electronics:Development towards spatial Audio
● 2 channel stereo: one-dimensional (width of stage)
● 2 channel surround: two dimensions (added depth of room)
● N channel 3D: added audio tracks for upper frequency bands
● N-channel object-based VR: binaural technology, outside head
● Future: Audio AR, i.e for gaming and navigation
Stereo(2-3 speakers)
Surround(5 to 7 speakers)
3D(7 plus speakers)
Audio VR(7 plus speakers)
Immersion
60‘s ~201670‘s
![Page 19: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/19.jpg)
IEEE Consumer Electronics Society
1921.09.1721.09.17
Mathematical Elements of Speech Technology
![Page 20: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/20.jpg)
IEEE Consumer Electronics Society
2021.09.1721.09.17
HMI: Dialog and Speech Understanding
“A symbolic description should be calculated from a speech signal, that allows a usable reaction of a system to a verbally expressed user demand in context of a human-machine dialog.”
according to: Sagerer, Automatisches Verstehen gesprochener Sprache, BI-Wiss.-Verl., 1990
![Page 21: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/21.jpg)
IEEE Consumer Electronics Society
2121.09.1721.09.17
Mathematical Elements
● Elements– Signal, System, Frequency, Amplitude,
Phase, Spectrum– Sampling, Quantisation
● Acoustic Modells of Speech Production– Tube Model– Source-Filter-Model– Perturbation Model (Formant Shifting)
● Spectral Attributes of Sound Classes● Spectral Analysis
– Basics– Windowing
![Page 22: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/22.jpg)
IEEE Consumer Electronics Society
2221.09.1721.09.17
Basics and Terminology
● Signal– analog (continuous in time and value)
• modulated Signals: amplitude-, frequency modulated
– digital (discrete time and discret value)● Signal parameters
– Frequency– Amplitude– Phase
● Spectrum
![Page 23: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/23.jpg)
IEEE Consumer Electronics Society
2321.09.1721.09.17
Frequency, Amplitude, Phase
● Frequency = 1 / cycle time [Hz]
● Phase = displacement of a wave with respect to a fixed point in time
Cycle time
Amplitude
t =Time
• Waves with same phase
• Waves with different phase
![Page 24: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/24.jpg)
IEEE Consumer Electronics Society
2421.09.1721.09.17
Analog to Digital Signal Conversion
● Analog Signal
● Sampling– Time becomes discrete
● Quantization– Values become discrete
![Page 25: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/25.jpg)
IEEE Consumer Electronics Society
2521.09.1721.09.17
Sampling
● Nyquist/Shannon definition– Signal is fully reconstructable if
fsample > 2 fmax – Otherwise we get aliasing
● example speech analysis:– fmax ~ 7 kHz– fsample =16 kHz
● Sampling rate:– Number of samples per second
![Page 26: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/26.jpg)
IEEE Consumer Electronics Society
2621.09.1721.09.17
Quantization
Quantization error
Sampling value
Mean value of interval
Maximum quantization error
![Page 27: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/27.jpg)
IEEE Consumer Electronics Society
2721.09.1721.09.17
Topics of Speech Acoustics
● Concerned with signal processing and speech communication
● Topics:– Speech production, Vocal tract models
– Seech signal analysis
– Speech perception, Readability and -quality
– Speech- and Sound coding
– Speech synthesis
– Noise suppression, robust Speech-signal processing
– Speech recognition
– Speaker recognition
![Page 28: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/28.jpg)
IEEE Consumer Electronics Society
2821.09.1721.09.17
Speech signal in time and frequency domain
The word „aua“ in time domain
The word „aua“ in frequency domain
![Page 29: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/29.jpg)
IEEE Consumer Electronics Society
2921.09.1721.09.17
Signal Spectgrogram vs. Cascade Spectrogram
![Page 30: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/30.jpg)
IEEE Consumer Electronics Society
3021.09.1721.09.17
● Wide-band Spectrogram– Shows formants (resonance
functions of vocal tract) = characteristics of filter
● Narrow-band Spectrogram– Shows harmonics =
characteristics of source
● Synonyme: Sonagram
Spectrogram II
![Page 31: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/31.jpg)
IEEE Consumer Electronics Society
3121.09.1721.09.17
„flat“ Spectrogram (Sonagram)
time
freq
uenc
y
Amplitude shown by density
![Page 32: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/32.jpg)
IEEE Consumer Electronics Society
3221.09.1721.09.17
Acoustic Models of Speech Production
● Source/Filter Model
● Tube Model
● Perturbation Model (formant shifting)
![Page 33: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/33.jpg)
IEEE Consumer Electronics Society
3321.09.1721.09.17
1) Source/Filter Model
Source Filter Speech signal
Sound formingStimulation
![Page 34: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/34.jpg)
IEEE Consumer Electronics Society
3421.09.1721.09.17
2) Tube Model
● Vocal tract modelled with tube elements of different diameters
Approximation of changing cross-sectionwith piecewise homogenous tubes Tube model
Glottis lips
![Page 35: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/35.jpg)
IEEE Consumer Electronics Society
3521.09.1721.09.17
Simplified tube model
● assumption:
– The whole vocal tract is a homogenous tube
– Diameter is much less then length
– Equal diameter over length
– Glottis = total reflector
– Lips = open end
● Result: – resonant wave
![Page 36: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/36.jpg)
IEEE Consumer Electronics Society
3621.09.1721.09.17
3) Formant shifting model
● Defined by local energy maxima in spectrum
● Center frequency is defined as formant frequency
● Independent of base frequency
● Based on resonance characteristics (size and form) of articulation tract
● 1st and 2nd formant define vowels
![Page 37: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/37.jpg)
IEEE Consumer Electronics Society
3721.09.1721.09.17
Formant-Shifting (Perturbation Model)
● Increasing (+) resp. Minimizing (-) of the first three formants by shifting the local constriction of the articulation tract
![Page 38: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/38.jpg)
IEEE Consumer Electronics Society
3821.09.1721.09.17
Sonagrams i, u, a
![Page 39: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/39.jpg)
IEEE Consumer Electronics Society
3921.09.1721.09.17
Speech Recognition
![Page 40: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/40.jpg)
IEEE Consumer Electronics Society
4021.09.1721.09.17
Interdisciplinarity of Speech Technology
Engineering / Computer Science
Computer Linguistics Phonetics
Natural Dialog, Speech-understanding, Text-to-Speech
i.e. Systems for:
Consumer Electronics
![Page 41: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/41.jpg)
IEEE Consumer Electronics Society
4121.09.1721.09.17
Typical Tasks in Speech Recognition
SpeechRecognition
LanguageRecognition
SpeakerRecognition
Words
Language Name
Speaker Name
“How are you?”
English
Glenn Miller
Speech Signal
Goal: Automatically extract information transmitted in speech signal
![Page 42: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/42.jpg)
IEEE Consumer Electronics Society
4221.09.1721.09.17
Three Steps of Speech Processing
Red
uctio
n o
f U
nce
rtai
nty
Grammar
Word
definitions
What does the speaker mean?
10alternatives
Speech Analysis
Knowledge about topic,
dialog partnerand context
What is the intent of the speaker?Unambiguous
understanding within the dialog
Speech
Understanding
Spoken Input
What did the speaker say?100
alternatives
Speech Recognition
Acoustic
Speech analysis
Word lists
acc. to W. Wahlster, DFKI
![Page 43: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/43.jpg)
IEEE Consumer Electronics Society
4321.09.1721.09.17
Speech Recognition: Dependencies
● EnvironmentNoise, Acoustics, S/N ratio
● Speaker‘s stateHealth, stress, gender..
● Speaker`s literacylanguage, amount of words
● Softwaresystem, dynamics, algorithm, error handling
● Use Casetranslation, user-device dialog, robotics
● Hardwaremicrophones, speakers
● Dialog Architecturesoftware design
● Training
![Page 44: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/44.jpg)
IEEE Consumer Electronics Society
4421.09.1721.09.17
Noise contamination of speech
Noise
Environmental Personal
Continuous Transient Related to breathing
Non related to breathing
e.g.
•Air Conditioner
•Motors
•Fans
•Continuous Conversation
e.g.
•Phone
•Vocal/
•Conversational
•Alarms
e.g.
•Body motion
•Respiratory infects/
•Distorted respiration
e.g.
•Indoor/ Outdoor Movement
•Clothes
•Joint crackles
![Page 45: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/45.jpg)
IEEE Consumer Electronics Society
4521.09.1721.09.17
Acoustic Wave
PossiblePhonemes
PossibleWords
PossibleSentences
Speech Recognition
PossibleSentences
GrammarStructure
WordMeaning
Phrase/SentenceMeaning
Speech Analysis
SentenceMeaning
Discourse Meaning in Source Language
Phrase Choice inTarget Language
Speech Understandingand Translation
Discourse Meaning in Target Language
Phrase Choice inTarget Language
Sentence Production
Speech Synthesis
Prosody Generation
Generation and Synthesis
Process Chain in Speech Processing
![Page 46: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/46.jpg)
IEEE Consumer Electronics Society
4621.09.1721.09.17
Remember: Technical Evaluation of a Speech Signal
● Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra
– Use a sliding window - 20 ms window, 10 ms shift
..
.
Fourier Transform
Fourier Transform MagnitudeMagnitude
• Produces time-frequency evolution of the spectrum
![Page 47: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/47.jpg)
IEEE Consumer Electronics Society
4721.09.1721.09.17
Sonagram
Narrow-band Sonagram
Broad-band Sonagram
voiced voiced voiced
freq
uenc
yfr
eque
ncy
time (s)
formants
![Page 48: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/48.jpg)
IEEE Consumer Electronics Society
4821.09.1721.09.17
Segmentability of Sonagrams: Phonemes
![Page 49: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/49.jpg)
IEEE Consumer Electronics Society
4921.09.1721.09.17
Speech Recognition: Problems
acc. to W. Wahlster, DFKI
„Calligraphy“
Spontanuous speech
Nonlinear time distortion
Channel distortion
„Coctail party effect“
Co- articulation
Variation in speech (slang)
no break between words
Punctuation? Capitalization?
A very good morning Mrs. Lennard. How is the state of your actual workplan?
Hi Jane, what's up with your plans?
Hi Jane what's up with your plans
HiJanewhatsupwithyourplans
Uh Jaine, whatss up with ya plan
![Page 50: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/50.jpg)
IEEE Consumer Electronics Society
5021.09.1721.09.17
Speech Recognition: Variety of Signals“Ich habe einen Termin um 17 Uhr 30”
![Page 51: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/51.jpg)
IEEE Consumer Electronics Society
5121.09.1721.09.17
Speech Recognition: Word Hypothesis Graph
“It´s hard to recognize speech”
U Washington, CS
![Page 52: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/52.jpg)
IEEE Consumer Electronics Society
5221.09.1721.09.17
Application to Consumer Electronics Dialog Systems
Systems Complexity
Siz
e o
f V
oca
bu
lary
Standard IVR Systems
Command & Control
“Star Trek Dialogs”Dictation
very high low
smal
lV
ery
larg
e
Telephone Dialogs
Dialog Systems
![Page 53: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/53.jpg)
IEEE Consumer Electronics Society
5321.09.1721.09.17
Characteristics of speech processing systems
● Speaker-dependent: – high training efforts
– limited group of users
– highly individual and sensitve against small changes
● Speaker-independent: – no training, robust
– small word capacity
● Speaker-adaptive: – learning system
– instant improvement of recognition
Training efforts
• Single-word recognition: – recognition of isolated spoken
words
• Discrete recognition: – short breaks between words
• Continuous recognition: – no break between words
• Spontaneous recognition: – speech with or without delays– interrupted words
Input types
![Page 54: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/54.jpg)
IEEE Consumer Electronics Society
5421.09.1721.09.17
Questions?
![Page 55: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/55.jpg)
IEEE Consumer Electronics Society
5521.09.1721.09.17
Speaker Recognition
![Page 56: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/56.jpg)
IEEE Consumer Electronics Society
5621.09.1721.09.17
...
Fourier Transform
Fourier Transform MagnitudeMagnitude
• Produces time-frequency evolution of the spectrum
Features for Speaker Recognition• Speech is a continuous evolution of the vocal tract
– Need to extract time series of spectra– Use a sliding window - 20 ms window, 10 ms shift
![Page 57: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/57.jpg)
IEEE Consumer Electronics Society
5721.09.1721.09.17
General Theory
- Speaker Models -● Speaker models (voiceprints) represent voice biometric in compact and generalizable form
h-a-d
• Modern speaker verification systems use Hidden Markov Models (HMMs)
– HMMs are statistical models of how a speaker produces sounds
– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.
– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.
![Page 58: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/58.jpg)
IEEE Consumer Electronics Society
5821.09.1721.09.17
Neural network-based speech recognition
Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden markov model part for language modeling.
![Page 59: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/59.jpg)
IEEE Consumer Electronics Society
5921.09.1721.09.17
Following: Part II
Applications
![Page 60: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/60.jpg)
IEEE Consumer Electronics Society
6021.09.1721.09.17
Psychoacoustics
University of Surrey, UK
![Page 61: Audio & Speech Technology for Consumer ElectronicsIEEE Consumer Electronics Society 21.09.17 11 Pre-History of Audio and Speech Technology 1653: Cyrano de Bergerac „Sonderbare Geschichten](https://reader036.vdocuments.mx/reader036/viewer/2022071419/61185b55b9d04f4063790d5e/html5/thumbnails/61.jpg)
IEEE Consumer Electronics Society
6121.09.1721.09.17
Voiceprint as a Biometric
• Biometric: a human generated signal or attribute for authenticating a person’s identity
• Voice is a popular biometric:– natural signal to produce
– ubiquitous: telephones, microphone equipped PC
• Voice biometric combined with other forms of security– Something we have
- e.g., badge
– Something we are - e.g., voice
– Something we know - e.g., password
Strongest security
HaveKnow
Are