speech processing overview & supporting concepts
TRANSCRIPT
Speech ProcessingOverview & Supporting Concepts
Professor Marie RochReadings are listed in the schedule
What is speech processing?
• Specialized branch of machine learning• Interdisciplinary
– signal processing– statistics– linear algebra– linguistics– optimization– perception & production
2
-1.5-1
-0.50
0.51
1.5
-1.5
-1
-0.5
0
0.5
10
2
4
6
8
10
12
f2f1
P(f 1
,f 2)
Basic ideas
• Acquire an audio signal• Extract features• Classify features to labels depending on goals, e.g.
– speech recognition: text to speech– speaker recognition: identify a person– command and control: recognize directives or keywords
3
Rosie the robot, The Jetsons ©
Hannah B
arbera
Probability
• A measure of uncertainty.• Sources of uncertainty:
– Process is stochastic (nondeterministic).– We cannot observe all aspects of process.– Incomplete modeling of process.
4
Probability
Two ways to think about:• frequentist – Related to the proportion of events occurring• Bayesian – Related to degree of belief
Both types of probability are treated using the same rules.
5
Random variables
• Variables that can be bound to values non-deterministically.
• Common to use notation to distinguish between*
– a random variable X (capitalized)– and an observed value x (lower case, e.g. x=5)
6*Goodfellow et al. use bold in place of capitalization.
Discrete random variables
• Values from a discrete set of labels:
• Probability density (mass) function (pdf/pmf) is the probability of something occurring, P(X=x)*:
•
7
{ } { }red, orange, yellow, blue, indigo, violet , 0,1,2X Y∈ ∈
( ) 0.8, 1( 2)3
PX re YP d= == =
domain( )domain( ),0 ( ) 1 and ( ) 1
x Xx X P X x P X x
∈
∈ ≤ = ≤ =∀ =∑* Goodfellow et al. use PMF for discrete distributions. P(X=x) is commonly abbreviated to P(x)
Distribution
• Set of probabilities associated with the values of a random variable.
• Uniform distribution – Special distribution where all probabilities are equal.
8
Example
• Let X be a die roll– P(X=3) denotes the probability of rolling a 3– Fair die P(X=3) = 1/6
• If we don’t know if the die is fair, how would we approximate P(X=x)?
• Estimates of probability are denoted with a hat:9
ˆ( )P X x=
Continuous random variables
• Values from a continuous domain• Satisfies the following:
• Worth noting:– no max on P(X=x)– What is?
10
domain( ),0 ( )
( ) ( )
( ) 1
b
a
x
X P X x
P a X b P X x dx
P X
x
x dx
∈ ≤ =
≤ ≤ = =
= =
∀
∫∫
( ) ( )a
aP a X a P X x dx≤ ≤ = =∫
Continuous uniform distribution
Notation: ~ means “has a distribution of.”U(a,b) is a uniform distribution between values and b
11
or implies 1
~ ( , )0
( )
X U a bx a
P X xa
x b
x bb a
>
≤ ≤−
<= =
Expectation
When u(X) = X, we call this the mean, or average value:
12
discrete: ( ) P( )
continuous:
[ ( )]
[ ( )] ( ) P( )x
x
E u X
E u
u x x
xX dxu x
=
= ∫
∑
[ ]E Xµ =
Sample mean
• What if we don’t know P(X=x)?• If we roll a die many times, we can estimate
P(X=x) by counting the number of times each x occurs and dividing by the number of rolls.
• To think about: How does this relate to our traditional idea of an average?
13
µ̂
Variance
• Answers the question: What is the expected squared deviation from the mean?
• Can be defined with expectation operator
• Sample variance (unbiased)
14
2 2) ] ( )[ )( (x
E X x P X xµ µ= − =− ∑
2 21ˆ[( ) ] ( )1 x
E x xN
µ µ− = −− ∑
Normal distribution
• Normal or Gaussian distributions are the so-called “bell curve” distributions
• Parameters: mean & variance:
15
2
2( )
2 22
,2
1( | )x
eP xµ
σµ σπσ
− −
=
2~ )( ,X n µ σ
Normal distribution
• Mean controls the center, variance the breadth of the distribution.
• A normal distribution can be fit with sample mean & variance
16
, the standard~ (0 no al,1) rmX n
~ (2,1)X n
~ (0,0.04)X n
Application: Speech/noise segmentation
How can we determine if someone is talking?1. Record: Acquire pressure signal 2. Frame: Break into pieces3. Compute features: root mean square intensity4. Train: Fit speech/noise distribution models5. Classify: See which distribution matches new samples
17
Acquisition: pressure sensors
Basic ideas for microphones:• Deformable membrane
– pushed by compression– pulled by rarefaction
• Deformation is converted to voltage• Samples: voltage measured N times/second and discretized
– called sample rate (Fs) and – measured in Hertz (Hz)
18
Pressure and Intensity• Pressure
– Amount of force per area– Typically measured in Pascals
(Newtons/meter squared)if the microphone is calibratedotherwise in counts
• Intensity– Product of sound pressure and particle velocity per area–
19
Intensity Pressure2α
2
NPam
=
Framing
• Most audio signals vary over time• Analyze small segments that are roughly static
20
chimpanzee pant hoot – courtesy Cat Hobaiter
and so on…
21
decibel (dB) scale
• Human auditory sensitivity to pressure:20 µPa – 200 Pa (107 range!)
• The decibel scale allows us to compare the intensity between two sounds in a compressed rang:– Intensity
– ~ with pressure
010log10
II
=
010
2
010 log20log10
PP
PP
The bel is named afterAlexander Graham Bell10 decibels 1 bel
image credit: biography.com
𝐼𝐼0,𝑃𝑃0 are reference intensity/pressure
22
What value of P0?
Rabiner/Juang 1993
calibrated terrestrial systems:𝑃𝑃0= 20𝜇𝜇Pa (threshold of hearing)denoted dB re 20 𝜇𝜇Pa
uncalibrated systems:𝑃𝑃0 = 1 (for convenience)denoted dB rel.
Computing intensity
• Computed for each frame• Root-mean-square (RMS) pressure
– squared– averaged– square root
• Pressure converted to intensity in dB:
23
2
frame
1| frame |RMS i
ip x
∈
= ∑
20 log10 (𝑝𝑝𝑅𝑅𝑅𝑅𝑅𝑅) = 10 log101
|frame|�
𝑖𝑖∈frame𝑥𝑥𝑖𝑖2 dB
Our first speech problem
• Given a recording, separate speech from constant background noise
240 1 2 3 4 5 6
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
Time (s)
Am
plitu
de (c
ount
s)
Relative Intensity
25
0 1 2 3 4 5 6-75
-70
-65
-60
-55
-50
-45
-40
-35
-30
Time (s)
dB re
l.
Intensity histogram
26
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Cou
nts
So how can we characterize speech and silence?
27-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
Silence estimator
• Beginning of recording vs.
• Entire recording
28
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Counts
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Counts
Normal fit of silence
29
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Cou
nts
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.12
2( )
2
2
1( | )2
sil
sil
x
sil
P X x sil eµσ
πσ
− −
= =
P(X|sil)
Similar for speech 1.3 – 2.8 s
30
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
0
0.01
0.02
0.03
0.04
0.05
0.06
P(X|speech)
Intensity distribution with pdfs
31
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Where should we draw the boundary?
P(X|sil) &
P(X|speech)
Conditional probability
The probability of something occurring can change when you know something…
32
( "nice to meet you")
( "nice to meet you" | meeting new person)
P XvsP X
=
=
Conditional probability
The probability of something occurring can change when you know something…
33
( "nice to meet you")
( "nice to meet you" | meeting new person)
P XvsP X
=
=
Formally, P(X|Y) defined as: 𝑃𝑃(𝑋𝑋|𝑌𝑌) ≜𝑃𝑃(𝑋𝑋,𝑌𝑌)𝑃𝑃(𝑌𝑌)
A problem…
Our intensity pdfs showed
but we would like to know
which is known as theposterior distribution
34-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
( | ) and ( | )P X speech P X noise
( | ) and ( | )P speech X P noise X
Bayes’ decision rule
The optimal classification rule is the one that maximizes the posterior probability:
Regrettably, we do not know ,but we do know
35
{ , }
( | )arg maxspeech noise
P Xω
ω∈
note: arg returns the argument (ω) rather than the value
( | )P Xω( | )P X ω
Reverend Thomas Bayes1702-1761
Bayes’ rule (of conditional probability)
36
𝑃𝑃(𝐴𝐴|𝐵𝐵) =𝑃𝑃(𝐴𝐴,𝐵𝐵)𝑃𝑃(𝐵𝐵)
≜ conditional probability
note that swapping names A and B
P(B|A) =𝑃𝑃(𝐵𝐵,𝐴𝐴)𝑃𝑃(𝐴𝐴)
→ 𝑃𝑃(𝐵𝐵,𝐴𝐴) = 𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
and 𝑃𝑃(𝐴𝐴,𝐵𝐵) = 𝑃𝑃(𝐵𝐵,𝐴𝐴)
∴ 𝑃𝑃(𝐴𝐴|𝐵𝐵) =𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
𝑃𝑃(𝐵𝐵)
also known as Bayes’ Theorem/Bayes’ Law)
Bayes’ Rule≠Bayes’ Decision Rule
𝑃𝑃(𝐴𝐴,𝐵𝐵) : probability that bothA and B occur
Bayes’ decision rule
Bayes’ rule to the rescue!
37
arg max ( ( | ) ( )| )( )
PP X PXP Xω
ω ωω =
posterior probability
class-conditionalprobability
priorprobability
observationprobability
almost there…
We know P(X|ω). We don’t need to know P(X) as
38
( | ) ( )| )( )
( P X PX
P XPω ωω =
( )
( | ) ( ) ( | ) ( ),( ) ( )
max ( | ) (
max
), ( | ) ( )
P X speech P speech P X noise P noiseP X P X
P X speech P speech P X noise P noise
=
Prior probability
• Probability of an observation without any prior information.• When we don’t know this, we frequently assume a uniform
(equally likely) distribution:
39
1 1max ( | ) , ( | )2 2
P X speech P X noise =
Bayes decision rule revisited
Assuming a uniform prior, we can now make decisions about speech or noise:
How can we solve the boundary threshold?
40
2
2( )
2
2{ , }
1decision( ) arg max2
x
speech noisex e
ω
ω
µσ
ωωπσ
− −
∈=
Threshold for our minimal speech detector
Solve for x (2 roots):
41
22
22
( )( )22
2 22 21 1
speechnoise
speechnoise
noise speech
xx
e eµµσσ
πσ πσ
− −− −
=
Result
42
0 1 2 3 4 5 6-75
-70
-65
-60
-55
-50
-45
-40
-35
-30
Time (s)
dB re
l.
speechnoise
Caveat: Example is for pedagogical purposes, we can do better than this…
A better way
• This required us to have labels for the speech and noise
• Remember that the intensity histogram had two modes:
• Can we learn these automatically?
43
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.C
ount
s
A better way
• Unsupervised model with unlabeled data• Mixture models
– Learn to fit complicated distributions with simpler parametric ones
– We do not require labels
44
45
Multidimensional Gaussians
Maximum likelihood estimators: sample mean & variance
Note: We do not need d>1 for this problem, but let’s start thinking about higher dimensions
( )
11 ( ) ( )2
1/22
1( )2
Ti i ix x
id
i
f x eµ µ
π
−− − Σ −=
Σ
dimensions, T transpose, determinant, , variance-covariance m
#[ a ri] t x i i
dE Xµ
→ ⋅ →→
→ Σ →
Gaussian Intuitions: Size of Σ
• µ = [0 0] µ = [0 0] µ = [0 0] • Σ = I Σ = 0.6I Σ = 2I• As Σ becomes larger, Gaussian becomes more spread
out; as Σ becomes smaller, Gaussian more compressedText and figures from Andrew Ng’s lecture notes for CS229 (via Jurafsky & Martin)
46Speech and Language Processing Jurafsky and Martin
47
GMM: Parameters
• Parameters for each of K mixtures k=1...K:– μk mean– ∑k variance-covariance matrix– ck mixture weight (importance) where
– Use Φk to denote (ck, μk,∑k) and Φ to denote the entire set of parameters
11
=∑ =
K
k kc
48
Sample 16 mixture modelBased upon R2 cepstral speech data
-1.5-1
-0.50
0.51
1.5
-1.5
-1
-0.5
0
0.5
10
2
4
6
8
10
12
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
Surface plot of pdfs Equal likelihood contour linesMixtures have different weights, means, & covariance matrices
49
GMM: Evaluation probability
• Probability of an observation:
P( �⃑�𝑥|Φ) = �𝑘𝑘=1
𝐾𝐾
𝑐𝑐𝑘𝑘𝑁𝑁𝑘𝑘(�⃗�𝑥|𝜇𝜇𝑘𝑘 ,∑𝑘𝑘)
= �𝑘𝑘=1
𝐾𝐾
𝑐𝑐𝑘𝑘1
(2𝜋𝜋) �𝑑𝑑 2 Σ𝑘𝑘 �1 2𝑒𝑒−12 (�⃑�𝑥−𝜇𝜇𝑘𝑘)𝑡𝑡Σ𝑘𝑘
−1(�⃑�𝑥−𝜇𝜇𝑘𝑘)
Learning GMM parameters(overview only)
• Given training data, and a set of parameters, we would like to maximize likelihood of the data.
• If the data are independent, select parameters to maximize:
𝑃𝑃 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑇𝑇|Φ = �𝑖𝑖=1
𝑇𝑇
𝑃𝑃(𝑥𝑥𝑖𝑖|Φ)
= Π𝑖𝑖=1𝑇𝑇 �𝑘𝑘=1
𝑐𝑐𝑘𝑘1
2𝜋𝜋𝑑𝑑2 Σ𝑘𝑘
12𝑒𝑒−
12 𝑥𝑥𝑖𝑖−𝜇𝜇𝑘𝑘 𝑇𝑇Σ𝑘𝑘
−1(𝑥𝑥𝑖𝑖−𝜇𝜇𝑘𝑘)
50
Learning GMM parameters
• If we knew which mixture generated each observation, we could use the derivative to select model parameters that maximize the likelihood:
𝑃𝑃 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑇𝑇|Φ• We do not know this…
51
EM algorithm
• Suppose we have an estimate, model Φ• Use E[·] operator to determine how likely a mixture is to
generate an observation.• Now we can maximize• This is called the expectation-maximization algorithm.
52
EM algorithm
Basic ideas• Use expectation and an extant model to estimate what you do
not know• Use maximization to come up with a new model• Repeat
Convergence guaranteed under minimal assumptions…but no guarantee on rate
Old Faithful Data
54
Wikipedia, EM algorithm
slide courtesy Kaitlin Palmer
Mixture models
• We do not have the time for the details of implementation• For those interested, see:
– Moon, T. K. (1996). "The expectation-maximization algorithm," IEEE Signal Proc. Mag. 13(6). 47-60.
or– Huang, X., Acero, A., and Hon, H. W. (2001). Spoken Language Processing (Prentice Hall PTR,
Upper Saddle River, NJ)
• For our purposes, we will use an implementation from scikit learn.
55
GMM estimation
• Our training data are the RMS intensity values• We use two mixtures• When predicting, we look at the likelihood of each mixture
and select the higher one. • How do we know which mixture represents noise and which
mixture represents speech?
56