recognizer issues

Recognizer Issues

• Problem: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct.

• Problem: How do we handle a string of text containing words that are not in the dictionary.

• Problem: How do we handle strings with valid words, but which do not form sentences with semantics that makes sense

Correcting Recognizer Ambiguities

• Problem: Resolving words not in the dictionary• Question: How different is a recognized word from

those that are in the dictionary?• Solution: Count the single step transformations

necessary to convert one word into another.• Example: caat cat with removal of one letter• Example: flpc fireplace requires adding the

letters ire after f and a before c and e at the end

Spelling Error Types• Levenshtein distance: the smallest number of insertion,

deletion, or substitution operations that transforms one string into another

• Examples: differences from the word, “cat”– Insertion: catt– Deletion: ct– Substitution: car– Transposition: cta

Note: There are similar well-defined rules for pronunciation variations

Spell Check AlgorithmFOR unrecognized words or those out of contextGenerate a list of candidates– Those that differ by a one step transformation– Those that exist in the lexicon

Order possibilities using language-based statisticsRETURN most likely word

Note: We could extend the algorithm to consider multiple transformations when we cannot find a single step solution

Example

Mispelled word: acress

Candidates – with probabilities of use and use within context

Context Context * P(c)

Which correction is most likely?

• Word frequency percentage is not enough– We need p(typo|candidate) * p(candidate)

• How likely is the particular error?– Deletion of a t after a c and before an r– Insertion of an a at the beginning– Transpose a c and an a– Substitute a c for an r– Substitute an o for an e– Insert an s before the last s, or after the last s– Context of the word within a sentence or paragraph

Misspelled word: accress

Additional Issues to Consider• What if there is more than one error per word?– Possible Solution: Compute minimum edit distance

and allow more than one transformation• Some words in the lexicon may not appear in the

frequency corpus that we use– Possible Solution: Smoothing algorithms

• Is P(typo|candidate) accurate?– Possible Solution: Train the algorithm

• Choice often depends on context– Possible Solution: Context sensitive algorithms

• Names and places likely do not exist in the lexicon– This is a difficult issue

8

Dynamic ProgrammingDefinition: Nesting small decision problems inside larger decisions (Richard Bellman)

Two approaches:1. Top Down: Start out with a method call to compute a final

solution. The method proceeds to solve sub-problems of the same kind. Eventually the algorithm reaches and solves the “base case.” The algorithm can avoid repetitive calculations by storing the answers to the sub-problems for later access.

2. Bottom Up: Start with the base cases, and work upwards, using previous results to reach the final answer.

The examples on the slides that follow use the bottom up approach

Dynamic Programming Algorithms

• Recursion is popular for solving ‘divide and conquer’ problems– Divide problem into smaller problems of the same kind– Define the base case– Works from the original problem down to the base case

• Dynamic programming is another approach that uses arrays– Initialize a variable or table– Use loops fill in entries the table. The algorithm always fills in

entries in the table before they are needed– Works top-down or bottom-up towards the final solution– Avoids repetitive calculations because the technique stores

results in a table for later use– Eliminates recursion activation record overhead

Implement recursive algorithms using arrays

Dynamic programming is a widely used algorithmic technique

Fibonacci Sequence

Bottom Upint fibo(n) { if (n <=1) return n; int fMinusTwo = 0; int fMinusOne = 1; for (int i = 2; i <= n; i++) { f = fAMinusTwo + fMinusOne; fMinusTwo = fMinusOne; fAtNMinusOne = f; } return(f)}

Top Downint[] fibs = new int[MAX];fibs[0] = 0;fibs[1] = 1;int max = 1;

int fibo(n){ if (n<=max) return fibs[n]; else fibs[n] = fibo(n-1)+fibo(n-2); max = n; return fibs[n];}

{0 1 1 2 3 5 8 13 21 …} where F(n)=F(n-1)+F(n-2)

Example: Knapsack Problem

• Values[v] containing values of the N items

• Weights[v] contains the weights of the N items

• K = knapsack capacity• dynTable[v][w] contains

the dynamic algorithm table

A thief enters a house with a knapsack of a particular size. Which items of different values does he choose to fill his knapsack.

KnapSack AlgorithmknapSack(int[] w, int[] v, int K){

int[][] dynTable[v.length+1][K];for (int w = 0 to K) a[0][w] = 0;for (int v=1; v<=v.length; v++){

for (int w=0; w<W; w++){

dyn[v][w] = dynTable[v-1][w]; // Copy upif (w[v] <= w && ) dynTable[v][w] = // Try for better choice Math.max(dynTable[v][w], values[v]+dynTable[v-1,w-weights[v]]);

}}return dynTable[N,W];

}

Note: Row for each item to consider; column for integer weights

Knapsack Example

Weight

ITEM

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3

2 0 0 3 3 3 5 5 5 5 5 5 5 5 5 5 5

3 0 0 3 3 6 6 9 9 9 11 11 11 11 11 11 11

4 0 0 3 3 6 10 10 13 13 16 16 19 19 19 21 21

5 0 0 3 3 6 10 10 13 13 16 16 19 19 19 21 22

0 1 2 3 4 5

Weights ? 2 3 4 5 8

Values ? 3 2 6 10 9

Goal: Maximum profit with a 15kg knapsack?

Array cells represent value stuffed into knapsack of that weight

Example: Minimum Edit Distance

• Problem: How can we measure how different one word is from another word (ie spell checker)?– How many operations will transform one word into another?– Examples: caat --> cat, fplc --> fireplace

• Definition: – Levenshtein distance: smallest number of insertion, deletion,

or substitution operations to transform one string into another– Each insertion, deletion, or substitution is one operation, with

a cost of 1.• Requires a two dimension array

– Rows: source word positions, Columns: spelled word positions– Cells: distance[r][c] is the distance (cost) up to that point

A useful dynamic programming algorithm

Pseudo Code (minDistance(target, source))

n = character in sourcem = characters in targetCreate array, distance, with dimensions n+1, m+1FOR r=0 TO n distance[r,0] = rFOR c=0 TO m distance[0,c] = cFOR each row r FOR each column c IF source[r]=target[c] cost = 0 ELSE cost = 1 distance[r,c]=minimum of

distance[r-1,c] + 1, //insertion distance[r, c-1] + 1, //deletion and distance[r-1,c-1] + cost) //substitutionResult is in distance[n,m]

Example

• Source: GAMBOL, Target: GUMBO• Algorithm Step: Initialization

G U M B O

0 1 2 3 4 5

G 1

A 2

M 3

B 4

O 5

L 6

Example

• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 1

G U M B O

0 1 2 3 4 5

G 1 0

A 2 1

M 3 2

B 4 3

O 5 4

L 6 5

Example


G U M B O

0 1 2 3 4 5

G 1 0 1

A 2 1 1

M 3 2 2

B 4 3 3

O 5 4 4

L 6 5 5

Example


G U M B O

0 1 2 3 4 5

G 1 0 1 2

A 2 1 1 2

M 3 2 2 1

B 4 3 3 2

O 5 4 4 3

L 6 5 5 4

Example


G U M B O

0 1 2 3 4 5

G 1 0 1 2 3

A 2 1 1 2 3

M 3 2 2 1 2

B 4 3 3 2 1

O 5 4 4 3 2

L 6 5 5 4 3

Example

• Source: GAMBOL, Target: GUMBO• Algorithm Step: Column 5• Result: Distance equals 2

G U M B O

0 1 2 3 4 5

G 1 0 1 2 3 4

A 2 1 1 2 3 4

M 3 2 2 1 2 3

B 4 3 3 2 1 2

O 5 4 4 3 2 1

L 6 5 5 4 3 2

Another ExampleE X E C U T I O N

0 1 2 3 4 5 6 7 8 9I 1 1 2 3 4 5 6 6 7 8N 2 2 2 3 4 5 6 7 7 7T 3 3 3 3 4 5 5 6 7 8E 4 3 4 3 4 5 6 6 7 8N 5 4 4 4 4 5 6 7 7 7T 6 5 5 5 5 5 5 6 7 8I 7 6 6 6 6 6 6 5 6 7O 8 7 7 7 7 7 7 6 5 6N 9 8 8 8 8 8 8 7 6 5

Comparing Audio Frames

• Patterns: Database of audio samples. Match to the ones that are closest

• Templates: Database of features extracted from audio samples

• Training: Use a training set to create a vector of patterns or templates

• Distortion Measure: algorithm to measure how far a pair of templates or patterns are apart

Will Minimum Edit Distance Work?• Maybe: Distances to higher array indices are functions of the costs of

smaller indices; a dynamic programming approach may work

• Issues– The algorithm may be too slow– A binary equal or not equal comparison does not work– A distance metric is needed– Speaking rates change, even with a single speaker– Do we compare the raw data or frame-based features?– Do we assign cost to adjacent cells or to those further away?– Other issues: Phase, energy, pitch misalignment, Presence of noise,

length of vowels, Phoneme pronunciation variances, etc.

Incorrect comparisons occur when the algorithm isn’t carefully designed

Dynamic Time WarpingGoal: Find “best” alignment between pairs of audio frames

(A)

(B)

time (frame) of (A)

time

(fra

me)

of (

B)

The matrix to the right shows the optimal alignment path (warping) between frames from utterance A with those of utterance B

26

Dynamic Time Warping (DTW) Overview• Computes the “distance” between 2 frames of speech

o Measures frame by frame distances to compute dissimilarities between speech

o Allows the comparison to warp the comparison to account for differences in speaking rates

o Requires a cost function to account for different paths through frames of speech

o Uses dynamic programming algorithm to find best warping.

o Computes a total “distortion score” for best warped path.

• Assumptionso Constrain begin and end times to be (1,1) and (TA,TB)o Allow only monotonically increasing timeo Don’t allow too many frames to be skippedo Can express results in terms of “paths” with “slope

weights”

27

Assumptions

• Does not require that both patterns have the same length• One speech pattern is the “input” and the other speech pattern is

the “template” to compare against• Divide speech signal into equally-spaced frames (10-30ms) with

approximately 50% overlap and compute a frame-based feature set

• The local distance measure (d) is the distance between features at a pair of frames (one from A, one from B).

• The Global distortion from beginning of utterance until current pair of frames called G.

Algorithm Efficiency

• The algorithm complexity is O(m*n) where m and n are the respective number of frames between the two utterances. If m=n, the algorithm is O(n2). Why?: Count the number of cells that need to be filled in.

• O(n2) may be too slow. Alternate solutions have been devised.– Don’t fill in all of the cells.– Use a multi-level approach

Don’t Fill in all of the Cells

Disadvantage: The algorithm may miss the optimal path

The Multilevel ApproachConcept

1. Coarsen the array of features2. Run the algorithm3. Refine the array4. Adjust the solution5. Repeat steps 3-4 till the original array of

features is restored

Notes • The multilevel approach is a common

technique for increasing many algorithms’ complexity from O(n2) to O(n lg n)

• Example: partitioning a graph to balance work loads among threads or processors

Which Audio Features?• Cepstrals: They are statistically independent and

phase differences are removed• ΔCepstrals, or ΔΔCepstrals: Reflects how the

signal is changing from one frame to the next• Energy: Distinguish the frames that are voiced

verses those that are unvoiced• Normalized LPC Coefficients: Represents the

shape of the vocal track normalized by vocal tract length for different speakers.

These are some of the popular speech recognition features

32

Distance Metric Requirements

Definition: Measure similarity of two frames of speech.

The vector xt , yt contain the features from frames of two signals

A distance measure should have the following properties:0 d(xt,yt) 0 = d(xt,yt) iff xt = yt

d(xt,yt) = d(xt,yt) (symmetry)d(xt,yt) d(xt,zt) + d(zt,yt) (triangle inequality)

A speech distance metric should correlate with perceived distance.

Perceptually-warped spectral features work well in practice

(positive definiteness)

Which Distance Metric?• General Formula:

array[i,j] = distance(i,j) + min{array[i-1,j], array[i-1,j-1],array[i,j-1)}• Assumption : There is no cost assessed for duplicate or

eliminated frames. • Distance Formula:

– Euclidian: sum the square of one metric minus another squared– Linear: sum the absolute value of the distance between features

• Example of a distance metric using linear distance∑ wi |(fa[i] – fb[i])| where f[i] is a particular audio feature for signals a and b. w[i] is that feature’s weight

Which Local Distance Measure?

Other distance measuresItakura-Saito distortion, COSH, likelihood ratio, etc…

where xt(f), and yt(f) are frame frequency values at time t; f is a feature index

• Euclidean distance:

1

0

2)()(F

ftt ff yx

• Mahalanobis distance:

Note: we can weight the features by multiplying differences by weighting factors to emphasize/deemphasize certain features

35

Dynamic Time Warping Termination Step

Divide the total computed cost by a normalizing factor

The normalizing factor is necessary to compare results between input speech and various templates to which it is compared

1. One quick and effective normalizing method divides by the number of frames in the template.

2. Another method is divide the result by the length of the path taken, where we adjust the length by the slope weights at each transition. This requires backtracking to sum the slope values, but can sometimes be more accurate.

36

P1

P2

P3

P1=(1,0)P2=(1,1)P3=(1,2)

Frame transition cost heuristics

• Path P and slope weight m determined heuristically• Paths considered backward from target frame• Larger weight values for less preferable paths• Optimal paths always go up, right (monotonically forward in time)• Only evaluate P if all frames have meaningful values (e.g.

don’t evaluate a path if one frame is at time 1, because there is no data for time 1).

P1=(1,1)(1,0)P2=(1,1)P3=(1,1)(0,1)

P1

P2

P3

1 ½

½

½

½

Heuristic 1 Heuristic 2

37

Dynamic Time Warping (DTW) Algorithm

1. Initialization (time 1 is first time frame)D(1,1) = d(1,1)

2. Recursion

to1 from ),,( to)','( frompath validalongindex

)(),()),(),,((

)),(),,((),(),(

min),(

1

Lyxyxl

lmyxdyxyx

yxyxyxDyx

yxD

L

lll

3. Termination

T

k

TT

kmM

yxDresult

1

)(

),(

where (=zeta) is a function of previous distances and slopes

M sometimes defined as Tx, or Tx+Ty,or (Tx 2+ Ty 2)½

A convenient value for M is the length of the template.

38

Dynamic Time Warping (DTW) Example

1

1

2

2

3

3

1

1

1

2

2

2

3

1

2

2

3

3

3

1

2

1

1

2

3

2

1

2

1

2

3

3

3

2

1

2

P1=(1,0)P2=(1,1)P3=(1,2)

1

1

1

heuristic paths:

begin at (1,1), end at (7,6)

1 2 5 8 11 14

2 3 4 6 9

2 4 5 5 8

4 4 6 7

5 5 5 6

6 6 7 D(1,1) = D(2,1) = D(3,1) = D(4,1) = D(1,2) = D(2,2) = D(3,2) = D(4,2) = D(2,3) = D(3,3) = D(4,3) = D(5,3) = D(3,4) = D(4,4) = D(5,4) = D(6,4) = D(4,5) = D(5,5) = D(6,5) = D(7,5) = D(4,6) = D(5,6) = D(6,6) = D(7,6) =

normalized distortion = 8/6 = 1.33

3

3

3

2

1

2

17

12

11

9

7

8

normalized distortion = 8/7 = 1.14

39


1

22

2

3

3

2

1

82

2

2

3

892

3

3

3

1

2

1

1

2

3

2

1

2

1

2

3

3

3

2

1

2

P1=(1,0)P2=(1,1)P3=(0,1)

1

1

heuristic paths:


1 3 6 9

2 10 7

10 11 9

9 10

D(1,1) = 1 D(2,1) = 3 D(3,1) = 6 D(4,1) = 9 …D(1,2) = 3 D(2,2) = 2 D(3,2) = 10 D(4,2) = 7 …D(1,3) = 5 D(2,3) = 10 D(3,3) = 11 D(4,3) = 9 …D(1,4) = 7 D(2,4) = 7 D(3,4) = 9 D(4,4) = 10 …D(1,5) = 10 D(2,5) = 9 D(3,5) = 10 D(4,5) = 10 …D(1,6) = 13 D(2,6) = 11 D(3,6) = 12 D(4,6) = 12 … normalized distortion = 13/6 = 2.17

12 15

9 12

8 11

10 10

10 10 11

12

11

12 13

1

3

5

7 7

10 9

13 11 12

40


4

5

7

8

7

9

2

3

5

6

7

8

3

2

2

1

4

5

1

2

1

2

1

3

3

4

3

3

3

1

7

6

4

5

4

2 heuristic paths:


P1=(1,1)(1,0)P2=(1,1)P3=(1,1)(0,1)

1 ½

½

½

½

D(1,1) = D(2,1) = D(3,1) = D(4,1) = D(1,2) = D(2,2) = D(3,2) = D(4,2) = D(2,3) = D(3,3) = D(4,3) = D(5,3) = D(3,4) = D(4,4) = D(5,4) = D(6,4) = D(3,5) = D(4,5) = D(5,5) = D(6,5) = D(3,6) = D(4,6) = D(5,6) = D(6,6) =

Singularities• Assumption

– The minimum distance comparing two signals only depends on the previous adjacent entries

– The cost function accounts for the varied length of a particular phoneme, which causes the cost in particular array indices to no longer be well-defined

• Problem: The algorithm can compute incorrectly due to mismatched alignments

• Possible solutions:– Compare based on the change of feature values between windows

instead of the values themselves– Pre-process to eliminate the causes of the mismatches

Possible Preprocessing• Normalize the energy of voiced audio:

– Compute the energy of both signals– Multiply the larger by the percentage difference

• Brick Wall Normalize the peaks and valleys: – Find the average peak and valley value– Set values larger than the average equal to the average

• Normalize the pitch: Use PSOLA to align the pitch of the two signals

• Remove duplicate frames: Auto correlate frames at pitch points

• Implement a noise removal algorithm

• Normalize the speaking rate

43

Dynamic Time Warping/Hidden Markov Models

• Dynamic Time Warping (DTW)o Comparing speech with a number of templateso The algorithm selects the template with the lowest

normalized distortion to determine the recognized word.

• Hidden Markov Models (HMMs) o Refines the DTW technology. o HMMs compare speech against “probabilistic

templates”o HMMs compute most likely paths using

probabilities

Phoneme Marking

• Goal: Mark the start and end of phoneme boundaries• Research– Unsupervised text (language) independent algorithms have

been proposed– Accuracy: 75% to 80%, which is 5-10% lower than supervised

algorithms that make assumptions about the language• If successful, a database of phonemes can be used in

conjunction with dynamic time warping to simplify the speech recognition problem

recognizer issues

Documents