human vs. machine

Human vs. Machine• Human beings are much better at resolving signal

ambiguities than are computers. – Computers are improving (e.g. Watson and Jeopardy)– Turing Test: Are we communicating with a human or machine?

• Case in point - Speech

Sloppy speech is okay, as long as the hearer still understands. – “haya dun”– “ay d ih s h er d s ah m th in ng ah b aw m uh v ih ng r ih s en l ih”

There are difficulties, even when a computer recognizes phonemes in a speech signal

One sentence, eight possible meanings

– I cooked waterfowl for her.– I stole her waterfowl and cooked it.– I used my abilities to create a living waterfowl for her.– I caused her to bid low in the game of bridge.– I created the plastic duck that she owns.– I caused her to quickly lower her head or body.– I waved my magic wand and turned her into waterfowl.– I caused her to avoid the test.

I made her duck

Robot-human dialogRobot: “Hi, my name is Robo. I am looking for work to raise funds for Natural

Language Processing research.”Person: “Do you know how to paint?”Robo: “I have successfully completed training in this skill.”Person: “Great! The porch needs painting. Here are the brushes and paint.”

Robot rolls away efficiently. An hour later he returns.

Robo: “The task is complete.”Person: “That was fast, here is your salary; good job, and come back again.”

Robo speaks while rolling away with the payment.

Robo: “The car was not a Porche; it was a Mercedes.”

Moral: You need a sense of humor to work in this field.

99% accuracy

Difficulties

• Challenges– Speaker variability– Slurring and running

words together– Co-articulation– Handling words not in

the vocabulary– Grammar Complexities– Speech semantics– Recognizing idioms– Background noise– Signal transmission

distortion

• Approaches– Use large pre-recorded

data samples– Train for particular users– Require artificial pauses

between words– Limit vocabulary size– Limit the grammar– Use high quality

microphones– Require low noise

environments

Today's best systems cannot match human perception

ASR Difficulties• Realizations are points in continuous space, not discrete• Sounds take characteristics of adjacent sounds (assimilation)• Sounds that are combinations of two (co-articulation)• Articulator targets are often not reached• Diphthongs combine different phonemes• Adding (epenthesis) or deleting (elision)• Missing word, phrase boundaries, endings• Many tonal variations during speech• Varied vowel durations• Common knowledge, familiar background leads to more

sloppy speech with additional non-linearities.

Possible Applications• Compare two audio signals to compare speaker’s

utterance to records from a database of recordings• Convert audio into a text document• Visually represent the vocal tract of the speaker in

real time• Recognize a particular speaker for enhanced security• Transform audio signal to enhance its speech

qualities• Perform tasks based on user commands• Recognize the language and perform appropriately

A sample of issues to consider• Can we assume the target language or is the application to be

language independent?• Is there access to databases describing grammatical,

morphological, and phonological rules?• Are there digital dictionaries available? Does the application

require a large dictionary?• Are there corpora available to scientifically measure

performance against other implementations?• How does the system perform when the SNR is low? What is a

typical SNR characteristics when the application is in use?• What is the accuracy requirements for the application?• Are statistical training procedures practical for the application?

Phonological Grammars

• Sound Patterns– English: 13 features for 8192 combinations– Complete descriptive grammar– Rule based, meaning a formal grammar can represent valid

sound combinations in a language– Unfortunately, these rules are language-specific

• Recent research– Trend towards context-sensitive descriptions– Little thought concerning computational feasibility– Listeners likely don’t perceive using thousands of rules

Phonology: Study of sound combinations

Formal Grammars (Chomsky 1950)• Formal grammar definition: G = (N, T, s0, P, F)

– N is a set of non-terminal symbols (or states)– T is the set of terminal symbols (N ∩ T = {})– S0 a start symbol– P is a set of production rules– F (a subset of N) is a set of final symbols

• Right regular grammar productions have the formsB → a, B → aC, or B → "" where B,C ∈ N and a T∈

• Context Free (Programming language) productions have formsB → w where B ∈ N and w is a possibly empty string from N, T

• Context Sensitive (Natural language) productions have formsαAβ → αγβ or αAβ "" where A∈N and α,γ,β (∈ N U T)* abd |αAβ|≤|αγβ|

Chomsky Language Hierarchy

Classifying the Chomsky Grammars

NotesRegular• Left hand side contains one non terminal, right hand has only one non-terminalContext Free • Left hand side contains one non-terminal, right hand side mixes terminals and

non-terminalsContext sensitive• Left hand side has both terminals and non-terminalsTuring Equivalent: All rules are fair game (computational power of a computer)

Context Free Grammars

• Capture constituents and ordering– Regular grammars are too limited to represent grammars

• Context Free Grammars consist of– Set of non-terminal symbols N– Finite alphabet of terminals – Set of productions A → such that A N, -string (N)*– A designated start symbol

• Used for programming language syntax. Too restrictive for natural languages

Chomsky (1956) Backus (1959)

Example Grammar (L0)

Context Free Grammars for Natural Language

• Context free grammars work well for basic grammar syntax

• Disadvantage– Some complex syntactical rules requires clumsy

constructions– Agreement: He ate many meal– Movement of grammatical components:

o Which flight do you want me to have the travel agent book?o The object is far from its matching verb

Morphology

• How phonemes combine to make words• Important for speech recognition and

synthesis• Example: singular to plural

– Run to runs: z sound (voiced)– Hit to Hits: s sound (unvoiced)

• One approach: Devise language specific sets of rules of pronunciation

Syllables

• Organizational phonological unit– Vowel between two consonants– Ambiguous positioning of consonants into

syllables– Tree structured representation

• Basic unit of prosody– Lexical stress: inherent property of a word– Sentential stress: speaker choice to emphasize or

clarrify

Finite State Automata• Definition: (N, T, s0, δ, F) where• N is a finite, non-empty set of non-terminal states• T is a finite, non-empty set of terminal symbols• s0 is an initial state, an element of • δ is the state-transition function

– Deterministic transition function: δ :Sx∑S– Nondeterministic transition function: δ:Sx∑P S⊂– Transducers: add Γ, a set of output symbols and ω:ΓO

• F S is the (possibly empty) set of final states⊂

Finite-state Automata

Equivalent to:• Finite-state automata (FSA)• Regular languages• Regular expressions

Finite-state Automata (Machines)

q0 q1 q2 q3 q4

b a a !

a

/baa+!/

state transition finalstate

baa! baaa! baaaa! baaaaa! ...

Input Tape

a b a ! b

q0

0 1 2 3 4

b a a !a

REJECT

Input Tape

b a a a

q0 q1 q2 q3 q3 q4

!

0 1 2 3 4

b a a !a

ACCEPT

State-transition Tables

InputState b a !

0 1 0 01 0 2 02 0 3 03 0 3 44: 0 0 0

Finite State Machine Examples

Deterministic Non deterministic

a b b a a b a b0 1 1 0 1 1 0 1

Finite State Transducer

Input: Features from a sequence of framesProcessing: Find the most likely path through the sequence using hidden Markov models or Neural NetworksOutput: The most likely word, phoneme, or syllable

A Finite State Automata that produces an output string

O is a set of output states, ω: S->O

Back End Processing

• Rule Based: Insufficient to represent the differences in how words are constructed

• Statistics based: Most other areas of Natural Language processing are trending to statistical-based methods

• Procedure– Supervised training: An algorithm “learns” the parameters

using a training set of data. The “trained” algorithm then is ready to run in an actual environment.

– Unsupervised training: An algorithm trains itself by computing categories from the training data

Representing Stress

• There have been unsuccessful attempts to automatically assign stress to phonemes

• Notations for representing stress– IPA (International Phonetic Alphabet) has a diacritic

symbol for stress– Numeric representation

• 0: reduced, 1: normal, 2: stressed– Relative

• Reduced (R) or Stressed (S)• No notation means undistinguished

Random Variables

• Random Variable, X, is a quantity that assigns a numerical value to each possible event

• Reason: It would not be possible to analyze the results without this.

• Example: pick a ball out of a bag. Suppose the balls are red, blue, and green. We could assign X=0 if red, X=1 if blue, and X=2 if green.

• Discrete random variable has a finite number of possible values (∑i=1,np(xi) = ∑i=1,nP(X=xi) = 1).

Probability Chain Rule• Conditional Probability P(A1,A2) = P(A1) * P(A2|A1)• The Chain Rule generalizes to multiple events

– P(A1, …,An) = P(A1) P(A2|A1) P(A3|A1,A2)…P(An|A1…An-1) • Examples:

– P(the dog) = P(the) P(dog | the)– P(the dog bites) = P(the) P(dog | the) P(bites| the dog)

• Conditional probability applies more than individual relative word frequencies because they consider the context – Dog may be relatively rare word in a corpus– But if we see barking, P(dog|barking) is much more likely

)11|

1( wk

n

kwkP

1n

• In general, the probability of a complete string of words w1…wn is:

P(w ) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) =

Note: A large n requires a lot of data; chains of two or three work well

Probability Density Function

• f(x) is a continuous probability density function if ∫-∞, ∞ f(x)dx = 1

0 a b

Note: The shaded area is the probability that a <= x <= b

Mean, Variance, Standard Deviation• The mean or expected value

– Discrete: µ = E(x) = ∑ x p(x) over all x values– Continuous µ = E(x) = ∫ x f(x) dx from -∞ to∞

• Variance– Discrete: σ2 = ∑(x - µ)2 p(x) = ∑x2p(x) – (∑x p(x))2

– Continuous: σ2 = ∫(x - µ)2f(x)dx = ∫x2dx – (∫x f(x)dx)2

• Standard Deviation: σ = square root of variance• Intuition

– Mean: center of the distribution (1st moment)– Variance: spread of the distribution (2nd moment)– standard deviation: percent within a distance from the mean– Skew: asymmetry of the distribution (3rd moment)– Kurtosis: how peaked is the distribution (4th moment)

Note: Same mean, different variances

Example• Bag of numbered balls• Pick a single ball from the bag• Mean: (µ = ∑ x p(x) )

1*8/30 + 2*5/30 + 3*3/30 + 4*10/30 + 5*4/30 = 87/30 = 2.9

• Variance Method 1: σ2 = ∑(x - µ)2 p(x) σ2 = 8/30*(-1.9)2 + 5/30*(-0.9)2 + 3/30*0.12

+ 10/30*1.12 + 4/30*2.12 = 2.09

• Variance Method 2 (without mean): σ2 = ∑x2p(x)–(∑x p(x))2

σ2 = 1*8/30 + 4*5/30 + 9*3/30 + 16*10/30 + 25*4/30 – (1*8/30+2*5/30+3*3/30+4*10/30+5*4/30)2 = 10.5–2.92 = 2.09

• Standard Deviation = (2.09)1/2 ≈ 1.45

x Quantity P(x)1 8 8/302 5 5/303 3 3/304 10 10/305 4 4/30

Covariance

• A positive covariance occurs if two random variables tend to both be above or below their means together

• A negative covariance occurs when the random variables tend to be on opposite sides of the mean

• If no correlation, the covariance will be close to zero

• Covariance formula– Discrete: Cov(X,Y) = ∑x ∑y (x-µx)p(x) * (y-µy)p(y)– Continuous: Cov(X,Y) = ∫x ∫y (x-µx)f(x) * (y-µy)f(y) dy dx

• Correlation coefficient: ρxy = Cov(x,y)/(σx*σy) or Cov(x,y)/(N-1)

• Numbers greater than unity imply the variables are related

Covariance determines how two random variables relate

Covariance (Dispersion) Matrix• Given Random variables

• The covariance matrix ∑, where ∑i,j = E(Xi-ui)(Xj-uj) = cov(xi,xj)

• Equivalent Matrix Definition: ∑ = E[(X – E[X])(X – E[X})T]• Note: The T means transpose, some use a single quote instead

Covariance ExampleThree random variables (x0, x1, x2), five observations (N = 5) each

∑

X0 – u0 X1 – u1 X2 – u2

-0.1 = 4.0-4.1 -0.08=2.0-2.08 -0.04=.60-.604

0.1 = 4.2-4.1 0.02=2.1-2.08 -0.014=.59-.604

-0.2 = 3.9-4.1 -0.08=2.0-2.08 -0.024=.58-.604

0.2 = 4.3-4.1 0.02=2.1-2.08 0.016=.62-.604

0.0 = 4.1-4.1 0.12=2.2-2.08 0.026=.63-.604

Note: ∑ results from the multiplication of the 5x3 matrix and its 3x5 transpose

Uniform Distribution

• pdf: f(x) = 1 / (b-a) a≤x≤b; 0 otherwise• µ = (a + b)/2, Variance: 1/12 (b-a)2

• Initial training data for acoustic information can be set up as a uniform distribution

The probability of every value is equal

Binomial Distribution• pdf:

where

and

n = # of experimentsp = success probability

• µ = np• σ2 = np(1-p)

Repeated experiments each with two possible outcomes

Multinomial DistributionNumber of successes in n independent experiments

• pdf:

• µi = n pi

• σi2 = n pi (1-pi)

• Cov(xi,xj) = -npipj

• Extends the binomial distribution to multiple random variables

Gaussian Distribution• When we analyze probability involving many random processes, the

distribution is almost always Gaussian.• Central Limit Theorem: As the sample size of random variables

approach ∞, the distribution approaches Gaussian

• Probability distribution:

f(x | µ,σ2)

= 1/(2 πσ)½ * ez

where

z = -(x-µ)2 / (2 σ2)

Multivariate Mixture Gaussian Distribution

• Multiple independent random variables• Each variable can have its own mean and variance

Two independentrandom variablesX and Y

Multivariate Normal Distribution

•

Determinant of a 3x3 Matrix

Example: Compute Determinant of:

5 3 42 1 53 6 2

Bayes Example

•

Probability that a car will be late to its destination

Noisy Channel Decoding

• Assume Input=word w; Feature vector=f, V=vocabulary– We want to find the word, w = max wεv p(w|f)

– Using Bayes Rule: maxwεv p(f|w)p(w)/p(f)

• Why use Bayes Rule?– P(w|f) is difficult to compute– P(f|w) is relatively easy to compute. Just add probabilities to

reflect spelling or pronunciation variation rules– P(w) is how often w occurs in a large corpus (prior priority)– Ignore P(f). f doesn’t change as we search the lexicon.

Source Noisy Channel Decoder

Bayesian Inference

• Randomly choose students from population of ten. Find probabilities:– p(vegetarian) = .4, p(cs major) = .3– Student vegetarian is a CS major? p(c|v) = .5 = p(c) p(v|c) / p(v) – Student is a vegetarian and CS major? p(c,v) = .2 = p(v) p(c|v)– Student vegetarian and CS major? p(c,v) = .2 = p(c) p(v|c)– Student CS major is a vegetarian? p(v|c) = .66 = p(v) p(c|v) / p(c)

4 Vegetarians

3 CS majors

Definitions• Stochastic process: A process of change of one or random variables

{Xi} over time based on a well-defined set of probabilities

• Markov model: a Markov model consists of a list of the possible states of that system, the possible transitions from one state to another, and the rates that govern those transitions. Transitions can depend on the current state and some number of previous states.

• Markov Chain– Markov model with a finite number of states in which the probability

of a next state depends only upon the current state and the immediate past state

• Examples – The next phoneme’s probability depends solely on the preceding one

of the sequence– A model of word or phoneme prediction that uses the previous N-1

words or phonemes to predict the next (N-gram model)– Hidden Markov model, predicting the hidden cause after observing

the output (predicting the words, when observing the features)

Vector Quantization

• Partition the data into cells (Ci)• Cell centroids quantized as zi

• Compute distance between received data and centroids

• Received data– quantized into one of the cells– q(x)=zi if x in cell Ci boundary

• Distortion (distance) formulas– Euclidian: d(x,z) = ∑i=1,D(xi–zi)2

– Linear: d(x,z) = (x-z)T∑-1 (x-z)– Mahalanobs: Euclidian/ variance– D is the quantization codebook size

K- Means Algorithm• Input:

– F = {f1, …, fk} is a list of feature vectors– N = desired number of categories (phoneme types)

• Output: – C = {c1, …, cN) center of each category– m: F->C Maps feature vector to one of the categories

• PseudocodeRandomly put members of F into an initial CWHILE true FOR EACH fj F assign f∈ j to the closest ck

IF no reassignments have taken place THEN BREAK Recompute the center of each member of C

• Issues– What is the metric that we use to compute distances?– A poor initial selection will lead to incorrect results or poor performance

LBG Extension of K Means

1. Let M= 1 to form a single partition2. Find centroid of all training data ( 1/T ∑i=0,Txi )3. While (M < desired number of partitions)

For each Mi. Compute centroid positionii. Replace old centroid with new oneiii. Partition the partition in halfiv. Estimate centroid in each halfv. Use the k-means algorithm to optimize centroid positionvi. M = 2*M

Linde, Buzo, and Gray

Maximum Likelihood Formulation•

ProbabilityVector of Outcomes: ϴ

for each ϴi

Context Free Grammar Example

Goal: Frequency of a particular grammatical constructionParameters

– X = set of all possible parse trees– T = {t1 … tO} where ti X are observed parse tree sequences∈– Let ϴp = probability that a parse tree applies production p P ∈– Parameter space, Ω = set of ϴ [0,1]∈ |P| where for all α, ∑p P∈ ϴp=1– Number of times a production p is in tree ti (C(ti,p))

Estimate of parse tree probability P(t|ϴ) = (ϴp )C(ti,p) Easier to deal with logs: log(P(t|ϴ’)) = ∑p P∈ ϴp* C(ti,p)Estimate over all trees L(ϴ’) = ∑t log(P(t|ϴ)) = ∑t ∑p P∈ ϴp* C(t,p)ϴMostLikely =

G = (N, T, s0, P, F)

EM Algorithm

1. Perform an initial Maximum Likelihood (MLI) estimation2. Expectation Step: Compute the expected value of the

Maximum likelihood function with respect to the observed distribution

3. Maximization Step: Use the adjusted values computed in step 2 to refine the expectation estimation

4. Repeat step 2 until algorithm converges

Note: The Baum-Welsh Hidden Markov Algorithm, which we will discuss later, is a special case of the EM Algorithm

EM = Expectation-Maximization

x

x xx

x

x

xxx

xx

x

x

x

xx

x

x

x

x

x

x

x

x xx

x

x

xxx

xx

x

x

x

xx

x

x

x

x

x

x

Decision Trees

Reasonably Good Partition Poor Partition

Partition a series of questions, each with a discrete set of answers

CART Algorithm

1. Create a set of questions that can distinguish between the measured variablesa. Singleton Questions: Boolean (yes/no or true/false) answersb. Complex Questions: many possible answers

2. Initialize the tree with one root node3. Compute the entropy for a node to be split4. Pick the question that with the greatest entropy gain5. Split the tree based on step 46. Return to step 3 as long as nodes remain to split7. Prune the tree to the optimal size by removing leaf nodes

with minimal improvement

Classification and regression trees

Note: We build the tree from top down. We prune the tree from bottom up.

Example: Play or not Play?Outlook Temperature Humidity Windy

Play?

sunny hot high false Nosunny hot high true Noovercast hot high false Yesrain mild high false Yesrain cool normal false Yesrain cool normal true Noovercast cool normal true Yessunny mild high false Nosunny cool normal false Yesrain mild normal false Yessunny mild normal true Yesovercast mild high true Yesovercast hot normal false Yesrain mild high true No

Questions1) What is the outlook?2) What is the temperature?3) What is the humidity?4) Is it Windy?

Goal: Order the questions inthe most efficient way

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Example Tree for “Do we play?”

Outlook

HumidityWindy

Goal: Find the optimal tree

Which question to select?

witten&eibe

Computing Entropy• Entropy: Bits needed to store possible question answers• Formula: Computing the entropy for a question:

Entropy(p1, p2, …, pn) = - p1log2p1 – p2log2p2 … - pn log2pn

• Where pi is the probability of the ith answer to a questionlog2x is logarithm base 2 of x

• Examples: – A coin toss requires one bit (head=1, tail=0)– A question with 30 equally likely answers requires

∑i=1,30-(1/30)log2(1/30) = - log2(1/30) = 4.907

Example: question “Outlook”

Entropy(“Outlook”=“Sunny”)=Entropy(0.4, 0.6)=-0.4 log2(0.4)-0.6 log2(0.6)=0.971

Five outcomes, 2 for play for P = 0.4, 3 for not play for P=0.6

Entropy(“Outlook” = “Overcast”) = Entropy(1.0, 0.0)= -1 log2(1.0) - 0 log2(0.0) = 0.0

Four outcomes, all for play. P = 1.0 for play and P = 0.0 for no play.

Entropy(“Outlook”=“Rainy”)= Entropy(0.6,0.4)= -0.6 log2(0.6) - 0.4 log2(0.4)= 0.971

Five Outcomes, 3 for play for P=0.6, 2 for not play for P=0.4

Entropy(Outlook) = Entropy(Sunny, Overcast, Rainy) = 5/14*0.971+4/14*0+5/14*0.971 = 0.693

Compute the entropy for the question: What is the outlook?

Computing the Entropy gain• Original Entropy : Do we play?

Entropy(“Play“)=Entropy(9/14,5/14)=-9/14log2(9/14) - 5/14 log2(5/14)=0.940

14 outcomes, 9 for Play P = 9/14, 5 for not play P=5/14• Information gain equals (information before) – (information after)

gain("Outlook") = 0.940 – 0.693 = 0.247

• Information gain for other weather questions– gain("Temperature") = 0.029– gain("Humidity") = 0.152– gain("Windy") = 0.048

• Conclusion: Ask, “What is the Outlook?” first

Continuing to split

bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("

bits 020.0)Windy"gain("

For each child question, do the same thing to form the complete decision tree

Example: After the outlook sunny node, we still can ask about temperature, humidity, and windiness

yesnono

The final decision tree

Note: The splitting stops when further splits don't reduce entropy more than some threshold value

Senone Model• Goal: Reduce the trainable

units that the recognizer needs process

• Approach: – HMMs represent sub-

phonetic units– A tree structure Combine

sub-phonetic units– Phoneme recognizer searches

tree to find HMMs– Nodes partition with

questions about neighbors• Performance:

– Triphones reduces error rate by:15%

– Senones reduces error rate by 24%

Definition: A cluster of similar Markov States

Is left phone sonorant or nasal?

Is right a back-R? Is left s, z, sh, zh?

Is left a back-L?

Is right voiced?

Scoring Acoustic Features• Choose the model: discrete, continuous, semi-continuous • Continuous: Insufficient training data

– Consider discrete ranges of values– Problem: difficult to determine boundaries between ranges

• Discrete or semi-continuous– Consider multiple codebooks – Multiple codebooks require HMM formulas adjustments– For example: αt,i = ∑i=0,N-1 αt-1,iai,j ∏ bj(xt)

• Decide whether to use a word or sub word model– Word model: collect training data for each word– Sub-word models: share the subunits across the vocabulary

human vs. machine

Documents