outline for today the learning problem input to baum-welch

16
1 1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 7: Baum Welch/Learning and Disfluencies IP Notice: Some of these slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, and Bryan Pellom. I’ll try to accurately give credit on each slide. 2 LSA 352 Summer 2007 Outline for Today Learning EM for HMMs (the “Baum-Welch Algorithm”) Embedded Training Training mixture gaussians Disfluencies 3 LSA 352 Summer 2007 The Learning Problem Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or Expectation- Maximization algorithm (Dempster, Laird, Rubin) The algorithm will let us train the transition probabilities A= {a ij } and the emission probabilities B={b i (o t )} of the HMM 4 LSA 352 Summer 2007 Input to Baum-Welch O unlabeled sequence of observations Q vocabulary of hidden states For ice-cream task O = {1,3,2.,,,.} Q = {H,C} 5 LSA 352 Summer 2007 Starting out with Observable Markov Models How to train? Run the model on the observation sequence O. Since it’s not hidden, we know which states we went through, hence which transitions and observations were used. Given that information, training: B = {b k (o t )}: Since every state can only generate one observation symbol, observation likelihoods B are all 1.0 A = {a ij }: a ij = C(i " j ) C(i " q) q #Q $ 6 LSA 352 Summer 2007 Extending Intuition to HMMs For HMM, cannot compute these counts directly from observed sequences Baum-Welch intuitions: Iteratively estimate the counts. – Start with an estimate for a ij and b k , iteratively improve the estimates Get estimated probabilities by: – computing the forward probability for an observation – dividing that probability mass among all the different paths that contributed to this forward probability

Upload: others

Post on 11-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

1

1LSA 352 Summer 2007

LSA 352Speech Recognition and Synthesis

Dan Jurafsky

Lecture 7: Baum Welch/Learningand Disfluencies

IP Notice: Some of these slides were derived from Andrew Ng’s CS 229 notes,as well as lecture notes from Chen, Picheny et al, and Bryan Pellom. I’ll try toaccurately give credit on each slide.

2LSA 352 Summer 2007

Outline for Today

LearningEM for HMMs (the “Baum-Welch Algorithm”)Embedded TrainingTraining mixture gaussians

Disfluencies

3LSA 352 Summer 2007

The Learning Problem

Baum-Welch = Forward-Backward Algorithm(Baum 1972)Is a special case of the EM or Expectation-Maximization algorithm (Dempster, Laird, Rubin)The algorithm will let us train the transitionprobabilities A= {aij} and the emission probabilitiesB={bi(ot)} of the HMM

4LSA 352 Summer 2007

Input to Baum-Welch

O unlabeled sequence of observationsQ vocabulary of hidden states

For ice-cream taskO = {1,3,2.,,,.}Q = {H,C}

5LSA 352 Summer 2007

Starting out with ObservableMarkov Models

How to train?Run the model on the observation sequence O.Since it’s not hidden, we know which states we wentthrough, hence which transitions and observationswere used.Given that information, training:

B = {bk(ot)}: Since every state can only generateone observation symbol, observation likelihoods Bare all 1.0A = {aij}:

!

aij =C(i" j)

C(i" q)q#Q

$

6LSA 352 Summer 2007

Extending Intuition to HMMs

For HMM, cannot compute these counts directly fromobserved sequencesBaum-Welch intuitions:

Iteratively estimate the counts.– Start with an estimate for aij and bk, iteratively improve

the estimates

Get estimated probabilities by:– computing the forward probability for an observation– dividing that probability mass among all the different

paths that contributed to this forward probability

2

7LSA 352 Summer 2007

The Backward algorithm

We define the backward probability asfollows:

This is the probability of generating partialobservations Ot+1

T from time t+1 to the end, giventhat the HMM is in state i at time t and of coursegiven Φ.!

"t (i) = P(ot+1,ot+2,...oT ,|qt = i,#)

8LSA 352 Summer 2007

The Backward algorithm

We compute backward prob by induction:

9LSA 352 Summer 2007

Inductive step of the backward algorithm(figure inspired by Rabiner and Juang)

Computation of βt(i) by weighted sum of all successive values βt+1

10LSA 352 Summer 2007

Intuition for re-estimation of aij

We will estimate ^aij via this intuition:

Numerator intuition:Assume we had some estimate of probability that agiven transition i->j was taken at time t inobservation sequence.If we knew this probability for each time t, we couldsum over all t to get expected value (count) for i->j.

!

ˆ a ij =expected number of transitions from state i to state j

expected number of transitions from state i

11LSA 352 Summer 2007

Re-estimation of aij

Let ξt be the probability of being in state i at time tand state j at time t+1, given O1..T and model Φ:

We can compute ξ from not-quite-ξ, which is:

!

" t (i, j) = P(qt = i,qt+1 = j |O,#)

!

not _quite_" t (i, j) = P(qt = i,qt+1 = j,O | #)

12LSA 352 Summer 2007

Computing not-quite-ξ

!

The four components of P(qt = i,qt+1 = j,) | ") : #,$,aij and bj (ot )

3

13LSA 352 Summer 2007

From not-quite-ξ to ξ

We want:

We’ve got:

Which we compute as follows:

!

not _quite_" t (i, j) = P(qt = i,qt+1 = j,O | #)

!

" t (i, j) = P(qt = i,qt+1 = j |O,#)

14LSA 352 Summer 2007

From not-quite-ξ to ξ

We want:

We’ve got:

Since:

We need:!

not _quite_" t (i, j) = P(qt = i,qt+1 = j,O | #)

!

" t (i, j) = P(qt = i,qt+1 = j |O,#)

!

" t (i, j) =not _quite_" t (i, j)

P(O | #)

15LSA 352 Summer 2007

From not-quite-ξ to ξ

!

" t (i, j) =not _quite_" t (i, j)

P(O | #)

16LSA 352 Summer 2007

From ξ to aij

The expected number of transitions from state i tostate j is the sum over all t of ξ

The total expected number of transitions out of state iis the sum over all transitions out of state iFinal formula for reestimated aij

!

ˆ a ij =expected number of transitions from state i to state j

expected number of transitions from state i

17LSA 352 Summer 2007

Re-estimating the observationlikelihood b

!

ˆ b j (vk ) =expected number of times in state j and observing symbol vk

expected number of times in state j

We’ll need to know the probability of being in state j at time t:

18LSA 352 Summer 2007

Computing gamma

4

19LSA 352 Summer 2007

Summary

The ratio between theexpected number oftransitions from state i to j and the expected number ofall transitions from state i

The ratio between theexpected number of timesthe observation data emittedfrom state j is vk, and theexpected number of timesany observation is emittedfrom state j

20LSA 352 Summer 2007

Summary: Forward-BackwardAlgorithm

1) Intialize Φ=(A,B)2) Compute α, β, ξ3) Estimate new Φ’=(A,B)4) Replace Φ with Φ’

5) If not converged go to 2

21LSA 352 Summer 2007

The Learning Problem: Caveats

Network structure of HMM is always created by handno algorithm for double-induction of optimal structureand probabilities has been able to beat simple hand-built structures.Always Bakis network = links go forward in timeSubcase of Bakis net: beads-on-string net:

Baum-Welch only guaranteed to return local max,rather than global optimum

22LSA 352 Summer 2007

Complete Embedded Training

Setting all the parameters in an ASR systemGiven:

training set: wavefiles & word transcripts for eachsentenceHand-built HMM lexicon

Uses:Baum-Welch algorithm

We’ll return to this after we’ve introduced GMMs

23LSA 352 Summer 2007

Baum-Welch for Mixture Models

By analogy with ξ earlier, let’s define the probability of being in state jat time t with the kth mixture component accounting for ot:

Now,

!

" tm ( j) =

# t$1( j)i=1

N

% aijc jmb jm (ot )& j (t)

#F (T)

!

µ jm =

" tm ( j)ott=1

T

#

" tk ( j)k=1

M

#t=1

T

#

!

c jm =

" tm ( j)t=1

T

#

" tk ( j)k=1

M

#t=1

T

#

!

" jm =

# tm ( j)(ot $µ j )t=1

T

% (ot $µ j )T

# tm ( j)k=1

M

%t=1

T

%

24LSA 352 Summer 2007

How to train mixtures?

• Choose M (often 16; or can tune M dependent onamount of training observations)

• Then can do various splitting or clusteringalgorithms

• One simple method for “splitting”:1) Compute global mean µ and global variance2) Split into two Gaussians, with means µ±ε

(sometimes ε is 0.2σ)

3) Run Forward-Backward to retrain4) Go to 2 until we have 16 mixtures

5

25LSA 352 Summer 2007

Embedded Training

Components of a speech recognizer:Feature extraction: not statisticalLanguage model: word transition probabilities,trained on some other corpusAcoustic model:

– Pronunciation lexicon: the HMM structure for each word,built by hand

– Observation likelihoods bj(ot)– Transition probabilities aij

26LSA 352 Summer 2007

Embedded training of acousticmodel

If we had hand-segmented and hand-labeled training dataWith word and phone boundariesWe could just compute the

B: means and variances of all our triphone gaussiansA: transition probabilities

And we’d be done!But we don’t have word and phone boundaries, nor phonelabeling

27LSA 352 Summer 2007

Embedded training

Instead:We’ll train each phone HMM embedded in an entiresentenceWe’ll do word/phone segmentation and alignmentautomatically as part of training process

28LSA 352 Summer 2007

Embedded Training

29LSA 352 Summer 2007

Initialization: “Flat start”

Transition probabilities: set to zero any that you want to be “structurallyzero”

– The γ probability computation includes previous value ofaij, so if it’s zero it will never change

Set the rest to identical values

Likelihoods:initialize µ and σ of each state to global mean andvariance of all training data

30LSA 352 Summer 2007

Embedded Training

Now we have estimates for A and BSo we just run the EM algorithmDuring each iteration, we compute forward andbackward probabilitiesUse them to re-estimate A and BRun EM til converge

6

31LSA 352 Summer 2007

Viterbi training

Baum-Welch training says:We need to know what state we were in, to accumulate counts of agiven output symbol ot

We’ll compute ξI(t), the probability of being in state i at time t, byusing forward-backward to sum over all possible paths that mighthave been in state i and output ot.

Viterbi training says:Instead of summing over all possible paths, just take the singlemost likely pathUse the Viterbi algorithm to compute this “Viterbi” pathVia “forced alignment”

32LSA 352 Summer 2007

Forced Alignment

Computing the “Viterbi path” over the training data iscalled “forced alignment”Because we know which word string to assign to eachobservation sequence.We just don’t know the state sequence.So we use aij to constrain the path to go through thecorrect wordsAnd otherwise do normal ViterbiResult: state sequence!

33LSA 352 Summer 2007

Viterbi training equations

Viterbi Baum-Welch

!

ˆ b j (vk ) =n j (s.t.ot = vk )

n j!

ˆ a ij =nij

ni

For all pairs of emitting states, 1 <= i, j <= N

Where nij is number of frames with transition from i to j in best pathAnd nj is number of frames where state j is occupied

34LSA 352 Summer 2007

Viterbi Training

Much faster than Baum-WelchBut doesn’t work quite as wellBut the tradeoff is often worth it.

35LSA 352 Summer 2007

Viterbi training (II)

Equations for non-mixture Gaussians

Viterbi training for mixture Gaussians is morecomplex, generally just assign each observation to 1mixture

!

" i2

=1

Ni

(ott=1

T

# $ µ i )2 s.t. qt = i

!

µ i =1

Ni

ot s.t. qt = it=1

T

"

36LSA 352 Summer 2007

Log domain

In practice, do all computation in log domainAvoids underflow

Instead of multiplying lots of very small probabilities, weadd numbers that are not so small.

Single multivariate Gaussian (diagonal Σ) compute:

In log space:

7

37LSA 352 Summer 2007

Repeating:

With some rearrangement of terms

Where:

Note that this looks like a weighted Mahalanobis distance!!!Also may justify why we these aren’t really probabilities (pointestimates); these are really just distances.

Log domain

38LSA 352 Summer 2007

State Tying: Young, Odell, Woodland 1994

The steps in creating CDphones.Start with monophone, do EMtrainingThen clone Gaussians intotriphonesThen build decision tree andcluster GaussiansThen clone and train mixtures(GMMs

39LSA 352 Summer 2007

Summary: Acoustic Modeling forLVCSR.

Increasingly sophisticated modelsFor each state:

GaussiansMultivariate GaussiansMixtures of Multivariate Gaussians

Where a state is progressively:CI PhoneCI Subphone (3ish per phone)CD phone (=triphones)State-tying of CD phone

Forward-Backward TrainingViterbi training

40LSA 352 Summer 2007

Outline

DisfluenciesCharacteristics of disfluencesDetecting disfluenciesMDE bakeoffFragments

41LSA 352 Summer 2007

Disfluencies

42LSA 352 Summer 2007

Disfluencies: standard terminology (Levelt)

Reparandum: thing repairedInterruption point (IP): where speaker breaks offEditing phase (edit terms): uh, I mean, you knowRepair: fluent continuation

8

43LSA 352 Summer 2007

Why disfluencies?

Need to clean them up to get understandingDoes American airlines offer any one-way flights [uh] one-way fares for 160 dollars?Delta leaving Boston seventeen twenty one arriving FortWorth twenty two twenty one forty

Might help in language modelingDisfluencies might occur at particular positions (boundariesof clauses, phrases)

Annotating them helps readabilityDisfluencies cause errors in nearby words

44LSA 352 Summer 2007

Counts (from Shriberg, Heeman)

Sentence disfluency rateATIS: 6% of sentences disfluent (10% long sentences)Levelt human dialogs: 34% of sentences disfluentSwbd: ~50% of multiword sentences disfluentTRAINS: 10% of words are in reparandum or editingphrase

Word disfluency rateSWBD: 6%ATIS: 0.4%AMEX 13%

– (human-human air travel)

45LSA 352 Summer 2007

Prosodic characteristics ofdisfluencies

Nakatani and Hirschberg 1994Fragments are good cues to disfluenciesProsody:

Pause duration is shorter in disfluent silence than fluentsilenceF0 increases from end of reparandum to beginning ofrepair, but only minor changeRepair interval offsets have minor prosodic phraseboundary, even in middle of NP:

– Show me all n- | round-trip flights | from Pittsburgh | toAtlanta

46LSA 352 Summer 2007

Syntactic Characteristics ofDisfluencies

Hindle (1983)The repair often has same structure as reparandumBoth are Noun Phrases (NPs) in this example:

So if could automatically find IP, could find and correctreparandum!

47LSA 352 Summer 2007

Disfluencies and LM

Clark and Fox TreeLooked at “um” and “uh”

“uh” includes “er” (“er” is just British non-rhoticdialect spelling for “uh”)

Different meaningsUh: used to announce minor delays

– Preceded and followed by shorter pauses

Um: used to announce major delays– Preceded and followed by longer pauses

48LSA 352 Summer 2007

Um versus uh: delays(Clark and Fox Tree)

9

49LSA 352 Summer 2007

Utterance Planning

The more difficulty speakers have in planning, the more delaysConsider 3 locations:

I: before intonation phrase: hardestII: after first word of intonation phrase: easierIII: later: easiest

And then uh somebody said, . [I] but um -- [II] don’t you thinkthere’s evidence of this, in the twelfth - [III] and thirteenthcenturies?

50LSA 352 Summer 2007

Delays at different points in phrase

51LSA 352 Summer 2007

Disfluencies in language modeling

Should we “clean up” disfluencies before training LM(i.e. skip over disfluencies?)

Filled pauses– Does United offer any [uh] one-way fares?

Repetitions– What what are the fares?

Deletions– Fly to Boston from Boston

Fragments (we’ll come back to these)– I want fl- flights to Boston.

52LSA 352 Summer 2007

Does deleting disfluenciesimprove LM perplexity?

Stolcke and Shriberg (1996)Build two LMs

RawRemoving disfluencies

See which has a higher perplexity on the data.Filled PauseRepetitionDeletion

53LSA 352 Summer 2007

Change in Perplexity when FilledPauses (FP) are removed

LM Perplexity goes up at following word:

Removing filled pauses makes LM worse!!I.e., filled pauses seem to help to predict next word.Why would that be?

Stolcke and Shriberg 1996 54LSA 352 Summer 2007

Filled pauses tend to occur atclause boundaries

Word before FP is end of previous clause; word afteris start of new clause;

Best not to delete FP

Some of the different things we’re doing [uh] there’snot time to do it all“there’s” is very likely to start a sentenceSo P(there’s|uh) is better estimate thanP(there’s|doing)

10

55LSA 352 Summer 2007

Suppose we just delete medial FPs

Experiment 2:Parts of SWBD hand-annotated for clausesBuild FP-model by deleting only medial FPsNow prediction of post-FP word (perplexity) improvesgreatly!Siu and Ostendorf found same with “you know”

56LSA 352 Summer 2007

What about REP and DEL

S+S built a model with “cleaned-up” REP and DELSlightly lower perplexityBut exact same word error rate (49.5%)Why?

Rare: only 2 words per 100Doesn’t help because adjacent words aremisrecognized anyhow!

57LSA 352 Summer 2007

Stolcke and Shriberg conclusions wrtLM and disfluencies

Disfluency modeling purely in the LM probably won’tvastly improve WERBut

Disfluencies should be considered jointly withsentence segmentation taskNeed to do better at recognizing disfluent wordsthemselvesNeed acoustic and prosodic features

58LSA 352 Summer 2007

WER reductions from modelingdisfluencies+background events

Rose and Riccardi (1999)

59LSA 352 Summer 2007

HMIHY Background events

Out of 16,159 utterances:

5353Echoed Prompt

5112Operater Utt.

3585Background Speech

8834Non-Speech Noise

8048Breath

2171Lipsmack

163Laughter

792Hesitations

1265Word Fragments

7189Filled Pauses

60LSA 352 Summer 2007

Phrase-based LM

“I would like to make a collect call”“a [wfrag]”<dig3> [brth] <dig3>“[brth] and I”

11

61LSA 352 Summer 2007

Rose and Riccardi (1999)Modeling “LBEs” does help in WER

89.860.8LBELBE

88.160.8BaselineLBE

87.758.7BaselineBaseline

Card NumberGreeting(HMIHY)

LMHMM

Test CorporaSystem Configuration

ASR Word Accuracy

62LSA 352 Summer 2007

More on location of FPs

Peters: Medical dictation taskMonologue rather than dialogueIn this data, FPs occurred INSIDE clausesTrigram PP after FP: 367Trigram PP after word: 51

Stolcke and Shriberg (1996b)wk FP wk+1: looked at P(wk+1|wk)Transition probabilities lower for these transitions thannormal ones

Conclusion:People use FPs when they are planning difficult things, sofollowing words likely to be unexpected/rare/difficult

63LSA 352 Summer 2007

Detection of disfluencies

Nakatani and HirschbergDecision tree at wi-wj boundary

pause durationWord fragmentsFilled pauseEnergy peak within wiAmplitude difference between wi and wjF0 of wiF0 differencesWhether wi accented

Results:78% recall/89.2% precision

64LSA 352 Summer 2007

Detection/Correction

Bear, Dowding, Shriberg (1992)System 1:Hand-written pattern matching rules to find repairs

– Look for identical sequences of words– Look for syntactic anomalies (“a the”, “to from”)– 62% precision, 76% recall– Rate of accurately correcting: 57%

65LSA 352 Summer 2007

Using Natural LanguageConstraints

Gemini natural language systemBased on Core Language EngineFull syntax and semantics for ATISCoverage of whole corpus:

70% syntax50% semantics

66LSA 352 Summer 2007

Using Natural LanguageConstraints

Gemini natural language systemRun pattern matcherFor each sentence it returned

Remove fragment sentencesLeaving 179 repairs, 176 false positivesParse each sentence

– If succeed: mark as false positive– If fail:

run pattern matcher, make corrections Parse again If succeeds, mark as repair If fails, mark no opinion

12

67LSA 352 Summer 2007

NL Constraints

Syntax OnlyPrecision: 96%

Syntax and SemanticsCorrection: 70%

68LSA 352 Summer 2007

Recent work: EARS MetadataEvaluation (MDE)

A recent multiyear DARPA bakeoffSentence-like Unit (SU) detection:

find end points of SUDetect subtype (question, statement, backchannel)

Edit word detection:Find all words in reparandum (words that will be removed)

Filler word detectionFilled pauses (uh, um)Discourse markers (you know, like, so)Editing terms (I mean)

Interruption point detection

Liu et al 2003

69LSA 352 Summer 2007

Kinds of disfluencies

RepetitionsI * I like it

RevisionsWe * I like it

Restarts (false starts)It’s also * I like it

70LSA 352 Summer 2007

MDE transcription

Conventions:./ for statement SU boundaries,<> for fillers,[] for edit words,* for IP (interruption point) inside edits

And <uh> <you know> wash your clothes whereveryou are ./ and [ you ] * you really get used to theoutdoors ./

71LSA 352 Summer 2007

MDE Labeled Corpora

1.86.8Filler word %

1.87.4Edit word %

8.113.6SU %

11.714.9STT WER (%)

45K35KTest set (words)

182K484KTraining set (words)

BNCTS

72LSA 352 Summer 2007

MDE Algorithms

Use both text and prosodic featuresAt each interword boundary

Extract Prosodic features (pause length, durations,pitch contours, energy contours)Use N-gram Language modelCombine via HMM, Maxent, CRF, or other classifier

13

73LSA 352 Summer 2007

State of the art: SU detection

2 stageDecision tree plus N-gram LM to decide boundarySecond maxent classifier to decide subtype

Current error rates:Finding boundaries

– 40-60% using ASR– 26-47% using transcripts

74LSA 352 Summer 2007

State of the art: Edit worddetection

Multi-stage model– HMM combining LM and decision tree finds IP– Heuristics rules find onset of reparandum– Separate repetition detector for repeated words

One-stage modelCRF jointly finds edit region and IPBIO tagging (each word has tag whether is beginning ofedit, inside edit, outside edit)

Error rates:43-50% using transcripts80-90% using ASR

75LSA 352 Summer 2007

Using only lexical cues

3-way classification for each wordEdit, filler, fluent

Using TBLTemplates: Change

– Word X from L1 to L2– Word sequence X Y to L1– Left side of simple repeat to L1– Word with POS X from L1 to L2 if followed by word with

POS Y

76LSA 352 Summer 2007

Rules learned

Label all fluent filled pauses as fillersLabel the left side of a simple repeat as an editLabel “you know” as fillersLabel fluent well’s as fillerLabel fluent fragments as editsLabel “I mean” as a filler

77LSA 352 Summer 2007

Error rates using only lexicalcues

CTS, using transcriptsEdits: 68%Fillers: 18.1%

Broadcast News, using transcriptsEdits 45%Fillers 6.5%

Using speech:Broadcast news filler detection from 6.5% error to 57.2%

Other systems (using prosody) better on CTS, not on BroadcastNews

78LSA 352 Summer 2007

Conclusions: Lexical Cues Only

Can do pretty well with only words(As long as the words are correct)

Much harder to do fillers and fragments from ASRoutput, since recognition of these is bad

14

79LSA 352 Summer 2007

Fragments

Incomplete or cut-off words:Leaving at seven fif- eight thirtyuh, I, I d-, don't feel comfortableYou know the fam-, well, the familiesI need to know, uh, how- how do you feel…

Uh yeah, yeah, well, it- it- that’s right. And it-

SWBD: around 0.7% of words are fragments (Liu 2003)ATIS: 60.2% of repairs contain fragments (6% of corpussentences had a least 1 repair) Bear et al (1992)Another ATIS corpus: 74% of all reparanda end in wordfragments (Nakatani and Hirschberg 1994)

80LSA 352 Summer 2007

Fragment glottalization

Uh yeah, yeah, well, it- it- that’s right. And it-

81LSA 352 Summer 2007

Why fragments are important

Frequent enough to be a problem:Only 1% of words/3% of sentencesBut if miss fragment, likely to get surrounding words wrong(word segmentation error).So could be 3 or more times worse (3% words, 9% ofsentences): problem!

Useful for finding other repairsIn 40% of SRI-ATIS sentences containing fragments,fragment occurred at right edge of long repair74% of ATT-ATIS reparanda ended in fragments

Sometimes are the only cue to repair“leaving at <seven> <fif-> eight thirty”

82LSA 352 Summer 2007

How fragments are dealt with incurrent ASR systems

In training, throw out any sentences with fragmentsIn test, get them wrongProbably get neighboring words wrong too!!!!!!

83LSA 352 Summer 2007

Cues for fragment detection

49/50 cases examined ended in silence >60msec; average282ms (Bear et al)24 of 25 vowel-final fragments glottalized (Bear et al)

Glottalization: increased time between glottal pulses

75% don’t even finish the vowel in first syllable (i.e., speakerstopped after first consonant) (O’Shaughnessy)

84LSA 352 Summer 2007

Cues for fragment detection

Nakatani and Hirschberg (1994)Word fragments tend to be content words:

54%155Untranscribed

4%12Function

42%121Content

PercentTokenLexical Class

15

85LSA 352 Summer 2007

Cues for fragment detection

Nakatani and Hirschberg (1994)91% are one syllable or less

0.3%13

9%252

52%1491

39%1130

PercentTokensSyllables

86LSA 352 Summer 2007

Cues for fragment detection

Nakatani and Hirschberg (1994)Fricative-initial common; not vowel-initial

73%

0%

11%

% 1-C frags

45%33%Fric

13%25%Vowel

23%23%Stop

% frags% wordsClass

87LSA 352 Summer 2007

Liu (2003): Acoustic-Prosodicdetection of fragments

Prosodic featuresDuration (from alignments)

– Of word, pause, last-rhyme-in word– Normalized in various ways

F0 (from pitch tracker)– Modified to compute stylized speaker-specific contours

Energy– Frame-level, modified in various ways

88LSA 352 Summer 2007

Liu (2003): Acoustic-Prosodicdetection of fragments

Voice Quality FeaturesJitter

– A measure of perturbation in pitch period Praat computes this

Spectral tilt– Overall slope of spectrum– Speakers modify this when they stress a word

Open Quotient– Ratio of times in which vocal folds are open to total length of

glottal cycle– Can be estimated from first and second harmonics– Creaky voice (laryngealization) vocal folds held together, so

short open quotient

89LSA 352 Summer 2007

Liu (2003)

Use Switchboard 80%/20%Downsampled to 50% frags, 50% wordsGenerated forced alignments with gold transcriptsExtract prosodic and voice quality featuresTrain decision tree

90LSA 352 Summer 2007

Liu (2003) results

Precision 74.3%, Recall 70.1%

10143fragment

reference 35109complete

fragmentcomplete

hypothesis

16

91LSA 352 Summer 2007

Liu (2003) features

Features most queried by DT

0.018Pause duration

0.084Position of current turn

.147Average OQ

%

.238Ratio between F0 before and after boundary

.241Energy slope difference between current andfollowing word

.272jitter

Feature

92LSA 352 Summer 2007

Liu (2003) conclusion

Very preliminary workFragment detection is good problem that isunderstudied!

93LSA 352 Summer 2007

Fragments in other languages

Mandarin (Chu, Sung, Zhao, Jurafsky 2006)Fragments cause similar errors as in English:

但我﹣我問的是

I- I was asking…他是﹣卻很﹣活的很好

He very- lived very well

94LSA 352 Summer 2007

Fragments in Mandarin

Mandarin fragments unlike English; no glottalization.Instead: Mostly (unglottalized) repetitions

So: best features are lexical, rather than voice quality

95LSA 352 Summer 2007

Summary

DisfluenciesCharacteristics of disfluencesDetecting disfluenciesMDE bakeoffFragments