speech recognition

- Ajay Iyer

OutlineWhat is a Spectrogram?Types of SpectrogramLinguistic and Acoustic CategoryProsodic AnalysisPitch Estimation

What is a Spectrogram?A Spectrogram is a visual representation of an

acoustic signal.It displays the degrees of amplitude, frequency and

temporal content of the signal.Depending on the size of the Fourier analysis

window, different resolutions in frequency/time are achieved.

A long analysis window, resolves frequency at the expense of time thereby giving a “Narrowband spectr0gram”.

A short analysis window on the other hand, resolves time at the expense of frequency – hence called a “Wideband spectrogram”.

Types of Spectrograms

Narrowband Spectrogram Wideband Spectrogram

Spectrograms

Linguistic/ Acoustic CategoriesLabeling of the Linguistic and/or Acoustic

categories aids in speeding up the search and decoding algorithms, by discarding the impossible and highly unlikely phoneme combinations.

Implementation : The given phoneme is compared to the different categories according to TIMIT lexicon.

The category thus obtained is displayed along with the phoneme as shown in the following slide.

Linguistic/Acoustic Categories

Prosodic AnalysisAcoustically speaking, prosodies refer to

variation in syllable duration, loudness, pitch and the formant frequencies of the speech signal.

Prosodic features are suprasegmental, i.e they are not restricted to any one segment of speech. They occur in some higher level of an utterance.

Say for example: “No!”, “Don’t!”

PitchOf the various prosodic features, the most

important one is the pitch. Its knowledge enables one to differentiate

between contexts in which a word is spoken viz. Alerting or Referential contexts.

Thus incorporation of pitch information increases the accuracy of the recognizer.

ImplementationThe pitch.m file uses cepstral analysis to

extract pitch information.Pitch.m performs analysis on one analysis

frame segment.Frame based analysis has been coded for

pitch estimation of the entire speech signal.The estimated fundamental frequency (pitch)

is for the instance of time tpitch

= tinterval(frameNum - 1) + fo/Fs;

Pitch Estimation

ReferencesProsodic_Modeling_for_Improved_Speech_Recogntion_and_

Understanding_Wang_phd_thesis.pdf Prosodic Analysis of Alerting and Referential Context of

Sentinel Words_final_draft.pdf Discrimination_of_Sentinel_Word_Contexts_using_Prosodic

_Features_Journal_v1.pdf http://home.cc.umanitoba.ca/~robh/howto.htmlhttp://en.wikipedia.org/wiki/Prosody_(linguistics)

Thank You

speech recognition

Documents

cepstral analysis

implementationthe pitch

acoustic signal

short analysis window

fourier analysis window

long analysis window

analysis frame segment

entire speech signal