an introduction to speech recognition

An Introduction to Speech Recognition

Advance Electronic DevicesEC - 410

By:Mayank Awasthi (2006033)

Instructor:Dr. M Ravibabu

Topics to be covered

Overview Speech Production SR system Why Speech Recognition is difficult Current Software Options for PC Applications References

Overview

Speech is the vocalized form of human communication. Each spoken word is created out of the phonetic combination of

a limited set of vowel and consonant speech sound units.

Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format.

Speech recognition has evolved quite a bit over the past few years. Initially, it used to work in discrete dictation mode, where you had to pause between each spoken word. Today, however, it uses continuous dictation. It’s also become smarter, with its own set of grammar rules to make out the meaning of what’s being said.

Speech Production

Normal human speech is produced with pulmonary pressure provided by the lungs which creates phonation in the glottis in the laryngeal prominence that then is modified by the vocal tract into different vowels and consonants.

Knowledge of generation of various speech sounds help us to understand the properties of speech sounds.

In short we can say that sound is generated when vocal tract is excited.

The mode of excitation can be of 3 type: 1). Periodic -------------- in case of vowels2). Aperiodic ------------ in case of consonants3). Mixed

Contd.

In case of voiced sound as vowels, the excitation is periodic. The periodic opening & closing of glottis results in puffs of air exciting vocal tract.

If we assume that 340m/s as the speed of sound in air and 17cm as the length of vocal tract from glottis to lips, the fundamental frequency of resonance can be calculated as

v=c / w = 34000 / 4*17= 500hz

The frequencies of the harmonic would be 1500hz, 2500hz etc. Thus we should expect peaks in the frequency spectrum of the vowel at these frequencies.

These peaks in the spectrum, due to resonance in the vocal tract is called Formants

Different speech sources are generated by changing the resonant cavity resulting in the different value of frequency, amplitude and bandwidth of formants.

Contd.

Source Filter Model of speech production

• We know that s(n)= e(n)*h(n)• Figure shows that typical spectra of two speech sounds of the

hindi word “ki” on log scale. Red one for ‘/i/’ and black one for ‘/k/’

Time varying filter representing vocal tract

Source excitation Output Speech wave

Contd.

Speech sounds are characterized by the size and shape of filter (vocal cavity) which is represented by the spectrum of the filter H(k). Therefore, the source characteristic such as fundamental frequency, signal amplitude etc. can be ignored in speech recognition.

The log power spectrum of the is the sum of the log power spectrum of source and filter.Since the power spectrum of source is varying rapidly with frequency whereas the filter varies slowly. Therefore if we pass this composite log power spectrum through a low pass, only the characteristic of the filter remains.

This process is called Liftering & can be achieved by just taking the inverse fourier transform of log power spectrum and retaining first few components. The resulting spectrum is called cepstrum and coefficient is called cepstral coefficients.

cep(q)= IFFT{ log(|S(k)|2)} q=0,1,2,……N-1. Most of the SR system use cepstral coefficients and their time

derivatives as feature for representing speech sounds

Speech Recognition System

Contd.

First, the user gives a voice command over the microphone, which is passed to the sound card in your system. This analog signal is sampled converted into digital form using a technique called Pulse Code Modulation or PCM. This digital waveform is a stream of amplitudes that look like a wavy line.

The audio signal is further sampled and each sample is converted into a frequency domain. So, the incoming stream is now a set of discrete frequency bands, in a form that can be used by the speech recognizer.

The next stage involves recognizing these bands of frequencies. For this, the speech recognition software has a database containing thousands of frequencies or "phonemes", as they’re called.

Contd.

A phoneme is the smallest unit of speech in a language. The utterance (vocalization) of one phoneme is different from another, such that if one phoneme replaces another in a word, the word would have a different meaning. For example, if the "b" in "bat" were replaced by the phoneme "r", the meaning would change to "rat".

Ex: Kit vs Skill. /k/ is aspirated in first case & not in second case.

The phoneme database is used to match the audio frequency bands that were sampled. So, for example, if the incoming frequency sounds like a "t", the software will try and match it to the corresponding phoneme in the database. Each phoneme is tagged with a feature number, which is then assigned to the incoming signal.

Why SR is difficult?

A given word is spoken by different persons, different persons have different spectral properties. Ex- Female had shorter vocal tract than male. So the formant frequency spoken by female is higher than that of spoken by male.

The properties of the sound not only depend on the identity of the corresponding phoneme but also on the neighbouring sound. Ex- a speaker has mispronounced the long word “Thiruvananthapuran” as “tiruvanthpuram. Human being don’t have any problem in translating it to correct word.However such case pose a problem for machine.

Current Software Options for PC

Dragon Systems – Naturally Speaking Philips – FreeSpeech IBM – ViaVoice Lernout & Hauspie – Voice Xpress

Applications

Military: On particular note are the U.S. programs in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), the program in France on installing speech recognition systems on Mirage aircraft.Inthese programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays.

Person with disabilities Telephony and other domains

References

www.google.com www.wikipedia.org www.esnips.com