Computer Recognition of Single-Syllable English Words
Post on 11-Apr-2017
77TH MEETING ACOUSTICAL SOCIETY OF AMERICA
TUESDAY, 8 APRIL 1969 Soux CAMEO Roost, 2:00 r.M.
Session J. Speech Processing II
WILLARD MEEKER, Chairman
Contributed Papers (20 minutes)
J1. Use of Syntax in the Analysis of Connected Speech Utterances. P. J. VICENS (nonmember), Computer Depart- ment, Stanford University, Stanford, California 94305.A method to analyze connected speech utterances using a finite-state grammar acting on a restricted vocabulary is proposed. The sentences are commands to a computer-con- trolled HAN-EYZ system able to find blocks and stack them. Audio input to the system is provided through an analog speech preprocessor producing six parameters every 10 msec. Using these parameters, a computer procedure segments the whole utterance and generates a sound description. The decoding of the commands is done by scanning the speech- utterance description forward and backward looking for "known" (previously learned) words. The vocabulary is chosen so that the word boundaries are always at a mini- mum of amplitude or at a fricative-type segment. The word recognition is done on a best-match-type comparison scheme acting on prelearned word representations. At any step, feedback from the grammar is used to eliminate from the matching process the syntactically incorrect word representa- tion. The program "understands" the key words of an utter- ance like: "PICK UP EVERY MEDIUM BLOCK STARTING AT TI-IE BOTTOM RIGHT CORNER" in about 10 sec.
J2. Segment-Synchronization Problem in Speech Recog- nition. D. R^j REDDY, Computer Science Department, Stan- ford University, Stanford, California 94305.--Procedures for the segmentation of acoustic continuum of speech cannot usu- ally guarantee that two utterances of the same phrase by the same speaker will always result in the same number of segments even under ideal conditions. To match the seg- mental parameters of an utterance against known parameters of the same phrase, one must determine correspondences be- tween the segments of the two utterances. A solution to this synchronization problem that requires no time normalization can be based on the possibility labeling the segments at least in terms of phoneme groups. Segments of unvoiced fricatives and vowels can usually be more reliably detected than others. Therefore, the synchronization procedure first maps vowel to vowel and fricative to fricative. The few unmapped segments between any two pairs can then be mapped using segment labels such as nasal or stop, or on the basis of similarity of segmental parameters. This scheme is much faster than corre- lation-based schemes and has been used successfully in a real time recognition program. It took 4 (15) sec/recognition for a 50-(500)-word vocabulary to achieve 98% correct recognition.
J3. Experimental Conversational Computer System. Rzc- tm) B. NV.L (nonmember), Computer Science Department, Stanford University, Stanford, California 94305.--This paper describes an experimental system with speech input and out-- put providing a problem domain for speech analysis and synthesis research. Speech input to the computer is provided using a system developed at Stanford (Reddy and Vicens, "Spoken Message Recognition by a Computer," to be pub-
lished). Time-domain compressed speech [D. R. Reddy, J. Acoust. Soc. Amer. 44, 391(A) (1968)] is used for voice response from the computer. If a query is not one of the 25 or so messages the system can recognize, it responds with "Please repeat," or some such. If the input is recognized, then the response is a random but relevant message chosen from about 100 messages. This system was developed not so much to talk to machines but rather to study some aspects of man-machine voice communication such as the effect of (a) a delayed reply on a human speaker, (b) intonationless speech that can result from concatenating isolated words to form a sentence, (c) the increase of vocabulary on computer speed and storage requirements, and (d) improvements to the analysis or synthesis algorithms on the over-all perform- ance of the system.
J4. Computer Recognition of Single-Syllable English Words. MARK MEDRESS, Department of Electrical Engi- neering and Research Laboratory of Electronics, Massa- chusetts Institute of Technology, Cambridge, Massachusetts 02139.--Recognition of single-syllable words is attempted by estimating element values of the distinctive features matrices defining their phoneme strings. These values are determined by making measurements on a filter bank rep- resentation of the short time spectrum of the unknown utterance. The first properties estimated are those least influenced by the phonetic environment, such as the presence and location of stops and fricatives, and whether the vowel is front or back. Other more dependent features are deter- mined by phonological rules and additional measurements that depend on the estimates of the first set of properties. These include place of articulation for the stops and frica- tires, other vowel features, and detection and identification of nasals, liquids, and glides. For three speakers' recordings of 60 words containing five vowels and initial and final liquids, this procedure correctly identified all but three of the vowels, and identified without error the liquids in all words with front vowels. Preliminary work indicates that similar success can be expected for a more heterogeneous word list. [This work was supported in part by grants from the National Institutes of Health.]
J5. Acoustic Measurements for Speaker Recognition. J^lmD J. WOLF, Department of Electrical Engineering and Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139.In a scheme for mechanical recognition of speakers, it is desirable to use acoustic measures that are related in as direct a manner as
possible to the voice characteristics of the unknown speaker and that are minimally affected by irrelevant factors. Acoustic attributes that are dependent on anatomical properties of the speaker's vocal mechanism should be particularly effective ones. Certain phonemes or phoneme features are well suited for displaying speaker-dependent characteristics. For exam- ple, aspects of the spectra of // (high frequency shape), /i/ (shape of the F2-F3-F4 peak), and/m/ (pole-zero inter- play and nasal formants) have been found to be effective.
The Journal of the Acoustical Society of America 89
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 22.214.171.124 On: Tue, 16 Dec 2014