[studies in computational intelligence] phoneme-based speech segmentation using hybrid soft...

Studies in Computational Intelligence 550

MousmitaSarmaKandarpa KumarSarma

Phoneme-Based Speech Segmentation Using Hybrid Soft Computing Framework

Studies in Computational Intelligence

Volume 550

Series editor

J. Kacprzyk, Warsaw, Poland

For further volumes:http://www.springer.com/series/7092

http://www.springer.com/series/7092

About this Series

The series Studies in Computational Intelligence (SCI) publishes new devel-opments and advances in the various areas of computational intelligencequicklyand with a high quality. The intent is to cover the theory, applications, and designmethods of computational intelligence, as embedded in the fields of engineering,computer science, physics and life sciences, as well as the methodologies behindthem. The series contains monographs, lecture notes and edited volumes incomputational intelligence spanning the areas of neural networks, connectionistsystems, genetic algorithms, evolutionary computation, artificial intelligence,cellular automata, self-organizing systems, soft computing, fuzzy systems, andhybrid intelligent systems. Of particular value to both the contributors and thereadership are the short publication timeframe and the world-wide distribution,which enable both wide and rapid dissemination of research output.

Mousmita Sarma Kandarpa Kumar Sarma

Phoneme-Based SpeechSegmentation Using HybridSoft Computing Framework

123

Mousmita SarmaDepartment of Electronics

and Communication EngineeringGauhati UniversityGuwahati, AssamIndia

Kandarpa Kumar SarmaDepartment of Electronics

and Communication TechnologyGauhati UniversityGuwahati, AssamIndia

ISSN 1860-949X ISSN 1860-9503 (electronic)ISBN 978-81-322-1861-6 ISBN 978-81-322-1862-3 (eBook)DOI 10.1007/978-81-322-1862-3Springer New Delhi Heidelberg New York Dordrecht London

Library of Congress Control Number: 2014933541

Springer India 2014This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part ofthe material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission orinformation storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed. Exempted from this legal reservation are briefexcerpts in connection with reviews or scholarly analysis or material supplied specifically for thepurpose of being entered and executed on a computer system, for exclusive use by the purchaser of thework. Duplication of this publication or parts thereof is permitted only under the provisions ofthe Copyright Law of the Publishers location, in its current version, and permission for use mustalways be obtained from Springer. Permissions for use may be obtained through RightsLink at theCopyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exemptfrom the relevant protective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date ofpublication, neither the authors nor the editors nor the publisher can accept any legal responsibility forany errors or omissions that may be made. The publisher makes no warranty, express or implied, withrespect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

This work is dedicated to all theresearchers of Speech Processingand related technology

Preface

Speech is a naturally occuring nonstationary signal essential not only for person-to-person communication but has become an important aspect of Human Com-puter Interaction (HCI). Some of the issues related to analysis and design ofspeech-based applications for HCI have received widespread attention. Withcontinuous upgradation of processing techniques, treatment of speech signals andrelated analysis from varied angles has become a critical research domain. It ismore so with cases where there are regional linguistic orientations with culturaland dialectal elements. It has enabled technologists to visualize innovativeapplications. This work is an attempt to treat speech recognition with soft com-puting tools oriented toward a language like Assamese spoken mostly in thenortheastern part of India with rich linguistic and phonetic diversity. The regionaland phonetic variety observed in Assamese makes it a sound area for researchinvolving ever-changing approaches in speech and speaker recognition. Thecontents included in this compilation are outcomes of the research carried out overthe last few years with emphasis on the design of a soft computing framework forphoneme segmentation used for speech recognition. Though the work usesAssamese as an application language, the concepts outlined, systems formulated,and the results reported are equally relevant to any other language. It makes theproposed framework a universal system suitable for application to soft computing-based speech segmentation algorithm design and implementation.

Chapter 1 provides basic notions related to speech and its generation. Thistreatment is general in nature and is expected to provide the background necessaryfor such a work. The contents included in this chapter also provide the necessarymotivation, certain aspects of phoneme segmentation, a review of the reportedliterature, application of Artificial Neural Network (ANN) as a speech processingtool, and certain related issues. This content should help the reader to have arudimentary familiarization about the subsequent issues highlighted in the work.

Speech recognition research is interdisciplinary in nature, drawing upon workin fields as diverse as biology, computer science, electrical engineering, linguistics,mathematics, physics, and psychology. Some of the basic issues related to speechprocessing are summarized in Chap. 2. The related theories on speech perceptionand spoken word recognition model have been covered in this chapter. As ANN isthe most critical element of the book, certain essential features necessary for thesubsequent portion of the work constitute Chap. 3. The primary topologies covered

vii

http://dx.doi.org/10.1007/978-81-322-1862-3_1http://dx.doi.org/10.1007/978-81-322-1862-3_2http://dx.doi.org/10.1007/978-81-322-1862-3_3

include the Multi Layer Perceptron (MLP), Recurrent Neural Network (RNN),Probabilistic Neural Network (PNN), Learning Vector Quantization (LVQ), andSelf-Organizing Map (SOM). The descriptions included are designed in such amanner that it serves as a supporting material for the subsequent content.

Chapter 4 primarily discusses about Assamese language and its phonemicalcharacteristics. Assamese is an Indo-Aryan language originated from the Vedicdialects and has strong links to Sanskrit, the ancient language of the Indian sub-continent. However, its vocabulary, phonology, and grammar have substantiallybeen influenced by the original inhabitants of Assam such as the Bodos andKacharis. Retaining certain features of its parent IndoEuropean family, it hasmany unique phonological characteristics. There is a host of phonologicaluniqueness in Assamese pronunciation which shows variations when spoken bypeople of different regions of the state. This makes Assamese speech unique andhence requires a study exclusively to directly develop a language-specific speechrecognition/speaker identification system.

In Chap. 5, a brief overview derived out of a detailed survey of speech rec-ognition works reported from different groups all over the globe in the last twodecades is given. Robustness of speech recognition systems toward languagevariation is a recent trend of research in speech recognition technology. Todevelop a system that can communicate between human beings in any languagelike any other human being is the foremost requirement of any speech recognitionsystem. The related efforts in this direction are summarized in this chapter.

Chapter 6 includes certain experimental work carried out. The chapter providesa description of a SOM-based segmentation technique and explains how it can beused to segment the initial phoneme from some ConsonantVowelConsonant(CVC) type Assamese word. The work provides a comparison of the proposedSOM-based technique with the conventional Discrete Wavelet Transform(DWT)-based speech segmentation technique. The contents include a descriptionof an ANN approach to speech segmentation by extracting the weight vectorsobtained from SOM trained with the LP coefficients of digitized samples of speechto be segmented. The results obtained are better than those reported earlier.

Chapter 7 provides a description of the proposed spoken word recognitionmodel, where a set of word candidates are activated at first on the basis of pho-neme family to whichits initial phoneme belongs. The phonemical structure ofevery natural language provides some phonemical groups for both vowel andconsonant phonemes each having distinctive features. This work provides anapproach to CVC-type Assamese spoken words recognition by taking advantagesof such phonemical groups of Assamese language, where all words of the rec-ognition vocabulary are initially classified into six distinct phoneme families andthen the constituent vowel and consonant phonemes are classified within thegroup. A hybrid framework, using four different ANN structures, is constituted forthis word recognition model, to recognize phoneme family and phonemes and thusthe word at various levels of the algorithm.

A technique to remove the CVC-type word limitation observed in case ofspoken word recognition model described in Chap. 7 is proposed in Chap. 8.

viii Preface

http://dx.doi.org/10.1007/978-81-322-1862-3_4http://dx.doi.org/10.1007/978-81-322-1862-3_5http://dx.doi.org/10.1007/978-81-322-1862-3_6http://dx.doi.org/10.1007/978-81-322-1862-3_7http://dx.doi.org/10.1007/978-81-322-1862-3_7http://dx.doi.org/10.1007/978-81-322-1862-3_8

This technique is based on a phoneme count determination block based onK-means Clustering (KMC) of speech data. A KMC algorithm-based techniqueprovides prior knowledge about the possible number of phonemes in a word. TheKMC-based approach enables proper counting of phonemes which improves thesystem to include words with multiple number of phonemes.

Chapter 9 presents a neural model for speaker identification usingspeaker-specific information extracted from vowel sounds. The vowel sound issegmented out from words spoken by the speaker to be identified. Vowel soundsoccur in a speech more frequently and with higher energy. Therefore, situationswhere acoustic information is noise corrupted vowel sounds can be used to extractdifferent amounts of speaker discriminative information. The model explainedhere uses a neural framework formed with PNN and LVQ where the proposedSOM-based vowel segmentation technique is used. The speaker-specific glottalsource information is initially extracted using LP residual. Later, Empirical ModeDecomposition (EMD) of the speech signal is performed to extract the residual.The work shows a comparison of effectiveness between these two residualfeatures.

The key features of the work have been summarized in Chap. 10. It alsoincludes certain future directions that can be considered as part of follow-upresearch to make the proposed system a fail-proof framework.

The authors are thankful to the acquisition, editorial, and production team of thepublishers. The authors are thankful to students, research scholars, faculty mem-bers of Gauhati University, and IIT Guwahati for being connected in respectiveways to the work. The authors are also thankful to their respective family membersfor their support and encouragement.

Finally, the authors are thankful to the Almighty.

Guwahati, Assam, India Mousmita SarmaJanuary 2014 Kandarpa Kumar Sarma

Preface ix

http://dx.doi.org/10.1007/978-81-322-1862-3_9http://dx.doi.org/10.1007/978-81-322-1862-3_10

Acknowledgments

The authors acknowledge the contribution of the following:

Mr. Prasanta Kumar Sarma of Swadeshy Academy, Guwahati and Mr. ManoranjanKalita, sub-editor of Assamese daily Amar Axom, Guwahati for their exemplaryhelp in developing rudimentary know-how on linguistic and phonetics.

Krishna Dutta, Surajit Deka, Arup, Sagarika Bhuyan, Pallabi Talukdar, Amlan J.Das, Mridusmita Sarma, Chayashree Patgiri, Munmi Dutta, Banti Das, Manas J.Bhuyan, Parismita Gogoi, Ashok Mahato, Hemashree Bordoloi, SamarjyotiSaikia, and all other students of Department of Electronics and CommunicationTechnology, Gauhati University, who provided their valuable time duringrecording the raw speech samples.

xi

Contents

Part I Background

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Phoneme Boundary Segmentation: Present Technique. . . . . . . 61.3 ANN as a Speech Processing and Recognition Tool . . . . . . . . 11

1.3.1 Speech Recognition Using RNN . . . . . . . . . . . . . . . . 121.3.2 Speech Recognition Using SOM . . . . . . . . . . . . . . . . 14

1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5 Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Speech Processing Technology: Basic Consideration . . . . . . . . . . 212.1 Fundamentals of Speech Recognition . . . . . . . . . . . . . . . . . . 212.2 Speech Communication Chain . . . . . . . . . . . . . . . . . . . . . . . 242.3 Mechanism of Speech Perception . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Physical Mechanism of Perception . . . . . . . . . . . . . . . 282.3.2 Perception of Sound . . . . . . . . . . . . . . . . . . . . . . . . . 302.3.3 Basic Unit of Speech Perception . . . . . . . . . . . . . . . . 32

2.4 Theories and Models of Spoken Word Recognition . . . . . . . . 332.4.1 Motor Theory of Speech Perception . . . . . . . . . . . . . . 352.4.2 Analysis-by-Synthesis Model. . . . . . . . . . . . . . . . . . . 372.4.3 Direct Realist Theory of Speech Perception . . . . . . . . 372.4.4 Cohort Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.4.5 Trace Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.4.6 Shortlist Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.4.7 Neighborhood Activation Model . . . . . . . . . . . . . . . . 43

2.5 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Fundamental Considerations of ANN . . . . . . . . . . . . . . . . . . . . . 473.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2 Learning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

xiii

http://dx.doi.org/10.1007/978-81-322-1862-3_1http://dx.doi.org/10.1007/978-81-322-1862-3_1http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_1#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_1#Bib1http://dx.doi.org/10.1007/978-81-322-1862-3_2http://dx.doi.org/10.1007/978-81-322-1862-3_2http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec14http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec14http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec15http://dx.doi.org/10.1007/978-81-322-1862-3_2#Sec15http://dx.doi.org/10.1007/978-81-322-1862-3_2#Bib1http://dx.doi.org/10.1007/978-81-322-1862-3_3http://dx.doi.org/10.1007/978-81-322-1862-3_3http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec2

3.3 Prediction and Classification Using ANN . . . . . . . . . . . . . . . 523.4 Multi Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.5 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6 Probabilistic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 63

3.6.1 Architecture of a PNN Network. . . . . . . . . . . . . . . . . 643.7 Self-Organizing Map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.7.1 Competitive Learning and Self-OrganizingMap (SOM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.8 Learning Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . 713.8.1 LVQ Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.8.2 LVQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


4 Sounds of Assamese Language . . . . . . . . . . . . . . . . . . . . . . . . . . 774.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2 Formation of Assamese Language . . . . . . . . . . . . . . . . . . . . 784.3 Phonemes of Assamese Language. . . . . . . . . . . . . . . . . . . . . 79

4.3.1 Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.3.2 Diphthongs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.3.3 Stop Consonant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.3.4 Nasals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.3.5 Fricatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.3.6 Affricates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.3.7 Semi Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.4 Some Specific Phonemical Features of AssameseLanguage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.5 Dialects of Assamese Language . . . . . . . . . . . . . . . . . . . . . . 924.6 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5 State of Research of Speech Recognition . . . . . . . . . . . . . . . . . . . 955.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2 A Brief Overview of Speech Recognition Technology . . . . . . 975.3 Review of Speech Recognition During the Last

Two Decades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.4 Research of Speech Recognition in Indian Languages. . . . . . . 101

5.4.1 Statistical Approach . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4.2 ANN-Based Approach . . . . . . . . . . . . . . . . . . . . . . . 106


xiv Contents

http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_3#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_3#Bib1http://dx.doi.org/10.1007/978-81-322-1862-3_4http://dx.doi.org/10.1007/978-81-322-1862-3_4http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_4#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_4#Bib1http://dx.doi.org/10.1007/978-81-322-1862-3_5http://dx.doi.org/10.1007/978-81-322-1862-3_5http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_5#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_5#Bib1

Part II Design Aspects

6 Phoneme Segmentation Technique Using Self-OrganizingMap (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.2 Linear Prediction Coefficient (LPC) as Speech Feature . . . . . . 1186.3 Application of SOM and PNN for Phoneme Boundary

Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1196.4 DWT-Based Speech Segmentation . . . . . . . . . . . . . . . . . . . . 1216.5 Proposed SOM- and PNN-Based Segmentation Algorithm. . . . 123

6.5.1 PNN Learning Algorithm . . . . . . . . . . . . . . . . . . . . . 1246.5.2 SOM Weight Vector Extraction Algorithm . . . . . . . . . 1256.5.3 PNN-Based Decision Algorithm . . . . . . . . . . . . . . . . 125

6.6 Experimental Details and Result. . . . . . . . . . . . . . . . . . . . . . 1266.6.1 Experimental Speech Signals. . . . . . . . . . . . . . . . . . . 1266.6.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.6.3 Role of PNN Smoothing Parameter . . . . . . . . . . . . . . 1296.6.4 Comparison of SOM- and DWT-Based

Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.7 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7 Application of Phoneme Segmentation Techniquein Spoken Word Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377.2 Linear Prediction Model for Estimation

of Formant Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1397.2.1 Human Vocal Tract and Linear Prediction

Model of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407.2.2 Pole or Formant Location Determination . . . . . . . . . . 142

7.3 LVQ and Its Application to Codebook Design . . . . . . . . . . . . 1427.4 Phoneme Segmentation for Spoken Word Recognition . . . . . . 143

7.4.1 RNN-Based Local Classification . . . . . . . . . . . . . . . . 1447.4.2 SOM-Based Segmentation Algorithm . . . . . . . . . . . . . 1457.4.3 PNN- and F1-Based Vowel Phoneme

and Initial Phoneme Recognition . . . . . . . . . . . . . . . . 1457.4.4 LVQ Codebook Assisted Last Phoneme

Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.5 Experimental Details and Results . . . . . . . . . . . . . . . . . . . . . 146

7.5.1 Experimental Speech Signals. . . . . . . . . . . . . . . . . . . 1487.5.2 RNN Training Consideration . . . . . . . . . . . . . . . . . . . 1487.5.3 Phoneme Segmentation and Classification Results . . . . 150


Contents xv

http://dx.doi.org/10.1007/978-81-322-1862-3_6http://dx.doi.org/10.1007/978-81-322-1862-3_6http://dx.doi.org/10.1007/978-81-322-1862-3_6http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec14http://dx.doi.org/10.1007/978-81-322-1862-3_6#Sec14http://dx.doi.org/10.1007/978-81-322-1862-3_6#Bib1http://dx.doi.org/10.1007/978-81-322-1862-3_7http://dx.doi.org/10.1007/978-81-322-1862-3_7http://dx.doi.org/10.1007/978-81-322-1862-3_7http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec13http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec14http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec14http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec15http://dx.doi.org/10.1007/978-81-322-1862-3_7#Sec15http://dx.doi.org/10.1007/978-81-322-1862-3_7#Bib1

8 Application of Clustering Techniques to Generatea Priori Knowledge for Spoken Word Recognition . . . . . . . . . . . 1538.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538.2 K-Means Clustering (KMC). . . . . . . . . . . . . . . . . . . . . . . . . 1548.3 KMC Applied to Speech Data . . . . . . . . . . . . . . . . . . . . . . . 1558.4 Experimental Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.4.1 Experimental Speech Samples . . . . . . . . . . . . . . . . . . 1578.4.2 Role of RNN in Decision Making of the Proposed

Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1588.4.3 Result and Limitation . . . . . . . . . . . . . . . . . . . . . . . . 159


9 Application of Proposed Phoneme Segmentation Techniquefor Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639.2 Certain Previous Work Done . . . . . . . . . . . . . . . . . . . . . . . . 1659.3 Linear Prediction Residual Feature . . . . . . . . . . . . . . . . . . . . 1699.4 EMD Residual-Based Source Extraction . . . . . . . . . . . . . . . . 1699.5 LVQ Codebook and Speaker Identification . . . . . . . . . . . . . . 1709.6 Speaker Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1709.7 Experimental Details and Results . . . . . . . . . . . . . . . . . . . . . 1719.8 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

9.8.1 Vowel Segmentation Results . . . . . . . . . . . . . . . . . . . 1729.8.2 Speaker Identification Results Using LP Residual . . . . 1749.8.3 Speaker Identification Results Using

EMD Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1769.9 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18310.1 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18310.2 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18410.3 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

xvi Contents

http://dx.doi.org/10.1007/978-81-322-1862-3_8http://dx.doi.org/10.1007/978-81-322-1862-3_8http://dx.doi.org/10.1007/978-81-322-1862-3_8http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_8#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_8#Bib1http://dx.doi.org/10.1007/978-81-322-1862-3_9http://dx.doi.org/10.1007/978-81-322-1862-3_9http://dx.doi.org/10.1007/978-81-322-1862-3_9http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec4http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec5http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec6http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec7http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec8http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec9http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec10http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec11http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_9#Sec12http://dx.doi.org/10.1007/978-81-322-1862-3_9#Bib1http://dx.doi.org/10.1007/978-81-322-1862-3_10http://dx.doi.org/10.1007/978-81-322-1862-3_10http://dx.doi.org/10.1007/978-81-322-1862-3_10#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_10#Sec1http://dx.doi.org/10.1007/978-81-322-1862-3_10#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_10#Sec2http://dx.doi.org/10.1007/978-81-322-1862-3_10#Sec3http://dx.doi.org/10.1007/978-81-322-1862-3_10#Sec3

Acronyms

AANN Autoassociative Neural Networks are feedforward nets trained toproduce an approximation of the identity mapping between networkinputs and outputs using backpropagation or similar learningprocedures

ANN Artificial Neural Networks are nonparametric models inspired bybiological central nervous systems. These are particularly similar likethe working of the human brain and are capable of machine learningand pattern recognition. Usually, these are represented as systems ofinterconnected neurons that can compute values from inputs byfeeding information through the network

ARPA Advanced Research Projects Agency is an agency of the United StatesDepartment of Defense responsible for the development of newtechnologies for use by the military

ASR Automatic Speech Recognition is a technology by which a computer ora machine is made to recognize the speech of a human being

AVQ Adaptive Vector Quantization is a form of vector quantization wherethe parameters of a quantizer are updated during real time operationbased on observed information regarding the statistics of the signalbeing quantized

BP Back Propagation Algorithm is a common supervised method oftraining ANN where in reference to a desired output, the network learnsfrom many inputs

BPTT Back Propagation Through Time is a gradient-based technique fortraining certain types of Recurrent Neural Network (RNN)s

CFRNN Fully Connected RNN is the basic RNN where each neuron unit has adirect connection to every other neuron unit. The Complex FullyRecurrent Neural Network has both global and local feedback pathsand process both real and imaginary segments separately which helpsin better learning

CRF Conditional Random Field is a class of statistical modeling used forstructured prediction taking context (neighboring samples) into accountoften applied in pattern recognition and machine learning

CSR Continuous Speech Recognition, ASR technology implemented in casecontinuous speech signal

xvii

CV It is a coda-less open syllable where speech sounds are organized in thesequence of consonant and vowel

CVC It is a coda-closed syllable where speech sounds are organized in thesequence of consonant, vowel and consonant

DEKF Decoupled Extended Kalman Filter is a technique used to trainRecurrent Neural Network (RNN) with separated blocks of Kalmanfilter. It is class of RNN training method where modified forms ofKalman filters are employed in decoupled form

DNN Deep Neural Network is a type of ANN where learning is layerdistributed and retains the learning at neuron levels

DRT Direct Realist Theory is a theory of speech perception proposed byCarol Fowler which claims that the objects of speech perception arearticulatory rather than acoustic events

DTW Dynamic Time Warping is an algorithm to measure similarity betweentwo temporally varying time sequence

DWFB Discrete Wavelet Filter Bank is a wavelet filter made up of successivehigh pass and low pass orthogonal filter derived from the process ofwavelet decomposition tree where the time scale representation of thesignal to be analyzed is passed through filters with different cutofffrequencies at different scales

DWT Discrete Wavelet Transform is a kind of wavelet transform where thewavelets are discretely sampled

EKF Extended Kalman Filter is a nonlinear version of the Kalman filter usedin estimation theory which linearizes about an estimate of the currentmean and covariance

EMD Empirical Mode Decomposition is a signal decomposition techniqueused in case of nonstationary signals derived by N. E. Huang, where thesignal is decomposed into a set of high frequency oscillatingcomponents and a low frequency residue

F1 The first spectral peak observed in the vocal tract filter responseFF Feed Forward is an ANN structure where the processing flow is in the

forward directionFFT Fast Fourier Transform is a computationally efficient algorithm used

for Discrete Fourier Transform (DFT) analysisFVQ Fuzzy Vector Quantization is a technique where fuzzy concepts are

used for vector quantization and clusteringGMM Gaussian Mixture Model is a probabilistic model for representing the

presence of subpopulations within an overall population, withoutrequiring that an observed data set should identify the sub-population towhich an individual observation belongs where mixture distribution isGaussian

HCI Human Computer Interface involves the study, planning, and design ofthe interaction between people (users) and computers

xviii Acronyms

HMM Hidden Markov Models is a statistical Markov model in which thesystem being modeled is assumed to be a Markov process withunobserved or hidden states. It can be considered to be the simplestdynamic Bayesian network

IMF Intrinsic Mode Function is a high frequency component obtained afterdecomposition using EMD technique from nonstationary signal

ISR Isolated Speech Recognition is an ASR technology developed forisolated speech where a distinct pause is observed between each spokenword

KMC K-means clustering is a method of vector quantization originally fromsignal processing, that is popular for cluster analysis in data mining

KNN K Nearest Neighbor is a nonparametric method for classification andregression used in pattern recognition that predicts class membershipsbased on the most common amongst its k closest training examples inthe feature space

LD Log Determinant is a function from the set of symmetric matriceswhich provides a measure of the volume of an ellipsoid precisely

LDA Linear Discriminant Analysis is a method used in statistics, patternrecognition, and machine learning to find a linear combination offeatures which characterizes or separates two or more classes of objects

LPC Linear Prediction Coding is a tool used mostly in audio signalprocessing and speech processing for representing the spectral enve-lope of digital signal of speech in compressed form, using theinformation of a linear predictive model

LPCC Linear Prediction Cepstral Coefficients are the coefficients that can befound by converting the Linear Prediction coefficients into cepstralcoefficients

LVQ Learning Vector Quantization is a prototype based supervised classi-fication algorithm and is the supervised counterpart of vector quanti-zation systems

MBE Minimum Boundary Error is a criterion related to HMM based ASR,which tries to minimize the expected boundary errors over a set ofpossible phonetic alignments

MFCC Mel-Frequency Cepstral Coefficients are coefficients that collectivelymake up a Mel-Frequency Cepstrum which is a representation of theshort-term power spectrum of a sound, based on a linear cosine transformof a log power spectrum on a nonlinear mel scale of frequency

MLP Multi Layer Perceptron is a feedforward artificial neural network modelthat maps sets of input data onto a set of appropriate outputs

MMI Maximum Mutual Information is a discriminative training criteria usedin the HMM modeling involved in ASR technology, which considersall the classes simultaneously, during training and parameters of thecorrect model are updated to enhance its contribution to the observa-tions, while parameters of the alternative models are updated to reducetheir contributions

Acronyms xix

MPE Minimum Phone Error is a discriminative criteria called MPE is asmoothed measure of phone transcription error while training HMMparameters. It is introduced by Povey in his Doctoral thesis submittedto University of Cambridge in 2003

MSE Mean Square Error is one of many ways to quantify the differencebetween values implied by an estimator and the true values of thequantity being estimated

NARX Nonlinear Autoregressive with Exogenous Input is an architecturalapproach of RNN with embedded memory provides the ability to tracknonlinear time varying signals

NWHF Normalized Wavelet Hybrid Feature is a hybrid feature extractionmethod which uses the combination of Classical Wavelet Decompo-sition (CWD) and Wavelet Packet Decomposition (WPD) along with z-score normalization technique

PCA Principal Component Analysis is a mathematical procedure that usesorthogonal transformation to convert a set of observations of possiblycorrelated variables into a set of values of linearly uncorrelatedvariables called principal components

PDP Parallel Distributed Processing is a class of neurally inspired informa-tion processing models that attempt to model information processingthe way it actually takes place in the brain, where the representation ofinformation is distributed and learning can occur with gradual changesin connection strength by experience

PLP Perceptual Linear Predictive analysis is a technique of speech analysiswhich uses the concepts of the critical-band spectral resolution, theequal loudness curve and the intensity-loudness power law ofpsychophysics of hearing to approximate the auditory spectrum by anautoregressive allpole model

PNN Probabilistic Neural Network is a feedforward neural network, whichwas derived from the Bayesian network and a statistical algorithmcalled Kernel Fisher discriminant analysis

PSO Particle Swarm Optimization is a computational method that optimizesa problem by iteratively trying to improve a candidate solution withregard to a given measure of quality

RBF Radial Basis Function is a hybrid supervisedunsupervised ANNtopology having a static Gaussian function as the nonlinearity for thehidden layer processing elements

RBPNN Radial Basis Probabilistic Neural Networks is a feedforward ANNmodel that avoids the huge amount of hidden units of the PNNs andreduce the training time for the RBFs

RD Rate Distortion is a theory related to lossy data compression whichaddresses the problem of determining the minimal number of bits persymbol, as measured by the rate R, that should be communicated over achannel, so that the source can be approximately reconstructed at thereceiver without exceeding a given distortion D

xx Acronyms

RNN Recurrent Neural Network is a class of ANN where connectionsbetween units form a directed cycle which creates an internal state ofthe network which allows it to exhibit dynamic temporal behavior. It isan ANN with feed forward and feedback paths

RTRL Real Time Recurrent Learning is a training method associated withRNN where adjustments are made to the synaptic weights of a fullyconnected recurrent network in real time

SGMM Subspace Gaussian Mixture Model is an acoustic model for speechrecognition in which all phonetic states share a common GaussianMixture Model structure, and the means and mixture weights vary in asubspace of the total parameter space

SOM Self-Organizing Map is a type of ANN that is trained usingunsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of thetraining samples, called a map

SSE Sum of Squares of Error is the sum of the squared differences betweeneach observation and its groups mean

SUR Speech Understanding Research is a program founded by AdvancedResearch Projects Agency (ARPA) of the U.S. Department of Defenseduring 1970s

SVM Support Vector Machines are supervised learning models withassociated learning algorithms that analyze data and recognize patterns,used for classification and regression analysis

TDNN Time Delay Neural Network is an ANN architecture whose primarypurpose is to work on continuous data

VE Voting Experts is an algorithm for chunking sequences of symbolswhich greedily searches for sequences that match an information-theoretic signature: low entropy internally and high entropy at theboundaries

VOT Voice Onset Time is defined as the length of time that passes betweenthe release of a stop consonant and the onset of voicing, i.e., thevibration of the vocal folds or periodicity

VQ Vector Quantization is a classical quantization technique originallyused for data compression which allows the modeling of probabilitydensity functions by the distribution of prototype vectors

WER Word Error Rate is a measure of speech recognition system derivedfrom the Levenshtein distance, working at the word level

WPD Wavelet Packet Decomposition is a wavelet transform where thediscrete-time (sampled) signal is passed through more filters than thediscrete wavelet transform (DWT)

Acronyms xxi

Part IBackground

The book is constituted by two parts. In Part I, we include the background of thework and motivation behind the work. We also discuss some of the relevantliterature in circulation which provides an insight into the development oftechnology related to speech recognition. We also include a discussion onphoneme and its importance in case of speech recognition with certain stress onAssamese language which is widely spoken in northeastern part of India. We alsodiscuss about ANN as tool of speech recognition and highlight its importance forsuch an application. We also summarize the contribution of the work and provide abrief organization of the chapters that constitute the work.

Chapter 1Introduction

Abstract Speech is a naturally occuring non-stationary signal essential not only forperson-to-person communication but also is an important aspect of humancomputerinteraction (HCI). Some of the issues related to analysis and design of speech-basedapplications for HCI have received widespread attention. Some of these issues arecovered in this chapter which is used as background and motivation for the workincluded in the subsequent portion of the book.

Keywords Speech Artificial neural network Phoneme Segmentation RNN SOM

1.1 Background

Speech is the most common medium of communication between person to person.Speech is a naturally occuring non-stationary signal produced from a time-varyingvocal tract system. It results due to time-varying excitation. The speech productionmechanism is started at the cortex of the human brain which sends some neural sig-nals through the nerves and the muscles which control breathing to enable humans toarticulate the word or words needed to communicate the thoughts. This productionof the speech signal occurs at physiological level and it involves rapid, coordinated,sequential movements of the vocal apparatus. The purpose of speech is informationexchange between human beings. In terms of information theory as introduced byShannon, speech can be represented in terms of its message content or informa-tion. Speech in modern times has become an important element of humancomputerinteraction (HCI). As such, all speech processing applications are, therefore, ori-ented toward establishing some connection with HCI-based designs.While doing so,speech needs detailed analysis and processing with proper mechanisms to developapplications suitable for HCI designs.

M. Sarma and K. K. Sarma, Phoneme-Based Speech Segmentation Using 3Hybrid Soft Computing Framework, Studies in Computational Intelligence 550,DOI: 10.1007/978-81-322-1862-3_1, Springer India 2014

4 1 Introduction

An alternative way of characterizing speech is in terms of the signal carrying themessage information, i.e., the acoustic waveform [1]. Speech signal contains threedistinct eventssilence, voiced, and unvoiced. The silence is that part of the signalwhen no speech is produced. Voiced is that periodic part of the speech signal whichis produced due to the periodic vibrations of the vocal tract due to flow of air fromthe lungs, whereas unvoiced part of the signal is produced due to the non-vibratingvocal cords and hence the resulting signal is aperiodic and random in nature. Thesethree parts of the signal occupy distinct spectral regions in the spectrogram.

However, the information that is communicated through speech is intrinsically ofa discrete nature. Hence, it can be represented by a concatenation of a number oflinguistically varying sound units called the phonemes, which is possibly the smallestunit of speech signal. Each language has its own distinctive set of phonemes, typicallynumbering between 30 and 50. Segmentation of phoneme or identification of thephoneme boundary is a fundamental and crucial task since it has many importantapplications in speech and audio processing. The work described throughout thisbook presents a new approach to phoneme-level speech segmentation based on ahybrid soft computational framework. A Self-Organizing Map (SOM) trained withvarious iteration numbers is used to extract out a number of abstract internal structuresin terms of weight vector, from any Assamese spoken word in an unsupervisedmanner. The SOM is an Artificial Neural Network (ANN) trained by following acompetitive learning algorithm [2]. In other words, SOM provides some phonemeboundaries, from which some supervised Probabilistic Neural Network (PNN) [2]identifies the constituent phoneme of theword. The PNNs are trained to learn patternsof all Assamese phonemes using a database of clean Assamese phonemes recordedfrom five male and five female speakers in noise-free and noisy environment. Animportant aspect of the segmentation and classification algorithm is that it uses thepriori knowledge of first formant frequency (F1) of the Assamese phonemes whiletaking decision, since phonemes are distinguished by its ownunique pattern aswell asin terms of their formant frequencies. The work uses the concept of pole or formantlocation determination from the linear prediction (LP) model of vocal tract whileestimating F1 [3]. Assamese words containing all the phonemes are recorded froma set of girl and boy speaker, so that the classification algorithm can remove thespeaker dependence limitation.

The proposed SOM- and PNN-based algorithm is able to segment out all thevowel and consonant phonemes from any consonantvowelconsonant (CVC)-typeAssamese word. Thus, the algorithm reflects two different application possibilities.The recognized consonant and vowel phonemes can be used to recognize the CVC-type Assamese words, whereas the segmented vowel phonemes can be used forAssamese speaker identification. Some experiments have been performed as part ofthis ongoingwork to explore such future application possibilities of the said algorithmin the field of spoken word recognition and speaker identification.

The phonemical structure of every language provides various phonemical groupsfor both vowel and consonant phonemes each having distinctive features. A spokenword recognitionmodel can be developed inAssamese language by taking advantageof such phonemical groups. Assamese language has six distinct phoneme families.

1.1 Background 5

Another important fact is that the initial phoneme of a word is used to activatewords starting with that phoneme in spoken word recognition models. Therefore, byinvestigating the initial phoneme, one can classify them into a phonetic group, andthen, it can be classified within the group. The second part of this work providesa prototype model for Assamese CVC-type words recognition, where all words ofthe recognition vocabulary are initially classified into six distinct phoneme families,and then, the constituent vowel and consonant phonemes are segmented and recog-nized bymeans of the proposed SOM- and PNN-based segmentation and recognitionalgorithm. Before using the global decision taken by the PNN, a recurrent neural net-work (RNN) takes some local decisions about the incoming word and classifies theminto six phoneme families of Assamese language on the basis of the first formantfrequency (F1) of its neighboring vowel. Then, the segmentation and recognition isperformed separately at the RNNdecided family.While taking decision about the lastphoneme, the algorithm is assisted by some learning vector quantization (LVQ) codebook which contains a distinct entry for every word of the recognition vocabulary. Ithelps the algorithm to take the most likely decision.

The spoken word recognition model explained in this work has the severe limi-tation of recognizing only CVC-type words. Therefore, in order to include multiplecombinations of words in the present method, a logic needs to be developed so thatthe algorithm can initially take some decision about the number of phonemes inthe word to be recognized. As a part of this work, we have proposed a method totake priory decision to determine number of phonemes by using k-mean clustering(KMC) technique.

Vowel phoneme is a part of any acoustic speech signal. Vowel sounds occur in aspeech more frequently and with higher energy. Therefore, vowel phoneme can beused to extract different amounts of speaker discriminative information in situationswhere acoustic information is noise corrupted. Use of vowel sound as the basis forspeaker identification has been initiated long back by the Speech Processing Group,University of Auckland, New Zealand [4]. Since then, phoneme recognition algo-rithms and related techniques have received considerable attention in the domain ofspeaker recognition and have even been extended to the linguistic arena. Role ofvowel phoneme is yet an open issue in the field of speaker verification or identifi-cation. This is because of the fact that vowel phoneme-based pronunciation varieswith regional and linguistic diversity. Hence, segmented vowel speech slices can beused to track regional variation in the way the speaker speaks the language. It is moreso in case of a language like Assamese spoken by over three million people in thenortheast (NE) state of Assam with huge linguistic and cultural diversity which haveinfluenced the way people speak the language. Therefore, an efficient vowel segmen-tation technique shall be effective in speaker identification system. The third part ofthis work experimented a vowel phoneme segmentation-based speaker identifica-tion technique. A separate database of vowel phonemes is created by some samplesobtained from Assamese speakers. This clean vowel database is used to design aLVQ-based code book by means of LP residue and empirical mode decomposition(EMD) residue. The LP error sequence provides the speaker source informationby subtracting the vocal tract effect, and therefore, it can be used as an effective

6 1 Introduction

feature for speaker recognition. Another new method EMD is used to extract resid-ual content from speech signal. EMD is an adaptive tool to analyze nonlinear ornon-stationary signals, which segregates the constituent parts of the signal based onthe local behavior of the signal. Using EMD, signals can be decomposed into numberof frequency modes called intrinsic mode function (IMF) and a residue [5]. Thus,the LVQ code book contains some unique entries for all the speakers in terms ofvowel sounds source pattern. Speaker identification is carried out by first segment-ing the vowel sound from the speakers speech signal with the proposed SOM-basedsegmentation technique and then matching the vowel pattern with the LVQ codebook.

A major part of the work is related to phoneme boundary segmentation. In thenext section, we provide a survey of certain techniques related to phoneme boundarysegmentation which are found to be closely related to the present work.

1.2 Phoneme Boundary Segmentation: Present Technique

Automatic segmentation of speech signals into phonemes is necessary for the purposeof the recognition and synthesis so that manpower and time required in manual seg-mentation can be removed.Mostly two types of segmentation techniques are found inthe literature. These are implicit segmentation and explicit segmentation [6]. Implicitsegmentation methods split up the utterance into segments without explicit informa-tion like phonetic transcription. It defines segment implicitly as a spectrally stablepart of the signal. Explicit segmentation methods split up the incoming utterance intosegments that are explicitly defined by the phonetic transcription. Both the methodshave their respective advantages and disadvantages. In explicit segmentation, thesegment boundaries may be inaccurate due to a possible poor resemblance betweenreference and test spectrum since the segments are labeled in accordance with thephonetic transcription. Further segmentation can be done on phone level, syllablelevel, and word level. Here we present a brief literature review on various speechsegmentation methods.

1. In 1991, Hemert reported a work [6] on automatic segmentation of speech,where the author combined both an implicit and an explicit method to improvethe final segmentation results. First, an implicit segmentation algorithm splitsup the utterance into segments on the basis of the degree of similarity betweenthe frequency spectra of neighboring frames. Secondly, an explicit algorithmdoes the same, but this time on the basis of the degree of similarity betweenthe frequency spectra of the frames in the utterance and reference spectra. Acombination algorithm compares the two segmentation results and produces thefinal segmentation.

2. Deng et al. reported another work on using wavelet probing for compression-based segmentation in 1993, where the authors described how wavelets canbe used for data segmentation. The basic idea is to split the data into smooth

1.2 Phoneme Boundary Segmentation: Present Technique 7

segments that can be compressed separately. A fast algorithm that uses waveletson closed sets and wavelet probing is described in this chapter [7].

3. In 1994, in a work reported by Tang et al. in [8], the design of a hearing aiddevice based on wavelet transform is explained. The fast wavelet transform isused in the work to decompose speech into different frequency components.

4. Wendt et al. reported a work on pitch determination and speech segmentationusing discrete wavelet transform (DWT) in 1996. They have proposed a time-based event detection method for finding the pitch period of a speech signal.Based on the DWT, it detects voiced speech, which is local in frequency, anddetermines the pitch period. This method is computationally inexpensive, andthrough simulations and real speech experiments the authors show that it is bothaccurate and robust to noise [9].

5. Another work is reported by Suh and Lee in [10], where the authors proposed anewmethod of phoneme segmentation usingmultilayer perceptron (MLP)whichis a feedforward (FF) ANN. The structure of the proposed segmenter consists ofthree parts: preprocessor, MLP-based phoneme segmenter, and postprocessor.The preprocessor utilizes a sequence of 44 order feature parameters for eachframe of speech, based on the acousticphonetic knowledge.

6. An automatic method is described for delineating the temporal boundaries ofsyllabic units in continuous speech using a temporal flow model (TFM) andmodulation-filtered spectral features by Shastri et al. [11]. The TFM is an ANNarchitecture that supports arbitrary connectivity across layers, provides for feed-forward (FF) as well as recurrent links, and allows variable propagation delaysalong links. They have developed two TFM configurations, global and tono-topic, and trained on a phonetically transcribed corpus of telephone and addressnumbers spoken over the telephone by several hundred individuals of variabledialect, age, and gender. The networks reliably detected the boundaries of syl-labic entities with an accuracy of 84 %.

7. In 2002, Gomez and Castro proposed an work on automatic segmentation ofspeech at the phonetic level [12]. Here, the phonetic boundaries are establishedusing a dynamic time warping (DTW) algorithm that uses the aposteriori prob-abilities of each phonetic unit given the acoustic frame. These aposteriori prob-abilities are calculated by combining the probabilities of acoustic classes whichare obtained from a clustering procedure on the feature space and the conditionalprobabilities of each acoustic class with respect to each phonetic unit. The use-fulness of the approach presented in the work is that manually segmented dataare not needed in order to train acoustic models.

8. Nagarajan et al. [13] reported a minimum phase group delay-based approach tosegment spontaneous speech into syllable-like units. Here, three different min-imum phase signals are derived from the short-term energy functions of threesubbands of speech signals, as if it were a magnitude spectrum. The experimentsare carried out on Switchboard and OGI Multi-language Telephone Speech cor-pus, and the error in segmentation is found to be utmost 40 ms for 85 % of thesyllable segments.

8 1 Introduction

9. Zioko et al. have reported a work where they applied the DWT to speech signalsand analyze the resulting power spectrum and its derivatives to locate candidatesfor the boundaries of phonemes in continuous speech. They compare the resultswith hand segmentation and constant segmentation over a number of words. Themethod proved to be effective for finding most phoneme boundaries. The workwas published in 2006 [14].

10. In 2006, Awais et al. reported another work [15] where the authors described aphoneme segmentation algorithm that uses fast Fourier transform (FFT) spectro-gram. The algorithm has been implemented and tested for utterances of contin-uous Arabic speech of 10 male speakers that contain almost 2,346 phonemes intotal. The recognition system determines the phoneme boundaries and identifiesthem as pauses, vowels, and consonants. The system uses intensity and phonemeduration for separating pauses from consonants. Intensity in particular is used todetect two specific consonants (/r/, /h/) when they are not detected through thespectrographic information.

11. In 2006, Huggins-Daines et al. reported another work [16], where the authorsdescribed an extension to theBaum-Welch algorithm for trainingHiddenMarkovModel (HMM)s that uses explicit phoneme segmentation to constrain the forwardand backward lattice. The HMMs trained with this algorithm can be shown toimprove the accuracy of automatic phoneme segmentation.

12. Zibert et al. reported a work [17] in 2006, where they have proposed a new, high-level representation of audio signals based on phoneme recognition features suit-able for speech and non-speech discrimination tasks. The authors developed arepresentationwhere just onemodel per class is used in the segmentation process.For this purpose, four measures based on consonantvowel pairs obtained fromdifferent phoneme speech recognizers are introduced and applied in two dif-ferent segmentationclassification frameworks. The segmentation systems wereevaluated on different broadcast news databases. The evaluation results indicatethat the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropyand dynamism). The proposed features proved to be more robust and less sensi-tive to different training and unforeseen conditions. They have claimed that themost suitable method for speech/non-speech segmentation is a combination oflow-level acoustic and high-level recognition features which they have derivedby performing additional experiments with fusion models based on cepstral andthe proposed phoneme recognition features.

13. An improved HMM/support vector machine (SVM) method for a two-stagephoneme segmentation framework, which attempts to imitate the humanphoneme segmentation process, is described by Kuo et al. [18]. The first stage ofthe method performs HMM forced alignment according to the minimum bound-ary error (MBE) criterion. The objective of the work was to align a phonemesequence of a speech utterance with its acoustic signal counterpart based onMBE-trained HMMs and explicit phoneme duration models. The second stageuses the SVM method to refine the hypothesized phoneme boundaries derivedby HMM-based forced alignment [18].

1.2 Phoneme Boundary Segmentation: Present Technique 9

14. Another work [19] is reported by Almpanidis and Kotropoulos [19] on phone-level speech segmentationwhere the authors employed theBayesian informationcriterion corrected for small samples and model speech samples with the gener-alized Gamma distribution, which offers a more efficient parametric characteri-zation of speech in the frequency domain than the Gaussian distribution. A com-putationally inexpensive maximum likelihood approach is used for parameterestimation. The proposed adjustments yield significant performance improve-ment in noisy environment.

15. Qiao [20] reported a work on unsupervised optimal phoneme segmentation. Thework formulates the optimal segmentation into a probabilistic framework. Usingstatistics and information theory analysis, the author developed three optimalobjective functions, namelymean square error (MSE), log determinant (LD), andrate distortion (RD). Specially, RD objective function is defined based on infor-mation RD theory and can be related to human speech perception mechanisms.To optimize these objective functions, the author used time-constrained agglom-erative clustering algorithm. The author also proposed an efficient method toimplement the algorithm by using integration functions.

16. In 2008, Qiao et al. reported a work [21] on unsupervised optimal phoneme seg-mentationwhich assumes no knowledge on linguistic contents and acousticmod-els. The work formulated the optimal segmentation problem into a probabilis-tic framework. Using statistics and information theory analysis, they developedthree different objective functions, namely summation of square error (SSE), LD,and RD. RD function is derived from information RD theory and can be relatedto human signal perception mechanism. They introduced a time-constrainedagglomerative clustering algorithm to find the optimal segmentations and pro-posed an efficient method to implement the algorithm by using integration func-tions. The experiments are carried out on TIMIT database to compare the abovethree objective functions, and RD provides the best performance.

17. A text-independent automatic phone segmentation algorithm based on theBayesian information criterion reported in 2008 by Almpanidis and Kotropou-los [22]. In order to detect the phone boundaries accurately, the authors employan information criterion corrected for small samples while modeling speechsamples with the generalized gamma distribution, which offers a more effi-cient parametric characterization of speech in the frequency domain than theGaussian distribution. Using a computationally inexpensive maximum likeli-hood approach for parameter estimation, they have evaluated the efficiency ofthe proposed algorithm in M2VTS and NTIMIT datasets and demonstrated thatthe proposed adjustments yield significant performance improvement in noisyenvironments.

18. Miller and Stoytchev proposed an algorithm for the unsupervised segmentationof audio speech, based on the voting experts (VE) algorithm in 2008, whichwas originally designed to segment sequences of discrete tokens into categoricalepisodes. They demonstrated that the procedure is capable of inducing breakswith an accuracy substantially greater than chance and suggest possible avenuesof exploration to further increase the segmentation quality [23].

10 1 Introduction

19. Jurado et al. described awork [24] in 2009 text-independent speech segmentationusing an improvement method for identification of phoneme boundaries. Themodification is based on the distance calculation and selection of candidates forboundaries. From the calculation of distances among MFCC features of frames,prominent values that show transitions among phonemes identifying a phonemeboundary are generated. The modification improves the segmentation process inEnglish and Spanish corpus.

20. In 2009, Patil et al. reported a work [25] to improve the robustness of phoneticsegmentation to accent and style variation with a two-staged approach com-bining HMM broad-class recognition with acousticphonetic knowledge-basedrefinement. The system is evaluated for phonetic segmentation accuracy in thecontext of accent and style mismatches with training data.

21. In 2010, Bharathi et al. proposed a novel approach for automatic phoneme seg-mentation that is by hybridizing best phoneme segmentation algorithms HMM,Gaussian Mixture Model (GMM), and Brandts Generalized Likelihood Ratio(GLR) [26].

22. Ziolko et al. [27] reported a non-uniform speech segmentation method usingperceptual wavelet decomposition in 2010, which is used for the localization ofphoneme boundaries. They have chosen eleven subbands by applying the meanbest basis algorithm. Perceptual scale is used for decomposition of speech viaMeyer wavelet in the wavelet packet structure. A real-valued vector representingthe digital speech signal is decomposed into phone-like units by placing segmentborders according to the result of the multiresolution analysis. The final decisionon localization of the boundaries is made by analysis of the energy flows amongthe decomposition levels.

23. Kalinli reported a work [28] on automatic phoneme segmentation using auditoryattention features in 2012. The auditory attention model can successfully detectsalient audio sounds in an acoustic scene by capturing changes that make suchsalient events perceptually different than their neighbors. Therefore, in this work,the author uses it as an effective solution for detecting phoneme boundaries fromacoustic signal. The proposed phoneme segmentation method does not requiretranscription or acoustic models of phonemes.

24. Bymimicking human auditory processing,King et al. has reported awork [29] onspeech segmentation where phone boundaries are located without prior knowl-edge of the text of an utterance. A biomimetic model of human auditory process-ing is used in this work to calculate the neural features of frequency synchronyand average signal level. Frequency synchrony and average signal levels areused as input to a two-layered SVM-based system to detect phone boundaries.Phone boundaries are detected with 87.0% precision and 84.8% recall when theautomatic segmentation system has no prior knowledge of the phone sequencein the utterance. The work was published in 2013.

1.3 ANN as a Speech Processing and Recognition Tool 11

1.3 ANN as a Speech Processing and Recognition Tool

Speech is a commonly occurring means of communication between people to peo-ple which is acquired naturally. The ability to adopt speech as a mode of interper-sonal communication is developed in stages starting from birth of a person and isexecuted so smoothly that it hides the complexities associated with its generation.The generation of speech and its use for communication involves the human vocaltract articulation of different biological organs under conscious control affectedby factors like gender to upbringing to emotional state. As a result, vocalizationsare observed with respect to accent, pronunciation, articulation, roughness, nasal-ity, pitch, volume, and speed; moreover, during transmission, the irregular speechpatterns can be further distorted by background noise and echoes, as well as electri-cal characteristics (if telephones or other electronic equipment are used). All thesesources of variability make speech recognition, even more than speech generation,a very complex problem. But people are efficient in recognizing speech more thanthe conventional computing systems. This is because of the fact that while con-ventional computers are capable of doing number crunching operations, the humanbrain is expert in doing pattern recognition which is a cognitive process. The brainadopts a connectionist approach of computation relating past knowledge with thepresent stimulus to arrive at a decision. The cognitive properties demonstrated by thehuman brain is based on parallel and distributed processing carried out by a networkof biological neurons connected by neurochemical synaptic links (synapses) whichshow modifications with learning and experience, directly supporting the integra-tion of multiple constraints, and providing a distributed form of associative memory.These attributes have motivated the adoption of learning-based systems like ANNfor speech recognition applications.

The initial attempts to use ANN is speech recognition was a conscious effortto treat the entire issue as a pattern recognition problem. Since speech is a patternand ANNs are efficient at recognizing patterns, therefore, the challenge of dealingwith speech recognition was simplified to the recognition of speech samples withANN trained for the purpose. The earliest attempts could be summarized to taskslike classifying speech segments as voiced/unvoiced, or nasal/fricative/plosive. Next,ANNs became popular for phoneme classification as part of larger speech recognitionand related applications. Traditionally, ANNs are adopted for speech applications intwo broad forms. In the static form, the ANN receives the entire speech sample assingle pattern and the decision is just a pattern association problem. A simple butelegant experiment was performed by Huang and Lippmann in 1988, to show theeffectiveness of ANN in a rudimentary form to classify speech samples. This attemptused a MLP with only 2 inputs, 50 hidden units, and 10 outputs trained with vowelsuttered by men, women, and children with training sessions sustained up to 50,000iterations. Later, Elman and Zipser et al. in 1987 trained a network to classify thevowels using a more complex setup. Such an approach is satisfactory with phonemeclassification but fails with a continuous speech. In such a case, a dynamic approachis taken where a sliding window rides over a speech sample and feeds a portion

12 1 Introduction

of the extract to the ANN which is provided with the ability to capture temporalvariations in the speech inputs. The ANN provides a series of local decisions whichcombine together to generate a global output. Using the dynamic approach, Waibelet al. in 19871989 demonstrated the effectiveness of a time delay neural network(TDNN)a variation of the MLP to deal with phoneme classification.

In other attempts, Kohonens electronic typewriter [30] used the clustering andclassification characteristics of the SOM to obtain an orderedmap from a sequence offeature vectors. In the TDNN [31] and Kohonens electronic typewriter options, timewarping was a major issue. To integrate large time spans in the ANN-based speechrecognition, hybrid approaches have been adopted. Traditionally, in such hybridforms immediately after the phoneme recognition block, either HMM models [32]or time delay warping (TDW) [33] measure procedures are used which providesbetter results. Lately, Deep Neural Networks (DNNs) with many hidden layers andtrained using new methods have been shown better results than GMMs in speechrecognition applications [34]. These are being adopted for acoustic modeling inspeech recognition.

1.3.1 Speech Recognition Using RNN

ANNs are learning-based prediction tools which have been considered for a hostof pattern classification applications including speech recognition. As shall be dis-cussed later, ANNs learn the patters applied to them, retain the learning, and use itsubsequently. One of the most commonly used ANN structures for pattern classifi-cation is the MLP which has found application in speech synthesis and recognition[3537] as well. Initial attempts of using MLP for speech applications revolvedaround the assumption that speech as a sample should be learnt by the MLP anddiscriminated. The MLPs work well as an effective classifier for vowel sounds withstationary spectra, while their phoneme discriminating power deteriorates for con-sonants characterized by variations of short-time spectra [38]. A simple but elegantexperiment was performed by Huang and Lippmann in 1988, to establish the effec-tiveness of ANN in a rudimentary feedforward form to classify speech samples. Inthis case, the MLP was designed with only 2 inputs, 50 hidden units, and 10 outputstrained with vowels uttered by men, women, and children. The learning involvedtraining sustained up to 50,000 iterations. Later, Elman and Zipser et al. in 1987configured and trained a network to classify the vowels using a more complex setup.Such an approach is satisfactory with phoneme classification but fails with a con-tinuous speech. As an optimal solution to these situations, a dynamic approach isformulated where a sliding window rides over a speech sample and feeds a portionof the extract to the ANN. The ANN is provided with the ability to capture temporalvariations in the speech inputs. The ANN derives a series of local decisions that com-bine together to generate a global output. Certain variations in the basic MLP-basedapproach were executed for making it suitable for dynamic situations. Using such adynamic approach, Waibel et al. in 19871989 demonstrated the effectiveness of a

1.3 ANN as a Speech Processing and Recognition Tool 13

TDNNa variation of theMLP. It was intended to deal with phoneme classification.The TDNN has memory or delay blocks either at the input or at the output or both. Itadds to the temporal processing ability of the TDNN, but computational requirementsrise significantly. Programming the TDNN is a time-consuming process and henceis not considered for real-time applications. Computationally as the TDNN is muchdemanding than the MLP, TDNN-based real-time applications show considerableslow down for receiving favor and hence has been discarded as a viable option. Insuch a situation, the RNNs emerge as a viable option. Its design enables it to deal withdynamic variations in the input. It has feedforward and feedback back paths whichcontribute toward the temporal processing ability. The feedforward paths make itlike the MLP; hence, it is able to make nonlinear discrimination between bound-aries using a gradient descent-based learning. Next, the feedback paths enable theRNN to generate contextual processing. The work of the RNN therefore involvesthe generation of a combined output of the feedforward and the feedback paths andthe information content fed by a state vector representing the current or contextualportion of the sample for which the response is being generated. The key differencecompared to theMLP is the contextual processing which circulates the most relevantportion of the information among the different layers of the network and the con-stituent neurons. Further, in many situations, due to inversion in the applied patternswhile performing the contextual processing, differential mode learning in the locallevel of neurons enables the RNN to consider only the most relevant portion of thedata. With different types of activation functions at different layers of the network,contextual and differential processing is strengthened. For example, in a three hiddenlayer RNN, if one layer has tan-sigmoidal, the next with log-sigmoidal and the lastwith tan-sigmoidal activation function enables better lettering. The least correlatedportion of the patterns is retained and circulated, and the portions with similarity arediscarded. As a result, the RNN becomes a fast learner and tracks time-dependentvariations. The RNN uses feedforward and feedback paths to track finer variations inthe input. The feedback paths are sometimes passed through memory blocks whichenable delayed versions of the processed output to be considered for processing.These variations can be due to the time-dependent behavior of the input patterns. Sowhile the MLP is only able to do discriminations between applied samples, the RNNis able to distinguish classes that show time variations. For the above-mentionedattributes, RNNs are found to be suitable for application like speech recognition[37]. RNNs were first applied to speech recognition in [39]. Other important worksinclude [4047]. In [40], fully connected RNN is used for speechmodeling. By usingdiscriminative training, each RNN speech model is adjusted to reduce its distancefrom the designated speech unit while increasing distances from the others. In [41]and [42], RNN is used for phone recognition. In [43], Elman network is used forphoneme recognition where HMM-based postprocessing is used, whereas in [44]and [45], HMM-RNN hybrid system is explained. In [46], RNN is used explicitly tomodel the Markovian dynamics of a set of observations through a nonlinear functionwith a much larger hidden state space than traditional sequence models such as anHMM. Here, pretraining principles used for DNNs and second-order optimization

14 1 Introduction

techniques are used to train an RNN model. In [47], a contextual real-valued inputvector is provided in association with each RNN-based word model.

1.3.2 Speech Recognition Using SOM

Among various tools developed during the last few decades for pattern recognitionpurpose, ANNs have received attention primarily for the fact that these not only canlearn from examples but can also retain and use the knowledge subsequently. Onesuch approach requires continuous feeding of examples such that the ANN executesthe learning in a supervisory mode. A concurrent approach avoids the requirementof continuous feeding of the training examples. This is possible due to the use ofunsupervised learning as demonstrated by the SOM. Proposed by Kohonen, SOMhas a feedforward structure with a single computational layer of neurons arranged inrows and columns. Each neuron in the input layer is fully connected to all the sourceunits by some connectionist weights and follows a philosophy of self-organization.The objective is to achieve a dimensionality reduction. The unsupervised learningprocess groups the features from the continuous input space into discrete units basedon certain selection criteria. The SOM converts a large set of input vectors by findinga smaller set of prototypes so as to provide a proper approximation to the originalinput space. This was the basis of using the SOM for design of the phonetic type-writer by Kohonen. The typewriter was designed to work in real time by recognizingphonemes from a continuous speech so that a text equivalent could be generated. Thefeature set was generated using a combination of filtering and Fourier transformingof data sampled every 9.83ms from spoken words which produced a set of 16 dimen-sional spectral vectors. These constituted the input segment to the SOM, while theoutput was formed by a grid of nodes arranged in 8 by 12 grid structure. Time-slicedwaveforms of the speech was used to train the SOM to constitute certain clusters ofphonemes. This enabled the SOM to reorganize and form nodes linked to specificphonemes in an ideal mode. The training was carried out in two stages. During thefirst stage, speech feature vectors were applied into the SOM and a convergence to anideal state was generated. During the second stage, the nodes of the SOMused certainlabeling linked to specific phonemes. It mapped the phoneme features to a certainlabel assigned to a node of the SOM. The nodes reorganized during training andprovided the optimum topology using a combination of clustering and classificationmethods. At the end of the training, continuous speech samples were applied whichproduced a string of labels. These label strings were next compared with the idealcase using a Euclidean distance measure to establish the effectiveness of the training.The recognition segment was performed by a combination of HMMand dynamicallyexpanding context (DEC) approach. Sequences of speechwith phonemes of typically40400 ms lengt

[studies in computational intelligence] phoneme-based speech segmentation using hybrid soft...

Documents